arXivDaily arXiv每日学术速递 周一至周五更新
重置
2602.16709 2026-02-19 cs.LG math.ST stat.ME stat.TH

Knowledge-Embedded Latent Projection for Robust Representation Learning

Weijing Tang, Ming Yuan, Zongqi Xia, Tianxi Cai

详情
英文摘要

Latent space models are widely used for analyzing high-dimensional discrete data matrices, such as patient-feature matrices in electronic health records (EHRs), by capturing complex dependence structures through low-dimensional embeddings. However, estimation becomes challenging in the imbalanced regime, where one matrix dimension is much larger than the other. In EHR applications, cohort sizes are often limited by disease prevalence or data availability, whereas the feature space remains extremely large due to the breadth of medical coding system. Motivated by the increasing availability of external semantic embeddings, such as pre-trained embeddings of clinical concepts in EHRs, we propose a knowledge-embedded latent projection model that leverages semantic side information to regularize representation learning. Specifically, we model column embeddings as smooth functions of semantic embeddings via a mapping in a reproducing kernel Hilbert space. We develop a computationally efficient two-step estimation procedure that combines semantically guided subspace construction via kernel principal component analysis with scalable projected gradient descent. We establish estimation error bounds that characterize the trade-off between statistical error and approximation error induced by the kernel projection. Furthermore, we provide local convergence guarantees for our non-convex optimization procedure. Extensive simulation studies and a real-world EHR application demonstrate the effectiveness of the proposed method.

2602.16690 2026-02-19 stat.ME cs.LG stat.ML

Synthetic-Powered Multiple Testing with FDR Control

Yonghoon Lee, Meshi Bashari, Edgar Dobriban, Yaniv Romano

详情
英文摘要

Multiple hypothesis testing with false discovery rate (FDR) control is a fundamental problem in statistical inference, with broad applications in genomics, drug screening, and outlier detection. In many such settings, researchers may have access not only to real experimental observations but also to auxiliary or synthetic data -- from past, related experiments or generated by generative models -- that can provide additional evidence about the hypotheses of interest. We introduce SynthBH, a synthetic-powered multiple testing procedure that safely leverages such synthetic data. We prove that SynthBH guarantees finite-sample, distribution-free FDR control under a mild PRDS-type positive dependence condition, without requiring the pooled-data p-values to be valid under the null. The proposed method adapts to the (unknown) quality of the synthetic data: it enhances the sample efficiency and may boost the power when synthetic data are of high quality, while controlling the FDR at a user-specified level regardless of their quality. We demonstrate the empirical performance of SynthBH on tabular outlier detection benchmarks and on genomic analyses of drug-cancer sensitivity associations, and further study its properties through controlled experiments on simulated data.

2602.16634 2026-02-19 stat.ML cs.AI cs.LG physics.bio-ph physics.chem-ph

Enhanced Diffusion Sampling: Efficient Rare Event Sampling and Free Energy Calculation with Diffusion Models

Yu Xie, Ludwig Winkler, Lixin Sun, Sarah Lewis, Adam E. Foster, José Jiménez Luna, Tim Hempel, Michael Gastegger, Yaoyi Chen, Iryna Zaporozhets, Cecilia Clementi, Christopher M. Bishop, Frank Noé

详情
英文摘要

The rare-event sampling problem has long been the central limiting factor in molecular dynamics (MD), especially in biomolecular simulation. Recently, diffusion models such as BioEmu have emerged as powerful equilibrium samplers that generate independent samples from complex molecular distributions, eliminating the cost of sampling rare transition events. However, a sampling problem remains when computing observables that rely on states which are rare in equilibrium, for example folding free energies. Here, we introduce enhanced diffusion sampling, enabling efficient exploration of rare-event regions while preserving unbiased thermodynamic estimators. The key idea is to perform quantitatively accurate steering protocols to generate biased ensembles and subsequently recover equilibrium statistics via exact reweighting. We instantiate our framework in three algorithms: UmbrellaDiff (umbrella sampling with diffusion models), $Δ$G-Diff (free-energy differences via tilted ensembles), and MetaDiff (a batchwise analogue for metadynamics). Across toy systems, protein folding landscapes and folding free energies, our methods achieve fast, accurate, and scalable estimation of equilibrium properties within GPU-minutes to hours per system -- closing the rare-event sampling gap that remained after the advent of diffusion-model equilibrium samplers.

2602.16616 2026-02-19 stat.AP

Design and Analysis Strategies for Pooling in High Throughput Screening: Application to the Search for a New Anti-Microbial

Byran Smucker, Benjamin Brennan, Emily Rego, Meng Wu, Zhihong Lin, Brian Ahmer, Blake Peterson

详情
英文摘要

A major public health issue is the growing resistance of bacteria to antibiotics. An important part of the needed response is the discovery and development of new antimicrobial strategies. These require the screening of potential new drugs, typically accomplished using high-throughput screening (HTS). Traditionally, HTS is performed by examining one compound per well, but a more efficient strategy pools multiple compounds per well. In this work, we study several recently proposed pooling construction methods, as well as a variety of pooled high-throughput screening analysis methods, in order to provide guidance to practitioners on which methods to use. This is done in the context of an application of the methods to the search for new drugs to combat bacterial infection. We discuss both an extensive pilot study as well as a small screening campaign, and highlight both the successes and challenges of the pooling approach.

2602.16606 2026-02-19 math.ST stat.ME stat.TH

On Sharpened Convergence Rate of Generalized Sliced Inverse Regression for Nonlinear Sufficient Dimension Reduction

Chak Fung Choi, Yin Tang, Bing Li

详情
英文摘要

Generalized Sliced Inverse Regression (GSIR) is one of the most important methods for nonlinear sufficient dimension reduction. As shown in Li and Song (2017), it enjoys a convergence rate that is independent of the dimension of the predictor, thus avoiding the curse of dimensionality. In this paper we establish an improved convergence rate of GSIR under additional mild eigenvalue decay rate and smoothness conditions. Our convergence rate can be made arbitrarily close to $n^{-1/3}$ under appropriate decay rate and smoothness parameters. As a comparison, the rate of Li and Song (2017) is $n^{-1/4}$ under the best conditions. This improvement is significant because, for example, in a semiparametric estimation problem involving an infinite-dimensional nuisance parameter, the convergence rate of the estimator of the nuisance parameter is often required to be faster than $n^{-1/4}$ to guarantee desired semiparametric properties such as asymptotic efficiency. This can be achieved by the improved convergence rate, but not by the original rate. The sharpened convergence rate can also be established for GSIR in more general settings, such as functional sufficient dimension reduction.

2602.16583 2026-02-19 stat.AP

Physical Activity Trajectories Preceding Incident Major Depressive Disorder Diagnosis Using Consumer Wearable Devices in the All of Us Research Program: Case-Control Study

Yuezhou Zhang, Amos Folarin, Hugh Logan Ellis, Rongrong Zhong, Callum Stewart, Heet Sankesara, Hyunju Kim, Shaoxiong Sun, Abhishek Pratap, Richard JB Dobson

详情
英文摘要

Low physical activity is a known risk factor for major depressive disorder (MDD), but changes in activity before a first clinical diagnosis remain unclear, especially using long-term objective measurements. This study characterized trajectories of wearable-measured physical activity during the year preceding incident MDD diagnosis. We conducted a retrospective nested case-control study using linked electronic health record and Fitbit data from the All of Us Research Program. Adults with at least 6 months of valid wearable data in the year before diagnosis were eligible. Incident MDD cases were matched to controls on age, sex, body mass index, and index time (up to four controls per case). Daily step counts and moderate-to-vigorous physical activity (MVPA) were aggregated into monthly averages. Linear mixed-effects models compared trajectories from 12 months before diagnosis to diagnosis. Within cases, contrasts identified when activity first significantly deviated from levels 12 months prior. The cohort included 4,104 participants (829 cases and 3,275 controls; 81.7% women; median age 48.4 years). Compared with controls, cases showed consistently lower activity and significant downward trajectories in both step counts and MVPA during the year before diagnosis (P < 0.001). Significant declines appeared about 4 months before diagnosis for step counts and 5 months for MVPA. Exploratory analyses suggested subgroup differences, including steeper declines in men, greater intensity reductions at older ages, and persistently low activity among individuals with obesity. Sustained within-person declines in physical activity emerged months before incident MDD diagnosis. Longitudinal wearable monitoring may provide early signals to support risk stratification and earlier intervention.

2602.16581 2026-02-19 math.NA cs.NA stat.CO

Whittle-Matérn Fields with Variable Smoothness

Hamza Ruzayqat, Wenyu Lei, David Bolin, George Turkiyyah, Omar Knio

Comments 24 pages, 5 figures, 2 tables

详情
英文摘要

We introduce and analyze a nonlocal generalization of Whittle--Matérn Gaussian fields in which the smoothness parameter varies in space through the fractional order, $s=s(x)\in[\underline{s}\,,\bar{s}]\subset(0,1)$. The model is defined via an integral-form operator whose kernel is constructed from the modified Bessel function of the second kind and whose local singularity is governed by the symmetric exponent $β(x,y)=(s(x)+s(y))/2$. This variable-order nonlocal formulation departs from the classical constant-order pseudodifferential setting and raises new analytic and numerical challenges. We develop a novel variational framework adapted to the kernel, prove existence and uniqueness of weak solutions on truncated bounded domains, and derive Sobolev regularity of the Gaussian (spectral) solution controlled by the minimal local order: realizations lie in $H^r(G)$ for every $r<2\underline{s}-\tfrac{d}{2}$ (here $H^r(G)$ denotes the Sobolev space on the bounded domain $G$), hence in $L_2(G)$ when $\underline s>d/4$. We also present a finite-element sampling method for the integral model, derive error estimates, and provide numerical experiments in one dimension that illustrate the impact of spatially varying smoothness on samples covariances. Computational aspects and directions for scalable implementations are discussed.

2602.16568 2026-02-19 math.ST cs.DS cs.LG math.OC stat.ML stat.TH

Separating Oblivious and Adaptive Models of Variable Selection

Ziyun Chen, Jerry Li, Kevin Tian, Yusong Zhu

Comments 40 pages

详情
英文摘要

Sparse recovery is among the most well-studied problems in learning theory and high-dimensional statistics. In this work, we investigate the statistical and computational landscapes of sparse recovery with $\ell_\infty$ error guarantees. This variant of the problem is motivated by \emph{variable selection} tasks, where the goal is to estimate the support of a $k$-sparse signal in $\mathbb{R}^d$. Our main contribution is a provable separation between the \emph{oblivious} (``for each'') and \emph{adaptive} (``for all'') models of $\ell_\infty$ sparse recovery. We show that under an oblivious model, the optimal $\ell_\infty$ error is attainable in near-linear time with $\approx k\log d$ samples, whereas in an adaptive model, $\gtrsim k^2$ samples are necessary for any algorithm to achieve this bound. This establishes a surprising contrast with the standard $\ell_2$ setting, where $\approx k \log d$ samples suffice even for adaptive sparse recovery. We conclude with a preliminary examination of a \emph{partially-adaptive} model, where we show nontrivial variable selection guarantees are possible with $\approx k\log d$ measurements.

2602.16540 2026-02-19 stat.ME math.ST stat.TH

Generalised Linear Models Driven by Latent Processes: Asymptotic Theory and Applications

Wagner Barreto-Souza, Ngai Hang Chan

Comments Paper submitted for publication

详情
英文摘要

This paper introduces a class of generalised linear models (GLMs) driven by latent processes for modelling count, real-valued, binary, and positive continuous time series. Extending earlier latent-process regression frameworks based on Poisson or one-parameter exponential family assumptions, we allow the conditional distribution of the response to belong to a bi-parameter exponential family, with the latent process entering the conditional mean multiplicatively. This formulation substantially broadens the scope of latent-process GLMs, for instance, it naturally accommodates gamma responses for positive continuous data, enables estimation of an unknown dispersion parameter via method of moments, and avoids restrictive conditions on link functions that arise under existing formulations. We establish the asymptotic normality of the GLM estimators obtained from the GLM likelihood that ignores the latent process, and we derive the correct information matrix for valid inference. In addition, we provide a principled approach to prediction and forecasting in GLMs driven by latent processes, a topic not previously addressed in the literature. We present two real data applications on measles infections in North Rhine-Westphalia (Germany) and paleoclimatic glacial varves, which highlight the practical advantages and enhanced flexibility of the proposed modelling framework.

2602.16505 2026-02-19 stat.ML cs.LG

Functional Decomposition and Shapley Interactions for Interpreting Survival Models

Sophie Hanna Langbein, Hubert Baniecki, Fabian Fumagalli, Niklas Koenen, Marvin N. Wright, Julia Herbinger

详情
英文摘要

Hazard and survival functions are natural, interpretable targets in time-to-event prediction, but their inherent non-additivity fundamentally limits standard additive explanation methods. We introduce Survival Functional Decomposition (SurvFD), a principled approach for analyzing feature interactions in machine learning survival models. By decomposing higher-order effects into time-dependent and time-independent components, SurvFD offers a previously unrecognized perspective on survival explanations, explicitly characterizing when and why additive explanations fail. Building on this theoretical decomposition, we propose SurvSHAP-IQ, which extends Shapley interactions to time-indexed functions, providing a practical estimator for higher-order, time-dependent interactions. Together, SurvFD and SurvSHAP-IQ establish an interaction- and time-aware interpretability approach for survival modeling, with broad applicability across time-to-event prediction tasks.

2602.16497 2026-02-19 stat.ME

Factor-Adjusted Multiple Testing for High-Dimensional Individual Mediation Effects

Chen Shi, Zhao Chen, Christina Dan Wang

详情
英文摘要

Identifying individual mediators is a central goal of high-dimensional mediation analysis, yet pervasive dependence among mediators can invalidate standard debiased inference and lead to substantial false discovery rate (FDR) inflation. We propose a Factor-Adjusted Debiased Mediation Testing (FADMT) framework that enables large-scale inference for individual mediation effects with FDR control under complex dependence structures. Our approach posits an approximate factor structure on the unobserved errors of the mediator model, extracts common latent factors, and constructs decorrelated pseudo-mediators for the subsequent inferential procedure. We establish the asymptotic normality of the debiased estimator and develop a multiple testing procedure with theoretical FDR control under mild high-dimensional conditions. By adjusting for latent factor induced dependence, FADMT also improves robustness to spurious associations driven by shared latent variation in observational studies. Extensive simulations demonstrate the superior finite-sample performance across a wide range of correlation structures. Applications to TCGA-BRCA multi-omics data and to China's stock connect study further illustrate the practical utility of the proposed method.

2602.16476 2026-02-19 stat.ML cs.LG

Learning Preference from Observed Rankings

Yu-Chang Chen, Chen Chian Fuh, Shang En Tsai

详情
英文摘要

Estimating consumer preferences is central to many problems in economics and marketing. This paper develops a flexible framework for learning individual preferences from partial ranking information by interpreting observed rankings as collections of pairwise comparisons with logistic choice probabilities. We model latent utility as the sum of interpretable product attributes, item fixed effects, and a low-rank user-item factor structure, enabling both interpretability and information sharing across consumers and items. We further correct for selection in which comparisons are observed: a comparison is recorded only if both items enter the consumer's consideration set, inducing exposure bias toward frequently encountered items. We model pair observability as the product of item-level observability propensities and estimate these propensities with a logistic model for the marginal probability that an item is observable. Preference parameters are then estimated by maximizing an inverse-probability-weighted (IPW), ridge-regularized log-likelihood that reweights observed comparisons toward a target comparison population. To scale computation, we propose a stochastic gradient descent (SGD) algorithm based on inverse-probability resampling, which draws comparisons in proportion to their IPW weights. In an application to transaction data from an online wine retailer, the method improves out-of-sample recommendation performance relative to a popularity-based benchmark, with particularly strong gains in predicting purchases of previously unconsumed products.

2602.16466 2026-02-19 math.ST stat.TH

Estimation of Conformal Metrics

Jérôme Taupin

详情
英文摘要

We study deformations of the geodesic distances on a domain of R N induced by a function called conformal factor. We show that under a positive reach assumption on the domain (not necessarily a submanifold) and mild assumptions on the conformal factor, geodesics for the conformal metric have good regularity properties in the form of a lower bounded reach. This regularity allows for efficient estimation of the conformal metric from a random point cloud with a relative error proportional to the Hausdorff distance between the point cloud and the original domain. We then establish convergence rates of order n^(-1/d) that are close to sharp when the intrinsic dimension d of the domain is large, for an estimator that can be computed in O(n^2 ) time. Finally, this paper includes a useful equivalence result between ball graphs and nearest-neighbors graphs when assuming Ahlfors regularity of the sampling measure, allowing to transpose results from one setting to another.

2602.16463 2026-02-19 stat.ME

Focused Relative Risk Information Criterion for Variable Selection in Linear Regression

Nils Lid Hjort

Comments 19 pages, 5 figures; technical report of July 2020 (Department of Mathematics, University of Oslo), from which a modified version will be written and submitted for journal publication

详情
英文摘要

This paper motivates and develops a novel and focused approach to variable selection in linear regression models. For estimating the regression mean $μ=\E\,(Y\midd x_0)$, for the covariate vector of a given individual, there is a list of competing estimators, say $\hattμ_S$ for each submodel $S$. Exact expressions are found for the relative mean squared error risks, when compared to the widest model available, say $\mse_S/\mse_\wide$. The theory of confidence distributions is used for accurate assessments of these relative risks. This leads to certain Focused Relative Risk Information Criterion scores, and associated FRIC plots and FRIC tables, as well as to Confidence plots to exhibit the confidence the data give in the submodels. The machinery is extended to handle many focus parameters at the same time, with appropriate averaged FRIC scores. The particular case where all available covariate vectors have equal importance yields a new overall criterion for variable selection, balancing complexity and fit in a natural fashion. A connection to the Mallows criterion is demonstrated, leading also to natural modifications of the latter. The FRIC and AFRIC strategies are illustrated for real data.

2602.16436 2026-02-19 cs.LG cs.CR stat.ML

Learning with Locally Private Examples by Inverse Weierstrass Private Stochastic Gradient Descent

Jean Dufraiche, Paul Mangold, Michaël Perrot, Marc Tommasi

Comments 30 pages, 8 figures

详情
英文摘要

Releasing data once and for all under noninteractive Local Differential Privacy (LDP) enables complete data reusability, but the resulting noise may create bias in subsequent analyses. In this work, we leverage the Weierstrass transform to characterize this bias in binary classification. We prove that inverting this transform leads to a bias-correction method to compute unbiased estimates of nonlinear functions on examples released under LDP. We then build a novel stochastic gradient descent algorithm called Inverse Weierstrass Private SGD (IWP-SGD). It converges to the true population risk minimizer at a rate of $\mathcal{O}(1/n)$, with $n$ the number of examples. We empirically validate IWP-SGD on binary classification tasks using synthetic and real-world datasets.

2602.15385 2026-02-19 stat.AP q-fin.RM stat.ML

From Chain-Ladder to Individual Claims Reserving

Ronald Richman, Mario V. Wüthrich

详情
英文摘要

The chain-ladder (CL) method is the most widely used claims reserving technique in non-life insurance. This manuscript introduces a novel approach to computing the CL reserves based on a fundamental restructuring of the data utilization for the CL prediction procedure. Instead of rolling forward the cumulative claims with estimated CL factors, we estimate multi-period factors that project the latest observations directly to the ultimate claims. This alternative perspective on CL reserving creates a natural pathway for the application of machine learning techniques to individual claims reserving. As a proof of concept, we present a small-scale real data application employing neural networks for individual claims reserving.

2602.14981 2026-02-19 stat.ME math.ST stat.CO stat.ML stat.TH

Block Empirical Likelihood Inference for Longitudinal Generalized Partially Linear Single-Index Models

Tianni Zhang, Yuyao Wang, Yu Lu, Mengfei Ran

详情
英文摘要

Generalized partially linear single-index models (GPLSIMs) provide a flexible and interpretable semiparametric framework for longitudinal outcomes by combining a low-dimensional parametric component with a nonparametric index component. For repeated measurements, valid inference is challenging because within-subject correlation induces nuisance parameters and variance estimation can be unstable in semiparametric settings. We propose a profile estimating-equation approach based on spline approximation of the unknown link function and construct a subject-level block empirical likelihood (BEL) for joint inference on the parametric coefficients and the single-index direction. The resulting BEL ratio statistic enjoys a Wilks-type chi-square limit, yielding likelihood-free confidence regions without explicit sandwich variance estimation. We also discuss practical implementation, including constrained optimization for the index parameter, working-correlation choices, and bootstrap-based confidence bands for the nonparametric component. Simulation studies and an application to the epilepsy longitudinal study illustrate the finite-sample performance.

2602.10749 2026-02-19 stat.AP

The Dataset of Daily Air Quality for the Years 2013-2023 in Italy

Alessandro Fusta Moro, Alessandro Fassò, Jacopo Rodeschini

详情
英文摘要

Air quality and climate are major issues in Italian society and lie at the intersection of many research fields, including public health and policy planning. There is an increasing need for readily available, easily accessible, ready-to-use and well-documented datasets on air quality and climate. In this paper, we present the GRINS AQCLIM dataset, created under the GRINS project framework covering the Italian domain for an extensive time period. It includes daily statistics (e.g., minimum, quartiles, mean, median and maximum) for a collection of air pollutant concentrations and climate variables at the locations of the 700+ available monitoring stations. Input data are retrieved from the European Environmental Agency and Copernicus Programme and were subjected to multiple processing steps to ensure their reliability and quality. These steps include automatic procedures for fixing raw files, manual inspection of stations information, the detection and removal of anomalies, and the temporal harmonisation on a daily basis. Datasets are hosted on Zenodo under open-access principles.

2602.05298 2026-02-19 stat.ML cs.LG math.OC

Logarithmic-time Schedules for Scaling Language Models with Momentum

Damien Ferbach, Courtney Paquette, Gauthier Gidel, Katie Everett, Elliot Paquette

详情
英文摘要

In practice, the hyperparameters $(β_1, β_2)$ and weight-decay $λ$ in AdamW are typically kept at fixed values. Is there any reason to do otherwise? We show that for large-scale language model training, the answer is yes: by exploiting the power-law structure of language data, one can design time-varying schedules for $(β_1, β_2, λ)$ that deliver substantial performance gains. We study logarithmic-time scheduling, in which the optimizer's gradient memory horizon grows with training time. Although naive variants of this are unstable, we show that suitable damping mechanisms restore stability while preserving the benefits of longer memory. Based on this, we present ADANA, an AdamW-like optimizer that couples log-time schedules with explicit damping to balance stability and performance. We empirically evaluate ADANA across transformer scalings (45M to 2.6B parameters), comparing against AdamW, Muon, and AdEMAMix. When properly tuned, ADANA achieves up to 40% compute efficiency relative to a tuned AdamW, with gains that persist--and even improve--as model scale increases. We further show that similar benefits arise when applying logarithmic-time scheduling to AdEMAMix, and that logarithmic-time weight-decay alone can yield significant improvements. Finally, we present variants of ADANA that mitigate potential failure modes and improve robustness.

2601.21106 2026-02-19 stat.ME

Scalable Dirichlet Process Mixture Models with Unknown Concentration and Adaptive Covariance for High-Dimensional Clustering Applied to Leukemia Transcriptomics

Annesh Pal, Aguirre Mimoun, Rodolphe Thiébaut, Boris P. Hejblum

Comments 22 pages with 5 figures and 1 table

详情
英文摘要

We propose a novel method that performs adaptive clustering with DPMM using collapsed VI, while incorporating weakly-informative priors for DP concentration parameter alpha and base distribution G0. We illustrate the importance of G0 covariance structure and prior choice by considering different parameterisations of the data covariance matrix. On high-dimensional Gaussian simulations, our model demonstrates substantially faster convergence than a state-of-the-art MCMC splice sampler. We further evaluate performances on Negative Binomial simulations and conduct sensitivity analyses to assess robustness on realistic data conditions. Application to a publicly available leukemia transcriptomic data set comprising 72 samples and 2,194 gene expression successfully recovers every known sub-type, all while identifying additional gene expression-based sub-clusters with meaningful biological interpretation.

2512.09530 2026-02-19 stat.ML cs.LG

Transformers for Tabular Data: A Training Perspective of Self-Attention via Optimal Transport

Alessandro Quadrio, Antonio Candelieri

详情
英文摘要

This thesis examines self-attention training through the lens of Optimal Transport (OT) and develops an OT-based alternative for tabular classification. The study tracks intermediate projections of the self-attention layer during training and evaluates their evolution using discrete OT metrics, including Wasserstein distance, Monge gap, optimality, and efficiency. Experiments are conducted on classification tasks with two and three classes, as well as on a biomedical dataset. Results indicate that the final self-attention mapping often approximates the OT optimal coupling, yet the training trajectory remains inefficient. Pretraining the MLP section on synthetic data partially improves convergence but is sensitive to their initialization. To address these limitations, an OT-based algorithm is introduced: it generates class-specific dummy Gaussian distributions, computes an OT alignment with the data, and trains an MLP to generalize this mapping. The method achieves accuracy comparable to Transformers while reducing computational cost and scaling more efficiently under standardized inputs, though its performance depends on careful dummy-geometry design. All experiments and implementations are conducted in R.

2510.16161 2026-02-19 cs.LG stat.ML

Still Competitive: Revisiting Recurrent Models for Irregular Time Series Prediction

Ankitkumar Joshi, Milos Hauskrecht

Comments Published in Transactions on Machine Learning Research, 2026

详情
英文摘要

Modeling irregularly sampled multivariate time series is a persistent challenge in domains like healthcare and sensor networks. While recent works have explored a variety of complex learning architectures to solve the prediction problems for irregularly sampled time series, it remains unclear what the true benefits of some of these architectures are, and whether clever modifications of simpler and more efficient RNN-based algorithms are still competitive, i.e. they are on par with or even superior to these methods. In this work, we propose and study GRUwE: Gated Recurrent Unit with Exponential basis functions, that builds upon RNN-based architectures for observations made at irregular times. GRUwE supports both regression-based and event-based predictions in continuous time. GRUwE works by maintaining a Markov state representation of the time series that updates with the arrival of irregular observations. The Markov state update relies on two reset mechanisms: (i) observation-triggered reset to account for the new observation, and (ii) time-triggered reset that relies on learnable exponential decays, to support the predictions in continuous time. Our empirical evaluations across several real-world benchmarks on next-observation and next-event prediction tasks demonstrate that GRUwE can indeed achieve competitive or superior performance compared to the recent state-of-the-art (SOTA) methods. Thanks to its simplicity, GRUwE offers compelling advantages: it is easy to implement, requires minimal hyper-parameter tuning efforts, and significantly reduces the computational overhead in the online deployment.

2510.08102 2026-02-19 cs.CL cs.AI cs.LG stat.ML

Lossless Vocabulary Reduction for Auto-Regressive Language Models

Daiki Chijiwa, Taku Hasegawa, Kyosuke Nishida, Shin'ya Yamaguchi, Tomoya Ohba, Tamao Sakao, Susumu Takeuchi

Comments The Fourteenth International Conference on Learning Representations (ICLR 2026)

详情
英文摘要

Tokenization -- the process of decomposing a given text into a sequence of subwords called tokens -- is one of the key components in the development of language models. Particularly, auto-regressive language models generate texts token by token, i.e., by predicting the next-token distribution given the previous ones, and thus tokenization directly affects their efficiency in text generation. Since each language model has their own vocabulary as a set of possible tokens, they struggle to cooperate with each other at the level of next-token distributions such as model ensemble. In this paper, we establish a theoretical framework of lossless vocabulary reduction, which efficiently converts a given auto-regressive language model into the one with an arbitrarily small vocabulary without any loss in accuracy. This framework allows language models with different tokenization to cooperate with each other efficiently by reduction to their maximal common vocabulary. Specifically, we empirically demonstrate its applicability to model ensemble with different tokenization.

2509.18406 2026-02-19 stat.ME

A constrained iteratively-reweighted least-squares framework for generalised linear models

Pierre Masselot, Devon Nenon, Jacopo Vanoli, Zaid Chalabi, Antonio Gasparrini

Comments Submitted for peer reviewed publication. V3 changes: (i) Introduction and absract have been reworked, (ii) improvement in the evaluation of degrees of freedom formulae and (iii) modification of the first application (global warming)

详情
英文摘要

Many applications of generalised linear models (GLMs) can be improved by applying constraints that impose assumptions on the associations or improve consistency of the estimators. Yet, there are still barriers to the implementation and practical application of constrained GLMs. We present a general framework for fitting GLMs subject to linear constraints on the coefficients that offers original and interesting features. First, estimation is performed using constrained iteratively-reweighted least-squares (CIRLS), offering fast and stable algorithms with excellent convergence performance. Second, the development includes advanced inferential procedures based on truncated multivariate normal distribution and corrected degrees of freedom that account for the constrained nature of the estimation problem. Extensive simulation studies indicate good inferential and computational properties, even in the case of slightly overconstrained models. Third, the proposed methods are fully implemented in the 'cirls' library for the R software, embedding constrained estimation in standard regression routines with simple usage and syntax. Two real-data case studies provide examples of applications for constrained dose-response estimation and compositional data analysis. The CIRLS framework and software offer a unified approach for various constrained estimation problems across a wide range of research areas.

2508.12926 2026-02-19 math.ST math.PR stat.ML stat.TH

On the distance between mean and geometric median in high dimensions

Richard Schwank, Mathias Drton

Comments Background section added and proofs shortened

详情
英文摘要

The geometric median, a notion of center for multivariate distributions, has gained recent attention in robust statistics and machine learning. Although conceptually distinct from the mean (i.e., expectation), we demonstrate that both are very close in high dimensions when the dependence between the distribution components is suitably controlled. Concretely, we find an upper bound on the distance that vanishes with the dimension asymptotically, and derive a rate-matching first order expansion of the geometric median components. Simulations illustrate and confirm our results.

2508.02922 2026-02-19 stat.ME

A multi-stage Bayesian approach to fit spatial point process models

Rachael Ren, Mevin B. Hooten, Toryn L. J. Schafer, Nicholas M. Calzada, Benjamin Hoose, Jamie N. Womble, Scott Gende

Comments 51 pages, 24 figures

详情
英文摘要

Spatial point process (SPP) models are commonly used to analyze point pattern data in many fields, including presence-only data in ecology. Existing exact Bayesian methods for fitting these models are computationally expensive because they require approximating an intractable integral each time parameters are updated and often involve algorithm supervision (i.e., tuning in the Bayesian setting). We propose a flexible, efficient, and exact multi-stage recursive Bayesian approach to fitting SPP models that leverages parallel computing resources to obtain realizations from the joint posterior, which can then be used to obtain inference on derived quantities. We outline potential extensions, including a framework for analyzing study designs with compact observation windows and a neural network basis expansion for increased model flexibility. We demonstrate this approach and its extensions using a simulation study and analyze data from aerial imagery surveys to improve our understanding of spatially explicit abundance of harbor seal (Phoca vitulina) pups in Johns Hopkins Inlet, a protected tidewater glacial fjord in Glacier Bay National Park, Alaska.

2507.04033 2026-02-19 cs.LG cs.CY math.OC stat.ML

Benchmarking Stochastic Approximation Algorithms for Fairness-Constrained Training of Deep Neural Networks

Andrii Kliachkin, Jana Lepšová, Gilles Bareilles, Jakub Mareček

详情
Journal ref
14th International Conference on Learning Representations, 2026
英文摘要

The ability to train Deep Neural Networks (DNNs) with constraints is instrumental in improving the fairness of modern machine-learning models. Many algorithms have been analysed in recent years, and yet there is no standard, widely accepted method for the constrained training of DNNs. In this paper, we provide a challenging benchmark of real-world large-scale fairness-constrained learning tasks, built on top of the US Census (Folktables). We point out the theoretical challenges of such tasks and review the main approaches in stochastic approximation algorithms. Finally, we demonstrate the use of the benchmark by implementing and comparing three recently proposed, but as-of-yet unimplemented, algorithms both in terms of optimization performance, and fairness improvement. We release the code of the benchmark as a Python package at https://github.com/humancompatible/train.

2506.17773 2026-02-19 stat.ME

Selection of functional predictors and smooth coefficient estimation for scalar-on-function regression models

Hedayat Fathi, Marzia A. Cremona, Federico Severino

Comments 46 pages, 6 figures

详情
英文摘要

In the framework of scalar-on-function regression models, in which several functional variables are employed to predict a scalar response, we propose a methodology for selecting relevant functional predictors while simultaneously providing accurate smooth (or, more generally, regular) estimates of the functional coefficients. We suppose that the functional predictors belong to a real separable Hilbert space, while the functional coefficients belong to a specific subspace of this Hilbert space. Such a subspace can be a Reproducing Kernel Hilbert Space (RKHS) to ensure the desired regularity characteristics, such as smoothness or periodicity, for the coefficient estimates. Our procedure, called SOFIA (Scalar-On-Function Integrated Adaptive Lasso), is based on an adaptive penalized least squares algorithm that leverages functional subgradients to efficiently solve the minimization problem. We demonstrate that the proposed method satisfies the functional oracle property, even when the number of predictors exceeds the sample size. SOFIA's effectiveness in variable selection and coefficient estimation is evaluated through extensive simulation studies and a real-data application to GDP growth prediction.

2412.06004 2026-02-19 math.ST math.PR q-bio.PE stat.CO stat.TH

Large-sample analysis of cost functionals for inference under the coalescent

Martina Favero, Jere Koskela

Comments 34 pages, 7 figures

详情
Journal ref
Stochastic Processes and their Applications, Volume 195, 2026, Stochastic Processes and their Applications 195 (2026) 104894
英文摘要

The coalescent is a foundational model of latent genealogical trees under neutral evolution, but suffers from intractable sampling probabilities. Methods for approximating these sampling probabilities either introduce bias or fail to scale to large sample sizes. We show that a class of cost functionals of the coalescent with recurrent mutation and a finite number of alleles converge to tractable processes in the infinite-sample limit. A particular choice of costs yields insight about importance sampling methods, which are a classical tool for coalescent sampling probability approximation. These insights reveal that the behaviour of coalescent importance sampling algorithms differs markedly from standard sequential importance samplers, with or without resampling. We conduct a simulation study to verify that our asymptotics are accurate for algorithms with finite (and moderate) sample sizes. Our results constitute the first theoretical description of large-sample importance sampling algorithms for the coalescent, provide heuristics for the a priori optimisation of computational effort, and identify settings where resampling is harmful for algorithm performance. We observe strikingly different behaviour for importance sampling methods under the infinite sites model of mutation, which is regarded as a good and more tractable approximation of finite alleles mutation in most respects.

2409.12019 2026-02-19 math.ST stat.TH

Asymptotics for conformal inference

Ulysse Gazin

Comments 39 pages, 3 figures, 2 tables

详情
英文摘要

Conformal inference is a versatile tool for building prediction sets in regression or classification. We study the false coverage proportion (FCP) in a simultaneous inference setting with a calibration sample of $n$ points and a test sample of $m$ points. We identify the exact, distribution-free, asymptotic distribution of the FCP when both $n$ and $m$ tend to infinity. This shows in particular that FCP control can be achieved by using the well-known Kolmogorov distribution, and puts forward that the asymptotic variance is decreasing in the ratio $n/m$. We then provide a number of extensions by considering the problems of novelty detection, weighted conformal inference or distribution shift between the calibration sample and the test sample. In particular, our asymptotic results allow to accurately quantify the asymptotic behavior of the errors (a miscovering interval or declaring a false novelty) when weighted conformal inference is used.

2602.16352 2026-02-19 stat.ML cs.CY cs.LG

Machine Learning in Epidemiology

Marvin N. Wright, Lukas Burk, Pegah Golchian, Jan Kapar, Niklas Koenen, Sophie Hanna Langbein

详情
Journal ref
In: Ahrens, W., Pigeot, I. (Eds.) Handbook of Epidemiology. Springer, New York (2025)
英文摘要

In the age of digital epidemiology, epidemiologists are faced by an increasing amount of data of growing complexity and dimensionality. Machine learning is a set of powerful tools that can help to analyze such enormous amounts of data. This chapter lays the methodological foundations for successfully applying machine learning in epidemiology. It covers the principles of supervised and unsupervised learning and discusses the most important machine learning methods. Strategies for model evaluation and hyperparameter optimization are developed and interpretable machine learning is introduced. All these theoretical parts are accompanied by code examples in R, where an example dataset on heart disease is used throughout the chapter.

2602.16328 2026-02-19 stat.ME

A general framework for modeling Gaussian process with qualitative and quantitative factors

Linsui Deng, C. F. Jeff Wu

详情
英文摘要

Computer experiments involving both qualitative and quantitative (QQ) factors have attracted increasing attention. Gaussian process (GP) models have proven effective in this context by choosing specialized covariance functions for QQ factors. In this work, we extend the latent variable-based GP approach, which maps qualitative factors into a continuous latent space, by establishing a general framework to apply standard kernel functions to continuous latent variables. This approach provides a novel perspective for interpreting some existing GP models for QQ factors and introduces new covariance structures in some situations. The ordinal structure can be incorporated naturally and seamlessly in this framework. Furthermore, the Bayesian information criterion and leave-one-out cross-validation are employed for model selection and model averaging. The performance of the proposed method is comprehensively studied on several examples.

2602.16310 2026-02-19 stat.ME econ.EM math.ST stat.AP stat.TH

Introducing the b-value: combining unbiased and biased estimators from a sensitivity analysis perspective

Zhexiao Lin, Peter J. Bickel, Peng Ding

Comments 53 pages

详情
英文摘要

In empirical research, when we have multiple estimators for the same parameter of interest, a central question arises: how do we combine unbiased but less precise estimators with biased but more precise ones to improve the inference? Under this setting, the point estimation problem has attracted considerable attention. In this paper, we focus on a less studied inference question: how can we conduct valid statistical inference in such settings with unknown bias? We propose a strategy to combine unbiased and biased estimators from a sensitivity analysis perspective. We derive a sequence of confidence intervals indexed by the magnitude of the bias, which enable researchers to assess how conclusions vary with the bias levels. Importantly, we introduce the notion of the b-value, a critical value of the unknown maximum relative bias at which combining estimators does not yield a significant result. We apply this strategy to three canonical combined estimators: the precision-weighted estimator, the pretest estimator, and the soft-thresholding estimator. For each estimator, we characterize the sequence of confidence intervals and determine the bias threshold at which the conclusion changes. Based on the theory, we recommend reporting the b-value based on the soft-thresholding estimator and its associated confidence intervals, which are robust to unknown bias and achieve the lowest worst-case risk among the alternatives.

2602.16283 2026-02-19 math.ST stat.OT stat.TH

Orthogonal parametrisations of Extreme-Value distributions

Nathan Huet, Ilaria Prosdocimi

详情
英文摘要

Extreme value distributions are routinely employed to assess risks connected to extreme events in a large number of applications. They typically are two- or three- parameter distributions: the inference can be unstable, which is particularly problematic given the fact that often times these distributions are fitted to small samples. Furthermore, the distribution's parameters are generally not directly interpretable and not the key aim of the estimation. We present several orthogonal reparametrisations of the main extreme-value distributions, key in the modelling of rare events. In particular, we apply the theory developed in Cox and Reid (1987) to the Generalised Extreme-Value, Generalised Pareto, and Gumbel distributions. We illustrate the principal advantage of these reparametrisations in a simulation study.

2602.16259 2026-02-19 math.ST stat.CO stat.ME stat.TH

HAL-MLE Log-Splines Density Estimation (Part I: Univariate)

Yilong Hou, Zhengpu Zhao, Yi Li, Mark van der Laan

Comments 75 pages

详情
英文摘要

We study nonparametric maximum likelihood estimation of probability densities under a total variation (TV) type penalty, sectional variation norm (also named as Hardy-Krause variation). TV regularization has a long history in regression and density estimation, including results on $L^2$ and KL divergence convergence rates. Here, we revisit this task using the Highly Adaptive Lasso (HAL) framework. We formulate a HAL-based maximum likelihood estimator (HAL-MLE) using the log-spline link function from \citet{kooperberg1992logspline}, and show that in the univariate setting the bounded sectional variation norm assumption underlying HAL coincides with the classical bounded TV assumption. This equivalence directly connects HAL-MLE to existing TV-penalized approaches such as local adaptive splines \citep{mammen1997locally}. We establish three new theoretical results: (i) the univariate HAL-MLE is asymptotically linear, (ii) it admits pointwise asymptotic normality, and (iii) it achieves uniform convergence at rate $n^{-(k+1)/(2k+3)}$ up to logarithmic factors for the smoothness order $k \geq 1$. These results extend existing results from \citet{van2017uniform}, which previously guaranteed only uniform consistency without rates when $k=0$. We will include the uniform convergence for general dimension $d$ in the follow-up work of this paper. The intention of this paper is to provide a unified framework for the TV-penalized density estimation methods, and to connect the HAL-MLE to the existing TV-penalized methods in the univariate case, despite that the general HAL-MLE is defined for multivariate cases.

2602.16223 2026-02-19 math.ST math.PR stat.TH

Nonparametric estimation of linear multiplier for processes driven by a Hermite process

B. L. S. Prakasa Rao

详情
英文摘要

We study the problem of nonparametric estimation of the linear multiplier function $θ(t)$ for processes satisfying stochastic differential equations of the type $$dX_t=θ(t) X_tdt+ εdZ^{q,H}_t, X_0=x_0, 0\leq t \leq T$$ where $\{Z^{q,H}_t, t \geq 0\}$ is a Hermite process with known order $q$ and known self-similarity parameter $H \in (\frac{1}{2},1).$ We investigate the asymptotic behaviour of the estimator of the unknown function $θ(t)$ as $ε\rightarrow 0.$

2602.16218 2026-02-19 cs.LG cs.NA math.NA stat.ML

Bayesian Quadrature: Gaussian Processes for Integration

Maren Mahsereci, Toni Karvonen

详情
英文摘要

Bayesian quadrature is a probabilistic, model-based approach to numerical integration, the estimation of intractable integrals, or expectations. Although Bayesian quadrature was popularised already in the 1980s, no systematic and comprehensive treatment has been published. The purpose of this survey is to fill this gap. We review the mathematical foundations of Bayesian quadrature from different points of view; present a systematic taxonomy for classifying different Bayesian quadrature methods along the three axes of modelling, inference, and sampling; collect general theoretical guarantees; and provide a controlled numerical study that explores and illustrates the effect of different choices along the axes of the taxonomy. We also provide a realistic assessment of practical challenges and limitations to application of Bayesian quadrature methods and include an up-to-date and nearly exhaustive bibliography that covers not only machine learning and statistics literature but all areas of mathematics and engineering in which Bayesian quadrature or equivalent methods have seen use.

2602.16183 2026-02-19 cs.GT cs.LG stat.ML

Multi-Agent Combinatorial-Multi-Armed-Bandit framework for the Submodular Welfare Problem under Bandit Feedback

Subham Pokhriyal, Shweta Jain, Vaneet Aggarwal

详情
英文摘要

We study the \emph{Submodular Welfare Problem} (SWP), where items are partitioned among agents with monotone submodular utilities to maximize the total welfare under \emph{bandit feedback}. Classical SWP assumes full value-oracle access, achieving $(1-1/e)$ approximations via continuous-greedy algorithms. We extend this to a \emph{multi-agent combinatorial bandit} framework (\textsc{MA-CMAB}), where actions are partitions under full-bandit feedback with non-communicating agents. Unlike prior single-agent or separable multi-agent CMAB models, our setting couples agents through shared allocation constraints. We propose an explore-then-commit strategy with randomized assignments, achieving $\tilde{\mathcal{O}}(T^{2/3})$ regret against a $(1-1/e)$ benchmark, the first such guarantee for partition-based submodular welfare problem under bandit feedback.

2602.16137 2026-02-19 stat.ME

Experimental Assortments for Choice Estimation and Nest Identification

Xintong Yu, Will Ma, Michael Zhao

详情
英文摘要

What assortments (subsets of items) should be offered, to collect data for estimating a choice model over $n$ total items? We propose a structured, non-adaptive experiment design requiring only $O(\log n)$ distinct assortments, each offered repeatedly, that consistently outperforms randomized and other heuristic designs across an extensive numerical benchmark that estimates multiple different choice models under a variety of (possibly mis-specified) ground truths. We then focus on Nested Logit choice models, which cluster items into "nests" of close substitutes. Whereas existing Nested Logit estimation procedures assume the nests to be known and fixed, we present a new algorithm to identify nests based on collected data, which when used in conjunction with our experiment design, guarantees correct identification of nests under any Nested Logit ground truth. Our experiment design was deployed to collect data from over 70 million users at Dream11, an Indian fantasy sports platform that offers different types of betting contests, with rich substitution patterns between them. We identify nests based on the collected data, which lead to better out-of-sample choice prediction than ex-ante clustering from contest features. Our identified nests are ex-post justifiable to Dream11 management.

2602.16131 2026-02-19 stat.ML cs.LG

Empirical Cumulative Distribution Function Clustering for LLM-based Agent System Analysis

Chihiro Watanabe, Jingyu Sun

详情
英文摘要

Large language models (LLMs) are increasingly used as agents to solve complex tasks such as question answering (QA), scientific debate, and software development. A standard evaluation procedure aggregates multiple responses from LLM agents into a single final answer, often via majority voting, and compares it against reference answers. However, this process can obscure the quality and distributional characteristics of the original responses. In this paper, we propose a novel evaluation framework based on the empirical cumulative distribution function (ECDF) of cosine similarities between generated responses and reference answers. This enables a more nuanced assessment of response quality beyond exact match metrics. To analyze the response distributions across different agent configurations, we further introduce a clustering method for ECDFs using their distances and the $k$-medoids algorithm. Our experiments on a QA dataset demonstrate that ECDFs can distinguish between agent settings with similar final accuracies but different quality distributions. The clustering analysis also reveals interpretable group structures in the responses, offering insights into the impact of temperature, persona, and question topics.

2602.16120 2026-02-19 cs.LG stat.AP stat.ML

Feature-based morphological analysis of shape graph data

Murad Hossen, Demetrio Labate, Nicolas Charon

详情
英文摘要

This paper introduces and demonstrates a computational pipeline for the statistical analysis of shape graph datasets, namely geometric networks embedded in 2D or 3D spaces. Unlike traditional abstract graphs, our purpose is not only to retrieve and distinguish variations in the connectivity structure of the data but also geometric differences of the network branches. Our proposed approach relies on the extraction of a specifically curated and explicit set of topological, geometric and directional features, designed to satisfy key invariance properties. We leverage the resulting feature representation for tasks such as group comparison, clustering and classification on cohorts of shape graphs. The effectiveness of this representation is evaluated on several real-world datasets including urban road/street networks, neuronal traces and astrocyte imaging. These results are benchmarked against several alternative methods, both feature-based and not.

2602.16111 2026-02-19 stat.AP cs.AI

Surrogate-Based Prevalence Measurement for Large-Scale A/B Testing

Zehao Xu, Tony Paek, Kevin O'Sullivan, Attila Dobi

详情
英文摘要

Online media platforms often need to measure how frequently users are exposed to specific content attributes in order to evaluate trade-offs in A/B experiments. A direct approach is to sample content, label it using a high-quality rubric (e.g., an expert-reviewed LLM prompt), and estimate impression-weighted prevalence. However, repeatedly running such labeling for every experiment arm and segment is too costly and slow to serve as a default measurement at scale. We present a scalable \emph{surrogate-based prevalence measurement} framework that decouples expensive labeling from per-experiment evaluation. The framework calibrates a surrogate signal to reference labels offline and then uses only impression logs to estimate prevalence for arbitrary experiment arms and segments. We instantiate this framework using \emph{score bucketing} as the surrogate: we discretize a model score into buckets, estimate bucket-level prevalences from an offline labeled sample, and combine these calibrated bucket level prevalences with the bucket distribution of impressions in each arm to obtain fast, log-based estimates. Across multiple large-scale A/B tests, we validate that the surrogate estimates closely match the reference estimates for both arm-level prevalence and treatment--control deltas. This enables scalable, low-latency prevalence measurement in experimentation without requiring per-experiment labeling jobs.

2602.16099 2026-02-19 stat.CO stat.ME stat.ML

Quantifying and Attributing Submodel Uncertainty in Stochastic Simulation Models and Digital Twins

Mohammadmahdi Ghasemloo, David J. Eckman, Yaxian Li

详情
英文摘要

Stochastic simulation is widely used to study complex systems composed of various interconnected subprocesses, such as input processes, routing and control logic, optimization routines, and data-driven decision modules. In practice, these subprocesses may be inherently unknown or too computationally intensive to directly embed in the simulation model. Replacing these elements with estimated or learned approximations introduces a form of epistemic uncertainty that we refer to as submodel uncertainty. This paper investigates how submodel uncertainty affects the estimation of system performance metrics. We develop a framework for quantifying submodel uncertainty in stochastic simulation models and extend the framework to digital-twin settings, where simulation experiments are repeatedly conducted with the model initialized from observed system states. Building on approaches from input uncertainty analysis, we leverage bootstrapping and Bayesian model averaging to construct quantile-based confidence or credible intervals for key performance indicators. We propose a tree-based method that decomposes total output variability and attributes uncertainty to individual submodels in the form of importance scores. The proposed framework is model-agnostic and accommodates both parametric and nonparametric submodels under frequentist and Bayesian modeling paradigms. A synthetic numerical experiment and a more realistic digital-twin simulation of a contact center illustrate the importance of understanding how and how much individual submodels contribute to overall uncertainty.

2602.16065 2026-02-19 cs.LG cs.AI math.ST stat.ML stat.TH

Can Generative Artificial Intelligence Survive Data Contamination? Theoretical Guarantees under Contaminated Recursive Training

Kevin Wang, Hongqian Niu, Didong Li

详情
英文摘要

Generative Artificial Intelligence (AI), such as large language models (LLMs), has become a transformative force across science, industry, and society. As these systems grow in popularity, web data becomes increasingly interwoven with this AI-generated material and it is increasingly difficult to separate them from naturally generated content. As generative models are updated regularly, later models will inevitably be trained on mixtures of human-generated data and AI-generated data from earlier versions, creating a recursive training process with data contamination. Existing theoretical work has examined only highly simplified settings, where both the real data and the generative model are discrete or Gaussian, where it has been shown that such recursive training leads to model collapse. However, real data distributions are far more complex, and modern generative models are far more flexible than Gaussian and linear mechanisms. To fill this gap, we study recursive training in a general framework with minimal assumptions on the real data distribution and allow the underlying generative model to be a general universal approximator. In this framework, we show that contaminated recursive training still converges, with a convergence rate equal to the minimum of the baseline model's convergence rate and the fraction of real data used in each iteration. To the best of our knowledge, this is the first (positive) theoretical result on recursive training without distributional assumptions on the data. We further extend the analysis to settings where sampling bias is present in data collection and support all theoretical results with empirical studies.

2602.16063 2026-02-19 eess.SY cs.CE cs.ET cs.LG cs.SY stat.CO

MARLEM: A Multi-Agent Reinforcement Learning Simulation Framework for Implicit Cooperation in Decentralized Local Energy Markets

Nelson Salazar-Pena, Alejandra Tabares, Andres Gonzalez-Mancera

Comments 32 pages, 7 figures, 1 table, 1 algorithm

详情
英文摘要

This paper introduces a novel, open-source MARL simulation framework for studying implicit cooperation in LEMs, modeled as a decentralized partially observable Markov decision process and implemented as a Gymnasium environment for MARL. Our framework features a modular market platform with plug-and-play clearing mechanisms, physically constrained agent models (including battery storage), a realistic grid network, and a comprehensive analytics suite to evaluate emergent coordination. The main contribution is a novel method to foster implicit cooperation, where agents' observations and rewards are enhanced with system-level key performance indicators to enable them to independently learn strategies that benefit the entire system and aim for collectively beneficial outcomes without explicit communication. Through representative case studies (available in a dedicated GitHub repository in https://github.com/salazarna/marlem, we show the framework's ability to analyze how different market configurations (such as varying storage deployment) impact system performance. This illustrates its potential to facilitate emergent coordination, improve market efficiency, and strengthen grid stability. The proposed simulation framework is a flexible, extensible, and reproducible tool for researchers and practitioners to design, test, and validate strategies for future intelligent, decentralized energy systems.

2602.16062 2026-02-19 eess.SY cs.CE cs.LG cs.MA cs.SY stat.AP

Harnessing Implicit Cooperation: A Multi-Agent Reinforcement Learning Approach Towards Decentralized Local Energy Markets

Nelson Salazar-Pena, Alejandra Tabares, Andres Gonzalez-Mancera

Comments 42 pages, 7 figures, 10 tables

详情
英文摘要

This paper proposes implicit cooperation, a framework enabling decentralized agents to approximate optimal coordination in local energy markets without explicit peer-to-peer communication. We formulate the problem as a decentralized partially observable Markov decision problem that is solved through a multi-agent reinforcement learning task in which agents use stigmergic signals (key performance indicators at the system level) to infer and react to global states. Through a 3x3 factorial design on an IEEE 34-node topology, we evaluated three training paradigms (CTCE, CTDE, DTDE) and three algorithms (PPO, APPO, SAC). Results identify APPO-DTDE as the optimal configuration, achieving a coordination score of 91.7% relative to the theoretical centralized benchmark (CTCE). However, a critical trade-off emerges between efficiency and stability: while the centralized benchmark maximizes allocative efficiency with a peer-to-peer trade ratio of 0.6, the fully decentralized approach (DTDE) demonstrates superior physical stability. Specifically, DTDE reduces the variance of grid balance by 31% compared to hybrid architectures, establishing a highly predictable, import-biased load profile that simplifies grid regulation. Furthermore, topological analysis reveals emergent spatial clustering, where decentralized agents self-organize into stable trading communities to minimize congestion penalties. While SAC excelled in hybrid settings, it failed in decentralized environments due to entropy-driven instability. This research proves that stigmergic signaling provides sufficient context for complex grid coordination, offering a robust, privacy-preserving alternative to expensive centralized communication infrastructure.

2602.16041 2026-02-19 stat.ME

Predictive Subsampling for Scalable Inference in Networks

Arpan Kumar, Minh Tang, Srijan Sengupta

详情
英文摘要

Network datasets appear across a wide range of scientific fields, including biology, physics, and the social sciences. To enable data-driven discoveries from these networks, statistical inference techniques like estimation and hypothesis testing are crucial. However, the size of modern networks often exceeds the storage and computational capacities of existing methods, making timely, statistically rigorous inference difficult. In this work, we introduce a subsampling-based approach aimed at reducing the computational burden associated with estimation and two-sample hypothesis testing. Our strategy involves selecting a small random subset of nodes from the network, conducting inference on the resulting subgraph, and then using interpolation based on the observed connections between the subsample and the rest of the nodes to estimate the entire graph. We develop the methodology under the generalized random dot product graph framework, which affords broad applicability and permits rigorous analysis. Within this setting, we establish consistency guarantees and corroborate the practical effectiveness of the approach through comprehensive simulation studies.

2602.16040 2026-02-19 stat.ME

Covariate Adjustment for Wilcoxon Two Sample Statistic and Test

Zhilan Lou, Jun Shao, Ting Ye, Tuo Wang, Yanyao Yi, Yu Du

Comments 18 pages, 0 figures, 3 tables

详情
英文摘要

We apply covariate adjustment to the Wincoxon two sample statistic and Wincoxon-Mann-Whitney test in comparing two treatments. The covariate adjustment through calibration not only improves efficiency in estimation/inference but also widens the application scope of the Wilcoxon two sample statistic and Wincoxon-Mann-Whitney test to situations where covariate-adaptive randomization is used. We motivate how to adjust covariates to reduce variance, establish the asymptotic distribution of adjusted Wincoxon two sample statistic, and provide explicitly the guaranteed efficiency gain. The asymptotic distribution of adjusted Wincoxon two sample statistic is invariant to all commonly used covariate-adaptive randomization schemes so that a unified formula can be used in inference regardless of which covariate-adaptive randomization is applied.

2602.16031 2026-02-19 stat.ME stat.AP

Competing Risk Analysis in Cardiovascular Outcome Trials: A Simulation Comparison of Cox and Fine-Gray Models

Tuo Wang, Yu Du

Comments 18 pages, 6 figures

详情
英文摘要

Cardiovascular outcome trials commonly face competing risks when non-CV death prevents observation of major adverse cardiovascular events (MACE). While Cox proportional hazards models treat competing events as independent censoring, Fine-Gray subdistribution hazard models explicitly handle competing risks, targeting different estimands. This simulation study using bivariate copula models systematically varies competing event rates (0.5%-5% annually), treatment effects on competing events (50% reduction to 50% increase), and correlation structures to compare these approaches. At competing event rates typical of CV outcome trials (~1% annually), Cox and Fine-Gray produce nearly identical hazard ratio estimates regardless of correlation strength or treatment effect direction. Substantial divergence occurs only with high competing rates and directionally discordant treatment effects, though neither estimator provides unbiased estimates of true marginal hazard ratios under these conditions. In typical CV trial settings with low competing event rates, Cox models remain appropriate for primary analysis due to superior interpretability. Pre-specified Cox models should not be abandoned for competing risk methods. Importantly, Fine-Gray models do not constitute proper sensitivity analyses to Cox models per ICH E9(R1), as they target different estimands rather than testing assumptions. As supplementary analysis, cumulative incidence using Aalen-Johansen estimator can provide transparency about competing risk impact. Under high competing-risk scenarios, alternative approaches such as inverse probability of censoring weighting, multiple imputation, or inclusion of all-cause mortality in primary endpoints warrant consideration.

2602.15972 2026-02-19 cs.LG stat.ML

Fast Online Learning with Gaussian Prior-Driven Hierarchical Unimodal Thompson Sampling

Tianchi Zhao, He Liu, Hongyin Shi, Jinliang Li

详情
英文摘要

We study a type of Multi-Armed Bandit (MAB) problems in which arms with a Gaussian reward feedback are clustered. Such an arm setting finds applications in many real-world problems, for example, mmWave communications and portfolio management with risky assets, as a result of the universality of the Gaussian distribution. Based on the Thompson Sampling algorithm with Gaussian prior (TSG) algorithm for the selection of the optimal arm, we propose our Thompson Sampling with Clustered arms under Gaussian prior (TSCG) specific to the 2-level hierarchical structure. We prove that by utilizing the 2-level structure, we can achieve a lower regret bound than we do with ordinary TSG. In addition, when the reward is Unimodal, we can reach an even lower bound on the regret by our Unimodal Thompson Sampling algorithm with Clustered Arms under Gaussian prior (UTSCG). Each of our proposed algorithms are accompanied by theoretical evaluation of the upper regret bound, and our numerical experiments confirm the advantage of our proposed algorithms.

2602.15955 2026-02-19 cs.LG stat.AP

Adaptive Semi-Supervised Training of P300 ERP-BCI Speller System with Minimum Calibration Effort

Shumeng Chen, Jane E. Huggins, Tianwen Ma

Comments 8 pages, 8 figures

详情
英文摘要

A P300 ERP-based Brain-Computer Interface (BCI) speller is an assistive communication tool. It searches for the P300 event-related potential (ERP) elicited by target stimuli, distinguishing it from the neural responses to non-target stimuli embedded in electroencephalogram (EEG) signals. Conventional methods require a lengthy calibration procedure to construct the binary classifier, which reduced overall efficiency. Thus, we proposed a unified framework with minimum calibration effort such that, given a small amount of labeled calibration data, we employed an adaptive semi-supervised EM-GMM algorithm to update the binary classifier. We evaluated our method based on character-level prediction accuracy, information transfer rate (ITR), and BCI utility. We applied calibration on training data and reported results on testing data. Our results indicate that, out of 15 participants, 9 participants exceed the minimum character-level accuracy of 0.7 using either on our adaptive method or the benchmark, and 7 out of these 9 participants showed that our adaptive method performed better than the benchmark. The proposed semi-supervised learning framework provides a practical and efficient alternative to improve the overall spelling efficiency in the real-time BCI speller system, particularly in contexts with limited labeled data.

2602.15925 2026-02-19 stat.ML cs.LG

Robust Stochastic Gradient Posterior Sampling with Lattice Based Discretisation

Zier Mensch, Lars Holdijk, Samuel Duffield, Maxwell Aifer, Patrick J. Coles, Max Welling, Miranda C. N. Cheng

详情
英文摘要

Stochastic-gradient MCMC methods enable scalable Bayesian posterior sampling but often suffer from sensitivity to minibatch size and gradient noise. To address this, we propose Stochastic Gradient Lattice Random Walk (SGLRW), an extension of the Lattice Random Walk discretization. Unlike conventional Stochastic Gradient Langevin Dynamics (SGLD), SGLRW introduces stochastic noise only through the off-diagonal elements of the update covariance; this yields greater robustness to minibatch size while retaining asymptotic correctness. Furthermore, as comparison we analyze a natural analogue of SGLD utilizing gradient clipping. Experimental validation on Bayesian regression and classification demonstrates that SGLRW remains stable in regimes where SGLD fails, including in the presence of heavy-tailed gradient noise, and matches or improves predictive performance.

2602.15920 2026-02-19 stat.ML cs.LG eess.SP

Including Node Textual Metadata in Laplacian-constrained Gaussian Graphical Models

Jianhua Wang, Killian Cressant, Pedro Braconnot Velloso, Arnaud Breloy

Comments Submitted to EUSIPCO 2026

详情
英文摘要

This paper addresses graph learning in Gaussian Graphical Models (GGMs). In this context, data matrices often come with auxiliary metadata (e.g., textual descriptions associated with each node) that is usually ignored in traditional graph estimation processes. To fill this gap, we propose a graph learning approach based on Laplacian-constrained GGMs that jointly leverages the node signals and such metadata. The resulting formulation yields an optimization problem, for which we develop an efficient majorization-minimization (MM) algorithm with closed-form updates at each iteration. Experimental results on a real-world financial dataset demonstrate that the proposed method significantly improves graph clustering performance compared to state-of-the-art approaches that use either signals or metadata alone, thus illustrating the interest of fusing both sources of information.

2602.15916 2026-02-19 stat.ME stat.ML

Nonparametric Identification and Inference for Counterfactual Distributions with Confounding

Jianle Sun, Kun Zhang

Comments 35 pages for Main text, 22 pages for Appendices, 6 figures

详情
英文摘要

We propose nonparametric identification and semiparametric estimation of joint potential outcome distributions in the presence of confounding. First, in settings with observed confounding, we derive tighter, covariate-informed bounds on the joint distribution by leveraging conditional copulas. To overcome the non-differentiability of bounding min/max operators, we establish the asymptotic properties for both a direct estimator with polynomial margin condition and a smooth approximation with log-sum-exp operator, facilitating valid inference for individual-level effects under the canonical rank-preserving assumption. Second, we tackle the challenge of unmeasured confounding by introducing a causal representation learning framework. By utilizing instrumental variables, we prove the nonparametric identifiability of the latent confounding subspace under injectivity and completeness conditions. We develop a ``triple machine learning" estimator that employs cross-fitting scheme to sequentially handle the learned representation, nuisance parameters, and target functional. We characterize the asymptotic distribution with variance inflation induced by representation learning error, and provide conditions for semiparametric efficiency. We also propose a practical VAE-based algorithm for confounding representation learning. Simulations and real-world analysis validate the effectiveness of proposed methods. By bridging classical semiparametric theory with modern representation learning, this work provides a robust statistical foundation for distributional and counterfactual inference in complex causal systems.

2602.13158 2026-02-19 stat.ME

A new mixture model for spatiotemporal exceedances with flexible tail dependence

Ryan Li, Emily C. Hector, Brian J. Reich, Reetam Majumder

详情
英文摘要

We propose a new model and estimation framework for spatiotemporal streamflow exceedances above a threshold that flexibly captures asymptotic dependence and independence in the tail of the distribution. We model streamflow using a mixture of processes with spatial, temporal and spatiotemporal asymptotic dependence regimes. A censoring mechanism allows us to use only observations above a threshold to estimate marginal and joint probabilities of extreme events. As the likelihood is intractable, we use simulation-based inference powered by random forests to estimate model parameters from summary statistics of the data. Simulations and modeling of streamflow data from the U.S. Geological Survey illustrate the feasibility and practicality of our approach.

2602.10531 2026-02-19 stat.ML cs.LG math.ST stat.TH

From Collapse to Improvement: Statistical Perspectives on the Evolutionary Dynamics of Iterative Training on Contaminated Sources

Soham Bakshi, Sunrit Chakraborty

详情
英文摘要

The problem of model collapse has presented new challenges in iterative training of generative models, where such training with synthetic data leads to an overall degradation of performance. This paper looks at the problem from a statistical viewpoint, illustrating that one can actually hope for improvement when models are trained on data contaminated with synthetic samples, as long as there is some amount of fresh information from the true target distribution. In particular, we consider iterative training on samples sourced from a mixture of the true target and synthetic distributions. We analyze the entire iterative evolution in a next-token prediction language model, capturing how the interplay between the mixture weights and the sample size controls the overall long-term performance. With non-trivial mixture weight of the true distribution, even if it decays over time, simply training the model in a contamination-agnostic manner with appropriate sample sizes can avoid collapse and even recover the true target distribution under certain conditions. Simulation studies support our findings and also show that such behavior is more general for other classes of models.

2601.21093 2026-02-19 stat.ML cs.LG math.OC math.PR math.ST stat.TH

High-dimensional learning dynamics of multi-pass Stochastic Gradient Descent in multi-index models

Zhou Fan, Leda Wang

详情
英文摘要

We study the learning dynamics of a multi-pass, mini-batch Stochastic Gradient Descent (SGD) procedure for empirical risk minimization in high-dimensional multi-index models with isotropic random data. In an asymptotic regime where the sample size $n$ and data dimension $d$ increase proportionally, for any sub-linear batch size $κ\asymp n^α$ where $α\in [0,1)$, and for a commensurate ``critical'' scaling of the learning rate, we provide an asymptotically exact characterization of the coordinate-wise dynamics of SGD. This characterization takes the form of a system of dynamical mean-field equations, driven by a scalar Poisson jump process that represents the asymptotic limit of SGD sampling noise. We develop an analogous characterization of the Stochastic Modified Equation (SME) which provides a Gaussian diffusion approximation to SGD. Our analyses imply that the limiting dynamics for SGD are the same for any batch size scaling $α\in [0,1)$, and that under a commensurate scaling of the learning rate, dynamics of SGD, SME, and gradient flow are mutually distinct, with those of SGD and SME coinciding in the special case of a linear model. We recover a known dynamical mean-field characterization of gradient flow in a limit of small learning rate, and of one-pass/online SGD in a limit of increasing sample size $n/d \to \infty$.

2601.17973 2026-02-19 stat.ML cs.LG

Boosting methods for interval-censored data with regression and classification

Yuan Bian, Grace Y. Yi, Wenqing He

详情
Journal ref
In The 13th International Conference on Learning Representations (2025)
英文摘要

Boosting has garnered significant interest across both machine learning and statistical communities. Traditional boosting algorithms, designed for fully observed random samples, often struggle with real-world problems, particularly with interval-censored data. This type of data is common in survival analysis and time-to-event studies where exact event times are unobserved but fall within known intervals. Effective handling of such data is crucial in fields like medical research, reliability engineering, and social sciences. In this work, we introduce novel nonparametric boosting methods for regression and classification tasks with interval-censored data. Our approaches leverage censoring unbiased transformations to adjust loss functions and impute transformed responses while maintaining model accuracy. Implemented via functional gradient descent, these methods ensure scalability and adaptability. We rigorously establish their theoretical properties, including optimality and mean squared error trade-offs. Our proposed methods not only offer a robust framework for enhancing predictive accuracy in domains where interval-censored data are common but also complement existing work, expanding the applicability of existing boosting techniques. Empirical studies demonstrate robust performance across various finite-sample scenarios, highlighting the practical utility of our approaches.

2511.03952 2026-02-19 stat.ML cs.LG

High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes

Aukosh Jagannath, Taj Jones-McCormick, Varnan Sarangian

详情
英文摘要

We develop a high-dimensional scaling limit for Stochastic Gradient Descent with Polyak Momentum (SGD-M) and adaptive step-sizes. This provides a framework to rigourously compare online SGD with some of its popular variants. We show that the scaling limits of SGD-M coincide with those of online SGD after an appropriate time rescaling and a specific choice of step-size. However, if the step-size is kept the same between the two algorithms, SGD-M will amplify high-dimensional effects, potentially degrading performance relative to online SGD. We demonstrate our framework on two popular learning problems: Spiked Tensor PCA and Single Index Models. In both cases, we also examine online SGD with an adaptive step-size based on normalized gradients. In the high-dimensional regime, this algorithm yields multiple benefits: its dynamics admit fixed points closer to the population minimum and widens the range of admissible step-sizes for which the iterates converge to such solutions. These examples provide a rigorous account, aligning with empirical motivation, of how early preconditioners can stabilize and improve dynamics in settings where online SGD fails.

2510.20755 2026-02-19 math.ST math.CO stat.ME stat.ML stat.TH

Incomplete U-Statistics of Equireplicate Designs: Berry-Esseen Bound and Efficient Construction

Cesare Miglioli, Jordan Awan

详情
英文摘要

U-statistics are a fundamental class of estimators that generalize the sample mean and underpin much of nonparametric statistics. Although extensively studied in both statistics and probability, key challenges remain: their high computational cost - addressed partly through incomplete U-statistics - and their non-standard asymptotic behavior in the degenerate case, which typically requires resampling methods for hypothesis testing. This paper presents a novel perspective on U-statistics, grounded in hypergraph theory and combinatorial designs. Our approach bypasses the traditional Hoeffding decomposition, the main analytical tool in this literature but one that is highly sensitive to degeneracy. By characterizing the dependence structure of a U-statistic, we derive a Berry-Esseen bound valid for incomplete U-statistics of deterministic designs, yielding conditions under which Gaussian limiting distributions can be established even in degenerate cases and when the order diverges. We also introduce efficient algorithms to construct incomplete U-statistics based on equireplicate designs, a subclass of deterministic designs that, in certain cases, achieve minimum variance. Beyond its theoretical contributions, our framework provides a systematic way to construct permutation-free counterparts to tests based on degenerate U-statistics, as demonstrated in experiments with kernel-based tests using the Maximum Mean Discrepancy and the Hilbert-Schmidt Independence Criterion.

2509.20928 2026-02-19 stat.ML cs.LG

Conditionally Whitened Generative Models for Probabilistic Time Series Forecasting

Yanfeng Yang, Siwei Chen, Pingping Hu, Zhaotong Shen, Yingjie Zhang, Zhuoran Sun, Shuai Li, Ziqi Chen, Kenji Fukumizu

Comments Accepted by the fourteenth International Conference on Learning Representations (ICLR 2026). https://openreview.net/forum?id=GG01lCopSK

详情
英文摘要

Probabilistic forecasting of multivariate time series is challenging due to non-stationarity, inter-variable dependencies, and distribution shifts. While recent diffusion and flow matching models have shown promise, they often ignore informative priors such as conditional means and covariances. In this work, we propose Conditionally Whitened Generative Models (CW-Gen), a framework that incorporates prior information through conditional whitening. Theoretically, we establish sufficient conditions under which replacing the traditional terminal distribution of diffusion models, namely the standard multivariate normal, with a multivariate normal distribution parameterized by estimators of the conditional mean and covariance improves sample quality. Guided by this analysis, we design a novel Joint Mean-Covariance Estimator (JMCE) that simultaneously learns the conditional mean and sliding-window covariance. Building on JMCE, we introduce Conditionally Whitened Diffusion Models (CW-Diff) and extend them to Conditionally Whitened Flow Matching (CW-Flow). Experiments on five real-world datasets with six state-of-the-art generative models demonstrate that CW-Gen consistently enhances predictive performance, capturing non-stationary dynamics and inter-variable correlations more effectively than prior-free approaches. Empirical results further demonstrate that CW-Gen can effectively mitigate the effects of distribution shift.

2505.24205 2026-02-19 cs.LG stat.ML

On the Expressive Power of Mixture-of-Experts for Structured Complex Tasks

Mingze Wang, Weinan E

Comments 28 pages, NeurIPS 2025 Spotlight

详情
英文摘要

Mixture-of-experts networks (MoEs) have demonstrated remarkable efficiency in modern deep learning. Despite their empirical success, the theoretical foundations underlying their ability to model complex tasks remain poorly understood. In this work, we conduct a systematic study of the expressive power of MoEs in modeling complex tasks with two common structural priors: low-dimensionality and sparsity. For shallow MoEs, we prove that they can efficiently approximate functions supported on low-dimensional manifolds, overcoming the curse of dimensionality. For deep MoEs, we show that $\mathcal{O}(L)$-layer MoEs with $E$ experts per layer can approximate piecewise functions comprising $E^L$ pieces with compositional sparsity, i.e., they can exhibit an exponential number of structured tasks. Our analysis reveals the roles of critical architectural components and hyperparameters in MoEs, including the gating mechanism, expert networks, the number of experts, and the number of layers, and offers natural suggestions for MoE variants.

2411.16370 2026-02-19 cs.CV cs.AI cs.LG eess.IV stat.ML

A Review of Bayesian Uncertainty Quantification in Deep Probabilistic Image Segmentation

M. M. A. Valiuddin, R. J. G. van Sloun, C. G. A. Viviers, P. H. N. de With, F. van der Sommen

Comments TMLR

详情
英文摘要

Advances in architectural design, data availability, and compute have driven remarkable progress in semantic segmentation. Yet, these models often rely on relaxed Bayesian assumptions, omitting critical uncertainty information needed for robust decision-making. Despite growing interest in probabilistic segmentation to address point-estimate limitations, the research landscape remains fragmented. In response, this review synthesizes foundational concepts in uncertainty modeling, analyzing how feature- and parameter-distribution modeling impact four key segmentation tasks: Observer Variability, Active Learning, Model Introspection, and Model Generalization. Our work establishes a common framework by standardizing theory, notation, and terminology, thereby bridging the gap between method developers, task specialists, and applied researchers. We then discuss critical challenges, including the nuanced distinction between uncertainty types, strong assumptions in spatial aggregation, the lack of standardized benchmarks, and pitfalls in current quantification methods. We identify promising avenues for future research, such as uncertainty-aware active learning, data-driven benchmarks, transformer-based models, and novel techniques to move from simple segmentation problems to uncertainty in holistic scene understanding. Based on our analysis, we offer practical guidelines for researchers on method selection, evaluation, reproducibility, and meaningful uncertainty estimation. Ultimately, our goal is to facilitate the development of more reliable, efficient, and interpretable segmentation models that can be confidently deployed in real-world applications.

2409.19642 2026-02-19 stat.CO cs.NA math.NA math.OC stat.ME

Solving Fredholm Integral Equations of the Second Kind via Wasserstein Gradient Flows

Francesca R. Crucinio, Adam M. Johansen

详情
英文摘要

Motivated by a recent method for approximate solution of Fredholm equations of the first kind, we develop a corresponding method for a class of Fredholm equations of the \emph{second kind}. In particular, we consider the class of equations for which the solution is a probability measure. The approach centres around specifying a functional whose gradient flow admits a minimizer corresponding to a regularized version of the solution of the underlying equation and using a mean-field particle system to approximately simulate that flow. Theoretical support for the method is presented, along with some illustrative numerical results.

2409.04332 2026-02-19 cs.LG stat.ML

Amortized Bayesian Workflow

Chengkun Li, Aki Vehtari, Paul-Christian Bürkner, Stefan T. Radev, Luigi Acerbi, Marvin Schmitt

Comments Accepted in Transactions on Machine Learning Research

详情
英文摘要

Bayesian inference often faces a trade-off between computational speed and sampling accuracy. We propose an adaptive workflow that integrates rapid amortized inference with gold-standard MCMC techniques to achieve a favorable combination of both speed and accuracy when performing inference on many observed datasets. Our approach uses principled diagnostics to guide the choice of inference method for each dataset, moving along the Pareto front from fast amortized sampling via generative neural networks to slower but guaranteed-accurate MCMC when needed. By reusing computations across steps, our workflow synergizes amortized and MCMC-based inference. We demonstrate the effectiveness of this integrated approach on several synthetic and real-world problems with tens of thousands of datasets, showing efficiency gains while maintaining high posterior quality.

2408.09760 2026-02-19 stat.ME cs.SI econ.GN q-fin.EC stat.AP

Regional and spatial dependence of poverty factors in Thailand, and its use into Bayesian hierarchical regression analysis

Irving Gómez-Méndez, Chainarong Amornbunchornvej

Comments Codes to reproduce our results are available in https://github.com/IrvingGomez/SpatialPovertyFactors

详情
Journal ref
Statistical Journal of the IAOS. 2026
英文摘要

Poverty is a serious issue that harms humanity progression. The simplest solution is to use one-shirt-size policy to alleviate it. Nevertheless, each region has its unique issues, which require a unique solution to solve them. In the aspect of spatial analysis, neighbor regions can provide useful information to analyze issues of a given region. In this work, we proposed inferred boundaries of regions of Thailand that can explain better the poverty dynamics, instead of the usual government administrative regions. The proposed regions maximize a trade-off between poverty-related features and geographical coherence. We use a spatial analysis together with Moran's cluster algorithms and Bayesian hierarchical regression models, with the potential of assist the implementation of the right policy to alleviate the poverty phenomenon. We found that all variables considered show a positive spatial autocorrelation. The results of analysis illustrate that 1) Northern, Northeastern Thailand, and in less extend Northcentral Thailand are the regions that require more attention in the aspect of poverty issues, 2) Northcentral, Northeastern, Northern and Southern Thailand present dramatically low levels of education, income and amount of savings contrasted with large cities such as Bangkok-Pattaya and Central Thailand, and 3) Bangkok-Pattaya is the only region whose average years of education is above 12 years, which corresponds (approx.) with a complete senior high school.

2402.19456 2026-02-19 quant-ph cs.DS math.PR math.ST stat.ML stat.TH

Statistical Estimation in the Spiked Tensor Model via the Quantum Approximate Optimization Algorithm

Leo Zhou, Joao Basso, Song Mei

Comments 51 pages, 4 figures, 1 table

详情
英文摘要

The quantum approximate optimization algorithm (QAOA) is a general-purpose algorithm for combinatorial optimization. In this paper, we analyze the performance of the QAOA on a statistical estimation problem, namely, the spiked tensor model, which exhibits a statistical-computational gap classically. We prove that the weak recovery threshold of $1$-step QAOA matches that of $1$-step tensor power iteration. Additional heuristic calculations suggest that the weak recovery threshold of $p$-step QAOA matches that of $p$-step tensor power iteration when $p$ is a fixed constant. This further implies that multi-step QAOA with tensor unfolding could achieve, but not surpass, the classical computation threshold $Θ(n^{(q-2)/4})$ for spiked $q$-tensors. Meanwhile, we characterize the asymptotic overlap distribution for $p$-step QAOA, finding an intriguing sine-Gaussian law verified through simulations. For some $p$ and $q$, the QAOA attains an overlap that is larger by a constant factor than the tensor power iteration overlap. Of independent interest, our proof techniques employ the Fourier transform to handle difficult combinatorial sums, a novel approach differing from prior QAOA analyses on spin-glass models without planted structure.

2402.11020 2026-02-19 stat.ME

Proximal Causal Inference for Conditional Separable Effects

Chan Park, Mats Stensrud, Eric Tchetgen Tchetgen

详情
英文摘要

Scientists regularly pose questions about treatment effects on outcomes conditional on a post-treatment event. However, causal inference in such settings requires care, even in perfectly executed randomized experiments. Recently, the conditional separable effect (CSE) was proposed as an interventionist estimand that corresponds to scientifically meaningful questions in these settings. However, existing results for the CSE require no unmeasured confounding between the outcome and post-treatment event, an assumption frequently violated in practice. In this work, we address this concern by developing new identification and estimation results for the CSE that allow for unmeasured confounding. We establish nonparametric identification of the CSE in observational and experimental settings with time-varying confounders, provided that certain proxy variables for hidden common causes of the post-treatment event and outcome are available. For inference, we characterize an influence function for the CSE under a semiparametric model where nuisance functions are a priori unrestricted. Using modern machine learning methods, we construct nonparametric nuisance function estimators and establish convergence rates that improve upon existing results. Moreover, we develop a consistent, asymptotically linear, and locally semiparametric efficient estimator of the CSE. We illustrate our framework with simulation studies and a real-world cancer therapy trial.

2305.12288 2026-02-19 stat.AP cs.NA math.NA

A Cost-Effective Slag-based Mix Activated with Soda Ash and Hydrated Lime: A Pilot Study

Jayashree Sengupta, Nirjhar Dhang, Arghya Deb

详情
英文摘要

The present study explores a cost-effective method for using activated ground granulated blast furnace slag (GGBFS) and silica fume (SF) as cement substitutes. Instead of activating them with expensive alkali solutions, the present study employs industrial-grade powdered soda ash (SA) and hydrated lime (HL) as activators, reducing expenses by about 94.5% compared to their corresponding analytical-grade counterparts. Herein, the exclusivity is depicted using less pure chemicals rather than relying on reagents with 99% purity. Two mixing techniques are compared: one involves directly introducing powdered SA and HL, while the other pre-mixes SA with water before adding it to a dry powder mixture of GGBFS, SF, and HL. Microstructural analysis reveals that the initial strength results from various hydrate phases, including calcium-sodium-aluminate-silicate hydrate (CNASH). The latter strength is attributed to the coexistence of calcium-silicate hydrate (CSH), calcium-aluminate-silicate hydrate (CASH) and sodium-aluminate-silicate hydrate (NASH), with contributions from calcite and hydrotalcite. The SF content significantly influenced the formation of these gel phases. Thermogravimetric analysis (TGA) reveals phase transitions and bound water related to hydration products. The optimal mix comprises 10% SF, 90% GGBFS, 9.26% HL, and 13.25% SA, with a water-to-solids ratio of 0.45. This approach yields a compressive strength of 35.1 MPa after 28 days and 41.33 MPa after 120 days, hence suitable for structural construction.

2208.14153 2026-02-19 cs.LG stat.ML

Identifying Weight-Variant Latent Causal Models

Yuhang Liu, Zhen Zhang, Dong Gong, Mingming Gong, Biwei Huang, Anton van den Hengel, Kun Zhang, Javen Qinfeng Shi

详情
英文摘要

The task of causal representation learning aims to uncover latent higher-level causal variables that affect lower-level observations. Identifying the true latent causal variables from observed data, while allowing instantaneous causal relations among latent variables, remains a challenge, however. To this end, we start with the analysis of three intrinsic indeterminacies in identifying latent variables from observations: transitivity, permutation indeterminacy, and scaling indeterminacy. We find that transitivity acts as a key role in impeding the identifiability of latent causal variables. To address the unidentifiable issue due to transitivity, we introduce a novel identifiability condition where the underlying latent causal model satisfies a linear-Gaussian model, in which the causal coefficients and the distribution of Gaussian noise are modulated by an additional observed variable. Under certain assumptions, including the existence of a reference condition under which latent causal influences vanish, we can show that the latent causal variables can be identified up to trivial permutation and scaling, and that partial identifiability results can still be obtained when this reference condition is violated for a subset of latent variables. Furthermore, based on these theoretical results, we propose a novel method, termed Structural caUsAl Variational autoEncoder (SuaVE), which directly learns causal representations and causal relationships among them, together with the mapping from the latent causal variables to the observed ones. Experimental results on synthetic and real data demonstrate the identifiability and consistency results and the efficacy of SuaVE in learning causal representations.