arXivDaily arXiv每日学术速递 周一至周五更新
2601.23239 2026-02-02 stat.ML cs.IT cs.LG cs.SI math.IT math.ST stat.TH

Graph Attention Network for Node Regression on Random Geometric Graphs with Erdős--Rényi contamination

Somak Laha, Suqi Liu, Morgane Austern

Comments 62 pages, 2 figures, 2 tables

详情
英文摘要

Graph attention networks (GATs) are widely used and often appear robust to noise in node covariates and edges, yet rigorous statistical guarantees demonstrating a provable advantage of GATs over non-attention graph neural networks~(GNNs) are scarce. We partially address this gap for node regression with graph-based errors-in-variables models under simultaneous covariate and edge corruption: responses are generated from latent node-level covariates, but only noise-perturbed versions of the latent covariates are observed; and the sample graph is a random geometric graph created from the node covariates but contaminated by independent Erdős--Rényi edges. We propose and analyze a carefully designed, task-specific GAT that constructs denoised proxy features for regression. We prove that regressing the response variables on the proxies achieves lower error asymptotically in (a) estimating the regression coefficient compared to the ordinary least squares (OLS) estimator on the noisy node covariates, and (b) predicting the response for an unlabelled node compared to a vanilla graph convolutional network~(GCN) -- under mild growth conditions. Our analysis leverages high-dimensional geometric tail bounds and concentration for neighbourhood counts and sample covariances. We verify our theoretical findings through experiments on synthetically generated data. We also perform experiments on real-world graphs and demonstrate the effectiveness of the attention mechanism in several node regression tasks.

2601.23208 2026-02-02 stat.ML cs.LG

A Random Matrix Theory of Masked Self-Supervised Regression

Arie Wortsman Zurich, Federica Gerace, Bruno Loureiro, Yue M. Lu

详情
英文摘要

In the era of transformer models, masked self-supervised learning (SSL) has become a foundational training paradigm. A defining feature of masked SSL is that training aggregates predictions across many masking patterns, giving rise to a joint, matrix-valued predictor rather than a single vector-valued estimator. This object encodes how coordinates condition on one another and poses new analytical challenges. We develop a precise high-dimensional analysis of masked modeling objectives in the proportional regime where the number of samples scales with the ambient dimension. Our results provide explicit expressions for the generalization error and characterize the spectral structure of the learned predictor, revealing how masked modeling extracts structure from data. For spiked covariance models, we show that the joint predictor undergoes a Baik--Ben Arous--Péché (BBP)-type phase transition, identifying when masked SSL begins to recover latent signals. Finally, we identify structured regimes in which masked self-supervised learning provably outperforms PCA, highlighting potential advantages of SSL objectives over classical unsupervised methods

2601.23203 2026-02-02 stat.AP

Beyond the Null Effect: Unmasking the True Impact of Teacher-Child Interaction Quality on Child Outcomes in Early Head Start

JoonHo Lee, Alison Hooper

详情
英文摘要

In Early Head Start (EHS), teacher-child interactions are widely believed to shape infant-toddler outcomes, yet large-scale studies often find only modest or null associations. This study addresses four methodological sources of attenuation -- item-level measurement error, center-level confounding, teacher- and classroom-level covariate imbalance, and overlooked nonlinearities -- to clarify classroom process quality's true influence on child development. Using data from the 2018 wave of the Early Head Start Family and Child Experiences Survey (Baby FACES), we applied a three-level generalized additive latent and mixed model (GALAMM) to distinguish genuine classroom-level variability in process quality, as measured by the Classroom Assessment Scoring System (CLASS) and Quality of Caregiver-Child Interactions for Infants and Toddlers (QCIT), from item-level noise and center-level effects. We then estimated dose-response relationships with children's language and socioemotional outcomes, employing covariate balancing weights and generalized additive models. Results show that nearly half of each item's variance reflects classroom-level processes, with the remainder tied to measurement error or center-wide influences, masking true classroom effects. After correcting for these biases, domain-focused dose-response analyses reveal robust linear associations between cognitive/language supports and children's English communicative skills, while emotional-behavioral supports better predict social-emotional competence. Some domains display plateaus when pushed to extremes, underscoring potential nonlinearities. These findings challenge the "null effect" narrative, demonstrating that rigorous methodology can uncover the critical, domain-specific impacts of teacher-child interaction quality, offering clearer guidance for targeted professional development and policy in EHS.

2601.23171 2026-02-02 stat.OT

Revisiting the Lost Submarine Problem: A Decision Theoretic Approach

Anthony Almudevar

Comments 2 figures, 11 pages

详情
英文摘要

This article includes a discussion of the ``lost submarine problem", following Morey \emph{et al} (2016). As the title of that paper suggests (\emph{The fallacy of placing confidence in confidence intervals}), the example is intended to illustrate the futility of relying on the confidence interval as a formal inference statement. In the view of this author, the misgivings expressed in Morey \emph{et al} (2016) can be resolved using a decision theoretic approach. While it is true that a variety of statistical methods lead to a variety of confidence intervals, once we precisely define their purpose, a single optimal choice emerges. Furthermore, distinct purposes lead to distinct optimal choices. Therefore, that a variety of procedures exist is an advantage rather than a liability.

2601.23124 2026-02-02 math.ST stat.TH

Semi-knockoffs: a model-agnostic conditional independence testing method with finite-sample guarantees

Angel Reyero-Lobo, Bertrand Thirion, Pierre Neuvial

详情
英文摘要

Conditional independence testing (CIT) is essential for reliable scientific discovery. It prevents spurious findings and enables controlled feature selection. Recent CIT methods have used machine learning (ML) models as surrogates of the underlying distribution. However, model-agnostic approaches require a train-test split, which reduces statistical power. We introduce Semi-knockoffs, a CIT method that can accommodate any pre-trained model, avoids this split, and provides valid p-values and false discovery rate (FDR) control for high-dimensional settings. Unlike methods that rely on the model-$X$ assumption (known input distribution), Semi-knockoffs only require conditional expectations for continuous variables. This makes the procedure less restrictive and more practical for machine learning integration. To ensure validity when estimating these expectations, we present two new theoretical results of independent interest: (i) stability for regularized models trained with a null feature and (ii) the double-robustness property.

2601.23031 2026-02-02 stat.ML cs.LG

Asymptotic Theory of Iterated Empirical Risk Minimization, with Applications to Active Learning

Hugo Cui, Yue M. Lu

详情
英文摘要

We study a class of iterated empirical risk minimization (ERM) procedures in which two successive ERMs are performed on the same dataset, and the predictions of the first estimator enter as an argument in the loss function of the second. This setting, which arises naturally in active learning and reweighting schemes, introduces intricate statistical dependencies across samples and fundamentally distinguishes the problem from classical single-stage ERM analyses. For linear models trained with a broad class of convex losses on Gaussian mixture data, we derive a sharp asymptotic characterization of the test error in the high-dimensional regime where the sample size and ambient dimension scale proportionally. Our results provide explicit, fully asymptotic predictions for the performance of the second-stage estimator despite the reuse of data and the presence of prediction-dependent losses. We apply this theory to revisit a well-studied pool-based active learning problem, removing oracle and sample-splitting assumptions made in prior work. We uncover a fundamental tradeoff in how the labeling budget should be allocated across stages, and demonstrate a double-descent behavior of the test error driven purely by data selection, rather than model size or sample count.

2601.23021 2026-02-02 stat.ME

Differences in Performance of Bayesian Dynamic Borrowing and Synthetic Control Methods: A Case Study of Pediatric Atopic Dermatitis

Nicole Cizauskas, Foteini Strimenopoulou, Svetlana S. Cherlin, James M. S. Wason

Comments 13 pages, 1 table, 2 figures

详情
英文摘要

Bayesian dynamic borrowing (BDB) and synthetic control methods (SCM) are both used in clinical trial design when recruitment, retention, or allocation is a challenge. The performance of these approaches has not previously been directly compared due to differences in application, product, and measurement metrics. This study aims to conduct a comparison of power and type 1 error rates of BDB (using meta-analytic predictive prior (MAP)) and SCM using a case study of Pediatric Atopic Dermatitis. Six historical randomised control trials were selected for use in both the creation of the MAP prior and synthetic control arm. The R library RBesT was used to create a MAP prior and the R library Synthpop was used to create a synthetic control arm for the SCM. Power and type 1 error rate were used as comparison metrics. BDB produced a power of 0.580 and a type 1 error rate of 0.026. SCM produced a power of 0.641 and a type 1 error rate of 0.027. In this case study, the SCM model produced a higher power than the BDB method with a similar type 1 error rate. However, the decision to use SCM or BDB should come from the specific needs of the potential trial, since their power and type 1 error rate may differ on a case-by-case basis.

2601.22999 2026-02-02 stat.ME stat.CO

Computationally efficient segmentation for non-stationary time series with oscillatory patterns

Nicolas Bianco, Lorenzo Cappello

详情
英文摘要

We propose a novel approach for change-point detection and parameter learning in multivariate non-stationary time series exhibiting oscillatory behaviour. We approximate the process through a piecewise function defined by a sum of sinusoidal functions with unknown frequencies and amplitudes plus noise. The inference for this model is non-trivial. However, discretising the parameter space allows us to recast this complex estimation problem into a more tractable linear model, where the covariates are Fourier basis functions. Then, any change-point detection algorithms for segmentation can be used. The advantage of our proposal is that it bypasses the need for trans-dimensional Markov chain Monte Carlo algorithms used by state-of-the-art methods. Through simulations, we demonstrate that our method is significantly faster than existing approaches while maintaining comparable numerical accuracy. We also provide high probability bounds on the change-point localization error. We apply our methodology to climate and EEG sleep data.

2601.22971 2026-02-02 stat.ME q-bio.QM

Dynamic modelling and evaluation of preclinical trials in acute leukaemia

Julian Wäsche, Romina Ludwig, Irmela Jeremias, Christiane Fuchs

详情
英文摘要

Dynamic models are widely used to mathematically describe biological phenomena that evolve over time. One important area of application is leukaemia research, where leukaemia cells are genetically modified in preclinical studies to explore new therapeutic targets for reducing leukaemic burden. In advanced experiments, these studies are often conducted in mice and generate time-resolved data, the analysis of which may reveal growth-inhibiting effects of the investigated gene modifications. However, the experimental data is often times evaluated using statistical tests which compare measurements from only two different time points. This approach does not only reduce the time series to two instances but also neglects biological knowledge about cell mechanisms. Such knowledge, translated into mathematical models, expands the power to investigate and understand effects of modifications on underlying mechanisms based on experimental data. We utilise two population growth models -- an exponential and a logistic growth model -- to capture cell dynamics over the whole experimental time horizon and to consider all measurement times jointly. This approach enables us to derive modification effects from estimated model parameters. We demonstrate that the exponential growth model recognises simulated scenarios more reliably than the other candidate model and than a statistical test. Moreover, we apply the population growth models to evaluate the efficacy of candidate gene knockouts in patient-derived xenograft (PDX) models of acute leukaemia.

2601.22951 2026-02-02 stat.ML cs.LG

OneFlowSBI: One Model, Many Queries for Simulation-Based Inference

Mayank Nautiyal, Li Ju, Melker Ernfors, Klara Hagland, Ville Holma, Maximilian Werkö Söderholm, Andreas Hellander, Prashant Singh

详情
英文摘要

We introduce \textit{OneFlowSBI}, a unified framework for simulation-based inference that learns a single flow-matching generative model over the joint distribution of parameters and observations. Leveraging a query-aware masking distribution during training, the same model supports multiple inference tasks, including posterior sampling, likelihood estimation, and arbitrary conditional distributions, without task-specific retraining. We evaluate \textit{OneFlowSBI} on ten benchmark inference problems and two high-dimensional real-world inverse problems across multiple simulation budgets. \textit{OneFlowSBI} is shown to deliver competitive performance against state-of-the-art generalized inference solvers and specialized posterior estimators, while enabling efficient sampling with few ODE integration steps and remaining robust under noisy and partially observed data.

2601.22950 2026-02-02 cs.LG cs.AI cs.CL stat.ML

Perplexity Cannot Always Tell Right from Wrong

Petar Veličković, Federico Barbero, Christos Perivolaropoulos, Simon Osindero, Razvan Pascanu

Comments 11 pages, 4 figures

详情
英文摘要

Perplexity -- a function measuring a model's overall level of "surprise" when encountering a particular output -- has gained significant traction in recent years, both as a loss function and as a simple-to-compute metric of model quality. Prior studies have pointed out several limitations of perplexity, often from an empirical manner. Here we leverage recent results on Transformer continuity to show in a rigorous manner how perplexity may be an unsuitable metric for model selection. Specifically, we prove that, if there is any sequence that a compact decoder-only Transformer model predicts accurately and confidently -- a necessary pre-requisite for strong generalisation -- it must imply existence of another sequence with very low perplexity, but not predicted correctly by that same model. Further, by analytically studying iso-perplexity plots, we find that perplexity will not always select for the more accurate model -- rather, any increase in model confidence must be accompanied by a commensurate rise in accuracy for the new model to be selected.

2601.22911 2026-02-02 stat.CO math.CT math.PR

A categorical account of the Metropolis-Hastings algorithm

Rob Cornish, Andi Q. Wang

详情
英文摘要

Metropolis-Hastings (MH) is a foundational Markov chain Monte Carlo (MCMC) algorithm. In this paper, we ask whether it is possible to formulate and analyse MH in terms of categorical probability, using a recent involutive framework for MH-type procedures as a concrete case study. We show how basic MCMC concepts such as invariance and reversibility can be formulated in Markov categories, and how one part of the MH kernel can be analysed using standard CD categories. To go further, we then study enrichments of CD categories over commutative monoids. This gives an expressive setting for reasoning abstractly about a range of important probabilistic concepts, including substochastic kernels, finite and $σ$-finite measures, absolute continuity, singular measures, and Lebesgue decompositions. Using these tools, we give synthetic necessary and sufficient conditions for a general MH-type sampler to be reversible with respect to a given target distribution.

2601.22890 2026-02-02 stat.CO cond-mat.mtrl-sci math.OC math.ST stat.TH

A Framework for the Bayesian Calibration of Complex and Data-Scarce Models in Applied Sciences

Christina Schenk, Ignacio Romero

Comments 57, 23 figures (includes supplementary information)

详情
英文摘要

In this work, we review the theory involved in the Bayesian calibration of complex computer models, with particular emphasis on their use for applications involving computationally expensive simulations and scarce experimental data. In the article, we present a unified framework that incorporates various Bayesian calibration methods, including well-established approaches. Furthermore, we describe their implementation and use with a new, open-source Python library, ACBICI (A Configurable BayesIan Calibration and Inference Package). All algorithms are implemented with an object-oriented structure designed to be both easy to use and readily extensible. In particular, single-output and multiple-output calibration are addressed in a consistent manner. The article completes the theory and its implementation with practical recommendations for calibrating the problems of interest. These guidelines -- currently unavailable in a unified form elsewhere -- together with the open-source Python library, are intended to support the reliable calibration of computational codes and models commonly used in engineering and related fields. Overall, this work aims to serve both as a comprehensive review of the statistical foundations and (computational) tools required to perform such calculations, and as a practical guide to Bayesian calibration with modern software tools.

2601.22884 2026-02-02 stat.ME stat.CO

Depth-based estimation for multivariate functional data with phase variability

Ana Arribas-Gil, Sara López-Pintado

Comments 34 pages, 11 figures, 6 tables

详情
英文摘要

In the context of multivariate functional data with individual phase variation, we develop a robust depth-based approach to estimate the main pattern function when cross-component time warping is also present. In particular, we consider the latent deformation model (Carroll and Müller, 2023) in which the different components of a multivariate functional variable are also time-distorted versions of a common template function. Rather than focusing on a particular functional depth measure, we discuss the necessary conditions on a depth function to be able to provide a consistent estimation of the central pattern, considering different model assumptions. We evaluate the method performance and its robustness against atypical observations and violations of the model assumptions through simulations, and illustrate its use on two real data sets.

2601.22834 2026-02-02 math.ST stat.TH

Asymmetric conformal prediction with penalized kernel sum-of-squares

Louis Allain, Sébastien Da Veiga, Brian Staber

详情
英文摘要

Conformal prediction (CP) is a distribution-free method to construct reliable prediction intervals that has gained significant attention in recent years. Despite its success and various proposed extensions, a significant practical feature which has been overlooked in previous research is the potential skewed nature of the noise, or of the residuals when the predictive model exhibits bias. In this work, we leverage recent developments in CP to propose a new asymmetric procedure that bridges the gap between skewed and non-skewed noise distributions, while still maintaining adaptivity of the prediction intervals. We introduce a new statistical learning problem to construct adaptive and asymmetric prediction bands, with a unique feature based on a penalty which promotes symmetry: when its intensity varies, the intervals smoothly change from symmetric to asymmetric ones. This learning problem is based on reproducing kernel Hilbert spaces and the recently introduced kernel sum-of-squares framework. First, we establish representer theorems to make our problem tractable in practice, and derive dual formulations which are essential for scalability to larger datasets. Second, the intensity of the penalty is chosen using a novel data-driven method which automatically identifies the symmetric nature of the noise. We show that consenting to some asymmetry can let the learned prediction bands better adapt to small sample regimes or biased predictive models.

2601.22799 2026-02-02 math.ST stat.TH

Convergence of Multi-Level Markov Chain Monte Carlo Adaptive Stochastic Gradient Algorithms

Antoine Godichon-Baggioni, Gabriel Lang, Sylvain Le Corff, Julien Stoehr, Sobihan Surendran

详情
英文摘要

Stochastic optimization in learning and inference often relies on Markov chain Monte Carlo (MCMC) to approximate gradients when exact computation is intractable. However, finite-time MCMC estimators are biased, and reducing this bias typically comes at a higher computational cost. We propose a multilevel Monte Carlo gradient estimator whose bias decays as $O(T_{n}^{-1} )$ while its expected computational cost grows only as $O(log T_n )$, where $T_n$ is the maximal truncation level at iteration n. Building on this approach, we introduce a multilevel MCMC framework for adaptive stochastic gradient methods, leading to new multilevel variants of Adagrad and AMSGrad algorithms. Under conditions controlling the estimator bias and its second and third moments, we establish a convergence rate of order $O(n^{-1/2} )$ up to logarithmic factors. Finally, we illustrate these results on Importance-Weighted Autoencoders trained with the proposed multilevel adaptive methods.

2601.22790 2026-02-02 cs.AI math.ST stat.TH

Conditional Performance Guarantee for Large Reasoning Models

Jianguo Huang, Hao Zeng, Bingyi Jing, Hongxin Wei, Bo An

详情
英文摘要

Large reasoning models have shown strong performance through extended chain-of-thought reasoning, yet their computational cost remains significant. Probably approximately correct (PAC) reasoning provides statistical guarantees for efficient reasoning by adaptively switching between thinking and non-thinking models, but the guarantee holds only in the marginal case and does not provide exact conditional coverage. We propose G-PAC reasoning, a practical framework that provides PAC-style guarantees at the group level by partitioning the input space. We develop two instantiations: Group PAC (G-PAC) reasoning for known group structures and Clustered PAC (C-PAC) reasoning for unknown groupings. We prove that both G-PAC and C-PAC achieve group-conditional risk control, and that grouping can strictly improve efficiency over marginal PAC reasoning in heterogeneous settings. Our experiments on diverse reasoning benchmarks demonstrate that G-PAC and C-PAC successfully achieve group-conditional risk control while maintaining substantial computational savings.

2601.22782 2026-02-02 stat.ME

Optimal Sample Splitting for Observational Studies

Qishuo Yin, Dylan S. Small

详情
英文摘要

In observational studies of treatment effects, estimates may be biased by unmeasured confounders, which can potentially affect the validity of the results. Understanding sensitivity to such biases helps assess how unmeasured confounding impacts credibility. The design of an observational study strongly influences its sensitivity to bias. Previous work has shown that the sensitivity to bias can be reduced by dividing a dataset into a planning sample and a larger analysis sample, where the planning sample guides design decisions. But the choice of what fraction of the data to put in the planning sample vs. the analysis sample was ad hoc. Here, we develop an approach to find the optimal fraction using plasmode datasets. We show that our method works well in high-dimensional outcome spaces. We apply our method to study the effects of exposure to second-hand smoke in children. The OptimalSampling R package implementing our method is available at GitHub.

2601.22771 2026-02-02 stat.ML cs.LG

GRANITE: A Generalized Regional Framework for Identifying Agreement in Feature-Based Explanations

Julia Herbinger, Gabriel Laberge, Maximilian Muschalik, Yann Pequignot, Marvin N. Wright, Fabian Fumagalli

详情
英文摘要

Feature-based explanation methods aim to quantify how features influence the model's behavior, either locally or globally, but different methods often disagree, producing conflicting explanations. This disagreement arises primarily from two sources: how feature interactions are handled and how feature dependencies are incorporated. We propose GRANITE, a generalized regional explanation framework that partitions the feature space into regions where interaction and distribution influences are minimized. This approach aligns different explanation methods, yielding more consistent and interpretable explanations. GRANITE unifies existing regional approaches, extends them to feature groups, and introduces a recursive partitioning algorithm to estimate such regions. We demonstrate its effectiveness on real-world datasets, providing a practical tool for consistent and interpretable feature explanations.

2601.19888 2026-02-02 stat.ME cs.AI cs.LG

M-SGWR: Multiscale Similarity and Geographically Weighted Regression

M. Naser Lessani, Zhenlong Li, Manzhu Yu, Helen Greatrex, Chan Shen

详情
英文摘要

The first law of geography is a cornerstone of spatial analysis, emphasizing that nearby and related locations tend to be more similar, however, defining what constitutes "near" and "related" remains challenging, as different phenomena exhibit distinct spatial patterns. Traditional local regression models, such as Geographically Weighted Regression (GWR) and Multiscale GWR (MGWR), quantify spatial relationships solely through geographic proximity. In an era of globalization and digital connectivity, however, geographic proximity alone may be insufficient to capture how locations are interconnected. To address this limitation, we propose a new multiscale local regression framework, termed M-SGWR, which characterizes spatial interaction across two dimensions: geographic proximity and attribute (variable) similarity. For each predictor, geographic and attribute-based weight matrices are constructed separately and then combined using an optimized parameter, alpha, which governs their relative contribution to local model fitting. Analogous to variable-specific bandwidths in MGWR, the optimal alpha varies by predictor, allowing the model to flexibly account for geographic, mixed, or non-spatial (remote similarity) effects. Results from two simulation experiments and one empirical application demonstrate that M-SGWR consistently outperforms GWR, SGWR, and MGWR across all goodness-of-fit metrics.

2601.05219 2026-02-02 stat.ML cs.AI cs.LG

CAOS: Conformal Aggregation of One-Shot Predictors

Maja Waldron

详情
英文摘要

One-shot prediction enables rapid adaptation of pretrained foundation models to new tasks using only one labeled example, but lacks principled uncertainty quantification. While conformal prediction provides finite-sample coverage guarantees, standard split conformal methods are inefficient in the one-shot setting due to data splitting and reliance on a single predictor. We propose Conformal Aggregation of One-Shot Predictors (CAOS), a conformal framework that adaptively aggregates multiple one-shot predictors and uses a leave-one-out calibration scheme to fully exploit scarce labeled data. Despite violating classical exchangeability assumptions, we prove that CAOS achieves valid marginal coverage using a monotonicity-based argument. Experiments on one-shot facial landmarking and RAFT text classification tasks show that CAOS produces substantially smaller prediction sets than split conformal baselines while maintaining reliable coverage.

2512.15429 2026-02-02 stat.ME

Accounting for missing data when modelling block maxima

Emma S. Simpson, Paul J. Northrop

Journal ref Environmetrics 2026, Volume 37, Issue 2, e70075

详情
英文摘要

Modelling block maxima using the generalised extreme value (GEV) distribution is a classical and widely used method for studying univariate extremes. It allows for theoretically motivated estimation of return levels, including extrapolation beyond the range of observed data. A frequently overlooked challenge in applying this methodology comes from handling datasets containing missing values. In this case, one cannot be sure whether the true maximum has been recorded in each block, and simply ignoring the issue can lead to biased parameter estimators and, crucially, underestimated return levels. We propose an extension of the standard block maxima approach to overcome such missing data issues. This is achieved by explicitly accounting for the proportion of missing values in each block within the GEV model. Inference is carried out using likelihood-based techniques, and we propose an update to commonly used diagnostic plots to assess model fit. We assess the performance of our method via a simulation study, with results that are competitive with the "ideal" case of having no missing values. The practical use of our methodology is demonstrated on sea surge data from Brest, France, and air pollution data from Plymouth, U.K.

2511.16377 2026-02-02 cs.LG cs.CR stat.ML

Optimal Fairness under Local Differential Privacy

Hrad Ghoukasian, Shahab Asoodeh

Comments 21 pages, 6 figures, 2 tables

详情
英文摘要

We investigate how to optimally design local differential privacy (LDP) mechanisms that reduce data unfairness and thereby improve fairness in downstream classification. We first derive a closed-form optimal mechanism for binary sensitive attributes and then develop a tractable optimization framework that yields the corresponding optimal mechanism for multi-valued attributes. As a theoretical contribution, we establish that for discrimination-accuracy optimal classifiers, reducing data unfairness necessarily leads to lower classification unfairness, thus providing a direct link between privacy-aware pre-processing and classification fairness. Empirically, we demonstrate that our approach consistently outperforms existing LDP mechanisms in reducing data unfairness across diverse datasets and fairness metrics, while maintaining accuracy close to that of non-private models. Moreover, compared with leading pre-processing and post-processing fairness methods, our mechanism achieves a more favorable accuracy-fairness trade-off while simultaneously preserving the privacy of sensitive attributes. Taken together, these results highlight LDP as a principled and effective pre-processing fairness intervention technique.

2511.04576 2026-02-02 stat.ML cs.LG

Physics-Informed Neural Networks and Neural Operators for Parametric PDEs

Zhuo Zhang, Xiong Xiong, Sen Zhang, Yuan Zhao, Xi Yang

Comments 61 pages, 3 figures. Submitted to The 1st International Conference on AI Scientists (ICAIS 2025). This revision corrects the bibliography mismatch caused by hallucination issues

详情
英文摘要

PDEs arise ubiquitously in science and engineering, where solutions depend on parameters (physical properties, boundary conditions, geometry). Traditional numerical methods require re-solving the PDE for each parameter, making parameter space exploration prohibitively expensive. Recent machine learning advances, particularly physics-informed neural networks (PINNs) and neural operators, have revolutionized parametric PDE solving by learning solution operators that generalize across parameter spaces. We critically analyze two main paradigms: (1) PINNs, which embed physical laws as soft constraints and excel at inverse problems with sparse data, and (2) neural operators (e.g., DeepONet, Fourier Neural Operator), which learn mappings between infinite-dimensional function spaces and achieve unprecedented generalization. Through comparisons across fluid dynamics, solid mechanics, heat transfer, and electromagnetics, we show neural operators can achieve computational speedups of $10^3$ to $10^5$ times faster than traditional solvers for multi-query scenarios, while maintaining comparable accuracy. We provide practical guidance for method selection, discuss theoretical foundations (universal approximation, convergence), and identify critical open challenges: high-dimensional parameters, complex geometries, and out-of-distribution generalization. This work establishes a unified framework for understanding parametric PDE solvers via operator learning, offering a comprehensive, incrementally updated resource for this rapidly evolving field

2510.25128 2026-02-02 cs.LG stat.ML

An Analysis of Causal Effect Estimation using Outcome Invariant Data Augmentation

Uzair Akbar, Niki Kilbertus, Hao Shen, Krikamol Muandet, Bo Dai

Comments Accepted at NeurIPS 2025

详情
英文摘要

The technique of data augmentation (DA) is often used in machine learning for regularization purposes to better generalize under i.i.d. settings. In this work, we present a unifying framework with topics in causal inference to make a case for the use of DA beyond just the i.i.d. setting, but for generalization across interventions as well. Specifically, we argue that when the outcome generating mechanism is invariant to our choice of DA, then such augmentations can effectively be thought of as interventions on the treatment generating mechanism itself. This can potentially help to reduce bias in causal effect estimation arising from hidden confounders. In the presence of such unobserved confounding we typically make use of instrumental variables (IVs) -- sources of treatment randomization that are conditionally independent of the outcome. However, IVs may not be as readily available as DA for many applications, which is the main motivation behind this work. By appropriately regularizing IV based estimators, we introduce the concept of IV-like (IVL) regression for mitigating confounding bias and improving predictive performance across interventions even when certain IV properties are relaxed. Finally, we cast parameterized DA as an IVL regression problem and show that when used in composition can simulate a worst-case application of such DA, further improving performance on causal estimation and generalization tasks beyond what simple DA may offer. This is shown both theoretically for the population case and via simulation experiments for the finite sample case using a simple linear example. We also present real data experiments to support our case.

2510.12442 2026-02-02 math.ST stat.TH

On estimation of weighted cumulative residual Tsallis entropy for complete and censored samples

Siddhartha Chakraborty, Asok K. Nanda

Comments 30 pages

详情
英文摘要

Recently, weighted cumulative residual Tsallis entropy has been introduced in the literature as a generalization of weighted cumulative residual entropy. We study some new properties of weighted cumulative residual Tsallis entropy measure. Next, we propose some non-parametric estimators of this measure. Asymptotic properties of these estimators are discussed. Performance of these estimators are compared by mean squared error. Non-parametric estimators for weighted cumulative residual entropy measure are also discussed. Estimator for weighted cumulative residual Tsallis entropy for progressive type-II censored data is proposed and its performance is investigated by Monte-Carlo simulations for various censoring schemes. Two uniformity tests for complete samples are proposed based on an estimator of these two measures and power of the tests are compared with some popular tests. The tests perform reasonably well. Uniformity test under progressively type-II censored data is also developed. Some real datasets are analysed for illustration.

2510.09133 2026-02-02 cs.AI cs.LG math.ST stat.TH

On the Provable Performance Guarantee of Efficient Reasoning Models

Hao Zeng, Jianguo Huang, Bingyi Jing, Hongxin Wei, Bo An

详情
英文摘要

Large reasoning models (LRMs) have achieved remarkable progress in complex problem-solving tasks. Despite this success, LRMs typically suffer from high computational costs during deployment, highlighting a need for efficient inference. A practical direction of efficiency improvement is to switch the LRM between thinking and non-thinking modes dynamically. However, such approaches often introduce additional reasoning errors and lack statistical guarantees for the performance loss, which are critical for high-stakes applications. In this work, we propose Probably Approximately Correct (PAC) reasoning that controls the performance loss under the user-specified tolerance. Specifically, we construct an upper confidence bound on the performance loss and determine a threshold for switching to the non-thinking model. Theoretically, using the threshold to switch between the thinking and non-thinking modes ensures bounded performance loss in a distribution-free manner. Our comprehensive experiments on reasoning benchmarks show that the proposed method can save computational budgets and control the user-specified performance loss.

2509.12206 2026-02-02 math.ST stat.ME stat.TH

Haussdorff consistency of MLE in folded normal and Gaussian mixtures

Koustav Mallik

Comments This is a series of works on nonidentifiable models

详情
英文摘要

We develop a constant-tracking likelihood theory for two nonregular models: the folded normal and finite Gaussian mixtures. For the folded normal, we prove boundary coercivity for the profiled likelihood, show that the profile path of the location parameter exists and is strictly decreasing by an implicit-function argument, and establish a unique profile maximizer in the scale parameter. Deterministic envelopes for the log-likelihood, the score, and the Hessian yield elementary uniform laws of large numbers with finite-sample bounds, avoiding covering numbers. Identification and Kullback-Leibler separation deliver consistency. A sixth-order expansion of the log hyperbolic cosine creates a quadratic-minus-quartic contrast around zero, leading to a nonstandard one-fourth-power rate for the location estimator at the kink and a standard square-root rate for the scale estimator, with a uniform remainder bound. For finite Gaussian mixtures with distinct components and positive weights, we give a short identifiability proof up to label permutations via Fourier and Vandermonde ideas, derive two-sided Gaussian envelopes and responsibility-based gradient bounds on compact sieves, and obtain almost-sure and high-probability uniform laws with explicit constants. Using a minimum-matching distance on permutation orbits, we prove Hausdorff consistency on fixed and growing sieves. We quantify variance-collapse spikes via an explicit spike-bonus bound and show that a quadratic penalty in location and log-scale dominates this bonus, making penalized likelihood coercive; when penalties shrink but sample size times penalty diverges, penalized estimators remain consistent. All proofs are constructive, track constants, verify measurability of maximizers, and provide practical guidance for tuning sieves, penalties, and EM-style optimization.

2506.06446 2026-02-02 cs.CL cs.AI cs.LG stat.ML

Tokenization Multiplicity Leads to Arbitrary Price Variation in LLM-as-a-service

Ivi Chatzi, Nina Corvelo Benz, Stratis Tsirtsis, Manuel Gomez-Rodriguez

详情
英文摘要

Providers of LLM-as-a-service have predominantly adopted a simple pricing model: users pay a fixed price per token. Consequently, one may think that the price two different users would pay for the same output string under the same input prompt is the same. In our work, we show that, surprisingly, this is not (always) true. We find empirical evidence that, particularly for non-english outputs, both proprietary and open-weights LLMs often generate the same (output) string with multiple different tokenizations, even under the same input prompt, and this in turn leads to arbitrary price variation. To address the problem of tokenization multiplicity, we introduce canonical generation, a type of constrained generation that restricts LLMs to only generate canonical tokenizations -- the unique tokenization in which each string is tokenized during the training process of an LLM. Further, we introduce an efficient sampling algorithm for canonical generation based on the Gumbel-Max trick. Experiments on a variety of natural language tasks demonstrate that our sampling algorithm for canonical generation is comparable to standard sampling in terms of performance and runtime, and it solves the problem of tokenization multiplicity.

2505.24769 2026-02-02 stat.ML cond-mat.dis-nn cs.LG math.ST stat.TH

Generalization Dynamics of Linear Diffusion Models

Claudia Merger, Sebastian Goldt

详情
英文摘要

Diffusion models are powerful generative models that produce high-quality samples from complex data. While their infinite-data behavior is well understood, their generalization with finite data remains less clear. Classical learning theory predicts that generalization occurs at a sample complexity that is exponential in the dimension, far exceeding practical needs. We address this gap by analyzing diffusion models through the lens of data covariance spectra, which often follow power-law decays, reflecting the hierarchical structure of real data. To understand whether such a hierarchical structure can benefit learning in diffusion models, we develop a theoretical framework based on linear neural networks, congruent with a Gaussian hypothesis on the data. We quantify how the hierarchical organization of variance in the data and regularization impacts generalization. We find two regimes: When $N <d$, not all directions of variation are present in the training data, which results in a large gap between training and test loss. In this regime, we demonstrate how a strongly hierarchical data structure, as well as regularization and early stopping help to prevent overfitting. For $N > d$, we find that the sampling distributions of linear diffusion models approach their optimum (measured by the Kullback-Leibler divergence) linearly with $d/N$, independent of the specifics of the data distribution. Our work clarifies how sample complexity governs generalization in a simple model of diffusion-based generative models.

2505.19589 2026-02-02 cs.LG stat.ML

Model Agnostic Differentially Private Causal Inference

Christian Janos Lebeda, Mathieu Even, Aurélien Bellet, Julie Josse

详情
英文摘要

Estimating causal effects from observational data is essential in fields such as medicine, economics and social sciences, where privacy concerns are paramount. We propose a general, model-agnostic framework for differentially private estimation of average treatment effects (ATE) that avoids strong structural assumptions on the data-generating process or the models used to estimate propensity scores and conditional outcomes. In contrast to prior work, which enforces differential privacy by directly privatizing these nuisance components, our approach decouples nuisance estimation from privacy protection. This separation allows the use of flexible, state-of-the-art black-box models, while differential privacy is achieved by perturbing only predictions and aggregation steps within a fold-splitting scheme with ensemble techniques. We instantiate the framework for three classical estimators -- the G-Formula, inverse propensity weighting (IPW), and augmented IPW (AIPW) -- and provide formal utility and privacy guarantees, together with privatized confidence intervals. Empirical results on synthetic and real data show that our methods maintain competitive performance under realistic privacy budgets.

2503.03020 2026-02-02 math.ST stat.TH

Adaptive monotonicity testing in sublinear time

Housen Li, Zhi Liu, Axel Munk

Comments The implementation in R is available at \url{https://github.com/liuzhi1993/FOMT}

Journal ref IEEE Transactions on Information Theory, vol. 72, no. 2, pp. 1240-1275, Feb. 2026

详情
英文摘要

Modern large-scale data analysis increasingly faces the challenge of achieving computational efficiency as well as statistical accuracy, as classical statistically efficient methods often fall short in the first regard. In the context of testing monotonicity of a regression function, we propose FOMT (Fast and Optimal Monotonicity Test), a novel methodology tailored to meet these dual demands. FOMT employs a sparse collection of local tests, strategically generated at random, to detect violations of monotonicity scattered throughout the domain of the regression function. This sparsity enables significant computational efficiency, achieving sublinear runtime in most cases, and quasilinear runtime (i.e., linear up to a log factor) in the worst case. In contrast, existing statistically optimal tests typically require at least quadratic runtime. FOMT's statistical accuracy is achieved through the precise calibration of these local tests and their effective combination, ensuring both sensitivity to violations and control over false positives. More precisely, we show that FOMT separates the null and alternative hypotheses at minimax optimal rates over Hölder function classes of smoothness order in $(0,2]$. Further, when the smoothness is unknown, we introduce an adaptive version of FOMT, based on a modified Lepskii principle, which attains statistical optimality and meanwhile maintains the same computational complexity as if the intrinsic smoothness were known. Extensive simulations confirm the competitiveness and effectiveness of both FOMT and its adaptive variant.

2502.13790 2026-02-02 stat.ME stat.AP stat.CO

A Zero-Inflated Poisson Latent Position Cluster Model

Chaoyi Lu, Riccardo Rastelli, Nial Friel

Comments 43 pages, 16 figures, 3 tables

Journal ref Net Sci 14 (2026) e2

详情
英文摘要

The latent position network model (LPM) is a popular approach for the statistical analysis of network data. A central aspect of this model is that it assigns nodes to random positions in a latent space, such that the probability of an interaction between each pair of individuals or nodes is determined by their distance in this latent space. A key feature of this model is that it allows one to visualize nuanced structures via the latent space representation. The LPM can be further extended to the Latent Position Cluster Model (LPCM), to accommodate the clustering of nodes by assuming that the latent positions are distributed following a finite mixture distribution. In this paper, we extend the LPCM to accommodate missing network data and apply this to non-negative discrete weighted social networks. By treating missing data as ``unusual'' zero interactions, we propose a combination of the LPCM with the zero-inflated Poisson distribution. Statistical inference is based on a novel partially collapsed Markov chain Monte Carlo algorithm, where a Mixture-of-Finite-Mixtures (MFM) model is adopted to automatically determine the number of clusters and optimal group partitioning. Our algorithm features a truncated absorb-eject move, which is a novel adaptation of an idea commonly used in collapsed samplers, within the context of MFMs. Another aspect of our work is that we illustrate our results on 3-dimensional latent spaces, maintaining clear visualizations while achieving more flexibility than 2-dimensional models. The performance of this approach is illustrated via three carefully designed simulation studies, as well as four different publicly available real networks, where some interesting new perspectives are uncovered.

2409.17910 2026-02-02 math.ST stat.TH

On the tails of log-concave density estimators

Didier B. Ryter, Lutz Duembgen

详情
英文摘要

It is shown that the nonparametric maximum likelihood estimator of a univariate log-concave probability density satisfies desirable consistency properties in the tail regions. Specifically, let $P$ and $f$ denote the true underlying distribution and density, respectively. If $\hat{f}_n$ is the estimated log-concave density, and $\hatφ_n = \log \hat{f}_n$, then we specify sequences $(b_n)_{n\in \mathbb{N}}$ such that $P([b_n,\infty)) \to 0$ at a specific speed, ensuring that the absolute errors or absolute relative errors of $\hat{f}_n, \ \hatφ_n$ and $\hatφ_n'$ converge to zero uniformly on sets $[a, b_n]$. The main tools, besides characterizations of $\hat{f}_n$, are exponential and maximal inequalities for truncated moments of log-concave distributions, which are of independent interest.

2408.01517 2026-02-02 cs.LG cs.AI math-ph math.MP math.OC stat.ML

Gradient flow in parameter space is equivalent to linear interpolation in output space

Thomas Chen, Patrícia Muñoz Ewald

Comments To appear in Journal of Geometry and Physics

Journal ref J. Geom. Phys., 222, Article No. 105765 (2026)

详情
英文摘要

We prove that the standard gradient flow in parameter space that underlies many training algorithms in deep learning can be continuously deformed into an adapted gradient flow which yields (constrained) Euclidean gradient flow in output space. Moreover, for the $L^{2}$ loss, if the Jacobian of the outputs with respect to the parameters is full rank (for fixed training data), then the time variable can be reparametrized so that the resulting flow is simply linear interpolation, and a global minimum can be achieved. For the cross-entropy loss, under the same rank condition and assuming the labels have positive components, we derive an explicit formula for the unique global minimum.

2405.01761 2026-02-02 stat.ML cs.LG

Multivariate Bayesian Last Layer for Regression with Uncertainty Quantification and Decomposition

Han Wang, Eiji Kawasaki, Guillaume Damblin, Geoffrey Daniel

详情
英文摘要

We present new Bayesian Last Layer neural network models in the setting of multivariate regression under heteroscedastic noise, and propose EM algorithms for parameter learning. Bayesian modeling of a neural network's final layer has the attractive property of uncertainty quantification with a single forward pass. The proposed framework is capable of disentangling the aleatoric and epistemic uncertainty, and can be used to enhance a canonically trained deep neural network with uncertainty-aware capabilities.

2404.15133 2026-02-02 stat.CO stat.ME

Bayesian Strategies for Repulsive Spatial Point Processes

Chaoyi Lu, Nial Friel

Comments 29 pages, 5 figures, 5 tables

详情
英文摘要

There is increasing interest to develop Bayesian inferential algorithms for point process models with intractable likelihoods. A purpose of this paper is to illustrate the utility of using simulation based strategies, including Approximate Bayesian Computation (ABC) and Markov Chain Monte Carlo (MCMC) methods for this task. Shirota and Gelfand (2017) proposed an extended version of an ABC approach for Repulsive Spatial Point Processes (RSPP), but their algorithm was not correctly detailed. In this paper, we correct their method and, based on this, we propose a new ABC-MCMC algorithm to which Markov property is introduced compared to a typical ABC method. Though it is generally impractical to use, Monte Carlo approximations can be leveraged for intractable terms. Another aspect of this paper is to explore the use of the exchange algorithm and the noisy Metropolis-Hastings algorithm (Alquier et al., 2016) on RSPP. Comparisons to ABC-MCMC methods are also provided. We find that the inferential approaches outlined above yield good performance for RSPP in both simulated and real data applications and should be considered as viable approaches for the analysis of these models.

2402.17233 2026-02-02 cs.LG stat.AP stat.ME

Hybrid$^2$ Neural ODE Causal Modeling and an Application to Glycemic Response

Bob Junyi Zou, Matthew E. Levine, Dessi P. Zaharieva, Ramesh Johari, Emily B. Fox

Journal ref Proceedings of the 41st International Conference on Machine Learning, PMLR 235:62934-62963, 2024

详情
英文摘要

Hybrid models composing mechanistic ODE-based dynamics with flexible and expressive neural network components have grown rapidly in popularity, especially in scientific domains where such ODE-based modeling offers important interpretability and validated causal grounding (e.g., for counterfactual reasoning). The incorporation of mechanistic models also provides inductive bias in standard blackbox modeling approaches, critical when learning from small datasets or partially observed, complex systems. Unfortunately, as the hybrid models become more flexible, the causal grounding provided by the mechanistic model can quickly be lost. We address this problem by leveraging another common source of domain knowledge: \emph{ranking} of treatment effects for a set of interventions, even if the precise treatment effect is unknown. We encode this information in a \emph{causal loss} that we combine with the standard predictive loss to arrive at a \emph{hybrid loss} that biases our learning towards causally valid hybrid models. We demonstrate our ability to achieve a win-win, state-of-the-art predictive performance \emph{and} causal validity, in the challenging task of modeling glucose dynamics post-exercise in individuals with type 1 diabetes.

2402.12683 2026-02-02 cs.LG cs.CV math.ST stat.TH

TorchCP: A Python Library for Conformal Prediction

Jianguo Huang, Jianqing Song, Xuanning Zhou, Bingyi Jing, Hongxin Wei

详情
英文摘要

Conformal prediction (CP) is a powerful statistical framework that generates prediction intervals or sets with guaranteed coverage probability. While CP algorithms have evolved beyond traditional classifiers and regressors to sophisticated deep learning models like deep neural networks (DNNs), graph neural networks (GNNs), and large language models (LLMs), existing CP libraries often lack the model support and scalability for large-scale deep learning (DL) scenarios. This paper introduces TorchCP, a PyTorch-native library designed to integrate state-of-the-art CP algorithms into DL techniques, including DNN-based classifiers/regressors, GNNs, and LLMs. Released under the LGPL-3.0 license, TorchCP comprises about 16k lines of code, validated with 100\% unit test coverage and detailed documentation. Notably, TorchCP enables CP-specific training algorithms, online prediction, and GPU-accelerated batch processing, achieving up to 90\% reduction in inference time on large datasets. With its low-coupling design, comprehensive suite of advanced methods, and full GPU scalability, TorchCP empowers researchers and practitioners to enhance uncertainty quantification across cutting-edge applications.

2402.03113 2026-02-02 math.OC math.ST stat.TH

Optimal sampling for stochastic and natural gradient descent

Robert Gruhlke, Anthony Nouy, Philipp Trunschke

详情
英文摘要

We consider the problem of optimising the expected value of a loss functional over a nonlinear model class of functions, assuming that we have only access to realisations of the gradient of the loss. This is a classical task in statistics, machine learning and physics-informed machine learning. A straightforward solution is to replace the exact objective with a Monte Carlo estimate before employing standard first-order methods like gradient descent, which yields the classical stochastic gradient descent method. But replacing the true objective with an estimate ensues a generalisation error. Rigorous bounds for this error typically require strong compactness and Lipschitz continuity assumptions while providing a very slow decay with sample size. To alleviate these issues, we propose a version of natural gradient descent that is based on optimal sampling methods. Under classical assumptions on the loss and the nonlinear model class, we prove that this scheme converges almost surely monotonically to a stationary point of the true objective. Under Polyak-Lojasiewicz-type conditions, this provides bounds for the generalisation error. As a remarkable result, we show that our stochastic optimisation scheme achieves the linear or exponential convergence rates of deterministic first order descent methods under suitable conditions.

2306.10987 2026-02-02 stat.ML cs.LG

A VAE Approach to Sample Multivariate Extremes

Nicolas Lafon, Philippe Naveau, Ronan Fablet

详情
英文摘要

Generating accurate extremes from an observational data set is crucial when seeking to estimate risks associated with the occurrence of future extremes which could be larger than those already observed. Applications range from the occurrence of natural disasters to financial crashes. Generative approaches from the machine learning community do not apply to extreme samples without careful adaptation. Besides, asymptotic results from extreme value theory (EVT) give a theoretical framework to model multivariate extreme events, especially through the notion of multivariate regular variation. Bridging these two fields, this paper details a variational autoencoder (VAE) approach for sampling multivariate heavy-tailed distributions, i.e., distributions likely to have extremes of particularly large intensities. We illustrate the relevance of our approach on a synthetic data set and on a real data set of discharge measurements along the Danube river network. The latter shows the potential of our approach for flood risks' assessment. In addition to outperforming the standard VAE for the tested data sets, we also provide a comparison with a competing EVT-based generative approach. On the tested cases, our approach improves the learning of the dependency structure between extremes.

2111.08953 2026-02-02 stat.ML cs.LG stat.ME

Three approaches to supervised learning for compositional data with pairwise logratios

Germa Coenders, Michael Greenacre

Comments 17 pages, 3 figures, 5 tables

Journal ref Journal of Applied Statistics, 50, 16 (2023), 3272-3293

详情
英文摘要

The common approach to compositional data analysis is to transform the data by means of logratios. Logratios between pairs of compositional parts (pairwise logratios) are the easiest to interpret in many research problems. When the number of parts is large, some form of logratio selection is a must, for instance by means of an unsupervised learning method based on a stepwise selection of the pairwise logratios that explain the largest percentage of the logratio variance in the compositional dataset. In this article we present three alternative stepwise supervised learning methods to select the pairwise logratios that best explain a dependent variable in a generalized linear model, each geared for a specific problem. The first method features unrestricted search, where any pairwise logratio can be selected. This method has a complex interpretation if some pairs of parts in the logratios overlap, but it leads to the most accurate predictions. The second method restricts parts to occur only once, which makes the corresponding logratios intuitively interpretable. The third method uses additive logratios, so that $K-1$ selected logratios involve exactly $K$ parts. This method in fact searches for the subcomposition with the highest explanatory power. Once the subcomposition is identified, the researcher's favourite logratio representation may be used in subsequent analyses, not only pairwise logratios. Our methodology allows logratios or non-compositional covariates to be forced into the models based on theoretical knowledge, and various stopping criteria are available based on information measures or statistical significance with the Bonferroni correction. We present an illustration of the three approaches on a dataset from a study predicting Crohn's disease. The first method excels in terms of predictive power, and the other two in interpretability.

2601.22717 2026-02-02 stat.ME

Policy learning under constraint: Maximizing a primary outcome while controlling an adverse event

Laura Fuentes-Vicente, Mathieu Even, Gaelle Dormion, Julie Josse, Antoine Chambaz

详情
英文摘要

A medical policy aims to support decision-making by mapping patient characteristics to individualized treatment recommendations. Standard approaches typically optimize a single outcome criterion. For example, recommending treatment according to the sign of the Conditional Average Treatment Effect (CATE) maximizes the policy "value" by exploiting treatment effect heterogeneity. This point of view shifts policy learning towards the challenge of learning a reliable CATE estimator. However, in multi-outcome settings, such strategies ignore the risk of adverse events, despite their relevance. PLUC (Policy Learning Under Constraint) addresses this challenges by learning an estimator of the CATE that yields smoothed policies controlling the probability of an adverse event in observational settings. Inspired by insights from EP-learning, PLUC involves the optimization of strongly convex Lagrangian criteria over a convex hull of functions. Its alternating procedure iteratively applies the Frank-Wolfe algorithm to minimize the current criterion, then performs a targeting step that updates the criterion so that its evaluations at previously visited landmarks become targeted estimators of the corresponding theoretical quantities. An R package PLUC-R provides a practical implementation. We illustrate PLUC's performance through a series of numerical experiments.

2601.22652 2026-02-02 stat.ML cs.LG

Spectral Gradient Descent Mitigates Anisotropy-Driven Misalignment: A Case Study in Phase Retrieval

Guillaume Braun, Han Bao, Wei Huang, Masaaki Imaizumi

Comments 53 pages, 8 figures

详情
英文摘要

Spectral gradient methods, such as the Muon optimizer, modify gradient updates by preserving directional information while discarding scale, and have shown strong empirical performance in deep learning. We investigate the mechanisms underlying these gains through a dynamical analysis of a nonlinear phase retrieval model with anisotropic Gaussian inputs, equivalent to training a two-layer neural network with the quadratic activation and fixed second-layer weights. Focusing on a spiked covariance setting where the dominant variance direction is orthogonal to the signal, we show that gradient descent (GD) suffers from a variance-induced misalignment: during the early escaping stage, the high-variance but uninformative spike direction is multiplicatively amplified, degrading alignment with the true signal under strong anisotropy. In contrast, spectral gradient descent (SpecGD) removes this spike amplification effect, leading to stable alignment and accelerated noise contraction. Numerical experiments confirm the theory and show that these phenomena persist under broader anisotropic covariances.

2601.22650 2026-02-02 stat.ML cs.LG

Generative and Nonparametric Approaches for Conditional Distribution Estimation: Methods, Perspectives, and Comparative Evaluations

Yen-Shiu Chin, Zhi-Yu Jou, Toshinari Morimoto, Chia-Tse Wang, Ming-Chung Chang, Tso-Jung Yen, Su-Yun Huang, Tailen Hsing

Comments 22 pages, 2 figures, 2 tables

详情
英文摘要

The inference of conditional distributions is a fundamental problem in statistics, essential for prediction, uncertainty quantification, and probabilistic modeling. A wide range of methodologies have been developed for this task. This article reviews and compares several representative approaches spanning classical nonparametric methods and modern generative models. We begin with the single-index method of Hall and Yao (2005), which estimates the conditional distribution through a dimension-reducing index and nonparametric smoothing of the resulting one-dimensional cumulative conditional distribution function. We then examine the basis-expansion approaches, including FlexCode (Izbicki and Lee, 2017) and DeepCDE (Dalmasso et al., 2020), which convert conditional density estimation into a set of nonparametric regression problems. In addition, we discuss two recent generative simulation-based methods that leverage modern deep generative architectures: the generative conditional distribution sampler (Zhou et al., 2023) and the conditional denoising diffusion probabilistic model (Fu et al., 2024; Yang et al., 2025). A systematic numerical comparison of these approaches is provided using a unified evaluation framework that ensures fairness and reproducibility. The performance metrics used for the estimated conditional distribution include the mean-squared errors of conditional mean and standard deviation, as well as the Wasserstein distance. We also discuss their flexibility and computational costs, highlighting the distinct advantages and limitations of each approach.

2601.22625 2026-02-02 stat.ML cs.LG

RPWithPrior: Label Differential Privacy in Regression

Haixia Liu, Ruifan Huang

Comments 20 pages

详情
英文摘要

With the wide application of machine learning techniques in practice, privacy preservation has gained increasing attention. Protecting user privacy with minimal accuracy loss is a fundamental task in the data analysis and mining community. In this paper, we focus on regression tasks under $ε$-label differential privacy guarantees. Some existing methods for regression with $ε$-label differential privacy, such as the RR-On-Bins mechanism, discretized the output space into finite bins and then applied RR algorithm. To efficiently determine these finite bins, the authors rounded the original responses down to integer values. However, such operations does not align well with real-world scenarios. To overcome these limitations, we model both original and randomized responses as continuous random variables, avoiding discretization entirely. Our novel approach estimates an optimal interval for randomized responses and introduces new algorithms designed for scenarios where a prior is either known or unknown. Additionally, we prove that our algorithm, RPWithPrior, guarantees $ε$-label differential privacy. Numerical results demonstrate that our approach gets better performance compared with the Gaussian, Laplace, Staircase, and RRonBins, Unbiased mechanisms on the Communities and Crime, Criteo Sponsored Search Conversion Log, California Housing datasets.

2601.22602 2026-02-02 math.ST stat.ME stat.TH

A spectral approach for online covariance change point detection

Zhigang Bao, Kha Man Cheong, Yuji Li, Jiaxin Qiu

详情
英文摘要

Change point detection in covariance structures is a fundamental and crucial problem for sequential data. Under the high-dimensional setting, most of the existing research has focused on identifying change points in historical data. However, there is a significant lack of studies on the practically relevant online change point problem, which means promptly detecting change points as they occur. In this paper, applying the limiting theory of linear spectral statistics for random matrices, we propose a class of spectrum based CUSUM-type statistic. We first construct a martingale from the difference of linear spectral statistics of sequential sample Fisher matrices, which converges to a Brownian motion. Our CUSUM-type statistic is then defined as the maximum of a variant of this process. Finally, we develop our detection procedure based on the invariance principle. Simulation results show that our detection method is highly sensitive to the occurrence of change point and is able to identify it shortly after they arise, outperforming the existing approaches.

2601.22600 2026-02-02 stat.ML cs.LG

An Efficient Algorithm for Thresholding Monte Carlo Tree Search

Shoma Nameki, Atsuyoshi Nakamura, Junpei Komiyama, Koji Tabata

详情
英文摘要

We introduce the Thresholding Monte Carlo Tree Search problem, in which, given a tree $\mathcal{T}$ and a threshold $θ$, a player must answer whether the root node value of $\mathcal{T}$ is at least $θ$ or not. In the given tree, `MAX' or `MIN' is labeled on each internal node, and the value of a `MAX'-labeled (`MIN'-labeled) internal node is the maximum (minimum) of its child values. The value of a leaf node is the mean reward of an unknown distribution, from which the player can sample rewards. For this problem, we develop a $δ$-correct sequential sampling algorithm based on the Track-and-Stop strategy that has asymptotically optimal sample complexity. We show that a ratio-based modification of the D-Tracking arm-pulling strategy leads to a substantial improvement in empirical sample complexity, as well as reducing the per-round computational cost from linear to logarithmic in the number of arms.

2601.22572 2026-02-02 stat.ME

Propensity score weighted Cox regression for survival outcomes in observational studies with multiple or factorial treatments

Zixian Zhao, Chengxin Yang, Fan Li

Comments Correspondence: Fan Li, fl35@duke.edu

详情
英文摘要

In observational studies with survival or time-to-event outcomes, a propensity score weighted marginal Cox proportional hazard model with the treatment variable as the only predictor is commonly used to estimate the causal marginal hazard ratio between two treatments. Observational studies often have more than two treatments, but corresponding analysis methods are limited. In this paper, we combine the propensity score weighting method for multiple treatments and a marginal Cox model with indicators for each treatment to estimate the causal hazard ratios between multiple treatments and a common reference treatment. We illustrate two weighting schemes: inverse probability of treatment weighting and overlap weighting. We prove the consistency of the maximum weighted partial likelihood estimator of the causal marginal hazard ratio and derive a robust sandwich variance estimator. As an important special case of multiple treatments, we elaborate the Cox model for two-way factorial treatments. We apply the method to evaluate the real-world comparative effectiveness of three types of anti-obesity medications on heart failure. We develop an associated R package 'PSsurvival'.

2601.22539 2026-02-02 cs.LG stat.CO stat.ML

Neural-Inspired Posterior Approximation (NIPA)

Babak Shahbaba, Zahra Moslemi

Comments 13 pages, 4 tables

详情
英文摘要

Humans learn efficiently from their environment by engaging multiple interacting neural systems that support distinct yet complementary forms of control, including model-based (goal-directed) planning, model-free (habitual) responding, and episodic memory-based learning. Model-based mechanisms compute prospective action values using an internal model of the environment, supporting flexible but computationally costly planning; model-free mechanisms cache value estimates and build heuristics that enable fast, efficient habitual responding; and memory-based mechanisms allow rapid adaptation from individual experience. In this work, we aim to elucidate the computational principles underlying this biological efficiency and translate them into a sampling algorithm for scalable Bayesian inference through effective exploration of the posterior distribution. More specifically, our proposed algorithm comprises three components: a model-based module that uses the target distribution for guided but computationally slow sampling; a model-free module that uses previous samples to learn patterns in the parameter space, enabling fast, reflexive sampling without directly evaluating the expensive target distribution; and an episodic-control module that supports rapid sampling by recalling specific past events (i.e., samples). We show that this approach advances Bayesian methods and facilitates their application to large-scale statistical machine learning problems. In particular, we apply our proposed framework to Bayesian deep learning, with an emphasis on proper and principled uncertainty quantification.

2601.22525 2026-02-02 stat.ME

Group Sequential Methods for the Win Ratio

Tracy Bergemann, Tim Hanson

Comments 26 pages, 2 figures, 2 tables

详情
英文摘要

The win ratio is increasingly used in randomized trials due to its intuitive clinical interpretation, ability to incorporate the relative importance of composite endpoints, and its capacity for combining different types of outcomes (e.g. time-to-event, binary, counts, etc.) to be combined. There are open questions, however, about how to implement adaptive design approaches when the primary endpoint is a win ratio, including in group sequential designs. A key requirement allowing for straightforward application of classical group sequential methods is the independence of incremental interim test statistics. This paper derives the covariance structure of incremental U-statistics that evaluate the win ratio under its asymptotic distribution. The derived covariance shows that the independent increments assumption holds for the asymptotic distribution of U-statistics that test the win ratio. Simulations confirm that traditional $α$-spending preserves Type I error across interim looks. A retrospective look at the IN.PACT SFA clinical trial data illustrates the potential for stopping early in a group sequential design using the win ratio. We have demonstrated that straightforward use of Lan-De\uppercase{M}ets $α$-spending is possible for randomized trials involving the win ratio under certain common conditions. Thus, existing software capable of computing traditional group sequential boundaries can be employed.

2601.22441 2026-02-02 stat.ML cs.LG

Simulation-based Bayesian inference with ameliorative learned summary statistics -- Part I

Getachew K. Befekadu

Comments 13 pages

详情
英文摘要

This paper, which is Part 1 of a two-part paper series, considers a simulation-based inference with learned summary statistics, in which such a learned summary statistic serves as an empirical-likelihood with ameliorative effects in the Bayesian setting, when the exact likelihood function associated with the observation data and the simulation model is difficult to obtain in a closed form or computationally intractable. In particular, a transformation technique which leverages the Cressie-Read discrepancy criterion under moment restrictions is used for summarizing the learned statistics between the observation data and the simulation outputs, while preserving the statistical power of the inference. Here, such a transformation of data-to-learned summary statistics also allows the simulation outputs to be conditioned on the observation data, so that the inference task can be performed over certain sample sets of the observation data that are considered as an empirical relevance or believed to be particular importance. Moreover, the simulation-based inference framework discussed in this paper can be extended further, and thus handling weakly dependent observation data. Finally, we remark that such an inference framework is suitable for implementation in distributed computing, i.e., computational tasks involving both the data-to-learned summary statistics and the Bayesian inferencing problem can be posed as a unified distributed inference problem that will exploit distributed optimization and MCMC algorithms for supporting large datasets associated with complex simulation models.

2601.22380 2026-02-02 stat.ME stat.CO

Mixed Latent Position Cluster Models for Networks

Chaoyi Lu, Riccardo Rastelli

详情
英文摘要

Over the last two decades, the Latent Position Model (LPM) has become a prominent tool to obtain model-based visualizations of networks. However, the geometric structure of the LPM is inherently symmetric, in the sense that outgoing and incoming edges are assumed to follow the same statistical distribution. As a consequence, the canonical LPM framework is not ideal for the analysis of directed networks. In addition, edges may be weighted to describe the duration or intensity of a connection. This can lead to disassortative patterns and other motifs that cannot be easily captured by the underlying geometry. To address these limitations, we develop a novel extension of the LPM, called the Mixed Latent Position Cluster Model (MLPCM), which can deal with asymmetry and non-Euclidean patterns, while providing new interpretations of the latent space. We dissect the directed edges of the network by formally disentangling how a node behaves from how it is perceived by others. This leads to a dual representation of a node's profile, identifying its ``overt'' and ``covert'' social positions. In order to efficiently estimate the parameters of our model, we develop a variational Bayes approach to approximate the posterior distribution. Unlike many existing variational frameworks, our algorithm does not require any additional numerical approximations. Model selection is performed by introducing a novel partially integrated complete likelihood criteria, which builds upon the literature on penalized likelihood methods. We demonstrate the accuracy of our proposed methodology using synthetic datasets, and we illustrate its practical utility with an application to a dataset of international arms transfers.

2601.22336 2026-02-02 stat.ML cs.LG stat.ME

Dependence-Aware Label Aggregation for LLM-as-a-Judge via Ising Models

Krishnakumar Balasubramanian, Aleksandr Podkopaev, Shiva Prasad Kasiviswanathan

详情
英文摘要

Large-scale AI evaluation increasingly relies on aggregating binary judgments from $K$ annotators, including LLMs used as judges. Most classical methods, e.g., Dawid-Skene or (weighted) majority voting, assume annotators are conditionally independent given the true label $Y\in\{0,1\}$, an assumption often violated by LLM judges due to shared data, architectures, prompts, and failure modes. Ignoring such dependencies can yield miscalibrated posteriors and even confidently incorrect predictions. We study label aggregation through a hierarchy of dependence-aware models based on Ising graphical models and latent factors. For class-dependent Ising models, the Bayes log-odds is generally quadratic in votes; for class-independent couplings, it reduces to a linear weighted vote with correlation-adjusted parameters. We present finite-$K$ examples showing that methods based on conditional independence can flip the Bayes label despite matching per-annotator marginals. We prove separation results demonstrating that these methods remain strictly suboptimal as the number of judges grows, incurring nonvanishing excess risk under latent factors. Finally, we evaluate the proposed method on three real-world datasets, demonstrating improved performance over the classical baselines.

2601.22335 2026-02-02 cs.LG stat.ML

Knowledge Gradient for Preference Learning

Kaiwen Wu, Jacob R. Gardner

详情
英文摘要

The knowledge gradient is a popular acquisition function in Bayesian optimization (BO) for optimizing black-box objectives with noisy function evaluations. Many practical settings, however, allow only pairwise comparison queries, yielding a preferential BO problem where direct function evaluations are unavailable. Extending the knowledge gradient to preferential BO is hindered by its computational challenge. At its core, the look-ahead step in the preferential setting requires computing a non-Gaussian posterior, which was previously considered intractable. In this paper, we address this challenge by deriving an exact and analytical knowledge gradient for preferential BO. We show that the exact knowledge gradient performs strongly on a suite of benchmark problems, often outperforming existing acquisition functions. In addition, we also present a case study illustrating the limitation of the knowledge gradient in certain scenarios.

2601.22331 2026-02-02 cs.LG stat.CO

Scalable Batch Correction for Cell Painting via Batch-Dependent Kernels and Adaptive Sampling

Aditya Narayan Ravi, Snehal Vadvalkar, Abhishek Pandey, Ilan Shomorony

Comments 40 pages, many figures

详情
英文摘要

Cell Painting is a microscopy-based, high-content imaging assay that produces rich morphological profiles of cells and can support drug discovery by quantifying cellular responses to chemical perturbations. At scale, however, Cell Painting data is strongly affected by batch effects arising from differences in laboratories, instruments, and protocols, which can obscure biological signal. We present BALANS (Batch Alignment via Local Affinities and Subsampling), a scalable batch-correction method that aligns samples across batches by constructing a smoothed affinity matrix from pairwise distances. Given $n$ data points, BALANS builds a sparse affinity matrix $A \in \mathbb{R}^{n \times n}$ using two ideas. (i) For points $i$ and $j$, it sets a local scale using the distance from $i$ to its $k$-th nearest neighbor within the batch of $j$, then computes $A_{ij}$ via a Gaussian kernel calibrated by these batch-aware local scales. (ii) Rather than forming all $n^2$ entries, BALANS uses an adaptive sampling procedure that prioritizes rows with low cumulative neighbor coverage and retains only the strongest affinities per row, yielding a sparse but informative approximation of $A$. We prove that this sampling strategy is order-optimal in sample complexity and provides an approximation guarantee, and we show that BALANS runs in nearly linear time in $n$. Experiments on diverse real-world Cell Painting datasets and controlled large-scale synthetic benchmarks demonstrate that BALANS scales to large collections while improving runtime over native implementations of widely used batch-correction methods, without sacrificing correction quality.

2601.22326 2026-02-02 cs.LG stat.AP

Label-Efficient Monitoring of Classification Models via Stratified Importance Sampling

Lupo Marsigli, Angel Lopez de Haro

Comments 24 pages

详情
英文摘要

Monitoring the performance of classification models in production is critical yet challenging due to strict labeling budgets, one-shot batch acquisition of labels and extremely low error rates. We propose a general framework based on Stratified Importance Sampling (SIS) that directly addresses these constraints in model monitoring. While SIS has previously been applied in specialized domains, our theoretical analysis establishes its broad applicability to the monitoring of classification models. Under mild conditions, SIS yields unbiased estimators with strict finite-sample mean squared error (MSE) improvements over both importance sampling (IS) and stratified random sampling (SRS). The framework does not rely on optimally defined proposal distributions or strata: even with noisy proxies and sub-optimal stratification, SIS can improve estimator efficiency compared to IS or SRS individually, though extreme proposal mismatch may limit these gains. Experiments across binary and multiclass tasks demonstrate consistent efficiency improvements under fixed label budgets, underscoring SIS as a principled, label-efficient, and operationally lightweight methodology for post-deployment model monitoring.

2601.22282 2026-02-02 stat.AP

A Time-Varying Branching Process Approach to Model Self-Renewing Cells

Huyen Nguyen, Haim Bar, Zhiyi Chi, Vladimir Pozdnyakov

详情
英文摘要

Stem cells, through their ability to produce daughter stem cells and differentiate into specialized cells, are essential in the growth, maintenance, and repair of biological tissues. Understanding the dynamics of cell populations in the proliferation process not only uncovers proliferative properties of stem cells, but also offers insight into tissue development under both normal conditions and pathological disruption. In this paper, we develop a continuous time branching process model with time-dependent offspring distribution to characterize stem cell proliferation process. We derive analytical expressions for mean, variance, and autocovariance of the stem cell counts, and develop likelihood-based inference procedures to estimate model parameters. Particularly, we construct a forward algorithm likelihood to handle situations when some cell types cannot be directly observed. Simulation results demonstrate that our estimation method recovers the time-dependent division probabilities with good accuracy.

2601.22206 2026-02-02 cs.LG stat.ME stat.ML

Causal Imitation Learning Under Measurement Error and Distribution Shift

Shi Bo, AmirEmad Ghassami

Comments 28 pages, 3 figures

详情
英文摘要

We study offline imitation learning (IL) when part of the decision-relevant state is observed only through noisy measurements and the distribution may change between training and deployment. Such settings induce spurious state-action correlations, so standard behavioral cloning (BC) -- whether conditioning on raw measurements or ignoring them -- can converge to systematically biased policies under distribution shift. We propose a general framework for IL under measurement error, inspired by explicitly modeling the causal relationships among the variables, yielding a target that retains a causal interpretation and is robust to distribution shift. Building on ideas from proximal causal inference, we introduce \texttt{CausIL}, which treats noisy state observations as proxy variables, and we provide identification conditions under which the target policy is recoverable from demonstrations without rewards or interactive expert queries. We develop estimators for both discrete and continuous state spaces; for continuous settings, we use an adversarial procedure over RKHS function classes to learn the required parameters. We evaluate \texttt{CausIL} on semi-simulated longitudinal data from the PhysioNet/Computing in Cardiology Challenge 2019 cohort and demonstrate improved robustness to distribution shift compared to BC baselines.

2601.22200 2026-02-02 q-fin.ST cs.LG cs.MS cs.NA math.NA stat.ML

Adaptive Benign Overfitting (ABO): Overparameterized RLS for Online Learning in Non-stationary Time-series

Luis Ontaneda Mijares, Nick Firoozye

Comments 32 pages, 3 figures, 10 tables

详情
英文摘要

Overparameterized models have recently challenged conventional learning theory by exhibiting improved generalization beyond the interpolation limit, a phenomenon known as benign overfitting. This work introduces Adaptive Benign Overfitting (ABO), extending the recursive least-squares (RLS) framework to this regime through a numerically stable formulation based on orthogonal-triangular updates. A QR-based exponentially weighted RLS (QR-EWRLS) algorithm is introduced, combining random Fourier feature mappings with forgetting-factor regularization to enable online adaptation under non-stationary conditions. The orthogonal decomposition prevents the numerical divergence associated with covariance-form RLS while retaining adaptability to evolving data distributions. Experiments on nonlinear synthetic time series confirm that the proposed approach maintains bounded residuals and stable condition numbers while reproducing the double-descent behavior characteristic of overparameterized models. Applications to forecasting foreign exchange and electricity demand show that ABO is highly accurate (comparable to baseline kernel methods) while achieving speed improvements of between 20 and 40 percent. The results provide a unified view linking adaptive filtering, kernel approximation, and benign overfitting within a stable online learning framework.

2601.22170 2026-02-02 math.NA cs.LG cs.NA stat.ML

Large Language Models: A Mathematical Formulation

Ricardo Baptista, Andrew Stuart, Son Tran

Comments 51 pages, 2 figures

详情
英文摘要

Large language models (LLMs) process and predict sequences containing text to answer questions, and address tasks including document summarization, providing recommendations, writing software and solving quantitative problems. We provide a mathematical framework for LLMs by describing the encoding of text sequences into sequences of tokens, defining the architecture for next-token prediction models, explaining how these models are learned from data, and demonstrating how they are deployed to address a variety of tasks. The mathematical sophistication required to understand this material is not high, and relies on straightforward ideas from information theory, probability and optimization. Nonetheless, the combination of ideas resting on these different components from the mathematical sciences yields a complex algorithmic structure; and this algorithmic structure has demonstrated remarkable empirical successes. The mathematical framework established here provides a platform from which it is possible to formulate and address questions concerning the accuracy, efficiency and robustness of the algorithms that constitute LLMs. The framework also suggests directions for development of modified and new methodologies.

2512.21399 2026-02-02 stat.ME stat.AP

Standardized Descriptive Index for Measuring Deviation and Uncertainty in Psychometric Indicators

Mark Dominique Dalipe Muñoz

Comments 21 pages, 4 figures, 1 table

详情
英文摘要

The use of descriptive statistics in pilot testing procedures requires objective, standard diagnostic tools that are feasible for small sample sizes. While current psychometric practices report item-level statistics, they often report these raw descriptives separately rather than consolidating both mean and standard deviation into a single diagnostic tool to directly measure item quality. By leveraging the analytical properties of Cohen's d, this article repurposes its use in scale development as a standardized item deviation index. This measures the extent of an item's raw deviation relative to its scale midpoint while accounting for its own uncertainty. Analytical properties such as boundedness, scale invariance, and bias are explored to further understand how the index values behave, which will aid future efforts to establish empirical thresholds that characterize redundancy among formative indicators and consistency among reflective indicators.

2512.20523 2026-02-02 econ.EM cs.LG math.ST stat.ME stat.ML stat.TH

ScoreMatchingRiesz: Score Matching for Debiased Machine Learning and Policy Path Estimation

Masahiro Kato

详情
英文摘要

We propose ScoreMatchingRiesz, a family of Riesz representer estimators based on score matching. The Riesz representer is a key nuisance component in debiased machine learning, enabling $\sqrt{n}$-consistent and asymptotically efficient estimation of causal and structural targets via Neyman-orthogonal scores. We formulate Riesz representer estimation as a score estimation problem. This perspective stabilizes representer estimation by allowing us to leverage denoising score matching and telescoping density ratio estimation. We also introduce the policy path, a parameter that captures how policy effects evolve under continuous treatments. We show that the policy path can be estimated via score matching by smoothly connecting average marginal effect (AME) and average policy effect (APE) estimation, which improves the interpretability of policy effects.

2512.03393 2026-02-02 cs.LG stat.ML

Tuning-Free Structured Sparse Recovery of Multiple Measurement Vectors using Implicit Regularization

Lakshmi Jayalal, Sheetal Kalyani

详情
英文摘要

Recovering jointly sparse signals in the multiple measurement vectors (MMV) setting is a fundamental problem in machine learning, but traditional methods often require careful parameter tuning or prior knowledge of the sparsity of the signal and/or noise variance. We propose a tuning-free framework that leverages implicit regularization (IR) from overparameterization to overcome this limitation. Our approach reparameterizes the estimation matrix into factors that decouple the shared row-support from individual vector entries and applies gradient descent to a standard least-squares objective. We prove that with a sufficiently small and balanced initialization, the optimization dynamics exhibit a "momentum-like" effect where the true support grows significantly faster. Leveraging a Lyapunov-based analysis of the gradient flow, we further establish formal guarantees that the solution trajectory converges towards an idealized row-sparse solution. Empirical results demonstrate that our tuning-free approach achieves performance comparable to optimally tuned established methods. Furthermore, our framework significantly outperforms these baselines in scenarios where accurate priors are unavailable to the baselines.

2511.10625 2026-02-02 math.ST stat.ME stat.TH

Model-oriented Graph Distances via Partially Ordered Sets

Armeen Taeb, F. Richard Guo, Leonard Henckel

详情
英文摘要

A well-defined distance on the parameter space is key to evaluating estimators, ensuring consistency, and building confidence sets. While there are typically standard distances to adopt in a continuous space, this is not the case for combinatorial parameters such as graphs that represent statistical models. Defined on the graphs alone, existing proposals like the structural Hamming distance ignore the structure of the model space and can thus exhibit undesirable behaviors. We propose a model-oriented framework for defining the distance between graphs that is applicable across different graph classes. Our approach treats each graph as a statistical model and organizes the graphs in a partially ordered set based on model inclusion. This induces a neighborhood structure, from which we define the model-oriented distance as the length of a shortest path through neighbors, yielding a metric in the space of graphs. We apply this framework to probabilistic undirected graphs, causal directed acyclic graphs, probabilistic completed partially directed acyclic graphs, and causal maximally oriented partially directed acyclic graphs. We analyze theoretical and empirical behaviors of the model-oriented distance and draw comparison with existing distances. By exploiting the underlying poset structures, we develop algorithms for computing and bounding the proposed distance that scale to moderate-sized graphs.

2510.21800 2026-02-02 cs.LG math.OC stat.ML

MARS-M: When Variance Reduction Meets Matrices

Yifeng Liu, Angela Yuan, Quanquan Gu

详情
英文摘要

Matrix-based preconditioned optimizers, such as Muon, have recently been shown to be more efficient than scalar-based optimizers for training large-scale neural networks, including large language models (LLMs). Recent benchmark studies of LLM pretraining optimizers have demonstrated that variance-reduction techniques such as MARS can substantially speed up training compared with standard optimizers that do not employ variance reduction. In this paper, we introduce MARS-M, a new optimizer that integrates MARS-style variance reduction with Muon. Under standard regularity conditions, we prove that MARS-M converges to a first-order stationary point at a rate of $\tilde{\mathcal{O}}(T^{-1/3})$, improving upon the $\tilde{\mathcal{O}}(T^{-1/4})$ rate attained by Muon. Empirical results on language modeling and computer vision tasks demonstrate that MARS-M consistently yields lower losses and improved performance across various downstream benchmarks. The implementation of MARS-M is available at https://github.com/AGI-Arena/MARS/tree/main/MARS_M.

2510.02291 2026-02-02 cs.LG cs.CV stat.ML

Test-Time Anchoring for Discrete Diffusion Posterior Sampling

Litu Rout, Andreas Lugmayr, Yasamin Jafarian, Srivatsan Varadharajan, Constantine Caramanis, Sanjay Shakkottai, Ira Kemelmacher-Shlizerman

Comments Preprint

详情
英文摘要

While continuous diffusion models have achieved remarkable success, discrete diffusion offers a unified framework for jointly modeling text and images. Beyond unification, discrete diffusion provides faster inference, finer control, and principled training-free guidance, making it well-suited for posterior sampling. Existing approaches to posterior sampling using discrete diffusion face severe challenges: derivative-free guidance yields sparse signals, continuous relaxations limit applicability, and split Gibbs samplers suffer from the curse of dimensionality. To overcome these limitations, we introduce Anchored Posterior Sampling (APS), built on two key innovations: quantized expectation for gradient-like guidance in discrete embedding space, and anchored remasking for adaptive decoding. APS achieves state-of-the-art performance among discrete diffusion samplers on both linear and nonlinear inverse problems across the standard image benchmarks. We demonstrate the generality of APS through training-free stylization and text-guided editing. We further apply APS to a large-scale diffusion language model, showing consistent improvement in question answering.

2509.22122 2026-02-02 econ.EM cs.LG math.ST stat.ME stat.ML stat.TH

Direct Bias-Correction Term Estimation for Average Treatment Effect Estimation

Masahiro Kato

详情
英文摘要

This study considers the estimation of the direct bias-correction term for estimating the average treatment effect (ATE). Let $\{(X_i, D_i, Y_i)\}_{i=1}^{n}$ be the observations, where $X_i$ denotes $K$-dimensional covariates, $D_i \in \{0, 1\}$ denotes a binary treatment assignment indicator, and $Y_i$ denotes an outcome. In ATE estimation, $h_0(D_i, X_i) = \frac{1[D_i = 1]}{e_0(X_i)} - \frac{1[D_i = 0]}{1 - e_0(X_i)}$ is called the bias-correction term, where $e_0(X_i)$ is the propensity score. The bias-correction term is also referred to as the Riesz representer or clever covariates, depending on the literature, and plays an important role in construction of efficient ATE estimators. In this study, we propose estimating $h_0$ by directly minimizing the Bregman divergence between its model and $h_0$, which includes squared error and Kullback--Leibler divergence as special cases. Our proposed method is inspired by direct density ratio estimation methods and generalizes existing bias-correction term estimation methods, such as covariate balancing weights, Riesz regression, and nearest neighbor matching. Importantly, under specific choices of bias-correction term models and Bregman divergence, we can automatically ensure the covariate balancing property. Thus, our study provides a practical modeling and estimation approach through a generalization of existing methods.

2509.06468 2026-02-02 q-fin.ST stat.AP

The use of financial and sustainability ratios to map a sector. An approach using compositional data

Elena Rondós-Casas, Germà Coenders, Miquel Carreras-Simó, Núria Arimany-Serrat

Comments 19 pages, 1 table, 1 figure, 8551 words

详情
英文摘要

Purpose: The article aims to visualise in a single graph fish and meat processing company groups in Spain with respect to long-term solvency, energy, waste and water intensity and gender employment gap. Design/methodology/approach: The selected financial, environmental and social indicators are ratios, which require specific statistical analysis methods to prevent severe skewness and outliers. We use the compositional data methodology and the principal-component analysis biplot. Findings: Fish-processing companies have more homogeneous financial, environmental and social performance than their meat-processing counterparts. Specific company groups in both sectors can be identified as poor performers in some of the indicators. Firms with higher solvency tend to be less efficient in energy and water use. Two clusters of company groups with similar performances are identified. Research limitations/implications: As of now, few firms publish reports according to the EU Corporate Sustainability Reporting Directive. In future research larger samples will be available. Social Implications: Firm groups can visually see their areas of improvement in their financial, environmental and social performance compared to their competitors in the sector. Originality/value: This is the first time in which visualization tools have combined financial, environmental and social indicators. All individual firms can be visually ordered along all indicators simultaneously.

2508.11843 2026-02-02 stat.ME

Post-selection inference with a single realization of a network

Ethan Ancell, Daniela Witten, Daniel Kessler

Comments 67 pages, 10 figures

详情
英文摘要

Given a dataset consisting of a single realization of a network, we consider conducting inference on a parameter selected from the data. In particular, we focus on the setting where the parameter of interest is a linear combination of the mean connectivities within and between estimated communities. Inference in this setting poses a challenge, since the communities are themselves estimated from the data. Furthermore, since only a single realization of the network is available, sample splitting is not possible. In this paper, we show that it is possible to split a single realization of a network consisting of $n$ nodes into two (or more) networks involving the same $n$ nodes; the first network can be used to select a data-driven parameter, and the second to conduct inference on that parameter. In the case of weighted networks with Poisson or Gaussian edges, we obtain two independent realizations of the network; by contrast, in the case of Bernoulli edges, the two realizations are dependent, and so extra care is required. We establish the theoretical properties of our estimators, in the sense of confidence intervals that attain the nominal (selective) coverage, and demonstrate their utility in numerical simulations and in application to a dataset representing the relationships among dolphins in Doubtful Sound, New Zealand.

2506.08337 2026-02-02 cs.LG stat.ML

Diffusion Models under Alternative Noise: Simplified Analysis and Sensitivity

Juhyeok Choi, Chenglin Fan

Comments 19 pages

详情
英文摘要

Diffusion models, typically formulated as discretizations of stochastic differential equations (SDEs), have achieved state-of-the-art performance in generative tasks. However, their theoretical analysis often involves complex proofs. In this work, we present a simplified framework for analyzing the Euler--Maruyama discretization of variance-preserving SDEs (VP-SDEs). Using Grönwall's inequality, we derive a convergence rate of $O(T^{-1/2})$ under standard Lipschitz assumptions, streamlining prior analyses. We then demonstrate that the standard Gaussian noise can be replaced by computationally cheaper discrete random variables (e.g., Rademacher) without sacrificing this convergence guarantee, provided the mean and variance are matched. Our experiments validate this theory, showing that (i) discrete noise achieves sample quality comparable to Gaussian noise provided the variance is matched correctly, and (ii) performance degrades if the noise variance is scaled incorrectly.

2506.06185 2026-02-02 cs.LG cs.NA math.NA stat.CO stat.ML

Antithetic Noise in Diffusion Models

Jing Jia, Sifan Liu, Bowen Song, Wei Yuan, Liyue Shen, Guanyang Wang

Comments Code: https://github.com/jjia131/Antithetic-Noise-in-Diffusion-Models-page, Project Page: https://jjia131.github.io/Antithetic-Noise-in-Diffusion-Models-page/, Blog: https://jjia131.github.io/Antithetic-Noise-in-Diffusion-Models-page/static/blog/blog.html

详情
英文摘要

We systematically study antithetic initial noise in diffusion models, discovering that pairing each noise sample with its negation consistently produces strong negative correlation. This universal phenomenon holds across datasets, model architectures, conditional and unconditional sampling, and even other generative models such as VAEs and Normalizing Flows. To explain it, we combine experiments and theory and propose a \textit{symmetry conjecture} that the learned score function is approximately affine antisymmetric (odd symmetry up to a constant shift), supported by empirical evidence. This negative correlation leads to substantially more reliable uncertainty quantification with up to $90\%$ narrower confidence intervals. We demonstrate these gains on tasks including estimating pixel-wise statistics and evaluating diffusion inverse solvers. We also provide extensions with randomized quasi-Monte Carlo noise designs for uncertainty quantification, and explore additional applications of the antithetic noise design to improve image editing and generation diversity. Our framework is training-free, model-agnostic, and adds no runtime overhead. Code is available at https://github.com/jjia131/Antithetic-Noise-in-Diffusion-Models-page.

2506.03104 2026-02-02 stat.ME stat.AP

Two-Phase Treatment with Noncompliance: Identifying the Cumulative Average Treatment Effect via Multisite Instrumental Variables

Guanglei Hong, Xu Qin, Zhengyan Xu, Fan Yang

Comments 36 pages; 1 figure; 6 tables

详情
英文摘要

When evaluating a two-phase intervention, the cumulative average treatment effect (ATE) is often the primary causal estimand of interest. However, some individuals who do not respond well to the Phase I treatment may subsequently display noncompliant behaviors. At the same time, exposure to the Phase I treatment is expected to directly influence an individual's potential outcomes, thereby violating the exclusion restriction. Building on an instrumental variable (IV) strategy for multisite trials, we clarify the conditions under which the cumulative ATE of a two-phase treatment can be identified by employing the random assignment of the Phase I treatment as the instrument. Our strategy relaxes both the conventional exclusion restriction and sequential ignorability assumptions. We assess the performance of the new strategy through simulation studies. Additionally, we reanalyze data from the Tennessee class size study, in which students and teachers were randomly assigned to either small or regular class types in kindergarten (Phase I) with noncompliance emerging in Grade 1 (Phase II). Applying our new strategy, we estimate the cumulative ATE of receiving two consecutive years of instruction in a small versus regular class.

2505.17965 2026-02-02 math.OC cs.LG stat.ML

Bias-Optimal Bounds for SGD: A Computer-Aided Lyapunov Analysis

Daniel Cortild, Lucas Ketels, Juan Peypouquet, Guillaume Garrigos

Comments 35 pages, 2 figures. Under review

详情
英文摘要

The non-asymptotic analysis of Stochastic Gradient Descent (SGD) typically yields bounds that decompose into a bias term and a variance term. In this work, we focus on the bias component and study the extent to which SGD can match the optimal convergence behavior of deterministic gradient descent. Assuming only (strong) convexity and smoothness of the objective, we derive new bounds that are bias-optimal, in the sense that the bias term coincides with the worst-case rate of gradient descent. Our results hold for the full range of constant step-sizes $γL \in (0,2)$, including critical and large step-size regimes that were previously unexplored without additional variance assumptions. The bounds are obtained through the construction of a simple Lyapunov energy whose monotonicity yields sharp convergence guarantees. To design the parameters of this energy, we employ the Performance Estimation Problem framework, which we also use to provide numerical evidence for the optimality of the associated variance terms.

2504.19726 2026-02-02 stat.ME

Discrimination performance in illness-death models with interval-censored disease data

Marta Spreafico, Anja J. Rueten-Budde, Hein Putter, Marta Fiocco

Comments Author order updated to match the published version (https://journals.sagepub.com/doi/10.1177/09622802251412855); preprint replaced with the accepted manuscript

Journal ref Statistical Methods in Medical Research 2026

详情
英文摘要

In clinical studies, the illness-death model is often used to describe disease progression. A subject starts disease-free, may develop the disease and then die, or die directly. In clinical practice, disease can only be diagnosed at pre-specified follow-up visits, so the exact time of disease onset is often unknown, resulting in interval-censored data. This study examines the impact of ignoring this interval-censored nature of disease data on the discrimination performance of illness-death models, focusing on the time-specific Area Under the receiver operating characteristic Curve (AUC) in both incident/dynamic and cumulative/dynamic definitions. A simulation study with data simulated from Weibull transition hazards and disease state censored at regular intervals is conducted. Estimates are derived using different methods: the Cox model with a time-dependent binary disease marker, which ignores interval-censoring, and the illness-death model for interval-censored data estimated with three implementations - the piecewise-constant model from the msm package, the Weibull and M-spline models from the SmoothHazard package. These methods are also applied to a dataset of 2232 patients with high-grade soft tissue sarcoma, where the interval-censored disease state is the post-operative development of distant metastases. The results suggest that, in the presence of interval-censored disease times, it is important to account for interval-censoring not only when estimating the parameters of the model but also when evaluating the discrimination performance of the disease.

2503.17300 2026-02-02 math.PR math.ST stat.ML stat.TH

Variational Tail Bounds for Norms of Random Vectors and Matrices

Sohail Bahmani

Comments reorganized + some examples are consolidated into Theorem 1; a random matrix series example added in Section 4.3; the generalization via coupling is further developed

详情
英文摘要

We propose a variational tail bound for norms of random vectors under moment assumptions on their one-dimensional marginals. A simplified version of the bound that parametrizes the ``aggregating distribution'' using a certain pushforward of the Gaussian distribution is also provided. We apply the proposed method to reproduce some of the well-known bounds on norms of Gaussian random vectors, and also obtain dimension-free tail bounds for the Euclidean norm of random vectors with arbitrary moment profiles. Furthermore, we reproduce a dimension-free concentration inequality for sum of independent and identically distributed positive semidefinite matrices with sub-exponential marginals, and obtain a concentration inequality for the sample covariance matrix of sub-exponential random vectors. We also obtain a tail bound for the operator norm of a random matrix series whose random coefficients may have arbitrary moment profiles. Furthermore, we use coupling to formulate an abstraction of the proposed approach that applies more broadly.

2503.11599 2026-02-02 stat.AP

Quantifying sleep apnea heterogeneity using hierarchical Bayesian modeling

Glenn Palmer, Narat Srivali, David B. Dunson

详情
英文摘要

Obstructive Sleep Apnea (OSA) is a breathing disorder during sleep that affects millions of people worldwide. The diagnosis of OSA often occurs through an overnight polysomnogram (PSG) sleep study that generates a massive amount of physiological data. However, despite the evidence of substantial heterogeneity in the expression and symptoms of OSA, diagnosis and scientific analysis of severity typically focus on a single summary statistic, the Apnea-Hypopnea Index (AHI). We address the limitations of this approach through hierarchical Bayesian modeling of PSG data. Our approach produces interpretable random effects for each patient, which govern sleep-stage dynamics, rates of OSA events, and impacts of OSA events on subsequent sleep-stage dynamics. We propose a novel approach for using these random effects to produce a Bayes optimal clustering of patients. We use the proposed approach to analyze data from the APPLES study. Our analysis produces clinically interesting groups of patients with sleep apnea and a novel finding of an association between OSA expression and cognitive performance that is missed by an AHI-based analysis.

2502.13157 2026-02-02 stat.ME stat.AP

Bayesian Kernel Machine Regression via Random Fourier Features for Estimating Joint Health Effects of Multiple Exposures

Danlu Zhang, Stephanie M. Eick, Howard H. Chang

详情
英文摘要

Environmental epidemiology has traditionally examined single exposure one at a time. Advances in exposure assessment and statistical methods now enable studies of multiple exposures and their combined health impacts. Bayesian Kernel Machine Regression (BKMR) is a widely used approach to flexibly estimates joint, nonlinear effects of multiple exposures. But BMKR is computationally intensive for large datasets, as repeated kernel inversion in Markov chain Monte Carlo (MCMC) can be time-consuming and often infeasible in practice. To address this issue, we propose using supervised random Fourier basis functions to replace the Gaussian process random effects. This re-frames the kernel machine regression into a linear mixed-effect model that facilitates computationally efficient estimation and prediction. Bayesian inference is conducted using MCMC with Hamiltonian Monte Carlo algorithms. Simulation studies demonstrate that our method yields results comparable to BKMR while significantly reduces the computation time. Our approach outperforms BKMR when the exposure-response surface has stronger dependency and when using predictive process as an alternative approximation method. Finally, we applied this approach to analyze over 270,000 birth records, examining associations between multiple ambient air pollutants and birthweight in Georgia.

2502.10939 2026-02-02 stat.ME

Model-assisted inference for dynamic causal effects in staggered rollout cluster randomized experiments

Xinyuan Chen, Fan Li

详情
英文摘要

Staggered rollout cluster randomized experiments (SR-CREs) involve sequential treatment adoption across clusters, requiring analysis methods that address a general class of dynamic causal effects, anticipation, and non-ignorable cluster-period sizes. Without imposing any outcome modeling assumptions, we study regression estimators using individual data, cluster-period averages, and scaled cluster-period totals, with and without covariate adjustment from a design-based perspective. We establish consistency and asymptotic normality of each estimator under a randomization-based framework and prove that the associated variance estimators are asymptotically conservative in the Löwner ordering. Furthermore, we conduct a unified efficiency comparison of the estimators and provide recommendations. We highlight the efficiency advantage of using estimators based on scaled cluster-period totals with covariate adjustment over their counterparts using individual-level data and cluster-period averages. Our results rigorously justify linear regression estimators as model-assisted methods to address an entire class of dynamic causal effects in SR-CREs.

2412.01496 2026-02-02 cs.CV cs.LG eess.IV stat.ML

Fréchet Radiomic Distance (FRD): A Versatile Metric for Comparing Medical Imaging Datasets

Nicholas Konz, Richard Osuala, Preeti Verma, Yuwen Chen, Hanxue Gu, Haoyu Dong, Yaqian Chen, Andrew Marshall, Lidia Garrucho, Kaisar Kushibar, Daniel M. Lang, Gene S. Kim, Lars J. Grimm, John M. Lewin, James S. Duncan, Julia A. Schnabel, Oliver Diaz, Karim Lekadir, Maciej A. Mazurowski

Comments Codebase for FRD computation: https://github.com/RichardObi/frd-score. Codebase for medical image similarity metric evaluation framework: https://github.com/mazurowski-lab/medical-image-similarity-metrics

Journal ref Medical Image Analysis, 103943 (2026)

详情
英文摘要

Determining whether two sets of images belong to the same or different distributions or domains is a crucial task in modern medical image analysis and deep learning; for example, to evaluate the output quality of image generative models. Currently, metrics used for this task either rely on the (potentially biased) choice of some downstream task, such as segmentation, or adopt task-independent perceptual metrics (e.g., Fréchet Inception Distance/FID) from natural imaging, which we show insufficiently capture anatomical features. To this end, we introduce a new perceptual metric tailored for medical images, FRD (Fréchet Radiomic Distance), which utilizes standardized, clinically meaningful, and interpretable image features. We show that FRD is superior to other image distribution metrics for a range of medical imaging applications, including out-of-domain (OOD) detection, the evaluation of image-to-image translation (by correlating more with downstream task performance as well as anatomical consistency and realism), and the evaluation of unconditional image generation. Moreover, FRD offers additional benefits such as stability and computational efficiency at low sample sizes, sensitivity to image corruptions and adversarial attacks, feature interpretability, and correlation with radiologist-perceived image quality. Additionally, we address key gaps in the literature by presenting an extensive framework for the multifaceted evaluation of image similarity metrics in medical imaging -- including the first large-scale comparative study of generative models for medical image translation -- and release an accessible codebase to facilitate future research. Our results are supported by thorough experiments spanning a variety of datasets, modalities, and downstream tasks, highlighting the broad potential of FRD for medical image analysis.

2410.11771 2026-02-02 stat.ML cs.NA math.NA

Stein's method for marginals on large graphical models

Tiangang Cui, Shuigen Liu, Xin T. Tong

详情
英文摘要

Many spatial models exhibit locality structures that effectively reduce their intrinsic dimensionality, enabling efficient approximation and sampling of high-dimensional distributions. However, existing approximation techniques primarily focus on joint distributions and do not provide precise accuracy control for low-dimensional marginals, which are of primary interest in many practical scenarios. By leveraging the locality structures, we establish a dimension independent uniform error bound for the marginals of approximate distributions. Inspired by the Stein's method, we introduce a novel $δ$-locality condition that quantifies the locality in distributions, and link it to the structural assumptions such as the sparse graphical models. The theoretical guarantee motivates the localization of existing sampling methods, as we illustrate through the localized likelihood-informed subspace method and localized score matching. We show that by leveraging the locality structure, these methods greatly reduce the sample complexity and computational cost via localized and parallel implementations.

2410.02025 2026-02-02 math.ST cs.AI cs.LG stat.ME stat.ML stat.TH

A Likelihood Based Approach to Distribution Regression Using Conditional Deep Generative Models

Shivam Kumar, Yun Yang, Lizhen Lin

Comments arXiv admin note: text overlap with arXiv:1708.06633 by other authors

Journal ref Proc. 42nd Int. Conf. on Machine Learning (ICML 2025), PMLR 267:31964-31990, 2025

详情
英文摘要

In this work, we explore the theoretical properties of conditional deep generative models under the statistical framework of distribution regression where the response variable lies in a high-dimensional ambient space but concentrates around a potentially lower-dimensional manifold. More specifically, we study the large-sample properties of a likelihood-based approach for estimating these models. Our results lead to the convergence rate of a sieve maximum likelihood estimator (MLE) for estimating the conditional distribution (and its devolved counterpart) of the response given predictors in the Hellinger (Wasserstein) metric. Our rates depend solely on the intrinsic dimension and smoothness of the true conditional distribution. These findings provide an explanation of why conditional deep generative models can circumvent the curse of dimensionality from the perspective of statistical foundations and demonstrate that they can learn a broader class of nearly singular conditional distributions. Our analysis also emphasizes the importance of introducing a small noise perturbation to the data when they are supported sufficiently close to a manifold. Finally, in our numerical studies, we demonstrate the effective implementation of the proposed approach using both synthetic and real-world datasets, which also provide complementary validation to our theoretical findings.

2409.12578 2026-02-02 stat.CO stat.AP

CLE-SH: Comprehensive Literal Explanation package for SHapley values by statistical validity

Kyungjin Kim, Youngro Lee, Jongmo Seo

Journal ref IEEE Access, vol. 14, pp. 12514-12525, 2026

详情
英文摘要

Recently, SHapley Additive exPlanations (SHAP) has been widely utilized in various research domains. This is particularly evident in application fields, where SHAP analysis serves as a crucial tool for identifying biomarkers and assisting in result validation. However, despite its frequent usage, SHAP is often not applied in a manner that maximizes its potential contributions. A review of recent papers employing SHAP reveals that many studies subjectively select a limited number of features as 'important' and analyze SHAP values by approximately observing plots without assessing statistical significance. Such superficial application may hinder meaningful contributions to the applied fields. To address this, we propose a library package designed to simplify the interpretation of SHAP values. By simply inputting the original data and SHAP values, our library provides: 1) the number of important features to analyze, 2) the pattern of each feature via univariate analysis, and 3) the interaction between features. All information is extracted based on its statistical significance and presented in simple, comprehensible sentences, enabling users of all levels to understand the interpretations. We hope this library fosters a comprehensive understanding of statistically valid SHAP results.

2407.03389 2026-02-02 stat.ME cs.LG stat.ML

A Deterministic Information Bottleneck Method for Clustering Mixed-Type Data

Efthymios Costa, Ioanna Papatsouma, Angelos Markos

Comments 35 pages

详情
英文摘要

In this paper, we present an information-theoretic method for clustering mixed-type data, that is, data consisting of both continuous and categorical variables. The proposed approach extends the Information Bottleneck principle to heterogeneous data through generalised product kernels, integrating continuous, nominal, and ordinal variables within a unified optimization framework. We address the following challenges: developing a systematic bandwidth selection strategy that equalises contributions across variable types, and proposing an adaptive hyperparameter updating scheme that ensures a valid solution into a predetermined number of potentially imbalanced clusters. Through simulations on 28,800 synthetic data sets and ten publicly available benchmarks, we demonstrate that the proposed method, named DIBmix, achieves superior performance compared to four established methods (KAMILA, K-Prototypes, FAMD with K-Means, and PAM with Gower's dissimilarity). Results show DIBmix particularly excels when clusters exhibit size imbalances, data contain low or moderate cluster overlap, and categorical and continuous variables are equally represented. The method presents a significant advantage over traditional centroid-based algorithms, establishing DIBmix as a competitive and theoretically grounded alternative for mixed-type data clustering.

2405.05459 2026-02-02 stat.ME math.ST stat.TH

Estimation and Inference for Change Points in Functional Regression Time Series

Shivam Kumar, Haotian Xu, Haeran Cho, Daren Wang

详情
英文摘要

In this paper, we study the estimation and inference of change points under a functional linear regression model with changes in the slope function. We present a novel Functional Regression Binary Segmentation (FRBS) algorithm which is computationally efficient as well as achieving consistency in multiple change point detection. This algorithm utilizes the predictive power of piece-wise constant functional linear regression models in the reproducing kernel Hilbert space framework. We further propose a refinement step that improves the localization rate of the initial estimator output by FRBS, and derive asymptotic distributions of the refined estimators for two different regimes determined by the magnitude of a change. To facilitate the construction of confidence intervals for underlying change points based on the limiting distribution, we propose a consistent block-type long-run variance estimator. Our theoretical justifications for the proposed approach accommodate temporal dependence and heavy-tailedness in both the functional covariates and the measurement errors. Empirical effectiveness of our methodology is demonstrated through extensive simulation studies and an application to the Standard and Poor's 500 index dataset.

2404.09113 2026-02-02 stat.ML cs.LG math.ST stat.TH

Extending Mean-Field Variational Inference via Entropic Regularization: Theory and Computation

Bohan Wu, David Blei

详情
英文摘要

Variational inference (VI) has emerged as a popular method for approximate inference for high-dimensional Bayesian models. In this paper, we propose a novel VI method that extends the naive mean field via entropic regularization, referred to as $Ξ$-variational inference ($Ξ$-VI). $Ξ$-VI has a close connection to the entropic optimal transport problem and benefits from the computationally efficient Sinkhorn algorithm. We show that $Ξ$-variational posteriors effectively recover the true posterior dependency, where the dependence is downweighted by the regularization parameter. We analyze the role of dimensionality of the parameter space on the accuracy of $Ξ$-variational approximation and how it affects computational considerations, providing a rough characterization of the statistical-computational trade-off in $Ξ$-VI. We also investigate the frequentist properties of $Ξ$-VI and establish results on consistency, asymptotic normality, high-dimensional asymptotics, and algorithmic stability. We provide sufficient criteria for achieving polynomial-time approximate inference using the method. Finally, we demonstrate the practical advantage of $Ξ$-VI over mean-field variational inference on simulated and real data.

2402.02277 2026-02-02 cs.LG stat.ML

Causal Bayesian Optimization via Exogenous Distribution Learning

Shaogang Ren, Zihao Wang, Yuzhou Chen, Xiaoning Qian

详情
英文摘要

Maximizing a target variable as an operational objective within a structural causal model is a fundamental problem. Causal Bayesian Optimization (CBO) approaches typically achieve this either by performing interventions that modify the causal structure to increase the reward or by introducing action nodes to endogenous variables, thereby adjusting the data-generating mechanisms to meet the objective. In this paper, we propose a novel method that learns the distribution of exogenous variables-an aspect often ignored or marginalized through expectation in existing CBO frameworks. By modeling the exogenous distribution, we enhance the approximation fidelity of the data-generating structural causal models (SCMs) used in surrogate models, which are commonly trained on limited observational data. Furthermore, the ability to recover exogenous variables enables the application of our approach to more general causal structures beyond the confines of Additive Noise Models (ANMs) and single-mode Gaussian, allowing the use of more expressive priors for context noise. We incorporate the learned exogenous distribution into a new CBO method, demonstrating its advantages across diverse datasets and application scenarios.

2309.08556 2026-02-02 math.ST stat.TH

High-Dimensional Bernstein Von-Mises Theorems for Covariance and Precision Matrices

Partha Sarkar, Kshitij Khare, Malay Ghosh, Matt P. Wand

详情
英文摘要

This paper aims to examine the characteristics of the posterior distribution of covariance/precision matrices in a "large $p$, large $n$" scenario, where $p$ represents the number of variables and $n$ is the sample size. Our analysis focuses on establishing asymptotic normality of the posterior distribution of the entire covariance/precision matrices under specific growth restrictions on $p_n$ and other mild assumptions. In particular, the limiting distribution turns out to be a symmetric matrix variate normal distribution whose parameters depend on the maximum likelihood estimate. Our results hold for a wide class of prior distributions which includes standard choices used by practitioners. Next, we consider Gaussian graphical models which induce sparsity in the precision matrix. Asymptotic normality of the corresponding posterior distribution is established under mild assumptions on the prior and true data-generating mechanism.

2308.14735 2026-02-02 math.ST stat.TH

Logarithmic Asymptotic Relations Between $p$-Values and Mutual Information

Tsutomu Mori, Takashi Kawamura

Comments 21 pages, 3 figures, 7 tables

详情
英文摘要

We establish a precise connection between statistical significance in dependence testing and information-theoretic dependence as quantified by Shannon mutual information (MI). In the absence of prior distributional information, we consider a maximum-entropy model and show that the probability associated with the realization of a given magnitude of MI takes an exponential form, yielding a corresponding tail-probability interpretation of a $p$-value. In contingency tables with fixed marginal frequencies, we analyze Fisher's exact test and prove that its $p$-value $P_F$ satisfies a logarithmic asymptotic relation of the form $MI=-(1/N)\log P_F + O(\log(N+1)/N)$ as the sample size $N\to\infty$. These results clarify the role of MI as the exponential rate governing the asymptotic behavior of $p$-values in the settings studied here, and they enable principled comparisons of dependence across datasets with different sample sizes. We further discuss implications for combining evidence across studies via meta-analysis, allowing mutual information and its statistical significance to be integrated in a unified framework.

2307.16373 2026-02-02 astro-ph.HE astro-ph.IM hep-ex stat.ML

2D Convolutional Neural Network for Event Reconstruction in IceCube DeepCore

J. H. Peterson, M. Prado Rodriguez, K. Hanson

Comments Presented at the 38th International Cosmic Ray Conference (ICRC2023). See arXiv:2307.13047 for all IceCube contributions

详情
英文摘要

IceCube DeepCore is an extension of the IceCube Neutrino Observatory designed to measure GeV scale atmospheric neutrino interactions for the purpose of neutrino oscillation studies. Distinguishing muon neutrinos from other flavors and reconstructing inelasticity are especially difficult tasks at GeV scale energies in IceCube DeepCore due to sparse instrumentation. Convolutional neural networks (CNNs) have been found to have better success at neutrino event reconstruction than conventional likelihood-based methods. In this contribution, we present a new CNN model that exploits time and depth translational symmetry in IceCube DeepCore data and present the model's performance, specifically for flavor identification and inelasticity reconstruction.

2211.10547 2026-02-02 stat.AP

Leaf clustering using circular densities

Luis E. Nieto-Barajas

详情
英文摘要

In the biology field of botany, leaf shape recognition is an important task. One way of characterising the leaf shape is through the centroid contour distances (CCD). Each CCD path might have different resolution, so normalisation is done by associating each contour to a circular density. Densities are rotated by subtracting the mean or mode preferred direction. Distance measures between densities are used to produce a hierarchical clustering method to cluster the leaves. We illustrate our approach with a motivating small dataset as well as a larger dataset.

1709.10066 2026-02-02 stat.ME

Empirical Bayes Shrinkage and False Discovery Rate Estimation, Allowing For Unwanted Variation

David Gerard, Matthew Stephens

Comments 42 pages, 11 figures, 3 tables

Journal ref Biostatistics 21 (2020) 15--32

详情
英文摘要

We combine two important ideas in the analysis of large-scale genomics experiments (e.g. experiments that aim to identify genes that are differentially expressed between two conditions). The first is use of Empirical Bayes (EB) methods to handle the large number of potentially-sparse effects, and estimate false discovery rates and related quantities. The second is use of factor analysis methods to deal with sources of unwanted variation such as batch effects and unmeasured confounders. We describe a simple modular fitting procedure that combines key ideas from both these lines of research. This yields new, powerful EB methods for analyzing genomics experiments that account for both sparse effects and unwanted variation. In realistic simulations, these new methods provide significant gains in power and calibration over competing methods. In real data analysis we find that different methods, while often conceptually similar, can vary widely in their assessments of statistical significance. This highlights the need for care in both choice of methods and interpretation of results. All methods introduced in this paper are implemented in the R package vicar available at https://github.com/dcgerard/vicar .

1705.08393 2026-02-02 stat.ME math.ST stat.TH

Unifying and Generalizing Methods for Removing Unwanted Variation Based on Negative Controls

David Gerard, Matthew Stephens

Comments 34 pages, 6 figures, methods implemented at https://github.com/dcgerard/vicar , results reproducible at https://github.com/dcgerard/ruvb_sims

Journal ref Statistica Sinica 31 (2021) 1145--1166

详情
英文摘要

Unwanted variation, including hidden confounding, is a well-known problem in many fields, particularly large-scale gene expression studies. Recent proposals to use control genes --- genes assumed to be unassociated with the covariates of interest --- have led to new methods to deal with this problem. Going by the moniker Removing Unwanted Variation (RUV), there are many versions --- RUV1, RUV2, RUV4, RUVinv, RUVrinv, RUVfun. In this paper, we introduce a general framework, RUV*, that both unites and generalizes these approaches. This unifying framework helps clarify connections between existing methods. In particular we provide conditions under which RUV2 and RUV4 are equivalent. The RUV* framework also preserves an advantage of RUV approaches --- their modularity --- which facilitates the development of novel methods based on existing matrix imputation algorithms. We illustrate this by implementing RUVB, a version of RUV* based on Bayesian factor analysis. In realistic simulations based on real data we found that RUVB is competitive with existing methods in terms of both power and calibration, although we also highlight the challenges of providing consistently reliable calibration among data sets.