arXivDaily arXiv每日学术速递 周一至周五更新
2602.02491 2026-02-03 math.ST stat.ML stat.TH

New explanations and inference for least angle regression

Karl B. Gregory, Daniel J. Nordman

Comments 50 pages, 9 figures

详情
英文摘要

Efron et al. (2004) introduced least angle regression (LAR) as an algorithm for linear predictions, intended as an alternative to forward selection with connections to penalized regression. However, LAR has remained somewhat of a "black box," where some basic behavioral properties of LAR output are not well understood, including an appropriate termination point for the algorithm. We provide a novel framework for inference with LAR, which also allows LAR to be understood from new perspectives with several newly developed mathematical properties. The LAR algorithm at a data level can viewed as estimating a population counterpart "path" that organizes a response mean along regressor variables which are ordered according to a decreasing series of population "correlation" parameters; such parameters are shown to have meaningful interpretations for explaining variable contributions whereby zero correlations denote unimportant variables. In the output of LAR, estimates of all non-zero population correlations turn out to have independent normal distributions for use in inference, while estimates of zero-valued population correlations have a certain non-normal joint distribution. These properties help to provide a formal rule for stopping the LAR algorithm. While the standard bootstrap for regression can fail for LAR, a modified bootstrap provides a practical and formally justified tool for interpreting the entrance of variables and quantifying uncertainty in estimation. The LAR inference method is studied through simulation and illustrated with data examples.

2602.02445 2026-02-03 cs.LG math.ST stat.TH

Finite-Sample Wasserstein Error Bounds and Concentration Inequalities for Nonlinear Stochastic Approximation

Seo Taek Kong, R. Srikant

详情
英文摘要

This paper derives non-asymptotic error bounds for nonlinear stochastic approximation algorithms in the Wasserstein-$p$ distance. To obtain explicit finite-sample guarantees for the last iterate, we develop a coupling argument that compares the discrete-time process to a limiting Ornstein-Uhlenbeck process. Our analysis applies to algorithms driven by general noise conditions, including martingale differences and functions of ergodic Markov chains. Complementing this result, we handle the convergence rate of the Polyak-Ruppert average through a direct analysis that applies under the same general setting. Assuming the driving noise satisfies a non-asymptotic central limit theorem, we show that the normalized last iterates converge to a Gaussian distribution in the $p$-Wasserstein distance at a rate of order $γ_n^{1/6}$, where $γ_n$ is the step size. Similarly, the Polyak-Ruppert average is shown to converge in the Wasserstein distance at a rate of order $n^{-1/6}$. These distributional guarantees imply high-probability concentration inequalities that improve upon those derived from moment bounds and Markov's inequality. We demonstrate the utility of this approach by considering two applications: (1) linear stochastic approximation, where we explicitly quantify the transition from heavy-tailed to Gaussian behavior of the iterates, thereby bridging the gap between recent finite-sample analyses and asymptotic theory and (2) stochastic gradient descent, where we establish rate of convergence to the central limit theorem.

2602.02432 2026-02-03 cs.LG math.OC stat.ML

Maximizing Reliability with Bayesian Optimization

Jack M. Buckingham, Ivo Couckuyt, Juergen Branke

Comments 25 pages, 9 figures

详情
英文摘要

Bayesian optimization (BO) is a popular, sample-efficient technique for expensive, black-box optimization. One such problem arising in manufacturing is that of maximizing the reliability, or equivalently minimizing the probability of a failure, of a design which is subject to random perturbations - a problem that can involve extremely rare failures ($P_\mathrm{fail} = 10^{-6}-10^{-8}$). In this work, we propose two BO methods based on Thompson sampling and knowledge gradient, the latter approximating the one-step Bayes-optimal policy for minimizing the logarithm of the failure probability. Both methods incorporate importance sampling to target extremely small failure probabilities. Empirical results show the proposed methods outperform existing methods in both extreme and non-extreme regimes.

2602.02403 2026-02-03 econ.GN q-fin.EC stat.AP

Strategic Interactions in Science and Technology Networks: Substitutes or Complements?

Michael Balzer, Adhen Benlahlou

详情
英文摘要

This paper develops a theory of scientific and technological peer effects to study how individuals' productivity responds to the behavior and network positions of their collaborators across both scientific and inventive activities. Building on a simultaneous equation network framework, the model predicts that productivity in each activity increases in a variation of the Katz-Bonacich centrality that captures within-activity and cross-activity strategic complementarities. To test these predictions, we assemble the universe of cancer-related publications and patents and construct coauthorship and coinventorship networks that jointly map the collaboration structure of researchers active in both spheres. Using an instrumental-variables approach based on predicted link formation from exogenous dyadic characteristics, and incorporating community fixed effects to address endogenous network formation, we show that both authors' and inventors' outputs rise with their network centrality, consistent with the theory. Moreover, scientific productivity significantly enhances technological productivity, while technological output does not exert a detectable reciprocal effect on scientific production, highlighting an asymmetric linkage aligned with a science-driven model of innovation. These findings provide the first empirical evidence on the joint dynamics of scientific and inventive peer effects, underscore the micro-foundations of the co-evolution of science and technology, and reveal how collaboration structures can be leveraged to design policies that enhance collective knowledge creation and downstream innovation.

2602.02398 2026-02-03 stat.AP

Counting models with excessive zeros ensuring stochastic monotonicity

Hyemin Lee, Dohee Kim, Banghee So, Jae Youn Ahn

详情
英文摘要

Standard count models such as the Poisson and Negative Binomial models often fail to capture the large proportion of zero claims commonly observed in insurance data. To address such issue of excessive zeros, zero-inflated and hurdle models introduce additional parameters that explicitly account for excess zeros, thereby improving the joint representation of zero and positive claim outcomes. These models have further been extended with random effects to accommodate longitudinal dependence and unobserved heterogeneity. However, their consistency with fundamental probabilistic principles in insurance, particularly stochastic monotonicity, has not been formally examined. This paper provides a rigorous analysis showing that standard counting random-effect models for excessive zeros may violate this property, leading to inconsistencies in posterior credibility. We then propose new classes of counting random-effect models that both accommodate excessive zeros and ensure stochastic monotonicity, thereby providing fair and theoretically coherent credibility adjustments as claim histories evolve.

2602.02371 2026-02-03 cs.LG stat.ML

C-kNN-LSH: A Nearest-Neighbor Algorithm for Sequential Counterfactual Inference

Jing Wang, Jie Shen, Qiaomin Xie, Jeremy C Weiss

详情
英文摘要

Estimating causal effects from longitudinal trajectories is central to understanding the progression of complex conditions and optimizing clinical decision-making, such as comorbidities and long COVID recovery. We introduce \emph{C-kNN--LSH}, a nearest-neighbor framework for sequential causal inference designed to handle such high-dimensional, confounded situations. By utilizing locality-sensitive hashing, we efficiently identify ``clinical twins'' with similar covariate histories, enabling local estimation of conditional treatment effects across evolving disease states. To mitigate bias from irregular sampling and shifting patient recovery profiles, we integrate neighborhood estimator with a doubly-robust correction. Theoretical analysis guarantees our estimator is consistent and second-order robust to nuisance error. Evaluated on a real-world Long COVID cohort with 13,511 participants, \emph{C-kNN-LSH} demonstrates superior performance in capturing recovery heterogeneity and estimating policy values compared to existing baselines.

2602.02358 2026-02-03 stat.ML cs.LG

Transfer Learning Through Conditional Quantile Matching

Yikun Zhang, Steven Wilkins-Reeves, Wesley Lee, Aude Hofleitner

Comments 24 pages (8 pages for the main paper), 3 figures, 3 tables

详情
英文摘要

We introduce a transfer learning framework for regression that leverages heterogeneous source domains to improve predictive performance in a data-scarce target domain. Our approach learns a conditional generative model separately for each source domain and calibrates the generated responses to the target domain via conditional quantile matching. This distributional alignment step corrects general discrepancies between source and target domains without imposing restrictive assumptions such as covariate or label shift. The resulting framework provides a principled and flexible approach to high-quality data augmentation for downstream learning tasks in the target domain. From a theoretical perspective, we show that an empirical risk minimizer (ERM) trained on the augmented dataset achieves a tighter excess risk bound than the target-only ERM under mild conditions. In particular, we establish new convergence rates for the quantile matching estimator that governs the transfer bias-variance tradeoff. From a practical perspective, extensive simulations and real data applications demonstrate that the proposed method consistently improves prediction accuracy over target-only learning and competing transfer learning methods.

2602.02316 2026-02-03 math.ST stat.AP stat.ME stat.TH

A Kullback-Leibler divergence test for multivariate extremes: theory and practice

Sebastian Engelke, Philippe Naveau, Chen Zhou

详情
英文摘要

Testing whether two multivariate samples exhibit the same extremal behavior is an important problem in various fields including environmental and climate sciences. While several ad-hoc approaches exist in the literature, they often lack theoretical justification and statistical guarantees. On the other hand, extreme value theory provides the theoretical foundation for constructing asymptotically justified tests. We combine this theory with Kullback-Leibler divergence, a fundamental concept in information theory and statistics, to propose a test for equality of extremal dependence structures in practically relevant directions. Under suitable assumptions, we derive the limiting distributions of the proposed statistic under null and alternative hypotheses. Importantly, our test is fast to compute and easy to interpret by practitioners, making it attractive in applications. Simulations provide evidence of the power of our test. In a case study, we apply our method to show the strong impact of seasons on the strength of dependence between different aggregation periods (daily versus hourly) of heavy rainfall in France.

2602.02283 2026-02-03 cs.LG stat.ML

Choice-Model-Assisted Q-learning for Delayed-Feedback Revenue Management

Owen Shen, Patrick Jaillet

详情
英文摘要

We study reinforcement learning for revenue management with delayed feedback, where a substantial fraction of value is determined by customer cancellations and modifications observed days after booking. We propose \emph{choice-model-assisted RL}: a calibrated discrete choice model is used as a fixed partial world model to impute the delayed component of the learning target at decision time. In the fixed-model deployment regime, we prove that tabular Q-learning with model-imputed targets converges to an $O(\varepsilon/(1-γ))$ neighborhood of the optimal Q-function, where $\varepsilon$ summarizes partial-model error, with an additional $O(t^{-1/2})$ sampling term. Experiments in a simulator calibrated from 61{,}619 hotel bookings (1{,}088 independent runs) show: (i) no statistically detectable difference from a maturity-buffer DQN baseline in stationary settings; (ii) positive effects under in-family parameter shifts, with significant gains in 5 of 10 shift scenarios after Holm--Bonferroni correction (up to 12.4\%); and (iii) consistent degradation under structural misspecification, where the choice model assumptions are violated (1.4--2.6\% lower revenue). These results characterize when partial behavioral models improve robustness under shift and when they introduce harmful bias.

2602.02277 2026-02-03 stat.ME

A spatial random forest algorithm for population-level epidemiological risk assessment

Duncan Lee, Vinny Davies

详情
英文摘要

Spatial epidemiology identifies the drivers of elevated population-level disease risks, using disease counts, exposures and known confounders at the areal unit level. Poisson regression models are typically used for inference, which incorporate a linear/additive regression component and allow for unmeasured confounding via a set of spatially autocorrelated random effects. This approach requires the confounder interactions and their functional relationships with disease risk to be specified in advance, rather than being learned from the data. Therefore, this paper proposes the SPAR-Forest-ERF algorithm, which is the first fusion of random forests for capturing non-linear and interacting confounder-response effects with Bayesian spatial autocorrelation models that can estimate interpretable exposure response functions (ERF) with full uncertainty quantification. Methodologically, we extend existing methods set in a prediction context by propagating uncertainty between both the ML and statistical models, developing a new stopping criteria designed to ensure the stability of the primary inferential target, and incorporating a range of different ERFs for maximum model flexibility. This methodology is motivated by a new study quantifying the impact of air pollution concentrations on self-rated health in Scotland, using data from the recently released 2022 national census.

2602.02265 2026-02-03 stat.ME

Nonparametric Inference with an Instrumental Variable under a Separable Binary Treatment Choice Model

Chan Park, Eric Tchetgen Tchetgen

详情
英文摘要

Instrumental variable (IV) methods are widely used to infer treatment effects in the presence of unmeasured confounding. In this paper, we study nonparametric inference with an IV under a separable binary treatment choice model, which posits that the odds of the probability of taking the treatment, conditional on the instrument and the treatment-free potential outcome, factor into separable components for each variable. While nonparametric identification of smooth functionals of the treatment-free potential outcome among the treated, such as the average treatment effect on the treated, has been established under this model, corresponding nonparametric efficient estimation has proven elusive due to variationally dependent nuisance parameters defined in terms of counterfactual quantities. To address this challenge, we introduce a new variationally independent parameterization based on nuisance functions defined directly from the observed data. This parameterization, coupled with a novel fixed-point argument, enables the use of modern machine learning methods for nuisance function estimation. We characterize the semiparametric efficiency bound for any smooth functional of the treatment-free potential outcome among the treated and construct a corresponding semiparametric efficient estimator without imposing any unnecessary restriction on nuisance functions. Furthermore, we describe a straightforward generative model justifying our identifying assumptions and characterize empirically falsifiable implications of the framework to evaluate our assumptions in practical settings. Our approach seamlessly extends to nonlinear treatment effects, population-level effects, and nonignorable missing data settings. We illustrate our methods through simulation studies and an application to the Job Corps study.

2602.02246 2026-02-03 stat.ME

Cumulative Treatment Effect Testing under Continuous Time Reinforcement Learning

Jiuchen Zhang, Annie Qu

详情
英文摘要

Understanding the impact of treatment effect over time is a fundamental aspect of many scientific and medical studies. In this paper, we introduce a novel approach under a continuous-time reinforcement learning framework for testing a treatment effect. Specifically, our method provides an effective test on carryover effects of treatment over time utilizing the average treatment effect (ATE). The average treatment effect is defined as difference of value functions over an infinite horizon, which accounts for cumulative treatment effects, both immediate and carryover. The proposed method outperforms existing testing procedures such as discrete time reinforcement learning strategies in multi-resolution observation settings where observation times can be irregular. Another advantage of the proposed method is that it can capture treatment effects of a shorter duration and provide greater accuracy compared to discrete-time approximations, through the use of continuous-time estimation for the value function. We establish the asymptotic normality of the proposed test statistics and apply it to OhioT1DM diabetes data to evaluate the cumulative treatment effects of bolus insulin on patients' glucose levels.

2602.02240 2026-02-03 stat.ME stat.ML

Causal Inference for Preprocessed Outcomes with an Application to Functional Connectivity

Zihang Wang, Razieh Nabi, Benjamin B. Risk

Comments 55 pages, 5 figures, 2 tables

详情
英文摘要

In biomedical research, repeated measurements within each subject are often processed to remove artifacts and unwanted sources of variation. The resulting data are used to construct derived outcomes that act as proxies for scientific outcomes that are not directly observable. Although intra-subject processing is widely used, its impact on inter-subject statistical inference has not been systematically studied, and a principled framework for causal analysis in this setting is lacking. In this article, we propose a semiparametric framework for causal inference with derived outcomes obtained after intra-subject processing. This framework applies to settings with a modular structure, where intra-subject analyses are conducted independently across subjects and are followed by inter-subject analyses based on parameters from the intra-subject stage. We develop multiply robust estimators of causal parameters under rate conditions on both intra-subject and inter-subject models, which allows the use of flexible machine learning. We specialize the framework to a mediation setting and focus on the natural direct effect. For high dimensional inference, we employ a step-down procedure that controls the exceedance rate of the false discovery proportion. Simulation studies demonstrate the superior performance of the proposed approach. We apply our method to estimate the impact of stimulant medication on brain connectivity in children with autism spectrum disorder.

2602.02224 2026-02-03 cs.LG cs.AI math.SP stat.ML

Spectral Superposition: A Theory of Feature Geometry

Georgi Ivanov, Narmeen Oozeer, Shivam Raval, Tasana Pejovic, Shriyash Upadhyay, Amir Abdullah

详情
英文摘要

Neural networks represent more features than they have dimensions via superposition, forcing features to share representational space. Current methods decompose activations into sparse linear features but discard geometric structure. We develop a theory for studying the geometric structre of features by analyzing the spectra (eigenvalues, eigenspaces, etc.) of weight derived matrices. In particular, we introduce the frame operator $F = WW^\top$, which gives us a spectral measure that describes how each feature allocates norm across eigenspaces. While previous tools could describe the pairwise interactions between features, spectral methods capture the global geometry (``how do all features interact?''). In toy models of superposition, we use this theory to prove that capacity saturation forces spectral localization: features collapse onto single eigenspaces, organize into tight frames, and admit discrete classification via association schemes, classifying all geometries from prior work (simplices, polygons, antiprisms). The spectral measure formalism applies to arbitrary weight matrices, enabling diagnosis of feature localization beyond toy settings. These results point toward a broader program: applying operator theory to interpretability.

2602.02217 2026-02-03 math.PR math.ST stat.TH

Refined Berry-Esseen bounds under local dependence

Zhi-Jun Cai, Qi-Man Shao, Zhuo-Song Zhang

Comments 65 pages

详情
英文摘要

In this paper, we establish Berry--Esseen bounds for both self-normalized and non-self-normalized sums of locally dependent random variables. The proofs are based on Stein's method together with a concentration inequality approach. We develop a new class of concentration inequalities that extend classical results and achieve optimal convergence rates under more general dependence structures. As applications, we apply our main results to derive sharper Berry--Esseen bounds for graph dependency, distributed $U$-statistics, constrained $U$-statistics, and decorated injective homomorphism sums.

2602.02216 2026-02-03 stat.ME math.ST stat.CO stat.TH

Posterior Uncertainty for Targeted Parameters in Bayesian Bootstrap Procedures

Magid Sabbagh, David A. Stephens

Comments 20 pages

详情
英文摘要

We propose a general method to carry out a valid Bayesian analysis of a finite-dimensional `targeted' parameter in the presence of a finite-dimensional nuisance parameter. We apply our methods to causal inference based on estimating equations. While much of the literature in Bayesian causal inference has relied on the conventional 'likelihood times prior' framework, a recently proposed method, the 'Linked Bayesian Bootstrap', deviated from this classical setting to obtain valid Bayesian inference using the Dirichlet process and the Bayesian bootstrap. These methods rely on an adjustment based on the propensity score and explain how to handle the uncertainty concerning it when studying the posterior distribution of a treatment effect. We examine theoretically the asymptotic properties of the posterior distribution obtained and show that our proposed method, a generalized version of the 'Linked Bayesian Bootstrap', enjoys desirable frequentist properties. In addition, we show that the credible intervals have asymptotically the correct coverage properties. We discuss the applications of our method to mis-specified and singly-robust models in causal inference.

2602.02190 2026-02-03 stat.ML cs.LG

PCA of probability measures: Sparse and Dense sampling regimes

Gachon Erell, Jérémie Bigot, Elsa Cazelles

详情
英文摘要

A common approach to perform PCA on probability measures is to embed them into a Hilbert space where standard functional PCA techniques apply. While convergence rates for estimating the embedding of a single measure from $m$ samples are well understood, the literature has not addressed the setting involving multiple measures. In this paper, we study PCA in a double asymptotic regime where $n$ probability measures are observed, each through $m$ samples. We derive convergence rates of the form $n^{-1/2} + m^{-α}$ for the empirical covariance operator and the PCA excess risk, where $α>0$ depends on the chosen embedding. This characterizes the relationship between the number $n$ of measures and the number $m$ of samples per measure, revealing a sparse (small $m$) to dense (large $m$) transition in the convergence behavior. Moreover, we prove that the dense-regime rate is minimax optimal for the empirical covariance error. Our numerical experiments validate these theoretical rates and demonstrate that appropriate subsampling preserves PCA accuracy while reducing computational cost.

2602.02172 2026-02-03 stat.ME

Neural Network Machine Regression (NNMR): A Deep Learning Framework for Uncovering High-order Synergistic Effects

Jiuchen Zhang, Ling Zhou, Peter Song

详情
英文摘要

We propose a new neural network framework, termed Neural Network Machine Regression (NNMR), which integrates trainable input gating and adaptive depth regularization to jointly perform feature selection and function estimation in an end-to-end manner. By penalizing both gating parameters and redundant layers, NNMR yields sparse and interpretable architectures while capturing complex nonlinear relationships driven by high-order synergistic effects. We further develop a post-selection inference procedure based on split-sample, permutation-based hypothesis testing, enabling valid inference without restrictive parametric assumptions. Compared with existing methods, including Bayesian kernel machine regression and widely used post hoc attribution techniques, NNMR scales efficiently to high-dimensional feature spaces while rigorously controlling type I error. Simulation studies demonstrate its superior selection accuracy and inference reliability. Finally, an empirical application reveals sparse, biologically meaningful food group predictors associated with somatic growth among adolescents living in Mexico City.

2602.02153 2026-02-03 stat.ML cs.LG

Learning Beyond the Gaussian Data: Learning Dynamics of Neural Networks on an Expressive and Cumulant-Controllable Data Model

Onat Ure, Samet Demir, Zafer Dogan

Comments ICASSP 2026, 5 pages, 2 figures

详情
英文摘要

We study the effect of high-order statistics of data on the learning dynamics of neural networks (NNs) by using a moment-controllable non-Gaussian data model. Considering the expressivity of two-layer neural networks, we first construct the data model as a generative two-layer NN where the activation function is expanded by using Hermite polynomials. This allows us to achieve interpretable control over high-order cumulants such as skewness and kurtosis through the Hermite coefficients while keeping the data model realistic. Using samples generated from the data model, we perform controlled online learning experiments with a two-layer NN. Our results reveal a moment-wise progression in training: networks first capture low-order statistics such as mean and covariance, and progressively learn high-order cumulants. Finally, we pretrain the generative model on the Fashion-MNIST dataset and leverage the generated samples for further experiments. The results of these additional experiments confirm our conclusions and show the utility of the data model in a real-world scenario. Overall, our proposed approach bridges simplified data assumptions and practical data complexity, which offers a principled framework for investigating distributional effects in machine learning and signal processing.

2602.02113 2026-02-03 stat.ML cs.LG cs.NA math.NA

Training-free score-based diffusion for parameter-dependent stochastic dynamical systems

Minglei Yang, Sicheng He

详情
英文摘要

Simulating parameter-dependent stochastic differential equations (SDEs) presents significant computational challenges, as separate high-fidelity simulations are typically required for each parameter value of interest. Despite the success of machine learning methods in learning SDE dynamics, existing approaches either require expensive neural network training for score function estimation or lack the ability to handle continuous parameter dependence. We present a training-free conditional diffusion model framework for learning stochastic flow maps of parameter-dependent SDEs, where both drift and diffusion coefficients depend on physical parameters. The key technical innovation is a joint kernel-weighted Monte Carlo estimator that approximates the conditional score function using trajectory data sampled at discrete parameter values, enabling interpolation across both state space and the continuous parameter domain. Once trained, the resulting generative model produces sample trajectories for any parameter value within the training range without retraining, significantly accelerating parameter studies, uncertainty quantification, and real-time filtering applications. The performance of the proposed approach is demonstrated via three numerical examples of increasing complexity, showing accurate approximation of conditional distributions across varying parameter values.

2602.02087 2026-02-03 cs.LG stat.ML

Efficient Swap Regret Minimization in Combinatorial Bandits

Andreas Kontogiannis, Vasilis Pollatos, Panayotis Mertikopoulos, Ioannis Panageas

Comments Accepted at AISTATS 2026

详情
英文摘要

This paper addresses the problem of designing efficient no-swap regret algorithms for combinatorial bandits, where the number of actions $N$ is exponentially large in the dimensionality of the problem. In this setting, designing efficient no-swap regret translates to sublinear -- in horizon $T$ -- swap regret with polylogarithmic dependence on $N$. In contrast to the weaker notion of external regret minimization - a problem which is fairly well understood in the literature - achieving no-swap regret with a polylogarithmic dependence on $N$ has remained elusive in combinatorial bandits. Our paper resolves this challenge, by introducing a no-swap-regret learning algorithm with regret that scales polylogarithmically in $N$ and is tight for the class of combinatorial bandits. To ground our results, we also demonstrate how to implement the proposed algorithm efficiently -- that is, with a per-iteration complexity that also scales polylogarithmically in $N$ -- across a wide range of well-studied applications.

2602.02083 2026-02-03 math.ST stat.ML stat.TH

Handling Covariate Mismatch in Federated Linear Prediction

Alexis Ayme, Rémi Khellaf

详情
英文摘要

Federated learning enables institutions to train predictive models collaboratively without sharing raw data, addressing privacy and regulatory constraints. In the standard horizontal setting, clients hold disjoint cohorts of individuals and collaborate to learn a shared predictor. Most existing methods, however, assume that all clients measure the same features. We study the more realistic setting of covariate mismatch, where each client observes a different subset of features, which typically arises in multicenter collaborations with no prior agreement on data collection. We formalize learning a linear prediction under client-wise MCAR patterns and develop two modular approaches tailored to the dimensional regime and communication budget. In the low-dimensional setting, we propose a plug-in estimator that approximates the oracle linear predictor by aggregating sufficient statistics to estimate the covariance and cross-moment terms. In higher dimensions, we study an impute-then-regress strategy: (i) impute missing covariates using any exchangeability-preserving imputation procedure, and (ii) fit a ridge-regularized linear model on the completed data. We provide asymptotic and finite-sample learning rates for our predictors, explicitly characterizing their behaviour with the global dimension, the client-specific feature partition, and the distribution of samples across sites.

2602.02013 2026-02-03 cs.LG stat.ML

SNAP: A Self-Consistent Agreement Principle with Application to Robust Computation

Xiaoyi Jiang, Andreas Nienkötter

详情
英文摘要

We introduce SNAP (Self-coNsistent Agreement Principle), a self-supervised framework for robust computation based on mutual agreement. Based on an Agreement-Reliability Hypothesis SNAP assigns weights that quantify agreement, emphasizing trustworthy items and downweighting outliers without supervision or prior knowledge. A key result is the Exponential Suppression of Outlier Weights, ensuring that outliers contribute negligibly to computations, even in high-dimensional settings. We study properties of SNAP weighting scheme and show its practical benefits on vector averaging and subspace estimation. Particularly, we demonstrate that non-iterative SNAP outperforms the iterative Weiszfeld algorithm and two variants of multivariate median of means. SNAP thus provides a flexible, easy-to-use, broadly applicable approach to robust computation.

2602.01993 2026-02-03 stat.ME math.ST stat.TH

Exchangeable random permutations with an application to Bayesian graph matching

Francesco Gaffi, Nathaniel Josephs, Lizhen Lin

Comments 41 pages, 4 figures

详情
英文摘要

We introduce a general Bayesian framework for graph matching grounded in a new theory of exchangeable random permutations. Leveraging the cycle representation of permutations and the literature on exchangeable random partitions, we define, characterize, and study the structural and predictive properties of these probabilistic objects. A novel sequential metaphor, the position-aware generalized Chinese restaurant process, provides a constructive foundation for this theory and supports practical algorithmic design. Exchangeable random permutations offer flexible priors for a wide range of inferential problems centered on permutations. As an application, we develop a Bayesian model for graph matching that integrates a correlated stochastic block model with our novel class of priors. The cycle structure of the matching is linked to latent node partitions that explain connectivity patterns, an assumption consistent with the homogeneity requirement underlying the graph matching task itself. Posterior inference is performed through a node-wise blocked Gibbs sampler directly enabled by the proposed sequential construction. To summarize posterior uncertainty, we introduce perSALSO, an adaptation of SALSO to the permutation domain that provides principled point estimation and interpretable posterior summaries. Together, these contributions establish a unified probabilistic framework for modeling, inference, and uncertainty quantification over permutations.

2602.01988 2026-02-03 stat.ML cs.LG

Stochastic Interpolants in Hilbert Spaces

James Boran Yu, RuiKang OuYang, Julien Horwood, José Miguel Hernández-Lobato

Comments 8 pages, 1 figure, 2 tables

详情
英文摘要

Although diffusion models have successfully extended to function-valued data, stochastic interpolants -- which offer a flexible way to bridge arbitrary distributions -- remain limited to finite-dimensional settings. This work bridges this gap by establishing a rigorous framework for stochastic interpolants in infinite-dimensional Hilbert spaces. We provide comprehensive theoretical foundations, including proofs of well-posedness and explicit error bounds. We demonstrate the effectiveness of the proposed framework for conditional generation, focusing particularly on complex PDE-based benchmarks. By enabling generative bridges between arbitrary functional distributions, our approach achieves state-of-the-art results, offering a powerful, general-purpose tool for scientific discovery.

2602.01953 2026-02-03 cs.LG stat.ML

Deep Multivariate Models with Parametric Conditionals

Dmitrij Schlesinger, Boris Flach, Alexander Shekhovtsov

详情
英文摘要

We consider deep multivariate models for heterogeneous collections of random variables. In the context of computer vision, such collections may e.g. consist of images, segmentations, image attributes, and latent variables. When developing such models, most existing works start from an application task and design the model components and their dependencies to meet the needs of the chosen task. This has the disadvantage of limiting the applicability of the resulting model for other downstream tasks. Here, instead, we propose to represent the joint probability distribution by means of conditional probability distributions for each group of variables conditioned on the rest. Such models can then be used for practically any possible downstream task. Their learning can be approached as training a parametrised Markov chain kernel by maximising the data likelihood of its limiting distribution. This has the additional advantage of allowing a wide range of semi-supervised learning scenarios.

2602.01912 2026-02-03 stat.ML cs.AI cs.LG q-fin.RM

Reliable Real-Time Value at Risk Estimation via Quantile Regression Forest with Conformal Calibration

Du-Yi Wang, Guo Liang, Kun Zhang, Qianwen Zhu

详情
英文摘要

Rapidly evolving market conditions call for real-time risk monitoring, but its online estimation remains challenging. In this paper, we study the online estimation of one of the most widely used risk measures, Value at Risk (VaR). Its accurate and reliable estimation is essential for timely risk control and informed decision-making. We propose to use the quantile regression forest in the offline-simulation-online-estimation (OSOA) framework. Specifically, the quantile regression forest is trained offline to learn the relationship between the online VaR and risk factors, and real-time VaR estimates are then produced online by incorporating observed risk factors. To further ensure reliability, we develop a conformalized estimator that calibrates the online VaR estimates. To the best of our knowledge, we are the first to leverage conformal calibration to estimate real-time VaR reliably based on the OSOA formulation. Theoretical analysis establishes the consistency and coverage validity of the proposed estimators. Numerical experiments confirm the proposed method and demonstrate its effectiveness in practice.

2602.01898 2026-02-03 cs.LG stat.ML

Observation-dependent Bayesian active learning via input-warped Gaussian processes

Sanna Jarl, Maria Bånkestad, Jonathan J. S. Scragg, Jens Sjölund

Comments 13 pages

详情
英文摘要

Bayesian active learning relies on the precise quantification of predictive uncertainty to explore unknown function landscapes. While Gaussian process surrogates are the standard for such tasks, an underappreciated fact is that their posterior variance depends on the observed outputs only through the hyperparameters, rendering exploration largely insensitive to the actual measurements. We propose to inject observation-dependent feedback by warping the input space with a learned, monotone reparameterization. This mechanism allows the design policy to expand or compress regions of the input space in response to observed variability, thereby shaping the behavior of variance-based acquisition functions. We demonstrate that while such warps can be trained via marginal likelihood, a novel self-supervised objective yields substantially better performance. Our approach improves sample efficiency across a range of active learning benchmarks, particularly in regimes where non-stationarity challenges traditional methods.

2602.01863 2026-02-03 stat.ML cs.LG

Transformers as Measure-Theoretic Associative Memory: A Statistical Perspective and Minimax Optimality

Ryotaro Kawata, Taiji Suzuki

详情
英文摘要

Transformers excel through content-addressable retrieval and the ability to exploit contexts of, in principle, unbounded length. We recast associative memory at the level of probability measures, treating a context as a distribution over tokens and viewing attention as an integral operator on measures. Concretely, for mixture contexts $ν= I^{-1} \sum_{i=1}^I μ^{(i^*)}$ and a query $x_{\mathrm{q}}(i^*)$, the task decomposes into (i) recall of the relevant component $μ^{(i^*)}$ and (ii) prediction from $(μ_{i^*},x_\mathrm{q})$. We study learned softmax attention (not a frozen kernel) trained by empirical risk minimization and show that a shallow measure-theoretic Transformer composed with an MLP learns the recall-and-predict map under a spectral assumption on the input densities. We further establish a matching minimax lower bound with the same rate exponent (up to multiplicative constants), proving sharpness of the convergence order. The framework offers a principled recipe for designing and analyzing Transformers that recall from arbitrarily long, distributional contexts with provable generalization guarantees.

2602.01853 2026-02-03 cs.LG stat.ME stat.ML

Designing Time Series Experiments in A/B Testing with Transformer Reinforcement Learning

Xiangkun Wu, Qianglin Wen, Yingying Zhang, Hongtu Zhu, Ting Li, Chengchun Shi

详情
英文摘要

A/B testing has become a gold standard for modern technological companies to conduct policy evaluation. Yet, its application to time series experiments, where policies are sequentially assigned over time, remains challenging. Existing designs suffer from two limitations: (i) they do not fully leverage the entire history for treatment allocation; (ii) they rely on strong assumptions to approximate the objective function (e.g., the mean squared error of the estimated treatment effect) for optimizing the design. We first establish an impossibility theorem showing that failure to condition on the full history leads to suboptimal designs, due to the dynamic dependencies in time series experiments. To address both limitations simultaneously, we next propose a transformer reinforcement learning (RL) approach which leverages transformers to condition allocation on the entire history and employs RL to directly optimize the MSE without relying on restrictive assumptions. Empirical evaluations on synthetic data, a publicly available dispatch simulator, and a real-world ridesharing dataset demonstrate that our proposal consistently outperforms existing designs.

2602.01825 2026-02-03 stat.ME cs.LG math.OC stat.ML

Learning Sequential Decisions from Multiple Sources via Group-Robust Markov Decision Processes

Mingyuan Xu, Zongqi Xia, Tianxi Cai, Doudou Zhou, Nian Si

详情
英文摘要

We often collect data from multiple sites (e.g., hospitals) that share common structure but also exhibit heterogeneity. This paper aims to learn robust sequential decision-making policies from such offline, multi-site datasets. To model cross-site uncertainty, we study distributionally robust MDPs with a group-linear structure: all sites share a common feature map, and both the transition kernels and expected reward functions are linear in these shared features. We introduce feature-wise (d-rectangular) uncertainty sets, which preserve tractable robust Bellman recursions while maintaining key cross-site structure. Building on this, we then develop an offline algorithm based on pessimistic value iteration that includes: (i) per-site ridge regression for Bellman targets, (ii) feature-wise worst-case (row-wise minimization) aggregation, and (iii) a data-dependent pessimism penalty computed from the diagonals of the inverse design matrices. We further propose a cluster-level extension that pools similar sites to improve sample efficiency, guided by prior knowledge of site similarity. Under a robust partial coverage assumption, we prove a suboptimality bound for the resulting policy. Overall, our framework addresses multi-site learning with heterogeneous data sources and provides a principled approach to robust planning without relying on strong state-action rectangularity assumptions.

2602.01770 2026-02-03 stat.CO

A multifidelity approximate Bayesian computation with pre-filtering

Xuefei Cao, Shijia Wang, Yongdao Zhou

详情
英文摘要

Approximate Bayesian Computation (ABC) methods often require extensive simulations, resulting in high computational costs. This paper focuses on multifidelity simulation models and proposes a pre-filtering hierarchical importance sampling algorithm. Under mild assumptions, we theoretically prove that the proposed algorithm satisfies posterior concentration properties, characterize the error upper bound and the relationship between algorithmic efficiency and pre-filtering criteria. Additionally, we provide a practical strategy to assess the suitability of multifidelity models for the proposed method. Finally, we develop a multifidelity ABC sequential Monte Carlo with adaptive pre-filtering strategy. Numerical experiments are used to demonstrate the effectiveness of the proposed approach. We develop an R package that is available at https://github.com/caofff/MAPS

2602.01691 2026-02-03 stat.ME

Locally sparse estimation for simultaneous functional quantile regression

Boyi Hu, Jiguo Cao

详情
英文摘要

Motivated by the study of how daily temperature affects soybean yield, this article proposes a simultaneous functional quantile regression (FQR) model featuring a locally sparse bivariate slope function indexed by both quantile and time and linked to a functional predictor. The slope function's local sparsity means it holds non-zero values only in certain segments of its domain, remaining zero elsewhere. These zero-slope regions, which vary by quantile, indicate times when the functional predictor has no discernible impact on the response variable. This feature boosts the model's interpretability. Unlike traditional FQR models, which fit one quantile at a time and have several limitations, our proposed method can handle a spectrum of quantiles simultaneously. We tested the new approach through simulation studies, demonstrating its clear advantages over standard techniques. To validate its practical use, we applied the method to soybean yield data, pinpointing the time periods when daily temperature doesn't affect yield. This insight could be crucial for agricultural planning and crop management.

2602.01648 2026-02-03 stat.ME

Demystify Doubly-Robust Estimation: The Role of Overlap

Chengxin Yang, Laine E. Thomas, Fan Li

Comments Corresponding to Fan Li, Department of Statistical Science, Duke University. Email: fl35@duke.edu

详情
英文摘要

The doubly-robust (DR) estimator is popular for evaluating causal effects in observational studies and is often perceived as more desirable than inverse probability weighting (IPW) or outcome modeling alone because it provides extra protection against model misspecification. However, double robustness is an asymptotic property that may not hold in finite samples. We investigate how the finite sample performance of the DR estimator depends on the degree of covariate overlap between comparison groups. Using analytical illustrations and extensive simulations under various scenarios with different degrees of covariate overlap and model specifications, we examine the bias and variance of the DR estimator relative to IPW and outcome modeling estimators. We find that: (i) specification of the outcome model has a stronger influence on the DR estimates than specification of the propensity score model, and this dominance increases as overlap decreases; (ii) with poor overlap, the DR estimator generally amplifies the adverse consequences of extreme weights (large bias and/or variance) regardless of model specifications, and is often inferior to both the IPW and outcome modeling estimators. As a practical guide, we recommend always first checking the degree of overlap in applications. In the case of poor overlap, analysts should consider shifting the target population to a subpopulation with adequate overlap via methods such as trimming or overlap weighting.

2602.01631 2026-02-03 stat.ME

Difference-in-Differences under Local Dependence on Networks

Akihiro Sato, Shonosuke Sugasawa

Comments 34 pages (main) + 8 pages (supplement)

详情
英文摘要

Estimating causal effects under interference, where the stable unit treatment value assumption is violated, is critical in fields such as regional and public economics. Much of the existing research on causal inference under interference relies on a pre-specified "exposure mapping". This paper focuses on difference-in-difference and proposes a nonparametric identification strategy for direct and indirect average treatment effects under local interference on an observed network. In particular, we proposed a new concept of an indirect effect measuring the total outward influence of the intervension. Based on parallel trends assumption conditional on the neighborhood treatment vector, we develop inverse probability weighted and doubly robust estimators. We establish their asymptotic properties, including consistency under misspecification of nuisance models under some regularity conditions. Simulation studies and an empirical application demonstrate the effectiveness of the proposed method.

2602.01605 2026-02-03 cs.LG stat.ML

Universal Redundancies in Time Series Foundation Models

Anthony Bao, Venkata Hasith Vattikuti, Jeffrey Lai, William Gilpin

详情
英文摘要

Time Series Foundation Models (TSFMs) leverage extensive pretraining to accurately predict unseen time series during inference, without the need for task-specific fine-tuning. Through large-scale evaluations on standard benchmarks, we find that leading transformer-based TSFMs exhibit redundant components in their intermediate layers. We introduce a set of tools for mechanistic interpretability of TSFMs, including ablations of specific components and direct logit attribution on the residual stream. Our findings are consistent across several leading TSFMs with diverse architectures, and across a diverse set of real-world and synthetic time-series datasets. We discover that all models in our study are robust to ablations of entire layers. Furthermore, we develop a theoretical framework framing transformers as kernel regressors, motivating a purely intrinsic strategy for ablating heads based on the stable rank of the per-head projection matrices. Using this approach, we uncover the specific heads responsible for degenerate phenomena widely observed in TSFMs, such as parroting of motifs from the context and seasonality bias. Our study sheds light on the universal properties of this emerging class of architectures for continuous-time sequence modeling.

2602.01603 2026-02-03 stat.ML cs.LG

Inference-Aware Meta-Alignment of LLMs via Non-Linear GRPO

Shokichi Takakura, Akifumi Wachi, Rei Higuchi, Kohei Miyaguchi, Taiji Suzuki

详情
英文摘要

Aligning large language models (LLMs) to diverse human preferences is fundamentally challenging since criteria can often conflict with each other. Inference-time alignment methods have recently gained popularity as they allow LLMs to be aligned to multiple criteria via different alignment algorithms at inference time. However, inference-time alignment is computationally expensive since it often requires multiple forward passes of the base model. In this work, we propose inference-aware meta-alignment (IAMA), a novel approach that enables LLMs to be aligned to multiple criteria with limited computational budget at inference time. IAMA trains a base model such that it can be effectively aligned to multiple tasks via different inference-time alignment algorithms. To solve the non-linear optimization problems involved in IAMA, we propose non-linear GRPO, which provably converges to the optimal solution in the space of probability measures.

2602.01595 2026-02-03 stat.ME math.ST stat.TH

Data-Driven Uniform Inference for General Continuous Treatment Models via Minimum-Variance Weighting

Chunrong Ai, Wei Huang, Zheng Zhang

详情
英文摘要

Ai et al. (2021) studied the estimation of a general dose-response function (GDRF) of a continuous treatment that includes the average dose-response function, the quantile dose-response function, and other expectiles of the dose-response distribution. They specified the GDRF as a parametric function of the treatment status only and proposed a weighted regression with the weighting function estimated using the maximum entropy approach. This paper specifies the GDRF as a nonparametric function of the treatment status, proposes a weighted local linear regression for estimating GDRF, and develops a bootstrap procedure for constructing the uniform confidence bands. We propose stable weights with minimum sample variance while eliminating the sample association between the treatment and the confounding variables. The proposed weights admit a closed-form expression, allowing them to be computed efficiently in the bootstrap sampling. Under certain conditions, we derive the uniform Bahadur representation for the proposed estimator of GDRF and establish the validity of the corresponding uniform confidence bands. A fully data-driven approach to choosing the undersmooth tuning parameters and a data-driven bias-control confidence band are included. A simulation study and an application demonstrate the usefulness of the proposed approach.

2602.01573 2026-02-03 stat.ME stat.ML

When Is Generalized Bayes Bayesian? A Decision-Theoretic Characterization of Loss-Based Updating

Kenichiro McAlinn, Kōsaku Takanashi

详情
英文摘要

Loss-based updating, including generalized Bayes, Gibbs, and quasi-posteriors, replaces likelihoods by a user-chosen loss and produces a posterior-like distribution via exponential tilt. We give a decision-theoretic characterization that separates \emph{belief posteriors} -- conditional beliefs justified by the foundations of Savage and Anscombe-Aumann under a joint probability mode l-- from \emph{decision posteriors} -- randomized decision rules justified by preferences over decision rules. We make explicit that a loss-based posterior coincides with ordinary Bayes if and only if the loss is, up to scale and a data-only term, negative log-likelihood. We then show that generalized marginal likelihood is not evidence for decision posteriors, and Bayes factors are not well-defined without additional structure. In the decision posterior regime, non-degenerate posteriors require nonlinear preferences over decision rules. Under sequential coherence and separability, these lead to an entropy-penalized variational representation yielding generalized Bayes as the optimal rule.

2602.01551 2026-02-03 stat.AP

Bayesian brain mapping: population-informed individualized functional topography and connectivity

Nohelia Da Silva Sanchez, Diego Derman, Damon D. Pham, Ellyn R. Butler, Mary Beth Nebel, Amanda F. Mejia

详情
英文摘要

The spatial topography of brain functional organization is increasingly recognized to play an important role in cognition and disease. Accounting for individual differences in functional topography is also crucial for accurately distinguishing spatial and temporal aspects of brain organization. Yet, accurate estimation of individual functional brain networks from functional magnetic resonance imaging (fMRI) without extensive scanning remains challenging, due to low signal-to-noise ratio. Here, we describe Bayesian brain mapping (BBM), a technique for individual functional topography and connectivity leveraging population information. Population-derived priors for both spatial topography and functional connectivity based on existing spatial templates, such as parcellations or continuous network maps, are used to guide subject-level estimation and combat noise. BBM is highly flexible, avoiding strong spatial or temporal constraints and allowing for overlap between networks and heterogeneous patterns of engagement. Unlike multi-subject hierarchical models, BBM is designed for single-subject analysis, making it highly computationally efficient and translatable to clinical settings. Here, we describe the BBM model and illustrate the use of the BayesBrainMap R package to construct population-derived priors, fit the model, and perform inference to identify engagements. A demo is provided in an accompanying Github repo. We also share priors derived from the Human Connectome Project database and provide code to support the construction of priors from different data sources, lowering the barrier to adoption of BBM for studies of individual brain organization.

2602.01485 2026-02-03 cs.LG stat.ML

Predicting and improving test-time scaling laws via reward tail-guided search

Muheng Li, Jian Qian, Wenlong Mou

Comments 33 pages, 5 figures

详情
英文摘要

Test-time scaling has emerged as a critical avenue for enhancing the reasoning capabilities of Large Language Models (LLMs). Though the straight-forward ''best-of-$N$'' (BoN) strategy has already demonstrated significant improvements in performance, it lacks principled guidance on the choice of $N$, budget allocation, and multi-stage decision-making, thereby leaving substantial room for optimization. While many works have explored such optimization, rigorous theoretical guarantees remain limited. In this work, we propose new methodologies to predict and improve scaling properties via tail-guided search. By estimating the tail distribution of rewards, our method predicts the scaling law of LLMs without the need for exhaustive evaluations. Leveraging this prediction tool, we introduce Scaling-Law Guided (SLG) Search, a new test-time algorithm that dynamically allocates compute to identify and exploit intermediate states with the highest predicted potential. We theoretically prove that SLG achieves vanishing regret compared to perfect-information oracles, and achieves expected rewards that would otherwise require a polynomially larger compute budget required when using BoN. Empirically, we validate our framework across different LLMs and reward models, confirming that tail-guided allocation consistently achieves higher reward yields than Best-of-$N$ under identical compute budgets. Our code is available at https://github.com/PotatoJnny/Scaling-Law-Guided-search.

2602.01480 2026-02-03 cs.LG cs.AI math.OC stat.ML

Rod Flow: A Continuous-Time Model for Gradient Descent at the Edge of Stability

Eric Regis, Sinho Chewi

详情
英文摘要

How can we understand gradient-based training over non-convex landscapes? The edge of stability phenomenon, introduced in Cohen et al. (2021), indicates that the answer is not so simple: namely, gradient descent (GD) with large step sizes often diverges away from the gradient flow. In this regime, the "Central Flow", recently proposed in Cohen et al. (2025), provides an accurate ODE approximation to the GD dynamics over many architectures. In this work, we propose Rod Flow, an alternative ODE approximation, which carries the following advantages: (1) it rests on a principled derivation stemming from a physical picture of GD iterates as an extended one-dimensional object -- a "rod"; (2) it better captures GD dynamics for simple toy examples and matches the accuracy of Central Flow for representative neural network architectures, and (3) is explicit and cheap to compute. Theoretically, we prove that Rod Flow correctly predicts the critical sharpness threshold and explains self-stabilization in quartic potentials. We validate our theory with a range of numerical experiments.

2602.01477 2026-02-03 stat.ML cs.LG

Density-Informed Pseudo-Counts for Calibrated Evidential Deep Learning

Pietro Carlotti, Nevena Gligić, Arya Farahi

详情
英文摘要

Evidential Deep Learning (EDL) is a popular framework for uncertainty-aware classification that models predictive uncertainty via Dirichlet distributions parameterized by neural networks. Despite its popularity, its theoretical foundations and behavior under distributional shift remain poorly understood. In this work, we provide a principled statistical interpretation by proving that EDL training corresponds to amortized variational inference in a hierarchical Bayesian model with a tempered pseudo-likelihood. This perspective reveals a major drawback: standard EDL conflates epistemic and aleatoric uncertainty, leading to systematic overconfidence on out-of-distribution (OOD) inputs. To address this, we introduce Density-Informed Pseudo-count EDL (DIP-EDL), a new parametrization that decouples class prediction from the magnitude of uncertainty by separately estimating the conditional label distribution and the marginal covariate density. This separation preserves evidence in high-density regions while shrinking predictions toward a uniform prior for OOD data. Theoretically, we prove that DIP-EDL achieves asymptotic concentration. Empirically, we show that our method enhances interpretability and improves robustness and uncertainty calibration under distributional shift.

2602.01468 2026-02-03 cs.LG stat.ML

A Statistical Theory of Gated Attention through the Lens of Hierarchical Mixture of Experts

Viet Nguyen, Tuan Minh Pham, Thinh Cao, Tan Dinh, Huy Nguyen, Nhat Ho, Alessandro Rinaldo

Comments Viet Nguyen, Tuan Minh Pham, and Thinh Cao contributed equally to this work

详情
英文摘要

Self-attention has greatly contributed to the success of the widely used Transformer architecture by enabling learning from data with long-range dependencies. In an effort to improve performance, a gated attention model that leverages a gating mechanism within the multi-head self-attention has recently been proposed as a promising alternative. Gated attention has been empirically demonstrated to increase the expressiveness of low-rank mapping in standard attention and even to eliminate the attention sink phenomenon. Despite its efficacy, a clear theoretical understanding of gated attention's benefits remains lacking in the literature. To close this gap, we rigorously show that each entry in a gated attention matrix or a multi-head self-attention matrix can be written as a hierarchical mixture of experts. By recasting learning as an expert estimation problem, we demonstrate that gated attention is more sample-efficient than multi-head self-attention. In particular, while the former needs only a polynomial number of data points to estimate an expert, the latter requires exponentially many data points to achieve the same estimation error. Furthermore, our analysis also provides a theoretical justification for why gated attention yields higher performance when a gate is placed at the output of the scaled dot product attention or the value map rather than at other positions in the multi-head self-attention architecture.

2602.01466 2026-02-03 stat.ML cs.LG

Rethinking Multinomial Logistic Mixture of Experts with Sigmoid Gating Function

Tuan Minh Pham, Thinh Cao, Viet Nguyen, Huy Nguyen, Nhat Ho, Alessandro Rinaldo

Comments Tuan Minh Pham, Thinh Cao, and Viet Nguyen contributed equally to this work

详情
英文摘要

The sigmoid gate in mixture-of-experts (MoE) models has been empirically shown to outperform the softmax gate across several tasks, ranging from approximating feed-forward networks to language modeling. Additionally, recent efforts have demonstrated that the sigmoid gate is provably more sample-efficient than its softmax counterpart under regression settings. Nevertheless, there are three notable concerns that have not been addressed in the literature, namely (i) the benefits of the sigmoid gate have not been established under classification settings; (ii) existing sigmoid-gated MoE models may not converge to their ground-truth; and (iii) the effects of a temperature parameter in the sigmoid gate remain theoretically underexplored. To tackle these open problems, we perform a comprehensive analysis of multinomial logistic MoE equipped with a modified sigmoid gate to ensure model convergence. Our results indicate that the sigmoid gate exhibits a lower sample complexity than the softmax gate for both parameter and expert estimation. Furthermore, we find that incorporating a temperature into the sigmoid gate leads to a sample complexity of exponential order due to an intrinsic interaction between the temperature and gating parameters. To overcome this issue, we propose replacing the vanilla inner product score in the gating function with a Euclidean score that effectively removes that interaction, thereby substantially improving the sample complexity to a polynomial order.

2601.23172 2026-02-03 q-fin.ST math.PR q-fin.MF q-fin.TR stat.AP

A unified theory of order flow, market impact, and volatility

Johannes Muhle-Karbe, Youssef Ouazzani Chahdi, Mathieu Rosenbaum, Grégoire Szymanski

详情
英文摘要

We propose a microstructural model for the order flow in financial markets that distinguishes between {\it core orders} and {\it reaction flow}, both modeled as Hawkes processes. This model has a natural scaling limit that reconciles a number of salient empirical properties: persistent signed order flow, rough trading volume and volatility, and power-law market impact. In our framework, all these quantities are pinned down by a single statistic $H_0$, which measures the persistence of the core flow. Specifically, the signed flow converges to the sum of a fractional process with Hurst index $H_0$ and a martingale, while the limiting traded volume is a rough process with Hurst index $H_0-1/2$. No-arbitrage constraints imply that volatility is rough, with Hurst parameter $2H_0-3/2$, and that the price impact of trades follows a power law with exponent $2-2H_0$. The analysis of signed order flow data yields an estimate $H_0 \approx 3/4$. This is not only consistent with the square-root law of market impact, but also turns out to match estimates for the roughness of traded volumes and volatilities remarkably well.

2601.16979 2026-02-03 cs.LG cond-mat.dis-nn cs.AI stat.ML

A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs

Dayal Singh Kalra, Jean-Christophe Gagnon-Audet, Andrey Gromov, Ishita Mediratta, Kelvin Niu, Alexander H Miller, Michael Shvartsman

Comments Improved Appendix D proofs, text for clarity, added more related works

详情
英文摘要

Understanding the curvature evolution of the loss landscape is fundamental to analyzing the training dynamics of neural networks. The most commonly studied measure, Hessian sharpness ($λ_{\max}^H$) -- the largest eigenvalue of the loss Hessian -- determines local training stability and interacts with the learning rate throughout training. Despite its significance in analyzing training dynamics, direct measurement of Hessian sharpness remains prohibitive for Large Language Models (LLMs) due to high computational cost. We analyze $\textit{critical sharpness}$ ($λ_c$), a computationally efficient measure requiring fewer than $10$ forward passes given the update direction $Δ\mathbfθ$. Critically, this measure captures well-documented Hessian sharpness phenomena, including progressive sharpening and Edge of Stability. Using this measure, we provide the first demonstration of these sharpness phenomena at scale, up to $7$B parameters, spanning both pre-training and mid-training of OLMo-2 models. We further introduce $\textit{relative critical sharpness}$ ($λ_c^{1\to 2}$), which quantifies the curvature of one loss landscape while optimizing another, to analyze the transition from pre-training to fine-tuning and guide data mixing strategies. Critical sharpness provides practitioners with a practical tool for diagnosing curvature dynamics and informing data composition choices at scale. More broadly, our work shows that scalable curvature measures can provide actionable insights for large-scale training.

2601.16318 2026-02-03 stat.ME

Orthogonal factorial designs for trials of therapist-delivered interventions: Randomising intervention-therapist combinations to patients

Rebecca EA Walwyn, Rosemary A Bailey, Arpan Singh, Neil Corrigan, Steven G Gilmour

详情
英文摘要

It is recognised that treatment-related clustering should be allowed for in the sample size and analyses of individually-randomised parallel-group trials that evaluate therapist-delivered interventions such as psychotherapy. Here, interventions are a treatment factor, but therapists are not. If the aim of a trial is to separate effects of therapists from those of interventions, we propose that interventions and therapists should be regarded as two potentially interacting treatment factors (one fixed, one random) with a factorial structure. We consider the specific design where each therapist delivers each intervention (crossed therapist-intervention design), and the resulting therapist-intervention combinations are randomised to patients. We adopt a classical Design of Experiments (DoE) approach to propose a family of orthogonal factorial designs and their associated data analyses, which allow for therapist learning and centre too. We set out the associated data analyses using ANOVA and regression and report the results of a small simulation study conducted to explore the performance of the proposed randomisation methods in estimating the intervention effect and its standard error, the between-therapist variance and the between-therapist variance in the intervention effect. We conclude that more purposeful trial design has the potential to lead to better evidence on a range of complex interventions and outline areas for further methodological research.

2601.00521 2026-02-03 eess.SY cs.AI cs.SY stat.AP

Probability-Aware Parking Selection

Cameron Hickert, Sirui Li, Zhengbing He, Cathy Wu

Comments 10 pages, 8 figures, 3 tables. To be published in IEEE Transactions on Intelligent Transportation Systems

详情
英文摘要

Current navigation systems conflate time-to-drive with the true time-to-arrive by ignoring parking search duration and the final walking leg. Such underestimation can significantly affect user experience, mode choice, congestion, and emissions. To address this issue, this paper introduces the probability-aware parking selection problem, which aims to direct drivers to the best parking location rather than straight to their destination. An adaptable dynamic programming framework is proposed that leverages probabilistic, lot-level availability to minimize the expected time-to-arrive. Closed-form analysis determines when it is optimal to target a specific parking lot or explore alternatives, as well as the expected time cost. Sensitivity analysis and three illustrative cases are examined, demonstrating the model's ability to account for the dynamic nature of parking availability. Given the high cost of permanent sensing infrastructure, we assess the error rates of using stochastic observations to estimate availability. Experiments with real-world data from the US city of Seattle indicate this approach's viability, with mean absolute error decreasing from 7% to below 2% as observation frequency increases. In data-based simulations, probability-aware strategies demonstrate time savings up to 66% relative to probability-unaware baselines, yet still take up to 123% longer than time-to-drive estimates.

2512.23973 2026-02-03 cs.SI cs.AI stat.ML

A Community-Aware Framework for Influence Maximization with Explicit Accounting for Inter-Community Influence

Eliot W. Robson, Abhishek K. Umrawal

Comments 7 pages, 4 figures, and 1 table

详情
英文摘要

Influence Maximization (IM) seeks to identify a small set of seed nodes in a social network to maximize expected information spread under a diffusion model. While community-based approaches improve scalability by exploiting modular structure, they typically assume independence between communities, overlooking inter-community influence$\unicode{x2014}$a limitation that reduces effectiveness in real-world networks. We introduce Community-IM++, a scalable framework that explicitly models cross-community diffusion through a principled heuristic based on community-based diffusion degree (CDD) and a progressive budgeting strategy. The algorithm partitions the network, computes CDD to prioritize bridging nodes, and allocates seeds adaptively across communities using lazy evaluation to minimize redundant computations. Experiments on large real-world social networks under different edge weight models show that Community-IM++ achieves near-greedy influence spread at up to 100 times lower runtime, while outperforming Community-IM and degree heuristics across budgets and structural conditions. These results demonstrate the practicality of Community-IM++ for large-scale applications such as viral marketing, misinformation control, and public health campaigns, where efficiency and cross-community reach are critical.

2511.23402 2026-02-03 cs.LG stat.ML

Quantized-Tinyllava: a new multimodal foundation model enables efficient split learning

Jiajun Guo, Xin Luo, Jiayin Zheng, Yiqun Wang, Kai-Wei Chang, Wei Wang, Jie Liu

详情
英文摘要

Multimodal foundation models are increasingly trained on sensitive data across domains such as finance, biomedicine, and personal identifiers. However, this distributed setup raises serious privacy concerns due to the need for cross-partition data sharing. Split learning addresses these concerns by enabling collaborative model training without raw data exchange between partitions, yet it introduces a significant challenge: transmitting high-dimensional intermediate feature representations between partitions leads to substantial communication costs. To address this challenge, we propose Quantized-TinyLLaVA, a multimodal foundation model with an integrated communication-efficient split learning framework. Our approach adopts a compression module that quantizes intermediate feature into discrete representations before transmission, substantially reducing communication overhead. Besides, we derive a principled quantization strategy grounded in entropy coding theory to determine the optimal number of discrete representation levels. We deploy our framework in a two-partition setting, with one partition operating as the client and the other as the server, to realistically simulate distributed training. Under this setup, Quantized-TinyLLaVA achieves an approximate \textbf{87.5\%} reduction in communication overhead with 2-bit quantization, while maintaining performance of the original 16-bit model across five benchmark datasets. Furthermore, our compressed representations exhibit enhanced resilience against feature inversion attacks, validating the privacy of transmission. The code is available at https://github.com/anonymous-1742/Quantized-TinyLLaVA.

2511.06189 2026-02-03 stat.ME math.ST stat.ML stat.TH

Counterfactual Forecasting for Panel Data

Navonil Deb, Raaz Dwivedi, Sumanta Basu

Comments 39 pages, 10 figures

详情
英文摘要

We address the challenge of forecasting counterfactual outcomes in a panel data with missing entries and temporally dependent latent factors -- a common scenario in causal inference, where estimating unobserved potential outcomes ahead of time is essential. We propose Forecasting Counterfactuals under Stochastic Dynamics (FOCUS), a method that extends traditional matrix completion methods by leveraging time series dynamics of the factors, thereby enhancing the prediction accuracy of future counterfactuals. Building upon a consistent estimator of the factors, our method accommodates both stochastic and deterministic components within the factors, and provides a flexible framework for various applications. In case of stationary autoregressive factors and under standard conditions, we derive error bounds and establish asymptotic normality of our estimator. Empirical evaluations demonstrate that our method outperforms existing benchmarks when the latent factors have an autoregressive component. We illustrate FOCUS results on HeartSteps, a mobile health study, illustrating its effectiveness in forecasting step counts for users receiving activity prompts, thereby leveraging temporal patterns in user behavior.

2507.13024 2026-02-03 stat.ML cs.LG

When Pattern-by-Pattern Works: Theoretical and Empirical Insights for Logistic Models with Missing Values

Christophe Muller, Erwan Scornet, Julie Josse

详情
英文摘要

Predicting with missing inputs challenges even parametric models, as parameter estimation alone is insufficient for prediction on incomplete data. While several works study prediction in linear models, we focus on logistic models, where optimal predictors lack closed-form expressions. We prove that a Pattern-by-Pattern strategy (PbP), which learns one logistic model per missingness pattern, accurately approximates Bayes probabilities under a Gaussian Pattern Mixture Model (GPMM). Crucially, this result holds across standard missing data scenarios (MCAR and MAR) and, notably, in Missing Not at Random (MNAR) settings where standard methods often fail. Empirically, we compare PbP against imputation and EM methods across classification, probability estimation, calibration, and inference. Our analysis provides a comprehensive view of logistic regression with missing values. It reveals that mean imputation can be used as baseline for low sample sizes and PbP for large sample sizes, as both methods are fast to train and may have good performances in some settings. The best performances are achieved by non-linear multiple iterative imputation techniques that include the response label (Random Forest MICE with response), which are more computationally expensive.

2506.02754 2026-02-03 stat.ML cs.LG

Safely Learning Controlled Stochastic Dynamics

Luc Brogat-Motte, Alessandro Rudi, Riccardo Bonalli

详情
英文摘要

We address the problem of safely learning controlled stochastic dynamics from discrete-time trajectory observations, ensuring system trajectories remain within predefined safe regions during both training and deployment. Safety-critical constraints of this kind are crucial in applications such as autonomous robotics, finance, and biomedicine. We introduce a method that ensures safe exploration and efficient estimation of system dynamics by iteratively expanding an initial known safe control set using kernel-based confidence bounds. After training, the learned model enables predictions of the system's dynamics and permits safety verification of any given control. Our approach requires only mild smoothness assumptions and access to an initial safe control set, enabling broad applicability to complex real-world systems. We provide theoretical guarantees for safety and derive adaptive learning rates that improve with increasing Sobolev regularity of the true dynamics. Experimental evaluations demonstrate the practical effectiveness of our method in terms of safety, estimation accuracy, and computational efficiency.

2506.01582 2026-02-03 cs.LG cond-mat.dis-nn cs.IT math.IT stat.ML

Bayes optimal learning of attention-indexed models

Fabrizio Boncoraglio, Emanuele Troiani, Vittorio Erba, Lenka Zdeborová

Journal ref NeurIPS 2025

详情
英文摘要

We introduce the attention-indexed model (AIM), a theoretical framework for analyzing learning in deep attention layers. Inspired by multi-index models, AIM captures how token-level outputs emerge from layered bilinear interactions over high-dimensional embeddings. Unlike prior tractable attention models, AIM allows full-width key and query matrices, aligning more closely with practical transformers. Using tools from statistical mechanics and random matrix theory, we derive closed-form predictions for Bayes-optimal generalization error and identify sharp phase transitions as a function of sample complexity, model width, and sequence length. We propose a matching approximate message passing algorithm and show that gradient descent can reach optimal performance. AIM offers a solvable playground for understanding learning in self-attention layers, that are key components of modern architectures.

2505.22518 2026-02-03 stat.ML cs.LG

IGNIS: A Robust Neural Network Framework for Constrained Parameter Estimation in Archimedean Copulas

Agnideep Aich

详情
英文摘要

Classical estimators, the cornerstones of statistical inference, face insurmountable challenges when applied to important emerging classes of Archimedean copulas. These models exhibit pathological properties, including numerically unstable densities, a restrictive lower bound on Kendall's tau, and vanishingly small likelihood gradients, making MLE brittle and limiting MoM's applicability to datasets with sufficiently strong dependence (i.e., only when the empirical Kendall's $τ$ exceeds the family's lower bound $\approx 0.545$). We introduce \textbf{IGNIS}, a unified neural estimation framework that sidesteps these barriers by learning a direct, robust mapping from data-driven dependency measures to the underlying copula parameter $θ$. IGNIS utilizes a multi-input architecture and a theory-guided output layer ($\mathrm{softplus}(z) + 1$) to automatically enforce the domain constraint $\hatθ \geq 1$. Trained and validated on four families (Gumbel, Joe, and the numerically challenging A1/A2), IGNIS delivers accurate and stable estimates for real-world financial and health datasets, demonstrating its necessity for reliable inference in modern, complex dependence models where traditional methods fail. To our knowledge, IGNIS is the first \emph{standalone, general-purpose} neural estimator for Archimedean copulas (not a generative model or likelihood optimizer), delivering direct, constraint-aware $\hatθ$ and readily extensible to additional families via retraining or minor output-layer adaptations.

2505.20757 2026-02-03 stat.AP stat.ME

Performance of prior event rate ratio method in the presence of differential mortality or dropout

Yin Bun Cheung, Xiangmei Ma

Comments 13 pages, including 2 figures

详情
英文摘要

Purpose: Prior event rate ratio (PERR) method was proposed to control for measured or unmeasured confounders in real-world evaluation of effectiveness and safety of medical treatments using electronic medical records data. A widely cited simulation study showed that PERR estimate of treatment effect was biased in the presence of differential morality/dropout. However, the study only considered one specific PERR estimator of treatment effect and one specific scenario of differential mortality/dropout. To enhance understanding of the method, we replicated and extended the simulation to consider an alternative PERR estimator and multiple scenarios. Methods: Simulation studies were performed with varying rate of mortality/dropout, including the scenario in the previous study in which mortality/dropout was simultaneously influenced by treatment, confounder and prior event and scenarios that differed in the determinants of mortality/dropout. In addition to the PERR estimator used in the previous study (PERR_Prev) that involved data form both completers and non-completers, we also evaluated an alternative PERR estimator (PERR_Comp) that used data only from completers. Results: The bias of PERR_Prev in the previously considered mortality/dropout scenario was replicated. Bias of PERR_Comp was only about one-third in magnitude as compared to that of PERR_Prev in this scenario. Furthermore, PERR_Prev did but PERR_Comp did not give biased estimates of treatment effect in scenarios that mortality/dropout was influenced by treatment or confounder but not prior event. Conclusion: The PERR is better seen as a methodological framework within which there is more than one way to operationalize the estimation. Its performance depends on the specific operationalization. PERR_Comp provides unbiased estimates unless mortality/dropout is affected by prior event.

2505.17958 2026-02-03 stat.ML cond-mat.dis-nn cs.IT cs.LG math.IT

The Nuclear Route: Sharp Asymptotics of ERM in Overparameterized Quadratic Networks

Vittorio Erba, Emanuele Troiani, Lenka Zdeborová, Florent Krzakala

Journal ref NeurIPS 2025

详情
英文摘要

We study the high-dimensional asymptotics of empirical risk minimization (ERM) in over-parametrized two-layer neural networks with quadratic activations trained on synthetic data. We derive sharp asymptotics for both training and test errors by mapping the $\ell_2$-regularized learning problem to a convex matrix sensing task with nuclear norm penalization. This reveals that capacity control in such networks emerges from a low-rank structure in the learned feature maps. Our results characterize the global minima of the loss and yield precise generalization thresholds, showing how the width of the target function governs learnability. This analysis bridges and extends ideas from spin-glass methods, matrix factorization, and convex optimization and emphasizes the deep link between low-rank matrix sensing and learning in quadratic neural networks.

2505.12801 2026-02-03 stat.ML cs.LG

Transportability without Graphs: A Bayesian Approach to Identifying s-Admissible Backdoor Sets

Konstantina Lelova, Gregory F. Cooper, Sofia Triantafillou

详情
英文摘要

Transporting causal information across populations is a critical challenge in clinical decision-making. Causal modeling provides criteria for identifiability and transportability, but these require knowledge of the causal graph, which rarely holds in practice. We propose a Bayesian method that combines observational data from the target domain with experimental data from a different domain to identify s-admissible backdoor sets, which enable unbiased estimation of causal effects across populations, without requiring the causal graph. We prove that if such a set exists, we can always find one within the Markov boundary of the outcome, narrowing the search space, and we establish asymptotic convergence guarantees for our method. We develop a greedy algorithm that reframes transportability as a feature selection problem, selecting conditioning sets that maximize the marginal likelihood of experimental data given observational data. In simulated and semi-synthetic data, our method correctly identifies transportability bias, improves causal effect estimation, and performs favorably against alternatives.

2505.09134 2026-02-03 cs.LG stat.ML

Scaling Gaussian Process Regression with Full Derivative Observations

Daniel Huang

Comments 13 pages, Published in TMLR

详情
英文摘要

We present a scalable Gaussian Process (GP) method called DSoftKI that can fit and predict full derivative observations. It extends SoftKI, a method that approximates a kernel via softmax interpolation, to the setting with derivatives. DSoftKI enhances SoftKI's interpolation scheme by replacing its global temperature vector with local temperature vectors associated with each interpolation point. This modification allows the model to encode local directional sensitivity, enabling the construction of a scalable approximate kernel, including its first and second-order derivatives, through interpolation. Moreover, the interpolation scheme eliminates the need for kernel derivatives, facilitating extensions such as Deep Kernel Learning (DKL). We evaluate DSoftKI on synthetic benchmarks, a toy n-body physics simulation, standard regression datasets with synthetic gradients, and high-dimensional molecular force field prediction (100-1000 dimensions). Our results demonstrate that DSoftKI is accurate and scales to larger datasets with full derivative observations than previously possible.

2504.18677 2026-02-03 math.NA cs.NA stat.CO

Empirical Bernstein and betting confidence intervals for randomized quasi-Monte Carlo

Aadit Jain, Fred J. Hickernell, Art B. Owen, Aleksei G. Sorokin

详情
英文摘要

Randomized quasi-Monte Carlo (RQMC) methods estimate the mean of a random variable by sampling an integrand at $n$ equidistributed points. For scrambled digital nets, the resulting variance is typically $\tilde O(n^{-θ})$ where $θ\in[1,3]$ depends on the smoothness of the integrand and $\tilde O$ neglects logarithmic factors. While RQMC can be far more accurate than plain Monte Carlo (MC) it remains difficult to get confidence intervals on RQMC estimates. We investigate some empirical Bernstein confidence intervals (EBCI) and hedged betting confidence intervals (HBCI), both from Waudby-Smith and Ramdas (2024), when the random variable of interest is subject to known bounds. When there are $N$ integrand evaluations partitioned into $R$ independent replicates of $n=N/R$ RQMC points, and the RQMC variance is $Θ(n^{-θ})$, then an oracle minimizing the width of a Bennett confidence interval would choose $n =Θ(N^{1/(θ+1)})$. The resulting intervals have a width that is $Θ(N^{-θ/(θ+1)})$. Our empirical investigations had optimal values of $n$ grow slowly with $N$, HBCI intervals that were usually narrower than the EBCI ones, and optimal values of $n$ for HBCI that were equal to or smaller than the ones for the oracle.

2504.01842 2026-02-03 cs.LG stat.CO

shapr: Explaining Machine Learning Models with Conditional Shapley Values in R and Python

Martin Jullum, Lars Henry Berge Olsen, Jon Lachmann, Annabelle Redelmeier

详情
英文摘要

This paper introduces the shapr R package, a versatile tool for generating Shapley value-based prediction explanations for machine learning and statistical regression models. Moreover, the shaprpy Python library brings the core capabilities of shapr to the Python ecosystem. Shapley values originate from cooperative game theory in the 1950s, but have over the past few years become a widely used method for quantifying how a model's features/covariates contribute to specific prediction outcomes. The shapr package emphasizes conditional Shapley value estimates, providing a comprehensive range of approaches for accurately capturing feature dependencies -- a crucial aspect for correct model explanation, typically lacking in similar software. In addition to regular tabular data, the shapr R package includes specialized functionality for explaining time series forecasts. The package offers a minimal set of user functions with sensible default values for most use cases while providing extensive flexibility for advanced users to fine-tune computations. Additional features include parallelized computations, iterative estimation with convergence detection, and rich visualization tools. shapr also extends its functionality to compute causal and asymmetric Shapley values when causal information is available. Overall, the shapr and shaprpy packages aim to enhance the interpretability of predictive models within a powerful and user-friendly framework.

2503.24004 2026-02-03 math.ST stat.ME stat.TH

Multivariate Species Sampling Models

Beatrice Franzolini, Antonio Lijoi, Igor Prünster, Giovanni Rebaudo

详情
英文摘要

Species sampling processes have long served as the fundamental framework for modeling random discrete distributions and exchangeable sequences. However, data arising from distinct but related sources require a broader notion of probabilistic invariance, making partial exchangeability a natural choice. Countless models for partially exchangeable data, collectively known as dependent nonparametric priors, have been proposed. These include hierarchical, nested and additive processes, widely used in statistics and machine learning. Still, a unifying framework is lacking and key questions about their underlying learning mechanisms remain unanswered. We fill this gap by introducing multivariate species sampling models, a new general class of nonparametric priors that encompasses most existing finite- and infinite-dimensional dependent processes. They are characterized by the induced partially exchangeable partition probability function encoding their multivariate clustering structure. We establish their core distributional properties and analyze their dependence structure, demonstrating that borrowing of information across groups is entirely determined by shared ties. This provides new insights into the underlying learning mechanisms, offering, for instance, a principled rationale for the previously unexplained correlation structure observed in existing models. Beyond providing a cohesive theoretical foundation, our approach serves as a constructive tool for developing new models and opens novel research directions for capturing richer dependence structures beyond the framework of multivariate species sampling processes.

2502.11820 2026-02-03 stat.AP stat.OT

A Diagnostic to Find and Help Combat Stochastic Positivity Issues -- with a Focus on Continuous Treatments

Katharina Ring, Michael Schomaker

Comments 33 pages (24 without appendix), 12 figures (7 without appendix)

详情
英文摘要

The positivity assumption is central in the identification of a causal effect, and especially the stochastic variant is an issue many applied researchers face, yet is rarely discussed, especially in conjunction with continuous treatments or Modified Treatment Policies. One common recommendation for dealing with a violation is to change the estimand. However, an applied researcher is faced with two problems: First, how can she tell whether there is a stochastic positivity violation given her estimand of interest, preferably without having to estimate a model first? Second, if she finds a problem with stochastic positivity, how should she change her estimand in order to arrive at an estimand which does not face the same issues? We suggest a novel diagnostic which allows the researcher to answer both questions by providing insights into how well an estimation for a certain estimand can be made for each observation using the data at hand. We provide a simulation study on the general behaviour of different Modified Treatment Policies (MTPs) at different levels of stochastic positivity violations and show how the diagnostic helps understand where bias is to be expected. We illustrate the application of our proposed diagnostic in a pharmacoepidemiological study based on data from CHAPAS-3, a trial comparing different treatment regimens for children living with HIV.

2501.16538 2026-02-03 stat.AP stat.CO

Synchronized step multilevel Markov chain Monte Carlo

Sanjan C. Muchandimath, Alex A. Gorodetsky

详情
英文摘要

We propose SYNCE (synchronized step correlation enhancement), a new algorithm for coupling Markov chains within multilevel Markov chain Monte Carlo (ML-MCMC) estimators. We apply this algorithm to solve Bayesian inverse problems using multiple model fidelities. SYNCE is inspired by the concept of common random number coupling in Markov chain Monte Carlo sampling. Unlike state-of-the-art methods that rely on the overlap of level-wise posteriors, our approach enables effective coupling even when posteriors differ substantially. This overlap-independence generates significantly higher correlation between samples at different fidelity levels, improving variance reduction and computational efficiency in the ML-MCMC estimator. We prove that SYNCE admits a unique invariant probability measure and demonstrate that the coupled chains converge to this measure faster than existing overlap-dependent methods, particularly when models are dissimilar. Numerical experiments validate that SYNCE consistently outperforms current coupling strategies in terms of computational efficiency and scalability across varying model fidelities and problem dimensions.

2411.11270 2026-02-03 stat.ME cs.NA math.NA math.PR stat.CO

Unbiased Approximations for Stationary Distributions of McKean-Vlasov SDEs

Elsiddig Awadelkarim, Neil K. Chada, Ajay Jasra

详情
英文摘要

We consider the development of unbiased estimators, to approximate the stationary distribution of Mckean-Vlasov stochastic differential equations (MVSDEs). These are an important class of processes, which frequently appear in applications such as mathematical finance, biology and opinion dynamics. Typically the stationary distribution is unknown and indeed one cannot simulate such processes exactly. As a result one commonly requires a time-discretization scheme which results in a discretization bias and a bias from not being able to simulate the associated stationary distribution. To overcome this bias, we present a new unbiased estimator taking motivation from the literature on unbiased Monte Carlo. We prove the unbiasedness of our estimator, under assumptions. In order to prove this we require developing ergodicity results of various discrete time processes, through an appropriate discretization scheme, towards the invariant measure. Numerous numerical experiments are provided, on a range of MVSDEs, to demonstrate the effectiveness of our unbiased estimator. Such examples include the Currie-Weiss model, a 3D neuroscience model and a parameter estimation problem.

2410.13115 2026-02-03 stat.ME

Online conformal inference for multi-step time series forecasting

Xiaoqian Wang, Rob J Hyndman

详情
英文摘要

We consider the problem of constructing distribution-free prediction intervals for multi-step time series forecasting, with a focus on the temporal dependencies inherent in multi-step forecast errors. We establish that the optimal $h$-step-ahead forecast errors exhibit serial correlation up to lag $(h-1)$ under a general non-stationary autoregressive data generating process. To leverage these properties, we propose the Autocorrelated Multi-step Conformal Prediction (AcMCP) method, which effectively incorporates autocorrelations in multi-step forecast errors, resulting in more statistically efficient prediction intervals. This method guarantees asymptotic marginal coverage for multi-step prediction intervals, though we note that, for finite samples, the coverage error admits an upper bound that increases with the forecasting horizon. Additionally, we extend several easy-to-implement conformal prediction methods, originally designed for single-step forecasting, to accommodate multi-step scenarios. Through empirical evaluations, including simulations and applications to data, we demonstrate that AcMCP achieves coverage that closely aligns with the target within local windows, while providing adaptive prediction intervals that effectively respond to varying conditions.

2410.06329 2026-02-03 stat.ML cs.LG eess.SP

Joint Bayesian Parameter and Model Order Estimation for Low-Rank Probability Mass Tensors

Joseph K. Chege, Arie Yeredor, Martin Haardt

详情
英文摘要

Obtaining a reliable estimate of the joint probability mass function (PMF) of a set of random variables from observed data is a significant objective in statistical signal processing and machine learning. Modelling the joint PMF as a tensor that admits a low-rank canonical polyadic decomposition (CPD) has enabled the development of efficient PMF estimation algorithms. However, these algorithms require the rank (model order) of the tensor to be specified beforehand. In real-world applications, the true rank is unknown. Therefore, an appropriate rank is usually selected from a candidate set either by observing validation errors or by computing various likelihood-based information criteria, a procedure that could be costly in terms of computational time or hardware resources, or could result in mismatched models which affect the model accuracy. This paper presents a novel Bayesian framework for estimating the low-rank components of a joint PMF tensor and simultaneously inferring its rank from the observed data. We specify a Bayesian PMF estimation model and employ appropriate prior distributions for the model parameters, allowing the rank to be inferred without cross-validation.We then derive a deterministic solution based on variational inference (VI) to approximate the posterior distributions of various model parameters. Numerical experiments involving both synthetic data and real classification and item recommendation data illustrate the advantages of our VI-based method in terms of estimation accuracy, automatic rank detection, and computational efficiency.

2405.10712 2026-02-03 stat.AP

Enhancing the statistical evaluation of earthquake forecasts -- An application to Italy

Jonas R. Brehmer, Kristof Kraus, Tilmann Gneiting, Marcus Herrmann, Warner Marzocchi

Journal ref Seismological Research Letters 96 (3): 1966-1988 (2025)

详情
英文摘要

Testing earthquake forecasts is essential to obtain scientific information on forecasting models and sufficient credibility for societal usage. We aim at enhancing the testing phase proposed by the Collaboratory for the Study of Earthquake Predictability (CSEP, Schorlemmer et al., 2018) with new statistical methods supported by mathematical theory. To demonstrate their applicability, we evaluate three short-term forecasting models that were submitted to the CSEP-Italy experiment, and two ensemble models thereof. The models produce weekly overlapping forecasts for the expected number of M4+ earthquakes in a collection of grid cells. We compare the models' forecasts using consistent scoring functions for means or expectations, which are widely used and theoretically principled tools for forecast evaluation. We further discuss and demonstrate their connection to CSEP-style earthquake likelihood model testing, and specifically suggest an improvement of the T-test. Then, using tools from isotonic regression, we investigate forecast reliability and apply score decompositions in terms of calibration and discrimination. Our results show where and how models outperform their competitors and reveal a substantial lack of calibration for various models. The proposed methods also apply to full-distribution (e.g., catalog-based) forecasts, without requiring Poisson distributions or making any other type of parametric assumption.

2310.00617 2026-02-03 stat.ME math.PR math.ST stat.TH

Nonparametric priors with full-range borrowing of information

Filippo Ascolani, Beatrice Franzolini, Antonio Lijoi, Igor Prünster

详情
英文摘要

Modeling of the dependence structure across heterogeneous data is crucial for Bayesian inference since it directly impacts the borrowing of information. Despite the extensive advances over the last two decades, most available proposals allow only for non-negative correlations. We derive a new class of dependent nonparametric priors that can induce correlations of any sign, thus introducing a new and more flexible idea of borrowing of information. This is achieved thanks to a novel concept, which we term hyper-tie, and represents a direct and simple measure of dependence. We investigate prior and posterior distributional properties of the model and develop algorithms to perform posterior inference. Illustrative examples on simulated and real data show that our proposal outperforms alternatives in terms of prediction and clustering.

2307.10065 2026-02-03 stat.ME stat.CO stat.ML

Entropy regularization in probabilistic clustering

Beatrice Franzolini, Giovanni Rebaudo

详情
英文摘要

Bayesian nonparametric mixture models are widely used to cluster observations. However, one major drawback of the approach is that the estimated partition often presents unbalanced clusters' frequencies with only a few dominating clusters and a large number of sparsely-populated ones. This feature translates into results that are often uninterpretable unless we accept to ignore a relevant number of observations and clusters. Interpreting the posterior distribution as penalized likelihood, we show how the unbalance can be explained as a direct consequence of the cost functions involved in estimating the partition. In light of our findings, we propose a novel Bayesian estimator of the clustering configuration. The proposed estimator is equivalent to a post-processing procedure that reduces the number of sparsely-populated clusters and enhances interpretability. The procedure takes the form of entropy-regularization of the Bayesian estimate. While being computationally convenient with respect to alternative strategies, it is also theoretically justified as a correction to the Bayesian loss function used for point estimation and, as such, can be applied to any posterior distribution of clusters, regardless of the specific model used.

2305.03038 2026-02-03 math.ST eess.SP math.PR stat.TH

The envelope of a complex Gaussian random variable

Sattwik Ghosal, Ranjan Maitra

Comments 24 pages, 4 figures, 1 table

详情
英文摘要

The envelope of an elliptical Gaussian complex vector, or equivalently, the amplitude or norm of a bivariate normal random vector has application in many weather and signal processing contexts. We explicitly characterize its distribution in the general case through its probability density, cumulative distribution and moment generating function. Moments and limiting distributions are also derived. These derivations are exploited to also characterize the special cases where the bivariate Gaussian mean vector and covariance matrix have a simpler structure, providing new additional insights in many cases. Simulations illustrate the benefits of using our formulae over Monte Carlo methods. We also use our derivations to get a better initial characterization of the distribution of the observed values in structural Magnetic Resonance Imaging datasets, and of wind speed.

2208.00952 2026-02-03 stat.ME q-fin.ST stat.AP

Change point detection in dynamic Gaussian graphical models: the impact of COVID-19 pandemic on the US stock market

Beatrice Franzolini, Alexandros Beskos, Maria De Iorio, Warrick Poklewski Koziell, Karolina Grzeszkiewicz

Journal ref The Annals of Applied Statistics, 18(1), 555-584, 2024

详情
英文摘要

Reliable estimates of volatility and correlation are fundamental in economics and finance for understanding the impact of macroeconomics events on the market and guiding future investments and policies. Dependence across financial returns is likely to be subject to sudden structural changes, especially in correspondence with major global events, such as the COVID-19 pandemic. In this work, we are interested in capturing abrupt changes over time in the dependence across US industry stock portfolios, over a time horizon that covers the COVID-19 pandemic. The selected stocks give a comprehensive picture of the US stock market. To this end, we develop a Bayesian multivariate stochastic volatility model based on a time-varying sequence of graphs capturing the evolution of the dependence structure. The model builds on the Gaussian graphical models and the random change points literature. In particular, we treat the number, the position of change points, and the graphs as object of posterior inference, allowing for sparsity in graph recovery and change point detection. The high dimension of the parameter space poses complex computational challenges. However, the model admits a hidden Markov model formulation. This leads to the development of an efficient computational strategy, based on a combination of sequential Monte-Carlo and Markov chain Monte-Carlo techniques. Model and computational development are widely applicable, beyond the scope of the application of interest in this work.

2203.15782 2026-02-03 stat.ME stat.AP

Model Selection for Maternal Hypertensive Disorders with Symmetric Hierarchical Dirichlet Processes

Beatrice Franzolini, Antonio Lijoi, Igor Prünster

Journal ref The Annals of Applied Statistics, 17(1): 313-332, 2023

详情
英文摘要

Hypertensive disorders of pregnancy occur in about 10% of pregnant women around the world. Though there is evidence that hypertension impacts maternal cardiac functions, the relation between hypertension and cardiac dysfunctions is only partially understood. The study of this relationship can be framed as a joint inferential problem on multiple populations, each corresponding to a different hypertensive disorder diagnosis, that combines multivariate information provided by a collection of cardiac function indexes. A Bayesian nonparametric approach seems particularly suited for this setup and we demonstrate it on a dataset consisting of transthoracic echocardiography results of a cohort of Indian pregnant women. We are able to perform model selection, provide density estimates of cardiac function indexes and a latent clustering of patients: these readily interpretable inferential outputs allow to single out modified cardiac functions in hypertensive patients compared to healthy subjects and progressively increased alterations with the severity of the disorder. The analysis is based on a Bayesian nonparametric model that relies on a novel hierarchical structure, called symmetric hierarchical Dirichlet process. This is suitably designed so that the mean parameters are identified and used for model selection across populations, a penalization for multiplicity is enforced, and the presence of unobserved relevant factors is investigated through a latent clustering of subjects. Posterior inference relies on a suitable Markov Chain Monte Carlo algorithm and the model behaviour is also showcased on simulated data.

2008.13619 2026-02-03 stat.AP stat.ME

A model and method for analyzing the precision of binary measurement methods based on beta-binomial distributions, and related statistical tests

Jun-ichi Takeshita, Tomomichi Suzuki

Journal ref Quality & Quantity, 59 (2025), 1323-1352

详情
英文摘要

This study developed a new statistical model and method for analyzing the precision of binary measurement methods from collaborative studies. The model is based on beta-binomial distributions. In other words, it assumes that the sensitivity of each laboratory obeys a beta distribution, and the binary measured values under a given sensitivity follow a binomial distribution. We propose the key precision measures of repeatability and reproducibility for the model, and provide their unbiased estimates. Further, through consideration of a number of statistical test methods for homogeneity of proportions, we propose appropriate methods for determining laboratory effects in the new model. Finally, we apply the results to real-world examples in the fields of food safety and chemical risk assessment and management.

2602.01437 2026-02-03 cs.LG stat.ML

Theoretical Analysis of Measure Consistency Regularization for Partially Observed Data

Yinsong Wang, Shahin Shahrampour

详情
英文摘要

The problem of corrupted data, missing features, or missing modalities continues to plague the modern machine learning landscape. To address this issue, a class of regularization methods that enforce consistency between imputed and fully observed data has emerged as a promising approach for improving model generalization, particularly in partially observed settings. We refer to this class of methods as Measure Consistency Regularization (MCR). Despite its empirical success in various applications, such as image inpainting, data imputation and semi-supervised learning, a fundamental understanding of the theoretical underpinnings of MCR remains limited. This paper bridges this gap by offering theoretical insights into why, when, and how MCR enhances imputation quality under partial observability, viewed through the lens of neural network distance. Our theoretical analysis identifies the term responsible for MCR's generalization advantage and extends to the imperfect training regime, demonstrating that this advantage is not always guaranteed. Guided by these insights, we propose a novel training protocol that monitors the duality gap to determine an early stopping point that preserves the generalization benefit. We then provide detailed empirical evidence to support our theoretical claims and to show the effectiveness and accuracy of our proposed stopping condition. We further provide a set of real-world data simulations to show the versatility of MCR under different model architectures designed for different data sources.

2602.01412 2026-02-03 stat.ML cs.LG

Importance Weighted Variational Inference without the Reparameterization Trick

Kamélia Daudel, Minh-Ngoc Tran, Cheng Zhang

详情
英文摘要

Importance weighted variational inference (VI) approximates densities known up to a normalizing constant by optimizing bounds that tighten with the number of Monte Carlo samples $N$. Standard optimization relies on reparameterized gradient estimators, which are well-studied theoretically yet restrict both the choice of the data-generating process and the variational approximation. While REINFORCE gradient estimators do not suffer from such restrictions, they lack rigorous theoretical justification. In this paper, we provide the first comprehensive analysis of REINFORCE gradient estimators in importance weighted VI, leveraging this theoretical foundation to diagnose and resolve fundamental deficiencies in current state-of-the-art estimators. Specifically, we introduce and examine a generalized family of variational inference for Monte Carlo objectives (VIMCO) gradient estimators. We prove that state-of-the-art VIMCO gradient estimators exhibit a vanishing signal-to-noise ratio (SNR) as $N$ increases, which prevents effective optimization. To overcome this issue, we propose the novel VIMCO-$\star$ gradient estimator and show that it averts the SNR collapse of existing VIMCO gradient estimators by achieving a $\sqrt{N}$ SNR scaling instead. We demonstrate its superior empirical performance compared to current VIMCO implementations in challenging settings where reparameterized gradients are typically unavailable.

2602.01400 2026-02-03 stat.ML cs.LG

Online Social Welfare Function-based Resource Allocation

Kanad Pardeshi, Samsara Foubert, Aarti Singh

详情
英文摘要

In many real-world settings, a centralized decision-maker must repeatedly allocate finite resources to a population over multiple time steps. Individuals who receive a resource derive some stochastic utility; to characterize the population-level effects of an allocation, the expected individual utilities are then aggregated using a social welfare function (SWF). We formalize this setting and present a general confidence sequence framework for SWF-based online learning and inference, valid for any monotonic, concave, and Lipschitz-continuous SWF. Our key insight is that monotonicity alone suffices to lift confidence sequences from individual utilities to anytime-valid bounds on optimal welfare. Building on this foundation, we propose SWF-UCB, a SWF-agnostic online learning algorithm that achieves near-optimal $\tilde{O}(n+\sqrt{nkT})$ regret (for $k$ resources distributed among $n$ individuals at each of $T$ time steps). We instantiate our framework on three normatively distinct SWF families: Weighted Power Mean, Kolm, and Gini, providing bespoke oracle algorithms for each. Experiments confirm $\sqrt{T}$ scaling and reveal rich interactions between $k$ and SWF parameters. This framework naturally supports inference applications such as sequential hypothesis testing, optimal stopping, and policy evaluation.

2602.01381 2026-02-03 cs.CL stat.ML

On the Power of (Approximate) Reward Models for Inference-Time Scaling

Youheng Zhu, Yiping Lu

详情
英文摘要

Inference-time scaling has recently emerged as a powerful paradigm for improving the reasoning capability of large language models. Among various approaches, Sequential Monte Carlo (SMC) has become a particularly important framework, enabling iterative generation, evaluation, rejection, and resampling of intermediate reasoning trajectories. A central component in this process is the reward model, which evaluates partial solutions and guides the allocation of computation during inference. However, in practice, true reward models are never available. All deployed systems rely on approximate reward models, raising a fundamental question: Why and when do approximate reward models suffice for effective inference-time scaling? In this work, we provide a theoretical answer. We identify the Bellman error of the approximate reward model as the key quantity governing the effectiveness of SMC-based inference-time scaling. For a reasoning process of length $T$, we show that if the Bellman error of the approximate reward model is bounded by $O(1/T)$, then combining this reward model with SMC reduces the computational complexity of reasoning from exponential in $T$ to polynomial in $T$. This yields an exponential improvement in inference efficiency despite using only approximate rewards.

2602.01378 2026-02-03 cs.CL cs.AI stat.ML

Context Dependence and Reliability in Autoregressive Language Models

Poushali Sengupta, Shashi Raj Pandey, Sabita Maharjan, Frank Eliassen

详情
英文摘要

Large language models (LLMs) generate outputs by utilizing extensive context, which often includes redundant information from prompts, retrieved passages, and interaction history. In critical applications, it is vital to identify which context elements actually influence the output, as standard explanation methods struggle with redundancy and overlapping context. Minor changes in input can lead to unpredictable shifts in attribution scores, undermining interpretability and raising concerns about risks like prompt injection. This work addresses the challenge of distinguishing essential context elements from correlated ones. We introduce RISE (Redundancy-Insensitive Scoring of Explanation), a method that quantifies the unique influence of each input relative to others, minimizing the impact of redundancies and providing clearer, stable attributions. Experiments demonstrate that RISE offers more robust explanations than traditional methods, emphasizing the importance of conditional information for trustworthy LLM explanations and monitoring.

2602.01366 2026-02-03 stat.ME math.PR

A Fractional M/M/1 Queue Governed by Stretched Non-Local Time Operators

Mehmet Sıddık Çadırcı

Comments 13 pages, 3 figures, 1 table

详情
英文摘要

We introduce a non-Markovian generalization of the classical M/M/1 queue by incorporating extended nonlocal time dynamics into Kolmogorov forward equations. We obtain the model by replacing the standard time derivative with an extended Caputo-type operator. It preserves the birth-death transition structure of the standard queue while introducing memory effects into the temporal evolution. We derive explicit representations for transient state probabilities in terms of the Kilbas-Saigo function, which naturally emerges as the relaxation kernel associated with the stretched operator, using Laplace transform techniques. We construct a time-varying interpretation and show that the fractional queue can be viewed as a distribution of a classical M/M/1 process evaluated at a non-decreasing random time. It is observed that the fractional queue can be viewed as a distribution of a classical M/M/1 process evaluated at a non-decreasing random time. We prove that under the standard stability condition $ρ<1$, the steady-state distribution remains geometric and coincides with the distribution of the classical queue, whilst we prove that the stretched fractional parameters significantly affect the convergence rate in the transient regime. Numerical examples based on Monte Carlo simulations highlight the effect of the parameters $(α,γ)$ on the distribution of empty states, tail length distributions, and the average tail evolution, and validate the flexibility of the proposed framework in capturing long-memory tail dynamics.

2602.01342 2026-02-03 cs.CR cs.AI stat.AP

Adaptive Quantum-Safe Cryptography for 6G Vehicular Networks via Context-Aware Optimization

Poushali Sengupta, Mayank Raikwar, Sabita Maharjan, Frank Eliassen, Yan Zhang

Comments Accepted for presentation at NDSS 2026 - FutureG Workshop, 23 February 2026. (10 pages, 5 figures.)

详情
英文摘要

Powerful quantum computers in the future may be able to break the security used for communication between vehicles and other devices (Vehicle-to-Everything, or V2X). New security methods called post-quantum cryptography can help protect these systems, but they often require more computing power and can slow down communication, posing a challenge for fast 6G vehicle networks. In this paper, we propose an adaptive post-quantum cryptography (PQC) framework that predicts short-term mobility and channel variations and dynamically selects suitable lattice-, code-, or hash-based PQC configurations using a predictive multi-objective evolutionary algorithm (APMOEA) to meet vehicular latency and security constraints.However, frequent cryptographic reconfiguration in dynamic vehicular environments introduces new attack surfaces during algorithm transitions. A secure monotonic-upgrade protocol prevents downgrade, replay, and desynchronization attacks during transitions. Theoretical results show decision stability under bounded prediction error, latency boundedness under mobility drift, and correctness under small forecast noise. These results demonstrate a practical path toward quantum-safe cryptography in future 6G vehicular networks. Through extensive experiments based on realistic mobility (LuST), weather (ERA5), and NR-V2X channel traces, we show that the proposed framework reduces end-to-end latency by up to 27\%, lowers communication overhead by up to 65\%, and effectively stabilizes cryptographic switching behavior using reinforcement learning. Moreover, under the evaluated adversarial scenarios, the monotonic-upgrade protocol successfully prevents downgrade, replay, and desynchronization attacks.

2602.01285 2026-02-03 cs.LG cs.AI stat.ML

Multi-LLM Adaptive Conformal Inference for Reliable LLM Responses

Kangjun Noh, Seongchan Lee, Ilmun Kim, Kyungwoo Song

Comments Accepted to ICLR 2026

详情
英文摘要

Ensuring factuality is essential for the safe use of Large Language Models (LLMs) in high-stakes domains such as medicine and law. Conformal inference provides distribution-free guarantees, but existing approaches are either overly conservative, discarding many true-claims, or rely on adaptive error rates and simple linear models that fail to capture complex group structures. To address these challenges, we reformulate conformal inference in a multiplicative filtering setting, modeling factuality as a product of claim-level scores. Our method, Multi-LLM Adaptive Conformal Inference (MACI), leverages ensembles to produce more accurate factuality-scores, which in our experiments led to higher retention, while validity is preserved through group-conditional calibration. Experiments show that MACI consistently achieves user-specified coverage with substantially higher retention and lower time cost than baselines. Our repository is available at https://github.com/MLAI-Yonsei/MACI

2602.01245 2026-02-03 stat.ME

Explicit Expressions for Multidimensional Value-at-Risk under Archimedean Copulas

Dotamana Yéo, Saralees Nadarajah, Amadou Sawadogo

Comments 17 pages, 1 table

详情
英文摘要

This paper studies multivariate Value-at-Risk (VaR) for financial portfolios with a focus on modeling dependence structures through Archimedean copulas. Using the generator representation of Archimedean copulas, we derive explicit analytical expressions for the marginal lower-tail multivariate VaR in arbitrary dimensions. Closed-form formulas are obtained for several commonly used copula families, including Clayton, Frank, Gumbel-Hougaard, Joe and Ali--Mikhail--Haq copulas, allowing a direct assessment of the impact of dependence on multivariate risk. These results complement existing approaches, which largely rely on numerical or simulation-based methods, by providing tractable alternatives for theoretical and applied risk analysis. Monte Carlo simulations are conducted to evaluate the finite-sample performance of the proposed VaR estimator and to illustrate the role of different dependence structures. The proposed analytical setting offers transparent tools for multivariate risk measurement and systemic risk assessment.

2602.01144 2026-02-03 math.ST stat.TH

Estimating Conditional Distributions via Sklar's Theorem and Empirical Checkerboard Approximations, with Consequences to Nonparametric Regression

Kai Schärer, Wolfgang Trutschnig

Comments 29 pages, 12 figures

详情
英文摘要

We tackle the natural question of whether it is possible to estimate conditional distributions via Sklar's theorem by separately estimating the conditional distributions of the underlying copula and the marginals. Working with so-called empirical checkerboard/Bernstein approximations with suitably chosen resolution/degree, we first show that uniform weak convergence to the true underlying copula can be established under very mild regularity assumptions. Building upon these results and plugging in the univariate empirical marginal distribution functions we then provide an affirmative answer to the afore-mentioned question and prove strong consistency of the resulting estimators for the conditional distributions. Moreover, we show that aggregating our estimators allows to construct consistent nonparametric estimators for the mean, the quantile, and the expectile regression function, and beyond. Some simulations illustrating the performance of the estimators and a real data example complement the established theoretical results.

2602.01045 2026-02-03 cs.LG cs.AI physics.data-an stat.ML

Superposition unifies power-law training dynamics

Zixin Jessie Chen, Hao Chen, Yizhou Liu, Jeff Gore

Comments 17 pages, 14 figures

详情
英文摘要

We investigate the role of feature superposition in the emergence of power-law training dynamics using a teacher-student framework. We first derive an analytic theory for training without superposition, establishing that the power-law training exponent depends on both the input data statistics and channel importance. Remarkably, we discover that a superposition bottleneck induces a transition to a universal power-law exponent of $\sim 1$, independent of data and channel statistics. This one over time training with superposition represents an up to tenfold acceleration compared to the purely sequential learning that takes place in the absence of superposition. Our finding that superposition leads to rapid training with a data-independent power law exponent may have important implications for a wide range of neural networks that employ superposition, including production-scale large language models.

2602.00960 2026-02-03 cs.LG cs.AI cs.CE stat.CO stat.ML

Multimodal Scientific Learning Beyond Diffusions and Flows

Leonardo Ferreira Guilhoto, Akshat Kaushal, Paris Perdikaris

详情
英文摘要

Scientific machine learning (SciML) increasingly requires models that capture multimodal conditional uncertainty arising from ill-posed inverse problems, multistability, and chaotic dynamics. While recent work has favored highly expressive implicit generative models such as diffusion and flow-based methods, these approaches are often data-hungry, computationally costly, and misaligned with the structured solution spaces frequently found in scientific problems. We demonstrate that Mixture Density Networks (MDNs) provide a principled yet largely overlooked alternative for multimodal uncertainty quantification in SciML. As explicit parametric density estimators, MDNs impose an inductive bias tailored to low-dimensional, multimodal physics, enabling direct global allocation of probability mass across distinct solution branches. This structure delivers strong data efficiency, allowing reliable recovery of separated modes in regimes where scientific data is scarce. We formalize these insights through a unified probabilistic framework contrasting explicit and implicit distribution networks, and demonstrate empirically that MDNs achieve superior generalization, interpretability, and sample efficiency across a range of inverse, multistable, and chaotic scientific regression tasks.

2602.00939 2026-02-03 math.ST cs.LG stat.TH

Improving Minimax Estimation Rates for Contaminated Mixture of Multinomial Logistic Experts via Expert Heterogeneity

Fanqi Yan, Dung Le, Trang Pham, Huy Nguyen, Nhat Ho

Comments Fanqi Yan, Dung Le contributed equally to this work. 41 pages, 3 figures, 1 table

详情
英文摘要

Contaminated mixture of experts (MoE) is motivated by transfer learning methods where a pre-trained model, acting as a frozen expert, is integrated with an adapter model, functioning as a trainable expert, in order to learn a new task. Despite recent efforts to analyze the convergence behavior of parameter estimation in this model, there are still two unresolved problems in the literature. First, the contaminated MoE model has been studied solely in regression settings, while its theoretical foundation in classification settings remains absent. Second, previous works on MoE models for classification capture pointwise convergence rates for parameter estimation without any guaranty of minimax optimality. In this work, we close these gaps by performing, for the first time, the convergence analysis of a contaminated mixture of multinomial logistic experts with homogeneous and heterogeneous structures, respectively. In each regime, we characterize uniform convergence rates for estimating parameters under challenging settings where ground-truth parameters vary with the sample size. Furthermore, we also establish corresponding minimax lower bounds to ensure that these rates are minimax optimal. Notably, our theories offer an important insight into the design of contaminated MoE, that is, expert heterogeneity yields faster parameter estimation rates and, therefore, is more sample-efficient than expert homogeneity.

2602.00903 2026-02-03 stat.ME

A Graph-based Framework for Coverage Analysis in Autonomous Driving

Thomas Muehlenstädt, Marius Bause

详情
英文摘要

Coverage analysis is essential for validating the safety of autonomous driving systems, yet existing approaches typically assess coverage factors individually or in limited combinations, struggling to capture the complex interactions inherent in traffic scenes. This paper proposes a graph-based framework for coverage analysis that represents traffic scenes as hierarchical graphs, combining map topology with actor relationships. The framework introduces a two-phase graph construction algorithm that systematically captures spatial relationships between traffic participants, including leading, following, neighboring, and opposing configurations. Two complementary coverage analysis methods are presented. First, a sub-graph isomorphism approach matches traffic scenes against a set of manually defined archetype graphs representing common driving scenarios. Second, a graph embedding approach utilizes Graph Isomorphism Networks with Edge features (GINE) trained via self-supervised contrastive learning to project traffic scenes into a vector space, enabling similarity-based coverage assessment. The framework is validated on both real-world data from the Argoverse 2.0 dataset and synthetic data from the CARLA simulator. The subgraph isomorphism method is used to calculate node coverage percentages using predefined archetypes, while the embedding approach reveals meaningful structure in the latent space suitable for clustering and anomaly detection. The proposed approach offers significant advantages over traditional methods by scaling efficiently to diverse traffic scenarios without requiring scenario-specific handling, and by naturally accommodating varying numbers of actors in a scene.

2602.00901 2026-02-03 math.ST stat.TH

Efficient Bayesian Inference in Strictly Semi-parametric Linear Inverse Problems

Adel Magra, Aad van der Vaart

详情
英文摘要

We consider the efficient inference of finite dimensional parameters arising in the context of inverse problems. Our setup is the observation of a transformation of an unknown infinite dimensional signal $f$ corrupted by statistical noise, with the transformation $K_θ$ being linear but unknown up to a scalar $θ$. We adopt a Bayesian approach and put a prior on the pair $(θ,f)$ and prove a Bernstein-von Mises theorem for the marginal posterior of $θ$ under regularity conditions on the operators $K_θ$ and on the prior. We apply our results to the recovery of location parameters in semi-blind deconvolution problems and to the recovery of attenuation constants in X-ray tomography.

2602.00890 2026-02-03 stat.AP

Boundary-Induced Biases in Climate Networks of Extreme Precipitation and Temperature

Behzad Ghanbarian, Victor Oladoja, Kehinde Bosikun, Tayeb Jamali, Jürgen Kurths

详情
英文摘要

To address spatial boundary effects in climate networks, two surrogate-based correction methods, (1) subtraction and (2) division, have been widely applied in the literature. In the subtraction method, an original network measure is adjusted by subtracting the expected value obtained from a surrogate ensemble, whereas in the division method, it is normalized by dividing by this expected value. However, to the best of our knowledge, no prior study has assessed whether these two correction approaches yield statistically different results. In this study, we constructed complex networks of extreme precipitation and temperature events (EPEs and ETEs) across the CONUS for both summer (June-August, JJA) and winter (December-February, DJF) seasons. We computed key network metrics degree centrality (DC), clustering coefficient (CC), mean geographic distance (MGD), and betweenness centrality (BC) and applied both correction methods. Although the corrected spatial patterns generally appeared visually similar, statistical analyses revealed that the network measures derived from the subtraction and division methods were significantly different at the 95 percent confidence level. Across the CONUS, network hubs of EPEs were primarily concentrated in the northwestern United States during summer and shifted toward the east during winter, reflecting seasonal differences in the dominant atmospheric drivers. In contrast, the ETE networks showed strong spatial coherence and pronounced regional teleconnections in both seasons, with higher connectivity and longer synchronization distances in winter, consistent with large-scale circulation patterns such as the Pacific-North American and North Atlantic Oscillation modes. Our results indicated that the network metrics CC and MGD were more sensitive to the correction methods than the DC and BC, particularly in the EPE networks.

2602.00889 2026-02-03 math.ST stat.TH

Semi-parametric Bernstein-von Mises Theorem in a Parabolic PDE Problem

Adel Magra, Frank van der Meulen, Aad van der Vaart

详情
英文摘要

We consider the heat equation with absorption in a bounded domain of $\mathbb{R}^d$, where both the scalar diffusivity and the absorption function are unknown. We investigate a Bayesian approach for recovering the diffusivity from a noisy observation of the solution to the PDE over the domain. Given a Gaussian process prior on the absorption function, we derive a Bernstein-von Mises theorem for the marginal posterior distribution of the diffusivity under assumptions on the prior and on smoothness properties of the absorption.

2602.00844 2026-02-03 stat.ML cs.LG stat.AP

Multivariate Time Series Data Imputation via Distributionally Robust Regularization

Che-Yi Liao, Zheng Dong, Gian-Gabriel Garcia, Kamran Paynabar

详情
英文摘要

Multivariate time series (MTS) imputation is often compromised by mismatch between observed and true data distributions -- a bias exacerbated by non-stationarity and systematic missingness. Standard methods that minimize reconstruction error or encourage distributional alignment risk overfitting these biased observations. We propose the Distributionally Robust Regularized Imputer Objective (DRIO), which jointly minimizes reconstruction error and the divergence between the imputer and a worst-case distribution within a Wasserstein ambiguity set. We derive a tractable dual formulation that reduces infinite-dimensional optimization over measures to adversarial search over sample trajectories, and propose an adversarial learning algorithm compatible with flexible deep learning backbones. Comprehensive experiments on diverse real-world datasets show DRIO consistently improves imputation under both missing-completely-at-random and missing-not-at-random settings, reaching Pareto-optimal trade-offs between reconstruction accuracy and distributional alignment.

2602.00836 2026-02-03 stat.ME econ.EM

Dynamic causal inference with time series data

Tanique Schaffe-Odeleye, Kōsaku Takanashi, Vishesh Karwa, Edoardo M. Airoldi, Kenichiro McAlinn

详情
英文摘要

We generalize the potential outcome framework to time series with an intervention by defining causal effects on stochastic processes. Interventions in dynamic systems alter not only outcome levels but also evolutionary dynamics -- changing persistence and transition laws. Our framework treats potential outcomes as entire trajectories, enabling causal estimands, identification conditions, and estimators to be formulated directly on path space. The resulting Dynamic Average Treatment Effect (DATE) characterizes how causal effects evolve through time and reduces to the classical average treatment effect under one period of time. For observational data, we derive a dynamic inverse-probability weighting estimator that is unbiased under dynamic ignorability and positivity. When treated units are scarce, we show that conditional mean trajectories underlying the DATE admit a linear state-space representation, yielding a dynamic linear model implementation. Simulations demonstrate that modeling time as intrinsic to the causal mechanism exposes dynamic effects that static methods systematically misestimate. An empirical study of COVID-19 lockdowns illustrates the framework's practical value for estimating and decomposing treatment effects.

2602.00835 2026-02-03 stat.ML cs.LG

Score-based Metropolis-Hastings for Fractional Langevin Algorithms

Ahmed Aloui, Junyi Liao, Ali Hasan, Jose Blanchet, Vahid Tarokh

详情
英文摘要

Sampling from heavy-tailed and multimodal distributions is challenging when neither the target density nor the proposal density can be evaluated, as in $α$-stable Lévy-driven fractional Langevin algorithms. While the target distribution can be estimated from data via score-based or energy-based models, the $α$-stable proposal density and its score are generally unavailable, rendering classical density-based Metropolis--Hastings (MH) corrections impractical. Consequently, existing fractional Langevin methods operate in an unadjusted regime and can exhibit substantial finite-time errors and poor empirical control of tail behavior. We introduce the Metropolis-Adjusted Fractional Langevin Algorithm (MAFLA), an MH-inspired, fully score-based correction mechanism. MAFLA employs designed proxies for fractional proposal score gradients under isotropic symmetric $α$-stable noise and learns an acceptance function via Score Balance Matching. We empirically illustrate the strong performance of MAFLA on a series of tasks including combinatorial optimization problems where the method significantly improves finite time sampling accuracy over unadjusted fractional Langevin dynamics.

2602.00825 2026-02-03 stat.ML cs.LG

Harmful Overfitting in Sobolev Spaces

Kedar Karhadkar, Alexander Sietsema, Deanna Needell, Guido Montufar

详情
英文摘要

Motivated by recent work on benign overfitting in overparameterized machine learning, we study the generalization behavior of functions in Sobolev spaces $W^{k, p}(\mathbb{R}^d)$ that perfectly fit a noisy training data set. Under assumptions of label noise and sufficient regularity in the data distribution, we show that approximately norm-minimizing interpolators, which are canonical solutions selected by smoothness bias, exhibit harmful overfitting: even as the training sample size $n \to \infty$, the generalization error remains bounded below by a positive constant with high probability. Our results hold for arbitrary values of $p \in [1, \infty)$, in contrast to prior results studying the Hilbert space case ($p = 2$) using kernel methods. Our proof uses a geometric argument which identifies harmful neighborhoods of the training data using Sobolev inequalities.

2602.00822 2026-02-03 stat.ML cs.LG

Safety-Efficacy Trade Off: Robustness against Data-Poisoning

Diego Granziol

详情
英文摘要

Backdoor and data poisoning attacks can achieve high attack success while evading existing spectral and optimisation based defences. We show that this behaviour is not incidental, but arises from a fundamental geometric mechanism in input space. Using kernel ridge regression as an exact model of wide neural networks, we prove that clustered dirty label poisons induce a rank one spike in the input Hessian whose magnitude scales quadratically with attack efficacy. Crucially, for nonlinear kernels we identify a near clone regime in which poison efficacy remains order one while the induced input curvature vanishes, making the attack provably spectrally undetectable. We further show that input gradient regularisation contracts poison aligned Fisher and Hessian eigenmodes under gradient flow, yielding an explicit and unavoidable safety efficacy trade off by reducing data fitting capacity. For exponential kernels, this defence admits a precise interpretation as an anisotropic high pass filter that increases the effective length scale and suppresses near clone poisons. Extensive experiments on linear models and deep convolutional networks across MNIST and CIFAR 10 and CIFAR 100 validate the theory, demonstrating consistent lags between attack success and spectral visibility, and showing that regularisation and data augmentation jointly suppress poisoning. Our results establish when backdoors are inherently invisible, and provide the first end to end characterisation of poisoning, detectability, and defence through input space curvature.

2602.00816 2026-02-03 stat.ML cs.LG

Hessian Spectral Analysis at Foundation Model Scale

Diego Granziol, Khurshid Juarev

详情
英文摘要

Accurate Hessian spectra of foundation models have remained out of reach, leading most prior work to rely on small models or strong structural approximations. We show that faithful spectral analysis of the true Hessian is tractable at frontier scale. Using shard-local finite-difference Hessian vector products compatible with Fully Sharded Data Parallelism, we perform stochastic Lanczos quadrature on open-source language models with up to 100B parameters, producing the first large-scale spectral density estimates beyond the sub-10B regime. We characterize the numerical behavior of this pipeline, including finite-difference bias, floating-point noise amplification, and their effect on Krylov stability in fp32 and bf16, and derive practical operating regimes that are validated empirically. We further provide end-to-end runtime and memory scaling laws, showing that full-operator spectral probing incurs only a modest constant-factor overhead over first-order training. Crucially, direct access to the Hessian reveals that widely used block-diagonal curvature approximations can fail catastrophically, exhibiting order-one relative error and poor directional alignment even in mid-scale LLMs. Together, our results demonstrate that foundation-model Hessian spectra are both computable and qualitatively misrepresented by prevailing approximations, opening the door to principled curvature-based analysis at scale.

2602.00790 2026-02-03 eess.SP cs.NA math.NA stat.ME

Denoising deterministic networks using iterative Fourier transforms

H. Robert Frost

详情
英文摘要

We detail a novel Fourier-based approach (IterativeFT) for identifying deterministic network structure in the presence of both edge pruning and Gaussian noise. This technique involves the iterative execution of forward and inverse 2D discrete Fourier transforms on a target network adjacency matrix. The denoising ability of the method is achieved via the application of a sparsification operation to both the real and frequency domain representations of the adjacency matrix with algorithm convergence achieved when the real domain sparsity pattern stabilizes. To demonstrate the effectiveness of the approach, we apply it to noisy versions of several deterministic models including Kautz, lattice, tree and bipartite networks. For contrast, we also evaluate preferential attachment networks to illustrate the behavior on stochastic graphs. We compare the performance of IterativeFT against simple real domain and frequency domain thresholding, reduced rank reconstruction and locally adaptive network sparsification. Relative to the comparison network denoising approaches, the proposed IterativeFT method provides the best overall performance for lattice and Kuatz networks with competitive performance on tree and bipartite networks. Importantly, the InterativeFT technique is effective at both filtering noisy edges and recovering true edges that are missing from the observed network.

2602.00641 2026-02-03 stat.ML cs.LG stat.CO

Sampling from multi-modal distributions on Riemannian manifolds with training-free stochastic interpolants

Alain Durmus, Maxence Noble, Thibaut Pellerin

详情
英文摘要

In this paper, we propose a general methodology for sampling from un-normalized densities defined on Riemannian manifolds, with a particular focus on multi-modal targets that remain challenging for existing sampling methods. Inspired by the framework of diffusion models developed for generative modeling, we introduce a sampling algorithm based on the simulation of a non-equilibrium deterministic dynamics that transports an easy-to-sample noise distribution toward the target. At the marginal level, the induced density path follows a prescribed stochastic interpolant between the noise and target distributions, specifically constructed to respect the underlying Riemannian geometry. In contrast to related generative modeling approaches that rely on machine learning, our method is entirely training-free. It instead builds on iterative posterior sampling procedures using only standard Monte Carlo techniques, thereby extending recent diffusion-based sampling methodologies beyond the Euclidean setting. We complement our approach with a rigorous theoretical analysis and demonstrate its effectiveness on a range of multi-modal sampling problems, including high-dimensional and heavy-tailed examples.

2602.00629 2026-02-03 stat.ML cs.AI cs.LG

Action-Free Offline-to-Online RL via Discretised State Policies

Natinael Solomon Neggatu, Jeremie Houssineau, Giovanni Montana

Comments ICLR 2026

详情
英文摘要

Most existing offline RL methods presume the availability of action labels within the dataset, but in many practical scenarios, actions may be missing due to privacy, storage, or sensor limitations. We formalise the setting of action-free offline-to-online RL, where agents must learn from datasets consisting solely of $(s,r,s')$ tuples and later leverage this knowledge during online interaction. To address this challenge, we propose learning state policies that recommend desirable next-state transitions rather than actions. Our contributions are twofold. First, we introduce a simple yet novel state discretisation transformation and propose Offline State-Only DecQN (\algo), a value-based algorithm designed to pre-train state policies from action-free data. \algo{} integrates the transformation to scale efficiently to high-dimensional problems while avoiding instability and overfitting associated with continuous state prediction. Second, we propose a novel mechanism for guided online learning that leverages these pre-trained state policies to accelerate the learning of online agents. Together, these components establish a scalable and practical framework for leveraging action-free datasets to accelerate online RL. Empirical results across diverse benchmarks demonstrate that our approach improves convergence speed and asymptotic performance, while analyses reveal that discretisation and regularisation are critical to its effectiveness.

2602.00512 2026-02-03 stat.CO

Exact Gibbs sampling for stochastic differential equations with gradient drift and constant diffusion

Xinyi Pei, Minhyeok Kim, Vinayak Rao

Comments Main document: 18 pages, 4 figures. Supplementary material: 12 pages, 7 figures

详情
英文摘要

Stochastic differential equations (SDEs) are an important class of time-series models, used to describe stochastic systems evolving in continuous time. Simulating paths from these processes, particularly after conditioning on noisy observations of the latent path, remains a challenge. Existing methods often introduce bias through time-discretization, require involved rejection sampling or debiasing schemes or are restricted to a narrow family of diffusions. In this work, we propose an exact Markov chain Monte Carlo (MCMC) sampling algorithm that is applicable to a broad subset of all SDEs with unit diffusion coefficient; after suitable transformation, this includes an even larger class of multivariate SDEs and most 1-d SDEs. We develop a Gibbs sampling framework that allows exact MCMC for such diffusions, without any discretization error. We demonstrate how our MCMC methodology requires only fairly straightforward simulation steps. Our framework can be extended to include parameter simulation, and allows tools from the Gaussian process literature to be easily applied. We evaluate our method on synthetic and real datasets, demonstrating superior performance to particle MCMC approaches.

2601.18013 2026-02-03 stat.ME

Examining the Efficacy of Coarsened Exact Matching as an Alternative to Propensity Score Matching

Fei Wan

详情
英文摘要

Coarsened exact matching (CEM) is often promoted as a superior alternative to propensity score matching (PSM) for addressing imbalance, model dependence, bias, and efficiency. However, this recommendation remains uncertain. First, CEM is commonly mischaracterized as exact matching, despite relying on coarsened rather than original variables. This inexactness in matching introduces residual confounding, which necessitates accurate modeling of the outcome-confounder relationship post-matching to mitigate bias, thereby increasing vulnerability to model misspecification. Second, prior studies overlook that any imbalance between treated and untreated subjects matched on the same propensity score is attributable to random variation. Thus, claims that CEM outperforms PSM in reducing imbalance are unfounded, particularly when using metrics like Mahalanobis distance, which do not account for chance imbalance in PSM. Our simulations show that PSM reduces imbalance more effectively than CEM when evaluated with multivariate standardized mean differences (SMD), and unadjusted analyses indicate greater bias with CEM. While adjusted analyses in both CEM with autocoarsening and PSM may perform similarly when matching on few variables, CEM suffers from the curse of dimensionality as the number of factors increases, resulting in substantial data loss and unstable estimates. Increasing the level of coarsening may mitigate data loss but exacerbates residual confounding and model dependence. In contrast, both analytical results and simulations demonstrate that PSM is more robust to model misspecification and thus less model-dependent. Therefore, CEM is not a viable alternative to PSM when matching on a large number of covariates.

2601.17855 2026-02-03 cs.DC stat.ML

A Universal Load Balancing Principle and Its Application to Large Language Model Serving

Zixi Chen, Tianci Bu, Chendong Song, Xin Lu, Yinyu Ye, Zijie Zhou

详情
英文摘要

Over 40% of computational power in Large Language Model (LLM) serving systems can be systematically wasted - not from hardware limits, but from load imbalance in barrier-synchronized parallel processing. When progress is gated by the slowest worker at each step, heterogeneous and evolving workloads create persistent stragglers; faster workers idle while drawing power, producing nothing. In large language model inference alone, this translates to gigawatt-hours of wasted electricity daily. Here we develop a universal load-balancing principle for barrier-synchronized systems with non-migratable state. We prove worst-case theoretical guarantees: imbalance reduction grows with system scale, and the resulting energy savings can exceed 52% for modern hardware at fleet scale. Experiments corroborate the theory, demonstrating 28% energy reduction alongside substantial throughput and latency improvements. Formulated as an online integer optimization with provable guarantees, the principle extends beyond LLM serving to broad classes of barrier-synchronized parallel systems, establishing a theoretical foundation for sustainable high-performance computing.

2512.22515 2026-02-03 stat.AP stat.ME

Robust Liu-Type Estimation for Multicollinearity in Fuzzy Logistic Regression

Ayad Habib Shemail, Ahmed Razzaq Al-Lami, Amal Hadi Rashid

详情
英文摘要

This article addresses the fuzzy logistic regression model under conditions of multicollinearity, which causes instability and inflated variance in parameter estimation. In this model, both the response variable and parameters are represented as fuzzy triangular numbers. To overcome the multicollinearity problem, various Liu-type estimators were employed: Fuzzy Maximum Likelihood Estimators (FMLE), Fuzzy Logistic Ridge Estimators (FLRE), Fuzzy Logistic Liu Estimators (FLLE), Fuzzy Logistic Liu-type Estimators (FLLTE), and Fuzzy Logistic Liu-type Parameter Estimators (FLLTPE). Through simulations with various sample sizes and application to real fuzzy data on kidney failure, model performance was evaluated using mean square error (MSE) and goodness of fit criteria. Results demonstrated superior performance of FLLTPE and FLLTE compared to other estimators.

2512.20169 2026-02-03 cs.LG cs.CL stat.ML

Learning to Reason in LLMs by Expectation Maximization

Junghyun Lee, Branislav Kveton, Anup Rao, Subhojyoti Mukherjee, Ryan A. Rossi, Sunav Choudhary, Alexa Siu

Comments 27 pages, 15 figures, 5 tables (ver2: major revision, including new experiments, reorganization, etc)

详情
英文摘要

Large language models (LLMs) solve reasoning problems by first generating a rationale and then answering. We formalize reasoning as a latent variable model and derive a reward-based filtered expectation-maximization (FEM) objective for learning to reason. This view connects EM and modern reward-based optimization, and shows that the main challenge lies in designing a sampling distribution of rationales that justify correct answers. We instantiate and compare three sampling schemes: rejection sampling with a budget, self-taught reasoner (STaR), and prompt posterior sampling (PPS), which only keeps the rationalization stage of STaR that conditions on the correct answer in the prompt. We experiment with LLM-as-a-judge calibration and summarization from feedback tasks, where conditioning on the correct answer provides a strong guidance for generating rationales. Our experiments show the efficacy of PPS over other sampling schemes, and that the sampling scheme can have a significant impact on performance.

2511.12698 2026-02-03 stat.ME

Optimal Hold-Out Size in Cross-Validation

Kenichiro McAlinn, Kōsaku Takanashi

详情
英文摘要

Cross-validation (CV) is routinely used across the sciences to select models and tune parameters, and the resulting choices are often interpreted as substantive scientific conclusions (e.g., which variables, mechanisms, or risk factors are ``supported by the data''). A key part of the CV procedure -- the hold-out size, or equivalently the fold count $K$ -- is typically set by convention (e.g., 80/20, $K=5$) rather than by a principled criterion. Central to the issue is the tradeoff between training and testing: increasing the training sample size improves model accuracy, while sacrificing certainty around the accuracy itself. We formalize the tradeoff by targeting predictive performance and explicitly penalizing evaluation uncertainty, which cannot be identified from the data without additional assumptions. We derive finite-sample expressions of this evaluation uncertainty under symmetric errors and general upper bounds under broader error conditions, yielding a transparent utility-based rule for selecting the hold-out size as a function of an irreducible-noise parameter. Empirical analyses with linear regression and random forests across multiple domains, and a high-dimensional genomics application, show that (i) the choice of $K$ is dependent on the data and model. (ii) the optimal $K$ varies based on the assumption on the irreducible error, and (iii) the implied inferential conclusions can change materially as the irreducible error, and thus $K$, varies. The resulting framework replaces a one-size-fits-all convention with a context-specific, assumption-explicit choice of $K$, enabling more reliable model comparisons and downstream scientific inference.

2511.04666 2026-02-03 cs.LG stat.ML

Forgetting is Everywhere

Ben Sanati, Thomas L. Lee, Trevor McInroe, Aidan Scannell, Nikolay Malkin, David Abel, Amos Storkey

Comments Project page: https://ben-sanati.github.io/forgetting-is-everywhere-project/

详情
英文摘要

A fundamental challenge in developing general learning algorithms is their tendency to forget past knowledge when adapting to new data. Addressing this problem requires a principled understanding of forgetting; yet, despite decades of study, no unified definition has emerged that provides insights into the underlying dynamics of learning. We propose an algorithm- and task-agnostic theory that characterises forgetting as a lack of self-consistency in a learner's predictive distribution, manifesting as a loss of predictive information. Our theory naturally yields a general measure of an algorithm's propensity to forget and demonstrates that exact Bayesian inference allows for adaptation without forgetting. To validate the theory, we design a comprehensive set of experiments that span classification, regression, generative modelling, and reinforcement learning. We empirically demonstrate how forgetting is present across all deep learning settings and plays a significant role in determining learning efficiency. Together, these results establish a principled understanding of forgetting and lay the foundation for analysing and improving the information retention capabilities of general learning algorithms.

2510.01268 2026-02-03 cs.CL cs.AI cs.LG stat.ML

AdaDetectGPT: Adaptive Detection of LLM-Generated Text with Statistical Guarantees

Hongyi Zhou, Jin Zhu, Pingfan Su, Kai Ye, Ying Yang, Shakeel A O B Gavioli-Akilagun, Chengchun Shi

Comments Accepted by NeurIPS2025

详情
英文摘要

We study the problem of determining whether a piece of text has been authored by a human or by a large language model (LLM). Existing state of the art logits-based detectors make use of statistics derived from the log-probability of the observed text evaluated using the distribution function of a given source LLM. However, relying solely on log probabilities can be sub-optimal. In response, we introduce AdaDetectGPT -- a novel classifier that adaptively learns a witness function from training data to enhance the performance of logits-based detectors. We provide statistical guarantees on its true positive rate, false positive rate, true negative rate and false negative rate. Extensive numerical studies show AdaDetectGPT nearly uniformly improves the state-of-the-art method in various combination of datasets and LLMs, and the improvement can reach up to 37\%. A python implementation of our method is available at https://github.com/Mamba413/AdaDetectGPT.

2509.25741 2026-02-03 stat.ML cs.LG

Test time training enhances in-context learning of nonlinear functions

Kento Kuwataka, Taiji Suzuki

Comments Under review at ICML 2026. 34 pages, 2 figures, appendix included; revised synthetic experiment and corrected mistakes

详情
英文摘要

Test-time training (TTT) enhances model performance by explicitly updating designated parameters prior to each prediction to adapt to the test data. While TTT has demonstrated considerable empirical success, its theoretical underpinnings remain limited, particularly for nonlinear models. In this paper, we investigate the combination of TTT with in-context learning (ICL), where the model is given a few examples from the target distribution at inference time. We analyze this framework in the setting of single-index models $y=σ_*(\langle β, \mathbf{x} \rangle)$, where the feature vector $β$ is drawn from a hidden low-dimensional subspace. For single-layer transformers trained with gradient-based algorithms and adopting TTT, we establish an upper bound on the prediction risk. Our theory reveals that TTT enables the single-layer transformers to adapt to both the feature vector $β$ and the link function $σ_*$, which vary across tasks. This creates a sharp contrast with ICL alone, which is theoretically difficult to adapt to shifts in the link function. Moreover, we provide the convergence rate with respect to the data length, showing the predictive error can be driven arbitrarily close to the noise level as the context size and the network width grow.

2509.23162 2026-02-03 cs.LG cs.AI math.ST stat.ML stat.TH

Dense associative memory for Gaussian distributions

Chandan Tankala, Krishnakumar Balasubramanian

详情
英文摘要

Dense associative memories (DAMs) store and retrieve patterns via energy-function based fixed points, but existing models are limited to vector representations. We extend DAMs to Gaussian densities equipped with the 2-Wasserstein distance. Our framework defines a log-sum-exp energy over stored distributions and a retrieval dynamics aggregating optimal transport maps in a Gibbs-weighted manner. Stationary points correspond to self-consistent Wasserstein barycenters, generalizing classical DAM fixed points. We prove exponential storage capacity and provide quantitative retrieval guarantees under Wasserstein perturbations. We validate the method on synthetic and real-world image (CelebA and CIFAR-10 datasets) and text (text8 and NLI corpus) datasets. By generalizing from vectors to distributions, our work bridges classical DAMs with modern generative modeling and paves way for distributional storage and retrieval in memory-augmented learning.

2509.22505 2026-02-03 cs.HC cs.AI cs.CL cs.CY stat.AP

Mental Health Impacts of AI Companions: Triangulating Social Media Quasi-Experiments, User Perspectives, and Relational Theory

Yunhao Yuan, Jiaxun Zhang, Talayeh Aledavood, Renwen Zhang, Koustuv Saha

Journal ref Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems

详情
英文摘要

AI-powered companion chatbots (AICCs) such as Replika are increasingly popular, offering empathetic interactions, yet their psychosocial impacts remain unclear. We examined how engaging with AICCs shaped wellbeing and how users perceived these experiences. First, we conducted a large-scale quasi-experimental study of longitudinal Reddit data, applying stratified propensity score matching and Difference-in-Differences regression. Findings revealed mixed effects -- greater grief expression and interpersonal focus, alongside increases in language about loneliness, depression, and suicidal ideation. Second, we complemented these results with 18 semi-structured interviews, which we thematically analyzed and contextualized using Knapp's relationship development model. We identified trajectories of initiation, escalation, and bonding, wherein AICCs provided emotional validation and social rehearsal but also carried risks of over-reliance and withdrawal. Triangulating across methods, we offer design implications for AI companions that scaffold healthy boundaries, support mindful engagement, support disclosure without dependency, and surface relationship stages -- maximizing psychosocial benefits while mitigating risks.

2509.07031 2026-02-03 stat.ME

Scalable Sample-to-Population Estimation of Hyperbolic Space Models for Hypergraphs

Cornelius Fritz, Yubai Yuan, Michael Schweinberger

详情
英文摘要

Hypergraphs are useful mathematical representations of overlapping and nested subsets of interacting units, including groups of genes or brain regions, economic cartels, political or military coalitions, and groups of products that are purchased together. Despite the vast range of applications, the statistical analysis of hypergraphs is challenging: There are many hyperedges of small and large sizes, and hyperedges can overlap or be nested. Existing approaches to hypergraphs are either not scalable or achieve scalability at the expense of model realism. We develop a statistical framework that enables scalable estimation, simulation, and model assessment of hypergraph models, which is supported by non-asymptotic and asymptotic theoretical guarantees. First, we introduce a novel model of hypergraphs capturing core-periphery structure in addition to proximity, by embedding units in an unobserved hyperbolic space. Second, we achieve scalability by developing manifold optimization algorithms for learning hyperbolic space models based on samples from a population hypergraph. Third, we provide non-asymptotic and asymptotic theoretical guarantees for learning hyperbolic space models based on samples from a population hypergraph. We use the proposed statistical framework to detect core-periphery structure along with proximity among U.S.\ politicians based on historical media reports.

2507.18505 2026-02-03 math.ST stat.TH

LSD of sample covariances of superposition of matrices with separable covariance structure

Javed Hazarika, Debashis Paul

详情
英文摘要

We study the asymptotic behavior of the spectra of matrices of the form $S_n = \frac{1}{n}XX^*$ where $X =\sum_{r=1}^K X_r$, where $X_r = A_r^\frac{1}{2}Z_rB_r^\frac{1}{2}$, $K \in \mathbb{N}$ and $A_r,B_r$ are sequences of positive semi-definite matrices of dimensions $p\times p$ and $n\times n$, respectively. We establish the existence of a limiting spectral distribution for $S_n$ by assuming that matrices $\{A_r\}_{r=1}^K$ are simultaneously diagonalizable and $\{B_r\}_{r=1}^K$ are simultaneously digaonalizable, and that the joint spectral distributions of $\{A_r\}_{r=1}^K$ and $\{B_r\}_{r=1}^K$ converge to $K$-dimensional distributions, as $p,n\to \infty$ such that $p/n \to c \in (0,\infty)$. The LSD of $S_n$ is characterized by system of equations with unique solutions within the class of Stieltjes transforms of measures on $\mathbb{R}_+$. These results generalize existing results on the LSD of sample covariances when the data matrices have a separable covariance structure.

2507.07660 2026-02-03 cs.SI stat.CO

Scalable Signed Exponential Random Graph Models under Local Dependence

Marc Schalberger, Cornelius Fritz

详情
英文摘要

Traditional network analysis focuses on binary edges, while real-world relationships are more nuanced, encompassing cooperation, neutrality, and conflict. The rise of negative edges in social media discussions spurred interest in analyzing signed interactions, especially in polarized debates. However, the vast data generated by digital networks presents challenges for traditional methods like Stochastic Block Models (SBM) and Exponential Family Random Graph Models (ERGM), particularly due to the homogeneity assumption and global dependence, which become increasingly unrealistic as network size grows. To address this, we propose a novel method that combines the strengths of SBM and ERGM while mitigating their weaknesses by incorporating local dependence based on nonoverlapping blocks. Our approach involves a two-step process: First, decomposing the network into sub-networks using SBM approximation, and, second, estimating parameters using ERGM methods. We validate our method on large synthetic networks and apply it to a signed Wikipedia network of thousands of editors. Through the use of local dependence, we find patterns consistent with structural balance theory.

2507.00651 2026-02-03 cs.LG cs.CV stat.ML

Bridging GANs and Bayesian Neural Networks via Partial Stochasticity

Maurizio Filippone, Marius P. Linhard

详情
英文摘要

Generative Adversarial Networks (GANs) are popular and successful generative models. Despite their success, optimization is notoriously challenging. In this work, we explain the success and limitations of GANs by casting them as Bayesian neural networks with partial stochasticity. This interpretation allows us to establish conditions of universal approximation and to rewrite the adversarial-style optimization of several variants of GANs as the optimization of a proxy for the likelihood obtained by marginalizing out the stochastic variables. Following this interpretation, the need for regularization becomes apparent, and we propose to adopt strategies to smooth the loss landscape and methods to search for solutions with minimum description length, which are associated with flat minima and good generalization. Results obtained on a wide range of experiments indicate that these strategies lead to performance improvements and pave the way to a deeper understanding of GANs.

2506.20124 2026-02-03 math.ST stat.TH

Modifications of the BIC for order selection in finite mixture models

Hien Duy Nguyen, TrungTin Nguyen

详情
英文摘要

Finite mixture models are ubiquitous in modern statistical modeling, and a recurring practical issue is choosing the model order. In \citet[Sankhyā Series A, \textbf62, pp. 49--66]{keribin2000consistent}, the Bayesian information criterion (BIC) was proved consistent in mixtures, but under strong regularity, including high moments and high-order derivatives of the component density. We introduce the $ν$-BIC and $ε$-BIC, which weight the BIC penalty by negligibly small logarithmic factors immaterial in practice. This minor modification yields consistency under substantially weaker conditions, without differentiability and with mild moment assumptions, and we also give a misspecification result: when the truth lies outside the candidate family, any vanishing-penalty IC eventually selects a Kullback--Leibler optimal order among candidates. Finally, we clarify two limitations of consistent IC-based selection in mixtures: there is no universally minimal BIC-scale penalty within our sufficient conditions, and order consistency can conflict with minimax optimality in Hellinger risk. We illustrate the theory for Gaussian mixtures, non-differentiable Laplace mixtures, heavy-tailed $t$-mixtures, and mixtures of regression models.

2505.18410 2026-02-03 stat.ML cs.LG

On Theoretical Identifiability of Discrete Latent Causal Graphical Models

Seunghyun Lee, Yuqi Gu

详情
英文摘要

This paper considers a challenging problem of identifying a causal graphical model under the presence of latent variables. While various identifiability conditions have been proposed in the literature, they often require multiple pure children per latent variable or restrictions on the latent causal graph. Furthermore, it is common for all observed variables to exhibit the same modality. Consequently, the existing identifiability conditions are often too stringent for complex real-world data. We consider a general nonparametric measurement model with arbitrary observed variable types and binary latent variables, and propose a double triangular graphical condition that guarantees identifiability of the entire causal graphical model. The proposed condition significantly relaxes the popular pure children condition. We also establish necessary conditions for identifiability and provide valuable insights into fundamental limits of identifiability. Simulation studies verify that latent structures satisfying our conditions can be accurately estimated from data.

2505.13815 2026-02-03 stat.CO cs.NA math.NA

Dimension-independent convergence rates of randomized nets using median-of-means

Zexin Pan

详情
英文摘要

Recent advances in quasi-Monte Carlo integration demonstrate that the median of linearly scrambled digital net estimators achieves near-optimal convergence rates for high-dimensional integrals without requiring a priori knowledge of the integrand's smoothness. Building on this framework, we prove that the median estimator attains dimension-independent convergence, a property known as strong tractability in complexity theory, under tractability conditions characterized by low effective dimensionality. Using a probabilistic, integrand-specific error criterion, our analysis establishes both faster and dimension-independent convergence under weaker assumptions than previously possible in the worst-case setting.

2505.10860 2026-02-03 cs.LG stat.ML

On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating

Huy Nguyen, Thong T. Doan, Quang Pham, Nghi D. Q. Bui, Nhat Ho, Alessandro Rinaldo

Comments 97 pages

详情
英文摘要

Mixture of experts (MoE) methods are a key component in most large language model architectures, including the recent series of DeepSeek models. Compared to other MoE implementations, DeepSeekMoE stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism. Despite the prominent role of DeepSeekMoE in the success of the DeepSeek series of models, there have been only a few attempts to justify theoretically the value of the shared expert strategy, while its normalized sigmoid gating has remained unexplored. To bridge this gap, we undertake a comprehensive theoretical study of these two features of DeepSeekMoE from a statistical perspective. We perform a convergence analysis of the expert estimation task to highlight the gains in sample efficiency for both the shared expert strategy and the normalized sigmoid gating, offering useful insights into the design of expert and gating structures. To verify empirically our theoretical findings, we carry out several experiments on both synthetic data and real-world datasets for (vision) language modeling tasks. Finally, we conduct an extensive empirical analysis of the router behaviors, ranging from router saturation, router change rate, to expert utilization.

2505.10007 2026-02-03 cs.LG math.OC stat.ML

Sample Complexity of Distributionally Robust Average-Reward Reinforcement Learning

Zijun Chen, Shengbo Wang, Nian Si

Comments Accepted at NeurIPS 2025. Updated with minor corrections and additional experiments

详情
英文摘要

Motivated by practical applications where stable long-term performance is critical-such as robotics, operations research, and healthcare-we study the problem of distributionally robust (DR) average-reward reinforcement learning. We propose two algorithms that achieve near-optimal sample complexity. The first reduces the problem to a DR discounted Markov decision process (MDP), while the second, Anchored DR Average-Reward MDP, introduces an anchoring state to stabilize the controlled transition kernels within the uncertainty set. Assuming the nominal MDP is uniformly ergodic, we prove that both algorithms attain a sample complexity of $\widetilde{O}\left(|\mathbf{S}||\mathbf{A}| t_{\mathrm{mix}}^2\varepsilon^{-2}\right)$ for estimating the optimal policy as well as the robust average reward under KL and $f_k$-divergence-based uncertainty sets, provided the uncertainty radius is sufficiently small. Here, $\varepsilon$ is the target accuracy, $|\mathbf{S}|$ and $|\mathbf{A}|$ denote the sizes of the state and action spaces, and $t_{\mathrm{mix}}$ is the mixing time of the nominal MDP. This represents the first finite-sample convergence guarantee for DR average-reward reinforcement learning. We further validate the convergence rates of our algorithms through numerical experiments.

2505.00961 2026-02-03 stat.ML cs.LG

DOLCE: Decomposing Off-Policy Evaluation/Learning into Lagged and Current Effects

Shu Tamano

详情
英文摘要

Off-policy evaluation and learning in contextual bandits use logged interaction data to estimate and optimize the value of a target policy. Most existing methods require sufficient action overlap between the logging and target policies, and violations can bias value and policy gradient estimates. To address this issue, we propose DOLCE (Decomposing Off-policy evaluation/learning into Lagged and Current Effects), which uses only lagged contexts already stored in bandit logs to construct lag-marginalized importance weights and to decompose the objective into a support-robust lagged correction term and a current, model-based term, yielding bias cancellation when the reward-model residual is conditionally mean-zero given the lagged context and action. With multiple candidate lags, DOLCE softly aggregates lag-specific estimates, and we introduce a moment-based training procedure that promotes the desired invariance using only logged lag-augmented data. We show that DOLCE is unbiased in an idealized setting and yields consistent and asymptotically normal estimates with cross-fitting under standard conditions. Our experiments demonstrate that DOLCE achieves substantial improvements in both off-policy evaluation and learning, particularly as the proportion of individuals who violate support increases.

2502.20177 2026-02-03 stat.ME

The Marginal Likelihood of two-way tables and Ecological Inference

Antonio Forcina

详情
英文摘要

The paper derives new results on the marginal likelihood of a two-way table which clarify the conditions under which Ecological inference is possible and lead to an efficient algorithm for maximizing the exact multinomial likelihood. The first part generalizes the work of Placket(1977} on the marginal likelihood of a 2 x 2 table to a general R x C table. In doing so, new conceptual tools are introduced and new insights on the geometry of the collection of tables having fixed row and column margins and the extended hypergeometric distribution are derived. In the second part, when observations on the row and the column marginal distributions are available for a collection of two-way tables sharing the same association structure, an efficient Fisher scoring algorithm for maximizing the exact likelihood under multinomial sampling is introduced and a small simulation study is used to compare the performance of the proposed method with two well established ones.

2502.04204 2026-02-03 cs.LG cs.CR stat.ML

Short-length Adversarial Training Helps LLMs Defend Long-length Jailbreak Attacks: Theoretical and Empirical Evidence

Shaopeng Fu, Liang Ding, Jingfeng Zhang, Di Wang

Comments The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025)

详情
英文摘要

Jailbreak attacks against large language models (LLMs) aim to induce harmful behaviors in LLMs through carefully crafted adversarial prompts. To mitigate attacks, one way is to perform adversarial training (AT)-based alignment, i.e., training LLMs on some of the most adversarial prompts to help them learn how to behave safely under attacks. During AT, the length of adversarial prompts plays a critical role in the robustness of aligned LLMs. While long-length adversarial prompts during AT might lead to strong LLM robustness, their synthesis however is very resource-consuming, which may limit the application of LLM AT. This paper focuses on adversarial suffix jailbreak attacks and unveils that to defend against a jailbreak attack with an adversarial suffix of length $Θ(M)$, it is enough to align LLMs on prompts with adversarial suffixes of length $Θ(\sqrt{M})$. Theoretically, we analyze the adversarial in-context learning of linear transformers on linear regression tasks and prove a robust generalization bound for trained transformers. The bound depends on the term $Θ(\sqrt{M_{\text{test}}}/M_{\text{train}})$, where $M_{\text{train}}$ and $M_{\text{test}}$ are the numbers of adversarially perturbed in-context samples during training and testing. Empirically, we conduct AT on popular open-source LLMs and evaluate their robustness against jailbreak attacks of different adversarial suffix lengths. Results confirm a positive correlation between the attack success rate and the ratio of the square root of the adversarial suffix length during jailbreaking to the length during AT. Our findings show that it is practical to defend against "long-length" jailbreak attacks via efficient "short-length" AT. The code is available at https://github.com/fshp971/adv-icl.

2501.17520 2026-02-03 math.ST stat.TH

Conditional Feature Importance revisited: Double Robustness, Efficiency and Inference

Angel Reyero-Lobo, Pierre Neuvial, Bertrand Thirion

详情
英文摘要

Conditional Feature Importance (CFI) is a classical variable importance measure that accounts for the relationship between the studied feature and the others. However, CFI has not yet been studied from a theoretical perspective because the conditional sampling step has generally been considered a purely practical problem. In this article, we demonstrate that the recent Conditional Permutation Importance (CPI) is indeed a valid implementation of this concept. Under the conditional null hypothesis, we then establish a double robustness property that can be used for variable selection. With either a valid model or a valid conditional sampler, the method correctly identifies null coordinates. Under the alternative hypothesis, we study the theoretical target and link it to the popular Total Sobol Index (TSI). We introduce the Sobol-CPI, which generalizes CPI/CFI, prove that it is nonparametrically efficient, and provide a bias correction. Finally, we propose a consistent and valid type-I error test and present numerical experiments that illustrate our findings.

2412.20727 2026-02-03 cs.LG stat.ML

AverageTime: Enhance Long-Term Time Series Forecasting with Simple Averaging

Gaoxiang Zhao, Chunmao Huang, Li Zhou, Xiaoqiang Wang

详情
英文摘要

Multivariate long-term time series forecasting aims to predict future sequences by utilizing historical observations, with a core focus on modeling intra-sequence and cross-channel dependencies. Numerous studies have developed diverse architectures to capture these patterns, achieving significant improvements in forecasting accuracy. Among them, iTransformer, a representative method for channel information extraction, leverages the Transformer architecture to model channel-wise dependencies, thereby facilitating sequence transformation for enhanced forecasting performance. Building upon iTransformer's channel extraction concept, we propose AverageTime, a simple, efficient, and scalable forecasting model. Beyond iTransformer, AverageTime retains the original sequence information and reframes channel extraction as a stackable and extensible architecture. This allows the model to generate multiple novel sequences through various structural mechanisms, rather than being limited to transforming the original input. Moreover, the newly extracted sequences are not restricted to channel processing; other techniques such as series decomposition can also be incorporated to enhance predictive accuracy. Additionally, we introduce a channel clustering technique into AverageTime, which substantially improves training and inference efficiency with negligible performance loss. Experiments on real-world datasets demonstrate that with only two straightforward averaging operations, applied to both the extracted sequences and the original series. AverageTime surpasses state-of-the-art models in forecasting performance while maintaining near-linear complexity. This work offers a new perspective on time series forecasting: enriching sequence information through extraction and fusion. The source code is available at https://github.com/ UniqueoneZ/AverageTime.

2411.18794 2026-02-03 stat.ML cs.LG

Graph Max Shift: A Hill-Climbing Method for Graph Clustering

Ery Arias-Castro, Elizabeth Coda, Wanli Qiao

详情
英文摘要

We present a method for graph clustering that is analogous to gradient ascent methods previously proposed for clustering points in space. The algorithm, which can be viewed as a max-degree hill-climbing procedure on the graph, iteratively moves each node to a neighboring node of highest degree. We show that, when applied to a random geometric graph whose nodes correspond to data drawn i.i.d. from a density with Morse regularity, the method is asymptotically consistent. Here, consistency is in the sense of Fukunaga and Hostetler, meaning, with respect to the partition of the support of the density defined by the basins of attraction of the density gradient flow.

2411.01249 2026-02-03 stat.ME

A robust regression approach to synthetic control with interference

Peiyu He, Yilin Li, Xu Shi, Wang Miao

Comments 95 pages, 12 figures

详情
英文摘要

Synthetic control methods are widely used for policy evaluation, but most existing approaches rule out interference among units, compromising validity when such effects are present. We develop a framework that accommodates contaminated donor pools and unknown interference patterns through two stages: factor-model adjustment for unobserved confounding, followed by robust regression in which direct and interference effects appear as a sparse outlier component. We study two asymptotic regimes. When the number of units is fixed and at least half are unaffected by interference, high-breakdown robust regression yields consistent identification of valid controls and asymptotically normal inference. When the number of units diverges, we allow for sparse large and dense weak interference, with robust M-estimation remaining valid even when the post-intervention period is short. Unlike existing approaches requiring prespecification of valid controls or parametric modeling of interference, our framework relies only on coarse sparsity information and enables formal inference on both direct and interference effects. We assess the proposed methods through simulations and two empirical applications. An analysis of the US embassy relocation to Jerusalem reveals significant interference effects on conflict outcomes in Jordan, and an analysis of Beijing's air pollution policy uncovers spatial interference patterns consistent with prevailing wind directions.

2404.11509 2026-02-03 stat.ML cs.LG

VC Theory for Inventory Policies

Yaqi Xie, Will Ma, Linwei Xin

详情
英文摘要

There has been growing interest in applying reinforcement learning (RL) to inventory management, either by optimizing over temporal transitions or by learning directly from full historical demand trajectories. This contrasts sharply with classical data-driven approaches, which first estimate demand distributions from past data and then compute well-structured optimal policies via dynamic programming. This paper considers a hybrid approach that combines trajectory-based RL with policy regularization imposing base-stock and $(s, S) $ structures. We provide generalization guarantees for this combined approach for several well-known classes in a $T$-period dynamic inventory model, using tools from the celebrated Vapnik-Chervonenkis (VC) theory, such as the Pseudo-dimension and Fat-shattering dimension. Our results have implications for regret against the best-in-class policies, and allow for an arbitrary distribution over demand sequences, which makes no assumptions such as independence across time. Surprisingly, we prove that the class of policies defined by $T$ non-stationary base-stock levels exhibits a generalization error that does not grow with $T$, whereas the two-parameter $(s, S)$ policy class has a generalization error growing logarithmically with $T$. Overall, our analysis leverages specific inventory structures within the learning theory framework, and improves sample complexity guarantees even compared to existing results assuming independent demands.

2306.02584 2026-02-03 econ.EM stat.ME

Synthetic Regressing Control

Rong J. B. Zhu

Journal ref Observational Studies, 2026

详情
英文摘要

Estimating weights in the synthetic control method, typically resulting in sparse weights where only a few control units have non-zero weights, involves an optimization procedure that selects and combines control units to closely match the treated unit. However, it is not uncommon for the linear combination of pre-treatment period outcomes for the control units, using nonnegative weights with the constraint that their sum equals one, to inadequately approximate the pre-treatment outcomes for the treated unit. To address the issue, this paper proposes a simple and effective method called Synthetic Regressing Control (SRC). The SRC method begins by performing the univariate linear regression to appropriately align the pre-treatment periods of the control units with the treated unit. Subsequently, a SRC estimator is obtained by synthesizing the regressed controls. To determine the weights in the synthesis procedure, we propose an approach that utilizes a criterion of an unbiased risk estimator. Theoretically, we show that the synthesis way is asymptotically optimal in the sense of achieving the minimum loss of the infeasible best possible synthetic estimator. Extensive numerical experiments highlight the advantages of the SRC method.

1908.10493 2026-02-03 cs.LG math.FA stat.ML

The Function Representation of Artificial Neural Network

Zhongkui Ma

Comments This submission is withdrawn due to factual errors identified after posting that affect the validity of the results

详情
英文摘要

This paper expresses the structure of artificial neural network (ANN) as a functional form, using the activation integral concept derived from the activation function. In this way, the structure of ANN can be represented by a simple function, and it is possible to find the mathematical solutions of ANN. Thus, it can be recognized that the current ANN can be placed in a more reasonable framework. Perhaps all questions about ANN will be eliminated.

1811.08083 2026-02-03 econ.EM stat.ME

Complete Subset Averaging with Many Instruments

Seojeong Lee, Youngki Shin

Comments 56 pages, 3 figures, 10 tables

Journal ref Econometrics Journal, 24(2), 2021, pp. 290-314

详情
英文摘要

We propose a two-stage least squares (2SLS) estimator whose first stage is the equal-weighted average over a complete subset with $k$ instruments among $K$ available, which we call the complete subset averaging (CSA) 2SLS. The approximate mean squared error (MSE) is derived as a function of the subset size $k$ by the Nagar (1959) expansion. The subset size is chosen by minimizing the sample counterpart of the approximate MSE. We show that this method achieves the asymptotic optimality among the class of estimators with different subset sizes. To deal with averaging over a growing set of irrelevant instruments, we generalize the approximate MSE to find that the optimal $k$ is larger than otherwise. An extensive simulation experiment shows that the CSA-2SLS estimator outperforms the alternative estimators when instruments are correlated. As an empirical illustration, we estimate the logistic demand function in Berry, Levinsohn, and Pakes (1995) and find the CSA-2SLS estimate is better supported by economic theory than the alternative estimates.

1604.07299 2026-02-03 stat.AP stat.CO

Bayesian modelling and quantification of Raman spectroscopy

Matthew Moores, Kirsten Gracie, Jake Carson, Karen Faulds, Duncan Graham, Mark Girolami

详情
英文摘要

Raman spectroscopy can be used to identify molecules such as DNA by the characteristic scattering of light from a laser. It is sensitive at very low concentrations and can accurately quantify the amount of a given molecule in a sample. The presence of a large, nonuniform background presents a major challenge to analysis of these spectra. To overcome this challenge, we introduce a sequential Monte Carlo (SMC) algorithm to separate the observed spectrum into a series of peaks plus a smoothly-varying baseline, corrupted by additive white noise. The peaks are modelled using Lorentzian or Gaussian broadening functions, while the baseline is estimated using a penalised cubic spline. This latent continuous representation accounts for differences in resolution between measurements. By incorporating this representation in a Bayesian model, we can quantify the relationship between molecular concentration and peak intensity, thereby providing an improved estimate of the limit of detection (LOD), which is of major importance in analytical chemistry.

2602.00458 2026-02-03 cs.LG cs.AI cs.RO stat.ML

LatentTrack: Sequential Weight Generation via Latent Filtering

Omer Haq

详情
英文摘要

We introduce LatentTrack (LT), a sequential neural architecture for online probabilistic prediction under nonstationary dynamics. LT performs causal Bayesian filtering in a low-dimensional latent space and uses a lightweight hypernetwork to generate predictive model parameters at each time step, enabling constant-time online adaptation without per-step gradient updates. At each time step, a learned latent model predicts the next latent distribution, which is updated via amortized inference using new observations, yielding a predict--generate--update filtering framework in function space. The formulation supports both structured (Markovian) and unstructured latent dynamics within a unified objective, while Monte Carlo inference over latent trajectories produces calibrated predictive mixtures with fixed per-step cost. Evaluated on long-horizon online regression using the Jena Climate benchmark, LT consistently achieves lower negative log-likelihood and mean squared error than stateful sequential and static uncertainty-aware baselines, with competitive calibration, demonstrating that latent-conditioned function evolution is an effective alternative to traditional latent-state modeling under distribution shift.

2602.00427 2026-02-03 stat.ML cs.LG math.MG

Topological Residual Asymmetry for Bivariate Causal Direction

Mouad El Bouchattaoui

详情
英文摘要

Inferring causal direction from purely observational bivariate data is fragile: many methods commit to a direction even in ambiguous or near non-identifiable regimes. We propose Topological Residual Asymmetry (TRA), a geometry-based criterion for additive-noise models. TRA compares the shapes of two cross-fitted regressor-residual clouds after rank-based copula standardization: in the correct direction, residuals are approximately independent, producing a two-dimensional bulk, while in the reverse direction -- especially under low noise -- the cloud concentrates near a one-dimensional tube. We quantify this bulk-tube contrast using a 0D persistent-homology functional, computed efficiently from Euclidean MST edge-length profiles. We prove consistency in a triangular-array small-noise regime, extend the method to fixed noise via a binned variant (TRA-s), and introduce TRA-C, a confounding-aware abstention rule calibrated by a Gaussian-copula plug-in bootstrap. Extensive experiments across many challenging synthetic and real-data scenarios demonstrate the method's superiority.

2602.00413 2026-02-03 stat.ML cs.LG

Alignment of Diffusion Model and Flow Matching for Text-to-Image Generation

Yidong Ouyang, Liyan Xie, Hongyuan Zha, Guang Cheng

详情
英文摘要

Diffusion models and flow matching have demonstrated remarkable success in text-to-image generation. While many existing alignment methods primarily focus on fine-tuning pre-trained generative models to maximize a given reward function, these approaches require extensive computational resources and may not generalize well across different objectives. In this work, we propose a novel alignment framework by leveraging the underlying nature of the alignment problem -- sampling from reward-weighted distributions -- and show that it applies to both diffusion models (via score guidance) and flow matching models (via velocity guidance). The score function (velocity field) required for the reward-weighted distribution can be decomposed into the pre-trained score (velocity field) plus a conditional expectation of the reward. For the alignment on the diffusion model, we identify a fundamental challenge: the adversarial nature of the guidance term can introduce undesirable artifacts in the generated images. Therefore, we propose a finetuning-free framework that trains a guidance network to estimate the conditional expectation of the reward. We achieve comparable performance to finetuning-based models with one-step generation with at least a 60% reduction in computational cost. For the alignment on flow matching, we propose a training-free framework that improves the generation quality without additional computational cost.

2602.00399 2026-02-03 stat.ML cs.LG

Reinforcement Learning for Control Systems with Time Delays: A Comprehensive Survey

Armando Alves Neto

Comments 30 pages

详情
英文摘要

In the last decade, Reinforcement Learning (RL) has achieved remarkable success in the control and decision-making of complex dynamical systems. However, most RL algorithms rely on the Markov Decision Process assumption, which is violated in practical cyber-physical systems affected by sensing delays, actuation latencies, and communication constraints. Such time delays introduce memory effects that can significantly degrade performance and compromise stability, particularly in networked and multi-agent environments. This paper presents a comprehensive survey of RL methods designed to address time delays in control systems. We first formalize the main classes of delays and analyze their impact on the Markov property. We then systematically categorize existing approaches into five major families: state augmentation and history-based representations, recurrent policies with learned memory, predictor-based and model-aware methods, robust and domain-randomized training strategies, and safe RL frameworks with explicit constraint handling. For each family, we discuss underlying principles, practical advantages, and inherent limitations. A comparative analysis highlights key trade-offs among these approaches and provides practical guidelines for selecting suitable methods under different delay characteristics and safety requirements. Finally, we identify open challenges and promising research directions, including stability certification, large-delay learning, multi-agent communication co-design, and standardized benchmarking. This survey aims to serve as a unified reference for researchers and practitioners developing reliable RL-based controllers in delay-affected cyber-physical systems.

2602.00383 2026-02-03 q-fin.ST math.DS math.ST stat.TH

Null-Validated Topological Signatures of Financial Market Dynamics

Samuel W. Akingbade

Comments 22 pages, 9 figures

详情
英文摘要

Financial markets exhibit temporal organization that is not fully captured by volatility measures or linear correlation structure. We study a null validated topological approach for quantifying market complexity and apply it to Bitcoin daily log returns. The analysis uses the $L^1$ norm of persistence landscapes computed from sliding-window delay embeddings. This quantity shows strong co-movement with stochastic volatility during periods of market stress, but remains intermittently elevated during low volatility regimes, indicating dynamical structure beyond fluctuation scale. Rolling correlation analysis reveals that the dependence between geometry and volatility is not stationary. Surrogate based null models provide statistical validation of these observations. Rejection of shuffle surrogates rules out explanations based on marginal distributions alone, while departures from phase randomized surrogates indicate sensitivity to nonlinear and phase dependent temporal organization beyond linear correlations. These results demonstrate that persistence landscape norms provide complementary information about market dynamics across market conditions.

2602.00291 2026-02-03 stat.ME stat.AP

A Bayesian Prevalence Incidence Cure model for estimating survival using Electronic Health Records with incomplete baseline diagnoses

Matilda Pitt, Robert J. B. Goudie

Comments 29 pages, 2 figures

详情
英文摘要

Retrospective cohorts can be extracted from Electronic Health Records (EHR) to study prevalence, time until disease or event occurrence and cure proportion in real world scenarios. However, EHR are collected for patient care rather than research, so typically have complexities, such as patients with missing baseline disease status. Prevalence-Incidence (PI) models, which use a two-component mixture model to account for this missing data, have been proposed. However, PI models are biased in settings in which some individuals will never experience the endpoint (they are 'cured'). To address this, we propose a Prevalence Incidence Cure (PIC) model, a 3 component mixture model that combines the PI model framework with a cure model. Our PIC model enables estimation of the prevalence, time-to-incidence, and the cure proportion, and allows for covariates to affect these. We adopt a Bayesian inference approach, and focus on the interpretability of the prior. We show in a simulation study that the PIC model has smaller bias than a PI model for the survival probability; and compare inference under vague, informative and misspecified priors. We illustrate our model using a dataset of 1964 patients undergoing treatment for Diabetic Macular Oedema, demonstrating improved fit under the PIC model.

2602.00218 2026-02-03 cs.LG stat.ML

GRIP2: A Robust and Powerful Deep Knockoff Method for Feature Selection

Bob Junyi Zou, Lu Tian

详情
英文摘要

Identifying truly predictive covariates while strictly controlling false discoveries remains a fundamental challenge in nonlinear, highly correlated, and low signal-to-noise regimes, where deep learning based feature selection methods are most attractive. We propose Group Regularization Importance Persistence in 2 Dimensions (GRIP2), a deep knockoff feature importance statistic that integrates first-layer feature activity over a two-dimensional regularization surface controlling both sparsity strength and sparsification geometry. To approximate this surface integral in a single training run, we introduce efficient block-stochastic sampling, which aggregates feature activity magnitudes across diverse regularization regimes along the optimization trajectory. The resulting statistics are antisymmetric by construction, ensuring finite-sample FDR control. In extensive experiments on synthetic and semi-real data, GRIP2 demonstrates improved robustness to feature correlation and noise level: in high correlation and low signal-to-noise ratio regimes where standard deep learning based feature selectors may struggle, our method retains high power and stability. Finally, on real-world HIV drug resistance data, GRIP2 recovers known resistance-associated mutations with power better than established linear baselines, confirming its reliability in practice.

2602.00194 2026-02-03 stat.ME cs.AI math.ST stat.TH

On the calibration of survival models with competing risks

Julie Alberge, Tristan Haugomat, Gaël Varoquaux, Judith Abécassis

Journal ref International Conference on Artificial Intelligence and Statistics (AISTATS) 2026, May 2026, Tanger, Morocco

详情
英文摘要

Survival analysis deals with modeling the time until an event occurs, and accurate probability estimates are crucial for decision-making, particularly in the competing-risks setting where multiple events are possible. While recent work has addressed calibration in standard survival analysis, the competing-risks setting remains under-explored as it is harder (the calibration applies to both probabilities across classes and time horizon). We show that existing calibration measures are not suited to the competing-risk setting and that recent models do not give well-behaved probabilities. To address this, we introduce a dedicated framework with two novel calibration measures that are minimized for oracle estimators (i.e., both measures are proper). We also introduce some methods to estimate, test, and correct the calibration. Our recalibration methods yield good probabilities while preserving discrimination.

2602.00172 2026-02-03 stat.ML cs.LG

Neuron Block Dynamics for XOR Classification with Zero-Margin

Guillaume Braun, Masaaki Imaizumi

Comments 47 pages, 9 figures

Journal ref Proceedings of the 29th International Conference on Artificial Intelligence and Statistics (AISTATS) 2026

详情
英文摘要

The ability of neural networks to learn useful features through stochastic gradient descent (SGD) is a cornerstone of their success. Most theoretical analyses focus on regression or on classification tasks with a positive margin, where worst-case gradient bounds suffice. In contrast, we study zero-margin nonlinear classification by analyzing the Gaussian XOR problem, where inputs are Gaussian and the XOR decision boundary determines labels. In this setting, a non-negligible fraction of data lies arbitrarily close to the boundary, breaking standard margin-based arguments. Building on Glasgow's (2024) analysis, we extend the study of training dynamics from discrete to Gaussian inputs and develop a framework for the dynamics of neuron blocks. We show that neurons cluster into four directions and that block-level signals evolve coherently, a phenomenon essential in the Gaussian setting where individual neuron signals vary significantly. Leveraging this block perspective, we analyze generalization without relying on margin assumptions, adopting an average-case view that distinguishes regions of reliable prediction from regions of persistent error. Numerical experiments confirm the predicted two-phase block dynamics and demonstrate their robustness beyond the Gaussian setting.

2602.00171 2026-02-03 stat.ML cs.LG

Uncertainty-Aware Multimodal Learning via Conformal Shapley Intervals

Mathew Chandy, Michael Johnson, Judong Shen, Devan V. Mehrotra, Hua Zhou, Jin Zhou, Xiaowu Dai

详情
英文摘要

Multimodal learning combines information from multiple data modalities to improve predictive performance. However, modalities often contribute unequally and in a data dependent way, making it unclear which data modalities are genuinely informative and to what extent their contributions can be trusted. Quantifying modality level importance together with uncertainty is therefore central to interpretable and reliable multimodal learning. We introduce conformal Shapley intervals, a framework that combines Shapley values with conformal inference to construct uncertainty-aware importance intervals for each modality. Building on these intervals, we propose a modality selection procedure with a provable optimality guarantee: conditional on the observed features, the selected subset of modalities achieves performance close to that of the optimal subset. We demonstrate the effectiveness of our approach on multiple datasets, showing that it provides meaningful uncertainty quantification and strong predictive performance while relying on only a small number of informative modalities.

2602.00143 2026-02-03 q-bio.QM cs.LG stat.ML

Early warning prediction: Onsager-Machlup vs Schrödinger

Xiaoai Xu, Yixuan Zhou, Xiang Zhou, Jingqiao Duan, Ting Gao

Comments 20 pages

详情
英文摘要

Predicting critical transitions in complex systems, such as epileptic seizures in the brain, represents a major challenge in scientific research. The high-dimensional characteristics and hidden critical signals further complicate early-warning tasks. This study proposes a novel early-warning framework that integrates manifold learning with stochastic dynamical system modeling. Through systematic comparison, six methods including diffusion maps (DM) are selected to construct low-dimensional representations. Based on these, a data-driven stochastic differential equation model is established to robustly estimate the probability evolution scoring function of the system. Building on this, a new Score Function (SF) indicator is defined by incorporating Schrödinger bridge theory to quantify the likelihood of significant state transitions in the system. Experiments demonstrate that this indicator exhibits higher sensitivity and robustness in epilepsy prediction, enables earlier identification of critical points, and clearly captures dynamic features across various stages before and after seizure onset. This work provides a systematic theoretical framework and practical methodology for extracting early-warning signals from high-dimensional data.

2602.00073 2026-02-03 q-fin.ST cs.LG stat.ML

Test-Time Adaptation for Non-stationary Time Series: From Synthetic Regime Shifts to Financial Markets

Yurui Wu, Qingying Deng, Wonou Chung, Mairui Li

详情
英文摘要

Time series encountered in practice are rarely stationary. When the data distribution changes, a forecasting model trained on past observations can lose accuracy. We study a small-footprint test-time adaptation (TTA) framework for causal timeseries forecasting and direction classification. The backbone is frozen, and only normalization affine parameters are updated using recent unlabeled windows. For classification we minimize entropy and enforce temporal consistency; for regression we minimize prediction variance across weak time-preserving augmentations and optionally distill from an EMA teacher. A quadratic drift penalty and an uncertainty triggered fallback keep updates stable. We evaluate this framework in two stages: synthetic regime shifts on ETT benchmarks, and daily equity and FX series (SPY, QQQ, EUR/USD) across pandemic, high-inflation, and recovery regimes. On synthetic gradual drift, normalization-based TTA improves forecasting error, while in financial markets a simple batch-normalization statistics update is a robust default and more aggressive norm-only adaptation can even hurt. Our results provide practical guidance for deploying TTA on non-stationary time series.

2602.00072 2026-02-03 cs.LG stat.ML

Generative AI-enhanced Probabilistic Multi-Fidelity Surrogate Modeling Via Transfer Learning

Jice Zeng, David Barajas-Solano, Hui Chen

详情
英文摘要

The performance of machine learning surrogates is critically dependent on data quality and quantity. This presents a major challenge, as high-fidelity (HF) data is often scarce and computationally expensive to acquire, while low-fidelity (LF) data is abundant but less accurate. To address this data scarcity problem, we develop a probabilistic multi-fidelity surrogate framework based on generative transfer learning. We employ a normalizing flow (NF) generative model as the backbone, which is trained in two phases: (i) the NF is first pretrained on a large LF dataset to learn a probabilistic forward model; (ii) the pretrained model is then fine-tuned on a small HF dataset, allowing it to correct for LF-HF discrepancies via knowledge transfer. To relax the dimension-preserving constraint of standard bijective NFs, we integrate surjective (dimension-reducing) layers with standard coupling blocks. This architecture enables learned dimension reduction while preserving the ability to train with exact likelihoods. The resulting surrogate provides fast probabilistic predictions with quantified uncertainty and significantly outperforms LF-only baselines while using fewer HF evaluations. We validate the approach on a reinforced concrete slab benchmark, combining many coarse-mesh (LF) simulations with a limited set of fine-mesh (HF) simulations. The proposed model achieves probabilistic predictions with HF accuracy, demonstrating a practical path toward data-efficient, generative AI-driven surrogates for complex engineering systems.

2512.24145 2026-02-03 cs.LG stat.ML

When Does Pairing Seeds Reduce Variance? Evidence from a Multi-Agent Economic Simulation

Udit Sharma

详情
英文摘要

Machine learning systems appear stochastic but are deterministically random, as seeded pseudorandom number generators produce identical realisations across repeated executions. Standard evaluation practice typically treats runs across alternatives as independent and does not exploit shared sources of randomness. This paper analyses the statistical structure of comparative evaluation under shared random seeds. Under this design, competing systems are evaluated using identical seeds, inducing matched stochastic realisations and yielding strict variance reduction whenever outcomes are positively correlated at the seed level. We demonstrate these effects using an extended learning-based multi-agent economic simulator, where paired evaluation exposes systematic differences in aggregate and distributional outcomes that remain statistically inconclusive under independent evaluation at fixed budgets.

2512.15676 2026-02-03 stat.ME stat.AP

Data-driven controlled subgroup selection in clinical trials

Manuel M. Müller, Björn Bornkamp, Frank Bretz, Timothy I. Cannings, Wei Liu, Henry W. J. Reeve, Richard J. Samworth, Nikolaos Sfikas, Fang Wan, Konstantinos Sechidis

Comments 37 pages, 10 figures

详情
英文摘要

Subgroup selection in clinical trials is essential for identifying patient groups that react differently to a treatment, thereby enabling personalised medicine. In particular, subgroup selection can identify patient groups that respond particularly well to a treatment or that encounter adverse events more often. However, this is a post-selection inference problem, which may pose challenges for traditional techniques used for subgroup analysis, such as increased Type I error rates and potential biases from data-driven subgroup identification. In this paper, we present two methods for subgroup selection in regression problems: one based on generalised linear modelling and another on isotonic regression. We demonstrate how these methods can be used for data-driven subgroup identification in the analysis of clinical trials, focusing on two distinct tasks: identifying patient groups that are safe from manifesting adverse events and identifying patient groups with high treatment effect, while controlling for Type I error in both cases. A thorough simulation study is conducted to evaluate the strengths and weaknesses of each method, providing detailed insight into the sensitivity of the Type I error rate control to modelling assumptions.

2510.17794 2026-02-03 cs.LG stat.ML

Functional Distribution Networks (FDN)

Omer Haq

详情
英文摘要

Modern probabilistic regressors often remain overconfident under distribution shift. Functional Distribution Networks (FDN) place input-conditioned distributions over network weights, producing predictive mixtures whose dispersion adapts to the input; we train them with a Monte Carlo beta-ELBO objective. We pair FDN with an evaluation protocol that separates interpolation from extrapolation and emphasizes simple OOD sanity checks. On controlled 1D tasks and small/medium UCI-style regression benchmarks, FDN remains competitive in accuracy with strong Bayesian, ensemble, dropout, and hypernetwork baselines, while providing strongly input-dependent, shift-aware uncertainty and competitive calibration under matched parameter and update budgets.

2509.23494 2026-02-03 cs.LG cs.AI stat.ML

Revisiting Multivariate Time Series Forecasting with Missing Values

Jie Yang, Yifan Hu, Kexin Zhang, Luyang Niu, Philip S. Yu, Kaize Ding

详情
英文摘要

Missing values are common in real-world time series, and multivariate time series forecasting with missing values (MTSF-M) has become a crucial area of research for ensuring reliable predictions. To address the challenge of missing data, current approaches have developed an imputation-then-prediction framework that uses imputation modules to fill in missing values, followed by forecasting on the imputed data. However, this framework overlooks a critical issue: there is no ground truth for the missing values, making the imputation process susceptible to errors that can degrade prediction accuracy. In this paper, we conduct a systematic empirical study and reveal that imputation without direct supervision can corrupt the underlying data distribution and actively degrade prediction accuracy. To address this, we propose a paradigm shift that moves away from imputation and directly predicts from the partially observed time series. We introduce Consistency-Regularized Information Bottleneck (CRIB), a novel framework built on the Information Bottleneck principle. CRIB combines a unified-variate attention mechanism with a consistency regularization scheme to learn robust representations that filter out noise introduced by missing values while preserving essential predictive signals. Comprehensive experiments on four real-world datasets demonstrate the effectiveness of CRIB, which predicts accurately even under high missing rates. Our code is available in https://github.com/Muyiiiii/CRIB.

2506.19294 2026-02-03 math.OC math.PR q-fin.PM stat.ML

Duality and Policy Evaluation in Distributionally Robust Bayesian Diffusion Control

Jose Blanchet, Jiayi Cheng, Yuewei Ling, Hao Liu, Yang Liu

详情
英文摘要

We study diffusion control problems under parameter uncertainty. Controllers based on plug-in estimation can be brittle due to potential distribution shifts. Bayesian control with a prior on the parameters offers a formulation with beliefs about such shifts. However, as with any Bayesian model, the prior may be misspecified. To mitigate misspecification and reduce over-pessimism compared to classical robust control approaches (e.g. \citet{hansen2008robustness}), we propose a distributionally robust Bayesian control (DRBC) formulation in which an adversary perturbs the prior within a divergence neighborhood of a baseline prior. We develop a strong duality result that reduces the distributionally robust prior evaluation to a low-dimensional optimization and yields a practical simulation-based policy evaluation and learning procedure with structured policy parameterizations. We validate the efficiency of the algorithm on a synthetic linear-quadratic control example and real-data portfolio selection.

2506.17093 2026-02-03 cs.LG cs.AI math.AG stat.ML

Identifiability of Deep Polynomial Neural Networks

Konstantin Usevich, Ricardo Borsoi, Clara Dérand, Marianne Clausel

Comments NeurIPS final version

详情
英文摘要

Polynomial Neural Networks (PNNs) possess a rich algebraic and geometric structure. However, their identifiability -- a key property for ensuring interpretability -- remains poorly understood. In this work, we present a comprehensive analysis of the identifiability of deep PNNs, including architectures with and without bias terms. Our results reveal an intricate interplay between activation degrees and layer widths in achieving identifiability. As special cases, we show that architectures with non-increasing layer widths are generically identifiable under mild conditions, while encoder-decoder networks are identifiable when the decoder widths do not grow too rapidly compared to the activation degrees. Our proofs are constructive and center on a connection between deep PNNs and low-rank tensor decompositions, and Kruskal-type uniqueness theorems. We also settle an open conjecture on the dimension of PNN's neurovarieties, and provide new bounds on the activation degrees required for it to reach the expected dimension.

2506.13680 2026-02-03 cs.LG stat.ME

Hybrid Meta-learners for Estimating Heterogeneous Treatment Effects

Zhongyuan Liang, Lars van der Laan, Ahmed Alaa

详情
英文摘要

Estimating conditional average treatment effects (CATE) from observational data involves modeling decisions that differ from supervised learning, particularly concerning how to regularize model complexity. Previous approaches can be grouped into two primary "meta-learner" paradigms that impose distinct inductive biases. Indirect meta-learners first fit and regularize separate potential outcome (PO) models and then estimate CATE by taking their difference, whereas direct meta-learners construct and directly regularize estimators for the CATE function itself. Neither approach consistently outperforms the other across all scenarios: indirect learners perform well when the PO functions are simple, while direct learners outperform when the CATE is simpler than individual PO functions. In this paper, we introduce the Hybrid Learner (H-learner), a novel regularization strategy that interpolates between the direct and indirect regularizations depending on the dataset at hand. The H-learner achieves this by learning intermediate functions whose difference closely approximates the CATE without necessarily requiring accurate individual approximations of the POs themselves. We demonstrate that intentionally allowing suboptimal fits to the POs improves the bias-variance tradeoff in estimating CATE. Experiments conducted on semi-synthetic and real-world benchmark datasets illustrate that the H-learner consistently operates at the Pareto frontier, effectively combining the strengths of both direct and indirect meta-learners.

2506.04192 2026-02-03 math.OC stat.ML

Lions and Muons: Optimization via Stochastic Frank-Wolfe

Maria-Eleni Sfyraki, Jun-Kun Wang

详情
英文摘要

Stochastic Frank-Wolfe is a classical optimization method for solving constrained optimization problems. On the other hand, recent optimizers such as Lion and Muon have gained quite significant popularity in deep learning. In this work, building on recent initiatives, we provide a unifying perspective by interpreting these seemingly disparate methods through the lens of Stochastic Frank-Wolfe. Specifically, we show that Lion and Muon with weight decay can be viewed as special instances of a Stochastic Frank-Wolfe, and we establish their convergence guarantees in terms of the Frank-Wolfe gap, a standard stationarity measure in non-convex optimization for Frank-Wolfe methods. We further find that convergence to this gap implies convergence to a KKT point of the original problem under a norm constraint for Lion and Muon. Moreover, motivated by recent empirical findings that stochastic gradients in modern machine learning tasks often exhibit heavy-tailed distributions, we extend Stochastic Frank-Wolfe to settings with heavy-tailed noise by developing two robust variants with strong theoretical guarantees that hold for general compact convex sets without the need for a large batch size, filling the gap in the literature on Stochastic Frank-Wolfe for non-convex optimization. Our contributions in the later part of this work, in turn, yield new variants of Lion and Muon, that better accommodate heavy-tailed gradient noise, thereby enhancing their practical scope.

2505.19145 2026-02-03 stat.ME cs.LG stat.AP

Do Large Language Models (Really) Need Statistical Foundations?

Weijie Su

Comments Accepted at The Annals of Applied Statistics; Added discussions on more topics

详情
英文摘要

Large language models (LLMs) represent a new paradigm for processing unstructured data, with applications across an unprecedented range of domains. In this paper, we address, through two arguments, whether the development and application of LLMs would genuinely benefit from foundational contributions from the statistics discipline. First, we argue affirmatively, beginning with the observation that LLMs are inherently statistical models due to their profound data dependency and stochastic generation processes, where statistical insights are naturally essential for handling variability and uncertainty. Second, we argue that the persistent black-box nature of LLMs -- stemming from their immense scale, architectural complexity, and development practices often prioritizing empirical performance over theoretical interpretability -- renders closed-form or purely mechanistic analyses generally intractable, thereby necessitating statistical approaches due to their flexibility and often demonstrated effectiveness. To substantiate these arguments, the paper outlines several research areas -- including alignment, watermarking, uncertainty quantification, evaluation, and data mixture optimization -- where statistical methodologies are critically needed and are already beginning to make valuable contributions. We conclude with a discussion suggesting that statistical research concerning LLMs will likely form a diverse ``mosaic'' of specialized topics rather than deriving from a single unifying theory, and highlighting the importance of timely engagement by our statistics community in LLM research.

2503.23158 2026-02-03 stat.ME

Active Learning with Adaptive Non-Stationary Kernel for Continuous-Fidelity Surrogate Models

Romain Boutelet, Chih-Li Sung

详情
英文摘要

Simulating complex physical processes across a domain of input parameters can be very computationally expensive. Multi-fidelity surrogate modeling can resolve this issue by integrating cheaper simulations with the expensive ones in order to obtain better predictions at a reasonable cost. We are specifically interested in computer experiments where real-valued fidelity parameters determine the fidelity of the numerical output, such as finite element methods. In these cases, integrating this fidelity parameter in the analysis enables us to make inference on fidelity levels that have not been observed yet. Such models have been developed, and we propose a new adaptive non-stationary kernel function which more accurately reflects the behavior of computer simulation outputs. In addition, we develop an active learning strategy based on the integrated mean squared prediction error (IMSPE) to identify the best design points across input parameters and fidelity parameters, while taking into account the computational cost associated with the fidelity parameters. We illustrate this methodology through numerical examples and applications to finite element methods. An $\textsf{R}$ package for the proposed methodology is provided in an open repository.

2502.02925 2026-02-03 stat.ME cs.LG math.PR math.ST stat.TH

Data denoising with self consistency, variance maximization, and the Kantorovich dominance

Joshua Zoen-Git Hiew, Tongseok Lim, Brendan Pass, Marcelo Cruz de Souza

Comments v2 will be published in the SIAM Journal on Mathematics of Data Science (SIMODS). We greatly appreciate the valuable feedback and guidance from the reviewers and the associate editor

详情
英文摘要

We introduce a new framework for data denoising, partially inspired by martingale optimal transport. For a given noisy distribution (the data), our approach involves finding the closest distribution to it among all distributions which 1) have a particular prescribed structure (expressed by requiring they lie in a particular domain), and 2) are self-consistent with the data. We show that this amounts to maximizing the variance among measures in the domain which are dominated in convex order by the data. For particular choices of the domain, this problem and a relaxed version of it, in which the self-consistency condition is removed, are intimately related to various classical approaches to denoising. We prove that our general problem has certain desirable features: solutions exist under mild assumptions, have certain robustness properties, and, for very simple domains, coincide with solutions to the relaxed problem. We also introduce a novel relationship between distributions, termed Kantorovich dominance, which retains certain aspects of the convex order while being a weaker, more robust, and easier-to-verify condition. Building on this, we propose and analyze a new denoising problem by substituting the convex order in the previously described framework with Kantorovich dominance. We demonstrate that this revised problem shares some characteristics with the full convex order problem but offers enhanced stability, greater computational efficiency, and, in specific domains, more meaningful solutions. Finally, we present simple numerical examples illustrating solutions for both the full convex order problem and the Kantorovich dominance problem.

2501.06652 2026-02-03 math.ST cs.NA math.NA math.OC stat.ME stat.TH

High-order Accurate Inference on Manifolds

Chengzhu Huang, Anru R. Zhang

Comments The Annals of Statistics, to appear

详情
英文摘要

We present a new framework for statistical inference on Riemannian manifolds that achieves high-order accuracy, addressing the challenges posed by non-Euclidean parameter spaces frequently encountered in modern data science. Our approach leverages a novel and computationally efficient procedure to reach higher-order asymptotic precision. In particular, we develop a bootstrap algorithm on Riemannian manifolds that is both computationally efficient and accurate for hypothesis testing and confidence region construction. Although locational hypothesis testing can be reformulated as a standard Euclidean problem, constructing high-order accurate confidence regions necessitates careful treatment of manifold geometry. To this end, we establish high-order asymptotics under an appropriate coordinate representation induced by a second-order retraction, thereby enabling precise expansions that incorporate curvature effects. We demonstrate the versatility of this framework across various manifold settings, including spheres, the Stiefel manifold, fixed-rank matrix manifolds, and rank-one tensor manifolds; for Euclidean submanifolds, we also introduce a class of projection-like coordinate charts with strong consistency properties. Finally, numerical studies confirm the practical merits of the proposed procedure.

2403.19515 2026-02-03 stat.ME

Bootstrapping Lasso in Generalized Linear Models

Mayukh Choudhury, Debraj Das

详情
英文摘要

Generalized linear model or GLM constitutes a large class of models and essentially extends the ordinary linear regression by connecting the mean of the response variable with the covariate through appropriate link functions. On the other hand, Lasso is a popular and easy-to-implement penalization method in regression when not all covariates are relevant. However, the asymptotic distributional properties the Lasso estimator in GLM is still unknown. In this paper, we show that the Lasso estimator in GLM does not have a tractable form and subsequently, we develop two Bootstrap methods, namely the Perturbation Bootstrap and Pearson's Residual Bootstrap methods, for approximating the distribution of the Lasso estimator in GLM. As a result, our Bootstrap methods can be used to draw valid statistical inferences for any sub-model of GLM. We support our theoretical findings by showing good finite-sample properties of the proposed Bootstrap methods through a moderately large simulation study. We also implement one of our Bootstrap methods on a real data set.

2309.15983 2026-02-03 stat.ME econ.EM stat.AP

Causal Panel Analysis under Parallel Trends: Lessons from a Large Reanalysis Study

Albert Chiu, Xingchen Lan, Ziyi Liu, Yiqing Xu

Journal ref American Political Science Review, Vol. 120, Iss. 1, February 2026, pp. 245--266

详情
英文摘要

Two-way fixed effects (TWFE) models are widely used in political science to establish causality, but recent methodological discussions highlight their limitations under heterogeneous treatment effects (HTE) and violations of the parallel trends (PT) assumption. This growing literature has introduced numerous new estimators and procedures, causing confusion among researchers about the reliability of existing results and best practices. To address these concerns, we replicated and reanalyzed 49 studies from leading journals that employ TWFE models for causal inference using observational panel data with binary treatments. Using six HTE-robust estimators, diagnostic tests, and sensitivity analyses, we find: (i) HTE-robust estimators yield qualitatively similar but highly variable results; (ii) while a few studies show clear signs of PT violations, many lack evidence to support this assumption; and (iii) many studies are underpowered when accounting for HTE and potential PT violations. We emphasize the importance of strong research designs and rigorous validation of key identifying assumptions.