arXivDaily arXiv每日学术速递 周一至周五更新
重置
2602.17633 2026-02-20 cs.LG cs.AI stat.ML

When to Trust the Cheap Check: Weak and Strong Verification for Reasoning

Shayan Kiyani, Sima Noorani, George Pappas, Hamed Hassani

详情
英文摘要

Reasoning with LLMs increasingly unfolds inside a broader verification loop. Internally, systems use cheap checks, such as self-consistency or proxy rewards, which we call weak verification. Externally, users inspect outputs and steer the model through feedback until results are trustworthy, which we call strong verification. These signals differ sharply in cost and reliability: strong verification can establish trust but is resource-intensive, while weak verification is fast and scalable but noisy and imperfect. We formalize this tension through weak--strong verification policies, which decide when to accept or reject based on weak verification and when to defer to strong verification. We introduce metrics capturing incorrect acceptance, incorrect rejection, and strong-verification frequency. Over population, we show that optimal policies admit a two-threshold structure and that calibration and sharpness govern the value of weak verifiers. Building on this, we develop an online algorithm that provably controls acceptance and rejection errors without assumptions on the query stream, the language model, or the weak verifier.

2602.17608 2026-02-20 cs.LG cs.AI stat.ML

Towards Anytime-Valid Statistical Watermarking

Baihe Huang, Eric Xu, Kannan Ramchandran, Jiantao Jiao, Michael I. Jordan

详情
英文摘要

The proliferation of Large Language Models (LLMs) necessitates efficient mechanisms to distinguish machine-generated content from human text. While statistical watermarking has emerged as a promising solution, existing methods suffer from two critical limitations: the lack of a principled approach for selecting sampling distributions and the reliance on fixed-horizon hypothesis testing, which precludes valid early stopping. In this paper, we bridge this gap by developing the first e-value-based watermarking framework, Anchored E-Watermarking, that unifies optimal sampling with anytime-valid inference. Unlike traditional approaches where optional stopping invalidates Type-I error guarantees, our framework enables valid, anytime-inference by constructing a test supermartingale for the detection process. By leveraging an anchor distribution to approximate the target model, we characterize the optimal e-value with respect to the worst-case log-growth rate and derive the optimal expected stopping time. Our theoretical claims are substantiated by simulations and evaluations on established benchmarks, showing that our framework can significantly enhance sample efficiency, reducing the average token budget required for detection by 13-15% relative to state-of-the-art baselines.

2602.17603 2026-02-20 stat.ML stat.AP

SOLVAR: Fast covariance-based heterogeneity analysis with pose refinement for cryo-EM

Roey Yadgar, Roy R. Lederman, Yoel Shkolnisky

详情
英文摘要

Cryo-electron microscopy (cryo-EM) has emerged as a powerful technique for resolving the three-dimensional structures of macromolecules. A key challenge in cryo-EM is characterizing continuous heterogeneity, where molecules adopt a continuum of conformational states. Covariance-based methods offer a principled approach to modeling structural variability. However, estimating the covariance matrix efficiently remains a challenging computational task. In this paper, we present SOLVAR (Stochastic Optimization for Low-rank Variability Analysis), which leverages a low-rank assumption on the covariance matrix to provide a tractable estimator for its principal components, despite the apparently prohibitive large size of the covariance matrix. Under this low-rank assumption, our estimator can be formulated as an optimization problem that can be solved quickly and accurately. Moreover, our framework enables refinement of the poses of the input particle images, a capability absent from most heterogeneity-analysis methods, and all covariance-based methods. Numerical experiments on both synthetic and experimental datasets demonstrate that the algorithm accurately captures dominant components of variability while maintaining computational efficiency. SOLVAR achieves state-of-the-art performance across multiple datasets in a recent heterogeneity benchmark. The code of the algorithm is freely available at https://github.com/RoeyYadgar/SOLVAR.

2602.17592 2026-02-20 stat.ME stat.AP

BMW: Bayesian Model-Assisted Adaptive Phase II Clinical Trial Design for Win Ratio Statistic

Di Zhu, Yong Zang

Comments 32 pages, 2 figures

详情
英文摘要

The win ratio (WR) statistic is increasingly used to evaluate treatment effects based on prioritized composite endpoints, yet existing Bayesian adaptive designs are not directly applicable because the WR is a summary statistic derived from pairwise comparisons and does not correspond to a unique data-generating mechanism. We propose a Bayesian model-assisted adaptive design for randomized phase II clinical trials based on the WR statistic, referred to as the BMW design. The proposed design uses the joint asymptotic distribution of WR test statistics across interim and final analyses to compute posterior probabilities without specifying the underlying outcome distribution. The BMW design allows flexible interim monitoring with early stopping for futility or superiority and is extended to jointly evaluate efficacy and toxicity using a graphical testing procedure that controls the family-wise error rate (FWER). Simulation studies demonstrate that the BMW design maintains valid type I error and FWER control, achieves power comparable to conventional methods, and substantially reduces expected sample size. An R Shiny application is provided to facilitate practical implementation.

2602.17577 2026-02-20 cs.DS cs.LG stat.ML

Simultaneous Blackwell Approachability and Applications to Multiclass Omniprediction

Lunjia Hu, Kevin Tian, Chutong Yang

详情
英文摘要

Omniprediction is a learning problem that requires suboptimality bounds for each of a family of losses $\mathcal{L}$ against a family of comparator predictors $\mathcal{C}$. We initiate the study of omniprediction in a multiclass setting, where the comparator family $\mathcal{C}$ may be infinite. Our main result is an extension of the recent binary omniprediction algorithm of [OKK25] to the multiclass setting, with sample complexity (in statistical settings) or regret horizon (in online settings) $\approx \varepsilon^{-(k+1)}$, for $\varepsilon$-omniprediction in a $k$-class prediction problem. En route to proving this result, we design a framework of potential broader interest for solving Blackwell approachability problems where multiple sets must simultaneously be approached via coupled actions.

2602.17565 2026-02-20 math.ST cs.LG stat.ML stat.TH

Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning

Hien Dang, Pratik Patil, Alessandro Rinaldo

Comments 78 pages, 25 figures

详情
英文摘要

Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth labels and the teacher's own predictions using the same architecture and training data. Although SD has been empirically shown to often improve generalization, its formal guarantees remain limited. We study SD for ridge regression in unconstrained setting in which the mixing weight $ξ$ may be outside the unit interval. Conditioned on the training data and without any distributional assumptions, we prove that for any squared prediction risk (including out-of-distribution), the optimally mixed student strictly improves upon the ridge teacher for every regularization level $λ> 0$ at which the teacher ridge risk $R(λ)$ is nonstationary (i.e., $R'(λ) \neq 0$). We obtain a closed-form expression for the optimal mixing weight $ξ^\star(λ)$ for any value of $λ$ and show that it obeys the sign rule: $\operatorname{sign}(ξ^\star(λ))=-\operatorname{sign}(R'(λ))$. In particular, $ξ^\star(λ)$ can be negative, which is the case in over-regularized regimes. To quantify the risk improvement due to SD, we derive exact deterministic equivalents for the optimal SD risk in the proportional asymptotics regime (where the sample and feature sizes $n$ and $p$ both diverge but their aspect ratio $p/n$ converges) under general anisotropic covariance and deterministic signals. Our asymptotic analysis extends standard second-order ridge deterministic equivalents to their fourth-order analogs using block linearization, which may be of independent interest. From a practical standpoint, we propose a consistent one-shot tuning method to estimate $ξ^\star$ without grid search, sample splitting, or refitting. Experiments on real-world datasets and pretrained neural network features support our theory and the one-shot tuning method.

2602.17543 2026-02-20 stat.ML cs.LG econ.EM math.ST stat.ME stat.TH

genriesz: A Python Package for Automatic Debiased Machine Learning with Generalized Riesz Regression

Masahiro Kato

详情
英文摘要

Efficient estimation of causal and structural parameters can be automated using the Riesz representation theorem and debiased machine learning (DML). We present genriesz, an open-source Python package that implements automatic DML and generalized Riesz regression, a unified framework for estimating Riesz representers by minimizing empirical Bregman divergences. This framework includes covariate balancing, nearest-neighbor matching, calibrated estimation, and density ratio estimation as special cases. A key design principle of the package is automatic regressor balancing (ARB): given a Bregman generator $g$ and a representer model class, genriesz} automatically constructs a compatible link function so that the generalized Riesz regression estimator satisfies balancing (moment-matching) optimality conditions in a user-chosen basis. The package provides a modulr interface for specifying (i) the target linear functional via a black-box evaluation oracle, (ii) the representer model via basis functions (polynomial, RKHS approximations, random forest leaf encodings, neural embeddings, and a nearest-neighbor catchment basis), and (iii) the Bregman generator, with optional user-supplied derivatives. It returns regression adjustment (RA), Riesz weighting (RW), augmented Riesz weighting (ARW), and TMLE-style estimators with cross-fitting, confidence intervals, and $p$-values. We highlight representative workflows for estimation problems such as the average treatment effect (ATE), ATE on treated (ATT), and average marginal effect estimation. The Python package is available at https://github.com/MasaKat0/genriesz and on PyPI.

2602.17414 2026-02-20 stat.CO astro-ph.IM stat.ME

Nested Sampling with Slice-within-Gibbs: Efficient Evidence Calculation for Hierarchical Bayesian Models

David Yallup

Comments 26 pages, 6 figures

详情
英文摘要

We present Nested Sampling with Slice-within-Gibbs (NS-SwiG), an algorithm for Bayesian inference and evidence estimation in high-dimensional models whose likelihood admits a factorization, such as hierarchical Bayesian models. We construct a procedure to sample from the likelihood-constrained prior using a Slice-within-Gibbs kernel: an outer update of hyperparameters followed by inner block updates over local parameters. A likelihood-budget decomposition caches per-block contributions so that each local update checks feasibility in constant time rather than recomputing the global constraint at linearly growing cost. This reduces the per-replacement cost from quadratic to linear in the number of groups, and the overall algorithmic complexity from cubic to quadratic under standard assumptions. The decomposition extends naturally beyond independent observations, and we demonstrate this on Markov-structured latent variables. We evaluate NS-SwiG on challenging benchmarks, demonstrating scalability to thousands of dimensions and accurate evidence estimates even on posterior geometries where state-of-the-art gradient-based samplers can struggle.

2602.16265 2026-02-20 stat.ML cs.LG

On sparsity, extremal structure, and monotonicity properties of Wasserstein and Gromov-Wasserstein optimal transport plans

Titouan Vayer

详情
英文摘要

This note gives a self-contained overview of some important properties of the Gromov-Wasserstein (GW) distance, compared with the standard linear optimal transport (OT) framework. More specifically, I explore the following questions: are GW optimal transport plans sparse? Under what conditions are they supported on a permutation? Do they satisfy a form of cyclical monotonicity? In particular, I present the conditionally negative semi-definite property and show that, when it holds, there are GW optimal plans that are sparse and supported on a permutation.

2601.16174 2026-02-20 stat.ML cs.LG

Beyond Predictive Uncertainty: Reliable Representation Learning with Structural Constraints

Yiyao Yang

Comments 22 pages, 5 figures, 5 propositions

详情
英文摘要

Uncertainty estimation in machine learning has traditionally focused on the prediction stage, aiming to quantify confidence in model outputs while treating learned representations as deterministic and reliable by default. In this work, we challenge this implicit assumption and argue that reliability should be regarded as a first-class property of learned representations themselves. We propose a principled framework for reliable representation learning that explicitly models representation-level uncertainty and leverages structural constraints as inductive biases to regularize the space of feasible representations. Our approach introduces uncertainty-aware regularization directly in the representation space, encouraging representations that are not only predictive but also stable, well-calibrated, and robust to noise and structural perturbations. Structural constraints, such as sparsity, relational structure, or feature-group dependencies, are incorporated to define meaningful geometry and reduce spurious variability in learned representations, without assuming fully correct or noise-free structure. Importantly, the proposed framework is independent of specific model architectures and can be integrated with a wide range of representation learning methods.

2601.14969 2026-02-20 q-bio.GN stat.ML

Robust Machine Learning for Regulatory Sequence Modeling under Biological and Technical Distribution Shifts

Yiyao Yang

Comments 20 pages, 16 figures

详情
英文摘要

Robust machine learning for regulatory genomics is studied under biologically and technically induced distribution shifts. Deep convolutional and attention based models achieve strong in distribution performance on DNA regulatory sequence prediction tasks but are usually evaluated under i.i.d. assumptions, even though real applications involve cell type specific programs, evolutionary turnover, assay protocol changes, and sequencing artifacts. We introduce a robustness framework that combines a mechanistic simulation benchmark with real data analysis on a massively parallel reporter assay (MPRA) dataset to quantify performance degradation, calibration failures, and uncertainty based reliability. In simulation, motif driven regulatory outputs are generated with cell type specific programs, PWM perturbations, GC bias, depth variation, batch effects, and heteroscedastic noise, and CNN, BiLSTM, and transformer models are evaluated. Models remain accurate and reasonably calibrated under mild GC content shifts but show higher error, severe variance miscalibration, and coverage collapse under motif effect rewiring and noise dominated regimes, revealing robustness gaps invisible to standard i.i.d. evaluation. Adding simple biological structural priors motif derived features in simulation and global GC content in MPRA improves in distribution error and yields consistent robustness gains under biologically meaningful genomic shifts, while providing only limited protection against strong assay noise. Uncertainty-aware selective prediction offers an additional safety layer that risk coverage analyses on simulated and MPRA data show that filtering low confidence inputs recovers low risk subsets, including under GC-based out-of-distribution conditions, although reliability gains diminish when noise dominates.

2512.18118 2026-02-20 stat.AP stat.ME

Distribution-Free Selection of Low-Risk Oncology Patients for Survival Beyond a Time Horizon

Matteo Sesia, Vladimir Svetnik

详情
英文摘要

We study the problem of selecting a subset of patients who are unlikely to experience an adverse event within a fixed time horizon by calibrating a screening rule based on a black-box survival model. We consider two complementary, distribution-free frameworks for this task. The first extends classical calibration ideas -- estimating the event rate among selected patients using a hold-out dataset -- by integrating them with the Learn-Then-Test (LTT) framework, yielding high-probability guarantees for data-adaptively tuned screening rules. The second takes a different perspective by reformulating screening as a hypothesis testing problem on future patient outcomes, enabling false discovery rate (FDR) control via the Benjamini-Hochberg procedure applied to selective conformal p-values, and providing guarantees in expectation. We clarify the theoretical relationship between these approaches, explain how both can be adapted to right-censored time-to-event data via inverse probability of censoring weighting, and compare them empirically using simulations and oncology data from the Flatiron Health Research Database. Our results reveal a trade-off between efficiency and strength of guarantees: FDR-based screening is typically more powerful, while LTT-based calibration is more conservative but offers stronger guarantees. We also provide practical guidance on implementation and tuning.

2511.11166 2026-02-20 stat.ME

Choosing the nominal level post-hoc with knockoffs using e-values

Lasse Fischer, Konstantinos Sechidis

详情
英文摘要

The knockoff filter is a powerful tool for controlled variable selection with false discovery rate (FDR) control. In this paper, we leverage e-values to allow the nominal FDR level to be switched post-hoc, after looking at the data and applying the knockoff procedure. This approach addresses a significant limitation of standard knockoffs: while frequently used in high-dimensional regressions, they often lack power in low-dimensional and sparse signal settings. One of the main reasons for this is that the knockoff filter requires a minimum number of selections that depends strictly on the nominal FDR level. By utilizing e-values, we can increase the nominal level in cases where the original procedure makes no discoveries, or decrease it to improve precision when discoveries are abundant. These improvements come without any costs, meaning the results of our post-hoc procedure are always more informative than those of the original knockoff filter. We extend this methodology to recently proposed derandomized knockoff procedures and demonstrate its utility in variable selection problems relevant to drug development using real clinical trial data.

2511.08192 2026-02-20 stat.ME

Geometric modelling of spatial extremes

Lydia Kakampakou, Jennifer L. Wadsworth

Comments 35 pages, 15 figures

详情
英文摘要

Recent developments in extreme value statistics have established the so-called geometric approach as a powerful modelling tool for multivariate extremes. We tailor these methods to the case of spatial modelling and examine their efficacy at inferring extremal dependence and performing extrapolation. The geometric approach is based around a limit set described by a gauge function, which is a key target for inference. We consider a variety of spatially-parameterised gauge functions and perform inference on them by building on the framework of Wadsworth and Campbell (2024), where extreme radii are modelled via a truncated gamma distribution. We also consider spatial modelling of the angular distribution, for which we propose two candidate models. Estimation of extreme event probabilities is possible by combining draws from the radial and angular models respectively. We compare our method with two other established frameworks for spatial extreme value analysis and show that our approach generally allows for unbiased, albeit more uncertain, inference compared to the more classical models. We illustrate the methodology on a space weather dataset of daily geomagnetic field fluctuations.

2509.03315 2026-02-20 stat.ME

The super learner for time-to-event outcomes: A tutorial

Ruth H. Keogh, Karla Diaz-Ordaz, Nan van Geloven, Jon Michael Gran, Kamaryn T. Tanner

详情
英文摘要

Estimating risks or survival probabilities conditional on individual characteristics based on censored time-to-event data is a commonly faced task. This may be for the purpose of developing a prediction model or may be part of a wider estimation procedure, such as in causal inference. A challenge is that it is impossible to know at the outset which of a set of candidate models will provide the best risk estimates. The super learner is a powerful approach for finding the best model or combination of models ('ensemble') among a pre-specified set of candidate models or 'learners', which can include both 'statistical' models (e.g. parametric, semi-parametric models) and 'machine learning' models. Super learners for time-to-event outcomes have been developed, but the literature is technical and the full details of how these methods work and can be implemented in practice have not previously been presented in an accessible format. In this paper we provide a practical tutorial on super learner methods for time-to-event outcomes. An overview of the general steps involved in the super learner is given, followed by details of three specific implementations for time-to-event outcomes. These include the originally proposed super learner, which involves using a discrete time scale, and two more recently proposed versions of the super learner for continuous-time data. We compare the properties of the methods and provide information on how they can be implemented in R. The methods are illustrated using an open access data set and R code is provided.

2507.05634 2026-02-20 math.ST stat.TH

A Note on Inferential Decisions, Errors and Path-Dependency

Kangda K. Wren

Comments 12 pages: 1 highlight, 7 main text, 3 appendix and 1 bibliography

详情
英文摘要

Consider the sequential testing of binary outcomes. The a posteriori belief process and its objective conditional-probability counterpart generally differ but converge to the same result in well-defined tests. We show that unless the two processes are 'essentially identical', differing only by an a priori factor, time-homogeneous continuous decisions based on the former are path-dependent with respect to state-variables based on the latter or any other non-essentially-identical processes. Inferential error decomposes into a path-dependent and a path-independent component, whose distinct properties are relevant to error mitigation.

2503.17338 2026-02-20 cs.AI cs.LG stat.ML

Capturing Individual Human Preferences with Reward Features

André Barreto, Vincent Dumoulin, Yiran Mao, Mark Rowland, Nicolas Perez-Nieves, Bobak Shahriari, Yann Dauphin, Doina Precup, Hugo Larochelle

Comments Published at NeurIPS 2025

详情
英文摘要

Reinforcement learning from human feedback usually models preferences using a reward function that does not distinguish between people. We argue that this is unlikely to be a good design choice in contexts with high potential for disagreement, like in the training of large language models. We formalise and analyse the problem of learning a reward model that can be specialised to a user. Using the principle of empirical risk minimisation, we derive a probably approximately correct (PAC) bound showing the dependency of the approximation error on the number of training examples, as usual, and also on the number of human raters who provided feedback on them. Based on our theoretical findings, we discuss how to best collect pairwise preference data and argue that adaptive reward models should be beneficial when there is considerable disagreement among users. We also propose a concrete architecture for an adaptive reward model. Our approach leverages the observation that individual preferences can be captured as a linear combination of a set of general reward features. We show how to learn such features and subsequently use them to quickly adapt the reward model to a specific individual, even if their preferences are not reflected in the training data. We present experiments with large language models illustrating our theoretical results and comparing the proposed architecture with a non-adaptive baseline. Consistent with our analysis, the benefits provided by our model increase with the number of raters and the heterogeneity of their preferences. We also show that our model compares favourably to adaptive counterparts, including those performing in-context personalisation.

2501.14118 2026-02-20 cs.LG stat.AP stat.ML

Selecting Critical Scenarios of DER Adoption in Distribution Grids Using Bayesian Optimization

Olivier Mulkin, Miguel Heleno, Mike Ludkovski

Comments 12 pages, 4 tables, 12 figures

详情
英文摘要

We develop a new methodology to select scenarios of DER adoption most critical for distribution grids. Anticipating risks of future voltage and line flow violations due to additional PV adopters is central for utility investment planning but continues to rely on deterministic or ad hoc scenario selection. We propose a highly efficient search framework based on multi-objective Bayesian Optimization. We treat underlying grid stress metrics as computationally expensive black-box functions, approximated via Gaussian Process surrogates and design an acquisition function based on probability of scenarios being Pareto-critical across a collection of line- and bus-based violation objectives. Our approach provides a statistical guarantee and offers an order of magnitude speed-up relative to a conservative exhaustive search. Case studies on realistic feeders with 200-400 buses demonstrate the effectiveness and accuracy of our approach.

2411.02137 2026-02-20 math.ST cs.LG stat.ML stat.TH

Finite-sample performance of the maximum likelihood estimator in logistic regression

Hugo Chardon, Matthieu Lerasle, Jaouad Mourtada

Comments Minor revision

详情
英文摘要

Logistic regression is a classical model for describing the probabilistic dependence of binary responses to multivariate covariates. We consider the predictive performance of the maximum likelihood estimator (MLE) for logistic regression, assessed in terms of logistic risk. We consider two questions: first, that of the existence of the MLE (which occurs when the dataset is not linearly separated), and second, that of its accuracy when it exists. These properties depend on both the dimension of covariates and the signal strength. In the case of Gaussian covariates and a well-specified logistic model, we obtain sharp non-asymptotic guarantees for the existence and excess logistic risk of the MLE. We then generalize these results in two ways: first, to non-Gaussian covariates satisfying a certain two-dimensional margin condition, and second to the general case of statistical learning with a possibly misspecified logistic model. Finally, we consider the case of a Bernoulli design, where the behavior of the MLE is highly sensitive to the parameter direction.

2408.13220 2026-02-20 stat.CO stat.AP

A New Perspective to Fish Trajectory Imputation: A Methodology for Spatiotemporal Modeling of Acoustically Tagged Fish Data

Mahshid Ahmadian, Edward L. Boone, Grace S. Chiu

详情
英文摘要

The focus of this paper is a key component of a methodology for understanding, interpolating, and predicting fish movement patterns based on spatiotemporal data recorded by spatially static acoustic receivers. Unlike GPS trackers which emit satellite signals from the animal's location, acoustic receivers are akin to stationary motion sensors that record movements within their detection range. Thus, for periods of time, fish may be far from the receivers, resulting in the absence of observations. The lack of information on the fish's location for extended time periods poses challenges to the understanding of fish movement patterns, and hence, the identification of proper statistical inference frameworks for modeling the trajectories. As the initial step in our methodology, in this paper, we devise and implement a simulation-based imputation strategy that relies on both Markov chain and random-walk principles to enhance our dataset over time. This methodology will be generalizable and applicable to all fish species with similar migration patterns or data with similar structures due to the use of static acoustic receivers.

2408.10478 2026-02-20 stat.ME

Reconciliating Bayesian and frequentist approaches to robustness against outliers

Philippe Gagnon, Alain Desgagné

详情
英文摘要

Heavy-tailed models are used as a way to gain robustness against outliers in Bayesian analyses. In frequentist analyses, M-estimators are often employed. In this paper, the two approaches are tentatively reconciled by considering M-estimators as maximum likelihood estimators of heavy-tailed models. From this perspective, it is realized that a fundamental difference exists as frequentists, contrarily to Bayesians, do not require these heavy-tailed models to be proper. For instance, a popular robust estimator in linear regression, Tukey's biweight M-estimator, does not correspond to a proper heavy-tailed model. Thus, a Bayesian practitioner does not have access to the same range of tools as a frequentist practitioner. It is shown through two real-data linear regression analyses that the former may in consequence obtain significantly different estimation results than the latter, where the difference is due to a more pronounced influence by the outliers in the former case. It is highlighted that a way to give these practitioners access to the same range of tools is for the Bayesian to adopt the generalized Bayesian framework of Bissiri et al. (2016) which allows the use of improper models (Jewson and Rossell, 2022), in combination with proper prior distributions yielding proper generalized posterior distributions. A complete reconciliation of the Bayesian and frequentist approaches to robustness is then achieved. An extensive theoretical study of the generalized Bayesian counterpart of Tukey's biweight M-estimator is provided, which includes a robustness characterization result and a Bernstein--von Mises result, the latter allowing to calibrate the generalized posterior distribution for meaningful uncertainty quantification. After adopting the generalized Bayesian framework, the Bayesian practitioner obtains similar results as the frequentist practitioner in the aforementioned examples.

2408.05826 2026-02-20 math.ST math.CO math.PR stat.TH

Möbius inversion and the iterated bootstrap

Florian Schäfer

详情
英文摘要

Estimating nonlinear functionals of probability distributions from samples is a fundamental statistical problem. The "plug-in" estimator obtained by applying the target functional to the empirical distribution of samples is biased. Resampling methods such as the bootstrap derive artificial datasets from the original one by resampling. Comparing the outcome of the plug-in estimator in the original and resampled datasets allows estimating and thus correcting the bias. In the asymptotic setting, iterations of this procedure attain an arbitrarily high order of bias correction, but finite sample results are scarce. This work develops a new theoretical understanding of bootstrap bias correction by viewing it as an iterative linear solver for the combinatorial operation of Möbius inversion. It sharply characterizes the regime of linear convergence of the bootstrap bias reduction for moment polynomials. It uses these results to show its superalgebraic convergence rate for band-limited functionals. Finally, it derives a modified bootstrap iteration enabling the unbiased estimation of unknown order-$m$ moment polynomials in $m$ bootstrap iterations.

2407.01566 2026-02-20 q-fin.CP cs.GT cs.LG stat.ML

A Parametric Contextual Online Learning Theory of Brokerage

François Bachoc, Tommaso Cesari, Roberto Colomboni

详情
英文摘要

We study the role of contextual information in the online learning problem of brokerage between traders. In this sequential problem, at each time step, two traders arrive with secret valuations about an asset they wish to trade. The learner (a broker) suggests a trading (or brokerage) price based on contextual data about the asset and the market conditions. Then, the traders reveal their willingness to buy or sell based on whether their valuations are higher or lower than the brokerage price. A trade occurs if one of the two traders decides to buy and the other to sell, i.e., if the broker's proposed price falls between the smallest and the largest of their two valuations. We design algorithms for this problem and prove optimal theoretical regret guarantees under various standard assumptions.

2602.17272 2026-02-20 stat.ME

Estimating Zero-inflated Negative Binomial GAMLSS via a Balanced Gradient Boosting Approach with an Application to Antenatal Care Data from Nigeria

Alexandra Daub, Elisabeth Bergherr

详情
英文摘要

Statistical boosting algorithms are renowned for their intrinsic variable selection and enhanced predictive performance compared to classical statistical methods, making them especially useful for complex models such as generalized additive models for location scale and shape (GAMLSS). Boosting this model class can suffer from imbalanced updates across the distribution parameters as well as long computation times. Shrunk optimal step lengths have been shown to address these issues. To examine the influence of socio-economic factors on the distribution of the number of antenatal care visits in Nigeria, we generalize boosting of GAMLSS with shrunk optimal step lengths to base-learners beyond simple linear models and to a more complex response variable distribution. In an extensive simulation study and in the application we demonstrate that shrunk optimal step lengths yield a more balanced regularization of the overall model and enhance computational efficiency across diverse settings, in particular in the presence of base-learners penalizing the size of the fit.

2602.17261 2026-02-20 stat.ME

Parametric or nonparametric: the FIC approach for stationary time series

Gudmund Hermansen, Nils Lid Hjort, Martin Jullum

Comments 21 pages, 6 figures; Statistical Research Report (Department of Mathematics, University of Oslo), from December 2015, but arXiv'd February 2026; a later modified and extended version might then become a journal paper

详情
英文摘要

We seek to narrow the gap between parametric and nonparametric modelling of stationary time series processes. The approach is inspired by recent advances in focused inference and model selection techniques. The paper generalises and extends recent work by developing a new version of the focused information criterion (FIC), directly comparing the performance of parametric time series models with a nonparametric alternative. For a pre-specified focused parameter, for which scrutiny is considered valuable, this is achieved by comparing the mean squared error of the model-based estimators of this quantity. In particular, this yields FIC formulae for covariances or correlations at specified lags, for the probability of reaching a threshold, etc. Suitable weighted average versions, the AFIC, also lead to model selection strategies for finding the best model for the purpose of estimating e.g.~a sequence of correlations.

2602.17255 2026-02-20 stat.ME stat.AP

Selection and Collider Restriction Bias Due to Predictor Availability in Prognostic Models

Marc Delord

详情
英文摘要

This methodological note investigates and discuss possible selection and collider restriction bias due to predictor availability in prognostic models.

2602.17225 2026-02-20 cond-mat.mtrl-sci physics.data-an stat.ME

Wide-Surface Furnace for In Situ X-Ray Diffraction of Combinatorial Samples using a High-Throughput Approach

Giulio Cordaro, Juande Sirvent, Cristian Mocuta, Fjorelo Buzi, Thierry Martin, Federico Baiutti, Alex Morata, Albert Tarancòn, Dominique Thiaudière, Guilhem Dezanneau

详情
英文摘要

The combinatorial approach applied to functional oxides has enabled the production of material libraries that formally contain infinite compositions. A complete ternary diagram can be obtained by pulsed laser deposition (PLD) on 100 mm silicon wafers. However, interest in such materials libraries is only meaningful if high-throughput characterization enables the information extraction from the as-deposited library in a reasonable time. While much commercial equipment allows for XY-resolved characterization at room temperature, very few sample holders have been made available to investigate structural, chemical, and functional properties at high temperatures in controlled atmospheres. In the present work, we present a furnace that enables the study of 100 mm wafers as a function of temperature. This furnace has a dome to control the atmosphere, typically varying from nitrogen gas to pure oxygen atmosphere with external control. We present the design of such a furnace and an example of X-ray diffraction (XRD) and fluorescence (XRF) measurements performed at the DiffAbs beamline of the SOLEIL synchrotron. We apply this high-throughput approach to a combinatorial library up to 735 {\textdegree}C in nitrogen and calculate the thermal expansion coefficients (TEC) of the ternary system using custom-made MATLAB codes. The TEC analysis revealed the potential limitations of Vegard's law in predicting lattice variations for high-entropy materials.

2602.17211 2026-02-20 stat.ML cs.LG

MGD: Moment Guided Diffusion for Maximum Entropy Generation

Etienne Lempereur, Nathanaël Cuvelle--Magar, Florentin Coeurdoux, Stéphane Mallat, Eric Vanden-Eijnden

详情
英文摘要

Generating samples from limited information is a fundamental problem across scientific domains. Classical maximum entropy methods provide principled uncertainty quantification from moment constraints but require sampling via MCMC or Langevin dynamics, which typically exhibit exponential slowdown in high dimensions. In contrast, generative models based on diffusion and flow matching efficiently transport noise to data but offer limited theoretical guarantees and can overfit when data is scarce. We introduce Moment Guided Diffusion (MGD), which combines elements of both approaches. Building on the stochastic interpolant framework, MGD samples maximum entropy distributions by solving a stochastic differential equation that guides moments toward prescribed values in finite time, thereby avoiding slow mixing in equilibrium-based methods. We formally obtain, in the large-volatility limit, convergence of MGD to the maximum entropy distribution and derive a tractable estimator of the resulting entropy computed directly from the dynamics. Applications to financial time series, turbulent flows, and cosmological fields using wavelet scattering moments yield estimates of negentropy for high-dimensional multiscale processes.

2602.17161 2026-02-20 stat.ME

Dynamic likelihood hazard rate estimation

Nils Lid Hjort

Comments 20 pages, no figures; Statistical Research Report from 1993 (Department of Mathematics, University of Oslo); accepted with "minor revision" by Biometrika then, but somehow I never got around to do the final polish. This report, arXiv'd now in 2026, might be modified and updated (and illustrated with real data) for later journal publication

详情
英文摘要

The best known methods for estimating hazard rate functions in survival analysis models are either purely parametric or purely nonparametric. The parametric ones are sometimes too biased while the nonparametric ones are sometimes too variable. In the present paper a certain semiparametric approach to hazard rate estimation, proposed in Hjort (1991), is developed further, aiming to combine parametric and nonparametric features. It uses a dynamic local likelihood approach to fit the locally most suitable member in a given parametric class of hazard rates, and amounts to a version of nonparametric parameter smoothing within the parametric class. Thus the parametric hazard rate estimate at time $s$ inserts a parameter estimate that also depends on $s$. We study bias and variance properties of the resulting estimator and methods for choosing the local smoothing parameter. It is shown that dynamic likelihood estimation often leads to better performance than the purely nonparametric methods, while also having capacity for not losing much to the parametric methods in cases where the model being smoothed is adequate.

2602.17144 2026-02-20 cs.LG stat.ML

When More Experts Hurt: Underfitting in Multi-Expert Learning to Defer

Shuqi Liu, Yuzhou Cao, Lei Feng, Bo An, Luke Ong

详情
英文摘要

Learning to Defer (L2D) enables a classifier to abstain from predictions and defer to an expert, and has recently been extended to multi-expert settings. In this work, we show that multi-expert L2D is fundamentally more challenging than the single-expert case. With multiple experts, the classifier's underfitting becomes inherent, which seriously degrades prediction performance, whereas in the single-expert setting it arises only under specific conditions. We theoretically reveal that this stems from an intrinsic expert identifiability issue: learning which expert to trust from a diverse pool, a problem absent in the single-expert case and renders existing underfitting remedies failed. To tackle this issue, we propose PiCCE (Pick the Confident and Correct Expert), a surrogate-based method that adaptively identifies a reliable expert based on empirical evidence. PiCCE effectively reduces multi-expert L2D to a single-expert-like learning problem, thereby resolving multi expert underfitting. We further prove its statistical consistency and ability to recover class probabilities and expert accuracies. Extensive experiments across diverse settings, including real-world expert scenarios, validate our theoretical results and demonstrate improved performance.

2602.17115 2026-02-20 stat.ML cs.LG

Semi-Supervised Learning on Graphs using Graph Neural Networks

Juntong Chen, Claire Donnat, Olga Klopp, Johannes Schmidt-Hieber

Comments 57 pages, 7 figures

详情
英文摘要

Graph neural networks (GNNs) work remarkably well in semi-supervised node regression, yet a rigorous theory explaining when and why they succeed remains lacking. To address this gap, we study an aggregate-and-readout model that encompasses several common message passing architectures: node features are first propagated over the graph then mapped to responses via a nonlinear function. For least-squares estimation over GNNs with linear graph convolutions and a deep ReLU readout, we prove a sharp non-asymptotic risk bound that separates approximation, stochastic, and optimization errors. The bound makes explicit how performance scales with the fraction of labeled nodes and graph-induced dependence. Approximation guarantees are further derived for graph-smoothing followed by smooth nonlinear readouts, yielding convergence rates that recover classical nonparametric behavior under full supervision while characterizing performance when labels are scarce. Numerical experiments validate our theory, providing a systematic framework for understanding GNN performance and limitations.

2602.17103 2026-02-20 cs.LG stat.ML

Online Learning with Improving Agents: Multiclass, Budgeted Agents and Bandit Learners

Sajad Ashkezari, Shai Ben-David

详情
英文摘要

We investigate the recently introduced model of learning with improvements, where agents are allowed to make small changes to their feature values to be warranted a more desirable label. We extensively extend previously published results by providing combinatorial dimensions that characterize online learnability in this model, by analyzing the multiclass setup, learnability in a bandit feedback setup, modeling agents' cost for making improvements and more.

2602.17087 2026-02-20 math.PR stat.CO

Diffusive Scaling Limits of Forward Event-Chain Monte Carlo: Provably Efficient Exploration with Partial Refreshment

Hirofumi Shiba, Kengo Kamatani

Comments 43 pages, 5 figures

详情
英文摘要

Piecewise deterministic Markov process samplers are attractive alternatives to Metropolis--Hastings algorithms. A central design question is how to incorporate partial velocity refreshment to ensure ergodicity without injecting excessive noise. Forward Event-Chain Monte Carlo (FECMC) is a generalization of the Bouncy Particle Sampler (BPS) that addresses this issue through a stochastic reflection mechanism, thereby reducing reliance on global refreshment moves. Despite promising empirical performance, its theoretical efficiency remains largely unexplored. We develop a high-dimensional scaling analysis for standard Gaussian targets and prove that the negative log-density (or potential) process of FECMC converges to an Ornstein--Uhlenbeck diffusion, under the same scaling as BPS. We derive closed-form expressions for the limiting diffusion coefficients of both methods by analyzing their associated radial momentum processes and solving the corresponding Poisson equations. These expressions yield a sharp efficiency comparison: the diffusion coefficient of FECMC is strictly larger than that of optimally tuned BPS, and the optimum for FECMC is attained at zero global refreshment. Specifically, they imply an approximately eightfold increase in effective sample size per event over optimal BPS. Numerical experiments confirm the predicted diffusion coefficients and show that the resulting efficiency gains remain substantial for a range of non-Gaussian targets. Finally, as an application of these results, we propose an asymptotic variance estimator for Piecewise deterministic Markov processes that becomes increasingly efficient in high dimensions by extracting information from the velocity variable.

2602.17086 2026-02-20 econ.TH cs.LG math.ST stat.TH

Dynamic Decision-Making under Model Misspecification: A Stochastic Stability Approach

Xinyu Dai, Daniel Chen, Yian Qian

详情
英文摘要

Dynamic decision-making under model uncertainty is central to many economic environments, yet existing bandit and reinforcement learning algorithms rely on the assumption of correct model specification. This paper studies the behavior and performance of one of the most commonly used Bayesian reinforcement learning algorithms, Thompson Sampling (TS), when the model class is misspecified. We first provide a complete dynamic classification of posterior evolution in a misspecified two-armed Gaussian bandit, identifying distinct regimes: correct model concentration, incorrect model concentration, and persistent belief mixing, characterized by the direction of statistical evidence and the model-action mapping. These regimes yield sharp predictions for limiting beliefs, action frequencies, and asymptotic regret. We then extend the analysis to a general finite model class and develop a unified stochastic stability framework that represents posterior evolution as a Markov process on the belief simplex. This approach characterizes two sufficient conditions to classify the ergodic and transient behaviors and provides inductive dimensional reductions of the posterior dynamics. Our results offer the first qualitative and geometric classification of TS under misspecification, bridging Bayesian learning with evolutionary dynamics, and also build the foundations of robust decision-making in structured bandits.

2602.17079 2026-02-20 stat.AP

Environmental policy in the context of complex systems: Statistical optimization and sensitivity analysis for ABMs

Dylan Munson, Arijit Dey, Simon Mak

详情
英文摘要

Coupled human-environment systems are increasingly being understood as complex adaptive systems (CAS), in which micro-level interactions between components lead to emergent behavior. Agent-based models (ABMs) hold great promise for environmental policy design by capturing such complex behavior, enabling a sophisticated understanding of potential interventions. One limitation, however, is that ABMs can be computationally costly to simulate, which hinders their use for policy optimization. To address this, we propose a new statistical framework that exploits machine learning techniques to accelerate policy optimization with costly ABMs. We first develop a statistical approach for sensitivity testing of the optimal policy, then leverage a reinforcement learning method for efficient policy optimization. We test this framework on the classic ``Sugarscape'' model, an ABM for resource harvesting. We show that our approach can quickly identify optimal and interpretable policies that improve upon baseline techniques, with insightful sensitivity and dynamic analyses that connect back to economic theory.

2602.17070 2026-02-20 stat.ME cs.AI

General sample size analysis for probabilities of causation: a delta method approach

Tianyuan Cheng, Ruirui Mao, Judea Pearl, Ang Li

详情
英文摘要

Probabilities of causation (PoCs), such as the probability of necessity and sufficiency (PNS), are important tools for decision making but are generally not point identifiable. Existing work has derived bounds for these quantities using combinations of experimental and observational data. However, there is very limited research on sample size analysis, namely, how many experimental and observational samples are required to achieve a desired margin of error. In this paper, we propose a general sample size framework based on the delta method. Our approach applies to settings in which the target bounds of PoCs can be expressed as finite minima or maxima of linear combinations of experimental and observational probabilities. Through simulation studies, we demonstrate that the proposed sample size calculations lead to stable estimation of these bounds.

2602.17052 2026-02-20 stat.ME econ.EM

Generative modeling for the bootstrap

Leon Tran, Ting Ye, Peng Ding, Fang Han

Comments 62 pages

详情
英文摘要

Generative modeling builds on and substantially advances the classical idea of simulating synthetic data from observed samples. This paper shows that this principle is not only natural but also theoretically well-founded for bootstrap inference: it yields statistically valid confidence intervals that apply simultaneously to both regular and irregular estimators, including settings in which Efron's bootstrap fails. In this sense, the generative modeling-based bootstrap can be viewed as a modern version of the smoothed bootstrap: it could mitigate the curse of dimensionality and remain effective in challenging regimes where estimators may lack root-$n$ consistency or a Gaussian limit.

2602.17043 2026-02-20 stat.AP

Quantifying the limits of human athletic performance: A Bayesian analysis of elite decathletes

Paul-Hieu V. Nguyen, James M. Smoliga, Benton Lindaman, Sameer K. Deshpande

详情
英文摘要

Because the decathlon tests many facets of athleticism, including sprinting, throwing, jumping, and endurance, many consider it to be the ultimate test of athletic ability. On this view, estimating the maximal decathlon score and understanding what it would take to achieve that score provides insight into the upper limits of human athletic potential. To this end, we develop a Bayesian composition model for forecasting how individual athletes perform in each of the 10 decathlon events of time. Besides capturing potential non-linear temporal trends in performance, our model carefully captures the dependence between performance in an event and all preceding events. Using our model, we can simulate and evaluate the distribution of the maximal possible scores and identify profiles of athletes who could realistically attain scores approaching this limit.

2602.17034 2026-02-20 stat.AP

Using Time Series Measures to Explore Family Planning Survey Data and Model-based Estimates

Oluwayomi Akinfenwa, Niamh Cahill, Catherine Hurley

详情
英文摘要

Family planning is a global development priority and a key indicator of reproductive health. Monitoring progress is challenged by gaps in survey data across countries. The United Nations Population Division addresses this with the Family Planning Estimation Model (FPEM), a Bayesian hierarchical time series model producing annual estimates of modern contraceptive use while sharing information across countries and regions. This paper evaluates how well FPEM estimates align with survey data using time series diagnostic indices from the wdiexplorer R package, which account for countries nested within sub-regions. Visualisation of survey data, modelled trajectories, and diagnostics enables assessment of model performance, highlighting where trends align and where discrepancies occur.

2602.17015 2026-02-20 cs.AI stat.AP

Cinder: A fast and fair matchmaking system

Saurav Pal

详情
英文摘要

A fair and fast matchmaking system is an important component of modern multiplayer online games, directly impacting player retention and satisfaction. However, creating fair matches between lobbies (pre-made teams) of heterogeneous skill levels presents a significant challenge. Matching based simply on average team skill metrics, such as mean or median rating or rank, often results in unbalanced and one-sided games, particularly when skill distributions are wide or skewed. This paper introduces Cinder, a two-stage matchmaking system designed to provide fast and fair matches. Cinder first employs a rapid preliminary filter by comparing the "non-outlier" skill range of lobbies using the Ruzicka similarity index. Lobbies that pass this initial check are then evaluated using a more precise fairness metric. This second stage involves mapping player ranks to a non-linear set of skill buckets, generated from an inverted normal distribution, to provide higher granularity at average skill levels. The fairness of a potential match is then quantified using the Kantorovich distance on the lobbies' sorted bucket indices, producing a "Sanction Score." We demonstrate the system's viability by analyzing the distribution of Sanction Scores from 140 million simulated lobby pairings, providing a robust foundation for fair matchmaking thresholds.

2602.16992 2026-02-20 stat.ME

Modeling Multivariate Missingness with Tree Graphs and Conjugate Odds

Daniel Suen, Yen-Chi Chen

Comments 82 pages, 15 figures

详情
英文摘要

In this paper, we analyze a specific class of missing not at random (MNAR) assumptions called tree graphs, extending upon the work of pattern graphs. We build off previous work by introducing the idea of a conjugate odds family in which certain parametric models on the selection odds can preserve the data distribution family across all missing data patterns. Under a conjugate odds family and a tree graph assumption, we are able to model the full data distribution elegantly in the sense that for the observed data, we obtain a model that is conjugate from the complete-data, and for the missing entries, we create a simple imputation model. In addition, we investigate the problem of graph selection, sensitivity analysis, and statistical inference. Using both simulations and real data, we illustrate the applicability of our method.

2602.16970 2026-02-20 stat.AP

Temperature and Respiratory Emergency Department Visits: A Mediation Analysis with Ambient Ozone Exposure

Chen Li, Thomas W. Hsiao, Stefanie Ebelt, Rebecca H. Zhang, Howard H. Chang

详情
英文摘要

High temperatures are associated with adverse respiratory health outcomes and increases in ambient air pollution. Limited research has quantified air pollution's mediating role in the relationship between temperature and respiratory morbidity, such as emergency department (ED) visits. In this study, we conducted a causal mediation analysis to decompose the total effect of daily temperature on respiratory ED visits in Los Angeles from 2005 to 2016. We focused on ambient ozone as a mediator because its precursors and formation are directly driven by sunlight and temperature. We estimated natural direct, indirect, and total effects on the relative risk scale across deciles of temperature exposure compared to the median. We utilized Bayesian additive regression trees (BART) to flexibly characterize the nonlinear relationship between temperature and ozone and quantified uncertainty via posterior prediction and the Bayesian bootstrap. Our results showed that ozone partially mediated the association between high temperatures and respiratory ED visits, particularly at moderately high temperatures. We also validated our modeling approach through simulation studies. This study extends the existing literature by considering acute respiratory morbidity and employing a flexible modeling approach, offering new insights into the mechanisms underlying temperature-related health risks.

2602.16923 2026-02-20 stat.ML cs.LG

Poisson-MNL Bandit: Nearly Optimal Dynamic Joint Assortment and Pricing with Decision-Dependent Customer Arrivals

Junhui Cai, Ran Chen, Qitao Huang, Linda Zhao, Wu Zhu

详情
英文摘要

We study dynamic joint assortment and pricing where a seller updates decisions at regular accounting/operating intervals to maximize the cumulative per-period revenue over a horizon $T$. In many settings, assortment and prices affect not only what an arriving customer buys but also how many customers arrive within the period, whereas classical multinomial logit (MNL) models assume arrivals as fixed, potentially leading to suboptimal decisions. We propose a Poisson-MNL model that couples a contextual MNL choice model with a Poisson arrival model whose rate depends on the offered assortment and prices. Building on this model, we develop an efficient algorithm PMNL based on the idea of upper confidence bound (UCB). We establish its (near) optimality by proving a non-asymptotic regret bound of order $\sqrt{T\log{T}}$ and a matching lower bound (up to $\log T$). Simulation studies underscore the importance of accounting for the dependency of arrival rates on assortment and pricing: PMNL effectively learns customer choice and arrival models and provides joint assortment-pricing decisions that outperform others that assume fixed arrival rates.

2602.16914 2026-02-20 stat.ME cs.LG stat.ML

A statistical perspective on transformers for small longitudinal cohort data

Kiana Farhadyar, Maren Hackenberg, Kira Ahrens, Charlotte Schenk, Bianca Kollmann, Oliver Tüscher, Klaus Lieb, Michael M. Plichta, Andreas Reif, Raffael Kalisch, Martin Wolkewitz, Moritz Hess, Harald Binder

详情
英文摘要

Modeling of longitudinal cohort data typically involves complex temporal dependencies between multiple variables. There, the transformer architecture, which has been highly successful in language and vision applications, allows us to account for the fact that the most recently observed time points in an individual's history may not always be the most important for the immediate future. This is achieved by assigning attention weights to observations of an individual based on a transformation of their values. One reason why these ideas have not yet been fully leveraged for longitudinal cohort data is that typically, large datasets are required. Therefore, we present a simplified transformer architecture that retains the core attention mechanism while reducing the number of parameters to be estimated, to be more suitable for small datasets with few time points. Guided by a statistical perspective on transformers, we use an autoregressive model as a starting point and incorporate attention as a kernel-based operation with temporal decay, where aggregation of multiple transformer heads, i.e. different candidate weighting schemes, is expressed as accumulating evidence on different types of underlying characteristics of individuals. This also enables a permutation-based statistical testing procedure for identifying contextual patterns. In a simulation study, the approach is shown to recover contextual dependencies even with a small number of individuals and time points. In an application to data from a resilience study, we identify temporal patterns in the dynamics of stress and mental health. This indicates that properly adapted transformers can not only achieve competitive predictive performance, but also uncover complex context dependencies in small data settings.

2602.16876 2026-02-20 cs.LG stat.ML

ML-driven detection and reduction of ballast information in multi-modal datasets

Yaroslav Solovko

Comments 20 pages, 27 figures, 10 tables

详情
英文摘要

Modern datasets often contain ballast as redundant or low-utility information that increases dimensionality, storage requirements, and computational cost without contributing meaningful analytical value. This study introduces a generalized, multimodal framework for ballast detection and reduction across structured, semi-structured, unstructured, and sparse data types. Using diverse datasets, entropy, mutual information, Lasso, SHAP, PCA, topic modelling, and embedding analysis are applied to identify and eliminate ballast features. A novel Ballast Score is proposed to integrate these signals into a unified, cross-modal pruning strategy. Experimental results demonstrate that significant portions of the feature space as often exceeding 70% in sparse or semi-structured data, can be pruned with minimal or even improved classification performance, along with substantial reductions in training time and memory footprint. The framework reveals distinct ballast typologies (e.g. statistical, semantic, infrastructural), and offers practical guidance for leaner, more efficient machine learning pipelines.

2602.16857 2026-02-20 physics.comp-ph physics.data-an physics.geo-ph stat.ML

Distillation and Interpretability of Ensemble Forecasts of ENSO Phase using Entropic Learning

Michael Groom, Davide Bassetti, Illia Horenko, Terence J. O'Kane

详情
英文摘要

This paper introduces a distillation framework for an ensemble of entropy-optimal Sparse Probabilistic Approximation (eSPA) models, trained exclusively on satellite-era observational and reanalysis data to predict ENSO phase up to 24 months in advance. While eSPA ensembles yield state-of-the-art forecast skill, they are harder to interpret than individual eSPA models. We show how to compress the ensemble into a compact set of "distilled" models by aggregating the structure of only those ensemble members that make correct predictions. This process yields a single, diagnostically tractable model for each forecast lead time that preserves forecast performance while also enabling diagnostics that are impractical to implement on the full ensemble. An analysis of the regime persistence of the distilled model "superclusters", as well as cross-lead clustering consistency, shows that the discretised system accurately captures the spatiotemporal dynamics of ENSO. By considering the effective dimension of the feature importance vectors, the complexity of the input space required for correct ENSO phase prediction is shown to peak when forecasts must cross the boreal spring predictability barrier. Spatial importance maps derived from the feature importance vectors are introduced to identify where predictive information resides in each field and are shown to include known physical precursors at certain lead times. Case studies of key events are also presented, showing how fields reconstructed from distilled model centroids trace the evolution from extratropical and inter-basin precursors to the mature ENSO state. Overall, the distillation framework enables a rigorous investigation of long-range ENSO predictability that complements real-time data-driven operational forecasts.

2602.16849 2026-02-20 cs.LG math.OC stat.ML

On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking

Jianliang He, Leda Wang, Siyu Chen, Zhuoran Yang

详情
英文摘要

We present a comprehensive analysis of how two-layer neural networks learn features to solve the modular addition task. Our work provides a full mechanistic interpretation of the learned model and a theoretical explanation of its training dynamics. While prior work has identified that individual neurons learn single-frequency Fourier features and phase alignment, it does not fully explain how these features combine into a global solution. We bridge this gap by formalizing a diversification condition that emerges during training when overparametrized, consisting of two parts: phase symmetry and frequency diversification. We prove that these properties allow the network to collectively approximate a flawed indicator function on the correct logic for the modular addition task. While individual neurons produce noisy signals, the phase symmetry enables a majority-voting scheme that cancels out noise, allowing the network to robustly identify the correct sum. Furthermore, we explain the emergence of these features under random initialization via a lottery ticket mechanism. Our gradient flow analysis proves that frequencies compete within each neuron, with the "winner" determined by its initial spectral magnitude and phase alignment. From a technical standpoint, we provide a rigorous characterization of the layer-wise phase coupling dynamics and formalize the competitive landscape using the ODE comparison lemma. Finally, we use these insights to demystify grokking, characterizing it as a three-stage process involving memorization followed by two generalization phases, driven by the competition between loss minimization and weight decay.

2602.16830 2026-02-20 stat.AP cs.LG

The Impact of Formations on Football Matches Using Double Machine Learning. Is it worth parking the bus?

Genís Ruiz-Menárguez, Llorenç Badiella

Comments 17 pages, 5 figures, 3 tables

详情
英文摘要

This study addresses a central tactical dilemma for football coaches: whether to employ a defensive strategy, colloquially known as "parking the bus", or a more offensive one. Using an advanced Double Machine Learning (DML) framework, this project provides a robust and interpretable tool to estimate the causal impact of different formations on key match outcomes such as goal difference, possession, corners, and disciplinary actions. Leveraging a dataset of over 22,000 matches from top European leagues, formations were categorized into six representative types based on tactical structure and expert consultation. A major methodological contribution lies in the adaptation of DML to handle categorical treatments, specifically formation combinations, through a novel matrix-based residualization process, allowing for a detailed estimation of formation-versus-formation effects that can inform a coach's tactical decision-making. Results show that while offensive formations like 4-3-3 and 4-2-3-1 offer modest statistical advantages in possession and corners, their impact on goals is limited. Furthermore, no evidence supports the idea that defensive formations, commonly associated with parking the bus, increase a team's winning potential. Additionally, red cards appear unaffected by formation choice, suggesting other behavioral factors dominate. Although this approach does not fully capture all aspects of playing style or team strength, it provides a valuable framework for coaches to analyze tactical efficiency and sets a precedent for future research in sports analytics.

2602.16789 2026-02-20 stat.ME math.ST stat.TH

First versus full or first versus last: U-statistic change-point tests under fixed and local alternatives

Herold Dehling, Daniel Vogel, Martin Wendler

Comments 56 pages: 22 pages main document, 34 pages appendices (containing proofs), one reference list at the end

详情
英文摘要

The use of U-statistics in the change-point context has received considerable attention in the literature. We compare two approaches of constructing CUSUM-type change-point tests, which we call the first-vs-full and first-vs-last approach. Both have been pursued by different authors. The question naturally arises if the two tests substantially differ and, if so, which of them is better in which data situation. In large samples, both tests are similar: they are asymptotically equivalent under the null hypothesis and under sequences of local alternatives. In small samples, there may be quite noticeable differences, which is in line with a different asymptotic behavior under fixed alternatives. We derive a simple criterion for deciding which test is more powerful. We examine the examples Gini's mean difference, the sample variance, and Kendall's tau in detail. Particularly, when testing for changes in scale by Gini's mean difference, we show that the first-vs-full approach has a higher power if and only if the scale changes from a smaller to a larger value -- regardless of the population distribution or the location of the change. The asymptotic derivations are under weak dependence. The results are illustrated by numerical simulations and data examples.

2602.16784 2026-02-20 cs.LG cs.CL stat.ME

Omitted Variable Bias in Language Models Under Distribution Shift

Victoria Lin, Louis-Philippe Morency, Eli Ben-Michael

详情
英文摘要

Despite their impressive performance on a wide variety of tasks, modern language models remain susceptible to distribution shifts, exhibiting brittle behavior when evaluated on data that differs in distribution from their training data. In this paper, we describe how distribution shifts in language models can be separated into observable and unobservable components, and we discuss how established approaches for dealing with distribution shift address only the former. Importantly, we identify that the resulting omitted variable bias from unobserved variables can compromise both evaluation and optimization in language models. To address this challenge, we introduce a framework that maps the strength of the omitted variables to bounds on the worst-case generalization performance of language models under distribution shift. In empirical experiments, we show that using these bounds directly in language model evaluation and optimization provides more principled measures of out-of-distribution performance, improves true out-of-distribution performance relative to standard distribution shift adjustment methods, and further enables inference about the strength of the omitted variables when target distribution labels are available.

2602.16177 2026-02-20 stat.ML cs.AI cs.LG

Conjugate Learning Theory: Uncovering the Mechanisms of Trainability and Generalization in Deep Neural Networks

Binchuan Qi

详情
英文摘要

In this work, we propose a notion of practical learnability grounded in finite sample settings, and develop a conjugate learning theoretical framework based on convex conjugate duality to characterize this learnability property. Building on this foundation, we demonstrate that training deep neural networks (DNNs) with mini-batch stochastic gradient descent (SGD) achieves global optima of empirical risk by jointly controlling the extreme eigenvalues of a structure matrix and the gradient energy, and we establish a corresponding convergence theorem. We further elucidate the impact of batch size and model architecture (including depth, parameter count, sparsity, skip connections, and other characteristics) on non-convex optimization. Additionally, we derive a model-agnostic lower bound for the achievable empirical risk, theoretically demonstrating that data determines the fundamental limit of trainability. On the generalization front, we derive deterministic and probabilistic bounds on generalization error based on generalized conditional entropy measures. The former explicitly delineates the range of generalization error, while the latter characterizes the distribution of generalization error relative to the deterministic bounds under independent and identically distributed (i.i.d.) sampling conditions. Furthermore, these bounds explicitly quantify the influence of three key factors: (i) information loss induced by irreversibility in the model, (ii) the maximum attainable loss value, and (iii) the generalized conditional entropy of features with respect to labels. Moreover, they offer a unified theoretical lens for understanding the roles of regularization, irreversible transformations, and network depth in shaping the generalization behavior of deep neural networks. Extensive experiments validate all theoretical predictions, confirming the framework's correctness and consistency.

2602.15438 2026-02-20 cs.LG cs.AI stat.ML

Logit Distance Bounds Representational Similarity

Beatrix M. G. Nielsen, Emanuele Marconato, Luigi Gresele, Andrea Dittadi, Simon Buchholz

详情
英文摘要

For a broad family of discriminative models that includes autoregressive language models, identifiability results imply that if two models induce the same conditional distributions, then their internal representations agree up to an invertible linear transformation. We ask whether an analogous conclusion holds approximately when the distributions are close instead of equal. Building on the observation of Nielsen et al. (2025) that closeness in KL divergence need not imply high linear representational similarity, we study a distributional distance based on logit differences and show that closeness in this distance does yield linear similarity guarantees. Specifically, we define a representational dissimilarity measure based on the models' identifiability class and prove that it is bounded by the logit distance. We further show that, when model probabilities are bounded away from zero, KL divergence upper-bounds logit distance; yet the resulting bound fails to provide nontrivial control in practice. As a consequence, KL-based distillation can match a teacher's predictions while failing to preserve linear representational properties, such as linear-probe recoverability of human-interpretable concepts. In distillation experiments on synthetic and image datasets, logit-distance distillation yields students with higher linear representational similarity and better preservation of the teacher's linearly recoverable concepts.

2602.12577 2026-02-20 stat.ME stat.AP

Conjugating Variational Inference for Large Mixed Multinomial Logit Models and Consumer Choice

Weiben Zhang, Ruben Loaiza-Maya, Michael Stanley Smith, Worapree Maneesoonthorn

详情
英文摘要

Heterogeneity in multinomial choice data is often accounted for using logit models with random coefficients. Such models are called "mixed", but they can be difficult to estimate for large datasets. We review current Bayesian variational inference (VI) methods that can do so, and propose a new VI method that scales more effectively. The key innovation is a step that updates efficiently a Gaussian approximation to the conditional posterior of the random coefficients, addressing a bottleneck within the variational optimization. The approach is used to estimate three types of mixed logit models: standard, nested and bundle variants. We first demonstrate the improvement of our new approach over existing VI methods using simulations. Our method is then applied to a large scanner panel dataset of pasta choice. We find consumer response to price and promotion variables exhibits substantial heterogeneity at the grocery store and product levels. Store size, premium and geography are found to be drivers of store level estimates of price elasticities. Extension to bundle choice with pasta sauce improves model accuracy further. Predictions from the mixed models are more accurate than those from fixed coefficients equivalents, and our VI method provides insights in circumstances which other methods find challenging.

2601.20591 2026-02-20 math.DS stat.AP

Exploring Memory Effects: Sparse Identification in Vector-Borne Diseases

Dimitri Breda, Muhammad Tanveer, Jianhong Wu, Xue Zhang

详情
英文摘要

Predicting the human burden of vector-borne diseases from limited surveillance data remains a major challenge, particularly in the presence of nonlinear transmission dynamics and delayed effects arising from vector ecology and human behavior. We develop a data-driven framework based on an extension of Sparse Identification of Nonlinear Dynamics (SINDy) to systems with distributed memory, enabling discovery of transmission mechanisms directly from time series data. Using severe fever with thrombocytopenia syndrome (SFTS) as a case study, we show that this approach can uncover key features of tick-borne disease dynamics using only human incidence and local temperature data, without imposing predefined assumptions on human case reporting. We further demonstrate that predictive performance is substantially enhanced when the data-driven model is coupled with mechanistic representations of tick-host transmission pathways informed by empirical studies. The framework supports systematic sensitivity analysis of memory kernels and behavioral parameters, identifying those most influential for prediction accuracy. Although the approach prioritizes predictive accuracy over mechanistic transparency, it yields sparse, interpretable integral representations suitable for epidemiological forecasting. This hybrid methodology provides a scalable strategy for forecasting vector-borne disease risk and informing public health decision-making under data limitations.

2509.22860 2026-02-20 math.OC cs.DC cs.LG stat.ML

Ringleader ASGD: The First Asynchronous SGD with Optimal Time Complexity under Data Heterogeneity

Artavazd Maranjyan, Peter Richtárik

详情
Journal ref
The Fourteenth International Conference on Learning Representations (ICLR 2026)
英文摘要

Asynchronous stochastic gradient methods are central to scalable distributed optimization, particularly when devices differ in computational capabilities. Such settings arise naturally in federated learning, where training takes place on smartphones and other heterogeneous edge devices. In addition to varying computation speeds, these devices often hold data from different distributions. However, existing asynchronous SGD methods struggle in such heterogeneous settings and face two key limitations. First, many rely on unrealistic assumptions of similarity across workers' data distributions. Second, methods that relax this assumption still fail to achieve theoretically optimal performance under heterogeneous computation times. We introduce Ringleader ASGD, the first asynchronous SGD algorithm that attains the theoretical lower bounds for parallel first-order stochastic methods in the smooth nonconvex regime, thereby achieving optimal time complexity under data heterogeneity and without restrictive similarity assumptions. Our analysis further establishes that Ringleader ASGD remains optimal under arbitrary and even time-varying worker computation speeds, closing a fundamental gap in the theory of asynchronous optimization.

2506.20749 2026-02-20 econ.EM stat.ME

Analytic inference with two-way clustering

Laurent Davezies, Xavier D'Haultfœuille, Yannick Guyonvarch

Comments 69 pages, supplement starts at p.43

详情
英文摘要

This paper studies analytic inference along two dimensions of clustering. In such setups, the commonly used approach has two drawbacks. First, the corresponding variance estimator is not necessarily positive. Second, inference is invalid in non-Gaussian regimes, namely when the estimator of the parameter of interest is not asymptotically Gaussian. We consider a simple fix that addresses both issues. In Gaussian regimes, the corresponding tests are asymptotically exact and equivalent to usual ones. Otherwise, the new tests are asymptotically conservative. We also establish their uniform validity over a certain class of data generating processes. Independently of our tests, we highlight potential issues with multiple testing and nonlinear estimators under two-way clustering. Finally, we compare our approach with existing ones through simulations.

2505.00292 2026-02-20 math.ST eess.SP stat.ME stat.TH

Offline changepoint localization using a matrix of conformal p-values

Sanjit Dandapanthula, Aaditya Ramdas

详情
英文摘要

Changepoint localization is the problem of estimating the index at which a change occurred in the data generating distribution of an ordered list of data, or declaring that no change occurred. We present the broadly applicable MCP algorithm, which uses a matrix of conformal p-values to produce a confidence interval for a (single) changepoint under the mild assumption that the pre-change and post-change distributions are each exchangeable. We prove a novel conformal Neyman-Pearson lemma, motivating practical classifier-based choices for our conformal score function. Finally, we exemplify the MCP algorithm on a variety of synthetic and real-world datasets, including using black-box pre-trained classifiers to detect changes in sequences of images, text, and accelerometer data.

2412.00926 2026-02-20 stat.ME

A sensitivity analysis approach to principal stratification with a continuous longitudinal intermediate outcome: Applications to a cohort stepped wedge trial

Lei Yang, Michael J. Daniels, Fan Li

详情
英文摘要

Causal inference in the presence of intermediate variables is a challenging problem in many applications. Principal stratification (PS) provides a framework to estimate principal causal effects (PCE) in such settings. However, existing PS methods primarily focus on settings with binary intermediate variables. We propose a novel approach to estimate PCE with continuous intermediate variables in the context of stepped wedge cluster randomized trials (SW-CRTs). Our method leverages the time-varying treatment assignment in SW-CRTs to calibrate sensitivity parameters and identify the PCE under realistic assumptions. We demonstrate the application of our approach using data from a cohort SW-CRT evaluating the effect of a crowdsourcing intervention on HIV testing uptake among men who have sex with men in China, with social norms as a continuous intermediate variable. The proposed methodology expands the scope of PS to accommodate continuous variables and provides a practical tool for causal inference in SW-CRTs.

2409.20250 2026-02-20 stat.ML cs.LG

Input-Label Correlation Governs a Linear-to-Nonlinear Transition in Random Features under Spiked Covariance

Samet Demir, Zafer Dogan

Comments 30 pages, 7 figures

详情
英文摘要

Random feature models (RFMs), two-layer networks with a randomly initialized fixed first layer and a trained linear readout, are among the simplest nonlinear predictors. Prior asymptotic analyses in the proportional high-dimensional regime show that, under isotropic data, RFMs reduce to noisy linear models and offer no advantage over classical linear methods such as ridge regression. Yet RFMs frequently outperform linear baselines on structured real data. We show that this tension is explained by a correlation-driven phase transition: under spiked-covariance designs, the interaction between anisotropy and input-label correlation determines whether the RFM behaves as an effectively linear predictor or exhibits genuinely nonlinear gains. Concretely, we establish a universality principle under anisotropy and characterize the RFM generalization error via an equivalent noisy polynomial model. The effective degree of this polynomial, equivalently, which Hermite orders of the activation survive, is governed by the strength of input-label correlation, yielding an explicit boundary in the correlation-spike-magnitude plane. Below the boundary, the RFM collapses to a linear surrogate and can underperform strong linear baselines; above it, higher-order terms persist and the RFM achieves a clear nonlinear advantage. Numerical simulations and real-data experiments corroborate the theory and delineate the transition between these two regimes.

2403.11332 2026-02-20 cs.LG cs.SI stat.ME

Graph Machine Learning based Doubly Robust Estimator for Network Causal Effects

Seyedeh Baharan Khatami, Harsh Parikh, Haowei Chen, Sudeepa Roy, Babak Salimi

详情
Journal ref
Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, PMLR 258:4366-4374, 2025
英文摘要

We address the challenge of inferring causal effects in social network data. This results in challenges due to interference -- where a unit's outcome is affected by neighbors' treatments -- and network-induced confounding factors. While there is extensive literature focusing on estimating causal effects in social network setups, a majority of them make prior assumptions about the form of network-induced confounding mechanisms. Such strong assumptions are rarely likely to hold especially in high-dimensional networks. We propose a novel methodology that combines graph machine learning approaches with the double machine learning framework to enable accurate and efficient estimation of direct and peer effects using a single observational social network. We demonstrate the semiparametric efficiency of our proposed estimator under mild regularity conditions, allowing for consistent uncertainty quantification. We demonstrate that our method is accurate, robust, and scalable via an extensive simulation study. We use our method to investigate the impact of Self-Help Group participation on financial risk tolerance.