arXivDaily arXiv每日学术速递 周一至周五更新
2602.18401 2026-02-23 cs.LG cs.AI q-bio.NC stat.ML

Leakage and Second-Order Dynamics Improve Hippocampal RNN Replay

Josue Casco-Rodriguez, Nanda H. Krishna, Richard G. Baraniuk

详情
英文摘要

Biological neural networks (like the hippocampus) can internally generate "replay" resembling stimulus-driven activity. Recent computational models of replay use noisy recurrent neural networks (RNNs) trained to path-integrate. Replay in these networks has been described as Langevin sampling, but new modifiers of noisy RNN replay have surpassed this description. We re-examine noisy RNN replay as sampling to understand or improve it in three ways: (1) Under simple assumptions, we prove that the gradients replay activity should follow are time-varying and difficult to estimate, but readily motivate the use of hidden state leakage in RNNs for replay. (2) We confirm that hidden state adaptation (negative feedback) encourages exploration in replay, but show that it incurs non-Markov sampling that also slows replay. (3) We propose the first model of temporally compressed replay in noisy path-integrating RNNs through hidden state momentum, connect it to underdamped Langevin sampling, and show that, together with adaptation, it counters slowness while maintaining exploration. We verify our findings via path-integration of 2D triangular and T-maze paths and of high-dimensional paths of synthetic rat place cell activity.

2602.18396 2026-02-23 cs.LG eess.SP math.PR stat.AP stat.ML

PRISM-FCP: Byzantine-Resilient Federated Conformal Prediction via Partial Sharing

Ehsan Lari, Reza Arablouei, Stefan Werner

Comments 13 pages, 5 figures, 2 tables, Submitted to IEEE Transactions on Signal Processing (TSP)

详情
英文摘要

We propose PRISM-FCP (Partial shaRing and robust calIbration with Statistical Margins for Federated Conformal Prediction), a Byzantine-resilient federated conformal prediction framework that utilizes partial model sharing to improve robustness against Byzantine attacks during both model training and conformal calibration. Existing approaches address adversarial behavior only in the calibration stage, leaving the learned model susceptible to poisoned updates. In contrast, PRISM-FCP mitigates attacks end-to-end. During training, clients partially share updates by transmitting only $M$ of $D$ parameters per round. This attenuates the expected energy of an adversary's perturbation in the aggregated update by a factor of $M/D$, yielding lower mean-square error (MSE) and tighter prediction intervals. During calibration, clients convert nonconformity scores into characterization vectors, compute distance-based maliciousness scores, and downweight or filter suspected Byzantine contributions before estimating the conformal quantile. Extensive experiments on both synthetic data and the UCI Superconductivity dataset demonstrate that PRISM-FCP maintains nominal coverage guarantees under Byzantine attacks while avoiding the interval inflation observed in standard FCP with reduced communication, providing a robust and communication-efficient approach to federated uncertainty quantification.

2602.18369 2026-02-23 stat.AP stat.ME

Hidden multistate models to study multimorbidity trajectories

Valentina Manzoni, Francesca Ieva, Amaia Calderón-Larrañaga, Davide Liborio Vetrano, Caterina Gregorio

详情
英文摘要

Multimorbidity in older adults is common, heterogeneous, and highly dynamic, and it is strongly associated with disability and increased healthcare utilization. However, existing approaches to studying multimorbidity trajectories are largely descriptive or rely on discrete-time models, which struggle to handle irregular observation intervals and right-censoring. We developed a continuous-time hidden multistate modeling framework to capture transitions among latent multimorbidity patterns while accounting for interval censoring and misclassification. A simulation study compared alternative model specifications under varying sample sizes and follow-up schemes, and the best-performing specification was applied to longitudinal data from the Swedish National study on Aging and Care-Kungsholmen (SNAC-K), including 2,716 multimorbid participants followed for up to 18 years. Simulation results showed that hidden multistate models substantially reduced bias in transition hazard estimates compared to non-hidden models, with fully time-inhomogeneous models outperforming piecewise approximations. Application to SNAC-K confirmed the feasibility and practical utility of this framework, enabling identification of risk factors for accelerated progression toward complex multimorbidity and revealing a gradient of mortality risk across patterns. Continuous-time hidden multistate models provide a robust alternative to traditional approaches, supporting individualized predictions and informing targeted interventions and secondary prevention strategies for multimorbidity in aging populations.

2602.18277 2026-02-23 cs.LG cs.AI stat.ML

PRISM: Parallel Reward Integration with Symmetry for MORL

Finn van der Knaap, Kejiang Qian, Zheng Xu, Fengxiang He

详情
英文摘要

This work studies heterogeneous Multi-Objective Reinforcement Learning (MORL), where objectives can differ sharply in temporal frequency. Such heterogeneity allows dense objectives to dominate learning, while sparse long-horizon rewards receive weak credit assignment, leading to poor sample efficiency. We propose a Parallel Reward Integration with Symmetry (PRISM) algorithm that enforces reflectional symmetry as an inductive bias in aligning reward channels. PRISM introduces ReSymNet, a theory-motivated model that reconciles temporal-frequency mismatches across objectives, using residual blocks to learn a scaled opportunity value that accelerates exploration while preserving the optimal policy. We also propose SymReg, a reflectional equivariance regulariser that enforces agent mirroring and constrains policy search to a reflection-equivariant subspace. This restriction provably reduces hypothesis complexity and improves generalisation. Across MuJoCo benchmarks, PRISM consistently outperforms both a sparse-reward baseline and an oracle trained with full dense rewards, improving Pareto coverage and distributional balance: it achieves hypervolume gains exceeding 100\% over the baseline and up to 32\% over the oracle. The code is at \href{https://github.com/EVIEHub/PRISM}{https://github.com/EVIEHub/PRISM}.

2602.18271 2026-02-23 stat.ME

Two-Stage Multiple Test Procedures Controlling False Discovery Rate with auxiliary variable and their Application to Set4Delta Mutant Data

Seohwa Hwang, Mark Louie Ramos, DoHwan Park, Junyong Park, Johan Lim, Erin Green

Comments 24 pages, 5 figures

详情
英文摘要

In this paper, we present novel methodologies that incorporate auxiliary variables for multiple hypotheses testing related to the main point of interest while effectively controlling the false discovery rate. When dealing with multiple tests concerning the primary variable of interest, researchers can use auxiliary variables to set preconditions for the significance of primary variables, thereby enhancing test efficacy. Depending on the auxiliary variable's role, we propose two approaches: one terminates testing of the primary variable if it does not meet predefined conditions, and the other adjusts the evaluation criteria based on the auxiliary variable. Employing the copula method, we elucidate the dependence between the auxiliary and primary variables by deriving their joint distribution from individual marginal distributions.Our numerical studies, compared with existing methods, demonstrate that the proposed methodologies effectively control the FDR and yield greater statistical power than previous approaches solely based on the primary variable. As an illustrative example, we apply our methods to the Set4$Δ$ mutant dataset. Our findings highlight the distinctions between our methodologies and traditional approaches, emphasising the potential advantages of our methods in introducing the auxiliary variable for selecting more genes.

2602.18242 2026-02-23 stat.OT

Reflections on the Future of Statistics Education in a Technological Era

Craig Alexander, Jennifer Gaskell, Vinny Davies

详情
英文摘要

Keeping pace with rapidly evolving technology is a key challenge in teaching statistics. To equip students with essential skills for the modern workplace, educators must integrate relevant technologies into the statistical curriculum where possible. University-level statistics education has experienced substantial technological change, particularly in the tools and practices that underpin teaching and learning. Statistical programming has become central to many courses, with R widely used and Python increasingly incorporated into statistics and data analytics programmes. Additionally, coding practices, database management, and machine learning now feature within some statistics curricula. Looking ahead, we anticipate a growing emphasis on artificial intelligence (AI), particularly the pedagogical implications of generative AI tools such as ChatGPT. In this article, we explore these technological developments and discuss strategies for their integration into contemporary statistics education.

2602.18214 2026-02-23 math.ST math-ph math.MP math.SP stat.TH

Quantitative concentration inequalities for the uniform approximation of the IDS

Max Kämper, Christoph Schumacher, Fabian Schwarzenberger, Ivan Veselic

详情
英文摘要

The integrated density of states (IDS) is a fundamental spectral quantity for quantum Hamiltonians modeling condensed matter systems, describing how densely energy levels are distributed. It can be interpreted as a volume-averaged spectral distribution. Hence, there are two equivalent definitions of the IDS related by the Pastur-Shubin formula: an operator-theoretic trace formula and a limit of normalized eigenvalue counting functions on finite volumes. We study a discrete random Schrödinger operator with bounded random potentials of finite-range correlations and prove a quantitative concentration inequality ensuring, with explicit high probability, that the empirical IDS (normalized eigenvalue counting function) uniformly approximates the abstract IDS trace formula within a prescribed error, thereby implying confidence regions for the IDS.

2602.18210 2026-02-23 stat.ME math.ST stat.TH

Semiparametric Uncertainty Quantification via Isotonized Posterior for Deconvolutions

Francesco Gili, Geurt Jongbloed

详情
英文摘要

We address the problem of uncertainty quantification for the deconvolution model \(Z = X + Y\), where \(X\) and \(Y\) are nonnegative random variables and the goal is to estimate the signal's distribution of \(X \sim F_0\) supported on~\([0,\infty)\), from observations where the noise distribution is known. Existing frequentist methods often produce confidence intervals for $F_0(x)$ that depend on unknown nuisance parameters, such as the density of \(X\) and its derivative, which are difficult to estimate in practice. This paper introduces a novel and computationally efficient nonparametric Bayesian approach, based on projecting the posterior, to overcome this limitation. Our method leverages the solution \(p\) to a specific Volterra integral equation as in \cite{74}, which relates the cumulative distribution function (CDF) of the signal, \(F_0\), to the distribution of the observables. We place a Dirichlet Process prior directly on the distribution of the observed data $Z$, yielding a simple, conjugate posterior. To ensure the resulting estimates for \(F_0\) are valid CDFs, we isotonize posterior draws taking the Greatest Convex Majorant of the primitive of the posterior draws and defining what we term the Isotonic Inverse Posterior. We show that this framework yields posterior credible sets for \(F_0\) that are not only computationally fast to generate but also possess asymptotically correct frequentist coverage after a straightforward recalibration technique for the so-called Bayes Chernoff distribution introduced in \cite{54}. Our approach thus does not require the estimation of nuisance parameters to deliver uncertainty quantification for the parameter of interest $F_0(x)$. The practical effectiveness and robustness of the method are demonstrated through a simulation study with various noise distributions for $Y$.

2512.23211 2026-02-23 econ.EM math.ST stat.ME stat.TH

Nonparametric Identification of Demand without Exogenous Product Characteristics

Kirill Borusyak, Jiafeng Chen, Peter Hull, Lihua Lei

详情
英文摘要

We study identification of differentiated product demand from market-level data when product characteristics can be endogenous. Past work suggests nonparametric identification may be impossible: that is, in addition to standard price instruments, exogenous characteristic-based instruments are essentially necessary to identify sufficiently flexible demand models with standard index restrictions. We show, however, that price counterfactuals are nonparametrically identified using recentered instruments -- which combine exogenous price instruments with possibly endogenous product characteristics -- under a weaker index restriction and a new condition we term faithfulness. We argue that faithfulness, like the usual completeness condition for nonparametric instrumental variable identification, is best viewed as a technical requirement on the strength of identifying variation rather than a substantive economic or statistical restriction. We show the two conditions are closely related, though generally distinct. We conclude with several practical implications for the parametric estimation of demand counterfactuals.

2512.00315 2026-02-23 physics.soc-ph math.PR q-bio.PE stat.AP

Correlation-Weighted Communicability Curvature as a Structural Driver of Dengue Spread: A Bayesian Spatial Analysis of Recife (2015-2024)

Marcílio Ferreira dos Santos, Cleiton de Lima Ricardo, Andreza dos Santos Rodrigues de Melo

Comments 18 pages, 2 figures, tables. Accepted for publication

Journal ref Chaos, Solitons & Fractals Volume 208, Part 1, July 2026, 118089

详情
英文摘要

We investigate whether the structural connectivity of urban road networks helps explain dengue incidence in Recife, Brazil (2015--2024). For each neighborhood, we compute the average \emph{communicability curvature}, a graph-theoretic measure capturing the ability of a locality to influence others through multiple network paths. We integrate this metric into Negative Binomial models, fixed-effects regressions, SAR/SAC spatial models, and a hierarchical INLA/BYM2 specification. Across all frameworks, curvature is the strongest and most stable predictor of dengue risk. In the BYM2 model, the structured spatial component collapses ($ϕ\approx 0$), indicating that functional network connectivity explains nearly all spatial dependence typically attributed to adjacency-based CAR terms. The results show that dengue spread in Recife is driven less by geographic contiguity and more by network-mediated structural flows.

2510.13887 2026-02-23 eess.IV cs.AI cs.LG stat.ML

Incomplete Multi-view Clustering via Hierarchical Semantic Alignment and Cooperative Completion

Xiaojian Ding, Lin Zhao, Xian Li, Xiaoying Zhu

Comments 13 pages, conference paper. Accepted to the Thirty-ninth Conference on Neural Information Processing Systems (NeurIPS 2025)

详情
英文摘要

Incomplete multi-view data, where certain views are entirely missing for some samples, poses significant challenges for traditional multi-view clustering methods. Existing deep incomplete multi-view clustering approaches often rely on static fusion strategies or two-stage pipelines, leading to suboptimal fusion results and error propagation issues. To address these limitations, this paper proposes a novel incomplete multi-view clustering framework based on Hierarchical Semantic Alignment and Cooperative Completion (HSACC). HSACC achieves robust cross-view fusion through a dual-level semantic space design. In the low-level semantic space, consistency alignment is ensured by maximizing mutual information across views. In the high-level semantic space, adaptive view weights are dynamically assigned based on the distributional affinity between individual views and an initial fused representation, followed by weighted fusion to generate a unified global representation. Additionally, HSACC implicitly recovers missing views by projecting aligned latent representations into high-dimensional semantic spaces and jointly optimizes reconstruction and clustering objectives, enabling cooperative learning of completion and clustering. Experimental results demonstrate that HSACC significantly outperforms state-of-the-art methods on five benchmark datasets. Ablation studies validate the effectiveness of the hierarchical alignment and dynamic weighting mechanisms, while parameter analysis confirms the model's robustness to hyperparameter variations. The code is available at https://github.com/XiaojianDing/2025-NeurIPS-HSACC.

2510.01915 2026-02-23 stat.ME stat.ML

Predictively Oriented Posteriors

Yann McLatchie, Badr-Eddine Cherief-Abdellatif, David T. Frazier, Jeremias Knoblauch

Comments 37 (+36 supplementary) pages (excluding references); 11 (+4 supplementary) figures

详情
英文摘要

We advocate for a new statistical principle that combines the most desirable aspects of both parameter inference and density estimation. This leads us to the predictively oriented (PrO) posterior, which expresses uncertainty as a consequence of predictive ability. Doing so leads to inferences which predictively dominate both classical and generalised Bayes posterior predictive distributions: up to logarithmic factors, PrO posteriors converge to the predictively optimal model average. Whereas classical and generalised Bayes posteriors only achieve this rate if the model can recover the data-generating process, PrO posteriors adapt to the level of model misspecification. This means that they concentrate around the true model in the same way as Bayes and Gibbs posteriors if the model can recover the data-generating distribution, but do not concentrate in the presence of non-trivial forms of model misspecification. Instead, they stabilise towards a predictively optimal posterior whose degree of irreducible uncertainty admits an interpretation as the degree of model misspecification -- a sharp contrast to how Bayesian uncertainty and its existing extensions behave. Lastly, we show that PrO posteriors can be sampled from by evolving particles based on mean field Langevin dynamics, and verify the practical significance of our theoretical developments on a number of numerical examples.

2509.25753 2026-02-23 math.NA cs.CE cs.NA stat.CO

Quasi-Monte Carlo methods for uncertainty quantification of tumor growth modeled by a parametric semi-linear parabolic reaction-diffusion equation

Alexander D. Gilbert, Frances Y. Kuo, Dirk Nuyens, Graham Pash, Ian H. Sloan, Karen E. Willcox

详情
英文摘要

We study the application of a quasi-Monte Carlo (QMC) method to a class of semi-linear parabolic reaction-diffusion partial differential equations used to model tumor growth. Mathematical models of tumor growth are largely phenomenological in nature, capturing infiltration of the tumor into surrounding healthy tissue, proliferation of the existing tumor, and patient response to therapies, such as chemotherapy and radiotherapy. Considerable inter-patient variability, inherent heterogeneity of the disease, sparse and noisy data collection, and model inadequacy all contribute to significant uncertainty in the model parameters. It is crucial that these uncertainties can be efficiently propagated through the model to compute quantities of interest (QoIs), which in turn may be used to inform clinical decisions. We show that QMC methods can be successful in computing expectations of meaningful QoIs. Well-posedness results are developed for the model and used to show a theoretical error bound for the case of uniform random fields. The theoretical linear error rate, which is superior to that of standard Monte Carlo, is verified numerically. Encouraging computational results are also provided for lognormal random fields, prompting further theoretical development.

2509.02871 2026-02-23 stat.AP stat.CO

Learning from geometry-aware near misses to real-time COR: A corridor-wide grouped random parameters GEV framework

Mohammad Anis, Yang Zhou, Dominique Lord

Comments 13 figures, 8 Tables

详情
英文摘要

Real-time corridor-wide crash-occurrence risk (COR) prediction is challenging because existing near-miss extreme value theory (EVT) models often oversimplify collision geometry, neglect vehicle-infrastructure (V-I) interactions, and inadequately account for spatial heterogeneity in traffic and roadway conditions. This study develops a geometry-aware two-dimensional time-to-collision (2D-TTC) near-miss extraction framework and integrates it with a hierarchical Bayesian grouped random parameter unified generalized extreme value model (HBSGRP-UGEV) to estimate short-term COR in urban corridors. The proposed framework builds on prior grouped EVT formulations while explicitly accommodating both vehicle-vehicle (V-V) and vehicle-infrastructure (V-I) near-miss processes within a unified corridor-wide modeling structure. High-resolution trajectories from the Argoverse-2 dataset were analyzed across 28 sites along Miami's Biscayne Boulevard to extract extreme near-miss events. The model incorporates vehicle dynamics and roadway features as covariates, with partial pooling across segments and intersections to capture corridor-wide heterogeneity. Results indicate that the HBSGRP-UGEV framework outperforms the fixed-parameter HBSFP-UGEV model, reducing the deviance information criterion (DIC) by up to 7.5 percent for V-V interactions and 3.1 percent for V-I interactions. Predictive validation using receiver operating characteristic area under the curve (ROC-AUC) demonstrates strong classification performance, with values of 0.89 for V-V segments, 0.82 for V-V intersections, 0.79 for V-I segments, and 0.75 for V-I intersections.

2508.13076 2026-02-23 econ.EM stat.ME

The purpose of an estimator is what it does: Misspecification, estimands, and over-identification

Isaiah Andrews, Jiafeng Chen, Otavio Tecchio

Comments to be published in Econometric Society Monographs, 2025 World Congress volumes: Volume 1, Chapter 8

详情
英文摘要

In over-identified models, misspecification -- the norm rather than exception -- fundamentally changes what estimators estimate. Different estimators imply different estimands rather than different efficiency for the same target. A review of recent applications of generalized method of moments in the American Economic Review suggests widespread acceptance of this fact: There is little formal specification testing and widespread use of estimators that would be inefficient were the model correct, including the use of "hand-selected" moments and weighting matrices. Motivated by these observations, we review and synthesize recent results on estimation under model misspecification, providing guidelines for transparent and robust empirical research. We also provide a new theoretical result, showing that Hansen's J-statistic measures, asymptotically, the range of estimates achievable at a given standard error. Given the widespread use of inefficient estimators and the resulting researcher degrees of freedom, we thus particularly recommend the broader reporting of J-statistics.

2503.19095 2026-02-23 econ.EM stat.ME

Empirical Bayes shrinkage (mostly) does not correct the measurement error in regression

Jiafeng Chen, Jiaying Gu, Soonwoo Kwon

详情
英文摘要

In the value-added literature, it is often claimed that regressing on empirical Bayes shrinkage estimates corrects for the measurement error problem in linear regression. We clarify the conditions needed; we argue that these conditions are stronger than the those needed for classical measurement error correction, which we advocate for instead. Moreover, we show that the classical estimator cannot be improved without stronger assumptions. We extend these results to regressions on nonlinear transformations of the latent attribute and find generically slow minimax estimation rates.

2502.05351 2026-02-23 astro-ph.SR cs.LG stat.ML

Deep Generative model that uses physical quantities to generate and retrieve solar magnetic active regions

Subhamoy Chatterjee, Andres Munoz-Jaramillo, Anna Malanushenko

Comments 14 pages, 9 figures, accepted for publication in ApJS

详情
英文摘要

Deep generative models have shown immense potential in generating unseen data that has properties of real data. These models learn complex data-generating distributions starting from a smaller set of latent dimensions. However, generative models have encountered great skepticism in scientific domains due to the disconnection between generative latent vectors and scientifically relevant quantities. In this study, we integrate three types of machine learning models to generate solar magnetic patches in a physically interpretable manner and use those as a query to find matching patches in real observations. We use the magnetic field measurements from Space-weather HMI Active Region Patches (SHARPs) to train a Generative Adversarial Network (GAN). We connect the physical properties of GAN-generated images with their latent vectors to train Support Vector Machines (SVMs) that do mapping between physical and latent spaces. These produce directions in the GAN latent space along which known physical parameters of the SHARPs change. We train a self-supervised learner (SSL) to make queries with generated images and find matches from real data. We find that the GAN-SVM combination enables users to produce high-quality patches that change smoothly only with a prescribed physical quantity, making generative models physically interpretable. We also show that GAN outputs can be used to retrieve real data that shares the same physical properties as the generated query. This elevates Generative Artificial Intelligence (AI) from a means-to-produce artificial data to a novel tool for scientific data interrogation, supporting its applicability beyond the domain of heliophysics.

2410.22333 2026-02-23 stat.ME astro-ph.IM hep-ph physics.data-an stat.AP

Hypothesis tests and model parameter estimation on data sets with missing correlation information

Lukas Koch

Comments 18 pages, 10 figures; follow-up of arxiv.org:2102.06172; Fixed layout

详情
英文摘要

Ideally, all analyses of normally distributed data should include the full covariance information between all data points. In practice, the full covariance matrix between all data points is not always available. Either because a result was published without a covariance matrix, or because one tries to combine multiple results from separate publications. For simple hypothesis tests, it is possible to define robust test statistics that will behave conservatively in the presence on unknown correlations. For model parameter fits, one can inflate the variance by a factor to ensure that things remain conservative at least up to a chosen confidence level. This paper describes a class of robust test statistics for simple hypothesis tests, as well as an algorithm to determine the necessary inflation factor for model parameter fits and Goodness of Fit tests and composite hypothesis tests. It then presents some example applications of the methods to real neutrino interaction data and model comparisons.

2404.11739 2026-02-23 econ.EM stat.ME

Testing Mechanisms

Soonwoo Kwon, Jonathan Roth

详情
英文摘要

Economists are often interested in the mechanisms by which a treatment affects an outcome. We develop tests for the "sharp null of full mediation" that a treatment $D$ affects an outcome $Y$ only through a particular mechanism (or set of mechanisms) $M$. Our approach exploits connections between mediation analysis and the econometric literature on testing instrument validity. We also provide tools for quantifying the magnitude of alternative mechanisms when the sharp null is rejected: we derive sharp lower bounds on the fraction of individuals whose outcome is affected by the treatment despite having the same value of $M$ under both treatments (``always-takers''), as well as sharp bounds on the average effect of the treatment for such always-takers. An advantage of our approach relative to existing tools for mediation analysis is that it does not require stringent assumptions about how $M$ is assigned. We illustrate our methodology in two empirical applications.

2312.12715 2026-02-23 stat.ML cs.LG

Learning Performance Maximizing Ensembles with Explainability Guarantees

Vincent Pisztora, Jia Li

详情
英文摘要

In this paper we propose a method for the optimal allocation of observations between an intrinsically explainable glass box model and a black box model. An optimal allocation being defined as one which, for any given explainability level (i.e. the proportion of observations for which the explainable model is the prediction function), maximizes the performance of the ensemble on the underlying task, and maximizes performance of the explainable model on the observations allocated to it, subject to the maximal ensemble performance condition. The proposed method is shown to produce such explainability optimal allocations on a benchmark suite of tabular datasets across a variety of explainable and black box model types. These learned allocations are found to consistently maintain ensemble performance at very high explainability levels (explaining $74\%$ of observations on average), and in some cases even outperforming both the component explainable and black box models while improving explainability.

2312.07882 2026-02-23 stat.ME cs.GT stat.AP

A semi-parametric approach for estimating consumer valuation distributions using second price auctions

Sourav Mukherjee, Ziqian Yang, Rohit K Patra, Kshitij Khare

Comments 41 pages, 13 figures

详情
英文摘要

We focus on online second price auctions, where bids are made sequentially, and the winning bidder pays the maximum of the second-highest bid and a seller specified starting price. For many such auctions, the seller does not see all the bids or the total number of bidders accessing the auction, and only observes the current selling prices throughout the course of the auction. We develop a novel semi-parametric approach to estimate the underlying consumer valuation distribution based on this data. Previous semi-parametric or non-parametric approaches in the literature only use the final selling price and assume knowledge of the total number of bidders. The resulting estimate, in particular, can be used by the seller to compute the optimal profit-maximizing price for the product. Our approach is free of tuning parameters, and we demonstrate its computational and statistical efficiency in a variety of simulation settings, and also on an Xbox 7-day auction dataset on eBay.

1903.06568 2026-02-23 stat.CO astro-ph.IM physics.data-an stat.ME stat.OT

A response-matrix-centred approach to presenting cross-section measurements

Lukas Koch

Comments 26 pages, added reference to Phystat-nu

详情
英文摘要

The current canonical approach to publishing cross-section data is to unfold the reconstructed distributions. Detector effects like efficiency and smearing are undone mathematically, yielding distributions in true event properties. This is an ill-posed problem, as even small statistical variations in the reconstructed data can lead to large changes in the unfolded spectra. This work presents an alternative or complementary approach: the response-matrix-centred forward-folding approach. It offers a convenient way to forward-fold model expectations in truth space to reconstructed quantities. These can then be compared to the data directly, similar to what is usually done with full detector simulations within the experimental collaborations. For this, the detector response (efficiency and smearing) is parametrised as a matrix. The effects of the detector on the measurement of a given model is simulated by simply multiplying the binned truth expectation values by this response matrix. Systematic uncertainties in the detector response are handled by providing a set of matrices according to the prior distribution of the detector properties and marginalising over them. Background events can be included in the likelihood calculation by giving background events their own bins in truth space. To facilitate a straight-forward use of response matrices, a new software framework has been developed: the Response Matrix Utilities (ReMU). ReMU is a Python package distributed via the Python Package Index. It only uses widely available, standard scientific Python libraries and does not depend on any custom experiment-specific software. It offers all methods needed to build response matrices from Monte Carlo data sets, use the response matrix to forward-fold truth-level model predictions, and compare the predictions to real data using Bayesian or frequentist statistical inference.

1803.09319 2026-02-23 cs.LG stat.ML

SUNLayer: Stable denoising with generative networks

Ruhui Jin, Dustin G. Mixon, Soledad Villar

详情
英文摘要

Deep neural networks are often used to implement powerful generative models for real-world data. Notable applications include image denoising, as well as other classical inverse problems like compressed sensing and super-resolution. To provide a rigorous but simplified analysis of generative models, in this work, we introduce an elegant theoretical framework based on spherical harmonics, namely \textbf{SUNLayer}. Our theoretical framework identifies explicit conditions on activation functions that guarantee denoising under local optimization. Numerical experiments examine the theoretical properties on commonly used activation functions and demonstrate their stable denoising performance.

2602.18186 2026-02-23 stat.ML cs.LG

Box Thirding: Anytime Best Arm Identification under Insufficient Sampling

Seohwa Hwang, Junyong Park

Comments 29 pages, 5 figures

详情
英文摘要

We introduce Box Thirding (B3), a flexible and efficient algorithm for Best Arm Identification (BAI) under fixed-budget constraints. It is designed for both anytime BAI and scenarios with large N, where the number of arms is too large for exhaustive evaluation within a limited budget T. The algorithm employs an iterative ternary comparison: in each iteration, three arms are compared--the best-performing arm is explored further, the median is deferred for future comparisons, and the weakest is discarded. Even without prior knowledge of T, B3 achieves an epsilon-best arm misidentification probability comparable to Successive Halving (SH), which requires T as a predefined parameter, applied to a randomly selected subset of c0 arms that fit within the budget. Empirical results show that B3 outperforms existing methods under limited-budget constraints in terms of simple regret, as demonstrated on the New Yorker Cartoon Caption Contest dataset.

2602.18170 2026-02-23 stat.ME

Minimum L2 and robust Kullback-Leibler estimation

Nils Lid Hjort

Comments 4 pages, 0 figure. This arXiv'd February 2026 paper is from the 12th Prague Conference on Information Theory, Statistical Decision Functions and Random Processes proceedings volume, 1994, pages 102-105. The material preshadows local likelihood and BHHJ estimation

详情
英文摘要

This paper introduces two new robust methods for estimation of parameters in a given parametric family. The first method is that of `minimum weighted L2', effectively minimising an estimate of the integrated (and possibly weighted) squared error. The second is `robust Kullback-Leibler', consisting of minimising a robust version of the empirical Kullback-Leibler distance, and can be viewed as a general robust modification of the maximum likelihood procedure. This second method is also related to recent local likelihood ideas for semiparametric density estimation. The methods are described, influence functions are found, as are formulae for asymptotic variances. In particular large-sample efficiencies are computed under the home turf conditions of the underlying parametric model. The methods and formulae are illustrated for the normal model.

2602.18161 2026-02-23 stat.ME

Equal Marginal Power for Co-Primary Endpoints

Simon Bond

Comments 10 pages, 3 figures

详情
英文摘要

The choice of sample size in the context of co-primary endpoints for a randomised trial is discussed. Current guidance can leave endpoints with unequal marginal power. A method is provided to achieve equal marginal power by using the flexibility provided in multiple testing procedures. A comparison is made to several choices of rule to determine the sample size, in terms of the study design and its operating characteristics.

2602.18150 2026-02-23 stat.ME

Inclusive Ranking of Indian States via Bayesian Bradley-Terry Model

Arshi Rizvi, Rahul Singh

Comments 24 pages, 15 figures

详情
英文摘要

Evaluating the performance of different administrative regions within a country is crucial for its development and policy formulation. The performance evaluators are mostly based on health, education, per capita income, awareness, family planning and so on. Not only evaluating regions, but also ranking them is a crucial step, and various methods have been proposed to date. We aim to provide a ranking system for Indian states that uses a Bayesian approach via the famous Bradley-Terry model for paired comparisons. The ranking method uses indicators from the NFHS-5 dataset with the prior information of per-capita incomes of the states/UTs, thus leading to a holistic ranking, which not only includes human development factors but also take account the economic background of the states. We also carried out various Markov chain Monte Carlo diagnostics required for the reliability of the estimates of merits for these states. These merits thus provide a ranking for the states/UTs and can further be utilised to make informed policy decisions.

2602.18087 2026-02-23 stat.ME

Optimal inference via confidence distributions for two-by-two tables modelled as Poisson pairs: fixed and random effects

Céline Cunen, Nils Lid Hjort

Comments 6 pages, 3 figures; this article has appeared in essentially this form in the International Statistical Institute 2015 Rio World Conference proceedings volume. The present 2026 arXiv'd version might be further extended by the authors for a fuller journal publication

详情
英文摘要

This paper presents methods for meta-analysis of $2 \times 2$ tables, both with and without allowing heterogeneity in the treatment effects. Meta-analysis is common in medical research, but most existing methods are unsuited for $2 \times 2$ tables with rare events. Usually the tables are modelled as pairs of binomial variables, but we will model them as Poisson pairs. The methods presented here are based on confidence distributions, and offer optimal inference for the treatment effect parameter. We also propose an optimal method for inference on the ratio between treatment effects, and illustrate our methods on a real dataset.

2602.18058 2026-02-23 astro-ph.EP astro-ph.IM cs.SY eess.SY math.OC physics.data-an physics.space-ph stat.AP

Probabilistic Methods for Initial Orbit Determination and Orbit Determination in Cislunar Space

Ishan Paranjape, Tarun Hejmadi, Suman Chakravorty

Comments To be submitted to the Journal of Astronautical Sciences DISTRIBUTION A: Approved for public release; distribution is unlimited. Public affairs approval #AFRL-2026-0779

详情
英文摘要

In orbital mechanics, Gauss's method for orbit determination (OD) is a popular, minimal assumption solution for obtaining the initial state estimate of a passing resident space object (RSO). Since much of the cislunar domain relies on three-body dynamics, a key assumption of Gauss's method is rendered incompatible, creating a need for a new, minimal assumption method for initial orbit determination (IOD). In this work, we present a framework for short and long term probabilistic target tracking in cislunar space which produces an initial state estimate with as few assumptions as possible. Specifically, we propose an IOD method involving the kinematic fitting of several series of noisy, consecutive ground-based observations. Once a probabilistic initial state estimate in the form of a particle cloud is formed, we apply the powerful Particle Gaussian Mixture (PGM) Filter to reduce the uncertainty of our state estimate over time. This combined IOD/OD framework is demonstrated for several classes of trajectories in cislunar space and compared to better-known filtering frameworks.

2602.18053 2026-02-23 stat.ML cs.LG math.ST stat.TH

On the Generalization and Robustness in Conditional Value-at-Risk

Dinesh Karthik Mulumudi, Piyushi Manupriya, Gholamali Aminian, Anant Raj

详情
英文摘要

Conditional Value-at-Risk (CVaR) is a widely used risk-sensitive objective for learning under rare but high-impact losses, yet its statistical behavior under heavy-tailed data remains poorly understood. Unlike expectation-based risk, CVaR depends on an endogenous, data-dependent quantile, which couples tail averaging with threshold estimation and fundamentally alters both generalization and robustness properties. In this work, we develop a learning-theoretic analysis of CVaR-based empirical risk minimization under heavy-tailed and contaminated data. We establish sharp, high-probability generalization and excess risk bounds under minimal moment assumptions, covering fixed hypotheses, finite and infinite classes, and extending to $β$-mixing dependent data; we further show that these rates are minimax optimal. To capture the intrinsic quantile sensitivity of CVaR, we derive a uniform Bahadur-Kiefer type expansion that isolates a threshold-driven error term absent in mean-risk ERM and essential in heavy-tailed regimes. We complement these results with robustness guarantees by proposing a truncated median-of-means CVaR estimator that achieves optimal rates under adversarial contamination. Finally, we show that CVaR decisions themselves can be intrinsically unstable under heavy tails, establishing a fundamental limitation on decision robustness even when the population optimum is well separated. Together, our results provide a principled characterization of when CVaR learning generalizes and is robust, and when instability is unavoidable due to tail scarcity.

2602.18039 2026-02-23 stat.AP

A context-specific causal model for estimating the effect of extended length of overnight stay on traveller's total expenditure

Lauri Valkonen, Juha Karvanen

详情
英文摘要

Tourism significantly affects the economies of many countries. Understanding the causal relationship between the length of overnight stay and traveller's expenditure is crucial for stakeholders to characterize spending profiles and to design marketing strategies. Causal mechanisms differ between personal and work-related travel because the decision-making processes have different drivers and constraints. We apply context-specific independence relations to model causal mechanisms in contexts specified by trip purpose and identify the causal effect of the length of stay on expenditure. Using the international visitor survey data on foreign travellers to Finland, we fit a hierarchical Bayesian model to estimate the posterior distribution of the counterfactual expenditure due to extending the length of stay by one night. We also perform a Bayesian sensitivity analysis of the estimated causal effect with respect to omitted variable bias.

2602.18014 2026-02-23 cs.RO cs.SY eess.SY stat.ML

Quasi-Periodic Gaussian Process Predictive Iterative Learning Control

Unnati Nigam, Radhendushka Srivastava, Faezeh Marzbanrad, Michael Burke

详情
英文摘要

Repetitive motion tasks are common in robotics, but performance can degrade over time due to environmental changes and robot wear and tear. Iterative learning control (ILC) improves performance by using information from previous iterations to compensate for expected errors in future iterations. This work incorporates the use of Quasi-Periodic Gaussian Processes (QPGPs) into a predictive ILC framework to model and forecast disturbances and drift across iterations. Using a recent structural equation formulation of QPGPs, the proposed approach enables efficient inference with complexity $\mathcal{O}(p^3)$ instead of $\mathcal{O}(i^2p^3)$, where $p$ denotes the number of points within an iteration and $i$ represents the total number of iterations, specially for larger $i$. This formulation also enables parameter estimation without loss of information, making continual GP learning computationally feasible within the control loop. By predicting next-iteration error profiles rather than relying only on past errors, the controller achieves faster convergence and maintains this under time-varying disturbances. We benchmark the method against both standard ILC and conventional Gaussian Process (GP)-based predictive ILC on three tasks, autonomous vehicle trajectory tracking, a three-link robotic manipulator, and a real-world Stretch robot experiment. Across all cases, the proposed approach converges faster and remains robust under injected and natural disturbances while reducing computational cost. This highlights its practicality across a range of repetitive dynamical systems.

2602.18004 2026-02-23 stat.ME stat.CO stat.ML

Preconditioned Robust Neural Posterior Estimation for Misspecified Simulators

Ryan P. Kelly, David T. Frazier, David J. Warne, Christopher C. Drovandi

详情
英文摘要

Simulation-based inference (SBI) enables parameter estimation for complex stochastic models with intractable likelihoods when model simulation is feasible. Neural posterior estimation (NPE) is a popular SBI approach that often achieves accurate inference with far fewer simulations than classical approaches. But in practice, neural approaches can be unreliable for two reasons: incompatible data summaries arising from model misspecification yield unreliable posteriors due to extrapolation, and prior-predictive draws can produce extreme summaries that lead to difficulties in obtaining an accurate posterior for the observed data of interest. Existing preconditioning schemes target well-specified settings, and their behaviour under misspecification remains unexplored. We study preconditioning under misspecification and propose preconditioned robust neural posterior estimation, which computes data-dependent weights that focus training near the observed summaries and fits a robust neural posterior approximation. We also introduce a forest-proximity preconditioning approach that uses tree-based proximity scores to down-weight outlying simulations and concentrate computation around the observed dataset. Across two synthetic examples and one real example with incompatible summaries and extreme prior-predictive behaviour, we demonstrate that preconditioning combined with robust NPE increases stability and improves accuracy, calibration, and posterior-predictive fit over standard baseline methods.

2602.17995 2026-02-23 stat.ME stat.AP

Hybrid Non-informative and Informative Prior Model-assisted Designs for Mid-trial Dose Insertion

Kana Yamada, Hisato Sunami, Kentaro Takeda, Keisuke Hanada, Masahiro Kojima

详情
英文摘要

In oncology phase I trials, model-assisted designs have been increasingly adopted because they enable adaptive yet operationally simple dose adjustment based on accumulating safety data, leading to a paradigm shift in dose-escalation methodology. In practice, a single mid-trial dose insertion may be considered to examine safer doses and/or to collect more informative efficacy data. In this study, we investigate methods to improve dose assignment and the selection of the maximum tolerated dose (MTD) or the optimal biological dose (OBD) when a new dose level is added during an ongoing trial under a model-assisted framework, by assigning informative prior information to the inserted dose. We propose a hybrid design that uses a non-informative model-assisted design at trial initiation and, upon dose insertion, applies an informative-prior extension only to the newly added dose. In addition, to address potential skeleton misspecification, we propose two adaptive extensions: (i) an online-weighting approach that updates the skeleton over time, and (ii) a Bayesian-mixture approach that robustly combines multiple candidate skeletons. We evaluate the proposed methods through simulation studies.

2602.17985 2026-02-23 cs.LG stat.ML

Learning Without Training

Ryan O'Dowd

Comments PhD Dissertation of Ryan O'Dowd, defended successfully at Claremont Graduate University on 1/28/2026

详情
英文摘要

Machine learning is at the heart of managing the real-world problems associated with massive data. With the success of neural networks on such large-scale problems, more research in machine learning is being conducted now than ever before. This dissertation focuses on three different projects rooted in mathematical theory for machine learning applications. The first project deals with supervised learning and manifold learning. In theory, one of the main problems in supervised learning is that of function approximation: that is, given some data set $\mathcal{D}=\{(x_j,f(x_j))\}_{j=1}^M$, can one build a model $F\approx f$? We introduce a method which aims to remedy several of the theoretical shortcomings of the current paradigm for supervised learning. The second project deals with transfer learning, which is the study of how an approximation process or model learned on one domain can be leveraged to improve the approximation on another domain. We study such liftings of functions when the data is assumed to be known only on a part of the whole domain. We are interested in determining subsets of the target data space on which the lifting can be defined, and how the local smoothness of the function and its lifting are related. The third project is concerned with the classification task in machine learning, particularly in the active learning paradigm. Classification has often been treated as an approximation problem as well, but we propose an alternative approach leveraging techniques originally introduced for signal separation problems. We introduce theory to unify signal separation with classification and a new algorithm which yields competitive accuracy to other recent active learning algorithms while providing results much faster.

2602.17984 2026-02-23 stat.ME

Developing Performance-Guaranteed Biomarker Combination Rules with Integrated External Information under Practical Constraint

Albert Osom, Camden Lopez, Ashley Alexander, Suresh Chari, Ziding Feng, Ying-Qi Zhao

详情
英文摘要

In clinical practice, there is significant interest in integrating novel biomarkers with existing clinical data to construct interpretable and robust decision rules. Motivated by the need to improve decision-making for early disease detection, we propose a framework for developing an optimal biomarker-based clinical decision rule that is both clinically meaningful and practically feasible. Specifically, our procedure constructs a linear decision rule designed to achieve optimal performance among class of linear rules by maximizing the true positive rate while adhering to a pre-specified positive predictive value constraint. Additionally, our method can adaptively incorporate individual risk information from external source to enhance performance when such information is beneficial. We establish the asymptotic properties of our proposed estimator and compare to the standard approach used in practice through extensive simulation studies. Results indicate that our approach offers strong finite-sample performance. We also apply the proposed methods to develop biomarker-based screening rules for pancreatic ductal adenocarcinoma (PDAC) among new-onset diabetes (NOD) patients.

2602.17967 2026-02-23 math.ST stat.ME stat.ML stat.TH

Minimax optimal adaptive structured transfer learning through semi-parametric domain-varying coefficient model

Hanxiao Chen, Debarghya Mukherjee

Comments 86 pages, 8 figures

详情
英文摘要

Transfer learning aims to improve inference in a target domain by leveraging information from related source domains, but its effectiveness critically depends on how cross-domain heterogeneity is modeled and controlled. When the conditional mechanism linking covariates and responses varies across domains, indiscriminate information pooling can lead to negative transfer, degrading performance relative to target-only estimation. We study a multi-source, single-target transfer learning problem under conditional distributional drift and propose a semiparametric domain-varying coefficient model (DVCM), in which domain-relatedness is encoded through an observable domain identifier. This framework generalizes classical varying-coefficient models to structured transfer learning and interpolates between invariant and fully heterogeneous regimes. Building on this model, we develop an adaptive transfer learning estimator that selectively borrows strength from informative source domains while provably safeguarding against negative transfer. Our estimator is computationally efficient and easy to implement; we also show that it is minimax rate-optimal and derive its asymptotic distribution, enabling valid uncertainty quantification and hypothesis testing despite data-adaptive pooling and shrinkage. Our results precisely characterize the interplay among domain heterogeneity, the smoothness of the underlying mean function, and the number of source domains and are corroborated by comprehensive numerical experiments and two real-data applications.

2602.17956 2026-02-23 stat.ME math.ST stat.CO stat.TH

A variational framework for modal estimation

Tâm LeMinh, Julyan Arbel, Florence Forbes, Hien Duy Nguyen

详情
英文摘要

We approach multivariate mode estimation through Gibbs distributions and introduce GERVE (Gibbs-measure Entropy-Regularised Variational Estimation), a likelihood-free framework that approximates Gibbs measures directly from samples by maximizing an entropy-regularised variational objective with natural-gradient updates. GERVE brings together kernel density estimation, mean-shift, variational inference, and annealing in a single platform for mode estimation. It fits Gaussian mixtures that concentrate on high-density regions and yields cluster assignments from responsibilities, with reduced sensitivity to the chosen number of components. We provide theory in two regimes: as the Gibbs temperature approaches zero, mixture components converge to population modes; at fixed temperature, maximisers of the empirical objective exist, are consistent, and are asymptotically normal. We also propose a bootstrap procedure for per-mode confidence ellipses and stability scores. Simulation and real-data studies show accurate mode recovery and emergent clustering, robust to mixture overspecification. GERVE is a practical likelihood-free approach when the number of modes or groups is unknown and full density estimation is impractical.

2602.17923 2026-02-23 stat.ME cs.NA math.NA stat.CO

Model Error Embedding with Orthogonal Gaussian Processes

Mridula Kuppa, Khachik Sargsyan, Marco Panesi, Habib N. Najm

Comments 30 pages, 26 figures

详情
英文摘要

Computational models of complex physical systems often rely on simplifying assumptions which inevitably introduce model error, with consequent predictive errors. Given data on model observables, the estimation of parameterized model-error representations, along with other model parameters, would be ideally done while separating the contributions of each of the two sets of parameters, in order to ensure meaningful stand-alone model predictions. This work builds an embedded model error framework using a weight-space representation of Gaussian processes (GPs) to flexibly capture model-error spatiotemporal correlations and enable inference with GP-embedding in non-linear models. To disambiguate model and model-error/bias parameters, we extend an existing orthogonal GP method to the embedded model-error setting and derive appropriate orthogonality constraints. To address the increased dimensionality introduced by the GP representation, we employ the likelihood-informed subspace method. The construction is demonstrated on linear and non-linear examples, where it effectively corrects model predictions to match data trends. Extrapolation beyond the training data recovers the prior predictive distribution, and the orthogonality constraints lead to meaningful stand-alone model predictions and nearly uncorrelated posteriors between model and model-error parameters.

2602.17896 2026-02-23 math.ST stat.OT stat.TH

Central limit theorem for the global clustering coefficient of random geometric graphs

Mingao Yuan, Md. Niamul Islam Sium

详情
英文摘要

The global clustering coefficient serves as a powerful metric for the structural analysis and comparison of complex networks. Random geometric graphs offer a realistic framework for representing the spatial constraints and geometry often found in real-world network datasets. In this paper, we establish a central limit theorem for the global clustering coefficient of random geometric graphs. Our main result identifies the centering and scaling sequences required for convergence in law to the standard normal distribution. Our approach varies by regime: in the dense case, we employ the Lyapunov CLT; in the intermediate case, we utilize the asymptotic theory of $U$-statistics with sample-size-dependent kernels; and in the sparse regime, we use the method of moments to derive the asymptotic distribution. Notably, the convergence rates for non-uniform and uniform random geometric graphs diverge in the dense regime, yet they coincide in the sparse regime. In addition, we find that the global clustering coefficient for both uniform and non-uniform RGGs is asymptotically equal to $3/4$

2602.17876 2026-02-23 stat.ML cs.LG math.ST stat.TH

Interactive Learning of Single-Index Models via Stochastic Gradient Descent

Nived Rajaraman, Yanjun Han

Comments 26 pages, 2 figures

详情
英文摘要

Stochastic gradient descent (SGD) is a cornerstone algorithm for high-dimensional optimization, renowned for its empirical successes. Recent theoretical advances have provided a deep understanding of how SGD enables feature learning in high-dimensional nonlinear models, most notably the \textit{single-index model} with i.i.d. data. In this work, we study the sequential learning problem for single-index models, also known as generalized linear bandits or ridge bandits, where SGD is a simple and natural solution, yet its learning dynamics remain largely unexplored. We show that, similar to the optimal interactive learner, SGD undergoes a distinct ``burn-in'' phase before entering the ``learning'' phase in this setting. Moreover, with an appropriately chosen learning rate schedule, a single SGD procedure simultaneously achieves near-optimal (or best-known) sample complexity and regret guarantees across both phases, for a broad class of link functions. Our results demonstrate that SGD remains highly competitive for learning single-index models under adaptive data.

2602.17830 2026-02-23 stat.ML cs.LG

Drift Estimation for Stochastic Differential Equations with Denoising Diffusion Models

Marcos Tapia Costa, Nikolas Kantas, George Deligiannidis

详情
英文摘要

We study the estimation of time-homogeneous drift functions in multivariate stochastic differential equations with known diffusion coefficient, from multiple trajectories observed at high frequency over a fixed time horizon. We formulate drift estimation as a denoising problem conditional on previous observations, and propose an estimator of the drift function which is a by-product of training a conditional diffusion model capable of simulating new trajectories dynamically. Across different drift classes, the proposed estimator was found to match classical methods in low dimensions and remained consistently competitive in higher dimensions, with gains that cannot be attributed to architectural design choices alone.

2602.17827 2026-02-23 cs.LG stat.ML

Avoid What You Know: Divergent Trajectory Balance for GFlowNets

Pedro Dall'Antonia, Tiago da Silva, Daniel Csillag, Salem Lahlou, Diego Mesquita

Comments 20 pages, under review

详情
英文摘要

Generative Flow Networks (GFlowNets) are a flexible family of amortized samplers trained to generate discrete and compositional objects with probability proportional to a reward function. However, learning efficiency is constrained by the model's ability to rapidly explore diverse high-probability regions during training. To mitigate this issue, recent works have focused on incentivizing the exploration of unvisited and valuable states via curiosity-driven search and self-supervised random network distillation, which tend to waste samples on already well-approximated regions of the state space. In this context, we propose Adaptive Complementary Exploration (ACE), a principled algorithm for the effective exploration of novel and high-probability regions when learning GFlowNets. To achieve this, ACE introduces an exploration GFlowNet explicitly trained to search for high-reward states in regions underexplored by the canonical GFlowNet, which learns to sample from the target distribution. Through extensive experiments, we show that ACE significantly improves upon prior work in terms of approximation accuracy to the target distribution and discovery rate of diverse high-reward states.

2602.17792 2026-02-23 stat.ME

Spatial Confounding: A review of concepts, challenges, and current approaches

Isaque Vieira Machado Pim, Luiz Max Fagundes de Carvalho, Marcos Oliveira Prates

Comments 34 pages, 4 figures

详情
英文摘要

Spatial confounding is a persistent challenge in spatial statistics, influencing the validity of statistical inference in models that analyze spatially-structured data. The concept has been interpreted in various ways but is broadly defined as bias in estimates arising from unmeasured spatial variation. In this paper we review definitions, classical spatial models, and recent methodological advances, including approaches from spatial statistics and causal inference. We provide an unified view of the many available approaches for areal as well as geostatistical data and discuss their relative merits both theoretically and empirically with a head-to-head comparison on real datasets. Finally, we leverage the results of the empirical comparisons to discuss directions for future research.

2602.17779 2026-02-23 stat.ML cond-mat.dis-nn cs.LG

Topological Exploration of High-Dimensional Empirical Risk Landscapes: general approach, and applications to phase retrieval

Antoine Maillard, Tony Bonnaire, Giulio Biroli

Comments 43 pages, 14 figures

详情
英文摘要

We consider the landscape of empirical risk minimization for high-dimensional Gaussian single-index models (generalized linear models). The objective is to recover an unknown signal $\boldsymbolθ^\star \in \mathbb{R}^d$ (where $d \gg 1$) from a loss function $\hat{R}(\boldsymbolθ)$ that depends on pairs of labels $(\mathbf{x}_i \cdot \boldsymbolθ, \mathbf{x}_i \cdot \boldsymbolθ^\star)_{i=1}^n$, with $\mathbf{x}_i \sim \mathcal{N}(0, I_d)$, in the proportional asymptotic regime $n \asymp d$. Using the Kac-Rice formula, we analyze different complexities of the landscape -- defined as the expected number of critical points -- corresponding to various types of critical points, including local minima. We first show that some variational formulas previously established in the literature for these complexities can be drastically simplified, reducing to explicit variational problems over a finite number of scalar parameters that we can efficiently solve numerically. Our framework also provides detailed predictions for properties of the critical points, including the spectral properties of the Hessian and the joint distribution of labels. We apply our analysis to the real phase retrieval problem for which we derive complete topological phase diagrams of the loss landscape, characterizing notably BBP-type transitions where the Hessian at local minima (as predicted by the Kac-Rice formula) becomes unstable in the direction of the signal. We test the predictive power of our analysis to characterize gradient flow dynamics, finding excellent agreement with finite-size simulations of local optimization algorithms, and capturing fine-grained details such as the empirical distribution of labels. Overall, our results open new avenues for the asymptotic study of loss landscapes and topological trivialization phenomena in high-dimensional statistical models.

2602.17744 2026-02-23 cs.LG cs.CL math.ST stat.ML stat.TH

Bayesian Optimality of In-Context Learning with Selective State Spaces

Di Zhang, Jiaqi Xing

Comments 17 pages

详情
英文摘要

We propose Bayesian optimal sequential prediction as a new principle for understanding in-context learning (ICL). Unlike interpretations framing Transformers as performing implicit gradient descent, we formalize ICL as meta-learning over latent sequence tasks. For tasks governed by Linear Gaussian State Space Models (LG-SSMs), we prove a meta-trained selective SSM asymptotically implements the Bayes-optimal predictor, converging to the posterior predictive mean. We further establish a statistical separation from gradient descent, constructing tasks with temporally correlated noise where the optimal Bayesian predictor strictly outperforms any empirical risk minimization (ERM) estimator. Since Transformers can be seen as performing implicit ERM, this demonstrates selective SSMs achieve lower asymptotic risk due to superior statistical efficiency. Experiments on synthetic LG-SSM tasks and a character-level Markov benchmark confirm selective SSMs converge faster to Bayes-optimal risk, show superior sample efficiency with longer contexts in structured-noise settings, and track latent states more robustly than linear Transformers. This reframes ICL from "implicit optimization" to "optimal inference," explaining the efficiency of selective SSMs and offering a principled basis for architecture design.

2602.17743 2026-02-23 cs.LG stat.ML

Provable Adversarial Robustness in In-Context Learning

Di Zhang

Comments 16 pages

详情
英文摘要

Large language models adapt to new tasks through in-context learning (ICL) without parameter updates. Current theoretical explanations for this capability assume test tasks are drawn from a distribution similar to that seen during pretraining. This assumption overlooks adversarial distribution shifts that threaten real-world reliability. To address this gap, we introduce a distributionally robust meta-learning framework that provides worst-case performance guarantees for ICL under Wasserstein-based distribution shifts. Focusing on linear self-attention Transformers, we derive a non-asymptotic bound linking adversarial perturbation strength ($ρ$), model capacity ($m$), and the number of in-context examples ($N$). The analysis reveals that model robustness scales with the square root of its capacity ($ρ_{\text{max}} \propto \sqrt{m}$), while adversarial settings impose a sample complexity penalty proportional to the square of the perturbation magnitude ($N_ρ- N_0 \propto ρ^2$). Experiments on synthetic tasks confirm these scaling laws. These findings advance the theoretical understanding of ICL's limits under adversarial conditions and suggest that model capacity serves as a fundamental resource for distributional robustness.

2602.17699 2026-02-23 cs.LG math.RA stat.ML

Certified Learning under Distribution Shift: Sound Verification and Identifiable Structure

Chandrasekhar Gokavarapu, Sudhakar Gadde, Y. Rajasekhar, S. R. Bhargava

详情
英文摘要

Proposition. Let $f$ be a predictor trained on a distribution $P$ and evaluated on a shifted distribution $Q$. Under verifiable regularity and complexity constraints, the excess risk under shift admits an explicit upper bound determined by a computable shift metric and model parameters. We develop a unified framework in which (i) risk under distribution shift is certified by explicit inequalities, (ii) verification of learned models is sound for nontrivial sizes, and (iii) interpretability is enforced through identifiability conditions rather than post hoc explanations. All claims are stated with explicit assumptions. Failure modes are isolated. Non-certifiable regimes are characterized.

2602.16146 2026-02-23 stat.ME

Uncertainty-Aware Neural Multivariate Geostatistics

Yeseul Jeon, Aaron Scheffler, Rajarshi Guhaniyogi

详情
英文摘要

We propose Deep Neural Coregionalization, a scalable framework for uncertainty-aware multivariate geostatistics. DNC models multivariate spatial effects through spatially varying latent factors and loadings, assigning deep Gaussian process (DGP) priors to both the factors and the entries of the loading matrix. This joint construction learns shared latent spatial structure together with response-specific, location-dependent mixing weights, enabling flexible nonlinear and space-dependent associations within and across variables. A key contribution is a variational formulation that makes the DGP to deep neural network (DNN) correspondence explicit: maximizing the DGP evidence lower bound (ELBO) is equivalent to training DNNs with weight decay and Monte Carlo (MC) dropout. This yields fast mini-batch stochastic optimization without Markov Chain Monte Carlo (MCMC), while providing principled uncertainty quantification through MC-dropout forward passes as approximate posterior draws, producing calibrated credible surfaces for prediction and spatial effect estimation. Across simulations, DNC is competitive with existing spatial factor models, particularly under strong nonstationarity and complex cross-dependence, while delivering substantial computational gains. In a multivariate environmental case study, DNC captures spatially varying cross-variable interactions, produces interpretable maps of multivariate outcomes, and scales uncertainty quantification to large datasets with orders-of-magnitude reductions in runtime.

2602.15328 2026-02-23 math.ST stat.TH

Non-Stationary Covariance Functions for Spatial Data on Linear Networks

Alfredo Alegría

详情
英文摘要

We introduce a novel class of non-stationary covariance functions for random fields on linear networks that allows both the variance and the correlation range of the random field to vary spatially. The proposed covariance functions are useful to model random fields with a spatial dependence that is locally isotropic with respect to the resistance metric, a distance that reflects the topology of the network. The framework admits explicit stochastic representations of the associated random fields and can be naturally extended to matrix-valued covariance functions for vector-valued random fields. We assess the statistical and computational performance of a weighted local likelihood estimator for the proposed models using synthetic data generated on the street network of the University of Chicago neighborhood.

2602.14835 2026-02-23 stat.ME cs.CY stat.AP

The Global Representativeness Index: A Total Variation Distance Framework for Measuring Demographic Fidelity in Survey Research

Evan Hadfield, Andrew Konya

Comments 2 figures, 9 tables. Open-source library: https://github.com/collect-intel/gri v2: added co-author

详情
英文摘要

Global survey research increasingly informs high-stakes decisions in AI governance and cross-cultural policy, yet no standardized metric quantifies how well a sample's demographic composition matches its target population. Response rates and demographic quotas -- the prevailing proxies for sample quality -- measure effort and coverage but not distributional fidelity. This paper introduces the Global Representativeness Index (GRI), a framework grounded in Total Variation Distance that scores any survey sample against population benchmarks across multiple demographic dimensions on a [0, 1] scale. Validation on seven waves of the Global Dialogues survey (N = 7,500 across 60+ countries) finds fine-grained demographic GRI scores of only 0.33--0.36 -- roughly 43% of the theoretical maximum at that sample size. Cross-validation on the World Values Survey (seven waves, N = 403,000), Afrobarometer Round 9 (N = 53,000), and Latinobarometro (N = 19,000) reveals that even large probability surveys score below 0.22 on fine-grained global demographics when country coverage is limited. The GRI connects to classical survey statistics through the design effect; both metrics are recommended as a minimum summary of sample quality, since GRI quantifies demographic distance symmetrically while effective N captures the asymmetric inferential cost of underrepresentation. The framework is released as an open-source Python library with UN and Pew Research Center population benchmarks, applicable to survey research, machine learning dataset auditing, and AI evaluation benchmarks.

2602.08347 2026-02-23 stat.ME math.ST stat.TH

Estimating the Shannon Entropy Using the Pitman--Yor Process

Takato Hashino, Koji Tsukuda

Comments 23 pages, 5 figures. Corrected errors

详情
英文摘要

The Shannon entropy is a fundamental measure for quantifying diversity and model complexity in fields such as information theory, ecology, and genetics. However, many existing studies assume that the number of species is known, an assumption that is often unrealistic in practice. In recent years, efforts have been made to relax this restriction. Motivated by these developments, this study proposes an entropy estimation method based on the Pitman--Yor process, a representative approach in Bayesian nonparametrics. By approximating the true distribution as an infinite-dimensional process, the proposed method enables stable estimation even when the number of observed species is smaller than the true number of species. This approach provides a principled way to deal with the uncertainty in species diversity and enhances the reliability and robustness of entropy-based diversity assessment. In addition, we investigate the convergence property of the Shannon entropy for regularly varying distributions and use this result to establish the consistency of the proposed estimator. Finally, we demonstrate the effectiveness of the proposed method through numerical experiments.

2602.07480 2026-02-23 math.ST stat.TH

Asymptotically normal estimators in high-dimensional linear regression

Kou Fujimori, Koji Tsukuda

Comments 8 pages. Refined assumptions, corrected minor errors, and improved the presentation

详情
英文摘要

We establish asymptotic normality for estimators in high-dimensional linear regression by proving weak convergence in a separable Hilbert space, thereby enabling direct use of standard asymptotic tools, for example, the continuous mapping theorem. The approach allows the number of non-zero coefficients to grow, provided only a fixed number have moderate magnitude. As an application, we test linear hypotheses with a statistic whose null limit is a finite weighted sum of independent chi-squared variables, yielding plug-in critical values with asymptotically correct size.

2601.01679 2026-02-23 stat.ML cs.LG

Simplex Deep Linear Discriminant Analysis

Maxat Tezekbayev, Arman Bolatov, Zhenisbek Assylbekov

Comments Accepted at CPAL 2026. Camera-ready version

详情
英文摘要

We revisit Deep Linear Discriminant Analysis (Deep LDA) from a likelihood-based perspective. While classical LDA is a simple Gaussian model with linear decision boundaries, attaching an LDA head to a neural encoder raises the question of how to train the resulting deep classifier by maximum likelihood estimation (MLE). We first show that end-to-end MLE training of an unconstrained Deep LDA model ignores discrimination: when both the LDA parameters and the encoder parameters are learned jointly, the likelihood admits a degenerate solution in which some of the class clusters may heavily overlap or even collapse, and classification performance deteriorates. Batchwise moment re-estimation of the LDA parameters does not remove this failure mode. We then propose a constrained Deep LDA formulation that fixes the class means to the vertices of a regular simplex in the latent space and restricts the shared covariance to be spherical, leaving only the priors and a single variance parameter to be learned along with the encoder. Under these geometric constraints, MLE becomes stable and yields well-separated class clusters in the latent space. On images (Fashion-MNIST, CIFAR-10, CIFAR-100) and texts (AG News, CLINC150), the resulting Deep LDA models achieve accuracy competitive with softmax baselines while offering a simple, interpretable latent geometry that is clearly visible in two-dimensional projections.

2512.12051 2026-02-23 stat.CO

StochTree: BART-based modeling in R and Python

Andrew Herren, P. Richard Hahn, Jared Murray, Carlos Carvalho

详情
英文摘要

stochtree is a C++ library for Bayesian tree ensemble models such as BART and Bayesian Causal Forests (BCF), as well as user-specified variations. Unlike previous BART packages, stochtree provides bindings to both R and Python for full interoperability. stochtree boasts a more comprehensive range of models relative to previous packages, including heteroskedastic forests, random effects, and treed linear models. Additionally, stochtree offers flexible handling of model fits: the ability to save model fits, reinitialize models from existing fits (facilitating improved model initialization heuristics), and pass fits between R and Python. On both platforms, stochtree exposes lower-level functionality, allowing users to specify models incorporating Bayesian tree ensembles without needing to modify C++ code. We illustrate the use of stochtree in three settings: i) straightfoward applications of existing models such as BART and BCF, ii) models that include more sophisticated components like heteroskedasticity and leaf-wise regression models, and iii) as a component of custom MCMC routines to fit nonstandard tree ensemble models.

2512.09273 2026-02-23 stat.ME math.ST stat.TH

On the inverse of covariance matrices for unbalanced crossed designs

Ziyang Lyu, S. A. Sisson, A. H. Welsh

Comments 43 pages

详情
英文摘要

This paper addresses a long-standing open problem in the analysis of linear mixed models with crossed random effects under unbalanced designs: how to find an analytic expression for the inverse of $\mathbf{V}$, the covariance matrix of the observed response. The inverse matrix $\mathbf{V}^{-1}$ is required for likelihood-based estimation and inference. However, for unbalanced crossed designs, $\mathbf{V}$ is dense and the lack of a closed-form representation for $\mathbf{V}^{-1}$, until now, has made using likelihood-based methods computationally challenging and difficult to analyse mathematically. We use the Khatri--Rao product to represent $\mathbf{V}$ and then to construct a modified covariance matrix whose inverse admits an exact spectral decomposition. Building on this construction, we obtain an elegant and simple approximation to $\mathbf{V}^{-1}$ for asymptotic unbalanced designs. For non-asymptotic settings, we derive an accurate and interpretable approximation under mildly unbalanced data and establish an exact inverse representation as a low-rank correction to this approximation, applicable to arbitrary degrees of unbalance. Simulation studies demonstrate the accuracy, stability, and computational tractability of the proposed framework.

2511.18555 2026-02-23 stat.ME cs.LG math.DS stat.ML

A joint optimization approach to identifying sparse dynamics using least squares kernel collocation

Alexander W. Hsu, Ike Griss Salas, Jacob M. Stevens-Haas, J. Nathan Kutz, Aleksandr Aravkin, Bamdad Hosseini

详情
英文摘要

We develop an all-at-once modeling framework for learning systems of ordinary differential equations (ODE) from scarce, partial, and noisy observations of the states. The proposed methodology amounts to a combination of sparse recovery strategies for the ODE over a function library combined with techniques from reproducing kernel Hilbert space (RKHS) theory for estimating the state and discretizing the ODE. Our numerical experiments reveal that the proposed strategy leads to significant gains in terms of accuracy, sample efficiency, and robustness to noise, both in terms of learning the equation and estimating the unknown states. This work demonstrates capabilities well beyond existing and widely used algorithms while extending the modeling flexibility of other recent developments in equation discovery.

2511.05983 2026-02-23 stat.ML cs.LG

Benchmarking of Clustering Validity Measures Revisited

Connor Simpson, Ricardo J. G. B. Campello, Elizabeth Stojanovski

Comments 48 pages, 17 tables, 17 figures

详情
英文摘要

Validation plays a crucial role in the clustering process. Many different internal validity indexes exist for the purpose of determining the best clustering solution(s) from a given collection of candidates, e.g., as produced by different algorithms or different algorithm hyper-parameters. In this study, we present a comprehensive benchmark study of 26 internal validity indexes, which includes highly popular classic indexes as well as more recently developed ones. We adopted an enhanced revision of the methodology presented in Vendramin et al. (2010), developed here to address several shortcomings of this previous work. This overall new approach consists of three complementary custom-tailored evaluation sub-methodologies, each of which has been designed to assess specific aspects of an index's behaviour while preventing potential biases of the other sub-methodologies. Each sub-methodology features two complementary measures of performance, alongside mechanisms that allow for an in-depth investigation of more complex behaviours of the internal validity indexes under study. Additionally, a new collection of 16177 datasets has been produced, paired with eight widely-used clustering algorithms, for a wider applicability scope and representation of more diverse clustering scenarios.

2510.18322 2026-02-23 cs.LG stat.ML

Uncertainty Estimation by Flexible Evidential Deep Learning

Taeseong Yoon, Heeyoung Kim

Comments NeurIPS 2025

详情
英文摘要

Uncertainty quantification (UQ) is crucial for deploying machine learning models in high-stakes applications, where overconfident predictions can lead to serious consequences. An effective UQ method must balance computational efficiency with the ability to generalize across diverse scenarios. Evidential deep learning (EDL) achieves efficiency by modeling uncertainty through the prediction of a Dirichlet distribution over class probabilities. However, the restrictive assumption of Dirichlet-distributed class probabilities limits EDL's robustness, particularly in complex or unforeseen situations. To address this, we propose \textit{flexible evidential deep learning} ($\mathcal{F}$-EDL), which extends EDL by predicting a flexible Dirichlet distribution -- a generalization of the Dirichlet distribution -- over class probabilities. This approach provides a more expressive and adaptive representation of uncertainty, significantly enhancing UQ generalization and reliability under challenging scenarios. We theoretically establish several advantages of $\mathcal{F}$-EDL and empirically demonstrate its state-of-the-art UQ performance across diverse evaluation settings, including classical, long-tailed, and noisy in-distribution scenarios.

2510.17561 2026-02-23 math.ST cond-mat.dis-nn cs.LG stat.ML stat.TH

Spectral Thresholds in Correlated Spiked Models and Fundamental Limits of Partial Least Squares

Pierre Mergny, Lenka Zdeborová

Comments 24 pages, 4 figures

Journal ref AISTATS 2026

详情
英文摘要

We provide a rigorous random matrix theory analysis of spiked cross-covariance models where the signals across two high-dimensional data channels are partially aligned. These models are motivated by multi-modal learning and form the standard generative setting underlying Partial Least Squares (PLS), a widely used yet theoretically underdeveloped method. We show that the leading singular values of the sample cross-covariance matrix undergo a Baik-Ben Arous-Peche (BBP)-type phase transition, and we characterize the precise thresholds for the emergence of informative components. Our results yield the first sharp asymptotic description of the signal recovery capabilities of PLS in this setting, revealing a fundamental performance gap between PLS and the Bayes-optimal estimator. In particular, we identify the SNR and correlation regimes where PLS fails to recover any signal, despite detectability being possible in principle. These findings clarify the theoretical limits of PLS and provide guidance for the design of reliable multi-modal inference methods in high dimensions.

2510.00545 2026-02-23 stat.ML cs.LG

Bayesian Neural Networks for Functional ANOVA model

Seokhun Park, Choeun Kim, Jihu Lee, Yunseop Shin, Insung Kong, Yongdai Kim

详情
英文摘要

With the increasing demand for interpretability in machine learning, functional ANOVA decomposition has gained renewed attention as a principled tool for breaking down high-dimensional function into low-dimensional components that reveal the contributions of different variable groups. Recently, Tensor Product Neural Network (TPNN) has been developed and applied as basis functions in the functional ANOVA model, referred to as ANOVA-TPNN. A disadvantage of ANOVA-TPNN, however, is that the components to be estimated must be specified in advance, which makes it difficult to incorporate higher-order TPNNs into the functional ANOVA model due to computational and memory constraints. In this work, we propose Bayesian-TPNN, a Bayesian inference procedure for the functional ANOVA model with TPNN basis functions, enabling the detection of higher-order components with reduced computational cost compared to ANOVA-TPNN. We develop an efficient MCMC algorithm and demonstrate that Bayesian-TPNN performs well by analyzing multiple benchmark datasets. Theoretically, we prove that the posterior of Bayesian-TPNN is consistent.

2509.02073 2026-02-23 stat.ML cond-mat.dis-nn cond-mat.stat-mech cs.LG physics.soc-ph

Inference in Spreading Processes with Neural-Network Priors

Davide Ghio, Fabrizio Boncoraglio, Lenka Zdeborová

Comments 26 pages, 13 figures

Journal ref Phys. Rev. E 113, 015301, 2026

详情
英文摘要

Stochastic processes on graphs are a powerful tool for modelling complex dynamical systems such as epidemics. A recent line of work focused on the inference problem where one aims to estimate the state of every node at every time, starting from partial observation of a subset of nodes at a subset of times. In these works, the initial state of the process was assumed to be random i.i.d. over nodes. Such an assumption may not be realistic in practice, where one may have access to a set of covariate variables for every node that influence the initial state of the system. In this work, we will assume that the initial state of a node is an unknown function of such covariate variables. Given that functions can be represented by neural networks, we will study a model where the initial state is given by a simple neural network -- notably the single-layer perceptron acting on the known node-wise covariate variables. Within a Bayesian framework, we study how such neural-network prior information enhances the recovery of initial states and spreading trajectories. We derive a hybrid belief propagation and approximate message passing (BP-AMP) algorithm that handles both the spreading dynamics and the information included in the node covariates, and we assess its performance against the estimators that either use only the spreading information or use only the information from the covariate variables. We show that in some regimes, the model can exhibit first-order phase transitions when using a Rademacher distribution for the neural-network weights. These transitions create a statistical-to-computational gap where even the BP-AMP algorithm, despite the theoretical possibility of perfect recovery, fails to achieve it.

2508.15637 2026-02-23 cs.LG cs.CL stat.AP

Classification errors distort findings in automated speech processing: examples and solutions from child-development research

Lucas Gautheron, Evan Kidd, Anton Malko, Marvin Lavechin, Alejandrina Cristia

详情
英文摘要

With the advent of wearable recorders, scientists are increasingly turning to automated methods of analysis of audio and video data in order to measure children's experience, behavior, and outcomes, with a sizable literature employing long-form audio-recordings to study language acquisition. While numerous articles report on the accuracy and reliability of the most popular automated classifiers, less has been written on the downstream effects of classification errors on measurements and statistical inferences (e.g., the estimate of correlations and effect sizes in regressions). This paper's main contributions are drawing attention to downstream effects of confusion errors, and providing an approach to measure and potentially recover from these errors. Specifically, we use a Bayesian approach to study the effects of algorithmic errors on key scientific questions, including the effect of siblings on children's language experience and the association between children's production and their input. By fitting a joint model of speech behavior and algorithm behavior on real and simulated data, we show that classification errors can significantly distort estimates for both the most commonly used \gls{lena}, and a slightly more accurate open-source alternative (the Voice Type Classifier from the ACLEW system). We further show that a Bayesian calibration approach for recovering unbiased estimates of effect sizes can be effective and insightful, but does not provide a fool-proof solution.

2507.17316 2026-02-23 stat.ML cs.LG

Nearly Minimax Discrete Distribution Estimation in Kullback-Leibler Divergence with High Probability

Dirk van der Hoeven, Julia Olkhovskaia, Tim van Erven

详情
英文摘要

We consider the fundamental problem of estimating a discrete distribution on a domain of size $K$ with high probability in Kullback-Leibler divergence. We provide upper and lower bounds on the minimax estimation rate, which show that the optimal rate is between $\big(K + \ln(K)\ln(1/δ)\big) /n$ and $\big(K\ln\ln(K) + \ln(K)\ln(1/δ)\big) /n$ at error probability $δ$ and sample size $n$, which pins down the rate up to the doubly logarithmic factor $\ln \ln K$ that multiplies $K$. Our upper bound uses techniques from online learning to construct a novel estimator via online-to-batch conversion. Perhaps surprisingly, the tail behavior of the minimax rate is worse than for the squared total variation and squared Hellinger distance, for which it is $\big(K + \ln(1/δ)\big) /n$, i.e. without the $\ln K$ multiplying $\ln (1/δ)$. As a consequence, we cannot obtain a fully tight lower bound from the usual reduction to these smaller distances. Moreover, we show that this lower bound cannot be achieved by the standard lower bound approach based on a reduction to hypothesis testing, and instead we need to introduce a new reduction to what we call weak hypothesis testing. We investigate the source of the gap with other divergences further in refined results, which show that the total variation rate is achievable for Kullback-Leibler divergence after all (in fact by he maximum likelihood estimator) if we rule out outcome probabilities smaller than $O(\ln(K/δ) / n)$, which is a vanishing set as $n$ increases for fixed $K$ and $δ$. This explains why minimax Kullback-Leibler estimation is more difficult than asymptotic estimation.

2507.09043 2026-02-23 cs.LG stat.ML

GAGA: Gaussianity-Aware Gaussian Approximation for Efficient 3D Molecular Generation

Jingxiang Qu, Wenhan Gao, Ruichen Xu, Yi Liu

详情
英文摘要

Gaussian Probability Path based Generative Models (GPPGMs) generate data by reversing a stochastic process that progressively corrupts samples with Gaussian noise. Despite state-of-the-art results in 3D molecular generation, their deployment is hindered by the high cost of long generative trajectories, often requiring hundreds to thousands of steps during training and sampling. In this work, we propose a principled method, named GAGA, to improve generation efficiency without sacrificing training granularity or inference fidelity of GPPGMs. Our key insight is that different data modalities obtain sufficient Gaussianity at markedly different steps during the forward process. Based on this observation, we analytically identify a characteristic step at which molecular data attains sufficient Gaussianity, after which the trajectory can be replaced by a closed-form Gaussian approximation. Unlike existing accelerators that coarsen or reformulate trajectories, our approach preserves full-resolution learning dynamics while avoiding redundant transport through truncated distributional states. Experiments on 3D molecular generation benchmarks demonstrate that our GAGA achieves substantial improvement on both generation quality and computational efficiency.

2506.08864 2026-02-23 stat.ME

Safety-Driven Response Adaptive Randomisation: An Application in Non-inferiority Oncology Trials

Maria Vittoria Chiaruttini, Lukas Pin, Sofia S. Villar

Comments 31 pages, 5 figures, 4 tables

详情
英文摘要

The majority of response-adaptive randomisation (RAR) designs in the literature rely on efficacy data to guide dynamic patient allocation. However, their applicability becomes limited in settings where efficacy outcomes, such as survival, are observed with a random delay. To address this limitation, we introduce SAFER, a novel RAR design that leverages early-emerging safety data to inform treatment allocation decisions, particularly in oncology trials. The design is broadly applicable to contexts where prioritizing the arm with a superior safety is desirable. This is especially relevant in non-inferiority trials, to demonstrate that an experimental treatment is not inferior to the standard of care, while potentially offering improved tolerability. In such trials, an unavoidable trade-off arises: maintaining statistical efficiency for the efficacy hypothesis while integrating safety-driven adaptations through RAR. The SAFER design addresses this trade-off by dynamically adjusting the allocation proportion based on the observed association between safety and efficacy endpoints. We illustrate the performance of SAFER through a simulation study inspired by the CAPP-IT Phase III oncology trial. Results show that SAFER preserves statistical power, reduces the adverse event rate, and offers flexible adaptation speed depending on the temporal alignment of the endpoints.

2506.02394 2026-02-23 stat.ME cs.LG

Joint modeling for learning decision-making dynamics in behavioral experiments

Yuan Bian, Xingche Guo, Yuanjia Wang

Journal ref The Annals of Applied Statistics, 19(4): 3372-3393, 2025

详情
英文摘要

Major depressive disorder (MDD), a leading cause of disability and mortality, is associated with reward-processing abnormalities and concentration issues. Motivated by the probabilistic reward task from the Establishing Moderators and Biosignatures of Antidepressant Response in Clinical Care (EMBARC) study, we propose a novel framework that integrates the reinforcement learning (RL) model and drift-diffusion model (DDM) to jointly analyze reward-based decision-making with response times. To account for emerging evidence suggesting that decision-making may alternate between multiple interleaved strategies, we model latent state switching using a hidden Markov model (HMM). In the ''engaged'' state, decisions follow an RL-DDM, simultaneously capturing reward processing, decision dynamics, and temporal structure. In contrast, in the ''lapsed'' state, decision-making is modeled using a simplified DDM, where specific parameters are fixed to approximate random guessing with equal probability. The proposed method is implemented using a computationally efficient generalized expectation-maximization (EM) algorithm with forward-backward procedures. Through extensive numerical studies, we demonstrate that our proposed method outperforms competing approaches across various reward-generating distributions, under both strategy-switching and non-switching scenarios, as well as in the presence of input perturbations. When applied to the EMBARC study, our framework reveals that MDD patients exhibit lower overall engagement than healthy controls and experience longer decision times when they do engage. Additionally, we show that neuroimaging measures of brain activities are associated with decision-making characteristics in the ''engaged'' state but not in the ''lapsed'' state, providing evidence of brain-behavior association specific to the ''engaged'' state.

2505.18150 2026-02-23 cs.LG q-bio.QM stat.ML

Generative Distribution Embeddings: Lifting autoencoders to the space of distributions for multiscale representation learning

Nic Fishman, Gokul Gowri, Peng Yin, Jonathan Gootenberg, Omar Abudayyeh

Comments NeurIPS 2025

详情
英文摘要

Many real-world problems require reasoning across multiple scales, demanding models which operate not on single data points, but on entire distributions. We introduce generative distribution embeddings (GDE), a framework that lifts autoencoders to the space of distributions. In GDEs, an encoder acts on sets of samples, and the decoder is replaced by a generator which aims to match the input distribution. This framework enables learning representations of distributions by coupling conditional generative models with encoder networks which satisfy a criterion we call distributional invariance. We show that GDEs learn predictive sufficient statistics embedded in the Wasserstein space, such that latent GDE distances approximately recover the $W_2$ distance, and latent interpolation approximately recovers optimal transport trajectories for Gaussian and Gaussian mixture distributions. We systematically benchmark GDEs against existing approaches on synthetic datasets, demonstrating consistently stronger performance. We then apply GDEs to six key problems in computational biology: learning donor-level representations from single-nuclei RNA sequencing data (6M cells), capturing clonal dynamics in lineage-traced RNA sequencing data (150K cells), predicting perturbation effects on transcriptomes (1M cells), predicting perturbation effects on cellular phenotypes (20M single-cell images), designing synthetic yeast promoters (34M sequences), and spatiotemporal modeling of viral protein sequences (1M sequences).

2505.14825 2026-02-23 cs.LG math.ST physics.data-an stat.ME stat.ML stat.TH

Assimilative Causal Inference

Marios Andreou, Nan Chen, Erik Bollt

Comments 47 pages (Main Text pp. 1--17; Supplementary Information pp. 18--47), 11 figures (3 in Main Text, 8 in Supplementary Information). Published in Nature Communications. The MATLAB code used in the analyses and to generate the figures in this work can be found in https://github.com/marandmath/ACI_code . For further details visit https://mariosandreou.short.gy/ACI

Journal ref Nature Communications 17, 1854 (2026)

详情
英文摘要

Causal inference is fundamental across scientific disciplines, yet existing methods struggle to capture instantaneous, time-evolving causal relationships in complex, high-dimensional systems. In this paper, assimilative causal inference (ACI) is developed, which is a methodological framework that leverages Bayesian data assimilation to trace causes backward from observed effects. ACI solves the inverse problem rather than quantifying forward influence. It uniquely identifies dynamic causal interactions without requiring observations of candidate causes, accommodates short datasets, and, in principle, can be implemented in high-dimensional settings by employing efficient data assimilation algorithms. Crucially, it provides online tracking of causal roles that may reverse intermittently and facilitates a mathematically rigorous criterion for the causal influence range, revealing how far effects propagate. The effectiveness of ACI is demonstrated by complex dynamical systems showcasing intermittency and extreme events. ACI opens valuable pathways for studying complex systems, where transient causal structures are critical.

2503.18980 2026-02-23 stat.ML cs.AI cs.LG

CAE: Repurposing the Critic as an Explorer in Deep Reinforcement Learning

Yexin Li

详情
英文摘要

Exploration remains a fundamental challenge in reinforcement learning, as many existing methods either lack theoretical guarantees or fall short in practical effectiveness. In this paper, we propose CAE, i.e., the Critic as an Explorer, a lightweight approach that repurposes the value networks in standard deep RL algorithms to drive exploration, without introducing additional parameters. CAE leverages multi-armed bandit techniques combined with a tailored scaling strategy, enabling efficient exploration with provable sub-linear regret bounds and strong empirical stability. Remarkably, it is simple to implement, requiring only about 10 lines of code. For complex tasks where learning reliable value networks is difficult, we introduce CAE+, an extension of CAE that incorporates an auxiliary network. CAE+ increases the parameter count by less than 1% while preserving implementation simplicity, adding roughly 10 additional lines of code. Extensive experiments on MuJoCo, MiniHack, and Habitat validate the effectiveness of CAE and CAE+, highlighting their ability to unify theoretical rigor with practical efficiency.

2503.07313 2026-02-23 stat.ML cs.LG

The influence of missing data mechanisms and simple missing data handling techniques on fairness

Aeysha Bhatti, Trudie Sandrock, Johane Nienkemper-Swanepoel

详情
英文摘要

Machine learning algorithms permeate the day-to-day aspects of our lives and therefore studying the fairness of these algorithms before implementation is crucial. One way in which bias can manifest in a dataset is through missing values. Missing data are often assumed to be missing completely randomly; in reality the propensity of data being missing is often tied to the demographic characteristics of individuals. There is limited research into how missing values and the handling thereof can impact the fairness of an algorithm. Most researchers either apply listwise deletion or tend to use simpler methods of imputation (e.g. mean or mode) compared to more advanced approaches (e.g. multiple imputation). This study considers the fairness of various classification algorithms after a range of missing data handling strategies is applied. Missing values are generated (i.e. amputed) in three popular datasets for classification fairness, by creating a high percentage of missing values using three missing data mechanisms. The results show that the missing data mechanism does not significantly impact fairness; across the missing data handling techniques listwise deletion gives the highest fairness on average and amongst the classification algorithms random forests leads to the highest fairness on average. The interaction effect of the missing data handling technique and the classification algorithm is also often significant.

2502.21276 2026-02-23 stat.ME

Boosting prediction with data missing not at random

Yuan Bian, Grace Y. Yi, Wenqing He

Journal ref Journal of Computational and Graphical Statistics, 35(1): 406-417, 2026

详情
英文摘要

Boosting has emerged as a useful machine learning technique over the past three decades, attracting increased attention. Most advancements in this area, however, have primarily focused on numerical implementation procedures, often lacking rigorous theoretical justifications. Moreover, these approaches are generally designed for datasets with fully observed data, and their validity can be compromised by the presence of missing observations. In this paper, we employ semiparametric estimation approaches to develop boosting prediction methods for data with missing responses. We explore two strategies for adjusting the loss functions to account for missingness effects. The proposed methods are implemented using a functional gradient descent algorithm, and their theoretical properties, including algorithm convergence and estimator consistency, are rigorously established. Numerical studies demonstrate that the proposed methods perform well in finite sample settings.

2501.00755 2026-02-23 stat.ML cs.AI cs.LG stat.ME

An AI-powered Bayesian generative modeling approach for causal inference in observational studies

Qiao Liu, Wing Hung Wong

详情
英文摘要

Causal inference in observational studies with high-dimensional covariates presents significant challenges. We introduce CausalBGM, an AI-powered Bayesian generative modeling approach that captures the causal relationship among covariates, treatment, and outcome. The core innovation is to estimate the individual treatment effect (ITE) by learning the individual-specific distribution of a low-dimensional latent feature set (e.g., latent confounders) that drives changes in both treatment and outcome. This individualized posterior representation yields estimates of the individual treatment effect (ITE) together with well-calibrated posterior intervals while mitigating confounding effect. CausalBGM is fitted through an iterative algorithm to update the model parameters and the latent features until convergence. This framework leverages the power of AI to capture complex dependencies among variables while adhering to the Bayesian principles. Extensive experiments demonstrate that CausalBGM consistently outperforms state-of-the-art methods, particularly in scenarios with high-dimensional covariates and large-scale datasets. By addressing key limitations of existing methods, CausalBGM emerges as a robust and promising framework for advancing causal inference in a wide range of modern applications. The code for CausalBGM is available at https://github.com/liuq-lab/bayesgm. The document for using CausalBGM is available at https://bayesgm.readthedocs.io.

2411.18799 2026-02-23 stat.AP stat.ML

Density correction for multivariate spatial fields of global climate model output using deep learning

Reetam Majumder, Shiqi Fang, A. Sankarasubramanian, Emily C. Hector, Brian J. Reich

详情
英文摘要

Global Climate Models (GCMs) are numerical models that simulate complex physical processes within the Earth's climate system and are essential for understanding and predicting climate change. However, GCMs suffer from systemic biases due to simplifications made to the underlying physical processes. GCM output therefore needs to be bias corrected before it can be used for future climate projections. Most common bias correction methods, however, cannot preserve spatial, temporal, or inter-variable dependencies. We propose a new semi-parametric estimation of conditional densities (SPECD) approach for density correction of the joint distribution of daily precipitation and maximum temperature data obtained from gridded GCM spatial fields. The Vecchia approximation is employed to preserve dependencies in the observed field during the density correction process, which is carried out using semi-parametric quantile regression. The ability to calibrate joint distributions of GCM projections has potential advantages not only in estimating extremes, but also in better estimating compound hazards, like heat waves and drought, under potential climate change. Illustration on historical data from 1951-2014 over two 5 x 5 spatial grids in the US indicate that SPECD can preserve key marginal and joint distribution properties of precipitation and maximum temperature, and predictions obtained using SPECD are better calibrated compared to predictions using asynchronous quantile mapping and canonical correlation analysis, two commonly used bias correction approaches.

2403.08691 2026-02-23 math.PR math.ST stat.TH

Large deviations for Independent Metropolis Hastings and Metropolis-adjusted Langevin algorithm

Federica Milinanni, Pierre Nyquist

Comments 30 pages, 0 figures, 1 table

详情
英文摘要

In this paper, we prove large deviation principles for the empirical measures associated with the Independent Metropolis Hastings (IMH) sampler and the Metropolis-adjusted Langevin Algorithm (MALA). These are the first large deviation results for empirical measures of Markov chains arising from specific Metropolis-Hastings methods on a continuous state space. Moreover, we show that the existing large deviation framework, that we developed in a previous work (Milinanni and Nyquist, 2024), does not cover the Random Walk Metropolis sampler, even in cases when the underlying Markov chain is geometrically ergodic.

2402.08621 2026-02-23 cs.LG math.OC stat.ML

A Unified Framework for Analyzing Meta-algorithms in Online Convex Optimization

Mohammad Pedramfar, Vaneet Aggarwal

Comments in Proc. AAMAS 2026

Journal ref Proc. AAMAS 2026

详情
英文摘要

In this paper, we analyze the problem of online convex optimization in different settings, including different feedback types (full-information/semi-bandit/bandit/etc) in either stochastic or non-stochastic setting and different notions of regret (static adversarial regret/dynamic regret/adaptive regret). This is done through a framework which allows us to systematically propose and analyze meta-algorithms for the various settings described above. We show that any algorithm for online linear optimization with deterministic gradient feedback against fully adaptive adversaries is an algorithm for online convex optimization. We also show that any such algorithm that requires full-information feedback may be transformed to an algorithm with semi-bandit feedback with comparable regret bound. We further show that algorithms that are designed for fully adaptive adversaries using deterministic semi-bandit feedback can obtain similar bounds using only stochastic semi-bandit feedback when facing oblivious adversaries. We use this to describe general meta-algorithms to convert first order algorithms to zeroth order algorithms with comparable regret bounds. Our framework allows us to analyze online optimization in various settings, recovers several results in the literature with a simplified proof technique, and provides new results.

2312.12710 2026-02-23 stat.ME

Semiparametric Copula Estimation for Spatially Correlated Multivariate Mixed Outcomes: Analyzing Visual Sightings of Fin Whales from a Line Transect Survey

Tomotaka Momozaki, Tomoyuki Nakagawa, Shonosuke Sugasawa, Hiroko Kato Solvang

Comments 65 pages, 26 figures

详情
英文摘要

For marine biologists, ascertaining the dependence structures between marine species and marine environments, such as sea surface temperature and ocean depth, is imperative for defining ecosystem functioning and providing insights into the dynamics of marine ecosystems. However, obtained data include not only continuous but also discrete data, such as binaries and counts (referred to as mixed outcomes), as well as spatial correlations, both of which make conventional multivariate analysis tools impractical. To solve this issue, we propose semiparametric Bayesian inference and develop an efficient algorithm for computing the posterior of the dependence structure based on the rank likelihood under a latent multivariate spatial Gaussian process using the Markov chain Monte Carlo method. To alleviate the computational intractability caused by the Gaussian process, we also provide a scalable implementation that leverages the nearest-neighbor Gaussian process. Extensive numerical experiments reveal that the proposed method reliably infers the dependence structures of spatially correlated mixed outcomes. Finally, we apply the proposed method to a dataset collected during an international synoptic krill survey in the Scotia Sea of the Antarctic Peninsula to infer the dependence structure between fin whales (Balaenoptera physalus), krill biomass, and relevant oceanographic data.

2203.16462 2026-02-23 cs.LG cs.NE math.OC math.PR stat.ML

Convergence of gradient descent for deep neural networks

Sourav Chatterjee

Comments 30 pages, 3 figures. Numerical experiments added in this revision

详情
英文摘要

We give a simple local Polyak-Lojasiewicz (PL) criterion that guarantees linear (exponential) convergence of gradient flow and gradient descent to a zero-loss solution of a nonnegative objective. We then verify this criterion for the squared training loss of a feedforward neural network with smooth, strictly increasing activation functions, in a regime that is complementary to the usual over-parameterized analyses: the network width and depth are fixed, while the input data vectors are assumed to be linearly independent (in particular, the ambient input dimension is at least the number of data points). A notable feature of the verification is that it is constructive: it leads to a simple "positive" initialization (zero first-layer weights, strictly positive hidden-layer weights, and sufficiently large output-layer weights) under which gradient descent provably converges to an interpolating global minimizer of the training loss. We also discuss a probabilistic corollary for random initializations, clarify its dependence on the probability of the required initialization event, and provide numerical experiments showing that this theory-guided initialization can substantially accelerate optimization relative to standard random initializations at the same width.

2112.05128 2026-02-23 stat.ML cs.LG

Fair Community Detection and Structure Learning in Heterogeneous Graphical Models

Davoud Ataee Tarzanagh, Laura Balzano, Alfred O. Hero

详情
英文摘要

Inference of community structure in probabilistic graphical models may not be consistent with fairness constraints when nodes have demographic attributes. Certain demographics may be over-represented in some detected communities and under-represented in others. This paper defines a novel $\ell_1$-regularized pseudo-likelihood approach for fair graphical model selection. In particular, we assume there is some community or clustering structure in the true underlying graph, and we seek to learn a sparse undirected graph and its communities from the data such that demographic groups are fairly represented within the communities. In the case when the graph is known a priori, we provide a convex semidefinite programming approach for fair community detection. We establish the statistical consistency of the proposed method for both a Gaussian graphical model and an Ising model for, respectively, continuous and binary data, proving that our method can recover the graphs and their fair communities with high probability.