arXivDaily arXiv每日学术速递 周一至周五更新
重置
2603.13207 2026-03-16 math.ST stat.TH

Estimating the Missing Mass, Partition Function or Evidence for a Case of Sampling from a Discrete Set

Bastiaan J. Braams

Comments 20 pages

详情
英文摘要

We consider the problem of estimating the missing mass, partition function or evidence and its probability distribution in the case that for each sample point in the discrete sample space its (unnormalized) probability mass is revealed. Estimating the missing mass or partition function (evidence) is a well-studied problem for which, in different contexts, the harmonic mean estimator and the Good-Turing (and related) estimators are available. For sampling on a discrete set with revealed probability masses these estimators can be Rao-Blackwellized, leading to self-consistent estimators not involving an auxiliary distribution with known total mass. For the case of sampling from a mixture distribution this offers the perspective of anchoring the estimator at both ends: at the diffuse end (high temperature in statistical physics) via an explicit expression for the total probability mass and at the peaked end (low temperature) via the feature of repeated entries in the sample. Estimation is model-free, but to provide a probability distribution for the missing mass or partition function a model is needed for the distribution of mass. We present one such model, identify sufficient reduced statistics, and analyze the model in various ways -- Bayesian, profile likelihood, maximum likelihood and moment matching -- with the objective of eliminating the mathematical (nuisance) parameters for a final expression in terms of the observed data. The most satisfactory (explicit and transparent) result is obtained by a mixed method that combines Bayesian marginalization or profile likelihood optimization for all but one of the parameters with plain maximum likelihood optimization of the final parameter.

2603.13156 2026-03-16 stat.ME stat.ML

When Your Model Stops Working: Anytime-Valid Calibration Monitoring

Tristan Farran

详情
英文摘要

Practitioners monitoring deployed probabilistic models face a fundamental trap: any fixed-sample test applied repeatedly over an unbounded stream will eventually raise a false alarm, even when the model remains perfectly stable. Existing methods typically lack formal error guarantees, conflate alarm time with changepoint location, and monitor indirect signals that do not fully characterize calibration. We present PITMonitor, an anytime-valid calibration-specific monitor that detects distributional shifts in probability integral transforms via a mixture e-process, providing Type I error control over an unbounded monitoring horizon as well as Bayesian changepoint estimation. On river's FriedmanDrift benchmark, PITMonitor achieves detection rates competitive with the strongest baselines across all three scenarios, although detection delay is substantially longer under local drift.

2603.13009 2026-03-16 stat.ME stat.CO

TwoTimeScales: An R-package for Smoothing Hazards with Two Time Scales

Angela Carollo, Paul H. C. Eilers, Hein Putter, Jutta Gampe

Comments 15 pages, 6 figures

详情
英文摘要

Background: Time-to-event data with multiple time scales are observed in many epidemiological and clinical studies. While models that allow for simultaneous consideration of multiple time scales for the hazard of an event have been proposed, their use is still not wide-spread in applied research. One reason for this might be the lack of convenient statistical software to estimate such models. Here we introduce the R-package TwoTimeScales. The package provides tools to estimate models for hazards that vary smoothly over two time scales, including proportional hazards models with such a two-dimensional baseline hazard. Extensions to competing risks models are implemented as well. Methodology is based on two-dimensional smoothing with P-splines. Results: We demonstrate the features of the R-package by analysing a freely available dataset containing post-surgery follow-up data on patients with breast cancer. We present two examples, a proportional hazards regression and a competing risks problem. Besides estimation, we illustrate the plotting utilities of the package. Conclusion: The R-package TwoTimeScales can be easily used to fit flexible hazard models with two time scales, allowing new perspectives in the analysis of time-to-event data with multiple time scales.

2603.12920 2026-03-16 cs.CL stat.ML

HMS-BERT: Hybrid Multi-Task Self-Training for Multilingual and Multi-Label Cyberbullying Detection

Zixin Feng, Xinying Cui, Yifan Sun, Zheng Wei, Jiachen Yuan, Jiazhen Hu, Ning Xin, Md Maruf Hasan

详情
英文摘要

Cyberbullying on social media is inherently multilingual and multi-faceted, where abusive behaviors often overlap across multiple categories. Existing methods are commonly limited by monolingual assumptions or single-task formulations, which restrict their effectiveness in realistic multilingual and multi-label scenarios. In this paper, we propose HMS-BERT, a hybrid multi-task self-training framework for multilingual and multi-label cyberbullying detection. Built upon a pretrained multilingual BERT backbone, HMS-BERT integrates contextual representations with handcrafted linguistic features and jointly optimizes a fine-grained multi-label abuse classification task and a three-class main classification task. To address labeled data scarcity in low-resource languages, an iterative self-training strategy with confidence-based pseudo-labeling is introduced to facilitate cross-lingual knowledge transfer. Experiments on four public datasets demonstrate that HMS-BERT achieves strong performance, attaining a macro F1-score of up to 0.9847 on the multi-label task and an accuracy of 0.6775 on the main classification task. Ablation studies further verify the effectiveness of the proposed components.

2603.12893 2026-03-16 cs.CV cs.AI cs.LG cs.NE stat.ML

Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models

David McAllister, Miika Aittala, Tero Karras, Janne Hellsten, Angjoo Kanazawa, Timo Aila, Samuli Laine

Comments Code available at https://github.com/NVlabs/finite-difference-flow-optimization

详情
英文摘要

Reinforcement learning (RL) has become a standard technique for post-training diffusion-based image synthesis models, as it enables learning from reward signals to explicitly improve desirable aspects such as image quality and prompt alignment. In this paper, we propose an online RL variant that reduces the variance in the model updates by sampling paired trajectories and pulling the flow velocity in the direction of the more favorable image. Unlike existing methods that treat each sampling step as a separate policy action, we consider the entire sampling process as a single action. We experiment with both high-quality vision language models and off-the-shelf quality metrics for rewards, and evaluate the outputs using a broad set of metrics. Our method converges faster and yields higher output quality and prompt alignment than previous approaches.

2603.11240 2026-03-16 stat.OT

Statistical Methodology Groups in the Pharmaceutical Industry

Jenny Devenport, Tobias Mielke, Mouna Akacha, Kaspar Rufibach, Alex Ocampo, Vivian Lanius, Marc Vandemeulebroecke, Philip Hougaard, Pierre Collin, David Wright, Jurgen Hummel, Cornelia Ursula Kunz, Mike Krams

Comments 39 pages, 2 figures, 1 table

详情
英文摘要

Research and Development is the largest budget position in the pharmaceutical industry, with clinical trials being a critical, yet costly and time-consuming component to inform decisions. Beyond drug efficacy, the probability of success and efficiency of research and development are highly dependent on the approaches used for designing, analyzing, and interpreting clinical trials. Deep understanding of statistical methodology and quantitative approaches is therefore essential. Consequently, dedicated methodology groups have emerged in mid-size and large pharmaceutical companies and CROs. Their remit is to lead the conception and implementation of innovative quantitative methodologies in order to improve drug development, often by addressing complexities or offering more efficient designs. To achieve this, they collaborate internally and externally (e.g., with academics, regulators) to identify common challenges and tear down silos in order to invest in methods with the highest impact on efficiency and value to the portfolio. Given the immense financial stakes of drug development -- where delays carry massive implications -- these groups represent a critical strategic investment. However, to realize this business impact, statistical innovations must be rigorously validated and seamlessly integrated. This manuscript explores the setup, remit, and value of dedicated methodology groups, alongside the critical organizational considerations and success factors required to maximize their impact on the speed, efficiency, and probability of success.

2603.07227 2026-03-16 physics.ao-ph stat.AP

Estimating changes in extreme quantiles over time, applied to desert temperatures

Callum Leach, Kevin Ewans, Philip Jonathan

详情
英文摘要

We quantify changes DeltaQ in 100-year return values for regional annual maxima and minima of near-surface atmospheric temperature from output of five CMIP6 models, for five of the Earth's desert regions, over the interval (2025,2125). We use generalised extreme value (GEV) regression to characterise changes in extremes, considering a range of different parametric forms for the variation of GEV parameters with time, and coupling models for different scenarios so that they provide a common GEV tail in the first year of observation. Parameters are estimated using Bayesian inference. We perform a simulation study using ground truth models generating data qualitatively similar to the CMIP6 output, to assess the relative performance of different information criteria in selecting models from a set of candidates, to minimise error in predictions of DeltaQ. The Bayesian information criterion (BIC) provides best performance, out-performing the divergence and widely-applicable information criteria in particular. Using BIC-selected GEV regression models, we estimate joint posterior distributions of DeltaQ over three forcing scenarios, for different combinations of region, GCM and climate ensemble. Estimates show a consistent trend across regions, GCMs and climate ensembles, of DeltaQ increasing with climate scenario for both regional annual maxima and minima. Aggregating posterior distributions over climate ensembles and GCMs, we find evidence for significant increases in DeltaQ for regional annual maxima under more severe forcing scenarios for all desert regions. Similar but weaker and less significant trends are observed for regional annual minima.

2512.22587 2026-03-16 cs.LG stat.ML

Structural Incompatibility of Differentiable Sorting and Within-Vector Rank Normalization

Taeyun Kim

Comments 6 pages

详情
英文摘要

We show that differentiable sorting and ranking operators are structurally incompatible with within-vector rank normalization. We formalize admissibility through monotone invariance (C1), batch independence (C2), and a rank-space stability condition (C3). Gap-sensitive relaxations such as SoftSort violate (C1) by a quantitative margin that depends on the temperature and input scale. Batchwise rank relaxations such as SinkhornSort violate (C2): the same sample can be assigned outputs arbitrarily close to 0 or 1 depending solely on batch context. Condition (C3) implies (C1) under the rank representation used here and should not be read as a third independent failure mode. We also characterize the admissible class: any admissible operator must factor through the rank representation via a Lipschitz function.

2512.11946 2026-03-16 cs.LG cs.AI stat.ML

Global Sensitivity Analysis for Engineering Design Based on Individual Conditional Expectations

Pramudita Satria Palar, Paul Saves, Rommel G. Regis, Koji Shimoyama, Shigeru Obayashi, Nicolas Verstaevel, Joseph Morlier

Comments Published in Aerospace Science and Technology, 2026

详情
英文摘要

Explainable machine learning techniques have gained increasing attention in engineering applications, especially in aerospace design and analysis, where understanding how input variables influence data-driven models is essential. Partial Dependence Plots (PDPs) are widely used for interpreting black-box models by showing the average effect of an input variable on the prediction. However, their global sensitivity metric can be misleading when strong interactions are present, as averaging tends to obscure interaction effects. To address this limitation, we propose a global sensitivity metric based on Individual Conditional Expectation (ICE) curves. The method computes the expected feature importance across ICE curves, along with their standard deviation, to more effectively capture the influence of interactions. We provide a mathematical proof demonstrating that the PDP-based sensitivity is a lower bound of the proposed ICE-based metric under truncated orthogonal polynomial expansion. In addition, we introduce an ICE-based correlation value to quantify how interactions modify the relationship between inputs and the output. Comparative evaluations were performed on three cases: a 5-variable analytical function, a 5-variable wind-turbine fatigue problem, and a 9-variable airfoil aerodynamics case, where ICE-based sensitivity was benchmarked against PDP, SHapley Additive exPlanations (SHAP), and Sobol' indices. The results show that ICE-based feature importance provides richer insights than the traditional PDP-based approach, while visual interpretations from PDP, ICE, and SHAP complement one another by offering multiple perspectives.

2511.13421 2026-03-16 cs.LG stat.ML

Larger Datasets Can Be Repeated More: A Theoretical Analysis of Multi-Epoch Scaling in Linear Regression

Tingkai Yan, Haodong Wen, Binghui Li, Kairong Luo, Wenguang Chen, Kaifeng Lyu

详情
英文摘要

While data scaling laws of large language models (LLMs) have been widely examined in the one-pass regime with massive corpora, their form under limited data and repeated epochs remains largely unexplored. This paper presents a theoretical analysis of how a common workaround, training for multiple epochs on the same dataset, reshapes the data scaling laws in linear regression. Concretely, we ask: to match the performance of training on a dataset of size $N$ for $K$ epochs, how much larger must a dataset be if the model is trained for only one pass? We quantify this using the \textit{effective reuse rate} of the data, $E(K, N)$, which we define as the multiplicative factor by which the dataset must grow under one-pass training to achieve the same test loss as $K$-epoch training. Our analysis precisely characterizes the scaling behavior of $E(K, N)$ for SGD in linear regression under either strong convexity or Zipf-distributed data: (1) When $K$ is small, we prove that $E(K, N) \approx K$, indicating that every new epoch yields a linear gain; (2) As $K$ increases, $E(K, N)$ plateaus at a problem-dependent value that grows with $N$ ($Θ(\log N)$ for the strongly-convex case), implying that larger datasets can be repeated more times before the marginal benefit vanishes. These theoretical findings point out a neglected factor in a recent empirical study (Muennighoff et al. (2023)), which claimed that training LLMs for up to $4$ epochs results in negligible loss differences compared to using fresh data at each step, \textit{i.e.}, $E(K, N) \approx K$ for $K \le 4$ in our notation. Supported by further empirical validation with LLMs, our results reveal that the maximum $K$ value for which $E(K, N) \approx K$ in fact depends on the data size and distribution, and underscore the need to explicitly model both factors in future studies of scaling laws with data reuse.

2511.04974 2026-03-16 stat.AP

Estimating Inhomogeneous Spatio-Temporal Background Intensity Functions using Graphical Dirichlet Processes

Isaías Bañales, Tomoaki Nishikawa, Yoshihiro Ito, Manuel J. Aguilar-Velázquez

详情
英文摘要

An enhancement in seismic measuring instrumentation has been proven to have implications in the quantity of observed earthquakes, since denser networks usually allow recording more events. However, phenomena such as strong earthquakes or even aseismic transients, as slow slip earthquakes, may alter the occurrence of earthquakes. In the field of seismology, it is a standard practice to model background seismicity as a Poisson process. Based on this idea, this work proposes a model that can incorporate the evolving spatial intensity of Poisson processes over time (i.e., we include temporal changes in the background seismicity when modeling). In recent years, novel methodologies have been developed for quantifying the uncertainty in the estimation of the background seismicity in homogeneous cases using Bayesian non-parametric techniques. This work proposes a novel methodology based on graphical Dirichlet processes for incorporating spatial and temporal inhomogeneities in background seismicity. The proposed model in this work is applied to study the seismicity in the southern Mexico, using recorded data from 2000 to 2015.

2510.01930 2026-03-16 stat.ML cond-mat.dis-nn cs.LG

Precise Dynamics of Diagonal Linear Networks: A Unifying Analysis by Dynamical Mean-Field Theory

Sota Nishiyama, Masaaki Imaizumi

Comments 48 pages, accepted at AISTATS 2026 (Spotlight)

详情
英文摘要

Diagonal linear networks (DLNs) are a tractable model that captures several nontrivial behaviors in neural network training, such as initialization-dependent solutions and incremental learning. These phenomena are typically studied in isolation, leaving the overall dynamics insufficiently understood. In this work, we present a unified analysis of various phenomena in the gradient flow dynamics of DLNs. Using Dynamical Mean-Field Theory (DMFT), we derive a low-dimensional effective process that captures the asymptotic gradient flow dynamics in high dimensions. Analyzing this effective process yields new insights into DLN dynamics, including loss convergence rates and their trade-off with generalization, and systematically reproduces many of the previously observed phenomena. These findings deepen our understanding of DLNs and demonstrate the effectiveness of the DMFT approach in analyzing high-dimensional learning dynamics of neural networks.

2507.14389 2026-03-16 stat.AP econ.EM math.ST stat.ME stat.TH

Spatiotemporal Autoregressive Models for Areal Compositional Data

Matthias Eckardt, Philipp Otto

详情
英文摘要

Compositional data, such as regional shares of economic sectors or property transactions, are central to understanding structural change in economic systems across space and time. This paper introduces a spatiotemporal multivariate autoregressive model tailored for panel data with composition-valued responses at each areal unit and time point. The proposed framework enables the joint modelling of temporal dynamics and spatial dependence under compositional constraints, and is estimated via a quasi-maximum likelihood approach. We build on recent theoretical advances to establish the identifiability and asymptotic properties of the estimator as both the number of regions and the number of time points grow. The utility and flexibility of the model are demonstrated through two applications: analysing property transaction compositions in an intra-city housing market (Berlin), and regional sectoral compositions in Spain's economy. These case studies highlight how the proposed framework captures key features of spatiotemporal economic processes that are often missed by conventional methods.

2506.20021 2026-03-16 stat.ME

Speeding up the ordered allocation sampler

Maria F. Gil-Leyva, Fidel Selva, Pierpaolo De Blasi

Comments Change from v1: added acknowledgment

详情
英文摘要

The ordered allocation sampler is a Gibbs sampler designed to explore the posterior distribution in nonparametric mixture models. It encompasses both infinite mixtures and finite mixtures with random number of components, and it has be shown to possess mixing properties that pair well with collapsed, or marginal, samplers that integrate out the mixing distribution. The main advantage is that it adapts to mixing priors that do not enjoy tractable predictive structures needed for the implementation of marginal sampling methods. Thus it is as widely applicable as other conditional samplers while enjoying better algorithmic performances. In this paper we provide a modification of the ordered allocation sampler that enhances its performances in a substantial way while easing its implementation. In addition, exploiting the similarity with marginal samplers, we are able to adapt to the new version of the sampler the split-merge moves of Jain and Neal. Simulation studies confirm these findings.

2502.20114 2026-03-16 stat.CO cond-mat.stat-mech cs.NA math.NA math.PR

Scalability of the second-order reliability method for stochastic differential equations with multiplicative noise

Timo Schorlepp, Tobias Grafke

Comments 59 pages, 9 figures

详情
英文摘要

We show how to efficiently compute asymptotically sharp estimates of extreme event probabilities in stochastic differential equations (SDEs) with small multiplicative Brownian noise. The underlying approximation is known as sharp large deviation theory or precise Laplace asymptotics in mathematics, the second-order reliability method (SORM) in reliability engineering, and the instanton or optimal fluctuation method with 1-loop corrections in physics. It is based on approximating the tail probability in question with the most probable realization of the stochastic process, and local perturbations around this realization. We first recall and contextualize the relevant classical theoretical result on precise Laplace asymptotics of diffusion processes [Ben Arous (1988), Stochastics, 25(3), 125-153], and then show how to compute the involved infinite-dimensional quantities - operator traces and Carleman-Fredholm determinants - numerically in a way that is scalable with respect to the time discretization and remains feasible in high spatial dimensions. Using tools from automatic differentiation, we achieve a straightforward black-box numerical computation of the SORM estimates in JAX. The method is illustrated in examples of SDEs and stochastic partial differential equations, including a two-dimensional random advection-diffusion model of a passive scalar. We thereby demonstrate that it is possible to obtain efficient and accurate SORM estimates for very high-dimensional problems, as long as the infinite-dimensional structure of the problem is correctly taken into account. Our JAX implementation of the method is made publicly available.

2501.15194 2026-03-16 cs.LG stat.CO stat.ML

Reliable Pseudo-labeling via Optimal Transport with Attention for Short Text Clustering

Zhihao Yao

Comments arXiv admin comment: This version has been removed by arXiv administrators as the submitter did not have the rights to agree to the license at the time of submission

详情
英文摘要

Short text clustering has gained significant attention in the data mining community. However, the limited valuable information contained in short texts often leads to low-discriminative representations, increasing the difficulty of clustering. This paper proposes a novel short text clustering framework, called Reliable \textbf{P}seudo-labeling via \textbf{O}ptimal \textbf{T}ransport with \textbf{A}ttention for Short Text Clustering (\textbf{POTA}), that generate reliable pseudo-labels to aid discriminative representation learning for clustering. Specially, \textbf{POTA} first implements an instance-level attention mechanism to capture the semantic relationships among samples, which are then incorporated as a semantic consistency regularization term into an optimal transport problem. By solving this OT problem, we can yield reliable pseudo-labels that simultaneously account for sample-to-sample semantic consistency and sample-to-cluster global structure information. Additionally, the proposed OT can adaptively estimate cluster distributions, making \textbf{POTA} well-suited for varying degrees of imbalanced datasets. Then, we utilize the pseudo-labels to guide contrastive learning to generate discriminative representations and achieve efficient clustering. Extensive experiments demonstrate \textbf{POTA} outperforms state-of-the-art methods. The code is available at: \href{https://github.com/YZH0905/POTA-STC/tree/main}{https://github.com/YZH0905/POTA-STC/tree/main}.

2410.17046 2026-03-16 stat.ME stat.AP

Mesoscale two-sample testing for networks

Peter W. MacDonald, Elizaveta Levina, Ji Zhu

Comments 59 pages, 9 figures

详情
英文摘要

Networks arise naturally in many scientific fields as a representation of pairwise connections. Statistical network analysis has most often considered a single large network, but it is common in a number of applications to observe multiple networks on a shared node set. When these networks are grouped by case-control status or another categorical covariate, the classical statistical question of two-sample comparison arises. In this work, we address the problem of testing for statistically significant differences in a given arbitrary subset of connections. This general framework allows an analyst to focus on a single node, a specific region of interest, or compare whole networks. Our ability to conduct ``mesoscale'' testing on a meaningful group of edges is particularly relevant for applications such as neuroimaging and distinguishes our approach from prior work, which tends to focus either on a single node or the whole network. In this mesoscale setting, we develop statistically sound projection-based tests for two-sample comparison in both weighted and binary edge networks. The key to our approach is to leverage network information from outside the set of interest to learn informative low-rank projections which leads to more powerful tests.

2406.03821 2026-03-16 stat.AP stat.ME

Bayesian generalized method of moments applied to pseudo-observations in survival analysis

Léa Orsini, Caroline Brard, Emmanuel Lesaffre, Guosheng Yin, David Dejardin, Gwénaël Le Teuff

详情
英文摘要

Bayesian inference for survival regression modeling offers numerous advantages, especially for decision-making and external data borrowing, but demands the specification of the baseline hazard function, which may be a challenging task. We propose an alternative approach that does not need the specification of this function. Our approach combines pseudo-observations to convert censored data into longitudinal data with the Generalized Methods of Moments (GMM) to estimate the parameters of interest from the survival function directly. GMM may be viewed as an extension of the Generalized Estimating Equation (GEE) currently used for frequentist pseudo-observations analysis and can be extended to the Bayesian framework using a pseudo-likelihood function. We assessed the behavior of the frequentist and Bayesian GMM in the new context of analyzing pseudo-observations. We compared their performances to the Cox, GEE, and Bayesian piecewise exponential models through a simulation study of two-arm randomized clinical trials. Frequentist and Bayesian GMM gave valid inferences with similar performances compared to the three benchmark methods, except for small sample sizes and high censoring rates. For illustration, three post-hoc efficacy analyses were performed on randomized clinical trials involving patients with Ewing Sarcoma, producing results similar to those of the benchmark methods. Through a simple application of estimating hazard ratios, these findings confirm the effectiveness of this new Bayesian approach based on pseudo-observations and the generalized method of moments. This offers new insights on using pseudo-observations for Bayesian survival analysis.

2311.08365 2026-03-16 math.ST stat.TH

Local asymptotics of selection models with applications in Bayesian selective inference

Daniel G. Rasines, G. Alastair Young

Comments 30 pages, 7 figures, 1 table

详情
英文摘要

Contemporary focus on selective inference has renewed interest in the theory of selection models. In this paper, we analyze the asymptotic properties of selection models built on independent and identically distributed observations. We show that, under suitable regularity conditions, they behave asymptotically like a sequence of Gaussian selection models. This provides a natural generalization of the Local Asymptotic Normality framework of Le Cam (1960), and indicates a notion of local asymptotic selective normality as the appropriate simplifying theoretical framework for analysis of selective inference. As a key application, we consider the methodological consequences of the asymptotic theory for Bayesian selective inference. Specifically, we prove that the posterior distribution constructed from a selection model under a fixed prior is asymptotically equivalent to the posterior derived in the corresponding asymptotic Gaussian selection model under a uniform prior. Notably, the latter is often mis-calibrated in a frequentist sense, particularly for one-sided selection mechanisms. This demonstrates that the familiar asymptotic equivalence between Bayesian and frequentist approaches does not hold under selection.

2311.07733 2026-03-16 stat.ME math.PR

Credible Intervals for Probability of Failure with Gaussian Processes

Aleksei G. Sorokin, Vishwas Rao

详情
英文摘要

Estimating the probability of failure for expensive simulations is a central task in reliability analysis for structural design, power grid design, and safety certification, among other areas. This work derives credible intervals on the probability of failure by modeling the simulation as a realization of a Gaussian process surrogate. These intervals are governed by the pointwise binary classification error of the surrogate and are compatible with the broad class of adaptive sampling schemes proposed in the literature. We further propose a novel batch sampling scheme that suggests multiple evaluation points per iteration, enabling parallel simulation on HPC systems. The method is empirically validated using our scalable, open-source implementation on a variety of test problems including a Tsunami model where failure is quantified in terms of maximum wave height.

2303.07167 2026-03-16 stat.ME stat.AP stat.ML

When Respondents Don't Care Anymore: Identifying the Onset of Careless Responding

Max Welz, Andreas Alfons

详情
英文摘要

Questionnaires in the behavioral sciences tend to be lengthy. However, literature suggests that survey length is a contributing factor to careless responding, with longer questionnaires yielding higher probability that participants start responding carelessly. Consequently, in long surveys a large number of participants may engage in careless responding, posing a major threat to internal validity. We propose a novel method for identifying the onset of careless responding (or an absence thereof) that searches for a changepoint in combined measurements of multiple dimensions in which carelessness may manifest, such as inconsistency and invariability. It is highly flexible, based on machine learning, and provides statistical guarantees for controlling the false positive rate. In simulation experiments, the proposed method achieves high accuracy in identifying carelessness onset and discriminates well between attentive and various types of careless responding, even when a large number of careless respondents are present. An empirical application highlights how identifying partial carelessness uncovers novel insights on careless responding behavior. Furthermore, we provide the freely available open source software package "carelessonset" to facilitate adoption by empirical researchers.

2208.13701 2026-03-16 stat.ME cs.LG math.OC stat.ML

Data-Driven Influence Functions for Optimization-Based Causal Inference

Michael I. Jordan, Yixin Wang, Angela Zhou

Comments Revision

详情
英文摘要

We study a constructive algorithm that approximates Gateaux derivatives for statistical functionals by finite differencing, with a focus on functionals that arise in causal inference. We study the case where probability distributions are not known a priori but need to be estimated from data. These estimated distributions lead to empirical Gateaux derivatives, and we study the relationships between empirical, numerical, and analytical Gateaux derivatives. Starting with a case study of the interventional mean (average potential outcome), we delineate the relationship between finite differences and the analytical Gateaux derivative. We then derive requirements on the rates of numerical approximation in perturbation and smoothing that preserve the statistical benefits of one-step adjustments, such as rate double robustness. We then study more complicated functionals such as dynamic treatment regimes, the linear-programming formulation for policy optimization in infinite-horizon Markov decision processes, and sensitivity analysis in causal inference. More broadly, we study optimization-based estimators, since this begets a class of estimands where identification via regression adjustment is straightforward but obtaining influence functions under minor variations thereof is not. The ability to approximate bias adjustments in the presence of arbitrary constraints illustrates the usefulness of constructive approaches for Gateaux derivatives. We also find that the statistical structure of the functional (rate double robustness) can permit less conservative rates for finite-difference approximation. This property, however, can be specific to particular functionals; e.g., it occurs for the average potential outcome (hence average treatment effect) but not the infinite-horizon MDP policy value.

2603.12867 2026-03-16 stat.ME

Breaking the Winner's Curse with Bayesian Hybrid Shrinkage

Richard Mudd, Abbas Zaidi, Rina Friedberg, Ilya Gorbachev, Anchal Choubey, Houssam Nassif

详情
英文摘要

The widespread adoption of randomized controlled trials (A/B Tests) for decision-making has introduced a pervasive "Winner's Curse": experiments selected for launch often exhibit upwardly biased effect estimates and invalid confidence intervals. This selection bias leads to over-optimistic impact projections and undermines decision-making, particularly in low-power regimes. We propose Bayesian Hybrid Shrinkage (BHS), an empirical Bayes (EB) framework that leverages data-driven priors to mitigate selection bias and provides accurate uncertainty quantification. Unlike traditional EB methods that apply uniform shrinkage, BHS introduces an experiment-specific "local" shrinkage factor that incorporates individual experiment characteristics, improving robustness against prior misspecification. We also derive a closed-form inference strategy designed for high-throughput production environments. Extensive simulations and real-world evaluations at Meta Platforms demonstrate that BHS outperforms existing methods in terms of bias reduction and interval coverage, even under substantial violations of modeling assumptions.

2603.12843 2026-03-16 math.ST stat.ME stat.TH

The geometry of Stein's method of moments: A canonical decomposition via score matching

Mitsuki Nagai, Keisuke Yano

详情
英文摘要

In this paper, we elucidate the geometry of Stein's method of moments (SMoM). SMoM is a parameter estimation method based on the Stein operator, and yields a wide class of estimators that do not depend on the normalizing constant. We present a canonical decomposition of an SMoM estimator after centering the score matching estimator, which sheds light on the central role of the score matching within the SMoM framework. Using this decomposition, we construct an SMoM estimator that improves upon the score matching estimator in the asymptotic variance. We also discuss the connection between SMoM and the Wasserstein geometry. Specifically, using the Wasserstein score function, we provide a geometrical interpretation of the gap in the asymptotic variance between the score matching estimator and the maximum likelihood estimator. Furthermore, it is shown that the score matching estimator is asymptotically efficient if and only if the Fisher score functions span the same space as the Wasserstein score functions.

2603.12838 2026-03-16 math.OC cs.DC stat.ML

A New Kernel Regularity Condition for Distributed Mirror Descent: Broader Coverage and Simpler Analysis

Junwen Qiu, Ziyang Zeng, Leilei Mei, Junyu Zhang

Comments 25 pages, 4 figures

详情
英文摘要

Existing convergence of distributed optimization methods in non-Euclidean geometries typically rely on kernel assumptions: (i) global Lipschitz smoothness and (ii) bi-convexity of the associated Bregman divergence function. Unfortunately, these conditions are violated by nearly all kernels used in practice, leaving a huge theory-practice gap. This work closes this gap by developing a unified analytical tool that guarantees convergence under mild conditions. Specifically, we introduce Hessian relative uniform continuity (HRUC), a regularity satisfied by nearly all standard kernels. Importantly, HRUC is closed under concatenation, positive scaling, composition, and various kernel combinations. Leveraging the geometric structure induced by HRUC, we derive convergence guarantees for mirror descent-based gradient tracking without imposing any restrictive assumptions. More broadly, our analysis techniques extend seamlessly to other decentralized optimization methods in genuinely non-Euclidean and non-Lipschitz settings.

2603.12780 2026-03-16 math.ST math.PR stat.TH

Functional CLT for general sample covariance matrices

Jian Cui, Zhijun Liu, Jiang Hu, Zhidong Bai

详情
英文摘要

This paper studies the central limit theorems (CLTs) for linear spectral statistics (LSSs) of general sample covariance matrices, when the test functions belong to $C^3$, the class of functions with continuous third order derivatives. We consider matrices of the form $B_n=(1/n)T_p^{1/2}X_nX_n^{*}T_p^{1/2},$ where $X_n= (x_{i j} ) $ is a $p \times n$ matrix whose entries are independent and identically distributed (i.i.d.) real or complex random variables, and $T_p$ is a $p\times p$ nonrandom Hermitian nonnegative definite matrix with its spectral norm uniformly bounded in $p$. By using Bernstein polynomial approximation, we show that, under $\mathbb{E}|x_{ij}|^{8}<\infty$, the centered LSSs of $B_n$ have Gaussian limits. Under the stronger $\mathbb{E}|x_{ij}|^{10}<\infty$, we further establish convergence rates $O(n^{-1/2+κ})$ in Kolmogorov--Smirnov $O(n^{-1/2+κ})$, for any fixed $κ>0$.

2603.12753 2026-03-16 stat.ME cs.CR

Balancing the privacy-utility trade-off: How to draw reliable conclusions from private data

Raphaël de Fondeville

详情
英文摘要

Absolute anonymization, conceived as an irreversible transformation that prevents re-identification and sensitive value disclosure, has proven to be a broken promise. Consequently, modern data protection must shift toward a privacy-utility trade-off grounded in risk mitigation. Differential Privacy (DP) offers a rigorous mathematical framework for balancing quantified disclosure risk with analytical usefulness. Nevertheless, widespread adoption remains limited, largely because effective translation of complex technical concepts, such as privacy-loss parameters, into forms meaningful to non-technical stakeholders has yet to be achieved. This difficulty arises from the inherent use of randomization: both legitimate analysts and potential adversaries must draw conclusions from uncertain observations rather than deterministic values. In this work, we propose a new interpretation of the privacy-utility trade-off based on hypothesis testing. This perspective explicitly accounts for the uncertainty introduced by randomized mechanisms in both membership inference scenarios and general data analysis. In particular, we introduce the concept of relative disclosure risk to quantify the maximum reduction in uncertainty an adversary can obtain from protected outputs, and we show that this measure is directly related to standard privacy-loss parameters. At the same time, we analyze how DP affects analytical validity by studying its impact on hypothesis tests commonly used to assess the statistical significance of empirical results. Finally, we provide practical guidance, accessible to non-experts, for navigating the privacy-utility trade-off, aiding in the selection of suitable protection mechanisms and the values for the privacy-loss parameters.

2603.12734 2026-03-16 stat.ML cs.LG

VecMol: Vector-Field Representations for 3D Molecule Generation

Yuchen Hua, Xingang Peng, Jianzhu Ma, Muhan Zhang

详情
英文摘要

Generative modeling of three-dimensional (3D) molecules is a fundamental yet challenging problem in drug discovery and materials science. Existing approaches typically represent molecules as 3D graphs and co-generate discrete atom types with continuous atomic coordinates, leading to intrinsic learning difficulties such as heterogeneous modality entanglement and geometry-chemistry coherence constraints. We propose VecMol, a paradigm-shifting framework that reimagines molecular representation by modeling 3D molecules as continuous vector fields over Euclidean space, where vectors point toward nearby atoms and implicitly encode molecular structure. The vector field is parameterized by a neural field and generated using a latent diffusion model, avoiding explicit graph generation and decoupling structure learning from discrete atom instantiation. Experiments on the QM9 and GEOM-Drugs benchmarks validate the feasibility of this novel approach, suggesting vector-field-based representations as a promising new direction for 3D molecular generation.

2603.12672 2026-03-16 math.ST stat.TH

Multivariate normality test based on the uniform distribution on the Stiefel manifold

Koki Shimizu, Toshiya Iwashita

详情
英文摘要

This study presents a new procedure for necessary tests of multivariate normality based on the uniform distribution on the Stiefel manifold. We demonstrate that the test statistic, which is formed by the product of the scaled residual matrix and the symmetric square root of a Wishart matrix, is exactly distributed as a matrix-variate normal distribution under the null hypothesis. Monte Carlo simulations are conducted to assess the Type I error rate and power in non-asymptotic settings.

2603.12627 2026-03-16 stat.ML cs.IT cs.LG math.IT

Batched Kernelized Bandits: Refinements and Extensions

Chenkai Ma, Keqin Chen, Jonathan Scarlett

详情
英文摘要

In this paper, we consider the problem of black-box optimization with noisy feedback revealed in batches, where the unknown function to optimize has a bounded norm in some Reproducing Kernel Hilbert Space (RKHS). We refer to this as the Batched Kernelized Bandits problem, and refine and extend existing results on regret bounds. For algorithmic upper bounds, (Li and Scarlett, 2022) shows that $B=O(\log\log T)$ batches suffice to attain near-optimal regret, where $T$ is the time horizon and $B$ is the number of batches. We further refine this by (i) finding the optimal number of batches including constant factors (to within $1+o(1)$), and (ii) removing a factor of $B$ in the regret bound. For algorithm-independent lower bounds, noticing that existing results only apply when the batch sizes are fixed in advance, we present novel lower bounds when the batch sizes are chosen adaptively, and show that adaptive batches have essentially same minimax regret scaling as fixed batches. Furthermore, we consider a robust setting where the goal is to choose points for which the function value remains high even after an adversarial perturbation. We present the robust-BPE algorithm, and show that a suitably-defined cumulative regret notion incurs the same bound as the non-robust setting, and derive a simple regret bound significantly below that of previous work.

2603.12562 2026-03-16 stat.ML cs.CV cs.LG

Variational Garrote for Sparse Inverse Problems

Kanghun Lee, Hyungjoon Soh, Junghyo Jo

Comments 10 pages, 4 figures

详情
英文摘要

Sparse regularization plays a central role in solving inverse problems arising from incomplete or corrupted measurements. Different regularizers correspond to different prior assumptions about the structure of the unknown signal, and reconstruction performance depends on how well these priors match the intrinsic sparsity of the data. This work investigates the effect of sparsity priors in inverse problems by comparing conventional L1 regularization with the Variational Garrote (VG), a probabilistic method that approximates L0 sparsity through variational binary gating variables. A unified experimental framework is constructed across multiple reconstruction tasks including signal resampling, signal denoising, and sparse-view computed tomography. To enable consistent comparison across models with different parameterizations, regularization strength is swept across wide ranges and reconstruction behavior is analyzed through train-generalization error curves. Experiments reveal characteristic bias-variance tradeoff patterns across tasks and demonstrate that VG frequently achieves lower minimum generalization error and improved stability in strongly underdetermined regimes where accurate support recovery is critical. These results suggest that sparsity priors closer to spike-and-slab structure can provide advantages when the underlying coefficient distribution is strongly sparse. The study highlights the importance of prior-data alignment in sparse inverse problems and provides empirical insights into the behavior of variational L0-type methods across different information bottlenecks.

2603.12561 2026-03-16 stat.ME

Consistent and powerful CUSUM change-point test for panel data with changes in variance

Wenzhi Yang, Yueting Xu, Xiaoping Shi, Qiong Li

详情
英文摘要

This paper investigates change-point of variance in panel data models with time series of $α$-mixing. Based on the cumulative sum (CUSUM) method and the individual differences, we construct a CUSUM test for panel data models to detect variance changes. Under the null hypothesis, we derive the limit distribution of this test, which can be used to detect the change-point of variance. Under the alternative hypothesis, the limit behavior of the CUSUM test is also derived. To validate the performance of the test, we conducted simulation analyses on with Gaussian and Gamma errors. The results demonstrate that this testing method significantly outperforms existing approaches, particularly in detecting sparse variance changes. Finally, we conducted a practical case study using panel data from the Shanghai Shenzhen CSI 300 Index Components. Not only did we successfully identify the change-points of variance, but we also delved deeper into the underlying economic drivers behind these changes.

2603.12552 2026-03-16 cs.LG math.OC stat.ML

Asymptotic and Finite-Time Guarantees for Langevin-Based Temperature Annealing in InfoNCE

Faris Chaudhry

Comments Accepted at the Optimization for Machine Learning Workshop (NeurIPS 2025)

详情
英文摘要

The InfoNCE loss in contrastive learning depends critically on a temperature parameter, yet its dynamics under fixed versus annealed schedules remain poorly understood. We provide a theoretical analysis by modeling embedding evolution under Langevin dynamics on a compact Riemannian manifold. Under mild smoothness and energy-barrier assumptions, we show that classical simulated annealing guarantees extend to this setting: slow logarithmic inverse-temperature schedules ensure convergence in probability to a set of globally optimal representations, while faster schedules risk becoming trapped in suboptimal minima. Our results establish a link between contrastive learning and simulated annealing, providing a principled basis for understanding and tuning temperature schedules.

2603.12525 2026-03-16 stat.ML cond-mat.dis-nn cs.LG

EB-RANSAC: Random Sample Consensus based on Energy-Based Model

Muneki Yasuda, Nao Watanabe, Kaiji Sekimoto

详情
英文摘要

Random sample consensus (RANSAC), which is based on a repetitive sampling from a given dataset, is one of the most popular robust estimation methods. In this study, an energy-based model (EBM) for robust estimation that has a similar scheme to RANSAC, energy-based RANSAC (EB-RANSAC), is proposed. EB-RANSAC is applicable to a wide range of estimation problems similar to RANSAC. However, unlike RANSAC, EB-RANSAC does not require a troublesome sampling procedure and has only one hyperparameter. The effectiveness of EB-RANSAC is numerically demonstrated in two applications: a linear regression and maximum likelihood estimation.

2603.12523 2026-03-16 stat.ME math.ST stat.TH

Inference for function-on-function regression: central limit theorem and residual bootstrap

Hyemin Yeon

详情
英文摘要

We investigate asymptotic inference in a linear regression model where both response and regressors are functions, using an estimator based on functional principal components analysis. Although this approach is widely used in functional data analysis, there remains significant room for developing its asymptotic properties for function-on-function regression. Our study targets the mean response at a new regressor with two primary aims. First, we refine the existing central limit theorem by relaxing certain technical conditions, which include generalizing the scaling factor, resulting in incorporating a broader class of random functions beyond those having scores with independence or finite higher moments. Second, we introduce a residual bootstrap method that enhances the calibration of various confidence sets for quantities related to mean response, while its consistency is rigorously verified. Numerical studies compare the finite sample performance of both asymptotic and bootstrap approaches, demonstrating higher accuracy of the latter. To illustrate bootstrap inference for mean response, we apply it to the Canadian weather dataset.

2603.12518 2026-03-16 math.ST stat.ME stat.TH

Gaussian and bootstrap approximations for functional principal component regression

Hyemin Yeon

详情
英文摘要

Asymptotic inference using functional principal component regression (FPCR) has long been considered difficult, largely because, upon any scalar scaling, the FPCR estimator fails to satisfy a central limit theorem, leading to the prevailing belief that it is unsuitable for direct statistical inference. In this paper, we upend this traditional viewpoint by establishing a new result: upon suitable operator scaling, valid Gaussian and bootstrap approximations hold for the FPCR estimator. We apply this surprising finding to hypothesis testing for the significance of the slope function in functional regression models and demonstrate the strong numerical performance of the resulting tests. While concise, our results yield powerful inferential tools for functional regression. We believe it paves the way for new lines of inferential methodology for more complex functional regression settings.

2603.12448 2026-03-16 stat.CO cs.NA math.NA

Sampling through iterated approximation: Gradient-free and multi-fidelity Bayesian inference via transport

Daniel Sharp, Bart van Bloemen Waanders, Youssef Marzouk

详情
英文摘要

We develop an iterative framework for Bayesian inference problems where the posterior distribution may involve computationally intensive models, intractable gradients, significant posterior concentration, and pronounced non-Gaussianity. Our approach integrates: (i) a generalized annealing scheme that combines geometric tempering with multi-fidelity modeling; (ii) expressive measure transport surrogates for the intermediate annealed and final target distributions, learned variationally without evaluating gradients of the target density; and, (iii) an importance-weighting scheme to combine multiple quadrature rules, which recycles and reweighs expensive model evaluations as successive posterior approximations are built. Our scheme produces both a quadrature rule for computing posterior expectations and a transport-based approximation of the posterior from which we can easily generate independent Monte Carlo samples. We demonstrate the efficiency and accuracy of our approach on low-dimensional but strongly non-Gaussian Bayesian inverse problems involving partial differential equations.

2603.12394 2026-03-16 stat.AP

Spatio-temporal evolution of surface temperature trends in Ghana (1983-2021): a multi-station approach

John Bagiliko, David Stern, Denis Ndanguza

详情
英文摘要

Surface temperature is a fundamental Essential Climate Variable, serving as a primary indicator of climate change and exerting a profound influence on ecosystems, agriculture, and human livelihoods. Although existing research provides a foundation for understanding the climate of Ghana, there remains an opportunity to enhance this landscape with granular station-level analysis. Such high-resolution analysis complements existing studies by capturing localised climatic nuances. This study conducts a detailed spatio-temporal analysis of temperature trends across 22 meteorological stations from 1983 to 2021. Using daily maximum (Tmax) and minimum (Tmin) observations, data were subjected to quality control, homogeneity testing, and homogenisation according to World Meteorological Organisation (WMO) standards, using AgERA5 reanalysis as a reference.The significance and magnitude of trends were determined using the Modified Mann-Kendall test, which is robust in handling potential effects of autocorrelation, and Sen's slope estimator. Results revealed that temperature trends in Ghana are highly localised and seasonal, highlighting the necessity for more studies of this nature. A critical finding is the asymmetric warming across the country, with minimum temperatures rising at an accelerated rate compared to maximum temperatures. This narrowing of the diurnal temperature range poses significant threats to agricultural stability and public health because nocturnal cooling is diminished. These findings underscore the urgent need for site-specific, seasonal climate monitoring to inform customised adaptation strategies. To mitigate these impacts, the study recommends a robust policy framework focusing on afforestation and the transition to green energy.

2603.12356 2026-03-16 stat.AP

Modeling diesel output particulate matter as the Ornstein-Uhlenbeck process

Maxwell Bolt, Alex Alberts, Akash S. Desai, Peter Meckl, Ilias Bilionis

Comments 18 pages, 8 figures

详情
英文摘要

Diesel engine particulate matter (PM) is one of the most challenging emission constituents to predict. As engines become cleaner and emissions levels drop, manufacturers need reliable methods to quantify the PM generated by production engines. Due to the inaccuracy of commercial-grade sensors, they turn to predictive models to accurately estimate PM. In practice, this requires a computationally inexpensive model that provides PM estimates with calibrated uncertainty. Complex, multiscale physics make mechanistic models intractable and traditional data-driven methods struggle in transient drive cycles due to the stochastic nature of PM generation. Leveraging recent innovations in PM measurement technology, we introduce a novel PM model based on the Ornstein-Uhlenbeck (OU) process. The OU process is a mean-reverting stochastic process commonly used in financial modeling, now being explored for engineering applications, and can be described as a stochastic differential equation (SDE). We modify the OU process by parameterizing the terms of the SDE as functions of the engine state, which are then fit with a maximum likelihood estimate. In a synthetic example, we verify the ability of our model to learn a time-varying, parametrized OU process. We then train the model using real experimental data designed to dynamically cover the engine operating space and test the trained model on EPA-regulated drive cycles. For most drive cycles, we find the method accurately predicts cumulative output of PM across time.

2603.12352 2026-03-16 stat.ME

Bayesian Covariate-Varying Interaction Analysis for Multivariate Count Data: Application to Microbiome Studies

Shuangjie Zhang, Michael L. Patnode, Juhee Lee

Comments 33 pages, 1o Figures

详情
英文摘要

Understanding covariate-varying interdependencies among features is of great interest in various applications. Motivated by microbiome studies where microbial abundances and interactions vary with environmental factors, we develop a Bayesian covariate-varying factor model. This model flexibly estimates heteroscedasticity in the covariance matrix as a function of covariates. Specifically, our approach employs covariance regression through linear regression on a lower-dimensional factor loading matrix. This formulation, combined with joint sparsity induced by the Dirichlet--Horseshoe prior for the factor loadings, provides robust estimation of covariate-varying covariance in high-dimensional settings. The model simultaneously incorporates a regression structure for the mean abundance and jointly addresses the covariate-varying mean and covariance structure. Furthermore, the model tackles key statistical challenges such as discreteness, over-dispersion, compositionality, and high dimensionality, common in microbiome data analysis, using a flexible nonparametric Bayesian framework. We thoroughly investigate the properties of the model and conduct extensive simulation studies to examine its performance. Real microbiome data examples are provided for illustration.

2603.12351 2026-03-16 stat.ML cs.LG q-bio.QM stat.CO stat.ME

Probabilistic Joint and Individual Variation Explained (ProJIVE) for Data Integration

Raphiel J. Murden, Ganzhong Tian, Deqiang Qiu, Benajmin B. Risk

详情
Journal ref
Journal of Computational and Graphical Statistics (2026)
英文摘要

Collecting multiple types of data on the same set of subjects is common in modern scientific applications including, genomics, metabolomics, and neuroimaging. Joint and Individual Variance Explained (JIVE) seeks a low-rank approximation of the joint variation between two or more sets of features captured on common subjects and isolates this variation from that unique to eachset of features. We develop an expectation-maximization (EM) algorithm to estimate a probabilistic model for the JIVE framework. The model extends probabilistic principal components analysis to multiple data sets. Our maximum likelihood approach simultaneously estimates joint and individual components, which can lead to greater accuracy compared to other methods. We apply ProJIVE to measures of brain morphometry and cognition in Alzheimer's disease. ProJIVE learns biologically meaningful courses of variation, and the joint morphometry and cognition subject scores are strongly related to more expensive existing biomarkers. Data used in preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. Code to reproduce the analysis is available on our GitHub page.

2603.12349 2026-03-16 cs.LG cs.AI q-bio.QM stat.ML

Budget-Sensitive Discovery Scoring: A Formally Verified Framework for Evaluating AI-Guided Scientific Selection

Abhinaba Basu, Pavan Chakraborty

详情
英文摘要

Scientific discovery increasingly relies on AI systems to select candidates for expensive experimental validation, yet no principled, budget-aware evaluation framework exists for comparing selection strategies -- a gap intensified by large language models (LLMs), which generate plausible scientific proposals without reliable downstream evaluation. We introduce the Budget-Sensitive Discovery Score (BSDS), a formally verified metric -- 20 theorems machine-checked by the Lean 4 proof assistant -- that jointly penalizes false discoveries (lambda-weighted FDR) and excessive abstention (gamma-weighted coverage gap) at each budget level. Its budget-averaged form, the Discovery Quality Score (DQS), provides a single summary statistic that no proposer can inflate by performing well at a cherry-picked budget. As a case study, we apply BSDS/DQS to: do LLMs add marginal value to an existing ML pipeline for drug discovery candidate selection? We evaluate 39 proposers -- 11 mechanistic variants, 14 zero-shot LLM configurations, and 14 few-shot LLM configurations -- using SMILES representations on MoleculeNet HIV (41,127 compounds, 3.5% active, 1,000 bootstrap replicates) under both random and scaffold splits. Three findings emerge. First, the simple RF-based Greedy-ML proposer achieves the best DQS (-0.046), outperforming all MLP variants and LLM configurations. Second, no LLM surpasses the Greedy-ML baseline under zero-shot or few-shot evaluation on HIV or Tox21, establishing that LLMs provide no marginal value over an existing trained classifier. Third, the proposer hierarchy generalizes across five MoleculeNet benchmarks spanning 0.18%-46.2% prevalence, a non-drug AV safety domain, and a 9x7 grid of penalty parameters (tau >= 0.636, mean tau = 0.863). The framework applies to any setting where candidates are selected under budget constraints and asymmetric error costs.

2603.12297 2026-03-16 cs.IT math.IT math.PR math.ST stat.TH

Complex-Valued Probability Measures and Their Applications in Information Theory

Siang Cheng, Hejun Xu, Tianxiao Pang

Comments 23 pages, 3 tables

详情
英文摘要

This paper introduces a comprehensive framework for complex-valued probability measures and explores their novel applications in information theory and statistical analysis. We define a complex probability measure as a phase-modulated extension of a classical probability measure. Building upon this foundation, we propose three fundamental information-theoretic quantities: complex entropy, which quantifies distribution uniformity through phase coherence; complex divergence, an asymmetric measure of dissimilarity between distributions; and the complex metric, a symmetric distance function satisfying the triangle inequality. We establish these concepts rigorously for both continuous and discrete probability distributions, proving key properties such as boundedness, continuity under total variation convergence, and clear extremal behaviors. A detailed comparative analysis with classical measures (Shannon entropy and Kullback-Leibler divergence) highlights the unique geometric and interpretive advantages of the proposed framework, particularly its sensitivity to distributional shape via a tunable phase parameter. We elucidate a profound formal analogy between the complex entropy integral and Feynman's path integral formulation of quantum mechanics, suggesting a deeper conceptual bridge. Finally, we demonstrate the practical utility of the complex metric through a detailed application in nonparametric two-sample hypothesis testing, outlining the testing procedure, advantages, limitations, and providing a conceptual simulation. This work opens new avenues for analyzing probability distributions through the lens of complex analysis and interference phenomena, with potential impacts across information theory, statistical inference, and machine learning.

2603.12288 2026-03-16 cs.LG cs.AI stat.ML

From Garbage to Gold: A Data-Architectural Theory of Predictive Robustness

Terrence J. Lee-St. John, Jordan L. Lawson, Bartlomiej Piechowski-Jozwiak

Comments 120 pages, 12 figures, 3 tables. Simulation code and documentation available at: https://github.com/tjleestjohn/from-garbage-to-gold

详情
英文摘要

Tabular machine learning presents a paradox: modern models achieve state-of-the-art performance using high-dimensional (high-D), collinear, error-prone data, defying the "Garbage In, Garbage Out" mantra. To help resolve this, we synthesize principles from Information Theory, Latent Factor Models, and Psychometrics, clarifying that predictive robustness arises not solely from data cleanliness, but from the synergy between data architecture and model capacity. Partitioning predictor-space "noise" into "Predictor Error" and "Structural Uncertainty" (informational deficits from stochastic generative mappings), we prove that leveraging high-D sets of error-prone predictors asymptotically overcomes both types of noise, whereas cleaning a low-D set is fundamentally bounded by Structural Uncertainty. We demonstrate why "Informative Collinearity" (dependencies from shared latent causes) enhances reliability and convergence efficiency, and explain why increased dimensionality reduces the latent inference burden, enabling feasibility with finite samples. To address practical constraints, we propose "Proactive Data-Centric AI" to identify predictors that enable robustness efficiently. We also derive boundaries for Systematic Error Regimes and show why models that absorb "rogue" dependencies can mitigate assumption violations. Linking latent architecture to Benign Overfitting, we offer a first step towards a unified view of robustness to Outcome Error and predictor-space noise, while also delineating when traditional DCAI's focus on label cleaning remains powerful. By redefining data quality from item-level perfection to portfolio-level architecture, we provide a theoretical rationale for "Local Factories" -- learning from live, uncurated enterprise "data swamps" -- supporting a deployment paradigm shift from "Model Transfer" to "Methodology Transfer'' to overcome static generalizability limitations.

2603.12284 2026-03-16 stat.ME stat.ML

Bayesian Conservative Policy Optimization (BCPO): A Novel Uncertainty-Calibrated Offline Reinforcement Learning with Credible Lower Bounds

Debashis Chatterjee

详情
英文摘要

Offline reinforcement learning (RL) aims to learn decision policies from a fixed batch of logged transitions, without additional environment interaction. Despite remarkable empirical progress, offline RL remains fragile under distribution shifts: value-based methods can overestimate the value of unseen actions, yielding policies that exploit model errors rather than genuine long-term rewards. We propose \emph{Bayesian Conservative Policy Optimization (BCPO)}, a unified framework that converts epistemic uncertainty into \emph{provably conservative} policy improvement. BCPO maintains a hierarchical Bayesian posterior over environment/value models, constructs a \emph{credible lower bound} (LCB) on action values, and performs policy updates under explicit KL regularization toward the behavior distribution. This yields an uncertainty-calibrated analogue of conservative policy iteration in the offline regime. We provide a finite-MDP theory showing that the pessimistic fixed point lower-bounds the true value function with high probability and that KL-controlled updates improve a computable return lower bound. Empirically, we verify the methodology on a real offline replay dataset for the CartPole benchmark obtained via the \texttt{d3rlpy} ecosystem, and report diagnostics that link uncertainty growth and policy drift to offline instability, motivating principled early stopping and calibration

2603.11829 2026-03-16 stat.ME

Robust Sequential Hypothesis Testing with Generalized Estimating Equations for Incomplete Clustered and Longitudinal Data

Nathan T. Provost, Abdus S. Wahed

Comments VERSION 2: First version accidentally used older abbreviated title, this has been corrected. 24 pages; 1 figure

详情
英文摘要

Existing sequential generalized estimating equation methodology for longitudinal and group-correlated data focuses on narrow hypotheses concerning treatment efficacy and often makes modeling assumptions that impede the desirable robustness of the involved test statistics. Drawing upon the well-established theory of incremental information gain for well-posed sequential analyses, we develop an approach that does not rely on modeling assumptions that infringe upon the robustness of the resulting estimators while simultaneously testing a much wider range of hypotheses. Our methodology provides general submatrix-level asymptotic theory for the evaluation of joint covariance matrices of sequential test statistics. Moreover, this framework allows us to construct a novel approach to computing efficacy boundaries, the likes of which can be estimated with greater precision at later interim times. These constructions also accommodate accessible multiple imputation procedures, thereby allowing for our approach to be applied to incomplete datasets. Type I error and power are assessed through a series of comprehensive simulations mirroring the simulations of recent work to facilitate a proper comparison. We conclude by applying our methods to a dataset from a longitudinal study concerning the impact of race on the efficacy a treatment for hepatitis C.

2602.21130 2026-03-16 stat.ML cs.LG

An Enhanced Projection Pursuit Tree Classifier with Visual Methods for Assessing Algorithmic Improvements

Natalia da Silva, Dianne Cook, Eun-Kyung Lee

详情
英文摘要

This paper presents enhancements to the projection pursuit tree classifier and visual diagnostic methods for assessing their impact in high dimensions. The original algorithm uses linear combinations of variables in a tree structure where depth is constrained to be less than the number of classes -- a limitation that proves too rigid for complex classification problems. Our extensions improve performance in multi-class settings with unequal variance-covariance structures and nonlinear class separations by allowing more splits and more flexible class groupings in the projection pursuit computation. Proposing algorithmic improvements is straightforward; demonstrating their actual utility is not. We therefore develop two visual diagnostic approaches to verify that the enhancements perform as intended. Using high-dimensional visualization techniques, we examine model fits on benchmark datasets to assess whether the algorithm behaves as theorized. An interactive web application enables users to explore the behavior of both the original and enhanced classifiers under controlled scenarios. The enhancements are implemented in the R package PPtreeExt.

2601.02610 2026-03-16 stat.ME stat.ML

Conformal novelty detection with false discovery rate control at the boundary

Zijun Gao, Etienne Roquain, Daniel Xiang

Comments 43 pages, 17 figures, 1 table

详情
英文摘要

Conformal novelty detection is a classical machine learning task for which uncertainty quantification is essential for providing reliable results. Recent work has shown that the BH procedure applied to conformal p-values controls the false discovery rate (FDR). Unfortunately, the BH procedure can lead to over-optimistic assessments near the rejection threshold, with an increase of false discoveries at the margin as pointed out by Soloff et al. (2024). This issue is solved therein by the support line (SL) correction, which is proven to control the boundary false discovery rate (bFDR) in the independent, non-conformal setting. The present work extends the SL method to the conformal setting: first, we show that the SL procedure can violate the bFDR control in this specific setting. Second, we propose several alternatives that provably control the bFDR in the conformal setting. Finally, numerical experiments with both synthetic and real data support our theoretical findings and show the relevance of the new proposed procedures.

2512.13622 2026-03-16 stat.ME stat.AP

Empirical Bayes learning from selectively reported confidence intervals

Hunter Chen, Junming Guan, Erik van Zwet, Nikolaos Ignatiadis

详情
英文摘要

We develop a statistical framework for empirical Bayes learning from selectively reported confidence intervals, and apply it to provide context for interpreting results published in MEDLINE abstracts. We use a collection of 326,060 z-scores from MEDLINE abstracts (2000-2018) as the input for an empirical Bayes analysis, with publication bias as a key methodological challenge. We address publication bias through a selective tilting approach that extends empirical Bayes confidence intervals to truncated sampling. Our framework provides coverage guarantees for functionals including posterior estimands describing idealized replications and the symmetrized posterior mean, which we justify decision-theoretically as optimal among sign-equivariant (odd) estimators.

2510.09816 2026-03-16 q-bio.NC math.OC physics.bio-ph physics.data-an stat.ML

A mathematical theory for understanding when abstract representations emerge in neural networks

Bin Wang, W. Jeffrey Johnston, Stefano Fusi

Comments 19 pages, 8 figures

详情
英文摘要

Recent experiments in neuroscience reveal that task-relevant variables are often encoded in approximately orthogonal subspaces of neural population activity. These disentangled, or abstract, representations have been observed in multiple brain areas and across different species. These representations have been shown to support out of distribution generalization and rapid learning of novel tasks. The mechanisms by which these representations emerge remain poorly understood, especially in the case of supervised task behavior. Here, we show mathematically that abstract representations of latent variables are guaranteed to appear in the hidden layer of feedforward nonlinear networks when they are trained on tasks that depend directly on these latent variables. These learned abstract representations reflect the semantics of the input stimuli. To show this, we reformulate the usual optimization over the network weights into a mean field optimization problem over the distribution of neural preactivations. We then apply this framework to finite-width ReLU networks and show that the hidden layer of these networks will exhibit an abstract representation at all global minima of the task objective. Finally, we extend our findings to two broad families of activation functions as well as deep feedforward architectures. Together, our results provide an explanation for the widely observed abstract representations in both the brain and artificial neural networks. In addition, the general framework that we develop here provides a mathematically tractable toolkit for understanding the emergence of different kinds of representations in task-optimized, feature-learning network models.

2510.09598 2026-03-16 stat.ME

Defensive Model Expansion for Robust Bayesian Inference

Antonio R. Linero

详情
英文摘要

Some applied researchers hesitate to use nonparametric methods, worrying that they will lose power in small samples or overfit the data when simpler models are sufficient. We argue that at least some of these concerns are unfounded when nonparametric models are strongly shrunk toward parametric submodels. We consider expanding a parametric model with a nonparametric component $r(x)$ that is heavily shrunk toward zero. This construction allows the model to adapt automatically: if the parametric model is correct, the nonparametric component disappears, recovering parametric efficiency, while if it is misspecified, the flexible component activates to capture the missing signal. We show that this adaptive behavior follows from simple and general conditions. Specifically, we prove that Bayesian nonparametric models anchored to linear regression, including variants of Gaussian process regression and Bayesian additive regression trees, consistently identify the correct parametric submodel when it holds and give asymptotically efficient inference for regression coefficients. In simulations, we find that the general BART model performs identically to correctly specified linear regression when the parametric model holds, and substantially outperforms it when nonlinear effects are present. This suggests a practical paradigm: defensive model expansion as a safeguard against model misspecification.

2510.05645 2026-03-16 math.ST stat.TH

Weak convergence of Bayes estimators under general loss functions

Robin Requadt, Housen Li, Axel Munk

详情
英文摘要

We investigate the asymptotic behavior of parametric Bayes estimators under a broad class of loss functions that extend beyond the classical translation-invariant setting. To this end, we develop a unified theoretical framework for loss functions exhibiting locally polynomial structure. This general theory encompasses important examples such as the squared Wasserstein distance, the Sinkhorn divergence and Stein discrepancies, which have gained prominence in modern statistical inference and machine learning. Building on the classical Bernstein--von Mises theorem, we establish sufficient conditions under which Bayes estimators inherit the posterior's asymptotic normality. As a by-product, we also derive conditions for the differentiability of Wasserstein-induced loss functions and provide new consistency results for Bayes estimators. Several examples and numerical experiments demonstrate the relevance and accuracy of the proposed methodology.

2508.21742 2026-03-16 cs.AI stat.ME

Orientability of Causal Relations in Time Series using Summary Causal Graphs and Faithful Distributions

Timothée Loranchet, Charles K. Assaad

Comments Accepted to AISTATS 2026

详情
英文摘要

Understanding causal relations between temporal variables is a central challenge in time series analysis, particularly when the full causal structure is unknown. Even when the full causal structure cannot be fully specified, experts often succeed in providing a high-level abstraction of the causal graph, known as a summary causal graph, which captures the main causal relations between different time series while abstracting away micro-level details. In this work, we present conditions that guarantee the orientability of micro-level edges between temporal variables given the background knowledge encoded in a summary causal graph and assuming having access to a faithful and causally sufficient distribution with respect to the true unknown graph. Our results provide theoretical guarantees for edge orientation at the micro-level, even in the presence of cycles or bidirected edges at the macro-level. These findings offer practical guidance for leveraging SCGs to inform causal discovery in complex temporal systems and highlight the value of incorporating expert knowledge to improve causal inference from observational time series data.

2505.10628 2026-03-16 stat.ML cs.LG math.PR

Minimax learning rates for estimating binary classifiers under margin conditions

Jonathan García, Philipp Petersen

详情
英文摘要

We study classification problems using binary estimators where the decision boundary is described by horizon functions and where the data distribution satisfies a geometric margin condition. A key novelty of our work is the derivation of lower bounds for the worst-case learning rates over broad classes of functions, under a geometric margin condition -- a setting that is almost universally satisfied in practice, but remains theoretically challenging. Moreover, we work in the noiseless setting, where lower bounds are particularly hard to establish. Our general results cover, in particular, classification problems with decision boundaries belonging to several classes of functions: for Barron-regular functions, Hölder-continuous functions, and convex-Lipschitz functions with strong margins, we identify optimal rates close to the fast learning rates of $\mathcal{O}(n^{-1})$ for $n \in \mathbb{N}$ samples.

2411.12367 2026-03-16 stat.ME stat.AP

Left-truncated discrete lifespans: The AFiD enterprise panel

Eric Scholz, Rafael Weißbach

Comments 42 pages, 2 figures, 4 tables

详情
英文摘要

Our model for the lifespan of an enterprise is the geometric distribution. We do not formulate a model for enterprise foundation, but assume that foundations and lifespans are independent. We aim to fit the model to information about foundation and closure of German enterprises in the AFiD panel. The lifespan for an enterprise that has been founded before the first wave of the panel is either left truncated, when the enterprise is contained in the panel, or missing, when it already closed down before the first wave. Marginalizing the likelihood to that part of the enterprise history after the first wave contributes to the aim of a closed-form estimate and standard error. Invariance under the foundation distribution is achived by conditioning on observability of the enterprises. The conditional marginal likelihood can be written as a function of a martingale. The later arises when calculating the compensator, with respect some filtration, of a process that counts the closures. The estimator itself can then also be written as a martingale transform and consistency as well as asymptotic normality are easily proven. The life expectancy of German enterprises, estimated from the demographic information about 1.4 million enterprises for the years 2018 and 2019, are ten years. The width of the confidence interval are two months. Closure after the last wave is taken into account as right censored.

2411.07993 2026-03-16 stat.AP

Markov Processes for Enhanced Deepfake Generation and Detection

Michael A. Kouritzin, Ian Zhang, Jyoti Bhadana, Seoyeon Park

详情
英文摘要

New and existing methods for generating, and especially detecting, deepfakes are investigated and compared on the simple problem of authenticating coin flip data. Importantly, an alternative approach to deepfake generation and detection, which uses a Markov Observation Model (MOM) is introduced and compared on detection ability to the traditional Generative Adversarial Network (GAN) approach as well as Support Vector Machine (SVM), Branching Particle Filtering (BPF) and human alternatives. MOM was also compared on generative and discrimination ability to GAN, filtering and humans (as SVM does not have generative ability). Humans are shown to perform the worst, followed in order by GAN, SVM, BPF and MOM, which was the best at the detection of deepfakes. Unsurprisingly, the order was maintained on the generation problem with removal of SVM as it does not have generation ability.

2410.18613 2026-03-16 cs.LG cs.CV stat.ML

Rethinking Attention: Polynomial Alternatives to Softmax in Transformers

Hemanth Saratchandran, Jianqiao Zheng, Yiping Ji, Wenbo Zhang, Simon Lucey

详情
英文摘要

This paper questions whether the strong performance of softmax attention in transformers stems from producing a probability distribution over inputs. Instead, we argue that softmax's effectiveness lies in its implicit regularization of the Frobenius norm of the attention matrix, which stabilizes training. Motivated by this, we explore alternative activations, specifically polynomials, that achieve a similar regularization effect. Our theoretical analysis shows that certain polynomials can serve as effective substitutes for softmax, achieving strong performance across transformer applications despite violating softmax's typical properties of positivity, normalization, and sparsity. Extensive experiments support these findings, offering a new perspective on attention mechanisms.

2410.03191 2026-03-16 stat.ML cs.LG

Nested Deep Learning Model Towards A Foundation Model for Brain Signal Data

Fangyi Wei, Jiajie Mo, Kai Zhang, Haipeng Shen, Srikantan Nagarajan, Fei Jiang

Comments 56 pages; paper structure updated

详情
英文摘要

Epilepsy affects around 50 million people globally. Electroencephalography (EEG) or Magnetoencephalography (MEG) based spike detection plays a crucial role in diagnosis and treatment. Manual spike identification is time-consuming and requires specialized training that further limits the number of qualified professionals. To ease the difficulty, various algorithmic approaches have been developed. However, the existing methods face challenges in handling varying channel configurations and in identifying the specific channels where the spikes originate. A novel Nested Deep Learning (NDL) framework is proposed to overcome these limitations. NDL applies a weighted combination of signals across all channels, ensuring adaptability to different channel setups, and allows clinicians to identify key channels more accurately. Through theoretical analysis and empirical validation on real EEG/MEG datasets, NDL is shown to improve prediction accuracy, achieve channel localization, support cross-modality data integration, and adapt to various neurophysiological applications.

2407.15693 2026-03-16 math.AP cs.LG math.FA math.ST stat.TH

Fisher-Rao Gradient Flow: Geodesic Convexity and Functional Inequalities

José A. Carrillo, Yifan Chen, Daniel Zhengyu Huang, Jiaoyang Huang, Dongyi Wei

Comments 38 pages

详情
英文摘要

The dynamics of probability density functions have been extensively studied in computational science and engineering to understand physical phenomena and facilitate algorithmic design. Of particular interest are dynamics formulated as gradient flows of energy functionals under the Wasserstein metric. The development of functional inequalities, such as the log-Sobolev inequality, plays a pivotal role in analyzing the convergence of these dynamics. This paper aims to extend the success of functional inequality techniques to dynamics that are gradient flows under the Fisher-Rao metric, with various $f$-divergences serving as energy functionals. Such dynamics take the form of nonlocal differential equations, for which existing analyses critically rely on explicit solution formulas in special cases. We provide a comprehensive study of functional inequalities and the relevant geodesic convexity for Fisher-Rao gradient flows under minimal assumptions. A notable feature of our functional inequalities is their independence from the log-concavity or log-Sobolev constants of the target distribution. Consequently, the convergence rate of the dynamics (assuming well-posedness) remains uniform across general target distributions.

2401.02739 2026-03-16 cs.LG q-bio.QM stat.ML

Denoising Diffusion Variational Inference: Diffusion Models as Expressive Variational Posteriors

Wasu Top Piriyakulkij, Yingheng Wang, Volodymyr Kuleshov

Comments published at AAAI 2025; the first two authors contribute equally to this work; code available at https://github.com/topwasu/DDVI

详情
英文摘要

We propose denoising diffusion variational inference (DDVI), a black-box variational inference algorithm for latent variable models which relies on diffusion models as flexible approximate posteriors. Specifically, our method introduces an expressive class of diffusion-based variational posteriors that perform iterative refinement in latent space; we train these posteriors with a novel regularized evidence lower bound (ELBO) on the marginal likelihood inspired by the wake-sleep algorithm. Our method is easy to implement (it fits a regularized extension of the ELBO), is compatible with black-box variational inference, and outperforms alternative classes of approximate posteriors based on normalizing flows or adversarial networks. We find that DDVI improves inference and learning in deep latent variable models across common benchmarks as well as on a motivating task in biology -- inferring latent ancestry from human genomes -- where it outperforms strong baselines on the Thousand Genomes dataset.

2311.09838 2026-03-16 stat.ME q-bio.GN q-bio.PE stat.AP stat.CO

Bayesian Inference of Reproduction Number from Epidemiological and Genetic Data Using Particle MCMC

Alicia Gill, Jere Koskela, Xavier Didelot, Richard G. Everitt

Comments 24 pages, 11 figures (30 pages, 19 figures including appendices)

详情
英文摘要

Inference of the reproduction number through time is of vital importance during an epidemic outbreak. Typically, epidemiologists tackle this using observed prevalence or incidence data. However, prevalence and incidence data alone is often noisy or partial. Models can also have identifiability issues with determining whether a large amount of a small epidemic or a small amount of a large epidemic has been observed. Sequencing data however is becoming more abundant, so approaches which can incorporate genetic data are an active area of research. We propose using particle MCMC methods to infer the time-varying reproduction number from a combination of prevalence data reported at a set of discrete times and a dated phylogeny reconstructed from sequences. We validate our approach on simulated epidemics with a variety of scenarios. We then apply the method to real data sets of HIV-1 in North Carolina, USA and tuberculosis in Buenos Aires, Argentina. The models and algorithms are implemented in an open source R package called EpiSky which is available at https://github.com/alicia-gill/EpiSky.

2303.07287 2026-03-16 stat.ML cs.LG econ.EM

Tight Non-asymptotic Inference via Sub-Gaussian Intrinsic Moment Norm

Huiming Zhang, Haoyu Wei, Guang Cheng

Comments This manuscript has been withdrawn by the authors as it is not yet ready for public release. Further improvements and revisions are required before a final version can be considered for distribution

详情
英文摘要

In non-asymptotic learning, variance-type parameters of sub-Gaussian distributions are of paramount importance. However, directly estimating these parameters using the empirical moment generating function (MGF) is infeasible. To address this, we suggest using the sub-Gaussian intrinsic moment norm [Buldygin and Kozachenko (2000), Theorem 1.3] achieved by maximizing a sequence of normalized moments. Significantly, the suggested norm can not only reconstruct the exponential moment bounds of MGFs but also provide tighter sub-Gaussian concentration inequalities. In practice, we provide an intuitive method for assessing whether data with a finite sample size is sub-Gaussian, utilizing the sub-Gaussian plot. The intrinsic moment norm can be robustly estimated via a simple plug-in approach. Our theoretical findings are also applicable to reinforcement learning, including the multi-armed bandit scenario.