arXivDaily arXiv每日学术速递 周一至周五更新
重置
2603.06502 2026-03-09 cs.SI stat.AP

Mapping the long-term trajectories of political violence in Africa

Steven M. Radil, Nick Dorward, Olivier Walther, Levi John Wolf

详情
英文摘要

Existing models of political violence often emphasize discrete transitions, when conflicts emerge, escalate, or subside, without considering the longer trajectories of violence that accumulate across time and space. This paper introduces a spatially explicit longitudinal sequence analysis to address this gap. Using event-level data from the Armed Conflict Location and Event Dataset covering Africa from 1997 to 2024, we classify locations according to the intensity and spatial concentration of violence, tracing how these states evolve into distinct conflict trajectories. Applying optimal matching and clustering techniques, we identify six recurrent patterns ranging from short-lived, localized outbreaks to protracted high-intensity conflicts. We further assess how these trajectories align across neighboring areas, revealing evidence of spatial interdependence, particularly in border regions. By highlighting the temporal rhythms and geographic linkages of political violence, the study advances conflict research beyond isolated transitions and provides a framework for understanding the life cycles of violence.

2603.06493 2026-03-09 stat.ME

Balancing Efficiency and Feasibility: A Sensitivity Analysis of the Augmentation Parameter in the Finite Selection Model

Safaa K. Kadhem

详情
英文摘要

This paper investigates the role of the augmentation parameter in the Finite Selection Model (FSM) and its impact on estimator performance. Through a comprehensive Monte Carlo simulation study, we analyze the sensitivity of bias, variance, and mean squared error to different values of the augmentation parameter. The results demonstrate that moderate augmentation improves covariate balance while maintaining estimation efficiency. However, excessive augmentation may increase variance and reduce estimator stability. The findings provide practical guidelines for selecting the augmentation parameter in applied experimental design settings.

2603.06462 2026-03-09 stat.ME stat.ML

Bayesian Additive Distribution Regression

Antonio R. Linero, Soumyabrata Bose, Jared Murray

详情
英文摘要

Distribution regression, where the goal is to predict a scalar response from a distribution-valued predictor, arises naturally in settings where observations are grouped and outcomes depend on group-level characteristics rather than on individual measurements. We introduce DistBART, a Bayesian nonparametric approach to distribution regression that models the regression function as a linear functional with the Riesz representer assigned a Bayesian additive regression trees (BART) prior. We argue that shallow decision tree ensembles encode reasonable inductive biases for tabular data, making them appropriate in settings where the functional depends primarily on low-dimensional marginals of the distributions. We show this both empirically on synthetic and real data and theoretically through an adaptive posterior concentration result. We also establish connections to kernel methods, and use this connection to motivate variants of DistBART that can learn nonlinear functionals. To enable scalability to large datasets, we develop a random-feature approximation that samples trees from the BART prior and reduces inference to sparse Bayesian linear regression, achieving computational efficiency while retaining uncertainty quantification.

2603.06437 2026-03-09 stat.AP

Estimating Residential Displacement in the Central Puget Sound Region using Household Survey Data

Ameer Dharamshi, Mary Richards, Suzanne Childress, Brian Lee, Daniel Casey

详情
英文摘要

Housing instability is a persistent challenge faced by households in cities across the United States. In worst-case scenarios, households are displaced from their residences and forced to start anew. In an effort to mitigate the harms of residential displacement, local policymakers have an interest in monitoring residential displacement within their communities. In this work, we propose a new strategy to estimate sub-county residential displacement within the Central Puget Sound Region using data from three household survey programs. We first estimate residential displacement between 2016-2023 from a local household travel survey using a Bayesian spatiotemporal model, and poststratify with data from the American Community Survey. We then benchmark these estimates to the American Housing Survey to ensure consistency across sources. The results reveal east-west and north-south differences in residential displacement rates within the region as well as a temporary moderation of displacement in the 2020-2021 cohort of movers. Our estimates are publicly available for interested stakeholders to further study trends in residential displacement in the Central Puget Sound Region, and our methodology is transportable to other jurisdictions with similar data contexts.

2510.17798 2026-03-09 eess.SY cs.SY stat.AP

Admittance Matrix Concentration Inequalities for Understanding Uncertain Power Networks

Samuel Talkington, Cameron Khanpour, Rahul K. Gupta, Sergio A. Dorado-Rojas, Daniel Turizo, Hyeongon Park, Dmitrii M. Ostrovskii, Daniel K. Molzahn

Comments 9 pages, 2 figures

详情
英文摘要

This paper presents conservative probabilistic bounds for the spectrum of the admittance matrix and classical linear power flow models under uncertain network parameters; for example, probabilistic line contingencies. Our proposed approach imports tools from probability theory, such as concentration inequalities for random matrices. This provides a theoretical framework for understanding error bounds of common approximations of the AC power flow equations under parameter uncertainty, including the DC and LinDistFlow approximations. Additionally, we show that the upper bounds scale as functions of nodal criticality. This network-theoretic quantity captures how uncertainty concentrates at critical nodes for use in contingency analysis. We validate these bounds on IEEE test networks, demonstrating that they correctly capture the scaling behavior of spectral perturbations up to conservative constants.

2510.02852 2026-03-09 stat.AP cs.NA math.NA

Data-Driven Bed Capacity Planning Using $M_t/G_t/\infty$ Queueing Models with an Application to Neonatal Intensive Care Units

Maryam Akbari-Moghaddam, Douglas G. Down, Na Li, Catherine Eastwood, Ayman Abou Mehrem, Alexandra Howlett

Comments This paper has been submitted to the Operations Research, Data Analytics and Logistics journal

详情
英文摘要

Hospitals face challenges in long-term intensive care unit (ICU) capacity planning under uncertain demand. Admission rates fluctuate over time, and LOS distributions vary with patient heterogeneity, hospital location, case mix, and clinical practice. Common approaches rely on steady-state queueing models or heuristic rules with fixed parameters, which often fail to capture real occupancy dynamics. The widely used 85% occupancy rule, for example, recommends keeping average utilization below this level to preserve responsiveness, yet it is grounded in stationary assumptions and may lack resilience in time-varying systems. Our analysis shows that even when long-run utilization targets are satisfied, daily occupancy often exceeds 100% capacity. We propose a data-driven framework to estimate ICU bed occupancy using an $M_t/G_t/\infty$ queueing model with time-varying arrival rates and empirically fitted LOS distributions. The approach combines statistical decomposition and parametric fitting to capture temporal patterns in admissions and LOS, and is applied to multi-year data from neonatal ICUs (NICUs) in Calgary. We evaluate capacity scenarios including average-based thresholds and Poisson-based surge estimates. Results show that static heuristics are inadequate under fluctuating demand and underscore the importance of modeling LOS variability when estimating bed needs. Although the case study focuses on NICUs, the framework has potential applicability to other ICU settings and provides interpretable, data-informed support for systems facing rising demand and constrained capacity.

2509.14961 2026-03-09 stat.ML cond-mat.mtrl-sci cs.LG physics.chem-ph

Spectral/Spatial Tensor Atomic Cluster Expansion with Universal Embeddings in Cartesian Space

Zemin Xu, Wenbo Xie, P. Hu

详情
英文摘要

Equivariant atomistic machine learning models have largely been built on spherical-tensor representations, where explicit angular-momentum coupling introduces substantial complexity and systematic extensions beyond energies and forces remain challenging, often requires problem-specific architectural choices. Here we introduce the Tensor Atomic Cluster Expansion (TACE), which unifies scalar and tensorial modeling in Cartesian and space by decomposing local environments into irreducible Cartesian tensors (ICT) constructing a controlled many-body hierarchy with atomic cluster expansion (ACE). In addition to performing ACE in the frequency domain, we propose an efficient Clebsch-Gordan-free alternative in the spatial domain. TACE provides universal invariant (e.g., fidelity tags and charges) and equivariant (e.g., external electric fields and non-collinear magnetic moments) embeddings and predicted tensorial observables are handled on equal footing and enabling explicit control at inference. We demonstrate the accuracy, stability, and efficiency across finite molecules and extended materials, including in-domain and out-of-domain benchmarks, spectra, Hessian, external-field responses, charged systems, and multi-fidelity/head training. We further show its robustness on nonequilibrium/reactive datasets and controlled scaling when extending to large foundation model datasets.

2508.12860 2026-03-09 econ.EM stat.ME

Estimation and exclusion restrictions in clustered linear models

Anna Mikusheva, Mikkel Sølvsten, Baiyun Jing

Comments 48 pages, 3 figures

详情
英文摘要

We study linear regression models with clustered data, high-dimensional controls, and intricate exclusion restrictions. We propose a correctly centered internal instrument IV estimator that accommodates a broad class of exclusion restrictions and allows within-cluster dependence. The estimator admits a simple leave-out interpretation and is computationally tractable. We derive a central limit theorem for the associated quadratic form and propose a robust variance estimator. We also develop identification-robust inference procedures. Our framework extends dynamic panel methods to general clustered settings. We illustrate the approach in a large-scale fiscal intervention in rural Kenya, where spatial interference generates the exclusion-restriction pattern.

2508.01068 2026-03-09 physics.chem-ph cond-mat.mtrl-sci stat.ML

Learning the action for long-time-step simulations of molecular dynamics

Filippo Bigi, Johannes Spies, Michele Ceriotti

Comments 16 pages, 7 figures

详情
英文摘要

The equations of classical mechanics can be used to model the time evolution of countless physical systems, from the astrophysical to the atomic scale. Accurate numerical integration requires small time steps, which limits the computational efficiency -- especially in cases such as molecular dynamics that span wildly different time scales. Using machine-learning (ML) algorithms to predict trajectories allows one to greatly extend the integration time step, at the cost of introducing artifacts such as lack of energy conservation and loss of equipartition between different degrees of freedom of a system. We propose learning data-driven structure-preserving (symplectic and time-reversible) maps to generate long time-step classical dynamics and show that this method is equivalent to learning the mechanical action of the system of interest. These models can be learned based on short reference trajectories, and be transferred across thermodynamic conditions and chemical composition. We show that an action-derived ML integrator eliminates the pathological behavior of non-structure-preserving ML predictors, and that the method can be applied iteratively, serving as a correction to computationally cheaper direct predictors.

2506.15735 2026-03-09 cs.AI cs.LG stat.ML

ContextBench: Modifying Contexts for Targeted Latent Activation

Robert Graham, Edward Stevinson, Leo Richter, Alexander Chia, Joseph Miller, Joseph Isaac Bloom

Comments Published at ICLR 2026

详情
Journal ref
Proceedings of the International Conference on Learning Representations (ICLR), 2026
英文摘要

Identifying inputs that trigger specific behaviours or latent features in language models could have a wide range of safety use cases. We investigate a class of methods capable of generating targeted, linguistically fluent inputs that activate specific latent features or elicit model behaviours. We formalise this approach as context modification and present ContextBench -- a benchmark with tasks assessing core method capabilities and potential safety applications. Our evaluation framework measures both elicitation strength (activation of latent features or behaviours) and linguistic fluency, highlighting how current state-of-the-art methods struggle to balance these objectives. We enhance Evolutionary Prompt Optimisation (EPO) with LLM-assistance and diffusion model inpainting, and demonstrate that these variants achieve state-of-the-art performance in balancing elicitation effectiveness and fluency.

2505.02614 2026-03-09 math.OC cs.LG stat.ML

Entropic Mirror Descent for Linear Systems: Polyak's Stepsize and Implicit Bias

Yura Malitsky, Alexander Posch

Comments 20 pages, 2 figures

详情
英文摘要

This paper focuses on applying entropic mirror descent to solve linear systems, where the main challenge for the convergence analysis stems from the unboundedness of the domain. To overcome this without imposing restrictive assumptions, we introduce a variant of Polyak-type stepsizes. Along the way, we strengthen the bound for $\ell_1$-norm implicit bias, obtain sublinear and linear convergence results, and generalize the convergence result to arbitrary convex $L$-smooth functions. We also propose an alternative method that avoids exponentiation, resembling the original Hadamard descent, but with provable convergence.

2410.10354 2026-03-09 stat.ME

Bayesian nonparametric modeling of heterogeneous populations of networks

Francesco Barile, Simón Lunagómez, Bernardo Nipoti

Comments A version of this article has been accepted for publication in Bayesian Analysis

详情
英文摘要

The increasing availability of multiple network data has highlighted the need for statistical models for heterogeneous populations of networks. A convenient framework makes use of metrics to measure similarity between networks. In this context, we propose a novel Bayesian nonparametric model that identifies clusters of networks characterized by similar connectivity patterns. Our approach relies on a location-scale Dirichlet process mixture of centered Erdős--Rényi kernels, with components parametrized by a unique network representative, or mode, and a univariate measure of dispersion around the mode. We demonstrate that this model has full support in the Kullback--Leibler sense and is strongly consistent. An efficient Markov chain Monte Carlo scheme is proposed for posterior inference and clustering of multiple network data. The performance of the model is validated through extensive simulation studies, showing improvements over state-of-the-art methods. Additionally, we present an effective strategy to extend the application of the proposed model to datasets with a large number of nodes. We illustrate our approach with the analysis of human brain network data.

1905.05141 2026-03-09 math.ST math.AG stat.TH

Moment Identifiability of Homoscedastic Gaussian Mixtures

Daniele Agostini, Carlos Améndola, Kristian Ranestad

Comments 27 pages, 1 table, 1 figure

详情
英文摘要

We consider the problem of identifying a mixture of Gaussian distributions with same unknown covariance matrix by their sequence of moments up to certain order. Our approach rests on studying the moment varieties obtained by taking special secants to the Gaussian moment varieties, defined by their natural polynomial parametrization in terms of the model parameters. When the order of the moments is at most three, we prove an analogue of the Alexander-Hirschowitz theorem classifying all cases of homoscedastic Gaussian mixtures that produce defective moment varieties. As a consequence, identifiability is determined when the number of mixed distributions is smaller than the dimension of the space. In the two component setting we provide a closed form solution for parameter recovery based on moments up to order four, while in the one dimensional case we interpret the rank estimation problem in terms of secant varieties of rational normal curves.

1903.08611 2026-03-09 math.ST math.AG stat.TH

Autocovariance Varieties of Moving Average Random Fields

Carlos Améndola, Viet Son Pham

Comments 20 pages, 5 tables, 2 figures

详情
英文摘要

We study the autocovariance functions of moving average random fields over the integer lattice $\mathbb{Z}^d$ from an algebraic perspective. These autocovariances are parametrized polynomially by the moving average coefficients, hence tracing out algebraic varieties. We derive dimension and degree of these varieties and we use their algebraic properties to obtain statistical consequences such as identifiability of model parameters. We connect the problem of parameter estimation to the algebraic invariants known as euclidean distance degree and maximum likelihood degree. Throughout, we illustrate the results with concrete examples. In our computations we use tools from commutative algebra and numerical algebraic geometry.

2603.06328 2026-03-09 stat.OT stat.ME

Variable selection in linear mixed model meta-regression with suspected interaction effects -- How can tree-based methods help?

Jan-Bernd Igelmann, Paula Lorenz, Markus Pauly

Comments 25 pages, 5 figures. Supplementary Materials at https://doi.org/10.17877/TUDODATA-2026-3CDZSS

详情
英文摘要

Detecting interaction effects (IEs) in meta-regression is challenging, especially when few studies are available and many plausible interactions are considered. In many meta-analyses, interpretability is essential, which limits the use of complex machine learning methods. Tree-based approaches offer a potentially useful compromise, but their role in meta-regression with random effects is not yet well understood. This paper examines how traditional linear and tree-based methods can support variable selection for IEs in random effects meta-regression. We compare test-based and information-criterion-based linear selection procedures with meta-CART approaches. These include fixed effect and random effects trees and their stability-selected ensemble variants. All methods are evaluated using a real-world meta-analytic dataset and a plasmode simulation study. The data-generating process assumes linear IEs and is complemented by settings with nonlinear interactions. Our results show that under strictly linear interactions, linear selection methods perform as expected and achieve superior performance for IE detection. Tree-based methods are more conservative when the number of studies is small, but become competitive as sample size increases, particularly the stability-selected variants. When IEs deviate from strict linearity, even in simple ways, the performance of linear methods deteriorates, whereas tree-based approaches, especially stability-selected fixed effect trees, provide a more robust alternative. Overall, stability-selected random effects trees are useful complementary tools for IE detection in applied meta-regression, particularly for metric covariates. They are well suited for pre-selection and sensitivity analyses, and selection frequency patterns in tree ensembles can help reveal structural patterns in the data.

2603.06293 2026-03-09 stat.ME stat.AP

Large Wave Direction Data Modeling Using Wrapped Spatial Gaussian Markov Random Fields

Arnab Hazra

Comments 31 pages, 6 figures, 2 tables

详情
英文摘要

Statistical modeling of dependent directional data remains relatively underexplored, particularly in high-dimensional spatial settings. Existing approaches for spatial angular data primarily rely on wrapped Gaussian process (WGP) models, which provide a coherent framework for capturing spatial dependence on the circle. However, WGP-based methods become computationally challenging when the spatial domain is large, and observations are available at high resolution. This limitation is especially relevant in the analysis of large-scale geological and climate phenomena, such as tsunamis and hurricanes, where directional measurements (e.g., wave or wind directions) may be available over an entire ocean basin. To address these challenges, we propose a wrapped Gaussian Markov random field (WGMRF) model for large spatial directional datasets. By exploiting the sparse precision structure inherent in Gaussian Markov random fields, the proposed approach achieves substantial computational gains while preserving flexible spatial dependence on the circular scale. We discuss key properties of the model, including its identifiability and dependence characteristics. The model fitting involves standard Markov chain Monte Carlo techniques. Through extensive simulation studies and an application to the wave direction data across the Indian Ocean during the 2004 Indian Ocean Tsunami, we compare the proposed method with both a non-spatial wrapped Gaussian model and a low-rank WGP alternative. The results demonstrate that the WGMRF offers improved predictive performance and scalability in large-domain applications.

2603.06283 2026-03-09 stat.ME

Optimizing Complex Health Intervention Packages through the Learn-As-you-GO (LAGO) Design

Donna Spiegelman, Dong Roman Xu, Ante Bing, Guangyu Tong, Mona Abdo, Jingyu Cui, Charles Goss, John Baptist Kiggundu, Chris T. Longenecker, LaRon Nelson, Drew Cameron, Fred Semitala, Xin Zhou, Judith J. Lok

详情
英文摘要

In the face of vast numbers of preventable deaths worldwide and gaping disparities in their distribution, we cannot afford to conduct null and inconclusive effectiveness and implementation trials of evidence-based interventions. The gold standard in biomedical research, the individually randomized clinical trial, is ill-suited as the primary tool for knowledge generation for contextually relevant, scalable, complex public health interventions of multi-component strategies. In this paper, we discuss the new Learn-As-you-GO (LAGO) design. In LAGO trials, the components of a complex intervention package are repeatedly optimized in pre-planned stages, until the package achieves its outcome and power goals with minimized cost and/or other optimization criteria, such as maximizing patient satisfaction. In this paper, the inputs to, and outputs of, LAGO are described, along with its general methodology. The methods are illustrated in the BetterBirth study, a large trial that aimed to reduce maternal and neonatal mortality in Uttar Pradesh, India, using the WHO essential birth practices checklist. Despite its scale, the BetterBirth study failed to demonstrate a significant effect of the intervention package on the primary health endpoint that included maternal mortality. We show how this unfortunate outcome could have been remedied had LAGO been used. LAGO is further illustrated through the discussion of several ongoing LAGO-informed implementation trials of HIV and non-communicable diseases in the United States and Sub-Saharan Africa. The Learn-As-you-GO (LAGO) design optimizes a complex, multi-level intervention for minimum cost, pre-specified power, and a pre-specified effectiveness goal, by adapting the intervention as the study is conducted, reducing risk of trial failure.

2603.06252 2026-03-09 cs.LG stat.ML

Synthetic Monitoring Environments for Reinforcement Learning

Leonard Pleiss, Carolin Schmidt, Maximilian Schiffer

详情
英文摘要

Reinforcement Learning (RL) lacks benchmarks that enable precise, white-box diagnostics of agent behavior. Current environments often entangle complexity factors and lack ground-truth optimality metrics, making it difficult to isolate why algorithms fail. We introduce Synthetic Monitoring Environments (SMEs), an infinite suite of continuous control tasks. SMEs provide fully configurable task characteristics and known optimal policies. As such, SMEs allow for the exact calculation of instantaneous regret. Their rigorous geometric state space bounds allow for systematic within-distribution (WD) and out-of-distribution (OOD) evaluation. We demonstrate the framework's benefit through multidimensional ablations of PPO, TD3, and SAC, revealing how specific environmental properties - such as action or state space size, reward sparsity and complexity of the optimal policy - impact WD and OOD performance. We thereby show that SMEs offer a standardized, transparent testbed for transitioning RL evaluation from empirical benchmarking toward rigorous scientific analysis.

2603.06251 2026-03-09 stat.ML cs.LG

SPPCSO: Adaptive Penalized Estimation Method for High-Dimensional Correlated Data

Ying Hu, Hu Yang

详情
英文摘要

With the rise of high-dimensional correlated data, multicollinearity poses a significant challenge to model stability, often leading to unstable estimation and reduced predictive accuracy. This work proposes the Single-Parametric Principal Component Selection Operator (SPPCSO), an innovative penalized estimation method that integrates single-parametric principal component regression and $L_{1}$ regularization to adaptively adjust the shrinkage factor by incorporating principal component information. This approach achieves a balance between variable selection and coefficient estimation, ensuring model stability and robust estimation even in high-dimensional, high-noise environments. The primary contribution lies in addressing the instability of traditional variable selection methods when applied to high-noise, high-dimensional correlated data. Theoretically, our method exhibits selection consistency and achieves a smaller estimation error bound compared to traditional penalized estimation approaches. Extensive numerical experiments demonstrate that SPPCSO not only delivers stable and reliable estimation in high-noise settings but also accurately distinguishes signal variables from noise variables in group-effect structured data with highly correlated noise variables, effectively eliminating redundant variables and achieving more stable variable selection. Furthermore, SPPCSO successfully identifies disease-associated genes in gene expression data analysis, showcasing strong practical value. The results indicate that SPPCSO serves as an ideal tool for high-dimensional variable selection, offering an efficient and interpretable solution for modeling correlated data.

2603.06248 2026-03-09 cs.LG math.OC stat.ML

Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions

Aditya Varre, Mark Rofin, Nicolas Flammarion

Comments 35 pages, 21 figures

详情
英文摘要

Understanding the intricate non-convex training dynamics of softmax-based models is crucial for explaining the empirical success of transformers. In this article, we analyze the gradient flow dynamics of the value-softmax model, defined as ${L}(\mathbf{V} σ(\mathbf{a}))$, where $\mathbf{V}$ and $\mathbf{a}$ are a learnable value matrix and attention vector, respectively. As the matrix times softmax vector parameterization constitutes the core building block of self-attention, our analysis provides direct insight into transformer's training dynamics. We reveal that gradient flow on this structure inherently drives the optimization toward solutions characterized by low-entropy outputs. We demonstrate the universality of this polarizing effect across various objectives, including logistic and square loss. Furthermore, we discuss the practical implications of these theoretical results, offering a formal mechanism for empirical phenomena such as attention sinks and massive activations.

2603.06212 2026-03-09 cs.LG stat.AP

Topological descriptors of foot clearance gait dynamics improve differential diagnosis of Parkinsonism

Jhonathan Barrios, Wolfram Erlhagen, Miguel F. Gago, Estela Bicho, Flora Ferreira

Comments 17 pages, 12 figures, Under review

详情
英文摘要

Differential diagnosis among parkinsonian syndromes remains a clinical challenge due to overlapping motor symptoms and subtle gait abnormalities. Accurate differentiation is crucial for treatment planning and prognosis. While gait analysis is a well established approach for assessing motor impairments, conventional methods often overlook hidden nonlinear and structural features embedded in foot clearance patterns. We evaluated Topological Data Analysis (TDA) as a complementary tool for Parkinsonism classification using foot clearance time series. Persistent homology produced Betti curves, persistence landscapes, and silhouettes, which were used as features for a Random Forest classifier. The dataset comprised 15 controls (CO), 15 idiopathic Parkinson's disease (IPD), and 14 vascular Parkinsonism (VaP). Models were assessed with leave-one-out cross-validation (LOOCV). Betti-curve descriptors consistently yielded the strongest results. For IPD vs VaP, foot clearance variables minimum toe clearance, maximum toe late swing, and maximum heel clearance achieved 83% accuracy and AUC=0.89 under LOOCV in the medicated (On) state. Performance improved in the On state and further when both Off and On states were considered, indicating sensitivity of the topological features to levodopa related gait changes. These findings support integrating TDA with machine learning to improve clinical gait analysis and aid differential diagnosis across parkinsonian disorders.

2603.06176 2026-03-09 math.ST stat.TH

Sparse Estimation for High-Dimensional Lévy-driven Ornstein--Uhlenbeck Processes from Discrete Observations

Niklas Dexheimer, Natalia Jeszka

详情
英文摘要

We study high-dimensional drift estimation for Lévy-driven Ornstein--Uhlenbeck processes based on discrete observations. Assuming sparsity of the drift matrix, we analyze Lasso and Slope estimators constructed from approximate likelihoods and derive sharp nonasymptotic oracle inequalities. Our bounds disentangle the contributions of discretization error and stochastic fluctuations, and establish minimax optimal convergence rates under suitable choices of tuning parameters in a high-frequency regime. We further quantify the sample complexity required to attain these rates depending on the Lévy noise. The results extend the theory of high-dimensional statistics for stochastic processes to a substantially broader class of noise mechanisms, in particular pure jump processes. They also demonstrate that Lasso and Slope remain competitive for jump-driven systems, providing practical guidance for inference in applications where Lévy processes are a natural modeling choice.

2603.06142 2026-03-09 cs.LG cond-mat.dis-nn cs.AI cs.NE stat.ML

Predictive Coding Graphs are a Superset of Feedforward Neural Networks

Björn van Zwol

Comments 11 pages, 3 figures. Accepted at the NeuroAI Workshop @ NeurIPS 2024. OpenReview: https://openreview.net/forum?id=J36z3R0sNq

详情
英文摘要

Predictive coding graphs (PCGs) are a recently introduced generalization to predictive coding networks, a neuroscience-inspired probabilistic latent variable model. Here, we prove how PCGs define a mathematical superset of feedforward artificial neural networks (multilayer perceptrons). This positions PCNs more strongly within contemporary machine learning (ML), and reinforces earlier proposals to study the use of non-hierarchical neural networks for ML tasks, and more generally the notion of topology in neural networks.

2603.06078 2026-03-09 stat.ME

Simultaneously accounting for winner's curse and sample structure in Mendelian randomization: bivariate rerandomized inverse variance weighted estimator

Xin Liu, Ping Yin, Peng Wang

详情
英文摘要

The recently developed rerandomized inverse variance weighted (RIVW) estimator provides a simple and efficient framework to break the winner's curse in two-sample Mendelian randomization (MR). However, this method has ignored the possible presence of sample structure (e.g., residual population stratification and sample overlap), a common confounding factor in MR studies. Sample structure can not only distort SNP-exposure and SNP-outcome association estimates but also induce correlation between them, leading exposure-side instrument selection to propagate bias to the outcome side. To address this challenge, we propose the bivariate RIVW (BRIVW) estimator that can simultaneously account for the winner's curse and sample structure. The BRIVW estimator extends the RIVW framework by modeling the joint distribution of SNP-exposure and SNP-outcome associations, first adjusting their covariance matrix via linkage disequilibrium score regression to account for sample structure, and then applying randomized instrument selection and Rao-Blackwellization to obtain unbiased post-selection association estimates as well as their covariance matrix. Under appropriate conditions, we show that the BRIVW estimator is consistent and asymptotically normal. Extensive simulations and real data analyses demonstrate that the BRIVW estimator provides more accurate causal effect estimates than existing methods.

2603.06072 2026-03-09 stat.ME math.OC math.ST stat.OT stat.TH

A Hierarchical Bayesian Dynamic Game for Competitive Inventory and Pricing under Incomplete Information: Learning, Credible Risk, and Equilibrium

Debashis Chatterjee

详情
英文摘要

We develop a hierarchical Bayesian dynamic game for competitive inventory and pricing under incomplete information. Two firms repeatedly choose order quantities and prices while facing two layers of uncertainty: unknown market demand and private rival characteristics. The framework combines Bayesian learning about demand and substitution with strategic belief updating about rival types. To make decisions robust to posterior uncertainty, we introduce a credible-risk criterion that rewards expected future profit while penalizing posterior predictive dispersion. This yields a conservative equilibrium concept in which firms learn, compete, and adapt simultaneously. The paper provides the model formulation, information structure, posterior updating mechanism, equilibrium definition, and a computational strategy based on belief-state dynamic programming. A simulation study shows that Bayesian learning is crucial for strong performance and that the credible-risk rule is especially effective as an operational regularizer under uncertainty. A real-data illustration on a high-dimensional protein-expression dataset demonstrates that the same uncertainty-aware Bayesian principle can produce biologically interpretable subgroup and latent-state findings. The proposed framework offers a unified bridge between Bayesian game theory and operations research, with practical relevance for competitive decision-making in uncertain and information-limited environments.

2603.06062 2026-03-09 math.ST stat.TH

Estimation of Lévy-driven CARMA models under renewal sampling

Frank Bosserhoff, Giacomo Francisci, Robert Stelzer

Comments 37 pages

详情
英文摘要

Continuous-time autoregressive and moving average (CARMA) models are extensively used to model high-frequency and irregularly sampled data. We study Whittle estimation for the model parameters when the process is observed at renewal times. The driving noise is assumed to be a Lévy process allowing for more flexibility including heavy-tailed marginal distributions and jumps in the sample paths. We show that the Whittle estimator based on the integrated periodogram is consistent and asymptotically normal under very mild conditions. To obtain these results, we establish the asymptotic normality of the integrated periodogram.

2603.06027 2026-03-09 cs.LG cs.DS stat.ML

Agnostic learning in (almost) optimal time via Gaussian surface area

Lucas Pesenti, Lucas Slot, Manuel Wiedmer

Comments 20 pages

详情
英文摘要

The complexity of learning a concept class under Gaussian marginals in the difficult agnostic model is closely related to its $L_1$-approximability by low-degree polynomials. For any concept class with Gaussian surface area at most $Γ$, Klivans et al. (2008) show that degree $d = O(Γ^2 / \varepsilon^4)$ suffices to achieve an $\varepsilon$-approximation. This leads to the best-known bounds on the complexity of learning a variety of concept classes. In this note, we improve their analysis by showing that degree $d = \tilde O (Γ^2 / \varepsilon^2)$ is enough. In light of lower bounds due to Diakonikolas et al. (2021), this yields (near) optimal bounds on the complexity of agnostically learning polynomial threshold functions in the statistical query model. Our proof relies on a direct analogue of a construction of Feldman et al. (2020), who considered $L_1$-approximation on the Boolean hypercube.

2603.05988 2026-03-09 stat.ME

On parameter estimation for the truncated skew-normal distribution

Kwangok Seo, Seul Lee, Johan Lim

详情
英文摘要

Parameter estimation for the truncated skew-normal distribution is challenging, as truncation introduces additional nonlinearity into the likelihood function and often leads to numerical instability in existing estimation procedures. In this paper, we propose a grid-based estimation method, referred to as GRID-MOM, for parameter estimation in the truncated skew-normal distribution. The proposed approach fixes the shape parameter on a pre-specified grid and, for each grid point, estimates the location and scale parameters using the method of moments. The optimal value of the shape parameter is then selected via likelihood-based comparison, yielding the final parameter estimates. By decoupling the estimation of the shape parameter from that of the location and scale parameters, the proposed method reduces the complexity of the optimization problem and improves numerical stability. We evaluate the finite-sample performance of the proposed estimator through an extensive numerical study, comparing it with existing methods under a variety of scenarios. The results demonstrate that the proposed method provides stable and accurate estimation, particularly for the shape parameter, suggesting that the proposed method offers a practical alternative for inference in truncated skew-normal models. We further demonstrate the practical applicability of the proposed method using phosphoproteomics data and hospital admission data.

2603.05938 2026-03-09 stat.AP

Modeling Animal Communication Using Multivariate Hawkes Processes with Additive Excitation and Multiplicative Inhibition

Bokgyeong Kang, Erin M. Schliep, Alan E. Gelfand, Ariana Strandburg-Peshkin, Robert S. Schick

详情
英文摘要

Animal acoustic communication often exhibits temporal dependence, with calls triggering or suppressing subsequent calls within and across call types, individuals, or species. While Hawkes processes provide a natural framework for modeling excitation, incorporating inhibition in multivariate settings can raise identifiability issues and complicate parameter interpretation. We propose a flexible class of multivariate Hawkes processes that combines additive excitation with multiplicative inhibition. This formulation preserves the branching process interpretation of excitation while reducing confounding between excitation and inhibition, and allows direct quantification of background and excitation contributions to the event rate. Bayesian inference is conducted via Markov chain Monte Carlo, and model adequacy is assessed using the random time change theorem. The proposed methodology is evaluated through simulation and applied to two acoustic communication datasets: group-living meerkats, for which we analyze three selected call types with distinct behavioral roles, and a two-species baleen whale dataset involving humpback and North Atlantic right whales. The meerkat analysis reveals significant within- and cross-type excitation with cross-type inhibition, whereas the whale data show evidence primarily of within-species excitation.

2603.05897 2026-03-09 math.ST stat.TH

A Minimax Theory of Nonparametric Regression Under Covariate Shift

Petr Zamolodtchikov

详情
英文摘要

We consider nonparametric regression under covariate shift, where we observe samples from both the target distribution and a related but distinct source distribution. We introduce a novel object, the transfer function, and show that properties of its domain determine our minimax rates. Those exhibit a variety of regimes, including classical rates, governed by the better of source-only and target-only rates, as well as regimes in which the convergence rates exhibit multiplicative interactions between the sample sizes and are faster than the best-of-two benchmark. The rates are shown to be achieved up to logarithmic factors by a design-adaptive estimator. Compared with existing theory, our results cover the case in which covariates have unbounded support.

2603.05885 2026-03-09 stat.OT math.OC

Bayesian Linear Programming under Learned Uncertainty: Posterior Feasibility Guarantees, Scenario Certification, and Applications

Debashis Chatterjee

详情
英文摘要

Linear programming is widely used for decision-making in science, engineering, and operations research, yet in many modern applications the coefficients entering the constraints and objective are not known exactly and must be learned from data. Classical stochastic and robust optimization offer two influential paradigms for handling such uncertainty, but they typically treat the underlying uncertainty description as given and do not directly integrate priors and updated to posteriors guarantees. This paper develops a Bayesian framework for linear programming in which uncertain quantities are modeled probabilistically, updated through observed data, and propagated into optimization through posterior feasibility requirements. We present two complementary computational strategies: a credible-region robustification that converts posterior uncertainty into deterministic protection, and a posterior-scenario approach that uses sampled posterior realizations to construct tractable optimization problems with finite-sample interpretability. We also propose a Monte Carlo certification procedure that provides conservative, data-conditioned assessments of residual infeasibility. Simulation experiments show that the proposed framework substantially improves safety relative to naive plug-in decisions, while a real-data study on single-cell transcriptomic data demonstrates that the approach can produce scientifically interpretable decisions together with explicit uncertainty-aware feasibility diagnostics. The proposed methodology offers a unified bridge between Bayesian learning, optimization under uncertainty, and practical decision certification.

2603.05817 2026-03-09 stat.CO stat.AP stat.ME

Two Localization Strategies for Sequential MCMC Data Assimilation with Applications to Nonlinear Non-Gaussian Geophysical Models

Hamza Ruzayqat, Hristo G. Chipilski, Omar Knio

Comments 31 pages, 19 figures, 9 tables. arXiv admin note: text overlap with arXiv:2409.07111

详情
英文摘要

We present a localized data assimilation (DA) scheme based on the sequential Markov Chain Monte Carlo (SMCMC) technique [Ruzayqat et al., 2024], a provably convergent method for filtering high-dimensional, nonlinear, and potentially non-Gaussian state-space models. Unlike particle filters, which are exact methods for nonlinear non-Gaussian models, SMCMC does not assign weights to samples and therefore avoids weight degeneracy in small-ensemble regimes. We design two localization approaches within the SMCMC framework that exploit spatial sparsity of observations to reduce the effective degrees of freedom and improve efficiency. The first variant collects observed blocks into a single reduced domain and runs parallel MCMC chains over this combined region. The second variant further reduces the per-chain state dimension by decomposing the observed region into independent blocks, each augmented with a compact halo, and applying Gaspari--Cohn observation-noise tapering to smoothly down-weight distant observations. When the observation model is linear and Gaussian, we show that our approximate filtering density reduces to a Gaussian mixture from which independent samples can be drawn exactly. For nonlinear or non-Gaussian observation models, we employ an MCMC kernel. We test on high-dimensional ($d \sim 10^4 - 10^5$) state-space models, including a linear Gaussian model and a nonlinear multilayer shallow water equation with both linear and nonlinear observation operators. We consider Gaussian and non-Gaussian (Student-$t$) observation noise, showing that LSMCMC naturally handles heavy-tailed errors that cause ensemble Kalman methods to diverge. Observations include synthetic and real data from the Surface Water and Ocean Topography (SWOT) mission (NASA) and ocean drifter data (NOAA). We compare the two variants against each other and the local ensemble transform Kalman filter (LETKF).

2603.05794 2026-03-09 stat.ME

Robust Estimation of Location in Matrix Manifolds Using the Projected Frobenius Median

Houren Hong, Kassel Liam Hingee, Janice L. Scealy, Andrew T. A. Wood

详情
英文摘要

We propose a robust method for location estimation in various matrix manifolds based on the projected Frobenius median, which is closely related to the spatial median. This method applies broadly to matrix manifolds, including Stiefel and Grassmann manifolds, Kendall shape spaces as well as to projective Stiefel manifolds, a type of quotient space of a Stiefel manifold. Our approach involves computation of the Frobenius median in an ambient Euclidean space followed by projection onto the relevant matrix manifold. Our estimation method is computationally attractive, has a unique solution provided the sample data are not colinear in the ambient Euclidean space, has desirable robustness features and has appropriate equivariance properties under natural groups of transformations. We establish asymptotic normality under mild conditions and derive the influence function for matrix manifolds of interest. Simulation studies on the rank-1 complex Grassmann manifold and the projective Stiefel manifold further show the applicability and robustness of our method. We also apply our method to a real-world earthquake moment tensor dataset.

2603.05741 2026-03-09 stat.AP

Preoperative Decline and Postoperative Recovery of Wearable-Derived Physical Activity Over a Four-Year Perioperative Period in Total Knee and Hip Arthroplasty: Evidence from the All of Us Research Program

Yuezhou Zhang, Amos Folarin, Callum Stewart, Hyunju Kim, Rongrong Zhong, Shaoxiong Sun, Richard JB Dobson

详情
英文摘要

Total knee arthroplasty (TKA) and total hip arthroplasty (THA) improve symptoms in end-stage osteoarthritis, yet long-term objective characterization of perioperative physical activity trajectories remains limited. We conducted a longitudinal observational study within the All of Us Research Program dataset, linking electronic health records with continuous Fitbit-derived step count data over a four-year perioperative window (two years before and two years after arthroplasty). Piecewise linear mixed-effects models characterized preoperative declines and postoperative recovery trajectories, and time-to-recovery was evaluated using Kaplan-Meier curves and Cox proportional hazards models under remote and immediate preoperative physical activity baseline definitions. Among 238 participants (147 TKA; 91 THA), both procedures exhibited progressive preoperative decline with distinct procedure-specific patterns and staged postoperative recovery: rapid improvement during weeks 1-6, decelerating gains through weeks 7-19/20, and subsequent stabilization through week 104. Recovery to remote and immediate baselines differed in timing (median 22 vs 13 weeks) and associated predictors. Higher immediate preoperative activity was associated with greater likelihood of recovery to habitual activity levels, underscoring the relevance of preoperative functional reserve and surgical timing. These findings demonstrate the potential of long-term wearable monitoring to refine assessment of functional outcomes, guide recovery expectations, and support perioperative management.

2603.05703 2026-03-09 stat.ME cs.LG math.ST stat.TH

Random Dot Product Graphs as Dynamical Systems: Limitations and Opportunities

Giulio Valentino Dalla Riva

Comments 39 pages, 3 figures

详情
英文摘要

Can we learn the differential equations governing the evolution of a temporal network? We investigate this within Random Dot Product Graphs (RDPGs), where each network snapshot is generated from latent positions evolving under unknown dynamics. We identify three fundamental obstructions: gauge freedom from rotational ambiguity in latent positions, realizability constraints from the manifold structure of the probability matrix, and trajectory recovery artifacts from spectral embedding. We develop a geometric framework based on principal fiber bundles that formalizes these obstructions. We characterize invisible dynamics as exactly the skew-symmetric generators, and show the realizable tangent space has dimension $nd - d(d-1)/2$. An holonomy dichotomy emerges: polynomial dynamics have commuting generators, stationary eigenvectors, and trivial holonomy, making gauge alignment purely statistical; Laplacian dynamics satisfy a non-commutativity criterion producing nontrivial holonomy, with curvature weighted by $1/(λ_ι+ λ_γ)$ linking gauge sensitivity to the spectral gap. In $d=2$ this yields full restricted holonomy $\mathrm{SO}(2)$; for $d \ge 3$ generic full $\mathrm{SO}(d)$ remains conjectural. Cram'er--Rao lower bounds reveal that the same spectral gap controlling curvature and injectivity simultaneously controls Fisher information, so geometric and statistical difficulty are inextricable. We prove an identifiability principle: symmetric dynamics cannot absorb skew-symmetric gauge contamination, so dynamics structure can resolve gauge ambiguity. We demonstrate this constructively with anchor-based alignment and a UDE pipeline recovering vector fields from noisy graph sequences. Yet finite-sample interactions between noise, gauge, and dynamics expressiveness remain beyond the asymptotic theory. We frame this gap as an open challenge.

2603.05700 2026-03-09 stat.ME

Change Point Detection for Cell Populations Measured via Flow Cytometry

Yik Lun Kei, Qi Wang, Paul Parker, Francois Ribalet, Sangwon Hyun

详情
英文摘要

The ocean is filled with phytoplankton that contribute as much photosynthesis as all land plants combined, making them vital to the carbon cycle and climate system. Recent advances in flow cytometry allow oceanographers to measure the optical traits of individual cells along research cruise tracks, generating single-cell resolution microbial data. In marine microbial ecology, detecting locations of abrupt changes in the environmental response of cytometric plankton distributions is an important task. This manuscript proposes a latent space Gaussian mixture-of-experts model, facilitating change point detection in replicated and clustered phytoplankton observations. Change points are identified through shifts in prior means of low-dimensional representations, with piecewise-constant structure enforced by a group-fused LASSO penalty. The optimization problem is then solved via Alternating Direction Method of Multipliers. Applied to flow cytometry data, the proposed method identifies a scientifically important change point that aligns with a transition zone between two marine provinces.

2603.05619 2026-03-09 stat.AP cs.GT

Test-then-Punish: A Statistical Approach to Repeated Games

Aymeric Capitaine, Antoine Scheid, Etienne Boursier, Alain Durmus, Michael I. Jordan

详情
英文摘要

We study discounted infinitely repeated games in which players agree on a cooperative mixed action profile but, at each step, observe only the realized pure actions. This form of imperfect monitoring breaks classical trigger strategies, since deviations cannot be identified with certainty. To address this problem, we study how hypothesis testing can be used to sustain cooperation. First, we develop a framework that embeds statistical inference directly into strategic behavior. We introduce relaxed equilibrium notions that allow players to ignore vanishing probability histories arising from rare but extreme realizations of the monitoring process. Within this framework, we formalize a generic test then punish strategy: players commit ex ante to a cooperative mixed action profile, continuously test whether observed play is consistent with this prescription, and permanently switch to punishment once sufficient statistical evidence of deviation accumulates. Under mild conditions on the testing procedure, this construction sustains any feasible and individually rational payoff for sufficiently patient players, yielding a Folk theorem type result under imperfect monitoring. We then propose two explicit implementations of this strategy. The first relies on anytime valid sequential tests, providing uniform control of Type I error over an infinite horizon and a finite expected detection time for payoff-relevant deviations. However, this strategy only accounts for stationary deviations and yields a Nash equilibrium. The second uses testing over batches with a fixed size, accommodating arbitrary deviations and achieving subgame perfect Nash equilibrium, at the cost of losing global anytime guarantees on false punishments.

2603.05612 2026-03-09 q-bio.NC cs.LG stat.AP stat.ML

Behavior-dLDS: A decomposed linear dynamical systems model for neural activity partially constrained by behavior

Eva Yezerets, En Yang, Misha B. Ahrens, Adam S. Charles

详情
英文摘要

Brain-wide recordings of large-scale networks of neurons now provide an unprecedented view into how the brain drives behavior. However, brain activity contains both information directly related to behavior as well as the potential for many internal computations. Moreover, observable behavior is executed not only by the brain, but also by the spinal cord and peripheral nervous system. Behavior is a coarse-grained product of neural activity, and we thus take the view that it can be best represented by lower-dimensional latent neural dynamics. Capturing this indirect relationship while disambiguating behavior-generating networks from internal computations running in parallel requires new modeling approaches that can embody the parallel and distributed nature of large-scale neural populations. We thus present behavior-decomposed linear dynamical systems (b-dLDS) to disentangle simultaneously recorded subsystems and identify how the latent neural subsystems relate to behavior. We demonstrate the ability of b-dLDS to decouple behavioral vs. internal computations on controlled, simulated data, showing improvements over a state-of-the-art model that uses behavior to supervise all dynamics based on behavior. We then show that b-dLDS can further scale up to tens of thousands of neurons by applying our model to large-scale recording of a zebrafish hindbrain during the complex positional homeostasis behavior, wherein b-dLDS highlights behavior-related dynamic connectivity networks.

2603.05575 2026-03-09 stat.ML cs.LG

Prediction-Powered Conditional Inference

Yang Sui, Jin Zhou, Hua Zhou, Xiaowu Dai

详情
英文摘要

We study prediction-powered conditional inference in the setting where labeled data are scarce, unlabeled covariates are abundant, and a black-box machine-learning predictor is available. The goal is to perform statistical inference on conditional functionals evaluated at a fixed test point, such as conditional means, without imposing a parametric model for the conditional relationship. Our approach combines localization with prediction-based variance reduction. First, we introduce a reproducing kernel-based localization method that learns a data-adaptive weight function from covariates and reformulates the target conditional moment at the test point as a weighted unconditional moment. Second, we incorporate machine-learning predictions through a correction-based decomposition of this localized moment, yielding a prediction-powered estimator and confidence interval that reduce variance when the predictor is informative while preserving validity regardless of predictor accuracy. We establish nonasymptotic error bounds and minimax-optimal convergence rates for the resulting estimator, prove pointwise asymptotic normality with consistent variance estimation, and provide an explicit variance decomposition that characterizes how machine-learning predictions and unlabeled covariates improve statistical efficiency. Numerical experiments on simulated and real datasets demonstrate valid conditional coverage and substantially sharper confidence intervals than alternative methods.

2603.05568 2026-03-09 stat.ML cs.LG

Learning Optimal Distributionally Robust Individualized Treatment Rules Integrating Multi-Source Data

Wenhai Cui, Wen Su, Xingqiu Zhao

详情
英文摘要

Integrative analysis of multiple datasets for estimating optimal individualized treatment rules (ITRs) can enhance decision efficiency. A central challenge is posterior shift, wherein the conditional distribution of potential outcomes given covariates differs between source and target populations. We propose a prior information-based distributionally robust ITR (PDRO-ITR) that maximizes the worst-case policy value over a covariate-dependent distributional uncertainty set, ensuring robust performance under posterior shift. The uncertainty set is constructed as an individualized combination of source distributions, with weights combining prior source-membership probabilities and deviation terms constrained to the probability simplex to accommodate posterior shift. We derive a closed-form solution for the PDRO-ITR and develop an adaptive procedure to tune the uncertainty level. We establish risk bounds for the PDRO-ITR estimator, which guarantees robust performance under the worst case. Extensive simulations and two real-data applications demonstrate that the proposed method achieves superior performance compared to existing approaches.

2603.05561 2026-03-09 stat.ME stat.AP

Two-stage Adaptive Design Cluster Randomised Trials

Samuel I. Watson, James Martin

详情
英文摘要

Adaptive sample size re-estimation, early stopping, and trial re-design at interim analyses can reduce expected sample sizes in randomised trials. Cluster randomised trials, in which groups of participants are randomly allocated to treatment status, may particularly benefit as they can be costly and their required sample sizes depend on one or more auxiliary parameters governing correlations within and between clusters, which are often estimated with high uncertainty. We adapt a combination test approach to the cluster trial setting allowing for early stopping for futility or efficacy and accounting for correlations between trial stages and other nuisance parameters. We consider design decisions for multi-dimensional sample sizes involving clusters, participants, and time and allowing for modifications to intervention roll-out patterns. We use a Pareto optimality approach to balance objectives relating to different components of the sample size and costs. We also examine the interim estimation of auxiliary parameters and trial re-design for efficiency. We illustrate the methods including examples of stepped-wedge trial re-design and a re-analysis of the large cluster randomised trial E-MOTIVE.

2603.05544 2026-03-09 stat.ME cs.LG stat.AP

An intuitive rearranging of the Yates covariance decomposition for probabilistic verification of forecasts with the Brier score

Bruno Hebling Vieira

Comments 4 pages, 0 figures

详情
英文摘要

Proper scoring rules are essential for evaluating probabilistic forecasts. We propose a simple algebraic rearrangement of the Yates covariance decomposition of the Brier score into three independently non-negative terms: a variance mismatch term, a correlation deficit term, and a calibration-in-the-large term. This rearrangement makes the optimality conditions for perfect forecasting transparent: the optimal forecast must simultaneously match the variance of outcomes, achieve perfect positive correlation with outcomes, and match the mean of outcomes. Any deviation from these conditions results in a positive contribution to the Brier score.

2512.20219 2026-03-09 stat.ME stat.AP

Estimation and Inference for Causal Explainability

Weihan Zhang, Zijun Gao

Comments 35 pages, 5 figures, 7 tables

详情
英文摘要

Understanding how much each variable contributes to an outcome is a central question across disciplines. A causal view of explainability is favorable for its ability in uncovering underlying mechanisms and generalizing to new contexts. Based on a family of causal explainability quantities, we develop methods for their estimation and inference. In particular, we construct a one-step correction estimator using semi-parametric efficiency theory, which explicitly leverages the independence structure of variables to reduce the asymptotic variance. For a null hypothesis on the boundary, i.e., zero explainability, we show its equivalence to Fisher's sharp null, which motivates a randomization-based inference procedure. Finally, we illustrate the empirical efficacy of our approach through simulations as well as an immigration experiment dataset, where we investigate how features and their interactions shape public opinion toward admitting immigrants.

2512.00566 2026-03-09 econ.EM stat.ME

Improved inference for nonparametric regression and regression-discontinuity designs

Giuseppe Cavaliere, Sílvia Gonçalves, Morten Ørregaard Nielsen, Edoardo Zanelli

详情
英文摘要

Nonparametric regression and regression-discontinuity designs suffer from smoothing bias that distorts conventional confidence intervals. Solutions based on robust bias correction (RBC) are now central to the economist's toolbox. In this paper, we establish a novel connection between RBC methods and bootstrap prepivoting. Revisiting RBC through the lens of bootstrapping allows us to develop a novel bias correction procedure which delivers improved nonparametric inference. The resulting confidence intervals are 17% shorter than the usual intervals employed in curve estimation and regression discontinuity designs, without compromising asymptotic coverage. This holds regardless of evaluation point location, bandwidth choice, or regressor and error distribution.

2511.09823 2026-03-09 stat.CO stat.ME

Diagnostics for Semiparametric Accelerated Failure Time Models with R Package afttest

Woojung Bae, Dongrak Choi, Jun Yan, Sangwook Kang

详情
英文摘要

The semiparametric accelerated failure time (AFT) model offers a direct and interpretable alternative to the Cox proportional hazards model, yet practical diagnostic tools for this framework remain limited. We introduce afttest, an R package that implements martingale-residual-based goodness-of-fit procedures for semiparametric AFT models. In addition to the recently developed multiplier bootstrap diagnostics, the package introduces a new computationally efficient resampling strategy based on an influence-function linear approximation. Unlike the original approach, which requires repeatedly solving estimating equations for each bootstrap replicate, the proposed method avoids iterative optimization and substantially reduces computation time while preserving asymptotic validity. Both the standard multiplier bootstrap and the accelerated linear approximation are implemented, allowing users to balance finite-sample performance and computational scalability. The package supports rank-based and least-squares estimators, provides omnibus, link function, and functional form tests, and includes graphical tools for visualizing residual processes. An application to the Mayo Clinic primary biliary cirrhosis study illustrates the workflow.

2511.05826 2026-03-09 cs.LG stat.ML

CADM: Cluster-customized Adaptive Distance Metric for Categorical Data Clustering

Taixi Chen, Yiu-ming Cheung, Yiqun Zhang

Comments Accepted by ICASSP 2026

详情
英文摘要

An appropriate distance metric is crucial for categorical data clustering, as the distance between categorical data cannot be directly calculated. However, the distances between attribute values usually vary in different clusters induced by their different distributions, which has not been taken into account, thus leading to unreasonable distance measurement. Therefore, we propose a cluster-customized distance metric for categorical data clustering, which can competitively update distances based on different distributions of attributes in each cluster. In addition, we extend the proposed distance metric to the mixed data that contains both numerical and categorical attributes. Experiments demonstrate the efficacy of the proposed method, i.e., achieving an average ranking of around first in fourteen datasets. The source code is available at https://anonymous.4open.science/r/CADM-47D8

2510.22558 2026-03-09 stat.ME

Surface decomposition method for sensitivity analysis of first-passage dynamic reliability of linear systems

Jianhua Xian, Sai Hung Cheung, Cheng Su

详情
英文摘要

This work presents a novel surface decomposition method for the sensitivity analysis of first-passage dynamic reliability of linear systems subjected to Gaussian random excitations. The method decomposes the sensitivity of first-passage failure probability into a sum of surface integrals over the constrained component limit-state hypersurfaces. The evaluation of these surface integrals can be accomplished, owing to the availability of closed-form linear expressions of both the component limit-state functions and their sensitivities for linear systems. An importance sampling strategy is introduced to further enhance the efficiency for estimating the sum of these surface integrals. The number of function evaluations required for the reliability sensitivity analysis is typically on the order of 10^2 to 10^3. The approach is particularly advantageous when a large number of design parameters are considered, as the results of function evaluations can be reused across different parameters. Three numerical examples are investigated to demonstrate the effectiveness of the proposed method.

2510.16657 2026-03-09 stat.ML cs.LG

Escaping Model Collapse via Synthetic Data Verification: Near-term Improvements and Long-term Convergence

Bingji Yi, Qiyuan Liu, Yuwei Cheng, Haifeng Xu

Comments 29 pages, 10 figures

详情
英文摘要

Synthetic data has been increasingly used to train frontier generative models. However, recent studies raise key concerns that iteratively retraining a generative model on its self-generated synthetic data may keep deteriorating model performance, a phenomenon often coined model collapse. In this paper, we investigate ways to modify the synthetic retraining process to avoid model collapse, and even possibly help reverse the trend from collapse to improvement. Our key finding is that by injecting information through an external synthetic data verifier, whether a human or a better model, synthetic retraining will not cause model collapse. Specifically, we situate our theoretical analysis in the fundamental linear regression setting, showing that verifier-guided retraining can yield near-term improvements, but ultimately drives the parameter estimate to the verifier's "knowledge center" in the long run. Our theory further predicts that, unless the verifier is perfectly reliable, these early gains will plateau and may even reverse. Indeed, our experiments across linear regression, Variational Autoencoders (VAEs) trained on MNIST, and fining-tuning SmolLM2-135M on the XSUM task confirm these theoretical insights.

2510.03929 2026-03-09 stat.ML cs.LG

Self-Speculative Masked Diffusions

Andrew Campbell, Valentin De Bortoli, Jiaxin Shi, Arnaud Doucet

Comments 32 pages, 7 figures, 4 tables

详情
Journal ref
ICLR 2026
英文摘要

We present self-speculative masked diffusions, a new class of masked diffusion generative models for discrete data that require significantly fewer function evaluations to generate samples. Standard masked diffusion models predict factorized logits over currently masked positions. A number of masked positions are then sampled, however, the factorization approximation means that sampling too many positions in one go leads to poor sample quality. As a result, many simulation steps and therefore neural network function evaluations are required to generate high-quality data. We reduce the computational burden by generating non-factorized predictions over masked positions. This is achieved by modifying the final transformer attention mask from non-causal to causal, enabling draft token generation and parallel validation via a novel, model-integrated speculative sampling mechanism. This results in a non-factorized predictive distribution over masked positions in a single forward pass. We apply our method to GPT2 scale text modelling and protein sequence generation, finding that we can achieve a ~2x reduction in the required number of network forward passes relative to standard masked diffusion models.

2509.16337 2026-03-09 stat.ME math.ST stat.AP stat.ML stat.TH

Learning Centre Partitions from Summaries

Zinsou Max Debaly, Jean-Francois Ethier, Michael H. Neumann, Félix Camirand-Lemyre

详情
英文摘要

Multi-centre studies increasingly rely on distributed inference, where sites share only centre-level summaries. Homogeneity of parameters across centres is often violated, motivating methods that both \emph{test} for equality and \emph{learn} centre groupings before estimation. We develop multivariate Cochran-type tests that operate on summary statistics and embed them in a sequential, test-driven \emph{Clusters-of-Centres (CoC)} algorithm that merges centres (or blocks) only when equality is not rejected. We derive the asymptotic $χ^2$-mixture distributions of the test statistics and provide plug-in estimators for implementation. To improve finite-sample integration, we introduce a multi-round bootstrap CoC that re-evaluates merges across independently resampled summary sets; under mild regularity and a separation condition, we prove a \emph{golden-partition recovery} result: as the number of rounds grows with $n$, the true partition is recovered with probability tending to one. We also give simple numerical guidelines, including a plateau-based stopping rule, to make the multi-round procedure reproducible. Simulations and a real-data analysis of U.S.\ airline on-time performance (2007) show accurate heterogeneity detection and partitions that change little with the choice of resampling scheme.

2509.07289 2026-03-09 stat.ML cs.CV cs.LG

Kernel VICReg for Self-Supervised Learning in Reproducing Kernel Hilbert Space

M. Hadi Sepanj, Benyamin Ghojogh, Saed Moradi, Paul Fieguth

Comments Published in Big Data and Cognitive Computing, 2026, volume 10, issue 3, https://doi.org/10.3390/bdcc10030078

详情
Journal ref
Big Data and Cognitive Computing, 2026, volume 10, issue 3
英文摘要

Self-supervised learning (SSL) has emerged as a powerful paradigm for representation learning by optimizing geometric objectives, such as invariance to augmentations, variance preservation, and feature decorrelation, without requiring labels. However, most existing methods operate in Euclidean space, limiting their ability to capture nonlinear dependencies and geometric structures. In this work, we propose Kernel VICReg, a novel self-supervised learning framework that pulls the VICReg objective into a Reproducing Kernel Hilbert Space (RKHS). By kernelizing each term of the loss, variance, invariance, and covariance, we obtain a general formulation that operates on double-centered kernel matrices and Hilbert--Schmidt norms, enabling nonlinear feature learning without explicit mappings. We demonstrate that Kernel VICReg mitigates the risk of representational collapse under challenging conditions and improves performance on datasets exhibiting nonlinear structure or limited sample regimes. Empirical evaluations across MNIST, CIFAR-10, STL-10, TinyImageNet, and ImageNet100 show consistent gains over Euclidean VICReg, with particularly strong improvements on datasets where nonlinear structures are prominent. UMAP visualizations are provided only as a qualitative illustration of embedding geometry and are not used as a calibration or statistical validation. Our results suggest that kernelizing SSL objectives is a promising direction for bridging classical kernel methods with modern representation learning.

2505.21453 2026-03-09 stat.AP

An Integrated Time-Varying Ornstein-Uhlenbeck Process for Jointly Modeling Individual and Population-Level Movement of Golden Eagles

Michael L. Shull, Ephraim M. Hanks, James C. Russell, Robert K. Murphy, Frances E. Buderman

详情
英文摘要

With technological advancements, the quantity and quality of animal movement data have increased greatly. Currently, no movement model can be used to describe full-year data from migratory species by leveraging both individual movement and species distribution data. Herein we propose a full-year stochastic differential equation model for jointly modeling both individual movement and species distribution data. We show that this joint model, under certain assumptions, results in efficient computation of the spatio-temporal dynamics of the entire population, and thus provides straightforward inference on the species distribution data. We illustrate this model by analyzing 215 bird-years of golden eagle movement in western North America jointly with relative abundance data from eBird. We use the results to estimate wind project risk for these eagles and predict where they came from earlier in the year based on a single telemetry observation from later in the year. Our joint model enables additional inference and greater predictive power than afforded by sole use of eBird relative abundance.

2504.13057 2026-03-09 stat.ME

Covariate balancing estimation and model selection for difference-in-differences approach

Takamichi Baba, Yoshiyuki Ninomiya

Comments 32 pages, 6 tables

详情
英文摘要

Remarkable progress has been made in difference-in-differences (DID) approaches to causal inference that estimate the average effect of a treatment on the treated (ATT). Of these, the semiparametric DID (SDID) approach incorporates a propensity score analysis into the DID setup. Supposing that the ATT is a function of covariates, we estimate it by weighting the inverse of the propensity score. In this study, as one way to make the estimation robust to the propensity score modeling, we incorporate covariate balancing. Then, by attentively constructing the moment conditions used in the covariate balancing, we show that the proposed estimator is doubly robust. In addition to the estimation, we also address model selection. In practice, covariate selection is an essential task in statistical analysis, but even in the basic setting of the SDID approach, there are no reasonable information criteria. Here, we derive a model selection criterion as an asymptotically bias-corrected estimator of risk based on the loss function used in the SDID estimation. We show that a penalty term can be derived that is considerably different from almost twice the number of parameters that often appears in AIC-type information criteria. Numerical experiments show that the proposed method estimates the ATT more robustly compared with the method using propensity scores given by maximum likelihood estimation, and that the proposed criterion clearly reduces the risk targeted in the SDID approach in comparison with the intuitive generalization of the existing information criterion. In addition, real data analysis confirms that there is a large difference between the results of the proposed method and those of the existing method.

2501.11268 2026-03-09 cs.LG stat.ML

L0-Regularized Quadratic Surface Support Vector Machines

Ahmad Mousavi, Ramin Zandvakili, Zheming Gao

详情
英文摘要

Kernel-free quadratic surface support vector machines (QSVM) have recently gained traction due to their flexibility in modeling nonlinear decision boundaries without relying on kernel functions. However, the introduction of a full quadratic classifier significantly increases the number of model parameters, scaling quadratically with data dimensionality, which often leads to overfitting and makes interpretation difficult. To address these challenges, we propose sparse variants of the QSVM by enforcing a cardinality constraint on the model parameters. While enhancing generalization and promoting sparsity, leveraging the $\ell_0$-norm inevitably incurs additional computational complexity. To tackle this, we develop a penalty decomposition algorithm capable of producing solutions that provably satisfy the first-order Lu-Zhang optimality conditions. We show that the subproblems arising within the algorithm either admit closed-form solutions or can be solved efficiently through dual formulations, which contributes to the method's overall effectiveness. Besides, we analyze the convergence behavior of the algorithm under both loss settings. In addition, the numerical experiments on public benchmark datasets indicate that the proposed model is competitive with commonly used SVM variants and produces sparse solutions as expected. Moreover, its strong performance on real-world credit datasets demonstrates its potential for credit scoring applications.

2412.05905 2026-03-09 stat.ME

Fast QR updating methods for statistical applications

Mauro Bernardi, Claudio Busatto, Manuela Cattelan

详情
英文摘要

This paper introduces fast R updating algorithms specifically designed for statistical applications, including regression, filtering, and model selection, where data structures change frequently. Although traditional QR decomposition is essential for matrix operations, it becomes computationally intensive when dynamically updating the design matrix in statistical models. The proposed algorithms efficiently update the R matrix without the need for recalculation of Q, thereby significantly reducing computational costs in practical computational scenarios. The provision of scalable solutions for high-dimensional regression models is a key strength of these algorithms, enhancing the feasibility of large-scale statistical analyses and model selection in data-intensive fields. A thorough simulation study and the analysis of real-world data demonstrate that the methods achieve a substantial reduction in computational time without compromising accuracy. The discussion illustrates the benefits of these algorithms across a wide range of models and applications in statistics and machine learning.

2411.11824 2026-03-09 math.ST stat.ME stat.ML stat.TH

Theoretical Foundations of Conformal Prediction

Anastasios N. Angelopoulos, Rina Foygel Barber, Stephen Bates

Comments This material will be published by Cambridge University Press as Theoretical Foundations of Conformal Prediction by Anastasios N. Angelopoulos, Rina Foygel Barber, and Stephen Bates. This prepublication version is free to view/download for personal use only. Not for redistribution/resale/use in derivative works. Copyright Anastasios N. Angelopoulos, Rina Foygel Barber, and Stephen Bates, 2025

详情
英文摘要

This book is about conformal prediction and related inferential techniques that build on permutation tests and exchangeability. These techniques are useful in a diverse array of tasks, including hypothesis testing and providing uncertainty quantification guarantees for machine learning systems. Much of the current interest in conformal prediction is due to its ability to integrate into complex machine learning workflows, solving the problem of forming prediction sets without any assumptions on the form of the data generating distribution. Since contemporary machine learning algorithms have generally proven difficult to analyze directly, conformal prediction's main appeal is its ability to provide formal, finite-sample guarantees when paired with such methods. The goal of this book is to teach the reader about the fundamental technical arguments that arise when researching conformal prediction and related questions in distribution-free inference. Many of these proof strategies, especially the more recent ones, are scattered among research papers, making it difficult for researchers to understand where to look, which results are important, and how exactly the proofs work. We hope to bridge this gap by curating what we believe to be some of the most important results in the literature and presenting their proofs in a unified language, with illustrations, and with an eye towards pedagogy.

2409.16240 2026-03-09 math.ST stat.TH

Axiomatic characterisation of generalized $ψ$-estimators

Matyas Barczy, Zsolt Páles

Comments 20 pages

详情
英文摘要

We give axiomatic characterisations of generalized $ψ$-estimators and (usual) $ψ$-estimators (also called $Z$-estimators), respectively. The key properties of estimators that come into play in the characterisation theorems are the symmetry, the (strong) internality and the asymptotic idempotency. In the proofs, a separation theorem for Abelian subsemigroups plays a crucial role.

2407.04117 2026-03-09 cs.LG cond-mat.dis-nn cs.AI cs.NE stat.ML

Predictive Coding Networks and Inference Learning: Tutorial and Survey

Björn van Zwol, Ro Jefferson, Egon L. van den Broek

Comments 47 pages, 11 figures, 9 tables

详情
英文摘要

Recent years have witnessed a growing call for renewed emphasis on neuroscience-inspired approaches in artificial intelligence research, under the banner of NeuroAI. A prime example of this is predictive coding networks (PCNs), based on the neuroscientific framework of predictive coding. This framework views the brain as a hierarchical Bayesian inference model that minimizes prediction errors through feedback connections. Unlike traditional neural networks trained with backpropagation (BP), PCNs utilize inference learning (IL), a more biologically plausible algorithm that explains patterns of neural activity that BP cannot. Historically, IL has been more computationally intensive, but recent advancements have demonstrated that it can achieve higher efficiency than BP with sufficient parallelization. Furthermore, PCNs can be mathematically considered a superset of traditional feedforward neural networks (FNNs), significantly extending the range of trainable architectures. As inherently probabilistic (graphical) latent variable models, PCNs provide a versatile framework for both supervised learning and unsupervised (generative) modeling that goes beyond traditional artificial neural networks. This work provides a comprehensive review and detailed formal specification of PCNs, particularly situating them within the context of modern ML methods. This positions PC as a promising framework for future ML innovations.

2309.12032 2026-03-09 cs.LG stat.ML

Expert-Aided Causal Discovery of Ancestral Graphs

Tiago da Silva, Bruna Bazaluk, Eliezer de Souza da Silva, António Góis, Salem Lahlou, Dominik Heider, Samuel Kaski, Diego Mesquita, Adèle Helena Ribeiro

详情
英文摘要

Causal discovery (CD) is an important component of many scientific applications, yet most techniques produce unreliable point estimates that often contradict expert knowledge. To mitigate this, recent research has focused on ex-ante incorporation of background knowledge into the CD process, typically under an unrealistic causal sufficiency assumption. When probing experts is costly (e.g., hidden behind expensive LLM APIs), however, ex-post model refinement that maximizes query utility is preferable. Also, when independent experts provide conflicting but better-than-random feedback, a principled aggregation method is required. In this context, we introduce the first CD algorithm that enables (i) distributional inference over ancestral graphs (AGs), which represent causal systems under latent confounding, and (ii) integration of both ex-ante and uncertain ex-post expert knowledge. Briefly, our method is a diversity-seeking reinforcement learning algorithm, termed Ancestral GFlowNet (AGFN), whose policy we iteratively refine based on a Bayesian model of the noisy expert feedback. Importantly, we prove convergence to the true AG given sufficiently accurate responses. Through validation on synthetic and realistic datasets using simulated humans and LLMs, we show AGFN is competitive with or superior to strong baselines in terms of structural Hamming distance and Bayesian Information Criterion.

2209.04757 2026-03-09 math.ST math.PR stat.TH

Normal approximations for the multivariate inverse Gaussian distribution and asymmetric kernel smoothing on $d$-dimensional half-spaces

Léo R. Belzile, Alain Desgagné, Christian Genest, Frédéric Ouimet

Comments 59 pages, 11 figures, 1 table

详情
Journal ref
Electronic Journal of Statistics (2025), 19 (2), 3134-3187
英文摘要

This paper introduces a novel density estimator supported on $d$-dimensional half-spaces. It stands out as the first asymmetric kernel density estimator for half-spaces in the literature. Using the multivariate inverse Gaussian (MIG) density from Minami (2003) as the kernel and incorporating locally adaptive parameters, the estimator achieves desirable boundary properties. To analyze its mean integrated squared error (MISE) and asymptotic normality, a local limit theorem and probability metric bounds are established between the MIG and the corresponding multivariate Gaussian distribution with the same mean vector and covariance matrix, which may also be of independent interest. Additionally, a new algorithm for generating MIG random vectors is developed, proving to be faster and more accurate than Minami's algorithm based on a Brownian first-hitting location representation. This algorithm is then used to discuss and compare optimal MISE and likelihood cross-validation bandwidths for the estimator in a simulation study under various target distributions. As an application, the MIG asymmetric kernel is used to smooth the posterior distribution of a generalized Pareto model fitted to large electromagnetic storms.

2110.10296 2026-03-09 stat.ME stat.CO

A Bayesian Approach for the Variance of Fine Stratification

Sepideh Mosaferi

Comments Please see the final version (arXiv:2603.03569)

详情
英文摘要

Fine stratification is a popular design as it permits the stratification to be carried out to the fullest possible extent. Some examples include the Current Population Survey and National Crime Victimization Survey both conducted by the U.S. Census Bureau, and the National Survey of Family Growth conducted by the University of Michigan's Institute for Social Research. Clearly, the fine stratification survey has proved useful in many applications as its point estimator is unbiased and efficient. A common practice to estimate the variance in this context is collapsing the adjacent strata to create pseudo-strata and then estimating the variance, but the attained estimator of variance is not design-unbiased, and the bias increases as the population means of the pseudo-strata become more variant. Additionally, the estimator may suffer from a large mean squared error (MSE). In this paper, we propose a hierarchical Bayesian estimator for the variance of collapsed strata and compare the results with a nonparametric Bayes variance estimator. Additionally, we make comparisons with a kernel-based variance estimator recently proposed by Breidt et al. (2016). We show our proposed estimator is superior compared to the alternatives given in the literature such that it has a smaller frequentist MSE and bias. We verify this throughout multiple simulation studies and data analysis from the 2007-8 National Health and Nutrition Examination Survey and the 1998 Survey of Mental Health Organizations.

2106.00283 2026-03-09 stat.AP cs.LG

A mixed-frequency approach for exchange rates predictions

Raffaele Mattera, Michelangelo Misuraca, Germana Scepi, Maria Spano

详情
Journal ref
Electron J Appl Stat Anal 14 (2021) 230-253
英文摘要

Selecting an appropriate statistical model to forecast exchange rates is still today a relevant issue for policymakers and central bankers. The so-called Meese and Rogoff puzzle assesses that exchange rate fluctuations are unpredictable. In the literature, a lot of studies tried to solve the puzzle finding alternative predictors and statistical models based on temporal aggregation. In this paper, we propose an approach based on mixed frequency models to overcome the lack of information caused by temporal aggregation. We show the effectiveness of our approach in comparison with other proposed methods by performing CAD/USD exchange rate predictions.

2005.12840 2026-03-09 cs.IR stat.AP stat.CO

Sentiment Analysis for Education with R: packages, methods and practical applications

Michelangelo Misuraca, Alessia Forciniti, Germana Scepi, Maria Spano

详情
Journal ref
Studies in Educational Evaluation, Volume 68, March 2021, 100979
英文摘要

Sentiment Analysis (SA) refers to a family of techniques at the crossroads of statistics, natural language processing, and computational linguistics. The primary goal is to detect the semantic orientation of individual opinions and comments expressed in written texts. There are several practical applications of SA in several domains. In an educational context, the use of this approach allows processing students' feedback, aiming at monitoring the teaching effectiveness of instructors and enhancing the learning experience. This paper wants to review the different R packages that can be used to carry on SA, comparing the implemented methods, discussing their characteristics, and showing how they perform by considering a simple example.