arXivDaily arXiv每日学术速递 周一至周五更新
重置
2604.07063 2026-04-09 stat.ME

Introduction to Relational Event Modelling

Martina Boschi, Ernst C. Wit

详情
英文摘要

Interactions and time shape many aspects of life. Everyday activities -- like conversations, emails, money transfers, citations, and even acts of violence -- are relational events: interactions between a sender and a receiver at a specific moment. At the intersection of event-history analysis and network modelling, relational event models (REMs) offer a powerful framework for studying when and why these events occur. Recent advances have made it possible to express REMs as generalized additive models, allowing researchers to capture complex, non-linear patterns over time. While an essay and a comprehensive review exist, a hands-on tutorial paper on REMs is still missing. This work fills that gap. It provides a practical introduction to REMs, incorporating the latest developments in the field. It demonstrates how to simulate synthetic relational-event data and walks through several empirical applications, comparing different modelling and inference strategies. By bringing together theory, simulation, and application, this tutorial lowers the barrier to entry and makes REMs a more accessible and practical tool.

2604.07018 2026-04-09 stat.ME stat.ML

Time Series Gaussian Chain Graph Models

Qin Fang, Xinghao Qiao, Zihan Wang

详情
英文摘要

Time series graphical models have recently received considerable attention for characterizing (conditional) dependence structures in multivariate time series. In many applications, the multivariate series exhibit variable-partitioned blockwise dependence, with distinct patterns within and across blocks. In this paper, we introduce a new class of time series Gaussian chain graph models that represent contemporaneous and lagged causal relations via directed edges across blocks, while capturing within-block conditional dependencies through undirected edges. In the frequency domain, this formulation induces a cross-frequency shared group sparse plus group low-rank decomposition of the inverse spectral density matrices, which we exploit to establish identifiability of the time series chain graph structure. Building on this, we then propose a three-stage learning procedure for estimating the undirected and directed edge sets, which involves optimizing a regularized Whittle likelihood with a group lasso penalty to encourage group sparsity and a novel tensor-unfolding nuclear norm penalty to enforce group low-rank structure. We investigate the asymptotic properties of the proposed method, ensuring its consistency for exact recovery of the chain graph structure. The superior empirical performance of the proposed method is demonstrated through both extensive simulation studies and an application to U.S. macroeconomic data that highlights key monetary policy transmission mechanisms.

2604.06915 2026-04-09 stat.ME

Covariance Correction for Permutation Statistics in Multiple Testing Problems

Merle Munko, Paavo Sattler

详情
英文摘要

In qualitative statistics, permutation tests are very popular, mainly because of their finite-sample exactness under exchangeability. However, in non-exchangeable settings, the covariance structure of permuted statistics typically differs from that of the original statistic. A common solution is studentization, which restores asymptotic correctness for general hypotheses while preserving exactness under exchangeability. In multiple testing settings, however, standard studentization fails to provide the correct joint limiting distribution. Existing solutions such as prepivoting address this issue but are computationally expensive and therefore rarely used in practice. We propose a general, computationally more efficient methodology that overcomes this fundamental limitation. By appropriately correcting the covariance matrix of multiple permutation statistics, our approach restores the correct joint asymptotic dependence structure, enabling asymptotically valid permutation tests in broad multiple testing frameworks. The proposed method is highly flexible: it accommodates singular covariance structures and is not tied to specific parameters, test statistics, or permutation schemes. This generality makes it applicable across a wide range of problems. Extensive simulation studies demonstrate that our approach results in reliable inference and outperforms existing methods across diverse settings.

2604.06894 2026-04-09 stat.AP

How Does LLM Help Regional CPI Forecast: An LLM-powered Deep Panel Modeling Framework

Tianchen Gao, Ao Sun, Yurou Wang, Jingyuan Liu, Cheng Hsiao

详情
英文摘要

Understanding regional Consumer Price Index (CPI) dynamics is essential for timely and effective economic policymaking. However, traditional modeling procedures typically rely only on parametric panel modeling with low-frequency and high-cost macroeconomic indicators, which often fail to capture rapid market fluctuations and lead to inaccurate predictions. To this end, we propose a residual-joint-modeling framework that integrates large language model (LLM) analyses and social media narratives via a new deep neural network based panel modeling. Specifically, we construct a large narrative corpus from a newly collected {\it Sina Weibo} dataset, and develop a prompt-based GPT model and a series of fine-tuned BERT models to generate high-frequency LLM-induced surrogates for regional CPI. A novel joint modeling strategy is then advocated to transfer the information from these surrogates to the target regional CPI data and hence empower CPI prediction. To solve the joint objectives, we further introduce a new deep panel learning procedure with region-wise homogeneity pursuit, which has its own significance in panel data analysis literature. In addition, conformal-based panel prediction intervals are provided to quantify the uncertainty of the LLM-powered prediction. The proposed approach significantly reduces short-term forecasting errors and more effectively captures abrupt inflationary shifts compared to traditional econometric models. While demonstrated for regional CPI forecasting, the proposed framework is broadly applicable for incorporating insights from LLMs to enhance traditional statistical modeling.

2604.06864 2026-04-09 stat.ML cs.LG

A Data-Informed Variational Clustering Framework for Noisy High-Dimensional Data

Wan Ping Chen

详情
英文摘要

Clustering in high-dimensional settings with severe feature noise remains challenging, especially when only a small subset of dimensions is informative and the final number of clusters is not specified in advance. In such regimes, partition recovery, feature relevance learning, and structural adaptation are tightly coupled, and standard likelihood-based methods can become unstable or overly sensitive to noisy dimensions. We propose DIVI, a data-informed variational clustering framework that combines global feature gating with split-based adaptive structure growth. DIVI uses informative prior initialization to stabilize optimization, learns feature relevance in a differentiable manner, and expands model complexity only when local diagnostics indicate underfit. Beyond clustering performance, we also examine runtime scalability and parameter sensitivity in order to clarify the computational and practical behavior of the framework. Empirically, we find that DIVI performs competitively under severe feature noise, remains computationally feasible, and yields interpretable feature-gating behavior, while also exhibiting conservative growth and identifiable failure regimes in challenging settings. Overall, DIVI is best viewed as a practical variational clustering framework for noisy high-dimensional data rather than as a fully Bayesian generative solution.

2604.06701 2026-04-09 cs.LG stat.ML

Bi-Lipschitz Autoencoder With Injectivity Guarantee

Qipeng Zhan, Zhuoping Zhou, Zexuan Wang, Qi Long, Li Shen

Comments Accepted for publication at ICLR 2026, 27 Pages, 15 Figures

详情
英文摘要

Autoencoders are widely used for dimensionality reduction, based on the assumption that high-dimensional data lies on low-dimensional manifolds. Regularized autoencoders aim to preserve manifold geometry during dimensionality reduction, but existing approaches often suffer from non-injective mappings and overly rigid constraints that limit their effectiveness and robustness. In this work, we identify encoder non-injectivity as a core bottleneck that leads to poor convergence and distorted latent representations. To ensure robustness across data distributions, we formalize the concept of admissible regularization and provide sufficient conditions for its satisfaction. In this work, we propose the Bi-Lipschitz Autoencoder (BLAE), which introduces two key innovations: (1) an injective regularization scheme based on a separation criterion to eliminate pathological local minima, and (2) a bi-Lipschitz relaxation that preserves geometry and exhibits robustness to data distribution drift. Empirical results on diverse datasets show that BLAE consistently outperforms existing methods in preserving manifold structure while remaining resilient to sampling sparsity and distribution shifts. Code is available at https://github.com/qipengz/BLAE.

2604.06659 2026-04-09 stat.ME

Transfer Learning for Robust Structured Regression with Bi-level Source Detection

Haoming Shi, Yang Feng, Xiaoqian Liu

Comments 34 pages, 7 Figures

详情
英文摘要

High-dimensional data in modern applications, such as COVID-19 mortality, often span multiple domains. Leveraging auxiliary information from source domains to improve performance in a target domain motivates the use of transfer learning. However, a practical issue that has been overlooked is data contamination, which induces heterogeneity and can significantly degrade transfer learning performance. To address this challenge, we propose a novel approach that tackles transfer learning under data contamination within a structured regression setting. By employing the robust L2E criterion, we develop the TransL2E method that accounts for contamination in both target and source data while effectively transferring relevant information. Beyond robust estimation, TransL2E introduces a data-driven bi-level source detection mechanism, operating at both individual and cohort levels, which possesses multiple advantages over existing source detection approaches. Comprehensive simulation studies and a real data application demonstrate the superior performance of TransL2E in both robust estimation and structure recovery in the presence of data limitation and contamination.

2604.06621 2026-04-09 cs.GL cs.LG stat.ML

The Theorems of Dr. David Blackwell and Their Contributions to Artificial Intelligence

Napoleon Paxton

Comments Survey article, 19 pages, 1 figure, 2 tables

详情
英文摘要

Dr. David Blackwell was a mathematician and statistician of the first rank, whose contributions to statistical theory, game theory, and decision theory predated many of the algorithmic breakthroughs that define modern artificial intelligence. This survey examines three of his most consequential theoretical results the Rao Blackwell theorem, the Blackwell Approachability theorem, and the Blackwell Informativeness theorem (comparison of experiments) and traces their direct influence on contemporary AI and machine learning. We show that these results, developed primarily in the 1940s and 1950s, remain technically live across modern subfields including Markov Chain Monte Carlo inference, autonomous mobile robot navigation (SLAM), generative model training, no-regret online learning, reinforcement learning from human feedback (RLHF), large language model alignment, and information design. NVIDIAs 2024 decision to name their flagship GPU architecture (Blackwell) provides vivid testament to his enduring relevance. We also document an emerging frontier: explicit Rao Blackwellized variance reduction in LLM RLHF pipelines, recently proposed but not yet standard practice. Together, Blackwell theorems form a unified framework addressing information compression, sequential decision making under uncertainty, and the comparison of information sources precisely the problems at the core of modern AI.

2604.06548 2026-04-09 cs.CE stat.AP

A Rolling-Horizon Stochastic Optimization Framework for NBA Franchise Management with Distributionally Robust Risk Constraints

Siming Zhang, Zhehui Shen, Shijie Chen, Jian Zhou

Comments 27 pages, 12 figures

详情
英文摘要

NBA franchise management is not a sequence of independent tasks, but a single dynamic control problem in which roster construction, cash-flow discipline, media strategy, external market shocks, and player-health uncertainty interact over time. Using the New York Knicks as a case study, this paper develops a unified decision architecture for franchise management under competitive, financial, and regulatory constraints. The core layer is formulated as a rolling-horizon stochastic mixed-integer program augmented with distributionally robust optimization and conditional value-at-risk constraints, so that long-run franchise value can be optimized while downside exposure remains explicitly controlled. On top of this core layer, we construct coordinated modules for transaction execution, league-expansion shock transmission, media-rights regime transition, and injury-triggered re-optimization. This integrated design reframes multiple managerial mechanisms inside one research problem: how should an NBA franchise allocate resources and update decisions when performance objectives and commercial objectives are jointly determined under uncertainty? The manuscript is organized around problem formulation, model architecture, empirical validation, robustness analysis, and managerial interpretation.

2604.06499 2026-04-09 stat.AP stat.ML

Equivalence Testing Under Privacy Constraints

Savita Pareek, Luca Insolia, Roberto Molinari, Stéphane Guerrier

详情
英文摘要

Protecting individual privacy is essential across research domains, from socio-economic surveys to big-tech user data. This need is particularly acute in healthcare, where analyses often involve sensitive patient information. A typical example is comparing treatment efficacy across hospitals or ensuring consistency in diagnostic laboratory calibrations, both requiring privacy-preserving statistical procedures. However, standard equivalence testing procedures for differences in proportions or means, commonly used to assess average equivalence, can inadvertently disclose sensitive information. To address this problem, we develop differentially private equivalence testing procedures that rely on simulation-based calibration, as the finite-sample distribution is analytically intractable. Our approach introduces a unified framework, termed DP-TOST, for conducting differentially private equivalence testing of both means and proportions. Through numerical simulations and real-world applications, we demonstrate that the proposed method maintains type-I error control at the nominal level and achieves power comparable to its non-private counterpart as the privacy budget and/or sample size increases, while ensuring strong privacy guarantees. These findings establish a reliable and practical framework for privacy-preserving equivalence testing in high-stakes fields such as healthcare, among others.

2604.06492 2026-04-09 cs.LG cs.CR stat.ML

Optimal Rates for Pure {\varepsilon}-Differentially Private Stochastic Convex Optimization with Heavy Tails

Andrew Lowy

详情
英文摘要

We study stochastic convex optimization (SCO) with heavy-tailed gradients under pure epsilon-differential privacy (DP). Instead of assuming a bound on the worst-case Lipschitz parameter of the loss, we assume only a bounded k-th moment. This assumption allows for unbounded, heavy-tailed stochastic gradient distributions, and can yield sharper excess risk bounds. The minimax optimal rate for approximate (epsilon, delta)-DP SCO is known in this setting, but the pure epsilon-DP case has remained open. We characterize the minimax optimal excess-risk rate for pure epsilon-DP heavy-tailed SCO up to logarithmic factors. Our algorithm achieves this rate in polynomial time with high probability. Moreover, it runs in polynomial time with probability 1 when the worst-case Lipschitz parameter is polynomially bounded. For important structured problem classes - including hinge/ReLU-type and absolute-value losses on Euclidean balls, ellipsoids, and polytopes - we achieve the same excess-risk guarantee in polynomial time with probability 1 even when the worst-case Lipschitz parameter is infinite. Our approach is based on a novel framework for privately optimizing Lipschitz extensions of the empirical loss. We complement our excess risk upper bound with a novel high probability lower bound.

2604.06464 2026-04-09 cs.LG physics.app-ph stat.ML

Weighted Bayesian Conformal Prediction

Xiayin Lou, Peng Luo

详情
英文摘要

Conformal prediction provides distribution-free prediction intervals with finite-sample coverage guarantees, and recent work by Snell \& Griffiths reframes it as Bayesian Quadrature (BQ-CP), yielding powerful data-conditional guarantees via Dirichlet posteriors over thresholds. However, BQ-CP fundamentally requires the i.i.d. assumption -- a limitation the authors themselves identify. Meanwhile, weighted conformal prediction handles distribution shift via importance weights but remains frequentist, producing only point-estimate thresholds. We propose \textbf{Weighted Bayesian Conformal Prediction (WBCP)}, which generalizes BQ-CP to arbitrary importance-weighted settings by replacing the uniform Dirichlet $\Dir(1,\ldots,1)$ with a weighted Dirichlet $\Dir(\neff \cdot \tilde{w}_1, \ldots, \neff \cdot \tilde{w}_n)$, where $\neff$ is Kish's effective sample size. We prove four theoretical results: (1)~$\neff$ is the unique concentration parameter matching frequentist and Bayesian variances; (2)~posterior standard deviation decays as $O(1/\sqrt{\neff})$; (3)~BQ-CP's stochastic dominance guarantee extends to per-weight-profile data-conditional guarantees; (4)~the HPD threshold provides $O(1/\sqrt{\neff})$ improvement in conditional coverage. We instantiate WBCP for spatial prediction as \emph{Geographical BQ-CP}, where kernel-based spatial weights yield per-location posteriors with interpretable diagnostics. Experiments on synthetic and real-world spatial datasets demonstrate that WBCP maintains coverage guarantees while providing substantially richer uncertainty information.

2604.06445 2026-04-09 stat.ME

From Simple to Composite Perturbations: A Unified Decomposition Framework for Stochastic Block Models

Jianwei Hu, Ding Chen, Ji Zhu

详情
英文摘要

Statistical inference for stochastic block models typically relies on the spectrum of the normalized adjacency matrix $\A^*$. In practice, the true probability matrix $\mathbf{B}$ is unknown and must be replaced by a plug-in estimator $\hat{\mathbf{B}}$. This substitution introduces two distinct types of estimation error: a simple perturbation $\boldsymbolΔ$, arising when $\hat{\mathbf{B}}$ replaces $\mathbf{B}$ only in the numerator, and a composite perturbation $\tilde{\boldsymbolΔ}$, arising when the replacement occurs in both the numerator and the denominator. Under both perturbation regimes, we decompose the total sum of squares into three components and conduct a detailed analysis of their asymptotic properties. This reveals a key, and perhaps surprising, distinction between simple and composite perturbations: the cross term $\tr({\A^*}\bDelta)$ is asymptotically negligible, whereas its composite counterpart $\tr({\A^*}\tilde{\bDelta})$ is not. Motivated by this, we develop a unified decomposition framework, expressing the composite perturbation matrix as $\tilde{\bDelta}=\check{\A}+\bDelta+\check{\bDelta}$, where $\check{\A}$ is a bias matrix of the normalized adjacency matrix, $\bDelta$ is the simple perturbation, and $\check{\bDelta}$ is a bias matrix of $\bDelta$. This structured decomposition allows us to precisely isolate and control each source of error, leading to a refined limiting theory for two key classes of test statistics. Concretely, for the largest eigenvalue statistic, we improve the existing condition from $K=O(n^{1/6-τ})$ to the optimal rate $K=o(n^{1/6})$ under both simple and composite perturbations. For the linear spectral statistic, our unified decomposition framework provides the necessary structure to systematically control these errors term by term, leading to a complete and rigorous proof of asymptotic normality.

2604.06438 2026-04-09 stat.AP cs.LG

Learning Debt and Cost-Sensitive Bayesian Retraining: A Forecasting Operations Framework

Harrison Katz

详情
英文摘要

Forecasters often choose retraining schedules by convention rather than by an explicit decision rule. This paper gives that decision a posterior-space language. We define learning debt as the divergence between the deployed and continuously updated posteriors, define actionable staleness as the policy-relevant latent state, and derive a one-step Bayes retraining rule under an excess-loss formulation. In an online conjugate simulation using the exact Kullback-Leibler divergence between deployed and shadow normal-inverse-gamma posteriors, a debt-filter beats a default 10-period calendar baseline in 15 of 24 abrupt-shift cells, all 24 gradual-drift cells, and 17 of 24 variance-shift cells, and remains below the best fixed cadence in a grid of cadences (5, 10, 20, and 40 periods) in 10, 24, and 17 cells, respectively. Fixed-threshold CUSUM remains a strong benchmark, while a proxy filter built from indirect diagnostics performs poorly. A retrospective Airbnb production backtest shows how the same decision logic behaves around a known payment-policy shock.

2604.06417 2026-04-09 stat.CO

Niching Importance Sampling for Multi-modal Rare-event Simulation

Hugh J. Kinnear, F. A. DiazDelaO

详情
英文摘要

This paper proposes niching importance sampling, a framework that combines concepts from reliability analysis, e.g. Markov chains, importance sampling, and relative cross entropy minimisation, with niching techniques from evolutionary multi-modal optimisation. The result is a highly robust estimator of the probability of failure, that can tackle sampling challenges posed by the underlying geometry of a reliability problem. Niching importance sampling is tested on a range of numerical examples and is shown to consistently avoid the degenerate behaviour observed for existing reliability methods on several multi-modal performance functions.

2604.06407 2026-04-09 stat.ME

Dealing with positivity violations in mediation analysis via weighted controlled effects, with application to assessing immune correlates of protection in antigen-experienced participants

Qijia He, Bo Zhang

详情
英文摘要

Causal mediation analysis has become an important and increasingly used framework for evaluating candidate immune response biomarkers in vaccine research. A controlled effects approach has been proposed to estimate controlled risk curves under a counterfactual scenario in which the entire study population is vaccinated and their post-vaccination immune responses are set to a range of fixed levels. This framework performs well when the study population is antigenically naïve, that is, individuals have not been previously exposed to the antigen, as is common in HIV-1 vaccine research and during the early phases of the COVID-19 pandemic. However, the controlled effects framework becomes more challenging to apply in antigen-experienced populations, where prior vaccination or infection has occurred, as in the case of influenza, dengue, and more recent phases of the COVID-19 pandemic. In such settings, a key identification assumption for valid causal mediation analysis, the positivity assumption, is violated: it is no longer plausible to conceive of a hypothetical intervention that sets a post-vaccination immune marker to a fixed level below an individual's baseline immune level. In this article, we introduce a weighted controlled risk approach that targets a subpopulation for whom there is a prespecified probability of attaining a post-vaccination immune marker level. We further generalize this framework to study contrasts of controlled risks for relevant subpopulations. We demonstrate the validity of the proposed estimators through simulation studies and apply the method to reanalyze post-vaccination neutralizing antibody titers against Omicron BA.4/BA.5 as an immune correlate of COVID-19 in the Coronavirus Variant Immunologic Landscape (COVAIL) trial. R code to implement the proposed method can be found on Github: https://github.com/Qijia-He/weighted_CVE.

2604.06395 2026-04-09 cs.LG q-bio.NC stat.ML

Bridging Theory and Practice in Crafting Robust Spiking Reservoirs

Ruggero Freddi, Nicolas Seseri, Diana Nigrisoli, Alessio Basti

详情
英文摘要

Spiking reservoir computing provides an energy-efficient approach to temporal processing, but reliably tuning reservoirs to operate at the edge-of-chaos is challenging due to experimental uncertainty. This work bridges abstract notions of criticality and practical stability by introducing and exploiting the robustness interval, an operational measure of the hyperparameter range over which a reservoir maintains performance above task-dependent thresholds. Through systematic evaluations of Leaky Integrate-and-Fire (LIF) architectures on both static (MNIST) and temporal (synthetic Ball Trajectories) tasks, we identify consistent monotonic trends in the robustness interval across a broad spectrum of network configurations: the robustness-interval width decreases with presynaptic connection density $β$ (i.e., directly with sparsity) and directly with the firing threshold $θ$. We further identify specific $(β, θ)$ pairs that preserve the analytical mean-field critical point $w_{\text{crit}}$, revealing iso-performance manifolds in the hyperparameter space. Control experiments on Erdős-Rényi graphs show the phenomena persist beyond small-world topologies. Finally, our results show that $w_{\text{crit}}$ consistently falls within empirical high-performance regions, validating $w_{\text{crit}}$ as a robust starting coordinate for parameter search and fine-tuning. To ensure reproducibility, the full Python code is publicly available.

2604.06394 2026-04-09 stat.ME

Depth-Based Vector Median Absolute Deviation Moments for Robust Multivariate Shape Analysis

Elsayed Elamir

Comments 14 pages, 3 figures

详情
英文摘要

Classical multivariate shape analysis relies on covariance-standardized moments, such as Mardia skewness and kurtosis, which are sensitive to outliers and require finite moments. This paper introduces vector median absolute deviation (VMedAD) moments for robust multivariate shape analysis. The proposed framework replaces moment aggregation and covariance standardization with median-based center-outward contrasts defined through data depth, yielding affine equivariance and moment-free vector moments. VMedAD moments provide direction-preserving measures of multivariate skewness and directional peripheral dominance, separating central structure from tail-driven behavior. Consistency, breakdown properties, and affine equivariance are established, and simulation and real dataset examples demonstrate improved robustness and geometric interpretability over classical and projection-based methods.

2604.06366 2026-04-09 cs.LG stat.ML

Stochastic Gradient Descent in the Saddle-to-Saddle Regime of Deep Linear Networks

Guillaume Corlouer, Avi Semler, Alexander Strang, Alexander Gietelink Oldenziel

详情
英文摘要

Deep linear networks (DLNs) are used as an analytically tractable model of the training dynamics of deep neural networks. While gradient descent in DLNs is known to exhibit saddle-to-saddle dynamics, the impact of stochastic gradient descent (SGD) noise on this regime remains poorly understood. We investigate the dynamics of SGD during training of DLNs in the saddle-to-saddle regime. We model the training dynamics as stochastic Langevin dynamics with anisotropic, state-dependent noise. Under the assumption of aligned and balanced weights, we derive an exact decomposition of the dynamics into a system of one-dimensional per-mode stochastic differential equations. This establishes that the maximal diffusion along a mode precedes the corresponding feature being completely learned. We also derive the stationary distribution of SGD for each mode: in the absence of label noise, its marginal distribution along specific features coincides with the stationary distribution of gradient flow, while in the presence of label noise it approximates a Boltzmann distribution. Finally, we confirm experimentally that the theoretical results hold qualitatively even without aligned or balanced weights. These results establish that SGD noise encodes information about the progression of feature learning but does not fundamentally alter the saddle-to-saddle dynamics.

2604.06282 2026-04-09 stat.ML cs.LG

Tight Convergence Rates for Online Distributed Linear Estimation with Adversarial Measurements

Nibedita Roy, Vishal Halder, Gugan Thoppe, Alexandre Reiffers-Masson, Mihir Dhanakshirur, Naman, Alexandre Azor

Comments Preprint

详情
英文摘要

We study mean estimation of a random vector $X$ in a distributed parameter-server-worker setup. Worker $i$ observes samples of $a_i^\top X$, where $a_i^\top$ is the $i$th row of a known sensing matrix $A$. The key challenges are adversarial measurements and asynchrony: a fixed subset of workers may transmit corrupted measurements, and workers are activated asynchronously--only one is active at any time. In our previous work, we proposed a two-timescale $\ell_1$-minimization algorithm and established asymptotic recovery under a null-space-property-like condition on $A$. In this work, we establish tight non-asymptotic convergence rates under the same null-space-property-like condition. We also identify relaxed conditions on $A$ under which exact recovery may fail but recovery of a projected component of $\mathbb{E}[X]$ remains possible. Overall, our results provide a unified finite-time characterization of robustness, identifiability, and statistical efficiency in distributed linear estimation with adversarial workers, with implications for network tomography and related distributed sensing problems.

2604.06281 2026-04-09 stat.ML math.PR

Generalization error bounds for two-layer neural networks with Lipschitz loss function

Jiang Yu Nguwi, Nicolas Privault

详情
英文摘要

We derive generalization error bounds for the training of two-layer neural networks without assuming boundedness of the loss function, using Wasserstein distance estimates on the discrepancy between a probability distribution and its associated empirical measure, together with moment bounds for the associated stochastic gradient method. In the case of independent test data, we obtain a dimension-free rate of order $O(n^{-1/2} )$ on the $n$-sample generalization error, whereas without independence assumption, we derive a bound of order $O(n^{-1 / ( d_{\rm in}+d_{\rm out} )} )$, where $d_{\rm in}$, $d_{\rm out}$ denote input and output dimensions. Our bounds and their coefficients can be explicitly computed prior to the training of the model, and are confirmed by numerical simulations.

2604.06251 2026-04-09 cs.AI cs.LG stat.AP

Toward Reducing Unproductive Container Moves: Predicting Service Requirements and Dwell Times

Elena Villalobos, Adolfo De Unánue T., Fernanda Sobrino, David Aké, Stephany Cisneros, Jorge Lecona, Alejandra Matadamaz

Comments Preprint, 20 pages, 9 figures, 5 tables (including appendices)

详情
英文摘要

This article presents the results of a data science study conducted at a container terminal, aimed at reducing unproductive container moves through the prediction of service requirements and container dwell times. We develop and evaluate machine learning models that leverage historical operational data to anticipate which containers will require pre-clearance handling services prior to cargo release and to estimate how long they are expected to remain in the terminal. As part of the data preparation process, we implement a classification system for cargo descriptions and perform deduplication of consignee records to improve data consistency and feature quality. These predictive capabilities provide valuable inputs for strategic planning and resource allocation in yard operations. Across multiple temporal validation periods, the proposed models consistently outperform existing rule-based heuristics and random baselines in precision and recall. These results demonstrate the practical value of predictive analytics for improving operational efficiency and supporting data-driven decision-making in container terminal logistics.

2604.04868 2026-04-09 cs.LG cs.AI stat.ML

Noise Immunity in In-Context Tabular Learning: An Empirical Robustness Analysis of TabPFN's Attention Mechanisms

James Hu, Mahdi Ghelichi

详情
英文摘要

Tabular foundation models (TFMs) such as TabPFN (Tabular Prior-Data Fitted Network) are designed to generalize across heterogeneous tabular datasets through in-context learning (ICL). They perform prediction in a single forward pass conditioned on labeled examples without dataset-specific parameter updates. This paradigm is particularly attractive in industrial domains (e.g., finance and healthcare) where tabular prediction is pervasive. Retraining a bespoke model for each new table can be costly or infeasible in these settings, while data quality issues such as irrelevant predictors, correlated feature groups, and label noise are common. In this paper, we provide strong empirical evidence that TabPFN is highly robust under these sub-optimal conditions. We study TabPFN and its attention mechanisms for binary classification problems with controlled synthetic perturbations that vary: (i) dataset width by injecting random uncorrelated features and by introducing nonlinearly correlated features, (ii) dataset size by increasing the number of training rows, and (iii) label quality by increasing the fraction of mislabeled targets. Beyond predictive performance, we analyze internal signals including attention concentration and attention-based feature ranking metrics. Across these parametric tests, TabPFN is remarkably resilient: ROC-AUC remains high, attention stays structured and sharp, and informative features are highly ranked by attention-based metrics. Qualitative visualizations with attention heatmaps, feature-token embeddings, and SHAP plots further support a consistent pattern across layers in which TabPFN increasingly concentrates on useful features while separating their signals from noise. Together, these findings suggest that TabPFN is a robust TFM capable of maintaining both predictive performance and coherent internal behavior under various scenarios of data imperfections.

2603.11090 2026-04-09 cs.LG stat.ME

Interventional Time Series Priors for Causal Foundation Models

Dennis Thumm, Ying Chen

Comments ICLR 2026 1st Workshop on Time Series in the Age of Large Models (TSALM)

详情
英文摘要

Prior-data fitted networks (PFNs) have emerged as powerful foundation models for tabular causal inference, yet their extension to time series remains limited by the absence of synthetic data generators that provide interventional targets. Existing time series benchmarks generate observational data with ground-truth causal graphs but lack the interventional data required for training causal foundation models. To address this, we propose \textbf{CausalTimePrior}, a principled framework for generating synthetic temporal structural causal models (TSCMs) with paired observational and interventional time series. Our prior supports configurable causal graph structures, nonlinear autoregressive mechanisms, regime-switching dynamics, and multiple intervention types (hard, soft, time-varying). We demonstrate that PFNs trained on CausalTimePrior can perform in-context causal effect estimation on held-out TSCMs, establishing a pathway toward foundation models for time series causal inference.

2603.06257 2026-04-09 stat.ML cs.LG

Robust support vector model based on bounded asymmetric elastic net loss for binary classification

Haiyan Du, Hu Yang

Comments Upon re-examination, we found fundamental flaws in the BAEN-SVM model that undermine our conclusions. The design inadequately addresses geometrical rationality on slack variables, questioning generalizability. Thus, we retract this manuscript. We are exploring a different model and will resubmit after thorough validation. We apologize for any confusion

详情
英文摘要

In this paper, we propose a novel bounded asymmetric elastic net ($L_{baen}$) loss function and combine it with the support vector machine (SVM), resulting in the BAEN-SVM. The $L_{baen}$ is bounded and asymmetric and can degrade to the asymmetric elastic net hinge loss, pinball loss, and asymmetric least squares loss. BAEN-SVM not only effectively handles noise-contaminated data but also addresses the geometric irrationalities in the traditional SVM. By proving the violation tolerance upper bound (VTUB) of BAEN-SVM, we show that the model is geometrically well-defined. Furthermore, we derive that the influence function of BAEN-SVM is bounded, providing a theoretical guarantee of its robustness to noise. The Fisher consistency of the model further ensures its generalization capability. Since the \( L_{\text{baen}} \) loss is non-convex, we designed a clipping dual coordinate descent-based half-quadratic algorithm to solve the non-convex optimization problem efficiently. Experimental results on artificial and benchmark datasets indicate that the proposed method outperforms classical and advanced SVMs, particularly in noisy environments.

2602.15889 2026-04-09 stat.AP cs.AI cs.CL physics.ed-ph

Daily and Weekly Periodicity in Large Language Model Performance and Its Implications for Research

Paul Tschisgale, Peter Wulff

Comments The Supplementary Information can be found in the OSF repository cited in the Data Availability Statement

详情
英文摘要

Large language models (LLMs) are increasingly used in research as both tools and objects of study. Much of this work assumes that LLM performance under fixed conditions (identical model snapshot, hyperparameters, and prompt) is time-invariant, meaning that average output quality remains stable over time; otherwise, reliability and reproducibility would be compromised. To test the assumption of time invariance, we conducted a longitudinal study of GPT-4o's average performance under fixed conditions. The LLM was queried to solve the same physics task ten times every three hours over approximately three months. Spectral (Fourier) analysis of the resulting time series revealed substantial periodic variability, accounting for about 20% of total variance. The observed periodic patterns are consistent with interacting daily and weekly rhythms. These findings challenge the assumption of time invariance and carry important implications for research involving LLMs.

2512.10717 2026-04-09 stat.ME

Dynamic sparse graphs with overlapping communities

Xenia Miscouridou, Francesca Panero, Antreas Laos

详情
英文摘要

Dynamic community detection concerns inferring how community memberships evolve over time, including the emergence, persistence, merging, and dissolution of groups in temporal networks. We propose a Bayesian nonparametric model for time-evolving sparse networks, which captures power-law degree distributions and dynamically overlapping communities. The model is constructed from vectors of completely random measures coupled through a latent Markov process governing the evolution of node affiliations. This construction provides a flexible and interpretable approach to model dynamic communities, naturally generalizing existing overlapping block models to the sparse and scale-free regimes. We establish asymptotic results characterizing sparsity and degree heterogeneity over time, and develop an approximate inference procedure for recovering time-varying community trajectories. Applications to synthetic and real-world dynamic networks show that the model accurately uncovers evolving community structure and yields interpretable temporal patterns.

2512.01423 2026-04-09 stat.ME

Active Hypothesis Testing under Computational Budgets with Applications to GWAS and LLM

Qi Kuang, Bowen Gang, Yin Xia

详情
英文摘要

In large-scale hypothesis testing, computing exact $p$-values or $e$-values is often resource-intensive, creating a need for budget-aware inferential methods. We propose a general framework for active hypothesis testing that leverages inexpensive auxiliary statistics to allocate a global computational budget. For each hypothesis, our data-adaptive procedure probabilistically decides whether to compute the exact test statistic or a transformed proxy, guaranteeing a valid $p$-value or $e$-value while satisfying the exact budget constraint. Theoretical guarantees are established for our constructions, showing that the procedure achieves optimality for $e$-values and for $p$-values under independence, and admissibility for $p$-values under general dependence. Empirical results from simulations and two real-world applications, including a large-scale genome-wide association study (GWAS) and a clinical prediction task leveraging large language models (LLM), demonstrate that our framework improves statistical efficiency under fixed resource limits.

2511.05834 2026-04-09 stat.OT

Impacts of Data Splitting Strategies on Parameterized Link Prediction Algorithms

Xinshan Jiao, Yuxin Luo, Yilin Bi, Tao Zhou

Comments 18 pages, 3 figures. Published in Physica A (2026)

详情
Journal ref
Physica A: Statistical Mechanics and its Applications, 692 (2026), 131545
英文摘要

Link prediction is a fundamental problem in network science, aiming to infer potential or missing links based on observed network structures. With the increasing adoption of parameterized models, the rigor of evaluation protocols has become critically important. However, a previously common practice of using the test set during hyperparameter tuning has led to human-induced information leakage, thereby inflating the reported model performance. To address this issue, this study introduces a novel evaluation metric, Loss Ratio, which quantitatively measures the extent of performance overestimation. We conduct large-scale experiments on 60 real-world networks across six domains. The results demonstrate that the information leakage leads to an average overestimation of about 3.6%, with the bias reaching over 15% for specific algorithms. Meanwhile, heuristic and random-walk-based methods exhibit greater robustness and stability. The analysis uncovers a pervasive information leakage issue in link prediction evaluation and underscores the necessity of adopting standardized data splitting strategies to enable fair and reproducible benchmarking of link prediction models.

2511.01028 2026-04-09 quant-ph math-ph math.MP math.ST stat.TH

Pseudo quantum advantages in perceptron storage capacity

Fabio Benatti, Masoud Gharahi, Giovanni Gramegna, Stefano Mancini, Vincenzo Parisi

Comments 24 pages, 1 figure; minor changes, typos corrected

详情
Journal ref
J. Phys. A: Math. Theor. 59 145203 (2026)
英文摘要

We investigate a generalized quantum perceptron architecture characterized by an oscillating activation function with a tunable frequency ranging from zero to infinity. Employing analytical techniques from statistical mechanics, we derive the optimal storage capacity and demonstrate that the classical result is recovered in the limit of vanishing frequency. As the frequency increases, however, the architecture exhibits enhanced quantum storage capabilities. Notably, this improvement stems solely from the specific form of the activation function and, in principle, could be emulated within a classical framework. Accordingly, we refer to this enhancement as a pseudo quantum advantage.

2510.11169 2026-04-09 stat.ML cs.LG

PAC-Bayesian Bounds on Constrained f-Entropic Risk Measures

Hind Atbir, Farah Cherfaoui, Guillaume Metzler, Emilie Morvant, Paul Viallard

Comments Accepted at the 29th International Conference on Artificial Intelligence and Statistics (AISTATS 2026)

详情
英文摘要

PAC generalization bounds on the risk, when expressed in terms of the expected loss, are often insufficient to capture imbalances between subgroups in the data. To overcome this limitation, we introduce a new family of risk measures, called constrained f-entropic risk measures, which enable finer control over distributional shifts and subgroup imbalances via f-divergences, and include the Conditional Value at Risk (CVaR), a well-known risk measure. We derive both classical and disintegrated PAC-Bayesian generalization bounds for this family of risks, providing the first disintegratedPAC-Bayesian guarantees beyond standard risks. Building on this theory, we design a self-bounding algorithm that minimizes our bounds directly, yielding models with guarantees at the subgroup level. Finally, we empirically demonstrate the usefulness of our approach.

2510.08974 2026-04-09 stat.CO cs.NA math.NA

Bayesian Active Learning for Bayesian Model Updating: the Art of Acquisition Functions and Beyond

Jingwen Song, Pengfei Wei

Comments 47 pages, 15 figures, submitted to Elsevier journal

详情
Journal ref
Mechanical Systems and Signal Processing 251 (2026) 114237
英文摘要

Estimating posteriors and the associated model evidences, with desired accuracy and affordable computational cost, is a core issue of Bayesian model updating, and can be of great challenge given expensive-to-evaluate models and posteriors with complex features such as multi-modalities of unequal importance, nonlinear dependencies and high sharpness. Bayesian Quadrature (BQ) equipped with active learning has emerged as a competitive framework for tackling this challenge, as it provides flexible balance between computational cost and accuracy. The performance of a BQ scheme is fundamentally dictated by the acquisition function as it exclusively governs the active generation of integration points. After reexamining one of the most advanced acquisition function from a prospective inference perspective and reformulating the quadrature rules for prediction, four new acquisition functions, inspired by distinct intuitions on expected rewards, are primarily developed, all of which are accompanied by elegant interpretations and highly efficient numerical estimators. Mathematically, these four acquisition functions measure, respectively, the prediction uncertainty of posterior, the contribution to prediction uncertainty of evidence, as well as the expected reduction of prediction uncertainties concerning posterior and evidence, and thus provide flexibility for highly effective design of integration points. These acquisition functions are further extended to the transitional BQ scheme, along with several specific refinements, to tackle the above-mentioned challenges with high efficiency and robustness. Effectiveness of the developments is ultimately demonstrated with extensive benchmark studies and application to an engineering example.

2508.05423 2026-04-09 cs.LG stat.ML

Negative Binomial Variational Autoencoders for Overdispersed Latent Modeling

Yixuan Zhang, Jinhao Sheng, Wenxin Zhang, Quyu Kong, Feng Zhou

详情
英文摘要

Although artificial neural networks are often described as brain-inspired, their representations typically rely on continuous activations, such as the continuous latent variables in variational autoencoders (VAEs), which limits their biological plausibility compared to the discrete spike-based signaling in real neurons. Extensions like the Poisson VAE introduce discrete count-based latents, but their equal mean-variance assumption fails to capture overdispersion in neural spikes, leading to less expressive and informative representations. To address this, we propose NegBio-VAE, a negative-binomial latent-variable model with a dispersion parameter for flexible spike count modeling. NegBio-VAE preserves interpretability while improving representation quality and training feasibility via novel KL estimation and reparameterization. Experiments on four datasets demonstrate that NegBio-VAE consistently achieves superior reconstruction and generation performance compared to competing single-layer VAE baselines, and yields robust, informative latent representations for downstream tasks. Extensive ablation studies are performed to verify the model's robustness w.r.t. various components. Our code is available at https://github.com/co234/NegBio-VAE.

2507.18937 2026-04-09 physics.ao-ph cs.AI cs.LG stat.ML

CNN-based Surface Temperature Forecasts with Ensemble Numerical Weather Prediction

Takuya Inoue, Takuya Kawabata

Comments 48 pages, 14 figures

详情
英文摘要

Due to limited computational resources, medium-range temperature forecasts typically rely on low-resolution numerical weather prediction (NWP) models, which are prone to systematic and random errors. We propose a method that integrates a convolutional neural network (CNN) with an ensemble of low-resolution NWP models (40-km horizontal resolution) to produce high-resolution (5-km) surface temperature forecasts with lead times extending up to 5.5 days (132 h). First, CNN-based post-processing (bias correction and spatial downscaling) is applied to individual ensemble members to reduce systematic errors and perform downscaling, which improves the deterministic forecast accuracy. Second, this member-wise correction is applied to all 51 ensemble members to construct a new high-resolution ensemble forecasting system with an improved probabilistic reliability and spread-skill ratio that differs from the simple error reduction mechanism of ensemble averaging. Whereas averaging reduces forecast errors by smoothing spatial fields, our member-wise CNN correction reduces error from noise while maintaining forecast information at a level comparable to that of other high-resolution forecasts. Experimental results indicate that the proposed method provides a practical and scalable solution for improving medium-range temperature forecasts, which is particularly valuable for use in operational centers with limited computational resources.

2507.10303 2026-04-09 stat.ML cs.LG stat.CO stat.ME

MF-GLaM: A multifidelity stochastic emulator using generalized lambda models

K. Giannoukou, X. Zhu, S. Marelli, B. Sudret

详情
Journal ref
Computer Methods in Applied Mechanics and Engineering, Volume 448, Part B, January 2026, 118498
英文摘要

Stochastic simulators exhibit intrinsic stochasticity due to unobservable, uncontrollable, or unmodeled input variables, resulting in random outputs even at fixed input conditions. Such simulators are common across various scientific disciplines; however, emulating their entire conditional probability distribution is challenging, as it is a task traditional deterministic surrogate modeling techniques are not designed for. Additionally, accurately characterizing the response distribution can require prohibitively large datasets, especially for computationally expensive high-fidelity (HF) simulators. When lower-fidelity (LF) stochastic simulators are available, they can enhance limited HF information within a multifidelity surrogate modeling (MFSM) framework. While MFSM techniques are well-established for deterministic settings, constructing multifidelity emulators to predict the full conditional response distribution of stochastic simulators remains a challenge. In this paper, we propose multifidelity generalized lambda models (MF-GLaMs) to efficiently emulate the conditional response distribution of HF stochastic simulators by exploiting data from LF stochastic simulators. Our approach builds upon the generalized lambda model (GLaM), which represents the conditional distribution at each input by a flexible, four-parameter generalized lambda distribution. MF-GLaMs are non-intrusive, requiring no access to the internal stochasticity of the simulators nor multiple replications of the same input values. We demonstrate the efficacy of MF-GLaM through synthetic examples of increasing complexity and a realistic earthquake application. Results show that MF-GLaMs can achieve improved accuracy at the same cost as single-fidelity GLaMs, or comparable performance at significantly reduced cost.

2507.06580 2026-04-09 math.PR math.ST stat.TH

On the rate of convergence to the Boolean extreme value distribution under the von Mises condition

Yuki Ueda

Comments 15 pages. This version has been revised from the previous one (see Section 4.2). Accepted in IDAQP

详情
英文摘要

We investigate the rate of convergence toward the Boolean extreme value distribution, which is the universal limiting law for the normalized spectral maximum of Boolean independent and identically distributed positive operators, under the von Mises condition.

2503.02129 2026-04-09 cs.LG cs.AI math.ST stat.TH

Path Regularization: A Near-Complete and Optimal Nonasymptotic Generalization Theory for Multilayer Neural Networks and Double Descent Phenomenon

Hao Yu

详情
英文摘要

Path regularization has shown to be a very effective regularization to train neural networks, leading to a better generalization property than common regularizations i.e. weight decay, etc. We propose a first near-complete (as will be made explicit in the main text) nonasymptotic generalization theory for multilayer neural networks with path regularizations for general learning problems. In particular, it does not require the boundedness of the loss function, as is commonly assumed in the literature. Our theory goes beyond the bias-variance tradeoff and aligns with phenomena typically encountered in deep learning. It is therefore sharply different from other existing nonasymptotic generalization error bounds. More explicitly, we propose an explicit generalization error upper bound for multilayer neural networks with $σ(0)=0$ and sufficiently broad Lipschitz loss functions, without requiring the width, depth, or other hyperparameters of the neural network to approach infinity, a specific neural network architecture (e.g., sparsity, boundedness of some norms), a particular optimization algorithm, or boundedness of the loss function, while also taking approximation error into consideration. A key feature of our theory is that it also considers approximation errors. In particular, we solve an open problem proposed by Weinan E et. al. regarding the approximation rates in generalized Barron spaces. Furthermore, we show the near-minimax optimality of our theory for regression problems with ReLU activations. Notably, our upper bound exhibits the famous double descent phenomenon for such networks, which is the most distinguished characteristic compared with other existing results. We argue that it is highly possible that our theory reveals the true underlying mechanism of the double descent phenomenon.

2501.10806 2026-04-09 math.OC cs.LG cs.SY eess.SY stat.ML

Non-Expansive Mappings in Two-Time-Scale Stochastic Approximation: Finite-Time Analysis

Siddharth Chandak

Comments Accepted for publication to SIAM Journal on Control and Optimization

详情
英文摘要

Two-time-scale stochastic approximation algorithms are iterative methods used in applications such as optimization, reinforcement learning, and control. Finite-time analysis of these algorithms has primarily focused on fixed point iterations where both time-scales have contractive mappings. In this work, we broaden the scope of such analyses by considering settings where the slower time-scale has a non-expansive mapping. For such algorithms, the slower time-scale can be viewed as a stochastic inexact Krasnoselskii-Mann iteration. We also study a variant where the faster time-scale has a projection step which leads to non-expansiveness in the slower time-scale. We show that the last-iterate mean square residual error for such algorithms decays at a rate $O(1/k^{1/4-ε})$, where $ε>0$ is arbitrarily small. We further establish almost sure convergence of iterates to the set of fixed points. We demonstrate the applicability of our framework by applying our results to minimax optimization, linear stochastic approximation, and Lagrangian optimization.

2411.11728 2026-04-09 stat.ME

Davis-Kahan Theorem in the two-to-infinity norm and its application to perfect clustering

Marianna Pensky

Comments 45 pages

详情
英文摘要

Many statistical applications, such as the Principal Component Analysis, matrix completion, tensor regression and many others, rely on accurate estimation of leading eigenvectors of a matrix. The Davis-Kahan theorem is known to be instrumental for bounding above the distances between matrices $U$ and $\widehat{U}$ of population eigenvectors and their sample versions. While those distances can be measured in various metrics, the recent developments have shown advantages of evaluation of the deviation in the two-to-infinity norm. The purpose of this paper is to develop a toolbox for derivation of upper bounds for the distances between $U$ and $\widehat{U}$ in the two-to-infinity norm for a variety of possible scenarios. Although this problem has been studied by several authors, the difference between this paper and its predecessors is that the upper bounds are obtained under various sets of assumptions. The upper bounds are initially derived with no or mild probabilistic assumptions on the error, and are subsequently refined, when some generic probabilistic assumptions on the errors hold. The paper also provides rectification of the upper bounds in the cases of heavy-tailed or exponentially fast decaying errors. In addition, the paper suggests alternative methods for evaluation of $\widehat{U}$ and, therefore, enables one to compare the resulting accuracies. As an example of an application of the techniques in the paper, we derive sufficient conditions for perfect clustering in a generic setting, and then employ them in various scenarios.

2411.10858 2026-04-09 stat.ME

Scalable Gaussian Process Regression Via Median Posterior Inference for Estimating Multi-Pollutant Mixture Health Effects

Aaron Sonabend, Jiangshan Zhang, Edgar Castro, Joel Schwartz, Brent A. Coull, Junwei Lu

详情
英文摘要

Humans are exposed to complex mixtures of environmental pollutants rather than single chemicals, necessitating methods to quantify the health effects of such mixtures. Research on environmental mixtures provides insights into realistic exposure scenarios, informing regulatory policies that better protect public health. However, statistical challenges, including complex correlations among pollutants and nonlinear multivariate exposure-response relationships, complicate such analyses. A popular Bayesian semi-parametric Gaussian process regression framework (Coull et al., 2015) addresses these challenges by modeling exposure-response functions with Gaussian processes and performing feature selection to manage high-dimensional exposures while accounting for confounders. Originally designed for small to moderate-sized cohort studies, this framework does not scale well to massive datasets. To address this, we propose a divide-and-conquer strategy, partitioning data, computing posterior distributions in parallel, and combining results using the generalized median. While we focus on Gaussian process models for environmental mixtures, the proposed distributed computing strategy is broadly applicable to other Bayesian models with computationally prohibitive full-sample Markov Chain Monte Carlo fitting. We provide theoretical guarantees for the convergence of the proposed posterior distributions to those derived from the full sample. We apply this method to estimate associations between a mixture of ambient air pollutants and ~650,000 birthweights recorded in Massachusetts during 2001-2012. Our results reveal negative associations between birthweight and traffic pollution markers, including elemental and organic carbon and PM2.5, and positive associations with ozone and vegetation greenness.

2410.02941 2026-04-09 stat.ME

Efficient collaborative learning of the average treatment effect

Sijia Li, Rui Duan

Comments 30 pages, 6 figures

详情
英文摘要

In response to the growing need for generating real-world evidence from multi-site collaborative studies, we introduce an efficient collaborative learning approach to evaluate average treatment effect (ECO-ATE) in a multi-site setting under data sharing constraints. Specifically, ECO-ATE operates in a federated manner, using individual-level data from a user-defined target population and summary statistics from other source populations, to construct efficient estimator for the average treatment effect on the target population of interest. Our federated approach does not require iterative communications between sites, making it particularly suitable for research consortia with limited resources for developing automated data-sharing infrastructures. Compared to existing work data integration methods in causal inference, ECO-ATE allows distributional shifts in outcomes, treatments and baseline covariates distributions, and achieves semiparametric efficiency bound under appropriate conditions. We conduct simulation studies to demonstrate the extent of efficiency gains achieved by incorporating additional data sources, as well as the robustness of our approach against varying levels of distributional shifts and overparameterization, compared to existing benchmarks. We apply ECO-ATE to a case study examining the effect of insulin vs. non-insulin treatments on heart failure for patients with type II diabetes using electronic health record data collected from the All of Us program.

2409.14590 2026-04-09 cs.LG cs.AI stat.ML

Explainable AI needs formalization

Stefan Haufe, Rick Wilming, Benedict Clark, Rustam Zhumagambetov, Ahcène Boubekki, Jörg Martin, Danny Panknin

详情
英文摘要

The field of "explainable artificial intelligence" (XAI) seemingly addresses the desire that decisions of machine learning systems should be human-understandable. However, in its current state, XAI itself needs scrutiny. Popular methods cannot reliably answer relevant questions about ML models, their training data, or test inputs, because they systematically attribute importance to input features that are independent of the prediction target. This limits the utility of XAI for diagnosing and correcting data and models, for scientific discovery, and for identifying intervention targets. The fundamental reason for this is that current XAI methods do not address well-defined problems and are not evaluated against targeted criteria of explanation correctness. Researchers should formally define the problems they intend to solve and design methods accordingly. This will lead to diverse use-case-dependent notions of explanation correctness and objective metrics of explanation performance that can be used to validate XAI algorithms.

2409.06490 2026-04-09 cs.CV stat.AP

UAVDB: Point-Guided Masks for UAV Detection and Segmentation

Yu-Hsi Chen

Comments 14 pages, 4 figures, 4 tables

详情
英文摘要

Accurate detection of Unmanned Aerial Vehicles (UAVs) is critical for surveillance, security, and airspace monitoring. However, existing datasets remain limited in scale, resolution, and the ability to capture objects across extreme size variations. To address these challenges, we present UAVDB, a benchmark dataset for UAV detection and segmentation, constructed via a point-guided weak supervision pipeline. We introduce Patch Intensity Convergence (PIC), a lightweight annotation method that converts trajectory points into bounding boxes, eliminating the need for manual labeling while preserving precise spatial localization. Building upon these annotations, we further generate segmentation masks using SAM2, enriching the dataset with multi-task labels. UAVDB consists of RGB frames from a fixed-camera multi-view video dataset, capturing UAVs across scales ranging from clearly visible objects to near single-pixel instances under diverse conditions. Quantitative results show that PIC combined with SAM2 outperforms existing annotation techniques in terms of IoU. Furthermore, we benchmark YOLO-based detectors on UAVDB, establishing baselines for future research.

2407.20162 2026-04-09 math.ST stat.TH

Non-standard boundary behaviour in two-component mixture models

Heather Battey, Peter McCullagh, Daniel Xiang

详情
英文摘要

Consider a binary mixture model of the form $F_θ= (1-θ)F_0 + θF_1$, where $F_0$ is standard Gaussian and $F_1$ is a completely specified heavy-tailed distribution with the same support. For a sample of $n$ independent and identically distributed values $X_i \sim F_θ$, the maximum likelihood estimator $\hatθ_n$ is asymptotically normal provided that $0 < θ< 1$ is an interior point. This paper investigates the large-sample behaviour for boundary points, which is entirely different and strikingly asymmetric for $θ=0$ and $θ=1$. The reason for the asymmetry has to do with typical choices such that $F_0$ is an extreme boundary point and $F_1$ is usually not extreme. On the right boundary, well known results on boundary parameter problems are recovered, giving $\lim \mathbb{P}_1(\hatθ_n < 1)=1/2$. On the left boundary, $\lim\mathbb{P}_0(\hatθ_n > 0)=1-1/α$, where $1\leq α\leq 2$ indexes the domain of attraction of the density ratio $f_1(X)/f_0(X)$ when $X\sim F_0$. For $α=1$, which is the most important case in practice, we show how the tail behaviour of $F_1$ governs the rate at which $\mathbb{P}_0(\hatθ_n > 0)$ tends to zero. A new limit theorem for the joint distribution of the sample maximum and sample mean conditional on positivity establishes multiple inferential anomalies. Most notably, given $\hatθ_n > 0$, the likelihood ratio statistic has a conditional null limit distribution $G\neqχ^2_1$ determined by the joint limit theorem. We show through this route that no advantage is gained by extending the single distribution $F_1$ to the nonparametric composite mixture generated by the same tail-equivalence class.

2406.06408 2026-04-09 stat.ML cs.CR cs.LG math.ST stat.TH

Differentially Private Best-Arm Identification

Achraf Azize, Marc Jourdan, Aymen Al Marjani, Debabrota Basu

Comments 85 pages, 5 figures, 3 tables, 11 algorithms. To be published in the Journal of Machine Learning Research 27. This journal paper is an extended version of the conference paper Azize et al. ("On the Complexity of Differentially Private Best-Arm Identification with Fixed Confidence", NeurIPS 2023)

详情
英文摘要

Best Arm Identification (BAI) problems are progressively used for data-sensitive applications, such as designing adaptive clinical trials, tuning hyper-parameters, and conducting user studies. Motivated by the data privacy concerns invoked by these applications, we study the problem of BAI with fixed confidence in both the local and central models, i.e. $ε$-local and $ε$-global Differential Privacy (DP). First, to quantify the cost of privacy, we derive lower bounds on the sample complexity of any $δ$-correct BAI algorithm satisfying $ε$-global DP or $ε$-local DP. Our lower bounds suggest the existence of two privacy regimes. In the high-privacy regime, the hardness depends on a coupled effect of privacy and novel information-theoretic quantities involving the Total Variation. In the low-privacy regime, the lower bounds reduce to the non-private lower bounds. We propose $ε$-local DP and $ε$-global DP variants of a Top Two algorithm, namely CTB-TT and AdaP-TT*, respectively. For $ε$-local DP, CTB-TT is asymptotically optimal by plugging in a private estimator of the means based on Randomised Response. For $ε$-global DP, our private estimator of the mean runs in arm-dependent adaptive episodes and adds Laplace noise to ensure a good privacy-utility trade-off. By adapting the transportation costs, the expected sample complexity of AdaP-TT* reaches the asymptotic lower bound up to multiplicative constants.

2405.08253 2026-04-09 stat.ML cs.LG math.OC

Thompson Sampling for Infinite-Horizon Discounted Decision Processes

Daniel Adelman, Cagla Keceli, Alba V. Olivares-Nadal

详情
英文摘要

This paper develops a viable notion of learning for sampling-based algorithms that applies in broader settings than previously considered. More specifically, we model a discounted infinite-horizon MDPs with Borel state and action spaces, whose rewards and transitions depend on an unknown parameter. To analyze adaptive learning algorithms based on sampling we introduce a general canonical probability space in this setting. Since standard definitions of regret are inadequate for policy evaluation in this setting, we propose new metrics that arise from decomposing the standard expected regret in discounted infinite-horizon MDPs into three terms: (i) the expected finite-time regret, (ii) the expected state regret, and (iii) the expected residual regret. Component (i) translates into the traditional concept of expected regret over a finite horizon. Term (ii) reflects how much future performance is compromised at a given time because earlier decisions have led the system to a less favorable state than under an optimal policy. Finally, metric (iii) measures regret with respect to the optimal reward from the current period onward, disregarding the irreversible consequences of past decisions. We further disaggregate this term by introducing the probabilistic residual regret, a finer, sample-path version of (iii) that captures the remaining loss in future performance from the current period onward, conditional on the observed history. Its expectation coincides with (iii). We then focus on Thompson sampling (TS); under assumptions that extend those used in prior work on finite state and action spaces to the Borel setting, we show that component (iii) for TS converges to zero exponentially fast. We further show that, under mild conditions ensuring the existence of the relevant limits, its probabilistic counterpart converges to zero almost surely and TS achieves complete learning.

2404.04794 2026-04-09 stat.ME

Local Balance Calibration for Nonparametric Propensity Score Estimation

Maosen Peng, Yan Li, Chong Wu, Liang Li

Comments Corresponding author: Chong Wu (Email: CWu18@mdanderson.org) and Liang Li (Email: LLi15@mdanderson.org)

详情
英文摘要

The propensity score is widely used for causal inference in observational studies, but common parametric estimators can produce biased and inefficient effect estimates when model assumptions are violated. Nonparametric approaches reduce sensitivity to misspecification but often yield unstable weights and inadequate covariate balance. We propose Local Balance with Calibration, implemented by Neural Networks, a weighting method that combines flexible function approximation with the explicit enforcement of covariate balance and calibration. When used with inverse probability weighting, the proposed estimator produces more stable weights, improved covariate balance, and reduced bias in average treatment effect estimation compared with existing approaches. We further develop an influence-function-based variance estimator that provides accurate uncertainty quantification for the resulting weighted estimators. Numerical studies demonstrate improved efficiency and reliable variance estimation across a range of data-generating scenarios. The method is implemented using the publicly available R package LBCNet.

2403.07628 2026-04-09 math.PR math-ph math.MP math.ST stat.TH

Asymptotic Expansions of the Limit Laws of Gaussian and Laguerre (Wishart) Ensembles at the Soft Edge

Folkmar Bornemann

Comments V5: using an alternative expression for the parameter tau that better fits the style of the other parameters in the Laguerre/Wishart cases, more remarks on the rationale of the scaling in the symplectic cases; 70 pages, 8 figures

详情
Journal ref
Constr. Approx., 2026, 97pp
英文摘要

The large-matrix limit laws of the rescaled largest eigenvalue of the orthogonal, unitary, and symplectic $n$-dimensional Gaussian ensembles -- and of the corresponding Laguerre ensembles (Wishart distributions) for various regimes of the parameter $α$ (degrees of freedom $p$) -- are known to be the Tracy-Widom distributions $F_β$ ($β=1,2,4$). We establish (paying particular attention to large or small ratios $p/n$) that, with careful choices of the rescaling constants and of the expansion parameter $h$, the limit laws embed into asymptotic expansions in powers of $h$, where $h \asymp n^{-2/3}$ resp. $h \asymp (n\,\wedge\,p)^{-2/3}$. We find explicit analytic expressions of the first few expansion terms as linear combinations of higher-order derivatives of the limit law $F_β$ with rational polynomial coefficients. The parametrizations are fine-tuned so that the expansion coefficients in the Gaussian cases are, for given $n$, the limits $p\to\infty$ of those of the Laguerre cases. Whereas the results for $β=2$ are presented with proof, the discussion of the cases $β=1,4$ is based on some hypotheses, focusing on the algebraic aspects of actually computing the polynomial coefficients. For the purposes of illustration and validation, the various results are checked against simulation data with large sample sizes.

2403.05281 2026-04-09 stat.ML math.ST stat.TH

A Generative Approach to Quasi-Random Sampling from Copulas via Space-Filling Designs

Sumin Wang, Chenxian Huang, Yongdao Zhou, Min-Qian Liu

Comments 42 pages, 5 figures

详情
英文摘要

Exploring the dependence between covariates across distributions is crucial for many applications. Copulas serve as a powerful tool for modeling joint variable dependencies and have been effectively applied in various practical contexts due to their intuitive properties. However, existing computational methods lack the capability for feasible inference and sampling of any copula, preventing their widespread use. This paper introduces an innovative quasi-random sampling approach for copulas, utilizing generative adversarial networks (GANs) and space-filling designs. The proposed framework constructs a direct mapping from low-dimensional uniform distributions to high-dimensional copula structures using GANs, and generates quasi-random samples for any copula structure from points set of space-filling designs. In the high-dimensional situations with limited data, the proposed approach significantly enhances sampling accuracy and computational efficiency compared to existing methods. Additionally, we develop convergence rate theory for quasi-Monte Carlo estimators, providing rigorous upper bounds for bias and variance. Both simulated experiments and practical implementations, particularly in risk management, validate the proposed method and showcase its superiority over existing alternatives.

2403.03208 2026-04-09 stat.ML cs.LG stat.ME

Active Statistical Inference

Tijana Zrnic, Emmanuel J. Candès

详情
英文摘要

Inspired by the concept of active learning, we propose active inference$\unicode{x2013}$a methodology for statistical inference with machine-learning-assisted data collection. Assuming a budget on the number of labels that can be collected, the methodology uses a machine learning model to identify which data points would be most beneficial to label, thus effectively utilizing the budget. It operates on a simple yet powerful intuition: prioritize the collection of labels for data points where the model exhibits uncertainty, and rely on the model's predictions where it is confident. Active inference constructs provably valid confidence intervals and hypothesis tests while leveraging any black-box machine learning model and handling any data distribution. The key point is that it achieves the same level of accuracy with far fewer samples than existing baselines relying on non-adaptively-collected data. This means that for the same number of collected samples, active inference enables smaller confidence intervals and more powerful p-values. We evaluate active inference on datasets from public opinion research, census analysis, and proteomics.

2304.07797 2026-04-09 math.ST math.OC math.PR stat.TH

Optimal distributions for randomized unbiased estimators with an infinite horizon and an adaptive algorithm

Chao Zheng, Jiangtao Pan, Qun Wang

详情
Journal ref
IMA Journal of Numerical Analysis, 2026
英文摘要

The randomized unbiased estimators of Rhee and Glynn (Operations Research:63(5), 1026-1043, 2015) can be highly efficient at approximating expectations of path functionals associated with stochastic differential equations (SDEs). However, there is a lack of algorithms for calculating the optimal distributions with an infinite horizon. In this article, based on the method of Cui et.al. (Operations Research Letters: 477-484, 2021), we prove that, under mild assumptions, there is a simple representation of the optimal distributions. Then, we develop an adaptive algorithm to compute the optimal distributions with an infinite horizon, which requires only a small amount of computational time in prior estimation. Finally, we provide numerical results to illustrate the efficiency of our adaptive algorithm.

2303.05443 2026-04-09 stat.ME

Likelihood-based Inference for Skewed Responses in a Crossover Trial Setup

Savita Pareek, Kalyan Das, Siuli Mukhopadhyay

详情
Journal ref
Communications in Statistics - Simulation and Computation 2025
英文摘要

This work proposes a statistical model for crossover trials with multiple skewed responses measured in each period. A 3 $\times$ 3 crossover trial data where different drug doses were administered to subjects with a history of seasonal asthma rhinitis to grass pollen is used for motivation. In each period, gene expression values for ten genes were measured from each subject. It considers a linear mixed effect model with skew normally distributed random effect or random error term to model the asymmetric responses in the crossover trials. The paper examines cases (i) when a random effect follows a skew-normal distribution, as well as (ii) when a random error follows a skew-normal distribution. The EM algorithm is used in both cases to compute maximum likelihood estimates of parameters. Simulations and crossover data from the gene expression study illustrate the proposed approach. Keywords: Crossover design, Mixed effect models, Skew-normal distribution, EM algorithm.

2105.07446 2026-04-09 stat.ML cs.LG math.ST stat.TH

Sobolev Norm Learning Rates for Conditional Mean Embeddings

Prem Talwai, Ali Shameli, David Simchi-Levi

Comments Appears in AISTATS 2022

详情
英文摘要

We develop novel learning rates for conditional mean embeddings by applying the theory of interpolation for reproducing kernel Hilbert spaces (RKHS). We derive explicit, adaptive convergence rates for the sample estimator under the misspecifed setting, where the target operator is not Hilbert-Schmidt or bounded with respect to the input/output RKHSs. We demonstrate that in certain parameter regimes, we can achieve uniform convergence rates in the output RKHS. We hope our analyses will allow the much broader application of conditional mean embeddings to more complex ML/RL settings involving infinite dimensional RKHSs and continuous state spaces.

2604.07325 2026-04-09 stat.ME math.ST stat.ML stat.TH

Conformal Prediction with Time-Series Data via Sequential Conformalized Density Regions

M. Sampson, K. S. Chan

详情
英文摘要

We propose a new conformal prediction method for time-series data with a guaranteed asymptotic conditional coverage rate, Sequential Conformalized Density Regions (SCDR), which is flexible enough to produce both prediction intervals and disconnected prediction sets, signifying the emergence of bifurcations. Our approach uses existing estimated conditional highest density predictive regions to form initial predictive regions. We then use a quantile random forest conformal adjustment to provide guaranteed coverage while adaptively changing to take the non-exchangeable nature of time-series data into account. We show that the proposed method achieves the guaranteed coverage rate asymptotically under certain regularity conditions. In particular, the method is doubly robust -- it works if the predictive density model is correctly specified and/or if the scores follow a nonlinear autoregressive model with the correct order specified. Simulations reveal that the proposed method outperforms existing methods in terms of empirical coverage rates and set sizes. We illustrate the method using two real datasets, the Old Faithful geyser dataset and the Australian electricity usage dataset. Prediction sets formed using SCDR for the geyser eruption durations include both single intervals and unions of two intervals, whereas existing methods produce wider, less informative, single-interval prediction sets.

2604.07323 2026-04-09 stat.ML cs.LG math.PR

Gaussian Approximation for Asynchronous Q-learning

Artemy Rubtsov, Sergey Samsonov, Vladimir Ulyanov, Alexey Naumov

Comments 41 pages

详情
英文摘要

In this paper, we derive rates of convergence in the high-dimensional central limit theorem for Polyak-Ruppert averaged iterates generated by the asynchronous Q-learning algorithm with a polynomial stepsize $k^{-ω},\, ω\in (1/2, 1]$. Assuming that the sequence of state-action-next-state triples $(s_k, a_k, s_{k+1})_{k \geq 0}$ forms a uniformly geometrically ergodic Markov chain, we establish a rate of order up to $n^{-1/6} \log^{4} (nS A)$ over the class of hyper-rectangles, where $n$ is the number of samples used by the algorithm and $S$ and $A$ denote the numbers of states and actions, respectively. To obtain this result, we prove a high-dimensional central limit theorem for sums of martingale differences, which may be of independent interest. Finally, we present bounds for high-order moments for the algorithm's last iterate.

2604.07290 2026-04-09 physics.ins-det physics.geo-ph stat.AP

Multispectral representation of Distributed Acoustic Sensing data: a framework for physically interpretable feature extraction and visualization

Sergio Morell-Monzó, Dídac Diego-Tortosa, Isabel Pérez-Arjona, Víctor Espinosa

详情
英文摘要

Distributed Acoustic Sensing (DAS) enables continuous monitoring of dynamic strain along tens of kilometers of optical fiber, generating massive datasets whose interpretation and automated analysis remain challenging. DAS measurements often lack a standardized visual representation, and their physical interpretation depends strongly on acquisition conditions and signal processing choices. This work introduces a systematic framework for visualization and feature extraction of DAS data based on a multispectral signal representation. The approach decomposes strain-rate measurements into predefined frequency bands and computes band-limited energy images that describe the spatial and temporal distribution of acoustic energy across distinct spectral regimes. The framework is evaluated using DAS recordings containing Fin Whale (Balaenoptera physalus) and Blue Whale (Balaenoptera musculus) vocalizations. Three experiments are conducted to assess the approach: enhanced visualization of bioacoustic signals, unsupervised clustering of acoustic patterns, and supervised event detection using a convolutional neural network. Using multispectral composites as input, a ResNet-18 classifier achieves an accuracy of 97.3% in whale vocalization detection, demonstrating that the proposed representation captures biologically meaningful spectral structure and provides an effective feature space for automated analysis of DAS data.

2604.07267 2026-04-09 stat.ML cs.LG

The Theory and Practice of Highly Scalable Gaussian Process Regression with Nearest Neighbours

Robert Allison, Tomasz Maciazek, Anthony Stephenson

Comments 92 pages (35-page main text + self-contained appendix with theorem proofs and auxiliary lemmas)

详情
英文摘要

Gaussian process ($GP$) regression is a widely used non-parametric modeling tool, but its cubic complexity in the training size limits its use on massive data sets. A practical remedy is to predict using only the nearest neighbours of each test point, as in Nearest Neighbour Gaussian Process ($NNGP$) regression for geospatial problems and the related scalable $GPnn$ method for more general machine-learning applications. Despite their strong empirical performance, the large-$n$ theory of $NNGP/GPnn$ remains incomplete. We develop a theoretical framework for $NNGP$ and $GPnn$ regression. Under mild regularity assumptions, we derive almost sure pointwise limits for three key predictive criteria: mean squared error ($MSE$), calibration coefficient ($CAL$), and negative log-likelihood ($NLL$). We then study the $L_2$-risk, prove universal consistency, and show that the risk attains Stone's minimax rate $n^{-2α/(2p+d)}$, where $α$ and $p$ capture regularity of the regression problem. We also prove uniform convergence of $MSE$ over compact hyper-parameter sets and show that its derivatives with respect to lengthscale, kernel scale, and noise variance vanish asymptotically, with explicit rates. This explains the observed robustness of $GPnn$ to hyper-parameter tuning. These results provide a rigorous statistical foundation for $NNGP/GPnn$ as a highly scalable and principled alternative to full $GP$ models.

2604.07179 2026-04-09 stat.ME

NLP-Informed Dynamic Cognitive Diagnosis Modelling

Yawen Ma, Sahoko Ishida, Kate Cain, Gabriel Wallin

详情
英文摘要

Digital learning platforms are increasingly used to support reading development while generating rich log files and item-level textual content. Using these data, this study proposes a dynamic cognitive diagnostic modelling (CDM) framework that incorporates text-derived semantic information to inform the estimation of the Q-matrix. We construct item-level semantic representations of question text and response options, and use these representations to define an informative prior on the Q-matrix. This approach treats text-derived signals as proxies for item complexity and cognitive demands, guiding the item-skill mapping in a data-driven manner. The proposed framework jointly estimates latent skill mastery profiles, item parameters, and transition dynamics over time within a Bayesian framework. We apply the model to data from Boost Reading, a digital reading supplement, focusing on students' vocabulary and comprehension skill development. We compare the proposed framework with a baseline model without any text information and show that the text-derived prior can improve Q-matrix recovery, particularly in settings where response data alone provide limited identification, as well as other model parameters for varying scenarios. This study provides a novel integration of natural language processing and dynamic CDMs, offering a data-driven approach to modelling skill acquisition and item-skill relationships in digital learning environments.

2604.07153 2026-04-09 math.ST stat.ME stat.TH

Non-asymptotic two-sample kernel testing with the spectrally truncated normalized MMD

Perrine Lacroix, Bertrand Michel, Franck Picard, Vincent Rivoirard

详情
英文摘要

Kernel methods provide a flexible and powerful framework for nonparametric statistical testing by embedding probability distributions into a reproducing kernel Hilbert space (RKHS). In this work, we study the kernel two-sample testing problem and focus on a normalized version of the Maximum Mean Discrepancy (MMD) as a test statistic, which scales the discrepancy by the within-group covariance operator to account for data variability. This normalization has been shown to improve test power in both theoretical and empirical settings. Because this normalization requires regularization, we study the non-asymptotic properties of the spectrally truncated normalized MMD (st-nMMD) and derive an exponential upper bound under the null hypothesis. Thanks to this result we propose a sharp and explicit upper bound for the corresponding non-asymptotic quantile, along with a data-adaptive estimator. We further propose an algorithm to tune the hyperparameters involved in the quantile estimation, including the truncation level, without requiring data splitting. We demonstrate the performance of the st-nMMD through numerical experiments under both the null and alternative hypotheses.

2604.07143 2026-04-09 cs.LG stat.AP stat.ML

Lumbermark: Resistant Clustering by Chopping Up Mutual Reachability Minimum Spanning Trees

Marek Gagolewski

详情
英文摘要

We introduce Lumbermark, a robust divisive clustering algorithm capable of detecting clusters of varying sizes, densities, and shapes. Lumbermark iteratively chops off large limbs connected by protruding segments of a dataset's mutual reachability minimum spanning tree. The use of mutual reachability distances smoothens the data distribution and decreases the influence of low-density objects, such as noise points between clusters or outliers at their peripheries. The algorithm can be viewed as an alternative to HDBSCAN that produces partitions with user-specified sizes. A fast, easy-to-use implementation of the new method is available in the open-source 'lumbermark' package for Python and R. We show that Lumbermark performs well on benchmark data and hope it will prove useful to data scientists and practitioners across different fields.

2604.07135 2026-04-09 stat.ME

Private Federated Learning for High-dimensional Time Series

Kejun Chen, Qianqian Zhu

详情
英文摘要

In the era of big data, leveraging information from multiple clients while preserving data privacy has emerged as a critical challenge in modern statistical modeling and forecasting. This paper introduces a privacy-preserving federated learning framework for high-dimensional vector autoregressive models, where each client's dynamics are characterized by a common low-rank structure augmented with sparse client-specific deviations. We develop a two-stage estimation procedure that integrates differentially private representation learning for the shared component with local personalization for client-specific adjustments, enabling effective information pooling under selective privacy constraints. Non-asymptotic error bounds are established for both the single-client and federated estimators to characterize the inherent privacy-utility trade-off, and consistency of a ridge-type rank selection criterion is proved. Simulation studies demonstrate that federation substantially improves estimation accuracy when local sample sizes are limited. Two empirical applications to analyzing electricity-economy linkages across U.S. states and conducting multi-task macroeconomic forecasting across countries, highlight the superior predictive accuracy of the proposed method over existing single-client benchmarks.

2604.06104 2026-04-09 physics.soc-ph stat.AP

Modeling Disruptions to Urban Metabolism using Interconnected Networks

Bharat Sharma, Abhilasha J. Saroj, Evan Scherrer, Melissa R. Allen-Dumas

详情
英文摘要

Representation of cities as organisms with metabolic processes is a useful analogy for urban design, development and sustainability. Urban metabolism can be modeled by representing urban systems as networks. The various networks included in a city's metabolism are interdependent in complex ways. Thus, understanding the interaction among these networks is essential to understanding how a healthy urban metabolism is sustained and how injuries to the metabolic system can "heal". It is particularly important to understand how disruptions to one system in an urban area affect the functioning of other systems. Using distribution-level data from a real U.S. city on the electricity distribution system and road geometry, we apply connected network modeling to two critical inter-connected urban infrastructure sectors: energy and transportation. We quantify the robustness of these interdependent networks by evaluating the connectivity disruptions that may occur due to natural or synthetic disruptive events, using both unweighted and weighted metrics.

2603.14984 2026-04-09 stat.ME stat.AP

Spatiotemporally Consistent Multivariate Bias Correction for Climate Projections via Nested Vine Copulas

Theresa Meier, Erwan Koch, Valérie Chavez-Demoulin, Thibault Vatter

Comments 58 pages, 15 figures, 7 tables

详情
英文摘要

Climate models are essential for understanding large-scale climate dynamics and long-term climate change, yet they exhibit systematic biases when compared with historical observations. Existing multivariate bias correction (MBC) approaches do not explicitly handle spatiotemporal dependence. However, preserving both spatiotemporal and inter-variable consistency is essential for realistic climate dynamics and reliable regional impact assessments. To address this gap, we propose a novel MBC method called GN-VBC that uses generalized additive models (GAMs) to disentangle spatiotemporal deterministic effects from stochastic residuals. To model joint distributions and dependencies across variables and locations, we introduce nested vine copulas (NVCs), a hierarchical vine merging strategy. NVC in the context of MBC combines two dependence levels: (i) spatial dependence across locations, modeled separately for each variable, and (ii) inter-variable dependence modeled at a selected reference location, which links the spatial models into a coherent multivariate and spatial structure. An application to Switzerland shows improvements in preserving inter-variable, spatial and temporal dependence across a wide range of evaluation metrics.

2603.14135 2026-04-09 stat.ML cs.LG

Conditional flow matching for physics-constrained inverse problems with finite training data

Agnimitra Dasgupta, Ali Fardisi, Mehrnegar Aminy, Brianna Binder, Bryan Shaddy, Saeed Moazami, Assad Oberai

详情
英文摘要

This study presents a conditional flow matching framework for solving physics-constrained Bayesian inverse problems. In this setting, samples from the joint distribution of inferred variables and measurements are assumed available, while explicit evaluation of the prior and likelihood densities is not required. We derive a simple and self-contained formulation of both the unconditional and conditional flow matching algorithms, tailored specifically to inverse problems. In the conditional setting, a neural network is trained to learn the velocity field of a probability flow ordinary differential equation that transports samples from a chosen source distribution directly to the posterior distribution conditioned on observed measurements. This black-box formulation accommodates nonlinear, high-dimensional, and potentially non-differentiable forward models without restrictive assumptions on the noise model. We further analyze the behavior of the learned velocity field in the regime of finite training data. Under mild architectural assumptions, we show that overtraining can induce degenerate behavior in the generated conditional distributions, including variance collapse and a phenomenon termed selective memorization, wherein generated samples concentrate around training data points associated with similar observations. A simplified theoretical analysis explains this behavior, and numerical experiments confirm it in practice. We demonstrate that standard early-stopping criteria based on monitoring test loss effectively mitigate such degeneracy. The proposed method is evaluated on several physics-based inverse problems. We investigate the impact of different choices of source distributions, including Gaussian and data-informed priors. Across these examples, conditional flow matching accurately captures complex, multimodal posterior distributions while maintaining computational efficiency.

2506.13017 2026-04-09 stat.AP

Spatially Varying Deep Functional Neural Network: Application in Large-Scale Crop Yield Prediction

Yeonjoo Park, Bo Li, Yehua Li

详情
Journal ref
Journal of the Royal Statistical Society Series C: Applied Statistic (2026)
英文摘要

Accurate prediction of crop yield is critical for supporting food security, agricultural planning, and economic decision-making. However, yield forecasting remains a significant challenge due to the complex and nonlinear relationships between weather variables and crop production, as well as spatial heterogeneity across agricultural regions. We propose DSNet, a deep neural network architecture that integrates functional and scalar predictors with spatially varying coefficients and spatial random effects. The method is designed to flexibly model spatially indexed functional data, such as daily temperature curves, and their relationship to variability in the response, while accounting for spatial correlation. DSNet mitigates the curse of dimensionality through a low-rank structure inspired by the spatially varying functional index model (SVFIM). Through comprehensive simulations, we demonstrate that DSNet outperforms state-of-the-art functional regression models for spatial data, when the functional predictors exhibit complex structure and their relationship with the response varies spatially in a potentially nonstationary manner. Application to corn yield data from the U.S. Midwest demonstrates that DSNet achieves superior predictive accuracy compared to both leading machine learning approaches and parametric statistical models. These results highlight the model's robustness and its potential applicability to other weather-sensitive crops.

2503.24209 2026-04-09 math.ST math.PR stat.TH

Optimal low-rank posterior mean and distribution approximation in linear Gaussian inverse problems on Hilbert spaces

Giuseppe Carere, Han Cheng Lie

Comments To be published in Inverse Problems and Imaging, 43 pages, 5 figures

详情
英文摘要

We construct optimal low-rank approximations for the Gaussian posterior distribution in linear Gaussian inverse problems with possibly infinite-dimensional separable Hilbert parameter spaces and finite-dimensional data spaces. We first consider approximate posteriors in which the means vary and the posterior covariance is kept fixed, for all possible realisations of the data simultaneously. We give necessary and sufficient conditions for these approximating posteriors to be equivalent to the exact posterior. For such approximations, we measure the data-averaged approximation error with the Kullback-Leibler, Rényi and Amari $α$-divergences for $α\in(0,1)$, and the Hellinger distance. With the loss in Kullback-Leibler and Rényi divergences, we find the optimal approximations and formulate an equivalent condition for their uniqueness, extending the work in finite dimensions of Spantini et al. (SIAM J. Sci. Comput. 2015). We then consider joint low-rank approximation of the mean and covariance. For the reverse Kullback-Leibler divergence, the optimal approximations of the mean and of the covariance yield an optimal joint approximation of the mean and covariance. We interpret one such joint approximation in terms of an optimal projector in parameter space, and show that this approximation amounts to solving a Bayesian inverse problem with projected forward model. Extensive numerical examples demonstrate some of our theoretical findings.

2503.08028 2026-04-09 stat.ML cs.LG

Computational bottlenecks for denoising diffusions

Andrea Montanari, Viet Vu

Comments 51 pages; 2 figures

详情
英文摘要

Denoising diffusions sample from a probability distribution $μ$ in $\mathbb{R}^d$ by constructing a stochastic process $({\hat{\boldsymbol x}}_t:t\ge 0)$ in $\mathbb{R}^d$ such that ${\hat{\boldsymbol x}}_0$ is easy to sample, but the distribution of $\hat{\boldsymbol x}_T$ at large $T$ approximates $μ$. The drift ${\boldsymbol m}:\mathbb{R}^d\times\mathbb{R}\to\mathbb{R}^d$ of this diffusion process is learned my minimizing a score-matching objective. Is every probability distribution $μ$, for which sampling is tractable, also amenable to sampling via diffusions? We provide evidence to the contrary by studying a probability distribution $μ$ for which sampling is easy, but the drift of the diffusion process is intractable -- under a popular conjecture on information-computation gaps in statistical estimation. We show that there exist drifts that are superpolynomially close to the optimum value (among polynomial time drifts) and yet yield samples with distribution that is very far from the target one.

2411.19653 2026-04-09 stat.ML cs.LG

Nonparametric Instrumental Regression via Kernel Methods is Minimax Optimal

Dimitri Meunier, Zhu Li, Tim Christensen, Arthur Gretton

详情
英文摘要

We study the kernel instrumental variable (KIV) algorithm, a kernel-based two-stage least-squares method for nonparametric instrumental variable regression. We provide a convergence analysis covering both identified and non-identified regimes: when the structural function is not identified, we show that the KIV estimator converges to the minimum-norm IV solution in the reproducing kernel Hilbert space associated with the kernel. Crucially, we establish convergence in the strong $L_2$ norm, rather than only in a pseudo-norm. We quantify statistical difficulty through a link condition that compares the covariance structure of the endogenous regressor with that induced by the instrument, yielding an interpretable measure of ill-posedness. Under standard eigenvalue-decay and source assumptions, we derive strong $L_2$ learning rates for KIV and prove that they are minimax-optimal over fixed smoothness classes. Finally, we replace the stage-1 Tikhonov step by general spectral regularization, thereby avoiding saturation and improving rates for smoother first-stage targets. The matching lower bound shows that instrumental regression induces an unavoidable slowdown relative to ordinary kernel ridge regression.

2402.14260 2026-04-09 stat.ME

A New Regression Lens on Multi-Class Classification

Xin Bing, Bingqing Li, Marten Wegkamp

详情
英文摘要

Linear Discriminant Analysis (LDA) is a fundamental method for classification. Its simple linear structure facilitates interpretation, and it is naturally suited to multi-class settings. LDA is also closely connected to several classical multivariate techniques, including Fisher's discriminant analysis, canonical correlation analysis, and linear regression. In this paper, we strengthen the connection between LDA and multivariate response regression by establishing an explicit relationship between discriminant directions and regression coefficients. This characterization yields a new regression-based framework for multi-class classification that accommodates structured, regularized, and even non-parametric regression methods. In contrast to existing regression-based approaches, our formulation is particularly amenable to theoretical analysis: we develop a general strategy for deriving bounds on the excess misclassification risk of the proposed classifier across all such regression procedures. As concrete applications, we provide complete theoretical guarantees for two widely used methods -- $\ell_1$-regularization and reduced-rank regression -- neither of which has previously been fully analyzed in the LDA context. The theoretical results are supported by extensive simulation studies and empirical evaluations on real data.

2307.03571 2026-04-09 cs.LG math.OC stat.ML

Smoothing the Edges: Smooth Optimization for Sparse Regularization using Hadamard Overparametrization

Chris Kolb, Christian L. Müller, Bernd Bischl, David Rügamer

详情
英文摘要

We present a framework for smooth optimization of explicitly regularized objectives for (structured) sparsity. These non-smooth and possibly non-convex problems typically rely on solvers tailored to specific models and regularizers. In contrast, our method enables fully differentiable and approximation-free optimization and is thus compatible with the ubiquitous gradient descent paradigm in deep learning. The proposed optimization transfer comprises an overparameterization of selected parameters and a change of penalties. In the overparametrized problem, smooth surrogate regularization induces non-smooth, sparse regularization in the base parametrization. We prove that the surrogate objective is equivalent in the sense that it not only has identical global minima but also matching local minima, thereby avoiding the introduction of spurious solutions. Additionally, our theory establishes results of independent interest regarding matching local minima for arbitrary, potentially unregularized, objectives. We comprehensively review sparsity-inducing parametrizations across different fields that are covered by our general theory, extend their scope, and propose improvements in several aspects. Numerical experiments further demonstrate the correctness and effectiveness of our approach on several sparse learning problems ranging from high-dimensional regression to sparse neural network training.