arXivDaily arXiv每日学术速递 周一至周五更新
重置
2603.26644 2026-03-30 cs.LG astro-ph.IM stat.ME

Automatic Laplace Collapsed Sampling: Scalable Marginalisation of Latent Parameters via Automatic Differentiation

Toby Lovick, David Yallup, Will Handley

Comments 28 Pages, 7 Figures. Comments welcome

详情
英文摘要

We present Automatic Laplace Collapsed Sampling (ALCS), a general framework for marginalising latent parameters in Bayesian models using automatic differentiation, which we combine with nested sampling to explore the hyperparameter space in a robust and efficient manner. At each nested sampling likelihood evaluation, ALCS collapses the high-dimensional latent variables $z$ to a scalar contribution via maximum a posteriori (MAP) optimisation and a Laplace approximation, both computed using autodiff. This reduces the effective dimension from $d_θ+ d_z$ to just $d_θ$, making Bayesian evidence computation tractable for high-dimensional settings without hand-derived gradients or Hessians, and with minimal model-specific engineering. The MAP optimisation and Hessian evaluation are parallelised across live points on GPU-hardware, making the method practical at scale. We also show that automatic differentiation enables local approximations beyond Laplace to parametric families such as the Student-$t$, which improves evidence estimates for heavy-tailed latents. We validate ALCS on a suite of benchmarks spanning hierarchical, time-series, and discrete-likelihood models and establish where the Gaussian approximation holds. This enables a post-hoc ESS diagnostic that localises failures across hyperparameter space without expensive joint sampling.

2603.26618 2026-03-30 math.ST stat.TH

Statistical inference for extremal directions in high-dimensional spaces

Lucas Butsch, Vicky Fasen-Hartmann

详情
英文摘要

In multivariate extreme value statistics, the first step in understanding the dependence structure of extremes is identifying the directions in which they occur. The novelty of this paper is the analysis of high-dimensional extreme value models in which both the model dimension and the number of bias directions go to infinity as the number of observations tends to infinity; we estimate the number of extremal directions. To address the curse of dimensionality, we extend and investigate the information criteria (AIC, BICU, BICL, QAIC and MSEIC) from the fixed-dimensional case (Butsch and Fasen-Hartmann, 2025a; Meyer and Wintenberger, 2023), which employ the concept of sparse regular variation that is closely related to multivariate regular variation, for the estimation of the number of extremal directions. For all information criteria, we derive sufficient conditions for consistency. Unlike in the fixed-dimensional case, where only the Bayesian information criteria (BICU and BICL) and the QAIC are consistent, the AIC and MSEIC are also consistent in high dimensions under certain model assumptions. We compare the performance of the different information criteria in a simulation study that includes a detailed analysis of the model assumptions and the necessary and sufficient conditions for consistency.

2603.26611 2026-03-30 cs.LG stat.ME stat.ML

Benchmarking Tabular Foundation Models for Conditional Density Estimation in Regression

Rafael Izbicki, Pedro L. C. Rodrigues

详情
英文摘要

Conditional density estimation (CDE) - recovering the full conditional distribution of a response given tabular covariates - is essential in settings with heteroscedasticity, multimodality, or asymmetric uncertainty. Recent tabular foundation models, such as TabPFN and TabICL, naturally produce predictive distributions, but their effectiveness as general-purpose CDE methods has not been systematically evaluated, unlike their performance for point prediction, which is well studied. We benchmark three tabular foundation model variants against a diverse set of parametric, tree-based, and neural CDE baselines on 39 real-world datasets, across training sizes from 50 to 20,000, using six metrics covering density accuracy, calibration, and computation time. Across all sample sizes, foundation models achieve the best CDE loss, log-likelihood, and CRPS on the large majority of datasets tested. Calibration is competitive at small sample sizes but, for some metrics and datasets, lags behind task-specific neural baselines at larger sample sizes, suggesting that post-hoc recalibration may be a valuable complement. In a photometric redshift case study using SDSS DR18, TabPFN exposed to 50,000 training galaxies outperforms all baselines trained on the full 500,000-galaxy dataset. Taken together, these results establish tabular foundation models as strong off-the-shelf conditional density estimators.

2603.26548 2026-03-30 stat.AP

Impact of Residential Retrofits on Gas and Electricity Consumption in France

Charly Andral, Laetitia Leduc, Guillaume Matheron, Yukihide Nakada

详情
英文摘要

This study examines the impact of residential energy retrofits on household energy consumption in France using smart meter data from nearly 2,500 Hello Watt users, using a two-period difference-in-differences design. The dataset combines daily electricity and gas consumption collected through smart meters, hourly temperatures from Météo France, and user-declared home and retrofit information. As a control, we use a group composed of homes of Hello Watt users that are similar to the treated homes, but did not undergo any renovations. The average treatment effect on the treated is estimated with the estimator of Sant'Anna & Zhao (2020). Estimates are reported by energy source (electricity vs. gas) and by retrofit type. The retrofit measures considered are limited to single interventions: wall insulation, attic insulation, floor insulation, installation of an air-to-air heat pump, or installation of an air-to-water heat pump. A comprehensive retrofit is defined separately as the simultaneous implementation of at least two of these measures. Our results show that insulation works cause a significant decrease in both electricity and gas consumption (3% to 13% and 5% to 16% respectively, depending on the retrofit type). We also estimate the reduction on the heating consumption only (7% to 27% for electrical heating and 7% to 19% for gas heating). We also study retrofits that consist in replacing a gas boiler with an air-to-water heat pump, resulting in a cut of 85% in carbon emissions.

2603.26502 2026-03-30 stat.ME stat.AP stat.ML

Targeted learning of heterogeneous treatment effect curves for right censored or left truncated time-to-event data

Matthew Pryce, Karla Diaz-Ordaz, Ruth H. Keogh, Stijn Vansteelandt

详情
英文摘要

In recent years, there has been growing interest in causal machine learning estimators for quantifying subject-specific effects of a binary treatment on time-to-event outcomes. Estimation approaches have been proposed which attenuate the inherent regularisation bias in machine learning predictions, with each of these estimators addressing measured confounding, right censoring, and in some cases, left truncation. However, the existing approaches are found to exhibit suboptimal finite-sample performance, with none of the existing estimators fully leveraging the temporal structure of the data, yielding non-smooth treatment effects over time. We address these limitations by introducing surv-iTMLE, a targeted learning procedure for estimating the difference in the conditional survival probabilities under two treatments. Unlike existing estimators, surv-iTMLE accommodates both left truncation and right censoring while enforcing smoothness and boundedness of the estimated treatment effect curve over time. Through extensive simulation studies under both right censoring and left truncation scenarios, we demonstrate that surv-iTMLE outperforms existing methods in terms of bias and smoothness of time-varying effect estimates in finite samples. We then illustrate surv-iTMLE's practical utility by exploring heterogeneity in the effects of immunotherapy on survival among non-small cell lung cancer (NSCLC) patients, revealing clinically meaningful temporal patterns that existing estimators may obscure.

2603.26478 2026-03-30 cs.SD stat.ME stat.ML

Probabilistic Multilabel Graphical Modelling of Motif Transformations in Symbolic Music

Ron Taieb, Yoel Greenberg, Barak Sober

Comments 23 pages (21 pages main text), 2 figures. Submitted to Journal of New Music Research (Special Issue on Computational and Cognitive Musicology)

详情
英文摘要

Motifs often recur in musical works in altered forms, preserving aspects of their identity while undergoing local variation. This paper investigates how such motivic transformations occur within their musical context in symbolic music. To support this analysis, we develop a probabilistic framework for modeling motivic transformations and apply it to Beethoven's piano sonatas by integrating multiple datasets that provide melodic, rhythmic, harmonic, and motivic information within a unified analytical representation. Motif transformations are represented as multilabel variables by comparing each motif instance to a designated reference occurrence within its local context, ensuring consistent labeling across transformation families. We introduce a multilabel Conditional Random Field to model how motif-level musical features influence the occurrence of transformations and how different transformation families tend to co-occur. Our goal is to provide an interpretable, distributional analysis of motivic transformation patterns, enabling the study of their structural relationships and stylistic variation. By linking computational modeling with music-theoretical interpretation, the proposed framework supports quantitative investigation of musical structure and complexity in symbolic corpora and may facilitate the analysis of broader compositional patterns and writing practices.

2603.26460 2026-03-30 math.ST stat.TH

The relative value of interventional and observational samples in Bayesian Causal Linear Gaussian Models

Valentinian Lungu, Anish Dhir, Mark van der Wilk, Ioannis Kontoyiannis

详情
英文摘要

We investigate the asymptotic properties of Bayesian bivariate causal discovery for Gaussian Linear Structural Equation Models (SEMs) with heteroscedastic noise. We demonstrate that with purely observational data, the posterior distribution over the models fails to consistently identify the true causal structure - a consequence of the fundamental non-identifiability within the Markov Equivalence Class. Specifically, if the true generating mechanism corresponds to a connected graph (A -> B or B -> A), the asymptotic behavior of the posterior is given by the ratio between the prior on the true model and the push-forward prior of the alternative. In contrast, for the independence model, we establish that the posterior concentrates at a stochastic polynomial rate of O_p(n^{-1/2}). To resolve this non-identifiability, we incorporate m interventional samples and characterize the concentration rates as a function of the observational-to-total sample ratio, η. We identify a sharp concentration dichotomy: while the independence graph maintains a polynomial O_p(N^{-1/2}) rate (where N = n+m), connected graphs undergo a phase transition to exponentially fast convergence. This highlights an exponential relative importance between the two data types, as altering the amount of one data type directly changes the exponent governing the concentration speed. We derive explicit formulae for the exponential decay rates and provide precise conditions under which mixing observational and interventional data optimizes concentration speed. Finally, our theoretical findings are validated through empirical simulations in Bayesian Gaussian equivalent (BGe)-style prior specifications offering a principled foundation for experimental design in Bayesian causal discovery.

2603.26418 2026-03-30 stat.ML cs.LG math.FA

Kantorovich--Kernel Neural Operators: Approximation Theory, Asymptotics, and Neural Network Interpretation

Tian-Xiao He

详情
英文摘要

This paper studies a class of multivariate Kantorovich-kernel neural network operators, including the deep Kantorovich-type neural network operators studied by Sharma and Singh. We prove density results, establish quantitative convergence estimates, derive Voronovskaya-type theorems, analyze the limits of partial differential equations for deep composite operators, prove Korovkin-type theorems, and propose inversion theorems. This paper studies a class of multivariate Kantorovich-kernel neural network operators, including the deep Kantorovich-type neural network operators studied by Sharma and Singh. We prove density results, establish quantitative convergence estimates, derive Voronovskaya-type theorems, analyze the limits of partial differential equations for deep composite operators, prove Korovkin-type theorems, and propose inversion theorems. Furthermore, this paper discusses the connection between neural network architectures and the classical positive operators proposed by Chui, Hsu, He, Lorentz, and Korovkin.

2603.26415 2026-03-30 cs.LG cs.AI stat.AP

KMM-CP: Practical Conformal Prediction under Covariate Shift via Selective Kernel Mean Matching

Siddhartha Laghuvarapu, Rohan Deb, Jimeng Sun

详情
英文摘要

Uncertainty quantification is essential for deploying machine learning models in high-stakes domains such as scientific discovery and healthcare. Conformal Prediction (CP) provides finite-sample coverage guarantees under exchangeability, an assumption often violated in practice due to distribution shift. Under covariate shift, restoring validity requires importance weighting, yet accurate density-ratio estimation becomes unstable when training and test distributions exhibit limited support overlap. We propose KMM-CP, a conformal prediction framework based on Kernel Mean Matching (KMM) for covariate-shift correction. We show that KMM directly controls the bias-variance components governing conformal coverage error by minimizing RKHS moment discrepancy under explicit weight constraints, and establish asymptotic coverage guarantees under mild conditions. We then introduce a selective extension that identifies regions of reliable support overlap and restricts conformal correction to this subset, further improving stability in low-overlap regimes. Experiments on molecular property prediction benchmarks with realistic distribution shifts show that KMM-CP reduces coverage gap by over 50% compared to existing approaches. The code is available at https://github.com/siddharthal/KMM-CP.

2603.26375 2026-03-30 stat.AP

Summarising mortality data with a time-dependent beta latent variable model

Pedro Menezes de Araújo, Isobel Claire Gormley, Thomas Brendan Murphy

详情
英文摘要

Age-specific probabilities of death provide a snapshot of population mortality at the country level at a given point in time. Due to the high dimensionality of the data, summarising mortality information is essential for various analyses, such as visualisation and clustering. We propose the use of beta latent variable (BLV) models to summarise mortality information without data transformation. A time-dependent version of the BLV model is developed by incorporating an autoregressive prior for the latent effects. This model aims to represent mortality data with a small set of $K$ latent effects while accounting for time dependence between these effects. Inference is performed using Bayesian methods, with posterior samples generated via Hamiltonian Monte Carlo. The BLV model is applied to probabilities of death from the Human Mortality Database, covering 41 countries and 23 age-specific probabilities of death over several periods. The time-dependent BLV model with $K=6$ latent effects accurately reconstructs observed mortality data, and the model parameters have intuitive and insightful interpretations. The time-dependent BLV outperforms the standard Gaussian factor analysis model applied to logit probability of death, and demonstrates that BLV models can effectively summarise mortality data.

2603.26369 2026-03-30 math.ST stat.ME stat.TH

Validating spatial-temporal separability for stationary processes

Lujia Bai, Holger Dette, Zihao Yuan

详情
英文摘要

A crucial assumption to reduce computational complexity in spatial-temporal data analysis is separability, which factors the covariance structure into a purely spatial and a purely temporal component. In this paper, we develop statistical inference tools for validating this assumption for a second-order stationary process under both domain-expanding-infill asymptotics and domain-expanding asymptotics. In contrast to previous work on this subject, the methodology neither requires the assumption of normally distributed data, nor uses spectral methods. Our approach is based on nonparametric estimates of measures for the deviation between the covariance matrix and separable approximations, which vanish if and only if the assumption of separability is satisfied. We derive the asymptotic distributions of appropriate estimators for these measures with non-standard limiting distributions and use these results to develop inference tools for validating the assumption of separability. More specifically, we derive confidence intervals for the deviation measures, tests for the hypothesis of exact separability, and for the hypothesis that the deviation from separability is smaller than a prespecified threshold.

2603.26358 2026-03-30 stat.ME stat.AP

Mixed Time Series Quasi-Likelihood Models for Uncovering Covid-19 Viral Load and Mortality Dynamics

Kejin Wu, Raanju R. Sundararajan, Michel F. C. Haddad, Luiza S. C. Piancastelli, Wagner Barreto-Souza

Comments Paper submitted for publication

详情
英文摘要

Accurate real-time monitoring of disease transmission is crucial for epidemic control, which has conventionally relied on reported cases or hospital admissions. Such metrics are frequently susceptible to delays in reporting, various forms of bias, and under-ascertainment. Cycle threshold values obtained from reverse transcription quantitative polymerase chain reaction offer a promising alternative, serving as a proxy for viral load. In this paper, we aim to jointly model the viral load and the number of deaths (mortality), which involves a continuous bounded and a count time series, and therefore, a proper mixed-type model is needed. This is the motivation to introduce a new mixed-valued time series quasi-likelihood (MixTSQL) model capable of analyzing multivariate time series of different types, like continuous, discrete, bounded, and continuous positive. The MixTSQL model only requires a mean-variance specification with no distributional assumptions needed, and allows for testing Granger causality. Statistical guarantees are provided to ensure consistency and asymptotic normality of the proposed quasi-maximum likelihood estimators. We analyze weekly viral load and Covid-19 death counts in São Paulo, Brazil, using our MixTSQL model, which not only establishes the temporal order in which viral load Granger-causes mortality but also offers a comprehensive joint statistical analysis.

2603.26349 2026-03-30 stat.ML cs.AI cs.LG

Generative Score Inference for Multimodal Data

Xinyu Tian, Xiaotong Shen

Comments 25 pages, 4 figures

详情
英文摘要

Accurate uncertainty quantification is crucial for making reliable decisions in various supervised learning scenarios, particularly when dealing with complex, multimodal data such as images and text. Current approaches often face notable limitations, including rigid assumptions and limited generalizability, constraining their effectiveness across diverse supervised learning tasks. To overcome these limitations, we introduce Generative Score Inference (GSI), a flexible inference framework capable of constructing statistically valid and informative prediction and confidence sets across a wide range of multimodal learning problems. GSI utilizes synthetic samples generated by deep generative models to approximate conditional score distributions, facilitating precise uncertainty quantification without imposing restrictive assumptions about the data or tasks. We empirically validate GSI's capabilities through two representative scenarios: hallucination detection in large language models and uncertainty estimation in image captioning. Our method achieves state-of-the-art performance in hallucination detection and robust predictive uncertainty in image captioning, and its performance is positively influenced by the quality of the underlying generative model. These findings underscore the potential of GSI as a versatile inference framework, significantly enhancing uncertainty quantification and trustworthiness in multimodal learning.

2603.26344 2026-03-30 stat.ML cs.LG cs.SD eess.AS eess.SP

A Power-Weighted Noncentral Complex Gaussian Distribution

Toru Nakashika

详情
英文摘要

The complex Gaussian distribution has been widely used as a fundamental spectral and noise model in signal processing and communication. However, its Gaussian structure often limits its ability to represent the diverse amplitude characteristics observed in individual source signals. On the other hand, many existing non-Gaussian amplitude distributions derived from hyperspherical models achieve good empirical fit due to their power-law structures, while they do not explicitly account for the complex-plane geometry inherent in complex-valued observations. In this paper, we propose a new probabilistic model for complex-valued random variables, which can be interpreted as a power-weighted noncentral complex Gaussian distribution. Unlike conventional hyperspherical amplitude models, the proposed model is formulated directly on the complex plane and preserves the geometric structure of complex-valued observations while retaining a higher-dimensional interpretation. The model introduces a nonlinear phase diffusion through a single shape parameter, enabling continuous control of the distributional geometry from arc-shaped diffusion along the phase direction to concentration of probability mass toward the origin. We formulate the proposed distribution and analyze the statistical properties of the induced amplitude distribution. The derived amplitude and power distributions provide a unified framework encompassing several widely used distributions in signal modeling, including the Rice, Nakagami, and gamma distributions. Experimental results on speech power spectra demonstrate that the proposed model consistently outperforms conventional distributions in terms of log-likelihood.

2603.26334 2026-03-30 stat.AP physics.data-an

Bayesian estimation of optical constants using mixtures of Gaussian process experts

Teemu Härkönen, Hui Chen, Erik Vartiainen

详情
英文摘要

We propose modeling absorption spectrum measurements as mixtures of Gaussian process experts. This enables us to construct a flexible statistical model for interpolating and extrapolating measurements, facilitating statistical integration of Kramers-Kronig relations to estimate the whole complex refractive index. Additionally, we statistically model the anchoring points used in subtractive Kramers-Kronig relations to account for possible measurement errors of the anchor point. In addition to flexible statistical modeling, the mixtures of Gaussian process formulation enables automatic selection of measurement points to use for extrapolation. We apply the method to experimental absorption spectrum measurements of gallium arsenide, potassium chloride, and transparent wood.

2603.26327 2026-03-30 stat.ME cs.LG

Making Multi-Axis Models Robust to Multiplicative Noise: How, and Why?

Bailey Andrew, David R. Westhead, Luisa Cutillo

Comments 9 pages (26 with supplemental), 4 figures (+2 in supplemental), preprint

详情
英文摘要

In this paper we develop a graph-learning algorithm, MED-MAGMA, to fit multi-axis (Kronecker-sum-structured) models corrupted by multiplicative noise. This type of noise is natural in many application domains, such as that of single-cell RNA sequencing, in which it naturally captures technical biases of RNA sequencing platforms. Our work is evaluated against prior work on each and every public dataset in the Single Cell Expression Atlas under a certain size, demonstrating that our methodology learns networks with better local and global structure. MED-MAGMA is made available as a Python package (MED-MAGMA).

2603.26309 2026-03-30 stat.AP cs.LG q-fin.RM

Semi-structured multi-state delinquency model for mortgage default

Victor Medina-Olivares, Wangzhen Xia, Stefan Lessmann, Nadja Klein

详情
英文摘要

We propose a semi-structured discrete-time multi-state model to analyse mortgage delinquency transitions. This model combines an easy-to-understand structured additive predictor, which includes linear effects and smooth functions of time and covariates, with a flexible neural network component that captures complex nonlinearities and higher-order interactions. To ensure identifiability when covariates are present in both components, we orthogonalise the unstructured part relative to the structured design. For discrete-time competing transitions, we derive exact transformations that map binary logistic models to valid competing transition probabilities, avoiding the need for continuous-time approximations. In simulations, our framework effectively recovers structured baseline and covariate effects while using the neural component to detect interaction patterns. We demonstrate the method using the Freddie Mac Single-Family Loan-Level Dataset, employing an out-of-time test design. Compared with a structured generalised additive benchmark, the semi-structured model provides modest but consistent gains in discrimination across the earliest prediction spans, while maintaining similar Brier scores. Adding macroeconomic indicators provides limited incremental benefit in this out-of-time evaluation and does not materially change the estimated borrower-, loan-, or duration-driven effects. Overall, semi-structured multi-state modelling offers a practical compromise between transparent effect estimates and flexible pattern learning, with potential applications beyond credit-transition forecasting.

2603.26301 2026-03-30 stat.ME math.CO math.PR math.ST stat.ML stat.TH

Complete Causal Identification from Ancestral Graphs under Selection Bias

Leihao Chen, Joris M. Mooij

详情
英文摘要

Many causal discovery algorithms, including the celebrated FCI algorithm, output a Partial Ancestral Graph (PAG). PAGs serve as an abstract graphical representation of the underlying causal structure, modeled by directed acyclic graphs with latent and selection variables. This paper develops a characterization of the set of extended-type conditional independence relations that are invariant across all causal models represented by a PAG. This theory allows us to formulate a general measure-theoretic version of Pearl's causal calculus and a sound and complete identification algorithm for PAGs under selection bias. Our results also apply when PAGs are learned by certain algorithms that integrate observational data with experimental data and incorporate background knowledge.

2603.26297 2026-03-30 stat.ME

Attribution of Spurious Factors from High-Dimensional Functional Time Series

Adam Nie, Yanrong Yang, Han Lin Shang, Yi He

Comments 35 pages,7 figures, 1 table

详情
英文摘要

This article explores a general factor structure for high-dimensional nonstationary functional time series, encompassing a wide range of factor models studied in the existing literature. We investigate the asymptotic spectral behaviors of the sample covariance operator under this general data structure. A novel fundamental sufficient condition, formulated in terms of a newly introduced effective rank tailored to this setup, is established under which empirical eigen-analysis yields spurious results, rendering sample eigenvalues and eigenvectors unreliable for accurately recovering the underlying factor structure. This generalizes the results of Onatski and Wang [2021] from typical high-dimensional time series (HDTS) to the more intricate functional framework. The newly defined effective rank is rigorously analyzed through a decomposition of the effects attributable to functional factor loadings and functional factors. Contrary to the findings in the HDTS setting, empirical eigen-analysis of models with only a small number of strong non-stationary factors may still produce spurious limits in the functional framework. Therefore, additional caution is warranted when applying covariance-based statistical methods to potentially nonstationary functional data. Simulation studies are performed to determine conditions under which spurious limits occur. Real data analysis on age-specific mortality rate data from multiple locations is conducted for evidence of spurious factors induced by empirical eigen-analysis.

2603.26296 2026-03-30 stat.AP

Adaptation and Validation of the Turkish Version of the Large Language Model Dependency Scale (LLM-D12)

Tugba Coskun Aslan, Gulser Uncular, Hasan Durmus, Yasin Kavla, Arda Borlu, Sameha Alshakhsi, Ala Yankouskaya, Raian Ali

详情
英文摘要

This study aimed to adapt the Dual-Dimensional Scale of Instrumental and Relational Dependencies on Large Language Models (LLM-D12) into Turkish and evaluate its psychometric properties among regular LLM users. A sample of 387 participants (68.5% female; mean age = 25.22 +/- 7.13) completed the translated scale, which underwent cultural-linguistic validation through forward-backward translation and expert review. Confirmatory factor analysis supported the original two-factor structure after removing one item, with strong model fit (CFI = 0.993, RMSEA = 0.073). Internal consistency was high across both subscales: Cronbach's alpha = 0.831 (instrumental), 0.876 (relational), and 0.868 (total); McDonald's omega = 0.834, 0.880, and 0.900, respectively. Test-retest reliability and item monotonicity were satisfactory. External validity was demonstrated via significant associations with ATAI, IA, and PTLLM scores. Interestingly, the lack of association with need for cognition (NFC) suggests that LLM dependency may reflect strategic cognitive offloading rather than cognitive avoidance. The Turkish version of the LLM-D12 is a valid and reliable 11-item tool for assessing both instrumental and relational dependencies on LLMs.

2603.26261 2026-03-30 cs.LG stat.ML

Contrastive Conformal Sets

Yahya Alkhatib, Wee Peng Tay

详情
英文摘要

Contrastive learning produces coherent semantic feature embeddings by encouraging positive samples to cluster closely while separating negative samples. However, existing contrastive learning methods lack principled guarantees on coverage within the semantic feature space. We extend conformal prediction to this setting by introducing minimum-volume covering sets equipped with learnable generalized multi-norm constraints. We propose a method that constructs conformal sets guaranteeing user-specified coverage of positive samples while maximizing negative sample exclusion. We establish theoretically that volume minimization serves as a proxy for negative exclusion, enabling our approach to operate effectively even when negative pairs are unavailable. The positive inclusion guarantee inherits the distribution-free coverage property of conformal prediction, while negative exclusion is maximized through learned set geometry optimized on a held-out training split. Experiments on simulated and real-world image datasets demonstrate improved inclusion-exclusion trade-offs compared to standard distance-based conformal baselines.

2603.26225 2026-03-30 math.ST stat.TH

Dependencies in Multiplex Networks: A Motif Count Approach

Karl Sawaya, Sofia Olhede

详情
英文摘要

Multiplex networks are a powerful framework for representing systems with multiple types of interactions among a common set of entities. Understanding their structure requires statistical tools capturing higher-order cross-layer correlations. We develop a comprehensive framework for estimating and testing dependence in exchangeable multiplex networks through motif counts. We first propose a moment-based estimation methodology that extends the multi-layer stochastic block model network histogram to arbitrary motif counts. This allows us to estimate the $2^d-1$ graphons defining a $d$-layer multiplex network. We then derive the joint asymptotic distribution of cross-layer motif counts, that is aligned motifs shared across layers. Extending existing results from the unilayer setting, we show that the limiting distribution in the motif-regular case exhibits a covariance structure involving minimum-based distances between graphons. Finally, we construct hypothesis tests to detect inter-layer similarity and dependence. This work provides a rigorous extension of motif-count asymptotics and inference procedures to the multiplex setting, providing new tools to study high-order dependencies in complex networks.

2603.26166 2026-03-30 stat.ME

Unifying the Hoover and Gini indices: Analytical, bias, and computational aspects

Roberto Vila, Helton Saulo, Felipe Quintino

Comments 19 pages, 2 figures

详情
英文摘要

We propose a new family of inequality indices that bridges the Hoover index and the Gini coefficient. The measure is defined as the normalized expected absolute value of a convex combination of deviations from the mean and pairwise differences, providing a continuous interpolation between these two classical indices. We establish key theoretical properties, including scale invariance, boundedness, continuity, and compliance with the Pigou-Dalton transfer principle. Analytical representations are derived, allowing explicit evaluation under gamma distributions and leading to closed-form expressions involving incomplete gamma functions. From a statistical perspective, we study the plug-in estimator, obtaining a general expression for its expectation and explicit formulas for its bias under gamma populations. Simulation results indicate good finite-sample performance, with decreasing bias and mean squared error as the sample size increases. An empirical application to GDP per capita data illustrates the practical usefulness of the proposed index as a flexible tool for inequality analysis.

2603.26097 2026-03-30 cs.LG cs.AI stat.ML

Dynamic Tokenization via Reinforcement Patching: End-to-end Training and Zero-shot Transfer

Yulun Wu, Sravan Kumar Ankireddy, Samuel Sharpe, Nikita Seleznev, Dehao Yuan, Hyeji Kim, Nam H. Nguyen

详情
英文摘要

Efficiently aggregating spatial or temporal horizons to acquire compact representations has become a unifying principle in modern deep learning models, yet learning data-adaptive representations for long-horizon sequence data, especially continuous sequences like time series, remains an open challenge. While fixed-size patching has improved scalability and performance, discovering variable-sized, data-driven patches end-to-end often forces models to rely on soft discretization, specific backbones, or heuristic rules. In this work, we propose Reinforcement Patching (ReinPatch), the first framework to jointly optimize a sequence patching policy and its downstream sequence backbone model using reinforcement learning. By formulating patch boundary placement as a discrete decision process optimized via Group Relative Policy Gradient (GRPG), ReinPatch bypasses the need for continuous relaxations and performs dynamic patching policy optimization in a natural manner. Moreover, our method allows strict enforcement of a desired compression rate, freeing the downstream backbone to scale efficiently, and naturally supports multi-level hierarchical modeling. We evaluate ReinPatch on time-series forecasting datasets, where it demonstrates compelling performance compared to state-of-the-art data-driven patching strategies. Furthermore, our detached design allows the patching module to be extracted as a standalone foundation patcher, providing the community with visual and empirical insights into the segmentation behaviors preferred by a purely performance-driven neural patching strategy.

2603.26048 2026-03-30 stat.ML cs.LG math.ST stat.TH

Asymptotic Optimism for Tensor Regression Models with Applications to Neural Network Compression

Haoming Shi, Eric C. Chi, Hengrui Luo

Comments 62 pages, 11 figures

详情
英文摘要

We study rank selection for low-rank tensor regression under random covariates design. Under a Gaussian random-design model and some mild conditions, we derive population expressions for the expected training-testing discrepancy (optimism) for both CP and Tucker decomposition. We further demonstrate that the optimism is minimized at the true tensor rank for both CP and Tucker regression. This yields a prediction-oriented rank-selection rule that aligns with cross-validation and extends naturally to tensor-model averaging. We also discuss conditions under which under- or over-ranked models may appear preferable, thereby clarifying the scope of the method. Finally, we showcase its practical utility on a real-world image regression task and extend its application to tensor-based compression of neural network, highlighting its potential for model selection in deep learning.

2603.26026 2026-03-30 stat.OT

Hybrid physics-data driven spectral forecasts of semisubmersible response

Ian Milne, Lachlan Astfalck, Matthew Zed, Jack Lee-Kopij, Edward Cripps

详情
英文摘要

A framework for probabilistic forecasting of vessel motion is developed and validated for a semisubmersible operating in long period swell. Bayesian statistical methods are applied to predictions of the heave response from a physics model using numerical wave spectra and measured motion data. Model diagnoses motivate an additional level of complexity required for the error structure in the Bayesian model, specifically to account for heteroskedasticity and time-correlated errors. The hybrid model forecasts were evaluated during periods where the heave resonance and cancellation frequencies were excited. The method is demonstrated to be effective for providing reliable quantification of uncertainty and correcting bias in the raw physics model predictions. This justifies its value for improving the efficiency and safety of offshore operations.

2603.24999 2026-03-30 stat.AP cs.AI

Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients

Michael Hardy, Joshua Gilbert, Benjamin Domingue

详情
英文摘要

The validity of assessments, from large-scale AI benchmarks to human classrooms, depends on the quality of individual items, yet modern evaluation instruments often contain thousands of items with minimal psychometric vetting. We introduce a new family of nonparametric scalability coefficients based on interitem isotonic regression for efficiently detecting globally bad items (e.g., miskeyed, ambiguously worded, or construct-misaligned). The central contribution is the signed isotonic $R^2$, which measures the maximal proportion of variance in one item explainable by a monotone function of another while preserving the direction of association via Kendall's $τ$. Aggregating these pairwise coefficients yields item-level scores that sharply separate problematic items from acceptable ones without assuming linearity or committing to a parametric item response model. We show that the signed isotonic $R^2$ is extremal among monotone predictors (it extracts the strongest possible monotone signal between any two items) and show that this optimality property translates directly into practical screening power. Across three AI benchmark datasets (HS Math, GSM8K, MMLU) and two human assessment datasets, the signed isotonic $R^2$ consistently achieves top-tier AUC for ranking bad items above good ones, outperforming or matching a comprehensive battery of classical test theory, item response theory, and dimensionality-based diagnostics. Crucially, the method remains robust under the small-n/large-p conditions typical of AI evaluation, requires only bivariate monotone fits computable in seconds, and handles mixed item types (binary, ordinal, continuous) without modification. It is a lightweight, model-agnostic filter that can materially reduce the reviewer effort needed to find flawed items in modern large-scale evaluation regimes.

2603.24970 2026-03-30 econ.EM stat.ME

Randomization Inference For the Always-Reporter Average Treatment Effect

Haoge Chang, Zeyang Yu

详情
英文摘要

This article studies randomization inference for treatment effects in randomized controlled trials with attrition, where outcomes are observed for only a subset of units. We assume monotonicity in reporting behavior as in \cite{lee2009training} and focus on the average treatment effect for always-reporters (AR-ATE), defined as units whose outcomes are observed under both treatment and control. Because always-reporter status is only partially revealed by observed assignment and response patterns, we propose a worst-case randomization test that maximizes the randomization p-value over all always-reporter configurations consistent with the data, with an optional pretest to prune implausible configurations. Using studentized Hajek- and chi-square-type statistics, we show the resulting procedure is finite-sample valid for the sharp null and asymptotically valid for the weak null. We also discuss computational implementations for discrete outcomes and integer-programming-based bounds for continuous outcomes.

2603.24771 2026-03-30 stat.ME stat.ML

Identifiable Deep Latent Variable Models for MNAR Data

Huiming Xie, Fei Xue, Xiao Wang

详情
英文摘要

Missing data is a ubiquitous challenge in data analysis, often leading to biased and inaccurate results. Traditional imputation methods usually assume that the missingness mechanism is missing-at-random (MAR), where the missingness is independent of the missing values themselves. This assumption is frequently violated in real-world scenarios, prompted by recent advances in imputation methods using deep learning to address this challenge. However, these methods neglect the crucial issue of nonparametric identifiability in missing-not-at-random (MNAR) data, which can lead to biased and unreliable results. This paper seeks to bridge this gap by proposing a novel framework based on deep latent variable models for MNAR data. Building on the assumption of conditional no self-censoring given latent variables, we establish the identifiability of the data distribution. This crucial theoretical result guarantees the feasibility of our approach. To effectively estimate unknown parameters, we develop an efficient algorithm utilizing importance-weighted autoencoders. We demonstrate, both theoretically and empirically, that our estimation process accurately recovers the ground-truth joint distribution under specific regularity conditions. Extensive simulation studies and real-world data experiments showcase the advantages of our proposed method compared to various classical and state-of-the-art approaches to missing data imputation.

2603.24763 2026-03-30 math.ST cs.LG stat.ML stat.TH

Binary Expansion Group Intersection Network

Sicheng Zhou, Kai Zhang

详情
英文摘要

Conditional independence is central to modern statistics, but beyond special parametric families it rarely admits an exact covariance characterization. We introduce the binary expansion group intersection network (BEGIN), a distribution-free graphical representation for multivariate binary data and bit-encoded multinomial variables. For arbitrary binary random vectors and bit representations of multinomial variables, we prove that conditional independence is equivalent to a sparse linear representation of conditional expectations, to a block factorization of the corresponding interaction covariance matrix, and to block diagonality of an associated generalized Schur complement. The resulting graph is indexed by the intersection of multiplicative groups of binary interactions, yielding an analogue of Gaussian graphical modeling beyond the Gaussian setting. This viewpoint treats data bits as atoms and local BEGIN molecules as building blocks for large Markov random fields. We also show how dyadic bit representations allow BEGIN to approximate conditional independence for general random vectors under mild regularity conditions. A key technical device is the Hadamard prism, a linear map that links interaction covariances to group structure.

2603.19778 2026-03-30 cs.CE math.ST stat.TH

Uniform Maximum Projection Designs for Computer Experiments

Miroslav Vořechovský, Jan Mašek

Comments Accepted in Computers and Structures

详情
Journal ref
Computers and Structures, ISSN 1879-2243 (0045-7949), 325:108209, 2026
英文摘要

Space-filling experimental designs are widely used in engineering computer experiments, where only a limited number of expensive model evaluations can be afforded. Distance-based designs such as Maximin or Minimax ensure global space-filling, while Latin hypercube sampling enforces uniform one-dimensional projections, yet neither guarantees uniformity in lowdimensional subspaces. Maximum Projection (MaxPro) designs were introduced to improve uniformity in low-dimensional subspaces, yet their original formulation relies on the Euclidean distance and may induce systematic density distortions in bounded domains. We demonstrate that the standard MaxPro criterion leads to statistically non-uniform sampling, resulting in undersampling of corner regions and biased Monte Carlo estimates. To remedy this issue, we introduce a periodic variant of the criterion, termed Uniform Maximum Projection (uMaxPro), in which the Euclidean metric is replaced by a periodic distance based on the minimum image convention. The proposed uMaxPro designs preserve the projection-aware structure of MaxPro while achieving statistical uniformity of the design-generation mechanism. Numerical experiments show unbiased Monte Carlo integration with reduced variance, excellent subspace projection performance, and competitive discrepancy properties. The methodology is further validated on benchmark engineering problems, including a meso-scale finite element model of concrete, demonstrating improved accuracy in surrogate modeling and probabilistic estimation. The resulting criterion provides a simple and computationally efficient modification of MaxPro that enhances its robustness for nonadaptive computer experiments. The construction algorithm, open-source implementation, and reproducible optimized designs are provided to facilitate practical adoption of the method.

2603.02460 2026-03-30 stat.ML cs.LG

Conformal Graph Prediction with Z-Gromov Wasserstein Distances

Gabriel Melo, Thibaut de Saivre, Anna Calissano, Florence d'Alché-Buc

详情
英文摘要

Supervised graph prediction addresses regression problems where the outputs are structured graphs. Although several approaches exist for graph-valued prediction, principled uncertainty quantification remains limited. We propose a conformal prediction framework for graph-valued outputs, providing distribution-free coverage guarantees in structured output spaces. Our method defines nonconformity via the Z-Gromov-Wasserstein distance, instantiated in practice through Fused Gromov-Wasserstein (FGW), enabling permutation invariant comparison between predicted and candidate graphs. To obtain adaptive prediction sets, we introduce Score Conformalized Quantile Regression (SCQR), an extension of Conformalized Quantile Regression (CQR) to handle complex output spaces such as graph-valued outputs. We evaluate the proposed approach on a synthetic task.

2602.20396 2026-03-30 cs.LG stat.ME

cc-Shapley: Measuring Multivariate Feature Importance Needs Causal Context

Jörg Martin, Stefan Haufe

详情
英文摘要

Explainable artificial intelligence promises to yield insights into relevant features, thereby enabling humans to examine and scrutinize machine learning models or even facilitating scientific discovery. Considering the widespread technique of Shapley values, we find that purely data-driven operationalization of multivariate feature importance is unsuitable for such purposes. Even for simple problems with two features, spurious associations due to collider bias and suppression arise from considering one feature only in the observational context of the other, which can lead to misinterpretations. Causal knowledge about the data-generating process is required to identify and correct such misleading feature attributions. We propose cc-Shapley (causal context Shapley), an interventional modification of conventional observational Shapley values leveraging knowledge of the data's causal structure, thereby analyzing the relevance of a feature in the causal context of the remaining features. We show theoretically that this eradicates spurious association induced by collider bias. We compare the behavior of Shapley and cc-Shapley values on various, synthetic, and real-world datasets. We observe nullification or reversal of associations compared to univariate feature importance when moving from observational to cc-Shapley.

2601.02226 2026-03-30 stat.AP

Initial data analysis of the national German transplantation registry with a focus on kidney transplantation

Lukas Klein, Gunter Grieser, Carl-Ludwig Fischer-Fröhlich, Axel Rahmel, Henrik Stahl, Andreas Wienke, Antje Jahn-Eimermacher

Comments 31 pages, 9 figures, 1 supplementary document, Submitted to BMC Medical Research Methodology

详情
英文摘要

This study presents an Initial Data Analysis (IDA) of the German Transplantation Registry (TxReg) data for a better data understanding and to inform future data analyses. The IDA is focusing on data on first-time kidney-only transplantations in adult recipients from deceased donors between 2006 and 2016 and refers to data from 14,954 recipients and 9,964 donors across 25 tables. Investigated aspects include missing data patterns and structure, data consistency, and availability of event time data. Results show that missing data proportions vary widely, with some tables nearly complete while others have over 50% missing values. Missing data patterns are identified using a decision tree approach. An influx and outflux analysis demonstrates that some variables have high potential for imputing missing data, while others were less suitable for imputation. We identified 168 multi-sourced variables that are reported by multiple data providers in parallel leading to discrepancies for some variables but also providing opportunities for missing data imputation. Our findings on event time data demonstrate the importance of carefully selecting the variables used for event time analyses as results will strongly depend on this selection. In summary, our findings highlight the challenges when utilizing the TxReg data for research and provide recommendations for data preprocessing and analysis in future analyses.

2512.23138 2026-03-30 astro-ph.IM astro-ph.SR cs.LG stat.ML

Why Machine Learning Models Systematically Underestimate Extreme Values II: How to Fix It with LatentNN

Yuan-Sen Ting

Comments 17 pages, 7 figures. Published in the Open Journal of Astrophysics

详情
英文摘要

Attenuation bias -- the systematic underestimation of regression coefficients due to measurement errors in input variables -- affects astronomical data-driven models. For linear regression, this problem was solved by treating the true input values as latent variables to be estimated alongside model parameters. In this paper, we show that neural networks suffer from the same attenuation bias and that the latent variable solution generalizes directly to neural networks. We introduce LatentNN, a method that jointly optimizes network parameters and latent input values by maximizing the joint likelihood of observing both inputs and outputs. We demonstrate the correction on one-dimensional regression, multivariate inputs with correlated features, and stellar spectroscopy applications. LatentNN reduces attenuation bias across a range of signal-to-noise ratios where standard neural networks show large bias. This provides a framework for improved neural network inference in the low signal-to-noise regime characteristic of astronomical data. This bias correction is most effective when measurement errors are less than roughly half the intrinsic data range; in the regime of very low signal-to-noise and few informative features. Code is available at https://github.com/tingyuansen/LatentNN.

2512.03321 2026-03-30 stat.CO

Numerical optimization for the compatibility constant of the lasso

Kei Hirose

详情
英文摘要

The compatibility constant plays an important role in evaluating the prediction error of the lasso in high-dimensional settings. However, the computation of the compatibility constant is generally difficult because it is a complicated nonconvex optimization problem. In this study, we present a numerical approach to compute the compatibility constant when the support of true regression coefficients is given. We show that the optimization problem reduces to a quadratic programming (QP) once the signs of the nonzero coefficients are specified. In this case, the compatibility constant can be obtained by solving QPs for all possible sign combinations. We also formulate a mixed-integer QP (MIQP) approach that can be applied when the number of true nonzero coefficients is large. We investigate the finite-sample behavior of the compatibility constant for simulated data under various parameter settings and compare the prediction error with its theoretical upper bound. The behavior of the compatibility constant in finite samples is also investigated through a real data analysis.

2511.19234 2026-03-30 stat.ME stat.AP

Integrating Complex Covariate Transformations in Generalized Additive Models

Claudia Collarin, Matteo Fasiolo, Yannig Goude, Simon N. Wood

详情
英文摘要

Transformations of covariates are widely used in applied statistics to improve interpretability and to satisfy assumptions required for valid inference. More broadly, feature engineering encompasses a wider set of practices aimed at enhancing predictive performance, and is typically performed as part of a data pre-processing step. In contrast, this paper integrates a substantial component of the feature engineering process directly into the modelling stage. This is achieved by introducing a novel general framework for embedding interpretable covariate transformations within multi-parameter Generalised Additive Models (GAMs). Our framework accommodates any sufficiently differentiable scalar-valued transformation of potentially high-dimensional and complex covariates. These transformations are treated as integral model components, with their parameters estimated jointly with regression coefficients via maximum a posteriori (MAP) methods, and joint uncertainty quantified via approximate Bayesian techniques. Smoothing parameters are selected in an empirical Bayes framework using a Laplace approximation to the marginal likelihood, supported by efficient computation based on implicit differentiation methods. We demonstrate the flexibility and practical value of the proposed methodology through applications to forecasting electricity net-demand in Great Britain and to modelling house prices in London. Methods for building and fitting GAMs with nested transformations are provided by the gamFactory R package, available at https://github.com/mfasiolo/gamFactory, while the code for reproducing the results in this paper is available at https://doi.org/10.5281/zenodo.19239350.

2511.04206 2026-03-30 math.ST stat.TH

Goodness-of-fit testing of the distribution of posterior classification probabilities for validating model-based clustering

Salima El Kolei, Matthieu Marbac

详情
英文摘要

We present the first method for assessing the relevance of a model-based clustering result in a general framework. Standard validation criteria, like the adjusted Rand index, rely on external labels to assess partition accuracy; consequently, they are inapplicable to real-world clustering problems where labels are missing. In contrast, our method offers an internal goodness-of-fit diagnostic, since it evaluates the validity of the clustering mechanism by testing the specification of the posterior probabilities of classification defined on the unit simplex. Because this simplex dimension is fixed by the number of clusters, the procedure naturally circumvents the curse of dimensionality, making it applicable to high-dimensional data where traditional density-based tests fail. The testing procedure requires only a consistent estimator of the parameters and the associated posterior classification probabilities for each observation, and its implementation is straightforward, as no additional model fitting is needed. Under the null hypothesis, the method exploits the fact that any functional transformation of the posterior probabilities has the same expectation under both the model being tested and the true data-generating process. The resulting goodness-of-fit test is constructed via an empirical likelihood approach with a growing number of moment conditions, allowing asymptotic detection of any alternative. A block-splitting strategy, employed to account for parameter estimation, provides a vector of test statistics that behave like a vector of independent chi-square random variables. Therefore, the goodness-of-fit of the posterior classification probabilities is assessed via the goodness-of-fit of the vector of empirical likelihood ratio test statistics. Hence, based on the distribution of this vector of statistics, different goodness-of-fit tests (e.g., Kolmogorov-Smirnov) can be used to investigate the distribution of the vector of test statistics with an exact asymptotic significance level.

2510.27643 2026-03-30 stat.ML cs.LG cs.NA math.NA math.OC stat.CO

Bayesian Optimization on Networks

Wenwen Li, Daniel Sanz-Alonso, Ruiyi Yang

Comments 40 pages, 10 figures; includes appendices

详情
英文摘要

This paper studies optimization on networks modeled as metric graphs. Motivated by applications where the objective function is expensive to evaluate or only available as a black box, we develop Bayesian optimization algorithms that sequentially update a Gaussian process surrogate model of the objective to guide the acquisition of query points. To ensure that the surrogates are tailored to the network's geometry, we adopt Whittle-Matérn Gaussian process prior models defined via stochastic partial differential equations on metric graphs. In addition to establishing regret bounds for optimizing sufficiently smooth objective functions, we analyze the practical case in which the smoothness of the objective is unknown and the Whittle-Matérn prior is represented using finite elements. Numerical results demonstrate the effectiveness of our algorithms for optimizing benchmark objective functions on a synthetic metric graph and for Bayesian inversion via maximum a posteriori estimation on a telecommunication network.

2510.11239 2026-03-30 stat.CO

Spline Interpolation on Compact Riemannian Manifolds

Charlie Sire, Mike Pereira, Thomas Romary

详情
英文摘要

Spline interpolation is a widely used class of methods for solving interpolation problems by constructing smooth interpolants that minimize a regularized energy functional involving the Laplacian operator. While many existing approaches focus on Euclidean domains or the sphere, relying on the spectral properties of the Laplacian, this work introduces a method for spline interpolation on general manifolds by exploiting its equivalence with kriging. Specifically, the proposed approach uses finite element approximations of random fields defined over the manifold, based on Gaussian Markov Random Fields and a discretization of the Laplace-Beltrami operator on a triangulated mesh. This framework enables the modeling of spatial fields with smooth variations and local anisotropies via domain deformation. The method is first validated on the sphere using both analytical test cases and a pollution-related study, and is compared to the classical spherical harmonics-based method. Additional experiments on the surface of a cylinder further illustrate the generality of the approach.

2510.07235 2026-03-30 math.ST stat.ME stat.TH

A Bernstein polynomial approach for the estimation of cumulative distribution functions in the presence of missing data

Rihab Gharbi, Wissem Jedidi, Salah Khardani, Frédéric Ouimet

Comments 33 pages, 2 figures, 10 tables

详情
英文摘要

We study nonparametric estimation of univariate cumulative distribution functions (CDFs) pertaining to data missing at random. The proposed estimators smooth the inverse probability weighted (IPW) empirical CDF with the Bernstein operator, yielding monotone, $[0,1]$-valued curves that automatically adapt to bounded supports. We analyze two versions: a pseudo estimator that uses known propensities and a feasible estimator that uses propensities estimated nonparametrically from discrete auxiliary variables, the latter scenario being much more common in practice. For both, we derive pointwise bias and variance expansions, establish the optimal polynomial degree $m$ with respect to the mean integrated squared error, and prove the asymptotic normality. A key finding is that the feasible estimator has a smaller variance than the pseudo estimator by an explicit nonnegative correction term. We also develop an efficient degree selection procedure via least-squares cross-validation. Monte Carlo experiments show that, for small to moderate sample sizes, the Bernstein-smoothed pseudo and feasible estimators outperform their unsmoothed counterparts and the integrated version of the IPW kernel density estimator proposed by Dubnicka (2009), under certain models. A real-data application to fasting plasma glucose from the 2017-2018 NHANES survey illustrates the method in a practical setting. All code needed to reproduce our analyses is readily accessible on GitHub.

2510.05646 2026-03-30 stat.AP math.ST stat.TH

Geographically Weighted Regression for Air Quality Low-Cost Sensor Calibration

Jean-Michel Poggi, Bruno Portier, Emma Thulliez

详情
英文摘要

This article focuses on the use of Geographically Weighted Regression (GWR) method to correct air quality low-cost sensors measurements. Those sensors are of major interest in the current era of high-resolution air quality monitoring at urban scale, but require calibration using reference analyzers. The results for NO2 are provided along with comments on the estimated GWR model and the spatial content of the estimated coefficients. The study has been carried out using the publicly available SensEURCity dataset in Antwerp, which is especially relevant since it includes 9 reference stations and 34 low-cost sensors collocated and deployed within the city.

2510.03587 2026-03-30 stat.CO stat.ME stat.ML

Exact and Approximate MCMC for Doubly-intractable Probabilistic Graphical Models Leveraging the Underlying Independence Model

Yujie Chen, Antik Chakraborty, Anindya Bhadra

Comments To appear in Proceedings of the 29th International Conference on Artificial Intelligence and Statistics (AISTATS) 2026, Tangier, Morocco

详情
英文摘要

Bayesian inference for doubly-intractable pairwise exponential graphical models typically involves variations of the exchange algorithm or approximate Markov chain Monte Carlo (MCMC) samplers. However, existing methods for both classes of algorithms require either perfect samplers or sequential samplers for complex models, which are often either not available, or suffer from poor mixing, especially in high dimensions. We develop a method that does not require perfect or sequential sampling, and can be applied to both classes of methods: exact and approximate MCMC. The key to our approach is to utilize the tractable independence model underlying the intractable probabilistic graphical model for the purpose of constructing a finite sample unbiased Monte Carlo (and not MCMC) estimate of the Metropolis--Hastings ratio. This innovation turns out to be crucial for scalability in high dimensions. The method is demonstrated on the Ising model. Gradient-based alternatives to construct a proposal, such as Langevin and Hamiltonian Monte Carlo approaches, also arise as a natural corollary to our general procedure, and are demonstrated as well.

2509.25419 2026-03-30 stat.ME stat.CO

Bias-Reduced Estimation of Structural Equation Models

Haziq Jamil, Yves Rosseel, Oliver Kemp, Ioannis Kosmidis

详情
Journal ref
Structural Equation Modeling: A Multidisciplinary Journal (2026)
英文摘要

Finite-sample bias is a pervasive challenge in the estimation of structural equation models (SEMs), especially when sample sizes are small or measurement reliability is low. A range of methods have been proposed to improve finite-sample bias in the SEM literature, ranging from analytic bias corrections to resampling-based techniques, with each carrying trade-offs in scope, computational burden, and statistical performance. We apply the reduced-bias M-estimation framework (RBM, Kosmidis & Lunardon, 2024, J. R. Stat. Soc. Series B Stat. Methodol.) to SEMs. The RBM framework is attractive as it requires only first- and second-order derivatives of the log-likelihood, which renders it both straightforward to implement, and computationally more efficient compared to resampling-based alternatives such as bootstrap and jackknife. It is also robust to departures from modelling assumptions. Using the same simulation setup as in Dhaene and Rosseel (2022), we illustrate that RBM estimators consistently reduce mean bias in the estimation of SEMs without inflating mean squared error. They also deliver improvements in both median bias and inference relative to maximum likelihood estimators, while maintaining robustness under non-normality. Our findings suggest that RBM offers a promising, practical, and broadly applicable tool for mitigating bias in the estimation of SEMs, particularly in small-sample research contexts.

2507.18951 2026-03-30 math.AP math.ST stat.CO stat.TH

Elliptic Bayesian Inverse Problems on Metric Graphs

David Bolin, Wenwen Li, Daniel Sanz-Alonso

Comments 38 pages, 7 figures, including appendices

详情
英文摘要

This paper studies the formulation, well-posedness, and numerical solution of Bayesian inverse problems on metric graphs, in which the edges represent one-dimensional wires connecting vertices. We focus on the inverse problem of recovering the diffusion coefficient of a (fractional) elliptic equation on a metric graph from noisy measurements of the solution. Well-posedness hinges on both stability of the forward model and an appropriate choice of prior. We establish the stability of elliptic and fractional elliptic forward models using recent regularity theory for differential equations on metric graphs. For the prior, we leverage modern Gaussian Whittle--Matérn process models on metric graphs with sufficiently smooth sample paths. Numerical results demonstrate accurate reconstruction and effective uncertainty quantification.

2505.17564 2026-03-30 stat.AP math.ST stat.TH

Using low-cost sensors to improve NO2 concentration maps derived from physico-chemical models

Emma Thulliez, Camille Coron

详情
英文摘要

Urban air quality is a major concern today. Concentrations of pollutants, such as nitrogen dioxide, must be monitored to ensure that they do not exceed hazardous thresholds. For this reason, scarse reference stations, which are generally managed by air quality monitoring associations, are located in major cities. Two recent approaches enable fine-scale mapping of pollutant concentrations. The first relies on deterministic physico-chemical models that incorporate the street network and compute concentration estimates on a grid, producing spatial maps. The second is based on the emergence of low-cost sensors, which enable monitoring organizations to increase the density of their measurement networks. However, these sensors are unreliable and require regular and important calibration. We propose to combine these approaches and improve maps generated by deterministic models by integrating data from multiple sensor networks. Specifically, we model the bias of deterministic models and estimate its parameters using measurements, through a Bayesian nested framework. Our approach simultaneously enables the calibration of low-cost sensors and the correction of deterministic models outputs. This method, although general, is applied to the city of Rouen (France), combining outputs of the physico-chemical model SIRANE (Soulhac et al. 2011) and the measurements provided both by 4 reference monitoring stations and 10 low-cost sensors during December 2022. Results show that the method indeed corrects the concentration maps, reducing the root mean squared error by approximately 12.4%, and that low-cost sensors play an essential role in this correction.

2504.06215 2026-03-30 stat.ME econ.EM

Randomization Inference in Two-Sided Market Experiments

Jizhou Liu, Azeem M. Shaikh, Panos Toulis

详情
英文摘要

Randomized experiments are increasingly employed in two-sided markets, such as buyer--seller platforms, to evaluate the effects of marketplace interventions. These experiments must reflect the underlying two-sided market structure in their design and can therefore be challenging to analyze. In this paper, we develop a randomization inference framework for outcomes from two-sided experiments, with a focus on testing and inference for two-sided spillover effects. Our approach is finite-sample valid under sharp null hypotheses. Regarding weak null hypotheses, we find that the commonly used Neyman-style studentization does not universally ensure asymptotic validity, and we document how it depends on the specific formulation of the null. We then propose a two-way variance estimator for studentization that restores asymptotic validity. We further propose methods to improve testing power by exploiting the two-sided structure of the problem, which we validate empirically. We demonstrate our methods through a series of simulation studies and an applied example from a network experiment in micro-lending.

2502.18253 2026-03-30 econ.GN q-fin.EC stat.AP

Enhancing External Validity of Experiments with Ongoing Sampling

Chen Wang, Shichao Han, Shan Huang

详情
英文摘要

Participants in online experiments often enroll over time, which can compromise sample representativeness due to temporal shifts in covariates. This issue is particularly critical in A/B tests, online controlled experiments extensively used to evaluate product updates, since these tests are cost-sensitive and typically short in duration. We propose a novel framework that dynamically assesses sample representativeness by dividing the ongoing sampling process into three stages. We then develop stage-specific estimators for Population Average Treatment Effects (PATE), ensuring that experimental results remain generalizable across varying experiment durations. Leveraging survival analysis, we develop a heuristic function that identifies these stages without requiring prior knowledge of population or sample characteristics, thereby keeping implementation costs low. Our approach bridges the gap between experimental findings and real-world applicability, enabling product decisions to be based on evidence that accurately represents the broader target population. We validate the effectiveness of our framework on three levels: (1) through a real-world online experiment conducted on WeChat; (2) via a synthetic experiment; and (3) by applying it to 600 A/B tests on WeChat in a platform-wide application. Additionally, we provide practical guidelines for practitioners to implement our method in real-world settings.

2502.14950 2026-03-30 quant-ph cs.LG math.ST stat.ML stat.TH

Symmetric observations without symmetric causal explanations

Christian William, Patrick Remy, Jean-Daniel Bancal, Yu Cai, Nicolas Brunner, Alejandro Pozas-Kerstjens

Comments 8+3 pages, 4+1 figures, RevTeX 4.2. The computational appendix is available at https://www.github.com/apozas/symmetric-causal. V2: published version

详情
Journal ref
Phys. Rev. A 113, 032219 (2026)
英文摘要

Inferring causal models from observed correlations is a challenging task, crucial to many areas of science. In order to alleviate the computational effort when sifting through possible causal explanations for some given observations, it is important to know whether symmetries in the observations correspond to symmetries in the underlying realization so that one can quickly discard impossible explanations. Via an explicit example, we demonstrate that, in general, symmetries cannot be exploited to reduce the hypothesis space. We use a tripartite probability distribution over binary events that is realized by using three (different) independent sources of classical randomness. We prove that even removing the condition that the sources distribute systems described by classical physics, the requirements that (i) the sources distribute the same physical systems, (ii) these physical systems respect relativistic causality, and (iii) the correlations are the observed ones are incompatible.

2502.01557 2026-03-30 cs.LG math.DS stat.ML

How iteration order influences convergence and stability in deep learning

Benoit Dherin, Benny Avelin, Anders Karlsson, Hanna Mazzawi, Javier Gonzalvo, Michael Munn

详情
英文摘要

Despite exceptional achievements, training neural networks remains computationally expensive and is often plagued by instabilities that can degrade convergence. While learning rate schedules can help mitigate these issues, finding optimal schedules is time-consuming and resource-intensive. This work explores theoretical issues concerning training stability in the constant-learning-rate (i.e., without schedule) and small-batch-size regime. Surprisingly, we show that the composition order of gradient updates affects stability and convergence in gradient-based optimizers. We illustrate this new line of thinking using backward-SGD, which produces parameter iterates at each step by reverting the usual forward composition order of batch gradients. Our theoretical analysis shows that in contractive regions (e.g., around minima) backward-SGD converges to a point while the standard forward-SGD generally only converges to a distribution. This leads to improved stability and convergence which we demonstrate experimentally. While full backward-SGD is computationally intensive in practice, it highlights that the extra freedom of modifying the usual iteration composition by reusing creatively previous batches at each optimization step may have important beneficial effects in improving training. Our experiments provide a proof of concept supporting this phenomenon. To our knowledge, this represents a new and unexplored avenue in deep learning optimization.

2501.03501 2026-03-30 stat.AP

Modeling Cell Developmental Trajectory using Multinomial Unbalanced Optimal Transport

Junhao Zhu, Kevin Zhang, Zhaolei Zhang, Dehan Kong

详情
英文摘要

Single-cell trajectory analysis aims to reconstruct the biological developmental processes of cells as they evolve over time, leveraging temporal correlations in gene expression. During cellular development, gene expression patterns typically change and vary across different cell types. A significant challenge in this analysis is that RNA sequencing destroys the cell, making it impossible to track gene expression across multiple stages for the same cell. Recent advances have introduced the use of optimal transport tools to model the trajectory of individual cells. In this paper, our focus shifts to a question of greater practical importance: we examine the differentiation of cell types over time. Specifically, we propose a novel method based on discrete unbalanced optimal transport to model the developmental trajectory of cell types. Our method detects biological changes in cell types and infers their transitions to different states by analyzing the transport matrix. We validated our method using single-cell RNA sequencing data from mouse embryonic fibroblasts. The results accurately identified major developmental changes in cell types, which were corroborated by experimental evidence. Furthermore, the inferred transition probabilities between cell types are highly congruent to biological ground truth.

2408.00949 2026-03-30 cs.LG math.GR math.RT stat.ML

Equivariant neural networks and piecewise linear representation theory

Joel Gibson, Daniel Tubbenhauer, Geordie Williamson

Comments 23 pages, many figures, revision, to appear in Contemp. Math., comments welcome

详情
Journal ref
Contemp. Math. 829 (2025), 157-192
英文摘要

Equivariant neural networks are neural networks with symmetry. Motivated by the theory of group representations, we decompose the layers of an equivariant neural network into simple representations. The nonlinear activation functions lead to interesting nonlinear equivariant maps between simple representations. For example, the rectified linear unit (ReLU) gives rise to piecewise linear maps. We show that these considerations lead to a filtration of equivariant neural networks, generalizing Fourier series. This observation might provide a useful tool for interpreting equivariant neural networks.

2405.16885 2026-03-30 stat.ME q-bio.PE

Hidden Markov modelling of spatio-temporal dynamics of measles in 1750-1850 Finland

Tiia-Maria Pasanen, Jouni Helske, Tarmo Ketola

详情
Journal ref
Journal of Applied Statistics, 1-25. (2026)
英文摘要

Real world spatio-temporal datasets, and phenomena related to them, are often challenging to visualise or gain a general overview of. In order to summarise information encompassed in such data, we combine two well known statistical modelling methods. To account for the spatial dimension, we use the intrinsic modification of the conditional autoregression, and incorporate it with the hidden Markov model, allowing the spatial patterns to vary over time. We apply our method to parish register data considering deaths caused by measles in Finland in 1750-1850, and gain novel insight of previously undiscovered infection dynamics. Five distinctive, reoccurring states, describing spatially and temporally differing infection burden and potential routes of spread, are identified. We also find that there is a change in the occurrences of the most typical spatial patterns circa 1812, possibly due to changes in communication networks after major administrative transformations in Finland.

2403.08079 2026-03-30 cs.SE stat.ME

BayesFLo: Bayesian fault localization of complex software systems

Yi Ji, Simon Mak, Ryan Lekivetz, Joseph Morgan

详情
英文摘要

Software testing is essential for the reliable development of complex software systems. A key step in software testing is fault localization, which uses test data to pinpoint failure-inducing combinations for further diagnosis. Existing fault localization methods have two key limitations: they (i) largely do not incorporate domain and/or structural knowledge from test engineers, and (ii) do not provide a probabilistic assessment of risk for potential root causes. Such methods can thus fail to confidently whittle down the combinatorial number of potential root causes in complex systems, resulting in prohibitively high debugging costs. To address this, we propose a novel Bayesian fault localization framework called BayesFLo, which leverages a flexible Bayesian model for identifying potential root causes with probabilistic uncertainties. Using a carefully-specified prior on root cause probabilities, BayesFLo permits the integration of domain and structural knowledge via the principles of combination hierarchy and heredity, which capture the expected structure of failure-inducing combinations. We then develop new algorithms for efficient computation of posterior root cause probabilities, leveraging recent tools from integer programming and graph representations. Finally, we demonstrate the effectiveness of BayesFLo over existing methods in two fault localization case studies, the first on the Traffic Alert and Collision Avoidance System for aircraft collision avoidance, and the second on the Vulnerable Road User protection tests for safe autonomous driving.

2311.12634 2026-03-30 math.PR math.ST stat.TH

On $q$-Order Statistics

Malvina Vamvakari

详情
Journal ref
Seminaire Lotharingien de Combinatoire, 87B (2023), Article 9, 25 pp
英文摘要

Building on the notion of $q$-integral introduced by Thomae in 1869, we introduce $q$-order statistics (that, is $q$-analogues of the classical order statistics, for $0<q<1$) which arise from dependent and not identically distributed $q$-continuous random variables and study their distributional properties. We study the $q$-distribution functions and the $q$-density functions of the relative $q$-ordered random variables. We focus on $q$-ordered variables arising from dependent and not identically $q$-uniformly distributed random variables and we derive their $q$-distributions, including $q$-power law, $q$-beta and $q$-Dirichlet distributions.

2311.02543 2026-03-30 stat.ME

Pairwise likelihood estimation and limited information goodness-of-fit test statistics for binary factor analysis models under complex survey sampling

Haziq Jamil, Irini Moustaki, Chris Skinner

详情
Journal ref
British Journal of Mathematical and Statistical Psychology, 2025, 78
英文摘要

This paper discusses estimation and limited information goodness-of-fit test statistics in factor models for binary data using pairwise likelihood estimation and sampling weights. The paper extends the applicability of pairwise likelihood estimation for factor models with binary data to accommodate complex sampling designs. Additionally, it introduces two key limited information test statistics: the Pearson chi-squared test and the Wald test. To enhance computational efficiency, the paper introduces modifications to both test statistics. The performance of the estimation and the proposed test statistics under simple random sampling and unequal probability sampling is evaluated using simulated data.

2201.10300 2026-03-30 math.CA cs.NA math.NA math.ST stat.TH

The Inverse Problem for Single Trajectories of Rough Differential Equations

Thomas Morrish, Theodore Papamarkou, Anastasia Papavasiliou, Yang Zhao

Comments Final version, accepted for publication in the SIAM/ASA Journal on Uncertainty Quantification

详情
英文摘要

Motivated by the need to develop a general framework for performing statistical inference for discretely observed random rough differential equations, our aim is to construct a geometric $p$-rough path ${\bf X}$ whose response $Y$, when driving a rough differential equation, matches the observed trajectory $y$. We call this the \textit{continuous inverse problem} and start by rigorously defining its solution. We then develop a framework where the solution can be constructed as a limit of solutions to appropriately designed \textit{discrete inverse problems}, so that convergence holds in $p$-variation. Our approach is based on calibrating the bounded variation paths whose limit defines the rough path `lift' of path $X$ to rough path ${\bf X}$ to the observed trajectory $y$. Moreover, we develop a general numerical algorithm for constructing the solution to the discrete inverse problem. The core idea of the algorithm is to use the signature representation of the path, iterating between the response and the control, each time correcting according to the required properties. We apply our framework to the case where the geometric $p$-rough path ${\bf X}$ is defined as the limit of piecewise linear paths in the $p$-variation topology. We express the discrete inverse problem for a fixed observation rate as a solution to a system of equations driven by piecewise linear paths and prove convergence to the solution of the continuous inverse problem for observation time $δ\to 0$. Finally, we show that, in this context, the numerical algorithm for solving the discrete inverse problem simplifies to an iterative simultaneous update of the local gradients and we prove that it converges in $p$-variation uniformly with respect to $δ$.

2603.26002 2026-03-30 math.ST math.PR stat.TH

Quasi-Banach spaces of random variables and stochastic processes

Yuriy Kozachenko, Yuriy Mlavets, Oleksandr Mokliachuk

详情
英文摘要

This book develops the theory of quasi-Banach $K_σ$-spaces $\mathbf{F}_ψ(Ω)$, $\mathbf{F}_ψ^*(Ω)$, and $D_{V,W}(Ω)$ of random variables and stochastic processes, extending the classical framework of Orlicz spaces, $Sub_φ(Ω)$ and $V(φ,ψ)$ spaces. The book consists of eleven chapters. The first two chapters establish the foundational theory: stochastic processes from quasi-Banach $K_σ$-spaces are introduced, and the fundamental properties of $\mathbf{F}_ψ(Ω)$ are studied in detail. The third chapter derives distribution estimates for suprema of processes from $\mathbf{F}_ψ^*(Ω)$, and the fourth addresses approximation theory in $SF_ψ(Ω)$. The fifth chapter examines Orlicz spaces and their connections to $\mathbf{F}_ψ(Ω)$. Chapters six and seven treat the pre-Banach $K_σ$-spaces $D_{V,W}(Ω)$, establishing their essential properties and evaluating reliability and accuracy of stochastic process models. The eighth chapter provides norm distribution estimates in $L_p(T)$ for processes from $\mathbf{F}_ψ(Ω)$. The ninth chapter develops the Monte Carlo method for multiple integrals over $\mathbb{R}^n$ with prescribed reliability and accuracy. The final two chapters treat modeling of $Sub_φ(Ω)$ processes - subclasses of $K_σ$-spaces - with given reliability and accuracy in $L_p(T)$ and $C(T)$ respectively. The results are substantially based on the authors' original work and that of their co-authors.

2603.25970 2026-03-30 stat.AP stat.CO

Bayesian Deep Count Regression and Anomaly Detection: Evidence from GDELT Event Panels

Hsin-Hsiung Huang, Yuh-Haur Chen, Mahlon Scott

详情
英文摘要

The Global Database of Events, Language and Tone (GDELT) provides geolocated event records that can be aggregated into weekly spatiotemporal panels of event counts across regions, actors, and event types. These panels are typically sparse, bursty, and overdispersed, so calibrated probabilistic forecasting is essential for monitoring rare surges. We propose Bayesian count regression pipelines that pair deterministic deep temporal encoders with negative binomial (NB2) and zero-inflated negative binomial (ZINB2) likelihood heads. Posterior predictive simulation yields predictive quantiles and right-tail probabilities that support both forecasting and anomaly scoring. For interpretable spillover attribution, we also fit a Bayesian generalised linear model with high-dimensional lagged cross-series predictors and a two-step screen-and-refit procedure under a three-parameter beta-normal (TPBN) shrinkage prior. To connect spillovers to directional statistics, active cross-region effects are mapped to geodesic bearings on the World Geodetic System 1984 ellipsoid (WGS84) and summarised using weighted circular moments, rose diagrams, and bearing-field maps. Simulations with known spillovers and conflict-panel case studies show accurate right-tail behaviour and a practical workflow for detecting and interpreting geopolitical shocks.

2603.25966 2026-03-30 math.PR math.ST stat.TH

Besov-Orlicz moduli of Brownian motion and polygonal partial sum processes

Fabian Mies

详情
英文摘要

The sample paths of Brownian motion are known to admit the exact Besov-type smoothness exponent 1/2 when measured in the sub-Gaussian Orlicz norm. We extend these regularity results by deriving the exact limit of the sub-Gaussian Orlicz modulus for Brownian motion in Banach spaces, and we provide a rate of convergence towards this limiting value. The central technique is a new chaining bound for the Orlicz modulus of a stochastic process. The latter also applies to polyogonal partial sum processes of functional random variables and allows us to strengthen Donsker's invariance principle to all function spaces on the Besov-Orlicz scale up to the exact modulus with exponent 1/2. For the critical case, we establish the thresholded weak convergence of the Besov-Orlicz seminorm of the partial sum process. The analytical results find application in a nonparametric statistical testing problem, where Besov-Orlicz statistics are shown to detect a broader range of alternatives compared to Hölderian multiscale statistics.

2603.25964 2026-03-30 stat.AP

Assessing Reporting Delays in ACLED Conflict Event Data

Faniry A. Razakason, Daniel Racek, Paul W. Thurner, Göran Kauermann

详情
英文摘要

Timely and accurate conflict event data are essential for real-time monitoring, forecasting, and policy response. Yet near-real-time conflict datasets such as the Armed Conflict Location \& Event Data Project (ACLED) are subject to reporting delays, that is, delays between event occurrence and first inclusion in the database. Such delays can introduce bias in short-term analyses and forecasts. This study provides a statistical analysis of reporting delays for African events recorded in ACLED's weekly releases from June 30, 2024, to June 1, 2025. Treating delay as a discrete time duration, we estimate grouped proportional hazards models with additive-linear and smooth terms incorporating event-level, spatial, and country-level covariates. Our results show that more than half of events are reported within two weeks, but delays vary systematically by event type, fatalities, geographic location, and political regime. Higher-fatality events are reported more quickly, while events in more restrictive political and informational environments tend to be reported more slowly. We also find substantial between-country heterogeneity, and country-specific analyses indicate that event-level effects differ across contexts. These findings show that reporting delays are structured rather than random and that real-time conflict analysis must account for them. More broadly, they provide an empirical foundation for developing nowcasting approaches to correct short-term underreporting in conflict event data.

2603.25934 2026-03-30 math.ST math.PR stat.ML stat.TH

Sharp Concentration Inequalities: Phase Transition and Mixing of Orlicz Tails with Variance

Yinan Shen, Jinchi Lv

详情
英文摘要

In this work, we investigate how to develop sharp concentration inequalities for sub-Weibull random variables, including sub-Gaussian and sub-exponential distributions. Although the random variables may not be sub-Guassian, the tail probability around the origin behaves as if they were sub-Gaussian, and the tail probability decays align with the Orlicz $Ψ_α$-tail elsewhere. Specifically, for independent and identically distributed (i.i.d.) $\{X_i\}_{i=1}^n$ with finite Orlicz norm $\|X\|_{Ψ_α}$, our theory unveils that there is an interesting phase transition at $α= 2$ in that $\PPł(ł|\sum_{i=1}^n X_i \r| \geq t\r)$ with $t > 0$ is upper bounded by $2\expł(-C\maxł\{\frac{t^2}{n\|X\|_{Ψ_α}^2},\frac{t^α}{ n^{α-1} \|X\|_{Ψ_α}^α}\r\}\r)$ for $α\geq 2$, and by $2\expł(-C\minł\{\frac{t^2}{n\|X\|_{Ψ_α}^2},\frac{t^α}{ n^{α-1} \|X\|_{Ψ_α}^α}\r\}\r)$ for $1\leq α\leq 2$ with some positive constant $C$. In many scenarios, it is often necessary to distinguish the standard deviation from the Orlicz norm when the latter can exceed the former greatly. To accommodate this, we build a new theoretical analysis framework, and our sharp, flexible concentration inequalities involve the variance and a mixing of Orlicz $Ψ_α$-tails through the min and max functions. Our theory yields new, improved concentration inequalities even for the cases of sub-Gaussian and sub-exponential distributions with $α= 2$ and $1$, respectively. We further demonstrate our theory on martingales, random vectors, random matrices, and covariance matrix estimation. These sharp concentration inequalities can empower more precise non-asymptotic analyses across different statistical and machine learning applications.

2603.25919 2026-03-30 stat.ME

Regularized Regression by Composition: Identifiability, Structured Penalization, and Statistical Guarantees for Multi-Flow Distributional Models

Safaa K. Kadhem

详情
英文摘要

Regression by composition provides a flexible framework for constructing conditional distributions through sequential group actions. However, when multiple flows act on the same distribution, the model becomes non-identifiable, leading to flat likelihood regions and unstable estimates. We introduce a structured regularization framework that resolves this issue by assigning flow-specific penalties. The resulting estimator is defined as a penalized maximum likelihood problem with heterogeneous regularization across flows. We establish theoretical properties, including identifiability under penalization, uniqueness of the minimizer via strict convexification, and asymptotic consistency. For the adaptive Lasso, we further prove the oracle property. An efficient proximal gradient algorithm handles non-smooth penalties. Extensive simulation studies evaluate performance under varying sample sizes, correlation structures, and signal-to-noise ratios, demonstrating that regularized methods (Lasso and Elastic Net) successfully break non-identifiability and achieve low estimation error with controlled false positive rates. An application to NHANES data on asthma and lead exposure illustrates the practical utility: the unregularized estimator yields implausible coefficients, whereas regularized estimators produce stable and interpretable models and automatically select the relevant risk transformation. The Labbe plots derived from regularized estimators indicate a protective effect of reducing lead exposure. The proposed framework bridges identifiability theory with penalized estimation and opens the door to high-dimensional and longitudinal extensions.

2603.25916 2026-03-30 cs.LG stat.ML

Parameter-Free Dynamic Regret for Unconstrained Linear Bandits

Alberto Rumi, Andrew Jacobsen, Nicolò Cesa-Bianchi, Fabio Vitale

Comments 10 pages. v1: AISTATS 2026

详情
英文摘要

We study dynamic regret minimization in unconstrained adversarial linear bandit problems. In this setting, a learner must minimize the cumulative loss relative to an arbitrary sequence of comparators $\boldsymbol{u}_1,\ldots,\boldsymbol{u}_T$ in $\mathbb{R}^d$, but receives only point-evaluation feedback on each round. We provide a simple approach to combining the guarantees of several bandit algorithms, allowing us to optimally adapt to the number of switches $S_T = \sum_t\mathbb{I}\{\boldsymbol{u}_t \neq \boldsymbol{u}_{t-1}\}$ of an arbitrary comparator sequence. In particular, we provide the first algorithm for linear bandits achieving the optimal regret guarantee of order $\mathcal{O}\big(\sqrt{d(1+S_T) T}\big)$ up to poly-logarithmic terms without prior knowledge of $S_T$, thus resolving a long-standing open problem.

2603.25911 2026-03-30 stat.ME stat.ML

Robust Tensor-on-Tensor Regression

Mehdi Hirari, Fabio Centofanti, Mia Hubert, Stefan Van Aelst

详情
英文摘要

Tensor-on-tensor (TOT) regression is an important tool for the analysis of tensor data, aiming to predict a set of response tensors from a corresponding set of predictor tensors. However, standard TOT regression is sensitive to outliers, which may be present in both the response and the predictor. It can be affected by casewise outliers, which are observations that deviate from the bulk of the data, as well as by cellwise outliers, which are individual anomalous cells within the tensors. The latter are particularly common due to the typically large number of cells in tensor data. This paper introduces a novel robust TOT regression method, named ROTOT, that can handle both types of outliers simultaneously, and can cope with missing values as well. This method uses a single loss function to reduce the influence of both casewise and cellwise outliers in the response. The outliers in the predictor are handled using a robust Multilinear Principal Component Analysis method. Graphical diagnostic tools are also proposed to identify the different types of outliers detected. The performance of ROTOT is evaluated through extensive simulations and further illustrated using the Labeled Faces in the Wild dataset, where ROTOT is applied to predict facial attributes.

2603.25910 2026-03-30 math.ST stat.TH

Finite-Time Observability of Oscillatory Instabilities in Synchronous p-bit Dynamics

Naoya Onizawa, Shunsuke Koshita, Takahiro Hanyu

Comments submitted to physical review e

详情
英文摘要

Synchronous update schemes in p-bit annealing offer a natural route to massive parallelism, but they can also induce period-2 oscillations that degrade optimization performance. In practical solvers, such oscillations matter only if they become observable within the finite runtime of the device or simulation, yet most existing analyses are formulated in terms of asymptotic stability. As a result, they do not directly address the experimentally relevant question of when oscillatory modes actually appear during finite-duration annealing. Here we develop a finite-time observability framework for synchronous tick-random p-bit dynamics. Starting from a linearized mean-field description, we derive a graph-dependent criterion that predicts whether unstable modes amplify enough within a finite observation window to produce visible signatures in quantities such as the one-step autocorrelation and the energy trace. This shifts the analysis from asymptotic instability to practical detectability and yields a principled estimate of the minimum reduction in synchrony required to suppress oscillations. We validate the framework on G-set benchmark graphs and illustrative graph families. The predicted thresholds capture both the graph dependence of oscillation onset and the finite-time conditions under which oscillations become practically observable. These results provide a graph-aware basis for selecting the update probability in synchronous p-bit annealers without relying on exhaustive instance-by-instance parameter sweeps.

2603.25869 2026-03-30 eess.IV cs.CV stat.ML

Learning to Recorrupt: Noise Distribution Agnostic Self-Supervised Image Denoising

Brayan Monroy, Jorge Bacca, Julián Tachella

详情
英文摘要

Self-supervised image denoising methods have traditionally relied on either architectural constraints or specialized loss functions that require prior knowledge of the noise distribution to avoid the trivial identity mapping. Among these, approaches such as Noisier2Noise or Recorrupted2Recorrupted, create training pairs by adding synthetic noise to the noisy images. While effective, these recorruption-based approaches require precise knowledge of the noise distribution, which is often not available. We present Learning to Recorrupt (L2R), a noise distribution-agnostic denoising technique that eliminates the need for knowledge of the noise distribution. Our method introduces a learnable monotonic neural network that learns the recorruption process through a min-max saddle-point objective. The proposed method achieves state-of-the-art performance across unconventional and heavy-tailed noise distributions, such as log-gamma, Laplace, and spatially correlated noise, as well as signal-dependent noise models such as Poisson-Gaussian noise.

2603.25854 2026-03-30 stat.ME

Modeling with Categorical Features via Exact Fusion and Sparsity Regularisation

Kayhan Behdin, Riade Benbaki, Peter Radchenko, Rahul Mazumder

Comments Journal of Royal Statistical Society, Series B (to appear)

详情
英文摘要

We study the high-dimensional linear regression problem with categorical predictors that have many levels. We propose a new estimation approach, which performs model compression via two mechanisms by simultaneously encouraging (a) clustering of the regression coefficients to collapse some of the categorical levels together; and (b) sparsity of the regression coefficients. We present novel mixed integer programming formulations for our estimator, and develop a custom row generation procedure to speed up the exact off-the-shelf solvers. We also propose a fast approximate algorithm for our method that obtains high-quality feasible solutions via block coordinate descent. As the main building block of our algorithm, we develop an exact algorithm for the univariate case based on dynamic programming, which can be of independent interest. We establish new theoretical guarantees for both the prediction and the cluster recovery performance of our estimator. Our numerical experiments on synthetic and real datasets demonstrate that our proposed estimator tends to outperform the state-of-the-art.

2603.25838 2026-03-30 stat.ME

Causal Network Discovery from Interventional Count Data with Latent Linear DAGs

Yijiao Zhang, Hongzhe Li

Comments 35 pages, 5 figures

详情
英文摘要

The increasing availability of interventional data offers new opportunities for causal discovery, with gene perturbation studies providing a prominent example. Such data are typically count-valued and subject to substantial measurement error arising from technical variability and latent state heterogeneity. Motivated by these challenges, we study identification and estimation in latent linear structural causal models for interventional count data. We propose a latent linear Gaussian directed acyclic graph (DAG) model with Poisson measurement error that explicitly separates the latent causal structure from the observed counts. Under a mean-shift intervention design, we establish population-level identifiability of the latent causal DAG. Building on these identification results, we develop an estimation procedure based on sparse inverse matrix estimation and provide theoretical guarantees on estimation error and finite-sample causal discovery. Simulation studies and applications to Perturb-seq data demonstrate the practical effectiveness of the proposed method.

2603.25796 2026-03-30 stat.ML cs.AI cs.LG math.ST stat.TH

Beyond identifiability: Learning causal representations with few environments and finite samples

Inbeom Lee, Tongtong Jin, Bryon Aragam

详情
英文摘要

We provide explicit, finite-sample guarantees for learning causal representations from data with a sublinear number of environments. Causal representation learning seeks to provide a rigourous foundation for the general representation learning problem by bridging causal models with latent factor models in order to learn interpretable representations with causal semantics. Despite a blossoming theory of identifiability in causal representation learning, estimation and finite-sample bounds are less well understood. We show that causal representations can be learned with only a logarithmic number of unknown, multi-node interventions, and that the intervention targets need not be carefully designed in advance. Through a careful perturbation analysis, we provide a new analysis of this problem that guarantees consistent recovery of (a) the latent causal graph, (b) the mixing matrix and representations, and (c) \emph{unknown} intervention targets.

2603.25776 2026-03-30 stat.ML cs.LG

SAHMM-VAE: A Source-Wise Adaptive Hidden Markov Prior Variational Autoencoder for Unsupervised Blind Source Separation

Yuan-Hao Wei

详情
英文摘要

We propose SAHMM-VAE, a source-wise adaptive Hidden Markov prior variational autoencoder for unsupervised blind source separation. Instead of treating the latent prior as a single generic regularizer, the proposed framework assigns each latent dimension its own adaptive regime-switching prior, so that different latent dimensions are pulled toward different source-specific temporal organizations during training. Under this formulation, source separation is not implemented as an external post-processing step; it is embedded directly into variational learning itself. The encoder, decoder, posterior parameters, and source-wise prior parameters are optimized jointly, where the encoder progressively learns an inference map that behaves like an approximate inverse of the mixing transformation, while the decoder plays the role of the generative mixing model. Through this coupled optimization, the gradual alignment between posterior source trajectories and heterogeneous HMM priors becomes the mechanism through which different latent dimensions separate into different source components. To instantiate this idea, we develop three branches within one common framework: a Gaussian-emission HMM prior, a Markov-switching autoregressive HMM prior, and an HMM state-flow prior with state-wise autoregressive flow transformations. Experiments show that the proposed framework achieves unsupervised source recovery while also learning meaningful source-wise switching structures. More broadly, the method extends our structured-prior VAE line from smooth, mixture-based, and flow-based latent priors to adaptive switching priors, and provides a useful basis for future work on interpretable and potentially identifiable latent source modeling.

2603.25755 2026-03-30 physics.chem-ph cs.LG q-bio.QM stat.ML

KANEL: Kolmogorov-Arnold Network Ensemble Learning Enables Early Hit Enrichment in High-Throughput Virtual Screening

Pavel Koptev, Nikita Krainov, Konstantin Malkov, Alexander Tropsha

Comments 8 Pages

详情
英文摘要

Machine learning models of chemical bioactivity are increasingly used for prioritizing a small number of compounds in virtual screening libraries for experimental follow-up. In these applications, assessing model accuracy by early hit enrichment such as Positive Predicted Value (PPV) calculated for top N hits (PPV@N) is more appropriate and actionable than traditional global metrics such as AUC. We present KANEL, an ensemble workflow that combines interpretable Kolmogorov-Arnold Networks (KANs) with XGBoost, random forest, and multilayer perceptron models trained on complementary molecular representations (LillyMol descriptors, RDKit-derived descriptors, and Morgan fingerprints).

2602.04668 2026-03-30 math.ST stat.TH

Estimation of reliability and accuracy of models of $φ$-sub-Gaussian process using generating functions of polynomial expansions

Oleksandr Mokliachuk

详情
英文摘要

Stochastic processes are often represented through orthonormal series expansions, a framework originating in the classical works of Loève and Karhunen and widely used for simulation and numerical approximation. While truncation error in such expansions has been extensively studied, practical models frequently involve an additional source of error arising from the approximation of coefficient functions when closed-form expressions are unavailable. The combined effect of these two errors remains insufficiently addressed in the literature. Building on the author's earlier work on reliability and accuracy estimates for $φ$-sub-Gaussian processes, this paper extends the methodology to orthonormal polynomial systems that do not possess normalized generating functions in analytical form, including the Legendre, generalized Laguerre, and Gegenbauer families. New bounds are derived for models in $L_p(T)$ space that simultaneously account for truncation and coefficient approximation. The resulting criteria provide practical guidance for selecting the number of series terms required to achieve prescribed levels of reliability and accuracy across a broader class of polynomial-based stochastic process models.

2512.17374 2026-03-30 stat.ML math.OC

Generative modeling of conditional probability distributions on the level-sets of collective variables

Fatima-Zahrae Akhyar, Wei Zhang, Gabriel Stoltz, Christof Schütte

详情
英文摘要

Given a probability distribution $μ$ in $\mathbb{R}^d$ represented by data, we study in this paper the generative modeling of the corresponding conditional probability distributions on the level-sets of a collective variable $ξ:\mathbb{R}^d \rightarrow \mathbb{R}^k$, where $1 \le k<d$. We propose a general and efficient learning approach that can learn generative models on different level-sets of $ξ$ simultaneously. To improve the learning quality on level-sets in low-probability regions, we also propose a data enrichment strategy by utilizing data from enhanced sampling techniques. We demonstrate the effectiveness of our proposed learning approach through concrete numerical examples. The proposed approach is potentially useful for the generative modeling of molecular systems in biophysics.

2511.19742 2026-03-30 stat.AP stat.ME

Anchoring Convenience Survey Samples to a Baseline Census for Vaccine Coverage Monitoring in Global Health

Nathaniel Dyrkton, Shomoita Alam, Susan Shepherd, Ibrahim Sana, Kevin Phelan, Jay JH Park

Comments 5 figures, 2 tables. Includes updates to DGM, Results, and added clarification

详情
英文摘要

While conducting probabilistic surveys is the gold standard for assessing vaccine coverage, implementing these surveys poses challenges for global health. There is a need for more convenient option that is more affordable and practical. Motivated by childhood vaccine monitoring programs in rural areas of Chad and Niger, we conducted a simulation study to evaluate calibration-weighted design-based and logistic regression-based imputation estimators of the finite-population proportion of MCV1 coverage. These estimators use a hybrid approach that anchors non-probabilistic follow-up survey to probabilistic baseline census to account for selection bias. We explored varying degrees of non-ignorable selection bias (odds ratios from 1.0-1.5), percentage of villages sampled (25-75%), and village-level survey response rate to the follow-up survey (50-80%). Our performance metrics included bias, coverage, and proportion of simulated 95% confidence intervals falling within equivalence margins of 5% and 7.5% (equivalence tolerance). For both adjustment methods, the performance worsened with higher selection bias and lower response rate and generally improved as a larger proportion of villages was sampled. Under the worst scenario with 1.5 OR, 25% village sampled, and 50% survey response rate, both methods showed empirical biases of 2% or less, below 95% coverage, and low equivalence tolerances. In more realistic scenarios, the performance of our estimators showed lower biases and close to 95% coverage. For example, at OR$\leq$1.2, both methods showed high performance, except at the lowest village sampling and participation rates. Our simulations show that a hybrid anchoring survey approach is a feasible survey option for vaccine monitoring.

2511.09500 2026-03-30 stat.ML cs.LG math.ST stat.ME stat.TH

Distributional Shrinkage I: Universal Denoiser Beyond Tweedie's Formula

Tengyuan Liang

Comments 27 pages, 5 figures

详情
英文摘要

We study the problem of denoising when only the noise level is known, not the noise distribution. Independent noise $Z$ corrupts a signal $X$, yielding the observation $Y = X + σZ$ with known $σ\in (0,1)$. We propose \emph{universal} denoisers, agnostic to both signal and noise distributions, that recover the signal distribution $P_X$ from $P_Y$. When the focus is on distributional recovery of $P_X$ rather than on individual realizations of $X$, our denoisers achieve order-of-magnitude improvements over the Bayes-optimal denoiser derived from Tweedie's formula, which achieves $O(σ^2)$ accuracy. They shrink $P_Y$ toward $P_X$ with $O(σ^4)$ and $O(σ^6)$ accuracy in matching generalized moments and densities. Drawing on optimal transport theory, our denoisers approximate the Monge--Ampère equation with higher-order accuracy and can be implemented efficiently via score matching. Let $q$ denote the density of $P_Y$. For distributional denoising, we propose replacing the Bayes-optimal denoiser, $$\mathbf{T}^*(y) = y + σ^2 \nabla \log q(y),$$ with denoisers exhibiting less-aggressive distributional shrinkage, $$\mathbf{T}_1(y) = y + \frac{σ^2}{2} \nabla \log q(y),$$ $$\mathbf{T}_2(y) = y + \frac{σ^2}{2} \nabla \log q(y) - \frac{σ^4}{8} \nabla \!\left( \frac{1}{2} \| \nabla \log q(y) \|^2 + \nabla \cdot \nabla \log q(y) \right)\!.$$

2501.06360 2026-03-30 stat.ME

Borrowing Information from an Unidentifiable Model: Guaranteed Efficiency Gain with a Dichotomized Outcome in the External Data

Lu Wang, Yanyuan Ma, Jiwei Zhao

详情
英文摘要

In the era of big data, the increasing availability of diverse data sources has driven interest in analytical approaches that integrate information across sources to enhance statistical accuracy, efficiency, and scientific insights. Many existing methods assume exchangeability among data sources and often implicitly require that sources measure identical covariates or outcomes, or that the error distribution is correctly specified-assumptions that may not hold in complex real-world scenarios. This paper explores the integration of data from sources with distinct outcome scales, focusing on leveraging external data to improve statistical efficiency. Specifically, we consider a scenario where the primary dataset includes a continuous outcome, and external data provides a dichotomized version of the same outcome. We propose two novel estimators: the first estimator remains asymptotically consistent even when the error distribution is potentially misspecified, while the second estimator guarantees an efficiency gain over weighted least squares estimation that uses the primary study data alone. Theoretical properties of these estimators are rigorously derived, and extensive simulation studies are conducted to highlight their robustness and efficiency gains across various scenarios. Finally, a real-world application using the NHANES dataset demonstrates the practical utility of the proposed methods.

2412.04882 2026-03-30 cs.LG stat.ML

Nonmyopic Global Optimisation via Approximate Dynamic Programming

Filippo Airaldi, Bart De Schutter, Azita Dabiri

Comments 36 pages, 6 figures, 2 tables, submitted to Springer Machine Learning

详情
英文摘要

Global optimisation to optimise expensive-to-evaluate black-box functions without gradient information. Bayesian optimisation, one of the most well-known techniques, typically employs Gaussian processes as surrogate models, leveraging their probabilistic nature to balance exploration and exploitation. However, these processes become computationally prohibitive in high-dimensional spaces. Recent alternatives, based on inverse distance weighting (IDW) and radial basis functions (RBFs), offer competitive, computationally lighter solutions. Despite their efficiency, both traditional global and Bayesian optimisation strategies suffer from the myopic nature of their acquisition functions, which focus on immediate improvement neglecting future implications of the sequential decision making process. Nonmyopic acquisition functions devised for the Bayesian setting have shown promise in improving long-term performance. Yet, their combination with deterministic surrogate models remains unexplored. In this work, we introduce novel nonmyopic acquisition strategies tailored to IDW and RBF based on approximate dynamic programming paradigms, including rollout and multi-step scenario-based optimisation schemes, to enable lookahead acquisition. These methods optimise a sequence of query points over a horizon by predicting the evolution of the surrogate model, inherently managing the exploration-exploitation trade-off via optimisation techniques. The proposed approach represents a significant advance in extending nonmyopic acquisition principles, previously confined to Bayesian optimisation, to deterministic models. Empirical results on synthetic and hyperparameter tuning benchmark problems, a constrained problem, as well as on a data-driven predictive control application, demonstrate that these nonmyopic methods outperform conventional myopic approaches, leading to faster and more robust convergence.

2307.15181 2026-03-30 econ.EM math.ST stat.ME stat.TH

On the Efficiency of Highly Stratified Experiments

Yuehao Bai, Jizhou Liu, Azeem M. Shaikh, Max Tabord-Meehan

详情
英文摘要

This paper studies the use of highly stratified designs for the efficient estimation of a large class of treatment effect parameters that arise in the analysis of experiments. By a "highly stratified" design, we mean experiments in which units are divided into blocks of a fixed size and a proportion within each block is assigned to a binary treatment uniformly at random. The class of parameters considered are those that can be expressed as the solution to a set of moment conditions constructed using a known function of the observed data. They include, among other things, average treatment effects, quantile treatment effects, and local average treatment effects as well as the counterparts to these quantities in experiments in which the unit is itself a cluster. In this setting, we establish three results. First, we show that under a highly stratified design, the naïve method of moments estimator achieves the same asymptotic variance as what could typically be attained under alternative treatment assignment mechanisms only through ex post covariate adjustment. Second, we argue that the naïve method of moments estimator under a highly stratified design is asymptotically efficient by deriving a lower bound on the asymptotic variance of regular estimators of the parameter of interest in the form of a convolution theorem. In this sense, highly stratified experiments are attractive because they lead to efficient estimators of treatment effect parameters "by design." Finally, we strengthen this conclusion by establishing conditions under which a "fast-balancing" property of highly stratified designs is in fact necessary for the naïve method of moments estimator to attain the efficiency bound.