arXivDaily arXiv每日学术速递 周一至周五更新
2601.20830 2026-01-29 stat.ML cs.LG stat.CO

VSCOUT: A Hybrid Variational Autoencoder Approach to Outlier Detection in High-Dimensional Retrospective Monitoring

Waldyn G. Martinez

详情
英文摘要

Modern industrial and service processes generate high-dimensional, non-Gaussian, and contamination-prone data that challenge the foundational assumptions of classical Statistical Process Control (SPC). Heavy tails, multimodality, nonlinear dependencies, and sparse special-cause observations can distort baseline estimation, mask true anomalies, and prevent reliable identification of an in-control (IC) reference set. To address these challenges, we introduce VSCOUT, a distribution-free framework designed specifically for retrospective (Phase I) monitoring in high-dimensional settings. VSCOUT combines an Automatic Relevance Determination Variational Autoencoder (ARD-VAE) architecture with ensemble-based latent outlier filtering and changepoint detection. The ARD prior isolates the most informative latent dimensions, while the ensemble and changepoint filters identify pointwise and structural contamination within the determined latent space. A second-stage retraining step removes flagged observations and re-estimates the latent structure using only the retained inliers, mitigating masking and stabilizing the IC latent manifold. This two-stage refinement produces a clean and reliable IC baseline suitable for subsequent Phase II deployment. Extensive experiments across benchmark datasets demonstrate that VSCOUT achieves superior sensitivity to special-cause structure while maintaining controlled false alarms, outperforming classical SPC procedures, robust estimators, and modern machine-learning baselines. Its scalability, distributional flexibility, and resilience to complex contamination patterns position VSCOUT as a practical and effective method for retrospective modeling and anomaly detection in AI-enabled environments.

2601.20821 2026-01-29 stat.AP

A Survival Framework for Estimating Child Mortality Rates using Multiple Data Types

Katherine R Paulson, Taylor Okonek, Jon Wakefield

详情
英文摘要

Child mortality is an important population health indicator. However, many countries lack high-quality vital registration to measure child mortality rates precisely and reliably over time. Research endeavors such as those by the United Nations Inter-agency Group for Child Mortality Estimation (UN IGME) and the Global Burden of Disease (GBD) study leverage statistical models and available data to estimate child survival summaries including neonatal, infant, and under-five mortality rates. UN IGME fits separate models for each age group and the GBD uses a multi-step modeling process. We propose a Bayesian survival framework to estimate temporal trends in the probability of survival as a function of age, up to the fifth birthday, with a single model. Our framework integrates all data types that are used by UN IGME: household surveys, vital registration, and other pre-processed mortality rates. We demonstrate that our framework is applicable to any country using log-logistic and piecewise-exponential survival functions, and discuss findings for four example countries with diverse data profiles: Kenya, Brazil, Estonia, and Syrian Arab Republic. Our model produces estimates of the three survival summaries that are in broad agreement with both the data and the UN IGME estimates, but in addition gives the complete survival curve.

2601.20819 2026-01-29 stat.ML cs.LG

Demystifying Prediction Powered Inference

Yilin Song, Dan M. Kluger, Harsh Parikh, Tian Gu

详情
英文摘要

Machine learning predictions are increasingly used to supplement incomplete or costly-to-measure outcomes in fields such as biomedical research, environmental science, and social science. However, treating predictions as ground truth introduces bias while ignoring them wastes valuable information. Prediction-Powered Inference (PPI) offers a principled framework that leverages predictions from large unlabeled datasets to improve statistical efficiency while maintaining valid inference through explicit bias correction using a smaller labeled subset. Despite its potential, the growing PPI variants and the subtle distinctions between them have made it challenging for practitioners to determine when and how to apply these methods responsibly. This paper demystifies PPI by synthesizing its theoretical foundations, methodological extensions, connections to existing statistics literature, and diagnostic tools into a unified practical workflow. Using the Mosaiks housing price data, we show that PPI variants produce tighter confidence intervals than complete-case analysis, but that double-dipping, i.e. reusing training data for inference, leads to anti-conservative confidence intervals and coverages. Under missing-not-at-random mechanisms, all methods, including classical inference using only labeled data, yield biased estimates. We provide a decision flowchart linking assumption violations to appropriate PPI variants, a summary table of selective methods, and practical diagnostic strategies for evaluating core assumptions. By framing PPI as a general recipe rather than a single estimator, this work bridges methodological innovation and applied practice, helping researchers responsibly integrate predictions into valid inference.

2601.20812 2026-01-29 stat.ME math.ST stat.TH

Effective Sample Size for Functional Spatial Data

Alfredo Alegría, John Gómez, Jorge Mateu, Ronny Vallejos

详情
英文摘要

The effective sample size quantifies the amount of independent information contained in a dataset, accounting for redundancy due to correlation between observations. While widely used in geostatistics for scalar data, its extension to functional spatial data has remained largely unexplored. In this work, we introduce a novel definition of the effective sample size for functional geostatistical data, employing the trace-covariogram as a measure of correlation, and show that it retains the intuitive properties of the classical scalar ESS. We illustrate the behavior of this measure using a functional autoregressive process, demonstrating how serial dependence and the allocation of variability across eigen-directions influence the resulting functional ESS. Finally, the approach is applied to a real meteorological dataset of geometric vertical velocities over a portion of the Earth, showing how the method can quantify redundancy and determine the effective number of independent curves in functional spatial datasets.

2601.20809 2026-01-29 stat.ME

Joint estimation of the basic reproduction number and serial interval using Sequential Bayes

Tatiana Krikella, Jane M. Heffernan, Hanna Jankowski

详情
英文摘要

Early in an infectious disease outbreak, timely and accurate estimation of the basic reproduction number ($R_0$) and the serial interval (SI) is critical for understanding transmission dynamics and informing public health responses. While many methods estimate these quantities separately, and a small number jointly estimate them from incidence data, existing joint approaches are largely likelihood-based and do not fully exploit prior information. We propose a novel Bayesian framework for the joint estimation of $R_0$ and the serial interval using only case count data, implemented through a sequential Bayes approach. Our method assumes an SIR model and employs a mildly informative joint prior constructed by linking log-Gamma marginal distributions for $R_0$ and the SI via a Gaussian copula, explicitly accounting for their dependence. The prior is updated sequentially as new incidence data become available, allowing for real-time inference. We assess the performance of the proposed estimator through extensive simulation studies under correct model specification as well as under model misspecification, including when the true data come from an SEIR or SEAIR model, and under varying degrees of prior misspecification. Comparisons with the widely used White and Pagano likelihood-based joint estimator show that our approach yields substantially more precise and stable estimates of $R_0$, with comparable or improved bias, particularly in the early stages of an outbreak. Estimation of the SI is more sensitive to prior misspecification; however, when prior information is reasonably accurate, our method provides reliable SI estimates and remains more stable than the competing approach. We illustrate the practical utility of the proposed method using Canadian COVID-19 incidence data at both national and provincial levels.

2506.05544 2026-01-29 stat.ML cs.LG

Online Conformal Model Selection for Nonstationary Time Series

Shibo Li, Yao Zheng

详情
英文摘要

This paper introduces the MPS (Model Prediction Set), a novel framework for online model selection for nonstationary time series. Classical model selection methods, such as information criteria and cross-validation, rely heavily on the stationarity assumption and often fail in dynamic environments which undergo gradual or abrupt changes over time. Yet real-world data are rarely stationary, and model selection under nonstationarity remains a largely open problem. To tackle this challenge, we combine conformal inference with model confidence sets to develop a procedure that adaptively selects models best suited to the evolving dynamics at any given time. Concretely, the MPS updates in real time a confidence set of candidate models that covers the best model for the next time period with a specified long-run probability, while adapting to nonstationarity of unknown forms. Through simulations and real-world data analysis, we demonstrate that MPS reliably and efficiently identifies optimal models under nonstationarity, an essential capability lacking in offline methods. Moreover, MPS frequently produces high-quality sets with small cardinality, whose evolution offers deeper insights into changing dynamics. As a generic framework, MPS accommodates any data-generating process, data structure, model class, training method, and evaluation metric, making it broadly applicable across diverse problem settings.

2408.00014 2026-01-29 cs.DC stat.CO stat.ML

Optimization of Energy Consumption Forecasting in Puno using Parallel Computing and ARIMA Models: An Innovative Approach to Big Data Processing

Cliver W. Vilca-Tinta, Fred Torres-Cruz, Josefh J. Quispe-Morales

Comments In preparation for Journal Submission

详情
英文摘要

This research presents an innovative use of parallel computing with the ARIMA (AutoRegressive Integrated Moving Average) model to forecast energy consumption in Peru's Puno region. The study conducts a thorough and multifaceted analysis, focusing on the execution speed, prediction accuracy, and scalability of both sequential and parallel implementations. A significant emphasis is placed on efficiently managing large datasets. The findings demonstrate notable improvements in computational efficiency and data processing capabilities through the parallel approach, all while maintaining the accuracy and integrity of predictions. This new method provides a versatile and reliable solution for real-time predictive analysis and enhances energy resource management, which is particularly crucial for developing areas. In addition to highlighting the technical advantages of parallel computing in this field, the study explores its practical impacts on energy planning and sustainable development in regions like Puno.

2601.20725 2026-01-29 stat.AP

Comparing causal estimands from sequential nested versus single point target trials: A simulation study

Catherine Wiener, Chase D. Latour, Kathleen Hurwitz, Xiaojuan Li, Catherine R. Lesko, Alexander Breskin, M. Alan Brookhart

Comments 32 pages, 3 main figures, 3 supplemental figures,

详情
英文摘要

Sequential nested trial (SNT) emulation is a powerful approach for maximizing precision and avoiding time-related biases. However, there exists little discussion about the implied causal estimands in comparison to a real-world single point trial. We used Monte Carlo simulation to compare treatment effect estimates from an SNT emulation that re-indexed patients annually and a SNT emulation with a treatment decision design to the estimates from a single point trial. We generated 5,000 cohorts of 5,000 people with 3 years of follow-up. For the single point trial, patients were randomized to initiate or not initiate treatment at Visit 1. For the SNT emulations, simulated patients could contribute up to two index dates. When disease severity did not modify the treatment effect, both SNT approaches returned treatment effect estimates identical to the single point trial. In the presence of treatment effect modification by disease severity, both SNT approaches returned treatment effect estimates that diverged from the single point trial even after confounding-adjustment. These findings underscore the difficulties of interpreting causal estimands from a SNT emulation: the target population does not correspond to a single time point trial. Such implications are important for communicating study results for evidence-based decision-making.

2601.20710 2026-01-29 stat.ME

Two-dose vs. Three-Dose Optimization Under Sample Size Constraint

Linda Sun, Yixin Ren, Cong Chen

Comments 14 pages; 4 figures

详情
英文摘要

Dose optimization is a hallmark of Project Optimus for oncology drug development. The number of doses to include in a dose optimization study depends on the totality of evidence, which is often unclear in early-phase development. With equal sample sizes per dose, carrying three doses is clearly more advantageous than two for optimization. In this paper, we show that, even when the total sample size is fixed, it is still preferable to carry three unless there is very strong evidence that one can be dropped. A mathematical approximation is applied to guide the investigation, followed by a simulation study to complement the theoretical findings. Semi-quantitative guidance is provided for practitioners, addressing both randomized and non-randomized dose optimization while considering population homogeneity.

2601.20699 2026-01-29 cs.IT math.IT stat.AP

Reflected wireless signals under random spatial sampling

H. Paul Keeler

详情
英文摘要

We present a propagation model showing that a transmitter randomly positioned in space generates unbounded peaks in the histogram of the resulting power, provided the signal strength is an oscillating or non-monotonic function of distance. Specifically, these peaks are singularities in the empirical probability density that occur at turning point values of the deterministic propagation model. We explain the underlying mechanism of this phenomenon through a concise mathematical argument. This observation has direct implications for estimating random propagation effects such as fading, particularly when reflections off walls are involved. Motivated by understanding intelligent surfaces, we apply this fundamental result to a physical model consisting of a single transmitter between two parallel passive walls. We analyze signal fading due to reflections and observe power oscillations resulting from wall reflections -- a phenomenon long studied in waveguides but relatively unexplored in wireless networks. For the special case where the transmitter is placed halfway between the walls, we present a compact closed-form expression for the received signal involving the Lerch transcendent function. The insights from this work can inform design decisions for intelligent surfaces deployed in cities.

2601.20643 2026-01-29 q-fin.PM stat.AP

Shrinkage Estimators for Mean and Covariance: Evidence on Portfolio Efficiency Across Market Dimensions

Rupendra Yadav, Amita Sharma, Aparna Mehra

Comments 29 pages, 3 figures

详情
英文摘要

The mean-variance model remains the most prevalent investment framework, built on diversification principles. However, it consistently struggles with estimation errors in expected returns and the covariance matrix, its core parameters. To address this concern, this research evaluates the performance of mean variance (MV) and global minimum-variance (GMV) models across various shrinkage estimators designed to improve these parameters. Specifically, we examine five shrinkage estimators for expected returns and eleven for the covariance matrix. To compare multiple portfolios, we employ a super efficient data envelopment analysis model to rank the portfolios according to investors risk-return preferences. Our comprehensive empirical investigation utilizes six real world datasets with different dimensional characteristics, applying a rolling window methodology across three out of sample testing periods. Following the ranking process, we examine the chosen shrinkage based MV or GMV portfolios against five traditional portfolio optimization techniques classical MV and GMV for sample estimates, MiniMax, conditional value at risk, and semi mean absolute deviation risk measures. Our empirical findings reveal that, in most scenarios, the GMV model combined with the Ledoit Wolf two parameter shrinkage covariance estimator (COV2) represents the optimal selection for a broad spectrum of investors. Meanwhile, the MV model utilizing COV2 alongside the sample mean (SM) proves more suitable for return oriented investors. These two identified models demonstrate superior performance compared to traditional benchmark approaches. Overall, this study lays the groundwork for a more comprehensive understanding of how specific shrinkage models perform across diverse investor profiles and market setups.

2601.20610 2026-01-29 stat.ME math.ST stat.TH

Causal Inference in Biomedical Imaging via Functional Linear Structural Equation Models

Ting Li, Ethan Fan, Tengfei Li, Hongtu Zhu

详情
英文摘要

Understanding the causal effects of organ-specific features from medical imaging on clinical outcomes is essential for biomedical research and patient care. We propose a novel Functional Linear Structural Equation Model (FLSEM) to capture the relationships among clinical outcomes, functional imaging exposures, and scalar covariates like genetics, sex, and age. Traditional methods struggle with the infinite-dimensional nature of exposures and complex covariates. Our FLSEM overcomes these challenges by establishing identifiable conditions using scalar instrumental variables. We develop the Functional Group Support Detection and Root Finding (FGS-DAR) algorithm for efficient variable selection, supported by rigorous theoretical guarantees, including selection consistency and accurate parameter estimation. We further propose a test statistic to test the nullity of the functional coefficient, establishing its null limit distribution. Our approach is validated through extensive simulations and applied to UK Biobank data, demonstrating robust performance in detecting causal relationships from medical imaging.

2601.20533 2026-01-29 stat.ML cs.LG q-fin.RM

Incorporating data drift to perform survival analysis on credit risk

Jianwei Peng, Stefan Lessmann

Comments 27 pages, 2 figures

详情
英文摘要

Survival analysis has become a standard approach for modelling time to default by time-varying covariates in credit risk. Unlike most existing methods that implicitly assume a stationary data-generating process, in practise, mortgage portfolios are exposed to various forms of data drift caused by changing borrower behaviour, macroeconomic conditions, policy regimes and so on. This study investigates the impact of data drift on survival-based credit risk models and proposes a dynamic joint modelling framework to improve robustness under non-stationary environments. The proposed model integrates a longitudinal behavioural marker derived from balance dynamics with a discrete-time hazard formulation, combined with landmark one-hot encoding and isotonic calibration. Three types of data drift (sudden, incremental and recurring) are simulated and analysed on mortgage loan datasets from Freddie Mac. Experiments and corresponding evidence show that the proposed landmark-based joint model consistently outperforms classical survival models, tree-based drift-adaptive learners and gradient boosting methods in terms of discrimination and calibration across all drift scenarios, which confirms the superiority of our model design.

2601.20528 2026-01-29 math.ST math.PR stat.ML stat.TH

Spectral Bayesian Regression on the Sphere

Claudio Durastanti

Comments 42 pages, 3 figures, 2 tables

详情
英文摘要

We develop a fully intrinsic Bayesian framework for nonparametric regression on the unit sphere based on isotropic Gaussian field priors and the harmonic structure induced by the Laplace-Beltrami operator. Under uniform random design, the regression model admits an exact diagonalization in the spherical harmonic basis, yielding a Gaussian sequence representation with frequency-dependent multiplicities. Exploiting this structure, we derive closed-form posterior distributions, optimal spectral truncation schemes, and sharp posterior contraction rates under integrated squared loss. For Gaussian priors with polynomially decaying angular power spectra, including spherical Matérn priors, we establish posterior contraction rates over Sobolev classes, which are minimax-optimal under correct prior calibration. We further show that the posterior mean admits an exact variational characterization as a geometrically intrinsic penalized least-squares estimator, equivalent to a Laplace-Beltrami smoothing spline.

2601.20522 2026-01-29 math.ST cs.DS math.PR stat.TH

Improved Computational Lower Bound of Estimation for Multi-Frequency Group Synchronization

Zhangsong Li

Comments 22 pages

详情
英文摘要

We study the computational phase transition in a multi-frequency group synchronization problem, where pairwise relative measurements of group elements are observed across multiple frequency channels and corrupted by Gaussian noise. Using the framework of \emph{low-degree polynomial algorithms}, we analyze the task of estimating the structured signal in such observations. We show that, assuming the low-degree heuristic, in synchronization models over the circle group $\mathsf{SO}(2)$, a simple spectral method is computationally optimal among all polynomial-time estimators when the number of frequencies satisfies $L=n^{o(1)}$. This significantly extends prior work \cite{KBK24+}, which only applied to a fixed constant number of frequencies. Together with known upper bounds on the statistical threshold \cite{PWBM18a}, our results establish the existence of a \emph{statistical-to-computational gap} in this model when the number of frequencies is sufficiently large.

2601.20498 2026-01-29 math.PR stat.ML

Spectral Diffusion Models on the Sphere

Pierpaolo Brutti, Claudio Durastanti, Francesco Mari

Comments 28 pages, 1 figure

详情
英文摘要

Diffusion models provide a principled framework for generative modeling via stochastic differential equations and time-reversed dynamics. Extending spectral diffusion approaches to spherical data, however, raises nontrivial geometric and stochastic issues that are absent in the Euclidean setting. In this work, we develop a diffusion modeling framework defined directly on finite-dimensional spherical harmonic representations of real-valued functions on the sphere. We show that the spherical discrete Fourier transform maps spatial Brownian motion to a constrained Gaussian process in the frequency domain with deterministic, generally non-isotropic covariance. This induces modified forward and reverse-time stochastic differential equations in the spectral domain. As a consequence, spatial and spectral score matching objectives are no longer equivalent, even in the band-limited setting, and the frequency-domain formulation introduces a geometry-dependent inductive bias. We derive the corresponding diffusion equations and characterize the induced noise covariance.

2601.20496 2026-01-29 stat.ML cs.LG

Physics-informed Blind Reconstruction of Dense Fields from Sparse Measurements using Neural Networks with a Differentiable Simulator

Ofek Aloni, Barak Fishbain

详情
英文摘要

Generating dense physical fields from sparse measurements is a fundamental question in sampling, signal processing, and many other applications. State-of-the-art methods either use spatial statistics or rely on examples of dense fields in the training phase, which often are not available, and thus rely on synthetic data. Here, we present a reconstruction method that generates dense fields from sparse measurements, without assuming availability of the spatial statistics, nor of examples of the dense fields. This is made possible through the introduction of an automatically differentiable numerical simulator into the training phase of the method. The method is shown to have superior results over statistical and neural network based methods on a set of three standard problems from fluid mechanics.

2601.20442 2026-01-29 math.ST stat.ME stat.TH

Blessing of dimensionality in cross-validated bandwidth selection on the sphere

José E. Chacón, Eduardo García-Portugués, Andrea Meilán-Vila

Comments 25 pages, 6 figures, 2 tables. Supplementary material: 43 pages, 4 figures

详情
英文摘要

We study the asymptotic behavior of least-squares cross-validation bandwidth selection in kernel density estimation on the $d$-dimensional hypersphere, $d\geq 1$. We show that the exact rate of convergence with respect to the optimal bandwidth minimizing the mean integrated squared error, shown to exist under mild non-uniformity conditions, is $n^{-d/(2d+8)}$, thus approaching the $n^{-1/2}$ parametric rate as $d$ grows. This ``blessing of dimensionality'' in bandwidth selection offers theoretical support for utilizing the conceptually simpler cross-validation selector over plug-in techniques for larger dimensions $d$. We compare this result for bandwidth estimation on the $d$-dimensional Euclidean space through explicit expressions for the asymptotic variance functionals. Numerical experiments corroborate the speed of this convergence in an array of scenarios and dimensions, precisely illustrating the tipping dimension where cross-validation outperforms plug-in approaches.

2601.20428 2026-01-29 cs.LG stat.AP

Nonlinear Dimensionality Reduction with Diffusion Maps in Practice

Sönke Beier, Paula Pirker-Díaz, Friedrich Pagenkopf, Karoline Wiesner

详情
英文摘要

Diffusion Map is a spectral dimensionality reduction technique which is able to uncover nonlinear submanifolds in high-dimensional data. And, it is increasingly applied across a wide range of scientific disciplines, such as biology, engineering, and social sciences. But data preprocessing, parameter settings and component selection have a significant influence on the resulting manifold, something which has not been comprehensively discussed in the literature so far. We provide a practice oriented review of the Diffusion Map technique, illustrate pitfalls and showcase a recently introduced technique for identifying the most relevant components. Our results show that the first components are not necessarily the most relevant ones.

2601.20405 2026-01-29 stat.OT

Position: A Potential Outcomes Perspective on Pearl's Causal Hierarchy

Peng Wu, Linbo Wang

详情
英文摘要

Pearl's causal hierarchy has garnered sustained attention as a foundational lens for formulating and understanding causal questions, and has been extensively discussed within the framework of structural causal models. In this paper, we revisit the hierarchy from a potential outcomes perspective and provide a formal, systematic classification of how various causal estimands are mapped to specific layers. Building on this classification, we summarize key identifiability challenges for estimands at different layers and review general strategies for achieving identification under varying assumptions. Our perspective is both intuitive and theoretically grounded, as higher layers of the hierarchy correspond to progressively richer features of the potential outcomes distribution, which in turn require stronger assumptions for identification. We expect this perspective to help clarify and deepen understanding of various causal estimands, particularly those in the third layer of the causal hierarchy, along with their associated identifiability challenges, identifiability strategies, and application scenarios.

2601.20341 2026-01-29 math.ST stat.TH

Partial heteroscedastic deconvolution estimation in nonparametric regression

Baba Thiam

详情
英文摘要

In this paper, we consider a partial deconvolution kernel estimator for nonparametric regression when some covariates are measured with error while others are observed without error. We focus on a general and realistic setting in which the measurement errors are heteroscedastic. We propose a kernel-based estimator of the regression function in this framework and show that it achieves the optimal convergence rate under suitable regularity conditions. The finite-sample performance of the proposed estimator is illustrated through simulation studies.

2601.20320 2026-01-29 stat.ME

Confidence intervals for maximum unseen probabilities, with application to sequential sampling design

Alessandro Colombi, Mario Beraha, Amichai Painsky, Stefano Favaro

详情
英文摘要

Discovery problems often require deciding whether additional sampling is needed to detect all categories whose prevalence exceeds a prespecified threshold. We study this question under a Bernoulli product (incidence) model, where categories are observed only through presence--absence across sampling units. Our inferential target is the \emph{maximum unseen probability}, the largest prevalence among categories not yet observed. We develop nonasymptotic, distribution-free upper confidence bounds for this quantity in two regimes: bounded alphabets (finite and known number of categories) and unbounded alphabets (countably infinite under a mild summability condition). We characterise the limits of data-independent worst-case bounds, showing that in the unbounded regime no nontrivial data-independent procedure can be uniformly valid. We then propose data-dependent bounds in both regimes and establish matching lower bounds demonstrating their near-optimality. We compare empirically the resulting procedures in both simulated and real datasets. Finally, we use these bounds to construct sequential stopping rules with finite-sample guarantees, and demonstrate robustness to contamination that introduces spurious low-prevalence categories.

2601.20254 2026-01-29 stat.ME

Wavelet Tree Ensembles for Triangulable Manifolds

Hengrui Luo, Akira Horiguchi, Li Ma

Comments 56 pages, 16 figures

详情
英文摘要

We develop unbalanced Haar (UH) wavelet tree ensembles for regression on triangulable manifolds. Given data sampled on a triangulated manifold, we construct UH wavelet trees whose atoms are supported on geodesic triangles and form an orthonormal system in $L^2(μ_n)$, where $μ_n$ is the empirical measure on the sample, which allows us to use UH trees as weak learners in additive ensembles. Our construction extends classical UH wavelet trees from regular Euclidean grids to generic triangulable manifolds while preserving three key properties: (i) orthogonality and exact reconstruction at the sampled locations, (ii) recursive, data-driven partitions adapted to the geometry of the manifold via geodesic triangulations, and (iii) compatibility with optimization-based and Bayesian ensemble building. In Euclidean settings, the framework reduces to standard UH wavelet tree regression and provides a baseline for comparison. We illustrate the method on synthetic regression on the sphere and on climate anomaly fields on a spherical mesh, where UH ensembles on triangulated manifolds substantially outperform classical tree ensembles and non-adaptive mesh-based wavelets. For completeness, we also report results on image denoising on regular grids. A Bayesian variant (RUHWT) provides posterior uncertainty quantification for function estimates on manifolds. Our implementation is available at http://www.github.com/hrluo/WaveletTrees.

2601.20250 2026-01-29 cs.LG cs.AI cs.IT math.IT stat.ML

Order-Optimal Sample Complexity of Rectified Flows

Hari Krishna Sahoo, Mudit Gaur, Vaneet Aggarwal

详情
英文摘要

Recently, flow-based generative models have shown superior efficiency compared to diffusion models. In this paper, we study rectified flow models, which constrain transport trajectories to be linear from the base distribution to the data distribution. This structural restriction greatly accelerates sampling, often enabling high-quality generation with a single Euler step. Under standard assumptions on the neural network classes used to parameterize the velocity field and data distribution, we prove that rectified flows achieve sample complexity $\tilde{O}(\varepsilon^{-2})$. This improves on the best known $O(\varepsilon^{-4})$ bounds for flow matching model and matches the optimal rate for mean estimation. Our analysis exploits the particular structure of rectified flows: because the model is trained with a squared loss along linear paths, the associated hypothesis class admits a sharply controlled localized Rademacher complexity. This yields the improved, order-optimal sample complexity and provides a theoretical explanation for the strong empirical performance of rectified flow models.

2601.20219 2026-01-29 stat.ME

Joint Estimation of Edge Probabilities for Multi-layer Networks via Neighborhood Smoothing

Yong He, Zizhou Huang, Bingyi Jing, Diqing Li

详情
英文摘要

In this paper we focus on jointly estimating the edge probabilities for multi-layer networks. We define a novel multi-layer graphon, a ternary function in contrast to the bivariate graphon function in the literature by introducing an additional latent layer position parameter, which is model-free and covers a wide range of multi-layer networks. We develop a computationally efficient two-step neighborhood smoothing algorithm to estimate the edge probabilities of multi-layer networks, which requires little tuning and fully utilize the similarity across both network layers and nodes. Numerical experiments demonstrate the advantages of our method over the existing state-of-the-art ones. A real Worldwide Food Import/Export Network dataset example is analyzed to illustrate the better performance of the proposed method over benchmark methods in terms of link prediction.

2601.20132 2026-01-29 stat.ME

Connecting reflective asymmetries in multivariate spatial and spatio-temporal covariances

Drew Yarger

详情
英文摘要

In the analysis of multivariate spatial and univariate spatio-temporal data, it is commonly recognized that asymmetric dependence may exist, which can be addressed using an asymmetric (matrix or space-time, respectively) covariance function within a Gaussian process framework. This paper introduces a new paradigm for constructing asymmetric space-time covariances, which we refer to as "reflective asymmetric," by leveraging recently-introduced models for multivariate spatial data. We first provide new results for reflective asymmetric multivariate spatial models that extends their applicability. We then propose their asymmetric space-time extension, which come from a substantially different perspective than Lagrangian asymmetric space-time covariances. There are fewer parameters in the new models, one controls both the spatial and temporal marginal covariances, and the standard separable model is a special case. In simulation studies and analysis of the frequently-studied Irish wind data, these new models also improve model fit and prediction performance, and they can be easier to estimate. These features indicate broad applicability for improved analysis in environmental and other space-time data.

2601.20120 2026-01-29 cs.LG stat.ME

Going NUTS with ADVI: Exploring various Bayesian Inference techniques with Facebook Prophet

Jovan Krajevski, Biljana Tojtovska Ribarski

Comments 6 pages, 5 figures, Published in Proceedings of the 22nd International Conference for Informatics and Information Technologies - CiiT 2025

Journal ref Proceedings of the 22nd International Conference for Informatics and Information Technologies, pp. 260-265, 2025, ISBN: 978-608-4699-22-4

详情
英文摘要

Since its introduction, Facebook Prophet has attracted positive attention from both classical statisticians and the Bayesian statistics community. The model provides two built-in inference methods: maximum a posteriori estimation using the L-BFGS-B algorithm, and Markov Chain Monte Carlo (MCMC) sampling via the No-U-Turn Sampler (NUTS). While exploring various time-series forecasting problems using Bayesian inference with Prophet, we encountered limitations stemming from the inability to apply alternative inference techniques beyond those provided by default. Additionally, the fluent API design of Facebook Prophet proved insufficiently flexible for implementing our custom modeling ideas. To address these shortcomings, we developed a complete reimplementation of the Prophet model in PyMC, which enables us to extend the base model and evaluate and compare multiple Bayesian inference methods. In this paper, we present our PyMC-based implementation and analyze in detail the implementation of different Bayesian inference techniques. We consider full MCMC techniques, MAP estimation and Variational inference techniques on a time-series forecasting problem. We discuss in details the sampling approach, convergence diagnostics, forecasting metrics as well as their computational efficiency and detect possible issues which will be addressed in our future work.

2601.20047 2026-01-29 stat.ML cs.LG

Minimax Rates for Hyperbolic Hierarchical Learning

Divit Rawal, Sriram Vishwanath

详情
英文摘要

We prove an exponential separation in sample complexity between Euclidean and hyperbolic representations for learning on hierarchical data under standard Lipschitz regularization. For depth-$R$ hierarchies with branching factor $m$, we first establish a geometric obstruction for Euclidean space: any bounded-radius embedding forces volumetric collapse, mapping exponentially many tree-distant points to nearby locations. This necessitates Lipschitz constants scaling as $\exp(Ω(R))$ to realize even simple hierarchical targets, yielding exponential sample complexity under capacity control. We then show this obstruction vanishes in hyperbolic space: constant-distortion hyperbolic embeddings admit $O(1)$-Lipschitz realizability, enabling learning with $n = O(mR \log m)$ samples. A matching $Ω(mR \log m)$ lower bound via Fano's inequality establishes that hyperbolic representations achieve the information-theoretic optimum. We also show a geometry-independent bottleneck: any rank-$k$ prediction space captures only $O(k)$ canonical hierarchical contrasts.

2601.20046 2026-01-29 cs.LG stat.AP

Externally Validated Longitudinal GRU Model for Visit-Level 180-Day Mortality Risk in Metastatic Castration-Resistant Prostate Cancer

Javier Mencia-Ledo, Mohammad Noaeen, Zahra Shakeri

Comments 7 pages, 4 figures

详情
英文摘要

Metastatic castration-resistant prostate cancer (mCRPC) is a highly aggressive disease with poor prognosis and heterogeneous treatment response. In this work, we developed and externally validated a visit-level 180-day mortality risk model using longitudinal data from two Phase III cohorts (n=526 and n=640). Only visits with observable 180-day outcomes were labeled; right-censored cases were excluded from analysis. We compared five candidate architectures: Long Short-Term Memory, Gated Recurrent Unit (GRU), Cox Proportional Hazards, Random Survival Forest (RSF), and Logistic Regression. For each dataset, we selected the smallest risk-threshold that achieved an 85% sensitivity floor. The GRU and RSF models showed high discrimination capabilities initially (C-index: 87% for both). In external validation, the GRU obtained a higher calibration (slope: 0.93; intercept: 0.07) and achieved an PR-AUC of 0.87. Clinical impact analysis showed a median time-in-warning of 151.0 days for true positives (59.0 days for false positives) and 18.3 alerts per 100 patient-visits. Given late-stage frailty or cachexia and hemodynamic instability, permutation importance ranked BMI and systolic blood pressure as the strongest associations. These results suggest that longitudinal routine clinical markers can estimate short-horizon mortality risk in mCRPC and support proactive care planning over a multi-month window.

2601.20043 2026-01-29 cs.LG stat.ML

Regime-Adaptive Bayesian Optimization via Dirichlet Process Mixtures of Gaussian Processes

Yan Zhang, Xuefeng Liu, Sipeng Chen, Sascha Ranftl, Chong Liu, Shibo Li

详情
英文摘要

Standard Bayesian Optimization (BO) assumes uniform smoothness across the search space an assumption violated in multi-regime problems such as molecular conformation search through distinct energy basins or drug discovery across heterogeneous molecular scaffolds. A single GP either oversmooths sharp transitions or hallucinates noise in smooth regions, yielding miscalibrated uncertainty. We propose RAMBO, a Dirichlet Process Mixture of Gaussian Processes that automatically discovers latent regimes during optimization, each modeled by an independent GP with locally-optimized hyperparameters. We derive collapsed Gibbs sampling that analytically marginalizes latent functions for efficient inference, and introduce adaptive concentration parameter scheduling for coarse-to-fine regime discovery. Our acquisition functions decompose uncertainty into intra-regime and inter-regime components. Experiments on synthetic benchmarks and real-world applications, including molecular conformer optimization, virtual screening for drug discovery, and fusion reactor design, demonstrate consistent improvements over state-of-the-art baselines on multi-regime objectives.

2601.20031 2026-01-29 stat.AP stat.ME

Scalable Decisions using a Bayesian Decision-Theoretic Approach

Hoiyi Ng, Guido Imbens

详情
英文摘要

Randomized controlled experiments assess new policy impacts on performance metrics to inform launch decisions. Traditional approaches evaluate metrics independently despite correlations, and mixed results (e.g., positive revenue impact, negative customer experience) require manual judgment, hindering scalability. We propose a Bayesian decision-theoretic framework that systematically incorporates multiple objectives and trade-offs by comparing expected risks across decisions. Our approach combines experimenter-defined loss functions with observed evidence, using hierarchical models to leverage historical experiment learnings for prior information on treatment effects. Through real and simulated Amazon supply chain experiments, we demonstrate that compared to null hypothesis statistical testing, our method increases estimation efficiency via informative hierarchical priors and simplifies decision-making by systematically incorporating business preferences and costs for comprehensive, scalable decisions.

2601.20020 2026-01-29 math.ST math.PR stat.CO stat.ML stat.TH

Matching and mixing: Matchability of graphs under Markovian error

Zhirui Li, Keith D. Levin, Zhiang Zhao, Vince Lyzinski

Comments 48 pages, 12 figures

详情
英文摘要

We consider the problem of graph matching for a sequence of graphs generated under a time-dependent Markov chain noise model. Our edgelighter error model, a variant of the classical lamplighter random walk, iteratively corrupts the graph $G_0$ with edge-dependent noise, creating a sequence of noisy graph copies $(G_t)$. Much of the graph matching literature is focused on anonymization thresholds in edge-independent noise settings, and we establish novel anonymization thresholds in this edge-dependent noise setting when matching $G_0$ and $G_t$. Moreover, we also compare this anonymization threshold with the mixing properties of the Markov chain noise model. We show that when $G_0$ is drawn from an Erdős-Rényi model, the graph matching anonymization threshold and the mixing time of the edgelighter walk are both of order $Θ(n^2\log n)$. We further demonstrate that for more structured model for $G_0$ (e.g., the Stochastic Block Model), graph matching anonymization can occur in $O(n^α\log n)$ time for some $α<2$, indicating that anonymization can occur before the Markov chain noise model globally mixes. Through extensive simulations, we verify our theoretical bounds in the settings of Erdős-Rényi random graphs and stochastic block model random graphs, and explore our findings on real-world datasets derived from a Facebook friendship network and a European research institution email communication network.

2601.19992 2026-01-29 cs.LG stat.ML

BayPrAnoMeta: Bayesian Proto-MAML for Few-Shot Industrial Image Anomaly Detection

Soham Sarkar, Tanmay Sen, Sayantan Banerjee

详情
英文摘要

Industrial image anomaly detection is a challenging problem owing to extreme class imbalance and the scarcity of labeled defective samples, particularly in few-shot settings. We propose BayPrAnoMeta, a Bayesian generalization of Proto-MAML for few-shot industrial image anomaly detection. Unlike existing Proto-MAML approaches that rely on deterministic class prototypes and distance-based adaptation, BayPrAnoMeta replaces prototypes with task-specific probabilistic normality models and performs inner-loop adaptation via a Bayesian posterior predictive likelihood. We model normal support embeddings with a Normal-Inverse-Wishart (NIW) prior, producing a Student-$t$ predictive distribution that enables uncertainty-aware, heavy-tailed anomaly scoring and is essential for robustness in extreme few-shot settings. We further extend BayPrAnoMeta to a federated meta-learning framework with supervised contrastive regularization for heterogeneous industrial clients and prove convergence to stationary points of the resulting nonconvex objective. Experiments on the MVTec AD benchmark demonstrate consistent and significant AUROC improvements over MAML, Proto-MAML, and PatchCore-based methods in few-shot anomaly detection settings.

2601.19958 2026-01-29 stat.ML cs.LG math.PR

Deep Neural Networks as Iterated Function Systems and a Generalization Bound

Jonathan Vacher

详情
英文摘要

Deep neural networks (DNNs) achieve remarkable performance on a wide range of tasks, yet their mathematical analysis remains fragmented: stability and generalization are typically studied in disparate frameworks and on a case-by-case basis. Architecturally, DNNs rely on the recursive application of parametrized functions, a mechanism that can be unstable and difficult to train, making stability a primary concern. Even when training succeeds, there are few rigorous results on how well such models generalize beyond the observed data, especially in the generative setting. In this work, we leverage the theory of stochastic Iterated Function Systems (IFS) and show that two important deep architectures can be viewed as, or canonically associated with, place-dependent IFS. This connection allows us to import results from random dynamical systems to (i) establish the existence and uniqueness of invariant measures under suitable contractivity assumptions, and (ii) derive a Wasserstein generalization bound for generative modeling. The bound naturally leads to a new training objective that directly controls the collage-type approximation error between the data distribution and its image under the learned transfer operator. We illustrate the theory on a controlled 2D example and empirically evaluate the proposed objective on standard image datasets (MNIST, CelebA, CIFAR-10).

2601.19944 2026-01-29 cs.LG stat.AP stat.ML

Classifier Calibration at Scale: An Empirical Study of Model-Agnostic Post-Hoc Methods

Valery Manokhin, Daniel Grønhaug

Comments 61 pages, 23 figures

详情
英文摘要

We study model-agnostic post-hoc calibration methods intended to improve probabilistic predictions in supervised binary classification on real i.i.d. tabular data, with particular emphasis on conformal and Venn-based approaches that provide distribution-free validity guarantees under exchangeability. We benchmark 21 widely used classifiers, including linear models, SVMs, tree ensembles (CatBoost, XGBoost, LightGBM), and modern tabular neural and foundation models, on binary tasks from the TabArena-v0.1 suite using randomized, stratified five-fold cross-validation with a held-out test fold. Five calibrators; Isotonic regression, Platt scaling, Beta calibration, Venn-Abers predictors, and Pearsonify are trained on a separate calibration split and applied to test predictions. Calibration is evaluated using proper scoring rules (log-loss and Brier score) and diagnostic measures (Spiegelhalter's Z, ECE, and ECI), alongside discrimination (AUC-ROC) and standard classification metrics. Across tasks and architectures, Venn-Abers predictors achieve the largest average reductions in log-loss, followed closely by Beta calibration, while Platt scaling exhibits weaker and less consistent effects. Beta calibration improves log-loss most frequently across tasks, whereas Venn-Abers displays fewer instances of extreme degradation and slightly more instances of extreme improvement. Importantly, we find that commonly used calibration procedures, most notably Platt scaling and isotonic regression, can systematically degrade proper scoring performance for strong modern tabular models. Overall classification performance is often preserved, but calibration effects vary substantially across datasets and architectures, and no method dominates uniformly. In expectation, all methods except Pearsonify slightly increase accuracy, but the effect is marginal, with the largest expected gain about 0.008%.

2601.10992 2026-01-29 cs.LG stat.CO

Constant Metric Scaling in Riemannian Computation

Kisung You

详情
英文摘要

Constant rescaling of a Riemannian metric appears in many computational settings, often through a global scale parameter that is introduced either explicitly or implicitly. Although this operation is elementary, its consequences are not always made clear in practice and may be confused with changes in curvature, manifold structure, or coordinate representation. In this note we provide a short, self-contained account of constant metric scaling on arbitrary Riemannian manifolds. We distinguish between quantities that change under such a scaling, including norms, distances, volume elements, and gradient magnitudes, and geometric objects that remain invariant, such as the Levi--Civita connection, geodesics, exponential and logarithmic maps, and parallel transport. We also discuss implications for Riemannian optimization, where constant metric scaling can often be interpreted as a global rescaling of step sizes rather than a modification of the underlying geometry. The goal of this note is purely expository and is intended to clarify how a global metric scale parameter can be introduced in Riemannian computation without altering the geometric structures on which these methods rely.

2601.09673 2026-01-29 physics.soc-ph econ.GN math.OC q-fin.EC stat.AP

A probabilistic match classification model for sports tournaments

László Csató, András Gyimesi

Comments 25 pages, 3 tables, 8 figures

详情
英文摘要

Existing match classification models in the tournament design literature have two major limitations: a contestant is considered indifferent only if uncertain future results do never affect its prize, and competitive matches are not distinguished with respect to the incentives of the contestants. We propose a probabilistic framework to address both issues. For each match, our approach relies on simulating all other matches played simultaneously or later to compute the qualifying probabilities under the three main outcomes (win, draw, loss), which allows the classification of each match into six different categories. The suggested model is applied to the previous group stage and the new incomplete round-robin league, introduced in the 2024/25 season of UEFA club competitions. An incomplete round-robin tournament is found to contain fewer stakeless matches where both contestants are indifferent, and substantially more matches where both contestants should play offensively. However, the robustly higher proportion of potentially collusive matches can threaten with serious scandals.

2512.15950 2026-01-29 stat.ME

Modeling Issues with Eye Tracking Data

Gregory Camilli

Comments Total effects are replaced with transition effects to enable better comparisons across methods

详情
英文摘要

I describe and compare procedures for binary eye-tracking (ET) data. The basic GLM model is a logistic mixed model combined with random effects for persons and items. Additional models address error correlation in eye-tracking serial observations. In particular, three novel approaches are illustrated that address serial without the use of an observed lag-1 predictor: a first-order autoregressive model and a first-order moving average models obtained with generalized estimating equations, and a recurrent two-state survival model used with run-length encoded data. Altogether, the results of five different analyses point to unresolved issues in the analysis of eye-tracking data and new directions for analytic development. A more traditional model incorporating a lag-1 observed outcome for serial correlation is also included.

2512.04690 2026-01-29 stat.ML cs.LG

Recurrent Neural Networks with Linear Structures for Electricity Price Forecasting

Souhir Ben Amor, Florian Ziel

详情
英文摘要

We present a novel recurrent neural network architecture specifically designed for day-ahead electricity price forecasting, aimed at improving short-term decision-making and operational management in energy systems. Our combined forecasting model embeds linear structures, such as expert models and Kalman filters, into recurrent networks, enabling efficient computation and enhanced interpretability. The design leverages the strengths of both linear and non-linear model structures, allowing it to capture all relevant stylized price characteristics in power markets, including calendar and autoregressive effects, as well as influences from load, renewable energy, and related fuel and carbon markets. For empirical testing, we use hourly data from the largest European electricity market spanning 2018 to 2025 in a comprehensive forecasting study, comparing our model against state-of-the-art approaches, particularly high-dimensional linear and neural network models. In terms of RMSE, the proposed model achieves approximately 11% higher accuracy than the best-performing benchmark. We evaluate the contributions of the interpretable model components and conclude on the impact of combining linear and non-linear structures. We further evaluate the temporal robustness of the model by examining the stability of hyperparameters and the economic significance of key features. Additionally, we introduce a probabilistic extension to quantify forecast uncertainty.

2510.22848 2026-01-29 cs.LG nlin.AO stat.ML

Self-induced stochastic resonance: A physics-informed machine learning approach

Divyesh Savaliya, Marius E. Yamakou

Comments 25 pages, 10 figures, 62 references

详情
英文摘要

Self-induced stochastic resonance (SISR) is the emergence of coherent oscillations in slow-fast excitable systems driven solely by noise, without external periodic forcing or proximity to a bifurcation. This work presents a physics-informed machine learning framework for modeling and predicting SISR in the stochastic FitzHugh-Nagumo neuron. We embed the governing stochastic differential equations and SISR-asymptotic timescale-matching constraints directly into a Physics-Informed Neural Network (PINN) based on a Noise-Augmented State Predictor architecture. The composite loss integrates data fidelity, dynamical residuals, and barrier-based physical constraints derived from Kramers' escape theory. The trained PINN accurately predicts the dependence of spike-train coherence on noise intensity, excitability, and timescale separation, matching results from direct stochastic simulations with substantial improvements in accuracy and generalization compared with purely data-driven methods, while requiring significantly less computation. The framework provides a data-efficient and interpretable surrogate model for simulating and analyzing noise-induced coherence in multiscale stochastic systems.

2510.11546 2026-01-29 stat.ML cs.LG math.OC math.ST stat.TH

Efficient Group Lasso Regularized Rank Regression with Data-Driven Parameter Determination

Meixia Lin, Meijiao Shi, Yunhai Xiao, Qian Zhang

Comments 36 pages, 4 figures, 8 tables

详情
英文摘要

High-dimensional regression often suffers from heavy-tailed noise and outliers, which can severely undermine the reliability of least-squares based methods. To improve robustness, we adopt a non-smooth Wilcoxon score based rank objective and incorporate structured group sparsity regularization, a natural generalization of the lasso, yielding a group lasso regularized rank regression method. By extending the tuning-free parameter selection scheme originally developed for the lasso, we introduce a data-driven, simulation-based tuning rule and further establish a finite-sample error bound for the resulting estimator. On the computational side, we develop a proximal augmented Lagrangian method for solving the associated optimization problem, which eliminates the singularity issues encountered in existing methods, thereby enabling efficient semismooth Newton updates for the subproblems. Extensive numerical experiments demonstrate the robustness and effectiveness of our proposed estimator against alternatives, and showcase the scalability of the algorithm across both simulated and real-data settings.

2510.06165 2026-01-29 cs.LG eess.SP math.ST stat.ML stat.TH

Higher-Order Feature Attribution: Bridging Statistics, Explainable AI, and Topological Signal Processing

Kurt Butler, Guanchao Feng, Petar Djuric

Comments 5 pages, 3 figures, to be published in the Proceedings of ICASSP 2026

详情
英文摘要

Feature attributions are post-training analysis methods that assess how various input features of a machine learning model contribute to an output prediction. Their interpretation is straightforward when features act independently, but it becomes less clear when the predictive model involves interactions, such as multiplicative relationships or joint feature contributions. In this work, we propose a general theory of higher-order feature attribution, which we develop on the foundation of Integrated Gradients (IG). This work extends existing frameworks in the literature on explainable AI. When using IG as the method of feature attribution, we discover natural connections to statistics and topological signal processing. We provide several theoretical results that establish the theory, and we validate our theory on a few examples.

2509.00923 2026-01-29 cs.AI cs.GT stat.ML

Robust Deep Monte Carlo Counterfactual Regret Minimization: Addressing Theoretical Risks in Neural Fictitious Self-Play

Zakaria El Jaafari

Comments There seems to be some errors related to the encountered problems and the interpreation of numerical results, that do not have a common pattern

详情
英文摘要

Monte Carlo Counterfactual Regret Minimization (MCCFR) has emerged as a cornerstone algorithm for solving extensive-form games, but its integration with deep neural networks introduces scale-dependent challenges that manifest differently across game complexities. This paper presents a comprehensive analysis of how neural MCCFR component effectiveness varies with game scale and proposes an adaptive framework for selective component deployment. We identify that theoretical risks such as nonstationary target distribution shifts, action support collapse, variance explosion, and warm-starting bias have scale-dependent manifestation patterns, requiring different mitigation strategies for small versus large games. Our proposed Robust Deep MCCFR framework incorporates target networks with delayed updates, uniform exploration mixing, variance-aware training objectives, and comprehensive diagnostic monitoring. Through systematic ablation studies on Kuhn and Leduc Poker, we demonstrate scale-dependent component effectiveness and identify critical component interactions. The best configuration achieves final exploitability of 0.0628 on Kuhn Poker, representing a 60% improvement over the classical framework (0.156). On the more complex Leduc Poker domain, selective component usage achieves exploitability of 0.2386, a 23.5% improvement over the classical framework (0.3703) and highlighting the importance of careful component selection over comprehensive mitigation. Our contributions include: (1) a formal theoretical analysis of risks in neural MCCFR, (2) a principled mitigation framework with convergence guarantees, (3) comprehensive multi-scale experimental validation revealing scale-dependent component interactions, and (4) practical guidelines for deployment in larger games.

2507.20072 2026-01-29 cs.LG stat.ME

Sparse Equation Matching: A Derivative-Free Learning for General-Order Dynamical Systems

Jiaqiang Li, Jianbin Tan, Xueqin Wang

详情
英文摘要

Equation discovery is a fundamental learning task for uncovering the underlying dynamics of complex systems, with wide-ranging applications in areas such as brain connectivity analysis, climate modeling, gene regulation, and physical simulation. However, many existing approaches rely on accurate derivative estimation and are limited to first-order dynamical systems, restricting their applicability in real-world scenarios. In this work, we propose Sparse Equation Matching (SEM), a unified framework that encompasses several existing equation discovery methods under a common formulation. SEM introduces an integral-based sparse regression approach using Green's functions, enabling derivative-free estimation of differential operators and their associated driving functions in general-order dynamical systems. The effectiveness of SEM is demonstrated through extensive simulations, benchmarking its performance against derivative-based approaches. We then apply SEM to electroencephalographic (EEG) data recorded during multiple oculomotor tasks, collected from 52 participants in a brain-computer interface experiment. Our method identifies active brain regions across participants and reveals task-specific connectivity patterns. These findings offer valuable insights into brain connectivity and the underlying neural mechanisms.

2507.10601 2026-01-29 q-bio.QM cs.CV cs.LG eess.IV stat.ME

AGFS-Tractometry: A Novel Atlas-Guided Fine-Scale Tractometry Approach for Enhanced Along-Tract Group Statistical Comparison Using Diffusion MRI Tractography

Ruixi Zheng, Wei Zhang, Yijie Li, Xi Zhu, Zhou Lan, Jarrett Rushmore, Yogesh Rathi, Nikos Makris, Lauren J. O'Donnell, Fan Zhang

Comments 31 pages and 7 figures

详情
英文摘要

Diffusion MRI (dMRI) tractography is currently the only method for in vivo mapping of the brain's white matter (WM) connections. Tractometry is an advanced tractography analysis technique for along-tract profiling to investigate the morphology and microstructural properties along the fiber tracts. Tractometry has become an essential tool for studying local along-tract differences between different populations (e.g., health vs disease). In this study, we propose a novel atlas-guided fine-scale tractometry method, namely AGFS-Tractometry, that leverages tract spatial information and permutation testing to enhance the along-tract statistical analysis between populations. There are two major contributions in AGFS-Tractometry. First, we create a novel atlas-guided tract profiling template that enables consistent, fine-scale, along-tract parcellation of subject-specific fiber tracts. Second, we propose a novel nonparametric permutation testing group comparison method to enable simultaneous analysis across all along-tract parcels while correcting for multiple comparisons. We perform experimental evaluations on synthetic datasets with known group differences and in vivo real data. We compare AGFS-Tractometry with two state-of-the-art tractometry methods, including Automated Fiber-tract Quantification (AFQ) and BUndle ANalytics (BUAN). Our results show that the proposed AGFS-Tractometry obtains enhanced sensitivity and specificity in detecting local WM differences. In the real data analysis experiments, AGFS-Tractometry can identify more regions with significant differences, which are anatomically consistent with the existing literature. Overall, these demonstrate the ability of AGFS-Tractometry to detect subtle or spatially localized WM group-level differences. The created tract profiling template and related code are available at: https://github.com/ZhengRuixi/AGFS-Tractometry.git.

2506.11743 2026-01-29 cs.LG stat.ML

Taxonomy of reduction matrices for Graph Coarsening

Antonin Joly, Nicolas Keriven, Aline Roumy

详情
英文摘要

Graph coarsening aims to diminish the size of a graph to lighten its memory footprint, and has numerous applications in graph signal processing and machine learning. It is usually defined using a reduction matrix and a lifting matrix, which, respectively, allows to project a graph signal from the original graph to the coarsened one and back. This results in a loss of information measured by the so-called Restricted Spectral Approximation (RSA). Most coarsening frameworks impose a fixed relationship between the reduction and lifting matrices, generally as pseudo-inverses of each other, and seek to define a coarsening that minimizes the RSA. In this paper, we remark that the roles of these two matrices are not entirely symmetric: indeed, putting constraints on the lifting matrix alone ensures the existence of important objects such as the coarsened graph's adjacency matrix or Laplacian. In light of this, in this paper, we introduce a more general notion of reduction matrix, that is not necessarily the pseudo-inverse of the lifting matrix. We establish a taxonomy of ``admissible'' families of reduction matrices, discuss the different properties that they must satisfy and whether they admit a closed-form description or not. We show that, for a fixed coarsening represented by a fixed lifting matrix, the RSA can be further reduced simply by modifying the reduction matrix. We explore different examples, including some based on a constrained optimization process of the RSA. Since this criterion has also been linked to the performance of Graph Neural Networks, we also illustrate the impact of this choices on different node classification tasks on coarsened graphs.

2506.11482 2026-01-29 stat.ME

Data-Adaptive Integration With Summary Data

Kosuke Morikawa, Sho Komukai, Satoshi Hattori

Comments 23 pages, 5 figures, 1 tables

详情
英文摘要

Combining an internal individual-level study with readily available external summary statistics promises major efficiency gains at minimal additional cost, yet heterogeneity between sources can bias estimates for the internal target population. We develop a generalized entropy-balancing integration strategy that calibrates external moments to the internal covariate distribution, explicitly permitting a biased external sample. Our estimator of the internal-population mean is doubly robust: it remains consistent when either the outcome-regression model or the entropy-balancing modelis correctly specified. When multiple balancing specifications are plausible, we introduce a data-adaptive selection rule. We also provide easy-to-compute, fully estimable diagnostics-based on the Mahalanobis distance and the Pearson chi-square divergence-that pinpoint when integration is guaranteed to strictly outperform the internal sample mean. The approach is implemented in the R package daisy. Simulations and an application to nationwide public-access defibrillation records in Japan demonstrate meaningful precision gains while maintaining bias control under distributional shift.

2505.01044 2026-01-29 q-fin.RM q-fin.ST stat.AP

Exploring different subtypes of recurrent event Cox-regression models in modelling lifetime default risk: A tutorial

Arno Botha, Tanja Verster, Bernard Scheepers

Comments 9162 words, 23 pages (excluding appendix), 11 figures

详情
英文摘要

In the pursuit of modelling a loan's probability of default (PD) over its lifetime, repeat default events are often ignored when using Cox Proportional Hazard (PH) models. Excluding such events may produce biased and inaccurate PD-estimates, which can compromise financial buffers against future losses. Accordingly, we investigate a few subtypes of Cox-models that can incorporate recurrent default events. We explore both the Andersen-Gill (AG) and the Prentice-Williams-Peterson (PWP) spell-time models using real-world data as an illustration. These models are compared against a baseline that deliberately ignores recurrent events, called the time to first default (TFD) model. Our models are evaluated using Harrell's c-statistic, adjusted Cox-Sell residuals, and a novel extension of time-dependent receiver operating characteristic analysis. From these Cox-models, we demonstrate how to derive a portfolio-level term-structure of default risk, which is a series of marginal PD-estimates over the average loan's lifetime. While the TFD- and PWP-models do not differ significantly across all diagnostics, the AG-model underperformed expectations. We believe that our pedagogical tutorial, as accompanied by a codebase, would be of great value to practitioner and regulator alike. Accordingly, our work enhances the current practice of using Cox-modelling in producing timeous and accurate PD-estimates under IFRS 9.

2504.06250 2026-01-29 math.PR cs.LG stat.ML

Fractal and Regular Geometry of Deep Neural Networks

Simmaco Di Lillo, Domenico Marinucci, Michele Salvi, Stefano Vigogna

详情
英文摘要

We study the geometric properties of random neural networks by investigating the boundary volumes of their excursion sets for different activation functions, as the depth increases. More specifically, we show that, for activations which are not very regular (e.g., the Heaviside step function), the boundary volumes exhibit fractal behavior, with their Hausdorff dimension monotonically increasing with the depth. On the other hand, for activations which are more regular (e.g., ReLU, logistic and $\tanh$), as the depth increases, the expected boundary volumes can either converge to zero, remain constant or diverge exponentially, depending on a single spectral parameter which can be easily computed. Our theoretical results are confirmed in some numerical experiments based on Monte Carlo simulations.

2502.06545 2026-01-29 cs.LG stat.ML

Universal Sequence Preconditioning

Annie Marsden, Elad Hazan

Comments 35 pages, 3 tables, 5 figures

详情
英文摘要

We study the problem of preconditioning in sequential prediction. From the theoretical lens of linear dynamical systems, we show that convolving the target sequence corresponds to applying a polynomial to the hidden transition matrix. Building on this insight, we propose a universal preconditioning method that convolves the target with coefficients from orthogonal polynomials such as Chebyshev or Legendre. We prove that this approach reduces regret for two distinct prediction algorithms and yields the first ever sublinear and hidden-dimension-independent regret bounds (up to logarithmic factors) that hold for systems with marginally table and asymmetric transition matrices. Finally, extensive synthetic and real-world experiments show that this simple preconditioning strategy improves the performance of a diverse range of algorithms, including recurrent neural networks, and generalizes to signals beyond linear dynamical systems.

2501.04903 2026-01-29 stat.ML cs.LG

Analyzing decision tree bias towards the minority class

Nathan Phelps, Daniel J. Lizotte, Douglas G. Woolford

详情
英文摘要

There is a widespread and longstanding belief that machine learning models are biased towards the majority class when learning from imbalanced binary response data, leading them to neglect or ignore the minority class. Motivated by a recent simulation study that found that decision trees can be biased towards the minority class, our paper aims to reconcile the conflict between that study and other published works. First, we critically evaluate past literature on this problem, finding that failing to consider the conditional distribution of the outcome given the predictors has led to incorrect conclusions about the bias in decision trees. We then show that, under specific conditions, decision trees fit to purity are biased towards the minority class, debunking the belief that decision trees are always biased towards the majority class. This bias can be reduced by adjusting the tree-fitting process to include regularization methods like pruning and setting a maximum tree depth, and/or by using post-hoc calibration methods. Our findings have implications on the use of popular tree-based models, such as random forests. Although random forests are often composed of decision trees fit to purity, our work adds to recent literature indicating that this may not be the best approach.

2412.19004 2026-01-29 stat.ME

Robust functional PCA for relative data

Jeremy Oguamalam, Peter Filzmoser, Karel Hron, Alessandra Menafoglio, Una Radojičić

详情
英文摘要

This paper introduces a robust approach to functional principal component analysis (FPCA) for relative data, particularly density functions. While recent papers have studied density data within the Bayes space framework, there has been limited focus on developing robust methods to effectively handle anomalous observations and large noise. To address this, we extend the Mahalanobis distance concept to Bayes spaces, proposing its regularized version that accounts for the constraints inherent in density data. Based on this extension, we introduce a new method, robust density principal component analysis (RDPCA), for more accurate estimation of functional principal components in the presence of outliers. The method's performance is validated through simulations and real-world applications, showing its ability to improve covariance estimation and principal component analysis compared to traditional methods.

2410.08132 2026-01-29 stat.AP

Dynamic Interconnections between Corruption and Economic Growth

Macavilca Tello Bartolome, Kevin Fernandez, Oscar Cutipa-Luque, Yhon Tiahuallpa, Helder Rojas

Comments We decided to withdraw the current version of the manuscript to improve its structure, clarify some arguments, and incorporate additional results. A revised version will be submitted soon

详情
英文摘要

This study explores the dynamic relationship between corruption and economic growth through an approach based on a system of stochastic equations. In the context of globalization and economic interdependencies, corruption not only affects investment and distorts markets, but it can also, under certain conditions, temporarily boost economic activity. Using data from the Gross Domestic Product (GDP) and the Corruption Perception Index (CPI), we implement a time-series-based model to capture the interactions between these two variables. Through a coupled vector autoregressive equations system, our model identifies patterns of interdependence between economic fluctuations and perceptions of corruption at a global level. Employing graph theory and Granger causality, we build a network of interconnections that illustrates how corruption dynamics in one country can influence economic growth and corruption perception in others. The results provide a robust tool for analyzing international political-economic relationships and can serve as a basis for designing policies that promote transparency and sustainable development.

2403.12456 2026-01-29 econ.EM stat.ME

Inflation Target at Risk: A Time-varying Parameter Distributional Regression

Yunyun Wang, Tatsushi Oka, Dan Zhu

详情
英文摘要

Inflation exhibits state-dependent, skewed, and fat-tailed dynamics that make risk a central concern for monetary policy. Accordingly, inflation risks are distributional and cannot be fully captured by mean-based models. We propose a flexible time-varying parameter distributional regression model that estimates the full conditional distribution of inflation, allowing macroeconomic drivers to have nonlinear and asymmetric effects across the distribution. Applied to U.S. inflation, the model captures major shifts in tail-risk probabilities. Analysis of risk drivers shows that deflationary pressures arise primarily from demand-side weakness and inflation persistence, whereas upside risks are driven mainly by supply-side shocks, particularly energy price inflation. Examining the impact of key drivers further reveals that the unemployment-inflation relationship weakens in the distributional tails. Energy price shocks, by contrast, have little effect on deflation risk but exhibit strongly time-varying and asymmetric effects on high-inflation risk.

2308.04926 2026-01-29 stat.AP

Bayesian modeling of insurance claims for hail damage

Ophélia Miralles, Anthony C. Davison, Timo Schmid

详情
英文摘要

Despite its importance for insurance, there is almost no literature on statistical hail damage modeling. Statistical models for hailstorms exist, though they are generally not open-source, but no study appears to have developed a stochastic hail impact function. In this paper, we use hail-related insurance claim data to build a Gaussian line process with extreme marks to model both the geographical footprint of a hailstorm and the damage to buildings that hailstones can cause. We build a model for the claim counts and claim values, and compare it to the use of a benchmark deterministic hail impact function. Our model proves to be better than the benchmark at capturing hail spatial patterns and allows for localized and extreme damage, which is seen in the insurance data. The evaluation of both the claim counts and value predictions shows that performance is improved compared to the benchmark, especially for extreme damage. Our model appears to be the first to provide realistic estimates for hail damage to individual buildings.