Probabilistic Graphical Models in Astronomy
Comments 11 pages, 3 figures
Abigail Sheerin, Giuseppe Vinci
Comments 11 pages, 3 figures
The field of astronomy is experiencing a data explosion driven by significant advances in observational instrumentation, and classical methods often fall short of addressing the complexity of modern astronomical datasets. Probabilistic graphical models offer powerful tools for uncovering the dependence structures and data-generating processes underlying a wide array of cosmic variables. By representing variables as nodes in a network, these models allow for the visualization and analysis of the intricate relationships that underpin theories of hierarchical structure formation within the universe. We highlight the value that graphical models bring to astronomical research by demonstrating their practical application to the study of exoplanets and host stars.
Adam Chojecki, Piotr Graczyk, Hideyuki Ishi, Bartosz Kołodziejek
We study Bayesian model selection in colored Gaussian graphical models (CGGMs), which combine sparsity of conditional independencies with symmetry constraints encoded by vertex- and edge-colored graphs. A computational bottleneck in Bayesian inference for CGGMs is the evaluation of Diaconis-Ylvisaker normalizing constants, given by gamma-type integrals over cones of precision matrices with prescribed zeros and equality constraints. While explicit formulas are known for standard Gaussian graphical models only in special cases (e.g. decomposable graphs) and for a limited class of RCOP models, no general tractable framework has been available for broader families of CGGMs. We introduce a new subclass of RCON models for which these normalizing constants admit closed-form expressions. On the algebraic side, we identify conditions on spaces of colored precision matrices that guarantee tractability of the associated integrals, leading to Block-Cholesky spaces (BC-spaces) and Diagonally Commutative Block-Cholesky spaces (DCBC-spaces). On the combinatorial side, we characterize the colored graphs inducing such spaces via a color perfect elimination ordering and a 2-path regularity condition, and define the resulting Color Elimination-Regular (CER) graphs and their symmetric variants. This class strictly extends decomposable graphs in the uncolored setting and contains all RCOP models associated with decomposable graphs. In the one-color case, our framework reveals a close connection between DCBC-spaces and Bose-Mesner algebras. For models defined on BC- and DCBC-spaces, we derive explicit closed-form formulas for the normalizing constants in terms of a finite collection of structure constants and propose an efficient method for computing them in the commutative case. Our results broaden the range of CGGMs amenable to principled Bayesian structure learning in high-dimensional applications.
Hyojung Jang, Peter M. Graffy, Benjamin W. Barrett, Daniel E. Horton, Jennifer L. Chan, Abel N. Kho
Extreme heat is an escalating public health concern. Although prior studies have examined heat-health associations, their reliance on restricted diagnoses and diagnostic categories misses or misclassifies heat-related illness. We conducted a heat-wide association study to identify acute-care diagnoses associated with extreme heat in Chicago, Illinois. Using 916,904 acute-care visits -- including emergency department and urgent care encounters -- among 372,140 adults across five healthcare systems from 2011-2023, we applied a two-stage analytic approach: quasi-Poisson regression to screen 1,803 diagnosis codes for heat-related risks, followed by distributed lag non-linear models in a time-stratified case-crossover design to refine the list of heat-related diagnoses and estimate same-day and short-term cumulative odds ratios of acute-care visits during extreme heat versus reference temperature. We observed same-day increases in visits for heat illness, volume depletion, hypotension, edema, acute kidney failure, and multiple injuries. By analyzing the full diagnostic spectrum of acute-care services, this study comprehensively characterizes heat-associated morbidity, reinforcing and advancing existing literature.
Navid Ardeshir, Samuel Deng, Daniel Hsu, Jingwen Liu
The sample complexity of multi-group learning is shown to improve in the group-realizable setting over the agnostic setting, even when the family of groups is infinite so long as it has finite VC dimension. The improved sample complexity is obtained by empirical risk minimization over the class of group-realizable concepts, which itself could have infinite VC dimension. Implementing this approach is also shown to be computationally intractable, and an alternative approach is suggested based on improper learning.
Antesh Upadhyay, Sang Bin Moon, Abolfazl Hashemi
We introduce FedSGM, a unified framework for federated constrained optimization that addresses four major challenges in federated learning (FL): functional constraints, communication bottlenecks, local updates, and partial client participation. Building on the switching gradient method, FedSGM provides projection-free, primal-only updates, avoiding expensive dual-variable tuning or inner solvers. To handle communication limits, FedSGM incorporates bi-directional error feedback, correcting the bias introduced by compression while explicitly understanding the interaction between compression noise and multi-step local updates. We derive convergence guarantees showing that the averaged iterate achieves the canonical $\boldsymbol{\mathcal{O}}(1/\sqrt{T})$ rate, with additional high-probability bounds that decouple optimization progress from sampling noise due to partial participation. Additionally, we introduce a soft switching version of FedSGM to stabilize updates near the feasibility boundary. To our knowledge, FedSGM is the first framework to unify functional constraints, compression, multiple local updates, and partial client participation, establishing a theoretically grounded foundation for constrained federated learning. Finally, we validate the theoretical guarantees of FedSGM via experimentation on Neyman-Pearson classification and constrained Markov decision process (CMDP) tasks.
Jaume Anguera Peris, Joakim Jaldén
Edge intelligence enables AI inference at the network edge, co-located with or near the radio access network, rather than in centralized clouds or on mobile devices. It targets low-latency, resource-constrained applications with large data volumes, requiring tight integration of wireless access and on-site computing. Yet system performance and cost-efficiency hinge on joint pre-deployment dimensioning of radio and computational resources, especially under spatial and temporal uncertainty. Prior work largely emphasizes run-time allocation or relies on simplified models that decouple radio and computing, missing end-to-end correlations in large-scale deployments. This paper introduces a unified stochastic framework to dimension multi-cell edge-intelligent systems. We model network topology with Poisson point processes, capturing random user and base-station locations, inter-cell interference, distance-based fractional power control, and peak-power constraints. By combining this with queueing theory and empirical AI inference workload profiling, we derive tractable expressions for end-to-end offloading delay. These enable a non-convex joint optimization that minimizes deployment cost under statistical QoS guarantees, expressed through strict tail-latency and inference-accuracy constraints. We prove the problem decomposes into convex subproblems, yielding global optimality. Numerical results in noise- and interference-limited regimes identify cost-efficient design regions and configurations that cause under-utilization or user unfairness. Smaller cells reduce transmission delay but raise per-request computing cost due to weaker server multiplexing, whereas larger cells show the opposite trend. Densification reduces computational costs only when frequency reuse scales with base-station density; otherwise, sparser deployments improve fairness and efficiency in interference-limited settings.
Edoardo Otranto, Luca Scaffidi Domianello
Comments 24 pages, 4 figures
Recent developments in financial time series focus on modeling volatility across multiple assets or indices in a multivariate framework, accounting for potential interactions such as spillover effects. Furthermore, the increasing integration of global financial markets provides a similar dynamics (referred to as comovement). In this context, we introduce a novel model for volatility vectors within the Multiplicative Error Model (MEM) class. This framework accommodates both spillover and co-movement effects through a distinct latent component. By adopting a specific parameterization, the model remains computationally feasible even for high-dimensional volatility vectors. To reduce the number of unknown coefficients, we propose a simple model-based clustering procedure. We illustrate the effectiveness of the proposed approach through an empirical application to 29 assets of the Dow Jones Industrial Average index, providing insight into volatility spillovers and shared market dynamics. Comparative analysis against alternative vector MEMs, including a fully parameterized version of the proposed model, demonstrates its superior or at least comparable performance across multiple evaluation criteria.
Andrew Thompson, Miles McCrory
We give analytical results for propagation of uncertainty through trained multi-layer perceptrons (MLPs) with a single hidden layer and ReLU activation functions. More precisely, we give expressions for the mean and variance of the output when the input is multivariate Gaussian. In contrast to previous results, we obtain exact expressions without resort to a series expansion.
Danna L. Cruz-Reyes, Renato M. Assunção, Reinaldo B. Arellano-Valle, Rosangela H. Loschi
Comments The manuscript is clearly written and well structured. It spans approximately 10 pages and includes several figures that effectively illustrate the proposed methodology and results. Overall, the paper makes a solid contribution to the literature on spatial hierarchical modeling and is suitable for publication after minor revisions
We introduce a skewed edge based spatial prior, named RENeGe sk that extends the Gaussian RENeGe framework by incorporating directional asymmetry through a skew normal distribution. Skewness is defined on the edge graph and propagated to the node space, aligning asymmetric behavior with transitions across neighboring regions rather than with marginal node effects. The model is formulated within the skew normal framework and employs identifiable hierarchical priors together with low rank parameterizations to ensure scalability. The skew normal's stochastic representation is considered to facilitate the computational implementation. Simulation studies show that RENeGe sk recovers compact, edge-aligned directional structure more accurately than symmetric Gaussian priors, while remaining competitive under irregular spatial patterns. An application to cancer incidence data in Southern Brazil illustrates how the proposed approach yields stable area-level estimates while preserving localized, directionally driven spatial variation.
Jun Xiao, Qiong Wang, Yihui Li, Zhexuan Yu, Hao Zhou, Borong Lin
Comments Submitted to Advanced Engineering Informatics, currently Under Review
Artificial intelligence in construction increasingly depends on structured representations such as Building Information Models and knowledge graphs, yet early-stage building designs are predominantly created as flexible boundary-representation (B-rep) models that lack explicit spatial, semantic, and performance structure. This paper presents a robust, fully automated framework that transforms unstructured B-rep geometry into knowledge-graph-based Building Information Models and further into executable Building Energy Models. The framework enables artificial intelligence to explicitly interpret building elements, spatial topology, and their associated thermal and performance attributes. It integrates automated geometry cleansing, multiple auto space-generation strategies, graph-based extraction of space and element topology, ontology-aligned knowledge modeling, and reversible transformation between ontology-based BIM and EnergyPlus energy models. Validation on parametric, sketch-based, and real-world building datasets demonstrates high robustness, consistent topological reconstruction, and reliable performance-model generation. By bridging design models, BIM, and BEM, the framework provides an AI-oriented infrastructure that extends BIM- and graph-based intelligence pipelines to flexible early-stage design geometry, enabling performance-driven design exploration and optimization by learning-based methods.
Joshua Corneck, Edward A. K. Cohen, Francesco Sanna Passino
In many real-world networks, data on the edges evolve in continuous time, naturally motivating representations based on point processes. Heterogeneity in edge types further gives rise to multiplex network point processes. In this work, we propose a model for multiplex network data observed in continuous-time. We establish two-to-infinity norm consistency and asymptotic normality for spectral-embedding-based estimation of the model parameters as both network size and time resolution increase. Drawing inspiration from random dot product graph models, each edge intensity is expressed as the inner product of two low-dimensional latent positions: one dynamic and layer-agnostic, the other static and layer-dependent. These latent positions constitute the primary objects of inference, which is conducted via spectral embedding methods. Our theoretical results are established under a histogram estimator of the network intensities and provide justification for applying a doubly unfolded adjacency spectral embedding method for estimation. Simulations and real-data analyses demonstrate the effectiveness of the proposed model and inference procedure.
Eunseong Bae, Wolfgang Polonik
Under the assumption that data lie on a compact (unknown) manifold without boundary, we derive finite sample bounds for kernel smoothing and its (first and second) derivatives, and we establish asymptotic normality through Berry-Esseen type bounds. Special cases include kernel density estimation, kernel regression and the heat kernel signature. Connections to the graph Laplacian are also discussed.
Tianang Deng, Yu Deng, Tianchen Gao, Yonghong Hu, Rui Pan
Rapid financial innovation has been accompanied by a sharp increase in patenting activity, making timely and comprehensive prior-art discovery more difficult. This problem is especially evident in financial technologies, where innovations develop quickly, patent collections grow continuously, and citation recommendation systems must be updated as new applications arrive. Existing patent retrieval and citation recommendation methods typically rely on static indexes or periodic retraining, which limits their ability to operate effectively in such dynamic settings. In this study, we propose a real-time patent citation recommendation framework designed for large and fast-changing financial patent corpora. Using a dataset of 428,843 financial patents granted by the China National Intellectual Property Administration (CNIPA) between 2000 and 2024, we build a three-stage recommendation pipeline. The pipeline uses large language model (LLM) embeddings to represent the semantic content of patent abstracts, applies efficient approximate nearest-neighbor search to construct a manageable candidate set, and ranks candidates by semantic similarity to produce top-k citation recommendations. In addition to improving recommendation accuracy, the proposed framework directly addresses the dynamic nature of patent systems. By using an incremental indexing strategy based on hierarchical navigable small-world (HNSW) graphs, newly issued patents can be added without rebuilding the entire index. A rolling day-by-day update experiment shows that incremental updating improves recall while substantially reducing computational cost compared with rebuild-based indexing. The proposed method also consistently outperforms traditional text-based baselines and alternative nearest-neighbor retrieval approaches.
Ian Carbó Casals
Comments 12 pages, 5 figures, 2 tables. Code available at: https://github.com/Bailduke/bayesian-sentiment-state-space-model
Text-based sentiment indicators are widely used to monitor public and market mood, but weekly sentiment series are noisy by construction. A main reason is that the amount of relevant news changes over time and across categories. As a result, some weekly averages are based on many articles, while others rely on only a few. Existing approaches do not explicitly account for changes in data availability when measuring uncertainty. We present a Bayesian state-space framework that turns aggregated news sentiment into a smoothed time series with uncertainty. The model treats each weekly sentiment value as a noisy measurement of an underlying sentiment process, with observation uncertainty scaled by the effective information weight $n_{tj}$: when coverage is high, latent sentiment is anchored more strongly to the observed aggregate; when coverage is low, inference relies more on the latent dynamics and uncertainty increases. Using news data grouped into multiple categories, we find broadly similar latent dynamics across categories, while larger differences appear in observation noise. The framework is designed for descriptive monitoring and can be extended to other text sources where information availability varies over time.
Pedro Picchetti
This paper develops a finite population framework for analyzing causal effects in settings with imperfect compliance where multiple treatments affect the outcome of interest. Two prominent examples are factorial designs and panel experiments with imperfect compliance. I define finite population causal effects that capture the relative effectiveness of alternative treatment sequences. I provide nonparametric estimators for a rich class of factorial and dynamic causal effects and derive their finite population distributions as the sample size increases. Monte Carlo simulations illustrate the desirable properties of the estimators. Finally, I use the estimator for causal effects in factorial designs to revisit a famous voter mobilization experiment that analyzes the effects of voting encouragement through phone calls on turnout.
Claudio Agostinelli
Comments Comment on arXiv:2302.02156
The main aim of robust statistics is the development of methods able to cope with the presence of outliers. A new type of outliers, namely "cellwise", has garnered considerable attention. The state of the art for dealing with cellwise contamination in different models is presented in Raymaekers and Rousseeuw (2024). Outliers in time series can be treated as cellwise outliers, a further discussion on this subject is presented.
Jakob Robnik, Uroš Seljak
Comments 21 pages, 9 Figures
Despite the enormous success of Hamiltonian Monte Carlo and related Markov Chain Monte Carlo (MCMC) methods, sampling often still represents the computational bottleneck in scientific applications. Availability of parallel resources can significantly speed up MCMC inference by running a large number of chains in parallel, each collecting a single sample. However, the parallel approach converges slowly if the chains are not initialized close to the target distribution (cold start). Theoretically this can be resolved by initially running MCMC without Metropolis-Hastings adjustment to quickly converge to the vicinity of the target distribution and then turn on adjustment to achieve fine convergence. However, no practical scheme uses this strategy, due to the difficulty of automatically selecting the step size during the unadjusted phase. We here develop Late Adjusted Parallel Sampler (LAPS), which is precisely such a scheme and is applicable out of the box, all the hyperparameters are selected automatically. LAPS takes advantage of ensemble-based hyperparameter adaptation to estimate the bias at each iteration and converts it to the appropriate step size. We show that LAPS consistently and significantly outperforms ensemble adjusted methods such as MEADS or ChESS and the optimization-based initializer Pathfinder on a variety of standard benchmark problems. LAPS typically achieves two orders of magnitude lower wall-clock time than the corresponding sequential algorithms such as NUTS.
Joni Virta, Takeru Matsuda
Comments 21 pages, 2 figures
We propose a new asymptotic test for the separability of a covariance matrix. The null distribution is valid in wide matrix elliptical model that includes, in particular, both matrix Gaussian and matrix $t$-distribution. The test is fast to compute and makes no assumptions about the component covariance matrices. An alternative, Wald-type version of the test is also proposed. Our simulations reveal that both versions of the test have good power even for heavier-tailed distributions and can compete with the Gaussian likelihood ratio test in the case of normal data.
A. Hatstatt, X. Zhu, B. Sudret
Polynomial Chaos Expansions (PCEs) are widely recognized for their efficient computational performance in surrogate modeling. Yet, a robust framework to quantify local model errors is still lacking. While the local uncertainty of PCE prediction can be captured using bootstrap resampling, other methods offering more rigorous statistical guarantees are needed, especially in the context of small training datasets. Recently, conformal predictions have demonstrated strong potential in machine learning, providing statistically robust and model-agnostic prediction intervals. Due to its generality and versatility, conformal prediction is especially valuable, as it can be adapted to suit a variety of problems, making it a compelling choice for PCE-based surrogate models. In this contribution, we explore its application to PCE-based surrogate models. More precisely, we present the integration of two conformal prediction methods, namely the full conformal and the Jackknife+ approaches, into both full and sparse PCEs. For full PCEs, we introduce computational shortcuts inspired by the inherent structure of regression methods to optimize the implementation of both conformal methods. For sparse PCEs, we incorporate the two approaches with appropriate modifications to the inference strategy, thereby circumventing the non-symmetrical nature of the regression algorithm and ensuring valid prediction intervals. Our developments yield better-calibrated prediction intervals for both full and sparse PCEs, achieving superior coverage over existing approaches, such as the bootstrap, while maintaining a moderate computational cost.
Ming Kang, Fung Fung Ting, Raphaël C. -W. Phan, Zongyuan Ge, Chee-Ming Ting
Comments 10 pages, 3 figures
Nuclei panoptic segmentation supports cancer diagnostics by integrating both semantic and instance segmentation of different cell types to analyze overall tissue structure and individual nuclei in histopathology images. Major challenges include detecting small objects, handling ambiguous boundaries, and addressing class imbalance. To address these issues, we propose PanopMamba, a novel hybrid encoder-decoder architecture that integrates Mamba and Transformer with additional feature-enhanced fusion via state space modeling. We design a multiscale Mamba backbone and a State Space Model (SSM)-based fusion network to enable efficient long-range perception in pyramid features, thereby extending the pure encoder-decoder framework while facilitating information sharing across multiscale features of nuclei. The proposed SSM-based feature-enhanced fusion integrates pyramid feature networks and dynamic feature enhancement across different spatial scales, enhancing the feature representation of densely overlapping nuclei in both semantic and spatial dimensions. To the best of our knowledge, this is the first Mamba-based approach for panoptic segmentation. Additionally, we introduce alternative evaluation metrics, including image-level Panoptic Quality ($i$PQ), boundary-weighted PQ ($w$PQ), and frequency-weighted PQ ($fw$PQ), which are specifically designed to address the unique challenges of nuclei segmentation and thereby mitigate the potential bias inherent in vanilla PQ. Experimental evaluations on two multiclass nuclei segmentation benchmark datasets, MoNuSAC2020 and NuInsSeg, demonstrate the superiority of PanopMamba for nuclei panoptic segmentation over state-of-the-art methods. Consequently, the robustness of PanopMamba is validated across various metrics, while the distinctiveness of PQ variants is also demonstrated. Code is available at https://github.com/mkang315/PanopMamba.
Dafne Zorzetto, Zizhao Xie, Julian Stamp, Arman Oganisian, Roberta De Vito
Diet plays a crucial role in health, and understanding the causal effects of dietary patterns is essential for informing public health policy and personalized nutrition strategies. However, causal inference in nutritional epidemiology faces several challenges: (i) high-dimensional and correlated food/nutrient intake data induce massive treatment levels; (ii) nutritional studies are interested in latent dietary patterns rather than single food items; and (iii) the goal is to estimate heterogeneous causal effects of these dietary patterns on health outcomes. We address these challenges by introducing a sophisticated exposure mapping framework that reduces the high-dimensional treatment space via factor analysis and enables the identification of dietary patterns. We also extend the Bayesian Causal Forest to accommodate three ordered levels of dietary exposure, better capturing the complex structure of nutritional data and enabling estimation of heterogeneous causal effects. We evaluate the proposed method through extensive simulations and apply it to a multi-center epidemiological study of Hispanic/Latino adults residing in the US. Using high-dimensional dietary data, we identify six dietary patterns and estimate their causal link with two key health risk factors: body mass index and fasting insulin levels. Our findings suggest that higher consumption of plant lipid-antioxidant, plant-based, animal protein, and dairy product patterns is associated with reduced risk.
Weichang Yu, Khue-Dung Dang
We develop a fast and accurate grouped penalized credible region approach for variable selection and prediction in Bayesian high-dimensional linear regression. Most existing Bayesian methods either are subject to high computational costs due to long Markov Chain Monte Carlo runs or yield ambiguous variable selection results due to non-sparse solution output. The penalized credible region framework yields sparse post-processed estimates that facilitates unambiguous grouped variable selection. High estimation accuracy is achieved by shrinking noise from unimportant groups using a grouped global-local shrinkage prior. To ensure computational scalability, we approximate posterior summaries using coordinate ascent variational inference and recast the penalized credible region framework as a convex optimization problem that admits efficient computations. We prove that the resultant post-processed estimators are both parameter-consistent and variable selection consistent in high-dimensional settings. Theory is developed to justify running the coordinate ascent algorithm for at least two cycles. Through extensive simulations, we demonstrate that our proposed method outperforms state-of-the-art methods in grouped variable selection, prediction, and computation time for several common models including ANOVA and nonparametric varying coefficient models.
María F. Gil-Leyva, Antonio Lijoi, Ramsés H. Mena, Igor Prünster
Comments 57 pages, 13 figures, to be published in Annals of Statistics
Stick-breaking has a long history and is one of the most popular procedures for constructing random discrete distributions in Statistics and Machine Learning. In particular, due to their intuitive construction and computational tractability they are ubiquitous in modern Bayesian nonparametric inference. Most widely used models, such as the Dirichlet and the Pitman-Yor processes, rely on iid or independent length variables. Here we pursue a completely unexplored research direction by considering Markov length variables and investigate the corresponding general class of stick-breaking processes, which we term Markov stick-breaking processes. We establish conditions under which the associated species sampling process is proper and the distribution of a Markov stick-breaking process has full topological support, two fundamental desiderata for Bayesian nonparametric models. We also analyze the stochastic ordering of the weights and provide a new characterization of the Pitman-Yor process as the only stick-breaking process invariant under size-biased permutations, under mild conditions. Moreover, we identify two notable subclasses of Markov stick-breaking processes that enjoy appealing properties and include Dirichlet, Pitman-Yor and Geometric priors as special cases. Our findings include distributional results enabling posterior inference algorithms and methodological insights.
Ruonan Zheng, Min-Qian Liu, Yongdao Zhou, Xuan Chen
Comments 24pages, 5figures
Computer experiments have become an indispensable alternative to complex physical and engineering experiments. The Kriging model is the most widely used surrogate model, with the core goal of minimizing the discrepancy between the surrogate and true models across the entire experimental domain. However, existing sequential design methods have critical limitations: observation-based batch sequential designs are rarely studied, while one-point sequential designs have insufficient information utilization and suffer from inefficient resource utilization -- they require numerous repeated observation rounds to accumulate sufficient points, leading to prolonged experimental cycles. To address these gaps, this paper proposes two novel one-point sequential design criteria and a general batch sequential design framework. Moreover, the batch sequential design framework solves the inherent point clustering problem in naive batch selection, enabling efficient extension of any sequential criterion to batch scenarios. Simulations on some test functions demonstrate that the proposed methods outperform existing approaches in terms of fitting accuracy in most cases.
Jiazhen Xu, Han Lin Shang
Spherically embedded spatial data are spatially indexed observations whose values naturally reside on or can be equivalently mapped to the unit sphere. Such data are increasingly ubiquitous in fields ranging from geochemistry to demography. However, analysing such data presents unique difficulties due to the intrinsic non-Euclidean nature of the sphere, and rigorous methodologies for statistical modelling, inference, and uncertainty quantification remain limited. This paper introduces a unified framework to address these three limitations for spherically embedded spatial data. We first propose a novel spherical spatial autoregressive model that leverages optimal transport geometry and then extend it to accommodate exogenous covariates. Second, for either scenario with or without covariates, we establish the asymptotic properties of the estimators and derive a distribution-free Wald test for spatial dependence, complemented by a bootstrap procedure to enhance finite-sample performance. Third, we contribute a novel approach to uncertainty quantification by developing a conformal prediction procedure specifically tailored to spherically embedded spatial data. The practical utility of these methodological advances is illustrated through extensive simulations and applications to Spanish geochemical compositions and Japanese age-at-death mortality distributions.
Simon Mack, Marc Ditzhaus, Merle Munko, Markus Pauly
Comments 25 pages, 9 figures, 2 tables
In competing risks models, cumulative incidence functions are commonly compared to infer differences between groups. Many existing inference methods, however, struggle when these functions cross during the time frame of interest. To address this problem, we investigate a test statistic based on the area between cumulative incidence functions. As the corresponding limiting distribution depends on quantities that are typically unknown, we propose a wild bootstrap approach to obtain a feasible and asymptotically valid two-sample test. The finite sample performance of the proposed method, in comparison with existing methods, is examined in an extensive simulation study.
Jongmin Mun, Jeong Hoon Jang
Comments 39 pages (main 21 pages), 8 figures, 3 tables
Motivated by renal imaging studies that combine renogram curves with pharmacokinetic and demographic covariates, we propose Hybrid partial least squares (Hybrid PLS) for simultaneous supervised dimension reduction and regression in the presence of cross-modality correlations. The proposed approach embeds multiple functional and scalar predictors into a unified hybrid Hilbert space and rigorously extends the nonlinear iterative PLS (NIPALS) algorithm. This theoretical development is complemented by a sample-level algorithm that incorporates roughness penalties to control smoothness. By exploiting the rank-one structure of the resulting optimization problem, the algorithm admits a computationally efficient closed-form solution that requires solving only linear systems at each iteration. We establish fundamental geometric properties of the proposed framework, including orthogonality of the latent scores and PLS directions. Extensive numerical studies on synthetic data, together with an application to a renal imaging study, validate these theoretical results and demonstrate the method's ability to recover predictive structure under intermodal multicollinearity, yielding parsimonious low-dimensional representations.
Erika McPhillips, Hyeongseong Lee, Xiangyu Xie, Kathy Baylis, Chris Funk, Mengyang Gu
Weather conditions can drastically alter the state of crops and rangelands, and in turn, impact the incomes and food security of individuals worldwide. Satellite-based remote sensing offers an effective way to monitor vegetation and climate variables on regional and global scales. The annual peak Normalized Difference Vegetation Index (NDVI), derived from satellite observations, is closely associated with crop development, rangeland biomass, and vegetation growth. Although various machine learning methods have been developed to forecast NDVI over short time ranges, such as one-month-ahead predictions, long-term forecasting approaches, such as one-year-ahead predictions of vegetation conditions, are not yet available. To fill this gap, we develop a two-phase machine learning model to forecast the one-year-ahead peak NDVI over high-resolution grids, using the Four Corners region of the Southwestern United States as a testbed. In phase one, we identify informative climate attributes, including precipitation and maximum vapor pressure deficit, and develop the generalized parallel Gaussian process that captures the relationship between climate attributes and NDVI. In phase two, we forecast these climate attributes using historical data at least one year before the NDVI prediction month, which then serve as inputs to forecast the peak NDVI at each spatial grid. We developed open-source tools that outperform alternative methods for both gross NDVI and grid-based NDVI one-year forecasts, providing information that can help farmers and ranchers make actionable plans a year in advance.
Nesta Midavaine, Christian A. Naesseth, Grigory Bartosh
Language diffusion models aim to improve sampling speed and coherence over autoregressive LLMs. We introduce Neural Flow Diffusion Models for language generation, an extension of NFDM that enables the straightforward application of continuous diffusion models to discrete state spaces. NFDM learns a multivariate forward process from the data, ensuring that the forward process and generative trajectory are a good fit for language modeling. Our model substantially reduces the likelihood gap with autoregressive models of the same size, while achieving sample quality comparable to that of previous latent diffusion models.
Nadezhda Gribkova, Jianxi Su, Mengqi Wang
Comments 38 pages, 7 figures
Tail Value-at-Risk (TVaR) is a widely adopted risk measure playing a critically important role in both academic research and industry practice in insurance. In data applications, TVaR is often estimated using the empirical method, owing to its simplicity and nonparametric nature. The empirical TVaR has been explicitly advocated by regulatory authorities as a standard approach for computing TVaR. However, prior literature has pointed out that the empirical TVaR estimator is negatively biased, which can lead to a systemic underestimation of risk in finite-sample applications. This paper aims to deepen the understanding of the bias of the empirical TVaR estimator in two dimensions: its magnitude as well as the key distributional and structural determinants driving the severity of the bias. To this end, we derive a leading-term approximation for the bias based on its asymptotic expansion. The closed-form expression associated with the leading-term approximation enables us to obtain analytical insights into the structural properties governing the bias of the empirical TVaR estimator. To account for the discrepancy between the leading-term approximation and the true bias, we further derive an explicit upper bound for the bias. We validate the proposed bias analysis framework via simulations and demonstrate its practical relevance using real data.
Ricardo J. Sandoval, Sivaraman Balakrishnan, Avi Feller, Michael I. Jordan, Ian Waudby-Smith
We study nonasymptotic (finite-sample) confidence intervals for treatment effects in randomized experiments. In the existing literature, the effective sample sizes of nonasymptotic confidence intervals tend to be looser than the corresponding central-limit-theorem-based confidence intervals by a factor depending on the square root of the propensity score. We show that this performance gap can be closed, designing nonasymptotic confidence intervals that have the same effective sample size as their asymptotic counterparts. Our approach involves systematic exploitation of negative dependence or variance adaptivity (or both). We also show that the nonasymptotic rates that we achieve are unimprovable in an information-theoretic sense.
Kai Ming Ting, Ye Zhu, Hang Zhang, Tianrun Liang
This paper investigates two fundamental descriptors of data, i.e., density distribution versus mass distribution, in the context of clustering. Density distribution has been the de facto descriptor of data distribution since the introduction of statistics. We show that density distribution has its fundamental limitation -- high-density bias, irrespective of the algorithms used to perform clustering. Existing density-based clustering algorithms have employed different algorithmic means to counter the effect of the high-density bias with some success, but the fundamental limitation of using density distribution remains an obstacle to discovering clusters of arbitrary shapes, sizes and densities. Using the mass distribution as a better foundation, we propose a new algorithm which maximizes the total mass of all clusters, called mass-maximization clustering (MMC). The algorithm can be easily changed to maximize the total density of all clusters in order to examine the fundamental limitation of using density distribution versus mass distribution. The key advantage of the MMC over the density-maximization clustering is that the maximization is conducted without a bias towards dense clusters.
Dapeng Shi, Haoran Zhang, Tiandong Wang, Junhui Wang
Comments 31 pages, 4 figures
Community detection in multilayer networks, which aims to identify groups of nodes exhibiting similar connectivity patterns across multiple network layers, has attracted considerable attention in recent years. Most existing methods are based on the assumption that different layers are either independent or follow specific dependence structures, and edges within the same layer are independent. In this article, we propose a novel method for community detection in multilayer networks that accounts for a broad range of inter-layer and intra-layer dependence structures. The proposed method integrates the multilayer stochastic block model for community detection with a multivariate probit model to capture the structures of inter-layer dependence, which also allows intra-layer dependence. To facilitate parameter estimation, we develop a constrained pairwise likelihood method coupled with an efficient alternating updating algorithm. The asymptotic properties of the proposed method are also established, with a focus on examining the influence of inter-layer and intra-layer dependences on the accuracy of both parameter estimation and community detection. The theoretical results are supported by extensive numerical experiments on both simulated networks and a real-world multilayer trade network.
Alois Duston, Tan Bui-Thanh
Comments Updated to match journal submission and add ACM & MSC class info
We study variance reduction for score estimation and diffusion-based sampling in settings where the clean (target) score is available or can be approximated. Starting from the Target Score Identity (TSI), which expresses the noisy marginal score as a conditional expectation of the target score under the forward diffusion, we develop: (i) a plug-and-play nonparametric self-normalized importance sampling estimator compatible with standard reverse-time solvers, (ii) a variance-minimizing \emph{state- and time-dependent} blending rule between Tweedie-type and TSI estimators together with an anti-correlation analysis, (iii) a data-only extension based on locally fitted proxy scores, and (iv) a likelihood-tilting extension to Bayesian inverse problems. We also propose a \emph{Critic--Gate} distillation scheme that amortizes the state-dependent blending coefficient into a neural gate. Experiments on synthetic targets and PDE-governed inverse problems demonstrate improved sample quality for a fixed simulation budget.
William Consagra, Eardi Lila
Functional magnetic resonance imaging (fMRI) provides an indirect measurement of neuronal activity via hemodynamic responses that vary across brain regions and individuals. Ignoring this hemodynamic variability can bias downstream connectivity estimates. Furthermore, the hemodynamic parameters themselves may serve as important imaging biomarkers. Estimating spatially varying hemodynamics from resting-state fMRI (rsfMRI) is therefore an important but challenging blind inverse problem, since both the latent neural activity and the hemodynamic coupling are unknown. In this work, we propose a methodology for inferring hemodynamic coupling on the cortical surface from rsfMRI. Our approach avoids the highly unstable joint recovery of neural activity and hemodynamics by marginalizing out the latent neural signal and basing inference on the resulting marginal likelihood. To enable scalable, high-resolution estimation, we employ a deep neural network combined with conditional normalizing flows to accurately approximate this intractable marginal likelihood, while enforcing spatial coherence through priors defined on the cortical surface that admit sparse representations. Uncertainty in the hemodynamic estimates is quantified via a double-bootstrap procedure. The proposed approach is extensively validated using synthetic data and real fMRI datasets, demonstrating clear improvements over current methods for hemodynamic estimation and downstream connectivity analysis.
Claire Donnat, Olga Klopp, Hemant Tyagi
Comments 52 pages, 10 figures, 2 tables, corrected minor typos
We consider the problem of joint estimation of the parameters of $m$ linear dynamical systems, given access to single realizations of their respective trajectories, each of length $T$. The linear systems are assumed to reside on the nodes of an undirected and connected graph $G = ([m], \mathcal{E})$, and the system matrices are assumed to either vary smoothly or exhibit small number of ``jumps'' across the edges. We consider a total variation penalized least-squares estimator and derive non-asymptotic bounds on the mean squared error (MSE) which hold with high probability. In particular, the bounds imply for certain choices of well connected $G$ that the MSE goes to zero as $m$ increases, even when $T$ is constant. The theoretical results are supported by extensive experiments on synthetic and real data.
Gabriel Ponte, Marcia Fampa, Jon Lee
We establish strong connections between two fundamental nonlinear 0/1 optimization problems coming from the area of experimental design, namely maximum entropy sampling and 0/1 D-Optimality. The connections are based on maps between instances, and we analyze the behavior of these maps. Using these maps, we transport basic upper-bounding methods between these two problems, and we are able to establish new domination results and other inequalities relating various basic upper bounds. Further, we establish results relating how different branch-and-bound schemes based on these maps compare. Additionally, we observe some surprising numerical results, where bounding methods that did not seem promising in their direct application to real-data MESP instances, are now useful for MESP instances that come from 0/1 D-Optimality.
Samuel Bright-Thonney, Christina Reissel, Gaia Grosso, Nathaniel Woodward, Katya Govorkova, Andrzej Novak, Sang Eon Park, Eric Moreno, Philip Harris
Comments Accepted at NeurIPS 2025; 33 pages, 16 figures
Novelty detection in large scientific datasets faces two key challenges: the noisy and high-dimensional nature of experimental data, and the necessity of making statistically robust statements about any observed outliers. While there is a wealth of literature on anomaly detection via dimensionality reduction, most methods do not produce outputs compatible with quantifiable claims of scientific discovery. In this work we directly address these challenges, presenting the first step towards a unified pipeline for novelty detection adapted for the rigorous statistical demands of science. We introduce AutoSciDACT (Automated Scientific Discovery with Anomalous Contrastive Testing), a general-purpose pipeline for detecting novelty in scientific data. AutoSciDACT begins by creating expressive low-dimensional data representations using a contrastive pre-training, leveraging the abundance of high-quality simulated data in many scientific domains alongside expertise that can guide principled data augmentation strategies. These compact embeddings then enable an extremely sensitive machine learning-based two-sample test using the New Physics Learning Machine (NPLM) framework, which identifies and statistically quantifies deviations in observed data relative to a reference distribution (null hypothesis). We perform experiments across a range of astronomical, physical, biological, image, and synthetic datasets, demonstrating strong sensitivity to small injections of anomalous data across all domains.
Lujia Bai, Holger Dette
Most of the work on checking spherical symmetry assumptions on the distribution of the $p$-dimensional random vector $Y$ has its focus on statistical tests for the null hypothesis of exact spherical symmetry. In this paper, we take a different point of view and propose a measure for the deviation from spherical symmetry, which is based on the minimum distance between the distribution of the vector $\big (\|Y\|, Y/ \|Y\| )^\top $ and its best approximation by a distribution of a vector $\big (\|Y_s\|, Y_s/ \|Y_s \| )^\top $ corresponding to a random vector $Y_s$ with a spherical distribution. We develop estimators for the minimum distance with corresponding statistical guarantees (provided by asymptotic theory) and demonstrate the applicability of our approach by means of a simulation study and a real data example.
Rohan Sen
Comments 40 pages, 4 figures
We propose a kernel-based nonparametric framework for mean-variance optimization that enables inference on economically motivated shape constraints in finance, including positivity, monotonicity, and convexity. Many central hypotheses in financial econometrics are naturally expressed as shape relations on latent functions (e.g., term premia, CAPM relations, and the pricing kernel), yet enforcing such constraints during estimation can mask economically meaningful violations; our approach therefore separates learning from validation by first estimating an unconstrained solution and then testing shape properties. We establish statistical properties of the regularized sample estimator and derive rigorous guarantees, including asymptotic consistency, a functional central limit theorem, and a finite-sample deviation bound achieving the Monte Carlo rate up to a regularization term. Building on these results, we construct a joint Wald-type statistic to test shape constraints on finite grids. An efficient algorithm based on a pivoted Cholesky factorization yields scalability to large datasets. Numerical studies, including an options-based asset-pricing application, illustrate the usefulness of the proposed method for evaluating monotonicity and convexity restrictions.
Jiaqi Tong, Fan Li
In many causal inference problems, multiple action variables, such as factors, mediators, or network units, often share a common causal role yet lack a natural ordering. To avoid ambiguity, the scientific interpretation of a vector of estimands should remain invariant under relabeling, an implicit principle we refer to as permutation equivariance. Permutation equivariance can be understood as the property that permuting the variables permutes the estimands in a trackable manner, such that scientific meaning is preserved. We formally characterize this principle and study its combinatorial algebra. We present a class of weighted estimands that project unstructured potential outcome means into a vector of permutation equivariant and interpretable estimands capturing all orders of interaction. To guide practice, we discuss the implications and choices of weights and define residual-free estimands, whose inclusion-exclusion sums capture the maximal effect, which is useful in context such as causal mediation and network interference. We present the application of our general theory to three canonical examples and extend our results to ratio effect measures.
Georgia D Tomova, Richard J Silverwood, Peter WG Tennant, Liam Wright
Survey data are self-reported data collected directly from respondents by a questionnaire or an interview and are commonly used in epidemiology. Such data are traditionally collected via a single mode (e.g. face-to-face interview alone), but use of mixed-mode designs (e.g. offering face-to-face interview or online survey) has become more common. This introduces two key challenges. First, individuals may respond differently to the same question depending on the mode; these differences due to measurement are known as 'mode effects'. Second, different individuals may participate via different modes; these differences in sample composition between modes are known as 'mode selection'. Where recognised, mode effects are often handled by straightforward approaches such as conditioning on survey mode. However, while reducing mode effects, this and other equivalent approaches may introduce collider bias in the presence of mode selection. The existence of mode effects and the consequences of naïve conditioning may be underappreciated in epidemiology. This paper offers a simple introduction to these challenges using directed acyclic graphs by exploring a range of possible data structures. We discuss the potential implications of using conditioning- or imputation-based approaches and outline the advantages of quantitative bias analyses for dealing with mode effects.
Adel Daoud, Cindy Conlin, Connor T. Jerzak
Comments Forthcoming in World Development
Debates about whether development projects improve living conditions persist, partly because observational estimates can be biased by incomplete adjustment and because reliable outcome data are scarce at the neighborhood level. We address both issues in a continent-scale, sector-specific evaluation of Chinese and World Bank projects across 9,899 neighborhoods in 36 African countries (2002-2013), representative of ~88% of the population. First, we use a recent dataset that measures living conditions with a machine-learned wealth index derived from contemporaneous satellite imagery, yielding a consistent panel of 6.7 km square mosaics. Second, to strengthen identification, we proxy officials' map-based placement criteria using pre-treatment daytime satellite images and fuse these with tabular covariates to estimate funder- and sector-specific ATEs via inverse-probability weighting. Incorporating imagery often shrinks effects relative to tabular-only models. On average, both donors raise wealth, with larger and more consistent gains for China; sector extremes in our sample include Trade and Tourism (330) for the World Bank (+12.29 IWI points), and Emergency Response (700) for China (+15.15). Assignment-mechanism analyses also show World Bank placement is often more predictable from imagery alone (as well as from tabular covariates). This suggests that Chinese project placements are more driven by non-visible, political, or event-driven factors than World Bank placements. To probe residual concerns about selection on observables, we also estimate within-neighborhood (unit) fixed-effects models at a spatial resolution about 67 times finer than prior fixed-effects analyses, leveraging the computer-vision-imputed IWI panels; these deliver smaller but, for Chinese projects, directionally consistent effects.
Alireza Mousavi-Hosseini, Stephen Y. Zhang, Michal Klein, Marco Cuturi
Comments 38 pages, 23 figures
Flow models parameterized as time-dependent velocity fields can generate data from noise by integrating an ODE. These models are often trained using flow matching, i.e. by sampling random pairs of noise and target points $(\mathbf{x}_0,\mathbf{x}_1)$ and ensuring that the velocity field is aligned, on average, with $\mathbf{x}_1-\mathbf{x}_0$ when evaluated along a segment linking $\mathbf{x}_0$ to $\mathbf{x}_1$. While these pairs are sampled independently by default, they can also be selected more carefully by matching batches of $n$ noise to $n$ target points using an optimal transport (OT) solver. Although promising in theory, the OT flow matching (OT-FM) approach is not widely used in practice. Zhang et al. (2025) pointed out recently that OT-FM truly starts paying off when the batch size $n$ grows significantly, which only a multi-GPU implementation of the Sinkhorn algorithm can handle. Unfortunately, the costs of running Sinkhorn can quickly balloon, requiring $O(n^2/\varepsilon^2)$ operations for every $n$ pairs used to fit the velocity field, where $\varepsilon$ is a regularization parameter that should be typically small to yield better results. To fulfill the theoretical promises of OT-FM, we propose to move away from batch-OT and rely instead on a semidiscrete formulation that leverages the fact that the target dataset distribution is usually of finite size $N$. The SD-OT problem is solved by estimating a dual potential vector using SGD; using that vector, freshly sampled noise vectors at train time can then be matched with data points at the cost of a maximum inner product search (MIPS). Semidiscrete FM (SD-FM) removes the quadratic dependency on $n/\varepsilon$ that bottlenecks OT-FM. SD-FM beats both FM and OT-FM on all training metrics and inference budget constraints, across multiple datasets, on unconditional/conditional generation, or when using mean-flow models.
Muralikrishnna G. Sethuraman, Faramarz Fekri
Understanding causal relationships between variables is fundamental across scientific disciplines. Most causal discovery algorithms rely on two key assumptions: (i) all variables are observed, and (ii) the underlying causal graph is acyclic. While these assumptions simplify theoretical analysis, they are often violated in real-world systems, such as biological networks. Existing methods that account for confounders either assume linearity or struggle with scalability. To address these limitations, we propose DCCD-CONF, a novel framework for differentiable learning of nonlinear cyclic causal graphs in the presence of unmeasured confounders using interventional data. Our approach alternates between optimizing the graph structure and estimating the confounder distribution by maximizing the log-likelihood of the data. Through experiments on synthetic data and real-world gene perturbation datasets, we show that DCCD-CONF outperforms state-of-the-art methods in both causal graph recovery and confounder identification. Additionally, we also provide consistency guarantees for our framework, reinforcing its theoretical soundness.
Zijian Guo, Zhenyu Wang, Yifan Hu, Francis Bach
In multi-source learning with discrete labels, distributional heterogeneity across domains poses a central challenge to developing predictive models that transfer reliably to unseen domains. We study multi-source unsupervised domain adaptation, where labeled data are available from multiple source domains and only unlabeled data are observed from the target domain. To address potential distribution shifts, we propose a novel Conditional Group Distributionally Robust Optimization (CG-DRO) framework that learns a classifier by minimizing the worst-case cross-entropy loss over the convex combinations of the conditional outcome distributions from sources domains. We develop an efficient Mirror Prox algorithm for solving the minimax problem and employ a double machine learning procedure to estimate the risk function, ensuring that errors in nuisance estimation contribute only at higher-order rates. We establish fast statistical convergence rates for the empirical CG-DRO estimator by constructing two surrogate minimax optimization problems that serve as theoretical bridges. A distinguishing challenge for CG-DRO is the emergence of nonstandard asymptotics: the empirical CG-DRO estimator may fail to converge to a standard limiting distribution due to boundary effects and system instability. To address this, we introduce a perturbation-based inference procedure that enables uniformly valid inference, including confidence interval construction and hypothesis testing.
Agnideep Aich, Ashit Baran Aich, Dipak C. Jain
We propose \textbf{Temporal Conformal Prediction (TCP)}, a distribution-free framework for constructing well-calibrated prediction intervals in nonstationary time series. TCP couples a modern quantile forecaster with a rolling split-conformal calibration layer; its \textbf{TCP-RM} variant adds an online Robbins-Monro offset to steer coverage in real time. We benchmark TCP against GARCH, Historical Simulation, Quantile Regression (QR), linear QR, and Adaptive Conformal Inference (ACI) across S\&P 500, Bitcoin, and Gold. Three results are consistent. First, QR baselines yield the sharpest intervals but are materially under-calibrated; even ACI remains below the 95\% target. Second, TCP achieves near-nominal coverage, yielding intervals slightly wider than Historical Simulation (e.g., S\&P 500: 5.21 vs.\ 5.06). Third, the RM update changes calibration only marginally at default hyperparameters. Crisis-window visualizations (March 2020) show TCP promptly expanding and contracting intervals as volatility spikes. A sensitivity study confirms robustness to hyperparameters. Overall, TCP bridges statistical inference and machine learning, providing a practical solution for calibrated risk forecasting under distribution shift.
Lasse Leskelä, Maximilien Dreveton
Comments 20 pages
Journal ref Statistica Neerlandica 80(1):e70023, 2026
Markov chains are fundamental models for stochastic dynamics, with applications in a wide range of areas such as population dynamics, queueing systems, reinforcement learning, and Monte Carlo methods. Estimating the transition matrix and stationary distribution from observed sample paths is a core statistical challenge, particularly when multiple independent trajectories are available. While classical theory typically assumes identical chains with known stationary distributions, real-world data often arise from heterogeneous chains whose transition kernels and stationary measures might differ from a common target. We analyse empirical estimators for such parallel Markov processes and establish sharp concentration inequalities that generalise Bernstein-type bounds from standard time averages to ensemble-time averages. Our results provide nonasymptotic error bounds and consistency guarantees in high-dimensional regimes, accommodating sparse or weakly mixing chains, model mismatch, nonstationary initialisations, and partially corrupted data. These findings offer rigorous foundations for statistical inference in heterogeneous Markov chain settings common in modern computational applications.
Cecilia Balocchi, Sara Wade
The Bayesian approach to clustering is often appreciated for its ability to provide uncertainty in the partition structure. However, summarizing the posterior distribution over the clustering structure can be challenging, due the discrete, unordered nature and massive dimension of the space. While recent advancements provide a single clustering estimate to represent the posterior, this ignores uncertainty and may even be unrepresentative in instances where the posterior is multimodal. To enhance our understanding of uncertainty, we propose a WASserstein Approximation for Bayesian clusterIng (WASABI), which summarizes the posterior samples with not one, but multiple clustering estimates, each corresponding to a different part of the partition space that receives substantial posterior mass. Specifically, we find such clustering estimates by approximating the posterior distribution in a Wasserstein distance sense, equipped with a suitable metric on the partition space. An interesting byproduct is that a locally optimal solution can be found using a k-medoids-like algorithm on the partition space to divide the posterior samples into groups, each represented by one of the clustering estimates. Using synthetic and real datasets, we show that WASABI helps to improve the understanding of uncertainty, particularly when clusters are not well separated or when the employed model is misspecified.
Daniel Waxman, Fernando Llorente, Petar M. Djurić
Comments 28 pages, 10 figures; Accepted to Transactions on Machine Learning Research (TMLR)
Journal ref Transactions on Machine Learning Research (TMLR), 2026
We revisit the classical problem of Bayesian ensembles and address the challenge of learning optimal combinations of Bayesian models in an online, continual learning setting. To this end, we reinterpret existing approaches such as Bayesian model averaging (BMA) and Bayesian stacking through a novel empirical Bayes lens, shedding new light on the limitations and pathologies of BMA. Further motivated by insights from online optimization, we propose Online Bayesian Stacking (OBS), a method that optimizes the log-score over predictive distributions to adaptively combine Bayesian models. A key contribution of our work is establishing a novel connection between OBS and portfolio selection, bridging Bayesian ensemble learning with a rich, well-studied theoretical framework that offers efficient algorithms and extensive regret analysis. We further clarify the relationship between OBS and online BMA, showing that they optimize related but distinct cost functions. Through theoretical analysis and empirical evaluation, we identify scenarios where OBS outperforms online BMA and provide principled methods and guidance on when practitioners should prefer one approach over the other.
Elliot H. Young, Peter Bühlmann
We develop Clustered Random Forests, a random forests algorithm for clustered data, arising from independent groups that exhibit within-cluster dependence. The leaf-wise predictions for each decision tree making up clustered random forests takes the form of a weighted least squares estimator, which leverage correlations between observations for improved prediction accuracy and tighter confidence intervals when performing inference. We show that approximately linear time algorithms exist for fitting classes of clustered random forests, matching the computational complexity of standard random forests. Further, we observe that the optimality of a clustered random forest, with regards to how optimal weights are chosen within this framework i.e. those that minimise mean squared prediction error, vary under covariate distribution shift. In light of this, we advocate weight estimation to be determined by a user-chosen covariate distribution, or test dataset of covariates, with respect to which optimal prediction or inference is desired. This highlights a key distinction between correlated and independent data with regards to optimality of nonparametric conditional mean estimation under covariate shift. We demonstrate our theoretical findings numerically in a number of simulated and real-world settings.
Lorenzo Mauri, Niccolò Anceschi, David B. Dunson
This article focuses on covariance estimation for multi-study data. Popular approaches employ factor-analytic terms with shared and study-specific loadings that decompose the variance into (i) a shared low-rank component, (ii) study-specific low-rank components, and (iii) a diagonal term capturing idiosyncratic variability. Our proposed methodology estimates the latent factors via spectral decompositions, with a novel approach for separating shared and specific factors, and infers the factor loadings and residual variances via surrogate Bayesian regressions. The resulting posterior has a simple product form across outcomes, bypassing the need for Markov chain Monte Carlo sampling and facilitating parallelization. The proposed methodology has major advantages over current Bayesian competitors in terms of computational speed, scalability and stability while also having strong frequentist guarantees. The theory and methods also add to the rich literature on frequentist methods for factor models with shared and group-specific components of variation. The approximation error decreases as the sample size and the data dimension diverge, formalizing a blessing of dimensionality. We show favorable asymptotic properties, including central limit theorems for point estimators and posterior contraction, and excellent empirical performance in simulations. The methods are applied to integrate three studies on gene associations among immune cells.
Jianyu Xu, Yining Wang, Xi Chen, Yu-Xiang Wang
Comments 28 pages, 1 figure
We study an online dynamic pricing problem where the potential demand at each time period $t=1,2,\ldots, T$ is stochastic and dependent on the price. However, a perishable inventory is imposed at the beginning of each time $t$, censoring the potential demand if it exceeds the inventory level. To address this problem, we introduce a pricing algorithm based on the optimistic estimates of derivatives. We show that our algorithm achieves $\tilde{O}(\sqrt{T})$ optimal regret even with adversarial inventory series. Our findings advance the state-of-the-art in online decision-making problems with censored feedback, offering a theoretically optimal solution against adversarial observations.
Kiana Asgari, Andrea Montanari, Basil Saeed
We consider a general model for high-dimensional empirical risk minimization whereby the data $\mathbf{x}_i$ are $d$-dimensional Gaussian vectors, the model is parametrized by $\mathbfΘ\in\mathbb{R}^{d\times k}$, and the loss depends on the data via the projection $\mathbfΘ^\mathsf{T}\mathbf{x}_i$. This setting covers as special cases classical statistics methods (e.g. multinomial regression and other generalized linear models), but also two-layer fully connected neural networks with $k$ hidden neurons. We use the Kac-Rice formula from Gaussian process theory to derive a bound on the expected number of local minima of this empirical risk, under the proportional asymptotics in which $n,d\to\infty$, with $n\asymp d$. Via Markov's inequality, this bound allows to determine the positions of these minimizers (with exponential deviation bounds) and hence derive sharp asymptotics on the estimation and prediction error. As a special case, we apply our characterization to convex losses. We show that our approach is tight and allows to prove previously conjectured results. In addition, we characterize the spectrum of the Hessian at the minimizer. A companion paper applies our general result to non-convex examples.
Paul-Louis Delacour, Sander Wahls, Jeffrey M. Spraggins, Lukasz Migas, Raf Van de Plas
Journal ref IEEE Transactions on Signal Processing, vol. 73, pp. 3748-3761, 2025
We introduce the spiked mixture model (SMM) to address the problem of estimating a set of signals from many randomly scaled and noisy observations. Subsequently, we design a novel expectation-maximization (EM) algorithm to recover all parameters of the SMM. Numerical experiments show that in low signal-to-noise ratio regimes, and for data types where the SMM is relevant, SMM surpasses the more traditional Gaussian mixture model (GMM) in terms of signal recovery performance. The broad relevance of the SMM and its corresponding EM recovery algorithm is demonstrated by applying the technique to different data types. The first case study is a biomedical research application, utilizing an imaging mass spectrometry dataset to explore the molecular content of a rat brain tissue section at micrometer scale. The second case study demonstrates SMM performance in a computer vision application, segmenting a hyperspectral imaging dataset into underlying patterns. While the measurement modalities differ substantially, in both case studies SMM is shown to recover signals that were missed by traditional methods such as k-means clustering and GMM.
Yiyun He, Ke Wang, Yizhe Zhu
Comments 53 pages, 1 figure
We derive new Hanson-Wright-type inequalities tailored to the quadratic forms of random vectors with sparse independent components. Specifically, we consider cases where the components of the random vector are sparse $α$-subexponential random variables with $α>0$. When $α=\infty$, these inequalities can be seen as quadratic generalizations of the classical Bernstein and Bennett inequalities for sparse bounded random vectors. To establish this quadratic generalization, we also develop new Bernstein-type and Bennett-type inequalities for linear forms of sparse $α$-subexponential random variables that go beyond the bounded case $(α=\infty)$. Our proof relies on a novel combinatorial method for estimating the moments of both random linear forms and quadratic forms. We present two key applications of these new sparse Hanson-Wright inequalities: (1) A local law and complete eigenvector delocalization for sparse $α$-subexponential Hermitian random matrices, generalizing the result of He et al. (2019) beyond sparse Bernoulli random matrices. To the best of our knowledge, this is the first local law and complete delocalization result for sparse $α$-subexponential random matrices down to the near-optimal sparsity $p\geq \frac{\mathrm{polylog}(n)}{n}$ when $α\in (0,2)$ as well as for unbounded sparse sub-gaussian random matrices down to the optimal sparsity $p\gtrsim \frac{\log n}{n}.$ (2) Concentration of the Euclidean norm for the linear transformation of a sparse $α$-subexponential random vector, improving on the results of G{ö}tze et al. (2021) for sparse sub-exponential random vectors.
Claudio Durastanti
Comments 34 pages, 3 figures, 2 tables
This paper investigates aliasing effects emerging from the reconstruction from discrete samples of spin spherical random fields defined on the two-dimensional sphere. We determine the location in the frequency domain and the intensity of the aliases of the harmonic coefficients in the Fourier decomposition of the spin random field and evaluate the consequences of aliasing errors in the angular power spectrum when the samples of the random field are obtained by using some very popular sampling procedures on the sphere, the equiangular and the Gauss-Jacobi sampling schemes. Finally, we demonstrate that band-limited spin random fields are free from aliases, provided that a sufficiently large number of nodes is used in the selected quadrature rule.
Erik Wallin, Lennart Svensson, Fredrik Kahl, Lars Hammarstrand
Comments ECCV2024
In open-set semi-supervised learning (OSSL), we consider unlabeled datasets that may contain unknown classes. Existing OSSL methods often use the softmax confidence for classifying data as in-distribution (ID) or out-of-distribution (OOD). Additionally, many works for OSSL rely on ad-hoc thresholds for ID/OOD classification, without considering the statistics of the problem. We propose a new score for ID/OOD classification based on angles in feature space between data and an ID subspace. Moreover, we propose an approach to estimate the conditional distributions of scores given ID or OOD data, enabling probabilistic predictions of data being ID or OOD. These components are put together in a framework for OSSL, termed ProSub, that is experimentally shown to reach SOTA performance on several benchmark problems. Our code is available at https://github.com/walline/prosub.
Bo Lin, Pierpaolo Belardinelli
The quasipotential function allows for comprehension and prediction of the escape mechanisms from metastable states in nonlinear dynamical systems. This function acts as a natural extension of the potential function for non-gradient systems and it unveils important properties such as the maximum likelihood transition paths, transition rates and expected exit times of the system. Here, we demonstrate how to discover parsimonious equations for the quasipotential directly from data. Leveraging machine learning, we combine two existing data-driven techniques, namely a neural network and a sparse regression algorithm, specifically designed to symbolically describe multistable energy landscapes. First, we employ a vanilla neural network enhanced with a renormalization and rescaling procedure to achieve an orthogonal decomposition of the vector field. Next, we apply symbolic regression to extract the downhill and circulatory components of the decomposition, ensuring consistency with the underlying dynamics. This symbolic reconstruction involves a simultaneous regression that imposes constraints on both the orthogonality condition and the vector field. We implement and benchmark our approach using an archetypal model with a known exact quasipotential, as well as a nanomechanical resonator system. We further demonstrate its applicability to noisy data and to a four-dimensional system. Our model-unbiased analytical forms of the quasipotential is of interest to a wide range of applications aimed at assessing metastability and energy landscapes, serving to parametrically capture the distinctive fingerprint of the fluctuating dynamics.
Niccolo Anceschi, Federico Ferrari, David B. Dunson, Himel Mallick
Comments To be published in Biometrics
It is increasingly common to collect data of multiple different types on the same set of samples. Our focus is on studying relationships between such multiview features and responses. A motivating application arises in the context of precision medicine where multi-omics data are collected to correlate with clinical outcomes. It is of interest to infer dependence within and across views while combining multimodal information to improve the prediction of outcomes. The signal-to-noise ratio can vary substantially across views, motivating more nuanced statistical tools beyond standard late and early fusion. This challenge comes with the need to preserve interpretability, select features, and obtain accurate uncertainty quantification. To address these challenges, we introduce two complementary factor regression models. A baseline Joint Factor Regression (\textsc{jfr}) captures combined variation across views via a single factor set, and a more nuanced Joint Additive FActor Regression (\textsc{jafar}) that decomposes variation into shared and view-specific components. For \textsc{jfr}, we use independent cumulative shrinkage process (\textsc{i-cusp}) priors, while for \textsc{jafar} we develop a dependent version (\textsc{d-cusp}) designed to ensure identifiability of the components. We develop Gibbs samplers that exploit the model structure and accommodate flexible feature and outcome distributions. Prediction of time-to-labor onset from immunome, metabolome, and proteome data illustrates performance gains against state-of-the-art competitors. Our open-source software (\texttt{R} package) is available at https://github.com/niccoloanceschi/jafar.
Masayuki Sawada, Takuya Ishihara, Daisuke Kurisu, Yasumasa Matsuda
We study a multivariate regression discontinuity design in which treatment is assigned by crossing a boundary in the space of multiple running variables. We document that the existing bandwidth selector is suboptimal for a multivariate regression discontinuity design when the distance to a boundary point is used for its running variable, and introduce a multivariate local-linear estimator for multivariate regression discontinuity designs. Our estimator is asymptotically valid and can capture heterogeneous treatment effects over the boundary. We demonstrate that our estimator exhibits smaller root mean squared errors and often shorter confidence intervals in numerical simulations. We illustrate our estimator in our empirical applications of multivariate designs of a Colombian scholarship study and a U.S. House of representative voting study and demonstrate that our estimator reveals richer heterogeneous treatment effects with often shorter confidence intervals than the existing estimator.
Nathan Doumèche, Gérard Biau, Claire Boyer
Journal ref Bernoulli, 2025, Vol. 31, pp. 2127-2151
Physics-informed neural networks (PINNs) are a promising approach that combines the power of neural networks with the interpretability of physical modeling. PINNs have shown good practical performance in solving partial differential equations (PDEs) and in hybrid modeling scenarios, where physical models enhance data-driven approaches. However, it is essential to establish their theoretical properties in order to fully understand their capabilities and limitations. In this study, we highlight that classical training of PINNs can suffer from systematic overfitting. This problem can be addressed by adding a ridge regularization to the empirical risk, which ensures that the resulting estimator is risk-consistent for both linear and nonlinear PDE systems. However, the strong convergence of PINNs to a solution satisfying the physical constraints requires a more involved analysis using tools from functional analysis and calculus of variations. In particular, for linear PDE systems, an implementable Sobolev-type regularization allows to reconstruct a solution that not only achieves statistical accuracy but also maintains consistency with the underlying physics.
扫码添加微信好友,提出您的宝贵建议 👇
💡 备注请填写:网站反馈