arXivDaily arXiv每日学术速递 周一至周五更新
2601.16959 2026-01-26 astro-ph.IM stat.AP

Probabilistic Graphical Models in Astronomy

Abigail Sheerin, Giuseppe Vinci

Comments 11 pages, 3 figures

详情
英文摘要

The field of astronomy is experiencing a data explosion driven by significant advances in observational instrumentation, and classical methods often fall short of addressing the complexity of modern astronomical datasets. Probabilistic graphical models offer powerful tools for uncovering the dependence structures and data-generating processes underlying a wide array of cosmic variables. By representing variables as nodes in a network, these models allow for the visualization and analysis of the intricate relationships that underpin theories of hierarchical structure formation within the universe. We highlight the value that graphical models bring to astronomical research by demonstrating their practical application to the study of exoplanets and host stars.

2601.16945 2026-01-26 math.ST math.PR stat.TH

A new class of colored Gaussian graphical models with explicit normalizing constants

Adam Chojecki, Piotr Graczyk, Hideyuki Ishi, Bartosz Kołodziejek

详情
英文摘要

We study Bayesian model selection in colored Gaussian graphical models (CGGMs), which combine sparsity of conditional independencies with symmetry constraints encoded by vertex- and edge-colored graphs. A computational bottleneck in Bayesian inference for CGGMs is the evaluation of Diaconis-Ylvisaker normalizing constants, given by gamma-type integrals over cones of precision matrices with prescribed zeros and equality constraints. While explicit formulas are known for standard Gaussian graphical models only in special cases (e.g. decomposable graphs) and for a limited class of RCOP models, no general tractable framework has been available for broader families of CGGMs. We introduce a new subclass of RCON models for which these normalizing constants admit closed-form expressions. On the algebraic side, we identify conditions on spaces of colored precision matrices that guarantee tractability of the associated integrals, leading to Block-Cholesky spaces (BC-spaces) and Diagonally Commutative Block-Cholesky spaces (DCBC-spaces). On the combinatorial side, we characterize the colored graphs inducing such spaces via a color perfect elimination ordering and a 2-path regularity condition, and define the resulting Color Elimination-Regular (CER) graphs and their symmetric variants. This class strictly extends decomposable graphs in the uncolored setting and contains all RCOP models associated with decomposable graphs. In the one-color case, our framework reveals a close connection between DCBC-spaces and Bose-Mesner algebras. For models defined on BC- and DCBC-spaces, we derive explicit closed-form formulas for the normalizing constants in terms of a finite collection of structure constants and propose an efficient method for computing them in the commutative case. Our results broaden the range of CGGMs amenable to principled Bayesian structure learning in high-dimensional applications.

2601.16932 2026-01-26 stat.AP

Identifying heat-related diagnoses in emergency department visits among adults in Chicago: a heat-wide association study

Hyojung Jang, Peter M. Graffy, Benjamin W. Barrett, Daniel E. Horton, Jennifer L. Chan, Abel N. Kho

详情
英文摘要

Extreme heat is an escalating public health concern. Although prior studies have examined heat-health associations, their reliance on restricted diagnoses and diagnostic categories misses or misclassifies heat-related illness. We conducted a heat-wide association study to identify acute-care diagnoses associated with extreme heat in Chicago, Illinois. Using 916,904 acute-care visits -- including emergency department and urgent care encounters -- among 372,140 adults across five healthcare systems from 2011-2023, we applied a two-stage analytic approach: quasi-Poisson regression to screen 1,803 diagnosis codes for heat-related risks, followed by distributed lag non-linear models in a time-stratified case-crossover design to refine the list of heat-related diagnoses and estimate same-day and short-term cumulative odds ratios of acute-care visits during extreme heat versus reference temperature. We observed same-day increases in visits for heat illness, volume depletion, hypotension, edema, acute kidney failure, and multiple injuries. By analyzing the full diagnostic spectrum of acute-care services, this study comprehensively characterizes heat-associated morbidity, reinforcing and advancing existing literature.

2601.16922 2026-01-26 cs.LG stat.ML

Group-realizable multi-group learning by minimizing empirical risk

Navid Ardeshir, Samuel Deng, Daniel Hsu, Jingwen Liu

详情
英文摘要

The sample complexity of multi-group learning is shown to improve in the group-realizable setting over the agnostic setting, even when the family of groups is infinite so long as it has finite VC dimension. The improved sample complexity is obtained by empirical risk minimization over the class of group-realizable concepts, which itself could have infinite VC dimension. Implementing this approach is also shown to be computationally intractable, and an alternative approach is suggested based on improper learning.

2601.16897 2026-01-26 cs.LG math.OC stat.ML

FedSGM: A Unified Framework for Constraint Aware, Bidirectionally Compressed, Multi-Step Federated Optimization

Antesh Upadhyay, Sang Bin Moon, Abolfazl Hashemi

详情
英文摘要

We introduce FedSGM, a unified framework for federated constrained optimization that addresses four major challenges in federated learning (FL): functional constraints, communication bottlenecks, local updates, and partial client participation. Building on the switching gradient method, FedSGM provides projection-free, primal-only updates, avoiding expensive dual-variable tuning or inner solvers. To handle communication limits, FedSGM incorporates bi-directional error feedback, correcting the bias introduced by compression while explicitly understanding the interaction between compression noise and multi-step local updates. We derive convergence guarantees showing that the averaged iterate achieves the canonical $\boldsymbol{\mathcal{O}}(1/\sqrt{T})$ rate, with additional high-probability bounds that decouple optimization progress from sampling noise due to partial participation. Additionally, we introduce a soft switching version of FedSGM to stabilize updates near the feasibility boundary. To our knowledge, FedSGM is the first framework to unify functional constraints, compression, multiple local updates, and partial client participation, establishing a theoretically grounded foundation for constrained federated learning. Finally, we validate the theoretical guarantees of FedSGM via experimentation on Neyman-Pearson classification and constrained Markov decision process (CMDP) tasks.

2601.16848 2026-01-26 cs.NI cs.NA math.NA stat.AP

Stochastic Modeling and Resource Dimensioning of Multi-Cellular Edge Intelligent Systems

Jaume Anguera Peris, Joakim Jaldén

详情
英文摘要

Edge intelligence enables AI inference at the network edge, co-located with or near the radio access network, rather than in centralized clouds or on mobile devices. It targets low-latency, resource-constrained applications with large data volumes, requiring tight integration of wireless access and on-site computing. Yet system performance and cost-efficiency hinge on joint pre-deployment dimensioning of radio and computational resources, especially under spatial and temporal uncertainty. Prior work largely emphasizes run-time allocation or relies on simplified models that decouple radio and computing, missing end-to-end correlations in large-scale deployments. This paper introduces a unified stochastic framework to dimension multi-cell edge-intelligent systems. We model network topology with Poisson point processes, capturing random user and base-station locations, inter-cell interference, distance-based fractional power control, and peak-power constraints. By combining this with queueing theory and empirical AI inference workload profiling, we derive tractable expressions for end-to-end offloading delay. These enable a non-convex joint optimization that minimizes deployment cost under statistical QoS guarantees, expressed through strict tail-latency and inference-accuracy constraints. We prove the problem decomposes into convex subproblems, yielding global optimality. Numerical results in noise- and interference-limited regimes identify cost-efficient design regions and configurations that cause under-utilization or user unfairness. Smaller cells reduce transmission delay but raise per-request computing cost due to weaker server multiplexing, whereas larger cells show the opposite trend. Densification reduces computational costs only when frequency reuse scales with base-station density; otherwise, sparser deployments improve fairness and efficiency in interference-limited settings.

2601.16837 2026-01-26 stat.AP

Spillovers and Co-movements in Multivariate Volatility: A Vector Multiplicative Error Model

Edoardo Otranto, Luca Scaffidi Domianello

Comments 24 pages, 4 figures

详情
英文摘要

Recent developments in financial time series focus on modeling volatility across multiple assets or indices in a multivariate framework, accounting for potential interactions such as spillover effects. Furthermore, the increasing integration of global financial markets provides a similar dynamics (referred to as comovement). In this context, we introduce a novel model for volatility vectors within the Multiplicative Error Model (MEM) class. This framework accommodates both spillover and co-movement effects through a distinct latent component. By adopting a specific parameterization, the model remains computationally feasible even for high-dimensional volatility vectors. To reduce the number of unknown coefficients, we propose a simple model-based clustering procedure. We illustrate the effectiveness of the proposed approach through an empirical application to 29 assets of the Dow Jones Industrial Average index, providing insight into volatility spillovers and shared market dynamics. Comparative analysis against alternative vector MEMs, including a fully parameterized version of the proposed model, demonstrates its superior or at least comparable performance across multiple evaluation criteria.

2601.16830 2026-01-26 cs.LG cs.AI cs.NE math.ST stat.TH

Uncertainty propagation through trained multi-layer perceptrons: Exact analytical results

Andrew Thompson, Miles McCrory

详情
英文摘要

We give analytical results for propagation of uncertainty through trained multi-layer perceptrons (MLPs) with a single hidden layer and ReLU activation functions. More precisely, we give expressions for the mean and variance of the output when the input is multivariate Gaussian. In contrast to previous results, we obtain exact expressions without resort to a series expansion.

2601.16829 2026-01-26 stat.ME stat.AP

Directional Asymmetry in Edge BasedSpatial Models via a Skew Normal Prior

Danna L. Cruz-Reyes, Renato M. Assunção, Reinaldo B. Arellano-Valle, Rosangela H. Loschi

Comments The manuscript is clearly written and well structured. It spans approximately 10 pages and includes several figures that effectively illustrate the proposed methodology and results. Overall, the paper makes a solid contribution to the literature on spatial hierarchical modeling and is suitable for publication after minor revisions

详情
英文摘要

We introduce a skewed edge based spatial prior, named RENeGe sk that extends the Gaussian RENeGe framework by incorporating directional asymmetry through a skew normal distribution. Skewness is defined on the edge graph and propagated to the node space, aligning asymmetric behavior with transitions across neighboring regions rather than with marginal node effects. The model is formulated within the skew normal framework and employs identifiable hierarchical priors together with low rank parameterizations to ensure scalability. The skew normal's stochastic representation is considered to facilitate the computational implementation. Simulation studies show that RENeGe sk recovers compact, edge-aligned directional structure more accurately than symmetric Gaussian priors, while remaining competitive under irregular spatial patterns. An application to cancer incidence data in Southern Brazil illustrates how the proposed approach yields stable area-level estimates while preserving localized, directionally driven spatial variation.

2601.16813 2026-01-26 stat.CO

A Fully Automated DM-BIM-BEM Pipeline Enabling Graph-Based Intelligence, Interoperability, and Performance-Driven Early Design

Jun Xiao, Qiong Wang, Yihui Li, Zhexuan Yu, Hao Zhou, Borong Lin

Comments Submitted to Advanced Engineering Informatics, currently Under Review

详情
英文摘要

Artificial intelligence in construction increasingly depends on structured representations such as Building Information Models and knowledge graphs, yet early-stage building designs are predominantly created as flexible boundary-representation (B-rep) models that lack explicit spatial, semantic, and performance structure. This paper presents a robust, fully automated framework that transforms unstructured B-rep geometry into knowledge-graph-based Building Information Models and further into executable Building Energy Models. The framework enables artificial intelligence to explicitly interpret building elements, spatial topology, and their associated thermal and performance attributes. It integrates automated geometry cleansing, multiple auto space-generation strategies, graph-based extraction of space and element topology, ontology-aligned knowledge modeling, and reversible transformation between ontology-based BIM and EnergyPlus energy models. Validation on parametric, sketch-based, and real-world building datasets demonstrates high robustness, consistent topological reconstruction, and reliable performance-model generation. By bridging design models, BIM, and BEM, the framework provides an AI-oriented infrastructure that extends BIM- and graph-based intelligence pipelines to flexible early-stage design geometry, enabling performance-driven design exploration and optimization by learning-based methods.

2601.16784 2026-01-26 stat.ME

Spectral embedding of inhomogeneous Poisson processes on multiplex networks

Joshua Corneck, Edward A. K. Cohen, Francesco Sanna Passino

详情
英文摘要

In many real-world networks, data on the edges evolve in continuous time, naturally motivating representations based on point processes. Heterogeneity in edge types further gives rise to multiplex network point processes. In this work, we propose a model for multiplex network data observed in continuous-time. We establish two-to-infinity norm consistency and asymptotic normality for spectral-embedding-based estimation of the model parameters as both network size and time resolution increase. Drawing inspiration from random dot product graph models, each edge intensity is expressed as the inner product of two low-dimensional latent positions: one dynamic and layer-agnostic, the other static and layer-dependent. These latent positions constitute the primary objects of inference, which is conducted via spectral embedding methods. Our theoretical results are established under a histogram estimator of the network intensities and provide justification for applying a doubly unfolded adjacency spectral embedding method for estimation. Simulations and real-data analyses demonstrate the effectiveness of the proposed model and inference procedure.

2601.16777 2026-01-26 math.ST cs.LG math.DG stat.ML stat.TH

Kernel smoothing on manifolds

Eunseong Bae, Wolfgang Polonik

详情
英文摘要

Under the assumption that data lie on a compact (unknown) manifold without boundary, we derive finite sample bounds for kernel smoothing and its (first and second) derivatives, and we establish asymptotic normality through Berry-Esseen type bounds. Special cases include kernel density estimation, kernel regression and the heat kernel signature. Connections to the graph Laplacian are also discussed.

2601.16775 2026-01-26 cs.IR stat.AP

LLM-powered Real-time Patent Citation Recommendation for Financial Technologies

Tianang Deng, Yu Deng, Tianchen Gao, Yonghong Hu, Rui Pan

详情
英文摘要

Rapid financial innovation has been accompanied by a sharp increase in patenting activity, making timely and comprehensive prior-art discovery more difficult. This problem is especially evident in financial technologies, where innovations develop quickly, patent collections grow continuously, and citation recommendation systems must be updated as new applications arrive. Existing patent retrieval and citation recommendation methods typically rely on static indexes or periodic retraining, which limits their ability to operate effectively in such dynamic settings. In this study, we propose a real-time patent citation recommendation framework designed for large and fast-changing financial patent corpora. Using a dataset of 428,843 financial patents granted by the China National Intellectual Property Administration (CNIPA) between 2000 and 2024, we build a three-stage recommendation pipeline. The pipeline uses large language model (LLM) embeddings to represent the semantic content of patent abstracts, applies efficient approximate nearest-neighbor search to construct a manageable candidate set, and ranks candidates by semantic similarity to produce top-k citation recommendations. In addition to improving recommendation accuracy, the proposed framework directly addresses the dynamic nature of patent systems. By using an incremental indexing strategy based on hierarchical navigable small-world (HNSW) graphs, newly issued patents can be added without rebuilding the entire index. A rolling day-by-day update experiment shows that incremental updating improves recall while substantially reducing computational cost compared with rebuild-based indexing. The proposed method also consistently outperforms traditional text-based baselines and alternative nearest-neighbor retrieval approaches.

2601.16769 2026-01-26 stat.ME stat.AP

From Noisy News Sentiment Scores to Interpretable Temporal Dynamics: A Bayesian State-Space Model

Ian Carbó Casals

Comments 12 pages, 5 figures, 2 tables. Code available at: https://github.com/Bailduke/bayesian-sentiment-state-space-model

详情
英文摘要

Text-based sentiment indicators are widely used to monitor public and market mood, but weekly sentiment series are noisy by construction. A main reason is that the amount of relevant news changes over time and across categories. As a result, some weekly averages are based on many articles, while others rely on only a few. Existing approaches do not explicitly account for changes in data availability when measuring uncertainty. We present a Bayesian state-space framework that turns aggregated news sentiment into a smoothed time series with uncertainty. The model treats each weekly sentiment value as a noisy measurement of an underlying sentiment process, with observation uncertainty scaled by the effective information weight $n_{tj}$: when coverage is high, latent sentiment is anchored more strongly to the observed aggregate; when coverage is low, inference relies more on the latent dynamics and uncertainty increases. Using news data grouped into multiple categories, we find broadly similar latent dynamics across categories, while larger differences appear in observation noise. The framework is designed for descriptive monitoring and can be extended to other text sources where information availability varies over time.

2601.16749 2026-01-26 stat.ME econ.EM

Finite Population Inference for Factorial Designs and Panel Experiments with Imperfect Compliance

Pedro Picchetti

详情
英文摘要

This paper develops a finite population framework for analyzing causal effects in settings with imperfect compliance where multiple treatments affect the outcome of interest. Two prominent examples are factorial designs and panel experiments with imperfect compliance. I define finite population causal effects that capture the relative effectiveness of alternative treatment sequences. I provide nonparametric estimators for a rich class of factorial and dynamic causal effects and derive their finite population distributions as the sample size increases. Monte Carlo simulations illustrate the desirable properties of the estimators. Finally, I use the estimator for causal effects in factorial designs to revisit a famous voter mobilization experiment that analyzes the effects of voting encouragement through phone calls on turnout.

2601.16739 2026-01-26 stat.ME

Comments on "Challenges of cellwise outliers" by Jakob Raymaekers and Peter J. Rousseeuw

Claudio Agostinelli

Comments Comment on arXiv:2302.02156

详情
英文摘要

The main aim of robust statistics is the development of methods able to cope with the presence of outliers. A new type of outliers, namely "cellwise", has garnered considerable attention. The state of the art for dealing with cellwise contamination in different models is presented in Raymaekers and Rousseeuw (2024). Outliers in time series can be treated as cellwise outliers, a further discussion on this subject is presented.

2601.16696 2026-01-26 stat.CO

Faster parallel MCMC: Metropolis adjustment is best served warm

Jakob Robnik, Uroš Seljak

Comments 21 pages, 9 Figures

详情
英文摘要

Despite the enormous success of Hamiltonian Monte Carlo and related Markov Chain Monte Carlo (MCMC) methods, sampling often still represents the computational bottleneck in scientific applications. Availability of parallel resources can significantly speed up MCMC inference by running a large number of chains in parallel, each collecting a single sample. However, the parallel approach converges slowly if the chains are not initialized close to the target distribution (cold start). Theoretically this can be resolved by initially running MCMC without Metropolis-Hastings adjustment to quickly converge to the vicinity of the target distribution and then turn on adjustment to achieve fine convergence. However, no practical scheme uses this strategy, due to the difficulty of automatically selecting the step size during the unadjusted phase. We here develop Late Adjusted Parallel Sampler (LAPS), which is precisely such a scheme and is applicable out of the box, all the hyperparameters are selected automatically. LAPS takes advantage of ensemble-based hyperparameter adaptation to estimate the bias at each iteration and converts it to the appropriate step size. We show that LAPS consistently and significantly outperforms ensemble adjusted methods such as MEADS or ChESS and the optimization-based initializer Pathfinder on a variety of standard benchmark problems. LAPS typically achieves two orders of magnitude lower wall-clock time than the corresponding sequential algorithms such as NUTS.

2601.16684 2026-01-26 math.ST stat.ME stat.TH

Asymptotic testing of covariance separability for matrix elliptical data

Joni Virta, Takeru Matsuda

Comments 21 pages, 2 figures

详情
英文摘要

We propose a new asymptotic test for the separability of a covariance matrix. The null distribution is valid in wide matrix elliptical model that includes, in particular, both matrix Gaussian and matrix $t$-distribution. The test is fast to compute and makes no assumptions about the component covariance matrices. An alternative, Wald-type version of the test is also proposed. Our simulations reveal that both versions of the test have good power even for heavier-tailed distributions and can compete with the Gaussian likelihood ratio test in the case of normal data.

2601.16636 2026-01-26 stat.ME stat.ML

Conformal prediction for full and sparse polynomial chaos expansions

A. Hatstatt, X. Zhu, B. Sudret

详情
英文摘要

Polynomial Chaos Expansions (PCEs) are widely recognized for their efficient computational performance in surrogate modeling. Yet, a robust framework to quantify local model errors is still lacking. While the local uncertainty of PCE prediction can be captured using bootstrap resampling, other methods offering more rigorous statistical guarantees are needed, especially in the context of small training datasets. Recently, conformal predictions have demonstrated strong potential in machine learning, providing statistically robust and model-agnostic prediction intervals. Due to its generality and versatility, conformal prediction is especially valuable, as it can be adapted to suit a variety of problems, making it a compelling choice for PCE-based surrogate models. In this contribution, we explore its application to PCE-based surrogate models. More precisely, we present the integration of two conformal prediction methods, namely the full conformal and the Jackknife+ approaches, into both full and sparse PCEs. For full PCEs, we introduce computational shortcuts inspired by the inherent structure of regression methods to optimize the implementation of both conformal methods. For sparse PCEs, we incorporate the two approaches with appropriate modifications to the inference strategy, thereby circumventing the non-symmetrical nature of the regression algorithm and ensuring valid prediction intervals. Our developments yield better-calibrated prediction intervals for both full and sparse PCEs, achieving superior coverage over existing approaches, such as the bootstrap, while maintaining a moderate computational cost.

2601.16631 2026-01-26 eess.IV cs.CV eess.SP stat.AP

PanopMamba: Vision State Space Modeling for Nuclei Panoptic Segmentation

Ming Kang, Fung Fung Ting, Raphaël C. -W. Phan, Zongyuan Ge, Chee-Ming Ting

Comments 10 pages, 3 figures

详情
英文摘要

Nuclei panoptic segmentation supports cancer diagnostics by integrating both semantic and instance segmentation of different cell types to analyze overall tissue structure and individual nuclei in histopathology images. Major challenges include detecting small objects, handling ambiguous boundaries, and addressing class imbalance. To address these issues, we propose PanopMamba, a novel hybrid encoder-decoder architecture that integrates Mamba and Transformer with additional feature-enhanced fusion via state space modeling. We design a multiscale Mamba backbone and a State Space Model (SSM)-based fusion network to enable efficient long-range perception in pyramid features, thereby extending the pure encoder-decoder framework while facilitating information sharing across multiscale features of nuclei. The proposed SSM-based feature-enhanced fusion integrates pyramid feature networks and dynamic feature enhancement across different spatial scales, enhancing the feature representation of densely overlapping nuclei in both semantic and spatial dimensions. To the best of our knowledge, this is the first Mamba-based approach for panoptic segmentation. Additionally, we introduce alternative evaluation metrics, including image-level Panoptic Quality ($i$PQ), boundary-weighted PQ ($w$PQ), and frequency-weighted PQ ($fw$PQ), which are specifically designed to address the unique challenges of nuclei segmentation and thereby mitigate the potential bias inherent in vanilla PQ. Experimental evaluations on two multiclass nuclei segmentation benchmark datasets, MoNuSAC2020 and NuInsSeg, demonstrate the superiority of PanopMamba for nuclei panoptic segmentation over state-of-the-art methods. Consequently, the robustness of PanopMamba is validated across various metrics, while the distinctiveness of PQ variants is also demonstrated. Code is available at https://github.com/mkang315/PanopMamba.

2601.16595 2026-01-26 stat.ME

Bayesian Nonparametric Causal Inference for High-Dimensional Nutritional Data via Factor-Based Exposure Mapping

Dafne Zorzetto, Zizhao Xie, Julian Stamp, Arman Oganisian, Roberta De Vito

详情
英文摘要

Diet plays a crucial role in health, and understanding the causal effects of dietary patterns is essential for informing public health policy and personalized nutrition strategies. However, causal inference in nutritional epidemiology faces several challenges: (i) high-dimensional and correlated food/nutrient intake data induce massive treatment levels; (ii) nutritional studies are interested in latent dietary patterns rather than single food items; and (iii) the goal is to estimate heterogeneous causal effects of these dietary patterns on health outcomes. We address these challenges by introducing a sophisticated exposure mapping framework that reduces the high-dimensional treatment space via factor analysis and enables the identification of dietary patterns. We also extend the Bayesian Causal Forest to accommodate three ordered levels of dietary exposure, better capturing the complex structure of nutritional data and enabling estimation of heterogeneous causal effects. We evaluate the proposed method through extensive simulations and apply it to a multi-center epidemiological study of Hispanic/Latino adults residing in the US. Using high-dimensional dietary data, we identify six dietary patterns and estimate their causal link with two key health risk factors: body mass index and fasting insulin levels. Our findings suggest that higher consumption of plant lipid-antioxidant, plant-based, animal protein, and dairy product patterns is associated with reduced risk.

2601.16585 2026-01-26 stat.ME stat.CO

Variational approximate penalized credible regions for Bayesian grouped regression

Weichang Yu, Khue-Dung Dang

详情
英文摘要

We develop a fast and accurate grouped penalized credible region approach for variable selection and prediction in Bayesian high-dimensional linear regression. Most existing Bayesian methods either are subject to high computational costs due to long Markov Chain Monte Carlo runs or yield ambiguous variable selection results due to non-sparse solution output. The penalized credible region framework yields sparse post-processed estimates that facilitates unambiguous grouped variable selection. High estimation accuracy is achieved by shrinking noise from unimportant groups using a grouped global-local shrinkage prior. To ensure computational scalability, we approximate posterior summaries using coordinate ascent variational inference and recast the penalized credible region framework as a convex optimization problem that admits efficient computations. We prove that the resultant post-processed estimators are both parameter-consistent and variable selection consistent in high-dimensional settings. Theory is developed to justify running the coordinate ascent algorithm for at least two cycles. Through extensive simulations, we demonstrate that our proposed method outperforms state-of-the-art methods in grouped variable selection, prediction, and computation time for several common models including ANOVA and nonparametric varying coefficient models.

2601.16561 2026-01-26 math.ST stat.TH

Markov Stick-breaking Processes

María F. Gil-Leyva, Antonio Lijoi, Ramsés H. Mena, Igor Prünster

Comments 57 pages, 13 figures, to be published in Annals of Statistics

详情
英文摘要

Stick-breaking has a long history and is one of the most popular procedures for constructing random discrete distributions in Statistics and Machine Learning. In particular, due to their intuitive construction and computational tractability they are ubiquitous in modern Bayesian nonparametric inference. Most widely used models, such as the Dirichlet and the Pitman-Yor processes, rely on iid or independent length variables. Here we pursue a completely unexplored research direction by considering Markov length variables and investigate the corresponding general class of stick-breaking processes, which we term Markov stick-breaking processes. We establish conditions under which the associated species sampling process is proper and the distribution of a Markov stick-breaking process has full topological support, two fundamental desiderata for Bayesian nonparametric models. We also analyze the stochastic ordering of the weights and provide a new characterization of the Pitman-Yor process as the only stick-breaking process invariant under size-biased permutations, under mild conditions. Moreover, we identify two notable subclasses of Markov stick-breaking processes that enjoy appealing properties and include Dirichlet, Pitman-Yor and Geometric priors as special cases. Our findings include distributional results enabling posterior inference algorithms and methodological insights.

2601.16431 2026-01-26 stat.ME

Sequential Experimental Designs for Kriging Model

Ruonan Zheng, Min-Qian Liu, Yongdao Zhou, Xuan Chen

Comments 24pages, 5figures

详情
英文摘要

Computer experiments have become an indispensable alternative to complex physical and engineering experiments. The Kriging model is the most widely used surrogate model, with the core goal of minimizing the discrepancy between the surrogate and true models across the entire experimental domain. However, existing sequential design methods have critical limitations: observation-based batch sequential designs are rarely studied, while one-point sequential designs have insufficient information utilization and suffer from inefficient resource utilization -- they require numerous repeated observation rounds to accumulate sufficient points, leading to prolonged experimental cycles. To address these gaps, this paper proposes two novel one-point sequential design criteria and a general batch sequential design framework. Moreover, the batch sequential design framework solves the inherent point clustering problem in naive batch selection, enabling efficient extension of any sequential criterion to batch scenarios. Simulations on some test functions demonstrate that the proposed methods outperform existing approaches in terms of fitting accuracy in most cases.

2601.16385 2026-01-26 stat.ME math.ST stat.AP stat.TH

Spherical Spatial Autoregressive Model for Spherically Embedded Spatial Data

Jiazhen Xu, Han Lin Shang

详情
英文摘要

Spherically embedded spatial data are spatially indexed observations whose values naturally reside on or can be equivalently mapped to the unit sphere. Such data are increasingly ubiquitous in fields ranging from geochemistry to demography. However, analysing such data presents unique difficulties due to the intrinsic non-Euclidean nature of the sphere, and rigorous methodologies for statistical modelling, inference, and uncertainty quantification remain limited. This paper introduces a unified framework to address these three limitations for spherically embedded spatial data. We first propose a novel spherical spatial autoregressive model that leverages optimal transport geometry and then extend it to accommodate exogenous covariates. Second, for either scenario with or without covariates, we establish the asymptotic properties of the estimators and derive a distribution-free Wald test for spatial dependence, complemented by a bootstrap procedure to enhance finite-sample performance. Third, we contribute a novel approach to uncertainty quantification by developing a conformal prediction procedure specifically tailored to spherically embedded spatial data. The practical utility of these methodological advances is illustrated through extensive simulations and applications to Spanish geochemical compositions and Japanese age-at-death mortality distributions.

2601.16368 2026-01-26 stat.ME

Inference for competing risks based on area between curves statistics

Simon Mack, Marc Ditzhaus, Merle Munko, Markus Pauly

Comments 25 pages, 9 figures, 2 tables

详情
英文摘要

In competing risks models, cumulative incidence functions are commonly compared to infer differences between groups. Many existing inference methods, however, struggle when these functions cross during the time frame of interest. To address this problem, we investigate a test statistic based on the area between cumulative incidence functions. As the corresponding limiting distribution depends on quantities that are typically unknown, we propose a wild bootstrap approach to obtain a feasible and asymptotically valid two-sample test. The finite sample performance of the proposed method, in comparison with existing methods, is examined in an extensive simulation study.

2601.16364 2026-01-26 stat.ME

Hybrid Partial Least Squares Regression with Multiple Functional and Scalar Predictors

Jongmin Mun, Jeong Hoon Jang

Comments 39 pages (main 21 pages), 8 figures, 3 tables

详情
英文摘要

Motivated by renal imaging studies that combine renogram curves with pharmacokinetic and demographic covariates, we propose Hybrid partial least squares (Hybrid PLS) for simultaneous supervised dimension reduction and regression in the presence of cross-modality correlations. The proposed approach embeds multiple functional and scalar predictors into a unified hybrid Hilbert space and rigorously extends the nonlinear iterative PLS (NIPALS) algorithm. This theoretical development is complemented by a sample-level algorithm that incorporates roughness penalties to control smoothness. By exploiting the rank-one structure of the resulting optimization problem, the algorithm admits a computationally efficient closed-form solution that requires solving only linear systems at each iteration. We establish fundamental geometric properties of the proposed framework, including orthogonality of the latent scores and PLS directions. Extensive numerical studies on synthetic data, together with an application to a renal imaging study, validate these theoretical results and demonstrate the method's ability to recover predictive structure under intermodal multicollinearity, yielding parsimonious low-dimensional representations.

2601.16347 2026-01-26 stat.AP cs.LG stat.ML

Long-Term Probabilistic Forecast of Vegetation Conditions Using Climate Attributes in the Four Corners Region

Erika McPhillips, Hyeongseong Lee, Xiangyu Xie, Kathy Baylis, Chris Funk, Mengyang Gu

详情
英文摘要

Weather conditions can drastically alter the state of crops and rangelands, and in turn, impact the incomes and food security of individuals worldwide. Satellite-based remote sensing offers an effective way to monitor vegetation and climate variables on regional and global scales. The annual peak Normalized Difference Vegetation Index (NDVI), derived from satellite observations, is closely associated with crop development, rangeland biomass, and vegetation growth. Although various machine learning methods have been developed to forecast NDVI over short time ranges, such as one-month-ahead predictions, long-term forecasting approaches, such as one-year-ahead predictions of vegetation conditions, are not yet available. To fill this gap, we develop a two-phase machine learning model to forecast the one-year-ahead peak NDVI over high-resolution grids, using the Four Corners region of the Southwestern United States as a testbed. In phase one, we identify informative climate attributes, including precipitation and maximum vapor pressure deficit, and develop the generalized parallel Gaussian process that captures the relationship between climate attributes and NDVI. In phase two, we forecast these climate attributes using historical data at least one year before the NDVI prediction month, which then serve as inputs to forecast the peak NDVI at each spatial grid. We developed open-source tools that outperform alternative methods for both gross NDVI and grid-based NDVI one-year forecasts, providing information that can help farmers and ranchers make actionable plans a year in advance.

2601.16220 2026-01-26 cs.CL cs.LG stat.ML

Towards Latent Diffusion Suitable For Text

Nesta Midavaine, Christian A. Naesseth, Grigory Bartosh

详情
英文摘要

Language diffusion models aim to improve sampling speed and coherence over autoregressive LLMs. We introduce Neural Flow Diffusion Models for language generation, an extension of NFDM that enables the straightforward application of continuous diffusion models to discrete state spaces. NFDM learns a multivariate forward process from the data, ensuring that the forward process and generative trajectory are a good fit for language modeling. Our model substantially reduces the likelihood gap with autoregressive models of the same size, while achieving sample quality comparable to that of previous latent diffusion models.

2601.12064 2026-01-26 math.ST stat.AP stat.ME stat.TH

Asymptotic Expansion and Bounds for the Bias of Empirical Tail Value-at-Risk

Nadezhda Gribkova, Jianxi Su, Mengqi Wang

Comments 38 pages, 7 figures

详情
英文摘要

Tail Value-at-Risk (TVaR) is a widely adopted risk measure playing a critically important role in both academic research and industry practice in insurance. In data applications, TVaR is often estimated using the empirical method, owing to its simplicity and nonparametric nature. The empirical TVaR has been explicitly advocated by regulatory authorities as a standard approach for computing TVaR. However, prior literature has pointed out that the empirical TVaR estimator is negatively biased, which can lead to a systemic underestimation of risk in finite-sample applications. This paper aims to deepen the understanding of the bias of the empirical TVaR estimator in two dimensions: its magnitude as well as the key distributional and structural determinants driving the severity of the bias. To this end, we derive a leading-term approximation for the bias based on its asymptotic expansion. The closed-form expression associated with the leading-term approximation enables us to obtain analytical insights into the structural properties governing the bias of the empirical TVaR estimator. To account for the discrepancy between the leading-term approximation and the true bias, we further derive an explicit upper bound for the bias. We validate the proposed bias analysis framework via simulations and demonstrate its practical relevance using real data.

2601.11744 2026-01-26 stat.ME math.ST stat.AP stat.ML stat.TH

On Nonasymptotic Confidence Intervals for Treatment Effects in Randomized Experiments

Ricardo J. Sandoval, Sivaraman Balakrishnan, Avi Feller, Michael I. Jordan, Ian Waudby-Smith

详情
英文摘要

We study nonasymptotic (finite-sample) confidence intervals for treatment effects in randomized experiments. In the existing literature, the effective sample sizes of nonasymptotic confidence intervals tend to be looser than the corresponding central-limit-theorem-based confidence intervals by a factor depending on the square root of the propensity score. We show that this performance gap can be closed, designing nonasymptotic confidence intervals that have the same effective sample size as their asymptotic counterparts. Our approach involves systematic exploitation of negative dependence or variance adaptivity (or both). We also show that the nonasymptotic rates that we achieve are unimprovable in an information-theoretic sense.

2601.10759 2026-01-26 stat.ML cs.LG

Mass Distribution versus Density Distribution in the Context of Clustering

Kai Ming Ting, Ye Zhu, Hang Zhang, Tianrun Liang

详情
英文摘要

This paper investigates two fundamental descriptors of data, i.e., density distribution versus mass distribution, in the context of clustering. Density distribution has been the de facto descriptor of data distribution since the introduction of statistics. We show that density distribution has its fundamental limitation -- high-density bias, irrespective of the algorithms used to perform clustering. Existing density-based clustering algorithms have employed different algorithmic means to counter the effect of the high-density bias with some success, but the fundamental limitation of using density distribution remains an obstacle to discovering clusters of arbitrary shapes, sizes and densities. Using the mass distribution as a better foundation, we propose a new algorithm which maximizes the total mass of all clusters, called mass-maximization clustering (MMC). The algorithm can be easily changed to maximize the total density of all clusters in order to examine the fundamental limitation of using density distribution versus mass distribution. The key advantage of the MMC over the density-maximization clustering is that the maximization is conducted without a bias towards dense clusters.

2601.09161 2026-01-26 stat.ME

A Multilayer Probit Network Model for Community Detection with Dependent Edges and Layers

Dapeng Shi, Haoran Zhang, Tiandong Wang, Junhui Wang

Comments 31 pages, 4 figures

详情
英文摘要

Community detection in multilayer networks, which aims to identify groups of nodes exhibiting similar connectivity patterns across multiple network layers, has attracted considerable attention in recent years. Most existing methods are based on the assumption that different layers are either independent or follow specific dependence structures, and edges within the same layer are independent. In this article, we propose a novel method for community detection in multilayer networks that accounts for a broad range of inter-layer and intra-layer dependence structures. The proposed method integrates the multilayer stochastic block model for community detection with a multivariate probit model to capture the structures of inter-layer dependence, which also allows intra-layer dependence. To facilitate parameter estimation, we develop a constrained pairwise likelihood method coupled with an efficient alternating updating algorithm. The asymptotic properties of the proposed method are also established, with a focus on examining the influence of inter-layer and intra-layer dependences on the accuracy of both parameter estimation and community detection. The theoretical results are supported by extensive numerical experiments on both simulated networks and a real-world multilayer trade network.

2601.01594 2026-01-26 stat.ML cs.LG

Variance-Reduced Diffusion Sampling via Target Score Identity

Alois Duston, Tan Bui-Thanh

Comments Updated to match journal submission and add ACM & MSC class info

详情
英文摘要

We study variance reduction for score estimation and diffusion-based sampling in settings where the clean (target) score is available or can be approximated. Starting from the Target Score Identity (TSI), which expresses the noisy marginal score as a conditional expectation of the target score under the forward diffusion, we develop: (i) a plug-and-play nonparametric self-normalized importance sampling estimator compatible with standard reverse-time solvers, (ii) a variance-minimizing \emph{state- and time-dependent} blending rule between Tweedie-type and TSI estimators together with an anti-correlation analysis, (iii) a data-only extension based on locally fitted proxy scores, and (iv) a likelihood-tilting extension to Bayesian inverse problems. We also propose a \emph{Critic--Gate} distillation scheme that amortizes the state-dependent blending coefficient into a neural gate. Experiments on synthetic targets and PDE-governed inverse problems demonstrate improved sample quality for a fixed simulation budget.

2601.00973 2026-01-26 eess.IV eess.SP stat.AP

Learned Hemodynamic Coupling Inference in Resting-State Functional MRI

William Consagra, Eardi Lila

详情
英文摘要

Functional magnetic resonance imaging (fMRI) provides an indirect measurement of neuronal activity via hemodynamic responses that vary across brain regions and individuals. Ignoring this hemodynamic variability can bias downstream connectivity estimates. Furthermore, the hemodynamic parameters themselves may serve as important imaging biomarkers. Estimating spatially varying hemodynamics from resting-state fMRI (rsfMRI) is therefore an important but challenging blind inverse problem, since both the latent neural activity and the hemodynamic coupling are unknown. In this work, we propose a methodology for inferring hemodynamic coupling on the cortical surface from rsfMRI. Our approach avoids the highly unstable joint recovery of neural activity and hemodynamics by marginalizing out the latent neural signal and basing inference on the resulting marginal likelihood. To enable scalable, high-resolution estimation, we employ a deep neural network combined with conditional normalizing flows to accurately approximate this intractable marginal likelihood, while enforcing spatial coherence through priors defined on the cortical surface that admit sparse representations. Uncertainty in the hemodynamic estimates is quantified via a double-bootstrap procedure. The proposed approach is extensively validated using synthetic data and real fMRI datasets, demonstrating clear improvements over current methods for hemodynamic estimation and downstream connectivity analysis.

2511.18737 2026-01-26 math.ST math.OC stat.ML stat.TH

Joint learning of a network of linear dynamical systems via total variation penalization

Claire Donnat, Olga Klopp, Hemant Tyagi

Comments 52 pages, 10 figures, 2 tables, corrected minor typos

详情
英文摘要

We consider the problem of joint estimation of the parameters of $m$ linear dynamical systems, given access to single realizations of their respective trajectories, each of length $T$. The linear systems are assumed to reside on the nodes of an undirected and connected graph $G = ([m], \mathcal{E})$, and the system matrices are assumed to either vary smoothly or exhibit small number of ``jumps'' across the edges. We consider a total variation penalized least-squares estimator and derive non-asymptotic bounds on the mean squared error (MSE) which hold with high probability. In particular, the bounds imply for certain choices of well connected $G$ that the MSE goes to zero as $m$ increases, even when $T$ is constant. The theoretical results are supported by extensive experiments on synthetic and real data.

2511.04350 2026-01-26 math.OC cs.CE cs.IT math.IT math.ST stat.TH

On the relationship between MESP and 0/1 D-Opt and their upper bounds

Gabriel Ponte, Marcia Fampa, Jon Lee

详情
英文摘要

We establish strong connections between two fundamental nonlinear 0/1 optimization problems coming from the area of experimental design, namely maximum entropy sampling and 0/1 D-Optimality. The connections are based on maps between instances, and we analyze the behavior of these maps. Using these maps, we transport basic upper-bounding methods between these two problems, and we are able to establish new domination results and other inequalities relating various basic upper bounds. Further, we establish results relating how different branch-and-bound schemes based on these maps compare. Additionally, we observe some surprising numerical results, where bounding methods that did not seem promising in their direct application to real-data MESP instances, are now useful for MESP instances that come from 0/1 D-Optimality.

2510.21935 2026-01-26 cs.LG cs.AI stat.ML

AutoSciDACT: Automated Scientific Discovery through Contrastive Embedding and Hypothesis Testing

Samuel Bright-Thonney, Christina Reissel, Gaia Grosso, Nathaniel Woodward, Katya Govorkova, Andrzej Novak, Sang Eon Park, Eric Moreno, Philip Harris

Comments Accepted at NeurIPS 2025; 33 pages, 16 figures

详情
英文摘要

Novelty detection in large scientific datasets faces two key challenges: the noisy and high-dimensional nature of experimental data, and the necessity of making statistically robust statements about any observed outliers. While there is a wealth of literature on anomaly detection via dimensionality reduction, most methods do not produce outputs compatible with quantifiable claims of scientific discovery. In this work we directly address these challenges, presenting the first step towards a unified pipeline for novelty detection adapted for the rigorous statistical demands of science. We introduce AutoSciDACT (Automated Scientific Discovery with Anomalous Contrastive Testing), a general-purpose pipeline for detecting novelty in scientific data. AutoSciDACT begins by creating expressive low-dimensional data representations using a contrastive pre-training, leveraging the abundance of high-quality simulated data in many scientific domains alongside expertise that can guide principled data augmentation strategies. These compact embeddings then enable an extremely sensitive machine learning-based two-sample test using the New Physics Learning Machine (NPLM) framework, which identifies and statistically quantifies deviations in observed data relative to a reference distribution (null hypothesis). We perform experiments across a range of astronomical, physical, biological, image, and synthetic datasets, demonstrating strong sensitivity to small injections of anomalous data across all domains.

2510.18598 2026-01-26 stat.ME math.ST stat.TH

Measuring deviations from spherical symmetry

Lujia Bai, Holger Dette

详情
英文摘要

Most of the work on checking spherical symmetry assumptions on the distribution of the $p$-dimensional random vector $Y$ has its focus on statistical tests for the null hypothesis of exact spherical symmetry. In this paper, we take a different point of view and propose a measure for the deviation from spherical symmetry, which is based on the minimum distance between the distribution of the vector $\big (\|Y\|, Y/ \|Y\| )^\top $ and its best approximation by a distribution of a vector $\big (\|Y_s\|, Y_s/ \|Y_s \| )^\top $ corresponding to a random vector $Y_s$ with a spherical distribution. We develop estimators for the minimum distance with corresponding statistical guarantees (provided by asymptotic theory) and demonstrate the applicability of our approach by means of a simulation study and a real data example.

2510.16745 2026-01-26 stat.ML cs.LG math.ST stat.ME stat.TH

Kernel-Based Nonparametric Tests For Shape Constraints

Rohan Sen

Comments 40 pages, 4 figures

详情
英文摘要

We propose a kernel-based nonparametric framework for mean-variance optimization that enables inference on economically motivated shape constraints in finance, including positivity, monotonicity, and convexity. Many central hypotheses in financial econometrics are naturally expressed as shape relations on latent functions (e.g., term premia, CAPM relations, and the pricing kernel), yet enforcing such constraints during estimation can mask economically meaningful violations; our approach therefore separates learning from validation by first estimating an unconstrained solution and then testing shape properties. We establish statistical properties of the regularized sample estimator and derive rigorous guarantees, including asymptotic consistency, a functional central limit theorem, and a finite-sample deviation bound achieving the Monte Carlo rate up to a regularization term. Building on these results, we construct a joint Wald-type statistic to test shape constraints on finite grids. An efficient algorithm based on a pivoted Cholesky factorization yields scalability to large datasets. Numerical studies, including an options-based asset-pricing application, illustrate the usefulness of the proposed method for evaluating monotonicity and convexity restrictions.

2510.11863 2026-01-26 stat.ME

On the permutation equivariance principle for causal estimands

Jiaqi Tong, Fan Li

详情
英文摘要

In many causal inference problems, multiple action variables, such as factors, mediators, or network units, often share a common causal role yet lack a natural ordering. To avoid ambiguity, the scientific interpretation of a vector of estimands should remain invariant under relabeling, an implicit principle we refer to as permutation equivariance. Permutation equivariance can be understood as the property that permuting the variables permutes the estimands in a trackable manner, such that scientific meaning is preserved. We formally characterize this principle and study its combinatorial algebra. We present a class of weighted estimands that project unstructured potential outcome means into a vector of permutation equivariant and interpretable estimands capturing all orders of interaction. To guide practice, we discuss the implications and choices of weights and define residual-free estimands, whose inclusion-exclusion sums capture the maximal effect, which is useful in context such as causal mediation and network interference. We present the application of our general theory to three canonical examples and extend our results to ratio effect measures.

2510.00900 2026-01-26 stat.ME stat.OT

How can the use of different modes of survey data collection introduce bias? A simple introduction to mode effects using directed acyclic graphs (DAGs)

Georgia D Tomova, Richard J Silverwood, Peter WG Tennant, Liam Wright

详情
英文摘要

Survey data are self-reported data collected directly from respondents by a questionnaire or an interview and are commonly used in epidemiology. Such data are traditionally collected via a single mode (e.g. face-to-face interview alone), but use of mixed-mode designs (e.g. offering face-to-face interview or online survey) has become more common. This introduces two key challenges. First, individuals may respond differently to the same question depending on the mode; these differences due to measurement are known as 'mode effects'. Second, different individuals may participate via different modes; these differences in sample composition between modes are known as 'mode selection'. Where recognised, mode effects are often handled by straightforward approaches such as conditioning on survey mode. However, while reducing mode effects, this and other equivalent approaches may introduce collider bias in the presence of mode selection. The existence of mode effects and the consequences of naïve conditioning may be underappreciated in epidemiology. This paper offers a simple introduction to these challenges using directed acyclic graphs by exploring a range of possible data structures. We discuss the potential implications of using conditioning- or imputation-based approaches and outline the advantages of quantitative bias analyses for dealing with mode effects.

2509.25648 2026-01-26 stat.AP

Chinese vs. World Bank Development Projects: Insights from Earth Observation and Computer Vision on Wealth Gains in Africa, 2002-2013

Adel Daoud, Cindy Conlin, Connor T. Jerzak

Comments Forthcoming in World Development

详情
英文摘要

Debates about whether development projects improve living conditions persist, partly because observational estimates can be biased by incomplete adjustment and because reliable outcome data are scarce at the neighborhood level. We address both issues in a continent-scale, sector-specific evaluation of Chinese and World Bank projects across 9,899 neighborhoods in 36 African countries (2002-2013), representative of ~88% of the population. First, we use a recent dataset that measures living conditions with a machine-learned wealth index derived from contemporaneous satellite imagery, yielding a consistent panel of 6.7 km square mosaics. Second, to strengthen identification, we proxy officials' map-based placement criteria using pre-treatment daytime satellite images and fuse these with tabular covariates to estimate funder- and sector-specific ATEs via inverse-probability weighting. Incorporating imagery often shrinks effects relative to tabular-only models. On average, both donors raise wealth, with larger and more consistent gains for China; sector extremes in our sample include Trade and Tourism (330) for the World Bank (+12.29 IWI points), and Emergency Response (700) for China (+15.15). Assignment-mechanism analyses also show World Bank placement is often more predictable from imagery alone (as well as from tabular covariates). This suggests that Chinese project placements are more driven by non-visible, political, or event-driven factors than World Bank placements. To probe residual concerns about selection on observables, we also estimate within-neighborhood (unit) fixed-effects models at a spatial resolution about 67 times finer than prior fixed-effects analyses, leveraging the computer-vision-imputed IWI panels; these deliver smaller but, for Chinese projects, directionally consistent effects.

2509.25519 2026-01-26 cs.LG stat.ML

Flow Matching with Semidiscrete Couplings

Alireza Mousavi-Hosseini, Stephen Y. Zhang, Michal Klein, Marco Cuturi

Comments 38 pages, 23 figures

详情
英文摘要

Flow models parameterized as time-dependent velocity fields can generate data from noise by integrating an ODE. These models are often trained using flow matching, i.e. by sampling random pairs of noise and target points $(\mathbf{x}_0,\mathbf{x}_1)$ and ensuring that the velocity field is aligned, on average, with $\mathbf{x}_1-\mathbf{x}_0$ when evaluated along a segment linking $\mathbf{x}_0$ to $\mathbf{x}_1$. While these pairs are sampled independently by default, they can also be selected more carefully by matching batches of $n$ noise to $n$ target points using an optimal transport (OT) solver. Although promising in theory, the OT flow matching (OT-FM) approach is not widely used in practice. Zhang et al. (2025) pointed out recently that OT-FM truly starts paying off when the batch size $n$ grows significantly, which only a multi-GPU implementation of the Sinkhorn algorithm can handle. Unfortunately, the costs of running Sinkhorn can quickly balloon, requiring $O(n^2/\varepsilon^2)$ operations for every $n$ pairs used to fit the velocity field, where $\varepsilon$ is a regularization parameter that should be typically small to yield better results. To fulfill the theoretical promises of OT-FM, we propose to move away from batch-OT and rely instead on a semidiscrete formulation that leverages the fact that the target dataset distribution is usually of finite size $N$. The SD-OT problem is solved by estimating a dual potential vector using SGD; using that vector, freshly sampled noise vectors at train time can then be matched with data points at the cost of a maximum inner product search (MIPS). Semidiscrete FM (SD-FM) removes the quadratic dependency on $n/\varepsilon$ that bottlenecks OT-FM. SD-FM beats both FM and OT-FM on all training metrics and inference budget constraints, across multiple datasets, on unconditional/conditional generation, or when using mean-flow models.

2508.08450 2026-01-26 cs.LG stat.ME stat.ML

Differentiable Cyclic Causal Discovery Under Unmeasured Confounders

Muralikrishnna G. Sethuraman, Faramarz Fekri

详情
英文摘要

Understanding causal relationships between variables is fundamental across scientific disciplines. Most causal discovery algorithms rely on two key assumptions: (i) all variables are observed, and (ii) the underlying causal graph is acyclic. While these assumptions simplify theoretical analysis, they are often violated in real-world systems, such as biological networks. Existing methods that account for confounders either assume linearity or struggle with scalability. To address these limitations, we propose DCCD-CONF, a novel framework for differentiable learning of nonlinear cyclic causal graphs in the presence of unmeasured confounders using interventional data. Our approach alternates between optimizing the graph structure and estimating the confounder distribution by maximizing the log-likelihood of the data. Through experiments on synthetic data and real-world gene perturbation datasets, we show that DCCD-CONF outperforms state-of-the-art methods in both causal graph recovery and confounder identification. Additionally, we also provide consistency guarantees for our framework, reinforcing its theoretical soundness.

2507.09905 2026-01-26 stat.ME math.OC math.ST stat.ML stat.TH

Statistical Analysis of Conditional Group Distributionally Robust Optimization with Cross-Entropy Loss

Zijian Guo, Zhenyu Wang, Yifan Hu, Francis Bach

详情
英文摘要

In multi-source learning with discrete labels, distributional heterogeneity across domains poses a central challenge to developing predictive models that transfer reliably to unseen domains. We study multi-source unsupervised domain adaptation, where labeled data are available from multiple source domains and only unlabeled data are observed from the target domain. To address potential distribution shifts, we propose a novel Conditional Group Distributionally Robust Optimization (CG-DRO) framework that learns a classifier by minimizing the worst-case cross-entropy loss over the convex combinations of the conditional outcome distributions from sources domains. We develop an efficient Mirror Prox algorithm for solving the minimax problem and employ a double machine learning procedure to estimate the risk function, ensuring that errors in nuisance estimation contribute only at higher-order rates. We establish fast statistical convergence rates for the empirical CG-DRO estimator by constructing two surrogate minimax optimization problems that serve as theoretical bridges. A distinguishing challenge for CG-DRO is the emergence of nonstandard asymptotics: the empirical CG-DRO estimator may fail to converge to a standard limiting distribution due to boundary effects and system instability. To address this, we introduce a perturbation-based inference procedure that enables uniformly valid inference, including confidence interval construction and hypothesis testing.

2507.05470 2026-01-26 stat.ML cs.LG

Temporal Conformal Prediction (TCP): A Distribution-Free Statistical and Machine Learning Framework for Adaptive Risk Forecasting

Agnideep Aich, Ashit Baran Aich, Dipak C. Jain

详情
英文摘要

We propose \textbf{Temporal Conformal Prediction (TCP)}, a distribution-free framework for constructing well-calibrated prediction intervals in nonstationary time series. TCP couples a modern quantile forecaster with a rolling split-conformal calibration layer; its \textbf{TCP-RM} variant adds an online Robbins-Monro offset to steer coverage in real time. We benchmark TCP against GARCH, Historical Simulation, Quantile Regression (QR), linear QR, and Adaptive Conformal Inference (ACI) across S\&P 500, Bitcoin, and Gold. Three results are consistent. First, QR baselines yield the sharpest intervals but are materially under-calibrated; even ACI remains below the 95\% target. Second, TCP achieves near-nominal coverage, yielding intervals slightly wider than Historical Simulation (e.g., S\&P 500: 5.21 vs.\ 5.06). Third, the RM update changes calibration only marginally at default hyperparameters. Crisis-window visualizations (March 2020) show TCP promptly expanding and contracting intervals as volatility spikes. A sensitivity study confirms robustness to hyperparameters. Overall, TCP bridges statistical inference and machine learning, providing a practical solution for calibrated risk forecasting under distribution shift.

2506.20325 2026-01-26 math.ST math.PR stat.TH

Robust estimation of a Markov chain transition matrix from multiple sample paths

Lasse Leskelä, Maximilien Dreveton

Comments 20 pages

Journal ref Statistica Neerlandica 80(1):e70023, 2026

详情
英文摘要

Markov chains are fundamental models for stochastic dynamics, with applications in a wide range of areas such as population dynamics, queueing systems, reinforcement learning, and Monte Carlo methods. Estimating the transition matrix and stationary distribution from observed sample paths is a core statistical challenge, particularly when multiple independent trajectories are available. While classical theory typically assumes identical chains with known stationary distributions, real-world data often arise from heterogeneous chains whose transition kernels and stationary measures might differ from a common target. We analyse empirical estimators for such parallel Markov processes and establish sharp concentration inequalities that generalise Bernstein-type bounds from standard time averages to ensemble-time averages. Our results provide nonasymptotic error bounds and consistency guarantees in high-dimensional regimes, accommodating sparse or weakly mixing chains, model mismatch, nonstationary initialisations, and partially corrupted data. These findings offer rigorous foundations for statistical inference in heterogeneous Markov chain settings common in modern computational applications.

2506.16295 2026-01-26 stat.CO

Understanding uncertainty in Bayesian cluster analysis

Cecilia Balocchi, Sara Wade

详情
英文摘要

The Bayesian approach to clustering is often appreciated for its ability to provide uncertainty in the partition structure. However, summarizing the posterior distribution over the clustering structure can be challenging, due the discrete, unordered nature and massive dimension of the space. While recent advancements provide a single clustering estimate to represent the posterior, this ignores uncertainty and may even be unrepresentative in instances where the posterior is multimodal. To enhance our understanding of uncertainty, we propose a WASserstein Approximation for Bayesian clusterIng (WASABI), which summarizes the posterior samples with not one, but multiple clustering estimates, each corresponding to a different part of the partition space that receives substantial posterior mass. Specifically, we find such clustering estimates by approximating the posterior distribution in a Wasserstein distance sense, equipped with a suitable metric on the partition space. An interesting byproduct is that a locally optimal solution can be found using a k-medoids-like algorithm on the partition space to divide the posterior samples into groups, each represented by one of the clustering estimates. Using synthetic and real datasets, we show that WASABI helps to improve the understanding of uncertainty, particularly when clusters are not well separated or when the employed model is misspecified.

2505.15638 2026-01-26 cs.LG stat.CO stat.ME stat.ML

Bayesian Ensembling: Insights from Online Optimization and Empirical Bayes

Daniel Waxman, Fernando Llorente, Petar M. Djurić

Comments 28 pages, 10 figures; Accepted to Transactions on Machine Learning Research (TMLR)

Journal ref Transactions on Machine Learning Research (TMLR), 2026

详情
英文摘要

We revisit the classical problem of Bayesian ensembles and address the challenge of learning optimal combinations of Bayesian models in an online, continual learning setting. To this end, we reinterpret existing approaches such as Bayesian model averaging (BMA) and Bayesian stacking through a novel empirical Bayes lens, shedding new light on the limitations and pathologies of BMA. Further motivated by insights from online optimization, we propose Online Bayesian Stacking (OBS), a method that optimizes the log-score over predictive distributions to adaptively combine Bayesian models. A key contribution of our work is establishing a novel connection between OBS and portfolio selection, bridging Bayesian ensemble learning with a rich, well-studied theoretical framework that offers efficient algorithms and extensive regret analysis. We further clarify the relationship between OBS and online BMA, showing that they optimize related but distinct cost functions. Through theoretical analysis and empirical evaluation, we identify scenarios where OBS outperforms online BMA and provide principled methods and guidance on when practitioners should prefer one approach over the other.

2503.12634 2026-01-26 stat.ME stat.ML

Clustered random forests with correlated data for optimal estimation and inference under potential covariate shift

Elliot H. Young, Peter Bühlmann

详情
英文摘要

We develop Clustered Random Forests, a random forests algorithm for clustered data, arising from independent groups that exhibit within-cluster dependence. The leaf-wise predictions for each decision tree making up clustered random forests takes the form of a weighted least squares estimator, which leverage correlations between observations for improved prediction accuracy and tighter confidence intervals when performing inference. We show that approximately linear time algorithms exist for fitting classes of clustered random forests, matching the computational complexity of standard random forests. Further, we observe that the optimality of a clustered random forest, with regards to how optimal weights are chosen within this framework i.e. those that minimise mean squared prediction error, vary under covariate distribution shift. In light of this, we advocate weight estimation to be determined by a user-chosen covariate distribution, or test dataset of covariates, with respect to which optimal prediction or inference is desired. This highlights a key distinction between correlated and independent data with regards to optimality of nonparametric conditional mean estimation under covariate shift. We demonstrate our theoretical findings numerically in a number of simulated and real-world settings.

2502.14600 2026-01-26 stat.ME stat.CO stat.ML

Spectral decomposition-assisted multi-study factor analysis

Lorenzo Mauri, Niccolò Anceschi, David B. Dunson

详情
英文摘要

This article focuses on covariance estimation for multi-study data. Popular approaches employ factor-analytic terms with shared and study-specific loadings that decompose the variance into (i) a shared low-rank component, (ii) study-specific low-rank components, and (iii) a diagonal term capturing idiosyncratic variability. Our proposed methodology estimates the latent factors via spectral decompositions, with a novel approach for separating shared and specific factors, and infers the factor loadings and residual variances via surrogate Bayesian regressions. The resulting posterior has a simple product form across outcomes, bypassing the need for Markov chain Monte Carlo sampling and facilitating parallelization. The proposed methodology has major advantages over current Bayesian competitors in terms of computational speed, scalability and stability while also having strong frequentist guarantees. The theory and methods also add to the rich literature on frequentist methods for factor models with shared and group-specific components of variation. The approximation error decreases as the sample size and the data dimension diverge, formalizing a blessing of dimensionality. We show favorable asymptotic properties, including central limit theorems for point estimators and posterior contraction, and excellent empirical performance in simulations. The methods are applied to integrate three studies on gene associations among immune cells.

2502.06168 2026-01-26 stat.ML cs.LG econ.EM math.OC

Dynamic Pricing with Adversarially-Censored Demands

Jianyu Xu, Yining Wang, Xi Chen, Yu-Xiang Wang

Comments 28 pages, 1 figure

详情
英文摘要

We study an online dynamic pricing problem where the potential demand at each time period $t=1,2,\ldots, T$ is stochastic and dependent on the price. However, a perishable inventory is imposed at the beginning of each time $t$, censoring the potential demand if it exceeds the inventory level. To address this problem, we introduce a pricing algorithm based on the optimistic estimates of derivatives. We show that our algorithm achieves $\tilde{O}(\sqrt{T})$ optimal regret even with adversarial inventory series. Our findings advance the state-of-the-art in online decision-making problems with censored feedback, offering a theoretically optimal solution against adversarial observations.

2502.01953 2026-01-26 stat.ML cs.LG math.ST stat.TH

Local minima of the empirical risk in high dimension: General theorems and convex examples

Kiana Asgari, Andrea Montanari, Basil Saeed

详情
英文摘要

We consider a general model for high-dimensional empirical risk minimization whereby the data $\mathbf{x}_i$ are $d$-dimensional Gaussian vectors, the model is parametrized by $\mathbfΘ\in\mathbb{R}^{d\times k}$, and the loss depends on the data via the projection $\mathbfΘ^\mathsf{T}\mathbf{x}_i$. This setting covers as special cases classical statistics methods (e.g. multinomial regression and other generalized linear models), but also two-layer fully connected neural networks with $k$ hidden neurons. We use the Kac-Rice formula from Gaussian process theory to derive a bound on the expected number of local minima of this empirical risk, under the proportional asymptotics in which $n,d\to\infty$, with $n\asymp d$. Via Markov's inequality, this bound allows to determine the positions of these minimizers (with exponential deviation bounds) and hence derive sharp asymptotics on the estimation and prediction error. As a special case, we apply our characterization to convex losses. We show that our approach is tight and allows to prove previously conjectured results. In addition, we characterize the spectrum of the Hessian at the minimizer. A companion paper applies our general result to non-convex examples.

2501.01840 2026-01-26 stat.ML cs.LG

Signal Recovery Using a Spiked Mixture Model

Paul-Louis Delacour, Sander Wahls, Jeffrey M. Spraggins, Lukasz Migas, Raf Van de Plas

Journal ref IEEE Transactions on Signal Processing, vol. 73, pp. 3748-3761, 2025

详情
英文摘要

We introduce the spiked mixture model (SMM) to address the problem of estimating a set of signals from many randomly scaled and noisy observations. Subsequently, we design a novel expectation-maximization (EM) algorithm to recover all parameters of the SMM. Numerical experiments show that in low signal-to-noise ratio regimes, and for data types where the SMM is relevant, SMM surpasses the more traditional Gaussian mixture model (GMM) in terms of signal recovery performance. The broad relevance of the SMM and its corresponding EM recovery algorithm is demonstrated by applying the technique to different data types. The first case study is a biomedical research application, utilizing an imaging mass spectrometry dataset to explore the molecular content of a rat brain tissue section at micrometer scale. The second case study demonstrates SMM performance in a computer vision application, segmenting a hyperspectral imaging dataset into underlying patterns. While the measurement modalities differ substantially, in both case studies SMM is shown to recover signals that were missed by traditional methods such as k-means clustering and GMM.

2410.15652 2026-01-26 math.PR math.CO math.ST stat.TH

Sparse Hanson-Wright Inequalities with Applications

Yiyun He, Ke Wang, Yizhe Zhu

Comments 53 pages, 1 figure

详情
英文摘要

We derive new Hanson-Wright-type inequalities tailored to the quadratic forms of random vectors with sparse independent components. Specifically, we consider cases where the components of the random vector are sparse $α$-subexponential random variables with $α>0$. When $α=\infty$, these inequalities can be seen as quadratic generalizations of the classical Bernstein and Bennett inequalities for sparse bounded random vectors. To establish this quadratic generalization, we also develop new Bernstein-type and Bennett-type inequalities for linear forms of sparse $α$-subexponential random variables that go beyond the bounded case $(α=\infty)$. Our proof relies on a novel combinatorial method for estimating the moments of both random linear forms and quadratic forms. We present two key applications of these new sparse Hanson-Wright inequalities: (1) A local law and complete eigenvector delocalization for sparse $α$-subexponential Hermitian random matrices, generalizing the result of He et al. (2019) beyond sparse Bernoulli random matrices. To the best of our knowledge, this is the first local law and complete delocalization result for sparse $α$-subexponential random matrices down to the near-optimal sparsity $p\geq \frac{\mathrm{polylog}(n)}{n}$ when $α\in (0,2)$ as well as for unbounded sparse sub-gaussian random matrices down to the optimal sparsity $p\gtrsim \frac{\log n}{n}.$ (2) Concentration of the Euclidean norm for the linear transformation of a sparse $α$-subexponential random vector, improving on the results of G{ö}tze et al. (2021) for sparse sub-exponential random vectors.

2408.17078 2026-01-26 math.PR math.ST stat.TH

Aliasing Effects for Samples of Spin Random Fields on the Sphere

Claudio Durastanti

Comments 34 pages, 3 figures, 2 tables

详情
英文摘要

This paper investigates aliasing effects emerging from the reconstruction from discrete samples of spin spherical random fields defined on the two-dimensional sphere. We determine the location in the frequency domain and the intensity of the aliases of the harmonic coefficients in the Fourier decomposition of the spin random field and evaluate the consequences of aliasing errors in the angular power spectrum when the samples of the random field are obtained by using some very popular sampling procedures on the sphere, the equiangular and the Gauss-Jacobi sampling schemes. Finally, we demonstrate that band-limited spin random fields are free from aliases, provided that a sufficiently large number of nodes is used in the selected quadrature rule.

2407.11735 2026-01-26 cs.LG cs.CV stat.ML

ProSub: Probabilistic Open-Set Semi-Supervised Learning with Subspace-Based Out-of-Distribution Detection

Erik Wallin, Lennart Svensson, Fredrik Kahl, Lars Hammarstrand

Comments ECCV2024

详情
英文摘要

In open-set semi-supervised learning (OSSL), we consider unlabeled datasets that may contain unknown classes. Existing OSSL methods often use the softmax confidence for classifying data as in-distribution (ID) or out-of-distribution (OOD). Additionally, many works for OSSL rely on ad-hoc thresholds for ID/OOD classification, without considering the statistics of the problem. We propose a new score for ID/OOD classification based on angles in feature space between data and an ID subspace. Moreover, we propose an approach to estimate the conditional distributions of scores given ID or OOD data, enabling probabilistic predictions of data being ID or OOD. These components are put together in a framework for OSSL, termed ProSub, that is experimentally shown to reach SOTA performance on several benchmark problems. Our code is available at https://github.com/walline/prosub.

2407.05050 2026-01-26 math.DS nlin.CD physics.comp-ph stat.ML

Sparse identification of quasipotentials via a combined data-driven method

Bo Lin, Pierpaolo Belardinelli

详情
英文摘要

The quasipotential function allows for comprehension and prediction of the escape mechanisms from metastable states in nonlinear dynamical systems. This function acts as a natural extension of the potential function for non-gradient systems and it unveils important properties such as the maximum likelihood transition paths, transition rates and expected exit times of the system. Here, we demonstrate how to discover parsimonious equations for the quasipotential directly from data. Leveraging machine learning, we combine two existing data-driven techniques, namely a neural network and a sparse regression algorithm, specifically designed to symbolically describe multistable energy landscapes. First, we employ a vanilla neural network enhanced with a renormalization and rescaling procedure to achieve an orthogonal decomposition of the vector field. Next, we apply symbolic regression to extract the downhill and circulatory components of the decomposition, ensuring consistency with the underlying dynamics. This symbolic reconstruction involves a simultaneous regression that imposes constraints on both the orthogonality condition and the vector field. We implement and benchmark our approach using an archetypal model with a known exact quasipotential, as well as a nanomechanical resonator system. We further demonstrate its applicability to noisy data and to a four-dimensional system. Our model-unbiased analytical forms of the quasipotential is of interest to a wide range of applications aimed at assessing metastability and energy landscapes, serving to parametrically capture the distinctive fingerprint of the fluctuating dynamics.

2406.00778 2026-01-26 stat.ML cs.AI cs.LG stat.CO stat.ME

Bayesian Joint Additive Factor Models for Multiview Learning

Niccolo Anceschi, Federico Ferrari, David B. Dunson, Himel Mallick

Comments To be published in Biometrics

详情
英文摘要

It is increasingly common to collect data of multiple different types on the same set of samples. Our focus is on studying relationships between such multiview features and responses. A motivating application arises in the context of precision medicine where multi-omics data are collected to correlate with clinical outcomes. It is of interest to infer dependence within and across views while combining multimodal information to improve the prediction of outcomes. The signal-to-noise ratio can vary substantially across views, motivating more nuanced statistical tools beyond standard late and early fusion. This challenge comes with the need to preserve interpretability, select features, and obtain accurate uncertainty quantification. To address these challenges, we introduce two complementary factor regression models. A baseline Joint Factor Regression (\textsc{jfr}) captures combined variation across views via a single factor set, and a more nuanced Joint Additive FActor Regression (\textsc{jafar}) that decomposes variation into shared and view-specific components. For \textsc{jfr}, we use independent cumulative shrinkage process (\textsc{i-cusp}) priors, while for \textsc{jafar} we develop a dependent version (\textsc{d-cusp}) designed to ensure identifiability of the components. We develop Gibbs samplers that exploit the model structure and accommodate flexible feature and outcome distributions. Prediction of time-to-labor onset from immunome, metabolome, and proteome data illustrates performance gains against state-of-the-art competitors. Our open-source software (\texttt{R} package) is available at https://github.com/niccoloanceschi/jafar.

2402.08941 2026-01-26 econ.EM stat.AP stat.ME

Local-Polynomial Estimation for Multivariate Regression Discontinuity Designs

Masayuki Sawada, Takuya Ishihara, Daisuke Kurisu, Yasumasa Matsuda

详情
英文摘要

We study a multivariate regression discontinuity design in which treatment is assigned by crossing a boundary in the space of multiple running variables. We document that the existing bandwidth selector is suboptimal for a multivariate regression discontinuity design when the distance to a boundary point is used for its running variable, and introduce a multivariate local-linear estimator for multivariate regression discontinuity designs. Our estimator is asymptotically valid and can capture heterogeneous treatment effects over the boundary. We demonstrate that our estimator exhibits smaller root mean squared errors and often shorter confidence intervals in numerical simulations. We illustrate our estimator in our empirical applications of multivariate designs of a Colombian scholarship study and a U.S. House of representative voting study and demonstrate that our estimator reveals richer heterogeneous treatment effects with often shorter confidence intervals than the existing estimator.

2305.01240 2026-01-26 math.ST stat.TH

On the convergence of PINNs

Nathan Doumèche, Gérard Biau, Claire Boyer

Journal ref Bernoulli, 2025, Vol. 31, pp. 2127-2151

详情
英文摘要

Physics-informed neural networks (PINNs) are a promising approach that combines the power of neural networks with the interpretability of physical modeling. PINNs have shown good practical performance in solving partial differential equations (PDEs) and in hybrid modeling scenarios, where physical models enhance data-driven approaches. However, it is essential to establish their theoretical properties in order to fully understand their capabilities and limitations. In this study, we highlight that classical training of PINNs can suffer from systematic overfitting. This problem can be addressed by adding a ridge regularization to the empirical risk, which ensures that the resulting estimator is risk-consistent for both linear and nonlinear PDE systems. However, the strong convergence of PINNs to a solution satisfying the physical constraints requires a more involved analysis using tools from functional analysis and calculus of variations. In particular, for linear PDE systems, an implementable Sobolev-type regularization allows to reconstruct a solution that not only achieves statistical accuracy but also maintains consistency with the underlying physics.