arXivDaily arXiv每日学术速递 周一至周五更新
2601.18788 2026-01-27 cs.CL cs.LG stat.ML

Unsupervised Text Segmentation via Kernel Change-Point Detection on Sentence Embeddings

Mumin Jia, Jairo Diaz-Rodriguez

Comments arXiv admin note: substantial text overlap with arXiv:2510.03437. substantial text overlap with arXiv:2510.03437. substantial text overlap with arXiv:2510.03437. substantial text overlap with arXiv:2510.03437

详情
英文摘要

Unsupervised text segmentation is crucial because boundary labels are expensive, subjective, and often fail to transfer across domains and granularity choices. We propose Embed-KCPD, a training-free method that represents sentences as embedding vectors and estimates boundaries by minimizing a penalized KCPD objective. Beyond the algorithmic instantiation, we develop, to our knowledge, the first dependence-aware theory for KCPD under $m$-dependent sequences, a finite-memory abstraction of short-range dependence common in language. We prove an oracle inequality for the population penalized risk and a localization guarantee showing that each true change point is recovered within a window that is small relative to segment length. To connect theory to practice, we introduce an LLM-based simulation framework that generates synthetic documents with controlled finite-memory dependence and known boundaries, validating the predicted scaling behavior. Across standard segmentation benchmarks, Embed-KCPD often outperforms strong unsupervised baselines. A case study on Taylor Swift's tweets illustrates that Embed-KCPD combines strong theoretical guarantees, simulated reliability, and practical effectiveness for text segmentation.

2601.18770 2026-01-27 math.ST stat.TH

Equality between two general ridge estimators and applications in several linear models

Hirai Mukasa

Comments 12 pages

详情
英文摘要

General ridge estimators are widely used in the general linear model because they possess desirable properties such as linear sufficiency and linear admissibility. However, when the covariance matrix of the error term is partially unknown, estimation typically requires a two-step procedure. This paper derives conditions under which the general ridge estimator based on the covariance matrix coincides with the one that does not depend on it. In particular, we provide practically verifiable conditions for several linear models, including Rao's mixed-effects model, a seemingly unrelated regression model, first-order spatial autoregressive and spatial moving average models, and serial correlation models. These results enable the use of a covariance-free general ridge estimator, thereby simplifying the two-step estimation procedure.

2601.18689 2026-01-27 math.ST stat.TH

Function estimation in the empirical Bayes setting

Benjamin Kang, Yury Polyanskiy, Anzo Teh

详情
英文摘要

We study function estimation in the empirical Bayes setting for Poisson and normal means. Specifically, given observations $Y_i\sim f(\cdot; θ_i)$ with latent parameters $θ_i\sim π$, the goal is to estimate $\mathbb{E}_π[\ell(θ)|X = x]$. This task lies between classical deconvolution (recovering the full prior $π$), and standard empirical Bayes mean estimation. While the minimax risk for estimating $π$ in the Wasserstein distance is known to decay only logarithmically, we show that estimating certain smooth functions admits dramatically faster rates. In particular, for polynomial functions of degree $k$ in the Poisson model, we establish a tight bound of $Θ(\frac{1}{n}(\frac{\log n}{\log \log n})^{k+1})$ and $Θ(\frac{1}{n}(\log n)^{2k+1})$ for bounded and subexponential priors, respectively, attainable by estimators mimicking those that achieve optimal regret for the mean estimation problem (Robbins, mininum distance, ERM). Our analysis identifies the approximation-theoretic origin of this improvement: smooth functions can be well-approximated by low-degree polynomials, whereas Lipschitz functions require dense polynomial approximations, incurring a $\frac{1}{k}$ loss for degree $k$ polynomial approximation. The results reveal a sharp hierarchy in the difficulty of empirical Bayes problems: ranging from slow, logarithmic deconvolution to near-parametric convergence for smooth posterior functionals, and establish new connections between nonparametric empirical Bayes theory, polynomial approximation, and statistical inverse problems. Finally, we complement our analysis with a lower bound of $Ω(\frac 1n (\frac{\log n}{\log \log n})^{k+1})$ (bounded priors) and $Ω(\frac 1n (\log n)^{k + 1})$ (subgaussian priors) for the normal means model.

2601.18683 2026-01-27 stat.ME astro-ph.IM cs.LG

Learned harmonic mean estimation of the marginal likelihood for multimodal posteriors with flow matching

Alicja Polanska, Jason D. McEwen

Comments Submitted to 44th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering

详情
英文摘要

The marginal likelihood, or Bayesian evidence, is a crucial quantity for Bayesian model comparison but its computation can be challenging for complex models, even in parameters space of moderate dimension. The learned harmonic mean estimator has been shown to provide accurate and robust estimates of the marginal likelihood simply using posterior samples. It is agnostic to the sampling strategy, meaning that the samples can be obtained using any method. This enables marginal likelihood calculation and model comparison with whatever sampling is most suitable for the task. However, the internal density estimators considered previously for the learned harmonic mean can struggle with highly multimodal posteriors. In this work we introduce flow matching-based continuous normalizing flows as a powerful architecture for the internal density estimation of the learned harmonic mean. We demonstrate the ability to handle challenging multimodal posteriors, including an example in 20 parameter dimensions, showcasing the method's ability to handle complex posteriors without the need for fine-tuning or heuristic modifications to the base distribution.

2601.18677 2026-01-27 stat.ML cs.LG

Out-of-Distribution Radar Detection with Complex VAEs: Theory, Whitening, and ANMF Fusion

Yadang Alexis Rouzoumka, Jean Pinsolle, Eugénie Terreaux, Christèle Morisseau, Jean-Philippe Ovarlez, Chengfang Ren

Comments 13 pages, 12 figures, submitted to IEEE Transactions on Signal Processing

详情
英文摘要

We investigate the detection of weak complex-valued signals immersed in non-Gaussian, range-varying interference, with emphasis on maritime radar scenarios. The proposed methodology exploits a Complex-valued Variational AutoEncoder (CVAE) trained exclusively on clutter-plus-noise to perform Out-Of-Distribution detection. By operating directly on in-phase / quadrature samples, the CVAE preserves phase and Doppler structure and is assessed in two configurations: (i) using unprocessed range profiles and (ii) after local whitening, where per-range covariance estimates are obtained from neighboring profiles. Using extensive simulations together with real sea-clutter data from the CSIR maritime dataset, we benchmark performance against classical and adaptive detectors (MF, NMF, AMF-SCM, ANMF-SCM, ANMF-Tyler). In both configurations, the CVAE yields a higher detection probability Pd at matched false-alarm rate Pfa, with the most notable improvements observed under whitening. We further integrate the CVAE with the ANMF through a weighted log-p fusion rule at the decision level, attaining enhanced robustness in strongly non-Gaussian clutter and enabling empirically calibrated Pfa control under H0. Overall, the results demonstrate that statistical normalization combined with complex-valued generative modeling substantively improves detection in realistic sea-clutter conditions, and that the fused CVAE-ANMF scheme constitutes a competitive alternative to established model-based detectors.

2601.18658 2026-01-27 stat.ME stat.ML

Contrasting Global and Patient-Specific Regression Models via a Neural Network Representation

Max Behrens, Daiana Stolz, Eleni Papakonstantinou, Janis M. Nolde, Gabriele Bellerino, Angelika Rohde, Moritz Hess, Harald Binder

详情
英文摘要

When developing clinical prediction models, it can be challenging to balance between global models that are valid for all patients and personalized models tailored to individuals or potentially unknown subgroups. To aid such decisions, we propose a diagnostic tool for contrasting global regression models and patient-specific (local) regression models. The core utility of this tool is to identify where and for whom a global model may be inadequate. We focus on regression models and specifically suggest a localized regression approach that identifies regions in the predictor space where patients are not well represented by the global model. As localization becomes challenging when dealing with many predictors, we propose modeling in a dimension-reduced latent representation obtained from an autoencoder. Using such a neural network architecture for dimension reduction enables learning a latent representation simultaneously optimized for both good data reconstruction and for revealing local outcome-related associations suitable for robust localized regression. We illustrate the proposed approach with a clinical study involving patients with chronic obstructive pulmonary disease. Our findings indicate that the global model is adequate for most patients but that indeed specific subgroups benefit from personalized models. We also demonstrate how to map these subgroup models back to the original predictors, providing insight into why the global model falls short for these groups. Thus, the principal application and diagnostic yield of our tool is the identification and characterization of patients or subgroups whose outcome associations deviate from the global model.

2601.18656 2026-01-27 stat.AP stat.ME

A varying-coefficient model for characterizing duration-driven heterogeneity in flood-related health impacts

Sarika Aggarwal, Phillip B. Nicol, Brent A. Coull, Rachel C. Nethery

详情
英文摘要

Previous work revealed associations between flood exposure and adverse health outcomes during and in the aftermath of flood events. Floods are highly heterogeneous events, largely owing to vast differences in flood durations, i.e., flash-floods versus slow-moving floods. However, little to no work has incorporated exposure duration into the modeling of flood-related health impacts or has investigated duration-driven effect heterogeneity. To address this gap, we propose an exposure duration varying coefficient modeling (EDVCM) framework for estimating exposure day-specific health effects of consecutive-day environmental exposures that vary in duration. We develop the EDVCM within an area-level self-matched study design to eliminate time-invariant confounding followed by conditional Poisson regression modeling for exposure effect estimation and adjustment of time-varying confounders. Using a Bayesian framework, we introduce duration- and exposure day-specific exposure coefficients within the conditional Poisson model and assign them a two-dimensional Gaussian process prior to allow for sharing of information across both duration and exposure day. This approach enables highly-resolved insights into duration-driven effect heterogeneity while ensuring model stability through information sharing. Through simulations, we demonstrate that the EDVCM out-performs conventional approaches in terms of both effect estimation and uncertainty quantification. We apply the EDVCM to nationwide, multi-decade Medicare claims data linked with high-resolution flood exposure measures to investigate duration-driven heterogeneity in flood effects on musculoskeletal system disease hospitalizations.

2601.18598 2026-01-27 stat.ME stat.AP

Goodness-of-Fit Checks for Joint Models

Dimitris Rizopoulos, Jeremy M. G. Taylor, Isabella Kardys

详情
英文摘要

Joint models for longitudinal and time-to-event data are widely used in many disciplines. Nonetheless, existing model comparison criteria do not indicate whether a model adequately fits the data or which components may be misspecified. We introduce a Bayesian posterior predictive checks framework for assessing a joint model's fit to the longitudinal and survival processes and their association. The framework supports multiple settings, including existing subjects, new subjects with only covariates, dynamic prediction at intermediate follow-up times, and cross-validated assessment. For the longitudinal component, goodness-of-fit is assessed through the mean, variance, and correlation structure, while the survival component is evaluated using empirical cumulative distributions and probability integral transforms. The association between processes is examined using time-dependent concordance statistics. We apply these checks to the Bio-SHiFT heart failure study, and a simulation study demonstrates that they can identify model misspecification that standard information criteria fail to detect. The proposed methodology is implemented in the freely available R package JMbayes2.

2601.18593 2026-01-27 stat.CO cond-mat.mtrl-sci

The generalised balanced power diagram: flat sections, affine transformations and an improved rendering algorithm

Felix Ballani

Comments 10 pages, 2 figures

详情
英文摘要

The generalised balanced power diagram (GBPD) is regarded in the literature as a suitable geometric model for describing polycrystalline microstructures with curved grain boundaries. This article compiles properties of GBPDs with regard to affine transformations and flat sections. Furthermore, it extends an algorithm known for power diagrams for generating digital images, which is more efficient than the usual brute force approach, on GBPDs.

2601.18587 2026-01-27 stat.ME stat.AP

Vaccine Efficacy Estimands Implied by Common Estimators Used in Individual Randomized Field Trials

Michael P. Fay, Dean Follmann, Bruce J. Swihart, Lauren E. Dang

Comments 28 pages, 6 figures

详情
英文摘要

We review vaccine efficacy (VE) estimands for susceptibility in individual randomized trials with natural (unmeasured) exposure, where individual responses are measured as time from vaccination until an event (e.g., disease from the infectious agent). Common VE estimands are written as $1-θ$, where $θ$ is some ratio effect measure (e.g., ratio of incidence rates, cumulative incidences, hazards, or odds) comparing outcomes under vaccination versus control. Although the ratio effects are approximately equal with low control event rates, we explore the quality of that approximation using a nonparametric formulation. Traditionally, the primary endpoint VE estimands are full immunization (or biological) estimands that represent a subset of the intent-to-treat population, excluding those that have the event before the vaccine has been able to ramp-up to its full effect, requiring care for proper causal interpretation. Besides these primary VE estimands that summarize an effect of the vaccine over the full course of the study, we also consider local VE estimands that measure the effect at particular time points. We discuss interpretational difficulties of local VE estimands (e.g., depletion of susceptibles bias), and using frailty models as sensitivity analyses for the individual-level causal effects over time.

2601.18480 2026-01-27 stat.AP

Uncertainty Quantification in Coupled Multiphysics Systems via Gaussian Process Surrogates: Application to Fuel Assembly Bow

Ali Abboud, Josselin Garnier, Bertrand Leturcq, Stanislas de Lambert

详情
英文摘要

Predicting fuel assembly bow in pressurized water reactors requires solving tightly coupled fluid-structure interaction problems, whose direct simulations can be computationally prohibitive, making large-scale uncertainty quantification (UQ) very challenging. This work introduces a general mathematical framework for coupling Gaussian process (GP) surrogate models representing distinct physical solvers, aimed at enabling rigorous UQ in coupled multiphysics systems. A theoretical analysis establishes that the predictive variance of the coupled GP system remains bounded under mild regularity and stability assumptions, ensuring that uncertainty does not grow uncontrollably through the iterative coupling process. The methodology is then applied to the coupled hydraulic-structural simulation of fuel assembly bow, enabling global sensitivity analysis and full UQ at a fraction of the computational cost of direct code coupling. The results demonstrate accurate uncertainty propagation and stable predictions, establishing a solid mathematical basis for surrogate-based coupling in large-scale multiphysics simulations.

2601.18371 2026-01-27 math.ST stat.TH

Nonparametric inference for spot volatility in pure-jump semimartingales

Chengxin Yan, Dachuan Chen, Jia Li

详情
英文摘要

We provide a comprehensive analysis of spot volatility inference in pure-jump semimartingales under two asymptotic settings: fixed-$k$, where each local window uses a fixed number of observations, and large-$k$, where this number grows with sampling frequency. For both active- and possibly inactive-jump settings, we derive generally nonstandard, typically non-Gaussian limit distributions and establish valid inference, including when the jump-activity index is consistently estimated. Simulations show that fixed-$k$ asymptotics offer markedly better finite-sample accuracy, underscoring their practical advantage for nonparametric spot volatility inference.

2601.18276 2026-01-27 math.PR math.ST stat.TH

Functional Large Deviations for Wide Deep Neural Networks with Gaussian Initialization and Lipschitz Activations

Claudio Macci, Barbara Pacchiarotti, Katerina Papagiannouli, Giovanni Luca Torrisi, Dario Trevisan

详情
英文摘要

We establish a functional large deviation principle for fully connected multi-layer perceptrons with i.i.d. Gaussian weights (LeCun initialization) and general Lipschitz activation functions, including therefore the popular case of ReLU. The large deviation principle holds for the entire network output process on any compact input set. The proof combines exponential tightness for recursively defined processes, finite-dimensional large deviations, and the Dawson-Gärtner theorem, extending existing results beyond finite input sets and less general activations.

2601.18145 2026-01-27 stat.ML cs.LG stat.CO

Exact Minimum-Volume Confidence Set Intersection for Multinomial Outcomes

Heguang Lin, Binhao Chen, Mengze Li, Daniel Pimentel-Alarcón, Matthew L. Malloy

Comments 15 pages, 1 figure

详情
英文摘要

Computation of confidence sets is central to data science and machine learning, serving as the workhorse of A/B testing and underpinning the operation and analysis of reinforcement learning algorithms. Among all valid confidence sets for the multinomial parameter, minimum-volume confidence sets (MVCs) are optimal in that they minimize average volume, but they are defined as level sets of an exact p-value that is discontinuous and difficult to compute. Rather than attempting to characterize the geometry of MVCs directly, this paper studies a practically motivated decision problem: given two observed multinomial outcomes, can one certify whether their MVCs intersect? We present a certified, tolerance-aware algorithm for this intersection problem. The method exploits the fact that likelihood ordering induces halfspace constraints in log-odds coordinates, enabling adaptive geometric partitioning of parameter space and computable lower and upper bounds on p-values over each cell. For three categories, this yields an efficient and provably sound algorithm that either certifies intersection, certifies disjointness, or returns an indeterminate result when the decision lies within a prescribed margin. We further show how the approach extends to higher dimensions. The results demonstrate that, despite their irregular geometry, MVCs admit reliable certified decision procedures for core tasks in A/B testing.

2601.18128 2026-01-27 stat.ML cs.LG

Nonlinear multi-study factor analysis

Gemma E. Moran, Anandi Krishnan

详情
英文摘要

High-dimensional data often exhibit variation that can be captured by lower dimensional factors. For high-dimensional data from multiple studies or environments, one goal is to understand which underlying factors are common to all studies, and which factors are study or environment-specific. As a particular example, we consider platelet gene expression data from patients in different disease groups. In this data, factors correspond to clusters of genes which are co-expressed; we may expect some clusters (or biological pathways) to be active for all diseases, while some clusters are only active for a specific disease. To learn these factors, we consider a nonlinear multi-study factor model, which allows for both shared and specific factors. To fit this model, we propose a multi-study sparse variational autoencoder. The underlying model is sparse in that each observed feature (i.e. each dimension of the data) depends on a small subset of the latent factors. In the genomics example, this means each gene is active in only a few biological processes. Further, the model implicitly induces a penalty on the number of latent factors, which helps separate the shared factors from the group-specific factors. We prove that the latent factors are identified, and demonstrate our method recovers meaningful factors in the platelet gene expression data.

2601.18085 2026-01-27 cs.HC cs.AI stat.AP

"Crash Test Dummies" for AI-Enabled Clinical Assessment: Validating Virtual Patient Scenarios with Virtual Learners

Brian Gin, Ahreum Lim, Flávia Silva e Oliveira, Kuan Xing, Xiaomei Song, Gayana Amiyangoda, Thilanka Seneviratne, Alison F. Doubleday, Ananya Gangopadhyaya, Bob Kiser, Lukas Shum-Tim, Dhruva Patel, Kosala Marambe, Lauren Maggio, Ara Tekian, Yoon Soo Park

详情
英文摘要

Background: In medical and health professions education (HPE), AI is increasingly used to assess clinical competencies, including via virtual standardized patients. However, most evaluations rely on AI-human interrater reliability and lack a measurement framework for how cases, learners, and raters jointly shape scores. This leaves robustness uncertain and can expose learners to misguidance from unvalidated systems. We address this by using AI "simulated learners" to stress-test and psychometrically characterize assessment pipelines before human use. Objective: Develop an open-source AI virtual patient platform and measurement model for robust competency evaluation across cases and rating conditions. Methods: We built a platform with virtual patients, virtual learners with tunable ACGME-aligned competency profiles, and multiple independent AI raters scoring encounters with structured Key-Features items. Transcripts were analyzed with a Bayesian HRM-SDT model that treats ratings as decisions under uncertainty and separates learner ability, case performance, and rater behavior; parameters were estimated with MCMC. Results: The model recovered simulated learners' competencies, with significant correlations to the generating competencies across all ACGME domains despite a non-deterministic pipeline. It estimated case difficulty by competency and showed stable rater detection (sensitivity) and criteria (severity/leniency thresholds) across AI raters using identical models/prompts but different seeds. We also propose a staged "safety blueprint" for deploying AI tools with learners, tied to entrustment-based validation milestones. Conclusions: Combining a purpose-built virtual patient platform with a principled psychometric model enables robust, interpretable, generalizable competency estimates and supports validation of AI-assisted assessment prior to use with human learners.

2601.18072 2026-01-27 stat.ME stat.AP stat.CO

The effect of collinearity and sample size on linear regression results: a simulation study

Stephanie CC van der Lubbe, Jose M Valderas, Evangelos Kontopantelis

Comments 17 pages, 5 figures

详情
英文摘要

Background: Multicollinearity inflates the variance of OLS coefficients, widening confidence intervals and reducing inferential reliability. Yet fixed variance inflation factor (VIF) cut-offs are often applied uniformly across studies with very different sample sizes, even though collinearity is a finite-sample problem. We quantify how collinearity and sample size jointly affect linear regression performance and provide practical guidance for interpreting VIFs. Methods: We simulated data across sample sizes N=100-100,000 and collinearity levels VIF=1-50. For each scenario we generated 1,000 datasets, fitted OLS models, and assessed coverage, mean absolute error (MAE), bias, traditional power (CI excludes 0), and precision assurance (probability the 95% CI lies within a prespecified margin around the true effect). We also evaluated a biased, misspecified setting by omitting a relevant predictor to study bias amplification. Results: Under correct specification, collinearity did not materially affect nominal coverage and did not introduce systematic bias, but it reduced precision in small samples: at N=100, even mild collinearity (VIF<2) inflated MAE and markedly reduced both power metrics, whereas at N>=50,000 estimates were robust even at VIF=50. Under misspecification, collinearity strongly amplified bias, increasing errors, reducing coverage, and sharply degrading both precision assurance and traditional power even at low VIF. Conclusion: VIF thresholds should not be applied mechanically. Collinearity must be interpreted in relation to sample size and potential sources of bias; removing predictors solely to reduce VIF can worsen inference via omitted-variable bias. The accompanying heatmaps provide a practical reference across study sizes and modelling assumptions.

2601.18052 2026-01-27 stat.ME econ.GN q-fin.EC

BASTION: A Bayesian Framework for Trend and Seasonality Decomposition

Jason B. Cho, David S. Matteson

详情
英文摘要

We introduce BASTION (Bayesian Adaptive Seasonality and Trend DecompositION), a flexible Bayesian framework for decomposing time series into trend and multiple seasonality components. We cast the decomposition as a penalized nonparametric regression and establish formal conditions under which the trend and seasonal components are uniquely identifiable, an issue only treated informally in the existing literature. BASTION offers three key advantages over existing decomposition methods: (1) accurate estimation of trend and seasonality amidst abrupt changes, (2) enhanced robustness against outliers and time-varying volatility, and (3) robust uncertainty quantification. We evaluate BASTION against established methods, including TBATS, STR, and MSTL, using both simulated and real-world datasets. By effectively capturing complex dynamics while accounting for irregular components such as outliers and heteroskedasticity, BASTION delivers a more nuanced and interpretable decomposition. To support further research and practical applications, BASTION is available as an R package at https://github.com/Jasoncho0914/BASTION

2601.13675 2026-01-27 stat.AP econ.EM

On the Anchoring Effect of Monetary Policy on the Labor Share of Income and the Rationality of Its Setting Mechanism

Li Tuobang

详情
英文摘要

Modern macroeconomic monetary theory suggests that the labor share of income has effectively become a core macroe-conomic parameter anchored by top policymakers through Open Market Operations (OMO). However, the setting of this parameter remains a subject of intense economic debate. This paper provides a detailed summary of these controversies, analyzes the scope of influence exerted by market agents other than the top policymakers on the labor share, and explores the rationality of its setting mechanism.

2601.11603 2026-01-27 physics.geo-ph physics.ins-det stat.AP

The use of spectral indices in environmental monitoring of smouldering coal-waste dumps

Anna Abramowicz, Michal Laska, Adam Nadudvari, Oimahmad Rahmonov

Comments This is the submitted version (preprint) of the article, which has been published in final form at [https://doi.org/10.1016/j.rsase.2025.101865]. This research was funded in whole by National Science Centre, Poland [Grant number: 2023/07/X/ST10/00540]

Journal ref Remote Sens. Appl.: Soc. Environ., 41 (2026)

详情
英文摘要

The study aimed to evaluate the applicability of environmental indices in the monitoring of smouldering coal-waste dumps. A dump located in the Upper Silesian Coal Basin served as the research site for a multi-method analysis combining remote sensing and field-based data. Two UAV survey campaigns were conducted, capturing RGB, infrared, and multispectral imagery. These were supplemented with direct ground measurements of subsurface temperature and detailed vegetation mapping. Additionally, publicly available satellite data from the Landsat and Sentinel missions were analysed. A range of vegetation and fire-related indices (NDVI, SAVI, EVI, BAI, among others) were calculated to identify thermally active zones and assess vegetation conditions within these degraded areas. The results revealed strong seasonal variability in vegetation indices on thermally active sites, with evidence of disrupted vegetation cycles, including winter greening in moderately heated root zones - a pattern indicative of stress and degradation processes. While satellite data proved useful in reconstructing the fire history of the dump, their spatial resolution was insufficient for detailed monitoring of small-scale thermal anomalies. The study highlights the diagnostic potential of UAV-based remote sensing in post-industrial environments undergoing land degradation but emphasises the importance of field validation for accurate environmental assessment.

2512.19855 2026-01-27 cs.RO stat.ML

Gaussian Variational Inference with Non-Gaussian Factors for State Estimation: A UWB Localization Case Study

Andrew Stirling, Mykola Lukashchuk, Dmitry Bagaev, Wouter Kouw, James R. Forbes

详情
英文摘要

This letter extends the exactly sparse Gaussian variational inference (ESGVI) algorithm for state estimation in two complementary directions. First, ESGVI is generalized to operate on matrix Lie groups, enabling the estimation of states with orientation components while respecting the underlying group structure. Second, factors are introduced to accommodate heavy-tailed and skewed noise distributions, as commonly encountered in ultra-wideband (UWB) localization due to non-line-of-sight (NLOS) and multipath effects. Both extensions are shown to integrate naturally within the ESGVI framework while preserving its sparse and derivative-free structure. The proposed approach is validated in a UWB localization experiment with NLOS-rich measurements, demonstrating improved accuracy and comparable consistency. Finally, a Python implementation within a factor-graph-based estimation framework is made open-source (https://github.com/decargroup/gvi_ws) to support broader research use.

2512.15765 2026-01-27 cs.LG cs.GT stat.ML

Data Valuation for LLM Fine-Tuning: Efficient Shapley Value Approximation via Language Model Arithmetic

Mélissa Tamine, Otmane Sakhi, Benjamin Heymann

Comments 11 pages, 2 figures

详情
英文摘要

Data is a critical asset for training large language models (LLMs), alongside compute resources and skilled workers. While some training data is publicly available, substantial investment is required to generate proprietary datasets, such as human preference annotations or to curate new ones from existing sources. As larger datasets generally yield better model performance, two natural questions arise. First, how can data owners make informed decisions about curation strategies and data sources investment? Second, how can multiple data owners collaboratively pool their resources to train superior models while fairly distributing the benefits? This problem, data valuation, which is not specific to large language models, has been addressed by the machine learning community through the lens of cooperative game theory, with the Shapley value being the prevalent solution concept. However, computing Shapley values is notoriously expensive for data valuation, typically requiring numerous model retrainings, which can become prohibitive for large machine learning models. In this work, we demonstrate that this computational challenge is dramatically simplified for LLMs trained with Direct Preference Optimization (DPO). We show how the specific mathematical structure of DPO enables scalable Shapley value computation. We believe this observation unlocks many applications at the intersection of data valuation and large language models.

2512.14308 2026-01-27 stat.ML cs.LG stat.CO

Improving the Accuracy of Amortized Model Comparison with Self-Consistency

Šimon Kucharský, Aayush Mishra, Daniel Habermann, Stefan T. Radev, Paul-Christian Bürkner

Comments This submission should have rather been posted as a second version of a previous submission: arXiv:2508.20614

详情
英文摘要

Amortized Bayesian inference (ABI) offers fast, scalable approximations to posterior densities by training neural surrogates on data simulated from the statistical model. However, ABI methods are highly sensitive to model misspecification: when observed data fall outside the training distribution (generative scope of the statistical models), neural surrogates can behave unpredictably. This makes it a challenge in a model comparison setting, where multiple statistical models are considered, of which at least some are misspecified. Recent work on self-consistency (SC) provides a promising remedy to this issue, accessible even for empirical data (without ground-truth labels). In this work, we investigate how SC can improve amortized model comparison conceptualized in four different ways. Across two synthetic and two real-world case studies, we find that approaches for model comparison that estimate marginal likelihoods through approximate parameter posteriors consistently outperform methods that directly approximate model evidence or posterior model probabilities. SC training improves robustness when the likelihood is available, even under severe model misspecification. The benefits of SC for methods without access of analytic likelihoods are more limited and inconsistent. Our results suggest practical guidance for reliable amortized Bayesian model comparison: prefer parameter posterior-based methods and augment them with SC training on empirical datasets to mitigate extrapolation bias under model misspecification.

2512.00312 2026-01-27 stat.AP

Kicking for Goal or Touch? An Expected Points Framework for Penalty Decisions in Rugby Union

Kenny Watts, Jonathan Pipping-Gamón

Comments 20 pages; 9 figures, 8 tables. Submitted to the Journal of Quantitative Analysis in Sports (JQAS). Code & replication package: https://github.com/WhartonSABI/rugby-ep (data from a public source; mirrored in the repo with attribution). Preprint licensed CC BY 4.0

详情
英文摘要

Following a penalty in rugby union, teams typically choose between attempting a shot at goal or kicking to touch to pursue a try. We develop an Expected Points (EP) framework that quantifies the value of each option as a function of both field location and game context. Using phase-level data from the 2018/19 Premiership Rugby season (35,199 phases across 132 matches) and an angle-distance model of penalty kick success estimated from international records, we construct two surfaces: (i) the expected points of a possession beginning with a lineout, and (ii) the expected points of a kick at goal, taking into account the in-game consequences of made and missed kicks. We then compare these surfaces to produce decision maps that indicate where kicking for goal or kicking to touch maximizes expected return, and we analyze how the boundary shifts with game context and the expected meters gained to touch. Our results provide a unified, data-driven method for evaluating penalty decisions and can be tailored to team-specific kickers and lineout units. This study offers, to our knowledge, the first comprehensive EP-based assessment of penalty strategy in rugby union and outlines extensions to win-probability analysis and richer tracking data.

2512.00203 2026-01-27 stat.AP cs.LG eess.IV

Beyond Expected Goals: A Probabilistic Framework for Shot Occurrences in Soccer

Jonathan Pipping-Gamón, Tianshu Feng, R. Paul Sabin

Comments 18pp main + 3pp appendix; 8 figures, 12 tables. Submitted to the Journal of Quantitative Analysis in Sports (JQAS). Data proprietary to Gradient Sports; we share derived features & scripts (code under MIT/Apache-2.0). Preprint licensed CC BY 4.0

详情
英文摘要

Expected goals (xG) models estimate the probability that a shot results in a goal from its context (e.g., location, pressure), but they operate only on observed shots. We propose xG+, a possession-level framework that first estimates the probability that a shot occurs within the next second and its corresponding xG if it were to occur. We also introduce ways to aggregate this joint probability estimate over the course of a possession. By jointly modeling shot-taking behavior and shot quality, xG+ remedies the conditioning-on-shots limitation of standard xG. We show that this improves predictive accuracy at the team level and produces a more persistent player skill signal than standard xG models.

2510.23170 2026-01-27 stat.ME math.ST stat.AP stat.TH

Set-valued data analysis for interlaboratory comparisons

Sébastien Petit, Sébastien Marmin, Nicolas Fischer

详情
英文摘要

This article introduces tools to analyze set-valued data statistically. The tools were initially developed to analyze results from an interlaboratory comparison made by the Electromagnetic Compatibility Working Group of Eurolab France, where the goal was to select a consensual set of injection points on an electrical device. Families based on the Hamming-distance from a consensus set are introduced and Fisher's noncentral hypergeometric distribution is proposed to model the number of deviations. A Bayesian approach is used and two types of techniques are proposed for the inference. Hierarchical models are also considered to quantify a possible within-laboratory effect.

2510.22345 2026-01-27 cs.LG stat.ML

Uncertainty quantification in model discovery by distilling interpretable material constitutive models from Gaussian process posteriors

David Anton, Henning Wessels, Ulrich Römer, Alexander Henkes, Jorge-Humberto Urrea-Quintero

详情
英文摘要

Constitutive model discovery refers to the task of identifying an appropriate model structure, usually from a predefined model library, while simultaneously inferring its material parameters. The data used for model discovery are measured in mechanical tests and are thus inevitably affected by noise which, in turn, induces uncertainties. Previously proposed methods for uncertainty quantification in model discovery either require the selection of a prior for the material parameters, are restricted to linear coefficients of the model library or are limited in the flexibility of the inferred parameter probability distribution. We therefore propose a partially Bayesian framework for uncertainty quantification in model discovery that does not require prior selection for the material parameters and also allows for the discovery of constitutive models with inner-non-linear parameters: First, we augment the available stress-deformation data with a Gaussian process. Second, we approximate the parameter distribution by a normalizing flow, which allows for modeling complex joint distributions. Third, we distill the parameter distribution by matching the distribution of stress-deformation functions induced by the parameters with the Gaussian process posterior. Fourth, we perform a Sobol' sensitivity analysis to obtain a sparse and interpretable model. We demonstrate the capability of our framework for both isotropic and experimental anisotropic data.

2510.03437 2026-01-27 cs.LG cs.CL stat.ML

Consistent Kernel Change-Point Detection under m-Dependence for Text Segmentation

Jairo Diaz-Rodriguez, Mumin Jia

Comments This paper is withdrawn due to an error in the proof of Proposition 3, which is used to support Theorem 1

详情
英文摘要

Kernel change-point detection (KCPD) has become a widely used tool for identifying structural changes in complex data. While existing theory establishes consistency under independence assumptions, real-world sequential data such as text exhibits strong dependencies. We establish new guarantees for KCPD under $m$-dependent data: specifically, we prove consistency in the number of detected change points and weak consistency in their locations under mild additional assumptions. We perform an LLM-based simulation that generates synthetic $m$-dependent text to validate the asymptotics. To complement these results, we present the first comprehensive empirical study of KCPD for text segmentation with modern embeddings. Across diverse text datasets, KCPD with text embeddings outperforms baselines in standard text segmentation metrics. We demonstrate through a case study on Taylor Swift's tweets that KCPD not only provides strong theoretical and simulated reliability but also practical effectiveness for text segmentation tasks.

2507.18737 2026-01-27 math.ST stat.TH

Robust Tail Index Estimation under Random Censoring via Minimum Density Power Divergence

Nour Elhouda Guesmia, Abdelhakim Necir, Djamel Meraghni

详情
英文摘要

We propose a robust estimator for the tail index of Pareto-type distributions under random right-censoring, constructed within the minimum density power divergence (MDPD) framework and based on the Nelson--Aalen estimator of the cumulative hazard function. To our knowledge, this is the first application of the MDPD methodology to tail index estimation in the presence of random censoring. Under mild regularity conditions and within the weak censoring regime, the estimator is shown to be consistent and asymptotically normal. Its finite-sample performance is assessed through Monte Carlo simulations, revealing improved robustness--efficiency trade-offs compared to standard non-robust tail index estimators. Robustness is investigated under both pre-censoring and post-censoring contamination schemes. While pre-censoring contamination provides a meaningful framework for robustness assessment, post-censoring contamination directly alters the observable data and highlights the sensitivity of reconstruction-based approaches. The practical relevance of the method is illustrated using an insurance claims dataset with light censoring and fully observable extremes. An additional application to AIDS survival data is included for illustrative purposes, emphasizing the challenges encountered under stronger censoring.

2506.02069 2026-01-27 stat.CO

A label-switching algorithm for fast core-periphery identification

Eric Yanchenko, Srijan Sengupta

详情
英文摘要

Core-periphery (CP) structure is frequently observed in networks where the nodes form two distinct groups: a small, densely interconnected core and a sparse periphery. Borgatti and Everett (2000) proposed one of the most popular methods to identify and quantify CP structure by comparing the observed network with an ``ideal'' CP structure. While this metric has been widely used, an improved algorithm is still needed. In this work, we detail a greedy, label-switching algorithm to identify CP structure that is both fast and accurate. By leveraging a mathematical reformulation of the CP metric, our proposed heuristic offers an order-of-magnitude improvement on the number of operations compared to a naive implementation. We prove that the algorithm monotonically ascends to a local maximum while consistently yielding solutions within 90% of the global optimum on small toy networks. On synthetic networks, our algorithm exhibits superior classification accuracies and run-times compared to a popular competing method, and on one-real world network, it is 340 times faster.

2504.11609 2026-01-27 stat.ML cs.AI cs.LG stat.ME

Towards Interpretable Deep Generative Models via Causal Representation Learning

Gemma E. Moran, Bryon Aragam

Comments Accepted in Journal of the American Statistical Association: Special Issue on AI

详情
英文摘要

Recent developments in generative artificial intelligence (AI) rely on machine learning techniques such as deep learning and generative modeling to achieve state-of-the-art performance across wide-ranging domains. These methods' surprising performance is due in part to their ability to learn implicit "representations" of complex, multi-modal data. Unfortunately, deep neural networks are notoriously black boxes that obscure these representations, making them difficult to interpret or analyze. To resolve these difficulties, one approach is to build new interpretable neural network models from the ground up. This is the goal of the emerging field of causal representation learning (CRL) that uses causality as a vector for building flexible, interpretable, and transferable generative AI. CRL can be seen as a synthesis of three intrinsically statistical ideas: (i) latent variable models such as factor analysis; (ii) causal graphical models with latent variables; and (iii) nonparametric statistics and deep learning. This paper introduces CRL from a statistical perspective, focusing on connections to classical models as well as statistical and causal identifiability results. We also highlights key application areas, implementation strategies, and open statistical questions.

2504.05695 2026-01-27 cs.LG cs.AI math.AP math.OC stat.ML

Architecture independent generalization bounds for overparametrized deep ReLU networks

Anandatheertha Bapu, Thomas Chen, Chun-Kai Kevin Chien, Patricia Muñoz Ewald, Andrew G. Moore

Comments AMS Latex, 19 pages

详情
英文摘要

We prove that overparametrized neural networks are able to generalize with a test error that is independent of the level of overparametrization, and independent of the Vapnik-Chervonenkis (VC) dimension. We prove explicit bounds that only depend on the metric geometry of the test and training sets, on the regularity properties of the activation function, and on the operator norms of the weights and norms of biases. For overparametrized deep ReLU networks with a training sample size bounded by the input space dimension, we explicitly construct zero loss minimizers without use of gradient descent, and prove a uniform generalization bound that is independent of the network architecture. We perform computational experiments of our theoretical results with MNIST, and obtain agreement with the true test error within a 22 % margin on average.

2503.23832 2026-01-27 cs.LG eess.IV math.OC stat.ML

An extrapolated and provably convergent algorithm for nonlinear matrix decomposition with the ReLU function

Nicolas Gillis, Margherita Porcelli, Giovanni Seraghiti

Comments 25 pages. Codes and data available from https://github.com/giovanniseraghiti/ReLU-NMD

详情
英文摘要

ReLU matrix decomposition (RMD) is the following problem: given a sparse, nonnegative matrix $X$ and a factorization rank $r$, identify a rank-$r$ matrix $Θ$ such that $X\approx \max(0,Θ)$. RMD is a particular instance of nonlinear matrix decomposition (NMD) that finds application in data compression, matrix completion with entries missing not at random, and manifold learning. The standard RMD model minimizes the least squares error, that is, $\|X - \max(0,Θ)\|_F^2$. The corresponding optimization problem, Least-Squares RMD (LS-RMD), is nondifferentiable and highly nonconvex. This motivated Saul to propose an alternative model, \revise{dubbed Latent-RMD}, where a latent variable $Z$ is introduced and satisfies $\max(0,Z)=X$ while minimizing $\|Z - Θ\|_F^2$ (``A nonlinear matrix decomposition for mining the zeros of sparse data'', SIAM J.\ Math.\ Data Sci., 2022). Our first contribution is to show that the two formulations may yield different low-rank solutions $Θ$. We then consider a reparametrization of the Latent-RMD, called 3B-RMD, in which $Θ$ is substituted by a low-rank product $WH$, where $W$ has $r$ columns and $H$ has $r$ rows. Our second contribution is to prove the convergence of a block coordinate descent (BCD) approach applied to 3B-RMD. Our third contribution is a novel extrapolated variant of BCD, dubbed eBCD, which we prove is also convergent under mild assumptions. We illustrate the significant acceleration effect of eBCD compared to eBCD, and also show that eBCD performs well against the state of the art on synthetic and real-world data sets.

2501.10845 2026-01-27 stat.CO stat.ME

A Multi-fidelity Estimator of the Expected Information Gain for Bayesian Optimal Experimental Design

Thomas E. Coons, Xun Huan

详情
英文摘要

Optimal experimental design (OED) is a framework that leverages a mathematical model of the experiment to identify optimal conditions for conducting the experiment. Under a Bayesian approach, the design objective function is typically chosen to be the expected information gain (EIG). However, EIG is intractable for nonlinear models and must be estimated numerically. Estimating the EIG generally entails some variant of Monte Carlo sampling, requiring repeated data model and likelihood evaluations $\unicode{x2013}$ each involving solving the governing equations of the experimental physics $\unicode{x2013}$ under different sample realizations. This computation becomes impractical for high-fidelity models. We introduce a novel multi-fidelity EIG (MF-EIG) estimator under the approximate control variate (ACV) framework. This estimator is unbiased with respect to the high-fidelity mean, and minimizes variance under a given computational budget. We achieve this by first reparameterizing the EIG so that its expectations are independent of the data models, a requirement for compatibility with ACV. We then provide specific examples under different data model forms, as well as practical enhancements of sample size optimization and sample reuse techniques. We demonstrate the MF-EIG estimator in two numerical examples: a nonlinear benchmark and a turbulent flow problem involving the calibration of shear-stress transport turbulence closure model parameters within the Reynolds-averaged Navier-Stokes model. We validate the estimator's unbiasedness and observe one- to two-orders-of-magnitude variance reduction compared to existing single-fidelity EIG estimators.

2501.03950 2026-01-27 stat.ME

Scalable calibration of individual-based epidemic models through categorical approximations

Lorenzo Rimella, Nick Whiteley, Chris Jewell, Paul Fearnhead, Michael Whitehouse

详情
英文摘要

Traditional compartmental models capture population-level dynamics but fail to characterize individual-level risk. The computational cost of exact likelihood evaluation for partially observed individual-based models, however, grows exponentially with the population size, necessitating approximate inference. Existing sampling-based methods usually require multiple simulations of the individuals in the population and rely on bespoke proposal distributions or summary statistics. We propose a deterministic approach to approximating the likelihood using categorical distributions. The approximate likelihood is amenable to automatic differentiation so that parameters can be estimated by maximization or posterior sampling using standard software libraries such as Stan or TensorFlow with little user effort. We prove the consistency of the maximum approximate likelihood estimator. We empirically test our approach on several classes of individual-based models for epidemiology: different sets of disease states, individual-specific transition rates, spatial interactions, under-reporting and misreporting. We demonstrate ground truth recovery and comparable marginal log-likelihood values at substantially reduced cost compared to competitor methods. Finally, we show the scalability and effectiveness of our approach with a real-world application on the 2001 UK Foot-and-Mouth outbreak, where the simplicity of the CAL allows us to include 162775 farms.

2501.02963 2026-01-27 stat.AP econ.EM q-fin.TR

A data-driven merit order: Learning a fundamental electricity price model

Paul Ghelasi, Florian Ziel

Journal ref Energy Economics, 154 (2026) 109114

详情
英文摘要

Electricity price forecasting approaches generally fall into two categories: data-driven models, which learn from historical patterns, or fundamental models, which simulate market mechanisms. We propose a novel and highly efficient data-driven merit order model that integrates both paradigms. The model embeds the classical expert-based merit order as a nested special case, allowing all key parameters, such as plant efficiencies, bidding behavior, and available capacities, to be estimated directly from historical data, rather than assumed. We further enhance the model with critical embedded extensions such as hydro power, cross-border flows and corrections for underreported capacities, which considerably improve forecasting accuracy. Applied to the German day-ahead market, our model outperforms both classic fundamental and state-of-the-art machine learning models. It retains the interpretability of fundamental models, offering insights into marginal technologies, fuel switches, and dispatch patterns, elements which are typically inaccessible to black-box machine learning approaches. This transparency and high computational efficiency make it a promising new direction for electricity price modeling.

2411.03384 2026-01-27 stat.ML cs.LG cs.NA math.NA math.PR

Solving stochastic partial differential equations using neural networks in the Wiener chaos expansion

Ariel Neufeld, Philipp Schmocker

详情
英文摘要

In this paper, we solve stochastic partial differential equations (SPDEs) numerically by using (possibly random) neural networks in the truncated Wiener chaos expansion of their corresponding solution. Moreover, we provide some approximation rates for learning the solution of SPDEs with additive and/or multiplicative noise. Finally, we apply our results in numerical examples to approximate the solution of three SPDEs: the stochastic heat equation, the Heath-Jarrow-Morton equation, and the Zakai equation.

2410.04883 2026-01-27 cs.LG stat.ML

Improving the Weighting Strategy in KernelSHAP

Lars Henry Berge Olsen, Martin Jullum

Comments This is the accepted, post peer-reviewed version of the manuscript, accepted for publication in the proceedings after the Third World Conference on eXplainable Artificial Intelligence, XAI-2025. A link to the version of record will be included here upon publication

详情
英文摘要

In Explainable AI (XAI), Shapley values are a popular model-agnostic framework for explaining predictions made by complex machine learning models. The computation of Shapley values requires estimating non-trivial contribution functions representing predictions with only a subset of the features present. As the number of these terms grows exponentially with the number of features, computational costs escalate rapidly, creating a pressing need for efficient and accurate approximation methods. For tabular data, the KernelSHAP framework is considered the state-of-the-art model-agnostic approximation framework. KernelSHAP approximates the Shapley values using a weighted sample of the contribution functions for different feature subsets. We propose a novel modification of KernelSHAP which replaces the stochastic weights with deterministic ones to reduce the variance of the resulting Shapley value approximations. This may also be combined with our simple, yet effective modification to the KernelSHAP variant implemented in the popular Python library SHAP. Additionally, we provide an overview of established methods. Numerical experiments demonstrate that our methods can reduce the required number of contribution function evaluations by $5\%$ to $50\%$ while preserving the same accuracy of the approximated Shapley values -- essentially reducing the running time by up to $50\%$. These computational advancements push the boundaries of the feature dimensionality and number of predictions that can be accurately explained with Shapley values within a feasible runtime.

2407.02357 2026-01-27 math.ST math.AG stat.ML stat.TH

Contrastive independent component analysis

Kexin Wang, Aida Maraj, Anna Seigal

Comments 54 pages, 20 figures

详情
英文摘要

In recent years, there has been growing interest in jointly analyzing a foreground dataset, representing an experimental group, and a background dataset, representing a control group. The goal of such contrastive investigations is to identify salient features in the experimental group relative to the control. Independent component analysis (ICA) is a powerful tool for learning independent patterns in a dataset. We generalize it to contrastive ICA (cICA). For this purpose, we devise a new linear algebra based tensor decomposition algorithm, which is more expressive but just as efficient and identifiable as other linear algebra based algorithms. We establish the identifiability of cICA and demonstrate its performance in finding patterns and visualizing data, using synthetic, semi-synthetic, and real-world datasets, comparing the approach to existing methods.

2404.14895 2026-01-27 stat.AP

Sequential Federated Analysis of Early Outbreak Data Applied to Incubation Period Estimation

Simon Busch-Moreno, Moritz U. G. Kraemer

Comments Epidemics (2026)

详情
英文摘要

Early outbreak data analysis is critical for informing about their potential impact and interventions. However, data obtained early in outbreaks are often sensitive and subject to strict privacy restrictions. Thus, federated analysis, which implies decentralised collaborative analysis where no raw data sharing is required, emerged as an attractive paradigm to solve issues around data privacy and confidentiality. In the present study, we propose two approaches which require neither data sharing nor direct communication between devices/servers. The first approach approximates the joint posterior distributions via a multivariate normal distribution and uses this information to update prior distributions sequentially. The second approach uses summaries from parameters' posteriors obtained locally at different locations (sites) to perform a meta-analysis via a hierarchical model. We test these models on simulated and on real outbreak data to estimate the incubation period of multiple infectious diseases. Results indicate that both approaches can recover incubation period parameters accurately, but they present different inferential advantages. While the approximation approach permits to work with full posterior distributions, thus providing a better quantification of uncertainty; the meta-analysis approach allows for an explicit hierarchical structure, which can make some parameters more interpretable. We provide a framework for federated analysis of early outbreak data where the public health contexts are complex.

2403.14531 2026-01-27 stat.ME

Green's matching: an efficient approach to parameter estimation in complex dynamic systems

Jianbin Tan, Guoyu Zhang, Xueqin Wang, Hui Huang, Fang Yao

Journal ref Journal of the Royal Statistical Society: Series B, 2024

详情
英文摘要

Parameters of differential equations are essential to characterize intrinsic behaviors of dynamic systems. Numerous methods for estimating parameters in dynamic systems are computationally and/or statistically inadequate, especially for complex systems with general-order differential operators, such as motion dynamics. This article presents Green's matching, a computationally tractable and statistically efficient two-step method, which only needs to approximate trajectories in dynamic systems but not their derivatives due to the inverse of differential operators by Green's function. This yields a statistically optimal guarantee for parameter estimation in general-order equations, a feature not shared by existing methods, and provides an efficient framework for broad statistical inferences in complex dynamic systems.

2401.06990 2026-01-27 stat.ME

Graphical Principal Component Analysis of Multivariate Functional Time Series

Jianbin Tan, Decai Liang, Yongtao Guan, Hui Huang

Comments Journal of the American Statistical Association (2024)

Journal ref Journal of the American Statistical Association (2024): 1-24

详情
英文摘要

In this paper, we consider multivariate functional time series with a two-way dependence structure: a serial dependence across time points and a graphical interaction among the multiple functions within each time point. We develop the notion of dynamic weak separability, a more general condition than those assumed in literature, and use it to characterize the two-way structure in multivariate functional time series. Based on the proposed weak separability, we develop a unified framework for functional graphical models and dynamic principal component analysis, and further extend it to optimally reconstruct signals from contaminated functional data using graphical-level information. We investigate asymptotic properties of the resulting estimators and illustrate the effectiveness of our proposed approach through extensive simulations. We apply our method to hourly air pollution data that were collected from a monitoring network in China.

2312.05153 2026-01-27 stat.ML cs.LG

Uncertainty Quantification and Propagation in Surrogate-based Bayesian Inference

Philipp Reiser, Javier Enrique Aguilar, Anneli Guthke, Paul-Christian Bürkner

Journal ref Statistics and Computing 35(3):66 (2025)

详情
英文摘要

Surrogate models are statistical or conceptual approximations for more complex simulation models. In this context, it is crucial to propagate the uncertainty induced by limited simulation budget and surrogate approximation error to predictions, inference, and subsequent decision-relevant quantities. However, quantifying and then propagating the uncertainty of surrogates is usually limited to special analytic cases or is otherwise computationally very expensive. In this paper, we propose a framework enabling a scalable, Bayesian approach to surrogate modeling with thorough uncertainty quantification, propagation, and validation. Specifically, we present three methods for Bayesian inference with surrogate models given measurement data. This is a task where the propagation of surrogate uncertainty is especially relevant, because failing to account for it may lead to biased and/or overconfident estimates of the parameters of interest. We showcase our approach in three detailed case studies for linear and nonlinear real-world modeling scenarios. Uncertainty propagation in surrogate models enables more reliable and safe approximation of expensive simulators and will therefore be useful in various fields of applications.

2302.01822 2026-01-27 stat.ME

Lord's 'paradox' explained: the 50-year warning on the use of 'change scores' in observational data

Peter W. G. Tennant, Georgia D. Tomova, Eleanor J. Murray, Kellyn F. Arnold, Matthew P. Fox, Mark S. Gilthorpe

详情
英文摘要

In 1967, Frederick Lord posed a conundrum that has confused scientists for over half a century. Subsequently named Lord's 'paradox', the puzzle centres on the observation that two different approaches to estimating the effect of an exposure on the 'change' in an outcome can produce radically different results. Approach 1 involves comparing the mean 'change score' between exposure groups and Approach 2 involves comparing the follow-up outcome between exposure groups conditional on the baseline outcome. Resolving this puzzle starts with recognising the three reasons that a variable may change value: (A) 'endogenous change', which represents autocorrelation from baseline, (B) 'random change', which represents change from transient random processes, and (C) 'exogenous change', which represents all non-endogenous, non-random change and contains all change that is potentially modifiable by other baseline variables. In observational data, neither Approach 1 nor Approach 2 can reliably estimate the causal effect of an exposure on 'exogenous change' in an outcome. Approach 1 is susceptible to diluted or opposite-sign estimates whenever the exposure causes, or is caused by, the baseline outcome. Approach 2 is susceptible to inflated estimates due to measurement error in the baseline outcome and time-varying confounding bias when the baseline outcome is a mediator. The measurement error can be reduced with multiple measures of the baseline outcome, and the time-varying confounding can be reduced using g- methods. Lord's 'paradox' offers several enduring lessons for observational data science including the importance of a well-defined research question and the problems with analysing change scores in observational data.

2210.00884 2026-01-27 stat.CO stat.ML

Generating Synthetic Data with Locally Estimated Distributions for Disclosure Control

Ali Furkan Kalay

详情
英文摘要

Sensitive datasets are often underutilized in research and industry due to privacy concerns, limiting the potential of valuable data-driven insights. Synthetic data generation presents a promising solution to address this challenge by balancing privacy protection with data utility. This paper introduces a new approach to mitigate privacy risks associated with outlier observations in synthetic datasets: the Local Resampler (LR). The LR leverages the $k$-nearest neighbors algorithm to generate synthetic data while minimizing disclosure risks by underrepresenting outliers, even when they are not detectable in marginal distributions. Theoretical and empirical analyses demonstrate that the LR effectively mitigates outlier-driven disclosure risks, and accurately replicates multimodal, skewed, and non-convex support distributions. The semiparametric nature of the LR ensures a low computational burden and works efficiently even with small samples. By parameterizing the balance between privacy risks and data utility, this approach promotes broader access to sensitive datasets for research.

2104.14204 2026-01-27 q-fin.ST q-fin.MF q-fin.PM q-fin.TR stat.AP

Optimal bidding in hourly and quarter-hourly electricity price auctions: trading large volumes of power with market impact and transaction costs

Michał Narajewski, Florian Ziel

Journal ref Energy Economics, 110 (2022) 105974

详情
英文摘要

This paper addresses the question of how much to bid to maximize the profit when trading in two electricity markets: the hourly Day-Ahead Auction and the quarter-hourly Intraday Auction. For optimal coordinated bidding many price scenarios are examined, the own non-linear market impact is estimated by considering empirical supply and demand curves, and a number of trading strategies is used. Additionally, we provide theoretical results for risk neutral agents. The application study is conducted using the German market data, but the presented methods can be easily utilized with other two consecutive auctions. This paper contributes to the existing literature by evaluating the costs of electricity trading, i.e. the price impact and the transaction costs. The empirical results for the German EPEX market show that it is far more profitable to minimize the price impact rather than maximize the arbitrage.

2005.05837 2026-01-27 cs.LG stat.ML

Energy-Aware DNN Graph Optimization

Yu Wang, Rong Ge, Shuang Qiu

Comments Accepted paper at Resource-Constrained Machine Learning (ReCoML) Workshop of MLSys 2020 Conference, Austin, TX, USA, 2020

详情
英文摘要

Unlike existing work in deep neural network (DNN) graphs optimization for inference performance, we explore DNN graph optimization for energy awareness and savings for power- and resource-constrained machine learning devices. We present a method that allows users to optimize energy consumption or balance between energy and inference performance for DNN graphs. This method efficiently searches through the space of equivalent graphs, and identifies a graph and the corresponding algorithms that incur the least cost in execution. We implement the method and evaluate it with multiple DNN models on a GPU-based machine. Results show that our method achieves significant energy savings, i.e., 24% with negligible performance impact.

1705.07577 2026-01-27 math.ST stat.TH

Semiparametric Efficient Empirical Higher Order Influence Function Estimators

Lin Liu, Rajarshi Mukherjee, Whitney K. Newey, James M. Robins

Comments 50 pages

详情
英文摘要

Robins et al. (2008, 2017) applied the theory of higher order influence functions (HOIFs) to derive an estimator of the mean $ψ$ of an outcome Y in a missing data model with Y missing at random conditional on a vector X of continuous covariates; their estimator, in contrast to previous estimators, is semiparametric efficient under the minimal conditions of Robins et al. (2009b), together with an additional (non-minimal) smoothness condition on the density g of X, because the Robins et al. (2008, 2017) estimator depends on a nonparametric estimate of g. In this paper, we introduce a new HOIF estimator that has the same asymptotic properties as the original one, but does not impose any smoothness requirement on g. This is important for two reasons. First, one rarely has the knowledge about the properties of g. Second, even when g is smooth, if the dimension of X is even moderate, accurate nonparametric estimation of its density is not feasible at the sample sizes often encountered in applications. In fact, to the best of our knowledge, this new HOIF estimator remains the only semiparametric efficient estimator of $ψ$ under minimal conditions, despite the rapidly growing literature on causal effect estimation. We also show that our estimator can be generalized to the entire class of functionals considered by Robins et al. (2008) which include the average effect of a treatment on a response Y when a vector X suffices to control confounding and the expected conditional variance of a response Y given a vector X. Simulation experiments are also conducted, which demonstrate that our new estimator outperforms those of Robins et al. (2008, 2017) in finite samples, when g is not very smooth.

2601.17990 2026-01-27 stat.ML cs.LG cs.SY eess.SY

A Cherry-Picking Approach to Large Load Shaping for More Effective Carbon Reduction

Bokan Chen, Raiden Hasegawa, Adriaan Hilbers, Ross Koningstein, Ana Radovanović, Utkarsh Shah, Gabriela Volpato, Mohamed Ahmed, Tim Cary, Rod Frowd

详情
英文摘要

Shaping multi-megawatt loads, such as data centers, impacts generator dispatch on the electric grid, which in turn affects system CO2 emissions and energy cost. Substantiating the effectiveness of prevalent load shaping strategies, such as those based on grid-level average carbon intensity, locational marginal price, or marginal emissions, is challenging due to the lack of detailed counterfactual data required for accurate attribution. This study uses a series of calibrated granular ERCOT day-ahead direct current optimal power flow (DC-OPF) simulations for counterfactual analysis of a broad set of load shaping strategies on grid CO2 emissions and cost of electricity. In terms of annual grid level CO2 emissions reductions, LMP-based shaping outperforms other common strategies, but can be significantly improved upon. Examining the performance of practicable strategies under different grid conditions motivates a more effective load shaping approach: one that "cherry-picks" a daily strategy based on observable grid signals and historical data. The cherry-picking approach to power load shaping is applicable to any large flexible consumer on the electricity grid, such as data centers, distributed energy resources and Virtual Power Plants (VPPs).

2601.17985 2026-01-27 stat.AP

Bayesian Multiple Testing for Suicide Risk in Pharmacoepidemiology: Leveraging Co-Prescription Patterns

Soumya Sahu, Kwan Hur, Dulal K. Bhaumik, Robert Gibbons

详情
英文摘要

Suicide is the tenth leading cause of death in the United States, yet evidence on medication-related risk or protection remains limited. Most post-marketing studies examine one drug class at a time or rely on empirical-Bayes shrinkage with conservative multiplicity corrections, sacrificing power to detect clinically meaningful signals. We introduce a unified Bayesian spike-and-slab framework that advances both applied suicide research and statistical methodology. Substantively, we screen 922 prescription drugs across 150 million patients in U.S. commercial claims (2003 to 2014), leveraging real-world co-prescription patterns to inform a covariance prior that adaptively borrows strength across pharmacologically related agents. Statistically, the model couples this structured prior with Bayesian false-discovery-rate control, illustrating how network-guided variable selection can improve rare-event surveillance in high dimensions. Relative to the seminal empirical-Bayes analysis of Gibbons et al. (2019), our approach reconfirms the key harmful (e.g., alprazolam, hydrocodone) and protective (e.g., mirtazapine, folic acid) signals while revealing additional associations, such as a high-risk opioid combination and several folate-linked agents with potential preventive benefit that had been overlooked. A focused re-analysis of 18 antidepressants shows how alternative co-prescription metrics modulate effect estimates, shedding light on competitive versus complementary prescribing. These findings generate actionable hypotheses for clinicians and regulators and showcase the value of structured Bayesian modeling in pharmacovigilance.

2601.17961 2026-01-27 stat.ME

Transportability of Regression Calibration with External Validation Studies for Measurement Error Correction

Zexiang Li, Donna Spiegelman, Molin Wang, Zuoheng Wang, Xin Zhou

详情
英文摘要

In nutritional and environmental epidemiology, exposures are impractical to measure accurately, while practical measures for these exposures are often subject to substantial measurement error. Regression calibration is among the most used measurement error correction methods with external validation studies. The use of external studies to assess the measurement error process always carries the risk of introducing estimation bias into the main study analysis. Although the transportability of regression calibration is usually assumed for practical epidemiology studies, it has not been well studied. In this work, under the measurement error process with a mixture of Berkson-like and classical-like errors, we investigate conditions under which the effect estimate from regression calibration with an external validation study is unbiased for the association between exposure and health outcome. We further examine departures from the transportability assumption, under which the regression calibration estimator is itself biased. However, we theoretically prove that, in most cases, it yields lower bias than the naive method. The derived conditions are confirmed through simulation studies and further verified in an example investigating the association between the risk of cardiovascular disease and moderate physical activity in the health professional follow-up study.

2601.17905 2026-01-27 cs.CV cs.AI stat.ML

Feature-Space Generative Models for One-Shot Class-Incremental Learning

Jack Foster, Kirill Paramonov, Mete Ozay, Umberto Michieli

详情
英文摘要

Few-shot class-incremental learning (FSCIL) is a paradigm where a model, initially trained on a dataset of base classes, must adapt to an expanding problem space by recognizing novel classes with limited data. We focus on the challenging FSCIL setup where a model receives only a single sample (1-shot) for each novel class and no further training or model alterations are allowed after the base training phase. This makes generalization to novel classes particularly difficult. We propose a novel approach predicated on the hypothesis that base and novel class embeddings have structural similarity. We map the original embedding space into a residual space by subtracting the class prototype (i.e., the average class embedding) of input samples. Then, we leverage generative modeling with VAE or diffusion models to learn the multi-modal distribution of residuals over the base classes, and we use this as a valuable structural prior to improve recognition of novel classes. Our approach, Gen1S, consistently improves novel class recognition over the state of the art across multiple benchmarks and backbone architectures.

2601.17860 2026-01-27 math.ST econ.EM stat.TH

The Hellinger Bounds on the Kullback-Leibler Divergence and the Bernstein Norm

Tetsuya Kaji

Comments 25 pages

详情
英文摘要

The Kullback-Leibler divergence, the Kullback-Leibler variation, and the Bernstein "norm" are used to quantify discrepancies among probability distributions in likelihood models such as nonparametric maximum likelihood and nonparametric Bayes. They are closely related to the Hellinger distance, which is often easier to work with. Consequently, it is of interest to characterize conditions under which the Hellinger distance serves as an upper bound for these measures. This article characterizes a necessary and sufficient condition for each of the discrepancy measures to be bounded by the Hellinger distance. It accommodates unbounded likelihood ratios and generalizes all previously known results. We then apply it to relax the regularity condition for the sieve maximum likelihood estimator.

2601.17843 2026-01-27 econ.EM stat.ME

Best Feasible Conditional Critical Values for a More Powerful Subvector Anderson-Rubin Test

Jesse Hoekstra, Frank Windmeijer

详情
英文摘要

For subvector inference in the linear instrumental variables model under homoskedasticity but allowing for weak instruments, Guggenberger, Kleibergen, and Mavroeidis (2019) (GKM) propose a conditional subvector Anderson and Rubin (1949) (AR) test that uses data-dependent critical values that adapt to the strength of the parameters not under test. This test has correct size and strictly higher power than the test that uses standard asymptotic chi-square critical values. The subvector AR test is the minimum eigenvalue of a data dependent matrix. The GKM critical value function conditions on the largest eigenvalue of this matrix. We consider instead the data dependent critical value function conditioning on the second-smallest eigenvalue, as this eigenvalue is the appropriate indicator for weak identification. We find that the data dependent critical value function of GKM also applies to this conditioning and show that this test has correct size and power strictly higher than the GKM test when the number of parameters not under test is larger than one. Our proposed procedure further applies to the subvector AR test statistic that is robust to an approximate kronecker product structure of conditional heteroskedasticity as proposed by Guggenberger, Kleibergen, and Mavroeidis (2024), carrying over its power advantage to this setting as well.

2601.17805 2026-01-27 math.ST cs.NA math.NA stat.TH

On the contraction rate of the posterior distribution for nonlinear PDE parameter identification

Yuxin Fan, Bangti Jin

Comments 26 pages

详情
英文摘要

In this work, we investigate the estimation of a parameter $f$ in PDEs using Bayesian procedures, and focus on posterior distributions constructed using Gaussian process priors, and its variational approximation. We establish contraction rates for the posterior distribution and the variational approximation in the regime of low-regularity parameters. The main novelty of the study lies in relaxing the condition that the ground truth parameter must lie in the reproducing kernel Hilbert space of the Gaussian process prior, which is commonly imposed in existing studies on posterior contraction rate analysis [14,40,44]. The analysis relies on a delicate approximation argument that suitably balances various error sources. We illustrate the general theory on three nonlinear inverse problems for PDEs.

2601.17779 2026-01-27 stat.ME

Sensitivity analysis for incremental effects, with application to a study of victimization & offending

Shuying Shen, Valerio Bacak, Edward H. Kennedy

详情
英文摘要

Sensitivity analysis for unmeasured confounding under incremental propensity score interventions remains relatively underdeveloped. Incremental interventions define stochastic treatment regimes by multiplying the odds of treatment, offering a flexible framework for causal effect estimation. To study incremental effects when there are unobserved confounders, we adopt Rosenbaum's sensitivity model in single time point settings, and propose a doubly robust estimator for the resulting effect bounds. The bound estimators are asymptotically normal under mild conditions on nuisance function estimation. We show that incremental effect bounds can be narrower or wider than those for mean potential outcomes, and that the bounds must lie between the expected minimum and maximum of the conditional bounds on E(Y^0|X) and E(Y^1|X). For time-varying treatments, we consider the marginal sensitivity model. Although sharp bounds for incremental effects are identifiable from longitudinal data under this model, practical estimators have not yet been established; we discuss this challenge and provide partial results toward implementation. Finally, we apply our methods to study the effect of victimization on subsequent offending using data from the National Longitudinal Study of Adolescent to Adult Health (Add Health), illustrating the robustness of our findings in an empirical setting.

2601.17712 2026-01-27 econ.EM stat.ME

The Proximal Surrogate Index: Long-Term Treatment Effects under Unobserved Confounding

Ting-Chih Hung, Yu-Chang Chen

详情
英文摘要

We study the identification and estimation of long-term treatment effects under unobserved confounding by combining an experimental sample, where the long-term outcome is missing, with an observational sample, where the treatment assignment is unobserved. While standard surrogate index methods fail when unobserved confounders exist, we establish novel identification results by leveraging proxy variables for the unobserved confounders. We further develop multiply robust estimation and inference procedures based on these results. Applying our method to the Job Corps program, we demonstrate its ability to recover experimental benchmarks even when unobserved confounders bias standard surrogate index estimates.

2601.17695 2026-01-27 stat.ME

Bidirectional causal inference for binary outcomes in the presence of unmeasured confounding

Yafang Deng, Kang Shuai, Shanshan Luo

Comments 21 pages, 8 figures

详情
英文摘要

Bidirectional causal relationships arising from mutual interactions between variables are commonly observed within biomedical, econometrical, and social science contexts. When such relationships are further complicated by unobserved factors, identifying causal effects in both directions becomes especially challenging. For continuous variables, methods that utilize two instrumental variables from both directions have been proposed to explore bidirectional causal effects in linear models. However, the existing techniques are not applicable when the key variables of interest are binary. To address these issues, we propose a structural equation modeling approach that links observed binary variables to continuous latent variables through a constrained mapping. We further establish identification results for bidirectional causal effects using a pair of instrumental variables. Additionally, we develop an estimation method for the corresponding causal parameters. We also conduct sensitivity analysis under scenarios where certain identification conditions are violated. Finally, we apply our approach to investigate the bidirectional causal relationship between heart disease and diabetes, demonstrating its practical utility in biomedical research.

2601.17646 2026-01-27 cs.LG math.FA math.OC math.ST stat.TH

A Mosco sufficient condition for intrinsic stability of non-unique convex Empirical Risk Minimization

Karim Bounja, Lahcen Laayouni, Abdeljalil Sakat

详情
英文摘要

Empirical risk minimization (ERM) stability is usually studied via single-valued outputs, while convex non-strict losses yield set-valued minimizers. We identify Painlevé-Kuratowski upper semicontinuity (PK-u.s.c.) as the intrinsic stability notion for the ERM solution correspondence (set-level Hadamard well-posedness) and a prerequisite to interpret stability of selections. We then characterize a minimal non-degenerate qualitative regime: Mosco-consistent perturbations and locally bounded minimizers imply PK-u.s.c., minimal-value continuity, and consistency of vanishing-gap near-minimizers. Quadratic growth yields explicit quantitative deviation bounds.

2601.17618 2026-01-27 stat.ME

Two-stage Estimation of Latent Variable Regression Models: A General, Root-N Consistent Solution

Yang Liu, Xiaohui Luo, Jieyuan Dong, Youjin Sung, Yueqin Hu, Hongyun Liu, Daniel J. Bauer

详情
英文摘要

Latent variable (LV) models are widely used in psychological research to investigate relationships among unobservable constructs. When one-stage estimation of the overall LV model is challenging, two-stage factor score regression (FSR) serves as a convenient alternative: the measurement model is fitted to obtain factor scores in the first stage, which are then used to fit the structural model in the subsequent stage. However, naive application of FSR is known to yield biased estimates of structural parameters. In this paper, we develop a generic bias-correction framework for two-stage estimation of parametric statistical models and tailor it specifically to FSR. Unlike existing bias-corrected FSR solutions, the proposed method applies to a broader class of LV models and does not require computing specific types of factor scores. We establish the root-n consistency of the proposed bias-corrected two-stage estimator under mild regularity conditions. To ensure broad applicability and minimize reliance on complex analytical derivations, we introduce a stochastic approximation algorithm for point estimation and a Monte Carlo-based procedure for variance estimation. In a sequence of Monte Carlo experiments, we demonstrate that the bias-corrected FSR estimator performs comparably to the ``gold standard'' one-stage maximum likelihood estimator. These results suggest that our approach offers a straightforward yet effective alternative for estimating LV models.

2601.17605 2026-01-27 math.ST stat.ME stat.TH

Event history analysis with time-dependent covariates via landmarking supermodels and boosted trees

Oliver Lunding Sandqvist

详情
英文摘要

We propose a nonparametric method for dynamic prediction in event history analysis with high-dimensional, time-dependent covariates. The approach estimates future conditional hazards by combining landmarking supermodels with gradient boosted trees. Unlike joint modeling or Cox landmarking models, the proposed estimator flexibly captures interactions and nonlinear effects without imposing restrictive parametric assumptions or requiring the covariate process to be Markovian. We formulate the approach as a sieve M-estimator and establish weak consistency. Computationally, the problem reduces to a Poisson regression, allowing implementation via standard gradient boosting software. A key theoretical advantage is that the method avoids the temporal inconsistencies that arise in landmark Cox models. Simulation studies demonstrate that the method performs well in a variety of settings, and its practical value is illustrated through an analysis of primary biliary cirrhosis data.

2601.17591 2026-01-27 math.PR math.ST stat.TH

Exact Recovery in the Geometric Hidden Community Model

Julia Gaudio, Andrew Jin

详情
英文摘要

Hidden community problems, such as community detection in the Stochastic Block Model (SBM), submatrix localization, and $\mathbb{Z}_2$ synchronization, have received considerable attention in the probability, statistics, and information-theory literature. Motivated by transitive behavior in social networks, which tend to exhibit high triangle density, recent works have considered hidden community models in spatially-embedded networks. In particular, Baccelli and Sankararaman proposed the Geometric SBM, a spatially-embedded analogue of the standard SBM with dramatically more triangles. In this paper, we consider the problem of exact recovery for the Geometric Hidden Community Model (GHCM) of Gaudio, Guan, Niu, and Wei, which generalizes the Geometric SBM to allow for arbitrary pairwise observation distributions. Under mild technical assumptions, we find the information-theoretic threshold for exact recovery in the ``distance-dependent'' GHCM, which allows the pairwise distributions to depend on distance as well as community labels, thus completing the picture of exact recovery in spatially-embedded hidden community models.

2601.17578 2026-01-27 cs.DC stat.CO

A Unified Approach to Concurrent, Parallel Map-Reduce in R using Futures

Henrik Bengtsson

Comments 16 pages including 2.5 pages references, 1 figure

详情
英文摘要

The R ecosystem offers a rich variety of map-reduce application programming interfaces (APIs) for iterative computations, yet parallelizing code across these diverse frameworks requires learning multiple, often incompatible, parallel APIs. The futurize package addresses this challenge by providing a single function, futurize(), which transpiles sequential map-reduce expressions into their parallel equivalents in the future ecosystem, which performs all the heavy lifting. By leveraging R's native pipe operator, users can parallelize existing code with minimal refactoring -- often by simply appending `|> futurize()' to an expression. The package supports classical map-reduce functions from base R, purrr, crossmap, foreach, plyr, BiocParallel, e.g., lapply(xs, fcn) |> futurize() and map(xs, fcn) |> futurize(), as well as a growing set of domain-specific packages, e.g., boot, caret, glmnet, lme4, mgcv, and tm. By abstracting away the underlying parallel machinery, and unifying handling of future options, the package enables developers to declare what to parallelize via futurize(), and end-users to choose how via plan(). This article describes the philosophy, design, and implementation of futurize, demonstrates its usage across various map-reduce paradigms, and discusses its role in simplifying parallel computing in R.

2601.17565 2026-01-27 math.ST stat.TH

Directional footrule-coefficients

Enrique de Amo, David García-Fernández, Manuel Úbeda-Flores

详情
英文摘要

Rank-based dependence measures such as Spearman's footrule are robust and invariant, but they often fail to capture directional or asymmetric dependence in multivariate settings. This paper introduces a new family of directional Spearman's footrule coefficients for multivariate data, defined within the copula framework to clearly separate marginal behavior from dependence structure. We establish their main theoretical properties, showing full consistency with the classical footrule, including behavior under independence and extreme dependence, as well as symmetry and reflection properties. Nonparametric rank-based estimators are proposed and their asymptotic consistency is discussed. Explicit expressions for several known families of copulas illustrate the ability of the proposed coefficients to detect directional dependence patterns undetected by classical measures.

2601.17518 2026-01-27 math.ST stat.AP stat.ME stat.TH

Comparisons of policies based on relevation and replacement by a new one unit in reliability

Belzunce, F., Martínez-Riquelme, C., Mercader, J. A., Ruiz, J. M

Comments 17 pages. Author accepted manuscript

Journal ref TEST (2021), 30, 211-227

详情
英文摘要

The purpose of this paper is to study the role of the relevation transform, where a failed unit is replaced by a used unit with the same age as the failed one, as an alternative to the policy based on the replacement by a new one. In particular, we compare the stochastic processes arising from a policy based on the replacement of a failed unit by a new one and from the one in which the unit is being continuously subjected to a relevation policy. The comparisons depend on the aging properties of the units under repair.

2601.17511 2026-01-27 math.ST stat.TH

A new stochastic dominance criterion for dependent random variables with applications

F. Belzunce, C. Martínez-Riquelme

Comments 12 pages. Author accepted manuscript

Journal ref Insurance: Mathematics and Economics (2023), 108, 165-176

详情
英文摘要

In this paper we develop a new tool for the comparison of paired data based on a new criterion of stochastic dominance that takes into account the dependence structure of the random variables under comparison. This new procedure provides a more detailed comparison of dependent random variables and overcomes some difficulties of standard techniques like Student's t and Wilcoxon-Mann-Whitney tests for non normal data. This tool provides an alternative to the usual stochastic dominance criterion which only considers the marginal distributions in the comparison. We show how this new tool can be fruitfully used for the comparison of paired asset returns.

2601.17510 2026-01-27 stat.ML cs.AI cs.LG

"Rebuilding" Statistics in the Age of AI: A Town Hall Discussion on Culture, Infrastructure, and Training

David L. Donoho, Jian Kang, Xihong Lin, Bhramar Mukherjee, Dan Nettleton, Rebecca Nugent, Abel Rodriguez, Eric P. Xing, Tian Zheng, Hongtu Zhu

Comments 35 pages, 3 figures,

详情
英文摘要

This article presents the full, original record of the 2024 Joint Statistical Meetings (JSM) town hall, "Statistics in the Age of AI," which convened leading statisticians to discuss how the field is evolving in response to advances in artificial intelligence, foundation models, large-scale empirical modeling, and data-intensive infrastructures. The town hall was structured around open panel discussion and extensive audience Q&A, with the aim of eliciting candid, experience-driven perspectives rather than formal presentations or prepared statements. This document preserves the extended exchanges among panelists and audience members, with minimal editorial intervention, and organizes the conversation around five recurring questions concerning disciplinary culture and practices, data curation and "data work," engagement with modern empirical modeling, training for large-scale AI applications, and partnerships with key AI stakeholders. By providing an archival record of this discussion, the preprint aims to support transparency, community reflection, and ongoing dialogue about the evolving role of statistics in the data- and AI-centric future.

2601.17501 2026-01-27 math.PR math.ST stat.TH

Sufficient conditions for some transform orders based on the quantile density ratio

A. Arriaza, F. Belzunce, C. Martínez-Riquelme

Comments 29 pages. Author accepted manuscript

Journal ref Methodology and Computing in Applied Probability (2021), 23, 29-52

详情
英文摘要

In this paper we focus on providing sufficient conditions for some transform orders for which the quantile densities ratio is non-monotone and, therefore, the convex transform order does not hold. These results are interesting for comparing random variables with a non-explicit expression of their quantile functions or they are computationally complex. In addition, the main results are applied to compare two Tukey generalized distributed random variables and to establish new relationships among non-monotone and positive aging notions.

2601.17485 2026-01-27 physics.geo-ph eess.SP stat.ME

Curvelet-Regularized SPDE Inversion on Piecewise-Planar Fractures with Trace-Graph Coupling

J. J. Segura

详情
英文摘要

We formulate a sparse-to-dense reconstruction layer for fractured media in which sparse point measurements are mapped onto piecewise-planar fracture supports inferred from 3D trace polylines. Each plane is discretized in local coordinates and estimated via a convex objective that combines a grid SPDE/GMRF quadratic prior with an $\ell_1$ penalty on undecimated discrete curvelet coefficients, targeting anisotropic, fracture-aligned structure that is poorly represented by isotropic smoothness alone. We further define an along-fracture distance through trace-network geodesics and express connectivity-driven regularization as a quadratic form $z^\top P^\top L_G P z$, where $L_G$ is a graph Laplacian on the trace network and $P$ maps plane grids to graph nodes; plane intersections are handled by linear consistency constraints sampled along intersection lines. The resulting optimization admits efficient splitting: sparse linear solves for the quadratic block and coefficient-wise shrinkage for the curvelet block, with standard ADMM convergence under convexity. We specify reproducible synthetic benchmarks, baselines, ablations, and sensitivity studies that isolate directional sparsity and connectivity effects, and provide reference code to generate the figures and quantitative tables.

2601.17419 2026-01-27 eess.SP cs.IT math.IT stat.ML

Semantic-Aware Task Clustering for Federated Cooperative Multi-Task Semantic Communication

Ahmad Halimi Razlighi, Pallavi Dhingra, Edgar Beck, Bho Matthiesen, Armin Dekorsy

Comments This work has been submitted to the IEEE for possible publication

详情
英文摘要

Task-oriented semantic communication (SemCom) prioritizes task execution over accurate symbol reconstruction and is well-suited to emerging intelligent applications. Cooperative multi-task SemCom (CMT-SemCom) further improves task execution performance. However, [1] demonstrates that cooperative multi-tasking can be either constructive or destructive. Moreover, the existing CMT-SemCom framework is not directly applicable to distributed multi-user scenarios, such as non-terrestrial satellite networks, where each satellite employs an individual semantic encoder. In this paper, we extend our earlier CMT-SemCom framework to distributed settings by proposing a federated learning (FL) based CMT-SemCom that enables cooperative multi-tasking across distributed users. Moreover, to address performance degradation caused by negative information transfer among heterogeneous tasks, we propose a semantic-aware task clustering method integrated in the FL process to ensure constructive cooperation based on an information-theoretic approach. Unlike common clustering methods that rely on high-dimensional data or feature space similarity, our proposed approach operates in the low-dimensional semantic domain to identify meaningful task relationships. Simulation results based on a LEO satellite network setup demonstrate the effectiveness of our approach and performance gain over unclustered FL and individual single-task SemCom.

2601.17319 2026-01-27 stat.ME

Statistical process control via $p$-values

Hien Duy Nguyen, Dan Wang

详情
英文摘要

We study statistical process control (SPC) through charting of $p$-values. When in control (IC), any valid sequence $(P_{t})_{t}$ is super-uniform, a requirement that can hold in nonparametric and two-phase designs without parametric modelling of the monitored process. Within this framework, we analyse the Shewhart rule that signals when $P_{t}\leα$. Under super-uniformity alone, and with no assumptions on temporal dependence, we derive universal IC lower bounds for the average run length (ARL) and for the expected time to the $k$th false alarm ($k$-ARL). When conditional super-uniformity holds, these bounds sharpen to the familiar $α^{-1}$ and $kα^{-1}$ rates, giving simple, distribution-free calibration for $p$-value charts. Beyond thresholding, we use merging functions for dependent $p$-values to build EWMA-like schemes that output, at each time $t$, a valid $p$-value for the hypothesis that the process has remained IC up to $t$, enabling smoothing without ad hoc control limits. We also study uniform EWMA processes, giving explicit distribution formulas and left-tail guarantees. Finally, we propose a modular approach to directional and coordinate localisation in multivariate SPC via closed testing, controlling the family-wise error rate at the time of alarm. Numerical examples illustrate the utility and variety of our approach.

2601.17302 2026-01-27 math.ST stat.ME stat.TH

Tighter confidence intervals for quantiles of heterogeneous data

John H. J. Einmahl, Yi He

详情
英文摘要

It is well known that the asymptotic variance of sample quantiles can be reduced under heterogeneity relative to the i.i.d. setting. However, asymptotically correct confidence intervals for quantiles are not yet available. We propose a novel, consistent estimator of the reduced asymptotic variance arising when quantiles are computed from groups of observations, leading to asymptotically correct confidence intervals. Simulation studies show that our confidence intervals are substantially shorter than those in the i.i.d. case and attain nearly correct coverage across a wide range of heterogeneous settings.

2601.17265 2026-01-27 stat.ME stat.ML

Covariate-assisted Grade of Membership Models via Shared Latent Geometry

Zhiyu Xu, Yuqi Gu

详情
英文摘要

The grade of membership model is a flexible latent variable model for analyzing multivariate categorical data through individual-level mixed membership scores. In many modern applications, auxiliary covariates are collected alongside responses and encode information about the same latent structure. Traditional approaches to incorporating such covariates typically rely on fully specified joint likelihoods, which are computationally intensive and sensitive to misspecification. We introduce a covariate-assisted grade of membership model that integrates response and covariate information by exploiting their shared low-rank simplex geometry, rather than modeling their joint distribution. We propose a likelihood-free spectral estimation procedure that combines heterogeneous data sources through a balance parameter controlling their relative contribution. To accommodate high-dimensional and heteroskedastic noise, we employ heteroskedastic principal component analysis before performing simplex-based geometric recovery. Our theoretical analysis establishes weaker identifiability conditions than those required in the covariate-free model, and further derives finite-sample, entrywise error bounds for both mixed membership scores and item parameters. These results demonstrate that auxiliary covariates can provably improve latent structure recovery, yielding faster convergence rates in high-dimensional regimes. Simulation studies and an application to educational assessment data illustrate the computational efficiency, statistical accuracy, and interpretability gains of the proposed method. The code for reproducing these results is open-source and available at \texttt{https://github.com/Toby-X/Covariate-Assisted-GoM}

2601.17256 2026-01-27 cs.ET stat.AP

Safety, Mobility, and Environmental Impacts of Driver-Assistance-Enabled Electric Vehicles: An Empirical Study

Gabriel Geffen, Jun Zhao, Mingfeng Shang, Shian Wang, Yao-Jan Wu

详情
英文摘要

The advancement of vehicle automation and the growing adoption of electric vehicles (EVs) are reshaping transportation systems. While fully automated vehicles are expected to improve traffic stability, efficiency, and sustainability, recent studies suggest that partially automated vehicles, such as those equipped with adaptive cruise control (ACC), may adversely affect traffic flow. These drawbacks may not extend to ACC-enabled EVs due to their distinct mechanical characteristics, including regenerative braking and smoother torque delivery. As a result, the impacts of EVs operating under ACC remain insufficiently understood. To address this gap, this study develops an empirical framework using the OpenACC dataset to compare ACC-enabled EVs and internal combustion engine vehicles. Dynamic time warping aligns comparable lead-vehicle trajectories. Results show that EVs exhibit smoother speed profiles, lower speed variability, and shorter spacing, leading to higher efficiency. EVs reduce critical safety events by over 85% and lower platoon-level emissions by up to 26.2%.

2601.17241 2026-01-27 stat.ME

Capturing Cumulative Disease Burden in Chronic Kidney Disease Outcome Trials: Area Under the Curve and Restricted Mean Time in Favor of Treatment Beyond Conventional Time-to-First Analysis

Jiren Sun, Tuo Wang, Yu Du

详情
英文摘要

Chronic kidney disease (CKD) affects millions worldwide and progresses irreversibly through stages culminating in end-stage renal disease (ESRD) and death. Outcome trials in CKD traditionally employ time-to-first-event analyses using the Cox models. However, this approach has fundamental limitations for progressive diseases: it assigns equal weight to each composite endpoint component despite clear clinical hierarchy: an eGFR decline threshold receives the same weight as ESRD or death in the analysis, and it captures only the first occurrence while ignoring subsequent progression. Given CKD's gradual evolution over years, comprehensive treatment evaluation requires quantifying cumulative disease burden: integrating both event severity and time spent in each disease state. We propose two complementary approaches to better characterize treatment benefits by incorporating event severity and state occupancy: area under the curve (AUC) and restricted mean time in favor of treatment (RMT-IF). The AUC method assigns ordinal severity scores to disease states and calculates the area under the mean cumulative score curve, quantifying total event-free time lost. Treatment effects are expressed as AUC ratios or differences. The RMT-IF extends restricted mean survival time to multistate processes, measuring average time patients in the treatment arm spend in more favorable states versus the comparator. These methods better capture CKD's progressive nature where treatment benefits extend beyond first-event delay to overall disease trajectory modification. By discriminating between events of differing clinical importance and quantifying the complete disease course, these estimands offer alternative assessment frameworks for kidney-protective therapies, potentially improving efficiency and interpretability of future CKD outcome trials.

2601.17233 2026-01-27 stat.ME stat.AP

An Empirical Method for Analyzing Count Data

Jiren Sun, Linda Amoafo, Yongming Qu

详情
英文摘要

Count endpoints are common in clinical trials, particularly for recurrent events such as hypoglycemia. When interest centers on comparing overall event rates between treatment groups, negative binomial (NB) regression is widely used because it accommodates overdispersion and requires only event counts and exposure times. However, NB regression can be numerically unstable when events are sparse, and the efficiency gains from baseline covariate adjustment may be sensitive to model misspecification. We propose an empirical method that targets the same marginal estimand as NB regression -- the ratio of marginal event rates -- while avoiding distributional assumptions on the count outcome. Simulation studies show that the empirical method maintains appropriate Type I error control across diverse scenarios, including extreme overdispersion and zero inflation, achieves power comparable to NB regression, and yields consistent efficiency gains from baseline covariate adjustment. We illustrate the approach using severe hypoglycemia data from the QWINT-5 trial comparing insulin efsitora alfa with insulin degludec in adults with type 1 diabetes. In this sparse-event setting, the empirical method produced stable marginal rate estimates and rate ratios closely aligned with observed rates, while NB regression exhibited greater sensitivity and larger deviations from the observed rates in the sparsest intervals. The proposed empirical method provides a robust and numerically stable alternative to NB regression, particularly when the number of events is low or when numerical stability is a concern.

2601.17171 2026-01-27 math.OC stat.ML

A Unified Kantorovich Duality for Multimarginal Optimal Transport

Yehya Cheryala, Mokhtar Z. Alaya, Salim Bouzebda

详情
英文摘要

Multimarginal optimal transport (MOT) has gained increasing attention in recent years, notably due to its relevance in machine learning and statistics, where one seeks to jointly compare and align multiple probability distributions. This paper presents a unified and complete Kantorovich duality theory for MOT problem on general Polish product spaces with bounded continuous cost function. For marginal compact spaces, the duality identity is derived through a convex-analytic reformulation, that identifies the dual problem as a Fenchel-Rockafellar conjugate. We obtain dual attainment and show that optimal potentials may always be chosen in the class of $c$-conjugate families, thereby extending classical two-marginal conjugacy principle into a genuinely multimarginal setting. In non-compact setting, where direct compactness arguments are unavailable, we recover duality via a truncation-tightness procedure based on weak compactness of multimarginal transference plans and boundedness of the cost. We prove that the dual value is preserved under restriction to compact subsets and that admissible dual families can be regularized into uniformly bounded $c$-conjugate potentials. The argument relies on a refined use of $c$-splitting sets and their equivalence with multimarginal $c$-cyclical monotonicity. We then obtain dual attainment and exact primal-dual equality for MOT on arbitrary Polish spaces, together with a canonical representation of optimal dual potentials by $c$-conjugacy. These results provide a structural foundation for further developments in probabilistic and statistical analysis of MOT, including stability, differentiability, and asymptotic theory under marginal perturbations.

2601.17153 2026-01-27 stat.ME stat.AP

Evaluating Aggregated Relational Data Models with Simple Diagnostics

Ian Laga, Benjamin Vogel, Jieyun Wang, Anna Smith, Owen Ward

详情
英文摘要

Aggregated Relational Data (ARD) contain summary information about individual social networks and are widely used to estimate social network characteristics and the size of populations of interest. Although a variety of ARD estimators exist, practitioners currently lack guidance on how to evaluate whether a selected model adequately fits the data. We introduce a diagnostic framework for ARD models that provides a systematic, reproducible process for assessing covariate structure, distributional assumptions, and correlation. The diagnostics are based on point estimates, using either maximum likelihood or maximum a posteriori optimization, which allows quick evaluation without requiring repeated Bayesian model fitting. Through simulation studies and applications to large ARD datasets, we show that the proposed workflow identifies common sources of model misfit and helps researchers select an appropriate model that adequately explains the data.

2601.17146 2026-01-27 stat.ME cs.CY cs.LG stat.ML

Falsifying Predictive Algorithm

Amanda Coston

详情
英文摘要

Empirical investigations into unintended model behavior often show that the algorithm is predicting another outcome than what was intended. These exposes highlight the need to identify when algorithms predict unintended quantities - ideally before deploying them into consequential settings. We propose a falsification framework that provides a principled statistical test for discriminant validity: the requirement that an algorithm predict intended outcomes better than impermissible ones. Drawing on falsification practices from causal inference, econometrics, and psychometrics, our framework compares calibrated prediction losses across outcomes to assess whether the algorithm exhibits discriminant validity with respect to a specified impermissible proxy. In settings where the target outcome is difficult to observe, multiple permissible proxy outcomes may be available; our framework accommodates both this setting and the case with a single permissible proxy. Throughout we use nonparametric hypothesis testing methods that make minimal assumptions on the data-generating process. We illustrate the method in an admissions setting, where the framework establishes discriminant validity with respect to gender but fails to establish discriminant validity with respect to race. This demonstrates how falsification can serve as an early validity check, prior to fairness or robustness analyses. We also provide analysis in a criminal justice setting, where we highlight the limitations of our framework and emphasize the need for complementary approaches to assess other aspects of construct validity and external validity.

2601.17141 2026-01-27 stat.ME

Varying coefficient model for longitudinal data with informative observation times

Yu Gu, Yangjianchen Xu, Peijun Sang

详情
英文摘要

Varying coefficient models are widely used to characterize dynamic associations between longitudinal outcomes and covariates. Existing work on varying coefficient models, however, all assumes that observation times are independent of the longitudinal outcomes, which is often violated in real-world studies with outcome-driven or otherwise informative visit schedules. Such informative observation times can lead to biased estimation and invalid inference using existing methods. In this article, we develop estimation and inference procedures for varying coefficient models that account for informative observation times. We model the observation time process as a general counting process under a proportional intensity model, with time-varying covariates summarizing the observed history. To address potential bias, we incorporate inverse intensity weighting into a sieve estimation framework, yielding closed-form coefficient function estimators via weighted least squares. We establish consistency, convergence rates, and asymptotic normality of the proposed estimators, and construct pointwise confidence intervals for the coefficient functions. Extensive simulation studies demonstrate that the proposed weighted method substantially outperforms the conventional unweighted method when observation times are informative. Finally, we provide an application of our method to the Alzheimer's Disease Neuroimaging Initiative study.

2601.13561 2026-01-27 cond-mat.stat-mech math.PR math.ST stat.TH

Additive-Functional Approach to Transport in Periodic and Tilted Periodic Potentials

Sang Yang, Zhixin Peng

Comments 4 pages,1 figures

详情
英文摘要

In this Letter, we clarify the physical origin of effective transport in periodic and tilted periodic systems. When Brownian dynamics is examined on the scale of a single period, the particle displacement admits a natural separation into a bounded part associated with recurrent motion within the periodic landscape, and an unbounded stochastic part that grows in time and carries the net transport. We show that effective drift and diffusion are governed entirely by this unbounded component, while local potential-induced fluctuations contribute only bounded corrections. Treating the displacement as an additive functional of the stochastic dynamics provides a rigorous formulation of this separation and leads to a corrector-martingale representation at the trajectory level. Within this framework, classical results-including the Lifson-Jackson formula for unbiased periodic systems and the Stratonovich expressions for tilted periodic potentials-follow as direct consequences of the same underlying structure. The same perspective extends naturally to higher-dimensional periodic environments, recovering the standard homogenized transport tensors.

2601.13259 2026-01-27 math.PR math.FA math.ST stat.TH

Entropy-Wasserstein regularization, defective local concentration and a cutoff criterion beyond non-negative curvature

Francesco Pedrotti

Comments 21 pages, minor changes

详情
英文摘要

Notions of positive curvature have been shown to imply many remarkable properties for Markov processes, in terms, e.g., of regularization effects, functional inequalities, mixing time bounds and, more recently, the cutoff phenomenon. In this work, we are interested in a relaxed variant of Ollivier's coarse Ricci curvature, where a Markov kernel $P$ satisfies only a weaker Wasserstein bound $W_p(μP, νP) \leq K W_p(μ,ν)+M$ for constants $M\ge 0, K\in [0,1], p \ge 1$. Under appropriate additional assumptions on the one-step transition measures $δ_x P$, we establish (i) a form of local concentration, given by a defective Talagrand inequality, and (ii) an entropy-transport regularization effect. We consider as illustrative examples the Langevin dynamics and the Proximal Sampler when the target measure is a log-Lipschitz perturbation of a log-concave measure. As an application of the above results, we derive criteria for the occurrence of the cutoff phenomenon in some negatively curved settings.

2601.13102 2026-01-27 stat.ML cs.LG math.ST stat.TH

Approximate full conformal prediction in an RKHS

Davidson Lova Razafindrakoto, Alain Celisse, Jérôme Lacaille

详情
英文摘要

Full conformal prediction is a framework that implicitly formulates distribution-free confidence prediction regions for a wide range of estimators. However, a classical limitation of the full conformal framework is the computation of the confidence prediction regions, which is usually impossible since it requires training infinitely many estimators (for real-valued prediction for instance). The main purpose of the present work is to describe a generic strategy for designing a tight approximation to the full conformal prediction region that can be efficiently computed. Along with this approximate confidence region, a theoretical quantification of the tightness of this approximation is developed, depending on the smoothness assumptions on the loss and score functions. The new notion of thickness is introduced for quantifying the discrepancy between the approximate confidence region and the full conformal one.

2601.06788 2026-01-27 cs.LG cs.AI hep-th quant-ph stat.ML

Artificial Entanglement in the Fine-Tuning of Large Language Models

Min Chen, Zihan Wang, Canyu Chen, Zeguan Wu, Manling Li, Junyu Liu

Comments 41 pages, many figures

详情
英文摘要

Large language models (LLMs) can be adapted to new tasks using parameter-efficient fine-tuning (PEFT) methods that modify only a small number of trainable parameters, often through low-rank updates. In this work, we adopt a quantum-information-inspired perspective to understand their effectiveness. From this perspective, low-rank parameterizations naturally correspond to low-dimensional Matrix Product States (MPS) representations, which enable entanglement-based characterizations of parameter structure. Thereby, we term and measure "Artificial Entanglement", defined as the entanglement entropy of the parameters in artificial neural networks (in particular the LLMs). We first study the representative low-rank adaptation (LoRA) PEFT method, alongside full fine-tuning (FFT), using LLaMA models at the 1B and 8B scales trained on the Tulu3 and OpenThoughts3 datasets, and uncover: (i) Internal artificial entanglement in the updates of query and value projection matrices in LoRA follows a volume law with a central suppression (termed as the "Entanglement Valley"), which is sensitive to hyper-parameters and is distinct from that in FFT; (ii) External artificial entanglement in attention matrices, corresponding to token-token correlations in representation space, follows an area law with logarithmic corrections and remains robust to LoRA hyper-parameters and training steps. Drawing a parallel to the No-Hair Theorem in black hole physics, we propose that although LoRA and FFT induce distinct internal entanglement signatures, such differences do not manifest in the attention outputs, suggesting a "no-hair" property that results in the effectiveness of low rank updates. We further provide theoretical support based on random matrix theory, and extend our analysis to an MPS Adaptation PEFT method, which exhibits qualitatively similar behaviors.

2510.12639 2026-01-27 stat.ML cs.LG math.PR

Thermodynamic structure of the Sinkhorn flow

Anand Srinivasan, Jean-Jacques Slotine

Comments 26 pages excluding references

详情
英文摘要

Entropy-regularized optimal transport, which has strong links to the Schrödinger bridge problem in statistical mechanics, enjoys a variety of applications from trajectory inference to generative modeling. A major driver of renewed interest in this problem is the recent development of fast matrix-scaling algorithms\textemdash known as iterative proportional fitting or the Sinkhorn algorithm\textemdash for entropic optimal transport, which have favorable complexity over traditional approaches to the unregularized problem. Here, we take a perspective on this algorithm rooted in the thermodynamic origins of Schrödinger's problem and inspired by the modern geometric theory of diffusion: is the Sinkhorn flow (viewed in continuous-time as a mirror descent by recent results) the gradient flow of entropy in a formal Riemannian geometry? We answer this question affirmatively, finding a nonlocal Wasserstein gradient structure in the dynamics of its free marginal. This offers a physical interpretation of the Sinkhorn flow as the stochastic dynamics of a particle with law evolving by the nonlocal diffusion of a chemical potential. Simultaneously, it brings a standard suite of functional inequalities characterizing Markov diffusion processes to bear upon its geometry and convergence. We prove an entropy-energy (de Bruijn) identity, a Poincaré inequality, and a Bakry-Émery-type condition under which a logarithmic Sobolev inequality (LSI) holds and implies exponential convergence of the Sinkhorn flow in entropy. We lastly discuss computational applications such as stopping heuristics and latent-space design criteria leveraging the LSI and, returning to the physical interpretation, the possibility of natural systems whose relaxation to equilibrium inherently solves entropic optimal transport or Schrödinger bridge problems.

2509.25009 2026-01-27 stat.ME econ.EM math.ST stat.TH

Efficient Difference-in-Differences Estimation when Outcomes are Missing at Random

Lorenzo Testa, Edward H. Kennedy, Matthew Reimherr

Comments 12 pages, 1 figure

详情
英文摘要

The Difference-in-Differences (DiD) method is a fundamental tool for causal inference, yet its application is often complicated by missing data. Although recent work has developed robust DiD estimators for complex settings like staggered treatment adoption, these methods typically assume complete data and fail to address the critical challenge of outcomes that are missing at random (MAR) -- a common problem that invalidates standard estimators. We develop a rigorous framework, rooted in semiparametric theory, for identifying and efficiently estimating the Average Treatment Effect on the Treated (ATT) when either pre- or post-treatment (or both) outcomes are missing at random. We first establish nonparametric identification of the ATT under two minimal sets of sufficient conditions. For each, we derive the semiparametric efficiency bound, which provides a formal benchmark for asymptotic optimality. We then propose novel estimators that are asymptotically efficient, achieving this theoretical bound. A key feature of our estimators is their multiple robustness, which ensures consistency even if some nuisance function models are misspecified. We validate the properties of our estimators and showcase their broad applicability through an extensive simulation study.

2508.21649 2026-01-27 stat.ME

NExON-Bayes: A Bayesian approach to network estimation informed by ordinal covariates

Joseph Feest, Hélène Ruffieux, Camilla Lingjærde, Xiaoyue Xi

详情
英文摘要

In heterogeneous disease settings, accounting for intrinsic sample variability is crucial for obtaining reliable and interpretable omic network estimates. However, most graphical model analyses of biomedical data assume homogeneous conditional dependence structures, potentially leading to misleading conclusions. To address this, we propose a joint Gaussian graphical model that leverages sample-level ordinal covariates (e.g., disease stage) to account for heterogeneity and improve the estimation of partial correlation structures. Our modelling framework, called NExON-Bayes, extends the graphical spike-and-slab framework to account for ordinal covariates, jointly estimating their relevance to the graph structure and leveraging them to improve the accuracy of network estimation. To scale to high-dimensional omic settings, we develop an efficient variational inference algorithm tailored to our model. Through simulations, we demonstrate that our method outperforms the vanilla graphical spike-and-slab (with no covariate information), as well as other state-of-the-art network approaches which exploit covariate information. Applying our method to reverse phase protein array data from patients diagnosed with stage I, II or III breast carcinoma, we estimate the behaviour of proteomic networks as breast carcinoma progresses. Our model provides insights not only through inspection of the estimated proteomic networks, but also of the estimated ordinal covariate dependencies of key groups of proteins within those networks, offering a comprehensive understanding of how biological pathways shift across disease stages. Availability and Implementation: A user-friendly R package for NExON-Bayes with tutorials is available on Github at github.com/jf687/NExON.

2506.18397 2026-01-27 cs.CV math.ST stat.TH

Distributed Poisson multi-Bernoulli filtering via generalised covariance intersection

Ángel F. García-Fernández, Giorgio Battistelli

Comments Matlab code with the distributed GCI-PMB and GCI-PMBM filters is provided at https://github.com/Agarciafernandez/MTT

Journal ref Á. F. García-Fernández and G. Battistelli, "Distributed Poisson Multi-Bernoulli Filtering via Generalized Covariance Intersection," in IEEE Transactions on Signal Processing, vol. 74, pp. 246-257, 2026

详情
英文摘要

This paper presents the distributed Poisson multi-Bernoulli (PMB) filter based on the generalised covariance intersection (GCI) fusion rule for distributed multi-object filtering. Since the exact GCI fusion of two PMB densities is intractable, we derive a principled approximation. Specifically, we approximate the power of a PMB density as an unnormalised PMB density, which corresponds to an upper bound of the PMB density. Then, the GCI fusion rule corresponds to the normalised product of two unnormalised PMB densities. We show that the result is a Poisson multi-Bernoulli mixture (PMBM), which can be expressed in closed form. Future prediction and update steps in each filter preserve the PMBM form, which can be projected back to a PMB density before the next fusion step. Experimental results show the benefits of this approach compared to other distributed multi-object filters.

2506.17764 2026-01-27 stat.ML cs.LG cs.SY eess.SP eess.SY

Derandomizing Simultaneous Confidence Regions for Band-Limited Functions by Improved Norm Bounds and Majority-Voting Schemes

Balázs Csanád Csáji, Bálint Horváth

Journal ref IEEE Control Systems Letters, Volume 9, 2025, pp. 1381-1386

详情
英文摘要

Band-limited functions are fundamental objects that are widely used in systems theory and signal processing. In this paper we refine a recent nonparametric, nonasymptotic method for constructing simultaneous confidence regions for band-limited functions from noisy input-output measurements, by working in a Paley-Wiener reproducing kernel Hilbert space. Kernel norm bounds are tightened using a uniformly-randomized Hoeffding's inequality for small samples and an empirical Bernstein bound for larger ones. We derive an approximate threshold, based on the sample size and how informative the inputs are, that governs which bound to deploy. Finally, we apply majority voting to aggregate confidence sets from random subsamples, boosting both stability and region size. We prove that even per-input aggregated intervals retain their simultaneous coverage guarantee. These refinements are also validated through numerical experiments.

2506.12371 2026-01-27 cs.LG stat.ML

Path-specific effects for pulse-oximetry guided decisions in critical care

Kevin Zhang, Yonghan Jung, Divyat Mahajan, Karthikeyan Shanmugam, Shalmali Joshi

详情
英文摘要

Identifying and measuring biases associated with sensitive attributes is a crucial consideration in healthcare to prevent treatment disparities. One prominent issue is inaccurate pulse oximeter readings, which tend to overestimate oxygen saturation for dark-skinned patients and misrepresent supplemental oxygen needs. Most existing research has revealed statistical disparities linking device measurement errors to patient outcomes in intensive care units (ICUs) without causal formalization. This study causally investigates how racial discrepancies in oximetry measurements affect invasive ventilation in ICU settings. We employ a causal inference-based approach using path-specific effects to isolate the impact of bias by race on clinical decision-making. To estimate these effects, we leverage a doubly robust estimator, propose its self-normalized variant for improved sample efficiency, and provide novel finite-sample guarantees. Our methodology is validated on semi-synthetic data and applied to two large real-world health datasets: MIMIC-IV and eICU. Contrary to prior work, our analysis reveals minimal impact of racial discrepancies on invasive ventilation rates. However, path-specific effects mediated by oxygen saturation disparity are more pronounced on ventilation duration, and the severity differs across datasets. Our work provides a novel pipeline for investigating potential disparities in clinical decision-making and, more importantly, highlights the necessity of causal methods to robustly assess fairness in healthcare.

2506.02337 2026-01-27 cs.LG cs.NA math-ph math.MP math.NA physics.comp-ph stat.ML

Discovery of Probabilistic Dirichlet-to-Neumann Maps on Graphs

Adrienne M. Propp, Jonas A. Actor, Elise Walker, Houman Owhadi, Nathaniel Trask, Daniel M. Tartakovsky

详情
英文摘要

Dirichlet-to-Neumann maps enable the coupling of multiphysics simulations across computational subdomains by ensuring continuity of state variables and fluxes at artificial interfaces. We present a novel method for learning Dirichlet-to-Neumann maps on graphs using Gaussian processes, specifically for problems where the data obey a conservation constraint from an underlying partial differential equation. Our approach combines discrete exterior calculus and nonlinear optimal recovery to infer relationships between vertex and edge values. This framework yields data-driven predictions with uncertainty quantification across the entire graph, even when observations are limited to a subset of vertices and edges. By optimizing over the reproducing kernel Hilbert space norm while applying a maximum likelihood estimation penalty on kernel complexity, our method ensures that the resulting surrogate strictly enforces conservation laws without overfitting. We demonstrate our method on two representative applications: subsurface fracture networks and arterial blood flow. Our results show that the method maintains high accuracy and well-calibrated uncertainty estimates even under severe data scarcity, highlighting its potential for scientific applications where limited data and reliable uncertainty quantification are critical.

2505.18902 2026-01-27 stat.AP cs.CV

Unsupervised cell segmentation by fast Gaussian Processes

Laura Baracaldo, Blythe King, Haoran Yan, Yizi Lin, Nina Miolane, Mengyang Gu

详情
英文摘要

Cell boundary information is crucial for analyzing cell behaviors from time-lapse microscopy videos. Existing supervised cell segmentation tools, such as ImageJ, require tuning various parameters and rely on restrictive assumptions about the shape of the objects. While recent supervised segmentation tools based on convolutional neural networks enhance accuracy, they depend on high-quality labeled images, making them unsuitable for segmenting new types of objects not in the database. We developed a novel unsupervised cell segmentation algorithm based on fast Gaussian processes for noisy microscopy images without the need for parameter tuning or restrictive assumptions about the shape of the object. We derived robust thresholding criteria adaptive for heterogeneous images containing distinct brightness at different parts to separate objects from the background, and employed watershed segmentation to distinguish touching cell objects. Both simulated studies and real-data analysis of large microscopy images demonstrate the scalability and accuracy of our approach compared with the alternatives.

2505.18038 2026-01-27 stat.ME stat.CO

Assessing the impact of variance heterogeneity and misspecification in mixed-effects location-scale models

Vincent Jeanselme, Marco Palma, Jessica K Barrett

Comments BMC Medical Research Methodology (2026)

详情
英文摘要

Linear Mixed Model (LMM) is a common statistical approach to model the relation between exposure and outcome while capturing individual variability through random effects. However, this model assumes the homogeneity of the error term's variance. Breaking this assumption, known as homoscedasticity, can bias estimates and, consequently, may change a study's conclusions. If this assumption is unmet, the mixed-effect location-scale model (MELSM) offers a solution to account for within-individual variability. Our work explores how LMMs and MELSMs behave when the homoscedasticity assumption is not met. Further, we study how misspecification affects inference for MELSM. To this aim, we propose a simulation study with longitudinal data and evaluate the estimates' bias and coverage. Our simulations show that neglecting heteroscedasticity in LMMs leads to loss of coverage for the estimated coefficients and biases the estimates of the standard deviations of the random effects. In MELSMs, scale misspecification does not bias the location model, but location misspecification alters the scale estimates. Our simulation study illustrates the importance of modelling heteroscedasticity, with potential implications beyond mixed effect models, for generalised linear mixed models for non-normal outcomes and joint models with survival data.

2503.04588 2026-01-27 stat.ME

Fiducial Inference for Random-Effects Calibration Models: Advancing Reliable Quantification in Environmental Analytical Chemistry

Soumya Sahu, Thomas Mathew, Robert Gibbons, Dulal K. Bhaumik

详情
英文摘要

This article addresses calibration challenges in analytical chemistry by employing a random-effects calibration curve model and its generalizations to capture variability in analyte concentrations. The model is motivated by specific issues in analytical chemistry, where measurement errors remain constant at low concentrations but increase proportionally as concentrations rise. To account for this, the model permits the parameters of the calibration curve, which relate instrument responses to true concentrations, to vary across different laboratories, thereby reflecting real-world variability in measurement processes. Traditional large-sample interval estimation methods are inadequate for small samples, leading to the use of an alternative approach, namely the fiducial approach. The calibration curve that accurately captures the heteroscedastic nature of the data, results in more reliable estimates across diverse laboratory conditions. It turns out that the fiducial approach, when used to construct a confidence interval for an unknown concentration, produces a slightly wider width while achieving the desired coverage probability. Applications considered include the determination of the presence of an analyte and the interval estimation of an unknown true analyte concentration. The proposed method is demonstrated for both simulated and real interlaboratory data, including examples involving copper and cadmium in distilled water.

2503.02903 2026-01-27 stat.ME math.ST stat.TH

Asymmetric Cross-Correlation in Multivariate Spatial Stochastic Processes: A Primer

Xiaoqing Chen

Comments 18 pages, 4 figures

Journal ref Proceedings of 7th International Conference on Statistics: Theory and Applications (ICSTA 25), Paris, France, Aug 2025, Paper No. 150, pp. 150-1-150-7

详情
英文摘要

Multivariate spatial phenomena are ubiquitous, spanning domains such as climate, pandemics, air quality, and social economy. Cross-correlation between different quantities of interest at different locations is asymmetric in general. This paper provides the visualization, structure, and properties of asymmetric cross-correlation as well as symmetric auto-correlation. It reviews mainstream multivariate spatial models and analyzes their capability to accommodate asymmetric cross-correlation. It also illustrates the difference in model accuracy with and without asymmetric accommodation using a 1D simulated example.

2503.02536 2026-01-27 math.AC math.AG math.CO math.ST stat.TH

The Likelihood Correspondence

Thomas Kahle, Hal Schenck, Bernd Sturmfels, Maximilian Wiesmann

Comments 18 pages, v3: final version to appear in JFoCM, Comments are welcome!

详情
英文摘要

An arrangement of hypersurfaces in projective space is strict normal crossing (SNC) if and only if its Euler discriminant is nonzero. We study the critical loci of arbitrary Laurent monomials in the equations of the smooth hypersurfaces. The family of these loci forms an irreducible variety in the product of two projective spaces, known in algebraic statistics as the likelihood correspondence and in particle physics as the scattering correspondence. We establish an explicit determinantal representation for the minimal generators of the bihomogeneous prime ideal that defines this variety.

2502.10578 2026-01-27 math.ST stat.TH

Implicit vs. explicit regularization for high-dimensional gradient descent

Thomas Stark, Lukas Steinberger

详情
英文摘要

In this paper we investigate the generalization error of gradient descent (GD) applied to an $\ell_2$-regularized OLS objective function in the linear model. Based on our analysis we develop new methodology for computationally tractable and statistically efficient linear prediction in a high-dimensional and massive data scenario (large-$n$, large-$p$). Our results are based on the surprising observation that the generalization error of optimally tuned regularized gradient descent approaches that of an optimal benchmark procedure $monotonically$ in the iteration number $t$. On the other hand standard GD for OLS (without explicit regularization) can achieve the benchmark only in degenerate cases. This shows that (optimal) explicit regularization can be nearly statistically efficient (for large $t$) whereas implicit regularization by (optimal) early stopping can not. To complete our methodology, we provide a fully data driven and computationally tractable choice of the $\ell_2$ regularization parameter $λ$ that is computationally cheaper than cross-validation. On this way, we follow and extend ideas of Dicker (2014) to the non-gaussian case, which requires new results on high-dimensional sample covariance matrices that might be of independent interest.

2410.11743 2026-01-27 stat.ME stat.AP

Causal Inference Using Augmented Epidemic Models

Heejong Bong, Valérie Ventura, Larry Wasserman

详情
英文摘要

Epidemic models describe the evolution of a communicable disease over time. These models are often modified to include the effects of interventions (control measures) such as vaccination, social distancing, school closings etc. Many such models were proposed during the COVID-19 epidemic. Inevitably these models are used to answer the question: What is the effect of the intervention on the epidemic? These models can either be interpreted as data generating models describing observed random variables or as causal models for counterfactual random variables. These two interpretations are often conflated in the literature. We discuss the difference between these two types of models, and then we discuss how to estimate the parameters of the model. Our focus is causal inference for parameters in epidemic models by adjusting for confounders, allowing time varying interventions.

2410.05263 2026-01-27 stat.ML cs.AI cs.LG math.ST stat.ME stat.TH

Bias-Aware Conformal Prediction for Metric-Based Imaging Pipelines

Matt Y. Cheung, Tucker J. Netherton, Laurence E. Court, Ashok Veeraraghavan, Guha Balakrishnan

Comments 7 pages, 1 figure, accepted at ISBI 2026

详情
英文摘要

Reliable confidence measures of metrics derived from medical imaging reconstruction pipelines would improve the standard of decision-making in many clinical workflows. Conformal Prediction (CP) provides a robust framework for producing calibrated prediction intervals, but standard CP formulations face a critical challenge in the imaging pipeline: common mismatches between image reconstruction objectives and downstream metrics can introduce systematic prediction deviations from ground truth values, known as bias. These biases in turn compromise the efficiency of prediction intervals, which is a problem that has been unexplored in the CP literature. In this study, we formalize the behavior of symmetric (where bounds expand equally in both directions) and asymmetric (where bounds expand unequally) formulations for common non-conformity scores in CP in the presence of bias, and argue that this measurable bias must inform the choice of CP formulation. We theoretically and empirically demonstrate that symmetric intervals are inflated by a factor of two times the magnitude of bias while asymmetric intervals remain unaffected by bias, and provide conditions under which each formulation produces tighter intervals. We empirically validated our theoretical analyses on sparse-view CT reconstruction for downstream radiotherapy planning. Our work enables users of medical imaging pipelines to proactively select optimal CP formulations, thereby improving interval length efficiency for critical downstream metrics.

2409.20207 2026-01-27 math.SP math.CO math.FA math.PR math.ST stat.TH

New matrix perturbation bounds with relative norm: Perturbation of eigenspaces

Phuc Tran, Van Vu

详情
英文摘要

Matrix perturbation bounds (such as Weyl and Davis-Kahan) are used abundantly in many areas of mathematics and data science. Many bounds (such as the above two) involve the spectral norm of the noise matrix and are sharp in worst case analysis. In order to refine these classical bounds, we introduce a new parameter, which we refer to as the relative norm. This parameter measures the strength of the action of the noise matrix on the relevant eigenvectors of the ground matrix. It has turned out that in a number of situations, we can use the relative norm as a replacement for the spectral norm. This has led to a number of notable improvements under certain sets of assumptions, which are frequently met in practice. For instance, our new results apply very well in the case when the noise matrix is random. For the purpose of our study, we introduce a new method of analysis, which combines the classical contour integral argument with new (combinatorial) ideas. This method is robust and of independent interest. In the current paper, we focus on the perturbation of eigenspaces (Davis-Kahan type results). Perturbation bounds for eigenspaces are essential in statistics and theoretical computer science, and thus deserve a special treatment. Furthermore, this will lay the ground for the more technical treatment of general matrix functionals, which appears in a future paper.

2408.10396 2026-01-27 stat.ME math.ST stat.TH

Highly Multivariate Large-scale Spatial Stochastic Processes -- A Cross-Markov Random Field Approach

Xiaoqing Chen, Peter Diggle, James V. Zidek, Gavin Shaddick

Comments 54 pages; 10 figures

详情
英文摘要

Key challenges in the analysis of highly multivariate large-scale spatial stochastic processes, where both the number of components (p) and spatial locations (n) can be large, include achieving maximal sparsity in the joint precision matrix, ensuring efficient computational cost for its generation, accommodating asymmetric cross-covariance in the joint covariance matrix, and delivering scientific interpretability. We propose a cross-MRF model class, consisting of a mixed spatial graphical model framework and cross-MRF theory, to collectively address these challenges in one unified framework across two modelling stages. The first stage exploits scientifically informed conditional independence (CI) among p component fields and allows for a step-wise parallel generation of joint covariance and precision matrix, enabling a simultaneous accommodation of asymmetric cross-covariance in joint covariance matrix and sparsity in joint precision matrix. The second stage extends the first-stage CI to doubly CI among both p and n and unearths the cross-MRF via an extended Hammersley-Clifford theorem for multivariate spatial stochastic processes. This results in the sparsest possible representation of the joint precision matrix and ensures its lowest generation complexity. We demonstrate with 1D simulated comparative studies and 2D real-world data.

2408.09288 2026-01-27 math.ST stat.TH

ARMAr-LASSO: Mitigating the Impact of Predictor Serial Correlation on the LASSO

Simone Tonini, Francesca Chiaromonte, Alessandro Giovannelli

Comments 58 pages, 4 Figures, 3 Tables. arXiv admin note: substantial text overlap with arXiv:2208.00727

详情
英文摘要

We explore estimation and forecast accuracy for sparse linear models, focusing on scenarios where both predictors and errors carry serial correlations. We establish a clear link between predictor serial correlation and the performance of the LASSO, showing that even orthogonal or weakly correlated stationary AR processes can lead to significant spurious correlations due to their serial correlations. To address this challenge, we propose a novel approach named ARMAr-LASSO ({\em ARMA residuals LASSO}), which applies the LASSO to predictors that have been pre-whitened with ARMA filters and lags of dependent variable. We derive both asymptotic results and oracle inequalities for the ARMAr-LASSO, demonstrating that it effectively reduces estimation errors while also providing an effective forecasting and feature selection strategy. Our findings are supported by extensive simulations and an application to real-world macroeconomic data, which highlight the superior performance of the ARMAr-LASSO for handling sparse linear models in the context of time series.

2407.13169 2026-01-27 stat.ME

Combining Climate Models using Bayesian Regression Trees and Random Paths

John C. Yannotty, Thomas J. Santner, Bo Li, Matthew T. Pratola

Comments 46 pages, 17 figures

详情
英文摘要

General circulation models (GCMs) are essential tools for climate studies. Such climate models may have varying accuracy across the input domain, but no model is uniformly best. One can improve climate model prediction performance by integrating multiple models using input-dependent weights. Weight functions modeled using Bayesian Additive Regression Trees (BART) were recently shown to be useful in nuclear physics applications. However, a restriction of that approach was the piecewise constant weight functions. To smoothly integrate multiple climate models, we propose a new tree-based model, Random Path BART (RPBART), that incorporates random path assignments in BART to produce smooth weight functions and smooth predictions, all in a matrix-free formulation. RPBART requires a more complex prior specification, for which we introduce a semivariogram to guide hyperparameter selection. This approach is easy to interpret, computationally cheap, and avoids expensive cross-validation. Finally, we propose a posterior projection technique to enable detailed analysis of the fitted weight functions. This allows us to identify a sparse set of climate models that recovers the underlying system within a given spatial region as well as quantifying model discrepancy given the available model set. Our method is demonstrated on an ensemble of 8 GCMs modeling average monthly surface temperature.

2407.10089 2026-01-27 stat.ME stat.AP stat.CO

The inverse Kalman filter

Xinyi Fang, Mengyang Gu

Comments 17 pages, 8 figures, 2 tables

Journal ref Biometrika, Volume 112, Issue 4, 2025, asaf054

详情
英文摘要

We introduce the inverse Kalman filter, which enables exact matrix-vector multiplication between a covariance matrix from a dynamic linear model and any real-valued vector with linear computational cost. We integrate the inverse Kalman filter with the conjugate gradient algorithm, which substantially accelerates the computation of matrix inversion for a general form of covariance matrix, where other approximation approaches may not be directly applicable. We demonstrate the scalability and efficiency of the proposed approach through applications in nonparametric estimation of particle interaction functions, using both simulations and cell trajectories from microscopy data.

2404.14019 2026-01-27 cs.CV eess.SP stat.AP

A Multimodal Feature Distillation with Mamba-Transformer Network for Brain Tumor Segmentation with Incomplete Modalities

Ming Kang, Fung Fung Ting, Shier Nee Saw, Raphaël C. -W. Phan, Zongyuan Ge, Chee-Ming Ting

Comments 14 pages, 5 figures

详情
英文摘要

Existing brain tumor segmentation methods usually utilize multiple Magnetic Resonance Imaging (MRI) modalities in brain tumor images for segmentation, which can achieve better segmentation performance. However, in clinical applications, some modalities are often missing due to resource constraints, resulting in significant performance degradation for methods that rely on complete modality segmentation. In this paper, we propose a Multimodal feature distillation with Mamba-Transformer hybrid network (MMTSeg) for accurate brain tumor segmentation with missing modalities. We first employ a Multimodal Feature Distillation (MFD) module to distill feature-level multimodal knowledge into different unimodalities to extract complete modality information. We further develop an Unimodal Feature Enhancement (UFE) module to model the semantic relationship between global and local information. Finally, we built a Cross-Modal Fusion (CMF) module to explicitly align the global correlations across modalities, even when some modalities are missing. Complementary features within and across modalities are refined by the Mamba-Transformer hybrid architectures in both the UFE and CMF modules, dynamically capturing long-range dependencies and global semantic information for complex spatial contexts. A boundary-wise loss function is employed as the segmentation loss of the proposed MMTSeg to minimize boundary discrepancies for a distance-based metric. Our ablation study demonstrates the importance of the proposed feature enhancement and fusion modules in the proposed network and the Transformer with Mamba block for improving the performance of brain tumor segmentation with missing modalities. Extensive experiments on the BraTS 2018 and BraTS 2020 datasets demonstrate that the proposed MMTSeg framework outperforms state-of-the-art methods when modalities are missing.

2403.01838 2026-01-27 stat.ME

The power of visualizing distributional differences: Formal graphical $n$-sample tests

Konstantinos Konstantinou, Tomáš Mrkvička, Mari Myllymäki

Comments 23 pages, 17 figures

Journal ref Computational Statistics 2025; 40, 2553-2582

详情
英文摘要

Classical tests are available for the two-sample test of correspondence of distribution functions. From these, the Kolmogorov-Smirnov test provides also the graphical interpretation of the test results, in different forms. Here, we propose modifications of the Kolmogorov-Smirnov test with higher power. The proposed tests are based on the so-called global envelope test which allows for graphical interpretation, similarly as the Kolmogorov-Smirnov test. The tests are based on rank statistics and are suitable also for the comparison of $n$ samples, with $n \geq 2$. We compare the alternatives for the two-sample case through an extensive simulation study and discuss their interpretation. Finally, we apply the tests to real data. Specifically, we compare the height distributions between boys and girls at different ages, the sepal length distributions of different flower species, and distributions of standardized residuals from a time series model for different exchange courses using the proposed methodologies.

2311.16025 2026-01-27 stat.ME

Change Point Inference for Non-Euclidean Data Sequences using Distance Profiles

Paromita Dubey, Minxing Zheng

Comments 34 pages, 10 figures

详情
英文摘要

We introduce a powerful scan statistic and the corresponding test for detecting the presence and pinpointing the location of a change point within the distribution of a data sequence with the data elements residing in a separable metric space $(Ω, d)$. These change points mark abrupt shifts in the distribution of the data sequence as characterized using distance profiles, where the distance profile of an element $ω\in Ω$ is the distribution of distances from $ω$ as dictated by the data. This approach is tuning parameter free, fully non-parametric and universally applicable to diverse data types, including distributional and network data, as long as distances between the data objects are available. We obtain an explicit characterization of the asymptotic distribution of the test statistic under the null hypothesis of no change points, rigorous guarantees on the consistency of the test in the presence of change points under fixed and local alternatives and near-optimal convergence of the estimated change point location, all under practicable settings. To compare with state-of-the-art methods we conduct simulations covering multivariate data, bivariate distributional data and sequences of graph Laplacians, and illustrate our method on real data sequences of the U.S. electricity generation compositions and Bluetooth proximity networks.

2310.18047 2026-01-27 stat.ME

Robust Bayesian Inference on Riemannian Submanifold

Rong Tang, Anirban Bhattacharya, Debdeep Pati, Yun Yang

详情
英文摘要

Manifold-valued parameters routinely arise in modern statistical applications such as in medical imaging, robotics, and computer vision, to name a few. While traditional Bayesian approaches are applicable to such settings by considering an ambient Euclidean space as the parameter space, we demonstrate the benefits of integrating manifold structure into the Bayesian framework, both theoretically and computationally. Moreover, existing Bayesian approaches which are designed specifically for manifold-valued parameters are primarily model-based, which are typically subject to inaccurate uncertainty quantification under model misspecification. In this article, we propose a robust model-free Bayesian inference for parameters defined on a Riemannian submanifold, which is shown to provide valid uncertainty quantification from a frequentist perspective. Computationally, we propose a Markov chain Monte Carlo to sample from the posterior on the Riemannian submanifold, where the mixing time, in the large sample regime, is shown to depend only on the intrinsic dimension of the parameter space instead of the potentially muchlarger ambient dimension. Our numerical results demonstrate the effectiveness of our approach on a variety of problems, such as multiple quantile regression, reduced-rank regression, and Fréchet mean estimation.

2111.09266 2026-01-27 cs.LG cs.AI stat.ML

GFlowNet Foundations

Yoshua Bengio, Salem Lahlou, Tristan Deleu, Edward J. Hu, Mo Tiwari, Emmanuel Bengio

详情
英文摘要

Generative Flow Networks (GFlowNets) have been introduced as a method to sample a diverse set of candidates in an active learning context, with a training objective that makes them approximately sample in proportion to a given reward function. In this paper, we show a number of additional theoretical properties of GFlowNets. They can be used to estimate joint probability distributions and the corresponding marginal distributions where some variables are unspecified and, of particular interest, can represent distributions over composite objects like sets and graphs. GFlowNets amortize the work typically done by computationally expensive MCMC methods in a single but trained generative pass. They could also be used to estimate partition functions and free energies, conditional probabilities of supersets (supergraphs) given a subset (subgraph), as well as marginal distributions over all supersets (supergraphs) of a given set (graph). We introduce variations enabling the estimation of entropy and mutual information, sampling from a Pareto frontier, connections to reward-maximizing policies, and extensions to stochastic environments, continuous actions and modular energy functions.

2110.07478 2026-01-27 stat.ML cs.LG

Inferring manifolds using Gaussian processes

David B Dunson, Nan Wu

Comments 51 pages, 20 figures

详情
英文摘要

It is often of interest to infer lower-dimensional structure underlying complex data. As a flexible class of non-linear structures, it is common to focus on Riemannian manifolds. Most existing manifold learning algorithms replace the original data with lower-dimensional coordinates without providing an estimate of the manifold or using the manifold to denoise the original data. This article proposes a new methodology to address these problems, allowing interpolation of the estimated manifold between the fitted data points. The proposed approach is motivated by the novel theoretical properties of local covariance matrices constructed from samples near a manifold. Our results enable us to turn a global manifold reconstruction problem into a local regression problem, allowing for the application of Gaussian processes for probabilistic manifold reconstruction. In addition to the theory justifying our methodology, we provide simulated and real data examples to illustrate the performance.

2102.06130 2026-01-27 stat.ME

Continuum centroid classifier for functional data

Zhiyang Zhou, Peijun Sang

Comments 38 pages, 4 figures, 2 tables

Journal ref Can J Statistics, 50: 200-220 (2022)

详情
英文摘要

Aiming at the binary classification of functional data, we propose the continuum centroid classifier (CCC) built upon projections of functional data onto one specific direction. This direction is obtained via bridging the regression and classification. Controlling the extent of supervision, our technique is neither unsupervised nor fully supervised. Thanks to the intrinsic infinite dimension of functional data, one of two subtypes of CCC enjoys the (asymptotic) zero misclassification rate. Our proposal includes an effective algorithm that yields a consistent empirical counterpart of CCC. Simulation studies demonstrate the performance of CCC in different scenarios. Finally, we apply CCC to two real examples.

1901.07599 2026-01-27 stat.ME

Functional continuum regression

Zhiyang Zhou

Journal ref Journal of Multivariate Analysis 173: 328-346 (2019)

详情
英文摘要

Functional principal component regression (PCR) can fail to provide good prediction if the response is highly correlated with some excluded functional principal component(s). This situation is common since the construction of functional principal components never involves the response. Aiming at this shortcoming, we develop functional continuum regression (CR). The framework of functional CR includes, as special cases, both functional PCR and functional partial least squares (PLS). Functional CR is expected to own a better accuracy than functional PCR and functional PLS both in estimation and prediction; evidence for this is provided through simulation and numerical case studies. Also, we demonstrate the consistency of estimators given by functional CR.

2601.17073 2026-01-27 cs.LG cs.CV stat.ML

Attention-Based Variational Framework for Joint and Individual Components Learning with Applications in Brain Network Analysis

Yifei Zhang, Meimei Liu, Zhengwu Zhang

详情
英文摘要

Brain organization is increasingly characterized through multiple imaging modalities, most notably structural connectivity (SC) and functional connectivity (FC). Integrating these inherently distinct yet complementary data sources is essential for uncovering the cross-modal patterns that drive behavioral phenotypes. However, effective integration is hindered by the high dimensionality and non-linearity of connectome data, complex non-linear SC-FC coupling, and the challenge of disentangling shared information from modality-specific variations. To address these issues, we propose the Cross-Modal Joint-Individual Variational Network (CM-JIVNet), a unified probabilistic framework designed to learn factorized latent representations from paired SC-FC datasets. Our model utilizes a multi-head attention fusion module to capture non-linear cross-modal dependencies while isolating independent, modality-specific signals. Validated on Human Connectome Project Young Adult (HCP-YA) data, CM-JIVNet demonstrates superior performance in cross-modal reconstruction and behavioral trait prediction. By effectively disentangling joint and individual feature spaces, CM-JIVNet provides a robust, interpretable, and scalable solution for large-scale multimodal brain analysis.

2601.17010 2026-01-27 cs.LG stat.AP

Optimizing the Landscape of LLM Embeddings with Dynamic Exploratory Graph Analysis for Generative Psychometrics: A Monte Carlo Study

Hudson Golino

Comments 18 pages, 6 figures, conference paper

详情
英文摘要

Large language model (LLM) embeddings are increasingly used to estimate dimensional structure in psychological item pools prior to data collection, yet current applications treat embeddings as static, cross-sectional representations. This approach implicitly assumes uniform contribution across all embedding coordinates and overlooks the possibility that optimal structural information may be concentrated in specific regions of the embedding space. This study reframes embeddings as searchable landscapes and adapts Dynamic Exploratory Graph Analysis (DynEGA) to systematically traverse embedding coordinates, treating the dimension index as a pseudo-temporal ordering analogous to intensive longitudinal trajectories. A large-scale Monte Carlo simulation embedded items representing five dimensions of grandiose narcissism using OpenAI's text-embedding-3-small model, generating network estimations across systematically varied item pool sizes (3-40 items per dimension) and embedding depths (3-1,298 dimensions). Results reveal that Total Entropy Fit Index (TEFI) and Normalized Mutual Information (NMI) leads to competing optimization trajectories across the embedding landscape. TEFI achieves minima at deep embedding ranges (900--1,200 dimensions) where entropy-based organization is maximal but structural accuracy degrades, whereas NMI peaks at shallow depths where dimensional recovery is strongest but entropy-based fit remains suboptimal. Single-metric optimization produces structurally incoherent solutions, whereas a weighted composite criterion identifies embedding dimensions depth regions that jointly balance accuracy and organization. Optimal embedding depth scales systematically with item pool size. These findings establish embedding landscapes as non-uniform semantic spaces requiring principled optimization rather than default full-vector usage.

2601.16999 2026-01-27 cs.CL cs.LG stat.ML

Uncertainty Quantification for Named Entity Recognition via Full-Sequence and Subsequence Conformal Prediction

Matthew Singer, Srijan Sengupta, Karl Pazdernik

详情
英文摘要

Named Entity Recognition (NER) serves as a foundational component in many natural language processing (NLP) pipelines. However, current NER models typically output a single predicted label sequence without any accompanying measure of uncertainty, leaving downstream applications vulnerable to cascading errors. In this paper, we introduce a general framework for adapting sequence-labeling-based NER models to produce uncertainty-aware prediction sets. These prediction sets are collections of full-sentence labelings that are guaranteed to contain the correct labeling with a user-specified confidence level. This approach serves a role analogous to confidence intervals in classical statistics by providing formal guarantees about the reliability of model predictions. Our method builds on conformal prediction, which offers finite-sample coverage guarantees under minimal assumptions. We design efficient nonconformity scoring functions to construct efficient, well-calibrated prediction sets that support both unconditional and class-conditional coverage. This framework accounts for heterogeneity across sentence length, language, entity type, and number of entities within a sentence. Empirical experiments on four NER models across three benchmark datasets demonstrate the broad applicability, validity, and efficiency of the proposed methods.