arXivDaily arXiv每日学术速递 周一至周五更新
重置
2603.15609 2026-03-17 stat.AP cs.CR cs.CY cs.SI physics.soc-ph

Differential Privacy for Network Connectedness Indices

Tom A. Rutter, Yuxin Liu, M. Amin Rahimian

Comments Code to replicate all of our analyses is available at: https://github.com/TomRutter42/Privacy-for-Connectedness-Indices

详情
英文摘要

Researchers increasingly use data on social and economic networks to study a range of social science questions, but releasing statistics derived from networks can raise significant privacy concerns. We show how to release network connectedness indices that quantify assortative mixing across node attributes under edge-adjacent differential privacy. Standard privacy techniques perform poorly in this setting both because connectedness indices have high global sensitivity and because a single node's attribute can potentially be an input to connectedness in thousands of cells, leading to poor composition. Our method, which is straightforward to apply, first adds noise to node attributes, then analytically debiases downstream statistics, and finally applies a second layer of noise to protect the presence or absence of individual edges. We prove consistency and asymptotic normality of our estimators for both discrete and continuous labels and show our method works well in simulations and on real networks with as few as 200 nodes collected by social scientists.

2603.15578 2026-03-17 stat.ME stat.ML

Low-Complexity and Consistent Graphon Estimation from Multiple Networks

Roland Boniface Sogan, Tabea Rebafka

Comments Accepted at AISTATS 2026

详情
英文摘要

Recovering the random graph model from an observed collection of networks is known to present significant challenges in the setting, where the networks do not share a common node set and have different sizes. More specifically, the goal is the estimation of the graphon function that parametrizes the nonparametric exchangeable random graph model. Existing methods typically suffer from either limited accuracy or high computational complexity. We introduce a new histogram-based estimator with low algorithmic complexity that achieves high accuracy by jointly aligning the nodes of all graphs, in contrast to most conventional methods that order nodes graph by graph. Consistency results of the proposed graphon estimator are established. A numerical study shows that the proposed estimator outperforms existing methods in terms of accuracy, especially when the dataset comprises only small and variable-size networks. Moreover, the computing time of the new method is considerably shorter than that of other consistent methodologies. Additionally, when applied to a graph neural network classification task, the proposed estimator enables more effective data augmentation, yielding improved performance across diverse real-world datasets.

2603.15576 2026-03-17 cs.LG math.OC stat.ML

Unbiased and Biased Variance-Reduced Forward-Reflected-Backward Splitting Methods for Stochastic Composite Inclusions

Quoc Tran-Dinh, Nghia Nguyen-Trung

Comments 34 pages and 2 figures

详情
英文摘要

This paper develops new variance-reduction techniques for the forward-reflected-backward splitting (FRBS) method to solve a class of possibly nonmonotone stochastic composite inclusions. Unlike unbiased estimators such as mini-batching, developing stochastic biased variants faces a fundamental technical challenge and has not been utilized before for inclusions and fixed-point problems. We fill this gap by designing a new framework that can handle both unbiased and biased estimators. Our main idea is to construct stochastic variance-reduced estimators for the forward-reflected direction and use them to perform iterate updates. First, we propose a class of unbiased variance-reduced estimators and show that increasing mini-batch SGD, loopless-SVRG, and SAGA estimators fall within this class. For these unbiased estimators, we establish a $\mathcal{O}(1/k)$ best-iterate convergence rate for the expected squared residual norm, together with almost-sure convergence of the iterate sequence to a solution. Consequently, we prove that the best oracle complexities for the $n$-finite-sum and expectation settings are $\mathcal{O}(n^{2/3}ε^{-2})$ and $\mathcal{O}(ε^{-10/3})$, respectively, when employing loopless-SVRG or SAGA, where $ε$ is a desired accuracy. Second, we introduce a new class of biased variance-reduced estimators for the forward-reflected direction, which includes SARAH, Hybrid SGD, and Hybrid SVRG as special instances. While the convergence rates remain valid for these biased estimators, the resulting oracle complexities are $\mathcal{O}(n^{3/4}ε^{-2})$ and $\mathcal{O}(ε^{-5})$ for the $n$-finite-sum and expectation settings, respectively. Finally, we conduct two numerical experiments on AUC optimization for imbalanced classification and policy evaluation in reinforcement learning.

2603.15568 2026-03-17 stat.ML cs.LG

Estimating Staged Event Tree Models via Hierarchical Clustering on the Simplex

Muhammad Shoaib, Eva Riccomagno, Manuele Leonelli, Gherardo Varando

详情
英文摘要

Staged tree models enhance Bayesian networks by incorporating context-specific dependencies through a stage-based structure. In this study, we present a new framework for estimating staged trees using hierarchical clustering on the probability simplex, utilizing simplex basesd divergences. We conduct a thorough evaluation of several distance and divergence metrics including Total Variation, Hellinger, Fisher, and Kaniadakis; alongside various linkage methods such as Ward.D2, average, complete, and McQuitty. We conducted the simulation experiments that reveals Total Variation, especially when combined with Ward.D2 linkage, consistently produces staged trees with better model fit, structure recovery, and computational efficiency. We assess performance by utilizing relative Bayesian Information Criterion (BIC), and Hamming distance. Our findings indicate that although Backward Hill Climbing (BHC) delivers competitive outcomes, it incurs a significantly higher computational cost. On the other, Total Variation divergence with Ward.D2 linkage, achieves similar performance while providing significantly better computational efficiency, making it a more viable option for large-scale or time sensitive tasks.

2603.15564 2026-03-17 cs.LG stat.AP stat.ML

Predictive Uncertainty in Short-Term PV Forecasting under Missing Data: A Multiple Imputation Approach

Parastoo Pashmchi, Jérôme Benoit, Motonobu Kanagawa

Comments 10 pages

详情
英文摘要

Missing values are common in photovoltaic (PV) power data, yet the uncertainty they induce is not propagated into predictive distributions. We develop a framework that incorporates missing-data uncertainty into short-term PV forecasting by combining stochastic multiple imputation with Rubin's rule. The approach is model-agnostic and can be integrated with standard machine-learning predictors. Empirical results show that ignoring missing-data uncertainty leads to overly narrow prediction intervals. Accounting for this uncertainty improves interval calibration while maintaining comparable point prediction accuracy. These results demonstrate the importance of propagating imputation uncertainty in data-driven PV forecasting.

2603.15388 2026-03-17 cs.LG cs.AI cs.RO stat.ML

Efficient Morphology-Control Co-Design via Stackelberg Proximal Policy Optimization

Yanning Dai, Yuhui Wang, Dylan R. Ashley, Jürgen Schmidhuber

Comments presented at the Fourteenth International Conference on Learning Representations; 11 pages in main text + 3 pages of references + 23 pages of appendices, 5 figures in main text + 11 figures in appendices, 16 tables in appendices; accompanying website available at https://yanningdai.github.io/stackelberg-ppo-co-design/ ; source code available at https://github.com/YanningDai/StackelbergPPO

详情
英文摘要

Morphology-control co-design concerns the coupled optimization of an agent's body structure and control policy. This problem exhibits a bi-level structure, where the control dynamically adapts to the morphology to maximize performance. Existing methods typically neglect the control's adaptation dynamics by adopting a single-level formulation that treats the control policy as fixed when optimizing morphology. This can lead to inefficient optimization, as morphology updates may be misaligned with control adaptation. In this paper, we revisit the co-design problem from a game-theoretic perspective, modeling the intrinsic coupling between morphology and control as a novel variant of a Stackelberg game. We propose Stackelberg Proximal Policy Optimization (Stackelberg PPO), which explicitly incorporates the control's adaptation dynamics into morphology optimization. By modeling this intrinsic coupling, our method aligns morphology updates with control adaptation, thereby stabilizing training and improving learning efficiency. Experiments across diverse co-design tasks demonstrate that Stackelberg PPO outperforms standard PPO in both stability and final performance, opening the way for dramatically more efficient robotics designs.

2603.15384 2026-03-17 stat.ML cs.LG math.AT math.ST stat.TH

Persistence Spheres: a Bi-continuous Linear Representation of Measures for Partial Optimal Transport

Matteo Pegoraro

详情
英文摘要

We improve and extend persistence spheres, introduced in~\cite{pegoraro2025persistence}. Persistence spheres map an integrable measure $μ$ on the upper half-plane, including persistence diagrams (PDs) as counting measures, to a function $S(μ)\in C(\mathbb{S}^2)$, and the map is stable with respect to 1-Wasserstein partial transport distance $\mathrm{POT}_1$. Moreover, to the best of our knowledge, persistence spheres are the first explicit representation used in topological machine learning for which continuity of the inverse on the image is established at every compactly supported target. Recent bounded-cardinality bi-Lipschitz embedding results in partial transport spaces, despite being powerful, are not given by the kind of explicit summary map considered here. Our construction is rooted in convex geometry: for positive measures, the defining ReLU integral is the support function of the lift zonoid. Building on~\cite{pegoraro2025persistence}, we refine the definition to better match the $\mathrm{POT}_1$ deletion mechanism, encoding partial transport via a signed diagonal augmentation. In particular, for integrable $μ$, the uniform norm between $S(0)$ and $S(μ)$ depends only on the persistence of $μ$, without any need of ad-hoc re-weightings, reflecting optimal transport to the diagonal at persistence cost. This yields a parameter-free representation at the level of measures (up to numerical discretization), while accommodating future extensions where $μ$ is a smoothed measure derived from PDs (e.g., persistence intensity functions~\citep{wu2024estimation}). Across clustering, regression, and classification tasks involving functional data, time series, graphs, meshes, and point clouds, the updated persistence spheres are competitive and often improve upon persistence images, persistence landscapes, persistence splines, and sliced Wasserstein kernel baselines.

2603.15340 2026-03-17 cs.CL stat.ML

DOS: Dependency-Oriented Sampler for Masked Diffusion Language Models

Xueyu Zhou, Yangrong Hu, Jian Huang

Comments 16 pages, 5 figures

详情
英文摘要

Masked diffusion language models (MDLMs) have recently emerged as a new paradigm in language modeling, offering flexible generation dynamics and enabling efficient parallel decoding. However, existing decoding strategies for pre-trained MDLMs predominantly rely on token-level uncertainty criteria, while largely overlooking sequence-level information and inter-token dependencies. To address this limitation, we propose Dependency-Oriented Sampler (DOS), a training-free decoding strategy that leverages inter-token dependencies to inform token updates during generation. Specifically, DOS exploits attention matrices from transformer blocks to approximate inter-token dependencies, emphasizing information from unmasked tokens when updating masked positions. Empirical results demonstrate that DOS consistently achieves superior performance on both code generation and mathematical reasoning tasks. Moreover, DOS can be seamlessly integrated with existing parallel sampling methods, leading to improved generation efficiency without sacrificing generation quality.

2603.15336 2026-03-17 stat.ML cs.LG

Active Seriation: Efficient Ordering Recovery with Statistical Guarantees

James Cheshire, Yann Issartel

详情
英文摘要

Active seriation aims at recovering an unknown ordering of $n$ items by adaptively querying pairwise similarities. The observations are noisy measurements of entries of an underlying $n$ x $n$ permuted Robinson matrix, whose permutation encodes the latent ordering. The framework allows the algorithm to start with partial information on the latent ordering, including seriation from scratch as a special case. We propose an active seriation algorithm that provably recovers the latent ordering with high probability. Under a uniform separation condition on the similarity matrix, optimal performance guarantees are established, both in terms of the probability of error and the number of observations required for successful recovery.

2603.15292 2026-03-17 stat.ML cs.AI cs.LG

Scalable Simulation-Based Model Inference with Test-Time Complexity Control

Manuel Gloeckler, J. P. Manzano-Patrón, Stamatios N. Sotiropoulos, Cornelius Schröder, Jakob H. Macke

详情
英文摘要

Simulation plays a central role in scientific discovery. In many applications, the bottleneck is no longer running a simulator; it is choosing among large families of plausible simulators, each corresponding to different forward models/hypotheses consistent with observations. Over large model families, classical Bayesian workflows for model selection are impractical. Furthermore, amortized model selection methods typically hard-code a fixed model prior or complexity penalty at training time, requiring users to commit to a particular parsimony assumption before seeing the data. We introduce PRISM, a simulation-based encoder-decoder that infers a joint posterior over both discrete model structures and associated continuous parameters, while enabling test-time control of model complexity via a tunable model prior that the network is conditioned on. We show that PRISM scales to families with combinatorially many (up to billions) of model instantiations on a synthetic symbolic regression task. As a scientific application, we evaluate PRISM on biophysical modeling for diffusion MRI data, showing the ability to perform model selection across several multi-compartment models, on both synthetic and in vivo neuroimaging data.

2603.15215 2026-03-17 stat.OT

Deepest voting on rankings

Jean-Baptiste Aubin, Antoine Rolland, Ioana Gavra, Irène Gannaz, Jacques Anderson Kouassi

详情
英文摘要

This article aims to present a unified framework for ranking-based voting rules based on the use of depth functions on permutations, as a counterpart of deepest voting rules on evaluation introduced in Aubin et al. [2022]. It introduces the notion of depth functions, in continuous sets and in permutation sets, the later using the notion of Fr{é}chet means. Deepest voting procedures are then formally defined, and some classical voting rules are expressed as deepest voting procedures, using a large variety of distances on the set of permutations. Links are done between the depth functions mathematical properties and some behaviours of the voting rule, such as Neutrality, Anonymity, Universality, Condorcet winner/loser property and so on.

2603.15189 2026-03-17 stat.ML cs.LG

The Sampling Complexity of Condorcet Winner Identification in Dueling Bandits

El Mehdi Saad, Victor Thuot, Nicolas Verzelen

详情
英文摘要

We study best-arm identification in stochastic dueling bandits under the sole assumption that a Condorcet winner exists, i.e., an arm that wins each noisy pairwise comparison with probability at least $1/2$. We introduce a new identification procedure that exploits the full gap matrix $Δ_{i,j}=q_{i,j}-\tfrac12$ (where $q_{i,j}$ is the probability that arm $i$ beats arm $j$), rather than only the gaps between the Condorcet winner and the other arms. We derive high-probability, instance-dependent sample-complexity guarantees that (up to logarithmic factors) improve the best known ones by leveraging informative comparisons beyond those involving the winner. We complement these results with new lower bounds which, to our knowledge, are the first for Condorcet-winner identification in stochastic dueling bandits. Our lower-bound analysis isolates the intrinsic cost of locating informative entries in the gap matrix and estimating them to the required confidence, establishing the optimality of our non-asymptotic bounds. Overall, our results reveal new regimes and trade-offs in the sample complexity that are not captured by asymptotic analyses based only on the expected budget.

2603.15149 2026-03-17 stat.ME econ.GN q-fin.EC stat.AP

Measuring the depth of multidimensional poverty with ordinal data

Fernando Flores Tavares

详情
英文摘要

This paper proposes a positional poverty gap measure of multidimensional poverty within the Alkire-Foster counting framework. The measure captures the depth of deprivations even when indicators are ordinal, unlike the standard poverty gap, which requires cardinal variables. The proposed method draws on the fuzzy set literature and introduces a distribution-based measure of deprivation depth using the empirical cumulative distribution of each indicator, with the most deprived group as the benchmark. For each deprived individual, the method assigns a score based on the individual's relative position in the distribution. Depth is thus expressed as a difference in distributional positions, motivating the label positional poverty gap. The paper demonstrates that this measure preserves the identification and aggregation structure of the counting approach and satisfies its axiomatic properties when the reference distribution remains fixed over time. The framework remains flexible because it accommodates different identification rules, deprivation cutoffs, and variable types. Overall, it offers a simple, meaningful, and theoretically grounded way to incorporate depth into multidimensional poverty measurement with ordinal data.

2603.15121 2026-03-17 cs.LG stat.ML

Establishing Construct Validity in LLM Capability Benchmarks Requires Nomological Networks

Timo Freiesleben

详情
英文摘要

Recent work in machine learning increasingly attributes human-like capabilities such as reasoning or theory of mind to large language models (LLMs) on the basis of benchmark performance. This paper examines this practice through the lens of construct validity, understood as the problem of linking theoretical capabilities to their empirical measurements. It contrasts three influential frameworks: the nomological account developed by Cronbach and Meehl, the inferential account proposed by Messick and refined by Kane, and Borsboom's causal account. I argue that the nomological account provides the most suitable foundation for current LLM capability research. It avoids the strong ontological commitments of the causal account while offering a more substantive framework for articulating construct meaning than the inferential account. I explore the conceptual implications of adopting the nomological account for LLM research through a concrete case: the assessment of reasoning capabilities in LLMs.

2603.15082 2026-03-17 stat.ME math.ST stat.TH

Identifying Topological Differences in Two Populations of Random Geometric Objects

Satish Kumar, Subhra Sankar Dhar

详情
英文摘要

We propose a statistical framework to identify topological differences in two populations of random geometric objects. The proposed framework involves first associating a topological signature with random geometric objects and then performing a two-sample test using the observed topological signatures. We associate persistence barcodes, a topological signature from topological data analysis, with each observed random geometric object. This, in turn, yields a two-sample problem on the space of persistence barcodes. As the space of persistence barcodes is not suitable for standard statistical analysis, we translate the two-sample problem on a suitable subset of a Euclidean space. In the course of this study, we embed the topological signatures in an ordered convex cone in a Euclidean space using functions from tropical geometry. We show that the embedding is a sufficient statistic for the persistence barcodes. This fact leads to the proposal of a two-sample test based on this sufficient statistic, and its equivalence to the two-sample problem on the barcode space is established. Finally, the consistency of the proposed test is studied.

2603.15016 2026-03-17 cs.CV stat.ML

Riemannian Motion Generation: A Unified Framework for Human Motion Representation and Generation via Riemannian Flow Matching

Fangran Miao, Jian Huang, Ting Li

Comments 18 pages, 6 figures

详情
英文摘要

Human motion generation is often learned in Euclidean spaces, although valid motions follow structured non-Euclidean geometry. We present Riemannian Motion Generation (RMG), a unified framework that represents motion on a product manifold and learns dynamics via Riemannian flow matching. RMG factorizes motion into several manifold factors, yielding a scale-free representation with intrinsic normalization, and uses geodesic interpolation, tangent-space supervision, and manifold-preserving ODE integration for training and sampling. On HumanML3D, RMG achieves state-of-the-art FID in the HumanML3D format (0.043) and ranks first on all reported metrics under the MotionStreamer format. On MotionMillion, it also surpasses strong baselines (FID 5.6, R@1 0.86). Ablations show that the compact $\mathscr{T}+\mathscr{R}$ (translation + rotations) representation is the most stable and effective, highlighting geometry-aware modeling as a practical and scalable route to high-fidelity motion generation.

2603.14991 2026-03-17 math.ST math.OC stat.TH

Wasserstein Distributionally Robust Quantile Regression

Chunxu Zhang, Tiantian Mao, Ruodu Wang

详情
英文摘要

We study distributionally robust quantile regression using type-$p$ Wasserstein ambiguity sets. We derive a closed-form expression for the worst-case quantile regression loss under general $p$-Wasserstein uncertainty. We further give a uniqueness result showing that for $p>1$, the check loss yields the only class of convex loss functions for which such an additive Wasserstein regularization holds. Our analysis also uncovers qualitative differences between the regimes $p=1$ and $p>1$. When $p>1$, the slope coefficients coincide with those of the regularized formulation, while the intercept undergoes a radius-dependent adjustment; the value $p$ affects only this intercept correction, whereas the choice of transport norm influences both. Finally, we establish finite-sample out-of-sample risk guarantees of order $O(N^{-1/2})$ under mild moment conditions. Numerical experiments illustrate the theoretical findings and the practical implications of the proposed formulation.

2603.14961 2026-03-17 math.ST stat.TH

Performance of Efron and Tibshirani's semiparametric density estimator

Nils Lid Hjort

Comments 15 pages, no figures; Statistical Research Report, Department of Mathematics, University of Oslo, from December 1995, but arXiv'd March 2026

详情
英文摘要

Recently, Efron and Tibshirani (Annals of Statistics, 1996) proposed a semiparametric density estimator, which works by multiplying an initial kernel type estimate with a parametric exponential type correction factor, chosen so as to match certain empirical moments. While Efron and Tibshirani investigate and illustrate many aspects of their method, the basic questions of performance, and comparison with other density estimators, were not directly addressed in their article. The purpose of the present paper is to provide formulae for bias and variance and hence mean squared error for the estimator. This additional insight into the method makes it easy to compare its performance with that of other recently proposed semiparametric constructions. A brief comparison study is carried out here. It indicates that the new method, used with lower order polynomials in the exponential correction term, is often better than the kernel estimator, in a reasonable neighbourhood around the normal distribution, but that its performance as a density estimator is more than equalled by other methods. In particular, the recently developed Hjort and Glad estimator (Annals of Statistics, 1995), using a parametric start times a nonparametric correction, wins in eight out of nine test cases, from the list of such suggested by Wand and Jones (Annals of Statistics, 1992).

2603.14859 2026-03-17 stat.AP

vPET-ABC: Fast Voxelwise Approximate Bayesian Inference for Kinetic Modeling in PET

Qinlin Gu, Gaelle M. Emvalomenos, Evan D. Morris, Clara Grazian, Steven R. Meikle

Comments Q. Gu and G. M. Emvalomenos contributed equally to this work

详情
英文摘要

Dynamic PET kinetic modeling increasingly demands voxelwise uncertainty quantification and robust model selection. Yet total-body PET (TB-PET) data volumes make conventional Bayesian approaches, such as per-voxel MCMC, computationally impractical, while deep models typically require retraining and careful revalidation when tracers, protocols, or kinetic models change, without necessarily improving inference speed. Vectorized voxelwise approximate Bayesian computation (vPET-ABC) is introduced as a likelihood-free, model-agnostic posterior inference framework for dynamic PET kinetic modeling at total-body scale. The method replaces explicit likelihood evaluation with forward simulations and a discrepancy test, then exploits full vectorization to transform voxelwise inference into an embarrassingly parallel workload suited to modern GPUs. In simulation, vPET-ABC produced posterior summaries with small divergence from sequential Monte Carlo baselines, and posterior mean estimates significantly more accurate than non-negative least squares (NNLS). For model selection between the linear parametric neurotransmitter model (lp-ntPET) and the multilinear reference tissue model, vPET-ABC maintained high sensitivity under high noise with moderate loss of specificity, whereas NNLS+Bayesian information criteria exhibited the opposite trade-off with near-zero sensitivity. In a human cigarette smoking dataset, vPET-ABC yielded denser probabilistic activation maps than lp-ntPET with effective number of parameters. On a 50 min total-body [18F]FDG study, vPET-ABC generated high quality whole volume K_i parametric images within practical runtimes on a single GPU, while also preserved local spatial correlation better than NNLS. Overall, vPET-ABC delivers fast, training-free, uncertainty-aware inference that scales to TB-PET and remains portable across tracers and kinetic models.

2603.14835 2026-03-17 math.ST stat.AP stat.TH

On the evaluation of time-to-event, survival time and first passage time forecasts

Robert J. Taggart, Nicholas Loveday, Simon Louis

详情
英文摘要

Time-to-event forecasts are essential when decisions depend on event timing. This article develops a framework for evaluating such forecasts when the event has not yet occurred or is not predicted within the forecast horizon. We introduce a theory of provisional evaluation, in which each forecast is assessed against its right-censored realization, defined as the minimum of the event time and the evaluation time. For probabilistic forecasts, we show that strictly proper scoring rules induce provisionally strictly proper scoring rules, whose expected score, computed from the right-censored realization, is optimized under truthful forecasting. Threshold-weighted versions of the continuous ranked probability score and the logarithmic score satisfy this property. We also develop a theory for scoring point (single-valued) forecasts under right-censoring. Quantile and interquartile range forecasts are shown to be provisionally elicitable, meaning that scoring functions exist for which these functionals uniquely minimize the expected score, whereas the expectation functional is not provisionally elicitable. A synthetic experiment demonstrates that the proposed scores correctly rank forecasters. Diagnostic tools, including Murphy diagrams and reliability diagrams, extend naturally. Applications to operational time-to-flood and time-to-strong-wind forecasts illustrate the approach.

2603.14815 2026-03-17 stat.ME

On Heterogeneity in Wasserstein Space

Kisung You

详情
英文摘要

Data represented by probability measures arise as empirical distributions, posterior distributions, and feature-based representations of complex objects. We study heterogeneity in a population of probability measures through the expected value of a chosen transform of the pairwise Wasserstein distance. The resulting estimator is unbiased and, under simple moment conditions on the population law, is strongly consistent, asymptotically normal, and equipped with a consistent standard error. This also yields a simple comparison of two populations and remains stable under plug-in approximation when the measures are estimated. The associated empirical eccentricities identify the observations that contribute most strongly to heterogeneity within a sample.

2603.11867 2026-03-17 stat.ME stat.ML

Data Fusion with Distributional Equivalence Test-then-pool

Linying Yang, Xing Liu, Robin J. Evans

详情
英文摘要

Randomized controlled trials (RCTs) are the gold standard for causal inference, yet practical constraints often limit the size of the concurrent control arm. Borrowing control data from previous trials offers a potential efficiency gain, but naive borrowing can induce bias when historical and current populations differ. Existing test-then-pool (TTP) procedures address this concern by testing for equality of control outcomes between historical and concurrent trials before borrowing; however, standard implementations may suffer from reduced power or inadequate control of the Type-I error rate. We develop a new TTP framework that fuses control arms while rigorously controlling the Type-I error rate of the final treatment effect test. Our method employs kernel two-sample testing via maximum mean discrepancy (MMD) to capture distributional differences, and equivalence testing to avoid introducing uncontrolled bias, providing a more flexible and informative criterion for pooling. To ensure valid inference, we introduce partial bootstrap and partial permutation procedures for approximating null distributions in the presence of heterogeneous controls. We further establish the overall validity and consistency. We provide empirical studies demonstrating that the proposed approach achieves higher power than standard TTP methods while maintaining nominal error control, highlighting its value as a principled tool for leveraging historical controls in modern clinical trials.

2603.08682 2026-03-17 stat.ML cs.LG

Structural Causal Bottleneck Models

Simon Bing, Jonas Wahl, Jakob Runge

详情
英文摘要

We introduce structural causal bottleneck models (SCBMs), a novel class of structural causal models. At the core of SCBMs lies the assumption that causal effects between high-dimensional variables only depend on low-dimensional summary statistics, or bottlenecks, of the causes. SCBMs provide a flexible framework for task-specific dimension reduction while being estimable via standard, simple learning algorithms in practice. We analyse identifiability in SCBMs, connect them to information bottlenecks in the sense of Tishby & Zaslavsky (2015), and illustrate how to estimate them experimentally. We also demonstrate the benefit of bottlenecks for effect estimation in low-sample transfer learning settings. We argue that SCBMs provide an alternative to existing causal dimension reduction frameworks like causal representation learning or causal abstraction learning.

2603.06134 2026-03-17 stat.ME stat.AP

Clustering-Based Outcome Models for Clinical Studies: A Scoping Review

Johannes Vilsmeier, Fabian Eibensteiner, Franz König, Francois Mercier, Robin Ristl, Nigel Stallard, Marc Vandemeulebroecke, Sarah Zohar, Martin Posch

详情
英文摘要

This review provides a systematic overview of methods that combine covariate-based clustering of observational units (patients) with outcome models for clinical studies. We distinguish between informed-cluster models, where the outcome contributes to cluster formation, and agnostic-cluster models, where clustering is performed solely on covariates in a separate first step. Informed-cluster models include product partition models with covariates (PPMx), finite mixtures of regression models (FMR), and cluster-aware supervised learning (CluSL). Agnostic-cluster models encompass two-step procedures using either model-based or algorithmic clustering followed by cluster-specific regression models. Following a systematic search of Web of Science and PubMed, 55 records were identified that propose or evaluate such models. We describe the key models, summarise study characteristics, and present applications from biomedical and public health research. Clustering-based outcome models are particularly relevant for settings with high-dimensional covariates (e.g., biomarker panels and "omics") and heterogeneous patient populations. These models can support risk stratification and we discuss extensions to estimate subgroup-specific treatment effects. They are most valuable when the population is clustered in distinct regions of the covariate space that correspond to different outcome distributions. We discuss applications to rare disease research, covariate adjustment and borrowing from historical data, and subgroup-specific treatment effect estimation in clinical trials.

2603.05340 2026-03-17 stat.ML cs.LG math.ST stat.TH

On the Statistical Optimality of Optimal Decision Trees

Zineng Xu, Subhro Ghosh, Yan Shuo Tan

详情
英文摘要

While globally optimal empirical risk minimization (ERM) decision trees have become computationally feasible and empirically successful, rigorous theoretical guarantees for their statistical performance remain limited. In this work, we develop a comprehensive statistical theory for ERM trees under random design in both high-dimensional regression and classification. We first establish sharp oracle inequalities that bound the excess risk of the ERM estimator relative to the best possible approximation achievable by any tree with at most $L$ leaves, thereby characterizing the interpretability-accuracy trade-off. We derive these results using a novel uniform concentration framework based on empirically localized Rademacher complexity. Furthermore, we derive minimax optimal rates over a novel function class: the piecewise sparse heterogeneous anisotropic Besov (PSHAB) space. This space explicitly captures three key structural features encountered in practice: sparsity, anisotropic smoothness, and spatial heterogeneity. While our main results are established under sub-Gaussianity, we also provide robust guarantees that hold under heavy-tailed noise settings. Together, these findings provide a principled foundation for the optimality of ERM trees and introduce empirical process tools broadly applicable to other highly adaptive, data-driven procedures.

2601.11213 2026-03-17 physics.optics math-ph math.MP stat.AP

Light Propagation through Space-Time Non-Markovian Random Media

Chaoran Wang, Jinquan Qi, Shuang Liu, Chenjin Deng, Shensheng Han

Comments 3figures

详情
英文摘要

Here, we introduce a stochastic partial differential equation (SPDE) formulation driven by temporally correlated noise to describe light propagation beyond the standard Markov approximation. By representing the squared refractive index fluctuations as a random field with explicit long-range temporal correlations, we demonstrate that the propagation dynamics map exactly onto the hyperbolic Anderson model. This rigorous mapping enables the derivation of new quantitative scaling relations that connect the environment's non-Markovian memory effects to the statistical properties of the emergent light field. We experimentally validate these analytical predictions in an outdoor atmospheric environment, confirming the memory-dependent statistical signatures of the propagated light. Our results establish a precise physical foundation for understanding memory-driven wave phenomena, providing crucial insights for free-space optical communication, remote sensing, and coherent imaging.

2601.03220 2026-03-17 cs.LG stat.ML

From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence

Marc Finzi, Shikai Qiu, Yiding Jiang, Pavel Izmailov, J. Zico Kolter, Andrew Gordon Wilson

Comments Code available at https://github.com/shikaiqiu/epiplexity

详情
英文摘要

Can we learn more from data than existed in the generating process itself? Can new and useful information be constructed from merely applying deterministic transformations to existing data? Can the learnable content in data be evaluated without considering a downstream task? On these questions, Shannon information and Kolmogorov complexity come up nearly empty-handed, in part because they assume observers with unlimited computational capacity and do not target the useful information content. In this work, we identify and exemplify three seeming paradoxes in information theory: (1) information cannot be increased by deterministic transformations; (2) information is independent of the order of data; (3) likelihood modeling is merely distribution matching. To shed light on the tension between these results and modern practice, and to quantify the value of data, we introduce epiplexity, a formalization of information capturing what computationally bounded observers can learn from data. Epiplexity captures the structural content in data while excluding time-bounded entropy, the random unpredictable content exemplified by pseudorandom number generators and chaotic dynamical systems. With these concepts, we demonstrate how information can be created with computation, how it depends on the ordering of the data, and how likelihood modeling can produce more complex programs than present in the data generating process itself. We also present practical procedures to estimate epiplexity which we show capture differences across data sources, track with downstream performance, and highlight dataset interventions that improve out-of-distribution generalization. In contrast to principles of model selection, epiplexity provides a theoretical foundation for data selection, guiding how to select, generate, or transform data for learning systems.

2511.16568 2026-03-17 math.OC math.ST stat.ML stat.TH

Failure of uniform laws of large numbers for subdifferentials and beyond

Lai Tian, Johannes O. Royset

Comments 17 pages, 2 figures; Section 2.3 now includes new discussion of SAA and subdifferential approximation

详情
英文摘要

We provide counterexamples showing that uniform laws of large numbers do not hold for subdifferentials under natural assumptions. Our constructions are univariate random Lipschitz functions and bivariate random convex functions with two smooth pieces. Consequently, they resolve the questions posed by Shapiro and Xu [J. Math. Anal. Appl., 325 (2007), 1390-1399] in the negative. They also demonstrate the failure of certain graphical and pointwise laws for subdifferentials, revealing fundamental barriers to the consistency of sample-average approximation and subdifferential approximation.

2511.02660 2026-03-17 math.ST econ.EM stat.ME stat.TH

Spectral analysis of high-dimensional spot volatility matrix with applications

Qiang Liu, Yiming Liu, Zhi Liu, Wang Zhou

详情
英文摘要

In random matrix theory, the spectral distribution of the covariance matrix has been well studied under the large dimensional asymptotic regime when the dimensionality and the sample size tend to infinity at the same rate. However, most existing theories are built upon the assumption of independent and identically distributed samples, which may be violated in practice. For example, the observational data of continuous-time processes at discrete time points, namely, the high-frequency data. In this paper, we extend the classical spectral analysis for the covariance matrix in large dimensional random matrix to the spot volatility matrix by using the high-frequency data. We establish the first-order limiting spectral distribution and obtain a second-order result, that is, the central limit theorem for linear spectral statistics. Moreover, we apply the results to design some feasible tests for the spot volatility matrix, including the identity and sphericity tests. Simulation studies justify the finite sample performance of the test statistics and verify our established theory.

2510.13496 2026-03-17 math.NA cs.NA stat.ML

Data-intrinsic approximation in metric spaces

Jürgen Dölz, Michael Multerer

详情
英文摘要

Analysis and processing of data is a vital part of our modern society and requires vast amounts of computational resources. To reduce the computational burden, compressing and approximating data has become a central topic. We consider the approximation of labeled data samples, mathematically described as site-to-value maps between finite metric spaces. Within this setting, we identify the discrete modulus of continuity as an effective data-intrinsic quantity to measure regularity of site-to-value maps without imposing further structural assumptions. We investigate the consistency of the discrete modulus of continuity in the infinite data limit and propose an algorithm for its efficient computation. Building on these results, we present a sample based approximation theory for labeled data. For data subject to statistical uncertainty we consider multilevel approximation spaces and a variant of the multilevel Monte Carlo method to compute statistical quantities of interest. Our considerations connect approximation theory for labeled data in metric spaces to the covering problem for (random) balls on the one hand and the efficient evaluation of the discrete modulus of continuity to combinatorial optimization on the other hand. We provide extensive numerical studies to illustrate the feasibility of the approach and to validate our theoretical results.

2510.01112 2026-03-17 astro-ph.GA astro-ph.CO cs.LG stat.AP stat.ME

The causal structure of galactic astrophysics

Harry Desmond, Joseph Ramsey

Comments 10 pages, 4 figures; published in the Open Journal of Astrophysics

详情
Journal ref
Open Journal of Astrophysics, Vol 9 (2026)
英文摘要

Data-driven astrophysics currently relies on the detection and characterisation of correlations between objects' properties, which are then used to test physical theories that make predictions for them. This process fails to utilise information in the data that forms a crucial part of the theories' predictions, namely which variables are directly correlated (as opposed to accidentally correlated through others), the directions of these determinations, and the presence or absence of confounders that correlate variables in the dataset but are themselves absent from it. We propose to recover this information through causal discovery, a well-developed methodology for inferring the causal structure of datasets that is however almost entirely unknown to astrophysics. We develop a causal discovery algorithm suitable for large astrophysical datasets and illustrate it on $\sim$4.5$\times10^5$ nearby galaxies from the Nasa Sloan Atlas, demonstrating its ability to distinguish physical mechanisms that are degenerate on the basis of correlations alone.

2509.24912 2026-03-17 stat.ML cs.LG

When Scores Learn Geometry: Rate Separations under the Manifold Hypothesis

Xiang Li, Zebang Shen, Ya-Ping Hsieh, Niao He

Comments Accepted at ICLR 2026

详情
英文摘要

Score-based methods, such as diffusion models and Bayesian inverse problems, are often interpreted as learning the data distribution in the low-noise limit ($σ\to 0$). In this work, we propose an alternative perspective: their success arises from implicitly learning the data manifold rather than the full distribution. Our claim is based on a novel analysis of scores in the small-$σ$ regime that reveals a sharp separation of scales: information about the data manifold is $Θ(σ^{-2})$ stronger than information about the distribution. We argue that this insight suggests a paradigm shift from the less practical goal of distributional learning to the more attainable task of geometric learning, which provably tolerates $O(σ^{-2})$ larger errors in score approximation. We illustrate this perspective through three consequences: i) in diffusion models, concentration on data support can be achieved with a score error of $o(σ^{-2})$, whereas recovering the specific data distribution requires a much stricter $o(1)$ error; ii) more surprisingly, learning the uniform distribution on the manifold-an especially structured and useful object-is also $O(σ^{-2})$ easier; and iii) in Bayesian inverse problems, the maximum entropy prior is $O(σ^{-2})$ more robust to score errors than generic priors. Finally, we validate our theoretical findings with preliminary experiments on large-scale models, including Stable Diffusion.

2509.19040 2026-03-17 stat.ME

Nonparametric efficient estimation of the longitudinal front-door functional

Marie S. Breum, Helene C. W. Rytgaard, Torben Martinussen, Erin E. Gabriel

详情
英文摘要

The front-door criterion is an identification strategy for the intervention-specific mean outcome in settings where the standard back-door criterion fails due to unmeasured exposure-outcome confounders, but an intermediate variable exists that completely mediates the effect of exposure on the outcome and is not affected by unmeasured confounding. The front-door criterion has been extended to the longitudinal setting, where exposure and mediator vary over time. However, with the exception of a simple plug-in estimator, no suitable estimation techniques have been proposed. In this work, we derive nonparametric efficient estimators of the longitudinal front-door functional. The estimators accommodate high-dimensional mediators, are multiply robust, and allow for the use of data-adaptive methods for estimating nuisance functions while still providing valid inference. The theoretical properties of the estimators are illustrated in a simulation study, and we apply the estimators to a trial of peanut allergy in infants.

2509.08731 2026-03-17 cs.LG stat.ML

Generating solution paths of Markovian stochastic differential equations using diffusion models

Xuefeng Gao, Jiale Zha, Xun Yu Zhou

详情
英文摘要

This paper introduces a new approach to generating sample paths of unknown Markovian stochastic differential equations (SDEs) using diffusion models, a class of generative AI methods commonly employed in image and video applications. Unlike the traditional Monte Carlo methods for simulating SDEs, which require explicit specifications of the drift and diffusion coefficients, ours takes a model-free, data-driven approach. Given a finite set of sample paths from an SDE, we utilize conditional diffusion models to generate new, synthetic paths of the same SDE. Numerical experiments show that our method consistently outperforms two alternative methods in terms of the Kullback--Leibler (KL) divergence between the distributions of the target SDE paths and the generated ones. Moreover, we present a theoretical error analysis deriving an explicit bound on the said KL divergence. Finally, in simulation and empirical studies, we leverage these synthetically generated sample paths to boost the performance of reinforcement learning algorithms for continuous-time mean--variance portfolio selection, hinting promising applications of our study in financial analysis and decision-making.

2508.06431 2026-03-17 quant-ph stat.OT

Nonparametric Learning Non-Gaussian Quantum States of Continuous Variable Systems

Liubov A. Markovich, Xiaoyu Liu, Jordi Tura

详情
英文摘要

Continuous-variable quantum systems are foundational to quantum computation, communication, and sensing. While traditional representations using wave functions or density matrices are often impractical, the tomographic picture of quantum mechanics provides an accessible alternative by associating quantum states with classical probability distribution functions called tomograms. Despite its advantages, including compatibility with classical statistical methods, tomographic method remain underutilized due to a lack of robust estimation techniques. This work addresses this gap by introducing a non-parametric \emph{kernel quantum state estimation} (KQSE) framework for reconstructing quantum states and their trace characteristics from noisy data, without prior knowledge of the state. In contrast to existing methods, KQSE yields estimates of the density matrix in various bases, as well as trace quantities such as purity, higher moments, overlap, and trace distance, with a near-optimal convergence rate of $\tilde{O}\bigl(T^{-1}\bigr)$, where $T$ is the total number of measurements. KQSE is robust for multimodal, non-Gaussian states, making it particularly well suited for characterizing states essential for quantum science.

2506.19837 2026-03-17 stat.ML cs.LG

Convergence and clustering analysis for Mean Shift with radially symmetric, positive definite kernels

Susovan Pal

详情
英文摘要

The mean shift (MS) is a non-parametric, density-based, iterative algorithm with prominent usage in clustering and image segmentation. A rigorous proof for the convergence of its mode estimate sequence in full generality remains unknown. In this paper, we show that for\textit{ sufficiently large bandwidth} convergence is guaranteed in any dimension with \textit{any radially symmetric and strictly positive definite kernels}. Although the author acknowledges that our result is partially more restrictive than that of \cite{YT} due to the lower limit of the bandwidth, our kernel class is not covered by the kernel class in \cite{YT}, and the proof technique is different. Moreover, we show theoretically and experimentally that while for Gaussian kernel, accurate clustering at \textit{large bandwidths} is generally impossible, it may still be possible for other radially symmetric, strictly positive definite kernels.

2504.00919 2026-03-17 math.ST stat.ML stat.TH

Nonparametric spectral density estimation using interactive mechanisms under local differential privacy

Cristina Butucea, Karolina Klockmann, Tatyana Krivobokova

Comments 56 pages, 3 figures

详情
英文摘要

We study the problem of estimating the spectral density of a centered stationary Gaussian time series under local differential privacy constraints. Specifically, we propose new interactive privacy mechanisms for three tasks: recovering a single covariance coefficient, recovering the spectral density at a fixed frequency, and global recovery. Our approach achieves faster rates through a two-stage process: we first apply the Laplace mechanism to the truncated value, and then use the resulting privatized sample to learn about the dependence mechanism in the time series. For spectral densities belonging to Hölder and Sobolev smoothness classes, we demonstrate that our algorithms improve upon the non-interactive mechanism of Kroll (2024) for small privacy parameter $α$, since the pointwise rates depend on $nα^2$ instead of $nα^4$. Moreover, we show that the rate $(nα^4)^{-1}$ is optimal for estimating a covariance coefficient with non-interactive mechanisms. However, the $L_2$ rate of our interactive estimator is slower than the pointwise rate. We show how to use these procedures to provide a bona fide locally differentially private estimator of the entire covariance matrix. A simulation study validates our findings.

2501.15926 2026-03-17 math.ST stat.TH

Minimax convergence rates of a binary classification procedure for time-homogeneous SDE paths

Eddy Michel Ella Mintsa

Comments 39 pages

详情
英文摘要

In the context of binary classification of trajectories generated by time-homogeneous stochastic differential equations, we consider a mixture of two diffusion processes characterized by a stochastic differential equation (SDE) whose drift coefficient depends on the class and whose diffusion coefficient is independent of the class. We assume that the drift and diffusion coefficients are unknown as well as the law of the discrete random variable that models the class. In this paper, we study the minimax convergence rates for the excess risk of the resulting plug-in classifier under different sets of assumptions on the diffusion model. As the plug-in classifier is based on nonparametric estimators of drift and diffusion coefficients, we established rates of convergence for projection estimators of drift coefficients on the real line. We propose a new methodology for the study of the lower bound on the excess risk. The theoretical study is completed with a numerical experiment over simulated data.

2501.10229 2026-03-17 stat.ML cs.LG stat.CO

Amortized Bayesian Mixture Models

Šimon Kucharský, Paul Christian Bürkner

Comments 34 pages, 17 figures

详情
英文摘要

Finite mixtures are a broad class of models useful in scenarios where observed data is generated by multiple distinct processes but without explicit information about the responsible process for each data point. Estimating Bayesian mixture models is computationally challenging due to issues such as high-dimensional posterior inference and label switching. Furthermore, traditional methods such as MCMC are applicable only if the likelihoods for each mixture component are analytically tractable. Amortized Bayesian Inference (ABI) is a simulation-based framework for estimating Bayesian models using generative neural networks. This allows the fitting of models without explicit likelihoods, and provides fast inference. ABI is therefore an attractive framework for estimating mixture models. This paper introduces a novel extension of ABI tailored to mixture models. We factorize the posterior into a distribution of the parameters and a distribution of (categorical) mixture indicators, which allows us to use a combination of generative neural networks for parameter inference, and classification networks for mixture membership identification. The proposed framework accommodates both independent and dependent mixture models, enabling filtering and smoothing. We validate and demonstrate our approach through synthetic and real-world datasets.

2412.12407 2026-03-17 stat.ME

Inside-out cross-covariance for spatial multivariate data

Michele Peruzzi

详情
英文摘要

As the spatial features of multivariate data are increasingly central in researchers' applied problems, there is a growing demand for novel spatially-aware methods that are flexible, easily interpretable, and scalable to large data. We develop inside-out cross-covariance (IOX) models for multivariate spatial likelihood-based inference. IOX leads to valid cross-covariance matrix functions which we interpret as inducing spatial dependence on independent replicates of a correlated random vector. The resulting sample cross-covariance matrices are "inside-out" relative to the ubiquitous linear model of coregionalization (LMC). However, unlike LMCs, our methods offer direct marginal inference, easy prior elicitation of covariance parameters, the ability to model outcomes with unequal smoothness, and flexible dimension reduction. As a covariance model for a q-variate Gaussian process, IOX leads to scalable models for noisy vector data as well as flexible latent models. For large n cases, IOX complements Vecchia approximations and related process-based methods based on sparse graphical models. We demonstrate superior performance of IOX on synthetic datasets as well as on colorectal cancer proteomics data. An R package implementing the proposed methods is available at github.com/mkln/spiox.

2412.01010 2026-03-17 stat.ML cs.LG

A Note on Estimation Error Bound and Grouping Effect of Transfer Elastic Net

Yui Tomo

详情
Journal ref
Communications in Statistics - Theory and Methods (2026)
英文摘要

The Transfer Elastic Net is an estimation method for linear regression models that combines $\ell_1$ and $\ell_2$ norm penalties to facilitate knowledge transfer. In this study, we derive a non-asymptotic $\ell_2$ norm estimation error bound for the estimator and discuss scenarios where the Transfer Elastic Net effectively works. Furthermore, we examine situations where it exhibits the grouping effect, which states that the estimates corresponding to highly correlated predictors have a small difference.

2404.05006 2026-03-17 math.ST math.PR stat.TH

High-dimensional bootstrap and asymptotic expansion

Yuta Koike

Comments 66 pages, 1 figure, 2 tables. Some typos were corrected. The order of the subsections in Appendix B was rearranged

详情
英文摘要

The recent seminal work of Chernozhukov, Chetverikov and Kato has shown that bootstrap approximation for the maximum of a sum of independent random vectors is justified even when the dimension is much larger than the sample size. In this context, numerical experiments suggest that third-moment matching bootstrap approximations would outperform normal approximation even without studentization, but the existing theoretical results cannot explain this phenomenon. In this paper, we develop an asymptotic expansion formula for the bootstrap coverage probability and show that it can give an explanation for the above phenomenon. In particular, we find the following interesting blessing of dimensionality phenomenon: The third-moment matching wild bootstrap is second-order accurate in high dimensions even without studentization if the covariance matrix has identical diagonal entries and bounded eigenvalues. We also show that a double wild bootstrap method is second-order accurate regardless of the covariance structure. The validity of these results is established under the assumption that the underlying distributions admit Stein kernels.

2309.05435 2026-03-17 stat.CO

Parallel Selected Inversion for Space-Time Gaussian Markov Random Fields

Abylay Zhumekenov, Elias T. Krainski, Håvard Rue

Comments Published in Statistics and Computing (2025). Expanded version with additional results, discussion, and references

详情
Journal ref
Statistics and Computing 35, 211 (2025)
英文摘要

Performing Bayesian inference on large spatio-temporal models requires extracting inverse elements of large sparse precision matrices for marginal variances, as well as estimating model hyperparameters. Although direct matrix factorizations can be used for the inversion, such methods fail to scale well for distributed problems when run on large computing clusters. On the contrary, Krylov subspace methods for the selected inversion have been gaining traction. We propose a parallel hybrid approach based on domain decomposition, which extends the Rao-Blackwellized Monte Carlo estimator for distributed precision matrices. Our approach exploits the strength of Krylov subspace methods as global solvers and efficiency of direct factorizations as base case solvers to compute the marginal variances and the derivatives required for hyperparameter estimation using a divide-and-conquer strategy. By introducing subdomain overlaps, one can achieve greater accuracy at an increased computational effort with little to no additional communication. We demonstrate the speed improvements and efficient hyperparameter inference on both simulated models and a massive US daily temperature data.

2307.01111 2026-03-17 stat.CO stat.ME

A Gaussian process and linear-based framework for computing cut distributions in modular Bayesian calibration of two chained computer models

Oumar Baldé, Guillaume Damblin, Amandine Marrel, Antoine Bouloré, Loïc Giraldi

Comments 44 pages, 14 figures

详情
英文摘要

Computer models are widely used in science and engineering to simulate complex systems. However, these models are affected by several sources of uncertainty, which may limit their use for decision making in risk management. We present a Bayesian approach for quantifying parameter uncertainty in a chain of two computer models motivated by multiphysics simulations in the nuclear field. Part of the inputs of a downstream model parametrized by $θ\in \mathbb{R}^p$ come from the outputs of an upstream model parametrized by $λ\in \mathbb{R}^q$. Usually, the joint posterior distribution of $(θ, λ)$ would be obtained by applying Bayes' theorem using the experimental observations of both models. However, when the observations of the downstream model are too indirect to provide informative inference on $λ$, it may be preferable to compute a modular posterior distribution of $(θ, λ)$, referred to as the \emph{cut distribution}. Assuming that the posterior distribution of $λ$ has been previously estimated from observations of the upstream model only, we aim to compute the posterior distribution of $θ$ conditional on $λ$ using observations from the downstream model. To this end, we propose a Gaussian-process and linear-based framework to estimate the functional dependence between $θ$ and $λ$, denoted by $θ(λ)$, where each component is modeled as a realization of a Gaussian process. As the downstream model is approximated by a linear function of $θ(λ)$, Bayesian conjugacy allows us to derive a Gaussian posterior predictive distribution of $θ(λ)$ for any realization of $λ$. The effectiveness of the method is illustrated through several synthetic examples, and we highlight how variations in $λ$ impact the predictive distribution of the chained simulation.

2305.02881 2026-03-17 quant-ph cs.LG hep-ex stat.ML

Trainability barriers and opportunities in quantum generative modeling

Manuel S. Rudolph, Sacha Lerch, Supanut Thanasilp, Oriel Kiss, Oxana Shaya, Sofia Vallecorsa, Michele Grossi, Zoë Holmes

Comments 21+44 pages, 10+2 figures

详情
英文摘要

Quantum generative models provide inherently efficient sampling strategies and thus show promise for achieving an advantage using quantum hardware. In this work, we investigate the barriers to the trainability of quantum generative models posed by barren plateaus and exponential loss concentration. We explore the interplay between explicit and implicit models and losses, and show that using quantum generative models with explicit losses such as the KL divergence leads to a new flavour of barren plateaus. In contrast, the implicit Maximum Mean Discrepancy loss can be viewed as the expectation value of an observable that is either low-bodied and provably trainable, or global and untrainable depending on the choice of kernel. In parallel, we find that solely low-bodied implicit losses cannot in general distinguish high-order correlations in the target data, while some quantum loss estimation strategies can. We validate our findings by comparing different loss functions for modelling data from High-Energy-Physics.

2210.09502 2026-03-17 stat.ME

Small Area Estimation using EBLUPs under the Nested Error Regression Model

Ziyang Lyu, A. H. Welsh

Comments 35 pages include 6 tables and 2 figures

详情
Journal ref
Statistica Sinica 35 (2025), 1277-1299
英文摘要

Estimating characteristics of domains (referred to as small areas) within a population from sample surveys of the population is an important problem in survey statistics. In this paper, we consider model-based small area estimation under the nested error regression model. We discuss the construction of mixed model estimators (empirical best linear unbiased predictors, EBLUPs) of small area means and the conditional linear predictors of small area means. Under the asymptotic framework of increasing numbers of small areas and increasing numbers of units in each area, we establish asymptotic linearity results and central limit theorems for these estimators which allow us to establish asymptotic equivalences between estimators, approximate their sampling distributions, obtain simple expressions for and construct simple estimators of their asymptotic mean squared errors, and justify asymptotic prediction intervals. We present model-based simulations that show that in quite small, finite samples, our mean squared error estimator performs as well or better than the widely-used \cite{prasad1990estimation} estimator and is much simpler, so is easier to interpret. We also carry out a design-based simulation using real data on consumer expenditure on fresh milk products to explore the design-based properties of the mixed model estimators. We explain and interpret some surprising simulation results through analysis of the population and further design-based simulations. The simulations highlight important differences between the model- and design-based properties of mixed model estimators in small area estimation.

1908.01943 2026-03-17 q-fin.RM math.ST stat.TH

Stochastic comparisons of sample mean differences for multivariate random variables

Xuehua Yin, Dan Zhu, Chuancun Yin

Comments 14pages

详情
英文摘要

In this paper, we establish the stochastic ordering of the Gini indexes for multivariate elliptical risks which generalized the corresponding results for multivariate normal risks. It is shown that several conditions on dispersion matrices and the components of dispersion matrices of multivariate normal risks for the monotonicity of the Gini index in the usual stochastic order proposed by Samanthi, Wei and Brazauskas (2016) and Kim and Kim (2019) are also suitable for multivariate elliptical risks. We also study the tail probability of Gini index for multivariate elliptical risks and revised a large deviation result for the Gini indexes of multivariate normal risks in Kim and Kim (2019).

2603.14768 2026-03-17 cs.LG stat.ML

Understanding the geometry of deep learning with decision boundary volume

Matthew Burfitt, Jacek Brodzki, Pawel Dłotko

详情
英文摘要

For classification tasks, the performance of a deep neural network is determined by the structure of its decision boundary, whose geometry directly affects essential properties of the model, including accuracy and robustness. Motivated by a classical tube formula due to Weyl, we introduce a method to measure the decision boundary of a neural network through local surface volumes, providing a theoretically justifiable and efficient measure enabling a geometric interpretation of the effectiveness of the model applicable to the high dimensional feature spaces considered in deep learning. A smaller surface volume is expected to correspond to lower model complexity and better generalisation. We verify, on a number of image processing tasks with convolutional architectures that decision boundary volume is inversely proportional to classification accuracy. Meanwhile, the relationship between local surface volume and generalisation for fully connected architecture is observed to be less stable between tasks. Therefore, for network architectures suited to a particular data structure, we demonstrate that smoother decision boundaries lead to better performance, as our intuition would suggest.

2603.14752 2026-03-17 stat.ME

Prior- and likelihood-free probabilistic inference with finite-sample calibration guarantees

Leonardo Cella, Emily C. Hector

Comments 26 pages, 6 Figures

详情
英文摘要

Motivated by parametric models for which the likelihood is analytically unavailable, numerically unstable, or prohibitively expensive to compute or optimize, we develop a prior- and likelihood-free framework for fully probabilistic (Bayesian-like) uncertainty quantification with finite-sample calibration guarantees. Our method, a type of inferential model, produces data-dependent degrees of belief about claims concerning the unknown parameter while controlling the frequency with which high belief is assigned to false claims, even in finite-sample settings. Our procedure is general in that it requires only the ability to simulate from the model. We first rank candidate parameter values according to how well data simulated from the model agree with the observed data, and then rescale these rankings in a way that yields the desired finite-sample calibration guarantees. The key idea is to employ a permutation-invariant function, such as a depth function, to rank parameter values. We show that such a choice yields closed-form calibration rescaling calculations, making the procedure computationally simple. We illustrate our method's broad appeal with four examples, including differential privacy and Ising models. An analysis of the spatial configuration of 2025 measles outbreaks in the U.S. showcases our method's practical advantages.

2603.14744 2026-03-17 quant-ph cs.CC math.OC math.ST stat.TH

Towards Exponential Quantum Improvements in Solving Cardinality-Constrained Binary Optimization

Haomu Yuan, Hanqing Wu, Kuan-Cheng Chen, Bin Cheng, Crispin H. W. Barnes

Comments 19 pages

详情
英文摘要

Cardinality-constrained binary optimization is a fundamental computational primitive with broad applications in machine learning, finance, and scientific computing. In this work, we introduce a Grover-based quantum algorithm that exploits the structure of the fixed-cardinality feasible subspace under a natural promise on solution existence. For quadratic objectives, our approach achieves ${O}\left(\sqrt{\frac{\binom{n}{k}}{M}}\right)$ Grover rotations for any fixed cardinality $k$ and degeneracy of the optima $M$, yielding an exponential reduction in the number of Grover iterations compared with unstructured search over $\{0,1\}^n$. Building on this result, we develop a hybrid classical--quantum framework based on the alternating direction method of multipliers (ADMM) algorithm. The proposed framework is guaranteed to output an $ε$-approximate solution with a consistency tolerance $ε+ δ$ using at most $ {O}\left(\sqrt{\binom{n}{k}}\frac{n^{6}k^{3/2} }{ \sqrt{M}ε^2 δ}\right)$ queries to a quadratic oracle, together with ${O}\left(\frac{n^{6}k^{3/2}}{ε^2δ}\right)$ classical overhead. Overall, our method suggests a practical use of quantum resources and demonstrates an exponential improvements over existing Grover-based approaches in certain parameter regimes, thereby paving the way toward quantum advantage in constrained binary optimization.

2603.14704 2026-03-17 cs.LG cs.CV stat.ML

Chain-of-Trajectories: Unlocking the Intrinsic Generative Optimality of Diffusion Models via Graph-Theoretic Planning

Ping Chen, Xiang Liu, Xingpeng Zhang, Fei Shen, Xun Gong, Zhaoxiang Liu, Zezhou Chen, Huan Hu, Kai Wang, Shiguo Lian

Comments 12 figues, 5 tables

详情
英文摘要

Diffusion models operate in a reflexive System 1 mode, constrained by a fixed, content-agnostic sampling schedule. This rigidity arises from the curse of state dimensionality, where the combinatorial explosion of possible states in the high-dimensional noise manifold renders explicit trajectory planning intractable and leads to systematic computational misallocation. To address this, we introduce Chain-of-Trajectories (CoTj), a train-free framework enabling System 2 deliberative planning. Central to CoTj is Diffusion DNA, a low-dimensional signature that quantifies per-stage denoising difficulty and serves as a proxy for the high-dimensional state space, allowing us to reformulate sampling as graph planning on a directed acyclic graph. Through a Predict-Plan-Execute paradigm, CoTj dynamically allocates computational effort to the most challenging generative phases. Experiments across multiple generative models demonstrate that CoTj discovers context-aware trajectories, improving output quality and stability while reducing redundant computation. This work establishes a new foundation for resource-aware, planning-based diffusion modeling. The code is available at https://github.com/UnicomAI/CoTj.

2603.14676 2026-03-17 stat.ME stat.ML

Scalable Text-Embedding-informed Cognitive Diagnosis of Large Language Models

Jia Liu, Zhiyu Xu, Yuqi Gu

Comments 34 pages of main text, 12 pages of appendix, 7 figures

详情
英文摘要

Large language models (LLMs) have achieved remarkable performance on diverse benchmarks, yet existing evaluation practices largely rely on coarse summary metrics that obscure underlying reasoning abilities. In this work, we propose novel methodologies to adapt cognitive diagnosis models (CDMs) in psychometrics to LLM evaluation, enabling fine-grained diagnosis via multidimensional discrete capability profiles and interpretable characterizations of LLM strengths and weaknesses. First, to enable CDM-based evaluation at benchmark scale (more than 1000 items), we propose a scalable method that jointly estimates LLM mastery profiles and the item-attribute Q-matrix, addressing key challenges posed by high-dimensional latent attributes (K > 20), large item pools, and the prohibitive computational cost of existing marginal maximum likelihood-based estimation. Second, we incorporate item-level textual information to construct AI-embedding-informed priors for the Q-matrix, stabilizing high-dimensional estimation while reducing reliance on costly human specification. We develop an efficient stochastic-approximation algorithm to jointly estimate LLM mastery profiles and the Q-matrix that balances data fit with text-embedding-informed priors. Simulation studies demonstrate accurate parameter recovery. An application to the MATH Level 5 benchmark illustrates the practical utility of our method for LLM evaluation and uncovers useful insights into LLMs' fine-grained capabilities.

2603.14608 2026-03-17 cs.LG cs.AI math.OC stat.ML

Delightful Policy Gradient

Ian Osband

详情
英文摘要

Standard policy gradients weight each sampled action by advantage alone, regardless of how likely that action was under the current policy. This creates two pathologies: within a single decision context (e.g. one image or prompt), a rare negative-advantage action can disproportionately distort the update direction; across many such contexts in a batch, the expected gradient over-allocates budget to contexts the policy already handles well. We introduce the \textit{Delightful Policy Gradient} (DG), which gates each term with a sigmoid of \emph{delight}, the product of advantage and action surprisal (negative log-probability). For $K$-armed bandits, DG provably improves directional accuracy in a single context and, across multiple contexts, shifts the expected gradient strictly closer to the supervised cross-entropy oracle. This second effect is not variance reduction: it persists even with infinite samples. Empirically, DG outperforms REINFORCE, PPO, and advantage-weighted baselines across MNIST, transformer sequence modeling, and continuous control, with larger gains on harder tasks.

2603.14578 2026-03-17 stat.ML cs.LG math.PR

Power-Law Spectrum of the Random Feature Model

Elliot Paquette, Ke Liang Xiao, Yizhe Zhu

详情
英文摘要

Scaling laws for neural networks, in which the loss decays as a power-law in the number of parameters, data, and compute, depend fundamentally on the spectral structure of the data covariance, with power-law eigenvalue decay appearing ubiquitously in vision and language tasks. A central question is whether this spectral structure is preserved or destroyed when data passes through the basic building block of a neural network: a random linear projection followed by a nonlinear activation. We study this question for the random feature model: given data $x \sim N(0,H)\in \mathbb{R}^v$ where $H$ has $α$-power-law spectrum ($λ_j(H ) \asymp j^{-α}$, $α> 1$), a Gaussian sketch matrix $W \in \mathbb{R}^{v\times d}$, and an entrywise monomial $f(y) = y^{p}$, we characterize the eigenvalues of the population random-feature covariance $\mathbb{E}_{x }[\frac{1}{d}f(W^\top x )^{\otimes 2}]$. We prove matching upper and lower bounds: for all $1 \leq j \leq c_1 d \log^{-(p+1)}(d)$, the $j$-th eigenvalue is of order $\left(\log^{p-1}(j+1)/j\right)^α$. For $ c_1 d \log^{-(p+1)}(d)\leq j\leq d$, the $j$-th eigenvalue is of order $j^{-α}$ up to a polylog factor. That is, the power-law exponent $α$ is inherited exactly from the input covariance, modified only by a logarithmic correction that depends on the monomial degree $p$. The proof combines a dyadic head-tail decomposition with Wick chaos expansions for higher-order monomials and random matrix concentration inequalities.

2603.14576 2026-03-17 quant-ph cs.LG stat.ML

IQP Born Machines under Data-dependent and Agnostic Initialization Strategies

Sacha Lerch, Joseph Bowles, Ricard Puig, Erik Armengol, Zoë Holmes, Supanut Thanasilp

Comments 16 + 35 pages, 3 + 4 figures

详情
英文摘要

Quantum circuit Born machines based on instantaneous quantum polynomial-time (IQP) circuits are natural candidates for quantum generative modeling, both because of their probabilistic structure and because IQP sampling is provably classically hard in certain regimes. Recent proposals focus on training IQP-QCBMs using Maximum Mean Discrepancy (MMD) losses built from low-body Pauli-$Z$ correlators, but the effect of initialization on the resulting optimization landscape remains poorly understood. In this work, we address this by first proving that the MMD loss landscape suffers from barren plateaus for random full-angle-range initializations of IQP circuits. We then establish lower bounds on the loss variance for identity and an unbiased data-agnostic initialization. We then additionally consider a data-dependent initialization that is better aligned with the target distribution and, under suitable assumptions, yields provable gradients and generally converges quicker to a good minimum (as indicated by our training of circuits with 150 qubits on genomic data). Finally, as a by-product, the developed variance lower bound framework is applicable to a general class of non-linear losses, offering a broader toolset for analyzing warm-starts in quantum machine learning.

2603.14547 2026-03-17 math.NA cs.IT cs.NA math.IT math.ST stat.TH

Maximum Entropy Least Squares Solutions of Overdetermined Linear Systems

Felice Iavernaro, Monica Lazzo, Lorenzo Pisani

Comments 34 pages, 10 figures

详情
英文摘要

We investigate the theoretical foundations of a recently introduced entropy-based formulation of weighted least squares for the approximation of overdetermined linear systems, motivated by robust data fitting in the presence of sparse gross errors. The weight vector is interpreted as a discrete probability distribution and is determined by maximizing Shannon entropy under normalization and a prescribed mean squared error (MSE) constraint. Unlike classical ordinary least squares, where the error level is an output of the minimization process, here the MSE value plays the role of a control parameter, and entropy selects the least biased weight distribution achieving the prescribed accuracy. The resulting optimization problem is nonconvex due to the nonlinear coupling between the weights and the solution induced by the residual constraint. We analyze the associated optimality system and characterize stationary points through first- and second-order conditions. We prove the existence and local uniqueness of a smooth branch of entropy-maximizing configurations emanating from the ordinary least squares solution and establish its global continuation under suitable nondegeneracy conditions. Furthermore, we investigate the asymptotic regime as the prescribed MSE tends to zero and show that, under appropriate assumptions, the limiting configuration concentrates on a largest subset of data consistent with the linear model, thus suppressing the influence of outliers. Two numerical experiments illustrate the theoretical findings and confirm the robustness properties of the method.

2603.14543 2026-03-17 stat.ME stat.ML

Gradient Boosting for Spatial Panel Models with Random and Fixed Effects

Michael Balzer, Adhen Benlahlou

详情
英文摘要

Due to the increase in data availability in urban and regional studies, various spatial panel models have emerged to model spatial panel data, which exhibit spatial patterns and spatial dependencies between observations across time. Although estimation is usually based on maximum likelihood or generalized method of moments, these methods may fail to yield unique solutions if researchers are faced with high-dimensional settings. This article proposes a model-based gradient boosting algorithm, which enables estimation with interpretable results that is feasible in low- and high-dimensional settings. Due to its modular nature, the flexible model-based gradient boosting algorithm is suitable for a variety of spatial panel models, which can include random and fixed effects. The general framework also enables data-driven model and variable selection as well as implicit regularization where the bias-variance trade-off is controlled for, thereby enhancing accuracy of prediction on out-of-sample spatial panel data. Monte Carlo experiments concerned with the performance of estimation and variable selection confirm proper functionality in low- and high-dimensional settings while real-world applications including non-life insurance in Italian districts, rice production in Indonesian farms and life expectancy in German districts illustrate the potential application.

2603.14514 2026-03-17 cs.LG cs.SY eess.SY math.OC stat.ML

High-Probability Bounds for SGD under the Polyak-Lojasiewicz Condition with Markovian Noise

Avik Kar, Siddharth Chandak, Rahul Singh, Eric Moulines, Shalabh Bhatnagar, Nicholas Bambos

Comments Submitted to SIAM Journal on Optimization

详情
英文摘要

We present the first uniform-in-time high-probability bound for SGD under the PL condition, where the gradient noise contains both Markovian and martingale difference components. This significantly broadens the scope of finite-time guarantees, as the PL condition arises in many machine learning and deep learning models while Markovian noise naturally arises in decentralized optimization and online system identification problems. We further allow the magnitude of noise to grow with the function value, enabling the analysis of many practical sampling strategies. In addition to the high-probability guarantee, we establish a matching $1/k$ decay rate for the expected suboptimality. Our proof technique relies on the Poisson equation to handle the Markovian noise and a probabilistic induction argument to address the lack of almost-sure bounds on the objective. Finally, we demonstrate the applicability of our framework by analyzing three practical optimization problems: token-based decentralized linear regression, supervised learning with subsampling for privacy amplification, and online system identification.

2603.14481 2026-03-17 stat.ML cs.LG math.PR

Convergence of Two Time-Scale Stochastic Approximation: A Martingale Approach

Mathukumalli Vidyasagar

Comments 21 pages

详情
英文摘要

In this paper, we analyze the two time-scale stochastic approximation (TTSSA) algorithm introduced in Borkar (1997) using a martingale approach. This approach leads to simple sufficient conditions for the iterations to be bounded almost surely, as well as estimates on the rate of convergence of the mean-squared error of the TTSSA algorithm to zero. Our theory is applicable to nonlinear equations, in contrast to many papers in the TTSSA literature which assume that the equations are linear. The convergence of TTSSA is proved in the "almost sure" sense, in contrast to earlier papers on TTSSA that establish convergence in distribution, convergence in the mean, and the like. Moreover, in this paper we establish different rates of convergence for the fast and the slow subsystems, perhaps for the first time. Finally, all of the above results to continue to hold in the case where the two measurement errors have nonzero conditional mean, and/or have conditional variances that grow without bound as the iterations proceed. This is in contrast to previous papers which assumed that the errors form a martingale difference sequence with uniformly bounded conditional variance. It is shown that when the measurement errors have zero conditional mean and the conditional variance remains bounded, the mean-squared error of the iterations converges to zero at a rate of $o(t^{-η})$ for all $η\in (0,1)$. This improves upon the rate of $O(t^{-2/3})$ proved in Doan (2023) (which is the best bound available to date). Our bound is virtually the same as the rate of $O(t^{-1})$ proved in Doan (2024), but for a Polyak-Ruppert averaged version of TTSSA, and not directly. Rates of convergence are also established for the case where the errors have nonzero conditional mean and/or unbounded conditional variance.

2603.14423 2026-03-17 math.ST cs.IT math.IT stat.ME stat.TH

Tighter Confidence Intervals under Without Replacement Sampling via Empirical Rate Functions

Shubhanshu Shekhar, Aaditya Ramdas

Comments 39 pages, 4 figures

详情
英文摘要

We consider the problem of constructing confidence intervals (CIs) for the population mean of $N$ values $\{x_1, \ldots, x_N\} \subset Σ^N$ based on a random sample of size $n$, denoted by $X^n \equiv (X_1, \ldots, X_n)$, drawn uniformly without replacement (WoR). We begin by focusing on the finite alphabet ($|Σ| = k <\infty$) and moderate accuracy ($\log(1/α_N) \gg (k+1)\log N$) regime, and derive a fundamental lower bound on the width of any level-$(1-α_N)$ CI in terms of the inverse of the WoR rate functions from the theory of large deviations. Guided by this lower bound, we propose a new level-$(1-α_N)$ CI using an empirical inverse rate function, and show that in certain asymptotic regimes the width of this CI matches the lower bound up to constants. We also derive a dual formulation of the inverse rate function that enables efficient computation of our proposed CI. We then move beyond the finite alphabet case and use a Bernoulli coupling idea to construct an almost sure CI for $Σ= [0,1]$, and a conceptually simple nonasymptotic CI for the case of $Σ$ being a $(2,D)$ smooth Banach space. For both finite and general alphabets, our results employ classical large deviation techniques in novel ways, thus establishing new connections between estimation under WoR sampling and the theory of large deviations.

2603.14387 2026-03-17 stat.ME cs.LG stat.ML

Label Noise Cleaning for Supervised Classification via Bernoulli Random Sampling

Yuxin Liu, Xiong Jin, Yang Han

详情
英文摘要

Label noise - incorrect labels assigned to observations - can substantially degrade the performance of supervised classifiers. This paper proposes a label noise cleaning method based on Bernoulli random sampling. We show that the mean label noise levels of subsets generated by Bernoulli random sampling containing a given observation are identically distributed for all clean observations, and identically distributed, with a different distribution, for all noisy observations. Although the mean label noise levels are not independent across observations, by introducing an independent coupling we further prove that they converge to a mixture of two well-separated distributions corresponding to clean and noisy observations. By establishing a linear model between cross-validated classification errors and label noise levels, we are able to approximate this mixture distribution and thereby separate clean and noisy observations without any prior label information. The proposed method is classifier-agnostic, theoretically justified, and demonstrates strong performance on both simulated and real datasets.

2603.14381 2026-03-17 stat.ME

A Bayesian Critique of Rank-Based Methods for Surrogate Marker Evaluation

Pietro Carlotti, Layla Parast

详情
英文摘要

Surrogate markers are often employed in clinical trials to replace primary outcomes that may be difficult, expensive, or time-consuming to measure directly. These markers can accelerate the evaluation of new treatments, provided they reliably capture the causal relationship between treatment and true clinical benefit. Parast et al. (2024) recently proposed a rank-based approach for evaluating surrogate markers, characterized by its nonparametric nature and minimal assumptions. While this method is useful in small-sample model-agnostic settings, it has several limitations, including a lack of clear causal interpretation, low statistical power, and insufficient robustness to different data-generating mechanisms. In this paper, we propose a Bayesian approach that addresses these shortcomings by focusing on causal treatment effect estimands and, in doing so, improves power through covariate adjustment. We demonstrate the advantages of our proposed method through a simulation study designed to highlight gains in both accuracy and power.

2603.14325 2026-03-17 cs.IT math.IT stat.ML

Fundamental Limits of CSI Compression in FDD Massive MIMO

Bumsu Park, Youngmok Park, Chanho Park, Namyoon Lee

Comments 14 pages, 5 figures

详情
英文摘要

Channel state information (CSI) feedback in frequency-division duplex (FDD) massive multiple-input multiple-output (MIMO) systems is fundamentally limited by the high dimensionality of wideband channels. In this paper, we model the stacked wideband CSI vector as a Gaussian-mixture source with a latent geometry state that represents different propagation environments. Each component corresponds to a locally stationary regime characterized by a correlated proper complex Gaussian distribution with its own covariance matrix. This representation captures the multimodal nature of practical CSI datasets while preserving the analytical tractability of Gaussian models. Motivated by this structure, we propose Gaussian-mixture transform coding (GMTC), a practical CSI feedback architecture that combines state inference with state-adaptive TC. The mixture parameters are learned offline from channel samples and stored as a shared statistical dictionary at both the user equipment (UE) and the base station. For each CSI realization, the UE identifies the most likely geometry state, encodes the corresponding label using a lossless source code, and compresses the CSI using the Karhunen-Loeve transform matched to that state. We further characterize the fundamental limits of CSI compression under this model by deriving analytical converse and achievability bounds on the rate-distortion (RD) function. A key structural result is that the optimal bit allocation across all mixture components is governed by a single global reverse-waterfilling level. Simulations on the COST2100 dataset show that GMTC significantly improves the RD tradeoff relative to neural transform coding approaches while requiring substantially smaller model memory and lower inference complexity. These results indicate that near-optimal CSI compression can be achieved through state-adaptive TC without relying on large neural encoders.

2603.14233 2026-03-17 stat.ME

Conformalized Robust Principal Component Analysis

Liangliang Yuan, Lei Wang, Quan Kong, Liuhua Peng

详情
英文摘要

Robust principal component analysis (RPCA) is a widely used technique for recovering low-rank structure from matrices with missing entries and sparse, possibly large-magnitude corruptions. Although numerous algorithms achieve accurate point estimation, they offer little guidance on the uncertainty of recovered entries, limiting their reliability in practice. In this paper, we propose conformal prediction-RPCA (CP-RPCA), a practical and distribution-free framework for uncertainty quantification in robust matrix recovery. Our proposed method supports both split and full conformal implementations and incorporates weighted calibration to handle heterogeneous observation probabilities. We provide theoretical guarantees for finite-sample coverage and demonstrate through extensive simulations that CP-RPCA delivers reliable uncertainty quantification under severe outliers, missing data and model misspecification. Empirical results show that CP-RPCA can produce informative intervals and remain competitive in efficiency when the RPCA model is well specified, making it a scalable and robust tool for uncertainty-aware matrix analysis.

2603.14231 2026-03-17 stat.ME

Rank-based Maxsum test for high dimensional regression coefficient

Ping Zhao, Liangliang Yuan

Comments 1 pages, 1 table, 2 figures

详情
英文摘要

We study global inference for regression coefficients in high-dimensional linear models under potentially heavy-tailed errors. While sum-type tests are powerful for dense alternatives and max-type tests excel for sparse alternatives, practical applications rarely reveal the sparsity level, and many existing procedures rely on light-tail assumptions. Motivated by the Wilcoxon-score sum test of Feng et al. (2013) and the two Wilcoxon-score maximum tests of Xu and Zhou (2021), we establish under $H_0$ the asymptotic independence between the rank-based sum statistic and each max statistic. These joint limit results justify principled $p$-value aggregation, and we propose two adaptive rank-based maxsum tests via the Cauchy combination method (Liu and Xie, 2020). The proposed procedures inherit robustness from rank-based construction and adaptivity from combining dense- and sparse-sensitive components. Simulation studies confirm accurate size control and strong power across a wide range of error distributions and sparsity regimes.

2603.14129 2026-03-17 stat.ME

Semiparametric copula-based quantile regression for semicontinuous outcomes with application to healthcare data

Guanjie Lyu, Mohamed Belalia, Abdulkadir Hussein

Comments 25 pages, 2 figures

详情
英文摘要

A semiparametric copula-based two-part quantile regression framework is developed for the analysis of semicontinuous outcomes characterized by a point mass at zero and a continuous positive component. The proposed approach models the occurrence and magnitude processes separately and links them through copula-based conditional distributions, allowing for flexible dependence structures and nonlinear covariate effects across quantiles. Large-sample properties of the resulting estimator are established, and extensive simulation studies demonstrate improved finite-sample performance relative to logistic/linear quantile regression, particularly under nonlinear dependence and substantial zero inflation. An application to healthcare data illustrates how the proposed method provides a nuanced characterization of the association between social deprivation and uncompensated and charity care burdens, revealing heterogeneous and nonlinear effects that are not captured by competing approaches.

2603.14103 2026-03-17 cs.MS math.ST stat.ML stat.TH

Scorio.jl: A Julia package for ranking stochastic responses

Mohsen Hariri, Michael Hinczewski, Vipin Chaudhary

详情
英文摘要

Scorio.jl is a Julia package for evaluating and ranking systems from repeated responses to shared tasks. It provides a common tensor-based interface for direct score-based, pairwise, psychometric, voting, graph, and listwise methods, so the same benchmark can be analyzed under multiple ranking assumptions. We describe the package design, position it relative to existing Julia tools, and report pilot experiments on synthetic rank recovery, stability under limited trials, and runtime scaling.

2603.14092 2026-03-17 cs.LG stat.ME

Soft Mean Expected Calibration Error (SMECE): A Calibration Metric for Probabilistic Labels

Michael Leznik

详情
英文摘要

The Expected Calibration Error (ece), the dominant calibration metric in machine learning, compares predicted probabilities against empirical frequencies of binary outcomes. This is appropriate when labels are binary events. However, many modern settings produce labels that are themselves probabilities rather than binary outcomes: a radiologist's stated confidence, a teacher model's soft output in knowledge distillation, a class posterior derived from a generative model, or an annotator agreement fraction. In these settings, ece commits a category error - it discards the probabilistic information in the label by forcing it into a binary comparison. The result is not a noisy approximation that more data will correct. It is a structural misalignment that persists and converges to the wrong answer with increasing precision as sample size grows. We introduce the Soft Mean Expected Calibration Error (smece), a calibration metric for settings where labels are of probabilistic nature. The modification to the ece formula is one line: replace the empirical hard-label fraction in each prediction bin with the mean probability label of the samples in that bin. smece reduces exactly to ece when labels are binary, making it a strict generalisation.

2603.14070 2026-03-17 stat.ME stat.ML

Structured Credal Learning

Varun Venkatesh, Eyke Hüllermeier, Bernd Bischl, Mina Rezaei

详情
英文摘要

Real-world learning tasks often encounter uncertainty due to covariate shift and noisy or inconsistent labels. However, existing robust learning methods merge these effects into a single distributional uncertainty set. In this work, we introduce a novel structured credal learning framework that explicitly separates these two sources. Specifically, we derive geometric bounds on the total variation diameter of structured credal sets and demonstrate how this quantity decomposes into contributions from covariate shift and expected label disagreement. This decomposition reveals a gating effect: covariate modulates how much label disagreement contributes to the joint uncertainty, such that seemingly benign covariate shifts can substantially increase the effective uncertainty. We also establish finite-sample concentration bounds in a fixed covariate regime and demonstrate that this quantity can be efficiently estimated. Lastly, we show that robust optimization over these structured credal sets reduces to a tractable discrete min-max problem, avoiding ad-hoc robustness parameters. Overall, our approach provides a principled and practical foundation for robust learning under combined covariate and label mechanism ambiguity.

2603.14037 2026-03-17 math.ST stat.TH

On a Strictly Decreasing Nonparametric Estimator of the Drift Function for Recurrent Diffusion Processes

Nicolas Marie

Comments 17 pages, 2 figures

详情
英文摘要

This paper deals with a copies-based continuously differentiable and strictly decreasing estimator of the drift function for stochastic differential equations defining recurrent diffusion processes. The first part of our paper deals with non-asymptotic $\mathbb L^1$-risk bounds and a bandwidths selection procedure for a universal monotone estimator. These results are tailor-made to our framework, and then applied to the estimation of the drift function of recurrent diffusion processes in the second part of the paper.

2603.14034 2026-03-17 cs.SI stat.AP

A Machine Learning Framework for Constructing Heterogeneous Contact Networks: Implications for Epidemic Modelling

Luke Murray Kearney, Emma L Davis, Matt J Keeling

Comments 41 pages, 8 figures

详情
英文摘要

Capturing the structured mixing within a population is key to the reliable projection of infectious disease dynamics and hence informed control. Both heterogeneity in the number of contacts and age-structured mixing have been repeatedly demonstrated as fundamental, yet are rarely combined. Networks provide a powerful and intuitive method to realise population structure, and simulate infection dynamics. However the explicit measurement of contact networks is not scalable to larger populations. Here, using data from social contact surveys, we develop a generalisable and robust algorithm utilizing machine learning to generate a surrogate population-scale network that preserves both age-structured mixing and heterogeneity of contacts. We simulate the spread of infection across different populations, considering how the epidemic size varies over basic reproduction number ($R_0$) scenarios - mirroring the process of determining public health impact from early epidemic growth. Our approach shows that both age structure and degree heterogeneity substantially reduce the epidemic size. We also demonstrate that these simulations more accurately capture the heterogeneity in secondary cases observed for COVID-19 when transmission is scaled by contact duration, dampening the effect of highly connected ``super-spreaders". By using survey data collected during 2020-2022, these network models also inform about the impacts of control and targeting of public health interventions: quantifying the non-linear reduction in transmission opportunities that occurred during lockdowns, and the ages and contact types most responsible for onward transmission. Our robust methodology therefore allows for the inclusion of the full wealth of data commonly collected by surveys but frequently overlooked to be incorporated into more realistic transmission models of infectious diseases.

2603.13953 2026-03-17 math.ST math.PR stat.TH

Random discrete copulas

Damjana Kokol Bukovšek, Blaž Mojškerc, Nik Stopar

Comments 19 pages, 3 figure

详情
英文摘要

We introduce the notion of a bivariate random discrete copula on an equidistant mesh and explore its stochastic properties. A random discrete copula is a discrete random field, hence, its value at a given point on the mesh is a random variable. We determine the distribution of this random variable and calculate its expected value and variance. We also consider bilinear extension of a random discrete copula to a random field over the whole unit square.

2603.13935 2026-03-17 math.ST math.PR stat.ME stat.TH

A two-sample test for symmetric positive definite matrix distributions using Wishart kernel density estimators

Frédéric Ouimet

Comments 34 pages, 0 figures

详情
英文摘要

We develop a nonparametric two-sample test for distributions supported on the cone of symmetric positive definite matrices. The procedure relies on the Wishart kernel density estimator (KDE) introduced by Belzile et al. (2025), whose support-adaptive kernel alleviates boundary bias by remaining confined to the cone. Our test statistic is the rescaled integrated squared difference between two Wishart KDEs and can be expressed as a two-sample $V$-statistic via an explicit closed-form overlap of Wishart kernels, avoiding numerical integration. Under the null hypothesis of equal densities, we derive the asymptotic distribution in both the common shrinking-bandwidth and fixed-bandwidth regimes. The proposed method provides a kernel-based competitor to the empirical Laplace-transform two-sample test of Lukić (2024). Unlike the orthogonally invariant Hankel-transform test of Lukić and Milošević (2024), our statistic can detect alternatives that differ only through eigenvector structure, for instance, Wishart models with the same shape parameter and the same scale eigenvalues but different orientations.

2603.13930 2026-03-17 stat.ME math.ST stat.TH

Spatially Varying Coefficient Mallows Model Averaging

Zhuang Yong, Lv Jing, Tingting Li

详情
英文摘要

Model averaging, as an appealing ensemble technique, strategically integrates all valuable information from candidate models to construct fast and accurate prediction. Despite of having been widely practiced in many fields such as cross-sectional data, censored data and longitudinal data, its application to spatial data characterized by inherent spatial heterogeneity remains surprisingly limited. To mitigate risk of model misspecification and enhance the flexibility of prediction, we propose a combined estimator constructed by computing the weighted average of estimators derived from a set of spatially varying coefficient candidate models. Herein, the model weights are determined via a Mallows-type criterion, which dynamically calibrates the relative importance of individual candidate models in the ensemble. Theoretically, we establish desirable asymptotic properties under two practical scenarios. First, in the case where all candidate models are misspecified, the proposed model averaging estimator attains asymptotic optimality in the sense that it minimizes the squared error loss function asymptotically. Second, when the candidate model set encompasses at least one quasi-correct model, the weights assigned by the Mallows-type criterion asymptotically concentrate on the quasi-correct models, and the resulting model averaging estimator converges in probability to the true conditional mean. Both simulation studies and a real-world empirical example demonstrate that the proposed method generally outperforms alternative comparative approaches in terms of predictive accuracy and robustness.

2603.10886 2026-03-17 stat.ML cs.LG stat.ME

Kernel Tests of Equivalence

Xing Liu, Axel Gandy

Comments 29 pages; 6 figures

详情
英文摘要

We propose novel kernel-based tests for assessing the equivalence between distributions. Traditional goodness-of-fit testing is inappropriate for concluding the absence of distributional differences, because failure to reject the null hypothesis may simply be a result of lack of test power, also known as the Type-II error. This motivates \emph{equivalence testing}, which aims to assess the \emph{absence} of a statistically meaningful effect under controlled error rates. However, existing equivalence tests are either limited to parametric distributions or focus only on specific moments rather than the full distribution. We address these limitations using two kernel-based statistical discrepancies: the \emph{kernel Stein discrepancy} and the \emph{Maximum Mean Discrepancy}. The null hypothesis of our proposed tests assumes the candidate distribution differs from the nominal distribution by at least a pre-defined margin, which is measured by these discrepancies. We propose two approaches for computing the critical values of the tests, one using an asymptotic normality approximation, and another based on bootstrapping. Numerical experiments are conducted to assess the performance of these tests.

2603.05961 2026-03-17 stat.AP physics.data-an

A Tutorial on Bayesian Analysis of Linear Shock Compression Data

Jason Bernstein, Philip C. Myint, Beth A. Lindquist, Justin Lee Brown

Comments 29 pages, 14 figures

详情
英文摘要

Gas gun and other shock compression experiments often produce shock wave velocity measurements that are linearly associated with particle velocity. Traditionally, this empirical relationship is quantified with a single Hugoniot curve that is estimated using least squares regression. However, for downstream modeling and simulation tasks, it is often more useful to have multiple Hugoniot curves in the pressure-volume plane that are consistent with the data. We employ Bayesian uncertainty quantification methods as a framework for propagating measurement uncertainty through to model parameters and predictions. Specifically, this tutorial shows how to sample multiple Hugoniot curves in the pressure-volume plane that are consistent with the shock wave-particle velocity measurements in a two-step Bayesian approach. First, we obtain an analytical expression for the posterior distribution of the linear model parameters using Bayesian linear regression. Second, we propagate samples from the posterior distribution through the Rankine-Hugoniot equations to yield Hugoniot curves in the pressure-volume plane. The procedure is demonstrated with publicly available data on argon, copper, and nickel, and compared against bootstrapping and linear regression. The Bayesian procedure is shown to be interpretable, computationally inexpensive, and less sensitive than an alternative bootstrapping approach to the removal of the point in the copper dataset that has the largest particle velocity. As a tutorial on Bayesian methodology for the shock compression community, we provide several derivations and explanations that make this paper self-contained, and made all code and data available at https://github.com/llnl/BALSCD.

2603.04365 2026-03-17 math.PR cs.NA math.NA math.ST stat.TH

Comparison theorems for the extreme eigenvalues of a random symmetric matrix

Joel A. Tropp

Comments 32 pages. v2 with minor corrections

详情
英文摘要

This paper establishes a comparison theorem for the maximum eigenvalue of a sum of independent random symmetric matrices. The theorem states that the maximum eigenvalue of the matrix sum is dominated by the maximum eigenvalue of a Gaussian random matrix that inherits its statistics from the sum, and it strengthens previous results of this type. Corollaries address the minimum eigenvalue and the spectral norm. The comparison methodology is powerful because of the vast arsenal of tools for treating Gaussian random matrices. As applications, the paper improves on existing eigenvalue bounds for random matrices arising in spectral graph theory, quantum information theory, high-dimensional statistics, and numerical linear algebra. In particular, these techniques deliver the first complete proof that a sparse random dimension reduction map has the injectivity properties conjectured by Nelson & Nguyen in 2013.

2512.24413 2026-03-17 stat.ME

Demystifying Proximal Causal Inference

Grace V. Ringlein, Trang Quynh Nguyen, Peter P. Zandi, Elizabeth A. Stuart, Harsh Parikh

Comments 33 pages, 5 figures

详情
英文摘要

Proximal causal inference (PCI) has emerged as a promising framework for identifying and estimating causal effects in the presence of unobserved confounders. While many traditional causal inference methods rely on the assumption of no unobserved confounding, this assumption is likely often violated. PCI addresses this challenge by relying on an alternative set of assumptions regarding the relationships between treatment, outcome, and auxiliary variables that serve as proxies for unmeasured confounders. We review existing identification results, discuss the assumptions necessary for valid causal effect estimation via PCI, and compare different PCI estimation methods. We offer practical guidance on operationalizing PCI, with a focus on selecting and evaluating proxy variables using domain knowledge, measurement error perspectives, and negative control analogies. Through conceptual examples, we demonstrate tensions in proxy selection and discuss the importance of clearly defining the unobserved confounding mechanism. By bridging formal results with applied considerations, this work aims to demystify PCI, encourage thoughtful use in practice, and identify open directions for methodological development and empirical research.

2512.06522 2026-03-17 stat.ME math.ST stat.ML stat.TH

Hierarchical Clustering With Confidence

Di Wu, Jacob Bien, Snigdha Panigrahi

Comments 57 Pages, 11 Figures, 2 Algorithms

详情
英文摘要

Agglomerative hierarchical clustering is one of the most widely used approaches for exploring how observations in a dataset relate to each other. However, its greedy nature makes it highly sensitive to small perturbations in the data, often producing different clustering results and making it difficult to separate genuine structure from spurious patterns. In this paper, we show how randomizing hierarchical clustering can be useful not just for measuring stability but also for designing valid hypothesis testing procedures based on the clustering results. We propose a simple randomization scheme together with a method for constructing a valid p-value at each node of the hierarchical clustering dendrogram that quantifies evidence against performing the greedy merge. Our test controls the Type I error rate, works with any hierarchical linkage without case-specific derivations, and simulations show it is substantially more powerful than existing selective inference approaches. To demonstrate the practical utility of our p-values, we develop an adaptive $α$-spending procedure that estimates the number of clusters, with a probabilistic guarantee on overestimation. Experiments on simulated and real data show that this estimate yields powerful clustering and can be used, for example, to assess clustering stability across multiple runs of the randomized algorithm.

2511.09677 2026-03-17 cs.LG stat.ML

Boosted GFlowNets: Improving Exploration via Sequential Learning

Pedro Dall'Antonia, Tiago da Silva, Daniel Augusto de Souza, César Lincoln C. Mattos, Diego Mesquita

Comments 11 pages, 3 figures (22 pages total including supplementary material)

详情
英文摘要

Generative Flow Networks (GFlowNets) are powerful samplers for compositional objects that, by design, sample proportionally to a given non-negative reward. Nonetheless, in practice, they often struggle to explore the reward landscape evenly: trajectories toward easy-to-reach regions dominate training, while hard-to-reach modes receive vanishing or uninformative gradients, leading to poor coverage of high-reward areas. We address this imbalance with Boosted GFlowNets, a method that sequentially trains an ensemble of GFlowNets, each optimizing a residual reward that compensates for the mass already captured by previous models. This residual principle reactivates learning signals in underexplored regions and, under mild assumptions, ensures a monotone non-degradation property: adding boosters cannot worsen the learned distribution and typically improves it. Empirically, Boosted GFlowNets achieve substantially better exploration and sample diversity on multimodal synthetic benchmarks and peptide design tasks, while preserving the stability and simplicity of standard trajectory-balance training.

2511.03596 2026-03-17 stat.AP stat.ME

Accounting for Heavy Censoring in Evaluating the Risk Stratification Abilities of Existing Models for Time to Diagnosis of Huntington Disease

Kyle F. Grosser, Abigail G. Foes, Stellen Li, Vraj Parikh, Tanya P. Garcia, Sarah C. Lotspeich

Comments 16 pages, 4 tables, 2 figures

详情
英文摘要

Huntington disease (HD) is a neurodegenerative disease with progressively worsening symptoms. Accurately modeling time to HD diagnosis is essential for clinical trial design. Langbehn's model, the CAG-Age Product (CAP) model, the Prognostic Index Normed (PIN) model, and the Multivariate Risk Score (MRS) model have all been proposed for this task. However, these models may yield conflicting predictions and few studies have systematically compared their performance. Further, those that have could be misleading due to testing the models on the same data used to train them and failing to account for high rates of right censoring (80%+) in performance metrics. We discuss the theoretical foundations of these models, offering intuitive comparisons about their practical feasibility. We externally validate their risk stratification abilities using data from the ENROLL-HD study and two censoring-appropriate performance metrics, guiding model selection for HD clinical trial design. As these models were developed in HD studies that ended more than a decade ago, we compared their predictive performance using published parameters versus updated ones (re-estimated using ENROLL-HD). We show how these models can be used to estimate sample sizes for an HD clinical trial. Based on either metric and using published or updated parameters, the MRS model, which incorporates the most covariates, performed best. However, the simpler PIN model offered similarly good performance while requiring fewer variables, many of which would require patients to undergo additional tests. In illustrating an HD clinical trial design, we defined an optimal threshold based on model performance metrics to determine which patients are more likely to be diagnosed. Sample size calculations using an optimal threshold based on metrics that did not account for censoring, as in previous studies, are shown to lead to underpowered trials.

2510.26204 2026-03-17 math.ST cs.IT eess.SP math.IT stat.TH

Sequential Change Detection Under Markov Setup With Unknown Prechange And Postchange Distributions

Ashish Bhoopesh Gulaguli, Shashwat Singh, Rakesh Kumar Bansal

Comments 6 pages, theoretical paper, Pre-print

详情
英文摘要

In this work we extend the results developed in 2022 for a sequential change detection algorithm making use of Page's CUSUM statistic, the empirical distribution as an estimate of the pre-change distribution, and a universal code as a tool for estimating the post-change distribution, from the i.i.d. case to the Markov setup.

2510.10870 2026-03-17 stat.ML cs.LG math.ST stat.ME stat.TH

Transfer Learning with Distance Covariance for Random Forest: Error Bounds and an EHR Application

Chenze Li, Subhadeep Paul

详情
英文摘要

We propose a method for transfer learning in nonparametric regression using a random forest (RF) with distance covariance-based feature weights, assuming the unknown source and target regression functions are sparsely different. Our method obtains residuals from a source domain-trained Centered RF (CRF) in the target domain, then fits another CRF to these residuals with feature splitting probabilities proportional to feature-residual sample distance covariance. We derive an upper bound on the mean square error rate of the procedure as a function of sample sizes and difference dimension, theoretically demonstrating transfer learning benefits in random forests. A major difficulty for transfer learning in random forests is the lack of explicit regularization in the method. Our results explain why shallower trees with preferential selection of features lead to both lower bias and lower variance for fitting a low-dimensional function. We show that in the residual random forest, this implicit regularization is enabled by sample distance covariance. In simulations, we show that the results obtained for the CRFs also hold numerically for the standard RF (SRF) method with data-driven feature split selection. Beyond transfer learning, our results also show the benefit of distance-covariance-based weights on the performance of RF when some features dominate. Our method shows significant gains in predicting the mortality of ICU patients in smaller-bed target hospitals using a large multi-hospital dataset of electronic health records for 200,000 ICU patients.

2510.04647 2026-03-17 math.OC stat.ML

On decomposability and subdifferential of the tensor nuclear norm

Jiewen Guan, Bo Jiang, Zhening Li

详情
英文摘要

We study the decomposability and the subdifferential of the tensor nuclear norm. Both concepts are well understood and widely applied in matrices but remain unclear for higher-order tensors. We show that the tensor nuclear norm admits a full decomposability over specific subspaces and determine the largest possible subspaces that allow the full decomposability. We derive novel inclusions of the subdifferential of the tensor nuclear norm and study its subgradients in a variety of subspaces of interest. All the results hold for tensors of an arbitrary order. As an immediate application, we establish the statistical performance of the tensor robust principal component analysis, the first such result for tensors of an arbitrary order.

2510.04582 2026-03-17 stat.CO math.OC math.PR

Constrained Dikin-Langevin diffusion for polyhedra

James Chok, Domenic Petzinna

详情
英文摘要

We propose a reflection-free Langevin framework for sampling and optimization on compact polyhedra. The method is based on the inverse Hessian of the logarithmic barrier, which defines a Dikin--Langevin diffusion whose drift and noise adapt to the local interior-point geometry. We show that trajectories started in the interior remain feasible for all finite times almost surely, so the constrained domain is preserved without reflections or projections. For computation, we discretize the diffusion using the Euler--Maruyama scheme and apply a Metropolis--Hastings correction, yielding a sampler that targets the exact constrained distribution. We also propose an annealed interacting variant for nonconvex optimization. Numerically, the Metropolis-adjusted method outperforms both the Dikin random walk and standard MALA on anisotropic box-constrained Gaussians, and the interacting optimizer escapes suboptimal basins more reliably than the non-interacting method.

2509.23711 2026-03-17 cs.LG cs.AI math.OC stat.ML

Deterministic Policy Gradient for Reinforcement Learning with Continuous Time and State

Ziheng Cheng, Xin Guo, Yufei Zhang

详情
英文摘要

The theory of continuous-time reinforcement learning (RL) has progressed rapidly in recent years. While the ultimate objective of RL is typically to learn deterministic control policies, most existing continuous-time RL methods rely on stochastic policies. Such approaches often require sampling actions at very high frequencies, and involve computationally expensive expectations over continuous action spaces, resulting in high-variance gradient estimates and slow convergence. In this paper, we introduce and develop deterministic policy gradient (DPG) methods for continuous-time RL. We derive a continuous-time policy gradient formula expressed as the expected gradient of an advantage rate function and establish a martingale characterization for both the value function and the advantage rate. These theoretical results provide tractable estimators for deterministic policy gradients in continuous-time RL. Building on this foundation, we propose a model-free continuous-time Deep Deterministic Policy Gradient (CT-DDPG) algorithm that enables stable learning for general reinforcement learning problems with continuous time-and-state. Numerical experiments show that CT-DDPG achieves superior stability and faster convergence compared to existing stochastic-policy methods, across a wide range of learning tasks with varying time discretizations and noise levels.

2509.02661 2026-03-17 cs.AI astro-ph.IM cond-mat.mtrl-sci cs.LG physics.data-an stat.ML

The Future of Artificial Intelligence and the Mathematical and Physical Sciences (AI+MPS)

Andrew Ferguson, Marisa LaFleur, Lars Ruthotto, Jesse Thaler, Yuan-Sen Ting, Pratyush Tiwary, Soledad Villar, E. Paulo Alves, Jeremy Avigad, Simon Billinge, Camille Bilodeau, Keith Brown, Emmanuel Candes, Arghya Chattopadhyay, Bingqing Cheng, Jonathan Clausen, Connor Coley, Andrew Connolly, Fred Daum, Sijia Dong, Chrisy Xiyu Du, Cora Dvorkin, Cristiano Fanelli, Eric B. Ford, Luis Manuel Frutos, Nicolás García Trillos, Cecilia Garraffo, Robert Ghrist, Rafael Gomez-Bombarelli, Gianluca Guadagni, Sreelekha Guggilam, Sergei Gukov, Juan B. Gutiérrez, Salman Habib, Johannes Hachmann, Boris Hanin, Philip Harris, Murray Holland, Elizabeth Holm, Hsin-Yuan Huang, Shih-Chieh Hsu, Nick Jackson, Olexandr Isayev, Heng Ji, Aggelos Katsaggelos, Jeremy Kepner, Yannis Kevrekidis, Michelle Kuchera, J. Nathan Kutz, Branislava Lalic, Ann Lee, Matt LeBlanc, Josiah Lim, Rebecca Lindsey, Yongmin Liu, Peter Y. Lu, Sudhir Malik, Vuk Mandic, Vidya Manian, Emeka P. Mazi, Pankaj Mehta, Peter Melchior, Brice Ménard, Jennifer Ngadiuba, Stella Offner, Elsa Olivetti, Shyue Ping Ong, Christopher Rackauckas, Philippe Rigollet, Chad Risko, Philip Romero, Grant Rotskoff, Brett Savoie, Uros Seljak, David Shih, Gary Shiu, Dima Shlyakhtenko, Eva Silverstein, Taylor Sparks, Thomas Strohmer, Christopher Stubbs, Stephen Thomas, Suriyanarayanan Vaikuntanathan, Rene Vidal, Francisco Villaescusa-Navarro, Gregory Voth, Benjamin Wandelt, Rachel Ward, Melanie Weber, Risa Wechsler, Stephen Whitelam, Olaf Wiest, Mike Williams, Zhuoran Yang, Yaroslava G. Yingling, Bin Yu, Shuwen Yue, Ann Zabludoff, Huimin Zhao, Tong Zhang

Comments Community Paper from the NSF Future of AI+MPS Workshop, Cambridge, Massachusetts, March 24-26, 2025, supported by NSF Award Number 2512945; v2: minor clarifications; v3: approximate version to appear in MLST

详情
英文摘要

This community paper developed out of the NSF Workshop on the Future of Artificial Intelligence (AI) and the Mathematical and Physics Sciences (MPS), which was held in March 2025 with the goal of understanding how the MPS domains (Astronomy, Chemistry, Materials Research, Mathematical Sciences, and Physics) can best capitalize on, and contribute to, the future of AI. We present here a summary and snapshot of the MPS community's perspective, as of Spring/Summer 2025, in a rapidly developing field. The link between AI and MPS is becoming increasingly inextricable; now is a crucial moment to strengthen the link between AI and Science by pursuing a strategy that proactively and thoughtfully leverages the potential of AI for scientific discovery and optimizes opportunities to impact the development of AI by applying concepts from fundamental science. To achieve this, we propose activities and strategic priorities that: (1) enable AI+MPS research in both directions; (2) build up an interdisciplinary community of AI+MPS researchers; and (3) foster education and workforce development in AI for MPS researchers and students. We conclude with a summary of suggested priorities for funding agencies, educational institutions, and individual researchers to help position the MPS community to be a leader in, and take full advantage of, the transformative potential of AI+MPS.

2507.00923 2026-03-17 stat.CO

ForLion: An R Package for Finding Optimal Experimental Designs with Mixed Factors

Siting Lin, Yifei Huang, Jie Yang

Comments 33 pages, 5 figures, 5 tables

详情
英文摘要

Optimal design is crucial for experimenters to maximize the information collected from experiments and estimate the model parameters most accurately. ForLion algorithms have been proposed to find D-optimal designs for experiments with mixed types of factors. In this paper, we introduce the ForLion package which implements the ForLion algorithm to construct locally D-optimal designs and the Expected Weighted (EW) ForLion algorithm to generate robust EW D-optimal designs, which maximize the determinant of the expected Fisher information matrix under parameter uncertainty. The package supports experiments under linear models (LM), generalized linear models (GLM), and multinomial logistic models (MLM) with continuous, discrete, or mixed-type factors. It provides both optimal approximate designs and an efficient function converting approximate designs into exact designs with integer-valued allocations of experimental units. Tutorials are included to show the package's usage across different scenarios.

2506.13646 2026-03-17 stat.ME

Parsimonious Compactly Supported Covariance Models in the Gauss Hypergeometric Class: Identifiability, Reparameterizations, and Asymptotic Properties

Moreno Bevilacqua, Christian Caamaño-Carrillo, Tarik Faouzi, Xavier Emery

Comments 25 pages, 8 gigures

详情
英文摘要

We study covariance functions in the Gauss hypergeometric ($\mathcal{GH}$) class, a flexible family that encompasses the Generalized Wendland ($\mathcal{GW}$) and Matérn ($\mathcal{MT}$) models. We derive sharp validity conditions, providing a complete characterization of the admissible parameter space, and show that the model exhibits structural identifiability issues under both increasing- and fixed-domain asymptotics. To resolve this issue, we introduce a parsimonious compactly supported subclass selected via a maximum integral range criterion. The resulting hypergeometric model can be viewed as a structural refinement of the $\mathcal{GW}$ family and admits compact-support reparameterizations that recover the $\mathcal{MT}$ model as a limit case. We further establish strong consistency and asymptotic normality of the maximum likelihood estimator of the associated microergodic parameter under fixed-domain asymptotics. Simulation experiments and a real-data application to climate data illustrate the finite-sample behavior and practical performance of the proposed model.

2505.23260 2026-03-17 stat.ML cs.LG

Stable Thompson Sampling: Valid Inference via Variance Inflation

Budhaditya Halder, Shubhayan Pan, Koulik Khamaru

详情
英文摘要

We consider the problem of statistical inference when the data is collected via a Thompson Sampling-type algorithm. While Thompson Sampling (TS) is known to be both asymptotically optimal and empirically effective, its adaptive sampling scheme poses challenges for constructing confidence intervals for model parameters. We propose and analyze a variant of TS, called Stable Thompson Sampling, in which the posterior variance is inflated by a logarithmic factor. We show that this modification leads to asymptotically normal estimates of the arm means, despite the non-i.i.d. nature of the data. Importantly, this statistical benefit comes at a modest cost: the variance inflation increases regret by only a logarithmic factor compared to standard TS. Our results reveal a principled trade-off: by paying a small price in regret, one can enable valid statistical inference for adaptive decision-making algorithms.

2505.16124 2026-03-17 stat.ME math.ST stat.TH

Controlling the false discovery rate in high-dimensional linear models using model-X knockoffs and $p$-values

Jinyuan Chang, Chenlong Li, Cheng Yong Tang, Zhengtian Zhu

详情
英文摘要

We propose a novel multiple testing methodology for controlling the false discovery rate (FDR) in high-dimensional linear models that integrates model-X knockoff techniques with debiased penalized regression estimators. At the foundation of our methodology, we construct and study two sets of naturally paired high-dimensional test statistics and the associated $p$-values for evaluating the same null hypotheses. The first set is shown to be asymptotically mutually independent, justifying the use of the Benjamini-Hochberg procedure. We further exploit the pairing structure through a two-step procedure aimed at improving power. Our theoretical results establish the key properties of the framework with respect to asymptotic FDR control and formally characterize the associated power gains of the two-step procedure. Importantly, our framework accommodates general dependence in the design matrix. Extensive simulations demonstrate that our methods outperform existing approaches -- particularly those relying on empirical FDP estimates -- in both power and FDR control accuracy, with notable gains in settings involving weaker signals, small sample sizes, or low target FDR levels.

2504.21068 2026-03-17 math.CO math.ST stat.TH

Polyhedral Aspects of Maxoids

Tobias Boege, Kamillo Ferry, Benjamin Hollering, Francesco Nowell

Comments 29 pages, 7 figures. Submitted to the Kybernetika special edition for WUPES'25

详情
英文摘要

The conditional independence (CI) relation of a distribution in a max-linear Bayesian network depends on its weight matrix through the $C^\ast$-separation criterion. These CI~models, which we call maxoids, are compositional graphoids which are in general not representable by Gaussian random variables. We prove that every maxoid can be obtained from a transitively closed weighted DAG and show that the stratification of generic weight matrices by their maxoids yields a polyhedral~fan. We also use this connection to polyhedral geometry to develop an algorithm for solving the conditional independence implication problem for maxoids.

2504.14372 2026-03-17 cs.LG cs.AI cs.CE cs.CV stat.ML

Learning Enhanced Structural Representations with Block-Based Uncertainties for Ocean Floor Mapping

Jose Marie Antonio Minoza

详情
Journal ref
Tackling Climate Change with Machine Learning Workshop, ICLR 2025
英文摘要

Accurate ocean modeling and coastal hazard prediction depend on high-resolution bathymetric data; yet, current worldwide datasets are too coarse for exact numerical simulations. While recent deep learning advances have improved earth observation data resolution, existing methods struggle with the unique challenges of producing detailed ocean floor maps, especially in maintaining physical structure consistency and quantifying uncertainties. This work presents a novel uncertainty-aware mechanism using spatial blocks to efficiently capture local bathymetric complexity based on block-based conformal prediction. Using the Vector Quantized Variational Autoencoder (VQ-VAE) architecture, the integration of this uncertainty quantification framework yields spatially adaptive confidence estimates while preserving topographical features via discrete latent representations. With smaller uncertainty widths in well-characterized areas and appropriately larger bounds in areas of complex seafloor structures, the block-based design adapts uncertainty estimates to local bathymetric complexity. Compared to conventional techniques, experimental results over several ocean regions show notable increases in both reconstruction quality and uncertainty estimation reliability. This framework increases the reliability of bathymetric reconstructions by preserving structural integrity while offering spatially adaptive uncertainty estimates, so opening the path for more solid climate modeling and coastal hazard assessment.

2503.19126 2026-03-17 math.OC stat.ML

Tractable downfall of basis pursuit in structured sparse optimization

Maya V. Marmary, Christian Grussler

详情
英文摘要

The problem of finding the sparsest solution to a linear underdetermined system of equations, often appearing, e.g., in data analysis, optimal control, system identification, or sensor selection problems, is considered. This non-convex problem is commonly solved by convexification via $\ell_1$-norm minimization, known as basis pursuit (BP). In this work, a class of structured matrices, representing the system of equations, is introduced for which (BP) tractably fails to recover the sparsest solution. In particular, this enables efficient identification of matrix columns corresponding to unrecoverable non-zero entries of the sparsest solution and determination of the uniqueness of such a solution. These deterministic guarantees complement popular probabilistic ones and provide insights into the a priori design of sparse optimization problems. As our matrix structures appear naturally in optimal control problems, we exemplify our findings based on a fuel-optimal control problem for a class of discrete-time linear time-invariant systems. Finally, we draw connections of our results to compressed sensing and common basis functions in geometric modeling.

2502.09806 2026-03-17 econ.EM cs.IR cs.SI stat.ME

Two-Sided Prioritized Ranking: A Coherency-Preserving Design for Marketplace Experiments

Mahyar Habibi, Zahra Khanalizadeh, Negar Ziaeian

Comments New version with revisions and updated title

详情
英文摘要

Online marketplaces frequently run pricing experiments in environments where users choose from a list of items. In these settings, items compete for users' limited attention and demand, creating interference among items within a list: Changing prices for any item can affect the demand for others, biasing estimates from item-level A/B tests. Besides, a key consideration in pricing experiments is preserving platform coherency across prices and item availability. This requirement rules out experimental designs such as user-level A/B tests as they violate platform coherency. We propose Two-Sided Prioritized Ranking (TSPR) to estimate the total average treatment effect of price changes in such settings. TSPR exploits position bias in ranked search results to create variation in treatment exposure without compromising coherency. TSPR randomizes both users and items and reorders ranked lists, prioritizing treated items for one group of users and untreated items for the other. All users see the same items at consistent prices, but differ in exposure to treatment as they pay disproportionate attention across ranks. In semi-synthetic simulations based on Expedia hotel search data, TSPR outperforms baseline coherency-preserving experiment designs by reducing estimation bias and providing sufficient statistical power.

2501.15338 2026-03-17 cs.GT cs.LG stat.ML

Fairness-aware Contextual Dynamic Pricing with Strategic Buyers

Pangpang Liu, Will Wei Sun

Comments The paper has been accepted by JASA

详情
英文摘要

Contextual pricing strategies are prevalent in online retailing, where the seller adjusts prices based on products' attributes and buyers' characteristics. Although such strategies can enhance seller's profits, they raise concerns about fairness when significant price disparities emerge among specific groups, such as gender or race. These disparities can lead to adverse perceptions of fairness among buyers and may even violate the law and regulation. In contrast, price differences can incentivize disadvantaged buyers to strategically manipulate their group identity to obtain a lower price. In this paper, we investigate contextual dynamic pricing with fairness constraints, taking into account buyers' strategic behaviors when their group status is private and unobservable from the seller. We propose a dynamic pricing policy that simultaneously achieves price fairness and discourages strategic behaviors. Our policy achieves an upper bound of $O(\sqrt{T}+H(T))$ regret over $T$ time horizons, where the term $H(T)$ arises from buyers' assessment of the fairness of the pricing policy based on their learned price difference. When buyers are able to learn the fairness of the price policy, this upper bound reduces to $O(\sqrt{T})$. We also prove an $Ω(\sqrt{T})$ regret lower bound of any pricing policy under our problem setting. We support our findings with extensive experimental evidence, showcasing our policy's effectiveness. In our real data analysis, we observe the existence of price discrimination against race in the loan application even after accounting for other contextual information. Our proposed pricing policy demonstrates a significant improvement, achieving 35.06% reduction in regret compared to the benchmark policy.

2501.13218 2026-03-17 stat.ME

Design of Bayesian Clinical Trials with Clustered Data

Luke Hagar, Shirin Golchi

详情
英文摘要

In the design of clinical trials, it is essential to assess the design operating characteristics (e.g., power and the type I error rate). Common practice for the evaluation of operating characteristics in Bayesian clinical trials relies on estimating the sampling distribution of posterior summaries via Monte Carlo simulation. It is computationally intensive to repeat this estimation process for each design configuration considered, particularly for clustered data that are analyzed using complex, high-dimensional models. In this paper, we propose an efficient method to assess operating characteristics and determine sample sizes for Bayesian trials with clustered data. We prove theoretical results that enable posterior probabilities to be modeled as a function of the number of clusters. Using these functions, we assess operating characteristics at a range of sample sizes given simulations conducted at only two cluster counts. These theoretical results are also leveraged to quantify the impact of simulation variability on our sample size recommendations. The applicability of our methodology is illustrated using an example cluster-randomized Bayesian clinical trial.

2412.02945 2026-03-17 stat.ME

Detection of Multiple Influential Observations on Model Selection

Dongliang Zhang, Masoud Asgharian, Martin A. Lindquist

Comments 3 figures

详情
英文摘要

Outlying observations are frequently encountered across a wide spectrum of scientific domains, posing notable challenges to the generalizability of statistical models and the reproducibility of downstream analysis. They are identified through influential diagnostics, which aim to capture observations that unduly bias model estimation. To date, methods for identifying observations that influence the selection of a stochastically chosen submodel have been underdeveloped, especially in the high-dimensional setting where the number of predictors $p$ exceeds the sample size $n$. Recently we proposed an improved diagnostic measure to handle this setting. However, its distributional properties and approximations have not yet been explored. To address this shortcoming, we revisit the notion of exchangeability to determine the exact asymptotic distribution of our assessment measure. This foundation enables the introduction of theoretically supported parametric and nonparametric approaches for distributional approximation and derivation of thresholds for outlier identification. The resulting framework is further extended to logistic regression models and evaluated by comprehensive simulation studies comparing the performance of various detection methods. Finally, the framework is applied to data from a task-based fMRI study of thermal pain, with the goal of identifying outliers that distort the formulation of the statistical model using functional brain activity to predict physical pain ratings. Both linear and logistic models are used to demonstrate the benefits of detection and compare the performance of different detection procedures. In particular, we identify two influential observations that were not detected in prior studies

2407.19236 2026-03-17 stat.CO stat.ML

Approximate learning of parsimonious Bayesian context trees

Daniyar Ghani, Nicholas A. Heard, Francesco Sanna Passino

详情
Journal ref
Statistics and Computing, 36(106) (2026)
英文摘要

Models for categorical sequences typically assume exchangeable or first-order dependent sequence elements. These are common assumptions, for example, in models of computer malware traces and protein sequences. Although such simplifying assumptions lead to computational tractability, these models fail to capture long-range, complex dependence structures that may be harnessed for greater predictive power. To this end, a Bayesian modelling framework is proposed to parsimoniously capture rich dependence structures in categorical sequences, with memory efficiency suitable for real-time processing of data streams. Parsimonious Bayesian context trees are introduced as a form of variable-order Markov model with conjugate prior distributions. The novel framework requires fewer parameters than fixed-order Markov models by dropping redundant dependencies and clustering sequential contexts. Approximate inference on the context tree structure is performed via a computationally efficient model-based agglomerative clustering procedure. The proposed framework is tested on synthetic and real-world data examples, and it outperforms existing sequence models when fitted to real protein sequences and honeypot computer terminal sessions.

2407.14778 2026-03-17 math.ST stat.TH

Minimax estimation of functionals in sparse vector model with correlated observations

Yuhao Wang, Pengkun Yang, Alexandre B. Tsybakov

详情
英文摘要

We consider the observations of an unknown $s$-sparse vector ${\boldsymbol θ}$ corrupted by Gaussian noise with zero mean and unknown covariance matrix ${\boldsymbol Σ}$. We propose minimax optimal methods of estimating the $\ell_2$ norm of ${\boldsymbol θ}$ and testing the hypothesis $H_0: {\boldsymbol θ}=0$ against sparse alternatives when only partial information about ${\boldsymbol Σ}$ is available, such as an upper bound on its Frobenius norm and the values of its diagonal entries to within an unknown scaling factor. We show that the minimax rates of the estimation and testing are leveraged not by the dimension of the problem but by the value of the Frobenius norm of ${\boldsymbol Σ}$.

2403.19818 2026-03-17 stat.ME math.ST stat.TH

Testing common structure in high-dimensional factor models: change-point and two-sample procedures

Marie-Christine Düker, Vladas Pipiras

详情
英文摘要

This work proposes a novel procedure to test for common structures across two high-dimensional factor models. The introduced test allows to uncover whether two factor models are driven by the same loading matrix up to some linear transformation. The test can be used to discover inter-individual relationships between two datasets. In addition, it can be applied to test for structural changes over time in the loading matrix of an individual factor model. The test aims to reduce the set of possible alternatives in a classical change-point setting. The theoretical results establish the asymptotic behavior of the introduced test statistic. The theory is supported by a simulation study showing promising results in empirical test size and power. Two real data applications are considered: the first investigates changes in the loadings of the celebrated US macroeconomic dataset of Stock and Watson, and the second examines similarities of the loadings of macroeconomic indicators for the US and South Korea.

2402.09698 2026-03-17 stat.ME cs.LG math.PR math.ST stat.ML stat.TH

Combining Evidence Across Filtrations

Yo Joong Choe, Aaditya Ramdas

Comments Accepted for publication in the Journal of the Royal Statistical Society: Series B (Statistical Methodology). Code is available at https://github.com/yjchoe/CombiningEvidenceAcrossFiltrations

详情
英文摘要

In sequential anytime-valid inference, any admissible procedure must be based on e-processes: generalizations of test martingales that quantify the accumulated evidence against a composite null hypothesis at any stopping time. This paper proposes a method for combining e-processes constructed in different filtrations but for the same null. Although e-processes in the same filtration can be combined effortlessly (by averaging), e-processes in different filtrations cannot because their validity in a coarser filtration does not translate to a finer filtration. This issue arises in sequential tests of randomness and independence, as well as in the evaluation of sequential forecasters. We establish that a class of functions called adjusters can lift arbitrary e-processes across filtrations. The result yields a generally applicable "adjust-then-combine" procedure, which we demonstrate on the problem of testing randomness in real-world financial data. Furthermore, we prove a characterization theorem for adjusters that formalizes a sense in which using adjusters is necessary. There are two major implications. First, if we have a powerful e-process in a coarsened filtration, then we readily have a powerful e-process in the original filtration. Second, when we coarsen the filtration to construct an e-process, there is a logarithmic cost to recovering validity in the original filtration.

2401.13208 2026-03-17 stat.ME

Assessing Influential Observations in Pain Prediction using fMRI Data

Dongliang Zhang, Masoud Asgharian, Martin A. Lindquist

Comments 6 figures

详情
英文摘要

Neuroimaging data allows researchers to model the relationship between multivariate patterns of brain activity and outcomes related to mental states and behaviors. However, the existence of outlying participants can potentially undermine the generalizability of these models and jeopardize the validity of downstream statistical analysis. To date, the ability to detect and account for participants unduly influencing various model selection approaches have been sorely lacking. Motivated by a task-based functional magnetic resonance imaging (fMRI) study of thermal pain, we propose and establish the asymptotic distribution for a diagnostic measure applicable to a number of different model selectors. A high-dimensional clustering procedure is further combined with this measure to detect multiple influential observations. In a series of simulations, our proposed method demonstrates clear advantages over existing methods in terms of improved detection performance, leading to enhanced predictive and variable selection outcomes. Application of our method to data from the thermal pain study illustrates the influence of outlying participants, in particular with regards to differences in activation between low and intense pain conditions. This allows for the selection of an interpretable model with high prediction power after removal of the detected observations. Though inspired by the fMRI-based thermal pain study, our methods are broadly applicable to other high-dimensional data types.

2312.00590 2026-03-17 econ.EM math.ST stat.TH

Inference on common trends in functional time series

Morten Ørregaard Nielsen, Won-Ki Seo, Dakyung Seong

详情
英文摘要

We study statistical inference on unit roots and cointegration for time series in a Hilbert space. We develop statistical inference on the number of common stochastic trends embedded in the time series, i.e., the dimension of the nonstationary subspace. We also consider tests of hypotheses on the nonstationary and stationary subspaces themselves. The Hilbert space can be of an arbitrarily large dimension, and our methods remain asymptotically valid even when the time series of interest takes values in a subspace of possibly unknown dimension. This has wide applicability in practice; for example, to cointegrated vector time series that are either high-dimensional or of finite dimension, to high-dimensional factor models that include a finite number of nonstationary factors, to cointegrated curve-valued (or function-valued) time series, and to nonstationary dynamic functional factor models. To illustrate our methods, we include two empirical examples.

2303.11786 2026-03-17 cs.LG stat.ME stat.ML

Skeleton Regression: A Graph-Based Approach to Estimation with Manifold Structure

Zeyu Wei, Yen-Chi Chen

详情
英文摘要

We introduce a new regression framework designed to deal with large-scale, complex data that lies around a low-dimensional manifold with noises. Our approach first constructs a graph representation, referred to as the skeleton, to capture the underlying geometric structure. We then define metrics on the skeleton graph and apply nonparametric regression techniques, along with feature transformations based on the graph, to estimate the regression function. We also discuss the limitations of some nonparametric regressors with respect to the general metric space such as the skeleton graph. The proposed regression framework suggests a novel way to deal with data with underlying geometric structures and provides additional advantages in handling the union of multiple manifolds, additive noises, and noisy observations. We provide statistical guarantees for the proposed method and demonstrate its effectiveness through simulations and real data examples.

2212.05545 2026-03-17 math.PR cs.IT math.IT math.ST stat.TH

Gaussian random projections of convex cones: approximate kinematic formulae and applications

Qiyang Han, Huachen Ren

详情
英文摘要

Understanding the stochastic behavior of random projections of geometric sets constitutes a fundamental problem in high dimension probability that finds wide applications in diverse fields. This paper provides a kinematic description for the behavior of Gaussian random projections of closed convex cones, in analogy to that of randomly rotated cones studied in [ALMT14]. Formally, let $K$ be a closed convex cone in $\mathbb{R}^n$, and $G\in \mathbb{R}^{m\times n}$ be a Gaussian matrix with i.i.d. $\mathcal{N}(0,1)$ entries. We show that $GK\equiv \{Gμ: μ\in K\}$ behaves like a randomly rotated cone in $\mathbb{R}^m$ with statistical dimension $\min\{δ(K),m\}$, in the following kinematic sense: for any fixed closed convex cone $L$ in $\mathbb{R}^m$, \begin{align*} &δ(L)+δ(K)\ll m\, \Rightarrow\, L\cap GK = \{0\} \hbox{ with high probability},\\ &δ(L)+δ(K)\gg m\, \Rightarrow\, L\cap GK \neq \{0\} \hbox{ with high probability}. \end{align*} A similar kinematic description is obtained for $G^{-1}L\equiv \{μ\in \mathbb{R}^n: Gμ\in L\}$. The practical utility and broad applicability of the prescribed approximate kinematic formulae are demonstrated in a number of distinct problems arising from statistical learning, mathematical programming and asymptotic geometric analysis. In particular, we prove (i) new phase transitions of the existence of cone constrained maximum likelihood estimators in logistic regression, (ii) new phase transitions of the cost optimum of deterministic conic programs with random constraints, and (iii) a local version of the Gaussian Dvoretzky-Milman theorem that describes almost deterministic, low-dimensional behaviors of subspace sections of randomly projected convex sets.

2211.07092 2026-03-17 stat.ML cs.LG math.ST stat.TH

Offline Estimation of Controlled Markov Chains: Minimaxity and Sample Complexity

Imon Banerjee, Harsha Honnappa, Vinayak Rao

Comments 71 pages, 23 main

详情
英文摘要

In this work, we study a natural nonparametric estimator of the transition probability matrices of a finite controlled Markov chain. We consider an offline setting with a fixed dataset, collected using a so-called logging policy. We develop sample complexity bounds for the estimator and establish conditions for minimaxity. Our statistical bounds depend on the logging policy through its mixing properties. We show that achieving a particular statistical risk bound involves a subtle and interesting trade-off between the strength of the mixing properties and the number of samples. We demonstrate the validity of our results under various examples, such as ergodic Markov chains, weakly ergodic inhomogeneous Markov chains, and controlled Markov chains with non-stationary Markov, episodic, and greedy controls. Lastly, we use these sample complexity bounds to establish concomitant ones for offline evaluation of stationary Markov control policies.

2110.10801 2026-03-17 stat.CO

Efficient Sampling for Ising and Potts Models using Auxiliary Gaussian Variables

Charles C. Margossian, Chenyang Zhong, Sumit Mukherjee

详情
英文摘要

Ising and Potts models are an important class of discrete probability distributions which originated from statistical physics and since then have found applications in several disciplines. Simulation from these models is a well known challenging problem. In this paper, we study a class of Markov chain Monte Carlo algorithms, in which we introduce an auxiliary Gaussian variable such that, conditional on this variable, the discrete states are independent. This approach is broadly applicable to Ising and Potts models, including ones in which the coupling matrix admits negative entries, as in spin glass and Hopfield models. We focus on a block Gibbs sampler version of this algorithm, which alternates between sampling the auxiliary Gaussian and the discrete states, and derive mixing time bounds for a wide class of Ising/Potts models at both high and low temperatures, yielding results analogous to those derived for the Heat Bath and Swendsen-Wang algorithms. We present novel choices of auxiliary Gaussian variables which scale well with the number of states in the Potts model, and which can take advantage of the low rank structure of the coupling matrix, if any. Finally, we numerically evaluate the performance of the auxiliary Gaussian Gibbs sampler with several competing algorithms, across a range of examples.

2603.13826 2026-03-17 cs.LG stat.ML

Effective Sparsity: A Unified Framework via Normalized Entropy and the Effective Number of Nonzeros

Haoyu He, Hao Wang, Jiashan Wang, Hao Zeng

详情
英文摘要

Classical sparsity promoting methods rely on the l0 norm, which treats all nonzero components as equally significant. In practical inverse problems, however, solutions often exhibit many small amplitude components that have little effect on reconstruction but lead to an overestimation of signal complexity. We address this limitation by shifting the paradigm from discrete cardinality to effective sparsity. Our approach introduces the effective number of nonzeros (ENZ), a unified class of normalized entropy-based regularizers, including Shannon and Renyi forms, that quantifies the concentration of significant coefficients. We show that, unlike the classical l0 norm, the ENZ provides a stable and continuous measure of effective sparsity that is insensitive to negligible perturbations. For noisy linear inverse problems, we establish theoretical guarantees under the Restricted Isometry Property (RIP), proving that ENZ based recovery is unique and stable. We also derive a decomposition showing that the ENZ equals the support cardinality times a distributional efficiency term, thereby linking entropy with l0 regularization. Numerical experiments show that this effective sparsity framework outperforms traditional cardinality based methods in robustness and accuracy.

2603.13806 2026-03-17 stat.ML cs.LG

An Interpretable and Stable Framework for Sparse Principal Component Analysis

Ying Hu, Hu Yang

详情
英文摘要

Sparse principal component analysis (SPCA) addresses the poor interpretability and variable redundancy often encountered by principal component analysis (PCA) in high-dimensional data. However, SPCA typically imposes uniform penalties on variables and does not account for differences in variable importance, which may lead to unstable performance in highly noisy or structurally complex settings. We propose SP-SPCA, a method that introduces a single equilibrium parameter into the regularization framework to adaptively adjust variable penalties. This modification of the L2 penalty provides flexible control over the trade-off between sparsity and explained variance while maintaining computational efficiency. Simulation studies show that the proposed method consistently outperforms standard sparse principal component methods in identifying sparse loading patterns, filtering noise variables, and preserving cumulative variance, especially in high-dimensional and noisy settings. Empirical applications to crime and financial market data further demonstrate its practical utility. In real data analyses, the method selects fewer but more relevant variables, thereby reducing model complexity while maintaining explanatory power. Overall, the proposed approach offers a robust and efficient alternative for sparse modeling in complex high-dimensional data, with clear advantages in stability, feature selection, and interpretability

2603.13762 2026-03-17 stat.ME

Learning the Optimal Composite Mediator: Closed-Form Solution and Inference

Zihuai He

详情
英文摘要

Understanding how an exposure transmits its effect through high-dimensional intermediaries is a central problem in observational research. We study the problem of finding a composite mediator that maximises the indirect effect of an exposure on an outcome in a linear structural equation model. Although the objective is non-convex in the weight vector, a geometric argument yields a closed-form global solution: the optimal weight bisects the angle between two computable path vectors in a weighted inner product space, recovered via two linear solves. The resulting algorithm, MaxIE, runs at the same cost as ordinary least squares -- orders of magnitude lower than numerical optimisation -- with a dual formulation for settings where mediators outnumber observations. The same path vectors yield a test for the global null that no composite mediator exists, with t(p-1) in the classical and t(n-2) in the dual regime. Power is characterised analytically as a function of the population path angle; simulations confirm size control and the power characterisation. Applied to a UK Biobank proteomics dataset (n=38,383, p=2,916), the method rejects the global null (p-value = 6.4e-9) and identifies the optimal proteomic composite mediating age's effect on dementia.

2603.13706 2026-03-17 stat.AP

When Does Agroforestry Income Reduce Deforestation? Evidence from a Natural Experiment in Madagascar

Camille DeSisto, Ranaivo Rasolofoson, Michelle Foley, Harsh Parikh

详情
英文摘要

Tropical deforestation and rural poverty are deeply intertwined, yet isolating the causal effect of income on forest loss remains challenging. We use the 2015 global vanilla price boom, triggered by food-industry shifts toward natural flavoring, as an exogenous income shock affecting Madagascar's primary vanilla-producing region. Using a matching-augmented synthetic control design, we estimate that income gains reduced annual deforestation by 1.7 percentage points in 2017, equivalent to approximately 701 hectares of avoided forest loss. Under a monotonicity assumption linking the price boom to farmers' income, the sign of this reduced-form effect is informative about the causal direction of income on deforestation. However, effects were strongly heterogeneous: higher incomes reduced deforestation in drier, more accessible municipalities but increased clearing in wetter, low-elevation areas with high agricultural potential. These divergent patterns suggest that income simultaneously relaxes subsistence pressures driving forest dependence and raises the opportunity cost of conservation where agricultural returns are high. Our findings indicate that commodity-based agroforestry can align poverty alleviation with forest conservation under conditions of low agricultural opportunity cost. Still, policies must anticipate contexts where rising incomes amplify deforestation in agriculturally suitable land. The strategic targeting of livelihood interventions based on local agricultural potential may help reconcile development and conservation objectives in tropical forest frontiers.

2603.13688 2026-03-17 stat.ML cs.LG stat.AP

When Should Humans Step In? Optimal Human Dispatching in AI-Assisted Decisions

Lezhi Tan, Naomi Sagan, Lihua Lei, Jose Blanchet

详情
英文摘要

AI systems increasingly assist human decision making by producing preliminary assessments of complex inputs. However, such AI-generated assessments can often be noisy or systematically biased, raising a central question: how should costly human effort be allocated to correct AI outputs where it matters the most for the final decision? We propose a general decision-theoretic framework for human-AI collaboration in which AI assessments are treated as factor-level signals and human judgments as costly information that can be selectively acquired. We consider cases where the optimal selection problem reduces to maximizing a reward associated with each candidate subset of factors, and turn policy design into reward estimation. We develop estimation procedures under both nonparametric and linear models, covering contextual and non-contextual selection rules. In the linear setting, the optimal rule admits a closed-form expression with a clear interpretation in terms of factor importance and residual variance. We apply our framework to AI-assisted peer review. Our approach substantially outperforms LLM-only predictions and achieves performance comparable to full human review while using only 20-30% of the human information. Across different selection rules, we find that simpler rules derived under linear models can significantly reduce computational cost without harming final prediction performance. Our results highlight both the value of human intervention and the efficiency of principled dispatching.

2603.13677 2026-03-17 stat.AP

Hierarchical Latent Space Item Response Model for Analyzing Mental Health Vulnerability of Elementary School Students in South Korea

Soyeon Park, Seoyoung Shin, Minjeong Jeon, Hyoun Kyoung Kim, Ick Hoon Jin

详情
英文摘要

Mental health difficulties among elementary school students represent a growing public health concern in South Korea, yet analytical tools for identifying school-specific vulnerability patterns from item response data remain limited. We propose the hierarchical latent space item response model (HLSIRM), which adds hierarchical respondent effects and an inner-product latent interaction for signed respondent-item associations, yielding a unified interaction map that separates school, individual main effects from school/individual-item interactions. We apply HLSIRM to mental health vulnerability data from 2,210 elementary school students across 35 schools in Incheon, South Korea. Clustering item vectors by directional similarity identifies four empirically derived vulnerability domains. School-level analysis reveals that the absence of counseling experience is the primary vulnerability domain aligned with most school vectors, while stress, depression, and smartphone dependency concentrate in specific schools. Within-school analysis demonstrates how individual student positions in the interaction map translate into targeted intervention strategies that address school-specific needs.

2603.13674 2026-03-17 cs.LG cs.AI stat.ML

Locally Linear Continual Learning for Time Series based on VC-Theoretical Generalization Bounds

Yan V. G. Ferreira, Igor B. Lima, Pedro H. G. Mapa S., Felipe V. Campos, Antonio P. Braga

Comments 12 pages. Accepted at IEEE Transactions on Pattern Analysis and Machine Intelligence

详情
Journal ref
IEEE Transactions on Pattern Analysis and Machine Intelligence, Early Access, 2026
英文摘要

Most machine learning methods assume fixed probability distributions, limiting their applicability in nonstationary real-world scenarios. While continual learning methods address this issue, current approaches often rely on black-box models or require extensive user intervention for interpretability. We propose SyMPLER (Systems Modeling through Piecewise Linear Evolving Regression), an explainable model for time series forecasting in nonstationary environments based on dynamic piecewise-linear approximations. Unlike other locally linear models, SyMPLER uses generalization bounds from Statistical Learning Theory to automatically determine when to add new local models based on prediction errors, eliminating the need for explicit clustering of the data. Experiments show that SyMPLER can achieve comparable performance to both black-box and existing explainable models while maintaining a human-interpretable structure that reveals insights about the system's behavior. In this sense, our approach conciliates accuracy and interpretability, offering a transparent and adaptive solution for forecasting nonstationary time series.

2603.13662 2026-03-17 stat.ME stat.AP stat.ML

Fast Uncertainty Quantification for Kernel-Based Estimators in Large-Scale Causal Inference

Matthew Kosko, Falco J, Bargagli-Stoffi, Lin Wang, Michele Santacatterina

Comments 47 pages

详情
英文摘要

Kernel methods are widely used in causal inference for tasks such as treatment effect estimation, policy evaluation, and policy learning. The bootstrap is a standard tool for uncertainty quantification because of its broad applicability. As increasingly large datasets become available, such as the 2023 U.S. Natality data from the National Vital Statistics System (NVSS), which includes 3,596,017 registered births, the computational demands of these methods increase substantially. Kernel methods are known to scale poorly with sample size, and this limitation is further exacerbated by the repeated re-fitting required by the bootstrap. As a result, bootstrap-based inference for kernel-based estimators can become computationally infeasible in large-scale settings. In this paper, we address these challenges by extending the causal Bag of Little Bootstraps (cBLB) algorithm to kernel methods. Our approach achieves computational scalability by combining subsampling and resampling while preserving first-order uncertainty quantification and asymptotically correct coverage. We evaluate the method across three representative implementations: kernelized augmented outcome-weighted learning, kernel-based minimax weighting, and double machine learning with kernel support vector machines. We show in simulations that our method yields confidence intervals with nominal coverage at a fraction of the computational cost. We further demonstrate its utility in a real-world application by estimating the effect of any amount of smoking on birth weight, as well as the optimal treatment regime, using the NVSS dataset, where the standard bootstrap is prohibitively expensive computationally and effectively infeasible at this scale.

2603.13646 2026-03-17 stat.ME

Surrogate-Based Bayesian Inference: Uncertainty Quantification and Active Learning

Andrew Gerard Roberts, Michael C. Dietze, Jonathan H. Huggins

详情
英文摘要

Surrogate models - also called emulators - are widely used to facilitate Bayesian inference in settings where computational costs preclude the use of standard posterior inference algorithms. Their deployment is now standard practice across many scientific domains. However, integrating surrogates in statistical analyses introduces unique challenges that complicate established Bayesian workflow principles. While significant progress has been made in addressing these issues, the relevant developments are scattered across several distinct research communities, with different emphases and perspective. We present a unifying review that synthesizes the literature into a coherent framework, aiming to benefit both practitioners and methods developers. We place particular emphasis on propagating surrogate uncertainty and sequentially refining emulators via active learning, two key components of a robust surrogate-based Bayesian workflow.

2603.13622 2026-03-17 stat.CO math.PR stat.ME

The Continuous Rank Probability Score of a Generalized Beta-Prime Distribution and Some Special Cases

Matthew LeDuc

Comments 9 pages, no figures. Work in progress

详情
英文摘要

This working paper describes new results in derivations of the Continuous Ranked Probability Score of a generalized beta-prime distribution and several special cases, such as the Dagum distribution and Singh-Maddala distribution. Comparison with Monte Carlo estimates is also presented.

2603.13616 2026-03-17 cs.RO stat.AP

Beyond Binary Success: Sample-Efficient and Statistically Rigorous Robot Policy Comparison

David Snyder, Apurva Badithela, Nikolai Matni, George Pappas, Anirudha Majumdar, Masha Itkina, Haruki Nishimura

Comments 12 + 9 pages, 2 + 5 figures,

详情
英文摘要

Generalist robot manipulation policies are becoming increasingly capable, but are limited in evaluation to a small number of hardware rollouts. This strong resource constraint in real-world testing necessitates both more informative performance measures and reliable and efficient evaluation procedures to properly assess model capabilities and benchmark progress in the field. This work presents a novel framework for robot policy comparison that is sample-efficient, statistically rigorous, and applicable to a broad set of evaluation metrics used in practice. Based on safe, anytime-valid inference (SAVI), our test procedure is sequential, allowing the evaluator to stop early when sufficient statistical evidence has accumulated to reach a decision at a pre-specified level of confidence. Unlike previous work developed for binary success, our unified approach addresses a wide range of informative metrics: from discrete partial credit task progress to continuous measures of episodic reward or trajectory smoothness, spanning both parametric and nonparametric comparison problems. Through extensive validation on simulated and real-world evaluation data, we demonstrate up to 70% reduction in evaluation burden compared to standard batch methods and up to 50% reduction compared to state-of-the-art sequential procedures designed for binary outcomes, with no loss of statistical rigor. Notably, our empirical results show that competing policies can be separated more quickly when using fine-grained task progress than binary success metrics.

2603.13614 2026-03-17 stat.ME math.ST stat.TH

Measuring Extreme Tail Association

Bikramjit Das, Xiangyu Liu

Comments 38 pages, 13 figures, includes appendix

详情
英文摘要

Simultaneous occurrences of extreme events need not imply symmetric or reciprocal tail dependence. However, most existing measures of extremal dependence are inherently symmetric and hence often fail to capture directional influence in tail association. We introduce a rank-based measure of Extreme Tail Association (ETA) for bivariate data quantifying such directional influence of one variable on another in extreme tail regions. The proposed estimator is easily computable, consistent with its population counterpart, and asymptotically normal under mild conditions, allowing for statistical inference. We further develop a formal test for asymmetry in tail association based on a multiplier bootstrap procedure. The practical relevance of the methodology is illustrated using data on extreme price movements in major cryptocurrencies. Beyond providing a flexible tool for extremal association, the proposed framework offers a substantive argument for investigating causal relationships in extreme scenarios.

2603.13613 2026-03-17 stat.ML cs.LG math.ST stat.TH

Robust Sequential Tracking via Bounded Information Geometry and Non-Parametric Field Actions

Carlos C. Rodriguez

Comments 1o pages, 3 figures

详情
英文摘要

Standard sequential inference architectures are compromised by a normalizability crisis when confronted with extreme, structured outliers. By operating on unbounded parameter spaces, state-of-the-art estimators lack the intrinsic geometry required to appropriately sever anomalies, resulting in unbounded covariance inflation and mean divergence. This paper resolves this structural failure by analyzing the abstraction sequence of inference at the meta-prior level (S_2). We demonstrate that extremizing the action over an infinite-dimensional space requires a non-parametric field anchored by a pre-prior, as a uniform volume element mathematically does not exist. By utilizing strictly invariant Delta (or ν) Information Separations on the statistical manifold, we physically truncate the infinite tails of the spatial distribution. When evaluated as a Radon-Nikodym derivative against the base measure, the active parameter space compresses into a strictly finite, normalizable probability droplet. Empirical benchmarks across three domains--LiDAR maneuvering target tracking, high-frequency cryptocurrency order flow, and quantum state tomography--demonstrate that this bounded information geometry analytically truncates outliers, ensuring robust estimation without relying on infinite-tailed distributional assumptions.

2603.13583 2026-03-17 stat.ME

Confidence intervals for two-stage adaptive designs with subpopulation selection

Enyu Li, Nigel Stallard, Ekkehard Glimm, Dominic Magirr, Peter K. Kimani

详情
英文摘要

We consider clinical trials in which an experimental treatment is compared with a control in pre-specified patient subpopulations. In such settings, adaptive enrichment designs allow the enrolled population to be modified at an interim analysis, with subpopulations selected according to preplanned rules. Since these interim decisions are data-dependent, valid statistical inference must account for them. We focus on constructing confidence intervals for the treatment effect in the selected population. Confidence interval methods that ignore the possibility of population modification may fail to achieve the desired coverage probability. We propose a new approach that constructs confidence intervals with exact nominal coverage conditional on the interim decision. Importantly, our method applies to a broad class of adaptive enrichment designs, rather than a single specific design. Our method involves deriving the distribution of the naive estimator of the treatment effect in the selected population conditional on the interim decision and inverting uniformly most accurate unbiased tests to obtain the confidence interval. We provide an efficient computational procedure and show through extensive simulations that the resulting confidence intervals satisfy the theoretical coverage guarantees.

2603.13561 2026-03-17 stat.ME

Addressing both variable selection and misclassified responses with parametric and semiparametric methods

Hui Guo, Grace Y. Yi, Boyu Wang

详情
英文摘要

While variable selection has received extensive attention in the literature, its exploration in the presence of response measurement error remains underexplored. In this paper, we investigate this important problem within the context of binary classification with error-prone responses. We present valid variable selection procedures to address the complexities of response errors. Leveraging validation data, we introduce both parametric and semiparametric methodologies to accommodate the mismeasurement effects. By rigorously establishing theoretical results, we offer insights and justifications of the validity of the proposed methods. By properly choosing {the} penalty function and regularization parameter, we demonstrate that the resulting estimators possess the oracle property. To assess the finite sample properties of the proposed methods, we conduct numerical studies that confirm the effectiveness of our proposed methods.

2603.13559 2026-03-17 stat.ML cs.LG cs.SY eess.SP eess.SY

Robust Automatic Differentiation of Square-Root Kalman Filters via Gramian Differentials

Adrien Corenflos

Comments 4 pages, documents the mathematics of a bug fix at https://github.com/state-space-models/cuthbert

详情
英文摘要

Square-root Kalman filters propagate state covariances in Cholesky-factor form for numerical stability, and are a natural target for gradient-based parameter learning in state-space models. Their core operation, triangularization of a matrix $M \in \mathbb{R}^{n \times m}$, is computed via a QR decomposition in practice, but naively differentiating through it causes two problems: the semi-orthogonal factor is non-unique when $m > n$, yielding undefined gradients; and the standard Jacobian formula involves inverses, which diverges when $M$ is rank-deficient. Both are resolved by the observation that all filter outputs relevant to learning depend on the input matrix only through the Gramian $MM^\top$, so the composite loss is smooth in $M$ even where the triangularization is not. We derive a closed-form chain-rule directly from the differential of this Gramian identity, prove it exact for the Kalman log-marginal likelihood and filtered moments, and extend it to rank-deficient inputs via a two-component decomposition: a column-space term based on the Moore--Penrose pseudoinverse, and a null-space correction for perturbations outside the column space of $M$.

2603.13558 2026-03-17 stat.ML cs.CL cs.IT cs.LG math.IT

Holographic Invariant Storage: Design-Time Safety Contracts via Vector Symbolic Architectures

Arsenios Scrivens

Comments 25 pages, 7 figures, includes appendices with extended proofs and pilot LLM experiment

详情
英文摘要

We introduce Holographic Invariant Storage (HIS), a protocol that assembles known properties of bipolar Vector Symbolic Architectures into a design-time safety contract for LLM context-drift mitigation. The contract provides three closed-form guarantees evaluable before deployment: single-signal recovery fidelity converging to $1/\sqrt{2} \approx 0.707$ (regardless of noise depth or content), continuous-noise robustness $2Φ(1/σ) - 1$, and multi-signal capacity degradation $\approx\sqrt{1/(K+1)}$. These bounds, validated by Monte Carlo simulation ($n = 1{,}000$), enable a systems engineer to budget recovery fidelity and codebook capacity at design time -- a property no timer or embedding-distance metric provides. A pilot behavioral experiment (four LLMs, 2B--7B, 720 trials) confirms that safety re-injection improves adherence at the 2B scale; full results are in an appendix.

2603.13542 2026-03-17 stat.ME stat.AP

Robust Inferential Methodology for Multidimensional Diffusion Processes

Sourojyoti Barick

详情
英文摘要

We investigate robust parameter estimation and testing procedure for multivariate diffusion processes observed at high frequency via the minimum density power divergence estimator (MDPDE). Within a general diffusion framework and under standard regularity conditions, we establish consistency and asymptotic normality for the estimators of both drift and diffusion parameters. The drift estimator converges at the $\sqrt{n h_n}$ rate, whereas the diffusion estimator attains the standard $\sqrt{n}$ rate, and the two estimators are shown to be asymptotically independent. The proposed methodology constitutes a robust alternative to quasi-likelihood and ordinary least squares based approaches, offering resilience against outliers, local contamination, and mild model misspecification, while remaining asymptotically equivalent to classical methods in the absence of contamination. Simulation studies demonstrate that the MDPDE achieves reliable finite-sample performance and enhanced numerical stability relative to likelihood-based estimators. These results underscore the practical relevance of divergence-based estimation for high-frequency diffusion models and point to natural extensions to more complex continuous-time settings.

2603.13501 2026-03-17 stat.ML cs.LG

Standard Acquisition Is Sufficient for Asynchronous Bayesian Optimization

Ben Riegler, James Odgers, Vincent Fortuin

详情
英文摘要

Asynchronous Bayesian optimization is widely used for gradient-free optimization in domains with independent parallel experiments and varying evaluation times. Existing methods posit that standard acquisitions lead to redundant and repeated queries, proposing complex solutions to enforce diversity in queries. Challenging this fundamental premise, we show that methods, like the Upper Confidence Bound, can in fact achieve theoretical guarantees essentially equivalent to those of sequential Thompson sampling. A conceptual analysis of asynchronous Bayesian optimization reveals that existing works neglect intermediate posterior updates, which we find to be generally sufficient to avoid redundant queries. Further investigation shows that by penalizing busy locations, diversity-enforcing methods can over-explore in asynchronous settings, reducing their performance. Our extensive experiments demonstrate that simple standard acquisition functions match or outperform purpose-built asynchronous methods across synthetic and real-world tasks.

2603.13499 2026-03-17 math.ST stat.TH

An Empirical Bayes Perspective on Heteroskedastic Mean Estimation

Yanjun Han, Abhishek Shetty, Jacob Shkrob

详情
英文摘要

Towards understanding the fundamental limits of estimation from data of varied quality, we study the problem of estimating a mean parameter from heteroskedastic Gaussian observations where the variances are unknown and may vary arbitrarily across observations. While a simple linear estimator with known variances attains the smallest mean squared error, estimation without this knowledge is challenging due to the large number of nuisance parameters. We propose a simple and principled approach based on empirical Bayes: model the observations as if they were i.i.d. from a normal scale mixture and compute the profile maximum likelihood estimator (MLE) for the mean, treating the nonparametric mixing distribution as nuisance. Our result shows that this estimator achieves near-optimal error bounds across various heteroskedastic models in the literature. In particular, for the subset-of-signals problem where an unknown subset of observations has small variance, our estimator adaptively achieves the minimax rate for all signal sizes, including the sharp phase transition, without any tuning parameters. One of our key technical steps is a sharper metric entropy bound for normal scale mixtures, obtained via Chebyshev approximations on a transformed polynomial basis. This approach yields an improved polylogarithmic, rather than polynomial, dependence on the variance ratio, which could be of independent interest.

2603.13478 2026-03-17 cs.NE cs.AI cs.LG q-bio.NC stat.ML

Equivalence of approximation by networks of single- and multi-spike neurons

Dominik Dold, Philipp Christian Petersen

详情
英文摘要

In a spiking neural network, is it enough for each neuron to spike at most once? In recent work, approximation bounds for spiking neural networks have been derived, quantifying how well they can fit target functions. However, these results are only valid for neurons that spike at most once, which is commonly thought to be a strong limitation. Here, we show that the opposite is true for a large class of spiking neuron models, including the commonly used leaky integrate-and-fire model with subtractive reset: for every approximation bound that is valid for a set of multi-spike neural networks, there is an equivalent set of single-spike neural networks with only linearly more neurons (in the maximum number of spikes) for which the bound holds. The same is true for the reverse direction too, showing that regarding their approximation capabilities in general machine learning tasks, single-spike and multi-spike neural networks are equivalent. Consequently, many approximation results in the literature for single-spike neural networks also hold for the multi-spike case.

2603.13361 2026-03-17 cs.CV cs.AI stat.ML

BrainCast: A Spatio-Temporal Forecasting Model for Whole-Brain fMRI Time Series Prediction

Yunlong Gao, Jinbo Yang, Li Xiao, Haiye Huo, Yang Ji, Hao Wang, Aiying Zhang, Yu-Ping Wang

详情
英文摘要

Functional magnetic resonance imaging (fMRI) enables noninvasive investigation of brain function, while short clinical scan durations, arising from human and non-human factors, usually lead to reduced data quality and limited statistical power for neuroimaging research. In this paper, we propose BrainCast, a novel spatio-temporal forecasting framework specifically tailored for whole-brain fMRI time series forecasting, to extend informative fMRI time series without additional data acquisition. It formulates fMRI time series forecasting as a multivariate time series prediction task and jointly models temporal dynamics within regions of interest (ROIs) and spatial interactions across ROIs. Specifically, BrainCast integrates a Spatial Interaction Awareness module to characterize inter-ROI dependencies via embedding every ROI time series as a token, a Temporal Feature Refinement module to capture intrinsic neural dynamics within each ROI by enhancing both low- and high-energy temporal components of fMRI time series at the ROI level, and a Spatio-temporal Pattern Alignment module to combine spatial and temporal representations for producing informative whole-brain features. Experimental results on resting-state and task fMRI datasets from the Human Connectome Project demonstrate the superiority of BrainCast over state-of-the-art time series forecasting baselines. Moreover, fMRI time series extended by BrainCast improve downstream cognitive ability prediction, highlighting the clinical and neuroscientific impact brought by whole-brain fMRI time series forecasting in scenarios with restricted scan durations.

2603.13284 2026-03-17 cs.LG stat.ML

Do Diffusion Models Dream of Electric Planes? Discrete and Continuous Simulation-Based Inference for Aircraft Design

Aurelien Ghiglino, Daniel Elenius, Anirban Roy, Ramneet Kaur, Manoj Acharya, Colin Samplawski, Brian Matejek, Susmit Jha, Juan Alonso, Adam Cobb

详情
英文摘要

In this paper, we generate conceptual engineering designs of electric vertical take-off and landing (eVTOL) aircraft. We follow the paradigm of simulation-based inference (SBI), whereby we look to learn a posterior distribution over the full eVTOL design space. To learn this distribution, we sample over discrete aircraft configurations (topologies) and their corresponding set of continuous parameters. Therefore, we introduce a hierarchical probabilistic model consisting of two diffusion models. The first model leverages recent work on Riemannian Diffusion Language Modeling (RDLM) and Unified World Models (UWMs) to enable us to sample topologies from a discrete and continuous space. For the second model we introduce a masked diffusion approach to sample the corresponding parameters conditioned on the topology. Our approach rediscovers known trends and governing physical laws in aircraft design, while significantly accelerating design generation.

2603.13254 2026-03-17 cs.LG stat.CO

Introducing Feature-Based Trajectory Clustering, a clustering algorithm for longitudinal data

Marie-Pierre Sylvestre, Laurence Boulanger

详情
英文摘要

We present a new algorithm for clustering longitudinal data. Data of this type can be conceptualized as consisting of individuals and, for each such individual, observations of a time-dependent variable made at various times. Generically, the specific way in which this variable evolves with time is different from one individual to the next. However, there may also be commonalities; specific characteristic features of the time evolution shared by many individuals. The purpose of the method we put forward is to find clusters of individual whose underlying time-dependent variables share such characteristic features. This is done in two steps. The first step identifies each individual to a point in Euclidean space whose coordinates are determined by specific mathematical formulae meant to capture a variety of characteristic features. The second step finds the clusters by applying the Spectral Clustering algorithm to the resulting point cloud.

2603.13241 2026-03-17 stat.ML cs.AI cs.LG

A Hybrid Tsallis-Polarization Impurity Measure for Decision Trees: Theoretical Foundations and Empirical Evaluation

Edouard Lansiaux, Idriss Jairi, Hayfa Zgaya-Biau

详情
英文摘要

We introduce the Integrated Tsallis Combination (ITC), a hybrid impurity measure for decision tree learning that combines normalized Tsallis entropy with an exponential polarization component. While many existing measures sacrifice theoretical soundness for computational efficiency or vice versa, ITC provides a mathematically principled framework that balances both aspects. The core innovation lies in the complementarity between Tsallis entropy's information-theoretic foundations and the polarization component's sensitivity to distributional asymmetry. We establish key theoretical properties-concavity under explicit parameter conditions, proper boundary conditions, and connections to classical measures-and provide a rigorous justification for the hybridization strategy. Through an extensive comparative evaluation on seven benchmark datasets comparing 23 impurity measures with five-fold repetition, we show that simple parametric measures (Tsallis $α=0.5$) achieve the highest average accuracy ($91.17\%$), while ITC variants yield competitive results ($88.38-89.16\%$) with strong theoretical guarantees. Statistical analysis (Friedman test: $χ^2=3.89$, $p=0.692$) reveals no significant global differences among top performers, indicating practical equivalence for many applications. ITC's value resides in its solid theoretical grounding-proven concavity under suitable conditions, flexible parameterization ($α$, $β$, $γ$), and computational efficiency $O(K)$-making it a rigorous, generalizable alternative when theoretical guarantees are paramount. We provide guidelines for measure selection based on application priorities and release an open-source implementation to foster reproducibility and further research.

2603.13234 2026-03-17 cs.LG stat.ML

RFX-Fuse: Breiman and Cutler's Unified ML Engine + Native Explainable Similarity

Chris Kuchar

Comments 31 pages, 10 figures

详情
英文摘要

Breiman and Cutler's original Random Forest was designed as a unified ML engine -- not merely an ensemble predictor. Their implementation included classification, regression, unsupervised learning, proximity-based similarity, outlier detection, missing value imputation, and visualization -- capabilities that modern libraries like scikit-learn never implemented. RFX-Fuse (Random Forests X [X=compression] -- Forest Unified Learning and Similarity Engine) delivers Breiman and Cutler's complete vision with native GPU/CPU support. Modern ML pipelines require 5+ separate tools -- XGBoost for prediction, FAISS for similarity, SHAP for explanations, Isolation Forest for outliers, custom code for importance. RFX-Fuse provides a 1 to 2 model object alternative -- a single set of trees grown once. Novel Contributions: (1) Proximity Importance -- native explainable similarity: proximity measures that samples are similar; proximity importance explains why. (2) Dataset-specific imputation validation for general tabular data -- ranking imputation methods by how real the imputed data looks, without ground truth labels.

2602.02319 2026-03-17 stat.ME

Leave-One-Out Neighborhood Smoothing for Graphons: Berry-Esseen Bounds, Confidence Intervals, and Honest Tuning

Behzad Aalipur, Rachel Kilby

详情
英文摘要

Neighborhood smoothing methods achieve minimax-optimal rates for estimating edge probabilities under graphon models, but their use for statistical inference has remained limited. The main obstacle is that classical neighborhood smoothers select data-driven neighborhoods and average edges using the same adjacency matrix, inducing complex dependencies that invalidate standard concentration and normal approximation arguments. We introduce a leave-one-out modification of neighborhood smoothing for undirected simple graphs. When estimating a single entry P_ij, the neighborhood of node i is constructed from an adjacency matrix in which the jth row and column are set to zero, thereby decoupling neighborhood selection from the edges being averaged. We show that this construction restores conditional independence of the centered summands, enabling the use of classical probabilistic tools for inference. Under piecewise Lipschitz graphon assumptions and logarithmic degree growth, we derive variance-adaptive concentration inequalities based on Bousquet's inequality and establish Berry-Esseen bounds with explicit rates for the normalized estimation error. These results yield both finite-sample and asymptotic confidence intervals for individual edge probabilities. The same leave-one-out structure also supports an honest cross-validation scheme for tuning parameter selection, for which we prove an oracle inequality. The proposed estimator retains the optimal row-wise mean-squared error rates of classical neighborhood smoothing while providing valid entrywise uncertainty quantification.

2510.14075 2026-03-17 eess.SY cs.AI cs.SY stat.CO stat.ML

DiffOPF: Diffusion Solver for Optimal Power Flow

Milad Hoseinpour, Vladimir Dvorkin

Comments 8 pages, 4 figures, 2 tables

详情
英文摘要

The optimal power flow (OPF) is a multi-valued, non-convex mapping from loads to dispatch setpoints. The variability of system parameters (e.g., admittances, topology) further contributes to the multiplicity of dispatch setpoints for a given load. Existing deep learning OPF solvers are single-valued and thus fail to capture the variability of system parameters unless fully represented in the feature space, which is prohibitive. To solve this problem, we introduce a diffusion-based OPF solver, termed \textit{DiffOPF}, that treats OPF as a conditional sampling problem. The solver learns the joint distribution of loads and dispatch setpoints from operational history, and returns the marginal dispatch distributions conditioned on loads. Unlike single-valued solvers, DiffOPF enables sampling statistically credible warm starts with favorable cost and constraint satisfaction trade-offs. We explore the sample complexity of DiffOPF to ensure the OPF solution within a prescribed distance from the optimization-based solution, and verify this experimentally on power system benchmarks.

2510.12271 2026-03-17 stat.AP cs.LG

The Living Forecast: Evolving Day-Ahead Predictions into Intraday Reality

Kutay Bölat, Peter Palensky, Simon Tindemans

详情
英文摘要

Accurate intraday forecasts are essential for power system operations, complementing day-ahead forecasts that gradually lose relevance as new information becomes available. This paper introduces a Bayesian updating mechanism that converts fully probabilistic day-ahead forecasts into intraday forecasts without retraining or re-inference. The approach conditions the Gaussian mixture output of a conditional variational autoencoder-based forecaster on observed measurements, yielding an updated distribution for the remaining horizon that preserves its probabilistic structure. This enables consistent point, quantile, and ensemble forecasts while remaining computationally efficient and suitable for real-time applications. Experiments on household electricity consumption and photovoltaic generation datasets demonstrate that the proposed method improves forecast accuracy up to 25% across likelihood-, sample-, quantile-, and point-based metrics. The largest gains occur in time steps with strong temporal correlation to observed data, and the use of pattern dictionary-based covariance structures further enhances performance. The results highlight a theoretically grounded framework for intraday forecasting in modern power systems.

2509.01437 2026-03-17 stat.ME cs.LG stat.CO stat.ML

Sampling as Bandits: Evaluation-Efficient Design for Black-Box Densities

Takuo Matsubara, Andrew Duncan, Simon Cotter, Konstantinos Zygalakis

详情
英文摘要

We propose bandit importance sampling (BIS), a powerful importance sampling framework tailored for settings in which evaluating the target density is computationally expensive. BIS facilitates accurate sampling while minimizing the required number of target-density evaluations. In contrast to adaptive importance sampling, which optimizes a proposal distribution, BIS directly optimizes the set of samples through a sequential selection process driven by multi-armed bandits. BIS serves as a general framework that accommodates user-defined bandit strategies. Theoretically, the weak convergence of the weighted samples, and thus the consistency of the Monte Carlo estimator, is established regardless of the specific strategy employed. In this paper, we present a practical strategy that leverages Gaussian process surrogates to guide sample selection, adapting the principles of Bayesian optimization for sampling. Comprehensive numerical studies demonstrate the superior performance of BIS across multimodal, heavy-tailed distributions, and real-world Bayesian inference tasks involving Markov random fields.

2505.20235 2026-03-17 cs.LG cs.AI stat.ML

Variational Deep Learning via Implicit Regularization

Jonathan Wenger, Beau Coker, Juraj Marusic, John P. Cunningham

详情
英文摘要

Modern deep learning models generalize remarkably well in-distribution, despite being overparametrized and trained with little to no explicit regularization. Instead, current theory credits implicit regularization imposed by the choice of architecture, hyperparameters, and optimization procedure. However, deep neural networks can be surprisingly non-robust, resulting in overconfident predictions and poor out-of-distribution generalization. Bayesian deep learning addresses this via model averaging, but typically requires significant computational resources as well as carefully elicited priors to avoid overriding the benefits of implicit regularization. Instead, in this work, we propose to regularize variational neural networks solely by relying on the implicit bias of (stochastic) gradient descent. We theoretically characterize this inductive bias in overparametrized linear models as generalized variational inference and demonstrate the importance of the choice of parametrization. Empirically, our approach demonstrates strong in- and out-of-distribution performance without additional hyperparameter tuning and with minimal computational overhead.

2503.05861 2026-03-17 cs.LG stat.ML

Interpretable Visualizations of Data Spaces for Classification Problems

Christian Jorgensen, Arthur Y. Lin, Rhushil Vasavada, Rose K. Cersonsky

Comments 15 pages, 8 figures

详情
英文摘要

How do classification models "see" our data? Based on their success in delineating behaviors, there must be some lens through which it is easy to see the boundary between classes; however, our current set of visualization techniques makes this prospect difficult. In this work, we propose a hybrid supervised-unsupervised technique distinctly suited to visualizing the decision boundaries determined by classification problems. This method provides a human-interpretable map that can be analyzed qualitatively and quantitatively, which we demonstrate through visualizing and interpreting a decision boundary for chemical neurotoxicity. While we discuss this method in the context of chemistry-driven problems, its application can be generalized across subfields for "unboxing" the operations of machine-learning classification models.

2409.06680 2026-03-17 stat.ME

Sequential stratified inference for the mean

Jacob V. Spertus, Mayuri Sridhar, Philip B. Stark

Comments 22 pages, 5 figures, submitted to Annals of Applied Statistics

详情
英文摘要

We develop conservative tests for the mean of a bounded population under stratified sampling and apply them to risk-limiting post-election audits. The tests are ``anytime valid'' under sequential sampling, allowing optional stopping in each stratum. Our core method expresses a global hypothesis about the population mean as a union of intersection hypotheses describing within-stratum means. It tests each intersection hypothesis using independent test supermartingales (TSMs) combined across strata by multiplication. A $P$-value for each intersection hypothesis is the reciprocal of that test statistic, and the largest $P$-value in the union is a $P$-value for the global hypothesis. This approach has two primary moving parts: the rule selecting which stratum to draw from next given the sample so far, and the form of the TSM within each stratum. These rules may vary over intersection hypotheses. We construct the test with the smallest expected stopping time and present a few strategies for approximating that optimum. In instances that arise in auditing and other applications, its expected sample size is substantially smaller than that of previous methods.

2307.05705 2026-03-17 math.NA cs.NA math.ST stat.ML stat.TH

Measure transfer via stochastic slicing and matching

Shiying Li, Caroline Moosmueller, Yongzhe Wang

详情
英文摘要

This paper studies iterative schemes for measure transfer and approximation problems, which are defined through a slicing-and-matching procedure. Similar to the sliced Wasserstein distance, these schemes benefit from the availability of closed-form solutions for the one-dimensional optimal transport problem and the associated computational advantages. While such schemes have already been successfully utilized in data science applications, not too many results on their convergence are available. The main contribution of this paper is an almost sure convergence proof for stochastic slicing-and-matching schemes. The proof builds on an interpretation as a stochastic gradient descent scheme on the Wasserstein space. Numerical examples on step-wise image morphing are demonstrated as well.

1307.7624 2026-03-17 math.ST stat.TH

Singularity of Data Analytic Operations

Steven P. Ellis

Comments 495 pages, 11 figures

详情
英文摘要

Statistical data by their very nature are indeterminate in the sense that if one repeats the process of collecting the data the new data set will be different from the original. But two data sets generated in the same way should ``tell the same story''. Therefore, a statistical method, a map $Φ$ taking a data set $x$ to a point in some space $\mathsf{F}$, should be stable at $x$: Small perturbations in $x$ should result in a small change in $Φ(x)$. Otherwise, $Φ$ is useless at $x$ or -- and this is important -- near $x$. So one doesn't want $Φ$ to have "singularities," data sets $x$ such that the the limit of $Φ(y)$ as $y$ approaches $x$ doesn't exist. (The same issue arises elsewhere in applied math.) We prove that broad classes of statistical methods have topological obstructions to continuity: They must have singularities. We derive broadly applicable lower bounds on the Hausdorff dimension, even Hausdorff measure, of the set of singularities of data maps. General results concerning severity of singularities are proved. For illustration, we show our results apply to plane fitting, measuring location of data on spheres, and to linear classification. This is not a "final" version, merely another attempt.