arXivDaily arXiv每日学术速递 周一至周五更新
重置
2604.26942 2026-04-30 cs.LG math.ST q-bio.GN stat.ME stat.ML stat.TH

Hyper Input Convex Neural Networks for Shape Constrained Learning and Optimal Transport

Shayan Hundrieser, Insung Kong, Johannes Schmidt-Hieber

Comments 65 pages, 13 figures, the first two authors contributed equally

详情
英文摘要

We introduce Hyper Input Convex Neural Networks (HyCNNs), a novel neural network architecture designed for learning convex functions. HyCNNs combine the principles of Maxout networks with input convex neural networks (ICNNs) to create a neural network that is always convex in the input, theoretically capable of leveraging depth, and performs reliable when trained at scale compared to ICNNs. Concretely, we prove that HyCNNs require exponentially fewer parameters than ICNNs to approximate quadratic functions up to a given precision. Throughout a series of synthetic experiments, we demonstrate that HyCNNs outperform existing ICNNs and MLPs in terms of predictive performance for convex regression and interpolation tasks. We further apply HyCNNs to learn high-dimensional optimal transport maps for synthetic examples and for single-cell RNA sequencing data, where they oftentimes outperform ICNN-based neural optimal transport methods and other baselines across a wide range of settings.

2604.26926 2026-04-30 cs.LG math.OC stat.ML

A Note on How to Remove the $\ln\ln T$ Term from the Squint Bound

Francesco Orabona

详情
英文摘要

In Orabona and Pál [2016], we introduced the shifted KT potentials, to remove the $\ln \ln T$ factor in the parameter-free learning with expert bound. In this short technical note, I show that this is equivalent to changing the prior in the Krichevsky--Trofimov algorithm. Then, I show how to use the same idea to remove the $\ln \ln T$ factor in the data-independent bound for the Squint algorithm.

2604.26922 2026-04-30 cs.LG cs.DS cs.GT stat.ML

On the Learning Curves of Revenue Maximization

Steve Hanneke, Alkis Kalavasis, Shay Moran, Grigoris Velegkas

Comments To appear in the 58th ACM Symposium on Theory of Computing (STOC 2026)

详情
英文摘要

Learning curves are a fundamental primitive in supervised learning, describing how an algorithm's performance improves with more data and providing a quantitative measure of its generalization ability. Formally, a learning curve plots the decay of an algorithm's error for a fixed underlying distribution as a function of the number of training samples. Prior work on revenue-maximizing learning algorithms, starting with the seminal work of Cole and Roughgarden [STOC, 2014], adopts a distribution-free perspective, which parallels the PAC learning framework in learning theory. This approach evaluates performance against the hardest possible sequence of valuation distributions, one for each sample size, effectively defining the upper envelope of learning curves over all possible distributions, thus leading to error bounds that do not capture the shape of the learning curves. In this work we initiate the study of learning curves for revenue maximization and provide a near-complete characterization of their rate of decay in the basic setting of a single item and a single buyer. In the absence of any restriction on the valuation distribution, we show that there exists a Bayes-consistent algorithm, meaning that its learning curve converges to zero for any arbitrary valuation distribution as the number of samples $n \to \infty$. However, this convergence must be arbitrarily slow, even if the optimal revenue is finite. In contrast, if the optimal revenue is achieved by a finite price, then the optimal rate of decay is roughly $1/\sqrt{n}$. Finally, for distributions supported on discrete sets of values, we show that learning curves decay almost exponentially fast, a rate unattainable under the PAC framework.

2604.26898 2026-04-30 math.PR cs.LG stat.ML

Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

Andrea Agazzi, Giuseppe Bruno, Eloy Mosig García, Samuele Saviozzi, Marco Romito

Comments 55 pages, 6 figures

详情
英文摘要

We prove pathwise convergence of the layerwise evolution of tokens in a finite-depth, finite-width transformer model with MultiLayer Perceptron (MLP) blocks to a continuous-time stochastic interacting particle system. We also identify the stochastic partial differential equation describing the evolution of the tokens' distribution in this limit and prove propagation of chaos when the number of such tokens is large. The bounds we establish are quantitative and the limits we consider commute. We further prove that the limiting stochastic model displays synchronization by noise and establish exponential dissipation of the interaction energy on average, provided that the common noise is sufficiently coercive relative to the deterministic self-attention drift. We finally characterize the activation functions satisfying the former condition.

2604.26884 2026-04-30 stat.AP

Improving Bias Correction Methods for Daily Rainfall Using a Markov Chain Approach

Danny Parsons, David Stern, Mouhamadou Bamba Sylla, James Musyoka, John Bagiliko, Lily Clements, John Mupuro, Denis Ndanguza

Comments 42 pages, 19 figures

详情
英文摘要

Accurate, localised rainfall information is essential for applications such as agricultural planning, climate risk assessment, and water resources management. Gridded climate products provide rainfall information over large areas but can lack the accuracy needed at local scales, often requiring bias correction before use in local impact studies. Bias correction of daily rainfall is particularly challenging due to its complex characteristics. Local intensity scaling (LOCI) and quantile mapping (QM) are two widely used bias correction methods which adjust both rainfall frequency and intensity, but do not account for the temporal structure of daily rainfall. This can lead to biases in the representation of wet and dry spells. This study proposes integrating a two-state first-order Markov chain directly into existing bias correction methods through state-dependent rain day thresholds and rainfall adjustments, aimed at improving the temporal structure of rainfall. Two implementations of this framework are presented: Markov chain local intensity scaling (MC LOCI) and Markov chain quantile mapping (MC QM). The proposed methods were applied to AgERA5 reanalysis data with rainfall data from five stations in Zimbabwe. Results showed that the Markov chain methods outperformed LOCI and QM by improving the representation of rainfall persistence, onset, and wet and dry spell characteristics, while maintaining improvements in rain day frequency and overall rainfall statistics. These results demonstrate that the proposed methods could be beneficial for applications such as crop simulation, hydrological modelling and other applications which rely on accurate representation of rainfall sequencing.

2604.26843 2026-04-30 stat.ME

Nonparametric Testing and Variable Selection for ARCH-m(X) Model

Adriano Zanin Zambom, Qing Wang

详情
英文摘要

We introduce the ARCH-m(X) model, a semiparametric extension of the ARCH-X framework in which the effect of a multivariate exogenous covariate vector X on the conditional variance is modeled through an unknown nonparametric function m(), accommodating complex nonlinear relationships between external predictors and financial volatility. Within this model, we develop a novel hypothesis test for the significance of covariates constructed with an artificial one-way ANOVA. Under some regularity conditions, the test statistic is shown to converge in distribution to the standard Normal. Another key contribution of this paper is the construction of a variable selection procedure based on the Benjamini-Yekutieli false discovery rate correction applied to covariate-level p-values. We show that the resulting index set coincides with the true set of relevant covariates with probability tending to one as n goes to infinity. Extensive simulations confirm that the proposed methods outperform existing competitors, and an empirical application to SP500 return volatility illustrates the practical utility of the proposed variable selection framework.

2604.26826 2026-04-30 econ.EM stat.ME

Bootstrap Inference in Nonlinear Panel Data Models with Interactive Fixed Effects

Haoyuan Xu, Wei Miao, Geert Dhaene, Jad Beyhum

详情
英文摘要

The maximum likelihood estimator in nonlinear panel data models with interactive fixed effects is biased. Several bias correction methods, such as analytical and jackknife approaches, have been proposed to enable valid inference. This paper shows that the parametric bootstrap also enables valid inference in such models. In particular, we show that the parametric bootstrap replicates the asymptotic distribution of the maximum likelihood estimator. Therefore, it yields asymptotically unbiased estimates and confidence sets with asymptotically correct coverage. We also propose a transformation-based bootstrap confidence interval that delivers improved finite-sample performance. Simulation results support the theoretical findings. Finally, we apply the proposed method to examine technological and product market spillover effects on firms' innovation behavior.

2604.26769 2026-04-30 stat.ME stat.CO

Minimum Covariance Determinant Estimator and Outlier Detection for Interval-valued Data

Catarina P. Loureiro, M. Rosário Oliveira, Paula Brito, Lina Oliveira

详情
英文摘要

Interval-valued data are one of the most common symbolic data types, which enables the preservation of the underlying variability of the data. The interval mean and covariance matrix can be estimated using the barycenter approach based on the Mallows distance. However, as for conventional data, classical estimates can be significantly affected by anomalous data points, frequently present in real-life datasets. To address this problem, we develop a robust alternative which estimates location and scale by extending the Minimum Covariance Determinant estimator to interval-valued data. The algorithm yields a robust Interval-Mahalanobis distance, which can be used to detect anomalous observations based on adaptive cutoff values. Through extensive simulation studies across various contamination levels, we demonstrate that the interval-valued robust estimator consistently outperforms classical methods in covariance matrix estimation and achieves superior outlier detection accuracy. Finally, the applicability and effectiveness of the proposed method are illustrated through real-world datasets.

2604.26765 2026-04-30 stat.ME

CARhy: Comprehensive Analyses of Circadian Rhythms in Transcriptomic Experiments with Multiple Conditions

Weiyi Huang, Jerome S. Menet, Samiran Sinha

详情
英文摘要

Circadian rhythms are endogenous oscillations that regulate various physiological processes and their disruption has been linked to many diseases, making it important to determine how gene-expression rhythms are altered across genotypes, treatments, or environmental exposures. Existing approaches for circadian transcriptomic analysis are often limited to pairwise comparisons or to a single aspect of rhythmic behavior, making them inadequate for comprehensive inference in multi-condition experimental designs. We propose CARhy (Comprehensive Analysis of Rhythmicity), a unified statistical framework for transcriptomic data collected under more than two conditions. Based on first-harmonic Fourier regression, CARhy provides formal tests for the presence of rhythmicity and for differences across conditions in rhythmicity, amplitude, phase, and baseline level. By allowing condition-specific variances and accommodating unbalanced designs, the framework remains reliable under heteroscedastic noise and realistic sampling constraints. Simulations show that CARhy controls type I error and false discovery rates well while achieving higher power than existing approaches in challenging settings. In mouse liver transcriptomic data, CARhy offers an interpretable and practical tool for characterizing how circadian rhythms differ across multiple experimental conditions. CARhy is implemented as an R package and is publicly available at: https://github.com/DrHuang123/Comprehensive-Analyses-of-Circadian-Rhythms-CARhy.git.

2604.26744 2026-04-30 cs.IT math.IT stat.ML

A Sufficient-Statistic Reduction of the Information Bottleneck to a Low-Dimensional Problem

Joss Armstrong

详情
英文摘要

We show that if the conditional distribution p(C | T) factors through a sufficient statistic ϕ(T), then the Information Bottleneck (IB) problem for (T, C) is exactly equivalent to the IB problem for (ϕ(T), C). The reduction is loss-free: it preserves the full IB curve, the Lagrangian optimum at every trade-off parameter \b{eta}, and the optimal representations up to pullback through ϕ. As a result, the computational complexity of solving the IB problem is governed by the dimension of the sufficient statistic rather than the ambient dimension of the source. This identifies an exact structural condition under which the generic IB problem becomes tractable, and gives a formal bridge between the discrete and linear-Gaussian regimes. We then show that the classical Gaussian IB solution of Chechik, Globerson, Tishby and Weiss is an immediate corollary of this reduction, and we state a nonlinear-Gaussian generalisation. A small numerical example illustrates the practical consequence: when a low-dimensional sufficient statistic is available, the exact IB curve can be computed on the reduced problem at a cost determined by the statistic rather than by the ambient source dimension.

2604.26729 2026-04-30 stat.ME

Flexible semiparametric modeling with application to Causal Inference

Kun Ren, Wen Su, Li Liu, Ian W. McKeague, Xingqiu Zhao

详情
英文摘要

This paper proposes a flexible new framework for constructing Neyman-orthogonal scores in semiparametric models involving infinite-dimensional nuisance parameters. While locally estimation is vital for integrating machine learning into econometrics, deriving orthogonal scores for complex models remains a major challenge. We provide explicit construction strategies for broad classes of settings. The proposed framework ensures asymptotic normality of target parameter estimators in a way that does not depend on the method used to construct the nuisance parameter estimators, provided they are $o_p(n^{-\1/4})$-consistent. We apply the proposed methodology to causal inference with a binary instrumental variable, developing a novel, robust estimator for treatment effects. Numerical studies demonstrate that our approach significantly outperforms naive alternatives in finite samples. An empirical application to the Oregon Health Insurance Experiment illustrates the framework's utility in providing robust causal evidence.

2604.26706 2026-04-30 math.ST stat.TH

A Leakage Bound for Confidence Sets after Black-Box Selection

Sayantan Banerjee

详情
英文摘要

In many analyses the object reported at the end is not fixed in advance, but is chosen after a preliminary search over variables, subgroups, transformations, models or contrasts. Classical selective-inference methods are most effective when this search can be written as an explicit selection event. This note treats the less structured case in which the selection rule is a black box and inference is required for the target indexed by the selected object. We show that, for any fixed-target confidence procedure, selected-target noncoverage is bounded by the nominal fixed-target noncoverage plus the average total variation distance between the marginal law of the inferential data and its conditional law given the selected object. A mutual-information bound follows immediately. The result recovers sample splitting as the zero-leakage case and gives explicit guarantees for noisy screening through a Gaussian information bound. Thus the inferential cost of black-box selection is quantified by the information that the selected object carries about the inferential sample.

2604.26673 2026-04-30 stat.ML cs.LG

Laplace Approximation for Bayesian Tensor Network Kernel Machines

Albert Saiapin, Kim Batselier

Comments 19 pages, 3 figures, 6 tables. Code available at: https://github.com/AlbMLpy/laplace-tnkm

详情
英文摘要

Uncertainty estimation is essential for robust decision-making in the presence of ambiguous or out-of-distribution inputs. Gaussian Processes (GPs) are classical kernel-based models that offer principled uncertainty quantification and perform well on small- to medium-scale datasets. Alternatively, formulating the weight space learning problem under tensor network assumptions yields scalable tensor network kernel machines. However, these assumptions break Gaussianity, complicating standard probabilistic inference. This raises a fundamental question: how can tensor network kernel machines provide principled uncertainty estimates? We propose a novel Bayesian Tensor Network Kernel Machine (LA-TNKM) that employs a (linearized) Laplace approximation for Bayesian inference. A comprehensive set of numerical experiments shows that the proposed method consistently matches or surpasses Gaussian Processes and Bayesian Neural Networks (BNNs) across diverse UCI regression benchmarks, highlighting both its effectiveness and practical relevance.

2604.26668 2026-04-30 stat.ME

Nonlinear Probabilistic Forecast Reconciliation

Anubhab Biswas, Lorenzo Zambon, Lorenzo Nespoli, Giorgio Corani

详情
英文摘要

Forecast reconciliation adjusts independently generated forecasts so that they satisfy some known constraints. While probabilistic forecast reconciliation is well established for linear constraints, some practical forecasting problems involve nonlinear relationships among variables. In this paper, we address probabilistic forecast reconciliation with nonlinear constraints for the first time. We extend both reconciliation via projection and conditioning to the case of nonlinear constraints. The projection approach reconciles forecast samples by mapping them onto the nonlinear coherent manifold. The conditioning approach adopts a sampling algorithm inspired to the Unscented Kalman Filter (UKF). We evaluate both methods on synthetic and real datasets. Empirically, both reconciliation approaches generally improve forecast accuracy. The UKF-based approach achieves the best overall performance while being substantially faster than the projection one.

2604.24904 2026-04-30 econ.EM math.ST stat.TH

Inference for Linear Systems with Unknown Coefficients

Yuehao Bai, Kirill Ponomarev, Andres Santos, Azeem M. Shaikh, Max Tabord-Meehan, Alexander Torgovitsky

详情
英文摘要

This paper considers the problem of testing whether there exists a solution satisfying certain non-negativity constraints to a linear system of equations. Importantly and in contrast to some prior work, we allow all parameters in the system of equations, including the slope coefficients, to be unknown. For this reason, we describe the linear system as having unknown (as opposed to known) coefficients. This hypothesis testing problem arises naturally when constructing confidence sets for possibly partially identified parameters in the analysis of nonparametric instrumental variables models, treatment effect models, and random coefficient models, among other settings. To rule out certain instances in which the testing problem is impossible, in the sense that the power of any test will be bounded by its size, we begin our analysis by characterizing the closure of the null hypothesis with respect to the total variation distance. We then use this characterization to develop novel testing procedures based on sample-splitting. We establish the validity of our testing procedures under weak and interpretable conditions on the linear system. An important feature of these conditions is that they permit the dimensionality of the problem to grow rapidly with the sample size. A further attractive property of our tests is that they do not require simulation to compute suitable critical values. We illustrate the practical relevance of our theoretical results in a simulation study.

2604.23865 2026-04-30 cs.LG cs.AI stat.ML

Inverting Foundation Models of Brain Function with Simulation-Based Inference

Niels Bracher, Xavier Intes, Stefan T. Radev

详情
英文摘要

Foundation models of brain activity promise a new frontier for in silico neuroscience by emulating neural responses to complex stimuli across tasks and modalities. A natural next step is to ask whether these models can also be used in reverse. Can we recover a stimulus or its properties from synthetic brain activity? We study this question in a proof-of-concept setting using TRIBEv2. We pair the brain emulator with large language models (LLMs) that generate news headlines from linguistic parameters such as valence, arousal, and dominance. We then use simulation-based inference to learn a probabilistic mapping from brain maps to latent stimulus parameters. Our results show that these parameters can be recovered from predicted brain maps, validating the quality of neural encodings. They also show that LLMs can serve as controllable stimulus generators for simulated experiments. Together, these findings provide a step toward decoding and inverse design with foundation brain models.

2601.15036 2026-04-30 cs.LG stat.ML

Factorizable joint shift revisited

Dirk Tasche

Comments 34 pages

详情
英文摘要

Factorizable joint shift (FJS) represents a type of distribution shift (or dataset shift) that comprises both covariate and label shift. Recently, it has been observed that FJS actually arises from consecutive label and covariate (or vice versa) shifts. Research into FJS so far has been confined mostly to the case of categorical labels. We propose a framework for analysing distribution shift in the case of a general label space, thus covering both classification and regression models. Based on the framework, we generalise existing results on FJS to general label spaces and present and analyse a related extension to label distribution estimation of the expectation maximisation (EM) algorithm for class prior probabilities. We also take a fresh look at generalized label shift (GLS) in the case of a general label space.

2510.06735 2026-04-30 cs.LG stat.ME

Incorporating Expert Knowledge into Bayesian Causal Discovery of Mixtures of Directed Acyclic Graphs

Zachris Björkman, Jorge Loría, Sophie Wharrie, Samuel Kaski

Comments 32 pages, 19 figures

详情
英文摘要

Bayesian causal discovery benefits from prior information elicited from domain experts, and in heterogeneous domains any prior knowledge would be badly needed. However, so far prior elicitation approaches have assumed a single causal graph and hence are not suited to heterogeneous domains. We propose a causal elicitation strategy for heterogeneous settings, based on Bayesian experimental design (BED) principles, and a variational mixture structure learning (VaMSL) method -- extending the earlier differentiable Bayesian structure learning (DiBS) method -- to iteratively infer mixtures of causal Bayesian networks (CBNs). We construct an informative graph prior incorporating elicited expert feedback in the inference of mixtures of CBNs. Our proposed method successfully produces a set of alternative causal models (mixture components or clusters), and achieves an improved structure learning performance on heterogeneous synthetic data when informed by a simulated expert. Finally, we demonstrate that our approach is capable of capturing complex distributions in a breast cancer database.

2505.14808 2026-04-30 stat.ML cs.LG math.ST stat.TH

Out-of-Distribution Generalization of In-Context Learning: A Low-Dimensional Subspace Perspective

Soo Min Kwon, Alec S. Xu, Can Yaras, Laura Balzano, Qing Qu

Comments AISTATS 2026

详情
英文摘要

The transformer's remarkable ability to perform in-context learning (ICL) has sparked a wide range of studies designed to understand its strengths and limitations. However, a theoretical understanding of when ICL can and cannot generalize beyond its pre-training data still remains unclear. This paper puts forth a minimal mathematical model that provably identifies when ICL can generalize out-of-distribution (OOD). By studying linear regression tasks parameterized with low-rank covariance matrices, we model distribution shifts as varying angles between subspaces and derive conditions under which a single-layer linear attention model interpolates across all angles. We show that if pre-training task vectors are drawn from a union of subspaces, transformers can generalize to all angle shifts--enabling ICL even in regions with zero probability mass in the training distribution. On the other hand, if the pre-training tasks are drawn from a single Gaussian, the test risk shows a non-negligible dependence on the angle, implying that ICL cannot generalize OOD. We empirically show that our results also hold for models such as GPT-2, and present experiments on how our results extend to nonlinear function classes.

2503.13569 2026-04-30 physics.soc-ph math.OC stat.AP

Ranking matters: Does the new format select the best teams for the knockout phase in the UEFA Champions League?

László Csató, Karel Devriesere, Dries Goossens, András Gyimesi, Roel Lambers, Frits Spieksma

Comments 10 pages, 3 tables

详情
Journal ref
International Journal of Sports Science & Coaching, 21(2): 1123-1131, 2026
英文摘要

Starting in the 2024/25 season, the Union of European Football Associations (UEFA) has fundamentally changed the format of its club competitions: the group stage has been replaced by a league phase played by 36 teams in an incomplete round robin format. This makes ranking the teams based on their results challenging because teams play against different sets of opponents, whose strengths vary. In this research note, we apply several well-known ranking methods for incomplete round robin tournaments to the 2024/25 UEFA Champions League league phase in order to check the robustness of the official ranking, as well as to call the attention of organizers to the non-trivial issue of ranking in these competitions. Our results show that it is doubtful whether the currently used point-based system provides the best ranking of the teams.

2412.12007 2026-04-30 math.ST stat.TH

The entropic optimal (self-)transport problem: Limit distributions for decreasing regularization with application to score function estimation

Gilles Mordant

详情
英文摘要

We study the statistical properties of the entropic optimal (self) transport problem for smooth probability measures. We provide an accurate description of the limit distribution for entropic (self-)potentials and plans as the regularization parameter shrinks with the sample size; this regime is largely unexplored in the prior statistical literature, where $ε$ is typically held fixed. Additionally, we show that a rescaling of the barycentric projection of the empirical entropic optimal self-transport plans converges to the score function, a central object for diffusion models, and characterize the asymptotic fluctuations both pointwise and in $L^2$. Finally, we describe under what conditions the methods used enable to derive (pointwise) limiting distribution results for the empirical entropic optimal transport potentials in the case of two different measures and appropriately chosen shrinking regularization parameter. This endeavour requires a better understanding of the composition of Sinkhorn operators in the small $\eps$-limit, a result of independent interest.

2207.06229 2026-04-30 stat.ML cs.LG math.FA math.PR math.ST stat.CO stat.TH

Distribution-Free Stochastic Analysis and Robust Multilevel Vector Field Anomaly Detection

Julio E Castrillon-Candas, Michael Rosenbaum, Mark Kon

详情
英文摘要

Massive vector field datasets are common in multi-spectral optical and radar sensors, among many other emerging areas of application. We develop a novel stochastic functional (data) analysis approach for detecting anomalies based on the covariance structure of nominal stochastic behavior across a domain. An optimal vector field Karhunen-Loeve expansion is applied to such random field data. A series of multilevel orthogonal functional subspaces is constructed from the geometry of the domain, adapted from the KL expansion. Detection is achieved by examining the projection of the random field on the multilevel basis. A critical feature of this approach is that reliable hypothesis tests are formed, which do not require prior assumptions on probability distributions of the data. The method is applied to the important problem of degradation in the Amazon forest. Due to the complexity and high dimensionality of satellite imagery, it is not feasible to assume known distributions, nor to estimate them. In addition to providing reliable hypothesis tests, our approach shows the advantage of using multiple bands of data in a vectorized complex, leading to better anomaly detection. Furthermore, using simulated data, our approach is capable of detecting subtle anomalies that are impossible to detect with PCA-based methods.

2604.26558 2026-04-30 stat.ML cs.LG stat.ME

Deep-testing: the case of dependence detection

Gery Geenens, Pierre Lafaye de Micheaux, Ivan Muyun Zou

详情
英文摘要

Deep learning methods have proved highly effective for classification and image recognition problems. In this paper, we ask whether this success can be transferred to hypothesis testing: if a neural network can distinguish, for example, an image of a handwritten digit from another, can it also distinguish an "image of a sample" (such as a scatter plot) generated under a given statistical model from one generated outside that model? Motivated by this idea, we propose a novel procedure called deep-testing, which approaches the classical inferential problem of hypothesis testing through deep learning. More specifically, the test statistic is a classification map learned by a deep neural network from simulated data satisfying the null and alternative hypotheses, leveraging its strong discriminating power to construct a highly powerful test. As a proof of concept, we apply deep-testing to the problem of independence testing, arguably one of the most important problems in statistics. In a large-scale simulation study, deep-testing achieves the highest overall power against nineteen competing methods across a broad range of complex dependence structures, confirming the viability of the proposed approach.

2604.26479 2026-04-30 stat.ME cs.LG

Recipes for Calibration Checks in Safety-Critical Applications

Romeo Valentin

Comments 36 pages, 22 figures. Manuscript prepared with Typst

详情
英文摘要

Safety-critical prediction systems, such as autonomous vehicles, weather forecasters, and medical monitors, commonly rely on probabilistic forecasters. These forecasters make predictions about possible future outcomes, and their quality and robustness needs to be validated and certified. Often, only accuracy -- the mean of the predictions -- is evaluated against true outcomes. However, for safety-critical scenarios and decision making under uncertainty, the full distributional properties of the forecasts should be checked: do the observed prediction errors actually follow the forecasted probability distributions? To this end, we introduce a framework for calibration checks: statistical tests that validate distributional properties of forecasts when measured over many samples. In order to support ease-of-use in real-world operations, these checks produce a single accept/reject decision for data collected from a forecaster. This contrasts typical calibration calculations which produce one or multiple continuous calibration scores and require expertise to implement in a validation workflow. We further support operationalization by introducing modifications to calibration testing that (a) reject only overconfident predictions, allowing for pessimistic or cautious predictions in safety-critical settings, and (b) tolerate small, operationally acceptable deviations even for large numbers of validation samples. We organize the calibration checking process into a modular pipeline comprising four steps: (i) the data model, (ii) the chosen metric, (iii) the hypothesis formulation, and (iv) the testing procedure. Each step consists of independently swappable components, thereby supporting a large variety of possible use-cases and trade-offs. We demonstrate the applicability of the framework on two complementary example problems, weather forecasting and robot pose estimation.

2604.26471 2026-04-30 stat.ME stat.AP

A simple strategy for valid inference in target trial emulations

Mats Julius Stensrud

Comments This is a short, non-technical manuscript that outlines how valid inference can be ensured in target trials, using existing ideas for sample splitting

详情
英文摘要

Target trial emulation has improved comparative effectiveness research by making the causal question, assumptions, and analysis plan explicit. However, target trial protocols are usually developed iteratively. After examining the data, investigators revise the protocol to reflect which target trials the observational data can realistically support. While this iterative procedure is part of normal scientific practice, it raises concerns about selective choices and invalid statistical inference. A simple procedure can address these concerns. This procedure is based on sample splitting. In the initial split, investigators explore the data to define a target trial protocol. When these choices are made, the target trial protocol is implemented on the second split. Although the investigators made data-informed choices to select the target trial protocol, the inference has the usual coverage guarantees. The procedure is created to mirror how trialists move from pilot studies to a phase 3 trial. First, they use data from pilots and early-phase trials to learn and decide on a final protocol. Then they implement this protocol and analyze a new set of data in a phase 3 trial.

2604.26410 2026-04-30 stat.ME stat.AP

Longitudinal Outcomes Truncated by Death: Causal Estimands and Bayesian Estimators

Juliette Ortholand, Young Lee, Marie-Abele C Bind

详情
英文摘要

Defining a causal estimand for a longitudinal outcome truncated by death is challenging, because the outcome may be undefined at the end of follow-up. Although a range of estimands and several estimators have been proposed, guidance on the underlying causal assumptions and on the contexts in which each estimand is most appropriate remains limited. We propose a framework to clarify the challenges of defining causal estimands in a longitudinal setting with censoring due to death. Within this framework, we review existing estimands and make explicit the assumptions required for their identification and estimation. We develop Bayesian estimators for each estimand and compare their behavior in a simulation study. Finally, we illustrate the proposed approach using data from a randomized controlled trial in amyotrophic lateral sclerosis. We show that the main difficulty arises from the lack of a natural notion of ordering and distance for outcomes truncated by death. This leads to an inherently multifactorial problem. In this context, the stratified average causal effect, combined with restricted mean survival time, provides a more complete characterisation of treatment effects.

2604.26396 2026-04-30 math.PR math.ST stat.TH

Limiting spectral distributions of large consistent rank correlation matrices

Zhaorui Dong, Fang Han, Jianfeng Yao

详情
英文摘要

We study random matrices whose entries are obtained by applying consistent rank correlations, such as Hoeffding's $D$, pairwise to a high-dimensional random vector with mutually independent components. Prior work has shown that, in the proportional high-dimensional regime, the empirical spectral distributions of large Kendall's tau and Spearman's rho matrices converge weakly almost surely to the Marchenko--Pastur law. By contrast, we prove that for consistent rank correlations such as Hoeffding's $D$, the limiting spectral distribution is given by the semicircle law. Our result thus generalizes a recent work of Dong, Han, and Yao (2025), who considered the special case of Chatterjee's rank correlation and established the first semicircle law for a large correlation matrix in the proportional regime.

2604.26366 2026-04-30 stat.ML cs.LG

Probabilistic data quality assessment for structural monitoring data via outlier-resistant conditional diffusion model

Qi Li, Yong Huang, Hui Li

Comments 43 pages, 15 figures and 2 tables

详情
Journal ref
Expert Systems with Applications, 2026: 132181
英文摘要

Data quality assessment is an essential step that ensures the reliability of the subsequent structural health monitoring (SHM) tasks. This study proposes a prediction deviation-based SHM data quality assessment method using a univariate implicit auto-regressive model, enabling outlier diagnosis and data cleaning. The proposed conditional diffusion model (CDM) augments the standard diffusion model with a conditional embedding module to incorporate temporal context, quartile normalization to mitigate distribution skew, and a Huber loss to enhance robustness against outliers. Within this univariate implicit autoregressive framework, each data point is assigned an outlier probability, quantifying its degree of "outlier-ness", and a global quality evaluation score is computed to characterize the overall dataset quality. Extensive case studies utilizing operational data from real-world structures demonstrate that the proposed framework significantly improves the accuracy of data quality assessment, outperforming other strong baselines representative of clustering, isolation-based, and deep reconstruction methods. The effectiveness and robustness of the proposed framework are further demonstrated by the findings of ablation experiments and hyperparameter analysis.

2604.26359 2026-04-30 stat.AP stat.ME

A spatio-temporal statistical framework for heatwave attribution under climate change

Kamal Gasser, Johan Segers, Francesco Ragone

详情
英文摘要

We develop a unified statistical framework for attributing heatwaves as spatio-temporal phenomena under climate change. We quantify the impact of anthropogenic forcing on the probability and persistence of heatwaves not captured by standard marginal extreme-value approaches. Our methodology constructs a generative model for daily temperature fields that separates marginal nonstationarity from spatio-temporal dependence. We combine three components: a Bayesian spatial quantile regression model for the bulk of the data; a nonstationary spatial generalized extreme value model for tail behavior; and a copula-based model capturing both asymptotic dependence and independence in the extremes. The framework is applied to the CMIP6 MRI-ESM2 climate model, contrasting factual and counterfactual scenarios for probabilistic attribution. Our results show that the approach captures key heatwave characteristics inaccessible to traditional methods, enabling direct estimation of event-level attribution metrics. Overall, it provides a flexible basis for analyzing and attributing complex climate extremes as space-time objects.

2604.26268 2026-04-30 stat.AP stat.ME

The Difference Between "Replicable" and "Not replicable" is not Itself Scientifically Replicable

Berna Devezer, Erkan O. Buzbas

详情
英文摘要

Replication studies estimate the replicability rate of scientific results by aggregating binary verdicts of experiments. Exact replications are rarely attainable, so most replication sequences are non-exact. Experiments differ in ways that matter and do not share a single data-generating process. We formalize two statistical interpretations of non-exactness. In a shared latent rate (benchmark) model, experiments are exchangeable and depend on a common random replicability rate. In a conditionally independent rates (operational) model, each experiment has its own replicability rate drawn from a population distribution. Under the benchmark model, even small variability among replicability rates induces an irreducible variance floor on the estimated mean replicability rate that no amount of replication can eliminate. Under the operational model, the degree of non-exactness is not identifiable from standard replication data, because one binary verdict per experiment carries no information about between-experiment heterogeneity. Researchers cannot tell which precision regime they are in or whether high- and low-replicability sequences can be distinguished in principle. The usual data structure cannot support reliable demarcation between "replicable" and "not replicable" results and systematically understates uncertainty, making high- and low-replicability sequences appear discriminable when they are not. We show how common sources of heterogeneity amplify these problems and demonstrate practical consequences in a reanalysis of Many Labs 4. Aggregating replicability rates across heterogeneous literatures produces averages that conflate incommensurable regimes and lack a stable interpretation. Replicability rate is not a reliable demarcation criterion. The replication crisis, if there is one, cannot be established by the methods used to declare it.

2604.26230 2026-04-30 cs.CL stat.ME

A New Semisupervised Technique for Polarity Analysis using Masked Language Models

Kohei Watanabe

详情
英文摘要

I developed a new version of Latent Semantic Scaling (LSS) employing word2vec as a masked language model. Unlike original spatial models, it assigns polarity scores to words and documents as predicted probabilities of seed words to occur in given contexts. These probabilistic polarity scores are more accurate, interpretable and consistent than those spatial polarity models can produce in text analysis. I demonstrate these advantages by applying both probabilistic and spatial models to China Daily's coverage of China and other countries during the coronavirus disease (COVID) pandemic in terms of achievement in health issues. The result suggests that more advanced masked language models would further improve the semisupervised machine learning technique.

2604.26198 2026-04-30 stat.AP

Pricing Global Macroeconomic Risk in Equity Markets: Evidence from Selected G20 Economies

Vivek Mishra

详情
英文摘要

This study investigates whether international equity markets systematically price global macroeconomic risks. The empirical analysis is conducted using monthly excess returns for ten G20 countries over the period 2000-2024. A Dynamic Factor Model (DFM) is employed to extract latent global factors from a set of macroeconomic variables capturing global inflation, real activity, monetary policy, term structure, exchange rates, volatility, and oil prices. The model selection criteria of the dynamic factor framework, which support a 3 factor specification that is parsimonious. The Fama MacBeth regressions demonstrate the low explanatory power of the 3-factor model. In contrast, a 4 factor specification results in economically large and statistically significant factor loadings, an obvious rise in explanatory power, and a significant improvement in model performance. The results indicate that a four-factor specification provides the best balance between explanatory power and model stability, significantly improving the ability to explain cross-sectional variation in excess returns , with all factors statistically significant. The Capital Asset Pricing Model, while offering a parsimonious and stable benchmark with consistently significant market betas, exhibits limited explanatory power due to its single factor structure. Overall, the findings suggest that macro driven latent factors extracted through the DFM provide a more comprehensive and empirically robust framework for international asset pricing than the CAPM, highlighting the importance of incorporating multiple sources of systematic risk in explaining cross-country equity returns.

2604.26178 2026-04-30 math.ST stat.TH

Asymptotics of ultra-high-dimensional generalized spiked sample covariance matrix

Wonjun Seo

Comments 25 pages

详情
英文摘要

This paper investigates the asymptotics of eigenstructure of sample covariance matrix under the spiked covariance matrix model in ultra-high-dimensional settings, where the dimensionality can grow much faster than the sample size with $ p \asymp n^α $, $ α> 1 $. We establish the first-order convergence limits of eigenvalue locations and eigenvector projections of properly scaled sample covariance matrix. Our results are extensions of \cite{bloemendal16,ding21}.

2604.26172 2026-04-30 eess.SY cs.AI cs.LG cs.SY math.OC stat.ML

Co-Learning Port-Hamiltonian Systems and Optimal Energy-Shaping Control

Ankur Kamboj, Biswadip Dey, Vaibhav Srivastava

详情
英文摘要

We develop a physics-informed learning framework for energy-shaping control of port-Hamiltonian (pH) systems from trajectory data. The proposed approach {co-learns} a pH system model and an optimal energy-balancing passivity-based controller (EB-PBC) through alternating optimization with policy-aware data collection. At each iteration, the system model is refined using trajectory data collected under the current control policy, and the controller is re-optimized on the updated model. Both components are parameterized by neural networks that embed the pH {dynamics} and EB-PBC structure, ensuring interpretability in terms of energy {interactions}. The learned controller renders the closed-loop system inherently passive and provably stable, and exploits passive plant dynamics without canceling the natural potential. A dissipation regularization enforces strict energy decay during training, thereby enhancing robustness to sim-to-real gaps. The proposed framework is validated on state-regulation and swing-up tasks for planar and torsional pendulum systems.

2604.26169 2026-04-30 cs.LG econ.EM stat.ML

Budget-Constrained Causal Bandits: Bridging Uplift Modeling and Sequential Decision-Making

Abhirami Pillai

Comments 12 pages, 2 figures, preprint

详情
英文摘要

Treatment allocation under budget constraints is a central challenge in digital advertising: advertisers must decide which users to show ads to while spending a limited budget wisely. The standard approach follows a two-stage offline pipeline - first collect historical data to estimate heterogeneous treatment effects (HTE), then solve a constrained optimization to allocate the budget. This works well with abundant data, but fails in cold-start settings such as new campaigns, new markets, or new customer segments where little historical data exists. We propose Budget-Constrained Causal Bandits (BCCB), an online framework that learns which users respond to ads while simultaneously spending the budget, making treatment decisions one user at a time. BCCB unifies three components into a single sequential process: learning individual-level ad effectiveness, exploring users whose response is uncertain, and pacing the budget over time. We evaluated on the Criteo Uplift dataset, a large-scale advertising dataset from a real randomized controlled trial. Our key finding is a data-efficiency crossover: offline methods require approximately 10,000 historical observations to produce reliable results, while BCCB operates effectively from the very first user. Furthermore, BCCB exhibits 3-5x lower performance variance between runs, making it more practical for real campaign planning. Among purely online methods, BCCB consistently outperforms standard Thompson Sampling, budgeted Thompson Sampling, and greedy HTE estimation across all budget levels tested.

2604.26160 2026-04-30 stat.ME cs.CE cs.LG cs.MS stat.CO

Fitting Large Nonlinear Mixed Effects Models Using Variational Expectation Maximization

Mohamed Tarek, Pedro Afonso

详情
英文摘要

Nonlinear Mixed Effects models (NLME) models are widely used in pharmacometrics and related fields to analyze hierarchical and longitudinal data. However, as the number of parameters and random effects increases, traditional methods for maximizing the marginal likelihood become computationally expensive. This paper explores the Variational Expectation Maximization (VEM) algorithm, a scalable alternative for fitting NLME models. Originally introduced in the context of probabilistic graphical models and later popularized through variational autoencoders, VEM has not been extensively applied to NLME modeling. By leveraging flexible variational families and reverse-mode automatic differentiation, VEM can efficiently maximize the marginal likelihood, scaling to NLME models with over 15,000 population parameters. This work provides a detailed description of VEM, compares it to other NLME fitting algorithms, and highlights its scalability through computational experiments. Using the Pumas statistical software, we fit two test models: 1) a standard warfarin model, and 2) a DeepNLME Friberg model with 15,410 population parameters and 16 random effects. The warfarin model was fitted to completion to demonstrate the correctness of VEM, while the DeepNLME Friberg model was fitted for a limited number of iterations to measure the time per iteration and demonstrate VEM's scalability.

2604.26128 2026-04-30 stat.ML cs.LG

Robust Representation Learning through Explicit Environment Modeling

Yuli Slavutsky, David M. Blei

详情
英文摘要

We consider learning from labeled data collected across multiple environments, where the data distribution may vary across these environments. This problem is commonly approached from a causal perspective, seeking invariant representations that retain causal factors while discarding spurious ones. However, this framework assumes that the environment has no direct effect on the target. In contrast, we consider settings in which this assumption fails, but still aim to learn representations that support robust prediction on average across previously unseen environments. To this end, we study representations learned by explicitly modeling variation across environments and then marginalizing that variation out. We analyze the resulting representations and characterize when they are preferable to those learned by causal invariant-representation methods. We propose a concrete method based on generalized random-intercept models, a class of predictors in which such marginalization is possible, and study their generalization properties. Empirically, we show that these models outperform invariant-learning methods across a range of challenging settings.

2604.26069 2026-04-30 math.ST stat.TH

Estimating the tail index of Pareto-type distributions from geometric records

Martín Alcalde, Raúl Gouet, Miguel Lafuente, F. Javier López, Gerardo Sanz

Comments 35 pages, 3 figures, 2 tables

详情
英文摘要

In this paper we develop a novel inferential approach based on geometric records for estimating the tail index of heavy-tailed distributions. We construct a maximum likelihood estimator for the Pareto model and establish its strong consistency and asymptotic normality, providing also an explicit expression for its asymptotic variance. These results are then extended to a broad class of Pareto-type distributions. The performance of the estimator is assessed via Monte Carlo simulation and compared with classical estimators from the literature. The proposed method is particularly well suited for settings where data arrive sequentially, as it yields smooth estimation trajectories. It is also especially advantageous in applications such as destructive testing, where measuring each observation exactly is costly. In this context, the estimator clearly outperforms Hill's estimator, achieving comparable or better accuracy while requiring a substantially smaller number of measured observations. An application to the analysis of the distribution of fluctuations of the Dow Jones Industrial Average (DJI) is also presented.

2604.26041 2026-04-30 stat.ME math.ST stat.TH

A semiparametric autorregresive spatial prediction model

Rodrigo García Arancibia, Pamela Llop, Mariel Lovatto

详情
英文摘要

In this paper we propose a semiparametric spatial autoregressive model that combines a linear covariate component with a nonparametrically estimated spatial term, allowing flexible dependence modeling without restrictive covariance structure while preserving interpretability. We establish asymptotic properties, including consistency and asymptotic normality, and evaluate performance through simulations and real data. Results show competitive predictive accuracy relative to geostatistical methods and improved interpretability compared to spatial econometric models.

2604.26029 2026-04-30 stat.ME stat.CO stat.ML

Safe, Scalable, and Accurate Bayes Posterior Sampling for Large-Data Generalized Linear Mixed Models

Youngsoo Baek, Samuel I. Berchuck

Comments 19 pages, 5 figures

详情
英文摘要

We consider the problem of scalable sampling algorithms to fit Bayesian generalized linear mixed models on large datasets. Stochastic gradient Langevin dynamics, coupled with smooth re-parameterizations of variance parameters, produces divergent Markov chains and cannot be reliably used for sampling covariance parameters of random effects. We advocate the use of a mirror Langevin dynamics algorithm, propose the novel stochastic mirror Langevin dynamics based on data subsampling, and provide concrete guidelines for its use in a Bayesian inference framework. Based on an explicit Wasserstein distance error bound between the posterior and its algorithmic approximation, we propose a post-processing step that yields an asymptotic, order-wise correct estimation of the posterior variance, eliminating the irreducible posterior variance estimation bias due to subsampling. Empirical performance of the method is evaluated through simulated experiments and a longitudinal study of pain trajectories in a study of breast cancer survivors.

2604.25984 2026-04-30 stat.ML cs.LG

Occam's Razor is Only as Sharp as Your ELBO

Ethan Harvey, Michael C. Hughes

详情
英文摘要

The marginal likelihood, also known as the evidence, is regarded as a mathematical embodiment of Occam's razor, enabling model selection that avoids overfitting. The evidence lower bound (ELBO) objective from variational inference has also been used for similar purposes. Prior work has shown that restricting the approximate posterior family via a mean-field approximation can lead the ELBO to underfit. In this paper, we show how ELBO-based hyperparameter learning in a simple over-parameterized regression model can also produce overfitting, depending on the assumed rank of the covariance matrix in a Gaussian approximate posterior. Surprisingly, among only the underfit and overfit options, Bayesian model selection via the evidence itself sometimes prefers the overfit version, while the ELBO does not. Bayesian practitioners hoping to scale to large models should be cautious about how reduced-rank assumptions needed for tractability may impact the potential for model selection.

2604.25966 2026-04-30 stat.ME math.ST stat.TH

Principal Component Based Estimation of Finite Population Mean under Multicollinearity

Rajesh Singh, Shobh Nath Tiwari

Comments 10 pages, 9 tables, 5 figures

详情
英文摘要

Auxiliary information is frequently utilized in survey sampling to improve the efficiency of estimators of the finite population mean. However, the simultaneous use of multiple auxiliary variables often induces multicollinearity, which adversely affects the stability and performance of conventional estimators. To address this issue, the present study proposes a principal component analysis (PCA) based estimation approach for the finite population mean in the presence of multicollinearity between two auxiliary variables. The proposed methodology transforms the correlated auxiliary variables into a set of orthogonal principal components, thereby removing the effect of multicollinearity while preserving the essential information contained in the auxiliary variables. An efficient estimator is then constructed using these components under simple random sampling without replacement. The bias and mean square error (MSE) of the proposed estimator are derived up to the first order of approximation. The performance of the proposed estimator is evaluated through both empirical and simulation studies under varying correlation structures. Moreover, the presence of multicollinearity is evaluated using variance inflation factors, condition indices, and eigenvalues. The results from empirical and simulation studies demonstrate that the proposed PCA-based estimator outperforms several conventional estimators in terms of MSE and percentage relative efficiency (PRE) when multicollinearity exists, ensuring robust and efficient estimation of the population mean.

2604.25940 2026-04-30 stat.AP

SCARFACE: a harmonized spatio-temporal dataset integrating socio-economic, environmental, and agricultural indicators for the Po Valley (Italy), 2011--2024

Paolo Maranzano, Pietro Colombo, Felicetta Carillo, Riccardo Borgoni, Riccardo Pajno, Matteo Borrotti, Luca Ferrero, Ezio Bolzacchini

详情
英文摘要

We present "Sequestering CARbon through Forests, AgriCulture, and land usE (SCARFACE)", a harmonized spatio-temporal dataset that integrates climate, air quality, airborne pollutant emissions, land cover, soil properties, agro-industry dynamics and socio-economic indicators, to jointly investigate interconnected processes linking agricultural systems, atmospheric dynamics, emissions, and socioeconomic conditions in the Po Valley, Northern Italy. The spatial reference unit is the Agrarian Sub Region (ASR), that is, groups of contiguous municipalities that are considered homogeneous with respect to physical geography, agronomic characteristics, and prevailing agricultural production systems. The dataset adopts an annual panel structure from 2011 to 2024 defined over the 256 ASRs partitioning the Po Valley and comprises more than 2,700 indicators sourced from national and international public institutions. Heterogeneous data are harmonized within a processing workflow, tailored to the specific characteristics of each dataset, that guarantee spatial and temporal consistency of the output dataset. The resource supports reuse in applied econometrics, spatio-temporal modeling, clustering, and policy analysis focused on agriculture, air quality, and land use in a major European hotspot.

2604.25235 2026-04-30 cs.LG cs.CL cs.CV stat.ML

VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

Divake Kumar, Sina Tayebati, Devashri Naik, Ranganath Krishnan, Amit Ranjan Trivedi

详情
英文摘要

Vision-language models (VLMs) are increasingly used as automated judges for multimodal systems, yet their scores provide no indication of reliability. We study this problem through conformal prediction, a distribution-free framework that converts a judge's point score into a calibrated prediction interval using only score-token log-probabilities, with no retraining. We present the first systematic analysis of conformal prediction for VLM-as-a-Judge across 3 judges and 14 visual task categories. Our results show that evaluation uncertainty is strongly task-dependent: intervals cover ~40% of the score range for aesthetics and natural images but expand to ~70% for chart and mathematical reasoning, yielding a quantitative reliability map for multimodal evaluation. We further identify a failure mode not captured by standard evaluation metrics, ranking-scoring decoupling, where judges achieve high ranking correlation while producing wide, uninformative intervals, correctly ordering responses but failing to assign reliable absolute scores. Finally, we show that interval width is driven primarily by task difficulty and annotation quality, i.e., the same judge and method yield 4.5x narrower intervals on a clean, multi-annotator captioning benchmark. Code: https://github.com/divake/VLM-Judge-Uncertainty

2604.23514 2026-04-30 stat.ML cs.LG stat.ME

Probabilistic Graphical Model using Graph Neural Networks for Bayesian Inversion of Discrete Structural Component States

Teng Li, Stephen Wu, Yong Huang, James L. Beck, Hui Li

Comments Accepted by Reliability Engineering & System Safety on 23 February 2026

详情
Journal ref
Reliability Engineering & System Safety (2026): 112478
英文摘要

The health condition of components in civil infrastructures can be described by various discrete states according to their performance degradation. Inferring these states from measurable responses is typically an ill-posed inverse problem. Although Bayesian methods are well-suited to tackle such problems, computing the posterior probability density function (PDF) presents challenges. The likelihood function cannot be analytically formulated due to the unclear relationship between discrete states and structural responses, and the high-dimensional state parameters resulting from numerous components severely complicates the computation of the marginal likelihood function. To address these challenges, this study proposes a novel Bayesian inversion paradigm for discrete variables based on Probabilistic Graphical Models (PGMs). The Markov networks are employed as modeling tools, with model parameters learned from data and structural topology prior. It has been proved that inferring this PGM produces the same probabilistic estimation as the posterior PDF derived from Bayesian inference, which effectively solves the above challenges. The inference is accomplished by Graph Neural Networks (GNNs), and a graph property-based GNN training strategy is developed to enable accurate inference across varying graph scales, thereby significantly reducing the computational overhead in high-dimensional problems. Both synthetic and experimental data are used to validate the proposed framework

2604.22140 2026-04-30 stat.ML cs.LG math.ST stat.AP stat.TH

Concave Statistical Utility Maximization Bandits via Influence-Function Gradients

Matías Carrasco, Alejandro Cholaquidis

详情
英文摘要

We study stochastic multi-armed bandits in which the objective is a statistical functional of the long-run reward distribution, rather than expected reward alone. Under mild continuity assumptions, we show that the infinite-horizon problem reduces to optimizing over stationary mixed policies: each weight vector \(w\) on the simplex induces a mixture law \(P^w\), and performance is measured by the concave utility \(U(w)=\mathfrak U(P^w)\). For differentiable statistical utilities, we use influence-function calculus to derive stochastic gradient estimators from bandit feedback. This leads to an entropic mirror-ascent algorithm on a truncated simplex, implemented through multiplicative-weights updates and plug-in estimates of the influence function. We establish regret bounds that separate the mirror-ascent optimization error from the bias caused by estimating the influence function. The framework is developed for general concave distributional utilities and illustrated through variance and Wasserstein objectives, with numerical experiments comparing exact and plug-in influence-function implementations.

2604.18701 2026-04-30 cs.LG cs.AI stat.ML

Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training

Vin Bhaskara, Haicheng Wang

Comments 18 pages, 7 figures, 1 table, 1 algorithm. Code: https://github.com/vinbhaskara/Curiosity-Critic

详情
英文摘要

Local prediction-error-based curiosity rewards focus on the current transition without considering the world model's cumulative prediction error across all visited transitions. We introduce Curiosity-Critic, which grounds its intrinsic reward in the improvement of this cumulative objective, and show that it admits a tractable per-step surrogate: the difference between the current prediction error and the asymptotic error baseline of the current state transition. We estimate this error baseline online with a learned critic co-trained alongside the world model; regressing a single scalar, the critic converges well before the world model saturates, redirecting exploration toward learnable transitions without oracle knowledge of the noise floor. The reward is higher for learnable transitions and collapses toward the error baseline for stochastic ones, effectively separating epistemic (reducible) from aleatoric (irreducible) prediction error online. Prior prediction-error curiosity formulations, from Schmidhuber (1991) to learned-feature-space variants, emerge as special cases corresponding to specific approximations of this error baseline. Experiments on a stochastic grid world show that Curiosity-Critic outperforms prediction-error, visitation-count, and Random Network Distillation methods in training speed and final world model accuracy.

2604.17698 2026-04-30 cs.LG cs.CL stat.ML

The Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability

Prashant C. Raju

详情
英文摘要

Reliable deployment of language models requires two capabilities that appear distinct but share a common geometric foundation: predicting whether a model will accept targeted behavioral control, and detecting when its internal structure degrades. We show that geometric stability, the consistency of a representation's pairwise distance structure, addresses both. Supervised Shesha variants that measure task-aligned geometric stability predict linear steerability with near-perfect accuracy ($ρ= 0.89$-$0.97$) across 35-69 embedding models and three NLP tasks, capturing unique variance beyond class separability (partial $ρ= 0.62$-$0.76$). A critical dissociation emerges: unsupervised stability fails entirely for steering on real-world tasks ($ρ\approx 0.10$), revealing that task alignment is essential for controllability prediction. However, unsupervised stability excels at drift detection, measuring nearly $2\times$ greater geometric change than CKA during post-training alignment (up to $5.23\times$ in Llama) while providing earlier warning in 73\% of models and maintaining a $6\times$ lower false alarm rate than Procrustes. Together, supervised and unsupervised stability form complementary diagnostics for the LLM deployment lifecycle: one for pre-deployment controllability assessment, the other for post-deployment monitoring.

2604.06146 2026-04-30 stat.ME math.ST stat.TH

Sufficient conditions for proper posteriors in fully-Bayesian Functional PCA

Joseph Sartini, Scott Zeger, Ciprian Crainiceanu

Comments 5 pages, no figures

详情
英文摘要

In a fully-Bayesian Functional Principal Components Analysis (FPCA) the principal components are treated as unknown infinite-dimensional parameters. By projecting the functional principal components on a rich orthonormal spline basis, we show that orthonormality of the principal components is equivalent to orthonormality of the spline coefficients. A penalty on the integral of the second derivative of the functional principal components can be induced on the spline coefficients, where each function has its own smoothing parameter. Finally, each smoothing parameter is treated as an inverse variance component in the associated mixed effects model. In this work, we demonstrate that no additional conditions are required to ensure that the corresponding smoothing prior, and thus the posterior distribution, is proper. This allows the choice of less informative priors, such that smoothing is driven by the data.

2603.10992 2026-04-30 stat.ML cs.LG physics.chem-ph physics.comp-ph

A Tutorial Review of Bayesian Optimization with Gaussian Processes to Accelerate Stationary Point Searches

Rohit Goswami

Comments 66 pages, 24 figures (main). Accepted article for ACS Physical Chemistry Au

详情
英文摘要

Building local surrogates to accelerate stationary point searches on potential energy surfaces spans decades of effort. Done correctly, surrogates can reduce the number of expensive electronic structure evaluations by roughly an order of magnitude while preserving the accuracy of the underlying theory, with the gain depending on oracle cost, search distance, and the availability of analytical forces. We present a unified Bayesian optimization view of minimization, single-point saddle searches, and double-ended path searches: all three share one six-step surrogate loop and differ only in the inner optimization target and the acquisition criterion. The framework uses Gaussian process regression with derivative observations, inverse-distance kernels, and active learning, and we develop optional extensions for production use, including farthest-point sampling with the Earth Mover's Distance, MAP regularization, an adaptive trust radius, and random Fourier features for scaling. Accompanying pedagogical Rust code demonstrates that all three applications use the same Bayesian optimization loop, bridging the gap between theoretical formulation and practical execution.

2603.07955 2026-04-30 math.GT cs.LG stat.ML

RL unknotter, hard unknots and unknotting number

Anne Dranowski, Yura Kabkov, Daniel Tubbenhauer

Comments 19 pages, many figures, comments welcome

详情
英文摘要

We develop a reinforcement learning pipeline for simplifying knot diagrams. A trained agent learns move proposals and a value heuristic for navigating Reidemeister moves. The pipeline applies to arbitrary knots and links; we test it on ``very hard'' unknot diagrams and, using diagram inflation, on $4_1\#9_{10}$ where we recover the recently established and surprising upper bound of three for the unknotting number. In addition, we explain a self-improving workbook-driven extension of the pipeline that systematically improves unknotting number upper bounds on the list of prime knots.

2603.06752 2026-04-30 cs.LG cs.NA math.NA stat.ME stat.ML

Latent Autoencoder Ensemble Kalman Filter for Nonlinear Data assimilation

Xin T. Tong, Yanyan Wang, Liang Yan

详情
英文摘要

The ensemble Kalman filter (EnKF) is widely used for data assimilation in high-dimensional systems, but its performance often deteriorates for strongly nonlinear dynamics due to the structural mismatch between the Kalman update and the underlying system behavior. In this work, we propose a latent autoencoder ensemble Kalman filter (LAE-EnKF) that addresses this limitation by reformulating the assimilation problem in a learned latent space with linear and stable dynamics. The proposed method learns a nonlinear encoder--decoder together with a stable linear latent evolution operator and a consistent latent observation mapping, yielding a closed linear state-space model in the latent coordinates. This construction restores compatibility with the Kalman filtering framework and allows both forecast and analysis steps to be carried out entirely in the latent space. Compared with existing autoencoder-based and latent assimilation approaches that rely on unconstrained nonlinear latent dynamics, the proposed formulation emphasizes structural consistency, stability, and interpretability. We provide a theoretical analysis of learning linear dynamics on low-dimensional manifolds and establish generalization error bounds for the proposed latent model. Numerical experiments on representative nonlinear and chaotic systems demonstrate that the LAE-EnKF yields more accurate and stable assimilation than the standard EnKF and related latent-space methods, while maintaining comparable computational cost and data-driven.

2602.21876 2026-04-30 stat.AP

Comparative Evaluation of Machine Learning Models for Predicting Donor Kidney Discard

Peer Schliephacke, Hannah Schult, Leon Mizera, Judith Würfel, Gunter Grieser, Axel Rahmel, Carl-Ludwig Fischer-Fröhlich, Antje Jahn-Eimermacher

详情
英文摘要

A kidney transplant can improve the life expectancy and quality of life of patients with end-stage renal failure. Even more patients could be helped with a transplant if the rate of kidneys that are discarded and not transplanted could be reduced. Machine learning (ML) can support decision-making in this context by early identification of donor organs at high risk of discard, for instance to enable timely interventions to improve organ utilization such as rescue allocation. Although various ML models have been applied, their results are difficult to compare due to heterogenous datasets and differences in feature engineering and evaluation strategies. This study aims to provide a systematic and reproducible comparison of ML models for donor kidney discard prediction. We trained five commonly used ML models: Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, and Deep Learning along with an ensemble model on data from 4,080 deceased donors (death determined by neurologic criteria) in Germany. A unified benchmarking framework was implemented, including standardized feature engineering and selection, and Bayesian hyperparameter optimization. Model performance was assessed for discrimination (MCC, AUC, F1), calibration (Brier score), and explainability (SHAP). The ensemble achieved the highest discrimination performance (MCC=0.76, AUC=0.87, F1=0.90), while individual models such as Logistic Regression, Random Forest, and Deep Learning performed comparably and better than Decision Trees. Platt scaling improved calibration for tree-and neural network-based models. SHAP consistently identified donor age and renal markers as dominant predictors across models, reflecting clinical plausibility. This study demonstrates that consistent data preprocessing, feature selection, and evaluation can be more decisive for predictive success than the choice of the ML algorithm.

2602.12988 2026-04-30 math.PR math.ST stat.TH

Multidimensional Dickman distribution and operator selfdecomposability

Anastasiia S. Kovtun, Nikolai N. Leonenko, Andrey Pepelyshev

Comments 31 pages

详情
英文摘要

The one-dimensional Dickman distribution arises in various stochastic models across number theory, combinatorics, physics, and biology. Recently, a definition of the multidimensional Dickman distribution has appeared in the literature, together with its application to approximating the small jumps of multidimensional Lévy processes. In this paper, we extend this definition to a class of vector-valued random elements, which we characterise as fixed points of a specific affine transformation involving a random matrix obtained from the matrix exponential of a uniformly distributed random variable. We prove that these new distributions possess the key properties of infinite divisibility and operator selfdecomposability. Furthermore, we identify several cases where this new distribution arises as a limiting distribution.

2602.03169 2026-04-30 stat.ML cs.LG

NeuralFLoC: Neural Flow-Based Joint Registration and Clustering of Functional Data

Xinyang Xiong, Siyuan jiang, Pengcheng Zeng

详情
英文摘要

Clustering functional data in the presence of phase variation is challenging, as temporal misalignment can obscure intrinsic shape differences and degrade clustering performance. Most existing approaches treat registration and clustering as separate tasks or rely on restrictive parametric assumptions. We present \textbf{NeuralFLoC}, a fully unsupervised, end-to-end deep learning framework for joint functional registration and clustering based on Neural ODE-driven diffeomorphic flows and spectral clustering. The proposed model learns smooth, invertible warping functions and cluster-specific templates simultaneously, effectively disentangling phase and amplitude variation. We establish universal approximation guarantees and asymptotic consistency for the proposed framework. Experiments on functional benchmarks show state-of-the-art performance in both registration and clustering, with robustness to missing data, irregular sampling, and noise, while maintaining scalability. Code is available at https://anonymous.4open.science/r/NeuralFLoC-FEC8.

2512.17005 2026-04-30 econ.EM stat.ME

Principled Identification of Structural Dynamic Models

Neville Francis, Peter Reinhard Hansen, Chen Tong

详情
英文摘要

We take a new perspective on identification in structural dynamic models: rather than imposing restrictions alone, we optimize an objective. While definitive structural identification ultimately requires exogenous economic insight, a weighted correlation-maximizing objective yields an Order- and Scale-Invariant Scheme (OASIS) that selects the orthogonal rotation most aligned with designated target variables. In traditional SVARs, these targets are the reduced-form innovations, making OASIS a natural reference rotation. We show that recursive Cholesky identification is a constrained version of the same objective and that OASIS is systematically closer to perfect correlation, closing roughly twice as much of the gap as recursive orderings, both theoretically and empirically. The same framework also provides a principled estimation strategy for Proxy VARs (IV-SVARs), where the weighted criterion is essential for resolving overdetermination in multi-proxy systems while symmetrically accommodating proxy leakage. Revisiting 22 published SVARs, we find that reduced-form innovations are typically only weakly correlated, helping explain the historical robustness of recursive schemes. Applying OASIS to seminal proxy applications, however, reveals economically important leakage across shocks and shows that accounting for such leakage can materially alter substantive conclusions.

2512.08824 2026-04-30 stat.AP

Commanding the Foul Shot: A New Ensemble of Free Throw Metrics

Jake McGrath, Amanda Glazer, Vanna Bushong, Michelle Nguyen, Kirk Goldsberry

详情
英文摘要

With the NBA's adoption of in-game limb tracking in 2023, Sony's Hawk-Eye system now captures high-resolution, 3D poses of players and the ball 60 times per second. Linking these data to key events opens a new era in NBA analytics. Here, we leverage a large dataset of 21,964 shot attempts from 72 NBA players to introduce a novel ensemble of metrics for evaluating free-throw shooting. Inspired by baseball analytics, we introduce command, which quantifies the quality of a free throw by measuring a shooter's accuracy and precision near the basket's bullseye. This metric recognizes that some makes (or misses) are better than others and captures a player's ability to execute quality attempts consistently. We demonstrate that command captures underlying skill more effectively than traditional make-or-miss statistics; early-season command predicts late-season success more reliably than traditional shooting percentage. To identify what drives command, we define launch-based metrics assessing consistency in release velocity, angle, and 3D position. Players with greater touch, i.e., more consistent launch dynamics, exhibit stronger command as they can reliably control their shot trajectory. Finally, we develop a physics model to identify the range of launch conditions that result in a make and to determine which launch conditions are most robust to small perturbations. This framework reveals ''safe'' launch regions and explains why certain players excel at free throws, providing actionable insights for player development.

2509.17625 2026-04-30 cs.LG cs.CY physics.soc-ph stat.ME

Comparing Data Assimilation and Likelihood-Based Inference on Latent State Estimation in Agent-Based Models

Blas Kolic, Corrado Monti, Gianmarco De Francisci Morales, Marco Pangallo

详情
英文摘要

In this paper, we present the first systematic comparison of Data Assimilation (DA) and Likelihood-Based Inference (LBI) in the context of an Agent-Based Model (ABM). These models generate observable time series driven by evolving, partially-latent microstates. Latent states must be estimated to align simulations with real-world data, a task traditionally addressed by DA, particularly in continuous and equation-based models used in weather forecasting. However, the nature of ABMs poses challenges for standard DA methods. Solving such issues requires adapting previous DA techniques or using ad hoc alternatives such as LBI. DA approximates the likelihood in a model-agnostic way, making it broadly applicable but potentially less precise. In contrast, LBI provides more accurate state estimation by directly leveraging the model's likelihood, but at the cost of requiring a hand-crafted, model-specific likelihood function, which may be complex or infeasible to derive. We compare the two methods on the Bounded-Confidence Model, a well-known opinion dynamics ABM, where agents are affected only by others holding sufficiently similar opinions. We find that LBI better recovers latent agent-level opinions, even under model mis-specification, leading to improved individual-level forecasts. At the aggregate level, however, both methods perform comparably, and DA remains competitive across levels of aggregation under certain parameter settings. Our findings suggest that DA is well-suited for aggregate predictions, while LBI is preferable for agent-level inference.

2509.10736 2026-04-30 stat.AP

Adaptive Bayesian computation for efficient biobank-scale genomic inference

Yiran Li, John Whittaker, Sylvia Richardson, Helene Ruffieux

详情
英文摘要

Motivation: Modern biobanks, with unprecedented sample sizes and phenotypic diversity, have become foundational resources for genomic studies, enabling powerful cross-phenotype and population-scale analyses. As studies grow in complexity, Bayesian hierarchical models offer a principled framework for jointly modeling multiple units such as cells, traits, and experimental conditions, increasing statistical power through information sharing. However, adoption of Bayesian hierarchical models in biobank-scale studies remains limited due to computational inefficiencies, particularly in posterior inference over high-dimensional parameter spaces. Deterministic approximations such as variational inference provide scalable alternatives to Markov Chain Monte Carlo, yet current implementations do not fully exploit the structure of genome-wide multi-unit modeling, especially when biological effects of interest are concentrated in a few units. Results: We propose an adaptive focus (AF) strategy within a block coordinate ascent variational inference (CAVI) framework that selectively updates subsets of parameters at each iteration, corresponding to units deemed relevant based on current estimates. We illustrate this approach in protein quantitative trait locus (pQTL) mapping using a joint model of hierarchically linked regressions with shared parameters across traits. In both simulated data and real proteomic data from the UK Biobank, AF-CAVI achieves up to a 50\% reduction in runtime while maintaining statistical performance. We also provide a genome-wide pipeline for multi-trait pQTL mapping across thousands of traits, demonstrating AF-CAVI as an efficient scheme for large-scale, multi-unit Bayesian analysis in biobanks.

2507.17544 2026-04-30 stat.ML cs.LG stat.ME

Optimal differentially private kernel learning with random projection

Bonwoo Lee, Cheolwoo Park, Jeongyoun Ahn

Comments 139 page, 3 figures

详情
英文摘要

Differential privacy has become a cornerstone in the development of privacy-preserving learning algorithms. This work addresses optimizing differentially private kernel learning within the empirical risk minimization (ERM) framework. We propose a novel differentially private kernel ERM algorithm based on random projection in the reproducing kernel Hilbert space using Gaussian processes. Our method achieves minimax-optimal excess risk rates for both the squared loss and Lipschitz-smooth convex loss functions under a local strong convexity condition. We further show that existing approaches based on alternative dimension reduction techniques, such as random Fourier feature mappings or $\ell_2$ regularization, yield suboptimal excess risk bounds. Our key theoretical contribution also includes the derivation of dimension-free excess risk bounds for objective perturbation-based private linear ERM, marking the first such result that does not rely on noisy gradient-based mechanisms. Additionally, we obtain sharper excess risk bounds for existing differentially private kernel ERM algorithms. Empirical evaluations support our theoretical claims, demonstrating that random projection enables statistically efficient and optimally private kernel learning. These findings provide new insights into the design of differentially private algorithms and highlight the central role of dimension reduction in balancing privacy and utility.

2507.12549 2026-04-30 cs.LG cs.CC stat.ML

The Serial Scaling Hypothesis

Yuxi Liu, Konpat Preechakul, Kananart Kuwaranancharoen, Yutong Bai

Comments ICLR 2026. Equal contribution by the first two authors. Project page: https://serial-scaling-hypothesis.github.io

详情
英文摘要

While machine learning has advanced through massive parallelization, we identify a critical blind spot: some problems are fundamentally sequential. These "inherently serial" problems-from mathematical reasoning to physical simulations to sequential decision-making-require sequentially dependent computational steps that cannot be efficiently parallelized. We formalize this distinction in complexity theory, and demonstrate that current parallel-centric architectures face fundamental limitations on such tasks. Then, we show for first time that diffusion models despite their sequential nature are incapable of solving inherently serial problems. We argue that recognizing the serial nature of computation holds profound implications on machine learning, model design, and hardware development.

2506.23040 2026-04-30 stat.OT cs.AI

Treatment, evidence, imitation, and chat

Samuel J. Weisenthal

Comments 12 pages

详情
英文摘要

Large language models are thought to have the potential to aid in medical decision making. This work investigates the degree to which this might be the case. We start with the treatment problem, the patient's core medical decision-making task, which is solved in collaboration with a clinician. We discuss different approaches to solving it, including, within evidence-based medicine, experimental and observational data. We then discuss the chat problem, and how this differs from the treatment problem -- in particular with respect to imitation (and how imitation alone cannot solve the true treatment problem, although this does not mean it is not useful). We then discuss how a large-language-model-based system might be trained to solve the treatment problem, highlighting that the major challenges relate to the ethics of experimentation and the assumptions associated with observation. We finally discuss how these challenges relate to evidence-based medicine and how this might inform the efforts of the medical research community to solve the treatment problem. Throughout, we illustrate our arguments with the cholesterol medications, statins.

2505.18441 2026-04-30 cs.LG cs.MS stat.AP

DB-KSVD: Scalable Alternating Optimization for Disentangling High-Dimensional Embedding Spaces

Romeo Valentin, Sydney M. Katz, Vincent Vanhoucke, Mykel J. Kochenderfer

Comments 8 pages + 10 pages appendix. Updated with additional vision transformer experiments

详情
英文摘要

Dictionary learning has recently emerged as a promising approach for mechanistic interpretability of large transformer models. Disentangling high-dimensional transformer embeddings requires algorithms that scale to high-dimensional data with large sample sizes. Recent work has explored sparse autoencoders (SAEs) for this problem. However, SAEs use a simple linear encoder to solve the sparse encoding subproblem, which is known to be NP-hard. It is therefore interesting to understand whether this approach is sufficient to find good solutions to the dictionary learning problem or if a more sophisticated algorithm could find better solutions. In this work, we propose Double-Batch KSVD (DB-KSVD), a scalable dictionary learning algorithm that adapts the classic KSVD algorithm. DB-KSVD is informed by the rich theoretical foundations of KSVD but scales to datasets with millions of samples and thousands of dimensions. We demonstrate the efficacy of DB-KSVD by disentangling text embeddings of the Gemma-2-2B and Pythia-160M models and evaluating on six metrics from the SAEBench benchmark, where we achieve competitive results when compared to established approaches based on SAEs. We further show similar results when disentangling image embeddings obtained from the DINOv2-S and DINOv2-B models, solidifying our findings. By matching SAE performance with an entirely different optimization approach, our results suggest that (i) SAEs do find strong solutions to the dictionary learning problem and (ii) traditional optimization approaches can be scaled to the required problem sizes, offering a promising avenue for further research. We make an implementation of DB-KSVD available at https://github.com/romeov/ksvd.jl.

2505.13518 2026-04-30 stat.ML cs.AI cs.LG

Data Balancing Strategies: A Systematic Survey of Resampling and Augmentation Methods

Behnam Yousefimehr, Mehdi Ghatee, Javad Fazli, Shervin Ghaffari, Zahra Rafei, Mohammad Amin Seifi, Sajed Tavakoli, Abolfazl Nikahd, Mahdi Razi Gandomani, Alireza Orouji, Ramtin Mahmoudi Kashani, Sarina Heshmati, Negin Sadat Mousavi

详情
英文摘要

Imbalanced datasets, where one class significantly outnumbers others, remain a persistent challenge in machine learning, often biasing predictions toward the majority class and degrading classifier performance. This paper provides a comprehensive, systematic review of data balancing methods, extending beyond foundational oversampling techniques such as the Synthetic Minority Oversampling Technique (SMOTE) and its variants (e.g., Borderline SMOTE, K-Means SMOTE, and Safe-Level SMOTE) to encompass advanced adaptive methods (MWMOTE, AMDO), deep generative models (generative adversarial networks, variational autoencoders, and diffusion models), undersampling techniques (NearMiss, Tomek Links), combination/hybrid methods (SMOTE-ENN, SMOTE-Tomek, and SMOTE+OCSVM), ensemble strategies (SMOTEBoost, RUSBoost, Balanced Random Forest, and One-Sided Selection), and specialized approaches for multi-label and clustered data. Beyond descriptive categorization, this review critically examines each method's underlying assumptions, operational mechanisms, and suitability for diverse data characteristics, including high dimensionality, mixed feature types, class overlap, and noise. Key findings demonstrate that no single method universally outperforms others; optimal selection depends critically on dataset characteristics, classifier choice, and evaluation metrics. The paper concludes by identifying emerging research directions, including self-supervised learning for imbalance, diffusion-based generative oversampling, distribution-preserving resampling, knowledge distillation for imbalanced deployment, and the adaptation of foundation models to skewed distributions, offering practical guidelines for practitioners and a roadmap for future methodological development.

2503.05023 2026-04-30 stat.AP

A Behavioral Scorecard Model Using Survival Analysis

Cheng Lee, Hsi Lee

详情
英文摘要

Credit risk assessment is a crucial aspect of financial decision-making, enabling institutions to predict the likelihood of default and make informed lending decisions. Two prominent methodologies in credit risk modeling are logistic regression and survival analysis. Logistic regression is widely used in scorecard development due to its simplicity, interpretability, and effectiveness in estimating the probability of binary outcomes, such as default versus non-default. In contrast, survival analysis -- particularly within the hazard rate framework -- provides insights into the timing of events, such as the time to default. By integrating logistic regression with survival analysis, traditional scorecard models can be enhanced to capture not only the probability of default but also the dynamics of default over time. This combined approach offers a more comprehensive view of credit risk, enabling institutions to manage risk proactively and tailor strategies to individual borrower profiles. This article presents the process of developing a monthly hazard rate model using logistic regression and augmented data with survival analysis techniques to incorporate time-varying risk factors. The process includes data preparation, model construction, and the evaluation of performance metrics. Monthly hazard rates are then converted into default probabilities. Finally, a behavioral scorecard is developed using offset adjustment.

2411.16121 2026-04-30 stat.ML cs.LG

DP-CDA: An Algorithm for Enhanced Privacy Preservation in Dataset Synthesis Through Randomized Mixing

Utsab Saha, Tanvir Muntakim Tonoy, Hafiz Imtiaz

Comments This manuscript has been published in the SECURITY AND PRIVACY by Wiley

详情
Journal ref
Security and Privacy 9 (2026) e70207
英文摘要

In recent years, the growth of data across various sectors, including healthcare, security, finance, and education, has created significant opportunities for analysis and informed decision-making. However, these datasets often contain sensitive and personal information, which raises serious privacy concerns. It has been shown in multiple works that a person's identity is intertwined with their data, even if the data is anonymized. Due to this lack of separation between a person's identity and their information, the patterns associated with an individual's information can uniquely identify them. Protecting individual privacy is crucial, yet many existing machine learning and data publishing algorithms struggle with high-dimensional data, facing challenges related to the trade-off between computational efficiency and privacy. To address these challenges, we introduce an effective data publishing algorithm \emph{DP-CDA}. Our proposed algorithm generates synthetic data by randomly mixing the privacy-sensitive data in a class-specific manner and inducing carefully tuned randomness to ensure formal privacy guarantees. Our comprehensive privacy accounting shows that the proposed DP-CDA provides a stronger privacy guarantee compared to existing methods, allowing for better utility while maintaining a stricter level of privacy. To evaluate the effectiveness of DP-CDA, we examine the accuracy of predictive models trained on the synthetic data, which serves as a measure of dataset utility. Importantly, we identify an optimal order of mixing that balances privacy-utility trade-off. Our results indicate that synthetic datasets produced using the DP-CDA can achieve superior utility compared to those generated by conventional data publishing algorithms, even when subject to the same privacy requirements.

2411.05871 2026-04-30 stat.AP math.DS

A Pole-Based Approach to Interpret Electromechanical Impedance Measurements in Structural Health Monitoring

Sourabh Sangle, Sa'ed Alajlouni, Pablo A. Tarazaga

详情
英文摘要

Over several decades, electromechanical impedance (EMI) measurements have been employed as a basis for structural health monitoring and damage detection. Traditionally, Root-mean-squared-deviation (RMSD) and Cross-correlation (XCORR) based metrics have been used to interpret EMI measurements for damage assessment. These tools, although helpful and widely used, were not designed with the idea to assess changes in EMI to underlying physical changes incurred by damage. The authors propose leveraging vector fitting (VF), a rational function approximation technique, to estimate the poles of the underlying system, and consequently, the modal parameters which have a physical connection to the underlying model of a system. Shifts in natural frequencies, as an effect of changes in the pole location, can be attributed to changes in a structure undergoing damage. With VF, tracking changes between measurements of damaged and pristine structures is physically more intuitive unlike when using traditional metrics, making it ideal for informed post-processing. Alternative methods to VF exist in the literature (e.g., Least Square Complex Frequency-domain (LSCF) estimation, adaptive Antoulas--Anderson (AAA), Rational Krylov Fitting (RKFIT)). The authors demonstrate that VF is better suited for EMI-based structural health monitoring for the following reasons: 1. VF is more accurate at high frequency, 2. VF estimates complex conjugate stable pole pairs, close to the actual poles of the system, and 3. VF can capture critical information missed by other approaches and present it in a condensed form. Thus, using the selected technique for interpreting high-frequency EMI measurements for structural health monitoring is proposed. A set of representative case studies is presented to show the benefits of VF for damage detection and diagnosis.

2407.20295 2026-04-30 stat.AP stat.CO stat.ME

Warped multifidelity Gaussian processes for data fusion of skewed environmental data

Pietro Colombo, Claire Miller, Xiaochen Yang, Ruth O'Donnell, Paolo Maranzano

详情
英文摘要

Understanding the dynamics of climate variables is paramount for numerous sectors, like energy and environmental monitoring. This study focuses on the critical need for a precise mapping of environmental variables for national or regional monitoring networks, a task notably challenging when dealing with skewed data. To address this issue, we propose a novel data fusion approach, the \textit{warped multifidelity Gaussian process} (WMFGP). The method performs prediction using multiple time-series, accommodating varying reliability and resolutions and effectively handling skewness. In an extended simulation experiment the benefits and the limitations of the methods are explored, while as a case study, we focused on the wind speed monitored by the network of ARPA Lombardia, one of the regional environmental agencies operting in Italy. ARPA grapples with data gaps, and due to the connection between wind speed and air quality, it struggles with an effective air quality management. We illustrate the efficacy of our approach in filling the wind speed data gaps through two extensive simulation experiments. The case study provides more informative wind speed predictions crucial for predicting air pollutant concentrations, enhancing network maintenance, and advancing understanding of relevant meteorological and climatic phenomena.

2405.20191 2026-04-30 stat.AP econ.EM stat.CO

Multidimensional spatiotemporal clustering -- An application to environmental sustainability scores in Europe

Caterina Morelli, Simone Boccaletti, Paolo Maranzano, Philipp Otto

详情
英文摘要

The assessment of corporate sustainability performance is extremely relevant in facilitating the transition to a green and low-carbon intensity economy. However, companies located in different areas may be subject to different sustainability and environmental risks and policies. Henceforth, the main objective of this paper is to investigate the spatial and temporal pattern of the sustainability evaluations of European firms. We leverage on a large dataset containing information about companies' sustainability performances, measured by MSCI ESG ratings, and geographical coordinates of firms in Western Europe between 2013 and 2023. By means of a modified version of the Chavent et al. (2018) hierarchical algorithm, we conduct a spatial clustering analysis, combining sustainability and spatial information, and a spatiotemporal clustering analysis, which combines the time dynamics of multiple sustainability features and spatial dissimilarities, to detect groups of firms with homogeneous sustainability performance. We are able to build cross-national and cross-industry clusters with remarkable differences in terms of sustainability scores. Among other results, in the spatio-temporal analysis, we observe a high degree of geographical overlap among clusters, indicating that the temporal dynamics in sustainability assessment are relevant within a multidimensional approach. Our findings help to capture the diversity of ESG ratings across Western Europe and may assist practitioners and policymakers in evaluating companies facing different sustainability-linked risks in different areas.

2402.00183 2026-04-30 stat.ME stat.CO stat.OT

A review of regularised estimation methods and cross-validation in spatiotemporal statistics

Philipp Otto, Alessandro Fassò, Paolo Maranzano

详情
英文摘要

This review article focuses on regularised estimation procedures applicable to geostatistical and spatial econometric models. These methods are particularly relevant in the case of big geospatial data for dimensionality reduction or model selection. To structure the review, we initially consider the most general case of multivariate spatiotemporal processes (i.e., $g > 1$ dimensions of the spatial domain, a one-dimensional temporal domain, and $q \geq 1$ random variables). Then, the idea of regularised/penalised estimation procedures and different choices of shrinkage targets are discussed. Finally, guided by the elements of a mixed-effects model setup, which allows for a variety of spatiotemporal models, we show different regularisation procedures and how they can be used for the analysis of geo-referenced data, e.g. for selection of relevant regressors, dimensionality reduction of the covariance matrices, detection of conditionally independent locations, or the estimation of a full spatial interaction matrix.

2201.06682 2026-04-30 stat.ME stat.CO

Antimodes and Graphical Anomaly Exploration via Adaptive Depth Quantile Functions

Gabriel Chandler, Wolfgang Polonik

Comments 24 pages, 13 figures

详情
英文摘要

This work proposes and investigates a novel method for anomaly detection and shows it to be competitive in a variety of Euclidean and non-Euclidean situations. It is based on an extension of the depth quantile functions (DQF) approach. The DQF approach encodes geometric information about a point cloud via functions of a single variable, whereas each observation in a data set is associated with a single such function. Plotting these functions provides a very beneficial visualization aspect. This technique can be applied to any data lying in a Hilbert space. The proposed anomaly detection approach is motivated by the geometric insight of the presence of anomalies in data being tied to the existence of antimodes in the data generating distribution. Coupling this insight with novel theoretical understanding into the shape of the DQFs gives rise to the proposed adaptive DQF (aDQF) methodology. Applications to various data sets illustrate the DQF and aDQF's strong anomaly detection performance, and the benefits of its visualization aspects.

2011.06695 2026-04-30 econ.EM stat.ME

When Should We (Not) Interpret Linear IV Estimands as LATE?

Tymon Słoczyński

详情
英文摘要

In this paper I revisit the interpretation of the linear instrumental variables (IV) estimand as a weighted average of conditional local average treatment effects (LATEs). I focus on a situation in which additional covariates are required for identification while the reduced-form and first-stage regressions may be misspecified due to an implicit homogeneity restriction on the effects of the instrument. I show that the weights on some conditional LATEs are negative and the IV estimand is no longer interpretable as a causal effect under a weaker version of monotonicity, i.e. when there are compliers but no defiers at some covariate values and defiers but no compliers elsewhere. The problem of negative weights disappears in the interacted specification of Angrist and Imbens (1995), which avoids misspecification and seems to be underused in applied work. I illustrate my findings in an application to the causal effects of pretrial detention on case outcomes. In this setting, I reject the stronger version of monotonicity, demonstrate that the interacted instruments are sufficiently strong for consistent estimation using the jackknife methodology, and present several estimates that are economically and statistically different, depending on whether the interacted instruments are used.

1907.11752 2026-04-30 cs.AI stat.ME

Choosing with unknown causal information: Action-outcome probabilities for decision making can be grounded in causal models

Mauricio Gonzalez Soto, David Danks, Hugo J. Escalante Balderas, L. Enrique Sucar

详情
英文摘要

Decision-making under uncertainty and causal thinking are fundamental aspects of intelligent reasoning. Decision-making has been well studied when the available information is considered at the associative (probabilistic) level. The classical Theorems of von Neumann-Morgenstern and Savage provide a formal criterion for rational choice using associative information: maximize expected utility. There is an ongoing debate around the origin of probabilities involved in such calculation. In this work, we will show how the probabilities for decision-making can be grounded in causal models by considering decision problems in which the available actions and consequences are causally connected. In this setting, actions are regarded as an intervention over a causal model. Then, we extend a previous causal decision-making result, which relies on a known causal model, to the case in which the causal mechanism that controls some environment is unknown to a rational decision-maker. In this way, action-outcome probabilities can be grounded in causal models in known and unknown cases. Finally, as an application, we extend the well-known concept of Nash Equilibrium to the case in which the players of a strategic game consider causal information.