Variance component score test for multivariate change point detection with applications to mobile health
Comments 15 pages, 5 figures
Melissa Lynne Martin, Juliette Brook, Sage Rush, Theodore D. Satterthwaite, Ian J. Barnett
Comments 15 pages, 5 figures
Multivariate change point detection is the process of identifying distributional shifts in time-ordered data across multiple features. This task is particularly challenging when the number of features is large relative to the number of observations. This problem is often present in mobile health, where behavioral changes in at-risk patients must be detected in real time in order to prompt timely interventions. We propose a variance component score test (VC*) for detecting changes in feature means and/or variances using only pre-change point data to estimate distributional parameters. Through simulation studies, we show that VC* has higher power than existing methods. Moreover, we demonstrate that reducing bias by using only pre-change point days to estimate parameters outweighs the increased estimator variances in most scenarios. Lastly, we apply VC* and competing methods to passively collected smartphone data in adolescents and young adults with affective instability.
Greg Kreider
Comments 19 pages, 5 figures
We define interval spacing as the difference in the order statistics of data over a gap of some width. We derive its density, expected value, and variance for uniform, exponential, and logistic variates. We show that interval spacing is equivalent to running a rectangular low-pass filter over the spacing, which simplifies the expressions for the expected values and introduces correlations between overlapping intervals.
Harry T. Bond, Bertrand Gauthier, Kirstin Strokorb
Comments 23 pages, 10 figures
We investigate the properties of a class of regularisation-free approaches for Gaussian graphical inference based on the information-geometry-driven sequential growth of initially edgeless graphs. Relating the growth of a graph to a coordinate descent process, we characterise the fully-corrective descents corresponding to information-optimal growths, and propose numerically efficient strategies for their approximation. We demonstrate the ability of the proposed procedures to reliably extract sparse graphical models while limiting the number of false detections, and illustrate how activation ranks can provide insight into the informational relevance of edge sets. The considered approaches are tuning-parameter-free and have complexities akin to coordinate descents.
Paolo Andrich, Shengjie Lai, Halim Jun, Qianwen Duan, Zhifeng Cheng, Seth R. Flaxman, Andrew J. Tatem
Comments 25 pages, 8 figures
Accurate and timely population data are essential for disaster response and humanitarian planning, but traditional censuses often cannot capture rapid demographic changes. Social media data offer a promising alternative for dynamic population monitoring, but their representativeness remains poorly understood and stringent privacy requirements limit their reliability. Here, we address these limitations in the context of the Philippines by calibrating Facebook user counts with the country's 2020 census figures. First, we find that differential privacy techniques commonly applied to social media-based population datasets disproportionately mask low-population areas. To address this, we propose a Bayesian imputation approach to recover missing values, restoring data coverage for $5.5\%$ of rural areas. Further, using the imputed social media data and leveraging predictors such as urbanisation level, demographic composition, and socio-economic status, we develop a statistical model for the proportion of Facebook users in each municipality, which links observed Facebook user numbers to the true population levels. Out-of-sample validation demonstrates strong result generalisability, with errors as low as ${\approx}18\%$ and ${\approx}24\%$ for urban and rural Facebook user proportions, respectively. We further demonstrate that accounting for overdispersion and spatial correlations in the data is crucial to obtain accurate estimates and appropriate credible intervals. Crucially, as predictors change over time, the models can be used to regularly update the population predictions, providing a dynamic complement to census-based estimates. These results have direct implications for humanitarian response in disaster-prone regions and offer a general framework for using biased social media signals to generate reliable and timely population data.
Greg Kreider
Comments 40 pages, 3 figures
The distribution of the spacing, or the difference between consecutive order statistics, is known only for uniform and exponential random variates. We add here logistic and Gumbel variates, and present an estimator for distributions with a known inverse cumulative density function. We show the estimator is accurate to the limit of numerical simulations for points near the middle of the order statistics, but degrades by up to 20% in the tails.
Jian Yao, Pengtao Li, Xiaohui Chen, Quntao Zhuang
Distance metrics are central to machine learning, yet distances between ensembles of quantum states remain poorly understood due to fundamental quantum measurement constraints. We introduce a hierarchy of integral probability metrics, termed MMD-$k$, which generalizes the maximum mean discrepancy to quantum ensembles and exhibit a strict trade-off between discriminative power and statistical efficiency as the moment order $k$ increases. For pure-state ensembles of size $N$, estimating MMD-$k$ using experimentally feasible SWAP-test-based estimators requires $Θ(N^{2-2/k})$ samples for constant $k$, and $Θ(N^3)$ samples to achieve full discriminative power at $k = N$. In contrast, the quantum Wasserstein distance attains full discriminative power with $Θ(N^2 \log N)$ samples. These results provide principled guidance for the design of loss functions in quantum machine learning, which we illustrate in the training quantum denoising diffusion probabilistic models.
Lev Fedorov, Michaël E. Sander, Romuald Elie, Pierre Marion, Mathieu Laurière
Comments 24 pages
Transformers have revolutionized deep learning across various domains but understanding the precise token dynamics remains a theoretical challenge. Existing theories of deep Transformers with layer normalization typically predict that tokens cluster to a single point; however, these results rely on deterministic weight assumptions, which fail to capture the standard initialization scheme in Transformers. In this work, we show that accounting for the intrinsic stochasticity of random initialization alters this picture. More precisely, we analyze deep Transformers where noise arises from the random initialization of value matrices. Under diffusion scaling and token-wise RMS normalization, we prove that, as the number of Transformer layers goes to infinity, the discrete token dynamics converge to an interacting-particle system on the sphere where tokens are driven by a \emph{common} matrix-valued Brownian noise. In this limit, we show that initialization noise prevents the collapse to a single cluster predicted by deterministic models. For two tokens, we prove a phase transition governed by the interaction strength and the token dimension: unlike deterministic attention flows, antipodal configurations become attracting with positive probability. Numerical experiments confirm the predicted transition, reveal that antipodal formations persist for more than two tokens, and demonstrate that suppressing the intrinsic noise degrades accuracy.
Karel Devriendt, Ignacio Echave-Sustaeta Rodríguez, Frank Röttger
Comments 16 pages, 3 figures
We study extremal conditional independence for Hüsler-Reiss distributions, which is a parametric subclass of multivariate Pareto distributions. As the main contribution, we introduce two set functions, i.e.~functions which assign a value to the distribution and each of its marginals, and show that extremal conditional independence statements can be characterized by modularity relations for these functions. For the first function, we make use of the close connection between Hüsler-Reiss and Gaussian models to introduce a multiinformation-inspired measure $m^{\text{HR}}$ for Hüsler-Reiss distributions. For the second function, we consider an invariant $σ^2$ that is naturally associated to the Hüsler-Reiss parameterization and establish the second modularity criterion under additional positivity constraints. Together, these results provide new tools for describing extremal dependence structures in high-dimensional extreme value statistics. In addition, we study the geometry of a bounded subset of Hüsler-Reiss parameters and its relation with the Gaussian elliptope.
Amir Ali Ahmadi, Georgina Hall
We show that computing even very coarse approximations of critical points is intractable for simple classes of nonconvex functions. More concretely, we prove that if there exists a polynomial-time algorithm that takes as input a polynomial in $n$ variables of constant degree (as low as three) and outputs a point whose gradient has Euclidean norm at most $2^n$ whenever the polynomial has a critical point, then P=NP. The algorithm is permitted to return an arbitrary point when no critical point exists. We also prove hardness results for approximate computation of critical points under additional structural assumptions, including settings in which existence and uniqueness of a critical point are guaranteed, the function is lower bounded, and approximation is measured in terms of distance to a critical point. Overall, our results stand in contrast to the commonly-held belief that, in nonconvex optimization, approximate computation of critical points is a tractable task.
Laura Mack, Norbert Pirk
Land-atmosphere exchange processes are determined by turbulent fluxes, which can be derived from eddy-covariance measurements. This method was established to quantify ecosystem-scale vertical atmosphere-vegetation exchange processes, but is also used to validate atmospheric turbulence theories with the ultimate aim to improve the representation of turbulence in numerical models. While the focus has long been on turbulence over idealized, homogeneous and flat surfaces, recent scientific developments are shifting towards investigating turbulent exchange processes in complex heterogeneous environments under non-idealized conditions, which pose particular challenges, e.g. advective fluxes between different surface types or non-stationarity of nighttime turbulence. This requires to rethink standard post-processing routines for determining turbulent fluxes from the high-frequency sonic and gas analyzer measurements. Here, we introduce the open-source R-package 'Reddy', which provides modular-built functions for post-processing, analysis and visualization of eddy-covariance measurements, including investigating spectra, coherent structures, anisotropy, flux footprints and surface energy balance closure. The 'Reddy' package is accompanied by a detailed documentation and a set of jupyter notebooks introducing new users hands-on to eddy-covariance data analysis. We showcase 'Reddy' based on measurements from three different sites in Norway: A case study during strong stratification over alpine tundra, for determining suitable averaging times during ice-cover transition at a boreal lake, and for fitting flux-variance relations for a permafrost peatland. 'Reddy' serves as extension of previously developed software packages, paving the way towards holistic turbulence data analysis in heterogeneous real-world environments.
Gustav Norén, Anubhab Ghosh, Fredrik Cumlin, Saikat Chatterjee
Comments The article is accepted at ICASSP 2026
We design a variational state estimation (VSE) method that provides a closed-form Gaussian posterior of an underlying complex dynamical process from (noisy) nonlinear measurements. The complex process is model-free. That is, we do not have a suitable physics-based model characterizing the temporal evolution of the process state. The closed-form Gaussian posterior is provided by a recurrent neural network (RNN). The use of RNN is computationally simple in the inference phase. For learning the RNN, an additional RNN is used in the learning phase. Both RNNs help each other learn better based on variational inference principles. The VSE is demonstrated for a tracking application - state estimation of a stochastic Lorenz system (a benchmark process) using a 2-D camera measurement model. The VSE is shown to be competitive against a particle filter that knows the Lorenz system model and a recently proposed data-driven state estimation method that does not know the Lorenz system model.
Jinhang Chai, Xuyuan Liu, Elynn Chen, Yujun Yan
Learning systems often expand their ambient features or latent representations over time, embedding earlier representations into larger spaces with limited new latent structure. We study transfer learning for structured matrix estimation under simultaneous growth of the ambient dimension and the intrinsic representation, where a well-estimated source task is embedded as a subspace of a higher-dimensional target task. We propose a general transfer framework in which the target parameter decomposes into an embedded source component, low-dimensional low-rank innovations, and sparse edits, and develop an anchored alternating projection estimator that preserves transferred subspaces while estimating only low-dimensional innovations and sparse modifications. We establish deterministic error bounds that separate target noise, representation growth, and source estimation error, yielding strictly improved rates when rank and sparsity increments are small. We demonstrate the generality of the framework by applying it to two canonical problems. For Markov transition matrix estimation from a single trajectory, we derive end-to-end theoretical guarantees under dependent noise. For structured covariance estimation under enlarged dimensions, we provide complementary theoretical analysis in the appendix and empirically validate consistent transfer gains.
Shujaat Khan, Syed Muhammad Atif, Jaeyoung Huh, Syed Saad Azhar
Comments 11 pages, 13 figures
Ultrasound (US) interpretation is hampered by multiplicative speckle, acquisition blur from the point-spread function (PSF), and scanner- and operator-dependent artifacts. Supervised enhancement methods assume access to clean targets or known degradations; conditions rarely met in practice. We present a blind, self-supervised enhancement framework that jointly deconvolves and denoises B-mode images using a Swin Convolutional U-Net trained with a \emph{physics-guided} degradation model. From each training frame, we extract rotated/cropped patches and synthesize inputs by (i) convolving with a Gaussian PSF surrogate and (ii) injecting noise via either spatial additive Gaussian noise or complex Fourier-domain perturbations that emulate phase/magnitude distortions. For US scans, clean-like targets are obtained via non-local low-rank (NLLR) denoising, removing the need for ground truth; for natural images, the originals serve as targets. Trained and validated on UDIAT~B, JNU-IFM, and XPIE Set-P, and evaluated additionally on a 700-image PSFHS test set, the method achieves the highest PSNR/SSIM across Gaussian and speckle noise levels, with margins that widen under stronger corruption. Relative to MSANN, Restormer, and DnCNN, it typically preserves an extra $\sim$1--4\,dB PSNR and 0.05--0.15 SSIM in heavy Gaussian noise, and $\sim$2--5\,dB PSNR and 0.05--0.20 SSIM under severe speckle. Controlled PSF studies show reduced FWHM and higher peak gradients, evidence of resolution recovery without edge erosion. Used as a plug-and-play preprocessor, it consistently boosts Dice for fetal head and pubic symphysis segmentation. Overall, the approach offers a practical, assumption-light path to robust US enhancement that generalizes across datasets, scanners, and degradation types.
Cecilie Olesen Recke, Sarah Lumpp, Nataliia Kushnerchuk, Janike Oldekop, Jiayi Li, Jane Ivy Coons, Elina Robeva
Comments 29 pages, 18 pages of appendix
In this paper, we study discrete Lyapunov models, which consist of steady-state distributions of first-order vector autoregressive models. The parameter matrix of such a model encodes a directed graph whose vertices correspond to the components of the random vector. This combinatorial framework naturally allows for cycles in the graph structure. We focus on the fundamental problem of identifying the entries of the parameter matrix. In contrast to the classical setting, we assume non-Gaussian error terms, which allows us to use the higher-order cumulants of the model. In this setup, we show generic identifiability for directed acyclic graphs with self-loops at each vertex and show how to express the parameters as a rational function of the cumulants. Furthermore, we establish local identifiability for all directed graphs containing self loops at each vertex and no isolated vertices. Finally, we provide first results on the defining equations of the models, showing model equivalence for certain graphs and paving the way towards structure learning.
Nina Moutonnet, Joshua Corneck, Felipe Tobar, Danilo Mandic
Reliable seizure detection from electroencephalography (EEG) time series is a high-priority clinical goal, yet the acquisition cost and scarcity of labeled EEG data limit the performance of machine learning methods. This challenge is exacerbated by the long-range, high-dimensional, and non-stationary nature of epileptic EEG recordings, which makes realistic data generation particularly difficult. In this work, we revisit Gaussian processes as a principled and interpretable foundation for modeling EEG dynamics, and propose a novel hierarchical framework, \textit{GP-EEG}, for generating synthetic epileptic EEG recordings. At its core, our approach decomposes EEG signals into temporal segments modeled via Gaussian process regression, and integrates a domain-adaptation variational autoencoder. We validate the proposed method on two real-world, open-source epileptic EEG datasets. The synthetic EEG recordings generated by our model match real-world epileptic EEG both quantitatively and qualitatively, and can be used to augment training sets.
Tatiana Krikella, Joel A. Dubin
Advances in precision medicine increasingly drive methodological innovation in health research. A key development is the use of personalized prediction models (PPMs), which are fit using a similar subpopulation tailored to a specific index patient, and have been shown to outperform one-size-fits-all models, particularly in terms of model discrimination performance. We propose a generalized loss function that enables tuning of the subpopulation size used to fit a PPM. This loss function allows joint optimization of discrimination and calibration, allowing both the performance measures and their relative weights to be specified by the user. To reduce computational burden, we conducted extensive simulation studies to identify practical bounds for the grid of subpopulation sizes. Based on these results, we recommend using a lower bound of 20\% and an upper bound of 70\% of the entire training dataset. We apply the proposed method to both simulated and real-world datasets and demonstrate that previously observed relationships between subpopulation size and model performance are robust. Furthermore, we show that the choice of performance measures in the loss function influences the optimal subpopulation size selected. These findings support the flexible and computationally efficient implementation of PPMs in precision health research.
Jonas Arruda, Niels Bracher, Ullrich Köthe, Jan Hasenauer, Stefan T. Radev
Diffusion models have recently emerged as powerful learners for simulation-based inference (SBI), enabling fast and accurate estimation of latent parameters from simulated and real data. Their score-based formulation offers a flexible way to learn conditional or joint distributions over parameters and observations, thereby providing a versatile solution to various modeling problems. In this tutorial review, we synthesize recent developments on diffusion models for SBI, covering design choices for training, inference, and evaluation. We highlight opportunities created by various concepts such as guidance, score composition, flow matching, consistency models, and joint modeling. Furthermore, we discuss how efficiency and statistical accuracy are affected by noise schedules, parameterizations, and samplers. Finally, we illustrate these concepts with case studies across parameter dimensionalities, simulation budgets, and model types, and outline open questions for future research.
Sara Pérez-Vieites, Sahel Iqbal, Simo Särkkä, Dominik Baumann
Comments 20 pages, 7 figures
Bayesian experimental design (BED) provides a principled framework for optimizing data collection by choosing experiments that are maximally informative about unknown parameters. However, existing methods cannot deal with the joint challenge of (a) partially observable dynamical systems, where only noisy and incomplete observations are available, and (b) fully online inference, which updates posterior distributions and selects designs sequentially in a computationally efficient manner. Under partial observability, dynamical systems are naturally modeled as state-space models (SSMs), where latent states mediate the link between parameters and data, making the likelihood -- and thus information-theoretic objectives like the expected information gain (EIG) -- intractable. We address these challenges by deriving new estimators of the EIG and its gradient that explicitly marginalize latent states, enabling scalable stochastic optimization in nonlinear SSMs. Our approach leverages nested particle filters for efficient online state-parameter inference with convergence guarantees. Applications to realistic models, such as the susceptible-infectious-recovered (SIR) and a moving source location task, show that our framework successfully handles both partial observability and online inference.
Dane Isenberg, Michael O. Harhay, Andrew B. Forbes, Paul J. Young, Fan Li, Nandita Mitra
Comments corrected minor writing errors (including in abstract) and typos
In cluster-randomized crossover (CRXO) trials, groups of individuals are randomly assigned to two or more sequences of alternating treatments. Since clusters serve as their own control, the CRXO design is typically more statistically efficient than the usual parallel-arm design. CRXO trials are increasingly popular in many areas of health research where the number of available clusters is limited. Further, in trials among severely ill patients, researchers often want to assess the effect of treatments on secondary non-terminal outcomes, but frequently in these studies, there are patients who do not survive to have these measurements fully recorded. In this paper, we provide a causal inference framework and treatment effect estimation methods for addressing truncation by death in the setting of CRXO trials. We target the survivor average causal effect (SACE) estimand, a well-defined subgroup treatment effect obtained via principal stratification. We propose novel structural and standard modeling assumptions that enable estimating the SACE within a Bayesian paradigm. We evaluate the small-sample performance of our proposed Bayesian approach for estimation of the SACE in CRXO trial settings via simulation studies. We apply our methods to a previously conducted two-period cross-sectional CRXO study examining the impact of proton pump inhibitors compared to histamine-2 receptor blockers on length of hospitalization among adults requiring invasive mechanical ventilation.
Andrew Millard, Joshua Murphy, Simon Maskell, Zheng Zhao
Partial Bayesian neural networks (pBNNs) have been shown to perform competitively with fully Bayesian neural networks while only having a subset of the parameters be stochastic. Using sequential Monte Carlo (SMC) samplers as the inference method for pBNNs gives a non-parametric probabilistic estimation of the stochastic parameters, and has shown improved performance over parametric methods. In this paper we introduce a new SMC-based training method for pBNNs by utilising a guided proposal and incorporating gradient-based Markov kernels, which gives us better scalability on high dimensional problems. We show that our new method outperforms the state-of-the-art in terms of predictive performance and optimal loss. We also show that pBNNs scale well with larger batch sizes, resulting in significantly reduced training times and often better performance.
Andrew Millard, Joshua Murphy, Daniel Frisch, Simon Maskell
Comments 16 pages, 9 figures
Markov chain Monte Carlo (MCMC) methods are a powerful but computationally expensive way of performing non-parametric Bayesian inference. MCMC proposals which utilise gradients, such as Hamiltonian Monte Carlo (HMC), can better explore the parameter space of interest if the additional hyper-parameters are chosen well. The No-U-Turn Sampler (NUTS) is a variant of HMC which is extremely effective at selecting these hyper-parameters but is slow to run and is not suited to GPU architectures. An alternative to NUTS, Change in the Estimator of the Expected Square HMC (ChEES-HMC) was shown not only to run faster than NUTS on GPU but also sample from posteriors more efficiently. Sequential Monte Carlo (SMC) samplers are another sampling method which instead output weighted samples from the posterior. They are very amenable to parallelisation and therefore being run on GPUs while having additional flexibility in their choice of proposal over MCMC. We incorporate (ChEEs-HMC) as a proposal into SMC samplers and demonstrate competitive but faster performance than NUTS on a number of tasks.
Felix Reichel
Comments 25 pages, 12 figures, 7 tables, 10 references, 3 appendices
Journal ref J. Math. Stat. 2025, 44-49
Bessel's correction adjusts the denominator in the sample variance formula from n to n-1 to ensure an unbiased estimator of the population variance. This paper provides rigorous algebraic derivations geometric interpretations and visualizations to reinforce the necessity of this correction. It further introduces the concept of Bariance an alternative dispersion measure based on pairwise squared differences that avoids reliance on the arithmetic mean. Building on this we address practical concerns raised in Rosenthals article [Rosenthal, 2015] which advocates for n-based estimates from a mean squared error (MSE) perspective particularly in pedagogical contexts and specific applied settings. Finally, the empirical component of this work based on simulation studies demonstrates that estimating the population variance via an algebraically optimized Bariance approach can yield a computational advantage. Specifically the unbiased Bariance estimator can be computed in linear time resulting in shorter runtimes while preserving statistical validity.
Emilie Højbjerre-Frandsen, Mark J. van der Laan, Alejandro Schuler
Comments 41 pages, 7 figures
In randomized clinical trials (RCTs), the accurate estimation of marginal treatment effects is crucial for determining the efficacy of interventions. Enhancing the statistical power of these analyses is a key objective for statisticians. The increasing availability of historical data from registries, prior trials, and health records presents an opportunity to improve trial efficiency. However, many methods for historical borrowing compromise strict type-I error rate control. Building on the work by Schuler et al. [2022] on prognostic score adjustment for linear models, this paper extends the methodology to the plug-in analysis proposed by Rosenblum et al. [2010] using generalized linear models (GLMs) to further enhance the efficiency of RCT analyses without introducing bias. Specifically, we train a prognostic model on historical control data and incorporate the resulting prognostic scores as covariates in the plug-in GLM analysis of the trial data. This approach leverages the predictive power of historical data to improve the precision of marginal treatment effect estimates. We demonstrate that this method achieves local semi-parametric efficiency under the assumption of an additive treatment effect on the link scale. We expand the GLM plug-in method to include negative binomial regression. Additionally, we provide a straightforward formula for conservatively estimating the asymptotic variance, facilitating power calculations that reflect these efficiency gains. Our simulation study supports the theory. Even without an additive treatment effect, we observe increased power or reduced standard error. While population shifts from historical to trial data may dilute benefits, they do not introduce bias.
Laura Hucker, Markus Reiß, Thomas Stark
We consider standard gradient descent, gradient flow and conjugate gradients as iterative algorithms for minimising a penalised ridge criterion in linear regression. While it is well known that conjugate gradients exhibit fast numerical convergence, the statistical properties of their iterates are more difficult to assess due to inherent non-linearities and dependencies. On the other hand, standard gradient flow is a linear method with well-known regularising properties when stopped early. By an explicit non-standard error decomposition we are able to bound the prediction error for conjugate gradient iterates by a corresponding prediction error of gradient flow at transformed iteration indices. This way, the risk along the entire regularisation path of conjugate gradient iterations can be compared to that for regularisation paths of standard linear methods like gradient flow and ridge regression. In particular, the oracle conjugate gradient iterate shares the optimality properties of the gradient flow and ridge regression oracles up to a constant factor. Numerical examples show the similarity of the regularisation paths in practice.
Manav Seksaria, Anil Prabhakar
We present a method for estimating the number of shots required to achieve a desired variance in the results of a quantum circuit. First, we establish a baseline for single-qubit characterisation of individual noise sources. We then move on to multi-qubit circuits, focusing on expectation-value circuits. We decompose the variance of the estimator into a sum of a statistical term and a bias floor. These are independently estimated with one additional run of the circuit. We test our method on a Variational Quantum Eigensolver for $H_2$ and show that we can predict the variance to within known error bounds. We go on to show that for IBM Pittsburgh's noise characteristics, at that instant, 7000 shots for the given circuit would have achieved a $σ^2 \approx 0.01$
Mohammad Zhalechian, Soroush Saghafian, Omar Robles
The United States Food and Drug Administration's (FDA's) 510(k) pathway allows manufacturers to gain medical device approval by demonstrating substantial equivalence to a legally marketed device. However, the inherent ambiguity of this regulatory procedure has been associated with high recall among many devices cleared through this pathway, raising significant safety concerns. In this paper, we develop a combined human-algorithm approach to assist the FDA in improving its 510(k) medical device clearance process by reducing recall risk and regulatory workload. We first develop machine learning methods to estimate the risk of recall of 510(k) medical devices based on the information available at the time of submission. We then propose a data-driven clearance policy that recommends acceptance, rejection, or deferral to FDA's committees for in-depth evaluation. We conduct an empirical study using a unique dataset of over 31,000 submissions that we assembled based on data sources from the FDA and Centers for Medicare and Medicaid Service (CMS). Compared to the FDA's current practice, which has a recall rate of 10.3% and a normalized workload measure of 100%, a conservative evaluation of our policy shows a 32.9% improvement in the recall rate and a 40.5% reduction in the workload. Our analyses further suggest annual cost savings of approximately $1.7 billion for the healthcare system driven by avoided replacement costs, which is equivalent to 1.1% of the entire United States annual medical device expenditure. Our findings highlight the value of a holistic and data-driven approach to improve the FDA's current 510(k) pathway.
Ansgar Steland
Training deep learning neural networks often requires massive amounts of computational ressources. We propose to sequentially monitor network predictions to trigger retraining only if the predictions are no longer valid. This can reduce drastically computational costs and opens a door to green deep learning. Our approach is based on the relationship to projected second moments monitoring, a problem also arising in other areas such as computational finance. Various open-end as well as closed-end monitoring rules are studied under mild assumptions on the training sample and the observations of the monitoring period. The results allow for high-dimensional non-stationary time series data and thus, especially, non-i.i.d. training data. Asymptotics is based on Gaussian approximations of projected partial sums allowing for an estimated projection vector. Estimation of projection vectors is studied both for classical non-$\ell_0$-sparsity as well as under sparsity. For the case that the optimal projection depends on the unknown covariance matrix, hard- and soft-thresholded estimators are studied. The method is analyzed by simulations and supported by synthetic data experiments.
Xiaoyu Hu, Zhenhua Lin
Comments 49 pages, 3 figures
The two-sample homogeneity testing problem is fundamental in statistics and becomes particularly challenging in high dimensions, where classical tests can suffer substantial power loss. We develop a learning-assisted procedure based on the projection 1-Wasserstein distance, which we call the neural Wasserstein test. The method is motivated by the observation that there often exists a low-dimensional projection under which the two high-dimensional distributions differ. In practice, we learn the projection directions via manifold optimization and a witness function using deep neural networks. To adapt to unknown projection dimensions and sparsity levels, we aggregate a collection of candidate statistics through a max-type construction, avoiding explicit tuning while potentially improving power. We establish the validity and consistency of the proposed test and prove a Berry--Esseen type bound for the Gaussian approximation. In particular, under the null hypothesis, the aggregated statistic converges to the absolute maximum of a standard Gaussian vector, yielding an asymptotically pivotal (distribution-free) calibration that bypasses resampling. Simulation studies and a real-data example demonstrate the strong finite-sample performance of the proposed method.
Yaxi Hu, Johanna Düngler, Bernhard Schölkopf, Amartya Sanyal
We introduce the (Wishart) projection mechanism, a randomized map of the form $S \mapsto M f(S)$ with $M \sim W_d(1/r I_d, r)$ and study its differential privacy properties. For vector-valued queries $f$, we prove non-asymptotic DP guarantees without any additive noise, showing that Wishart randomness alone can suffice. For matrix-valued queries, however, we establish a sharp negative result: in the noise-free setting, the mechanism is not DP, and we demonstrate its vulnerability by implementing a near perfect membership inference attack (AUC $> 0.99$). We then analyze a noisy variant and prove privacy amplification due to randomness and low rank projection, in both large- and small-rank regimes, yielding stronger privacy guarantees than additive noise alone. Finally, we show that LoRA-style updates are an instance of the matrix-valued mechanism, implying that LoRA is not inherently private despite its built-in randomness, but that low-rank fine-tuning can be more private than full fine-tuning at the same noise level. Preliminary experiments suggest that tighter accounting enables lower noise and improved accuracy in practice.
Hugo Morvan, Jonas Agholme, Bjorn Eliasson, Katarina Olofsson, Ludger Grote, Fredrik Iredahl, Oleg Sysoev
Missing data is a prevalent issue in many applications, including large medical registries such as the Swedish Healthcare Quality Registries, potentially leading to biased or inefficient analyses if not handled properly. Multiple Imputation by Chained Equations (MICE) is a popular and versatile method for handling multivariate missing data but traditional implementations face significant challenges when applied to big data sets due to computational time and memory limitations. To address this, the bigMICE package was developed, adapting the MICE framework to big data using Apache Spark MLLib and Spark ML. Our implementation allows for controlling the maximum memory usage during the execution, enabling processing of very large data sets on a hardware with a limited memory, such as ordinary laptops. The developed package was tested on a large Swedish medical registry to measure memory usage, runtime and dependence of the imputation quality on sample size and on missingness proportion in the data. In conclusion, our method is generally more memory efficient and faster on large data sets compared to a commonly used MICE implementation. We also demonstrate that working with very large datasets can result in high quality imputations even when a variable has a large proportion of missing data. This paper also provides guidelines and recommendations on how to install and use our open source package.
Edoardo Otranto
We propose a modeling framework for time-varying covariance matrices based on the assumption that the logarithm of a realized covariance matrix follows a matrix-variate oNrmal distribution. By operating in the space of symmetric matrices, the approach guarantees positive definiteness without imposing parameter constraints beyond stationarity. The conditional mean of the logarithmic covariance matrix is specified through a BEKK-type structure that can be rewritten as a diagonal vector representation, yielding a parsimonious specification that mitigates the curse of dimensionality. Estimation is performed by maximum likelihood exploiting properties of matrix-variate Normal distributions and expressing the scale parameter matrix as a function of the location matrix. The covariance matrix is recovered via the matrix exponential. Since this transformation induces an upward bias, an approximate, time-specific bias correction based on a second-order Taylor expansion is proposed. The framework is flexible and applicable to a wide class of problems involving symmetric positive definite matrices.
Edoardo Otranto
This paper proposes a novel data-driven approach for identifying and modelling areas with similar temperature variations throufigureh clustering and Space-Time AutoRegressive (STAR) models. Using annual temperature data from 168 countries (1901-2022), we apply three clustering methods based on (i) warming rates, (ii) annual temperature variations, and (iii) persistence of variation signs, using Euclidean and Hamming distances. These clusters are then employed to construct alternative spatial weight matrices for STAR models. Empirical results show that distance-based STAR models outperform classical contiguity-based ones, both in-sample and out-of-sample, with the Hamming distance-based STAR model achieving the best predictive accuracy. The study demonstrates that using statistical similarity rather than geographical proximity improves the modelling of global temperature dynamics, suggesting broader applicability to other environmental and socioeconomic datasets.
Silvia Dallari, Laura Anderlucci, Nicola Grandi, Angela Montanari
Public debate on the alleged decline of language skills among younger generations often focuses on university students, the most highly educated segment of the population. Rather than addressing the ill posed question of linguistic decline, this paper examines how formal written Italian is currently used by university students and whether systematic patterns of competence and heterogeneity can be identified. The analysis is based on data from the UniversITA project, which collected formal texts written by a large and nationally representative sample of Italian university students. Texts were annotated for linguistically motivated features covering orthography, lexicon, syntax, morphosyntax, coherence, register, and sentence structure, yielding low frequency multivariate count data. To analyse these data, we propose a novel model-based clustering approach based on a Poisson factor mixture model that accounts for dependence among linguistic features and unobserved population heterogeneity. The results identify two correlated dimensions of writing competence, interpretable as communicative competence and linguistic grammatical competence. When educational and socio demographic information is incorporated, distinct student profiles emerge that are associated with field of study and educational background. These findings provide quantitative evidence on contemporary writing and offer insights relevant for language education and higher education policy.
Kang You, Gary Green, Jian Zhang
Comments 40 pages, 13 figures
Pathophysiolpgical modelling of brain systems from microscale to macroscale remains difficult in group comparisons partly because of the infeasibility of modelling the interactions of thousands of neurons at the scales involved. Here, to address the challenge, we present a novel approach to construct differential causal networks directly from electroencephalogram (EEG) data. The proposed network is based on conditionally coupled neuronal circuits which describe the average behaviour of interacting neuron populations that contribute to observed EEG data. In the network, each node represents a parameterised local neural system while directed edges stand for node-wise connections with transmission parameters. The network is hierarchically structured in the sense that node and edge parameters are varying in subjects but follow a mixed-effects model. A novel evolutionary optimisation algorithm for parameter inference in the proposed method is developed using a loss function derived from Chen-Fliess expansions of stochastic differential equations. The method is demonstrated by application to the fitting of coupled Jansen-Rit local models. The performance of the proposed method is evaluated on both synthetic and real EEG data. In the real EEG data analysis, we track changes in the parameters that characterise dynamic causality within brains that demonstrate epileptic activity. We show evidence of network functional disruptions, due to imbalance of excitatory-inhibitory interneurons and altered epileptic brain connectivity, before and during seizure periods.
Ruicheng Ao, Hongyu Chen, Siyang Gao, Hanwei Li, David Simchi-Levi
Comments 22 pages, 3 figures
We study fixed-confidence best-arm identification (BAI) where a cheap but potentially biased proxy (e.g., LLM judge) is available for every sample, while an expensive ground-truth label can only be acquired selectively when using a human for auditing. Unlike classical multi-fidelity BAI, the proxy is biased (arm- and context-dependent) and ground truth is selectively observed. Consequently, standard multi-fidelity methods can mis-select the best arm, and uniform auditing, though accurate, wastes scarce resources and is inefficient. We prove that without bias correction and propensity adjustment, mis-selection probability may not vanish (even with unlimited proxy data). We then develop an estimator for the mean of each arm that combines proxy scores with inverse-propensity-weighted residuals and form anytime-valid confidence sequences for that estimator. Based on the estimator and confidence sequence, we propose an algorithm that adaptively selects and audits arms. The algorithm concentrates audits on unreliable contexts and close arms and we prove that a plug-in Neyman rule achieves near-oracle audit efficiency. Numerical experiments confirm the theoretical guarantees and demonstrate the superior empirical performance of the proposed algorithm.
Ruicheng Ao, Hongyu Chen, Haoyang Liu, David Simchi-Levi, Will Wei Sun
Comments 27 pages, 4 figures
We study semi-supervised stochastic optimization when labeled data is scarce but predictions from pre-trained models are available. PPI and SVRG both reduce variance through control variates -- PPI uses predictions, SVRG uses reference gradients. We show they are mathematically equivalent and develop PPI-SVRG, which combines both. Our convergence bound decomposes into the standard SVRG rate plus an error floor from prediction uncertainty. The rate depends only on loss geometry; predictions affect only the neighborhood size. When predictions are perfect, we recover SVRG exactly. When predictions degrade, convergence remains stable but reaches a larger neighborhood. Experiments confirm the theory: PPI-SVRG reduces MSE by 43--52\% under label scarcity on mean estimation benchmarks and improves test accuracy by 2.7--2.9 percentage points on MNIST with only 10\% labeled data.
Ferdinand Buchner, David Jobst, Annette Möller, Claudia Czado
To quantify the uncertainty in numerical weather prediction (NWP) forecasts, ensemble prediction systems are utilized. Although NWP forecasts continuously improve, they suffer from systematic bias and dispersion errors. To obtain well calibrated and sharp predictive probability distributions, statistical postprocessing methods are applied to NWP output. Recent developments focus on multivariate postprocessing models incorporating dependencies directly into the model. We introduce three novel bivariate postprocessing approaches, and analyze their performance for joint postprocessing of bivariate wind vector components for 60 stations in Germany. Bivariate vine copula based models, a bivariate gradient boosted version of ensemble model output statistics (EMOS), and a bivariate distributional regression network (DRN) are compared to bivariate EMOS. The case study indicates that the novel bivariate methods improve over the bivariate EMOS approaches. The bivariate DRN and the most flexible version of the bivariate vine copula approach exhibit the best performance in terms of verification scores and calibration.
Zijun Chen, Zaiwei Chen, Nian Si, Shengbo Wang
We present the convergence rates of synchronous and asynchronous Q-learning for average-reward Markov decision processes, where the absence of contraction poses a fundamental challenge. Existing non-asymptotic results overcome this challenge by either imposing strong assumptions to enforce seminorm contraction or relying on discounted or episodic Markov decision processes as successive approximations, which either require unknown parameters or result in suboptimal sample complexity. In this work, under a reachability assumption, we establish optimal $\widetilde{O}(\varepsilon^{-2})$ sample complexity guarantees (up to logarithmic factors) for a simple variant of synchronous and asynchronous Q-learning that samples from the lazified dynamics, where the system remains in the current state with some fixed probability. At the core of our analysis is the construction of an instance-dependent seminorm and showing that, after a lazy transformation of the Markov decision process, the Bellman operator becomes one-step contractive under this seminorm.
Dongyue Xie, Wanrong Zhu, Matthew Stephens
We introduce a flexible empirical Bayes approach for fitting Bayesian generalized linear models. Specifically, we adopt a novel mean-field variational inference (VI) method and the prior is estimated within the VI algorithm, making the method tuning-free. Unlike traditional VI methods that optimize the posterior density function, our approach directly optimizes the posterior mean and prior parameters. This formulation reduces the number of parameters to optimize and enables the use of scalable algorithms such as L-BFGS and stochastic gradient descent. Furthermore, our method automatically determines the optimal posterior based on the prior and likelihood, distinguishing it from existing VI methods that often assume a Gaussian variational. Our approach represents a unified framework applicable to a wide range of exponential family distributions, removing the need to develop unique VI methods for each combination of likelihood and prior distributions. We apply the framework to solve sparse logistic regression and demonstrate the superior predictive performance of our method in extensive numerical studies, by comparing it to prevalent sparse logistic regression approaches.
Liang Hong
The extant insurance literature demonstrates a paucity of finite-sample valid prediction intervals of future insurance claims in the regression setting. To address this challenge, this article proposes a new strategy that converts a predictive method in the unsupervised iid (independent identically distributed) setting to a predictive method in the regression setting. In particular, it enables an actuary to obtain infinitely many finite-sample valid prediction intervals in the regression setting.
Qiyang Han
Adaptive sampling schemes are well known to create complex dependence that may invalidate conventional inference methods. A recent line of work shows that this need not be the case for UCB-type algorithms in multi-armed bandits. A central emerging theme is a `stability' property with asymptotically deterministic arm-pull counts in these algorithms, making inference as easy as in the i.i.d. setting. In this paper, we study the precise arm-pull dynamics in another canonical class of Thompson-sampling type algorithms. We show that the phenomenology is qualitatively different: the arm-pull count is asymptotically deterministic if and only if the arm is suboptimal or is the unique optimal arm; otherwise it converges in distribution to the unique invariant law of an SDE. This dichotomy uncovers a unifying principle behind many existing (in)stability results: an arm is stable if and only if its interaction with statistical noise is asymptotically negligible. As an application, we show that normalized arm means obey the same dichotomy, with Gaussian limits for stable arms and a semi-universal, non-Gaussian limit for unstable arms. This not only enables the construction of confidence intervals for the unknown mean rewards despite non-normality, but also reveals the potential of developing tractable inference procedures beyond the stable regime. The proofs rely on two new approaches. For suboptimal arms, we develop an `inverse process' approach that characterizes the inverse of the arm-pull count process via a Stieltjes integral. For optimal arms, we adopt a reparametrization of the arm-pull and noise processes that reduces the singularity in the natural SDE to proving the uniqueness of the invariant law of another SDE. We prove the latter by a set of analytic tools, including the parabolic Hörmander condition and the Stroock-Varadhan support theorem.
Aidan Gleich, Scott C. Schmidler
We address the problem of accurate, training-free guidance for conditional generation in trained diffusion models. Existing methods typically rely on point-estimates to approximate the posterior score, often resulting in biased approximations that fail to capture multimodality inherent to the reverse process of diffusion models. We propose a sequential Monte Carlo (SMC) framework that constructs an unbiased estimator of $p_θ(y|x_t)$ by integrating over the full denoising distribution via Monte Carlo approximation. To ensure computational tractability, we incorporate variance-reduction schemes based on Multi-Level Monte Carlo (MLMC). Our approach achieves new state-of-the-art results for training-free guidance on CIFAR-10 class-conditional generation, achieving $95.6\%$ accuracy with $3\times$ lower cost-per-success than baselines. On ImageNet, our algorithm achieves $1.5\times$ cost-per-success advantage over existing methods.
Jacob Zhu, Donald Estep
We introduce a novel machine learning method called the Penalized Profile Support Vector Machine based on the Gabriel edited set for the computation of the probability of failure for a complex system as determined by a threshold condition on a computer model of system behavior. The method is designed to minimize the number of evaluations of the computer model while preserving the geometry of the decision boundary that determines the probability. It employs an adaptive sampling strategy designed to strategically allocate points near the boundary determining failure and builds a locally linear surrogate boundary that remains consistent with its geometry by strategic clustering of training points. We prove two convergence results and we compare the performance of the method against a number of state of the art classification methods on four test problems. We also apply the method to determine the probability of survival using the Lotka--Volterra model for competing species.
Chonghuan Wang
Matching mechanisms play a central role in operations management across diverse fields including education, healthcare, and online platforms. However, experimentally comparing a new matching algorithm against a status quo presents some fundamental challenges due to matching interference, where assigning a unit in one matching may preclude its assignment in the other. In this work, we take a design-based perspective to study the design of randomized experiments to compare two predetermined matching plans on a finite population, without imposing outcome or behavioral models. We introduce the notation of a disagreement set, which captures the difference between the two matching plans, and show that it admits a unique decomposition into disjoint alternating paths and cycles with useful structural properties. Based on these properties, we propose the Alternating Path Randomized Design, which sequentially randomizes along these paths and cycles to effectively manage interference. Within a minimax framework, we optimize the conditional randomization probability and show that, for long paths, the optimal choice converges to $\sqrt{2}-1$, minimizing worst-case variance. We establish the unbiasedness of the Horvitz-Thompson estimator and derive a finite-population Central Limit Theorem that accommodates complex and unstable path and cycle structures as the population grows. Furthermore, we extend the design to many-to-one matchings, where capacity constraints fundamentally alter the structure of the disagreement set. Using graph-theoretic tools, including finding augmenting paths and Euler-tour decomposition on an auxiliary unbalanced directed graph, we construct feasible alternating path and cycle decompositions that allow the design and inference results to carry over.
Louis Grenioux, Maxence Noble
Sampling configurations at thermodynamic equilibrium is a central challenge in statistical physics. Boltzmann Generators (BGs) tackle it by combining a generative model with a Monte Carlo (MC) correction step to obtain asymptotically unbiased samples from an unnormalized target. Most current BGs use classic MC mechanisms such as importance sampling, which both require tractable likelihoods from the backbone model and scale poorly in high-dimensional, multi-modal targets. We study BGs built on annealed Monte Carlo (aMC), which is designed to overcome these limitations by bridging a simple reference to the target through a sequence of intermediate densities. Diffusion models (DMs) are powerful generative models and have already been incorporated into aMC-based recalibration schemes via the diffusion-induced density path, making them appealing backbones for aMC-BGs. We provide an empirical meta-analysis of DM-based aMC-BGs on controlled multi-modal Gaussian mixtures (varying mode separation, number of modes, and dimension), explicitly disentangling inference effects from learning effects by comparing (i) a perfectly learned DM and (ii) a DM trained from data. Even with a perfect DM, standard integrations using only first-order stochastic denoising kernels fail systematically, whereas second-order denoising kernels can substantially improve performance when covariance information is available. We further propose a deterministic aMC integration based on first-order transport maps derived from DMs, which outperforms the stochastic first-order variant at higher computational cost. Finally, in the learned-DM setting, all DM-aMC variants struggle to produce accurate BGs; we trace the main bottleneck to inaccurate DM log-density estimation.
Markus Holzleitner, Sergiy Pereverzyev, Sergei V. Pereverzyev, Vaibhav Silmana, S. Sivananthan
Comments 38 pages
This paper investigates a general regularization framework for unsupervised domain adaptation in vector-valued regression under the covariate shift assumption, utilizing vector-valued reproducing kernel Hilbert spaces (vRKHS). Covariate shift occurs when the input distributions of the training and test data differ, introducing significant challenges for reliable learning. By restricting the hypothesis space, we develop a practical operator learning algorithm capable of handling functional outputs. We establish optimal convergence rates for the proposed framework under a general source condition, providing a theoretical foundation for regularized learning in this setting. We also propose an aggregation-based approach that forms a linear combination of estimators corresponding to different regularization parameters and different kernels. The proposed approach addresses the challenge of selecting appropriate tuning parameters, which is crucial for constructing a good estimator, and we provide a theoretical justification for its effectiveness. Furthermore, we illustrate the proposed method on a real-world face image dataset, demonstrating robustness and effectiveness in mitigating distributional discrepancies under covariate shift.
Steve Hanneke, Shay Moran
We provide a complete theory of optimal universal rates for binary classification in the agnostic setting. This extends the realizable-case theory of Bousquet, Hanneke, Moran, van Handel, and Yehudayoff (2021) by removing the realizability assumption on the distribution. We identify a fundamental tetrachotomy of optimal rates: for every concept class, the optimal universal rate of convergence of the excess error rate is one of $e^{-n}$, $e^{-o(n)}$, $o(n^{-1/2})$, or arbitrarily slow. We further identify simple combinatorial structures which determine which of these categories any given concept class falls into.
Xiyuan Liu, Christian Hacker, Shengnian Wang, Yuhua Duan
Journal ref International Journal of Hydrogen Energy,Volume 211,2026,153744,ISSN 0360-3199,
Developing new metal hydrides is a critical step toward efficient hydrogen storage in carbon-neutral energy systems. However, existing materials databases, such as the Materials Project, contain a limited number of well-characterized hydrides, which constrains the discovery of optimal candidates. This work presents a framework that integrates causal discovery with a lightweight generative machine learning model to generate novel metal hydride candidates that may not exist in current databases. Using a dataset of 450 samples (270 training, 90 validation, and 90 testing), the model generates 1,000 candidates. After ranking and filtering, six previously unreported chemical formulas and crystal structures are identified, four of which are validated by density functional theory simulations and show strong potential for future experimental investigation. Overall, the proposed framework provides a scalable and time-efficient approach for expanding hydrogen storage datasets and accelerating materials discovery.
Stefano Maria Iacus, Haodong Qi, Devika Jain
Recent advances in Generative Artificial Intelligence (AI), particularly Large Language Models (LLMs), enable scalable extraction of spatial information from unstructured text and offer new methodological opportunities for studying climate geography. This study develops a spatial framework to examine how cumulative climate risk relates to multidimensional human flourishing across U.S. counties. High-resolution climate hazard indicators are integrated with a Human Flourishing Geographic Index (HFGI), an index derived from classification of 2.6 billion geotagged tweets using fine-tuned open-source Large Language Models (LLMs). These indicators are aggregated to the US county-level and mapped to a structural equation model to infer overall climate risk and human flourishing dimensions, including expressed well-being, meaning and purpose, social connectedness, psychological distress, physical condition, economic stability, religiosity, character and virtue, and institutional trust. The results reveal spatially heterogeneous associations between greater cumulative climate risk and lower levels of expressed human flourishing, with coherent spatial patterns corresponding to recurrent exposure to heat, flooding, wind, drought, and wildfire hazards. The study demonstrates how Generative AI can be combined with latent construct modeling for geographical analysis and for spatial knowledge extraction.
Mustafa Cavus, Przemysław Biecek, Julian Tejada, Fernando Marmolejo-Ramos, Andre Faro
Comments 19 pages, 2 figures
This paper introduces a new modeling perspective in the public mental health domain to provide a robust interpretation of the relations between anxiety and depression, and the demographic and temporal factors. This perspective particularly leverages the Rashomon Effect, where multiple models exhibit similar predictive performance but rely on diverse internal structures. Instead of considering these multiple models, choosing a single best model risks masking alternative narratives embedded in the data. To address this, we employed this perspective in the interpretation of a large-scale psychological dataset, specifically focusing on the Patient Health Questionnaire-4. We use a random forest model combined with partial dependence profiles to rigorously assess the robustness and stability of predictive relationships across the resulting Rashomon set, which consists of multiple models that exhibit similar predictive performance. Our findings confirm that demographic variables \texttt{age}, \texttt{sex}, and \texttt{education} lead to consistent structural shifts in anxiety and depression risk. Crucially, we identify significant temporal effects: risk probability demonstrates clear diurnal and circaseptan fluctuations, peaking during early morning hours. This work demonstrates the necessity of moving beyond the best model to analyze the entire Rashomon set. Our results highlight that the observed variability, particularly due to circadian and circaseptan rhythms, must be meticulously considered for robust interpretation in psychological screening. We advocate for a multiplicity-aware approach to enhance the stability and generalizability of ML-based conclusions in mental health research.
Gaku Omiya, Pierre-Louis Poirion, Akiko Takeda
Comments 41 pages
Randomized subspace methods reduce per-iteration cost; however, in nonconvex optimization, most analyses are expectation-based, and high-probability bounds remain scarce even under sub-Gaussian noise. We first prove that randomized subspace SGD (RS-SGD) admits a high-probability convergence bound under sub-Gaussian noise, achieving the same order of oracle complexity as prior in-expectation results. Motivated by the prevalence of heavy-tailed gradients in modern machine learning, we then propose randomized subspace normalized SGD (RS-NSGD), which integrates direction normalization into subspace updates. Assuming the noise has bounded $p$-th moments, we establish both in-expectation and high-probability convergence guarantees, and show that RS-NSGD can achieve better oracle complexity than full-dimensional normalized SGD.
Fabian Bleile, Sarah Lumpp, Mathias Drton
Learning a stationary diffusion amounts to estimating the parameters of a stochastic differential equation whose stationary distribution matches a target distribution. We build on the recently introduced kernel deviation from stationarity (KDS), which enforces stationarity by evaluating expectations of the diffusion's generator in a reproducing kernel Hilbert space. Leveraging the connection between KDS and Stein discrepancies, we introduce the Stein-type KDS (SKDS) as an alternative formulation. We prove that a vanishing SKDS guarantees alignment of the learned diffusion's stationary distribution with the target. Furthermore, under broad parametrizations, SKDS is convex with an empirical version that is $ε$-quasiconvex with high probability. Empirically, learning with SKDS attains comparable accuracy to KDS while substantially reducing computational cost and yields improvements over the majority of competitive baselines.
Zilu Zhao, Fangqing Xiao, Dirk Slock
Expectation Propagation (EP) is a widely used message-passing algorithm that decomposes a global inference problem into multiple local ones. It approximates marginal distributions (beliefs) using intermediate functions (messages). While beliefs must be proper probability distributions that integrate to one, messages may have infinite integral values. In Gaussian-projected EP, such messages take a Gaussian form and appear as if they have "negative" variances. Although allowed within the EP framework, these negative-variance messages can impede algorithmic progress. In this paper, we investigate EP in linear models and analyze the relationship between the corresponding beliefs. Based on the analysis, we propose both non-persistent and persistent approaches that prevent the algorithm from being blocked by messages with infinite integral values. Furthermore, by examining the relationship between the EP messages in linear models, we develop an additional approach that avoids the occurrence of messages with infinite integral values.
José Camacho
The validation of a data-driven model is the process of assessing the model's ability to generalize to new, unseen data in the population of interest. This paper proposes a set of general rules for model validation. These rules are designed to help practitioners create reliable validation plans and report their results transparently. While no validation scheme is flawless, these rules can help practitioners ensure their strategy is sufficient for practical use, openly discuss any limitations of their validation strategy, and report clear, comparable performance metrics.
Sanjeev Manivannan, Chandrashekar Lakshminarayan
Comments 65 pages (Main paper - 10 pages, Appendix - 55 pages)
Cross-subject motor-imagery decoding remains a major challenge in EEG-based brain-computer interfaces. To mitigate strong inter-subject variability, recent work has emphasized manifold-based approaches operating on covariance representations. Yet dispersion scaling and orientation alignment remain largely unaddressed in existing methods. In this paper, we address both issues through congruence transforms and introduce three complementary geometry-aware models: (i) Discriminative Congruence Transform (DCT), (ii) Deep Linear DCT (DLDCT), and (iii) Deep DCT-UNet (DDCT-UNet). These models are evaluated both as pre-alignment modules for downstream classifiers and as end-to-end discriminative systems trained via cross-entropy backpropagation with a custom logistic-regression head. Across challenging motor-imagery benchmarks, the proposed framework improves transductive cross-subject accuracy by 2-3%, demonstrating the value of geometry-aware congruence learning.
George-Rafael Domenikos, Victoria Leong
Comments Preprint. Submitted to PLOS Computational Biology
Complex systems produce high-dimensional signals that lack macroscopic variables analogous to entropy, temperature, or free energy. This work introduces a thermoinformational formulation that derives entropy, internal energy, temperature, and Helmholtz free energy directly from empirical microstate distributions of arbitrary datasets. The approach provides a data-driven description of how a system reorganizes, exchanges information, and moves between stable and unstable states. Applied to dual-EEG recordings from mother-infant dyads performing the A-not-B task, the formulation captures increases in informational heat during switches and errors, and reveals that correct choices arise from more stable, low-temperature states. In an independent optogenetic dam-pup experiment, the same variables separate stimulation conditions and trace coherent trajectories in thermodynamic state space. Across both human and rodent systems, this thermoinformational formulation yields compact and physically interpretable macroscopic variables that generalize across species, modalities, and experimental paradigms.
Edward Berman, Jacob Ginesin, Marco Pacini, Robin Walters
Comments Published in Transactions on Machine Learning Research (TMLR). Code is available at https://github.com/EdwardBerman/EquiUQ . Excited to share this paper, comments welcome :D
Data-sparse settings such as robotic manipulation, molecular physics, and galaxy morphology classification are some of the hardest domains for deep learning. For these problems, equivariant networks can help improve modeling across undersampled parts of the input space, and uncertainty estimation can guard against overconfidence. However, until now, the relationships between equivariance and model confidence, and more generally equivariance and model calibration, has yet to be studied. Since traditional classification and regression error terms show up in the definitions of calibration error, it is natural to suspect that previous work can be used to help understand the relationship between equivariance and calibration error. In this work, we present a theory relating equivariance to uncertainty estimation. By proving lower and upper bounds on uncertainty calibration errors (ECE and ENCE) under various equivariance conditions, we elucidate the generalization limits of equivariant models and illustrate how symmetry mismatch can result in miscalibration in both classification and regression. We complement our theoretical framework with numerical experiments that clarify the relationship between equivariance and uncertainty using a variety of real and simulated datasets, and we comment on trends with symmetry mismatch, group size, and aleatoric and epistemic uncertainties.
Ayush Sawarni, Jikai Jin, Justin Whitehouse, Vasilis Syrgkanis
Comments Accepted at AISTATS 2025
Policy learning algorithms are widely used in areas such as personalized medicine and advertising to develop individualized treatment regimes. However, most methods force a decision even when predictions are uncertain, which is risky in high-stakes settings. We study policy learning with abstention, where a policy may defer to a safe default or an expert. When a policy abstains, it receives a small additive reward on top of the value of a random guess. We propose a two-stage learner that first identifies a set of near-optimal policies and then constructs an abstention rule from their disagreements. We establish fast O(1/n)-type regret guarantees when propensities are known, and extend these guarantees to the unknown-propensity case via a doubly robust (DR) objective. We further show that abstention is a versatile tool with direct applications to other core problems in policy learning: it yields improved guarantees under margin conditions without the common realizability assumption, connects to distributionally robust policy learning by hedging against small data shifts, and supports safe policy improvement by ensuring improvement over a baseline policy with high probability.
Mariona Jaramillo-Civill, Peng Wu, Pau Closas
Comments Accepted at ICASSP 2026; 5 pages, 2 figures
Clustered Federated Learning (CFL) improves performance under non-IID client heterogeneity by clustering clients and training one model per cluster, thereby balancing between a global model and fully personalized models. However, most CFL methods require the number of clusters K to be fixed a priori, which is impractical when the latent structure is unknown. We propose DPMM-CFL, a CFL algorithm that places a Dirichlet Process (DP) prior over the distribution of cluster parameters. This enables nonparametric Bayesian inference to jointly infer both the number of clusters and client assignments, while optimizing per-cluster federated objectives. This results in a method where, at each round, federated updates and cluster inferences are coupled, as presented in this paper. The algorithm is validated on benchmark datasets under Dirichlet and class-split non-IID partitions.
Yevgeny Seldin
Learning, whether natural or artificial, is a process of selection. It starts with a set of candidate options and selects the more successful ones. In the case of machine learning the selection is done based on empirical estimates of prediction accuracy of candidate prediction rules on some data. Due to randomness of data sampling the empirical estimates are inherently noisy, leading to selection under uncertainty. The book provides statistical tools to obtain theoretical guarantees on the outcome of selection under uncertainty. We start with concentration of measure inequalities, which are the main statistical instrument for controlling how much an empirical estimate of expectation of a function deviates from the true expectation. The book covers a broad range of inequalities, including Markov's, Chebyshev's, Hoeffding's, Bernstein's, Empirical Bernstein's, Unexpected Bernstein's, kl, and split-kl. We then study the classical (offline) supervised learning and provide a range of tools for deriving generalization bounds, including Occam's razor, Vapnik-Chervonenkis analysis, and PAC-Bayesian analysis. The latter is further applied to derive generalization guarantees for weighted majority votes. After covering the offline setting, we turn our attention to online learning. We present the space of online learning problems characterized by environmental feedback, environmental resistance, and structural complexity. A common performance measure in online learning is regret, which compares performance of an algorithm to performance of the best prediction rule in hindsight, out of a restricted set of prediction rules. We present tools for deriving regret bounds in stochastic and adversarial environments, and under full information and bandit feedback.
Ya Zhou, Yujie Yang, Xiaohan Fan, Wei Zhao
Comments A Transformer-based model is used as an example; the proposed post-training strategy may also be applicable to CNN-based models. The manuscript is currently under review
ECG foundation models are increasingly popular due to their adaptability across various tasks. However, their clinical applicability is often limited by performance gaps compared to task-specific models, even after pre-training on large ECG datasets and fine-tuning on target data. This limitation is likely due to the lack of an effective post-training strategy. In this paper, we propose a simple yet effective post-training approach to enhance ECG foundation models. We evaluate it on a publicly available Transformer-based foundation model. Experiments across multiple ECG tasks show that our method consistently outperforms baseline fine-tuning. On the PTB-XL benchmarks, it improves macro AUROC by 0.7%-8.9% and macro AUPRC by 23.3%-77.9%, also outperforming several recent state-of-the-art approaches, including task-specific and advanced architectures. Further analyses demonstrate improved training dynamics and data efficiency, with only 30% of the training data outperforming the baseline trained on the full dataset. Ablation studies highlight the importance of stochastic depth and preview linear probing. These findings underscore the potential of post-training strategies to improve ECG foundation models, and we hope this work will contribute to the continued development of foundation models in the ECG domain.
Nicolas Chatzikiriakos, Kevin Jamieson, Andrea Iannelli
Comments Accepted for spotlight presentation at AISTATS 2026
In this work, we consider the problem of identifying an unknown linear dynamical system given a finite hypothesis class. In particular, we analyze the effect of the excitation input on the sample complexity of identifying the true system with high probability. To this end, we present sample complexity lower bounds that capture the choice of the selected excitation input. The sample complexity lower bound gives rise to a system theoretic condition to determine the potential benefit of experiment design. Informed by the analysis of the sample complexity lower bound, we propose a persistent excitation (PE) condition tailored to the considered setting, which we then use to establish sample complexity upper bounds. Notably, the PE condition is weaker than in the case of an infinite hypothesis class and allows analyzing different excitation inputs modularly. Crucially, the lower and upper bounds share the same dependency on key problem parameters. Finally, we leverage these insights to propose an active learning algorithm that sequentially excites the system optimally with respect to the current estimate, and provide sample complexity guarantees for the presented algorithm. Concluding simulations showcase the effectiveness of the proposed algorithm.
Jung-hun Kim, Min-hwan Oh
We introduce a bandit framework for stochastic matching under the multinomial logit (MNL) choice model. In our setting, $N$ agents on one side are assigned to $K$ arms on the other side, where each arm stochastically selects an agent from its assigned pool according to unknown preferences and yields a corresponding reward over a horizon $T$. The objective is to minimize regret by maximizing the cumulative revenue from successful matches. A naive approach requires solving an NP-hard combinatorial optimization problem at every round, resulting in a prohibitive computational cost. To address this challenge, we propose batched algorithms that strategically limit the number of times matching assignments are updated to $Θ(\log\log T)$ over the entire horizon. By invoking expensive combinatorial optimization only on a vanishing fraction of rounds, our algorithms substantially reduce overall computational overhead while still achieving a regret bound of $\widetilde{\mathcal{O}}(\sqrt{T})$.
Marcel Hedman, Desi R. Ivanova, Cong Guan, Tom Rainforth
Comments Accepted at Proceedings of the 42nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025
Journal ref Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:22904-22923, 2025
We develop a semi-amortized, policy-based, approach to Bayesian experimental design (BED) called Stepwise Deep Adaptive Design (Step-DAD). Like existing, fully amortized, policy-based BED approaches, Step-DAD trains a design policy upfront before the experiment. However, rather than keeping this policy fixed, Step-DAD periodically updates it as data is gathered, refining it to the particular experimental instance. This test-time adaptation improves both the flexibility and the robustness of the design strategy compared with existing approaches. Empirically, Step-DAD consistently demonstrates superior decision-making and robustness compared with current state-of-the-art BED methods.
Chengyuan Zhang, Zhengbing He, Cathy Wu, Lijun Sun
Comments Accepted to ISTTT26
Modeling car-following behavior is fundamental to microscopic traffic simulation, yet traditional deterministic models often fail to capture the full extent of variability and unpredictability in human driving. While many modern approaches incorporate context-aware inputs (e.g., spacing, speed, relative speed), they frequently overlook structured stochasticity that arises from latent driver intentions, perception errors, and memory effects -- factors that are not directly observable from context alone. To fill the gap, this study introduces an interpretable stochastic modeling framework that captures not only context-dependent dynamics but also residual variability beyond what context can explain. Leveraging deep neural networks integrated with nonstationary Gaussian processes (GPs), our model employs a scenario-adaptive Gibbs kernel to learn dynamic temporal correlations in acceleration decisions, where the strength and duration of correlations between acceleration decisions evolve with the driving context. This formulation enables a principled, data-driven quantification of uncertainty in acceleration, speed, and spacing, grounded in both observable context and latent behavioral variability. Comprehensive experiments on the naturalistic vehicle trajectory dataset collected from the German highway, i.e., the HighD dataset, demonstrate that the proposed stochastic simulation method within this framework surpasses conventional methods in both predictive performance and interpretable uncertainty quantification. The integration of interpretability and accuracy makes this framework a promising tool for traffic analysis and safety-critical applications.
Luca Bindini, Lorenzo Perini, Stefano Nistri, Jesse Davis, Paolo Frasconi
Comments Published in Transactions on Machine Learning Research (TMLR), January 2026. See https://openreview.net/forum?id=yLoXQDNwwa for the official submission
Contextual anomaly detection (CAD) aims to identify anomalies in a target (behavioral) variable conditioned on a set of contextual variables that influence the normalcy of the target variable but are not themselves indicators of anomaly. In many anomaly detection tasks, there exist contextual variables that influence the normalcy of the target variable but are not themselves indicators of anomaly. In this work, we propose a novel framework for CAD, normalcy score (NS), that explicitly models both the aleatoric and epistemic uncertainties. Built on heteroscedastic Gaussian process regression, our method regards the Z-score as a random variable, providing confidence intervals that reflect the reliability of the anomaly assessment. Through experiments on benchmark datasets and a real-world application in cardiology, we demonstrate that NS outperforms state-of-the-art CAD methods in both detection accuracy and interpretability. Moreover, confidence intervals enable an adaptive, uncertainty-driven decision-making process, which may be very important in domains such as healthcare.
Ruijia Zhang, Xiangyu Zhang, Zhengling Qi, Yue Wu, Yanxun Xu
Dynamic treatment regimes (DTRs) provide a principled framework for optimizing sequential decision-making in domains where decisions must adapt over time in response to individual trajectories, such as healthcare, education, and digital interventions. However, existing statistical methods often rely on strong positivity assumptions and lack robustness under partial data coverage, while offline reinforcement learning approaches typically focus on average training performance, lack statistical guarantees, and require solving complex optimization problems. To address these challenges, we propose POLAR, a novel pessimistic model-based policy learning algorithm for offline DTR optimization. POLAR estimates the transition dynamics from offline data and quantifies uncertainty for each history-action pair. A pessimistic penalty is then incorporated into the reward function to discourage actions with high uncertainty. Unlike many existing methods that focus on average training performance or provide guarantees only for an oracle policy, POLAR directly targets the suboptimality of the final learned policy and offers theoretical guarantees, without relying on computationally intensive minimax or constrained optimization procedures. To the best of our knowledge, POLAR is the first model-based DTR method to provide both statistical and computational guarantees, including finite-sample bounds on policy suboptimality. Empirical results on both synthetic data and the MIMIC-III dataset demonstrate that POLAR outperforms state-of-the-art methods and yields near-optimal, history-aware treatment strategies.
Yueqi Cao, Anthea Monod
Comments 26 pages, 11 figures
We introduce the first graph kernels for metric graphs via tropical algebraic geometry. In contrast to conventional graph kernels based on graph combinatorics such as nodes, edges, and subgraphs, our metric graph kernels are purely based on the geometry and topology of the underlying metric space. A key characterizing property of our construction is its invariance under edge subdivision, making the kernels intrinsically well-suited for comparing graphs representing different underlying metric spaces. We develop efficient algorithms to compute our kernels and analyze their complexity, which depends primarily on the genus of the input graphs rather than their size. Through experiments on synthetic data and selected real-world datasets, we demonstrate that our kernels capture complementary geometric and topological information overseen by standard combinatorial approaches, particularly in label-free settings. We further showcase their practical utility with an urban road network classification task.
Joshua Corneck, Edward A. K. Cohen, James S. Martin, Lekha Patel, Kurtis W. Shuler, Francesco Sanna Passino
Journal ref Journal of Complex Networks, Volume 14, Issue 1, February 2026, cnaf056
Understanding both global and layer-specific group structures is useful for uncovering complex patterns in networks with multiple interaction types. In this work, we introduce a new model, the hierarchical multiplex stochastic blockmodel (HMPSBM), that simultaneously detects communities within individual layers of a multiplex network while inferring a global node clustering across the layers. A stochastic blockmodel is assumed in each layer, with probabilities of layer-level group memberships determined by a node's global group assignment. Our model uses a Bayesian framework, employing a probit stick-breaking process to construct node-specific mixing proportions over a set of shared Griffiths-Engen-McCloseky (GEM) distributions. These proportions determine layer-level community assignment, allowing for an unknown and varying number of groups across layers, while incorporating nodal covariate information to inform the global clustering. We propose a scalable variational inference procedure with parallelisable updates for application to large networks. Extensive simulation studies demonstrate our model's ability to accurately recover both global and layer-level clusters in complicated settings, and applications to real data showcase the model's effectiveness in uncovering interesting latent network structure.
Jiarong Wu, Pavel Perezhogin, David John Gagne, Brandon Reichl, Aneesh C. Subramanian, Elizabeth Thompson, Laure Zanna
Comments add zenodo link
Accurately quantifying air-sea fluxes is important for understanding air-sea interactions and improving coupled weather and climate systems. This study introduces a probabilistic framework to represent the highly variable nature of air-sea fluxes, which is missing in deterministic bulk algorithms. Assuming Gaussian distributions conditioned on the input variables, we use artificial neural networks and eddy-covariance measurement data to estimate the mean and variance by minimizing negative log-likelihood loss. The trained neural networks provide alternative mean flux estimates to existing bulk algorithms, and quantify the uncertainty around the mean estimates. Stochastic parameterization of air-sea turbulent fluxes can be constructed by sampling from the predicted distributions. Tests in a single-column forced upper-ocean model suggest that changes in flux algorithms influence sea surface temperature and mixed layer depth seasonally. The ensemble spread in stochastic runs is most pronounced during spring restratification.
Attila Miklós Ámon, Kristian Fenech, Péter Kovács, Tamás Dózsa
Comments Submitted to IEEE Transactions on Signal Processing, 2024 (under review)
In this paper we consider the continuous wavelet transform using Gaussian wavelets multiplied by an appropriate rational term. The zeros and poles of this rational modifier act as free parameters and their choice highly influences the shape of the mother wavelet. This allows the proposed construction to approximate signals with complex morphology using only a few wavelet coefficients. We show that the proposed rational Gaussian wavelets are admissible and provide numerical approximations of the wavelet coefficients using variable projection operators. In addition, we show how the proposed variable projection based rational Gaussian wavelet transform can be used in neural networks to obtain a highly interpretable feature learning layer. We demonstrate the effectiveness of the proposed scheme through a biomedical application, namely, the detection of ventricular ectopic beats (VEBs) in real ECG measurements.
Fengyi Li, Ricardo Baptista, Youssef Marzouk
Computing expected information gain (EIG) from prior to posterior (equivalently, mutual information between candidate observations and model parameters or other quantities of interest) is a fundamental challenge in Bayesian optimal experimental design. We formulate flexible transport-based schemes for EIG estimation in general nonlinear/non-Gaussian settings, compatible with both standard and implicit Bayesian models. These schemes are representative of two-stage methods for estimating or bounding EIG using marginal and conditional density estimates. In this setting, we analyze the optimal allocation of samples between training (density estimation) and approximation of the outer prior expectation. We show that with this optimal sample allocation, the mean squared error (MSE) of the resulting EIG estimator converges more quickly than that of a standard nested Monte Carlo scheme. We then address the estimation of EIG in high dimensions, by deriving gradient-based upper bounds on the mutual information lost by projecting the parameters and/or observations to lower-dimensional subspaces. Minimizing these upper bounds yields projectors and hence low-dimensional EIG approximations that outperform approximations obtained via other linear dimension reduction schemes. Numerical experiments on a PDE-constrained Bayesian inverse problem also illustrate a favorable trade-off between dimension truncation and the modeling of non-Gaussianity, when estimating EIG from finite samples in high dimensions.
Charles J. Wolock, Susan Jacob, Julia C. Bennett, Anna Elias-Warren, Jessica O'Hanlon, Avi Kenny, Nicholas P. Jewell, Andrea Rotnitzky, Stephen R. Cole, Ana A. Weil, Helen Y. Chu, Marco Carone
Comments The first two authors contributed equally to this work. Main text: 22 pages, 2 figure, 4 tables. Supplement: 23 pages, 14 figures, 0 tables. This update (v4) fixes a typo in the displayed equation on page 7. This typo did not impact the implementation of the method in the accompanying software package
Journal ref Epidemiology 36(5) (2025)
For infectious diseases, characterizing symptom duration is of clinical and public health importance. Symptom duration may be assessed by surveying infected individuals and querying symptom status at the time of survey response. For example, in a SARS-CoV-2 testing program at the University of Washington, participants were surveyed at least $28$ days after testing positive and asked to report current symptom status. This study design yielded current status data: outcome measurements for each respondent consisted only of the time of survey response and a binary indicator of whether symptoms had resolved by that time. Such study design benefits from limited risk of recall bias, but analyzing the resulting data necessitates tailored statistical tools. Here, we review methods for current status data and describe a novel application of modern nonparametric techniques to this setting. The proposed approach is valid under weaker assumptions compared to existing methods, allows use of flexible machine learning tools, and handles potential survey nonresponse. From the university study, under an assumption that the survey response time is conditionally independent of symptom resolution time within strata of measured covariates, we estimate that 19% of participants experienced ongoing symptoms 30 days after testing positive, decreasing to 7% at 90 days. We assess the sensitivity of these results to deviations from conditional independence, finding the estimates to be more sensitive to assumption violations at 30 days compared to 90 days. Female sex, fatigue during acute infection, and higher viral load were associated with slower symptom resolution.
Yuhang Liu, Zhen Zhang, Dong Gong, Erdun Gao, Biwei Huang, Mingming Gong, Anton van den Hengel, Kun Zhang, Javen Qinfeng Shi
Causal representation learning (CRL) offers the promise of uncovering the underlying causal model by which observed data was generated, but the practical applicability of existing methods remains limited by the strong assumptions required for identifiability and by challenges in applying them to real-world settings. Most current approaches are applicable only to relatively restrictive model classes, such as linear or polynomial models, which limits their flexibility and robustness in practice. One promising approach to this problem seeks to address these issues by leveraging changes in causal influences among latent variables. In this vein we propose a more general and relaxed framework than typically applied, formulated by imposing constraints on the function classes applied. Within this framework, we establish partial identifiability results under weaker conditions, including scenarios where only a subset of causal influences change. We then extend our analysis to a broader class of latent post-nonlinear models. Building on these theoretical insights, we develop a flexible method for learning latent causal representations. We demonstrate the effectiveness of our approach on synthetic and semi-synthetic datasets, and further showcase its applicability in a case study on human motion analysis, a complex real-world domain that also highlights the potential to broaden the practical reach of identifiable CRL models.
Berivan Isik, Natalia Ponomareva, Hussein Hazimeh, Dimitris Paparas, Sergei Vassilvitskii, Sanmi Koyejo
Comments Published at the International Conference on Learning Representations (ICLR) 2025, with title: "Scaling Laws for Downstream Task Performance in Machine Translation"
Scaling laws provide important insights that can guide the design of large language models (LLMs). Existing work has primarily focused on studying scaling laws for pretraining (upstream) loss. However, in transfer learning settings, in which LLMs are pretrained on an unsupervised dataset and then finetuned on a downstream task, we often also care about the downstream performance. In this work, we study the scaling behavior in a transfer learning setting, where LLMs are finetuned for machine translation tasks. Specifically, we investigate how the choice of the pretraining data and its size affect downstream performance (translation quality) as judged by: downstream cross-entropy and translation quality metrics such as BLEU and COMET scores. Our experiments indicate that the size of the finetuning dataset and the distribution alignment between the pretraining and downstream data significantly influence the scaling behavior. With sufficient alignment, both downstream cross-entropy and translation quality scores improve monotonically with more pretraining data. In such cases, we show that it is possible to predict the downstream translation quality metrics with good accuracy using a log-law. However, there are cases where moderate misalignment causes the downstream translation scores to fluctuate or get worse with more pretraining, whereas downstream cross-entropy monotonically improves. By analyzing these, we provide new practical insights for choosing appropriate pretraining data.
Siran Li
Comments 9 pages
Journal ref Bull. Lond. Math. Soc. 56 (2024), no. 1, 263-273
We show that the Gaussian kernel $\exp\left\{-λd_g^2(\bullet, \bullet)\right\}$ on any non-simply-connected closed Riemannian manifold $(\mathcal{M},g)$, where $d_g$ is the geodesic distance, is not positive definite for any $λ> 0$, combining analyses in the recent preprint~[9] by Da Costa--Mostajeran--Ortega and classical comparison theorems in Riemannian geometry.
Hsueh-Han Huang, Ching-Kang Ing, Shu-Hui Yu
Comments This work is being withdrawn due to significant revisions
We establish a negative moment bound for the sample autocovariance matrix of a stationary process driven by conditional heteroscedastic errors. This moment bound enables us to asymptotically express the mean squared prediction error (MSPE) of the least squares predictor as the sum of three terms related to model complexity, model misspecification, and conditional heteroscedasticity. A direct application of this expression is the development of a model selection criterion that can asymptotically identify the best (in the sense of MSPE) subset AR model in the presence of misspecification and conditional heteroscedasticity. Finally, numerical simulations are conducted to confirm our theoretical results.
Gwendal Debaussart-Joniec, Harry Sevi, Matthieu Jonckheere, Argyris Kalogeratos
Vertex-level clustering for directed graphs (digraphs) remains challenging as edge directionality breaks the key assumptions underlying popular spectral methods, which also incur the overhead of eigen-decomposition. This paper proposes Parametrized Power-Iteration Clustering (ParPIC), a random-walk-based clustering method for weakly connected digraphs. This builds over the Power-Iteration Clustering paradigm, which uses the rows of the iterated diffusion operator as a data embedding. ParPIC has three important features: the use of parametrized reversible random walk operators, the automatic tuning of the diffusion time, and the efficient truncation of the final embedding, which produces low-dimensional data representations and reduces complexity. Empirical results on synthetic and real-world graphs demonstrate that ParPIC achieves competitive clustering accuracy with improved scalability relative to spectral and teleportation-based methods.
Xiaochun Meng, James W. Taylor, Souhaib Ben Taieb, Siran Li
Journal ref Oper. Res. 73 (2025), no. 1, 344-362
Forecasts of multivariate probability distributions are required for a variety of applications. Scoring rules enable the evaluation of forecast accuracy, and comparison between forecasting methods. We propose a theoretical framework for scoring rules for multivariate distributions, which encompasses the existing quadratic score and multivariate continuous ranked probability score. We demonstrate how this framework can be used to generate new scoring rules. In some multivariate contexts, it is a forecast of a level set that is needed, such as a density level set for anomaly detection or the level set of the cumulative distribution as a measure of risk. This motivates consideration of scoring functions for such level sets. For univariate distributions, it is well-established that the continuous ranked probability score can be expressed as the integral over a quantile score. We show that, in a similar way, scoring rules for multivariate distributions can be decomposed to obtain scoring functions for level sets. Using this, we present scoring functions for different types of level set, including density level sets and level sets for cumulative distributions. To compute the scores, we propose a simple numerical algorithm. We perform a simulation study to support our proposals, and we use real data to illustrate usefulness for forecast combining and CoVaR estimation.
扫码添加微信好友,提出您的宝贵建议 👇
💡 备注请填写:网站反馈