Teacher Forcing as Generalized Bayes: Optimization Geometry Mismatch in Switching Surrogates for Chaotic Dynamics
Comments Presented at the Workshop on Optimization and Post-Bayesian Inference in Machine Learning, AISTATS 2026
Andre Herz, Daniel Durstewitz, Georgia Koppe
Comments Presented at the Workshop on Optimization and Post-Bayesian Inference in Machine Learning, AISTATS 2026
Identity teacher forcing (ITF) enables stable training of deterministic recurrent surrogates for chaotic dynamical systems and has been highly effective for dynamical systems reconstruction (DSR) with recurrent neural networks (RNNs), including interpretable almost-linear RNNs (AL-RNNs). However, as an intervention-based prediction loss (and thus a generalized Bayes update), teacher forcing need not match the free-running model's marginal likelihood geometry. We compare the objective-induced curvatures of ITF and marginal likelihood in a probabilistic switching augmentation of AL-RNNs, estimating ambiguity-aware observed information via Louis' identity. In the switching setting studied here, conditioning on a single forced regime path (as ITF does) inflates curvature, while marginal likelihood curvature is reduced by a missing-information correction when multiple switching explanations remain plausible. In Lorenz-63 experiments, windowed evidence fine-tuning improves held-out evidence but can degrade dynamical quantities of interest (QoIs) relative to ITF-pretrained models.
Adriano Zanin Zambom, Beck Saunders
In this paper we develop a consistent variable selection procedure for GARCH-X models that identifies the truly relevant exogenous covariates influencing volatility dynamics. The proposed method is based on a multiple hypothesis testing framework with Wald-type test statistics and the Benjamini-Yekutieli False Discovery Rate (FDR) procedure to control the proportion of false discoveries. We establish the consistency of the selection rule, showing that it asymptotically recovers the correct set of covariates as the sample size increases. Monte Carlo simulations across different distributions and dependence structures validate the method's accuracy and robustness. The procedure is applied to modeling the volatility of the SP 500 using macroeconomic and commodity indicators.
Shuning Shang, Hubert Strauss, Stanley Wei, Sanjeev Arora, Noam Razin
Comments Code available at https://github.com/princeton-pli/imperfect-rewards
Training language models via reinforcement learning often relies on imperfect proxy rewards, since ground truth rewards that precisely define the intended behavior are rarely available. Standard metrics for assessing the quality of proxy rewards, such as ranking accuracy, treat incorrect rewards as strictly harmful. In this work, however, we highlight that not all deviations from the ground truth are equal. By theoretically analyzing which outputs attract probability during policy gradient optimization, we categorize reward errors according to their effect on the increase in ground truth reward. The analysis establishes that reward errors, though conventionally viewed as harmful, can also be benign or even beneficial by preventing the policy from stalling around outputs with mediocre ground truth reward. We then present two practical implications of our theory. First, for reinforcement learning from human feedback (RLHF), we develop reward model evaluation metrics that account for the harmfulness of reward errors. Compared to standard ranking accuracy, these metrics typically correlate better with the performance of a language model after RLHF, yet gaps remain in robustly evaluating reward models. Second, we provide insights for reward design in settings with verifiable rewards. A key theme underlying our results is that the effectiveness of a proxy reward function depends heavily on its interaction with the initial policy and learning algorithm.
Austin Brown
This article extends weak convergence bounds of Markov transition kernels to convergence bounds on the variance of the Markov kernel applied to Lipschitz functions. In the reversible case, weak convergence rates of the transition kernels imply chi-squared divergence convergence bounds if the density of the initialization measure is Lipschitz. These results provide new tools to establish central limit theorems for Lipschitz functions used in Markov chain Monte Carlo simulations. Applications are explored to the stability of Metropolis-Hastings algorithms in high dimensions, stochastic gradient descent, and solutions to stochastic delay equations.
Zhu Guojun, Zhang Sanguo, Ren Mingyang
Comments 35pages,4 figures,
Label noise presents a fundamental challenge in modern machine learning, especially when large-scale datasets are generated via automated processes. An increasingly common and important data paradigm, particularly in domains like medical imaging, involves learning from a large dataset with coarse, noisy labels supplemented by a small, expert-verified, clean dataset. This setting constitutes a typical information transfer and fusion problem. However, the significant distribution shift between the noisy and clean data violates the core overall parametric similarity assumptions of existing statistical transfer learning methods, while their reliance on parametric models is ill-suited for complex data like images. To address these limitations, this paper develops a generic model-agnostic nonparametric framework for classification with label noise, which applies to a broad class of classifiers. Our approach leverages the small clean dataset to ``purify'' the large noisy one and carefully manages the remaining ambiguous samples. This framework is underpinned by a rigorous statistical theory. Its empirical performance is demonstrated through simulations and a real-world application to medical image analysis for pneumonia diagnosis.
Ifeanyi Ezuma, Olusiji Medaiyese
Comments 12 pages, 7 figures, 3 tables. Preprint manuscript
Magnification shift is a major obstacle to robust histopathology classification, because models trained on one imaging scale often generalize poorly to another. Here, we evaluated this problem on the BreaKHis dataset using a strict patient-disjoint leave-one-magnification-out protocol, comparing supervised baseline, baseline augmented with DCGAN-generated patches, and a gradient-reversal domain-general model designed to preserve discriminative information while suppressing magnification-specific variation. Across held-out magnifications, the domain-general model achieved the strongest overall discrimination and its clearest gain was observed when 200X was held out. By contrast, GAN augmentation produced inconsistent effects, improving some folds but degrading others, particularly at 400X. The domain-general model also yielded the lowest Brier score at 0.063 vs 0.089 at baseline. Sparse embedding analysis further revealed that domain-general training reduced average signature size more than three-fold (306 versus 1,074 dimensions) while preserving equivalent predictive performance (AUC: 0.967 vs 0.965; F1: 0.930 vs 0.931). It also increased cross-fold signature reproducibility from near-zero Jaccard overlap in the baseline to 0.99 between the 100X and 200X folds. These findings show that calibrated, compact, and transferable representations can be learned without added architectural complexity, with clear implications for the reliable deployment of computational pathology models across heterogeneous acquisition settings.
Ruqian Zhang, Juan Shen, Yijiao Zhang
The availability of data from multiple heterogeneous environments has motivated methods that remain reliable under distributional shifts. When the joint distribution of response and predictors varies across environments, the response may still depend on a subset of predictors through an invariant mechanism. Existing methods typically assess candidate invariant sets through pooled stability criteria, treating environmental variation as nuisance. In this paper, we propose a Bayesian framework that explicitly separates a shared response mechanism from environment-specific or response-dependent associations, exploiting heterogeneity as evidence for structure learning. A competitive spike-and-slab prior is designed to force each predictor to compete between invariant and non-invariant spurious effects. Under a tractable working model, we establish invariant model selection consistency and posterior contraction for invariant coefficients. We further study the presence of irrelevant predictors, characterize posterior concentration on an equivalent invariant class, and introduce a post-selection refinement that consistently recovers the minimal invariant model. Simulations and a real application illustrate the robustness and finite-sample efficiency of the proposed method.
Alex Stringer, Jeffrey Negrea
We test the hypothesis that simulataneous linear contrasts of multiple variance components equal zero in a Gaussian variance components model via a parametric bootstrap. Applications include but are not limited to nested and crossed designs. The main technical contributions are a computationally efficient decomposition of the normalized residual log-likelihood that does not require the variance components to be non-negative or variance design matrices to be positive semi-definite, a modified Newton method for its minimization, and a method for efficient optimization and sampling under the null hypothesis that certain linear combinations of variance components equal zero. A special case of the proposed procedure is a test for multiple variance components simulataneously equalling zero, for which a likelihood ratio test was not previously available. However, the proposed procedure is significantly more general.
Xianghao Meng, James L. Beck, Yong Huang, Hui Li
In the last few decades, Markov chain Monte Carlo (MCMC) methods have been widely applied to Bayesian updating of structural dynamic models in the field of structural health monitoring. Recently, several MCMC algorithms have been developed that incorporate neural networks to enhance their performance for specific Bayesian model updating problems. However, a common challenge with these approaches lies in the fact that the embedded neural networks often necessitate retraining when faced with new tasks, a process that is time-consuming and significantly undermines the competitiveness of these methods. This paper introduces a newly developed adaptive meta-learning stochastic gradient Hamiltonian Monte Carlo (AM-SGHMC) algorithm. The idea behind AM-SGHMC is to optimize the sampling strategy by training adaptive neural networks, and due to the adaptive design of the network inputs and outputs, the trained sampler can be directly applied to various Bayesian updating problems of the same type of structure without further training, thereby achieving meta-learning. Additionally, practical issues for the feasibility of the AM-SGHMC algorithm for structural dynamic model updating are addressed, and two examples involving Bayesian updating of multi-story building models with different model fidelity are used to demonstrate the effectiveness and generalization ability of the proposed method.
Georgy Sofronov, Joanna Rymaszewska, Krzysztof J. Szajowski
Comments 13 pages. Presented on the MATRIX Research Program: Probabilistic Models in Evolutionary Biology and Game Theory (6-10 January 2025
Our research is closely related to ontological studies in mathematics. It provides crucial insights into the nature of decisions and strategies characterized by Markov moments. In a stopping game, a holistic decision-maker would evaluate comprehensive information by assessing the probabilities of various outcomes and their associated payoffs. This involves understanding the current state, historical data, and potential future scenarios. Such a decision-maker must also consider strategic interactions by anticipating and accounting for the strategies of other players. They must be flexible in adapting their strategy as the game evolves and able to integrate uncertainty by incorporating risk preferences and tolerances. They would perform scenario analysis to evaluate the impact of different stopping times under varying conditions. The goal of this modeling and its implementation in psychological practice is to introduce a novel method for assessing the state of players, leveraging deviations from rational strategies as diagnostic indicators of their psychological and decision-making profiles. The details of other models will be subject to contributed papers. The article presents the theoretical basis for combining various factors when modeling decision-making processes. The original title is "Rationality, Deviation, and Diagnosis: A Holistic Approach to Stopping Games" and will be used when it is possible to describe and interpret the results of the experiments we write about in the last section of the paper.
Johannes Brutsche, Lukas Riepl
Based on discrete observations, we develop a test to infer if the volatility function $σ(\cdot)$ within the nonparametric Gaussian white noise model $dY_t = σ(t)dW_t$ is constant. The testing procedure is shown to be minimax-optimal and adaptive for infill asymptotics and these results entail that a deviation from the null hypothesis of constancy is best measured in terms of the ratio of $σ(t)$ and its $L^2$-average. The derivation of optimal constants requires the construction of hypotheses with height $h(b)$, where the parameter $b$ solves $F_n(b)=0$ for given functions $F_n$. Proving this equation to be solvable for each $n\in\mathbb{N}$ and establishing quantitative bounds of the solutions is built upon the implicit function theorem.
Sharmin Afroz, Brendan Ames
Sparse Optimal Scoring (SOS) reformulates linear discriminant analysis to enable feature selection through elastic net regularization, making it well-suited for high-dimensional settings where the number of features exceeds observations. Most existing SOS methods use deflation-based strategies that compute discriminant vectors sequentially, which can propagate errors and produce suboptimal solutions. We propose a novel approach that estimates all discriminant vectors simultaneously under an explicit global orthogonality constraint, which we call Deflation-Free Sparse Optimal Scoring (DFSOS). DFSOS combines Bregman iteration with orthogonality-constrained optimization, decomposing the problem into tractable subproblems for scoring vectors, discriminant vectors, and orthogonality enforcement. We establish convergence to stationary points of the augmented Lagrangian under mild conditions. Extensive experiments using synthetic data and real-world time series data demonstrate that DFSOS achieves classification accuracy comparable to or better than existing deflation-based methods. These results indicate that deflation-free approaches offer a robust and effective framework for sparse discriminant analysis in high-dimensional problems.
Yuhe Bai, Chengli Tan, Jiaqi Li, Xiangjun Wang, Zhikun Zhang
Nonlinear dynamical systems with regime transitions are typically described by ordinary differential equations with jumping parameters parameters. Traditional methods often treat change-point detection and parameter estimation as separate tasks, ignoring the inherent coupling between them. To address this, we propose residual-loss anomaly analysis of physics-informed neural networks, a unified framework that leverages dynamical consistency within the physics-informed learning paradigm. This approach jointly infers piecewise parameters and transition points under a single set of constraints. The method follows a two-stage strategy: First, local physical residuals are analyzed through overlapping subinterval decomposition. When a subinterval spans a true transition point, the residual exhibits a distinct structural elevation in noise-free conditions, which has a non-zero lower bound, enabling effective localization of potential transition intervals. Second, within our framework, change-point locations and piecewise parameters are integrated into a unified physical loss function for joint optimization, enabling simultaneous identification. Experiments on benchmark nonlinear dynamical systems, including Malthusian and logistic growth models, Van der Pol oscillator, Lotka-Volterra model and Lorenz system, demonstrate that the proposed method outperforms traditional decoupled approaches in both change-point localization and parameter estimation accuracy. This study provides an efficient, unified solution for structurally coupled inverse problems in nonlinear dynamical systems with regime switching.
Shakeel Gavioli-Akilagun, Yining Chen, Flavio Ziegelmann
Comments 48 pages, 10 figures
We study the problem of estimating locations in time at which the level of technology in an economy changes when given a sequence of time ordered inputs and outputs. We approach the problem through the lens of nonparametric frontier analysis with frontiers that expand sharply and globally over time, and develop an offline change point detection procedure which achieves the minimax localization rates for the problem at hand up to logarithmic factors. We additionally give a simple method for constructing confidence intervals for the unobserved change point locations. Finally, we explain how the procedure can be modified to accommodate local changes in technology, meaning that efficiency gains are only realized for certain combinations of inputs. Simulation studies and real data examples are also presented to illustrate the practical value of our methods.
Łukasz Brzozowski, Marek Gagolewski, Grzegorz Siudem
Generating realistic synthetic citation, patent, or component dependency networks is essential for benchmarking community detection, graph visualisation, and network data mining algorithms. We present the first systematic comparison of generators of directed graphs that are nearly acyclic and have a ground-truth community structure. We evaluate 12 methods across 7 real citation networks and 26 metrics. We propose the practice of reversing directions of edges in static generators to break cycles and induce a citation-like flow, which significantly improves the performance of a degree-corrected Stochastic Block Model. Our novel methodological approach to evaluating community detection benchmarks distinguishes between endogenous and exogenous mesoscopic similarities, with the latter proving more important. This distinction reveals that high-parameter models suffer from overfitting by memorising planted community statistics which lead to their failing to produce realistic networks. Finally, we introduce the Citation Seeder (CS) algorithm, an iterative generator grounded in the Price-Pareto model of citation networks, with interpretable parameters and O(N+E) runtime. CS achieves competitive results against the best-performing baselines while using up to four orders of magnitude fewer parameters and providing a clean framework for explaining and predicting a network's future growth.
Yihao Tan, Marianthi Markatou, Saptarshi Chakraborty
Postmarketing safety surveillance relies on data from spontaneous reporting systems (SRS) such as FAERS, EudraVigilance and VigiBase, and commonly uses SRS data mining methods to assess the associations between drugs and adverse events (AEs). Traditionally, these analyses have focused on signal detection framed as a binary decision problem, whereas more recent work has emphasized more nuanced inference involving signal strength estimation and uncertainty quantification. In this paper, we review contemporary SRS data mining approaches and their statistical underpinnings for safety assessment using data from major pharmacovigilance databases worldwide. In addition to methodological review, we provide practical guidance on data preprocessing for such analysis, including construction of SRS contingency tables using only aggregated AE-drug counts, as are publicly available from databases such as VigiBase and EudraVigilance. We illustrate the guidance via opioid-related datasets obtained from FAERS and VigiBase, complied with subsequent downstream SRS data analyses.
Lukas Koch
Comments 13 pages, 10 figures, accepted manuscript, replaced references for PCA and parallel coordinates plots
A very common task in data visualization is to plot many data points with some measured y-value as a function of fixed x-values. Uncertainties on the y-values are typically presented as vertical error bars that represent either a Frequentist confidence interval or Bayesian credible interval for each data point. Most of the time, these error bars represent a 68\% confidence/credibility level, which leads to the intuition that a model fits the data reasonably well if its prediction lies within the error bars of roughly two thirds of the data points. Unfortunately, this and other intuitions no longer work when the uncertainties of the data points are correlated. If the error bars only show the square root of diagonal elements of some covariance matrix with non-negligible off-diagonal elements, we simply do not have enough information in the plot to judge whether a drawn model line agrees well with the data or not. In this paper we will demonstrate this problem and discuss ways to add more information to the plots to make it easier to judge the agreement between the data and some model prediction in the plot, as well as glean some insight where the model might be deficient. This is done by explicitly showing the contribution of the first principal component of the uncertainties, and by displaying the conditional uncertainties of all data points.
Alexandra Dache, Arnaud Vandaele, Nicolas Gillis
Comments 14 pages, 10 figures, code and data available from https://github.com/Alexia1305/OtrisymNMF_DCBM
Community detection is a fundamental task in data analysis, and block models provide an approach for identifying a wide variety of community structures while offering high interpretability. The degree-corrected block model (DCBM) is an established model that accounts for the heterogeneity of node degrees. However, inference methods are computationally costly and highly sensitive to initialization, while cheaper alternatives, such as spectral or modularity-based approaches, are restricted to detecting specific structures, typically assortative. In this work, we show that DCBM inference can be reformulated as a constrained nonnegative matrix factorization problem. Leveraging this insight, we propose a novel method for community detection and a theoretically well-grounded initialization strategy that provides an initial estimate of communities for inference algorithms. Our approach is agnostic to any specific network structure and applies to graphs with any structure representable by a DCBM. Experiments on synthetic and real benchmark networks show that our method detects communities comparable to those found by DCBM inference while being faster; for instance, it processes a graph with 100,000 nodes and 1,000,000 edges in approximately 4 minutes. Moreover, the proposed initialization strategy significantly improves solution quality and reduces the number of iterations required by all tested inference algorithms. Overall, this work provides a scalable and robust framework for community detection and highlights the benefits of a matrix-factorization perspective for the DCBM.
Simon Donker van Heel, Neil Shephard
We propose using a discounted version of a convex combination of the log-likelihood with the corresponding expected log-likelihood such that when they are maximized they yield a filter, predictor and smoother for time series. This paper then focuses on working out the implications of this in the case of the canonical exponential family. The results are simple exact filters, predictors and smoothers with linear recursions. A theory for these models is developed and the models are illustrated on simulated and real data.
Guo Liu
Comments 20 pages
The least absolute shrinkage and selection operator (Lasso) is a popular method for high-dimensional statistics. However, it is known that the Lasso often has estimation bias and prediction error. To address such disadvantages, many alternatives and refitting strategies have been proposed and studied. This work introduces a novel Lasso--Ridge method. Our analysis indicates that the proposed estimator achieves improved prediction performance in a range of settings, including cases where the Lasso is tuned at its theoretical optimal rate \(\sqrt{\log(p)/n}\). Moreover, the proposed method retains several key advantages of the Lasso, such as prediction consistency and reliable variable selection under mild conditions. Through extensive simulations, we further demonstrate that our estimator outperforms the Lasso in both prediction and estimation accuracy, highlighting its potential as a powerful tool for high-dimensional linear regression.
Łukasz Brzozowski, Marek Gagolewski, Grzegorz Siudem, Barbara Żogała-Siudem
We introduce a new analytical framework for modelling degree sequences in individual communities of real-world networks, e.g., citations to papers in different fields. Our work is inspired by a recent modification of the Price's model, which assumes that citations are gained partly accidentally, and to some extent preferentially. Our work addresses the need to represent the heterogeneity of various scientific domains, as standard homogeneous models fail to capture the distinct growth ratios and citing cultures of different fields. Extending the model to networks with a community structure allows us to devise the analytical formulae for, amongst others, citation counts in each cluster and their inequality as described by the Gini index. We also show that a citation count distribution in each community tends to a Pareto type II distribution. Thanks to the derived model parameter estimators, the new model can be fitted to real citation and similar networks.
Giuseppe Cavaliere, Iliyan Georgiev, Edoardo Zanelli
We consider bootstrap inference in predictive (or Granger-causality) regressions when the parameter of interest may lie on the boundary of the parameter space, here defined by means of a smooth inequality constraint. For instance, this situation occurs when the definition of the parameter space allows for the cases of either no predictability or sign-restricted predictability. We show that in this context constrained estimation gives rise to bootstrap statistics whose limit distribution is, in general, random, and thus distinct from the limit null distribution of the original statistics of interest. This is due to both (i) the possible location of the true parameter vector on the boundary of the parameter space, and (ii) the possible non-stationarity of the posited predicting (resp. Granger-causing) variable. We discuss a modification of the standard fixed-regressor wild bootstrap scheme where the bootstrap parameter space is shifted by a data-dependent function in order to eliminate the portion of limiting bootstrap randomness attributable to the boundary, and prove validity of the associated bootstrap inference under non-stationarity of the predicting variable as the only remaining source of limiting bootstrap randomness. Our approach, which is initially presented in a simple location model, has bearing on inference in parameter-on-the-boundary situations beyond the predictive regression problem.
Daniel Malinsky
We study the data-driven selection of causal graphical models using constraint-based algorithms, which determine the existence or non-existence of edges (causal connections) in a graph based on testing a series of conditional independence hypotheses. In settings where the ultimate scientific goal is to use the selected graph to inform estimation of some causal effect of interest (e.g., by selecting a valid and sufficient set of adjustment variables), we argue that a "cautious" approach to graph selection should control the probability of falsely removing edges and prefer dense, rather than sparse, graphs. We propose a simple inversion of the usual conditional independence testing procedure: to remove an edge, test the null hypothesis of conditional association greater than some user-specified threshold, rather than the null of independence. This equivalence testing formulation to testing independence constraints leads to a procedure with desriable statistical properties and behaviors that better match the inferential goals of certain scientific studies, for example observational epidemiological studies that aim to estimate causal effects in the face of causal model uncertainty. We illustrate our approach on a data example from environmental epidemiology.
Mikihito Nishi
Comments added figures for Section 2; corrected typos
In this study, we propose a test for the coefficient randomness in autoregressive models where the autoregressive coefficient is local to unity, which is empirically relevant given the results of earlier studies. Under this specification, we theoretically analyze the effect of the correlation between the random coefficient and disturbance on tests' properties, which remains largely unexplored in the literature. Our analysis reveals that the correlation crucially affects the power of tests for coefficient randomness and that tests proposed by earlier studies can perform poorly when the degree of the correlation is moderate to large. The test we propose in this paper is designed to have a power function robust to the correlation. Because the asymptotic null distribution of our test statistic depends on the correlation $ψ$ between the disturbance and its square as earlier tests do, we also propose a modified version of the test statistic such that its asymptotic null distribution is free from the nuisance parameter $ψ$. The modified test is shown to have better power properties than existing ones in large and finite samples.
Valentin De Bortoli, Agnès Desolneux
Laplace-type results characterize the limit of sequence of measures $(π_\varepsilon)_{\varepsilon >0}$ with density w.r.t the Lebesgue measure $(\mathrm{d} π_\varepsilon / \mathrm{d} \mathrm{Leb})(x) \propto \exp[-U(x)/\varepsilon]$ when the temperature $\varepsilon>0$ converges to $0$. If a limiting distribution $π_0$ exists, it concentrates on the minimizers of the potential $U$. Classical results require the invertibility of the Hessian of $U$ in order to establish such asymptotics. In this work, we study the particular case of norm-like potentials $U$ and establish quantitative bounds between $π_\varepsilon$ and $π_0$ w.r.t. the Wasserstein distance of order $1$ under an invertibility condition of a generalized Jacobian. One key element of our proof is the use of geometric measure theory tools such as the coarea formula. We apply our results to the study of maximum entropy models (microcanonical/macrocanonical distributions) and to the convergence of the iterates of the Stochastic Gradient Langevin Dynamics (SGLD) algorithm at low temperatures for non-convex minimization.
Alessandro Casini, Taosong Deng, Pierre Perron
We establish theoretical results about the low frequency contamination (i.e., long memory effects) induced by general nonstationarity for estimates such as the sample autocovariance and the periodogram, and deduce consequences for heteroskedasticity and autocorrelation robust (HAR) inference. We present explicit expressions for the asymptotic bias of these estimates. We distinguish cases where this contamination only occurs as a small-sample problem and cases where the contamination continues to hold asymptotically. We show theoretically that nonparametric smoothing over time is robust to low frequency contamination. Our results provide new insights on the debate between consistent versus inconsistent long-run variance (LRV) estimation. Existing LRV estimators tend to be in inflated when the data are nonstationary. This results in HAR tests that can be undersized and exhibit dramatic power losses. Our theory indicates that long bandwidths or fixed-b HAR tests suffer more from low frequency contamination relative to HAR tests based on HAC estimators, whereas recently introduced double kernel HAC estimators do not super from this problem. Finally, we present second-order Edgeworth expansions under nonstationarity about the distribution of HAC and DK-HAC estimators and about the corresponding t-test in the linear regression model.
Die Gan, Siyu Xie, Zhixin Liu, Xuebo Zhang
Comments 13 pages, submitted to IEEE TAC
This paper studies the distributed adaptiveestimation problems for stochastic large regression modelswith an infinite number of parameters. By constructing a re-cursive local cost function, we propose a novel distributedrecursive least squares algorithm to estimate the unknownsystem parameters, where the growth rate of regressors'dimension is characterized by a non-decreasing positivefunction. The almost sure convergence of the proposedalgorithm is established under a cooperative excitationcondition, which incorporates the temporal information andthe spatial information to reflect the cooperative effectamong multiple agents. Moreover, we analyze the predic-tion error by establishing the asymptotic upper boundof the accumulated regret without any excitation condi-tions. The main difficulty of theoretical analysis lies in howto analyze properties of the product of non-independentand non-stationary random matrices, whose dimensionschange over time simultaneously. Some techniques, suchas stochastic Lyapunov function, double-array martingaletheory and algebraic graph theory, are employed to dealwith the above issue. Our theoretical results are derivedwithout imposing independence or stationarity assump-tions on the regression vectors, thereby not excluding thecorrelated feedback signals.
Beatrice Franzolini, Francesco Pozza
Comments 21 pages, 3 Figures
Posterior inference for Dirichlet process mixture models is analytically intractable and typically relies on Markov chain Monte Carlo methods, which can become computationally prohibitive at moderate to large sample sizes. In this work, we investigate the performance of Laplace and skew-Laplace posterior approximations for density estimation in this setting. Through an extensive numerical study covering four simulation scenarios with sample sizes ranging from n = 20 to n = 2,000 and four standard real datasets, we compare the standard Laplace approximation, its skew-corrected extension, and a slice sampling benchmark, assessing accuracy through total variation distance and computational efficiency through runtime. Our results show that the Gaussian Laplace approximation is more effective in this setting than might be anticipated, and that the skew-Laplace approximation consistently improves posterior recovery while remaining substantially faster than state-of-the-art Markov chain Monte Carlo samplers across all settings considered. In particular, the use of skew-Laplace in place of the standard Laplace approximation is especially beneficial in more complex density structures, where we observe error reductions typically on the order of 30%.
Nils Lid Hjort
Comments 11 pages, 5 figures. Statistical Research Report, Department of Mathematics, University of Oslo; will be submitted for publication
The sudoku puzzles have a long history, with variations going back more than a hundred years, but its current and perhaps surprising world-wide prominence goes back to certain initiatives and then puzzle-generating computer programmes from just after 2000. To solve a sudoko puzzle, a statistician can put up a probabilitymodel on the enormous space of $9\times9$ matrix possibilities, constructed to favour `good attempts', and then engineer a Markov chain to sample a long enough chain of sudoku table realisations from that model, until the solution is found. The methods work also for other types of puzzles, like constructing `magic squares' with wished-for properties (sums of rows, columns, diagonals equal, etc.), as is also illustrated in this article; via magic models and equally magic Markov chains I find impressively magic $8\times8$ and $10\times10$ squares.
Siu-Ming Tam
Comments 27 pages, 8 tables
Tam [2026] shows that combining Bethel multivariate allocation with Hierarchical Bayes (HB) small area models can substantially reduce survey sample sizes while maintaining domain-level precision and near-nominal coverage of posterior credible intervals (CrIs). This paper extends that framework to cross-classified statistics derived from HBcalibrated unit record data. Its central contribution is a Post-Hoc Inference Engine (PHIE) that propagates uncertainty from HB domain posterior draws to arbitrary cross-tabulations. PHIE transforms each MCMC draw via chi-square calibration to produce replicate survey weights, from which CrIs are obtained. Three tiers of statistics are identified. Tier 1-E cells reproduce calibration totals and yield exact posterior CrIs. Tier 2 cells involve filtered sums of calibration variables; PHIE alone undercovers, but a Calibrated Bayes interval (CBI), augmenting PHIE with design-based compositional variance, restores near-nominal coverage. Tier 3-NCV cells involve non-calibration variables; a ratio-based CBI linked to a correlated calibration variable achieves reliable coverage even under weak correlation. A key empirical finding is that uncertainty in cross-tabulations is driven primarily by compositional sampling variability rather than HB model uncertainty. Resulting CBI-based coefficients of variation remain within standard publication thresholds.
Johannes Brutsche, Sebastian Hahn, Angelika Rohde
Based on discrete observations $X_0,X_Δ,\dots, X_{nΔ}$ for $Δ=n^{-γ}$ with $γ\in [0,1)$ of the null-recurrent dynamic $dX_t = σ(X_t)dW_t$ with a Brownian motion $W$ and $σ(x)=α\mathbb{1}\{x<ρ\} + β\mathbb{1}\{x\geq ρ\}$, we derive rate of convergence and limiting distribution of the profile MLE for $ρ$. This includes low-frequency asymptotics ($γ=0$) for which the observations form a null-recurrent Markov chain. The derived non-standard limit is the argsup over a doubly stochastic drifted Poisson process explicitly involving the local time of oscillating Brownian motion. Its dependence on $ρ$ as well as the unknown volatility levels $α$ and $β$ is shown to be continuous w.r.t. the topology of weak convergence, enabling statistical inference. Whereas this limit is independent of the sampling frequency, the profile MLE's rate of convergence equals $n^{-(1+γ)/2}$ and is proven to be minimax optimal. The surprising idea of the proof of the limit theorem is to relate the long-term behavior of the null-recurrent Markov chain to the infill asymptotics on a fixed time interval. Indeed, in the very special case that $(X_t)_{t\geq 0}$ is started in the true parameter $X_0=ρ_0$, the process $(X_t-ρ_0)_{t\geq 0}$ is shown to possess a desirable distributional self-similarity. On basis of the strong Markov property, the artificial constallation of starting in $ρ_0$ is finally overcome by a coupling argument.
Haowei Yuan
Comments 9 pages. Derives exact closed-form formulae for $P_c(N-1; N, w)$, $P_c(3; N, w)$, and $P(3; N, w)$
The continuous linear $P(k; N, w)$ and circular scan statistics $P_c(k; N, w)$ are fundamental tools in probability and spatial statistics, frequently used to detect clustering in uniform data. Let $X_1, X_2, \dots, X_N$ be independently and uniformly distributed random variables on a unit interval or unit ring. The exact distribution of these scan statistics relies on the minimum window width required to capture exactly $k$ points. Furthermore, the survival function $1 - P_c(k; N, w)$ directly corresponds to the geometric probability that if $N$ arcs of length $1 - w$ are uniformly and randomly placed on a unit circle, every point on the circle is covered at least $N + 1 - k$ times. Historically, evaluating the exact cumulative distribution functions, $P(k; N, w)$ and $P_c(k; N, w)$, relies heavily on complex recursive approximations. In this paper, we bypass these traditional recursive methods to derive direct, generalized closed-form expressions for some linear and circular continuous scan statistics. Specifically, we present the exact analytical solutions for $P_c(N - 1; N, w)$, $P_c(3; N, w)$, and $P(3; N, w)$ for arbitrary values of $N$ and window width $w$. These newly derived closed-form expressions not only provide exact baseline distributions for extreme spacings but also significantly simplify computational complexity compared to existing iterative approaches.
Julián Urbano
Comments 11 pages, 5 tables, 2 figures, ACM SIGIR 2026
In benchmarking of Information Retrieval systems, the Wilcoxon signed-rank test is often treated as a safer alternative to the t-test. This belief is fueled by textbooks and recommendations that portray Wilcoxon as the proper non-parametric alternative because metric scores are not normally distributed. We argue that this narrative is misleading and harmful. A careful review of Statistics textbooks reveals inconsistencies and omissions in how the assumptions underlying these tests are presented, fostering confusion that has propagated into IR research. As a result, Wilcoxon has been routinely misapplied for decades, creating a false sense of safety against a threat that was never there to begin with, while introducing another one so severe that it virtually guarantees the test will break down and mislead researchers. Through a combination of systematic literature review, analysis and empirical demonstrations with TREC data, we show how and why the Wilcoxon test easily loses control of its Type I error rate in IR settings. We conclude that the continued use of Wilcoxon in IR evaluation is unjustified and that abandoning it would improve the methodological soundness of our field.
Riccardo Pajno, Felicetta Carillo, Paolo Maranzano, Timo Schmid, Riccardo Borgoni
Comments 30 pages; 17 figures; submitted to Computational Statistics & Data Analysis
The agricultural sector is undergoing rapid change due to climate pressures, demographic shifts, and uneven economic development, increasing the demand for reliable environmental indicators at fine spatial scales. However, limited data availability often constrains subregional analyses. This study develops a model-based framework for producing reliable small-area estimates for assessing the agricultural carbon footprint in the Po Valley (Northern Italy), a region characterized by intensive livestock farming and high environmental pressure. We integrate survey, census, and satellite-derived emission data into a unified framework and produce estimates at the level of Agrarian Subregions, defined as agriculturally homogeneous municipalities by the Italian National Institute of Statistics. Satellite-based ammonia emission data are incorporated as auxiliary covariates to improve precision and spatial coherence. A key methodological contribution is the treatment of spatial misalignment between gridded satellite data and administrative boundaries. This issue is addressed through a geostatistical upscaling procedure combined with a parametric bootstrap that propagates uncertainty from the covariate construction stage to the final small-area estimates. The results show that satellite-derived information substantially improves the accuracy and stability of carbon footprint estimates while reducing reliance on large, heterogeneous auxiliary datasets, illustrating the potential of Earth observation data in model-based environmental statistics.
Ashwin Ram, Aaditya Ramdas
Comments Preprint
This paper characterizes the best possible rate of growth of wealth in a Kelly betting game when repeatedly betting against a general i.i.d. null hypothesis $\mathscr{P}$, but the data are drawn i.i.d from an arbitrary alternative $Q$. We prove that it equals $\lim_{n \to \infty}n^{-1}\inf_{P \in (\mathscr P)^n)^{\circ\circ}} \mathrm{KL}(Q^n,P)$, where ${\mathscr P}^n = \{P^n: P \in \mathscr{P}\}$ and $(\mathscr {P}^n)^{\circ\circ}$ is its bipolar, i.e., this rate is achievable and one cannot do better. This quantity is in general smaller than a more popular quantity in the literature, $\mathrm{KL}_{\inf}(Q,\mathscr{P}) := \inf_{P \in \mathscr P}\mathrm{KL}(Q,P)$. If $\mathrm{KL}_{\mathrm{inf}}(\cdot,\mathscr P)$ is weakly lowersemicontinuous (w.l.s.c.) at $Q$, we show that the two quantities are equal; in particular, this happens when $\mathscr P$ is weakly compact. For simple alternatives, we provide the first matching necessary and sufficient condition for when power-one sequential tests exist (without assumptions on $\mathscr P, Q$). We also derive the optimal worst-case growth rate against composite $\mathscr Q$. We emphasize that test supermartingales on reduced filtrations suffice for all i.i.d. testing problems, and more general e-processes are not required. We thus completely generalize the recent results of Larsson et al.~\cite{larsson2025numeraire} to the sequential setting.
Tomáš Kocák, Rémi Munos, Branislav Kveton, Shipra Agrawal, Michal Valko
Comments Published in Journal of Machine Learning Research (JMLR 2020). arXiv admin note: text overlap with arXiv:2604.18420
Smooth functions on graphs have wide applications in manifold and semi-supervised learning. In this work, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as content-based recommendation. In this problem, each item we can recommend is a node of an undirected graph and its expected rating is similar to the one of its neighbors. The goal is to recommend items that have high expected ratings. We aim for the algorithms where the cumulative regret with respect to the optimal policy would not scale poorly with the number of nodes. In particular, we introduce the notion of an effective dimension, which is small in real-world graphs, and propose three algorithms for solving our problem that scale linearly and sublinearly in this dimension. Our experiments on content recommendation problem show that a good estimator of user preferences for thousands of items can be learned from just tens of node evaluations.
Tomáš Kocák, Gergely Neu, Michal Valko
Comments Published at International Conference on Machine Learning (ICML) 2015. 11 pages
We consider adversarial multi-armed bandit problems where the learner is allowed to observe losses of a number of arms beside the arm that it actually chose. We study the case where all non-chosen arms reveal their loss with a fixed but unknown probability $r$, independently of each other and the action of the learner. We propose two algorithms that work for different ranges of $r$. We show that after $T$ rounds in a bandit problem with $N$ arms, the expected regret of our first algorithm is $O(\sqrt{(T /r) \log N })$ whenever $r\ge(\log T)/(2N)$, while our second algorithm achieves a regret of $O(\sqrt{(T/r) \log (N+T)})$ for smaller values of $r$. We also give a quick estimation procedure that decides the range of~$r$. All our bounds are within logarithmic factors of the best achievable performance of any algorithm that is even allowed to know~$r$.
Gergely Neu, Michal Valko
Comments Published at Neural Information Processing Systems (NeurIPS) 2014
Most work on sequential learning assumes a fixed set of actions that are available all the time. However, in practice, actions can consist of picking subsets of readings from sensors that may break from time to time, road segments that can be blocked or goods that are out of stock. In this paper we study learning algorithms that are able to deal with stochastic availability of such unreliable composite actions. We propose and analyze algorithms based on the Follow-The-Perturbed-Leader prediction method for several learning settings differing in the feedback provided to the learner. Our algorithms rely on a novel loss estimation technique that we call Counting Asleep Times. We deliver regret bounds for our algorithms for the previously studied full information and (semi-)bandit settings, as well as a natural middle point between the two that we call the restricted information setting. A special consequence of our results is a significant improvement of the best known performance guarantees achieved by an efficient algorithm for the sleeping bandit problem with stochastic availability. Finally, we evaluate our algorithms empirically and show their improvement over the known approaches.
Junya Miyake, Akira Okazaki, Shuichi Kawano
Variable fusion in linear regression models is a statistical method that identifies covariates making similar contributions to the response variable and imposes the same coefficient values on them. Many methods for variable fusion also incorporate variable selection for practical reasons. In this paper, within the Bayesian model averaging (BMA) framework, we propose a spike-and-slab-based Bayesian method that performs both variable fusion and selection. This is challenging in the BMA framework because one must construct a discrete model space that accommodates both selection and fusion and assign suitable priors over that space. In the proposed method, we present a way to explore a model space for variable fusion and selection based on Gibbs sampling by devising a prior distribution for latent variables representing the model. Furthermore, among non-local priors with superior model selection properties, we construct a prior tailored for variable fusion and use it as the slab distribution. We examine the effectiveness of the proposed method through theoretical and empirical studies.
Xinru Wang, Meghna Bose, Bibhas Chakraborty, Robert Mahar
Dynamic treatment regimes (DTRs) are sequences of decision rules to guide treatment assignments in response to a patient's evolving, time-varying disease status. Sequential multiple assignment randomized trials (SMARTs) are considered the gold standard experimental design for evaluating DTRs. However, SMARTs often require more time to complete compared with a single stage RCT and new candidate treatments may become available or feasible during the trial. Platform trials are an adaptive trial design that allow new treatments to be added to the ongoing study according to a prespecified master protocol. In this paper, we introduce a novel platform SMART that integrates features from both platform trials and SMARTs, allowing new treatments to be added during the trial. Additionally, we propose the Bayesian integration G-formula (BIG) estimators for platform SMARTs to account for non-concurrent treatment comparisons. Extensive simulations are conducted to evaluate the performance of different BIG estimators against benchmark methods. We demonstrate the proposed BIG estimators based on the S. aureus Network Adaptive Platform (SNAP) trial.
Yaniv Shulman
Local Polynomial Regression (LPR) is a powerful tool for nonparametric smoothing, yet it traditionally suffers from a "Euclidean tautology": the variables used to define the local neighborhood are identical to those used in the polynomial fit. This restricts its ability to handle complex domains where the regression function varies across non-Euclidean structures, such as graphs, manifolds, or discrete categories, while remaining locally smooth in the primary feature space. We propose Generalized Context-Aware LPR (GC-LPR), a framework that decouples the fitting coordinates ($Z$) from the weighting context ($C$). By adopting a modeling convention where the conditional mean depends jointly on $Z$ and $C$ ($Y = m_C(Z) + \varepsilon$), our estimator acts as a "projected smoother": it isolates a slice of the data on the manifold defined by $C$ via a compound product kernel, and performs polynomial fitting in the $Z$-coordinates within that slice. This enables practitioners to model responses that vary across graphs, networks, or categorical strata while retaining the interpretability and bias properties of LPR in a primary Euclidean feature space. Theoretical analysis clarifies the induced context-smoothed target of GC-LPR and shows that the method preserves the Euclidean bias-reduction properties of standard LPR while allowing arbitrary, non-Euclidean contexts to modulate the local estimation. We demonstrate the efficacy of this approach on geospatial and network-structured datasets.
Yao Zhao
Comments 24 pages, 4 figures. Methodological paper on functional time series
Functional autoregressive models of order one (FAR(1)) are predominantly estimated by projecting curves onto leading functional principal components and fitting a vector autoregression in score space, requiring a discrete truncation level $K$ chosen by an \emph{ad hoc} variance threshold. We demonstrate via Monte Carlo experiments that the truncation choice is both consequential and highly regime dependent: the optimal $K$ can differ by an order of magnitude across data-generating regimes, while commonly used high variance thresholds (95\%, 99\%) lead to substantial forecast deterioration, inflating error by up to $35 \%$ relative to an oracle benchmark. We propose a Tikhonov-regularized estimator $\widehatΨ_α= \widehat{C}_1(\widehat{C}_0 + αI)^{-1}$ that replaces the discrete truncation choice with a continuous regularization parameter, selected in a data-driven manner. We establish the convergence rate $n^{-β/(2(β+1))}$ under a source condition with smoothness parameter $β\in (0, 1]$, achieving the saturation rate $n^{-1/4}$ for smoother targets. Across three contrasting regimes and four sample sizes, the proposed estimator closely tracks the oracle-best FPCA rule and outperforms it in the most challenging wide-spectrum regime, without prior knowledge of the effective operator dimension. An application to 2{,}735 daily intraday PM10 curves from Vienna confirms a 9.7\% reduction in mean forecast error relative to the popular 80\% threshold and exhibits more stable parameter adaptation across 16 winter seasons.
Tianying Wang
We study split-conformal prediction for regression when the reported prediction set must be a single interval, at target marginal coverage $1-α$, where $α$ is the nominal miscoverage level. Under this reporting constraint, the natural conditional target is the shortest interval with conditional mass at least $1-α$, rather than an equal-tailed interval or a possibly disconnected high-probability set. We parameterize this single-interval oracle by a lower-tail allocation, which determines how the nominal miscoverage $α$ is split between the two endpoints, and propose tail-allocation conformalized quantile regression (TA-CQR). TA-CQR estimates this allocation by searching over quantile-defined cores and then applies nonnegative additive split-conformal calibration, retaining exact finite-sample marginal coverage under exchangeability. The main contribution is theoretical. We characterize the oracle geometry, including its highest-density interpretation under unimodality and the positive connectedness cost induced by disconnected highest-density sets. We prove local recovery of the selected allocation and core, establish that calibration radii are asymptotically negligible under endpoint-density conditions, and give a finite-sample calibrated length oracle inequality with explicit grid, endpoint-quantile estimation, and calibration-sampling terms. Simulations and real-data examples report coverage and length jointly.
Mohammad Jafari Jozani, Bahram Moeinianfar
Comments 41 pages, 4 figures
Support vector machines (SVMs) are a standard tool for binary classification, but their classical formulations are purely data-driven and offer no direct way to encode trusted benchmark models or structured preferences on selected subsets of the data. We propose Elite-Driven Support Vector Machines (EDSVM), a general framework that augments regularized empirical risk minimization by guiding the slack variables for a curated set of elite observations (typically the union of support vectors from one or more reference SVMs). EDSVM combines the usual slack loss with a deviation penalty that shrinks new slacks toward benchmark slack values, defining a localized, margin-aligned notion of proximity to reference models, unlike global function penalties in knowledge distillation or teacher-student methods, and without requiring privileged features as in SVM+/LUPI. Within this framework we develop two concrete models, C-EDSVM and LS-EDSVM, based respectively on hinge-type and squared-slack losses. For both variants we derive dual quadratic programs that can be implemented with modest modifications of standard SVM solvers, and we give simple sufficient conditions under which the induced margin losses are classification calibrated. Simulation studies and experiments on several UCI benchmarks show that EDSVMs closely track the behaviour induced by reference SVMs while achieving predictive performance that is competitive with, and sometimes better than, C-SVM, LINEX-SVM, and LS-SVM.
Zhang Jiang, Marios Andreou, Sebastian Reich, Nan Chen
Comments 33 pages, 11 figures. Corresponding author: Nan Chen (chennan@math.wisc.edu)
Data assimilation (DA) integrates observational information with model predictions to improve state estimation in complex systems. While filtering provides the basis for online forecasts by using only past and present observations, it can exhibit delays and biases when the underlying dynamics evolve rapidly or undergo regime transitions. Smoothing, which additionally incorporates future observations, provides a natural pipeline for hindcasting and reanalysis that yields an uncertainty reduction beyond the filter. This paper introduces an ensemble Kalman-Bucy smoother (EnKBS) for continuous-time DA of nonlinear dynamical systems, where the smoother's conditional distributions are reconstructed using ensemble moments. The result is a derivative-free framework that does not require explicit computation of tangent-linear or adjoint models, which converges to the exact smoother solution at the infinite-ensemble limit for a wide class of complex systems. Incorporating standard regularization techniques for high-dimensional systems, such as covariance localization and inflation, the skill of the EnKBS is demonstrated in various important scientific problems. By integrating future observations, which reveal the underlying causal mechanisms for retrospective state updates, the EnKBS is used for Bayesian-based inference of causal relationships and their temporal influence range in a dyadic trigger-feedback model and the development of a causality-driven iterative learning algorithm that identifies the structure and recovers the hidden parameters of a nonlinear reduced-order model mimicking midlatitude atmospheric circulation. Notably, both tasks remain effective with an ensemble size of $O(10)$ under partial observations, suggesting that EnKBS can support the instantaneous discovery of high-dimensional complex systems over time.
Fan Wang, Haotian Xu, Yi Yu
Dynamic multilayer networks arise in many applications where multiple types of relations among a common set of nodes evolve over time. Existing approaches often assume temporal independence, focus on single-layer networks or impose stationarity, limiting their applicability in practice. In this paper, we introduce a first-order autoregressive multilayer stochastic block model (AR(1)-MSBM), in which edge formation and dissolution probabilities between consecutive time points are determined by latent community memberships and shared across layers. Under stationarity, we propose an online estimation procedure based on recursive updates and tensor-based spectral refinement. We establish non-asymptotic estimation rates, prove their minimax optimality and derive guarantees for community recovery. We further consider a non-stationary setting that allows both abrupt changes and gradual shifts, and develop an adaptive windowed online algorithm that automatically adjusts to unknown structural changes. Under a quasi-stationary segmentation framework, we derive estimation and community recovery guarantees that match the stationary results when applied segmentwise. Our theoretical findings are supported by extensive numerical experiments, with code available online.
Mohammad Jafari Jozani, Jingyu Wang
Comments 25 pages, 7 figures
Fractionally supervised classification (FSC) offers a flexible framework for combining labeled and unlabeled data in model-based classification, but existing formulations assume simple random sampling. In many applications, however, the retained observation is an extreme order statistic from a set rather than a randomly selected unit. This is particularly appealing when the target population is rare, since maxima nomination sampling (NS) can enrich the sample with the most informative observations, as in screening, environmental monitoring, repeated testing, and reliability studies. Under such designs, the likelihood function changes fundamentally, and the usual FSC EM construction is no longer valid. We develop FSC for nominated samples by introducing a latent representation that accounts for both the class membership of the observed maximum and the latent composition of the remaining units in the set. The resulting method yields a proper EM algorithm and a coherent weighted-likelihood FSC procedure for NS data. We present the methodology in general form, illustrate it for a rare-event contamination normal mixtures, and show through simulation that it substantially improves on the misspecified alternative by ignoring the extra rank information of such data. A real-data analysis demonstrates its practical value.
Aditya Basarkar, Emmett B. Kendall, David Randahl, Jonathan P. Williams, Gudmund H. Hermansen
Whether or not a country is at war, or experiencing escalating or deescalating levels of conflict, has massive ramifications on a country's national and foreign policy. Given a country's history of conflict, or lack thereof, future predictions about the war-status of a country are valuable information. In this paper, we present the use of conformal prediction on temporally-dependent data to obtain prediction sets of possible future conflict state-sequences. More specifically, we compare the results of conformal prediction to a likelihood-based prediction strategy when the data are assumed to come from a discrete-state Markov process. A point-prediction may not supply sufficient information because the penalty for a wrong prediction is extreme, and so we consider a machine learning alternative that gives valid uncertainty quantification and is robust to model misspecification. In the data analysis, we present real forecasts of conflict dynamics across multiple countries. Lastly, we comment on the possible limitations of existing approaches for applying conformal prediction to Markovian data, where the exchangeability assumption is violated.
Yasumasa Matsuda, Michel F. C. Haddad
Comments 61 pages, 5 figures, 3 tables
We propose a density-valued vector autoregressive model with latent factors for multivariate time series of density functions. Motivated by weekly regional distributions of SARS-CoV-2 cycle threshold (Ct) values in Brazil, we study their distributional dynamics across regions. The Ct value is the number of amplification cycles required for the viral signal to cross a detection threshold (lower Ct values correspond to higher viral load). We estimate each regional density by a B-spline mixture, mapping the mixture weights to a Euclidean space by a generalized logit transform equipped with an isometric inner product, and model the transformed series by a cross-regional VAR with latent factors. This decomposition allows for the separation between strong common movements and directed idiosyncratic dynamics. Directed edges are identified from the idiosyncratic VAR component using one-sided tests with Benjamini--Yekutieli false discovery rate control. Simulations show that increasing the number of estimated factors does not mechanically eliminate genuine idiosyncratic dependence; rather, it mainly removes spuriously detected edges driven by common factor movements. In the real-world data application, the full sample yields only a weak directed network, whereas a substantial network emerges once the first six months are excluded and the density prior is kept weak. The estimated links suggest directed predictive relations from the northern region toward southeastern metropolitan areas.
Shiyu Wan, Yuhan Qian, Yanyao Yi, Nicole Mayer-Hamblett, Patrick J. Heagerty, Ting Ye
A master protocol trial uses a single overarching protocol to test multiple therapies, often across several diseases or subtypes. Although such trials offer considerable flexibility and efficiency, their constrained and non-uniform treatment assignment raises two core challenges: precisely defining treatment effects and conducting robust, efficient inference. These challenges intensify when participants can re-enroll to receive additional eligible therapies over time. To address these issues, we first define a clinically meaningful estimand with a clear population specification for master protocol trials that allow re-enrollment across multiple episodes. Specifically, we define the episode-specific entire concurrently eligible (ECE) population, which preserves the integrity of randomized comparisons and remains invariant to randomization ratios and operational formats. We then introduce a per-episode added-effect estimand that aggregates episode-specific effects into an interpretable overall measure. For inference, we develop weighting and post-stratification estimators under the same minimal assumptions as conventional randomized trials, with model-assisted covariate adjustment to improve efficiency. We establish asymptotic distributions for all estimators and provide cluster-robust variance estimators that properly account for within-participant correlation induced by re-enrollment. We evaluate our methods through extensive simulations and apply our methods to SIMPLIFY, a master protocol trial comparing continuation versus discontinuation of two common cystic fibrosis therapies. All analyses are conducted using the \textsf{R} package \textsf{RobinCID}.
Eugene Han, Marahi Perez-Tamayo, Hannah D. Holscher, Ruoqing Zhu
Comments 34 pages, 5 figures
This paper introduces a rectified and renormalized Fisher-Bingham model for compositional data with zeros, motivated in part by the presence of zeros in microbiota studies. The approach represents compositions through a square-root transformation that maps data to the positive orthant of the unit sphere, and models them via a latent Fisher-Bingham followed by a deterministic transformation that induces exact zeros. This construction yields a coherent likelihood without requiring zero imputation or separate modeling of zero and nonzero components. Parameter estimation is performed using a Monte Carlo expectation-maximization algorithm that accommodates the latent structure. We further develop a score test for detecting structured differences in composition across groups, providing a parametric alternative to commonly used distance-based methods. Simulation studies demonstrate that the proposed method closely approximates the induced distribution and achieves higher power for detecting structured compositional changes, particularly when observations include many zero-valued components. An application to a dietary intervention study illustrates that the method identifies meaningful microbiota shifts not detected by standard approaches.
Dhruv Gupta
Comments 12 pages. Companion Lean 4 formalization: https://github.com/Zetetic-Dhruv/formal-learning-theory-kernel/tree/v3.3.0-paper
Recent work revisiting measurability in the fundamental theorem of statistical learning imposes Borel measurability of ghost-gap suprema. We show that, at the one-sided ghost-gap interface actually used by the standard symmetrization proof, this requirement is stronger than necessary. For any Borel-parameterized concept class on a Polish domain, the bad event "there exists a hypothesis whose ghost empirical error exceeds its training empirical error by at least ε/2" is analytic. By Choquet capacitability, it is therefore measurable in the completion of every finite Borel measure. We then construct a concept class whose bad event is null-measurable but not Borel, giving a strict separation from the Borel supremum condition. Finally, we prove closure under patching, fixed and countable interpolation, and fiber-product amalgamation, showing that the weaker regularity level is stable under natural concept-class constructors. In the realizable setting, where targets belong to the class and are measurable, these results weaken the measurability hypothesis needed by the symmetrization route from finite VC dimension to PAC learnability. The main results and the descriptive-set-theoretic infrastructure used by them are formalized in Lean 4.
Joseph Lazzaro, Davide Buffelli, Da-shan Shiu, Sattar Vakili
Comments AISTATS 2026
Preference feedback, in the form of pairwise comparisons rather than scalar scores, has seen increasing use in applications such as human-, laboratory-, and expert-in-the-loop design, as well as scientific discovery. We propose a Thompson Sampling (TS) approach to Bayesian optimization with preferential feedback that models comparisons using a monotone link on latent utility differences and leverages the dueling kernel induced by a base kernel. We provide a finite-time analysis showing that the performance of the proposed method matches that of standard TS for conventional Bayesian optimization with scalar feedback. The analysis exploits the anchor invariance of TS for challenger selection and introduces a double-TS pairing variant. We also demonstrate the performance of the method on both synthetic and real-world examples.
Samhita Pal, Dhrubajyoti Ghosh
Estimating causal effects from high-dimensional, structured exposures is a fundamental challenge in modern applications ranging from neuroscience and finance to environmental science. While the literature has addressed high-dimensional instrumental variable (IV) regression, and separately leveraged graph structure in penalized regression, the integration of both, especially for causal support recovery in the presence of latent confounding, remains unexplored. In this work, we propose a novel two-stage regression framework that incorporates instrumental variables and graph-based regularization to uncover sparse causal effects among network-structured exposures. Our method accommodates both valid and partially invalid instruments, and encourages structural similarity among connected predictors through a graph-fused penalty. We establish non-asymptotic guarantees for estimation accuracy and causal variable selection, and demonstrate that our approach yields improved performance over existing methods that ignore network dependencies or invalid IVs. Applied to ADNI brain imaging and genetic data, our method identifies interpretable causal ROIs associated with cognitive outcomes, underscoring the utility of graph-assisted IV regression in neuroscience and beyond.
Dongze Wu, Linglingzhi Zhu, Yao Xie
Learning matrix-valued distributions from high-dimensional and possibly incomplete training data is challenging: ambient-space generative modeling is computationally expensive and statistically fragile when the matrix dimension is large but the sample size is limited. We propose CoreFlow, a geometry-preserving low-rank flow model that learns shared row/column subspaces across the matrix distribution, and then trains a continuous normalizing flow only on the induced low-dimensional core. CoreFlow is designed for settings where shared low-rank matrix geometry is present, especially in high-dimensional limited-sample regimes. This separates shared matrix geometry from sample-specific variation, preserves matrix structure, and substantially improves training efficiency. The same framework also handles incomplete training matrices through masked Riemannian updates and iterative completion. Across real and synthetic benchmarks, CoreFlow substantially improves spectral and moment-level generation quality in few-sample regimes while remaining competitive in data-rich settings, even under compression to 9% of the ambient dimension and with up to 40% missing training entries.
Chandler Squires, Pradeep Ravikumar
Comments AISTATS 2026, 9 pages
Techniques for concept extraction, such as sparse autoencoders and transcoders, aim to extract high-level symbolic concepts from low-level nonsymbolic representations. When these extracted concepts are used for downstream tasks such as model steering and unlearning, it is essential to understand their guarantees, or lack thereof. In this work, we present a unified theoretical framework for unsupervised concept extraction, in which we frame the task of concept extraction as identifying a generative model. We present a general meta-theorem for identifiability, which reduces the problem of establishing identifiability guarantees to the problem of characterizing the intersection of two sets. As we demonstrate on a range of widely-used approaches, this meta-theorem substantially simplifies the task of proving such guarantees, thus paving the way for the development of new, principled approaches for concept extraction.
Kisung You
Hyperbolic space is increasingly used for hierarchical, tree-like, and network-structured data, but likelihood-based density modeling on hyperbolic space remains relatively limited. This paper develops finite mixture modeling with isotropic Riemannian Gaussian distributions on hyperbolic space under the hyperboloid model. We derive the density, radial normalizing constant, and a finite-sum representation involving the complementary error function. We then formulate weighted maximum likelihood estimation, which is the fundamental subproblem in mixture fitting: the location estimator is the weighted Fréchet mean, while the inverse-scale estimator is obtained from a one-dimensional strictly convex profile problem. For finite mixtures, we derive exact EM and generalized EM algorithms. The generalized version replaces exact barycenter solves with truncated hyperbolic majorization-minimization updates. We establish existence and uniqueness of the weighted single-component estimator, singularity of the unrestricted mixture likelihood, existence of a constrained mixture estimator, and monotonicity properties of the EM-type algorithms. Simulations show accurate weighted estimation, reliable mixture recovery, effective model selection, and substantial computational savings from generalized EM. Real network examples based on hyperbolic embeddings illustrate the method as an exploratory likelihood-based clustering tool for non-Euclidean data.
Jerry Yao-Chieh Hu, Mingcheng Lu, Yi-Chen Lee, Han Liu
We provide a systematic recipe for translating ReLU approximation results to softmax attention mechanism. This recipe covers many common approximation targets. Importantly, it yields target-specific, economic resource bounds beyond universal approximation statements. We showcase the recipe on multiplication, reciprocal computation, and min/max primitives. These results provide new analytical tools for analyzing softmax transformer models.
Jordan Awan, Xi Chen, Roberto Molinari
The increased use of differential privacy (DP) has allowed the sharing of large amounts of data while reducing the risk of disclosure of sensitive information at the individual level. However, the noise introduced by DP methods makes performing statistical inference more challenging. While various methods have been proposed to address different inferential tasks, they often require strong parametric assumptions and/or do not scale well with sample sizes (e.g. U.S. Census products). In response to these limitations, we propose an approximate Bayesian method to analyze privatized data products, which uses a two-step approach of imputing the confidential data and then sampling from the non-private posterior, and which is inspired by the method of Guha and Reiter (2025). We prove that this approximate sampler is asymptotically valid under mild assumptions. While this approach is motivated by Bayesian theory, we show through simulations that it provides conservative frequentist properties as well. We demonstrate the utility of our method by applying it in simulated settings as well as for an analysis on the drivers of homeownership via the 2022 American Community Survey.
Dingke Tang, Xuming He, Shu Yang
Mendelian randomization is a powerful tool for causal inference in observational studies. The two-sample summary-data design, which estimates genetic associations with exposures and outcomes in separate cohorts, is the most widely used Mendelian randomization approach in large-scale genomic studies. However, this approach relies on a strong assumption of population homogeneity across the two samples. In practice, available samples often differ in ancestry, demographics, socioeconomic factors, covariate adjustment, and measurement protocols. Violations of the homogeneity assumption can bias causal effect estimates and undermine the credibility of Mendelian randomization findings. We introduce a robust, model-free Mendelian randomization framework that directly addresses population heterogeneity in the two-sample summary-data setting. Our method avoids parametric assumptions about population differences and is designed to address real-world challenges, including measurement error, weak instruments, and pleiotropy. We show that the proposed estimator is consistent and asymptotically normal under heterogeneous designs, and may offer efficiency gains over the classic estimator even in homogeneous settings. Through numerical simulations and a real data analysis for estimating the causal effect of body mass index on high-density lipoprotein cholesterol across ancestrally diverse populations, we demonstrate the practical utility, stability, and robustness of our approach.
Taeyoung Kim
Comments 28 pages, 12 figures
We investigate random feature models in which neural networks sampled from a prescribed initialization ensemble are frozen and used as random features, with only the readout weights optimized. Adopting a statistical-physics viewpoint, we study the training error, test error, and generalization gap beyond the mean kernel approximation. Since the predictor is a nonlinear functional of the induced random kernel, the ensemble-averaged errors depend not only on the mean kernel but also on higher-order fluctuation statistics. Within an effective field-theoretic framework, these finite-width contributions naturally appear as loop corrections. We derive loop corrections to the training error, test error, and generalization gap, obtain their scaling laws, and support the theory with experimental verification.
Yan Li
It is well known that estimating the expectation of any given bounded random variable with values in $[-B, B]$ has a sample complexity of $\mathrm{O}(B^2/ε^2)$ that is independent of the underlying probability measure. We show that this property can no longer hold when evaluating the worst-case expectation of the random variable, where the probability measures defining the expectation belong to a $ϕ$-divergence ball centered at some nominal measure $P$. Specifically, the sample complexity and its dependence on the nominal measure can be completely characterized by the growth of the divergence function. When the divergence function $ϕ$ exhibits superlinear growth, a $P$-independent sample complexity can be obtained for sample average approximation, which depends only on the growth of $ϕ$, the radius of the divergence ball, and the target precision. We also provide sample complexity lower bounds and demonstrate the optimality of the obtained bounds for commonly used $ϕ$-divergences. On the other hand, when superlinear growth does not hold for $ϕ$, we show that for any estimation method, evaluating the worst-case expectation has a $P$-dependent sample complexity lower bound that can be made arbitrarily large by changing $P$.
Tomohiro Ohigashi, Shunichiro Orihara, Shonosuke Sugasawa
In Bayesian inference for the Cox proportional hazards model, modeling the baseline hazard function is challenging. Recently, direct Bayesian inference using the partial likelihood is considered in the framework of general Bayesian inference. In terms of posterior computation, several studies have examined sampling algorithms under the Cox model. In this study, we propose two Gibbs sampling algorithms for Bayesian inference in the Cox proportional hazards model, motivated by a rank-ordered data representation and based on the Plackett--Luce and generalized Plackett--Luce models with P'{o}lya--Gamma data augmentation, referred to as PL-Cox and GPL-Cox, respectively. The two proposed methods offer practical advantages, as they do not require correction of posterior samples, naturally handle tied event times, and are readily extensible to shared frailty models. In simulation study, we considered multiple survival model settings, including continuous and discrete survival time models, as well as scenarios with varying degrees of ties, and found that the PL-Cox model exhibited relatively stable performance. In analyses of a large real dataset, the proposed methods remained computationally feasible, and the GPL-Cox model showed more favorable computational scalability than the PL-Cox model. In analyses of real data incorporating shared frailty, both methods demonstrated good computational efficiency.
Yuan-Hao Wei
This paper presents StrADiff, a Structured Source-Wise Adaptive Diffusion Framework for unsupervised blind source separation under linear and nonlinear mixing. The framework treats each latent dimension as a source branch and assigns to it an individual adaptive reverse diffusion mechanism, so that latent sources are recovered directly from observed mixtures through a single end-to-end objective, without supervised source labels or separate post-processing. Source-wise generation, structural regularization, and observation-space reconstruction are optimized jointly during training. In this instantiation, a Gaussian process (GP) prior is used as one example of a source-wise structured prior to impose temporal organization on each recovered trajectory; the framework itself is not restricted to GP priors and can in principle incorporate other structured priors. Theoretical components clarify the induced pushforward source law, the sample-level role of the structured prior, the coupling between source recovery and prior adaptation, and a conditional weak recovery statement in an idealized linear low-noise regime. Experiments on linear and nonlinear mixtures show that StrADiff can recover meaningful latent source trajectories in an unsupervised manner, with particularly stable performance in the linear case and moderate degradation under nonlinear mixing. Beyond classical signal separation, a source branch may also be interpreted as an independent, disentangled, or otherwise interpretable explanatory factor under suitable structural assumptions, suggesting a broader route toward structured latent modeling and future identifiable nonlinear representation learning.
Juno Kim, Eshaan Nichani, Denny Wu, Alberto Bietti, Jason D. Lee
Comments 84 pages, 9 figures
Spectral optimizers such as Muon have recently shown strong empirical performance in large-scale language model training, but the source and extent of their advantage remain poorly understood. We study this question through the linear associative memory problem, a tractable model for factual recall in transformer-based models. In particular, we go beyond orthogonal embeddings and consider Gaussian inputs and outputs, which allows the number of stored associations to greatly exceed the embedding dimension. Our main result sharply characterizes the recovery rates of one step of Muon, SGD, and Newton's method on the logistic regression loss under a power law frequency distribution. We show that the storage capacity of Muon significantly exceeds that of SGD, and even matches Newton's method while only using first-order information. Moreover, Muon saturates at a larger critical batch size. We further analyze the multi-step dynamics under a thresholded gradient approximation and show that Muon achieves a substantially faster initial recovery rate than SGD, while both methods eventually converge to the information-theoretic limit at comparable speeds. Experiments on synthetic tasks validate the predicted scaling laws. Our analysis provides a quantitative understanding of the signal amplification of spectral preconditioners and lays the groundwork for establishing scaling laws across more practical language modeling tasks and optimizers.
Kartheek Bondugula, Santiago Mazuelas, Aritz Pérez, Anqi Liu
Loss functions play a central role in supervised classification. Cross-entropy (CE) is widely used, whereas the mean absolute error (MAE) loss can offer robustness but is difficult to optimize. Interpolating between the CE and MAE losses, generalized cross-entropy (GCE) has recently been introduced to provide a trade-off between optimization difficulty and robustness. Existing formulations of GCE result in a non-convex optimization over classification margins that is prone to underfitting, leading to poor performances with complex datasets. In this paper, we propose a minimax formulation of generalized cross-entropy (MGCE) that results in a convex optimization over classification margins. Moreover, we show that MGCEs can provide an upper bound on the classification error. The proposed bilevel convex optimization can be efficiently implemented using stochastic gradient computed via implicit differentiation. Using benchmark datasets, we show that MGCE achieves strong accuracy, faster convergence, and better calibration, especially in the presence of label noise.
Alexander Dombowsky, Barbara E. Engelhardt, Aaditya Ramdas
Unnormalized probability distributions are frequently used in machine learning for modeling complex data generating processes. Though Markov chain Monte Carlo (MCMC) algorithms can approximately sample from unnormalized distributions, intractability of their normalizing constants renders likelihood ratio testing infeasible. We propose to use the parallel method of Besag and Clifford to generate samples that are exchangeable with the data under the null, to then generate valid e-values for any number of iterations or algorithmic steps. We show that as the number of samples grows, these Besag-Clifford e-values constructed using the unnormalized likelihood ratio are actually log-optimal up to a multiplicative term that diminishes with the mixing time of the Markov chain. Additionally, averaging over the output of multiple chains retains validity while increasing the e-power. We extend Besag-Clifford e-values to the general problem of unnormalized test statistics, which allows application to composite hypotheses, uncertainty quantification, generative model evaluation, and sequential testing. Through simulations and an application to galaxy velocity modeling, we empirically verify our theory, explore the impact of autocorrelation and mixing, and evaluate the performance of Besag-Clifford e-values.
Harrison Katz
Tourism demand forecasting is methodologically mature, but it typically treats accommodation supply as fixed or exogenous. In platform-mediated short-term rentals, supply is elastic, decision-driven, and co-evolves with demand through pricing, information design, and interventions. I reframe the core issue as endogenous stock-out censoring: realized booked nights satisfy B_{k,t} <= min(D_{k,t}, S_{k,t}), so booking models that ignore supply learn a regime-specific ceiling and become fragile under policy changes and supply shocks. This narrated review synthesizes work from tourism forecasting, revenue management, two-sided market economics, and Bayesian time-series methods; develops a three-part coupling framework (behavioral, informational, intervention); and illustrates the identification failure with a toy simulation. I conclude with a focused research agenda for jointly forecasting supply, demand, and their compositions.
Kensuke Okada, Yui Furukawa, Kyosuke Bunji
Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments. Yet these instruments presume honest responding; in evaluative contexts, LLMs can instead gravitate toward socially preferred answers-a form of socially desirable responding (SDR)-biasing questionnaire-derived scores and downstream conclusions. We propose a psychometric framework to quantify and mitigate SDR in questionnaire-based evaluation of LLMs. To quantify SDR, the same inventory is administered under HONEST versus FAKE-GOOD instructions, and SDR is computed as a direction-corrected standardized effect size from item response theory (IRT)-estimated latent scores. This enables comparisons across constructs and response formats, as well as against human instructed-faking benchmarks. For mitigation, we construct a graded forced-choice (GFC) Big Five inventory by selecting 30 cross-domain pairs from an item pool via constrained optimization to match desirability. Across nine instruction-following LLMs evaluated on synthetic personas with known target profiles, Likert-style questionnaires show consistently large SDR, whereas desirability-matched GFC substantially attenuates SDR while largely preserving the recovery of the intended persona profiles. These results highlight a model-dependent SDR-recovery trade-off and motivate SDR-aware reporting practices for questionnaire-based benchmarking and auditing of LLMs.
Muhammad Hasan Ferdous, Md Osman Gani
Multivariate time series in domains such as finance, climate science, and healthcare often exhibit long-term trends, seasonal patterns, and short-term fluctuations, complicating causal inference under non-stationarity and autocorrelation. Existing causal discovery methods typically operate on raw observations, making them vulnerable to spurious edges and misattributed temporal dependencies. We introduce a decomposition-based causal discovery framework that separates each time series into trend, seasonal, and residual components and performs component-specific causal analysis. Trend components are assessed using stationarity tests, seasonal components using kernel-based dependence measures, and residual components using constraint-based causal discovery. The resulting component-level graphs are integrated into a unified multi-scale causal structure. This approach isolates long- and short-range causal effects, reduces spurious associations, and improves interpretability. Across extensive synthetic benchmarks and real-world climate data, our framework more accurately recovers ground-truth causal structure than state-of-the-art baselines, particularly under strong non-stationarity and temporal autocorrelation.
Jessica Renz, Frederik Witt, Iain G. Johnston
We present an algebraic approach to evolutionary accumulation modelling (EvAM). EvAM is concerned with learning and predicting the order in which evolutionary features accumulate over time. Our approach is complementary to the more common optimisation-based inference methods used in this field. Namely, we first use the natural underlying polynomial structure of the evolutionary process to define a semi-algebraic set of candidate parameters consistent with a given data set before maximising the likelihood function. We consider explicit examples and show that this approach is compatible with the solutions given by various statistical evolutionary accumulation models. Furthermore, we discuss the additional information of our algebraic model relative to these models.
Haitao Lin, Boxin Zhao, Mladen Kolar, Chong Liu
We study how to accelerate Bayesian optimization (BO) on a target task by transferring historical knowledge from related source tasks. Existing work on BO with knowledge transfer either lacks theoretical guarantees or achieves the same regret as BO in the non-transfer setting, $\widetilde{O}(\sqrt{T γ_f})$, where $T$ is the number of evaluations of the target function and $γ_f$ denotes its information gain. In this paper, we propose the DeltaBO algorithm, which builds a novel uncertainty-quantification approach on the difference function $δ$ between the source and target functions, which are allowed to belong to different Reproducing Kernel Hilbert Spaces (RKHSs). Under mild assumptions, we prove that the regret of DeltaBO is of order $\widetilde{O}(\sqrt{T (T/N + γ_δ)})$, where $N$ denotes the number of evaluations from source tasks and typically $N \gg T$. In many applications, source and target tasks are similar, which implies that $γ_δ$ can be much smaller than $γ_f$. Empirical studies on both real-world hyperparameter-tuning tasks and synthetic functions show that DeltaBO outperforms other baseline methods and also verify our theoretical claims. Our code is available on GitHub.
Jie Xie, Dongming Huang
We study nonasymptotic minimax estimation of the linear functional $L(θ)=η^\top θ$ for a high-dimensional $s$-sparse mean vector with an arbitrary loading vector $η$. For symmetric noise with exponentially decaying tails, we derive the sharp minimax rate, explicit in $s$, $η$, the tail parameter, and the noise level. The proposed estimator combines plug-in estimation for coordinates with large loadings and thresholding for coordinates with small loadings, and the matching lower bound is obtained via a loading-dependent sparse prior. For unknown sparsity, we construct an $η$-dependent Lepski-type procedure and show that, for a broad verifiable class of loading vectors, its risk matches the oracle rate up to the optimal logarithmic factor. Explicit examples illustrate how heterogeneity in $η$ changes both the minimax and adaptive rates. We also extend the analysis to non-symmetric noise, hypothesis testing, and estimation with unknown noise variance, where we show that asymmetry can increase the minimax rate in certain examples of $η$. Among these results, the two main technical novelties are the following. First, we extend the sharp lower-bound theory beyond the Gaussian setting via a new $χ^2$ bound for generalized Gaussian distributions. Second, for possibly non-symmetric noise, we derive new lower bounds through a worst-case asymmetric construction.
Anthony Sisti, Ellen McCreedy, Roee Gutman
In pragmatic cluster randomized controlled trials (PCRCTs), healthcare providers are randomized while both providers and patients may deviate from the assigned intervention. In many PCRCTs, cluster-level implementation is measured using multiple continuous metrics, while individual compliance is recorded as a binary indicator. Standard complier average causal effect (CACE) estimands focus on individual-level compliance and do not account for heterogeneity in implementation across clusters. When intervention uptake is shaped by both provider- and patient-level processes, it is of scientific interest to characterize how effects vary across these sources of compliance. We propose a Bayesian framework for PCRCTs with one-sided binary noncompliance at the individual level and one-sided partial compliance at the cluster level. The method uses a latent mixture model to summarize heterogeneity in cluster-level implementation based on baseline characteristics and observed implementation measures, and links these latent implementation types to individual compliance and outcomes through a joint model. Because compliance is only observed in treated clusters, the model imputes unobserved compliance behavior for clusters and individuals assigned to control. The framework enables estimation of finite- and super-population intent-to-treat (ITT) and CACE estimands, both marginally and within latent implementation types. We apply the method to the METRIcAL trial, a pragmatic cluster randomized study evaluating a personalized music intervention for nursing home residents with dementia. The analysis illustrates how accounting for implementation heterogeneity and individual compliance can provide insights beyond standard ITT analyses.}{Causal inference; Principal stratification; Complier average causal effect; Cluster randomized trials; Noncompliance; Bayesian methods; Latent variable models; Interference.
Alexander Dombowsky, David B. Dunson
A discrete Bayesian network is a directed acyclic graph (DAG) consisting of categorical variables. Two popular approaches for DBN modeling include classification and nonparametric methods. However, both methods often require a large number of parameters, such as high-order interactions in the former and cell probabilities in the latter. In this article, we propose a hierarchical model for node-parent conditional probabilities, inducing shrinkage to low-dimensional latent parameters aposteriori. We generate samples from the posterior distribution of these latent variables using the Metropolis-adjusted Langevin algorithm within a Gibbs sampler. Moreover, we verify that the full conditional distribution is log-concave under mild conditions, facilitating efficient sampling. We then detail several algorithms for structure learning that incorporate our hierarchical prior and preserve the DAG property. Through simulations, we evaluate the performance of our method for sparse counts, discovering graph structure, and selecting between competing DAGs. We conclude with an application to uncovering prognostic network structure from a breast cancer dataset.
Jiaxi Wu, Alexander Franks
Comments Attach code
Unmeasured confounding can severely bias causal effect estimates from spatiotemporal observational data, especially when the confounders do not vary smoothly in time and space. In this work, we develop a method for addressing unmeasured confounding in spatiotemporal contexts by building on models from the panel data literature and methods in multivariate causal inference. Our method is based on a factor confounding assumption, which posits that effects of unmeasured confounders on exposures and outcomes can be captured by a shared latent factor model. Factor confounding is sufficient to partially identify causal effects, even when there is interference between units. Additional assumptions that limit the degree of spatiotemporal interference, reasonable in most applications, are sufficient to point identify the effects. Simulation studies demonstrate that the proposed approach can substantially reduce omitted variable bias relative to other spatial smoothing and panel data baselines. We illustrate our method in a case study of the effect of prenatal PM2.5 exposure on birth weight in California.
Chi-Kuang Yeh, Weng Kee Wong, Julie Zhou
Comments 24 pages, 4 figures, 8 tables
Group testing techniques are widely used in resource-constrained settings, such as infectious-disease screening, blood safety, DNA library screening, and industrial inspection, where the efficient use of limited testing resources depends critically on how the initial study is designed. This paper discusses various ways that group testing experiments can be designed more efficiently and flexibly, under a user-specified optimality criterion and cost structure. We construct optimal designs to estimate model parameters beyond the \(D\)-optimality criterion to include the \(A\)-, \(c\)-, \(E\)-optimality, and extend the framework for finding optimal designs with multiple objectives. For large studies, we use a general theory and obtain various types of optimal approximate designs. When sample sizes are small, we propose two algorithms to construct highly efficient exact designs under realistic budget constraints. Additionally, we investigate properties of the proposed designs under various operational uncertainties and create a Shiny app to facilitate implementation of the proposed designs. To fix ideas, we focus on finding highly efficient group testing designs for a Chlamydia screening trial with imperfect assays under budget constraints and show the advantages of our optimal designs over current methods.
Jonathan Patsenker, Henry Li, Myeongseob Ko, Ruoxi Jia, Yuval Kluger
Diffusion models have been firmly established as principled zero-shot solvers for linear and nonlinear inverse problems, owing to their powerful image prior and iterative sampling algorithm. These approaches often rely on Tweedie's formula, which relates the diffusion variate $\mathbf{x}_t$ to the posterior mean $\mathbb{E} [\mathbf{x}_0 | \mathbf{x}_t]$, in order to guide the diffusion trajectory with an estimate of the final denoised sample $\mathbf{x}_0$. However, this does not consider information from the measurement $\mathbf{y}$, which must then be integrated downstream. In this work, we propose to estimate the conditional posterior mean $\mathbb{E} [\mathbf{x}_0 | \mathbf{x}_t, \mathbf{y}]$, which can be formulated as the solution to a lightweight, single-parameter maximum likelihood estimation problem. The resulting prediction can be integrated into any standard sampler, resulting in a fast and memory-efficient inverse solver. Our optimizer is amenable to a noise-aware likelihood-based stopping criteria that is robust to measurement noise in $\mathbf{y}$. We demonstrate comparable or improved performance against a wide selection of contemporary inverse solvers across multiple datasets and tasks.
Nasa Matsumoto, Quoc Hoan Tran, Koki Chinzei, Yasuhiro Endo, Hirotaka Oshima
Comments Accepted to Advanced Quantum Technologies
Quantum machine learning models that leverage quantum circuits as quantum feature maps (QFMs) are recognized for their enhanced expressive power in learning tasks. Such models have demonstrated rigorous end-to-end quantum speedups for specific families of classification problems. However, deploying deep QFMs on real quantum hardware remains challenging due to circuit noise and hardware constraints. Additionally, variational quantum algorithms often suffer from computational bottlenecks, particularly in accurate gradient estimation, which significantly increases quantum resource demands during training. We propose Iterative Quantum Feature Maps (IQFMs), a hybrid quantum-classical framework that constructs a deep architecture by iteratively connecting shallow QFMs with classically computed augmentation weights. By incorporating contrastive learning and a layer-wise training mechanism, the IQFMs framework effectively reduces quantum runtime and mitigates noise-induced degradation. In tasks involving noisy quantum data, numerical experiments show that the IQFMs framework outperforms quantum convolutional neural networks, without requiring the optimization of variational quantum parameters. Even for a typical classical image classification benchmark, a carefully designed IQFMs framework achieves performance comparable to that of classical neural networks. This framework presents a promising path to address current limitations and harness the full potential of quantum-enhanced machine learning.
Antonio Jesús Banegas-Luna, Horacio Pérez-Sánchez, Carlos Martínez-Cortés
Comments 27 pages, 11 figures, 2 tables, 13 equations
While predictive accuracy is often prioritized in machine learning (ML) models, interpretability remains essential in scientific and high-stakes domains. However, diverse interpretability algorithms frequently yield conflicting explanations, highlighting the need for consensus to harmonize results. In this study, six ML models were trained on six synthetic datasets with known ground truths, utilizing various model-agnostic interpretability techniques. Consensus explanations were generated using established methods and a novel approach: WISCA (Weighted Scaled Consensus Attributions), which integrates class probability and normalized attributions. WISCA consistently aligned with the most reliable individual method, underscoring the value of robust consensus strategies in improving explanation reliability.
Zhenghao Li, Shengbo Wang, Nian Si
Distributionally robust reinforcement learning (DR-RL) has recently gained significant attention as a principled approach that addresses discrepancies between training and testing environments. To balance robustness, conservatism, and computational traceability, the literature has introduced DR-RL models with SA-rectangular and S-rectangular adversaries. While most existing statistical analyses focus on SA-rectangular models, owing to their algorithmic simplicity and the optimality of deterministic policies, S-rectangular models more accurately capture distributional discrepancies in many real-world applications and often yield more effective robust randomized policies. In this paper, we study the empirical value iteration algorithm for divergence-based S-rectangular DR-RL and establish near-optimal sample complexity bounds of $\widetilde{O}(|\mathcal{S}||\mathcal{A}|(1-γ)^{-4}\varepsilon^{-2})$, where $\varepsilon$ is the target accuracy, $|\mathcal{S}|$ and $|\mathcal{A}|$ denote the cardinalities of the state and action spaces, and $γ$ is the discount factor. To the best of our knowledge, these are the first sample complexity results for divergence-based S-rectangular models that achieve optimal dependence on $|\mathcal{S}|$, $|\mathcal{A}|$, and $\varepsilon$ simultaneously. We further validate this theoretical dependence through numerical experiments on a robust inventory control problem and a theoretical worst-case example, demonstrating the fast learning performance of our proposed algorithm.
Tobias Wegel, Gil Kur, Patrick Rebeschini
We study early-stopped mirror descent (ESMD) for high-dimensional Gaussian linear regression over arbitrary convex bodies and design matrices, where the task is to minimize the in-sample mean squared error. Our main result shows that some of the sharpest risk bounds for the least squares estimator (LSE), based on the local Gaussian width, extend to ESMD. We derive sufficient conditions on the potential, expressed via the Minkowski functional, under which our result holds. These conditions allow us to construct new potentials and analyze existing ones. Our results then yield general sufficient conditions for minimax optimality of ESMD, provide a systematic comparison with the LSE, and establish the tightest known risk bound in the $\ell_1$-constrained setting.
Leo Benac, Abhishek Sharma, Sonali Parbhoo, Finale Doshi-Velez
We consider the problem of estimating the transition dynamics $T^*$ from near-optimal expert trajectories in the context of offline model-based reinforcement learning. We develop a novel constraint-based method, Inverse Transition Learning, that treats the limited coverage of the expert trajectories as a \emph{feature}: we use the fact that the expert is near-optimal to inform our estimate of $T^*$. We integrate our constraints into a Bayesian approach. Across both synthetic environments and real healthcare scenarios like Intensive Care Unit (ICU) patient management in hypotension, we demonstrate not only significant improvements in decision-making, but that our posterior can inform when transfer will be successful.
Vincent Berthet
Comments This paper reported that our predictive tool, FightTracker, generated large profits over an 8-week period against the bookmaker Unibet (90.17% ROI). New analyses over a much longer period show substantially lower performance. Because the original result may be misleading, we request withdrawal of the preprint
Mixed martial arts (MMA) has been one of the fastest-growing sports in recent years and has become a mainstream sport on the global stage. The growth of MMA has been driven by the Ultimate Fighting Championship (UFC), which is currently the largest MMA promotion organization in the world. However, data collection and statistical modeling in MMA are still in their infancy. We developed FightTracker, a data-driven solution that delivers real-time predictions for UFC fights. We first conducted regression analyses on the data provided by the UFC and MMA Decisions and built two predictive models of UFC fight outcomes. One model predicts the judges' majority score by round while the other predicts whether the red fighter will win the fight or not in 3-round fights that go beyond the second round (53% of all UFC fights). Both models use in-round fight statistics as explanatory variables and achieve 80% accuracy. We then designed an R shiny app that delivers these two predictions in real-time based on the ESPN live data. This information is valuable for fans, coaches, athletes, and especially bettors. Indeed, a live betting strategy based on FightTracker proved to generate large profits over an 8-week period against the bookmaker Unibet (90.17% ROI).
Petr Philonenko, Vladimir Kokh, Pavel Blinov
Comments Accepted to the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)
Conventional medical cancer screening methods are costly, labor-intensive, and extremely difficult to scale. Although AI can improve cancer detection, most systems rely on complex or specialized medical data, making them impractical for large-scale screening. We introduce Can-SAVE, a lightweight AI system that ranks population-wide cancer risks solely based on medical history events. By integrating survival model outputs into a gradient-boosting framework, our approach detects subtle, long-term patient risk patterns - often well before clinical symptoms manifest. Can-SAVE was rigorously evaluated on a real-world dataset of 2.5 million adults spanning five Russian regions, marking the study as one of the largest and most comprehensive deployments of AI-driven cancer risk assessment. In a retrospective oncologist-supervised study over 1.9M patients, Can-SAVE achieves a 4-10x higher detection rate at identical screening volumes and an Average Precision (AP) of 0.228 vs. 0.193 for the best baseline (LoRA-tuned Qwen3-Embeddings via DeepSeek-R1 summarization). In a year-long prospective pilot (426K patients), our method almost doubled the cancer detection rate (+91%) and increased population coverage by 36% over the national screening protocol. The system demonstrates practical scalability: a city-wide population of 1 million patients can be processed in under three hours using standard hardware, enabling seamless clinical integration. This work proves that Can-SAVE achieves nationally significant cancer detection improvements while adhering to real-world public healthcare constraints, offering immediate clinical utility and a replicable framework for population-wide screening. Code for training and feature engineering is available at https://github.com/sb-ai-lab/Can-SAVE.
Mike Diessner, Kevin J. Wilson, Richard D. Whalley
NUBO, short for Newcastle University Bayesian Optimisation, is a Bayesian optimization framework for the optimization of expensive-to-evaluate black-box functions, such as physical experiments and computer simulators. Bayesian optimization is a costefficient optimization strategy that uses surrogate modelling via Gaussian processes to represent an objective function and acquisition functions to guide the selection of candidate points to approximate the global optimum of the objective function. NUBO itself focuses on transparency and user experience to make Bayesian optimization easily accessible to researchers from all disciplines. Clean and understandable code, precise references, and thorough documentation ensure transparency, while user experience is ensured by a modular and flexible design, easy-to-write syntax, and careful selection of Bayesian optimization algorithms. NUBO allows users to tailor Bayesian optimization to their specific problem by writing the optimization loop themselves using the provided building blocks. It supports sequential single-point, parallel multi-point, and asynchronous optimization of bounded, constrained, and/or mixed (discrete and continuous) parameter input spaces. Only algorithms and methods that are extensively tested and validated to perform well are included in NUBO. This ensures that the package remains compact and does not overwhelm the user with an unnecessarily large number of options. The package is written in Python but does not require expert knowledge of Python to optimize your simulators and experiments. NUBO is distributed as open-source software under the BSD 3-Clause license.
Yan Cui, Zhou Zhou
Comments 57 pages
We consider the problem of joint simultaneous confidence band (JSCB) construction for regression coefficient functions of time series scalar-on-function linear regression when the regression model is estimated by roughness penalization approach with flexible choices of orthonormal basis functions. A simple and unified multiplier bootstrap methodology is proposed for the JSCB construction which is shown to achieve the correct coverage probability asymptotically. Furthermore, the JSCB is asymptotically robust to inconsistently estimated standard deviations of the model. The proposed methodology is applied to a time series data set of electricity market to visually investigate and formally test the overall regression relationship as well as perform model validation.
James Matuk, Amy H. Herring, David B. Dunson
Functional factor analysis is an important dimension reduction method for functional and longitudinal data. Factor loadings give insight into patterns of variability of the observations, while latent factors provide a low-dimensional representation of the data that is useful for inferential tasks. Constraining the functional factor loadings to be mutually orthogonal is desirable for model parsimony but is computationally challenging. In this work, we introduce nearly mutually orthogonal processes, which can be used to effectively enforce mutual orthogonality of factor loadings while maintaining computational simplicity and efficiency. The joint distribution is governed by a penalty parameter that determines the degree to which the processes are mutually orthogonal and is related to ease of posterior computation. We demonstrate that our approach can be used for flexible and interpretable inference in an application to studying the effects of breastfeeding status, illness, and demographic factors on weight dynamics in early childhood. Code is available on GitHub: https://github.com/jamesmatuk/NeMO-FFA