On the singularity of the Fisher Information matrix in the sine-skewed family on the d-dimensional torus
Comments 8 pages
Emily Schutte, Sophia Loizidou, Vincent Laheurte
Comments 8 pages
Skewed distributions are fundamental in modelling asymmetric data on the d-dimensional torus. In this context, asymmetry is introduced through the sine-skewing mechanism, which is the only skewing mechanism that has been proposed on the hyper-torus in the literature. Some sine-skewed models are known to suffer from a singular Fisher information matrix in the vicinity of symmetry, which poses a significant issue for inferential purposes. It is an open question to determine for which sine-skewed models Fisher information singularity occurs. In this paper, a general characterization of the class of models that exhibit this singularity is given in the general d-dimensional setting.
Sibsankar Singha, Marie Kratz, Sreekar Vadlamani
Comments 24 pages, 2 figures
Geometric (also known as spatial) quantiles, introduced by Chaudhury and representing one of the three principal approaches to defining multivariate quantiles, have been well studied in the literature. In this work, we focus on the extremal behaviour of these quantiles. We establish new extremal properties, namely general lower and upper bounds for the norm of extreme geometric quantiles, free of any moment conditions. We discuss the impact of such results on the characterization of distribution behaviour. Importantly, the lower bound can be directly linked to univariate quantiles and to halfspace (Tukey) depth central regions, highlighting a novel connection between these two fundamental notions of multivariate quantiles.
Kelly L Vomo-Donfack, Adryel Hoszu, Grégory Ginot, Ian Morilla
Comments 22 pages, 6 Figures
Federated learning (FL) faces two structural tensions: gradient sharing enables data-reconstruction attacks, while non-IID client distributions degrade aggregation quality. We introduce PTOPOFL, a framework that addresses both challenges simultaneously by replacing gradient communication with topological descriptors derived from persistent homology (PH). Clients transmit only 48-dimensional PH feature vectors-compact shape summaries whose many-to-one structure makes inversion provably ill-posed-rather than model gradients. The server performs topology-guided personalised aggregation: clients are clustered by Wasserstein similarity between their PH diagrams, intra-cluster models are topology-weighted,and clusters are blended with a global consensus. We prove an information-contraction theorem showing that PH descriptors leak strictly less mutual information per sample than gradients under strongly convex loss functions, and we establish linear convergence of the Wasserstein-weighted aggregation scheme with an error floor strictly smaller than FedAvg. Evaluated against FedAvg, FedProx, SCAFFOLD, and pFedMe on a non-IID healthcare scenario (8 hospitals, 2 adversarial) and a pathological benchmark (10 clients), PTOPOFL achieves AUC 0.841 and 0.910 respectively-the highest in both settings-while reducing reconstruction risk by a factor of 4.5 relative to gradient sharing. Code is publicly available at https://github.com/MorillaLab/TopoFederatedL and data at https://doi.org/10.5281/zenodo.18827595.
Yujia Wu, Xiucai Ding, Jingfei Zhang, Wei Lan, Chih-Ling Tsai
Comments 46 pages. This manuscript presents a significant generalization and resolves several issues in the previous submission, arXiv:2409.05276, which now appears as a special case within the current framework
To characterize the community structure in network data, researchers have developed various block-type models, including the stochastic block model, the degree-corrected stochastic block model, the mixed membership block model, the degree-corrected mixed membership block model, and others. A critical step in applying these models effectively is determining the number of communities in the network. However, to the best of our knowledge, existing methods for estimating the number of network communities either rely on explicit model fitting or fail to simultaneously accommodate network sparsity and a diverging number of communities. In this paper, we propose a model-free spectral inference method based on eigengap ratios that addresses these challenges. The inference procedure is straightforward to compute, requires no parameter tuning, and can be applied to a wide range of block models without the need to estimate network distribution parameters. Furthermore, it is effective for both dense and sparse networks with a divergent number of communities. Technically, we show that the proposed spectral test statistic converges to a {function of the type-I Tracy-Widom distribution via the Airy kernel} under the null hypothesis, and that the test is asymptotically powerful under weak alternatives. Simulation studies on both dense and sparse networks demonstrate the efficacy of the proposed method. Three real-world examples are presented to illustrate the usefulness of the proposed test.
Yidan Sun, Mayank Kejriwal
Understanding how social networks form, whether through reciprocity, shared attributes, or triadic closure, is central to computational social science. Exponential Random Graph Models (ERGMs) offer a principled framework for testing such formation theories, but translating qualitative social hypotheses into stable statistical specifications remains a significant barrier, requiring expertise in both network theory and model estimation. We present Forge (Formation-Oriented Reasoning with Guarded ERGMs), a framework that uses large language models to automate this translation. Given a network and an informal description of the social context, Forge proposes candidate formation mechanisms, validates them against feasibility and stability constraints, and iteratively refines specifications using goodness-of-fit diagnostics. Evaluation across twelve benchmark networks spanning schools, organizations, and online communication shows that Forge converges in 10 of 12 cases, and conditional on convergence it achieves the best likelihood-based fit in 9 of 10 while meeting adequacy thresholds. By combining LLM-based proposals with statistical guardrails, Forge reduces the manual effort required for ERGM specification.
Sofia Kaisaridi, Juliette Ortholand, Caglayan Tuna, Hugues Chabriat, Sophie Tezenas du Montcel
The progression of chronic diseases often follows highly variable trajectories, and the underlying factors remain poorly understood. Standard mixed-effects models typically represent inter-patient differences as random deviations around a common reference, which may obscure meaningful subgroups. We propose a probabilistic mixture extension of a mixed effects model, the Disease Course Mapping model, to identify distinct disease progression subtypes within a population. The mixture structure is introduced at the latent individual parameters, enabling clustering based on both temporal and spatial variability in disease trajectories. We evaluated the model through simulation studies to assess classification performance and parameter recovery. Classification accuracy exceeded 90% in simpler scenarios and remained above 80% in the most complex case, with particularly high recall and precision for fast-progressing clusters. Compared to a post hoc classification approach, the proposed model yielded more accurate parameter estimates, smaller biases, lower root mean squared errors, and reduced uncertainty. It also correctly recovered the true three-cluster structure in 93% of the simulations. Finally, we applied the model to a longitudinal cohort of CADASIL patients, identifying two clinically meaningful clusters, differentiating patients with early versus late onset and fast versus slow progression, with clear spatial patterns across motor and memory scores. Overall, this probabilistic mixture framework offers a robust, interpretable approach for clustering patients based on spatiotemporal disease dynamics.
M. L. Gámiz, N. Limnios, D. Montoro-Cazorla, M. C. Segovia-García
Comments 36 pages, 5 figures
This paper develops a comprehensive Markov-based framework for modelling reservoir behaviour and assessing key performance measures such as reliability and resilience. We first formulate a stochastic model for a finite-capacity dam, analysing its long-term storage dynamics under both independent and identically distributed inflows, following the Moran model, and correlated inflows represented by an ergodic Markov chain in the Lloyd formulation. For this finite case, we establish stationary water balance relations and derive asymptotic results, including a central limit theorem for storage levels. The analysis is then extended to an infinite-capacity reservoir, for which normal limit distributions and analogous long-term properties are obtained. A continuous-state formulation is also introduced to represent reservoirs with continuous inflow processes, generalizing the discrete-state framework. On this basis, we define and evaluate reliability and resilience metrics within the proposed Markovian context. The applicability of the methodology is demonstrated through a real-world case study of the Quiebrajano dam, illustrating how the developed models can support efficient and sustainable reservoir management under hydrological uncertainty.
Timo Dimitriadis, Marius Puke
We introduce inference methods for score decompositions, which partition scoring functions for predictive assessment into three interpretable components: miscalibration, discrimination, and uncertainty. Our estimation and inference relies on a linear recalibration of the forecasts, which is applicable to general multi-step ahead point forecasts such as means and quantiles due to its validity for both smooth and non-smooth scoring functions. This approach ensures desirable finite-sample properties, enables asymptotic inference, and establishes a direct connection to the classical Mincer-Zarnowitz regression. The resulting inference framework facilitates tests for equal forecast calibration or discrimination, which yield three key advantages. They enhance the information content of predictive ability tests by decomposing scores, deliver higher statistical power in certain scenarios, and formally connect scoring-function-based evaluation to traditional calibration tests, such as financial backtests. Applications demonstrate the method's utility. We find that for survey inflation forecasts, discrimination abilities can differ significantly even when overall predictive ability does not. In an application to financial risk models, our tests provide deeper insights into the calibration and information content of volatility and Value-at-Risk forecasts. By disentangling forecast accuracy from backtest performance, the method exposes critical shortcomings in current banking regulation.
Antonio Panico, Andrew Burlinson, Luigi Grossi
Accurate estimation of Marginal Emission Factors (MEFs) is critical for evaluating the decarbonization potential of low-carbon technologies and demand-side management. However, canonical methodologies, predominantly relying on linear regression and differencing techniques, fail to capture the structural non-linearities inherent in the merit order, i.e. the marginal technology setting electricity prices. Utilizing Markov switching autoregressive models with exogenous regressors (MS-ARX) and hourly US data (2019-2025), we identify distinct, mutually exclusive regimes governed by fuel-price dynamics. We find that linear models overestimate abatement potential by masking the dichotomy between a gas-driven and coal-driven marginal system. Furthermore, using robust structural break detection, we link regime instability to a specific structural shift in natural gas pricing in May 2022. Our results indicate that post-2022, the grid has transitioned into a correction phase where the coal-driven regime is less persistent but highly volatile, necessitating state-dependent policy metrics rather than static annual averages.
Nicolás Ferrari-Ortiz, Sebastián Orellana-Montini, Timur Abbiasov, Marie Garkavenko, Rutger Lit
Experimentation is central to modern digital businesses, but many operational decisions cannot be randomized at the user level. In such cases, cluster-level experiments, where clusters are usually geographic, come to the rescue. However, such experiments often suffer from low power due to persistent cluster heterogeneity, strong seasonality, and autocorrelated outcome metrics, as well as common shocks that move many clusters simultaneously. On an example of airline pricing - where policies are typically applied at the route level and thus the A/B test unit of analysis is a route - we study switchback designs to remedy these problems. In switchback designs, each cluster (route in our case) alternates between treatment and control on a fixed schedule, creating within-route contrasts that mitigate time-invariant heterogeneity and reduce sensitivity to low-frequency noise. We provide a unified Two-Way Fixed Effects interpretation of switchback experiments that makes the identifying variation explicit after partialling out route and time effects, clarifying how switching cadence interacts with temporal dependence to determine precision. Empirically, we evaluate weekly and daily switchback cadences using calibrated synthetic regimes and operational airline data from ancillary pricing. In our evaluations, switchbacks decrease standard errors by up to 67%, with daily switching yielding the largest gains over short horizons and weekly switching offering a strong and simpler-to-operationalize alternative.
Yunhan Wu, Finn Lindgren, Heidi A. Hanson
Producing reliable estimates of health and demographic indicators at fine areal scales is crucial for examining heterogeneity and supporting localized health policy. However, many surveys release outcomes only at coarser administrative levels, thereby limiting their relevance for decision-making. We propose a fully Bayesian, single-stage spatial modeling framework for area-level disaggregation that generates fine-scale estimates of indicators directly from coarsely aggregated survey data. By defining a latent spatial process at the target resolution and linking it to observed outcomes through an aggregation step, the framework adopts small-area estimation techniques while incorporating covariates and delivering coherent uncertainty quantification. The proposed methods are implemented with inlabru to achieve computational efficiency. We evaluate performance through a simulation study of general fertility rates in Kenya to demonstrate the models' ability to recover fine-scale variation across diverse data-generating scenarios. We further apply the framework to two national surveys to produce district-level fertility estimates from the 2022 Kenya Demographic and Health Survey and, more importantly, district-level indicators for unpaid care and domestic work and mass media usage from the 2021 Kenya Time Use Survey.
Kwong Yu Chong, Long Feng
We introduce Latent Space Distribution Matching (LSDM), a novel framework for semi-supervised generative modeling of conditional distributions. LSDM operates in two stages: (i) learning a low-dimensional latent space from both paired and unpaired data, and (ii) performing joint distribution matching in this space via the 1-Wasserstein distance, using only paired data. This two-step approach minimizes an upper bound on the 1-Wasserstein distance between joint distributions, reducing reliance on scarce paired samples while enabling fast one-step generation. Theoretically, we establish non-asymptotic error bounds and demonstrate a key benefit of unpaired data: enhanced geometric fidelity in generated outputs. Furthermore, by extending the scope of its two core steps, LSDM provides a coherent statistical perspective that connects to a broad class of latent-space approaches. Notably, Latent Diffusion Models (LDMs) can be viewed as a variant of LSDM, in which joint distribution matching is achieved indirectly via score matching. Consequently, our results also provide theoretical insights into the consistency of LDMs. Empirical evaluations on real-world image tasks, including class-conditional generation and image super-resolution, demonstrate the effectiveness of LSDM in leveraging unpaired data to enhance generation quality.
Raphaël Razafindralambo, Rémy Sun, Frédéric Precioso, Damien Garreau, Pierre-Alexandre Mattei
Density aggregation is a central problem in machine learning, for instance when combining predictions from a Deep Ensemble. The choice of aggregation remains an open question with two commonly proposed approaches being linear pooling (probability averaging) and geometric pooling (logit averaging). In this work, we address this question by studying the normalized generalized mean of order $r \in \mathbb{R} \cup \{-\infty,+\infty\}$ through the lens of log-likelihood, the standard evaluation criterion in machine learning. This provides a unifying aggregation formalism and shows different optimal configurations for different situations. We show that the regime $r \in [0,1]$ is the only range ensuring systematic improvements relative to individual distributions, thereby providing a principled justification for the reliability and widespread practical use of linear ($r=1$) and geometric ($r=0$) pooling. In contrast, we show that aggregation rules with $r \notin [0,1]$ may fail to provide consistent gains with explicit counterexamples. Finally, we corroborate our theoretical findings with empirical evaluations using Deep Ensembles on image and text classification benchmarks.
Cameron Bell, Timothy Johnston, Antoine Luciano, Christian P Robert
Theoretical and applied research into privacy encompasses an incredibly broad swathe of differing approaches, emphasis and aims. This work introduces a new quantitative notion of privacy that is both contextual and specific. We argue that it provides a more meaningful notion of privacy than the widely utilised framework of differential privacy and a more explicit and rigorous formulation than what is commonly used in statistical disclosure theory. Our definition relies on concepts inherent to standard Bayesian decision theory, while departing from it in several important respects. In particular, the party controlling the release of sensitive information should make disclosure decisions from the prior viewpoint, rather than conditional on the data, even when the data is itself observed. Illuminating toy examples and computational methods are discussed in high detail in order to highlight the specificities of the method.
Ikhlas Enaieh, Olivier Fercoq
Deep Neural Networks are powerful tools for solving machine learning problems, but their training often involves dense and costly parameter updates. In this work, we use a novel Max-Plus neural architecture in which classical addition and multiplication are replaced with maximum and summation operations respectively. This is a promising architecture in terms of interpretability, but its training is challenging. A particular feature is that this algebraic structure naturally induces sparsity in the subgradients, as only neurons that contribute to the maximum affect the loss. However, standard backpropagation fails to exploit this sparsity, leading to unnecessary computations. In this work, we focus on the minimization of the worst sample loss which transfers this sparsity to the optimization loss. To address this, we propose a sparse subgradient algorithm that explicitly exploits the algebraic sparsity. By tailoring the optimization procedure to the non-smooth nature of Max-Plus models, our method achieves more efficient updates while retaining theoretical guarantees. This highlights a principled path toward bridging algebraic structure and scalable learning.
Yuhao Deng, Le Kang
The difference-in-differences (DiD) design is a quasi-experimental method for estimating treatment effects. In staggered DiD with multiple treatment groups and periods, estimation based on the two-way fixed effects model yields negative weights when averaging heterogeneous group-period treatment effects into an overall effect. To address this issue, we first define group-period average treatment effects on the treated (ATT), and then define groupwise, periodwise, dynamic, and overall ATTs nonparametrically, so that the estimands are model-free. We propose doubly robust estimators for these types of ATTs in the form of augmented inverse variance weighting (AIVW). The proposed framework allows time-varying covariates that partially explain the time trends in outcomes. Even if part of the working models is misspecified, the proposed estimators still consistently estimate the parameter of interest. The asymptotic variance can be explicitly computed from influence functions. Under a homoskedastic working model, the AIVW estimator is simplified to an augmented inverse probability weighting (AIPW) estimator. We demonstrate the desirable properties of the proposed estimators through simulation and an application that compares the effects of a parallel admission mechanism with immediate admission on the China National College Entrance Examination.
Raunak Mukherjee, Sharayu Moharir
Comments 25 pages, 2 Figures
We study fixed budget constrained best-arm identification in grouped bandits, where each arm consists of multiple independent attributes with stochastic rewards. An arm is considered feasible only if all its attributes' means are above a given threshold. The aim is to find the feasible arm with the largest overall mean. We first derive a lower bound on the error probability for any algorithm on this setting. We then propose Feasibility Constrained Successive Rejects (FCSR), a novel algorithm that identifies the best arm while ensuring feasibility. We show it attains optimal dependence on problem parameters up to constant factors in the exponent. Empirically, FCSR outperforms natural baselines while preserving feasibility guarantees.
Øystein Sørensen
Dynamic structural equation models (DSEMs) combine time-series modeling of within-person processes with hierarchical modeling of between-person differences and differences between timepoints, and have become very popular for the analysis of intensive longitudinal data in the social sciences. An important computational bottleneck has, however, still not been resolved: whenever the underlying process is assumed to be latent and measured by one or more indicators per timepoint, currently published algorithms rely on inefficient brute-force Markov chain Monte Carlo sampling which scales poorly as the number of timepoints and participants increases and results in highly correlated samples. The main result of this paper shows that the within-level part of any DSEM can be reformulated as a linear Gaussian state space model. Consequently, the latent states can be analytically marginalized using a Kalman filter, allowing for highly efficient estimation via Hamiltonian Monte Carlo. This makes estimation of DSEMs computationally tractable for much larger datasets -- both in terms of timepoints and participants -- than what has been previously possible. We demonstrate the proposed algorithm in several simulation experiments, showing it can be orders of magnitude more efficient than standard Metropolis-within-Gibbs approaches.
Alexander Lehner
Spatial autocorrelation in regression models can lead to downward biased standard errors and thus incorrect inference. The most common correction in applied economics is the spatial heteroskedasticity and autocorrelation consistent (HAC) standard error estimator introduced by Conley (1999). A critical input is the kernel bandwidth: the distance within which residuals are allowed to be correlated. However, this is still an unresolved problem and there is no formal guidance in the literature. In this paper, I first document that the relationship between the bandwidth and the magnitude of spatial HAC standard errors is inverse-U shaped. This implies that both too narrow and too wide bandwidths lead to underestimated standard errors, contradicting the conventional wisdom that wider bandwidths yield more conservative inference. I then propose a simple, non-parametric, data-driven bandwidth selector based on the empirical covariogram of regression residuals. In extensive Monte Carlo experiments calibrated to empirically relevant spatial correlation structures across the contiguous United States, I show that the proposed method controls the false positive rate at or near the nominal 5% level across a wide range of spatial correlation intensities and sample configurations. I compare six kernel functions and find that the Bartlett and Epanechnikov kernels deliver the best size control. An empirical application using U.S. county-level data illustrates the practical relevance of the method. The R package SpatialInference implements the proposed bandwidth selection method.
Francisco F. Queiroz, Johannes Brachem, Paul F. V. Wiemann, Thomas Kneib
Bounded continuous data on the unit interval frequently arise in applied fields and often exhibit a non-negligible proportion of observations at the boundaries. Inflated regression models address this feature by combining a continuous distribution on the unit interval with a discrete component to account for zero- and/or one-inflation. In this paper, we propose a class of Bayesian structured additive quantile regression models for inflated bounded continuous data that accommodates zero- and/or one-inflation. The proposed approach enables direct modeling of both the conditional quantiles of the continuous component and the probabilities of observing zeros and/or ones, with structured additive predictors incorporated in both parts, including nonlinear effects, spatial effects, random effects, and varying-coefficient terms. Posterior inference is carried out using Markov chain Monte Carlo algorithms implemented through the software Liesel, a probabilistic programming framework for semiparametric regression. The practical performance of the proposed models is illustrated through simulation studies and two real-data applications: one analyzing the proportion of traffic-related fatalities across Brazilian municipal districts, and another evaluating speech intelligibility in cochlear implant recipients under different experimental conditions.
Miltiadis Galanis, Michail Louvaris
Comments 10 pages
We consider a sparse i.i.d.\ non-Hermitian random matrix model $X_n$ (with sparsity parameter $K_n$) and a deterministic finite-rank perturbation $E_n$. Assuming biorthogonality for $E_n$ and a growth condition on $K_n$, we outline a finite-rank resolvent reduction leading to asymptotics for the overlap between an outlier eigenvector of $Y_n:=X_n+E_n$ and the corresponding spike eigenspace. In particular, for an outlier spike $μ$ with $|μ|>1$, the squared projection of the associated (right) eigenvector onto the spike eigenspace converges in probability to $1-|μ|^{-2}$. Our result generalizes Theorem 1.6 of [HLN26] to general finite rank case solving Open Problem 5.
Anirban Ghosh, Raju Maiti
Multiple seasonalities have been widely studied in continuous time series using models such as TBATS, for instance in electricity demand forecasting. However, their treatment in categorical time series, such as air quality index (AQI) data, remains limited. Categorical AQI often exhibits distinct seasonal patterns at multiple frequencies, which are not captured by standard models. In this paper, we propose a framework that models multiple seasonalities using Fourier series and indicator functions, inspired by the TBATS methodology. The approach accommodates the ordinal nature of AQI categories while explicitly capturing daily, weekly and yearly seasonal cycles. Simulation studies demonstrate the empirical consistency of parameter estimates under the proposed model. We further illustrate its applicability using real categorical AQI data from Kolkata and compare forecasting performance with Markov models and machine learning methods. Results indicate that our approach effectively captures complex seasonal dynamics and provides improved predictive accuracy. The proposed methodology offers a flexible and interpretable framework for analyzing categorical time series exhibiting multiple seasonal patterns, with potential applications in air quality monitoring, energy consumption and other environmental domains.
Pengyu Zhang, Arnaud Vadeboncoeur, Alex Glyn-Davies, Mark Girolami
Inverse problems are the task of calibrating models to match data. They play a pivotal role in diverse engineering applications by allowing practitioners to align models with reality. In many applications, engineers and scientists do not have a complete picture of i) the detailed properties of a system (such as material properties, geometry, initial conditions, etc.); ii) the complete laws describing all dynamics at play (such as friction laws, complicated damping phenomena, and general nonlinear interactions). In this paper, we develop a principled methodology for leveraging data from collections of distinct yet related physical systems to jointly estimate the individual model parameters of each system, and learn the shared unknown dynamics in the form of an ML-based closure model. To robustly infer the unknown parameters for each system, we employ a hierarchical Bayesian framework, which allows for the joint inference of multiple systems and their population-level statistics. To learn the closures, we use a maximum marginal likelihood estimate of a neural network embeded within the ODE/PDE formulation of the problem. To realize this framework we utilize the ensemble Metropolis-Adjusted Langevin Algorithm (MALA) for stable and efficient sampling. To mitigate the computational bottleneck of repetitive forward evaluations in solving inverse problems, we introduce a bilevel optimization strategy to simultaneously train a surrogate forward model alongside the inference. Within this framework, we evaluate and compare distinct surrogate architectures, specifically Fourier Neural Operators (FNO) and parametric Physics-Informed Neural Network (PINNs).
Wolfgang Hoegele
We present a computational framework to investigate steady state distributions and perform stability analysis for random ordinary differential equations driven by parameter uncertainty. Using the nonlinear Rosenzweig McArthur predator prey model as a case study, we characterize the non-trivial equilibrium steady state of the system and investigate its complex distribution when the parameter probability densities are multi-modal mixture models with partially overlapping or separated components. In consequence, this application includes both, uncertainties and superpositions, of the system parameters. In addition, we present the stability analysis of steady states based on the eigenvalue distribution of the system's Jacobian matrix in this stochastic regime. The steady state posterior density and stability metrics are computed with a recently published Monte Carlo based numerical scheme specifically designed for random equation systems (Hoegele, 2026). Particularly, the simplicity of this stochastic extension of dynamic systems combined with a broadly applicable computational approach is demonstrated. Numerical experiments show the emergence of multi-modal steady state distributions of the predator prey model and we calculate their stability regions, illustrating the method's applicability to uncertainty quantification in dynamical systems.
Margherita Lazzaretto, Jonas Peters, Niklas Pfister
Comments 32 pages, 7 figures
We consider stochastic non-stationary linear bandits where the linear parameter connecting contexts to the reward changes over time. Existing algorithms in this setting localize the policy by gradually discarding or down-weighting past data, effectively shrinking the time horizon over which learning can occur. However, in many settings historical data may still carry partial information about the reward model. We propose to leverage such data while adapting to changes, by assuming the reward model decomposes into stationary and non-stationary components. Based on this assumption, we introduce ISD-linUCB, an algorithm that uses past data to learn invariances in the reward model and subsequently exploits them to improve online performance. We show both theoretically and empirically that leveraging invariance reduces the problem dimensionality, yielding significant regret improvements in fast-changing environments when sufficient historical data is available.
Inge G. Helland, Nils Lid Hjort, Gunnar Taraldsen
Comments 7 pages, no figures; Statistical Research Report, Department of Mathematics, University of Oslo, February 2023, but now arXiv'd March 2026. The article has appeared in International Encyclopedia of Statistical Science 2024, pages 1894-1899, Springer, at this url: https://link.springer.com/content/pdf/10.1007/978-3-662-69359-9_471.pdf
The philosophical foundations of statistics involve issues in theoretical statistics, such as goals and methods to meet these goals, and interpretation of the meaning of inference using statistics. They are related to the philosophy of science and to the philosophy of probability. We review the core and partly interrelated themes and place them in context.
Daisuke Kondo, Shonosuke Sugasawa
Comments 25 pages
Regression discontinuity designs (RDD) are widely used for causal inference. In many empirical applications, treatment effects vary substantially with covariates, and ignoring such heterogeneity can lead to misleading conclusions, which motivates flexible modeling of heterogeneous treatment effects in RDD. To this end, we propose a Bayesian nonparametric approach to estimating heterogeneous treatment effects based on Bayesian Additive Regression Trees (BART). The key feature of our method lies in adopting a general Bayesian framework using a pseudo-model defined through a loss function for fitting local linear models around the cutoff, which gives direct modeling of heterogeneous treatment effects by BART. Optimal selection of the bandwidth parameter for the local model is implemented using the Hyvärinen score. Through numerical experiments, we demonstrate that the proposed approach flexibly captures complicated structures of heterogeneous treatment effects as a function of covariates.
Kanti V. Mardia, Antonio Mauricio F. L. Miranda de Sa'
Comments 33 Pages 14Figures
This paper is motivated by a cutting-edge application in neuroscience: the analysis of electroencephalogram (EEG) signals recorded under flash stimulation. Under commonly used signal-processing assumptions, only the phase angle of the EEG is required for the analysis of such applications. We demonstrate that these assumptions imply that the phase has a projected isotropic normal distribution. We revisit this distribution and derive several new properties, including closed-form expressions for its trigonometric moments. We then examine the distribution of the mean resultant and its square -- a statistic of central importance in phase-based EEG studies. The distribution of the resultant is analytically intricate; to make it practically useful, we develop two approximations based on the well-known resultant distribution for the von Mises distribution. We then study inference problems for this projected isotropic normal distribution. The method is illustrated with an application to EEG data from flash-stimulation experiments.
G. Bimonte, M. Russolillo, Y. Yang, H. L. Shang
Comments 45 pages, 6 figures
A well-established insight in mortality forecasting is that combining predictions from a set of models improves accuracy compared to relying on a single best model. This paper proposes a novel ensemble approach based on Shapley values, a game-theoretic measure of each model's marginal contribution to the forecast. We further compute these SHapley Additive exPlanations (SHAP)-based weights age-by-age, thereby capturing the specific contribution of each model at each age. In addition, we introduce a threshold mechanism that excludes models with negligible contributions, effectively reducing the forecast variance. Using data from 24 OECD countries, we demonstrate that our SHAP ensemble enhances out-of-sample forecasting performance, especially at longer horizons. By leveraging the complementary strengths of different mortality models and filtering out those that add little predictive power, our approach offers a robust and interpretable solution for improving mortality forecasts.
Erdun Gao, Liang Zhang, Jake Fawkes, Aoqi Zuo, Wenqin Liu, Haoxuan Li, Mingming Gong, Dino Sejdinovic
Randomized Controlled Trials (RCTs) represent the gold standard for causal inference yet remain a scarce resource. While large-scale observational data is often available, it is utilized only for retrospective fusion, and remains discarded in prospective trial design due to bias concerns. We argue this "tabula rasa" data acquisition strategy is fundamentally inefficient. In this work, we propose Active Residual Learning, a new paradigm that leverages the observational model as a foundational prior. This approach shifts the experimental focus from learning target causal quantities from scratch to efficiently estimating the residuals required to correct observational bias. To operationalize this, we introduce the R-Design framework. Theoretically, we establish two key advantages: (1) a structural efficiency gap, proving that estimating smooth residual contrasts admits strictly faster convergence rates than reconstructing full outcomes; and (2) information efficiency, where we quantify the redundancy in standard parameter-based acquisition (e.g., BALD), demonstrating that such baselines waste budget on task-irrelevant nuisance uncertainty. We propose R-EPIG (Residual Expected Predictive Information Gain), a unified criterion that directly targets the causal estimand, minimizing residual uncertainty for estimation or clarifying decision boundaries for policy. Experiments on synthetic and semi-synthetic benchmarks demonstrate that R-Design significantly outperforms baselines, confirming that repairing a biased model is far more efficient than learning one from scratch.
Yuqi Kong, Xiao Zhang, Weiran Shen
We study the Inverse Contextual Bandit (ICB) problem, in which a learner seeks to optimize a policy while an observer, who cannot access the learner's rewards and only observes actions, aims to recover the underlying problem parameters. During the learning process, the learner's behavior naturally transitions from exploration to exploitation, resulting in non-stationary action data that poses significant challenges for the observer. To address this issue, we propose a simple and effective framework called Two-Phase Suffix Imitation. The framework discards data from an initial burn-in phase and performs empirical risk minimization using only data from a subsequent imitation phase. We derive a predictive decision loss bound that explicitly characterizes the bias-variance trade-off induced by the choice of burn-in length. Despite the severe information deficit, we show that a reward-free observer can achieve a convergence rate of $\tilde O(1/\sqrt{N})$, matching the asymptotic efficiency of a fully reward-aware learner. This result demonstrates that a passive observer can effectively uncover the optimal policy from actions alone, attaining performance comparable to that of the learner itself.
Taku Moriyama
The kernel smoothing with large bandwidth values causes oversmoothing or underfitting in general. However, when irrelevant variables are included, the corresponding large bandwidth values are known to have an effect of shrinking them. This study investigates asymptotic properties of the kernel conditional density estimator and the regression estimator with large bandwidth matrix elements for cases of multi-index model. It is clarified that the optimal convergence rate of the estimators depends on not the number of the variables but the effective dimension without eliminating the irrelevant variables. Thus, the kernel conditional density estimator and regression estimator are demonstrated to equip the reduction of the curse of dimensionality by nature. Finite sample performances are investigated by a numerical study, and the bandwidth selection is discussed. Finally a case study on the Boston housing data is provided.
Tao Wang, Qiannan Huang, Jun Zhu, Cheng Meng
Comments 35 pages, 14 figures
Many learning tasks represent responses as multivariate probability measures, requiring repeated computation of weighted barycenters in Wasserstein space. In multivariate settings, transport barycenters are often computationally demanding and, more importantly, are generally not well posed under the affine weight schemes inherent to global and local Frećhet regression, where weights sum to one but may be negative. We propose HiMAP, a Hilbert mass-aligned parameterization that endows multivariate measures with a distribution-invariant notion of quantile level. The construction recursively refines the domain through equiprobable conditional-median splits and follows a Hilbert curve ordering, so that a single scalar index consistently tracks cumulative probability mass across distributions. This yields an embedding into a Hilbert function space and induces a tractable discrepancy for distribution comparison and averaging. Crucially, the representation is closed under affine averaging, leading to a closed-form, well-posed barycenter and an explicit distribution-valued Frećhet regression estimator obtained by averaging HiMAP quantile maps. We establish consistency and a dimension-dependent polynomial convergence rate for HiMAP estimators under mild conditions, matching the classical rates for empirical convergence in multivariate Wasserstein geometry. Numerical experiments and a multivariate climate-indicator study demonstrate that HiMAP delivers barycenters and regression fits comparable to standard optimal-transport surrogates while achieving substantial speedups in schemes dominated by repeated barycenter evaluations.
Sophia Sklaviadis, Thomas Moellenhoff, Andre F. T. Martins, Mario A. T. Figueiredo, Mohammad Emtiyaz Khan
Stein's identity is a fundamental tool in machine learning with applications in generative models, stochastic optimization, and other problems involving gradients of expectations under Gaussian distributions. Less attention has been paid to problems with non-Gaussian expectations. Here, we consider the class of bounded-support $q$-Gaussians and derive a new Stein identity leading to gradient estimators which have nearly identical forms to the Gaussian ones, and which are similarly easy to implement. We do this by extending the previous results of Landsman, Vanduffel, and Yao (2013) to prove new Bonnet- and Price-type theorems for q-Gaussians. We also simplify their forms by using escort distributions. Our experiments show that bounded-support distributions can reduce the variance of gradient estimators, which can potentially be useful for Bayesian deep learning and sharpness-aware minimization. Overall, our work simplifies the application of Stein's identity for an important class of non-Gaussian distributions.
Zhiyuan Zhan, Masashi Sugiyama
Low-dimensional structure in real-world data plays an important role in the success of generative models, which motivates diffusion models defined on intrinsic data manifolds. Such models are driven by stochastic differential equations (SDEs) on manifolds, which raises the need for convergence theory of numerical schemes for manifold-valued SDEs. In Euclidean space, the Euler--Maruyama (EM) scheme achieves strong convergence with order $1/2$, but an analogous result for manifold discretizations is less understood in general settings. In this work, we study a geometric version of the EM scheme for SDEs on Riemannian manifolds and prove strong convergence with order $1/2$ under geometric and regularity conditions. As an application, we obtain a Wasserstein bound for sampling on manifolds via the geometric EM discretization of Riemannian Langevin dynamics.
Blaine Quackenbush, Paul J. Atzberger
Comments related open source software see https://web.atzberger.org/
We develop a rigorous framework for extending neural operators to handle out-of-distribution input functions. We leverage kernel approximation techniques and provide theory for characterizing the input-output function spaces in terms of Reproducing Kernel Hilbert Spaces (RKHSs). We provide theorems on the requirements for reliable extensions and their predicted approximation accuracy. We also establish formal relationships between specific kernel choices and their corresponding Sobolev Native Spaces. This connection further allows the extended neural operators to reliably capture not only function values but also their derivatives. Our methods are empirically validated through the solution of elliptic partial differential equations (PDEs) involving operators on manifolds having point-cloud representations and handling geometric contributions. We report results on key factors impacting the accuracy and computational performance of the extension approaches.
Grzegorz Sroka
The No Free Lunch (NFL) theorem guarantees equal average performance only under uniform sampling of a function space closed under permutation (c.u.p.). We ask when this averaging ceases to reflect what benchmarking actually reports. We study an iterative-search setting with sampling without replacement, where algorithms differ only in evaluation order. Binary objectives allow exhaustive evaluation in the fully enumerable case, and efficiency is defined by the first time the global minimum is reached. We then construct two additional benchmarks by algebraically recombining the same baseline functions through sums and differences. Function-algorithm relations are examined via correlation structure, hierarchical clustering, delta heatmaps, and PCA. A one-way ANOVA with Tukey contrasts confirms that algebraic reformulations induce statistically meaningful shifts in performance patterns. The uniformly sampled baseline remains consistent with the global NFL symmetry. In contrast, the algebraically modified benchmarks yield stable re-rankings and coherent clusters of functions and sampling policies. Composite objectives can also exhibit non-additive search effort despite being built from simpler components. Monte Carlo experiments indicate that order effects persist in larger spaces and depend on function class. Taken together, the results show how objective reformulation and benchmark design can generate structured local departures from NFL intuition. They motivate algorithm choice that is aware of both the problem class and the objective representation. This message applies to evolutionary computation as well as to statistical procedures based on relabeling, resampling, and permutation tests.
Sepideh Mosaferi, Shonosuke Sugasawa
Comments arXiv admin note: text overlap with arXiv:2110.10296
Fine stratification survey is useful in many applications as its point estimator is unbiased, but the variance estimator under the design cannot be easily obtained, particularly when the sample size per stratum is as small as one unit. One common practice to overcome this difficulty is to collapse strata in pairs to create pseudo-strata and then estimate the variance. The estimator of variance achieved is not design-unbiased, and the positive bias increases as the population means of the paired pseudo-strata become more variant. The resulting confidence intervals can be unnecessarily large. In this paper, we propose a new Bayesian estimator for variance which does not rely on collapsing strata, unlike the previous methods given in the literature. We employ the penalized spline method for smoothing the mean and variance together in a nonparametric way. Furthermore, we make comparisons with the earlier work of Breidt et al. (2016). Throughout multiple simulation studies and an illustration using data from the National Survey of Family Growth (NSFG), we demonstrate the favorable performance of our methodology.
Felipe Maia Polo, Aida Nematzadeh, Virginia Aglietti, Adam Fisch, Isabela Albuquerque
Moving beyond evaluations that collapse performance across heterogeneous prompts toward fine-grained evaluation at the prompt level, or within relatively homogeneous subsets, is necessary to diagnose generative models' strengths and weaknesses. Such fine-grained evaluations, however, suffer from a data bottleneck: human gold-standard labels are too costly at this scale, while automated ratings are often misaligned with human judgment. To resolve this challenge, we propose a novel statistical model based on tensor factorization that merges cheap autorater data with a limited set of human gold-standard labels. Specifically, our approach uses autorater scores to pretrain latent representations of prompts and generative models, and then aligns those pretrained representations to human preferences using a small calibration set. This sample-efficient methodology is robust to autorater quality, more accurately predicts human preferences on a per-prompt basis than standard baselines, and provides tight confidence intervals for key statistical parameters of interest. We also showcase the practical utility of our method by constructing granular leaderboards based on prompt qualities and by estimating model performance solely from autorater scores, eliminating the need for additional human annotations.
Elsayed Elamir
Comments 20 pages, 2 figures
Irregular errors such as heteroscedasticity and nonnormality remain major challenges in linear modeling. These issues often lead to biased inference and unreliable measures of uncertainty. Classical remedies, such as robust standard errors and weighted least squares, only partially address the problem and may fail when heteroscedasticity interacts with skewness or nonlinear mean structures. To address this, we propose a two-stage cumulative distribution function-based (CDF-based) beta regression framework that models the full conditional distribution of the response. The approach first transforms the outcome using a smoothed empirical CDF and then fits a flexible beta regression, allowing heteroscedasticity and nonnormality to be handled naturally through the mean-precision structure of the beta distribution. Predictions are mapped back to the original scale via the empirical quantile function, which preserves interpretability. A comprehensive Monte Carlo study shows that the proposed method consistently achieves good distributional accuracy and well-calibrated prediction intervals compared with OLS, WLS, and GLS. Application to the concrete compressive strength dataset demonstrates its stability and practical advantages.
Nabaneet Das, Thorsten Dickhaus
The proportion of edges in a Gaussian graphical model (GGM) characterizes the complexity of its conditional dependence structure. Since edge presence corresponds to a nonzero entry of the precision matrix, estimation of this proportion can be formulated as a large-scale multiple testing problem. We propose an estimator that combines p-values from simultaneous edge-wise tests, conducted under false discovery rate control, with Storey's estimator of the proportion of true null hypotheses. We establish weak dependence conditions on the precision matrix under which the empirical cumulative distribution function of the p-values converges to its population counterpart. These conditions cover high-dimensional regimes, including those arising in genetic association studies. Under such dependence, we characterize the asymptotic bias of the Schweder--Spjøtvoll estimator, showing that it is upward biased and thus slightly underestimates the true edge proportion. Simulation studies across a variety of models confirm accurate recovery of graph complexity.
Beniamino Hadj-Amar, Jack Jewson
The Bayesian approach provides powerful methods for variable selection. The ability to incorporate sparsity through prior beliefs and account for parameter uncertainty allows Bayesian variable selection to consistently identify which of the variables are active and exhibit strong finite-sample performance. However, Bayesian methods require the correct specification of full likelihoods for the data, and there is increasing awareness of the problems that model misspecification causes for variable selection. Current approaches to mitigate misspecification either require complex models, detracting from the interpretability of the variable selection task, or move outside rigorous Bayesian uncertainty quantification and provide no recognised method for variable selection. This paper establishes the model quasi-posterior as a principled tool for variable selection. We prove that the model quasi-posterior shares desirable properties of Bayesian variable selection without requiring full likelihood specification. Instead, the quasi-posterior combines a prior with a quasi-likelihood and requires only specification of mean and variance functions, and is therefore robust to other aspects of the data. Marginalising the quasi-likelihood is analytically possible for linear regression, and Laplace approximations are used beyond this to ensure computational tractability. Extensive simulation studies illustrate improved variable selection accuracy across diverse data-generating scenarios when compared with likelihood-based Bayesian variable selection and lasso-penalized methods. We further demonstrate practical relevance through applications to real datasets from social science and genomics.
Cosme Louart, Sicheng Tan
Comments 1 Figures
We present a universal concentration bound for sums of random variables under arbitrary dependence, and we prove that it is asymptotically optimal for broad families of marginals admitting a uniform integrable tail-quantile envelope. The bound follows directly from the subadditivity of expected shortfall, a property well known in the risk-measure literature. Our sharpness result relies on an explicit construction of asymptotically extremal couplings. We furthermore provide practical sufficient conditions -- based on convex transformation order comparisons with exponential and power-law envelopes -- under which the bound admits simple, explicit tail profiles.
Douglas P. Wiens
Designs which are minimax in the presence of model misspecifications have been constructed so as to minimize the maximum, over classes of alternate response models, of the integrated mean squared error of the predicted values. This mean squared error decomposes into a term arising solely from variation, and a bias term arising from the model errors. Here we consider the problem of designing so as to minimize the variance of the predictors, subject to a bound on the maximum (over model misspecifications) bias. We consider as well designing so as to minimize the maximum bias, subject to a bound on the variance. We show that solutions to both problems are given by the minimax designs, with appropriately chosen values of their tuning constants. Conversely, any minimax design solves each problem for an appropriate choice of the bound on the maximum bias or on the variance.
Andrew Chin, Akihiko Nishimura
Combining a continuous "slab" density with discrete "spike" mass at zero, spike-and-slab priors provide important tools for inducing sparsity and carrying out variable selection in Bayesian models. However, the presence of discrete mass makes posterior inference challenging. "Sticky" extensions to piecewise-deterministic Markov process samplers have shown promising performance, where sampling from the spike is achieved by the process sticking there for an exponentially distributed duration. As it turns out, the sampler remains valid when the exponential sticking time is replaced with its expectation. We justify this by mapping the spike to a continuous density over a latent universe, allowing the sampler to be reinterpreted as traversing this universe while being stuck in the original space. This perspective opens up an array of possibilities to carry out posterior computation under spike-and-slab type priors. Notably, it enables us to construct sticky samplers using other dynamics-based paradigms such as Hamiltonian Monte Carlo; in fact, original sticky process can be established as a partial position-momentum refreshment limit of our Hamiltonian sticky sampler. Our theoretical and empirical findings suggest these alternatives to be at least as efficient as the original sticky approach.
Paul N Zivich
Within the biological, physical, and social sciences, there are two broad quantitative traditions: statistical and mathematical modeling. Both traditions have the common pursuit of advancing our scientific knowledge, but these traditions have developed largely independently using distinct languages and inferential frameworks. This paper uses the notion of identification from causal inference, a field originating from the statistical modeling tradition, to develop a shared language. I first review foundational identification results for statistical models and then extend these ideas to mathematical models. Central to this framework is the use of bounds, ranges of plausible numerical values, to analyze both statistical and mathematical models. I discuss the implications of this perspective for the interpretation, comparison, and integration of different modeling approaches, and illustrate the framework with a simple pharmacodynamic model for hypertension. To conclude, I describe areas where the approach taken here should be extended in the future. By formalizing connections between statistical and mathematical modeling, this work contributes to a shared framework for quantitative science. My hope is that this work will advance interactions between these two traditions.
Beomhan Baek, Minhak Song, Chulhee Yun
Comments Published at ICLR 2026
Adam [Kingma & Ba, 2015] is the de facto optimizer in deep learning, yet its theoretical understanding remains limited. Prior analyses show that Adam favors solutions aligned with $\ell_\infty$-geometry, but these results are restricted to the full-batch regime. In this work, we study the implicit bias of incremental Adam (using one sample per step) for logistic regression on linearly separable data, and show that its bias can deviate from the full-batch behavior. As an extreme example, we construct datasets on which incremental Adam provably converges to the $\ell_2$-max-margin classifier, in contrast to the $\ell_\infty$-max-margin bias of full-batch Adam. For general datasets, we characterize its bias using a proxy algorithm for the $β_2 \to 1$ limit. This proxy maximizes a data-adaptive Mahalanobis-norm margin, whose associated covariance matrix is determined by a data-dependent dual fixed-point formulation. We further present concrete datasets where this bias reduces to the standard $\ell_2$- and $\ell_\infty$-max-margin classifiers. As a counterpoint, we prove that Signum [Bernstein et al., 2018] converges to the $\ell_\infty$-max-margin classifier for any batch size. Overall, our results highlight that the implicit bias of Adam crucially depends on both the batching scheme and the dataset, while Signum remains invariant.
Fuming Lin WEilin Mou
Comments 35 pages, 7 figures, 2 tables
In this paper, we consider high-dimensional Lp-quantile regression which only requires a low order moment of the error and is also a natural generalization of the above methods and Lp-regression as well. The loss function of Lp-quantile regression circumvents the non-differentiability of the absolute loss function and the difficulty of the squares loss function requiring the finiteness of error's variance and thus promises excellent properties of Lp-quantile regression. Specifically, we first develop a new method called composite Lp-quantile regression(CLpQR). We study the oracle model selection theory based on CLpQR (call the estimator CLpQR-oracle) and show in some cases of p CLpQR-oracle behaves better than CQR-oracle (based on composite quantile regression) when error's variance is infinite. Moreover, CLpQR has high efficiency and can be sometimes arbitrarily more efficient than both CQR and the least squares regression. Second, we propose another new regression method,i.e. near quantile regression and prove the asymptotic normality of the estimator when p converges to 1 and the sample size infinity simultaneously. As its applications, a new thought of smoothing quantile objective functions and a new estimation are provided for the asymptotic covariance matrix of quantile regression. Third, we develop a unified efficient algorithm for fitting high-dimensional Lp-quantile regression by combining the cyclic coordinate descent and an augmented proximal gradient algorithm. Remarkably, the algorithm turns out to be a favourable alternative of the commonly used liner programming and interior point algorithm when fitting quantile regression.
Daniil Dmitriev, Harald Eskelund Franck, Carolin Heinzler, Amartya Sanyal
As machine learning systems increasingly train on self-annotated data, they risk reinforcing errors and becoming echo chambers of their own beliefs. We model this phenomenon by introducing a learning-theoretic framework: Online Learning in the Replay Setting. In round $t$, the learner outputs a hypothesis $\hat{h}_t$; the adversary then reveals either the true label $f^\ast(x_t)$ or a replayed label $\hat{h}_i(x_t)$ from an earlier round $i < t$. A mistake is counted only when the true label is shown, yet classical algorithms such as the SOA or the halving algorithm are easily misled by the replayed errors. We introduce the Extended Threshold dimension, $\mathrm{ExThD}(\mathcal{H})$, and prove matching upper and lower bounds that make $\mathrm{ExThD}(\mathcal{H})$ the exact measure of learnability in this model. A closure-based learner makes at most $\mathrm{ExThD}(\mathcal{H})$ mistakes against any adaptive adversary, and no algorithm can perform better. For stochastic adversaries, we prove a similar bound for every intersection-closed class. The replay setting is provably harder than the classical mistake bound setting: some classes have constant Littlestone dimension but arbitrarily large $\mathrm{ExThD}(\mathcal{H})$. Proper learning exhibits an even sharper separation: a class is properly learnable under replay if and only if it is (almost) intersection-closed. Otherwise, every proper learner suffers $Ω(T)$ errors, whereas our improper algorithm still achieves the $\mathrm{ExThD}(\mathcal{H})$ bound. These results give the first tight analysis of learning against replay adversaries, based on new results for closure-type algorithms.
Junpei Komiyama, Daisuke Oba, Masafumi Oyamada
Comments To appear at ICLR2026. Our code is available at https://github.com/jkomiyama/BoInf-code-publish/. Updated the title
We study best-of-$N$ for large language models (LLMs) where the selection is based on majority voting. In particular, we analyze the limit $N \to \infty$, which we denote as \boinflower. While this approach achieves impressive performance in the limit, it requires an infinite test-time budget. To address this, we propose an adaptive generation scheme that selects $N$ based on answer agreement, thereby efficiently allocating inference-time computation. Beyond adaptivity, we extend the framework to weighted ensembles of multiple LLMs, showing that such mixtures can outperform any individual model. The optimal ensemble weighting is formulated and efficiently computed as a mixed-integer linear program. Extensive experiments demonstrate the effectiveness of our approach.
Simon Wiegrebe, Johannes Piller, Mathias Gorski, Merle Behr, Helmut Küchenhoff, Iris M. Heid, Andreas Bender
Multi-stage disease histories derived from longitudinal data are becoming increasingly available as registry data and biobanks expand. Multi-state models are suitable to investigate transitions between different disease stages in presence of competing risks. In this context, however, their estimation is complicated by dependent left-truncation, multiple time scales, index event bias, and interval-censoring. In this work, we investigate the extension of piecewise exponential additive models (PAMs) to this setting and their applicability given the above challenges. In simulation studies we show that PAMs can handle dependent left-truncation and accommodate multiple time scales. Compared to a stratified single time scale model, a multiple time scales model is found to be less robust to the data generating process. We also quantify the extent of index event bias in multiple settings, demonstrating its dependence on the completeness of covariate adjustment. In general, PAMs recover baseline and fixed effects well in most settings, except for baseline hazards in interval-censored data. Finally, we apply our framework to estimate multi-state transition hazards and probabilities of chronic kidney disease (CKD) onset and progression in a UK Biobank dataset (n=142,667). We observe CKD progression risk to be highest for individuals with early CKD onset and to further increase over age. In addition, the well-known genetic variant rs77924615 in the UMOD locus is found to be associated with CKD onset hazards, but not with risk of further CKD progression.
Krishnakumar Balasubramanian, Nathan Ross
Comments To appear in Bernoulli Journal
We study the Finite-Dimensional Distributions (FDDs) of deep neural networks with randomly initialized weights that have finite-order moments. Specifically, we establish Gaussian approximation bounds in the Wasserstein-$1$ norm between the FDDs and their Gaussian limit assuming a Lipschitz activation function and allowing the layer widths to grow to infinity at arbitrary relative rates. In the special case where all widths are proportional to a common scale parameter $n$ and there are $L-1$ hidden layers, we obtain convergence rates of order $n^{-({1}/{6})^{L-1} + ε}$, for any $ε> 0$.
Thomas Möllenhoff, Siddharth Swaroop, Finale Doshi-Velez, Mohammad Emtiyaz Khan
Comments First two authors contributed equally. Published at ICLR 2026. Code is at https://github.com/team-approx-bayes/bayes-admm
We propose a new Bayesian approach to generalize the federated Alternating Direction Method of Multipliers (ADMM). We show that the solutions of variational-Bayesian (VB) objectives are associated with a duality structure that not only resembles the structure of ADMM's fixed-points but also generalizes it. For example, ADMM-like updates are recovered when the VB objective is optimized over the isotropic-Gaussian family, and new non-trivial extensions are obtained for other exponential-family distributions. These extensions include a Newton-like variant that converges in one step on quadratic objectives and an Adam-like variant that yields up to 7% accuracy boosts for deep heterogeneous cases. Our work opens a new Bayesian way to generalize ADMM and other primal-dual methods.
Peter Reinhard Hansen, Chen Tong
Comments Please note: The results in this manuscript have been entirely subsumed and extended by the more comprehensive framework in arXiv:2602.17007
We derive an integral expression $G(z)$ for the reciprocal gamma function, $1/Γ(z)=G(z)/π$, that is valid for all $z\in\mathbb{C}$, without the need for analytic continuation. The same integral avoids the singularities of the gamma function and satisfies $G(1-z)=Γ(z)\sin(πz)$ for all $z\in\mathbb{C}$.
Agnideep Aich, Md Monzur Murshed, Sameera Hewage, Amanda Mayeaux
Effective feature selection is critical for robust and interpretable predictive modeling in medicine, especially when risk factors matter most in extreme patient strata. Many standard selectors emphasize average associations and can miss predictors whose relevance is concentrated in the distribution tails. We propose a computationally efficient supervised filter based on a Gumbel-copula implied upper-tail concordance score (lambda U), defined as a monotone transformation of Kendall's tau, to rank features by their tendency to be simultaneously extreme with the positive class. We compare against four common baselines (Mutual Information, mRMR, ReliefF, and L1/Elastic-Net) across four classifiers on two diabetes datasets: a large-scale public health survey (CDC, N=253,680) and a clinical benchmark (PIMA, N=768). Analyses include statistical testing, permutation importance, and robustness checks. On CDC, the proposed selector is the fastest and reduces 21 features to 10 (approx 52%). This yields a small but statistically significant trade-off relative to using all features, while performing better than standard filters (Mutual Information, mRMR) and comparably to the strong ReliefF baseline. On PIMA (8 predictors), the resulting ranking attains the highest ROC-AUC numerically, though paired DeLong tests show no significant differences versus strong baselines; PIMA therefore serves as a ranking-only sanity check in a low-dimensional setting. Across both datasets, the lambda U-based selector highlights clinically coherent predictors and provides an efficient, interpretable screening step that can complement standard feature-selection methods in public health and clinical risk prediction.
Dmitry Dudukalov, Artem Logachov, Vladimir Lotov, Timofei Prasolov, Evgeny Prokopenko, Anton Tarasenko
Comments The introduction, Subsections 2.1 ("Suitable Time Scaling") and 2.2 ("Sticking to a Critical Point"), as well as a small portion of the proof, have been revised. Subsection 2.3 ("Leaving the Neighborhood of a Sharp Maximum") has undergone minor revisions due to the equality in the doubly exponential case
We study the convergence properties and escape dynamics of Stochastic Gradient Descent (SGD) in one-dimensional landscapes, separately considering infinite- and finite-variance noise. Our main focus is to identify the time scales on which SGD reliably moves from an initial point to the local minimum in the same ''basin''. Under suitable conditions on the noise distribution, we prove that SGD converges to the basin's minimum unless the initial point lies too close to a local maximum. In that near-maximum scenario, we show that SGD can linger for a long time in its neighborhood. For initial points near a ''sharp'' maximum, we show that SGD does not remain stuck there, and we provide results to estimate the probability that it will reach each of the two neighboring minima. Overall, our findings present a nuanced view of SGD's transitions between local maxima and minima, influenced by both noise characteristics and the underlying function geometry.
Lan V. Truong
Comments To appear in IEEE Transactions on Information Theory
We study best-arm identification in stochastic multi-armed bandits under the fixed-confidence setting, focusing on instances with multiple optimal arms. Unlike prior work that addresses the unknown-number-of-optimal-arms case, we consider the setting where the number of optimal arms is known in advance. We derive a new information-theoretic lower bound on the expected sample complexity that leverages this structural knowledge and is strictly tighter than previous bounds. Building on the Track-and-Stop algorithm, we propose a modified, tie-aware stopping rule and prove that it achieves asymptotic instance-optimality, matching the new lower bound. Our results provide the first formal guarantee of optimality for Track-and-Stop in multi-optimal settings with known cardinality, offering both theoretical insights and practical guidance for efficiently identifying any optimal arm.
Alberto Caimo, Isabella Gollini
Comments 20 pages, 9 figures, 3 tables
Signed networks capture the polarity of relationships between nodes, providing valuable insights into complex systems where both supportive and antagonistic interactions play a critical role in shaping the network dynamics. We propose a separable temporal generative framework based on multi-layer exponential random graph models, characterised by the assumption of conditional independence between the sign and interaction effects. This structure preserves the flexibly and explanatory power inherent in the binary network specification while adhering to consistent balance theory assumptions. Using a fully probabilistic Bayesian paradigm, we infer the doubly intractable posterior distribution of model parameters via an adaptive Metropolis-Hastings approximate exchange algorithm. We illustrate the interpretability of our model by analysing signed relations among U.S. Senators during Ronald Reagan's second term (1985-1989). Specifically, we aim to understand whether these relations are consistent and balanced or reflect patterns of supportive or antagonistic alliances.
Gianluca Finocchio, Tatyana Krivobokova
Comments 61 pages, 2 figures
A novel framework is introduced to formalize identifiability in well-specified but ill-posed linear regression models. The framework is distribution-free and accommodates highly correlated features that may or may not relate to the response, reflecting typical real-data structures. First, the identifiable parameter is defined as the least-squares solution obtained by regressing the response on the largest subset of relevant features whose condition number does not exceed a specified threshold, and the relative risk incurred by using this predictor instead of the optimal one is quantified. Second, simple, verifiable conditions are provided under which a broad class of linear dimensionality reduction algorithms can estimate identifiable parameters; algorithms satisfying these conditions are termed statistically interpretable. Third, sharp high-probability error bounds are derived for these algorithms, with rates explicitly reflecting the degree of ill-posedness. With heavy-tailed features and sufficiently low effective rank, these algorithms achieve convergence rates that improve upon both the minimax least-squares rate and lower bounds for sparse estimation under sub-Gaussian features. Results are illustrated via simulations and a real-data application, in which effective rank grows logarithmically with dimension. The framework may extend to algorithms modeling nonlinear response-feature dependence.
Henrik Häggström, Sebastian Persson, Marija Cvijovic, Umberto Picchini
Comments 42 pages, 23 figures
The analysis of data from multiple experiments, such as observations of several individuals, is commonly approached using mixed-effects models, which account for variation between individuals through hierarchical representations. This makes mixed-effects models widely applied in fields such as biology, pharmacokinetics, and sociology. In this work, we propose a novel methodology for scalable Bayesian inference in hierarchical mixed-effects models. Our framework first constructs amortized approximations of the likelihood and the posterior distribution, which are then rapidly refined for each individual dataset, to ultimately approximate the parameters posterior across many individuals. The framework is easily trainable, as it uses mixtures of experts but without neural networks, leading to parsimonious yet expressive surrogate models of the likelihood and the posterior. We demonstrate the effectiveness of our methodology using challenging stochastic models, such as mixed-effects stochastic differential equations emerging in systems biology-driven problems. However, the approach is broadly applicable and can accommodate both stochastic and deterministic models. We show that our approach can seamlessly handle inference for many parameters. Additionally, we applied our method to a real-data case study of mRNA transfection. When compared to exact pseudomarginal Bayesian inference, our approach proved to be both fast and competitive in terms of statistical accuracy.
Shaoqian Zhou, Wen You, Ling Guo, Xuhui Meng
Physics-informed deep learning approaches have been developed to solve forward and inverse stochastic differential equation (SDE) problems with high-dimensional stochastic space. However, the existing deep learning models have difficulties solving SDEs with high-dimensional spatial space. In the present study, we propose a scalable physics-informed deep generative model (sPI-GeM), which is capable of solving SDE problems with both high-dimensional stochastic and spatial space. The sPI-GeM consists of two deep learning models, i.e., (1) physics-informed basis networks (PI-BasisNet), which are used to learn the basis functions as well as the coefficients given data on a certain stochastic process or random field, and (2) physics-informed deep generative model (PI-GeM), which learns the distribution over the coefficients obtained from the PI-BasisNet. The new samples for the learned stochastic process can then be obtained using the inner product between the output of the generator and the basis functions from the trained PI-BasisNet. The sPI-GeM addresses the scalability in the spatial space in a similar way as in the widely used dimensionality reduction technique, i.e., principal component analysis (PCA). A series of numerical experiments, including approximation of Gaussian and non-Gaussian stochastic processes, forward and inverse SDE problems, are performed to demonstrate the accuracy of the proposed model. Furthermore, we also show the scalability of the sPI-GeM in both the stochastic and spatial space using an example of a forward SDE problem with 38- and 20-dimension stochastic and spatial space, respectively.
Shreya Mehta, Almut E. D. Veraart
This article introduces Levy-driven graph supOU processes, a parsimonious parametrisation for high-dimensional time series in which dependence between components is governed by a graph structure. Specifically, the model bridges short- and long-range dependence within a single parametric family while accommodating a wide range of marginal distributions. We further develop a generalised method of moments estimator, establish its consistency and asymptotic normality, and assess its finite-sample performance through a simulation study. Finally, we illustrate the practical relevance of our model and estimation method in an empirical study of wind capacity factors in a European electricity network context.
Jesus Gonzalo, Jean-Yves Pitarakis
We propose a two-step procedure to detect cointegration in high-dimensional settings, focusing on sparse relationships. First, we use the adaptive LASSO to identify the small subset of integrated covariates driving the equilibrium relationship with a target series, ensuring model-selection consistency. Second, we adopt an information-theoretic model choice criterion to distinguish between stationarity and nonstationarity in the resulting residuals, avoiding dependence on asymptotic distributional assumptions. Monte Carlo experiments confirm robust finite-sample performance, even under endogeneity and serial correlation.
Seong Jin Lee, Will Wei Sun, Yufeng Liu
Reinforcement learning from human feedback (RLHF) has become a cornerstone for aligning large language models with human preferences. However, the heterogeneity of human feedback, driven by diverse individual contexts and preferences, poses significant challenges for reward learning. To address this, we propose a Low-rank Contextual RLHF (LoCo-RLHF) framework that integrates contextual information to better model heterogeneous feedback while maintaining computational efficiency. Our approach builds on a contextual preference model, leveraging the intrinsic low-rank structure of the interaction between user contexts and query-answer pairs to mitigate the high dimensionality of feature representations. Furthermore, we address the challenge of distributional shifts in feedback through our Pessimism in Reduced Subspace (PRS) policy, inspired by pessimistic offline reinforcement learning techniques. We theoretically demonstrate that our policy achieves a tighter sub-optimality gap compared to existing methods. Extensive experiments validate the effectiveness of LoCo-RLHF, showcasing its superior performance in personalized RLHF settings and its robustness to distribution shifts.
Augusto Cerqua, Roberta Di Stefano, Raffaele Mattera
When treatments are non-randomly assigned, continuous, and yield heterogeneous effects at the same intensity, causal identification becomes particularly challenging. In such contexts, existing approaches often fail to provide policy-relevant estimates of the relationship between treatment intensity and outcomes, especially in the presence of limited common support. To fill this gap, we introduce the Clustered Dose-Response Function (Cl-DRF), a novel estimator designed to uncover the continuous causal relationship between treatment intensity and the dependent variable across distinct subgroups. Our approach leverages both theoretical and data-driven sources of heterogeneity, relying on relaxed versions of the conditional independence and positivity assumptions that are plausible across various observational settings. We apply the Cl-DRF estimator to estimate subgroup-specific dose-response relationships between European Cohesion Funds and economic growth. In contrast to much of the literature, higher funding increases growth in more developed regions without diminishing returns, while limited absorptive capacity prevents other regions from fully benefiting.
Guanyi Chen, Jian Ding, Shuyang Gong, Zhangsong Li
Comments 80 pages, 2 figures, added further explanations and remarks; to appear in Annals of Statistics
Detection of correlation in a pair of random graphs is a fundamental statistical and computational problem that has been extensively studied in recent years. In this work, we consider a pair of correlated (sparse) stochastic block models $\mathcal{S}(n,\tfracλ{n};k,ε;s)$ that are subsampled from a common parent stochastic block model $\mathcal S(n,\tfracλ{n};k,ε)$ with $k=O(1)$ symmetric communities, average degree $λ=O(1)$, divergence parameter $ε$, and subsampling probability $s$. For the detection problem of distinguishing this model from a pair of independent Erdős-Rényi graphs with the same edge density $\mathcal{G}(n,\tfrac{λs}{n})$, we focus on tests based on \emph{low-degree polynomials} of the entries of the adjacency matrices, and we determine the threshold that separates the easy and hard regimes. More precisely, we show that this class of tests can distinguish these two models if and only if $s> \min \{ \sqrtα, \frac{1}{λε^2} \}$, where $α\approx 0.338$ is the Otter's constant and $\frac{1}{λε^2}$ is the Kesten-Stigum threshold. Combining a reduction argument in \cite{Li25+}, our hardness result also implies low-degree hardness for partial recovery and detection (to independent block models) when $s< \min \{ \sqrtα, \frac{1}{λε^2} \}$. Finally, our proof of low-degree hardness is based on a conditional variant of the low-degree likelihood calculation.
Marius Huber, Sara Kalisnik, Patrick Schnider
Comments Code: https://doi.org/10.5281/zenodo.17279740
We present AuToMATo, a novel clustering algorithm based on persistent homology. While AuToMATo is not parameter-free per se, we provide default choices for its parameters that make it into an out-of-the-box clustering algorithm that performs well across the board. AuToMATo combines the existing ToMATo clustering algorithm with a bootstrapping procedure in order to separate significant peaks of an estimated density function from non-significant ones. We perform a thorough comparison of AuToMATo (with its parameters fixed to their defaults) against many other state-of-the-art clustering algorithms. We find not only that AuToMATo compares favorably against parameter-free clustering algorithms, but in many instances also significantly outperforms even the best selection of parameters for other algorithms. AuToMATo is motivated by applications in topological data analysis, in particular the Mapper algorithm, where it is desirable to work with a clustering algorithm that does not need tuning of its parameters. Indeed, we provide evidence that AuToMATo performs well when used with Mapper. Finally, we provide an open-source implementation of AuToMATo in Python that is fully compatible with the standard scikit-learn architecture.
Diego Salmerón, Juan Antonio Cano, Christian P. Robert
Comments Accepted for publication in International Statistical Review. DOI: 10.1111/insr.70028
Noninformative priors constructed for estimation purposes are usually not appropriate for model selection and testing. The methodology of integral priors was developed to get prior distributions for Bayesian model selection when comparing two models, modifying initial improper reference priors. We propose a generalization of this methodology to more than two models. Our approach adds an artificial copy of each model under comparison by compactifying the parametric space and creating an ergodic Markov chain across all models that returns the integral priors as marginals of the stationary distribution. Besides the guarantee of their existence and the lack of paradoxes attached to estimation reference priors, an additional advantage of this methodology is that the simulation of this Markov chain is straightforward as it only requires simulations of imaginary training samples for all models and from the corresponding posterior distributions. We present some examples, including situations where other methodologies need specific adjustments or do not produce a satisfactory answer.
Hédi Hadiji, Sarah Sachs, Cristóbal Guzmán
Tracking the solution of time-varying variational inequalities is an important problem with applications in game theory, optimization, and machine learning. Existing work considers time-varying games or time-varying optimization problems. For strongly convex optimization problems or strongly monotone games, these results provide tracking guarantees under the assumption that the variation of the time-varying problem is restrained, that is, problems with a sublinear solution path. In this work we extend existing results in two ways: In our first result, we provide tracking bounds for (1) variational inequalities with a sublinear solution path but not necessarily monotone functions, and (2) for periodic time-varying variational inequalities that do not necessarily have a sublinear solution path-length. Our second main contribution is an extensive study of the convergence behavior and trajectory of discrete dynamical systems of periodic time-varying VI. We show that these systems can exhibit provably chaotic behavior or can converge to the solution. Finally, we illustrate our theoretical results with experiments.
Daniele Tramontano, Mathias Drton, Jalal Etesami
Linear non-Gaussian causal models postulate that each random variable is a linear function of parent variables and non-Gaussian exogenous error terms. We study identification of the linear coefficients when such models contain latent variables. Our focus is on the commonly studied acyclic setting, where each model corresponds to a directed acyclic graph (DAG). For this case, prior literature has demonstrated that connections to overcomplete independent component analysis yield effective criteria to decide parameter identifiability in latent variable models. However, this connection is based on the assumption that the observed variables linearly depend on the latent variables. Departing from this assumption, we treat models that allow for arbitrary non-linear latent confounding. Our main result is a graphical criterion that is necessary and sufficient for deciding the generic identifiability of direct causal effects. Moreover, we provide an algorithmic implementation of the criterion with a run time that is polynomial in the number of observed variables. Finally, we report on estimation heuristics based on the identification result and explore a generalization to models with feedback loops.
William Acero, Isabel Molina, J. Miguel Marín
Comments 27 pages, 9 figures, 2 tables
When estimating area means, direct estimators based on area-specific data, are usually consistent under the sampling design without model assumptions. However, they are inefficient if the area sample size is small. In small area estimation, model assumptions linking the areas are used to "borrow strength" from other areas. The basic area-level model provides design-consistent estimators but error variances are assumed to be known. In practice, they are estimated with the (scarce) area-specific data. These estimators are inefficient, and their error is not accounted for in the associated mean squared error estimators. Unit-level models do not require to know the error variances but do not account for the survey design. Here we describe a unified estimator of an area mean that may be obtained both from an area-level model or a unit-level model and based on consistent estimators of the model error variances as the number of areas increases. We propose bootstrap mean squared error estimators that account for the uncertainty due to the estimation of the error variances. We show a better performance of the new small area estimators and our bootstrap estimators of the mean squared error. We apply the results to education data from Colombia.
Kota Takeda, Takashi Sakajo
Comments 18 pages, 0 figures
Data assimilation is a method of uncertainty quantification to estimate the hidden true state by updating the prediction owing to model dynamics with observation data. As a prediction model, we consider a class of nonlinear dynamical systems on Hilbert spaces including the two-dimensional Navier-Stokes equations and the Lorenz '63 and '96 equations. For nonlinear model dynamics, the ensemble Kalman filter (EnKF) is often used to approximate the mean and covariance of the probability distribution with a set of particles called an ensemble. In this paper, we consider a deterministic version of the EnKF known as the ensemble transform Kalman filter (ETKF), performing well even with limited ensemble sizes in comparison to other stochastic implementations of the EnKF. When the ETKF is applied to large-scale systems, an ad-hoc numerical technique called a covariance inflation is often employed to reduce approximation errors. Despite the practical effectiveness of the ETKF, little is theoretically known. The present study aims to establish the theoretical analysis of the ETKF. We obtain that the estimation error of the ETKF with and without the covariance inflation is bounded for any finite time. In particular, the uniform-in-time error bound is obtained when an inflation parameter is chosen appropriately, justifying the effectiveness of the covariance inflation in the ETKF.
Alireza F. Pour, Hassan Ashtiani, Shahab Asoodeh
We study the problem of hypothesis selection under the constraint of local differential privacy. Given a class $\mathcal{F}$ of $k$ distributions and a set of i.i.d. samples from an unknown distribution $h$, the goal of hypothesis selection is to pick a distribution $\hat{f}$ whose total variation distance to $h$ is comparable with the best distribution in $\mathcal{F}$ (with high probability). We devise an $\varepsilon$-locally-differentially-private ($\varepsilon$-LDP) algorithm that uses $Θ\left(\frac{k}{α^2\min \{\varepsilon^2,1\}}\right)$ samples to guarantee that $d_{TV}(h,\hat{f})\leq α+ 9 \min_{f\in \mathcal{F}}d_{TV}(h,f)$ with high probability. This sample complexity is optimal for $\varepsilon<1$, matching the lower bound of Gopi et al. (2020). All previously known algorithms for this problem required $Ω\left(\frac{k\log k}{α^2\min \{ \varepsilon^2 ,1\}} \right)$ samples to work. Moreover, our result demonstrates the power of interaction for $\varepsilon$-LDP hypothesis selection. Namely, it breaks the known lower bound of $Ω\left(\frac{k\log k}{α^2\min \{ \varepsilon^2 ,1\}} \right)$ for the sample complexity of non-interactive hypothesis selection. Our algorithm breaks this barrier using only $Θ(\log \log k)$ rounds of interaction. To prove our results, we define the notion of \emph{critical queries} for a Statistical Query Algorithm (SQA) which may be of independent interest. Informally, an SQA is said to use a small number of critical queries if its success relies on the accuracy of only a small number of queries it asks. We then design an LDP algorithm that uses a smaller number of critical queries.
Haochen Lei, Yan Li, Hongyuan Cao
Identifying signals that replicate across multiple studies is essential for establishing robust scientific evidence, yet existing methods for high-dimensional replicability analysis either rely on restrictive modeling assumptions, are limited to two-study settings, or lack statistical power. We propose a general empirical Bayes framework for multi-study replicability analysis that jointly models summary-level $p$-values while explicitly accounting for between-study heterogeneity. Within each study, non-null $p$-value densities are estimated nonparametrically under monotonicity constraints, enabling flexible and tuning-free inference. For two studies, we develop a local false discovery rate (Lfdr) statistic for the composite null of non-replicability and establish identifiability, consistency, and a cubic-rate convergence of the nonparametric MLE, along with minimax optimality. Extending replicability analysis to $n$ studies typically requires estimating $2^n$ latent configurations, which is computationally infeasible. To address this challenge, we introduce a scalable pairwise rejection strategy that decomposes the exponentially large composite null into disjoint components, yielding linear complexity in the number of studies. We prove asymptotic FDR control under mild regularity conditions and show that Lfdr-based thresholding is power-optimal. Extensive simulations demonstrate that our method provides substantial power gains while maintaining valid FDR control, outperforming state-of-the-art alternatives across a wide range of scenarios. Applying our framework to East Asian- and European-ancestry genome-wide association studies of type 2 diabetes reveals replicable genetic associations that competing approaches fail to detect, illustrating the method's practical utility in large-scale biomedical research.
Qi Zhang, Harsh Parikh, Ashley Naimi, Razieh Nabi, Christopher Kim, Timothy Lash
Comments 34 pages, 15 figures. Submitted to ICML 2026. Code available at https://github.com/zhangqiecho/causalmix
Method validation and study design in causal inference rely on synthetic data with known counterfactuals. Existing simulators trade off distributional realism, the ability to capture mixed-type and multimodal tabular data, against causal controllability, including explicit control over overlap, unmeasured confounding, and treatment effect heterogeneity. We introduce CausalMix, a variational generative framework that closes this gap by coupling a mixture of Gaussian latent priors with data-type-specific decoders for continuous, binary, and categorical variables. The model incorporates explicit causal controls: an overlap regularizer shaping propensity-score distributions, alongside direct parameterizations of confounding strength and effect heterogeneity. This unified objective preserves fidelity to the observed data while enabling factorial manipulation of causal mechanisms, allowing overlap, confounding strength, and treatment effect heterogeneity to be varied independently at design time. Across benchmarks, CausalMix achieves state-of-the-art distributional metrics on mixed-type tables while providing stable, fine-grained causal control. We demonstrate practical utility in a comparative safety study of metastatic castration-resistant prostate cancer treatments, using CausalMix to compare estimators under calibrated data-generating processes, tune hyperparameters, and conduct simulation-based power analyses under targeted treatment effect heterogeneity scenarios.
Marco Pollanen
Comments 44 pages, 2 figures, submitted to Meta-Psychology (open peer review)
Explanations of the replication crisis often emphasize misconduct, questionable research practices, or incentive misalignment, implying that behavioral reform is sufficient. This paper argues that a substantial component is architectural: within binary significance-based publication systems, even perfectly diligent researchers face structural limits on the reliability they can deliver. The posterior log-odds of a finding equal prior log-odds plus log(Lambda), where Lambda = (1-beta)/alpha is the experimental leverage. Interpreted architecturally, this implies a hard constraint: once evidence is coarsened to a binary significance decision, the decision rule contributes exactly log(Lambda) to posterior log-odds. A target reliability tau is feasible iff pi >= pi_crit, and under fixed alpha this generally cannot be rescued by sample size alone. Two mechanisms can drive effective leverage to 1 without bad faith: persistent unmeasured confounding in observational studies and unbounded specification search under publication pressure. These results concern binary significance-based decision architectures and do not bound inference based on full likelihoods or richer continuous evidence summaries. Two collapse results formalize these mechanisms, while the Replication Pipeline Theorem and Minimum Pipeline Depth Corollary identify a quantitative evidentiary standard for escape. Using independently documented parameters for pre-reform psychology (pi about 0.10, power about 0.35), the framework implies a replication rate of 36%, consistent with the Open Science Collaboration. The framework also provides quantitative bridges to Popper, Kuhn, and Lakatos. In low-prior settings below the single-study feasibility threshold, the natural unit of evidence is the replication pipeline rather than the individual experiment.
Mingxuan Zhang, Khushi Desai, Sopho Kevlishvili, Elham Azizi
Observational causal discovery is only identifiable up to the Markov equivalence class. While interventions can reduce this ambiguity, in practice interventions are often soft with multiple unknown targets. In many realistic scenarios, only a single intervention regime is observed. We propose a scalable causal discovery model for paired observational and interventional settings with shared underlying causal structure and unknown soft interventions. The model aggregates subset-level PDAGs and applies contrastive cross-regime orientation rules to construct a globally consistent maximal PDAG under Meek closure, enabling generalization to both in-distribution and out-of-distribution settings. Theoretically, we prove that our model is sound with respect to a restricted $Ψ$ equivalence class induced solely by the information available in the subset-restricted setting. We further show that the model asymptotically recovers the corresponding identifiable PDAG and can orient additional edges compared to non-contrastive subset-restricted methods. Experiments on synthetic data demonstrate improved causal structure recovery, generalization to unseen graphs with held-out causal mechanisms, and scalability to larger graphs, with ablations supporting the theoretical results.
Shion Matsumoto, Raul Castillo, Benjamin Prada, Ankur Arjun Mali
The forward and reverse Kullback-Leibler (KL) divergences arise as limiting objectives in learning and inference yet induce markedly different inductive biases that cannot be explained at the level of expectations alone. In this work, we introduce the Surprisal-Rényi Free Energy (SRFE), a log-moment-based functional of the likelihood ratio that lies outside the class of $f$-divergences. We show that SRFE recovers forward and reverse KL divergences as singular endpoint limits and derive local expansions around both limits in which the variance of the log-likelihood ratio appears as a first-order correction. This reveals an explicit mean-variance tradeoff governing departures from KL-dominated regimes. We further establish a Gibbs-type variational characterization of SRFE as the unique minimizer of a weighted sum of KL divergences and prove that SRFE directly controls large deviations of excess code-length via Chernoff-type bounds, yielding a precise Minimum Description Length interpretation. Together, these results identify SRFE as a variance- and tail-sensitive free-energy functional that clarifies the geometric and large-deviation structure underlying forward and reverse KL limits, without unifying or subsuming distinct learning frameworks.
Xiaotong Liu, Yunwen Lei, Xiangyu Chang, Shao-Bo Lin
This paper proposes a novel parameter selection strategy for kernel-based gradient descent (KGD) algorithms, integrating bias-variance analysis with the splitting method. We introduce the concept of empirical effective dimension to quantify iteration increments in KGD, deriving an adaptive parameter selection strategy that is implementable. Theoretical verifications are provided within the framework of learning theory. Utilizing the recently developed integral operator approach, we rigorously demonstrate that KGD, equipped with the proposed adaptive parameter selection strategy, achieves the optimal generalization error bound and adapts effectively to different kernels, target functions, and error metrics. Consequently, this strategy showcases significant advantages over existing parameter selection methods for KGD.
Mingjie Zhao, Sen Feng, Yiqun Zhang, Mengke Li, Yang Lu, Yiu-ming Cheung
Comments Accepted to ECAI2024
Clustering is a fundamental approach to understanding data patterns, wherein the intuitive Euclidean distance space is commonly adopted. However, this is not the case for implicit cluster distributions reflected by qualitative attribute values, e.g., the nominal values of attributes like symptoms, marital status, etc. This paper, therefore, discovered a tree-like distance structure to flexibly represent the local order relationship among intra-attribute qualitative values. That is, treating a value as the vertex of the tree allows to capture rich order relationships among the vertex value and the others. To obtain the trees in a clustering-friendly form, a joint learning mechanism is proposed to iteratively obtain more appropriate tree structures and clusters. It turns out that the latent distance space of the whole dataset can be well-represented by a forest consisting of the learned trees. Extensive experiments demonstrate that the joint learning adapts the forest to the clustering task to yield accurate results. Comparisons of 10 counterparts on 12 real benchmark datasets with significance tests verify the superiority of the proposed method.
David Wegmann
Comments This article is derived from my masters thesis
In 2018, McInnes et al. introduced a dimensionality reduction algorithm called UMAP, which enjoys wide popularity among data scientists. Their work introduces a finite variant of a functor called the metric realization, based on an unpublished draft by Spivak. This draft contains many errors, most of which are reproduced by McInnes et al. and subsequent publications. This article aims to repair these errors and provide a self-contained document with the full derivation of Spivak's functors and McInnes et al.'s finite variant. We contribute an explicit description of the metric realization and related functors. At the end, we discuss the UMAP algorithm, as well as claims about properties of the algorithm and the correspondence of McInnes et al.'s finite variant to the UMAP algorithm.
Martin Larsson, Johannes Ruf, Aaditya Ramdas
Comments 28 pages
We revisit a fundamental question in hypothesis testing: given two sets of probability measures $\mathcal{P}$ and $\mathcal{Q}$, when does a nontrivial (i.e. strictly unbiased) test for $\mathcal{P}$ against $\mathcal{Q}$ exist? Le Cam showed that, when $\mathcal{P}$ and $\mathcal{Q}$ have a common dominating measure, a test that has power exceeding its level by more than $\varepsilon$ exists if and only if the convex hulls of $\mathcal{P}$ and $\mathcal{Q}$ are separated in total-variation distance by more than $\varepsilon$. The requirement of a dominating measure is frequently violated in nonparametric statistics. In a passing remark, Le Cam described an approach to address more general scenarios, but he stopped short of stating a formal theorem. This work completes Le Cam's program, by presenting a matching necessary and sufficient condition for testability: for the aforementioned theorem to hold without assumptions, one must take the closures of the convex hulls of $\mathcal{P}$ and $\mathcal{Q}$ in the space of bounded finitely additive measures. We provide simple elucidating examples, and elaborate on various subtle measure theoretic and topological points regarding compactness and achievability.
Peter Halmos, Boris Hanin
Wasserstein gradient flow provides a general framework for minimizing an energy functional $J$ over the space of probability measures on a Riemannian manifold $(M,g)$. Its canonical time-discretization, the Jordan-Kinderlehrer-Otto (JKO) scheme, produces for any step size $η>0$ a sequence of probability distributions $ρ_k^η$ that approximate to first order in $η$ Wasserstein gradient flow on $J$. But the JKO scheme also has many other remarkable properties not shared by other first order integrators, e.g. it preserves energy dissipation and exhibits unconditional stability for $λ$-geodesically convex functionals $J$. To better understand the JKO scheme we characterize its implicit bias at second order in $η$. We show that $ρ_k^η$ are approximated to order $η^2$ by Wasserstein gradient flow on a modified energy \[ J^η(ρ) = J(ρ) - \fracη{4}\int_M \Big\lVert \nabla_g \frac{δJ}{δρ} (ρ) \Big\rVert_{2}^{2} \,ρ(dx), \] obtained by subtracting from $J$ the squared metric curvature of $J$ times $η/4$. The JKO scheme therefore adds at second order in $η$ a deceleration in directions where the metric curvature of $J$ is rapidly changing. This corresponds to canonical implicit biases for common functionals: for entropy the implicit bias is the Fisher information, for KL-divergence it is the Fisher-Hyv{ä}rinen divergence, and for Riemannian gradient descent it is the kinetic energy in the metric $g$. To understand the differences between minimizing $J$ and $J^η$ we study JKO-Flow, Wasserstein gradient flow on $J^η$, in several simple numerical examples. These include exactly solvable Langevin dynamics on the Bures-Wasserstein space and Langevin sampling from a quartic potential in 1D.
Korel Gundem, Juncheng Dong, Dennis Zhang, Vahid Tarokh, Zhengling Qi
In-Context Learning (ICL) allows Large Language Models (LLMs) to adapt to new tasks with just a few examples, but their predictions often suffer from systematic biases, leading to unstable performance in classification. While calibration techniques are proposed to mitigate these biases, we show that, in the logit space, many of these methods are equivalent to merely shifting the LLM's decision boundary without having the ability to alter its orientation. This proves inadequate when biases cause the LLM to be severely misaligned. To address these limitations and provide a unifying framework, we propose Supervised Calibration (SC), a loss-minimization-based framework, which learns an optimal, per-class affine transformation of LLM's predictive probabilities in the logit space without requiring external data beyond the context. By using a more expressive functional class, SC not only subsumes many existing calibration methods in ICL as special cases but also enables the ability of altering and even completely reversing the orientation of the LLM's decision boundary. Furthermore, SC's loss-based nature facilitates the seamless integration of two purpose-built regularization techniques, context-invariance and directional trust-region regularizers. The former is designed to tackle the instability issue in ICL, while the latter is to control the degree of calibration. Finally, SC delivers state-of-the-art performance over calibration baselines in the 4-shot, 8-shot, and 16-shot settings across all nine datasets for Mistral-7B-Instruct-v0.3, Llama-2-7B-chat, and Qwen2-7B-Instruct.
Sibasish Dhibar
White blood cells (WBC) are important parts of our immune system, and they protect our body against infections by eliminating viruses, bacteria, parasites and fungi. The number of WBC types and the total number of WBCs provide important information about our health status. A traditional method, convolutional neural networks (CNN), a deep learning architecture, can classify the blood cell from a part of an object and perform object recognition. Various CNN models exhibit potential; however, their development often involves ad-hoc processes that neglect unnecessary layers, leading to issues with unbalanced datasets and insufficient data augmentation. To address these challenges, we propose a novel ensemble approach that integrates three CNN architectures, each uniquely configured with different dropout and max-pooling layer settings to enhance feature learning. This ensemble model, named DCENWCNet, effectively balances the bias-variance trade-off. When evaluated on the widely recognized Rabbin-WBC dataset, our model outperforms existing state-of-the-art networks, achieving highest mean accuracy. Additionally, it demonstrates superior performance in precision, recall, F1-score, and Area Under the ROC Curve (AUC) across all categories. To delve deeper into the interpretability of classifiers, we employ reliable post-hoc explanation techniques, including Local Interpretable Model-Agnostic Explanations (LIME). These methods approximate the behavior of a black-box model by elucidating the relationships between feature values and predictions. Interpretable results enable users to comprehend and validate the model's predictions, thereby increasing their confidence in the automated diagnosis.
Ramon de Punder, Timo Dimitriadis, Rutger-Jan Lange
Score-driven (SD) models are a standard tool in statistics and econometrics, with applications in hundreds of published articles in the past decade. We provide an information-theoretic characterization of SD updates based on reductions in the expected Kullback-Leibler (EKL) divergence relative to the true -- but unknown -- data-generating density. EKL reductions occur if and only if the expected update direction aligns with the expected score; i.e., their inner product should be positive. This equivalence condition uniquely identifies SD updates (including scaled or clipped variants) as being EKL reducing, even in non-concave, multivariate, and misspecified settings. We further derive explicit bounds on admissible learning rates in terms of score moments, linking SD methods to adaptive optimization techniques. By contrast, alternative performance measures in the literature impose stronger conditions (e.g., concave logarithmic densities) and do not characterize SD updates: other updating rules may improve these measures, while SD updates need not. Our results provide a rigorous justification for SD models and establish EKL as their natural information-theoretic foundation.
扫码添加微信好友,提出您的宝贵建议 👇
💡 备注请填写:网站反馈