arXivDaily arXiv每日学术速递 周一至周五更新
重置
2604.24749 2026-04-28 cs.LG stat.ML

The Optimal Sample Complexity of Multiclass and List Learning

Chirag Pabbaraju

详情
英文摘要

While the optimal sample complexity of binary classification in terms of the VC dimension is well-established, determining the optimal sample complexity of multiclass classification has remained open. The appropriate complexity parameter for multiclass classification is the DS dimension, and despite significant efforts, a gap of $\sqrt{\text{DS}}$ has persisted between the upper and lower bounds on sample complexity. Recent work by Hanneke et al. (2026) shows a novel algebraic characterization of multiclass hypothesis classes in terms of their DS dimension. Building up on this, we show that the maximum hypergraph density of any multiclass hypothesis class is upper-bounded by its DS dimension. This proves a longstanding conjecture of Daniely and Shalev-Shwartz (2014). As a consequence, we determine the optimal dependence of the sample complexity on the DS dimension for multiclass as well as list learning.

2604.24737 2026-04-28 cs.LG cs.AI cs.CC stat.ML

Learning to Think from Multiple Thinkers

Nirmit Joshi, Roey Magen, Nathan Srebro, Nikolaos Tsilivis, Gal Vardi

Comments Comments are welcome. There are 78 pages and 5 Figures

详情
英文摘要

We study learning with Chain-of-Thought (CoT) supervision from multiple thinkers, all of whom provide correct but possibly systematically different solutions, e.g., step-by-step solutions to math problems written by different thinkers, or step-by-step execution traces of different programs solving the same problem. We consider classes that are computationally easy to learn using CoT supervision from a single thinker, but hard to learn with only end-result supervision, i.e., without CoT (Joshi et al. 2025). We establish that, under cryptographic assumptions, learning can be hard from CoT supervision provided by two or a few different thinkers, in passive data-collection settings. On the other hand, we provide a generic computationally efficient active learning algorithm that learns with a small amount of CoT data per thinker that is completely independent of the target accuracy $\varepsilon$, a moderate number of thinkers that scales as $\log \frac{1}{\varepsilon}\log \log \frac{1}{\varepsilon}$, and sufficient passive end-result data that scales as $\frac{1}{\varepsilon}\cdot poly\log\frac{1}{\varepsilon}$.

2604.24736 2026-04-28 math.ST stat.TH

Parametric Statistical Inference in the Zone of Moderate Deviation Probabilities

Mikhail Ermakov

Comments 18 pages

详情
英文摘要

A parametric theory of statistical inference is developed for the moderate deviation probability zone. The new approach to the proofs is based on the Taylor series expansion of the logarithm of the likelihood ratio based on the Hellinger distance. The Large Deviation Principle in the moderate deviation probability zone is proven for Bayesian estimators and maximum likelihood estimators. A uniform approximation of the logarithm of the likelihood ratio and Theorem on concentration of the posterior Bayesian measure are also established for the zone of moderate deviation probabilities.

2604.24652 2026-04-28 stat.ME econ.EM

Benefits and Costs of Adaptive Sampling

Yu-Shiou Willy Lin, Dae Woong Ham, Iavor Bojinov

Comments 41 pages, 3 figures

详情
英文摘要

Multi-armed bandits are widely used for sequential experimentation in clinical trials, recommendation systems, and online platforms. While regret minimization and valid inference from adaptively collected data have each been studied extensively, a basic question remains: when does adaptivity \emph{improve estimation precision} relative to uniform designs, and how should inference be balanced against the online cost of experimentation? We first study arm-level mean estimation under mean-squared-error (MSE) objectives. We characterize when an adaptive Neyman allocation, which allocates samples according to arm variance, yields strict MSE improvements over uniform sampling. When there is variance heterogeneity across arms, these improvements arise at modest sample sizes, clarifying that adaptivity can be preferable for inference not only asymptotically, but also in many practical finite-sample settings. We then study a joint inference-regret objective that accounts for the cost of assigning units to inferior arms during experimentation. We propose the Static-Allocation Rate Policy (SARP) and Neyman-Adaptive Rate Policy (NARP), which interpolates between inference- and regret-oriented policies by adjusting exploration to the local structure of the instance. We show that SARP and NARP converge to the complete-information benchmark at the optimal rate as the sampling budget grows. Our proposed policies are practically attractive as it linearly interpolates between any standard regret-minimizing algorithm and inference-targeting adaptive policies. Yet we show it still enjoys the oracle-based asymptotic optimal rate. Simulations support the theory by demonstrating improved precision over uniform allocation while controlling performance loss across a range of instances.

2604.24632 2026-04-28 stat.CO cs.NA math.NA math.PR

Theoretical guarantees for stochastic gradient sampling methods via Gaussian convolution inequalities

Daniel Paulin, Peter A. Whalley

Comments 34 pages, 2 figures

详情
英文摘要

We derive first-order (in the stepsize) bounds on the bias in Wasserstein distances of the invariant measure of stochastic gradient kinetic Langevin dynamics with minimal assumptions on the stochastic gradient noise. These bounds sharpen existing non-asymptotic guarantees for stochastic-gradient MCMC methods and provide a quantitative resolution of a previously open problem on invariant measure accuracy. The main technical ingredients are new Gaussian convolution inequalities controlling the Wasserstein-$p$ distance between a Gaussian convolved with a mean-zero perturbation and the Gaussian itself. We anticipate that these inequalities will be of independent interest beyond the present application.

2604.24563 2026-04-28 physics.chem-ph cs.LG stat.ML

Enhancing molecular dynamics with equivariant machine-learned densities

Mihail Bogojeski, Muhammad R. Hasyim, Leslie Vogt-Maranto, Klaus-Robert Müller, Kieron Burke, Mark E. Tuckerman

Comments 30 pages, 7 figures

详情
英文摘要

Machine-learning interatomic potentials (MLIPs) have enabled molecular dynamics at near ab initio accuracy, yet remain limited to energies and forces by construction, leaving electronic observables such as dipole moments and polarizabilities inaccessible. We introduce DenSNet, a density-first approach to machine-learned electronic structure that learns the Hohenberg--Kohn map from nuclear configurations to the ground-state electron density. Our approach employs an SE(3)-equivariant neural network to predict density coefficients of a flexible atom-centered Gaussian basis, combined with a $Δ$-learning strategy that uses superposed atomic densities as a prior to accelerate training. A second equivariant network then maps the predicted density to the total energy, providing a unified framework for molecular dynamics and electronic structure. We validate DenSNet on ethanol, ethanethiol, and resorcinol, where infrared spectra from machine-learned trajectories show excellent agreement with experimental gas-phase measurements. To test scalability, we train on polythiophene oligomers with 1--6 monomers and extrapolate to chains of up to 12 monomers, generating stable long-time trajectories whose infrared spectra agree with reference density functional theory calculations. Here, we show that reinstating the electron density as the central learned quantity opens a practical route to transferable prediction of spectroscopic and electronic observables in large-scale molecular simulations.

2604.24555 2026-04-28 cs.LG stat.ML

Efficient learning by implicit exploration in bandit problems with side observations

Tomas Kocak, Gergely Neu, Michal Valko, Remi Munos

Comments Published at Neural Information Processing Systems (NeurIPS) 2014

详情
英文摘要

We consider online learning problems under a partial observability model capturing situations where the information conveyed to the learner is between full information and bandit feedback. In the simplest variant, we assume that in addition to its own loss, the learner also gets to observe losses of some other actions. The revealed losses depend on the learner's action and a directed observation system chosen by the environment. For this setting, we propose the first algorithm that enjoys near-optimal regret guarantees without having to know the observation system before selecting its actions. Along similar lines, we also define a new partial information setting that models online combinatorial optimization problems where the feedback received by the learner is between semi-bandit and full feedback. As the predictions of our first algorithm cannot be always computed efficiently in this setting, we propose another algorithm with similar properties and with the benefit of always being computationally efficient, at the price of a slightly more complicated tuning mechanism. Both algorithms rely on a novel exploration strategy called implicit exploration, which is shown to be more efficient both computationally and information-theoretically than previously studied exploration strategies for the problem.

2604.24545 2026-04-28 stat.ML cs.LG

Extreme bandits

Alexandra Carpentier, Michal Valko

Comments Published at Neural Information Processing Systems (NeurIPS) 2014

详情
英文摘要

In many areas of medicine, security, and life sciences, we want to allocate limited resources to different sources in order to detect extreme values. In this paper, we study an efficient way to allocate these resources sequentially under limited feedback. While sequential design of experiments is well studied in bandit theory, the most commonly optimized property is the regret with respect to the maximum mean reward. However, in other problems such as network intrusion detection, we are interested in detecting the most extreme value output by the sources. Therefore, in our work we study extreme regret which measures the efficiency of an algorithm compared to the oracle policy selecting the source with the heaviest tail. We propose the ExtremeHunter algorithm, provide its analysis, and evaluate it empirically on synthetic and real-world experiments.

2604.24537 2026-04-28 cs.LG stat.ML

Stochastic simultaneous optimistic optimization

Michal Valko, Alexandra Carpentier, Rémi Munos

Comments Published in International Conference on Machine Learning (ICML 2013)

详情
英文摘要

We study the problem of global maximization of a function f given a finite number of evaluations perturbed by noise. We consider a very weak assumption on the function, namely that it is locally smooth (in some precise sense) with respect to some semi-metric, around one of its global maxima. Compared to previous works on bandits in general spaces (Kleinberg et al., 2008; Bubeck et al., 2011a) our algorithm does not require the knowledge of this semi-metric. Our algorithm, StoSOO, follows an optimistic strategy to iteratively construct upper confidence bounds over the hierarchical partitions of the function domain to decide which point to sample next. A finite-time analysis of StoSOO shows that it performs almost as well as the best specifically-tuned algorithms even though the local smoothness of the function is not known.

2604.24534 2026-04-28 math.ST stat.TH

Robust linear regression under latent group heterogeneity

Xifeng Li, Shuzhen Yang

详情
英文摘要

Uncertainty is ubiquitous in real-world data, and the assumptions underlying classical linear regression models are often violated in practice. Inspired by the theory of sublinear expectation, we consider a linear regression model where the random intercept term has mean uncertainty and the error term has variance uncertainty. We develop a novel two-step approach, named Expectation-Maximization with Moving Block (EMMB), to estimate the model parameters. The proposed method requires no prior knowledge of group structures or change points. Theoretical properties of the estimators are established under mild regularity conditions. Simulation studies and a real-data application to PM2.5 concentration modeling in Beijing demonstrate the superiority of the proposed method: it captures substantial intercept heterogeneity overlooked by ordinary least squares and yields more accurate and interpretable estimates.

2604.24533 2026-04-28 stat.ME

Hierarchical Causal Uplift Modeling in Overlapping Customer Journeys

Jorge Pellegrini

详情
英文摘要

Digital travel platforms often operate multiple marketing journeys simultaneously, resulting in overlapping user exposures that bias the standard A/B lift estimation. Because traditional lift experiments assume treatment isolation, the observed lifts reflect only marginal effects and may substantially underestimate the total incremental impact of each journey. This work introduces a Hierarchical Causal Lift Model that decomposes pure and global effects under journey overlap. Each journey is modeled as a multiplicative causal factor, and the interaction terms capture potential synergies or cannibalizations. The model is estimated through a Monte Carlo framework that incorporates uncertainty in overlap proportions, observed lifts, and single-journey effects. Regularized non-linear least squares are complemented with Monte Carlo simulation to quantify parameter uncertainty and assess the robustness of the solution. Applied to an active user base of approximately three million users, the model reveals positive but modest synergies between journeys and shows that pure lifts are significantly larger than those observed experimentally. The predicted global lift closely matches the experimentally measured value, demonstrating the ability of the model to recover incremental effects in an interpretable manner.

2604.24499 2026-04-28 cs.IT math.IT q-bio.PE stat.AP

Fisher Information and Dynamical Sampling I

Mattia Carrino, Stefan Hohenegger

Comments 41 pages, 17 figures

详情
英文摘要

Information theory is a powerful framework to capture aspects of dynamical systems with multiple degrees of freedom. Mathematically, the dynamics can be represented as a continuous curve $\mathcal{C}$ on a suitable hyperplane in flat space and the Fisher information provides the norm of an infinitesimal displacement along this curve. In many applications, however, we do not have direct access to $\mathcal{C}$. Instead, we have to reconstruct the latter from a time-series of measurements (obtained as samples of size $n$), which are represented by an ordered set of points $\widehat{\mathcal{C}}$ on the same hyperplane. In this work, we calculate the bias of the Fisher information for large $n$, which provides a quantitative estimation for how accurately the dynamics of a system can be reconstructed from a given set of sampled data. Based on this result, we show that a clustering of the degrees of freedom reduces the bias and thus improves the accuracy with which the new system can be described with the same data. Inspired by a recent proposal for such a clustering, we provide a quantitive assessment of the loss of information, which allows to estimate how much information about the dynamics of a system can reliably be extracted based on a given set of data. We illustrate our findings in the case of a simple compartmental model. Although the latter is inspired by epidemiology, the results of this work are applicable to very general dynamical models with multiple degrees of freedom.

2604.24490 2026-04-28 math.ST stat.TH

Posterior Invariance of Multiplicative Contrasts under Margin Constraints in Contingency Tables

Rafael Bassi Stern, Ruobin Gong, Joseph B. Kadane, Mark J. Schervish, Teddy Seidenfeld

Comments 18 pages, 1 figure

详情
英文摘要

Measures of association in contingency tables, such as odds ratios and their generalizations, are often studied under different sampling schemes that either fix or leave random the margins of the table. While classical results show that certain odds ratios are unaffected by constraining the margins, it is less clear when this invariance holds more generally. This paper studies posterior inference for a broad class of multiplicative contrasts of multinomial cell probabilities, which we refer to as generalized odds ratios, and addresses exactly when fixing a margin alters inference about them. We consider Bayesian inference under multinomial sampling and under models in which partition sums of the table are fixed in advance, and assume that the marginal and conditional parameters are independent a priori. Under additional mild assumptions, we show that the posterior distribution of a generalized odds ratio is invariant to fixing a margin if and only if the coefficients defining the contrast sum to zero within the margin.

2604.24430 2026-04-28 stat.ME

Combined shrinkage of fixed and random effects in linear mixed models using empirical Bayes

Matteo Amestoy, R. Vermeulen, Mark A. van de Wiel, Wessel N. van Wieringen

详情
英文摘要

A novel data-driven methodology is presented for the joint selection of prior parameters for both fixed and random effects in Linear Mixed Models (LMMs). This approach facilitates the estimation of complex random-effects structures, as well as potentially high-dimensional data. Although Bayesian frameworks require the specification of informative prior parameters, such values are often unavailable a priori - especially for random-effect covariances. The proposed method automates this selection through an Empirical Bayes framework, which maximizes the marginal likelihood using an efficient Laplace approximation. Numerical simulations demonstrate that this methodology significantly enhances parameter estimation accuracy and predictive performance. Finally, an application to a real-world air pollution and health dataset illustrates how the method enables the use of more sophisticated and statistically appropriate models to improve predictive outcomes.

2604.24360 2026-04-28 stat.ME

A Milestone-Based Framework for Characterizing Time-Varying Treatment Effects in Immunotherapy Trials

Yi-Cheng Tai, Weijing Wang, Jedd D. Wolchok, Martin T. Wells

Comments 39 pages, 35 figures

详情
英文摘要

Immune checkpoint inhibitor--based therapies often produce heterogeneous survival responses, including early risk, delayed treatment benefit, and durable long-term survival in a subset of patients. In these settings, conventional summary measures such as the hazard ratio may not adequately describe how treatment effects evolve over follow-up. We propose a milestone-based framework that separates long-term survival beyond a clinically meaningful time point from earlier outcomes and provides a practical way to characterize patient heterogeneity in treatment response. The framework summarizes treatment differences through milestone survival probabilities and, among patients who do not reach the milestone, characterizes short-term treatment ordering over time using a tau-based summary that helps identify hazard reversal. We illustrate the approach using reconstructed individual-level data from three landmark phase III trials: CheckMate~067, CheckMate~227, and CLEAR. Across these examples, the framework captures patterns that are difficult to summarize with conventional measures, including settings in which early disadvantage coexists with later durable benefit. It also helps clarify when treatment benefit begins to emerge and how short-term and long-term effects differ within the same trial. This approach provides a clinically interpretable and statistically principled way to evaluate heterogeneous and time-varying treatment effects in oncology trials with nonproportional hazards.

2604.24116 2026-04-28 cs.SE stat.AP

Closing the Loop: A Software Framework for AI to Support Business Decision Making

Jeffrey Wong, Antoine Creux

详情
英文摘要

Create an idea, prototype it, evaluate if users like it, then learn. It is the circle of business. If AI can operate in all parts of the circle, it will enable rapid iteration and learning speeds for businesses. Experiment platforms that deploy experiments to evaluate return on investment for businesses are abundant, but systems that help businesses learn personalization, mechanisms, and what to ideate next, are rare. Among technologies that do exist, they cannot be well orchestrated in a single software interface that can be safely and efficiently leveraged by an AI agent. These challenges make it difficult to teach an AI agent how to learn within a robust experimentation framework, and difficult for an AI agent to operate and iterate for the business. We offer a two part solution: one half that is rooted in mathematical reductions to contain complexity, and one half that is rooted in software design to optimize for orchestration, software safety, and multiplicity. Our solution, a software framework, moves beyond the simple treatment effect computed as a difference in means. To create a better understanding of a business and its customers, we enrich causal analysis with heterogeneous effects, policy algorithms, mediation analysis, and forecasts of effects. To have an AI complete the iteration cycle faster, we further enrich the analysis with variance reduction and anytime valid inference. The enrichments are made compatible across different types of experiments, and are presented in a single software interface that is usable in an AI agent. We evaluate the approach on various objectives in experiment analysis, and show that the framework improves code correctness, reduces lines of code, and is more performant than a baseline analysis constructed by a vanilla agent.

2604.00763 2026-04-28 stat.ME q-bio.GN stat.AP

Non-ignorable fuzziness in granular counts: the case of RNA-seq data

Antonio Calcagnì, Arianna Consiglio, Przemyslaw Grzegorzewski, Corrado Mencar

Comments 10 pages, 1 figure, 0 tables. Note: The compressed source folder contains the Supplementary Materials

详情
英文摘要

RNA-seq count data are often affected by read-to-gene alignment ambiguity, especially in high-dimensional transcriptomics. This type of ambiguity can be conveniently expressed through granular counts, namely fuzzy-valued observations of latent discrete quantities. We study a class of fuzzy-reporting mechanisms and show that, when reporting exploits graded membership, ignorability fails generically, leading to a coarsening-not-at-random structure. A hierarchical model is then introduced as a tractable instance of this construction and illustrated using RNA-seq data.

2603.17717 2026-04-28 cs.CR cs.AI stat.AP stat.ML

Machine Learning for Network Attacks Classification and Statistical Evaluation of Adversarial Learning Methodologies for Synthetic Data Generation

Iakovos-Christos Zarkadis, Christos Douligeris

Comments Accepted at IEEE ISCC 2026, Portugal

详情
英文摘要

Supervised detection of network attacks has always been a critical part of network intrusion detection systems (NIDS). Nowadays, in a pivotal time for artificial intelligence (AI), with even more sophisticated attacks that utilize advanced techniques, such as generative artificial intelligence (GenAI) and reinforcement learning, it has become a vital component if we wish to protect our personal data, which are scattered across the web. In this paper, we address two tasks, in the first unified multi-modal NIDS dataset, which incorporates flow-level data, packet payload information and temporal contextual features, from the reprocessed CIC-IDS-2017, CIC-IoT-2023, UNSW-NB15 and CIC-DDoS-2019, with the same feature space. In the first task we use machine learning (ML) algorithms, with stratified cross validation, in order to prevent network attacks, with stability and reliability. In the second task we use adversarial learning algorithms to generate synthetic data, compare them with the real ones and evaluate their fidelity, utility and privacy using the SDV framework, f-divergences, distinguishability and non-parametric statistical tests. The findings provide stable ML models for intrusion detection and generative models with high fidelity and utility, by combining the Synthetic Data Vault framework, the TRTS and TSTR tests, with non-parametric statistical tests and f-divergence measures.

2602.01338 2026-04-28 cs.LG math.ST stat.ML stat.TH

High-accuracy sampling for diffusion models and log-concave distributions

Fan Chen, Sinho Chewi, Constantinos Daskalakis, Alexander Rakhlin

详情
英文摘要

We present algorithms for diffusion model sampling which obtain $δ$-error in $\mathrm{polylog}(1/δ)$ steps, given access to $\widetilde O(δ)$-accurate score estimates in $L^2$. This is an exponential improvement over all previous results. Specifically, under minimal data assumptions, the complexity is $\widetilde O(d_\star \mathrm{polylog}(1/δ))$ where $d_\star$ is the intrinsic dimension of the data. Further, under a non-uniform $L$-Lipschitz condition, the complexity reduces to $\widetilde O(L \mathrm{polylog}(1/δ))$. Our approach also yields the first $\mathrm{polylog}(1/δ)$ complexity sampler for general log-concave distributions using only gradient evaluations.

2512.10570 2026-04-28 stat.ML cs.LG

Flexible Deep Neural Networks for Partially Linear Survival Data: Estimation and Survival Inference

Asaf Ben Arie, Malka Gorfine

详情
英文摘要

We propose a flexible deep neural network (DNN) framework for modeling survival data within a partially linear regression structure. The approach preserves interpretability through a parametric linear component for covariates of primary interest, while a nonparametric DNN component captures complex time-covariate interactions among nuisance variables. We refer to the method as FLEXI-Haz, a FLEXIble Hazard model with a partially linear structure. In contrast to existing DNN approaches for partially linear Cox models, FLEXI-Haz does not rely on the proportional hazards assumption. We establish theoretical guarantees: the neural network component attains minimax-optimal convergence rates over composite Hölder classes, the linear estimator is sqrt-n-consistent, asymptotically normal, and semiparametrically efficient, and we develop a cross-fitted one-step estimator of the cumulative hazard and survival function for a new subject, together with pointwise asymptotic confidence intervals. To the best of our knowledge, this is the first frequentist asymptotic pointwise inference result for a survival function in a DNN survival model, with or without a linear component. Simulations and real-data analyses demonstrate the utility of FLEXI-Haz as a principled and interpretable alternative to methods based on proportional hazards.

2511.17902 2026-04-28 cs.LG cs.AI stat.ML

Statistically-Guided Meta-Learning for Cross-Deployment Activity Recognition in Distributed Fiber-Optic Sensing

Yifan He, Haodong Zhang, Qiuheng Song, Lin Lei, Zhenxuan Zeng, Haoyang He, Hongyan Wu

详情
英文摘要

Distributed Fiber Optic Sensing (DFOS) is promising for long-range perimeter security, yet practical deployment faces three key obstacles: severe cross-deployment domain shift, scarce or unavailable labels at new sites, and limited within-class coverage even in source deployments. We propose DUPLE, a prototype-based meta-learning framework tailored for cross-deployment DFOS recognition. The core idea is to jointly exploit complementary time- and frequency-domain cues and adapt class representations to sample-specific statistics: (i) a dual-domain learner constructs multi-prototype class representations to cover intra-class heterogeneity; (ii) a lightweight statistical guidance mechanism estimates the reliability of each domain from raw signal statistics; and (iii) a query-adaptive aggregation strategy selects and combines the most relevant prototypes for each query. Extensive experiments on two real-world cross-deployment benchmarks demonstrate consistent improvements over strong deep learning and meta-learning baselines, achieving more accurate and stable recognition under label-scarce target deployments.

2510.10372 2026-04-28 stat.ME

Sequentially Doubly Robust Estimation of Conditional Survival Probability with Time-Varying Covariates

Hongxiang Qiu, Marco Carone, Alex Luedtke, Peter B. Gilbert

Comments new theoretical, simulation, data analysis results

详情
英文摘要

It is often of interest to study the association between covariates and the cumulative incidence of a right-censored time-to-event outcome. When time-varying covariates are measured on a fixed discrete time scale, it is desirable to account for these more up-to-date covariates when addressing censoring. For example, in vaccine trials, it is of interest to study the association between immune response levels after administering the vaccine and the cumulative incidence of the endpoint, while accounting for loss to follow-up explained by immune response levels measured at multiple post-vaccination visits. Existing methods rely on stringent parametric assumptions, do not account for informative censoring due to time-varying covariates when time is continuous, only estimate a marginal survival probability, or do not fully use the discrete-time structure of post-treatment covariates. We propose a nonparametric estimator of the continuous-time survival probability conditional on covariates, accounting for censoring due to time-varying covariates measured on a fixed discrete time scale. We show that the estimator is sequentially doubly robust: it is consistent if, within each time window between adjacent visits, the censoring distribution is consistently estimated, or both the time-to-event distribution and a conditional mean probability are consistently estimated. We also show that, in the special case of estimating the marginal survival probability, our estimator is asymptotically efficient. We demonstrate the superior performance of our estimator in a simulation experiment, and apply the method to a COVID-19 vaccine efficacy trial.

2508.09079 2026-04-28 econ.GN cs.DL q-fin.EC stat.OT

Exploring the Shape of Economics: A Multilayer Network Analysis of Social Communities and Intellectual Similarity Among Journals Before and After the 2008 Financial Crisis

Alberto Baccini, Lucio Barabesi, Carlo Debernardi

Comments 66 pages, 3 figures, 7 tables

详情
英文摘要

This paper develops a multilayer network approach for exploring the evolution of scientific disciplines, using the case of economics before and after the 2008 global financial crisis as a large-scale empirical testing ground. The units of analysis are journals, linked by social and intellectual relationships. The analysis covers all journals indexed in EconLit across three years (2006, 2012 and 2019). In the most recent year (2019), the dataset includes 909 journals, over 30,000 editorial board members, more than 260,000 authors, 134,000 articles, and nearly 2 million cited references. For each period, we model journals as connected in a four-layer multiplex network: the social relationships are based on shared editors (interlocking editorship) and shared authors (interlocking authorship), while the intellectual ones are based on shared references (bibliographic coupling) and textual similarity between articles. These four layers are integrated using Similarity Network Fusion to produce unified similarity networks from which journal communities are identified. Comparing the field across the three periods reveals a high degree of structural continuity. Although research topics changed after the crisis, the fundamental social and intellectual relationships among journals remained remarkably stable. A major result of the analysis is that editorial networks play the dominant role in shaping hierarchies and legitimize knowledge production within the discipline. Whether this finding holds in other scientific disciplines remains an open question for future research.

2507.20088 2026-04-28 cs.LG math-ph math.MP math.OC stat.ML

Learning Latent Graph Geometry via Fixed-Point Schrödinger-Type Activation: A Theoretical Study

Dmitry Pasechnyuk-Vilensky, Martin Takáč

Comments 50 pages

详情
英文摘要

We study neural architectures in which each hidden layer is defined by the stationary state of a dissipative Schrödinger-type dynamics on a learned latent graph. On stable branches, the local stationary problem defines a differentiable implicit graph layer. To learn the graph itself, we optimize over the stratified moduli space of weighted graphs and equip each stratum with a non-degenerate Kähler-Hessian metric that keeps natural-gradient descent and face crossing well posed. We then show that a multilayer stationary network is equivalent to an exact global stationary problem on a supra-graph, and that it admits a penalized global relaxation whose stationary states converge to the exact one as the penalty parameter tends to infinity. Reverse-mode differentiation is recovered as the adjoint of the exact global system, and the penalized adjoint converges to it in the same limit. Finally, under finite-dimensional strong-monotonicity and admissible-lift assumptions, the corresponding represented hypothesis classes coincide among resolvent feed-forward networks, graph-stationary networks, supra-graph stationary systems, and sheaf-based architectures with unitary connection. The resulting structural identifications yield complexity bounds controlled by sparse graph or supra-graph geometry rather than dense ambient connectivity.

2507.06542 2026-04-28 cs.LG cs.DC cs.MA stat.ML

On the Surprising Effectiveness of a Single Global Merging in Decentralized Learning

Tongtian Zhu, Tianyu Zhang, Mingze Wang, Zhanpeng Zhou, Can Wang

Comments We discover and theoretically explain why and when a single global parameter merging in decentralized learning can recover the performance of federated learning, even in highly heterogeneous and communication-constrained environments

详情
Journal ref
ICLR 2026 (Oral Presentation)
英文摘要

Decentralized learning provides a scalable alternative to parameter-server-based training, yet its performance is often hindered by limited peer-to-peer communication. In this paper, we study how communication should be scheduled over time, including determining when and how frequently devices synchronize. Counterintuitive empirical results show that concentrating communication budgets in the later stages of decentralized training remarkably improves global test performance. Surprisingly, we uncover that fully connected communication at the final step, implemented by a single global merging, can significantly improve the performance of decentralized learning under high data heterogeneity. Our theoretical contributions, which explain these phenomena, are the first to establish that the globally merged model of decentralized SGD can match the convergence rate of parallel SGD. Technically, we reinterpret part of the discrepancy among local models, which were previously considered as detrimental noise, as constructive components essential for matching this rate. This work provides evidence that decentralized learning is able to generalize under high data heterogeneity and limited communication, while offering broad new avenues for model merging research.

2506.22429 2026-04-28 stat.ML cs.LG

Beyond ReLU: How Activations Affect Neural Kernels and Random Wide Networks

David Holzmüller, Max Schölpple

Comments Published at AISTATS 2026. New in v2: more discussions, plots on empirical eigenvalue decay

详情
英文摘要

In recent years, the neural tangent kernel (NTK) and neural network Gaussian process kernel (NNGP) have given theoreticians tractable limiting cases of fully connected neural networks. However, the property of these kernels are poorly understood for activation functions other than powers of the ReLU. Our main contribution is a characterization of the RKHS of these kernels for activation functions whose only non-smoothness is at zero. This extends existing theory to numerous commonly used activation functions such as SELU, ELU, or LeakyReLU. Additionally, we analyze a broad set of special cases such as missing biases, two-layer networks, or polynomial activations. Our results show that a broad class of not infinitely smooth activations generate equivalent RKHSs at different network depths, depending only on the degree of the non-smoothness up to equivalence. On the other hand, the RKHS generated by polynomial activations depends on the network depth. Finally, we derive results for the smoothness of NNGP sample paths, characterizing the smoothness of infinitely wide neural networks at initialization.

2506.22258 2026-04-28 math.ST math.PR stat.TH

Mixing Time Bounds for the Gibbs Sampler under Isoperimetry

Alexander Goyal, George Deligiannidis, Nikolas Kantas

详情
英文摘要

We establish bounds on the conductance for the systematic-scan and random-scan Gibbs samplers when the target distribution satisfies a Poincaré or log-Sobolev inequality and possesses sufficiently regular conditional distributions. These bounds lead to mixing time guarantees that extend beyond the log-concave setting, offering new insights into the convergence behavior of Gibbs sampling in broader regimes. Moreover, we demonstrate that our results remain valid for log-Lipschitz and log-smooth target distributions. Our approach relies on novel isoperimetric inequalities and a sequential coupling argument for the Gibbs sampler.

2501.04187 2026-04-28 stat.AP

A cautious use of auxiliary outcomes for decision-making in randomized clinical trials

Massimiliano Russo, Steffen Ventz, Lorenzo Trippa

详情
英文摘要

Clinical trials often collect data on multiple outcomes, such as overall survival (OS), progression-free survival (PFS), and response to treatment (RT). In most cases, however, study designs only use primary outcome data for interim and final decision-making. In several disease settings, clinically relevant outcomes, for example OS, become available years after patient enrollment. Moreover, the effects of experimental treatments on OS might be less pronounced compared to auxiliary outcomes such as RT. We develop a Bayesian decision-theoretic framework that uses both primary and auxiliary outcomes for interim and final decision-making. The framework allows investigators to control standard frequentist operating characteristics, such as the type I error rate and can be used with auxiliary outcomes from emerging technologies, such as circulating tumor assays. False positive rates and other frequentist operating characteristics are rigorously controlled without any assumption about the concordance between primary and auxiliary outcomes. We discuss algorithms to implement this decision-theoretic approach and show that incorporating auxiliary information into interim and final decision-making can lead to relevant efficiency gains according to established and interpretable metrics.

2410.10704 2026-04-28 math.ST stat.ME stat.TH

Estimation beyond Missing (Completely) at Random

Tianyi Ma, Kabir A. Verchand, Thomas B. Berrett, Tengyao Wang, Richard J. Samworth

Comments 78 pages, 6 figures, Journal version

详情
英文摘要

We study the effects of missingness on the estimation of population parameters. Moving beyond restrictive missing completely at random (MCAR) assumptions, we first formulate a missing data analogue of Huber's arbitrary $ε$-contamination model. For mean estimation with respect to squared Euclidean error loss, we show that the minimax quantiles decompose as a sum of the corresponding minimax quantiles under a heterogeneous, MCAR assumption, and a robust error term, depending on $ε$, that reflects the additional error incurred by departure from MCAR. We next introduce natural classes of realisable $ε$-contamination models, where an MCAR version of a base distribution $P$ is contaminated by an arbitrary missing not at random (MNAR) version of $P$. These classes are rich enough to capture various notions of biased sampling and sensitivity conditions, yet we show that they enjoy improved minimax performance relative to our earlier arbitrary contamination classes for both parametric and nonparametric classes of base distributions. For instance, with a univariate Gaussian base distribution, consistent mean estimation over realisable $ε$-contamination classes is possible even when $ε$ and the proportion of missingness converge (slowly) to 1. We extend our results to the setting of departures from missing at random (MAR) in normal linear regression with a realisable missing response, and also demonstrate that our methods can be made adaptive to the case of unknown $ε$.

2405.20642 2026-04-28 cs.LG stat.ML

Learning Under Moral Hazard with Instrumental Regression and Generalized Method of Moments

Shiliang Zuo

详情
英文摘要

Machine learning has become increasingly popular in informing data-driven policy-making. Policies influence behavior in individuals or populations, and ideally, through observational signals, policy-makers learn which policies are effective. However, in many settings, individual actions cannot be perfectly observed. This issue, known in economics as moral hazard, poses a significant challenge. In this work, we study the foundational multitasking principal-agent contract design problem and demonstrate how instrumental regression and the generalized method of moments (GMM) estimator can be used to estimate or learn a good contract. As a bonus result, we also give a uniformity characterization of the shape of the optimal contract.

2402.15995 2026-04-28 cs.CC cs.LG math.ST stat.ML stat.TH

Improved Hardness Results for Learning Intersections of Halfspaces

Stefan Tiegel

详情
Journal ref
TheoretiCS, Volume 5 (2026), Article 8, 1-26
英文摘要

We show strong (and surprisingly simple) lower bounds for weakly learning intersections of halfspaces in the improper setting. Strikingly little is known about this problem. For instance, it is not even known if there is a polynomial-time algorithm for learning the intersection of only two halfspaces. On the other hand, lower bounds based on well-established assumptions (such as approximating worst-case lattice problems or variants of Feige's 3SAT hypothesis) are only known (or are implied by existing results) for the intersection of super-logarithmically many halfspaces [KS09,KS06,DSS16]. With intersections of fewer halfspaces being only ruled out under less standard assumptions [DV21] (such as the existence of local pseudo-random generators with large stretch). We significantly narrow this gap by showing that even learning $ω(\log \log N)$ halfspaces in dimension $N$ takes super-polynomial time under standard assumptions on worst-case lattice problems (namely that SVP and SIVP are hard to approximate within polynomial factors). Further, we give unconditional hardness results in the statistical query framework. Specifically, we show that for any $k$ (even constant), learning $k$ halfspaces in dimension $N$ requires accuracy $N^{-Ω(k)}$, or exponentially many queries -- in particular ruling out SQ algorithms with polynomial accuracy for $ω(1)$ halfspaces. To the best of our knowledge this is the first unconditional hardness result for learning a super-constant number of halfspaces. Our lower bounds are obtained in a unified way via a novel connection we make between intersections of halfspaces and the so-called parallel pancakes distribution [DKS17,BLPR19,BRST21] that has been at the heart of many lower bound constructions in (robust) high-dimensional statistics in the past few years.

2312.08410 2026-04-28 cs.LG math.PR stat.ML

Universal approximation property of Banach space-valued random feature models including random neural networks

Ariel Neufeld, Philipp Schmocker

Comments 52 pages, 4 figures, 4 tables

详情
英文摘要

We introduce a Banach space-valued extension of random feature learning, a data-driven supervised machine learning technique for large-scale kernel approximation. By randomly initializing the feature maps, only the linear readout needs to be trained, which reduces the computational complexity substantially. Viewing random feature models as Banach space-valued random variables, we prove a universal approximation result in the corresponding Bochner space. Moreover, we derive approximation rates and an explicit algorithm to learn an element of the given Banach space by such models. The framework of this paper includes random trigonometric/Fourier regression and in particular random neural networks which are single-hidden-layer feedforward neural networks whose weights and biases are randomly initialized, whence only the linear readout needs to be trained. For the latter, we can then lift the universal approximation property of deterministic neural networks to random neural networks, even within function spaces over non-compact domains, e.g., weighted spaces, $L^p$-spaces, and (weighted) Sobolev spaces, where the latter includes the approximation of the (weak) derivatives. In addition, we analyze when the training costs for approximating a given function grow polynomially in both the input/output dimension and the reciprocal of a pre-specified tolerated approximation error. Furthermore, we demonstrate in a numerical example the empirical advantages of random feature models over their deterministic counterparts.

2301.07855 2026-04-28 econ.EM stat.AP

Digital Divide: Evidence from the 2020 Canadian Internet Use Survey

Joann Jasiak, Peter MacKenzie, Purevdorj Tuvaandorj

Comments 47 pages, 8 figures, 15 tables. Substantially revised analysis based on the PUMF of the Statistics Canada 2020 Canadian Internet Use Survey

详情
英文摘要

This paper studies inequality in digital participation across socioeconomic and demographic groups using the 2020 Canadian Internet Use Survey (CIUS). We combine survey-weighted logistic Lasso, an exact Shapley decomposition of age--education gaps, a sequential logit, and a bifactor item response theory (IRT) measure of digital literacy to identify who is excluded, why gaps persist, and where along the adoption path they arise. Education is the only determinant that remains significant at every rung of the digital ladder. Income inequality is most pronounced for virtual-wallet adoption; for online banking, employment and education together account for nearly half of the pro-rich concentration, indicating a broad socioeconomic gradient rather than a purely income-based divide. Persons with disabilities face the largest penalty at the digital-payments stage rather than at online banking, pointing to accessibility gaps in retail payment interfaces. Conditioning on digital literacy eliminates the education gradient at internet entry and reduces it by 61\% at the online banking rung, but a substantial residual persists, pointing to behavioral and institutional frictions beyond measurable competence. The youngest cohort records the lowest information-seeking score despite high digital engagement, and security deficits are concentrated among landed immigrants and visible minorities.

1912.13213 2026-04-28 cs.LG math.OC stat.ML

A Modern Introduction to Online Learning

Francesco Orabona

Comments Major update: One new chapter (Online Learning to X); massive tightening of all the math; simplification of the betting algorithm that loses a constant fraction of money; exp-concave functions are now for extended-real-valued function; new layout for publication; added index

详情
英文摘要

In this book, I introduce the basic concepts of Online Learning through the modern view of Online Convex Optimization. Here, online learning refers to the framework of regret minimization under worst-case assumptions. I present first-order and second-order algorithms for online learning with convex losses, in Euclidean and non-Euclidean settings. All the algorithms are clearly presented as instantiation of Online Mirror Descent or Follow-The-Regularized-Leader and their variants. Particular attention is given to the issue of tuning the parameters of the algorithms and learning in unbounded domains, through adaptive and parameter-free online learning algorithms. Non-convex losses are addressed through convex surrogate losses and randomization. The bandit setting is also briefly discussed, touching on the problem of adversarial and stochastic multi-armed bandits. Finally, I also cover advanced topics, including black-box reductions, saddle-point optimization, sequential investment, and non-stationary forms of regret analysis. The book concludes with a selection of applications of online learning to domains far from it, such as generalization theory and concentration inequalities. I tried to maintain an informal, but mathematically serious, tone throughout the book. No prior knowledge of convex analysis is required. Moreover, all the included proofs have been carefully chosen to be as simple and as short as possible. This also means that sometimes I have added one or two additional assumptions, just to simplify the proofs.

2604.24172 2026-04-28 stat.ML cs.LG stat.ME

A Divergence-Based Method for Weighting and Averaging Model Predictions

Olav Benjamin Vassend

Comments Accepted at AISTATS 2026

详情
英文摘要

This paper uses a minimum divergence framework to introduce a new way of calculating model weights that can be used to average probabilistic predictions from statistical and machine learning models. The method is general and can be applied regardless of whether the models under consideration are fit to data using frequentist, Bayesian, or some other fitting method. The proposed method is motivated in two different ways and is shown empirically to perform better than or on a par with standard model averaging methods, including model stacking and model averaging that relies on Akaike-style negative exponentiated model weighting, especially when the sample size is small. Our theoretical analysis explains why the method has a small-sample advantage.

2604.24161 2026-04-28 quant-ph cs.IT math.IT stat.CO

Quantum Prediction of Transport Dynamics in Discretized State Spaces

Felix Govaers

Comments Submitted to IEEE Transaction on Quantum Engineering on April 9, 2026

详情
英文摘要

We propose a gate-based quantum algorithm for the prediction step of Bayesian state estimation based on the Fokker-Planck equation on a discretized position-velocity state space. The probability density is encoded in the amplitudes of a quantum state, enabling a compact representation of high-dimensional distributions. Exploiting the circulant structure of finite-difference operators, the evolution is realized in the spectral domain using quantum Fourier transforms and phase rotations. A key result is that the drift component can be implemented exactly in amplitude space, leading to an accurate reproduction of the classical transport dynamics. In contrast, the diffusion term does not admit a linear representation in amplitude space due to the nonlinear relation between probability density and wave function. To enable a quantum implementation, we introduce a unitary surrogate based on a Wick rotation, transforming diffusion into a dispersive phase evolution. This yields a fully unitary propagation that can be implemented efficiently on a gate-based quantum computer. The proposed method is evaluated numerically for different scenarios and shows strong agreement with the exact solution of the Fokker-Planck equation. The approach demonstrates the potential of quantum computing for Bayesian state estimation, as the representable state space grows exponentially with the number of qubits. This allows the efficient representation and propagation of probability densities that would otherwise require complex tensor decompositions on classical hardware, making the method a promising candidate for high-dimensional filtering problems.

2604.24056 2026-04-28 stat.ME

Bi-Gaussian Mirrors for False Discovery Rate Control

Yujia Wu, Panxu Yuan, Binyan Jiang

详情
英文摘要

Effectively controlling the false discovery rate (FDR) in high-dimensional variable selection is a fundamental statistical problem that has garnered significant research interest. In this paper, we propose a novel, user-friendly, and computationally efficient method called Bi-Gaussian Mirrors (BGM), which offers a conceptually simple yet powerful approach for FDR control. Our method makes the first attempt to achieve FDR control in high-dimensional data with complex dependencies, while overcoming key limitations of existing approaches, such as prior knowledge of the joint distribution of data, significant power loss, the need for full symmetry in test statistics, and the theoretical restriction to linear regression models. Additionally, we present a self-guiding procedure designed to enhance the practicality and applicability of the BGM method. Theoretical guarantees for FDR control and asymptotic power are rigorously established under regularity conditions. Moreover, extensive numerical simulations and two real-data examples demonstrate that the BGM method outperforms existing approaches in terms of finite-sample performance, achieving a superior balance between FDR control and testing power.

2604.24017 2026-04-28 stat.ME math.ST stat.TH

Neyman Jackknife: Design-Based Variance Estimation for Causal Inference under Interference

Bryan Park, Stefan Wager

详情
英文摘要

We propose a framework, the Neyman Jackknife, for conservative variance estimation in finite-population causal inference under interference. Our approach provides a general, flexible blueprint that enables conservative variance estimation whenever we are able to recompute our target estimator with some treatment assignments omitted. In classical settings, our approach recovers estimators closely related to the Neyman estimator under SUTVA and the Newey-West HAC variance estimator for time series. Numerical experiments suggest that our general-purpose framework yields variance estimators that can match or even surpass the performance of baselines that were purpose-built for specific applications.

2604.24010 2026-04-28 stat.ME eess.SP

Efficient Implementations of Extended Object PMBM Filters with Blocked Gibbs Sampling

Yuxuan Xia, Ángel F. García-Fernández, Lennart Svensson

Comments Submitted to IEEE T-AES

详情
英文摘要

This paper considers multiple extended object tracking based on Poisson multi-Bernoulli mixture (PMBM) filtering, which gives the closed-form Bayesian solution for standard multiple extended object models with Poisson birth. To efficiently address the challenging extended object data association problem in PMBM filtering, we develop implementations of the extended object PMBM filter using blocked Gibbs sampling. By formulating the PMBM density on an augmented state space with auxiliary variables and leveraging the Poisson object measurement model, we first derive a joint posterior over potential objects, previous global hypotheses, and current measurement association variables, together with its corresponding factorization. This factorized representation leads to blocked Gibbs samplers that efficiently generate high-weight global hypotheses and thereby provide an efficient implementation of the PMBM update step. We further introduce a collapsed Gibbs sampling variant, in which the Bernoulli object existence variables are marginalized out, yielding higher sampling efficiency, especially for the initiation of newly detected objects. The proposed methods, implemented under the gamma Gaussian inverse-Wishart model, are compared with an extended object Poisson multi-Bernoulli filter based on particle belief propagation. Simulation results demonstrate that the proposed approaches achieve comparable tracking performance while requiring substantially less runtime.

2604.24000 2026-04-28 eess.IV cs.CV cs.MM stat.AP

Shared-kernel Wavelet Neural Networks for Poisson Image Reconstruction

Yuanhao Gong, Tan Tang, Qianyan Liu

详情
英文摘要

The Laplacian operator transforms the image into its Laplacian field, which usually is sparse and satisfies a stable distribution. On the other hand, an image can be uniquely reconstructed from its Laplacian field via solving a Poisson equation with a proper boundary condition. Such uniqueness is mathematically guaranteed. Thanks to these properties, we propose to use the sparse Laplacian field to present the image. We first show that the Laplacian field is sparse and satisfies a stable distribution on hundreds images. Then, we show that the image can be accurately reconstruct from its Laplacian field. For the reconstruction task, we propose a shared-kernel wavelet neural network, which solves the Poisson equation and has three advantages. First, it has less than {\bf 0.0002M} parameters, which is compact enough for most of devices. Second, it has linear computation complexity, leading to a real-time reconstruction. Third, it achieves higher accuracy than previous methods. Several numerical experiments are conducted to show the effectiveness and efficiency of the sparse Laplacian field and the proposed Poisson solver. The proposed method can be applied in a large range of applications such as image compression, low light enhancement, object tracking, etc.

2604.23983 2026-04-28 math.ST math.PR q-fin.RM stat.ME stat.TH

A Geometric Witness Framework for Signed Multivariate Tail-Dependence Compatibility: Asymptotic Structure and Finite-Threshold Synthesis

Janusz Milek

Comments 47 pages, 4 figures, 3 tables; includes a Python implementation appendix

详情
英文摘要

We study multivariate tail-dependence compatibility for complete and partial signed tail families, treating lower-tail, upper-tail, and mixed configurations in one geometric witness representation indexed by active coordinate sets and sign patterns. For a complete signed tail family, witness generator weights w = (w_{I,sigma}) give a linear incidence parametrization and are recovered by explicit triangular inversion. Excluding the geometric scale p0, the complete case uses 3^d - 1 generator weights, matching the number of complete signed tail coefficients; for partial specifications, only selected target coefficients need be prescribed. At a fixed threshold p0 in (0, 1/2), the inversion identifies the normalized noncentral ternary cell masses of any realizing copula. Hence finite-threshold compatibility is characterized by nonnegative recovered generator weights, singleton normalization, and the residual central-mass constraint. This yields a complete Moebius-type synthesis within the witness framework. If the recovered increments are nonnegative and singleton normalization holds, then S(w) = sum(w) determines the admissible finite-scale range, and every admissible p0 gives an exact witness realization. In the canonical ray geometry, such a realization preserves the same complete signed tail family throughout 0 < p <= p0. Thus the primary object is the complete signed tail family lambda: it is realized at every admissible finite scale and can be carried along families of witness copulas with p0 decreasing to 0. Partial, noisy, or inconsistent specifications are treated through linear-feasibility and weighted-l1 recovery problems in the same parametrization. The representation separates the p0-free incidence/Moebius layer from finite-threshold realization and provides tools for realization, simulation, calibration, completion, repair, and scenario design.

2604.23968 2026-04-28 cs.LG cs.AI stat.ML

DecompKAN: Decomposed Patch-KAN for Long-Term Time Series Forecasting

Naveen Mysore

Comments 15 pages, 6 figures, 8 tables. Preprint; under review

详情
英文摘要

Accurate time series forecasting in scientific domains such as climate modeling, physiological monitoring, and energy systems benefits from both competitive predictions and model transparency. This work proposes DecompKAN, a lightweight attention-free architecture that combines trend-residual decomposition, channel-wise patching, learned instance normalization, and B-spline Kolmogorov-Arnold Network (KAN) edge functions. Each KAN edge learns an explicit, inspectable 1D scalar function over learned patch-embedding coordinates that can be directly visualized. On standard benchmarks, DecompKAN achieves best or tied-best MSE on 15 of 32 dataset-horizon combinations among selected published baselines, and achieves best or tied-best MSE on 20 of 36 comparisons under a controlled same-recipe evaluation across 9 datasets including the physiological PPG-DaLiA benchmark. The architecture shows particular strength on datasets with smooth temporal dynamics (Solar -17%, ECL -10% vs. iTransformer, Weather) and physiological time series. Visualization of learned edge functions reveals qualitatively different latent nonlinearities across domains. Ablation analysis shows that the architectural pipeline (decomposition, patching, normalization) drives performance more than the choice of nonlinear layer, while the KAN formulation enables inspection of learned latent transformations.

2604.23961 2026-04-28 stat.AP q-fin.MF q-fin.TR

Extended State-dependent Hawkes Process for Limit Order Books: Mathematical Foundation and the Reproduction of Volatility Signature Plots

Akitoshi Kimura

Comments 20 pages, 8 figures. This work was supported by JSPS KAKENHI Grant Number JP20K14366 and CREST, JST

详情
英文摘要

This paper proposes an Extended State-Dependent Hawkes Process (ExsdHawkes) to model the intricate dynamics of Limit Order Books (LOBs). Our theoretical contribution lies in relaxing traditional constraints by allowing for state disappearances -- a phenomenon frequently observed in high-frequency trading. We mathematically prove, using Karush--Kuhn--Tucker (KKT) conditions, that the maximum likelihood estimation remains separable, justifying an efficient two-step procedure. In the empirical section, we apply our model to three months of high-frequency tick data of Mitsubishi UFJ Financial Group (8306). We demonstrate that ExsdHawkes uniquely reproduces the volatility signature plot's characteristic upward slope by capturing the "local super-criticality" triggered during disequilibrium states. Crucially, we identify Marketable Limit Orders (MLO) as the primary catalyst that forces the LOB into these unstable states. Comparative analysis reveals that models lacking physical constraints (e.g., standard SD-Hawkes) suffer from explosive branching ratios and fail to maintain simulation stability. Our findings suggest that physical consistency is not merely a mathematical nicety, but a prerequisite for accurately modeling macro-level volatility. By enforcing the physical geometry to `pause' the residual accumulation during inadmissible periods, ExsdHawkes uniquely maintains statistical integrity where unconstrained models succumb to structural bias.

2604.23930 2026-04-28 stat.ME cs.LG stat.ML

Nearly Optimal Subdata Selection

Min Yang, Wei Zheng, John Stufken, Ming-Chung Chang, Ting Tian, Xueqin Wang

详情
英文摘要

When, in terms of the number of data points, the size of a dataset exceeds available computing resources, or when labeling is expensive, an attractive solution consists of selecting only some of the data points (subdata) for further consideration. A central question for selecting subdata of size $n$ from $N$ available data points is which $n$ points to select. While an answer to this question depends on the objective, one approach for a parametric model and a focus on parameter estimation is to select subdata that retains maximal information. Identifying such subdata is a classical NP-hard problem due to its inherent discreteness. Based on optimal approximate design theory, we develop a new methodology for information-based subdata selection, resulting in subdata that approaches the optimal solution. To achieve this, we develop a novel algorithm that applies to a general model, accommodates arbitrary choices of $N$ and $n$, and supports multiple optimality criteria, and we prove its convergence. Moreover, the new methodology facilitates an assessment of the efficiency of subdata selected by any method by obtaining tight lower and upper bounds for the efficiency. We show that the subdata obtained through the new methodology is highly efficient and outperforms all existing methods.

2604.23928 2026-04-28 math.ST math.PR stat.TH

Wasserstein convergence rates for empirical measures of point processes

Dongzhou Huang, Tianyi Jiang, Haonan Wang

详情
英文摘要

In this paper, we establish sharp upper and lower bounds on the convergence rate of the empirical measures of point processes under the Wasserstein distance. To this end, we first introduce a new metric on the space of counting measures and, based on this metric, define a Wasserstein distance between point processes. We then employ it to study the convergence rate of the empirical measures of point processes, which serves as a natural tool for identifying the distribution of the underlying point process. Furthermore, we derive concentration results. These theoretical results provide constructive tools for hypothesis testing and statistical inference for point processes. The applicability of our results is demonstrated through several practical examples.

2604.23917 2026-04-28 stat.ME

MR-CCC: Bayesian Mendelian Randomization for Causal Cell--Cell Communication

Bitan Sarkar, Yang Ni

详情
英文摘要

Cell--cell communication (CCC) is commonly inferred from ligand--receptor co-expression, an associational paradigm that cannot distinguish causal signaling from shared regulation or confounding. We propose MR-CCC, a Bayesian Mendelian randomization framework that uses cis-eQTLs as instruments for ligand and receptor expression and explicitly models receptor-modulated ligand effects through an interaction term, so the causal effect of a ligand can vary with receptor abundance. A spike--and--slab prior yields posterior inclusion probabilities quantifying evidence for causal signaling, and an efficient Gibbs sampler provides scalable inference. Benchmarked against naive regression, MVMR, and MR-BMA, MR-CCC controls false discoveries under confounding while retaining high power, and uniquely estimates both the ligand main and receptor-modulated interaction effects. Applied to the OneK1K NK cells $\to$ monocytes axis, MR-CCC identifies eight discoveries across GABA, interferon, interleukin, and prostaglandin signaling, including a stoichiometry-dependent dissociation of the two IL-18 receptor chains and co-discovery of both obligate IFN-$γ$ receptor subunits.

2604.23912 2026-04-28 cs.LG stat.ML

Gromov-Wasserstein Methods for Multi-View Relational Embedding and Clustering

Rafael Pereira Eufrazio, Eduardo Fernandes Montesuma, Charles Casimiro Cavalcante

Comments This manuscript is currently under review at the XLIV Simposio Brasileiro de Telecomunicacoes e Processamento de Sinais - SBrT (Brazilian Symposium on Telecommunications and Signal Processing ) 2026

详情
英文摘要

Learning low-dimensional representations from multi-view relational data is challenging when underlying geometries differ across views. We propose Bary-GWMDS, a Gromov-Wasserstein-based method that operates directly on distance matrices to learn a consensus embedding preserving shared relational structure. By leveraging intrinsic distances, the approach naturally handles nonlinear distortions across views. We also introduce Mean-GWMDS-C, a clustering-oriented formulation that averages distance matrices and learns reduced-support representations via a consensus Gromov-Wasserstein transport. Experiments on synthetic and real-world datasets show that the proposed framework yields stable and geometrically meaningful embeddings.

2604.23866 2026-04-28 stat.ME

A Review of Methods and Practices for Missing Data in Sequential Multiple Assignment Randomized Trials (SMARTs): An Ancillary Study of a Scoping Review

Nikki L. B. Freeman, Chenyao Yu, Margaret Hoch, Sydney Browder, Bradley G. Hammill, Avi Kenny, Kevin J. Anstrom, Michael R. Kosorok

Comments 19 pages, 4 tables, 1 figure

详情
英文摘要

Background: Missing data poses an acute threat to sequential multiple assignment randomized trial (SMART) analyses because of the sequential treatment structure and response-dependent re-randomization. Objectives: This study aimed to (1) review the current statistical methods for handling missing data in SMARTs, and (2) characterize how missing data is reported and handled in published SMARTs. Methods: We conducted a narrative review of statistical methods developed for missing data in SMARTs. Additionally, we conducted a pre-specified secondary extraction of a previously published scoping review of SMARTs focused on missing data. Extraction captured attrition rates, methods for handling missingness, and planned versus performed missing data analyses. Results: Seven methodological papers were identified; nearly all assume missing at random (MAR), and only one addresses the full set of SMART-specific missingness types. Across 30 published SMARTs, median overall attrition was 18.1% (range 0.6%-56.5%). Methods used to address missing data were described in 80% of the manuscripts; mixed-model methods were most common (30%). Among 14 studies with paired protocols, sensitivity analyses were pre-specified in 2 (14%). Conclusions: SMART-specific methodology for missing data is limited, and a substantial gap exists between available methodology and current SMART practice.

2604.23851 2026-04-28 stat.ME math.ST stat.TH

Bayesian change-plane regression

Yuki Ohnishi, Fan Li

详情
英文摘要

Change-plane regression identifies subpopulations through an interpretable linear threshold rule, but likelihood-based inference for the hard-threshold boundary is nonregular: objectives are non-smooth, the boundary is weakly identified under no heterogeneity, and standard large-sample approximations are fragile. We develop a new Bayesian inferential framework based on a probit-gated working likelihood -- a computationally regular surrogate that is deliberately misspecified for any fixed smoothing scale. For fixed smoothing, posterior summaries are therefore interpreted for a well-defined smoothed pseudo-true target; inference for the hard-threshold target is recovered only in a vanishing-smoothing regime, where approximation bias is governed by a boundary-margin condition on the covariate distribution. The resulting theory adapts misspecified Bernstein--von Mises arguments to Bayesian change-plane regression and makes explicit the triangular-array trade-off created by sending the smoothing scale to zero: sharper gates worsen the derivative bounds needed for Gaussian approximation, while approximation bias decreases according to the local amount of covariate mass near the boundary. Building on the resulting joint posterior, we further propose a decision-theoretic reporting protocol that separates evidence for clinically meaningful heterogeneity from the reporting of a subgroup boundary, with boundary uncertainty propagated to the covariate level through posterior membership probabilities. Simulations show favorable accuracy and uncertainty quantification of our new methods relative to the frequentist counterpart, and an application to a randomized lifestyle-intervention trial further demonstrates the utility of Bayesian change-plane regression in understanding treatment effect heterogeneity.

2604.23847 2026-04-28 stat.ME

Privacy-preserving Meta-analysis through Low-Rank Basis Hunting

Wenqi Shi, Kosuke Imai, Yi Zhang

详情
英文摘要

A central challenge of meta-analysis is that the populations underlying existing studies often differ from the target population in unknown ways. We study the problem of predicting function-valued quantities, such as regression and conditional average treatment effect functions, for a new target population using only study-level covariates and estimates. We propose MetaHunt, a new meta-analysis methodology based on a shared low-rank structure, in which the true function from each study lies within the convex hull of a small set of latent basis functions. To recover these basis functions, we extend the Successive Projection Algorithm to the functional setting, incorporating a denoised basis-hunting step. We establish consistency of the recovered basis functions under mild regularity conditions. We then model the relationship between study-level covariates and the corresponding mixing weights using flexible semi-parametric or non-parametric methods. MetaHunt is privacy-preserving and enables meta-analytic prediction based on study-level information alone, even when individual-level data are unavailable to analysts. In addition, for each study, functions of interest can be estimated using possibly different machine learning algorithms. For uncertainty quantification, we construct prediction intervals via conformal prediction. We show that, under exchangeability and mild estimation-error conditions, these intervals achieve asymptotically valid marginal coverage. We demonstrate the effectiveness of MetaHunt through both simulation studies and empirical applications.

2604.23834 2026-04-28 stat.ME stat.AP

Beyond the mean: Sequence analysis methods for clustering ordinal EMA data

Tianyi Wang, Anna L. Smith, Jillian R. Silva-Jones, Wendy Berry Mendes, Lauren N. Whitehurst

Comments 22 pages, 11 figures, 7 tables

详情
英文摘要

Ecological momentary assessment (EMA) ratings are widely used in studies of behavioral and psychological phenomena to capture real-time data in subjects' real-world environments. Because the data are collected repeatedly over the study period, they provide rich longitudinal rating profiles for each individual. However, the number of observations per subject is often large, while both sample size and sampling intensity can vary substantially across individuals, which complicates the analysis. In some settings, simplified summaries of individual profiles, such as averages computed across the study period, are used for downstream analyses, including regression-style modeling. Although such summaries can be convenient, they may fail to fully capture dynamic temporal patterns present in the complete longitudinal profiles. To address this, we borrow measures from sequence analysis that capture individual-level patterns over time and then applied principal component analysis (PCA) followed by $K$-means clustering to identify unobserved latent groups of individuals with similar profiles. We test our approach using simulated data from a categorical functional regression model and compare its performance with two commonly used methods for detecting unobserved group structures: latent class analysis (LCA), and latent transition analysis (LTA). Using EMA stress observations from a large sample of U.S. adults (Newman et al., 2024, 2025), we identify distinct latent stress profile groups and show that they improve characterization of the impact on cognitive performance.

2604.23828 2026-04-28 math.PR math.ST stat.CO stat.TH

Kac's walk on rotation matrices mixes in $n^2 \log n$ steps

Natesh S. Pillai, Aaron Smith

详情
英文摘要

Kac's walk on the rotation group, introduced by Hastings in 1970, is an important high-dimensional Markov chain with applications in statistical physics, statistics, cryptography, and computational science. Despite its simple transition rules, determining its total-variation mixing time has remained a challenging problem for decades. A key obstacle is that the walk is not conjugation-invariant, placing it beyond the reach of classical Fourier-analytic techniques that apply to many related random walks on compact groups. We prove that Kac's walk mixes in total variation in \(O(n^2 \log n)\) steps, matching the conjectured mixing time up to constants. The proof is based on a refined two-stage coupling. Building on earlier work, the first stage contracts two copies of the chain to a small neighborhood via a Wasserstein coupling. Our main contribution is a new framework for analyzing the second-stage coupling. It can be viewed as a discrete analogue of Malliavin calculus for Markov chains. We represent the law of the chain as the pushforward of high-dimensional noise and prove quantitative non-degeneracy of the associated linearization using matrix martingale methods. This yields an approximately Gaussian distribution in the Lie algebra with well-conditioned covariance, allowing small group translations to be absorbed at negligible cost in total variation. Our approach provides a general framework for studying mixing in high-dimensional Markov chains in continuous state spaces with singular transition kernels.

2604.23800 2026-04-28 cs.LG stat.ML

Causal Representation Learning from General Environments under Nonparametric Mixing

Ignavier Ng, Shaoan Xie, Xinshuai Dong, Peter Spirtes, Kun Zhang

Comments Accepted to AISTATS 2025. This is a slightly revised version of the published paper

详情
英文摘要

Causal representation learning aims to recover the latent causal variables and their causal relations, typically represented by directed acyclic graphs (DAGs), from low-level observations such as image pixels. A prevailing line of research exploits multiple environments, which assume how data distributions change, including single-node interventions, coupled interventions, or hard interventions, or parametric constraints on the mixing function or the latent causal model, such as linearity. Despite the novelty and elegance of the results, they are often violated in real problems. Accordingly, we formalize a set of desiderata for causal representation learning that applies to a broader class of environments, referred to as general environments. Interestingly, we show that one can fully recover the latent DAG and identify the latent variables up to minor indeterminacies under a nonparametric mixing function and nonlinear latent causal models, such as additive (Gaussian) noise models or heteroscedastic noise models, by properly leveraging sufficient change conditions on the causal mechanisms up to third-order derivatives. These represent, to our knowledge, the first results to fully recover the latent DAG from general environments under nonparametric mixing. Notably, our results match or improve upon many existing works, but require less restrictive assumptions about changing environments.

2604.23797 2026-04-28 physics.optics physics.app-ph quant-ph stat.OT

From Random Fringes to Deterministic Response: Statistical Foundations of Time-Reversed Young Interferometry

Jianming Wen

详情
英文摘要

Young interference is usually read as the gradual statistical accumulation of random detection events. Here we show that a time-reversed Young (TRY) geometry has a different statistical character: the fringe is not a marginal distribution of detector positions, but a conditional response indexed by a programmed source coordinate. With a fixed detector and a scanned source basis, the observable is an operational hybrid correlator between detector signal and source label. The resulting interference is deterministic at the response-function level, while noise enters only through estimation precision. We formulate this distinction using Fisher information, estimator variance, and noise scaling, clarifying why TRY naturally supports calibration, lock-in readout, null-fringe sensing, and source-plane superresolution.

2604.23790 2026-04-28 cs.LG stat.ML

A General Representation-Based Approach to Multi-Source Domain Adaptation

Ignavier Ng, Yan Li, Zijian Li, Yujia Zheng, Guangyi Chen, Kun Zhang

Comments ICML 2025

详情
英文摘要

A central problem in unsupervised domain adaptation is determining what to transfer from labeled source domains to an unlabeled target domain. To handle high-dimensional observations (e.g., images), a line of approaches use deep learning to learn latent representations of the observations, which facilitate knowledge transfer in the latent space. However, existing approaches often rely on restrictive assumptions to establish identifiability of the joint distribution in the target domain, such as independent latent variables or invariant label distributions, limiting their real-world applicability. In this work, we propose a general domain adaptation framework that learns compact latent representations to capture distribution shifts relative to the prediction task and address the fundamental question of what representations should be learned and transferred. Notably, we first demonstrate that learning representations based on all the predictive information, i.e., the label's Markov blanket in terms of the learned representations, is often underspecified in general settings. Instead, we show that, interestingly, general domain adaptation can be achieved by partitioning the representations of Markov blanket into those of the label's parents, children, and spouses. Moreover, its identifiability guarantee can be established. Building on these theoretical insights, we develop a practical, nonparametric approach for domain adaptation in a general setting, which can handle different types of distribution shifts.

2604.23770 2026-04-28 econ.EM stat.ML

Bootstrapping with AI/ML-generated labels

Timothy Christensen, Silvia Goncalves, Benoit Perron

详情
英文摘要

AI/ML methods are increasingly used in economics to generate binary variables (or labels) via classification algorithms. When these generated variables are included as covariates in regressions, even small misclassification errors can induce large biases in OLS estimators and invalidate standard inference. We study whether the bootstrap can correct this bias and deliver valid inference. We first show that a seemingly natural fixed-label bootstrap, which generates data using estimated labels but relies on a corrupted version in estimation, is generally invalid unless a strong independence condition between the latent true labels and other covariates holds. We then propose a coupled-label bootstrap that jointly resamples the true and imputed labels, and show it is valid without this condition. Two finite-sample adjustments further improve coverage: a variance correction for uncertainty in estimated misclassification rates and a Hessian rotation for near-singular designs. We illustrate the methods in simulations and apply them to investigate the relationship between wages and remote work status.

2604.23755 2026-04-28 stat.AP

Sparse Reduced-rank Regression Methods for Spatially Misaligned Data with Application to Spatial Transcriptomics

Zitian Wu, Susmita Datta, Arkaprava Roy

Comments 35 pages, 4 figures, 2 tables

详情
英文摘要

Understanding the spatiotemporal dynamics of disease progression in relation to transcriptomic profiles provides key insights into complex conditions such as Alzheimer disease. To enable such investigations, STARmap PLUS technology offers joint profiling of high-resolution spatial transcriptomics and protein detection within the same tissue section. Motivated by data from Zeng et al. (2023), we develop a novel kernel-weighted regression framework that models plaque size as a collective effect of the spatial transcriptomics of neighboring cells, automatically integrating across cell types and tissue samples from different disease states. To further strengthen interpretability and efficiency, we incorporate a sparse low-rank factorization that enables gene selection while borrowing strength across genes, cell types, and time points. The proposed approach is implemented in a fully automated manner with data-driven specification of key model components. Through simulation studies, we demonstrate the robustness of the proposed method and its superiority across a range of specification scenarios. Applied to Alzheimer disease data, the proposed framework uncovers biologically meaningful associations, highlighting its potential for advancing the understanding of disease mechanisms.

2604.23744 2026-04-28 stat.AP stat.OT

How temperature regimes near the equinox synchronize spring biological events

Jonathan Auerbach, Andrew Gelman, E. M. Wolkovich

详情
英文摘要

Many biological processes, including plant leafout and flowering, occur once cumulative temperatures reach a threshold (the thermal-sum model). In this way, temperatures are thought to coordinate the timing of biological events. But growing evidence suggests that as climates warm, both the advancement of spring has slowed (declining sensitivity) and the variance in the timing of spring events has increased (declining synchrony), raising questions about the resilience of temperature-based coordination to anthropogenic climate change. To answer these questions, researchers have complicated the thermal-sum model, introducing additional factors and mechanisms. We consider whether such complexity is necessary. Using results from the theory of stopped random walks, we show that sensitivity and synchrony are exactly as predicted by the basic thermal-sum model. The theory suggests a nonlinear relationship between temperatures and both the timing and synchrony of biological events. In particular, it predicts that as temperatures increase and springtime events shift from the equinox toward the solstice, the events themselves become less coordinated and more variable. We verify these predictions using experimental and real-world data, including 10,000 observations of common lilacs (United States, 1956-2025). We conclude that the theory provides a powerful tool for understanding the thermal-sum model, particularly when considering additional complexity.

2604.23681 2026-04-28 cs.LG cs.CL stat.ML

Rank, Head-Channel Non-Identifiability, and Symmetry Breaking: A Precise Analysis of Representational Collapse in Transformers

Giansalvo Cirrincione

Comments 36 pages, 8 figures, 1 table. Submitted to Artificial Intelligence (Elsevier)

详情
英文摘要

A widely cited result by Dong et al. (2021) showed that Transformers built from self-attention alone, without skip connections or feed-forward layers, suffer from rapid rank collapse: all token representations converge to a single direction. The proposed remedy was the MLP. We show that this picture, while correct in the regime studied by Dong, is incomplete in ways that matter for architectural understanding. Three results are established. First, layer normalisation is precisely affine-rank-neutral: it preserves the affine rank of the token representation set exactly. The widespread claim that LN "plays no role" is imprecise; the correct statement is sharper. Second, residual connections generically obstruct rank collapse in real Transformers such as BERT-base, in a measure-theoretic sense, without contribution from the MLP. The MLP's irreplaceable function is different: generating feature directions outside the linear span of the original token embeddings, which no stack of attention layers can produce. Third, a phenomenon distinct from rank collapse is identified: head-channel non-identifiability. After multi-head attention sums per-head outputs through the output projection, individual contributions cannot be canonically attributed to a specific head; n(H-1)d_k degrees of freedom per layer remain ambiguous when recovering a single head from the mixed signal. The MLP cannot remedy this because it acts on the post-summation signal. A constructive partial remedy is proposed: a position-gated output projection (PG-OP) at parameter overhead below 1.6% of the standard output projection. The four collapse phenomena identified in the literature -- rank collapse in depth, in width, head-channel non-identifiability, and entropy collapse -- are unified under a symmetry-breaking framework, each corresponding to a distinct symmetry of the Transformer's forward pass.

2604.23619 2026-04-28 stat.ME math.ST stat.TH

Weak Moment Methods for Statistical Inference: with an Application to Robust Estimation

R. Labouriau

Comments 29 pages, one figure and four tables

详情
英文摘要

A companion paper develops a framework in which probability measures are represented by distribution-kernel pairs (T,phi) with T a tempered distribution and phi a Schwartz kernel, so that weak moments of all orders exist unconditionally. The present paper turns this into a methodology for statistical inference: estimation via weak moment matching, weak characteristic functions, weak cumulants, and regularised density reconstruction via Tikhonov inversion. A key feature is that parametric inference proceeds directly from weak expectations without reconstructing the underlying density; reconstruction is an additional route, useful when density-level inference is the goal. The central result is that weak moment estimators are automatically locally robust in the sense of Hampel: their score is bounded and redescending, their influence function has a closed form, and their gross error sensitivity is finite in every identifiable parametric model -- all inherited from the kernel's decay, with no ad hoc truncation. The kernel plays the role of Huber's tuning constant, but as a structural component of the model rather than a post-hoc modification. The framework is worked out for the Cauchy location model (where no classical moment estimator exists), a Student t_3 location-scale model, a bivariate Cauchy location model, and a bivariate t_3 location-scale model. Monte Carlo comparisons show that weak moment estimators match or outperform classical robust benchmarks under contamination; in the bivariate t_3 case the MLE scale estimate breaks down while the weak moment estimator converges at the parametric rate. Although the paper focuses on parametric models, the reconstruction route is inherently non-parametric and opens a path to weak density estimation without parametric assumptions.

2604.23573 2026-04-28 stat.ML cs.LG

High-dimensional Semi-supervised Classification via the Fermat Distance

Ruoxu Tan, Yiming Zang

详情
英文摘要

Semi-supervised classification, where unlabeled data are massive but labeled data are limited, often arises in machine learning applications. We address this challenge under high-dimensional data by leveraging the manifold and cluster assumptions. Based on the Fermat distance, a density-sensitive metric that naturally encodes the cluster assumption, we propose the weighted $k$-nearest neighbors (NN) classifier and multidimensional scaling (MDS)-induced classifiers. The use of MDS with a large target dimension allows the effective application of linear classifiers to complex manifold data. Theoretically, we derive a sharp lower bound for the expected excess risk within clusters and prove that the weighted $k$-NN classifier utilizing the true Fermat distance is minimax optimal. Furthermore, we explicitly quantify the utility of unlabeled data by showing that the error arising from estimating the Fermat distance decays exponentially with the pooled sample size. Such a rate is much faster than the related rates in the literature. Extensive experiments on synthetic and real datasets demonstrate competitive or superior performance of our approaches compared to state-of-the-art graph-based semi-supervised classifiers.

2604.23552 2026-04-28 cs.LG cs.AI stat.ML

On the Memorization of Consistency Distillation for Diffusion Models

Bingqing Jiang, Difan Zou

Comments 34 pages

详情
英文摘要

Diffusion models are central to modern generative modeling, and understanding how they balance memorization and generalization is critical for reliable deployment. Recent work has shown that memorization in diffusion models is shaped by training dynamics, with generalization and memorization emerging at different stages of training. However, deployed diffusion models are often further distilled, introducing an additional training phase whose impact on memorization is not well understood. In this work, we analyze how distillation reshapes memorization behavior in diffusion models, taking consistency distillation as a representative framework. Empirically, we show that when applied to a teacher model that has memorized data, consistency distillation significantly reduces transferred memorization in the student while preserving, and sometimes improving, sample quality. To explain this behavior, we provide a theoretical analysis using a random feature neural network model [Bonnaire et al., 2025], showing that consistency distillation suppresses unstable feature directions associated with memorization while preserving stable, generalizable modes. Our findings suggest that distillation can serve not only as an acceleration tool, but also as a mechanism for improving the memorization-generalization trade-off.

2604.20266 2026-04-28 stat.ME

Bayesian Modeling of the Stochastic Block Model for Weighted Network Data with Zero-Inflated Negative Binomial Distribution

Fumiya Iwashige

Comments 19 pages, 1 figure

详情
英文摘要

Weighted networks encode not only the presence of interactions but also their strength. Existing methods for weighted network community detection often rely on Poisson models, which can be restrictive for overdispersed data and make efficient posterior computation difficult when covariates are incorporated. We propose Bayesian stochastic block models based on the zero-inflated negative binomial distribution: ZINB-SBM without covariates and CZINB-SBM with pairwise covariates. The proposed models accommodate overdispersion, naturally account for missing interactions through zero inflation, and admit efficient Gibbs sampling. In CZINB-SBM, Pólya-Gamma data augmentation enables posterior inference for regression coefficients with uncertainty quantification. We further employ a dynamic mixture of finite mixtures, which allows the number of communities to be inferred from the data and can lead to more accurate clustering. Simulation studies show that ZINB-SBM is more robust than a zero-inflated Poisson SBM for highly overdispersed networks. Real data analysis demonstrates interpretable block specific covariate effects and substantially improved missing link prediction compared with a Poisson regression-based Bayesian SBM.

2603.03004 2026-04-28 stat.ME stat.AP stat.CO

eTFCE: Exact Threshold-Free Cluster Enhancement via Fast Cluster Retrieval

Xu Chen, Wouter D. Weeda, Thomas E. Nichols, Jelle J. Goeman

Comments Revised manuscript with updated analyses and clarifications

详情
英文摘要

Threshold-free cluster enhancement (TFCE) is widely used for cluster-based inference in neuroimaging, but existing implementations typically rely on discretized approximations that may introduce numerical variability. We present eTFCE, an efficient framework that provides a numerically exact evaluation of the TFCE integral using an optimized cluster retrieval algorithm. Across multiple datasets, eTFCE and the standard implementation produce highly consistent inference results. Voxel-wise comparisons reveal a systematic asymmetry: the standard method yields smaller p-values for more voxels, while eTFCE concentrates stronger statistical evidence within a smaller subset. These differences are primarily confined to voxels near the inference boundary and have minimal impact on overall inference. This pattern is consistent with discretization effects in standard implementations, where the TFCE integral is approximated using a finite set of threshold levels, introducing subtle biases in statistical evidence accumulation across thresholds. Furthermore, eTFCE improves computational efficiency (71.3% of runtime on average) and enables unified computation of multiple cluster-based statistics within a single permutation framework. Overall, eTFCE provides an exact, efficient, and extensible approach to nonparametric neuroimaging inference.

2602.00417 2026-04-28 stat.ML cs.LG

Shuffle and Joint Differential Privacy for Generalized Linear Contextual Bandits

Sahasrajit Sarmasarkar

详情
英文摘要

We present the first algorithms for generalized linear contextual bandits under shuffle differential privacy and joint differential privacy. While prior work on private contextual bandits has been restricted to linear reward models -- which admit closed-form estimators -- generalized linear models (GLMs) pose fundamental new challenges: no closed-form estimator exists, requiring private convex optimization; privacy must be tracked across multiple evolving design matrices; and optimization error must be explicitly incorporated into regret analysis. We address these challenges under two privacy models and context settings. For stochastic contexts, we design a shuffle-DP algorithm achieving $\tilde{O}(d^{3/2}\sqrt{T \log T}/\sqrt{\varepsilon})$ regret in dominant term, differing from the non-private rate by a factor of $\sqrt{d/\varepsilon}$. For adversarial contexts, we provide a joint-DP algorithm with regret $\tilde{O}\!\big(d\sqrt{T} \log T + d^{3/4}\sqrt{T/\varepsilon}\,(\log T)\,(d + \log T)^{1/4}\big)$ -- matching the non-private rate $\tilde{O}(d\sqrt{T} \log T)$ in the leading term, with privacy contributing only an additive correction. Unlike prior work on locally private GLM bandits, our methods require no spectral assumptions on the context distribution beyond $\ell_2$ boundedness.

2512.20562 2026-04-28 stat.ML cs.LG math.OC

Shallow Neural Networks Learn Low-Degree Spherical Polynomials with Feature Learning by Learnable Channel Attention

Yingzhen Yang

Comments Accepted by Algorithmic Learning Theory (ALT) 2026

详情
英文摘要

We study the problem of learning a low-degree spherical polynomial of degree $\ell_0 = Θ(1) \ge 1$ defined on the unit sphere in $\RR^d$ by training an over-parameterized two-layer neural network (NN) with channel attention in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk $\eps \in (0,1)$, a carefully designed two-layer NN with channel attention and finite width trained by the vanilla gradient descent (GD) requires the lowest sample complexity of $n \asymp Θ(d^{\ell_0}/\eps)$ with high probability, in contrast with the representative sample complexity $Θ\pth{d^{\ell_0} \max\set{\eps^{-2},\log d}}$, where $n$ is the training data size. Moreover, such sample complexity is not improvable since the trained network renders a sharp rate of the nonparametric regression risk of the order $Θ(d^{\ell_0}/{n})$ with high probability. On the other hand, the minimax optimal rate for the regression risk with a kernel of rank $Θ(d^{\ell_0})$ is $Θ(d^{\ell_0}/{n})$, so that the rate of the nonparametric regression risk of the network trained by GD is minimax optimal. Training the two-layer NN with channel attention proceeds in two stages: (1) a provable learnable channel selection algorithm, as a learnable harmonic-degree selection process, identifies the ground truth channel number in the target function, $\ell_0$, from $L \ge \ell_0$ channels in the first-layer activation; (2) the second layer is trained by standard GD using the selected channels. To the best of our knowledge, this is the first time a minimax optimal risk bound is obtained by training an over-parameterized but finite-width neural network with feature learning capability to learn low-degree spherical polynomials.

2512.07868 2026-04-28 cs.LG cs.AI stat.ML

Bayesian Optimization for Function-Valued Responses under Min-Max Criteria

Pouya Ahadi, Reza Marzban, Ali Adibi, Kamran Paynabar

Comments 25 pages, 6 figures

详情
英文摘要

Bayesian optimization is widely used for optimizing expensive black box functions, but most existing approaches focus on scalar responses. In many scientific and engineering settings the response is functional, varying smoothly over an index such as time or wavelength, which makes classical formulations inadequate. Existing methods often minimize integrated error, which captures average performance but neglects worst case deviations. To address this limitation we propose min-max Functional Bayesian Optimization (MM-FBO), a framework that directly minimizes the maximum error across the functional domain. Functional responses are represented using functional principal component analysis, and Gaussian process surrogates are constructed for the principal component scores. Building on this representation, MM-FBO introduces an integrated uncertainty acquisition function that balances exploitation of worst case expected error with exploration across the functional domain. We provide two theoretical guarantees: a discretization bound for the worst case objective, and a consistency result showing that as the surrogate becomes accurate and uncertainty vanishes, the acquisition converges to the true min-max objective. We validate the method through experiments on synthetic benchmarks and physics inspired case studies involving electromagnetic scattering by metaphotonic devices and vapor phase infiltration. Results show that MM-FBO consistently outperforms existing baselines and highlights the importance of explicitly modeling functional uncertainty in Bayesian optimization.

2510.15479 2026-04-28 cs.LG stat.ML

Adversary-Free Counterfactual Prediction via Information-Regularized Representations

Shiqin Tang, Rong Feng, Shuxin Zhuang, Youzhi Zhang, Hongzong Li

详情
英文摘要

We study counterfactual prediction under assignment bias and propose a mathematically grounded, information-theoretic approach that removes treatment-covariate dependence without adversarial training. Starting from a bound that links the counterfactual-factual risk gap to mutual information, we learn a stochastic representation Z that is predictive of outcomes while minimizing I(Z; T). We derive a tractable variational objective that upper-bounds the information term and couples it with a supervised decoder, yielding a stable, provably motivated training criterion. The framework extends naturally to dynamic settings by applying the information penalty to sequential representations at each decision time. We evaluate the method on controlled numerical simulations and a real-world clinical dataset, comparing against recent state-of-the-art balancing, reweighting, and adversarial baselines. Across metrics of likelihood, counterfactual error, and policy evaluation, our approach performs favorably while avoiding the training instabilities and tuning burden of adversarial schemes.

2510.13087 2026-04-28 cs.LG stat.ME stat.ML

DeepCausalMMM: A Deep Learning Framework for Marketing Mix Modeling with Causal Structure Learning

Aditya Puttaparthi Tirumala

Comments Published in the Journal of Open Source Software. Please cite the JOSS version - doi:10.21105/joss.09914. Please note that Author has no middle name. Last name is 'Puttaparthi Tirumala' (it's a two-part surname)

详情
Journal ref
Journal of Open Source Software, 11(120), 9914 (2026)
英文摘要

Marketing Mix Modeling (MMM) estimates the impact of marketing activities on business outcomes such as sales or revenue. Traditional MMM approaches rely on linear regression or Bayesian hierarchical models that assume channel independence and struggle to capture temporal dynamics and non-linear saturation. DeepCausalMMM addresses these limitations by combining deep learning, causal inference, and marketing science. It uses Gated Recurrent Units (GRUs) to learn temporal patterns (adstock, lag) while learning statistical dependencies between channels through Directed Acyclic Graph (DAG) structure with upper triangular constraints. It implements Hill equation saturation curves for diminishing returns and budget optimization. Key features: (1) data-driven hyperparameters learned from data with defaults, (2) linear mean scaling of the dependent variable, (3) configurable attribution priors with dynamic loss scaling, (4) multi-region modeling with shared and region-specific parameters, (5) robust methods including Huber loss, (6) response curve analysis.

2510.00158 2026-04-28 math.ST cs.NA math.NA math.OC math.PR stat.TH

Exact affine conditioning beyond Gaussians: a unique characterization of the ensemble Kalman update

Frederic J. N. Jorgensen, Youssef M. Marzouk

Comments 35 pages, 4 figures

详情
英文摘要

The analysis step of the ensemble Kalman filter, called the ensemble Kalman update (EnKU), is widely used for approximating posterior distributions in inverse problems and data assimilation. The EnKU approximates the posterior distribution $π_{X\mid Y=y_\star}$ by pushing forward the joint distribution $(X,Y)\simπ$ through an affine map $L^{\mathrm{EnKU}}_{π,y_\star}(x,y)$ that depends only on the covariance structure of $π$ and the observation $y_\star$. While the EnKU yields the exact posterior for Gaussian $π$ in the mean-field, this property alone does not uniquely determine the EnKU. In fact, there are infinitely many affine maps $L_{π, y_\star}$ that achieve such exact conditioning. In this paper, we offer a novel characterization of the EnKU among all such affine maps. We first exhaustively characterize the set ${E}^{\mathrm{EnKU}}$ of joint distributions for which the EnKU yields exact conditioning, showing that it is much larger than the set of Gaussians. Next, we show that except for a small class of highly symmetric distributions within ${E}^{\mathrm{EnKU}}$, the EnKU is the {unique} exact affine conditioning map. Further, we characterize the largest possible set of distributions ${F}$ for which a distribution-dependent, weakly observation-dependent, affine map exists, a class of transports that naturally includes the EnKU. We show that ${F}={E}^{\mathrm{EnKU}}\cup{S}_{\mathrm{nl-dec}}$ with a small symmetry class ${S}_{\mathrm{nl-dec}}$, meaning that for affine conditioning beyond the Gaussian setting, the EnKU has an exact set that is essentially maximally large.

2508.05901 2026-04-28 math.ST cs.IT math.IT math.PR stat.ML stat.TH

Estimating the size of a set using cascading exclusion

Sourav Chatterjee, Persi Diaconis, Susan Holmes

Comments 52 pages, 10 figures. Minor changes in this revision. To appear in Statistical Science

详情
英文摘要

Let $S$ be a finite set, and $X_1,\ldots,X_n$ an i.i.d. uniform sample from $S$. To estimate the size $|S|$, without further structure, one can wait for repeats and use the birthday problem. This requires a sample size of the order $|S|^\frac{1}{2}$. On the other hand, if $S=\{1,2,\ldots,|S|\}$, the maximum of the sample blown up by $n/(n-1)$ gives an efficient estimator based on any growing sample size. This paper gives refinements that interpolate between these extremes. A general non-asymptotic theory is developed. This includes estimating the volume of a compact convex set, the unseen species problem, and a host of testing problems that follow from the question `Is this new observation a typical pick from a large prespecified population?' We also treat regression style predictors. A general theorem gives non-parametric finite $n$ error bounds in all cases.

2507.20058 2026-04-28 stat.ML cs.LG stat.AP

Modeling Parkinson's Disease Progression Using Longitudinal Voice Biomarkers: A Comparative Study of Statistical and Neural Mixed-Effects Models

Ran Tong, Lanruo Wang, Tong Wang, Wei Yan

Comments Published version: Computer Methods and Programs in Biomedicine Update, DOI: 10.1016/j.cmpbup.2026.100242. Version note: https://doi.org/10.5281/zenodo.19804672

详情
Journal ref
Computer Methods and Programs in Biomedicine Update, Volume 9, 2026, Article 100242
英文摘要

Longitudinal voice biomarkers provide a non-invasive source of information for monitoring Parkinson's disease progression, but their statistical analysis is difficult because repeated measurements from the same subject are correlated, clinical cohorts are often small, and disease trajectories can vary substantially across individuals. This study evaluates statistical and neural mixed-effects approaches for modeling Parkinson's disease progression from telemonitoring voice data. Using the Oxford Parkinson's telemonitoring dataset (N=42), we compare Neural Mixed Effects (NME) models, Generalized Neural Network Mixed Models (GNMMs), and semi-parametric Generalized Additive Mixed Models (GAMMs) under the same longitudinal prediction setting. The results show that neural mixed-effects models provide flexible nonlinear representations but can overfit severely in this small-sample setting, whereas GAMMs achieve stronger predictive performance and retain interpretable smooth effects and subject-level structure. In particular, the GAMM-based approach attains the lowest prediction error (MSE 6.56), while the neural baselines have substantially larger errors (MSE > 90). These findings support the use of interpretable statistical mixed-effects models for small longitudinal telemonitoring studies and suggest that larger and more diverse cohorts are needed before highly flexible neural mixed-effects models can be reliably assessed in this application.

2506.09163 2026-04-28 cs.LG stat.ML

Scalable Spatiotemporal Inference with Biased Scan Attention Transformer Neural Processes

Daniel Jenson, Jhonathan Navott, Piotr Grynfelder, Mengyan Zhang, Makkunda Sharma, Elizaveta Semenova, Seth Flaxman

详情
英文摘要

Neural Processes (NPs) are a rapidly evolving class of models designed to directly model the posterior predictive distribution of stochastic processes. While early architectures were developed primarily as a scalable alternative to Gaussian Processes (GPs), modern NPs tackle far more complex and data-hungry applications spanning geology, epidemiology, climate, and robotics. These applications have placed increasing pressure on the scalability of these models, with many architectures compromising accuracy for scalability. In this paper, we demonstrate that this trade-off is often unnecessary, particularly when modeling fully or partially translation-invariant processes. We propose a versatile new architecture, the Biased Scan Attention Transformer Neural Process (BSA-TNP), which introduces Kernel Regression Blocks (KRBlocks), group-invariant attention biases, and memory-efficient Biased Scan Attention (BSA). BSA-TNP is able to: (1) match or exceed the accuracy of the best models while often training in a fraction of the time, (2) exhibit translation invariance, enabling learning at multiple resolutions simultaneously, (3) transparently model processes that evolve in both space and time, (4) support high-dimensional fixed effects, and (5) scale gracefully, running inference on over 1M test points and 100K context points in under a minute on a single 24GB GPU. Code is provided as part of the `dl4bi` package.

2506.04118 2026-04-28 cs.LG stat.ML

Guided Speculative Inference for Efficient Test-Time Alignment of LLMs

Jonathan Geuter, Youssef Mroueh, David Alvarez-Melis

Comments 41 pages, 11 figures. Published at ICLR 2026

详情
英文摘要

We propose Guided Speculative Inference (GSI), a novel algorithm for efficient reward-guided decoding in large language models. GSI combines soft best-of-$n$ test-time scaling with a reward model $r(x,y)$ and speculative samples from a small auxiliary model $π_S(y\mid x)$. We provably approximate both the optimal tilted policy $π_{β,B}(y\mid x) \propto π_B(y\mid x)\exp(β\,r(x,y))$ of soft best-of-$n$ under the base model $π_B$, as well as the expected reward under the optimal policy. In experiments on reasoning benchmarks (MATH500, OlympiadBench, Minerva Math, MMLU-STEM, GSM8K) and across different model families, our method achieves higher accuracy than standard soft best-of-$n$ with $π_S$ and reward-guided speculative decoding (Liao et al., 2025), and in certain settings even outperforms soft best-of-$n$ with $π_B$, while reducing end-to-end latency by up to $28\%$. The code is available at https://github.com/j-geuter/GSI .

2505.16329 2026-04-28 stat.ML cs.LG

High-Dimensional Private Linear Regression with Optimal Rates

Simone Bombari, Jialei Luo, Inbar Seroussi, Marco Mondelli

Comments Updated version of "Better Rates for Private Linear Regression in the Proportional Regime via Aggressive Clipping"

详情
英文摘要

Differentially private (DP) linear regression has received significant attention in the recent theoretical literature, with several approaches proposed to improve error rates. Our work considers the popular high-dimensional regime with random data, where the number of training samples $n$ and the input dimension $d$ grow at a proportional rate $d / n \to γ$, and it studies a family of one-pass DP gradient descent (DP-GD) algorithms satisfying $ρ^2 / 2$ zero concentrated DP. In this setting, we establish a deterministic equivalent for the DP-GD trajectory in terms of a system of ordinary differential equations. This allows to analyze the effect of gradient clipping constants that are smaller than the typical norm of the per-sample gradients - a setup shown to improve performance in practice. For well-conditioned data, we show that DP-GD, upon properly choosing clipping constant and learning rate, achieves the non-asymptotic risk of $O(γ+ γ^2 / ρ^2)$, and we establish that this rate is minimax optimal. Then, we consider the ill-conditioned case where the data covariance spectrum follows a power-law distribution, and we show that the risk displays a power-like scaling law in $γ$, highlighting the change in the exponent as a function of the privacy parameter $ρ$. Overall, our analysis demonstrates the benefits of practical algorithmic design choices, including aggressive gradient clipping and decaying learning rate schedules.

2505.08008 2026-04-28 stat.ME

Separation-based causal discovery for extremes

Junshu Jiang, Jordan Richards, Raphaël Huser, David Bolin

详情
英文摘要

Structural causal models (SCMs), with an underlying directed acyclic graph (DAG), provide a powerful analytical framework to describe the interaction mechanisms in large-scale complex systems. However, when the system exhibits extreme events, the governing mechanisms can change dramatically, and SCMs with a focus on rare events are needed. We propose a new class of SCMs, called XSCMs, which leverage transformed-linear algebra to model causal relationships among extreme values. Similar to traditional SCMs, we prove that XSCMs satisfy the causal Markov and causal faithfulness properties with respect to partial tail (un)correlatedness. This enables estimation of the underlying DAG for extremes using separation-based tests, and makes many state-of-the-art constraint-based causal discovery algorithms directly applicable. We further consider the problem of undirected graph estimation for relationships among tail-dependent (and potentially heavy-tailed) data. The effectiveness of our method, compared to alternative approaches, is validated through simulation studies on large-scale systems with up to 50 variables, and in a well-studied application to river discharge data from the Danube basin. Finally, we apply the framework to investigate complex market-wide relationships in China's derivatives market.

2505.04884 2026-04-28 stat.ME math.ST stat.TH

Model Selection for Unit-root Time Series with Many Predictors

Shuo-Chieh Huang, Ching-Kang Ing, Ruey S. Tsay

详情
英文摘要

This paper studies model selection for general unit-root time series, including the case with many exogenous predictors. We propose a new model selection algorithm, FHTD, that leverages forward stepwise regression (FSR), a high-dimensional information criterion (HDIC), a backward elimination method based on HDIC, and a data-driven thresholding (DDT) approach. Under some mild assumptions that allow for unknown locations and multiplicities of the characteristic roots on the unit circle of the time series and conditional heteroscedasticity in the predictors and errors, we establish the sure screening property of FSR and the selection consistency of FHTD. Our theoretical analysis relies on two novel technical contributions, namely a functional central limit theorem for multivariate linear processes and a uniform lower bound for the minimum eigenvalue of the sample covariance matrices, both of which are of independent interest. Simulation results corroborate the theoretical properties and show the superior performance of FHTD in model selection. We apply the proposed FHTD to model U.S. monthly housing starts and unemployment data, showcasing its practical utility.

2504.18184 2026-04-28 stat.ML cs.LG math.FA math.ST stat.TH

Learning Operators by Regularized Stochastic Gradient Descent with Operator-valued Kernels

Jia-Qi Yang, Lei Shi

Comments 68 pages, 3 figures

详情
英文摘要

We consider a class of statistical inverse problems involving the estimation of a regression operator from a Polish space to a separable Hilbert space, where the target lies in a vector-valued reproducing kernel Hilbert space induced by an operator-valued kernel. To address the associated ill-posedness, we analyze regularized stochastic gradient descent (SGD) algorithms in both online and finite-horizon settings. The former uses polynomially decaying step sizes and regularization parameters, while the latter adopts fixed values. Under suitable structural and distributional assumptions, we establish dimension-independent bounds for prediction and estimation errors. The resulting convergence rates are near-optimal in expectation, and we also derive high-probability estimates that imply almost sure convergence. Our analysis introduces a general technique for obtaining high-probability guarantees in infinite-dimensional settings. We illustrate the practical scope of our framework with applications to structured prediction and parametric PDEs, providing examples that reflect how the approach can be applied in practice.

2502.04122 2026-04-28 stat.ME

Bayesian discovery of species in multiple areas

Alessandro Colombi, Raffaele Argiento, Federico Camerlenghi, Lucia Paci

详情
英文摘要

In ecology, the description of species composition and biodiversity calls for statistical methods that involve estimating features of interest in unobserved samples based on an observed one. In the last decade, the Bayesian nonparametrics literature has thoroughly investigated the case where data arise from a homogeneous population. In this work, we propose a novel framework to address heterogeneous populations, specifically dealing with scenarios where data arise from two areas. This setting significantly increases the mathematical complexity of the problem and, as a consequence, it has received limited attention in the literature. While early approaches leverage computational methods, we provide a distributional theory for the in-sample analysis of any observed sample and enable out-of-sample prediction for the number of unseen distinct and shared species in additional samples of arbitrary sizes. The latter also extends the frequentist estimators, which solely deal with one-step-ahead prediction. Furthermore, our results can be applied to address sample size determination in sampling problems aimed at detecting distinct and shared species. Our results are illustrated in a real-world dataset concerning a population of ants in the city of Trieste.

2402.11789 2026-04-28 stat.ML cs.CV cs.LG

Statistical Test for Diffusion-Based Anomaly Localization via Selective Inference

Teruyuki Katsuoka, Tomohiro Shiraishi, Daiki Miwa, Vo Nguyen Le Duy, Ichiro Takeuchi

Comments 35 pages, 6 figures

详情
英文摘要

Anomaly localization in images -- identifying regions that deviate from normal patterns -- is vital in applications such as medical diagnosis and industrial inspection. A recent trend is the use of image generation models in anomaly localization, where these models generate normal-looking counterparts of anomalous images, thereby allowing flexible and adaptive anomaly localization. However, these methods inherit the uncertainty and bias implicitly embedded in the employed generative model, raising concerns about the reliability. To address this, we propose a statistical framework based on selective inference to quantify the significance of detected anomalous regions. Our method provides $p$-values to assess the false positive detection rates, providing a principled measure of reliability. As a proof of concept, we consider anomaly localization using a diffusion model and its applications to medical diagnoses and industrial inspections. The results indicate that the proposed method effectively controls the risk of false positive detection, supporting its use in high-stakes decision-making tasks.

2312.12396 2026-04-28 stat.ME

A Bayesian time-varying random partition model for large spatio-temporal datasets

Giulio Beltramin, Andrea Cremaschi, Annalisa Cadonna, Alessandra Guglielmi, Fernando Andrés Quintana

详情
英文摘要

Spatio-temporal areal data can be seen as a collection of time series which are spatially correlated, according to a specific neighbouring structure. Motivated by a dataset on mobile phone usage in the Metropolitan area of Milan, Italy, we propose a semi-parametric hierarchical Bayesian model allowing for time-varying as well as spatial model-based clustering. Our approach incorporates the notion of regimes that describe changing patterns over work and night hours as well as weekdays/weekends. Changes across regimes are considered by means of temporal changepoint components that allow for different hierarchical structures specified across time points. The changepoints might occur within fixed time windows over the day. The model features a novel random partition prior that incorporates the desired spatial features and encourages co-clustering based on areal proximity. We explore properties of the model by way of extensive simulation studies from which we collect valuable information. Finally, we discuss the application to the motivating data, where the main goal is to spatially cluster population patterns of mobile phone usage.

2211.09619 2026-04-28 cs.LG cs.RO cs.SY eess.SY math.OC stat.ML

Introduction to Online Control

Elad Hazan, Karan Singh

Comments Draft; comments/suggestions welcome at nonstochastic.control@gmail.com

详情
英文摘要

This text presents an introduction to an emerging paradigm in control of dynamical systems and differentiable reinforcement learning called online nonstochastic control. The new approach applies techniques from online convex optimization and convex relaxations to obtain new methods with provable guarantees for classical settings in optimal and robust control. The primary distinction between online nonstochastic control and other frameworks is the objective. In optimal control, robust control, and other control methodologies that assume stochastic noise, the goal is to perform comparably to an offline optimal strategy. In online nonstochastic control, both the cost functions as well as the perturbations from the assumed dynamical model are chosen by an adversary. Thus the optimal policy is not defined a priori. Rather, the target is to attain low regret against the best policy in hindsight from a benchmark class of policies. This objective suggests the use of the decision making framework of online convex optimization as an algorithmic methodology. The resulting methods are based on iterative mathematical optimization algorithms, and are accompanied by finite-time regret and computational complexity guarantees.

2604.23527 2026-04-28 stat.ME physics.comp-ph physics.data-an stat.CO

Using Statistical Mechanics to Improve Real-World Bayesian Inference: A New Method Combining Tempered Posteriors and Wang-Landau Sampling

Alfred C. K. Farris

详情
英文摘要

We present a simple method to obtain optimal posterior distributions and improve the quality of Bayesian inference with reduced human and computational effort. Bayes' Theorem is reformulated in the language of statistical mechanics, wherein an improved posterior -- referred to as a tempered posterior -- is defined analogously to a canonical probability distribution at temperature $τ$. Wang-Landau sampling is used to obtain the density of states of the posterior probability, and signals analogous to those of phase transitions are extracted from a single simulation. In addition, the transition temperature is easily identified, providing the tempered posterior with optimal predictive performance. We demonstrate the efficacy of the method on a real-world problem in materials science (equation of state modeling) with messy data, a high-dimensional and correlated input parameter space, and "frustration" among model outputs.

2604.23515 2026-04-28 stat.CO

ragR: Retrieval-Augmented Generation and RAG Assessment in R

Muhammad Aimal Rehman, Zhili Lu, Chi-Kuang Yeh

Comments Preprint. Code available at the GitHub repository listed in the paper

详情
英文摘要

Retrieval-augmented generation (RAG) combines document retrieval with large language models to produce responses grounded in external evidence. While several R packages support core components of RAG workflows, integrated evaluation of RAG systems in R remains limited and is often conducted through Python-based tools, most notably the RAG assessment (RAGAS) framework. To address this gap, we introduce ragR, an R package that unifies document ingestion, embedding and vector storage, similarity-based retrieval, grounded generation, structured question-answer logging, and RAGAS-style evaluation within a single R-native workflow. The current implementation provides LLM-based scoring for four core RAGAS metrics: context precision, context recall, faithfulness, and answer relevance. Validation experiments under controlled settings show that ragR captures similar metric behavior to the reference Python RAGAS workflow across multiple use cases. By integrating RAG construction and evaluation within a reproducible workflow in R, ragR provides a practical framework for research, teaching, and moderate-scale experimentation on RAG systems entirely within the R ecosystem.

2604.23498 2026-04-28 math.ST math.OC stat.ML stat.TH

When Does Dynamic Preconditioning Preserve the Polyak-Ruppert CLT? A Stabilization Threshold

Sunyoung An, Xiaoming Huo

Comments 46 pages, 5 figures; includes supplementary material with deferred proofs and additional experiments

详情
英文摘要

Polyak-Ruppert averaging yields an asymptotically normal estimator with sandwich covariance $H^{-1}SH^{-1}$, the foundation of online inference. When the gradient step is preconditioned by a data-driven matrix $P_t$, we ask how fast $P_t$ must stabilize for the central limit theorem (CLT) to remain valid. We resolve this via an exact preconditioner-isolating decomposition of the averaged error that confines $P_t$ to a dynamic remainder $R_n$, leaving the martingale and Taylor terms preconditioner-free. Let $M_t = (P_t H)^{-1}$ denote the effective inverse drift matrix, with $\|M_t - M_{t-1}\|_{\mathrm{op}} \lesssim t^{-β}$ and step-size exponent $α\in (1/2, 1)$. We identify a stabilization-rate threshold $β> (α+1)/2$ and prove that, within the class of polynomial rate hypotheses used in our upper bound, it cannot be weakened: the dynamic remainder $\sqrt{n}\,R_n$ vanishes in $L^2$ whenever $β> (α+1)/2$, and we exhibit sequences satisfying those hypotheses for which it does not vanish when $β\le (α+1)/2$. A single stabilization argument certifies three SA variants - SA-AdaGrad, SA-RMSProp, and SA-ONS - with gain $ρ_t = c/t$, each delivering one-step $L^2(\mathrm{op})$ stabilization of order $t^{-1}$, yielding the CLT $\sqrt{n}(\bar{x}_n - x^*) \to N(0, H^{-1}SH^{-1})$; under bounded inputs the pathwise rate $β= 1$ further preserves the $n^{-1/6}$ Wasserstein rate at $α^* = 2/3$. Under standard regularity conditions, Wald-type online inference remains valid for dynamically preconditioned averaged SGD whose stabilization rate exceeds the threshold.

2604.23469 2026-04-28 stat.ME econ.EM

Estimation of MIDAS Regressions with Errors-in-the-Variables

Sukhbir Kaur, Sukhbir Singh, Kanchan Jain, Pooja Soni

详情
英文摘要

In this paper, a Mixed Data Sampling (MIDAS) model is studied when both low and high frequency variables are contaminated with measurement error. It is shown that the profile likelihood estimator becomes inconsistent in the presence of measurement error. Using the corrected score approach along with profile likelihood approach, a consistent estimator for parameters of MIDAS Measurement Error model is proposed. Small and large sample properties of the estimator are examined by performing a monte carlo simulation study and considering the effect of sample size, number of lags and profiling parameter.

2604.23463 2026-04-28 stat.ME math.ST stat.TH

A theory of ROC analysis of rule-out and rule-in diagnostics with applications to mammography data

Michelle Mastrianni, Kwok Lung Fan, Yee Lam Elim Thompson, Jessie J. J. Gommers, Ioannis Sechopoulos, Fredrik Strand, Weijie Chen, Gary Levine, Mukul Sherekar, Frank W. Samuelson

详情
英文摘要

Multiple diagnostic tests are frequently used to determine the presence of a disease condition in patients. In this paper, we use bivariate copulas to examine the properties of receiver operating characteristic (ROC) curves formed when two correlated diagnostic tests are used together to rule-out ("believe the negative") and rule-in ("believe the positive") patients for disease. We use this theory to analyze three mammography data sets where AI devices are applied to reduce radiologists' workload or improve diagnostic performance. Our analysis shows with generality that increasing the radiologist-AI correlation for diseased cases enhances the area under the ROC curve (AUC) of a radiologist-AI rule-out curve, whereas decreasing correlation for non-diseased cases has a similar effect. The opposite trends hold for rule-in scenarios. Applications to clinical mammography data show that projected empirical radiologist performance under a rule-out or rule-in scenario is consistent with the theory.

2604.23454 2026-04-28 stat.ME stat.CO stat.ML

Anchored Variational Inference for Personalized Sequential Latent-State Models

Xingche Guo

详情
英文摘要

Sequential latent-variable models with subject-specific random effects provide a flexible framework for modeling temporally structured data with both local latent dynamics and stable between-subject heterogeneity. In such models, conditional inference for the local latent process is often tractable, but integrating over subject-specific random effects can be computationally demanding. We propose an anchored variational inference framework for efficient approximate inference in this setting. The central idea is to replace the full conditional posterior of the local latent process with its evaluation at a representative value of the subject-specific latent effect, called the anchor point, thereby preserving tractable local inference while substantially reducing computational cost. This approximation is especially appealing in sequential settings, where the posterior distribution of the random effect becomes increasingly concentrated as the sequence length grows. Under suitable conditions, we show that the posterior mean is a nearly optimal anchor point and that the resulting anchored variational EM (AVEM) algorithm approximately preserves the local monotonicity behavior of standard variational inference. We instantiate the framework in two representative classes of sequential latent-variable models, namely mixed hidden Markov models and mixed-effects state-space models, derive the corresponding AVEM algorithms, and use simulation studies to indicate that the resulting methods achieve accurate estimation with substantial computational gains. We also discuss a partially anchored variant of the framework, in which only the components of the subject-specific latent effect whose posteriors are well concentrated are anchored.

2604.23438 2026-04-28 stat.AP stat.ME

Estimating Causal Attribution of Anthropogenic Forcing on High-Temperature Extremes Using a Latent Gaussian Spatial Model

Ritik Roshan Giri, Arnab Hazra

Comments 31 pages, 6 figures, 3 tables, 1 algorithm

详情
英文摘要

Climate change has become a significant global concern due to its capacity to cause substantial disruption to daily life by increasing the frequency and intensity of extreme weather events. Given the rising trend of human interventions in the climate system over recent decades, this study aims to quantify the relative contribution of anthropogenic forcing to the increasing likelihood of climate extremes, with a particular emphasis on high-temperature extremes. Our analysis focuses on annual temperature maxima from the IPSL-CM6A model in the CMIP6 experiment. We propose a novel causal inference framework that focuses on differences in return levels derived from annual temperature maxima between the factual and counterfactual worlds. While jointly modeling the annual maxima from the two worlds using a bivariate generalized extreme value distribution, we model the spatially-varying coefficients using a latent Gaussian framework. Specifically, given that the data are available over a $1^\circ \times 1^\circ$ grid, we employ the multivariate intrinsic conditional autoregressive model for the latent layer in the proposed hierarchical model, ensuring proper posterior distributions. We implement a recently developed highly-efficient approximate Bayesian inference technique, `Max-and-smooth', that uses a Laplace approximation of the likelihood and then performs Gibbs sampling based on the approximate posterior. The results include posterior estimates of the causal effect of anthropogenic forcing on high-temperature extremes, along with the trends in this effect, over the factual world. Furthermore, we estimate credible regions for a significant causal effect to facilitate hotspot detection across the mainland United States.

2604.23393 2026-04-28 stat.ME math.ST stat.TH

Asymptotic theory of rerandomization for survival analysis

Xinyuan Chen, Fan Li

详情
英文摘要

Rerandomization systematically reduces chance imbalance and can improve the efficiency of the average treatment effect estimator in randomized experiments. While the asymptotic properties of finite-dimensional M-estimators under rerandomization have been established, existing theory does not directly address survival outcomes under censoring, where the target estimand involves infinite-dimensional functional parameters. This article establishes the uniform weak convergence of treatment-specific survival function estimators under rerandomization and stratified rerandomization. We prove that the Kaplan-Meier and inverse probability of censoring weighted Kaplan-Meier estimators converge to tight limiting processes with reduced pointwise asymptotic variances. Furthermore, we prove that the pointwise asymptotic variance of the debiased machine learning survival function estimator remains invariant under rerandomization, a consequence of the Neyman orthogonality. Simulations and a real data example are used to illustrate the theoretical results. Our results characterize the geometric interplay between restricted randomization designs and analysis-stage covariate adjustment for functional target estimands in survival analysis.

2604.23381 2026-04-28 stat.AP stat.ME stat.ML

MCMC with Adaptive Principal-Component Transformation: Rotation-Invariant Universal Samplers for Bayesian Structural System Identification

Xianghao Meng, Yong Huang, James L. Beck, Kui Jiang, Hui Li

Comments Accepted by Advanced Engineering Informatics on Apr 25, 2026

详情
英文摘要

Over decades, Markov chain Monte Carlo (MCMC) methods have been widely studied, with a typical application being the quantification of posterior uncertainties in Bayesian system identification of structural dynamic models. To address the issue of excessively low sampling efficiency in generic MCMC methods when applied to specific problems, researchers developed several MCMC algorithms that integrate trainable neural networks to replace and enhance their critical components. Later, meta-learning MCMC methods emerged to reduce training time. However, they require considerable similarity between test and training tasks, while their sampling efficiency is constrained by trade-off-simplified network designs. This paper proposes the Adaptive Principal-Component (PC) Meta-learning Stochastic Gradient Hamiltonian Monte Carlo (APM-SGHMC) algorithm. It adaptively rotates coordinate axes in the parameter space to align with the PC directions of the current posterior samples, ensuring rotation-invariance of sampling performance with respect to the posterior distribution. By incorporating translation-invariance, scale-invariance, and rotation-invariance in a unified framework, APM-SGHMC enables universal samplers to acquire generalizable knowledge across diverse Bayesian system identification tasks using minimalistic tasks while eliminating the constraints imposed by network design trade-offs on sampling efficiency. Practical feasibility issues are also addressed. Two Bayesian system identification case studies demonstrate its effectiveness and universality: our method overcomes the case-by-case limitations of traditional data-driven approaches, achieving zero-shot generalization across structurally distinct models without retraining and maintaining consistent superior performance across all scenarios.

2604.23375 2026-04-28 cs.CV stat.ML

Hierarchical Spatio-Channel Clustering for Efficient Model Compression in Medical Image Analysis

Sisipho Hamlomo, Marcellin Atemkeng, Habte Tadesse Likassa, Blaise Ravelo, Thierry Bouwmans, Sébastien Lalléchère, Antoine Vacavant, Ding-Geng Chen

详情
英文摘要

Convolutional neural networks (CNNs) have become increasingly difficult to deploy in resource-constrained environments due to their large memory and computational requirements. Although low-rank compression methods can reduce this burden, most existing approaches compress spatial and channel redundancy independently and therefore do not fully exploit the localised structure within convolutional feature maps. This paper proposes a hierarchical spatio-channel low-rank compression framework for CNNs that exploits redundancy across spatial regions and channel activations. Unlike conventional methods, which apply a uniform decomposition across an entire layer, the proposed approach first partitions feature maps into spatial regions, then groups channels according to their co-activation patterns within each region, and finally applies rank-adaptive SVD to each resulting spatio-channel cluster. The method is evaluated on an AlexNet-based brain tumour MRI classification model and compared with Global SVD and Tucker decomposition under \(3\times\) and \(6\times\) compression budgets. Our method outperforms both baselines, reducing FLOPs from \(8.21\,\mathrm{G}\) to \(1.55\,\mathrm{G}\) (\(81.1\%\) reduction), achieving a \(1.38\times\) inference speed-up, and increasing classification accuracy from \(87.76\%\) to \(89.80\%\). The method also improves the macro \(F_1\)-score and performance on challenging classes such as meningioma. A hyper-parameter trade-off analysis demonstrates that the framework provides Pareto-optimal configurations, enabling control over the balance between compression and predictive performance. Moderate clustering with adaptive rank selection yields strong results. Bootstrap standard errors are reported for all classification metrics.

2604.23370 2026-04-28 math.OC cs.AI cs.LG cs.SY eess.SY stat.ML

Nonlinear Non-Gaussian Density Steering with Input and Noise Channel Mismatch: Sinkhorn with Memory for Solving the Control-affine Schrödinger Bridge Problem

Georgiy A. Bondar, Asmaa Eldesoukey, Yongxin Chen, Abhishek Halder

详情
英文摘要

Solutions to the Schrödinger bridge problem and its generalizations yield feedback control policies for optimal density steering over a controlled diffusion. To numerically compute the same, the dynamic Sinkhorn recursion has become a standard approach. The mathematical engine behind this approach is the Hopf-Cole transform that recasts the conditions for optimality into a system of boundary-coupled linear PDEs. Recent works pointed out that for the control-affine Schrödinger bridge problem, this exact linearity via Hopf-Cole transform, and thus the standard Sinkhorn recursion, apply only if the control and noise channels are proportional. When the channels do not match, the Hopf-Cole-transformed PDEs remain nonlinear, and no algorithm is available to solve the same. We advance the state-of-the-art by designing a Sinkhorn recursion with memory that leverages the structure of these nonlinear PDEs, and demonstrate how it solves the control-affine Schrödinger bridge problem with input and noise channel mismatch. We prove the local stability of the proposed algorithm.

2604.23367 2026-04-28 math.ST stat.TH

Conway--Maxwell multivariate Bernoulli distribution

Hélène Cossette, Etienne Marceau, Alessandro Mutti, Patrizia Semeraro

详情
英文摘要

We investigate the Conway--Maxwell multivariate Bernoulli distributions, a family of multivariate Bernoulli distributions derived from the Conway--Maxwell-binomial distribution. We show that it is possible to set the parametrization such that the Bernoulli marginals remain intact, allowing us to study dependence properties within this family. In particular, we demonstrate that this family spans the full spectrum of dependence. Moreover, for specific ranges of the parameters, these distributions satisfy the strongly Rayleigh property, a negative dependence notion stronger than negative association.

2604.23357 2026-04-28 stat.ME stat.AP

Modelling spatial heterogeneity in the effects of area-level covariates on income distributions using Bayesian nonparametric methods

Ziyou Wang, Jim Griffin, Maria Kalli

详情
英文摘要

Understanding the how the distribution of an economic outcome, such as income, changes with respect to space and covariates is a key concern for policy makers. To address this, we develop a Bayesian nonparametric model, the Normalised Latent Measure Factor Model with Covariates (NLMFM-C), which expresses a large collection of related densities as mixtures of latent factor densities and allows for spatial and covariate effects. We propose an adaptive Gibbs sampler to automatically infer the number of latent factor distributions, and a rotation method to make posterior inference on different data sets comparable. We apply the NLMFM-C model to Public Use Microdata Sample (PUMS) data, focusing on income distributions for sub-areas of four U.S. states over to different years, 2016 and 2020. We show that the latent factor distributions can be interpreted by income level (e.g., low, medium, and high) and investigate the spatially- and time-changing impact of three covariates: gender, race and educational attainment.

2604.23308 2026-04-28 cs.LG stat.ML

CODA: Coordination via On-Policy Diffusion for Multi-Agent Offline Reinforcement Learning

Marcel Hedman, Kale-ab Abebe Tessera, Juan Claude Formanek, Anya Sims, Riccardo Zamboni, Trevor McInroe, John Torr, Elliot Fosong

详情
英文摘要

Offline multi-agent reinforcement learning (MARL) enables policy learning from fixed datasets, but is prone to coordination failure: agents trained on static, off-policy data converge to suboptimal joint behaviours because they cannot co-adapt as their policies change. We introduce CODA (Coordination via On-Policy Diffusion for Multi-Agent Reinforcement Learning), a diffusion-based multi-agent trajectory generator for data augmentation that samples conditioned on the current joint policy, producing synthetic experience which reflects the evolving behaviours of the agents, thereby providing a mechanism for co-adaptation. We find that previous diffusion-based augmentation approaches are insufficient for fostering multi-agent coordination because they produce static augmented datasets that do not evolve as the current joint policy changes during training; CODA resolves this by more closely simulating on-policy learning and is a meaningful step toward coordinated behaviours in the offline setting. CODA is algorithm-agnostic and can be layered onto both model-free and model-based offline reinforcement learning pipelines as an augmentation module. Empirically, CODA not only resolves canonical coordination pathologies in continuous polynomial games but also delivers strong results on the more complex MaMuJoCo continuous-control benchmarks.

2604.23229 2026-04-28 math.ST stat.TH

Solidarity of Spectral Gaps for Component-Wise Markov Chains

Youngwoo Kwon, Galin Jones, Qian Qin

Comments 55 pages

详情
英文摘要

Deterministic-scan and random-scan component-wise Markov chain Monte Carlo algorithms, such as Gibbs samplers and conditional Metropolis-Hastings, are popular approaches for sampling from multivariate distributions. A long-standing open question is to determine the conditions under which these algorithms have similar convergence rates. A block-wise contraction condition for the component-wise updates is used to establish a solidarity principle for the $L^2$ spectral gaps of the associated Markov chains. Specifically, under this condition, the spectral gaps of the random-scan and deterministic-scan versions of the Gibbs and component-wise chains are either simultaneously positive or simultaneously zero. Moreover, the spectral gaps differ by at most polynomial factors in the number of blocks. As an application of the general results, a deterministic-scan conditional Metropolis-adjusted Langevin algorithm (MALA) for multivariate Gaussian targets is studied. The block-wise contraction condition is combined with known spectral gap bounds for the random-scan Gibbs sampler to obtain a spectral gap bound that is polynomial in dimension. The result is used to clarify how the convergence rate of the conditional MALA depends on the precision matrix of the Gaussian target and the step sizes of the block-wise MALA updates.

2604.23212 2026-04-28 stat.ML cs.LG math.ST stat.TH

Learning Curves and Benign Overfitting of Spectral Algorithms in Large Dimensions

Weihao Lu, Qian Lin, Yingcun Xia, Dongming Huang

详情
英文摘要

Existing large-dimensional theory for spectral algorithms resolves either the optimally tuned point or the interpolation limit, but leaves the under-regularized regime unexplored. We study the learning curve and benign overfitting of spectral algorithms in the large-dimensional setting where the sample size and dimension are of comparable order, i.e., $n \asymp d^γ$ for some $γ>0$. We first consider inner-product kernels on the sphere $\mathbb{S}^{d-1}$ and establish a sharp asymptotic characterization of the excess risk across the full regularization path under various source conditions $s \geq 0$, where $s$ measures the relative smoothness of the regression function. Our results reveal that the learning curve is not simply U-shaped but instead consists of three distinct regimes: over-regularized, under-regularized, and interpolation regimes. This characterization allows us to fully capture the benign overfitting phenomenon, demonstrating that benign overfitting arises consistently across both the under-regularized and interpolation regimes whenever $s$ is positive but no larger than a critical threshold. We further show that, in the sufficiently regularized regime, the kernel learning curve is recovered by an associated sequence model. Finally, we extend the learning-curve analysis to large-dimensional KRR for a class of kernels on general domains in $\mathbb{R}^d$ whose low-degree eigenspaces satisfy spectral-scaling and hyper-contractivity conditions.

2604.23193 2026-04-28 cs.DS cs.LG cs.NA math.NA math.PR stat.ML

Well-Conditioned Oblivious Perturbations in Linear Space

Shabarish Chenakkod, Michał Dereziński, Xiaoyu Dong, Mark Rudelson

详情
英文摘要

Perturbing a deterministic $n$-dimensional matrix with small Gaussian noise is a cornerstone of smoothed analysis of algorithms [Spielman and Teng, JACM 2004], as it reduces the condition number of the input to $O(n)$, and with it the complexity of many matrix algorithms. However, when deployed algorithmically, these perturbations are expensive due to the cost of generating and storing $n^2$ Gaussian random variables. We propose a perturbation that requires generating and storing $O(n)$ random numbers in $O(\log n)$ bits of precision, and reduces the condition number of any deterministic matrix to $O(n)$, matching Gaussian perturbations. Our result in particular implies a better complexity for the perturbed conjugate gradient algorithm, showing that we can solve an $n\times n$ linear system in linear space to within an arbitrarily small constant backward error using $O(n)$ matrix-vector products. In our construction, we introduce the concept of a pattern matrix, which is a dense deterministic matrix that maps all sparse vectors into dense vectors, and we combine it with a sparse perturbation whose entries are dependent and located in a non-uniform fashion. In order to analyze this construction, we develop new techniques for lower bounding the smallest singular value of a random matrix with dependent entries.

2604.23174 2026-04-28 stat.ME

Weighted Cumulative Residual Mathai-Haubold Entropy

Anija C. R, Smitha S, Sudheesh K. Kattumannil

详情
英文摘要

In this paper, we introduce the weighted cumulative residual Mathai--Haubold entropy and establish its fundamental properties. A dynamic version is developed, and its behavior under linear transformations is studied. Bounds and explicit expressions for some lifetime distributions are derived. Characterization results based on the associated measure are obtained and two new classes of life distributions are formulated. A goodness-of-fit test for the Rayleigh distribution is proposed and its performance is evaluated through a Monte Carlo simulation study. Applications to real data sets demonstrate the practical applicability of the proposed methodology

2604.23154 2026-04-28 stat.ME

A bivariate cure copula model with zero-inflated gamma frailty: dependence in both cure fractions and survival times

Masaki Hino, Shogo Kato, Takeshi Emura

Comments 47 pages, 2 figures

详情
英文摘要

In biomedical studies, paired survival data arise naturally when two event times are observed within the same subject. Existing statistical models seldom accommodate both cure fractions and complex dependence structures. In this paper, we propose a novel bivariate cure frailty-copula model for paired survival data with a cure fraction. By incorporating a zero-inflated gamma frailty, the proposed framework simultaneously accommodates a cure fraction and continuous unobserved heterogeneity among uncured subjects. Dependence between cure statuses is modeled naturally via an odds-ratio parameter, while dependence between survival times conditional on frailty is captured through a copula. We show that the proposed model includes existing bivariate cure models as special cases. Population-level rank correlation coefficients are derived for the proposed model, namely tie-corrected versions of Kendall's tau and Spearman's rho. For suitable choices of marginal distributions and copula, the joint survival function admits a closed-form expression, enabling maximum likelihood estimation and likelihood ratio testing. Simulation studies and a real data application demonstrate the practical utility of the proposed approach. An R package, curecopula, implementing the proposed methods is publicly available on GitHub.

2604.23127 2026-04-28 physics.geo-ph cs.LG stat.AP

A Dynamic Learning Observatory Reveals the Rapid Salinization of Satkhira, Bangladesh

Showmitra Kumar Sarkar, Sai Ravela

详情
英文摘要

Soil salinity is a major environmental challenge in coastal Bangladesh, threatening agricultural productivity and local livelihoods. This study develops a machine-learning-based framework to predict and map soil salinity in Satkhira district by integrating field observations with Landsat-derived spectral indices. A total of 205 soil samples collected during 2024-2025 were used to train an Extreme Gradient Boosting (XGBoost) model, and predictions were further improved using a Generalized Additive Model (GAM). Spatial cross-validation was applied to reduce autocorrelation bias, and bootstrap resampling was used to quantify prediction uncertainty. The results show strong spatial variability of soil salinity, with higher concentrations in the southern and central coastal regions and lower levels in the northern inland areas. Vegetation indices, particularly NDVI, along with salinity-related spectral indicators, were identified as key predictors. 10-year-window peak-exposure maps generated for 2014-2023 reveal recurrent high-salinity zones and a persistent, expanding footprint of moderate-to-high salinity exposure across the central parts of the district. Uncertainty analysis indicates higher variability in coastal zones and improved prediction stability when multi-year datasets are combined. The proposed framework provides a robust and scalable approach for long-term monitoring of soil salinity. It supports climate-resilient agriculture, land-use planning, and evidence-based decision-making in coastal Bangladesh.

2604.23107 2026-04-28 stat.ML cs.LG stat.ME

MOCA: A Transformer-based Modular Causal Inference Framework with One-way Cross-attention and Cutting Feedback

Lei Wang, Debashis Ghosh

Comments 25 pages, 8 figures, 4 tables. Preprint

详情
英文摘要

Causal effect estimation from observational data requires careful adjustment for confounding. Classical estimators such as inverse probability weighting and augmented inverse probability weighting are effective under favorable model specification, but may become unstable when treatment assignment and outcome mechanisms are complex, non-linear, and high-dimensional. Machine learning and representation learning approaches improve flexibility, yet joint training can allow outcome-related information to influence treatment-side representations, which is undesirable from a causal perspective. We propose MOCA (Modular One-way Causal Attention), a transformer-based framework that separates treatment and outcome modeling through a modular design, and performs confounder adjustment using a one-way attention mechanism. A cutting-feedback strategy, implemented via gradient detachment, prevents the outcome loss from updating the treatment module. This design preserves directional information flow while retaining the representational power of transformer architectures for causal inference. Across multiple simulated scenarios, including linear, nonlinear, heavy-tailed, hidden confounding, and high-dimensional settings, MOCA shows competitive or improved performance relative to IPW, AIPW, X-learner, TARNet, and DragonNet. We further illustrate the method on the Infant Health and Development Program dataset and the Dehejia-Wahba dataset as real-world benchmarks. These results suggest that modular attention with one-way information flow provides a promising and interpretable direction for causal inference with modern deep learning models.

2604.23085 2026-04-28 stat.ME

Using Importance Sampling to Estimate $p$-values in All-Subset Meta-Analysis, with Applications to Single-Cell eQTL Mapping

Samuel Anyaso-Samuel, Thong Luong, Fei Qin, Jiyeon Choi, Kai Yu, Paul S. Albert, Jianxin Shi

Comments 18 pages, 3 Figures

详情
英文摘要

Pooling genome-wide association studies of multiple related traits can substantially increase power for detecting genetic variants with pleiotropic effects. ASSET, which exhaustively searches all subsets of studies for association signals, has been widely used to detect modest effects and improve interpretability. Under a normality assumption, ASSET computes p-values via an analytic approximation that accounts for multiple testing. However, this approximation has been evaluated only in limited scenarios and for p-values no smaller than $10^{-3}$. A systematic assessment in the extreme tail is therefore needed, yet naïve Monte Carlo methods would require prohibitively many simulations. We develop a computationally efficient importance-sampling (IS) algorithm that provides accurate ASSET p-value estimates for both independent and overlapping studies, achieving substantial efficiency gains over naïve Monte Carlo, particularly for very small p-values. Using IS, we show that ASSET's analytic approximation is highly accurate across nearly the entire p-value range when normality holds. In contrast, when normality is violated (due to small sample sizes, low-frequency variants, or non-normal traits), ASSET p-values can be inflated or deflated by orders of magnitude, whereas our IS approach remains accurate. We illustrate the method through applications to single-cell eQTL mapping using peripheral blood mononuclear cells from the OneK1K cohort and lung cells from a Korean population.

2604.23083 2026-04-28 stat.ML cs.LG stat.ME

Turtle shell clustering: A mixture approach to discriminative clustering with applications to flow cytometry and other data

Mackenzie R. Neal, Paul D. McNicholas, Arthur White

详情
英文摘要

Generative approaches to clustering provide information on geometric properties of clusters, whereas discriminative approaches provide boundaries between clusters. Ideas from both approaches are incorporated to present a fully unsupervised, probabilistic, and discriminative clustering method via a regularized mutual information objective function, wherein a mixture of mixtures of Gaussian and uniform distributions is used for formulation of the conditional model. Automatic selection of the number of components is established with the introduction of the regularizing term and a merge step, similar to those applied in reversible jump Markov chain Monte Carlo methods used in Bayesian clustering. Consequently, the turtle shell method -- a fully unsupervised clustering method capable of estimating non-linear boundary lines, automatically selecting the number of components, and capturing intuitive clusters in the presence of data abnormalities such as noise and/or irregular cluster shapes -- is introduced. We test this method on various simulated and real datasets commonly explored in clustering research, and extend the analysis to datasets arising from flow cytometry experiments.

2604.23046 2026-04-28 cs.LG cs.IT cs.SI math.IT stat.ML

Shape of Memory: a Geometric Analysis of Machine Unlearning in Second-Order Optimizers

Kennon Stewart

Comments Full experiment data available at secondstreetlabs.io

详情
英文摘要

We argue that current definitions of machine unlearning are underspecified for second-order optimizers. We compare first-order and second-order learners for their ability to handle the data deletion task with varying degrees of eigendecomposition to mimic the loss model memory. While both first and second-order methods realign with the ideal counterfactul in terms of performance and gradient, the second-order optimizer shows significant volatility in the optimizer state. This indicates residual information, supposedly deleted, that isn't detectable by first-order analysis. Various eigendecay treatments show that stability and information loss is regained only under controlled state pertubation where geometric information (or memory) is erased.

2604.23029 2026-04-28 stat.ME

Sampling distributions for complex design variance estimators in a Fay-Herriot model

Alana McGovern, Geir-Arne Fuglstad, Jon Wakefield

详情
英文摘要

Fay-Herriot (FH) models with variance smoothing typically use chi-squared sampling distributions for the design variance estimators. This choice is only valid under strong assumptions on the population and the sampling design, and the choice of sampling distribution is understudied for complex survey designs such as the stratified two-stage clustering design used by the Demographic and Health Surveys (DHS). DHS conducts surveys in low- and middle-income countries and result in low sample sizes for unplanned domains of interest. Thus, accounting for the uncertainty in the estimated design variances is important. We derive two sampling distributions under the DHS design, a simple and a more complex, while clearly specifying and discussing the required superpopulation and design assumptions. In a simulation study, we compare the two sampling distributions to the empirical sampling distributions, and the resulting FH models with variance smoothing to the standard FH model. We find that the standard model exhibits undercoverage, while the variance smoothing models produce better credible intervals according to proper scoring rules. Interestingly, the simple sampling distribution, which is easiest to implement, performs equally as well as the more complex sampling distribution. We illustrate the proposed models by estimating height-for-age z-scores using the 2022 Kenya DHS.

2604.23022 2026-04-28 cs.IR cs.LG stat.ML

CASP: Support-Aware Offline Policy Selection for Two-Stage Recommender Systems

Nilson Chapagain

Comments 10 pages

详情
英文摘要

Two-stage recommender systems first choose a candidate generator and then rank items within the generated set. Because the generator decides which items are available to the ranker, changing the generator changes both the policy value and the data support used to estimate that value. This creates an offline selection problem that standard single-stage objectives do not capture: a policy may look good under a retrieval score or a raw off-policy value estimate, but still be unreliable if it depends on weakly supported generator-item pairs. We propose CASP (Coupled Action-Set Pessimism), a support-aware offline selector for finite libraries of two-stage recommender policies. CASP combines doubly robust value estimation with a support-burden penalty. We show that stagewise rules that ignore downstream continuation value can be arbitrarily suboptimal, and we derive population, finite-class, and reconstructed-propensity guarantees for conservative selection. In simulations and a reconstructed MovieLens 1M application, CASP selects lower-burden policies when estimated value and support credibility are in tension.

2604.23006 2026-04-28 stat.ME

Estimation of Time-Varying Treatment Effects in a Joint Model for Longitudinal and Recurrent Event Outcomes in Mobile Health Data

Madeline R Abbott, Jeremy M G Taylor, Inbal Nahum-Shani, Lindsey N Potter, David W Wetter, Cho Y Lam, Walter Dempsey

Comments Main text is 25 pages with 6 figures and 1 table

详情
英文摘要

Not only does mobile health technology enable researchers to track changes in multiple longitudinal outcomes of interest and to record the occurrence of health-related events over time, but it also allows for the delivery of repeated low-cost treatments directly to individuals in real time. We present a model-based approach for estimating the effect of repeatedly delivered treatments in a micro-randomized trial (MRT) via an extension of a joint longitudinal-survival model. We discuss different ways that these repeated treatment effects can be incorporated into the joint model; these different model specifications correspond to different mechanisms by which treatment is assumed to impact the longitudinal and event processes. Taking a Bayesian approach to inference, we model the association between repeated treatments, multiple longitudinally measured outcomes, and recurrent events. We also demonstrate how to calculate information criteria for model selection and present goodness-of-fit plots for assessing survival submodel calibration. We then illustrate the performance of our method via simulations and analysis of data collected in an MRT of substance use.

2604.22998 2026-04-28 stat.OT

Perceptions and Utilization of GenAI Tools among Data Science Students and Faculty

Abeer M. Hasan, Sayed A. Mostafa

详情
英文摘要

This study investigates perceptions and use of generative artificial intelligence (GenAI) tools among students and faculty in statistics and data science at a historically Black college or university. Survey data from 119 valid student responses and 14 faculty responses were used to examine familiarity, usage patterns, perceived benefits, awareness of limitations, and instructional support needs. Students reported substantial use of GenAI, with ChatGPT as the dominant tool, primarily for coding assistance and writing support. Although student perceptions of AI in data science workflows and careers were generally positive, confidence in interpreting AI-generated outputs was limited, and concerns about accuracy, reliability, and over-reliance were common. Faculty also viewed GenAI favorably, but self-rated proficiency and the frequency of classroom integration remained limited. Comparisons across student subgroups suggested that familiarity with GenAI and awareness of its limitations varied more by academic level than by gender. These findings highlight a gap between AI adoption and AI literacy and underscore the need for structured training, validation practices, and clearer institutional guidance for responsible AI integration in data science education.

2604.22980 2026-04-28 stat.ME math.ST stat.TH

Testing independence in the presence of missing data: high-dimensional case

Marija Cuparić, Bojana Milošević, Jelena Radojević

详情
英文摘要

In this paper, we consider the problem of testing independence in high-dimensional settings with missing data. Building upon a recently proposed Kendall-based statistic, we introduce two new modifications specifically designed to accommodate incomplete observations. The proposed methods are studied from both theoretical and empirical perspectives. A comprehensive simulation study illustrates the robustness and applicability of the new approaches. The findings contribute to the development of nonparametric methods for analyzing high-dimensional and incomplete data structures.

2604.22967 2026-04-28 stat.ML cs.LG

Rethinking Trust Region Bayesian Optimization in High Dimensions

Wei-Ting Tang, Joel A. Paulson

详情
英文摘要

Trust Region Bayesian Optimization (TuRBO) is an effective strategy for alleviating the curse of dimensionality in high-dimensional black-box optimization. However, inappropriate lengthscale design can cause the local Gaussian process (GP) model within the trust region to degenerate, leading to suboptimal performance in high dimensions. In this work, we show that TuRBO's local GP may remain either excessively complex or overly simple as the dimension $D$ and trust region side length $L$ vary. To address this issue, we propose a straightforward variant, AdaScale-TuRBO, which scales the GP lengthscale with both the problem dimension and trust region size, thereby preserving kernel geometry and maintaining consistent prior complexity. Empirically, we show that AdaScale-TuRBO can robustly outperform standard TuRBO and other popular high-dimensional BO methods on synthetic benchmarks and real-world trajectory planning tasks.

2604.22965 2026-04-28 stat.ME

Agreement coefficients for continuous variables: A review

Ronny Vallejos

Comments 33 pages, 7 figures, 1 table

详情
英文摘要

Agreement coefficients provide a fundamental framework for quantifying the concordance between two or more measurement methods applied to the same continuous variable. Unlike correlation, which measures the strength of a linear relationship, agreement focuses on assessing whether measurements are numerically similar, capturing both precision and accuracy. This review provides a comprehensive overview of the primary statistical approaches for assessing agreement between continuous variables. Such a synthesis is timely, as it has been 15-20 years since the last major review in the field. Beginning with the seminal contributions of Bland and Altman (1986) and Lin (1989), the paper discusses extensions of their methods to robust, multivariate, and repeated-measures settings, as well as recent developments like the probability of agreement and measures based on alternative distance functions measures. Special attention is given to probabilistic and spatial generalizations, including frameworks designed for geostatistical and areal data, which have become increasingly relevant in modern applications such as image analysis and environmental statistics. Through illustrative examples and comparative discussions, this review highlights the evolution, connections, and limitations of existing agreement measures, identifying open challenges and directions for future research.

2604.22941 2026-04-28 math.AG math.ST stat.TH

Sobolev embedding theorem and subanalytic measures

Guillaume Valette

详情
英文摘要

We focus on Borel measures that have a globally subanalytic density function. We prove, given such a measure $μ$ on a set $A$ and a globally subanalytic mapping $Φ:A\to Ω$, with $Ω$ bounded open subset of $\mathbb{R}^n$, a Sobolev embedding theorem for the Sobolev space $W^{k,p}_{Φ_*μ}(Ω)$ of the push-forward measure $Φ_*μ$. We derive an embedding of $W^{k,p}_{Φ_*μ}(Ω)$ into the space of inner Lipschitz functions and give an application to kernel theory.

2604.22925 2026-04-28 stat.AP cs.SD

Come Together: Analyzing Popular Songs Through Statistical Embeddings

Matthew Esmaili Mallory, Mark Glickman, Jason Brown

详情
英文摘要

Statistical modeling of popular music presents a unique challenge due to the complexity of song structures, which cannot be easily analyzed using conventional statistical tools. However, recent advances in data science have shown that converting non-standard data objects into real vector-valued embeddings enables meaningful statistical analysis. In this work, we demonstrate an approach based on logistic principal component analysis to construct embeddings from global song features, allowing for standard multivariate analysis. We apply this method to a corpus of Lennon and McCartney songs from 1962-1966, using embeddings derived from chords, melodic notes, chord and pitch transitions, and melodic contours. Our analysis explores how these song embeddings cluster by Beatles album, how songwriting styles evolved over time, and whether Lennon and McCartney's compositions exhibited convergence or divergence. This embedding-based approach offers a powerful framework for statistically examining musical structure and stylistic development in popular music.

2604.22907 2026-04-28 physics.med-ph stat.ME

Fingertip Micro-Motion as a Source of Respiratory Information During Sleep Using Triaxial Accelerometers

Jeanne Lin, Lily Liu, Hau-Tieng Wu

详情
英文摘要

Objective: Triaxial accelerometers (TAAs) are widely used in homecare medicine. This study investigates whether TAA signals recorded at the fingertip encode respiratory information, particularly instantaneous respiratory rate (IRR) and respiratory effort, during sleep. Method: We propose an antiderivative-based nonlinear transformation to convert TAA signals into a respiratory surrogate, termed TAA-resp. To quantify the embedded respiratory-induced motion, a modern time-frequency analysis tool is applied to derive an index, referred to as the respiratory motion index (RMI). The proposed TAA-resp and RMI are validated on a dataset comprising 39 full-night recordings with simultaneous polysomnography (PSG) and a fingertip TAA measurements. Criteria for labeling TAA-resp signal quality as good, moderate, or poor are established, and expert annotations are obtained. Result: On average, TAA-resp over 22.2% $\pm$ 15.6% of full-night recordings encodes high-quality respiratory information, reaching up to 58.9% in some cases. TAA-resp shows stronger correlation with thoracic and abdominal motion than with airflow, indicating predominant capture of respiratory effort. High-quality TAA-resp offers an accurate IRR estimate with root mean square error $0.027 \pm 0.022$ Hz. RMI is higher for high-quality segments and lower for poor-quality segments, and its distribution aligns with physiology, with higher values during REM, N2, and N3 sleep and in the absence of apnea or hypopnea events. In leave-one-subject-out cross-validation, RMI predicts quality labels with 0.74 sensitivity and 0.75 specificity. Conclusion: Fingertip-mounted TAAs encode meaningful respiratory information. Leveraging this underutilized signal may enhance home-based sleep monitoring in channel-limited settings.

2604.22813 2026-04-28 stat.AP math.PR

Cyclic fractional Gaussian noise: time and frequency domain properties

Hubert Woszczek, Agnieszka Wylomanska

详情
英文摘要

This article introduces cyclic fractional Gaussian noise (cfGn), a stochastic model that integrates second-order cyclostationarity with long-range dependence property. While classical cyclostationary processes are widely discussed in the literature, they often lack the capacity to account for the persistent, slow-decaying correlations found in complex empirical data. To bridge this gap, we extend the amplitude-modulated stationary framework by utilizing increments of two-dimensional fractional Brownian motion (2d fBm) as the underlying driving process. The proposed cfGn model is constructed by summing two components, which include periodic deterministic functions modulating the univariate coordinates of 2d fGn. We provide a rigorous derivation of the considered model's properties, specifically the autocovariance function (ACVF) and frequency-domain characteristics, including the cyclic spectrum. Through theoretical considerations of asymptotic properties and Monte Carlo simulations, we demonstrate that cfGn preserves periodic behavior of ACVF while inheriting long-memory traits which is manifested in time and frequency domains. This framework offers a robust foundation, for instance, in signal-based condition monitoring in systems where periodic fault components coexist with long-range dependent background noise.

2604.22812 2026-04-28 cs.CY cs.LG stat.AP

Cross-Course Generalizability of SRL-Aligned Predictive Models Using Digital Learning Traces

Jakob Schwerter, Loreen Sabel, Judith Bose, Matthew L. Bernacki, Di Xu, Marko Schmellenkamp, Thomas Zeume, Philipp Doebler

详情
英文摘要

STEM dropout rates remain high at universities, particularly in computer science programs with theory-intensive courses. Digital learning environments now capture rich behavioral data that could help identify struggling students early, yet the generalizability of data-driven prediction models across courses and institutions remains uncertain. Guided by self-regulated learning (SRL) theory, this study analyzed multimodal digital-trace data from three undergraduate theoretical computer science courses (N1 = 137, N2 = 104, N3 = 148) at two universities. Weekly SRL-aligned digital-trace indicators were modeled using Elastic Net, Random Forest, and XGBoost to evaluate predictive performance over time and across settings, and model calibration both within and across courses. Early prediction of at-risk students was feasible, with SRL-related behaviors such as time management, effort regulation, and sustained engagement emerging as key predictors. While Random Forest achieved the highest in-sample accuracy, Elastic Net generalized more robustly across contexts. Out-of-sample accuracy and calibration declined between institutions with different base rates, underscoring the contextual nature of predictive analytics in higher education. These findings suggest that digital learning traces enable early identification of at-risk students within courses, but generalizing predictive models beyond their original context requires caution, particularly if the at-risk rates differ between contexts.

2604.22807 2026-04-28 math.OC cs.LG cs.SY eess.SY stat.ML

Sliced Wasserstein Steering between Gaussian Measures

Kaito Ito, Anqi Dong

Comments Accepted at the European Control Conference 2026

详情
英文摘要

Optimal transport with quadratic cost provides a geometric framework for steering an ensemble, modeled by a probability law, with minimal effort. Yet ambient-space formulations become unwieldy in high dimensions, and sensing or actuation in practice often reveals only linear views of the state -- camera silhouettes, LiDAR beams, tomographic slices. We develop a sliced feedback controller for distribution steering: the evolving law is projected onto one-dimensional directions on the sphere, the optimal one-dimensional velocity is synthesized in each projection, and these velocities are averaged to produce a feedback control in the ambient space. The construction reduces to the Benamou--Brenier problem in one dimension. In addition, it is invariant under orthogonal transforms, nonexpansive under projections, and well posed on $\mathcal{P}_2(\mathbb{R}^n)$. Computation proceeds by sampling directions on the sphere and solving independent one-dimensional subproblems, yielding a scalable method aligned with partial observations. In the Gaussian setting, we show that the developed sliced controller steers the law to the prescribed target. Furthermore, we derive an identity relating the energy consumption incurred by the controller to the sliced Wasserstein distance.

2604.22802 2026-04-28 cs.DL cs.SI stat.AP

NIH-MPINet: A Large-Scale Feature-Rich Network Dataset for Mapping the Frontiers of Team Science

Cuiran Shi, Shuying Han, Shreya Kusumanchi, Mia Zhou, Didong Li

详情
英文摘要

This study presents a large-scale network dataset, NIH-MPINet, curated from NIH RePORTER and PubMed, characterizing collaboration among multiple Principal Investigators (multi-PIs) on NIH R01-equivalent grants from 2006 to 2023. The network characterizes 30,127 PIs as nodes and their collaborations on 86,743 NIH R01-equivalent grants as edges, spanning 888 recipient organizations and supported by 40 NIH Institutes and Centers. We also curated comprehensive metadata, including node-level features such as PI affiliation, alongside edge-level features comprising grant years, titles, and abstracts. Using these data, we constructed a PI collaboration network and identified 19 communities as well as 20 major research topics. Several collaboration communities showed distinct thematic profiles, such as cardiovascular health, cancer immunotherapy, neuroscience, and microbiome research, while genetics and genomics were broadly represented across communities. By incorporating temporal analysis, we observed shifts in research topics and collaboration patterns over time. Topics like healthcare and outcomes research, cognitive health, and Alzheimer's disease have become more prominent in recent years, whereas molecular and cellular biology has seen a relative decline. Overall, this work provides a high-fidelity, feature-rich resource for advancing statistical learning methods and network analysis-based discoveries in the study of long-term biomedical collaboration.

2604.22793 2026-04-28 stat.AP

Research Funding as a Decision Problem Under Heavy-Tailed Uncertainty

Carlos Oscar S. Sorzano, B. Pueche-Granados

详情
英文摘要

Heavy-tailed impact distributions, intrinsic uncertainty, and the high costs of proposal-based peer review increasingly challenge research funding decisions. Using large-scale bibliometric data, we show that past scientific performance provides statistically meaningful, though imperfect, information about future productivity and impact across multiple dimensions. An aggregated, percentile-normalised proxy signal captures this predictive structure robustly across research domains. We analyse deterministic and stochastic funding allocation mechanisms under impact-based objectives and find that both converge to highly concentrated allocations that favour a small number of top-performing researchers. To address the limitations of pure exploitation, we introduce a biased lottery framework based on a regularised decision-theoretic objective that explicitly balances exploration and exploitation while accounting for practical funding constraints. Our results suggest that biased lottery mechanisms offer a transparent, efficient, and scalable alternative to conventional peer review in environments characterised by heavy-tailed scientific returns. Additionally, we provide a web application, available at http://scilottery.biocomputingunit.es, that implements the deterministic allocation method presented in this work.

2604.21087 2026-04-28 stat.AP

Model quality in football: Quantifying the quality of an Expected Threat model

Koen van Arem, Jakob Söhl, Mirjam Bruinsma, Geurt Jongbloed

详情
英文摘要

The recent growth in data availability in football has increased the risk of incorrect use of data-driven models, making guidelines on their validation and application necessary. The Expected Threat (xT) model is an accessible option for football organizations that start building in-house methods, yet little is known about how to assess its quality. The aim of this study is twofold: to examine how the model error depends on the number of game states and the number of training points, and to translate these results into guidelines for constructing and applying the model. Using the Markov chain underlying the model, we perform theoretical analyses and simulations to study the model error. These show that the model error is approximately log-normally distributed for a specified number of training points and game states. Additionally, we combine the simulations with expert consultation to establish the model error beyond which player evaluations based on the Expected Threat model become unreliable for scouting applications. From this, we derive rules of thumb to ensure the quality of an Expected Threat model before application, and we illustrate through an example how a validated model can be applied in practice. Because the approach generalizes to Expected Possession Value models, this paper illustrates a framework to systematically quantify model quality, despite the ground truth being unobservable in football analytics.

2604.21066 2026-04-28 cs.CV cs.LG stat.ME

Optimizing Diffusion Priors in Image Reconstruction from a Single Observation

Frederic Wang, Katherine L. Bouman

详情
英文摘要

While diffusion priors generate high-quality posterior samples across many inverse problems, they are often trained on limited training sets or purely simulated data, thus inheriting the errors and biases of these underlying sources. Current approaches to finetuning diffusion models rely on a large number of observations with varying forward operators, which can be difficult to collect for many applications, and thus lead to overfitting when the measurement set is small. We propose a method for tuning a prior from only a single observation by combining existing diffusion priors into a single product-of-experts prior and identifying the exponents that maximize the Bayesian evidence. We validate our method on real-world inverse problems, including black hole imaging, where the true prior is unknown a priori, and image deblurring with text-conditioned priors. We find that the evidence is often maximized by priors that extend beyond those trained on a single dataset. By generalizing the prior through exponent weighting, our approach enables posterior sampling from both tempered and combined diffusion models, yielding more flexible priors that improve the trustworthiness of the resulting posterior image distribution.

2604.20059 2026-04-28 stat.ME

Investigating Targeting Strategies and Truncation in TMLE for the Average Treatment Effect under Practical Positivity Violations

Yichen Xu, Susan Gruber, Mark J. van der Laan

详情
英文摘要

Estimating average treatment effects from observational data is challenging under practical violations of the positivity assumption. Targeted Maximum Likelihood Estimators (TMLEs) are widely used because of their double robustness and efficiency, but they can remain sensitive to such violations. We conduct extensive simulation studies to examine how targeting strategies and truncation levels affect TMLE performance under varying degrees of outcome regression misspecification and practical positivity stress. We show that loss-weighted targeting can induce substantial systematic bias relative to clever-covariate-scaled targeting, while insufficient truncation for clever-covariate-scaled targeting leads to inflated variance and unstable estimation. We further find that fixed truncation rules of the form c/(sqrt(n) log n), especially with c = 5 or c = 6, provide robust practical defaults in many settings, although the optimal choice varies with sample size. Motivated by the limitations of standard Lepski selection, we propose a Lepski-type adaptive truncation procedure with a brake mechanism that improves stability in data-adaptive tuning. We also compare variance estimators and find that targeted bootstrap variance estimation provides a stable alternative across truncation levels.

2604.19391 2026-04-28 cs.IT eess.SP math.IT stat.AP

On the Practical Performance of Noise Modulation for Ultra-Low-Power IoT: Limitations, Capacity, and Energy Trade-offs

Felipe A. P. de Figueiredo, Pedro M. R. Pereira, Evandro C. Vilas Boas, Fernando D. A. Garcia, Hadi Zayyani, Rausley A. A. de Souza

Comments 5 pages, 5 figures, conference

详情
英文摘要

Ultra-low-power (ULP) Internet of Things (IoT) applications demand communication architectures with minimal energy consumption. Noise Modulation (NoiseMod) addresses this by encoding data through the statistical variance of a noise-like signal, eliminating the need for a coherent carrier. To bridge the gap between theoretical potential and practical deployment, this paper benchmarks NoiseMod against standard modulations like BPSK and NC-FSK. We analytically derive the optimal detection threshold and Bit Error Rate (BER) for AWGN and Rayleigh fading channels. Our results show that non-coherent NoiseMod suffers a catastrophic error floor in fading environments, making architectural additions like channel state information (CSI) estimation and 2-antenna selection diversity desirable. Using an ADC-aware energy model, we reveal that NoiseMod's oversampling severely bottlenecks capacity and imposes an 8 dB SNR penalty compared to NC-FSK for a $10^{-3}$ BER in AWGN. Despite its oscillator-free design drastically reducing baseline circuit power, these limitations establish a critical energy crossover distance, which decreases with frequency. Below this distance, NoiseMod offers superior energy efficiency; beyond it, the radiated power needed to overcome its SNR penalty makes coherent schemes like BPSK vastly superior.

2604.14579 2026-04-28 stat.ME math.ST stat.TH

HASOD: A Hybrid Adaptive Screening-Optimization Design for High-Dimensional Industrial Experiments

Kumarjit Pathak

详情
英文摘要

Industrial experimentation requires both factor screening to identify critical variables and response optimization to find optimal operating conditions. Traditional approaches treat these as separate phases, necessitating costly sequential experimentation and full experimental redesign between phases. This paper introduces HASOD (Hybrid Adaptive Screening-Optimization Design), a novel three-phase sequential framework that simultaneously addresses factor identification and response surface optimization within a unified adaptive structure. Phase 1 employs a modified Definitive Screening Design with an enhanced Cumulative Weighted Effect Screening Statistic (CWESS) incorporating interaction detection via ElasticNet regression. Phase 2 adaptively selects augmentation strategies -- from full factorial to Response Surface Methodology designs -- based on critical factors identified in Phase 1. Phase 3 applies Gaussian process-based global optimization with uncertainty-guided refinement near the predicted optimum. We prove that CWESS asymptotically separates active from inactive factors, providing classification consistency guarantees absent from most screening methodologies. Across six test scenarios, HASOD achieves 97.08% factor detection accuracy -- 13.75 percentage points above traditional sequential methods (83.33%) -- and significantly outperforms all eight competitor methods (p < 0.001). HASOD yields improved prediction performance (mean error: 3.61) while maintaining >=90% detection across all scenarios including interaction-heavy systems. The framework requires an average of 41.5 experimental runs -- a 43% increase over traditional approaches -- yet delivers superior detection accuracy with dramatically reduced prediction error. HASOD offers a theoretically grounded, unified framework that eliminates sequential redesign without sacrificing predictive capability.

2604.12263 2026-04-28 stat.ME econ.EM

Partial Identification of Policy-Relevant Treatment Effects with Instrumental Variables via Optimal Transport

Jiyuan Tan, Jose Blanchet, Vasilis Syrgkanis

Comments 105 pages, 5 figures

详情
英文摘要

Policy-Relevant Treatment Effects (PRTEs) are generally not point-identified under standard Instrumental Variable (IV) assumptions when the instrument generates limited support in treatment propensity. We show that PRTE partial identification in the generalized Roy model can instead be formulated as a Constrained Conditional Optimal Transport (CCOT) problem over the joint conditional law of the potential outcome and the latent resistance. The resulting multidimensional CCOT problem reduces analytically to separable one-dimensional OT problems with product costs, yielding sharp closed-form bounds and avoiding direct solution of the original high-dimensional CCOT problem. We also develop estimation and inference procedures for these bounds: for discrete instruments, we use a Double Machine Learning (DML) approach based on Neyman-orthogonal scores that accommodates high-dimensional covariates while achieving the parametric $\sqrt{n}$ rate and asymptotic normality; for continuous instruments, we explicitly characterize the corresponding nonparametric convergence rates. The framework accommodates covariates, discrete and continuous instruments, and extensions to general treatment settings. In simulations and a bed-net subsidy application, the resulting bounds are substantially tighter than the moment-relaxation method.

2604.10482 2026-04-28 stat.ME

The Fréchet correlation coefficient for heterogeneous random objects

Shuaida He, Yangzhou Chen, Xin Chen

详情
英文摘要

Modern regression analysis often involves responses and predictors taking values in the same or distinct metric spaces. To rank non-Euclidean heterogeneous predictors in regression by explanatory strength, analogous to the classical $R^2$, we introduce the Fréchet correlation coefficient (FCC), defined as the relative reduction in the Fréchet variance of the response after conditioning on a specific predictor. FCC is directional, model-free, and interpretable on a unit-scale, attaining one under almost sure functional dependence and zero when the Fréchet mean is invariant to conditioning. We propose a novel partition-based estimator that avoids explicit nonparametric estimation of the conditional Fréchet mean function, thereby improving both computational efficiency and flexibility. A tailored wild bootstrap algorithm is further developed for testing the Fréchet conditional mean dependence. We establish asymptotic theory and evaluate power through extensive simulations.

2604.07011 2026-04-28 stat.ME stat.AP

Recovering manifold structure in LLM responses through a joint Euclidean mirror

Maximilian Baum, Aranyak Acharyya, Tianyi Chen, Avanti Athreya, Youngser Park, Francesco Sanna Passino, Carey E. Priebe, Zachary Lubberts

Comments 13 pages, 9 figures

详情
英文摘要

Understanding the behavior of black-box large language models and determining effective means of comparing their performance is a key task in modern machine learning. We consider how large language models respond to a specific query by analyzing how the distributions of responses vary over different values of tuning parameters. We frame this problem in a general mathematical setting, treating the mapping from model parameters to response distributions as a structured family of probability measures, endowed with a geometry via a dissimilarity measure. We show how dissimilarities between response distributions can be represented in low-dimensional Euclidean space through a joint Euclidean mirror surface encoding the underlying geometry, which permits both qualitative and quantitative analysis of large language models and provides insight into predicting response distributions for different values of tuning parameters. We propose an estimation procedure for the underlying joint Euclidean mirror based on observed samples from the response distributions, and we prove its asymptotic properties. Additionally, we propose a statistically consistent procedure to infer the value of an unknown model parameter based on samples from the corresponding response distribution and the estimated joint Euclidean mirror. In an experimental setting with large language models, we find that changes in different tuning parameter values correspond to distinct directions in the embedding space, making it possible to estimate the tuning parameters that were used to generate a given response.

2603.18514 2026-04-28 stat.ML cs.LG

On the Peril of (Even a Little) Nonstationarity in Satisficing Regret Minimization

Yixuan Zhang, Ruihao Zhu, Qiaomin Xie

Comments 20 pages

详情
英文摘要

Motivated by the principle of satisficing in decision-making, we study satisficing regret guarantees for nonstationary $K$-armed bandits. We show that in the general realizable, piecewise-stationary setting with $L$ stationary segments, the optimal regret is $Θ(L\log T)$ as long as $L\geq 2$. This stands in sharp contrast to the case of $L=1$ (i.e., the stationary setting), where a $T$-independent $Θ(1)$ satisficing regret is achievable under realizability. In other words, the optimal regret has to scale with $T$ even if just a little nonstationarity presents. A key ingredient in our analysis is a novel Fano-based framework tailored to nonstationary bandits via a \emph{post-interaction reference} construction. This framework strictly extends the classical Fano method for passive estimation as well as recent interactive Fano techniques for stationary bandits. As a complement, we also discuss a special regime in which constant satisficing regret is again possible.

2603.17281 2026-04-28 stat.AP

Improving causal inference in interrupted time series analysis: the triple difference design

Ariel Linden

详情
英文摘要

Background: Interrupted time series analysis (ITSA) is widely used to evaluate health policy and intervention effects. While multiple-group ITSA (MG-ITSA) improves causal inference by incorporating a control group, residual confounding from unmeasured time-varying factors may remain. The triple-difference interrupted time series (DDD-ITSA) design extends this approach by adding a second control group to further isolate treatment effects, but it remains underutilized and lacks formal guidance. Methods: We formalize the DDD-ITSA framework, specify the regression model, define key parameters for estimating level and trend effects, and clarify interpretation of the triple-difference estimand. We illustrate the approach using a worked example evaluating California's Proposition 99 cigarette tax and its impact on per-capita cigarette sales. Results: In the example, all groups were balanced on pre-intervention level and trend. The triple-difference estimand indicated a statistically significant annual reduction of -1.76 per-capita cigarette packs in California relative to the secondary control (P = 0.020; 95 percent CI: -3.24, -0.28), consistent with results from the primary comparison. Differences between control groups were not significant. Conclusions: DDD-ITSA strengthens causal inference when two-group comparisons may be confounded by leveraging an additional control group to remove remaining biases and assess heterogeneity. Implementation is facilitated by updates to the itsa Stata package. Careful attention to control selection, baseline balance, and autocorrelation remains essential.

2603.12365 2026-04-28 cond-mat.mtrl-sci cs.LG cs.NA math.NA physics.comp-ph stat.CO

Optimal Experimental Design for Reliable Learning of History-Dependent Constitutive Laws

Kaushik Bhattacharya, Lianghao Cao, Andrew Stuart

详情
Journal ref
Computer Methods in Applied Mechanics and Engineering, Volume 457, 2026, 119022, ISSN 0045-7825
英文摘要

History-dependent constitutive models serve as macroscopic closures for the aggregated effects of micromechanics. Their parameters are typically learned from experimental data. With a limited experimental budget, eliciting the full range of responses needed to characterize the constitutive relation can be difficult. As a result, the data can be well explained by a range of parameter choices, leading to parameter estimates that are uncertain or unreliable. To address this issue, we propose a Bayesian optimal experimental design framework to quantify, interpret, and maximize the utility of experimental designs for reliable learning of history-dependent constitutive models. In this framework, the design utility is defined as the expected reduction in parametric uncertainty or the expected information gain. This enables in silico design optimization using simulated data and reduces the cost of physical experiments for reliable parameter identification. We introduce two approximations that make this framework practical for advanced material testing with expensive forward models and high-dimensional data: (i) a Gaussian approximation of the expected information gain, and (ii) a surrogate approximation of the Fisher information matrix. The former enables efficient design optimization and interpretation, while the latter extends this approach to batched design optimization by amortizing the cost of repeated utility evaluations. Our numerical studies of uniaxial tests for viscoelastic solids show that optimized specimen geometries and loading paths yield image and force data that significantly improve parameter identifiability relative to random designs, especially for parameters associated with memory effects.

2602.07825 2026-04-28 stat.ME

Estimation Strategies for Causal Decomposition Analysis with Allowability Specifications

John W. Jackson, Ting-Hsuan Chang, Aster Meche, Trang Q. Nguyen

详情
英文摘要

Causal decomposition analysis (CDA) is an approach for modeling the impact of hypothetical interventions to reduce disparities. It is useful for identifying foci that future interventions, including multilevel and multimodal interventions, could focus on to reduce disparities. Based within the potential outcomes framework, CDA has a causal interpretation when the identifying assumptions are met. CDA also allows an analyst to consider which covariates are allowable (i.e., fair) for defining the disparity in the outcome and in the point of intervention, so that its interpretation is also meaningful. While the incorporation of causal inference and allowability promotes robustness, transparency, and dialogue in disparities research, it can lead to challenges in estimation such as the need to correctly model densities. Also, how CDA differs from commonly used statistical decomposition estimators from the econometrics literature may not be clear, which may limit its uptake. To address these challenges, we provide a tour of estimation strategies for CDA, reviewing existing proposals and introducing novel estimators that overcome key estimation challenges. Among them we introduce what we call "bridging" estimators that avoid modeling any density, and sequential weighted regression estimators that are multiply robust. Additionally, we provide diagnostics to assess the quality of the nuisance density models and weighting functions they rely on. We formally establish the estimators' robustness to model mis-specification, demonstrate their performance through a simulation study based on real data, and apply them to study disparities in uncontrolled hypertension using electronic health records in a large healthcare system.

2601.18390 2026-04-28 math.PR math.ST stat.TH

Convergence in distribution of the P-P process in $L^1[0,1]$

Brendan K. Beare, Tetsuya Kaji

Comments 7 pages

详情
英文摘要

We show that the percentile-percentile (P-P) process constructed from an independent and identically distributed sample of pairs converges in distribution in $L^1[0,1]$ if and only if the associated P-P curve is absolutely continuous. When this condition holds, the limiting distribution is Gaussian and the process admits a valid bootstrap approximation.

2601.09925 2026-04-28 stat.ME

High Dimensional Gaussian and Bootstrap Approximations in Generalized Linear Models

Mayukh Choudhury, Debraj Das

详情
英文摘要

Generalized Linear Model (or GLM) extends the ordinary linear regression by linking the mean of the response variable to covariates through appropriate link functions. GLM is widely used in the analysis of datasets arising from diverse fields including medical sciences, clinical trials, population surveys and risk analysis. In this paper, we investigate the Gaussian and Bootstrap approximations of GLM under two separate high dimensional regimes: (I) when the dimension $d$ grows slower than $n$ and (II) when $d$ grows exponentially with $n$. Under regime (I), we essentially show that the Gaussian approximation holds over the collection of Borel convex sets when $d = o\big(n^{2/5}\big)$ and over the collection of Euclidean balls when $d = o\big(n^{1/2}\big)$. We further devise two high dimensional Bootstrap methods which are valid over the collections of Borel convex sets and Euclidean balls under the same dimension growth rates. Then we move to regime (II) where we invoke sparsity to GLM through Lasso. We show that the high dimensional Gaussian approximation fails under regime (II). However, the Bootstrap approximations over convex sets and Euclidean balls are valid for the relevant part of the GLM estimator provided $\log d = o\big(n^{2τ/3}\big)$ and the number of non-zero regression parameters is $o\big(n^{1/3- 4τ/3}\big)$, when the Lasso penalty $λ_n \sim n^{1/2 + τ}$, for some $τ\in (0, 1/4)$. Simulation studies confirm the strong finite-sample performance of our proposed Bootstrap methods under both regime (I) and (II). We also implement our methods on real datasets.

2512.03467 2026-04-28 cs.LG stat.ME

Bayesian Event-Based Model for Disease Subtype and Stage Inference

Hongtao Hao, Joseph L. Austerweil

Comments 32 pages; machine learning for health symposium (2025); Proceedings of the 5th Machine Learning for Health Symposium in PMLR

详情
Journal ref
Proceedings of Machine Learning Research (PMLR), vol. 297: Machine Learning for Health (ML4H), 2025
英文摘要

Chronic diseases often progress differently across patients. Rather than randomly varying, there are typically a small number of subtypes for how a disease progresses across patients. To capture this structured heterogeneity, the Subtype and Stage Inference Event-Based Model (SuStaIn) estimates the number of subtypes, the order of disease progression for each subtype, and assigns each patient to a subtype from primarily cross-sectional data. It has been widely applied to uncover the subtypes of many diseases and inform our understanding of them. But how robust is its performance? In this paper, we develop a principled Bayesian subtype variant of the event-based model (BEBMS) and compare its performance to SuStaIn in a variety of synthetic data experiments with varied levels of model misspecification. BEBMS substantially outperforms SuStaIn across ordering, staging, and subtype assignment tasks. Further, we apply BEBMS and SuStaIn to a real-world Alzheimer's data set. We find BEBMS has results that are more consistent with the scientific consensus of Alzheimer's disease progression than SuStaIn.

2511.21992 2026-04-28 stat.ME stat.AP

Design-based nested instrumental variable analysis

Zhe Chen, Xinran Li, Michael O. Harhay, Bo Zhang

详情
英文摘要

Two binary instrumental variables (IVs) are nested if individuals who comply under one binary IV also comply under the other. This situation often arises when the two IVs represent different intensities of encouragement or discouragement to take the treatment, with one stronger than the other. In a nested IV structure, treatment effects can be identified for two latent subgroups: always-compliers and switchers. Always-compliers are individuals who comply even under the weaker IV, while switchers are those who do not comply under the weaker IV but do under the stronger IV. We introduce a novel pair-of-pairs nested IV design, where each matched stratum consists of four units organized in two pairs. We develop design-based inference for the always-complier sample average treatment effect and switcher sample average treatment effect. In a nested IV analysis, IV assignment is randomized within each IV pair; however, whether a study unit receives the weaker or stronger IV may not be randomized. To address this complication, we then propose a novel partly biased randomization scheme and study design-based inference under this new scheme. Using extensive simulation studies, we demonstrate the validity of the proposed method even in challenging scenarios with small sample sizes and a low proportion of switchers. Applying the nested IV framework, we estimated that 52.2% (95% CI: 50.4%-53.9%) of participants enrolled at the Henry Ford Health System in the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial were always-compliers, while 26.7% (95% CI: 24.5%-28.9%) were switchers. Among always-compliers, flexible sigmoidoscopy was associated with a trend toward a decreased colorectal cancer rate. No effect was detected among switchers. This offers a richer interpretation of why no increase in the intention-to-treat effect was observed after 1997, even though the compliance rate rose.

2511.15161 2026-04-28 math.ST stat.ME stat.TH

Design-based finite-sample analysis for regression adjustment

Dogyoon Song

Comments 9 pages + Appendix; AISTATS 2026

详情
英文摘要

In randomized experiments, regression adjustment can improve the precision of average treatment effect (ATE) estimation using covariates without requiring a correctly specified outcome model. Although well studied in low-dimensional settings, its behavior in high-dimensional regimes, where the number of covariates $p$ may exceed the number of observations $n$, remains underexplored. Moreover, existing analyses are largely asymptotic, providing limited guidance for finite-sample inference. We develop a design-based, non-asymptotic framework for analyzing the regression-adjusted ATE estimator under complete randomization. This yields finite-sample-valid confidence intervals with explicit, instance-adaptive widths, even when $p > n$. While these intervals rely on oracle (population-level) quantities, we also outline data-driven envelopes computable from observed data. Our approach hinges on a refined swap sensitivity analysis of an estimator: stochastic fluctuation is controlled via a variance-adaptive Doob martingale and Freedman's inequality, and design bias is bounded by Stein's method of exchangeable pairs. The analysis elucidates how covariate geometry governs concentration and bias of the adjusted estimator, suggesting when and how regression adjustment can be effective.

2510.23976 2026-04-28 stat.AP

Forecasting Arctic Temperatures With Quantile Machine Learning

Richard Berk

Comments 30 pages, 8 figures

详情
英文摘要

Using data from the Longyearbyen weather station, quantile gradient boosting ("small AI") is applied to forecast daily temperatures in Svalbard, Norway. Temperatures above 0 degrees Celsius are of special interest because of their impact on ice, snow, and tundra permafrost. To improve forecasting skill for warmer temperatures, the target quantile is 0.60; forecast underestimates are weighted 1.5 times more heavily than forecast overestimates when the quantile loss is computed. Predictors include eight routinely collected indicators of weather conditions, each lagged by 14 days, yielding temperature forecasts with a two-week lead time. Adaptive conformal prediction regions quantify forecasting uncertainty with provably valid coverage. Using a holdout sample, a forecast of 0 degrees Celsius is correct 14 days later at least 80% of the time. Implications for Arctic adaptation policy are discussed.

2510.18067 2026-04-28 stat.CO stat.ME

A Neural-Mean Vecchia Gaussian Process for Unified Argo Modeling

Nian Liu, Jian Cao

详情
英文摘要

Argo is an international program that collects temperature and salinity observations in the upper two kilometers of the global ocean. Most existing approaches for modeling Argo temperature rely on localized modeling within moving windows, first estimating a prescribed mean structure and then fitting Gaussian processes (GPs) to the mean-subtracted anomalies. Such strategies introduce challenges in designing suitable mean structures and defining local moving windows, often resulting in case-specific modeling choices. In this work, we propose a one-stop Gaussian process regression framework with a flexible mean structure and a generic spatio-temporal covariance function to jointly model Argo temperature data across broad spatial domains. Our fully data-driven approach achieves predictive performance that compares favorably with the established benchmarks that require moving-window regression and separate parametric mean estimation. To ensure scalability over large spatial regions, we employ the Vecchia approximation, which reduces the computational complexity from cubic to quasi-linear in the number of observations while preserving predictive accuracy. Using Argo data from January to March over the years 2007-2016, the same dataset used in prior benchmark studies, we demonstrate that our approach provides a unified and data-driven alternative for large-scale oceanographic analysis.

2510.08893 2026-04-28 stat.AP

Quantifying Very Extreme Precipitation and Temperature Using Huge Ensembles Generated by Machine Learning-based Climate Model Emulators

Christopher J. Paciorek, Daniel Cooley

Comments 28 pages, 11 figures, 5 appendix figures. Published online in Bulletin of the American Meteorological Society on 2026-03-30

详情
英文摘要

Weather extremes produce major impacts on society and ecosystems and are likely to change in likelihood and magnitude with climate change. However, very low probability events are hard to characterize statistically using observations or even climate model output because of short records/runs. For precipitation, consideration of such events arises in quantifying Probable Maximum Precipitation (PMP), namely estimating extreme precipitation magnitudes for designing and assessing critical infrastructure. A recent National Academies report on modernizing PMP estimation proposed using very large climate model-based ensembles to estimate extreme quantiles, possibly through machine learning-based ensemble boosting. Here we assess statistical aspects of such an approach for the contiguous United States using a huge ensemble (10560 years) produced by a state-of-the-art emulator (ACE2) trained on ERA5 reanalysis. The results indicate that one can practically estimate very extreme precipitation and temperature quantiles, provided one uses appropriate statistical extreme value techniques. More specifically, the results provide evidence for (1) the use of threshold-exceedance methods with a sufficiently high threshold (necessary for precipitation) for reliable estimation, (2) the robustness of results to variation in extremes by season and storm type, and (3) the sufficiency of the ensemble for well-constrained statistical uncertainty. Our results also show that the emulator produces extremes outside the range of the ERA5 training data. While encouraging for emulators' potential use for quantifying the climatology of extremes, more investigation is needed to assess whether emulators are fit for this purpose. Our focus is on how to use huge ensembles to estimate very extreme statistics; we expect the results to be relevant for future improved emulators.

2509.09758 2026-04-28 stat.AP

A Path Signature Framework for Detecting Creative Fatigue in Digital Advertising

Charles Shaw

Comments version 3

详情
英文摘要

This paper introduces a signature-based framework for detecting advertising creative fatigue using path signatures, a geometric representation from rough path theory. Creative fatigue -- the degradation of creative effectiveness under repeated exposure -- is operationally important in digital marketing because delayed detection can translate directly into avoidable opportunity cost. We reframe fatigue monitoring as a geometric change detection problem: advertising performance trajectories are embedded as paths and represented by truncated (log-)signatures, enabling detection of changes in trend, volatility, and non-linear dynamics beyond simple mean or variance shifts. We further connect statistical detection to managerial decision-making via an explicit quantification of performance loss relative to a benchmark period. Because proprietary production data cannot be released, we evaluate the proposed framework on a synthetic panel dataset designed to mimic realistic impression volumes and noisy day-to-day CTR dynamics. We define observed CTR as the realised binomial rate $CTR_t := C_t/I_t$ using daily clicks $C_t$ and impressions $I_t$. The accompanying CSV also contains a pre-computed CTR field (e.g., due to rounding or upstream derivation), but all modelling and evaluation in this paper use $C_t/I_t$. Crucially, the dataset does not include injected changepoints; we therefore define an operational ground truth for ``fatigue onset'' based on a noise-robust CTR estimate and a sustained deterioration relative to a recent-best baseline. We report lead-time (early warning) and alert-burden metrics under this operational definition, and provide a sensitivity analysis over the detector's primary tuning parameters. The methodology scales linearly in time-series length for fixed signature depth and is suitable for monitoring large creative portfolios.

2506.14062 2026-04-28 cs.DS math.PR stat.CO

Exact and Efficient Sampling from Dynamic Discrete Distributions with Finite-Precision Weights

Lilith Orion Hafner, Adriano Meligrana

Comments Submitted to ESA 2026

详情
英文摘要

Sampling from a dynamic discrete distribution means drawing an index with probability proportional to a mutable set of weights. Classical constant-time techniques such as the Alias Method are well suited to static distributions, but become expensive in dynamic settings because updates require rebuilding auxiliary tables. Existing dynamic approaches, including Forest of Trees and BUcket Sampling (BUS), achieve reasonable practical performance but require infinite precision real arithmetic to be correct and produce meaningfully incorrect results when implemented on real hardware. We present EBUS (Exact BUcket Sampling), a dynamic sampler for finite-precision weights that is exact by construction: every returned index has probability exactly proportional to its represented weight. Our guarantees are proved in a word RAM model with bounded exponent range. In that model, our method supports $O(1)$ worst-case expected sampling time, $O(1)$ amortized time to update a single weight, $O(n)$ space, and $O(n)$ construction. We also provide an implementation for IEEE 64-bit floating-point weights and show experimentally that it is competitive with, and often faster than, several implementations of previous inexact methods while avoiding their numerical failure modes.

2505.11771 2026-04-28 cs.LG cs.AI math.ST stat.ML stat.TH

Residual Feature Integration is Sufficient to Prevent Negative Transfer

Yichen Xu, Ryumei Nakada, Linjun Zhang, Lexin Li

详情
Journal ref
The Fourteenth International Conference on Learning Representations (ICLR 2026)
英文摘要

Transfer learning has become a central paradigm in modern machine learning, yet it suffers from the long-standing problem of negative transfer, where leveraging source representations can harm rather than help performance on the target task. Although empirical remedies have been proposed, there remains little theoretical understanding of how to reliably avoid negative transfer. In this paper, we investigate a simple yet remarkably effective strategy: augmenting frozen, pretrained source-side features with a trainable target-side encoder that adapts target features to capture residual signals overlooked by models pretrained on the source data. We show this residual feature integration strategy is sufficient to provably prevent negative transfer, by establishing theoretical guarantees that it has no worse convergence rate than training from scratch under the informative class of target distributions up to logarithmic factors, and that the convergence rate can transition seamlessly from nonparametric to near-parametric when source representations are informative. To our knowledge, this is the first theoretical work that ensures protection against negative transfer. We carry out extensive numerical experiments across image, text and tabular benchmarks, and empirically verify that the method consistently safeguards performance under distribution shift, label noise, semantic perturbation, and class imbalance. We additionally demonstrate that this residual integration mechanism uniquely supports adapt-time multimodality extension, enabling a pretrained single-cell foundation model to incorporate spatial signals for lymph-node anatomical classification despite the source model being trained without them. Our study thus advances the theory of safe transfer learning, and provides a principled approach that is simple, robust, architecture-agnostic, and broadly applicable.

2505.06754 2026-04-28 stat.ME

Post-treatment problems: What can we say about the effect of a treatment among sub-groups who (would) respond in some way?

Chad Hazlett, Nina McMurry, Tanvi Shinkre

详情
英文摘要

Investigators are often interested in how a treatment affects an outcome for units responding to treatment in a certain way. We may wish to know the effect among units that, for example, meaningfully implemented an intervention, passed an attention check, or demonstrated some important mechanistic response. Simply conditioning on the observed value of the post-treatment variable introduces problematic biases. Further, the identification assumptions required of several existing strategies are often indefensible. We propose the Treatment Reactive Average Causal Effect (TRACE), which we define as the total effect of treatment in the group that, if treated, would realize a particular value of the relevant post-treatment variable. By reasoning about the effect among the "non-reactive" group, we can identify and estimate the range of plausible values for the TRACE. We demonstrate the use of this approach with three examples: (i) learning the effect of police-perceived race on police violence during traffic stops, a case where point identification may be possible; (ii) estimating effects of a community-policing intervention in Liberia, in communities that meaningfully implemented it, and (iii) studying how in-person canvassing affects support for transgender rights, among participants for whom the intervention would result in more positive feelings towards transgender people.

2504.04267 2026-04-28 cs.DS cs.DM cs.IT math.IT math.PR stat.CO

Efficient Rejection Sampling in the Entropy-Optimal Range

Thomas L. Draper, Feras A. Saad

详情
Journal ref
IEEE Transactions on Information Theory, 72, 5, pp 2801-2822, May 2026
英文摘要

We study the problem of generating a random variate $X$ from a finite discrete probability distribution $P$ using an entropy source of independent fair coin flips. A classic result from Knuth and Yao shows that the optimal expected number of input coin flips per output sample lies between $H(P)$ and $H(P)\,{+}\,2$, where $H$ is the Shannon entropy function. However, implementing the Knuth and Yao ``entropy-optimal'' sampler entails a tradeoff between using either exponential space with low runtime per sample, or linear space with high runtime per sample. We introduce a new sampling algorithm that avoids this tradeoff: it requires linearithmic space, incurs negligible runtime overhead per sample, and uses an expected number of coin flips that lies in the entropy-optimal range $[H(P), H(P)\,{+}\,2)$. No previous sampler for discrete distributions simultaneously achieves these space, time, and entropy characteristics. Numerical experiments demonstrate improvements in runtime and entropy of the proposed method compared to the celebrated alias method.

2502.03669 2026-04-28 cs.LG cs.AI cs.DM math.OC stat.ML

Unrealized Expectations: Comparing AI Methods vs Classical Algorithms for Maximum Independent Set

Yikai Wu, Haoyu Zhao, Sanjeev Arora

Comments Published on TMLR 04/2026. 28 pages, 6 figures, 98 tables

详情
Journal ref
Transactions on Machine Learning Research (2026)
英文摘要

AI methods, such as generative models and reinforcement learning, have recently been applied to combinatorial optimization (CO) problems, especially NP-hard ones. This paper compares such GPU-based methods with classical CPU-based methods on the Maximum Independent Set (MIS) problem. Strikingly, even on in-distribution random graphs, leading AI-inspired methods are consistently outperformed by the state-of-the-art classical solver KaMIS running on a single CPU, and some AI-inspired methods frequently fail to surpass even the simplest degree-based greedy heuristic. Even with post-processing techniques like local search, AI-inspired methods still perform worse than CPU-based solvers. To better understand the source of these failures, we introduce a novel analysis, serialization, which reveals that non-backtracking AI-inspired methods, e.g. LTFT (which is based on GFlowNets), end up reasoning similarly to the simplest degree-based greedy, and thus worse than KaMIS. More generally, our findings suggest a need for a rethinking of current approaches in AI for CO, advocating for more rigorous benchmarking and the principled integration of classical heuristics. Additionally, we also find that CPU-based algorithm KaMIS have strong performance on sparse random graphs, which appears to show that the shattering threshold conjecture for large independent sets proposed by Coja-Oghlan & Efthymiou (2015) does not apply for real-life sizes (such as 10^6 nodes).

2410.21922 2026-04-28 stat.CO math.ST stat.TH

Extending Sheldon M. Ross's Method for Efficient Large-Scale Variance Computation

Jiawen Li

详情
英文摘要

We introduce Prior Knowledge Acceleration (PKA), a batch-update method for variance that reuses previously computed sufficient statistics to avoid full recomputation. The update identity is algebraically equivalent to the pairwise formula of Chan, Golub, and LeVeque (1983); our contribution is a runtime-cost analysis that derives an explicit acceleration factor $τ_a$ and identifies the data-size regime where batch updating outperforms both naïve recomputation and Ross's single-sample method. We prove that Ross's approach is preferable only when the new batch contains a single sample ($N_2 = 1$). We further generalise the framework to covariance and other decomposable statistics. Benchmarks against Welford, Chan pairwise, and naïve two-pass baselines on synthetic and real-world streaming data confirm the theoretical predictions, with speedups of up to $454\times$ when the prior dataset is large relative to the new batch.

2410.09810 2026-04-28 stat.ME

Doubly unfolded adjacency spectral embedding of dynamic multiplex graphs

Maximilian Baum, Francesco Sanna Passino, Axel Gandy

Comments 39 pages, 4 figures

详情
英文摘要

Many real-world networks evolve dynamically over time and present different types of connections between nodes, often called layers. In this work, we propose a latent position model for these objects, called the dynamic multiplex random dot product graph (DMPRDPG), which uses an inner product between layer-specific and time-specific latent representations of the nodes to obtain edge probabilities. We further introduce a computationally efficient spectral embedding method for estimation of DMPRDPG parameters, called doubly unfolded adjacency spectral embedding (DUASE). The DUASE estimates are proved to be both consistent and asymptotically normally distributed. A key strength of our method is the encoding of time-specific node representations and layer-specific effects in separate latent spaces, which allows the model to capture complex behaviors while maintaining relatively low dimensionality. The embedding method we propose can also be efficiently used for subsequent inference tasks. In particular, we highlight the use of the ISOMAP algorithm in conjunction with DUASE as a way to efficiently capture trends and global changepoints within a network, and the use of DUASE for graph clustering. Applications on real-world networks describing geopolitical interactions between countries and financial news reporting demonstrate practical uses of our method.

2405.16730 2026-04-28 cs.LG cs.AI stat.AP

"Noisier" Noise Contrastive Eestimation is (Almost) Maximum Likelihood

Peiyu Yu, Dinghuai Zhang, Hengzhi He, Xiaojian Ma, Sirui Xie, Ruiyao Miao, Yifan Lu, Yasi Zhang, Deqian Kong, Ruiqi Gao, Jianwen Xie, Guang Cheng, Ying Nian Wu

Comments ICLR 2026

详情
英文摘要

Noise Contrastive Estimation (NCE) has fueled major breakthroughs in representation learning and generative modeling. Yet a long-standing challenge remains: accurately estimating ratios between distributions that differ substantially, which significantly limits the applicability of NCE on modern high-dimensional and multimodal datasets. We revisit this problem from a less explored perspective: the magnitude of the noise distribution. Specifically, we show that with a virtually scaled (\ie, artificially increased) noise magnitude, the gradient of the NCE objective can closely align with that of Maximum Likelihood, enabling a trajectory-wise approximation from NCE to MLE, and faster convergence both theoretically and empirically. Building on this insight, we introduce ``Noisier'' NCE, a simple drop-in modification to vanilla NCE that incurs little to no extra computational cost, while effectively handling density-ratio estimation in challenging regimes where traditional MLE and NCE struggle. Beyond improving classical density-ratio learning, ``Noisier'' NCE proves broadly applicable: it achieves strong results across image modeling, anomaly detection, and offline black-box optimization. On CIFAR-10 and ImageNet64x64 datasets, it yields 10-step and even 1-step samplers that match or surpass state-of-the-art methods, while cutting training iterations by up to half.

2309.00578 2026-04-28 cs.LG math.ST stat.TH

Consistency of Lloyd's Algorithm Under Perturbations

Dhruv Patel, Hui Shen, Shankar Bhamidi, Yufeng Liu, Vladas Pipiras

详情
英文摘要

In the context of unsupervised learning, Lloyd's algorithm is one of the most widely used clustering algorithms. It has inspired a plethora of work investigating the correctness of the algorithm under various settings with ground truth clusters. In particular, in 2016, Lu and Zhou have shown that the mis-clustering rate of Lloyd's algorithm on $n$ independent samples from a sub-Gaussian mixture is exponentially bounded after $O(\log(n))$ iterations, assuming proper initialization of the algorithm. However, in many applications, the true samples are unobserved and need to be learned from the data via pre-processing pipelines such as spectral methods on appropriate data matrices. We show that the mis-clustering rate of Lloyd's algorithm on perturbed samples from a sub-Gaussian mixture is also exponentially bounded after $O(\log(n))$ iterations under the assumptions of proper initialization and that the perturbation is small relative to the sub-Gaussian noise. In canonical settings with ground truth clusters, we derive bounds for algorithms such as $k$-means$++$ to find good initializations and thus leading to the correctness of clustering via the main result. We show the implications of the results for pipelines measuring the statistical significance of derived clusters from data such as SigClust. We use these general results to derive implications in providing theoretical guarantees on the misclustering rate for Lloyd's algorithm in a host of applications, including high-dimensional time series, multi-dimensional scaling, and community detection for sparse networks via spectral clustering.

2105.04332 2026-04-28 cs.LG stat.ML

Bayesian Optimistic Optimisation with Exponentially Decaying Regret

Hung Tran-The, Sunil Gupta, Santu Rana, Svetha Venkatesh

Comments To appear at ICML 2021 (21 pages)

详情
Journal ref
PMLR 139:10390-10400, 2021
英文摘要

Bayesian optimisation (BO) is a well-known efficient algorithm for finding the global optimum of expensive, black-box functions. The current practical BO algorithms have regret bounds ranging from $\mathcal{O}(\frac{logN}{\sqrt{N}})$ to $\mathcal O(e^{-\sqrt{N}})$, where $N$ is the number of evaluations. This paper explores the possibility of improving the regret bound in the noiseless setting by intertwining concepts from BO and tree-based optimistic optimisation which are based on partitioning the search space. We propose the BOO algorithm, a first practical approach which can achieve an exponential regret bound with order $\mathcal O(N^{-\sqrt{N}})$ under the assumption that the objective function is sampled from a Gaussian process with a Matérn kernel with smoothness parameter $ν> 4 +\frac{D}{2}$, where $D$ is the number of dimensions. We perform experiments on optimisation of various synthetic functions and machine learning hyperparameter tuning tasks and show that our algorithm outperforms baselines.

2009.02539 2026-04-28 stat.ML cs.IT cs.LG math.IT

Sub-linear Regret Bounds for Bayesian Optimisation in Unknown Search Spaces

Hung Tran-The, Sunil Gupta, Santu Rana, Huong Ha, Svetha Venkatesh

Comments 34th Conference on Neural Information Processing Systems (NeurIPS 2020)

详情
英文摘要

Bayesian optimisation is a popular method for efficient optimisation of expensive black-box functions. Traditionally, BO assumes that the search space is known. However, in many problems, this assumption does not hold. To this end, we propose a novel BO algorithm which expands (and shifts) the search space over iterations based on controlling the expansion rate thought a hyperharmonic series. Further, we propose another variant of our algorithm that scales to high dimensions. We show theoretically that for both our algorithms, the cumulative regret grows at sub-linear rates. Our experiments with synthetic and real-world optimisation tasks demonstrate the superiority of our algorithms over the current state-of-the-art methods for Bayesian optimisation in unknown search space.

1911.11950 2026-04-28 stat.ML cs.LG

Trading Convergence Rate with Computational Budget in High Dimensional Bayesian Optimization

Hung Tran-The, Sunil Gupta, Santu Rana, Svetha Venkatesh

Comments Our accepted paper (with Supplementary Material) at AAAI 2020

详情
英文摘要

Scaling Bayesian optimisation (BO) to high-dimensional search spaces is a active and open research problems particularly when no assumptions are made on function structure. The main reason is that at each iteration, BO requires to find global maximisation of acquisition function, which itself is a non-convex optimization problem in the original search space. With growing dimensions, the computational budget for this maximisation gets increasingly short leading to inaccurate solution of the maximisation. This inaccuracy adversely affects both the convergence and the efficiency of BO. We propose a novel approach where the acquisition function only requires maximisation on a discrete set of low dimensional subspaces embedded in the original high-dimensional search space. Our method is free of any low dimensional structure assumption on the function unlike many recent high-dimensional BO methods. Optimising acquisition function in low dimensional subspaces allows our method to obtain accurate solutions within limited computational budget. We show that in spite of this convenience, our algorithm remains convergent. In particular, cumulative regret of our algorithm only grows sub-linearly with the number of iterations. More importantly, as evident from our regret bounds, our algorithm provides a way to trade the convergence rate with the number of subspaces used in the optimisation. Finally, when the number of subspaces is "sufficiently large", our algorithm's cumulative regret is at most $\mathcal{O}^{*}(\sqrt{Tγ_T})$ as opposed to $\mathcal{O}^{*}(\sqrt{DTγ_T})$ for the GP-UCB of Srinivas et al. (2012), reducing a crucial factor $\sqrt{D}$ where $D$ being the dimensional number of input space.