arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.01628 2026-05-05 stat.ML cs.LG math.ST stat.TH

Self-Normalized Martingales and Uniform Regret Bounds for Linear Regression

Fan Chen, Jian Qian, Alexander Rakhlin, Nikita Zhivotovskiy

详情
英文摘要

Self-normalized martingale inequalities lie at the heart of confidence ellipsoids for online least squares and, more broadly, many bandit and reinforcement-learning results. Yet existing vector and scalar results typically rely on bounded covariates and an explicit regularization matrix, producing bounds that are \emph{not scale-invariant}: although the self-normalized quantity is scale-invariant by definition, its standard upper bounds are not. We characterize when scale-invariant upper bounds on self-normalized martingales are possible. Without further assumptions, we prove that nontrivial scale-invariant bounds exist only in dimension $d=1$; moreover, in $d=1$ we obtain $O(\log T)$ scale-invariant self-normalized bounds without any assumptions on the covariates. In contrast, for $d>1$ we show that no nontrivial scale-invariant bound can hold in full generality. We then connect this dichotomy to \emph{doubly-uniform} regret in online linear regression (i.e., regret bounds that are simultaneously independent of the covariate scale and the comparator norm) and use it to resolve the open question of Gaillard, Gerchinovitz, Huard, and Stoltz, \emph{``Uniform regret bounds over $\mathbb{R}^d$ for the sequential linear regression problem with the square loss''} (ALT 2019): in $d=1$ we give an explicit algorithm with $O(\log T)$ doubly-uniform regret, whereas for $d>1$ sublinear doubly-uniform regret is impossible. Finally, under a natural \emph{smoothness} condition (bounded Radon--Nikodym derivatives of the conditional covariate laws with respect to a fixed base measure), we recover sublinear regret for $d>1$ without bounded covariates and derive a self-normalized concentration inequality free of the usual regularization penalties, yielding arguably a first natural scale-invariant bound for adaptive, non-i.i.d. vector martingales.

2605.01624 2026-05-05 math.AT stat.AP stat.ML

Persistent Homology of Time Series through Complex Networks

İsmail Güzel

详情
英文摘要

We present a unified pipeline for univariate time series classification via complex networks and persistent homology. A time series is mapped to a graph through one of five constructions across three families (visibility (natural and horizontal visibility graphs), transition, and proximity) and the graph is converted to a dissimilarity matrix from which a Vietoris-Rips filtration yields persistence diagrams. These diagrams are vectorized into fixed-length features through persistence landscapes and topological summary statistics. By standardizing the downstream processing, differences in classification performance are attributable to the network construction and distance metric alone. Experiments on twelve UCR benchmarks show that (i) no single construction dominates: the optimal graph type depends on the signal's discriminative structure; (ii) the graph distance metric is a first-order design choice, with diffusion distance uniformly outperforming shortest-path alternatives; and (iii) persistence-based features degrade gracefully under noise, consistent with the classical stability theorem of persistent homology.

2605.01615 2026-05-05 stat.ME stat.AP stat.OT

Threshold Exceedance Estimation in Spatially Correlated Areal Data Using Maxima-Nominated Sampling

Mohammad Jafari Jozani

Comments 26 pages, 4 figures, 6 tables

详情
英文摘要

We study estimation of the proportion of areal units in a spatially correlated domain whose success probabilities exceed a prespecified threshold. Such problems arise in health surveillance, environmental monitoring, and social policy, where the goal is to estimate the fraction of high-risk areas. We propose a DUST-MNS design that combines maxima-nominated sampling (MNS) with the probability-proportional-to-size dependent unit sequential technique (pps-DUST), thereby promoting spatial spread while mitigating the effect of spatial autocorrelation. The design forms $n$ candidate sets of size $k$ and obtains final measurements only from the area judged to be at highest risk in each set, yielding $n$ measured areas from $nk$ screened candidates. Ranking may be based on expert judgment, prior surveys, or easily obtained auxiliary covariates. We derive a closed-form estimator of the exceedance probability $θ$ based on data from DUST-MNS design, establish its bias and variance, and show that, in the rare-to-moderate exceedance regime $θ<θ^\star(k)$, the proposed DUST-MNS estimator outperforms its SRS and DUST-SRS counterparts, where $θ^\star(k)$ depends only on $k$. We also provide guidance on the choice of $k$, derive efficiency bounds under a Beta model, extend the method to imperfect ranking, and develop variance estimation and bootstrap confidence intervals. An application to county-level stroke prevalence data from CDC PLACES, using diabetes prevalence as the ranking concomitant, illustrates the proposed approach.

2605.01608 2026-05-05 eess.SP stat.ME stat.ML

Why Model Selection Fails in Time Series Forecasting: An Empirical Study of Instability Across Data Regimes

Tahir Cetin Akinci, Alfredo A. Martinez-Morales

详情
英文摘要

Time series forecasting models often exhibit inconsistent performance across datasets with varying statistical and structural properties. Despite the wide range of available forecasting techniques, it remains unclear whether model selection can be reliably guided by simple data characteristics. This paper investigates why rule-based model selection fails in time series forecasting by analyzing the relationship between data-regime descriptors and model performance. A descriptor-based framework is introduced to characterize time series using measurable properties, including trend strength, seasonality, noise level, and temporal dependence. Based on these descriptors, a rule-based selection mechanism is formulated to map data regimes to candidate forecasting models. The approach is evaluated on multiple real-world datasets across different domains and forecasting horizons. The results show that rule-based model selection achieves low accuracy, with correct model identification occurring in only a small fraction of cases. Significant discrepancies are observed between recommended and empirically optimal models, particularly in noisy and mixed regimes. Further analysis reveals that model performance is highly sensitive to both dataset characteristics and forecasting horizon, resulting in substantial ranking instability across scenarios. These findings explain why simple heuristic rules fail to generalize and demonstrate that forecasting performance cannot be reliably predicted using static, descriptor-based approaches. This study provides empirical evidence that model selection in time series forecasting is inherently context-dependent and highlights the need for more adaptive, data-driven strategies.

2605.01606 2026-05-05 stat.ME math.ST stat.TH

L-Estimation of Population Quantiles Using Ranked Set Sampling

Mohammad Jafari Jozani, Ehsan Zamanzade, Reza Modarre

Comments 33 pages, 5 figures, 1 table

详情
英文摘要

Quantile estimation is central when interest lies in thresholds or tail behavior rather than the mean. When exact measurement is costly but units can be ranked cheaply, ranked set sampling (RSS) provides an attractive alternative to simple random sampling (SRS). We develop two families of RSS-based L-estimators for population quantiles that extend Stigler-type and Harrell--Davis estimators to the RSS framework. The first applies weighted-order-statistic estimation directly to the pooled ordered RSS sample and serves primarily as an exact conceptual benchmark, since its computational burden increases rapidly with the set size. The second exploits a decomposition induced by the RSS design that constructs $k$ pooled transformed-scale component estimators indexed by rank stratum and leads to a computationally scalable procedure. We derive large-sample results for these component estimators under regularity conditions; these results provide a principled first-order motivation for the combined estimators employed in practice. Simulation results across several distributions, quantile levels, and ranking qualities show consistent efficiency gains over empirical quantile estimators under both SRS and RSS, with the RSS Harrell--Davis version performing especially well for moderate and upper quantiles. Beyond the simulation study, we demonstrate the practical relevance of the proposed estimators through an application to NHANES transient elastography data, highlighting their usefulness for estimating clinically meaningful quantiles in a biomedical setting

2605.01603 2026-05-05 stat.CO

dirichletprocess: An R Package for Fitting Complex Bayesian Nonparametric Models

Gordon J. Ross, Dean Markwick, Priyanshu Tiwari

详情
英文摘要

The dirichletprocess package provides software for creating flexible Dirichlet process objects. Users can perform nonparametric Bayesian analysis using Dirichlet processes without the need to program their own inference algorithms. Instead, the user can utilise our pre-built models or specify their own models whilst allowing the dirichletprocess package to handle the Markov chain Monte Carlo sampling. Our Dirichlet process objects can act as building blocks for a variety of statistical models including: density estimation, clustering and prior distributions in hierarchical models.

2605.01586 2026-05-05 stat.CO math.ST stat.TH

The Pearson IV distribution: Random variate generation and applications

Luc Devroye, Joe R. Hill

详情
英文摘要

We develop uniformly fast random variate generators for the Pearson IV distribution that can be used over the entire range of both shape parameters and highlight some applications in a Bayesian setting.

2605.01579 2026-05-05 stat.ME cs.LG

Minimum Specification Perturbation: Robustness as Distance-to-Falsification in Causal Inference

Hoang Dang, Luan Pham, Minh Nguyen

Comments 36 pages, 2 figures

详情
英文摘要

Empirical causal claims depend on many analyst decisions, from selecting covariates to choosing estimators. Existing robustness tools summarize how results vary across these choices, but, to the best of our knowledge, do not answer: \textbf{How many analyst decisions must change to reach a specification, which is a set of choices, whose confidence interval (CI) contains zero?} We introduce \emph{Minimum Specification Perturbation (MSP)}, the smallest number of changes. MSP is small under the null, grows with effect strength and captures distance-to-falsification information that dispersion-based summaries cannot report; when making decisions under weak effects, an MSP-based rule yields lower false-positive rates than dispersion-based rules. We show that Fragility Index and MSP measure orthogonal vulnerabilities: fragility to influential observations need not imply fragility to specification choices. On the LaLonde benchmark, MSP = 1 implies that one decision change makes the CI contain zero. We further provide exact permutation calibration under randomization and characterize computation, showing tractable cases under additive structure and NP-hardness in general.

2605.01571 2026-05-05 stat.OT stat.CO

Functional Liu Regression for Scalar-on-Functional Models in High-Dimensional Settings

Shaista Ashraf, Stephen Becker, Farrukh Javed, Ismail Shah

详情
英文摘要

This study develops a functional Liu-type shrinkage estimator (fLiu) for scalar-on-function regression in the presence of strong multicollinearity and high-dimensional functional predictors. The approach extends the classical Liu estimator to the functional setting by combining directional shrinkage with smoothness regularization, providing flexible control over the bias-variance trade-off. Theoretical analysis is used to examine the behavior of the estimator and the associated parameter selection problem. In particular, an explicit mean squared error (MSE) decomposition is derived, characterizing the risk of the estimator in terms of variance reduction and shrinkage bias. This further yields an explicit optimal choice of the shrinkage parameter of the fLiu estimator through a one-dimensional convex risk minimization problem, leading to a practical plug-in tuning rule. Moreover, it is shown that in high-dimensional (underdetermined) settings, commonly used criterion such as GCV (and equivalently PRESS/LOO-CV) become constant with respect to the parameter d, thus uninformative for tuning. This provides a theoretical explanation for the predominant focus on the overdetermined regime in existing Liu-type methods. Numerical results demonstrate that the estimator achieves competitive predictive accuracy relative to existing methods. Implementation is carried out in R using the fda package, and in Python via the fLiu.py package developed for this study.

2605.01492 2026-05-05 stat.ML cs.IT cs.LG math.IT

Stabilizing Private LASSO under Heterogeneous Covariates via Anisotropic Objective Perturbation

Haruka Tanzawa, Ayaka Sakata

Comments 6 pages, 5 figures

详情
英文摘要

We study high-dimensional LASSO under differential privacy via objective perturbation with heterogeneous covariate scales. In practical scenarios, covariates often exhibit diverse scales; however, standard preprocessing is problematic under privacy constraints, as it consumes additional privacy budget. This heterogeneity induces effective anisotropy in the objective perturbation via the inverse Gram matrix of covariates, which can degrade the stability and accuracy of algorithms. To address this, we propose a Gram-based anisotropic objective perturbation, a ``pre-distortion" strategy that counteracts the distortion from the covariate structure to restore isotropy in the estimation process. Using an Approximate Message Passing (AMP) framework and state evolution analysis, we demonstrate that our proposed perturbation significantly stabilizes convergence and improves both statistical efficiency and privacy performance compared to standard uniform noise injection. Our results provide theoretical insights into designing stable and efficient private estimators without relying on data-dependent preprocessing.

2605.01484 2026-05-05 cs.LG stat.ML

Evaluating LLMs on Large-Scale Graph Property Estimation via Random Walks

Sunil Kumar Maurya, Xin Liu

Comments Accepted to ACL 2026 Main Conference

详情
英文摘要

With the rapidly improving reasoning abilities of Large Language Models (LLMs), there is also a rising demand to use them in a wide variety of domains. This brings about the need to carefully evaluate the limits of the capabilities of these models with various tests and benchmarks. Graph structures are ubiquitous in real-world data, and are often used to represent and analyze relationship patterns within data. Many benchmarks have already been proposed in the graph literature to test the reasoning ability of LLMs to follow and execute graph algorithms. However, due to the limited context length of LLMs, these benchmarks consist of very small graphs. In real-world data, the size of graphs can be significantly larger, and in many cases, not fully accessible. In this paper, we examine a class of problems that arises with very large graphs having limited accessibility. We propose a large graph benchmark dataset, EstGraph, and introduce four distinct tasks designed to estimate large graph properties. We evaluate the reasoning abilities of LLMs on these tasks using a wide variety of graph datasets. In addition, we provide task-specific prompt constructions based on random walk sampling of large graphs (up to millions of nodes) that effectively convey sufficient information to LLMs within the limits of context length.

2605.01452 2026-05-05 stat.ME cs.LG

Stable Localized Conformal Prediction via Transduction

Yinjie Min, Liuhua Peng, Changliang Zou

详情
英文摘要

Existing evaluations of conformal prediction, such as prediction efficiency and test-conditional coverage, are defined in expectation over the calibration data. In practice, when only one calibration set of limited size is available, prediction sets often exhibit high variability in size, especially for methods with localization. We formalize this concern as set stability, defined as the variance of the conditional expectation of the set size given the calibration data. To improve stability without requiring additional target-task labels, we propose Stable Conformal Prediction (StCP), a transfer learning approach that utilizes labeled source-task data and unlabeled target data. Theoretically, we characterize the marginal coverage and stability of StCP; empirically, it delivers more stable prediction sets than standard conformal prediction methods, especially for those with localization, when calibration data are limited.

2605.01379 2026-05-05 stat.ME

Federated generalized linear mixed models based on one-time shared summary statistics

Marie Analiz April Limpoco, Christel Faes, Niel Hens

详情
英文摘要

Data privacy has increasingly become a daunting challenge because it limits data availability, which is essential in estimating statistical models such as generalized linear mixed models. Access to personal data often involves considerable time, effort, and paperwork, which can impede research progress and collaboration. Existing approaches that do not use individual-level data for model estimation are either prone to ecological bias, cannot handle heterogeneity, or require iterative communication. In this paper, we propose an approach to estimate generalized linear mixed models based on summary statistics shared only once. We used linear, logistic, and Poisson mixed models as examples to demonstrate the methodology. Our strategy involves generating pseudo-data whose summary statistics match those of the actual but unavailable data. These pseudo-data are then used for model estimation instead of the actual data. The estimates we achieve are identical (up to the third decimal place) to those derived from actual data and have similar bias, coverage, and prediction performance. Communication and resource efficiency distinguish our approach from existing methods.

2605.01363 2026-05-05 hep-ex cs.LG hep-ph stat.ME

Data-Driven, Geometry-Aware Optimal-Transport Calibration of Flavor Tagger

Yeonjoon Kim, Un-ki Yang

Comments 32 Pages, 12 Figures

详情
英文摘要

Flavor-tagging calibrations are often provided either as scale factors measured at a finite set of working points or as binned corrections to a chosen one-dimensional discriminant. However, this approach falls short of providing continuous, event-level calibration across the full multicomponent outputs of modern taggers. This limitation leads to information loss in analyses that demand high-performance flavor tagging, restricting analyses to a limited set of predefined variables. In this work, we propose a geometry-aware framework that formulates flavor-tagger calibration as an optimal transport problem on the probability simplex. The transport maps are parameterized and trained in the isometric log-ratio coordinate system. Because the quadratic Euclidean cost of Brenier transport in this coordinate system is equivalent to the Aitchison distance on the simplex, the learned map induces a minimal deformation under the Aitchison geometry. Furthermore, we extract flavor-conditional target distributions directly from control-region data using an expectation-maximization (EM) technique that simultaneously fits multiple control regions, models each flavor component with a normalizing flow, and estimates the regional mixture fractions. The extracted targets are subsequently used to learn flavor-factorized transport maps. Because the joint estimation of mixture fractions and flexible component densities admits weakly constrained directions, we further introduce a linearized feedback-operator analysis that propagates the fitted composition covariance into the extracted component densities, separating data-constrained modes from those dominated by the composition prior. The simulation-based closure study demonstrates improved closure in dedicated control regions and in independent validation mixtures.

2605.01335 2026-05-05 stat.ML cs.LG math.ST stat.TH

Mean Testing under Truncation beyond Gaussian

Yuhao Wang, Roberto Imbuzeiro Oliveira, Themis Gouleakis

详情
英文摘要

We characterize the fundamental limits of high-dimensional mean testing under arbitrary truncation, where samples are drawn from the conditional distribution $P(\cdot \mid S)$ for an unknown truncation set $S$ that may hide up to an $\varepsilon$-fraction of the probability mass. For distributions with $p$-th directional moments of magnitude at most $ν_{P,p}$, truncation induces a bias of order $O(ν_{P,p}\varepsilon^{1-1/p})$. This bias creates a sharp information-theoretic detectability floor: when the signal $α$ falls below this threshold, the null and alternative hypotheses are indistinguishable even with infinite data. Above this floor, we prove that a simple second-order test achieving near-optimal sample complexity $n = O\!\left(\frac{\|Σ_P\|}{(α-4ν_{P,p}\varepsilon^{1-1/p})^2}\sqrt{d}\right)$. We further identify a structural escape from this finite-moment bias barrier. Under a directional median regularity assumption, truncation bias improves to linear order $O(\varepsilon)$. This reveals an intermediate regime in which estimation requires $Θ(d)$ samples for uniform recovery, while testing recovers the classical $Θ(\sqrt d)$ rate once truncation bias is eliminated. Together, our results provide a unified framework for mean testing under truncation, connecting finite-moment, sub-Gaussian, and median-regular structural regimes.

2605.01312 2026-05-05 stat.ME

Exploring Multivariate Data Using Median Absolute Deviation Depth

Elsayed Elamir

Comments 21 pages, 7 figures

详情
英文摘要

We propose and analyze the moving median absolute deviation (MMAD) as a robust depth construction based on the median absolute distance functional with particular emphasis on its local geometry and probabilistic structure. In the univariate setting, we derive the derivative of the MMAD scale and interpret it through boundary mass imbalance, thereby establishing a direct connection to a robust skewness measure. This idea extends naturally to a multivariate setting that describes how observations are arranged along the 50% central region using a directional derivative, a gradient representation, and a spherical boundary distribution. From a computational perspective, MMAD can be estimated efficiently using distance calculations without needing complex optimization or projection schemes. Multivariate applications based on depth correlations, contour visualizations, and central region overlap demonstrate that MMAD identifies essentially the same central observations as classical depth notions while delivering additional information and geometric insight about directional structure. These features make MMAD a practical and informative approach for robust multivariate data analysis.

2605.01311 2026-05-05 cs.LG econ.EM stat.AP stat.ML

The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice

Jikai Jin, Vasilis Syrgkanis

详情
英文摘要

Offline evaluation of language models from usage logs is biased when model choice is confounded: the same user-side factors that influence which model is used can also influence how its output is judged, so raw comparisons of logged scores mix self-selected populations rather than estimating a common quantity of interest. A small randomized experiment can break this bias by overriding model choice, but in practice such experiments are scarce and costly. We study a three-source design that combines a large confounded observational log (OBS) for scale, a small randomized experiment (EXP) for unconfounded scoring, and an offline simulator (SIM) that replays candidate models on cached contexts. Our main result is an identification theorem showing that the randomized experiment and the simulator are together enough to recover causal model values; the observational log enters only afterward, to reduce estimation error rather than to make the causal comparison valid. Six estimator families are evaluated in a controlled semi-synthetic validation and in two real-task cached benchmarks for summarization and coding. No family dominates every regime; relative performance depends on the amount of unbiased EXP supervision and on how closely the target reward aligns with OBS-derived structure.

2605.01262 2026-05-05 stat.AP

Factor State Space Modelling of the Ornstein-Uhlenbeck Process with Measurement Error and its Application

Shanglun Li, Toby Kenney, Hong Gu

详情
英文摘要

Standard Ornstein-Uhlenbeck (OU) models often yield biased parameter estimates when measurement error is ignored. While the Ornstein-Uhlenbeck State Space Model (OUSSM) addresses this in univariate settings, multidimensional extensions remain limited. This paper introduces the factor OUSSM to model multi-dimensional, mean-reverting systems with observational noise. We resolve critical identifiability challenges in parameter estimation by establishing necessary constraints and validating the method through extensive simulations. We demonstrate the model's versatility by analyzing human gut microbiome dynamics and North Atlantic Sea Surface Temperature (SST) data. The results reveal distinct latent temporal structures in both biological and environmental systems, establishing the factor OUSSM as a robust framework for multivariate time series analysis.

2605.01237 2026-05-05 math.ST stat.TH

An Exact Pointwise Characterization for Total Variation Denoising in Quantile Regression

Deep Ghoshal, Sabyasachi Chatterjee

详情
英文摘要

Total variation denoising (TVD) is a classical method for denoising and curve fitting, yet an explicit pointwise description of its fitted values has only recently been established in the mean regression setting by arXiv:2410.03041v4. This raises the question of whether a similar representation holds for quantile regression. We answer this question affirmatively by deriving an exact minmax/maxmin representation for the quantile TVD estimator, providing a complete pointwise characterization of its solution set. Given that the quantile TVD estimator is generally non-unique, the existence of such a representation is perhaps surprising. We show that the set of admissible fitted values at any location forms a compact interval, whose endpoints are characterized exactly by minmax/maxmin functionals of local order statistics over nested intervals. We next develop several structural properties of the quantile TVD solution set. First, the solution set is closed under coordinatewise maximum and minimum, guaranteeing the existence of extremal elements -- upper and lower envelope solutions. Second, this reveals that quantile TVD is intrinsically non-crossing across quantile levels when a common tuning parameter is used. We prove this is driven by submodularity of the total variation penalty, and show that any penalized quantile regression estimator with a submodular penalty enjoys this property. From an estimation error perspective, our representation enables a refined pointwise analysis via a transparent local bias-variance decomposition, facilitating new pointwise risk bounds and near-optimal rates for locally Holder smooth functions. Our results hold under heavy-tailed noise (e.g., Cauchy) and substantially extend existing guarantees beyond locally constant signals. Altogether, these results advance the theory of quantile TV regression via exact pointwise min-max representations.

2605.01198 2026-05-05 stat.CO stat.ME

Modular Markov chain Monte Carlo with application to multimodal sampling

Joonha Park

详情
英文摘要

We develop a modular approach to Markov chain Monte Carlo (MCMC) sampling for unnormalized target densities. In this approach, Markov chains are constructed in parallel, each constrained to a subset of the target space. The Monte Carlo estimates from the constrained chains are then combined with appropriate weights, calculated from the transition probabilities between subsets. In addition to the computational advantages arising from its parallelized structure, this modular MCMC approach enables variance reduction for Monte Carlo estimation in settings where sampling from low-density regions is required. We develop a central limit theorem-type result for the resulting Monte Carlo estimates and propose a method for estimating their standard errors. Furthermore, by applying this modular sampling technique to simulated tempering, we propose a method for Monte Carlo estimation of expectations with respect to multimodal target distributions. This approach effectively addresses a well-known challenge of tempering-based methods: sampling efficiency can be greatly reduced when separated modes of the target distribution have different scales. We demonstrate the efficiency of the proposed methods through numerical examples, including one arising from Bayesian sparse regression with a spike-and-slab prior.

2605.01172 2026-05-05 cs.LG stat.ML

A Theory of Generalization in Deep Learning

Elon Litman, Gabe Guo

详情
英文摘要

We present a non-asymptotic theory of generalization in deep learning where the empirical neural tangent kernel partitions the output space. In directions corresponding to signal, error dissipates rapidly; in the vast orthogonal dimensions corresponding to noise, the kernel's near-zero eigenvalues trap residual error in a test-invisible reservoir. Within the signal channel, minibatch SGD ensures that coherent population signal accumulates via fast linear drift, while idiosyncratic memorization is suppressed into a slow, diffusive random walk. We prove generalization survives even when the kernel evolves $\mathcal{O}(1)$ in operator norm, the full feature-learning regime. This theory naturally explains disparate phenomena in deep learning theory, such as benign overfitting, double descent, implicit bias, and grokking. Lastly, we derive an exact population-risk objective from a single training run with no validation data, for any architecture, loss, or optimizer, and prove that it measures precisely the noise in the signal channel. This objective reduces in practice to an SNR preconditioner on top of Adam, adding one state vector at no extra cost; it accelerates grokking by $5 \times$, suppresses memorization in PINNs and implicit neural representations, and improves DPO fine-tuning under noisy preferences while staying $3 \times$ closer to the reference policy.

2605.01157 2026-05-05 stat.ME

Coarse-to-fine spatial GLMM for scalable prediction and multiscale analysis

Daisuke Murakami, Alexis Comber, Takahiro Yoshida, Narumasa Tsutsumida, Chris Brunsdon, Tomoki Nakaya

详情
英文摘要

Although a recent study suggested that coarse-to-fine learning provides a fast and flexible framework for large-scale spatial process modeling, the method was originally developed for Gaussian responses, limiting its applicability. To address this limitation, we extended the coarse-to-fine spatial modeling (CFSM) framework to accommodate spatial generalized linear mixed models (GLMMs), with a particular focus on count data. The resulting model, referred to as CF-GLMM efficiently addresses the degeneracy problem often encountered in conventional spatial GLMMs. The performance of the proposed CF-GLMMs was evaluated in terms of spatial prediction and multiscale feature extraction via Monte Carlo experiments. Finally, we applied the proposed method to the analysis of coronavirus disease 2019 (COVID-19). The proposed method is implemented in an R package spCF (https://cran.r-project.org/web/packages/spCF/).

2605.01136 2026-05-05 cs.LG cs.SI math.SP stat.ML

Spectral Graph Sparsification Preserves Representation Geometry in Graph Neural Networks

Sanjukta Krishnagopal

Comments 9 pages, 4 figures

详情
英文摘要

Spectral graph sparsification is a classical tool for reducing graph complexity while preserving Laplacian quadratic forms. In graph neural networks (GNNs), sparsification is often used to accelerate computation while maintaining predictive performance. In this work, we study a complementary representation-level question: does sparsification preserve the geometry of learned embeddings? For polynomial-filter GNNs, we prove that any $ε$-spectral sparsifier induces $O(ε)$ perturbations in polynomial graph filters, multilayer hidden representations, and their Gram matrices. These guarantees imply stability of squared pairwise distances, class means, and covariance structure in embedding space. We further establish finite-time training stability: under smoothness and boundedness assumptions, gradient descent on dense and sparsified graphs produces weight trajectories whose separation grows at most proportionally to the sparsification distortion. Empirically, effective-resistance sparsification validates the predicted perturbation chain on synthetic graphs and preserves hidden representation geometry on real datasets. In our experiments, the gram matrix and training dynamics show low divergence even under substantial sparsification, consistent with the predicted stability under spectral sparsification. Hidden Gram preservation strongly predicts neighborhood preservation and class-centroid stability across FashionMNIST, Cora, and Paul15. Together, these results show that spectral sparsification preserves not only graph operators, but also the representation geometry that supports downstream use of GNN embeddings for interpretability.

2605.01118 2026-05-05 stat.ME

Nonparametric density estimation with a parametric start

Nils Lid Hjort, Ingrid Kristine Glad

Comments 31 pages, no figures. This is the original publication for the Hjort-Glad density estimator, Statistical Research Report, Department of Mathematics, University of Oslo, January 1994, with more material than for the published article Annals of Statistics, 1995, vol. 23, pages 882-904

详情
Journal ref
Annals of Statistics, 1995, vol. 23, pages 882-904
英文摘要

The traditional kernel density estimator of an unknown density is by construction completely nonparametric, in the sense that it has no preferences and will work reasonably well for all shapes. The present paper develops a class of semiparametric methods that are designed to work better than the kernel estimator in a broad nonparametric neighbourhood of a given parametric class of densities, for example the normal, while not losing much in precision when the true density is far from the parametric class. The idea is to multiply an initial parametric density estimate with a kernel type estimate of the necessary correction factor. This works well in cases where the correction factor function is less rough than the original density itself. Extensive comparisons with the kernel estimator are carried out, including exact analysis for the class of all normal mixtures. The new method, with a normal start, wins quite often, even in many cases where the true density is far from normal. Procedures for choosing the smoothing parameter of the estimator are also discussed. The new estimator should be particularly useful in higher dimensions, where the usual nonparametric methods have problems. The idea is also spelled out for nonparametric regression.

2605.01114 2026-05-05 stat.ME

A formal approach to variable selection in difference-in-differences

Daniela Rodrigues, Laura A. Hatfield

详情
英文摘要

Difference-in-differences (DiD) identification relies mainly on a parallel trends assumption about untreated potential outcomes. Researchers often relax this assumption by assuming conditional parallel trends within units with the same covariate values. However, the process of selecting which covariates to include in this assumption is often \emph{ad hoc}. We propose a formal approach to select the variables that support conditional parallel trends based on graphical criteria. We show that the parallel trends assumption is rarely justified without conditioning on covariates, and that unconditional and conditional parallel trends can conflict with one another. We also demonstrate that a time-invariant covariate with a time-invariant effect on the outcome, which might not ordinarily be considered a confounder in DiD, may be a useful conditioning variable. We clarify that adjustment for a post-treatment covariate depends on what causes that covariate to change. Extending our framework to multiple time periods, we distinguish between treatment type and rollout strategy and examine the problem of treatment-confounder feedback. On the estimation side, we argue that the difficulty of incorporating covariates in DiD, often framed as an estimator problem, is more accurately understood as a misalignment between the adjustment set used by the estimator and the adjustment set required for identification. This misalignment affects several popular estimation procedures, and resolving it requires not a change of estimator, but a change in how covariates enter the estimation procedure. We show how to achieve this alignment for all estimators we evaluate.

2605.01110 2026-05-05 cs.LG cs.SI math.AT stat.ML

Topological Neural Tangent Kernel

Sanjukta Krishnagopal

Comments 9 pages 4 figures

详情
英文摘要

Graph neural tangent kernels give a principled infinite-width theory for graph neural networks, but inherit a basic limitation of graph models: they see only pairwise structure. Many relational systems contain higher-order interactions that are more naturally represented by simplicial complexes. We introduce the Topological Neural Tangent Kernel (TopoNTK), an infinite-width kernel for simplicial message passing on edge features. TopoNTK combines lower Hodge interactions, capturing graph-like coupling through shared vertices, with upper Hodge interactions, capturing coupling through filled simplices. This makes the kernel sensitive to topology invisible to graph kernels, allowing complexes with the same graph but different filled simplices to induce different kernels. Beyond expressivity, the Hodge structure gives the kernel an interpretable learning geometry. Edge signals decompose into gradient-like, harmonic, and local circulation components, and the spectrum of the TopoNTK determines how quickly each component is learned. This yields a topological form of spectral bias: components aligned with large-eigenvalue modes are learned quickly, while global harmonic modes, retained through the residual channel, often lie at smaller eigenvalues and are learned more slowly. We prove expressivity, Hodge-alignment, spectral learning, and stability properties, and validate them on synthetic simplicial tasks and DBLP higher-order link prediction. The results show that topology is not merely extra structure; it can provide coordinates that make relational learning more faithful, interpretable, and effective.

2605.01107 2026-05-05 cs.LG cond-mat.dis-nn stat.ML

Diffusion Operator Geometry of Feedforward Representations

Kanishka Reddy

详情
英文摘要

Neural networks transform data through learned representations whose geometry affects separation, contraction, and generalization. Recent work studies this geometry using discrete curvature on neighborhood graphs, suggesting Ricci-flow-like behavior across layers. We develop a smooth operator-theoretic alternative for feedforward representation snapshots. Each feature cloud induces a Gaussian-kernel diffusion Markov operator, and transport, spectral, label-boundary, and local-scale observables are derived from this single object via Bakry-Emery $Γ$-calculus. In a balanced Gaussian class-conditional snapshot model with shared covariance, the population operator has closed-form class affinities, leakage, and coarse spectra, all controlled by pairwise regularized Mahalanobis separations $c_\varepsilon^{(a,b)}$. We also prove that the resulting operator observables vary smoothly under feature perturbations, while hard neighborhood-graph diagnostics can change discontinuously. Synthetic experiments validate the closed-form Gaussian bridge, while learned MNIST experiments show that the same operator observables track training, width, and perturbation stability. Together, these results give a stable operator-geometric framework for analyzing feedforward representation geometry.

2605.01089 2026-05-05 cs.LG math.PR stat.CO

Learning Discriminators for Resampling in the Ensemble Gaussian Mixture Filter through a Normalizing Flow Approach

Zain Jabbar, Andrey A. Popov

详情
英文摘要

The ensemble Gaussian mixture filter (EnGMF) is a powerful, convergent particle filter capable of medium-to-high dimensional non-linear filtering. The EnGMF relies on a resampling step that can generate physically unrealistic posterior samples, that would subsequently produce physically meaningless forecasts. This work introduces the discriminator-informed resampling procedure, that augments the posterior resampling step with a discriminator that accepts or rejects candidate particles based on their physical plausibility. In this work these discriminators are learned through a normalizing flow approach. Numerical experiments on both the Ikeda map and the Lorenz '63 system show that discriminator informed resampling procedure consistently reduces error relative to the standard EnGMF in low-ensemble regimes.

2605.01062 2026-05-05 stat.ME math.ST stat.TH

Single Change-Point Detection via Energy Distance with Application to Genomic Data

Suthakaran Ratnasingam

Comments 25 pages, 8 figures, 3 tables

详情
英文摘要

In this paper, we develop and analyze a nonparametric procedure for detecting a single change point in sequences of independent observations using energy distance. The asymptotic properties of the test statistic are derived under both null and alternative hypotheses. Under the null hypothesis, for any fixed candidate split point, the standardized statistic $\mathcal{Z}_{n,k}$ converges to a standard normal limit. For global detection, we use the scan statistic $T_n=\max_{k\in K_η}|\mathcal{Z}_{n,k}|$ and calibrate critical values using a permutation test, which yields valid type I error control under exchangeability. The simulation study shows that the proposed method demonstrates much better robustness across various error distributions. To handle multiple change points in practical applications, the method is combined with a binary segmentation approach. The breast cancer cell line (MDA157) from cDNA microarray CGH data is used to illustrate the detection and estimation capabilities of the proposed method for genomic sequences.

2605.01052 2026-05-05 quant-ph math.ST physics.data-an physics.optics stat.TH

Entropic Reciprocity in Time-Reversed Young Interferometry

Jianming Wen

Comments This work provides an explicit definition on time reversal based on information theory

详情
英文摘要

We show that time-reversed Young interferometry reorganizes, rather than reverses, optical entropy. A fixed detector conditions the reciprocal source--detector Green function and produces a source-label probability distribution. Marginal entropies in the standard and time-reversed geometries are generally unequal; the reciprocal invariant is instead the mutual information between source and detector coordinates. Near a destructive response, the conditioned source-label entropy can decrease while Fisher information for small phase, tilt, or defocus perturbations increases. The result identifies time-reversed Young interferometry as a source-space information processor with no analogue in ordinary detector-plane fringe readout.

2605.01003 2026-05-05 stat.ME cs.LG eess.SP

Pi-Change: A Prior-Informed Multiple Change Point Detection Algorithm

Jonathon Jacobs, Shanshan Chen

详情
英文摘要

Statistical change point (CP) detection methods typically rely on likelihood-based inference and ignore contextual information about plausible CP locations beyond the observed sequence. Although informative priors provide a natural way to incorporate such information, general and computationally efficient methods for doing so are lacking, especially for multiple CP detection. To address this gap, we propose a prior-informed CP detection algorithm (Pi-Change) that incorporates prior information on CP locations through a time-varying penalty term. We prove that the proposed penalty can be embedded in the Pruned Exact Linear Time framework while preserving the dynamic programming recursion and pruning rule required for efficient multiple CP detection. Across simulation studies and three time-series applications, Pi-Change discourages spurious CPs unsupported by prior information, remains robust to prior misspecification, and improves detection accuracy. More broadly, Pi-Change extends multiple CP detection beyond purely data-driven fitting by incorporating partial prior knowledge in a computationally efficient and interpretable way. It is particularly useful when CPs arise from heterogeneous mechanisms or are associated with known external events, helping quantify the delay between an event and the resulting structural change.

2605.00966 2026-05-05 cs.LG cs.NE q-bio.NC stat.ML

Robust volatility updates for Hierarchical Gaussian Filtering

Christoph Mathys, Nicolas Legrand, Peter Thestrup Waade, Nace Mikus, Lilian Aline Weber

详情
英文摘要

Hierarchical Gaussian Filtering (HGF) networks allow for efficient updating of posterior distributions (beliefs) about hidden states of an agent's environment. HGF parent nodes can target the mean or variance of their children. New information entering at input nodes leads to a cascade of belief updates across the network according to one-step update equations for each node's mean and precision (inverse variance). However, the original form of the update equations for variance-targeting parents(volatility coupling) can in some regions of parameter space lead to negative posterior precision, a logical impossibility which causes the updating algorithm to terminate with an error. In this report, we introduce a modified quadratic approximation to the variational energy of volatility-coupled nodes that avoids negative posterior precision. The key idea is to interpolate between two quadratic expansions of the variational energy: one at the prior prediction and one at a second mode whose location is obtained in closed form via the Lambert W function. The resulting update equations are robust across the entire parameter space and faithfully track the variational posterior even for large prediction errors.

2605.00855 2026-05-05 math.OC cs.LG stat.ML

An Efficient Spatial Branch-and-Bound Algorithm for Global Optimization of Gaussian Process Posterior Mean Functions

Wei-Ting Tang, Akshay Kudva, Calvin Tsay, Joel A. Paulson

详情
英文摘要

We study the deterministic global optimization of trained Gaussian process posterior mean functions over hyperrectangular domains. Although the posterior mean function has a compact closed-form representation, its global optimization is challenging because it remains nonlinear and nonconvex. Existing exact deterministic approaches become increasingly difficult to scale as the number of training data points grows, leading to approximation-based methods that improve tractability by optimizing a modified (inexact) objective. In this work, we propose PALM-Mean, a piecewise-analytic lower-bounding framework embedded in reduced-space spatial branch-and-bound. At each node, kernel terms that are locally important are replaced by a sign-aware piecewise-linear relaxation in an appropriate scalar distance variable, while the remaining terms are bounded analytically in closed form. We show this hybrid approach yields a valid lower bound for the posterior mean, while limiting the size of the branch-and-bound subproblems. We establish validity of the node lower bounds and $\varepsilon$-global convergence of the resulting algorithm. Computational results on synthetic benchmarks and real-world application problems show that PALM-Mean improves scalability relative to representative general-purpose deterministic global solvers, particularly as the number of training data points increases.

2604.22902 2026-05-05 stat.ME stat.ML

Design, Cups, and Blankets. A Free-Energy-Principle-Based Approach to Product Design

Luca M. Possati

详情
英文摘要

Classical design theory treats the type of an object as a given: the designer decides in advance that this will be a cup, then optimizes its parameters. This paper argues that object type is not a presupposition but an inference, something that can be determined from physical data and functional requirements jointly. We call this problem requirement-steered interface type inference and show that it is inexpressible within existing design frameworks. This paper makes two contributions that are jointly necessary and individually incomplete. The first is the problem itself, which classical design cannot pose because it presupposes the very thing our problem seeks to determine. The second is C-DMBD, a constrained extension of the Dynamic Markov Blanket Detection algorithm, which makes requirement-steered inference computationally tractable. Drawing on the free-energy principle and active inference, established frameworks in theoretical neuroscience and Bayesian mechanics, we model a product's surface as a Markov blanket: the minimal boundary through which all causal exchange between object and environment must pass. Different blanket structures correspond to different object types; different parameterizations of the same structure correspond to different functional modes of the same type. This paper is a proof of concept and a theoretical proposal. It reframes design as inference rather than optimization, and as a relation between generative models rather than a specification of parameters.

2604.22791 2026-05-05 stat.CO cs.SI stat.OT

R Package iglm: Regression under Interference in Connected Populations

Cornelius Fritz, Michael Schweinberger

详情
英文摘要

We introduce R package iglm, which implements a comprehensive framework for studying relationships among predictors and outcomes under interference. The implemented regression framework facilitates the study of spillover and other phenomena in connected populations and has important advantages over existing packages, among them scalability and provable theoretical guarantees. On the computational side, the regression framework relies on scalable methods that can be applied to small and large data sets, by solving a convex optimization program based on pseudo-likelihoods using Minorization-Maximization and Quasi-Newton algorithms. On the statistical side, the regression framework comes with provable theoretical guarantees. To increase the versatility of iglm, users can add custom-built model terms. We showcase iglm using two data sets, including hate speech on the social media platform X and communications among students.

2604.14240 2026-05-05 cs.AI cs.LG stat.ML

Interpretable and Explainable Surrogate Modeling for Simulations: A State-of-the-Art Survey and Perspectives on Explainable AI for Decision-Making

Pramudita Satria Palar, Paul Saves, Muhammad Daffa Robani, Nicolas Verstaevel, Moncef Garouani, Julien Aligon, Koji Shimoyama, Joseph Morlier, Benoit Gaudou

Comments Accepted for publication in Archives of Computational Methods in Engineering, 2026

详情
英文摘要

The simulation of complex systems increasingly relies on sophisticated but fundamentally opaque computational black-box simulators. Surrogate models play a central role in reducing the computational cost of complex systems simulations across a wide range of scientific and engineering domains. Notwithstanding, they inevitably inherit and often exacerbate this black-box nature, obscuring how input variables drive physical responses. Conversely, Explainable Artificial Intelligence (XAI) offers powerful tools to unpack these models. Yet, XAI methods struggle with engineering-specific constraints, such as highly correlated inputs, dynamical systems, and rigorous reliability requirements. Consequently, surrogate modeling and XAI have largely evolved as distinct fields of research, despite their strong complementarity. To reconnect these approaches, this state-of-the-art survey provides a structured perspective that maps existing XAI techniques onto the various stages of surrogate modeling workflows for design and exploration. To ground this synthesis, we draw upon illustrative applications across both equation-based simulations and agent-based modeling. We survey a broad spectrum of techniques, highlighting their strengths for revealing interactions and supporting human comprehension. Finally, we identify pressing open challenges, including the explainability of dynamical systems and the handling of mixed-variable systems, and propose a research agenda to make explainability a core, embedded element of simulation-driven workflows from model construction through decision-making. By transforming opaque emulators into explainable tools, this agenda empowers practitioners to move beyond accelerating simulations to extracting actionable insights from complex system behaviors.

2604.04249 2026-05-05 stat.ME

The arithmetic-harmonic inequality index: Theory, inference, and finite-sample analysis

Roberto Vila, Helton Saulo

Comments 17 pages, 5 figures

详情
英文摘要

We investigate the arithmetic-harmonic inequality (AHI) index, a bounded and scale-invariant measure of dispersion for positive random variables, defined through the interplay between the mean and its reciprocal. We derive analytical expressions for the AHI index within the generalized inverse Gaussian (GIG) family, encompassing the inverse Gaussian and gamma distributions as important special cases. We study the associated estimator, obtain a tractable expression for its expectation, establish its asymptotic properties, and derive explicit first-order bias approximations. A Monte Carlo study is conducted to evaluate the finite-sample performance of the estimator under various scenarios. An application to GDP per capita data for countries in the Americas illustrates the role of the AHI index within the broader Atkinson family across several values of the inequality-aversion parameter. The results show the good performance of the AHI index as a tractable and interpretable measure of economic dispersion.

2603.11907 2026-05-05 cs.LG stat.ME

Causal Representation Learning with Optimal Compression under Complex Treatments

Wanting Liang, Haoang Chi, Zhiheng Zhang

详情
英文摘要

Estimating Individual Treatment Effects (ITE) in multi-treatment scenarios faces two critical challenges: the Hyperparameter Selection Dilemma for balancing weights and the Curse of Dimensionality in computational scalability. This paper derives a novel multi-treatment generalization bound and proposes a theoretical estimator for the optimal balancing weight $α$, eliminating expensive heuristic tuning. We investigate three balancing strategies: Pairwise, One-vs-All (OVA), and Treatment Aggregation. While OVA achieves superior precision in low-dimensional settings, our proposed Treatment Aggregation ensures both accuracy and O(1) scalability as the treatment space expands. Furthermore, we extend our framework to a generative architecture, Multi-Treatment CausalEGM, which preserves the Wasserstein geodesic structure of the treatment manifold. Experiments on semi-synthetic and image datasets demonstrate that our approach significantly outperforms traditional models in estimation accuracy and efficiency, particularly in large-scale intervention scenarios.

2602.14861 2026-05-05 math.ST stat.ME stat.TH

Bias analysis of a linear order-statistic inequality index estimator: Unbiasedness under gamma populations

Roberto Vila, Helton Saulo

Comments 18 pages

详情
英文摘要

This paper studies a class of rank-based inequality measures built from linear combinations of expected order statistics. The proposed framework unifies several well-known indices, including the classical Gini coefficient, the $m$th Gini index, the extended $m$th Gini index and particular cases of the $S$-Gini index, and also connects to spectral inequality measures through an integral representation. We investigate the finite-sample behavior of a natural U-statistic-type estimator that averages weighted order-statistic contrasts over all subsamples of fixed size and normalizes by the sample mean. A general bias decomposition is derived in terms of components that isolate the effect of random normalization on each rank level, yielding analytical expressions that can be evaluated under broad non-negative distributions via Laplace-transform methods. Under mild moment conditions, the estimator is shown to be asymptotically unbiased. Moreover, we prove exact unbiasedness under gamma populations for any sample size, extending earlier unbiasedness results for Gini-type estimators. A Monte Carlo study is performed to numerically check that the theoretical {unbiasedness} under gamma populations. Finally, a data set on GDP per capita across $34$ countries in the Americas is analyzes to illustrate the proposed methodology.

2510.01020 2026-05-05 cs.LG cs.AI math.ST stat.ML stat.TH

The Good, the Bad, and the Sampled: a No-Regret Approach to Safe Online Classification

Tavor Z. Baharav, Spyros Dragazis, Aldo Pacchiano

Comments 38 pages, accepted to AISTATS 2026

详情
英文摘要

We study sequential testing for a binary disease outcome when risk follows an unknown logistic model. At each round, the decision maker may either pay for a test revealing the true label or predict the outcome based on patient features and past data. The goal is to minimize costly tests while ensuring the misclassification rate stays below $α$ with probability at least $1-δ$. We propose a method that jointly estimates the logistic parameter $θ^{\star}$ and the feature distribution, using a conservative threshold on the logistic score to decide when to test. We prove our procedure achieves the target error with high probability and requires only $\widetilde O(\sqrt{T})$ more tests than an oracle with full knowledge. This is the first no-regret guarantee for error-constrained logistic testing, with direct applications to medical screening. Simulations corroborate our theoretical results, showing safe classification of patients and efficient estimation of $θ^{\star}$ with few excess tests.

2509.09723 2026-05-05 cs.CL cs.AI cs.LG stat.ME

ALIGNS: Unlocking nomological networks in psychological measurement through a large language model

Kai R. Larsen, Sen Yan, Roland M. Mueller, Lan Sang, Mikko Rönkkö, Ravi Starzl, Donald Edmondson

Comments Error in algorithm explanation

详情
英文摘要

Psychological measurement is critical to many disciplines. Despite advances in measurement, building nomological networks, theoretical maps of how concepts and measures relate to establish validity, remains a challenge 70 years after Cronbach and Meehl proposed them as fundamental to validation. This limitation has practical consequences: clinical trials may fail to detect treatment effects, and public policy may target the wrong outcomes. We introduce Analysis of Latent Indicators to Generate Nomological Structures (ALIGNS), a large language model-based system trained with validated questionnaire measures. ALIGNS provides three comprehensive nomological networks containing over 550,000 indicators across psychology, medicine, social policy, and other fields. This represents the first application of large language models to solve a foundational problem in measurement validation. We report classification accuracy tests used to develop the model, as well as three evaluations. In the first evaluation, the widely used NIH PROMIS anxiety and depression instruments are shown to converge into a single dimension of emotional distress. The second evaluation examines child temperament measures and identifies four potential dimensions not captured by current frameworks, and questions one existing dimension. The third evaluation, an applicability check, engages expert psychometricians who assess the system's importance, accessibility, and suitability. ALIGNS is freely available at nomologicalnetwork.org, complementing traditional validation methods with large-scale nomological analysis.

2508.12674 2026-05-05 stat.ML cs.LG cs.SI

Unfolded Laplacian Spectral Embedding: A Theoretically Grounded Approach to Dynamic Network Representation

Haruka Ezoe, Hiroki Matsumoto, Ryohei Hisano

详情
Journal ref
43rd International Conference on Machine Learning (ICML 2026)
英文摘要

Dynamic relational data arise in many machine learning applications, yet their evolving structure poses challenges for learning representations that remain consistent and interpretable over time. A common approach is to learn time varying node embeddings, whose usefulness depends on well defined stability properties across nodes and across time. We introduce Unfolded Laplacian Spectral Embedding (ULSE), a principled extension of unfolded adjacency spectral embedding to normalized Laplacian operators, a setting where stability guarantees have remained out of reach. We prove that ULSE satisfies both cross-sectional and longitudinal stability under a dynamic stochastic block model. Moreover, the Laplacian formulation yields a dynamic Cheeger-type inequality linking the spectrum of the unfolded normalized Laplacian to worst case conductance over time, providing structural insight into the embeddings. Empirical results on synthetic and real world dynamic networks validate the theory.

2411.07874 2026-05-05 stat.ME math.ST stat.TH

Changepoint Detection in Complex Models: Cross-Fitting Is Needed

Chengde Qian, Guanghui Wang, Zhaojun Wang, Changliang Zou

详情
英文摘要

Changepoint detection is commonly formulated by minimizing the sum of in-sample losses to quantify the model's overall fit. However, for flexible modeling procedures -- especially those involving high-dimensional parameter spaces or hyperparameter tuning -- this strategy can lead to inaccurate changepoint estimation due to over-adaptivity biases. To mitigate this issue, we propose a novel cross-fitting methodology based on out-of-sample loss evaluations, which decouples model fitting from changepoint search. We establish a general theoretical framework for consistent changepoint estimation under mild conditions, and further extend it to temporally dependent data. A key implication of the theory is that consistency depends primarily on the models' predictive accuracy over nearly homogeneous segments. Numerical experiments show that the proposed method substantially improves the reliability and adaptability of changepoint detection in complex scenarios.

2409.02399 2026-05-05 stat.CO math.OC

Guidance for twisted particle filter: a continuous-time perspective

Jianfeng Lu, Yuliang Wang

详情
英文摘要

The particle filter (PF), also known as sequential Monte Carlo (SMC), approximates high-dimensional probability distributions and their normalizing constants in the discrete-time setting. To reduce the variance of the Monte Carlo approximation, various twisted particle filters (TPFs) have been proposed, in which a twisting function is chosen or learned to modify the Markov transition kernel. Guided by existing control-based importance sampling algorithms in the continuous-time setting, we propose a novel algorithm called the ``Twisted-Path Particle Filter'' (TPPF), in which the twisting function is parameterized by a neural network and trained to minimize a specific KL-divergence between path measures. Numerical experiments illustrate the capability of the proposed algorithm.

2307.01150 2026-05-05 stat.ME math.ST stat.TH

Reliever: Relieving the Burden of Costly Model Fits for Changepoint Detection

Chengde Qian, Guanghui Wang, Changliang Zou

详情
英文摘要

Changepoint detection typically relies on a grid-search strategy for optimal data segmentation. When model fitting itself is expensive, repeatedly fitting a model on every candidate segment dominates the computation. Existing approaches mitigate this by pruning the grid, thus reducing the number of segments (and model fits). We propose Reliever, which instead cuts the number of model fits directly and nests seamlessly within standard grid-search routines. Reliever fits a small, deterministic collection of proxy models and reuses them wherever they apply, making it compatible with a wide range of existing algorithms. For high-dimensional regression with changepoints, coupling Reliever with an optimal grid-search method yields changepoint and coefficient estimators that are rate-optimal up to a logarithmic factor. Extensive numerical experiments demonstrate that Reliever rapidly and accurately detects changepoints across a wide range of high-dimensional and nonparametric models.

2212.10406 2026-05-05 stat.ME stat.AP

GEEPERs: Principal Stratification using Principal Scores and Stacked Estimating Equations

Adam C. Sales, Kirk P. Vanacore, Erin R. Ottmar

详情
英文摘要

Principal stratification is a framework for making sense of causal effects conditioned on variables that may themselves have been affected by the treatment. For instance, in an evaluation of an educational intervention, some subjects in the treatment group may not fully utilize the intervention, and researchers may be interested in how this subgroup is affected. Most principal stratification estimators rely on strong structural or modeling assumptions and often require advanced statistical training to fit and evaluate, making them inaccessible to many applied researchers. In this paper, we introduce a new principal effect estimator for one-way noncompliance based on a binary indicator. Estimates may be computed using conventional regression methods (though the standard errors require a specialized sandwich estimator) and do not rely on distributional assumptions. We present a simulation study that demonstrates the novel method's greater robustness compared to popular alternatives and illustrate the method through a real-data analysis.

2209.14859 2026-05-05 math.ST math.PR stat.ML stat.TH

Exact Recovery of Community Detection in dependent Gaussian Mixture Models

Zhongyang Li, Sichen Yang

详情
英文摘要

We study exact recovery for community detection in a Gaussian mixture model with dependent and heterogeneous Gaussian noise. The noise covariance matrix $Σ$ may be non-diagonal and, in the general formulation, singular. In the singular case, we write the Gaussian likelihood on the support of the induced measure and show that the maximum likelihood estimator (MLE) is a constrained quadratic optimization problem involving the Moore--Penrose inverse. For general covariance structures, we obtain sufficient conditions for exact recovery of the MLE when the community sizes are unknown and when they are known. These conditions are driven by the $Σ$-whitened separation $L_Σ(x,y)$ together with local one-step comparison inequalities in the near-truth regime. Under the additional assumption that $Σ$ is invertible, we derive converse results showing failure of exact recovery when a large family of local perturbations has sufficiently nondegenerate Gaussian comparison statistics. We then analyze a full-rank non-diagonal block-covariance model, prove a sharp exact-recovery threshold in the unknown-size setting, and identify a general no-gap mechanism under which the sufficient and necessary conditions coincide asymptotically.

2007.02392 2026-05-05 cs.LG cs.DS math.ST stat.CO stat.ML stat.TH

Efficient Parameter Estimation of Truncated Boolean Product Distributions

Dimitris Fotakis, Alkis Kalavasis, Christos Tzamos

Comments 33rd Conference on Learning Theory (COLT 2020)

详情
英文摘要

We study the problem of estimating the parameters of a Boolean product distribution in $d$ dimensions, when the samples are truncated by a set $S \subset \{0, 1\}^d$ accessible through a membership oracle. This is the first time that the computational and statistical complexity of learning from truncated samples is considered in a discrete setting. We introduce a natural notion of fatness of the truncation set $S$, under which truncated samples reveal enough information about the true distribution. We show that if the truncation set is sufficiently fat, samples from the true distribution can be generated from truncated samples. A stunning consequence is that virtually any statistical task (e.g., learning in total variation distance, parameter estimation, uniformity or identity testing) that can be performed efficiently for Boolean product distributions, can also be performed from truncated samples, with a small increase in sample complexity. We generalize our approach to ranking distributions over $d$ alternatives, where we show how fatness implies efficient parameter estimation of Mallows models from truncated samples. Exploring the limits of learning discrete models from truncated samples, we identify three natural conditions that are necessary for efficient identifiability: (i) the truncation set $S$ should be rich enough; (ii) $S$ should be accessible through membership queries; and (iii) the truncation by $S$ should leave enough randomness in all directions. By carefully adapting the Stochastic Gradient Descent approach of (Daskalakis et al., FOCS 2018), we show that these conditions are also sufficient for efficient learning of truncated Boolean product distributions.

1910.09876 2026-05-05 cs.LG stat.ML

Neural Network Training with Approximate Logarithmic Computations

Arnab Sanyal, Peter A. Beerel, Keith M. Chugg

详情
英文摘要

The high computational complexity associated with training deep neural networks limits online and real-time training on edge devices. This paper proposed an end-to-end training and inference scheme that eliminates multiplications by approximate operations in the log-domain which has the potential to significantly reduce implementation complexity. We implement the entire training procedure in the log-domain, with fixed-point data representations. This training procedure is inspired by hardware-friendly approximations of log-domain addition which are based on look-up tables and bit-shifts. We show that our 16-bit log-based training can achieve classification accuracy within approximately 1% of the equivalent floating-point baselines for a number of commonly used datasets.

1511.04803 2026-05-05 stat.ME

Additive Logistic Models as Interpretable Likelihood-Ratio Scores for AUC-Based Classification

Yuan-chin Ivan Chang

Comments 42

详情
英文摘要

Classification is a common statistical task in many areas. In order to ameliorate the performance of the existing methods, there are always some new classification procedures proposed. These procedures, especially those raised in the machine learning and data-mining literature, are usually complicated, and therefore extra effort is required to understand them and the impacts of individual variables in these procedures. However, in some applications, for example, pharmaceutical and medical related research, future developments and/or research plans will rely on the interpretation of the classification rule, such as the role of individual variables in a diagnostic rule/model. Hence, in these kinds of research, despite the optimal performance of the complicated models, the model with the balanced ease of interpretability and satisfactory performance is preferred. The complication of a classification rule might diminish its advantage in performance and become an obstacle to be used in those applications. In this paper, we study how to improve the classification performance, in terms of area under the receiver operating characteristic curve of a conventional logistic model, while retaining its ease of interpretation. The proposed method increases the sensitivity at the whole range of specificity and hence is especially useful when the performance in the high-specificity range of a receiver operating characteristic curve is of interest. Theoretical justification is presented, and numerical results using both simulated data and two real data sets are reported.