arXivDaily arXiv每日学术速递 周一至周五更新
2602.15008 2026-02-17 cs.LG cs.IT math.IT math.ST stat.ML stat.TH

Efficient Sampling with Discrete Diffusion Models: Sharp and Adaptive Guarantees

Daniil Dmitriev, Zhihan Huang, Yuting Wei

详情
英文摘要

Diffusion models over discrete spaces have recently shown striking empirical success, yet their theoretical foundations remain incomplete. In this paper, we study the sampling efficiency of score-based discrete diffusion models under a continuous-time Markov chain (CTMC) formulation, with a focus on $τ$-leaping-based samplers. We establish sharp convergence guarantees for attaining $\varepsilon$ accuracy in Kullback-Leibler (KL) divergence for both uniform and masking noising processes. For uniform discrete diffusion, we show that the $τ$-leaping algorithm achieves an iteration complexity of order $\tilde O(d/\varepsilon)$, with $d$ the ambient dimension of the target distribution, eliminating linear dependence on the vocabulary size $S$ and improving existing bounds by a factor of $d$; moreover, we establish a matching algorithmic lower bound showing that linear dependence on the ambient dimension is unavoidable in general. For masking discrete diffusion, we introduce a modified $τ$-leaping sampler whose convergence rate is governed by an intrinsic information-theoretic quantity, termed the effective total correlation, which is bounded by $d \log S$ but can be sublinear or even constant for structured data. As a consequence, the sampler provably adapts to low-dimensional structure without prior knowledge or algorithmic modification, yielding sublinear convergence rates for various practical examples (such as hidden Markov models, image data, and random graphs). Our analysis requires no boundedness or smoothness assumptions on the score estimator beyond control of the score entropy loss.

2602.15007 2026-02-17 stat.AP stat.ME

Hidden Markov Individual-level Models of Infectious Disease Transmission

Dirk Douwes-Schultz, Rob Deardon, Alexandra M. Schmidt

详情
英文摘要

Individual-level epidemic models are increasingly being used to help understand the transmission dynamics of various infectious diseases. However, fitting such models to individual-level epidemic data is challenging, as we often only know when an individual's disease status was detected (e.g., when they showed symptoms) and not when they were infected or removed. We propose an autoregressive coupled hidden Markov model to infer unknown infection and removal times, as well as other model parameters, from a single observed detection time for each detected individual. Unlike more traditional data augmentation methods used in epidemic modelling, we do not assume that this detection time corresponds to infection or removal or that infected individuals must at some point be detected. Bayesian coupled hidden Markov models have been used previously for individual-level epidemic data. However, these approaches assumed each individual was continuously tested and that the tests were independent. In practice, individuals are often only tested until their first positive test, and even if they are continuously tested, only the initial detection times may be reported. In addition, multiple tests on the same individual may not be independent. We accommodate these scenarios by assuming that the probability of detecting the disease can depend on past observations, which allows us to fit a much wider range of practical applications. We illustrate the flexibility of our approach by fitting two examples: an experiment on the spread of tomato spot wilt virus in pepper plants and an outbreak of norovirus among nurses in a hospital.

2602.14998 2026-02-17 math.PR cs.IT cs.SI math.IT math.ST stat.TH

Random geometric graphs with smooth kernels: sharp detection threshold and a spectral conjecture

Cheng Mao, Yihong Wu, Jiaming Xu

详情
英文摘要

A random geometric graph (RGG) with kernel $K$ is constructed by first sampling latent points $x_1,\ldots,x_n$ independently and uniformly from the $d$-dimensional unit sphere, then connecting each pair $(i,j)$ with probability $K(\langle x_i,x_j\rangle)$. We study the sharp detection threshold, namely the highest dimension at which an RGG can be distinguished from its Erdős--Rényi counterpart with the same edge density. For dense graphs, we show that for smooth kernels the critical scaling is $d = n^{3/4}$, substantially lower than the threshold $d = n^3$ known for the hard RGG with step-function kernels \cite{bubeck2016testing}. We further extend our results to kernels whose signal-to-noise ratio scales with $n$, and formulate a unifying conjecture that the critical dimension is determined by $n^3 \mathop{\rm tr}^2(κ^3) = 1$, where $κ$ is the standardized kernel operator on the sphere. Departing from the prevailing approach of bounding the Kullback-Leibler divergence by successively exposing latent points, which breaks down in the sublinear regime of $d=o(n)$, our key technical contribution is a careful analysis of the posterior distribution of the latent points given the observed graph, in particular, the overlap between two independent posterior samples. As a by-product, we establish that $d=\sqrt{n}$ is the critical dimension for non-trivial estimation of the latent vectors up to a global rotation.

2602.14991 2026-02-17 stat.ME

Joint analysis for multivariate longitudinal and event time data with a change point anchored at interval-censored event time

Yue Zhan, Cheng Zheng, Ying Zhang

详情
英文摘要

Huntington's disease (HD) is an autosomal dominant neurodegenerative disorder characterized by motor dysfunction, psychiatric disturbances, and cognitive decline. The onset of HD is marked by severe motor impairment, which may be predicted by prior cognitive decline and, in turn, exacerbate cognitive deficits. Clinical data, however, are often collected at discrete time points, so the timing of disease onset is subject to interval censoring. To address the challenges posed by such data, we develop a joint model for multivariate longitudinal biomarkers with a change point anchored at an interval-censored event time. The model simultaneously assesses the effects of longitudinal biomarkers on the event time and the changes in biomarker trajectories following the event. We conduct a comprehensive simulation study to demonstrate the finite-sample performance of the proposed method for causal inference. Finally, we apply the method to PREDICT-HD, a multisite observational cohort study of prodromal HD individuals, to ascertain how cognitive impairment and motor dysfunction interact during disease progression.

2602.14969 2026-02-17 math.ST stat.TH

Topological trivialization in non-convex empirical risk minimization

Andrea Montanari, Basil Saeed

Comments 33 pages; 16 pdf figures

详情
英文摘要

Given data $\{({\boldsymbol x}_i,y_i): i\le n\}$, with ${\boldsymbol x}_i$ standard $d$-dimensional Gaussian feature vectors, and $y_i\in{\mathbb R}$ response variables, we study the general problem of learning a model parametrized by ${\boldsymbol θ}\in{\mathbb R}^d$, by minimizing a loss function that depends on ${\boldsymbol θ}$ via the one-dimensional projections ${\boldsymbol θ}^{\sf T}{\boldsymbol x}_i$. While previous work mostly dealt with convex losses, our approach assumes general (non-convex) losses hence covering classical, yet poorly understood examples such as the perceptron and non-convex robust regression. We use the Kac-Rice formula to control the asymptotics of the expected number of local minima of the empirical risk, under the proportional asymptotics $n,d\to\infty$, $n/d\toα>1$. Specifically, we prove a finite dimensional variational formula for the exponential growth rate of the expected number of local minima. Further we provide sufficient conditions under which the exponential growth rate vanishes and all empirical risk minimizers have the same asymptotic properties (in fact, we expect the minimizer to be unique in these circumstances). We refer to this phenomenon as `rate trivialization.' If the population risk has a unique minimizer, our sufficient condition for rate trivialization is typically verified when the samples/parameters ratio $α$ is larger than a suitable constant $α_{\star}$. Previous general results of this type required $n\ge Cd \log d$. We illustrate our results in the case of non-convex robust regression. Based on heuristic arguments and numerical simulations, we present a conjecture for the exact location of the trivialization phase transition $α_{\text{tr}}$.

2602.14952 2026-02-17 cs.LG math.OC stat.ME stat.ML

Locally Adaptive Multi-Objective Learning

Jivat Neet Kaur, Isaac Gibbs, Michael I. Jordan

Comments Code is available at https://github.com/jivatneet/adaptive-multiobjective

详情
英文摘要

We consider the general problem of learning a predictor that satisfies multiple objectives of interest simultaneously, a broad framework that captures a range of specific learning goals including calibration, regret, and multiaccuracy. We work in an online setting where the data distribution can change arbitrarily over time. Existing approaches to this problem aim to minimize the set of objectives over the entire time horizon in a worst-case sense, and in practice they do not necessarily adapt to distribution shifts. Earlier work has aimed to alleviate this problem by incorporating additional objectives that target local guarantees over contiguous subintervals. Empirical evaluation of these proposals is, however, scarce. In this article, we consider an alternative procedure that achieves local adaptivity by replacing one part of the multi-objective learning method with an adaptive online algorithm. Empirical evaluations on datasets from energy forecasting and algorithmic fairness show that our proposed method improves upon existing approaches and achieves unbiased predictions over subgroups, while remaining robust under distribution shift.

2602.14942 2026-02-17 stat.ME

Balanced Stochastic Block Model for Community Detection in Signed Networks

Yichao Chen, Weijing Tang, Ji Zhu

详情
英文摘要

Community detection, discovering the underlying communities within a network from observed connections, is a fundamental problem in network analysis, yet it remains underexplored for signed networks. In signed networks, both edge connection patterns and edge signs are informative, and structural balance theory (e.g., triangles aligned with ``the enemy of my enemy is my friend'' and ``the friend of my friend is my friend'' are more prevalent) provides a global higher-order principle that guides community formation. We propose a Balanced Stochastic Block Model (BSBM), which incorporates balance theory into the network generating process such that balanced triangles are more likely to occur. We develop a fast profile pseudo-likelihood estimation algorithm with provable convergence and establish that our estimator achieves strong consistency under weaker signal conditions than methods for the binary SBM that rely solely on edge connectivity. Extensive simulation studies and two real-world signed networks demonstrate strong empirical performance.

2602.14877 2026-02-17 stat.AP

When to repeat a biomarker test? Decomposing sources of variation from conditionally repeated measurements

Supun Manathunga, Mart P. Janssen, Yu Luo, W. Alton Russell, Mart Pothast

Comments 36 pages, 12 figures

详情
英文摘要

Repeating an imperfect biomarker test based on an initial result can introduce bias and influence misclassification risk. For example, in some blood donation settings, blood donors' hemoglobin is remeasured when the initial measurement falls below a minimum threshold for donor eligibility. This paper explores methods that use data resulting from processes with conditionally repeated biomarker measurement to decompose the variation in observed measurements of a continuous biomarker into population variability and variability arising from the measurement procedure. We present two frequentist approaches with analytical solutions, but these approaches perform poorly in a dataset of conditionally repeated blood donor hemoglobin measurements where normality assumptions are not met. We then develop a Bayesian hierarchical framework that allows for different distributional assumptions, which we apply to the blood donor hemoglobin dataset. Using a Bayesian hierarchical model that assumes normally distributed population hemoglobin and heavy tailed $t$-distributed measurement variation, we found that the total measurement variation accounted for 22\% of the total variance among females and 25\% among males, with population standard deviations of $1.07\, \rm g/dL$ for female donors and $1.28\, \rm g/dL$ for male donors. Our Bayesian framework can use data resulting from any clinical process with conditionally repeated biomarker measurements to estimate individuals' misclassification risk after one or more noisy continuous measurements and inform evidence-based conditional retesting decision rules.

2602.14872 2026-02-17 cs.LG cs.AI math.OC stat.ML

On the Learning Dynamics of RLVR at the Edge of Competence

Yu Huang, Zixin Wen, Yuejie Chi, Yuting Wei, Aarti Singh, Yingbin Liang, Yuxin Chen

详情
英文摘要

Reinforcement learning with verifiable rewards (RLVR) has been a main driver of recent breakthroughs in large reasoning models. Yet it remains a mystery how rewards based solely on final outcomes can help overcome the long-horizon barrier to extended reasoning. To understand this, we develop a theory of the training dynamics of RL for transformers on compositional reasoning tasks. Our theory characterizes how the effectiveness of RLVR is governed by the smoothness of the difficulty spectrum. When data contains abrupt discontinuities in difficulty, learning undergoes grokking-type phase transitions, producing prolonged plateaus before progress recurs. In contrast, a smooth difficulty spectrum leads to a relay effect: persistent gradient signals on easier problems elevate the model's capabilities to the point where harder ones become tractable, resulting in steady and continuous improvement. Our theory explains how RLVR can improve performance at the edge of competence, and suggests that appropriately designed data mixtures can yield scalable gains. As a technical contribution, our analysis develops and adapts tools from Fourier analysis on finite groups to our setting. We validate the predicted mechanisms empirically via synthetic experiments.

2602.14869 2026-02-17 cs.AI stat.ML

Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution

Matthew Kowal, Goncalo Paulo, Louis Jaburi, Tom Tseng, Lev E McKinney, Stefan Heimersheim, Aaron David Tucker, Adam Gleave, Kellin Pelrine

详情
英文摘要

As large language models are increasingly trained and fine-tuned, practitioners need methods to identify which training data drive specific behaviors, particularly unintended ones. Training Data Attribution (TDA) methods address this by estimating datapoint influence. Existing approaches like influence functions are both computationally expensive and attribute based on single test examples, which can bias results toward syntactic rather than semantic similarity. To address these issues of scalability and influence to abstract behavior, we leverage interpretable structures within the model during the attribution. First, we introduce Concept Influence which attribute model behavior to semantic directions (such as linear probes or sparse autoencoder features) rather than individual test examples. Second, we show that simple probe-based attribution methods are first-order approximations of Concept Influence that achieve comparable performance while being over an order-of-magnitude faster. We empirically validate Concept Influence and approximations across emergent misalignment benchmarks and real post-training datasets, and demonstrate they achieve comparable performance to classical influence functions while being substantially more scalable. More broadly, we show that incorporating interpretable structure within traditional TDA pipelines can enable more scalable, explainable, and better control of model behavior through data.

2602.14791 2026-02-17 cs.LG stat.ML

Extending Multi-Source Bayesian Optimization With Causality Principles

Luuk Jacobs, Mohammad Ali Javidian

Comments An extended abstract version of this work was accepted for the Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

详情
英文摘要

Multi-Source Bayesian Optimization (MSBO) serves as a variant of the traditional Bayesian Optimization (BO) framework applicable to situations involving optimization of an objective black-box function over multiple information sources such as simulations, surrogate models, or real-world experiments. However, traditional MSBO assumes the input variables of the objective function to be independent and identically distributed, limiting its effectiveness in scenarios where causal information is available and interventions can be performed, such as clinical trials or policy-making. In the single-source domain, Causal Bayesian Optimization (CBO) extends standard BO with the principles of causality, enabling better modeling of variable dependencies. This leads to more accurate optimization, improved decision-making, and more efficient use of low-cost information sources. In this article, we propose a principled integration of the MSBO and CBO methodologies in the multi-source domain, leveraging the strengths of both to enhance optimization efficiency and reduce computational complexity in higher-dimensional problems. We present the theoretical foundations of both Causal and Multi-Source Bayesian Optimization, and demonstrate how their synergy informs our Multi-Source Causal Bayesian Optimization (MSCBO) algorithm. We compare the performance of MSCBO against its foundational counterparts for both synthetic and real-world datasets with varying levels of noise, highlighting the robustness and applicability of MSCBO. Based on our findings, we conclude that integrating MSBO with the causality principles of CBO facilitates dimensionality reduction and lowers operational costs, ultimately improving convergence speed, performance, and scalability.

2602.14701 2026-02-17 cs.LG stat.ML

Unbiased Approximate Vector-Jacobian Products for Efficient Backpropagation

Killian Bakong, Laurent Massoulié, Edouard Oyallon, Kevin Scaman

详情
英文摘要

In this work we introduce methods to reduce the computational and memory costs of training deep neural networks. Our approach consists in replacing exact vector-jacobian products by randomized, unbiased approximations thereof during backpropagation. We provide a theoretical analysis of the trade-off between the number of epochs needed to achieve a target precision and the cost reduction for each epoch. We then identify specific unbiased estimates of vector-jacobian products for which we establish desirable optimality properties of minimal variance under sparsity constraints. Finally we provide in-depth experiments on multi-layer perceptrons, BagNets and Visual Transfomers architectures. These validate our theoretical results, and confirm the potential of our proposed unbiased randomized backpropagation approach for reducing the cost of deep learning.

2602.14692 2026-02-17 stat.CO math.PR

Weak Poincaré inequalities for Deterministic-scan Metropolis-within-Gibbs samplers

Mengxi Gao, Gareth O. Roberts, Andi Q. Wang

Comments 51 pages

详情
英文摘要

Using the framework of weak Poincaré inequalities, we analyze the convergence properties of deterministic-scan Metropolis-within-Gibbs samplers, an important class of Markov chain Monte Carlo algorithms. Our analysis applies to nonreversible Markov chains and yields explicit (subgeometric) convergence bounds through novel comparison techniques based on Dirichlet forms. We show that the joint chain inherits the convergence behavior of the marginal chain and conversely. In addition, we establish several fundamental results for weak Poincaré inequalities for discrete-time Markov chains, such as a tensorization property for independent chains. We apply our theoretical results through applications to algorithms for Bayesian inference for a hierarchical regression model and a diffusion model under discretely-observed data.

2602.14678 2026-02-17 quant-ph cond-mat.dis-nn stat.CO

NISQ-compatible quantum cryptography based on Parrondo dynamics in discrete-time quantum walks

Aditi Rath, Dinesh Kumar Panda, Colin Benjamin

Comments 23 pages, 24 figures, 3 tables

详情
英文摘要

Compatibility with noisy intermediate-scale quantum (NISQ) devices is crucial for the realistic implementation of quantum cryptographic protocols. We investigate a cryptographic scheme based on discrete-time quantum walks (DTQWs) on cyclic graphs that exploits Parrondo dynamics, wherein periodic evolution emerges from a deterministic sequence of individually chaotic coin operators. We construct an explicit quantum circuit realization tailored to NISQ architectures and analyze its performance through numerical simulations in Qiskit under both ideal and noisy conditions. Protocol performance is quantified using probability distributions, Hellinger fidelity, and total variation distance. To assess security at the circuit level, we model intercept-resend and man-in-the-middle attacks and evaluate the resulting quantum bit error rate. In the absence of adversarial intervention, the protocol enables reliable message recovery, whereas eavesdropping induces characteristic disturbances that disrupt the periodic reconstruction mechanism. We further examine hardware feasibility on contemporary NISQ processors, specifically $ibm\_torino$, incorporating qubit connectivity and state-transfer constraints into the circuit design. Our analysis demonstrates that communication between spatially separated logical modules increases circuit depth via SWAP operations, leading to cumulative noise effects. By exploring hybrid state-transfer strategies, we show that qubit selection and connectivity play a decisive role in determining fidelity and overall protocol performance, highlighting hardware-dependent trade-offs in NISQ implementations.

2602.14642 2026-02-17 stat.ML cs.LG

GenPANIS: A Latent-Variable Generative Framework for Forward and Inverse PDE Problems in Multiphase Media

Matthaios Chatzopoulos, Phaedon-Stelios Koutsourelakis

详情
英文摘要

Inverse problems and inverse design in multiphase media, i.e., recovering or engineering microstructures to achieve target macroscopic responses, require operating on discrete-valued material fields, rendering the problem non-differentiable and incompatible with gradient-based methods. Existing approaches either relax to continuous approximations, compromising physical fidelity, or employ separate heavyweight models for forward and inverse tasks. We propose GenPANIS, a unified generative framework that preserves exact discrete microstructures while enabling gradient-based inference through continuous latent embeddings. The model learns a joint distribution over microstructures and PDE solutions, supporting bidirectional inference (forward prediction and inverse recovery) within a single architecture. The generative formulation enables training with unlabeled data, physics residuals, and minimal labeled pairs. A physics-aware decoder incorporating a differentiable coarse-grained PDE solver preserves governing equation structure, enabling extrapolation to varying boundary conditions and microstructural statistics. A learnable normalizing flow prior captures complex posterior structure for inverse problems. Demonstrated on Darcy flow and Helmholtz equations, GenPANIS maintains accuracy on challenging extrapolative scenarios - including unseen boundary conditions, volume fractions, and microstructural morphologies, with sparse, noisy observations. It outperforms state-of-the-art methods while using 10 - 100 times fewer parameters and providing principled uncertainty quantification.

2602.06568 2026-02-17 math.ST math.PR stat.CO stat.TH

Ergodicity of an Adaptive MCMC Sampler under a Probability Bound

Alexandre Chotard

详情
英文摘要

This paper provides sufficient conditions over the sequence of samples and parameters of an adaptive Markov Chain Monte Carlo (MCMC) algorithm to ensure ergodicity with respect to a target distribution that can have unbounded support. These conditions aim to make more easily usable the conditions of Containment and Diminishing Adaptation from Roberts and Rosenthal [2007] formulated over the transition kernels, without needing, as was done in other works, an artificial assumption of the compactness over both sample and parameter spaces. The paper shows that the condition of compactness can be relaxed to a more realistic bound in probability over the sequence of both samples and parameters.

2602.06320 2026-02-17 stat.ML cond-mat.dis-nn cs.LG

High-Dimensional Limit of Stochastic Gradient Flow via Dynamical Mean-Field Theory

Sota Nishiyama, Masaaki Imaizumi

详情
英文摘要

Modern machine learning models are typically trained via multi-pass stochastic gradient descent (SGD) with small batch sizes, and understanding their dynamics in high dimensions is of great interest. However, an analytical framework for describing the high-dimensional asymptotic behavior of multi-pass SGD with small batch sizes for nonlinear models is currently missing. In this study, we address this gap by analyzing the high-dimensional dynamics of a stochastic differential equation called a \emph{stochastic gradient flow} (SGF), which approximates multi-pass SGD in this regime. In the limit where the number of data samples $n$ and the dimension $d$ grow proportionally, we derive a closed system of low-dimensional and continuous-time equations and prove that it characterizes the asymptotic distribution of the SGF parameters. Our theory is based on the dynamical mean-field theory (DMFT) and is applicable to a wide range of models encompassing generalized linear models and two-layer neural networks. We further show that the resulting DMFT equations recover several existing high-dimensional descriptions of SGD dynamics as special cases, thereby providing a unifying perspective on prior frameworks such as online SGD and high-dimensional linear regression. Our proof builds on the existing DMFT technique for gradient flow and extends it to handle the stochasticity in SGF using tools from stochastic calculus.

2602.01427 2026-02-17 stat.ML cs.LG stat.AP

Robust Generalization with Adaptive Optimal Transport Priors for Decision-Focused Learning

Haixiang Sun, Andrew L. Liu

详情
Journal ref
The 29th International Conference on Artificial Intelligence and Statistics (AISTATS), 2026
英文摘要

Few-shot learning requires models to generalize under limited supervision while remaining robust to distribution shifts. Existing Sinkhorn Distributionally Robust Optimization (DRO) methods provide theoretical guarantees but rely on a fixed reference distribution, which limits their adaptability. We propose a Prototype-Guided Distributionally Robust Optimization (PG-DRO) framework that learns class-adaptive priors from abundant base data via hierarchical optimal transport and embeds them into the Sinkhorn DRO formulation. This design enables few-shot information to be organically integrated into producing class-specific robust decisions that are both theoretically grounded and efficient, and further aligns the uncertainty set with transferable structural knowledge. Experiments show that PG-DRO achieves stronger robust generalization in few-shot scenarios, outperforming both standard learners and DRO baselines.

2601.21812 2026-02-17 stat.ML cs.AI cs.LG

A Decomposable Forward Process in Diffusion Models for Time-Series Forecasting

Francisco Caldas, Sahil Kumar, Cláudia Soares

Comments submitted to ICML'26

详情
英文摘要

We introduce a model-agnostic forward diffusion process for time-series forecasting that decomposes signals into spectral components, preserving structured temporal patterns such as seasonality more effectively than standard diffusion. Unlike prior work that modifies the network architecture or diffuses directly in the frequency domain, our proposed method alters only the diffusion process itself, making it compatible with existing diffusion backbones (e.g., DiffWave, TimeGrad, CSDI). By staging noise injection according to component energy, it maintains high signal-to-noise ratios for dominant frequencies throughout the diffusion trajectory, thereby improving the recoverability of long-term patterns. This strategy enables the model to maintain the signal structure for a longer period in the forward process, leading to improved forecast quality. Across standard forecasting benchmarks, we show that applying spectral decomposition strategies, such as the Fourier or Wavelet transform, consistently improves upon diffusion models using the baseline forward process, with negligible computational overhead. The code for this paper is available at https://anonymous.4open.science/r/D-FDP-4A29.

2512.19064 2026-02-17 stat.AP

Unraveling time-varying causal effects of multiple exposures: integrating Functional Data Analysis with Multivariable Mendelian Randomization

Nicole Fontana, Francesca Ieva, Luisa Zuccolo, Emanuele Di Angelantonio, Piercesare Secchi

详情
英文摘要

Mendelian Randomization is a widely used instrumental variable method for assessing causal effects of lifelong exposures on health outcomes. Many exposures, however, have causal effects that vary across the life course and often influence outcomes jointly with other exposures or indirectly through mediating pathways. Existing approaches to multivariable Mendelian Randomization assume constant effects over time and therefore fail to capture these dynamic relationships. We introduce Multivariable Functional Mendelian Randomization (MV-FMR), a new framework that extends functional Mendelian Randomization to simultaneously model multiple time-varying exposures. The method combines functional principal component analysis with a data-driven cross-validation strategy for basis selection and accounts for overlapping instruments and mediation effects. Through extensive simulations, we assessed MV-FMR's ability to recover time-varying causal effects under a range of data-generating scenarios and compared the performance of joint versus separate exposure effect estimation strategies. Across scenarios involving nonlinear effects, horizontal pleiotropy, mediation, and sparse data, MV-FMR consistently recovered the true causal functions and outperformed univariable approaches. To demonstrate its practical value, we applied MV-FMR to UK Biobank data to investigate the time-varying causal effects of systolic blood pressure and body mass index on coronary artery disease. MV-FMR provides a flexible and interpretable framework for disentangling complex time-dependent causal processes and offers new opportunities for identifying life-course critical periods and actionable drivers relevant to disease prevention.

2512.17979 2026-02-17 cs.GT cs.AI cs.MA econ.GN q-fin.EC stat.AP

Adaptive Agents in Spatial Double-Auction Markets: Modeling the Emergence of Industrial Symbiosis

Matthieu Mastio, Paul Saves, Benoit Gaudou, Nicolas Verstaevel

Comments AAMAS CC-BY 4.0 licence. Adaptive Agents in Spatial Double-Auction Markets: Modeling the Emergence of Industrial Symbiosis. Full paper. In Proc. of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026), Paphos, Cyprus, May 25 - 29, 2026, IFAAMAS, 10 pages

详情
Journal ref
AAMAS 2026, Paphos, IFAAMAS, 10 pages
英文摘要

Industrial symbiosis fosters circularity by enabling firms to repurpose residual resources, yet its emergence is constrained by socio-spatial frictions that shape costs, matching opportunities, and market efficiency. Existing models often overlook the interaction between spatial structure, market design, and adaptive firm behavior, limiting our understanding of where and how symbiosis arises. We develop an agent-based model where heterogeneous firms trade byproducts through a spatially embedded double-auction market, with prices and quantities emerging endogenously from local interactions. Leveraging reinforcement learning, firms adapt their bidding strategies to maximize profit while accounting for transport costs, disposal penalties, and resource scarcity. Simulation experiments reveal the economic and spatial conditions under which decentralized exchanges converge toward stable and efficient outcomes. Counterfactual regret analysis shows that sellers' strategies approach a near Nash equilibrium, while sensitivity analysis highlights how spatial structures and market parameters jointly govern circularity. Our model provides a basis for exploring policy interventions that seek to align firm incentives with sustainability goals, and more broadly demonstrates how decentralized coordination can emerge from adaptive agents in spatially constrained markets.

2512.13228 2026-02-17 cs.LG stat.ML

ModSSC: A Modular Framework for Semi-Supervised Classification on Heterogeneous Data

Melvin Barbaux, Samia Boukir

Comments Preprint describing the open source ModSSC framework for inductive and transductive semi-supervised classification on heterogeneous data

详情
英文摘要

Semi-supervised classification leverages both labeled and unlabeled data to improve predictive performance, but existing software support remains fragmented across methods, learning settings, and data modalities. We introduce ModSSC, an open source Python framework for inductive and transductive semi-supervised classification designed to support reproducible and controlled experimentation. ModSSC provides a modular and extensible software architecture centered on reusable semi-supervised learning components, stable abstractions, and fully declarative experiment specification. Experiments are defined through configuration files, enabling systematic comparison across heterogeneous datasets and model backbones without modifying algorithmic code. ModSSC 1.0.0 is released under the MIT license with full documentation and automated tests, and is available at https://github.com/ModSSC/ModSSC. The framework is validated through controlled experiments reproducing established semi-supervised learning baselines across multiple data modalities.

2511.15315 2026-02-17 stat.ML cs.LG

Robust Bayesian Optimisation with Unbounded Corruptions

Abdelhamid Ezzerg, Ilija Bogunovic, Jeremias Knoblauch

详情
英文摘要

Bayesian Optimization is critically vulnerable to extreme outliers. Existing provably robust methods typically assume a bounded cumulative corruption budget, which makes them defenseless against even a single corruption of sufficient magnitude. To address this, we introduce a new adversary whose budget is only bounded in the frequency of corruptions, not in their magnitude. We then derive RCGP-UCB, an algorithm coupling the famous upper confidence bound (UCB) approach with a Robust Conjugate Gaussian Process (RCGP). We present stable and adaptive versions of RCGP-UCB, and prove that they achieve sublinear regret in the presence of up to $O(T^{1/4})$ and $O(T^{1/7})$ corruptions with possibly infinite magnitude. This robustness comes at near zero cost: without outliers, RCGP-UCB's regret bounds match those of the standard GP-UCB algorithm.

2508.17622 2026-02-17 stat.ML cs.LG econ.TH math.OC

The Statistical Fairness-Accuracy Frontier

Alireza Fallah, Michael I. Jordan, Annie Ulichney

详情
英文摘要

We study fairness-accuracy tradeoffs when a single predictive model must serve multiple demographic groups. A useful tool for understanding this tradeoff is the fairness-accuracy (FA) Pareto frontier, which characterizes the set of models that cannot be improved in either fairness or accuracy without worsening the other. While characterizing the FA frontier requires full knowledge of the data distribution, we focus on the finite-sample regime, quantifying how well a designer can approximate any point on the frontier from limited data and bounding the worst-case gap. In particular, we derive worst-case-optimal estimators that depend on the designer's knowledge of the covariate distribution. For each estimator, we characterize how finite-sample effects asymmetrically impact each group's welfare and identify optimal sample allocation strategies. Finally, we provide uniform finite-sample bounds for the entire FA frontier, yielding confidence bands that quantify the reliability of welfare comparisons across alternative fairness-accuracy tradeoffs.

2506.13593 2026-02-17 cs.LG stat.AP stat.ML

Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs

Hen Davidov, Shai Feldman, Gilad Freidkin, Yaniv Romano

详情
英文摘要

We introduce time-to-unsafe-sampling, a novel safety measure for generative models, defined as the number of generations required by a large language model (LLM) to trigger an unsafe (e.g., toxic) response. While providing a new dimension for prompt-adaptive safety evaluation, quantifying time-to-unsafe-sampling is challenging: unsafe outputs are often rare in well-aligned models and thus may not be observed under any feasible sampling budget. To address this challenge, we frame this estimation problem as one of survival analysis. We build on recent developments in conformal prediction and propose a novel calibration technique to construct a lower predictive bound (LPB) on the time-to-unsafe-sampling of a given prompt with rigorous coverage guarantees. Our key technical innovation is an optimized sampling-budget allocation scheme that improves sample efficiency while maintaining distribution-free guarantees. Experiments on both synthetic and real data support our theoretical results and demonstrate the practical utility of our method for safety risk assessment in generative AI models.

2505.16788 2026-02-17 stat.CO stat.AP

Interpretable contour level selection for heat maps for gridded data

Tarn Duong

详情
英文摘要

Gridded data formats, where the observed multivariate data are aggregated into grid cells, ensure confidentiality and reduce storage requirements, with the trade-off that access to the underlying point data is lost. Heat maps are a highly pertinent visualisation for gridded data, and heat maps with a small number of well-selected contour levels offer improved interpretability over continuous contour levels. There are many possible contour level choices. Amongst them, density contour levels are highly suitable in many cases. Current methods for computing density contour levels requires access to the observed point data, so they are not applicable to gridded data. To remedy this, we introduce an approximation of density contour levels for gridded data. We then compare our proposed method to existing contour level selection methods, and conclude that our proposal provides improved interpretability for synthetic and experimental gridded data.

2504.13781 2026-02-17 stat.ME

Addressing outliers in mixed-effects logistic regression: a more robust modeling approach

Divan A. Burger, Sean van der Merwe, Emmanuel Lesaffre

详情
英文摘要

This study introduces an outlier-robust model for analyzing hierarchically structured bounded count data within a Bayesian framework, utilizing a logistic regression approach implemented in JAGS. Our model incorporates a t-distributed latent variable to address overdispersion and outliers, improving robustness compared to conventional models such as the beta-binomial, binomial-logit-normal, and standard binomial models. Notably, our model targets a pseudo-median that differs from the true discrete median by less than one count; this closed-form quantity provides a robust and interpretable measure of central tendency. For comparability between all models, we additionally make predictions based on the mean proportion; however, this involves an integration step for the t-distributed nuisance parameter. While limited literature specifically addresses outliers in mixed models for bounded count data, this research fills that gap. The practical utility of the model is demonstrated using a longitudinal medication adherence dataset, where patient behavior often results in abrupt changes and outliers within individual trajectories. A simulation study demonstrates the binomial-logit-t model's strong performance, with comparison statistics favoring it among the four evaluated models. An additional data contamination simulation confirms its robustness against outliers. Our robust approach maintains the integrity of the dataset, effectively handling outliers to provide more accurate and reliable parameter estimates.

2503.21980 2026-02-17 math.ST stat.ME stat.ML stat.TH

Rolled Gaussian process models for curves on manifolds

Simon Preston, Karthik Bharath, Pablo Lopez-Custodio, Alfred Kume

详情
英文摘要

Given a planar curve, imagine rolling a sphere along that curve without slipping or twisting, and by this means tracing out a curve on the sphere. It is well known that such a rolling operation induces a local isometry between the sphere and the plane so that the two curves uniquely determine each other, and moreover, the operation extends to a general class of manifolds in any dimension. We use rolling to construct an analogue of a Gaussian process on a manifold starting from a Euclidean Gaussian process with mean $m$ and covariance $K$, and refer to it as a rolled Gaussian process parameterized by $m$ and $K$. The resulting model is generative, and is amenable to statistical inference given data as curves on a manifold. We identify conditions on the manifold under which the rolling of $m$ equals the Fréchet mean of the rolled Gaussian process, propose computationally simple estimators of $m$ and $K$, and derive their rates of convergence. We illustrate with examples on the unit sphere, symmetric positive-definite matrices, and with a robotics application involving 3D orientations.

2503.21715 2026-02-17 stat.ME econ.EM

A Powerful Bootstrap Test of Independence in High Dimensions

Mauricio Olivares, Tomasz Olma, Daniel Wilhelm

详情
英文摘要

This paper proposes a nonparametric test of pairwise independence of one random variable from a large pool of other random variables. The test statistic is the maximum of several Chatterjee's rank correlations and critical values are computed via a block multiplier bootstrap. We show in simulations that other popular tests based on distance covariances do not necessarily control size under this null. Our test, on the other hand, is shown to asymptotically control size uniformly over a large class of data-generating processes, even when the number of variables is much larger than sample size. The test is consistent against any fixed alternative. It can be combined with a stepwise procedure for selecting those variables from the pool that violate independence, while controlling the family-wise error rate. All formal results leave the dependence among variables in the pool completely unrestricted. In simulations, we find that our test is typically more powerful than competing methods (in settings where they are valid), particularly in high-dimensional scenarios or when there is dependence among variables in the pool.

2503.09411 2026-02-17 cs.LG math.OC stat.ML

Learning Rate Annealing Improves Tuning Robustness in Stochastic Optimization

Amit Attia, Tomer Koren

Comments 23 pages

详情
英文摘要

The learning rate in stochastic gradient methods is a critical hyperparameter that is notoriously costly to tune via standard grid search, especially for training modern large-scale models with billions of parameters. We identify a theoretical advantage of learning rate annealing schemes that decay the learning rate to zero at a polynomial rate, such as the widely-used cosine schedule, by demonstrating their increased robustness to initial parameter misspecification due to a coarse grid search. We present an analysis in a stochastic convex optimization setup demonstrating that the convergence rate of stochastic gradient descent with annealed schedules depends sublinearly on the multiplicative misspecification factor $ρ$ (i.e., the grid resolution), achieving a rate of $O(ρ^{1/(2p+1)}/\sqrt{T})$ where $p$ is the degree of polynomial decay and $T$ is the number of steps. This is in contrast to the $O(ρ/\sqrt{T})$ rate obtained under the inverse-square-root and fixed stepsize schedules, which depend linearly on $ρ$. Experiments confirm the increased robustness compared to tuning with a fixed stepsize, that has significant implications for the computational overhead of hyperparameter search in practical training scenarios.

2602.14616 2026-02-17 stat.CO q-bio.QM stat.AP

Higher-Order Hit-&-Run Samplers for Linearly Constrained Densities

Richard D. Paul, Anton Stratmann, Johann F. Jadebeck, Martin Beyß, Hanno Scharr, David Rügamer, Katharina Nöh

详情
英文摘要

Markov chain Monte Carlo (MCMC) sampling of densities restricted to linearly constrained domains is an important task arising in Bayesian treatment of inverse problems in the natural sciences. While efficient algorithms for uniform polytope sampling exist, much less work has dealt with more complex constrained densities. In particular, gradient information as used in unconstrained MCMC is not necessarily helpful in the constrained case, where the gradient may push the proposal's density out of the polytope. In this work, we propose a novel constrained sampling algorithm, which combines strengths of higher-order information, like the target's log-density's gradients and curvature, with the Hit-&-Run proposal, a simple mechanism which guarantees the generation of feasible proposals, fulfilling the linear constraints. Our extensive experiments demonstrate improved sampling efficiency on complex constrained densities over various constrained and unconstrained samplers.

2602.14607 2026-02-17 stat.ME cs.LG cs.NA math.NA stat.CO

A Bayesian Approach to Low-Discrepancy Subset Selection

Nathan Kirk

Comments 13 pages, 3 figures, mODa14

详情
英文摘要

Low-discrepancy designs play a central role in quasi-Monte Carlo methods and are increasingly influential in other domains such as machine learning, robotics and computer graphics, to name a few. In recent years, one such low-discrepancy construction method called subset selection has received a lot of attention. Given a large population, one optimally selects a small low-discrepancy subset with respect to a discrepancy-based objective. Versions of this problem are known to be NP-hard. In this text, we establish, for the first time, that the subset selection problem with respect to kernel discrepancies is also NP-hard. Motivated by this intractability, we propose a Bayesian Optimization procedure for the subset selection problem utilizing the recent notion of deep embedding kernels. We demonstrate the performance of the BO algorithm to minimize discrepancy measures and note that the framework is broadly applicable any design criteria.

2602.14587 2026-02-17 cs.LG cs.AI math.OC math.ST stat.TH

Decoupled Continuous-Time Reinforcement Learning via Hamiltonian Flow

Minh Nguyen

详情
英文摘要

Many real-world control problems, ranging from finance to robotics, evolve in continuous time with non-uniform, event-driven decisions. Standard discrete-time reinforcement learning (RL), based on fixed-step Bellman updates, struggles in this setting: as time gaps shrink, the $Q$-function collapses to the value function $V$, eliminating action ranking. Existing continuous-time methods reintroduce action information via an advantage-rate function $q$. However, they enforce optimality through complicated martingale losses or orthogonality constraints, which are sensitive to the choice of test processes. These approaches entangle $V$ and $q$ into a large, complex optimization problem that is difficult to train reliably. To address these limitations, we propose a novel decoupled continuous-time actor-critic algorithm with alternating updates: $q$ is learned from diffusion generators on $V$, and $V$ is updated via a Hamiltonian-based value flow that remains informative under infinitesimal time steps, where standard max/softmax backups fail. Theoretically, we prove rigorous convergence via new probabilistic arguments, sidestepping the challenge that generator-based Hamiltonians lack Bellman-style contraction under the sup-norm. Empirically, our method outperforms prior continuous-time and leading discrete-time baselines across challenging continuous-control benchmarks and a real-world trading task, achieving 21% profit over a single quarter$-$nearly doubling the second-best method.

2602.14580 2026-02-17 cs.LG stat.ML

Replicable Constrained Bandits

Matteo Bollini, Gianmarco Genalti, Francesco Emanuele Stradi, Matteo Castiglioni, Alberto Marchesi

详情
英文摘要

Algorithmic \emph{replicability} has recently been introduced to address the need for reproducible experiments in machine learning. A \emph{replicable online learning} algorithm is one that takes the same sequence of decisions across different executions in the same environment, with high probability. We initiate the study of algorithmic replicability in \emph{constrained} MAB problems, where a learner interacts with an unknown stochastic environment for $T$ rounds, seeking not only to maximize reward but also to satisfy multiple constraints. Our main result is that replicability can be achieved in constrained MABs. Specifically, we design replicable algorithms whose regret and constraint violation match those of non-replicable ones in terms of $T$. As a key step toward these guarantees, we develop the first replicable UCB-like algorithm for \emph{unconstrained} MABs, showing that algorithms that employ the optimism in-the-face-of-uncertainty principle can be replicable, a result that we believe is of independent interest.

2602.14543 2026-02-17 cs.LG stat.ML

Truly Adapting to Adversarial Constraints in Constrained MABs

Francesco Emanuele Stradi, Kalana Kalupahana, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti

详情
英文摘要

We study the constrained variant of the \emph{multi-armed bandit} (MAB) problem, in which the learner aims not only at minimizing the total loss incurred during the learning dynamic, but also at controlling the violation of multiple \emph{unknown} constraints, under both \emph{full} and \emph{bandit feedback}. We consider a non-stationary environment that subsumes both stochastic and adversarial models and where, at each round, both losses and constraints are drawn from distributions that may change arbitrarily over time. In such a setting, it is provably not possible to guarantee both sublinear regret and sublinear violation. Accordingly, prior work has mainly focused either on settings with stochastic constraints or on relaxing the benchmark with fully adversarial constraints (\emph{e.g.}, via competitive ratios with respect to the optimum). We provide the first algorithms that achieve optimal rates of regret and \emph{positive} constraint violation when the constraints are stochastic while the losses may vary arbitrarily, and that simultaneously yield guarantees that degrade smoothly with the degree of adversariality of the constraints. Specifically, under \emph{full feedback} we propose an algorithm attaining $\widetilde{\mathcal{O}}(\sqrt{T}+C)$ regret and $\widetilde{\mathcal{O}}(\sqrt{T}+C)$ {positive} violation, where $C$ quantifies the amount of non-stationarity in the constraints. We then show how to extend these guarantees when only bandit feedback is available for the losses. Finally, when \emph{bandit feedback} is available for the constraints, we design an algorithm achieving $\widetilde{\mathcal{O}}(\sqrt{T}+C)$ {positive} violation and $\widetilde{\mathcal{O}}(\sqrt{T}+C\sqrt{T})$ regret.

2602.14520 2026-02-17 stat.ML astro-ph.HE

Accelerating Posterior Inference from Pulsar Light Curves via Learned Latent Representations and Local Simulator-Guided Optimization

Farhana Taiyebah, Abu Bucker Siddik, Indronil Bhattacharjee, Diane Oyen, Soumi De, Greg Olmschenk, Constantinos Kalapotharakos

详情
英文摘要

Posterior inference from pulsar observations in the form of light curves is commonly performed using Markov chain Monte Carlo methods, which are accurate but computationally expensive. We introduce a framework that accelerates posterior inference while maintaining accuracy by combining learned latent representations with local simulator-guided optimization. A masked U-Net is first pretrained to reconstruct complete light curves from partial observations and to produce informative latent embeddings. Given a query light curve, we identify similar simulated light curves from the simulation bank by measuring similarity in the learned embedding space produced by pretrained U-Net encoder, yielding an initial empirical approximation to the posterior over parameters. This initialization is then refined using a local optimization procedure using hill-climbing updates, guided by a forward simulator, progressively shifting the empirical posterior toward higher-likelihood parameter regions. Experiments on the observed light curve of PSR J0030+0451, captured by NASA's Neutron Star Interior Composition Explorer (NICER), show that our method closely matches posterior estimates obtained using traditional MCMC methods while achieving 120 times reduction in inference time (from 24 hours to 12 minutes), demonstrating the effectiveness of learned representations and simulator-guided optimization for accelerated posterior inference.

2602.14478 2026-02-17 stat.ML cs.DS cs.LG math.OC

Constrained and Composite Sampling via Proximal Sampler

Thanh Dang, Jiaming Liang

Comments The main paper is 13 pages; the rest are appendices

详情
英文摘要

We study two log-concave sampling problems: constrained sampling and composite sampling. First, we consider sampling from a target distribution with density proportional to $\exp(-f(x))$ supported on a convex set $K \subset \mathbb{R}^d$, where $f$ is convex. The main challenge is enforcing feasibility without degrading mixing. Using an epigraph transformation, we reduce this task to sampling from a nearly uniform distribution over a lifted convex set in $\mathbb{R}^{d+1}$. We then solve the lifted problem using a proximal sampler. Assuming only a separation oracle for $K$ and a subgradient oracle for $f$, we develop an implementation of the proximal sampler based on the cutting-plane method and rejection sampling. Unlike existing constrained samplers that rely on projection, reflection, barrier functions, or mirror maps, our approach enforces feasibility using only minimal oracle access, resulting in a practical and unbiased sampler without knowing the geometry of the constraint set. Second, we study composite sampling, where the target is proportional to $\exp(-f(x)-h(x))$ with closed and convex $f$ and $h$. This composite structure is standard in Bayesian inference with $f$ modeling data fidelity and $h$ encoding prior information. We reduce composite sampling via an epigraph lifting of $h$ to constrained sampling in $\mathbb{R}^{d+1}$, which allows direct application of the constrained sampling algorithm developed in the first part. This reduction results in a double epigraph lifting formulation in $\mathbb{R}^{d+2}$, on which we apply a proximal sampler. By keeping $f$ and $h$ separate, we further demonstrate how different combinations of oracle access (such as subgradient and proximal) can be leveraged to construct separation oracles for the lifted problem. For both sampling problems, we establish mixing time bounds measured in Rényi and $χ^2$ divergences.

2602.14472 2026-02-17 math.ST cs.LG math.OC stat.ML stat.TH

Frequentist Regret Analysis of Gaussian Process Thompson Sampling via Fractional Posteriors

Somjit Roy, Prateek Jaiswal, Anirban Bhattacharya, Debdeep Pati, Bani K. Mallick

Comments 34 pages, Submitted

详情
英文摘要

We study Gaussian Process Thompson Sampling (GP-TS) for sequential decision-making over compact, continuous action spaces and provide a frequentist regret analysis based on fractional Gaussian process posteriors, without relying on domain discretization as in prior work. We show that the variance inflation commonly assumed in existing analyses of GP-TS can be interpreted as Thompson Sampling with respect to a fractional posterior with tempering parameter $α\in (0,1)$. We derive a kernel-agnostic regret bound expressed in terms of the information gain parameter $γ_t$ and the posterior contraction rate $ε_t$, and identify conditions on the Gaussian process prior under which $ε_t$ can be controlled. As special cases of our general bound, we recover regret of order $\tilde{\mathcal{O}}(T^{\frac{1}{2}})$ for the squared exponential kernel, $\tilde{\mathcal{O}}(T^{\frac{2ν+3d}{2(2ν+d)}} )$ for the Matérn-$ν$ kernel, and a bound of order $\tilde{\mathcal{O}}(T^{\frac{2ν+3d}{2(2ν+d)}})$ for the rational quadratic kernel. Overall, our analysis provides a unified and discretization-free regret framework for GP-TS that applies broadly across kernel classes.

2602.14432 2026-02-17 cs.LG cs.AI stat.ML

S2D: Selective Spectral Decay for Quantization-Friendly Conditioning of Neural Activations

Arnav Chavan, Nahush Lele, Udbhav Bamba, Sankalp Dayal, Aditi Raghunathan, Deepak Gupta

详情
英文摘要

Activation outliers in large-scale transformer models pose a fundamental challenge to model quantization, creating excessively large ranges that cause severe accuracy drops during quantization. We empirically observe that outlier severity intensifies with pre-training scale (e.g., progressing from CLIP to the more extensively trained SigLIP and SigLIP2). Through theoretical analysis as well as empirical correlation studies, we establish the direct link between these activation outliers and dominant singular values of the weights. Building on this insight, we propose Selective Spectral Decay ($S^2D$), a geometrically-principled conditioning method that surgically regularizes only the weight components corresponding to the largest singular values during fine-tuning. Through extensive experiments, we demonstrate that $S^2D$ significantly reduces activation outliers and produces well-conditioned representations that are inherently quantization-friendly. Models trained with $S^2D$ achieve up to 7% improved PTQ accuracy on ImageNet under W4A4 quantization and 4% gains when combined with QAT. These improvements also generalize across downstream tasks and vision-language models, enabling the scaling of increasingly large and rigorously trained models without sacrificing deployment efficiency.

2602.14423 2026-02-17 cs.LG cs.AI stat.ML

The geometry of invariant learning: an information-theoretic analysis of data augmentation and generalization

Abdelali Bouyahia, Frédéric LeBlanc, Mario Marchand

详情
英文摘要

Data augmentation is one of the most widely used techniques to improve generalization in modern machine learning, often justified by its ability to promote invariance to label-irrelevant transformations. However, its theoretical role remains only partially understood. In this work, we propose an information-theoretic framework that systematically accounts for the effect of augmentation on generalization and invariance learning. Our approach builds upon mutual information-based bounds, which relate the generalization gap to the amount of information a learning algorithm retains about its training data. We extend this framework by modeling the augmented distribution as a composition of the original data distribution with a distribution over transformations, which naturally induces an orbit-averaged loss function. Under mild sub-Gaussian assumptions on the loss function and the augmentation process, we derive a new generalization bound that decompose the expected generalization gap into three interpretable terms: (1) a distributional divergence between the original and augmented data, (2) a stability term measuring the algorithm dependence on training data, and (3) a sensitivity term capturing the effect of augmentation variability. To connect our bounds to the geometry of the augmentation group, we introduce the notion of group diameter, defined as the maximal perturbation that augmentations can induce in the input space. The group diameter provides a unified control parameter that bounds all three terms and highlights an intrinsic trade-off: small diameters preserve data fidelity but offer limited regularization, while large diameters enhance stability at the cost of increased bias and sensitivity. We validate our theoretical bounds with numerical experiments, demonstrating that it reliably tracks and predicts the behavior of the true generalization gap.

2602.14414 2026-02-17 stat.ME econ.EM stat.AP

The Role of Measured Covariates in Assessing Sensitivity to Unmeasured Confounding

Abhinandan Dalal, Iris Horng, Yang Feng, Dylan S. Small

详情
英文摘要

Sensitivity analysis is widely used to assess the robustness of causal conclusions in observational studies, yet its interaction with the structure of measured covariates is often overlooked. When latent confounders cannot be directly adjusted for and are instead controlled using proxy variables, strong associations between exposure and measured proxies can amplify sensitivity to residual confounding. We formalize this phenomenon in linear regression settings by showing that a simple ratio involving the exposure model coefficient and residual exposure variance provides an observable measure of this increased sensitivity. Applying our framework to smoking and lung cancer, we document how growing socioeconomic stratification in smoking behavior over time leads to heightened sensitivity to unmeasured confounding in more recent data. These results highlight the importance of multicollinearity when interpreting sensitivity analyses based on proxy adjustment.

2602.14387 2026-02-17 stat.ME

Automatic Variance Adjustment for Small Area Estimation

Jon Wakefield, Jitong Jiang, Yunhan Wu

详情
英文摘要

Small area estimation (SAE) is a common endeavor and is used in a variety of disciplines. In low- and middle-income countries (LMICs), in which household surveys provide the most reliable and timely source of data, SAE is vital for highlighting disparities in health and demographic indicators. Weighted estimators are ideal for inference, but for fine geographical partitions in which there are insufficient data, SAE models are required. The most common approach is Fay-Herriot area-level modeling in which the data requirements are a weighted estimate and an associated variance estimate. The latter can be undefined or unstable when data are sparse and so we propose a principled modification which is based on augmenting the available data with a prior sample from a hypothetical survey. This adjustment is generally available, respects the design and is simple to implement. We examine the empirical properties of the adjustment through simulation and illustrate its use with wasting data from a 2018 Zambian Demographic and Health Survey. The modification is implemented as an automatic remedy in the R package surveyPrev, which provides a comprehensive suite of tools for conducing SAE in LMICs.

2602.14349 2026-02-17 stat.AP

Same Prompt, Different Outcomes: Evaluating the Reproducibility of Data Analysis by LLMs

Jiaxin Cui, Rohan Alexander

详情
英文摘要

We systematically evaluate the reproducibility of data analysis conducted by Large Language Models (LLMs). We evaluate two prompting strategies, six models, and four temperature settings, with ten independent executions per configuration, yielding 480 total attempts. We assess the completion, concordance, validity, and consistency of each attempt and find considerable variation in the analytical results even for consistent configurations. This suggests, as with human data analysis, the data analysis conducted by LLMs can vary, even given the same task, data, and settings. Our results mean that if an LLM is being used to conduct data analysis, then it should be run multiple times independently and the distribution of results considered.

2602.14284 2026-02-17 stat.OT cs.CY

Benchmarking AI Performance on End-to-End Data Science Projects

Evelyn Hughes, Rohan Alexander

详情
英文摘要

Data science is an integrated workflow of technical, analytical, communication, and ethical skills, but current AI benchmarks focus mostly on constituent parts. We test whether AI models can generate end-to-end data science projects. To do this we create a benchmark of 40 end-to-end data science projects with associated rubric evaluations. We use these to build an automated grading pipeline that systematically evaluates the data science projects produced by generative AI models. We find the extent to which generative AI models can complete end-to-end data science projects varies considerably by model. Most recent models did well on structured tasks, but there were considerable differences on tasks that needed judgment. These findings suggest that while AI models could approximate entry-level data scientists on routine tasks, they require verification.

2602.14280 2026-02-17 stat.CO cs.LG

Fast Compute for ML Optimization

Nick Polson, Vadim Sokolov

详情
英文摘要

We study optimization for losses that admit a variance-mean scale-mixture representation. Under this representation, each EM iteration is a weighted least squares update in which latent variables determine observation and parameter weights; these play roles analogous to Adam's second-moment scaling and AdamW's weight decay, but are derived from the model. The resulting Scale Mixture EM (SM-EM) algorithm removes user-specified learning-rate and momentum schedules. On synthetic ill-conditioned logistic regression benchmarks with $p \in \{20, \ldots, 500\}$, SM-EM with Nesterov acceleration attains up to $13\times$ lower final loss than Adam tuned by learning-rate grid search. For a 40-point regularization path, sharing sufficient statistics across penalty values yields a $10\times$ runtime reduction relative to the same tuned-Adam protocol. For the base (non-accelerated) algorithm, EM monotonicity guarantees nonincreasing objective values; adding Nesterov extrapolation trades this guarantee for faster empirical convergence.

2602.14244 2026-02-17 stat.ML cs.LG

Federated Ensemble Learning with Progressive Model Personalization

Ala Emrani, Amir Najafi, Abolfazl Motahari

Comments 42 pages

详情
英文摘要

Federated Learning provides a privacy-preserving paradigm for distributed learning, but suffers from statistical heterogeneity across clients. Personalized Federated Learning (PFL) mitigates this issue by considering client-specific models. A widely adopted approach in PFL decomposes neural networks into a shared feature extractor and client-specific heads. While effective, this design induces a fundamental tradeoff: deep or expressive shared components hinder personalization, whereas large local heads exacerbate overfitting under limited per-client data. Most existing methods rely on rigid, shallow heads, and therefore fail to navigate this tradeoff in a principled manner. In this work, we propose a boosting-inspired framework that enables a smooth control of this tradeoff. Instead of training a single personalized model, we construct an ensemble of $T$ models for each client. Across boosting iterations, the depth of the personalized component are progressively increased, while its effective complexity is systematically controlled via low-rank factorization or width shrinkage. This design simultaneously limits overfitting and substantially reduces per-client bias by allowing increasingly expressive personalization. We provide theoretical analysis that establishes generalization bounds with favorable dependence on the average local sample size and the total number of clients. Specifically, we prove that the complexity of the shared layers is effectively suppressed, while the dependence on the boosting horizon $T$ is controlled through parameter reduction. Notably, we provide a novel nonlinear generalization guarantee for decoupled PFL models. Extensive experiments on benchmark and real-world datasets (e.g., EMNIST, CIFAR-10/100, and Sent140) demonstrate that the proposed framework consistently outperforms state-of-the-art PFL methods under heterogeneous data distributions.

2602.14206 2026-02-17 math.ST stat.TH

Kernel Estimation Of Chatterjee's Dependence Coefficient

Mona Azadkia, Holger Dette

详情
英文摘要

Dette, Siburg, and Stoimenov (2013) introduced a copula-based measure of dependence, which implies independence if it vanishes and is equal to 1 if one variable is a measurable function of the other. For continuous distributions, the dependence measure also appears as stochastic limit of Chatterjee's rank correlation (Chatterjee, 2021). They proved asymptotic normality of a corresponding kernel estimator with a parametric rate of convergence. In recent work Shi, Drton, and Han (2022) revealed empirically and theoretically that under independence the asymptotic variance degenerates. In this note, we derive the correct asymptotic distribution of the kernel estimator under the null hypothesis of independence. We show that after a suitable centering and rescaling at a rate larger than $\sqrt{n}$ (where $n$ is the sample size), the estimator is asymptotically normal. The analysis relies on a refined central limit theorem for double-indexed linear permutation statistics and accounts for boundary effects that are asymptotically non-negligible. As a consequence, we obtain a valid basis for independence testing without relying on permutations and argue that tests based on the kernel estimator detect local alternatives converging to the null at a faster rate than those detectable by Chatterjee's rank correlation.

2602.14203 2026-02-17 stat.AP

Evaluating the Impact of COVID-19 on Transportation Infrastructure Funding

Lu Gao, Pan Lu, Fengxiang Qiao, Joshua Qiang Li, Yunpeng Zhang, Yihao Ren

详情
Journal ref
International Conference on Transportation and Development 2022
英文摘要

The coronavirus disease 2019 (COVID-19) pandemic has caused a reduction in business and routine activity and resulted in less motor fuel consumption. Thus, the gas tax revenue is reduced which is the major funding resource supporting the rehabilitation and maintenance of transportation infrastructure systems. The focus of this study is to evaluate the impact of the COVID-19 pandemic on transportation infrastructure funds in the United States through analyzing the motor fuel consumption data. Machine learning models were developed by integrating COVID-19 scenarios, fuel consumptions, and demographic data. The best model achieves an R2-score of more than 95% and captures the fluctuations of fuel consumption during the pandemic. Using the developed model, we project future motor gas consumption for each state. For some states, the gas tax revenues are going to be 10%-15% lower than the pre-pandemic level for at least one or two years.

2602.14198 2026-02-17 stat.AP

Zipf-Mandelbrot Scaling in Korean Court Music: Universal Patterns in Music

Byeongchan Choi, Junwon You, Myung Ock Kim, Jae-Hun Jung

Comments 20 pages, 5 figures, 4 tables

详情
英文摘要

Zipf's law, originally discovered in natural language and later generalized to the Zipf-Mandelbrot law, describes a power-law relationship between the frequency of a Zipfian element and its rank. Due to the semantic characteristics of this law, it has also been observed in musical data. However, most such studies have focused on Western music, and its applicability to non-Western music remains not well investigated. We analyzed 43 Korean court music pieces called Jeong-ak, spanning several centuries and written in the traditional Korean musical notation Jeongganbo. These pieces were transcribed into Western staff notation, and musical data such as pitch and duration were extracted. Using pitch, duration, and their paired combinations as Zipfian units, we found that Korean music also fits the Zipf-Mandelbrot law to a high degree, particularly for the paired pitch-duration unit. Korean music has evolved collectively over long periods, smoothing idiosyncratic variations and producing forms that are widely understandable among people. This collective evolution appears to have played a significant role in shaping the characteristics that lead to the satisfaction of Zipf-Mandelbrot law. Our findings provide additional evidence that Zipf-Mandelbrot scaling in musical data is universal across cultures. We further show that the joint distribution of two independent Zipfian data sets follows the Zipf-Mandelbrot law; in this sense, our result does not merely extend Zipf's law but deepens our understanding of how scaling laws behave under composition and interaction, offering a more unified perspective on rank-based statistical regularities.

2602.14061 2026-02-17 stat.CO cs.NA math.NA

MPL-HMC: A Tunable Parameterized Leapfrog Framework for Robust Hamiltonian Monte Carlo

Sourabh Bhattacharya

Comments Feedback welcome

详情
英文摘要

This article introduces the Modified Parameterized Leapfrog Hamiltonian Monte Carlo (MPL-HMC) method, a novel extension of HMC addressing key limitations through tunable integration parameters $α(δt)$ and $β(δt)$, enabling controlled perturbations to Hamiltonian dynamics. Theoretical analysis demonstrates MPL-HMC maintains approximate detailed balance. Extensive empirical evaluation reveals systematic performance improvements. The damping variant ($α_2=-0.1$, $β_2=-0.05$) achieves a 14-fold increase in effective sample size for Neal's funnel and 27\% better efficiency for pharmacokinetic models. The anti-damping variant ($α_2=0.1$, $β_2=0.05$) achieves $\hat{R}=1.026$ for Bayesian neural networks versus $\hat{R}=1.981$ for standard HMC. We introduce aggressive MPL-HMC for multimodal distributions, employing extreme parameters ($α_2=8.0$--$15.0$, $β_2=5.0$--$8.0$) with enhanced sampling to achieve full mode exploration where standard methods fail. All variants maintain computational efficiency identical to standard HMC while providing systematic control over damping, exploration, stability, and accuracy. The article provides rigorous mathematical foundations, implementation specifications, parameter tuning strategies, and comprehensive performance comparisons, extending HMC's applicability to previously challenging domains.

2602.14053 2026-02-17 math.ST cs.NA math.NA stat.TH

Mean-Square Convergence of a New Parameterized Leapfrog Scheme for Hamiltonian Systems Driven by Gaussian Process Potentials

Sourabh Bhattacharya

Comments Feedback welcome

详情
英文摘要

This paper establishes the mean-square convergence of a new stochastic, parameterized leapfrog scheme introduced in our companion paper Mazumder et al. (2026) for Hamiltonian systems with Gaussian process potentials. We consider a one-step numerical integrator and provide a complete, rigorous analysis under minimal regularity assumptions on the Gaussian potential. The key technical contribution is identifying and exploiting the symplectic structure ingrained in our stochastic, parameterized leapfrog method. Combined with local truncation error analysis, this leads to a global error bound of O(δt) in mean-square sense. Our results establish that although the spatio-temporal model of Mazumder et al. (2026) arises as the anticipated new stochastic leapfrog solution of a system of modified (parameterized) stochastic Hamiltonian equations, the new stochastic leapfrog actually solves the traditional stochastic Hamiltonian equations, driven by Gaussian process potential.

2602.14029 2026-02-17 stat.ML cs.LG math.ST stat.TH

Why Self-Training Helps and Hurts: Denoising vs. Signal Forgetting

Mingqi Wu, Archer Y. Yang, Qiang Sun

Comments 8 pages main, 29 pages in total

详情
英文摘要

Iterative self-training (self-distillation) repeatedly refits a model on pseudo-labels generated by its own predictions. We study this procedure in overparameterized linear regression: an initial estimator is trained on noisy labels, and each subsequent iterate is trained on fresh covariates with noiseless pseudo-labels from the previous model. In the high-dimensional regime, we derive deterministic-equivalent recursions for the prediction risk and effective noise across iterations, and prove that the empirical quantities concentrate sharply around these limits. The recursion separates two competing forces: a systematic component that grows with iteration due to progressive signal forgetting, and a stochastic component that decays due to denoising via repeated data-dependent projections. Their interaction yields a $U$-shaped test-risk curve and an optimal early-stopping time. In spiked covariance models, iteration further acts as an iteration-dependent spectral filter that preserves strong eigendirections while suppressing weaker ones, inducing an implicit form of soft feature selection distinct from ridge regression. Finally, we propose an iterated generalized cross-validation criterion and prove its uniform consistency for estimating the risk along the self-training trajectory, enabling fully data-driven selection of the stopping time and regularization. Experiments on synthetic covariances validate the theory and illustrate the predicted denoising-forgetting trade-off.

2602.14020 2026-02-17 stat.ML cs.LG

Computable Bernstein Certificates for Cross-Fitted Clipped Covariance Estimation

Even He, Zaizai Yan

详情
英文摘要

We study operator-norm covariance estimation from heavy-tailed samples that may include a small fraction of arbitrary outliers. A simple and widely used safeguard is \emph{Euclidean norm clipping}, but its accuracy depends critically on an unknown clipping level. We propose a cross-fitted clipped covariance estimator equipped with \emph{fully computable} Bernstein-type deviation certificates, enabling principled data-driven tuning via a selector (\emph{MinUpper}) that balances certified stochastic error and a robust hold-out proxy for clipping bias. The resulting procedure adapts to intrinsic complexity measures such as effective rank under mild tail regularity and retains meaningful guarantees under only finite fourth moments. Experiments on contaminated spiked-covariance benchmarks illustrate stable performance and competitive accuracy across regimes.

2602.13942 2026-02-17 stat.ML cs.LG

A Theoretical Framework for LLM Fine-tuning Using Early Stopping for Non-random Initialization

Zexuan Sun, Garvesh Raskutti

详情
英文摘要

In the era of large language models (LLMs), fine-tuning pretrained models has become ubiquitous. Yet the theoretical underpinning remains an open question. A central question is why only a few epochs of fine-tuning are typically sufficient to achieve strong performance on many different tasks. In this work, we approach this question by developing a statistical framework, combining rigorous early stopping theory with the attention-based Neural Tangent Kernel (NTK) for LLMs, offering new theoretical insights on fine-tuning practices. Specifically, we formally extend classical NTK theory [Jacot et al., 2018] to non-random (i.e., pretrained) initializations and provide a convergence guarantee for attention-based fine-tuning. One key insight provided by the theory is that the convergence rate with respect to sample size is closely linked to the eigenvalue decay rate of the empirical kernel matrix induced by the NTK. We also demonstrate how the framework can be used to explain task vectors for multiple tasks in LLMs. Finally, experiments with modern language models on real-world datasets provide empirical evidence supporting our theoretical insights.

2602.13935 2026-02-17 cs.AI cs.LG stat.ML

Statistical Early Stopping for Reasoning Models

Yangxinyu Xie, Tao Wang, Soham Mallick, Yan Sun, Georgy Noarov, Mengxin Yu, Tanwi Mallick, Weijie J. Su, Edgar Dobriban

详情
英文摘要

While LLMs have seen substantial improvement in reasoning capabilities, they also sometimes overthink, generating unnecessary reasoning steps, particularly under uncertainty, given ill-posed or ambiguous queries. We introduce statistically principled early stopping methods that monitor uncertainty signals during generation to mitigate this issue. Our first approach is parametric: it models inter-arrival times of uncertainty keywords as a renewal process and applies sequential testing for stopping. Our second approach is nonparametric and provides finite-sample guarantees on the probability of halting too early on well-posed queries. We conduct empirical evaluations on reasoning tasks across several domains and models. Our results indicate that uncertainty-aware early stopping can improve both efficiency and reliability in LLM reasoning, and we observe especially significant gains for math reasoning.

2602.13888 2026-02-17 stat.ME stat.AP stat.CO stat.ML

Mixture-of-experts Wishart model for covariance matrices with an application to Cancer drug screening

The Tien Mai, Zhi Zhao

详情
英文摘要

Covariance matrices arise naturally in different scientific fields, including finance, genomics, and neuroscience, where they encode dependence structures and reveal essential features of complex multivariate systems. In this work, we introduce a comprehensive Bayesian framework for analyzing heterogeneous covariance data through both classical mixture models and a novel mixture-of-experts Wishart (MoE-Wishart) model. The proposed MoE-Wishart model extends standard Wishart mixtures by allowing mixture weights to depend on predictors through a multinomial logistic gating network. This formulation enables the model to capture complex, nonlinear heterogeneity in covariance structures and to adapt subpopulation membership probabilities to covariate-dependent patterns. To perform inference, we develop an efficient Gibbs-within-Metropolis-Hastings sampling algorithm tailored to the geometry of the Wishart likelihood and the gating network. We additionally derive an Expectation-Maximization algorithm for maximum likelihood estimation in the mixture-of-experts setting. Extensive simulation studies demonstrate that the proposed Bayesian and maximum likelihood estimators achieve accurate subpopulation recovery and estimation under a range of heterogeneous covariance scenarios. Finally, we present an innovative application of our methodology to a challenging dataset: cancer drug sensitivity profiles, illustrating the ability of the MoE-Wishart model to leverage covariance across drug dosages and replicate measurements. Our methods are implemented in the \texttt{R} package \texttt{moewishart} available at https://github.com/zhizuio/moewishart .

2602.13872 2026-02-17 stat.ME stat.ML

Predicting fixed-sample test decisions enables anytime-valid inference

Chris Holmes, Stephen Walker

详情
英文摘要

Statistical hypothesis tests typically use prespecified sample sizes, yet data often arrive sequentially. Interim analyses invalidate classical error guarantees, while existing sequential methods require rigid testing preschedules or incur substantial losses in statistical power. We introduce a simple procedure that transforms any fixed-sample hypothesis test into an anytime-valid test while ensuring Type-I error control and near-optimal power with substantial sample savings when the null hypothesis is false. At each step, the procedure predicts the probability that a classical test would reject the null hypothesis at its fixed-sample size, treating future observations as missing data under the null hypothesis. Thresholding this probability yields an anytime-valid stopping rule. In areas such as clinical trials, stopping early and safely can ensure that subjects receive the best treatments and accelerate the development of effective therapies.

2602.13871 2026-02-17 math.ST cs.IT cs.LG math.IT math.OC stat.AP stat.ML stat.TH

Ensemble-Conditional Gaussian Processes (Ens-CGP): Representation, Geometry, and Inference

Sai Ravela, Jae Deok Kim, Kenneth Gee, Xingjian Yan, Samson Mercier, Lubna Albarghouty, Anamitra Saha

Comments 20 pages. Technical manuscrupt on representational equivalence between conditional Gaussian inference, quadratic optimization, and RKHS geometry in finite dimensions

详情
英文摘要

We formulate Ensemble-Conditional Gaussian Processes (Ens-CGP), a finite-dimensional synthesis that centers ensemble-based inference on the conditional Gaussian law. Conditional Gaussian processes (CGP) arise directly from Gaussian processes under conditioning and, in linear-Gaussian settings, define the full posterior distribution for a Gaussian prior and linear observations. Classical Kalman filtering is a recursive algorithm that computes this same conditional law under dynamical assumptions; the conditional Gaussian law itself is therefore the underlying representational object, while the filter is one computational realization. In this sense, CGP provides the probabilistic foundation for Kalman-type methods as well as equivalent formulations as a strictly convex quadratic program (MAP estimation), RKHS-regularized regression, and classical regularization. Ens-CGP is the ensemble instantiation of this object, obtained by treating empirical ensemble moments as a (possibly low-rank) Gaussian prior and performing exact conditioning. By separating representation (GP -> CGP -> Ens-CGP) from computation (Kalman filters, EnKF variants, and iterative ensemble schemes), the framework links an earlier-established representational foundation for inference to ensemble-derived priors and clarifies the relationships among probabilistic, variational, and ensemble perspectives.

2602.13852 2026-02-17 cs.AI stat.AP

Experimentation Accelerator: Interpretable Insights and Creative Recommendations for A/B Testing with Content-Aware ranking

Zhengmian Hu, Lei Shi, Ritwik Sinha, Justin Grover, David Arbour

详情
英文摘要

Modern online experimentation faces two bottlenecks: scarce traffic forces tough choices on which variants to test, and post-hoc insight extraction is manual, inconsistent, and often content-agnostic. Meanwhile, organizations underuse historical A/B results and rich content embeddings that could guide prioritization and creative iteration. We present a unified framework to (i) prioritize which variants to test, (ii) explain why winners win, and (iii) surface targeted opportunities for new, higher-potential variants. Leveraging treatment embeddings and historical outcomes, we train a CTR ranking model with fixed effects for contextual shifts that scores candidates while balancing value and content diversity. For better interpretability and understanding, we project treatments onto curated semantic marketing attributes and re-express the ranker in this space via a sign-consistent, sparse constrained Lasso, yielding per-attribute coefficients and signed contributions for visual explanations, top-k drivers, and natural-language insights. We then compute an opportunity index combining attribute importance (from the ranker) with under-expression in the current experiment to flag missing, high-impact attributes. Finally, LLMs translate ranked opportunities into concrete creative suggestions and estimate both learning and conversion potential, enabling faster, more informative, and more efficient test cycles. These components have been built into a real Adobe product, called \textit{Experimentation Accelerator}, to provide AI-based insights and opportunities to scale experimentation for customers. We provide an evaluation of the performance of the proposed framework on some real-world experiments by Adobe business customers that validate the high quality of the generation pipeline.

2602.13804 2026-02-17 cs.AI cs.LG stat.ML

Attention in Constant Time: Vashista Sparse Attention for Long-Context Decoding with Exponential Guarantees

Vashista Nobaub

Comments 22 pages

详情
英文摘要

Large language models spend most of their inference cost on attention over long contexts, yet empirical behavior suggests that only a small subset of tokens meaningfully contributes to each query. We formalize this phenomenon by modeling attention as a projection onto the convex hull of key vectors and analyzing its entropic (softmax-like) relaxation. Our main theoretical contribution is a face-stability theorem showing that, under a strict complementarity margin (a support gap (Δ) certified by KKT multipliers), entropic attention concentrates on a constant-size active face: the total mass assigned to inactive tokens decays exponentially as (\exp(-Ω(Δ/\varepsilon))), while the error on the active face scales linearly in the temperature/regularization parameter (\varepsilon). This yields a practical criterion for when sparse long-context decoding is safe and provides a principled knob to trade accuracy for compute. Building on these guarantees, we introduce Vashista Sparse Attention, a drop-in mechanism that maintains a small candidate set per query through a paging-style context selection strategy compatible with modern inference stacks. Across long-context evaluations, we observe stable constant-size effective support, strong wall-clock speedups, and minimal quality degradation in the regimes predicted by the support-gap diagnostics. Finally, we discuss deployment implications for privacy-sensitive and air-gapped settings, where interchangeable attention modules enable predictable latency and cost without external retrieval dependencies.

2602.13729 2026-02-17 math.ST stat.ME stat.TH

Semi-supervised linear regression with missing covariates

Benedict M. Risebrow, Thomas B. Berrett

详情
英文摘要

Missing values in datasets are common in applied statistics. For regression problems, theoretical work thus far has largely considered the issue of missing covariates as distinct from missing responses. However, in practice, many datasets have both forms of missingness. Motivated by this gap, we study linear regression with a labelled dataset containing missing covariates, potentially alongside an unlabelled dataset. We consider both structured (blockwise-missing) and unstructured missingness patterns, along with sparse and non-sparse regression parameters. For the non-sparse case, we provide an estimator based on imputing the missing data combined with a reweighting step. For the high-dimensional sparse case, we use a modified version of the Dantzig selector. We provide non-asymptotic upper bounds on the risk of both procedures. These are matched by several new minimax lower bounds, demonstrating the rate optimality of our estimators. Notably, even when the linear model is well-specified, our results characterise substantial differences in the minimax rates when unlabelled data is present relative to the fully supervised setting. Particular consequences of our sparse and non-sparse results include the first matching upper and lower bounds on the minimax rate for the supervised setting when either unstructured or structured missingness is present. Our theory is coupled with extensive simulations and a semi-synthetic application to the California housing dataset.

2602.13722 2026-02-17 econ.EM stat.ME

The Accuracy Smoothness Dilemma in Prediction: a Novel Multivariate M-SSA Forecast Approach

Marc Wildi

详情
英文摘要

Forecasting presents a complex estimation challenge, as it involves balancing multiple, often conflicting, priorities and objectives. Conventional forecast optimization methods typically emphasize a single metric--such as minimizing the mean squared error (MSE)--which may neglect other crucial aspects of predictive performance. To address this limitation, the recently developed Smooth Sign Accuracy (SSA) framework extends the traditional MSE approach by simultaneously accounting for sign accuracy, MSE, and the frequency of sign changes in the predictor. This addresses a fundamental trade-off--the so-called accuracy-smoothness (AS) dilemma--in prediction. We extend this approach to the multivariate M-SSA, leveraging the original criterion to incorporate cross-sectional information across multiple time series. As a result, the M-SSA criterion enables the integration of various design objectives related to AS forecasting performance, effectively generalizing conventional MSE-based metrics. To demonstrate its practical applicability and versatility, we explore the application of the M-SSA in three primary domains: forecasting, real-time signal extraction (nowcasting), and smoothing. These case studies illustrate the framework's capacity to adapt to different contexts while effectively managing inherent trade-offs in predictive modelling.

2602.12027 2026-02-17 math.ST math.PR stat.TH

General-purpose post-sampling reweighting method for multimodal target measures

Pierre Monmarché

详情
英文摘要

When sampling multi-modal probability distributions, correctly estimating the relative probability of each mode, even when the modes have been discovered and locally sampled, remains challenging. We test a simple reweighting scheme designed for this situation, which consists in minimizing (in terms of weights) the Kullback-Leibler divergence of a weighted (regularized) empirical distribution of the samples with respect to the target measure.

2602.11712 2026-02-17 cs.LG cs.CE nlin.CD physics.data-an stat.ME

Potential-energy gating for robust state estimation in bistable stochastic systems

Luigi Simeone

Comments 20 pages, 8 figures

详情
英文摘要

We introduce potential-energy gating, a method for robust state estimation in systems governed by double-well stochastic dynamics. The observation noise covariance of a Bayesian filter is modulated by the local value of a known or assumed potential energy function: observations are trusted when the state is near a potential minimum and progressively discounted as it approaches the barrier separating metastable wells. This physics-based mechanism differs from statistical robust filters, which treat all state-space regions identically, and from constrained filters, which bound states rather than modulating observation trust. The approach is especially relevant in non-ergodic or data-scarce settings where only a single realization is available and statistical methods alone cannot learn the noise structure. We implement gating within Extended, Unscented, Ensemble, and Adaptive Kalman filters and particle filters, requiring only two additional hyperparameters. Monte Carlo benchmarks (100 replications) on a Ginzburg-Landau double-well with 10% outlier contamination show 57-80% RMSE improvement over the standard Extended Kalman Filter, all statistically significant (p < 10^{-15}, Wilcoxon test). A naive topological baseline using only well positions achieves 57%, confirming that the continuous energy landscape adds ~21 percentage points. The method is robust to misspecification: even with 50% parameter errors, improvement never falls below 47%. Comparing externally forced and spontaneous Kramers-type transitions, gating retains 68% improvement under noise-induced transitions whereas the naive baseline degrades to 30%. As an empirical illustration, we apply the framework to Dansgaard-Oeschger events in the NGRIP delta-18O ice-core record, estimating asymmetry gamma = -0.109 (bootstrap 95% CI: [-0.220, -0.011]) and showing that outlier fraction explains 91% of the variance in filter improvement.

2602.11290 2026-02-17 math.ST math.OC stat.TH

Entropic vector quantile regression: Duality and Gaussian case

Kengo Kato, Boyu Wang

Comments 30 pages

详情
英文摘要

Vector quantile regression (VQR) is an optimal transport (OT) problem subject to a mean-independence constraint that extends classical linear quantile regression to vector response variables. Motivated by computational considerations, prior work has considered entropic relaxation of VQR, but its fundamental structural and approximation properties are still much less understood than entropic OT. The goal of this paper is to address some of these gaps. First, we study duality theory for entropic VQR and establish strong duality and dual attainment for marginals with possibly unbounded supports. In addition, when all marginals are compactly supported, we show that dual potentials are real analytic. Second, building on our duality theory, when all marginals are Gaussian, we show that entropic VQR has a closed-form optimal solution, which is again Gaussian, and establish the precise approximation rate toward unregularized VQR.

2602.07132 2026-02-17 stat.ML cs.LG

Discrete Adjoint Matching

Oswin So, Brian Karrer, Chuchu Fan, Ricky T. Q. Chen, Guan-Horng Liu

Comments ICLR 2026

详情
英文摘要

Computation methods for solving entropy-regularized reward optimization -- a class of problems widely used for fine-tuning generative models -- have advanced rapidly. Among those, Adjoint Matching (AM, Domingo-Enrich et al., 2025) has proven highly effective in continuous state spaces with differentiable rewards. Transferring these practical successes to discrete generative modeling, however, remains particularly challenging and largely unexplored, mainly due to the drastic shift in generative model classes to discrete state spaces, which are nowhere differentiable. In this work, we propose Discrete Adjoint Matching (DAM) -- a discrete variant of AM for fine-tuning discrete generative models characterized by Continuous-Time Markov Chains, such as diffusion-based large language models. The core of DAM is the introduction of discrete adjoint-an estimator of the optimal solution to the original problem but formulated on discrete domains-from which standard matching frameworks can be applied. This is derived via a purely statistical standpoint, in contrast to the control-theoretic viewpoint in AM, thereby opening up new algorithmic opportunities for general adjoint-based estimators. We showcase DAM's effectiveness on synthetic and mathematical reasoning tasks.

2602.06797 2026-02-17 stat.ML cs.LG

Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay

Binghui Li, Zilin Wang, Fengling Chen, Shiyang Zhao, Ruiheng Zheng, Lei Wu

详情
英文摘要

We study optimal learning-rate schedules (LRSs) under the functional scaling law (FSL) framework introduced in Li et al. (2025), which accurately models the loss dynamics of both linear regression and large language model (LLM) pre-training. Within FSL, loss dynamics are governed by two exponents: a source exponent $s>0$ controlling the rate of signal learning, and a capacity exponent $β>1$ determining the rate of noise forgetting. Focusing on a fixed training horizon $N$, we derive the optimal LRSs and reveal a sharp phase transition. In the easy-task regime $s \ge 1 - 1/β$, the optimal schedule follows a power decay to zero, $η^*(z) = η_{\mathrm{peak}}(1 - z/N)^{2β- 1}$, where the peak learning rate scales as $η_{\mathrm{peak}} \eqsim N^{-ν}$ for an explicit exponent $ν= ν(s,β)$. In contrast, in the hard-task regime $s < 1 - 1/β$, the optimal LRS exhibits a warmup-stable-decay (WSD) (Hu et al. (2024)) structure: it maintains the largest admissible learning rate for most of training and decays only near the end, with the decay phase occupying a vanishing fraction of the horizon. We further analyze optimal shape-fixed schedules, where only the peak learning rate is tuned -- a strategy widely adopted in practiceand characterize their strengths and intrinsic limitations. This yields a principled evaluation of commonly used schedules such as cosine and linear decay. Finally, we apply the power-decay LRS to one-pass stochastic gradient descent (SGD) for kernel regression and show the last iterate attains the exact minimax-optimal rate, eliminating the logarithmic suboptimality present in prior analyses. Numerical experiments corroborate our theoretical predictions.

2512.18898 2026-02-17 math.ST stat.ME stat.TH

Model-Agnostic Bounds for Augmented Inverse Probability Weighted Estimators' Wald-Confidence Interval Coverage in Randomized Controlled Trials

Hongxiang Qiu

Comments substantially weakened sub-Gaussian assumptions with discussions; highlighted non-asymptotic results in the supplement for better readability

详情
英文摘要

Nonparametric estimators, such as the augmented inverse probability weighted (AIPW) estimator, have become increasingly popular in causal inference. Numerous nonparametric estimators have been proposed, but they are all asymptotically normal with the same asymptotic variance under similar conditions, leaving little guidance for practitioners to choose an estimator. In this paper, I focus on another important perspective of their asymptotic behaviors beyond asymptotic normality, the convergence of the Wald-confidence interval (CI) coverage to the nominal coverage. Such results have been established for simpler estimators (e.g., the Berry-Esseen Theorem), but are lacking for nonparametric estimators. I consider a simple but practical setting where the AIPW estimator based on a black-box nuisance estimator, with or without cross-fitting, is used to estimate the average treatment effect in randomized controlled trials. I derive non-asymptotic Berry-Esseen-type bounds on the difference between Wald-CI coverage and the nominal coverage. I also analyze the bias of variance estimators, showing that the cross-fit variance estimator might overestimate while the non-cross-fit variance estimator might underestimate, which might explain why cross-fitting has been empirically observed to improve Wald-CI coverage.

2512.00499 2026-02-17 cs.LG cs.AI stat.ML

ESPO: Entropy Importance Sampling Policy Optimization

Yuepeng Sheng, Yuwei Huang, Shuman Liu, Anxiang Zeng, Haibo Zhang

详情
英文摘要

Reinforcement learning (RL) has become a central component of post-training for large language models (LLMs), particularly for complex reasoning tasks that require stable optimization over long generation horizons. However, achieving performance at scale often introduces a fundamental trade-off between training stability and training efficiency. Token-level optimization applies fine-grained updates at the individual units, but is prone to high variance in gradient estimation, which can result in unstable training dynamics. In contrast, Sequence-level optimization often relies on aggressive clipping mechanisms to ensure stable updates. However, such design may discard a large fraction of valid training samples, leading to inefficient gradient utilization and reduced training efficiency. We refer to this phenomenon as gradient underutilization. In this work, we propose Entropy Importance Sampling Policy Optimization (ESPO), a novel framework that aims to combine fine-grained updates with stable training. ESPO decomposes sequences into groups based on predictive entropy, enabling (1) Entropy Grouping Importance Sampling to capture intra-sequence heterogeneity, and (2) Entropy Adaptive Clipping to dynamically allocate trust regions based on model uncertainty. Extensive experiments on mathematical reasoning benchmarks demonstrate that ESPO not only accelerates convergence but also achieves state-of-the-art performance, notably improving accuracy on the challenging mathematical benchmarks.

2511.19628 2026-02-17 stat.ML cs.LG stat.CO

Optimization and Regularization Under Arbitrary Objectives

Jared N. Lakhani, Etienne Pienaar

Comments 74 pages, 29 figures, 16 tables

详情
英文摘要

This study investigates the limitations of applying Markov Chain Monte Carlo (MCMC) methods to arbitrary objective functions, focusing on a two-block MCMC framework which alternates between Metropolis-Hastings and Gibbs sampling. While such approaches are often considered advantageous for enabling data-driven regularization, we show that their performance critically depends on the sharpness of the employed likelihood form. By introducing a sharpness parameter and exploring alternative likelihood formulations proportional to the target objective function, we demonstrate how likelihood curvature governs both in-sample performance and the degree of regularization inferred by the training data. Empirical applications are conducted on reinforcement learning tasks: including a navigation problem and the game of tic-tac-toe. The study concludes with a separate analysis examining the implications of extreme likelihood sharpness on arbitrary objective functions stemming from the classic game of blackjack, where the first block of the two-block MCMC framework is replaced with an iterative optimization step. The resulting hybrid approach achieves performance nearly identical to the original MCMC framework, indicating that excessive likelihood sharpness effectively collapses posterior mass onto a single dominant mode.

2511.05128 2026-02-17 econ.EM stat.AP

Do Test Scores Help Teachers Give Better Track Advice to Students? A Principal Stratification Analysis

Andrea Ichino, Fabrizia Mealli, Javier Viviens

详情
英文摘要

Every year, over one million EU students choose a secondary school track based on teacher recommendations, yet little evidence shows this yields optimal assignments. Using Dutch data, we examine whether access to standardized test scores improves recommendation quality. We develop a Principal-Stratification metric in a quasi-randomized setting, conduct a welfare analysis that flexibly weights short- and long-term losses, and assess principal fairness by examining whether test-score access affects equity across protected attributes. Results are robust to replacing the Exclusion Restriction assumption underlying our main identification strategy with alternative assumptions. Allowing recommendation upgrades when test scores exceed expectations increases successful placement in more demanding tracks by at least 6%, while misplacing 7% of weaker students. Only unrealistically high weights on short-term losses would justify banning such upgrades. Test-score access also yields fairer recommendations for immigrant and low-SES students. Our methodology and findings contribute to the literature on algorithm-assisted human decisions.

2511.00769 2026-02-17 math.PR math.OC stat.CO

Information-theoretic minimax and submodular optimization algorithms for multivariate Markov chains

Zheyuan Lai, Michael C. H. Choi

Comments 34 pages, 6 figures

详情
英文摘要

We study an information-theoretic minimax problem for finite multivariate Markov chains on $d$-dimensional product state spaces. Given a family $\mathcal B=\{P_1,\ldots,P_n\}$ of $π$-stationary transition matrices and a class $\mathcal F = \mathcal{F}(\mathbf{S})$ of factorizable models induced by a partition $\mathbf S$ of the coordinate set $[d]$, we seek to minimize the worst-case information loss by analyzing $$\min_{Q\in\mathcal F}\max_{P\in\mathcal B} D_{\mathrm{KL}}^π(P\|Q),$$ where $D_{\mathrm{KL}}^π(P\|Q)$ is the $π$-weighted KL divergence from $Q$ to $P$. We recast the above minimax problem into concave maximization over the $n$-probability-simplex via strong duality and Pythagorean identities that we derive. This leads us to formulate an information-theoretic game and show that a mixed strategy Nash equilibrium always exists; and propose a projected subgradient algorithm to approximately solve the minimax problem with provable guarantee. By transforming the minimax problem into an orthant submodular function in $\mathbf{S}$, this motivates us to consider a max-min-max submodular optimization problem and investigate a two-layer subgradient-greedy procedure to approximately solve this generalization. Numerical experiments for Markov chains on the Curie-Weiss and Bernoulli-Laplace models illustrate the practicality of these proposed algorithms and reveals sparse optimal structures in these examples.

2510.26046 2026-02-17 stat.ML cs.LG stat.ME

Bias-Corrected Data Synthesis for Imbalanced Learning

Pengfei Lyu, Zhengchi Ma, Linjun Zhang, Anru R. Zhang

Comments 41 pages, 4 figures, includes proofs and appendix

详情
英文摘要

Imbalanced data, where the positive samples represent only a small proportion compared to the negative samples, makes it challenging for classification problems to balance the false positive and false negative rates. A common approach to addressing the challenge involves generating synthetic data for the minority group and then training classification models with both observed and synthetic data. However, since the synthetic data depends on the observed data and fails to replicate the original data distribution accurately, prediction accuracy is reduced when the synthetic data is naïvely treated as the true data. In this paper, we address the bias introduced by synthetic data and provide consistent estimators for this bias by borrowing information from the majority group. We propose a bias correction procedure to mitigate the adverse effects of synthetic data, enhancing prediction accuracy while avoiding overfitting. This procedure is extended to broader scenarios with imbalanced data, such as imbalanced multi-task learning and causal inference. Theoretical properties, including bounds on bias estimation errors and improvements in prediction accuracy, are provided. Simulation results and data analysis on handwritten digit datasets demonstrate the effectiveness of our method.

2510.18259 2026-02-17 stat.ML cs.AI cs.LG

Learning under Quantization for High-Dimensional Linear Regression

Dechen Zhang, Junwei Su, Difan Zou

详情
英文摘要

The use of low-bit quantization has emerged as an indispensable technique for enabling the efficient training of large-scale models. Despite its widespread empirical success, a rigorous theoretical understanding of its impact on learning performance remains notably absent, even in the simplest linear regression setting. We present the first systematic theoretical study of this fundamental question, analyzing finite-step stochastic gradient descent (SGD) for high-dimensional linear regression under a comprehensive range of quantization targets: data, label, parameter, activation, and gradient. Our novel analytical framework establishes precise algorithm-dependent and data-dependent excess risk bounds that characterize how different quantization affects learning: parameter, activation, and gradient quantization amplify noise during training; data quantization distorts the data spectrum and introduces additional approximation error. Crucially, we distinguish the effects of two quantization schemes: we prove that for additive quantization (with constant quantization steps), the noise amplification benefits from a suppression effect scaled by the batch size, while multiplicative quantization (with input-dependent quantization steps) largely preserves the spectral structure, thereby reducing the spectral distortion. Furthermore, under common polynomial-decay data spectra, we quantitatively compare the risks of multiplicative and additive quantization, drawing a parallel to the comparison between FP and integer quantization methods. Our theory provides a powerful lens to characterize how quantization shapes the learning dynamics of optimization algorithms, paving the way to further explore learning theory under practical hardware constraints.

2510.04455 2026-02-17 math.OC cs.AI cs.LG math.ST stat.ML stat.TH

Inverse Mixed-Integer Programming: Learning Constraints then Objective Functions

Akira Kitaoka

Comments 40 pages

详情
英文摘要

Data-driven inverse optimization for mixed-integer linear programs (MILPs), which seeks to learn an objective function and constraints consistent with observed decisions, is important for building accurate mathematical models in a variety of domains, including power systems and scheduling. However, to the best of our knowledge, existing data-driven inverse optimization methods primarily focus on learning objective functions under known constraints, and learning both objective functions and constraints from data remains largely unexplored. In this paper, we propose a two-stage approach for a class of inverse optimization problems in which the objective is a linear combination of given feature functions and the constraints are parameterized by unknown functions and thresholds. Our method first learns the constraints and then, conditioned on the learned constraints, estimates the objective-function weights. On the theoretical side, we provide finite-sample guarantees for solving the proposed inverse optimization problem. To this end, we develop statistical learning tools for pseudo-metric spaces under sub-Gaussian assumptions and use them to derive a learning-theoretic framework for inverse optimization with both unknown objectives and constraints. On the experimental side, we demonstrate that our method successfully solves inverse optimization problems on scheduling instances formulated as ILPs with up to 100 decision variables.

2510.02983 2026-02-17 cs.DS cs.LG math.OC stat.ML

Oracle-based Uniform Sampling from Convex Bodies

Thanh Dang, Jiaming Liang

Comments 32 pages

详情
英文摘要

We propose new Markov chain Monte Carlo algorithms to sample a uniform distribution on a convex body $K$. Our algorithms are based on the proximal sampler, which uses Gibbs sampling on an augmented distribution and assumes access to the so-called restricted Gaussian oracle (RGO). The key contribution of this work is an efficient implementation of the RGO for uniform sampling on convex $K$ that goes beyond the membership-oracle model used in many classical and modern uniform samplers, and instead leverages richer oracle access commonly assumed in convex optimization. We implement the RGO via rejection sampling and access to either a projection oracle or a separation oracle on $K$. In both oracle models, we provide non-asymptotic complexity guarantees for obtaining unbiased samples, with accuracy quantified in Rényi divergence and $χ^2$-divergence, and we support these theoretical guarantees with numerical experiments.

2509.23437 2026-02-17 cs.LG stat.ML

Better Hessians Matter: Studying the Impact of Curvature Approximations in Influence Functions

Steve Hong, Runa Eschenhagen, Bruno Mlodozeniec, Richard Turner

详情
英文摘要

Influence functions offer a principled way to trace model predictions back to training data, but their use in deep learning is hampered by the need to invert a large, ill-conditioned Hessian matrix. Approximations such as Generalised Gauss-Newton (GGN) and Kronecker-Factored Approximate Curvature (K-FAC) have been proposed to make influence computation tractable, yet it remains unclear how the departure from exactness impacts data attribution performance. Critically, given the restricted regime in which influence functions are derived, it is not necessarily clear better Hessian approximations should even lead to better data attribution performance. In this paper, we investigate the effect of Hessian approximation quality on influence-function attributions in a controlled classification setting. Our experiments show that better Hessian approximations consistently yield better influence score quality, offering justification for recent research efforts towards that end. We further decompose the approximation steps for recent Hessian approximation methods and evaluate each step's influence on attribution accuracy. Notably, the mismatch between K-FAC eigenvalues and GGN/EK-FAC eigenvalues accounts for the majority of the error and influence loss. These findings highlight which approximations are most critical, guiding future efforts to balance computational tractability and attribution accuracy.

2509.22794 2026-02-17 stat.ML cs.AI cs.LG econ.EM math.ST stat.TH

Differentially Private Two-Stage Gradient Descent for Instrumental Variable Regression

Haodong Liang, Yanhao Jin, Krishnakumar Balasubramanian, Lifeng Lai

Comments 37 pages, 12 figures

详情
英文摘要

We study instrumental variable regression (IVaR) under differential privacy constraints. Classical IVaR methods (like two-stage least squares regression) rely on solving moment equations that directly use sensitive covariates and instruments, creating significant risks of privacy leakage and posing challenges in designing algorithms that are both statistically efficient and differentially private. We propose a noisy two-stage gradient descent algorithm that ensures $ρ$-zero-concentrated differential privacy by injecting carefully calibrated noise into the gradient updates. Our analysis establishes finite-sample convergence rates for the proposed method, showing that the algorithm achieves consistency while preserving privacy. In particular, we derive precise bounds quantifying the trade-off among optimization, privacy, and sampling error. To the best of our knowledge, this is the first work to provide both privacy guarantees and provable convergence rates for instrumental variable regression in linear models. We further validate our theoretical findings with experiments on both synthetic and real datasets, demonstrating that our method offers practical accuracy-privacy trade-offs.

2509.19189 2026-02-17 cs.LG stat.ML

Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules

Binghui Li, Fengling Chen, Zixun Huang, Lean Wang, Lei Wu

Comments 60 pages, accepted by NeurIPS 2025 as a spotlight paper

详情
英文摘要

Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models (LLMs). However, existing studies predominantly focus on the final-step loss, leaving open whether the entire loss dynamics obey similar laws and, crucially, how the learning rate schedule (LRS) shapes them. We address these gaps in a controlled theoretical setting by analyzing stochastic gradient descent (SGD) on a power-law kernel regression model. The key insight is a novel intrinsic-time viewpoint, which captures the training progress more faithfully than iteration count. We then establish a Functional Scaling Law (FSL) that captures the full loss trajectory under arbitrary LRSs, with the schedule's influence entering through a simple convolutional functional. We further instantiate the theory for three representative LRSs -- constant, exponential decay, and warmup-stable-decay (WSD) -- and derive explicit scaling relations in both data- and compute-limited regimes. These comparisons explain key empirical phenomena: (i) higher-capacity models are more data- and compute-efficient; (ii) learning-rate decay improves training efficiency; and (iii) WSD-type schedules outperform pure decay. Finally, experiments on LLMs ranging from 0.1B to 1B parameters demonstrate the practical relevance of FSL as a surrogate model for fitting and predicting loss trajectories in large-scale pre-training.

2509.02784 2026-02-17 stat.AP stat.ML

A Composite-Loss Graph Neural Network for the Multivariate Post-Processing of Ensemble Weather Forecasts

Mária Lakatos

Comments 30 pages, 16 figures, 3 tables

详情
Journal ref
Q. J. R. Meteorol. Soc., Early View (2026)
英文摘要

Ensemble forecasting systems have advanced meteorology by providing probabilistic estimates of future states. Nonetheless, systematic biases often persist, making statistical post-processing essential. Traditional parametric post-processing techniques and machine learning-based methods can produce calibrated predictive distributions at specific locations and lead times, yet often struggle to capture dependencies across forecast dimensions. To address this, multivariate post-processing methods-such as ensemble copula coupling and the Schaake shuffle-are widely applied in a second step to restore realistic inter-variable or spatio-temporal dependencies. The aim of this study is the multivariate post-processing of ensemble forecasts using a graph neural network (dualGNN) trained with a composite loss function that combines the energy score (ES) and the variogram score (VS). The method is evaluated on two datasets: WRF-based solar irradiance forecasts over northern Chile and ECMWF visibility forecasts for Central Europe. The dualGNN consistently outperforms all empirical copula-based post-processed forecasts and shows significant improvements compared to graph neural networks trained solely on either the continuous ranked probability score or the ES, according to the evaluated multivariate verification metrics. Furthermore, for the WRF forecasts, the rank-order structure of the dualGNN forecasts captures valuable dependency information, enabling a more effective restoration of spatial relationships than either the raw numerical weather prediction ensemble or historical observational rank structures. Notably, incorporating VS into the loss function improved the univariate performance for both target variables compared to training on ES alone. Moreover, for the visibility forecasts, the ES-VS combination even outperformed the strongest calibrated reference in terms of univariate performance.

2509.00955 2026-02-17 cs.LG cs.AI stat.ML

ART: Adaptive Resampling-based Training for Imbalanced Classification

Arjun Basandrai, Shourya Jain, K. Ilanthenral

Comments Submitted to MLWA

详情
英文摘要

Traditional resampling methods for handling class imbalance typically uses fixed distributions, undersampling the majority or oversampling the minority. These static strategies ignore changes in class-wise learning difficulty, which can limit the overall performance of the model. This paper proposes an Adaptive Resampling-based Training (ART) method that periodically updates the distribution of the training data based on the class-wise performance of the model. Specifically, ART uses class-wise macro F1 scores, computed at fixed intervals, to determine the degree of resampling to be performed. Unlike instance-level difficulty modeling, which is noisy and outlier-sensitive, ART adapts at the class level. This allows the model to incrementally shift its attention towards underperforming classes in a way that better aligns with the optimization objective. Results on diverse benchmarks, including Pima Indians Diabetes and Yeast dataset demonstrate that ART consistently outperforms both resampling-based and algorithm-level methods, including Synthetic Minority Oversampling Technique (SMOTE), NearMiss Undersampling, and Cost-sensitive Learning on binary as well as multi-class classification tasks with varying degrees of imbalance. In most settings, these improvements are statistically significant. On tabular datasets, gains are significant under paired t-tests and Wilcoxon tests (p < 0.05), while results on text and image tasks remain favorable. Compared to training on the original imbalanced data, ART improves macro F1 by an average of 2.64 percentage points across all tested tabular datasets. Unlike existing methods, whose performance varies by task, ART consistently delivers the strongest macro F1, making it a reliable choice for imbalanced classification.

2508.10342 2026-02-17 stat.ME

Identifying Unmeasured Confounders in Panel Causal Models: A Two-Stage LM-Wald Approach

Bang Quan Zheng

详情
英文摘要

Panel data are widely used in political science to draw causal inferences. However, these models often rely on the strong and untested assumption of sequential ignorability--that no unmeasured variables influence both the independent and outcome variables across time. Grounded in psychometric literature on latent variable modeling, this paper introduces the Two-Stage LM-Wald (2SLW) approach, a diagnostic tool that extends the Lagrange Multiplier (LM) and Wald tests to detect violations of this assumption in panel causal models. Using Monte Carlo simulations within the Random Intercept Cross-Lagged Panel Model (RI-CLPM), which separates within and between person effects, I demonstrate the 2SLW's ability to detect unmeasured confounding across three key scenarios: biased corrections, distorted direct effects, and altered mediation pathways. I also illustrate the approach with an empirical application to real-world panel data. By providing a practical and theoretically grounded diagnostic, the 2SLW approach enhances the robustness of causal inferences in the presence of potential time-varying confounders. Moreover, it can be readily implemented using the R package lavaan.

2507.15471 2026-02-17 stat.ME math.ST stat.TH

Multiple Hypothesis Testing To Estimate The Number Of Communities in Stochastic Block Models

Chetkar Jha, Mingyao Li, Ian Barnett

Comments This article is significantly improved version of the previous arXiv submission arXiv:2201.04722

详情
英文摘要

Clustering of single-cell RNA sequencing (scRNA-seq) datasets can give key insights into the biological functions of cells. Therefore, it is not surprising that network-based community detection methods (one of the better clustering methods) are increasingly being used for the clustering of scRNA-seq datasets. The main challenge in implementing network-based community detection methods for scRNA-seq datasets is that these methods \emph{apriori} require the true number of communities or blocks for estimating the community memberships. Although there are existing methods for estimating the number of communities, they are not suitable for noisy scRNA-seq datasets. Moreover, we require an appropriate method for extracting suitable networks from scRNA-seq datasets. For addressing these issues, we present a two-fold solution: i) a simple likelihood-based approach for extracting stochastic block models (SBMs) out of scRNA-seq datasets, ii) a new sequential multiple testing (SMT) method for estimating the number of communities in SBMs. We study the theoretical properties of SMT and establish its consistency under moderate sparsity conditions. In addition, we compare the numerical performance of the SMT with several existing methods. We also show that our approach performs competitively well against existing methods for estimating the number of communities on benchmark scRNA-seq datasets. Finally, we use our approach for estimating subgroups of a human retina bipolar single cell dataset.

2507.08838 2026-02-17 cs.LG cs.AI stat.ML

wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models

Xiaohang Tang, Rares Dolga, Sangwoong Yoon, Ilija Bogunovic

Comments Accepted to ICLR 2026

详情
英文摘要

Improving the reasoning capabilities of diffusion-based large language models (dLLMs) through reinforcement learning (RL) remains an open problem. The intractability of dLLMs likelihood function necessitates approximating the current, old, and reference policy likelihoods at each policy optimization step. This reliance introduces additional computational overhead, and can lead to large variance and estimation error in RL objective -- particularly in computing the policy ratio for importance sampling. To mitigate these issues, we introduce wd1, a novel ratio-free policy optimization approach that reformulates the RL objective as a weighted log-likelihood, requiring only a single approximation for the current parametrized policy likelihood. We formally show that our proposed method can be interpreted as energy-guided discrete diffusion training combined with negative sample unlearning, thereby confirming its theoretical soundness. In experiments on LLaDA-8B model, wd1 outperforms diffusion-based GRPO (d1) while requiring lower computational cost, achieving up to a $+59\%$ improvement in accuracy. Furthermore, we extend wd1 to denoising-stepwise weighted policy optimization (wd1++), achieving state-of-the-art math performance of $44.2\%$ on MATH500 and $84.5\%$ on GSM8K with only 20 RL training steps.

2506.04749 2026-02-17 stat.CO stat.ME stat.ML

Variational Transdimensional Inference

Laurence Davies, Dan Mackinlay, Rafael Oliveira, Scott A. Sisson

Comments 35 pages, 11 figures

详情
英文摘要

The expressiveness of flow-based models combined with stochastic variational inference (SVI) has expanded the application of optimization-based Bayesian inference to highly complex problems. However, despite the importance of multi-model Bayesian inference for problems defined on a transdimensional joint model and parameter space, such as Bayesian structure learning and model selection, flow-based SVI has been limited to problems defined on a fixed-dimensional parameter space. We introduce CoSMIC, normalizing flows (COntextually-Specified Masking for Identity-mapped Components), an extension to neural autoregressive conditional normalizing flow architectures that enables use of a single flow-based variational density for inference over a transdimensional (multi-model) conditional target distribution. We propose a combined stochastic variational transdimensional inference (VTI) approach to training CoSMIC, flows using ideas from Bayesian optimization and Monte Carlo gradient estimation. Numerical experiments show the performance of VTI on challenging problems that scale to high-cardinality model spaces.

2505.20478 2026-02-17 stat.ME

Iterative Exploration-Driven Sparse SDP Clustering via Thompson Sampling

Jongmin Mun, Paromita Dubey, Yingying Fan

Comments 58 pages, 2 figures, 2 tables, 4 algorithms;

详情
英文摘要

This paper studies high-dimensional sparse clustering, a combinatorial NP-hard problem arising from the bilinear coupling between cluster assignment and feature selection. We analyze semidefinite programming (SDP) relaxations of $K$-means and establish minimax separation bounds, demonstrating that these relaxations are theoretically robust to feature over-selection: exact recovery is preserved even in the presence of non-informative features. Leveraging this robustness, we propose a block-coordinate ascent framework that alternates between SDP-based clustering and non-conservative feature selection. To address the tendency of deterministic greedy methods to become trapped in local optima, we formulate the feature selection step as a Thompson sampling bandit problem. This approach introduces adaptive memory by aggregating historical variable-selection outcomes into posterior distributions, and selects features via posterior sampling, enabling stochastic exploration that promotes the inclusion of under-explored features and facilitates escape from local maxima. We establish conditions for consistent variable selection and exact clustering recovery, and extend the method to settings with unknown covariance through a scalable, inverse-free estimation procedure. Numerical experiments demonstrate that the proposed memory-driven approach consistently outperforms state-of-the-art sparse clustering methods.

2505.19712 2026-02-17 cs.LG math.PR stat.ML

On the Relation between Rectified Flows and Optimal Transport

Johannes Hertrich, Antonin Chambolle, Julie Delon

Comments Accepted for NeurIPS 2025

详情
英文摘要

This paper investigates the connections between rectified flows, flow matching, and optimal transport. Flow matching is a recent approach to learning generative models by estimating velocity fields that guide transformations from a source to a target distribution. Rectified flow matching aims to straighten the learned transport paths, yielding more direct flows between distributions. Our first contribution is a set of invariance properties of rectified flows and explicit velocity fields. In addition, we also provide explicit constructions and analysis in the Gaussian (not necessarily independent) and Gaussian mixture settings and study the relation to optimal transport. Our second contribution addresses recent claims suggesting that rectified flows, when constrained such that the learned velocity field is a gradient, can yield (asymptotically) solutions to optimal transport problems. We study the existence of solutions for this problem and demonstrate that they only relate to optimal transport under assumptions that are significantly stronger than those previously acknowledged. In particular, we present several counterexamples that invalidate earlier equivalence results in the literature, and we argue that enforcing a gradient constraint on rectified flows is, in general, not a reliable method for computing optimal transport maps.

2503.09912 2026-02-17 stat.AP

Beta-Generalized Lindley Distribution: A Novel Probability Model for Wind Speed

Tiantian Yang, Dongwei Chen

Comments 18 pages, 7 figures, 5 tables

详情
Journal ref
Renewable Energy, 256 (2026), 124216
英文摘要

Wind speed distribution has many applications, such as the assessment of wind energy and building design. Applying an appropriate statistical distribution to fit the wind speed data, especially on its heavy right tail, is of great interest. In this study, we introduce a novel four-parameter class of generalized Lindley distribution, called the beta-generalized Lindley (BGL) distribution, to fit the wind speed data, which are derived from the annual and long-term measurements of the Flatirons M2 meteorological tower from the years 2010 to 2020 at heights of 10, 20, 50, and 80 meters. In terms of the density fit and various goodness-of-fit metrics, the BGL model outperforms its submodels (beta-Lindley, generalized Lindley, and Lindley) and other reference distributions, such as gamma, beta-Weibull, Weibull, beta-exponential, and Log-Normal. Furthermore, the BGL distribution is more accurate at modeling the long right tail of wind speed, including the $95^{th}$ and $99^{th}$ percentiles and Anderson-Darling statistics at different heights. Therefore, we conclude that the BGL distribution is a strong alternative model for the wind speed distribution.

2410.18784 2026-02-17 cs.LG cs.NA eess.SP math.NA math.ST stat.ML stat.TH

Denoising diffusion probabilistic models are optimally adaptive to unknown low dimensionality

Zhihan Huang, Yuting Wei, Yuxin Chen

Comments Accepted to Mathematics of Operations Research

详情
英文摘要

The denoising diffusion probabilistic model (DDPM) has emerged as a mainstream generative model in generative AI. While sharp convergence guarantees have been established for the DDPM, the iteration complexity is, in general, proportional to the ambient data dimension, resulting in overly conservative theory that fails to explain its practical efficiency. This has motivated the recent work Li and Yan (2024a) to investigate how the DDPM can achieve sampling speed-ups through automatic exploitation of intrinsic low dimensionality of data. We strengthen this line of work by demonstrating, in some sense, optimal adaptivity to unknown low dimensionality. For a broad class of data distributions with intrinsic dimension $k$, we prove that the iteration complexity of the DDPM scales nearly linearly with $k$, which is optimal when using KL divergence to measure distributional discrepancy. Notably, our work is closely aligned with the independent concurrent work Potaptchik et al. (2024) -- posted two weeks prior to ours -- in establishing nearly linear-$k$ convergence guarantees for the DDPM.

2410.13420 2026-02-17 stat.ME math.ST stat.TH

Spatial Proportional Hazards Model with Differential Regularization

Lorenzo Tedesco, Francesco Finazzi

详情
英文摘要

The Proportional Hazards (PH) model is one of the most widely used models in survival analysis, typically assuming a log-linear relationship between covariates and the hazard function. However, in the context of spatial survival data, where the time-to-event variable is associated with a spatial location within a given domain, this assumption is often unrealistic in capturing spatial effects. Thus, this paper proposes modeling the location effect through a nonparametric function of spatial location. The function is approximated using finite element methods on a triangulated mesh to accommodate irregular domains. Estimation is carried out within the classical partial likelihood framework, with smoothness of the spatial effect enforced through differential penalization. Using sieve methods, we establish the consistency and asymptotic normality of the parametric component. Simulations and two empirical applications demonstrate superior performance compared to existing approaches.

2410.03919 2026-02-17 cs.LG stat.ML

Online Posterior Sampling with a Diffusion Prior

Branislav Kveton, Boris Oreshkin, Youngsuk Park, Aniket Deshmukh, Rui Song

Comments Advances in Neural Information Processing Systems 37

详情
英文摘要

Posterior sampling in contextual bandits with a Gaussian prior can be implemented exactly or approximately using the Laplace approximation. The Gaussian prior is computationally efficient but it cannot describe complex distributions. In this work, we propose approximate posterior sampling algorithms for contextual bandits with a diffusion model prior. The key idea is to sample from a chain of approximate conditional posteriors, one for each stage of the reverse diffusion process, which are obtained by the Laplace approximation. Our approximations are motivated by posterior sampling with a Gaussian prior, and inherit its simplicity and efficiency. They are asymptotically consistent and perform well empirically on a variety of contextual bandit problems.

2409.15307 2026-02-17 stat.CO physics.comp-ph

An ILUES-based adaptive Gaussian process method for multimodal Bayesian inverse problems

Zhihang Xu, Xiaoyu Zhu, Daoji Li, Qifeng Liao

详情
英文摘要

Inverse problems are prevalent in both scientific research and engineering applications. In the context of Bayesian inverse problems, sampling from the posterior distribution can be particularly challenging when the forward models are computationally expensive. This challenge is further compounded when the posterior distribution is multimodal. To address this issue, we propose a Gaussian process (GP)-based method to indirectly build surrogates for the forward model. Specifically, the unnormalized posterior density is expressed as a product of an auxiliary density and an exponential GP surrogate. Iteratively, the auxiliary density converges to the posterior distribution, starting from an arbitrary initial density. However, the efficiency of GP regression is highly influenced by the quality of the training data. Therefore, we utilize the iterative local updating ensemble smoother (ILUES) to generate high-quality samples that are concentrated in regions with high posterior probability. Subsequently, based on the surrogate model and mode information extracted using a clustering method, Markov chain Monte Carlo (MCMC) with a Gaussian mixed (GM) proposal is used to draw samples from the auxiliary density. Through numerical examples, we demonstrate that the proposed method can accurately and efficiently represent the posterior with a limited number of forward simulations.

2406.00866 2026-02-17 stat.ME math.ST stat.AP stat.TH

Planning for gold: Hypothesis screening with split samples for valid powerful testing in matched observational studies

William Bekerman, Abhinandan Dalal, Carlo del Ninno, Dylan S. Small

Comments To be published in Biometrika

详情
英文摘要

Observational studies are valuable tools for inferring causal effects in the absence of controlled experiments. However, these studies may be biased due to the presence of some relevant, unmeasured set of covariates. One approach to mitigate this concern is to identify hypotheses likely to be more resilient to hidden biases by splitting the data into a planning sample for designing the study and an analysis sample for making inferences. We devise a powerful and flexible method for selecting hypotheses in the planning sample when an unknown number of outcomes are affected by the treatment, allowing researchers to gain the benefits of exploratory analysis and still conduct powerful inference under concerns of unmeasured confounding. We investigate the theoretical properties of our method and conduct extensive simulations that demonstrate pronounced benefits, especially at higher levels of allowance for unmeasured confounding. Finally, we demonstrate our method in an observational study of the multi-dimensional impacts of a devastating flood in Bangladesh.

2402.11219 2026-02-17 math.ST stat.ME stat.TH

Estimators for multivariate allometric regression model

Koji Tsukuda, Shun Matsuura

Comments 20 pages

详情
Journal ref
Journal of Multivariate Analysis, Volume 210, Article 105482 (2025)
英文摘要

In a regression model with multiple response variables and multiple explanatory variables, if the difference of the mean vectors of the response variables for different values of explanatory variables is always in the direction of the first principal eigenvector of the covariance matrix of the response variables, then it is called a multivariate allometric regression model. This paper studies the estimation of the first principal eigenvector in the multivariate allometric regression model. A class of estimators that includes conventional estimators is proposed based on weighted sum-of-squares matrices of regression sum-of-squares matrix and residual sum-of-squares matrix. We establish an upper bound of the mean squared error of the estimators contained in this class, and the weight value minimizing the upper bound is derived. Sufficient conditions for the consistency of the estimators are discussed in weak identifiability regimes under which the difference of the largest and second largest eigenvalues of the covariance matrix decays asymptotically and in ``large $p$, large $n$" regimes, where $p$ is the number of response variables and $n$ is the sample size. Several numerical results are also presented.

2402.02644 2026-02-17 cs.LG stat.ML

Permutation-based Inference for Variational Learning of Directed Acyclic Graphs

Edwin V. Bonilla, Pantelis Elinas, He Zhao, Maurizio Filippone, Vassili Kitsios, Terry O'Kane

详情
英文摘要

Estimating the structure of Bayesian networks as directed acyclic graphs (DAGs) from observational data is a fundamental challenge, particularly in causal discovery. Bayesian approaches excel by quantifying uncertainty and addressing identifiability, but key obstacles remain: (i) representing distributions over DAGs and (ii) estimating a posterior in the underlying combinatorial space. We introduce PIVID, a method that jointly infers a distribution over permutations and DAGs using variational inference and continuous relaxations of discrete distributions. Through experiments on synthetic and real-world datasets, we show that PIVID can outperform deterministic and Bayesian approaches, achieving superior accuracy-uncertainty trade-offs while scaling efficiently with the number of variables.

2310.11736 2026-02-17 math.ST math.OC stat.ML stat.TH

A Theory of Feature Learning in Kernel Models

Yunlu Chen, Yang Li, Keli Liu, Feng Ruan

详情
英文摘要

We study feature learning in a compositional variant of kernel ridge regression in which the predictor is applied to a learnable linear transformation of the input. When the response depends on the input only through a low-dimensional predictive subspace, we show that all global minimizers of the population objective for the linear transformation annihilate directions orthogonal to this subspace, and in certain regimes, exactly identify the subspace. Moreover, we show that global minimizers of the finite-sample objective inherit the exact same low-dimensional structure with high probability, even without any explicit penalization on the linear transformation.

2308.13036 2026-02-17 math.ST stat.TH

Robust Signal Detection with Quadratically Convex Orthosymmetric Constraints

Yikun Li, Matey Neykov

Comments 80 pages, 7 figures

详情
英文摘要

This paper studies the problem of robust signal detection in Gaussian noise under quadratically convex orthosymmetric (QCO) constraints. We consider a minimax testing framework where the signal belongs to a QCO set and is separated from zero in Euclidean norm, while an adversary is allowed to arbitrarily corrupt a fraction $ε$ of the samples. We establish the minimax separation radius between the null and alternative purely in terms of the constraint geometry, sample size, corruption rate, and noise scale. Our analysis argues that the Kolmogorov widths of the constraint set play a central role in determining the detection limits, paralleling to classic results in estimation problem. The derived lower bounds exhibit phase transitions with respect to the corruption rate and confirm that robust testing is statistically easier than robust estimation. While the information-theoretic upper bound is achieved by a computationally intractable test, we develop a polynomial-time algorithm that achieves the minimax lower bound up to logarithmic factors. Unlike prior work, our algorithm handles signals of arbitrary Euclidean length while respecting the QCO constraints. Finally, we extend these results to the robust $\ell _{p}$ norm testing for $1 \le p < 2$.

2306.14297 2026-02-17 stat.ME cs.LG

Inference for relative sparsity

Samuel J. Weisenthal, Sally W. Thurston, Ashkan Ertefaie

Comments 66 pages, 3 figures

详情
英文摘要

In healthcare, there is much interest in estimating policies, or mappings from covariates to treatment decisions. Recently, there is also interest in constraining these estimated policies to the standard of care, which generated the observed data. A relative sparsity penalty was proposed to derive policies that have sparse, explainable differences from the standard of care, facilitating justification of the new policy. However, the developers of this penalty only considered estimation, not inference. Here, we develop inference for the relative sparsity objective function, because characterizing uncertainty is crucial to applications in medicine. Further, in the relative sparsity work, the authors only considered the single-stage decision case; here, we consider the more general, multi-stage case. Inference is difficult, because the relative sparsity objective depends on the unpenalized value function, which is unstable and has infinite estimands in the binary action case. Further, one must deal with a non-differentiable penalty. To tackle these issues, we nest a weighted Trust Region Policy Optimization function within a relative sparsity objective, implement an adaptive relative sparsity penalty, and propose a sample-splitting framework for post-selection inference. We study the asymptotic behavior of our proposed approaches, perform extensive simulations, and analyze a real, electronic health record dataset.

2305.13842 2026-02-17 math.ST stat.TH

Asymptotic Properties of Multi-Treatment Covariate Adaptive Randomization Procedures for Balancing Observed and Unobserved Covariates

Li-Xin Zhang

Comments 102 pages

详情
英文摘要

Applications of CAR for balancing continuous covariates remain comparatively rare, especially in multi-treatment clinical trials, and the theoretical properties of multi-treatment CAR have remained largely elusive for decades. In this paper, we consider a general framework of CAR procedures for multi-treatment clinal trials which can balance general covariate features, such as quadratic and interaction terms which can be discrete, continuous, and mixing. We show that under widely satisfied conditions the proposed procedures have superior balancing properties; in particular, the convergence rate of imbalance vectors can attain the best rate $O_P(1)$ for discrete covariates, continuous covariates, or combinations of both discrete and continuous covariates, and at the same time, the convergence rate of the imbalance of unobserved covariates is $O_P(\sqrt n)$, where $n$ is the sample size. The general framework unifies many existing methods and related theories, introduces a much broader class of new and useful CAR procedures, and provides new insights and a complete picture of the properties of CAR procedures. The favorable balancing properties lead to the precision of the treatment effect test in the presence of a heteroscedastic linear model with dependent covariate features. As an application, the properties of the test of treatment effect with unobserved covariates are studied under the CAR procedures, and consistent tests are proposed so that the test has an asymptotic precise type I error even if the working model is wrong and covariates are unobserved in the analysis.

2303.08218 2026-02-17 stat.ME

Spatial causal inference in the presence of unmeasured confounding and interference

Georgia Papadogeorgou, Srijata Samanta

详情
英文摘要

This manuscript unites causal inference and spatial statistics, presenting novel insights for causal inference in spatial data analysis, and drawing from tools in spatial statistics to estimate causal effects. We introduce spatial causal graphs to highlight that spatial confounding and interference can be entangled, in that investigating the presence of one can lead to wrongful conclusions in the presence of the other. Moreover, we show that spatial dependence in the exposure variable can render standard analyses invalid. To remedy these issues, we propose a Bayesian parametric approach based on tools commonly-used in spatial statistics. This approach simultaneously accounts for interference and mitigates bias from local and neighborhood unmeasured spatial confounding. From a Bayesian perspective, we show that incorporating an exposure model is necessary. Under a specific model formulation, we prove that all parameters are identifiable including the causal effects, even in the presence of unmeasured confounding. We illustrate the approach with a simulation study. We evaluate the effect of local and neighboring sulfur dioxide emissions from power plants on county-level cardiovascular mortality from observational spatial data in the United States, where unmeasured spatial confounding and interference might be present simultaneously.

2302.12093 2026-02-17 eess.SY cs.SY math.OC stat.ME

Experimenting under Stochastic Congestion

Shuangning Li, Ramesh Johari, Xu Kuang, Stefan Wager

详情
英文摘要

We study randomized experiments in a service system when stochastic congestion can arise from temporarily limited supply or excess demand. Such congestion gives rise to cross-unit interference between the waiting customers, and analytic strategies that do not account for this interference may be biased. In current practice, one of the most widely used ways to address stochastic congestion is to use switchback experiments that alternatively turn a target intervention on and off for the whole system. We find, however, that under a queueing model for stochastic congestion, the standard way of analyzing switchbacks is inefficient, and that estimators that leverage the queueing model can be materially more accurate. Additionally, we show how the queueing model enables estimation of total policy gradients from unit-level randomized experiments, thus giving practitioners an alternative experimental approach they can use without needing to pre-commit to a fixed switchback length before data collection.

2211.13478 2026-02-17 stat.ME

A New Spatio-Temporal Model Exploiting Hamiltonian Equations

Satyaki Mazumder, Sayantan Banerjee, Sourabh Bhattacharya

Comments Another updated version, more streamlined and establishing deep theoretical connections with our two new companion papers

详情
英文摘要

The solutions of Hamiltonian equations are known to describe the underlying phase space of a mechanical system. In this article, we propose a novel spatio-temporal model using a strategic modification of the Hamiltonian equations, incorporating appropriate stochasticity via Gaussian processes. The resultant spatio-temporal process, continuously varying with time, turns out to be nonparametric, non-stationary, non-separable, and non-Gaussian. Additionally, the lagged correlations converge to zero as the spatio-temporal lag goes to infinity. We investigate the theoretical properties of the new spatio-temporal process, including its continuity and smoothness properties. We derive methods for complete Bayesian inference using MCMC techniques in the Bayesian paradigm. The performance of our method has been compared with that of a non-stationary Gaussian process (GP) using two simulation studies, where our method shows a significant improvement over the non-stationary GP. Further, applying our new model to two real data sets revealed encouraging performance.

2206.03166 2026-02-17 math.ST cs.DM cs.MS math.PR stat.ME stat.TH

A novel statistical approach for two-sample testing based on the overlap coefficient

Atsushi Komaba, Hisashi Johno, Kazunori Nakamoto

Comments 36 pages, 5 figures. Accepted for publication in Journal of Mathematical Sciences, the University of Tokyo

详情
Journal ref
Journal of Mathematical Sciences, The University of Tokyo 30 (2023), no. 2, 205-240
英文摘要

Here we propose a new nonparametric framework for two-sample testing, named as the OVL-$q$ ($q = 1, 2, \ldots$). This can be regarded as a natural extension of the Smirnov test, which is equivalent to the OVL-1. We specifically focus on the OVL-2, implement its fast algorithm, and show its superiority over other statistical tests in some experiments.

2112.06251 2026-02-17 cs.LG stat.ML

Learning with Subset Stacking

Ş. İlker Birbil, Sinan Yıldırım, Samet Çopur, M. Hakan Akyüz

Comments 26 pages, 8 figures, 4 tables. Code available

详情
英文摘要

We propose a new regression algorithm that learns from a set of input-output pairs. Our algorithm is designed for populations where the relation between the input variables and the output variable exhibits a heterogeneous behavior across the predictor space. The algorithm starts with generating subsets that are concentrated around random points in the input space. This is followed by training a local predictor for each subset. Those predictors are then combined in a novel way to yield an overall predictor. We call this algorithm "LEarning with Subset Stacking" or LESS, due to its resemblance to the method of stacking regressors. We offer bagging and boosting variants of LESS and test against the state-of-the-art methods on several datasets. Our comparison shows that LESS is highly competitive.

2106.04096 2026-02-17 cs.LG math.OC stat.ML

Linear Convergence of Entropy-Regularized Natural Policy Gradient with Linear Function Approximation

Semih Cayci, Niao He, R. Srikant

详情
Journal ref
SIAM J. Optim. 34 (2024) 2729-2755
英文摘要

Natural policy gradient (NPG) methods with entropy regularization achieve impressive empirical success in reinforcement learning problems with large state-action spaces. However, their convergence properties and the impact of entropy regularization remain elusive in the function approximation regime. In this paper, we establish finite-time convergence analyses of entropy-regularized NPG with linear function approximation under softmax parameterization. In particular, we prove that entropy-regularized NPG with averaging satisfies the \emph{persistence of excitation} condition, and achieves a fast convergence rate of $\tilde{O}(1/T)$ up to a function approximation error in regularized Markov decision processes. This convergence result does not require any a priori assumptions on the policies. Furthermore, under mild regularity conditions on the concentrability coefficient and basis vectors, we prove that entropy-regularized NPG exhibits \emph{linear convergence} up to a function approximation error.

2103.14203 2026-02-17 stat.ML cs.LG

Deep Two-Way Matrix Reordering for Relational Data Analysis

Chihiro Watanabe, Taiji Suzuki

详情
英文摘要

Matrix reordering is a task to permute the rows and columns of a given observed matrix such that the resulting reordered matrix shows meaningful or interpretable structural patterns. Most existing matrix reordering techniques share the common processes of extracting some feature representations from an observed matrix in a predefined manner, and applying matrix reordering based on it. However, in some practical cases, we do not always have prior knowledge about the structural pattern of an observed matrix. To address this problem, we propose a new matrix reordering method, called deep two-way matrix reordering (DeepTMR), using a neural network model. The trained network can automatically extract nonlinear row/column features from an observed matrix, which can then be used for matrix reordering. Moreover, the proposed DeepTMR provides the denoised mean matrix of a given observed matrix as an output of the trained network. This denoised mean matrix can be used to visualize the global structure of the reordered observed matrix. We demonstrate the effectiveness of the proposed DeepTMR by applying it to both synthetic and practical datasets.

2602.13635 2026-02-17 stat.ME

Backward Smoothing versus Fixed-Lag Smoothing in Particle Filters

Genshiro Kitagawa

Comments 17 pages, 5 tables, 7 figures

详情
英文摘要

Particle smoothing enables state estimation in nonlinear and non-Gaussian state-space models, but its practical use is often limited by high computational cost. Backward smoothing methods such as the Forward Filter Backward Smoother (FFBS) and its marginal form (FFBSm) can achieve high accuracy, yet typically require quadratic computational complexity in the number of particles. This paper examines the accuracy--computational cost trade-offs of particle smoothing methods through a trend-estimation example. Fixed-lag smoothing, FFBS, and FFBSm are compared under Gaussian and heavy-tailed (Cauchy-type) system noise, with particular attention to O(m) approximations of FFBSm based on subsampling and local neighborhood restrictions. The results show that FFBS and FFBSm outperform fixed-lag smoothing at a fixed particle number, while fixed-lag smoothing often achieves higher accuracy under equal computational time. Moreover, efficient FFBSm approximations are effective for Gaussian transitions but become less advantageous for heavy-tailed dynamics.

2602.13619 2026-02-17 stat.ML cs.IT cs.LG math.IT stat.ME

Locally Private Parametric Methods for Change-Point Detection

Anuj Kumar Yadav, Cemre Cadir, Yanina Shkel, Michael Gastpar

Comments 43 pages, 20 figures

详情
英文摘要

We study parametric change-point detection, where the goal is to identify distributional changes in time series, under local differential privacy. In the non-private setting, we derive improved finite-sample accuracy guarantees for a change-point detection algorithm based on the generalized log-likelihood ratio test, via martingale methods. In the private setting, we propose two locally differentially private algorithms based on randomized response and binary mechanisms, and analyze their theoretical performance. We derive bounds on detection accuracy and validate our results through empirical evaluation. Our results characterize the statistical cost of local differential privacy in change-point detection and show how privacy degrades performance relative to a non-private benchmark. As part of this analysis, we establish a structural result for strong data processing inequalities (SDPI), proving that SDPI coefficients for Rényi divergences and their symmetric variants (Jeffreys-Rényi divergences) are achieved by binary input distributions. These results on SDPI coefficients are also of independent interest, with applications to statistical estimation, data compression, and Markov chain mixing.

2602.13565 2026-02-17 math.ST cs.NA math.NA math.PR stat.OT stat.TH

An Improved Milstein Method for the Numerical Solution of Multidimensional Stochastic Differential Equations

Paromita Banerjee, Anirban Mondal

详情
英文摘要

Stochastic differential equations (SDEs) offer powerful and accessible mathematical models for capturing both deterministic and probabilistic aspects of dynamic behavior across a wide range of physical, financial, and social systems. However, analytical solutions for many SDEs are often unavailable, necessitating the use of numerical approximation methods. The rate of convergence of such numerical methods is of great importance, as it directly influences both computational efficiency and accuracy. This paper presents a proposed theorem, along with its proof, that facilitates the numerical evaluation of the strong (and weak) order of convergence of a numerical scheme for an SDE when the analytical solution is unavailable. Additionally, we address the challenge of numerically computing the multiple stochastic integrals required by the Milstein method to achieve improved convergence rates for multidimensional SDEs. In this context, two newly proposed numerical techniques for computing these multiple stochastic integrals are introduced and compared with existing approaches in terms of efficiency and effectiveness. The methodologies are further illustrated through simulation studies and applications to widely used financial models.

2602.13538 2026-02-17 stat.ME

Empirical Bayes data integreation for multi-response regression

Antik Chakraborty, Fei Xue

Comments To appear in Statistica Sinica

详情
英文摘要

Motivated by applications in tissue-wide association studies (TWAS), we develop a flexible and theoretically grounded empirical Bayes approach for integrating %vector-valued outcomes data obtained from different sources. We propose a linear shrinkage estimator that effectively shrinks singular values of a data matrix. This problem is closely connected to estimating covariance matrices under a specific loss, for which we develop asymptotically optimal estimators. The basic linear shrinkage estimator is then extended to a local linear shrinkage estimator, offering greater flexibility. Crucially, the proposed method works under sparse/dense or low-rank/non low-rank parameter settings unlike well-known sparse or reduced rank estimators in the literature. Furthermore, the empirical Bayes approach offers greater scalability in computation compared to intensive full Bayes procedures. The method is evaluated through an extensive set of numerical experiments, and applied to a real TWAS data obtained from the Genotype-Tissue Expression (GTEx) project.

2602.13533 2026-02-17 stat.ME math.ST stat.AP stat.TH

Estimation and Inference of the Win Ratio for Two Hierarchical Endpoints Subject to Censoring and Missing Data

Yi Liu, Huiman Barnhart, Sean O'Brien, Yuliya Lokhnygina, Roland A. Matsouaka

详情
英文摘要

The win ratio (WR) is a widely used metric to compare treatments in randomized clinical trials with hierarchically ordered endpoints. Counting-based approaches, such as Pocock's algorithm, are the standard for WR estimation. However, this algorithm treats participants with censored or missing data inadequately, which may lead to biased and inefficient estimates, particularly in the presence of heterogeneous censoring or missing data between treatment groups. Although recent extensions have addressed some of these limitations for hierarchical time-to-event endpoints, no existing methods -- aside from the computationally intensive multiple imputation approach -- can accommodate settings that include non-survival endpoints that are subject to missing data. In this paper, we propose a simple nonparametric maximum likelihood estimator (NPMLE) of WR for two hierarchical endpoints that are subject to censoring and missing data. Our method uses all observed data, avoids strong parametric assumptions, and comes with a closed-form asymptotic variance estimator. We demonstrate its performance using simulation studies and two data examples, based on the HEART-FID and ISCHEMIA trials. The proposed method provides a consistent estimator, improves estimation efficiency, and is robust under non-informative censoring and missing at random (MAR) assumptions, offering a flexible alternative to existing WR estimation methods. A user-friendly R package, WinRS, is available to facilitate implementation.

2602.13518 2026-02-17 stat.ME

Towards Semiparametric Bandwidth Selectors for Kernel Density Estimators

Nils Lid Hjort

Comments 26 pages, no figures; technical report from 1999, needing additional numerical work to become a full paper

详情
英文摘要

There is an intense and partly recent literature focussing on the problem of selecting the bandwidth parameter for kernel density estimators. Available methods are largely `very nonparametric', in the sense of not requiring any knowledge about the underlying density, or `very parametric', like the normality-based reference rule. This report aims at widening the scope towards the inclusion of many semiparametric bandwidth selectors, via Hermite type expansions aroundthe normal distribution. The resulting bandwidths may be seen as carrying out suitable corrections on the normal reference rule, requiring a low number of extra coefficients to be estimated from data. The present report introduces and discusses some basic ideas and develops the necessary initial theory, but modestly chooses to stop short of giving precise recommendations for specific procedures among the many possible constructions. This will require some further analysis, numerical work, and some simulation-based exploration.

2602.13475 2026-02-17 stat.ME stat.AP stat.ML

Efficient and Debiased Learning of Average Hazard Under Non-Proportional Hazards

Xiang Meng, Lu Tian, Kenneth Kehl, Hajime Uno

Comments Main paper: 24 pages and 2 figures; Reference and Supplement: 22 pages and 8 Figures

详情
英文摘要

The hazard ratio from the Cox proportional hazards model is a ubiquitous summary of treatment effect. However, when hazards are non-proportional, the hazard ratio can lose a stable causal interpretation and become study-dependent because it effectively averages time-varying effects with weights determined by follow-up and censoring. We consider the average hazard (AH) as an alternative causal estimand: a population-level person-time event rate that remains well-defined and interpretable without assuming proportional hazards. Although AH can be estimated nonparametrically and regression-style adjustments have been proposed, existing approaches do not provide a general framework for flexible, high-dimensional nuisance estimation with valid sqrt{n} inference. We address this gap by developing a semiparametric, doubly robust framework for covariate-adjusted AH. We establish pathwise differentiability of AH in the nonparametric model, derive its efficient influence function, and construct cross-fitted, debiased estimators that leverage machine learning for nuisance estimation while retaining asymptotically normal, sqrt{n}-consistent inference under mild product-rate conditions. Simulations demonstrate that the proposed estimator achieves small bias and near-nominal confidence-interval coverage across proportional and non-proportional hazards settings, including crossing-hazards regimes where Cox-based summaries can be unstable. We illustrate practical utility in comparative effectiveness research by comparing immunotherapy regimens for advanced melanoma using SEER-Medicare linked data.

2602.13465 2026-02-17 math.ST stat.TH

Intrinsic dimension concentration inequalities for self-adjoint operators

Diego Martinez-Taboada, Aaditya Ramdas

详情
英文摘要

We derive novel concentration inequalities for the operator norm of the sum of self-adjoint operators that do not explicitly depend on the underlying dimension of the operator, but rather an intrinsic notion of it. Our analysis leads to tighter results (in terms of constants) and simplified proofs. Our results unify the current intrinsic-dimension and ambient-dimension inequalities under independence, strictly improving both categories of bounds (such as by Tropp and Minsker). We present a general master theorem that we instantiate to obtain specific sub-Gaussian, Hoeffding, Bernstein, Bennett, and sub-exponential type inequalities. We also establish widely applicable concentration bounds under martingale dependence that provide tighter control than existing results.

2602.13450 2026-02-17 econ.EM stat.AP stat.ME

Inference From Random Restarts

Moeen Nehzati, Diego Cussen

详情
英文摘要

Algorithms for computing equilibria, optima, and fixed points in nonconvex problems often depend sensitively on practitioner-chosen initial conditions. When uniqueness of a solution is of interest, a common heuristic is to run such algorithms from many randomly selected initial conditions and to interpret repeated convergence to the same output as evidence of a unique solution or a dominant basin of attraction. Despite its widespread use, this practice lacks a formal inferential foundation. We provide a simple probabilistic framework for interpreting such numerical evidence. First, we give sufficient conditions under which an algorithm's terminal output is a measurable function of its initial condition, allowing probabilistic reasoning over outcomes. Second, we provide sufficient conditions ensuring that an algorithm admits only finitely many possible terminal outcomes. While these conditions may be difficult to verify on a case-by-case basis, we give simple sufficient conditions for broad classes of problems under which almost all instances admit only finitely many outcomes (in the sense of prevalence). Standard algorithms such as gradient descent and damped fixed-point iteration applied to sufficiently smooth functions satisfy these conditions. Within this framework, repeated solver runs correspond to independent samples from the induced distribution over outcomes. We adopt a Bayesian approach to infer basin sizes and the probability of solution uniqueness from repeated identical outputs, and we establish convergence rates for the resulting posterior beliefs. Finally, we apply our framework to settings in the existing industrial organization literature, where random-restart heuristics are used. Our results formalize and qualify these arguments, clarifying when repeated convergence provides meaningful evidence for uniqueness and when it does not.

2602.13442 2026-02-17 stat.ME stat.ML

Measuring Neural Network Complexity via Effective Degrees of Freedom

Jia Zhou, Douglas Landsittel

Comments 20 pages, 3 figures, 6 tables

详情
英文摘要

Quantifying the complexity of feed-forward neural networks (FFNNs) remains challenging due to their nonlinear, hierarchical structure and numerous parameters. We apply generalized degrees of freedom (GDF) to measure model complexity in FFNNs with binary outcomes, adapting the algorithm for discrete responses. We compare GDF with both the effective number of parameters derived via log-likelihood cross-validation and the null degrees of freedom of Landsittel et al. Through simulation studies and a real data analysis, we demonstrate that GDF provides a robust assessment of model complexity for neural network models, as it depends only on the sensitivity of fitted values to perturbations in the observed responses rather than on assumptions about the likelihood. In contrast, cross-validation-based estimates of model complexity and the null degrees of freedom rely on the correctness of the assumed likelihood and may exhibit substantial variability. We find that GDF, cross-validation-based measures, and null degrees of freedom yield similar assessments of model complexity only when the fitted model adequately represents the data-generating mechanism. These findings highlight GDF as a stable and broadly applicable measure of model complexity for neural networks in statistical modeling.

2602.13413 2026-02-17 cs.LG cs.NA math.NA math.OC stat.ML

Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise

Yuchen Fang, James Demmel, Javad Lavaei

详情
英文摘要

We develop a worst-case complexity theory for stochastically preconditioned stochastic gradient descent (SPSGD) and its accelerated variants under heavy-tailed noise, a setting that encompasses widely used adaptive methods such as Adam, RMSProp, and Shampoo. We assume the stochastic gradient noise has a finite $p$-th moment for some $p \in (1,2]$, and measure convergence after $T$ iterations. While clipping and normalization are parallel tools for stabilizing training of SGD under heavy-tailed noise, there is a fundamental separation in their worst-case properties in stochastically preconditioned settings. We demonstrate that normalization guarantees convergence to a first-order stationary point at rate $\mathcal{O}(T^{-\frac{p-1}{3p-2}})$ when problem parameters are known, and $\mathcal{O}(T^{-\frac{p-1}{2p}})$ when problem parameters are unknown, matching the optimal rates for normalized SGD, respectively. In contrast, we prove that clipping may fail to converge in the worst case due to the statistical dependence between the stochastic preconditioner and the gradient estimates. To enable the analysis, we develop a novel vector-valued Burkholder-type inequality that may be of independent interest. These results provide a theoretical explanation for the empirical preference for normalization over clipping in large-scale model training.

2602.13362 2026-02-17 stat.ML cs.AI cs.LG

Nonparametric Distribution Regression Re-calibration

Ádám Jung, Domokos M. Kelen, András A. Benczúr

详情
英文摘要

A key challenge in probabilistic regression is ensuring that predictive distributions accurately reflect true empirical uncertainty. Minimizing overall prediction error often encourages models to prioritize informativeness over calibration, producing narrow but overconfident predictions. However, in safety-critical settings, trustworthy uncertainty estimates are often more valuable than narrow intervals. Realizing the problem, several recent works have focused on post-hoc corrections; however, existing methods either rely on weak notions of calibration (such as PIT uniformity) or impose restrictive parametric assumptions on the nature of the error. To address these limitations, we propose a novel nonparametric re-calibration algorithm based on conditional kernel mean embeddings, capable of correcting calibration error without restrictive modeling assumptions. For efficient inference with real-valued targets, we introduce a novel characteristic kernel over distributions that can be evaluated in $\mathcal{O}(n \log n)$ time for empirical distributions of size $n$. We demonstrate that our method consistently outperforms prior re-calibration approaches across a diverse set of regression benchmarks and model classes.

2602.07046 2026-02-17 q-fin.ST q-fin.CP stat.AP

Same Returns, Different Risks: How Cryptocurrency Markets Process Infrastructure vs Regulatory Shocks

Murad Farzulla

Comments 24 pages, 7 tables. JEL: C22, C58, G12, G14. Code at https://github.com/studiofarzulla/sentiment-without-structure

详情
英文摘要

We investigate whether cryptocurrency markets differentiate between infrastructure failures and regulatory enforcement at the return level, complementing a companion conditional variance analysis that finds 5.7 times larger volatility impacts from infrastructure events (p = 0.0008). Using event-level block bootstrap inference on 31 events across Bitcoin, Ethereum, Solana, and Cardano (2019-2025), we find no statistically significant difference in cumulative abnormal returns between infrastructure failures (-7.6%) and regulatory enforcement (-11.1%): the difference of +3.6 pp has p = 0.81 with 95% CI [-25.3%, +30.9%]. This null acquires substantive meaning alongside the companion's highly significant variance result: the same events that produce indistinguishable return responses generate dramatically different volatility signatures. Markets differentiate shock types through the risk channel -- the second moment -- rather than expected returns. The block bootstrap methodology, which resamples entire events to preserve cross-sectional correlation, reveals that prior parametric approaches systematically understate uncertainty by inflating degrees of freedom. Results are robust across eight specifications including permutation tests, leave-one-out analysis, and the Ibragimov-Mueller few-cluster test.

2510.26090 2026-02-17 stat.ME

Poisson process factorization for mutational signature analysis with genomic covariates

Alessandro Zito, Giovanni Parmigiani, Jeffrey W. Miller

详情
英文摘要

Mutational signatures are powerful summaries of the mutational processes altering the DNA of cancer cells. The usual approach to mutational signature analysis consists of decomposing the matrix of mutation counts from a sample of patients using non-negative matrix factorization (NMF). However, this ignores the heterogeneous patterns of mutation rates along the genome. In this paper, we introduce Poisson process factorization (PPF), which addresses this limitation by employing an inhomogeneous Poisson point process model to infer mutational signatures and their activities as they vary across the genome. PPF generalizes the baseline NMF model by representing a patient's exposure to each signature as a locus-specific function that depends on genomic covariates and patient-specific copy numbers via a log-linear model. This quantifies the relationships between genomic features and mutational signatures, and enables attribution of individual mutations to signatures. We develop tractable algorithms for maximum a posteriori estimation and posterior inference via Markov chain Monte Carlo. We demonstrate the method on simulated data and real data from breast cancer, using genomic covariates representing histone modifications, cell replication timing, nucleosome positioning, and DNA methylation.

2510.10854 2026-02-17 cs.LG cs.AI stat.ML

Discrete State Diffusion Models: A Sample Complexity Perspective

Aadithya Srikanth, Mudit Gaur, Vaneet Aggarwal

详情
英文摘要

Diffusion models have demonstrated remarkable performance in generating high-dimensional samples across domains such as vision, language, and the sciences. Although continuous-state diffusion models have been extensively studied both empirically and theoretically, discrete-state diffusion models, essential for applications involving text, sequences, and combinatorial structures, remain significantly less understood from a theoretical standpoint. In particular, all existing analyses of discrete-state models assume score estimation error bounds without studying sample complexity results. In this work, we present a principled theoretical framework for discrete-state diffusion, providing the first sample complexity bound of $\widetilde{\mathcal{O}}(ε^{-2})$. Our structured decomposition of the score estimation error into statistical, approximation, optimization, and clipping components offers critical insights into how discrete-state models can be trained efficiently. This analysis addresses a fundamental gap in the literature and establishes the theoretical tractability and practical relevance of discrete-state diffusion models.

2509.20993 2026-02-17 cs.LG cs.DS stat.ML

Learning the Inverse Temperature of Ising Models under Hard Constraints using One Sample

Rohan Chauhan, Ioannis Panageas

Comments Accepted to Appear in ICLR '26

详情
英文摘要

We consider the problem of estimating inverse temperature parameter $β$ of an $n$-dimensional truncated Ising model using a single sample. Given a graph $G = (V,E)$ with $n$ vertices, a truncated Ising model is a probability distribution over the $n$-dimensional hypercube $\{-1,1\}^n$ where each configuration $\mathbfσ$ is constrained to lie in a truncation set $S \subseteq \{-1,1\}^n$ and has probability $\Pr(\mathbfσ) \propto \exp(β\mathbfσ^\top A\mathbfσ)$ with $A$ being the adjacency matrix of $G$. We adopt the recent setting of [Galanis et al. SODA'24], where the truncation set $S$ can be expressed as the set of satisfying assignments of a $k$-SAT formula. Given a single sample $\mathbfσ$ from a truncated Ising model, with inverse parameter $β^*$, underlying graph $G$ of bounded degree $Δ$ and $S$ being expressed as the set of satisfying assignments of a $k$-SAT formula, we design in nearly $O(n)$ time an estimator $\hatβ$ that is $O(Δ^3/\sqrt{n})$-consistent with the true parameter $β^*$ for $k \gtrsim \log(d^2k)Δ^3.$ Our estimator is based on the maximization of the pseudolikelihood, a notion that has received extensive analysis for various probabilistic models without [Chatterjee, Annals of Statistics '07] or with truncation [Galanis et al. SODA '24]. Our approach generalizes recent techniques from [Daskalakis et al. STOC '19, Galanis et al. SODA '24], to confront the more challenging setting of the truncated Ising model.

2502.12581 2026-02-17 stat.ML cs.AI cs.LG

The Majority Vote Paradigm Shift: When Popular Meets Optimal

Antonio Purificato, Maria Sofia Bucarelli, Anil Kumar Nelakanti, Andrea Bacciu, Fabrizio Silvestri, Amin Mantrach

Comments 33 pages, 7 figures

详情
英文摘要

Reliably labelling data typically requires annotations from multiple human workers. However, humans are far from being perfect. Hence, it is a common practice to aggregate labels gathered from multiple annotators to make a more confident estimate of the true label. Among many aggregation methods, the simple and well known Majority Vote (MV) selects the class label polling the highest number of votes. However, despite its importance, the optimality of MV's label aggregation has not been extensively studied. We address this gap in our work by characterising the conditions under which MV achieves the theoretically optimal lower bound on label estimation error. Our results capture the tolerable limits on annotation noise under which MV can optimally recover labels for a given class distribution. This certificate of optimality provides a more principled approach to model selection for label aggregation as an alternative to otherwise inefficient practices that sometimes include higher experts, gold labels, etc., that are all marred by the same human uncertainty despite huge time and monetary costs. Experiments on both synthetic and real world data corroborate our theoretical findings.

2502.01010 2026-02-17 stat.ME

Sequential Change Detection in Correlation Structures with Window-Limited Statistics

Jie Gao, Liyan Xie, Zhaoyuan Li

详情
英文摘要

We consider detecting change points in the correlation structure of streaming data with minimum assumptions posed on the underlying data distribution. Detection statistics are constructed for dense and sparse change settings, based on $\ell_1$ and $\ell_{\infty}$ norms of the squared difference of vectorized pre- and post-change correlation matrices, respectively. We also propose a novel threshold determination algorithm based on sign-flip permutations that enhances the efficiency of our procedure, particularly when the data dimension is large compared to the window size. Theoretical guarantees of the proposed methods are provided in terms of average run length in the no-change regime and expected detection delay in the post-change regime. We evaluate the performance of the proposed methods across a wide range of simulated datasets and demonstrate their effectiveness, with small detection delays that are comparable to the exact optimal CUSUM test. Finally, we demonstrate the effectiveness of our methods on real-world datasets, including El Ni{ñ}o event forecasting, where we achieve a state-of-the-art hit rate exceeding 0.86 with near-zero false alarms, as well as seismic event detection.

2501.16178 2026-02-17 cs.LG stat.ML

SWIFT: Mapping Sub-series with Wavelet Decomposition Improves Time Series Forecasting

Wenxuan Xie, Fanpu Cao

详情
英文摘要

In recent work on time-series prediction, Transformers and even large language models have garnered significant attention due to their strong capabilities in sequence modeling. However, in practical deployments, time-series prediction often requires operation in resource-constrained environments, such as edge devices, which are unable to handle the computational overhead of large models. To address such scenarios, some lightweight models have been proposed, but they exhibit poor performance on non-stationary sequences. In this paper, we propose $\textit{SWIFT}$, a lightweight model that is not only powerful, but also efficient in deployment and inference for Long-term Time Series Forecasting (LTSF). Our model is based on three key points: (i) Utilizing wavelet transform to perform lossless downsampling of time series. (ii) Achieving cross-band information fusion with a learnable filter. (iii) Using only one shared linear layer or one shallow MLP for sub-series' mapping. We conduct comprehensive experiments, and the results show that $\textit{SWIFT}$ achieves state-of-the-art (SOTA) performance on multiple datasets, offering a promising method for edge computing and deployment in this task. Moreover, it is noteworthy that the number of parameters in $\textit{SWIFT-Linear}$ is only 25\% of what it would be with a single-layer linear model for time-domain prediction. Our code is available at https://github.com/LancelotXWX/SWIFT.

2501.01696 2026-02-17 stat.ML cs.IT cs.LG math.IT

Guaranteed Nonconvex Low-Rank Tensor Estimation via Scaled Gradient Descent

Tong Wu

Comments This paper has been accepted for publication in the Journal of Machine Learning Research

详情
英文摘要

Tensors, which give a faithful and effective representation to deliver the intrinsic structure of multi-dimensional data, play a crucial role in an increasing number of signal processing and machine learning problems. However, tensor data are often accompanied by arbitrary signal corruptions, including missing entries and sparse noise. A fundamental challenge is to reliably extract the meaningful information from corrupted tensor data in a statistically and computationally efficient manner. This paper develops a scaled gradient descent (ScaledGD) algorithm to directly estimate the tensor factors with tailored spectral initializations under the tensor-tensor product (t-product) and tensor singular value decomposition (t-SVD) framework. With tailored variants for tensor robust principal component analysis, (robust) tensor completion and tensor regression, we theoretically show that ScaledGD achieves linear convergence at a constant rate that is independent of the condition number of the ground truth low-rank tensor, while maintaining the low per-iteration cost of gradient descent. To the best of our knowledge, ScaledGD is the first algorithm that provably has such properties for low-rank tensor estimation with the t-SVD. Finally, numerical examples are provided to demonstrate the efficacy of ScaledGD in accelerating the convergence rate of ill-conditioned low-rank tensor estimation in a number of applications.

2411.01629 2026-02-17 stat.ML cs.LG math.OC math.ST stat.TH

Denoising Diffusions with Optimal Transport: Localization, Curvature, and Multi-Scale Complexity

Tengyuan Liang, Kulunu Dharmakeerthi, Takuya Koriyama

Comments 30 pages, 11 figures

详情
Journal ref
Transactions on Machine Learning Research, 2026
英文摘要

Adding noise is easy; what about denoising? Diffusion is easy; what about reverting a diffusion? Diffusion-based generative models aim to denoise a Langevin diffusion chain, moving from a log-concave equilibrium measure $ν$, say an isotropic Gaussian, back to a complex, possibly non-log-concave initial measure $μ$. The score function performs denoising, moving backward in time, and predicting the conditional mean of the past location given the current one. We show that score denoising is the optimal backward map in transportation cost. What is its localization uncertainty? We show that the curvature function determines this localization uncertainty, measured as the conditional variance of the past location given the current. We study in this paper the effectiveness of the diffuse-then-denoise process: the contraction of the forward diffusion chain, offset by the possible expansion of the backward denoising chain, governs the denoising difficulty. For any initial measure $μ$, we prove that this offset net contraction at time $t$ is characterized by the curvature complexity of a smoothed $μ$ at a specific signal-to-noise ratio (SNR) scale $r(t)$. We discover that the multi-scale curvature complexity collectively determines the difficulty of the denoising chain. Our multi-scale complexity quantifies a fine-grained notion of average-case curvature instead of the worst-case. Curiously, it depends on an integrated tail function, measuring the relative mass of locations with positive curvature versus those with negative curvature; denoising at a specific SNR scale is easy if such an integrated tail is light. We conclude with several non-log-concave examples to demonstrate how the multi-scale complexity probes the bottleneck SNR for the diffuse-then-denoise process.

2407.13010 2026-02-17 cs.LG cs.CE physics.comp-ph stat.ML

A Resolution Independent Neural Operator

Bahador Bahmani, Somdatta Goswami, Ioannis G. Kevrekidis, Michael D. Shields

详情
Journal ref
Journal of Computational Physics, 539, 2025, 114233
英文摘要

The Deep Operator Network (DeepONet) is a powerful neural operator architecture that uses two neural networks to map between infinite-dimensional function spaces. This architecture allows for the evaluation of the solution field at any location within the domain but requires input functions to be discretized at identical locations, limiting practical applications. We introduce a general framework for operator learning from input-output data with arbitrary sensor locations and counts. This begins by introducing a resolution-independent DeepONet (RI-DeepONet), which handles input functions discretized arbitrarily but sufficiently finely. To achieve this, we propose two dictionary learning algorithms that adaptively learn continuous basis functions, parameterized as implicit neural representations (INRs), from correlated signals on arbitrary point clouds. These basis functions project input function data onto a finite-dimensional embedding space, making it compatible with DeepONet without architectural changes. We specifically use sinusoidal representation networks (SIRENs) as trainable INR basis functions. Similarly, the dictionary learning algorithms identify basis functions for output data, defining a new neural operator architecture: the Resolution Independent Neural Operator (RINO). In RINO, the operator learning task reduces to mapping coefficients of input basis functions to output basis functions. We demonstrate RINO's robustness and applicability in handling arbitrarily sampled input and output functions during both training and inference through several numerical examples.

2209.05894 2026-02-17 math.ST stat.ME stat.TH

Nonparametric estimation of trawl processes: Theory and applications

Orimar Sauri, Almut E. D. Veraart

详情
英文摘要

Trawl processes belong to the class of continuous-time, strictly stationary, infinitely divisible processes; they are defined as Levy bases evaluated over deterministic trawl sets. This article presents the first nonparametric estimator of the trawl function characterising the trawl set and the serial correlation of the process. Moreover, it establishes a detailed asymptotic theory for the proposed estimator, including a law of large numbers and a central limit theorem for various asymptotic relations between an in-fill and a long-span asymptotic regime. In addition, it develops consistent estimators for both the asymptotic bias and variance, which are subsequently used for establishing feasible central limit theorems which can be applied to data. A simulation study shows the good finite sample performance of the proposed estimators. The new methodology is applied to model misspecification testing, forecasting high-frequency financial spread data from a limit order book and to estimating the busy-time distribution of a stochastic queue.