arXivDaily arXiv每日学术速递 周一至周五更新
2602.05997 2026-02-06 stat.ML cs.LG stat.ME

Causal Inference on Stopped Random Walks in Online Advertising

Jia Yuan Yu

详情
英文摘要

We consider a causal inference problem frequently encountered in online advertising systems, where a publisher (e.g., Instagram, TikTok) interacts repeatedly with human users and advertisers by sporadically displaying to each user an advertisement selected through an auction. Each treatment corresponds to a parameter value of the advertising mechanism (e.g., auction reserve-price), and we want to estimate through experiments the corresponding long-term treatment effect (e.g., annual advertising revenue). In our setting, the treatment affects not only the instantaneous revenue from showing an ad, but also changes each user's interaction-trajectory, and each advertiser's bidding policy -- as the latter is constrained by a finite budget. In particular, each a treatment may even affect the size of the population, since users interact longer with a tolerable advertising mechanism. We drop the classical i.i.d. assumption and model the experiment measurements (e.g., advertising revenue) as a stopped random walk, and use a budget-splitting experimental design, the Anscombe Theorem, a Wald-like equation, and a Central Limit Theorem to construct confidence intervals for the long-term treatment effect.

2602.05996 2026-02-06 cs.LG stat.ML

Orthogonal Self-Attention

Leo Zhang, James Martens

Comments Preprint

详情
英文摘要

Softmax Self-Attention (SSA) is a key component of Transformer architectures. However, when utilised within skipless architectures, which aim to improve representation learning, recent work has highlighted the inherent instability of SSA due to inducing rank collapse and poorly-conditioned Jacobians. In this work, we design a novel attention mechanism: Orthogonal Self-Attention (OSA), which aims to bypass these issues with SSA, in order to allow for (non-causal) Transformers without skip connections and normalisation layers to be more easily trained. In particular, OSA parametrises the attention matrix to be orthogonal via mapping a skew-symmetric matrix, formed from query-key values, through the matrix exponential. We show that this can be practically implemented, by exploiting the low-rank structure of our query-key values, resulting in the computational complexity and memory cost of OSA scaling linearly with sequence length. Furthermore, we derive an initialisation scheme for which we prove ensures that the Jacobian of OSA is well-conditioned.

2602.05927 2026-02-06 stat.ML cs.LG

Transformers Are Born Biased: Structural Inductive Biases at Random Initialization and Their Practical Consequences

Siquan Li, Yao Tong, Haonan Wang, Tianyang Hu

详情
英文摘要

Transformers underpin modern large language models (LLMs) and are commonly assumed to be behaviorally unstructured at random initialization, with all meaningful preferences emerging only through large-scale training. We challenge this assumption by showing that randomly initialized transformers already exhibit strong and systematic structural biases. In particular, untrained models display extreme token preferences: across random input sequences, certain tokens are predicted with probabilities orders of magnitude larger. We provide a mechanistic explanation for this phenomenon by dissecting the transformer architecture at initialization. We show that extreme token preference arises from a contraction of token representations along a random seed-dependent direction. This contraction is driven by two interacting forces: (i) asymmetric nonlinear activations in MLP sublayers induce global (inter-sequence) representation concentration, and (ii) self-attention further amplifies this effect through local (intra-sequence) aggregation. Together, these mechanisms align hidden representations along a direction determined solely by the random initialization, producing highly non-uniform next-token predictions. Beyond mechanistic insight, we demonstrate that these initialization-induced biases persist throughout training, forming a stable and intrinsic model identity. Leveraging this property, we introduce SeedPrint, a fingerprinting method that can reliably distinguish models that differ only in their random initialization, even after extensive training and under substantial distribution shift. Finally, we identify a fundamental positional discrepancy inherent to the attention mechanism's intra-sequence contraction that is causally linked to the attention-sink phenomenon. This discovery provides a principled explanation for the emergence of sinks and offers a pathway for their control.

2602.05861 2026-02-06 cs.LG stat.ML

CFRecs: Counterfactual Recommendations on Real Estate User Listing Interaction Graphs

Seyedmasoud Mousavi, Ruomeng Xu, Xiaojing Zhu

详情
英文摘要

Graph-structured data is ubiquitous and powerful in representing complex relationships in many online platforms. While graph neural networks (GNNs) are widely used to learn from such data, counterfactual graph learning has emerged as a promising approach to improve model interpretability. Counterfactual explanation research focuses on identifying a counterfactual graph that is similar to the original but leads to different predictions. These explanations optimize two objectives simultaneously: the sparsity of changes in the counterfactual graph and the validity of its predictions. Building on these qualitative optimization goals, this paper introduces CFRecs, a novel framework that transforms counterfactual explanations into actionable insights. CFRecs employs a two-stage architecture consisting of a graph neural network (GNN) and a graph variational auto-encoder (Graph-VAE) to strategically propose minimal yet high-impact changes in graph structure and node attributes to drive desirable outcomes in recommender systems. We apply CFRecs to Zillow's graph-structured data to deliver actionable recommendations for both home buyers and sellers with the goal of helping them navigate the competitive housing market and achieve their homeownership goals. Experimental results on Zillow's user-listing interaction data demonstrate the effectiveness of CFRecs, which also provides a fresh perspective on recommendations using counterfactual reasoning in graphs.

2602.05852 2026-02-06 cs.LG cs.IT math.IT stat.ML

Exact Recovery in the Data Block Model

Amir R. Asadi, Akbar Davoodi, Ramin Javadi, Farzad Parvaresh

Comments 35 pages

详情
英文摘要

Community detection in networks is a fundamental problem in machine learning and statistical inference, with applications in social networks, biological systems, and communication networks. The stochastic block model (SBM) serves as a canonical framework for studying community structure, and exact recovery, identifying the true communities with high probability, is a central theoretical question. While classical results characterize the phase transition for exact recovery based solely on graph connectivity, many real-world networks contain additional data, such as node attributes or labels. In this work, we study exact recovery in the Data Block Model (DBM), an SBM augmented with node-associated data, as formalized by Asadi, Abbe, and Verdú (2017). We introduce the Chernoff--TV divergence and use it to characterize a sharp exact recovery threshold for the DBM. We further provide an efficient algorithm that achieves this threshold, along with a matching converse result showing impossibility below the threshold. Finally, simulations validate our findings and demonstrate the benefits of incorporating vertex data as side information in community detection.

2602.05846 2026-02-06 stat.ML cs.LG

Optimal scaling laws in learning hierarchical multi-index models

Leonardo Defilippis, Florent Krzakala, Bruno Loureiro, Antoine Maillard

详情
英文摘要

In this work, we provide a sharp theory of scaling laws for two-layer neural networks trained on a class of hierarchical multi-index targets, in a genuinely representation-limited regime. We derive exact information-theoretic scaling laws for subspace recovery and prediction error, revealing how the hierarchical features of the target are sequentially learned through a cascade of phase transitions. We further show that these optimal rates are achieved by a simple, target-agnostic spectral estimator, which can be interpreted as the small learning-rate limit of gradient descent on the first-layer weights. Once an adapted representation is identified, the readout can be learned statistically optimally, using an efficient procedure. As a consequence, we provide a unified and rigorous explanation of scaling laws, plateau phenomena, and spectral structure in shallow neural networks trained on such hierarchical targets.

2602.05812 2026-02-06 cs.LG stat.ML

Principled Confidence Estimation for Deep Computed Tomography

Matteo Gätzner, Johannes Kirschner

详情
英文摘要

We present a principled framework for confidence estimation in computed tomography (CT) reconstruction. Based on the sequential likelihood mixing framework (Kirschner et al., 2025), we establish confidence regions with theoretical coverage guarantees for deep-learning-based CT reconstructions. We consider a realistic forward model following the Beer-Lambert law, i.e., a log-linear forward model with Poisson noise, closely reflecting clinical and scientific imaging conditions. The framework is general and applies to both classical algorithms and deep learning reconstruction methods, including U-Nets, U-Net ensembles, and generative Diffusion models. Empirically, we demonstrate that deep reconstruction methods yield substantially tighter confidence regions than classical reconstructions, without sacrificing theoretical coverage guarantees. Our approach allows the detection of hallucinations in reconstructed images and provides interpretable visualizations of confidence regions. This establishes deep models not only as powerful estimators, but also as reliable tools for uncertainty-aware medical imaging.

2602.05799 2026-02-06 math.OC cs.LG stat.ML

Non-Stationary Inventory Control with Lead Times

Nele H. Amiri, Sean R. Sinclair, Maximiliano Udenio

详情
英文摘要

We study non-stationary single-item, periodic-review inventory control problems in which the demand distribution is unknown and may change over time. We analyze how demand non-stationarity affects learning performance across inventory models, including systems with demand backlogging or lost-sales, both with and without lead times. For each setting, we propose an adaptive online algorithm that optimizes over the class of base-stock policies and establish performance guarantees in terms of dynamic regret relative to the optimal base-stock policy at each time step. Our results reveal a sharp separation across inventory models. In backlogging systems and lost-sales models with zero lead time, we show that it is possible to adapt to demand changes without incurring additional performance loss in stationary environments, even without prior knowledge of the demand distributions or the number of demand shifts. In contrast, for lost-sales systems with positive lead times, we establish weaker guarantees that reflect fundamental limitations imposed by delayed replenishment in combination with censored feedback. Our algorithms leverage the convexity and one-sided feedback structure of inventory costs to enable counterfactual policy evaluation despite demand censoring. We complement the theoretical analysis with simulation results showing that our methods significantly outperform existing benchmarks.

2602.05798 2026-02-06 stat.ME cs.LG eess.SP stat.ML

Learning False Discovery Rate Control via Model-Based Neural Networks

Arnau Vilella, Jasin Machkour, Michael Muma, Daniel P. Palomar

Comments Accepted to IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026

详情
英文摘要

Controlling the false discovery rate (FDR) in high-dimensional variable selection requires balancing rigorous error control with statistical power. Existing methods with provable guarantees are often overly conservative, creating a persistent gap between the realized false discovery proportion (FDP) and the target FDR level. We introduce a learning-augmented enhancement of the T-Rex Selector framework that narrows this gap. Our approach replaces the analytical FDP estimator with a neural network trained solely on diverse synthetic datasets, enabling a substantially tighter and more accurate approximation of the FDP. This refinement allows the procedure to operate much closer to the desired FDR level, thereby increasing discovery power while maintaining effective approximate control. Through extensive simulations and a challenging synthetic genome-wide association study (GWAS), we demonstrate that our method achieves superior detection of true variables compared to existing approaches.

2602.05784 2026-02-06 stat.ME

Correcting Measurement Error and Zero Inflation in Functional Covariates for Scalar-on-Function Quantile Regression

Caihong Qin, Lan Xue, Ufuk Beyaztas, Roger S. Zoh, Mark Benden, Jeff Goldsmith, Carmen D. Tekwe

详情
英文摘要

Wearable devices collect time-varying biobehavioral data, offering opportunities to investigate how behaviors influence health outcomes. However, these data often contain measurement error and excess zeros (due to nonwear, sedentary behavior, or connectivity issues), each characterized by subject-specific distributions. Current statistical methods fail to address these issues simultaneously. We introduce a novel modeling framework for zero-inflated and error-prone functional data by incorporating a subject-specific time-varying validity indicator that explicitly distinguishes structural zeros from intrinsic values. We iteratively estimate the latent functional covariates and zero-inflation probabilities via maximum likelihood, using basis expansions and linear mixed models to adjust for measurement error. To assess the effects of the recovered latent covariates, we apply joint quantile regression across multiple quantile levels. Through extensive simulations, we demonstrate that our approach significantly improves estimation accuracy over methods that only address measurement error, and joint estimation yields substantial improvements compared with fitting separate quantile regressions. Applied to a childhood obesity study, our approach effectively corrects for zero inflation and measurement error in step counts, yielding results that closely align with energy expenditure and supporting their use as a proxy for physical activity.

2602.05778 2026-02-06 stat.ME

Copula-based models for spatially dependent cylindrical data

Francesca Labanca, Anna Gottard, Nadja Klein

详情
英文摘要

Cylindrical data frequently arise across various scientific disciplines, including meteorology (e.g., wind direction and speed), oceanography (e.g., marine current direction and speed or wave heights), ecology (e.g., telemetry), and medicine (e.g., seasonality and intensity in disease onset). Such data often occur as spatially correlated series of intensities and angles, thereby representing dependent bivariate response vectors of linear and circular components. To accommodate both the circular-linear dependence and spatial autocorrelation, while remaining flexible in marginal specifications, copula-based models for cylindrical data have been developed in the literature. However, existing approaches typically treat the copula parameters as constants unrelated to covariates, and regression specifications for marginal distributions are frequently restricted to linear predictors, thereby ignoring spatial correlation. In this work, we propose a structured additive conditional copula regression model for cylindrical data. The circular component is modeled using a wrapped Gaussian process, and the linear component follows a distributional regression model. Both components allow for the inclusion of linear covariate effects. Furthermore, by leveraging the empirical equivalence between Gaussian random fields (GRFs) and Gaussian Markov random fields, our approach avoids the computational burden typically associated with GRFs, while simultaneously allowing for non-stationarity in the covariance structure. Posterior estimation is performed via Markov chain Monte Carlo simulation. We evaluate the proposed model in a simulation study and subsequently in an analysis of wind directions and speed in Germany.

2602.05704 2026-02-06 cs.LG stat.ML

Limitations of SGD for Multi-Index Models Beyond Statistical Queries

Daniel Barzilai, Ohad Shamir

详情
英文摘要

Understanding the limitations of gradient methods, and stochastic gradient descent (SGD) in particular, is a central challenge in learning theory. To that end, a commonly used tool is the Statistical Queries (SQ) framework, which studies performance limits of algorithms based on noisy interaction with the data. However, it is known that the formal connection between the SQ framework and SGD is tenuous: Existing results typically rely on adversarial or specially-structured gradient noise that does not reflect the noise in standard SGD, and (as we point out here) can sometimes lead to incorrect predictions. Moreover, many analyses of SGD for challenging problems rely on non-trivial algorithmic modifications, such as restricting the SGD trajectory to the sphere or using very small learning rates. To address these shortcomings, we develop a new, non-SQ framework to study the limitations of standard vanilla SGD, for single-index and multi-index models (namely, when the target function depends on a low-dimensional projection of the inputs). Our results apply to a broad class of settings and architectures, including (potentially deep) neural networks.

2602.05611 2026-02-06 stat.ME

The stochastic view used in climate sciences: (some) perspectives from (some of) mathematical statistics

Nils Lid Hjort

Comments 17 pages, 18 figures

详情
英文摘要

Climate statistics is of course a very broad field, along with the many connections and impacts for yet other areas, with a history as long as mankind has been recording temperatures, describing drastic weather events, etc. The important work of Klaus Hasselmann, with crucial contributions to the field, along with various other connected strands of work,is being reviewed and discussed in other chapters. The aim of the present chapter is to point to a few statistical methodology themes of relevance for and joint interest with climate statistics. These themes, presented from a statistical methods perspective, include (i) more careful modelling and model selection strategies for meteorological type time series; (ii) methods for prediction, not only for future values of a time series, but for assessing when a trend might be crossing a barrier, along with relevant measures of uncertainty for these; (iii) climatic influence on marine biology; (iv) monitoring processes to assess whether and then to what extent models and their parameters have stayed reasonably constant over time; (v) combination of outputs from different information sources; and (vi) analysing probabilities and their uncertainties related to extreme events.

2602.05592 2026-02-06 math.ST econ.EM stat.TH

An invariant modification of the bilinear form test

Angelo Garate, Felipe Osorio, Federico Crudu

Comments 7 pages

详情
英文摘要

The invariance properties of certain likelihood-based asymptotic tests as well as their extensions for M-estimation, estimating functions and the generalized method of moments have been well studied. The simulation study reported in Crudu and Osorio [Econ. Lett. 187: 108885, 2020] shows that the bilinear form test is not invariant to one-to-one transformations of the parameter space. This paper provides a set of suitable conditions to establish the invariance property under reparametrization of the bilinear form test for linear or nonlinear hypotheses that arise in extremum estimation which leads to a simple modification of the test statistic. Evidence from a Monte Carlo simulation experiment suggests good performance of the proposed methodology.

2602.05559 2026-02-06 stat.CO math.ST stat.AP stat.ML stat.TH

Piecewise Deterministic Markov Processes for Bayesian Inference of PDE Coefficients

Leon Riccius, Iuri B. C. M. Rocha, Joris Bierkens, Hanne Kekkonen, Frans P. van der Meer

Comments 38 pages, 17 figures

详情
英文摘要

We develop a general framework for piecewise deterministic Markov process (PDMP) samplers that enables efficient Bayesian inference in non-linear inverse problems with expensive likelihoods. The key ingredient is a surrogate-assisted thinning scheme in which a surrogate model provides a proposal event rate and a robust correction mechanism enforces an upper bound on the true rate by dynamically adjusting an additive offset whenever violations are detected. This construction is agnostic to the choice of surrogate and PDMP, and we demonstrate it for the Zig-Zag sampler and the Bouncy particle sampler with constant, Laplace, and Gaussian process (GP) surrogates, including gradient-informed and adaptively refined GP variants. As a representative application, we consider Bayesian inference of a spatially varying Young's modulus in a one-dimensional linear elasticity problem. Across dimensions, PDMP samplers equipped with GP-based surrogates achieve substantially higher accuracy and effective sample size per forward model evaluation than Random Walk Metropolis algorithm and the No-U-Turn sampler. The Bouncy particle sampler exhibits the most favorable overall efficiency and scaling, illustrating the potential of the proposed PDMP framework beyond this particular setting.

2602.05531 2026-02-06 math.OC cs.LG stat.ML

Solving Stochastic Variational Inequalities without the Bounded Variance Assumption

Ahmet Alacaoglu, Jun-Hyun Kim

详情
英文摘要

We analyze algorithms for solving stochastic variational inequalities (VI) without the bounded variance or bounded domain assumptions, where our main focus is min-max optimization with possibly unbounded constraint sets. We focus on two classes of problems: monotone VIs; and structured nonmonotone VIs that admit a solution to the weak Minty VI. The latter assumption allows us to solve structured nonconvex-nonconcave min-max problems. For both classes of VIs, to make the expected residual norm less than $\varepsilon$, we show an oracle complexity of $\widetilde{O}(\varepsilon^{-4})$, which is the best-known for constrained VIs. In our setting, this complexity had been obtained with the bounded variance assumption in the literature, which is not even satisfied for bilinear min-max problems with an unbounded domain. We obtain this complexity for stochastic oracles whose variance can grow as fast as the squared norm of the optimization variable.

2602.05489 2026-02-06 math.OC cs.LG stat.ML

Convergence Rate of the Last Iterate of Stochastic Proximal Algorithms

Kevin Kurian Thomas Vaidyan, Michael P. Friedlander, Ahmet Alacaoglu

详情
英文摘要

We analyze two classical algorithms for solving additively composite convex optimization problems where the objective is the sum of a smooth term and a nonsmooth regularizer: proximal stochastic gradient method for a single regularizer; and the randomized incremental proximal method, which uses the proximal operator of a randomly selected function when the regularizer is given as the sum of many nonsmooth functions. We focus on relaxing the bounded variance assumption that is common, yet stringent, for getting last iterate convergence rates. We prove the $\widetilde{O}(1/\sqrt{T})$ rate of convergence for the last iterate of both algorithms under componentwise convexity and smoothness, which is optimal up to log terms. Our results apply directly to graph-guided regularizers that arise in multi-task and federated learning, where the regularizer decomposes as a sum over edges of a collaboration graph.

2602.05460 2026-02-06 math.ST stat.TH

Complexity reduction in online stochastic Newton methods with potential O(N d) total cost

Antoine Godichon-Baggioni, Bruno Portier, Guillaume Sallé

详情
英文摘要

Optimizing smooth convex functions in stochastic settings, where only noisy estimates of gradients and Hessians are available, is a fundamental problem in optimization. While first-order methods possess a low per-iteration cost, their convergence is slow for ill-conditioned problems. Stochastic Newton methods utilize second-order information to correct for local curvature, but the O(d 3 ) per-iteration cost of computing and inverting a full Hessian, where d is the problem dimension, is prohibitive in high dimensions. This paper introduces an online mini-batch stochastic Newton algorithm. The method employs a random masking strategy that selects a subset of Hessian columns at each iteration, substantially reducing the per-step computational cost. This approach allows the algorithm, in the mini-batch setting, to achieve a total computational cost for a single pass over N data points of O(N d), which is comparable to first-order methods while retaining the advantages of second-order information. We establish the almost sure convergence and asymptotic efficiency of the resulting estimator. This property is obtained without requiring iterate averaging, which distinguishes this work from prior analyses.

2602.05379 2026-02-06 stat.ML cs.LG

Variance Reduction Based Experience Replay for Policy Optimization

Hua Zheng, Wei Xie, M. Ben Feng, Keilung Choy

Comments 24 pages, 4 figures. arXiv admin note: text overlap with arXiv:2208.12341

详情
英文摘要

Effective reinforcement learning (RL) for complex stochastic systems requires leveraging historical data collected in previous iterations to accelerate policy optimization. Classical experience replay treats all past observations uniformly and fails to account for their varying contributions to learning. To overcome this limitation, we propose Variance Reduction Experience Replay (VRER), a principled framework that selectively reuses informative samples to reduce variance in policy gradient estimation. VRER is algorithm-agnostic and integrates seamlessly with existing policy optimization methods, forming the basis of our sample-efficient off-policy algorithm, Policy Gradient with VRER (PG-VRER). Motivated by the lack of rigorous theoretical analysis of experience replay, we develop a novel framework that explicitly captures dependencies introduced by Markovian dynamics and behavior-policy interactions. Using this framework, we establish finite-time convergence guarantees for PG-VRER and reveal a fundamental bias-variance trade-off: reusing older experience increases bias but simultaneously reduces gradient variance. Extensive empirical experiments demonstrate that VRER consistently accelerates policy learning and improves performance over state-of-the-art policy optimization algorithms.

2602.05377 2026-02-06 stat.CO stat.AP

Optimal Accelerated Life Testing Sampling Plan Design with Piecewise Linear Function based Modeling of Lifetime Characteristics

Sandip Barui, Shovan Chowdhury

详情
英文摘要

Researchers have widely used accelerated life tests to determine an optimal inspection plan for lot acceptance. All such plans are proposed by assuming a known relationship between the lifetime characteristic(s) and the accelerating stress factor(s) under a parametric framework of the product lifetime distribution. As the true relationship is rarely known in practical scenarios, the assumption itself may produce biased estimates that may lead to an inefficient sampling plan. To this endeavor, an optimal accelerating life test plan is designed under a Type-I censoring scheme with a generalized link structure similar to a spline regression, to capture the nonlinear relationship between the lifetime characteristics and the stress levels. Product lifetime is assumed to follow Weibull distribution with non-identical scale and shape parameters linked with the stress factor through a piecewise linear function. The elements of the Fisher information matrix are computed in detail to formulate the acceptability criterion for the conforming lots. The decision variables of the sampling plan including sample size, stress factors, and others are determined using a constrained aggregated cost minimization approach and variance minimization approach. A simulated case study demonstrates that the nonlinear link-based piecewise linear approximation model outperforms the linear link-based model.

2602.05351 2026-02-06 stat.ME math.ST stat.TH

A Flexible Modeling of Extremes in the Presence of Inliers

Shivshankar Nila, Ishapathik Das, N. Balakrishna

详情
英文摘要

Many random phenomena, including life-testing and environmental data, show positive values and excess zeros, which pose modeling challenges. In life testing, immediate failures result in zero lifetimes, often due to defects or poor quality, especially in electronics and clinical trials. These failures, called inliers at zero, are difficult to model using standard approaches. The presence and proportion of inliers may influence the accuracy of extreme value analysis, bias parameter estimates, or even lead to severe events or extreme effects, such as drought or crop failure. In such scenarios, a key issue in extreme value analysis is determining a suitable threshold to capture tail behaviour accurately. Although some extreme value mixture models address threshold and tail estimation, they often inadequately handle inliers, resulting in suboptimal results. Bulk model misspecification can affect the threshold, extreme value estimates, and, in particular, the tail proportion. There is no unified framework for defining extreme value mixture models, especially the tail proportion. This paper proposes a flexible model that handles extremes, inliers, and the tail proportion. Parameters are estimated using maximum likelihood estimation. Compared the proposed model estimates with the classical mean excess plot, parameter stability plot, and Pickands plot estimates. Theoretical results are established, and the proposed model outperforms traditional methods in both simulation studies and real data analysis.

2602.05340 2026-02-06 stat.ML cs.LG

Decision-Focused Sequential Experimental Design: A Directional Uncertainty-Guided Approach

Beichen Wan, Mo Liu, Paul Grigas, Zuo-Jun Max Shen

详情
英文摘要

We consider the sequential experimental design problem in the predict-then-optimize paradigm. In this paradigm, the outputs of the prediction model are used as coefficient vectors in a downstream linear optimization problem. Traditional sequential experimental design aims to control the input variables (features) so that the improvement in prediction accuracy from each experimental outcome (label) is maximized. However, in the predict-then-optimize setting, performance is ultimately evaluated based on the decision loss induced by the downstream optimization, rather than by prediction error. This mismatch between prediction accuracy and decision loss renders traditional decision-blind designs inefficient. To address this issue, we propose a directional-based metric to quantify predictive uncertainty. This metric does not require solving an optimization oracle and is therefore computationally tractable. We show that the resulting sequential design criterion enjoys strong consistency and convergence guarantees. Under a broad class of distributions, we demonstrate that our directional uncertainty-based design attains an earlier stopping time than decision-blind designs. This advantage is further supported by real-world experiments on an LLM job allocation problem.

2602.05335 2026-02-06 stat.ME cs.GR stat.AP

Boxplots and quartile plots for grouped and periodic angular data

Joshua D. Berlinski, Fan Dai, Ranjan Maitra

Comments 7 pages, 8 figures

详情
英文摘要

Angular observations, or observations lying on the unit circle, arise in many disciplines and require special care in their description, analysis, interpretation and visualization. We provide methods to construct concentric circular boxplot displays of distributions of groups of angular data. The use of concentric boxplots brings challenges of visual perception, so we set the boxwidths to be inversely proportional to the square root of their distance from the centre. A perception survey supports this scaled boxwidth choice. For a large number of groups, we propose circular quartile plots. A three-dimensional toroidal display is also implemented for periodic angular distributions. We illustrate our methods on datasets in (1) psychology, to display motor resonance under different conditions, (2) genomics, to understand the distribution of peak phases for ancillary clock genes, and (3) meteorology and wind turbine power generation, to study the changing and periodic distribution of wind direction over the course of a year.

2602.05225 2026-02-06 math.ST stat.ML stat.TH

Metric space valued Fréchet regression

László Györfi, Pierre Humbert, Batiste Le Bars

详情
英文摘要

We consider the problem of estimating the Fréchet and conditional Fréchet mean from data taking values in separable metric spaces. Unlike Euclidean spaces, where well-established methods are available, there is no practical estimator that works universally for all metric spaces. Therefore, we introduce a computable estimator for the Fréchet mean based on random quantization techniques and establish its universal consistency across any separable metric spaces. Additionally, we propose another estimator for the conditional Fréchet mean, leveraging data-driven partitioning and quantization, and demonstrate its universal consistency when the output space is any Banach space.

2602.04795 2026-02-06 cs.LG cs.NA eess.SP math.NA stat.ML

Maximum-Volume Nonnegative Matrix Factorization

Olivier Vu Thanh, Nicolas Gillis

Comments arXiv admin note: substantial text overlap with arXiv:2412.06380 (this paper is an updated version of Chapter 7 of the thesis of the first author, available from arXiv:2412.06380). The code is available from https://gitlab.com/vuthanho/maxvolmf.jl

详情
英文摘要

Nonnegative matrix factorization (NMF) is a popular data embedding technique. Given a nonnegative data matrix $X$, it aims at finding two lower dimensional matrices, $W$ and $H$, such that $X\approx WH$, where the factors $W$ and $H$ are constrained to be element-wise nonnegative. The factor $W$ serves as a basis for the columns of $X$. In order to obtain more interpretable and unique solutions, minimum-volume NMF (MinVol NMF) minimizes the volume of $W$. In this paper, we consider the dual approach, where the volume of $H$ is maximized instead; this is referred to as maximum-volume NMF (MaxVol NMF). MaxVol NMF is identifiable under the same conditions as MinVol NMF in the noiseless case, but it behaves rather differently in the presence of noise. In practice, MaxVol NMF is much more effective to extract a sparse decomposition and does not generate rank-deficient solutions. In fact, we prove that the solutions of MaxVol NMF with the largest volume correspond to clustering the columns of $X$ in disjoint clusters, while the solutions of MinVol NMF with smallest volume are rank deficient. We propose two algorithms to solve MaxVol NMF. We also present a normalized variant of MaxVol NMF that exhibits better performance than MinVol NMF and MaxVol NMF, and can be interpreted as a continuum between standard NMF and orthogonal NMF. We illustrate our results in the context of hyperspectral unmixing.

2602.04408 2026-02-06 cs.LG stat.ML

Separation-Utility Pareto Frontier: An Information-Theoretic Characterization

Shizhou Xu

详情
英文摘要

We study the Pareto frontier (optimal trade-off) between utility and separation, a fairness criterion requiring predictive independence from sensitive attributes conditional on the true outcome. Through an information-theoretic lens, we prove a characterization of the utility-separation Pareto frontier, establish its concavity, and thereby prove the increasing marginal cost of separation in terms of utility. In addition, we characterize the conditions under which this trade-off becomes strict, providing a guide for trade-off selection in practice. Based on the theoretical characterization, we develop an empirical regularizer based on conditional mutual information (CMI) between predictions and sensitive attributes given the true outcome. The CMI regularizer is compatible with any deep model trained via gradient-based optimization and serves as a scalar monitor of residual separation violations, offering tractable guarantees during training. Finally, numerical experiments support our theoretical findings: across COMPAS, UCI Adult, UCI Bank, and CelebA, the proposed method substantially reduces separation violations while matching or exceeding the utility of established baseline methods. This study thus offers a provable, stable, and flexible approach to enforcing separation in deep learning.

2602.03539 2026-02-06 math.ST stat.TH

Optimal neural network approximation of smooth compositional functions on sets with low intrinsic dimension

Thomas Nagler, Sophie Langer

详情
英文摘要

We study approximation and statistical learning properties of deep ReLU networks under structural assumptions that mitigate the curse of dimensionality. We prove minimax-optimal uniform approximation rates for $s$-Hölder smooth functions defined on sets with low Minkowski dimension using fully connected networks with flexible width and depth, improving existing results by logarithmic factors even in classical full-dimensional settings. A key technical ingredient is a new memorization result for deep ReLU networks that enables efficient point fitting with dense architectures. We further introduce a class of compositional models in which each component function is smooth and acts on a domain of low intrinsic dimension. This framework unifies two common assumptions in the statistical learning literature, structural constraints on the target function and low dimensionality of the covariates, within a single model. We show that deep networks can approximate such functions at rates determined by the most difficult function in the composition. As an application, we derive improved convergence rates for empirical risk minimization in nonparametric regression that adapt to smoothness, compositional structure, and intrinsic dimensionality.

2601.21200 2026-02-06 stat.ML cs.LG

Provably Reliable Classifier Guidance via Cross-Entropy Control

Sharan Sahu, Arisina Banerjee, Yuchen Wu

Comments 31 pages, 3 figures

详情
英文摘要

Classifier-guided diffusion models generate conditional samples by augmenting the reverse-time score with the gradient of the log-probability predicted by a probabilistic classifier. In practice, this classifier is usually obtained by minimizing an empirical loss function. While existing statistical theory guarantees good generalization performance when the sample size is sufficiently large, it remains unclear whether such training yields an effective guidance mechanism. We study this question in the context of cross-entropy loss, which is widely used for classifier training. Under mild smoothness assumptions on the classifier, we show that controlling the cross-entropy at each diffusion model step is sufficient to control the corresponding guidance error. In particular, probabilistic classifiers achieving conditional KL divergence $\varepsilon^2$ induce guidance vectors with mean squared error $\widetilde O(d \varepsilon )$, up to constant and logarithmic factors. Our result yields an upper bound on the sampling error of classifier-guided diffusion models and bears resemblance to a reverse log-Sobolev--type inequality. To the best of our knowledge, this is the first result that quantitatively links classifier training to guidance alignment in diffusion models, providing both a theoretical explanation for the empirical success of classifier guidance, and principled guidelines for selecting classifiers that induce effective guidance.

2510.25753 2026-02-06 stat.ML cs.LG

How Data Mixing Shapes In-Context Learning: Asymptotic Equivalence for Transformers with MLPs

Samet Demir, Zafer Dogan

Comments NeurIPS 2025, 24 pages, 6 figures

详情
英文摘要

Pretrained Transformers demonstrate remarkable in-context learning (ICL) capabilities, enabling them to adapt to new tasks from demonstrations without parameter updates. However, theoretical studies often rely on simplified architectures (e.g., omitting MLPs), plain data models (e.g., linear regression with isotropic inputs), and single-source training, limiting their relevance to realistic settings. In this work, we study ICL in pretrained Transformers with nonlinear MLP heads on nonlinear tasks drawn from multiple data sources with heterogeneous input, task, and noise distributions. We analyze a model where the MLP comprises two layers, with the first layer trained via a single gradient step and the second layer fully optimized. Under high-dimensional asymptotics, we prove that such models are equivalent in ICL error to structured polynomial predictors, leveraging results from the theory of Gaussian universality and orthogonal polynomials. This equivalence reveals that nonlinear MLPs meaningfully enhance ICL performance, particularly on nonlinear tasks, compared to linear baselines. It also enables a precise analysis of data mixing effects: we identify key properties of high-quality data sources (low noise, structured covariances) and show that feature learning emerges only when the task covariance exhibits sufficient structure. These results are validated empirically across various activation functions, model sizes, and data distributions. Finally, we experiment with a real-world scenario involving multilingual sentiment analysis where each language is treated as a different source. Our experimental results for this case exemplify how our findings extend to real-world cases. Overall, our work advances the theoretical foundations of ICL in Transformers and provides actionable insight into the role of architecture and data in ICL.

2510.25502 2026-02-06 cs.LG cs.AI stat.ML

TempoPFN: Synthetic Pre-training of Linear RNNs for Zero-shot Time Series Forecasting

Vladyslav Moroshan, Julien Siems, Arber Zela, Timur Carstensen, Frank Hutter

Comments 38 pages, 22 figures, 17 tables

详情
英文摘要

Foundation models for zero-shot time series forecasting face challenges in efficient long-horizon prediction and reproducibility, with existing synthetic-only approaches underperforming on challenging benchmarks. This paper presents TempoPFN, a univariate time series foundation model based on linear Recurrent Neural Networks (RNNs) pre-trained exclusively on synthetic data. The model uses a GatedDeltaProduct architecture with state-weaving for fully parallelizable training across sequence lengths, eliminating the need for windowing or summarization techniques while maintaining robust temporal state-tracking. Our comprehensive synthetic data pipeline unifies diverse generators, including stochastic differential equations, Gaussian processes, and audio synthesis, with novel augmentations. In zero-shot evaluations on the Gift-Eval, fev-bench and Chronos-ZS benchmarks, TempoPFN achieves top-tier competitive performance, outperforming all existing synthetic-only approaches and surpassing the majority of models trained on real-world data, while being more efficient than existing baselines by leveraging fully parallelizable training and inference. We open-source our complete data generation pipeline and training code, providing a reproducible foundation for future research.

2510.22031 2026-02-06 cs.LG cs.AI stat.ML

Differentiable Constraint-Based Causal Discovery

Jincheng Zhou, Mengbo Wang, Anqi He, Yumeng Zhou, Hessam Olya, Murat Kocaoglu, Bruno Ribeiro

详情
英文摘要

Causal discovery from observational data is a fundamental task in artificial intelligence, with far-reaching implications for decision-making, predictions, and interventions. Despite significant advances, existing methods can be broadly categorized as constraint-based or score-based approaches. Constraint-based methods offer rigorous causal discovery but are often hindered by small sample sizes, while score-based methods provide flexible optimization but typically forgo explicit conditional independence testing. This work explores a third avenue: developing differentiable $d$-separation scores, obtained through a percolation theory using soft logic. This enables the implementation of a new type of causal discovery method: gradient-based optimization of conditional independence constraints. Empirical evaluations demonstrate the robust performance of our approach in low-sample regimes, surpassing traditional constraint-based and score-based baselines on a real-world dataset. Code and data of the proposed method are publicly available at https://github$.$com/PurdueMINDS/DAGPA.

2510.18713 2026-02-06 cs.LG cs.AI stat.ML

Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options

Joongkyu Lee, Seouh-won Yi, Min-hwan Oh

Comments Accepted at NeurIPS 2025

详情
英文摘要

We study online preference-based reinforcement learning (PbRL) with the goal of improving sample efficiency. While a growing body of theoretical work has emerged-motivated by PbRL's recent empirical success, particularly in aligning large language models (LLMs)-most existing studies focus only on pairwise comparisons. A few recent works (Zhu et al., 2023, Mukherjee et al., 2024, Thekumparampil et al., 2024) have explored using multiple comparisons and ranking feedback, but their performance guarantees fail to improve-and can even deteriorate-as the feedback length increases, despite the richer information available. To address this gap, we adopt the Plackett-Luce (PL) model for ranking feedback over action subsets and propose M-AUPO, an algorithm that selects multiple actions by maximizing the average uncertainty within the offered subset. We prove that M-AUPO achieves a suboptimality gap of $\tilde{O}\left( \frac{d}{T} \sqrt{ \sum_{t=1}^T \frac{1}{|S_t|}} \right)$, where $T$ is the total number of rounds, $d$ is the feature dimension, and $|S_t|$ is the size of the subset at round $t$. This result shows that larger subsets directly lead to improved performance and, notably, the bound avoids the exponential dependence on the unknown parameter's norm, which was a fundamental limitation in most previous works. Moreover, we establish a near-matching lower bound of $Ω\left( \frac{d}{K \sqrt{T}} \right)$, where $K$ is the maximum subset size. To the best of our knowledge, this is the first theoretical result in PbRL with ranking feedback that explicitly shows improved sample efficiency as a function of the subset size.

2510.08916 2026-02-06 stat.ML cs.LG

A Representer Theorem for Hawkes Processes via Penalized Least Squares Minimization

Hideaki Kim, Tomoharu Iwata

Comments Accepted to ICLR 2026

详情
英文摘要

The representer theorem is a cornerstone of kernel methods, which aim to estimate latent functions in reproducing kernel Hilbert spaces (RKHSs) in a nonparametric manner. Its significance lies in converting inherently infinite-dimensional optimization problems into finite-dimensional ones over dual coefficients, thereby enabling practical and computationally tractable algorithms. In this paper, we address the problem of estimating the latent triggering kernels--functions that encode the interaction structure between events--for linear multivariate Hawkes processes based on observed event sequences within an RKHS framework. We show that, under the principle of penalized least squares minimization, a novel form of representer theorem emerges: a family of transformed kernels can be defined via a system of simultaneous integral equations, and the optimal estimator of each triggering kernel is expressed as a linear combination of these transformed kernels evaluated at the data points. Remarkably, the dual coefficients are all analytically fixed to unity, obviating the need to solve a costly optimization problem to obtain the dual coefficients. This leads to a highly efficient estimator capable of handling large-scale data more effectively than conventional nonparametric approaches. Empirical evaluations on synthetic datasets reveal that the proposed method attains competitive predictive accuracy while substantially improving computational efficiency over existing state-of-the-art kernel method-based estimators.

2509.06505 2026-02-06 cs.LG cs.IT math.IT stat.ML

On optimal solutions of classical and sliced Wasserstein GANs with non-Gaussian data

Yu-Jui Huang, Hsin-Hua Shen, Yu-Chih Huang, Wan-Yi Lin, Shih-Chun Lin

详情
英文摘要

The generative adversarial network (GAN) aims to approximate an unknown distribution via a parameterized neural network (NN). While GANs have been widely applied in reinforcement and semi-supervised learning as well as computer vision tasks, selecting their parameters often needs an exhaustive search, and only a few selection methods have been proven to be theoretically optimal. One of the most promising GAN variants is the Wasserstein GAN (WGAN). Prior work on optimal parameters for population WGAN is limited to the linear-quadratic-Gaussian (LQG) setting, where the generator NN is linear, and the data is Gaussian. In this paper, we focus on the characterization of optimal solutions of population WGAN beyond the LQG setting. As a basic result, closed-form optimal parameters for one-dimensional WGAN are derived when the NN has non-linear activation functions, and the data is non-Gaussian. For high-dimensional data, we adopt the sliced Wasserstein framework and show that the linear generator can be asymptotically optimal. Moreover, the original sliced WGAN only constrains the projected data marginal instead of the whole one in classical WGAN, and thus, we propose another new unprojected sliced WGAN and identify its asymptotic optimality. Empirical studies show that compared to the celebrated r-principal component analysis (r-PCA) solution, which has cubic complexity to the data dimension, our generator for sliced WGAN can achieve better performance with only linear complexity.

2508.12735 2026-02-06 cs.DL stat.AP

Citation accuracy, citation noise, and citation bias: A foundation of citation analysis

Lutz Bornmann, Christian Leibel

详情
英文摘要

Citation analysis is widely used in research evaluation to assess the impact of scientific papers. These analyses rest on the assumption that citation decisions by authors are accurate, representing the flow of knowledge from cited to citing papers. However, in practice, researchers often cite for reasons that are not related to the fact that there has been (intellectual) input from previous papers. Citations made for rhetorical reasons or without reading the cited work compromise the value of citations as instrument for research evaluation. Past research on threats to the accuracy of citations has mainly focused on citation bias as the primary concern. In this paper, we argue that citation noise - the undesirable variance in citation decisions - represents an equally critical but underexplored challenge in citation analysis. We define and differentiate two types of citation noise: citation level noise and citation pattern noise. Each type of noise is described in terms of how it arises and the specific ways it can undermine the validity of citation-based research assessments. By conceptually differing citation noise from citation accuracy and citation bias, we propose a framework for the foundation of citation analysis. We discuss strategies and interventions to minimize citation noise, aiming to improve the reliability and validity of citation analysis in research evaluation. We recommend that the current professional reform movement in research evaluation such as the Coalition for Advancing Research Assessment (CoARA) pick up these strategies and interventions as an additional building block for careful, responsible use of bibliometric indicators in research evaluation.

2508.08336 2026-02-06 stat.ME

Empirical Bayes for Data Integration

Paul Rognon-Vael, David Rossell

详情
英文摘要

We discuss the use of empirical Bayes for data integration, in the sense of transfer learning. Our main interest is in settings where one wishes to learn structure (e.g. feature selection) and one only has access to incomplete data from previous studies, such as summaries, estimates or lists of relevant features. We discuss differences between full Bayes and empirical Bayes, and develop a computational framework for the latter. We discuss how empirical Bayes attains consistent variable selection under weaker conditions (sparsity and betamin assumptions) than full Bayes and other standard criteria do, and how it attains faster convergence rates. Our high-dimensional regression examples show that fully Bayesian inference enjoys excellent properties, and that data integration with empirical Bayes can offer moderate yet meaningful improvements in practice.

2505.20295 2026-02-06 cs.CL cs.AI cs.LG stat.ML

SelfReflect: Can LLMs Communicate Their Internal Answer Distribution?

Michael Kirchhof, Luca Füger, Adam Goliński, Eeshan Gunesh Dhekane, Arno Blaas, Seong Joon Oh, Sinead Williamson

Comments Accepted at ICLR 2026

详情
英文摘要

The common approach to communicate a large language model's (LLM) uncertainty is to add a percentage number or a hedging word to its response. But is this all we can do? Instead of generating a single answer and then hedging it, an LLM that is fully transparent to the user needs to be able to reflect on its internal belief distribution and output a summary of all options it deems possible, and how likely they are. To test whether LLMs possess this capability, we develop the SelfReflect metric, an information-theoretic distance between a given summary and a distribution over answers. In interventional and human studies, we find that SelfReflect indicates even slight deviations, yielding a fine measure of faithfulness between a summary string and an LLM's actual internal distribution over answers. With SelfReflect, we make a resounding negative observation: modern LLMs are, across the board, incapable of revealing what they are uncertain about, neither through reasoning, nor chains-of-thoughts, nor explicit finetuning. However, we do find that LLMs are able to generate faithful summaries of their uncertainties if we help them by sampling multiple outputs and feeding them back into the context. This simple approach shines a light at the universal way of communicating LLM uncertainties whose future development the SelfReflect score enables. To support the development of this universal form of LLM uncertainties, we publish the code that implements our metric for arbitrary LLMs under https://github.com/apple/ml-selfreflect .

2505.15423 2026-02-06 cs.LG econ.EM stat.AP stat.ME stat.ML

SplitWise Regression: Stepwise Modeling with Adaptive Dummy Encoding

Marcell T. Kurbucz, Nikolaos Tzivanakis, Nilufer Sari Aslam, Adam M. Sykulski

Comments 15 pages, 1 figure, 3 tables

Journal ref Scientific Reports 15, 42454 (2025)

详情
英文摘要

Capturing nonlinear relationships without sacrificing interpretability remains a persistent challenge in regression modeling. We introduce SplitWise, a novel framework that enhances stepwise regression. It adaptively transforms numeric predictors into threshold-based binary features using shallow decision trees, but only when such transformations improve model fit, as assessed by the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC). This approach preserves the transparency of linear models while flexibly capturing nonlinear effects. Implemented as a user-friendly R package, SplitWise is evaluated on both synthetic and real-world datasets. The results show that it consistently produces more parsimonious and generalizable models than traditional stepwise and penalized regression techniques.

2504.12841 2026-02-06 cs.LG cs.AI cs.CV cs.MS stat.ML

ALT: A Python Package for Lightweight Feature Representation in Time Series Classification

Balázs P. Halmos, Balázs Hajós, Vince Á. Molnár, Marcell T. Kurbucz, Antal Jakovác

Comments 16 pages, 4 figures

Journal ref Machine Learning: Science and Technology (2026)

详情
英文摘要

We introduce ALT, an open-source Python package created for efficient and accurate time series classification (TSC). The package implements the adaptive law-based transformation (ALT) algorithm, which transforms raw time series data into a linearly separable feature space using variable-length shifted time windows. This adaptive approach enhances its predecessor, the linear law-based transformation (LLT), by effectively capturing patterns of varying temporal scales. The software is implemented for scalability, interpretability, and ease of use, achieving state-of-the-art performance with minimal computational overhead. Extensive benchmarking on real-world datasets demonstrates the utility of ALT for diverse TSC tasks in physics and related domains.

2504.02974 2026-02-06 math.ST stat.TH

Testing hypotheses generated by constraints

Martin Larsson, Aaditya Ramdas, Johannes Ruf

详情
英文摘要

E-variables are nonnegative random variables with expected value at most one under any distribution from a given null hypothesis. Every nonasymptotically valid test can be obtained by thresholding some e-variable. As such, e-variables arise naturally in applications in statistics and operations research, and a key open problem is to characterize their form. We provide a complete solution to this problem for hypotheses generated by constraints -- a broad and natural framework that encompasses many hypothesis classes occurring in practice. Our main result is an abstract representation theorem that describes all e-variables for any hypothesis defined by an arbitrary collection of measurable constraints. We instantiate this general theory for three important classes: hypotheses generated by finitely many constraints, one-sided sub-$ψ$ distributions (including sub-Gaussian distributions), and distributions constrained by group symmetries. In each case, we explicitly characterize all e-variables as well as all admissible e-variables. Numerous examples are treated, including constraints on moments, quantiles, and conditional value-at-risk (CVaR). Building on these, we prove existence and uniqueness of optimal e-variables under a large class of expected utility-based objective functions used for optimal decision making, in particular covering all criteria studied in the e-variable literature to date.

2503.22548 2026-02-06 stat.AP

Comparing methods to assess treatment effect heterogeneity in general parametric regression models

Yao Chen, Sophie Sun, Konstantinos Sechidis, Cong Zhang, Torsten Hothorn, Björn Bornkamp

详情
英文摘要

This paper reviews and compares methods to assess treatment effect heterogeneity in the context of parametric regression models. These methods include the standard likelihood ratio tests, bootstrap likelihood ratio tests, and Goeman's global test motivated by testing whether the random effect variance is zero. We place particular emphasis on tests based on the score-residual of the treatment effect and explore different variants of tests in this class. All approaches are compared in a simulation study, and the approach based on residual scores is illustrated in a clinical trial with time-to-event outcome comparing treatment versus placebo. Our findings demonstrate that score-residual based methods provide practical, flexible and reliable tools for exploring treatment effect heterogeneity and treatment effect modifiers, and can provide useful guidance for decision making around treatment effect heterogeneity.

2503.03557 2026-02-06 stat.AP stat.ME

Causal language jumps in clinical practice guidelines for diabetes management

Keling Wang, Chang Wei, Jeremy A. Labrecque

Comments 10 pages, 4 figures, 3 tables, 4 supplementary files

Journal ref BMJ Open 2026;16:e109205

详情
英文摘要

Clinical practice guidelines are designed to guide clinical practice and involve causal language. Sometimes guidelines make or require stronger causal claims than those in the references they rely on, a phenomenon we refer to as 'causal language jump'. We evaluated the strength of expressed causation in diabetes guidelines and the evidence they reference to assess the pattern of jumps. We randomly sampled 300 guideline statements from four diabetes guidelines. We rated the causation strength in the statements and the dependence on causation in recommendations supported by these statements using existing scales. Among the causal statements, the cited original studies were similarly assessed. We also assessed how well they report target trial emulation (TTE) components as a proxy for reliability. Of the sampled statements, 114 (38.0%) were causal, and 76 (66.7%) expressed strong causation. 27.2% (31/114) of causal guideline statements demonstrated a "causal language jump", and 34.9% (29/83) of guideline recommendations cannot be effectively supported. Of the 53 eligible studies for TTE rating, most did not report treatment assignment and causal contrast in detail. Our findings suggest causal language jumps were common among diabetes guidelines. While these jumps are sometimes inevitable, they should always be supported by good causal inference practices.

2503.03530 2026-02-06 stat.ME

Inference for Heterogeneous Treatment Effects with Efficient Instruments and Machine Learning

Cyrill Scheidegger, Zijian Guo, Peter Bühlmann

详情
英文摘要

We introduce a new instrumental variable (IV) estimator for heterogeneous treatment effects in the presence of endogeneity. Our estimator is based on double/debiased machine learning (DML) and uses efficient machine learning instruments (MLIV) and kernel smoothing. We prove consistency and asymptotic normality of our estimator and also construct confidence sets that are more robust towards weak IV. Along the way, we also provide an accessible discussion of the corresponding estimator for the homogeneous treatment effect with efficient machine learning instruments. The methods are evaluated on synthetic and real datasets and an implementation is made available in the R package IVDML.

2502.09986 2026-02-06 stat.ME stat.ML

Statistical description and dimension reduction of continuous time categorical trajectories with multivariate functional principal components

Hervé Cardot, Caroline Peltier

详情
英文摘要

Getting tools that allow simple representations and comparisons of a set of categorical trajectories is of major interest for statisticians. Without loosing any information, we associate to each state a binary random indicator function, taking values in $\{0,1\}$, and turn the problem of statistical description of the categorical trajectories into a multivariate functional principal components analysis. This viewpoint encompasses experimental frameworks where two or more states can be observed simultaneously. The sample paths being piecewise constant, with a finite number of jumps, this a rare case in functional data analysis in which the trajectories are not supposed to be continuous and can be observed exhaustively. Under the weak hypothesis assuming only continuity in probability of the $0-1$ trajectories, the means and the (multivariate) covariance function are continuous and have interpretations in terms of departure from independence of the joint probabilities. Considering a functional data point of view, we show that the binary trajectories, which are right-continuous functions with left-hand limits, can be seen as random elements in the Hilbert space of square integrable functions. The multivariate functional principal components are simple to interpret and we show that we can get consistent estimators of the mean trajectories and the covariance functions under weak regularity assumptions. The ability of the approach to represent categorical trajectories in a small dimension space is illustrated on a data set of sensory perceptions, considering different gustometer-controlled stimuli experiments.

2502.00713 2026-02-06 stat.AP stat.ME

Using Individualized Treatment Effects to Assess Treatment Effect Heterogeneity

Konstantinos Sechidis, Cong Zhang, Sophie Sun, Yao Chen, Asher Spector, Björn Bornkamp

Journal ref Statistics in Medicine 2025

详情
英文摘要

Assessing treatment effect heterogeneity (TEH) in clinical trials is crucial, as it provides insights into the variability of treatment responses among patients, influencing important decisions related to drug development. Furthermore, it can lead to personalized medicine by tailoring treatments to individual patient characteristics. This paper introduces novel methodologies for assessing treatment effects using the individual treatment effect as a basis. To estimate this effect, we use a Double Robust (DR) learner to infer a pseudo-outcome that reflects the causal contrast. This pseudo-outcome is then used to perform three objectives: (1) a global test for heterogeneity, (2) ranking covariates based on their influence on effect modification, and (3) providing estimates of the individualized treatment effect. We compare our DR-learner with various alternatives and competing methods in a simulation study, and also use it to assess heterogeneity in a pooled analysis of five Phase III trials in psoriatic arthritis. By integrating these methods with the recently proposed WATCH workflow (Workflow to Assess Treatment Effect Heterogeneity in Drug Development for Clinical Trial Sponsors), we provide a robust framework for analyzing TEH, offering insights that enable more informed decision-making in this challenging area.

2501.09217 2026-02-06 cs.LG cs.AI cs.CV stat.ML

Adaptive Law-Based Transformation (ALT): A Lightweight Feature Representation for Time Series Classification

Marcell T. Kurbucz, Balázs Hajós, Balázs P. Halmos, Vince Á. Molnár, Antal Jakovác

Comments 8 pages, 1 figure, 5 tables

Journal ref Scientific Reports 15, 41775 (2025)

详情
英文摘要

Time series classification (TSC) is fundamental in numerous domains, including finance, healthcare, and environmental monitoring. However, traditional TSC methods often struggle with the inherent complexity and variability of time series data. Building on our previous work with the linear law-based transformation (LLT) - which improved classification accuracy by transforming the feature space based on key data patterns - we introduce adaptive law-based transformation (ALT). ALT enhances LLT by incorporating variable-length shifted time windows, enabling it to capture distinguishing patterns of various lengths and thereby handle complex time series more effectively. By mapping features into a linearly separable space, ALT provides a fast, robust, and transparent solution that achieves state-of-the-art performance with only a few hyperparameters.

2412.00160 2026-02-06 q-bio.QM stat.AP

How reproducible are data-driven subtypes of Alzheimer's disease atrophy?

Emma Prevot, Cameron Shand, Neil Oxtoby, for Alzheimer's Disease Neuroimaging Initiative

Journal ref Journal of Alzheimer's Disease (2026)

详情
英文摘要

Alzheimer's disease (AD) exhibits substantial clinical and biological heterogeneity, complicating efforts in treatment and intervention development. While new computational methods offer insights into AD progression, the reproducibility of these subtypes across datasets remains understudied, particularly concerning the robustness of subtype definitions when validated on diverse databases. This study evaluates the consistency of AD progression subtypes identified by the Subtype and Stage Inference (SuStaIn) algorithm using T1-weighted MRI data across 5,444 subjects from ANMerge, OASIS, and ADNI datasets, forming four independent cohorts. Each cohort was analyzed under two conditions: one using the full cohort, including cognitively normal controls, and another excluding controls to test subtype robustness. Results confirm the three primary atrophy subtypes identified in earlier studies: Typical, Cortical, and Subcortical, as well as the emergence of rare and atypical AD variants such as posterior cortical atrophy (PCA). Notably, each subtype displayed varying robustness to the inclusion of controls, with certain subtypes, like Subcortical, more influenced by cohort composition. This investigation underscores SuStaIn's reliability for defining stable AD subtypes and suggests its utility in clinical stratification for trials and diagnosis. However, our findings also highlight the need for improved dataset diversity, particularly in terms of ethnic representation, to enhance generalizability and support broader clinical application.

2410.04560 2026-02-06 cs.LG stat.ML

GAMformer: Bridging Tabular Foundation Models and Interpretable Machine Learning

Andreas Mueller, Julien Siems, Harsha Nori, David Salinas, Arber Zela, Rich Caruana, Frank Hutter

Comments 22 pages, 15 figures

详情
英文摘要

While interpretability is crucial for machine learning applications in safety-critical domains and for regulatory compliance, existing tabular foundation models like TabPFN lack transparency. Generalized Additive Models (GAMs) provide the needed interpretability through their additive structure, but traditional GAM methods rely on iterative learning algorithms (such as splines, boosted trees, or neural networks) that are fundamentally incompatible with the in-context learning paradigm of foundation models. In this paper, we introduce GAMformer, the first tabular foundation model for GAMs that bridges the gap between the power of foundation models and the interpretability requirements of critical real-world applications. GAMformer estimates GAM shape functions in a single forward pass using in-context learning, representing a significant departure from conventional iterative approaches. Building on previous research on tabular foundation models, we train GAMformer exclusively on synthetically generated tables to prevent data leakage. Our experiments demonstrate that GAMformer performs comparably to other leading GAMs across various classification benchmarks.

2409.01978 2026-02-06 quant-ph cs.LG stat.ML

Application of Langevin Dynamics to Advance the Quantum Natural Gradient Optimization Algorithm

Oleksandr Borysenko, Mykhailo Bratchenko, Ilya Lukin, Mykola Luhanko, Ihor Omelchenko, Andrii Sotnikov, Alessandro Lomi

Comments 11 pages, 3 figures

Journal ref Physica A 682 (2026) 131158

详情
英文摘要

A Quantum Natural Gradient (QNG) algorithm for optimization of variational quantum circuits has been proposed recently. In this study, we employ the Langevin equation with a QNG stochastic force to demonstrate that its discrete-time solution gives a generalized form of the above-specified algorithm, which we call Momentum-QNG. Similar to other optimization algorithms with the momentum term, such as the Stochastic Gradient Descent with momentum, RMSProp with momentum and Adam, Momentum-QNG is more effective to escape local minima and plateaus in the variational parameter space and, therefore, demonstrates an improved performance compared to the basic QNG. In this paper we benchmark Momentum-QNG together with the basic QNG, Adam and Momentum optimizers and explore its convergence behaviour. Among the benchmarking problems studied, the best result is obtained for the quantum Sherrington-Kirkpatrick model in the strong spin glass regime. Our open-source code is available at https://github.com/borbysh/Momentum-QNG

2407.02085 2026-02-06 stat.ME stat.CO

Regularized estimation of Monge-Kantorovich quantiles for spherical data

Bernard Bercu, Jérémie Bigot, Gauthier Thurin

详情
英文摘要

Tools from optimal transport (OT) theory have recently been used to define a notion of quantile function for directional data. In practice, regularization is mandatory for applications that require out-of-sample estimates. To this end, we introduce a regularized estimator built from entropic optimal transport, by extending the definition of the entropic map to the spherical setting. We propose a stochastic algorithm to directly solve a continuous OT problem between the uniform distribution and a target distribution, by expanding Kantorovich potentials in the basis of spherical harmonics. In addition, we define the directional Monge-Kantorovich depth, a companion concept for OT-based quantiles. We show that it benefits from desirable properties related to Liu-Zuo-Serfling axioms for the statistical analysis of directional data. Building on our regularized estimators, we illustrate the benefits of our methodology for data analysis.

2405.14982 2026-02-06 cs.LG cs.AI cs.CL stat.ML

In-context Time Series Predictor

Jiecheng Lu, Yan Sun, Shihao Yang

Comments Camera-ready version. Accepted at ICLR 2025

Journal ref Proceedings of the Thirteenth International Conference on Learning Representations (ICLR 2025)

详情
英文摘要

Recent Transformer-based large language models (LLMs) demonstrate in-context learning ability to perform various functions based solely on the provided context, without updating model parameters. To fully utilize the in-context capabilities in time series forecasting (TSF) problems, unlike previous Transformer-based or LLM-based time series forecasting methods, we reformulate "time series forecasting tasks" as input tokens by constructing a series of (lookback, future) pairs within the tokens. This method aligns more closely with the inherent in-context mechanisms, and is more parameter-efficient without the need of using pre-trained LLM parameters. Furthermore, it addresses issues such as overfitting in existing Transformer-based TSF models, consistently achieving better performance across full-data, few-shot, and zero-shot settings compared to previous architectures.

2403.01673 2026-02-06 stat.ML cs.AI cs.LG

CATS: Enhancing Multivariate Time Series Forecasting by Constructing Auxiliary Time Series as Exogenous Variables

Jiecheng Lu, Xu Han, Yan Sun, Shihao Yang

Comments Camera-ready version. Accepted at ICML 2024

Journal ref Proceedings of the Forty-first International Conference on Machine Learning (ICML 2024)

详情
英文摘要

For Multivariate Time Series Forecasting (MTSF), recent deep learning applications show that univariate models frequently outperform multivariate ones. To address the difficiency in multivariate models, we introduce a method to Construct Auxiliary Time Series (CATS) that functions like a 2D temporal-contextual attention mechanism, which generates Auxiliary Time Series (ATS) from Original Time Series (OTS) to effectively represent and incorporate inter-series relationships for forecasting. Key principles of ATS - continuity, sparsity, and variability - are identified and implemented through different modules. Even with a basic 2-layer MLP as core predictor, CATS achieves state-of-the-art, significantly reducing complexity and parameters compared to previous multivariate models, marking it an efficient and transferable MTSF solution.

2402.10506 2026-02-06 math.ST math.PR stat.TH

Optimistic Estimation of Convergence in Markov Chains with the Average-Mixing Time

Geoffrey Wolfer, Pierre Alquier

详情
英文摘要

The convergence rate of a Markov chain to its stationary distribution is typically assessed using the concept of total variation mixing time. However, this worst-case measure often yields pessimistic estimates and is challenging to infer from observations. In this paper, we advocate for the use of the average-mixing time as a more optimistic and demonstrably easier-to-estimate alternative. We further illustrate its applicability across a range of settings, from two-point to countable spaces, and discuss some practical implications.

2310.09488 2026-02-06 stat.ML cs.LG

ARM: Refining Multivariate Forecasting with Adaptive Temporal-Contextual Learning

Jiecheng Lu, Xu Han, Shihao Yang

Comments Camera-ready version. Accepted at ICLR 2024

Journal ref Proceedings of the Twelfth International Conference on Learning Representations (ICLR 2024)

详情
英文摘要

Long-term time series forecasting (LTSF) is important for various domains but is confronted by challenges in handling the complex temporal-contextual relationships. As multivariate input models underperforming some recent univariate counterparts, we posit that the issue lies in the inefficiency of existing multivariate LTSF Transformers to model series-wise relationships: the characteristic differences between series are often captured incorrectly. To address this, we introduce ARM: a multivariate temporal-contextual adaptive learning method, which is an enhanced architecture specifically designed for multivariate LTSF modelling. ARM employs Adaptive Univariate Effect Learning (AUEL), Random Dropping (RD) training strategy, and Multi-kernel Local Smoothing (MKLS), to better handle individual series temporal patterns and correctly learn inter-series dependencies. ARM demonstrates superior performance on multiple benchmarks without significantly increasing computational costs compared to vanilla Transformer, thereby advancing the state-of-the-art in LTSF. ARM is also generally applicable to other LTSF architecture beyond vanilla Transformer.

2309.16858 2026-02-06 stat.ML cs.LG

Improved Generalization Bounds for Transductive Learning by Transductive Local Complexity and Its Applications

Yingzhen Yang

Comments The ICML 2025 conference version (https://openreview.net/pdf?id=NRVdvg7VMn) is a special case of this paper where the chain length is fixed at 2 (i.e.,$Q=2$, see Def. 5.1), and its main results follow directly from the results here. This paper further provides a nearly optimal excess risk bound for realizable transductive learning and a stronger bound for transductive kernel learning

详情
英文摘要

We introduce Transductive Local Complexity (TLC) to extend the classical Local Rademacher Complexity (LRC) to the transductive setting, incorporating substantial and novel components. Although LRC has been used to obtain sharp generalization bounds and minimax rates for inductive tasks such as classification and nonparametric regression, it has remained an open problem whether a localized Rademacher complexity framework can be effectively adapted to transductive learning to achieve sharp or nearly sharp bounds consistent with inductive results. We provide an affirmative answer via TLC. TLC is constructed by first deriving a new concentration inequality in Theorem 4.1 for the supremum of empirical processes capturing the gap between test and training losses, termed the test-train process, under uniform sampling without replacement, which leverages a novel combinatorial property of the test-train process and a new proof strategy applying the exponential Efron-Stein inequality twice. A subsequent peeling strategy applied to a new decomposition of the expectation of the test-train process and a new surrogate variance operator then yield excess risk bounds in the transductive setting that are nearly consistent with classical LRC-based inductive bounds up to a logarithmic gap. We further advance transductive learning through two applications: (1) for realizable transductive learning over binary-valued classes with finite VC dimension of $\dVC$ and $u \ge m \ge \dVC$, where $u$ and $m$ are the number of test features and training features, our Theorem 6.1 gives a nearly optimal bound $Θ(\dVC \log(me/\dVC)/m)$ matching the minimax rate $Θ(\dVC/m)$ up to $\log m$, resolving a decade-old open question; and (2) Theorem 6.2 presents a sharper excess risk bound for transductive kernel learning compared to the current state-of-the-art.

2307.01930 2026-02-06 cs.LG cs.AI cs.CV stat.AP stat.ML

Learning ECG Signal Features Without Backpropagation Using Linear Laws

Péter Pósfay, Marcell T. Kurbucz, Péter Kovács, Antal Jakovác

Comments 35 pages, 3 figures, 3 tables

Journal ref Machine Learning: Science and Technology 6, 035001 (2025)

详情
英文摘要

This paper introduces LLT-ECG, a novel method for electrocardiogram (ECG) signal classification that leverages concepts from theoretical physics to automatically generate features from time series data. Unlike traditional deep learning approaches, LLT-ECG operates in a forward manner, eliminating the need for backpropagation and hyperparameter tuning. By identifying linear laws that capture shared patterns within specific classes, the proposed method constructs a compact and verifiable representation, enhancing the effectiveness of downstream classifiers. We demonstrate LLT-ECG's state-of-the-art performance on real-world ECG datasets from PhysioNet, underscoring its potential for medical applications where speed and verifiability are crucial.

2305.15793 2026-02-06 cs.LG cs.AI cs.CE stat.CO

Feature space reduction method for ultrahigh-dimensional, multiclass data: Random forest-based multiround screening (RFMS)

Gergely Hanczár, Marcell Stippinger, Dávid Hanák, Marcell T. Kurbucz, Olivér M. Törteli, Ágnes Chripkó, Zoltán Somogyvári

Comments 9 pages, 2 figures, 2 tables

Journal ref Machine Learning: Science and Technology 4, 045012 (2023)

详情
英文摘要

In recent years, numerous screening methods have been published for ultrahigh-dimensional data that contain hundreds of thousands of features; however, most of these features cannot handle data with thousands of classes. Prediction models built to authenticate users based on multichannel biometric data result in this type of problem. In this study, we present a novel method known as random forest-based multiround screening (RFMS) that can be effectively applied under such circumstances. The proposed algorithm divides the feature space into small subsets and executes a series of partial model builds. These partial models are used to implement tournament-based sorting and the selection of features based on their importance. To benchmark RFMS, a synthetic biometric feature space generator known as BiometricBlender is employed. Based on the results, the RFMS is on par with industry-standard feature screening methods while simultaneously possessing many advantages over these methods.

2304.14211 2026-02-06 cs.LG cs.AI cs.CV cs.MS stat.ML

LLT: An R package for Linear Law-based Feature Space Transformation

Marcell T. Kurbucz, Péter Pósfay, Antal Jakovác

Comments 15 pages, 5 figures, 1 table

Journal ref SoftwareX 25, 101623 (2024)

详情
英文摘要

The goal of the linear law-based feature space transformation (LLT) algorithm is to assist with the classification of univariate and multivariate time series. The presented R package, called LLT, implements this algorithm in a flexible yet user-friendly way. This package first splits the instances into training and test sets. It then utilizes time-delay embedding and spectral decomposition techniques to identify the governing patterns (called linear laws) of each input sequence (initial feature) within the training set. Finally, it applies the linear laws of the training set to transform the initial features of the test set. These steps are performed by three separate functions called trainTest, trainLaw, and testTrans. Their application requires a predefined data structure; however, for fast calculation, they use only built-in functions. The LLT R package and a sample dataset with the appropriate data structure are publicly available on GitHub.

2301.05936 2026-02-06 math.PR math.OC math.ST stat.TH

Arcade Processes for Informed Martingale Interpolation

Georges Kassis, Andrea Macrina

Comments On 4 February 2026, accepted for publication in Stochastic Processes and Their Applications

详情
英文摘要

Arcade processes are a class of continuous stochastic processes that interpolate in a strong sense, i.e., omega by omega, between zeros at fixed pre-specified times. Their additive randomisation allows one to match any finite sequence of target random variables, indexed by the given fixed dates, on the whole probability space. The randomised arcade processes (RAPs) can thus be interpreted as a generalisation of anticipative stochastic bridges. The filtrations generated by these processes are utilised to construct a class of martingales that interpolate between the given target random variables. These so-called filtered arcade martingales (FAMs) are almost-sure solutions to the martingale interpolation problem and reveal an underlying stochastic filtering structure. In the special case of conditionally Markov randomised arcade processes, the dynamics of FAMs are informed by Bayesian updating. The same ideas are applied to filtered arcade reverse-martingales, which are constructed in a similar fashion, using reverse-filtrations of RAPs, instead. Several explicit examples for RAPs and FAMs are provided and simulated. This paper concludes with an outlook on potential connections between FAMs and martingale optimal transport, and related applications.

2210.00200 2026-02-06 stat.ME math.ST stat.TH

Semiparametric Efficient Fusion of Individual Data and Summary Statistics

Wenjie Hu, Ruoyu Wang, Wei Li, Wang Miao

Comments 69 pages, 5 figures

详情
英文摘要

Suppose we have individual data from an internal study and various summary statistics from relevant external studies. External summary statistics have the potential to improve statistical inference for the internal population; however, it may lead to efficiency loss or bias if not used properly. We study the fusion of individual data and summary statistics in a semiparametric framework to investigate the efficient use of external summary statistics. Under a weak transportability assumption, we establish the semiparametric efficiency bound for estimating a general functional of the internal data distribution, which is no larger than that using only internal data and underpins the potential efficiency gain of integrating individual data and summary statistics. We propose a data-fused efficient estimator that achieves this efficiency bound. In addition, an adaptive fusion estimator is proposed to eliminate the bias of the original data-fused estimator when the transportability assumption fails. We establish the asymptotic oracle property of the adaptive fusion estimator. Simulations and application to a Helicobacter pylori infection dataset demonstrate the promising numerical performance of the proposed method.

2602.05272 2026-02-06 math.ST math.PR stat.ML stat.TH

Asymptotically optimal sequential change detection for bounded means

Ashwin Ram, Aaditya Ramdas

Comments Preprint

详情
英文摘要

We consider the problem of quickest changepoint detection under the Average Run Length (ARL) constraint where the pre-change and post-change laws lie in composite families $\mathscr{P}$ and $\mathscr{Q}$ respectively. In such a problem, a massive challenge is characterizing the best possible detection delay when the "hardest" pre-change law in $\mathscr{P}$ depends on the unknown post-change law $Q\in\mathscr{Q}$. And typical simple-hypothesis likelihood-ratio arguments for Page-CUSUM and Shiryaev-Roberts do not at all apply here. To that end, we derive a universal sharp lower bound in full generality for any ARL-calibrated changepoint detector in the low type-I error ($γ\to\infty$ regime) of the order $\log(γ)/\mathrm{KL}_{\mathrm{inf}}(Q,\mathscr{P})$. We show achievability of this universal lower bound by proving a tight matching upper bound (with the same sharp $\logγ$ constant) in the important bounded mean detection setting. In addition, for separated mean shifts, we also we derive a uniform minimax guarantee of this achievability over the alternatives.

2602.05259 2026-02-06 math.ST stat.ML stat.TH

An Asymptotic Law of the Iterated Logarithm for $\mathrm{KL}_{\inf}$

Ashwin Ram, Aaditya Ramdas

Comments Preprint

详情
英文摘要

The population $\mathrm{KL}_{\inf}$ is a fundamental quantity that appears in lower bounds for (asymptotically) optimal regret of pure-exploration stochastic bandit algorithms, and optimal stopping time of sequential tests. Motivated by this, an empirical $\mathrm{KL}_{\inf}$ statistic is frequently used in the design of (asymptotically) optimal bandit algorithms and sequential tests. While nonasymptotic concentration bounds for the empirical $\mathrm{KL}_{\inf}$ have been developed, their optimality in terms of constants and rates is questionable, and their generality is limited (usually to bounded observations). The fundamental limits of nonasymptotic concentration are often described by the asymptotic fluctuations of the statistics. With that motivation, this paper presents a tight (upper and lower) law of the iterated logarithm for empirical $\mathrm{KL}_{\inf}$ applying to extremely general (unbounded) data.

2602.05246 2026-02-06 stat.AP

Active Simulation-Based Inference for Scalable Car-Following Model Calibration

Menglin Kong, Chengyuan Zhang, Lijun Sun

详情
英文摘要

Credible microscopic traffic simulation requires car-following models that capture both the average response and the substantial variability observed across drivers and situations. However, most data-driven calibrations remain deterministic, producing a single best-fit parameter vector and offering limited guidance for uncertainty-aware prediction, risk-sensitive evaluation, and population-level simulation. Bayesian calibration addresses this gap by inferring a posterior distribution over parameters, but per-trajectory sampling methods such as Markov chain Monte Carlo (MCMC) are computationally infeasible for modern large-scale naturalistic driving datasets. This paper proposes an active simulation-based inference framework for scalable car-following model calibration. The approach combines (i) a residual-augmented car-following simulator with two alternatives for the residual process and (ii) an amortized conditional density estimator that maps an observed leader--follower trajectory directly to a driver-specific posterior over model parameters with a single forward pass at test time. To reduce simulation cost during training, we introduce a joint active design strategy that selects informative parameter proposals together with representative driving contexts, focusing simulations where the current inference model is most uncertain while maintaining realism. Experiments on the HighD dataset show improved predictive accuracy and closer agreement between simulated and observed trajectory distributions relative to Bayesian calibration baselines, with convergence and ablation studies supporting the robustness of the proposed design choices. The framework enables scalable, uncertainty-aware driver population modeling for traffic flow simulation and risk-sensitive transportation analysis.

2602.05239 2026-02-06 stat.ME

Impact Range Assessment (IRA): An Interpretable Sensitivity Measure for Regression Modelling

Jihao You, Dan Tulpan, Jiaojiao Diao, Jennifer L. Ellis

Comments 17 pages, 3 figures. This manuscript was first submitted to MethodsX on February 4, 2026

详情
英文摘要

While regression models capture the relationship between predictors and the response variable, they often lack intuitive accompanying methods to understand the influence of predictors on the outcome. To address this, we introduce an interpretability method called Impact Range Assessment (IRA), which quantifies the maximal influence of each predictor by measuring the total potential change in the response variable, across the predictor range. Validation using synthetic linear and nonlinear datasets demonstrates that relevant predictors produced higher IRA values than irrelevant ones. Moreover, repeated evaluations produced results closely aligned with those from the single-execution analysis, confirming the robustness of the method. A case study using a model that predicts pellet quality demonstrated that the IRA provides a simple and intuitive approach to interpret and rank predictor influence, thereby improving model transparency and reliability.

2602.05230 2026-02-06 cs.LG cs.AI stat.ML

ZeroS: Zero-Sum Linear Attention for Efficient Transformers

Jiecheng Lu, Xu Han, Yan Sun, Viresh Pati, Yubin Kim, Siddhartha Somani, Shihao Yang

Comments Camera-ready version. Accepted at NeurIPS 2025

Journal ref Proceedings of the Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025)

详情
英文摘要

Linear attention methods offer Transformers $O(N)$ complexity but typically underperform standard softmax attention. We identify two fundamental limitations affecting these approaches: the restriction to convex combinations that only permits additive information blending, and uniform accumulated weight bias that dilutes attention in long contexts. We propose Zero-Sum Linear Attention (ZeroS), which addresses these limitations by removing the constant zero-order term $1/t$ and reweighting the remaining zero-sum softmax residuals. This modification creates mathematically stable weights, enabling both positive and negative values and allowing a single attention layer to perform contrastive operations. While maintaining $O(N)$ complexity, ZeroS theoretically expands the set of representable functions compared to convex combinations. Empirically, it matches or exceeds standard softmax attention across various sequence modeling benchmarks.

2602.05226 2026-02-06 stat.AP econ.EM stat.ME

Predictive Synthesis under Sporadic Participation: Evidence from Inflation Density Surveys

Matthew C. Johnson, Matteo Luciani, Minzhengxiong Zhang, Kenichiro McAlinn

详情
英文摘要

Central banks rely on density forecasts from professional surveys to assess inflation risks and communicate uncertainty. A central challenge in using these surveys is irregular participation: forecasters enter and exit, skip rounds, and reappear after long gaps. In the European Central Bank's Survey of Professional Forecasters, turnover and missingness vary substantially over time, causing the set of submitted predictions to change from quarter to quarter. Standard aggregation rules -- such as equal-weight pooling, renormalization after dropping missing forecasters, or ad hoc imputation -- can generate artificial jumps in combined predictions driven by panel composition rather than economic information, complicating real-time interpretation and obscuring forecaster performance. We develop coherent Bayesian updating rules for forecast combination under sporadic participation that maintain a well-defined latent predictive state for each forecaster even when their forecast is unobserved. Rather than relying on renormalization or imputation, the combined predictive distribution is updated through the implied conditional structure of the panel. This approach isolates genuine performance differences from mechanical participation effects and yields interpretable dynamics in forecaster influence. In the ECB survey, it improves predictive accuracy relative to equal-weight benchmarks and delivers smoother and better-calibrated inflation density forecasts, particularly during periods of high turnover.

2602.05174 2026-02-06 stat.ML cs.AI cs.LG math.ST stat.TH

Total Variation Rates for Riemannian Flow Matching

Yunrui Guan, Krishnakumar Balasubramanian, Shiqian Ma

详情
英文摘要

Riemannian flow matching (RFM) extends flow-based generative modeling to data supported on manifolds by learning a time-dependent tangent vector field whose flow-ODE transports a simple base distribution to the data law. We develop a nonasymptotic Total Variation (TV) convergence analysis for RFM samplers that use a learned vector field together with Euler discretization on manifolds. Our key technical ingredient is a differential inequality governing the evolution of TV between two manifold ODE flows, which expresses the time-derivative of TV through the divergence of the vector-field mismatch and the score of the reference flow; controlling these terms requires establishing new bounds that explicitly account for parallel transport and curvature. Under smoothness assumptions on the population flow-matching field and either uniform (compact manifolds) or mean-square (Hadamard manifolds) approximation guarantees for the learned field, we obtain explicit bounds of the form $\mathrm{TV}\le C_{\mathrm{Lip}}\,h + C_{\varepsilon}\,\varepsilon$ (with an additional higher-order $\varepsilon^2$ term on compact manifolds), cleanly separating numerical discretization and learning errors. Here, $h$ is the step-size and $\varepsilon$ is the target accuracy. Instantiations yield \emph{explicit} polynomial iteration complexities on the hypersphere $S^d$, and on the SPD$(n)$ manifolds under mild moment conditions.

2602.05106 2026-02-06 cs.CL cs.LG stat.ML

Data Kernel Perspective Space Performance Guarantees for Synthetic Data from Transformer Models

Michael Browder, Kevin Duh, J. David Harris, Vince Lyzinski, Paul McNamee, Youngser Park, Carey E. Priebe, Peter Viechnicki

详情
英文摘要

Scarcity of labeled training data remains the long pole in the tent for building performant language technology and generative AI models. Transformer models -- particularly LLMs -- are increasingly being used to mitigate the data scarcity problem via synthetic data generation. However, because the models are black boxes, the properties of the synthetic data are difficult to predict. In practice it is common for language technology engineers to 'fiddle' with the LLM temperature setting and hope that what comes out the other end improves the downstream model. Faced with this uncertainty, here we propose Data Kernel Perspective Space (DKPS) to provide the foundation for mathematical analysis yielding concrete statistical guarantees for the quality of the outputs of transformer models. We first show the mathematical derivation of DKPS and how it provides performance guarantees. Next we show how DKPS performance guarantees can elucidate performance of a downstream task, such as neural machine translation models or LLMs trained using Contrastive Preference Optimization (CPO). Limitations of the current work and future research are also discussed.

2602.05082 2026-02-06 cs.LG cs.AI stat.ML

Reliable Explanations or Random Noise? A Reliability Metric for XAI

Poushali Sengupta, Sabita Maharjan, Frank Eliassen, Shashi Raj Pandey, Yan Zhang

详情
英文摘要

In recent years, explaining decisions made by complex machine learning models has become essential in high-stakes domains such as energy systems, healthcare, finance, and autonomous systems. However, the reliability of these explanations, namely, whether they remain stable and consistent under realistic, non-adversarial changes, remains largely unmeasured. Widely used methods such as SHAP and Integrated Gradients (IG) are well-motivated by axiomatic notions of attribution, yet their explanations can vary substantially even under system-level conditions, including small input perturbations, correlated representations, and minor model updates. Such variability undermines explanation reliability, as reliable explanations should remain consistent across equivalent input representations and small, performance-preserving model changes. We introduce the Explanation Reliability Index (ERI), a family of metrics that quantifies explanation stability under four reliability axioms: robustness to small input perturbations, consistency under feature redundancy, smoothness across model evolution, and resilience to mild distributional shifts. For each axiom, we derive formal guarantees, including Lipschitz-type bounds and temporal stability results. We further propose ERI-T, a dedicated measure of temporal reliability for sequential models, and introduce ERI-Bench, a benchmark designed to systematically stress-test explanation reliability across synthetic and real-world datasets. Experimental results reveal widespread reliability failures in popular explanation methods, showing that explanations can be unstable under realistic deployment conditions. By exposing and quantifying these instabilities, ERI enables principled assessment of explanation reliability and supports more trustworthy explainable AI (XAI) systems.

2602.05065 2026-02-06 cs.LG math.OC stat.ML

Does SGD Seek Flatness or Sharpness? An Exactly Solvable Model

Yizhou Xu, Pierfrancesco Beneventano, Isaac Chuang, Liu Ziyin

详情
英文摘要

A large body of theory and empirical work hypothesizes a connection between the flatness of a neural network's loss landscape during training and its performance. However, there have been conceptually opposite pieces of evidence regarding when SGD prefers flatter or sharper solutions during training. In this work, we partially but causally clarify the flatness-seeking behavior of SGD by identifying and exactly solving an analytically solvable model that exhibits both flattening and sharpening behavior during training. In this model, the SGD training has no \textit{a priori} preference for flatness, but only a preference for minimal gradient fluctuations. This leads to the insight that, at least within this model, it is data distribution that uniquely determines the sharpness at convergence, and that a flat minimum is preferred if and only if the noise in the labels is isotropic across all output dimensions. When the noise in the labels is anisotropic, the model instead prefers sharpness and can converge to an arbitrarily sharp solution, depending on the imbalance in the noise in the labels spectrum. We reproduce this key insight in controlled settings with different model architectures such as MLP, RNN, and transformers.

2602.05041 2026-02-06 stat.ME

A Weighting Framework for Clusters as Confounders in Observational Studies

Eli Ben-Michael, Avi Feller, Luke Keele

详情
英文摘要

When units in observational studies are clustered in groups, such as students in schools or patients in hospitals, researchers often address confounding by adjusting for cluster-level covariates or cluster membership. In this paper, we develop a unified weighting framework that clarifies how different estimation methods control two distinct sources of imbalance: global balance (differences between treated and control units across clusters) and local balance (differences within clusters). We show that inverse propensity score weighting (IPW) with a random effects propensity score model -- the current standard in the literature -- targets only global balance and constant level shifts across clusters, but imposes no constraints on local balance. We then present two approaches that target both forms of balance. First, hierarchical balancing weights directly control global and local balance through a constrained optimization problem. Second, building on the recently proposed Generalized Mundlak approach, we develop a novel Mundlak balancing weights estimator that adjusts for cluster-level sufficient statistics rather than cluster indicators; this approach can accommodate small clusters where all units are treated or untreated. Critically, these approaches rest on different assumptions: hierarchical balancing weights require only that treatment is ignorable given covariates and cluster membership, while Mundlak methods additionally require an exponential family structure. We then compare these methods in a simulation study and in two applications in education and health services research that exhibit very different cluster structures.

2602.05032 2026-02-06 stat.CO

Fast Compute via MC Boosting

Sarah Polson, Vadim Sokolov

详情
英文摘要

Modern training and inference pipelines in statistical learning and deep learning repeatedly invoke linear-system solves as inner loops, yet high-accuracy deterministic solvers can be prohibitively expensive when solves must be repeated many times or when only partial information (selected components or linear functionals) is required. We position \emph{Monte Carlo boosting} as a practical alternative in this regime, surveying random-walk estimators and sequential residual correction in a unified notation (Neumann-series representation, forward/adjoint estimators, and Halton-style sequential correction), with extensions to overdetermined/least-squares problems and connections to IRLS-style updates in data augmentation and EM/ECM algorithms. Empirically, we compare Jacobi and Gauss--Seidel iterations with plain Monte Carlo, exact sequential Monte Carlo, and a subsampled sequential variant, illustrating scaling regimes that motivate when Monte Carlo boosting can be an enabling compute primitive for modern statistical learning workflows.

2602.05028 2026-02-06 stat.AP

Physics-Informed Diffusion Models for Vehicle Speed Trajectory Generation

Vadim Sokolov, Farnaz Behnia, Dominik Karbowski

详情
英文摘要

Synthetic vehicle speed trajectory generation is essential for evaluating vehicle control algorithms and connected vehicle technologies. Traditional Markov chain approaches suffer from discretization artifacts and limited expressiveness. This paper proposes a physics-informed diffusion framework for conditional micro-trip synthesis, combining a dual-channel speed-acceleration representation with soft physics constraints that resolve optimization conflicts inherent to hard-constraint formulations. We compare a 1D U-Net architecture against a transformer-based Conditional Score-based Diffusion Imputation (CSDI) model using 6,367 GPS-derived micro-trips. CSDI achieves superior distribution matching (Wasserstein distance 0.30 for speed, 0.026 for acceleration), strong indistinguishability from real data (discriminative score 0.49), and validated utility for downstream energy assessment tasks. The methodology enables scalable generation of realistic driving profiles for intelligent transportation systems (ITS) applications without costly field data collection.

2602.05022 2026-02-06 stat.ME

Double Variable Importance Matching to Estimate Distinct Causal Effects on Event Probability and Timing

Yuqi Li, Quinn Lanners, Matthew M. Engelhard

详情
英文摘要

In many clinical contexts, estimating effects of treatment in time-to-event data is complicated not only by confounding, censoring, and heterogeneity, but also by the presence of a cured subpopulation in which the event of interest never occurs. In such settings, treatment may have distinct effects on (1) the probability of being cured and (2) the event timing among non-cured individuals. Standard survival analysis and causal inference methods typically do not separate cured from non-cured individuals, obscuring distinct treatment mechanisms on cure probability and event timing. To address these challenges, we propose a matching-based framework that constructs distinct match groups to estimate heterogeneous treatment effects (HTE) on cure probability and event timing, respectively. We use mixture cure models to identify feature importance for both estimands, which in turn informs weighted distance metrics for matching in high-dimensional spaces. Within matched groups, Kaplan-Meier estimators provide estimates of cure probability and expected time to event, from which individual-level treatment effects are derived. We provide theoretical guarantees for estimator consistency and distance metric optimality under an equal-scale constraint. We further decompose estimation error into contributions from censoring, model fitting, and irreducible noise. Simulations and real-world data analyses demonstrate that our approach delivers interpretable and robust HTE estimates in time-to-event settings.

2602.04895 2026-02-06 cs.CR cs.DS cs.LG stat.ML

Privacy Amplification Persists under Unlimited Synthetic Data Release

Clément Pierquin, Aurélien Bellet, Marc Tommasi, Matthieu Boussard

详情
英文摘要

We study privacy amplification by synthetic data release, a phenomenon in which differential privacy guarantees are improved by releasing only synthetic data rather than the private generative model itself. Recent work by Pierquin et al. (2025) established the first formal amplification guarantees for a linear generator, but they apply only in asymptotic regimes where the model dimension far exceeds the number of released synthetic records, limiting their practical relevance. In this work, we show a surprising result: under a bounded-parameter assumption, privacy amplification persists even when releasing an unbounded number of synthetic records, thereby improving upon the bounds of Pierquin et al. (2025). Our analysis provides structural insights that may guide the development of tighter privacy guarantees for more complex release mechanisms.

2602.04891 2026-02-06 stat.ME physics.comp-ph stat.AP

Penalized Likelihood Parameter Estimation for Differential Equation Models: A Computational Tutorial

Matthew J Simpson, James S Bennett, Alexander Johnston, Ruth E Baker

Comments 28 pages, 6 figures

详情
英文摘要

Parameter estimation connects mathematical models to real-world data and decision making across many scientific and industrial applications. Standard approaches such as maximum likelihood estimation and Markov chain Monte Carlo estimate parameters by repeatedly solving the model, which often requires numerical solutions of differential equation models. In contrast, generalized profiling (also called parameter cascading) focuses directly on the governing differential equation(s), linking the model and data through a penalized likelihood that explicitly measures both the data fit and model fit. Despite several advantages, generalized profiling is relatively rarely used in practice. This tutorial-style article outlines a set of self-directed computational exercises that facilitate skills development in applying generalized profiling to a range of ordinary differential equation models. All calculations can be repeated using reproducible open-source Jupyter notebooks that are available on GitHub.

2602.04886 2026-02-06 cs.LG cs.AI cs.CE stat.ML

Denoising diffusion networks for normative modeling in neuroimaging

Luke Whitbread, Lyle J. Palmer, Mark Jenkinson

Comments 55 pages, 20 figures

详情
英文摘要

Normative modeling estimates reference distributions of biological measures conditional on covariates, enabling centiles and clinically interpretable deviation scores to be derived. Most neuroimaging pipelines fit one model per imaging-derived phenotype (IDP), which scales well but discards multivariate dependence that may encode coordinated patterns. We propose denoising diffusion probabilistic models (DDPMs) as a unified conditional density estimator for tabular IDPs, from which univariate centiles and deviation scores are derived by sampling. We utilise two denoiser backbones: (i) a feature-wise linear modulation (FiLM) conditioned multilayer perceptron (MLP) and (ii) a tabular transformer with feature self-attention and intersample attention (SAINT), conditioning covariates through learned embeddings. We evaluate on a synthetic benchmark with heteroscedastic and multimodal age effects and on UK Biobank FreeSurfer phenotypes, scaling from dimension of 2 to 200. Our evaluation suite includes centile calibration (absolute centile error, empirical coverage, and the probability integral transform), distributional fidelity (Kolmogorov-Smirnov tests), multivariate dependence diagnostics, and nearest-neighbour memorisation analysis. For low dimensions, diffusion models deliver well-calibrated per-IDP outputs comparable to traditional baselines while jointly modeling realistic dependence structure. At higher dimensions, the transformer backbone remains substantially better calibrated than the MLP and better preserves higher-order dependence, enabling scalable joint normative models that remain compatible with standard per-IDP pipelines. These results support diffusion-based normative modeling as a practical route to calibrated multivariate deviation profiles in neuroimaging.

2511.15120 2026-02-06 stat.ML cs.AI cs.IT cs.LG math.IT math.ST stat.TH

Neural Networks Learn Generic Multi-Index Models Near Information-Theoretic Limit

Bohan Zhang, Zihao Wang, Hengyu Fu, Jason D. Lee

Comments 85 pages, 2 figures. The order of the first two authors was determined by a coin flip. Accepted by ICLR 2026

详情
英文摘要

In deep learning, a central issue is to understand how neural networks efficiently learn high-dimensional features. To this end, we explore the gradient descent learning of a general Gaussian Multi-index model $f(\boldsymbol{x})=g(\boldsymbol{U}\boldsymbol{x})$ with hidden subspace $\boldsymbol{U}\in \mathbb{R}^{r\times d}$, which is the canonical setup to study representation learning. We prove that under generic non-degenerate assumptions on the link function, a standard two-layer neural network trained via layer-wise gradient descent can agnostically learn the target with $o_d(1)$ test error using $\widetilde{\mathcal{O}}(d)$ samples and $\widetilde{\mathcal{O}}(d^2)$ time. The sample and time complexity both align with the information-theoretic limit up to leading order and are therefore optimal. During the first stage of gradient descent learning, the proof proceeds via showing that the inner weights can perform a power-iteration process. This process implicitly mimics a spectral start for the whole span of the hidden subspace and eventually eliminates finite-sample noise and recovers this span. It surprisingly indicates that optimal results can only be achieved if the first layer is trained for more than $\mathcal{O}(1)$ steps. This work demonstrates the ability of neural networks to effectively learn hierarchical functions with respect to both sample and time efficiency.

2511.08667 2026-02-06 cs.LG stat.ML

TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jäger, Dominik Safaric, Simone Alessi, Adrian Hayler, Mihir Manium, Rosen Yu, Felix Jablonski, Shi Bin Hoo, Anurag Garg, Jake Robertson, Magnus Bühler, Vladyslav Moroshan, Lennart Purucker, Clara Cornu, Lilly Charlotte Wehrhahn, Alessandro Bonetto, Bernhard Schölkopf, Sauraj Gambhir, Noah Hollmann, Frank Hutter

详情
英文摘要

The first tabular foundation model, TabPFN, and its successor TabPFNv2 have impacted tabular AI substantially, with dozens of methods building on it and hundreds of applications across different use cases. This report introduces TabPFN-2.5, the next generation of our tabular foundation model, built for datasets with up to 50,000 data points and 2,000 features, a 20x increase in data cells compared to TabPFNv2. TabPFN-2.5 is now the leading method for the industry standard benchmark TabArena (which contains datasets with up to 100,000 training data points), substantially outperforming tuned tree-based models and matching the accuracy of AutoGluon 1.4, a complex four-hour tuned ensemble that even includes the previous TabPFNv2. Remarkably, default TabPFN-2.5 has a 100% win rate against default XGBoost on small to medium-sized classification datasets (<=10,000 data points, 500 features) and a 87% win rate on larger datasets up to 100K samples and 2K features (85% for regression). For production use cases, we introduce a new distillation engine that converts TabPFN-2.5 into a compact MLP or tree ensemble, preserving most of its accuracy while delivering orders-of-magnitude lower latency and plug-and-play deployment. This new release will immediately strengthen the performance of the many applications and methods already built on the TabPFN ecosystem.

2510.24710 2026-02-06 math.OC cs.IT cs.LG math.IT stat.ML

A Single-Loop First-Order Algorithm for Linearly Constrained Bilevel Optimization

Wei Shen, Jiawei Zhang, Minhui Huang, Cong Shen

Comments NeurIPS 2025

详情
英文摘要

We study bilevel optimization problems where the lower-level problems are strongly convex and have coupled linear constraints. To overcome the potential non-smoothness of the hyper-objective and the computational challenges associated with the Hessian matrix, we utilize penalty and augmented Lagrangian methods to reformulate the original problem as a single-level one. Especially, we establish a strong theoretical connection between the reformulated function and the original hyper-objective by characterizing the closeness of their values and derivatives. Based on this reformulation, we propose a single-loop, first-order algorithm for linearly constrained bilevel optimization (SFLCB). We provide rigorous analyses of its non-asymptotic convergence rates, showing an improvement over prior double-loop algorithms -- form $O(ε^{-3}\log(ε^{-1}))$ to $O(ε^{-3})$. The experiments corroborate our theoretical findings and demonstrate the practical efficiency of the proposed SFLCB algorithm. Simulation code is provided at https://github.com/ShenGroup/SFLCB.

2508.06483 2026-02-06 math.PR math.ST stat.ML stat.TH

A variational approach to dimension-free self-normalized concentration

Ben Chugg, Aaditya Ramdas

Comments 42 pages

详情
英文摘要

We study the self-normalized concentration of vector-valued stochastic processes. We focus on bounds for "sub-$ψ$" processes, a well-known and quite general class of process that encompasses a wide variety of well-known tail conditions (including sub-exponential, sub-Gaussian, sub-gamma, sub-Poisson, and several heavy-tailed settings without a moment generating function such as symmetric or bounded 2nd or 3rd moments). Our results recover and generalize the influential bound of de la Peña et al. [20] (proved again in Abbasi-Yadkori et al. [2]) in the sub-Gaussian case. Further, we fill a gap in the literature between determinant-based bounds and more recent bounds based on condition numbers. As applications we prove a Bernstein inequality for random vectors satisfying a moment condition (a more general condition than boundedness), and also provide the first dimension-free self-normalized empirical Bernstein inequality. Our techniques are based on the variational (PAC-Bayes) approach to concentration.

2508.00120 2026-02-06 stat.ME stat.ML

AdapDISCOM: An Adaptive Sparse Regression Method for High-Dimensional Multimodal Data With Block-Wise Missingness and Measurement Errors

Maimouna Baldé, Abdoul O. Diakité, Claudia Moreau, Gleb Bezgin, Nikhil Bhagwat, Pedro Rosa-Neto, Jean-Baptiste Poline, Simon Girard, Amadou Barry

Comments 49 pages, 4 figures

详情
英文摘要

Multimodal high-dimensional data are increasingly prevalent in biomedical research, yet they are often compromised by block-wise missingness and measurement errors, posing significant challenges for statistical inference and prediction. We propose AdapDISCOM, a novel adaptive direct sparse regression method that simultaneously addresses these two pervasive issues. Building on the DISCOM framework, AdapDISCOM introduces modality-specific weighting schemes to account for heterogeneity in data structures and error magnitudes across modalities. We establish the theoretical properties of AdapDISCOM, including model selection consistency and convergence rates under sub-Gaussian and heavy-tailed settings, and develop robust and computationally efficient variants (AdapDISCOM-Huber and Fast-AdapDISCOM). Extensive simulations demonstrate that AdapDISCOM consistently outperforms existing methods such as DISCOM, SCOM, and CoCoLasso, particularly under heterogeneous contamination and heavy-tailed distributions. Finally, we apply AdapDISCOM to Alzheimers Disease Neuroimaging Initiative (ADNI) data, demonstrating improved prediction of cognitive scores and reliable selection of established biomarkers, even with substantial missingness and measurement errors. AdapDISCOM provides a flexible, robust, and scalable framework for high-dimensional multimodal data analysis under realistic data imperfections.

2505.21799 2026-02-06 math.OC cs.LG stat.ML

PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective

Tim Tsz-Kit Lau, Qi Long, Weijie Su

Comments Minor typos corrected

详情
英文摘要

The ever-growing scale of deep learning models and training data underscores the critical importance of efficient optimization methods. While preconditioned gradient methods such as Adam and AdamW are the de facto optimizers for training neural networks and large language models, structure-aware preconditioned optimizers like Shampoo and Muon, which utilize the matrix structure of gradients, have demonstrated promising evidence of faster convergence. In this paper, we introduce a unifying framework for analyzing "matrix-aware" preconditioned methods, which not only sheds light on the effectiveness of Muon and related optimizers but also leads to a class of new structure-aware preconditioned methods. A key contribution of this framework is its precise distinction between preconditioning strategies that treat neural network weights as vectors (addressing curvature anisotropy) versus those that consider their matrix structure (addressing gradient anisotropy). This perspective provides new insights into several empirical phenomena in language model pre-training, including Adam's training instabilities, Muon's accelerated convergence, and the necessity of learning rate warmup for Adam. Building upon this framework, we introduce PolarGrad, a new class of preconditioned optimization methods based on the polar decomposition of matrix-valued gradients. As a special instance, PolarGrad includes Muon with updates scaled by the nuclear norm of the gradients. We provide numerical implementations of these methods, leveraging efficient numerical polar decomposition algorithms for enhanced convergence. Our extensive evaluations across diverse matrix optimization problems and language model pre-training tasks demonstrate that PolarGrad outperforms both Adam and Muon.

2505.17004 2026-02-06 cs.LG cs.AI cs.NA math.NA stat.ML

Guided Diffusion Sampling on Function Spaces with Applications to PDEs

Jiachen Yao, Abbas Mammadov, Julius Berner, Gavin Kerrigan, Jong Chul Ye, Kamyar Azizzadenesheli, Anima Anandkumar

Comments Accepted to NeurIPS 2025

详情
英文摘要

We propose a general framework for conditional sampling in PDE-based inverse problems, targeting the recovery of whole solutions from extremely sparse or noisy measurements. This is accomplished by a function-space diffusion model and plug-and-play guidance for conditioning. Our method first trains an unconditional, discretization-agnostic denoising model using neural operator architectures. At inference, we refine the samples to satisfy sparse observation data via a gradient-based guidance mechanism. Through rigorous mathematical analysis, we extend Tweedie's formula to infinite-dimensional Banach spaces, providing the theoretical foundation for our posterior sampling approach. Our method (FunDPS) accurately captures posterior distributions in function spaces under minimal supervision and severe data scarcity. Across five PDE tasks with only 3% observation, our method achieves an average 32% accuracy improvement over state-of-the-art fixed-resolution diffusion baselines while reducing sampling steps by 4x. Furthermore, multi-resolution fine-tuning ensures strong cross-resolution generalizability. To the best of our knowledge, this is the first diffusion-based framework to operate independently of discretization, offering a practical and flexible solution for forward and inverse problems in the context of PDEs. Code is available at https://github.com/neuraloperator/FunDPS

2505.05961 2026-02-06 math.DG stat.CO

GEORCE: A Fast New Control Algorithm for Computing Geodesics

Frederik Möbius Rygaard, Søren Hauberg

详情
英文摘要

Computing geodesics for Riemannian manifolds is a difficult task that often relies on numerical approximations. However, these approximations tend to be either numerically unstable, have slow convergence, or scale poorly with manifold dimension and number of grid points. We introduce a new algorithm called GEORCE that computes geodesics in a local chart via a transformation into a discrete control problem. We show that GEORCE has global convergence and quadratic local convergence. In addition, we show that it extends to Finsler manifolds. For both Finslerian and Riemannian manifolds, we thoroughly benchmark GEORCE against several alternative optimization algorithms and show empirically that it has a much faster and more accurate performance for a variety of manifolds, including key manifolds from information theory and manifolds that are learned using generative models.

2502.07244 2026-02-06 cs.LG cs.AI stat.ML

Linear Transformers as VAR Models: Aligning Autoregressive Attention Mechanisms with Autoregressive Forecasting

Jiecheng Lu, Shihao Yang

Comments Camera-ready version. Accepted at ICML 2025

Journal ref Proceedings of the Forty-second International Conference on Machine Learning (ICML 2025)

详情
英文摘要

Autoregressive attention-based time series forecasting (TSF) has drawn increasing interest, with mechanisms like linear attention sometimes outperforming vanilla attention. However, deeper Transformer architectures frequently misalign with autoregressive objectives, obscuring the underlying VAR structure embedded within linear attention and hindering their ability to capture the data generative processes in TSF. In this work, we first show that a single linear attention layer can be interpreted as a dynamic vector autoregressive (VAR) structure. We then explain that existing multi-layer Transformers have structural mismatches with the autoregressive forecasting objective, which impair interpretability and generalization ability. To address this, we show that by rearranging the MLP, attention, and input-output flow, multi-layer linear attention can also be aligned as a VAR model. Then, we propose Structural Aligned Mixture of VAR (SAMoVAR), a linear Transformer variant that integrates interpretable dynamic VAR weights for multivariate TSF. By aligning the Transformer architecture with autoregressive objectives, SAMoVAR delivers improved performance, interpretability, and computational efficiency, comparing to SOTA TSF models.

2501.16156 2026-02-06 stat.ME

Moving toward best practice when using propensity score weighting in survey observational studies

Yukang Zeng, Fan Li, Guangyu Tong

详情
英文摘要

Propensity score weighting is a common method for estimating treatment effects with survey data. The method is applied to minimize confounding using measured covariates that are often different between individuals in treatment and control. However, existing literature does not reach a consensus on the optimal use of survey weights for population-level inference in the propensity score weighting analysis. Under the balancing weights framework, we provided a unified solution for incorporating survey weights in both the propensity score of estimation and the outcome regression model. We derived estimators for different target populations, including the combined, treated, controlled, and overlap populations. We provide a unified expression of the sandwich variance estimator and demonstrate that the survey-weighted estimator is asymptotically normal, as established through the theory of M-estimators. Through an extensive series of simulation studies, we examined the performance of our derived estimators and compared the results to those of alternative methods. We further carried out two case studies to illustrate the application of the different methods of propensity score analysis with complex survey data. We concluded with a discussion of our findings and provided practical guidelines for propensity score weighting analysis of observational data from complex surveys.

2411.09686 2026-02-06 stat.ML cs.LG

Conditional regression for the Nonlinear Single-Variable Model

Yantao Wu, Mauro Maggioni

Comments 63 pages, 11 figures

详情
英文摘要

Regressing a function $F$ on $\mathbb{R}^d$ without the statistical and computational curse of dimensionality requires special statistical models, for example that impose geometric assumptions on the distribution of the data (e.g., that its support is low-dimensional), or strong smoothness assumptions on $F$, or a special structure $F$. Among the latter, compositional models $F=f\circ g$ with $g$ mapping to $\mathbb{R}^r$ with $r\ll d$ include classical single- and multi-index models, as well as neural networks. While the case where $g$ is linear is well-understood, less is known when $g$ is nonlinear, and in particular for which $g$'s the curse of dimensionality in estimating $F$, or both $f$ and $g$, may be circumvented. Here we consider a model $F(X):=f(Π_γX)$ where $Π_γ:\mathbb{R}^d\to[0,\textrm{len}_γ]$ is the closest-point projection onto the parameter of a regular curve $γ:[0, \textrm{len}_γ]\to\mathbb{R}^d$, and $f:[0,\textrm{len}_γ]\to \mathbb{R}^1$. The input data $X$ is not low-dimensional: it can be as far from $γ$ as the condition that $Π_γ(X)$ is well-defined allows. The distribution $X$, the curve $γ$ and the function $f$ are all unknown. This model is a natural nonlinear generalization of the single-index model, corresponding to $γ$ being a line. We propose a nonparametric estimator, based on conditional regression, that under suitable assumptions, the strongest of which being that $f$ is coarsely monotone, achieves, up to log factors, the $\textit{one-dimensional}$ optimal min-max rate for non-parametric regression, up to the level of noise in the observations, and be constructed in time $\mathcal{O}(d^2 n\log n)$. All the constants in the learning bounds, in the minimal number of samples required for our bounds to hold, and in the computational complexity are at most low-order polynomials in $d$.

2410.03159 2026-02-06 cs.LG cs.AI stat.ML

WAVE: Weighted Autoregressive Varying Gate for Time Series Forecasting

Jiecheng Lu, Xu Han, Yan Sun, Shihao Yang

Comments Camera-ready version. Accepted at ICML 2025

Journal ref Proceedings of the Forty-second International Conference on Machine Learning (ICML 2025)

详情
英文摘要

We propose a Weighted Autoregressive Varying gatE (WAVE) attention mechanism equipped with both Autoregressive (AR) and Moving-average (MA) components. It can adapt to various attention mechanisms, enhancing and decoupling their ability to capture long-range and local temporal patterns in time series data. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the previously overlooked decoder-only autoregressive Transformer model can achieve results comparable to the best baselines when appropriate tokenization and training methods are applied. Moreover, inspired by the ARMA model from statistics and recent advances in linear attention, we introduce the full ARMA structure into existing autoregressive attention mechanisms. By using an indirect MA weight generation method, we incorporate the MA term while maintaining the time complexity and parameter size of the underlying efficient attention models. We further explore how indirect parameter generation can produce implicit MA weights that align with the modeling requirements for local temporal impacts. Experimental results show that WAVE attention that incorporates the ARMA structure consistently improves the performance of various AR attentions on TSF tasks, achieving state-of-the-art results.

2406.05014 2026-02-06 stat.ML cs.LG

Root Cause Analysis of Outliers with Missing Structural Knowledge

William Roy Orchard, Nastaran Okati, Sergio Hernan Garrido Mejia, Patrick Blöbaum, Dominik Janzing

Comments Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

详情
英文摘要

The goal of Root Cause Analysis (RCA) is to explain why an anomaly occurred by identifying where the fault originated. Several recent works model the anomalous event as resulting from a change in the causal mechanism at the root cause, i.e., as a soft intervention. RCA is then the task of identifying which causal mechanism changed. In real-world applications, one often has either few or only a single sample from the post-intervention distribution: a severe limitation for most methods, which assume one knows or can estimate the distribution. However, even those that do not are statistically ill-posed due to the need to probe regression models in regions of low probability density. In this paper, we propose simple, efficient methods to overcome both difficulties in the case where there is a single root cause and the causal graph is a polytree. When one knows the causal graph, we give guarantees for a traversal algorithm that requires only marginal anomaly scores and does not depend on specifying an arbitrary anomaly score cut-off. When one does not know the causal graph, we show that the heuristic of identifying root causes as the variables with the highest marginal anomaly scores is causally justified. To this end, we prove that anomalies with small scores are unlikely to cause those with larger scores in polytrees and give upper bounds for the likelihood of causal pathways with non-monotonic anomaly scores.

2405.07109 2026-02-06 stat.ME

Bridging Binarization: Causal Inference with Dichotomized Continuous Exposures

Kaitlyn J. Lee, Alan Hubbard, Alejandro Schuler

详情
英文摘要

The average treatment effect (ATE) is a common parameter estimated in causal inference literature, but it is only defined for binary exposures. Thus, despite concerns raised by some researchers, many studies seeking to estimate the causal effect of a continuous exposure create a new binary exposure variable by dichotomizing the continuous values into two categories. In this paper, we affirm binarization as a statistically valid method for answering causal questions about continuous exposures by showing the equivalence between the binarized ATE and the difference in the average outcomes of two specific modified treatment policies. These policies impose cut-offs corresponding to the binarized exposure variable and assume preservation of relative self-selection. Relative self-selection is the ratio of the probability density of an individual having an exposure equal to one value of the continuous exposure variable versus another. The policies assume that, for any two values of the exposure variable with non-zero probability density after the cut-off, this ratio will remain unchanged. Through this equivalence, we clarify the assumptions underlying binarization and discuss how to properly interpret the resulting estimator. Additionally, we introduce a new target parameter that can be computed after binarization that considers the observed world as a benchmark. We argue that this parameter addresses more relevant causal questions than the traditional binarized ATE parameter. We present a simulation study to illustrate the implications of these assumptions when analyzing data and to demonstrate how to correctly implement estimators of the parameters discussed. Finally, we present an application of this method to evaluate the effect of a law in the state of California which seeks to limit exposures to oil and gas wells on birth outcomes to further illustrate the underlying assumptions.

2405.07102 2026-02-06 stat.ME stat.AP stat.OT

Nested Instrumental Variables Analysis: Switcher Average Treatment Effect, Identification, Efficient Estimation and Generalizability

Rui Wang, Ying-Qi Zhao, Oliver Dukes, Bo Zhang

详情
英文摘要

Instrumental variables (IVs) are widely used to estimate causal effects from non-randomized data. A canonical example is a randomized trial with noncompliance, in which the randomized treatment assignment serves as an IV for the non-ignorable treatment received. Under a monotonicity assumption, a valid IV nonparametrically identifies the average treatment effect among a latent complier subgroup, whose generalizability is often under debate. In many studies, there exist multiple versions of an IV, for instance, different nudges to take the same treatment in different study sites in a multicenter clinical trial. These different versions of an IV may result in different compliance rates and offer a unique opportunity to study IV estimates' generalizability. In this article, we introduce a novel nested IV assumption and study identification of the average treatment effect among two latent subgroups: always-compliers and switchers, who are defined based on the joint potential treatment received under two versions of a binary IV. We derive the efficient influence function for the SWitcher Average Treatment Effect (SWATE) under a nonparametric model and propose efficient estimators. We then propose formal statistical tests of the generalizability of IV estimates under the nested IV framework. The proposed tests are flexible nonparametric generalizations of classical overidentification tests that allow estimating nuisance parameters using machine learning tools. We apply the proposed method to the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial and study the causal effect of colorectal cancer screening and its generalizability.

2404.15617 2026-02-06 cs.LG cs.AI math.OC math.ST stat.TH

A Differential and Pointwise Control Approach to Reinforcement Learning

Minh Nguyen, Chandrajit Bajaj

Comments NeurIPS 2025

详情
英文摘要

Reinforcement learning (RL) in continuous state-action spaces remains challenging in scientific computing due to poor sample efficiency and lack of pathwise physical consistency. We introduce Differential Reinforcement Learning (Differential RL), a novel framework that reformulates RL from a continuous-time control perspective via a differential dual formulation. This induces a Hamiltonian structure that embeds physics priors and ensures consistent trajectories without requiring explicit constraints. To implement Differential RL, we develop Differential Policy Optimization (dfPO), a pointwise, stage-wise algorithm that refines local movement operators along the trajectory for improved sample efficiency and dynamic alignment. We establish pointwise convergence guarantees, a property not available in standard RL, and derive a competitive theoretical regret bound of $\mathcal{O}(K^{5/6})$. Empirically, dfPO outperforms standard RL baselines on representative scientific computing tasks, including surface modeling, grid control, and molecular dynamics, under low-data and physics-constrained conditions.

2311.01147 2026-02-06 stat.ME stat.CO

Variational Inference for Sparse Poisson Regression

Mitra Kharabati, Morteza Amini, Mohammad Arashi

Comments A part of the PhD thesis of Miss Mitra Kharabati

详情
英文摘要

We have utilized the non-conjugate Variational Bayesian (VB) method for the problem of the sparse Poisson regression model. To provide approximate conjugacy in the model, the likelihood is approximated by a quadratic function, yielding conjugacy between the approximation component and the Gaussian prior on the regression coefficient. Three sparsity-enforcing priors (Laplace, Continuous Spike and Slab, and Bernoulli) are used for this problem. The proposed models are compared with each other, the associated MCMC models, and two frequentist sparse Poisson methods (LASSO and SCAD) to evaluate their estimation, prediction, and sparsity performance. In a simulation study, the proposed VB methods closely approximate the posterior parameter distribution while achieving significantly faster computation than benchmark MCMC methods. Using several benchmark count response data sets, the prediction performance of the proposed methods is evaluated in real-world applications.