arXivDaily arXiv每日学术速递 周一至周五更新
重置
math.ST统计理论30
2606.12333 2026-06-11 math.ST math.PR 新提交

Integrated expectile-based measures of inequality

基于期望分位数的综合不平等度量

Ignacio Cascos, Marco Tarsia

AI总结 本文基于期望分位数与凸随机序的一致性,提出一族综合期望分位数泛函,用于度量风险、离散度与不平等,并导出其解析表示与几何解释,构建了新的期望分位数不平等指数,具有单调性和一致性,且可自然推广至多元情形。

详情
AI中文摘要

期望分位数提供了一类非对称位置泛函,它们考虑了偏差的幅度并具有自然的几何解释。基于它们与凸随机序的结构一致性,本文引入了一族综合期望分位数泛函,用于度量风险、离散度和不平等。所提出的泛函具有解析表示,即作为跨不对称水平的期望分位数的积分。对于这些构造中的一个显著子类,存在几何表示:所得量可以表示为编码随机变量分布不对称性的星形集的加权面积。这种方法产生了一类新的基于期望分位数的不平等指数,构成了经典基尼型度量的自然对应物,同时保留了理想的单调性和一致性性质。经验对应物以封闭形式导出,并在有限样本上具有显式分解。该框架通过方向期望分位数构造自然扩展到多元设置,从而产生能够捕捉多元离散度和不平等的真正联合形式的度量。

英文摘要

Expectiles provide a class of asymmetric location functionals that incorporate the magnitude of deviations and admit a natural geometric interpretation. Building on their structural consistency with the convex stochastic order, this paper introduces a family of integrated expectile functionals for measuring risk, dispersion, and inequality. The proposed functionals admit analytical representations as integrals of expectiles across asymmetry levels. For a distinguished subclass of these constructions, a geometric representation is available: the resulting quantities can be expressed as weighted areas of star-shaped sets encoding the distributional asymmetry of a random variable. This approach yields a new class of expectile-based inequality indices, constituting a natural counterpart to classical Gini-type measures while preserving desirable monotonicity and consistency properties. Empirical counterparts are derived in closed form and admit explicit decompositions over finite samples. The framework extends naturally to multivariate settings through directional expectile constructions, leading to measures capable of capturing genuinely joint forms of multivariate dispersion and inequality.

2606.12185 2026-06-11 econ.EM math.ST 新提交

Pivotal and identification-robust nonparametric inference in linear IV models

线性IV模型中的关键与识别鲁棒非参数推断

Bertille Antoine, Pascal Lavergne

AI总结 针对线性工具变量模型,提出对识别强度与异方差鲁棒且第一阶段非参数的新推断方法,包括渐近关键统计量、子向量推断和设定检验。

详情
AI中文摘要

我们为线性IV模型开发了新的推断程序,这些程序对识别强度和未知形式的异方差具有鲁棒性,并且对第一阶段方程是非参数的。我们的第一个检验专门用于内生解释变量的参数推断。我们的新统计量修改了Antoine和Lavergne(2003)的统计量,直接考虑了未知形式的异方差。因此,它是渐近关键的,从而在实践中大大简化了推断。我们还开发了(i)一个识别鲁棒的子向量推断程序,该程序不依赖于剩余参数的识别强度知识,以及(ii)一个纯设定检验。在这两种情况下,检验是保守但有效的。我们通过模拟和实际应用表明,我们的程序计算友好且与现有方法相比具有竞争力。

英文摘要

We develop new inference procedures for a linear IV model that are robust to identification strength and heteroskedasticity of unknown form, and nonparametric with respect to the first-stage equation. Our first test is tailored for inference on parameters of endogenous explanatory variables. Our new statistic modifies that of Antoine and Lavergne (2003) to directly account for heteroskedasticity of unknown form. As a result, it is asymptotically pivotal, so that inference is greatly facilitated in practice. We also develop (i) an identification-robust subvector inference procedure that does not rely on the knowledge of identification strength for the remaining parameters, and (ii) a pure specification test. In both cases, the tests are conservative but powerful. We show that our procedures are computationally friendly and competitive with existing ones in simulations and an application.

2606.12131 2026-06-11 math.ST math.OC 新提交

A Discrete Cumulative Distribution Transform via Optimal Transport

通过最优传输的离散累积分布变换

Harbir Antil, Gustavo Rohde, Aryan Saxena

AI总结 针对实线上原子概率测度,提出基于单调分位数映射的离散累积分布变换,建立精确有限分辨率恢复的累积质量兼容性准则,并证明参考细化下的弱收敛性。

详情
AI中文摘要

本文针对实线上的原子概率测度,发展了一种完全离散的累积分布变换(CDT)。该变换通过单调分位数映射定义,并基于累积质量匹配,为前向变换和逆重建提供了显式的线性时间算法。与经典连续情形不同,原子测度之间的确定性传输通常不能分裂质量,因此在有限分辨率下精确重建可能失败。我们建立了精确有限分辨率恢复的累积质量兼容性准则,并证明了在参考细化下重建测度的弱收敛性。推导了离散CDT的若干结构性质,包括平移、复合和缩放律,并将该框架扩展到具有阈值稳定化的离散符号累积分布变换。通过避免连续插值,所提出的框架为离散数据提供了一种简单的固定参考传输表示。数值示例展示了平移线性化、兼容性控制重建、细化一致性以及符号变换的稳定化。

英文摘要

This paper develops a fully discrete cumulative distribution transform (CDT) for atomic probability measures on the real line. The transform is defined through monotone quantile maps and admits explicit linear-time algorithms for both forward transformation and inverse reconstruction based solely on cumulative mass matching. Unlike the classical continuous setting, deterministic transport between atomic measures cannot generally split masses, so exact reconstruction may fail at finite resolution. We establish a precise cumulative-mass compatibility criterion for exact finite-resolution recovery and prove weak convergence of reconstructed measures under reference refinement. Several structural properties of the discrete CDT are derived, including translation, composition, and scaling laws, and the framework is extended to a discrete signed cumulative distribution transform with thresholded stabilization near zero crossings. By avoiding continuous interpolation, the proposed framework provides a simple fixed-reference transport representation for discrete data. Numerical examples illustrate translation linearization, compatibility-controlled reconstruction, refinement consistency, and stabilization of the signed transform.

2606.11933 2026-06-11 math.ST stat.ME 新提交

Testing axial symmetry in multivariate location-scale linear regression

多元位置尺度线性回归中的轴向对称性检验

Šárka Hudecová, Miroslav Šiman

AI总结 提出基于积分秩得分的检验方法,用于多元线性异方差回归中条件轴向对称性的检验,推导渐近分布,并通过模拟和实际数据验证。

详情
AI中文摘要

本文研究多元线性异方差回归框架下条件轴向对称性的检验问题。提出了一种基于积分秩得分的新检验,并推导了其渐近分布。所提出的方法将针对多元数据开发的类似程序扩展到回归设定中。该检验也可用于评估关于误差项分布特性的特定假设。通过一个小型模拟研究和实际经济数据说明了其性能和应用。本文还包含一些关于轴向对称性的理论结果,这些结果可能具有独立的意义。

英文摘要

The article deals with the problem of testing conditional axial symmetry within a~multivariate linear heteroscedastic regression framework. A new test based on integrated rank scores is introduced and its asymptotic distribution is derived. The proposed method extends a similar procedure developed for multivariate data to the regression setting. The test may also be employed to assess specific hypotheses concerning distributional properties of the error term. Its performance and application is illustrated in a small simulation study and with real economic data. The article also contains a few theoretical results regarding axial symmetry that may be of independent interest.

2606.11726 2026-06-11 math.ST 新提交

Notes on the Theory of Statistical Symbol Recognition

统计符号识别理论笔记

Nils Lid Hjort

AI总结 本文是1986年Nils Lid Hjort关于统计符号识别理论的207页专著,基于噪声图像中的符号识别与分类分析,为适应当时技术限制(如扫描文档、向量转换)而开发的统计模式识别方法。

详情
Comments
Monograph, 207 pages, a limited circulation report from Norwegian Computing Centre, 1986, documenting statistical methods developed to serve industrial needs for various pattern recognition tasks
AI中文摘要

本文档是从1986年的旧plain-TeX文件生成的pdf,即Nils Lid Hjort的《统计符号识别理论笔记》,这是一本限量发行的207页专著,由挪威计算中心出版,报告编号778/1986。它介绍了为满足多个工业项目需求而开发的统计模式识别理论基础,这些项目与挪威-德国公司SysScan、挪威皇家研究委员会等签订的合同有关,涉及从噪声图像(如地图、文档、卫星图像等)中进行符号识别和分类分析。所开发的方法和算法还需适应当时(约1986年)的技术,包括扫描文档的机器、将其转换为矢量表示,并在计算和机器系统边界内运行。另有一本同样限量发行的小册子《统计符号识别:系统开发》,由Knut Bråten、Erik Holbæk-Hanssen和Torfinn Taxt编写(报告编号777/1986,挪威计算中心,奥斯陆),详细介绍了系统开发。因此,开发工作是在两个前沿领域同时进行的,Hjort的统计方法与使技术(包括其多个硬件和软件组件)正常工作的实践紧密合作。

英文摘要

This document is a pdf generated from old plain-TeX files of 1986, of Nils Lid Hjort's `Notes on the Theory of Statistical Symbol Recogntion', a limited circulation 207-pages monograph published at the Norwegian Computing Centre, as Report no. 778/1986. It gives the basics of the statistical pattern recognition theory developed to suit the needs of several industrial projects, related to contracts with the Norwegian-German firm SysScan, the Royal Norwegian Research Council, and yet others, involving symbol recognition and classification analysis from noisy images, related to maps, documents, satellite imaging, etc. The methods and algorithms developed also needed to fit the technology of that time, anno c. 1986, with machines scanning documents, converting these to vector representation, within computational and machine system boundaries. There is an accompanying and also limited circulation booklet, `Statistical Symbol Recognition: Development of a System', by Knut Bråten, Erik Holbæk-Hanssen, and Torfinn Taxt (Report No.777/1986, Norwegian Computing Centre, Oslo), detailing the system developments. Thus developments took form and shape on two frontiers, in close collaboration, Hjort's statistical methods and getting the technology to work, with its multiple components, hardware and software.

2606.11487 2026-06-11 math.ST math.PR stat.ML 新提交

Unbiased Derivative Estimation for Stationary Mean of Parameterized Markov chains

参数化马尔可夫链平稳均值的无偏导数估计

Jeffrey Wang, Chang-han Rhee

AI总结 提出一种针对参数化马尔可夫链平稳均值梯度的无偏估计方法,在慢混合率下高效,无需密度函数先验知识,适用于神经网络参数化。

详情
Comments
Preliminary draft. Full version in preparation
AI中文摘要

我们提出了一种新方法,用于无偏估计与参数化马尔可夫链族相关的平稳均值的梯度。当马尔可夫链具有慢混合率时,我们的估计器特别高效。我们的方法不需要特定的参数化,除了一个预言机来评估给定数据点的转移密度及其梯度,而无需关于密度函数本身的任何额外知识。这使得我们的估计器适用于与神经网络相关的参数化。该估计器在效率方面可能实现大幅提升。数值实验证实了理论预测的良好性能。

英文摘要

We propose a new approach to unbiased estimation of the gradients of the stationary means associated with parametrized families of Markov chains. Our estimators are particularly efficient when the Markov chains have slow mixing rate. Our approach does not require a specific parametrization except for an oracle to evaluate the transition density and its gradient at a given data point without any additional knowledge about the density function itself. It makes our estimator suitable for parametrizations associated with neural networks. The estimator can potentially achieve large improvement in terms of efficiency. Numerical experiments confirm the good performance predicted by the theory.

2606.11469 2026-06-11 cs.DS cs.LG math.ST 新提交

Density estimation for Hellinger via minimum-distance estimators: mixtures of Gaussians, log-concave, and more

基于最小距离估计量的Hellinger密度估计:高斯混合、对数凹等

Spencer Compton, Jerry Li

AI总结 将最小距离估计方法从总变差距离扩展到Hellinger距离,通过反向数据处理不等式,实现了对对数凹混合和高斯混合(任意方差)的近线性时间学习,样本复杂度接近最优。

详情
AI中文摘要

我们研究密度估计任务,希望从$n$个样本中准确估计概率密度。在总变差距离下,密度估计的经典方法是最小距离估计量方法,其中我们仅通过限制特定概念类(即Yatracos类)的VC维即可得到算法和分析。虽然该技术最初主要针对总变差距离给出了精确保证,但在本文中,我们将最小距离估计量方法扩展到Hellinger距离下的学习。我们的主要观察是,通过联系最近得到反向数据处理不等式的结果,我们可以为Hellinger距离生成类似的方案(其中我们只需要限制相关概念类的VC维)。该方案足够灵活,可以容纳最初为总变差距离设计的快速算法;通过修改Acharya等人(2017)的方法,我们首次得到了近线性时间算法,用于学习包括单变量对数凹密度混合和高斯混合(具有任意方差)在内的类别,且样本复杂度接近最优。

英文摘要

We study the task of density estimation, where we hope to accurately estimate a probability density from $n$ samples. A textbook method for density estimation in total variation distance is the minimum-distance estimator approach, where we conclude both the algorithm and the analysis merely from bounding the VC dimension of a particular concept class (the so-called Yatracos class). While this technique has originally yielded sharp guarantees primarily for total variation distance, in this work we extend the minimum-distance estimator approach for learning within Hellinger distance. Our main observation is that we may produce an analogous recipe for Hellinger (where we only require bounding the VC dimension of a related concept class) by drawing connections to recent results yielding reverse data processing inequalities. This recipe is flexible enough to accommodate fast algorithms originally designed for total variation distance; by modifying the approach of Acharya et al. (2017) we conclude the first near-linear time algorithm for learning classes including univariate mixtures of log-concave densities and mixtures of Gaussians (with arbitrary variances), with near-optimal sample complexity.

2606.11421 2026-06-11 stat.ME math.ST stat.CO 新提交

Second-Order Least Squares as a Special Case of the Polynomial Maximization Method

二阶最小二乘法作为多项式最大化方法的特例

Serhii Zabolotnii

AI总结 证明在条件同方差非高斯误差下,最优加权二阶最小二乘法与二次广义多项式最大化方法等价,并揭示高阶效率储备。

详情
Comments
26 pages, 3 figures, 7 tables. Includes Lean 4 formal verification and Monte Carlo simulation
AI中文摘要

我们证明,对于具有条件同方差非高斯误差的线性回归,最优加权二阶最小二乘法(SLS)与二次广义多项式最大化方法(PMM)是相同的总体估计方程:它们选择前两个中心残差矩的最优线性组合,求解同一个总体正规方程组,共享同一个影响函数,并达到相同的渐近方差 $c_2g_2/N$——普通最小二乘斜率方差因子 $c_2$ 乘以 PMM 方差缩减系数 $g_2=1-\gamma_3^2/(2+\gamma_4)$(其中 $\gamma_3,\gamma_4$ 为误差偏度和超额峰度)。因此,可行的插件实现是一阶等价的,仅存在高阶有限样本差异。这一等价性是尖锐的:在异方差下,无条件 PMM 主体与条件 SLS 加权分离,导致对称误差的效率损失和不对称误差的一致性损失。在二次以上,PMM 拥有 SLS 在其二阶矩范围内无法达到的效率储备。对于对称的尖峰误差,SLS 退化为普通最小二乘法估计斜率,而三次 PMM 通过闭式系数 $g_3$ 利用 SLS 矩范围之外的峰度信息;对于典型非对称分布,在三次多项式矩类中,这一储备为 $30$--$50\\%$。Lean 4 开发环境机器检验了特定次数的代数核心——$g_2$ 和 $g_3$ 的闭式、$g_2\le1$ 结果、设计抵消和对称退化——而一般单调性 $g_{S+1}\le g_S\le1$ 通过嵌套分析证明。蒙特卡洛研究说明了等价性、储备和异方差边界在有限样本中的表现。

英文摘要

We prove that optimally weighted second-order least squares (SLS) and the degree-two generalized polynomial maximization method (PMM) are the same population estimating equation for linear regression with conditionally homoskedastic non-Gaussian errors: they choose the same optimal linear combination of the first two centered residual moments, solve one population normal system, share one influence function, and attain the common asymptotic variance $c_2g_2/N$ -- the ordinary-least-squares slope-variance factor $c_2$ scaled by the PMM variance-reduction coefficient $g_2=1-\gamma_3^2/(2+\gamma_4)$ (with $\gamma_3,\gamma_4$ the error skewness and excess kurtosis). Feasible plug-in implementations are therefore first-order equivalent, with only higher-order finite-sample differences. The identity is sharp: under heteroskedasticity the unconditional PMM body and the conditional SLS weighting separate, costing efficiency for symmetric errors and consistency for asymmetric errors. Beyond degree two, PMM holds an efficiency reserve that SLS cannot reach within its second-moment span. For symmetric platykurtic errors SLS collapses to ordinary least squares for the slope, while degree-three PMM exploits kurtosis information outside the SLS moment span through a closed-form coefficient $g_3$; for canonical asymmetric laws this reserve is $30$--$50\%$ within the degree-three polynomial moment class. The Lean 4 development machine-checks the degree-specific algebraic core -- the closed forms for $g_2$ and $g_3$, the $g_2\le1$ result, the design cancellations, and the symmetric collapse -- while the general monotonicity $g_{S+1}\le g_S\le1$ is proved analytically by nesting. A Monte Carlo study illustrates the equivalence, the reserve, and the heteroskedastic boundary at finite samples.

2606.11406 2026-06-11 math.ST 新提交

Posterior consistency of Pólya trees for deconvolution under the linear model

线性模型下反卷积的Pólya树后验一致性

Nakul Shenoy, Asaf Weinstein

AI总结 研究线性模型反卷积问题,提出基于Pólya树先验的贝叶斯非参数方法,证明在X^TX最小特征值条件下后验分布依上确界范数集中于真实密度g0。

详情
AI中文摘要

最近几项工作解决了线性模型下的反卷积问题,其目标是从噪声观测向量 $\boldsymbol{Y} = X\boldsymbol{\beta} + \boldsymbol{\epsilon}$ 中估计完全未知的 $G_0$,假设系数 $\beta_j$ 是来自 $G_0$ 的独立同分布未观测实现。假设 $G_0$ 具有密度 $g_0$,我们从理论上研究了 Weinstein 等人 (2025) 提出的一种贝叶斯非参数方法,该方法在 $g_0$ 上假设 Pólya 树先验 $\Pi$,并基于后验分布 $\Pi(\cdot|\boldsymbol{Y})$ 进行反卷积估计。我们的主要结果表明,在真实模型(固定且未知的 $g_0$)下,并且在 $X^\top X$ 的最小特征值的适当条件下,后验 $\Pi(\cdot|\boldsymbol{Y})$ 在 sup-范数下集中于 $g_0$。所呈现的分析建立并扩展了 Castillo (2017) 的结果,其中证明了 Pólya 树在密度估计(即直接观测系数 $\beta_j$ 时估计 $g_0$ 的更简单问题)中的后验一致性。

英文摘要

Several recent works have addressed the problem of deconvolution under a linear model, where the goal is to estimate a completely unknown $G_0$ from a vector of noisy observations $\boldsymbol{Y} = X\boldsymbol{\beta} + \boldsymbol{\epsilon}$, assuming the coefficients $\beta_j$ are i.i.d. unobserved realizations from $G_0$. Assuming $G_0$ has a density $g_0$, we study theoretically a Bayesian nonparametric method proposed in Weinstein et al. (2025) that postulates a Pólya tree prior $\Pi$ on $g_0$ and bases a deconvolution estimate on the posterior distribution $\Pi(\cdot|\boldsymbol{Y})$. Our main result asserts that under the true model (fixed and unknown $g_0$), and under a suitable condition on the minimum eigenvalue of $X^\top X$, the posterior $\Pi(\cdot|\boldsymbol{Y})$ concentrates around $g_0$ in sup-norm. The analysis presented builds on and extends results from Castillo (2017), where posterior consistency of Pólya trees was proved for density estimation, the simpler problem of estimating $g_0$ when observing the coefficients $\beta_j$ directly.

2606.11282 2026-06-11 stat.AP math.PR math.ST 新提交

The Statistical Compass

统计罗盘

Eliuvish Han Cui

AI总结 将概率与随机过程思想作为统计学的翻译语言,从设计观测到数据对象、目标、稳定性、推断与应用,通过实例连接抽象对象与记录、机制和决策。

详情
Comments
669 pages, 23 figures; textbook/monograph working manuscript
AI中文摘要

本专著将概率和随机过程思想发展为统计学的翻译语言:从设计观测和数据对象到目标、稳定性陈述、推断和应用。各章节从激励性示例和随机化出发,涵盖概率测度、核、似然、数据对象、弱收敛、经验场、函数型数据、M-和Z-估计、检验、局部逼近、事件时间过程和预测。使用历史和生物医学示例,将抽象对象与记录、机制和决策联系起来。目的是为读者提供经典概率、现代数据结构和统计实践的通用语法。

英文摘要

This monograph develops probability and stochastic-process ideas as a translation language for statistics: from designed observations and data objects to targets, stability statements, inference, and use. The chapters move from motivating examples and randomization through probability measures, kernels, likelihoods, data objects, weak convergence, empirical fields, functional data, M- and Z-estimation, testing, local approximations, event-time processes, and prediction. Historical and biomedical examples are used to keep abstract objects tied to records, mechanisms, and decisions. The aim is to give readers a common grammar for classical probability, modern data structures, and statistical practice.

2606.11263 2026-06-11 math.ST cs.LG math.NA math.PR 新提交

Geometric bias in eigenspace perturbation under random heterogeneous noise

随机异质噪声下特征空间扰动的几何偏差

Fengkai Liu, Ke Wang, Wanjie Wang

AI总结 针对稀疏、异质方差噪声下的信号加噪声矩阵,研究发现经验特征向量存在经典扰动界无法捕捉的系统性几何偏差,并通过二次向量方程和精细各向同性局部律推导了最优非渐近扰动界。

详情
Comments
104 pages, 1 figure
AI中文摘要

谱方法从根本上依赖于主特征空间在随机扰动下的稳定性。经典上,这种稳定性由 Davis-Kahan 和 Wedin 定理量化,这些定理利用噪声的算子范数和相关谱间隙来界定特征空间误差。虽然这些最坏情况界对于任意确定性扰动是紧的,但在低秩信号加随机噪声的设置中可能造成浪费,因为它们未能捕捉信号几何与噪声分布之间的细粒度相互作用。在本文中,我们研究了被具有任意非齐次方差剖面的稀疏随机噪声破坏的信号加噪声矩阵的谱扰动。我们证明,在异质噪声方差下,经验特征向量遭受系统性的、确定性的几何偏差,这种偏差完全不为经典扰动界所见。通过利用二次向量方程并建立精细的各向同性局部律,我们推导了在算子范数和 $2\to\infty$ 范数下前导特征空间的近最优、非渐近扰动界。这些界将通常的信噪比贡献、随机波动和由信号特征空间与行方差剖面对齐决定的结构化几何偏差项分离开来。

英文摘要

Spectral methods rely fundamentally on the stability of principal eigenspaces under random perturbations. Classically, this stability is quantified by the Davis-Kahan and Wedin theorems, which bound the eigenspace error using the operator norm of the noise and the relevant spectral gaps. While these worst-case bounds are sharp for arbitrary deterministic perturbations, they can be wasteful in the low-rank signal-plus-random-noise setting, as they fail to capture the fine-grained interaction between the signal geometry and the noise distribution. In this paper, we study the spectral perturbation of signal-plus-noise matrices corrupted by sparse, random noise with an arbitrary, inhomogeneous variance profile. We demonstrate that under heterogeneous noise variances, the empirical eigenvectors suffer a systematic, deterministic geometric bias that is entirely invisible to classical perturbation bounds. By leveraging the Quadratic Vector Equation (QVE) and establishing fine-grained isotropic local laws, we derive near-optimal, non-asymptotic perturbation bounds for the leading eigenspaces in the operator and $2\to\infty$ norms. The bounds separate the usual signal-to-noise contribution, stochastic fluctuations, and structured geometric bias terms determined by the alignment between the signal eigenspaces and the row-wise variance profile.

2606.10212 2026-06-11 math.ST stat.ML 版本更新

Intrinsic Riemannian Cross-covariance for Manifold-valued Random Objects

内蕴立足点不变黎曼互协方差

Carlos Soto, Cheng Wang, Yujing Huang, Xiaoyu Chen

AI总结 提出一种通过平行传输将局部变化映射到公共切空间的黎曼互协方差,实现流形上随机对象的二阶统计量估计,并证明其渐近性质,在球面、SPD流形和心脏瓣膜形状数据上验证有效性。

详情
Comments
31 pages, 16 figures
AI中文摘要

协方差估计是表示学习、降维和依赖建模中基本的二阶统计量。虽然协方差在欧几里得空间中已被充分理解,但对于位于非线性黎曼流形上的随机对象(在现代机器学习应用中日益常见,涉及形状、对称正定(SPD)矩阵等),协方差定义不明确。本文引入了一种针对流形值随机对象的内蕴黎曼互协方差。我们的方法通过平行传输将局部变化映射到公共切空间来定义协方差和相关,从而得到一个独立于任意坐标选择的二阶描述符。我们证明了所提出的协方差继承了欧几里得对应物的理想性质,并刻画了其渐近行为。在球面和SPD流形上的数值研究,以及在Kendall形状空间中心脏瓣膜形状的真实数据实验,证明了我们估计量的有效性并验证了所述性质。我们的结果将黎曼协方差定位为非欧几里得表示空间中二阶学习和分析的基本工具。

英文摘要

Covariance estimation yields a fundamental second-order statistic underlying representation learning, dimension reduction, and dependence modeling. While covariance has been well understood in Euclidean spaces, it is ill-defined for random objects residing on nonlinear Riemannian manifolds, which increasingly arise in modern machine learning applications involving shapes, symmetric positive definite (SPD) matrices, etc. This paper introduces an intrinsic Riemannian cross-covariance for manifold-valued random objects. Our approach defines covariance and correlation by transporting local variations to a common tangent space via parallel transport, yielding a second-order descriptor that is independent of arbitrary coordinate choices. We establish that the proposed covariance inherits desirable properties of its Euclidean counterparts and characterize its asymptotic behavior. Numerical studies on spheres and SPD manifolds, together with real-data experiments on heart valve shapes in Kendall's shape space, demonstrate the effectiveness of our estimators and verify the stated properties. Our results position the Riemannian covariance as a fundamental tool for second-order learning and analysis in non-Euclidean representation spaces.

2605.31416 2026-06-11 math.ST math.PR 版本更新

Second-order PACF asymptotics and discrimination between fractional Gaussian noise and $\FARIMA(0,d,0)$

二阶PACF渐近性及分数高斯噪声与$\FARIMA(0,d,0)$的区分

Chunhao Cai

AI总结 通过推导分数高斯噪声(fGn)的偏自相关函数(PACF)的二阶渐近展开,揭示了其与$\FARIMA(0,d,0)$在二阶非通用阶上的差异,并解释了短记忆阶选择差异的原因。

详情
AI中文摘要

分数高斯噪声和$\FARIMA(0,d,0)$具有相同的长记忆极点$|\theta|^{-2d}$,因此具有相同的主导PACF律$\alpha(n)\sim d/n$。我们证明这种一致性在第一个非通用阶上被打破。对于$0<d<1/2$,纯fGn的PACF满足$$ \alpha_{\fGn}(n)=\frac d n+\frac{C_{\fGn}(d)}{n^2}+o(n^{-2}), \qquad C_{\fGn}(d)<d^2, $$ 证明使用了Bingham--Inoue--Kasahara表示、fGn的相位系数展开和Hankel算子摄动论证。因此,fGn谱包络在一阶不可见,但在二阶有限预测中可见,这解释了当fGn数据由FARIMA型模型拟合时短记忆阶选择可能不同的原因。

英文摘要

Fractional Gaussian noise and $\FARIMA(0,d,0)$ have the same long-memory pole $|\theta|^{-2d}$ and hence the same leading PACF law $\alpha(n)\sim d/n$. We show that this agreement breaks at the first non-universal order. For $0<d<1/2$, the pure fGn PACF satisfies $$ \alpha_{\fGn}(n)=\frac d n+\frac{C_{\fGn}(d)}{n^2}+o(n^{-2}), \qquad C_{\fGn}(d)<d^2, $$ The proof uses the Bingham--Inoue--Kasahara representation, a phase-coefficient expansion for fGn, and a Hankel-operator perturbation argument. Thus the fGn spectral envelope is invisible at first order but visible in second-order finite prediction, explaining why short-memory order selection can differ when fGn data are fitted by FARIMA-type models.

2605.01923 2026-06-11 econ.EM math.ST 版本更新

Estimation and Inference for the $τ$-Quantile of Individual Heterogeneous Coefficient

个体异质性系数的 $\tau$-分位数估计与推断

Antonio F. Galvao, Ulrich Hounyo, Jiahao Lin

AI总结 针对面板数据中个体异质性斜率系数的分位数,提出两步分位数估计框架,并建立渐近理论和自助法推断。

详情
AI中文摘要

本文提出了面板数据中个体异质性斜率系数分位数的估计与推断方法。我们开发了一个两步分位数估计框架,用于分析个体系数的异质性。与关注结果异质性的传统面板分位数回归不同,我们的方法针对个体特定斜率的横截面分布的 $\tau$-分位数。我们在随机设计和确定性设计下建立了渐近理论,收敛速度分别为 $\sqrt{N}$ 和 $\sqrt{N\sqrt{T}}$。我们还开发了两种相应的自助法程序用于实际推断,并正式建立了其有效性。所建议的方法具有实际意义,因为它们需要的样本量增长条件比标准固定效应分位数回归更弱,并且适用于大 $N$ 设置。数值模拟和对共同基金绩效的应用说明了所提出的方法及其在不同分位数上揭示的异质性模式。

英文摘要

This paper proposes estimation and inference procedures for quantiles of the heterogeneous individual-specific coefficients in panel data. Unlike conventional panel quantile regression, which focuses on outcome heterogeneity, our approach targets the $\tau$-quantile of the cross-sectional distribution of individual-specific slopes. We establish the asymptotic theory under both stochastic and deterministic designs, with convergence rates $\sqrt{N}$ and $\sqrt{N\sqrt{T}}$, respectively. We also develop two corresponding bootstrap procedures for practical inference, and formally establish their validity. The suggested methods are of practical interest since they require weaker sample size growth conditions than standard fixed-effect quantile regression, and accommodate large $N$ settings. Numerical simulations and an empirical application illustrate the empirical effectiveness of the methods under both designs.

2411.10959 2026-06-11 econ.EM cs.LG math.ST stat.AP stat.ME stat.ML 版本更新

Program Evaluation with Remotely Sensed Outcomes

利用遥感结果的程序评估

Ashesh Rambachan, Rahul Singh, Davide Viviano

AI总结 本文研究了在实验和准实验中,由于遥感变量不完全测量经济结果而引起的因果推断问题,提出了一种非参数识别因果参数的方法,结合实验和观测数据进行n^{-1/2}推断。

详情
AI中文摘要

我们研究了在实验和准实验中,经济结果由遥感变量不完全测量的因果推断问题。遥感变量是低成本、可扩展且在观测数据中预测经济结果的变量,例如卫星图像和移动电话活动。我们将遥感变量视为后结果:经济结果的变化导致遥感变量的变化。例如,环境质量的变化导致卫星图像的变化,而不是相反。在这一假设下,我们提出了一种结合实验和观测数据的公式,以非参数方式识别因果参数。我们开发了一种n^{-1/2}推断方法,该方法对规格不正确具有鲁棒性,并且不限制用于处理遥感变量的算法。

英文摘要

We study causal inference in experiments and quasi-experiments, where the economic outcome is imperfectly measured by a remotely sensed variable. The remotely sensed variable is low-cost, scalable, and predictive of the economic outcome in observational data; examples include satellite imagery and mobile phone activity. We model the remotely sensed variable as post-outcome: variation in the economic outcome causes variation in the remotely sensed variable. For example, changes in environmental quality cause changes in satellite imagery, not vice versa. Under this assumption, we propose a formula to nonparametrically identify the causal parameter by combining experimental and observational data. We develop a method for n^{-1/2} inference that is robust to misspecification and that does not restrict the algorithms used to process remotely sensed variables.

2605.00198 2026-06-11 math.ST 版本更新

Fast Convergence for Weighted Least Squares Estimates

加权最小二乘估计的快速收敛性

Andrey Sarantsev

AI总结 研究加权最小二乘估计在Fisher信息无穷时收敛速度快于经典平方根率的问题,通过构造双变量绝对连续分布族,证明其收敛阶渐近小于经典速率。

详情
Comments
8 pages. Keywords: stable subordinator, Fisher information, maximum likelihood estimate, weighted least squares, super-efficient estimate
AI中文摘要

众所周知,当Fisher信息无穷时,最大似然估计的收敛速度快于经典的平方根率。这通常发生在有效区域依赖于估计参数,或者密度在有效区域内依赖于估计参数的点处存在奇点的情况。我们提出了半空间上具有光滑密度的双变量绝对连续分布的单参数族。有效域始终是同一个半空间,并且不依赖于该参数。加权最小二乘估计的阶量级渐近小于经典的平方根率。对于高斯方差混合情况,最大似然估计与该加权最小二乘估计一致。

英文摘要

It is well-known that maximum likelihood estimates converge faster than the classic square root rate if the Fisher information is infinite. This is often the case when the effective region depends on the estimated parameters, or when density has a singularity inside the effective region at a point dependent on the estimated parameters. We present a one-parameter family of bivariate absolutely continuous distributions on the half-space with smooth densities. The effective domain is always the same half-space and does not depend on this parameter. The order of magnitude for the weighted least squares estimate is asymptotically smaller than the classic square root rate. For the Gaussian variance mixture case, the maximum likelihood estimate coincides with this weighted least squares estimate.

2604.27442 2026-06-11 math.ST stat.ML 版本更新

Bayesian online learning in the one-pass regime: Frequentist validity and uncertainty quantification

单次遍历下的贝叶斯在线学习:频率有效性及不确定性量化

Jeyong Lee, Junhyeok Choi, Dongguen Kim, Minwoo Chae

AI总结 提出一种针对单次遍历的贝叶斯在线学习算法,通过预热阶段确保稳定更新,证明后验达到最优收敛率并建立在线Bernstein-von Mises定理,实现无需小批量样本量发散的不确定性量化。

详情
Comments
52 pages
AI中文摘要

贝叶斯在线学习为序贯推理提供了一个连贯的框架。然而,其理论理解仍然有限,特别是在单次遍历设置中。现有的理论保证通常要求小批量样本量发散,这一条件在单次遍历机制下无法满足。在本文中,我们提出了一种针对单次遍历设置量身定制的新贝叶斯在线学习算法,该算法包含一个预热阶段以确保稳定的序贯更新。对于该算法,我们证明了序贯更新的后验达到了最优收敛率。在此基础上,我们建立了Bernstein-von Mises定理的在线类比,该定理保证了在没有发散的小批量样本量的情况下有效的不确定性量化。我们的分析基于一个新颖的理论框架,该框架与在线学习文献中的现有方法有根本不同。在广义线性模型上的数值实验表明,所提出的方法匹配了批处理估计器的性能,同时优于现有的在线程序。

英文摘要

Bayesian online learning provides a coherent framework for sequential inference. However, its theoretical understanding remains limited, particularly in the one-pass setting. Existing theoretical guarantees typically require the mini-batch sample size to diverge, a condition that fails in the one-pass regime. In this paper, we propose a new Bayesian online learning algorithm tailored to the one-pass setting, which incorporates a warm-start phase to ensure stable sequential updates. For this algorithm, we show that the sequentially updated posterior attains the optimal convergence rate. Building on this, we establish an online analogue of the Bernstein-von Mises theorem, which guarantees valid uncertainty quantification without diverging mini-batch sample sizes. Our analysis is based on a novel theoretical framework that differs fundamentally from existing approaches in the online learning literature. Numerical experiments on generalized linear models show that the proposed method matches the performance of the batch estimator while outperforming existing online procedures.

2604.00585 2026-06-11 math.ST 版本更新

Empirical tail dependence functions in high dimensions: uniform linearizations and inference

高维经验尾部依赖函数:均匀线性化与推断

Axel Bücher, Yeonjoon Choi, Katharina Effertz, Stanislav Volgushev

AI总结 本文针对高维极值统计中的经验尾部依赖函数,建立了有限样本概率界、高维中心极限定理和乘子自举法的有效性,允许维度随有效样本量指数增长,并应用于M估计和空间各向同性检验。

详情
Comments
71 pages (24 for the main paper)
AI中文摘要

高维极值依赖分析是现代极值统计中的一个关键挑战。现有方法主要关注极值依赖结构的建模和估计,通常由经验尾部量的浓度界支持。然而,关于高维极值中的一般推断程序所知相对较少。在本文中,我们发展了基础性结果,使得对基于秩的经验尾部依赖系数、稳定尾部依赖函数及其导出的泛函进行推断成为可能。我们首先建立了有限样本概率界,这些界在坐标集合上均匀地量化了此类估计量的线性化误差。此外,我们推导了高维中心极限定理,并建立了经验尾部依赖统计量集合的乘子自举程序的有效性。在渐近框架内,我们的结果允许维度随有效样本量指数增长。我们通过两个应用说明了结果的有用性:尾部依赖参数M估计量的均匀展开和正态逼近,以及基于尾部依赖函数集合的空间各向同性推断。

英文摘要

The analysis of extremal dependence in high dimensions is a key challenge in modern extreme-value statistics. Existing methodology primarily focuses on modeling and estimation of extremal dependence structures, often supported by concentration bounds for empirical tail quantities. However, comparatively little is known about general inferential procedures in high-dimensional extremes. In this paper, we develop foundational results that enable inference for rank-based empirical tail dependence coefficients, stable tail dependence functions, and functionals derived from them. We start by establishing finite-sample probability bounds that quantify the linearization error for such estimators uniformly over collections of coordinates. Moreover, we derive high-dimensional central limit theorems and establish the validity of multiplier bootstrap procedures for collections of empirical tail dependence statistics. Within an asymptotic framework, our results allow the dimension to grow exponentially with the effective sample size. We illustrate the usefulness of the results through two applications: uniform expansions and normal approximations for M-estimators of tail dependence parameters and inference for spatial isotropy based on collections of tail dependence functions.

2603.27843 2026-06-11 math.ST stat.ME 版本更新

Empirical Bayes Estimation and Inference via Smooth Nonparametric Maximum Likelihood

经验贝叶斯估计与推断:基于光滑非参数最大似然法

Taehyun Kim, Bodhisattva Sen

AI总结 针对非参数最大似然估计的离散性和慢对数解卷积率,引入高斯平滑层,提出光滑NPMLE,实现多项式解卷积率、近参数去噪性能及后验一致估计,并构建最优边际覆盖集。

详情
AI中文摘要

基于非参数最大似然估计(NPMLE)的经验贝叶斯 $g$-建模方法一直是正态均值问题中大规模估计和推断的核心。然而,不确定性量化的理论保证仍然很少。一个关键障碍是NPMLE必然是离散的,这导致离散的后验可信集和缓慢的对数解卷积率。我们通过引入一个分层高斯平滑层来解决这两个限制,该平滑层将混合分布限制为高斯位置混合。我们的光滑NPMLE继承了经典NPMLE的优良性质:它可以通过凸优化计算,并实现近乎参数的降噪性能。此外,它实现了多项式解卷积率,在相应类别上是渐近极小极大的。我们的过程还导致估计的光滑后验以多项式率收敛到真实后验。进一步,我们刻画了在期望长度上最优的边际覆盖集,构造了这些集的插件估计量,并在覆盖概率和期望长度方面为估计集建立了理论保证。我们还将理论扩展到模型误设和异方差高斯观测的设置,并研究了所提分层模型的可识别性。

英文摘要

The empirical Bayes $g$-modeling approach based on the nonparametric maximum likelihood estimator (NPMLE) has been central to large-scale estimation and inference in the normal means problem. However, theoretical guarantees for uncertainty quantification remain scarce. A key obstacle is that the NPMLE is necessarily discrete, which yields discrete posterior credible sets and a slow logarithmic deconvolution rate. We address both limitations by introducing a hierarchical Gaussian smoothing layer that restricts the mixing distribution to a Gaussian location mixture. Our smooth NPMLE inherits the favorable properties of the classical NPMLE: it is computable via convex optimization and achieves nearly parametric denoising performance. Moreover, it achieves a polynomial deconvolution rate that is asymptotically minimax over the corresponding class. Our procedure also leads to estimated smooth posteriors that converge to the true posteriors at a polynomial rate. Further, we characterize marginal coverage sets that are optimal in expected length, construct plug-in estimators of these sets, and establish theoretical guarantees for the estimated sets in terms of both coverage probability and expected length. We also extend the theory to settings with model misspecification and heteroscedastic Gaussian observations, and study identifiability of the proposed hierarchical model.

2603.22668 2026-06-11 math.ST stat.ME 版本更新

Fixed-level calibration of the Cauchy combination test

柯西组合检验的固定水平校准

Hirofumi Ota

AI总结 研究柯西组合检验在固定显著性水平下的渐近精确性,发现原始CCT在固定水平下不精确,提出边界层校准CCT(BL-CCT)通过修正参考分布而非统计量实现渐近精确,并在多种备择假设下保持功效。

详情
Comments
Added several related references, conducted power analyses and polished the proofs and the simulation section
AI中文摘要

柯西组合检验(CCT)被广泛使用,因为它能产生闭式组合$p$值,并且在宽依赖结构下当名义水平$\alpha\downarrow0$时渐近有效。我们研究了一个不同的渐近问题:当组合$p$值的数量$K$在依赖下增长时,通常的柯西截断值在普通固定水平下是否仍然准确。在典型单因子等相关高斯copula模型下,我们证明原始CCT在固定$\alpha$下通常不是渐近精确的。在固定正相关下,统计量收敛到随机潜在因子极限,因此不存在通用的固定水平参考分布。当公共相关$\rho_K$随$K$减弱时,固定水平行为由边界层尺度$s_K=\sqrt{\rho_K}(\log K)^{3/2}$控制,且原始CCT渐近精确当且仅当$\rho_K(\log K)^3\to0$。由于大小失真完全来自参考分布而非统计量,因此可以在不修改检验统计量本身的情况下进行校正。我们提出了边界层校准CCT(BL-CCT),它用单参数高斯平滑柯西族替代标准柯西参考分布。与最近修改检验统计量的变体不同,BL-CCT保持统计量不变,仅校正参考分布。BL-CCT在更弱的条件$\rho_K\log K\to0$下渐近精确,并在有界边界层上提供有用的有限$K$近似。我们还进行了若干功效分析:尽管BL-CCT仅提高了截断值,但在局部密集、稀疏和密集高斯备择假设下,它在精确度尺度上相对于原始CCT没有一阶功效损失。数值实验支持校准理论。

英文摘要

The Cauchy combination test (CCT) is widely used because it yields a closed-form combined $p$-value and is known to be asymptotically valid as the nominal level $\alpha\downarrow0$ under broad dependence structures. We study a different asymptotic question: whether the usual Cauchy cutoff remains accurate at an ordinary fixed level when the number $K$ of combined $p$-values grows under dependence. Under a canonical one-factor equicorrelated Gaussian copula model, we show that the raw CCT is generally not asymptotically exact at fixed $\alpha$. With fixed positive correlation, the statistic converges to a random latent-factor limit, so there is no universal fixed-level reference law. When the common correlation $\rho_K$ weakens with $K$, fixed-level behaviour is governed by the boundary-layer scale $s_K=\sqrt{\rho_K}(\log K)^{3/2}$, and the raw CCT is asymptotically exact if and only if $\rho_K(\log K)^3\to0$. Because the size distortion arises entirely from the reference law and not from the statistic, it can be corrected without modifying the test statistic itself. We propose the boundary-layer calibrated CCT (BL-CCT), which replaces the standard Cauchy reference by a one-parameter Gaussian-smoothed Cauchy family. Unlike recent variants that modify the test statistic, BL-CCT leaves the statistic unchanged and corrects only the reference law. BL-CCT is asymptotically exact under the weaker condition $\rho_K\log K\to0$ and provides a useful finite-$K$ approximation on bounded boundary layers. We also conduct several power analyses: although BL-CCT only raises the cutoff, it incurs no first-order power loss relative to the raw CCT on the exactness scale, under local dense, sparse, and dense Gaussian alternatives. Numerical experiments support the calibration theory.

2602.20266 2026-06-11 math.PR math.ST q-bio.PE 版本更新

Multiple Poisson-Dirichlet diffusions on generalized Kingman simplices

广义Kingman单纯形上的多重Poisson-Dirichlet扩散

Cristina Costantini, Matteo Ruggiero

AI总结 构造了有限标记广义Kingman单纯形上的无穷维扩散过程,通过分块斜积分解和极限过程,得到了多重Poisson-Dirichlet平稳分布。

详情
Comments
Revised version; dedicated to the memory of T.G. Kurtz
AI中文摘要

我们在带有有限个标记的广义Kingman单纯形上构造了一类新的无穷维扩散过程。该模型描述了由有限个$H$标记标记的无穷多种类型的相对频率的时间演化,但在每个标记内类型是无标记的。我们首先建立了有限类型Wright-Fisher扩散的分块斜积表示,扩展了Dirichlet律的聚合-重归一化自相似性质。该分解将控制演化中的随机标记质量的$H$维Wright-Fisher扩散与$H$个Wright-Fisher扩散(每个在其自己的随机时钟上运行)分开,后者描述了每个标记内相对频率的演化。在对标记内频率进行降序排序后,我们确定了当每个标记的类型数趋于无穷大时的分布极限,并在适当定义域上推导出其无穷小生成元的显式形式。极限扩散以多重Poisson-Dirichlet分布作为平稳分布;当所有类型共享相同标记时,它恢复为无穷多中性等位基因扩散,而当有两个标记时,它产生Thoma单纯形上的扩散。

英文摘要

We construct a new class of infinite-dimensional diffusions with values in a generalized Kingman simplex with finitely many marks. The model describes the temporal evolution of the relative frequencies of infinitely many types that are labeled by a finite number $H$ of marks, but unlabeled within each mark. We first establish a blockwise skew-product representation for a finite-type Wright-Fisher diffusion, extending the aggregation-renormalization self-similarity property of Dirichlet laws. The decomposition separates an $H$-dimensional Wright-Fisher diffusion governing the evolving random mark masses, from $H$ Wright-Fisher diffusions, each run on its own random clock, which describe the evolution of the relative frequencies within each mark. After ranking the within-mark frequencies in decreasing order, we identify the distributional limit as the number of types per mark tends to infinity and we derive an explicit form of its infinitesimal generator on a suitable domain. The limiting diffusion admits the multiple Poisson-Dirichlet distribution as a stationary distribution; it recovers the infinitely-many-neutral-alleles diffusion when all types share the same mark and yields a diffusion on the Thoma simplex when there are two marks.

2602.02285 2026-06-11 cs.LG cs.CL math.ST 版本更新

AI4SLT: Empirical Processes in Lean 4 for Formal Statistical Learning Theory

AI4SLT: 基于 Lean 4 的形式化统计学习理论实证过程

Yuanhe Zhang, Jason D. Lee, Fanghui Liu

AI总结 本文首次在 Lean 4 中完整形式化统计学习理论,基于实证过程理论,通过人机协作工作流构建了可验证的定理证明工具箱,并揭示了教材中的隐含假设。

详情
Comments
Accepted by ICML 2026
AI中文摘要

我们提出了首个基于实证过程理论的统计学习理论(SLT)在 Lean 4 中的全面形式化。我们的端到端形式化基础设施填补了最新 Lean 库中缺失的内容,包括高斯 Lipschitz 集中的完整推导、次高斯过程的 Dudley 熵积分定理,以及具有尖锐速率的(稀疏)最小二乘回归应用。该项目采用人机协作工作流,其中人类设计证明策略,AI 代理执行战术性证明构建,从而产生了经过人工验证的 SLT 的 Lean 4 工具箱。除了实现之外,形式化过程暴露并解决了标准 SLT 教材中的隐含假设和缺失细节,强制对理论进行逐行细粒度理解。这项工作建立了一个可重用的形式化基础,并为机器学习理论的未来发展打开了大门。代码可在以下网址获取:https://this https URL。

英文摘要

We present the first comprehensive Lean 4 formalization of statistical learning theory (SLT) grounded in empirical process theory. Our en-to-end formal infrastructure implement the missing contents in latest Lean library, including a complete development of Gaussian Lipschitz concentration, Dudley's entropy integral theorem for sub-Gaussian processes, and an application to least-squares (sparse) regression with a sharp rate. The project was carried out using a human-AI collaborative workflow, in which humans design proof strategies and AI agents execute tactical proof construction, leading to the human-verified Lean 4 toolbox for SLT. Beyond implementation, the formalization process exposes and resolves implicit assumptions and missing details in standard SLT textbooks, enforcing a granular, line-by-line understanding of the theory. This work establishes a reusable formal foundation and opens the door for future developments in machine learning theory. The code is provided in this https URL.

2511.11862 2026-06-11 econ.EM math.ST stat.ME 版本更新

Compound Selection Decisions: An Almost SURE Approach

复合选择决策:一种几乎无偏的SURE方法

Jiafeng Chen, Lihua Lei, Timothy Sudijono, Liyang Sun, Tian Xie

AI总结 针对高斯序列模型中的复合选择问题,提出基于SURE的几乎无偏估计量ASSURE,通过优化期望效用选择最优决策规则,并证明其渐近最优性。

详情
Comments
V2: Additional Results and Simulations. 110 pages. Comments welcome
AI中文摘要

本文提出了在高斯序列模型中生成复合选择决策的方法。给定未知的固定参数 $\mu_ {1:n}$ 和已知的 $\sigma_{1:n}$,观测值 $Y_i \sim \textsf{N}(\mu_i, \sigma_i^2)$,决策者希望选择一个子集 $S$ 以最大化效用 $\frac{1}{n}\sum_{i\in S} (\mu_i - K_i)$,其中 $K_i$ 为已知成本。受Stein无偏风险估计(SURE)启发,我们引入了一种几乎无偏的估计量,称为ASSURE,用于估计给定决策规则的期望效用。ASSURE允许用户通过优化估计福利,从预先指定的类别中选择福利最大化的规则,从而产生能够跨噪声估计借用强度的选择决策。我们证明,ASSURE产生的决策规则在渐近意义上不劣于预指定类别中最优但不可行的决策规则。我们将ASSURE应用于经济机会的人口普查区选择、歧视性企业的识别以及A/B测试中 $p$ 值决策程序的分析。

英文摘要

This paper proposes methods for producing compound selection decisions in a Gaussian sequence model. Given unknown, fixed parameters $\mu_ {1:n}$ and known $\sigma_{1:n}$ with observations $Y_i \sim \textsf{N}(\mu_i, \sigma_i^2)$, the decision maker would like to select a subset of indices $S$ so as to maximize utility $\frac{1}{n}\sum_{i\in S} (\mu_i - K_i)$, for known costs $K_i$. Inspired by Stein's unbiased risk estimate (SURE), we introduce an almost unbiased estimator, called ASSURE, for the expected utility of a proposed decision rule. ASSURE allows a user to choose a welfare-maximizing rule from a pre-specified class by optimizing the estimated welfare, thereby producing selection decisions that borrow strength across noisy estimates. We show that ASSURE produces decision rules that are asymptotically no worse than the optimal but infeasible decision rule in the pre-specified class. We apply ASSURE to the selection of Census tracts for economic opportunity, the identification of discriminating firms, and the analysis of $p$-value decision procedures in A/B testing.

2502.04046 2026-06-11 stat.ME math.ST stat.TH

A method for sparse and robust independent component analysis

Lauri Heinonen, Joni Virta

详情
Journal ref
Journal of Multivariate Analysis, 213, 105587 (2026)
Comments
27 pages, 9 figures
英文摘要

This work presents sparse invariant coordinate selection, SICS, a new method for sparse and robust independent component analysis. SICS is based on classical invariant coordinate selection, which is presented in such a form that a LASSO-type penalty can be applied to promote sparsity. Robustness is achieved by using robust scatter matrices. In the first part of the paper, the background and building blocks: scatter matrices, measures of robustness, ICS and independent component analysis, are carefully introduced. Then the proposed new method and its algorithm are derived and presented. This part also includes consistency and breakdown point results for a general case of sparse ICS-like methods. The performance of SICS in identifying sparse independent component loadings is investigated with multiple simulations. The method is illustrated with an example in constructing sparse causal graphs and we also propose a graphical tool for selecting the appropriate sparsity level in SICS.

2509.10817 2026-06-11 math.ST stat.ME 版本更新

Conditional Independence Testing Using Exchangeable Pairs

使用可交换对的条件独立性检验

Bilol Banerjee

AI总结 提出基于可交换对的条件独立性检验方法,将问题转化为两样本检验,利用能量距离度量偏离,并证明其一致性和最优检测率。

详情
AI中文摘要

本文考虑在给定混杂随机向量 \(\m Z\) 的情况下,检验两个随机向量 \(\m X\) 和 \(\m Y\) 之间的条件独立性问题。引入了一个可交换对框架,通过该框架将条件独立性检验问题重新表述为两样本检验问题。该框架受模型X文献思想的启发,基于在原假设条件独立性下成立的基本可交换性性质。采用能量距离/最大均值差异类型的度量来衡量可交换对与条件独立性的偏离。构建了所提出的差异度量的一致估计量,并在一般假设下建立了其理论性质。然后,使用该估计量作为检验统计量开发了条件独立性检验,并通过适当的重采样程序进行校准。结果表明,所提出的检验对固定备择假设是一致的,对局部邻接备择假设具有非平凡的渐近功效,达到了检测由所提出的差异度量表征的备择假设的极小化最优分离率,并且在数据维度随样本量发散时仍然一致。还研究了用于生成可交换对的条件分布估计的影响,并建立了保持有效性和功效性质的条件。广泛的模拟研究表明,所提出的方法在与一些最先进的方法相比具有竞争力。

英文摘要

This article considers the problem of testing conditional independence between two random vectors \(bm X\) and \(\bm Y\) given a confounding random vector \(\bm Z\). An exchangeable-pairs framework is introduced through which the conditional independence testing problem is reformulated as a two-sample testing problem. The framework is motivated by ideas from the model-X literature and is based on a fundamental exchangeability property that holds under the null hypothesis of conditional independence. An energy-distance/maximum mean discrepancy type measure is employed on the resulting exchangeable pairs to quantify departures from conditional independence. A consistent estimator of the proposed discrepancy measure is constructed and its theoretical properties are established under general assumptions. A conditional independence test is then developed using this estimator as a test statistic and is calibrated through a suitable resampling procedure. It is shown that the proposed test is consistent against fixed alternatives, possesses nontrivial asymptotic power against local contiguous alternatives, attains the minimax separation rate for detecting alternatives characterized by the proposed discrepancy measure, and remains consistent when the data dimension diverges with the sample size. The effect of estimating the conditional distribution used to generate the exchangeable pairs is also investigated, and condition under which validity and power properties are preserved is established. Extensive simulation studies demonstrate that the proposed procedure performs competitively with some state-of-the-art methods.

2507.23490 2026-06-11 math.ST 版本更新

Optimal Transport-Based Multivariate Goodness-of-Fit Tests

基于最优传输的多元拟合优度检验

Zdeněk Hlávka, Šárka Hudecová, Simos G. Meintanis

AI总结 提出基于特征函数的多元分布拟合优度检验,利用最优传输构造多元秩,检验统计量计算简单,对简单原假设无分布限制,并通过模拟和实际数据验证其有效性。

详情
AI中文摘要

针对多元分布,提出了基于特征函数的拟合优度检验。检验统计量计算简单,定义为原始观测的多元秩与从待检验参考分布生成的伪样本的秩之间经验特征函数差异的两样本准则。利用最优测度传输理论构造多元秩,使得简单原假设的检验无分布限制,而复合原假设仍需自助法近似。发展了渐近理论,并通过模拟研究(重点与先前提出的多元正态性检验进行比较)表明该方法在有限样本中表现良好。通过实际数据集的应用展示了所提方法的广泛适用性。

英文摘要

Characteristic function-based goodness-of-fit tests are suggested for multivariate distributions. The test statistics, which are straightforward to compute, are defined as two-sample criteria measuring discrepancy of empirical characteristic functions between multivariate ranks of the original observations and the ranks obtained from an artificial sample generated from the reference distribution under test. Multivariate ranks are constructed using the theory of the optimal measure transport, thus rendering the tests of a simple null hypothesis distribution-free, while bootstrap approximations are still necessary for testing composite null hypotheses. Asymptotic theory is developed and a simulation study, concentrating on comparisons with previously proposed tests of multivariate normality, demonstrates that the method performs well in finite samples. The broad applicability of the proposed methods is illustrated through an application to a real dataset.

2505.08784 2026-06-11 stat.ML cs.LG math.ST stat.ME 版本更新

PCS-UQ: Uncertainty Quantification via the Predictability-Computability-Stability Framework

PCS-UQ:基于可预测性-可计算性-稳定性框架的不确定性量化

Abhineet Agarwal, Fange Xiao, Rebecca Barter, Omer Ronen, Boyu Fan, Bin Yu

AI总结 提出PCS-UQ框架,通过预测检查、bootstrap采样和乘法校准实现不确定性量化,在回归和分类任务中优于或媲美共形预测方法,并提供理论保证。

详情
AI中文摘要

随着机器学习进入高风险领域,可信的不确定性量化对于安全性至关重要。本文基于真实数据科学的可预测性、可计算性和稳定性原则,提出了PCS-UQ框架。从候选模型或算法集开始,PCS-UQ集成了严格的预测检查以筛选出集合中不合适的模型,并利用bootstrap样本来捕获预测检查算法的样本间变异性和算法不稳定性。然后,我们引入了一种新颖的乘法校准方案来增强局部自适应性,这基本上对应于共形预测中的新分数。此外,我们编制了17个真实世界回归数据集,并手动构建了子组。在该基准测试中,PCS-UQ在保持目标覆盖率的同时,在区间宽度上优于或匹配配备有oracle选择算法的共形方法。PCS-UQ实现了一致的子组覆盖率,优于这些oracle选择的共形方法。值得注意的是,PCS-UQ在实现竞争性区间宽度和一致子组覆盖率方面表现出色。在6个分类数据集上,PCS-UQ将预测集大小减少了20%。为了将框架扩展到深度学习,我们提出了计算高效的变体,避免了昂贵的重新训练。在三个计算机视觉基准测试中,这些变体将预测集大小比共形基线减少了20%。最后,我们提供了理论证明,即修改后的PCS-UQ算法在可交换性下作为分割共形推断的一种形式保持了有效的覆盖率。

英文摘要

As machine learning (ML) enters high-stakes domains, trustworthy uncertainty quantification (UQ) is essential for safety. In this paper we introduce PCS-UQ, a framework based on the Predictability, Computability, and Stability (PCS) principles for veridical data science. Starting with a candidate set of models or algorithms, PCS-UQ integrates a rigorous prediction-check to screen out unsuitable models in the set and utilizes bootstrap samples, in order to capture both inter-sample variability and algorithmic instability for the prediction-checked algorithms. We then introduce a novel multiplicative calibration scheme to enhance local adaptivity, which basically corresponds to a new score in conformal prediction. Moreover, we produce a compilation of 17 real-world regression datasets with manually-constructed subgroups. On this benchmark, PCS-UQ maintains the target coverage while outperforming or matching conformal methods equipped with oracle-selected algorithms in interval width. PCS-UQ achieves consistent subgroup coverage, outperforming these oracle-selected conformal methods. Notably, PCS-UQ stands out in achieving both competitive interval widths and consistent subgroup this http URL 6 classification datasets, PCS-UQ reduces prediction set sizes by 20\%. To scale the framework for deep learning, we propose computationally efficient variants that bypass expensive retraining. On three computer vision benchmarks, these variants reduce prediction set sizes by 20\% over conformal baselines. Finally, we provide theoretical proof that a modified PCS-UQ algorithm preserves valid coverage under exchangeability as a form of split conformal inference.

2505.02653 2026-06-11 math.ST math.PR stat.ME 版本更新

Hierarchical Random Measures without Tables

无表格的层次随机测度

Marta Catalano, Claudio Del Sole

AI总结 提出一种层次狄利克雷过程的新先验,消除潜在表格变量,实现后验的准共轭分布和高效采样算法,并推广至归一化层次随机测度框架。

详情
AI中文摘要

层次狄利克雷过程是贝叶斯非参数多层次模型的基石。其生成模型可通过一组潜在变量描述,在流行的餐馆特许经营隐喻中通常称为表格。潜在表格简化了后验的表达,并允许实现吉布斯采样算法以近似抽取后验样本。然而,管理它们的分配可能变得计算昂贵,特别是随着数据集大小和层次数量的增加。在这项工作中,我们为层次狄利克雷过程的浓度参数确定了一个先验,该先验(i)诱导准共轭后验分布,并且(ii)消除了对表格的需求,导致后验更可解释的表达,同时具有可扩展和精确的算法来从中采样。值得注意的是,这种构造超越了狄利克雷过程,导致了一个定义归一化层次随机测度的新框架和一类从其后验采样的新算法。关键的分析工具是多元增量的独立性,即它们作为完全随机向量的表示。

英文摘要

The hierarchical Dirichlet process is the cornerstone of Bayesian nonparametric multilevel models. Its generative model can be described through a set of latent variables, commonly referred to as tables within the popular restaurant franchise metaphor. The latent tables simplify the expression of the posterior and allow for the implementation of Gibbs sampling algorithms to approximately draw posterior samples. However, managing their assignments can become computationally expensive, especially as the size of the dataset and the number of levels increase. In this work, we identify a prior for the concentration parameter of the hierarchical Dirichlet process that (i) induces a quasi-conjugate posterior distribution, and (ii) removes the need for tables, leading to more interpretable expressions for the posterior, with both a scalable and an exact algorithm to sample from it. Remarkably, this construction extends beyond the Dirichlet process, leading to a new framework for defining normalized hierarchical random measures and a new class of algorithms to sample from their posteriors. The key analytical tool is the independence of multivariate increments, that is, their representation as completely random vectors.

2504.01562 2026-06-11 math.ST math.PR 版本更新

Asymptotic analysis of the finite predictor for fractional Gaussian noise

分数高斯噪声有限预测器的渐近分析

P. Chigansky, M. Kleptsyna

AI总结 提出一种解析方法,精确推导分数高斯噪声驱动过程的相对预测误差和偏相关系数的渐近行为,解决了长记忆过程预测分析的难题。

详情
AI中文摘要

本文提出了一种新的方法,用于平稳序列有限预测器的渐近分析。我们的方法给出了相对预测误差和偏相关系数的精确渐近形式。底层假设具有解析性质,使得该方法适用于长记忆过程。以分数高斯噪声驱动的ARMA型过程作为案例研究,该过程此前一直难以处理。

英文摘要

This paper proposes a new approach to the asymptotic analysis of the finite predictor for stationary sequences. Our method yields the exact asymptotics of both the relative prediction error and the partial correlation coefficients. The underlying assumptions are analytic in nature, making the approach applicable to processes with long-range dependence. The ARMA-type process driven by fractional Gaussian noise (fGn), which had previously remained elusive, is used as a case study.

2310.14983 2026-06-11 econ.EM math.ST stat.ME 版本更新

Causal clustering: design of cluster experiments under network interference

因果聚类:网络干扰下的聚类实验设计

Davide Viviano, Lihua Lei, Guido Imbens, Brian Karrer, Okke Schrijvers, Liang Shi

AI总结 研究网络干扰下估计全局处理效应的聚类实验设计,提出通过惩罚最小割优化选择聚类以最小化最坏情况均方误差,并给出选择聚类设计的简单条件。

详情
AI中文摘要

本文研究在存在网络溢出效应的情况下,用于估计全局处理效应的聚类实验设计。我们提供了一个框架来选择聚类,以最小化估计全局效应的最坏情况均方误差。我们证明最优聚类解决了一个新颖的惩罚最小割优化问题,可通过现成的半定规划算法计算。我们的分析还刻画了在任何两个聚类设计之间进行选择的简单条件,包括在聚类或个体水平随机化之间进行选择。我们使用来自Facebook用户宇宙的独特网络数据和现有现场实验数据来说明该方法的性质。

英文摘要

This paper studies the design of cluster experiments to estimate the global treatment effect in the presence of network spillovers. We provide a framework to choose the clustering that minimizes the worst-case mean-squared error of the estimated global effect. We show that optimal clustering solves a novel penalized min-cut optimization problem computed via off-the-shelf semi-definite programming algorithms. Our analysis also characterizes simple conditions to choose between any two cluster designs, including choosing between a cluster or individual-level randomization. We illustrate the method's properties using unique network data from the universe of Facebook's users and existing data from a field experiment.