Causal Inference with High-dimensional Discrete Covariates
高维离散协变量下的因果推断
Zhenghao Zeng, Sivaraman Balakrishnan, Yanjun Han, Edward H. Kennedy
AI总结 研究高维离散协变量下因果效应的估计问题,证明常用估计量的均方误差界为d²/n²+1/n,并给出极小化下界,提出利用效应同质性和先验知识的新估计量以加速收敛。
Comments 74 pages, 9 figures
详情
在从观察性研究估计因果效应时,研究人员通常需要调整许多协变量以消除暴露与结果之间的非因果关系,其中许多协变量是离散的。常用估计量在存在许多离散协变量时的行为尚不明确,因为它们的性质通常是在稀疏性和平滑性等结构假设下分析的,而这些假设不适用于离散设置。在这项工作中,我们研究了一个模型中因果效应的估计,其中用于混杂调整的协变量是离散但高维的,意味着类别数量$d$与样本量$n$相当甚至更大。具体来说,我们证明了常用回归、加权和双稳健估计量的均方误差以$\frac{d^2}{n^2}+\frac{1}{n}$为界。然后,我们证明了平均处理效应的极小化下界为$\frac{d^2}{n^2 \log^2 n}+\frac{1}{n}$量级,这刻画了高维离散设置下因果效应估计的基本难度,并表明上述估计量在忽略对数因子时是速率最优的。我们进一步考虑了可以利用的额外结构,即效应同质性和协变量分布的先验知识,并提出了新的估计量,这些估计量具有更快的收敛速率$\frac{d}{n^2} + \frac{1}{n}$,从而在更广泛的范围内实现一致性。通过模拟研究对结果进行了实证说明。
When estimating causal effects from observational studies, researchers often need to adjust for many covariates to deconfound the non-causal relationship between exposure and outcome, among which many covariates are discrete. The behavior of commonly used estimators in the presence of many discrete covariates is not well understood since their properties are often analyzed under structural assumptions including sparsity and smoothness, which do not apply in discrete settings. In this work, we study the estimation of causal effects in a model where the covariates required for confounding adjustment are discrete but high-dimensional, meaning the number of categories $d$ is comparable with or even larger than sample size $n$. Specifically, we show the mean squared error of commonly used regression, weighting and doubly robust estimators is bounded by $\frac{d^2}{n^2}+\frac{1}{n}$. We then prove the minimax lower bound for the average treatment effect is of order $\frac{d^2}{n^2 \log^2 n}+\frac{1}{n}$, which characterizes the fundamental difficulty of causal effect estimation in the high-dimensional discrete setting, and shows the estimators mentioned above are rate-optimal up to log-factors. We further consider additional structures that can be exploited, namely effect homogeneity and prior knowledge of the covariate distribution, and propose new estimators that enjoy faster convergence rates of order $\frac{d}{n^2} + \frac{1}{n}$, which achieve consistency in a broader regime. The results are illustrated empirically via simulation studies.