arXivDaily arXiv每日学术速递 周一至周五更新
2606.20435 2026-06-19 econ.EM 新提交

Choosing A Headline Estimand from Matching, DID, and Hybrid Designs: A Minimax-Regret Approach

从匹配、DID和混合设计中选择标题估计量:一种极小化最大遗憾方法

Yechan Park, Yuya Sasaki

AI总结 本文提出在面板数据因果效应估计中,混合设计(DIDM)的估计量介于匹配(M)和双重差分(DID)之间,并在宽泛损失函数下是极小化最大遗憾选择,建议将DIDM作为标题估计量,匹配和DID作为边界。

详情
AI中文摘要

使用面板数据估计因果效应的研究人员通常从三种利用过去结果的方法中选择:双重差分(DID)、对滞后结果进行条件化(匹配,M)以及同时进行两者的混合方法(DIDM)。相应的识别假设是非嵌套的,因此对于报告哪种方法几乎没有指导。我们给出了相应估计量有序的条件,其中DIDM介于匹配和DID之间。这使得DIDM在宽泛的损失函数类中成为三者中的极小化最大遗憾选择。我们建议将DIDM报告为标题估计量,匹配和DID作为边界。我们在应用中进行了说明。

英文摘要

Researchers using panel data to estimate causal effects routinely choose among three approaches to using past outcomes: difference-in-differences (DID), conditioning on lagged outcomes (matching, M), and a hybrid that does both (DIDM). The corresponding identifying assumptions are non-nested, leaving little guidance on which to report. We give conditions under which the corresponding estimands are ordered, with DIDM bracketed between matching and DID. This makes DIDM the minimax-regret choice among the three under a broad class of loss functions. We recommend reporting DIDM as the headline estimate, with matching and DID as bounds. We illustrate in applications.

2606.20286 2026-06-19 econ.EM 新提交

Institutions, Inputs, and Agricultural Growth in China:Revisiting Several Controversies, 1949--1986

制度、投入与中国农业增长:重访若干争议(1949–1986)

Jiyuan Lyu

AI总结 本文利用统一数据集和计量方法,重新审视关于中国农业增长的价格剪刀差、重工业投资、1978年改革及去集体化对灌溉影响的四大争议。

详情
AI中文摘要

关于1949年至1986年间中国农业增长的学术争论在价格剪刀差的程度、重工业投资的影响、1978年改革的作用以及去集体化对灌溉的影响等方面持续存在分歧。本文利用单一数据集和互补的计量经济学方法,逐一回应了这些争议。结果表明,1952–1957年是唯一一个通过所有三个渠道实现净提取的时期,此后国家通过财政和信贷工具向农业净流入约1686亿元。重工业投资对农业产生了显著的正向滞后效应,而同期负相关源于投资份额指标的零和性质。投入产出弹性在1970年突然变化,集体农业贷款在1971年断裂,两者均指向华北农业会议的整顿效果。防灾能力从集体时期的0.70下降到家庭承包后的0.53,主要原因是集体维护体系崩溃而非国家投资减少。1979年后农业供给的价格弹性趋近于零,表明1979年的收购价格提高更像是一次性重新校准而非持续的边际激励。

英文摘要

Scholarly debates on China's agricultural growth between 1949 and 1986 continue to differ over the extent of the price scissors, the effect of heavy industrial investment, the role of the 1978 reforms, and the impact of decollectivization on irrigation. Using a single dataset and complementary econometric methods, this paper addresses each of these controversies. The results show that 1952--1957 was the only net extraction period across all three channels, after which the state channelled a net inflow of about 168.6 billion yuan into agriculture via fiscal and credit instruments. Heavy industrial investment exerted a significant positive lagged effect on agriculture, while the contemporaneous negative correlation stemmed from the zero-sum nature of the investment share indicator. The input-output elasticity shifted abruptly in 1970, and collective agricultural loans broke in 1971, both pointing to the rectification effects of the North China Agricultural Conference. Disaster prevention capacity fell from 0.70 under the collective era to 0.53 after household contracting, mainly because the collective maintenance system collapsed rather than because state investment declined. After 1979 the price elasticity of agricultural supply approached zero, suggesting that the 1979 procurement price increase acted more like a one-off recalibration than a sustained marginal incentive.

2606.19972 2026-06-19 econ.EM 新提交

Biodiversity Media Narratives and Stock Market Performance: Evidence from Europe

生物多样性媒体叙事与股市表现:来自欧洲的证据

Andres Azqueta-Gavaldon, Ben Jabeur Sami, Leila Hedhili

AI总结 利用GDELT全球知识图谱构建2015-2025年法德意西四国的生物多样性媒体风险指标,通过面板格兰杰因果检验和增广逆概率加权事件研究发现,生物多样性风险显著降低股价,且低风险期的正面效应大于高风险期的负面效应。

详情
AI中文摘要

本研究为法国、德国、意大利和西班牙构建了2015-2025年间新颖的生物多样性相关媒体风险指标,利用GDELT全球知识图谱捕捉媒体对生物多样性威胁的关注。通过面板格兰杰因果检验和增广逆概率加权(AIPW)事件研究设计,我们发现了高度显著的证据表明生物多样性风险会降低股票价格,其影响在冲击后3至10个月达到峰值。此外,我们揭示了一个明显的非对称性,即低生物多样性风险期的正面效应大于高风险期的负面效应。结果在收益分布的分位数上稳健,并在控制欧洲股票市场波动性和经济政策不确定性时依然成立。我们的发现首次提供了生物多样性媒体叙事驱动欧洲股市估值的证据。

英文摘要

This study constructs novel biodiversity related media risk indicators for France, Germany, Italy, and Spain over 2015-2025, capturing media attention to biodiversity threats using the GDELT Global Knowledge Graph. Using panel Granger causality tests and an augmented inverse probability weighting (AIPW) event-study design, we find highly significant evidence that biodiversity risk reduces stock prices, with effects peaking between 3 and 10 months after a shock. Moreover, we uncover a marked asymmetry whereby the positive effects of low biodiversity risk episodes outweigh the negative effects of high-risk episodes. Results are robust across quantiles of the return distribution and hold when controlling for European equity market volatility and economic policy uncertainty. Our findings provide the first evidence that biodiversity media narratives drive stock market valuations in Europe.

2606.20240 2026-06-19 econ.EM stat.AP 新提交

Two-Sample IV: Efficient Two-Step Estimation and Tests for Overidentification and Weak-Instruments

两样本IV:高效两步估计及过度识别与弱工具变量检验

Fatima Kasenally, Ruoxi Guan, Frank Windmeijer

AI总结 针对两样本IV估计,提出异方差和样本异质性下稳健的两步高效估计方法及过度识别检验,仅需线性回归的汇总统计量,并扩展弱工具变量检验。

详情
AI中文摘要

两样本IV是一种流行的估计方法,当结果变量和处理变量在不同样本中可用,而工具变量在两个样本中都可用时。标准估计量是两样本两阶段最小二乘估计量,在同方差和样本同质性下是有效的。我们开发了一个稳健的两步程序,用于在一般异方差和样本异质性下进行有效估计,并提出了相关的两样本Hansen过度识别检验。我们方法的一个关键特征是只需要两个样本中简化形式和第一阶段的线性回归的汇总统计量。这些是估计系数向量的六个对象,以及同方差和异方差稳健的估计方差矩阵。我们进一步表明,在同方差和同质性下,处理样本中的第一阶段F统计量可以按标准方式用作弱工具变量检验,这里的相对偏差是比例偏差。我们提出了Montiel-Olea和Pflueger (2013)的有效F统计量的扩展,用于异方差情况,遵循Windmeijer (2025)的推广。我们在Marshall (2019)研究教育对投票行为影响的应用中说明了估计量和检验,并进行了聚类稳健推断。

英文摘要

Two-sample IV is a popular estimation method when the outcome and treatment variables are available in different samples, whereas instruments are available in both samples. The standard estimator is two-sample two-stage least squares estimator, which is efficient under homoskedasticity and homogeneity of the samples. We develop a robust two-step procedure for efficient estimation under general heteroskedasticity and heterogeneity of the samples, and propose a related two-sample Hansen overidentification test. A key feature of our approach is that only summary statistics from the linear regressions of the reduced form and first-stage in the two samples are needed. These are the six objects of the estimated coefficient vectors, and the homoskedastic and heteroskedasticity robust estimated variance matrices. We further show that the first-stage F-statistic in the treatment sample can be used as a test for weak instruments in the standard way under homoskedasticity and homogeneity, with the relative bias here a proportional bias. We propose an extension of the effective F-statistic of Montiel-Olea and Pflueger (2013) for the heteroskedastic case, following the generalization in Windmeijer (2025). We illustrate the estimators and tests in an application studying the effect of education on voting behavior from Marshall (2019), with cluster robust inference.

2606.19846 2026-06-19 econ.GN q-fin.EC 新提交

What Capital After Labor? Forecasting the Talent ROI Transition in the Human-AI Era

劳动力之后是什么资本?预测人机时代的人才ROI转型

Kwan Soo Shin, In Seok Kang

AI总结 针对AI增强打破劳动时间与贡献的会计关联,本文构建从时间到产出的人才ROI预测框架,核心定理为ROI反转,并利用韩国52小时工作制案例验证了前期压力信号,预测产出型企业在2032年TFP增长领先1.5-2.0个百分点。

Comments 90 pages, 6 figures

详情
AI中文摘要

AI增强打破了劳动时间与生产贡献之间的会计联系,但企业仍通过基于时间的间接费用包来评估人才。本文开发了一个预测框架,用于在人机时代从基于时间的人才会计向基于产出的人才ROI转型。该框架以定理3(在τ*处的ROI反转)为实证主轴,包含四个机制定理:间接费用非加性、增强节省时间路径、创新溢价放大以及人机二元归因不确定性。韩国分阶段实施的52小时工作制规定提供了一个实证预警案例。在一个包含365家上市公司的DART面板数据(2281个公司-年观测值)中,SG&A与收入比率从2018年的18.26%上升至2020年的20.06%,在2021-2022年略有修正,并于2024年达到20.10%的峰值。在收入百分位队列代理下,双向固定效应(+1.56个百分点,p=0.049)、合并事件研究估计(t=+3时为+4.21个百分点,p=0.001)以及Callaway-Sant'Anna双重稳健交错DID估计(t=+4时为+4.51个百分点)收敛于一个正向间接费用压力特征。2015-2017年的向后扩展(224家公司,601个观测值)提供了预处理数据,提供了反对预先存在的上升趋势混杂因素的证据。我们将韩国证据解读为,据我们所知,第一个经验记录的τ*前间接费用压力制度特征,其中基于时间的会计仍占主导地位,而AI增强和劳动时间压缩共同推高了间接费用。预计到2032年,基于产出的公司在公司层面TFP增长上比基于时间的同行高出1.5-2.0个百分点。贡献在于为向AI增强的人才ROI会计转型提供了一个预测模型和管理规划工具。

英文摘要

AI augmentation breaks the accounting link between labor time and productive contribution, yet firms continue to evaluate talent through time-based overhead bundles. This paper develops a forecasting framework for the transition from time-based talent accounting to output-based talent ROI in the human-AI era. The framework centres on Theorem 3 (ROI Inversion at τ*) as the empirical spine, with four mechanism theorems: overhead non-additivity, augmentation-saved-time pathways, innovation-premium amplification, and human-AI dyad attribution uncertainty. Korea's staged 52-hour workweek mandate provides an empirical early-warning case. In a DART panel of 365 listed firms (2,281 firm-year observations), the SG&A-to-revenue ratio rose from 18.26 percent in 2018 to 20.06 percent in 2020, corrected mildly in 2021-2022, and peaked at 20.10 percent in 2024. Under the revenue-percentile cohort proxy, two-way fixed effects (+1.56 pp, p = 0.049), pooled event-study estimates (+4.21 pp at t = +3, p = 0.001), and Callaway-Sant'Anna doubly-robust staggered DiD estimates (+4.51 pp at t = +4) converge on a positive overhead-pressure signature. A 2015-2017 backward extension (224 firms, 601 observations) supplies pre-treatment data, providing evidence against pre-existing upward-trend confounds. We read the Korean evidence not as a direct τ* estimate or a point causal magnitude, but as, to our knowledge, the first empirically documented signature of the pre-τ overhead-pressure regime, where time-based accounting still dominates while AI augmentation and labor-time compression jointly raise overhead. Output-based firms are forecast to outperform time-based peers by 1.5-2.0 percentage points in firm-level TFP growth by 2032. The contribution is a forecasting model and managerial planning tool for the shift to AI-augmented talent ROI accounting.

2606.20041 2026-06-19 econ.GN cs.AI cs.LG q-fin.EC q-fin.GN 新提交

AI Economist Agent: An Agentic Framework for Model-Grounded Economic Analysis with RAG, Knowledge Graphs, and Large Language Models

AI经济学家代理:一种基于模型的经济分析代理框架,结合RAG、知识图谱和大语言模型

Masahiro Kato

AI总结 提出一种基于RAG的AI经济学家代理框架,利用知识图谱和大语言模型进行经济情景分析,通过代理规划、检索证据、选择模型并生成报告,提高经济叙事的连贯性和可追溯性。

详情
AI中文摘要

我们提出了一种基于模型的RAG型AI经济学家,具有用于经济情景分析的代理框架,使用大语言模型(LLMs)和知识图谱。虽然LLMs可以生成流畅的经济叙事,但经济学家通常需要做出基于经济理论和现实数据的经济主张。基于这一动机,本研究提出了一种基于RAG的AI经济学家,它利用包含经济数据和理论的知识图谱以及基于LLM的代理来规划分析、检索相关证据、选择合适的模型并生成报告。在我们的框架中,我们不直接仅使用语言模型产生定量主张;相反,我们生成基于显式模型计算的叙事,并通过AI代理与检索到的证据相关联。我们将我们的框架称为AI经济学家代理。我们在两个应用中评估了AI经济学家代理:为美国通胀持续性和美联储政策生成经济学家报告,以及为美国商业房地产再融资压力生成银行压力测试叙事。结果说明了如何通过基于生成报告来提高其经济连贯性和可追溯性。

英文摘要

We propose a model-grounded RAG-based AI economist with an agentic framework for economic scenario analysis using large language models (LLMs) and knowledge graphs. While LLMs can generate fluent economic narratives, economists are often required to make economic claims grounded by economic theory and real-world data. Based on this motivation, this study proposes an RAG-based AI economist, which utilizes knowledge graphs including economic data and theory and LLM-based agents to plan the analysis, retrieve relevant evidence, select appropriate models, and generate reports. In our framework, we do not produce quantitative claims directly with the language model alone; instead, we generate narratives grounded in explicit model-based computations and linked to the retrieved evidence via AI agents. We refer to our framework as an AI economist agent. We evaluate the AI economist agent in two applications: economist report generation for U.S. inflation persistence and Federal Reserve policy, and bank stress-test narrative generation for U.S. commercial real estate refinancing stress. The results illustrate how grounding the generated reports improves their economic coherence and traceability.

2606.19794 2026-06-19 econ.GN cs.CY q-fin.EC 新提交

Forecasting AI-Era Productivity: The Intellectually Converged Human Framework and a Missing Cognitive Mediator in Production Function Theory

预测AI时代的生产率:智力融合人类框架与生产函数理论中缺失的认知中介

Kwan Soo Shin, In Seok Kang

AI总结 本文提出智力融合人类(ICH)框架,通过引入四维认知构念“融合能力”(C)作为AI与生产率之间的认知中介,解释了AI投资未能带来相应生产率增长的理论悖论,并基于20个OECD国家的数据分析验证了AI与C的交互作用对全要素生产率变异的解释力。

Comments 78 pages, 3 figures

详情
AI中文摘要

为什么大规模AI投资未能产生相应的生产率增长?我们认为这一悖论在理论上是生成的:主流生产函数框架通过将AI视为可分离的生产要素,而未建模AI产生生产性价值的认知中介,从而遇到了结构性边界。这导致投资倾向于部署,而生产率需要先发展我们称之为融合能力(C)的东西。我们提出了智力融合人类(ICH)框架,这是生产函数理论的第五阶段框架:H-hat = H[1 + phi(A,C)],其中有效生产能力等于人力资本(H)乘以一个增强因子[1 + phi],phi由AI利用强度(A)和融合能力(C)共同决定,C是一个四维认知构念,涵盖具身理解、元认知、时间整合和整合思维。生产函数Y = F(K, H-hat)为索洛的TFP残差提供了一个以人为中心的机制:A_Solow = [1 + phi(A,C)]^(1-alpha)。该框架预测了三种具有不同政策含义的增强机制。对20个OECD经济体的描述性跨国分析显示,AIxC交互作用与86%的TFP变异相关,而仅AI为31%,这是小n理论传统中模式一致的发现。韩国是国家级欠增强的例证:高H、大量A、低C导致phi=0。我们将融合能力与相邻构念——吸收能力、动态能力和人力资本——区分开来,并证明C构成了先前框架中隐含的特定认知中介。我们推导出C优先的政策建议,并提出了三个可实证检验的命题及一个可证伪的10年预测。

英文摘要

Why does massive AI investment fail to generate commensurate productivity gains? We argue the paradox is theoretically generated: prevailing production function frameworks encounter a structural boundary by treating AI as a separable factor of production without modeling the cognitive mediation through which AI generates productive value. This directs investment toward deployment when productivity requires prior development of what we term convergence capacity (C). We propose the Intellectually Converged Human (ICH) framework, a fifth-stage framework for production function theory: H-hat = H[1 + phi(A,C)], where effective productive capacity equals human capital (H) scaled by an augmentation factor [1 + phi], with phi jointly determined by AI utilization intensity (A) and convergence capacity (C), a four-dimensional cognitive construct encompassing embodied understanding, metacognition, temporal integration, and integrative thinking. The production function Y = F(K, H-hat) provides a human-centered mechanism for Solow's TFP residual: A_Solow = [1 + phi(A,C)]^(1-alpha). The framework predicts three augmentation regimes with distinct policy implications. Descriptive cross-national analysis of 20 OECD economies shows the AIxC interaction is associated with 86% of TFP variance versus 31% for AI alone, a pattern-consistent finding in the small-n theoretical tradition. South Korea exemplifies national-scale under-augmentation: high H, substantial A, low C produce phi = 0. We distinguish convergence capacity from adjacent constructs, absorptive capacity, dynamic capability, and human capital, and demonstrate that C constitutes the specific cognitive mediator that prior frameworks have left implicit. We derive C-first policy prescriptions and offer three empirically testable propositions with a falsifiable 10-year forecast.

2606.19599 2026-06-19 eess.SY cs.SY econ.EM 新提交

Ramping Procurement and Bid-Cost Recovery in Real-Time Market

实时市场中的爬坡采购与投标成本回收

Cong Chen, Valentina Norambuena, Lang Tong

AI总结 研究净需求不确定下与经济调度协同优化的爬坡采购,分析单间隔与多间隔协同优化设计,提出评估发电机利润、消费者支付、投标成本回收和运营效率的分析框架,并比较三种定价机制。

Comments 4 figures

详情
AI中文摘要

我们研究了净需求不确定下与经济调度协同优化的爬坡采购。我们考察了电网运营商实施的两种灵活爬坡产品设计:单间隔和多间隔协同优化。两者都依赖于滚动窗口随机优化,包含绑定和咨询间隔决策。我们开发了分析框架来评估发电机利润、消费者支付、投标成本回收(BCR)和运营效率。特别是,净需求不确定性可能导致发电机补偿不足,需要歧视性BCR。虽然运营效率对能量和爬坡价格不变,但生产者利润和消费者支付关键取决于定价。我们研究了节点边际定价(LMP)和两种统一定价:最大调度成本定价(MDCP)和最大时间节点边际定价(MTLMP)。在市场外BCR下,LMP产生歧视性能量价格,而MDCP消除BCR,MTLMP在大多数情况下也是如此。这一性质使我们能够在MDCP下为价格接受型发电机建立真实投标激励。我们的分析突出了单间隔和多间隔协同优化与定价设计之间的权衡:在高预测不确定性和中等爬坡需求下,单间隔能量-爬坡协同优化具有优势,而当净需求预测相对准确且爬坡需求具有挑战性时,多间隔协同优化更优。基于CAISO和ERCOT数据的实证结果表明,与LMP相比,MDCP和MTLMP增加了生产者利润且BCR可忽略,但以消费者支付增加为代价。

英文摘要

We study ramping procurement co-optimized with economic dispatch under net-demand uncertainty. We examine two flexible ramp product designs implemented by grid operators: single-interval and multi-interval co-optimization. Both rely on rolling-window stochastic optimization with binding and advisory interval decisions. We develop analytical frameworks to evaluate generator profits, consumer payments, bid cost recovery (BCR), and operational efficiency. In particular, net-demand uncertainty may lead to generator under-compensation, requiring discriminatory BCR. While operational efficiency is invariant to energy and ramp prices, producer profits and consumer payments depend critically on pricing. We examine locational marginal pricing (LMP) and two uniform pricing: maximum dispatch cost pricing (MDCP) and maximum temporal locational marginal pricing (MTLMP). With out-of-market BCR, LMP yields discriminatory energy prices, whereas MDCP eliminates BCR and MTLMP does so in most cases. This property enables us to establish truthful bidding incentives for price-taking generators under MDCP. Our analysis highlights trade-offs between single- and multi-interval co-optimization and pricing designs: single-interval energy-ramp co-optimization is advantageous under high forecast uncertainty and moderate ramping requirements, whereas multi-interval co-optimization is superior when net-demand forecasts are relatively accurate and ramp needs are challenging. Empirical results on CAISO and ERCOT data show that MDCP and MTLMP increase producer profits with negligible BCR, albeit at the expense of higher consumer payments relative to LMP.

2606.19777 2026-06-19 physics.soc-ph econ.GN q-fin.EC 新提交

Have Data Centers Raised Your Electric Bill? Causal Evidence from the United States

数据中心提高了你的电费吗?来自美国的因果证据

Asa Watten, John Bistline, Geoffrey Blanford

AI总结 利用工具变量法,发现2015-2024年美国数据中心使平均零售电价温和下降,归因于电力系统的规模经济效应。

详情
AI中文摘要

我们使用工具变量法估计,从2015年到2024年,数据中心导致美国平均零售电价温和下降。尽管普遍看法相反,这一发现与经济推理一致:现有的大型电力系统固定成本、输配电的规模经济以及发电单位成本的下降意味着持久的需求增长会降低平均价格。我们发现了输电、配电和发电成本以及零售客户类别内部和之间的规模经济模式。我们警告说,未来的供应限制可能会逆转这一效应。

英文摘要

We estimate that data centers caused average retail electricity rates to fall modestly in the United States from 2015 to 2024 using an instrumental variables approach. Despite prevailing sentiment, the finding is consistent with economic reasoning: existing large power system fixed costs, economies of scale in transmission and distribution, and declining unit costs for generation imply that durable demand growth lowers average prices. We find patterns of economies of scale for transmission, distribution, and generation costs as well as within and across retail customer classes. We caution that future supply constraints could reverse the effect.

2606.17165 2026-06-19 stat.ME cs.AI econ.EM math.ST stat.TH 新提交

Statistical Foundations of LLM-based A/B Testing: A Surrogacy Framework for Human Causal Inference

基于LLM的A/B测试的统计基础:用于人类因果推断的替代指标框架

Joel Persson, Mårten Schultzberg, Sebastian Ankargren

发表机构 * Spotify USA, Inc.(Spotify美国公司)

AI总结 提出替代指标理论框架,证明在弱于分布等价条件下,校准LLM输出可识别平均处理效应,并分析随机性带来的偏差与方差。

详情
AI中文摘要

组织和研究者越来越有兴趣在A/B测试中使用大型语言模型(LLM)代替人类参与者,以期更快、更低成本地进行实验。我们研究当在LLM结果上估计的处理效应何时能够恢复在感兴趣的人类群体上测量的效应。LLM与人类结果之间的分布等价性会使任何标准估计量有效,但这不现实。因此,我们开发了一个统计框架,将替代终点理论适配到LLM。该框架表明,将LLM结果校准到人类结果,在替代性和可比性条件(联合弱于分布等价性)下,可以识别平均处理效应。当这些条件不成立时,感兴趣的效应仅部分可识别,我们提供了诊断方法,可以在历史实验上证伪替代性,并给出有限重叠下最坏情况偏差的界限。我们进一步证明,LLM固有的随机性会引入偏差和方差,但使用多次抽取的平均值作为替代指标可以同时缓解两者。我们在模拟和Upworthy标题的A/B测试应用中展示了方法和理论。我们工作的一个核心结论是,LLM结果作为替代指标的有效性只能对过去的处理被证伪,而无法对新处理被验证,因此对于新颖干预,人类实验仍然不可或缺。我们讨论了LLM选择、提示和温度作为设计变量的作用,以及如何确定人类实验的规模以进行验证。

英文摘要

Organizations and researchers show increasing interest in using large language models (LLMs) in place of human participants in A/B tests, in the hope of experimenting faster and at lower cost. We study when a treatment effect estimated on LLM outcomes can recover the effect that would have been measured on the human population of interest. Distributional equivalence between LLM and human outcomes would make any standard estimator valid but is unrealistic. We therefore develop a statistical framework that adapts surrogate endpoint theory to LLMs, showing that calibrating LLM outcomes to human outcomes identifies the average treatment effect under surrogacy and comparability conditions that are jointly weaker than distributional equivalence. We present a falsification test for surrogacy and a bound on the worst-case bias from limited overlap between the LLM and human samples. We further show that the stochasticity inherent to LLMs can weaken surrogacy for identification while also introducing bias and variance during estimation, but that using an average over multiple LLM draws per unit as the surrogate mitigates these issues. Simulations validate the results, and an empirical application to A/B tests on Upworthy headlines shows that raw LLM predictions recover only 39\% of the human treatment effect while nonparametric calibration closes the gap. A central takeaway is that A/B testing on LLMs yields correct results only by assumption, whereas A/B testing on humans is correct by design, and that the required assumptions are hardest to justify precisely where A/B testing on LLMs promises the greatest benefit. We discuss the role of LLM choice, prompting, and temperature as design variables, the compounded challenge posed by long-term outcomes, and how to size human pilot studies for validation.