Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses
大型语言模型能否革新调查研究?与灾害准备响应的实验
Yan Wang, Ziyi Guo, Christopher McCarty
AI总结 本文探讨了大型语言模型在调查研究中的应用,通过实验验证了其在灾害准备响应中的有效性,提出了一个五阶段框架,涵盖问卷设计、样本选择、试点测试、缺失数据填补和事后分析,并介绍了基于保护动机理论的协同出现知识图谱和七种LLM配置。
详情
调查研究面临日益严峻的结构性挑战:响应率下降、样本偏差、高风险受访者中的块状缺失以及在线面板中AI辅助的欺诈性完成。大型语言模型(LLMs)已被提出作为解决方案,但迄今为止,对整个调查工作流程的严格评估仍然有限,特别是在灾害情境中,数据质量至关重要。我们提出并评估了一个五阶段框架,用于LLM的整合,涵盖问卷设计、样本选择、试点测试、缺失数据填补和事后分析,使用2024年飓风米勒尔准备调查(佛罗里达居民,n=946)作为共享的实证测试床。我们引入了一个受保护动机理论(PMT)约束的协同出现知识图谱,并开发了七种LLM配置,涵盖零样本推理、检索增强基线和新型理论指导变体。我们提出的锚定边际理论指导LLM(A-TLM)在灾难相关块状MNAR条件下,在RMSE上优于所有三个经典填补基线(IPW/MI、MICE+PMM、missForest)(S4 RMSE 1.439 vs. 1.496 for the next-best),同时在接近零的符号偏差(-0.121)方面优于随机森林填补器(产生最大的绝对偏差-0.631)。围绕PMT因果结构组织检索,并在单个模型调用中整合所有证据,优于无结构检索和分阶段顺序推理(MAE 0.993 vs. 1.097 for standard RAG)。我们记录了接近零的总体偏差可以掩盖相反的子组误差,并提出子组分层偏差审计作为报告标准。一个检索受限的知识图谱聊天机器人展示了幻觉是通过接地拒绝可管理的。
Survey research faces mounting structural challenges: declining response rates, sample bias, block-wise missingness among at-risk respondents, and AI-assisted fraudulent completions in online panels. Large language models (LLMs) have been proposed as a remedy, yet rigorous evaluations across the full survey workflow remain scarce, particularly in disaster contexts where data quality matters most. We present and evaluate a five-stage framework for LLM integration covering questionnaire design, sample selection, pilot testing, missing-data imputation, and post-collection analysis, using the 2024 Hurricane Milton preparedness survey of Florida residents (n=946) as a shared empirical testbed. We introduce a Protection Motivation Theory (PMT)-constrained co-occurrence knowledge graph and develop seven LLM configurations spanning zero-shot inference, retrieval-augmented baselines, and novel theory-informed variants. Our proposed Anchored Marginal Theory-Informed LLM (A-TLM) outperforms all three classical imputation baselines (IPW/MI, MICE+PMM, missForest) on RMSE under disaster-relevant block-wise MNAR conditions (S4 RMSE 1.439 vs. 1.496 for the next-best), while achieving near-zero signed bias (-0.121) where the random-forest imputer produces the largest absolute bias (-0.631). Organizing retrieval around PMT causal structure and integrating all evidence in a single model call outperforms unstructured retrieval and staged sequential inference (MAE 0.993 vs. 1.097 for standard RAG). We document that near-zero aggregate bias can mask opposing subgroup errors and propose subgroup-stratified bias auditing as a reporting standard. A retrieval-constrained knowledge-graph chatbot demonstrates that hallucination is architecturally manageable through grounded refusal.