Flaws in the LLM Automation Narrative
LLM自动化叙事中的缺陷
George Perrett, Javae Elliott, Jennifer Hill, Marc Scott
AI总结 通过编写代码完成数据分析任务的新基准测试,发现前沿LLM在平均性能、方差和错误幅度上均不如人类专家,挑战了LLM达到人类专家水平的说法。
详情
大型语言模型(LLM)越来越多地被描述为在知识经济任务上达到人类专家水平。这些说法主要基于LLM在标准化数据集上衡量平均性能的基准测试任务中的表现。许多基准测试任务的主要局限性在于,它们通常基于直接包含在LLM训练数据中的内容来衡量性能,并且经常不评估LLM性能的可靠性或LLM错误的幅度。然而,在高风险情境中,这些品质至关重要。通过一项需要编写计算机代码完成数据分析任务的新型LLM基准测试,我们将前沿LLM的性能与人类专家的提交进行了比较,并明确测量了响应的方差和错误的幅度。我们的研究表明,人类专家在一系列指标上平均表现更好,并且表现出更小的性能变异性。我们的结果提供了证据,表明LLM并非始终如一地达到人类专家的水平,并证明了在LLM基准评估中测量方差和评估错误幅度的重要性。
Large Language Models (LLMs) are increasingly described as performing at the level of human experts on knowledge economy tasks. These claims are primarily based on how LLMs perform on benchmarking tasks that measure average performance across standardized datasets. Primary limitations of many benchmarking tasks are that they often measure performance based on content directly included in LLM training data, and they frequently do not assess the reliability of LLM performance or the magnitude of LLM errors. However, in high stakes contexts, these qualities are critically important. Through a novel LLM benchmarking task that requires writing computer code to complete a data analysis task, we compare the performance of a frontier LLM against submissions from human experts and explicitly measure the variance of responses and the magnitude of errors. Our study reveals that the human experts perform better on average on a range of metrics and demonstrate less variability in performance. Our results provide evidence that LLMs do not consistently perform at the level of human experts and demonstrate the importance of measuring variance and assessing error magnitude in LLM benchmark evaluations.