Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media
文本分析评估框架:基于LLM和社交媒体的案例研究
Yuefeng Shi, Nedjma Ousidhoum, Jose Camacho-Collados
AI总结 本文提出了一种基于问题的评估框架,通过470个手工整理的问题来评估LLM在处理聚合文本数据时的语义理解和推理能力,揭示了LLM在处理大规模文本数据时的性能瓶颈。
详情
LLMs在广泛的NLP任务中表现出色,但在实际数据分析场景中仍存在显著差距,尤其是在处理长序列的非结构化文档(如新闻feed或本文特别针对的社交媒体帖子)时。为了实证评估LLM在该设置中的有效性,我们引入了一个包含470个手工整理问题的问题基于评估框架,旨在评估LLM在聚合文本数据上的语义理解和推理能力。我们将其应用于覆盖各种NLP任务的多样化Twitter数据集,包括情感分析、仇恨言论检测和情感识别。我们的结果表明,性能严重依赖于输入规模和数据源的复杂性,在多标签或目标依赖场景中下降明显。此外,随着任务复杂性的增加,性能从基本的语义存在识别逐步下降到更 demanding 的操作,如比较、计数和计算。此外,当输入规模超过500个实例时,我们发现LLMs,特别是开放式权重模型,普遍存在一个共同的限制:在数值任务上性能显著下降。这些发现突显了当前LLMs在对大规模文本集合进行严格定量分析时的关键架构瓶颈。
LLMs have demonstrated exceptional proficiency in a wide range of NLP tasks. However, a notable gap remains in practical data analysis scenarios, particularly when LLMs are required to process long sequences of unstructured documents, such as news feeds or, as specifically addressed in this paper, social media posts. To empirically assess the effectiveness of LLMs in this setting, we introduce a question-based evaluation framework comprising 470 manually curated questions designed to evaluate LLMs' semantic understanding and reasoning abilities over aggregated text data. We apply our benchmark on diverse Twitter datasets covering various NLP tasks, including sentiment analysis, hate speech detection, and emotion recognition. Our results reveal that the performance depends heavily on input scale and the complexity of the data sources, declining noticeably in multi-label or target-dependent scenarios. In addition, as task complexity increases, performance drops progressively from basic semantic existence identification to more demanding operations such as comparison, counting, and calculation. Furthermore, as the input size grows beyond 500 instances, we identify a common limitation across LLMs, particularly Open-weights models: performance degrades substantially, especially on numerical tasks. These findings highlight critical architectural bottlenecks in current LLMs for performing rigorous quantitative analysis over large text collections.