QUIET: A Multi-Blank Cascaded Story Cloze Benchmark for LLM Creative Generation Capability
QUIET: 面向LLM创意生成能力的多空白级联故事完形填空基准
Bo Zou, Chao Xu
AI总结 提出QUIET基准,通过多空白级联故事完形填空和基于信息论的自动评分协议,客观评估大语言模型的创意生成能力。
详情
大语言模型(LLM)在创意能力评估中面临双重挑战:现有基准(如Story Cloze Test、HellaSwag)通过多项选择识别范式衡量模型对叙事延续的判别能力,而非直接衡量创意生成能力;基于量规的评分和LLM-as-Judge方法依赖主观维度评估或自然语言模型输出,无法提供客观、自动化的评分机制。本文提出QUIET(Quality Understanding via Interlocked Evaluation Testing),一种基于多空白级联故事完形填空的LLM创意能力诊断基准。QUIET在结构完整的故事中设置N个空白(10-20个),每个空白附带显式内容约束,且空白之间存在级联依赖关系——较早空白填充的内容约束较晚空白的可行解空间。被评估模型(或人类参与者)以开放生成模式填充所有空白;结果由基于信息论的自动化评分协议评分,无需人工评分。该评分协议直接操作化“校准惊喜”理论框架(Zou & Xu, 2026a)。对于每个空白k,计算复合分数:score = satisfy * (1 + lambda * surprise),其中lambda = 1.0。这里,“satisfy”衡量空白填充满足内容约束的程度(客观逻辑推理判断,非主观审美评分),“surprise”衡量在满足约束条件下的惊喜程度。不满足约束的创意答案得零分;满足约束但平庸的答案得分低;满足约束且令人惊喜的答案得分高。
Large language models (LLMs) face a dual challenge in creative capability evaluation: existing benchmarks (e.g., Story Cloze Test, HellaSwag) measure models' discriminative ability over narrative continuation using multiple-choice recognition paradigms, rather than directly measuring creative generation capability; rubric-based scoring and LLM-as-Judge methods rely on subjective dimension assessment or natural language model outputs, and cannot provide objective, automated scoring mechanisms. This paper proposes QUIET (Quality Understanding via Interlocked Evaluation Testing), a diagnostic benchmark for LLM creative capability based on multi-blank cascaded story cloze. QUIET sets N blanks (10-20) in a story with complete structure, with each blank accompanied by an explicit content constraint, and cascade dependency relationships between blanks -- the content filled into earlier blanks constrains the feasible solution space for later blanks. The evaluated model (or human participants) fills all blanks in open-ended generation mode; the results are scored by an information-theoretic automated scoring protocol without human grading. The scoring protocol directly operationalizes the "calibrated surprise" theoretical framework (Zou & Xu, 2026a). For each blank k, a composite score is computed: score = satisfy * (1 + lambda * surprise), where lambda = 1.0. Here, "satisfy" measures how well the blank filling satisfies the content constraint (objective logical reasoning judgment, not subjective aesthetic scoring), and "surprise" measures the degree of surprise given that the constraint is satisfied. Creative answers that do not satisfy the constraint score zero; answers that satisfy the constraint but are mediocre score low; answers that satisfy the constraint and are surprising score high.