The Constraint Tax: Measuring Validity-Correctness Tradeoffs in Structured Outputs for Small Language Models
约束税:小语言模型结构化输出中正确性与准确性的权衡度量
Jaideep Ray
AI总结 本文提出“约束税”测量协议,通过实验证明硬输出约束会显著降低小语言模型的答案准确性和可执行准确性,并建议生产系统应分别报告模式有效性、答案准确性、可执行准确性和错误有效模式率。
详情
生产级LLM系统越来越需要机器可读的输出:JSON对象、类型化轨迹、正则表达式约束字段和工具调用模式。本文针对设备端和低成本小语言模型(SLM)部署,其中低于3B参数的模型因隐私、延迟和通用硬件而具有吸引力,但在解决任务时满足模式的能力有限。通常的工程假设是硬输出约束能提高可靠性而不改变底层答案。我们证明这一假设对小模型不安全。我们引入\emph{约束税},一种测量协议,用于在固定模型、固定任务分布和固定问题实例下,隔离由结构化输出约束引起的答案和可执行准确性损失。在Qwen2.5-0.5B、Qwen2.5-1.5B和SmolLM2-1.7B的15,000次通用GPU生成中,硬答案模式解码将模式有效性从61.5%提高到100.0%,但将答案准确性从19.7%降低到11.0%,并将错误有效模式输出从49.5%增加到88.9%。最强的工业类比是确定性日历工具调用任务:Qwen2.5-1.5B在仅提示JSON下达到91.5%的可执行准确性,但在相同硬工具调用模式下仅为48.0%,而两种模式都是100.0%模式有效。错误是语义性的,而非结构性的。我们还表明,3B边界仍然支付直接模式税,并且延迟包装支持一种建设性设计模式:自由推理,延迟约束。实际结论是直接的:生产系统应分别报告模式有效性、答案准确性、可执行准确性和错误有效模式率。
Production LLM systems increasingly require machine-readable outputs: JSON objects, typed traces, regex-constrained fields, and tool-call schemas. This paper targets on-device and low-cost small language model (SLM) deployments, where sub-3B models are attractive for privacy, latency, and commodity hardware but have limited capacity to satisfy schemas while solving tasks. The usual engineering assumption is that hard output constraints improve reliability without changing the underlying answer. We show that this assumption is unsafe for small models. We introduce \emph{constraint tax}, a measurement protocol for isolating the answer and executable-accuracy loss caused by structured-output constraints at fixed model, fixed task distribution, and fixed problem instances. Across 15,000 commodity-GPU generations with Qwen2.5-0.5B, Qwen2.5-1.5B, and SmolLM2-1.7B, hard answer-only schema decoding raises schema validity from 61.5\% to 100.0\%, but lowers answer accuracy from 19.7\% to 11.0\% and increases wrong-valid-schema outputs from 49.5\% to 88.9\%. The strongest industry analogue is a deterministic calendar tool-call task: Qwen2.5-1.5B achieves 91.5\% executable accuracy with prompt-only JSON but only 48.0\% under the same hard tool-call schema, while both modes are 100.0\% schema-valid. The error is semantic, not structural. We also show that the 3B boundary still pays a direct-schema tax and that delayed packaging supports a constructive design pattern: reason free, constrain late. The practical conclusion is direct: production systems should report schema validity, answer accuracy, executable accuracy, and wrong-valid-schema rate separately.