Can I Take Another Dose? Evaluating LLM Decision-Making Under Temporal Uncertainty in OTC Dosing QA
我可以再服一剂吗?评估LLM在OTC剂量问答中时间不确定性下的决策能力
Maroof Kousar, Yibo Hu
AI总结 提出DOSEBENCH基准测试,评估大语言模型在非处方药剂量问答中处理时间推理、约束遵循和不确定性的能力。
详情
- Comments
- 16 pages, 7 figures
大型语言模型(LLM)越来越多地被用于日常健康问题,包括用户是否可以安全地再服用一剂非处方(OTC)药物。然而,这一常见的安全相关场景在现有的医学问答评估中仍未得到充分探索,其中正确答案需要跟踪剂量时间、计算滚动24小时摄入量、遵循产品标签约束以及处理不完整的用药史。我们引入了DOSEBENCH,这是一个包含81个精心策划的OTC剂量场景的聚焦基准测试,专注于成人对乙酰氨基酚和布洛芬的使用,并带有手动标注的金标准参考。我们使用决策正确性、一致性、解释可验证性、失败类型和置信度相关信号等指标,在多次运行中评估了四个LLM,共获得1620个模型响应。我们的结果表明,模型在滚动窗口推理和模糊敏感场景中经常遇到困难,且稳定或看似自信的响应仍可能违反剂量约束。这些发现表明,OTC剂量问答为评估医学问答中的时间推理、约束遵循和安全相关不确定性处理提供了一个狭窄但实用的测试平台。
Large language models (LLMs) are increasingly used for everyday health questions, including whether a user can safely take another dose of an over-the-counter (OTC) medication. Yet this common safety-relevant setting remains underexplored in existing medical QA evaluations, where correct answers require tracking dose timing, computing rolling 24-hour intake, following product-label constraints, and handling incomplete medication histories. We introduce DOSEBENCH, a focused benchmark of 81 curated OTC dosing scenarios focused on adult acetaminophen and ibuprofen use, with manually annotated gold references. We evaluate four LLMs across repeated runs using metrics for decision correctness, consistency, explanation verifiability, failure types, and confidence-related signals, resulting in 1,620 model responses. Our results show that models frequently struggle with rolling-window reasoning and ambiguity-sensitive cases and that stable or confident-looking responses can still violate dosing constraints. These findings suggest that OTC dosing QA provides a narrow yet practical testbed for evaluating temporal reasoning, constraint following, and safety-relevant uncertainty handling in medical QA.