DisasterBench: Benchmarking LLM Planning under Typed Tool Interface Constraints
DisasterBench: 在类型化工具接口约束下基准测试LLM规划
Zhitong Chen, Kai Yin, Weifeng Zhang, Zhiyuan Wang, Xiangjue Dong, Chengkai Liu, Zhewei Liu, Yiming Xiao, Ali Mostafavi, James Caverlee
AI总结 提出DisasterBench基准,通过类型化工具接口评估LLM在灾害响应中的结构化多智能体规划能力,并引入首次故障点(FPoF)方法进行步骤级故障归因,揭示语义推理与执行约束之间的差距。
详情
灾害造成严重的社会影响,需要快速协调异构AI工具(从卫星分析到洪水预测和损害评估)形成连贯的多步骤工作流。随着LLM越来越多地充当此类管道的编排者,有效的协调需要的不仅仅是选择语义上合理的工具:LLM必须生成具有正确参数绑定和依赖传播的可执行工作流。我们引入了DisasterBench,这是一个基准,用于评估在语义相似但操作上不同的灾害响应工具上的结构化多智能体规划。为了实现步骤级故障归因,我们进一步提出了首次故障点(FPoF),它定位预测工作流中最早的根因,将主要错误与下游级联效应分开。我们的评估揭示了三个发现:规划方法的有效性强烈依赖于模型容量;工具不匹配和参数绑定错误主导了首次故障,揭示了语义基础和执行一致性是不同瓶颈;冗长的中间推理可能与结构化输出要求产生指令冲突,破坏计划生成。总之,这些发现凸显了语义推理与执行基础协调之间的根本差距,强调了需要联合建模语义意图、执行约束和工作流一致性的规划框架。代码、数据和评估资源可在 https://github.com/TamuChen18/DisasterBench_Open 获取。
Disasters cause severe societal impacts, demanding rapid coordination of heterogeneous AI tools, from satellite analysis to flood prediction and damage assessment, into coherent multi-step workflows. As LLMs increasingly serve as orchestrators of such pipelines, effective coordination requires more than selecting semantically plausible tools: LLMs must generate executable workflows with correct parameter binding and dependency propagation. We introduce DisasterBench, a benchmark for evaluating structured multi-agent planning over semantically similar but operationally distinct disaster-response tools. To enable step-level failure attribution, we further propose First-Point-of-Failure (FPoF), which localizes the earliest root cause in a predicted workflow, separating primary errors from downstream cascading effects. Our evaluation reveals three findings: planning method effectiveness depends strongly on model capacity; tool mismatch and parameter-binding errors dominate first failures, revealing semantic grounding and execution consistency as distinct bottlenecks; and verbose intermediate reasoning can create instruction clash with structured output requirements, disrupting plan generation. Together, these findings highlight a fundamental gap between semantic reasoning and execution-grounded coordination, underscoring the need for planning frameworks that jointly model semantic intent, execution constraints, and workflow consistency. Code, data, and evaluation resources are available at: https://github.com/TamuChen18/DisasterBench_Open