AI中文摘要
电力系统基准测试通常评估数值求解器、预测模型或顺序控制器。这些基准是必要的,但它们不直接测试大型语言模型(LLM)智能体是否能执行工程工作流:检查电网案例、选择工具、调用模拟器、筛选 contingencies、提出可接受的缓解措施、验证结果并生成可审计的证据链。本文介绍了PowerAgentBench-SS,一个用于评估电力系统运行和规划研究中工具使用智能体的稳态基准框架。该基准向智能体公开案例数据、动作约束、工具API和验证预算,同时隐藏的评估器重新计算物理有效性并对提交的报告进行评分。我们定义了智能体接口、工具契约、证据日志和风险敏感指标,包括提交召回率、证据支持召回率、发现召回率、假安全惩罚、严重性遗憾、残余违规分数、动作成本、工具使用效率和工作流诊断。为了使框架具体化,我们在可复现的直流热N-2 contingency搜索试点中实例化该协议,使用确定性IEEE 39节点运行点变体,包括脚本基线、LLM JSON命令适配器、三个本地托管的Ollama LLM智能体和一个OpenAI API智能体。结果表明为什么仅求解器或仅答案评估是不够的:智能体不仅通过顶级contingency发现来区分,还通过验证预算使用、显式提交、类型强制、重复验证、证据支持报告和缓解行为来区分。
英文摘要
Power system benchmarks usually evaluate numerical solvers, prediction models, or sequential controllers. These benchmarks are necessary, but they do not directly test whether a Large Language Model (LLM) agent can execute an engineering workflow: inspect a grid case, select tools, call simulators, screen contingencies, propose admissible mitigations, validate results, and produce an auditable evidence trail. This paper introduces PowerAgentBench-SS, a steady-state benchmark framework for evaluating tool-using agents in power system operation and planning studies. The benchmark exposes public case data, action constraints, a tool API, and a validation budget to an agent, while a hidden evaluator recomputes physical validity and scores the submitted report. We define the agent interface, tool contract, evidence log, and risk-sensitive metrics, including submitted recall, evidence-backed recall, found recall, false-safe penalties, severity regret, residual violation score, action cost, tool-use efficiency, and workflow diagnostics. To make the framework concrete, we instantiate the protocol in a reproducible DC thermal N-2 contingency-search pilot on deterministic IEEE 39-bus operating-point variants, with scripted baselines, an LLM JSON-command adapter, three locally hosted Ollama LLM agents, and one OpenAI API agent. The results show why solver-only or answer-only evaluation is insufficient: agents are distinguished not only by top-contingency discovery, but also by validation-budget use, explicit submission, type coercions, duplicate validations, evidence-backed reporting, and mitigation behavior.