MAVEN: Improving Generalization in Agentic Tool Calling
MAVEN:提升智能体工具调用的泛化能力
Omkar Ghugarkar, Vishvesh Bhat, Muhammad Ahmed Mohsin, Asad Aali
AI总结 提出MAVEN框架,通过轻量级符号推理脚手架实现结构化分解、自适应工具编排和中间验证,在多个基准测试中显著提升模型性能,且成本仅为前沿专有模型的约1/10。
详情
跨智能体工具调用环境的泛化仍然是可靠智能体推理系统的核心挑战。尽管大语言模型在单个基准测试上取得了强劲结果,但它们在组合推理策略、保留中间状态以及跨域协调工具方面的能力仍未得到充分探索。我们提出MAVEN(模块化智能体验证与执行网络),这是一种轻量级符号推理脚手架,用于结构化分解、自适应工具编排和中间验证。我们在包括BFCL v3、TauBench、Tau2Bench、AceBench在内的既定工具调用基准上评估MAVEN,并引入MAVEN-Bench,这是一个针对多步数学和物理推理的压测基准,具有显式验证和对抗性任务组合。MAVEN-Bench揭示了部分推理质量与端到端任务成功之间的巨大差距;在直接的MAVEN-Bench运行中,MAVEN在不进行额外训练的情况下,将其GPT-OSS-120b基础模型的准确率从48%提升至71%。同时,它在使用开源权重骨干且估计成本约为1/10的情况下,与前沿专有基线保持竞争力,这表明以轻量级验证为中心的脚手架可以增强组合推理,并激励对智能体进行更注重过程的评估。
Generalization across agentic tool-calling environments remains a central challenge for reliable agentic reasoning systems. Although large language models achieve strong results on individual benchmarks, their ability to compose reasoning strategies, preserve intermediate states, and coordinate tools across domains remains underexplored. We present MAVEN (Modular Agentic Verification and Execution Network), a lightweight symbolic reasoning scaffold for structured decomposition, adaptive tool orchestration, and intermediate verification. We evaluate MAVEN across established tool-calling benchmarks, including BFCL v3, TauBench, Tau2Bench, AceBench, and introduce MAVEN-Bench, a stress-test benchmark for multi-step mathematical and physical reasoning with explicit verification and adversarial task composition. MAVEN-Bench exposes a substantial gap between partial reasoning quality and end-to-end task success; in direct MAVEN-Bench runs, MAVEN improves its GPT-OSS-120b base model from 48% to 71% accuracy without additional training. It also remains competitive with frontier proprietary baselines while using an open-weight backbone with an estimated cost ratio of roughly 1/10, suggesting that lightweight verification-centered scaffolds can strengthen compositional reasoning and motivate more process-aware evaluation of agents in the wild.