Adaptive Multimodal Agents-Based Framework for Automatic Workflow Execution
基于自适应多智能体框架的自动工作流执行
Susanna Cifani, Mario Luca Bernardi, Marta Cimitile
AI总结 提出一种多模态多智能体框架,通过离线构建拓扑知识库和在线自适应检索增强生成与闭环协作验证,实现自动工作流执行。
详情
- Comments
- Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses. Accepted for publication at the 2026 IEEE International Conference on Evolving and Adaptive Intelligent Systems (EAIS 2026)
现代信息系统需要能够导航复杂工作流的自主智能体,但当前方法在从结构化元数据解析过渡到通用环境感知时常常遇到困难。虽然多模态大语言模型的集成使智能体能够直接与图形用户界面交互,但现有方法通常将任务序列视为离散的线性片段。这种碎片化阻止了智能体捕捉底层转移拓扑结构,限制了它们在新型或非平稳场景中的有效性。为了解决这个问题,我们提出了一种新颖的多模态多智能体框架,通过一个独特的两阶段流程实现自动工作流执行。首先,在离线发现阶段,该架构从碎片化的执行日志中自适应地构建拓扑知识库。在推理过程中,智能体利用自适应检索增强生成(RAG)作用于这个固定的、预先建立的图,并结合闭环协作验证协议进行动态自我纠正和导航。这种基于图的方法促进了优越的任务分解和自适应导航性能。我们在真实世界环境中验证了该框架,展示了即使在训练数据有限的情况下,它也能保持高可靠性和语义感知能力。
Modern information systems require autonomous agents capable of navigating complex workflows, yet current methodologies often struggle with the transition from structured metadata parsing to general environmental perception. While the integration of MLLMs has enabled agents to interact directly with GUIs, existing approaches typically treat task sequences as discrete, linear episodes. This fragmentation prevents agents from capturing the underlying transition topology, limiting their effectiveness in novel or non-stationary scenarios. To address this, we propose a novel multimodal multi-agent framework that achieves automatic workflow execution through a distinct two-phase pipeline. First, during an offline discovery phase, the architecture adaptively constructs a topological knowledge base from fragmented execution logs. During inference, agents leverage Adaptive Retrieval-Augmented Generation (RAG) over this fixed, pre-established graph, coupled with a closed-loop collaborative verification protocol to dynamically self-correct and navigate. This graph-based approach facilitates superior task decomposition and adaptive navigation performance. We validate our framework in a real-world context, demonstrating its ability to maintain high reliability and semantic awareness even with limited training data.