MailoHLS: Multi-Adapter Structure-Aware Learning for Pareto-Driven HLS Pragma Optimization
MailoHLS: 面向帕累托驱动HLS编译指示优化的多适配器结构感知学习
Elena Vouvali, Dimosthenis Masouros, Aggelos Ferikoglou, Dimitrios Soudris, Sotirios Xydis
AI总结 提出MailoHLS混合框架,结合LLM语义推理与GNN结构建模,通过交叉注意力、目标条件LoRA适配器和帕累托优化,实现HLS编译指示的联合优化,在延迟优化上最高提速12.42倍,并持续生成近帕累托最优设计。
详情
高层次综合(HLS)能够快速开发FPGA加速器,但由于编译器指令(即编译指示)导致的设计空间庞大且不规则,实现高质量结果(QoR)仍然具有挑战性。选择有效配置需要推理程序结构、内存行为以及延迟和资源利用率等常常相互冲突的目标之间的复杂交互。先前的模型驱动方法在跨内核的泛化能力上表现有限,且无法捕捉更高层次的优化意图。最近,大型语言模型(LLM)能够捕捉代码语义和高层意图,但其顺序表示阻碍了对结构依赖性和全局权衡的建模,导致HLS设计次优。我们提出MailoHLS,一个混合框架,结合了基于LLM的语义推理和基于GNN的结构建模,用于目标感知的指令优化。通过交叉注意力集成结构嵌入,并利用PEFT与目标条件LoRA适配器以及帕累托驱动优化,MailoHLS能够对代码语义、结构和设计权衡进行联合推理。在已见和未见的内核上,MailoHLS在延迟优化上实现了高达12.42倍和8.4倍的加速(几何平均分别为9.48倍和4.97倍),持续生成接近帕累托最优的设计。在完全未见过的应用上,它达到了高达10.2倍的加速(几何平均6.58倍),优于高端LLM和先前方法,同时缩小了与帕累托前沿的差距。
High-Level Synthesis (HLS) enables rapid development of FPGA accelerators, yet achieving high-quality results (QoR) remains challenging due to the large and irregular design space induced by compiler directives (a.k.a pragmas). Selecting effective configurations requires reasoning over complex interactions between program structure, memory behavior, and often conflicting objectives such as latency and resource utilization. Prior model-driven approaches exhibit limited generalization across kernels and fail to capture higher-level optimization intent. Recently, Large Language Models (LLMs) capture code semantics and high-level intent, but their sequential representations hinder modeling of structural dependencies and global trade-offs, leading to suboptimal HLS designs. We present MailoHLS, a hybrid framework that combines LLM-based semantic reasoning with GNN-based structural modeling for objective-aware directive optimization. By integrating structural embeddings via cross-attention and leveraging PEFT with objective-conditioned LoRA adapters and Pareto-driven optimization, MailoHLS enables joint reasoning over code semantics, structure, and design trade-offs. Across seen and unseen kernels, MailoHLS achieves up to 12.42x and 8.4x speedup (9.48x and 4.97x geometric mean) for latency optimization, consistently producing near-Pareto-optimal designs. On fully unseen applications, it reaches up to 10.2x speedup (6.58x geometric mean), outperforming high-end LLMs and prior approaches while narrowing the gap to the Pareto frontier.