arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

代码大模型 / AI 编程

代码生成、软件工程智能体、程序修复、测试生成和开发者工具。

今日/当前日期收录 8 信号源:cs.SE, cs.CL, cs.AI, cs.LG, cs.PL
2606.06133 2026-06-18 cs.SE cs.AI cs.LG cs.LO 版本更新 90%

TLA-Prover: Verifiable TLA+ Specification Synthesis via Preference-Optimized Low-Rank Adaptation

TLA-Prover: 通过偏好优化低秩适配实现可验证的 TLA+ 规范合成

Eric Spencer, Arslan Bisharat, Brian Ortiz, Khushboo Bhadauria, TaiNing Wang, George K. Thiruvathukal, Konstantin Laufer, Mohammed Abuhamad

发表机构 * Department of Computer Science, Loyola University Chicago(洛约拉芝加哥大学计算机科学系)

专题命中 代码生成 :TLA+形式化规范合成,偏好优化提升通过率

AI总结 提出 TLA-Prover 模型,结合监督微调和基于修复的组相对策略优化,在 TLC 模型检查器上实现 TLA+ 规范合成,Gold/Diamond 级别通过率达 30%,约为未调优基线的 3.5 倍。

Comments 12 pages, 5 tables, 3 figures. Accepted at the 21st International Conference on Software Technologies (ICSOFT 2026)

详情
AI中文摘要

TLA+ 是一种用于验证分布式系统和安全关键协议的正式规范语言。大型语言模型(LLM)生成的 TLA+ 规范常常因语义原因无法通过 TLC 模型检查器。在 25 个 LLM 中,最佳公开基线的语法解析成功率为 26.6%,语义模型检查通过率为 8.6%。我们提出了 TLA-Prover,一个 200 亿参数的 TLA+ 规范合成模型。训练结合了在已验证示例上的监督微调(SFT)和基于修复的组相对策略优化(GRPO)。在 GRPO 阶段,模型学习修复自身被拒绝的规范。我们还从相同的 SFT 检查点训练了一个直接偏好优化(DPO)变体作为消融实验。TLC 直接提供奖励信号,无需学习奖励模型。每个输出分为四个等级:青铜(解析通过)、银(无警告)、金(通过 TLC)和钻石。要达到钻石级,模型的正确性属性会被自动微小修改;TLC 必须检测到违反。如果 TLC 仍然通过,则该属性始终为真且无贡献;输出无法达到钻石级。在一个保留的 30 问题基准上,TLA-Prover 在金级和钻石级均达到 9/30(即 pass@1 = 30%)。这大约是未调优基线 8.6% 的 3.5 倍。DPO 变体在钻石级达到 20%。金级和钻石级在每个检查点都一致;这防止了平凡属性失败模式。

英文摘要

TLA+ is a formal specification language for verifying distributed systems and safety-critical protocols. Large language models (LLMs) frequently produce TLA+ specifications that fail the TLC model checker for semantic reasons. Across 25 LLMs, the best public baseline is 26.6% syntactic parse and 8.6% semantic model-check. We present TLA-Prover, a 20-billion-parameter model for TLA+ specification synthesis. Training combines supervised fine-tuning (SFT) on verified examples with repair-based group-relative policy optimization (GRPO). In the GRPO stage, the model learns to fix its own rejected specifications. We also train a direct preference optimization (DPO) variant from the same SFT checkpoint as an ablation. TLC provides the reward signal directly, with no learned reward model. Four tiers grade each output: Bronze (parses), Silver (no warnings), Gold (passes TLC), and Diamond. To reach Diamond, the model's correctness property is automatically altered in a small way; TLC must then detect a violation. If TLC still passes, the property was always-true and contributes nothing; the output fails Diamond. TLA-Prover reaches 9/30 (i.e. pass@1 = 30%) at both Gold and Diamond on a held-out 30-problem benchmark. This is roughly 3.5x the 8.6% untuned baseline. The DPO variant reaches 20% at Diamond. Gold and Diamond coincide at every checkpoint; this prevents the trivial-property failure mode.

2606.18286 2026-06-18 cs.LG 新提交 85%

CODEBLOCK: Learning to Supervise Code at the Right Granularity

CODEBLOCK: 学习在正确的粒度上监督代码

Zhijie Deng, Ling Li, Jinlong Pang, Kaiqin Hu, Qi Xuan, Zhaowei Zhu, Jiaheng Wei

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) UC Santa Cruz(加州大学圣克鲁兹分校) Ant Group(蚂蚁集团) BAIA, ZJUT(浙江工业大学智能信息处理实验室) D5Data.ai

专题命中 代码生成 :提出CodeBlock框架,结构感知稀疏监督提升代码生成微调。

AI总结 提出CodeBlock框架,通过选择结构完整的代码块而非孤立token进行稀疏监督,在仅使用1.9%监督token的情况下,在六个代码生成基准上取得优于全token微调的效果。

详情
AI中文摘要

代码大语言模型的监督微调通常对所有响应token应用统一的交叉熵损失,隐含假设每个token提供同等有用的学习信号。最近的token级选择方法通过仅监督高价值token挑战了自然语言SFT中的这一假设。然而,直接将token级掩码迁移到代码可能会破坏语法和语义连贯的程序单元,因为代码依赖于结构完整性和定义-使用关系。因此,我们提出CodeBlock,一个结构感知的稀疏监督框架,选择结构完整的代码证据而非孤立token。CodeBlock首先选择高质量的指令-响应对,然后将代码响应划分为语法连贯的编码项,通过聚合核心逻辑token上的广义交叉熵来估计其效用,并使用数据流可达性和桥接信号重新排序,以优先传播或连接重要程序依赖的块。在训练期间,完整响应仍作为上下文可用,但损失仅应用于选定的代码项和信息性自然语言token。在六个代码生成基准上的实验表明,CodeBlock在仅使用1.9%的监督响应token的情况下,实现了比全tokenSFT和竞争性选择基线更强的平均pass@1。

英文摘要

Supervised fine-tuning of code LLMs typically applies uniform cross-entropy loss to all response tokens, implicitly assuming that every token provides equally useful learning signal. Recent token-level selection methods challenge this assumption in natural-language SFT by supervising only high-value tokens. However, directly transferring token-level masking to code can break syntactically and semantically coherent program units, because code depends on structural completeness and definition-use relations. We therefore propose CodeBlock, a structure-aware sparse supervision framework that selects structure-complete code evidence rather than isolated tokens. CodeBlock first selects high-quality instruction-response pairs, then partitions code responses into syntactically coherent coding items, estimates their utility by aggregating generalized cross-entropy over core logic tokens, and reranks them with data-flow reach and bridge signals to prioritize blocks that propagate or connect important program dependencies. During training, the full response remains available as context, while loss is applied only to selected code items and informative natural-language tokens. Experiments on six code-generation benchmarks show that CodeBlock achieves stronger average pass@1 than full-token SFT and competitive selection baselines, while using only 1.9% of supervised response tokens.

2511.00802 2026-06-18 cs.SE cs.CL cs.LG 版本更新 85%

GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents

GrowthHacker: 使用代码修改型LLM代理的自动离线策略评估优化

Jie JW Wu, Ayanda Patrick Herlihy, Ahmad Saleem Mirza, Ali Afoud, Fatemeh Fard

发表机构 * Michigan Technological University, Houghton(密歇根技术大学) Birmingham City University(伯明翰城市大学) University of British Columbia, Kelowna(不列颠哥伦比亚大学, 肯洛纳)

专题命中 代码生成 :利用LLM代理自动修改代码优化离线策略评估。

AI总结 提出GrowthHacker基准,利用LLM代理自动迭代修改代码以优化离线策略评估(OPE)实现,在Open Bandit Pipeline和Scope-RL上评估多种框架,证明基于LLM的代理可作为自动增长黑客持续改进OPE系统。

Comments Accepted for publication in ACM Transactions on Software Engineering and Methodology (TOSEM), 2026

详情
AI中文摘要

随着数据驱动开发的广泛采用,在线A/B测试已成为衡量新技术效果的既定方法。然而,部署在线实验需要设计、实现和部署资源,并可能对用户产生负面影响(例如,不安全或不道德的结果),同时需要数周的数据收集。为了解决这一问题,离线策略评估(OPE)或离线A/B测试这一日益增长的研究领域,使用先前收集的日志数据离线评估新技术。OPE也是强化学习中的一个基本问题,在在线测试昂贵或风险高的领域(如医疗保健、推荐系统、教育和机器人技术)中非常重要。尽管代码生成大语言模型(LLM)和代理工作流取得了进展,但关于LLM和基于LLM的代理是否以及如何自动优化OPE实现,我们知之甚少。我们提出了GrowthHacker,这是一个基准测试,用于在大规模公共数据集上评估基线LLM和基于LLM的代理。GrowthHacker自主迭代修改代码,运行OPE,并使用指标指导后续优化。我们在Open Bandit Pipeline(OBP)和Scope-RL上评估方法,并开发了一个双代理框架,该框架解决了现有框架的局限性,同时降低了复杂性。在两个库中,双代理显示出最高的可靠性(98.1%-100%成功率)和正向结果率(78%),正向结果的中位改进为4.4%;CrewAI实现了最高的平均改进(37.9%),并且是唯一没有极端值失败的框架。AutoGen和Default各达到65%的正向结果率。这些结果证明了使用基于LLM的代理作为自动“增长黑客”持续改进OPE系统的可行性,对在手动优化成本高昂的情况下扩展数据驱动决策具有重要意义。

英文摘要

With data-driven development now widely adopted, online A/B testing is an established method for measuring the effects of new technologies. However, deploying online experiments demands resources for design, implementation, and deployment, and may negatively impact users (e.g., unsafe or unethical outcomes) while requiring weeks of data collection. To address this, the growing research area of off-policy evaluation (OPE), or offline A/B testing, assesses new technologies offline using previously collected logged data. OPE is also a fundamental problem in reinforcement learning and is important where online testing is expensive or risky, such as healthcare, recommender systems, education, and robotics. Despite advances in code-generation large language models (LLMs) and agentic workflows, little is known about whether and how LLMs and LLM-based agents can automatically optimize OPE implementations. We propose GrowthHacker, a benchmark that evaluates baseline LLMs and LLM-based agents on large-scale public datasets. GrowthHacker autonomously and iteratively modifies code, runs OPE, and uses the metrics to guide subsequent optimization. We evaluate methods on Open Bandit Pipeline (OBP) and Scope-RL, and develop a two_agent framework that addresses limitations of existing frameworks while reducing complexity. Across both libraries, two_agent shows the highest reliability (98.1%-100% success rate) and positive-outcome rate (78%), with a median improvement of 4.4% among positive outcomes; CrewAI achieves the highest average improvement (37.9%) and is the only framework with zero extreme-value failures. AutoGen and Default each reach 65% positive-outcome rates. These results establish the feasibility of using LLM-based agents as automated "growth hackers" to continuously improve OPE systems, with implications for scaling data-driven decision-making where manual optimization is expensive.

2606.19315 2026-06-18 cs.LG 新提交 80%

Diffusion-Proof: Recipe for Formal Theorem Proving Beyond Auto-Regressive Generation

Diffusion-Proof:超越自回归生成的正式定理证明配方

Ruida Wang, Rui Pan, Pengcheng Wang, Shizhe Diao, Tong Zhang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) NVIDIA(英伟达)

专题命中 代码生成 :扩散语言模型用于形式定理证明

AI总结 提出Diffusion-Proof框架,首次将扩散语言模型应用于形式定理证明,通过全证明生成和局部校正方法,在ProofNet和MiniF2F上分别提升1.61%和6.14%,并解决了一个DeepSeek-Prover-V2-7B无法解决的IMO问题。

详情
AI中文摘要

近年来,增强大型语言模型(LLMs)的形式数学推理能力已成为数学和计算机科学社区的关键焦点。虽然在使用最先进的自回归(AR)LLMs进行形式定理证明方面取得了显著进展,但这些模型存在固有局限性。它们的下一个词预测生成方法可能因长程连贯性挑战和长序列错误累积而导致次优性能。最近,扩散LLMs(dLLMs)通过多词块的迭代去噪生成文本,提供了一种有前景的替代方案。然而,dLLMs在形式数学中的应用(其中保持长程连贯性至关重要)仍然研究不足。为解决上述挑战,我们提出了**Diffusion-Proof**,据我们所知,这是第一个训练和应用dLLMs进行形式定理证明的框架。我们的框架包含两种模型的训练和推理方法。第一个是*dLLM-Prover-7B*,它执行具有长程连贯策略使用的全证明写作。第二个是*dLLM-Corrector-7B*,这是一种新颖的大块扩散校正模型。它利用dLLMs的填充能力,使用双向信息进行局部证明校正。大量实验表明,**Diffusion-Proof**相对显著优于在同一数据集上训练的AR LLM基线。与基线相比,**Diffusion-Proof**在ProofNet-Test和MiniF2F-Test基准上分别实现了**1.61%**和**6.14%**的绝对提升。值得注意的是,**Diffusion-Proof**成功解决了一个更先进的思考模型DeepSeek-Prover-V2-7B无法解决的IMO问题,展示了dLLMs在形式定理证明中的独特优势。

英文摘要

Enhancing the formal math reasoning capabilities of Large Language Models (LLMs) has become a key focus in both mathematical and computer science communities in recent years. While significant progress has been made in using state-of-the-art Auto-Regressive (AR) LLMs for formal theorem proving, these models suffer from inherent limitations. Their next-token prediction generation methods may yield suboptimal performance due to the challenges of long-range coherence and the compounding of errors over long sequences. Recent advancements in diffusion LLMs (dLLMs), which generate text through iterative denoising of a multi-token block, offer a promising alternative. However, the application of dLLMs to formal mathematics, where maintaining long-range coherence is critical, remains largely understudied. To address the challenges above, we propose **Diffusion-Proof**, to the best of our knowledge, the first framework to train and apply dLLMs for formal theorem proving. Our frameworks contain training and inference methods for two models. The first one is *dLLM-Prover-7B*, which performs whole-proof writing with long-range coherent tactic usage. The second one is *dLLM-Corrector-7B*, which is a novel large block diffusion-based correction model. It leverages the in-filling capabilities of dLLMs to perform local proof correction using bi-directional information. Extensive experiments demonstrate that **Diffusion-Proof** relatively significantly outperforms the AR LLM baseline trained under the same dataset. **Diffusion-Proof** achieves an absolute improvement of **1.61%** on ProofNet-Test and **6.14%** on MiniF2F-Test benchmarks compare to the baseline. Notably, **Diffusion-Proof** successfully resolves one IMO problem that more advanced thinking model DeepSeek-Prover-V2-7B could not solve, showcasing the unique advantage of dLLMs in formal theorem proving.

2606.19042 2026-06-18 cs.SE cs.AI 新提交 80%

Where Did the Variability Go? From Vibe Coding to Product Lines by Regeneration

可变性去哪了?从氛围编码到通过再生的产品线

Xhevahire Tërnava

发表机构 * LTCI, Télécom Paris, Institut Polytechnique de Paris, Palaiseau, France(LTCI,巴黎电信学院,巴黎理工学院,Palaiseau,法国)

专题命中 代码生成 :AI驱动编程,可变性再生。

AI总结 研究AI驱动编程(氛围编码)中可变性缺失问题,提出通过再生实现可变性(VbR)方法,让LLM作为推导引擎生成无死代码的变体二进制。

Comments VARIABILITY 2026

详情
AI中文摘要

在氛围编码这一新兴的AI驱动范式中,LLM根据自然语言提示生成整个程序,但传统软件工程精心构建到代码中的可变性会发生什么?为了回答这个问题,我们对10个氛围编码的C/C++项目进行了探索性分析,结果表明在编译和运行时,工件内可变性几乎为零。所有可变性决策都在一个新的绑定时间——生成时间(即LLM生成源代码的时刻)得到解决。我们不将其视为需要修复的缺陷,而是提出了通过再生实现可变性(VbR),据我们所知,这是第一种产品线方法,其中LLM充当推导引擎,根据声明性规范为每个变体生成无死代码的专用二进制,同时变体调度器透明地将用户请求路由到匹配的二进制。我们形式化了VbR,将其与经典SPL推导进行对比,并在wc产品家族上演示了其完整流程。对于SPL工程,AI生成软件中的可变性应属于规范,而非代码。

英文摘要

In vibe coding, an emerging AI-driven paradigm, an LLM generates an entire program from a natural language prompt, but what happens to the variability that traditional software engineering carefully builds into code? To answer this question, we conducted an exploratory analysis on 10 vibe coded C/C++ projects, which suggests that there is near-zero in-artifact variability, i.e., at compile and runtime. All variability decisions are resolved at a single new binding time, generation time, the moment the LLM produces the source code. Rather than treating this as a defect to fix, we propose Variability by Regeneration (VbR), to our knowledge the first product-line approach in which the LLM acts as the derivation engine, generating a purpose-built, free of dead code binary for each variant from a declarative specification, while a variant dispatcher transparently routes user requests to the matching binary. We formalise VbR, contrast it with classical SPL derivation, and demonstrate its full pipeline on a wc product family. For SPL engineering, variability in AI-generated software belongs in the specification, not in the code.

2606.18293 2026-06-18 cs.SE cs.AI 新提交 80%

Vibe Coding Ate My Homework: An evaluation of AI approaches to greenfield software engineering and programming

Vibe Coding 吃掉我的作业:AI 方法在全新软件工程与编程中的评估

Callum Barbour

发表机构 * OpenAI

专题命中 代码生成 :评估AI编程(vibe coding)在软件工程中的可行性。

AI总结 本文评估了“氛围编码”(用自然语言提示编程)在全新软件工程任务中的可行性,并分析了现有基准,通过开发 Python 简单独立编程任务评估套件提供见解。

Comments 10 pages, 2 figures

详情
AI中文摘要

得益于生成式 AI 的快速发展,我们正处于一个可能永远改变我们与计算机交互方式的范式转变之中。我们观察到,在没有领域基础知识的情况下,使用自然语言提示来构建应用程序和编码基础设施的做法日益增长,这种做法被称为“氛围编码”。可以说,这代表了编程领域自诞生以来一直追求的目标,即每一个更高层次的抽象。就输入方法而言,氛围编码有望成为高级编程元认知的终点:完全消除人类对代码语法的使用,转而用母语进行编程。本文旨在评估氛围编码在全新软件工程任务中的可行性,并分析用于衡量其软件工程能力的基准。为此,我们开发了一个评估套件,用于分析 LLM 在 Python 中执行简单、独立的全新编程任务的熟练程度,以提供对此问题的有范围限制的见解。

英文摘要

Thanks to rapid developments in generative AI, we are in the midst of a paradigm shift that may change how we interact with computers forever. We have observed a growth in the use of natural language prompts to build applications and coding infrastructures without underlying knowledge of the field, and this practice has been dubbed `vibe coding.' It arguably represents what the field of programming has been building towards since the beginning, with every higher level of abstraction that is conceived. Vibe coding promises to be the endpoint for the meta of high-level programming as far as method of input is concerned: eliminating a human's use of code syntax entirely in favour of programming in their mother tongue. This paper aims to evaluate the viability of vibe coding for greenfield software engineering tasks, as well as analyse the benchmarks that have been used to measure its software engineering prowess. To this end, we have developed an evaluation suite for analysing an LLM's proficiency in carrying out simple, isolated greenfield programming tasks in Python to provide scoped insight on the matter.

2606.19257 2026-06-18 cs.CL 新提交 70%

DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models

DreamReasoner-8B:面向扩散推理模型的块大小课程学习

Zirui Wu, Lin Zheng, Jiacheng Ye, Shansan Gong, Xueliang Zhao, Yansong Feng, Wei Bi, Lingpeng Kong

发表机构 * The University of Hong Kong(香港大学) Peking University(北京大学)

专题命中 代码生成 :在代码推理基准上评估

AI总结 提出块大小课程学习,通过从细粒度到粗粒度的渐进训练,解决块扩散语言模型在长链推理中性能差距问题,DreamReasoner-8B在数学和代码推理上达到与Qwen3-8B相当的水平。

详情
AI中文摘要

块扩散语言模型通过并行块级去噪加速解码,但其能否可靠地扩展到长思维链(CoT)推理仍未解决。为此,我们开发了开源块扩散推理模型DreamReasoner-8B,并系统研究了训练和推理块大小如何影响长CoT推理。我们的分析揭示了显著的性能差距:使用大块大小训练会导致推理性能极差,而小块大小则能保持有效的推理。为了弥合这一粒度差距,我们提出了块大小课程学习,逐步从细粒度块大小过渡到粗粒度块大小进行训练,从而克服了这一限制,并实现了在多种推理块大小上泛化的强大推理性能。在数学和代码推理基准测试中,DreamReasoner-8B取得了与领先的开源自回归模型(如Qwen3-8B)相竞争的结果。这项工作为高效、具备推理能力的扩散语言模型奠定了实践基础。我们在以下网址发布模型:https://this URL。

英文摘要

Block diffusion language models accelerate decoding through parallel block-wise denoising, yet whether they can be reliably scaled for long chain-of-thought (CoT) reasoning remains unresolved. To this end, we develop DreamReasoner-8B, an open-source block diffusion reasoning model, and conduct a systematic study of how training and inference block sizes affect long-CoT reasoning. Our analysis reveals a stark performance disparity: training with large block sizes yields remarkably poor reasoning, whereas small block sizes preserve effective reasoning. To bridge this granularity gap, we propose block-size curriculum learning, which gradually transitions training from fine-grained to coarse-grained block sizes, thereby overcoming this limitation and enabling strong reasoning performance that generalizes across diverse inference block sizes. On mathematical and code reasoning benchmarks, DreamReasoner-8B achieves results competitive with leading open autoregressive models such as Qwen3-8B. This work establishes a practical foundation for efficient, reasoning-capable diffusion language models. We release our model at https://github.com/DreamLM/DreamReasoner.

2606.18425 2026-06-18 cs.SE cs.AI cs.DC 新提交 70%

From Specification to Execution: AI Assisted Scientific Workflow Management

从规范到执行:AI辅助的科学工作流管理

Komal Thareja, Hamza Safri, Rajiv Mayani, Anirban Mandal, Ewa Deelman

发表机构 * RENCI, University of North Carolina at Chapel Hill, NC, USA(RENCI,北卡罗来纳大学教堂山分校) Information Sciences Institute, University of Southern California, Marina del Rey, CA, USA(信息科学研究所,南加州大学马里纳德尔雷耶斯分校)

专题命中 代码生成 :利用LLM生成工作流代码

AI总结 提出一种AI辅助方法,通过规范驱动的工作流生成、自动化调试和分布式执行,结合Pegasus与MCP层,实现从自然语言到大规模科学工作流的端到端管理。

详情
AI中文摘要

科学工作流管理系统(WMS)支持复杂管道的可扩展和可重复执行,但工作流的设计、实现和调试仍然主要依赖人工,需要大量专业知识。最近使用大型语言模型(LLM)的方法在从自然语言生成工作流方面显示出潜力,但通常依赖于直接的代码合成,这限制了透明度、可重复性以及与工作流系统的集成。我们提出了一种AI辅助的科学工作流管理方法,结合了规范驱动的工作流生成、自动化调试和分布式执行。该方法引入了一个结构化的规范阶段,将工作流意图、设计和实现分离,允许在代码生成之前进行验证。我们还开发了一个基于LLM的调试代理,用于诊断和解决跨多个系统层的故障。为了支持分布式执行和用户交互,我们将广泛使用的WMS Pegasus与模型上下文协议(MCP)层集成,为工作流提交、监控和控制提供统一接口。我们使用一个用于医学影像的联邦学习工作流来评估该方法,该工作流具有并行、迭代和依赖密集的结构。该系统生成并执行了包含数千个作业的大规模工作流,减少了调试工作量,并允许非专家用户使用专家级设计模式构建工作流。这些结果表明,端到端的AI辅助工作流生成和执行是可行的,并指向了用于管理科学工作流生命周期的AI驱动平台。

英文摘要

Scientific workflow management systems (WMS) support scalable and reproducible execution of complex pipelines, but workflow design, implementation, and debugging remain largely manual and require significant expertise. Recent approaches using large language models (LLMs) show promise for workflow generation from natural language, but often rely on direct code synthesis, which limits transparency, reproducibility, and integration with workflow systems. We present an AI-assisted approach to scientific workflow management that combines specification-driven workflow generation, automated debugging, and distributed execution. The method introduces a structured specification phase that separates workflow intent, design, and implementation, allowing validation prior to code generation. We also develop an LLM-based debugging agent that diagnoses and resolves failures across multiple system layers. To support distributed execution and user interaction, we integrate Pegasus, a widely used WMS, with a Model Context Protocol (MCP) layer, providing a unified interface for workflow submission, monitoring, and control. We evaluate the approach using a federated learning workflow for medical imaging, chosen for its parallel, iterative, and dependency-intensive structure. The system generated and executed large-scale workflows with thousands of jobs, reduced debugging effort, and allowed non-expert users to construct workflows with expert-level design patterns. These results indicate that end-to-end AI-assisted workflow generation and execution is feasible, and point toward AI-driven platforms for managing the scientific workflow lifecycle.