arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

语言大模型 / LLM

大语言模型、预训练、指令微调、后训练和语言模型应用。

今日/当前日期收录 148 信号源:cs.CL, cs.AI, cs.LG

1. 后训练 9 篇

2410.15595 2026-06-18 cs.AI cs.CL cs.LG 版本更新 95%

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

直接偏好优化综述:数据集、理论、变体及应用

Wenyi Xiao, Zechuan Wang, Leilei Gan, Shuai Zhao, Zongrui Li, Ruirui Lei, Wanggui He, Luu Anh Tuan, Long Chen, Hao Jiang, Zhou Zhao, Fei Wu

发表机构 * Zhejiang University(浙江大学) Nanyang Technological University(南洋理工大学) Alibaba Group(阿里巴巴集团)

专题命中 后训练 :综述DPO,一种大模型后训练对齐方法

AI总结 综述直接偏好优化(DPO)在理论、变体、数据集和应用方面的进展,指出其作为RL-free替代方案的潜力与局限,并提出未来研究方向。

Comments Accepted by TPAMI 2026. Project page: https://github.com/Mr-Loevan/DPO-Survey

详情
AI中文摘要

随着大语言模型(LLMs)的快速发展,将策略模型与人类偏好对齐变得日益关键。直接偏好优化(DPO)作为一种有前景的对齐方法,作为从人类反馈中强化学习(RLHF)的无RL替代方案而出现。尽管DPO取得了各种进展并存在固有局限性,但文献中目前缺乏对这些方面的深入综述。在这项工作中,我们对DPO中的挑战和机遇进行了全面回顾,涵盖理论分析、变体、相关偏好数据集和应用。具体而言,我们基于关键研究问题对近期DPO研究进行分类,以提供对DPO当前格局的透彻理解。此外,我们提出了几个未来研究方向,为研究社区提供模型对齐的见解。相关论文的更新合集可在此https URL找到。

英文摘要

With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an RL-free alternative to Reinforcement Learning from Human Feedback (RLHF). Despite DPO's various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature. In this work, we present a comprehensive review of the challenges and opportunities in DPO, covering theoretical analyses, variants, relevant preference datasets, and applications. Specifically, we categorize recent studies on DPO based on key research questions to provide a thorough understanding of DPO's current landscape. Additionally, we propose several future research directions to offer insights on model alignment for the research community. An updated collection of relevant papers can be found on https://github.com/Mr-Loevan/DPO-Survey.

2606.18831 2026-06-18 cs.CL cs.AI 新提交 85%

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

超越奖励工程:长上下文强化学习的数据配方

Xiaoyue Xu, Sikui Zhang, Xiaorong Wang, Xu Han, Chaojun Xiao

发表机构 * OpenBMB Tsinghua University(清华大学)

专题命中 后训练 :通过数据配方和GRPO强化学习提升LLM长上下文推理能力

AI总结 提出一种简单有效的数据配方,结合最小化基于结果的GRPO设置,显著提升大语言模型的长上下文推理能力,在多个基准和智能体任务上取得平均+3.2至+7.2点的提升。

Comments 15 pages, 6 figures, 12 tables

详情
AI中文摘要

长上下文推理是大语言模型的一项关键能力,特别是当它们作为必须推理长轨迹的自主智能体部署时。强化学习最近成为提升这一能力的主要范式,然而现有工作主要关注奖励工程,而多样化的训练数据仍然稀缺。我们从数据为中心的角度重新审视这个问题,并表明仅凭一种简单有效的数据配方,结合最小化基于结果的GRPO设置,就足以显著提升长上下文推理。我们的配方针对三个互补的任务族——检索、多证据合成和推理——我们构建并整理了八个数据集,总计约1.4万个示例。在三个模型(Qwen3-4B/8B/30B-A3B)上的实验在七个长上下文基准上取得了平均+7.2/+3.2/+6.4分的提升,超过了之前的强化学习训练集。我们进一步证明这些增益可以迁移到智能体任务中,在基于智能体调整的模型上继续使用我们的数据配方进行强化学习训练,GAIA提升+4.8分,BrowseComp提升+7.0分。我们将发布我们的数据集以促进未来研究。

英文摘要

Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a dominant paradigm for improving this ability, yet existing work largely focuses on reward engineering while diverse training data remains scarce. We revisit this problem from a data-centric perspective and show that a simple yet effective data recipe alone, paired with a minimal outcome-based GRPO setup, suffices to substantially improve long-context reasoning. Our recipe targets three complementary task families -- retrieval, multi-evidence synthesis, and reasoning -- for which we construct and curate eight datasets totaling ~14K examples. Experiments on three models (Qwen3-4B/8B/30B-A3B) yield average gains of +7.2/+3.2/+6.4 points across seven long-context benchmarks, surpassing prior RL training sets. We further demonstrate that these gains transfer to agentic tasks, where continuing RL training on an agent-tuned model with our data recipe improves GAIA by +4.8 and BrowseComp by +7.0 points. We will release our datasets to facilitate future research.

2606.18810 2026-06-18 cs.LG cs.AI 新提交 85%

Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

从自身解中学习:面向可验证奖励强化学习的自条件化信用分配

Yingyu Shan, Yuhang Guo, Zihao Cheng, Zeming Liu, Xiangrong Zhu, Xinyi Wang, Jiashu Yao, Wei Lin, Hongru Wang, Heyan Huang

发表机构 * Beijing Institute of Technology(北京理工大学) Beihang University(北京航空航天大学) Independent Researcher(独立研究者)

专题命中 后训练 :SC-GRPO方法用于RLVR,提升LLM推理能力

AI总结 提出SC-GRPO方法,利用自条件化分布间的KL散度作为GRPO梯度的乘性权重,实现细粒度信用分配,在数学、代码和智能体任务上平均提升8.1%。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)在训练LLMs进行推理任务方面取得了显著进展,但代表性方法如GRPO对所有token分配统一信用,浪费了常规token上的梯度,同时低估了关键推理步骤。现有的token级信用分配方法需要超出模型自身rollout的资源。GRPO变体依赖于过程奖励模型或真实答案。知识蒸馏通过每个token的散度分配信用,但需要外部教师(在线策略蒸馏)或特权信息(在线策略自蒸馏)。然而,这些依赖性限制了在纯RLVR设置中的适用性。我们观察到,将模型以其自身验证过的轨迹为条件,会在原始分布和条件分布之间诱导出可测量的每token KL散度,并证明当存在多个验证过的轨迹时,从由验证过的轨迹构建的自教师进行蒸馏会导致不可行的加权平均解。我们提出SC-GRPO(自条件化GRPO),它使用前述KL散度作为GRPO梯度的乘性权重。在涵盖数学、代码和智能体任务的五个基准上,SC-GRPO一致优于GRPO 8.1%,优于DAPO 5.9%,并具有更强的分布外性能。此外,SC-GRPO实现了比OPD更高的性能。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has driven substantial progress in training LLMs for reasoning tasks, but representative methods such as GRPO assign uniform credit across all tokens, wasting gradient on routine tokens while under-crediting pivotal reasoning steps. Existing token-level credit assignment methods require resources beyond the model's own rollouts. GRPO variants rely on process reward models or ground-truth answers. Knowledge distillation assigns credit through per-token divergence but requires external teachers (On-Policy Distillation) or privileged information (On-Policy Self Distillation). However, these dependencies limit applicability in the pure RLVR setting. We observe that conditioning the model on its own verified trajectories induces a measurable per-token KL divergence between the original and conditioned distributions, and prove that distilling from a self-teacher constructed by verified trajectories leads to infeasible weighted-average solutions when multiple verified trajectories exist. We propose SC-GRPO (Self-Conditioned GRPO), which uses KL divergence mentioned before as a multiplicative weight on GRPO gradients. Across five benchmarks spanning math, code, and agentic tasks, SC-GRPO consistently outperforms 8.1% over GRPO and 5.9% over DAPO with stronger OOD performance. Moreover, SC-GRPO achieves higher performance than OPD.

2606.18388 2026-06-18 cs.LG cs.AI cs.CL cs.MA 新提交 85%

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

LLMZero: 通过LLM智能体发现RL后训练的自适应训练策略

Haoyang Fang, Wei Zhu, Boran Han, Alex Zhang, Zhenyu Pan, Shuo Yang, Shuai Zhang, Jiading Gai, Peng Tang, Cuixiong Hu, Xuan Zhu, Huzefa Rangwala, George Karypis, Bernie Wang

发表机构 * Amazon(亚马逊)

专题命中 后训练 :LLM智能体搜索RL后训练策略

AI总结 提出LLMZero系统,利用LLM智能体通过树搜索发现多阶段RL后训练的自适应策略,揭示容量参数单调累积、正则化参数振荡的规律,在4个GRPO任务上相对基线提升9%-140%。

详情
AI中文摘要

RL后训练策略依赖于数据集,并揭示了一个反复出现的经验模式:容量参数在阶段间单调累积,而正则化参数主要根据训练动态的变化而振荡。这种区别很重要,因为固定调度将所有参数提交到固定轨迹,因此无法表达正则化必须跟踪的非平稳探索-利用权衡;该原则为多阶段训练提供了可操作的设计规则。我们通过LLMZero发现了这一点,该系统通过树搜索让LLM智能体搜索训练轨迹,诊断每个检查点的病理并提出协调的多参数转换。在4个不同的GRPO任务中,LLMZero发现的策略相对基础模型提升9%到140%,相对网格搜索提升6%到15%,始终优于随机搜索和基于技能的智能体。该结构原则跨任务迁移,解释了为什么发现的策略形式不同但参数动态相似。

英文摘要

RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to shifting training dynamics. This distinction matters because fixed schedules commit all parameters to fixed trajectories and therefore cannot express the non-stationary exploration-exploitation tradeoffs that regularization must track; the principle provides actionable design rules for multi-stage training. We discover this through LLMZero, a system where LLM agents search over training trajectories via tree search, diagnosing pathologies at each checkpoint and proposing coordinated multi-parameter transitions. Across 4 diverse GRPO tasks, LLMZero discovers strategies that improve over the base model by 9% to 140% relative and over grid search by 6% to 15% relative, consistently outperforming random search and the skill-based agent. The structural principle transfers across tasks, providing an explanation for why discovered strategies take qualitatively different forms yet share similar parameter dynamics.

2606.01249 2026-06-18 cs.LG cs.CL 版本更新 85%

Trust Region On-Policy Distillation

信任区域在线策略蒸馏

Xingrun Xing, Haoqing Wang, Boyan Gao, Ziheng Li, Yehui Tang

发表机构 * Samsung Research(三星研究院) University of Oxford(牛津大学) Peking University(北京大学)

专题命中 后训练 :信任区域在线策略蒸馏,用于LLM后训练

AI总结 提出信任区域在线策略蒸馏(TrOPD),通过信用分配策略和信任区域学习解决师生分布差异导致的训练不稳定问题,在数学推理、代码生成和通用基准上超越现有方法。

详情
AI中文摘要

在线策略蒸馏(OPD)是大型语言模型(LLM)高效后训练的基本技术,在智能体学习、多任务增强和模型压缩中具有广泛应用。然而,当教师和学生分布差异较大时,OPD训练变得不稳定,因为教师对学生生成token的监督可能产生不可靠的策略梯度,甚至导致优化失败。本文通过信用分配策略解决可靠的在线策略token级监督问题,并提出信任区域在线策略蒸馏(TrOPD)。它具有以下特点:1)信任区域在线策略学习:TrOPD仅在教师提供可靠监督的区域进行OPD,缓解了分布不匹配下K1反向KL估计的优化困难。2)异常值估计:对于异常区域,我们探索梯度裁剪、掩码和前向KL估计,以减少不可靠监督的不利影响。3)离策略引导:学生从教师前缀继续生成,并使用前向KL模仿离策略引导,鼓励向可靠区域进行在线策略探索。实验表明,TrOPD在数学推理、代码生成和通用领域基准上始终优于最先进的OPD基线,包括OPD、EOPD和REOPOLD。

英文摘要

On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.

2601.17226 2026-06-18 cs.CL cs.AI 版本更新 85%

Retell, Reward, Repeat: Reinforcement Learning for Narrative Theory-Informed Story Retelling

复述、奖励、重复:面向叙事理论启发的故事复述的强化学习

David Y. Liu, Xanthe Muston, Dipankar Srirag, Aditya Joshi, Sebastian Sequoiah-Grayson

发表机构 * University of New South Wales(新南威尔士大学)

专题命中 后训练 :使用强化学习提升LLM故事复述能力

AI总结 提出RRR强化学习框架,结合结构主义叙事学与标量叙事性,通过d-RLAIF从文本特征中获取训练信号,无需参考输出,提升LLM故事复述的逻辑性、合理性和完整性。

Comments 8 Pages, 7 figures

详情
AI中文摘要

反事实故事复述暴露了LLM在受限叙事解空间中的缺陷,此时它们无法依赖回忆记忆的训练数据。基于真实值的后训练(如SFT)无法教会LLM生成逻辑合理的叙事事件。本文提出Retell, Reward, Repeat (RRR),一个基于强化学习的流水线,将结构主义叙事学与标量叙事性相结合,以教授故事结构。我们扩展了TimeTravel数据集,加入人工标注的叙事平衡阶段,以评估奖励模型。通过d-RLAIF,RRR从文本特征的叙事性中推导训练信号,无需参考输出。评估表明,RRR训练的LLM在逻辑性、合理性和完整性上优于少样本和SFT基线,输出质量通过盲人偏好验证。RRR仅依赖小型查询数据集,为故事讲述——一个目前缺乏有效后训练方法的领域——提供了一种基于语言学、成本效益高的后训练机制。RRR强调了将既定语言学理论整合到当代NLP中的持续相关性。

英文摘要

Counterfactual story retelling exposes LLM shortcomings in constrained narrative solution spaces where they can no longer rely on recalling memorised training data. Ground-truth-based post-training, such as SFT, fails to teach LLMs how to generate logical and rational narrative events. In this paper, we introduce Retell, Reward, Repeat (RRR), an RL-based pipeline synthesising Structuralist Narratology with scalar narrativity to teach storytelling structure. We extend the TimeTravel dataset with human-annotated stages of narrative equilibrium to evaluate reward models. By using d-RLAIF, RRR derives training signals from the narrativity of textual features without the need for reference outputs. Evaluations demonstrate that RRR-trained LLMs outperform few-shot and SFT baselines in logic, rationality, and completeness, with output quality additionally validated by blind human preference. Relying on a small, query-only dataset, RRR provides a linguistically grounded, cost-effective post-training mechanism for storytelling--a domain currently lacking effective post-training methods. RRR highlights the continued relevance of integrating established linguistic theories into contemporary NLP.

2506.14126 2026-06-18 cs.LG cs.AI 版本更新 85%

From Memorization to Parameter Interference: How Overtraining Experts Harms Model Merging

从记忆到参数干扰:过度训练专家如何损害模型合并

Stefan Horoi, Guy Wolf, Eugene Belilovsky, Gintare Karolina Dziugaite

发表机构 * Concordia University(康科德大学) Mila -- Québec AI Institute(魁北克人工智能研究所) Google DeepMind(谷歌深Mind)

专题命中 后训练 :研究专家模型微调对合并的影响

AI总结 本文研究专家模型微调过度对模型合并的影响,发现长时间微调导致记忆困难样本,造成参数干扰,降低合并性能,并提出任务相关的早停策略改善合并效果。

Comments Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026

详情
AI中文摘要

现代深度学习日益以使用开放权重基础模型为特征,这些模型可以在专门数据集上进行微调。这导致了专家模型和适配器的激增,通常通过HuggingFace和AdapterHub等平台共享。模型合并最近成为一种有效利用这些现有资源的方法,使得能够组合不同模型检查点的能力。因此,形成了一种自然的流程来利用迁移学习的好处并分摊沉没训练成本:模型在通用数据上预训练,在特定任务上微调,然后合并多个检查点以获得更强大的模型。一个普遍假设是,该流程中某一阶段的改进会向下游传播,从而在后续步骤中带来收益。在这项工作中,我们通过研究专家微调如何影响模型合并来挑战这一假设。我们表明,针对个体性能优化的专家长时间微调会导致跨视觉和语言模态、多种模型规模以及完全微调和LoRA适配模型的合并性能下降。我们将这种退化追溯到对一小部分困难样本的记忆,这些样本主导了微调后期步骤。这会导致负参数干扰,并编码在合并过程中被遗忘的知识。最后,我们证明任务相关的激进早停策略可以显著改善模型合并性能。

英文摘要

Modern deep learning is increasingly characterized by the use of open-weight foundation models that can be fine-tuned on specialized datasets. This has led to a proliferation of expert models and adapters, often shared via platforms like HuggingFace and AdapterHub. Model merging has recently emerged as an effective way to leverage these existing resources, enabling the composition of capabilities from different model checkpoints. A natural pipeline has thus formed to harness the benefits of transfer learning and amortize sunk training costs: models are pre-trained on general data, fine-tuned on specific tasks, and then multiple checkpoints are merged to obtain a more capable model. A prevailing assumption is that improvements at one stage of this pipeline propagate downstream, leading to gains at subsequent steps. In this work, we challenge that assumption by examining how expert fine-tuning affects model merging. We show that long fine-tuning of experts that optimizes for their individual performance leads to degraded merging performance across vision and language modalities, multiple model scales, and both fully fine-tuned and LoRA-adapted models. We trace this degradation to the memorization of a small set of difficult examples that dominate late fine-tuning steps. This causes negative parameter interference and encodes knowledge that is forgotten during merging. Finally, we demonstrate that task-dependent aggressive early stopping strategies can significantly improve model merging performance.

2606.19336 2026-06-18 cs.CL 新提交 80%

Learning User Simulators with Turing Rewards

基于图灵奖励的学习用户模拟器

Yingshan Susan Wang, Cedegao E. Zhang, Linlu Qiu, Zexue He, Pengyuan Li, Alex Pentland, Roger P. Levy, Yoon Kim

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Stanford University(斯坦福大学) MIT-IBM Watson AI Lab(MIT-IBM沃森人工智能实验室)

专题命中 后训练 :图灵奖励训练用户模拟器

AI总结 提出Turing-RL方法,利用基于图灵测试的强化学习训练用户模拟器,通过判别性图灵奖励使生成响应与真实用户不可区分,在对话和论坛讨论中优于基线方法。

详情
AI中文摘要

在交互式环境中学习模拟人类用户可以推动代理助手的训练、个性化系统的评估、社会科学研究等。现有方法通常通过训练大型语言模型(LLM)来匹配单一真实响应,要么通过最大化对数概率,要么使用相似性奖励。我们提出{Turing-RL}:一种基于图灵测试的强化学习方法,用于训练用户模拟器模型。{Turing-RL}使用带有LLM评判器的判别性图灵奖励,根据用户历史记录对生成的响应与真实用户的不可区分程度进行评分,用户模拟器LLM学习在这种奖励下产生与用户可能说的内容不可区分的响应。在两个不同领域——对话聊天和Reddit论坛讨论中,我们发现{Turing-RL}在LLM和人工评估指标上均持续优于基线方法。我们的研究表明,优化不可区分性而非响应匹配对于学习用户模拟器是有效的。

英文摘要

Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose {Turing-RL}: a Turing-Test-based reinforcement learning approach for training user simulator models. {Turing-RL} uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user's given the user's history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains--conversational chat and Reddit forum discussion--we find that {Turing-RL} consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.

2606.19327 2026-06-18 cs.AI cs.CL 新提交 80%

Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

重新思考奖励监督:基于评分准则的自蒸馏

Siyi Gu, Jialin Chen, Sophia Zhou, Arman Cohan, Rex Ying

发表机构 * Yale University(耶鲁大学)

专题命中 后训练 :评分准则自蒸馏优化推理模型

AI总结 提出评分准则条件自蒸馏框架,通过结构化细粒度反馈指导推理模型,在科学推理基准上平均超越GRPO 1.0分、OPSD 0.9分。

详情
AI中文摘要

推理语言模型的后训练通常由监督蒸馏和基于可验证奖励的强化学习驱动。蒸馏通常依赖于思维链注释,这些注释获取成本高昂,且可能本身带有噪声、不完整或部分错误;即使最终答案正确,不完美的推理过程也会干扰学习。另一方面,基于验证奖励的强化学习通常将评估反馈压缩为标量信号,掩盖了响应中哪些方面需要改进。我们提出\textbf{评分准则条件自蒸馏}框架,该框架将评分准则作为结构化、细粒度的反馈用于策略内自蒸馏。我们的方法使教师模型以准则级评分准则为条件,并利用它在学生自身采样的轨迹上提供令牌级指导。这种设计避免了将单一参考推理过程作为唯一的监督目标。相反,评分准则指定了一个强响应应满足的条件,从而在推理过程中实现比标量奖励优化更细粒度的信用分配。我们通过一个两阶段流程实例化该框架:首先学习生成任务特定的评分准则,然后训练一个评分准则引导的推理器。我们在多样化的科学推理基准上进行评估,结果表明,评分准则条件自蒸馏有效地将准则级标准转化为推理过程中的令牌级指导,平均超过GRPO 1.0分、OPSD 0.9分。

英文摘要

Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved. We propose \textbf{Rubric-Conditioned Self-Distillation}, a framework that incorporates rubrics as structured, fine-grained feedback for on-policy self-distillation. Our method conditions the teacher model on criterion-level rubrics and uses it to provide token-level guidance on the student's own sampled trajectories. This design avoids treating a single reference rationale as the sole supervision target. Instead, rubrics specify what a strong response should satisfy, enabling more fine-grained credit assignment over the reasoning process than scalar reward optimization. We instantiate this framework with a two-stage pipeline that first learns to generate task-specific rubrics and then trains a rubric-guided reasoner. We evaluate on a diverse suite of science reasoning benchmarks and results show that rubric-conditioned self-distillation effectively converts rubric-level criteria into token-level guidance over the reasoning process, surpassing GRPO by 1.0 points and OPSD by 0.9 points on average.

2. 领域大模型 3 篇

2606.19266 2026-06-18 cs.CL cs.AI 新提交 90%

Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

医学LLM适应中的权衡:法语问答的实证研究

Ikram Belmadani, Oumaima El Khettari, Carlos Ramisch, Frederic Bechet, Richard Dufour, Benoit Favre

发表机构 * Aix-Marseille Univ., CNRS, LIS UMR 7020(艾克斯-马赛大学,法国国家科学研究中心,LIS UMR 7020) Nantes Univ., École Centrale Nantes, CNRS, LS2N UMR 6004(南特大学,南特中央理工大学,法国国家科学研究中心,LS2N UMR 6004) Grenoble Alpes Univ., CNRS, INRIA, Grenoble INP, LIG UMR 5217(格勒诺布尔阿尔卑斯大学,法国国家科学研究中心,INRIA,格勒诺布尔INP,LIG UMR 5217)

专题命中 领域大模型 :法语医学LLM领域适应策略比较

AI总结 通过法语医学问答任务,实证比较持续预训练(CPT)和监督微调(SFT)在多个模型家族和规模下的效果,发现CPT+SFT在多项选择问答上最优但增益小,SFT是强且经济的默认选择,而CPT在开放式问答中提升重叠指标。

详情
AI中文摘要

大型语言模型(LLMs)的发展导致了对它们适应专业领域和语言的关注增加,但领域适应策略的有效性仍不明确。我们以法语医学问答(QA)为案例,进行了医学领域适应的研究。我们比较了持续预训练(CPT)、监督微调(SFT)及其组合,跨越三个模型家族、多个规模和三种初始化类型,明确区分了适应效果与基础模型选择。我们在贪婪和约束解码下,使用自动指标和LLM-as-a-Judge评估,评估了多项选择问答(MCQA)和开放式问答(OEQA)。对于MCQA,CPT+SFT通常取得最佳分数,但相比SFT的增益很小且通常不显著,使得SFT成为强大且成本效益高的默认选择。对于OEQA,CPT持续改善基于重叠的指标,而SFT常降低生成质量;指令调优和CPT+SFT在基于LLM的评估中更受青睐。跨语言实验进一步显示,法语适应能有效迁移到英语基准。总体而言,我们为在计算约束下选择适应策略提供了实用指南。

英文摘要

The development of large language models (LLMs) has led to an increased focus on their adaptation to specialized domains and languages, yet the effectiveness of domain adaptation strategies remains unclear. We present a study of medical domain adaptation using French medical question-answering (QA) as a case study. We compare continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across three model families, multiple sizes, and three initialization types, explicitly disentangling adaptation effects from base model choice. We evaluate both multiple-choice (MCQA) and open-ended QA (OEQA) under greedy and constrained decoding using automatic metrics and LLM-as-a-Judge evaluation. For MCQA, CPT+SFT most often achieves the best scores, but gains over SFT are small and frequently not statistically significant, making SFT a strong and cost-effective default. For OEQA, CPT consistently improves overlap-based metrics, while SFT often degrades generation quality; instruction tuning and CPT+SFT are preferred by LLM-based evaluation. Cross-lingual experiments further show effective transfer from French adaptation to English benchmarks. Overall, we provide practical guidelines for selecting adaptation strategies under computational constraints.

2606.18699 2026-06-18 cs.CL cs.AI cs.IR 新提交 90%

TW-LegalBench: Measuring Taiwanese Legal Understanding

TW-LegalBench: 衡量台湾法律理解

Fei-Yueh Chen, Chun Huang Lin, Chan Wei Hsu, Kuan Hsuan Yeh, Zih-Ching Chen, Kuan-Ming Chen, Patrick Chung-Chia Huang

发表机构 * University of Rochester(罗切斯特大学) National Taiwan University(国立台湾大学) NVIDIA(英伟达)

专题命中 领域大模型 :台湾法律理解基准,评估LLM法律推理

AI总结 提出TW-LegalBench基准,包含多项选择、开放式问答和法律判决预测任务,评估13个LLM在台湾法律上的表现,发现顶尖模型通过律师考试但未达到法官检察官标准,且法律条文引用困难。

Comments 10 pages, 2 figures, To appear in ICAIL 2026

详情
AI中文摘要

大型语言模型(LLM)在多种任务上展现出令人印象深刻的能力,但其在特定司法管辖区法律推理上的表现仍未充分探索。我们提出TW-LegalBench,利用台湾法律系统丰富的官方公开语料库,填补了在普通法基准(侧重英文来源)和大陆法基准(侧重简体中文来源)之外评估LLM在台湾法律上的空白。TW-LegalBench包含三种任务类型:(1)涵盖18个专业领域五年官方考试的超过16,000道多项选择题(MCQ);(2)来自法律专业人员考试的117道开放式问答题(OEQ),附有官方评分标准;(3)超过14,000个法律判决预测(LJP)实例,涵盖数百种犯罪类别。我们使用MCQ的准确率、基于评分标准点的分解式LLM作为裁判框架评估OEQ,以及LJP的判决准确性和法条引用指标,评估了13个LLM。我们的结果显示,表现最佳的模型超过了合格律师的通过门槛(通过率:11%),但未达到法官和检察官的通过标准(通过率:1-2%)。对于LJP,虽然模型展示了合理的判决类型准确性和刑期预测能力,但它们难以准确引用具体法律条文。这些发现表明,即使LLM在资格考试上的表现接近人类水平,可靠的 legal 文本生成仍然具有挑战性。

英文摘要

Large language models (LLMs) have shown impressive capabilities across diverse tasks, yet their performance on jurisdiction-specific legal reasoning remains underexplored. We present TW-LegalBench that utilizes Taiwanese legal system's rich official corpus open to the public to fill the gap in evaluating LLMs on Taiwanese law, among common-law benchmarks that focus on English sources and civil-law benchmarks focusing on sources of Simplified Chinese. TW-LegalBench comprises three task types: (1) over 16,000 multiple-choice questions (MCQs) across five years of official examinations in 18 professional domains; (2) 117 open-ended essay questions (OEQs) from examinations for legal professionals with official scoring rubrics; and (3) more than 14,000 legal judgment prediction (LJP) instances covering hundreds of crime categories. We evaluate 13 LLMs using accuracy for MCQs, a decomposed LLM-as-Judge framework based on the scoring rubric points for OEQs, and metrics for sentencing accuracy and statute citation for LJP. Our results reveal that top-performing models exceed the passing threshold for qualified lawyers (passing rate: 11%) but fall short of that for judges and prosecutors (passing rate: 1~2%). For LJP, while models demonstrate reasonable verdict type accuracy and sentence prediction capability, they struggle to cite exact legal articles. These findings highlight that reliable legal text generation remains challenging for LLMs, even though their performance on qualification examinations approaches human level.

2606.18600 2026-06-18 cs.DC 新提交 85%

ShuntServe: Cost-Efficient LLM Serving on Heterogeneous Spot GPU Clusters

ShuntServe: 异构竞价型GPU集群上的成本高效LLM服务

Seungwoo Jeong, Moohyun Song, Juhyun Park, Kyungyong Lee

专题命中 领域大模型 :提出ShuntServe系统优化LLM在异构GPU上服务

AI总结 提出ShuntServe系统,通过屋顶线模型估计性能和动态规划优化模型放置,在异构竞价型GPU集群上最大化吞吐量,结合输出保留迁移与共享张量存储实现容错,相比基线吞吐量提升1.42倍,成本效率提升31.9%以上。

Comments 18 pages, 16 figures, 5 tables

详情
AI中文摘要

随着大语言模型(LLM)服务的广泛采用,在云环境中为这些模型提供服务的GPU资源成本已成为关键问题。竞价实例相比按需实例可节省高达90%的成本,但其频繁中断和有限可用性对连续LLM服务构成重大挑战。特别是GPU竞价实例的可用性比基于CPU的实例更低且更不稳定,使得依赖单一GPU类型的同构集群容易受到关联故障的影响。跨多种GPU类型的异构集群可以通过利用不同竞价池的互补可用性模式来解决这一问题,然而现有的LLM服务系统是为同构环境设计的,在异构GPU上部署时会遇到负载不均衡的问题。本文提出了ShuntServe,一个用于异构竞价型GPU集群的成本高效LLM服务系统。ShuntServe采用基于屋顶线模型的分析性服务性能估计器和基于动态规划的模型放置优化器,联合确定节点配置、并行化策略和层分配,以最大化跨异构GPU的吞吐量。为了增强使用竞价实例时的容错能力,ShuntServe将输出保留的请求迁移与通过共享张量存储的并发初始化相结合,通过重叠替换节点准备与持续服务来最小化迁移停机时间。在由L4、A10G和L40S GPU组成的异构AWS集群上对Llama-3.1-70B和Qwen3-32B的评估表明,ShuntServe的吞吐量比最先进的基线高出1.42倍和1.35倍,并且与按需实例相比,在离线服务和在线服务中分别实现了31.9%和31.2%的成本效率提升。

英文摘要

As large language model (LLM) services become widely adopted, the cost of GPU resources for serving these models in cloud environments has emerged as a critical concern. Spot instances offer up to 90% cost savings over on-demand instances, but their frequent interruptions and limited availability pose significant challenges for continuous LLM serving. GPU spot instances, in particular, exhibit lower and more volatile availability than CPU-based instances, making homogeneous clusters that depend on a single GPU type vulnerable to correlated failures. Heterogeneous clusters spanning multiple GPU types can address this by leveraging complementary availability patterns across diverse spot pools, yet existing LLM serving systems are designed for homogeneous environments and suffer from load imbalance when deployed on heterogeneous GPUs. This paper presents ShuntServe, a cost-efficient LLM serving system for heterogeneous spot GPU clusters. ShuntServe employs a roofline model-based analytical serving performance estimator and a dynamic programming-based model placement optimizer that jointly determines node configuration, parallelization strategy, and layer assignment to maximize throughput across heterogeneous GPUs. To enhance fault tolerance when using spot instances, ShuntServe combines output-preserving request migration with concurrent initialization via a shared tensor store, minimizing migration downtime by overlapping replacement node preparation with ongoing serving. Evaluation on Llama-3.1-70B and Qwen3-32B with a heterogeneous AWS cluster of L4, A10G, and L40S GPUs shows that ShuntServe achieves 1.42x and 1.35x higher throughput than state-of-the-art baselines and attains 31.9% and 31.2% cost efficiency improvements over on-demand instances for offline and online serving, respectively.

3. 预训练 3 篇

2606.18663 2026-06-18 cs.CL 新提交 90%

RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

RegMix-D: 通过代理训练轨迹实现动态数据混合

Kaiyan Zhao, Zhongtao Miao, Akiko Aizawa, Yoshimasa Tsuruoka

发表机构 * The University of Tokyo(东京大学) National Institute of Informatics(国立信息学研究所)

专题命中 预训练 :LLM预训练动态数据混合方法

AI总结 提出RegMix-D,通过代理训练轨迹预测多阶段最优混合比例,实现动态数据混合,在13个下游任务上优于RegMix和DoReMi,且代理计算预算仅为RegMix的25%。

Comments Work in progress

详情
AI中文摘要

数据混合选择对于大型语言模型预训练至关重要。现有方法如RegMix通过在小规模代理运行上拟合回归模型来选择单个静态混合。我们提出RegMix-D,这是RegMix的一个简单扩展,用于动态混合。我们的关键观察是,代理运行不仅产生端点损失,还产生完整的损失轨迹,这些轨迹可用于进一步改进数据混合。通过在这些轨迹上训练回归模型,我们可以预测多个训练阶段的最优混合。RegMix-D支持两种部署模式:一种离线变体,在目标训练之前生成完整的混合计划;另一种在线变体,在训练期间使用观察到的损失自适应调整混合。在Pile数据集的250亿token上使用1B参数目标模型的实验表明,RegMix-D在13个下游任务上一致优于RegMix和DoReMi,同时保持代理高效:即使仅使用128个代理模型(RegMix代理计算预算的25%),它也超越了RegMix。

英文摘要

Data mixture selection is critical for Large Language Model pretraining. Existing methods such as RegMix select a single static mixture by fitting a regression model on small-scale proxy runs. We propose RegMix-D, a simple extension of RegMix to dynamic mixing. Our key observation is that proxy runs produce not only endpoint losses, but also full loss trajectories, which can be used to further improve data mixture. By training regression model on these trajectories, we can predict optimal mixtures at multiple training stages. RegMix-D supports two deployment modes: an offline variant that generates a complete mixture schedule before target training, and an online variant that adapts the mixture during training using observed loss. Experiments on 25B tokens of the Pile dataset with a 1B parameter target model show that RegMix-D consistently improves over RegMix and DoReMi across 13 downstream tasks while remaining proxy-efficient: it surpasses RegMix even with only 128 proxy models (25% of RegMix's proxy compute budget).

2606.19036 2026-06-18 cs.LG 新提交 85%

Geometric and Stochastic Analysis of Discontinuities in Sparse Mixture-of-Experts

稀疏混合专家模型中不连续性的几何与随机分析

Tho Tran Huu, Huu-Tuan Nguyen, Thien-Hai Nguyen, Nhat-Tri Ho, Viet-Hoang Tran, Tho Quan, Tan Minh Nguyen

发表机构 * Department of Mathematics, National University of Singapore, Singapore(新加坡国立大学数学系) Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), VNU-HCM, Ho Chi Minh City, Vietnam(胡志明市技术大学计算机科学与工程学院)

专题命中 预训练 :分析稀疏MoE不连续性,提出平滑机制,核心是LLM架构改进。

AI总结 本文对稀疏混合专家模型中的不连续性进行几何与随机分析,分类不连续阶数,建立渐近体积估计,证明随机路径几乎必然击中一阶不连续,并提出低开销平滑机制以提升性能。

Comments ICML 2026 Spotlight. arXiv admin note: text overlap with arXiv:2510.17794 by other authors

详情
AI中文摘要

稀疏混合专家(SMoE)架构现已广泛应用于最先进的语言和视觉模型中,其中条件路由允许扩展到非常大的网络。然而,正是这种Top-$k$专家选择使得条件路由成为可能,同时也导致SMoE映射本质上不连续。在这些不连续曲面附近,即使任意接近的输入也可能激活截然不同的专家集,从而产生显著不同的输出。本文对这些不连续性进行了严格的几何和随机分析。首先,我们根据切换事件中并列专家的数量对不连续性进行阶数分类。利用测度论切片论证,我们建立了加厚不连续曲面的渐近体积估计,表明低阶不连续集占主导地位,而高阶不连续集占据的体积相对极小。接着,通过扩散过程对输入空间中的随机扰动建模,我们证明路径最终会遇到不连续,并且首次击中几乎必然发生在阶数为1的不连续上,同时给出了显式的有限时间概率界。我们进一步推导了占据时间界,量化了随机路径在每个不连续阶数邻域内停留的时长。这些理论结果表明输入更可能位于低阶不连续附近。受此启发,我们提出一种简单的平滑机制,可直接应用于现有SMoE,在接近不连续处软性地整合专家;我们的分析保证增加的额外计算开销很小,同时在不连续附近提供局部平滑,跨语言和视觉任务的实验表明,平滑不仅增强了SMoE映射的连续性,还提升了经验性能。

英文摘要

Sparse Mixture-of-Experts (SMoE) architectures are now widely deployed in state-of-the-art language and vision models, where conditional routing allows scaling to very large networks. However, this very Top-$k$ expert selection that enables conditional routing also renders the SMoE map inherently discontinuous. In the vicinity of these discontinuity surfaces, even inputs that are arbitrarily close may activate substantially different sets of experts resulting in significantly different outputs. In this work we give a rigorous geometric and stochastic analysis of these discontinuities. We first classify them by order, determined by the number of tied experts at a switching event. Using measure-theoretic slicing arguments, we establish asymptotic volume estimates for the thickened discontinuity surfaces, showing that lower-order discontinuity sets dominate, whereas higher-order ones occupy a vanishingly small relative volume. Next, modeling random perturbations in the input space via a diffusion process, we prove that the path eventually encounter a discontinuity, and moreover that the first hit almost surely occurs on an order-1 discontinuity with explicit finite-time probability bounds. We further derive occupation-time bounds that quantify the duration the random path spend in the neighborhoods of each discontinuity order. These theoretical results imply that inputs are more likely to lie near lower order discontinuities. Motivated by this insight, we propose a simple smoothing mechanism that can be directly applied to existing SMoEs, softly incorporating experts near discontinuities; our analysis guarantees that the added computational overhead remains small while providing localized smoothing near discontinuities, and experiments across language and vision tasks show that smoothing not only enforces continuity of the SMoE map but also enhances empirical performance.

2606.19005 2026-06-18 cs.CL cs.LG 新提交 85%

Sumi: Open Uniform Diffusion Language Model from Scratch

Sumi: 从头训练的开放均匀扩散语言模型

Mengyu Ye, Keito Kudo, Wataru Ikeda, Ryosuke Matsuda, Keisuke Sakaguchi, Jun Suzuki

发表机构 * Tohoku University(东北大学)

专题命中 预训练 :从头预训练7B均匀扩散语言模型,性能与自回归模型相当。

AI总结 本文提出Sumi,一个从零开始预训练的70亿参数均匀扩散语言模型,在1.5T tokens上训练,性能与同规模自回归模型相当,并开源所有资源。

详情
AI中文摘要

扩散模型已成为自回归模型的有前途的替代方案。其中,均匀扩散语言模型(UDLM)允许在任何步骤更新任何token,原则上能够实现更灵活的生成。然而,目前还没有从零开始预训练的大参数规模和大token预算的UDLM。自回归建模和掩码扩散建模已经拥有大规模的可供社区研究和构建的模型;而均匀扩散模型则没有。大规模从头预训练的UDLM将为研究缩放行为、生成动态、可控性以及与现有自回归和掩码扩散模型的权衡提供一个干净的参考点。为此,我们引入了Sumi(日语中“墨水”的意思),一个完全开放的70亿参数均匀扩散语言模型,从零开始在1.5T tokens上预训练。Sumi在知识、推理和编码基准测试中与在可比token预算下训练的自回归模型表现相当,但在常识基准测试中表现较差,其中我们以教育为主的数据混合可能是原因之一。我们发布了模型权重、检查点和完整的训练方案,包括在公开可用的语料库上的数据混合的完整规范。我们希望这次发布能使社区研究大规模原生均匀扩散,并促进对其尚未很好理解的方面的研究。

英文摘要

Diffusion models have become a promising alternative to autoregressive models. Among these, uniform diffusion language models (UDLMs) permit any token to be updated at any step, in principle enabling more flexible generation. However, no UDLM has yet been pretrained from scratch at both large parameter scale and large token budget. Both autoregressive modeling and masked diffusion modeling already have capable models at scale that the community can study and build on; uniform diffusion has none. A scratch-pretrained UDLM at scale would provide a clean reference point for studying scaling behavior, generation dynamics, controllability, and trade-offs against established autoregressive and masked diffusion models. To this end, we introduce Sumi ("ink" in Japanese), a fully open 7B uniform diffusion language model pretrained from scratch on 1.5T tokens. Sumi performs competitively with autoregressive models trained at comparable token budgets on knowledge, reasoning, and coding benchmarks, while under-performing on commonsense benchmarks, where our education-heavy data mixture is a likely contributor. We release our model weights, checkpoints, and full training recipe, including a complete specification of the data mixture over publicly available corpora. We hope this release enables the community to study native uniform diffusion at scale and catalyzes work on its as-yet poorly understood aspects.

4. 指令微调 2 篇

2606.18875 2026-06-18 cs.CL 新提交 85%

Efficient Financial Language Understanding via Distillation with Synthetic Data

通过合成数据蒸馏实现高效金融语言理解

Wen-Fong, Huang, Edwin Simpson

发表机构 * School of Engineering Mathematics and Technology(工程数学与技术学院) University of Bristol(布里斯托大学)

专题命中 指令微调 :用大教师模型蒸馏到小模型,金融情感分析。

AI总结 提出一种在低资源条件下通过合成数据蒸馏进行金融情感分析的框架,利用聚类种子选择生成代表性合成数据,使紧凑模型在少量标注下达到强性能,甚至在某些任务上超越教师模型。

Journal ref Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), European Language Resources Association (ELRA), 2026, pp. 10242-10254

详情
AI中文摘要

大型指令跟随模型功能强大但部署成本高昂,尤其在金融领域,标注数据因保密性和专家标注成本而受限。我们提出一种通过合成数据蒸馏进行金融情感分析的高效框架,将知识从大型指令调优教师模型迁移到紧凑的学生模型。该框架专为低资源条件设计,其中收集并手工标注少量真实样本。框架随后对样本进行聚类,并利用聚类结果选择种子,通过结构化少样本提示生成合成样本。实验表明,基于聚类的种子选择比随机采样能生成更具代表性的合成数据,使紧凑模型在极少量监督下实现强性能。值得注意的是,在更复杂且噪声更多的文本领域,基于完整合成种子语料库训练的紧凑模型甚至优于教师模型,同时在正式文本上保持竞争力。该框架为金融NLP中资源高效的领域自适应提供了一条实用途径,且只需最少的人工标注工作。

英文摘要

Large instruction-following models are powerful but costly to deploy, particularly in finance, where labelled data are limited by confidentiality and expert annotation cost. We present an efficient framework for financial sentiment analysis through distillation with synthetic data, transferring knowledge from a large instruction-tuned teacher to compact student models. The framework is designed for low-resource conditions, where a small set of real examples are collected and labelled by hand. The framework then clusters the examples and uses the clusters to select seeds for generating synthetic examples via structured few-shot prompting. Experiments show that clustering-based seed selection yields more representative synthetic data than random sampling, enabling compact models to achieve strong performance with minimal supervision. Notably, on a more complex and noisy text domain, the compact model trained on the complete synthetic-seed corpus even outperforms the teacher model, while remaining competitive on formal text. The framework provides a practical route toward resource-efficient domain adaptation in financial NLP with minimal human labelling effort.

2606.18307 2026-06-18 cs.LG cs.AI 新提交 85%

DRIFT: Refining Instruction Data via On-Policy Data Attribution

DRIFT: 通过在线策略数据归因优化指令数据

Zefan Wang, Lincheng Li, Tianyu Yu, Yuan Yao

发表机构 * Tsinghua University(清华大学)

专题命中 指令微调 :提出DRIFT方法优化指令微调数据分布,提升LLM性能上限。

AI总结 提出DRIFT方法,利用在线策略影响函数解决标准影响函数在指令微调数据归因中的近邻偏差和梯度范数偏差问题,通过模型自身生成作为验证目标,提升7B模型性能上限。

详情
AI中文摘要

优化监督微调(SFT)的训练数据分布决定了大型语言模型(LLMs)的能力。虽然现有的数据筛选方法在有限预算下加速训练方面表现出色,但它们不太适合提升能力上限。这里的挑战不再是识别一个保持性能的较小子集,而是将数据分布优化为最能提升最终模型的实例。为了解决这个问题,我们探索了使用影响函数(IF)进行实例级数据归因。我们发现标准IF公式在此设置中存在两个结构限制:由离策略验证目标引起的近邻偏差,以及对梯度范数的严重偏向。我们提出了DRIFT(通过在线策略影响函数进行数据优化用于监督微调)。DRIFT不依赖外部参考数据,而是利用模型的在线策略生成作为验证目标,这在经验上最小化了参数近邻偏差,并更好地符合IF的局部邻域假设。它进一步基于轨迹正确性应用符号加权,并针对梯度操纵问题对影响分数进行去偏,使得少量验证查询能够作为可靠锚点来归因整个数据集。在7B参数指令和推理模型上的实验表明,DRIFT持续提升了两者的性能上限,优于现有的数据筛选基线。

英文摘要

Optimizing the training data distribution for Supervised Fine-Tuning (SFT) dictates the capability of Large Language Models (LLMs). While existing data curation methods excel at accelerating training under constrained budgets, they are less suited to elevating the capability upper bound. The challenge here is no longer to identify a smaller subset that preserves performance, but to refine the data distribution toward instances most capable of improving the final model. To address this problem, we explore instance-level data attribution using Influence Functions (IF). We identify that standard IF formulations struggle in this setting due to two structural limitations: a proximity gap caused by off-policy validation targets, and a severe bias towards gradient norm. We propose DRIFT (Data Refinement via On-Policy Influence Functions for Supervised Fine-Tuning). Instead of relying on external reference data, DRIFT utilizes the model's on-policy rollouts as validation targets, which empirically minimizes the parameter proximity gap and better aligns with the local neighborhood assumption of IF. It further applies signed weighting based on trajectory correctness and debiases influence scores against the gradient hacking issue, allowing a small set of validation queries to act as reliable anchors for attributing the full dataset. Experiments on 7B-parameter instruction and reasoning models show that DRIFT consistently raises the performance ceiling on both, outperforming existing data curation baselines.

5. 其他LLM 13 篇

2606.18431 2026-06-18 cs.LG cs.DC 新提交 85%

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

超越预测:面向LLM推理的尾延迟感知调度

Yueying Li, Yuanfan Chen, Jiayang Chen, Esha Choukse, Haoran Qiu, G. Edward Suh, Rodrigo Fonseca, Ziv Scully, Udit Gupta

发表机构 * Cornell University, Computer Science Department(康奈尔大学计算机科学系) Cornell University, Electrical and Computer Engineering Department(康奈尔大学电气与计算机工程系) Cornell University, Operations Research and Information Engineering Department(康奈尔大学运筹学与信息工程系) Microsoft Azure System Research(微软Azure系统研究) NVIDIA Corporation(英伟达公司)

专题命中 其他LLM :提出LLM推理调度框架,优化尾延迟

AI总结 针对LLM推理中长度预测调度在分布偏移和尾延迟控制上的脆弱性,提出无预测的分布感知调度框架,通过轻量统计信号实现软优先级提升,结合缓存感知抢占,在多种工作负载下将P99 TTLT降低35-50%,TTFT降低34-47%。

Journal ref Forty-Third International Conference on Machine Learning (2026)

详情
AI中文摘要

LLM服务表现出极端的长度可变性,使得基于大小的调度在实践中变得困难。最近的LLM调度器使用预测的解码长度或排名来近似SJF/SRPT,并主要报告均值中心指标如TTFT和TBT。我们表明,这些预测驱动的策略在分布偏移、突发到达和GPU内存压力下可能脆弱,同时对主导用户体验的尾延迟(P90-P99)控制有限,即使拥有完美的解码长度知识。我们引入了一个分布感知、无预测的调度框架,用由轻量统计信号驱动的软优先级提升取代显式长度预测。我们的设计协同优化调度和缓存感知抢占,以考虑跨工作负载混合的内存耦合解码动态。在生产环境和开源轨迹上的评估表明,相对于具有完美长度知识的SRPT,我们的方法将P99 TTLT降低了高达35-50%,并在各种工作负载(包括推理密集型和聊天密集型任务)上将TTFT降低了34-47%。这些结果证明了在在线LLM服务中优化尾延迟的稳健替代方案。

英文摘要

LLM serving exhibits extreme length variability, making size-based scheduling difficult in practice. Recent LLM schedulers approximate SJF/SRPT using predicted decode lengths or ranks and primarily report mean-centric metrics such as TTFT and TBT. We show that these prediction-driven policies can be fragile under distribution shifts, bursty arrivals, and GPU memory pressure, while offering limited control over the tail latency (P90-P99) that dominates user experience, even with perfect decode-length knowledge. We introduce a distribution-aware, prediction-free scheduling framework that replaces explicit length prediction with soft priority boosting driven by lightweight statistical signals. Our design co-optimizes scheduling and cache-aware preemption to account for memory-coupled decode dynamics across workload mixes. Evaluated on production and open-source traces, our method reduces P99 TTLT by up to 35-50% relative to SRPT with perfect length knowledge and reduces TTFT by 34-47% across workloads, including reasoning-heavy and chat-heavy tasks. These results demonstrate a robust alternative for optimizing tail latency in online LLM serving.

2606.18394 2026-06-18 cs.CL 新提交 85%

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

JetFlow: 通过并行树草稿突破推测解码的缩放上限

Lanxiang Hu, Zhaoxiang Feng, Yulun Wu, Haoran Yuan, Yujie Zhao, Yu-Yang Qian, Bojun Wang, Daxin Jiang, Yibo Zhu, Tajana Rosing, Hao Zhang

发表机构 * UC San Diego(加州大学圣地亚哥分校) Zhejiang University(浙江大学) UIUC(伊利诺伊大学厄巴纳-香槟分校) Nanjing University(南京大学) StepFun(阶跃星辰)

专题命中 其他LLM :提出并行树草稿加速LLM推测解码

AI总结 提出JetFlow框架,通过因果并行草稿头结合树推测解码,将更大草稿预算转化为更长接受前缀和更高端到端加速,在Qwen3模型上实现最高9.64倍加速。

详情
AI中文摘要

推测解码(SD)通过草拟多个令牌并并行验证来加速自回归大语言模型(LLM),但面临缩放限制:仅当接受率保持较高且草拟开销较低时,增加草稿预算才能提高速度。这一上限难以突破,因为先前基于头的SD方法面临因果-效率困境。自回归草稿器生成路径条件候选,适用于树推测解码且接受长度更高,但其草拟成本随树深度增长。双向块扩散草稿器一次性生成所有位置,但其分支无关的边缘分布可能形成个体合理但相互不一致的树,浪费预算并降低接受率。我们提出JetFlow,一种基于头的SD框架,结合单次前向草拟效率与分支级因果条件。JetFlow在冻结目标模型的融合隐藏状态上训练因果并行草稿头,生成与目标模型自回归分解对齐的候选树。这使得JetFlow能够将更大的草稿预算转换为更长的接受前缀和更高的端到端加速。在密集和MoE Qwen3模型上的数学、编码和聊天基准测试中,JetFlow始终优于双向头和基于树的SD基线。在H100 GPU上,JetFlow在MATH-500上实现高达9.64倍加速,在开放式对话工作负载上实现4.58倍加速,并通过vLLM集成在实际服务负载下进一步降低延迟。我们的代码和模型可在该https URL获取。

英文摘要

Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling has been difficult to break because prior head-based SD methods face a causality-efficiency dilemma. Autoregressive drafters produce path-conditioned candidates that are effective for tree speculative decoding with higher acceptance length, but their drafting cost grows with tree depth. Bidirectional block-diffusion drafters generate all positions in one pass, but their branch-agnostic marginals can form individually plausible yet mutually inconsistent trees, wasting budget and reducing acceptance. We propose JetFlow, a head-based SD framework that combines one-forward drafting efficiency with branch-wise causal conditioning. JetFlow trains a causal parallel draft head over fused hidden states from the frozen target model, producing candidate trees whose scores align with the target model's autoregressive factorization. This enables JetFlow to convert larger draft budgets into longer accepted prefixes and higher end-to-end speedup. Across math, coding, and chat benchmarks on dense and MoE Qwen3 models, JetFlow consistently outperforms bidirectional-head and tree-based SD baselines. On H100 GPUs, JetFlow achieves up to 9.64x speedup on MATH-500 and 4.58x on open-ended conversational workloads, with further latency gains demonstrated through vLLM integration under realistic serving loads. Our code and models are available at https://github.com/hao-ai-lab/JetFlow.

2602.05992 2026-06-18 cs.CL 版本更新 85%

DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs

DSB: 扩散语言模型的动态滑动块调度

Lizhuo Luo, Shenggui Li, Yonggang Wen, Tianwei Zhang

发表机构 * Nanyang Technological University(南洋理工大学)

专题命中 其他LLM :改进扩散语言模型的推理调度

AI总结 针对扩散语言模型固定块调度忽视语义难度的问题,提出无训练的动态滑动块方法DSB及配套KV缓存机制DSB Cache,显著提升生成质量和推理效率。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

扩散大语言模型(dLLMs)已成为文本生成的一种有前景的替代方案,其特点在于原生支持并行解码。在实践中,块推理对于避免全局双向解码中的顺序错乱以及提高输出质量至关重要。然而,广泛使用的固定、预定义块(朴素)调度忽略了语义难度,使其在质量和效率上均非最优策略:它可能迫使模型对不确定的位置过早做出承诺,同时延迟块边界附近的简单位置。在这项工作中,我们分析了朴素块调度的局限性,并揭示了根据语义难度动态调整调度对于可靠高效推理的重要性。受此启发,我们提出了动态滑动块(DSB),一种无训练的块调度方法,它使用动态大小的滑动块来克服朴素块的刚性。为了进一步提高效率,我们引入了DSB Cache,一种针对DSB量身定制的无训练KV缓存机制。跨多个模型和基准的大量实验表明,DSB与DSB Cache一起,持续提升了dLLMs的生成质量和推理效率。代码已发布在 https://this https URL。

英文摘要

Diffusion large language models (dLLMs) have emerged as a promising alternative for text generation, distinguished by their native support for parallel decoding. In practice, block inference is crucial for avoiding order misalignment in global bidirectional decoding and improving output quality. However, the widely-used fixed, predefined block (naive) schedule is agnostic to semantic difficulty, making it a suboptimal strategy for both quality and efficiency: it can force premature commitments to uncertain positions while delaying easy positions near block boundaries. In this work, we analyze the limitations of naive block scheduling and disclose the importance of dynamically adapting the schedule to semantic difficulty for reliable and efficient inference. Motivated by this, we propose Dynamic Sliding Block (DSB), a training-free block scheduling method that uses a sliding block with a dynamic size to overcome the rigidity of the naive block. To further improve efficiency, we introduce DSB Cache, a training-free KV-cache mechanism tailored to DSB. Extensive experiments across multiple models and benchmarks demonstrate that DSB, together with DSB Cache, consistently improves both generation quality and inference efficiency for dLLMs. Code is released at https://github.com/lizhuo-luo/DSB.

2602.23092 2026-06-18 cs.AI 版本更新 85%

Enhancing CVRP Solver through LLM-driven Automatic Heuristic Design

通过LLM驱动的自动启发式设计增强CVRP求解器

Zhuoliang Xie, Fei Liu, Zhenkun Wang, Qingfu Zhang

发表机构 * Southern University of Science and Technology(南方科技大学) City University of Hong Kong(香港城市大学)

专题命中 其他LLM :利用LLM自动设计启发式求解CVRP,属于LLM应用

AI总结 提出AILS-AHD方法,结合进化搜索框架与大语言模型动态生成和优化破坏启发式,并引入加速机制,在中等和大规模CVRP实例上优于现有求解器,在CVRPLib大规模基准中10个实例上取得8个新最优解。

详情
AI中文摘要

容量受限车辆路径问题(CVRP)是一个基本的组合优化挑战,专注于在车辆容量约束下优化车队运营。尽管在运筹学中得到了广泛研究,CVRP的NP-hard性质仍然带来显著的计算挑战,特别是对于大规模实例。本研究提出了AILS-AHD(自适应迭代局部搜索与自动启发式设计),一种利用大语言模型(LLMs)革新CVRP求解的新方法。我们的方法将进化搜索框架与LLMs集成,在AILS方法中动态生成和优化破坏启发式。此外,我们引入了一种基于LLM的加速机制以提高计算效率。针对最先进的求解器(包括AILS-II和HGS)的综合实验评估表明,AILS-AHD在中等和大规模实例上均表现出优越性能。值得注意的是,我们的方法在CVRPLib大规模基准的10个实例中为8个建立了新的最佳已知解,突显了LLM驱动的启发式设计在推进车辆路径优化领域的潜力。

英文摘要

The Capacitated Vehicle Routing Problem (CVRP), a fundamental combinatorial optimization challenge, focuses on optimizing fleet operations under vehicle capacity constraints. While extensively studied in operational research, the NP-hard nature of CVRP continues to pose significant computational challenges, particularly for large-scale instances. This study presents AILS-AHD (Adaptive Iterated Local Search with Automatic Heuristic Design), a novel approach that leverages Large Language Models (LLMs) to revolutionize CVRP solving. Our methodology integrates an evolutionary search framework with LLMs to dynamically generate and optimize ruin heuristics within the AILS method. Additionally, we introduce an LLM-based acceleration mechanism to enhance computational efficiency. Comprehensive experimental evaluations against state-of-the-art solvers, including AILS-II and HGS, demonstrate the superior performance of AILS-AHD across both moderate and large-scale instances. Notably, our approach establishes new best-known solutions for 8 out of 10 instances in the CVRPLib large-scale benchmark, underscoring the potential of LLM-driven heuristic design in advancing the field of vehicle routing optimization.

2602.15851 2026-06-18 cs.CL cs.AI 版本更新 85%

Narrative Theory-Driven LLM Methods for Automatic Story Generation and Understanding: A Survey

叙事理论驱动的LLM方法在自动故事生成与理解中的应用:综述

David Y. Liu, Aditya Joshi, Paul Dawson

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院) School of Arts and Media(艺术与媒体学院) University of New South Wales (UNSW)(新南威尔士大学)

专题命中 其他LLM :综述叙事理论驱动的LLM故事生成与理解

AI总结 综述叙事理论驱动的大语言模型方法在自动故事生成与理解中的应用,分析现状并指出生成任务在理论应用、后训练方法、非虚构叙事及叙事层次等方面落后于理解任务,提出未来方向。

Comments 31 pages

详情
AI中文摘要

使用大语言模型(LLM)的叙事理论应用在自动故事生成和理解任务中提供了有前景的方法。本综述考察了自然语言处理(NLP)研究如何利用LLM方法处理叙事研究中的不同概念。我们使用叙事学中的既定区分来分类当前工作,并发现以下内容:(a) 叙事文本来源多样,不仅限于文学;(b) 理论综合与验证是潜在成果;(c) 生成任务在多个方面落后于理解任务:理论应用、后训练方法、探索非虚构叙事以及处理超出故事与话语层面的叙事层次。对于未来方向,我们相信,与其追求单一的、通用的“叙事质量”基准,进步可以受益于以下方面的努力:定义和改进针对单个叙事属性的基于理论的度量;继续开展大规模、理论驱动的文学/社会/文化分析;在情境化上下文中生成叙事;以及继续进行实验,其输出可用于验证或完善叙事理论。本文通过概述当前研究工作和更广泛的叙事研究领域,为NLP中更系统、更具理论依据的叙事研究提供了背景基础。

英文摘要

Applications of narrative theories using large language models (LLMs) deliver promising methods in automatic story generation and understanding tasks. Our survey examines how natural language processing (NLP) research uses LLM methods to engage with diverse concepts from narrative studies. We use established distinctions from narratology to categorise ongoing efforts and discover the following: \redtext{(a) narrative texts come from diverse sources beyond just literature, (b) theoretical synthesis and validation are potential outcomes, (c) generation tasks lag behind understanding in several ways: theoretical application, post-training methods, exploring non-fiction narratives and addressing narrative levels beyond fabula and discourse.} For future directions, instead of the pursuit of a single, generalised benchmark for `narrative quality', we believe that progress can benefit from efforts that focus on the following: defining and improving theory-based metrics for individual narrative attributes; continue conducting large-scale, theory-driven literary/social/cultural analysis; generating narratives in situated contexts; and continuing experiments where outputs can be used to validate or refine narrative theories. This work provides a contextual foundation for more systematic and theoretically informed narrative research in NLP by providing an overview to ongoing research efforts and the broader narrative studies landscape.

2510.15551 2026-06-18 cs.CL cs.AI cs.LG 版本更新 85%

Rethinking Cross-lingual Gaps from a Statistical Viewpoint

从统计视角重新思考跨语言差距

Vihari Piratla, Purvam Jain, Darshan Singh, Trevor Cohn, Preethi Jyothi, Partha Talukdar

发表机构 * Google DeepMind(谷歌深Mind)

专题命中 其他LLM :研究LLM跨语言差距,属于LLM应用

AI总结 提出跨语言差距源于目标语言响应方差,通过形式化偏差和无偏误差,并采用推理时集成方法降低方差,使跨语言迁移得分提升8%-50%以上。

Comments 30 pages

详情
AI中文摘要

任何知识片段通常以一种或少数几种自然语言表达在网页或大型语料库中。大型语言模型(LLMs)通过从源语言获取知识,并在使用目标语言查询时使其可访问,从而充当桥梁。跨语言差距是指使用目标语言而非源语言查询知识时准确率的下降。现有研究侧重于导致跨语言差距的建模或训练失败。在这项工作中,我们采取另一种视角来表征跨语言错误的性质,并假设目标语言中响应的方差是造成这一差距的关键原因。我们首次将跨语言差距形式化为有偏误差和无偏误差。通过多种控制方差并减少跨语言差距的推理时干预,我们实证验证了我们的假设。我们展示了几种测试时集成方法,这些方法降低了响应方差,从而将源-目标迁移得分提高了多达12个绝对百分点,在各种LLMs上实现了8%到超过50%的相对提升。

英文摘要

Any piece of knowledge is usually expressed in one or a handful of natural languages on the web or in any large corpus. Large Language Models (LLMs) act as a bridge by acquiring knowledge from a source language and making it accessible when queried using target languages. A cross-lingual gap is a drop in accuracy incurred when querying knowledge in a target language rather than the source language. Existing research focused on modeling or training failures leading to cross-lingual gaps. In this work, we take an alternative view to characterize the nature of cross-lingual error, and hypothesize that the variance of responses in the target language is a key cause of this gap. For the first time, we formalize the cross-lingual gap in terms of biased and unbiased errors. We empirically validate our hypothesis through multiple inference-time interventions that control variance and reduce the cross-lingual gap. We demonstrate a few test-time ensemble methods that reduce response variance, and thereby improve source-target transfer scores by up to 12 absolute points yielding relative gains of 8% to over 50% across various LLMs.

2510.04120 2026-06-18 cs.CL cs.AI 版本更新 85%

Probing Semantic Alignment, Lexical Invariance, and Syntactic Influence in LLM Metaphor Processing

探究大语言模型隐喻处理中的语义对齐、词汇不变性和句法影响

Fengying Ye, Shanshan Wang, Lidia S. Chao, Derek F. Wong

发表机构 * NLP 2 CT Lab, Department of Computer and Information Science, University of Macau(自然语言处理2CT实验室,计算机与信息科学系,澳门大学)

专题命中 其他LLM :LLM隐喻处理机制分析

AI总结 通过几何探测、上下文替换和句法扰动三种方法,分析LLM在隐喻处理中的语义漂移、词汇稳定性及句法敏感性,揭示强行为表现可能源于异质信号。

Comments Accepted to ACL 2026

详情
AI中文摘要

大语言模型(LLM)在隐喻检测和解释任务上表现出色,但尚不清楚这种行为成功揭示了隐喻处理的哪些方面。我们通过探测三个互补维度:语义属性对齐、词汇不变性和句法敏感性,对行为证据的局限性进行诊断分析。使用几何探测,我们评估模型生成的解释是否与参考语义属性对齐;通过上下文变化替换,分析隐喻和字面表达之间词汇关联的稳定性;通过受控句法扰动,检查隐喻检测的敏感性。我们的分析表明,LLM生成的解释可能相对于参考属性出现语义漂移;稳定的词汇锚点在不同上下文条件下持续存在,可能支持常规隐喻,同时使需要上下文整合的新奇隐喻产生偏差;检测性能对句法不规则性敏感。这些发现表明,强行为表现可能反映了异质的潜在信号,强调在将隐喻基准解释为稳健、集成语义理解的证据时需要谨慎。

英文摘要

Large language models (LLMs) achieve strong performance on metaphor detection and interpretation tasks, yet it remains unclear what such behavioral success reveals about metaphor processing. We present a diagnostic analysis that examines the limits of behavioral evidence by probing three complementary dimensions: semantic attribute alignment, lexical invariance, and syntactic sensitivity. Using geometric probing, we assess whether model-generated interpretations align with reference semantic attributes; through context-varying substitution, we analyze the stability of lexical associations between metaphorical and literal expressions; and via controlled syntactic perturbations, we examine sensitivity in metaphor detection. Our analysis reveals that LLM-generated interpretations can exhibit semantic drift relative to reference attributes; stable lexical anchors persist across contextual conditions, potentially supporting conventional metaphors while biasing novel metaphors requiring contextual integration; and detection performance is sensitive to syntactic irregularities. These findings suggest that strong behavioral performance may reflect heterogeneous underlying signals, highlighting the need for caution when interpreting metaphor benchmarks as evidence of robust, integrated semantic understanding.

2508.09191 2026-06-18 cs.LG cs.AI 版本更新 85%

From Values to Tokens: An LLM-Driven Framework for Context-aware Time Series Forecasting via Symbolic Discretization

从数值到标记:一种基于符号离散化的LLM驱动上下文感知时间序列预测框架

Xiaoyu Tao, Shilong Zhang, Mingyue Cheng, Daoyu Wang, Tingyue Pan, Bokai Pan, Changqing Zhang, Shijin Wang

发表机构 * State Key Laboratory of Cognitive Intelligence(认知智能国家重点实验室) University of Science and Technology of China(中国科学技术大学) College of Intelligence and Computing(智能科学与计算学院) iFLYTEK Research(iFLYTEK研究院)

专题命中 其他LLM :提出TokenCast框架,利用LLM进行时间序列预测。

AI总结 提出TokenCast框架,利用大语言模型通过符号离散化将连续时间序列转化为标记,与上下文文本对齐,实现上下文感知的预测,实验证明有效。

详情
AI中文摘要

时间序列预测在能源、医疗和金融等关键应用领域支持决策中起着重要作用。尽管近期取得了进展,但由于将历史数值序列与通常包含非结构化文本数据的上下文特征整合的挑战,预测精度仍然有限。为了解决这一挑战,我们提出了TokenCast,一个由大语言模型(LLM)驱动的框架,利用基于语言的符号表示作为上下文感知时间序列预测的统一中介。具体来说,TokenCast采用离散分词器将连续数值序列转化为时间标记,实现与基于语言输入的结构对齐。为了有效弥合模态之间的语义差距,时间和上下文标记通过预训练的LLM嵌入到共享表示空间中,并通过生成目标进一步优化。基于这一统一语义空间,对齐的LLM随后以监督方式进行微调,以预测未来的时间标记,然后解码回原始数值空间。在真实世界数据集上的大量实验证明了我们框架的有效性,并突显了其作为上下文感知时间序列预测生成框架的潜力。代码可从此https URL获取。

英文摘要

Time series forecasting plays a vital role in supporting decision-making across a wide range of critical applications, including energy, healthcare, and finance. Despite recent advances, forecasting accuracy remains limited due to the challenge of integrating historical numerical sequences with contextual features, which often comprise unstructured textual data. To address this challenge, we propose TokenCast, a large language model (LLM) driven framework that leverages language-based symbolic representations as a unified intermediary for context-aware time series forecasting. Specifically, TokenCast employs a discrete tokenizer to transform continuous numerical sequences into temporal tokens, enabling structural alignment with language-based inputs. To effectively bridge the semantic gap between modalities, both temporal and contextual tokens are embedded into a shared representation space via a pre-trained LLM, further optimized with generative objectives. Building upon this unified semantic space, the aligned LLM is subsequently fine-tuned in a supervised manner to predict future temporal tokens, which are then decoded back into the original numerical space. Extensive experiments on real-world datasets demonstrate the effectiveness of our framework and highlight its potential as a generative framework for context-aware time series forecasting. The code is available at https://github.com/Xiaoyu-Tao/TokenCast.

2506.15066 2026-06-18 cs.AR cs.MA 版本更新 85%

ChatModel: Automating Reference Model Design and Verification with LLMs

ChatModel: 利用LLMs自动化参考模型设计与验证

Jianmin Ye, Tianyang Liu, Qi Tian, Shengchu Su, Zhe Jiang, Xi Wang

专题命中 其他LLM :利用LLM自动化参考模型设计与验证,提升效率。

AI总结 提出ChatModel平台,通过设计标准化和分层敏捷建模,利用LLM自动生成参考模型,在300个设计上验证,效率提升最高58.99%,验证周期加速7.11倍。

详情
AI中文摘要

随着集成电路设计复杂性的不断升级,功能验证变得越来越具有挑战性。参考模型对于加速验证过程至关重要,但其自身也变得越来越复杂且耗时。尽管大型语言模型(LLM)在代码编程方面显示出潜力,但有效生成复杂参考模型仍然是一个重大障碍。因此,我们引入了ChatModel,一个LLM辅助的敏捷参考模型生成与验证平台。ChatModel通过集成设计标准化和分层敏捷建模,简化了从设计规范到功能完备参考模型的过渡。采用构建块生成策略,不仅增强了LLM对参考模型的设计能力,还显著提高了验证效率。我们在300个不同复杂度的设计上评估了ChatModel,证明了参考模型生成在效率和质量上的显著提升。与替代方法相比,ChatModel实现了最高58.99%的性能提升,生成稳定性显著增强,并且其生成参考模型设计的能力提高了9.18倍。此外,ChatModel将参考模型设计与验证周期平均加速了7.11倍,相比传统手动方法。这些结果突显了ChatModel在推动参考模型生成与验证自动化方面的巨大潜力。

英文摘要

As the complexity of integrated circuit designs continues to escalate, functional verification becomes increasingly challenging. Reference models, critical for accelerating the verification process, are themselves becoming more intricate and time-consuming to develop. Despite the promise shown by large language models (LLMs) in code programming, effectively generating complex reference models remains a significant hurdle. Therefore, we introduce ChatModel, an LLM-aided agile reference model generation and verification platform. ChatModel streamlines the transition from design specifications to fully functional reference models by integrating design standardization and hierarchical agile modeling. Employing a building-block generation strategy, it not only enhances the design capabilities of LLMs for reference models but also significantly boosts verification efficiency. We evaluated ChatModel on 300 designs of varying complexity, demonstrating substantial improvements in both efficiency and quality of reference model generation. ChatModel achieved a peak performance improvement of 58.99% compared to alternative methods, with notable enhancements in generation stability, and delivered a 9.18x increase in its capacity to produce reference model designs. Moreover, ChatModel accelerates the reference model design and validation cycles by an average of 7.11x over traditional manual approaches. These results highlight the potential of ChatModel to significantly advance the automation of reference model generation and validation.

2503.01163 2026-06-18 cs.AI cs.CL cs.HC cs.LG cs.NE 85%

Bandit-Based Prompt Design Strategy Selection Improves Prompt Optimizers

基于Bandit的提示设计策略选择改进提示优化器

Rin Ashizawa, Yoichi Hirose, Nozomu Yoshinari, Kento Uchida, Shinichi Shirakawa

发表机构 * Yokohama National University(横滨国立大学)

专题命中 其他LLM :提出OPTS方法优化LLM提示策略

AI总结 本文提出OPTS方法,通过显式选择提示设计策略提升EvoPrompt性能,采用Thompson采样机制在BIG-Bench Hard上验证效果,实现最优结果。

Comments Accepted to ACL 2025 Findings

详情
AI中文摘要

提示优化旨在寻找能提升大语言模型性能的有效提示。尽管现有方法已发现有效提示,但往往与人类专家精心设计的复杂提示不同。提示设计策略作为提升提示性能的最佳实践,对优化提示至关重要。最近,Autonomous Prompt Engineering Toolbox (APET) 将多种提示设计策略整合到提示优化过程中。在APET中,需要LLM隐式选择和应用合适的策略,因为提示设计策略可能产生负面影响。这种隐式选择可能因LLM的有限优化能力而表现不佳。本文引入Optimizing Prompts with sTrategy Selection (OPTS),实现提示设计的显式选择机制。我们提出三种机制,包括基于Thompson采样的方法,并将其整合到EvoPrompt中。在使用BIG-Bench Hard对Llama-3-8B-Instruct和GPT-4o mini进行提示优化的实验中,结果表明提示设计策略的选择提升了EvoPrompt的性能,Thompson采样机制实现了最佳整体结果。我们的实验代码可在https://github.com/shiralab/OPTS获取。

英文摘要

Prompt optimization aims to search for effective prompts that enhance the performance of large language models (LLMs). Although existing prompt optimization methods have discovered effective prompts, they often differ from sophisticated prompts carefully designed by human experts. Prompt design strategies, representing best practices for improving prompt performance, can be key to improving prompt optimization. Recently, a method termed the Autonomous Prompt Engineering Toolbox (APET) has incorporated various prompt design strategies into the prompt optimization process. In APET, the LLM is needed to implicitly select and apply the appropriate strategies because prompt design strategies can have negative effects. This implicit selection may be suboptimal due to the limited optimization capabilities of LLMs. This paper introduces Optimizing Prompts with sTrategy Selection (OPTS), which implements explicit selection mechanisms for prompt design. We propose three mechanisms, including a Thompson sampling-based approach, and integrate them into EvoPrompt, a well-known prompt optimizer. Experiments optimizing prompts for two LLMs, Llama-3-8B-Instruct and GPT-4o mini, were conducted using BIG-Bench Hard. Our results show that the selection of prompt design strategies improves the performance of EvoPrompt, and the Thompson sampling-based mechanism achieves the best overall results. Our experimental code is provided at https://github.com/shiralab/OPTS .

2412.15557 2026-06-18 cs.SE cs.CL 版本更新 85%

MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems

MORTAR:基于LLM的对话系统的多轮蜕变测试

Aaron Guoxiang Guo, Aldeida Aleti, Neelofar Neelofar, Chakkrit Tantithamthavorn, Yuanyuan Qi, Tsong Yueh Chen

发表机构 * Faculty of Information Technology, Monash University(墨尔本大学信息科技学院) School of Computing Technologies, RMIT University(皇家墨尔本理工大学计算技术学院) School of Science, Computing and Emerging Technologies, Swinburne University of Technology(斯威本理工大学科学、计算与新兴技术学院)

专题命中 其他LLM :LLM对话系统多轮测试方法

AI总结 提出MORTAR方法,通过多轮蜕变关系自动化生成测试用例,解决LLM对话系统多轮测试中的预言问题,相比单轮测试每个用例发现更多且更高质量的缺陷。

Comments Accepted for publication in IEEE Transactions on Software Engineering (TSE)

详情
AI中文摘要

随着基于LLM的对话系统在日常生活中的广泛应用,质量保证变得比以往更加重要。最近的研究成功引入了在单轮测试场景中识别意外行为的方法。然而,多轮交互是对话系统常见的实际使用方式,但针对此类交互的测试方法仍未得到充分探索。这主要是由于多轮测试中的预言问题,它仍然是对话系统开发人员和研究人员面临的重大挑战。在本文中,我们提出了MORTAR,一种蜕变式多轮对话测试方法,它缓解了测试基于LLM的对话系统时的测试预言问题。MORTAR形式化了对话系统的多轮测试,并自动生成问答对话测试用例,其中包含多种对话级扰动和蜕变关系(MRs)。自动化的MR匹配机制使MORTAR在蜕变测试中具有更高的灵活性和效率。所提出的方法完全自动化,无需依赖LLM评判。在测试六个流行的基于LLM的对话系统时,与单轮蜕变测试基线相比,MORTAR每个测试用例发现的错误数量增加了150%以上,效果显著更好。在错误质量方面,MORTAR在多样性、精确性和唯一性方面揭示了更高质量的错误。MORTAR有望激发更多的多轮测试方法,并帮助开发人员在有限的测试资源和预算下更全面地评估对话系统性能。

英文摘要

With the widespread application of LLM-based dialogue systems in daily life, quality assurance has become more important than ever. Recent research has successfully introduced methods to identify unexpected behaviour in single-turn testing scenarios. However, multi-turn interaction is the common real-world usage of dialogue systems, yet testing methods for such interactions remain underexplored. This is largely due to the oracle problem in multi-turn testing, which continues to pose a significant challenge for dialogue system developers and researchers. In this paper, we propose MORTAR, a metamorphic multi-turn dialogue testing approach, which mitigates the test oracle problem in testing LLM-based dialogue systems. MORTAR formalises the multi-turn testing for dialogue systems, and automates the generation of question-answer dialogue test cases with multiple dialogue-level perturbations and metamorphic relations (MRs). The automated MR matching mechanism allows MORTAR more flexibility and efficiency in metamorphic testing. The proposed approach is fully automated without reliance on LLM judges. In testing six popular LLM-based dialogue systems, MORTAR reaches significantly better effectiveness with over 150\% more bugs revealed per test case when compared to the single-turn metamorphic testing baseline. Regarding the quality of bugs, MORTAR reveals higher-quality bugs in terms of diversity, precision and uniqueness. MORTAR is expected to inspire more multi-turn testing approaches, and assist developers in evaluating the dialogue system performance more comprehensively with constrained test resources and budget.

2506.09822 2026-06-18 cs.CE cs.AI 85%

Superstudent intelligence in thermodynamics

热力学中的超级学生智能

Rebecca Loubet, Pascal Zittlau, Marco Hoffmann, Luisa Vollmer, Sophie Fellenz, Heike Leitte, Fabian Jirasek, Johannes Lenhard, Hans Hasse

发表机构 * Laboratory of Engineering Thermodynamics (LTD)(工程热力学实验室) Visual Information Analysis Research Group (VIA)(视觉信息分析研究组) Machine Learning Research Group (ML)(机器学习研究组)

专题命中 其他LLM :评估o3模型在热力学考试中的表现

AI总结 研究展示OpenAI的o3模型在热力学考试中超越所有学生,证明机器在复杂任务中的能力,影响工程教育与实践。

Comments This document is the unedited Author's version of a yet to be Submitted Work to Physical Review Physics Education Research. 15 pages, 2 figures, Graphical Abstract, Highlights and SI available (12 pages)

详情
AI中文摘要

在本文中,我们报告并分析了一个引人注目的事件:OpenAI的大型语言模型o3在热力学考试中击败了所有学生。热力学考试是大多数学生的难点,需要展示对这一重要主题基本原理的掌握。因此,失败率很高,A级分数稀少,被视为学生卓越智力的证明。这是因为模式学习无助于考试。问题只能通过有创造力地结合热力学原理来解决。我们不仅将最新热力学考试提供给学生,还提供给OpenAI最强大的推理模型o3,并以相同方式评估其答案。在零样本模式下,模型o3正确解答了所有问题,优于所有参加考试的学生;其总分在1985年以来超过10000次类似考试中最佳分数范围内。这标志着转折点:机器现在在复杂任务中表现出色,通常被视为人类智力能力的证明。我们讨论了这对工程师工作和未来工程师教育的影响。

英文摘要

In this short note, we report and analyze a striking event: OpenAI's large language model o3 has outwitted all students in a university exam on thermodynamics. The thermodynamics exam is a difficult hurdle for most students, where they must show that they have mastered the fundamentals of this important topic. Consequently, the failure rates are very high, A-grades are rare - and they are considered proof of the students' exceptional intellectual abilities. This is because pattern learning does not help in the exam. The problems can only be solved by knowledgeably and creatively combining principles of thermodynamics. We have given our latest thermodynamics exam not only to the students but also to OpenAI's most powerful reasoning model, o3, and have assessed the answers of o3 exactly the same way as those of the students. In zero-shot mode, the model o3 solved all problems correctly, better than all students who took the exam; its overall score was in the range of the best scores we have seen in more than 10,000 similar exams since 1985. This is a turning point: machines now excel in complex tasks, usually taken as proof of human intellectual capabilities. We discuss the consequences this has for the work of engineers and the education of future engineers.

2504.12347 2026-06-18 cs.CL cs.AI cs.CY 85%

Assessment of Evolving Large Language Models in Upper Secondary Mathematics

对上中学数学中演进式大语言模型的评估

Mika Setälä, Pieta Sikström, Ville Heilala, Tommi Kärkkäinen

发表机构 * Faculty of Information Technology(信息科技学院) University of Jyväskylä(于韦斯屈莱大学) Faculty of Humanities and Social Sciences(人文与社会科学学院)

专题命中 其他LLM :评估LLM在中学数学考试中的能力

AI总结 本文评估了不同大语言模型在芬兰毕业考试中的数学能力,发现随着模型演进,其表现显著提升,部分模型接近完美,展示了LLM在数学能力上的快速进步及其在教育中的潜力。

详情
AI中文摘要

大型语言模型(LLMs)在教育环境中展现出日益增长的前景,但其数学推理能力被认为是在不断演变的。本研究通过芬兰毕业考试,一种针对上中学教育的高风险数字测试,评估了各种LLMs的数学能力。初步测试显示中等表现,对应中等成绩,但后续评估显示随着语言模型的演进,表现显著提升。令人惊讶的是,某些模型达到了接近完美或完美分数,与顶尖学生表现相当,符合大学入学要求。我们的发现突显了LLM数学能力的快速进步,并展示了其作为支持学习和教学的潜在工具的可能性。

英文摘要

Large language models (LLMs) have shown increasing promise in educational settings, yet their mathematical reasoning has been considered evolving. This study evaluates the mathematical capabilities of various LLMs using the Finnish matriculation examination, a high-stakes digital test for upper secondary education. Initial tests yielded moderate performance corresponding to mid-range grades, but later evaluations demonstrated substantial improvements as the language models evolved. Remarkably, some models achieved near-perfect or perfect scores, matching top student performance and qualifying for university admission. Our findings highlight the rapid advances in the mathematical proficiency of LLMs and illustrate their potential as underlying tools to support learning and teaching in a variety of ways.