语言大模型 / LLM - arXivDaily 专题

2606.18875 2026-06-18 cs.CL 新提交 85%

Efficient Financial Language Understanding via Distillation with Synthetic Data

通过合成数据蒸馏实现高效金融语言理解

Wen-Fong, Huang, Edwin Simpson

发表机构 * School of Engineering Mathematics and Technology（工程数学与技术学院）； University of Bristol（布里斯托大学）

专题命中指令微调：用大教师模型蒸馏到小模型，金融情感分析。

AI总结提出一种在低资源条件下通过合成数据蒸馏进行金融情感分析的框架，利用聚类种子选择生成代表性合成数据，使紧凑模型在少量标注下达到强性能，甚至在某些任务上超越教师模型。

Journal ref Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), European Language Resources Association (ELRA), 2026, pp. 10242-10254

详情

DOI: 10.63317/3b3zxy5qrw8s

AI中文摘要

大型指令跟随模型功能强大但部署成本高昂，尤其在金融领域，标注数据因保密性和专家标注成本而受限。我们提出一种通过合成数据蒸馏进行金融情感分析的高效框架，将知识从大型指令调优教师模型迁移到紧凑的学生模型。该框架专为低资源条件设计，其中收集并手工标注少量真实样本。框架随后对样本进行聚类，并利用聚类结果选择种子，通过结构化少样本提示生成合成样本。实验表明，基于聚类的种子选择比随机采样能生成更具代表性的合成数据，使紧凑模型在极少量监督下实现强性能。值得注意的是，在更复杂且噪声更多的文本领域，基于完整合成种子语料库训练的紧凑模型甚至优于教师模型，同时在正式文本上保持竞争力。该框架为金融NLP中资源高效的领域自适应提供了一条实用途径，且只需最少的人工标注工作。

英文摘要

Large instruction-following models are powerful but costly to deploy, particularly in finance, where labelled data are limited by confidentiality and expert annotation cost. We present an efficient framework for financial sentiment analysis through distillation with synthetic data, transferring knowledge from a large instruction-tuned teacher to compact student models. The framework is designed for low-resource conditions, where a small set of real examples are collected and labelled by hand. The framework then clusters the examples and uses the clusters to select seeds for generating synthetic examples via structured few-shot prompting. Experiments show that clustering-based seed selection yields more representative synthetic data than random sampling, enabling compact models to achieve strong performance with minimal supervision. Notably, on a more complex and noisy text domain, the compact model trained on the complete synthetic-seed corpus even outperforms the teacher model, while remaining competitive on formal text. The framework provides a practical route toward resource-efficient domain adaptation in financial NLP with minimal human labelling effort.

URL PDF HTML ☆

赞 0 踩 0

2606.18307 2026-06-18 cs.LG cs.AI 新提交 85%

DRIFT: Refining Instruction Data via On-Policy Data Attribution

DRIFT: 通过在线策略数据归因优化指令数据

Zefan Wang, Lincheng Li, Tianyu Yu, Yuan Yao

发表机构 * Tsinghua University（清华大学）

专题命中指令微调：提出DRIFT方法优化指令微调数据分布，提升LLM性能上限。

AI总结提出DRIFT方法，利用在线策略影响函数解决标准影响函数在指令微调数据归因中的近邻偏差和梯度范数偏差问题，通过模型自身生成作为验证目标，提升7B模型性能上限。

详情

AI中文摘要

优化监督微调（SFT）的训练数据分布决定了大型语言模型（LLMs）的能力。虽然现有的数据筛选方法在有限预算下加速训练方面表现出色，但它们不太适合提升能力上限。这里的挑战不再是识别一个保持性能的较小子集，而是将数据分布优化为最能提升最终模型的实例。为了解决这个问题，我们探索了使用影响函数（IF）进行实例级数据归因。我们发现标准IF公式在此设置中存在两个结构限制：由离策略验证目标引起的近邻偏差，以及对梯度范数的严重偏向。我们提出了DRIFT（通过在线策略影响函数进行数据优化用于监督微调）。DRIFT不依赖外部参考数据，而是利用模型的在线策略生成作为验证目标，这在经验上最小化了参数近邻偏差，并更好地符合IF的局部邻域假设。它进一步基于轨迹正确性应用符号加权，并针对梯度操纵问题对影响分数进行去偏，使得少量验证查询能够作为可靠锚点来归因整个数据集。在7B参数指令和推理模型上的实验表明，DRIFT持续提升了两者的性能上限，优于现有的数据筛选基线。

英文摘要

Optimizing the training data distribution for Supervised Fine-Tuning (SFT) dictates the capability of Large Language Models (LLMs). While existing data curation methods excel at accelerating training under constrained budgets, they are less suited to elevating the capability upper bound. The challenge here is no longer to identify a smaller subset that preserves performance, but to refine the data distribution toward instances most capable of improving the final model. To address this problem, we explore instance-level data attribution using Influence Functions (IF). We identify that standard IF formulations struggle in this setting due to two structural limitations: a proximity gap caused by off-policy validation targets, and a severe bias towards gradient norm. We propose DRIFT (Data Refinement via On-Policy Influence Functions for Supervised Fine-Tuning). Instead of relying on external reference data, DRIFT utilizes the model's on-policy rollouts as validation targets, which empirically minimizes the parameter proximity gap and better aligns with the local neighborhood assumption of IF. It further applies signed weighting based on trajectory correctness and debiases influence scores against the gradient hacking issue, allowing a small set of validation queries to act as reliable anchors for attributing the full dataset. Experiments on 7B-parameter instruction and reasoning models show that DRIFT consistently raises the performance ceiling on both, outperforming existing data curation baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.18850 2026-06-18 cs.CL cs.IR 新提交 75%

ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement

ScholarSum：基于知识图谱推理与反思性精炼的师生式抽象摘要生成

Bohou Zhang, Xiaoyu Tao, Mingyue Cheng, Huijie Liu, Qi Liu

发表机构 * State Key Laboratory of Cognitive Intelligence（认知智能国家重点实验室）

专题命中指令微调：科学文献摘要生成，师生框架与知识图谱。

AI总结提出ScholarSum框架，通过构建层次知识图谱引导学生生成初稿，并利用教师式审阅者迭代检查与修正，实现科学文献摘要的流畅性与事实一致性。

详情

AI中文摘要

抽象摘要生成在实现科学文献高效理解中起着关键作用，但它本质上要求同时具备语言流畅性和事实忠实性。现有方法往往难以协调这两个要求。抽取式方法依赖僵硬的句子拼接，破坏了宏观层面的逻辑连贯性；而基于大语言模型的生成式方法尽管掌握了语言流畅性，但事实一致性有限。在这项工作中，我们提出了ScholarSum，一个层次化反思性图框架，模拟师生写作过程以实现流畅且忠实的科学摘要生成。ScholarSum首先通过将文档分割成语义连贯的单元，组织成层次知识图谱，其多层社区结构捕获全局逻辑和宏观主题。在该全局结构引导下，学生生成初稿，随后通过细粒度证据检索进行精炼。为确保事实一致性，教师式审阅者迭代检查初稿，识别不支持的内容，并触发有针对性的重新检索和重写，直到摘要达到严格的质量标准。大量实验表明，ScholarSum在完整性和忠实性方面显著优于之前的基线方法。我们的代码可在该https URL获取。

英文摘要

Abstractive summarization plays a crucial role in enabling efficient understanding of scientific literature, yet it inherently demands both linguistic fluency and factual faithfulness. Existing approaches often fail to reconcile these two requirements. Extractive methods rely on rigid sentence splicing that disrupts macro-level logical coherence, while large language model (LLM)-based generative approaches, despite mastering linguistic fluency, exhibit limited factual consistency. In this work, we propose ScholarSum, a hierarchical reflective graph-based framework that emulates a student-teacher writing process for fluent and faithful scientific summarization. ScholarSum first organizes the document into a hierarchical knowledge graph by segmenting it into semantically coherent units, whose multi-layered community structure captures global logic and macro-level themes. Guided by this global structure, the student generates an initial draft, which is subsequently refined through fine-grained evidence retrieval. To ensure factual consistency, a teacher-like reviewer then iteratively examines the draft, identifies unsupported content, and prompts targeted re-retrieval and rewriting until the summary meets rigorous quality standards. Extensive experiments demonstrate that ScholarSum significantly outperforms previous baselines in terms of both completeness and faithfulness. Our code is available at https://github.com/Xiaoyu-Tao/ScholarSum.

URL PDF HTML ☆

赞 0 踩 0

2606.18902 2026-06-18 cs.CL 新提交 70%

SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

SAGE: 基于智能体引导探索的随机提示优化

Ziyi Zhu, Luka Smyth, Saki Shinoda, Jinghong Chen

发表机构 * Slingshot AI ； Department of Engineering, University of Cambridge（剑桥大学工程系）

专题命中指令微调：自动提示优化属于LLM应用

AI总结提出随机提示优化框架SPO，其中SAGE方法通过多智能体诊断代码执行实现黑盒搜索，在多个基准测试中表现依赖于错误类型，并在心理健康聊天机器人中通过连续优化显著提升次日留存率。

详情

AI中文摘要

上下文工程已成为无需参数更新即可改进AI系统的主要手段。最近研究表明文本梯度并非真实梯度，这促使我们将自动提示优化（APO）视为黑盒搜索。我们引入了SPO（随机提示优化），一个在提示空间上进行随机搜索的框架，并比较了三种复杂度递增的策略：基于错误信息的随机搜索、带有进化算子的遗传算法以及SAGE（基于智能体引导探索的SPO），后者是一个具有诊断代码执行的多智能体流水线。在三个基准测试中，没有单一策略占主导地位；有效性取决于景观结构与错误类型的相互作用。我们进一步在连续优化范式下将SAGE部署到一个心理健康聊天机器人上，它将八个个体噪声A/B测试周期累积为次日留存率的统计显著提升。我们认为，将定性诊断与定量验证相结合是使智能体优化对开放式任务导向对话有效的关键。

英文摘要

Context engineering has emerged as a primary lever for improving AI systems without parameter updates. Recent work showing that textual gradients do not function as real gradients motivates treating automatic prompt optimization (APO) as black-box search. We introduce SPO (Stochastic Prompt Optimization), a framework for stochastic search over prompt space, and compare three strategies of increasing sophistication: error-informed random search, a genetic algorithm with evolutionary operators, and SAGE (SPO via Agent-Guided Exploration), a multi-agent pipeline with diagnostic code execution. Across three benchmarks, no single strategy dominates; effectiveness depends on the interaction of landscape structure with error type. We further deploy SAGE on a mental-health chatbot under a continuous optimization paradigm, where it compounds eight cycles of individually-noisy A/B tests into a statistically robust gain in next-day retention. We argue that coupling qualitative diagnosis with quantitative validation is what makes agentic optimization effective for open-ended task-oriented dialogue.

URL PDF HTML ☆

赞 0 踩 0

2606.18257 2026-06-18 cs.HC cs.AI 新提交 70%

From Memorization to Creation: Evaluating the Cognitive Depth of LLM-Generated Educational Questions

从记忆到创造：评估LLM生成的教育问题的认知深度

Xiaolong Wang, Zhe Zhao, Song Lai, Chaoli Zhang, Zijie Geng, Yu Tong, Ye Wei, Qingsong Wen

发表机构 * City University of Hong Kong（香港城市大学）； Zhejiang Normal University（浙江师范大学）； Squirrel Ai Learning ； University of Science and Technology of China（中国科学技术大学）； Wuhan University（武汉大学）

专题命中指令微调：评估LLM生成问题认知层次，涉及提示策略

AI总结通过布鲁姆认知分类学评估六种LLM生成问题的认知层次，提出细粒度提示策略减少重复性并提升高阶认知比例，引入认知转移强度和类别漂移指标，揭示链式思维提示的可解释性。

Comments Accepted by KDD 2026

Journal ref KDD 2026

详情

DOI: 10.1145/3770854.3785686

AI中文摘要

尽管LLM在自动化教育内容生成方面展现出潜力，但它们生成能够激发高阶思维问题的能力仍未被充分研究。本研究通过布鲁姆认知分类学视角评估六种广泛使用的LLM，重点关注它们超越机械记忆并实现认知飞跃的能力。采用混合人机评估协议，我们在计算机科学、K-12数学和社会科学领域生成并分析了20,700个问题。主要贡献包括：(1) 一种细粒度提示策略，使Qwen2.5-7B-Instruct的问题重复性降低24.45%，并使InternLM3-8B-Instruct的高阶认知层次输出比例提升11.53%；(2) 认知转移强度（CogShift）和类别漂移的量化指标，揭示InternLM3在多层次转换中的优越性能；(3) 可解释性分析揭示指标级相关性，增强了链式思维提示的透明度。我们的发现强调了认知感知提示设计的重要性，并为在个性化学习系统中部署LLM提供了基准。

英文摘要

While LLMs show promise in automating educational content creation, their ability to generate questions that stimulate higher-order thinking remains understudied. This work evaluates six widely-used LLMs through a Bloom's Taxonomy lens, focusing on their capacity to transcend rote memorization and achieve cognitive leaps. Using a hybrid human--AI evaluation protocol, we generate and analyze 20{,}700 questions across computer science, K--12 math, and social-science domains. Key contributions include: (1) a fine-grained prompting strategy that reduces question repetitiveness by 24.45\% for Qwen2.5-7B-Instruct, and increases the proportion of higher-order cognitive level outputs by 11.53\% for InternLM3-8B-Instruct; (2) quantitative metrics for cognitive shift intensity (CogShift) and category drift, revealing InternLM3's superior performance in multi-level transitions; (3) an interpretability analysis revealing metric-level correlations that enhance the transparency of Chain-of-Thought prompting. Our findings highlight the importance of cognitive-aware prompt design and provide benchmarks for deploying LLMs in personalized learning systems.

URL PDF HTML ☆

赞 0 踩 0

2606.18624 2026-06-18 cs.CL 新提交 60%

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

PragReST：用于语用语言理解的自我强化反事实推理

Jihyung Park, Minchao Huang, Leqi Liu, Elias Stengel-Eskin

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）

专题命中指令微调：通过监督微调和强化学习训练模型

AI总结提出PragReST框架，通过自监督构建语用问答数据、生成反事实推理轨迹，结合监督微调和强化学习提升大语言模型的语用推理能力，在四个基准上显著优于基线模型。

Comments First two authors contributed equally. Code and models: https://github.com/jihyung803/PragReST

详情

AI中文摘要

自然语言理解通常依赖于隐含而非明确陈述的含义，需要语用推理。尽管大语言模型（LLMs）在数学和逻辑推理上表现强劲，但在进行语用推理时仍存在困难，往往选择字面解释。为了提升LLM的语用推理能力，我们提出了PragReST，一个自监督框架，它构建语用问答数据，生成反事实推理轨迹，并通过监督微调和强化学习训练模型内化这些轨迹，无需人工标注训练数据或从更强的教师模型蒸馏。在四个语用基准（PragMega、Ludwig、MetoQA和AltPrag）上，PragReST相比骨干模型、任务特定的语用微调基线以及同一流水线的非反事实变体均有提升。在基于准确率的基准上，PragReST在Qwen3-8B和Qwen3-14B上分别比指令骨干模型提升了5.37%和5.50%（绝对值）。我们的错误分析和消融实验强调了反事实推理的重要性：PragReST主要减少了因未能将观察到的话语与合理的替代方案进行对比而导致的错误，而去除反事实推理会显著降低性能。此外，我们的训练保留了对通用知识和数学推理基准的域外性能。

英文摘要

Natural language understanding often depends on meanings that are implied rather than explicitly stated, requiring pragmatic reasoning. Despite strong performance on math and logical reasoning, large language models (LLMs) still struggle with making pragmatic inferences, often choosing literal interpretations. To improve LLM pragmatic reasoning, we introduce PragReST, a self-supervised framework that constructs pragmatic QA data, generates counterfactual reasoning traces, and trains models to internalize them through supervised fine-tuning and reinforcement learning, without human-labeled training data or distillation from a stronger teacher. Across four pragmatic benchmarks (PragMega, Ludwig, MetoQA, and AltPrag), PragReST improves over backbone models, task-specific pragmatic tuning baselines, and non-counterfactual variants of the same pipeline. On accuracy-based benchmarks, PragReST improves over the instruct backbone by 5.37 and 5.50% (absolute) for Qwen3-8B and Qwen3-14B, respectively. Our error analysis and ablations underscore the importance of counterfactual reasoning: PragReST primarily reduces errors caused by failures to contrast observed utterances with plausible alternatives, and removing counterfactual reasoning substantially reduces performance. Moreover, our training preserves out-of-domain performance on general-knowledge and mathematical reasoning benchmarks.

URL PDF HTML ☆

赞 0 踩 0