arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.18394 2026-06-18 cs.CL 新提交

局部与全局注意力的双重维度

Zhiyuan Wang, Xuan Luo, Sirui Zeng, Xifeng Yan

发表机构 * UC Santa Barbara（加州大学圣塔芭芭拉分校）

AI总结提出距离自适应表示（DAR），对局部上下文保留全维度表示，对远距离token使用低维表示，在保持性能的同时减少KV缓存。

详情

AI中文摘要

解码器仅Transformer计算前面token的KV缓存上的注意力。键（和值）通常以相同的维度表示，无论其与预测目标的距离如何。然而，在自然语言中，下一个词受紧邻的前一个词影响最大。我们假设局部和远距离token对表示能力有不对称需求：局部token对预测即时输出更关键，因此需要更丰富的表示，而远距离token主要作为长期记忆，低维表示可能就足够了。我们将这一思想形式化为距离自适应表示（DAR），在受控设置中实现，该设置在局部上下文窗口内保留全维度表示，同时为超出该窗口的token分配降维表示（例如原始维度的1/4）。在多个预训练规模（70M到410M参数）以及1B规模模型上的持续监督微调中，该方法与全维度基线的性能紧密匹配。相比之下，在所有token位置上均匀降低维度会导致性能下降。这些结果挑战了键和值维度应在所有token位置上均匀的常见假设。我们的发现为设计注意力架构提供了新方向，该架构可自适应地跨序列分配表示能力，从而在推理期间进一步减少KV缓存。

英文摘要

Decoder-only Transformers compute attention over the KV cache of preceding tokens. Keys (and Values) are typically represented with the same dimensionality, regardless of its distance from the prediction target. In natural language, however, the next word is most strongly influenced by the immediately preceding tokens. We hypothesize that local and distant tokens impose asymmetric demands on representational capacity: local tokens are more critical for predicting immediate outputs and thus require richer representations, whereas distant tokens primarily serve as long-range memory, for which lower-dimensional representations may suffice. We formalize this idea as Distance-Adaptive Representation (DAR), implemented in a controlled setting that preserves full-dimensional representations within a local context window while assigning reduced-dimensional representations (e.g. 1/4 of the original dimensionality) to tokens beyond that window. Across multiple pretraining scales (70M to 410M parameters), as well as continued supervised fine-tuning on a 1B-scale model, this approach closely matches the performance of full-dimensional baselines. In contrast, uniformly reducing dimensionality across all token positions leads to worse performance. These results challenge the common assumption that key and value dimensionality should be uniform across token positions. Our findings suggest a new direction for designing attention architectures that adaptively allocate representational capacity across sequences, enabling further reductions in KV cache during inference.

URL PDF HTML ☆

赞 0 踩 0

2606.18663 2026-06-18 cs.CL 新提交

RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

RegMix-D: 通过代理训练轨迹实现动态数据混合

Kaiyan Zhao, Zhongtao Miao, Akiko Aizawa, Yoshimasa Tsuruoka

发表机构 * The University of Tokyo（东京大学）； National Institute of Informatics（国立信息学研究所）

AI总结提出RegMix-D，通过代理训练轨迹预测多阶段最优混合比例，实现动态数据混合，在13个下游任务上优于RegMix和DoReMi，且代理计算预算仅为RegMix的25%。

Comments Work in progress

详情

AI中文摘要

数据混合选择对于大型语言模型预训练至关重要。现有方法如RegMix通过在小规模代理运行上拟合回归模型来选择单个静态混合。我们提出RegMix-D，这是RegMix的一个简单扩展，用于动态混合。我们的关键观察是，代理运行不仅产生端点损失，还产生完整的损失轨迹，这些轨迹可用于进一步改进数据混合。通过在这些轨迹上训练回归模型，我们可以预测多个训练阶段的最优混合。RegMix-D支持两种部署模式：一种离线变体，在目标训练之前生成完整的混合计划；另一种在线变体，在训练期间使用观察到的损失自适应调整混合。在Pile数据集的250亿token上使用1B参数目标模型的实验表明，RegMix-D在13个下游任务上一致优于RegMix和DoReMi，同时保持代理高效：即使仅使用128个代理模型（RegMix代理计算预算的25%），它也超越了RegMix。

英文摘要

Data mixture selection is critical for Large Language Model pretraining. Existing methods such as RegMix select a single static mixture by fitting a regression model on small-scale proxy runs. We propose RegMix-D, a simple extension of RegMix to dynamic mixing. Our key observation is that proxy runs produce not only endpoint losses, but also full loss trajectories, which can be used to further improve data mixture. By training regression model on these trajectories, we can predict optimal mixtures at multiple training stages. RegMix-D supports two deployment modes: an offline variant that generates a complete mixture schedule before target training, and an online variant that adapts the mixture during training using observed loss. Experiments on 25B tokens of the Pile dataset with a 1B parameter target model show that RegMix-D consistently improves over RegMix and DoReMi across 13 downstream tasks while remaining proxy-efficient: it surpasses RegMix even with only 128 proxy models (25% of RegMix's proxy compute budget).

URL PDF HTML ☆

赞 0 踩 0

2606.18831 2026-06-18 cs.CL cs.AI 新提交

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

超越奖励工程：长上下文强化学习的数据配方

Xiaoyue Xu, Sikui Zhang, Xiaorong Wang, Xu Han, Chaojun Xiao

发表机构 * OpenBMB ； Tsinghua University（清华大学）

AI总结提出一种简单有效的数据配方，结合最小化基于结果的GRPO设置，显著提升大语言模型的长上下文推理能力，在多个基准和智能体任务上取得平均+3.2至+7.2点的提升。

Comments 15 pages, 6 figures, 12 tables

详情

AI中文摘要

长上下文推理是大语言模型的一项关键能力，特别是当它们作为必须推理长轨迹的自主智能体部署时。强化学习最近成为提升这一能力的主要范式，然而现有工作主要关注奖励工程，而多样化的训练数据仍然稀缺。我们从数据为中心的角度重新审视这个问题，并表明仅凭一种简单有效的数据配方，结合最小化基于结果的GRPO设置，就足以显著提升长上下文推理。我们的配方针对三个互补的任务族——检索、多证据合成和推理——我们构建并整理了八个数据集，总计约1.4万个示例。在三个模型（Qwen3-4B/8B/30B-A3B）上的实验在七个长上下文基准上取得了平均+7.2/+3.2/+6.4分的提升，超过了之前的强化学习训练集。我们进一步证明这些增益可以迁移到智能体任务中，在基于智能体调整的模型上继续使用我们的数据配方进行强化学习训练，GAIA提升+4.8分，BrowseComp提升+7.0分。我们将发布我们的数据集以促进未来研究。

英文摘要

Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a dominant paradigm for improving this ability, yet existing work largely focuses on reward engineering while diverse training data remains scarce. We revisit this problem from a data-centric perspective and show that a simple yet effective data recipe alone, paired with a minimal outcome-based GRPO setup, suffices to substantially improve long-context reasoning. Our recipe targets three complementary task families -- retrieval, multi-evidence synthesis, and reasoning -- for which we construct and curate eight datasets totaling ~14K examples. Experiments on three models (Qwen3-4B/8B/30B-A3B) yield average gains of +7.2/+3.2/+6.4 points across seven long-context benchmarks, surpassing prior RL training sets. We further demonstrate that these gains transfer to agentic tasks, where continuing RL training on an agent-tuned model with our data recipe improves GAIA by +4.8 and BrowseComp by +7.0 points. We will release our datasets to facilitate future research.

URL PDF HTML ☆

赞 0 踩 0

2606.18902 2026-06-18 cs.CL 新提交

SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

SAGE: 基于智能体引导探索的随机提示优化

Ziyi Zhu, Luka Smyth, Saki Shinoda, Jinghong Chen

发表机构 * Slingshot AI ； Department of Engineering, University of Cambridge（剑桥大学工程系）

AI总结提出随机提示优化框架SPO，其中SAGE方法通过多智能体诊断代码执行实现黑盒搜索，在多个基准测试中表现依赖于错误类型，并在心理健康聊天机器人中通过连续优化显著提升次日留存率。

详情

AI中文摘要

上下文工程已成为无需参数更新即可改进AI系统的主要手段。最近研究表明文本梯度并非真实梯度，这促使我们将自动提示优化（APO）视为黑盒搜索。我们引入了SPO（随机提示优化），一个在提示空间上进行随机搜索的框架，并比较了三种复杂度递增的策略：基于错误信息的随机搜索、带有进化算子的遗传算法以及SAGE（基于智能体引导探索的SPO），后者是一个具有诊断代码执行的多智能体流水线。在三个基准测试中，没有单一策略占主导地位；有效性取决于景观结构与错误类型的相互作用。我们进一步在连续优化范式下将SAGE部署到一个心理健康聊天机器人上，它将八个个体噪声A/B测试周期累积为次日留存率的统计显著提升。我们认为，将定性诊断与定量验证相结合是使智能体优化对开放式任务导向对话有效的关键。

英文摘要

Context engineering has emerged as a primary lever for improving AI systems without parameter updates. Recent work showing that textual gradients do not function as real gradients motivates treating automatic prompt optimization (APO) as black-box search. We introduce SPO (Stochastic Prompt Optimization), a framework for stochastic search over prompt space, and compare three strategies of increasing sophistication: error-informed random search, a genetic algorithm with evolutionary operators, and SAGE (SPO via Agent-Guided Exploration), a multi-agent pipeline with diagnostic code execution. Across three benchmarks, no single strategy dominates; effectiveness depends on the interaction of landscape structure with error type. We further deploy SAGE on a mental-health chatbot under a continuous optimization paradigm, where it compounds eight cycles of individually-noisy A/B tests into a statistically robust gain in next-day retention. We argue that coupling qualitative diagnosis with quantitative validation is what makes agentic optimization effective for open-ended task-oriented dialogue.

URL PDF HTML ☆

赞 0 踩 0

2606.18954 2026-06-18 cs.CL 新提交

GraphPO: Graph-based Policy Optimization for Reasoning Models

GraphPO：基于图的推理模型策略优化

Yuliang Zhan, Xinyu Tang, Jian Li, Dandan Zheng, Weilong Chai, Jingdong Chen, Jun Zhou, Ge Wu, Wenyue Tang, Hao Sun

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学北京校区人工智能学院）； Ant Group（蚂蚁集团）

AI总结提出GraphPO框架，将推理轨迹建模为有向无环图，通过合并语义等价路径减少冗余探索，并利用边级优势函数提高推理效率，在多个基准上优于链式和树式方法。

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）已成为增强大型推理模型能力的标准范式。RLVR通常独立采样响应并根据最终答案优化策略。该范式有两个局限性：首先，独立响应常包含相似的中间推理步骤，导致冗余探索和计算浪费；其次，稀疏的最终答案奖励难以识别有用步骤。基于树的方法通过共享前缀并比较同一前缀下的分支来提供细粒度信号，部分解决了这一问题。然而，树分支仍然是独立扩展的。当不同分支达到相似的推理状态时，它们无法共享信息并重复类似的探索。此外，基于树的方法忽略了这种分散性，仅在不同分支内进行局部比较，这可能导致优势估计的方差更高。为了解决这一挑战，我们提出了GraphPO（基于图的策略优化），一种新颖的RL框架，将轨迹表示为有向无环图，其中推理步骤作为边，从推理路径中总结的语义状态作为节点。GraphPO将语义等价的推理路径合并为等价类，允许它们共享后缀，并将预算从冗余扩展重新分配到多样化探索。此外，我们为入边分配效率优势，为出边分配正确性优势，从而在从结果中推导过程监督的同时提高推理效率。理论表明，GraphPO降低了优势估计方差并提高了推理效率。在三个LLM上的推理和智能体搜索基准实验表明，在相同的token预算或响应预算下，GraphPO始终优于基于链和基于树的基线方法。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for enhancing the capability of large reasoning models. RLVR typically samples responses independently and optimizes the policy using from final answers. This paradigm has two limitations. First, independently responses often contain similar intermediate reasoning steps, causing redundant exploration and wasted computation. Second, sparse final-answer rewards make it hard to identify useful steps. Tree-based methods partly address this problem by sharing prefixes and comparing branches from the same prefix to provide fine-grained signals. However, tree branches are still expanded independently. When different branches reach similar reasoning states, they cannot share information and repeat similar exploration. Moreover, tree-based methods ignore such dispersion and only perform local comparisons within separate branches, which can lead to higher variance in advantage estimation. To address this challenge, we propose GraphPO (Graph-based Policy Optimization), a novel RL framework that represents rollouts as a directed acyclic graph, with reasoning steps as edges and semantic states summarized from the reasoning paths as nodes. GraphPO merges semantically equivalent reasoning paths into equivalence classes, allowing them to share suffixes and reallocating budget away from redundant expansions to diverse exploration. Furthermore, we assign efficiency advantages to incoming edges and correctness advantages to outgoing edges, thereby improving inference efficiency while deriving process supervision from outcome. Theory shows that GraphPO reduces advantage-estimation variance and enhances reasoning efficiency. Experiments on three LLMs across reasoning and agentic search benchmarks show that GraphPO consistently outperforms chain- and tree-based baselines with the same token budgets or response budgets.

URL PDF HTML ☆

赞 0 踩 0

2606.19002 2026-06-18 cs.CL 新提交

Enhancing Multilingual Reasoning via Steerable Model Merging

通过可引导的模型合并增强多语言推理

Zhuoran Li, Rui Xu, Jian Yang, Junnan Liu, Zhijun Chen, Qianren Mao, Hongcheng Guo, Jiaheng Liu, Likang Xiao, Ming Li, Xiaojie Wang

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Fudan University（复旦大学）； Beihang University（北京航空航天大学）； Monash University（墨尔本大学）； Zhongguancun Laboratory（中关村实验室）； Nanjing University（南京大学）； Tsinghua University（清华大学）

AI总结提出可引导模型合并（ST-Merge）框架，通过门控交叉注意力机制自适应调节源模型贡献，在多语言推理任务中优于强基线。

Comments 12 pages, 7 figures, 8 tables. Accepted by ACL2026 Findings

详情

AI中文摘要

模型合并是组合多语言模型和推理模型能力的有效技术。通过对齐不同模型的特征空间，它在多语言推理任务中取得了有希望的泛化效果。然而，合并后的单一模型往往无法解决源模型之间的冲突，导致性能次优。换句话说，一刀切的合并策略可能无法适应不同输入的特性，这些输入可能要求优先考虑某些模型。为此，我们提出了一个可引导模型合并（ST-Merge）框架来调节每个源模型的贡献。为了实现这一想法，我们引入了一种门控交叉注意力机制，以自适应方式加权或过滤两个关注的源模型。大量实验表明，ST-Merge在涵盖21种不同语言的四个多语言推理基准上持续优于多个强基线。

英文摘要

Model merging is an effective technique for composing the capabilities of a multilingual model and a reasoning model. It has achieved promising generalization in multilingual reasoning tasks by aligning feature spaces of different models. However, the merged single model often fails to address the conflicts between source models, leading to suboptimal performance. In other words, the one-size-fits-all merging strategy may not align with the characteristics of different inputs which may require prioritizing certain models over others. To this end, we propose a Steerable Model Merging (ST-Merge) framework to modulate the contribution of each source model. To realize this idea, we introduce a gated cross-attention mechanism to weight or filter the two attended source models in an adaptive manner. Extensive experiments demonstrate that ST-Merge consistently outperforms multiple strong baselines on four multilingual reasoning benchmarks across 21 different languages.

URL PDF HTML ☆

赞 0 踩 0

2606.19005 2026-06-18 cs.CL cs.LG 新提交

Sumi: Open Uniform Diffusion Language Model from Scratch

Sumi: 从头训练的开放均匀扩散语言模型

Mengyu Ye, Keito Kudo, Wataru Ikeda, Ryosuke Matsuda, Keisuke Sakaguchi, Jun Suzuki

发表机构 * Tohoku University（东北大学）

AI总结本文提出Sumi，一个从零开始预训练的70亿参数均匀扩散语言模型，在1.5T tokens上训练，性能与同规模自回归模型相当，并开源所有资源。

详情

AI中文摘要

扩散模型已成为自回归模型的有前途的替代方案。其中，均匀扩散语言模型（UDLM）允许在任何步骤更新任何token，原则上能够实现更灵活的生成。然而，目前还没有从零开始预训练的大参数规模和大token预算的UDLM。自回归建模和掩码扩散建模已经拥有大规模的可供社区研究和构建的模型；而均匀扩散模型则没有。大规模从头预训练的UDLM将为研究缩放行为、生成动态、可控性以及与现有自回归和掩码扩散模型的权衡提供一个干净的参考点。为此，我们引入了Sumi（日语中“墨水”的意思），一个完全开放的70亿参数均匀扩散语言模型，从零开始在1.5T tokens上预训练。Sumi在知识、推理和编码基准测试中与在可比token预算下训练的自回归模型表现相当，但在常识基准测试中表现较差，其中我们以教育为主的数据混合可能是原因之一。我们发布了模型权重、检查点和完整的训练方案，包括在公开可用的语料库上的数据混合的完整规范。我们希望这次发布能使社区研究大规模原生均匀扩散，并促进对其尚未很好理解的方面的研究。

英文摘要

Diffusion models have become a promising alternative to autoregressive models. Among these, uniform diffusion language models (UDLMs) permit any token to be updated at any step, in principle enabling more flexible generation. However, no UDLM has yet been pretrained from scratch at both large parameter scale and large token budget. Both autoregressive modeling and masked diffusion modeling already have capable models at scale that the community can study and build on; uniform diffusion has none. A scratch-pretrained UDLM at scale would provide a clean reference point for studying scaling behavior, generation dynamics, controllability, and trade-offs against established autoregressive and masked diffusion models. To this end, we introduce Sumi ("ink" in Japanese), a fully open 7B uniform diffusion language model pretrained from scratch on 1.5T tokens. Sumi performs competitively with autoregressive models trained at comparable token budgets on knowledge, reasoning, and coding benchmarks, while under-performing on commonsense benchmarks, where our education-heavy data mixture is a likely contributor. We release our model weights, checkpoints, and full training recipe, including a complete specification of the data mixture over publicly available corpora. We hope this release enables the community to study native uniform diffusion at scale and catalyzes work on its as-yet poorly understood aspects.

URL PDF HTML ☆

赞 0 踩 0

2606.19170 2026-06-18 cs.CL 新提交

Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition

Dango：一个严格仅L1的大型语言模型，用于研究第二语言习得

Shiho Matta, Yin Jou Huang, Fei Cheng, Takashi Kodama, Hirokazu Kiyomaru, Yugo Murawaki

发表机构 * Kyoto University（京都大学）； NII-LLMC（国立信息学研究所-大规模语言模型中心）

AI总结提出1.8B参数的Dango模型，通过过滤L2污染和微调L2学习课程，模拟人类L2产出模式，优于未过滤和多语言基线。

Comments 8 pages main text, 20 pages total including references and appendices

详情

AI中文摘要

我们介绍了Dango，一个1.8B参数的大型语言模型，旨在用于第二语言习得（SLA）中L1到L2（日语到英语）迁移的受控研究。虽然先前的研究已经探索了语言模型中的SLA，但它们主要依赖于较小的或非解码器模型，限制了它们生成开放式文本的能力，并降低了它们作为实用L2模拟器的适用性。我们发现了将模型扩展到该规模时的一个关键挑战：用于L1习得的“单语”预训练语料库中的L2污染。为了解决这个问题，我们提出了一种过滤方法，以减少对英语的过早暴露，同时保留现实的最小暴露。然后，我们在LLM生成的L2学习课程上对模型进行微调，以模拟L2习得过程。我们的评估证实，Dango发展了类似人类的L2产出模式，优于未过滤和标准的多语言基线。我们发布了模型、数据和代码，以促进可重复的计算SLA研究和面向学习者的应用。

英文摘要

We introduce Dango, a 1.8B-parameter large language model designed for controlled studies of L1-to-L2 (Japanese-to-English) transfer in second language acquisition (SLA). While previous studies have explored SLA in language models, they have predominantly relied on smaller or non-decoder models, limiting their ability to generate open-ended text and reducing their suitability as practical L2 simulators. We identify a key challenge when scaling models to this size: L2 contamination within the "monolingual" pretraining corpus used for L1 acquisition. To address this, we propose a filtering method to reduce premature exposure to English while preserving realistic, minimal exposure. We then fine-tune the model on LLM-generated L2-learning lessons to simulate the L2 acquisition process. Our evaluations confirm that Dango develops human-like L2 production patterns, outperforming both unfiltered and standard multilingual baselines. We release the model, data, and code to facilitate reproducible computational SLA research and learner-facing applications.

URL PDF HTML ☆

赞 0 踩 0

2606.19257 2026-06-18 cs.CL 新提交

DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models

DreamReasoner-8B：面向扩散推理模型的块大小课程学习

Zirui Wu, Lin Zheng, Jiacheng Ye, Shansan Gong, Xueliang Zhao, Yansong Feng, Wei Bi, Lingpeng Kong

发表机构 * The University of Hong Kong（香港大学）； Peking University（北京大学）

AI总结提出块大小课程学习，通过从细粒度到粗粒度的渐进训练，解决块扩散语言模型在长链推理中性能差距问题，DreamReasoner-8B在数学和代码推理上达到与Qwen3-8B相当的水平。

详情

AI中文摘要

块扩散语言模型通过并行块级去噪加速解码，但其能否可靠地扩展到长思维链（CoT）推理仍未解决。为此，我们开发了开源块扩散推理模型DreamReasoner-8B，并系统研究了训练和推理块大小如何影响长CoT推理。我们的分析揭示了显著的性能差距：使用大块大小训练会导致推理性能极差，而小块大小则能保持有效的推理。为了弥合这一粒度差距，我们提出了块大小课程学习，逐步从细粒度块大小过渡到粗粒度块大小进行训练，从而克服了这一限制，并实现了在多种推理块大小上泛化的强大推理性能。在数学和代码推理基准测试中，DreamReasoner-8B取得了与领先的开源自回归模型（如Qwen3-8B）相竞争的结果。这项工作为高效、具备推理能力的扩散语言模型奠定了实践基础。我们在以下网址发布模型：https://this URL。

英文摘要

Block diffusion language models accelerate decoding through parallel block-wise denoising, yet whether they can be reliably scaled for long chain-of-thought (CoT) reasoning remains unresolved. To this end, we develop DreamReasoner-8B, an open-source block diffusion reasoning model, and conduct a systematic study of how training and inference block sizes affect long-CoT reasoning. Our analysis reveals a stark performance disparity: training with large block sizes yields remarkably poor reasoning, whereas small block sizes preserve effective reasoning. To bridge this granularity gap, we propose block-size curriculum learning, which gradually transitions training from fine-grained to coarse-grained block sizes, thereby overcoming this limitation and enabling strong reasoning performance that generalizes across diverse inference block sizes. On mathematical and code reasoning benchmarks, DreamReasoner-8B achieves results competitive with leading open autoregressive models such as Qwen3-8B. This work establishes a practical foundation for efficient, reasoning-capable diffusion language models. We release our model at https://github.com/DreamLM/DreamReasoner.

URL PDF HTML ☆

赞 0 踩 0

2606.18381 2026-06-18 cs.CL cs.IR 新提交

SproutRAG: Attention-Guided Tree Search with Progressive Embeddings for Long-Document RAG

SproutRAG: 基于注意力引导的树搜索与渐进嵌入的长文档RAG

Amirhossein Abaskohi, Issam H. Laradji, Peter West, Giuseppe Carenini

发表机构 * University of British Columbia（不列颠哥伦比亚大学）； ServiceNow Research（ServiceNow研究院）

AI总结提出SproutRAG，通过注意力引导构建句子级分块树，实现多粒度检索，无需额外LLM调用，平均信息效率提升6.1%。

详情

AI中文摘要

检索增强生成（RAG）系统必须平衡检索粒度与上下文连贯性，现有方法通过LLM引导的分块、单级上下文扩展或层次摘要来解决这一挑战。这些方法在索引或检索过程中依赖昂贵的LLM调用，将上下文聚合限制在单一粒度级别，或通过摘要引入信息损失。我们提出SproutRAG，一种注意力引导的层次化RAG框架，通过将句子级块组织成逐渐增大但语义连贯的单元，利用学习到的句子间注意力构建二分块树，从而解决这一权衡。与依赖外部LLM、固定上下文扩展或有损摘要的先前方法不同，SproutRAG学习哪些注意力头和层最能捕捉语义文档结构，实现无需额外LLM调用或压缩摘要的多粒度检索。在检索时，SproutRAG使用层次化束搜索检索多个粒度的候选，捕获超越平面检索的多句子相关性。该框架通过联合目标进行端到端训练，同时改进嵌入和树结构。在涵盖科学、法律和开放域设置的四个基准上的实验表明，SproutRAG在最强基线上平均信息效率（IE）提升6.1%。代码可在该https URL获取。

英文摘要

Retrieval-augmented generation (RAG) systems must balance retrieval granularity with contextual coherence, a challenge that existing methods address through LLM-guided chunking, single-level context expansion, or hierarchical summarization. These approaches variously depend on costly LLM calls during indexing or retrieval, limit context aggregation to a single granularity level, or introduce information loss through summarization. We present SproutRAG, an attention-guided hierarchical RAG framework that addresses this trade-off by organizing sentence-level chunks into progressively larger but semantically coherent units, using learned inter-sentence attention to construct a binary chunking tree. Unlike prior approaches that rely on external LLMs, fixed context expansion, or lossy summarization, SproutRAG learns which attention heads and layers best capture semantic document structure, enabling multi-granularity retrieval without additional LLM calls or compressed summaries. At retrieval time, SproutRAG uses hierarchical beam search to retrieve candidates at multiple granularities, capturing multi-sentence relevance beyond flat retrieval. The framework is trained end-to-end with a joint objective that improves both embeddings and tree structure. Experiments across four benchmarks spanning scientific, legal, and open-domain settings demonstrate that SproutRAG improves information efficiency (IE) by 6.1% on average over the strongest baseline. Code is available on https://github.com/AmirAbaskohi/SproutRAG.

URL PDF HTML ☆

赞 0 踩 0

2606.18508 2026-06-18 cs.CL cs.IR 新提交

MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval

MCompassRAG：主题元数据作为段落级检索的语义指南针

Amirhossein Abaskohi, Raymond Li, Gaetano Cimino, Peter West, Giuseppe Carenini, Issam H. Laradji

发表机构 * University of British Columbia（不列颠哥伦比亚大学）； University of Salerno（萨莱诺大学）； ServiceNow Research（ServiceNow研究院）

AI总结提出MCompassRAG框架，通过主题元数据增强段落表示，利用LLM蒸馏训练轻量检索器，实现主题感知检索，在六个基准上平均信息效率提升8.24%，延迟降低5倍以上。

详情

AI中文摘要

超越分词：面向时间序列问答的直接时间步嵌入与对比对齐

Yafeng Wu, Huu Hiep Nguyen, Thin Nguyen, Hung Le

发表机构 * Deakin University（德肯大学）

AI总结提出CADE框架，通过逐点线性编码器直接嵌入每个时间步，避免分词瓶颈，并利用单向监督对比损失对齐时间序列与文本锚点，在Time-MQA基准上提升六项TSQA任务性能。

详情

AI中文摘要

大型语言模型的最新进展催生了时间序列问答（TSQA），它将时间序列分析表述为自然语言问答。然而，直接将原始数值序列输入LLM会遇到分词瓶颈：字节对编码将连续值分割成不稳定的词元，其嵌入缺乏有意义的度量结构，导致幅度、尺度和趋势信息的丢失。先前的方法使用基于分块的编码器将序列分割成固定窗口，锁定单一粒度，这会破坏模式并隐藏确切的时间步，且通过一个在不同长度或采样率的数据集上很少迁移的独立模块实现。为了解决这一挑战，我们提出了CADE（对比对齐与直接嵌入），一个基于两个关键组件构建的TSQA新框架：直接时间步嵌入和语义对齐。该框架通过逐点线性编码器和MLP投影器将每个时间步直接映射到LLM嵌入空间，保留了精确的索引级访问，同时消除了分块和填充的需要。为了进一步弥合时间序列与语言表示之间的语义差距，我们引入了一种新颖的单向监督对比损失，将时间序列嵌入与冻结的类名文本锚点对齐。在公开的Time-MQA基准上的实验结果表明，我们的框架在六项TSQA任务上持续提升了性能，优于开源和专有的LLM基线。

英文摘要

Recent advances in large language models (LLMs) have given rise to time-series question answering (TSQA), which formulates time-series analysis as natural-language question answering. However, directly feeding raw numerical series into LLMs suffers from a tokenization bottleneck: Byte Pair Encoding fragments continuous values into unstable tokens whose embeddings lack meaningful metric structure, resulting in the loss of magnitude, scale, and trend information. Prior methods use patch-based encoders that split the series into fixed windows, locking in one granularity that breaks patterns and hides exact timesteps, through a separate module that rarely transfers across datasets with different lengths or sampling rates. To address this challenge, we propose CADE (Contrastive Alignment with Direct Embedding), a novel framework for TSQA built upon two key components: direct timestep embedding and semantic alignment. The proposed framework maps each timestep directly into the LLM embedding space through a point-wise linear encoder and MLP projector, preserving exact index-level access while eliminating the need for patching and padding. To further bridge the semantic gap between time-series and language representations, we introduce a novel one-directional supervised contrastive loss that aligns time-series embeddings with frozen class-name text anchors. Experimental results on the public Time-MQA benchmark demonstrate that our framework consistently improves performance across six TSQA tasks, outperforming both open-source and proprietary LLM baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.19183 2026-06-18 cs.CL cs.AI 新提交

Language Models as Interfaces, Not Oracles: A Hybrid LLM-ML System for Pediatric Appendicitis

语言模型作为接口而非预言机：用于小儿阑尾炎的混合LLM-ML系统

Soheyl Bateni, Maryam Abdolali

发表机构 * K. N. Toosi University of Technology（K. N. 图西理工大学）

AI总结提出ClaMPAPP混合系统，利用LLM从自由文本中提取结构化特征，再由XGBoost分类器进行诊断，在两个独立队列中优于端到端LLM，提高了诊断稳定性和可审计性。

详情

AI中文摘要

大型语言模型（LLM）通过解释自由文本记录可使临床决策支持更易获取，但直接作为诊断引擎使用时，受提示敏感性、信息顺序以及看似合理但错误的输出限制。结构化机器学习模型提供更稳定的风险预测，但需要难以与叙事性临床工作流集成的表格输入。我们提出ClaMPAPP（临床语言辅助机器学习阑尾炎诊断流程），这是一个混合系统，将LLM用作接口而非最终决策者。ClaMPAPP从类似笔记的叙述中提取模式约束的临床特征，应用确定性合理性检查，并将验证后的特征传递给基于临床、实验室和超声变量训练的XGBoost分类器。我们在来自德国医院的两个独立小儿阑尾炎队列上评估了ClaMPAPP，并将其与端到端LLM基线（包括开源和专有模型）进行比较。为在测试自由文本输入时保留真实标签，通过模板渲染和约束LLM重写从结构化电子健康记录生成叙述，并附加句子顺序排列以评估位置鲁棒性。ClaMPAPP在内部和外部验证中均达到最强的整体诊断性能，同时最小化漏诊阑尾炎病例（急性分诊中的关键安全问题）。端到端LLM表现出不稳定的灵敏度-特异性权衡，且在叙述重排下性能下降更严重。这些结果支持LLM作为接口、ML作为预测器的设计，将自然语言可用性与预测推理分离，并为临床决策支持提供更可审计的路径。

英文摘要

Large language models (LLMs) can make clinical decision support more accessible by interpreting free-text documentation, but their direct use as diagnostic engines is limited by sensitivity to prompts, information order, and plausible but incorrect outputs. Structured machine-learning models offer more stable risk prediction, yet they require tabular inputs that are difficult to integrate with narrative clinical workflows. We present ClaMPAPP (Clinical Language-assisted Machine-learning Pipeline for Appendicitis), a hybrid system that uses an LLM as an interface rather than as the final decision-maker. ClaMPAPP extracts schema-constrained clinical features from note-like narratives, applies deterministic plausibility checks, and passes validated features to an XGBoost classifier trained on clinical, laboratory, and ultrasound variables. We evaluated ClaMPAPP on two independent pediatric appendicitis cohorts from German hospitals and compared it with end-to-end LLM baselines, including open-source and proprietary models. To preserve ground truth while testing free-text input, narratives were generated from structured electronic health records through template rendering and constrained LLM rewriting, with additional sentence-order permutation to assess positional robustness. ClaMPAPP achieved the strongest overall diagnostic performance in both internal and external validation while minimizing missed appendicitis cases, the key safety concern in acute triage. End-to-end LLMs showed unstable sensitivity-specificity trade-offs and greater degradation under narrative reordering. These results support an LLM-as-interface, ML-as-predictor design that separates natural-language usability from predictive inference and provides a more auditable pathway for clinical decision support.

URL PDF HTML ☆

赞 0 踩 0

2606.18406 2026-06-18 cs.CL 新提交

CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

CoreMem: 对话代理中长期记忆的黎曼检索与Fisher引导蒸馏

Jiaqi Chen, Yongqin Zeng, Shaoshen Chen, Yijian Zhang, Hai-Tao Zheng, Chunxia Ma, XiuTeng Zhou

发表机构 * Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； Peng Cheng Laboratory（鹏城实验室）； Shandong Analysis and Test Center, Qilu University of Technology（齐鲁工业大学山东省分析测试中心）； State Key Laboratory for Quality Ensurance and Sustainable Use of Dao-di Herbs（道地药材品质保障与可持续利用国家重点实验室）

AI总结提出CoreMem架构，用黎曼检索替代余弦相似度解决高维检索枢纽问题，通过Fisher引导离散令牌蒸馏实现原则性压缩，在8GB显存边缘设备上实现长期记忆对话代理。

Comments 15 pages, 5 figures

详情

AI中文摘要

个性化对话代理需要持续的长期记忆以在多次会话中维持连贯交互。然而，在消费级硬件（例如8 GB VRAM边缘设备）上部署这些能力会引入严重的内存和计算瓶颈。现有系统通常依赖各向同性余弦相似度进行检索，以及启发式规则进行上下文压缩。这些方法缺乏统一的理论基础，经常在高维检索中遭受枢纽问题，并在压缩过程中出现句法碎片化。为克服这些限制，我们提出CoreMem，一种资源高效的边缘-云记忆架构，从根本上由信息几何统一。首先，黎曼检索用局部自适应Fisher-Rao度量替代余弦匹配，通过马氏距离有效惩罚枢纽记忆，并采用O(Ndr) Woodbury加速实现实时搜索。其次，Fisher引导离散令牌蒸馏（FDTD）引入分层句子到令牌压缩机制。它从Fisher信息迹中推导敏感度分数，提供原则性的压缩-KL权衡，并辅以显式结构句法保护。在LOCOMO和LongMemEval-S基准上评估，CoreMem实现了显著的准确率提升，在开放域（+4.51个百分点）和时间（+4.17个百分点）推理上取得实质性增益。广泛性能分析证实，CoreMem在严格的8 GB VRAM预算内无缝运行，成功弥合了资源受限边缘设备与对理论基础的终身记忆代理需求之间的差距。

英文摘要

Personalized dialogue agents require continuous long-term memory to maintain coherent interactions across multiple sessions. However, deploying these capabilities on consumer-grade hardware (e.g., 8 GB VRAM edge devices) introduces severe memory and compute bottlenecks. Existing systems typically rely on isotropic cosine similarity for retrieval and heuristic rules for context compression. These approaches lack a unified theoretical foundation, frequently suffering from the hubness problem in high-dimensional retrieval and syntactic fragmentation during compression. To overcome these limitations, we propose CoreMem, a resource-efficient edge-cloud memory architecture fundamentally unified by information geometry. First, Riemannian retrieval replaces cosine matching with a locally adaptive Fisher-Rao metric, effectively penalizing hub memories via Mahalanobis distance with O(Ndr) Woodbury acceleration for real-time search. Second, Fisher-guided discrete token distillation (FDTD) introduces a hierarchical sentence-to-token compression mechanism. It derives sensitivity scores from Fisher information traces, providing a principled compression-KL tradeoff augmented with explicit structural syntax protection. Evaluated on the LOCOMO and LongMemEval-S benchmarks, CoreMem achieves strong accuracy improvements, yielding substantial gains in Open-domain (+4.51 pp) and Temporal (+4.17 pp) reasoning. Extensive profiling confirms that CoreMem operates seamlessly within a strict 8 GB VRAM budget, successfully bridging the gap between resource-constrained edge devices and the demand for theoretically grounded, lifelong memory agents.

URL PDF HTML ☆

赞 0 踩 0

2606.18448 2026-06-18 cs.CL 新提交

VISUALSKILL: Multimodal Skills for Computer-Use Agents

VISUALSKILL：面向计算机使用智能体的多模态技能

Ziyan Jiang, Li An, Yujian Liu, Jiabao Ji, Qiucheng Wu, Jacob Andreas, Yang Zhang, Shiyu Chang

发表机构 * UC Santa Barbara（加州大学圣塔芭芭拉分校）； MIT CSAIL（麻省理工学院计算机科学与人工智能实验室）； MIT-IBM Watson AI Lab（麻省理工学院-IBM沃森人工智能实验室）

AI总结提出VISUALSKILL分层多模态技能库，通过结合文档与UI探索构建，使智能体在CUA基准上平均得分提升15.3点，且多模态优于纯文本技能。

详情

AI中文摘要

计算机使用智能体（CUA）在标准化基准上接近人类水平，但在长周期任务和未见软件上仍存在困难。现有技能库通过可复用技能解决此问题，但仅以文本形式表示技能工件，忽略了GUI交互的视觉特性。我们提出VISUALSKILL：一种分层多模态技能，针对每个目标应用定制，并组织为按主题文件索引的中央索引，智能体通过load_topic MCP工具按需获取相关主题的文本和图形。我们通过结合编写文档与实时应用UI探索的两阶段流水线构建每个技能。在两个CUA基准CUA-World和OSExpert-Eval上，由Claude Opus 4.6支持的Claude Code CLI智能体使用VISUALSKILL达到平均得分0.456，比无技能基线（0.303）绝对提升15.3点。与从相同源内容生成且仅在模态上与VISUALSKILL不同的匹配纯文本技能相比，VISUALSKILL进一步绝对提升8.3点（0.373 vs. 0.456），直接证明在技能工件中保留视觉图形而非将其语言化，有助于智能体识别UI元素并在每次操作后验证工作流状态。我们的代码见此链接。

英文摘要

Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the skill artifact as text only, despite the visual nature of GUI interaction. We propose VISUALSKILL: a hierarchical multimodal skill, tailored to each target application and organised as a central index over per-topic files, which the agent consumes through a load_topic MCP tool that fetches the relevant topic's text and figures on demand. We construct each skill with a two-stage pipeline that combines authored documentation with live-application UI exploration. On two CUA benchmarks, CUA-World and OSExpert-Eval, a Claude Code CLI agent backed by Claude Opus 4.6 reaches an average score of 0.456 with VISUALSKILL, a +15.3 point absolute lift over the no-skill baseline (0.303). Against a matched text-only skill that is generated from the same source content and differs from VISUALSKILL only in modality, VISUALSKILL yields a further +8.3 point absolute gain over the matched text-only skill (0.373 vs. 0.456), providing direct evidence that retaining visual figures in the skill artifact, rather than verbalizing them away, helps the agent both identify UI elements and verify workflow state after each action. Our code is available at https://github.com/XMHZZ2018/VisualSkills.

URL PDF HTML ☆

赞 0 踩 0

2606.18728 2026-06-18 cs.CL 新提交

基于图灵奖励的学习用户模拟器

Yingshan Susan Wang, Cedegao E. Zhang, Linlu Qiu, Zexue He, Pengyuan Li, Alex Pentland, Roger P. Levy, Yoon Kim

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Stanford University（斯坦福大学）； MIT-IBM Watson AI Lab（MIT-IBM沃森人工智能实验室）

AI总结提出Turing-RL方法，利用基于图灵测试的强化学习训练用户模拟器，通过判别性图灵奖励使生成响应与真实用户不可区分，在对话和论坛讨论中优于基线方法。

详情

AI中文摘要

在交互式环境中学习模拟人类用户可以推动代理助手的训练、个性化系统的评估、社会科学研究等。现有方法通常通过训练大型语言模型（LLM）来匹配单一真实响应，要么通过最大化对数概率，要么使用相似性奖励。我们提出{Turing-RL}：一种基于图灵测试的强化学习方法，用于训练用户模拟器模型。{Turing-RL}使用带有LLM评判器的判别性图灵奖励，根据用户历史记录对生成的响应与真实用户的不可区分程度进行评分，用户模拟器LLM学习在这种奖励下产生与用户可能说的内容不可区分的响应。在两个不同领域——对话聊天和Reddit论坛讨论中，我们发现{Turing-RL}在LLM和人工评估指标上均持续优于基线方法。我们的研究表明，优化不可区分性而非响应匹配对于学习用户模拟器是有效的。

英文摘要

Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose {Turing-RL}: a Turing-Test-based reinforcement learning approach for training user simulator models. {Turing-RL} uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user's given the user's history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains--conversational chat and Reddit forum discussion--we find that {Turing-RL} consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.

URL PDF HTML ☆

赞 0 踩 0

2606.18850 2026-06-18 cs.CL cs.IR 新提交

Morpheus: 一种面向土耳其语的形态感知神经分词器和词嵌入器

Tolga Şakar

发表机构 * Independent Researcher（独立研究者）

AI总结针对土耳其语粘着特性，提出Morpheus神经词素边界模型，实现无损可逆分词与结构化词嵌入，在可逆分词器中达到最低比特每字符（1.425），词素对齐F1提升至0.61，GPU内存节省约19%。

详情

AI中文摘要

土耳其语是粘着语：意义由词素承载，然而驱动现代语言模型的子词分词器根据语料库统计分割单词，切碎了承载语义的后缀，并且在WordPiece和基于规则的分析器的情况下，无法将其输出解码回原始文本。本文提出\textbf{Morpheus}，一个面向土耳其语的神经词素边界模型，它同时是一个无损的、形态感知的分词器和一个词嵌入生成器。一个可微的泊松-二项式动态规划程序在训练期间将每个字符的边界概率转化为软词素隶属度，在推理时转化为精确的片段，无需字符串归一化，因此$\mathrm{decode}(\mathrm{encode}(w)) = w$由构造保证。由于该模型是神经模型，相同的正向传播在分词的同时也输出结构化的词嵌入。在可逆分词器中——唯一适用于生成的分词器——Morpheus达到了最低的比特每字符（1.425），将子词家族的金标准词素对齐大致翻倍（MorphScore宏F1从约0.32提升至0.61），并且相比64K词汇量的子词分词器节省了约19%的GPU内存。作为嵌入器，冻结的Morpheus向量在词汇检索（根家族MAP 0.85）和同根验证（ROC-AUC 1.00）上领先，超越了多语言检索器BGE-M3和BERTurk；在上下文和屈折依赖的任务（NER、格/数探测）上，更重的上下文编码器仍然领先——我们将这一权衡归因于Morpheus以词根为中心的几何结构。代码：此https URL 模型：此https URL 交互演示：此https URL。

英文摘要

Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and -- in the case of WordPiece and rule-based analyzers -- failing to decode their output back to the original text. This paper presents \textbf{Morpheus}, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so $\mathrm{decode}(\mathrm{encode}(w)) = w$ holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers -- the only ones valid for generation -- Morpheus attains the lowest bits-per-character ($1.425$), roughly doubles the gold morphological alignment of the subword family (MorphScore macro-F1 $0.61$ vs.\ ${\sim}0.32$), and uses ${\sim}19\%$ less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP $0.85$) and same-root verification (ROC-AUC $1.00$), surpassing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heavier contextual encoders remain ahead -- a trade-off we attribute to Morpheus's root-centric geometry. Code: https://github.com/lonewolf-rd/TurkishMorpheus; model: https://huggingface.co/lonewolflab/Morpheus-TR-50K; interactive demo: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo.

URL PDF HTML ☆

赞 0 踩 0

2606.18856 2026-06-18 cs.CL cs.LG 新提交

Approximate Structured Diffusion for Sequence Labelling

近似结构化扩散用于序列标注

Nicolas Floquet, Joseph Le Roux, Nadi Tomeh

发表机构 * Université Sorbonne Paris Nord, CNRS, Laboratoire d’Informatique de Paris Nord, LIPN（巴黎北大学 Sorbonne、法国国家科学研究中心、巴黎北信息学实验室、LIPN）

AI总结提出一种基于扩散的条件随机场（CRF）训练方法，通过引入标签噪声条件来捕捉长距离依赖，结合近似推理在词性标注任务上实现16.5%的错误率降低。

2606.18922 2026-06-18 cs.CL cs.AI 新提交

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

像火箭科学一样简单：评估大型语言模型解释比喻语言中否定能力的研究

Jasmine Owers, Edwin Simpson, Martha Lewis

发表机构 * Intelligent Systems Lab University of Bristol（智能系统实验室英国布里斯托尔大学）； ILLC University of Amsterdam（阿姆斯特丹大学语言学研究所）

AI总结本研究通过开发新的注释数据集，测试多种大型语言模型在比喻语言中理解否定的能力，发现否定与比喻的组合对模型构成挑战，且性能高度依赖提示风格。

Comments 16 pages, 16 figures; for associated code and data see https://github.com/jrdowers/Negation-and-Fig-Lang; To be published in Transactions of the Association for Computational Linguistics

2606.18273 2026-06-18 cs.CL cs.AI cs.SD eess.AS 新提交

Continuous Audio Thinking for Large Audio Language Models

面向大型音频语言模型的连续音频思考

Gyojin Han, Dong-Jae Lee, Changho Choi, Jongsuk Kim, Junmo Kim

发表机构 * KAIST（韩国科学技术院）

AI总结提出连续音频思考（CoAT）框架，通过专家蒸馏在连续潜在空间中组织声学信息，使音频语言模型在生成响应前利用丰富声学特征，无需额外自回归解码成本，在多个音频任务上提升性能。

Comments Preprint

详情

AI中文摘要

大型音频语言模型（LALMs）在从语音转录到音乐分析等多种音频理解任务中展现了令人印象深刻的能力。然而，由于LALMs通常被训练生成与文本对齐的响应，其隐藏状态逐渐为文本生成而塑造，而非保留声学信息。因此，音频携带的多样化声学内容，如语音细节、韵律、声音事件、情感和音调，在过程中丢失，难以在响应中利用。我们引入了连续音频思考（CoAT），这是一个框架，为音频语言模型配备一个连续的潜在工作空间，用于在响应生成之前组织声学信息，并通过音频专家的蒸馏进行基础化。在思考空间内，模型可以在生成响应时利用专家蒸馏提供的丰富声学信息。此外，所提出的连续思考块可以在单个预填充中处理，因此CoAT不需要比基线额外的自回归解码成本。在三个LALM上，Qwen2-Audio、Qwen2.5-Omni-7B和Audio Flamingo~3，在涵盖音频推理、音频理解、音乐分类、语音情感和语音转录的广泛基准套件上的性能提升证明了CoAT的有效性。进一步分析证实，辅助监督从思考位置传播到模型的文本响应。

英文摘要

Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned responses, their hidden states are progressively shaped for text generation rather than for preserving acoustic information. As a result, the diverse acoustic content that audio carries, such as phonetic detail, prosody, sound events, affect, and pitch, is lost along the way and difficult to leverage in the response. We introduce Continuous Audio Thinking (CoAT), a framework that equips audio language models with a continuous latent workspace for organizing acoustic information prior to response generation, grounded by distillation from audio experts. Within the thinking space, the model can utilize the rich acoustic information provided by expert distillation when generating its response. Furthermore, the proposed continuous thinking block can be processed in a single prefill, so CoAT does not require additional autoregressive decoding cost over the baseline. Across three LALMs, Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo~3, performance gains on a broad benchmark suite spanning audio reasoning, audio understanding, music classification, speech emotion, and speech transcription demonstrate the effectiveness of CoAT. Further analysis confirms that the auxiliary supervision propagates from the thinking positions to the model's textual responses.

URL PDF HTML ☆

赞 0 踩 0

2606.18466 2026-06-18 cs.CL 新提交

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

Montreal Forced Aligner 与 2026 年语音到文本对齐的现状

Michael McAuliffe, Kaylynn Gunter, Michael Wagner, Morgan Sonderegger

发表机构 * University of Wisconsin--Madison（威斯康星大学麦迪逊分校）； McGill University（麦吉尔大学）； Centre for Brain, Language, and Music（大脑、语言与音乐中心）； University of Oregon（俄勒冈大学）

AI总结本文介绍 MFA 3.0 自 1.0 版本以来的发展，并在英语、日语和韩语上评估其性能，在四个基准数据集上达到平均边界误差低于 15 ms 的最优或接近最优性能。

2606.18584 2026-06-18 cs.CL 新提交

Speech-Driven End-to-End Language Discrimination towards Chinese Dialects

语音驱动的端到端汉语方言语言鉴别

Fan Xu, Jian Luo, MingWen Wang, GuoDong Zhou

发表机构 * Jiangxi normal university（江西师范大学）； Soochow university（苏州大学）

AI总结针对相似语言和方言鉴别难题，提出基于MFCC特征和HMM-DNN端到端模型的语音驱动方法，结合注意力机制和CNN融合词嵌入与MFCC特征，在基准语料上优于现有方法。

Comments Published in ACM TALLIP

2606.18471 2026-06-18 cs.CL 新提交

LLMs难以衡量区分不同水平学生的题目：阅读理解评估中题目区分度研究

Han Chen, Ming Li, Chenguang Wang, Yijun Liang, Dawei Zhou, Hong jiao, Tianyi Zhou

发表机构 * MBZUAI（穆罕默德·本·扎耶德人工智能大学）； University of Maryland（马里兰大学）； Virginia Tech（弗吉尼亚理工大学）

AI总结本研究评估42个LLM在零样本设置下预测题目区分度的能力，发现直接预测与人类校准的区分度相关性弱（最高Spearman 0.152），基于CTT的响应校准相关性有限（0.241），表明LLM尚不能可靠捕捉题目区分度。

详情

AI中文摘要

题目区分度是教育评估的一个基本心理测量属性，它衡量一个题目是否能有效区分高水平和低水平学生。虽然已有研究探讨了大语言模型（LLM）能否估计题目难度，但尚不清楚它们能否捕捉题目区分度。在本工作中，我们使用两种互补方法评估了42个专有和开源LLM在零样本设置下的表现：直接区分度预测，即模型从其内容中显式估计题目的区分度值；以及基于响应的经典测试理论（CTT）校准，其中LLM的答案被视为合成学生响应以计算区分度分数。我们的结果表明，直接预测与人类校准的区分度一致性较弱：表现最好的模型仅达到0.152的Spearman相关性。基于响应的CTT校准提供了更强但仍然有限的信号，全人格合成受访者池达到0.241的Spearman相关性。这些发现突显了题目区分度作为基于LLM的心理测量评估的一个开放挑战：当前的LLM包含非随机的区分度相关信号，但它们尚不能可靠地捕捉评估题目如何区分人类学生。

英文摘要

Item discrimination is a fundamental psychometric property of educational assessment, which measures whether an item meaningfully distinguishes students with higher proficiency from students with lower proficiency. While various existing works have explored whether large language models (LLMs) can estimate item difficulty, it remains unclear whether they can capture item discrimination. In this work, we evaluate 42 proprietary and open-weight LLMs in zero-shot settings using two complementary approaches: direct discrimination prediction, where models explicitly estimate an item's discrimination value from its content, and response-based Classical Test Theory (CTT) calibration, where LLM answers are treated as synthetic student responses to compute discrimination scores. Our results show that direct prediction yields weak alignment with human-calibrated discrimination: the best-performing model reaches only a Spearman correlation of 0.152. Response-based CTT calibration provides a stronger but still limited signal, with the all-persona synthetic respondent pool reaching a Spearman correlation of 0.241. These findings highlight item discrimination as an open challenge for LLM-based psychometric evaluation: current LLMs contain non-random discrimination-relevant signal, but they do not yet reliably capture how assessment items distinguish human students.

URL PDF HTML ☆

赞 0 踩 0

RECOM：开放式 Reddit 问答中自动评估指标的有效性与区分性权衡

Pushwitha Krishnappa, Amit Das, Vinija Jain, Aman Chadha, Tathagata Mukherjee

发表机构 * University of Alabama Huntsville（阿拉巴马大学亨茨维尔分校）； University of North Alabama（北阿拉巴马大学）； Stanford University（斯坦福大学）； Meta AI ； Amazon GenAI（亚马逊GenAI）

AI总结提出 RECOM 数据集，发现自动评估指标在开放式问答中无法同时兼顾有效性和区分性，余弦相似度有效性高但区分性差，BERTScore 区分性受长度影响且有效性弱。

详情

AI中文摘要

自动评估指标是评估 LLM 生成文本的默认方法，但一个指标被默默要求完成两项任务：区分真实内容对齐与表面巧合（有效性），以及区分更好的系统与更差的系统（区分性）。在开放式、观点驱动的问答中，这两者存在矛盾。我们引入了 RECOM（Reddit Evaluation for Correspondence of Models），一个无污染评估数据集，包含 15,000 个 r/AskReddit 问题（2025 年 9 月），每个问题都配有真实的社区回复，这些回复的发布时间晚于所有被评估模型的训练截止日期。通过将五个开源 LLM（7-10B）的每个回复与每个指标配对，并加入随机乱序噪声基线，我们发现没有指标能同时做好这两项工作。余弦相似度能很好地区分真实回答与随机回答（Cohen's $d \approx 2$），但无法对五个模型进行排序（$|d| < 0.1$）；BERTScore 精确度看似能对模型排序（原始 $|d|$ 高达 0.63），但一旦控制回复长度，这一数值骤降至 $|d| = 0.09$，且其有效性较弱（$d \approx 0.8$，而余弦相似度约为 2）。由于每个指标对相同的输出进行评分，这种有效性与区分性的权衡是指标的属性，而非模型的属性，我们认为这源于表示设计。三个独立的 LLM 评判员再现了有效性差距，同样只能微弱地区分五个模型。我们建议在两个轴上报告指标，并明确给出随机基线。RECOM 在此 https URL 公开提供。

英文摘要

Automatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity), and tell a better system from a worse one (discriminative power). On open-ended, opinion-driven question answering, the two are in tension. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a contamination-free evaluation dataset of 15,000 r/AskReddit questions (September 2025), each paired with its authentic community replies, which postdate every evaluated model's training cutoff. Scoring five open-source LLMs (7--10B) against every reply each metric paired with a random-derangement noise floor we find that no metric does both jobs well. Cosine similarity separates real from random answers (Cohen's $d \approx 2$) but cannot rank the five models ($|d| < 0.1$); BERTScore precision appears to rank the models (raw $|d|$ up to 0.63), but once response length is controlled this collapses to $|d| = 0.09$ and its validity is weak ($d \approx 0.8$, versus cosine's $\approx 2$). Because every metric scores the same outputs, this validity--discrimination tradeoff is a property of the metrics, not the models, and we argue it stems from representation design. Three independent LLM judges reproduce the validity gap and likewise separate the five models only weakly. We recommend reporting metrics on both axes, with an explicit random-baseline floor. RECOM is publicly available at https://anonymous.4open.science/r/recom-D4B0

URL PDF HTML ☆

赞 0 踩 0

2606.19334 2026-06-18 cs.CL cs.CY cs.LG 新提交

Freeing the Law with LOCUS: A Local Ordinance Corpus for the United States

用LOCUS解放法律：美国地方条例语料库

Denis Peskoff, Joe Barrow, Christopher Vu, Diag Davenport

发表机构 * UC Berkeley（加州大学伯克利分校）； School of Information（信息学院）； Independent（独立研究者）

AI总结为解决美国地方条例缺乏机器可读语料的问题，构建了包含9239个市县条例的LOCUS语料库，并训练ModernBERT分类器以分析法律透明度等维度。

Comments 14 pages, 6 figures

详情

AI中文摘要

法律人工智能的进展越来越依赖于大规模获取权威法律文本。然而，美国法律中最具影响力的层级之一——地方条例——在很大程度上仍然缺失于现有的机器可读语料库中。地方法规管辖着分区、住房、商业许可、公共卫生、噪音、动物控制以及许多其他日常监管领域，但它们分散在专为人类浏览而非批量研究访问设计的供应商平台上。我们引入了LOCUS——美国地方条例语料库——一个全面的语料库和县级统一访问层，用于美国市和县条例。原始语料库可供研究人员发布，几乎涵盖了所有公开可用的市和县条例。由此产生的原始语料库包含来自9239个城市和县的法规。一个较小的县级统一LOCUS访问层覆盖了美国3144个县中最大的2309个，覆盖了大部分人口。我们使用OCR来处理使法律无法成为公共资源的各种文档格式。我们发布了带有覆盖元数据的语料库，以支持可重复性、下游法律AI研究以及逐步扩展对地方法律的机器可读访问。我们训练了一系列基于ModernBERT的分类器和评分器，以便从多个维度分析美国地方法律，例如不透明性和家长式作风，这些维度以前从未在此规模上研究过。LOCUS-v1及其衍生模型可在以下网址获取：this https URL

英文摘要

Progress in legal AI increasingly depends on access to authoritative legal text at scale. Yet one of the most consequential layers of American law remains largely absent from existing machine-readable corpora: local ordinances. Local codes govern zoning, housing, business licensing, public health, noise, animal control, and many other domains of everyday regulation, but they are fragmented across vendor platforms designed for human browsing rather than bulk research access. We introduce LOCUS - the Local Ordinance Corpus for the United States - a comprehensive corpus and county-harmonized access layer for U.S. municipal and county ordinance codes. The raw corpus, available for release to researchers, represents nearly all publicly available municipal and county ordinance codes. The resulting raw corpus contains codes from 9,239 cities and counties. A smaller county-harmonized LOCUS access layer provides coverage for the largest 2,309 of 3,144 U.S. counties, accounting for a majority of the population. We use OCR to handle the myriad of document formats that have kept the law from being a public resource. We release the corpus with coverage metadata to support reproducibility, downstream legal AI research, and the incremental expansion of machine-readable access to local law. We train a collection of ModernBERT-based classifiers and scorers to facilitate analyzing U.S. local law among several dimensions, such as opacity and paternalism, that have not previously been studied at this scale. LOCUS-v1 and its derivative models are available at: https://huggingface.co/datasets/LocalLaws/LOCUS-v1

URL PDF HTML ☆

赞 0 踩 0

2606.18372 2026-06-18 cs.CL cs.AI 新提交

错误的正确：量化和定位大语言模型中的失调对齐

Naihao Deng, Yiming Feng, Chimaobi Okite, Kaijian Zou, Lu Wang, Rada Mihalcea, Yulong Chen

发表机构 * University of Michigan（密歇根大学）； University of Cambridge（剑桥大学）； University of Aberdeen（阿伯丁大学）

AI总结本文提出VETO基准和失调对齐率（MAR）指标，发现所有LLM在刻板印象相关问题上均存在非平凡的失调对齐，且人类为0%，机制分析表明对齐诱导的线索会放大该现象。

详情

AI中文摘要

警告：本文研究刻板印象和偏见，包含可能令人不适的例子，仅用于说明目的。我们的发现不应被解释为反对对齐的论据。相反，本文强调了需要更先进对齐的原则性方法。对齐旨在确保大语言模型（LLMs）安全可靠地行为，包括避免不安全的推理。然而，我们表明这种安全导向的行为可能误触发：模型可能拒绝有根据的结论，即使上下文明确支持它们。我们将这种失败模式称为失调对齐，其中对齐引起的改变导致LLMs覆盖显式证据。为了量化这一现象，特别是针对刻板印象相关的对齐，我们引入了VETO，一个由2,032个BBQ派生对比对组成的基准，并定义了一个新指标，失调对齐率（MAR），它衡量在0到100的尺度上，模型在刻板印象相关问题上失败但在其对比对应问题上成功的频率。我们在VETO上对25个LLMs进行了基准测试，并表明所有LLMs，包括最新的，都表现出非平凡的（4.7%至18.9%）MAR，而所有人类参与者达到0.0%的MAR。受控启动实验进一步表明，对齐诱导的线索可以显著放大LLMs的MAR，表明这些失败不仅仅是单个例子的伪影，而是可以由安全相关的框架诱导。对开放权重LLMs的机制分析揭示了后期层对证据支持答案的抑制，并且指令模型与基础模型之间的比较表明这种抑制在指令训练后出现。这些发现表明，当前的对齐方法可能过度泛化表面安全线索，以至于覆盖客观证据，这激励了更多关于更好保持上下文基础的对齐目标的工作。

英文摘要

Warning: This paper studies stereotypes and biases, and contains potentially disturbing examples, used for illustration purposes only. Our findings should not be interpreted as an argument against alignment. Instead, this paper highlights the need for principled approaches to more advanced alignment. Alignment aims to ensure that large language models (LLMs) behave safely and reliably, including by avoiding unsafe inferences. However, we show that such safety-oriented behaviors can misfire: models may reject warranted conclusions even when they are explicitly supported by context. We call this failure mode misfired alignment, where alignment-induced changes cause LLMs to override explicit evidence. To quantify this phenomenon, specifically on stereotype-related alignment, we introduce VETO, a benchmark consisting of 2,032 BBQ-derived contrastive pairs, and define a new metric, Misfired Alignment Rate (MAR), which measures on a 0 to 100 scale how often a model fails on a stereotype-related question but succeeds on its contrastive counterpart. We benchmark 25 LLMs on VETO, and show that all LLMs, including the most recent ones, exhibit non-trivial (4.7 to 18.9%) MARs while all human participants achieve 0.0% MAR. Controlled priming experiments further show that alignment-induced cues can substantially amplify MAR across LLMs, indicating that these failures are not merely artifacts of individual examples but can be induced by safety-related framing. Mechanistic analyses on open-weight LLMs reveal late-layer suppression of evidence-supported answers, and comparisons between instruct and base LLMs suggest that this suppression emerges after instruction training. These findings show that current alignment methods can overgeneralize surface-level safety cues, to the point of overriding objective evidence, motivating more work on alignment objectives that better preserve contextual grounding.

URL PDF HTML ☆

赞 0 踩 0

2606.18767 2026-06-18 cs.CL 新提交

Output Vector Editing for Memorization Mitigation in Large Language Models

输出向量编辑：缓解大型语言模型中的记忆化问题

Ahmad Dawar Hakimi, Kaiwei Lei, Isabelle Augenstein, Hinrich Schütze

发表机构 * Center for Information and Language Processing, LMU Munich（慕尼黑大学语言与信息处理中心）； Department of Computer Science, University of Copenhagen（哥本哈根大学计算机科学系）； Munich Center for Machine Learning（慕尼黑机器学习中心）； Pioneer Centre for AI（人工智能先锋中心）

AI总结提出输出向量编辑方法，通过约束优化修改MLP神经元输出向量引入干扰项，在不改变激活值的情况下抑制记忆化序列，在OLMo-7B上实现87.9%抑制率，并揭示MLP编辑的机制边界。

详情

AI中文摘要

大型语言模型会记忆并复现训练数据中的序列，从而带来隐私、版权和安全风险。现有的神经元级缓解方法将编辑等同于将神经元激活归零，但激活仅控制神经元是否参与；输出向量才是写入残差流的内容，并通过叠加编码多个特征。我们提出输出向量编辑，这是一种约束优化的权重编辑方法，定位负责记忆化延续的一小组MLP神经元，并最小程度地修改其输出向量，以在词汇空间中引入干扰项，从而重定向它们在残差流中的贡献，同时保持激活不变。在四个模型（SmolLM-360M、OLMo-1B、OLMo-7B、Llama2-7B，参数规模从360M到7B）上进行评估，我们重点研究OLMo-7B（其开放权重和预训练语料库支持系统化挖掘），挖掘了6831个记忆化序列，实现了高达87.9%的抑制率。在相同定位的神经元上，与零消融相比，2.7倍的差距表明抑制来自输出向量编辑，而非仅定位。四种编辑模式涵盖了从激进抑制到最小重定向的谱系；集成使用时覆盖了96.5%的记忆化序列，而我们推荐的单一模式配置达到了81.5%，且没有灾难性的局部性失败。我们进一步识别了一个机制边界：约14%的序列无法通过仅MLP编辑达到；虽然这些失败总体上并非由注意力驱动，但消融贡献最大的注意力头可恢复其中60-64%，对于从前缀复制token的延续，恢复更强，这表明注意力是互补的后备机制而非主要机制。编辑模式排序和成功-局部性权衡在所有四个模型上迁移，成功率随模型规模而非家族增长。

英文摘要

Large language models memorize and reproduce sequences from their training data, creating privacy, copyright, and security risks. Existing neuron-level mitigation methods equate editing with zeroing out neuron activations, but the activation only controls whether a neuron engages; the output vector is what writes to the residual stream and, through superposition, encodes multiple features. We propose output vector editing, a constrained-optimization weight edit that locates a small set of MLP neurons responsible for a memorized continuation and minimally modifies their output vectors to introduce a distractor in vocabulary space, redirecting their residual-stream contributions while leaving activations unchanged. Evaluating on four models from 360M to 7B parameters (SmolLM-360M, OLMo-1B, OLMo-7B, Llama2-7B), we center on OLMo-7B (whose open weights and pretraining corpus enable systematic mining) and mine 6831 memorized sequences, achieving up to 87.9% suppression. The 2.7$\times$ gap over zero ablation on the same located neurons shows the suppression comes from the output-vector edit, not localization alone. Four edit modes span a spectrum from aggressive suppression to minimal redirection; in ensemble they cover 96.5% of memorized sequences, while our recommended single-mode configuration reaches 81.5% with no catastrophic locality failures. We further identify a mechanistic boundary at ${\sim}14%$ of sequences unreachable by MLP-only editing; while these failures are not attention-driven overall, ablating the top contributing attention heads recovers 60--64% of them, with stronger recovery on continuations that copy tokens from the prefix, positioning attention as a complementary fallback rather than a primary mechanism. Edit mode ordering and the success-locality trade-off transfer across all four models, with success rates scaling with model size rather than family.

URL PDF HTML ☆

赞 0 踩 0

2606.18852 2026-06-18 cs.CL cs.AI 新提交

通过合成数据蒸馏实现高效金融语言理解

Wen-Fong, Huang, Edwin Simpson

发表机构 * School of Engineering Mathematics and Technology（工程数学与技术学院）； University of Bristol（布里斯托大学）

AI总结提出一种在低资源条件下通过合成数据蒸馏进行金融情感分析的框架，利用聚类种子选择生成代表性合成数据，使紧凑模型在少量标注下达到强性能，甚至在某些任务上超越教师模型。

详情

DOI: 10.63317/3b3zxy5qrw8s
Journal ref: Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), European Language Resources Association (ELRA), 2026, pp. 10242-10254

AI中文摘要

大型指令跟随模型功能强大但部署成本高昂，尤其在金融领域，标注数据因保密性和专家标注成本而受限。我们提出一种通过合成数据蒸馏进行金融情感分析的高效框架，将知识从大型指令调优教师模型迁移到紧凑的学生模型。该框架专为低资源条件设计，其中收集并手工标注少量真实样本。框架随后对样本进行聚类，并利用聚类结果选择种子，通过结构化少样本提示生成合成样本。实验表明，基于聚类的种子选择比随机采样能生成更具代表性的合成数据，使紧凑模型在极少量监督下实现强性能。值得注意的是，在更复杂且噪声更多的文本领域，基于完整合成种子语料库训练的紧凑模型甚至优于教师模型，同时在正式文本上保持竞争力。该框架为金融NLP中资源高效的领域自适应提供了一条实用途径，且只需最少的人工标注工作。

英文摘要

Large instruction-following models are powerful but costly to deploy, particularly in finance, where labelled data are limited by confidentiality and expert annotation cost. We present an efficient framework for financial sentiment analysis through distillation with synthetic data, transferring knowledge from a large instruction-tuned teacher to compact student models. The framework is designed for low-resource conditions, where a small set of real examples are collected and labelled by hand. The framework then clusters the examples and uses the clusters to select seeds for generating synthetic examples via structured few-shot prompting. Experiments show that clustering-based seed selection yields more representative synthetic data than random sampling, enabling compact models to achieve strong performance with minimal supervision. Notably, on a more complex and noisy text domain, the compact model trained on the complete synthetic-seed corpus even outperforms the teacher model, while remaining competitive on formal text. The framework provides a practical route toward resource-efficient domain adaptation in financial NLP with minimal human labelling effort.

URL PDF HTML ☆

赞 0 踩 0

2606.19266 2026-06-18 cs.CL cs.AI 新提交

Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

医学LLM适应中的权衡：法语问答的实证研究

Ikram Belmadani, Oumaima El Khettari, Carlos Ramisch, Frederic Bechet, Richard Dufour, Benoit Favre

发表机构 * Aix-Marseille Univ., CNRS, LIS UMR 7020（艾克斯-马赛大学，法国国家科学研究中心，计算机与系统实验室）； Nantes Univ., École Centrale Nantes, CNRS, LS2N UMR 6004（南特大学，南特中央理工学院，法国国家科学研究中心，数字科学实验室）； Grenoble Alpes Univ., CNRS, INRIA, Grenoble INP, LIG UMR 5217（格勒诺布尔-阿尔卑斯大学，法国国家科学研究中心，法国国家信息与自动化研究所，格勒诺布尔理工学院，信息学实验室）

AI总结通过法语医学问答任务，实证比较持续预训练（CPT）和监督微调（SFT）在多个模型家族和规模下的效果，发现CPT+SFT在多项选择问答上最优但增益小，SFT是强且经济的默认选择，而CPT在开放式问答中提升重叠指标。

详情

AI中文摘要

大型语言模型（LLMs）的发展导致了对它们适应专业领域和语言的关注增加，但领域适应策略的有效性仍不明确。我们以法语医学问答（QA）为案例，进行了医学领域适应的研究。我们比较了持续预训练（CPT）、监督微调（SFT）及其组合，跨越三个模型家族、多个规模和三种初始化类型，明确区分了适应效果与基础模型选择。我们在贪婪和约束解码下，使用自动指标和LLM-as-a-Judge评估，评估了多项选择问答（MCQA）和开放式问答（OEQA）。对于MCQA，CPT+SFT通常取得最佳分数，但相比SFT的增益很小且通常不显著，使得SFT成为强大且成本效益高的默认选择。对于OEQA，CPT持续改善基于重叠的指标，而SFT常降低生成质量；指令调优和CPT+SFT在基于LLM的评估中更受青睐。跨语言实验进一步显示，法语适应能有效迁移到英语基准。总体而言，我们为在计算约束下选择适应策略提供了实用指南。

英文摘要

The development of large language models (LLMs) has led to an increased focus on their adaptation to specialized domains and languages, yet the effectiveness of domain adaptation strategies remains unclear. We present a study of medical domain adaptation using French medical question-answering (QA) as a case study. We compare continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across three model families, multiple sizes, and three initialization types, explicitly disentangling adaptation effects from base model choice. We evaluate both multiple-choice (MCQA) and open-ended QA (OEQA) under greedy and constrained decoding using automatic metrics and LLM-as-a-Judge evaluation. For MCQA, CPT+SFT most often achieves the best scores, but gains over SFT are small and frequently not statistically significant, making SFT a strong and cost-effective default. For OEQA, CPT consistently improves overlap-based metrics, while SFT often degrades generation quality; instruction tuning and CPT+SFT are preferred by LLM-based evaluation. Cross-lingual experiments further show effective transfer from French adaptation to English benchmarks. Overall, we provide practical guidelines for selecting adaptation strategies under computational constraints.

URL PDF HTML ☆

赞 0 踩 0

1. 大语言模型与基础模型 12 篇

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

LLM Parameters for Math Across Languages: Shared or Separate?

Towards Scalable Customization and Deployment of Multi-Agent Systems for Enterprise Applications

Dual Dimensionality for Local and Global Attention

RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

GraphPO: Graph-based Policy Optimization for Reasoning Models

Enhancing Multilingual Reasoning via Steerable Model Merging

Sumi: Open Uniform Diffusion Language Model from Scratch

Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition

DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models

2. 信息抽取、检索与问答 7 篇

SproutRAG: Attention-Guided Tree Search with Progressive Embeddings for Long-Document RAG

MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval

BCL: Bayesian In-Context Learning Framework for Information Extraction

Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation

Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction

Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

Language Models as Interfaces, Not Oracles: A Hybrid LLM-ML System for Pediatric Appendicitis

3. 对话系统与智能体 6 篇

CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

VISUALSKILL: Multimodal Skills for Computer-Use Agents

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

Leadership as Coordination Control: Behavioral Signatures and the Recovery-Advantage Boundary in Multi-Agent LLM Teams

Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play

Learning User Simulators with Turing Rewards

4. 文本生成、摘要与编辑 2 篇

ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement

Improving Medical Communication using Rubric-Guided Counterfactual Recommendations

5. 语义、语法与语言学分析 4 篇

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

Approximate Structured Diffusion for Sequence Labelling

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

6. 多模态语言处理 1 篇

Continuous Audio Thinking for Large Audio Language Models

7. 语音语言联合与音频文本 2 篇

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

Speech-Driven End-to-End Language Discrimination towards Chinese Dialects

8. 评测、数据集与基准 12 篇

Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

PEC-Home: Interpretation of Progressively Elliptical Commands in Smart Homes

TW-LegalBench: Measuring Taiwanese Legal Understanding

LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

RedactionBench

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents

G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

Which Sections of a Research Paper Best Reveal Its Research Methods? Evidence from Library and Information Science

RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

Freeing the Law with LOCUS: A Local Ordinance Corpus for the United States

9. 安全、隐私、公平与可解释NLP 6 篇

Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

Steerable Cultural Preference Optimization of Reward Models

The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

Output Vector Editing for Memorization Mitigation in Large Language Models

Aligning Implied Statements for Implicit Hate Speech Generalizability with Context-Bounded Semi-hard Negative Mining

10. 低资源、领域适配与高效训练 4 篇

Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

Low-resource Language Discrimination Towards Chinese Dialects with Transfer learning and Data Augmentation

Efficient Financial Language Understanding via Distillation with Synthetic Data

Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA