arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 大语言模型与基础模型 12 篇

2606.18394 2026-06-18 cs.CL 新提交

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

JetFlow: 通过并行树草稿突破推测解码的缩放上限

Lanxiang Hu, Zhaoxiang Feng, Yulun Wu, Haoran Yuan, Yujie Zhao, Yu-Yang Qian, Bojun Wang, Daxin Jiang, Yibo Zhu, Tajana Rosing, Hao Zhang

发表机构 * UC San Diego(加州大学圣地亚哥分校) Zhejiang University(浙江大学) UIUC(伊利诺伊大学厄巴纳-香槟分校) Nanjing University(南京大学) StepFun(阶跃星辰)

AI总结 提出JetFlow框架,通过因果并行草稿头结合树推测解码,将更大草稿预算转化为更长接受前缀和更高端到端加速,在Qwen3模型上实现最高9.64倍加速。

详情
AI中文摘要

推测解码(SD)通过草拟多个令牌并并行验证来加速自回归大语言模型(LLM),但面临缩放限制:仅当接受率保持较高且草拟开销较低时,增加草稿预算才能提高速度。这一上限难以突破,因为先前基于头的SD方法面临因果-效率困境。自回归草稿器生成路径条件候选,适用于树推测解码且接受长度更高,但其草拟成本随树深度增长。双向块扩散草稿器一次性生成所有位置,但其分支无关的边缘分布可能形成个体合理但相互不一致的树,浪费预算并降低接受率。我们提出JetFlow,一种基于头的SD框架,结合单次前向草拟效率与分支级因果条件。JetFlow在冻结目标模型的融合隐藏状态上训练因果并行草稿头,生成与目标模型自回归分解对齐的候选树。这使得JetFlow能够将更大的草稿预算转换为更长的接受前缀和更高的端到端加速。在密集和MoE Qwen3模型上的数学、编码和聊天基准测试中,JetFlow始终优于双向头和基于树的SD基线。在H100 GPU上,JetFlow在MATH-500上实现高达9.64倍加速,在开放式对话工作负载上实现4.58倍加速,并通过vLLM集成在实际服务负载下进一步降低延迟。我们的代码和模型可在该https URL获取。

英文摘要

Speculative decoding (SD) accelerates autoregressive Large Language Models (LLMs) by drafting multiple tokens and verifying them in parallel, but it faces a scaling limitation: increasing the draft budget improves speed only when acceptance remains high and drafting overhead stays low. This ceiling has been difficult to break because prior head-based SD methods face a causality-efficiency dilemma. Autoregressive drafters produce path-conditioned candidates that are effective for tree speculative decoding with higher acceptance length, but their drafting cost grows with tree depth. Bidirectional block-diffusion drafters generate all positions in one pass, but their branch-agnostic marginals can form individually plausible yet mutually inconsistent trees, wasting budget and reducing acceptance. We propose JetFlow, a head-based SD framework that combines one-forward drafting efficiency with branch-wise causal conditioning. JetFlow trains a causal parallel draft head over fused hidden states from the frozen target model, producing candidate trees whose scores align with the target model's autoregressive factorization. This enables JetFlow to convert larger draft budgets into longer accepted prefixes and higher end-to-end speedup. Across math, coding, and chat benchmarks on dense and MoE Qwen3 models, JetFlow consistently outperforms bidirectional-head and tree-based SD baselines. On H100 GPUs, JetFlow achieves up to 9.64x speedup on MATH-500 and 4.58x on open-ended conversational workloads, with further latency gains demonstrated through vLLM integration under realistic serving loads. Our code and models are available at https://github.com/hao-ai-lab/JetFlow.

2606.18453 2026-06-18 cs.CL 新提交

LLM Parameters for Math Across Languages: Shared or Separate?

跨语言数学问题的LLM参数:共享还是分离?

Behzad Shomali, Luisa Victor, Tim Selbach, Ali Hamza Bashir, David Berghaus, Joachim Koehler, Mehdi Ali, Markus Frey

发表机构 * Lamarr Institute(Lamarr研究所) University of Bonn(波恩大学) Fraunhofer IAIS(弗劳恩霍夫智能分析和信息系统研究所)

AI总结 通过跨语言机制分析,发现多语言LLM中数学相关参数存在部分跨语言重叠,且主要集中在中间层,英语参数集最大,低资源语言参数集较小。

Comments 5 pages. Accepted at ACL Student Research Workshop (SRW) 2026. Code: https://github.com/luisavictor/math-across-languages Translated Datasets: https://huggingface.co/math-across-languages Webpage: https://math-across-languages.github.io

详情
AI中文摘要

大型语言模型(LLM)在数学推理性能上表现出显著的跨语言差异,但目前尚不清楚这些差异是反映语言特定参数,还是反映一种因语言不同而表现不同的共享机制。我们提出了一种跨语言的LLM数学推理机制分析,使我们能够定位和比较支持跨语言数学推理的模型参数。我们发现,提取的数学相关参数表现出部分跨语言重叠,最强的重叠集中在中间模型层。我们进一步观察到,英语始终产生最大的数学相关参数集,而低资源语言则显示出较小的相关参数集。这些结果表明,多语言LLM中与数学相关的行为既不是完全语言不变的,也不是完全语言特定的,而是表现出部分跨语言参数重叠,并伴有系统性的语言依赖差异。

英文摘要

Large language models (LLMs) exhibit substantial cross-lingual variation in mathematical reasoning performance, but it remains unclear whether these differences reflect language-specific parameters or a shared mechanism that manifests differently by language. We present a cross-lingual mechanistic analysis of mathematical reasoning in LLMs, enabling us to localize and compare model parameters that support mathematical reasoning across languages. We find that the extracted math-associated parameters exhibit partial cross-lingual overlap, with the strongest overlap concentrated in intermediate model layers. We further observe that English consistently produces the largest set of math-relevant parameters, whereas lower-resource languages reveal smaller sets of relevant parameters. These results suggest that math-related behavior in multilingual LLMs is neither fully language-invariant nor fully language-specific, but instead exhibits partial cross-lingual parameter overlap with systematic language-dependent differences.

2606.18502 2026-06-18 cs.CL 新提交

Towards Scalable Customization and Deployment of Multi-Agent Systems for Enterprise Applications

面向企业应用的多智能体系统可扩展定制与部署

Paresh Dashore, Shreyas Kulkarni, Uttam Gurram, Nadia Bathaee, Kartik Balasubramaniam, Genta Indra Winata, Sambit Sahu, Shi-Xiong Zhang

发表机构 * Capital One(第一资本)

AI总结 提出统一框架,通过智能体模型定制(持续预训练、微调、偏好优化)和推理优化(推测解码、FP8量化),实现领域自适应和4.48倍吞吐加速,保持性能并提升长尾场景鲁棒性。

Comments Preprint

详情
AI中文摘要

基于大语言模型的多智能体系统在复杂推理和任务执行上表现出色,支持广泛的企业应用。然而,由于领域特定的定制需求以及智能体工作流中的高延迟和推理成本,生产部署仍然具有挑战性。我们提出了一个统一框架,用于在实际环境中定制和高效部署多智能体系统。第一阶段,智能体模型定制,结合持续预训练、监督微调和偏好优化,将紧凑模型适应到专业领域,同时保留强大的智能体能力。第二阶段,推理优化,集成推测解码和FP8量化与目标校准,以最小质量损失实现成本高效的推理服务。在企业工作负载上,我们的框架实现了快速领域自适应,吞吐量提升4.48倍,同时保持性能并提高长尾场景的鲁棒性。

英文摘要

Large language model (LLM)-based multi-agent systems demonstrate strong performance on complex reasoning and task execution, enabling broad enterprise applications. However, production deployment remains challenging due to domain-specific customization requirements and high latency and inference costs in agentic workflows. We propose a unified framework for customization and efficient deployment of multi-agent systems in real-world settings. The first stage, Agentic Model Customization, combines continual pretraining, supervised fine-tuning, and preference optimization to adapt a compact model to specialized domains while retaining strong agentic capabilities. The second stage, Inference Optimization, integrates speculative decoding and FP8 quantization with targeted calibration to enable cost-efficient serving with minimal quality loss. Across enterprise workloads, our framework enables rapid domain adaptation and achieves a 4.48x speedup in throughput while maintaining performance and improving robustness on long-tail scenarios.

2606.18587 2026-06-18 cs.CL cs.AI 新提交

Dual Dimensionality for Local and Global Attention

局部与全局注意力的双重维度

Zhiyuan Wang, Xuan Luo, Sirui Zeng, Xifeng Yan

发表机构 * UC Santa Barbara(加州大学圣塔芭芭拉分校)

AI总结 提出距离自适应表示(DAR),对局部上下文保留全维度表示,对远距离token使用低维表示,在保持性能的同时减少KV缓存。

详情
AI中文摘要

解码器仅Transformer计算前面token的KV缓存上的注意力。键(和值)通常以相同的维度表示,无论其与预测目标的距离如何。然而,在自然语言中,下一个词受紧邻的前一个词影响最大。我们假设局部和远距离token对表示能力有不对称需求:局部token对预测即时输出更关键,因此需要更丰富的表示,而远距离token主要作为长期记忆,低维表示可能就足够了。我们将这一思想形式化为距离自适应表示(DAR),在受控设置中实现,该设置在局部上下文窗口内保留全维度表示,同时为超出该窗口的token分配降维表示(例如原始维度的1/4)。在多个预训练规模(70M到410M参数)以及1B规模模型上的持续监督微调中,该方法与全维度基线的性能紧密匹配。相比之下,在所有token位置上均匀降低维度会导致性能下降。这些结果挑战了键和值维度应在所有token位置上均匀的常见假设。我们的发现为设计注意力架构提供了新方向,该架构可自适应地跨序列分配表示能力,从而在推理期间进一步减少KV缓存。

英文摘要

Decoder-only Transformers compute attention over the KV cache of preceding tokens. Keys (and Values) are typically represented with the same dimensionality, regardless of its distance from the prediction target. In natural language, however, the next word is most strongly influenced by the immediately preceding tokens. We hypothesize that local and distant tokens impose asymmetric demands on representational capacity: local tokens are more critical for predicting immediate outputs and thus require richer representations, whereas distant tokens primarily serve as long-range memory, for which lower-dimensional representations may suffice. We formalize this idea as Distance-Adaptive Representation (DAR), implemented in a controlled setting that preserves full-dimensional representations within a local context window while assigning reduced-dimensional representations (e.g. 1/4 of the original dimensionality) to tokens beyond that window. Across multiple pretraining scales (70M to 410M parameters), as well as continued supervised fine-tuning on a 1B-scale model, this approach closely matches the performance of full-dimensional baselines. In contrast, uniformly reducing dimensionality across all token positions leads to worse performance. These results challenge the common assumption that key and value dimensionality should be uniform across token positions. Our findings suggest a new direction for designing attention architectures that adaptively allocate representational capacity across sequences, enabling further reductions in KV cache during inference.

2606.18663 2026-06-18 cs.CL 新提交

RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

RegMix-D: 通过代理训练轨迹实现动态数据混合

Kaiyan Zhao, Zhongtao Miao, Akiko Aizawa, Yoshimasa Tsuruoka

发表机构 * The University of Tokyo(东京大学) National Institute of Informatics(国立信息学研究所)

AI总结 提出RegMix-D,通过代理训练轨迹预测多阶段最优混合比例,实现动态数据混合,在13个下游任务上优于RegMix和DoReMi,且代理计算预算仅为RegMix的25%。

Comments Work in progress

详情
AI中文摘要

数据混合选择对于大型语言模型预训练至关重要。现有方法如RegMix通过在小规模代理运行上拟合回归模型来选择单个静态混合。我们提出RegMix-D,这是RegMix的一个简单扩展,用于动态混合。我们的关键观察是,代理运行不仅产生端点损失,还产生完整的损失轨迹,这些轨迹可用于进一步改进数据混合。通过在这些轨迹上训练回归模型,我们可以预测多个训练阶段的最优混合。RegMix-D支持两种部署模式:一种离线变体,在目标训练之前生成完整的混合计划;另一种在线变体,在训练期间使用观察到的损失自适应调整混合。在Pile数据集的250亿token上使用1B参数目标模型的实验表明,RegMix-D在13个下游任务上一致优于RegMix和DoReMi,同时保持代理高效:即使仅使用128个代理模型(RegMix代理计算预算的25%),它也超越了RegMix。

英文摘要

Data mixture selection is critical for Large Language Model pretraining. Existing methods such as RegMix select a single static mixture by fitting a regression model on small-scale proxy runs. We propose RegMix-D, a simple extension of RegMix to dynamic mixing. Our key observation is that proxy runs produce not only endpoint losses, but also full loss trajectories, which can be used to further improve data mixture. By training regression model on these trajectories, we can predict optimal mixtures at multiple training stages. RegMix-D supports two deployment modes: an offline variant that generates a complete mixture schedule before target training, and an online variant that adapts the mixture during training using observed loss. Experiments on 25B tokens of the Pile dataset with a 1B parameter target model show that RegMix-D consistently improves over RegMix and DoReMi across 13 downstream tasks while remaining proxy-efficient: it surpasses RegMix even with only 128 proxy models (25% of RegMix's proxy compute budget).

2606.18831 2026-06-18 cs.CL cs.AI 新提交

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

超越奖励工程:长上下文强化学习的数据配方

Xiaoyue Xu, Sikui Zhang, Xiaorong Wang, Xu Han, Chaojun Xiao

发表机构 * OpenBMB Tsinghua University(清华大学)

AI总结 提出一种简单有效的数据配方,结合最小化基于结果的GRPO设置,显著提升大语言模型的长上下文推理能力,在多个基准和智能体任务上取得平均+3.2至+7.2点的提升。

Comments 15 pages, 6 figures, 12 tables

详情
AI中文摘要

长上下文推理是大语言模型的一项关键能力,特别是当它们作为必须推理长轨迹的自主智能体部署时。强化学习最近成为提升这一能力的主要范式,然而现有工作主要关注奖励工程,而多样化的训练数据仍然稀缺。我们从数据为中心的角度重新审视这个问题,并表明仅凭一种简单有效的数据配方,结合最小化基于结果的GRPO设置,就足以显著提升长上下文推理。我们的配方针对三个互补的任务族——检索、多证据合成和推理——我们构建并整理了八个数据集,总计约1.4万个示例。在三个模型(Qwen3-4B/8B/30B-A3B)上的实验在七个长上下文基准上取得了平均+7.2/+3.2/+6.4分的提升,超过了之前的强化学习训练集。我们进一步证明这些增益可以迁移到智能体任务中,在基于智能体调整的模型上继续使用我们的数据配方进行强化学习训练,GAIA提升+4.8分,BrowseComp提升+7.0分。我们将发布我们的数据集以促进未来研究。

英文摘要

Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a dominant paradigm for improving this ability, yet existing work largely focuses on reward engineering while diverse training data remains scarce. We revisit this problem from a data-centric perspective and show that a simple yet effective data recipe alone, paired with a minimal outcome-based GRPO setup, suffices to substantially improve long-context reasoning. Our recipe targets three complementary task families -- retrieval, multi-evidence synthesis, and reasoning -- for which we construct and curate eight datasets totaling ~14K examples. Experiments on three models (Qwen3-4B/8B/30B-A3B) yield average gains of +7.2/+3.2/+6.4 points across seven long-context benchmarks, surpassing prior RL training sets. We further demonstrate that these gains transfer to agentic tasks, where continuing RL training on an agent-tuned model with our data recipe improves GAIA by +4.8 and BrowseComp by +7.0 points. We will release our datasets to facilitate future research.

2606.18902 2026-06-18 cs.CL 新提交

SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

SAGE: 基于智能体引导探索的随机提示优化

Ziyi Zhu, Luka Smyth, Saki Shinoda, Jinghong Chen

发表机构 * Slingshot AI Department of Engineering, University of Cambridge(剑桥大学工程系)

AI总结 提出随机提示优化框架SPO,其中SAGE方法通过多智能体诊断代码执行实现黑盒搜索,在多个基准测试中表现依赖于错误类型,并在心理健康聊天机器人中通过连续优化显著提升次日留存率。

详情
AI中文摘要

上下文工程已成为无需参数更新即可改进AI系统的主要手段。最近研究表明文本梯度并非真实梯度,这促使我们将自动提示优化(APO)视为黑盒搜索。我们引入了SPO(随机提示优化),一个在提示空间上进行随机搜索的框架,并比较了三种复杂度递增的策略:基于错误信息的随机搜索、带有进化算子的遗传算法以及SAGE(基于智能体引导探索的SPO),后者是一个具有诊断代码执行的多智能体流水线。在三个基准测试中,没有单一策略占主导地位;有效性取决于景观结构与错误类型的相互作用。我们进一步在连续优化范式下将SAGE部署到一个心理健康聊天机器人上,它将八个个体噪声A/B测试周期累积为次日留存率的统计显著提升。我们认为,将定性诊断与定量验证相结合是使智能体优化对开放式任务导向对话有效的关键。

英文摘要

Context engineering has emerged as a primary lever for improving AI systems without parameter updates. Recent work showing that textual gradients do not function as real gradients motivates treating automatic prompt optimization (APO) as black-box search. We introduce SPO (Stochastic Prompt Optimization), a framework for stochastic search over prompt space, and compare three strategies of increasing sophistication: error-informed random search, a genetic algorithm with evolutionary operators, and SAGE (SPO via Agent-Guided Exploration), a multi-agent pipeline with diagnostic code execution. Across three benchmarks, no single strategy dominates; effectiveness depends on the interaction of landscape structure with error type. We further deploy SAGE on a mental-health chatbot under a continuous optimization paradigm, where it compounds eight cycles of individually-noisy A/B tests into a statistically robust gain in next-day retention. We argue that coupling qualitative diagnosis with quantitative validation is what makes agentic optimization effective for open-ended task-oriented dialogue.

2606.18954 2026-06-18 cs.CL 新提交

GraphPO: Graph-based Policy Optimization for Reasoning Models

GraphPO:基于图的推理模型策略优化

Yuliang Zhan, Xinyu Tang, Jian Li, Dandan Zheng, Weilong Chai, Jingdong Chen, Jun Zhou, Ge Wu, Wenyue Tang, Hao Sun

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学北京校区人工智能学院) Ant Group(蚂蚁集团)

AI总结 提出GraphPO框架,将推理轨迹建模为有向无环图,通过合并语义等价路径减少冗余探索,并利用边级优势函数提高推理效率,在多个基准上优于链式和树式方法。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为增强大型推理模型能力的标准范式。RLVR通常独立采样响应并根据最终答案优化策略。该范式有两个局限性:首先,独立响应常包含相似的中间推理步骤,导致冗余探索和计算浪费;其次,稀疏的最终答案奖励难以识别有用步骤。基于树的方法通过共享前缀并比较同一前缀下的分支来提供细粒度信号,部分解决了这一问题。然而,树分支仍然是独立扩展的。当不同分支达到相似的推理状态时,它们无法共享信息并重复类似的探索。此外,基于树的方法忽略了这种分散性,仅在不同分支内进行局部比较,这可能导致优势估计的方差更高。为了解决这一挑战,我们提出了GraphPO(基于图的策略优化),一种新颖的RL框架,将轨迹表示为有向无环图,其中推理步骤作为边,从推理路径中总结的语义状态作为节点。GraphPO将语义等价的推理路径合并为等价类,允许它们共享后缀,并将预算从冗余扩展重新分配到多样化探索。此外,我们为入边分配效率优势,为出边分配正确性优势,从而在从结果中推导过程监督的同时提高推理效率。理论表明,GraphPO降低了优势估计方差并提高了推理效率。在三个LLM上的推理和智能体搜索基准实验表明,在相同的token预算或响应预算下,GraphPO始终优于基于链和基于树的基线方法。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for enhancing the capability of large reasoning models. RLVR typically samples responses independently and optimizes the policy using from final answers. This paradigm has two limitations. First, independently responses often contain similar intermediate reasoning steps, causing redundant exploration and wasted computation. Second, sparse final-answer rewards make it hard to identify useful steps. Tree-based methods partly address this problem by sharing prefixes and comparing branches from the same prefix to provide fine-grained signals. However, tree branches are still expanded independently. When different branches reach similar reasoning states, they cannot share information and repeat similar exploration. Moreover, tree-based methods ignore such dispersion and only perform local comparisons within separate branches, which can lead to higher variance in advantage estimation. To address this challenge, we propose GraphPO (Graph-based Policy Optimization), a novel RL framework that represents rollouts as a directed acyclic graph, with reasoning steps as edges and semantic states summarized from the reasoning paths as nodes. GraphPO merges semantically equivalent reasoning paths into equivalence classes, allowing them to share suffixes and reallocating budget away from redundant expansions to diverse exploration. Furthermore, we assign efficiency advantages to incoming edges and correctness advantages to outgoing edges, thereby improving inference efficiency while deriving process supervision from outcome. Theory shows that GraphPO reduces advantage-estimation variance and enhances reasoning efficiency. Experiments on three LLMs across reasoning and agentic search benchmarks show that GraphPO consistently outperforms chain- and tree-based baselines with the same token budgets or response budgets.

2606.19002 2026-06-18 cs.CL 新提交

Enhancing Multilingual Reasoning via Steerable Model Merging

通过可引导的模型合并增强多语言推理

Zhuoran Li, Rui Xu, Jian Yang, Junnan Liu, Zhijun Chen, Qianren Mao, Hongcheng Guo, Jiaheng Liu, Likang Xiao, Ming Li, Xiaojie Wang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Fudan University(复旦大学) Beihang University(北京航空航天大学) Monash University(墨尔本大学) Zhongguancun Laboratory(中关村实验室) Nanjing University(南京大学) Tsinghua University(清华大学)

AI总结 提出可引导模型合并(ST-Merge)框架,通过门控交叉注意力机制自适应调节源模型贡献,在多语言推理任务中优于强基线。

Comments 12 pages, 7 figures, 8 tables. Accepted by ACL2026 Findings

详情
AI中文摘要

模型合并是组合多语言模型和推理模型能力的有效技术。通过对齐不同模型的特征空间,它在多语言推理任务中取得了有希望的泛化效果。然而,合并后的单一模型往往无法解决源模型之间的冲突,导致性能次优。换句话说,一刀切的合并策略可能无法适应不同输入的特性,这些输入可能要求优先考虑某些模型。为此,我们提出了一个可引导模型合并(ST-Merge)框架来调节每个源模型的贡献。为了实现这一想法,我们引入了一种门控交叉注意力机制,以自适应方式加权或过滤两个关注的源模型。大量实验表明,ST-Merge在涵盖21种不同语言的四个多语言推理基准上持续优于多个强基线。

英文摘要

Model merging is an effective technique for composing the capabilities of a multilingual model and a reasoning model. It has achieved promising generalization in multilingual reasoning tasks by aligning feature spaces of different models. However, the merged single model often fails to address the conflicts between source models, leading to suboptimal performance. In other words, the one-size-fits-all merging strategy may not align with the characteristics of different inputs which may require prioritizing certain models over others. To this end, we propose a Steerable Model Merging (ST-Merge) framework to modulate the contribution of each source model. To realize this idea, we introduce a gated cross-attention mechanism to weight or filter the two attended source models in an adaptive manner. Extensive experiments demonstrate that ST-Merge consistently outperforms multiple strong baselines on four multilingual reasoning benchmarks across 21 different languages.

2606.19005 2026-06-18 cs.CL cs.LG 新提交

Sumi: Open Uniform Diffusion Language Model from Scratch

Sumi: 从头训练的开放均匀扩散语言模型

Mengyu Ye, Keito Kudo, Wataru Ikeda, Ryosuke Matsuda, Keisuke Sakaguchi, Jun Suzuki

发表机构 * Tohoku University(东北大学)

AI总结 本文提出Sumi,一个从零开始预训练的70亿参数均匀扩散语言模型,在1.5T tokens上训练,性能与同规模自回归模型相当,并开源所有资源。

详情
AI中文摘要

扩散模型已成为自回归模型的有前途的替代方案。其中,均匀扩散语言模型(UDLM)允许在任何步骤更新任何token,原则上能够实现更灵活的生成。然而,目前还没有从零开始预训练的大参数规模和大token预算的UDLM。自回归建模和掩码扩散建模已经拥有大规模的可供社区研究和构建的模型;而均匀扩散模型则没有。大规模从头预训练的UDLM将为研究缩放行为、生成动态、可控性以及与现有自回归和掩码扩散模型的权衡提供一个干净的参考点。为此,我们引入了Sumi(日语中“墨水”的意思),一个完全开放的70亿参数均匀扩散语言模型,从零开始在1.5T tokens上预训练。Sumi在知识、推理和编码基准测试中与在可比token预算下训练的自回归模型表现相当,但在常识基准测试中表现较差,其中我们以教育为主的数据混合可能是原因之一。我们发布了模型权重、检查点和完整的训练方案,包括在公开可用的语料库上的数据混合的完整规范。我们希望这次发布能使社区研究大规模原生均匀扩散,并促进对其尚未很好理解的方面的研究。

英文摘要

Diffusion models have become a promising alternative to autoregressive models. Among these, uniform diffusion language models (UDLMs) permit any token to be updated at any step, in principle enabling more flexible generation. However, no UDLM has yet been pretrained from scratch at both large parameter scale and large token budget. Both autoregressive modeling and masked diffusion modeling already have capable models at scale that the community can study and build on; uniform diffusion has none. A scratch-pretrained UDLM at scale would provide a clean reference point for studying scaling behavior, generation dynamics, controllability, and trade-offs against established autoregressive and masked diffusion models. To this end, we introduce Sumi ("ink" in Japanese), a fully open 7B uniform diffusion language model pretrained from scratch on 1.5T tokens. Sumi performs competitively with autoregressive models trained at comparable token budgets on knowledge, reasoning, and coding benchmarks, while under-performing on commonsense benchmarks, where our education-heavy data mixture is a likely contributor. We release our model weights, checkpoints, and full training recipe, including a complete specification of the data mixture over publicly available corpora. We hope this release enables the community to study native uniform diffusion at scale and catalyzes work on its as-yet poorly understood aspects.

2606.19170 2026-06-18 cs.CL 新提交

Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition

Dango:一个严格仅L1的大型语言模型,用于研究第二语言习得

Shiho Matta, Yin Jou Huang, Fei Cheng, Takashi Kodama, Hirokazu Kiyomaru, Yugo Murawaki

发表机构 * Kyoto University(京都大学) NII-LLMC(国立信息学研究所-大规模语言模型中心)

AI总结 提出1.8B参数的Dango模型,通过过滤L2污染和微调L2学习课程,模拟人类L2产出模式,优于未过滤和多语言基线。

Comments 8 pages main text, 20 pages total including references and appendices

详情
AI中文摘要

我们介绍了Dango,一个1.8B参数的大型语言模型,旨在用于第二语言习得(SLA)中L1到L2(日语到英语)迁移的受控研究。虽然先前的研究已经探索了语言模型中的SLA,但它们主要依赖于较小的或非解码器模型,限制了它们生成开放式文本的能力,并降低了它们作为实用L2模拟器的适用性。我们发现了将模型扩展到该规模时的一个关键挑战:用于L1习得的“单语”预训练语料库中的L2污染。为了解决这个问题,我们提出了一种过滤方法,以减少对英语的过早暴露,同时保留现实的最小暴露。然后,我们在LLM生成的L2学习课程上对模型进行微调,以模拟L2习得过程。我们的评估证实,Dango发展了类似人类的L2产出模式,优于未过滤和标准的多语言基线。我们发布了模型、数据和代码,以促进可重复的计算SLA研究和面向学习者的应用。

英文摘要

We introduce Dango, a 1.8B-parameter large language model designed for controlled studies of L1-to-L2 (Japanese-to-English) transfer in second language acquisition (SLA). While previous studies have explored SLA in language models, they have predominantly relied on smaller or non-decoder models, limiting their ability to generate open-ended text and reducing their suitability as practical L2 simulators. We identify a key challenge when scaling models to this size: L2 contamination within the "monolingual" pretraining corpus used for L1 acquisition. To address this, we propose a filtering method to reduce premature exposure to English while preserving realistic, minimal exposure. We then fine-tune the model on LLM-generated L2-learning lessons to simulate the L2 acquisition process. Our evaluations confirm that Dango develops human-like L2 production patterns, outperforming both unfiltered and standard multilingual baselines. We release the model, data, and code to facilitate reproducible computational SLA research and learner-facing applications.

2606.19257 2026-06-18 cs.CL 新提交

DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models

DreamReasoner-8B:面向扩散推理模型的块大小课程学习

Zirui Wu, Lin Zheng, Jiacheng Ye, Shansan Gong, Xueliang Zhao, Yansong Feng, Wei Bi, Lingpeng Kong

发表机构 * The University of Hong Kong(香港大学) Peking University(北京大学)

AI总结 提出块大小课程学习,通过从细粒度到粗粒度的渐进训练,解决块扩散语言模型在长链推理中性能差距问题,DreamReasoner-8B在数学和代码推理上达到与Qwen3-8B相当的水平。

详情
AI中文摘要

块扩散语言模型通过并行块级去噪加速解码,但其能否可靠地扩展到长思维链(CoT)推理仍未解决。为此,我们开发了开源块扩散推理模型DreamReasoner-8B,并系统研究了训练和推理块大小如何影响长CoT推理。我们的分析揭示了显著的性能差距:使用大块大小训练会导致推理性能极差,而小块大小则能保持有效的推理。为了弥合这一粒度差距,我们提出了块大小课程学习,逐步从细粒度块大小过渡到粗粒度块大小进行训练,从而克服了这一限制,并实现了在多种推理块大小上泛化的强大推理性能。在数学和代码推理基准测试中,DreamReasoner-8B取得了与领先的开源自回归模型(如Qwen3-8B)相竞争的结果。这项工作为高效、具备推理能力的扩散语言模型奠定了实践基础。我们在以下网址发布模型:https://this URL。

英文摘要

Block diffusion language models accelerate decoding through parallel block-wise denoising, yet whether they can be reliably scaled for long chain-of-thought (CoT) reasoning remains unresolved. To this end, we develop DreamReasoner-8B, an open-source block diffusion reasoning model, and conduct a systematic study of how training and inference block sizes affect long-CoT reasoning. Our analysis reveals a stark performance disparity: training with large block sizes yields remarkably poor reasoning, whereas small block sizes preserve effective reasoning. To bridge this granularity gap, we propose block-size curriculum learning, which gradually transitions training from fine-grained to coarse-grained block sizes, thereby overcoming this limitation and enabling strong reasoning performance that generalizes across diverse inference block sizes. On mathematical and code reasoning benchmarks, DreamReasoner-8B achieves results competitive with leading open autoregressive models such as Qwen3-8B. This work establishes a practical foundation for efficient, reasoning-capable diffusion language models. We release our model at https://github.com/DreamLM/DreamReasoner.

2. 信息抽取、检索与问答 7 篇

2606.18381 2026-06-18 cs.CL cs.IR 新提交

SproutRAG: Attention-Guided Tree Search with Progressive Embeddings for Long-Document RAG

SproutRAG: 基于注意力引导的树搜索与渐进嵌入的长文档RAG

Amirhossein Abaskohi, Issam H. Laradji, Peter West, Giuseppe Carenini

发表机构 * University of British Columbia(不列颠哥伦比亚大学) ServiceNow Research(ServiceNow研究院)

AI总结 提出SproutRAG,通过注意力引导构建句子级分块树,实现多粒度检索,无需额外LLM调用,平均信息效率提升6.1%。

详情
AI中文摘要

检索增强生成(RAG)系统必须平衡检索粒度与上下文连贯性,现有方法通过LLM引导的分块、单级上下文扩展或层次摘要来解决这一挑战。这些方法在索引或检索过程中依赖昂贵的LLM调用,将上下文聚合限制在单一粒度级别,或通过摘要引入信息损失。我们提出SproutRAG,一种注意力引导的层次化RAG框架,通过将句子级块组织成逐渐增大但语义连贯的单元,利用学习到的句子间注意力构建二分块树,从而解决这一权衡。与依赖外部LLM、固定上下文扩展或有损摘要的先前方法不同,SproutRAG学习哪些注意力头和层最能捕捉语义文档结构,实现无需额外LLM调用或压缩摘要的多粒度检索。在检索时,SproutRAG使用层次化束搜索检索多个粒度的候选,捕获超越平面检索的多句子相关性。该框架通过联合目标进行端到端训练,同时改进嵌入和树结构。在涵盖科学、法律和开放域设置的四个基准上的实验表明,SproutRAG在最强基线上平均信息效率(IE)提升6.1%。代码可在该https URL获取。

英文摘要

Retrieval-augmented generation (RAG) systems must balance retrieval granularity with contextual coherence, a challenge that existing methods address through LLM-guided chunking, single-level context expansion, or hierarchical summarization. These approaches variously depend on costly LLM calls during indexing or retrieval, limit context aggregation to a single granularity level, or introduce information loss through summarization. We present SproutRAG, an attention-guided hierarchical RAG framework that addresses this trade-off by organizing sentence-level chunks into progressively larger but semantically coherent units, using learned inter-sentence attention to construct a binary chunking tree. Unlike prior approaches that rely on external LLMs, fixed context expansion, or lossy summarization, SproutRAG learns which attention heads and layers best capture semantic document structure, enabling multi-granularity retrieval without additional LLM calls or compressed summaries. At retrieval time, SproutRAG uses hierarchical beam search to retrieve candidates at multiple granularities, capturing multi-sentence relevance beyond flat retrieval. The framework is trained end-to-end with a joint objective that improves both embeddings and tree structure. Experiments across four benchmarks spanning scientific, legal, and open-domain settings demonstrate that SproutRAG improves information efficiency (IE) by 6.1% on average over the strongest baseline. Code is available on https://github.com/AmirAbaskohi/SproutRAG.

2606.18508 2026-06-18 cs.CL cs.IR 新提交

MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval

MCompassRAG:主题元数据作为段落级检索的语义指南针

Amirhossein Abaskohi, Raymond Li, Gaetano Cimino, Peter West, Giuseppe Carenini, Issam H. Laradji

发表机构 * University of British Columbia(不列颠哥伦比亚大学) University of Salerno(萨莱诺大学) ServiceNow Research(ServiceNow研究院)

AI总结 提出MCompassRAG框架,通过主题元数据增强段落表示,利用LLM蒸馏训练轻量检索器,实现主题感知检索,在六个基准上平均信息效率提升8.24%,延迟降低5倍以上。

详情
AI中文摘要

检索增强生成(RAG)系统关键依赖于文档的分块和搜索方式。细粒度块可以提高检索精度,但会扩大搜索空间,增加延迟和成本;较大的块减少了候选数量,但使密集相似性变得不可靠,因为每个块的表示混合了多个主题并引入了更多语义噪声。这种权衡在深度研究任务中尤其受限,因为检索必须在大型异构语料库中既快速又精确。我们引入了MCompassRAG,一种元数据引导的检索框架,它使用主题级信号作为语义指南针来选择相关证据。MCompassRAG不仅依赖于查询与噪声块嵌入之间的余弦相似度,还在同一嵌入空间中用主题元数据丰富块表示,并通过LLM教师蒸馏训练轻量级检索器。在推理时,MCompassRAG无需额外的LLM调用即可执行主题感知检索,提高了效率和证据质量。在六个复杂检索基准上,MCompassRAG平均信息效率(IE)提高了8.24%,延迟比最强的高效RAG基线低5倍以上。代码可从此https URL获取。

英文摘要

Retrieval-augmented generation (RAG) systems depend critically on how documents are chunked and searched. Fine-grained chunks can improve retrieval precision but expand the search space, increasing latency and cost; larger chunks reduce the number of candidates but make dense similarity less reliable, as the representation for each chunk mixes multiple topics and introduces more semantic noise. This trade-off becomes especially limiting in deep research tasks, where retrieval must be both fast and precise across large, heterogeneous corpora. We introduce MCompassRAG, a metadata-guided retrieval framework that uses topic-level signals as a semantic compass for selecting relevant evidence. Instead of relying only on cosine similarity between queries and noisy chunk embeddings, MCompassRAG enriches chunk representations with topic metadata in the same embedding space and trains a lightweight retriever through LLM-teacher distillation. At inference time, MCompassRAG performs topic-aware retrieval without additional LLM calls, improving both efficiency and evidence quality. Across six complex retrieval benchmarks, MCompassRAG improves information efficiency (IE) by 8.24% on average with over 5 times lower latency than the strongest efficient RAG baselines. Code is available on https://github.com/AmirAbaskohi/MCompassRAG.

2606.18620 2026-06-18 cs.CL cs.AI 新提交

BCL: Bayesian In-Context Learning Framework for Information Extraction

BCL:面向信息抽取的贝叶斯上下文学习框架

Haoliang Liu, Chengkun Cai, Xu Zhao, Han Zhu, Shizhou Huang, Xinglin Zhang, Tao Chen, Jenq-Neng Hwang, Zhang Huaping, Lei Li

发表机构 * HiThink Research(海天瑞声研究) University College London(伦敦大学学院) University of Edinburgh(爱丁堡大学) The Hong Kong University of Science and Technology(香港科技大学) East China Normal University(华东师范大学) Shanghai Medical Image Insights(上海医学影像洞察) University of Waterloo(滑铁卢大学) University of Washington(华盛顿大学) Beijing Institute of Technology(北京理工大学)

AI总结 提出BCL框架,利用贝叶斯更新和粒子滤波优化信息抽取中的上下文学习,在序列标注和关系分类任务上取得显著提升。

Comments ACL 2026 Findings

详情
AI中文摘要

现有的信息抽取(IE)任务越来越多地采用大型语言模型的上下文学习(ICL)。然而,当前的方法要么在不同模型规模上表现不一致,要么缺乏系统优化和泛化能力。基于此,我们提出了BCL(面向信息抽取的贝叶斯上下文学习框架),这是第一个使用贝叶斯更新的粒子滤波来系统优化IE任务中标签表示的优化框架。通过四个步骤——初始化、观测、权重更新和重采样,BCL可以泛化到序列标注和关系分类两种范式。大量实验表明,与现有方法相比,BCL取得了显著且一致的改进。

英文摘要

Existing information extraction (IE) tasks increasingly adopt in-context learning (ICL) with large language models. However, current approaches either show inconsistent performance across model scales or lack systematic optimization and generalizability. Building on this, we propose BCL (Bayesian In-Context Learning Framework for Information Extraction), the first optimization framework that uses particle filtering with Bayesian updates to systematically refine label representations across IE tasks. Through four steps initialization, observation, weight update, and resampling, BCL generalizes to both sequence labeling and relation classification paradigms. Extensive experiments demonstrate substantial and consistent improvements over existing approaches.

2606.18781 2026-06-18 cs.CL 新提交

Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation

迷失在单一向量中:通过分块证据聚合改进长文档检索

Shanshan Lyu, Yiwei Wang, Yujun Cai, Jiafeng Guo, Shenghua Liu

发表机构 * Chongqing University(重庆大学) State Key Laboratory of AI Safety(人工智能安全国家重点实验室) Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) University of California, Merced(加州大学默塞德分校) University of Queensland(昆士兰大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 针对长文档检索中单向量编码削弱关键片段证据的问题,提出无训练的分块证据聚合策略DICE,通过独立编码分块并聚合为单一向量,在保持标准接口的同时显著提升检索性能。

Comments Code is available at https://github.com/PunchlineAAAA/DICE

详情
AI中文摘要

稠密检索将一个查询向量与一个文档向量进行排名。对于长文档,当在排名前的文档编码过程中,一个简短但决定性的跨度被削弱时,这种接口可能会失败。我们将这种失败模式研究为文档侧早期压缩,并引入证据稀释指数(EDI)来衡量文档级表示低于同一黄金文档中最强分块级证据的程度。在此观点的指导下,我们提出了DICE(通过分块证据进行文档推理),一种无需训练的文档侧策略,它将文档分割成块,使用冻结模型独立编码,然后将它们聚合回单个向量,同时保持标准的单查询-单文档接口。在LongEmbed上,DICE在四个骨干网络上提高了检索性能,在超过4k标记的切片上提升最大:对于Dream,Passkey >4k从30.0提升到90.0,Needle >4k从23.3提升到74.0。在12,779个过滤样本中,DICE在92.8%的情况下比单向量基线产生更低的EDI。这些结果确立了文档级编码作为长文档检索的一个实用且未被充分探索的杠杆。

英文摘要

Dense retrieval ranks one query vector against one document vector. On long documents, this interface can fail when a short but decisive span is weakened during document encoding before ranking. We study this failure mode as document-side early compression and introduce the Evidence Dilution Index (EDI) to measure how far a document-level representation falls below the strongest chunk-level evidence within the same gold document. Guided by this view, we propose DICE (Document Inference via Chunk Evidence), a training-free document-side strategy that splits documents into chunks, encodes them independently with a frozen model, and aggregates them back into a single vector while preserving the standard one-query-one-document interface. On LongEmbed, DICE improves retrieval across four backbones, with the largest gains on slices beyond 4k tokens: for Dream, Passkey >4k rises from 30.0 to 90.0 and Needle >4k from 23.3 to 74.0. Across 12,779 filtered samples, DICE yields lower EDI than the single-vector baseline in 92.8% of cases. These results establish document-level encoding as a practical and underexplored lever for long-document retrieval.

2606.18893 2026-06-18 cs.CL 新提交

Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction

学习鲁棒的成对置信度用于多模态情感-原因对提取

Zhuangzhuang Pan, Ning Dong, Yingna Su, Yan Xia

发表机构 * Institute for Advanced Studies(先进研究院) Universiti Malaya(马来大学) School of Information Engineering(信息工程学院) Suqian University(宿州学院) Digitization Department(数字化部门)

AI总结 提出RPCL框架,通过置信度差异边界约束和对抗性扰动,增强多模态情感-原因对提取中成对置信度的判别性和稳定性,在三个数据集上提升Pair F1约2.6-2.8个百分点。

Comments 11 pages, 3 figures, 5 tables

详情
AI中文摘要

多模态情感-原因对提取(MECPE)需要候选对上的可靠成对置信度。现有的成对评分器通常对有效候选使用成对级别的交叉熵,这大多独立地处理链接。这使得竞争原因之间的相对置信度几何结构约束不足,允许黄金对接近硬负例或依赖偶然的非黄金上下文。我们将这种脆弱性研究为成对置信度脆弱性,并提出RPCL(鲁棒成对置信度学习),一种仅用于训练的成对置信度学习框架。RPCL鼓励成对置信度既具有判别性又具有稳定性:通过置信度差异边界约束将黄金对与行方向硬负例分离,并将干净成对预测与来自损坏视图的预测对齐,其中非黄金上下文话语表示被部分损坏。在推理时,原始的干净成对评分器和解码流水线保持不变。在ECF、MECAD和MEC4上,RPCL在全文本-音频-视频设置下将三种子平均Pair F1相对于匹配基线模型提高了2.58到2.83个百分点,并在所有三个数据集上提高了平均Pair AUPRC。诊断分析进一步显示更大的黄金-负例置信度差距和更低的边界违反严重性。这些结果表明,显式塑造成对置信度是MECPE的一种有效训练策略。

英文摘要

Multimodal emotion-cause pair extraction (MECPE) requires reliable pair confidence over candidate pairs. Existing pair scorers commonly use pair-level cross entropy over valid candidates, which treats links mostly independently. This leaves the relative confidence geometry among competing causes under-constrained, allowing gold pairs to stay close to hard negatives or rely on incidental non-gold context. We study this vulnerability as pair-confidence brittleness and propose RPCL (Robust Pair Confidence Learning), a training-only framework for pair-confidence learning. RPCL encourages pair confidence to be both discriminative and stable: gold pairs are separated from row-wise hard negatives through a confidence-difference margin constraint, and clean pair predictions are aligned with predictions from a corrupted view where non-gold contextual utterance representations are partially corrupted. The original clean pair scorer and decoding pipeline are used unchanged at inference time. On ECF, MECAD, and MEC4, RPCL improves the three-seed mean Pair F1 over a matched base model by 2.58 to 2.83 percentage points in the full text-audio-video setting, and improves mean Pair AUPRC on all three datasets. Diagnostic analysis further shows larger gold-negative confidence gaps and lower margin-violation severity. These results suggest that explicitly shaping pair confidence is an effective training strategy for MECPE.

2606.18986 2026-06-18 cs.CL cs.AI 新提交

Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

超越分词:面向时间序列问答的直接时间步嵌入与对比对齐

Yafeng Wu, Huu Hiep Nguyen, Thin Nguyen, Hung Le

发表机构 * Deakin University(德肯大学)

AI总结 提出CADE框架,通过逐点线性编码器直接嵌入每个时间步,避免分词瓶颈,并利用单向监督对比损失对齐时间序列与文本锚点,在Time-MQA基准上提升六项TSQA任务性能。

详情
AI中文摘要

大型语言模型的最新进展催生了时间序列问答(TSQA),它将时间序列分析表述为自然语言问答。然而,直接将原始数值序列输入LLM会遇到分词瓶颈:字节对编码将连续值分割成不稳定的词元,其嵌入缺乏有意义的度量结构,导致幅度、尺度和趋势信息的丢失。先前的方法使用基于分块的编码器将序列分割成固定窗口,锁定单一粒度,这会破坏模式并隐藏确切的时间步,且通过一个在不同长度或采样率的数据集上很少迁移的独立模块实现。为了解决这一挑战,我们提出了CADE(对比对齐与直接嵌入),一个基于两个关键组件构建的TSQA新框架:直接时间步嵌入和语义对齐。该框架通过逐点线性编码器和MLP投影器将每个时间步直接映射到LLM嵌入空间,保留了精确的索引级访问,同时消除了分块和填充的需要。为了进一步弥合时间序列与语言表示之间的语义差距,我们引入了一种新颖的单向监督对比损失,将时间序列嵌入与冻结的类名文本锚点对齐。在公开的Time-MQA基准上的实验结果表明,我们的框架在六项TSQA任务上持续提升了性能,优于开源和专有的LLM基线。

英文摘要

Recent advances in large language models (LLMs) have given rise to time-series question answering (TSQA), which formulates time-series analysis as natural-language question answering. However, directly feeding raw numerical series into LLMs suffers from a tokenization bottleneck: Byte Pair Encoding fragments continuous values into unstable tokens whose embeddings lack meaningful metric structure, resulting in the loss of magnitude, scale, and trend information. Prior methods use patch-based encoders that split the series into fixed windows, locking in one granularity that breaks patterns and hides exact timesteps, through a separate module that rarely transfers across datasets with different lengths or sampling rates. To address this challenge, we propose CADE (Contrastive Alignment with Direct Embedding), a novel framework for TSQA built upon two key components: direct timestep embedding and semantic alignment. The proposed framework maps each timestep directly into the LLM embedding space through a point-wise linear encoder and MLP projector, preserving exact index-level access while eliminating the need for patching and padding. To further bridge the semantic gap between time-series and language representations, we introduce a novel one-directional supervised contrastive loss that aligns time-series embeddings with frozen class-name text anchors. Experimental results on the public Time-MQA benchmark demonstrate that our framework consistently improves performance across six TSQA tasks, outperforming both open-source and proprietary LLM baselines.

2606.19183 2026-06-18 cs.CL cs.AI 新提交

Language Models as Interfaces, Not Oracles: A Hybrid LLM-ML System for Pediatric Appendicitis

语言模型作为接口而非预言机:用于小儿阑尾炎的混合LLM-ML系统

Soheyl Bateni, Maryam Abdolali

发表机构 * K. N. Toosi University of Technology(K. N. 图西理工大学)

AI总结 提出ClaMPAPP混合系统,利用LLM从自由文本中提取结构化特征,再由XGBoost分类器进行诊断,在两个独立队列中优于端到端LLM,提高了诊断稳定性和可审计性。

详情
AI中文摘要

大型语言模型(LLM)通过解释自由文本记录可使临床决策支持更易获取,但直接作为诊断引擎使用时,受提示敏感性、信息顺序以及看似合理但错误的输出限制。结构化机器学习模型提供更稳定的风险预测,但需要难以与叙事性临床工作流集成的表格输入。我们提出ClaMPAPP(临床语言辅助机器学习阑尾炎诊断流程),这是一个混合系统,将LLM用作接口而非最终决策者。ClaMPAPP从类似笔记的叙述中提取模式约束的临床特征,应用确定性合理性检查,并将验证后的特征传递给基于临床、实验室和超声变量训练的XGBoost分类器。我们在来自德国医院的两个独立小儿阑尾炎队列上评估了ClaMPAPP,并将其与端到端LLM基线(包括开源和专有模型)进行比较。为在测试自由文本输入时保留真实标签,通过模板渲染和约束LLM重写从结构化电子健康记录生成叙述,并附加句子顺序排列以评估位置鲁棒性。ClaMPAPP在内部和外部验证中均达到最强的整体诊断性能,同时最小化漏诊阑尾炎病例(急性分诊中的关键安全问题)。端到端LLM表现出不稳定的灵敏度-特异性权衡,且在叙述重排下性能下降更严重。这些结果支持LLM作为接口、ML作为预测器的设计,将自然语言可用性与预测推理分离,并为临床决策支持提供更可审计的路径。

英文摘要

Large language models (LLMs) can make clinical decision support more accessible by interpreting free-text documentation, but their direct use as diagnostic engines is limited by sensitivity to prompts, information order, and plausible but incorrect outputs. Structured machine-learning models offer more stable risk prediction, yet they require tabular inputs that are difficult to integrate with narrative clinical workflows. We present ClaMPAPP (Clinical Language-assisted Machine-learning Pipeline for Appendicitis), a hybrid system that uses an LLM as an interface rather than as the final decision-maker. ClaMPAPP extracts schema-constrained clinical features from note-like narratives, applies deterministic plausibility checks, and passes validated features to an XGBoost classifier trained on clinical, laboratory, and ultrasound variables. We evaluated ClaMPAPP on two independent pediatric appendicitis cohorts from German hospitals and compared it with end-to-end LLM baselines, including open-source and proprietary models. To preserve ground truth while testing free-text input, narratives were generated from structured electronic health records through template rendering and constrained LLM rewriting, with additional sentence-order permutation to assess positional robustness. ClaMPAPP achieved the strongest overall diagnostic performance in both internal and external validation while minimizing missed appendicitis cases, the key safety concern in acute triage. End-to-end LLMs showed unstable sensitivity-specificity trade-offs and greater degradation under narrative reordering. These results support an LLM-as-interface, ML-as-predictor design that separates natural-language usability from predictive inference and provides a more auditable pathway for clinical decision support.

3. 对话系统与智能体 6 篇

2606.18406 2026-06-18 cs.CL 新提交

CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

CoreMem: 对话代理中长期记忆的黎曼检索与Fisher引导蒸馏

Jiaqi Chen, Yongqin Zeng, Shaoshen Chen, Yijian Zhang, Hai-Tao Zheng, Chunxia Ma, XiuTeng Zhou

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Peng Cheng Laboratory(鹏城实验室) Shandong Analysis and Test Center, Qilu University of Technology(齐鲁工业大学山东省分析测试中心) State Key Laboratory for Quality Ensurance and Sustainable Use of Dao-di Herbs(道地药材品质保障与可持续利用国家重点实验室)

AI总结 提出CoreMem架构,用黎曼检索替代余弦相似度解决高维检索枢纽问题,通过Fisher引导离散令牌蒸馏实现原则性压缩,在8GB显存边缘设备上实现长期记忆对话代理。

Comments 15 pages, 5 figures

详情
AI中文摘要

个性化对话代理需要持续的长期记忆以在多次会话中维持连贯交互。然而,在消费级硬件(例如8 GB VRAM边缘设备)上部署这些能力会引入严重的内存和计算瓶颈。现有系统通常依赖各向同性余弦相似度进行检索,以及启发式规则进行上下文压缩。这些方法缺乏统一的理论基础,经常在高维检索中遭受枢纽问题,并在压缩过程中出现句法碎片化。为克服这些限制,我们提出CoreMem,一种资源高效的边缘-云记忆架构,从根本上由信息几何统一。首先,黎曼检索用局部自适应Fisher-Rao度量替代余弦匹配,通过马氏距离有效惩罚枢纽记忆,并采用O(Ndr) Woodbury加速实现实时搜索。其次,Fisher引导离散令牌蒸馏(FDTD)引入分层句子到令牌压缩机制。它从Fisher信息迹中推导敏感度分数,提供原则性的压缩-KL权衡,并辅以显式结构句法保护。在LOCOMO和LongMemEval-S基准上评估,CoreMem实现了显著的准确率提升,在开放域(+4.51个百分点)和时间(+4.17个百分点)推理上取得实质性增益。广泛性能分析证实,CoreMem在严格的8 GB VRAM预算内无缝运行,成功弥合了资源受限边缘设备与对理论基础的终身记忆代理需求之间的差距。

英文摘要

Personalized dialogue agents require continuous long-term memory to maintain coherent interactions across multiple sessions. However, deploying these capabilities on consumer-grade hardware (e.g., 8 GB VRAM edge devices) introduces severe memory and compute bottlenecks. Existing systems typically rely on isotropic cosine similarity for retrieval and heuristic rules for context compression. These approaches lack a unified theoretical foundation, frequently suffering from the hubness problem in high-dimensional retrieval and syntactic fragmentation during compression. To overcome these limitations, we propose CoreMem, a resource-efficient edge-cloud memory architecture fundamentally unified by information geometry. First, Riemannian retrieval replaces cosine matching with a locally adaptive Fisher-Rao metric, effectively penalizing hub memories via Mahalanobis distance with O(Ndr) Woodbury acceleration for real-time search. Second, Fisher-guided discrete token distillation (FDTD) introduces a hierarchical sentence-to-token compression mechanism. It derives sensitivity scores from Fisher information traces, providing a principled compression-KL tradeoff augmented with explicit structural syntax protection. Evaluated on the LOCOMO and LongMemEval-S benchmarks, CoreMem achieves strong accuracy improvements, yielding substantial gains in Open-domain (+4.51 pp) and Temporal (+4.17 pp) reasoning. Extensive profiling confirms that CoreMem operates seamlessly within a strict 8 GB VRAM budget, successfully bridging the gap between resource-constrained edge devices and the demand for theoretically grounded, lifelong memory agents.

2606.18448 2026-06-18 cs.CL 新提交

VISUALSKILL: Multimodal Skills for Computer-Use Agents

VISUALSKILL:面向计算机使用智能体的多模态技能

Ziyan Jiang, Li An, Yujian Liu, Jiabao Ji, Qiucheng Wu, Jacob Andreas, Yang Zhang, Shiyu Chang

发表机构 * UC Santa Barbara(加州大学圣塔芭芭拉分校) MIT CSAIL(麻省理工学院计算机科学与人工智能实验室) MIT-IBM Watson AI Lab(麻省理工学院-IBM沃森人工智能实验室)

AI总结 提出VISUALSKILL分层多模态技能库,通过结合文档与UI探索构建,使智能体在CUA基准上平均得分提升15.3点,且多模态优于纯文本技能。

详情
AI中文摘要

计算机使用智能体(CUA)在标准化基准上接近人类水平,但在长周期任务和未见软件上仍存在困难。现有技能库通过可复用技能解决此问题,但仅以文本形式表示技能工件,忽略了GUI交互的视觉特性。我们提出VISUALSKILL:一种分层多模态技能,针对每个目标应用定制,并组织为按主题文件索引的中央索引,智能体通过load_topic MCP工具按需获取相关主题的文本和图形。我们通过结合编写文档与实时应用UI探索的两阶段流水线构建每个技能。在两个CUA基准CUA-World和OSExpert-Eval上,由Claude Opus 4.6支持的Claude Code CLI智能体使用VISUALSKILL达到平均得分0.456,比无技能基线(0.303)绝对提升15.3点。与从相同源内容生成且仅在模态上与VISUALSKILL不同的匹配纯文本技能相比,VISUALSKILL进一步绝对提升8.3点(0.373 vs. 0.456),直接证明在技能工件中保留视觉图形而非将其语言化,有助于智能体识别UI元素并在每次操作后验证工作流状态。我们的代码见此链接。

英文摘要

Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the skill artifact as text only, despite the visual nature of GUI interaction. We propose VISUALSKILL: a hierarchical multimodal skill, tailored to each target application and organised as a central index over per-topic files, which the agent consumes through a load_topic MCP tool that fetches the relevant topic's text and figures on demand. We construct each skill with a two-stage pipeline that combines authored documentation with live-application UI exploration. On two CUA benchmarks, CUA-World and OSExpert-Eval, a Claude Code CLI agent backed by Claude Opus 4.6 reaches an average score of 0.456 with VISUALSKILL, a +15.3 point absolute lift over the no-skill baseline (0.303). Against a matched text-only skill that is generated from the same source content and differs from VISUALSKILL only in modality, VISUALSKILL yields a further +8.3 point absolute gain over the matched text-only skill (0.373 vs. 0.456), providing direct evidence that retaining visual figures in the skill artifact, rather than verbalizing them away, helps the agent both identify UI elements and verify workflow state after each action. Our code is available at https://github.com/XMHZZ2018/VisualSkills.

2606.18728 2026-06-18 cs.CL 新提交

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

LegalWorld: 法律智能体的生命周期交互环境

Songhan Zuo, Shengbin Yue, Tao Chiang, Guanying Li, Yun Song, Xuanjing Huang, Zhongyu Wei

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Northwest University of Political and Law(西北政法大学)

AI总结 提出LegalWorld,一个将中国民事诉讼建模为五阶段因果链的生命周期交互环境,基于75309对判决书构建,并评估多智能体在连续诉讼中的能力差异。

详情
AI中文摘要

民事诉讼本质上是一个生命周期过程:律师第一天起草的内容会约束数月后庭审的走向。然而,现有的法律基准评估的是孤立的子任务,而先前的法律智能体模拟器每次从共享的真实情况重新初始化场景,忽略了跨阶段的因果依赖关系。我们提出LegalWorld,一个生命周期交互环境,将中国民事诉讼建模为五个阶段(七个子场景)的因果连接状态链,基于75,309对中国民事判决书构建。我们为其配备了可重用的基础设施(本地记忆、全局案件记忆、技能/工具库),确保每个争议在其整个生命周期中保持一致。在此环境基础上,我们构建了LongJud-Bench,用于评估智能体在所有五个连接阶段的能力。来自217名法律背景评估者的18,992个评分证实,LegalWorld的轨迹在程序上忠实且角色一致;跨模型的能力级评估揭示了聚合分数无法暴露的显著分歧,没有单一骨干模型在咨询、起草和庭审辩护中均领先。详细资源将公开发布。

英文摘要

Civil litigation is inherently a life-cycle process: what a lawyer drafts on day one constrains what unfolds at trial months later. Yet existing legal benchmarks evaluate isolated subtasks, and prior legal-agent simulators reinitialize each scenario from shared ground truth, leaving cross-stage causal dependencies unmodeled. We present LegalWorld, a life-cycle interactive environment that models Chinese civil litigation as a causally connected state chain of five stages (seven sub-scenarios), grounded in 75,309 paired Chinese civil judgments. We pair it with reusable infrastructure (local memory, global case memory, a Skill/Tool library) that keeps each dispute consistent across its full life cycle. Building on this environment, we construct LongJud-Bench to evaluate agent capability across all five connected stages. 18,992 ratings from 217 legal-background evaluators confirm that LegalWorld trajectories are procedurally faithful and role-consistent; and a capability-level cross-model evaluation reveals sharp divergences that aggregate scores cannot expose, with no single backbone leading across consultation, drafting, and courtroom advocacy. Detailed resources will be released publicly.

2606.19111 2026-06-18 cs.CL cs.AI cs.MA 新提交

Leadership as Coordination Control: Behavioral Signatures and the Recovery-Advantage Boundary in Multi-Agent LLM Teams

领导力作为协调控制:多智能体LLM团队中的行为特征与恢复优势边界

Haewoon Kwak

发表机构 * Indiana University Bloomington(印第安纳大学布卢明顿分校)

AI总结 研究多智能体LLM团队中过程级协调控制何时增加价值,通过行为特征和消融实验发现,控制器的优势仅在初始多数投票不可靠、任务可恢复且无指导交互无法修复时出现,验证了权变理论。

Comments 33 pages

详情
AI中文摘要

团队科学认为领导力是权变的:它仅在特定条件下有帮助,而能力强的自主团队可能根本不需要领导。我们对多智能体LLM团队提出类似问题:在什么可测量的条件下,过程级协调控制会增加价值,这些条件是否与团队科学的预测一致?我们使用行为特征(多数锁定、探索、从错误的第0轮共识中恢复)和每动作消融实验,因为每个控制器是一个显式动作集,而不是一个整体提示。我们将三种经典领导风格(交易型、变革型、情境型)操作化为对共享动作词汇(探索、修订、接受、综合)的控制器。一个具有相同动作但使用任意规则的匹配控制器恢复效果不优于多数投票,因此是理论推导的规则(而非词汇)起作用。在四个任务体系和三个开放权重模型系列中,没有控制器在准确率上占主导地位,正如权变观点所预测的:交易型控制在所有12个(模型、体系)组合上与共享的第0轮投票匹配,差异在1.3个百分点以内,仅在初始多数不可靠的一个组合上出现增益(llama-4-scout社会性;情境型比扁平型高8个百分点)。通过四个边界探针测试的恢复优势解释表明,控制器仅在初始多数投票不可靠、任务可恢复且无指导交互无法修复时优于纯交互。这些区域映射到权变理论(领导替代、路径-目标冗余、情境准备差距),因此基本为零的准确率结果正是理论所预测的,而非控制器的失败。我们将过程级协调控制视为一种需要测量和理论映射的权变因素,而不是需要超越的排行榜。

英文摘要

Team science holds that leadership is contingent: it helps only under specific conditions, and capable, autonomous teams may need none at all. We ask the analogous question for multi-agent LLM teams: under what measurable conditions does process-level coordination control add value, and do those conditions match what team science predicts? We use behavioral signatures (majority lock-in, exploration, recovery from an incorrect round-0 consensus) and per-action ablations, clean because each controller is an explicit action set, not a monolithic prompt. We operationalize three classical leadership styles (transactional, transformational, situational) as controllers over a shared action vocabulary (explore, revise, accept, synthesize). A matched controller with the same actions but an arbitrary rule recovers no better than majority voting, so the theory-derived rule, not the vocabulary, does the work. Across four task regimes and three open-weight model families, no controller dominates by accuracy, as the contingency view predicts: transactional control matches a shared round-0 vote on all 12 (model, regime) combinations to within 1.3pp, and gains appear only on the one combination where the round-0 majority is unreliable (llama-4-scout social; situational +8pp over flat). A recovery-advantage account, tested with four boundary probes, says a controller beats plain interaction only where the round-0 majority is unreliable, the task is recoverable, and undirected interaction does not already repair it. These regions map onto contingency theory (leadership substitutes, path-goal redundancy, the situational readiness gap), so a largely null accuracy result is what the theory predicts, not a failure of the controllers. We read process-level coordination control as a contingency to be measured and theory-mapped, not a leaderboard to be topped.

2606.19308 2026-06-18 cs.CL cs.MA 新提交

Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play

通过多智能体虚拟博弈增强大语言模型的决策能力

Leyang Shen, Yang Zhang, Xiaoyan Zhao, Chun Kai Ling, Tat-Seng Chua

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 针对多智能体系统中决策任务因立场纠缠而难以分解的问题,提出基于虚拟博弈的多智能体虚拟博弈(MAFP)范式,通过迭代最佳响应实现均衡求解,提升决策质量和鲁棒性。

Comments 18 pages, 8 figures

详情
AI中文摘要

基于大语言模型(LLM)的多智能体系统(MAS)通过将子任务分配给协作智能体,在解决具有执行复杂性的任务方面展现出巨大潜力。然而,这种分而治之的范式在现实世界中同样普遍的决策任务上表现不足。这些任务要求所有相关利益方同时推理,其决策相互依赖,因此无法孤立解决。我们将这一挑战定性为立场纠缠,这是一种区别于执行复杂性的决策复杂性。为了解决这一问题,我们提出了多智能体虚拟博弈(MAFP),一种新颖的MAS范式,将利益方立场表示为智能体,并将决策制定形式化为一个均衡寻求过程。基于博弈论中的虚拟博弈原理,MAFP通过每个智能体对其他智能体过去决策的经验混合做出最佳响应,迭代更新其决策。这使得智能体能够暴露并解决彼此的弱点,逐步提高决策质量和鲁棒性。我们在具有挑战性的决策任务上评估MAFP,这些任务测试在行动前为竞争场景制定策略的能力。MAFP在两个互补指标——锦标赛强度和鲁棒性上,均优于单轮和多轮基线,证明了其在解决立场纠缠方面的有效性。

英文摘要

Large language model (LLM)-based multi-agent systems (MAS) have demonstrated great potential in solving tasks with execution complexity, by distributing subtasks across cooperative agents. However, this divide-and-conquer paradigm falls short on decision-making tasks that are also prevalent in the real world. These tasks require simultaneous reasoning from the stances of all involved stakeholders whose decisions are mutually dependent and thus cannot be solved in isolation. We characterize this challenge as stance entanglement, a form of decision complexity distinct from execution complexity. To address it, we propose Multi-Agent Fictitious Play (MAFP), a novel MAS paradigm that represents stakeholder stances as agents and formulates decision-making as an equilibrium-seeking process. Built on the game-theoretic principle of fictitious play, MAFP iteratively updates each agent's decision by best responding to the empirical mixture of other agents' past decisions. This enables agents to expose and address one another's weaknesses, progressively improving decision quality and robustness. We evaluate MAFP on challenging decision-making tasks that test the capability of deciding strategies for competitive scenarios prior to acting. MAFP outperforms both single-round and multi-round baselines on two complementary metrics, tournament strength and robustness, demonstrating its effectiveness in addressing stance entanglement.

2606.19336 2026-06-18 cs.CL 新提交

Learning User Simulators with Turing Rewards

基于图灵奖励的学习用户模拟器

Yingshan Susan Wang, Cedegao E. Zhang, Linlu Qiu, Zexue He, Pengyuan Li, Alex Pentland, Roger P. Levy, Yoon Kim

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Stanford University(斯坦福大学) MIT-IBM Watson AI Lab(MIT-IBM沃森人工智能实验室)

AI总结 提出Turing-RL方法,利用基于图灵测试的强化学习训练用户模拟器,通过判别性图灵奖励使生成响应与真实用户不可区分,在对话和论坛讨论中优于基线方法。

详情
AI中文摘要

在交互式环境中学习模拟人类用户可以推动代理助手的训练、个性化系统的评估、社会科学研究等。现有方法通常通过训练大型语言模型(LLM)来匹配单一真实响应,要么通过最大化对数概率,要么使用相似性奖励。我们提出{Turing-RL}:一种基于图灵测试的强化学习方法,用于训练用户模拟器模型。{Turing-RL}使用带有LLM评判器的判别性图灵奖励,根据用户历史记录对生成的响应与真实用户的不可区分程度进行评分,用户模拟器LLM学习在这种奖励下产生与用户可能说的内容不可区分的响应。在两个不同领域——对话聊天和Reddit论坛讨论中,我们发现{Turing-RL}在LLM和人工评估指标上均持续优于基线方法。我们的研究表明,优化不可区分性而非响应匹配对于学习用户模拟器是有效的。

英文摘要

Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose {Turing-RL}: a Turing-Test-based reinforcement learning approach for training user simulator models. {Turing-RL} uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user's given the user's history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains--conversational chat and Reddit forum discussion--we find that {Turing-RL} consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.

4. 文本生成、摘要与编辑 2 篇

2606.18850 2026-06-18 cs.CL cs.IR 新提交

ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement

ScholarSum:基于知识图谱推理与反思性精炼的师生式抽象摘要生成

Bohou Zhang, Xiaoyu Tao, Mingyue Cheng, Huijie Liu, Qi Liu

发表机构 * State Key Laboratory of Cognitive Intelligence(认知智能国家重点实验室)

AI总结 提出ScholarSum框架,通过构建层次知识图谱引导学生生成初稿,并利用教师式审阅者迭代检查与修正,实现科学文献摘要的流畅性与事实一致性。

详情
AI中文摘要

抽象摘要生成在实现科学文献高效理解中起着关键作用,但它本质上要求同时具备语言流畅性和事实忠实性。现有方法往往难以协调这两个要求。抽取式方法依赖僵硬的句子拼接,破坏了宏观层面的逻辑连贯性;而基于大语言模型的生成式方法尽管掌握了语言流畅性,但事实一致性有限。在这项工作中,我们提出了ScholarSum,一个层次化反思性图框架,模拟师生写作过程以实现流畅且忠实的科学摘要生成。ScholarSum首先通过将文档分割成语义连贯的单元,组织成层次知识图谱,其多层社区结构捕获全局逻辑和宏观主题。在该全局结构引导下,学生生成初稿,随后通过细粒度证据检索进行精炼。为确保事实一致性,教师式审阅者迭代检查初稿,识别不支持的内容,并触发有针对性的重新检索和重写,直到摘要达到严格的质量标准。大量实验表明,ScholarSum在完整性和忠实性方面显著优于之前的基线方法。我们的代码可在该https URL获取。

英文摘要

Abstractive summarization plays a crucial role in enabling efficient understanding of scientific literature, yet it inherently demands both linguistic fluency and factual faithfulness. Existing approaches often fail to reconcile these two requirements. Extractive methods rely on rigid sentence splicing that disrupts macro-level logical coherence, while large language model (LLM)-based generative approaches, despite mastering linguistic fluency, exhibit limited factual consistency. In this work, we propose ScholarSum, a hierarchical reflective graph-based framework that emulates a student-teacher writing process for fluent and faithful scientific summarization. ScholarSum first organizes the document into a hierarchical knowledge graph by segmenting it into semantically coherent units, whose multi-layered community structure captures global logic and macro-level themes. Guided by this global structure, the student generates an initial draft, which is subsequently refined through fine-grained evidence retrieval. To ensure factual consistency, a teacher-like reviewer then iteratively examines the draft, identifies unsupported content, and prompts targeted re-retrieval and rewriting until the summary meets rigorous quality standards. Extensive experiments demonstrate that ScholarSum significantly outperforms previous baselines in terms of both completeness and faithfulness. Our code is available at https://github.com/Xiaoyu-Tao/ScholarSum.

2606.18889 2026-06-18 cs.CL 新提交

Improving Medical Communication using Rubric-Guided Counterfactual Recommendations

利用评分引导的反事实推荐改善医疗沟通

Adrian Cosma, Nicoleta-Nina Basoc, Andrei Niculae, Cosmin Dumitrache, Emilian Radoi

发表机构 * IDSIA, Dalle Molle Institute for Artificial Intelligence(IDSIA,达勒莫利人工智能研究所) National University of Science and Technology POLITEHNICA Bucharest(科学与技术国家大学POLITEHNICA布加勒斯特)

AI总结 提出一种语言模型引导的反事实推荐流程,通过调整语气、个性化等可解释沟通特征,在不影响医学内容的前提下提升患者积极反馈概率,平均提升6.41%。

Comments 4 Tables, 8 Figures

详情
AI中文摘要

基于文本的远程医疗越来越依赖轻量级的患者反馈,然而,此类反馈主要反映感知的沟通质量而非医学准确性。我们引入了一种语言模型引导的反事实推荐流程,该流程发现并优化可解释的沟通特征,如语气、个性化、可操作性和完整性,以解决患者关切,同时不干扰医学内容。这些特征与患者-医生互动元数据一起用于估计积极反馈。在推理时,系统搜索低成本的序数特征变化,并推荐最小的沟通变化,这些变化预计会增加积极反馈的概率,而独立的审计模型测试这些增益是否超出选择模型的泛化能力。在互动中,推荐在独立审计下平均带来+6.41%的预测积极反馈概率增益,且93.31%的推荐为非负。这些结果表明,小的、可解释的沟通变化可以捕获大部分预测增益,同时保留医生对医学推理和最终措辞的控制。

英文摘要

Text-based telemedicine increasingly relies on lightweight patient feedback, however, such feedback primarily reflects perceived communication quality rather than medical accuracy. We introduce an LM-guided counterfactual recommendation pipeline that discovers and refines interpretable communication features such as tone, personalization, actionability and completeness in addressing patient concerns, without interfering with the medical content. These features are used together with patient-doctor interaction metadata to estimate positive feedback. At inference time, the system searches over low-cost ordinal feature changes and recommends minimal communication changes predicted to increase the probability of positive feedback, while independent auditor models test whether these gains generalize beyond the selection model. Across interactions, recommendations yield a mean +6.41% gain in predicted positive feedback probability under independent auditors, and are non-negative for 93.31% of recommendations. These results suggest that small, interpretable communication changes can capture most predicted gains while preserving the doctor's control over medical reasoning and final wording.

5. 语义、语法与语言学分析 4 篇

2606.18624 2026-06-18 cs.CL 新提交

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

PragReST:用于语用语言理解的自我强化反事实推理

Jihyung Park, Minchao Huang, Leqi Liu, Elias Stengel-Eskin

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出PragReST框架,通过自监督构建语用问答数据、生成反事实推理轨迹,结合监督微调和强化学习提升大语言模型的语用推理能力,在四个基准上显著优于基线模型。

Comments First two authors contributed equally. Code and models: https://github.com/jihyung803/PragReST

详情
AI中文摘要

自然语言理解通常依赖于隐含而非明确陈述的含义,需要语用推理。尽管大语言模型(LLMs)在数学和逻辑推理上表现强劲,但在进行语用推理时仍存在困难,往往选择字面解释。为了提升LLM的语用推理能力,我们提出了PragReST,一个自监督框架,它构建语用问答数据,生成反事实推理轨迹,并通过监督微调和强化学习训练模型内化这些轨迹,无需人工标注训练数据或从更强的教师模型蒸馏。在四个语用基准(PragMega、Ludwig、MetoQA和AltPrag)上,PragReST相比骨干模型、任务特定的语用微调基线以及同一流水线的非反事实变体均有提升。在基于准确率的基准上,PragReST在Qwen3-8B和Qwen3-14B上分别比指令骨干模型提升了5.37%和5.50%(绝对值)。我们的错误分析和消融实验强调了反事实推理的重要性:PragReST主要减少了因未能将观察到的话语与合理的替代方案进行对比而导致的错误,而去除反事实推理会显著降低性能。此外,我们的训练保留了对通用知识和数学推理基准的域外性能。

英文摘要

Natural language understanding often depends on meanings that are implied rather than explicitly stated, requiring pragmatic reasoning. Despite strong performance on math and logical reasoning, large language models (LLMs) still struggle with making pragmatic inferences, often choosing literal interpretations. To improve LLM pragmatic reasoning, we introduce PragReST, a self-supervised framework that constructs pragmatic QA data, generates counterfactual reasoning traces, and trains models to internalize them through supervised fine-tuning and reinforcement learning, without human-labeled training data or distillation from a stronger teacher. Across four pragmatic benchmarks (PragMega, Ludwig, MetoQA, and AltPrag), PragReST improves over backbone models, task-specific pragmatic tuning baselines, and non-counterfactual variants of the same pipeline. On accuracy-based benchmarks, PragReST improves over the instruct backbone by 5.37 and 5.50% (absolute) for Qwen3-8B and Qwen3-14B, respectively. Our error analysis and ablations underscore the importance of counterfactual reasoning: PragReST primarily reduces errors caused by failures to contrast observed utterances with plausible alternatives, and removing counterfactual reasoning substantially reduces performance. Moreover, our training preserves out-of-domain performance on general-knowledge and mathematical reasoning benchmarks.

2606.18717 2026-06-18 cs.CL cs.AI 新提交

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

Morpheus: 一种面向土耳其语的形态感知神经分词器和词嵌入器

Tolga Şakar

发表机构 * Independent Researcher(独立研究者)

AI总结 针对土耳其语粘着特性,提出Morpheus神经词素边界模型,实现无损可逆分词与结构化词嵌入,在可逆分词器中达到最低比特每字符(1.425),词素对齐F1提升至0.61,GPU内存节省约19%。

详情
AI中文摘要

土耳其语是粘着语:意义由词素承载,然而驱动现代语言模型的子词分词器根据语料库统计分割单词,切碎了承载语义的后缀,并且在WordPiece和基于规则的分析器的情况下,无法将其输出解码回原始文本。本文提出\textbf{Morpheus},一个面向土耳其语的神经词素边界模型,它同时是一个无损的、形态感知的分词器和一个词嵌入生成器。一个可微的泊松-二项式动态规划程序在训练期间将每个字符的边界概率转化为软词素隶属度,在推理时转化为精确的片段,无需字符串归一化,因此$\mathrm{decode}(\mathrm{encode}(w)) = w$由构造保证。由于该模型是神经模型,相同的正向传播在分词的同时也输出结构化的词嵌入。在可逆分词器中——唯一适用于生成的分词器——Morpheus达到了最低的比特每字符(1.425),将子词家族的金标准词素对齐大致翻倍(MorphScore宏F1从约0.32提升至0.61),并且相比64K词汇量的子词分词器节省了约19%的GPU内存。作为嵌入器,冻结的Morpheus向量在词汇检索(根家族MAP 0.85)和同根验证(ROC-AUC 1.00)上领先,超越了多语言检索器BGE-M3和BERTurk;在上下文和屈折依赖的任务(NER、格/数探测)上,更重的上下文编码器仍然领先——我们将这一权衡归因于Morpheus以词根为中心的几何结构。代码:此https URL 模型:此https URL 交互演示:此https URL。

英文摘要

Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and -- in the case of WordPiece and rule-based analyzers -- failing to decode their output back to the original text. This paper presents \textbf{Morpheus}, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so $\mathrm{decode}(\mathrm{encode}(w)) = w$ holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers -- the only ones valid for generation -- Morpheus attains the lowest bits-per-character ($1.425$), roughly doubles the gold morphological alignment of the subword family (MorphScore macro-F1 $0.61$ vs.\ ${\sim}0.32$), and uses ${\sim}19\%$ less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP $0.85$) and same-root verification (ROC-AUC $1.00$), surpassing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heavier contextual encoders remain ahead -- a trade-off we attribute to Morpheus's root-centric geometry. Code: https://github.com/lonewolf-rd/TurkishMorpheus; model: https://huggingface.co/lonewolflab/Morpheus-TR-50K; interactive demo: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo.

2606.18856 2026-06-18 cs.CL cs.LG 新提交

Approximate Structured Diffusion for Sequence Labelling

近似结构化扩散用于序列标注

Nicolas Floquet, Joseph Le Roux, Nadi Tomeh

发表机构 * Université Sorbonne Paris Nord, CNRS, Laboratoire d’Informatique de Paris Nord, LIPN(巴黎北大学 Sorbonne、法国国家科学研究中心、巴黎北信息学实验室、LIPN)

AI总结 提出一种基于扩散的条件随机场(CRF)训练方法,通过引入标签噪声条件来捕捉长距离依赖,结合近似推理在词性标注任务上实现16.5%的错误率降低。

详情
AI中文摘要

序列标注是自然语言处理(NLP)的核心任务,涉及为输入句子的每个标记分配一个标签。从机器学习的角度来看,序列标注通常被建模为由神经网络参数化的线性链条件随机场(CRF)。虽然这种方法在经验上取得了良好结果,但CRF假设有限的决策跨度(例如标签二元组),这可能会限制其表达能力,并在需要长距离依赖时损害性能。我们证明可以利用扩散来训练一个以整个标签序列为条件的CRF,但条件是标签的噪声版本。实验表明,该方法结合近似CRF推理,在词性标注任务上实现了16.5%的错误率降低,提高了标签准确性。

英文摘要

Sequence labelling, a core task of Natural Language Processing (NLP), consists in assigning each token of an input sentence a label. From a Machine Learning point of view, sequence labelling is often cast as a Linear-Chain Conditional Random Field (CRF) parametrised by a neural network. While this approach gives good empirical results, CRFs assume a finite decision span (eg label bigrams) which can limit their expressivity and hurt performance when long-range dependencies are required. We show we can leverage diffusion to train a CRF conditioned on an entire label sequence, with the caveat that the condition is on a noisy version of labels. We show experimentally that this method, in conjunction with approximate CRF inference, improves label accuracy with a 16.5% error reduction for POS-tagging.

2606.18922 2026-06-18 cs.CL cs.AI 新提交

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

像火箭科学一样简单:评估大型语言模型解释比喻语言中否定能力的研究

Jasmine Owers, Edwin Simpson, Martha Lewis

发表机构 * Intelligent Systems Lab University of Bristol(智能系统实验室 英国布里斯托尔大学) ILLC University of Amsterdam(阿姆斯特丹大学语言学研究所)

AI总结 本研究通过开发新的注释数据集,测试多种大型语言模型在比喻语言中理解否定的能力,发现否定与比喻的组合对模型构成挑战,且性能高度依赖提示风格。

Comments 16 pages, 16 figures; for associated code and data see https://github.com/jrdowers/Negation-and-Fig-Lang; To be published in Transactions of the Association for Computational Linguistics

详情
AI中文摘要

比喻语言和否定是当前语言模型面临挑战的两个领域,然而,两者在书面和口语中广泛使用。大型语言模型(LLMs)也广泛应用于日常场景,在这些场景中它们不一定能针对特定数据集进行调整。因此,理解LLMs正确解释包含否定和比喻语言的文本的能力至关重要。为了研究这一点,我们为现有的比喻语言数据集开发了一套新的注释,并在该数据集上测试了一系列语言模型。我们发现,否定和比喻性的结合可能带来特殊挑战,并且整体性能以及不同否定类型上的性能特别依赖于所使用的提示风格。

英文摘要

Figurative language and negation are two areas that challenge current language models, however, both are widely used throughout written and spoken language. Large language models (LLMs) are also widely used in everyday contexts where they cannot necessarily be tuned for a specific dataset. It is therefore essential to understand the ability of LLMs to correctly interpret text that includes both negation and figurative language. To investigate this, we develop a set of new annotations to an existing dataset of figurative language, and test a range of language models on the dataset. We find that the combination of negation and figurativeness can present a particular challenge, and that performance overall and across different negation types is particularly dependent on the prompt style used.

6. 多模态语言处理 1 篇

2606.18273 2026-06-18 cs.CL cs.AI cs.SD eess.AS 新提交

Continuous Audio Thinking for Large Audio Language Models

面向大型音频语言模型的连续音频思考

Gyojin Han, Dong-Jae Lee, Changho Choi, Jongsuk Kim, Junmo Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 提出连续音频思考(CoAT)框架,通过专家蒸馏在连续潜在空间中组织声学信息,使音频语言模型在生成响应前利用丰富声学特征,无需额外自回归解码成本,在多个音频任务上提升性能。

Comments Preprint

详情
AI中文摘要

大型音频语言模型(LALMs)在从语音转录到音乐分析等多种音频理解任务中展现了令人印象深刻的能力。然而,由于LALMs通常被训练生成与文本对齐的响应,其隐藏状态逐渐为文本生成而塑造,而非保留声学信息。因此,音频携带的多样化声学内容,如语音细节、韵律、声音事件、情感和音调,在过程中丢失,难以在响应中利用。我们引入了连续音频思考(CoAT),这是一个框架,为音频语言模型配备一个连续的潜在工作空间,用于在响应生成之前组织声学信息,并通过音频专家的蒸馏进行基础化。在思考空间内,模型可以在生成响应时利用专家蒸馏提供的丰富声学信息。此外,所提出的连续思考块可以在单个预填充中处理,因此CoAT不需要比基线额外的自回归解码成本。在三个LALM上,Qwen2-Audio、Qwen2.5-Omni-7B和Audio Flamingo~3,在涵盖音频推理、音频理解、音乐分类、语音情感和语音转录的广泛基准套件上的性能提升证明了CoAT的有效性。进一步分析证实,辅助监督从思考位置传播到模型的文本响应。

英文摘要

Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned responses, their hidden states are progressively shaped for text generation rather than for preserving acoustic information. As a result, the diverse acoustic content that audio carries, such as phonetic detail, prosody, sound events, affect, and pitch, is lost along the way and difficult to leverage in the response. We introduce Continuous Audio Thinking (CoAT), a framework that equips audio language models with a continuous latent workspace for organizing acoustic information prior to response generation, grounded by distillation from audio experts. Within the thinking space, the model can utilize the rich acoustic information provided by expert distillation when generating its response. Furthermore, the proposed continuous thinking block can be processed in a single prefill, so CoAT does not require additional autoregressive decoding cost over the baseline. Across three LALMs, Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo~3, performance gains on a broad benchmark suite spanning audio reasoning, audio understanding, music classification, speech emotion, and speech transcription demonstrate the effectiveness of CoAT. Further analysis confirms that the auxiliary supervision propagates from the thinking positions to the model's textual responses.

7. 语音语言联合与音频文本 2 篇

2606.18466 2026-06-18 cs.CL 新提交

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

Montreal Forced Aligner 与 2026 年语音到文本对齐的现状

Michael McAuliffe, Kaylynn Gunter, Michael Wagner, Morgan Sonderegger

发表机构 * University of Wisconsin--Madison(威斯康星大学麦迪逊分校) McGill University(麦吉尔大学) Centre for Brain, Language, and Music(大脑、语言与音乐中心) University of Oregon(俄勒冈大学)

AI总结 本文介绍 MFA 3.0 自 1.0 版本以来的发展,并在英语、日语和韩语上评估其性能,在四个基准数据集上达到平均边界误差低于 15 ms 的最优或接近最优性能。

详情
AI中文摘要

Montreal Forced Aligner (MFA) 于 2016 年发布,此后成为研究和工业中最广泛使用的强制对齐工具。在过去的十年中,MFA 经历了实质性发展,包括使用更大的开源数据集扩展到更多语言和方言、统一的 IPA 词典、模型自适应、跨语言音素映射以及支持工具。本文记录了 MFA 3.0 自 1.0 版本以来的发展,并在英语、日语和韩语上评估 MFA 的性能,与经典和神经强制对齐器进行基准测试。MFA 3.0 在所有四个基准数据集上实现了最优或接近最优的性能,平均边界误差低于 15 ms。自适应和跨语言映射对于 MFA 训练分布之外的语言有效,并且发音概率建模和音系规则在特定条件下提供了增益。

英文摘要

The Montreal Forced Aligner (MFA) was released in 2016 and has since become the most widely used tool for forced alignment in research and industry. In the decade since, MFA has undergone substantial development, including expanded coverage across more languages and dialects using larger open-source datasets, harmonized IPA dictionaries, model adaptation, cross-language phone remapping, and support utilities. This paper documents MFA 3.0's developments since version 1.0 and evaluates MFA's performance across English, Japanese, and Korean, benchmarked against classic and neural forced aligners. MFA 3.0 achieves state-of-the-art or near state-of-the-art performance across all four benchmark datasets with mean boundary errors below 15 ms. Adaptation and cross-language remapping are effective for languages outside MFA's training distribution, and pronunciation probability modeling and phonological rules provide gains in specific conditions.

2606.18584 2026-06-18 cs.CL 新提交

Speech-Driven End-to-End Language Discrimination towards Chinese Dialects

语音驱动的端到端汉语方言语言鉴别

Fan Xu, Jian Luo, MingWen Wang, GuoDong Zhou

发表机构 * Jiangxi normal university(江西师范大学) Soochow university(苏州大学)

AI总结 针对相似语言和方言鉴别难题,提出基于MFCC特征和HMM-DNN端到端模型的语音驱动方法,结合注意力机制和CNN融合词嵌入与MFCC特征,在基准语料上优于现有方法。

Comments Published in ACM TALLIP

详情
AI中文摘要

在相似语言、变体和方言之间进行语言鉴别是一项具有挑战性的自然语言处理任务。传统的文本驱动方法效果不佳。本文探讨了语音驱动特征在汉语方言鉴别中的有效性。首先,我们系统地研究了语音驱动的MFCC特征对于基于CNN的语言鉴别的适用性。然后,我们设计了一个基于HMM-DNN的端到端语音识别模型来预测汉语方言词汇。我们采用注意力机制提取与不同汉语方言相关的鉴别性词汇。最后,通过CNN,我们将词级嵌入与基于MFCC的特征相结合。在两个基准汉语方言语料库上的评估表明,与最先进的方法相比,所提出的语音驱动方法在细粒度汉语方言鉴别中具有适用性和有效性。

英文摘要

Language discrimination among similar languages, varieties, and dialects is a challenging natural language processing task. The traditional text-driven focus leads to poor results. In this paper, we explore the effectiveness of speech-driven features towards language discrimination among Chinese dialects. First, we systematically explore the appropriateness of speech-driven MFCC features towards CNN-based language discrimination. Then, we design an end-to-end speech recognition model based on HMM-DNN to predict Chinese dialect words. We adopt attention to extract the discriminative words related to different Chinese dialects. Finally, through a CNN, we combine the word-level embedding and the MFCC-based features. Evaluation of two benchmark Chinese dialect corpora shows the appropriateness and effectiveness of the proposed speech-driven approach to fine-grained Chinese dialect discrimination compared to the state-of-the-art methods.

8. 评测、数据集与基准 12 篇

2606.18471 2026-06-18 cs.CL 新提交

Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text

可能还是确定?评估临床文本中诊断不确定性保留的基准

Hongbo Du, Zixin Lu, Jiaming Qu

发表机构 * Trine University(特里尼大学) University of Michigan(密歇根大学) Amazon(亚马逊)

AI总结 构建包含9184个不确定性标注的基准,评估LLM在临床文本中保留诊断不确定性的能力,发现LLM保留原始不确定性线索不足一半,且难以区分相邻级别。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地用于临床文本任务,如总结和修订。虽然大多数研究评估LLM生成文本的流畅性和连贯性,但LLM是否正确保留诊断不确定性仍未得到充分探索。在临床实践中,诸如“可能肺炎”之类的短语传达了现有证据的强度,并直接指导后续检测和治疗决策。改变这些不确定性表达可能会完全改变临床含义。在本文中,我们通过两个步骤系统地评估了这个问题。首先,我们构建了一个包含1200份临床文档的基准,其中包含跨五个级别的9184个不确定性标注。其次,我们在此基准上评估了三个LLM。我们的结果表明:(1)LLM保留原始不确定性线索的能力很差,通常不到一半的时间;(2)LLM难以区分相邻级别之间的细微差别。这项工作揭示了标准评估指标无法捕捉的失败模式,并为LLM在临床工作流程中的安全部署提供了启示。

英文摘要

Large language models (LLMs) are increasingly used for clinical text tasks such as summarization and revision. While most studies evaluate the fluency and coherence of LLM-generated text, whether LLMs correctly preserve diagnostic uncertainty remains underexplored. In clinical practice, phrases such as ``possible pneumonia'' communicate the strength of available evidence and directly guide decisions about follow-up testing and treatment. Altering these uncertainty expressions can change the clinical meaning entirely. In this paper, we systematically evaluated this problem in two steps. First, we constructed a benchmark of 1,200 clinical documents with 9,184 uncertainty annotations across five levels. Second, we evaluated three LLMs on this benchmark. Our results show that (1) LLMs preserve the original uncertainty cues poorly, often less than half the time; (2) LLMs struggle with nuanced distinctions between adjacent levels. This work reveals a failure mode not captured by standard evaluation metrics and provides implications for the safe deployment of LLMs in clinical workflows.

2606.18613 2026-06-18 cs.CL cs.AI 新提交

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

LLMs 是否已准备好辅助医生?PhysAssistBench:交互式医患-电子病历辅助基准

Tianming Du, Peijie Yu, Sihan Shang, Danli Shi, My Linh Nguyen, Shengbo Gao, Guangyuan Li, Yinghong Yu, Yan Jiang, Qianlong Zhao, Behzad Bozorgtabar, Shaoxiong Ji, Jiazhen Pan, Daniel Rueckert, Jiancheng Yang

发表机构 * Aalto University(阿尔托大学) Tencent(腾讯) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Hong Kong Polytechnic University(香港理工大学) Aarhus University(奥胡斯大学) Technical University of Munich(慕尼黑工业大学)

AI总结 提出PhysAssistBench基准,通过构建交互式患者代理评估LLM在医患-EHR交互中的协调能力,发现当前模型不可靠,瓶颈在于多维度协调而非单一能力。

Comments 34 pages with 8 figures

详情
AI中文摘要

医疗LLM最合理的近期角色是辅助而非替代医生,但当前的评估通常测试孤立能力:临床知识、EHR系统交互或患者沟通。而医生辅助需要在同一交互中协调这些能力,其中医生提出不明确的请求,患者模糊描述症状,EHR系统要求精确的工具使用。我们引入PhysAssistBench,一个用于交互式医患-EHR辅助的基准。基于真实的MIMIC-IV病例,PhysAssistBench使用可扩展的流水线构建交互式、记录驱动的患者代理,将静态EHR记录转化为多轮临床场景,同时保持临床事实准确性。PhysAssistBench提供了一个精选的双语评估集,包含1,296个经过人工审查和医生验证的轮次。与领先LLM的实验表明,当前模型在此设置下仍不可靠,这暴露了临床LLM的关键瓶颈:可靠的辅助需要知识、沟通和系统之间的协调,而非任何单一能力的孤立提升。

英文摘要

The most plausible near-term role of medical LLMs is to assist rather than replace physicians, yet current evaluations often test isolated capabilities: clinical knowledge, EHR system interaction, or patient communication. Physician assistance instead requires coordinating these capabilities within the same interaction, where physicians issue underspecified requests, patients describe symptoms ambiguously, and EHR systems demand precise tool use. We introduce PhysAssistBench, a benchmark for interactive doctor-patient-EHR assistance. Built from real MIMIC-IV cases, PhysAssistBench uses a scalable pipeline to construct agentic patients: interactive, record-grounded agents that turn static EHR records into multi-turn clinical scenarios while preserving clinical factuality. PhysAssistBench provides a curated bilingual evaluation set of 1,296 manually reviewed and physician-validated turns. Experiments with leading LLMs show that current models remain unreliable in this setting, which exposes a key bottleneck for clinical LLMs: reliable assistance requires coordination across knowledge, communication, and systems, not isolated gains in any of them.

2606.18636 2026-06-18 cs.CL cs.AI 新提交

PEC-Home: Interpretation of Progressively Elliptical Commands in Smart Homes

PEC-Home:智能家居中渐进式省略命令的解释

Yingyu Shan, Zeming Liu, Silin Li, Boao Qian, Jiashu Yao, Yuhang Guo, Haifeng Wang

发表机构 * Beijing Institute of Technology(北京理工大学) Beihang University(北京航空航天大学) Baidu Inc.(百度公司)

AI总结 针对智能家居中用户因共享上下文而使用渐进式省略命令导致的指代和意图歧义问题,提出首个模拟家庭数据集PEC-Home,实验表明现有LLM助手难以准确执行省略命令。

Comments Accepted by ACL 2026 Findings

详情
AI中文摘要

近年来,大型语言模型(LLM)的进步使家庭助手具备了自然语言交互能力。然而,当前的助手忽略了人类对话中随着共享上下文积累而发生的渐进式省略,即为了高效沟通而使用更简洁的表达。因此,当前助手仍难以准确解释此类省略表达,限制了其在现实应用中的有效性。在实际智能家居场景中,助手面临由省略命令引起的两大挑战:(1)多个用户对环境期望不同导致的指代歧义;(2)用户偏好随时间或环境变化导致的意图歧义。为应对这些挑战,我们引入了PEC-Home,这是首个专门为解释智能家居中渐进式省略命令而设计的模拟家庭数据集。在包括GPT-4o在内的多种LLM上的广泛实验表明,现有的家庭助手难以仅基于省略命令执行用户意图的操作。即使配备存储和检索用户对话历史的工具,其执行准确率仍低于使用完整命令时的水平。

英文摘要

Recent advancements in Large Language Models (LLMs) have empowered home assistants with natural language interaction capabilities. However, current assistants overlook the progressive omission that occurs in human dialogue as shared context accumulates, leading to more elliptical expressions for efficient communication. Thus, current assistants still struggle to interpret such elliptical expressions accurately, which limits their effectiveness in real-world applications. In practical smart home scenarios, assistants face two major challenges caused by elliptical commands: (1) referential ambiguity caused by different environmental expectations among multiple users; and (2) intention ambiguity resulting from user preferences that evolve over time or change with the environment. To address these challenges, we introduce PEC-Home, the first simulated home dataset specifically designed for interpreting progressively elliptical commands in smart homes. Extensive experiments on various LLMs, including GPT-4o, show that existing home assistants struggle to execute user-intended operations based solely on elliptical commands. Even when equipped with tools for storing and retrieving user dialogue history, execution accuracy remains below that achieved with complete commands.}.

2606.18699 2026-06-18 cs.CL cs.AI cs.IR 新提交

TW-LegalBench: Measuring Taiwanese Legal Understanding

TW-LegalBench: 衡量台湾法律理解

Fei-Yueh Chen, Chun Huang Lin, Chan Wei Hsu, Kuan Hsuan Yeh, Zih-Ching Chen, Kuan-Ming Chen, Patrick Chung-Chia Huang

发表机构 * University of Rochester(罗切斯特大学) National Taiwan University(国立台湾大学) NVIDIA(英伟达)

AI总结 提出TW-LegalBench基准,包含多项选择、开放式问答和法律判决预测任务,评估13个LLM在台湾法律上的表现,发现顶尖模型通过律师考试但未达到法官检察官标准,且法律条文引用困难。

Comments 10 pages, 2 figures, To appear in ICAIL 2026

详情
AI中文摘要

大型语言模型(LLM)在多种任务上展现出令人印象深刻的能力,但其在特定司法管辖区法律推理上的表现仍未充分探索。我们提出TW-LegalBench,利用台湾法律系统丰富的官方公开语料库,填补了在普通法基准(侧重英文来源)和大陆法基准(侧重简体中文来源)之外评估LLM在台湾法律上的空白。TW-LegalBench包含三种任务类型:(1)涵盖18个专业领域五年官方考试的超过16,000道多项选择题(MCQ);(2)来自法律专业人员考试的117道开放式问答题(OEQ),附有官方评分标准;(3)超过14,000个法律判决预测(LJP)实例,涵盖数百种犯罪类别。我们使用MCQ的准确率、基于评分标准点的分解式LLM作为裁判框架评估OEQ,以及LJP的判决准确性和法条引用指标,评估了13个LLM。我们的结果显示,表现最佳的模型超过了合格律师的通过门槛(通过率:11%),但未达到法官和检察官的通过标准(通过率:1-2%)。对于LJP,虽然模型展示了合理的判决类型准确性和刑期预测能力,但它们难以准确引用具体法律条文。这些发现表明,即使LLM在资格考试上的表现接近人类水平,可靠的 legal 文本生成仍然具有挑战性。

英文摘要

Large language models (LLMs) have shown impressive capabilities across diverse tasks, yet their performance on jurisdiction-specific legal reasoning remains underexplored. We present TW-LegalBench that utilizes Taiwanese legal system's rich official corpus open to the public to fill the gap in evaluating LLMs on Taiwanese law, among common-law benchmarks that focus on English sources and civil-law benchmarks focusing on sources of Simplified Chinese. TW-LegalBench comprises three task types: (1) over 16,000 multiple-choice questions (MCQs) across five years of official examinations in 18 professional domains; (2) 117 open-ended essay questions (OEQs) from examinations for legal professionals with official scoring rubrics; and (3) more than 14,000 legal judgment prediction (LJP) instances covering hundreds of crime categories. We evaluate 13 LLMs using accuracy for MCQs, a decomposed LLM-as-Judge framework based on the scoring rubric points for OEQs, and metrics for sentencing accuracy and statute citation for LJP. Our results reveal that top-performing models exceed the passing threshold for qualified lawyers (passing rate: 11%) but fall short of that for judges and prosecutors (passing rate: 1~2%). For LJP, while models demonstrate reasonable verdict type accuracy and sentence prediction capability, they struggle to cite exact legal articles. These findings highlight that reliable legal text generation remains challenging for LLMs, even though their performance on qualification examinations approaches human level.

2606.18709 2026-06-18 cs.CL 新提交

LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

LLMs难以衡量区分不同水平学生的题目:阅读理解评估中题目区分度研究

Han Chen, Ming Li, Chenguang Wang, Yijun Liang, Dawei Zhou, Hong jiao, Tianyi Zhou

发表机构 * MBZUAI(穆罕默德·本·扎耶德人工智能大学) University of Maryland(马里兰大学) Virginia Tech(弗吉尼亚理工大学)

AI总结 本研究评估42个LLM在零样本设置下预测题目区分度的能力,发现直接预测与人类校准的区分度相关性弱(最高Spearman 0.152),基于CTT的响应校准相关性有限(0.241),表明LLM尚不能可靠捕捉题目区分度。

详情
AI中文摘要

题目区分度是教育评估的一个基本心理测量属性,它衡量一个题目是否能有效区分高水平和低水平学生。虽然已有研究探讨了大语言模型(LLM)能否估计题目难度,但尚不清楚它们能否捕捉题目区分度。在本工作中,我们使用两种互补方法评估了42个专有和开源LLM在零样本设置下的表现:直接区分度预测,即模型从其内容中显式估计题目的区分度值;以及基于响应的经典测试理论(CTT)校准,其中LLM的答案被视为合成学生响应以计算区分度分数。我们的结果表明,直接预测与人类校准的区分度一致性较弱:表现最好的模型仅达到0.152的Spearman相关性。基于响应的CTT校准提供了更强但仍然有限的信号,全人格合成受访者池达到0.241的Spearman相关性。这些发现突显了题目区分度作为基于LLM的心理测量评估的一个开放挑战:当前的LLM包含非随机的区分度相关信号,但它们尚不能可靠地捕捉评估题目如何区分人类学生。

英文摘要

Item discrimination is a fundamental psychometric property of educational assessment, which measures whether an item meaningfully distinguishes students with higher proficiency from students with lower proficiency. While various existing works have explored whether large language models (LLMs) can estimate item difficulty, it remains unclear whether they can capture item discrimination. In this work, we evaluate 42 proprietary and open-weight LLMs in zero-shot settings using two complementary approaches: direct discrimination prediction, where models explicitly estimate an item's discrimination value from its content, and response-based Classical Test Theory (CTT) calibration, where LLM answers are treated as synthetic student responses to compute discrimination scores. Our results show that direct prediction yields weak alignment with human-calibrated discrimination: the best-performing model reaches only a Spearman correlation of 0.152. Response-based CTT calibration provides a stronger but still limited signal, with the all-persona synthetic respondent pool reaching a Spearman correlation of 0.241. These findings highlight item discrimination as an open challenge for LLM-based psychometric evaluation: current LLMs contain non-random discrimination-relevant signal, but they do not yet reliably capture how assessment items distinguish human students.

2606.18782 2026-06-18 cs.CL cs.AI 新提交

RedactionBench

Sean Brynjólfsson, Shashvat Jayakrishnan, Esha Sali, Diptanshu Purwar, Madhav Aggarwal

发表机构 * A10 Networks, Inc.(A10网络公司)

AI总结 针对大语言模型在敏感领域中的PII编辑需求,基于上下文完整性提出RedactionBench基准和R-Score指标,评估多种模型发现上下文编辑仍具挑战,人类评估显示隐私感知存在分歧。

详情
AI中文摘要

大语言模型越来越多地应用于需要编辑个人身份信息(PII)的敏感领域。虽然编辑PII是数据清洗的前提,但现有基准将提取机制与隐私语义混为一谈。公开电话号码与医疗记录中的电话号码不等同。信息是否构成违规在很大程度上取决于谁持有它、为什么持有以及持有背景,这从根本上区分了编辑与简单的实体识别。基于上下文完整性,我们引入了RedactionBench,这是一个手动标注的基准,包含11个领域的200份多样化文档,大多来自真实来源。我们还引入了R-Score,一种新颖的字符级指标,将语义相似的编辑视为等同,并消除浅层格式选择(如电话号码的不同掩码样式)的影响。对命名实体识别模型、实体提取小语言模型以及配备代理工具的前沿模型的评估表明,上下文编辑仍然是一个未解决的问题。在RedactionBench上对80多名用户进行的人类评估揭示了隐私感知的明显分歧。标注者对强制性编辑(89.4%)和安全文本保留(94.1%)的目标标签达成共识,但在上下文编辑(47.7%)上未能达成一致。这种差异证明了上下文隐私的主观性,并推动了R-Score的提出,它将上下文模糊性与严格精确性解耦。我们比较了不同系列的35个模型,并报告了它们在编辑PII方面的性能。最后,我们发布RedactionBench,为未来的隐私保护系统建立基线,希望能激发高效的模型设计和标准化评估。

英文摘要

Large Language Models are increasingly applied to sensitive domains that require redaction of personally identifiable information (PII). While redacting PII is a data cleaning prerequisite, existing benchmarks conflate extraction mechanics with privacy semantics. A public phone number is not equivalent to a phone number in a medical record. Whether information constitutes a violation depends heavily on who holds it, why, and in what context, fundamentally differentiating redaction from simple entity recognition. Grounded in contextual integrity, we introduce RedactionBench, a manually annotated benchmark comprising 200 diverse documents across 11 domains, mostly seeded from real-world sources. We also introduce R-Score, a novel character-level metric that treats semantically similar redactions equally and nullifies shallow formatting choices, such as varying masking styles for phone numbers. Evaluations across Named Entity Recognition models, entity extraction Small Language Models, and frontier models equipped with agentic tools demonstrate that contextual redaction remains an unsolved problem. A human evaluation with over 80 users on RedactionBench reveals a stark dichotomy in privacy perceptions. Annotators show consensus with target labels for mandatory redactions (89.4 percent) and safe text preservations (94.1 percent), but fail to agree on contextual redactions (47.7 percent). This variance demonstrates the subjective nature of contextual privacy and motivates R-Score, which decouples contextual ambiguity from strict precision. We compare 35 models across families and report their performance in redacting PII. Finally, we release RedactionBench to establish a baseline for future privacy-preserving systems, hoping to inspire efficient model design and standardized evaluations.

2606.18797 2026-06-18 cs.CL 新提交

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

超越标量分数:探索基于LLM的放射学报告临床意义评估指标

Qingyu Lu, Ruochen Li, Liang Ding, Yufei Xia, Youxiang Zhu, Dacheng Tao

发表机构 * Nanyang Technological University(南洋理工大学) Technical University of Munich(慕尼黑工业大学) Alibaba(阿里巴巴) University of Glasgow(格拉斯哥大学) University of Massachusetts Boston(马萨诸塞大学波士顿分校)

AI总结 针对放射学报告评估中临床准确性要求,研究基于LLM的指标区分临床错误与无害变体的能力,发现判别偏差,并通过合成数据训练轻量级指标,在成本敏感部署中优于大型模型。

Comments Under Review

详情
AI中文摘要

对生成的放射学报告进行可靠评估需要严格的临床准确性,因为遗漏关键发现或误判影像学观察结果会直接影响患者护理。现有指标通过将报告质量简化为一个医学上无依据的标量而模糊了这一要求。尽管大型语言模型(LLM)拥有丰富的医学知识,但它们同样难以在临床显著错误和无害变异之间划定可靠边界。我们以ReEvalMed基准为测试平台研究这一边界,并从检测真实临床错误(“判别力”)和容忍无关变异(“鲁棒性”)两方面评估指标的临床意义。在单次和两次设置下对8个LLM评估器进行实验,我们发现了一个普遍的判别偏差:模型能有效检测错误,但也过度惩罚无害的改写。为缓解这一问题,我们合成了4000对报告,并在Qwen3-8B和MedGemma-4B上训练了轻量级可解释指标。我们训练的指标明确了临床意义边界,超越了32B规模的医学LLM,并与专有模型保持竞争力。关键的是,成本更高的两次设置未能持续提升整体性能,主要是在用判别力换取鲁棒性。这些发现表明,单次训练指标是成本敏感部署的实用选择,而两次推理则保留给判别-鲁棒平衡至关重要的场景。我们将发布数据集和指标。

英文摘要

Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care. Existing metrics obscure this requirement by reducing report quality to a medically ungrounded scalar. Although Large Language Models (LLMs) possess rich medical knowledge, they likewise struggle to draw a reliable boundary between clinically significant errors and harmless variation. We study this boundary using ReEvalMed benchmark as testbed and evaluate metric-level clinical significance from detecting true clinical errors ("Discrimination") and tolerating insignificant variations ("Robustness"). Across 8 LLM evaluators under one-pass and two-pass settings, we identify a widespread discrimination bias: models effectively detect errors but also over-penalize harmless rephrasings. To mitigate this, we synthesize 4k report pairs and train lightweight interpretable metrics on Qwen3-8B and MedGemma-4B. Our trained metric sharpens the clinical significance boundary, surpassing 32B-scale medical LLMs and remaining competitive with proprietary models. Crucially, the more costly two-pass setting fails to consistently improve overall performance and mainly trades discrimination for robustness. These findings suggest one-pass trained metrics as the practical choice for cost-sensitive deployment, with two-pass inference reserved for settings where D-R balance is critical. We will release the dataset and metric.

2606.18946 2026-06-18 cs.CL 新提交

SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents

SenFlow: 面向混合文档中AI生成文本检测的句间流建模

Jingkun Luo, Yifan Sun, Da-Tian Peng, Guanxiong Pei

发表机构 * Northwestern Polytechnical University(西北工业大学) Zhejiang Lab(浙江实验室)

AI总结 针对人机混合文档的句子级AI文本检测,提出SenFlow模型,通过图传播和CRF解码建模句间依赖,在MOSAIC基准上跨域F1提升4.15个百分点。

Comments 16 pages, 4 figures, 9 tables

详情
AI中文摘要

针对混合文档(人类与LLM共同撰写同一文本)的句子级AI生成文本检测(S-AGTD)面临两个空白:现有方法孤立地对每个句子进行分类,忽略了句间依赖;现有基准遗漏了最新一代生成器。我们构建了MOSAIC基准,包含来自PubMed和XSum的16,000个混合文档,由DeepSeek-V3.2和Kimi K2生成,并经过严格质量控制,包括先前基准中缺失的困惑度一致性过滤器。我们将S-AGTD重新定义为文档句子序列上的结构化预测,并实例化为SenFlow,在句子图的单次文档级传递中,将基于图的句间传播与线性链CRF解码相结合。SenFlow在MOSAIC上达到了最先进的性能,在跨域迁移(三种难度递增协议中最难的一种)上平均Macro-F1提高了4.15个百分点。我们进一步发现,即使困惑度过滤器平衡了显式线索,AI插入仍然保留了一个依赖于生成器的句子长度差距,句子级检测器仍可利用这一点。代码和数据:此 https URL

英文摘要

Sentence-level AI-generated text detection (S-AGTD) for hybrid documents, where humans and LLMs co-author one text, faces two gaps: existing methods classify each sentence in isolation, discarding inter-sentence dependencies, and existing benchmarks omit the newest generation of generators. We construct MOSAIC, a benchmark of 16,000 hybrid documents over PubMed and XSum, generated by DeepSeek-V3.2 and Kimi K2 under stringent quality controls including a perplexity-consistency filter absent from prior benchmarks. We recast S-AGTD as structured prediction over the document sentence sequence and instantiate it as SenFlow, integrating graph-based inter-sentence propagation with linear-chain CRF decoding in a single document-level pass over a sentence graph. SenFlow reaches state-of-the-art performance on MOSAIC, with a +4.15 pp average Macro-F1 margin on cross-domain transfer, the hardest of three protocols of increasing difficulty. We further find that even after the perplexity filter equalizes overt cues, AI insertions retain a generator-dependent sentence-length gap that sentence-level detectors still exploit. Code and data: https://github.com/luojingkun22/SenFlow

2606.18989 2026-06-18 cs.CL cs.AI 新提交

G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

G-IdiomAlign:基于释义的跨语言习语对齐基准

Fengying Ye, Yanming Sun, Runzhe Zhan, Zheqi Zhang, Lidia S. Chao, Derek F. Wong

发表机构 * NLP 2 CT Lab, Department of Computer and Information Science, University of Macau(NLP 2 CT实验室,计算机与信息科学系,澳门大学) Faculty of Arts and Humanities, University of Macau(人文学院,澳门大学)

AI总结 提出G-IdiomAlign基准,通过维基词典释义锚定习语,构建高置信度对齐集,并设计多项选择等价测试和释义对比生成协议,揭示大语言模型在习语翻译中的字面翻译偏差。

Comments Accepted to ACL 2026

详情
AI中文摘要

习语由于其非组合性和弱表层形式基础,难以跨语言转换,使得字面映射不可靠。我们提出G-IdiomAlign,一个基于释义的基准,其中每个习语通过维基词典的英语释义进行锚定。我们进一步构建了一个高置信度的参考对齐集,用于可重复评估。G-IdiomAlign支持两种协议:(1)受控的多项选择习语等价测试,带有类型化干扰项用于错误归因;(2)释义对比生成,对比无释义和有释义输入,以隔离显式语义枢轴的影响。在不同的大语言模型中,字面翻译偏差是主要的失败模式,尤其是当目标语言是低资源语言时。在基于嵌入的语义代理下,释义一致地改善了释义对比生成,但性能仍然有限,表明在开放输出空间中存在显著提升空间。随后对Qwen3-8B的分析进一步表明,跨条件差异更多集中在注意力头而非层中,而有释义生成更好的情况与更强的释义锚定相关。

英文摘要

Idioms are difficult to transfer across languages due to their non-compositionality and weak surface-form grounding, making literal mappings unreliable. We present G-IdiomAlign, a gloss-pivoted benchmark where each idiom is anchored by an English gloss from Wiktionary. We further construct a high-confidence reference alignment set for reproducible evaluation. G-IdiomAlign supports two protocols: (1) a controlled Multiple-Choice Idiom Equivalence with typed distractors for error attribution; and (2) a Gloss-Contrastive Generation contrasting No-gloss and With-gloss inputs to isolate the effect of an explicit semantic pivot. Across diverse LLMs, a bias to literal translation is a dominant failure mode, especially when the target is a low-resource language. Glosses consistently improve Gloss-Contrastive Generation under an embedding-based semantic proxy, but performance remains modest, indicating substantial headroom in the open output space. Subsequent analysis on Qwen3-8B further suggests that cross-condition differences are concentrated more in attention heads than in layers, while better With-gloss generations coincide with stronger gloss anchoring.

2606.19051 2026-06-18 cs.CL cs.DL cs.IR 新提交

Which Sections of a Research Paper Best Reveal Its Research Methods? Evidence from Library and Information Science

研究论文的哪些部分最能揭示其研究方法?来自图书馆与信息科学的证据

Qiuyu Fang, Jiayi Hao, Chengzhi Zhang

发表机构 * Department of Information Management, Nanjing University of Science and Technology, China(南京理工大学信息管理学院)

AI总结 提出基于全文分段的组合策略,通过评估不同段落及其组合的分类性能,发现中后段和末尾段对研究方法识别更具区分力,且结合书目元数据可提升分类效果。

Comments ASIST 2026

详情
AI中文摘要

研究方法是学术论文中知识贡献的重要载体。研究方法的自动多标签分类可以支持方法检索、综述生成和研究情报分析等知识服务。现有研究主要依赖标题和摘要,但摘要通常只提供有限的方法信息,而利用全文内容则面临篇幅过长和信息冗余的挑战。因此,本文提出一种根据物理位置划分全文内容的段落组合策略。利用来自图书馆与信息科学领域三种代表性期刊(JASIST、LISR 和 JDoc)的 1,954 篇全文文章的标注语料,我们评估了多种模型下不同段落及其组合的分类性能。实验结果表明,方法信息在全文内容中分布不均匀,中后段和末尾段表现出更强的区分能力。此外,将书目元数据与跨段组合策略相结合,有效提升了分类性能。

英文摘要

Research methods are essential carriers of knowledge contribution in academic papers. Automatic multi-label classification of research methods can support knowledge services such as method retrieval, review generation, and research intelligence analysis. While existing studies primarily rely on titles and abstracts, abstracts often provide only limited methodological information, whereas utilizing full-text content faces challenges related to excessive length and information redundancy. Therefore, this paper proposes a segment combination strategy by partitioning the full-text content according to its physical postion. Using an annotated corpus of 1,954 full-text articles from three representative journals in Library and Information Science (JASIST, LISR, and JDoc), we evaluate the classification performance of various segments and their combinations across multiple models. Experimental results indicate that methodological information is distributed unevenly within the full-text content, with the middle-to-late and final segments exhibiting greater discriminative power. Furthermore, integrating bibliographic metadata with cross-segment combination strategies effectively enhances classification performance.

2606.19218 2026-06-18 cs.CL 新提交

RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

RECOM:开放式 Reddit 问答中自动评估指标的有效性与区分性权衡

Pushwitha Krishnappa, Amit Das, Vinija Jain, Aman Chadha, Tathagata Mukherjee

发表机构 * University of Alabama Huntsville(阿拉巴马大学亨茨维尔分校) University of North Alabama(北阿拉巴马大学) Stanford University(斯坦福大学) Meta AI Amazon GenAI(亚马逊GenAI)

AI总结 提出 RECOM 数据集,发现自动评估指标在开放式问答中无法同时兼顾有效性和区分性,余弦相似度有效性高但区分性差,BERTScore 区分性受长度影响且有效性弱。

详情
AI中文摘要

自动评估指标是评估 LLM 生成文本的默认方法,但一个指标被默默要求完成两项任务:区分真实内容对齐与表面巧合(有效性),以及区分更好的系统与更差的系统(区分性)。在开放式、观点驱动的问答中,这两者存在矛盾。我们引入了 RECOM(Reddit Evaluation for Correspondence of Models),一个无污染评估数据集,包含 15,000 个 r/AskReddit 问题(2025 年 9 月),每个问题都配有真实的社区回复,这些回复的发布时间晚于所有被评估模型的训练截止日期。通过将五个开源 LLM(7-10B)的每个回复与每个指标配对,并加入随机乱序噪声基线,我们发现没有指标能同时做好这两项工作。余弦相似度能很好地区分真实回答与随机回答(Cohen's $d \approx 2$),但无法对五个模型进行排序($|d| < 0.1$);BERTScore 精确度看似能对模型排序(原始 $|d|$ 高达 0.63),但一旦控制回复长度,这一数值骤降至 $|d| = 0.09$,且其有效性较弱($d \approx 0.8$,而余弦相似度约为 2)。由于每个指标对相同的输出进行评分,这种有效性与区分性的权衡是指标的属性,而非模型的属性,我们认为这源于表示设计。三个独立的 LLM 评判员再现了有效性差距,同样只能微弱地区分五个模型。我们建议在两个轴上报告指标,并明确给出随机基线。RECOM 在此 https URL 公开提供。

英文摘要

Automatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity), and tell a better system from a worse one (discriminative power). On open-ended, opinion-driven question answering, the two are in tension. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a contamination-free evaluation dataset of 15,000 r/AskReddit questions (September 2025), each paired with its authentic community replies, which postdate every evaluated model's training cutoff. Scoring five open-source LLMs (7--10B) against every reply each metric paired with a random-derangement noise floor we find that no metric does both jobs well. Cosine similarity separates real from random answers (Cohen's $d \approx 2$) but cannot rank the five models ($|d| < 0.1$); BERTScore precision appears to rank the models (raw $|d|$ up to 0.63), but once response length is controlled this collapses to $|d| = 0.09$ and its validity is weak ($d \approx 0.8$, versus cosine's $\approx 2$). Because every metric scores the same outputs, this validity--discrimination tradeoff is a property of the metrics, not the models, and we argue it stems from representation design. Three independent LLM judges reproduce the validity gap and likewise separate the five models only weakly. We recommend reporting metrics on both axes, with an explicit random-baseline floor. RECOM is publicly available at https://anonymous.4open.science/r/recom-D4B0

2606.19334 2026-06-18 cs.CL cs.CY cs.LG 新提交

Freeing the Law with LOCUS: A Local Ordinance Corpus for the United States

用LOCUS解放法律:美国地方条例语料库

Denis Peskoff, Joe Barrow, Christopher Vu, Diag Davenport

发表机构 * UC Berkeley(加州大学伯克利分校) School of Information(信息学院) Independent(独立研究者)

AI总结 为解决美国地方条例缺乏机器可读语料的问题,构建了包含9239个市县条例的LOCUS语料库,并训练ModernBERT分类器以分析法律透明度等维度。

Comments 14 pages, 6 figures

详情
AI中文摘要

法律人工智能的进展越来越依赖于大规模获取权威法律文本。然而,美国法律中最具影响力的层级之一——地方条例——在很大程度上仍然缺失于现有的机器可读语料库中。地方法规管辖着分区、住房、商业许可、公共卫生、噪音、动物控制以及许多其他日常监管领域,但它们分散在专为人类浏览而非批量研究访问设计的供应商平台上。我们引入了LOCUS——美国地方条例语料库——一个全面的语料库和县级统一访问层,用于美国市和县条例。原始语料库可供研究人员发布,几乎涵盖了所有公开可用的市和县条例。由此产生的原始语料库包含来自9239个城市和县的法规。一个较小的县级统一LOCUS访问层覆盖了美国3144个县中最大的2309个,覆盖了大部分人口。我们使用OCR来处理使法律无法成为公共资源的各种文档格式。我们发布了带有覆盖元数据的语料库,以支持可重复性、下游法律AI研究以及逐步扩展对地方法律的机器可读访问。我们训练了一系列基于ModernBERT的分类器和评分器,以便从多个维度分析美国地方法律,例如不透明性和家长式作风,这些维度以前从未在此规模上研究过。LOCUS-v1及其衍生模型可在以下网址获取:this https URL

英文摘要

Progress in legal AI increasingly depends on access to authoritative legal text at scale. Yet one of the most consequential layers of American law remains largely absent from existing machine-readable corpora: local ordinances. Local codes govern zoning, housing, business licensing, public health, noise, animal control, and many other domains of everyday regulation, but they are fragmented across vendor platforms designed for human browsing rather than bulk research access. We introduce LOCUS - the Local Ordinance Corpus for the United States - a comprehensive corpus and county-harmonized access layer for U.S. municipal and county ordinance codes. The raw corpus, available for release to researchers, represents nearly all publicly available municipal and county ordinance codes. The resulting raw corpus contains codes from 9,239 cities and counties. A smaller county-harmonized LOCUS access layer provides coverage for the largest 2,309 of 3,144 U.S. counties, accounting for a majority of the population. We use OCR to handle the myriad of document formats that have kept the law from being a public resource. We release the corpus with coverage metadata to support reproducibility, downstream legal AI research, and the incremental expansion of machine-readable access to local law. We train a collection of ModernBERT-based classifiers and scorers to facilitate analyzing U.S. local law among several dimensions, such as opacity and paternalism, that have not previously been studied at this scale. LOCUS-v1 and its derivative models are available at: https://huggingface.co/datasets/LocalLaws/LOCUS-v1

9. 安全、隐私、公平与可解释NLP 6 篇

2606.18372 2026-06-18 cs.CL cs.AI 新提交

Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

保留还是删除?用于教育对话去标识的完全本地AI级联框架

Haocheng Zhang, Zhuqian Zhou, Kirk Vanacore, Bakhtawar Ahtisham, René F. Kizilcec

发表机构 * Cornell University(康奈尔大学)

AI总结 针对教育对话中课程术语与个人身份信息混淆的问题,提出一种完全本地的级联框架,通过召回优先的联合提议器和上下文感知审查器实现约束性隐私分类,在数学辅导对话上达到0.958的宏F1,优于商业API和纯LLM基线。

详情
AI中文摘要

教育对话是研究中有价值但敏感的资源:捕捉真实学习的同一份转录往往也包含与课程内容纠缠的个人身份信息(PII),其中“Riemann”可能指真实学生或数学概念。现有方法在治理和准确性之间强制权衡。商业大型语言模型(LLM)可以处理这种歧义,但需要将学生数据发送给第三方,而本地命名实体识别(NER)系统保留治理但过度删除课程术语。我们提出一个完全本地的级联框架,将去标识从开放式实体识别重新定义为约束性隐私分类。一个召回优先的联合提议器结合两个轻量级编码器和确定性规则,过度生成候选跨度;然后一个上下文感知审查器利用周围对话和说话者角色对每个候选做出二元的保留/删除决策。我们在两个大型平台的数学辅导转录上评估了三种审查器配置,与同系列纯LLM基线和商业API进行比较。最强的本地配置达到0.958宏F1,而同系列纯LLM基线为0.767,商业API为0.706,同时完全在单个笔记本电脑上运行。在针对课程-人名歧义的挑战集上,相同配置仅下降0.03 F1,而较小审查器下降0.19至0.25。这些结果表明,对于教育去标识,问题表述比模型规模更重要。

英文摘要

Educational dialogue is a valuable but sensitive resource for research: the same transcripts that capture authentic learning often capture personally identifiable information (PII) entangled with curricular content, where "Riemann" may refer to a real student or to a mathematical concept. Existing approaches force a tradeoff between governance and accuracy. Commercial Large Language Models (LLMs) can handle this ambiguity but require sending student data to third parties, while local named entity recognition (NER) systems preserve governance but over-redact curricular terms. We propose a fully local cascade framework that reframes de-identification from open-ended entity recognition to constrained privacy triage. A recall-first union proposer combines two lightweight encoders with deterministic rules to over-generate candidate spans; a context-aware reviewer then makes a binary Redact/Keep decision for each candidate using surrounding dialogue and speaker role. We evaluate three reviewer configurations against same-family LLM-only baselines and a commercial API on math tutoring transcripts from two large platforms. The strongest local configuration reaches 0.958 macro F1, compared with 0.767 for a same-family LLM-only baseline and 0.706 for the commercial API, while running entirely on a single laptop. On a targeted challenge set of curricular-personal name ambiguity, the same configuration degrades by only 0.03 F1 versus 0.19 to 0.25 for smaller reviewers. These results suggest that for educational de-identification, problem formulation matters more than model scale.

2606.18473 2026-06-18 cs.CL 新提交

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

PreUnlearn: 在大语言模型遗忘之前审计附带知识损害

Bo Su, Ankit Shah, Thai Le

发表机构 * Indiana University Bloomington(印第安纳大学布卢明顿分校)

AI总结 提出PreUnlearn方法,通过数据特征预测遗忘操作对同领域和远距离知识的附带损害,实现遗忘前的风险审计。

Comments 12 pages, 6 figures

详情
AI中文摘要

大语言模型(LLMs)的机器遗忘旨在移除特定知识,同时保留模型其余能力。然而,遗忘与保留知识之间的界限往往不明确,因为相关甚至遥远的信息可能在模型中纠缠。在本文中,我们从数据中心的视角研究LLM遗忘,并衡量遗忘效应如何从遗忘集传播到同领域和远距离知识。我们发现一致的衰减模式:附带损害在遗忘集附近最强,随语义距离减弱,但不会在领域边界消失。我们进一步询问这种损害是否可以在执行遗忘之前被审计。我们将遗忘集审计制定为遗忘前预测任务,并分析哪些数据特征最能预测下游损害。我们的结果表明,遗忘集与评估集之间的交互特征提供了最强的信号,表明附带损害部分反映在模型更新前的数据几何中。这些发现将遗忘集审计定位为识别风险遗忘运行和设计更可靠遗忘程序的早期预警工具。

英文摘要

Machine unlearning for large language models (LLMs) aims to remove specified knowledge while preserving the rest of the model's capabilities. However, the boundary between knowledge to forget and knowledge to retain is often unclear, since related and even distant information may be entangled in the model. In this paper, we study LLM unlearning from a data-centric perspective and measure how unlearning effects propagate from the forget set to same-domain and distant-domain knowledge. We find a consistent decay pattern: collateral damage is strongest near the forget set, weakens with semantic distance, but does not disappear at domain boundaries. We further ask whether such damage can be audited before unlearning is executed. We formulate forget-set auditing as a pre-unlearning prediction task and analyze which data features are most predictive of downstream damage. Our results show that interaction features between the forget set and evaluation set provide the strongest signals, suggesting that collateral damage is partly reflected in data geometry before model updates occur. These findings position forget-set auditing as an early warning tool for identifying risky unlearning runs and designing more reliable unlearning procedures.

2606.18606 2026-06-18 cs.CL cs.AI 新提交

Steerable Cultural Preference Optimization of Reward Models

可引导的文化偏好优化奖励模型

Minsik Oh, Advit Deepak, Sophie Wu, Douwe Kiela, Ekaterina Shutova

发表机构 * Stanford University(斯坦福大学) University of Amsterdam(阿姆斯特丹大学)

AI总结 提出SCPO算法,通过平衡多种文化偏好训练奖励模型,在PRISM和GlobalOpinionQA数据集上提升少数群体偏好预测准确率最多7点,训练效率提高280%。

Comments Accepted to Pluralistic Alignment @ ICML 2026

详情
AI中文摘要

大型语言模型(LLM)技术以每个文化子社区可接受的方式服务于众多不同文化子社区至关重要。然而,迄今为止,关于LLM对齐的研究主要集中于预测来自特定地区的标注者的统一响应偏好。本文旨在以更全球化的视角推进对齐模型的发展,使其能够准确代表子社区的偏好,并且不对任何子社区表现出过度偏见。我们专注于为此目的开发奖励模型,并提出一种新颖的奖励模型训练算法(SCPO),该算法能够以平衡的方式融入多样化的文化偏好。我们的方法使得少数群体奖励模型在两个数据集(PRISM和GlobalOpinionQA)以及7个国家上的性能比基线模型提升最多7点。SCPO在训练数据效率上比奖励模型的完整数据微调高出最多280%。此外,我们通过分别评估子社区的偏好来进行偏见分析,并表明我们的加权方法减轻了过度偏见。我们的代码可在以下网址获取:this https URL

英文摘要

It is essential for large language model (LLM) technology to serve many different cultural sub-communities in a manner that is acceptable to each community. However, research on LLM alignment has so far predominantly focused on predicting a unified response preference of annotators from certain regions. This paper aims to advance the development of alignment models with a more global outlook, that are able to accurately represent the preferences of subcommunities and do not exhibit excessive bias towards any of them. We focus on the development of reward models for this purpose and present a novel reward model training algorithm (SCPO) that can incorporate diverse cultural preferences in a balanced manner. Our method results in performance increases of the minority reward model of up to 7 points over the baseline model across two datasets, PRISM and GlobalOpinionQA, and across 7 countries. SCPO is up to 280% more training data-efficient than full-data finetuning of reward models. In addition, we perform analysis of bias by separately evaluating on the preference of subcommunities and show that excessive bias is mitigated via our weighting method. Our code is available at https://github.com/minsik-ai/Steerable-Cultural-Preference

2606.18656 2026-06-18 cs.CL 新提交

The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

错误的正确:量化和定位大语言模型中的失调对齐

Naihao Deng, Yiming Feng, Chimaobi Okite, Kaijian Zou, Lu Wang, Rada Mihalcea, Yulong Chen

发表机构 * University of Michigan(密歇根大学) University of Cambridge(剑桥大学) University of Aberdeen(阿伯丁大学)

AI总结 本文提出VETO基准和失调对齐率(MAR)指标,发现所有LLM在刻板印象相关问题上均存在非平凡的失调对齐,且人类为0%,机制分析表明对齐诱导的线索会放大该现象。

详情
AI中文摘要

警告:本文研究刻板印象和偏见,包含可能令人不适的例子,仅用于说明目的。我们的发现不应被解释为反对对齐的论据。相反,本文强调了需要更先进对齐的原则性方法。对齐旨在确保大语言模型(LLMs)安全可靠地行为,包括避免不安全的推理。然而,我们表明这种安全导向的行为可能误触发:模型可能拒绝有根据的结论,即使上下文明确支持它们。我们将这种失败模式称为失调对齐,其中对齐引起的改变导致LLMs覆盖显式证据。为了量化这一现象,特别是针对刻板印象相关的对齐,我们引入了VETO,一个由2,032个BBQ派生对比对组成的基准,并定义了一个新指标,失调对齐率(MAR),它衡量在0到100的尺度上,模型在刻板印象相关问题上失败但在其对比对应问题上成功的频率。我们在VETO上对25个LLMs进行了基准测试,并表明所有LLMs,包括最新的,都表现出非平凡的(4.7%至18.9%)MAR,而所有人类参与者达到0.0%的MAR。受控启动实验进一步表明,对齐诱导的线索可以显著放大LLMs的MAR,表明这些失败不仅仅是单个例子的伪影,而是可以由安全相关的框架诱导。对开放权重LLMs的机制分析揭示了后期层对证据支持答案的抑制,并且指令模型与基础模型之间的比较表明这种抑制在指令训练后出现。这些发现表明,当前的对齐方法可能过度泛化表面安全线索,以至于覆盖客观证据,这激励了更多关于更好保持上下文基础的对齐目标的工作。

英文摘要

Warning: This paper studies stereotypes and biases, and contains potentially disturbing examples, used for illustration purposes only. Our findings should not be interpreted as an argument against alignment. Instead, this paper highlights the need for principled approaches to more advanced alignment. Alignment aims to ensure that large language models (LLMs) behave safely and reliably, including by avoiding unsafe inferences. However, we show that such safety-oriented behaviors can misfire: models may reject warranted conclusions even when they are explicitly supported by context. We call this failure mode misfired alignment, where alignment-induced changes cause LLMs to override explicit evidence. To quantify this phenomenon, specifically on stereotype-related alignment, we introduce VETO, a benchmark consisting of 2,032 BBQ-derived contrastive pairs, and define a new metric, Misfired Alignment Rate (MAR), which measures on a 0 to 100 scale how often a model fails on a stereotype-related question but succeeds on its contrastive counterpart. We benchmark 25 LLMs on VETO, and show that all LLMs, including the most recent ones, exhibit non-trivial (4.7 to 18.9%) MARs while all human participants achieve 0.0% MAR. Controlled priming experiments further show that alignment-induced cues can substantially amplify MAR across LLMs, indicating that these failures are not merely artifacts of individual examples but can be induced by safety-related framing. Mechanistic analyses on open-weight LLMs reveal late-layer suppression of evidence-supported answers, and comparisons between instruct and base LLMs suggest that this suppression emerges after instruction training. These findings show that current alignment methods can overgeneralize surface-level safety cues, to the point of overriding objective evidence, motivating more work on alignment objectives that better preserve contextual grounding.

2606.18767 2026-06-18 cs.CL 新提交

Output Vector Editing for Memorization Mitigation in Large Language Models

输出向量编辑:缓解大型语言模型中的记忆化问题

Ahmad Dawar Hakimi, Kaiwei Lei, Isabelle Augenstein, Hinrich Schütze

发表机构 * Center for Information and Language Processing, LMU Munich(慕尼黑大学语言与信息处理中心) Department of Computer Science, University of Copenhagen(哥本哈根大学计算机科学系) Munich Center for Machine Learning(慕尼黑机器学习中心) Pioneer Centre for AI(人工智能先锋中心)

AI总结 提出输出向量编辑方法,通过约束优化修改MLP神经元输出向量引入干扰项,在不改变激活值的情况下抑制记忆化序列,在OLMo-7B上实现87.9%抑制率,并揭示MLP编辑的机制边界。

详情
AI中文摘要

大型语言模型会记忆并复现训练数据中的序列,从而带来隐私、版权和安全风险。现有的神经元级缓解方法将编辑等同于将神经元激活归零,但激活仅控制神经元是否参与;输出向量才是写入残差流的内容,并通过叠加编码多个特征。我们提出输出向量编辑,这是一种约束优化的权重编辑方法,定位负责记忆化延续的一小组MLP神经元,并最小程度地修改其输出向量,以在词汇空间中引入干扰项,从而重定向它们在残差流中的贡献,同时保持激活不变。在四个模型(SmolLM-360M、OLMo-1B、OLMo-7B、Llama2-7B,参数规模从360M到7B)上进行评估,我们重点研究OLMo-7B(其开放权重和预训练语料库支持系统化挖掘),挖掘了6831个记忆化序列,实现了高达87.9%的抑制率。在相同定位的神经元上,与零消融相比,2.7倍的差距表明抑制来自输出向量编辑,而非仅定位。四种编辑模式涵盖了从激进抑制到最小重定向的谱系;集成使用时覆盖了96.5%的记忆化序列,而我们推荐的单一模式配置达到了81.5%,且没有灾难性的局部性失败。我们进一步识别了一个机制边界:约14%的序列无法通过仅MLP编辑达到;虽然这些失败总体上并非由注意力驱动,但消融贡献最大的注意力头可恢复其中60-64%,对于从前缀复制token的延续,恢复更强,这表明注意力是互补的后备机制而非主要机制。编辑模式排序和成功-局部性权衡在所有四个模型上迁移,成功率随模型规模而非家族增长。

英文摘要

Large language models memorize and reproduce sequences from their training data, creating privacy, copyright, and security risks. Existing neuron-level mitigation methods equate editing with zeroing out neuron activations, but the activation only controls whether a neuron engages; the output vector is what writes to the residual stream and, through superposition, encodes multiple features. We propose output vector editing, a constrained-optimization weight edit that locates a small set of MLP neurons responsible for a memorized continuation and minimally modifies their output vectors to introduce a distractor in vocabulary space, redirecting their residual-stream contributions while leaving activations unchanged. Evaluating on four models from 360M to 7B parameters (SmolLM-360M, OLMo-1B, OLMo-7B, Llama2-7B), we center on OLMo-7B (whose open weights and pretraining corpus enable systematic mining) and mine 6831 memorized sequences, achieving up to 87.9% suppression. The 2.7$\times$ gap over zero ablation on the same located neurons shows the suppression comes from the output-vector edit, not localization alone. Four edit modes span a spectrum from aggressive suppression to minimal redirection; in ensemble they cover 96.5% of memorized sequences, while our recommended single-mode configuration reaches 81.5% with no catastrophic locality failures. We further identify a mechanistic boundary at ${\sim}14%$ of sequences unreachable by MLP-only editing; while these failures are not attention-driven overall, ablating the top contributing attention heads recovers 60--64% of them, with stronger recovery on continuations that copy tokens from the prefix, positioning attention as a complementary fallback rather than a primary mechanism. Edit mode ordering and the success-locality trade-off transfer across all four models, with success rates scaling with model size rather than family.

2606.18852 2026-06-18 cs.CL cs.AI 新提交

Aligning Implied Statements for Implicit Hate Speech Generalizability with Context-Bounded Semi-hard Negative Mining

对齐隐含陈述:通过上下文边界半硬负挖掘实现隐式仇恨言论的泛化性

Wicaksono Leksono Muhamad, Yunita Sari

发表机构 * Mantera Studio(Mantera工作室) Universitas Gadjah Mada(加雅玛大学)

AI总结 提出ImpSH三元组框架,通过将帖子与隐含陈述对齐并使用上下文边界半硬负样本聚焦学习,提升隐式仇恨言论的跨域泛化能力,在多个数据集上优于对比基线。

详情
AI中文摘要

隐式仇恨言论分类仍然是一个挑战,因为意图通常通过暗示和上下文而非明确辱骂来掩盖。先前的监督对比方法改进了域内检测,但可能过拟合表面线索,且难以跨数据集迁移。我们提出ImpSH,一个基于三元组的框架,当隐含陈述可用时将其与帖子对齐,并使用上下文边界半硬负样本将学习聚焦于近混淆项。我们还研究了AugSH,它通过数据增强形成正样本。在使用BERT和HateBERT对IHC、SBIC和DynaHate进行的受控评估中,ImpSH是标准监督对比基线的可行替代方案,并且在匹配的预处理和调优预算下通常能提高跨域性能。使用对齐性和均匀性进行的表示分析表明,正样本对更紧密且全局分布平衡,定性最近邻案例研究展示了域转移下的典型假负例。这些结果表明,通过上下文边界挖掘将帖子与其隐含陈述对齐,提供了到相关暗示的更稳定、类似双射的映射,克服了传统基于聚类的表示学习固有的波动性。

英文摘要

Classifying implicit hate speech remains a challenge, as intent is often masked through insinuation and context rather than explicit slurs. Prior supervised contrastive approaches improve in-domain detection but can overfit surface cues and struggle to transfer across datasets. We propose ImpSH, a triplet-based framework that aligns posts with implied statements when available and uses context-bounded semi-hard negatives to focus learning on near confusions. We also examine AugSH, which forms positives via data augmentation. In controlled evaluations on IHC, SBIC, and DynaHate with BERT and HateBERT, ImpSH is a viable alternative to standard supervised contrastive baselines and often improves cross-domain performance under matched preprocessing and tuning budgets. Representation analysis using alignment and uniformity indicates tighter positive pairs with balanced global spread, and qualitative nearest-neighbor case studies illustrate typical false negatives under domain shift. These results demonstrate that aligning posts with their implied statements via context-bounded mining provides a more stable, bijective-like mapping to related insinuations, overcoming the volatility inherent in traditional clustering-based representation learning.

10. 低资源、领域适配与高效训练 4 篇

2606.18389 2026-06-18 cs.CL 新提交

Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

想要更好的合成数据?引导它:面向低资源语言生成的激活引导

Jan Cegin, Daniil Gurgurov, Yusser Al Ghussin, Simon Ostermann

发表机构 * Kempelen Institute of Intelligent Technologies(肯佩伦智能技术研究所) German Research Institute for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI))

AI总结 提出激活引导作为低资源语言合成数据生成的替代方法,包括语言引导和质量引导,实验表明早期层引导能提升数据多样性和下游模型性能。

Comments 25 pages

详情
AI中文摘要

大型语言模型(LLMs)已成为合成数据生成的有效工具,包括低资源语言,生成的数据可以提升下游任务性能。当前最佳方法通常依赖于目标语言示例的少样本提示,这增加了推理成本,并可能通过词汇锚定降低多样性。在这项工作中,我们研究激活引导作为低资源合成数据生成的替代方案。我们研究了两种引导策略:语言引导,针对语言的 linguistic identity;以及质量引导,通过对比人类撰写和反向翻译的文本表示来捕捉良好形式性。我们在四个开源LLM、多个层和11种类型多样的语言上评估这些方法,通过生成情感和主题分类数据并微调较小的分类器。引导在零样本和少样本提示设置中应用,并与非引导对应方法进行比较。我们的结果表明,早期层的引导一致地提高了生成数据的多样性,同时通常产生更强的下游模型性能,特别是对于低资源语言。

英文摘要

Large language models (LLMs) have become an effective tool for synthetic data generation, including for low-resource languages, where generated data can improve downstream task performance. Current best-performing approaches typically rely on few-shot prompting with target-language examples, which increases inference costs and may reduce diversity through lexical anchoring. In this work, we investigate activation steering as an alternative for low-resource synthetic data generation. We study two steering strategies: Language Steering, which targets the linguistic identity of a language, and Quality Steering, which captures well-formedness by contrasting human-written and backtranslated text representations. We evaluate these methods across four open-source LLMs, multiple layers, and 11 typologically diverse languages by generating sentiment and topic classification data and finetuning smaller classifiers. Steering is applied in both zero-shot and few-shot prompting settings and compared against non-steered counterparts. Our results show that steering on early layers consistently improves the diversity of generated data while often yielding stronger downstream model performance, particularly for low-resource languages.

2606.18597 2026-06-18 cs.CL 新提交

Low-resource Language Discrimination Towards Chinese Dialects with Transfer learning and Data Augmentation

低资源中文方言辨识:基于迁移学习与数据增强

Fan Xu, Yangjie Dan, Keyu Yan, Yong Ma, Mingwen Wang

发表机构 * Jiangxi Normal University(江西师范大学)

AI总结 针对中文方言标注资源稀缺的问题,提出结合迁移学习与数据增强的CDDTLDA框架,利用源域ASR模型和目标域数据增强及微调,通过自注意力机制捕获共性语义特征,显著超越现有方法。

Comments Published in ACM TALLIP

详情
AI中文摘要

中文方言辨识是一项具有挑战性的自然语言处理任务,由于标注资源稀缺。本文中,我们开发了一种新颖的中文方言辨识框架,结合迁移学习与数据增强(CDDTLDA),以克服资源短缺问题。具体来说,我们首先使用一个较大的中文方言语料库训练一个源端自动语音识别(ASR)模型。然后,我们采用一种简单但有效的数据增强方法(即速度、音高和噪声干扰)来增强目标端低资源中文方言,并基于之前的源端ASR模型微调另一个目标ASR模型。同时,通过使用自注意力机制,可以捕获源端和目标端ASR模型之间的潜在共性语义特征。最后,我们提取目标ASR模型中的隐藏语义表示来进行中文方言辨识。我们广泛的实验结果表明,我们的模型在两个基准中文方言语料库上显著优于最先进的方法。

英文摘要

Chinese dialects discrimination is a challenging natural language processing task due to scarce annotation resource. In this article, we develop a novel Chinese dialects discrimination framework with transfer learning and data augmentation (CDDTLDA) in order to overcome the shortage of resources. To be more specific, we first use a relatively larger Chinese dialects corpus to train a source-side automatic speech recognition (ASR) model. Then, we adopt a simple but effective data augmentation method (i.e., speed, pitch, and noise disturbance) to augment the target-side low-resource Chinese dialects, and fine-tune another target ASR model based on the previous source-side ASR model. Meanwhile, the potential common semantic features between source-side and target-side ASR models can be captured by using self-attention mechanism. Finally, we extract the hidden semantic representation in the target ASR model to conduct Chinese dialects discrimination. Our extensive experimental results demonstrate that our model significantly outperforms state-of-the-art methods on two benchmark Chinese dialects corpora.

2606.18875 2026-06-18 cs.CL 新提交

Efficient Financial Language Understanding via Distillation with Synthetic Data

通过合成数据蒸馏实现高效金融语言理解

Wen-Fong, Huang, Edwin Simpson

发表机构 * School of Engineering Mathematics and Technology(工程数学与技术学院) University of Bristol(布里斯托大学)

AI总结 提出一种在低资源条件下通过合成数据蒸馏进行金融情感分析的框架,利用聚类种子选择生成代表性合成数据,使紧凑模型在少量标注下达到强性能,甚至在某些任务上超越教师模型。

详情
Journal ref
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), European Language Resources Association (ELRA), 2026, pp. 10242-10254
AI中文摘要

大型指令跟随模型功能强大但部署成本高昂,尤其在金融领域,标注数据因保密性和专家标注成本而受限。我们提出一种通过合成数据蒸馏进行金融情感分析的高效框架,将知识从大型指令调优教师模型迁移到紧凑的学生模型。该框架专为低资源条件设计,其中收集并手工标注少量真实样本。框架随后对样本进行聚类,并利用聚类结果选择种子,通过结构化少样本提示生成合成样本。实验表明,基于聚类的种子选择比随机采样能生成更具代表性的合成数据,使紧凑模型在极少量监督下实现强性能。值得注意的是,在更复杂且噪声更多的文本领域,基于完整合成种子语料库训练的紧凑模型甚至优于教师模型,同时在正式文本上保持竞争力。该框架为金融NLP中资源高效的领域自适应提供了一条实用途径,且只需最少的人工标注工作。

英文摘要

Large instruction-following models are powerful but costly to deploy, particularly in finance, where labelled data are limited by confidentiality and expert annotation cost. We present an efficient framework for financial sentiment analysis through distillation with synthetic data, transferring knowledge from a large instruction-tuned teacher to compact student models. The framework is designed for low-resource conditions, where a small set of real examples are collected and labelled by hand. The framework then clusters the examples and uses the clusters to select seeds for generating synthetic examples via structured few-shot prompting. Experiments show that clustering-based seed selection yields more representative synthetic data than random sampling, enabling compact models to achieve strong performance with minimal supervision. Notably, on a more complex and noisy text domain, the compact model trained on the complete synthetic-seed corpus even outperforms the teacher model, while remaining competitive on formal text. The framework provides a practical route toward resource-efficient domain adaptation in financial NLP with minimal human labelling effort.

2606.19266 2026-06-18 cs.CL cs.AI 新提交

Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

医学LLM适应中的权衡:法语问答的实证研究

Ikram Belmadani, Oumaima El Khettari, Carlos Ramisch, Frederic Bechet, Richard Dufour, Benoit Favre

发表机构 * Aix-Marseille Univ., CNRS, LIS UMR 7020(艾克斯-马赛大学,法国国家科学研究中心,计算机与系统实验室) Nantes Univ., École Centrale Nantes, CNRS, LS2N UMR 6004(南特大学,南特中央理工学院,法国国家科学研究中心,数字科学实验室) Grenoble Alpes Univ., CNRS, INRIA, Grenoble INP, LIG UMR 5217(格勒诺布尔-阿尔卑斯大学,法国国家科学研究中心,法国国家信息与自动化研究所,格勒诺布尔理工学院,信息学实验室)

AI总结 通过法语医学问答任务,实证比较持续预训练(CPT)和监督微调(SFT)在多个模型家族和规模下的效果,发现CPT+SFT在多项选择问答上最优但增益小,SFT是强且经济的默认选择,而CPT在开放式问答中提升重叠指标。

详情
AI中文摘要

大型语言模型(LLMs)的发展导致了对它们适应专业领域和语言的关注增加,但领域适应策略的有效性仍不明确。我们以法语医学问答(QA)为案例,进行了医学领域适应的研究。我们比较了持续预训练(CPT)、监督微调(SFT)及其组合,跨越三个模型家族、多个规模和三种初始化类型,明确区分了适应效果与基础模型选择。我们在贪婪和约束解码下,使用自动指标和LLM-as-a-Judge评估,评估了多项选择问答(MCQA)和开放式问答(OEQA)。对于MCQA,CPT+SFT通常取得最佳分数,但相比SFT的增益很小且通常不显著,使得SFT成为强大且成本效益高的默认选择。对于OEQA,CPT持续改善基于重叠的指标,而SFT常降低生成质量;指令调优和CPT+SFT在基于LLM的评估中更受青睐。跨语言实验进一步显示,法语适应能有效迁移到英语基准。总体而言,我们为在计算约束下选择适应策略提供了实用指南。

英文摘要

The development of large language models (LLMs) has led to an increased focus on their adaptation to specialized domains and languages, yet the effectiveness of domain adaptation strategies remains unclear. We present a study of medical domain adaptation using French medical question-answering (QA) as a case study. We compare continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across three model families, multiple sizes, and three initialization types, explicitly disentangling adaptation effects from base model choice. We evaluate both multiple-choice (MCQA) and open-ended QA (OEQA) under greedy and constrained decoding using automatic metrics and LLM-as-a-Judge evaluation. For MCQA, CPT+SFT most often achieves the best scores, but gains over SFT are small and frequently not statistically significant, making SFT a strong and cost-effective default. For OEQA, CPT consistently improves overlap-based metrics, while SFT often degrades generation quality; instruction tuning and CPT+SFT are preferred by LLM-based evaluation. Cross-lingual experiments further show effective transfer from French adaptation to English benchmarks. Overall, we provide practical guidelines for selecting adaptation strategies under computational constraints.