arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.19348 2026-06-19 cs.CL cs.AI 新提交

粒度调控的自适应计算效率：测试时扩展中的最优验证

Ardit Krasniqi, Luan Vejsiu, Elira Dervishi

发表机构 * European University of Tirana（欧洲地拉那大学）

AI总结提出GRACE理论框架，将验证粒度建模为问题难度、验证器准确率和计算预算的函数，证明存在相变：细粒度验证在计算预算大或问题难时占优，粗粒度验证在低预算简单问题时更优，自适应策略可达到计算-性能帕累托前沿。

详情

AI中文摘要

测试时扩展（TTS）已成为一种强大的范式，通过在推理时投入额外计算来提升大语言模型（LLMs）的推理性能。TTS的核心组件是验证器，它选择或评分候选解以引导搜索过程。虽然先前工作已探索验证的益处，但一个基本问题仍未充分探索：在给定计算预算下，最优验证粒度是什么？粗粒度的结果奖励模型（ORMs）和细粒度的过程奖励模型（PRMs）代表两个极端，但两者单独均无法在所有场景下实现计算最优性。本文建立了一个统一的理论框架，称为GRACE（粒度调控的自适应计算效率），该框架将最优验证粒度刻画为问题难度、验证器准确率和计算预算的显式函数。我们证明存在一个相变：当计算预算大或问题难时，细粒度验证占优；而在低预算、简单问题场景下，粗粒度验证更受青睐。我们的理论将Best-of-N、束搜索和步骤级MCTS统一在一个帕累托最优框架内，并激发了一种自适应粒度策略，该策略可证明达到计算-性能帕累托前沿。在MATH-500、GSM8K和AIME基准上的实验结果证实了所有四个理论主张，在匹配计算量下，我们的自适应策略相比固定粒度基线准确率提升高达3.1%。

英文摘要

Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning performance of large language models (LLMs) by investing additional compute at inference time. A central component of TTS is the \emph{verifier}, which selects or scores candidate solutions to guide the search process. While prior work has explored the benefit of verification, a fundamental question remains underexplored: \emph{what is the optimal granularity of verification under a given compute budget?} Coarse-grained outcome reward models (ORMs) and fine-grained process reward models (PRMs) represent two extremes, yet neither alone achieves compute-optimality across all regimes. In this paper, we establish a unified theoretical framework, called \textbf{GRACE} (\underline{G}ranularity-\underline{R}egulated \underline{A}daptive \underline{C}omputational \underline{E}fficiency), that characterizes the optimal verification granularity as an explicit function of problem difficulty, verifier accuracy, and compute budget. We prove that there exists a phase transition: fine-grained verification dominates when either the compute budget is large or the problem is hard, whereas coarse-grained verification is preferred in the low-budget, easy-problem regime. Our theory unifies Best-of-$N$, beam search, and step-level MCTS within a single Pareto-optimality framework, and motivates an adaptive granularity strategy that provably achieves the compute-performance Pareto frontier. Empirical results on MATH-500, GSM8K, and AIME benchmarks corroborate all four theoretical claims, with our adaptive strategy outperforming fixed-granularity baselines by up to 3.1\% accuracy at matched compute.

URL PDF HTML ☆

赞 0 踩 0

2606.19625 2026-06-19 cs.CL cs.LG 新提交

Where Does Social Reasoning Come From? Capability Provenance in Language Models

社会推理从何而来？语言模型中的能力来源

Glenn Matlin, Chandreyi Chakraborty, Saehee Eom, Mika Okamoto, Rayan Castilla, Louis Jaburi, Alvin Deng, Taywon Min, Lucia Quirke, Stella Biderman, Mark Riedl

发表机构 * Georgia Institute of Technology, College of Computing（佐治亚理工学院计算学院）； MATS Program（MATS项目）； EleutherAI ； KAIST AI（韩国科学技术院人工智能学院）； Georgia Tech AI Safety Initiative（佐治亚理工学院人工智能安全倡议）

AI总结通过训练数据归因方法，发现OLMo3-7B中社会推理和STEM推理依赖于不同的预训练语料区域，且推理层面的差异比知识层面更显著。

Comments Under review at COLM 2026 (Conference)

详情

AI中文摘要

我们使用训练数据归因作为可解释的工具进行能力发现，映射预训练语料库中哪些区域支持OLMo3-7B的社会推理与STEM推理。训练数据归因衡量每个训练文档对模型在基准测试上的预测的影响强度，但文档级别的分数过于嘈杂，无法识别哪些语料区域支持哪些能力，且先前的工作侧重于事实知识而非推理。我们在从去重后的Dolma3混合数据中抽取的工作集上计算基于梯度的归因（通过Bergmann的TrackStar），聚合跨WebOrganizer的24格式×24主题分类（576个箱子）的影响，并在2×2设计中对比基准对，该设计变化领域（社会 vs. STEM）和能力类型（推理 vs. 知识）：SocialIQA和MMLU社会科学对比ARC-Challenge和MMLU STEM。社会和STEM推理依赖于定性不同的语料区域，且推理层面的对比比知识层面更尖锐。有针对性的机器遗忘提供了部分因果验证：遗忘高归因主题箱（例如，SocialIQA的文学）比箱内随机基线更严重地降低对齐的基准，我们开源所有代码、采样清单、箱级影响矩阵和遗忘检查点。

英文摘要

We use training-data attribution as an interpretable tool for capability discovery, mapping which regions of the pretraining corpus support social-reasoning versus STEM-reasoning in OLMo3-7B. Training-data attribution measures how strongly each training document influences a model's predictions on a benchmark, but document-level scores are too noisy to identify which corpus regions support which capabilities, and prior work has emphasized factual knowledge rather than reasoning. We compute gradient-based attribution (TrackStar via Bergson) over a working set drawn from the de-duplicated Dolma3 mix, aggregate influence across WebOrganizer's 24-format x 24-topic taxonomy (576 bins), and contrast benchmark pairs in a 2x2 design that varies domain (social vs. STEM) and capability type (reasoning vs. knowledge): SocialIQA and MMLU Social Sciences against ARC-Challenge and MMLU STEM. Social and STEM reasoning draw on qualitatively distinct corpus regions, and the contrast is sharper at the reasoning level than at the knowledge level. Targeted machine unlearning provides partial causal validation: forgetting high-attribution topic bins (e.g., Literature for SocialIQA) degrades the aligned benchmark more than within-bin random baselines, and we open-source all code, sampling manifests, the bin-level influence matrix, and unlearning checkpoints.

URL PDF HTML ☆

赞 0 踩 0

2606.19668 2026-06-19 cs.CL 新提交

Code-Switching Reveals Language Anchoring in Multilingual LLMs

代码切换揭示多语言大模型中的语言锚定

Jeonghyun Park, Seunghyun Yoon, Yonghyun Jun, Hwanhee Lee

发表机构 * Chung-Ang University（中央大学）； Adobe Research（Adobe研究院）

AI总结通过语法强制代码切换诊断多语言大模型中的语言锚定现象，提出锚定偏差度量并设计CANVAS干预方法，有效缓解代码切换导致的问答性能下降。

Comments 36 pages, 13 figures, 27 tables

详情

AI中文摘要

多语言大模型（MLLMs）越来越需要处理代码切换（CS）输入，然而混合语言通常会导致性能相对于源语言或目标语言单语版本下降。为了理解这种退化，我们使用语法强制CS作为受控诊断设置，将CS表示相对于其源和目标对应物进行定位。我们引入锚定偏差（Anchor Bias），一种几何度量，用于量化语言锚定，即CS隐藏状态是否更接近其源语言或目标语言对应物。在不同的MLLMs中，锚定偏差揭示了一致的语法框架效应：源框架CS保持源锚定，而目标框架CS向目标方向移动，并显示出更大的问答（QA）退化。受这种表示模式的启发，我们提出了CANVAS（基于上下文锚定的神经向量对齐引导），一种推理时干预方法，从输入中提取源侧画布，并在预填充期间将目标语言隐藏状态软引导向源锚定。CANVAS在MLLMs和CS条件下一致地恢复了QA F1分数，表明内部锚定信号为缓解CS推理失败提供了可行的目标。

英文摘要

Multilingual Large Language Models (MLLMs) are increasingly expected to handle Code-Switched (CS) inputs, yet mixing languages frequently degrades performance relative to source- or target-language monolingual counterparts. To understand this degradation, we use grammar-forced CS as a controlled diagnostic setting for locating CS representations relative to their source and target counterparts. We introduce Anchor Bias, a geometric measure that quantifies language anchoring, whether a CS hidden state aligns closer to its source or target language counterpart. Across diverse MLLMs, Anchor Bias reveals a consistent grammar-frame effect: source-framed CS stays source-anchored, whereas target-framed CS shifts target-ward and shows larger Question Answering (QA) degradation. Motivated by this representational pattern, we propose CANVAS (Contextual Anchor-based Neural Vector Alignment Steering), an inference-time intervention that extracts a source-side canvas from the input and softly steers target-language hidden states toward the source anchor during prefill. CANVAS consistently recovers QA F1 across MLLMs and CS conditions, showing that internal anchoring signals provide an actionable target for mitigating CS inference failures.

URL PDF HTML ☆

赞 0 踩 0

2606.19744 2026-06-19 cs.CL cs.AI cs.HC 新提交

Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings

超越统一遗忘：不同偏好设置下顺序直接偏好优化的研究

Pranav Bhandari, Nicolas Fay, Amitava Datta, Usman Naseem, Mehwish Nasim

发表机构 * Network Analysis and Social Influence Modelling (NASIM) Lab（网络分析与社会影响建模实验室）； School of Physics Maths and Computing, The University of Western Australia（西澳大学物理数学与计算学院）； School of Psychological Science, The University of Western Australia（西澳大学心理科学学院）； School of Computing, Macquarie University（麦考瑞大学计算机学院）

AI总结研究顺序DPO在不同偏好设置下的影响，发现遗忘模式并非统一，而是取决于目标关系、信号强度和训练顺序，并提出未来对齐流程应考虑目标兼容性。

Comments Submitted to EMNLP 2026

详情

AI中文摘要

将语言模型与人类偏好对齐通常需要优化多个行为目标。一种实用方法是使用直接偏好优化（DPO）等偏好优化方法顺序应用这些目标，但目前尚不清楚后续训练是否会统一降低先前学习的偏好，或者这种影响是否取决于目标之间的关系。我们研究了跨越四种偏好设置（包括分布冲突、多属性交互、强安全信号和兼容的响应质量目标）的顺序DPO。使用带有LoRA适配器的Llama-3.1-8B-Instruct，我们在每个阶段后使用固定的基础模型参考评估所有目标。我们发现顺序DPO不会产生单一的遗忘模式；偏好变化从部分退化到稳定、成对重新分配或正迁移，具体取决于目标关系、信号强度和训练顺序。使用长度归一化策略边界的成对分析表明，聚合指标可能掩盖偏好对之间的异质性变化，而四分位数分解显示，高置信度对可能根据设置而退化或改进。机制诊断表明，在所有设置中，阶段2的梯度和适配器更新与先前目标接近正交，几乎没有证据表明直接梯度对立是主要驱动因素。这些发现表明，未来的顺序对齐流程应考虑目标兼容性和信号强度，而不是假设后续目标会统一影响先前的偏好。

英文摘要

Aligning language models with human preferences often requires optimising multiple behavioural objectives. A practical approach is to apply these objectives sequentially using preference optimisation methods such as Direct Preference Optimisation (DPO), but it remains unclear whether later training uniformly degrades preferences learned earlier or whether the effect depends on the relationship between objectives. We study sequential DPO across four preference settings covering distributional conflict, multi-attribute interaction, strong safety signal, and compatible response-quality objectives. Using Llama-3.1-8B-Instruct with LoRA adapters, we evaluate all objectives after every stage with a fixed base-model reference. We find that sequential DPO does not produce a single forgetting pattern; preference change ranges from partial degradation to stability, pair-level redistribution, or positive transfer depending on objective relationship, signal strength, and training order. Pair-level analysis using length-normalised policy margins shows that aggregate metrics can mask heterogeneous changes across preference pairs, whereas quartile decomposition reveals that high-confidence pairs can either degrade or improve depending on the setting. Mechanistic diagnostics show that Stage~2 gradients and adapter updates are near-orthogonal to the previous objective across all settings, providing little evidence that direct gradient opposition is the primary driver. These findings suggest that future sequential alignment pipelines should account for objective compatibility and signal strength, rather than assuming that later objectives affect earlier preferences uniformly.

URL PDF HTML ☆

赞 0 踩 0

2606.19815 2026-06-19 cs.CL 新提交

Clusters are All You Need: Pre-Training the Tsetlin Machine with Semantic Clusters from Language Models for Interpretability

聚类即一切：利用语言模型中的语义聚类预训练Tsetlin Machine以实现可解释性

Jiechao Gao, Rohan Kumar Yadav, Yuangang Li, Yuandong Pan, Jie Wang, Ying Liu, Michael Lepech

发表机构 * Independent Researcher（独立研究员）； University of California, Irvine（加州大学尔湾分校）； University of the Chinese Academy of Sciences（中国科学院大学）

AI总结提出一种语义预训练框架，通过K-means或Top2Vec将文本聚类，用聚类-样本对预训练Tsetlin Machine，使其学习可解释的语义关键词，在五个数据集上性能优于传统方法且与BERT竞争。

详情

AI中文摘要

预训练语言模型如BERT在文本分类任务中表现强劲，但缺乏透明度，限制了在高风险场景中的应用。Tsetlin Machine (TM) 提供完全可解释的基于子句的推理，但捕获的语义信息有限，先前桥接两者的尝试依赖于静态词嵌入，忽略了上下文含义。我们提出一种语义预训练框架，无需使用嵌入即可将知识从预训练语言模型转移到TM中。文本样本通过K-means或Top2Vec被分组为语义一致的聚类，得到的聚类-样本对通过增强的Type I反馈预训练一个非否定TM。因此，TM学习到可解释的语义关键词，并在下游任务上进行微调。在五个数据集上，我们的方法显著优于传统和基于嵌入的TM，性能与BERT竞争，同时保持可解释性。

英文摘要

Pre-trained language models such as BERT achieve strong text classification performance but lack transparency, limiting their use in high-stakes settings. The Tsetlin Machine (TM) offers fully interpretable, clause-based reasoning but captures little semantic information, and prior attempts to bridge the two rely on static word embeddings that miss contextual meaning. We propose a semantic pre-training framework that transfers knowledge from a pre-trained language model into a TM without using embeddings. Text samples are grouped into semantically coherent clusters with K-means or Top2Vec, and the resulting cluster-sample pairs pre-train a non-negated TM with enhanced Type I feedback. The TM thereby learns interpretable semantic keywords that are fine-tuned on downstream tasks. Across five datasets, our method substantially outperforms vanilla and embedding-based TMs and reaches performance competitive with BERT while remaining interpretable.

URL PDF HTML ☆

赞 0 踩 0

2606.19831 2026-06-19 cs.CL cs.LG 新提交

Leverage Is Not Reach: A Control-Window Law for Single-Neuron Steering in Language Models

杠杆不等于可达性：语言模型中单神经元操控的控制窗口定律

Hongliang Liu

发表机构 * Palo Alto Networks

AI总结提出预算归一化控制窗口框架，通过残差范数与写入范数之比定义的相干预算，预测单神经元干预何时产生连贯行为控制，并在15个神经元上验证了预测精度。

详情

AI中文摘要

对齐语言模型通过稀疏前馈神经元门控拒绝和语言路由等行为，但尚无理论预测单神经元干预何时连贯地控制行为而非导致输出崩溃。我们开发了一个预算归一化的控制窗口框架用于单神经元操控。沿一个写入方向的剂量简化为一个控制坐标：残差流与写入之间的对齐，该对齐沿着一条通用饱和曲线驱动，以残差范数除以写入范数设定的相干预算为单位。当行为触发点低于崩溃上限时，存在连贯控制。同一坐标控制良性模式切换和拒绝；上限由权重和一次通用前向传播得出，而触发点在 rollout 时测量。在15个保留神经元上，预测上限的平均绝对误差为0.14，在批量层中约为0.07，并且承诺的开启或关闭判定在11个神经元上成立，而多数基线为10/15。关闭情况揭示了三种失败模式而非违反：触发前崩溃、深度不足以传播、或归一化限制了单个神经元能推动的距离。该定律解释了为什么局部梯度归因反直觉地预测控制：真正的控制器偏离读出轴写入，并携带接近零的一阶梯度。由窗口精确化的仅前向对比筛选恢复了归因遗漏的控制器。在拒绝这一最难案例中，干预成功是类型化的而非标量：连贯旁路和严格可操作可达性分离，因此一个神经元可以在流畅、任务相关且无操作内容的文本中翻转拒绝，而真正的可操作可达性仅出现在六个审计的 Llama 枢轴中的三个，且仅在较晚的 rollout 时间范围内。因此，单神经元操控是对可控性的预算化、类型化审计，而非固定剂量的轶事。

英文摘要

Aligned language models gate behaviors such as refusal and language routing through sparse feed forward neurons, yet no theory predicts when a single neuron intervention controls a behavior coherently rather than collapsing the output. We develop a budget normalized control window framework for single neuron steering. A dose along one write direction reduces to one control coordinate: the alignment between the residual stream and the write, driven along a universal saturation curve in units of a coherence budget set by the residual norm divided by the write norm. Coherent control exists when a behavior trigger lies below the collapse ceiling. The same coordinate governs benign mode switches and refusal; the ceiling follows from weights and one generic forward pass, while triggers are measured at rollout. On fifteen held out neurons, the predicted ceiling has mean absolute error 0.14, about 0.07 in bulk layers, and the committed open or closed verdict holds on eleven against a ten of fifteen majority baseline. Closed cases expose three failure modes rather than violations: collapse before trigger, too little depth to propagate, or a normalization that caps how far one neuron can push. The law explains why local gradient attribution anti predicts control: true controllers write off the readout axis and carry a near zero first order gradient. A forward only contrastive screen made precise by the window recovers controllers that attribution misses. On refusal, the hardest case, intervention success is typed, not scalar: coherent bypass and strict actionable reach separate, so a neuron can flip refusal in fluent, on task text with no actionable content, and genuine actionable reach appears only for three of six audited Llama pivots and only at later rollout horizons. Single neuron steering is therefore a budgeted, typed audit of controllability rather than a fixed dose anecdote.

URL PDF HTML ☆

赞 0 踩 0

2606.19857 2026-06-19 cs.CL cs.AI 新提交

Large Language Models Do Not Always Need Readable Language

大型语言模型并不总是需要可读语言

Jiayi Zhu, Haoxuan Peng, Junxi Wang, Liang Ke, Chen Zhang, Linfeng Zhang

AI总结研究提出BabelTele表示法，将语义编码为紧凑、非标准文本，牺牲人类可读性但保持LLM可恢复性，实验表明可压缩至27.9%长度并保持99.5%语义保真度，降低上下文开销。

Comments 23 pages, 10 figures. Preprint

详情

AI中文摘要

大型语言模型（LLM）通常使用人类可读的自然语言进行提示和交互，即使目标读者是另一个模型。本文研究语义信息是否可以编码为紧凑、非标准的文本形式，这种形式牺牲了人类可读性，但能被LLM恢复。我们将这类以模型为中心的文本表示称为BabelTele，这里不是作为固定协议，而是作为探索LLM生成和解释此类表示能力的经验探针。通过可读性诊断、模型似然度量、人类问卷和下游任务评估，我们发现BabelTele可以显著偏离普通自然语言，同时为指令调优的LLM保留核心语义。作为一种任务无关的表示范式，BabelTele展示了高信息密度，即使文本体积压缩到原始长度的27.9%，也能保持99.5%的语义保真度。我们进一步评估了其在跨模型迁移、智能体记忆和多智能体通信中的语义鲁棒性。结果表明，BabelTele可以降低上下文开销，同时通常保持可靠的下游性能，但其有效性取决于压缩器-读取器对和任务设置。这些发现表明，人类可读性、自然语言典型性和模型端语义可恢复性可以部分解耦，为未来探索LLM系统中的模型原生表示开辟了道路。

英文摘要

Large language models (LLMs) are commonly prompted and interfaced with human-readable natural language, even when the intended reader is another model. This paper investigates whether semantic information can be encoded in compact, non-standard textual forms that sacrifice human readability while remaining recoverable by LLMs. We refer to this class of model-centric textual representations as BabelTele, approached here not as a fixed protocol but as an empirical probe into LLMs' capacity to generate and interpret such representations. Through readability diagnostics, model likelihood measures, human questionnaires, and downstream task evaluations, we find that BabelTele can substantially depart from ordinary natural language while preserving core semantics for instruction-tuned LLMs. As a task-agnostic representational paradigm, BabelTele demonstrates high information density, maintaining 99.5% semantic fidelity even when the text volume is condensed to 27.9% of its original length. We further evaluate its semantic robustness in cross-model transfer, agent memory, and multi-agent communication. Results suggest that BabelTele can reduce context overhead while generally maintaining reliable downstream performance, although its effectiveness depends on the compressor-reader pair and task setting. These findings indicate that human readability, natural-language typicality, and model-side semantic recoverability can be partially decoupled, opening a path toward model-native representations in future exploration of LLM systems.

URL PDF HTML ☆

赞 0 踩 0

2606.19946 2026-06-19 cs.CL cs.LG 新提交

GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs

GEMS: 几何约束使LLM中多语义叠加成为可能

Yu Deng

AI总结提出GEMS方法，通过范数保持加权叠加、目标注意力路径注入和实时正交化两个几何约束，解决无训练多方向激活干预中的分布偏差和方向干扰问题，在GSM8K上保持98%准确率。

Comments 30 pages, 5 figures, 20 tables. Code and logs are available at: https://github.com/LuLu663939/gems-multi-semantic-steering

详情

AI中文摘要

激活引导通过在推理时修改中间隐藏状态来控制模型行为，无需重新训练。现有方法仅处理单方向注入；当多个语义方向无约束叠加时，模型崩溃。我们证明这种崩溃分解为两个独立作用的来源：分布偏差（加法扰动在层间累积范数并将激活推出训练分布）和方向干扰（非正交语义向量叠加时相互抑制）。这两个来源定义了任何无训练多方向干预必须满足的设计约束。作为这些原则的一个实例，我们提出GEMS，一种无训练方法，将每个来源映射到相应的几何约束：针对分布偏差的范数保持加权叠加和目标注意力路径注入，以及针对方向干扰的实时正交化。在GSM8K上，注入三个并发非数学方向保持98%的准确率（基线92%），而无约束加法崩溃至4%；在Wikitext-2上，相同注入仅导致2.2%的PPL增加。组件消融隔离了每个约束的因果作用，层级探针确认正交化信号通过FFN路径存活并以语义特异性到达输出分布。定性引导效果跨架构从3B到31B迁移。

英文摘要

Activation steering controls model behavior by modifying intermediate hidden states at inference time without retraining. Existing methods handle only single-direction injection; when multiple semantic directions are superposed without constraints, the model collapses. We show that this collapse decomposes into two independently acting sources: distributional deviation, where additive perturbations accumulate in norm across layers and drive activations outside the training distribution, and directional interference, where non-orthogonal semantic vectors mutually dampen when superposed. These two sources define the design constraints that any training-free multi-directional intervention must address. As one instantiation of these principles, we propose GEMS, a training-free method that maps each source to a corresponding geometric constraint: norm-preserving weighted superposition and targeted attention-pathway injection for distributional deviation, and real-time orthogonalization for directional interference. On GSM8K, injecting three concurrent non-mathematical directions preserves accuracy at 98% (baseline 92%), while unconstrained addition collapses to 4%; on Wikitext-2, the same injection incurs only 2.2% PPL increase. Component ablation isolates the causal role of each constraint, and layer-level probes confirm that orthogonalized signals survive the FFN pathway and reach the output distribution with semantic specificity. Qualitative steering effects transfer across architectures from 3B to 31B.

URL PDF HTML ☆

赞 0 踩 0

2606.20089 2026-06-19 cs.CL cs.AI 新提交

跨语言迁移中语言相关性与任务对齐的解耦

Ahmed Haj Ahmed, Ruochen Zhang, Alvin Grissom

发表机构 * Haverford College（哈弗福德学院）； Brown University（布朗大学）

AI总结通过微调大语言模型并在闪语族与非闪语族语言上评估零样本阅读理解，发现跨语言迁移主要提升任务格式对齐而非语言特定知识。

2606.19640 2026-06-19 cs.CL cs.AI cs.HC 新提交

Creating Multilingual Mental Health Dialogue Datasets: Limits of Persona-Based Localization via Nationality and Language

创建多语言心理健康对话数据集：基于国籍和语言的人物角色本地化方法的局限性

Yunkai Xu, Saeed Abdullah

发表机构 * Pennsylvania State University（宾夕法尼亚州立大学）

AI总结研究通过修改人物角色中的国籍和语言参数生成中文、孟加拉语和印地语临床对话，发现仅添加这些参数会导致跨语言临床不一致，且LLM评估非英语文本的抑郁严重度时存在不准确性。

Comments 15 pages, 4 figures. Accepted to the 2026 Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026), co-located with ACL 2026

详情

AI中文摘要

人工智能和大语言模型（LLMs）已成为应对全球心理健康挑战的有前景的工具。尽管这些挑战具有全球性，但用于训练和评估此类系统的高质量数据集仍然严重短缺。为弥补这一差距，研究人员越来越多地生成合成临床人物角色来模拟用户数据并测试数字心理健康支持系统。然而，大多数经过验证的人物角色依赖于以英语为中心的语境。本文研究了是否可以使用类似的人物角色方法生成多语言心理健康数据集。我们修改了人物角色中的国籍和语言参数，以生成普通话、孟加拉语和印地语的临床对话。然后，我们考察了不同LLM在评估这些生成的多语言数据集的抑郁严重程度（与英语基线相比）时的表现。我们的研究结果表明，仅在人物角色中添加国籍和语言参数可能不够，因为它可能引入跨语言的临床不一致性。LLM评判模型在评估非英语文本中的抑郁严重程度时常常表现出不准确性，且不同模型的性能存在差异。这暴露了将以英语为中心的人物角色应用于多语言语境的系统性局限性。最终，我们的工作强调了迫切需要文化响应式数据生成，以确保全球心理健康系统的公平性。

英文摘要

AI and large language models (LLMs) have emerged as promising tools to address global mental health challenges. Despite the global nature of these challenges, there remains a critical shortage of high-quality datasets for training and evaluating such systems. To mitigate this gap, researchers increasingly generate synthetic clinical personas to simulate user data and test digital mental health support systems. However, most validated personas rely on English-centric contexts. This paper investigates whether similar persona-based methods can be used to generate multilingual mental health datasets. We modified nationality and language parameters in personas to generate clinical dialogues in Mandarin, Bengali, and Hindi. We then examined how different LLMs perform when evaluating the depression severity of these generated multilingual datasets against the baseline in English. Our findings indicate that just adding nationality and language parameters in personas might not be adequate, as it can introduce clinical inconsistency across languages. LLM judge models often exhibit inaccuracies in assessing depression severity in non-English texts, with performance varying across different models. This exposes the systemic limitations of applying English-centric personas to multilingual contexts. Ultimately, our work highlights the urgent need for culturally responsive data generation to ensure equitable mental health systems globally.

URL PDF HTML ☆

赞 0 踩 0

2606.19345 2026-06-19 cs.CL cs.AI 新提交

Ensembles of Large Language Models for Identifying EQ-5D Studies in PubMed Based on Their Abstracts

基于摘要识别PubMed中EQ-5D研究的大型语言模型集成

Zhyar Rzgar K. Rostam, Márta Péntek, János Tibor Czere, Zsombor Zrubka, László Gulácsi, Gábor Kertész

发表机构 * Doctoral School of Applied Informatics and Applied Mathematics, Obuda University（欧布达大学应用信息学与应用数学博士学院）； John von Neumann Faculty of Informatics, Obuda University（欧布达大学约翰·冯·诺伊曼信息学学院）； Doctoral School of Innovation Management, Obuda University（欧布达大学创新管理博士学院）

AI总结提出多阶段框架集成Gemini和Gemma等LLM，通过少样本提示、权重集成和软堆叠元分类器，自动检测PubMed中EQ-5D研究，加权集成F1达0.74。

Comments 6 pages, 7 tables, 8 equations

详情

AI中文摘要

科学出版物的快速增长导致系统文献综述（SLR）中的人工研究筛选越来越耗费资源、效率低下且不一致。分类明确报告健康相关生活质量结果（如EQ-5D数据）的研究需要高水平的临床解释，并给人类评审者带来挑战。本研究探讨了使用Google的Gemini和Gemma大型语言模型（LLM）仅基于已发表摘要自动检测PubMed生物医学数据库中的EQ-5D。提出了一个多阶段框架，集成了少样本提示、权重集成聚合和软堆叠元分类器。在由两位专家手动标记的PubMed研究数据集上评估了九个LLM的EQ-5D报告情况。gemini-2.5-pro、gemma-3-12b和gemma-3-27b的加权集成获得了0.74的加权F1分数和0.74的准确率，超过了单独获得的结果。与单个模型相比，表现最佳模型的集成改善了精确率和召回率之间的平衡，而软堆叠方法提供了更高的可靠性和可解释性。特征分析表明，模型的概率结果在指导最终预测中很重要。研究结果表明，基于集成的LLM设置是自动化生物医学研究筛选的可靠且可扩展的方法。

英文摘要

The rapid increase in scientific publications leads to the fact that manual study screening in systematic literature reviews (SLRs) is increasingly resource consuming, inefficient, and inconsistent. Classifying studies that clearly report health-related quality-of-life results, such as EQ-5D data, requires a high level of clinical interpretation and poses challenges for human reviewers. This study investigates the use of Google's Gemini and Gemma large language models (LLMs) in automating EQ-5D detection in the PubMed biomedical database based only on published abstracts. A multi-phase framework is proposed that integrates few-shot prompting, weight ensembling aggregation, and a soft stacking meta-classifier. Nine LLMs are evaluated on a dataset of PubMed studies manually labeled by two experts regarding EQ-5D reporting. The weighted ensemble of gemini-2.5-pro, gemma-3-12b, and gemma-3-27b obtained a 0.74 weighted F1-score and 0.74 accuracy, exceeding individually attained results. The ensembling of top-performing models improved the balance between precision and recall compared to individual models, while the soft stacking approach provided greater reliability and interpretability. Feature analysis shows that the probability results from the models are important in guiding the final predictions. The findings suggest that an ensemble-based LLM setup is a reliable and scalable approach for automating screening in biomedical research.

URL PDF HTML ☆

赞 0 踩 0

2606.19351 2026-06-19 cs.CL cs.AI 新提交

Detecting Hallucinations for Large Language Model-based Knowledge Graph Reasoning

基于大语言模型的知识图谱推理中的幻觉检测

Xinyan Zhu, Yaoqi Liu, Yue Gao, Huadong Ma, Cheng Yang, Chuan Shi

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Tsinghua University（清华大学）

AI总结提出LUCID方法，结合LLM注意力分数、知识图谱语义和结构信息，利用图神经网络检测LLM在知识图谱推理中的幻觉，在九个数据集上达到最优性能。

详情

AI中文摘要

知识图谱推理从现有事实中推断新知识，广泛应用于问答、推荐和决策支持。随着大语言模型（LLM）的快速发展，基于LLM的知识图谱推理框架通过利用检索到的知识图谱信息变得越来越流行。然而，LLM中的幻觉仍然是一个关键问题。即使融入了相关的知识图谱知识，模型仍可能生成错误输出，导致错误信息和不可靠的决策。现有的幻觉检测方法要么关注LLM内部状态，要么验证与检索上下文的一致性，但两者都忽略了知识图谱中的结构信息，导致性能次优。为了解决这一差距，我们提出了LUCID，这是首个针对基于LLM的知识图谱推理框架的幻觉检测方法。LUCID联合利用LLM注意力分数、知识图谱语义和结构信息。具体来说，它从注意力分数和语义相似度中提取节点和边特征，并使用图神经网络将其与知识图谱结构集成。我们还构建了人工标注的基准数据集用于评估。在九个数据集上的实验表明，与15个基线相比，LUCID达到了最先进的性能。

英文摘要

Knowledge graph (KG) reasoning infers new knowledge from existing facts and is widely applied in question answering, recommendation, and decision support. With the rapid development of large language models (LLMs), LLM-based KG reasoning frameworks have become increasingly popular by leveraging retrieved KG information. However, hallucinations in LLMs remain a critical issue. Even when relevant KG knowledge is incorporated, models may still generate incorrect outputs, leading to misinformation and unreliable decisions. Existing hallucination detection methods either focus on LLM internal states or verify consistency with retrieved contexts, but both overlook the structural information in KGs, resulting in suboptimal performance. To address this gap, we propose LUCID, the first halLUcination deteCtIon method for LLM-based knowleDge graph reasoning frameworks. LUCID jointly leverages LLM attention scores, KG semantics, and structural information. Specifically, it extracts node and edge features from attention scores and semantic similarities, and integrates them with KG structure using a graph neural network. We also construct manually annotated benchmark datasets for evaluation. Experiments on nine datasets show that LUCID achieves state of the art performance compared to 15 baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.19667 2026-06-19 cs.CL 新提交

提示、规划、提取：用于从临床叙述中提取肺部病理学的零样本智能体LLM工作流

Aman Pathak, Cheng Peng, Mengxian Lyu, Ziyi Chen, Reema Solan, Sankalp Talankar, Yasir Khan, Hiren Mehta, Aokun Chen, Yi Guo, Yonghui Wu

AI总结提出零样本智能体工作流，利用开源大语言模型从肺切除病理报告中提取13个CAP字段，在无训练下达到0.893 Micro-F1，接近监督方法。

Comments 7 pages, 2 figures, 3 tables. Affiliations: (1) Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; (2) Division of Pulmonary, Critical Care and Sleep Medicine, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA; (3) College of Nursing, Florida State University, Tallahassee, FL, USA

详情

AI中文摘要

从病理报告中提取信息对于癌症分期和肿瘤登记人群至关重要。然而关键数据仍嵌入在叙述性报告中，使得手动提取劳动密集且易出错。传统的监督自然语言处理流程通过完全监督的命名实体识别和关系提取来解决这一问题，但需要昂贵的人工标注，并且当上游实体缺失时会出现级联故障。在本研究中，我们开发了一个零样本智能体工作流，并评估了五个开源生成式大语言模型（LLMs），以从肺切除病理报告中填充13个美国病理学家学会的概要字段。我们使用一种新颖的、与注册对齐的评估框架，将它们与最先进的监督GatorTron NER-RE基线进行比较。基线达到了0.960的Micro-F1，而最佳零样本模型（GPT-OSS-20B）达到了0.893的Micro-F1（召回率：0.949），在没有任务特定训练的情况下准确提取了复杂关系（如病理分期）。这些结果表明，开源零样本智能体LLMs是提取肺部病理信息的低成本解决方案。

英文摘要

Information extraction from pathology reports is essential for cancer staging, tumor registry population. Yet key data remains embedded in narrative reports, making manual extraction labor-intensive and error-prone. Traditional supervised Natural Language Processing pipelines address this through fully supervised Named Entity Recognition and Relation Extraction, but require expensive manual annotation and suffer cascading failures when upstream entities are missed. In this study, we developed a zero-shot, agentic workflow, and evaluated five open-source generative Large Language Models (LLMs) to populate 13 College of American Pathologists synoptic fields from lung resection pathology reports. We compared them against a state-of-the-art supervised GatorTron NER-RE baseline using a novel, registry-aligned evaluation framework. The baseline achieved Micro-F1of 0.960, while the best zero-shot model (GPT-OSS-20B) achieved Micro-F1 of 0.893 (recall: 0.949), accurately extracting complex relations like Pathologic Stage without task-specific training. These results suggest that open-source, zero-shot agentic LLMs are a low-cost solution for extracting lung pathology information.

URL PDF HTML ☆

赞 0 踩 0

2606.20072 2026-06-19 cs.CL 新提交

Source-Grounded Data Generation for Text-to-JSON Learning

基于源数据的文本到JSON学习数据生成

Sunghee Ahn, Guijin Son, Youngjae Yu

发表机构 * Seoul National University（首尔大学）

AI总结提出STAGE方法，利用电子表格作为源数据，通过LLM生成报告和JSON模式，并验证真实值，显著提升文本到JSON任务的训练数据质量。

Comments Preprint

详情

AI中文摘要

从财务文件到临床记录，传统行业严重依赖冗长、非结构化的文档来存储高价值信息。将这些信息可靠地提取为结构化的、机器可读的表示形式，是使自动化系统能够访问这些内容的关键前提。JSON是这种结构化提取的自然目标，然而构建可靠且可扩展的文本到JSON训练数据仍然具有挑战性。为了解决这一差距，我们提出了STAGE（电子表格基础的文本到JSON工件生成），一种基于源数据的数据生成管道，通过使用LLM进行可扩展合成，同时根据底层电子表格验证真实值，来构建报告和JSON模式。在STAGE-Eval（我们的基于源数据的基准测试，包含851个示例的测试集）上的评估表明，STAGE生成的训练数据优于现有方法。这使Qwen3-4B的精确匹配从31.37%提高到74.27%，值准确率从45.46%提高到90.69%。

英文摘要

From financial filings to clinical records, legacy industries rely heavily on long, unstructured documents to store high-value information. Reliably extracting this information into structured, machine-readable representations is a key prerequisite to making the contents accessible to automated systems. JSON is a natural target for such structured extraction, yet constructing reliable and scalable text-to-JSON training data remains challenging. To address this gap, we propose STAGE (Spreadsheet-grounded Text-to-JSON Artifact GEneration), a source-grounded data generation pipeline that constructs reports and JSON schema by using LLMs for scalable synthesis while validating ground-truth values against the underlying spreadsheet. Evaluations on STAGE-Eval, our source-grounded benchmark with an 851-example test set, show that STAGE produces stronger training data than existing approaches. This improves Qwen3-4B exact match from 31.37% to 74.27% and value accuracy from 45.46% to 90.69%.

URL PDF HTML ☆

赞 0 踩 0

2606.20113 2026-06-19 cs.CL cs.IR 新提交

超越全局重规划：跨设备智能体系统的分层恢复

Shu Yao, Yuhua Luo, Qian Long, Jingru Fan, Zhuoyuan Yu, Yuheng Wang, Lin Wu, Yufan Dang, Huatao Li, Chen Qian

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University（上海交通大学人工智能学院）； Shanghai Innovation Institute（上海创新研究院）； Southeast University（东南大学）； Tsinghua University（清华大学）

AI总结提出分层重规划框架H-RePlan，通过统一API-CLI-GUI执行和跨层失败抽象，区分设备本地策略恢复与全局重规划，在HeraBench基准上显著提升跨设备任务完成率和指令遵循度。

详情

AI中文摘要

现实世界中的计算机使用任务通常跨越多个应用程序和设备，要求智能体在动态运行时故障下协调异构环境。现有的多设备智能体系统支持任务分解和跨设备分配，但恢复仍然粗粒度：当执行失败时，它们通常重试相同策略、重新分配子任务或修改全局计划，而没有系统地建模设备本地策略空间。这限制了它们区分可在当前设备内修复的故障与需要跨设备重规划的故障的能力。我们提出\textbf{H-RePlan}，一个用于具有统一API-CLI-GUI执行的多设备智能体的分层重规划框架。H-RePlan为每个设备配备可互换的执行策略，并通过紧凑的跨层失败抽象将设备本地策略恢复与编排器级全局重规划分离。为了评估这一能力，我们引入\textbf{HeraBench}，一个故障注入基准，它在Linux和Android设备上构建跨设备工作流，并注入策略级和设备级故障。实验表明，H-RePlan显著优于单策略和粗粒度多设备基线，实现了更高的完成率、指令遵循率和完美通过率，同时降低了可靠端到端成功所需的令牌成本。这些结果表明，范围感知的分层恢复对于鲁棒的多设备智能体执行至关重要。

英文摘要

Real-world computer-use tasks often span multiple applications and devices, requiring agents to coordinate heterogeneous environments under dynamic runtime failures. Existing multi-device agent systems support task decomposition and cross-device assignment, but recovery remains largely coarse-grained: when execution fails, they typically retry the same strategy, reassign the subtask, or revise the global plan, without systematically modeling the device-local strategy space. This limits their ability to distinguish failures that can be repaired within the current device from those that require cross-device replanning. We propose \textbf{H-RePlan}, a hierarchical replanning framework for multi-device agents with unified API--CLI--GUI execution. H-RePlan equips each device with interchangeable execution strategies and separates device-local strategy recovery from orchestrator-level global replanning through a compact cross-layer failure abstraction. To evaluate this capability, we introduce \textbf{HeraBench}, a fault-injected benchmark that constructs cross-device workflows over Linux and Android devices and injects strategy- and device-level failures. Experiments show that H-RePlan substantially outperforms single-strategy and coarse-grained multi-device baselines, achieving higher completion, instruction adherence, and perfect-pass rates while reducing the token cost required for reliable end-to-end success. These results demonstrate that scope-aware hierarchical recovery is essential for robust multi-device agent execution.

URL PDF HTML ☆

赞 0 踩 0

2606.19591 2026-06-19 cs.CL cs.AI 新提交

A BART-based approach with hierarchical strategy for Vietnamese abstractive multi-document summarization

基于BART的分层策略用于越南语抽象式多文档摘要

Vu Nguyen Nguyen Xuan, Huy Ngo Quang

发表机构 * Aimesoft JSC（Aimesoft股份公司）

AI总结提出一种新颖简单的基于黄金摘要缩短文档的分层策略，结合BART模型实现越南语多文档抽象式摘要，在VLSP 2022测试集上达到ROUGE2-F1 0.2468，并利用外部数据增强训练。

Comments originally written in 2022

2606.20287 2026-06-19 cs.CL 新提交

PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback

PsyScore: 一种心理测量感知的特质自适应作文评分与最近发展区支架反馈框架

Wei Xia, Jin Wu, Haoran Shi, Xiangyu Wang, Chanjin Zheng

发表机构 * Department of Educational Psychology, East China Normal University（华东师范大学教育心理学系）； Shanghai Institute of Artificial Intelligence for Education, East China Normal University（华东师范大学上海智能教育研究院）； School of Computer Science and Technology, East China Normal University（华东师范大学计算机科学与技术学院）

AI总结提出PsyScore框架，通过共享潜在能力表示整合诊断评估与教学支架，包括特质自适应神经IRT评分器、ZPD支架反馈生成器和多视角反馈评估策略，在ASAP++数据集上实现竞争性评分性能并提供更符合教学法的反馈。

详情

AI中文摘要

有效的自动作文评分（AES）应支持可靠评估和可操作的教学反馈。然而，现有方法通常将评分和反馈视为独立组件：神经评分模型可解释性有限，而基于大语言模型（LLM）的反馈通常对学习者熟练度不敏感。为解决这一碎片化问题，本工作提出PsyScore，一个心理测量感知的框架，通过共享潜在能力表示整合诊断评估与教学支架。PsyScore包含三个关键模块：特质自适应神经IRT评分器，将分级部分信用模型（GPCM）融入神经架构，能够在保持心理测量可解释性的同时精确估计学生能力；ZPD支架反馈生成器，根据诊断出的能力参数调节多智能体反馈策略，以适应不同熟练水平的教学重点；以及多视角反馈评估策略，通过成对偏好判断和学生修订模拟评估反馈质量。在ASAP++数据集上的实验表明，PsyScore在提供更具教学一致性的反馈的同时，实现了有竞争力的评分性能。

英文摘要

Effective Automated Essay Scoring (AES) are expected to support both reliable assessment and actionable instructional feedback. However, existing approaches often treat scoring and feedback as separate components: neural scoring models provide limited interpretability, while Large Language Model (LLM)-based feedback is typically insensitive to learners proficiency levels. To address this fragmentation, this work proposes PsyScore, a psychometrically-aware framework that integrates diagnostic assessment with instructional scaffolding through a shared latent ability representation. PsyScore comprises three key modules: a Trait-Adaptive Neural IRT Scorer that incorporates the Graded Partial Credit Model (GPCM) into a neural architecture, enabling the precise estimation of student ability while maintaining psychometric interpretability, a ZPD-Scaffolded Feedback Generator, which conditions multi-agent feedback strategies on the diagnosed ability parameter to adapt instructional focus across different proficiency levels, and a Multi-Perspective Feedback Evaluation Strategy that assesses feedback quality via pairwise preference judgements and student revision simulations. Experiments on the ASAP++ dataset demonstrate that PsyScore achieves competitive scoring performance while providing more pedagogically aligned feedback.

URL PDF HTML ☆

赞 0 踩 0

2606.19638 2026-06-19 cs.CL 新提交

MiqraBERT: Regression-Based Sentence-BERT Finetuning for Biblical Hebrew Parallel Detection

MiqraBERT：基于回归的Sentence-BERT微调用于圣经希伯来语平行检测

David M. Smiley

AI总结提出MiqraBERT模型，通过余弦相似度回归微调Sentence-BERT，在圣经希伯来语中检测文本平行，将分布分离度提升2.7倍，重叠区域从24%降至6%。

详情

AI中文摘要

文本复用遍及希伯来圣经，但用于检测的计算方法仍主要依赖词汇重叠，一旦平行涉及释义、词汇替换或句法重组，这些方法就会失效。本文介绍MiqraBERT，一个从AlephBERT（现代希伯来语编码器）微调而来的Sentence-BERT模型，用于圣经希伯来语的诗句级语义相似度。训练集包含1,650个标注的诗句和半诗句对：825个来自编年史同源材料和诗歌平行基础研究的真实平行，与825个随机采样的负例平衡。通过余弦相似度回归，模型学习到一个嵌入空间，其中平行诗句聚集在一起，无关诗句彼此远离。我们使用基于分布的指标、Wasserstein距离和重叠系数，在十个随机种子上评估分离度。MiqraBERT将分布分离度比预训练基线提高了2.7倍，并将模糊重叠区域从约24%减少到约6%。叙事同源平行的召回@10达到87.1%；诗歌平行仍然困难，低于9%。这种依赖于体裁的不对称性将模型的可靠范围限制在叙事文本复用。MiqraBERT在此https URL公开可用。

英文摘要

Textual reuse pervades the Hebrew Bible, yet the computational methods used to detect it still rest largely on lexical overlap, and they falter once a parallel involves paraphrase, lexical substitution, or syntactic reworking. This paper introduces MiqraBERT, a Sentence-BERT model finetuned from AlephBERT (a Modern Hebrew encoder) for verse-level semantic similarity in Biblical Hebrew. The training set comprises 1,650 labeled verse and half-verse pairs: 825 true parallels drawn from the Chronicles synoptic material and from foundational studies of poetic parallelism, balanced against 825 randomly sampled negatives. Through cosine-similarity regression, the model learns an embedding space in which parallel verses cluster together and unrelated verses move apart. We evaluate separation with distribution-based metrics, Wasserstein distance and the overlap coefficient, across ten random seeds. MiqraBERT improves distributional separation 2.7-fold over the pre-trained baseline and reduces the ambiguous overlap region from roughly 24% to about 6%. Narrative synoptic parallels reach a recall@10 of 87.1%; poetic parallels remain difficult, below 9%. This genre-dependent asymmetry confines the model's reliable scope to narrative textual reuse. MiqraBERT is publicly available at https://huggingface.co/davidmsmiley/MiqraBERT

URL PDF HTML ☆

赞 0 踩 0

2606.19552 2026-06-19 cs.CL 新提交

ReNikud：音频监督的希伯来语字素到音素转换

Maxim Melichov, Yakov Kolani, Morris Alper

AI总结提出ReNikud方法，利用音频监督和伪元音化架构，通过无标注音频的ASR伪标签和字符级对齐，解决希伯来语G2P转换中的元音缺失和发音歧义问题，在多个基准上达到最优。

详情

AI中文摘要

现代希伯来语的字素到音素（G2P）转换对于文本到语音（TTS）等应用是必需的，但由于该语言的辅音音素文字系统（abjad）使元音大多不写出来，造成大量歧义，因此具有挑战性。标准方法首先预测元音变音符号（nikud）以生成国际音标（IPA）转录，但这存在局限性：元音化数据稀缺且制作费力，它不指定词汇重音等特征，并且反映的是正式语法规则而非日常口语发音。同时，直接的序列到序列IPA预测在有限数据上表现不佳，且未能利用辅音音素文字特有的字符级对齐。我们的方法ReNikud通过两个关键洞察克服了这些限制：（1）通过基于音素的自动语音识别（ASR）伪标签流水线，在数千小时无标注希伯来语音频上进行弱音频监督，生成反映自然口语规范的音位转录，无需人工标注。（2）一种伪元音化架构，在每个字符位置预测IPA音素，强制字符级对齐作为归纳偏置。在现有希伯来语G2P基准和针对口语希伯来语的新MILIM基准上的结果表明，ReNikud超越了先前的最先进方法。我们将发布代码和训练模型，以支持希伯来语TTS和语音技术的进一步研究。

英文摘要

Grapheme-to-phoneme (G2P) conversion for Modern Hebrew is needed for applications like text-to-speech (TTS), but is challenging due to the language's abjad writing system, which leaves vowels largely unwritten, creating substantial ambiguity. Standard approaches first predict vowel diacritics (nikud) to produce International Phonetic Alphabet (IPA) transcriptions, but this is limited: vocalization data is scarce and laborious to produce, it does not specify features such as lexical stress, and it reflects formal grammatical rules rather than everyday spoken pronunciation. Direct sequence-to-sequence IPA prediction, meanwhile, struggles on limited data and fails to exploit the character-level alignment characteristic of abjads. Our method, ReNikud, overcomes these limitations with two key insights: (1) Weak audio supervision via a phoneme-based automatic speech recognition (ASR) pseudo-labeling pipeline on thousands of hours of unlabeled Hebrew audio, yielding phonemic transcriptions that reflect natural spoken norms without manual annotation. (2) A pseudo-vocalization architecture that predicts IPA phonemes at each character position, enforcing character-level alignment as an inductive bias. Results on existing Hebrew G2P benchmarks and the new targeted MILIM benchmark for spoken Hebrew show that ReNikud surpasses previous state-of-the-art methods. We will release our code and trained models to support further work on Hebrew TTS and speech technologies.

URL PDF HTML ☆

赞 0 踩 0

2606.19352 2026-06-19 cs.CL cs.AI 新提交

Sign-Language Datasets at Scale: A Comprehensive Survey on Resources, Benchmarks, and Annotation Standards

大规模手语数据集：资源、基准和标注标准的综合调查

Yiming Ni, Zhi-Qi Cheng, Jiayu Li, Wei Cheng

发表机构 * Tacoma School of Engineering & Technology, University of Washington（华盛顿大学塔科马工程与技术学院）

AI总结本文调查了35种手语的120个数据集，分析了模态不平衡、标注粒度和手语者偏差等挑战，并提出了24字段手语数据表以支持标准化文档和可复现评估。

Comments Accepted to ACL 2026 Main. 27 pages, 5 figures

详情

AI中文摘要

手语是聋人和听障社区使用的表达性视觉语言。尽管在手语识别、翻译和生成方面取得了显著进展，但由于数据集碎片化、标注不一致以及语言覆盖有限，进展仍然受到制约。现有的基准往往无法反映现实世界的通信需求，对这些局限性的系统分析仍然有限。在本调查中，我们提出了一个全面的手语数据集索引，涵盖了35种手语的120个资源。我们分析了关键挑战，如模态不平衡、标注粒度和手语者偏差，并概述了未来数据集设计的考虑因素。我们还引入了一个24字段的手语数据表，并发布了一个公共GitHub仓库（此 https URL ），以支持标准化文档和可复现评估。总体而言，我们的工作为在现实应用中开发包容、稳健和可扩展的手语技术提供了统一且实用的基础。

英文摘要

Sign languages are expressive visual languages used by Deaf and Hard-of-Hearing (DHH) communities. Despite substantial progress in sign-language recognition, translation, and production, advances remain constrained by fragmented datasets, inconsistent annotations, and limited linguistic coverage. Existing benchmarks often fail to reflect real-world communication needs, and systematic analyses of these limitations remain limited. In this survey, we present a comprehensive index of sign-language datasets, covering 120 resources across 35 sign languages. We analyze key challenges such as modality imbalance, annotation granularity, and signer bias, and outline considerations for future dataset design. We also introduce a 24-field Sign-Language Datasheet and release a public GitHub repository (https://github.com/Ginqwerty/Open-Sign-Language) to support standardized documentation and reproducible evaluation. Overall, our work provides a unified and practical foundation for developing inclusive, robust, and scalable sign-language technologies in real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2606.19468 2026-06-19 cs.CL 新提交

Characterizing Narrative Content in Web-scale LLM Pretraining Data

网络规模LLM预训练数据中的叙事内容特征化

Teagan Johnson, Elliott Ash, Andrew Piper, Maria Antoniak

AI总结首次细粒度研究LLM预训练语料库Dolma的叙事特征，提出涵盖三个核心叙事元素（能动性、场景、事件）的框架，构建NarraBERT模型并发布NarraDolma数据集，揭示叙事结构在异构数据中可测量且分布不均。

Comments 8 pages of main content, 28 total pages. 30 figures

详情

AI中文摘要

尽管叙事是人类交流的基本模式，但网络规模LLM预训练语料库的叙事组成仍然很大程度上未被探索。我们首次对Dolma（一个3万亿词元的开放预训练语料库）中的叙事特征进行了细粒度研究。借鉴叙事理论，我们设计了一个框架，涵盖三个核心叙事元素（能动性、场景和事件），并将其操作化为11个可解释维度。在采样并标注了400个多样化的段落之后，我们微调并验证了NarraBERT，一个基于RoBERTa的细粒度叙事预测模型。我们将NarraBERT应用于300万个段落，生成了新数据集NarraDolma。我们发现：(i) 叙事结构在极度异构的数据中是可大规模测量的；(ii) 我们揭示了网络文本背后连续的多维叙事结构；(iii) 叙事质量在预训练来源和主题之间分布不均，而当前的策展实践既未测量也未考虑这一点。我们的框架、数据集和分析为理解LLM预训练数据中叙事质量的分布以及研究数据组成如何影响叙事推理任务提供了基础。我们公开发布了NarraDolma和NarraBERT。

英文摘要

The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features in Dolma, a 3-trillion-token open pretraining corpus. Drawing on narrative theory, we design a framework spanning three core narrative elements (agency, setting, and events) operationalized as 11 interpretable dimensions. After sampling and annotating a diverse set of 400 passages, we finetune and validate NarraBERT, a RoBERTa-based model for fine-grained narrative prediction. We apply NarraBERT to 3M passages, resulting in a new dataset, NarraDolma. We find (i) narrative structure is measurable at scale across extremely heterogeneous data, (ii) we uncover a continuous, multidimensional narrative structure underlying web text, and (iii) narrative qualities are unequally distributed across pretraining sources and topics in ways that current curation practices neither measure nor account for. Our framework, dataset, and analyses provide a foundation for understanding how narrative qualities are distributed in LLM pretraining data and for studying how data composition affects narrative reasoning tasks. We publicly release NarraDolma and NarraBERT.

URL PDF HTML ☆

赞 0 踩 0

2606.19544 2026-06-19 cs.CL 新提交

Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias

无效度的可靠性：LLM-as-a-Judge 模型在一致性、稳定性和偏差上的系统性大规模评估

Justin D. Norman, Michael U. Rivera, D. Alex Hughes

发表机构 * UC Berkeley School of Information（加州大学伯克利分校信息学院）

AI总结本研究通过大规模系统性评估（21个裁判模型、118次运行、约54.1万次判断），发现LLM-as-a-Judge在一致性、稳定性和偏差方面存在普遍问题，包括kappa通缩、排名偏移、高重测信度与严重位置偏差并存，并提出了最小可行验证协议。

详情

AI中文摘要

LLM-as-a-Judge已成为语言模型的主导评估范式，但实际中的裁判验证依赖于精确匹配一致性，这一指标未对随机性进行校正，且系统性地高估了判别能力。我们展示了迄今为止最大规模的LLM-as-a-Judge系统性评估：来自九个提供商的21个裁判模型，在MT-Bench、JudgeBench和RewardBench上，按照三种协议（一致性、稳定性、偏差审计）进行了118次运行，约54.1万次独立判断。发现了四个结果，在整个队列中一致，包括2026年4月的前沿模型：精确匹配与Cohen's kappa之间的kappa通缩是普遍存在的（MT-Bench上33-41个百分点），裁判排名在不同基准上最多移动14个位置，高重测信度（>0.95）与两个生产部署裁判中的严重位置偏差（>0.10）并存（体现了一致性-偏差悖论），以及在单一成对评分标准下，整个队列中的冗长偏差较小（<0.011）。我们将这些结果提炼为一个最小可行验证协议。

英文摘要

LLM-as-a-Judge has become the dominant evaluation paradigm for language models, but judge validation in practice relies on exact-match agreement, a metric that does not correct for chance and systematically overstates discriminative ability. We present the largest systematic evaluation of LLM-as-a-Judge to date: 21 judges from nine providers across MT-Bench, JudgeBench, and RewardBench, evaluated under three protocols (agreement, consistency, bias audit) over 118 runs and approximately 541,000 individual judgments. Four findings emerge, consistent across the full cohort, including the April 2026 frontier: kappa deflation between exact match and Cohen's kappa is universal (33--41 pp on MT-Bench), judge rankings shift by up to 14 positions across benchmarks, high test--retest reliability (>0.95) coexists with severe position bias (>0.10) in two production-deployed judges (instantiating a consistency--bias paradox), and verbosity bias is small (<0.011) across our cohort under a single pairwise rubric. We distill these into a Minimum Viable Validation Protocol.

URL PDF HTML ☆

赞 0 踩 0

2606.19637 2026-06-19 cs.CL cs.AI 新提交

Before the Labels: How Dataset Construction Shapes Suicidality Detection in Clinical Text

标签之前：数据集构建如何塑造临床文本中的自杀检测

Priyanshi Garg, Ishita Rao, Jieqiong Ding, Amandalynne Paullada

发表机构 * University of Washington（华盛顿大学）

AI总结通过ScAN数据集案例研究，揭示EHR自杀数据集编码特定操作化定义，受数据作者、事件边界和歧义处理影响，并展示相同标签涵盖异质性临床框架。

Comments To appear in the Proceedings of the 11th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026)

详情

AI中文摘要

临床自然语言处理越来越依赖电子健康记录（EHR）数据来检测自杀行为，将临床文档视为比社交媒体更可靠的真相。我们认为，这种框架掩盖了基于EHR的自杀数据集如何编码自杀的特定操作化定义，这种定义受到数据作者、事件边界划定方式以及歧义处理方式的影响。我们以ScAN数据集（基于MIMIC-III临床笔记构建）的案例研究为基础，论证了这一观点。我们展示了治理约束、基于ICD的队列选择、单一标注者标签以及住院级别聚合如何产生反映临床医生记录判断的标签，将自杀视为一个有边界的事件，并假设意图可以从文档中可靠推断。语言学分析表明，相同的标签涵盖了在时间性、否定性和不确定性方面不同的异质性临床框架。我们认为，临床自然语言处理在将自杀数据集的标签解释为真相之前，应审视其中嵌入的假设。

英文摘要

Clinical NLP increasingly relies on electronic health record (EHR) data to detect suicidal behaviors, treating clinical documentation as more reliable ground truth than social media. We argue that this framing obscures how EHR-based suicidality datasets encode a particular operationalization of suicidality, shaped by who authors the data, how episodes are bounded, and how ambiguity is resolved. We ground this argument in a case study of the ScAN dataset, built over MIMIC-III clinical notes. We show how governance constraints, ICD-based cohort selection, single-annotator labeling, and hospital-stay-level aggregation produce labels that reflect clinician-documented judgments, treat suicidality as a bounded episode, and assume that intent can be reliably inferred from documentation. A linguistic analysis demonstrates that identical labels subsume heterogeneous clinical framings differing in temporality, negation, and uncertainty. We argue that clinical NLP should examine the assumptions embedded in suicidality datasets before interpreting their labels as ground truth.

URL PDF HTML ☆

赞 0 踩 0

2606.19698 2026-06-19 cs.CL 新提交

从文本到分数：追踪大型语言模型中作文质量表征的出现

Jiaxu Zuo, Mu You, Kaixin Lan, Tao Fang, Yujia Huo, Henghua Shen, Lidia S. Chao, Derek F. Wong

AI总结通过线性探测等方法分析8个LLM在三个数据集上的隐藏表征，发现作文质量信息以线性可解码形式存在，并识别出与分数相关的神经元，揭示了LLM评分的内在机制。

Comments This is a preprint of a manuscript currently under peer review

详情

AI中文摘要

近年来，大型语言模型（LLMs）的进展极大地改变了自动作文评分（AES），但基于LLM的评分内部机制仍知之甚少。在本工作中，我们系统分析了八个LLMs在两个英文作文数据集（ASAP++、CSEE）和一个葡萄牙语数据集（ENEM）上的隐藏表征。通过线性探测、跨提示泛化、降维和神经元级分析，我们发现一致证据表明作文质量信息以线性可访问的形式编码在LLM表征中。这些表征在层间逐步出现，在不同提示策略下保持稳健，并且尽管评分标准不同，仍能在作文提示间部分迁移。此外，非线性探测相对于线性探测仅提供边际且不一致的改进，表明大多数作文质量信息已经是线性可解码的。我们进一步识别出单个“作文评分神经元”，其激活与作文分数强相关，且其行为对目标干预敏感。此外，这些神经元的逐层分布随作文长度系统性地变化，较长的作文更依赖深层。总体而言，我们的发现提供了LLM编码与作文质量相关的结构化表征的证据，并为基于LLM的AES系统的可解释性提供了新见解。

英文摘要

Recent advances in Large Language Models (LLMs) have substantially transformed Automated Essay Scoring (AES), yet the internal mechanisms underlying LLM-based scoring remain poorly understood. In this work, we systematically analyze the hidden representations of eight LLMs across two English essay datasets (ASAP++, CSEE) and one Portuguese dataset (ENEM). Using linear probing, cross-prompt generalization, dimensionality reduction, and neuron-level analyses, we find consistent evidence that essay quality information is encoded in a linearly accessible form within LLM representations. These representations emerge progressively across layers, remain robust across prompting strategies, and partially transfer across essay prompts despite differences in scoring rubrics. In addition, nonlinear probes provide only marginal and inconsistent improvements over linear probes, suggesting that most essay quality information is already linearly decodable. We further identify individual ``essay scoring neurons'' whose activations strongly correlate with essay scores and whose behavior is sensitive to targeted intervention. Moreover, the layer-wise distribution of these neurons systematically shifts with essay length, with longer essays relying more heavily on deeper layers. Overall, our findings provide evidence that LLMs encode structured representations related to essay quality and offer new insights into the interpretability of LLM-based AES systems.

URL PDF HTML ☆

赞 0 踩 0

2606.20212 2026-06-19 cs.CL 新提交

CzechDocs: A Multiway Parallel Dataset of Formatted Documents for Minority Languages in Czechia

CzechDocs：捷克少数民族语言格式化文档的多路平行数据集

Josef Jon, Ondřej Bojar

发表机构 * Charles University, Faculty of Mathematics ； Physics Institute of Formal

AI总结提出CzechDocs多路平行格式化文档数据集，覆盖捷克及少数民族语言，支持评估保留格式的机器翻译系统，并公开验证子集与评估工具。

2606.20255 2026-06-19 cs.CL cs.AI 新提交

The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse

语域差距：尼日利亚公共话语的意义智能框架

Celestine Achi

AI总结提出九维意义智能框架（MIF），通过语域、真实意图等维度区分表面情感与真实交际意图，在尼日利亚公共话语数据集上使语域分类准确率提升40个百分点，复合意义智能评分提升5.4分。

Comments Preprint. 12 pages, 2 tables. Supplementary materials: MIF Master Specification v2.0, Annotation Guidelines v1.0, and 30-item public calibration set with gold labels available from the author

详情

AI中文摘要

我们提出了意义智能框架（MIF），这是一个用于尼日利亚公共话语的九维标注和评估方案，将表面情感与真实交际意图区分开来。现有的尼日利亚语言基准（包括NaijaSenti和AfriSenti）将情感分类视为三向极性任务（正面、负面、中性）。我们认为，AI系统在尼日利亚话语上的主要失败模式不是翻译失败，而是语境失败：同一话语根据说话者、听众和情境可能具有相反的语用效力。MIF通过九个评分维度将这一见解操作化：语域、表面情感、真实意图、反讽、编码潜台词、风险等级、标注者置信度、说话者情绪和推荐沟通行动。我们构建了一个包含30个项目的校准数据集，涵盖标准英语、尼日利亚英语、尼日利亚皮钦语和混合语域，并在零样本和模式引导提示条件下评估了一个前沿语言模型（Gemini 2.5 Flash）。主要发现是语域差距：零样本语域分类准确率为33.3%，当模型在上下文中接收到MIF模式时，准确率上升至73.3%（+40个百分点）。在模式引导提示下，复合意义智能评分增加了5.4分（从73.2到78.6），最大的实际收益体现在语域识别、编码潜台词检测（+10分）和战略行动推荐（+10.3分）上。我们发布了框架规范、标注指南和包含30个项目的公开校准集以支持可重复性，同时保留了一个私有留存语料库用于防污染评估。

英文摘要

We introduce the Meaning Intelligence Framework (MIF), a nine-dimension annotation and evaluation schema for Nigerian public discourse that separates surface sentiment from true communicative intent. Existing benchmarks for Nigerian languages, including NaijaSenti and AfriSenti, treat sentiment classification as a three-way polarity task (positive, negative, neutral). We argue that the dominant failure mode of AI systems on Nigerian discourse is not translation failure but context failure: the same utterance carries opposite pragmatic force depending on speaker, audience, and situation. The MIF operationalises this insight across nine scored dimensions: register, surface sentiment, true intent, irony, coded subtext, risk tier, annotator confidence, speaker emotion, and recommended communications action. We construct a 30-item calibration dataset spanning Standard English, Nigerian English, Nigerian Pidgin, and code-mixed registers, and evaluate a frontier language model (Gemini 2.5 Flash) under zero-shot and schema-informed prompting conditions. The headline finding is the Register Gap: zero-shot register classification accuracy is 33.3%, rising to 73.3% (+40 points) when the model receives the MIF schema in-context. The composite Meaning Intelligence Score increases by 5.4 points (73.2 to 78.6) under schema-informed prompting, with the largest practical gains in register identification, coded-subtext detection (+10 points), and strategic action recommendation (+10.3 points). We release the framework specification, annotation guidelines, and the 30-item public calibration set to support reproducibility, while retaining a private holdout corpus for contamination-protected evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.20369 2026-06-19 cs.CL 新提交

近乎智能的革命：扩大审议规模并利用AI赋能人类的选项

Serge Sharoff

AI总结探讨大型语言模型如何通过系统功能语言学视角扩大民主审议规模，增强包容性并赋权边缘群体，同时警惕过度承诺与低估风险。

Comments Published in /Handbook of Democracy in the Era of Artificial Intelligence/ edited by Evangelos Pournaras, Srijoni Majumdar, Carina Ines Hausladen, and Dirk Helbing. 2026

详情

AI中文摘要

大型语言模型在公共话语中的日益突出为民主审议带来了机遇和挑战。虽然红队策略有助于缓解特定风险，但关于语言限制、偏见和LLM的谄媚倾向等更广泛的担忧仍然存在。本章探讨如何利用LLM显著扩大和民主化审议，特别是在促进包容性和赋权传统边缘群体方面。借鉴系统功能语言学的概念，本章考察了语言使用者之间的差异（例如，关于社会人口群体）和语言使用中的差异（例如，关于交际功能）如何影响AI支持的审议参与。本章介绍了AI驱动的审议研究，并评估了它们在支撑论证、增强可及性以及减少嵌入在声望语域中的排斥性语言规范和偏见的影响方面的潜力。同时，本章警告不要过度承诺（导致不切实际的期望）和低估承诺（冒着错失AI辅助参与机会的风险）。最后，本章确定了未来的研究方向，以最大化AI辅助参与的民主潜力，同时嵌入伦理保障以抵消语言不平等的再生产。

英文摘要

The increasing prominence of Large Language Models (LLMs) in public discourse presents both opportunities and challenges for democratic deliberation. While red teaming strategies help mitigate specific risks, broader concerns persist regarding linguistic constraints, biases, and the sycophantic tendencies of LLMs. This chapter explores how LLMs can be used to significantly scale up and democratise deliberation, particularly in fostering inclusivity and empowering traditionally marginalised groups. Drawing on concepts from Systemic-Functional Linguistics, the chapter examines how variations across language users (for example, with respect to socio-demographic groups) and across language use (for example, with respect to communicative functions) shape participation in AI-supported deliberation. The chapter presents AI-driven deliberation studies and assesses their potential to scaffold argumentation, enhance access, and reduce the influence of exclusionary linguistic norms and biases which are embedded in prestigious registers. At the same time, the chapter cautions against both overclaiming, which leads to unrealistic expectations, and underclaiming, which risks missed opportunities for AI-assisted engagement. The chapter concludes by identifying future research directions to maximise the democratic potential of AI-assisted participation while embedding ethical safeguards to counteract the reproduction of linguistic inequalities.

URL PDF HTML ☆

赞 0 踩 0

2606.20198 2026-06-19 cs.CL 新提交

Pitch Spelling Jazz Lead Sheets, Solo Transcriptions, Classical Piano and Monophonic Scores

爵士乐领谱、独奏转录、古典钢琴与单声部乐谱的音高拼写

Augustin Bouquillard, Florent Jacquemard

发表机构 * École polytechnique（巴黎综合理工学院）； INRIA（法国国家信息与自动化研究所）

AI总结提出一种音高拼写与调性估计算法，通过两阶段优化（模态与调性）联合估计音符名称、全局调号和每小节局部音阶，在多种数字乐谱数据集上验证有效性。

详情

AI中文摘要

我们提出了一种用于音高拼写和调性估计的算法。给定MIDI格式的输入，包含音符音高（以半音表示，相对于最低参考音）和小节边界信息，该算法估计适当的音符名称、全局调号以及每小节的局部音阶。这些相关信息元素在两个优化阶段中联合评估。在初始的“模态”阶段，通过最短路径搜索为每个小节提出一个可能的音阶，以最小化印刷乐谱中的临时记号数量。然后，在称为“调性”的第二阶段，这些局部音阶被用于估计调号和音符名称，从而为整首作品生成最佳音乐记谱。我们在包含多种数字乐谱的数据集上进行了评估：来自《Real Book》的爵士领谱、爵士独奏和贝斯线的录音转录、传统曲调，以及钢琴和单声部乐器的古典乐谱。我们的程序最初设计用于音乐转录，特别是构建从音频录音转录的爵士独奏数字集合，用于音乐分析、教学和文化遗产保护。该方法也应有助于其他与音乐记谱处理相关的任务。此外，为此我们定义了各种常见爵士音阶之间的新距离，这可能对音乐学研究有一定意义。

英文摘要

We present an algorithm for pitch spelling and key estimation. Given an input in MIDI-like format, containing information on note pitches (expressed in semitones relative to the lowest reference note) and bar boundaries, it estimates the appropriate note names, a global Key Signature, and a local scale for each bar. This related information elements are evaluated jointly during two stages of optimisation. During an initial 'modal' stage, a probable scale is proposed for each bar, minimising the number of accidentals to be printed in the printed score with a shortest-path search. Then, during a second stage called 'tonal', these local scales are used to estimate the Key Signature and note names that would result in the best musical notation for the entire piece. We present evaluations conducted on datasets comprising a variety of digital musical scores: jazz lead sheets taken from the Real Book, transcriptions of recordings of jazz soli and bass lines, traditional tunes, as well as classical scores for piano and monophonic instruments. Our procedure was originally designed for use in music transcription, specifically for building digital collections of jazz solos transcribed from audio recordings, for the purposes of music analysis, teaching and the preservation of cultural heritage. This method should also prove useful for other tasks related to the processing of musical notation. Furthermore, to this end, we have defined new distances between various common jazz scales, which may be of some interest to musicological studies.

URL PDF HTML ☆

赞 0 踩 0

1. 大语言模型与基础模型 15 篇

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

Where to Place the Query? Unveiling and Mitigating Positional Bias in In-Context Learning for Diffusion LLMs via Decoding Dynamics

Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models

Quantifying Aleatoric Uncertainty of In-Context Learning for Robust Measure of LLM Prediction Confidence

Granularity-Regulated Adaptive Computational Efficiency for Optimal Verification in Test-Time Scaling

Where Does Social Reasoning Come From? Capability Provenance in Language Models

Code-Switching Reveals Language Anchoring in Multilingual LLMs

Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings

Clusters are All You Need: Pre-Training the Tsetlin Machine with Semantic Clusters from Language Models for Interpretability

Leverage Is Not Reach: A Control-Window Law for Single-Neuron Steering in Language Models

Large Language Models Do Not Always Need Readable Language

GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs

IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization

Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users

2. 机器翻译与跨语言处理 2 篇

Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer

Creating Multilingual Mental Health Dialogue Datasets: Limits of Persona-Based Localization via Nationality and Language

3. 信息抽取、检索与问答 8 篇

Ensembles of Large Language Models for Identifying EQ-5D Studies in PubMed Based on Their Abstracts

Detecting Hallucinations for Large Language Model-based Knowledge Graph Reasoning

CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference

TerraMARS: A Domain-Adapted Small-Language-Model Pipeline for Mars Terraforming Literature

FineREX: Fine-Tuned NER-RE for Human Smuggling Knowledge Graphs

Prompt, Plan, Extract: Zero-Shot Agentic LLMs Workflows for Lung Pathology Extraction from Clinical Narratives

Source-Grounded Data Generation for Text-to-JSON Learning

When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented Generation

4. 对话系统与智能体 4 篇

Trustworthy Multi-Agent Systems: Mitigating Semantic Drift with the Argent Signaling Protocol

SAGE-OPD: Selective Agent-Guided Intervention for Multi-Turn On-Policy Distillation

AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts

Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems

5. 文本生成、摘要与编辑 2 篇

A BART-based approach with hierarchical strategy for Vietnamese abstractive multi-document summarization

PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback

6. 语义、语法与语言学分析 1 篇

MiqraBERT: Regression-Based Sentence-BERT Finetuning for Biblical Hebrew Parallel Detection

7. 多模态语言处理 2 篇

LaViSA: A Language and Vision Structural Ambiguity Benchmark

MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral Optimization

8. 语音语言联合与音频文本 2 篇

Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal

ReNikud: Audio-Supervised Hebrew Grapheme-to-Phoneme Conversion

9. 评测、数据集与基准 12 篇

Sign-Language Datasets at Scale: A Comprehensive Survey on Resources, Benchmarks, and Annotation Standards

Characterizing Narrative Content in Web-scale LLM Pretraining Data

Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias

Before the Labels: How Dataset Construction Shapes Suicidality Detection in Clinical Text

What sentiment analysis can't see: Measuring whether customers were helped, and what went wrong, across 70,000 support conversations

NRITYAM: Language Models Meet Art and Heritage of Dance

CREDENCE: Claim Reduction for Decomposition & Enhanced Credibility -- Semantic Metrics and Convergence Analysis

REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information Detection

From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models

CzechDocs: A Multiway Parallel Dataset of Formatted Documents for Minority Languages in Czechia

The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse

CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges

10. 安全、隐私、公平与可解释NLP 4 篇

Exposing the Unsaid: Visualizing Hidden LLM Bias through Stochastic Path Aggregation

Self-Preference Is Weak or Absent in Verifiable Instruction-Following Revision: A Four-Model Test Under Genuine Authorship

Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

11. 其他/综合NLP 4 篇

How LLMs Fail and Generalize in RTL Coding for Hardware Design?

From 50K to 8.2 Million in 24 Hours: Vozinha's Algorithmic Consecration and the Multilingual Making of World Cup Visibility

The Almost Intelligent Revolution: Options for Scaling Up Deliberation and Empowering People with AI

Pitch Spelling Jazz Lead Sheets, Solo Transcriptions, Classical Piano and Monophonic Scores