2605.23857 2026-05-25 cs.LG cs.CL 版本更新

Strong Teacher Not Needed? On Distillation in LLM Pretraining

不需要强教师？关于大语言模型预训练中的蒸馏

Taiming Lu, Zhuang Liu

发表机构 * Princeton University（普林斯顿大学）

AI总结本文挑战了知识蒸馏中“强教师优于弱教师”的常见假设，研究了在大语言模型预训练中不同教师-学生关系对蒸馏效果的影响。通过调整模型规模和训练数据量，作者构建了强-弱、同级和弱-强的教师-学生关系，并发现即使使用小型或未充分训练的教师，通过合理混合语言建模和蒸馏损失，也能有效提升学生模型性能。研究还表明，更强的教师并不总是更好，过度增加教师规模或训练数据可能削弱蒸馏效果，同时蒸馏在提升模型泛化能力方面比领域内拟合更具优势。

详情

AI中文摘要

知识蒸馏通常假设强到弱的关系，即更强的教师会产生更好的学生。在这项工作中，我们检验了关于大语言模型预训练中蒸馏的这一假设。通过改变架构大小和训练token预算，我们创建了强到弱、同级和弱到强的师生关系，并研究了每种情况下蒸馏的有效性。我们发现教师不需要强：通过适当混合语言建模和知识蒸馏损失，即使是小型和训练不足的教师也能提升较大的学生。同时，更强的教师并不总是更好：通过更多参数或更多训练token进一步推动教师，可能会饱和甚至逆转蒸馏收益。我们进一步观察到，蒸馏更容易改善泛化（分布外和下游性能）而非域内拟合。这些结果共同挑战了蒸馏预训练总是需要强教师的普遍信念。

英文摘要

Knowledge distillation generally assumes a strong-to-weak relationship where stronger teachers yield better students. In this work, we examine this assumption about distillation in large language model pretraining. By varying architecture sizes and training token budgets, we create strong-to-weak, same-level, and weak-to-strong teacher-student relationships, and study distillation's effectiveness under each. We find that the teacher need not be strong: with proper mixing of the language modeling and knowledge distillation losses, even small and undertrained teachers improve larger students. At the same time, a stronger teacher is not always better: pushing the teacher further, through more parameters or more training tokens, can saturate or even reverse the distillation gains. We further observe that distillation improves generalization (out-of-distribution and downstream performance) more readily than in-domain fitting. Together, these results challenge the common belief that distillation pretraining always requires a strong teacher.

URL PDF HTML ☆

赞 0 踩 0

2605.23826 2026-05-25 cs.CV cs.CL 版本更新

Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval

将查询分解为工具调用以进行长视频关键帧检索

Michal Shlapentokh-Rothman, Prachi Garg, Yu-Xiong Wang, Derek Hoiem

发表机构 * University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文研究了如何从长视频中检索关键帧以支持问答任务，提出了一种基于工具调用分解与合并的新型关键帧检索方法ToolMerge。该方法利用大语言模型将查询分解为多个工具调用，并通过布尔运算符合并各工具的排序结果，从而更精准地定位相关帧。实验在自建的M2M基准上进行，ToolMerge在多项任务中表现优异，尤其在字幕检索任务中超越其他方法5%。

详情

AI中文摘要

关键帧选择是为长视频问答（QA）提供可验证视觉证据的直接方式。查询所需的内容各不相同，找到正确的帧取决于知道要查找什么。现有的关键帧选择器要么根据单个查询对每一帧进行评分，要么将查询分解为由单个视觉工具评估的固定模式。我们提出ToolMerge，一种基于分解和合并的关键帧检索方法：基于大语言模型（LLM）的规划器将查询分解为工具调用，并指定如何使用布尔运算符合并每个工具的排名。为了直接评估检索，我们构建了Molmo-2 Moments（M2M）基准，其中每个问题通过构造锚定到特定的时间间隔。在QA、问题检索和字幕检索中，ToolMerge与先前的关键帧选择器具有竞争力，尤其是在字幕检索上，优于其他方法5%。代码和数据可在https://github.com/michalsr/ToolMerge找到。

英文摘要

Keyframe selection is a direct way to provide verifiable visual evidence for long-video question answering (QA). Queries differ in what they require, and finding the right frames depends on knowing what to look for. Existing keyframe selectors either score every frame against a single query, or decompose the query into a fixed schema evaluated by a single visual tool. We propose ToolMerge, a keyframe retrieval method based on decomposition and merging: an Large Language Model (LLM) based planner decomposes the query into tool calls and specifies how their per-tool rankings are merged using boolean operators. To evaluate retrieval directly, we construct Molmo-2 Moments (M2M), a benchmark in which every question is anchored to a specific time interval by construction. Across QA, question retrieval, and caption retrieval, ToolMerge is competitive with prior keyframe selectors, most notably on caption retrieval, outperforming other methods by 5%. Code and data can be found at https://github.com/michalsr/ToolMerge .

URL PDF HTML ☆

赞 0 踩 0

2605.23821 2026-05-25 cs.CL cs.LG 版本更新

Hierarchical Concept Geometry in Language Models Emerges from Word Co-occurrence

语言模型中的层级概念几何源于词汇共现

Andres Nava, Matthieu Wyart

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； EPFL（苏黎世联邦理工学院）

AI总结本文研究了语言模型中如何通过词共现关系几何地编码超类关系（即“是-一种”关系）。作者从词网中词语之间的共现频率与层次结构关系的实证观察出发，理论分析了词嵌入的协方差矩阵谱结构，证明了主特征向量能按从粗到细的层次逐步分离出概念分支，形成与树状结构一致的层次分割几何。实验验证表明，这一现象不仅在词2vec中存在，在Gemma 2B模型中也表现显著，表明层次概念几何可由词对统计的谱结构自然产生，无需依赖特定的层次功能机制。

Comments 34 pages, 12 figures, including appendices

详情

AI中文摘要

我们提出了一种分布理论，解释上下义关系——一般概念与具体概念之间的“is-a”关系——如何在语言表示中以几何方式编码。从经验验证的假设出发，即WordNet上下义图中距离较近的词汇共现频率更高，我们在理论上刻画了由此产生的word2vec嵌入Gram矩阵的谱。在共现核的温和正性和衰减条件下，我们证明主特征向量首先分离广泛的分类分支，然后逐步分离更细的子分支，产生一种\emph{层级分裂几何}，其从粗到细的谱组织反映了树结构。我们在多个采样的WordNet子树上的word2vec嵌入中验证了这些预测，并表明相同的特征显著地扩展到Gemma 2B的解嵌入。我们的结果表明，LLM中的层级概念几何不必反映层级特定的功能机制，而是从成对词汇统计的谱结构中涌现出来。

英文摘要

We propose a distributional theory of how hypernymy -- the ``is-a'' relation between general and specific concepts -- is encoded geometrically in language representations. Starting from the empirically verified assumption that words closer on the WordNet hypernym graph co-occur more often, we characterize theoretically the spectrum of the resulting embedding Gram matrix of word2vec embeddings. Under mild positivity and decay conditions on the co-occurrence kernel, we prove that the leading eigenvectors first separate broad taxonomic branches and then progressively finer sub-branches, producing a \emph{hierarchical splitting geometry} with a coarse-to-fine spectral organization that mirrors the tree. We confirm these predictions in word2vec embeddings across many sampled WordNet subtrees, and show that the same signature extends strikingly well to Gemma 2B unembeddings. Our results indicate that hierarchical concept geometry in LLMs need not reflect a hierarchy-specific functional mechanism, but emerges from the spectral structure of pairwise word statistics.

URL PDF HTML ☆

赞 0 踩 0

2605.23721 2026-05-25 cs.CL 版本更新

Is a Document Educational or Just Wikipedia-Style? -- Pitfalls of Classifier-Based Quality Filtering

文档是教育性的还是维基风格的？——基于分类器的质量过滤的陷阱

Mateusz Klimaszewski, Piotr Andruszkiewicz

发表机构 * Warsaw University of Technology（华沙技术大学）； IDEAS Research Institute（IDEAS研究所）

AI总结本文探讨了基于分类器的质量过滤方法在构建预训练语料库中的潜在问题，指出简单的维基百科风格格式化操作可能显著改变模型对内容质量的判断，导致低质量内容通过过滤。研究以FineWeb-Edu CQF模型为例，发现约7%的文档会被错误分类，从而被错误地纳入预训练语料库，揭示了该方法在实际应用中的关键缺陷。

Comments Accepted to ACL 2026

2605.23715 2026-05-25 cs.CL 版本更新

NLG Evaluation: Past, Present, Future

NLG评估：过去、现在与未来

Ehud Reiter

发表机构 * Dept of Computing Science University of Aberdeen（计算科学系大学阿伯丁）

AI总结本文回顾了自然语言生成（NLG）评估的发展历程，从1990年代与语言学紧密相关的阶段，到如今与机器学习深度融合的现状，并展望了未来评估方向的变化。研究指出，随着NLG技术的广泛应用，评估方法正从传统的形式化实验向更注重影响、质量与安全的方向演进，其中基于大语言模型的评估方法（如LLM-as-Judge）已成为最新趋势。本文系统梳理了NLG评估的发展脉络，并探讨了未来评估体系的关键挑战与发展方向。

Comments Will appear in Proceeedings of RetroEval 2026

2605.23710 2026-05-25 cs.CL 版本更新

A graph-based analysis of semantic types and coercion in contextualized word embeddings

基于图的语义类型与语境化词嵌入中的强制现象分析

Long Chen, Deniz Ekin Yavas

发表机构 * Heinrich Heine University Düsseldorf（杜伊斯堡-埃森海因rich-海因大学）

AI总结本文研究语义类型不匹配现象在上下文词嵌入中的表现，提出一种基于图的方法来分析词与上下文之间的语义类型信息。通过构建BERT和语义增强嵌入的图结构，并引入邻域类型概率和熵两个指标，研究发现语义增强嵌入能更有效地反映语义类型信息，并能有效区分匹配与不匹配的句子。

2605.23701 2026-05-25 cs.CL 版本更新

Metadata Predictability Is Not Evidence Dependence: An Intervention-Based Audit for Weak-Label Benchmarks

元数据可预测性并非证据依赖性：基于干预的弱标签基准审计

Kan Shao

发表机构 * Jinglue Technology Development (Nanjing) Co., Ltd.（景庐科技发展（南京）有限公司）

AI总结本文研究了弱标注基准测试中的一种协议级测试方法，即在干预提供的证据时，基准输出是否会变化。作者指出，仅基于元数据的快捷检查关注的是输出是否可由元数据预测，而非对证据的依赖性。为此，他们结合元数据统计量MPDS与证据干预统计量ΔEvi，揭示了元数据预测能力与证据敏感性的区别，并通过实验展示了不同数据集在不同模型下的表现差异，强调在基准审计中应同时报告元数据筛选、证据干预和模型校准结果。

Comments 5 pages, 1 figure, 1 table. Accepted at ICML 2026 Workshop on Hypothesis Testing

详情

AI中文摘要

我们研究了一种针对弱标签基准的协议级测试：当提供的证据被干预时，基准输出是否发生变化。仅元数据快捷检查回答了一个不同的问题，即输出是否可以从元数据先验中预测。因此，我们将元数据统计量——元数据先验主导得分（MPDS）——与证据干预统计量ΔEvi相结合，后者衡量在跨项目洗牌下对证据身份的敏感性。合成HotpotQA给出了一个针对仅元数据筛查的构造反例：MPDS仅为中等（0.643），但ΔEvi为零。更强阅读器的重新运行显示了为什么校准应属于测试程序：SNLI显示了校准反转，重构的HotpotQA占据了一个问题主导的警告区域，而FEVER在四个Transformer上是一个强证据敏感的正对照。实际教训很简单：基准审计应同时报告仅元数据筛查、证据干预和阅读器强度校准。

英文摘要

We study a protocol-level test for weak-label benchmarks: whether benchmark outputs change when the provided evidence is intervened on. Metadata-only shortcut checks answer a different question, namely whether outputs are predictable from metadata priors. We therefore combine a metadata statistic, the Metadata Prior Dominance Score (MPDS), with an evidence-intervention statistic, ΔEvi, measuring sensitivity to evidence identity under cross-item shuffling. Synthetic HotpotQA gives a constructed counterexample to metadata-only screening: MPDS is only moderate (0.643), yet ΔEvi is zero. Stronger-reader reruns show why calibration belongs in the test procedure: SNLI shows a calibration reversal, reconstructed HotpotQA occupies a question-dominant warning region, and FEVER is a strongly evidence-sensitive positive control across four transformers. The practical lesson is simple: benchmark audits should report metadata-only screening, evidence intervention, and reader-strength calibration together.

URL PDF HTML ☆

赞 0 踩 0

2605.22643 2026-05-25 cs.CL 版本更新

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

温水煮青蛙：面向智能体安全的多轮基准测试

Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Federico Sartore, Enrico Panai, Laura Caroli, Yue Zhu, Adam Leon Smith, Luca Nannini, Marcello Galisai, Susanna Cifani, Francesco Giarrusso, Marcantonio Bracale Syrnikov, Daniele Nardi

发表机构 * Icaro Foundation（Icaro基金会）； Sapienza University of Rome（罗马萨皮恩扎大学）； Sant’Anna School of Advanced Studies（圣安娜高级研究学校）； Tongji University School of Law（同济大学法学院）； AIQI Consortium（AIQI联盟）； Università Cattolica del Sacro Cuore（圣心大学）； Piccadilly Labs（皮卡迪利实验室）； VU Amsterdam（阿姆斯特丹伏伊大学）； Independent（独立）

AI总结本文提出“Boiling the Frog”（慢慢煮青蛙）多轮基准测试，用于评估作为智能体部署的语言模型在企业及办公场景中的安全性，尤其关注其在逐步攻击下的表现。该基准通过模拟工作环境中的多轮交互，测试模型在面对逐步引入的风险请求时是否会引发安全问题。研究发现，不同模型在该基准上的攻击成功率差异显著，部分模型的攻击成功率高达92.9%，突显了当前AI系统在安全防护方面仍面临严峻挑战。

详情

AI中文摘要

背景。传统的语言模型安全基准评估生成的文本：模型是否输出有毒语言、再现偏见或遵循有害指令。当模型作为智能体部署时，安全相关的对象从系统所说的内容转移到它在环境中执行的操作，仅评估模型在提示下的响应已不足以应对人工智能带来的安全挑战。近期发展出现了评估大型语言模型作为智能体的基准测试。我们对此研究方向做出贡献。方法。我们引入“温水煮青蛙”基准测试，评估部署在企业及办公环境中使用工具的AI模型是否易受渐进式攻击。每个场景以良性工作区编辑开始，随后引入一个包含风险的请求。基准测试聚焦于有状态的多轮评估：链暴露持久工作区，将风险负载置于轮次序列中的受控位置，并评估最终工件状态是否变得不安全。场景基于“温水煮青蛙”风险、AI法案附件I和附件III高风险情境以及欧盟AI法案通用人工智能行为准则（GPAI）构建的三级操作风险分类进行组织。结果。在九个模型的面板中，总体严格攻击成功率（ASR）为44.4%。模型级ASR范围从Claude Haiku 4.5的20.5%到Gemini 3.1 Flash Lite的92.9%，Seed 2.0 Lite也超过80%。对于行为准则失控场景，平均链类别级ASR达到93.3%。

英文摘要

Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the safety-relevant object shifts from what the system says to what it does within an environment, and evaluating model responses under prompting is no longer sufficient to address the safety challenges posed by artificial intelligence. Recent developments have seen the rise of benchmarks that evaluate large language models as agents. We contribute to this strand of research. Approach. We introduce Boiling the Frog, a benchmark that evaluates whether tool-using AI models deployed in corporate and office settings are susceptible to incremental attacks. Each scenario begins with benign workspace edits and later introduces a risk-bearing request. The benchmark focuses on stateful multi-turn evaluation: chains expose a persistent workspace, place the risk-bearing payload at controlled positions in the turn sequence, and score whether the resulting artifact state becomes unsafe. Scenarios are organized through a three-level operational risk taxonomy grounded in the Boiling the Frog risks, the AI Act Annex I and Annex III high-risk contexts, and EU AI Act's Code of Practice on General-Purpose AI (GPAI). Results. Across a nine-model panel, aggregate strict attack success rate (ASR) is 44.4%. Model-level ASR ranges from 20.5% for Claude Haiku 4.5 to 92.9% for Gemini 3.1 Flash Lite, with Seed 2.0 Lite also above 80%. Average chain category-level ASR reaches 93.3% for Code of Practice loss-of-control scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.19069 2026-05-25 cs.CL cs.AI 版本更新

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

商业ASR系统在代码切换语音上的基准测试：阿拉伯语、波斯语和德语

Sajjad Abdoli, Ghassan Al-Sumaidaee, Clayton W. Taylor, Ahmad ElShiekh, Ahmed Rashad

发表机构 * Perle AI

AI总结本文研究了自动语音识别（ASR）系统在语言代码转换（Code-Switching）场景下的性能，针对阿拉伯语、波斯语和德语与英语之间的四种语言对进行了评估。通过一个两阶段的筛选流程，选取了300个样本，并使用BERTScore和词错误率（WER）进行测评，发现不同指标对系统排名的一致性及质量差距的反映存在差异。研究还揭示了商业ASR系统在处理代码转换语音时的性能差距，并公开了相关数据集以供进一步研究。

详情

AI中文摘要

代码切换——在同一话语中两种语言的自然交替——仍然是自动语音识别（ASR）中最具挑战性和研究不足的条件之一。我们提出了一个基准测试，评估了五个商业ASR提供商在四种语言对上的表现：埃及阿拉伯语-英语、沙特阿拉伯语（纳吉迪/希贾兹）-英语、波斯语（法尔西）-英语和德语-英语，每对包含300个样本，通过结合启发式过滤和GPT-4o与Gemini 1.5 Pro集成评分器的两阶段管道选择，将LLM成本降低约91%。我们在WER和BERTScore上进行评估，表明虽然两个指标在阿拉伯语和波斯语对的系统排序上一致（τ=1.0），但WER通过惩罚语义正确的音译选择，将质量差距的幅度夸大约3倍。ElevenLabs Scribe v2实现了最低的WER（总体13.2%），并在BERTScore上领先（总体0.936）。难度分层分析揭示了被总体平均值掩盖的性能差距，BERT嵌入投影证实了参考和假设之间的语义接近性，尽管存在表面脚本差异。数据集公开于https://huggingface.co/datasets/Perle-ai/ASR_Code_Switch。

英文摘要

Code-switching -- the natural alternation between two languages within a single utterance -- remains one of the most challenging and under-studied conditions for automatic speech recognition (ASR). We present a benchmark evaluating five commercial ASR providers across four language pairs: Egyptian Arabic--English, Saudi Arabic (Najdi/Hijazi)--English, Persian (Farsi)--English, and German--English, comprising 300 samples per pair selected by a two-stage pipeline combining heuristic filtering with a GPT-4o and Gemini 1.5 Pro ensemble scorer, reducing LLM costs by $\approx$91\%. We evaluate on both WER and BERTScore, showing that while both metrics agree on the ordinal ranking of systems for all Arabic and Persian pairs ($τ= 1.0$), WER inflates the magnitude of quality gaps by approximately 3$\times$ by penalising semantically correct transliteration choices. ElevenLabs Scribe v2 achieves the lowest WER (13.2\% overall) and leads on BERTScore (0.936 overall). Difficulty-stratified analysis reveals performance gaps masked by aggregate averages, and BERT embedding projections confirm semantic proximity between reference and hypothesis despite surface-level script differences. The dataset is publicly available at https://huggingface.co/datasets/Perle-ai/ASR_Code_Switch.

URL PDF HTML ☆

赞 0 踩 0

2605.02443 2026-05-25 cs.CL 版本更新

HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs

HalluScan：一个用于检测和缓解指令遵循型大语言模型幻觉的系统性基准

Ahmed Cherif

发表机构 * Sofrecom Tunisia（Sofrecom突尼斯）； Orange Innovation（Orange创新）

AI总结 HalluScan 是一个系统化的基准框架，用于检测和缓解遵循指令的大型语言模型中的幻觉问题。该研究提出了 HalluScore 作为新的综合评估指标，并引入了 Adaptive Detection Routing 算法以提高检测效率，同时通过错误类型分解揭示了不同领域中幻觉表现的显著差异。实验表明，NLI 验证方法在检测效果上表现最佳，为后续研究和应用提供了重要参考。

Comments 38 pages, 13 figures, 10 tables. Submitted to Neural Computing and Applications

详情

AI中文摘要

大型语言模型（LLM）在各种自然语言处理任务中展现了卓越的能力，但它们仍然容易产生幻觉——生成事实上不正确、与提供上下文不一致或与用户指令不符的内容。我们提出了HalluScan，一个全面的基准框架，系统性地评估了跨越72种配置（涵盖6种检测方法、4个开放权重模型家族和3个不同领域）的幻觉检测与缓解。我们引入了三个关键贡献：（1）HalluScore，一种新颖的复合指标，与人类专家判断的皮尔逊相关系数达到r=0.41；（2）自适应检测路由（ADR），一种智能路由算法，实现了2.0倍的成本降低，仅损失0.1%的AUROC；（3）系统错误级联分解，揭示了不同领域间幻觉错误类型的显著差异。我们的实验表明，NLI验证达到了最高的总体AUROC为0.88，而RAV达到了第二高的AUROC为0.66。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, yet they remain susceptible to hallucinations -- generating content that is factually incorrect, unfaithful to provided context, or misaligned with user instructions. We present HalluScan, a comprehensive benchmark framework that systematically evaluates hallucination detection and mitigation across 72 configurations spanning 6 detection methods, 4 open-weight model families, and 3 diverse domains. We introduce three key contributions: (1) HalluScore, a novel composite metric that achieves a Pearson correlation of r = 0.41 with human expert judgments; (2) Adaptive Detection Routing (ADR), an intelligent routing algorithm achieving 2.0x cost reduction with only 0.1% AUROC degradation; and (3) systematic error cascade decomposition revealing substantial variation in hallucination error types across domains. Our experiments reveal that NLI Verification achieves the highest overall AUROC of 0.88, while RAV achieves the second-highest AUROC of 0.66.

URL PDF HTML ☆

赞 0 踩 0

2604.17134 2026-05-25 cs.CL 版本更新

RoIt-XMASA: Multi-Domain Multilingual Sentiment Analysis Dataset for Romanian and Italian

RoIt-XMASA：面向罗马尼亚语和意大利语的多领域多语言情感分析数据集

Andrei-Marius Avram, Aureliu Valentin Antonie, Cosmin-Mircea Croitoru, Vlad Andrei Muntean, Dumitru-Clementin Cercel

发表机构 * National University of Science and Technology POLITEHNICA Bucharest（波兰科技大学布加勒斯特分校）

AI总结本文介绍了 RoIt-XMASA，一个扩展了跨语言多领域亚马逊情感分析数据集的多语言数据集，涵盖意大利语和罗马尼亚语，包含36,000条标注评论和202,141条未标注样本。为应对跨语言和跨领域挑战，研究提出了一种多目标对抗训练框架，通过损失反转和元学习系数动态平衡情感判别与领域和语言不变性。实验表明，该方法在XLM-R模型上实现了66.23%的F1分数，优于基线4.64%，并在少样本设置下展示了任务微调与提示方法之间的性能权衡。

Comments Accepted at the International AAAI Conference on Web and Social Media (ICWSM 2026)

2511.10404 2026-05-25 cs.CL 版本更新

DELICATE: Diachronic Entity LInking using Classes And Temporal Evidence

DELICATE: 利用类别和时间证据的历时实体链接

Cristian Santini, Sebastian Barzaghi, Paolo Sernani, Emanuele Frontoni, Mehwish Alam

发表机构 * University of Macerata（马切拉塔大学）； University of Bologna（博洛尼亚大学）； Télécom Paris, Institut Polytechnique de Paris（巴黎电信学院）

AI总结本文提出了一种名为 DELICATE 的新型神经符号方法，用于解决历史意大利语文本中的实体链接问题，该方法结合了基于 BERT 的编码器和来自 Wikidata 的上下文信息，通过时间合理性与实体类型一致性选择合适的知识库实体。同时，研究还构建了一个名为 ENEIDE 的多领域实体链接语料库，涵盖19至20世纪的文学与政治文本。实验表明，DELICATE 在性能上优于其他历史文本实体链接模型，且其置信度评分和特征敏感性提升了结果的可解释性。

详情

AI中文摘要

尽管自然语言处理领域取得了显著进展，但由于复杂的文档类型、缺乏特定领域的数据集和模型以及长尾实体（即在知识库中代表性不足的实体），实体链接任务在人文学科中仍然具有挑战性。本文旨在通过两个主要贡献解决这些问题。第一个贡献是DELICATE，一种用于历史意大利语的新型神经符号方法，它结合了基于BERT的编码器和来自Wikidata的上下文信息，利用时间合理性和实体类型一致性来选择适当的KB实体。第二个贡献是ENEIDE，一个多领域的历史意大利语实体链接语料库，半自动地从两个注释版本中提取，时间跨度从19世纪到20世纪，包括文学和政治文本。结果表明，即使与拥有数十亿参数的更大架构相比，DELICATE在历史意大利语中的表现也优于其他实体链接模型。此外，进一步的分析揭示了DELICATE的置信度分数和特征敏感性如何提供比纯神经方法更可解释和可解释的结果。

英文摘要

In spite of the remarkable advancements in the field of Natural Language Processing, the task of Entity Linking (EL) remains challenging in the field of humanities due to complex document typologies, lack of domain-specific datasets and models, and long-tail entities, i.e., entities under-represented in Knowledge Bases (KBs). The goal of this paper is to address these issues with two main contributions. The first contribution is DELICATE, a novel neuro-symbolic method for EL on historical Italian which combines a BERT-based encoder with contextual information from Wikidata to select appropriate KB entities using temporal plausibility and entity type consistency. The second contribution is ENEIDE, a multi-domain EL corpus in historical Italian semi-automatically extracted from two annotated editions spanning from the 19th to the 20th century and including literary and political texts. Results show how DELICATE outperforms other EL models in historical Italian even if compared with larger architectures with billions of parameters. Moreover, further analyses reveal how DELICATE confidence scores and features sensitivity provide results which are more explainable and interpretable than purely neural methods.

URL PDF HTML ☆

赞 0 踩 0

2509.19858 2026-05-25 cs.CL 版本更新

Benchmarking Gaslighting Attacks Against Speech Large Language Models

针对语音大语言模型的气灯攻击基准测试

Jinyang Wu, Bin Zhu, Xiandong Zou, Qiquan Zhang, Xu Fang, Pan Zhou

发表机构 * Singapore Management University（新加坡管理学院）； Tongyi Speech Lab（通义语音实验室）； Dalian University of Technology（大连理工大学）

AI总结随着语音大语言模型（Speech LLMs）在语音应用中的广泛应用，确保其对操纵性或对抗性输入的鲁棒性变得尤为重要。本文引入了一种新型对抗攻击——“Gaslighting攻击”，通过精心设计的提示误导模型推理，评估Speech LLMs的脆弱性，并提出了五种操纵策略用于测试模型在不同任务下的鲁棒性。实验结果显示，五种攻击策略平均使模型准确率下降24.3%，突显了当前语音AI系统在行为层面存在的显著漏洞，亟需提升其鲁棒性和可靠性。

Comments 5 pages, 2 figures, 3 tables

详情

AI中文摘要

随着语音大语言模型（Speech LLMs）越来越多地集成到基于语音的应用中，确保其对操纵性或对抗性输入的鲁棒性变得至关重要。尽管先前的工作研究了基于文本的LLMs和视觉语言模型中的对抗性攻击，但基于语音交互的独特认知和感知挑战仍未得到充分探索。相比之下，语音具有固有的模糊性、连续性和感知多样性，这使得对抗性攻击更难检测。在本文中，我们引入了气灯攻击，即精心设计的提示，旨在误导、覆盖或扭曲模型推理，以评估语音LLMs的脆弱性。具体来说，我们构建了五种操纵策略：愤怒、认知干扰、讽刺、隐晦和专业否定，旨在测试模型在不同任务上的鲁棒性。值得注意的是，我们的框架捕获了性能下降和行为响应，包括未经请求的道歉和拒绝，以诊断不同维度的易感性。此外，还进行了声学扰动实验以评估多模态鲁棒性。为了量化模型脆弱性，在5个语音和多模态LLMs上，对来自5个不同数据集的超过10,000个测试样本进行全面评估，结果显示在五种气灯攻击下平均准确率下降24.3%，表明显著的行为脆弱性。这些发现强调了需要更具弹性和可信赖的基于语音的AI系统。

英文摘要

As Speech Large Language Models (Speech LLMs) become increasingly integrated into voice-based applications, ensuring their robustness against manipulative or adversarial input becomes critical. Although prior work has studied adversarial attacks in text-based LLMs and vision-language models, the unique cognitive and perceptual challenges of speech-based interaction remain underexplored. In contrast, speech presents inherent ambiguity, continuity, and perceptual diversity, which make adversarial attacks more difficult to detect. In this paper, we introduce gaslighting attacks, strategically crafted prompts designed to mislead, override, or distort model reasoning as a means to evaluate the vulnerability of Speech LLMs. Specifically, we construct five manipulation strategies: Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation, designed to test model robustness across varied tasks. It is worth noting that our framework captures both performance degradation and behavioral responses, including unsolicited apologies and refusals, to diagnose different dimensions of susceptibility. Moreover, acoustic perturbation experiments are conducted to assess multi-modal robustness. To quantify model vulnerability, comprehensive evaluation across 5 Speech and multi-modal LLMs on over 10,000 test samples from 5 diverse datasets reveals an average accuracy drop of 24.3% under the five gaslighting attacks, indicating significant behavioral vulnerability. These findings highlight the need for more resilient and trustworthy speech-based AI systems.

URL PDF HTML ☆

赞 0 踩 0

2412.14642 2026-05-25 cs.CL 版本更新

Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation

Speak-to-Structure：评估大语言模型在开放域自然语言驱动的分子生成中的表现

Jiatong Li, Junxian Li, Weida Wang, Yunqing Liu, Changmeng Zheng, Yatao Bian, Dongzhan Zhou, Xiao-yong Wei, Qing Li

发表机构 * Hong Kong Polytechnic University（香港理工大学）； Shanghai Jiao Tong University（上海交通大学）； Shanghai AI Lab（上海人工智能实验室）； National University of Singapore（新加坡国立大学）

AI总结近期，大型语言模型（LLMs）在自然语言驱动的分子发现任务中展现出巨大潜力，但现有数据集和基准主要基于一对一映射，仅评估模型检索单一预定义答案的能力，而忽略了其生成多样化且有效分子候选物的创造力。为此，研究者提出了首个用于评估LLMs在开放领域自然语言驱动分子生成能力的基准S²-Bench，该基准专门设计用于一对多关系，挑战模型展现真实的分子理解和开放生成能力。研究还引入了大规模指令微调数据集OpenMolIns，使Llama3.1-8B在该基准上超越了GPT-4o和Claude-3.5等强大多模态模型，并通过全面评估31个LLMs，推动了从简单模式记忆向真实分子设计的转变。

Comments Accepted by KDD 2026. Our codes and datasets are fully accessible through the https://github.com/phenixace/S2-TOMG-Bench and https://huggingface.co/datasets/phenixace/S2-TOMG-Bench

详情

DOI: 10.1145/3770855.3817473

AI中文摘要

近期，大语言模型（LLMs）在自然语言驱动的分子发现中展现出巨大潜力。然而，现有的分子-文本对齐数据集和基准主要基于一对一映射，衡量LLMs检索单一预定义答案的能力，而非其生成多样且同样有效的候选分子的创造性潜力。为填补这一关键空白，我们提出Speak-to-Structure（S^2-Bench），这是首个评估LLMs在开放域自然语言驱动分子生成中的基准。S^2-Bench专为一对多关系设计，挑战LLMs展现真正的分子理解和开放生成能力。我们的基准包括三个关键任务：分子编辑（MolEdit）、分子优化（MolOpt）和定制分子生成（MolCustom），每个任务探索分子发现的不同方面。我们还引入OpenMolIns，一个大规模指令微调数据集，使Llama3.1-8B在S^2-Bench上超越最强大的LLMs如GPT-4o和Claude-3.5。我们对31个LLMs的全面评估将焦点从简单的模式回忆转向现实的分子设计，为自然语言驱动分子发现中更强大的LLMs铺平道路。我们的代码和数据集完全可通过GitHub仓库（https://github.com/phenixace/S2-TOMG-Bench）和Huggingface数据集（https://huggingface.co/datasets/phenixace/S2-TOMG-Bench）获取。

英文摘要

Recently, Large Language Models (LLMs) have demonstrated great potential in natural language-driven molecule discovery. However, existing datasets and benchmarks for molecule-text alignment are predominantly built on one-to-one mappings, measuring LLMs' ability to retrieve a single, pre-defined answer, rather than their creative potential to generate diverse, yet equally valid, molecular candidates. To address this critical gap, we propose Speak-to-Structure (S^2-Bench), the first benchmark to evaluate LLMs in open-domain natural language-driven molecule generation. S^2-Bench is specifically designed for one-to-many relationships, challenging LLMs to exhibit genuine molecular understanding and open-ended generation capabilities. Our benchmark includes three key tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom), each probing a different aspect of molecule discovery. We also introduce OpenMolIns, a large-scale instruction tuning dataset that enables Llama3.1-8B to surpass the most powerful LLMs like GPT-4o and Claude-3.5 on S^2-Bench. Our comprehensive evaluation of 31 LLMs shifts the focus from simple pattern recall to realistic molecular design, paving the way for more capable LLMs in natural language-driven molecule discovery. Our codes and datasets are fully accessible through the Github Repository: https://github.com/phenixace/S2-TOMG-Bench and Huggingface Datasets: https://huggingface.co/datasets/phenixace/S2-TOMG-Bench.

URL PDF HTML ☆

赞 0 踩 0

2605.23668 2026-05-25 cs.CL cs.AI 版本更新

OnePred: Next-Query Prediction via Recursive Intent Memory in Multi-Turn Conversations

OnePred: 多轮对话中基于递归意图记忆的下一个查询预测

Jiangwang Chen, Bowen Zhang, Zixin Song, Jiazheng Kang, Xiao Yang, Da Zhu, Guanjun Jiang

发表机构 * Tsinghua University（清华大学）； Qwen Applications Business Group of Alibaba（阿里巴巴Qwen应用业务组）

AI总结该研究提出了 OnePred，一种用于多轮对话中预测用户下一条查询的模型，旨在使对话系统更具主动性。其核心方法是通过递归更新的意图记忆来捕捉用户意图的演变，从而在不依赖完整对话历史的情况下实现高效且准确的预测。该方法通过两阶段强化学习训练模型，既学习预测内容又优化信息压缩，显著降低了计算成本并提升了预测性能。研究还发布了 NQP-Bench 基准数据集，实验表明 OnePred 在保持预测质量的同时，相比传统方法减少了高达 22 倍的计算开销。

详情

AI中文摘要

尽管大语言模型（LLM）对话系统每天处理数百万次多轮对话，但它们本质上仍是被动的：仅在用户输入查询后才响应。迈向主动交互的关键一步是下一个查询预测，即仅根据之前的对话预测用户后续的查询。该任务的进展受到缺乏专用基准以及基本效率-质量权衡的阻碍：简单拼接完整对话历史会导致线性增长的token消耗，而截断至最新一轮则会丢弃关键的跨轮上下文。我们的关键见解是，准确预测不需要重新阅读原始历史；只需跟踪用户跨主题、未解决需求和兴趣转移的不断演变的意图轨迹即可。我们提出OnePred，它维护一个递归更新的记忆作为唯一的跨轮上下文，将每轮成本限制为与对话长度无关。我们通过两阶段强化学习流程训练模型，首先教导预测什么，然后教导压缩什么，将记忆塑造成面向预测的意图链。为了建立严格的测试平台，我们引入了NQP-Bench，涵盖三个不同的子集。实验表明，与完整历史输入相比，OnePred将每轮token消耗减少高达22倍，同时在预测质量上持续超过所有基线，在较长对话中增益更大。我们的代码可在https://github.com/ZBWpro/OnePred公开获取。

英文摘要

Although large language model (LLM) conversational systems process millions of multi-turn dialogues daily, they remain fundamentally reactive: they respond only after the user types a query. A key step toward proactive interaction is next-query prediction, which anticipates the user's subsequent query based solely on the preceding dialogue. Progress on this task is hindered by the lack of dedicated benchmarks and a fundamental efficiency--quality trade-off: naively concatenating full dialogue history incurs linearly growing token consumption, while truncating to the latest turn discards crucial cross-turn context. Our key insight is that accurate prediction does not require re-reading raw history; it suffices to track the user's evolving intent trajectory across topics, unresolved needs, and interest shifts. We propose OnePred, which maintains a recursively updated memory as its sole cross-turn context, bounding the per-turn cost independently of conversation length. We train the model via a two-stage reinforcement learning pipeline that first teaches what to predict, then what to compress, shaping the memory into a prediction-oriented intent chain. To establish a rigorous testbed, we introduce NQP-Bench, spanning three diverse subsets. Experiments demonstrate that OnePred reduces per-turn token consumption by up to 22$\times$ compared to full-history inputs while consistently exceeding all baselines in prediction quality, with larger gains on longer conversations. Our code is publicly available at https://github.com/ZBWpro/OnePred.

URL PDF HTML ☆

赞 0 踩 0

2605.23618 2026-05-25 cs.CL 版本更新

Benchmarking Google Embeddings 2 against Open-Source Models for Multilingual Dense Retrieval and RAG Systems

Google Embeddings 2 与开源模型在多语言稠密检索和 RAG 系统中的基准测试

Stefano Cirillo, Domenico Desiato, Giuseppe Polese, Giandomenico Solimando

发表机构 * University of Salerno（萨勒诺大学）； University of Bari（巴里大学）

AI总结本文对比了Google Embeddings 2（GE2）与五个开源模型在多语言密集检索和RAG系统中的性能，发现GE2在多个任务中表现最佳，但其延迟较高；相比之下，mE5-L在保持较高检索效果的同时具有更低的延迟，适合对响应时间有要求的应用；实验还表明，所有模型在32词块长度时性能趋于饱和，而语义分块仅在16词块时带来明显提升。

Comments 9 pages, 2 figures, 5 tables. Text and evaluation code available at https://github.com/cciro94/GoogleEmbeddings2-benchmark

详情

AI中文摘要

我们对 Google Embeddings (GE2) 进行了基准测试，这是一个由 Vertex-AI 托管的双编码器，具有 2048 令牌上下文和显式任务类型条件，与五个开源替代方案：BGE-M3、E5-large、Multilingual-E5-large (mE5-L)、LaBSE 和 Paraphrase-Multilingual-MPNet (mMPNet)。评估涵盖四个 BEIR 子集、一个合成意大利语 RAG 语料库、考虑三种策略下 5 种令牌大小的分块消融实验，以及在商品 CPU 硬件上的每查询延迟。GE2 在每个任务上排名第一，达到 BEIR 平均 nDCG@10 = 0.638 和 IT-RAG-Bench nDCG@10 = 0.282，但中位延迟为 231.6 毫秒，比最快的本地模型慢约 14 倍。mE5-L 在意大利语上以 31 毫秒的延迟达到与 GE2 相差 0.003 nDCG 的性能，使其成为在子 100 毫秒 SLA 下更优的选择。一个更惊人的发现涉及 LaBSE，尽管广泛部署于多语言场景，其在 BEIR 上的平均 nDCG@10 仅为 0.188，低于包括 mMPNet 在内的所有专用检索模型。分块实验表明，所有六个模型在我们的语料库上在 32 令牌分块时性能饱和，语义分块仅在 16 令牌时提供可衡量的增益。

英文摘要

We benchmark Google Embeddings (GE2), a Vertex-AI-hosted bi-encoder with 2,048-token context and explicit task-type conditioning, against five open-source alternatives: BGE-M3, E5-large, Multilingual-E5-large (mE5-L), LaBSE, and Paraphrase-Multilingual-MPNet (mMPNet). Evaluation covers four BEIR subsets, a synthetic Italian RAG corpus, a chunking ablation considering 5 sizes of tokens with three strategies, and per-query latency on commodity CPU hardware. GE2 ranks first on every task, achieving BEIR avg.nDCG@10 = 0.638 and IT-RAG-Bench nDCG@10 = 0.282, but at 231.6 ms median latency, it is roughly 14x slower than the fastest local models. mE5-L reaches within 0.003 nDCG of GE2 on Italian at 31 ms, making it the preferred option when sub-100 ms SLAs matter. A more striking finding concerns LaBSE, which, despite widespread multilingual deployment scores 0.188 average nDCG@10 on BEIR, below every dedicated retrieval model including mMPNet. Chunking experiments show that all six models saturate at 32-token chunks on our corpus, with semantic chunking providing measurable gains only at 16 tokens.

URL PDF HTML ☆

赞 0 踩 0

2605.23605 2026-05-25 cs.LG cs.AI cs.CL 版本更新

DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling

DiLaDiff: 蒸馏潜在增强扩散用于语言建模

Jean-Marie Lemercier, Tomas Geffner, Karsten Kreis, Morteza Mardani, Arash Vahdat, Ante Jukić

发表机构 * NVIDIA（英伟达）

AI总结 DiLaDiff 是一种改进的扩散语言模型，旨在解决传统扩散模型在采样质量和生成速度之间的矛盾。该方法引入了连续语义潜在空间，并通过自编码器和一致性蒸馏技术提升生成效率和质量。实验表明，DiLaDiff 在不进行蒸馏时已优于基线模型，并在蒸馏后显著加快了推理速度。

2605.23597 2026-05-25 cs.CL cs.LG 版本更新

Structure-Guided Entity Resolution: Fine-Tuning LLMs for Robust Name Matching in Complex Linguistic Contexts

结构引导的实体解析：微调大语言模型以实现复杂语言上下文中的鲁棒姓名匹配

Shivam Chourasia, Hitesh Kapoor, Nilesh Patil

发表机构 * Dream Sports

AI总结本文研究了在语言和文化复杂环境下进行人名匹配的实体解析问题，提出了一种名为Structure-Guided Entity Resolution（SGER）的新框架，通过两阶段课程式微调增强大语言模型对姓名结构和语义的理解，从而提升实体匹配的准确性。该方法在印度身份数据等具有高度语言多样性和噪声的现实场景中表现出色，取得了99.02%的高准确率，并在生产环境中成功部署，验证了其在大规模多语言系统中的有效性和鲁棒性。

Comments Accepted to ACL 2026. 8 pages, 1 figure, 2 tables

详情

AI中文摘要

跨异构记录匹配人名是实体解析的核心挑战，尤其是在语言和文化复杂的环境中。命名惯例的差异、跨文字的不一致音译以及频繁的数据录入错误使得统一用户身份变得困难，而这对于了解你的客户（KYC）合规至关重要。虽然大语言模型在理解自然语言方面显示出潜力，但它们往往难以处理此类特定领域设置中存在的结构化歧义。本文介绍了结构引导实体解析（SGER），一种新颖的框架，通过两阶段课程微调大语言模型。模型首先被训练解析人名的语法和语义结构，然后针对二元实体匹配的下游任务进行优化。我们在印度身份数据的挑战性背景下评估SGER，这是全球语言最多样化和噪声最大的环境之一。SGER在包含50,000个真实世界对的保留测试集上达到了99.02%的准确率和0.994的F1分数，优于GPT-4o少样本提示和单阶段微调基线。该系统已完全部署在全球最大的梦幻体育平台Dream11的生产环境中，服务超过2.5亿用户。我们的结果表明，课程引导的训练能够在现实世界的多语言系统中实现大规模、高精度的实体解析。

英文摘要

Matching person names across heterogeneous records is a core challenge in entity resolution, especially within linguistically and culturally complex environments. Variations in naming conventions, inconsistent transliteration across scripts, and frequent data entry errors make it difficult to unify user identities, an essential requirement for Know Your Customer (KYC) compliance. While Large Language Models have shown promise in understanding natural language, they often struggle with the structured ambiguity present in such domain-specific settings. This paper introduces Structure-Guided Entity Resolution (SGER), a novel framework that fine-tunes an LLM through a two-phase curriculum. The model is first trained to parse the grammatical and semantic structure of personal names, then optimized for the downstream task of binary entity matching. We evaluate SGER in the challenging context of Indian identity data, one of the most linguistically diverse and noisy environments globally. SGER achieves 99.02% accuracy and an F1 of 0.994 on a held-out set of 50,000 real-world pairs, outperforming GPT-4o few-shot prompting and single-stage fine-tuning baselines. The system is fully deployed in production at Dream11, the world's largest fantasy sports platform, serving 250M+ users. Our results demonstrate that curriculum-guided training enables robust, high-precision entity resolution in real-world multilingual systems at scale.

URL PDF HTML ☆

赞 0 踩 0

2605.23497 2026-05-25 cs.CL 版本更新

Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM-based Statutory Question Answering

询问老朋友：诊断和缓解基于LLM的法定问答中的时间故障模式

Max Prior, Andreas Schultz, Matthias Grabmair

发表机构 * Technical University of Munich（慕尼黑技术大学）

AI总结该研究探讨了基于大语言模型（LLM）的法律问答系统在处理时效性法律条文时的两种失效模式：法规更新后的过时问题和对较新法规的偏好偏差。为此，研究构建了一个包含312个专家验证的德语法律问答对的基准数据集，并在不同推理设置下评估了多个LLM的表现。结果表明，引入基于检索的增强方法能显著提升模型在时间有效性方面的性能，而单纯依赖网络搜索则存在不稳定性和近期偏好问题，研究强调了在法律问答中必须将时间有效性作为硬性约束。

详情

AI中文摘要

大型语言模型越来越多地用于法律研究，但其固定的训练截止日期和对静态参数知识的依赖与成文法的演变性质相矛盾。我们研究了两种时间故障模式：截止后过时（模型在立法修正后应用被取代的规则）和近因偏差（即使历史版本支配事实模式，模型也偏好较新的规定）。为此，我们提出了一个包含312个专家验证、时间敏感的德国法定问答对的基准，涵盖三个类别：截止后修正问题、修正前问题和多条款修正前问题。我们评估了来自OpenAI、Anthropic和DeepSeek的五个LLM，在四种推理设置下：普通、网络搜索和两种检索增强变体（通过事实日期提取和版本过滤强制执行时间有效性）。使用经过人类专家评分验证的LLM作为评判，我们发现普通设置在截止后设置中性能严重下降。两种RAG方法在所有问题类型上均显著提高了性能，而网络搜索则产生不稳定的收益，并在历史锚定任务上表现出明显的近因偏差。我们的结果表明，可靠的法律问答需要将时间有效性视为硬约束。

英文摘要

Large language models are increasingly used for legal research, yet their fixed training cutoffs and reliance on static parametric knowledge are at odds with the evolving nature of statutory law. We study two temporal failure modes: post-cutoff staleness, where models apply superseded rules after legislative amendments, and recency bias, where models prefer newer provisions even when a historical version governs the fact pattern. To this end, we present a benchmark of 312 expert-validated, time-sensitive German statutory QA pairs spanning three categories: Post-Cutoff Amendment Questions, Pre-Amendment Questions, and Multi-Provision Pre-Amendment Questions. We evaluate five LLMs by OpenAI, Anthropic and DeepSeek under four inference settings: Vanilla, Web-search, and two retrieval-augmented variants that enforce temporal validity via a fact date extraction and version filtering. Using an LLM-as-a-judge validated against human expert ratings, we find severe degradation in the Vanilla post-cutoff setting. Both RAG approaches substantially improve performance across all question types, while web search yields unstable gains and exhibits a marked recency bias on historically anchored tasks. Our results indicate that reliable legal QA requires treating temporal validity as a hard constraint.

URL PDF HTML ☆

赞 0 踩 0

2605.23420 2026-05-25 cs.CL 版本更新

Naturalistic measure of social norms alignment

社会规范一致性的自然主义度量

Yevhen Kostiuk, Kenneth Enevoldsen, Peter Bjerregaard Vahlstrup, Márton Kardos, Kristoffer Nielbo

发表机构 * Aarhus University（奥胡斯大学）

AI总结该研究旨在解决社会规范对齐的自然化测量问题，传统方法多依赖人工设定的封闭式评估，而本文提出了一种基于自由形式解决方案匹配的框架，用于衡量不同主体（如人类或大型语言模型）在社会困境中的回应一致性。研究引入了两个评估指标，并构建了一个包含3000个非平凡社会困境的丹麦语数据集，每个困境均附有三位文化背景评审提供的参考解决方案。实验表明，该方法能够有效区分模型在不同社会议题上的对齐程度，尤其在邻里冲突和共同居住等话题上表现出更高的共识水平。

详情

AI中文摘要

社会规范反映了对可接受行为的共同期望。测量社会规范一致性仍然具有挑战性，现有方法通常依赖于人为的封闭式评估，如多项选择问卷或衡量与预定义陈述的一致性。在本工作中，社会规范一致性指的是衡量解决方案在应对社会问题或困境时的一致性。我们提出了一个框架，通过解决方案匹配在自然主义、自由形式的设置中测量社会规范一致性。该框架使我们能够衡量任意两个困境响应之间的一致性，例如LLM与人类、LLM与LLM或人类与人类之间。我们引入了两个指标：陈述一致性和显式一致性准确性，并构建了一个包含3000个非平凡社会困境的丹麦语数据集。所有困境都分配了来自三位小组成员的参考解决方案，他们作为文化背景法官。我们在类似于自然用户模型对话的交互设置中评估了几个LLM和人类响应的一致性。我们的结果表明，所提出的指标产生一致的模型排名，并揭示了不同类型困境之间一致性的变化，其中在邻里冲突和共享生活情境等主题上观察到更高的一致性。总体而言，我们的工作引入了一个数据集和评估框架，用于研究自然主义开放式对话中基于文化背景的社会推理。

英文摘要

Social norms reflect shared expectations on acceptable behavior. Measuring social norms alignment remains challenging, with existing approaches typically relying on artificial closed-form evaluations such as multiple-choice questionnaires or measuring agreement with predefined statements. In the context of this work, social norms alignment refers to measuring an agreement between solutions with respect to the social problem or dilemma. We propose a framework for measuring social norm alignment in naturalistic, free-form settings through solution matching. The framework enables us to measure alignment between any two dilemma responses e.g., LLMs to a human, LLMs to LLMs, or human to human. We introduce two metrics: stated and explicit agreement accuracy, and construct a dataset of 3k non-trivial social dilemmas in Danish. All dilemmas are assigned reference solutions derived from three panelists, who serve as culturally grounded judges. We evaluate the agreement of several LLMs and human responses in an interaction setup that resembles natural user-model conversations. Our results show that the proposed metrics produce consistent model rankings and reveal variation in agreement across different types of dilemmas, with higher agreement observed for topics such as neighbor conflicts and shared living situations. Overall, our work introduces a dataset and evaluation framework for studying culturally grounded social reasoning in naturalistic open-ended conversations.

URL PDF HTML ☆

赞 0 踩 0

2605.23416 2026-05-25 cs.CL cs.SD 版本更新

Articulatory strategy as a source of variation in acoustic vowel dynamics

发音策略作为声学元音动态变异的一个来源

Patrycja Strycharczuk, Justin J. H. Lo, Sam Kirkham

发表机构 * Linguistics and English Language, University of Manchester, United Kingdom（曼彻斯特大学语言学与英语语言系，英国）； Linguistics and English Language, Lancaster University, United Kingdom（兰卡斯特大学语言学与英语语言系，英国）

AI总结本研究探讨了发音策略如何影响元音的声学动态变化，揭示了个体发音习惯与音素形式过渡之间的关系。通过分析36位北英格兰英语说话者的舌部超声影像数据，研究发现元音/i/的舌形是影响带有腭化滑音的双元音中共振峰动态变化的重要因素。研究结果表明，舌根和舌背的更大运动幅度会导致共振峰过渡更早且更陡峭，并为理解语音个体差异提供了新的视角。

详情

DOI: 10.1121/10.0043814
Journal ref: Journal of the Acoustical Society of America (2026) 159(5): 4068-4078

AI中文摘要

声学元音动态具有一些说话者识别特征，这些特征被归因于发音策略的个体特性：共振峰过渡具有特定形状，因为说话者使用特定且熟练的动作移动发音器官。然而，现有证据很少表明不同的发音策略会系统性地影响共振峰动态。本研究证实了二者之间的联系。使用来自36位北盎格鲁英语说话者的超声舌成像数据，识别出腭元音/i/产生的不同发音策略。发现/i/中的舌形是腭滑音双元音中共振峰动态的重要预测因子。观察到的关系可以通过声道形状调节的发音运动特征来解释。舌根和/或舌背的更大发音位移会产生腭元音中与平均舌形的更大偏差，并且还需要更高的发音速度，导致相对更早且更陡的共振峰过渡。结果通过阐明发音补偿的规律性和个体性方面，有助于对言语个体性的概念理解。

英文摘要

Acoustic vowel dynamics have some speaker-identifying characteristics, which have been ascribed to individual properties of articulatory strategies: formant transitions have a particular shape because speakers move their articulators, using specific and practised movements. However, there is little existing evidence that different articulatory strategies systematically affect formant dynamics. The present study corroborates the link between the two. Ultrasound tongue imaging data from 36 speakers of Northern-Anglo English are used to identify distinct articulatory strategies for the production of palatal vowel /i/. Tongue shape in /i/ is found to be a significant predictor of formant dynamics in diphthongs with a palatal offglide. The observed relationships can be explained by the characteristics of articulatory movement conditioned by vocal tract shape. Greater articulatory displacement of tongue root and/or dorsum produces greater distortion from the mean tongue shape in palatal vowels, and it also requires higher articulatory velocities, resulting in relatively earlier and steeper formant transitions. The results contribute to the conceptual understanding of individuality in speech, by illuminating the regularising and individual aspects of articulatory compensation.

URL PDF HTML ☆

赞 0 踩 0

2605.23412 2026-05-25 cs.CL 版本更新

EquiSumm : A Gender Bias-Aware Framework for Inclusive Tweet Summarization

EquiSumm：面向包容性推文摘要的性别偏见感知框架

Chaitanya Wanjari, Jessica Kamal, Riddhi Jain, Samruddhi Kurhe, Roshni Chakraborty

发表机构 * ABV IIITM Gwalior（ABV IIITM 格瓦利尔）

AI总结在新闻事件中，社交媒体平台如Twitter提供了大规模意见分享的渠道，但人工处理海量内容以提炼关键观点是不现实的。为此，已有自动摘要技术被提出，但大多未考虑人口统计学公平性，尤其是性别偏差问题。本文提出EquiSumm，一种关注性别偏见的包容性推文摘要框架，通过考虑意见中的性别因素生成更加公平的摘要，实验结果表明其在两个主要数据集上的性能优于现有方法。

Comments Accepted at AI for Social Good Workshop, Pattern Recognition and Machine Intelligence (PReMI 2025), IIT Delhi. 6 pages, 2 figures

详情

AI中文摘要

虽然Twitter等社交媒体平台为新闻事件期间的大规模意见分享提供了媒介，但个人或媒体机构手动处理大量内容以识别关键观点是不可能的。为了解决这个问题，已经提出了几种自动摘要技术，将大量推文压缩成简洁且信息丰富的摘要。然而，这些算法没有明确考虑人口统计公平性。现有的若干研究工作已经开发了自动摘要方法，可以提供社交媒体平台上与新闻事件相关的关键方面和主要意见的整体概述。然而，这些方法没有明确考虑不同形式的人口统计代表性，例如性别，这可能导致有偏见的摘要表示。在本文中，我们提出了EquiSumm，它考虑了共享意见的性别方面来生成摘要，我们在两个主要数据集上的实验分析表明了相对于现有研究工作的性能有效性。

英文摘要

While social media platforms, such as Twitter, provide a medium for large-scale opinion sharing during news events, it is manually impossible for individuals or media agencies to process the vast volume of content to identify key viewpoints. In order to resolve this, several automatic summarization techniques have been proposed to condense large collections of tweets into concise and informative summaries. However, these algorithms do not explicitly consider demographic fairness. Several existing research works have developed automated summarization approaches that can provide a holistic overview of the key aspects and major opinions shared on social media platforms related to a news event. However, these approaches do not explicitly consider different forms of demographic representation, such as gender, which can lead to biased summary representation. In this paper, we propose EquiSumm, which considers the gender aspect of the shared opinion to generate a summary, and our experimental analysis on two major datasets indicates the performance effectiveness with respect to existing research works.

URL PDF HTML ☆

赞 0 踩 0

2605.23384 2026-05-25 cs.CL cs.AI 版本更新

Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals

元认知作为奖励：通过知识和调节信号强化LLM推理

Sirui Chen, Lei Xu, Yuying Zhao, Yutian Chen, Yu Wang, Beier Zhu, Hanwang Zhang, Shengjie Zhao, Chaochao Lu

发表机构 * Tongji University（同济大学）； Shanghai AI Laboratory（上海人工智能实验室）； Nanyang Technological University（南洋理工大学）； University of Science and Technology of China（中国科学技术大学）； EPFL（苏黎世联邦理工学院）； Wuhan University（武汉大学）

AI总结该论文提出了一种基于元认知的强化学习框架 MaR，旨在提升大语言模型的推理能力。MaR 通过元认知知识和元认知调节两个维度提供奖励信号，前者用于识别任务相关的信息，后者用于规划和调整推理过程，从而超越仅依赖最终答案的奖励设计。实验表明，MaR 在多个基准测试中显著提升了模型性能，并在部分任务上超越了更强大的模型。

详情

AI中文摘要

最近的强化学习方法显著提高了LLM的推理能力。现有的奖励设计主要遵循两种范式：(1) 基于可验证奖励的强化学习（RLVR）从可执行检查或真实答案中获取结果信号，但对中间推理行为的指导有限。(2) 基于评分标准的奖励（RaR）通过使用自然语言评分标准来评估推理质量和任务合规性，超越了最终答案检查，但通常需要实例特定的评分标准和大量设计工作。为解决这些问题，我们引入了元认知奖励（MaR），一种受元认知启发的RL框架，通过两个通用过程维度指导LLM推理：i) 元认知知识，无需手工制作的实例特定评分标准即可识别任务相关信息；ii) 元认知调节，规划和调整推理过程，以提供超越最终答案结果的奖励指导。MaR将模型轨迹分解为显式的元认知组件，并通过任务知识覆盖度、调节保真度和最终答案正确性的轨迹级奖励进行优化。通过这种方式，MaR将奖励反馈扩展到推理轨迹，同时将奖励信号锚定在通用的元认知维度上。在22个基准上的实验表明，MaR持续提升模型性能，相比基础模型最高提升7.7%，相比原始DAPO最高提升11.0%。值得注意的是，Qwen3.5-9B + MaR缩小了与前沿模型的差距，在整体平均上超越GPT-OSS-120B，并在多个单独基准上超越更强模型。过程级分析进一步显示推理过程质量显著提升。MaR还能泛化到域外数据集，MaR训练的模型在平均性能上优于对应的基础模型。

英文摘要

Recent RL methods have substantially improved the reasoning abilities of LLMs. Existing reward designs mainly follow two paradigms: (1) Reinforcement learning with verifiable rewards (RLVR) derives outcome signals from executable checks or ground-truth answers, but provides limited guidance for intermediate reasoning behaviors. (2) Rubrics-as-reward (RaR) goes beyond final-answer checking by using natural-language rubrics to assess reasoning quality and task compliance, but often requires instance-specific rubrics and substantial design effort. To address these issues, we introduce Metacognition-as-Reward (MaR), a metacognition-inspired RL framework that guides LLM reasoning through two general process dimensions: i) metacognitive knowledge, which identifies task-relevant information without hand-crafted instance-specific rubrics, and ii) metacognitive regulation, which plans and adjusts the reasoning process to provide reward guidance beyond final-answer outcomes. MaR scaffolds model rollouts into explicit metacognitive components and optimizes them with a trajectory-level reward over task knowledge coverage, regulation fidelity, and final-answer correctness. In this way, MaR extends reward feedback to reasoning trajectories while grounding the reward signals in general metacognitive dimensions. Experiments on 22 benchmarks show that MaR consistently improves model performance, achieving up to a 7.7% gain over the base model and up to an 11.0% gain over vanilla DAPO. Notably, Qwen3.5-9B + MaR narrows the gap to frontier models, surpassing GPT-OSS-120B on overall average and outperforming stronger models on several individual benchmarks. Process-level analysis further shows substantial improvements in reasoning process quality. MaR also generalizes to out-of-domain datasets, where MaR-trained models improve over their corresponding base models on average.

URL PDF HTML ☆

赞 0 踩 0

2605.23382 2026-05-25 cs.CL 版本更新

From Correctness to Preference: A Framework for Personalized Agentic Reinforcement Learning

从正确性到偏好：个性化智能体强化学习框架

Ranxu zhang, zeyang li, Jiacheng Huang, Rui Zhang, Xiaozhou Xu, sun zhe, Yanyong Zhang, Chao Wang

发表机构 * University of Science and Technology of China（中国科学技术大学）； Alibaba Group（阿里巴巴集团）

AI总结该论文提出了一种面向个性化智能体的强化学习框架，旨在解决现实场景中用户需求差异带来的行为规划与工具使用策略多样化问题。核心方法是通过解耦通用任务奖励与个性化偏好奖励，并引入用户特定锚点以稳定学习过程，同时结合分阶段偏好分离奖励模型和个性化技能演化图记忆结构，实现偏好识别、策略优化与技能积累的闭环。实验表明，该框架在多个基准任务中优于现有强化学习与记忆方法。

Comments 34 pages, 7 figures, Under Review

详情

AI中文摘要

智能体强化学习（Agentic RL）在具有明确成功信号的任务中取得了显著进展。然而，许多现实世界的智能体应用需要用户条件行为：同一查询可能在不同用户之间需要不同的规划策略和工具使用决策。这种设置带来了关键挑战：通用奖励无法捕捉异构用户偏好，观察到的行为与从众效应纠缠在一起，扁平记忆无法支持个性化技能检索。为此，我们提出了一个统一的个性化智能体RL框架，将个性化嵌入训练时优化。其核心是 extbf{个性化锚点奖励解耦策略优化}（ extbf{PARPO}），它将通用任务质量奖励与个性化偏好奖励解耦，并使用用户特定锚点在异构奖励尺度下稳定学习。我们进一步引入了两阶段偏好解耦奖励模型和 extbf{偏好对齐技能演化图记忆}（ extbf{PSGM}），用于个性化监督和偏好对齐技能检索。它们共同构成了偏好识别、策略优化和结构化技能积累的闭环。在ETAPP、ETAPP-Hard和SJAgent上的实验表明，我们的框架始终优于强记忆和RL基线。代码和数据包含在补充材料中。

英文摘要

Agentic reinforcement learning (Agentic RL) has achieved strong progress in tasks with clear success signals. However, many real-world agent applications require user-conditioned behavior: the same query may call for different planning strategies and tool-use decisions across users. This setting raises key challenges: generic rewards cannot capture heterogeneous user preferences, observed behaviors are entangled with conformity effects, and flat memories cannot support personalized skill retrieval. To this end, we propose a unified personalized Agentic RL framework that embeds personalization into training-time optimization. At its core is \emph{Personalized Anchor Reward-Decoupled Policy Optimization} (\textbf{PARPO}), which decouples generic task-quality rewards from personalized preference rewards and uses user-specific anchors to stabilize learning under heterogeneous reward scales. We further introduce a two-stage preference-disentangled reward model and \emph{Preference-Aligned Skill Evolution Graph Memory} (\textbf{PSGM}) for personalized supervision and preference-aligned skill retrieval. Together, they form a closed loop of preference identification, policy optimization, and structured skill accumulation. Experiments on ETAPP, ETAPP-Hard, and SJAgent show that our framework consistently outperforms strong memory and RL baselines. Code and data are included in the supplementary materials.

URL PDF HTML ☆

赞 0 踩 0

2605.23332 2026-05-25 cs.CL 版本更新

Cultural Adaptation in Large Language Models for Political Discourse

大型语言模型在政治话语中的文化适应

Wajdi Zaghouani

发表机构 * Northwestern University in Qatar（卡塔尔西北大学）

AI总结本文探讨了在政治话语分析中部署大型语言模型时所需的文化适应问题，指出当前模型因依赖英语主导的数据和有限的多语言覆盖，难以适应不同文化和制度背景，导致系统性偏差。研究从翻译、话语和本体三个层面形式化文化适应，提出基于文化忠实度、校准和民主安全的评估矩阵，并结合参与式数据集构建、文化感知的迁移学习等方法，为文化适应提供可衡量的实现路径，以增强政治自然语言处理的民主合法性。

详情

AI中文摘要

大型语言模型融入政治话语分析为比较研究、政策分析和公民技术创造了新机遇，同时也为民主问责带来了实质性风险。本文认为，文化适应是在不同语言和制度背景下将大型语言模型可靠地部署于政治传播的先决条件。当前系统仍受英语主导数据、不均衡的多语言覆盖以及基于狭窄范围的政治制度和话语惯例的假设所影响，导致跨文化应用时产生系统性错误。我们在翻译、话语和本体层面形式化文化适应，识别政治自然语言处理中反复出现的文化失败模式，并提出一个基于文化保真度、校准度和民主安全性的操作性评估矩阵。基于政治文本分析、社会技术审计和跨文化语用学，我们概述了方法论路径，包括参与式数据集开发、文化感知迁移学习以及使文化适应可经验测量的基准设计。最后，我们阐明了适应性政治自然语言处理能够支持民主合法性的治理约束和适用范围条件。

英文摘要

The integration of large language models into political discourse analysis creates new opportunities for comparative research, policy analysis, and civic technology, while introducing material risks for democratic accountability. This paper argues that cultural adaptation is a prerequisite for trustworthy deployment of large language models in political communication across diverse linguistic and institutional contexts. Current systems remain shaped by English dominant data, uneven multilingual coverage, and assumptions grounded in a narrow range of political institutions and discourse conventions, producing systematic errors when applied across cultures. We formalize cultural adaptation across translation, discourse, and ontology levels, identify recurring cultural failure modes in political NLP, and propose an operational evaluation matrix grounded in cultural fidelity, calibration, and democratic safety. Building on political text analysis, sociotechnical auditing, and cross cultural pragmatics, we outline methodological pathways including participatory dataset development, culturally aware transfer learning, and benchmark design that makes cultural adaptation empirically measurable. We conclude by clarifying governance constraints and scope conditions under which culturally adaptive political NLP can support democratic legitimacy.

URL PDF HTML ☆

赞 0 踩 0

2605.23328 2026-05-25 cs.CL 版本更新

Emotion Recognition in Sign Language Conversation

手语对话中的情感识别

Yusong Wang, Keyu Mao, Takao Obi, Minghao Shao, Kotaro Funakoshi

发表机构 * Institute of Science Tokyo（东京科学研究所）； New York University（纽约大学）

AI总结本文研究手语对话中的情绪识别问题，针对现有手语情绪数据集主要关注孤立句子而缺乏对话上下文的局限性，提出了一个新的任务——手语对话情绪识别（ERC），并构建了包含1920个视频样本的eJSL Dialog数据集。研究通过多种模型在该数据集上的系统评估，揭示了通用多模态对话情绪识别模型在手语场景中的性能差距，突显了开发针对手语的上下文感知视觉提取器以及扩展对话数据集规模以支持大规模预训练的必要性。

详情

AI中文摘要

对话中的情感识别是情感计算的核心组成部分，而当前手语情感数据集资源主要关注孤立句子，缺乏对话上下文。仅在这些孤立话语上训练的模型在现实场景中表现下降，因为它们无法利用历史对话流。为了解决这一结构性限制，我们将ERC任务引入手语视频分析，并提出了eJSL Dialog数据集。该数据集使用STUDIES语料库的脚本构建，包含1,920个视频样本，组织成480个独特的对话。我们使用从孤立视觉网络到多模态对话架构的模型对该数据集进行了系统基准测试。结果揭示了将通用多模态对话情感识别模型应用于手语时存在的领域差距。这些发现表明，明确需要针对手语的上下文感知视觉提取器，并指出扩大对话数据集规模以支持大规模预训练是未来研究的必要下一步。

英文摘要

Emotion Recognition in Conversation is a core component of affective computing, while current resources of sign language emotion datasets primarily focus on isolated sentences and lack conversational context. Models trained exclusively on these isolated utterances demonstrate degraded performance in real world scenarios because they cannot utilize historical dialogue flow. To address this structural limitation, we introduce the ERC task to sign language video analysis and propose the eJSL Dialog dataset. Constructed using the scripts from the STUDIES corpus, the dataset contains 1,920 video samples organized into 480 unique dialogues. We conduct systematic benchmarking on this dataset using models ranging from isolated visual networks to multimodal conversational architectures. The results reveal a domain gap when applying generic multimodal conversational emotion recognition models to sign language. These findings demonstrate the explicit need for context aware visual extractors specific to sign language and indicate that expanding the scale of conversational datasets to support large scale pre-training is a necessary next step for future research.

URL PDF HTML ☆

赞 0 踩 0

2605.23326 2026-05-25 cs.CL 版本更新

ClimateChat-300K: A Multi-Modal Facebook Dataset for Understanding Diverse Perspectives in Climate Communication

ClimateChat-300K：一个用于理解气候传播中不同观点的多模态Facebook数据集

Wajdi Zaghouani, Md. Rafiul Biswas, Mabrouka Bessghaier, Shimaa Ibrahim, George Mikros

发表机构 * Northwestern University in Qatar（卡塔尔西北大学）； Hamad Bin Khalifa University（哈马德·本·卡伊夫大学）

AI总结本文介绍了ClimateChat-300K，一个包含299,329条公共Facebook帖子的多模态数据集，涵盖2020年5月至2024年5月期间关于气候变化的讨论。该数据集包含41个元数据特征，如内容、互动指标和页面属性，支持对气候沟通中公众话语的全面分析。研究通过主题建模和情感分析识别出五个领域下的十个主要主题，并揭示了情感基调、内容形式和页面身份对受众参与度的影响，同时展示了在线讨论如何随国际气候峰会和新冠疫情等重大事件演变。该数据集为气候议题的偏见、虚假信息及数字话语动态等跨学科研究提供了开放资源。

详情

AI中文摘要

我们提出了ClimateChat-300K，这是一个大规模数据集，包含通过CrowdTangle平台收集的2020年5月至2024年5月间关于气候变化的299,329条Facebook公开帖子。该数据集包含41个元数据特征，包括帖子内容、参与度指标和页面属性，覆盖来自超过26,000个全球页面的材料。每条帖子包含丰富的上下文信息，如语言、时间戳、页面类别和互动次数，从而能够对气候传播中的公共话语进行全面分析。通过主题建模和情感分析，我们识别出十个主要主题，这些主题被归为五个领域：政策、行动主义、合作、科学和保护。结果显示，情感基调、帖子格式和页面身份强烈影响受众参与度，视觉丰富且情感强烈的内容获得最高水平的互动。该数据集还展示了在线讨论如何响应重大事件（如国际气候峰会和COVID-19疫情时期）而演变。ClimateChat-300K为关于极化、错误信息和数字气候话语动态的可重复跨学科研究提供了开放资源。通过发布该数据集，我们旨在支持透明、数据驱动的研究，并促进对公众如何随时间、地理和制度背景参与气候问题的更深入理解。

英文摘要

We present ClimateChat-300K, a large-scale dataset of 299,329 public Facebook posts about climate change collected between May 2020 and May 2024 through the CrowdTangle platform. The dataset contains 41 metadata features including post content, engagement metrics, and page attributes, covering material from more than 26,000 global pages. Each post includes rich contextual information such as language, timestamp, page category, and interaction counts, enabling comprehensive analyses of public discourse around climate communication. Using topic modeling and sentiment analysis, we identify ten main themes grouped into five domains: policy, activism, cooperation, science, and conservation. The results reveal that emotional tone, post format, and page identity strongly influence audience engagement, with visually rich and emotionally charged content receiving the highest levels of interaction. The dataset also demonstrates how online discussions evolved in response to major events such as international climate summits and the COVID-19 pandemic period. ClimateChat-300K provides an open resource for reproducible and interdisciplinary research on polarization, misinformation, and the dynamics of digital climate discourse. By releasing this dataset, we aim to support transparent, data-driven research and contribute to a deeper un-derstanding of how public engagement with climate issues develops across time, geography, and institutional contexts.

URL PDF HTML ☆

赞 0 踩 0

2605.23325 2026-05-25 cs.CL 版本更新

AraHopeCorpus: Annotation Guidelines and Dataset for Hope Speech in Arabic Social Media Crisis Discourse

AraHopeCorpus：阿拉伯语社交媒体危机话语中的希望言语标注指南与数据集

Esra'a Sharqawi, Wajdi Zaghouani

发表机构 * Hamad Bin Khalifa University（哈马德·本·哈利法大学）； Northwestern University in Qatar（卡塔尔西北大学）

AI总结本文介绍了 AraHopeCorpus，这是首个用于阿拉伯语社交媒体危机语境中希望言论研究的标注数据集，包含2023年至2024年间加沙战争相关一万条YouTube评论。研究通过详细的标注框架将评论分为希望言论、无希望言论和中性或模糊内容三类，发现超过六成的评论表达了希望，主要体现为宗教鼓励、集体团结和对正义与忍耐的乐观。该数据集为研究阿拉伯语社交媒体中的建设性话语、危机沟通与韧性提供了重要资源。

详情

AI中文摘要

社交媒体已成为武装冲突期间塑造公众叙事的关键舞台，为有害和建设性交流提供了空间。虽然仇恨言论和虚假信息已被广泛研究，但促进韧性、团结和乐观的表达仍未被充分探索，尤其是在阿拉伯语语境中。本文介绍了AraHopeCorpus，这是首个从2023年至2024年与加沙战争相关的一万条YouTube评论中收集的阿拉伯语希望言语标注数据集。使用详细的标注框架，评论被分为三类：希望言语、无希望言语和中性或模糊话语。数据集显示，希望语言占主导地位，占所有评论的64%以上。这些希望表达主要表现为宗教鼓励、集体团结以及对忍耐和正义的乐观。无希望言语约占13%，反映了绝望和幻灭，而其余评论包含中性或混合内容。标注者间一致性达到显著水平（Cohen's Kappa = 0.71），尽管方言差异、讽刺和隐含意义带来了标注挑战。人类标注者与ChatGPT之间的比较分析表明，大型语言模型可以支持标注，但在处理方言和文化嵌入表达方面仍存在局限性。AraHopeCorpus将在开放和非商业许可下发布用于研究目的。它为研究建设性数字话语提供了宝贵资源，有助于进一步研究阿拉伯语社交媒体中的希望言语检测、危机沟通和韧性。

英文摘要

Social media has become a crucial arena for shaping public narratives during armed conflicts, providing space for both harmful and constructive communication. While hate speech and misinformation have been widely studied, expressions that promote resilience, solidarity, and optimism remain underexplored, particularly in Arabic contexts. This paper introduces AraHopeCorpus, the first annotated dataset of Arabic hope speech collected from ten thousand YouTube comments related to the war on Gaza between 2023 and 2024. Using a detailed annotation framework, comments were classified into three categories: hope speech, no hope speech, and neutral or unclear discourse. The dataset shows that hopeful language dominates, accounting for more than sixty four percent of all comments. These expressions of hope appear mainly as religious encouragement, collective solidarity, and optimism for endurance and justice. No hope speech, representing about thirteen percent, reflects despair and disillusionment, while the rest of the comments contain neutral or mixed content. Inter-Annotator Agreement reached substantial levels (Cohen's Kappa equals 0.71), though dialectal variation, sarcasm, and implicit meaning posed annotation challenges. A comparative analysis between human annotators and ChatGPT revealed that large language models can support annotation but remain limited in handling dialectal and culturally embedded expressions. AraHopeCorpus will be released for research purposes under an open and non commercial license. It provides a valuable resource for studying constructive digital discourse, enabling further research on hope speech detection, crisis communication, and resilience in Arabic social media.

URL PDF HTML ☆

赞 0 踩 0

2605.23315 2026-05-25 cs.CL cs.AI 版本更新

Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning

没有理解的趋同：当语言模型在表示上一致但在推理上分歧时

Muhammad Usama, Dong Eui Chang

AI总结本研究探讨了大型语言模型在不同目标和架构下训练后，其内部表征是否趋于一致，并进一步验证了这种表征一致性是否也体现在推理过程上。通过对8个模型家族共16个语言模型在800个推理问题上的分析，研究发现模型在表征层面趋于一致，但在推理策略上却存在显著分歧，表明表征收敛更多源于输入处理的共性，而非推理方法的统一。这一发现对模型集成、可解释性迁移及模型相似性评估具有重要意义。

详情

AI中文摘要

在不同目标和架构下训练的大型语言模型已被证明会发展出越来越相似的内部表示，这一观察被形式化为柏拉图式表示假说。这种表示趋同是否延伸到对共享表示进行操作的推理过程仍未得到检验。我们在800个涵盖数学、科学、常识和真实性的推理问题上，评估了来自8个家族（1.5B到72B参数）的16个语言模型的表示相似性，并按问题难度、计算阶段和因果相关性进行分层。我们的分析揭示了三种分离：难度反转，模型在它们共同失败的问题上趋同更多（中心核对齐[CKA] = 0.897），而在它们解决的问题上趋同较少（CKA = 0.830）；生成差距，决策前表示对齐（CKA = 0.875），而决策后表示分歧（CKA = 0.274）；以及附带正确性，共享信息可在模型间解码（66%的迁移准确率），但对预测的因果影响极小（在不同消融协议下翻转率为1.5%到5.5%）。这些结果表明，语言模型中的表示趋同反映了共享的输入处理约束而非共享的推理策略，对集成设计、可解释性迁移和模型相似性评估有直接影响。代码可在 https://github.com/Usama1002/convergence-without-understanding 获取。

英文摘要

Large language models trained under diverse objectives and architectures have been shown to develop increasingly similar internal representations, an observation formalized as the Platonic Representation Hypothesis. Whether this representational convergence extends to the reasoning processes that operate over shared representations remains untested. We evaluate representational similarity across 16 language models from 8 families (1.5B to 72B parameters) on 800 reasoning problems spanning mathematics, science, commonsense, and truthfulness, stratifying by problem difficulty, computational stage, and causal relevance. Our analysis reveals three dissociations: a difficulty inversion, where models converge more on problems they collectively fail (Centered Kernel Alignment [CKA] = 0.897) than on those they solve (CKA = 0.830); a generation gap, where pre-decision representations align (CKA = 0.875) while post-decision representations diverge (CKA = 0.274); and epiphenomenal correctness, where shared information is decodable across models (66% transfer accuracy) but exerts minimal causal influence on predictions (1.5% to 5.5% flip rate across ablation protocols). These results indicate that representational convergence in language models reflects shared input processing constraints rather than shared reasoning strategies, with direct implications for ensemble design, interpretability transfer, and evaluations of model similarity. Code is available at https://github.com/Usama1002/convergence-without-understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.23259 2026-05-25 cs.LG cs.AI cs.CL 版本更新

Multi-Gate Residuals

多门残差

Zhizhan Zheng, Feiyun Zhang, Shuchun Liu, Tian Xia, Xi Liu, Dasheng Hu, Hongquan Zhou

发表机构 * Shanghai Yichuang Information Technology Co.,Ltd.（上海亿创信息技术有限公司）； Fudan University（复旦大学）

AI总结本文提出了一种名为Multi-Gate Residuals（MGR）的新方法，旨在解决深度残差网络中激活值无界增长的问题，同时避免引入额外的通信开销。该方法通过简单的评分与门控机制维护多流上下文，并结合注意力池化技术提取隐藏状态，从而在保持激活规模稳定的同时提升模型性能。实验表明，MGR在大规模训练与部署中具有实用性，并优于现有架构。

2605.23215 2026-05-25 cs.LG cs.AI cs.CL 版本更新

FastKernels: Benchmarking GPU Kernel Generation in Production

FastKernels：生产中GPU内核生成的基准测试

Gabriele Oliaro, Yichao Fu, May Jiang, Owen Lu, Junli Wang, Zhihao Jia, Hao Zhang, Samyam Rajbhandari

发表机构 * Snowflake AI Research（Snowflake AI研究院）； CMU（卡内基梅隆大学）； UCSD（加州大学圣地亚哥分校）； Independent Researcher（独立研究者）

AI总结当前基于大语言模型的GPU内核生成代理在性能评估方面面临基准与实际生产环境不匹配的问题。为此，研究提出了FastKernels，一个基于46个代表性架构构建的基准测试集，覆盖了8个类别，几乎涵盖了96.2%的HuggingFace Transformers架构，并同时提供了一个生产级推理框架。实验表明，现有最先进的内核生成代理在FastKernels上的加速效果有限，突显了基准与实际应用之间存在的关键瓶颈。

详情

AI中文摘要

基于LLM的GPU内核生成代理正在快速发展，但其进展从根本上受到所优化基准的限制。现有基准与生产推理框架严重脱节：它们在单GPU上使用合成输入评估内核，忽略周围的编译栈，并奖励复制已知优化而非发现新优化。由此产生的奖励信号具有误导性：代理学会生成在沙箱中得分高但在集成到实际系统时引入接口不兼容、编译栈冲突和静默正确性下降的内核。我们引入FastKernels，一个基于最小化46个代表性架构（涵盖8个类别）的内核基准，这些内核共同涵盖了96.2%（409/425）的HuggingFace Transformers架构。FastKernels同时作为一个简约的生产级推理框架，在主流LLM服务上与vLLM和SGLang等成熟系统运行性能相当，并在服务不足的架构上显著超过上游参考；每个任务的接口镜像其架构家族中最先进库的相应模块，使得优化后的内核能够直接部署到生产代码库中。在FastKernels上评估最先进的内核代理，我们发现即使最强的代理也仅实现0.94倍于生产基线的总加速，而较弱的代理分别为0.78倍和0.53倍——证实基准-生产错位是该领域的关键瓶颈。我们发布FastKernels，作为迈向基准收益直接转化为生产吞吐量改进的内核代理的垫脚石。代码可在https://github.com/Snowflake-AI-Research/fastkernels获取。

英文摘要

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under-served architectures; each task's interface mirrors the corresponding module in the state-of-the-art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state-of-the-art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94$\times$ aggregate speedup over production baselines, with weaker agents at $0.78\times$ and $0.53\times$ -- confirming that benchmark-production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at https://github.com/Snowflake-AI-Research/fastkernels

URL PDF HTML ☆

赞 0 踩 0

2605.23193 2026-05-25 cs.HC cs.CL cs.CY cs.MA 版本更新

CultivAgents: Cultivating Relationship-Centered Multi-Agent Systems for Personalized Gardening

CultivAgents：培育以关系为中心的多智能体系统以实现个性化园艺

Yiyang Wang, Moeiini Reilly, Britney Johnson, Kefei Yan, Alex Cabral, Josiah Hester

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Massachusetts Institute of Technology（麻省理工学院）

AI总结 CultivAgents 是一种以关系为中心的多智能体系统，旨在为个性化园艺提供具有社会文化背景的支持。该系统通过协调多个专业智能体，如经验智能体、环境智能体和民族植物学智能体，提供适应用户技能水平、本地生态条件和文化背景的园艺建议。研究通过多阶段混合方法评估表明，CultivAgents 有效提升了园艺者的信心、动机和对AI建议的信任，同时揭示了在文化特异性、生态适配和智能体协作方面仍有改进空间。

Comments Preprint, 9 pages. Website: https://hello-diana.github.io/CultivAgents/

详情

AI中文摘要

园艺对于支持福祉、文化连续性和食物自主至关重要，然而现有的数字工具通常提供忽视园丁技能、当地生态、季节和文化背景的通用建议。我们介绍了 CultivAgents，一个以关系为中心的多智能体系统，用于个性化、社会文化背景化的园艺支持。基于关怀伦理，CultivAgents 协调多个专业智能体：经验智能体根据用户技能水平调整指导，环境智能体基于当地和季节条件提供建议，民族植物学智能体将植物与文化知识和历史联系起来。我们通过一项三阶段混合方法研究评估了 CultivAgents，参与者包括领域专家（n=3）、人机交互研究人员（n=7）和社区园丁（n=5），分析了专家反馈、前后调查和参与式设计活动。结果表明，CultivAgents 帮助园丁将兴趣转化为情境行动：社区园丁报告信心（从3.00到3.60）、动机（从4.00到4.40）和信任AI建议（从3.20到4.00）均有提升。参与者重视超本地生态指导和互补的智能体视角，同时也指出了文化特异性、生态基础以及智能体协调方面的局限性。这项工作推进了以关系为中心的人工智能，为支持食物主权、社区韧性和文化保护的多智能体系统提供了设计启示。

英文摘要

Gardening is critical to support well-being, cultural continuity, and food autonomy, yet existing digital tools often provide generic advice that overlooks gardeners' skills, local ecologies, seasons, and cultural contexts. We introduce CultivAgents, a relationship-centered multi-agent system for personalized, socio-culturally grounded gardening support. Grounded in ethics of care, CultivAgents coordinates multiple specialized agents: an Experience Agent that adapts guidance to users' skill levels, an Environmental Agent that grounds advice in local and seasonal conditions, and an Ethnobotanical Agent that connects plants to cultural knowledge and histories. We evaluated CultivAgents through a three-phase mixed-methods study with domain experts (n=3), HCI researchers (n=7), and community gardeners (n=5), analyzing expert feedback, pre/post surveys, and participatory design activities. Results suggest that CultivAgents helped gardeners translate interest into situated action: community gardeners reported increased confidence (3.00 to 3.60), motivation (4.00 to 4.40), and trust in acting on AI advice (3.20 to 4.00). Participants valued hyperlocal ecological guidance and complementary agent perspectives, while also identifying limits in cultural specificity, ecological grounding, and agent coordination. The work advances relationship-centered AI, offering design implications for multi-agent systems that support food sovereignty, community resilience, and cultural preservation.

URL PDF HTML ☆

赞 0 踩 0

2605.23190 2026-05-25 cs.CL 版本更新

Hidden Human-Like Nature of Machine-Generated Texts: Theory and Detection Enhancement

机器生成文本的隐藏类人本质：理论与检测增强

Chenwang Wu, Yiu-ming Cheung, Bo Han, Defu Lian

发表机构 * Department of Computer Science, Hong Kong Baptist University（香港 Baptist 大学计算机科学系）； School of Computer Science and Technology, University of Science and Technology of China（中国科学技术大学计算机科学与技术学院）； State Key Laboratory of Cognitive Intelligence（认知智能国家重点实验室）

AI总结随着大语言模型生成的文本在各类应用中日益普及，其潜在的滥用问题引发了对检测方法的迫切需求。现有方法通常将生成文本视为完全机械化的产物，忽视了其中可能存在的类人类写作风格的片段。本文揭示了这类隐藏的类人类片段的存在，并分析其对检测任务的负面影响，进而提出一种无需模型依赖的堆叠增强框架，通过迭代过滤和优化提升检测性能，实验表明该方法在多种场景下均有效，并支持无训练部署。

详情

AI中文摘要

由大型语言模型（LLMs）生成的机器生成文本（MGTs）在各种应用中越来越普遍，但其在虚假新闻传播和网络钓鱼中的潜在滥用引发了严重担忧，凸显了MGT检测的必要性。现有的段落级检测方法通常将MGT视为完全机器化的，忽略了机器生成文本的隐藏类人本质：即使是完全机器生成的文本也可能包含与人类写作高度一致的片段。为此，我们首先揭示了这种隐藏类人片段的存在，然后从理论上分析了它们对检测的影响。我们的分析表明，这些片段增加了检测的句子复杂度，从而使MGT检测本质上更加困难。基于这一发现，我们提出了一种模型无关的堆叠增强框架，通过减少隐藏类人片段的影响来改进现有检测器。具体来说，我们将片段级别的保留决策建模为潜在变量问题，并使用硬EM启发式过程实例化优化，其中检测器迭代地过滤置信度高的类人子序列，并在剩余文本上自我优化。在各种LLM和实际场景中的大量实验表明，所提出的框架能够持续增强现有检测器。值得注意的是，该框架还可以以无需训练的方式工作，为实际部署提供了灵活性和可扩展性。

英文摘要

Machine-generated texts (MGTs) produced by large language models (LLMs) are increasingly prevalent across various applications, while their potential misuse in fake news propagation and phishing has raised serious concerns, highlighting the need for MGT detection. Existing paragraph-level detection methods commonly treat MGTs as entirely machine-like, overlooking the hidden human-like nature of machine-generated texts: even fully machine-generated texts may contain spans that are highly consistent with human writing. To this end, we first reveal the existence of such hidden human-like spans, and then theoretically analyze their impact on detection. Our analysis shows that these spans increase the sentence complexity for detection, thereby making MGT detection intrinsically harder. Based on this finding, we propose a model-agnostic stacked enhancement framework that improves existing detectors by reducing the influence of hidden human-like spans. Specifically, we model span-level retention decisions as a latent-variable problem and instantiate the optimization with a hard-EM-inspired procedure, where the detector iteratively filters confidently human-like subsequences and refines itself on the remaining text. Extensive experiments across various LLMs and practical scenarios demonstrate that the proposed framework consistently enhances existing detectors. Notably, the framework can also work in a training-free manner, offering flexibility and scalability for practical deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.23180 2026-05-25 cs.CL cs.LG 版本更新

Self-Improving In-Context Learning

自我改进的上下文学习

Baturay Saglam, Dionysis Kalogerias

发表机构 * Department of Electrical and Computer Engineering（电气与计算机工程系）

AI总结本文提出了一种改进上下文学习（ICL）的方法，通过在测试时优化固定少样本提示的连续嵌入来提升模型性能。研究发现，模型对示例输出的对数概率可以作为衡量其任务理解程度的有效信号，并据此构建了一个无需额外数据的自监督置信度代理，通过零阶优化对提示嵌入进行校准。该方法无需微调、无需生成token、无需预定义标签集，适用于分类和自由生成任务，在多个ICL任务中表现出色，验证了其优化信号的有效性。

详情

AI中文摘要

我们提出通过优化测试时固定少样本提示的连续嵌入来改进上下文学习（ICL）。关键观察是，模型对其演示输出分配的对数概率——可在单次前向传播中获得，无需生成任何令牌——为模型从演示中推断任务提供了有意义的信号。我们将此信号形式化为一个有界的、自监督的置信度代理，并通过在提示嵌入上进行零阶优化来最大化它，从而得到一种测试时校准程序。该方法不需要微调、令牌生成、预定义标签集或外部数据，因此同样适用于分类和自由生成任务。在一系列全面的ICL任务中，所提出的校准方法始终匹配或改进基础模型，并在大多数任务上优于特定于分类的基线。代理改进与下游准确率提升之间的统计显著相关性证实了所提出的代理编码了用于上下文学习的可靠优化信号。

英文摘要

We propose to improve in-context learning (ICL) by optimizing the continuous embeddings of a fixed few-shot prompt at test time. The key observation is that the log-probabilities a model assigns to its demonstrated outputs$\unicode{x2013}$available from a single forward pass without generating any tokens$\unicode{x2013}$provide a meaningful signal for how well the model has inferred the task from its demonstrations. We formalize this signal as a bounded, self-supervised confidence proxy and maximize it via zeroth-order optimization over the prompt embeddings, yielding a test-time calibration procedure. The approach requires no finetuning, no token generation, no predefined label set, and no external data, making it equally applicable to both classification and free-form generation tasks. Across a comprehensive suite of ICL tasks, the proposed calibration consistently matches or improves upon the base model and outperforms classification-specific baselines on most tasks. The statistically significant correlation between proxy improvement and downstream accuracy gain confirms that the proposed proxy encodes a reliable optimization signal for in-context learning.

URL PDF HTML ☆

赞 0 踩 0

2605.23175 2026-05-25 cs.CR cs.CL 版本更新

Robust LLM Watermarking with Minimal Semantic Distortion for IP Protection

用于知识产权保护的具有最小语义失真的鲁棒LLM水印

Kieu Dang, Phung Lai, NhatHai Phan, Yelong Shen, Ruoming Jin

发表机构 * State University of New York at Albany（纽约州立大学阿尔巴尼分校）； New Jersey Institute of Technology（新泽西理工学院）； Microsoft（微软）； Kent State University（肯特州立大学）

AI总结本文研究如何在保护大型语言模型知识产权的同时，最小化语义失真。为此，作者提出了SAFESEAL，一种基于密钥条件的水印框架，通过上下文感知的同义词替换机制，在保持语义一致性和事实准确性的同时实现高可检测性。该方法在检测端引入密钥条件对比检测器，支持跨提供方和多用户的水印验证，并在实验中表现出优于现有方法的实用性、可检测性和鲁棒性。

详情

AI中文摘要

专有大语言模型面临知识产权侵犯的风险，因为对手可以通过收集输入-输出对来训练替代模型，从而复制LLM，造成财务损失。水印提供了一种有前景的防御手段来验证所有权，但现有方法常常面临语义失真、事实不一致和对抗攻击的问题。此外，用于特定提供商检测的密钥条件水印，特别是在跨提供商和多用户场景中，仍然在很大程度上未被探索。为了解决这些挑战，我们提出了SAFESEAL，一种新颖的密钥条件水印框架，在最小化对模型实用性的影响下实现强可检测性，有效平衡可检测性、实用性和鲁棒性。SAFESEAL通过密钥条件锦标赛采样机制，在替换语言术语为上下文感知同义词的同时保留命名实体，保持语义保真度和事实一致性。在检测方面，我们引入了一种密钥条件对比检测器，该检测器联合编码文本和密钥，实现特定提供商和鲁棒的水印验证。我们推导了实用性-可检测性权衡的理论界限，并通过轻量级模型、批处理和并行化显著降低了延迟。大量实验表明，SAFESEAL在实用性、可检测性和鲁棒性方面优于基线，实现了0.983的BERTScore、0.963的实体相似度、98.2%的检测率，以及文本质量和内容保留的最高人类评分，延迟与最快的基线相当。为了促进透明度和社区驱动的进展，我们发布了第一个公共水印排行榜和一个交互式演示。

英文摘要

Proprietary large language models (LLMs) face risks of intellectual property (IP) violation, as adversaries can replicate an LLM by collecting input-output pairs to train a surrogate model, causing financial setbacks. Watermarks offer a promising defense to verify ownership, but existing methods often struggle with semantic distortion, factual inconsistency, and adversarial attacks. In addition, key-conditioned watermarks for provider-specific detection, especially in cross-provider and multi-user scenarios, remain largely underexplored. To address these challenges, we propose SAFESEAL, a novel key-conditioned watermarking framework that achieves strong detectability with minimal impact on model utility, effectively balancing detectability, utility, and robustness. SAFESEAL preserves named entities while substituting linguistic terms with context-aware synonyms through a key-conditioned Tournament sampling mechanism, maintaining semantic fidelity and factual consistency. For detection, we introduce a key-conditioned contrastive detector that jointly encodes the text and key, enabling provider-specific and robust watermark verification. We derive theoretical bounds on the utility-detectability trade-off and significantly reduce latency through lightweight models, batching, and parallelism. Extensive experiments show that SAFESEAL outperforms baselines in utility, detectability, and robustness, achieving a BERTScore of 0.983, entity similarity of 0.963, a 98.2% detection rate, and the highest human ratings for text quality and content preservation, with latency comparable to the fastest baseline. To promote transparency and community-driven progress, we release the first public watermark leaderboard and an interactive demo.

URL PDF HTML ☆

赞 0 踩 0

2605.23170 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

长上下文LLM中的位置失败：推理基准测试中的盲点

Chuyifei Zhang, Hongyu Cui, Xiaowen Huang, Jitao Sang

发表机构 * Beijing Jiaotong University（北京交通大学）； Central South University of Forestry and Technology（中央林业科技大学）

AI总结该研究指出当前主流的长上下文大语言模型推理基准在任务位置控制方面存在不足，导致无法准确评估模型在不同位置上的表现。为此，作者提出了Context Rot Evaluation（CRE）框架，系统地控制任务位置、填充内容和上下文长度三个因素，并通过实验发现，当目标任务从上下文末尾移至中间位置时，模型性能会显著下降，且随着上下文长度增加，这一问题更加严重。研究还表明，通过在末尾添加任务副本，可以有效缓解位置带来的性能下降，揭示了当前基准设计中存在结构性的评估盲区。

Comments 20 pages, 1 figure, 23 tables

详情

AI中文摘要

位置控制评估是检索任务（如Needle-in-a-Haystack和RULER）的标准做法，但主流推理基准测试并未控制目标任务在长上下文中的位置。我们审计了11个长上下文基准测试，发现没有一个同时控制任务位置、填充内容和上下文长度进行推理。对四个旗舰长上下文发布的审计发现，NIAH、RULER或LongBench系列基准测试的主要结果表中没有条目，而智能体和编码基准测试在所有四个发布的主要结果表中均有出现。我们提出了上下文旋转评估（CRE），一个控制所有三个因素的框架，并在两轮中评估了九个LLM在GSM8K和ARC-Challenge上的表现：初始五个模型集和四个较新的供应商发布。当目标任务从末尾移动到中间时，模型性能可能急剧下降，且对于易受影响的模型，这种下降随着上下文长度增加而恶化。MiMo-v2-Flash在64K下使用with_solutions填充时下降88个百分点（中间准确率8%）。较新的发布显示出较小的下降：在64K下，四个模型中有三个的末尾位置准确率波动在+/-6个百分点内；MiMo-V2.5-Pro将MiMo-v2-Flash的88个百分点下降缩小到32个百分点。在questions_only_v2填充下，所有四个模型在中间位置的下降仍然存在（在8K、32K、64K下范围-16到-56个百分点）。在8K下，一个诊断探针在末尾添加目标任务副本，使所有九个模型的中间准确率与末尾基线相差在+/-4个百分点内，这与位置解释一致。在初始五个模型集中，76%的中间位置错误与周围填充文本匹配，而末尾位置仅为22%，这与填充-答案干扰作为主要错误模式一致。这些结果暴露了当前推理基准测试设计和供应商评估实践中的结构性评估差距：当任务位置不受控制时，无法测量随上下文长度增长而恶化的位置脆弱性。

英文摘要

Position-controlled evaluation is standard for retrieval tasks such as Needle-in-a-Haystack and RULER, but mainstream reasoning benchmarks do not control positional placement of target tasks in long contexts. We audit 11 long-context benchmarks and find none jointly controls task position, filler content, and context length for reasoning. An audit of four flagship long-context releases finds no main result-table entry for NIAH, RULER, or LongBench-family benchmarks, while agentic and coding benchmarks appear in main result-tables across all four. We propose Context Rot Evaluation (CRE), a controlled framework varying all three factors, and evaluate nine LLMs on GSM8K and ARC-Challenge across two rounds: an initial five-model set and four newer vendor releases. Models can drop sharply when the target task moves from end to middle, and the drop grows worse with context length for vulnerable models. MiMo-v2-Flash drops 88pp at 64K under with_solutions filler (middle accuracy 8%). Newer releases show smaller drops: at 64K, three of four stay within +/-6pp of end-position accuracy; MiMo-V2.5-Pro narrows the MiMo-v2-Flash 88pp drop to 32pp. Under questions_only_v2 filler, middle-position drops persist across all four (range -16pp to -56pp across 8K, 32K, 64K). At 8K, a diagnostic probe adding a target-task copy at the end brings middle accuracy within +/-4pp of end baseline across all nine models, consistent with a positional explanation. In the initial five-model set, 76% of middle-position errors match surrounding filler text versus 22% at the end position, consistent with filler-answer interference as a dominant error mode. These results expose a structural evaluation gap in current reasoning benchmark design and vendor evaluation practice: positional vulnerabilities that grow with context length cannot be measured when task position is not controlled.

URL PDF HTML ☆

赞 0 踩 0

2605.23165 2026-05-25 cs.RO cs.AI cs.CL 版本更新

Autonomous Frontier-Based Exploration with VLM Guidance

基于自主前沿探索与VLM引导

Aarush Aitha, Avideh Zakhor

发表机构 * EECS Department, University of California（加州大学EECS系）

AI总结本文提出了一种基于视觉语言模型（VLM）引导的自主前沿探索方法，用于提升机器人在未知和危险环境中的探索能力。该方法通过VLM进行高层战略决策，指导传统的底层机器人控制系统，利用当前地图和潜在路径的视觉信息生成多模态提示，从而选择最具前景的探索方向。实验表明，该方法在六个室内环境的仿真中提升了地图覆盖率，且具有轻量、无需训练和易于迁移的特点。

Comments 8 pages, 10 figures, CVPR 2026: 2nd Workshop on 3D-LLM/VLA: Bridging Language, Vision and Action in 3D Environments

2605.23158 2026-05-25 cs.CR cs.CL cs.LG 版本更新

What Does the Server See? Understanding Privacy Leakage from Large Language Models in Split Inference

服务器看到了什么？理解大语言模型在分割推理中的隐私泄露

Mingyuan Fan, Yu Liu, Fuyi Wang, Cen Chen

发表机构 * East China Normal University（华东师范大学）； RMIT University（皇家墨尔本理工大学）

AI总结本文研究了在分割推理（split inference）框架下，大型语言模型（LLM）可能泄露用户隐私的问题。作者提出了一种名为ActInv的方法，通过匹配中间激活值来重建客户端输入，揭示了分割推理中的隐私漏洞。研究还引入了“扰动放大因子”（PAF）来量化各层对重建的抵抗能力，并设计了PriPert防御方案，有效提升了隐私保护效果，同时保持了模型的实用性和计算效率。

Comments Accepted to ACM CCS'26

详情

AI中文摘要

在资源受限设备上部署大语言模型（LLM）仍然具有挑战性，这激发了人们对分割推理的兴趣，即模型在客户端和服务器之间进行划分，通过仅传输中间激活来减少计算负担并增强隐私。然而，分割推理的隐私保护能力，特别是在LLM背景下，尚未得到彻底研究。为填补这一空白，我们引入了ActInv，它解决了一个中间激活匹配问题以重建客户端的输入。大量评估表明，即使在存在常见基于扰动的防御（如高斯噪声注入和激活稀疏化）的情况下，ActInv也能实现高保真重建。为了系统地理解这一漏洞，我们开发了扰动放大因子（PAF），一个用于量化层对重建固有抵抗力的指标。我们的分析揭示了隐私脆弱性在层间并不均匀，一些层高度易受泄露，而另一些层则提供自然抵抗力。此外，我们证明了通过校准扰动方向以在反向传播期间最大化重建误差，可以显著提高防御有效性。基于这些见解，我们设计了PriPert，并进行了全面评估，涵盖隐私、效用和计算开销，以证明其有效性。

英文摘要

The deployment of large language models (LLMs) on resource-constrained devices remains challenging, spurring interest in split inference, where models are partitioned between client and server to reduce computational burden and enhance privacy by transmitting only intermediate activations. However, the privacy-preserving capabilities of split inference, particularly in the context of LLMs, have not been exhaustively investigated. To fill this gap, we introduce ActInv, which solves an intermediate activation matching problem to reconstruct the client's input. Extensive evaluations demonstrate that ActInv achieves high-fidelity reconstructions, even in the presence of common perturbation-based defenses such as Gaussian noise injection and activation sparsification. To systematically understand this vulnerability, we develop Perturbation Amplification Factor (PAF), a metric for quantifying a layer's inherent resistance to reconstruction. Our analysis reveals that privacy vulnerability is not uniform across layers, with some layers being highly susceptible to leakage while others offer natural resistance. Furthermore, we demonstrate that defense effectiveness can be significantly improved by calibrating perturbation directions to maximize reconstruction error during backpropagation. Building on these insights, we design PriPert and conduct comprehensive evaluations, covering privacy, utility, and computational overhead, to demonstrate its effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2605.23157 2026-05-25 cs.CL 版本更新

Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs

相同模型，不同弱点：语言与模态如何重塑前沿多模态大语言模型的越狱攻击面

Casey Ford, Madison Van Doren, Sicheng Jin, Emily Dix

发表机构 * Appen（Appliance）

AI总结该研究探讨了多模态大语言模型（MLLM）在不同语言和模态下的越狱攻击表面差异，揭示了语言对模型安全性的非均匀影响。通过对比四种前沿模型在英语和西班牙语下的攻击表现，研究发现语言框架攻击在西班牙语中效果减弱，而视觉化多模态攻击则更有效，表明语言与模态对齐失败的机制存在差异。研究指出，当前将语言和模态视为独立维度的安全评估框架无法准确反映实际攻击风险，需进行重新设计。

详情

AI中文摘要

多模态大语言模型（MLLM）的攻击面具有语言依赖性，揭示了对齐失败的机制结构。我们首次进行系统的跨语言、多模态红队研究，比较了四种前沿MLLM（Claude Sonnet 4.5、GPT-5、Pixtral Large和Qwen Omni）在美国英语（en-US）和墨西哥西班牙语（es-MX）下的越狱漏洞。使用包含363个多样化提示场景的固定对抗基准，在纯文本和多模态条件下进行测试，从每组语言的九名母语标注员匹配小组收集了52,272个危害评级和二元攻击成功判断。我们的核心发现是，语言不会均匀地放大漏洞。贝叶斯混合效应分析显示，语言框架攻击（如角色扮演）在西班牙语提示下效果显著降低，而视觉显式多模态攻击效果增强，这直接指向提示-语言界面而非全局标注员宽松度。这种分离表明，语言和视觉对齐失败通过不同机制运作，切换语言足以暴露这种分离。实际后果是安全排名不跨语言保持。Qwen Omni在es-MX参与者中超越Pixtral Large成为最易受攻击的模型，这种排名反转是英语条件下分数的标量校正无法恢复的，并且绝对攻击成功率在模型代际间下降，但模型间差距未缩小。这些发现表明，将语言和模态视为独立维度的安全评估框架从根本上错误地指定了全球部署MLLM的攻击面，必须相应重新设计。

英文摘要

The attack surface of a multimodal large language model (MLLM) is language-dependent in ways that reveal the mechanistic structure of alignment failures. We present the first systematic cross-lingual, multimodal red-teaming study comparing jailbreak vulnerability in US English (en-US) and Mexican Spanish (es-MX) across four frontier MLLMs: Claude Sonnet 4.5, GPT-5, Pixtral Large, and Qwen Omni. Using a fixed adversarial benchmark of 363 diverse prompt scenarios administered in text-only and multimodal conditions, we collected 52,272 harm ratings and binary attack success judgements from matched panels of nine native-speaker annotators per language group. Our central finding is that language does not scale vulnerability uniformly. Bayesian mixed-effects analyses reveal that linguistic framing attacks such as role-play become substantially less effective under Spanish prompting, while visually explicit multimodal attacks become more effective, which directly implicates the prompt-language interface rather than global annotator leniency. This dissociation indicates that linguistic and visual alignment failures operate through distinct mechanisms, and that switching language is sufficient to expose that separation. The practical consequence is that safety rankings are not preserved across languages. Qwen Omni overtakes Pixtral Large as the most vulnerable model among es-MX participants, a rank reversal no scalar correction of English-condition scores could recover, and absolute attack success rates have declined across model generations without closing the gaps between them. These findings demonstrate that safety evaluation frameworks treating language and modality as independent dimensions fundamentally misspecify the attack surface of globally deployed MLLMs, and must be redesigned accordingly.

URL PDF HTML ☆

赞 0 踩 0

2605.23147 2026-05-25 cs.CL cs.AI 版本更新

As X, Do Y: How Persona and Task Combine in Instruction-Tuned LLMs

作为X，做Y：角色和任务如何在指令微调LLM中结合

Eric Xu

发表机构 * Independent Researcher（独立研究者）

AI总结该研究探讨了在指令微调的大语言模型中，角色提示（如“As X, do Y”）如何将“人物”和“任务”信息结合，并发现这种结合在残差流中的某个特定位置可以通过线性分解清晰地体现。研究指出，人物和任务分别通过部分正交的加法方向影响模型输出，并展示了通过残差流局部加法结构可以实现对角色和任务贡献的可解释控制。然而，研究也表明，尽管存在局部加法结构，角色提示无法被压缩为单一的残差向量，因为其行为依赖于整个提示中的分布式机制。

Comments 12 pages, 1 figure. Code: https://github.com/xuy/localized-additive-composition

详情

AI中文摘要

形式为“作为X，做Y”的角色提示在残差流的一个特定位置——提示到答案的过渡（最后一个提示标记与前两个生成标记）——在早期/中层波段表现出清晰的线性分解。在那里，角色和任务通过部分正交的加性方向贡献。形成纯角色效应Δ_X、纯任务效应Δ_Y，并将h_BB + Δ_X + Δ_Y替换干净残差，在Gemma-2-2B-IT和Qwen-2.5-{1.5B, 3B}-Instruct上，跨越12个单元格的短网格和48个单元格的长角色网格，下游输出与干净输出的KL散度很小，并保留了角色特定的行为标记。从这种加性结构自然推断，角色提示可以压缩为单个缓存的残差向量。我们证明它不能。将缓存的加性预测——甚至oracle干净残差h_XY——注入到移除了角色文本的基线宿主提示中，无论是在一个位置还是在多个层，都无法接近干净的长角色目标。角色条件化的多标记生成通过注意力流回整个提示中的角色文本位置，这是任何单个位置的残差无法复现的。残差流中的局部加性性并不意味着提示可压缩。提示到答案过渡处的加性结构支持可解释性和对角色或任务贡献的细粒度控制；整个延续中的角色条件化行为依赖于分布式的提示/KV机制，局部激活算术无法取代。

英文摘要

Role prompts of the form As X, do Y admit a clean linear decomposition at one specific site in the residual stream: the prompt-to-answer transition -- the last prompt token together with the first two generated tokens -- in an early/mid layer band. There, persona and task contribute through partially orthogonal additive directions. Forming a pure persona effect $Δ_X$, a pure task effect $Δ_Y$, and substituting $h_{BB} + Δ_X + Δ_Y$ for the clean residual yields downstream output within a small KL of clean on Gemma-2-2B-IT and Qwen-2.5-\{1.5B, 3B\}-Instruct, across a 12-cell short grid and a 48-cell long-persona grid, with persona-specific behavioral markers preserved. The natural inference from this additive structure is that the role prompt can be compressed into a single cached residual vector. \emph{We show it cannot.} Injecting the cached additive prediction -- or even the oracle clean residual $h_{XY}$ -- into a baseline host prompt with the persona text removed does not approach the clean long-persona target, at one site or at many layers. Persona-conditioned multi-token generation flows through attention back to the persona-text positions throughout the prompt, which no residual at one site reproduces. Local additivity in the residual stream does not imply prompt compressibility. The additive structure at the prompt-to-answer transition supports interpretability and fine-grained steering of persona or task contributions; persona-conditioned behavior across the full continuation depends on a distributed prompt/KV mechanism that local activation arithmetic does not displace.

URL PDF HTML ☆

赞 0 踩 0

2605.23103 2026-05-25 cs.CL cs.AI cs.CY cs.DB 版本更新

A Fine-Tuned BERT Classifier for Personal-Letter Titles in Late-Ming and Early-Qing Collected Works

用于明清之际文集中个人书信标题的微调BERT分类器

Queenie Luo

发表机构 * Harvard University（哈佛大学）

AI总结本文提出了一种基于微调BERT的分类器Lepton，用于识别晚明至清初文集目录中的标题是否为个人书信，特别是与可混淆的序言（如告别序）进行区分。该模型在33位文人手标注的5438个文集标题上进行微调，并已部署于Hugging Face平台，应用于中国传记资料库（CBDB），成功识别出约五万五千封书信，为明信平台的数据建设提供了支持。

2605.23093 2026-05-25 cs.CL cs.CY 版本更新

A Comparative Evaluation of Structural Topic Models and BERTopic for Short, Open-Ended Survey Responses

结构主题模型与BERTopic在简短开放式调查回答中的比较评估

Yan Jiang, Sihong Liu, Philip A. Fisher

发表机构 * Stanford Center on Early Childhood, Stanford University（斯坦福大学早期儿童研究中心）

AI总结本文比较了结构主题模型（STM）和基于嵌入的BERTopic模型在分析短文本开放性调查回复中的表现。研究通过多种参数设置对两种方法进行了评估，发现BERTopic在主题一致性方面优于STM，而STM在协变量分析方面更具优势。研究结果表明，两种方法各有优劣，适用于不同研究需求，为应用社会科学研究中的主题建模方法选择提供了实用指导。

详情

AI中文摘要

应用心理学中的主题建模日益跨越两种方法论传统：概率词袋模型和较新的基于嵌入的方法。然而，对这些方法的许多评估依赖于较长且更干净的基准语料库，对简短、开放式调查回答的指导较少。本文比较了结构主题模型（STM）（一种概率主题模型）和BERTopic（一种基于嵌入的模型）用于分析开放式调查回答。我们评估了三种STM条件和五种BERTopic条件，变化包括拼写纠正、词干提取、嵌入选择以及上下文增强（我们引入的一种为极短回答提供额外语义上下文的策略）。结果表明，BERTopic始终比STM产生更高的主题连贯性，其中上下文增强带来了最强的性能提升。相比之下，仅使用更高维度的嵌入并未改善连贯性，反而与更大的数据损失相关。定性评估显示，BERTopic生成了更可解释和稳定的主题，而STM主题通常更广泛且更混杂。然而，STM为推断性协变量分析提供了更强的支持，而BERTopic的协变量比较主要是描述性的。这些发现表明STM和BERTopic具有互补优势。我们最后为应用社会科学研究中选择和结合主题建模方法提供了实用指导。

英文摘要

Topic modeling in applied psychology increasingly spans two methodological traditions: probabilistic bag-of-words models and newer embedding-based approaches. Yet many evaluations of these methods rely on longer and cleaner benchmark corpora, leaving less guidance for short, open-ended survey responses. This paper compares Structural Topic Models (STM), a probabilistic topic model, and BERTopic, an embedding-based model, for analyzing open-ended survey responses. We evaluated three STM conditions and five BERTopic conditions, varying typographical correction, stemming, embedding choice, and contextual augmentation, a strategy we introduced to provide additional semantic context for very short responses. Results indicate that BERTopic consistently produced higher topic coherence than STM, with contextual augmentation yielding the strongest performance gains. In contrast, higher-dimensional embeddings alone did not improve coherence and were associated with greater data loss. Qualitative evaluation showed that BERTopic generated more interpretable and stable topics, while STM topics were often broader and more mixed. However, STM provides stronger support for inferential covariate analysis, whereas BERTopic covariate comparisons are primarily descriptive. These findings suggest that STM and BERTopic offer complementary strengths. We conclude with practical guidance for selecting and combining topic modeling approaches in applied social science research.

URL PDF HTML ☆

赞 0 踩 0

2605.23078 2026-05-25 cs.LG cs.CL 版本更新

GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs

GEMQ：MoE大语言模型的全局专家级混合精度量化

Jianing Deng, Song Wang, Dongwei Wang, Zijie Liu, Tianlong Chen, Huanrui Yang, Jingtong Hu

发表机构 * University of Pittsburgh（匹兹堡大学）； University of Central Florida（佛罗里达州立大学）； University of Arizona（亚利桑那大学）； University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）

AI总结混合专家大型语言模型（MoE-LLMs）在性能上表现优异，但因大量专家参数导致内存开销较大。为解决这一问题，本文提出了一种全局专家级混合精度量化方法GEMQ，通过全局线性规划形式捕捉模型整体的专家重要性，并结合高效的路由微调以适应量化后的专家，从而实现更优的精度与内存权衡。实验表明，GEMQ在保持精度的同时显著降低了内存占用并加速了推理。

Comments ICML 2026

详情

AI中文摘要

混合专家大语言模型（MoE-LLMs）性能强大，但由于大量专家参数导致显著的内存开销。混合精度量化根据专家重要性分配不同的位宽，接近精度-内存帕累托前沿，并实现极低比特量化。然而，现有方法依赖于逐层重要性估计，忽视了量化引起的路由器偏移，导致次优的分配和路由。本文提出全局专家级混合精度量化（GEMQ），通过（1）基于量化误差分析的全局线性规划公式来捕获模型范围内的专家重要性，以及（2）高效的路由器微调以适应量化后的专家，从而克服这些限制。这些组件被集成到一个渐进式量化框架中，该框架迭代地优化重要性估计和分配。实验表明，GEMQ在最小化精度损失的情况下显著减少内存并加速推理。源代码可在 https://github.com/jndeng/GEMQ 获取。

英文摘要

Mixture-of-Experts Large Language Models (MoE-LLMs) achieve strong performance but incur substantial memory overhead due to massive expert parameters. Mixed-precision quantization mitigates this cost by allocating expert-wise bit-widths based on their importance, approaching the accuracy-memory Pareto frontier and enabling extreme low-bit quantization. However, existing methods rely on layer-wise importance estimation and overlook router shifts induced by quantization, resulting in suboptimal allocation and routing. In this work, we propose Global Expert-level Mixed-precision Quantization (GEMQ) to overcome these limitations via (1) a global linear-programming formulation that captures model-wide expert importance based on quantization error analysis, and (2) efficient router fine-tuning to adapt routing to quantized experts. These components are integrated into a progressive quantization framework that iteratively refines importance estimation and allocation. Experiments demonstrate that GEMQ significantly reduces memory and accelerates inference with minimal accuracy degradation. Source code is available at https://github.com/jndeng/GEMQ .

URL PDF HTML ☆

赞 0 踩 0

2605.23071 2026-05-25 cs.CL 版本更新

确定性视界：作为可信AI系统设计规范的不可行性结果

Dongxin Guo

AI总结本文探讨了可信人工智能系统设计中由计算理论根本限制所带来的边界问题，提出将不可行性定理转化为系统设计规则的新方法。研究核心在于确定性地证明了大型语言模型的推理深度存在一个由架构决定的上限——“确定性地平线”，该上限不受训练数据量、适配器秩或损失函数的影响，并可通过模型层数和嵌入宽度预先计算。研究还展示了这一理论在多个AI子领域中的应用，形成一套包含十六项设计规范的目录，为构建更可靠的人工智能系统提供了理论依据和设计指导。

Comments PhD thesis, Department of Computer Science, The University of Hong Kong, 2026. 271 pages, 18 figures, 15 tables, 5 algorithms

详情

AI中文摘要

大型语言模型现在编写软件、起草法律文件并生成临床笔记，但从图灵、阿罗到没有免费午餐定理的基本极限，塑造了计算的能力。本文将这些不可行性结果从奇闻转化为设计规则。其旗舰结果证明了仅由架构设定的准确率上限：超过关键推理深度后，无论适配器秩、样本大小或损失函数如何，训练都无法改变它。该确定性视界在部署前可从层数和嵌入宽度计算，在十二种Transformer架构中测量值介于19到31之间，而在最优长度轨迹上微调可恢复不到4个百分点。其机制是残差流的容量不变性，信息论转换得出超过视界后准确率超指数衰减。一个针对模幂的无条件电路复杂度下界（对抗常数深度素数模电路）补充了这一结果。同样的论证重新应用于多个子领域：任何错误指定模型下的偏好学习在样本复杂度上出现不连续跳跃；多阶段检索流水线至少需要与阶段数一样多的独立指标；标准诚实拍卖对于具有提示相关估值的智能体失效；神经推理的零知识验证为每个非线性激活支付110到190倍的测量开销。这些共同构成了一个包含16条规范的目录，每条规范配对一个可计算边界、一个量化违反成本和一个建设性设计规则：两个组合已被证明，一个配对是诚实障碍，四个保持开放。本文为可信AI可能需要的生成式研究计划提供了不可行性规范方法论。AI的每一个基本极限也是一个设计规则。

英文摘要

Large language models now write software, draft legal documents, and produce clinical notes, yet fundamental limits, from Turing and Arrow to the No Free Lunch theorems, shape what computation can do. This thesis turns such impossibility results from curiosities into design rules. Its flagship result proves an accuracy ceiling set by architecture alone: past a critical reasoning depth, no amount of training moves it, at any adapter rank, sample size, or loss function. Computable before deployment from layer count and embedding width, this Deterministic Horizon is measured between nineteen and thirty-one across twelve transformer architectures, and fine-tuning on optimal-length traces recovers under four percentage points. The mechanism is a capacity invariant of the residual stream, and an information-theoretic conversion yields super-exponential accuracy decay past the horizon. An unconditional circuit-complexity lower bound for modular exponentiation against constant-depth prime-modulus circuits complements this result. The same argument recasts across subfields: preference learning under any misspecified model jumps discontinuously in sample complexity; multi-stage retrieval pipelines require at least as many independent metrics as stages; standard truthful auctions fail for agents with prompt-dependent valuations; and zero-knowledge verification of neural inference pays a measured overhead of one hundred ten to one hundred ninety times per non-linear activation. Together these form a catalogue of sixteen specifications, each pairing a computable boundary, a quantified violation cost, and a constructive design rule: two compositions are proved, one pairing is an honest obstruction, and four remain open. The impossibility-specification methodology is offered for the generative research programme that trustworthy AI may need. Every fundamental limit of AI is also a design rule.

URL PDF HTML ☆

赞 0 踩 0

2605.22993 2026-05-25 cs.CL cs.AI 版本更新

A Proactive Multi-Agent Dialogue Framework for Assessing Social Language Disorder Traits in Autism

一种主动式多智能体对话框架用于评估自闭症中的社交语言障碍特征

Chuanbo Hu, Minglei Yin, Bin Liu, Wenqi Li, Lynn K. Paul, Shuo Wang, Xin Li

发表机构 * Department of Computer Science（计算机科学系）； University at Albany（阿尔巴尼大学）； Department of Management Information System（管理信息系统系）； West Virginia University（西弗吉尼亚大学）； Department of Radiology（放射学系）； Washington University in St. Louis（圣路易斯华盛顿大学）； Humanities and Social Sciences（人文学与社会科学）

AI总结该研究提出了一种名为TPA的主动多智能体对话框架，用于评估自闭症谱系障碍中的社会语言障碍（SLD）特征。该框架通过医生智能体主动选择针对性的问题策略，以系统性地揭示患者对话中潜在的语言障碍特征，从而提高诊断效率。实验表明，TPA在多个关键指标上优于现有基线方法，显著提升了SLD特征的覆盖率和诊断效率，为AI辅助临床筛查提供了重要支持。

详情

AI中文摘要

与自闭症谱系障碍中社交语言障碍（SLD）相关的特征性语言行为，包括回声性重复、代词位移和刻板媒体引用，在自发对话中基本不存在，仅在特定对话条件下出现。在结构化临床评估中，这种延迟意味着提问策略选择是决定对话产生多少诊断信息的关键但未被充分重视的因素。大型语言模型（LLMs）能否被引导主动选择系统地揭示这些潜在特征的提问策略，在很大程度上仍未探索。本文提出TPA（思考、计划、询问），一种应用于自闭症诊断观察量表模块4（ADOS-2）语言评估部分的主动式多智能体对话框架，其中医生智能体在选择临床依据策略并生成针对性问题之前，明确推理哪些特征尚未观察到。基于真实ADOS-2临床数据的患者智能体使得无需真实患者参与即可进行可重复评估，并通过三个独立实验验证，确认其对真实患者语言具有足够的保真度。在来自35名患者的484个片段上评估，TPA在所有主要指标上优于六个竞争性对话规划基线，实现了82.1%的SLD特征覆盖率，比训练有素的临床医生进行的真实临床对话自动回放（65.5%）高16.6%，并且每轮诊断效率显著更高（AUCC：0.628 vs. 0.458，绝对增益+0.170）。这些结果表明，主动提问策略选择显著提高了自动化SLD特征评估的效率，对可扩展的AI辅助临床筛查具有直接意义。

英文摘要

Characteristic linguistic behaviors associated with Social Language Disorder (SLD) in autism spectrum disorder, including echoic repetition, pronoun displacement, and stereotyped media quoting, are largely absent from spontaneous conversation and only emerge under specific conversational conditions. In structured clinical assessments, this latency means that questioning strategy selection is a critical yet underappreciated determinant of how much diagnostic information a conversation yields. Whether large language models (LLMs) can be guided to proactively select questioning strategies that systematically surface these latent traits remains largely unexplored. Here we present TPA (Think, Plan, Ask), a proactive multi-agent dialogue framework applied to the language assessment component of the Autism Diagnostic Observation Schedule Module 4 (ADOS-2), in which a doctor agent explicitly reasons about which traits remain unobserved before selecting a clinically grounded strategy and generating a targeted question. A patient agent grounded in real ADOS-2 clinical data enables reproducible evaluation without real patient participation, validated across three independent experiments confirming adequate fidelity to real patient language. Evaluated on 484 episodes from 35 patients, TPA outperforms six competitive dialogue planning baselines across all primary metrics, achieving 82.1% SLD trait coverage, 16.6% higher than automated replay of real clinical dialogues conducted by trained clinicians (65.5%), with substantially greater per-turn diagnostic efficiency (AUCC: 0.628 vs. 0.458, absolute gain +0.170). These results demonstrate that proactive questioning strategy selection substantially improves the efficiency of automated SLD trait assessment, with direct implications for scalable AI-assisted clinical screening.

URL PDF HTML ☆

赞 0 踩 0

2605.22981 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Memorization Dynamics of Fill-in-the-Middle Pretraining

Fill-in-the-Middle 预训练的记忆动态

Tobias von Arx, Tanguy Dieudonné

发表机构 * Department of Computer Science, ETH Zurich（苏黎世联邦理工学院计算机科学系）

AI总结本文研究了“填中”（FIM）预训练目标对语言模型逐字记忆能力的影响。通过在包含重复内容的语料库上训练匹配的Llama 3.2模型，发现FIM更倾向于恢复短或部分匹配的文本片段，而传统的从左到右（LTR）方法则更常对长段精确续写赋予高置信度。实验还表明，FIM训练下的逐字记忆能力随重复次数近似线性增长，并且后缀上下文不足以支持准确回忆，前缀上下文在其中起关键作用。研究强调了单一评估方式可能忽略记忆行为的复杂性。

Comments MemFM @ ICML 2026

详情

AI中文摘要

Fill-in-the-Middle (FIM) 是一种广泛用于赋予因果语言模型填充能力的预训练目标，但其对逐字记忆的影响尚未充分探索。我们在受控设置中研究 FIM 的记忆动态，通过在包含重复 Gutenberg 摘录的 FineWeb-Gutenberg 语料库上，使用 FIM 和标准从左到右 (LTR) 目标预训练匹配的 Llama 3.2 模型。基于前缀的探测表明，FIM 更常恢复短片段或部分匹配的跨度，而 LTR 更常对长精确延续赋予高置信度。我们观察到，在测试范围内，FIM 训练下的逐字提取随重复次数近似线性增长。评估原生 FIM 格式的探测显示，后缀上下文并不足够：FIM 训练下的逐字回忆仍然强烈锚定于前缀上下文。我们的结果还表明，仅评估一种跨度长度或探测格式可能会遗漏记忆行为中的重要细微差别。

英文摘要

Fill-in-the-middle (FIM) is a pretraining objective widely used to equip causal language models with infilling ability, yet its effect on verbatim memorization remains underexplored. We study the memorization dynamics of FIM in a controlled setting by pretraining matched Llama 3.2 models with FIM and standard left-to-right (LTR) objectives on a FineWeb-Gutenberg corpus containing repeated Gutenberg excerpts. With prefix-based probes, FIM more often recovers short or partially matching spans, while LTR more often assigns high confidence to long exact continuations. We observe that verbatim extraction under FIM-training grows approximately linearly with repetitions over the tested range. Evaluating native FIM-format probes reveals that suffix context is not sufficient: verbatim recall under FIM-training remains strongly anchored in prefix context. Our results also show that evaluating only one span length or probing format can miss important nuances in memorization behavior.

URL PDF HTML ☆

赞 0 踩 0

2605.22971 2026-05-25 cs.CL cs.HC 版本更新

Can AI Guess What You Know? Performance Comparison of Large Language Models for Human Domain Knowledge Estimation From Communication Logs

AI 能猜出你知道什么吗？从通信日志中估计人类领域知识的大型语言模型性能比较

Ko Watanabe, Shoya Ishimaru

发表机构 * Osaka Metropolitan University（大阪 metropolitan 大学）

AI总结本文研究了大型语言模型（LLMs）是否能通过分析员工的Slack通信日志来估计其领域知识水平。通过对43名用户共27,188条消息的分析，比较了包括Gemini、Claude和GPT系列在内的七种模型在零样本设置下的估计性能，发现Gemini 2.5 Flash表现最佳，误差最低。研究还表明，估计准确性与消息数量的相关性较弱，强调了自动化专家映射的可行性及当前限制，并指出隐私保护和更丰富的知识表示形式的重要性。

2605.22963 2026-05-25 cs.CL cs.AI 版本更新

Graph Alignment Topology as an Inductive Bias for Grounding Detection

图对齐拓扑作为接地检测的归纳偏置

Paul Landes, Pranav Herur, Adam Cross, Jimeng Sun

发表机构 * Department of Pediatrics, University of Illinois College of Medicine Peoria（伊利诺伊大学皮奥里亚医学院儿科部）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校计算机与数据科学学院）； Carle Illinois College of Medicine, University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校卡莱医学院）

AI总结本文研究了如何利用图对齐拓扑作为归纳偏置，以提升大语言模型（LLM）生成内容的事实准确性。作者构建了参考信息与模型输出之间的二分图，并通过图神经网络建模对齐结构，从而直接学习对齐拓扑特征。该方法在多个幻觉检测和问答数据集上取得了优于现有方法及基础LLM（如GPT-4o）的最先进结果，为提升模型输出的可解释性和事实可靠性提供了新思路。

详情

AI中文摘要

大型语言模型（LLM）被优化以产生分布上合理的延续，而不是明确验证生成的命题是否源自源文档。这种归纳偏置使得泛化成为可能，但它不编码响应是否相对于参考是接地的。这些问题限制了LLM在严格事实正确性至关重要的领域（如临床决策支持）中的使用。现有的幻觉检测方法通过检索增强、自一致性或声明验证来提高事实性，但通常不直接学习对齐拓扑。为了利用对齐拓扑作为归纳偏置，我们在参考信息和LLM输出之间构建对齐二分图，并训练图神经网络（GNN）通过消息传递来建模对齐结构。该方法在四个不同的幻觉和问答数据集上取得了最先进的结果，优于所有比较的方法，包括基础LLM如GPT-4o。

英文摘要

Large Language Models (LLMs) are optimized to produce distributionally plausible continuations rather than to explicitly verify whether generated propositions are entailed by source documents. This inductive bias enables generalization, but it does not encode whether responses are grounded with respect to a reference. These issues limit the use of LLMs in domains where strict factual correctness is crucial, such as clinical decision support. Existing hallucination detection approaches improve factuality through retrieval augmentation, self-consistency, or claim verification, but generally do not learn directly over alignment topology. To leverage alignment topology as an inductive bias, we construct aligned bipartite graphs between reference information and LLM outputs and train a graph neural network (GNN) to model alignment structure using message passing. The method achieves state-of-the-art results on four diverse hallucination and question-answering datasets, outperforming all compared methods, including foundational LLMs such as GPT-4o.

URL PDF HTML ☆

赞 0 踩 0

2605.22635 2026-05-25 cs.LG cs.CL cs.CV 版本更新

The Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution

多任务放射学报告生成中的双重困境：梯度动力学分析与解决方案

Erjian Zhang, Yatong Hao, Liejun Wang, Zhiqing Guo

发表机构 * School of Computer Science and Technology（计算机科学与技术学院）； Xinjiang University（新疆大学）； Information Security Engineering Technology Research Center（信息安全工程技术研究中心）

AI总结在多任务医学影像报告生成中，现有的线性标量化策略难以有效平衡临床监督的严格约束与报告生成的平滑性需求。本文从梯度动力学角度分析了这一问题，揭示其本质是漂移项偏差与扩散项衰减的“双重困境”，并提出了一种与模型无关的优化器CAME-Grad，通过冲突规避方向校正和幅度增强能量注入，实现了几何有效性与局部最优解的规避，实验表明该方法在多个任务中均能显著提升临床效果。

Comments Accepted by ICML 2026

详情

AI中文摘要

尽管基于多任务学习的自动放射学报告生成（RRG）被广泛采用以确保临床一致性，但大多数研究集中在架构设计上，仍局限于粗糙的线性标量化策略。这些策略无法有效平衡判别性临床监督的硬约束与报告生成的平滑性要求。为了解决这些问题，我们从梯度动力学的角度分析了线性标量化的失败机制，利用随机微分方程（SDE）框架将其表征为漂移项偏差和扩散项衰减的“双重困境”。基于此，我们提出了一种与骨干网络无关的优化器，名为冲突规避幅度增强梯度下降（CAME-Grad）。通过冲突规避的方向修正和幅度增强的能量注入，该算法不仅保证了几何有效性，还避免了局部最优解。然后，自适应梯度融合机制用于建立理论最优方向与任务特定归纳偏差之间的动态平衡。实验表明，作为一种通用的即插即用优化器，CAME-Grad在八种不同的RRG方法上带来了显著且一致的改进，在MIMIC-CXR上平均提升整体临床效能2.3%，在IU X-Ray上提升1.9%。我们的代码可在https://github.com/vpsg-research/CAME-Grad获取。

英文摘要

While multi-task learning based automatic radiology report generation (RRG) is widely adopted to ensure clinical consistency, most focus on architectural designs yet remain limited to coarse linear scalarization strategies. These strategies cannot effectively balance the hard constraints of discriminative clinical supervision with the smoothness requirements of report generation. To address these problems, we analyze the failure mechanism of linear scalarization from the perspective of gradient dynamics, utilizing the stochastic differential equation (SDE) framework to characterize it as a "Double Dilemma" of drift term deviation and diffusion term decay. Based on this, we propose a backbone-agnostic optimizer named Conflict-Averse Magnitude-Enhanced Gradient Descent (CAME-Grad). Through conflict-averse direction rectification and magnitude-enhanced energy injection, the algorithm not only ensures geometric validity, but also avoids local optimal solutions. Then, the adaptive gradient fusion mechanism is used to establish a dynamic balance between the theoretical optimal direction and the task-specific inductive bias. Experiments show that as a universal plug-and-play optimizer, CAME-Grad brings substantial and consistent improvements across eight diverse RRG methods, elevating overall clinical efficacy performance by an average of 2.3% on MIMIC-CXR and 1.9% on IU X-Ray. Our code is available at https://github.com/vpsg-research/CAME-Grad.

URL PDF HTML ☆

赞 0 踩 0

2605.22373 2026-05-25 cs.LG cs.CL 版本更新

Boundary-targeted Membership Inference Attacks on Safety Classifiers

针对安全分类器的边界目标成员推断攻击

Anthony Hughes, Alexander Goldberg, Prince Jha, Adam Perer, Nikolaos Aletras, Niloofar Mireshghallah

发表机构 * University of Sheffield（谢菲尔德大学）； Carnegie Mellon University（卡内基梅隆大学）； MBZUAI

AI总结该研究探讨了针对安全分类器的边界定向成员推理攻击问题，这类分类器常用于生成式AI系统中以过滤有害内容或识别高风险用户。研究提出了一种新的攻击方法，通过识别分类器最不自信的样本，揭示模型在训练数据上的记忆性特征，从而推断出样本是否属于训练集。实验表明，该方法在检测用户情绪支持需求的分类器上，能以较低的误报率恢复更多被标记为高风险的对话，效果显著优于现有成员推理攻击方法，并进一步分析了边界样本的特性，指出基于内容的过滤策略难以有效防御此类攻击。

详情

AI中文摘要

安全分类器是生成式AI系统中的重要保障，用于过滤有害内容或识别与大语言模型交互时处于风险中的用户。尽管这些模型是必要的，但它们是在包含自残和心理健康讨论等敏感数据集上训练的，这引发了重要但尚未充分理解的隐私问题。成员推断攻击（MIA）允许对手推断用于训练模型的示例的成员身份。在这项工作中，我们假设识别分类器最不自信的示例对于对手推断成员身份是有信息的。这反映了局部泛化失败，其中模型依赖记忆来解决训练集中的歧义。为了研究这一点，我们引入了一种新的边界目标选择策略，该策略识别低置信度示例，从而放大训练集中示例成员身份的信号。我们的实验结果表明，在针对检测可能需要情感支持的用户的微调分类器上，对手可以以5%的假阳性率恢复安全分类器标记为指示用户困扰的对话中的19%。这比单独使用最先进的MIA方法攻击高出3.5倍。最后，我们描述了边界示例的特征，并表明基于内容的过滤对于保护无效，而现有的噪声策略可以有效减轻这些示例的敏感性。

英文摘要

Safety classifiers are essential safeguards within generative AI systems, filtering harmful content or identifying at-risk users when interacting with large language models. Despite their necessity, these models are trained on sensitive datasets including discussions of self-harm and mental health, raising important, yet poorly understood, privacy concerns. Membership inference attacks (MIAs) allow adversaries to infer membership of examples used to train models. In this work, we hypothesize that identifying the examples on which the classifier is least confident are informative for an adversary to infer membership. This reflects a localized failure of generalization, where the model relies on memorization to resolve ambiguity in the training set. To investigate this, we introduce a new boundary-targeted selection strategy that identifies low confidence examples that amplify the signal of an examples membership within a training set. Our experimental results show that an adversary can recover 19% of the conversations a safety classifier flagged as indicating user distress, at a 5% false-positive rate, on a classifier fine-tuned for detecting a user who may require emotional support. This is $3.5$ times more than attacking using state-of-the-art MIA methods alone. Finally, we characterize the boundary laying examples and show that content-based filtering is ineffective for protection, and existing noise strategies can effectively mitigate susceptibility of these examples.

URL PDF HTML ☆

赞 0 踩 0

2605.21071 2026-05-25 cs.CL cs.AI 版本更新

Fine-grained Claim-level RAG Benchmark for Law

细粒度声明级法律RAG基准

Souvick Das, Sallam Abualhaija, Domenico Bianculli

发表机构 * University of Luxembourg（卢森堡大学）

AI总结本文提出ClaimRAG-LAW，一个支持英法双语、面向法律专家与非专家用户的细粒度法律检索增强生成（RAG）基准数据集，涵盖多种真实场景的问答类型。研究通过细粒度评估框架分析当前先进法律RAG系统的检索、生成及主张级表现，揭示了其在法律领域中存在的局限性，为提升法律AI系统的可靠性提供了重要参考。

详情

AI中文摘要

大型语言模型（LLM）的快速进展正在将语义搜索转向问答范式，用户提出问题，LLM生成回答。在法律等高风险领域，检索增强生成（RAG）通常用于减轻生成回答中的幻觉。然而，先前的研究表明，无论是通用还是法律专用的RAG系统，仍然以不同速率产生幻觉，这使得细粒度评估变得至关重要。尽管有需求，现有的法律RAG系统评估框架缺乏分别对检索和生成性能进行详细分析所需的粒度。此外，当前的基准主要是英文且集中于法律专家查询，忽视了非专家需求。我们引入了ClaimRAG-LAW，一个全面的法律RAG数据集，支持法语和英语，面向专家和非专家，并包含反映现实场景的多样化问题类型。我们进一步应用细粒度评估框架对最先进的法律RAG系统进行评估，揭示了法律领域在检索、生成和声明级分析方面的局限性。

英文摘要

The rapid progress of large language models (LLMs) is shifting semantic search toward a question-answering paradigm, where users ask questions and LLMs generate responses. In high-stake domains such as law, retrieval-augmented generation (RAG) is commonly used to mitigate hallucinations in generated responses. Nonetheless, prior work shows that RAG systems, whether general-purpose or legal-specific, still hallucinate at varying rates, making fine-grained evaluation essential. Despite the need, existing evaluation frameworks for legal RAG systems lack the granularity required to provide detailed analysis of retrieval and generation performance separately. Moreover, current benchmarks are largely English-only and centered on legal expert queries, overlooking non-expert needs. We introduce ClaimRAG-LAW, a comprehensive dataset for legal RAG that supports French and English, targets both experts and non-experts, and includes diverse question types reflecting realistic scenarios. We further apply a fine-grained evaluation framework of state-of-the-art legal RAG systems, revealing limitations in retrieval, generation, and claim-level analysis in the legal domain.

URL PDF HTML ☆

赞 0 踩 0

2605.20558 2026-05-25 cs.CL 版本更新

When Irregularity Helps: A Subclass Analysis of Inductive Bias in Neural Morphology

当不规则性有帮助：神经形态学中归纳偏置的子类分析

Wen Zhang

AI总结该研究分析了神经形态生成系统在处理日语过去时动词变位时的表现，发现模型错误主要集中在一个结构特殊且数据极少的不规则子类上。通过对照实验表明，移除这一小类比移除所有不规则动词更能提升模型泛化能力，说明不同不规则模式对模型稳定性的影响不同。研究指出，错误集中源于极低频形态模式与特定音系过程（如重音）的相互作用，强调形态评估应引入更细致的子类分析以揭示模型缺陷。

2605.20201 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

基于代理思维链调优的长上下文推理

Miao Li, Irina Saparina, Alexander Gurung, Mirella Lapata

发表机构 * School of Informatics, University of Edinburgh（爱丁堡大学信息学院）

AI总结该研究针对大语言模型在长上下文复杂推理任务中表现不佳的问题，提出了一种名为ProxyCoT的新训练框架。该方法通过在短代理上下文中获取高质量的推理轨迹，并将其迁移到完整的长上下文中，从而提升模型的长上下文推理能力。实验表明，ProxyCoT在多个数据集上均优于现有方法，且计算开销更低，同时具备良好的跨领域泛化能力。

Comments Long paper, ACL 2026 (Main conference)

详情

AI中文摘要

近期的大语言模型支持高达1000万token的输入，但在需要复杂推理的长上下文任务上表现不佳。此类任务可以通过仅使用输入的一个子集（即代理上下文）而非完整序列来解决。尽管共享相同的底层推理过程，模型在代理上下文和完整上下文之间表现出显著的性能差异。为了改进长上下文推理，我们提出了ProxyCoT，一种新颖的训练框架，将推理能力从短代理上下文迁移到完整长上下文。具体来说，我们首先通过强化学习或从更大的教师模型蒸馏，在代理上下文中获得高质量的思维链推理轨迹，然后通过监督微调将这些生成的轨迹锚定到完整长上下文中。跨不同数据集的实验表明，ProxyCoT在减少计算开销的同时，始终优于强基线。此外，使用ProxyCoT训练的模型能够将其长上下文推理能力泛化到域外任务。

英文摘要

Recent large language models support inputs of up to 10 million tokens, yet they perform poorly on long-context tasks that require complex reasoning. Such tasks can be solved using only a subset of the input -- a proxy context -- rather than the full sequence. Despite sharing the same underlying reasoning process, models exhibit a significant performance disparity between proxy and full contexts. To improve long-context reasoning, we propose ProxyCoT, a novel training framework that transfers reasoning capabilities from short proxy contexts to full long contexts. Specifically, we first obtain high-quality chain-of-thought reasoning traces on proxy contexts through reinforcement learning or distillation from a larger teacher model, and then ground the generated traces in full long contexts with supervised fine-tuning. Experiments across different datasets demonstrate that ProxyCoT consistently outperforms strong baselines with reduced computational overhead. Furthermore, models trained with ProxyCoT generalize their long-context reasoning capabilities to out-of-domain tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.20192 2026-05-25 cs.CL cs.CE cs.CR cs.CY q-fin.CP 版本更新

Leveraging Large Language Models for Sentiment Analysis: Multi-Modal Analysis of Decentraland's MANA Token

利用大语言模型进行情感分析：Decentraland的MANA代币多模态分析

Xintong Wu, Peiting Tsai, Jing Yuan, Michael Yu, Greg Sun, Luyao Zhang

发表机构 * University of Pennsylvania（宾夕法尼亚大学）； Microsoft（微软）； Duke Kunshan University（杜克昆山大学）

AI总结本文研究了如何利用大型语言模型分析Decentraland虚拟平台中Discord社区的情感，结合多模态金融数据提升对MANA代币价格的预测能力。研究采用基于BERT的模型进行情感分析，并构建了两种LSTM架构，分别基于历史价格和融合情感评分、交易量及市值的多模态特征。实验表明，多模态模型在预测准确性上显著优于仅使用价格数据的基线模型，揭示了社区情感信号在虚拟经济预测中的重要价值。

详情

AI中文摘要

Decentraland是一个在扩展的元宇宙生态系统中运行的去中心化虚拟现实平台，利用其原生MANA代币促进虚拟资产交易和治理。本研究探讨将Discord社区情感与多模态金融数据相结合，以增强虚拟世界经济中的加密货币价格预测。我们解决以下问题：(1) 识别Decentraland的Discord社区内的情感模式，以及(2) 评估多模态特征对代币回报预测的影响。使用基于BERT的大语言模型进行情感分析，我们开发了两种LSTM架构：一种包含历史价格的基线模型，另一种集成情感分数、交易量和市值的多模态变体。结果显示社区情感以中性为主，但存在正向偏斜。多模态模型在预测准确性上显著优于仅基于价格的基线模型。这些发现证明了社区衍生信号对虚拟经济预测的预测价值，并为未来在沉浸式虚拟环境、自然语言处理和加密货币市场分析交叉领域的研究奠定了基础。

英文摘要

Decentraland, a decentralized virtual reality platform operating within the expanding Metaverse ecosystem, utilizes its native MANA token to facilitate virtual asset transactions and governance. This study investigates the integration of Discord community sentiment with multi-modal financial data to enhance cryptocurrency price prediction within virtual world economies. We address: (1) identifying sentiment patterns within Decentraland's Discord community, and (2) evaluating the impact of multi-modal features on token return forecasting. Using a BERT-based large language model for sentiment analysis, we develop two LSTM architectures: a baseline incorporating historical prices and a multi-modal variant integrating sentiment scores, trading volume, and market capitalization. Results indicate predominantly neutral community sentiment with a positive skew. The multi-modal model significantly outperforms the price-only baseline in prediction accuracy. These findings demonstrate the predictive value of community-derived signals for virtual economy forecasting and establish a foundation for future research at the intersection of immersive virtual environments, natural language processing, and cryptocurrency market analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.20087 2026-05-25 cs.CL cs.AI 版本更新

ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions

ThoughtTrace: 理解真实世界LLM交互中的用户想法

Chuanyang Jin, Binze Li, Haopeng Xie, Cathy Mengying Fang, Tianjian Li, Shayne Longpre, Hongxiang Gu, Maximillian Chen, Tianmin Shu

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； Massachusetts Institute of Technology（麻省理工学院）； Google Research（谷歌研究）

AI总结 ThoughtTrace 是首个大规模数据集，记录了真实场景中用户与AI的多轮对话及其用户自我报告的思考内容，揭示了用户发送提示的原因和对AI回复的反应。该数据集包含1,058名用户、2,155次对话及10,174条思考注释，分析表明用户思考内容在语义上与对话消息不同，且难以被当前先进大模型准确推断。研究进一步展示了思考内容在行为预测和个性化助手训练中的应用价值，为理解用户潜在目标和需求提供了新的数据模态。

Comments 53 pages, 23 figures, 4 tables. Project website: https://thoughttrace-project.github.io/

详情

AI中文摘要

对话式AI现已服务数十亿用户，但现有数据集仅捕捉用户所说，而非所想。我们引入ThoughtTrace，首个大规模数据集，将真实世界多轮人机对话与用户自述想法配对：用户发送提示的原因以及对助手回复的反应。ThoughtTrace包含来自20个语言模型的1,058名用户、2,155次对话、17,058轮次和10,174条想法标注。我们的分析表明，ThoughtTrace捕捉了长期、主题多样的交互，且想法在语义上不同于消息，前沿LLM难以从上下文中推断，内容多样，并与对话阶段相关。我们进一步展示了想法在下游建模中的实用性。首先，想法作为推理时上下文改善了用户行为预测。其次，想法引导的重写为训练个性化助手提供了细粒度对齐信号。总之，ThoughtTrace将用户想法确立为研究人机交互背后认知动态的新数据模态，并为构建更好理解和适应用户潜在目标、偏好与需求的助手奠定了基础。

英文摘要

Conversational AI has now reached billions of users, yet existing datasets capture only what people say, not what they think. We introduce ThoughtTrace, the first large-scale dataset that pairs real-world multi-turn human--AI conversations with users' self-reported thoughts: their reasons for sending prompts and reactions to assistant responses. ThoughtTrace comprises 1,058 users, 2,155 conversations, 17,058 turns, and 10,174 thought annotations collected across 20 language models. Our analysis shows that ThoughtTrace captures long-horizon, topically diverse interactions, and that thoughts are semantically distinct from messages, difficult for frontier LLMs to infer from context, diverse in content, and tied to conversation stages. We further demonstrate the utility of thoughts for downstream modeling. First, thoughts improve user-behavior prediction as inference-time context. Second, thought-guided rewrites provide fine-grained alignment signals for training personalized assistants. Together, ThoughtTrace establishes user thoughts as a new data modality for studying the cognitive dynamics behind human--AI interaction and provides a foundation for building assistants that better understand and adapt to users' latent goals, preferences, and needs.

URL PDF HTML ☆

赞 0 踩 0

2605.20043 2026-05-25 cs.CL 版本更新

Mind Your Moras: Orthography-Aware Error Analysis of Neural Japanese Morphological Generation

注意你的莫拉：神经日语形态生成的拼写感知错误分析

Wen Zhang

AI总结本文针对日语过去时态形态生成任务，提出了一种关注正字法的错误分析方法，将平假名视为编码形态音系差异的表征系统，而不仅仅是转录媒介。研究评估了两种字符级序列到序列模型，在高整体准确率下仍存在与平假名正字法特性相关的系统性错误，尤其在涉及辅音连写（Gemination）的动词中表现明显。研究提出了包含七种主要错误模式的分类体系，并揭示了正字法表征、形态结构和数据频率在模型泛化中的紧密关联，强调了在形态复杂的语言中进行正字法感知评估的重要性。

详情

AI中文摘要

我们提出了一种拼写感知的日语过去时形态屈折错误分析，将平假名不仅视为转录媒介，而且视为编码形态音位区别的表征系统，这些区别可能影响模型泛化。我们使用根据SIGMORPHON 2020和2023共享任务约定格式化的数据集，评估了两种字符级序列到序列架构在过去时形成上的表现。尽管总体准确率较高，但模型表现出系统的、语言上可解释的错误，这些错误集中在平假名的特定拼写属性上。我们引入了一个简洁的错误分类法，捕获了七种主要失败模式，并提供了定量和定性分析。促音化相关错误主导了剩余失败，占错误的75-80%，特别是在词干以元音e结尾且需要在过去时后缀前促音化的动词中。错误模式在架构和随机种子之间高度一致，表明拼写表示、形态结构和数据频率效应在塑造模型泛化中存在稳健的交互。这些结果强调了在理解形态复杂语言的神经泛化时，拼写感知评估的必要性。

英文摘要

We present an orthography-aware error analysis of Japanese past-tense morphological inflection, treating hiragana not merely as a transcriptional medium, but as a representational system encoding morphophonological distinctions that may influence model generalization. We evaluate two character-level sequence-to-sequence architectures on past-tense formation using datasets formatted according to the SIGMORPHON 2020 and 2023 shared task conventions. Despite high aggregate accuracy, models exhibit systematic, linguistically interpretable errors that cluster around specific orthographic properties of hiragana. We introduce a concise error taxonomy capturing seven primary failure modes and provide both quantitative and qualitative analyses. Gemination-related errors dominate residual failures, accounting for 75-80% of errors, particularly in verbs whose stems end in the vowel e and require gemination before the past-tense suffix. Error patterns remain highly consistent across architectures and random seeds, suggesting a robust interaction between orthographic representation, morphological structure, and data frequency effects in shaping model generalization. These results underscore the necessity of orthography-aware evaluation for understanding neural generalization in morphologically complex languages.

URL PDF HTML ☆

赞 0 踩 0

2605.15482 2026-05-25 cs.CL 版本更新

FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models

FINESSE-Bench：面向大语言模型金融领域知识与技术分析的分层基准套件

Dmitry Stanishevskii, Nini Kamkia, Alexey Khoroshilov, Dmitry Zmitrovich, Denis Kokosinskii, Zhirayr Hayrapetyan, Andrei Kalmykov

发表机构 * Lime FinTech（Lime金融科技）

AI总结 FINESSE-Bench 是一个用于评估大型语言模型在金融领域知识和技术分析能力的分层基准测试套件，包含 3,993 道题目，涵盖从基础到专家级的多个难度层级。该基准结合了专业认证考试、实际交易任务和金融竞赛题目，旨在全面评估模型在金融领域的知识广度、计算能力及应对复杂问题的表现。此外，FINESSE-Bench 提供统一的评估协议和自动评分机制，为更深入、专业化的模型能力测评提供了有力工具。

Comments 21 pages, 10 tables, 2 figures

详情

AI中文摘要

大语言模型（LLMs）正越来越多地应用于金融分析、报告、投资决策支持、风险管理、合规和专业培训。然而，对其在金融领域专业能力的稳健评估仍不完整。广泛使用的开放基准如FinQA、ConvFinQA和TAT-QA在推动金融问答和数值推理方面发挥了重要作用，但它们主要关注财务报告上的问答，并未提供明确的专业难度层级。更广泛的资源，包括FinanceBench、PIXIU、FinBen和FLaME，扩展了金融任务的覆盖范围，但评估从基础知识到专家级金融推理的过渡问题仍然存在。在这项工作中，我们提出了FINESSE-Bench，这是一个由八个专业基准组成的套件，包含3993个问题，用于对LLMs的金融能力进行分层评估。FINESSE-Bench结合了受专业认证（类似CFA 1-3级、类似CMT 2级和类似CFTe 1级）启发的考试导向数据集、应用交易任务集以及一个俄语奥林匹克基准。这种设计能够评估领域广度、难度增加时的性能下降、解决计算任务的能力以及模型在专业金融领域中的行为。我们还描述了一个统一的评估协议，涵盖多项选择题、数值答案和简短开放式回答，以及基于LLM作为评判范式的自由形式答案自动评分方案。FINESSE-Bench旨在作为现有开放金融基准的补充，并作为对大语言模型中专业相关金融能力进行更实质性评估的工具。

英文摘要

Large language models (LLMs) are increasingly being applied to financial analysis, reporting, investment decision support, risk management, compliance, and professional training. However, robust evaluation of their domain competence in finance remains incomplete. Widely used open benchmarks such as FinQA, ConvFinQA, and TAT-QA have played an important role in advancing financial question answering and numerical reasoning, but they focus primarily on question answering over financial reports and do not provide an explicit hierarchy of professional difficulty. Broader resources, including FinanceBench, PIXIU, FinBen, and FLaME, expand the coverage of financial tasks, yet the problem of evaluating the transition from foundational knowledge to expert-level financial reasoning remains open. In this work, we present FINESSE-Bench, a suite of eight specialized benchmarks comprising 3,993 questions for hierarchical evaluation of financial competencies in LLMs. FINESSE-Bench combines exam-oriented datasets inspired by professional certifications (CFA-like Levels 1-3, CMT-like Level 2, and CFTe-like Level 1), applied trading task collections, and a Russian-language olympiad benchmark. This design enables evaluation of domain breadth, performance degradation as difficulty increases, the ability to solve computational tasks, and model behavior in specialized financial domains. We also describe a unified evaluation protocol covering multiple-choice questions, numerical answers, and short open-ended responses, together with an automated scoring scheme for freeform answers based on the LLM-as-judge paradigm. FINESSE-Bench is intended both as a complement to existing open financial benchmarks and as a tool for more substantive evaluation of professionally relevant financial competencies in large language models.

URL PDF HTML ☆

赞 0 踩 0

2605.10347 2026-05-25 cs.AI cs.CL 版本更新

How Mobile World Model Guides GUI Agents?

移动世界模型如何指导GUI代理？

Weikai Xu, Kun Huang, Yunren Feng, Jiaxing Li, Yuhan Chen, Yuxuan Liu, Zhizheng Jiang, Heng Qu, Pengzhi Gao, Wei Liu, Jian Luan, Xiaolin Hu, Bo An

发表机构 * Nanyang Technological University（南洋理工大学）； MiLM Plus, Xiaomi Inc.（小米公司）； Independent Researchers（独立研究人员）； Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学人工智能学院）； Wuhan University（武汉大学）； Xiamen University（厦门大学）

AI总结本文研究了移动世界模型如何指导GUI代理进行有效交互，针对现有模型在预测动作后果方面的不足，提出了一种多模态世界模型，涵盖增量文本、完整文本、扩散图像和可渲染代码四种表示方式。实验表明，该模型在多个基准测试中达到最优性能，并揭示了代码重建在分布内精度和多模态监督上的优势，文本反馈在分布外执行中的鲁棒性，以及世界模型在训练过程中的辅助作用，而非作为通用的后验验证工具。

详情

AI中文摘要

视觉语言模型的最新进展使移动GUI代理能够感知视觉界面并执行用户指令，但对于长期和高风险交互，动作后果的可靠预测仍然至关重要。现有的移动世界模型提供基于文本或基于图像的未来状态，但尚不清楚哪种表示有用，生成的rollout是否可以替代真实环境，以及测试时指导如何帮助不同强度的代理。为了回答上述问题，我们筛选并标注了移动世界模型数据，然后训练了四种模态的世界模型：增量文本、完整文本、基于扩散的图像和可渲染代码。这些模型在MobileWorldBench和Code2WorldBench上均达到了最先进性能。此外，通过在AITZ、AndroidControl和AndroidWorld上评估其下游效用，我们得到三个发现。首先，可渲染代码重建实现了高分布内保真度，并为数据构建提供了有效的多模态监督，而基于文本的反馈对于在线分布外执行更鲁棒。其次，世界模型生成的轨迹可以在训练过程中提供可迁移的交互经验，并提高代理的端到端任务性能，尽管这些数据不保留原始分布。最后，对于动作熵低的过度自信移动代理，后验自省提供的收益有限，这表明世界模型作为先验感知或训练监督比作为通用事后验证器更有效。

英文摘要

Recent advances in vision-language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but reliable prediction of action consequences remains critical for long-horizon and high-risk interactions. Existing mobile world models provide either text-based or image-based future states, yet it remains unclear which representation is useful, whether generated rollouts can replace real environments, and how test-time guidance helps agents of different strengths. To answer the above questions, we filter and annotate mobile world-model data, then train world models across four modalities: delta text, full text, diffusion-based images, and renderable code. These models achieve SoTA performance on both MobileWorldBench and Code2WorldBench. Furthermore, by evaluating their downstream utility on AITZ, AndroidControl, and AndroidWorld, we obtain three findings. First, renderable code reconstruction achieves high in-distribution fidelity and provides effective multimodal supervision for data construction, while text-based feedback is more robust for online out-of-distribution (OOD) execution. Second, world-model-generated trajectories can provide transferable interaction experience in the training process and improve agents' end-to-end task performance, although these data do not preserve the original distribution. Last, for overconfident mobile agents with low action entropy, posterior self-reflection provides limited gains, suggesting that world models are more effective as prior perception or training supervision than as universal post-hoc verifiers.

URL PDF HTML ☆

赞 0 踩 0

2604.28048 2026-05-25 cs.CL cs.SI 版本更新

Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

稳定行为，有限变化：城市情感感知中LLM智能体的角色有效性

Neemias B da Silva, Rodrigo Minetto, Daniel Silver, Thiago H Silva

发表机构 * University of Toronto（多伦多大学）

AI总结该研究探讨了在城市情感感知任务中，使用不同人格设定对多模态大语言模型（LLM）行为一致性与差异性的影响。通过设置包括性别、经济状况、政治立场和性格等维度的人格变量，研究发现同一人格设定下的模型表现出高度一致的行为，但不同人格之间的差异有限，仅经济状况和性格带来可检测但实际影响较小的变化。研究还指出，模型在细粒度情感判断上表现较差，且去除了人格设定后模型性能有时甚至更优，表明简单的人格标签提示可能对感知判断的注释价值有限。

Comments 8 pages, 8 figures. IEEE DCOSS - UrbCom

详情

Journal ref: IEEE DCOSS 2026

AI中文摘要

大型语言模型（LLM）越来越多地被用作城市分析中人类感知的代理，但尚不清楚角色提示是否会产生有意义且可重复的行为多样性。我们研究了不同角色是否影响多模态LLM生成的城市情感判断。使用涵盖性别、经济状况、政治取向和人格的角色因子集，我们为每个角色实例化多个智能体，以评估来自PerceptSent数据集的城市场景图像，并评估角色内一致性和角色间变化。结果显示，共享角色的智能体之间存在强收敛性，表明行为稳定且可重复。然而，角色间分化有限：经济状况和人格引起统计上可检测但实际变化不大的影响，而性别没有可测量的效果，政治取向的影响可忽略不计。智能体还表现出极端偏差，压缩了人类注释中常见的中间情感类别。因此，在粗粒度极性任务上表现强劲，但随着情感分辨率的提高而下降，表明简单的基于标签的角色提示无法捕捉细粒度的感知判断。为了隔离角色条件的作用，我们还评估了没有角色的相同模型。令人惊讶的是，无角色模型在所有任务变体上与人类标签的一致性有时达到或超过有角色条件，表明在这种设置下，简单的基于标签的角色提示可能增加有限的注释价值。

英文摘要

Large Language Models (LLMs) are increasingly used as proxies for human perception in urban analysis, yet it remains unclear whether persona prompting produces meaningful and reproducible behavioral diversity. We investigate whether distinct personas influence urban sentiment judgments generated by multimodal LLMs. Using a factorial set of personas spanning gender, economic status, political orientation, and personality, we instantiate multiple agents per persona to evaluate urban scene images from the PerceptSent dataset and assess both within-persona consistency and cross-persona variation. Results show strong convergence among agents sharing a persona, indicating stable and reproducible behavior. However, cross-persona differentiation is limited: economic status and personality induce statistically detectable but practically modest variation, while gender shows no measurable effect and political orientation only negligible impact. Agents also exhibit an extremity bias, collapsing intermediate sentiment categories common in human annotations. As a result, performance remains strong on coarse-grained polarity tasks but degrades as sentiment resolution increases, suggesting that simple label-based persona prompting does not capture fine-grained perceptual judgments. To isolate the contribution of persona conditioning, we additionally evaluate the same model without personas. Surprisingly, the no-persona model sometimes matches or exceeds persona-conditioned agreement with human labels across all task variants, suggesting that simple label-based persona prompting may add limited annotation value in this setting.

URL PDF HTML ☆

赞 0 踩 0

2604.27468 2026-05-25 cs.CL 版本更新

池化与语义偏移：长文本嵌入与检索中的根本挑战

Hang Gao, Wujiang Xu, Kai Mei, Dimitris N. Metaxas

发表机构 * Rutgers University（罗格斯大学）

AI总结本文研究了基于Transformer的嵌入模型在长文本表示与检索中面临的两个根本性挑战：池化操作导致的嵌入坍缩和语义漂移。作者提出，嵌入质量下降并非单纯由文本长度或注意力机制引起，而是源于池化操作与内部语义变化的共同作用，并建立了统一的理论框架加以证明。通过实验验证，语义漂移是导致嵌入高度集中化的主因，揭示了各向异性对检索性能的影响仅在强语义漂移情况下才显著，为理解长文本嵌入难题提供了理论依据。

详情

AI中文摘要

BURMESE-SAN: 评估大语言模型的缅甸语NLP基准

Thura Aung, Jann Railey Montalan, Jian Gang Ngui, Peerat Limkonchotiwat

AI总结本文介绍了BURMESE-SAN，这是首个系统评估大型语言模型在缅甸语自然语言理解、推理和生成能力的综合性基准。该基准包含七项子任务，涵盖问答、情感分析、因果推理等多个领域，并通过严格的母语者参与流程构建，确保语言自然性和文化真实性。研究发现，缅甸语模型性能更依赖于架构设计、语言表示和指令微调，而非模型规模，并指出区域微调和新一代模型能显著提升效果。

详情

AI中文摘要

我们引入了BURMESE-SAN，这是第一个系统性评估大语言模型（LLM）在缅甸语上三种核心NLP能力：理解（NLU）、推理（NLR）和生成（NLG）的全面基准。BURMESE-SAN整合了涵盖这些能力的七个子任务，包括问答、情感分析、毒性检测、因果推理、自然语言推理、抽象摘要和机器翻译，其中多个任务此前在缅甸语中不可用。该基准通过严格的母语者驱动流程构建，以确保语言自然性、流畅性和文化真实性，同时最小化翻译引起的伪影。我们对开源和商业LLM进行了大规模评估，以考察缅甸语建模中因预训练覆盖有限、丰富形态和句法变异带来的挑战。我们的结果表明，缅甸语性能更多地依赖于架构设计、语言表示和指令微调，而非仅模型规模。特别是，东南亚区域微调和更新的模型世代带来了显著提升。最后，我们发布BURMESE-SAN作为公共排行榜，以支持缅甸语及其他低资源语言的系统评估和持续进步。https://leaderboard.sea-lion.ai/detailed/MY

英文摘要

We introduce BURMESE-SAN, the first holistic benchmark that systematically evaluates large language models (LLMs) for Burmese across three core NLP competencies: understanding (NLU), reasoning (NLR), and generation (NLG). BURMESE-SAN consolidates seven subtasks spanning these competencies, including Question Answering, Sentiment Analysis, Toxicity Detection, Causal Reasoning, Natural Language Inference, Abstractive Summarization, and Machine Translation, several of which were previously unavailable for Burmese. The benchmark is constructed through a rigorous native-speaker-driven process to ensure linguistic naturalness, fluency, and cultural authenticity while minimizing translation-induced artifacts. We conduct a large-scale evaluation of both open-weight and commercial LLMs to examine challenges in Burmese modeling arising from limited pretraining coverage, rich morphology, and syntactic variation. Our results show that Burmese performance depends more on architectural design, language representation, and instruction tuning than on model scale alone. In particular, Southeast Asia regional fine-tuning and newer model generations yield substantial gains. Finally, we release BURMESE-SAN as a public leaderboard to support systematic evaluation and sustained progress in Burmese and other low-resource languages. https://leaderboard.sea-lion.ai/detailed/MY

URL PDF HTML ☆

赞 0 踩 0

2602.18176 2026-05-25 cs.CL 版本更新

Improving Sampling for Masked Diffusion Models via Information Gain

通过信息增益改进掩码扩散模型的采样

Kaisen Yang, Jayden Teoh, Kaicheng Yang, Yitong Zhang, Alex Lamb

发表机构 * Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）； Massachusetts Institute of Technology（麻省理工学院）； Shanghai Jiao Tong University（上海交通大学）； Beihang University（北航）； College of AI, Tsinghua University（清华大学人工智能学院）

AI总结该论文研究了如何改进掩码扩散模型（MDMs）的采样过程，指出现有采样方法过于贪心，仅关注局部确定性而忽视了后续影响，导致生成结果不确定性增加。为此，作者提出了一种无需训练的解码方法——信息增益采样器（Info-Gain Sampler），通过利用MDMs的双向结构，在当前不确定性和剩余位置的信息增益之间取得平衡。实验表明，该方法在推理、编码、创意写作和图像生成等任务中均优于现有方法，显著提升了生成质量。

Comments https://github.com/yks23/Information-Gain-Sampler Accepted by ICML2026 Accepted by ICML2026

2602.17653 2026-05-25 cs.CL 版本更新

Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking

语言模型处理差异论元标记中的类型学对齐差异

Iskar Deng, Nathalia Xu, Shane Steinert-Threlkeld

发表机构 * University of Washington（华盛顿大学）

AI总结本文研究语言模型在处理差分化论元标记（DAM）时表现出的类型学对齐差异。通过在18个实现不同DAM系统的合成语料库上训练GPT-2模型，并使用最小对进行评估，研究发现模型在DAM的自然标记方向上表现出与人类语言相似的偏好，即更倾向于对语义不典型的论元进行显性标记，但在对象优先这一人类语言常见现象上却未能复现。这一结果表明，不同类型学倾向可能源于不同的底层机制。

Comments 16 pages, 8 figures, 7 tables. To appear at CoNLL 2026

详情

AI中文摘要

近期研究表明，在合成语料上训练的语言模型可以展现出类似人类语言跨语言规律的类型学偏好，特别是对于语序等句法现象。本文将此范式扩展到差异论元标记（DAM），一种形态标记取决于语义显著性的语义许可系统。使用受控合成学习方法，我们在18个实现不同DAM系统的语料上训练GPT-2模型，并通过最小对评估其泛化能力。结果揭示了DAM的两个类型学维度之间的分离。模型可靠地展现出对自然标记方向的人类偏好，倾向于那些显性标记针对语义非典型论元的系统。相比之下，模型并未复现人类语言中强烈的宾语偏好，即在DAM中显性标记更常针对宾语而非主语。这些发现表明，不同的类型学倾向可能源于不同的潜在来源。

英文摘要

Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order. In this paper, we extend this paradigm to differential argument marking (DAM), a semantic licensing system in which morphological marking depends on semantic prominence. Using a controlled synthetic learning method, we train GPT-2 models on 18 corpora implementing distinct DAM systems and evaluate their generalization using minimal pairs. Our results reveal a dissociation between two typological dimensions of DAM. Models reliably exhibit human-like preferences for natural markedness direction, favoring systems in which overt marking targets semantically atypical arguments. In contrast, models do not reproduce the strong object preference in human languages, in which overt marking in DAM more often targets objects rather than subjects. These findings suggest that different typological tendencies may arise from distinct underlying sources.

URL PDF HTML ☆

赞 0 踩 0

2602.12316 2026-05-25 cs.AI cs.CL cs.CY cs.GT cs.MA 版本更新

GradingAttack: 揭示基于LLM的教育评分代理中的安全漏洞

Xueyi Li, Zhuoneng Zhou, Zitao Liu, Yongdong Wu

发表机构 * Guangdong Institute of Smart Education（广东智能教育研究院）； Jinan University（济南大学）

AI总结随着大型语言模型（LLM）在自动短答案评分中的广泛应用，其安全性问题日益受到关注。本文提出GradingAttack，一种细粒度的对抗攻击框架，用于系统评估基于LLM的教育评分代理的安全漏洞。通过设计基于词元和提示的攻击策略，该方法在保持高隐蔽性的同时有效操控评分结果，揭示了当前系统在防御对抗攻击方面的不足，突显了构建安全可信教育代理系统的重要性。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被部署为教育代理，用于实际教育环境中的自动简答题评分（ASAG），显著提升了评估效率和可扩展性。然而，当这些评分代理“在野外”运行时，它们对对抗性操纵的脆弱性引发了对代理安全性和可信度的关键担忧。在本文中，我们介绍了GradingAttack，一个细粒度的对抗攻击框架，系统地评估基于LLM的教育评分代理的安全漏洞。具体来说，我们设计了token级和prompt级攻击策略，在保持高隐蔽性的同时操纵代理评分结果，揭示了当前代理部署中的根本弱点。在多个数据集上的实验表明，两种攻击策略都能有效破坏评分代理，其中prompt级攻击成功率更高，而token级攻击具有更优的隐蔽性。我们的发现表明，当前的基于LLM的教育代理缺乏针对对抗性攻击的鲁棒防御，突显了为关键教育应用开发安全可信的代理系统的紧迫性。

英文摘要

Large language models (LLMs) are increasingly deployed as educational agents for automatic short answer grading (ASAG) in real-world educational environments, significantly boosting assessment efficiency and scalability. However, when these grading agents operate ``in the wild'', their vulnerability to adversarial manipulation raises critical concerns about agent security and trustworthiness. In this paper, we introduce GradingAttack, a fine-grained adversarial attack framework that systematically evaluates the security vulnerabilities of LLM based educational grading agents. Specifically, we design token-level and prompt-level attack strategies that manipulate agent grading outcomes while maintaining high stealth, exposing fundamental weaknesses in current agent deployments. Experiments on multiple datasets demonstrate that both attack strategies effectively compromise grading agents, with prompt-level attacks achieving higher success rates and token-level attacks exhibiting superior stealth capability. Our findings reveal that current LLM based educational agents lack robust defenses against adversarial attacks, underscoring the urgent need for developing secure and trustworthy agent systems for critical educational applications.

URL PDF HTML ☆

赞 0 踩 0

2601.21766 2026-05-25 cs.CL cs.AI 版本更新

CoFrGeNet: Continued Fraction Architectures for Language Generation

CoFrGeNet：用于语言生成的连分式架构

Amit Dhurandhar, Vijil Chenthamarakshan, Dennis Wei, Tejaswini Pedapati, Karthikeyan Natesan Ramamurthy, Rahul Nair

发表机构 * IBM Research（IBM研究院）

AI总结本文提出了一种基于连分数结构的新型生成模型架构CoFrGeNet，用于替代传统Transformer中的多头注意力和前馈网络模块，显著减少了参数量。该方法通过自定义梯度计算提升训练效率，并在多个大规模语言模型（如GPT2-xl和Llama3）上验证了其有效性，实验表明在保持甚至提升任务性能的同时，参数规模可减少至原模型的二分之一到三分之一，且预训练时间更短。

Comments Earlier version accepted to ICML 2026

详情

AI中文摘要

Transformer可以说是语言生成的首选架构。本文受连分式启发，引入了一种用于生成建模的新函数类。实现该函数类的架构族称为CoFrGeNets——连分式生成网络。我们基于该函数类设计了新颖的架构组件，可以替换Transformer块中的多头注意力和前馈网络，同时需要的参数少得多。我们推导了自定义梯度公式，以比使用标准PyTorch梯度更准确、更高效地优化所提出的组件。我们的组件是即插即用的替换，几乎不需要改变已为基于Transformer的模型建立的训练或推理过程，从而使我们的方法易于集成到大型工业工作流中。我们在两个非常不同的Transformer架构GPT2-xl（1.5B）和Llama3（3.2B）上进行了实验，前者我们在OpenWebText和GneissWeb上预训练，后者我们在docling数据混合（包含九个不同数据集）上预训练。结果表明，我们的模型在下游分类、问答、推理和文本理解任务上的性能与原始模型相当，有时甚至更优，而参数量仅为原始模型的2/3到1/2，预训练时间更短。我们相信，未来针对硬件定制的实现将进一步发挥我们架构的真正潜力。

英文摘要

Transformers are arguably the preferred architecture for language generation. In this paper, inspired by continued fractions, we introduce a new function class for generative modeling. The architecture family implementing this function class is named CoFrGeNets - Continued Fraction Generative Networks. We design novel architectural components based on this function class that can replace Multi-head Attention and Feed-Forward Networks in Transformer blocks while requiring much fewer parameters. We derive custom gradient formulations to optimize the proposed components more accurately and efficiently than using standard PyTorch-based gradients. Our components are a plug-in replacement requiring little change in training or inference procedures that have already been put in place for Transformer-based models thus making our approach easy to incorporate in large industrial workflows. We experiment on two very different transformer architectures GPT2-xl (1.5B) and Llama3 (3.2B), where the former we pre-train on OpenWebText and GneissWeb, while the latter we pre-train on the docling data mix which consists of nine different datasets. Results show that the performance on downstream classification, Q\& A, reasoning and text understanding tasks of our models is competitive and sometimes even superior to the original models with $\frac{2}{3}$ to $\frac{1}{2}$ the parameters and shorter pre-training time. We believe that future implementations customized to hardware will further bring out the true potential of our architectures.

URL PDF HTML ☆

赞 0 踩 0

2601.15224 2026-05-25 cs.CV cs.CL 版本更新

PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

PROGRESSLM: 迈向视觉-语言模型中的进度推理

Jianshu Zhang, Chengxuan Qian, Haosen Sun, Haoran Lu, Dingcheng Wang, Letian Xue, Han Liu

发表机构 * Northwestern University（西北大学）； University of California, Santa Barbara（加州大学圣巴巴拉分校）

AI总结该论文提出ProgressLM，旨在解决视觉语言模型在任务进展推理方面的能力不足问题。研究引入了Progress-Bench基准，用于系统评估模型对任务进展的推理能力，并提出了一种受人类启发的两阶段推理范式，通过训练无关的提示和基于数据集ProgressLM-45K的训练方法进行探索。实验表明，大多数现有模型在任务进展估计上表现有限，而基于训练的ProgressLM-3B即使在小规模下也取得了稳定提升，显示出良好的泛化能力。

Comments ACL 2026 Camera Ready Version

详情

AI中文摘要

估计任务进度需要对长期动态进行推理，而非仅识别静态视觉内容。尽管现代视觉-语言模型（VLM）擅长描述可见内容，但它们能否从部分观测中推断任务进展程度仍不清楚。为此，我们引入了Progress-Bench，一个用于系统评估VLM进度推理能力的基准。除基准测试外，我们进一步探索了一种受人类启发的两阶段进度推理范式，包括基于无训练提示和基于训练的方法，后者基于精心策划的数据集ProgressLM-45K。对14个VLM的实验表明，大多数模型尚未准备好进行任务进度估计，对演示模态和视角变化敏感，且难以处理不可回答的情况。虽然强制结构化进度推理的无训练提示仅带来有限且依赖模型的改进，但基于训练的ProgressLM-3B即使在小型模型规模下也取得了一致的改进，尽管其训练任务集与评估任务完全不相交。进一步分析揭示了特征性错误模式，并阐明了进度推理成功或失败的时间与原因。网站：https://progresslm.github.io/ProgressLM/

英文摘要

Estimating task progress requires reasoning over long-horizon dynamics rather than recognizing static visual content. While modern Vision-Language Models (VLMs) excel at describing what is visible, it remains unclear whether they can infer how far a task has progressed from partial observations. To this end, we introduce Progress-Bench, a benchmark for systematically evaluating progress reasoning in VLMs. Beyond benchmarking, we further explore a human-inspired two-stage progress reasoning paradigm through both training-free prompting and training-based approach based on curated dataset ProgressLM-45K. Experiments on 14 VLMs show that most models are not yet ready for task progress estimation, exhibiting sensitivity to demonstration modality and viewpoint changes, as well as poor handling of unanswerable cases. While training-free prompting that enforces structured progress reasoning yields limited and model-dependent gains, the training-based ProgressLM-3B achieves consistent improvements even at a small model scale, despite being trained on a task set fully disjoint from the evaluation tasks. Further analyses reveal characteristic error patterns and clarify when and why progress reasoning succeeds or fails. Website: https://progresslm.github.io/ProgressLM/

URL PDF HTML ☆

赞 0 踩 0

2601.03260 2026-05-25 cs.CE cs.CL 版本更新

SciNet: Evaluating AI Agents in Relation-Aware Scientific Literature Retrieval

SciNet: 评估关系感知科学文献检索中的AI代理

Chenyang Shao, Fengli Xu, Yong Li

发表机构 * Department of Electronic Engineering, BNRist, Tsinghua University, Beijing, China（电子工程系、BNRist、清华大学、北京、中国）； Zhongguancun Academy（中关村学院）

AI总结该研究提出 SciNet，一个用于科学文献检索的首个关系感知数据集，旨在解决现有 AI 检索代理在理解科学论文间复杂关系方面的不足。SciNet 基于包含 2.69 亿篇论文的元数据库构建，包含 8,940 个精心设计的任务，涵盖从个体论文检索到科学演化路径重建的多层次关系理解。实验表明，现有检索代理在关系感知任务上的准确率普遍低于 20%，而结合 SciNet 的代理在文献综述任务中质量提升了 25.3%，凸显了关系感知检索对深化科学洞察的重要性。

详情

AI中文摘要

AI代理在科学研究的文献检索中已被广泛采用，催生了如Deep Research等工具。然而，现有的检索代理主要依赖基于关键词或嵌入的方法。虽然它们在捕捉内容级相似性方面有效，但难以理解科学论文之间的复杂关系网络，例如识别相互印证或冲突的研究以及追踪技术谱系。这一根本性限制常常导致知识结构碎片化、研究情感误解以及集体科学进展建模失效。为解决这一限制，我们引入了SciNet，这是首个用于信息检索代理的科学网络关系感知数据集。该数据集基于包含7个学科、2.69亿篇论文的元数据库，并包含8940个精心设计的任务，系统性地捕捉了三个层次的关系理解：以自我为中心检索具有新颖知识结构的论文、成对识别学术关系以及路径式重建科学演化。对三类检索代理的广泛评估表明，它们在关系感知任务上的准确率通常低于20%，凸显了当前检索范式的根本缺陷。重要的是，在下游文献综述应用中，配备SciNet的代理在综述质量上实现了25.3%的提升，突出了关系感知检索对深化科学见解的关键价值。我们在https://github.com/tsinghua-fib-lab/SciNet公开发布SciNet以支持未来研究。

英文摘要

AI agents have seen widespread adoption in information retrieval for scientific research, giving rise to tools such as Deep Research. However, existing retrieval agents mainly rely on keyword- or embedding-based methods. While effective at capturing content-level similarities, they struggle to understand complex relational networks among scientific papers, such as identifying corroborating or conflicting studies and tracing technological lineages. This fundamental limitation often results in fragmented knowledge structures, misinterpreted research sentiment, and ineffective modeling of collective scientific progress. To address this limitation, we introduce SciNet, the first Scientific Network relation-aware dataset for information retrieval agents. Built on a meta-database of 269 million papers across 7 disciplines and containing 8,940 carefully designed tasks, SciNet systematically captures three levels of relational understanding: ego-centric retrieval of papers with novel knowledge structures, pairwise identification of scholarly relationships, and path-wise reconstruction of scientific evolution. Extensive evaluation of three categories of retrieval agents shows that their accuracy on relation-aware tasks often falls below 20%, highlighting a fundamental shortcoming of current retrieval paradigms. Importantly, in a downstream literature review application, agents empowered with SciNet achieve a 25.3% improvement in review quality, highlighting the critical value of relation-aware retrieval for deepening scientific insights. We publicly release SciNet at https://github.com/tsinghua-fib-lab/SciNet to support future research.

URL PDF HTML ☆

赞 0 踩 0

2512.20298 2026-05-25 cs.CL cs.AI cs.CY cs.HC 版本更新

基于强化学习的高效可迁移智能知识图谱检索增强生成

Junhong Lin, Shicheng Liu, Jinyeop Song, Song Wang, Julian Shun, Yada Zhu

发表机构 * MIT CSAIL（麻省理工学院CSAIL）； University of Virginia（弗吉尼亚大学）； IBM Research（IBM研究院）

AI总结该研究提出了一种基于强化学习的高效且可迁移的智能体知识图谱检索增强生成框架KG-R1，旨在解决现有KG-RAG系统中固定流程导致的高推理成本和依赖特定图结构的问题。KG-R1通过单智能体与知识图谱环境交互，逐步学习信息检索与推理生成的统一过程，从而在减少生成token数量的同时提升回答准确性。实验表明，KG-R1在多个知识图谱问答基准上表现出优异的效率和跨图谱迁移能力，且无需重新训练即可保持对新知识图谱的准确推理，具有良好的实际应用前景。

详情

AI中文摘要

知识图谱检索增强生成（KG-RAG）将大型语言模型（LLMs）与结构化、可验证的知识图谱（KGs）相结合，以减少幻觉并提供推理轨迹。然而，当前的KG-RAG系统通常依赖于多个LLM模块（如规划、推理和响应）的固定流水线，这增加了推理成本，并将性能与特定图模式绑定。为了解决这个问题，我们引入了KG-R1，一个通过强化学习（RL）优化KG-RAG的智能体框架。与模块化工作流不同，KG-R1使用单个智能体，将KGs作为其环境进行交互，学习在每一步检索信息，并将其融入统一的推理和生成过程中。在知识图谱问答（KGQA）基准测试中，KG-R1展示了高效性和可迁移性——使用Qwen 2.5-3B，KG-R1以比先前使用更大基础或微调模型的多模块工作流方法更少的生成token提高了答案准确性。此外，KG-R1表现出强大的即插即用能力：训练后，无需重新训练即可在未见过的KGs上保持准确性。这些特性使KG-R1成为实际部署中很有前景的KG-RAG框架。我们的代码公开在github.com/junhongmit/KG-R1/。

英文摘要

Knowledge-graph retrieval-augmented generation (KG-RAG) couples large language models (LLMs) with structured, verifiable knowledge graphs (KGs) to reduce hallucination and provide reasoning traces. However, current KG-RAG systems often rely on fixed pipelines of multiple LLM modules (e.g., planning, reasoning, and responding), which inflate inference costs and tie performance to specific graph schemas. To address this, we introduce KG-R1, an agentic framework that optimizes KG-RAG through reinforcement learning (RL). Unlike modular workflows, KG-R1 uses a single agent that interacts with KGs as its environment, learning to retrieve information at each step and incorporating it into its reasoning and generation in a unified process. Across Knowledge-Graph Question Answering (KGQA) benchmarks, KG-R1 demonstrates both efficiency and transferability-using Qwen 2.5-3B, KG-R1 improves answer accuracy with fewer generation tokens than prior multi-module workflow methods that use much larger foundation or fine-tuned models. Furthermore, KG-R1 exhibits strong plug-and-play capability: after training, maintaining accuracy on unseen KGs without retraining. These properties make KG-R1 a promising KG-RAG framework for real-world deployment. Our code is publicly available at github.com/junhongmit/KG-R1/.

URL PDF HTML ☆

赞 0 踩 0

2508.10016 2026-05-25 cs.CL 版本更新

Training-Free Multimodal Large Language Model Orchestration

免训练的多模态大语言模型编排

Tianyu Xie, Yuexiao Ma, Yuhang Wu, Wang Chen, Jiayi Ji, Tat-Seng Chua, Xiawu Zheng, Rongrong Ji

发表机构 * Media Analytics and Computing Lab, Xiamen University, Xiamen, China（厦门大学媒体分析与计算实验室）； Institute of Artificial Intelligence, Xiamen University, Xiamen, China（厦门大学人工智能研究院）； School of Informatics, Xiamen University, Xiamen, China（厦门大学信息学院）； Peng Cheng Laboratory, Shenzhen, China（鹏城实验室）； Department of Computer Science, National University of Singapore, Singapore（新加坡国立大学计算机科学系）

AI总结本文提出了一种无需训练的多模态大语言模型编排框架（LLM Orchestration），旨在解决构建交互式多模态助手时因端到端对齐带来的高昂数据和计算成本问题。该框架通过集成现成的模态专家模块，无需额外训练即可实现多模态输入输出的统一处理，包含意图识别控制器、跨模态记忆模块和统一交互层三个核心组件，有效提升了系统扩展性与运行效率。实验表明，该方法在多模态基准测试中表现出色，同时保持了较低的编排开销和模块化升级能力。

详情

Journal ref: ICML 2026

AI中文摘要

构建交互式全模态助手通常依赖于端到端的多模态对齐来融合异构模态，这会产生大量的数据和计算成本，并限制了可扩展性。我们提出了免训练的大语言模型编排（LLM Orchestration），这是一个免训练的编排框架，它将现成的模态专家集成到一个统一的多模态输入-输出系统中，而无需额外的基于梯度的训练来进行集成。LLM Orchestration包含三个组件：（1）一个LLM控制器，它推断用户意图并发出显式的控制令牌以进行专家选择和排序，实现协议约束和可审计的路由；（2）一个以文本为中心的跨模态记忆，它将多模态证据压缩为结构化记录，用于轻量级检索和重用，减少跨轮次中冗余的专家调用；（3）一个统一的交互层，它执行路由和记忆决策，以支持一致的模态转换、全双工流和可中断的对话。在多种多模态基准测试中，LLM Orchestration在标准评估约束下实现了强劲的性能，同时保持了低编排开销和模块化可升级性，为全模态系统提供了一种实用的替代方案，避免了昂贵的联合训练。

英文摘要

Building interactive omni-modal assistants often relies on end-to-end multimodal alignment to fuse heterogeneous modalities, which incurs substantial data and compute costs and limits extensibility. We present Training-Free Large Language Model Orchestration (LLM Orchestration), a training-free orchestration framework that integrates off-the-shelf modality experts into a unified multimodal input--output system without additional gradient-based training for integration. LLM Orchestration comprises three components: (1) an LLM controller that infers user intent and emits explicit control tokens for expert selection and sequencing, enabling protocol-constrained and auditable routing; (2) a text-centric cross-modal memory that compresses multimodal evidence into structured records for lightweight retrieval and reuse, reducing redundant expert invocations across turns; and (3) a unified interaction layer that executes routing and memory decisions to support consistent modality transitions, full-duplex streaming, and interruption-aware dialogue. Across diverse multimodal benchmarks, LLM Orchestration achieves strong performance under standard evaluation constraints while maintaining low orchestration overhead and modular upgradeability, providing a practical alternative to costly joint training for omni-modal systems.

URL PDF HTML ☆

赞 0 踩 0

2508.07849 2026-05-25 cs.CL 版本更新

Evaluating Customized vs. Generalist Transformer-based Models for Legal Contract Classification

评估定制化与通用型基于Transformer的模型在法律合同分类中的表现

Amrita Singh, H. Suhan Karaca, Aditya Joshi, Hye-young Paik, Jiaojiao Jiang

发表机构 * School of Computer Science and Engineering University of New South Wales (UNSW), Sydney（计算机科学与工程学院新南威尔士大学（UNSW），悉尼）

AI总结尽管在法律自然语言处理领域已有进展，但目前尚缺乏对专门用于法律任务的Transformer模型在合同分类任务上的全面评估。本文对13种法律专用Transformer模型在三个英文合同分类任务上的表现进行了评估，并与9种通用模型进行了对比，结果表明法律专用模型在需要细致法律理解的任务中表现更优，尤其在处理类别不平衡数据时能有效减少罕见类的误分类。研究还发现，Legal-BERT和Contracts-BERT在参数量仅为最佳通用模型69%的情况下，在两个任务中取得了新的最先进成果，突显了领域定制模型在法律应用中的重要性。

Comments Accepted to Customizable NLP at ACL 2026

详情

AI中文摘要

尽管法律NLP取得了进展，但目前尚无针对合同分类任务对法律任务定制的Transformer模型（本文称为“法律专用”模型）进行全面评估。为填补这一空白，我们对13个法律专用Transformer模型在三个英文合同分类任务上进行了评估，并与9个通用模型进行了比较。结果表明，法律专用模型持续优于通用模型，尤其是在需要精细法律理解的任务上。它们还有助于减少不平衡数据集中罕见类别的误分类。Legal-BERT和Contracts-BERT在三个任务中的两个上建立了新的最优结果，尽管其参数比最佳通用模型少69%。我们还确定了CaseLaw-BERT和LexLM作为合同分类的强有力额外基线。我们的结果凸显了通用模型的不足，强调了领域特定定制的必要性，尤其是在法律应用的背景下。

英文摘要

Despite advances in legal NLP, no comprehensive evaluation of Transformer-based models customized for legal tasks (referred to as `legal-specific' models in this paper) exists for contract classification tasks. To address this gap, we present an evaluation of 13 legal-specific transformer-based models on 3 English-language contract classification tasks and compare them with 9 generalist models. The results show that legal-specific models consistently outperform generalist models, especially on tasks requiring nuanced legal understanding. They also help reduce misclassification of rare classes in imbalanced datasets. Legal-BERT and Contracts-BERT establish new SOTAs on two of the three tasks, despite having 69% fewer parameters than the best-performing generalist models. We also identify CaseLaw-BERT and LexLM as strong additional baselines for contract classification. Our results highlight the shortcomings of generalist models, emphasizing the need for domain-specific customization, particularly in the context of legal applications.

URL PDF HTML ☆

赞 0 踩 0

2506.03530 2026-05-25 cs.MM cs.CL cs.CV 版本更新

How Far Are We from Generating Missing Modalities with Foundation Models?

我们距离用基础模型生成缺失模态还有多远？

Guanzhou Ke, Bo Wang, Guoqing Chao, Weiming Hu, Shengfeng He

发表机构 * Institute of Data Science and Intelligent Decision Support, Beijing Jiaotong University（数据科学与智能决策支持研究所，北京交通大学）； School of Computing and Information Systems, Singapore Management University（计算与信息系统学院，新加坡管理大学）； State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences（多模态人工智能系统国家重点实验室，自动化研究所，中国科学院）； School of Computer Science and Technology, Harbin Institute of Technology（计算机科学与技术学院，哈尔滨工业大学）

AI总结该研究探讨了基础模型在生成缺失模态数据方面的潜力与局限，提出了三种缺失模态重建的范式，并对42种模型变体进行了系统评估。研究发现，当前基础模型在细粒度语义提取和生成模态的鲁棒验证方面存在不足，导致生成结果不够理想。为此，作者提出了一种智能代理框架，通过动态的模态感知挖掘策略和自优化机制，显著提升了缺失模态重建的质量，实验表明在图像和文本重建任务中分别取得了14%和10%以上的性能提升。

Comments T-PAMI

详情

AI中文摘要

多模态基础模型在各种任务中展现了令人印象深刻的能力。然而，它们作为缺失模态重建的即插即用解决方案的潜力尚未被充分探索。为弥补这一差距，我们识别并形式化了三种可能的缺失模态重建范式，并跨这些范式进行了全面评估，覆盖了42个模型变体在重建准确性和下游任务适应性方面的表现。我们的分析表明，当前基础模型在两个关键方面往往表现不佳：(i) 从可用模态中提取细粒度语义，以及(ii) 对生成模态的稳健验证。这些限制导致了次优甚至有时不匹配的生成。为解决这些挑战，我们提出了一个专为缺失模态重建设计的智能框架。该框架根据输入上下文动态制定模态感知的挖掘策略，促进提取更丰富、更具判别性的语义特征。此外，我们引入了一种自精炼机制，通过内部反馈迭代验证和提升生成模态的质量。实验结果表明，与基线相比，我们的方法在缺失图像重建上FID降低了至少14%，在缺失文本重建上MER降低了至少10%。代码已发布在：https://github.com/Guanzhou-Ke/AFM2。

英文摘要

Multimodal foundation models have demonstrated impressive capabilities across diverse tasks. However, their potential as plug-and-play solutions for missing modality reconstruction remains underexplored. To bridge this gap, we identify and formalize three potential paradigms for missing modality reconstruction, and perform a comprehensive evaluation across these paradigms, covering 42 model variants in terms of reconstruction accuracy and adaptability to downstream tasks. Our analysis reveals that current foundation models often fall short in two critical aspects: (i) fine-grained semantic extraction from the available modalities, and (ii) robust validation of generated modalities. These limitations lead to suboptimal and, at times, misaligned generations. To address these challenges, we propose an agentic framework tailored for missing modality reconstruction. This framework dynamically formulates modality-aware mining strategies based on the input context, facilitating the extraction of richer and more discriminative semantic features. In addition, we introduce a self-refinement mechanism, which iteratively verifies and enhances the quality of generated modalities through internal feedback. Experimental results show that our method reduces FID for missing image reconstruction by at least 14\% and MER for missing text reconstruction by at least 10\% compared to baselines. Code are released at: https://github.com/Guanzhou-Ke/AFM2.

URL PDF HTML ☆

赞 0 踩 0

2505.17015 2026-05-25 cs.CV cs.CL 版本更新

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

Multi-SpatialMLLM: 多模态大语言模型的多帧空间理解

Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Matt Feiszli, Kevin J. Liang

发表机构 * FAIR, Meta（FAIR，Meta）； The Chinese University of Hong Kong（香港中文大学）

AI总结本文提出了一种名为Multi-SpatialMLLM的多模态大语言模型框架，旨在增强模型对多帧场景的时空理解能力。通过引入深度感知、视觉对应和动态感知等基本空间技能，并构建包含2700多万个样本的MultiSPA数据集，该方法显著提升了模型在多帧空间任务中的表现。实验表明，Multi-SpatialMLLM在多种空间任务上优于现有基线模型和商业系统，展示了其在复杂场景下的泛化能力和多任务学习优势，并可应用于机器人领域的多帧奖励标注。

Comments CVPR 2026 Camera Ready. 27 pages. Project page: https://runsenxu.com/projects/Multi-SpatialMLLM

详情

AI中文摘要

多模态大语言模型（MLLMs）在视觉任务中取得了快速进展，但其空间理解仍局限于单张图像，使其不适合需要多帧推理的物理世界应用。在本文中，我们提出一个框架，通过整合基本空间技能（包括深度感知、视觉对应和动态感知）来赋予MLLMs多帧空间理解能力。我们设计了一个新颖的数据管道，并收集了包含超过2700万个样本的MultiSPA数据集，涵盖多样的3D和4D场景，以支持训练。除了MultiSPA，我们还引入了一个全面的基准测试，在统一的度量标准下测试广泛的空间任务。我们的最终模型Multi-SpatialMLLM在基线和专有系统上取得了显著提升，展示了可扩展和可泛化的多帧感知能力。我们进一步观察到多任务收益和在挑战性场景中的新兴空间能力，并展示了我们的模型如何作为机器人学的多帧奖励标注器。

英文摘要

Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for physical-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with multi-frame spatial understanding by integrating fundamental spatial skills, including depth perception, visual correspondence, and dynamic perception. We design a novel data pipeline and collect the MultiSPA dataset of more than 27 million samples spanning diverse 3D and 4D scenes to enable training. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable and generalizable multi-frame perception. We further observe multi-task benefits and emergent spatial capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.

URL PDF HTML ☆

赞 0 踩 0

2505.13893 2026-05-25 cs.CL 版本更新

InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion

InfiGFusion：通过高效Gromov-Wasserstein进行图对逻辑蒸馏的模型融合

Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Yanggan Gu, Fei Wu, Hongxia Yang

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； Zhejiang University（浙江大学）； Daya Bay Technology and Innovation Research Institute（大亚湾技术与创新研究院）

AI总结该论文提出了一种名为 InfiGFusion 的模型融合框架，旨在将不同异构开源模型的优势整合到统一系统中。其核心方法是通过引入基于图的对数概率蒸馏（GLD）损失，显式建模词汇维度间的语义依赖关系，并利用改进的格罗莫夫-瓦萨距离近似算法提升计算效率。实验表明，InfiGFusion 在多个基准测试中显著优于现有方法，尤其在复杂推理任务中表现出色。

详情

AI中文摘要

近期大语言模型（LLMs）的进展推动了将异构开源模型融合为统一系统的努力，以继承其互补优势。现有的基于逻辑的融合方法保持了推理效率，但独立处理词汇维度，忽略了跨维度交互编码的语义依赖。这些依赖反映了模型内部推理下词元类型的交互方式，对于对齐具有不同生成行为的模型至关重要。为显式建模这些依赖，我们提出 extbf{InfiGFusion}，首个结构感知融合框架，带有新颖的 extit{图对逻辑蒸馏}（GLD）损失。具体地，我们保留每个输出的top-$k$逻辑值，并在序列位置上聚合它们的外积以形成全局共激活图，其中节点表示词汇通道，边量化其联合激活。为确保可扩展性和效率，我们设计了一种基于排序的闭式近似，将Gromov-Wasserstein距离的原始$O(n^4)$成本降至$O(n \log n)$，并具有可证明的近似保证。在多种融合设置下的实验表明，GLD持续提高了融合质量和稳定性。InfiGFusion在涵盖推理、编程和数学的11个基准上优于SOTA模型和融合基线。它在复杂推理任务中表现出特别优势，在Multistep Arithmetic上比SFT提高+35.6，在Causal Judgement上提高+37.06，展示了卓越的多步和关系推理能力。

英文摘要

Recent advances in large language models (LLMs) have intensified efforts to fuse heterogeneous open-source models into a unified system that inherits their complementary strengths. Existing logit-based fusion methods maintain inference efficiency but treat vocabulary dimensions independently, overlooking semantic dependencies encoded by cross-dimension interactions. These dependencies reflect how token types interact under a model's internal reasoning and are essential for aligning models with diverse generation behaviors. To explicitly model these dependencies, we propose \textbf{InfiGFusion}, the first structure-aware fusion framework with a novel \textit{Graph-on-Logits Distillation} (GLD) loss. Specifically, we retain the top-$k$ logits per output and aggregate their outer products across sequence positions to form a global co-activation graph, where nodes represent vocabulary channels and edges quantify their joint activations. To ensure scalability and efficiency, we design a sorting-based closed-form approximation that reduces the original $O(n^4)$ cost of Gromov-Wasserstein distance to $O(n \log n)$, with provable approximation guarantees. Experiments across multiple fusion settings show that GLD consistently improves fusion quality and stability. InfiGFusion outperforms SOTA models and fusion baselines across 11 benchmarks spanning reasoning, coding, and mathematics. It shows particular strength in complex reasoning tasks, with +35.6 improvement on Multistep Arithmetic and +37.06 on Causal Judgement over SFT, demonstrating superior multi-step and relational inference.

URL PDF HTML ☆

赞 0 踩 0

2504.01542 2026-05-25 cs.CL 版本更新

Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation

语域始终重要：通过语言变异视角分析LLM预训练数据

Amanda Myntti, Erik Henriksson, Veronika Laippala, Sampo Pyysalo

发表机构 * University of Turku（图尔库大学）

AI总结本文研究了语言变体（如语域或文体）对大语言模型预训练数据质量及模型性能的影响。作者首次将语域分类这一语料语言学中的标准应用于预训练数据的筛选，并通过实验发现不同语域的文本对模型表现有显著影响。研究显示，使用新闻类文本预训练会导致性能下降，而意见类文本（如评论和博客）则对模型性能提升有积极作用，合理组合多种高表现语域可进一步优化模型效果。这些发现表明，语域是解释模型性能差异的重要因素，有助于未来更精准地选择预训练数据。

详情

AI中文摘要

预训练数据整理是大语言模型（LLM）开发的基石，导致对大型网络语料库质量过滤的研究日益增多。从统计质量标志到基于LLM的标注系统，数据集被划分为不同类别，通常简化为二元分类：通过过滤器的被视为有价值的样本，其余则被丢弃为无用或有害。然而，对不同类型文本对模型性能贡献的更详细理解仍然缺乏。在本文中，我们首次利用语域或体裁——语料库语言学中广泛使用的语言变异建模标准——来整理预训练数据集，并研究语域对LLM性能的影响。我们使用语域分类的数据训练小型生成模型，并通过标准基准进行评估，表明预训练数据的语域显著影响模型性能。我们揭示了预训练材料与最终模型之间令人惊讶的关系：使用“新闻”语域会导致性能不佳，相反，包含“观点”类（涵盖评论和观点博客等文本）则非常有益。虽然在整个未过滤数据集上训练的模型优于在单一语域数据集上训练的模型，但将表现良好的语域（如“操作指南”、“信息描述”和“观点”）组合起来会带来重大改进。此外，对单个基准结果的分析揭示了特定语域类作为预训练数据的优势和缺点的关键差异。这些发现表明，语域是模型变异的重要解释因素，并可以促进未来更谨慎的数据选择实践。

英文摘要

Pretraining data curation is a cornerstone in Large Language Model (LLM) development, leading to growing research on quality filtering of large web corpora. From statistical quality flags to LLM-based labelling systems, datasets are divided into categories, frequently reducing to a binary: those passing the filters are deemed as valuable examples, others are discarded as useless or detrimental. However, a more detailed understanding of the contribution of different kinds of texts to model performance is still largely lacking. In this article, we present the first study utilising registers or genres - a widely used standard in corpus linguistics to model linguistic variation - to curate pretraining datasets and investigate the effect of register on the performance of LLMs. We train small generative models with register classified data and evaluate them using standard benchmarks, and show that the register of pretraining data substantially affects model performance. We uncover surprising relationships between the pretraining material and the resulting models: using the News register results in subpar performance, and on the contrary, including the Opinion class, covering texts such as reviews and opinion blogs, is highly beneficial. While a model trained on the entire unfiltered dataset outperforms those trained on datasets limited to a single register, combining well-performing registers like How-to-Instructions, Informational Description, and Opinion leads to major improvements. Furthermore, analysis of individual benchmark results reveals key differences in the strengths and drawbacks of specific register classes as pretraining data. These findings show that register is an important explainer of model variation and can facilitate more deliberate future data selection practices.

URL PDF HTML ☆

赞 0 踩 0

2407.04573 2026-05-25 cs.IR cs.CL 版本更新

Vector Retrieval with Similarity and Diversity: How Hard Is It?

向量检索中的相似性与多样性：难度有多大？

Hang Gao, Dong Deng, Yongfeng Zhang

发表机构 * Rutgers University（罗杰斯大学）

AI总结本文研究了在密集向量检索中如何同时平衡相似性与多样性这一关键问题，提出了一个名为VRSD的新优化框架，旨在从查询向量与所选候选向量之和之间最大化相似性。该问题被证明是NP难的，具有坚实的理论基础。作者进一步提出了一种无需参数的启发式算法，并在多个数据集上验证了其有效性，优于现有主流方法如MMR和k-DPP。

详情

AI中文摘要

稠密向量检索是现代机器学习系统的重要构建模块，支撑从语义搜索到检索增强生成和知识密集型推理等应用。除了检索与查询单独相似的项外，许多应用还需要一组结果具有多样性、互补性和集体信息性。因此，平衡相似性和多样性是有效检索的核心，但在稳定且理论扎实的方式下优化仍然具有挑战性。最大边际相关性（MMR）是解决该问题的广泛采用的启发式方法，但其对手动调整参数的依赖导致优化波动和不可预测的检索结果。更广泛地说，现有方法对相似性和多样性如何在稠密向量空间中相互作用提供的理论见解有限，使得联合优化问题尚未被充分理解。为了解决这些挑战，本文引入了一种新方法，通过最大化查询向量与所选候选向量之和之间的相似性来同时刻画两个约束。我们正式定义了该优化问题——向量检索相似性与多样性（VRSD），并证明其为NP完全问题，从而建立了该双目标检索固有难度的严格理论界限。随后，我们提出了一种无参数启发式算法来求解VRSD。在多个数据集上的广泛评估，结合客观几何指标和LLM模拟的主观评估，表明我们的VRSD启发式方法始终优于包括MMR和行列式点过程（k-DPP）在内的已建立基线。

英文摘要

Dense vector retrieval is an important building block of modern machine learning systems, underlying applications ranging from semantic search to retrieval-augmented generation and knowledge-intensive reasoning. Beyond retrieving items that are individually similar to a query, many applications require a set of results that is also diverse, complementary, and collectively informative. Balancing similarity and diversity is therefore central to effective retrieval, but remains challenging to optimize in a stable and theoretically grounded way. Maximal Marginal Relevance (MMR) is a widely adopted heuristic for this problem, yet its reliance on a manually tuned parameter leads to optimization fluctuations and unpredictable retrieval results. More broadly, existing methods provide limited theoretical insight into how similarity and diversity interact in dense vector spaces, leaving the joint optimization problem insufficiently understood. To address these challenges, this paper introduces a novel approach that characterizes both constraints simultaneously by maximizing the similarity between the query vector and the sum of the selected candidate vectors. We formally define this optimization problem, Vector Retrieval with Similarity and Diversity (VRSD), and prove that it is NP-complete, establishing a rigorous theoretical bound on the inherent difficulty of this dual-objective retrieval. Subsequently, we present a parameter-free heuristic algorithm to solve VRSD. Extensive evaluations on multiple datasets, incorporating both objective geometric metrics and LLM-simulated subjective assessments, demonstrate that our VRSD heuristic consistently outperforms established baselines, including MMR and Determinantal Point Processes (k-DPP).

URL PDF HTML ☆

赞 0 踩 0

2605.22939 2026-05-25 cs.CL cs.LG 版本更新

Learnability-Informed Fine-Tuning of Diffusion Language Models

扩散语言模型的可学习性感知微调

Shubham Parashar, Atharv Chagi, Jacob Helwig, Lakshmi Jotsna, Sushil Vemuri, James Caverlee, Dileep Kalathil, Shuiwang Ji

发表机构 * Department of Computer Science and Engineering, Texas A\&M University, College Station, TX, USA（计算机科学与工程系，德克萨斯A&M大学，College Station, TX, USA）； Department of Electrical and Computer Engineering, Texas A\&M University, College Station, TX, USA（电气与计算机工程系，德克萨斯A&M大学，College Station, TX, USA）

AI总结本文旨在提升扩散语言模型（DLMs）的推理能力。研究发现，传统的监督微调（SFT）在DLMs中应用时存在局限，忽视了学习的难易程度与时机，导致性能下降。为此，作者提出了一种新的微调方法LIFT，通过在不同扩散时间步根据上下文的丰富程度学习易学或难学的token，从而更有效地利用训练信息。实验表明，LIFT在六个推理基准测试中均优于现有方法，相对提升了达3倍的性能。

详情

AI中文摘要

我们旨在提升扩散语言模型（DLM）的推理能力。虽然SFT是自回归模型常用的后训练方法，但其在DLM中的应用面临挑战，甚至可能损害性能，而根本原因尚未得到充分研究。我们的分析揭示，普通SFT忽略了可学习性，即学习什么以及何时学习。具体而言，当大部分输入被掩码时，稀有标记难以学习；而当大部分输入未被掩码时，学习常见标记则较为简单且价值不大。基于我们的分析，我们提出LIFT，一种高效的基于SFT的DLM后训练算法。LIFT在大部分输入被掩码时学习容易标记，在更多上下文可用时学习困难标记，从而使训练与不同扩散时间步的信息可用性对齐。我们的结果表明，LIFT在六个推理基准上优于现有SFT基线，在AIME'24和AIME'25上实现了高达3倍的相对增益。我们的代码已在https://github.com/divelab/LIFT公开。

英文摘要

We aim to improve the reasoning capabilities of diffusion language models (DLMs). While SFT is a popular post-training recipe for autoregressive models, its use in DLMs faces challenges and can even hurt performance, though the underlying causes remain understudied. Our analysis reveals that vanilla SFT overlooks learnability, namely what and when tokens are learned. Specifically, rare tokens are difficult to learn when most of the input is masked, whereas it is straightforward and thus of little value to learn common tokens when most of the input is unmasked. Motivated by our analysis, we propose LIFT, an efficient SFT-based post-training algorithm for DLMs. LIFT learns easy tokens when most of the input is masked and hard tokens when more context is available, thus aligning the training with the information available at different diffusion time steps. Our results show that LIFT outperforms existing SFT baselines across six reasoning benchmarks, achieving up to a 3x relative gain on AIME'24 and AIME'25. Our code is publicly available at https://github.com/divelab/LIFT.

URL PDF HTML ☆

赞 0 踩 0

2605.22937 2026-05-25 cs.CL 版本更新

Transcoders 追踪视觉语言模型中的视觉基础与幻觉

Dimitrios Damianos, Leon Voukoutis, Georgios Skyrianos, Vassilis Katsouros, Georgios Paraskevopoulos

发表机构 * Institute of Language and Speech Processing（语言与语音处理研究所）； Athena Research Center（雅典研究中心）

AI总结该研究探讨了生成式视觉-语言模型（VLMs）中视觉输入如何转化为文本的问题，提出了基于Transcoders的函数中心解释框架，用于分解模型内部的计算路径，揭示图像块与文本生成之间的关联。相比传统的稀疏自编码器（SAEs），该方法在图像块缺失实验中表现出更强且更稳定的解释效果，并能更准确地对应语义相关的图像区域。此外，研究还通过结构分析揭示了模型生成幻觉的机制，并利用图特征构建分类器实现了对幻觉的预测。

详情

AI中文摘要

生成式视觉语言模型（VLM）在多模态推理上表现良好，但视觉输入如何转化为文本仍知之甚少。现有的VLM可解释性工作使用稀疏自编码器（SAE），其分解静态残差表示，忽略了驱动跨模态交互的功能更新。我们采用基于Transcoders的功能中心框架，Transcoders是MLP子层的稀疏近似，作为逐层计算的因果代理。应用于Gemma 3-4B-IT，该框架将模型分解为可解释的计算路径，连接图像块到文本生成中的方向。在补丁消融下，Transcoder归因对视觉基础标记产生比SAE归因更强且更稳定的效果，并与语义相关的图像区域更好对齐。假视觉基础反事实分析证实恢复的路径是视觉-语言交互特有的。最后，我们对幻觉生成进行结构分析，从Transcoder产生的电路痕迹中提取基于图的指标。基于这些机制图特征的逻辑分类器以AUC 0.68预测幻觉。这些结果表明，功能中心的电路分解为VLM中的多模态计算提供了可解释且可预测的描述。

英文摘要

Generative Vision-Language Models (VLMs) perform well on multimodal reasoning, but how visual inputs are transformed to text remains poorly understood. Existing interpretability work on VLMs uses Sparse Autoencoders (SAEs), which decompose static residual representations and miss the functional updates that drive cross-modal interaction. We adopt a function-centric framework based on Transcoders, sparse approximations of MLP sublayers that act as a causal proxy for layer-wise computation. Applied to Gemma 3-4B-IT, the framework decomposes the model into interpretable computational pathways linking image patches to directions in token generation. Transcoder attributions produce stronger and more stable effects on visually grounded tokens under patch ablation than SAE attributions, and align better with semantically relevant image regions. A False Visual Grounding counterfactual analysis confirms that the recovered pathways are specific to vision-language interaction.Finally, we perform a structural analysis of hallucinated generations, by extracting graph-based indicators from circuit traces produced by the transcoders. A logistic classifier over these mechanistic graph features predicts hallucinations at AUC $0.68$. These results show that function-centric circuit decomposition yields interpretable and predictive accounts of multimodal computation in VLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.22885 2026-05-25 cs.AI cs.CL cs.LG cs.LO 版本更新

冻结深层，训练浅层：面向持续预训练的可解释层分配

Yu-Hang Wu, Qin-Yuan Liu, Qiu-Yang Zhao, Bo Jiang, Jiang-Feng Yang, Qing-Wei Cong

发表机构 * Nanhu Research Institute of China Electronic Science and Technology（中国电子科技科技研究院纳米研究所）； School of Electronic and Electrical Engineering, Shanghai University of Engineering Science（上海工程技术大学电子电气工程学院）

AI总结本文研究了如何在低成本下进行大语言模型的持续预训练，提出了一种可解释的层分配策略。通过引入架构无关的诊断框架LayerTracer，揭示了各层表示的演变模式与稳定性，发现深层在任务执行中起关键作用且更具鲁棒性。实验表明，在持续预训练中保持深层冻结而浅层训练的策略优于全参数微调及其他分配方式，并在多个基准测试中表现更优，为资源受限团队提供了低成本、可解释的参数分配指导。

详情

AI中文摘要

选择性逐层更新对于大型语言模型（LLMs）的低成本持续预训练至关重要，但由于缺乏可解释的指导，确定哪些层应冻结或训练仍然是一个经验性的黑盒问题。为了解决这个问题，我们提出了LayerTracer，一个与架构无关的诊断框架，通过定位任务执行位置和量化层敏感性来揭示逐层表示和稳定性的演化模式。分析结果表明，深层作为任务执行的关键区域，并且对干扰更新保持高稳定性。基于这一发现，我们进行了三项受控的持续预训练试验，比较了不同的冻结-训练策略，证明在C-Eval和CMMLU基准测试中，训练浅层同时冻结深层始终优于全参数微调和相反的分配。我们进一步展示了一个混合模型案例研究，验证了将高质量预训练模块放置在深层可以有效保留模型的内在知识。这项工作为资源受限的团队提供了一种低成本且可解释的解决方案，为持续预训练和混合模型构建中的逐层参数分配提供了可操作的指导。

英文摘要

Selective layer-wise updates are essential for low-cost continued pre-training of Large Language Models (LLMs), yet determining which layers to freeze or train remains an empirical black-box problem due to the lack of interpretable guidance. To address this issue, we propose LayerTracer, an architecture-agnostic diagnostic framework that reveals the evolution patterns of layer-wise representations and stability by locating task execution positions and quantifying layer sensitivity. Analysis results reveal that deep layers act as critical regions for task execution and maintain high stability against disruptive updates. Guided by this finding, we conduct three controlled continued pre-training trials to compare diverse freeze-train strategies, demonstrating that training shallow layers while freezing deep layers consistently outperforms full-parameter fine-tuning and the opposite allocation on both C-Eval and CMMLU benchmarks. We further present a hybrid model case study, which validates that placing high-quality pre-trained modules in deep layers effectively preserves inherent knowledge of the model. This work delivers a low-cost and interpretable solution for resource-constrained teams, offering actionable guidance for layer-wise parameter allocation in continued pre-training and hybrid model construction.

URL PDF HTML ☆

赞 0 踩 0

2601.14652 2026-05-25 cs.AI cs.CL cs.MA 版本更新

MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks

MAS-Orchestra：通过整体编排和受控基准理解与改进多智能体推理

Zixuan Ke, Yifei Ming, Austin Xu, Ryan Chin, Xuan-Phi Nguyen, Prathyusha Jwalapuram, Jiayu Wang, Semih Yavuz, Caiming Xiong, Shafiq Joty

发表机构 * Salesforce Research（Salesforce研究院）； University of Wisconsin-Madison（威斯康星大学麦迪逊分校）； Massachusetts Institute of Technology（麻省理工学院）

AI总结本文提出MAS-Orchestra，一种通过整体编排和受控基准测试来理解和提升多智能体系统（MAS）推理能力的训练框架。该方法将MAS的编排建模为函数调用的强化学习问题，能够一次性生成完整的MAS系统，并通过抽象子智能体为可调用函数，实现对系统结构的全局推理。同时，研究引入MASBENCH基准，从五个维度刻画任务特性，揭示MAS优势依赖于任务结构、验证机制及智能体能力，而非普遍适用。实验表明，MAS-Orchestra在多个基准测试中取得显著提升，效率较现有方法提高十倍以上。

Comments ICML 2026

详情

AI中文摘要

虽然多智能体系统（MAS）通过智能体协调有望提升智能水平，但当前自动MAS设计的方法表现不佳。这些不足源于两个关键因素：（1）方法论复杂性——智能体编排通过顺序的代码级执行进行，限制了全局系统级整体推理，且随智能体复杂性扩展性差；（2）效能不确定性——MAS在未理解相比单智能体系统（SAS）是否有切实益处的情况下被部署。我们提出MAS-Orchestra，一个训练时框架，将MAS编排形式化为具有整体编排的函数调用强化学习问题，一次性生成整个MAS。在MAS-Orchestra中，复杂的、面向目标的子智能体被抽象为可调用函数，从而在隐藏内部执行细节的同时实现系统结构上的全局推理。为了严格研究MAS何时以及为何有益，我们引入了MASBENCH，一个受控基准，沿五个轴表征任务：深度、范围、广度、并行性和鲁棒性。我们的分析揭示，MAS的收益关键取决于任务结构、验证协议以及编排器和子智能体的能力，而非普遍成立。在这些洞察的指导下，MAS-Orchestra在数学推理、多跳问答和基于搜索的问答等公共基准上实现了一致的改进，同时相比强基线实现了超过10倍的效率提升。MAS-Orchestra和MASBENCH共同使得在追求多智能体智能的过程中能够更好地训练和理解MAS。

英文摘要

While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design under-deliver. Such shortcomings stem from two key factors: (1) methodological complexity - agent orchestration is performed using sequential, code-level execution that limits global system-level holistic reasoning and scales poorly with agent complexity - and (2) efficacy uncertainty - MAS are deployed without understanding if there are tangible benefits compared to single-agent systems (SAS). We propose MASOrchestra, a training-time framework that formulates MAS orchestration as a function-calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS-Orchestra, complex, goal-oriented subagents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis reveals that MAS gains depend critically on task structure, verification protocols, and the capabilities of both orchestrator and subagents, rather than holding universally. Guided by these insights, MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA, while achieving more than 10x efficiency over strong baselines. Together, MAS-Orchestra and MASBENCH enable better training and understanding of MAS in the pursuit of multi-agent intelligence.

URL PDF HTML ☆

赞 0 踩 0

2311.01468 2026-05-25 cs.CL cs.LG 版本更新

Remember what you did so you know what to do next

记住你做了什么，以便知道下一步该做什么

Manuel R. Ciosici, Alex Hedges, Yash Kankanampati, Justin Martin, Marjorie Freedman, Ralph Weischedel

发表机构 * Information Sciences Institute, University of Southern California（信息科学研究所，南加州大学）

AI总结本文研究了使用中等规模的大型语言模型（GPT-J，60亿参数）为模拟机器人在ScienceWorld平台中制定计划，以完成30类科学实验目标。实验表明，通过引入更多历史步骤信息，该模型的性能显著优于基于强化学习的方法，最高可达3.5倍。研究还指出任务类别间的性能差异较大，平均表现可能掩盖具体问题，并展示了在仅使用6.5%训练数据时仍能取得2.2倍的性能提升。

Comments Identical to EMNLP 2023 Findings

详情

DOI: 10.18653/v1/2023.findings-emnlp.104

AI中文摘要

我们探索使用中等规模的大语言模型（GPT-J，6B参数）为模拟机器人在ScienceWorld（一个用于基础科学实验的文本游戏模拟器）中制定计划，以实现30类目标。先前发表的实证工作声称，与强化学习相比，大语言模型（LLMs）不太适合（Wang等人，2022）。使用马尔可夫假设（仅前一步），LLM的性能是强化学习方法性能的1.4倍。当我们尽可能多地填充LLM的输入缓冲区（包含尽可能多的先前步骤）时，性能提升至3.5倍。即使仅使用6.5%的训练数据，我们观察到性能比强化学习方法提高了2.2倍。我们的实验表明，30类动作的性能差异很大，这表明对任务进行平均可能会掩盖显著的性能问题。在与我们同时期的工作中，Lin等人（2023）展示了一种两部分方法（SwiftSage），该方法使用一个小型LLM（T5-large）并辅以OpenAI的大规模LLM，在ScienceWorld中取得了出色结果。我们的6B参数单阶段GPT-J在结合GPT-3.5 turbo（其参数数量是GPT-J的29倍）时，与SwiftSage的两阶段架构性能相匹配。

英文摘要

We explore using a moderately sized large language model (GPT-J 6B parameters) to create a plan for a simulated robot to achieve 30 classes of goals in ScienceWorld, a text game simulator for elementary science experiments. Previously published empirical work claimed that large language models (LLMs) are a poor fit (Wang et al., 2022) compared to reinforcement learning. Using the Markov assumption (a single previous step), the LLM outperforms the reinforcement learning-based approach by a factor of 1.4. When we fill the LLM's input buffer with as many prior steps as possible, improvement rises to 3.5x. Even when training on only 6.5% of the training data, we observe a 2.2x improvement over the reinforcement-learning-based approach. Our experiments show that performance varies widely across the 30 classes of actions, indicating that averaging over tasks can hide significant performance issues. In work contemporaneous with ours, Lin et al. (2023) demonstrated a two-part approach (SwiftSage) that uses a small LLM (T5-large) complemented by OpenAI's massive LLMs to achieve outstanding results in ScienceWorld. Our 6-B parameter, single-stage GPT-J matches the performance of SwiftSage's two-stage architecture when it incorporates GPT-3.5 turbo which has 29-times more parameters than GPT-J.

URL PDF HTML ☆

赞 0 踩 0

2110.01552 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Perhaps PTLMs Should Go to School -- A Task to Assess Open Book and Closed Book QA

或许PTLMs应该去上学——一项评估开卷和闭卷问答的任务

Manuel R. Ciosici, Joe Cecil, Alex Hedges, Dong-Ho Lee, Marjorie Freedman, Ralph Weischedel

发表机构 * Information Sciences Institute, University of Southern California（信息科学研究所，南加州大学）

AI总结本文提出了一项新的任务，旨在评估预训练语言模型（PTLMs）在开放书和闭合书场景下的问答能力，使用社会学和人文领域的大学教材作为教学材料。研究通过设计基于教材内容的判断题，并进行多轮测试，发现PTLMs在闭合书条件下表现有限，表明其可能未真正理解教材内容；而在开放书条件下，允许模型检索相关段落进行回答时，性能显著提升。该任务为评估PTLMs对复杂文本的理解能力提供了新的基准。

Comments Identical to the EMNLP 2021 version

详情

DOI: 10.18653/v1/2021.emnlp-main.493

AI中文摘要

我们的目标是提供一项新任务和排行榜，以刺激关于问答和预训练语言模型（PTLM）的研究，使其理解重要的教学文档，例如大学入门教科书或手册。PTLM在许多问答任务中取得了巨大成功，但需要大量监督训练，而在零样本设置中表现较差。我们提出了一项新任务，包括两本社会科学（《美国政府2e》）和人文科学（《美国历史》）的大学入门教材，数百个基于教材作者编写的复习题的真假陈述，基于教材前八章的验证/开发测试，基于剩余章节的盲测，以及基于最先进PTLM的基线结果。由于问题平衡，随机表现应为约50%。使用BoolQ微调的T5达到了相同的表现，表明教材内容未在PTLM中预表示。闭卷考试（即阅读教材，将教材添加到T5的预训练中）最多带来微小改进（56%），表明PTLM可能没有“理解”教材（或可能误解了问题）。开卷考试（即允许机器自动检索段落并用于回答问题）表现更好（约60%）。

英文摘要

Our goal is to deliver a new task and leaderboard to stimulate research on question answering and pre-trained language models (PTLMs) to understand a significant instructional document, e.g., an introductory college textbook or a manual. PTLMs have shown great success in many question-answering tasks, given significant supervised training, but much less so in zero-shot settings. We propose a new task that includes two college-level introductory texts in the social sciences (American Government 2e) and humanities (U.S. History), hundreds of true/false statements based on review questions written by the textbook authors, validation/development tests based on the first eight chapters of the textbooks, blind tests based on the remaining textbook chapters, and baseline results given state-of-the-art PTLMs. Since the questions are balanced, random performance should be ~50%. T5, fine-tuned with BoolQ achieves the same performance, suggesting that the textbook's content is not pre-represented in the PTLM. Taking the exam closed book, but having read the textbook (i.e., adding the textbook to T5's pre-training), yields at best minor improvement (56%), suggesting that the PTLM may not have "understood" the textbook (or perhaps misunderstood the questions). Performance is better (~60%) when the exam is taken open-book (i.e., allowing the machine to automatically retrieve a paragraph and use it to answer the question).

URL PDF HTML ☆

赞 0 踩 0

2101.05400 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Machine-Assisted Script Curation

机器辅助脚本编纂

Manuel R. Ciosici, Joseph Cummings, Mitchell DeHaven, Alex Hedges, Yash Kankanampati, Dong-Ho Lee, Ralph Weischedel, Marjorie Freedman

发表机构 * Information Sciences Institute, University of Southern California（信息科学研究所，南加州大学）

AI总结本文介绍了一种名为MASC的系统，用于实现人机协作的脚本创作。该系统能够自动生成事件类型、链接至维基数据、提示可能被遗漏的子事件，并记录参与多个子事件的实体及其时间顺序，从而辅助用户高效编写结构复杂的事件脚本。研究展示了MASC在实际案例中的应用效果，验证了其在脚本创作中的实用价值。

Comments Identical to the NAACL 2021 Demo version