arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

语言大模型 / LLM

大语言模型、预训练、指令微调、后训练和语言模型应用。

今日/当前日期收录 64 信号源:cs.CL, cs.AI, cs.LG
2606.20493 2026-06-19 cs.LG cs.AI cs.MA 新提交 85%

Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems

传染网络:多智能体LLM系统中的评估者偏见传播

Zewen Liu

发表机构 * Qilu Institute of Technology, School of Software Engineering(齐鲁理工学院软件工程学院)

专题命中 其他LLM :研究多智能体LLM系统中评估者偏见传播

AI总结 提出传染网络框架,量化评估者偏见在多智能体LLM系统中的传播,发现同模型智能体间偏见传播系数为0.157-0.352,且增大评估委员会规模可减少72.4%的传播效应。

Comments 20 pages, 4 figures, 4 tables

详情
AI中文摘要

当大型语言模型在多智能体系统中担任评估者时,其系统性评估偏见会通过智能体网络传播。我们引入传染网络,这是一个用于衡量评估者偏见如何在交互的LLM智能体间传播的正式框架。在使用DeepSeek-chat进行的受控3智能体实验中,我们采用了三种不同的评估者偏见配置文件(结构化、平衡、基于证据),测量了跨智能体传染矩阵Gamma_3,并发现评估者偏见始终在智能体间传播(gamma在[0.157, 0.352]范围内),即使是在相同底层模型内也是如此。我们识别出由谱半径rho(Gamma_N)控制的三种传播机制,并证明同质模型智能体产生的传染系数比先前工作中观察到的跨模型系数弱3-5倍(MM-EPC: gamma约0.85-1.3),使其处于抑制机制中。我们表明,将评估委员会规模从k=1增加到k=3可将有效传染减少72.4%,提供了一种可行的缓解策略。我们发布了开源的传染网络实验框架。

英文摘要

When large language models serve as evaluators in multi-agent systems, their systematic evaluation biases propagate through the agent network. We introduce Contagion Networks, a formal framework for measuring how evaluator biases spread across interacting LLM agents. In a controlled 3-agent experiment using DeepSeek-chat with three distinct evaluator bias profiles (structured, balanced, evidence-based), we measure the Cross-Agent Contagion Matrix Gamma_3 and find that evaluator biases consistently propagate between agents (gamma in [0.157, 0.352]), even within the same underlying model. We identify three propagation regimes governed by the spectral radius rho(Gamma_N), and demonstrate that homogeneous-model agents produce contagion coefficients 3-5x weaker than cross-model coefficients observed in prior work (MM-EPC: gamma approx 0.85-1.3), placing them in the suppression regime. We show that increasing evaluator committee size from k=1 to k=3 reduces effective contagion by 72.4%, providing an actionable mitigation strategy. We release the open-source Contagion Network experimental framework.

2606.19746 2026-06-19 cs.DC 新提交 85%

SAC: Disaggregated KV Cache System for Sparse Attention LLMs with CXL

SAC: 面向稀疏注意力LLM的基于CXL的解耦KV缓存系统

Ruiyang Ma, Teng Ma, Junru Li, Hantian Zha, Xuchun Shang, Qingda Hu, Zheng Liu, Xinjun Yang, Tao Ma, Guojie Luo

专题命中 其他LLM :提出面向稀疏注意力LLM的解耦KV缓存系统SAC,优化长上下文推理性能。

AI总结 针对稀疏注意力模型在长上下文推理中全量KV缓存传输导致的瓶颈,提出基于CXL按需获取top-k KV条目的解耦缓存系统SAC,相比RDMA方案吞吐提升2.1倍、TTFT降低9.7倍。

详情
AI中文摘要

LLM向长上下文推理的扩展将主要服务系统瓶颈从计算转移到内存容量。传统针对密集注意力模型的解决方案依赖基于RDMA的解耦内存池,在解码前从远程存储粗粒度地获取整个前缀KV缓存到本地内存。然而,这种方法对于新兴的稀疏注意力模型本质上是低效的。尽管解码过程中只有一小部分KV条目是活跃的,这些系统仍然将完整的KV缓存获取到本地,导致严重的传输瓶颈和本地内存浪费。为了解决这个问题,我们提出了SAC,第一个针对稀疏注意力模型优化的高效解耦KV缓存系统。通过利用Compute Express Link (CXL)的低延迟、缓存行粒度的加载/存储语义,SAC在推理过程中按需仅获取所需的top-k KV条目。在使用SGLang对DeepSeek-V3.2的评估中,与基于RDMA的基线相比,SAC实现了2.1倍的吞吐量提升、9.7倍的TTFT降低和1.8倍的TBT降低,确立了基于CXL的解耦作为新兴稀疏注意力模型的优越基础设施。

英文摘要

The scaling of LLMs toward long-context inference has shifted the primary serving system bottleneck from computation to memory capacity. Traditional solutions for dense attention models rely on RDMA-based disaggregated memory pools, which perform coarse-grained fetching of the entire prefix KV cache from remote storage to local memory before decoding. However, this approach is fundamentally inefficient for emerging sparse attention models. While only a small fraction of KV entries are active during decoding, these systems still fetch the full KV cache locally, leading to severe transmission bottlenecks and local memory wastage. To address this, we propose SAC, the first efficient disaggregated KV cache system optimized for sparse attention models. By leveraging the low-latency, cache-line granularity load/store semantics of Compute Express Link (CXL), SAC fetches only the required top-k KV entries on demand during inference. Evaluations on DeepSeek-V3.2 using SGLang show that SAC achieves 2.1x higher throughput, 9.7x lower TTFT, and 1.8x lower TBT compared to RDMA-based baselines, establishing CXL-based disaggregation as the superior infrastructure for emerging sparse attention models.

2606.19605 2026-06-19 cs.SE cs.AI 新提交 85%

FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines

FAPO:多步骤LLM流水线的全自动提示优化

Paul Kassianik, Baturay Saglam, Huaibo Zhao, Blaine Nelson, Supriti Vijay, Aman Priyanshu, Amin Karbasi

发表机构 * Foundation AI–Cisco Systems Inc.(基础AI–思科系统公司) Yale University(耶鲁大学)

专题命中 其他LLM :提出FAPO框架自动优化多步LLM流水线的提示和链结构

AI总结 提出FAPO框架,通过自动诊断流水线瓶颈并迭代优化提示或链结构,在18个模型-基准比较中15次优于基线GEPA,平均提升14.1个百分点。

详情
AI中文摘要

多步骤LLM流水线因检索、推理和格式化步骤间的交互而失败,因此仅提示优化可能遗漏链中的瓶颈。我们提出FAPO(全自动提示优化),一个让Claude Code在标准化代码库内优化LLM流水线的框架。FAPO评估流水线、检查中间步骤、诊断失败、提出范围变更,并重复验证变体以针对评分函数进行优化。它首先尝试提示编辑,仅当提示优化似乎不足时,在归因识别出结构瓶颈的情况下,在允许范围内更改链结构。在六个基准和三个任务模型上,FAPO在18个模型-基准比较中的15个中击败了基线GEPA。在11个模型-基准比较中,FAPO以不重叠的均值±试验标准差范围获胜,平均FAPO-GEPA增益为+14.1个百分点。在六个HoVer和IFBench比较中,当提示优先搜索升级为结构变更时,FAPO在所有六个中获胜,平均增益为+33.8个百分点。FAPO还提高了安全任务的性能:在CTIBench-RCM(一个安全CVE到CWE任务)上,仅提示的FAPO在GPT-5上提升了+4.0个百分点的测试准确率,在Foundation-Sec-8B-Instruct上提升了+7.1个百分点,在Foundation-Sec-8B-Reasoning上提升了+2.0个百分点。这些结果使FAPO成为通用和安全任务的最先进流水线优化技术。

英文摘要

Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean $\pm$ trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.

2606.19475 2026-06-19 cs.AI cs.CL 新提交 85%

Diffusion Language Models: An Experimental Analysis

扩散语言模型:一项实验分析

Thomas Bertolani, Davide Bucciarelli, Leonardo Zini, Marcella Cornia, Lorenzo Baraldi

发表机构 * University of Modena and Reggio Emilia(摩德纳和雷焦艾米利亚大学) University of Pisa(比萨大学)

专题命中 其他LLM :系统比较扩散语言模型在多种任务上的表现。

AI总结 本文系统比较了八种扩散语言模型在推理、编码、翻译等任务上的表现,分析了去噪步数、上下文长度等推理因素对性能与效率的影响,揭示了扩散语言模型在不同任务和预算下的权衡。

详情
AI中文摘要

大型语言模型(LLMs)通过自回归生成彻底改变了语言建模,使其在广泛的任务中表现出色。最近,扩散语言模型(DLMs)作为一种替代范式出现,它通过迭代去噪而非下一个词预测来生成文本,从而允许对整个序列进行并行精炼。尽管已经提出了许多基于扩散的架构,但评估协议、数据集、推理预算和生成超参数的差异使得比较它们的能力和理解它们提供的权衡变得困难。在这项工作中,我们对现代DLMs进行了系统的实验分析。具体来说,我们评估了八种最先进的DLMs在八个基准上的表现,这些基准涵盖推理、编码、翻译、知识和结构化问题解决,同时明确考虑了生成质量和计算效率。除了下游评估,我们还分析了关键推理时间因素的影响,包括去噪步数、上下文长度、块大小和并行解掩策略,并通过在相同条件下训练的较小模型的受控比较来补充大规模实验。我们的分析突出了基于扩散的语言建模在不同任务、架构和推理预算下的优势和局限性。我们表明,DLMs的行为受到生成时间设计选择的强烈影响,导致性能和计算效率之间的不同权衡。总体而言,我们的研究为当代DLMs的能力和部署特性提供了实用见解。

英文摘要

Large Language Models (LLMs) have revolutionized language modeling through autoregressive generation, enabling strong performance across a wide range of tasks. Recently, Diffusion Language Models (DLMs) have emerged as an alternative paradigm that generates text through iterative denoising rather than next-token prediction, allowing parallel refinement of entire sequences. While numerous diffusion-based architectures have been proposed, differences in evaluation protocols, datasets, inference budgets, and generation hyperparameters make it difficult to compare their capabilities and understand the trade-offs they offer. In this work, we present a systematic experimental analysis of modern DLMs. Specifically, we evaluate eight state-of-the-art DLMs across eight benchmarks spanning reasoning, coding, translation, knowledge, and structured problem solving, while explicitly considering both generation quality and computational efficiency. Beyond downstream evaluation, we analyze the impact of key inference-time factors, including denoising steps, context length, block size, and parallel unmasking strategies, and complement large-scale experiments with controlled comparisons of smaller models trained under identical conditions. Our analysis highlights the strengths and limitations of diffusion-based language modeling across different tasks, architectures, and inference budgets. We show that the behavior of DLMs is strongly influenced by generation-time design choices, leading to distinct trade-offs between performance and computational efficiency. Overall, our study provides practical insights into the capabilities and deployment characteristics of contemporary DLMs.

2606.19351 2026-06-19 cs.CL cs.AI 新提交 85%

Detecting Hallucinations for Large Language Model-based Knowledge Graph Reasoning

基于大语言模型的知识图谱推理中的幻觉检测

Xinyan Zhu, Yaoqi Liu, Yue Gao, Huadong Ma, Cheng Yang, Chuan Shi

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Tsinghua University(清华大学)

专题命中 其他LLM :检测LLM在知识图谱推理中的幻觉

AI总结 提出LUCID方法,结合LLM注意力分数、知识图谱语义和结构信息,利用图神经网络检测LLM在知识图谱推理中的幻觉,在九个数据集上达到最优性能。

详情
AI中文摘要

知识图谱推理从现有事实中推断新知识,广泛应用于问答、推荐和决策支持。随着大语言模型(LLM)的快速发展,基于LLM的知识图谱推理框架通过利用检索到的知识图谱信息变得越来越流行。然而,LLM中的幻觉仍然是一个关键问题。即使融入了相关的知识图谱知识,模型仍可能生成错误输出,导致错误信息和不可靠的决策。现有的幻觉检测方法要么关注LLM内部状态,要么验证与检索上下文的一致性,但两者都忽略了知识图谱中的结构信息,导致性能次优。为了解决这一差距,我们提出了LUCID,这是首个针对基于LLM的知识图谱推理框架的幻觉检测方法。LUCID联合利用LLM注意力分数、知识图谱语义和结构信息。具体来说,它从注意力分数和语义相似度中提取节点和边特征,并使用图神经网络将其与知识图谱结构集成。我们还构建了人工标注的基准数据集用于评估。在九个数据集上的实验表明,与15个基线相比,LUCID达到了最先进的性能。

英文摘要

Knowledge graph (KG) reasoning infers new knowledge from existing facts and is widely applied in question answering, recommendation, and decision support. With the rapid development of large language models (LLMs), LLM-based KG reasoning frameworks have become increasingly popular by leveraging retrieved KG information. However, hallucinations in LLMs remain a critical issue. Even when relevant KG knowledge is incorporated, models may still generate incorrect outputs, leading to misinformation and unreliable decisions. Existing hallucination detection methods either focus on LLM internal states or verify consistency with retrieved contexts, but both overlook the structural information in KGs, resulting in suboptimal performance. To address this gap, we propose LUCID, the first halLUcination deteCtIon method for LLM-based knowleDge graph reasoning frameworks. LUCID jointly leverages LLM attention scores, KG semantics, and structural information. Specifically, it extracts node and edge features from attention scores and semantic similarities, and integrates them with KG structure using a graph neural network. We also construct manually annotated benchmark datasets for evaluation. Experiments on nine datasets show that LUCID achieves state of the art performance compared to 15 baselines.

2606.17165 2026-06-19 stat.ME cs.AI econ.EM math.ST stat.TH 新提交 85%

Statistical Foundations of LLM-based A/B Testing: A Surrogacy Framework for Human Causal Inference

基于LLM的A/B测试的统计基础:用于人类因果推断的替代指标框架

Joel Persson, Mårten Schultzberg, Sebastian Ankargren

发表机构 * Spotify USA, Inc.(Spotify美国公司)

专题命中 其他LLM :提出LLM替代人类进行A/B测试的统计框架

AI总结 提出替代指标理论框架,证明在弱于分布等价条件下,校准LLM输出可识别平均处理效应,并分析随机性带来的偏差与方差。

详情
AI中文摘要

组织和研究者越来越有兴趣在A/B测试中使用大型语言模型(LLM)代替人类参与者,以期更快、更低成本地进行实验。我们研究当在LLM结果上估计的处理效应何时能够恢复在感兴趣的人类群体上测量的效应。LLM与人类结果之间的分布等价性会使任何标准估计量有效,但这不现实。因此,我们开发了一个统计框架,将替代终点理论适配到LLM。该框架表明,将LLM结果校准到人类结果,在替代性和可比性条件(联合弱于分布等价性)下,可以识别平均处理效应。当这些条件不成立时,感兴趣的效应仅部分可识别,我们提供了诊断方法,可以在历史实验上证伪替代性,并给出有限重叠下最坏情况偏差的界限。我们进一步证明,LLM固有的随机性会引入偏差和方差,但使用多次抽取的平均值作为替代指标可以同时缓解两者。我们在模拟和Upworthy标题的A/B测试应用中展示了方法和理论。我们工作的一个核心结论是,LLM结果作为替代指标的有效性只能对过去的处理被证伪,而无法对新处理被验证,因此对于新颖干预,人类实验仍然不可或缺。我们讨论了LLM选择、提示和温度作为设计变量的作用,以及如何确定人类实验的规模以进行验证。

英文摘要

Organizations and researchers show increasing interest in using large language models (LLMs) in place of human participants in A/B tests, in the hope of experimenting faster and at lower cost. We study when a treatment effect estimated on LLM outcomes can recover the effect that would have been measured on the human population of interest. Distributional equivalence between LLM and human outcomes would make any standard estimator valid but is unrealistic. We therefore develop a statistical framework that adapts surrogate endpoint theory to LLMs, showing that calibrating LLM outcomes to human outcomes identifies the average treatment effect under surrogacy and comparability conditions that are jointly weaker than distributional equivalence. We present a falsification test for surrogacy and a bound on the worst-case bias from limited overlap between the LLM and human samples. We further show that the stochasticity inherent to LLMs can weaken surrogacy for identification while also introducing bias and variance during estimation, but that using an average over multiple LLM draws per unit as the surrogate mitigates these issues. Simulations validate the results, and an empirical application to A/B tests on Upworthy headlines shows that raw LLM predictions recover only 39\% of the human treatment effect while nonparametric calibration closes the gap. A central takeaway is that A/B testing on LLMs yields correct results only by assumption, whereas A/B testing on humans is correct by design, and that the required assumptions are hardest to justify precisely where A/B testing on LLMs promises the greatest benefit. We discuss the role of LLM choice, prompting, and temperature as design variables, the compounded challenge posed by long-term outcomes, and how to size human pilot studies for validation.

2606.07822 2026-06-19 cs.CL cs.AI cs.LG 新提交 85%

The ACUTE Protocol: Operationalizing Language Model Activations for Better Calibration, Utility, and Trust

ACUTE协议:操作语言模型激活以实现更好的校准、效用和信任

Nishant Subramani, Palash Goyal, Yiwen Song, Mani Malek, Yuan Xue, Tomas Pfister, Hamid Palangi

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Google(谷歌) Scale AI

专题命中 其他LLM :提出激活置信度估计协议,提升校准与信任

AI总结 提出ACUTE协议,通过操作语言模型激活来估计置信度,平衡校准与信息性,在多项选择问答、工具调用和科学文档摘要等任务上优于强基线,提升校准、效用和可信度。

Comments ICML 2026

详情
AI中文摘要

随着语言模型的改进并越来越多地部署以解决各种任务,可信度变得至关重要。校准是信任的良好代理:良好校准的置信度估计有助于在信任特定模型输出时告知风险与回报的权衡。不幸的是,即使模型改进,它们仍然校准不良,往往偏向过度自信。此外,校准可能被操纵:总是预测基率的策略是完美校准的,但完全没有信息性。为了解决这个问题,我们开发了一个新指标,即通过预言机重新归一化的期望效用(EURO),它平衡了校准和信息性。我们还提出了一种通用的基于激活的置信度、效用和信任估计协议(ACUTE),以适当裁决不确定性。ACUTE协议为4个模型家族的6个模型上的3个任务(包括多项选择问答、工具调用和科学文档摘要)提供了灵活、样本高效和计算高效的置信度估计器。ACUTE在EURO上优于强基线,同时保持较低的校准误差。综合来看,我们的工作表明,为LLM配备ACUTE协议可以在多种设置中提高校准、效用和可信度。

英文摘要

As language models improve and become increasingly deployed to solve a variety of tasks, trustworthiness becomes essential. Calibration is a good proxy for trust: well-calibrated confidence estimates help inform the risk versus reward tradeoff when trusting a specific model output. Unfortunately, even as models improve, they remain poorly calibrated, often biasing towards overconfidence. Additionally, calibration can be gamed: a policy that always predicts the base rate is perfectly calibrated, but completely uninformative. To resolve this, we develop a new metric, expected utility renormalized by the oracle (EURO), that balances calibration and informativeness. We also propose a general-purpose activation-based confidence, utility, and trust estimation protocol (ACUTE) to appropriately adjudicate uncertainty. The ACUTE protocol provides flexible, sample-efficient, and compute-efficient confidence estimators for 3 tasks including multiple choice question answering, tool-calling, and scientific document summarization across 6 models from 4 model families. ACUTE outperforms strong baselines on EURO, while maintaining low calibration error. Taken together, our work shows that equipping LLMs with the ACUTE protocol can improve calibration, utility, and trustworthiness in numerous settings.

2606.20517 2026-06-19 cs.AI cs.PL 新提交 80%

Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

Multi-LCB: 将 LiveCodeBench 扩展到多种编程语言

Maria Ivanova, Pavel Zadorozhny, Rodion Levichev, Ivan Petrov, Adamenko Pavel, Ivan Lopatin, Alexey Kutalev, Dmitrii Babaev

发表机构 * GigaCode Yandex School of Data Analysis, Applied AI Institute(Yandex数据分析学院,应用人工智能研究所)

专题命中 其他LLM :评估LLM跨语言代码生成能力,涉及预训练模型

AI总结 提出 Multi-LCB 基准,将 LiveCodeBench 的 Python 任务扩展到 12 种编程语言,评估 LLM 跨语言代码生成能力,发现 Python 过拟合和语言特定污染等问题。

Comments ICLR 2026

详情
AI中文摘要

LiveCodeBench (LCB) 最近已成为评估大型语言模型 (LLM) 在代码生成任务上的广泛采用的基准。通过策划竞争性编程问题、不断向集合中添加新问题并根据发布日期进行过滤,LCB 提供了污染感知的评估,并提供了编码能力的整体视图。然而,LCB 仍然局限于 Python,留下了 LLM 是否能够泛化到现实软件工程所需的各种编程语言的问题。我们引入了 Multi-LCB,这是一个跨十二种编程语言(包括 Python)评估 LLM 的基准。Multi-LCB 将 LCB 数据集中的 Python 任务转换为其他语言中的等效任务,同时保留 LCB 的污染控制和评估协议。由于它与原始 LCB 格式完全兼容,Multi-LCB 将自动跟踪未来的 LCB 更新,从而能够系统地评估跨语言代码生成能力,并要求模型在 Python 之外保持良好的性能。我们在 Multi-LCB 上评估了 24 个 LLM 的指令和推理能力,发现了 Python 过拟合、语言特定污染以及多语言性能显著差异的证据。我们的结果将 Multi-LCB 确立为多编程语言代码评估的严格新基准,直接解决了 LCB 的主要局限性,并揭示了当前 LLM 能力的关键差距。

英文摘要

LiveCodeBench (LCB) has recently become a widely adopted benchmark for evaluating large language models (LLMs) on code-generation tasks. By curating competitive programming problems, constantly adding fresh problems to the set, and filtering them by release dates, LCB provides contamination-aware evaluation and offers a holistic view of coding capability. However, LCB remains restricted to Python, leaving open the question of whether LLMs can generalize across the diverse programming languages required in real-world software engineering. We introduce Multi-LCB, a benchmark for evaluating LLMs across twelve programming languages, including Python. Multi-LCB transforms Python tasks from the LCB dataset into equivalent tasks in other languages while preserving LCB's contamination controls and evaluation protocol. Because it is fully compatible with the original LCB format, Multi-LCB will automatically track future LCB updates, enabling systematic assessment of cross-language code generation competence and requiring models to sustain performance well beyond Python. We evaluated 24 LLMs for instruction and reasoning on Multi-LCB, uncovering evidence of Python overfitting, language-specific contamination, and substantial disparities in multilingual performance. Our results establish Multi-LCB as a rigorous new benchmark for multi-programming-language code evaluation, directly addressing LCB's primary limitation and exposing critical gaps in current LLM capabilities.

2606.20245 2026-06-19 cs.AI 新提交 80%

Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM Inference

导航不可靠的参数化与上下文知识:面向LLM推理的显式知识冲突解决

Huang Peng, Jiuyang Tang, Weixin Zeng, Hao Xu, Xiang Zhao

发表机构 * National Key Laboratory of Big Data and Decision, National University of Defense Technology(国防科技大学大数据与决策国家重点实验室)

专题命中 其他LLM :解决LLM参数知识与上下文冲突

AI总结 提出MACR框架,通过自适应知识评估与多智能体推理,显式解决大语言模型内部参数知识与外部上下文之间的冲突,超越传统二元选择范式。

Comments 12 pages, 3 figures

详情
AI中文摘要

大型语言模型(LLM)通过利用广泛的参数化知识和上下文学习能力,在多种基于语言的任务中取得了强劲性能,使其能够整合输入提示中提供的外部信息。然而,外部知识的整合可能引入冲突,不仅存在于模型内部参数知识与外部信息之间,也存在于多个外部上下文之间。现有方法通常假设模型或提供的上下文是可靠的,忽视了两种来源都可能包含错误的情况,并通过优先考虑某一来源而非另一来源来避免冲突,而非主动解决不一致性。为解决这些局限,我们提出了一种新颖的LLM知识冲突解决框架MACR,该框架超越了传统的二元选择范式,并基于多智能体推理方法引入了显式的冲突解决机制。具体而言,我们首先提出一种自适应知识评估与检索方法,采用改进的语义熵度量来量化LLM对给定查询答案的置信度。基于此置信度估计,MACR要么将模型的内部知识外化为文本表示,要么在内部知识不足时检索相关外部知识,为后续推理生成基本上下文。然后,我们引入一个归纳式多智能体推理框架,包含三个专门智能体,分别用于归纳显式规则、分析潜在冲突以及解决所有可用上下文中的不一致性。实验结果表明,MACR在多个基准测试中显著优于最先进的基线方法,同时提供了可解释的显式冲突解决方案。

英文摘要

Large language models (LLMs) have achieved strong performance across a wide range of language-based tasks by leveraging both extensive parametric knowledge and in-context learning ability, enabling them to incorporate external information provided in the input prompt. However, the integration of external knowledge can introduce conflicts, not only between the model's internal parametric knowledge and the external information, but also among multiple pieces of external contexts. Existing approaches typically assume that either the model or the provided context is reliable, overlooking the possibility that both sources may contain errors, and avoid conflicts by privileging one source over the other, rather than actively resolving inconsistencies. To address these limitations, we propose a novel framework MACR for LLM knowledge conflict resolution that moves beyond the conventional binary choice paradigm and incorporates an explicit conflict-resolution mechanism based on a multi-agent reasoning approach. Specifically, we first propose an adaptive knowledge assessment and retrieval approach that employs a modified semantic entropy measure to quantify an LLM's confidence in its answer to a given query. Based on this confidence estimation, MACR either externalizes the model's internal knowledge as textual representations or retrieves relevant external knowledge when internal knowledge is insufficient, generating basic contexts for subsequent reasoning. Then we introduce an inductive multi-agent reasoning framework with three specialized agents that, respectively, induce explicit rules, analyze potential conflicts, and resolve inconsistencies across all available contexts. Empirical results demonstrate that MACR significantly outperforms state-of-the-art baselines across benchmarks, while also providing interpretable resolutions of explicit conflicts.

2606.20152 2026-06-19 cs.CL cs.AI 新提交 80%

From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models

从文本到分数:追踪大型语言模型中作文质量表征的出现

Jiaxu Zuo, Mu You, Kaixin Lan, Tao Fang, Yujia Huo, Henghua Shen, Lidia S. Chao, Derek F. Wong

发表机构 * NLP2 CT Lab, Department of Computer and Information Science, University of Macau(澳门大学计算机与信息科学系NLP2 CT实验室) Institute of International Language Services Studies, Macau Millennium College(澳门 millennium 学院国际语言服务研究学院) School of Data Science and Information Engineering, Guizhou Minzu University(贵州民族大学数据科学与信息工程学院)

专题命中 其他LLM :分析LLM内部表征用于自动作文评分。

AI总结 通过线性探测等方法分析8个LLM在三个数据集上的隐藏表征,发现作文质量信息以线性可解码形式存在,并识别出与分数相关的神经元,揭示了LLM评分的内在机制。

Comments This is a preprint of a manuscript currently under peer review

详情
AI中文摘要

近年来,大型语言模型(LLMs)的进展极大地改变了自动作文评分(AES),但基于LLM的评分内部机制仍知之甚少。在本工作中,我们系统分析了八个LLMs在两个英文作文数据集(ASAP++、CSEE)和一个葡萄牙语数据集(ENEM)上的隐藏表征。通过线性探测、跨提示泛化、降维和神经元级分析,我们发现一致证据表明作文质量信息以线性可访问的形式编码在LLM表征中。这些表征在层间逐步出现,在不同提示策略下保持稳健,并且尽管评分标准不同,仍能在作文提示间部分迁移。此外,非线性探测相对于线性探测仅提供边际且不一致的改进,表明大多数作文质量信息已经是线性可解码的。我们进一步识别出单个“作文评分神经元”,其激活与作文分数强相关,且其行为对目标干预敏感。此外,这些神经元的逐层分布随作文长度系统性地变化,较长的作文更依赖深层。总体而言,我们的发现提供了LLM编码与作文质量相关的结构化表征的证据,并为基于LLM的AES系统的可解释性提供了新见解。

英文摘要

Recent advances in Large Language Models (LLMs) have substantially transformed Automated Essay Scoring (AES), yet the internal mechanisms underlying LLM-based scoring remain poorly understood. In this work, we systematically analyze the hidden representations of eight LLMs across two English essay datasets (ASAP++, CSEE) and one Portuguese dataset (ENEM). Using linear probing, cross-prompt generalization, dimensionality reduction, and neuron-level analyses, we find consistent evidence that essay quality information is encoded in a linearly accessible form within LLM representations. These representations emerge progressively across layers, remain robust across prompting strategies, and partially transfer across essay prompts despite differences in scoring rubrics. In addition, nonlinear probes provide only marginal and inconsistent improvements over linear probes, suggesting that most essay quality information is already linearly decodable. We further identify individual ``essay scoring neurons'' whose activations strongly correlate with essay scores and whose behavior is sensitive to targeted intervention. Moreover, the layer-wise distribution of these neurons systematically shifts with essay length, with longer essays relying more heavily on deeper layers. Overall, our findings provide evidence that LLMs encode structured representations related to essay quality and offer new insights into the interpretability of LLM-based AES systems.

2606.19868 2026-06-19 cs.AI 新提交 80%

A Systematic Evaluation of Black-Box Uncertainty Estimation Methods for Large Language Models

大型语言模型黑盒不确定性估计方法的系统评估

Jiayi Wang, Xu-Yao Zhang

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所多模态人工智能系统国家重点实验室)

专题命中 其他LLM :系统评估LLM黑盒不确定性估计方法。

AI总结 系统评估了24种黑盒不确定性估计方法在4个模型和4个数据集上的表现,发现无单一方法普遍最优,但基于答案空间推理和比较的方法通常有效,混合方法在多数条件下表现良好。

详情
AI中文摘要

尽管大型语言模型(LLMs)在广泛的任务中展现出强大的能力,但其输出通常仍不可靠,可能包含幻觉,因此不确定性估计(UE)对于构建可信赖的LLMs至关重要。在实践中,许多主流LLMs仅通过受限API访问,此时logits和隐藏状态等内部信号不可用,使得黑盒UE尤为重要。然而,现有关于LLMs黑盒UE的研究在方法论上仍然零散,缺乏统一的实证比较。为填补这一空白,我们系统回顾了黑盒UE方法,并将其分为五类:基于口头化、基于采样、基于解释、多智能体和混合方法。我们进一步构建了统一的评估框架,并在4个模型和4个数据集设置下对24种代表性方法进行了基准测试。结果表明,没有单一方法在所有设置中一致占优。然而,在答案空间中进行推理和比较候选的方法通常有效,而结合多种不确定性信号的混合方法在大多数条件下表现良好。通过发布基准数据和统一评估框架,我们旨在促进可重复比较并支持未来研究,同时我们的实证发现为开发未来LLMs的黑盒UE方法提供了实践指导。

英文摘要

Although large language models (LLMs) have shown strong capabilities across a wide range of tasks, their outputs often remain unreliable and may contain hallucinations, making uncertainty estimation (UE) essential for building trustworthy LLMs. In practice, many mainstream LLMs are only accessible through restricted APIs, where internal signals such as logits and hidden states are unavailable, making black-box UE especially important. However, existing work on black-box UE for LLMs remains fragmented in methodology and lacks a unified empirical comparison. To address this gap, we present a systematic review of black-box UE methods and organize them into five categories: verbalization-based, sampling-based, explanation-based, multi-agent, and hybrid methods. We further build a unified evaluation framework and benchmark 24 representative methods across 4 models and 4 dataset settings. Our results show that no single method consistently dominates across all settings. Nevertheless, methods that reason over and compare candidates in the answer space are generally effective, and hybrid methods that combine multiple uncertainty signals perform well under most conditions. By releasing the benchmark data and a unified evaluation framework, we aim to facilitate reproducible comparisons and support future research, while our empirical findings provide practical guidance for developing future black-box UE methods for LLMs.

2606.19735 2026-06-19 cs.AI cs.CV 新提交 80%

GLARE: A Natural Language Interface for Querying Global Explanations

GLARE: 用于查询全局解释的自然语言接口

Bhavan Vasu, Rajesh Mangannavar

发表机构 * Oregon State University(俄勒冈州立大学)

专题命中 其他LLM :基于LLM的接口将自然语言转换为SQL查询。

AI总结 提出基于LLM的交互接口GLARE,将自然语言问题转换为SQL查询以聚合局部解释数据,提升全局解释的可访问性和可用性。

Comments 16 pages, 2 figures

详情
AI中文摘要

虽然全局解释对于理解跨数据集、类别和决策上下文的视觉模型至关重要,但其复杂和单一的性质常常阻碍实际探索。由于用户通常寻求针对特定问题的目标答案,而不是静态产物,我们提出了一种基于LLM的交互接口,提供对黑盒图像分类器全局解释的自然语言访问。系统的核心LLM充当调解者,将自然语言问题转换为对局部解释数据的结构化SQL查询。这使得灵活聚合成为可能,而无需向用户暴露低级表示。对于每个查询,接口输出统计增强的自然语言响应,支持局部解释和意图对齐的可视化。我们在意图解释、查询映射准确性、对新查询和数据集的泛化能力以及对语言错误的鲁棒性方面评估了该系统。我们的结果表明,LLM中介的查询显著提高了以人为中心的XAI中全局解释的可访问性和可用性。

英文摘要

While global explanations are crucial for understanding vision models across datasets, classes, and decision contexts, their complex and monolithic nature often hinders practical exploration. Because users typically seek targeted answers to specific questions rather than static artifacts, we present an LLM-based interactive interface that provides natural language access to global explanations for black-box image classifiers. The system's core LLM acts as a mediator, translating natural language questions into structured SQL queries over local explanation data. This enables flexible aggregation without exposing users to low-level representations. For each query, the interface outputs statistics-augmented natural language responses, supporting local explanations, and intent-aligned visualizations. We evaluate the system on intent interpretation, query mapping accuracy, generalization to novel queries and datasets, and robustness to linguistic errors. Our results demonstrate that LLM-mediated querying substantially improves the accessibility and usability of global explanations for human-centered XAI.

2606.19727 2026-06-19 cs.CL cs.AI 新提交 80%

NRITYAM: Language Models Meet Art and Heritage of Dance

NRITYAM:语言模型遇见舞蹈的艺术与遗产

Punit Kumar Singh, Niladri Ghosh, Advait Joshiınst, Shailee Choudhary, Michael Färber, Haiqin Yang

发表机构 * Shenzhen Technology University(深圳技术大学) New Delhi Institute of Management(新德里管理学院) Technische Universität Dresden(德累斯顿工业大学) Ramakrishna Mission Vivekananda Educational and Research Institute(罗摩克里希纳传道会维韦卡南达教育与研究学院) Indian Institute of Technology(印度理工学院) Swami Vivekananda Institute of Technology(斯瓦米·维韦卡南达技术学院) GuangDong Engineering Technology Research Center of Edge Intelligence(广东省边缘智能工程技术研究中心)

专题命中 其他LLM :评估语言模型对全球舞蹈文化的理解能力。

AI总结 提出NRITYAM基准,包含9,260个跨12语言的文化问答对,评估语言模型对全球舞蹈传统的文化理解能力,涵盖多种模型类型。

Comments 18 pages, 12 figures, in ECML_PKDD'26

详情
AI中文摘要

语言模型已成为塑造现代工作流程的重要工具。然而,其全球有效性取决于对当地社会文化背景的细致理解。为弥补这一差距,我们提出NRITYAM,一个用于评估语言模型在全球舞蹈传统背景下文化理解能力的综合基准。NRITYAM包含9,260个精心策划的问答对,涵盖12种语言,是专门用于评估舞蹈文化知识的最大数据集。该数据集通过与本地舞蹈艺术家和母语者的密切合作从头开发,他们创作并验证了特定地区的文化相关问题。我们评估了一系列模型,包括大型语言模型、小型语言模型、多模态大型语言模型和小型多模态语言模型。作为一个多语言和多文化基准,NRITYAM为评估AI系统理解和推理传统表演艺术的能力设定了新标准。详细数据集样本可在\url{this https URL}获取。

英文摘要

Language models have become essential tools in shaping modern workflows. However, their global effectiveness hinges on a nuanced understanding of local socio-cultural contexts. To address this gap, we present NRITYAM, a comprehensive benchmark for evaluating the cultural comprehension capabilities of language models in the context of global dance traditions. NRITYAM comprises 9,260 carefully curated question-answer pairs spanning 12 languages, making it the largest dataset dedicated to evaluating cultural knowledge in dance. The dataset has been developed from the ground up through close collaboration with native dance artists and native speakers of the languages, who authored and validated culturally relevant questions specific to their regions. We evaluate a broad set of models, including large language models, small language models, multimodal large language models, and small multimodal language models. As a multilingual and multicultural benchmark, NRITYAM sets a new standard for evaluating the ability of AI systems to understand and reason about traditional performing arts. Detailed dataset samples are available at~\url{https://github.com/niladrighosh03/NRITYAM}.

2606.19698 2026-06-19 cs.CL 新提交 80%

What sentiment analysis can't see: Measuring whether customers were helped, and what went wrong, across 70,000 support conversations

情感分析看不到的:衡量客户是否得到帮助以及出了什么问题——基于70,000次客服对话

Jason Potteiger

发表机构 * Dimension Labs(Dimension实验室)

专题命中 其他LLM :使用GPT-5.4估计客户满意度并标记问题。

AI总结 本研究使用GPT-5.4从70,450次客服对话中估计客户满意度并标记具体问题,发现满意度估计比情感分析更准确,且能揭示情感分析无法捕捉的客户状态和问题原因。

Comments 25 pages, 6 figures

详情
AI中文摘要

大多数公司通过情感分析大规模阅读客户支持数据,这种方法衡量的是客户听起来如何,而不是他们对结果是否满意。我们在一个领先的在线筹款平台的70,450次客服对话中测试了一种更丰富的替代方案:除了语气,我们使用GPT-5.4估计每位客户的满意度,并标记他们是否报告了具体问题,然后将这三个读数与客户对对话的1到5星评分进行验证。满意度估计跟踪这些评分的效果远好于情感分析,相关性分别为0.47和0.36,并且标记不满客户时的误报率低得多。结构化阅读还能看到情感分析看不到的东西:语气和满意度在44%的对话中不一致,一个单一的“中性”标签掩盖了从安静满意的客户到安静放弃的客户的一切,而最大的群体是“容忍摩擦”——那些满意但仍然报告可修复问题的客户,这是一个情感分析仪表板无法揭示的长期问题。更广泛的发现是,基于LLM的标注可以捕捉到比客户语言语气多得多的信息,为基于客户状态(他们是否满意)和直接从交互与反馈的原始文本数据中提取的问题原因的新业务指标提供了强大潜力。

英文摘要

Most companies read their customer support data at scale using sentiment analysis, which measures how customers sound rather than whether they were satisfied with the result. We tested a richer alternative on 70,450 support conversations from a leading online fundraising platform: alongside tone, we used GPT-5.4 to estimate each customer's satisfaction and to flag whether they reported a concrete problem, then validated all three readings against the 1-to-5 ratings customers left on the conversations they rated. The satisfaction estimate tracked those ratings far better than sentiment did, correlating at 0.47 against 0.36 and flagging unhappy customers with far fewer false alarms. The structured read also sees what sentiment cannot: tone and satisfaction disagree in 44% of conversations, a single "Neutral" label hides everything from quietly satisfied customers to ones who quietly gave up, and the largest group of all is "tolerated friction," customers who are satisfied but still reporting a fixable problem, a standing issue that no sentiment-based dashboard can surface. The broader finding is that LLM-based annotation can capture far more than the tonality of a customer's language, offering strong potential for new business metrics grounded instead in the customer's state (whether they were satisfied) and the cause of their problem extracted directly from the raw textual data of interactions and feedback.

2606.19668 2026-06-19 cs.CL 新提交 80%

Code-Switching Reveals Language Anchoring in Multilingual LLMs

代码切换揭示多语言大模型中的语言锚定

Jeonghyun Park, Seunghyun Yoon, Yonghyun Jun, Hwanhee Lee

发表机构 * Chung-Ang University(中央大学) Adobe Research(Adobe研究院)

专题命中 其他LLM :研究多语言大模型中的代码切换和语言锚定现象

AI总结 通过语法强制代码切换诊断多语言大模型中的语言锚定现象,提出锚定偏差度量并设计CANVAS干预方法,有效缓解代码切换导致的问答性能下降。

Comments 36 pages, 13 figures, 27 tables

详情
AI中文摘要

多语言大模型(MLLMs)越来越需要处理代码切换(CS)输入,然而混合语言通常会导致性能相对于源语言或目标语言单语版本下降。为了理解这种退化,我们使用语法强制CS作为受控诊断设置,将CS表示相对于其源和目标对应物进行定位。我们引入锚定偏差(Anchor Bias),一种几何度量,用于量化语言锚定,即CS隐藏状态是否更接近其源语言或目标语言对应物。在不同的MLLMs中,锚定偏差揭示了一致的语法框架效应:源框架CS保持源锚定,而目标框架CS向目标方向移动,并显示出更大的问答(QA)退化。受这种表示模式的启发,我们提出了CANVAS(基于上下文锚定的神经向量对齐引导),一种推理时干预方法,从输入中提取源侧画布,并在预填充期间将目标语言隐藏状态软引导向源锚定。CANVAS在MLLMs和CS条件下一致地恢复了QA F1分数,表明内部锚定信号为缓解CS推理失败提供了可行的目标。

英文摘要

Multilingual Large Language Models (MLLMs) are increasingly expected to handle Code-Switched (CS) inputs, yet mixing languages frequently degrades performance relative to source- or target-language monolingual counterparts. To understand this degradation, we use grammar-forced CS as a controlled diagnostic setting for locating CS representations relative to their source and target counterparts. We introduce Anchor Bias, a geometric measure that quantifies language anchoring, whether a CS hidden state aligns closer to its source or target language counterpart. Across diverse MLLMs, Anchor Bias reveals a consistent grammar-frame effect: source-framed CS stays source-anchored, whereas target-framed CS shifts target-ward and shows larger Question Answering (QA) degradation. Motivated by this representational pattern, we propose CANVAS (Contextual Anchor-based Neural Vector Alignment Steering), an inference-time intervention that extracts a source-side canvas from the input and softly steers target-language hidden states toward the source anchor during prefill. CANVAS consistently recovers QA F1 across MLLMs and CS conditions, showing that internal anchoring signals provide an actionable target for mitigating CS inference failures.

2606.19353 2026-06-19 cs.CL cs.LG 新提交 80%

Quantifying Aleatoric Uncertainty of In-Context Learning for Robust Measure of LLM Prediction Confidence

量化上下文学习中的偶然不确定性以稳健衡量LLM预测置信度

Jinseok Chung, Minkyoung Song, Hyunji Jung, Namhoon Lee

发表机构 * POSTECH(浦项科技大学)

专题命中 其他LLM :量化上下文学习中的不确定性,提升置信度

AI总结 针对上下文学习(ICL)中预测对提示设计敏感的问题,提出基于贝叶斯观点和机制可解释性的自函数向量,直接估计偶然不确定性,并设计严格评估协议,在合成和真实数据集上验证了方法的可靠性及在幻觉检测等应用中的实用性。

Comments Accepted to ACL 2026

详情
AI中文摘要

上下文学习(ICL)使LLM能够从少量示例中适应新任务,但其可靠性仍存疑虑:预测对提示设计和模型理解上下文的能力高度敏感,使得失败源于数据特性还是模型限制难以区分。不确定性分解——将偶然不确定性从认知不确定性中分离——在此场景中尤为关键,然而现有方法针对标准生成任务设计,未能捕捉ICL的独特动态。为解决此问题,我们引入基于贝叶斯观点和ICL机制可解释性的自函数向量概念。这些向量利用模型内部表示来建模上下文提示中学习的潜在概念,从而在贝叶斯框架内直接估计偶然不确定性,并规避了对脆弱的输入或解码操作的依赖。鉴于缺乏既定基准和合适的评估协议,我们还提出了首个严格的评估协议,其中数据以受控方式被操纵,以便精确量化偶然不确定性并将其与认知不确定性分离。借助这一新的评估框架(最初基于合成任务进行概念开发,随后扩展到真实世界数据集),我们展示了所提出的方法比现有替代方法更可靠地衡量LLM在ICL下做出的预测的不确定性。此外,我们展示了它可作为可信相关应用(如幻觉检测)的实用工具。我们的发现为将不确定性的量化观点与模型行为的机制理解联系起来开辟了新方向。

英文摘要

In-Context Learning (ICL) allows LLMs to adapt to new tasks from a few demonstrations, but its reliability remains a concern: predictions are highly sensitive to both prompt design and the model's ability to understand the context, obscuring whether failures arise from data properties or model limitations. Uncertainty decomposition-separating aleatoric from epistemic sources-is particularly crucial in this setting, yet existing methods, designed for standard generation tasks, fail to capture the unique dynamics of ICL. To address this, we introduce a concept of self-function vectors, built upon Bayesian views and the mechanistic interpretability of ICL. These vectors leverage internal model representations to model the latent concept learned during in-context prompting, thereby enabling a direct estimation of aleatoric uncertainty within a Bayesian framework and circumventing the reliance on brittle input or decoding manipulations. Given the lack of established benchmarks and suitable evaluation protocols, we also propose the first and rigorous evaluation protocol, in which data is manipulated in controlled ways so as to quantify aleatoric uncertainty precisely and separately from epistemic uncertainty. With this new evaluation framework, initially grounded in synthetic tasks for conceptual development and subsequently extended to real-world datasets, we show that our proposed methodology can measure uncertainty of LLM predictions made under ICL more reliably than existing alternative methods. Moreover, we show it can be used as a practical tool for trustworthy-related applications, such as hallucination detection. Our findings pave a new direction for connecting the quantitative view of uncertainty with the mechanistic understanding of model behavior.

2606.19349 2026-06-19 cs.CL cs.AI 新提交 80%

Where to Place the Query? Unveiling and Mitigating Positional Bias in In-Context Learning for Diffusion LLMs via Decoding Dynamics

查询应置于何处?通过解码动力学揭示并缓解扩散大语言模型中上下文学习的位置偏差

Zhengheng Li, Panrui Li, Xuyang Liu, Puzhi Xia

发表机构 * Southeast University(东南大学)

专题命中 其他LLM :研究扩散LLM中上下文学习的位置偏差

AI总结 本文系统分析了扩散大语言模型中查询位置对生成质量的影响,发现其与示例语义质量同等重要,并提出基于平均置信度的无训练自适应路由策略Auto-ICL以优化查询放置。

Comments 9 figures, 4 tables

详情
AI中文摘要

尽管上下文学习(ICL)在自回归(AR)大语言模型(LLMs)中已被广泛研究,但其在扩散大语言模型(dLLMs)中的机制仍基本未被探索。与受单向因果掩码限制的AR模型不同,dLLMs本质上利用双向注意力,为查询放置提供了广泛的空间灵活性。不幸的是,当前实践通常继承AR风格的尾随查询模板,往往忽略了结构范式转变。本文通过全面分析揭示了查询位置实际上是dLLMs中的一阶变量。通过经验解耦,我们证明了位置方差对生成质量的影响与示例语义质量相当。在内部,这种位置敏感性源于注意力流中的空间“近因效应”以及解码轨迹中依赖于任务的偏移。为了在没有真实标签的情况下缓解这种不稳定性,我们揭示了传统的单步置信度($C_{decoded}$)在dLLMs中失效。相反,我们提出了平均置信度($\overline{C}$),一种跟踪迭代解码过程的新指标。通过建立基础的空间ICL基线,我们引入了Auto-ICL,一种无需训练的自适应路由策略,动态优化查询放置,在异构推理和感知任务中稳健地接近最优性能。

英文摘要

While In-Context Learning (ICL) is extensively studied in Autoregressive (AR) LLMs, its mechanism within Diffusion Large Language Models (dLLMs) remains largely unexplored. Unlike AR models restricted by unidirectional causal masking, dLLMs intrinsically utilize bidirectional attention, offering extensive spatial flexibility for query placement. Unfortunately, current practices conventionally inherit AR-style trailing-query templates, often overlooking the structural paradigm shift. This paper presents a comprehensive analysis unveiling that query position is actually a first-order variable in dLLMs. Through empirical decoupling, we demonstrate that positional variance impacts generation quality on par with example semantic quality. Internally, this positional sensitivity stems from a spatial ``Recency Effect'' in attention flow and task-dependent shifts in decoding trajectories. To mitigate this instability without ground-truth labels, we reveal that traditional single-step confidence ($C_{decoded}$) fails in dLLMs. Instead, we propose Average Confidence ($\overline{C}$), a novel metric tracking the iterative decoding process. By establishing the foundational spatial ICL baselines, we introduce Auto-ICL, a training-free adaptive routing strategy that dynamically optimizes query placement, robustly approaching oracle performance across heterogeneous reasoning and perception tasks.

2606.19346 2026-06-19 cs.CL cs.AI 新提交 80%

Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer

跨语言迁移中语言相关性与任务对齐的解耦

Ahmed Haj Ahmed, Ruochen Zhang, Alvin Grissom

发表机构 * Haverford College(哈弗福德学院) Brown University(布朗大学)

专题命中 其他LLM :跨语言迁移中任务对齐与语言相关性解耦

AI总结 通过微调大语言模型并在闪语族与非闪语族语言上评估零样本阅读理解,发现跨语言迁移主要提升任务格式对齐而非语言特定知识。

详情
AI中文摘要

我们通过微调七个大语言模型(4B--671B参数)在阿拉伯语上,并在闪语族语言和非闪语族对照语言上评估零样本阅读理解,研究跨语言迁移。在密集架构和混合专家架构中,我们没有发现闪语族特定迁移的证据:基线较弱的模型在所有语言上都有显著提升,而基线较强的模型无论语言族系如何,只有边际提升。思维链消融实验强化了这一发现——从微调中获益最多的模型同样从推理时推理中获益,这表明两种机制都解决了任务格式对齐问题,而非跨语言知识迁移。

英文摘要

We study cross-lingual transfer by fine-tuning seven large language models (4B--671B parameters) on Arabic and evaluating zero-shot reading comprehension on Semitic languages and non-Semitic controls. Across dense and Mixture-of-Experts architectures, we find no evidence of Semitic-specific transfer: models with weak baselines improve dramatically across all languages, while strong-baseline models show only marginal gains regardless of language family. A chain-of-thought ablation reinforces this finding -- the same models that benefit most from fine-tuning benefit equally from inference-time reasoning, suggesting both mechanisms address task-format alignment rather than cross-lingual knowledge transfer.

2606.20560 2026-06-19 cs.LG cs.AI 新提交 75%

How Transparent is DiffusionGemma?

DiffusionGemma 的透明度如何?

Joshua Engels, Callum McDougall, Bilal Chughtai, Janos Kramar, Senthoran Rajamanoharan, Cindy Wu, Arthur Conmy, Asic Q Chen, Jean Tarbouriech, Min Ma, Brendan O'Donoghue, João Gabriel Lopes de Oliveira, Rohin Shah, Neel Nanda

发表机构 * Google(谷歌)

专题命中 其他LLM :研究DiffusionGemma推理透明度

AI总结 研究DiffusionGemma在连续潜空间中的推理透明度,通过变量透明度和算法透明度分解,发现可解释的令牌瓶颈将不透明串行深度降至Gemma 4的1.1倍,并揭示扩散特有现象。

Comments 20 main text pages and 6 pages of references and appendices

详情
AI中文摘要

LLM推理透明度是理解模型决策、减少误用和错位以及调试意外模型行为的关键能力。然而,DiffusionGemma在连续潜空间中执行了更大比例的计算;这是否使其推理透明度降低?我们通过将透明度分解为两个组成部分来研究这个问题:变量透明度,即我们是否理解模型计算状态的中间快照;以及算法透明度,即我们是否能够利用这些快照重建模型得出其输出的过程。直观上,DiffusionGemma的变量透明度较差:其不透明串行深度,即在可解释模型状态之间发生的串行计算量,最初似乎是相应自回归Gemma 4模型的28.6倍。然而,我们表明,我们可以通过一个可解释的令牌瓶颈映射去噪步骤之间流动的信息,且下游性能没有下降。将这些中间状态视为可解释的,将不透明串行深度降至仅为Gemma 4的1.1倍。对于扩散模型来说,算法透明度比自回归模型更难,因为画布中的所有令牌预测在每个去噪步骤中都可能发生变化,这使模型有能力在去噪过程中实现复杂的分布式算法。为了开始弥合这一差距,我们进行了一系列可解释性案例研究,发现了扩散特有现象(如非时序推理、令牌和序列涂抹以及中间上下文推理)的初步证据。最后,我们测试了可监控性,这是透明度的一个关键应用,衡量模型输出是否对下游任务有用。我们发现DiffusionGemma的可监控性与Gemma 4相似。

英文摘要

LLM reasoning transparency is a critical affordance for understanding model decisions, mitigating misuse and misalignment, and debugging surprising model behaviors. However, DiffusionGemma performs a larger fraction of its computation in a continuous latent space; does this make its reasoning less transparent? We study this question by decomposing transparency into two components: variable transparency, whether we understand intermediate snapshots of a model's computational state; and algorithmic transparency, whether we can use these snapshots to reconstruct the process by which the model arrived at its outputs. Naively, DiffusionGemma has poor variable transparency: its opaque serial depth, the amount of serial computation that occurs in between interpretable model states, seems at first 28.6X higher than the corresponding autoregressive Gemma 4 model. However, we show that we can map the information flowing between denoising steps through an interpretable token bottleneck with no decrease in downstream performance. Treating these intermediate states as interpretable reduces the opaque serial depth to just 1.1X that of Gemma 4. Algorithmic transparency is harder for diffusion models than for autoregressive models because all token predictions in the canvas can change at every denoising step, giving the model the power to implement complicated distributed algorithms during the denoising process. To begin bridging this gap, we conduct a suite of interpretability case studies, uncovering initial evidence of novel diffusion-specific phenomena such as non-chronological reasoning, token and sequence smearing, and intermediate-context reasoning. Finally, we test monitorability, a key application of transparency that measures whether model outputs are useful for downstream tasks. We find that DiffusionGemma is similarly monitorable to Gemma 4.

2606.20400 2026-06-19 cs.LG 新提交 75%

The Significance of Style Diversity in Annotation-Free Synthetic Data Generation

无标注合成数据生成中风格多样性的重要性

Zahra Abbasiantaeb, Zeno Belligoli, Omar Essam, Mohammad Aliannejadi

发表机构 * University of Amsterdam(阿姆斯特丹大学)

专题命中 其他LLM :利用LLM生成合成对话数据,提升意图分类性能

AI总结 提出无需人工标注的对话生成框架,利用主题和风格属性增强多样性,并设计两种后处理风格化模型,实验表明风格多样性比主题多样性更关键,性能可达人工标注数据的93.3%。

详情
AI中文摘要

为意图分类生成高实用性的合成数据通常需要人工标注的种子数据,这在快节奏的工业环境中往往不可用。在本文中,我们提出了一个完全无需人工标注数据、仅依赖意图定义的合成对话生成框架。我们提出的对话生成框架利用两种不同类型的主题和风格属性来提高数据多样性。此外,我们提出了两种新颖的后处理风格化模型,称为Univ和Exam,以将合成的LLM生成的语句转换为更多样化、更接近人类的语言风格。为了提升数据质量,我们利用LLM作为评判的过滤过程。在工业数据集和公开数据集上的实验结果表明,所提出的方法达到了使用人工标注训练数据所获得性能的93.3%。至关重要的是,研究结果揭示,对于合成数据的实用性,风格多样性比主题多样性更为关键,因为它能防止模型学习虚假的风格相关性。此外,研究表明,在生成过程中融入风格属性比后处理风格适应更有效。

英文摘要

Generating high-utility synthetic data for intent classification typically requires human-annotated seed data, which is often unavailable in fast-paced industrial settings. In this paper, we propose a framework for synthetic dialogue generation that works entirely without human-annotated data, relying solely on intent definitions. Our proposed dialogue generation framework utilizes two different types of topic and style attributes to improve data diversity. Also, we propose two novel post-hoc stylization models called Univ and Exam to transform synthetic LLM-generated utterances into more varied, human-like linguistic styles. To enhance data quality, we utilize an LLM-as-a-judge filtering process. Experimental results on both industrial and public datasets demonstrate that the proposed approach achieves up to 93.3% of the performance obtained using human-annotated training data. Crucially, the findings reveal that style diversity is more critical than topic diversity for synthetic data utility, as it prevents models from learning spurious stylistic correlations. Furthermore, the study shows that incorporating style attributes during the generation process is more effective than post-hoc style adaptation.

2606.19831 2026-06-19 cs.CL cs.LG 新提交 75%

Leverage Is Not Reach: A Control-Window Law for Single-Neuron Steering in Language Models

杠杆不等于可达性:语言模型中单神经元操控的控制窗口定律

Hongliang Liu

发表机构 * Palo Alto Networks

专题命中 其他LLM :研究语言模型中单神经元干预的控制窗口理论。

AI总结 提出预算归一化控制窗口框架,通过残差范数与写入范数之比定义的相干预算,预测单神经元干预何时产生连贯行为控制,并在15个神经元上验证了预测精度。

详情
AI中文摘要

对齐语言模型通过稀疏前馈神经元门控拒绝和语言路由等行为,但尚无理论预测单神经元干预何时连贯地控制行为而非导致输出崩溃。我们开发了一个预算归一化的控制窗口框架用于单神经元操控。沿一个写入方向的剂量简化为一个控制坐标:残差流与写入之间的对齐,该对齐沿着一条通用饱和曲线驱动,以残差范数除以写入范数设定的相干预算为单位。当行为触发点低于崩溃上限时,存在连贯控制。同一坐标控制良性模式切换和拒绝;上限由权重和一次通用前向传播得出,而触发点在 rollout 时测量。在15个保留神经元上,预测上限的平均绝对误差为0.14,在批量层中约为0.07,并且承诺的开启或关闭判定在11个神经元上成立,而多数基线为10/15。关闭情况揭示了三种失败模式而非违反:触发前崩溃、深度不足以传播、或归一化限制了单个神经元能推动的距离。该定律解释了为什么局部梯度归因反直觉地预测控制:真正的控制器偏离读出轴写入,并携带接近零的一阶梯度。由窗口精确化的仅前向对比筛选恢复了归因遗漏的控制器。在拒绝这一最难案例中,干预成功是类型化的而非标量:连贯旁路和严格可操作可达性分离,因此一个神经元可以在流畅、任务相关且无操作内容的文本中翻转拒绝,而真正的可操作可达性仅出现在六个审计的 Llama 枢轴中的三个,且仅在较晚的 rollout 时间范围内。因此,单神经元操控是对可控性的预算化、类型化审计,而非固定剂量的轶事。

英文摘要

Aligned language models gate behaviors such as refusal and language routing through sparse feed forward neurons, yet no theory predicts when a single neuron intervention controls a behavior coherently rather than collapsing the output. We develop a budget normalized control window framework for single neuron steering. A dose along one write direction reduces to one control coordinate: the alignment between the residual stream and the write, driven along a universal saturation curve in units of a coherence budget set by the residual norm divided by the write norm. Coherent control exists when a behavior trigger lies below the collapse ceiling. The same coordinate governs benign mode switches and refusal; the ceiling follows from weights and one generic forward pass, while triggers are measured at rollout. On fifteen held out neurons, the predicted ceiling has mean absolute error 0.14, about 0.07 in bulk layers, and the committed open or closed verdict holds on eleven against a ten of fifteen majority baseline. Closed cases expose three failure modes rather than violations: collapse before trigger, too little depth to propagate, or a normalization that caps how far one neuron can push. The law explains why local gradient attribution anti predicts control: true controllers write off the readout axis and carry a near zero first order gradient. A forward only contrastive screen made precise by the window recovers controllers that attribution misses. On refusal, the hardest case, intervention success is typed, not scalar: coherent bypass and strict actionable reach separate, so a neuron can flip refusal in fluent, on task text with no actionable content, and genuine actionable reach appears only for three of six audited Llama pivots and only at later rollout horizons. Single neuron steering is therefore a budgeted, typed audit of controllability rather than a fixed dose anecdote.

2606.18933 2026-06-19 cs.LG cs.IR stat.ME 新提交 75%

Zero-Shot Active Feature Acquisition via LLM-Elicitation

基于LLM启发式的零样本主动特征获取

Binyamin Perets, Natalie Mendelson, Shiran Vainberg, Yehuda Chowers, Shai Shen-Orr, Shie Mannor

发表机构 * Faculty of EE, Technion(技术学院电子工程系) Faculty of Medicine, Technion(技术学院医学院) CytoReason NVIDIA

专题命中 其他LLM :LLM启发式用于主动特征获取

AI总结 提出通过LLM启发式获取马尔可夫随机场充分统计量的零样本主动特征获取框架,解决数据标注不足问题,在IBD患者诊断中优于现有方法。

详情
AI中文摘要

主动特征获取(AFA)顺序选择要观察的特征以达成分类或排序决策。其主要局限性在于依赖大量标注数据来拟合指导获取的概率模型。大型语言模型(LLM)提供无监督的领域知识,但作为序列规划者表现不佳。要求其同时知晓和决策会混淆最好分开的能力。这里,我们通过严格的启发式方法开发了一个零样本AFA框架:仅要求LLM返回其可被信任返回的内容,即马尔可夫随机场(MRF)的充分统计量——一元偏差和成对协变。我们将该框架应用于两个场景:二分类和top-$k$识别。实践中,LLM可靠地仅返回判别性统计量,即区分类别而非孤立每个类别的统计量,这阻碍了经典AFA。我们应用最大熵闭包来解决这种规范模糊性。我们在炎症性肠病(IBD)患者队列上进行评估,这是一个活跃的临床环境,其中诊断模糊性和患者异质性阻碍了稳定的治疗策略。我们的框架在真实标签和其自身提取的信念上均优于LLM。在最关键的地方,即最困难的患者上,我们的top-$k$获取策略显著优于所有现有方法。

英文摘要

Active feature acquisition (AFA) sequentially selects which features to observe to reach a classification or ranking decision. Its central limitation is reliance on large amount of labeled data to fit probabilistic models guiding acquisition. Large language models (LLMs) supply unsupervised domain knowledge, but are poor sequential planners. Asking one to both know and decide conflates capabilities best kept separate. Here, we develop a framework for zero-shot AFA through disciplined elicitation: asking the LLM only for what it can be trusted to return, the unary deviations and pairwise co-variations that are the sufficient statistics of a Markov random field (MRF). We apply our framework to two settings: binary classification and top-$k$ identification. In practice, the LLM reliably returns only discriminative statistics, what distinguishes the classes rather than each class in isolation, which precludes classical AFA. We apply a maximum-entropy closure that resolves this gauge ambiguity. We evaluate on a cohort of Inflammatory Bowel Disease (IBD) patients, an active clinical setting where diagnostic ambiguity and patient heterogeneity obstruct stable treatment strategies. Our framework outperforms the LLM both on real labels and on its own extracted beliefs. Where it matters most, on the hardest patients, our top-$k$ acquisition policy markedly outperforms all existing methods.

2606.17832 2026-06-19 cs.LG 新提交 75%

From Drift to Coherence: Stabilizing Beliefs in LLMs

从漂移到一致:稳定LLM中的信念

SongEun Kim, Seungyoo Lee, Edwin Fong, Hyungi Lee, Juho Lee

发表机构 * Department of Statistics, Seoul National University Korea Advanced Institute of Science \& Technology Department of AI, Kookmin University University of Hong Kong

专题命中 其他LLM :研究LLM信念稳定性,提出预测重采样方法

AI总结 研究LLM在多项选择问答中的信念漂移问题,提出提示式预测重采样(PPR)方法,发现信念过程会自稳定并收敛,进而提出种子答案提示策略和自一致性损失以加速稳定并提高预测一致性。

详情
AI中文摘要

大型语言模型(LLM)常被假设执行隐式贝叶斯推理,然而一个关键的一致性条件——预测信念的鞅性质——已被证明在受控的合成上下文学习设置中失效。我们在更典型的使用场景中重新审视这个问题:通用多项选择问答。利用离散答案空间,我们计算精确的预测分布,并研究由自回归答案重采样引起的信念动态。我们引入了提示式预测重采样(PPR),其中LLM对同一问题生成一系列答案。实验表明,PPR揭示了早期阶段的信念漂移,表明鞅性质被违反。然而,在足够的重采样步骤后,信念过程自稳定并收敛到一个一致的预测分布。基于这一观察,我们进一步提出了(i)种子答案提示策略以加速稳定,以及(ii)自一致性损失,通过微调将早期漂移摊销到模型中。在多项选择问答基准上的实验表明,我们的方法在不牺牲准确性的情况下显著减少了信念漂移并提高了预测一致性。

英文摘要

Large language models (LLMs) are often hypothesized to perform implicit Bayesian inference, yet a key coherence condition, the martingale property of predictive beliefs, has been shown to fail in controlled synthetic in-context learning settings. We revisit this question in a more typical usage regime: generic multiple-choice question answering. Exploiting the discrete answer space, we compute exact predictive distributions and study belief dynamics induced by autoregressive answer resampling. We introduce prompted predictive resampling (PPR), where an LLM generates a sequence of answers to the same question. Empirically, PPR reveals early-stage belief drift, indicating martingale violations. However, after sufficient resampling steps, the belief process self-stabilizes and converges to a coherent predictive distribution. Based on this observation, we further propose (i) a seed-answer prompting strategy to accelerate stabilization, and (ii) a self-consistency loss that amortizes early-stage drift into the model via fine-tuning. Experiments on multiple-choice QA benchmarks show that our methods substantially reduce belief drift and improve predictive coherence without sacrificing accuracy.

2606.20544 2026-06-19 cs.AI cs.LG 新提交 70%

Toward Calibrated Mixture-of-Experts Under Distribution Shift

面向分布偏移下的校准混合专家模型

Gina Wong, Drew Prinster, Suchi Saria, Rama Chellappa, Anqi Liu

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

专题命中 其他LLM :研究混合专家模型的校准问题,属于大语言模型应用

AI总结 研究混合专家模型在分布偏移下的校准问题,提出对抗性重加权方法以改善路由聚合的校准误差,提升准确率-校准权衡。

Journal ref ICML 2026

详情
AI中文摘要

校准将模型的预测不确定性与其经验结果的频率对齐,对于理解和信任报告的概率很重要。最近的研究表明,在单个预测器级别强制执行校准可以提高集成准确性和校准,特别是混合专家(MoE)模型显示出强烈的经验改进;然而,校准有助于MoE的条件尚不清楚。在这项工作中,我们研究了MoE模型在分布偏移下的行为,重点关注路由机制如何与专家级校准相互作用。我们表明,在硬路由模型中,专家校准足以确保整体模型在一大类分布偏移下的校准,但不足以校准软路由模型。为了解决这个问题,我们提出了一种对抗性重加权方法,惩罚分布偏移下路由聚合的校准误差,并证明它在平均情况下以及在数据的困难子集上,跨模型类别、预测任务和分布偏移,改善了准确率-校准权衡。

英文摘要

Calibration aligns a model's predictive uncertainty with the frequencies of its empirical outcomes and is important for understanding and trusting reported probabilities. Recent work shows that enforcing calibration at the level of individual predictors can improve ensemble accuracy and calibration, with mixture-of-experts (MoE) models showing strong empirical improvements in particular; however, the conditions under which calibration helps MoE are not well understood. In this work, we study how MoE models behave under distribution shift, focusing on how routing mechanisms interact with expert-level calibration. We show that expert calibration is sufficient to ensure calibration of the overall model under a broad class of distribution shifts in hard-routed models, but is insufficient for calibrating soft-routed models. To address this, we propose an adversarial reweighting that penalizes calibration errors of the routed aggregate under distribution shift, and we demonstrate that it improves the accuracy-calibration tradeoff both on average and on difficult subsets of the data, across model classes, prediction tasks, and distribution shifts.

2606.20538 2026-06-19 cs.LG 新提交 70%

Multi-Task Bayesian In-Context Learning

多任务贝叶斯上下文学习

Qingyang Zhu, Eric Karl Oermann, Kyunghyun Cho

发表机构 * New York University(纽约大学)

专题命中 其他LLM :提出多任务上下文学习框架,属于LLM方法

AI总结 提出多任务上下文学习框架,通过将先验信息表示为上下文数据集前缀,训练Transformer实现分层贝叶斯预测推理,在多种分布偏移下匹配最优贝叶斯性能且速度提升数个数量级。

Comments ICML 2026

详情
AI中文摘要

贝叶斯预测推断为不确定性量化、数据效率和鲁棒泛化提供了原则性框架。然而,精确推断通常难以处理,可扩展近似可能仍计算昂贵或需要限制性建模假设,从而降低预测性能。先验数据拟合和上下文模型最近作为一种摊销替代方案出现,通过学习直接将数据集映射到预测分布,但现有方法与训练先验的支持紧密耦合,缺乏在测试时适应新先验的显式机制,导致在分布偏移下鲁棒性有限。我们引入了一个多任务上下文学习框架,用于摊销分层贝叶斯预测推断,该框架将先验信息显式表示为上下文数据集的前缀。一个在先验和目标任务序列上训练的Transformer学习跨先验族调整其预测。在一系列难度递增的评估中,包括元分布外先验和具有高维潜在结构的先验,我们的方法匹配了最优贝叶斯预测器,同时速度快了几个数量级。我们进一步在真实世界的时空温度预测基准上展示了其实用性。代码可在https://this URL获取。

英文摘要

Bayesian predictive inference provides a principled framework for uncertainty quantification, data efficiency, and robust generalization. However, exact inference is often intractable, and scalable approximations may remain computationally expensive or require restrictive modeling assumptions that degrade predictive performance. Prior-Data Fitted and in-context models have recently emerged as an amortized alternative by learning to map datasets directly to predictive distributions, but existing approaches are tightly coupled to the support of the training prior and lack explicit mechanisms for adapting to new priors at test time, resulting in limited robustness under distribution shift. We introduce a multi-task in-context learning framework for amortized hierarchical Bayesian predictive inference that explicitly represents prior information as a prefix of in-context datasets. A transformer trained on sequences of prior and target tasks learns to adapt its predictions across families of priors. On a suite of evaluations with increasing difficulty, including out-of-meta-distribution priors and priors with high-dimensional latent structures, our method matches oracle Bayesian predictors while being orders of magnitude faster. We further demonstrate its practical relevance on a real-world spatiotemporal temperature prediction benchmark. Code is available at https://github.com/martianmartina/multi-task-bayesian-icl/.

2606.20502 2026-06-19 cs.CR cs.AI cs.SE 新提交 70%

Calibration Without Comprehension: Diagnosing the Limits of Fine-Tuning LLMs for Vulnerability Detection in Systems Software

无理解的校准:诊断微调大语言模型在系统软件漏洞检测中的局限性

Arastoo Zibaeirad, Marco Vieira

发表机构 * University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校)

专题命中 其他LLM :诊断微调LLM在漏洞检测中的局限性

AI总结 提出CWE-Trace框架,通过834个Linux内核样本和两个诊断指标(DFI和HDD)评估LLM漏洞检测能力,发现数据污染无实质帮助,微调仅改变输出阈值而非决策策略,模型缺乏真正的安全推理能力。

详情
AI中文摘要

大语言模型在漏洞基准测试中得分高,但究竟是真正推理安全还是仅对污染数据进行模式匹配,这一问题仍未解决。我们提出CWE-Trace,一个基于834个手动整理的Linux内核样本(涵盖74个CWE)构建的LLM漏洞检测框架。该框架强制执行严格的时间分割(2025年前的历史集/截止后的无泄漏集),保留上下文感知的易受攻击-修补对,并引入两个诊断指标:方向性失败指数(DFI)和层次距离与方向(HDD)。我们评估了8个原始LLM和15个LoRA微调变体,涵盖非目标检测、目标检测和CWE分类。分析得出两个关键结果。首先,数据污染未提供可衡量的优势。函数级分析显示,84%的名义污染样本不携带可用的记忆信号:易受攻击的函数缺失或跨数据集交叉映射,约31%的污染样本存在CWE误分类。其次,骨干方向性先验主导微调。模型表现出稳定、系统性的失败模式(DFI范围从-85.5到+94.8个百分点),这些模式从历史数据持续到截止后数据,且难以纠正。微调改变了输出阈值,但未改变决策策略。这是无理解的校准:输出分布适应训练数据,而底层安全推理仍然缺失。在二元检测中最弱的骨干(DeepSeek-R1)在粗粒度CWE分类中提升最大,表明检测和理解是解耦的能力。最佳检测得分仅达到52.1%(比随机高2.1个百分点);精确CWE排名Top-1准确率仍低于1.3%,证实当前LLM无论采用何种微调策略,都缺乏对系统软件的可靠安全推理能力。

英文摘要

Whether LLMs scoring well on vulnerability benchmarks genuinely reason about security or merely pattern-match on contaminated data remains unresolved. We present CWE-Trace, a framework for LLM vulnerability detection built from 834 manually curated Linux kernel samples spanning 74 CWEs. The framework enforces a strict temporal split (pre-2025 historical set / post-cutoff leakage-free set), preserves context-aware vulnerable--patched pairs, and introduces two diagnostic metrics: the Directional Failure Index (DFI) and Hierarchical Distance and Direction (HDD). We evaluate eight vanilla LLMs and 15 LoRA fine-tuned variants across non-targeted detection, targeted detection, and CWE classification. Our analysis yields two key results. First, data contamination provides no measurable advantage. Function-level analysis shows that 84% of nominally contaminated samples carry no usable memorization signal: vulnerable functions are absent or cross-mapped across datasets, and ~31% of contaminated samples carry CWE misclassification. Second, backbone directional priors dominate fine-tuning. Models exhibit stable, systematic failure modes (DFI ranging from -85.5 to +94.8 pp) that persist from historical to post-cutoff data and resist correction. Fine-tuning shifts the output threshold without changing the decision policy. This is calibration without comprehension: output distributions adapt to training data while the underlying security reasoning remains absent. The weakest backbone at binary detection (DeepSeek-R1) gains the most in coarse CWE classification, revealing that detection and understanding are decoupled capabilities. The best detection score reaches only 52.1% (+2.1 pp above chance); exact CWE ranking remains below 1.3% Top-1 accuracy, confirming that current LLMs lack reliable security reasoning for systems software, regardless of fine-tuning strategy.

2606.20436 2026-06-19 cs.CR cs.AI 新提交 70%

Multi-View Decompilation for LLM-Based Malware Classification

基于LLM的恶意软件分类的多视角反编译

Bercan Turkmen, Vyas Raina

发表机构 * Independent Researcher(独立研究员) SPARK

专题命中 其他LLM :利用LLM分类反编译代码,涉及LLM应用

AI总结 提出多反编译器视角提升LLM恶意软件分类性能,通过Ghidra和RetDec的互补伪C代码提高召回率和F1分数。

详情
AI中文摘要

恶意软件分析师通常在源代码不可用时,通过反编译的伪C代码检查编译后的二进制文件。最近的研究表明,大型语言模型(LLMs)可以通过将反编译代码分类为良性或恶意来辅助这一过程,但现有的流程通常依赖于单一的反编译器视角。我们认为这一假设是脆弱的:反编译器是有损的启发式工具,不同的反编译器可能暴露同一二进制文件的不同特征。我们整理了一个包含良性工具和恶意程序的基准测试,涵盖一系列威胁行为。每个样本都使用Ghidra和RetDec进行编译和反编译,生成匹配的伪C视图。在来自主要模型系列的一系列LLMs中,我们发现提供两种反编译器视图可以提高恶意类别的F1分数,主要是通过提高恶意样本的召回率。一致性分析进一步表明,Ghidra和RetDec会犯部分不同的错误,支持反编译器输出提供互补证据的观点。我们的结果表明,多反编译器提示是一种简单、无需训练的方法,可以在实际环境中改进基于LLM的恶意软件分类。

英文摘要

Malware analysts often inspect compiled binaries through decompiled pseudo-C, when source code is unavailable. Recent work suggests that large language models (LLMs) can assist this process by classifying decompiled code as benign or malicious, but existing pipelines typically rely on a single decompiler view. We argue that this assumption is fragile: decompilers are lossy heuristic tools, and different decompilers can expose different artefacts of the same binary. We curate a benchmark of benign utilities and malicious programs spanning a range of threat behaviors. Each sample is compiled and decompiled with both Ghidra and RetDec, yielding matched pseudo-C views. Across a range of LLMs from major model families, we find that providing both decompiler views improves malicious-class F1, mainly by increasing recall on malicious samples. Agreement analyses further show that Ghidra and RetDec make partially different errors, supporting the view that decompiler outputs provide complementary evidence. Our results suggest that multi-decompiler prompting is a simple, training-free way to improve LLM-based malware triage in practical settings.

2606.20295 2026-06-19 cs.SE cs.CL 新提交 70%

Token-Operations-Oriented Inference Optimization Techniques for Large Models

面向令牌操作的大模型推理优化技术

Shiguo Lian, Kai Wang, Zhaoxiang Liu, Wen Liu, Minjie Hua, Yutong Liu, Jiangze Yan, Xin Wang, Cong Wang, Yilin Zhang, Yi Shen, Jieyun Huang, Fang Zhao, Huanlin Gao, Ping Chen, Xinyu Yang, Kaikai Zhao, Yao Zhao, Xinggang Wang, Huishuai Zhang, Dongyan Zhao, Junping Du, Tao Chen, Xiang Gao, Qinghuai Ma

发表机构 * China’s National Data Administration(中国国家数据管理局)

专题命中 其他LLM :综述大模型推理优化技术

AI总结 本文提出多模型融合、模型优化、计算-模型融合、计算-网络-模型融合四层技术架构,系统综述各层关键技术及产业现状,旨在降低令牌成本、提升服务效率、保障供应稳定性,推动大模型服务从可调用到可运营的转变。

Comments 62 pages, 36 figures

详情
AI中文摘要

大模型推理优化是支撑大模型服务可扩展、低成本、高稳定运行的关键基础。本文以面向令牌的推理优化技术为核心,首次提出由多模型融合、模型优化、计算-模型融合、计算-网络-模型融合组成的四层技术架构,系统梳理了这四层的关键技术和产业现状,并分析了相关技术在实际业务场景中的应用价值。本文为降低令牌生产成本、提高令牌服务效率、保障令牌供应稳定性、推动大模型服务从可调用到可运营的转变提供了实用的技术路径。

英文摘要

Large model inference optimization serves as a key foundation for supporting the scalable, low-cost, and highly stable operation of large model services. Centered on token-oriented inference optimization technology, this paper proposes for the first time a four-layer technical architecture consisting of Multi-model Fusion, Model Optimization, Compute-Model Fusion, and Compute-Network-Model Fusion. It systematically reviews the key technologies and current industry status across these four levels and analyzes the application value of related technologies in real-world business scenarios. This paper provides a practical technical path for reducing token production costs, improving token service efficiency, ensuring the stability of token supply, and driving the transition of large model services from being merely callable to being operable.

2606.20287 2026-06-19 cs.CL 新提交 70%

PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback

PsyScore: 一种心理测量感知的特质自适应作文评分与最近发展区支架反馈框架

Wei Xia, Jin Wu, Haoran Shi, Xiangyu Wang, Chanjin Zheng

发表机构 * Department of Educational Psychology, East China Normal University(华东师范大学教育心理学系) Shanghai Institute of Artificial Intelligence for Education, East China Normal University(华东师范大学上海智能教育研究院) School of Computer Science and Technology, East China Normal University(华东师范大学计算机科学与技术学院)

专题命中 其他LLM :使用LLM生成自适应反馈

AI总结 提出PsyScore框架,通过共享潜在能力表示整合诊断评估与教学支架,包括特质自适应神经IRT评分器、ZPD支架反馈生成器和多视角反馈评估策略,在ASAP++数据集上实现竞争性评分性能并提供更符合教学法的反馈。

详情
AI中文摘要

有效的自动作文评分(AES)应支持可靠评估和可操作的教学反馈。然而,现有方法通常将评分和反馈视为独立组件:神经评分模型可解释性有限,而基于大语言模型(LLM)的反馈通常对学习者熟练度不敏感。为解决这一碎片化问题,本工作提出PsyScore,一个心理测量感知的框架,通过共享潜在能力表示整合诊断评估与教学支架。PsyScore包含三个关键模块:特质自适应神经IRT评分器,将分级部分信用模型(GPCM)融入神经架构,能够在保持心理测量可解释性的同时精确估计学生能力;ZPD支架反馈生成器,根据诊断出的能力参数调节多智能体反馈策略,以适应不同熟练水平的教学重点;以及多视角反馈评估策略,通过成对偏好判断和学生修订模拟评估反馈质量。在ASAP++数据集上的实验表明,PsyScore在提供更具教学一致性的反馈的同时,实现了有竞争力的评分性能。

英文摘要

Effective Automated Essay Scoring (AES) are expected to support both reliable assessment and actionable instructional feedback. However, existing approaches often treat scoring and feedback as separate components: neural scoring models provide limited interpretability, while Large Language Model (LLM)-based feedback is typically insensitive to learners proficiency levels. To address this fragmentation, this work proposes PsyScore, a psychometrically-aware framework that integrates diagnostic assessment with instructional scaffolding through a shared latent ability representation. PsyScore comprises three key modules: a Trait-Adaptive Neural IRT Scorer that incorporates the Graded Partial Credit Model (GPCM) into a neural architecture, enabling the precise estimation of student ability while maintaining psychometric interpretability, a ZPD-Scaffolded Feedback Generator, which conditions multi-agent feedback strategies on the diagnosed ability parameter to adapt instructional focus across different proficiency levels, and a Multi-Perspective Feedback Evaluation Strategy that assesses feedback quality via pairwise preference judgements and student revision simulations. Experiments on the ASAP++ dataset demonstrate that PsyScore achieves competitive scoring performance while providing more pedagogically aligned feedback.

2606.19941 2026-06-19 cs.LG 新提交 70%

Compositionality Emerges in a Narrow Depth-Connectivity Regime: Architecture Constraints and Solution Manifolds

组合性在窄深度-连接性区域中涌现:架构约束与解流形

Dat H. Do, Rushi Shah, Duc V. Le, Dianbo Liu

发表机构 * National University of Singapore(新加坡国立大学) University of Twente(特温特大学)

专题命中 其他LLM :研究组合性在稀疏网络中的涌现机制

AI总结 研究发现组合性仅在特定稀疏网络和特定深度区间涌现,提出基于相似性的剪枝和深度预测方法,并用理论框架解释原因。

详情
AI中文摘要

组合性被认为是泛化的基础,使模型能够在新颖组合中重用有意义的原语。然而,使用标准梯度优化训练的模型很少且通常仅微弱地表现出组合内部结构,并且尚不清楚这种组合性如何或为何形成。在这项工作中,我们表明组合性在一个狭窄的连接性-深度最佳点涌现。沿着连接性轴,组合性仅出现在某些特定稀疏网络中,严重依赖于保留哪些连接而非仅权重的稀疏性。沿着深度轴,组合性在一个狭窄的、目标依赖的区域内涌现,在特定深度达到峰值,而更浅和更深的网络都失败。当深度或连接性条件被违反时,梯度下降会静默地收敛到破碎解而非组合解。为了发现并利用这种涌现,我们引入了(i)基于相似性的剪枝(SP)以恢复组合连接性,以及(ii)一个启发式深度预测器以估计组合性最可能出现的深度。最后,我们通过基于组合稀疏性、体积比论证和特征干扰界限的理论框架支持这些实证发现,解释了为什么组合解仅在狭窄的深度-连接性区域内可达。

英文摘要

Compositionality is believed to be the foundation for generalization, enabling models to reuse meaningful primitives in novel combinations. Yet, models trained with standard gradient-based optimization rarely, and often only weakly, exhibit compositional internal structure, and it remains unclear how or why such compositionality forms. In this work, we show that compositionality emerges in a narrow connectivity-depth sweet spot. Along the connectivity axis, compositionality only appears in some specifically sparse networks, heavily depends on which connections remain rather than on weights' sparsity alone. Along the depth axis, compositionality emerges within a narrow, target-dependent regime, peaking at specific depths, while both shallower and deeper networks fail. When either the depth or connectivity condition is violated, gradient descent silently converges to fractured solutions rather than compositional ones. To discover and exploit this emergence, we introduce (i) similarity-based pruning (SP) to recover compositional connectivity and (ii) a heuristic depth predictor to estimate where compositionality is most likely to appear. Finally, we support these empirical findings with a theoretical framework based on compositional sparsity, volume-ratio arguments, and feature-interference bounds, explaining why compositional solutions are reachable only in a narrow depth-connectivity regime.