arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

语言大模型 / LLM

大语言模型、预训练、指令微调、后训练和语言模型应用。

今日/当前日期收录 97 信号源:cs.CL, cs.AI, cs.LG

1. 其他LLM 24 篇

2606.19815 2026-06-19 cs.CL 新提交 70%

Clusters are All You Need: Pre-Training the Tsetlin Machine with Semantic Clusters from Language Models for Interpretability

聚类即一切:利用语言模型中的语义聚类预训练Tsetlin Machine以实现可解释性

Jiechao Gao, Rohan Kumar Yadav, Yuangang Li, Yuandong Pan, Jie Wang, Ying Liu, Michael Lepech

发表机构 * Independent Researcher(独立研究员) University of California, Irvine(加州大学尔湾分校) University of the Chinese Academy of Sciences(中国科学院大学)

专题命中 其他LLM :利用语言模型语义聚类预训练Tsetlin Machine,提升可解释性。

AI总结 提出一种语义预训练框架,通过K-means或Top2Vec将文本聚类,用聚类-样本对预训练Tsetlin Machine,使其学习可解释的语义关键词,在五个数据集上性能优于传统方法且与BERT竞争。

详情
AI中文摘要

预训练语言模型如BERT在文本分类任务中表现强劲,但缺乏透明度,限制了在高风险场景中的应用。Tsetlin Machine (TM) 提供完全可解释的基于子句的推理,但捕获的语义信息有限,先前桥接两者的尝试依赖于静态词嵌入,忽略了上下文含义。我们提出一种语义预训练框架,无需使用嵌入即可将知识从预训练语言模型转移到TM中。文本样本通过K-means或Top2Vec被分组为语义一致的聚类,得到的聚类-样本对通过增强的Type I反馈预训练一个非否定TM。因此,TM学习到可解释的语义关键词,并在下游任务上进行微调。在五个数据集上,我们的方法显著优于传统和基于嵌入的TM,性能与BERT竞争,同时保持可解释性。

英文摘要

Pre-trained language models such as BERT achieve strong text classification performance but lack transparency, limiting their use in high-stakes settings. The Tsetlin Machine (TM) offers fully interpretable, clause-based reasoning but captures little semantic information, and prior attempts to bridge the two rely on static word embeddings that miss contextual meaning. We propose a semantic pre-training framework that transfers knowledge from a pre-trained language model into a TM without using embeddings. Text samples are grouped into semantically coherent clusters with K-means or Top2Vec, and the resulting cluster-sample pairs pre-train a non-negated TM with enhanced Type I feedback. The TM thereby learns interpretable semantic keywords that are fine-tuned on downstream tasks. Across five datasets, our method substantially outperforms vanilla and embedding-based TMs and reaches performance competitive with BERT while remaining interpretable.

2606.19753 2026-06-19 cs.AI cs.SE 新提交 70%

Grounded Inference: Principles for Deterministically Encapsulated Generative Models

基于推理:确定性封装生成模型的原则

Marty O'Neill

发表机构 * Odenton, MD, USA(美国马里兰州奥登顿)

专题命中 其他LLM :讨论AI混合架构中生成模型的确定性封装原则。

AI总结 提出四种AI混合架构原语,实现概率模型的确定性封装,并指出两个行业反模式,为AI与传统系统集成提供基础框架。

Comments 12 pages, 3 figures

详情
AI中文摘要

将生成模型整合到传统计算系统中既带来了巨大的机遇,也带来了巨大的风险。尽管许多早期采用者已付出巨大代价认识到这些风险,但该领域仍需基础框架来降低AI融入传统系统的风险。本文通过定义四种AI混合架构的具体原语,旨在实现概率模型的确定性封装,从而奠定这一基础。此外,本文还确立了行业中广泛存在的两个总体反模式,作为该领域工程师的警示。该框架旨在实现AI与传统系统的成功集成,同时为生成模型提供商构建下一代生成模型接口提供基础。

英文摘要

The incorporation of generative models into traditional computational systems presents both enormous opportunity and tremendous peril. Although many early adopters have realized these perils at great expense, the field still requires foundational frameworks to de-risk incorporation of AI into traditional systems. This manuscript establishes this foundation through the definition of four specific primitives of AI blended architecture, designed to enable deterministic encapsulation of probabilistic models. It further establishes two overarching anti-patterns broadly represented across industry to serve as warnings for engineers in this field. This framework was designed to enable successful integration of AI into traditional systems while providing a foundation upon which generative model providers could build the next generation of generative model interfaces.

2606.19741 2026-06-19 cs.AI cs.LG 新提交 70%

Interpreting Neural Combinatorial Optimization via Evolving Programmatic Bottlenecks

通过演化程序瓶颈解释神经组合优化

Haocheng Duan, Yuxin Guo, Jieyi Bi, Anqi Xie, Sirui Li, Yining Ma, Cathy Wu

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Nanyang Technological University(南洋理工大学) Microsoft Research(微软研究院) Massachusetts Institute of Technology(麻省理工学院)

专题命中 其他LLM :利用LLM演化程序解释神经组合优化模型。

AI总结 提出演化程序瓶颈(EPB)框架,通过将黑盒神经组合优化模型蒸馏为可读程序组合,利用LLM和混合梯度下降实现可解释性,揭示模型行为与经典启发式变体的关系。

Comments Under Review

详情
AI中文摘要

神经组合优化(NCO)取得了强劲性能,但其黑盒性质仍然是部署和科学诊断的关键障碍。标准可解释性工具(如概念瓶颈模型)不适用于NCO,因为其决策是动态的、状态依赖的,且缺乏适当的概念词汇定义。为弥合这一差距,我们引入了演化程序瓶颈(EPB),据我们所知,这是首个通过将黑盒NCO模型蒸馏为人类可读程序组合来解释NCO策略的框架。EPB利用LLM自主演化一组程序,其中每个程序的每步动作分布作为瓶颈。EPB通过迭代框架工作:模块I固定程序库容量,并引入混合文本-数值梯度下降方案,该方案将学生路由器更新的数值梯度和基于LLM程序修订的文本梯度相结合;模块II通过故障目标扩展和冗余剪枝动态调整库容量。大量实验证明了EPB的有效性和广泛适用性,蒸馏后的程序组合在很大程度上保持了原始性能。EPB还揭示了NCO行为在优化阶段的变化,并且可以近似为经典启发式变体的组合。我们的工作推进了可解释NCO,并将EPB建立为解释序列决策模型的有前途工具。

英文摘要

Neural Combinatorial Optimization (NCO) achieves strong performance, yet its black-box nature remains a key roadblock to deployment and scientific diagnosis. Standard interpretability tools, such as Concept Bottleneck Models (CBMs), are ill-equipped for NCO, whose decisions are dynamic, state-dependent, and lack proper concept vocabulary definition. To close this gap, we introduce Evolving Programmatic Bottlenecks (EPB), to our knowledge, the first framework for interpreting NCO policies by distilling black-box NCO models into human-readable program portfolios. EPB employs an LLM to autonomously evolve a bank of programs, where each program's per-step action distribution serves as the bottleneck. EPB works through an iterative framework: Block I fixes program bank capacity and introduces a hybrid textual-numerical gradient descent scheme that couples numerical gradients for student router updates and textual gradients for LLM-based program revision; Block II dynamically adapts bank capacity via fault-targeted expansion and redundancy pruning. Extensive experiments demonstrate EPB's effectiveness and broad applicability, where the distilled program portfolios largely match original performance. EPB also reveals that NCO behavior shifts across optimization stages and can be approximated as a composition of classic heuristic variants. Our work advances interpretable NCO and establishes EPB as a promising tool for interpreting sequential decision-making models.

2606.19719 2026-06-19 cs.IR cs.CL cs.LG 新提交 70%

Closing the Calibration Gap in Semantic Caching

缩小语义缓存中的校准差距

Aditeya Baral, Radoslav Ralev, Iliya Sotirov Zhechev, Srijith Rajamohan, Jen Agarwal

发表机构 * New York University(纽约大学) Redis(Redis公司)

专题命中 其他LLM :语义缓存用于降低LLM推理成本。

AI总结 针对语义缓存系统中离线指标与部署性能的差距,提出P-CHR AUC和CRR指标,发现校准差距由训练目标主导,模型选择本质是校准问题。

Comments 23 pages, 2 figures. Source code: https://github.com/aditeyabaral/calibration-gap-semantic-caching ; Models and Datasets: https://huggingface.co/redis

详情
AI中文摘要

语义缓存通过为语义相似的查询提供缓存响应来降低LLM推理成本。标准实践使用PR-AUC评估这些系统,该指标仅衡量分数排序的好坏,而忽略它们在固定阈值下是否可用。我们表明这种不匹配会导致系统性的部署选择不佳,因为具有最高PR-AUC的模型通常在操作中最差。我们引入精度-缓存命中率(P-CHR)AUC,一种衡量缓存利用率水平上精度的缓存感知指标,以及校准保留率(CRR),它捕捉离线排序质量在部署中保留多少。我们将离线质量与部署质量之间的操作差距分解为可恢复的校准组件和由数据集正例率固定的不可约结构组件。我们的实验表明,校准差距由训练目标而非数据规模主导,事后校准只能部分缩小它。最终,语义缓存的模型选择是一个校准问题,而非排序问题,而测量它是缩小差距的第一步。

英文摘要

Semantic caching cuts LLM inference costs by serving a cached response to semantically similar queries. Standard practice evaluates these systems using PR-AUC, a metric that only measures how well scores rank and ignores whether they are usable at a fixed threshold. We show this mismatch leads to systematically poor deployment choices, as models with the highest PR-AUC are often the worst in operation. We introduce Precision-Cache Hit Ratio (P-CHR) AUC, a cache-aware metric that measures precision across cache utilization levels, and Calibration Retention Rate (CRR), which captures how much offline ranking quality survives at deployment. We decompose the operational gap between offline and deployed quality into a recoverable calibration component and an irreducible structural component fixed by the dataset's positive rate. Our experiments show that the calibration gap is governed by the training objective rather than data scale, and post-hoc calibration only partially closes it. Ultimately, model selection for semantic caching is a calibration problem, not a ranking one, and measuring it is the first step to closing the gap.

2606.19704 2026-06-19 cs.AI 新提交 70%

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

超越静态排行榜:LLM智能体评估的预测有效性

Dhaval C. Patel, Kaoutar El Maghraoui, Shuxin Lin, Yusheng Li, Tianjun Feng, Chun-Yi Tsai, Yihan Sun, Wei Alexander Xin, Akshat Bhandari, Tanisha Rathod, Aaron Fan, Sanskruti Vijay Shejwal, Tomas Pasiecznik, Sagar Chethan Kumar, Tanmay Agarwal, Rohith Kanathur, Sam Colman, Amaan Sheikh, Dev Bahl, Ann Li, Krish Veera, Alimurtaza Mustafa Merchant, Shambhawi Baswaraj Bhure, Sajal Kumar Goyla, Chengrui Li, Kirthana Natarajan, Rui Li, Thomas Ajai, Rujing Li, Vivek G. Iyer, Sanjaii Vijayakumar, Yitong Bai, Ayal Yakobe, Darief Maes, Yassine Jebbouri, Tianyang Xu, Thai Quoc On, Vera Mazeeva, Winston Li, Yuval Shemla, Yeshitha Bhuvanesh, Rushin Bhatt, Siddharth Chethan Gowda, Alisha Vinod, Caroline Cahill, Shriya Aishani Rachakonda, Yunfeng Chen, Aryaman Agrawal, Aman Upganlawar, Mao Le Jonathan Ang, Yubin Sally Go, Madhav Rajkondawar, Yang-Jung Chen, Trisha Maturi, Ananya Kapoor, Andrew Li, Shrey Arora, Mana Abbaszadeh, Shen Li, Charles Xu, Byeolah Kwon

发表机构 * IBM

专题命中 其他LLM :研究LLM智能体评估的泛化问题。

AI总结 本文通过14项并行研究,论证聚合分数排行榜无法泛化到分布外场景,提出基于预测有效性的排名配置方法,并设计可证伪的分布外评估标准。

Comments 17 pages, 2 tables, 5 figures

详情
AI中文摘要

智能体基准测试发展迅速,但单一基准测试无法涵盖部署所涉及的多个维度。本文汇总了迄今为止最大规模的基于MCP的工业智能体基准测试的协调深度分析:14项并行实现研究,涵盖新的资产类别(包括多模态视觉扩展)、替代编排、检索策略、推理模式、基础设施优化和评估方法探索。结合这些研究与七个先前的智能体基准测试,我们认为聚合分数排行榜系统性地低估了部署智能体的评估。基于聚合分数的排名无法泛化到分布外设置;最近的公开到私有竞赛回顾提供了这种排名不稳定性的直接经验证据。我们提出通过预测有效性(样本内与样本外排名之间的相关性)而非样本内均值来配置排名,并报告了一个十二层测量装置,该装置揭示了HELM及其智能体时代后继者所忽略的部署相关维度。该立场通过三个具有明确阈值的可证伪分布外标准得以操作化;现有证据部分支持但过于薄弱无法确认。最后,我们提出了一个预注册的试点设计和下一代智能体基准测试应报告的内容的领域级愿景。

英文摘要

Agent benchmarks are growing fast, but no single benchmark touches more than four or five of the dimensions that deployment exposes. This paper aggregates the largest coordinated deep-dive of one MCP-based industrial-agent benchmark to date: fourteen parallel implementation studies covering new asset classes (including a multi-modal visual extension), alternative orchestrations, retrieval strategies, reasoning modes, infrastructure optimizations, and evaluation-methodology probes. Consolidating those studies with seven prior agent benchmarks, we argue that aggregate-score leaderboards systematically underspecify deployed-agent evaluation. Rankings derived from aggregate scores do not transfer to out-of-distribution settings; recent public-to-hidden competition retrospectives provide direct empirical evidence of this rank instability. We propose ranking configurations by predictive validity, the correlation between in-sample and out-of-sample rank, rather than in-sample mean, and report a twelve-tier measurement apparatus that exposes the deployment-relevant dimensions HELM and its agent-era successors collapse. The position is operationalized through three falsifiable out-of-distribution criteria with explicit thresholds; existing evidence partly supports it but is too thin to confirm. We close with a pre-registered pilot design and a field-level vision for what the next generation of agentic benchmarks should report.

2606.19591 2026-06-19 cs.CL cs.AI 新提交 70%

A BART-based approach with hierarchical strategy for Vietnamese abstractive multi-document summarization

基于BART的分层策略用于越南语抽象式多文档摘要

Vu Nguyen Nguyen Xuan, Huy Ngo Quang

发表机构 * Aimesoft JSC(Aimesoft股份公司)

专题命中 其他LLM :基于BART的越南语多文档摘要,属于语言模型应用

AI总结 提出一种新颖简单的基于黄金摘要缩短文档的分层策略,结合BART模型实现越南语多文档抽象式摘要,在VLSP 2022测试集上达到ROUGE2-F1 0.2468,并利用外部数据增强训练。

Comments originally written in 2022

详情
AI中文摘要

在本技术报告中,我们专注于解决越南语多文档抽象式摘要的挑战,该任务在2022年越南语言与语音处理国际研讨会(VLSP)上提出。我们选择遵循流行的分层方法,即先浓缩每个文档,然后进行聚合和摘要。我们提出了一种新颖而简单的策略来缩短文档,该策略由黄金摘要驱动,从而确保分层方法各阶段之间的高度相关性。我们的方法在VLSP的公开测试集上达到了0.2468的ROUGE2-F1分数,并且能够生成流畅简洁的摘要。此外,我们利用外部来源获取额外数据,这极大地增加了越南语多文档摘要的数据量。额外数据已向社区公开。

英文摘要

In this technical report, we focus on solving the challenge of Vietnamese multi-document abstractive summarization, introduced in the International Workshop on Vietnamese Language and Speech Processing (VLSP) 2022. We choose to follow the popular hierarchical approach, i.e. condensing each document followed by aggregation and summarization. We propose a novel yet simple strategy to shorten documents that is driven by the golden summary, thus ensuring high correlation between stages of the hierarchical approach. Our method achieves a ROUGE2-F1 score of 0.2468 on the VLSP's public test set, and can produce fluent and concise summaries. Additionally, we utilize external sources for extra data, which greatly enhances the quantity of data for Vietnamese multi-document summarization. The additional data is made available for the community.

2606.19588 2026-06-19 cs.AI cs.CR cs.LO 新提交 70%

Analyzing the Narration Gap in LLM-Solver Loops

分析大语言模型-求解器循环中的叙述差距

Zunchen Huang, Songgaojun Deng

发表机构 * Eindhoven University of Technology(埃因霍温理工大学)

专题命中 其他LLM :分析LLM-求解器循环中的叙述差距和安全漏洞

AI总结 研究LLM与SAT/SMT求解器混合推理中,将求解器输出转化为用户答案的叙述步骤存在的安全漏洞,通过形式化建模和实验评估发现证书门控可保证求解结果正确,但对抗攻击可反转结论。

详情
AI中文摘要

诸如SAT和SMT求解器之类的形式化工具,当安全或安保关键问题可以用逻辑表述时,越来越多地被嵌入到语言模型推理流程中。与思维链不同(其步骤从模型分布中采样,没有形式化保证),求解器产生可靠且可独立验证的答案。然而,这种可靠性保证可能在求解器与模型之间的交互中丢失。混合流程包含三个组成部分:形式化问题、求解问题以及叙述结果。先前的工作研究了形式化和求解,但未涉及叙述——即将形式化工具的输出转化为用户答案的步骤。为了填补叙述差距,我们首先将LLM-求解器循环建模为经过验证的决策过程。我们进一步在提示注入下评估了五个开源模型,发现证书门控使求解器判定可靠,而攻击者可以通过不同措辞和渠道反转已验证的结论。我们研究了通过强化提示进行缓解的方法,该方法显著减少了注入但无法完全消除,并且在自适应攻击下仍然存在问题。结合形式化分析和实证研究,我们表明在LLM-求解器循环中,鲁棒性无法延伸到用户最终读取的答案。

英文摘要

Formal tools such as SAT and SMT solvers are increasingly embedded in language model reasoning pipelines when a safety or security critical question can be formulated in logic. Unlike chain of thought whose steps are sampled from the model distribution without formal guarantee, a solver produces a sound and independently verifiable answer. However, the soundness guarantee can be lost in the interaction between the solver and the model. The hybrid pipeline has three components: formalizing the question, deciding it, and narrating the result. Prior work has studied the formalization and decision, but not narration, which is the step that turns a formal tool's output into the user answer. To fill the narration gap, we first model the LLM-solver loop as a verified decision procedure. We further evaluate five open-sourced models under prompt injection, and we find certificate gating makes the solver verdict sound, while an adversary can invert a verified conclusion across phrasings and channels. We study the mitigation through hardened prompt that reduces injection significantly but cannot eliminate it and still suffers under adaptive attack. Combining the formal analysis and empirical studies, we show in the LLM-solver loop, robustness does not reach to the answer that the user finally reads.

2606.19559 2026-06-19 cs.AI cs.CL 新提交 70%

Uncertainty Decomposition for Clarification Seeking in LLM Agents

LLM代理中寻求澄清的不确定性分解

Gregory Matsnev

发表机构 * AI Talent Hub, ITMO University(AI Talent Hub, ITMO大学)

专题命中 其他LLM :基于提示的不确定性分解方法用于LLM代理

AI总结 提出一种基于提示的不确定性分解方法,将行动置信度与请求不确定性分离,使代理能在任务规范模糊时主动寻求澄清,在五个LLM骨干上平均澄清F1提升36%-73%。

Comments 26 pages, 8 figures. Source code: https://github.com/PE51K/udcs-in-llm-agents

详情
AI中文摘要

最近的立场论文认为,经典的偶然/认知不确定性框架对于交互式大型语言模型(LLM)代理是不够的,并呼吁需要一种对欠规范感知、可分解且可通信的不确定性表示,以解锁新的代理能力,如主动寻求澄清和共享心理模型构建。实际部署约束——黑盒API、交互延迟预算以及缺乏标注轨迹——排除了基于logprob、多采样和基于训练的方法,使得基于提示的估计成为在部署时浮现此类信号的最可行方案。我们通过一种简单的基于提示的分解来响应这一呼吁,该分解将行动置信度与请求不确定性(u)分离,使代理能在任务规范模糊时请求澄清。为了评估它,我们引入了两个增强澄清的基准(WebShop-Clarification和ALFWorld-Clarification),其中50%的任务被故意欠规范,并在这些变体以及用于故障检测的标准WebShop、ALFWorld和REAL基准上,系统地将所提出的分解与ReAct+UE和不确定性感知记忆(UAM)在五个LLM骨干(GPT-5.1、DeepSeek-v3.2-exp、GLM-4.7、Qwen3.5-35B、GPT-OSS-120B)上进行比较。在五个骨干上平均,所提出的分解在ALFWorld-Clarification上比ReAct+UE提高了73%的澄清F1,比UAM提高了36%,并且在WebShop-Clarification的每个骨干以及ALFWorld-Clarification的五个骨干中的四个上领先澄清F1,表明增益超越了单个LLM。

英文摘要

Recent position papers argue that the classical aleatoric/epistemic uncertainty framework is insufficient for interactive large language model (LLM) agents and call for underspecification-aware, decomposed, and communicable uncertainty representations that can unlock new agent capabilities such as proactive clarification seeking and shared mental-model building. Practical deployment constraints -- black-box APIs, interactive latency budgets, and the absence of labeled trajectories -- rule out logprob-based, multi-sampling, and training-based methods, leaving prompt-based estimation as the most viable family for surfacing such signals at deployment time. We answer this call with a simple prompt-based decomposition that separates action confidence from request uncertainty (u), enabling the agent to ask for clarification when the task specification is ambiguous. To evaluate it, we introduce two clarification-augmented benchmarks (WebShop-Clarification and ALFWorld-Clarification) in which 50% of tasks are deliberately underspecified, and systematically compare the proposed decomposition against ReAct+UE and Uncertainty-Aware Memory (UAM) across five LLM backbones (GPT-5.1, DeepSeek-v3.2-exp, GLM-4.7, Qwen3.5-35B, GPT-OSS-120B) on these variants together with the standard WebShop, ALFWorld, and REAL benchmarks for fault detection. Averaged across the five backbones, the proposed decomposition improves clarification F1 on ALFWorld-Clarification by 73% over ReAct+UE and by 36% over UAM, and leads clarification F1 on every backbone on WebShop-Clarification and on four of five backbones on ALFWorld-Clarification, indicating that the gains generalize beyond a single LLM.

2606.19558 2026-06-19 cs.LG cs.CL 新提交 70%

Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment

位移不是方向:评估量化LLM部署的保真度指标

Miloš Nikolić, Ali Hadi Zadeh, Enrique Torres Sanchez, Andreas Moshovos

发表机构 * ByteShape University of Toronto(多伦多大学) Vector Institute for Artificial Intelligence(向量人工智能研究所)

专题命中 其他LLM :评估量化LLM部署的保真度指标,属于LLM应用

AI总结 本文研究KL散度等保真度指标在量化语言模型部署中与下游基准分数的相关性,发现整体强相关但在近基线区域失效,归因于KL散度主要衡量分歧量而非方向。

详情
AI中文摘要

保真度指标,如每个token的KL散度(KLD)与高精度参考模型的比较,常被用作基准质量的低成本代理。我们在Qwen3.6-35B-A3B的28个量化模型和Devstral-Small-2-24B的41个量化模型上,通过一系列下游基准测试验证了这一做法。我们发现,在整个量化队列中,KLD与基准分数强相关(Qwen上ρ=-0.72,Devstral上ρ=-0.86,p<0.001)。然而,在接近基线的静默区,这种关系变得不显著(Qwen上ρ=+0.00,Devstral上ρ=-0.24,p=0.36)。这种失效在14种测量变体中持续存在,包括不同的KLD聚合方式、困惑度公式、top-1一致性、校准语料库和上下文长度。在逐提示层面,KLD在代码任务上仅有较弱的失败预测能力,在LiveCodeBench上五个模型的失败与通过几何平均比在[1.08,1.22]之间,并且作为跨模型路由器失败,在分歧提示上仅达到42.3%-49.4%的准确率。我们将这种失效归因于结构分解:KLD主要衡量与参考模型的分歧量,在静默区复合ρ在Qwen上为+0.94(p<0.001),在Devstral上为+0.55(p=0.03),而其与分歧方向的关系较弱且依赖于任务。

英文摘要

Fidelity metrics, such as per-token KL divergence (KLD) against a high-precision reference, are often used in practice as low-cost proxies for benchmark quality. We test this practice on a 28-quant cohort of Qwen3.6-35B-A3B and a 41-quant cohort of Devstral-Small-2-24B, evaluated across a suite of downstream benchmarks. We find that KLD is strongly correlated with benchmark score over the full cohort ($ρ=-0.72$ on Qwen and $ρ=-0.86$ on Devstral, both with $p<0.001$). However, this relationship collapses to non-significance in the near-baseline silent zone ($ρ=+0.00$ on Qwen and $ρ=-0.24$, $p=0.36$, on Devstral). This collapse persists across 14 measurement variants, including different KLD aggregations, perplexity formulations, top-1 agreement, calibration corpora, and context lengths. At the per-prompt level, KLD has only weak failure-prediction power on code, with failed-vs-passed geometric-mean ratios in $[1.08,1.22]$ across five models on LiveCodeBench, and fails as a cross-model router, achieving only $42.3\%-49.4\%$ accuracy on disagreement prompts. We trace the collapse to a structural decomposition: KLD primarily measures the volume of disagreement with the reference, with silent-zone composite $ρ=+0.94$ ($p<0.001$) on Qwen and $+0.55$ ($p=0.03$) on Devstral, while its relationship to the direction of those disagreements is weak and task-conditional.

2606.19476 2026-06-19 cs.LG cs.AI 新提交 70%

Can In-Context Learning Support Intrinsic Curiosity?

上下文学习能否支持内在好奇心?

Eric Elmoznino, Sangnie Bhardwaj, Johannes von Oswald, Rajai Nasser, Blaise Agüera y Arcas, João Sacramento, Rif A. Saurous, Guillaume Lajoie

发表机构 * Google – Paradigms of Intelligence Team(Google – 智能范式团队) Google DeepMind

专题命中 其他LLM :利用序列模型上下文学习作为世界模型,探索好奇心。

AI总结 研究利用序列模型的上下文学习能力作为即时无更新世界模型,以消除传统内在好奇心方法中梯度下降的计算瓶颈,理论证明在非时间设置下可渐近收敛到真实学习进度。

详情
AI中文摘要

有效的机器学习不仅取决于我们如何对数据建模,还取决于我们选择收集哪些数据。虽然大型序列模型已经彻底改变了数据建模,但自动数据选择或“内在好奇心”的问题仍然是一个重大挑战。经典方法通过基于智能体的“学习进度”奖励来激励探索,该奖励衡量新获得的观测在多大程度上改进了世界模型的预测能力。然而,传统上评估这些奖励需要在每个轨迹内进行昂贵的梯度下降内循环更新,这使得它们在规模上计算上不可行。在这项工作中,我们研究序列模型涌现的上下文学习(ICL)能力是否可以通过作为即时的、无需更新的世界模型来消除这一瓶颈。具体来说,我们评估是否可以训练一个探索策略来最大化学习进度,仅使用上下文学习者的预测误差和反事实上下文操作。我们首先证明,在一般马尔可夫决策过程中,这实际上不可能以无偏的方式实现:由此产生的内在奖励要么包含干扰项,使其对真实学习进度的估计产生偏差,要么无法使用上下文学习者的预测误差来实现。相反,我们对于非时间设置的一个广泛子类(包括主动学习和贝叶斯实验设计)证明了积极结果:在这里,ICL派生的奖励成功界定了真实学习进度并渐近收敛到它。我们通过连续和符号环境中的受控实验证实了我们的理论,表明我们的ICL驱动框架成功训练了以最优方式进行探索的好奇数据收集策略。

英文摘要

Effective machine learning depends not only on how we model data, but also on what data we choose to collect. While large sequence models have revolutionized data modeling, the problem of automated data selection, or "intrinsic curiosity", remains a significant challenge. Classic approaches incentivize exploration by rewarding an agent based on its "learning progress", which measures how much a newly acquired observation improves a world model's predictive ability. However, evaluating these rewards traditionally requires expensive inner loops of gradient descent updates within each trajectory, rendering them computationally impractical at scale. In this work, we investigate whether the emergent in-context learning (ICL) capabilities of sequence models can eliminate this bottleneck by serving as immediate, update-free world models. Specifically, we evaluate whether an exploration policy can be trained to maximize learning progress, using solely the prediction errors and counterfactual context manipulations of an in-context learner. We first prove that in general Markov decision processes, this is in fact impossible in an unbiased way: the resulting intrinsic rewards either suffer from nuisance terms that bias their estimation of true learning progress, or they cannot be implemented using an in-context learner's prediction errors. Conversely, we prove a positive result for a broad subclass of non-temporal settings, encompassing active learning and Bayesian Experimental Design: here, ICL-derived rewards successfully bound and asymptotically converge to the true learning progress. We corroborate our theory with controlled experiments across continuous and symbolic environments, demonstrating that our ICL-driven framework successfully trains curious data-collection policies that explore optimally.

2606.19404 2026-06-19 cs.LG cs.CL 新提交 70%

Thermodynamic Signatures of Reasoning: Free-Energy and Spectral-Form-Factor Diagnostics for Hallucination Detection in Large Language Models

推理的热力学特征:用于大型语言模型幻觉检测的自由能和谱形因子诊断

Salim Khazem

发表机构 * Talan Research & Innovation Center(Talan研究与创新中心)

专题命中 其他LLM :热力学特征用于LLM幻觉检测。

AI总结 提出自由能签名(Fes)作为谱描述符,将注意力拉普拉斯视为哈密顿量并提取热力学势和随机矩阵理论谱形因子,用于检测LLM幻觉,无需训练即可实现高AUROC。

详情
AI中文摘要

大型语言模型(LLM)中的幻觉检测对部署至关重要,近期研究表明注意力导出的图拉普拉斯谱携带关于推理质量的强信号。然而,先前的谱诊断仅通过少数特征值或手工选取的标量来总结拉普拉斯谱,忽略了其大部分结构。我们提出自由能签名(Fes),一种谱描述符,将每层的注意力拉普拉斯视为哈密顿量,并提取其热力学势(配分函数、自由能、谱熵、热容)以及随机矩阵理论(RMT)谱形因子。我们证明了三个结果:(i)Fes在注意力扰动下的Lipschitz稳定性;(ii)一个表达性结果,表明Fes丰富了有限谱摘要,并在明确的规则性和网格分辨率假设下逼近矩导出的谱泛函;(iii)基于Fes构建的无训练检测器AUROC的有限样本PAC界。实验上,在六个开源LLM和六个基准测试中,基于Fes描述符的轻量级探测在注意力谱基线中实现了最强的平均AUROC,相比LapEig平均提高+6.5 AUROC点,相比GoR-4平均提高+2.4点,且无需更新底层LLM。在完全无监督设置下,RMT偏差得分达到平均AUROC 0.71,提供了一个无标签但较弱的检测器。互补的RMT分析表明,正确生成表现出更接近Wigner-Dyson的谱统计,而幻觉表现出更接近Poisson的统计。匿名代码和配置在补充材料中提供。

英文摘要

Hallucination detection in large language models (LLMs) is deployment-critical, and recent work shows that the spectrum of attention-derived graph Laplacians carries strong signal about reasoning quality. Prior spectral diagnostics, however, summarize the Laplacian spectrum by a handful of eigenvalues or hand-picked scalars, leaving most of its structure unused. We propose Free-Energy Signatures (Fes), a spectral descriptor that treats each layer's attention Laplacian as a Hamiltonian and extracts its thermodynamic potentials partition function, free energy, spectral entropy, heat capacity together with the random-matrix-theory (RMT) spectral form factor. We prove three results: (i)~Lipschitz stability of Fes under attention perturbation; (ii)~an expressiveness result showing that Fes enriches finite spectral summaries and approximates moment-derived spectral functionals under explicit regularity and grid-resolution assumptions; and (iii)~a finite-sample PAC bound on the AUROC of a training-free detector built from Fes. Empirically, across six open-weight LLMs and six benchmarks, a lightweight probe on Fes descriptors achieves the strongest aggregate AUROC among attention-spectral baselines, improving over LapEig by $+6.5$ AUROC points and over GoR-4 by $+2.4$ points on average, while requiring no update to the underlying LLM. In the fully unsupervised setting, an RMT-deviation score achieves mean AUROC $0.71$, providing a label-free but weaker detector. A complementary RMT analysis shows that correct generations exhibit more Wigner-Dyson like spectral statistics, whereas hallucinations exhibit more Poisson-like statistics. The anonymized code and config are provided in the supplementary material.

2606.18812 2026-06-19 cs.LG cs.AI 新提交 70%

Reinforcement Learning Foundation Models Should Already Be A Thing

强化学习基础模型本应已经存在

Abdelrahman Zighem, Jill-Jênn Vie

发表机构 * École normale supérieure de Paris, PSL University, Paris, France(巴黎高等师范学院,PSL大学,法国巴黎) Soda team, Inria Saclay, Palaiseau, France(Soda团队,法国国家信息与自动化研究所萨克雷中心,法国帕莱索)

专题命中 其他LLM :提出强化学习基础模型概念

AI总结 提出通过合成MDP构建强化学习基础模型,利用固定大小的充分统计量使注意力架构适用,在线和离线实验均优于传统算法。

详情
AI中文摘要

语言和视觉的基础模型由互联网规模的数据驱动,而结构化领域(表格预测、时间序列预测、图学习、强化学习)则不然。替代方案是合成数据,它将负担从收集转移到先验设计。这种先验已经存在于许多结构化任务中:TabPFN及其后续工作通过一个在合成贝叶斯先验上预训练的Transformer解决表格分类问题。我们提出两点。\textbf{首先},强化学习是明显的空白:采样一个合成MDP与采样一个合成表格数据集一样可行,然而没有上下文强化学习工作将先验设计作为主要目标。\textbf{其次},MDP允许一个固定大小的充分统计量,独立于观察到的回合且形状为表格形式,这使得它们直接适用于用于表格基础模型的基于注意力的架构,只需将策略头替换监督目标。这些共同定义了强化学习基础模型的议程。作为概念验证,我们完全在合成MDP上训练一个模型,并表明,无需任务特定的调优,它就能在上下文中解决留出的表格基准,包括在线和离线:在线时,使用比UCB-VI和表格Q-learning少得多的回合;离线时,与VI-LCB竞争。

英文摘要

Foundation models for language and vision are powered by internet-scale data, while structured domains such as tabular prediction are powered by synthetic data. This substitute shifts the challenge from collection to prior design. Such priors already exist for many structured tasks: TabPFN and its successors solve tabular classification with a transformer pretrained on a synthetic Bayesian prior. We make two points. \textbf{First}, reinforcement learning is the conspicuous gap: sampling a synthetic MDP is as feasible as sampling a synthetic tabular dataset, yet no in-context RL work treats prior design as a primary objective. \textbf{Second}, MDPs admit a fixed-size sufficient statistic, independent of the episodes observed and tabular in shape, which makes them directly amenable to the attention-based architectures used for tabular foundation models, with a policy head replacing the supervised target. Together these define the agenda for an RL foundation model. As a proof of concept, we train a Graph Attention Network entirely on synthetic MDPs and show that, with no task-specific tuning, it solves held-out tabular benchmarks in context, both online and offline: online, in far fewer episodes than UCB-VI and tabular Q-learning, and offline, competitively with VI-LCB.

2606.14784 2026-06-19 cs.SD cs.LG eess.AS 新提交 70%

LLM-Based Synthetic Ground Truth Generation for Audio-Based Emotion Classification via In-Context Learning

基于上下文学习的音频情感分类的LLM合成真实标签生成

Qing Huang, Pooja Pol, Jianing Zhang

发表机构 * School of Business, Technical University of Applied Sciences Augsburg(应用技术大学阿沙芬堡商学院) Data Science und Autonome Systeme Technologietransferzentrum (TTZ)(数据科学与自主系统技术转移中心(TTZ))

专题命中 其他LLM :LLM生成音频情感标签

AI总结 提出利用大语言模型(LLM)和上下文学习(ICL)从多用户VR环境的流式语音数据中自动生成情感相关合成真实标签,解决团队协作状态标注难题。

Comments https://icaiit.org/paper.php?paper=14th_ICAIIT_2/3_9

详情
AI中文摘要

理解人类状态和交互动态是人机交互(HCI)的核心目标。随着交互范式变得更加沉浸,虚拟现实(VR)已成为研究协作工作的强大平台。在此类环境中,评估团队协作状态(包括团队表现和团队韧性)需要从多模态传感器数据(如语音信号)中连续可靠地推断潜在的团队级认知和情感状态。然而,由于传感器噪声、上下文变异性和稀疏的专家标注,为这些潜在状态生成真实标签仍然具有挑战性。传统的自我报告方法仅提供静态和延迟的测量,因此不足以捕捉连续语音数据中反映的动态团队过程。在这项工作中,我们提出了一种由大语言模型(LLM)驱动的、基于代理的推理工作流,用于从多用户VR环境中的流式语音数据自动生成情感相关的合成真实标签。利用LLM的泛化能力,我们使用上下文学习(ICL)和少量配对的音频样本及其对应转录的演示。ICL倾向于实现与模型微调相当的任务适应,同时避免了参数更新的计算开销。为了构建信息丰富且鲁棒的上下文提示,我们采用基于检索的选择策略,根据声学特征空间中的相似性动态识别相关的音频演示。

英文摘要

Understanding human states and interaction dynamics is a core goal of human-computer interaction (HCI). As interaction paradigms become more immersive, virtual reality (VR) has emerged as a powerful platform for studying collaborative work. In such settings, evaluating team collaboration states, including team performance and team resilience, requires continuous and reliable inference of latent team-level cognitive and affective states from multi-modal sensor data, such as speech signals. However, generating ground truth labels for these latent states remains challenging due to sensor-induced noise, contextual variability, and sparse expert annotations. Traditional self-reporting approaches provide only static and delayed measurements and are therefore insufficient for capturing dynamic team processes reflected in continuous speech data. In this work, we propose a large language model (LLM)-driven, agentic inference workflow for automated emotion-related synthetic ground truth generation from streaming speech data in multi-user VR environments. Leveraging the generalization capabilities of LLMs, we use In-Context Learning (ICL) with few-shot demonstrations of paired audio-based samples and their corresponding transcriptions. ICL tends to achieve task adaptation comparable to model fine-tuning while circumventing the computational overhead of parameter updates. To construct informative and robust in-context prompts, we adopt a retrieval-based selection strategy that dynamically identifies relevant audio demonstrations based on similarity in the acoustic feature space.

2606.10616 2026-06-19 cs.AI 新提交 70%

Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

学习记住什么:通过约束优化实现长时域语言代理的观测安全记忆保留

Qingcan Kang, Liu Mingyang, Shixiong Kai, Kaichao Liang, Tao Zhong, Mingxuan Yuan

发表机构 * Huawei Noah's Ark Lab(华为诺亚方舟实验室) Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系)

专题命中 其他LLM :针对语言代理的上下文窗口限制提出方法

AI总结 针对长时域语言代理的有限上下文窗口,提出OSL-MR框架,将记忆保留建模为约束随机优化问题,通过在线可观测特征与离线监督的严格分离学习查询条件化的证据价值,实验表明在严格预算下优于现有方法。

详情
AI中文摘要

长时域语言代理积累的观测、推理轨迹和检索事实会超出其有限的上下文窗口,使得记忆保留成为一个基本的资源分配问题。现有记忆系统通过启发式评分、检索优化或学习压缩来改进管理,但大多将保留视为局部决策问题,并未在现实观测约束下显式建模其长期后果。为填补这一空白,我们将记忆保留建模为一个约束随机优化问题,具有明确的预算可行性、证据效用以及延迟成本(包括遗漏惩罚、重新获取延迟和过时信息风险)。随后,我们提出OSL-MR(观测安全记忆保留学习),这是一个新颖的框架,强制执行在线可观测特征与离线可用监督(OAS)之间的严格分离。OSL-MR结合了一个从实现的证据监督中训练的证据学习器和一个混合评分启发式,该启发式既作为可部署的在线安全基线,又作为结构化的归纳先验用于学习。由此产生的策略直接从交互数据中学习查询条件化的证据价值,同时在同一观测约束下保持可部署性。在LOCOMO和LongMemEval上的实验表明,OSL-MR在严格记忆预算下持续优于基于最近性的方法、生成式代理风格评分和其他启发式基线。混合评分先验在保持召回率的同时进一步提高了精确度,敏感性分析表明其在广泛的成本配置下具有鲁棒性。

英文摘要

Long-horizon language agents accumulate observations, reasoning traces, and retrieved facts exceeding context windows, making memory retention a fundamental resource-allocation problem. Existing systems treat retention as local and do not model long-term consequences under observability constraints. To fill this gap, we formulate memory retention as a constrained stochastic optimization with budget feasibility, evidence utility, and delayed costs including miss, reacquisition, and stale penalties. We show this multi-step problem is NP-hard, making exact solution intractable. Moreover, deployment decisions must be made under partial observability. To address these challenges, we propose OSL-MR (Observability-Safe Learning for Memory Retention), a learning-augmented framework that enforces a strict separation between online-observable features and offline-available supervision. OSL-MR combines an evidence learner trained from realized evidence with a Mixed-Score heuristic that serves as a deployable online-safe baseline and an inductive prior. The policy learns query-conditioned evidence from interaction data and remains deployable under the same constraints. Experiments on LoCoMo and LongMemEval show OSL-MR outperforms recency-based, Generative Agents-style, and other heuristic baselines, especially under tight budgets. The Mixed-Score prior improves precision and recall, and sensitivity analysis shows robustness across cost settings. On small solvable instances, single-step optimization is insufficient to anticipate future demand shifts, while OSL-MR stays significantly closer to the dynamic-programming optimum, confirming the necessity of the sequential formulation and reinforcing our learning-guided approximation. These results establish constrained stochastic optimization and optimization-guided learning as a principled foundation for memory management in long-horizon agents.

2606.20537 2026-06-19 cs.LG cs.DC 新提交 65%

Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

执行状态胶囊:面向低延迟、小批量、设备端物理AI服务的图绑定执行状态检查点与恢复

Liang Su

发表机构 * GitHub

专题命中 其他LLM :面向LLM服务的执行状态检查点与恢复机制

AI总结 针对低延迟、小批量、设备端物理AI服务场景,提出执行状态胶囊机制,通过图绑定检查点与恢复完整可恢复状态,在RTX 5090上实现亚毫秒级恢复,TTFT加速比达3.9倍至27倍。

Comments 27 pages, 9 figures

详情
AI中文摘要

主流LLM服务系统主要通过分页或基数键值(KV)缓存重用前缀工作。这对于高吞吐量、高并发服务非常有效,但它只管理执行状态的一个位置片段:KV缓存。我们研究相反的场景:低延迟、小批量、设备端物理AI服务,其中交互式LLM代理、语音系统和机器人策略在严格的响应预算下频繁分支、重置、中断和重新进入。我们引入执行状态胶囊,一种图绑定的检查点和恢复机制,用于在提交边界处保存完整的可恢复状态。FlashRT是一个白盒、后端内核运行时,其评估的NVIDIA CUDA后端在连续的静态缓冲区上运行捕获的图计划,无需块表间接寻址。由于活动状态是一组命名的封闭缓冲区,胶囊可以快照、恢复、分叉或回滚整个执行边界,包括KV、循环状态、卷积状态、MTP状态和元数据。这将重用从令牌寻址的KV片段转移到图绑定的执行状态边界。在RTX 5090上,胶囊恢复在存储状态级别是字节精确的,在贪婪解码下是令牌一致的。仅KV的消融实验出现分歧,表明循环状态是承载负载的。GPU驻留的快照和恢复是亚毫秒级的,TTFT相对于冷预填充的加速比从2k令牌时的3.9倍增长到16k令牌时的27倍。在Jetson AGX Thor和DGX Spark上,相同的正确性和结构属性成立。胶囊不是高吞吐量KV缓存服务的替代品;它们定义了一个互补的以延迟为先的服务点,用于显式执行状态重用。

英文摘要

Mainstream LLM serving systems reuse prefix work mainly through paged or radix key-value (KV) caches. This is highly effective for high-throughput, high-concurrency serving, but it manages only one positional fragment of execution state: the KV cache. We study the opposite regime: low-latency, small-batch, on-device physical-AI serving, where interactive LLM agents, speech systems, and robot policies repeatedly branch, reset, interrupt, and re-enter under tight responsiveness budgets. We introduce execution-state capsules, a graph-bound checkpoint and restore mechanism for the complete restorable state at a committed boundary. FlashRT is a white-box, backend-facing kernel runtime whose evaluated NVIDIA CUDA backend runs captured graph plans over contiguous static buffers with no block-table indirection. Because the live state is a closed set of named buffers, a capsule can snapshot, restore, fork, or roll back the whole execution boundary, including KV, recurrent state, convolution state, MTP state, and metadata. This moves reuse from token-addressed KV fragments to graph-bound execution-state boundaries. On an RTX 5090, capsule restore is byte-exact at the stored-state level and token-identical under greedy decode. A KV-only ablation diverges, showing that recurrent state is load-bearing. GPU-resident snapshot and restore are sub-millisecond, and TTFT speedup over cold prefill grows from 3.9x at 2k tokens to 27x at 16k tokens. On Jetson AGX Thor and DGX Spark, the same correctness and structural properties hold. Capsules are not a replacement for high-throughput KV-cache serving; they define a complementary latency-first serving point for explicit execution-state reuse.

2606.19850 2026-06-19 cs.LG cs.AI 新提交 65%

Neural Additive and Basis Models with Feature Selection and Interactions

具有特征选择和交互的神经加性模型与神经基础模型

Yasutoshi Kishimoto, Kota Yamanishi, Takuya Matsuda, Shinichi Shirakawa

发表机构 * Yokohama National University(横滨国立大学)

专题命中 其他LLM :提出在神经加性模型中引入特征选择机制,属于可解释机器学习方法,与LLM无直接关系。

AI总结 提出在神经加性模型和神经基础模型中引入特征选择机制,通过特征选择层减少计算开销,并支持高维数据中的特征交互学习,性能优于或持平于现有GAM方法。

Comments Accepted at PAKDD 2024. Code is available at https://github.com/shiralab/NAM-FS

详情
AI中文摘要

深度神经网络(DNN)在各个领域表现出色,但通常可解释性较低。神经加性模型(NAM)及其变体神经基础模型(NBM)在广义加性模型(GAM)中使用神经网络(NN)作为非线性形状函数。这两种模型具有高度可解释性,并且在NN训练中表现出良好的性能和灵活性。NAM和NBM基于GAM架构,可以提供并可视化每个特征对预测的贡献。然而,当使用双输入NN来考虑特征交互或将其应用于高维数据集时,由于所需计算资源的增加,训练NAM和NBM变得棘手。本文提出将特征选择机制融入NAM和NBM以解决计算瓶颈。我们在两种模型中引入特征选择层,并在训练过程中更新选择权重。我们的方法简单,与原始NAM和NBM相比,可以降低计算成本和模型大小。此外,它使我们即使在数据维度很高的情况下也能使用双输入NN并捕获特征交互。我们证明,所提出的模型与原始NAM和NBM相比计算效率更高,并且与最先进的GAM相比表现出更好或相当的性能。

英文摘要

Deep neural networks (DNNs) exhibit attractive performance in various fields but often suffer from low interpretability. The neural additive model (NAM) and its variant called the neural basis model (NBM) use neural networks (NNs) as nonlinear shape functions in generalized additive models (GAMs). Both models are highly interpretable and exhibit good performance and flexibility for NN training. NAM and NBM can provide and visualize the contribution of each feature to the prediction owing to GAM-based architectures. However, when using two-input NNs to consider feature interactions or when applying them to high-dimensional datasets, training NAM and NBM becomes intractable due to the increase in the computational resources required. This paper proposes incorporating the feature selection mechanism into NAM and NBM to resolve computational bottlenecks. We introduce the feature selection layer in both models and update the selection weights during training. Our method is simple and can reduce computational costs and model sizes compared to vanilla NAM and NBM. In addition, it enables us to use two-input NNs even in high-dimensional datasets and capture feature interactions. We demonstrate that the proposed models are computationally efficient compared to vanilla NAM and NBM, and they exhibit better or comparable performance with state-of-the-art GAMs.

2606.19819 2026-06-19 cs.CL cs.AI 新提交 65%

CREDENCE: Claim Reduction for Decomposition & Enhanced Credibility -- Semantic Metrics and Convergence Analysis

CREDENCE: 面向分解与增强可信度的声明缩减——语义度量与收敛性分析

Phuong Huu Vu Tran, Thuan Duc Mai, Bach Xuan Le

发表机构 * Vietnamese-German University(越南德国大学) Ho Chi Minh University of Technology(胡志明市理工大学)

专题命中 其他LLM :声明分解和修复框架,用于事实核查,与LLM应用相关。

AI总结 提出CREDENCE框架,通过语义F1度量解决Jaccard度量对释义声明的低估问题,并形式化分析修复管道的收敛性,实验表明语义F1比Jaccard F1提升15-32个百分点,规则修复将原子性违反率降低47-100%。

Comments 40 pages, 6 figures, 19 tables. Submitted to Language Resources and Evaluation

详情
AI中文摘要

将复合句分解为原子化的、可验证的声明是可靠自动化事实核查的前提。先前工作依赖基于词重叠(Jaccard)的度量,系统性地低估了释义声明的分解质量,并且缺乏对修复循环的形式化终止分析。我们提出CREDENCE,一个改进的声明分解与评估框架,解决了这两个缺陷。我们的贡献包括:(1) 语义F1:我们使用BGE-large余弦相似度保真度度量,解决了Jaccard的惩罚问题,并提高了下游事实核查的准确性;(2) 收敛定理:我们形式化地表征了修复管道的四个性质,确立了在预言解析器假设下基于规则的修复是单调且有限终止的;基于LLM的自修复被证明是非单调的,需要早期退出保护;(3) 三个评估基准,涵盖社交媒体、百科全书和新闻领域,用于跨领域泛化度量;(4) 跨四个分解器模型(3.8B-12B)和一个封闭API模型的多模型基准测试。在SocialClaimSplit、WikiSplitBench和ClaimDecompBench上的实验表明,语义F1比Jaccard F1提升15-32个百分点。在SocialClaimSplit和WikiSplitBench上,EPR范围为0.94至1.00,而ClaimDecompBench由于更难的新闻领域构造,包含较低的基线EPR情况(低至0.824),规则修复相对于基线模型将原子性违反率(AVR)降低了47-100%,且不降低保真度。

英文摘要

Decomposing compound sentences into atomic, verifiable claims is a prerequisite for reliable automated fact-checking. Prior work has relied on token-overlap (Jaccard) metrics that systematically underestimate decomposition quality for paraphrastic claims, and has lacked formal termination analysis for the repair loop. We present Credence, a revised claim decomposition and evaluation framework addressing both shortcomings. Our contributions are: (1) Semantic-F1: we use BGE-large cosine similarity fidelity metric that resolves Jaccard's penalisation and improves downstream fact-checking accuracy; (2) Convergence theorems: we formally characterise four properties of the repair pipeline, establishing that rule-based repair is monotone and finitely terminating under an oracle parser assumption; LLM-based self-repair is provably non-monotone and requires an early-exit guard; (3) Three evaluation benchmarks spanning social-media, encyclopaedic, and news domains for cross-domain generalisation measurement; (4) Multi-model benchmarking across four decomposer models (3.8B-12B) and a closed API model. Experiments on SocialClaimSplit, WikiSplitBench, and ClaimDecompBench show that Semantic-F1 outperforms Jaccard-F1 by +15-32pp. EPR ranges from 0.94 to 1.00 on SocialClaimSplit and WikiSplitBench, while ClaimDecompBench includes lower base EPR cases (down to 0.824) due to harder news-domain constructions, and rule-repair reduces the Atomicity Violation Rate (AVR) by 47-100% relative to the base model without degrading fidelity.

2606.19721 2026-06-19 cs.LG cs.AI 新提交 65%

OnDeFog: Online Decision Transformer under Frame Dropping

OnDeFog:帧丢失下的在线决策变压器

Daiki Yotsufuji, Kenta Nishihara, Shoma Shimizu, Kento Uchida, Shinichi Shirakawa

发表机构 * Yokohama National University(横滨国立大学)

专题命中 其他LLM :提出在线决策变压器处理帧丢失问题。

AI总结 针对帧丢失导致性能下降的问题,提出OnDeFog,将DeFog机制与在线决策变压器结合,通过直接环境交互学习策略,在高丢帧率环境下优于ODT,在低奖励数据集上优于DeFog。

Comments Accepted to PRICAI 2025

详情
AI中文摘要

在具有挑战性的现实世界强化学习应用中,通信延迟或传感器故障经常导致帧丢失,此时智能体无法接收丢失的状态及相关奖励。为了解决帧丢失导致的性能下降问题,通过将额外机制引入决策变压器以处理帧丢失,开发了随机帧丢失下的决策变压器(DeFog)。尽管DeFog可以缓解帧丢失环境中的性能下降,但由于DeFog是一种离线学习方法,它难以有效泛化到训练数据集中未充分表示的新状态。在本研究中,我们提出OnDeFog,它将DeFog中的机制与在线决策变压器(ODT)相结合,ODT是一种通过直接环境交互学习策略的在线强化学习方法。全面的实验评估表明,我们提出的OnDeFog在高丢帧率环境下相比ODT取得了更优的性能,并且在包含大量低奖励数据的数据集上优于DeFog。

英文摘要

In challenging real-world reinforcement learning applications, communication delays or sensor failures often cause frame dropping, in which the agent cannot receive the dropped states and associated rewards. To address the performance degradation caused by frame dropping, the Decision Transformer under Random Frame Dropping (DeFog) was developed by incorporating additional mechanisms into the decision transformer to tackle frame dropping. Although DeFog can mitigate performance degradation in frame-dropping environments, since DeFog is an offline learning method, it struggles to effectively generalize to novel states not adequately represented in the training dataset. In this study, we propose OnDeFog, which integrates the mechanisms in DeFog with the online decision transformer (ODT), an online reinforcement learning method that learns policies through direct environmental interaction. Comprehensive experimental evaluation demonstrates that our proposed OnDeFog achieves superior performance compared to ODT in environments characterized by high dropping frame rate and outperforms DeFog on datasets containing a large amount of low-reward data.

2606.19587 2026-06-19 stat.ML cs.LG 新提交 60%

A Solver-Free Training Method for Predict-then-Optimize

一种无求解器的预测后优化训练方法

Beichen Wan, Mo Liu

发表机构 * Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, NC, USA(统计与运筹学系,北卡罗来纳大学教堂山分校)

专题命中 其他LLM :提出无求解器训练方法,优化预测模型,属于LLM应用

AI总结 提出一种基于测度变换的决策聚焦学习管道,通过无求解器代理损失实现预测后优化中预测模型的高效训练,理论保证Fisher一致性,训练时间降低数个数量级。

Comments Accepted by ICML 2026

详情
AI中文摘要

我们提出了一种可扩展的方法,用于在预测后优化范式中训练预测(机器学习)模型,其中模型输出作为后续线性优化任务的系数。直接最小化经验决策遗憾对于线性规划和组合优化是不可行的,因为决策映射是分段常数,且梯度几乎处处为零。虽然现有方法通过平滑微分过程来解决这一问题,但它们存在可扩展性问题,因为每次梯度评估都需要调用计算昂贵的求解器。为了解决这个问题,我们提出了一种基于测度变换原理的决策聚焦学习管道,该管道在训练期间产生一个完全无优化求解器的新代理损失。我们建立了理论保证,包括Fisher一致性和超额风险界。实验上,我们的方法在实现与最先进方法相当的决策质量的同时,将训练时间减少了数个数量级。

英文摘要

We propose a scalable method for training prediction (machine learning) models in the predict-then-optimize paradigm, where model outputs serve as coefficients for a subsequent linear optimization task. Directly minimizing the empirical decision regret is intractable for linear programming and combinatorial optimization since the decision mapping is piecewise constant, and the gradients are zero almost everywhere. While existing methods address this by smoothing the differentiation process, they suffer from scalability issues, since a computationally expensive solver call is required for every gradient evaluation. To address this, we propose a decision-focused learning pipeline based on a measure transformation principle, which yields a new surrogate loss that is completely optimization-solver-free during training. We establish theoretical guarantees, including Fisher consistency and excess risk bounds. Empirically, our method achieves decision quality competitive with state-of-the-art methods while reducing training time by orders of magnitude.

2606.19410 2026-06-19 stat.ML cs.LG 新提交 60%

The Representational Limit of Scalar Interactions: An Interventional Decomposition

标量交互的表征限制:一种干预分解

Potito Aghilar, Sabino Roccotelli, Stanislao Fidanza, Vito Walter Anelli, Sebastiano Stramaglia, Tommaso Di Noia

发表机构 * Polytechnic University of Bari(巴里理工学院) University of Bari Aldo Moro(巴里大学Aldo Moro)

专题命中 其他LLM :提出特征交互分解方法,可用于模型解释

AI总结 本文证明标量交互指标混淆了唯一性、冗余性和协同性,并提出Stochastic Hi-Fi方法,通过干预掩码推理分解每个特征的U/R/S轮廓,在表格和图像任务中恢复被标量基线遗漏的结构。

详情
AI中文摘要

有符号的成对交互指标从根本上混淆了唯一性(U)、冗余性(R)和协同性(S)。我们在一个最小的3路XOR结构因果模型上证明了这一点:忠实的指标如Shapley-Taylor对每对返回零,而投影指标如Shapley Interaction将三阶效应扩散到混淆三种机制的成对标量中。我们引入了Stochastic Hi-Fi,一种事后、无需重新训练的可预测性分解方法,通过干预掩码推理估计每个特征的U/R/S轮廓。该估计器提供精确的干预语义、有限样本蒙特卡洛界限、耦合菱形采样带来的严格方差减少以及均匀的有限词汇收敛。在表格SCM上,Stochastic Hi-Fi恢复了被标量基线遗漏的结构(交互幅度恢复比高达411倍)。它还在GPT-2 IOI电路中分离了冗余和协同头。在NIH ChestX-ray14上,Stochastic Hi-Fi在Pointing Game中匹配GradCAM,并在Deletion AUC上显著改进。

英文摘要

Signed pairwise interaction scores fundamentally conflate uniqueness (U), redundancy (R), and synergy (S). We prove this on a minimal 3-way XOR structural causal model: faithful indices such as Shapley-Taylor return zero per pair, whereas projective indices such as Shapley Interaction spread the third-order effect into pair scalars that conflate the three mechanisms. We introduce Stochastic Hi-Fi, a post-hoc, retraining-free predictability decomposition that estimates per-feature U/R/S profiles by interventional masked inference. The estimator provides exact interventional semantics, finite-sample Monte Carlo bounds, strict variance reduction from coupled diamond sampling, and uniform finite-vocabulary convergence. Across tabular SCMs, Stochastic Hi-Fi recovers structure missed by scalar baselines (up to 411x larger interaction-magnitude recovery ratios). It also separates redundant and synergistic heads in the GPT-2 IOI circuit. On NIH ChestX-ray14, Stochastic Hi-Fi matches GradCAM on Pointing Game and improves substantially on Deletion AUC.

2606.20518 2026-06-19 cs.AI 新提交 60%

FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS

FlowEdit: 流匹配TTS中终身发音适应的联想记忆

Harshit Singh, Ayush Pratap Singh, Nityanand Mathur

发表机构 * University Of Maryland(马里兰大学) TU Darmstadt(达姆施塔特工业大学) Smallest AI

专题命中 其他LLM :流匹配TTS的终身发音适应

AI总结 针对流匹配TTS部署后无法纠正专有名词发音错误的问题,提出FlowEdit框架,通过潜在条件编辑而非权重更新学习发音修正,并利用现代Hopfield网络存储和检索修正,在312个多语言专有名词基准上将音素错误率降低92.7%。

详情
AI中文摘要

流匹配文本到语音系统在零样本场景下表现出色,但部署后保持静态:除非重新训练模型,否则对词汇表外的专有名词的发音错误会持续存在。我们提出FlowEdit,一个用于冻结的流匹配TTS的终身适应框架,它将发音修正学习为潜在条件编辑而非权重更新。当提供纠正性反馈时,FlowEdit优化文本嵌入空间中的令牌级扰动,然后将修正存储在作为内容可寻址情景记忆的现代Hopfield网络中。在推理时,通过具有相似性门控的软注意力检索修正,实现模糊形态匹配。在我们整理的涵盖18个语系的312个多语言专有名词基准上,FlowEdit相对于零样本基线将目标词音素错误率降低了92.7%,同时保持相同的通用语音质量。修正过程在单个GPU上大约15秒完成。

英文摘要

Flow-matching text-to-speech systems achieve remarkable zero-shot quality but remain static after deployment: pronunciation errors on out-of-vocabulary proper nouns persist unless the model is retrained. We introduce FlowEdit, a life-long adaptation framework for frozen flow-matching TTS that learns pronunciation corrections as latent conditioning edits rather than weight updates. When corrective feedback is provided, FlowEdit optimizes a token-level perturbation in the text embedding space, then stores the correction in a Modern Hopfield Network serving as content-addressable episodic memory. At inference, corrections are retrieved via soft attention with a similarity gate, enabling fuzzy morphological matching. On our curated benchmark of 312 multilingual proper nouns across 18 language families, FlowEdit reduces target-word Phoneme Error Rate by 92.7% relative to the zero-shot baseline while maintaining identical general-speech quality. Corrections complete in approximately 15 seconds on a single GPU.

2606.20431 2026-06-19 cs.LG 新提交 60%

Sparsity, Superposition, and Forgetting: A Mechanistic Study of Representation Retention in Continual Learning

稀疏性、叠加与遗忘:持续学习中表示保持的机制研究

Jan Wasilewski, Jędrzej Kozal, Michał Woźniak, Bartosz Krawczyk

发表机构 * Rochester Institute of Technology(罗切斯特理工学院) Wrocław University of Science and Technology(弗罗茨瓦夫科技大学)

专题命中 其他LLM :研究持续学习中的遗忘机制,与LLM相关

AI总结 通过可控玩具框架研究持续学习中的遗忘机制,发现叠加随时间增加但任务边界处有瞬降,高稀疏性增加叠加但不必然导致遗忘,任务级有效秩随稀疏性增长。

详情
AI中文摘要

持续学习(CL)系统常常遗忘先前获得的知识,但由于真实数据集纠缠了许多因素,遗忘的机制在实践中难以孤立。我们提出了一个可控的玩具世界框架,使这些机制可观察和可测试。使用合成生成器-分离器流水线,我们定义了真实潜在特征,构建了具有可调稀疏性和重叠的任务,并引入了表示强度和叠加(特征间的方向重叠)的可测量量。然后,我们通过拟合保留、叠加和暴露历史之间的稀疏动态关系(通过SINDy)来研究保留动态——表示强度的时间变化。基于有效秩的互补任务级分析表征了表示能力如何在任务间分配。我们的受控实验得出三个要点。(1)叠加随时间增加,在任务边界处有瞬降,表明边界特定的干扰而非稳定漂移。(2)更高的特征稀疏性导致更多叠加,但不必然引起遗忘;当表示保持强时,尽管重叠,遗忘可以减少。(3)任务级有效秩随稀疏性增长,表明在稀疏机制下更广泛的能力使用。这些结果共同细化了常见直觉——更多叠加导致更多遗忘,通过显示重叠与表示强度和能力分配相互作用。我们的玩具分析为CL提供了可证伪的假设和诊断工具。

英文摘要

Continual learning (CL) systems often forget previously acquired knowledge, yet the mechanisms driving forgetting remain hard to isolate in practice because real datasets entangle many factors. We present a controlled, toy-world framework that makes these mechanisms observable and testable. Using a synthetic generator-separator pipeline, we define ground-truth latent features, build tasks with tunable sparsity and overlap, and introduce measurable quantities for representation strength and superposition (directional overlap among features). We then study retention dynamics-the temporal change of representation strength by fitting sparse dynamical relations (via SINDy) between retention, superposition, and exposure history. A complementary task-level analysis based on effective rank characterizes how representational capacity is allocated across tasks. Our controlled experiments yield three takeaways. (1) Superposition tends to increase over time with transient dips at task boundaries, suggesting boundary-specific interference rather than steady drift. (2) Higher feature sparsity induces more superposition yet does not inevitably cause forgetting; when representations remain strong, forgetting can be reduced despite overlap. (3) Task-level effective rank grows with sparsity, indicating broader capacity usage under sparse regimes. Together, these results nuance the common intuition that more superposition leads to more forgetting by showing that overlap interacts with representation strength and capacity allocation. Our toy analysis provides falsifiable hypotheses and diagnostic tools for CL.

2606.20254 2026-06-19 cs.CR 新提交 60%

Quantization as a Malicious Task: Removing Quantization-Conditioned Backdoors via Task Arithmetic

量化作为恶意任务:通过任务算术移除量化条件后门

Kaihsun Yang, Min-Yan Tsai, Chia-Mu Yu

专题命中 其他LLM :防御量化后门,涉及模型安全

AI总结 提出QVec方法,通过将量化引起的权重变化视为恶意任务向量,在部署前进行参数校正,无需重训练或触发样本即可防御量化条件后门。

详情
AI中文摘要

模型量化被广泛采用,以在资源受限设备上部署深度神经网络时减少内存使用和推理成本。然而,最近的研究揭示了一种新的安全威胁,称为量化条件后门(QCBs),其中模型在全精度下行为正常,但仅在量化后激活恶意行为。现有的防御通常修改量化过程或校正激活统计,往往引入额外的计算开销或依赖特定的量化设置。在这里,我们提出QVec,一种从参数空间角度防御QCBs的方法。我们观察到,全精度模型与其量化版本之间的权重差异编码了一种结构化的行为偏移,可以解释为恶意任务向量,而非随机量化噪声。基于这一见解,QVec通过在部署前进行受控的参数校正来抵消这一恶意方向。QVec无需重新训练,无需触发样本,仅需一次量化传递来估计参数偏移,以及轻量级的超参数搜索。在图像分类基准和多个大型语言模型(LLM)攻击场景中的大量实验表明,QVec在保持干净性能的同时,持续抑制后门激活。

英文摘要

Model quantization is widely adopted to reduce memory usage and inference cost when deploying deep neural networks on resource-constrained devices. However, recent studies have revealed a new security threat known as Quantization-Conditioned Backdoors (QCBs), where a model behaves normally in full precision but activates malicious behavior only after quantization. Existing defenses typically modify quantization procedures or correct activation statistics, often introducing additional computational overhead or relying on specific quantization settings. Here, we present QVec, a parameter-space perspective for defending against QCBs. We observe that the weight difference between a full-precision model and its quantized counterpart encodes a structured behavioral shift, which can be interpreted as a malicious task vector rather than random quantization noise. Based on this insight, QVec counteracts this malicious direction through controlled parameter correction prior to deployment. QVec requires no retraining, no trigger samples, and only a single quantization pass to estimate the parameter shift, together with a lightweight hyperparameter search. Extensive experiments across image classification benchmarks and multiple Large Language Model (LLM) attack scenarios demonstrate that QVec consistently suppresses backdoor activation while preserving clean performance.

2606.19910 2026-06-19 cs.CL cs.SD eess.AS 新提交 60%

Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal

轻量级发音评估:基于离散语音标记的意外度

Syeda Faiza Ahmed Sara, Shammur Absar Chowdhury

发表机构 * Qatar Computing Research Institute, Doha, Qatar(卡塔尔计算研究所,多哈,卡塔尔)

专题命中 其他LLM :使用语言模型计算语音标记意外度进行发音评估。

AI总结 提出仅使用母语语音资源训练的轻量级发音评估框架,通过离散化语音标记和语言模型计算意外度,结合文本引导对齐特征,在无监督或少量校准下达到接近监督方法的性能。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

训练自动发音评估通常依赖于标记的学习者错误或非母语语料库,这些语料库收集成本高昂。我们提出一个轻量级框架,仅使用母语语音资源训练,以无监督或通过少量评分话语进行轻量校准的方式运行。在推理时,学习者语音通过SSL编码器和K-means码本进行离散化。一个在母语序列上训练的标记语言模型计算意外度,其中较高的意外度表示音位偏差。我们添加了一个转录引导的Text2DUnit--DTW模块,该模块从参考文本预测母语标记序列,并将其与声学标记对齐以推导出错误敏感特征。意外度和对齐特征通过简单回归融合。在SpeechOcean762上,PCC从0.60提升到0.66(带转录引导),接近监督基线。在L2-ARCTIC上的跨数据集评估显示了一致的提升。

英文摘要

Training automated pronunciation assessment often relies on labeled learner errors or non-native corpora that are costly to collect. We propose a lightweight framework trained only on native speech resources, operating unsupervised or lightly calibrated with a small set of scored utterances. At inference, learner speech is discretized with an SSL encoder and a K-means codebook. A token language model trained on native sequences computes surprisal where higher surprisal indicates phonotactic deviation. We add a transcript-guided Text2DUnit--DTW module that predicts native token sequences from reference text and aligns them to acoustic tokens to derive error-sensitive features. Surprisal and alignment features are fused via simple regression. On SpeechOcean762, PCC improves from 0.60 to 0.66 with transcript guidance, near supervised baselines. Cross-dataset evaluation on L2-ARCTIC shows consistent gains.

2. 领域大模型 2 篇

2606.19640 2026-06-19 cs.CL cs.AI cs.HC 新提交 70%

Creating Multilingual Mental Health Dialogue Datasets: Limits of Persona-Based Localization via Nationality and Language

创建多语言心理健康对话数据集:基于国籍和语言的人物角色本地化方法的局限性

Yunkai Xu, Saeed Abdullah

发表机构 * Pennsylvania State University(宾夕法尼亚州立大学)

专题命中 领域大模型 :研究LLM生成多语言心理健康对话数据集及评估。

AI总结 研究通过修改人物角色中的国籍和语言参数生成中文、孟加拉语和印地语临床对话,发现仅添加这些参数会导致跨语言临床不一致,且LLM评估非英语文本的抑郁严重度时存在不准确性。

Comments 15 pages, 4 figures. Accepted to the 2026 Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026), co-located with ACL 2026

详情
AI中文摘要

人工智能和大语言模型(LLMs)已成为应对全球心理健康挑战的有前景的工具。尽管这些挑战具有全球性,但用于训练和评估此类系统的高质量数据集仍然严重短缺。为弥补这一差距,研究人员越来越多地生成合成临床人物角色来模拟用户数据并测试数字心理健康支持系统。然而,大多数经过验证的人物角色依赖于以英语为中心的语境。本文研究了是否可以使用类似的人物角色方法生成多语言心理健康数据集。我们修改了人物角色中的国籍和语言参数,以生成普通话、孟加拉语和印地语的临床对话。然后,我们考察了不同LLM在评估这些生成的多语言数据集的抑郁严重程度(与英语基线相比)时的表现。我们的研究结果表明,仅在人物角色中添加国籍和语言参数可能不够,因为它可能引入跨语言的临床不一致性。LLM评判模型在评估非英语文本中的抑郁严重程度时常常表现出不准确性,且不同模型的性能存在差异。这暴露了将以英语为中心的人物角色应用于多语言语境的系统性局限性。最终,我们的工作强调了迫切需要文化响应式数据生成,以确保全球心理健康系统的公平性。

英文摘要

AI and large language models (LLMs) have emerged as promising tools to address global mental health challenges. Despite the global nature of these challenges, there remains a critical shortage of high-quality datasets for training and evaluating such systems. To mitigate this gap, researchers increasingly generate synthetic clinical personas to simulate user data and test digital mental health support systems. However, most validated personas rely on English-centric contexts. This paper investigates whether similar persona-based methods can be used to generate multilingual mental health datasets. We modified nationality and language parameters in personas to generate clinical dialogues in Mandarin, Bengali, and Hindi. We then examined how different LLMs perform when evaluating the depression severity of these generated multilingual datasets against the baseline in English. Our findings indicate that just adding nationality and language parameters in personas might not be adequate, as it can introduce clinical inconsistency across languages. LLM judge models often exhibit inaccuracies in assessing depression severity in non-English texts, with performance varying across different models. This exposes the systemic limitations of applying English-centric personas to multilingual contexts. Ultimately, our work highlights the urgent need for culturally responsive data generation to ensure equitable mental health systems globally.

2606.20554 2026-06-19 cs.IR cs.AI 新提交 60%

Structuring and Tokenizing Distributed User Interest Context for Generative Recommendation

结构化与分词化分布式用户兴趣上下文以支持生成式推荐

Ruizhong Qiu, Yinglong Xia, Dongqi Fu, Hanqing Zeng, Ren Chen, Xiangjun Fan, Hong Li, Hong Yan, Hanghang Tong

发表机构 * University of Illinois Urbana--Champaign(伊利诺伊大学厄巴纳-香槟分校) Meta MRS

专题命中 领域大模型 :生成式推荐,涉及语言模型与用户兴趣建模。

AI总结 提出G2Rec框架,通过统一图建模与语义分词,实现工业级生成式推荐中用户兴趣上下文的全面准确建模。

详情
AI中文摘要

生成式推荐是一种新兴范式,在工业推荐系统中展现出前景,旨在从用户历史行为中预测其下一次交互。生成式推荐的核心是物品分词,它连接了物品语义与推荐模型。然而,现有方法往往难以同时有效地组织和注入复杂的用户行为与物品语义上下文。一方面,现有的基于图的集成方法,如图序列化和图神经网络,要么存在可扩展性问题,要么仅利用局部图信息。另一方面,现有的语义分词方法通常依赖启发式规则且缺乏明确的监督信号,可能导致不准确或次优的语义表示。为解决用户兴趣上下文建模中的这些局限性,我们提出G2Rec,一个可扩展的框架,将基于图的整体用户共同参与建模与语义分词统一起来,用于工业级生成式推荐。总体而言,G2Rec使推荐模型能够捕捉整体且基于语义的用户兴趣原型,而无需真实用户兴趣,从而在工业序列推荐中提供更全面、更准确的用户行为上下文建模。跨产品表面的在线部署和在公开数据集上的大量实验证明了G2Rec相对于现有方法的优越性。

英文摘要

Generative recommendation is an emerging paradigm that has shown promise in industrial recommendation systems, aiming to predict users' next interactions from their historical behaviors. At the core of generative recommendation lies item tokenization, which bridges item semantics and recommendation models. However, existing methods often struggle to effectively organize and inject complex user-behavioral and item-semantic contexts into recommendation models simultaneously. On the one hand, existing graph-based integration methods, such as graph serialization and graph neural networks, either suffer from scalability issues or exploit only local graph information. On the other hand, existing semantic tokenization methods typically rely on heuristics and lack explicit supervision signals, which may lead to inaccurate or suboptimal semantic representations. To address these limitations in user interest context modeling, we propose G2Rec, a scalable framework that unifies holistic graph-based user co-engagement modeling with semantic tokenization for industrial-scale generative recommendation. Overall, G2Rec enables recommendation models to capture holistic and semantically grounded user interest prototypes without requiring ground-truth user interests, thereby providing more comprehensive and accurate modeling of user behavior contexts in industrial sequential recommendation. Online deployment across product surfaces and extensive experiments on public datasets demonstrate the superiority of G2Rec over existing methods.

3. 预训练 2 篇

2606.19379 2026-06-19 cs.LG cs.AI cs.CL 新提交 70%

How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural

Transformer 前馈块有多线性?逐块线性可恢复性是学习得到的,而非架构决定的

Stuart Whipp

发表机构 * Independent Research(独立研究)

专题命中 预训练 :分析Transformer前馈块的线性度,与模型架构相关。

AI总结 通过精确最小二乘线性近似,测量训练后 Transformer 各前馈块的线性可恢复性,发现其高度异质且非单调,是学习得到的属性而非架构决定,并可用于压缩和诊断。

Comments 14 pages, 5 figures

详情
AI中文摘要

Transformer 前馈网络(FFN)通常被视为非线性的计算存储单元,但训练后的 FFN 块实际非线性程度很少被测量。我们将每个 FFN 视为位置级的输入-输出映射,并将其分解为精确的最小二乘线性近似加上残差。闭式线性映射解释的留出方差定义了一个块的线性可恢复性(R^2_lin),这是一种无需优化器的线性度量。在 GPT-2、Pythia-160m 和 llama-160m 的所有十二个块中,R^2_lin 高度异质且随深度非单调变化,相邻块之间范围从近线性(>0.99)到强非线性(<0.3),且并非由激活函数决定:相同宽度的 GELU 模型 GPT-2 和 Pythia-160m 具有截然不同的轮廓,因此可恢复性是单个训练块的学习属性,而非架构属性。残差的低秩双线性探针仅恢复少量 R^2 点,且增益与残差非线性不相关:未恢复的计算不是单个位置级乘积,而是高阶或分布式结构。该测量还作为有针对性的压缩信号:可恢复块允许大的单层替换(GPT-2 的早期 FFN 参数减少 8 倍,困惑度增加 +0.77),而低可恢复性块标记了这不安全的情况。它还暴露了一个方法论陷阱:训练后的线性基线可能在病态条件的 Transformer 激活上严重欠收敛,因此我们报告了整个过程中精确的闭式最小二乘上限。

英文摘要

Transformer feed-forward networks (FFNs) are often treated as nonlinear stores of computation, yet how nonlinear a trained FFN block actually is has rarely been measured. We treat each FFN as a position-wise input-to-output map and split it into the exact least-squares linear approximation plus a residual. The held-out variance the closed-form linear map explains defines a block's linear recoverability (R^2_lin), an optimiser-free measure of its linearity. Across all twelve blocks of GPT-2, Pythia-160m, and llama-160m, R^2_lin is highly heterogeneous and non-monotone with depth, ranging from near-linear (>0.99) to strongly nonlinear (<0.3) between adjacent blocks, and is not set by the activation function: same-width GELU models GPT-2 and Pythia-160m have sharply different profiles, so recoverability is a learned property of individual trained blocks, not an architectural one. A low-rank bilinear probe of the residual recovers only a few points of R^2, with gain uncorrelated with residual nonlinearity: the unrecovered computation is not a single position-wise product but higher-order or distributed structure. The measurement also serves as a targeted compression signal: recoverable blocks admit large single-layer replacements (GPT-2's early FFN at 8x fewer parameters for +0.77 perplexity), while low-recoverability blocks flag where this is unsafe. It further exposes a methodological pitfall: trained linear baselines can badly under-converge on ill-conditioned transformer activations, so we report the exact closed-form least-squares ceiling throughout.

2606.19367 2026-06-19 cs.LG 新提交 70%

Weibull Weight-Scale Parameter Evolution under AdamW Training Dynamics

Weibull 权重尺度参数在 AdamW 训练动态下的演化

Tiexin Ding

发表机构 * Independent Researcher(独立研究员)

专题命中 预训练 :研究AdamW训练动态,以Pythia模型为例。

AI总结 研究 AdamW 训练中 Weibull 权重尺度参数 λ 增长、过冲和松弛的原因,推导出三种力(对齐、注入、衰减)的分解,并在 Pythia-70M 模型上验证对齐力主导上升阶段,贡献 88-94%。

Comments 21 pages, 14 figures

详情
AI中文摘要

基于用于诊断变压器权重分布的双参数 Weibull 框架,我们研究了为什么在 AdamW 训练期间 Weibull 权重尺度参数 λ 会增长、过冲然后松弛。我们从 AdamW 更新中推导出平方权重范数的领先阶三力分解:一个对齐力,测量权重与自适应更新方向之间的相关性;一个注入力,来自自适应步长幅度;以及一个衰减力,来自解耦的权重衰减。在具有真实优化器矩的自训练 Pythia-70M 模型上,对齐力主导上升阶段,在四个随机种子中贡献了绝对力预算的 88-94%,并且对超权重移除具有鲁棒性。接近饱和时,对齐力和衰减力趋于平衡,解释了从权重尺度增长到松弛的转变。这些力动态直接控制 λ(t) 背后的平方范数分量;剩余的 RMS 到 Weibull 重建偏移是可测量的,并分解为桥接分量和积分分量,在密集采样区域总计约 5-6%。为了将分析扩展到无法获得优化器矩的真实模型,我们引入了一种样条位移方法,该方法从稀疏检查点以约 92-94% 的准确率恢复对齐力,大约是朴素两点基线的两倍。我们进一步观察到,在我们的实验中,λ(t) 的峰值随训练数据一致性而变化,这表明权重尺度增长存在数据依赖成分,我们将其留待后续对照研究。代码和数据可在 https://this URL 获取。

英文摘要

Building on a two-parameter Weibull framework for diagnosing transformer weight distributions, we study why the Weibull weight-scale parameter $λ$ grows, overshoots, and then relaxes during AdamW training. We derive a leading-order three-force decomposition of the squared weight norm from the AdamW update: an alignment force measuring the correlation between weights and the adaptive update direction, an injection force from adaptive step magnitude, and a decay force from decoupled weight decay. On self-trained Pythia-70M models with ground-truth optimizer moments, alignment dominates the rise phase, contributing 88-94% of the absolute force budget across four random seeds and remaining robust to super-weight removal. Near saturation, alignment and decay approach balance, explaining the transition from weight-scale growth to relaxation. These force dynamics directly govern the squared-norm component underlying $λ(t)$; the remaining RMS-to-Weibull reconstruction offset is measurable and decomposes into bridge and integration components, totaling approximately 5-6% in densely sampled regions. To extend the analysis to real models where optimizer moments are unavailable, we introduce a spline displacement method that recovers the alignment force from sparse checkpoints with approximately 92-94% accuracy, about twice the naive two-point baseline. We further observe that the peak value of $λ(t)$ varies with training-data coherence in our experiments, suggesting a data-dependent component of weight-scale growth that we leave to a controlled follow-up study. Code and data are available at https://github.com/tiexinding/NPM-Weibull-public.

4. 后训练 2 篇

2606.20475 2026-06-19 cs.LG 新提交 65%

Marginal Advantage Accumulation for Memory-Driven Agent Self-Evolution

边际优势累积用于记忆驱动智能体自我进化

Mingyu Yang, Keye Zheng, Congchao Cheng, Yujie Liu, Xingkang Lu, Fan Jiang, Yefei Zheng

发表机构 * Alibaba International Digital Commerce Group(阿里巴巴国际数字商业集团)

专题命中 后训练 :涉及语言模型轨迹蒸馏,但非核心贡献。

AI总结 针对批量式轨迹蒸馏中跨批次证据缺失问题,提出边际优势累积(MAA)方法,通过差分信号构造、指数移动平均累积和语义身份合并,在16个设置中14个取得最佳结果,优化阶段token消耗减少约75%。

Comments 26 pages, 4 figures, 10 tables, 42 references

详情
AI中文摘要

在批量式轨迹蒸馏中,同一记忆操作可能在不同批次间收到矛盾的反馈。现有方法缺乏跨批次、操作级别的证据累积机制,无法区分稳定有效的操作与偶然命中。本文将需求形式化为两个结构条件:可对齐性和可比性,并提出边际优势累积(MAA)。MAA构造差分信号使其跨批次可比,通过指数移动平均(EMA)累积每个操作的有符号证据,并通过语义身份合并确保跨批次可追溯性。作为一种后处理架构,MAA在4个基准和4个目标模型的16个设置中14个取得最佳结果,持续优于现有批量级蒸馏基线,并在大多数设置中匹配或超越在线替代方法,同时将优化阶段的token消耗减少约75%。

英文摘要

In batch-style trace distillation, the same memory operation may receive contradictory feedback across different batches. Existing methods lack a cross-batch, operation-level evidence accumulation mechanism, making it impossible to distinguish stably effective operations from accidental hits. This paper formalizes the requirement as two structural conditions, alignability and comparability, and proposes Marginal Advantage Accumulation (MAA). MAA constructs differential signals to make them comparable across batches, accumulates signed evidence per operation via EMA, and ensures cross-batch traceability through semantic identity merging. As a post-processing architecture, MAA achieves the best results in 14 out of 16 settings across 4 benchmarks and 4 target models, consistently outperforming existing batch-level distillation baselines and matching or surpassing online alternatives in most settings, while reducing optimization-phase token consumption by approximately 75%.

2606.20553 2026-06-19 cs.CR 新提交 60%

From Efficiency to Leakage -- Privacy Backdoor in Federated Language Model Fine-Tuning

从效率到泄露——联邦语言模型微调中的隐私后门

Shanghao Shi, Chaoyu Zhang, Heng Jin, Yang Xiao, Yevgeniy Vorobeychik, William Yeoh, Ning Zhang, Y. Thomas Hou, Wenjing Lou

专题命中 后训练 :涉及语言模型微调中的隐私泄露。

AI总结 提出NeuroImprint攻击,恶意参数服务器在参数高效微调中植入隐私后门,通过为每个样本分配独立神经元并限制单次更新,实现高保真重建训练文本。

详情
AI中文摘要

联邦学习(FL)使多方能够协作微调语言模型以完成特定领域任务,而无需共享原始数据。由于完整模型微调对FL客户端而言通常过于昂贵,参数高效微调(PEFT)已成为实践中的事实标准,它冻结基础模型,仅训练少量适配器。在本文中,我们表明恶意参数服务器可以隐秘地将PEFT适配器破坏为隐私后门,该后门隐式记忆客户端的训练样本,作为存储在独立神经元中的隔离的每样本参数更新,而不降低模型效用。具体来说,我们的攻击NeuroImprint为每个训练样本分配一个专用的记忆神经元,并约束每个神经元在局部微调轨迹中最多更新一次。这种设计减轻了语言模型微调中由大批量和状态优化器(如Adam/AdamW)引入的跨样本碰撞和跨步混合。微调后,得到的隔离的每样本更新可以通过闭式解析逆变换恢复文本嵌入,然后确定性地映射回令牌序列。为了理解我们方法的通用性,我们在多个语言模型(BERT、GPT-2、Qwen2和Llama3.2)上实现了NeuroImprint,并在涵盖不同领域的四个微调数据集上进行了评估。结果表明,我们的攻击能够以高语义保真度重建59%至79%的所有微调样本。

英文摘要

Federated learning (FL) enables multiple parties to collaboratively fine-tune language models for domain-specific tasks without sharing raw data. Since full model fine-tuning is often prohibitively expensive for FL clients, parameter-efficient fine-tuning (PEFT) has become the de facto approach in practice, freezing the base model and training only a small set of adapters. In this paper, we show that a malicious parameter server can stealthily corrupt a PEFT adapter into a privacy backdoor that implicitly memorizes the client's training samples as isolated per-sample parameter updates stored in separate neurons, without degrading model utility. Concretely, our attack, NeuroImprint, assigns a dedicated memorization neuron to each training sample and constrains that each neuron is updated at most once along the local fine-tuning trajectory. This design mitigates both cross-sample collisions and cross-step mixing introduced by large local batches and stateful optimizers (e.g., Adam/AdamW) in language-model fine-tuning. After fine-tuning, the resulting isolated per-sample updates can be analytically inverted in closed form to recover text embeddings, which are then deterministically mapped back to token sequences. To understand the generality of our method, we implemented NeuroImprint on multiple language models (BERT, GPT-2, Qwen2, and Llama3.2) and evaluated it across four fine-tuning datasets spanning diverse domains. The results demonstrate that our attack can reconstruct 59% to 79% of all finetuning samples with high semantic fidelity.