语言大模型 / LLM

2606.19256 2026-06-18 cs.AI 新提交 80%

X+Slides: Benchmarking Audience-Conditioned Slide Generation

X+Slides：面向受众条件的幻灯片生成基准测试

Haodong Chen, Xuanhe Zhou, Wei Zhou, Xinyue Shao, Yanbing Zhu, Bo Wang, Jiawei Hong, Anya Jia, Fan Wu

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Harbin Institute of Technology（哈尔滨工业大学）； SenseTime

专题命中其他LLM ：LLM幻灯片生成基准测试

AI总结提出X+Slides基准，通过动态评估框架和受众特定权重，衡量幻灯片生成系统在受众覆盖、领域覆盖、效率和正确性方面的表现，揭示现有系统在受众关键信息恢复上的不足。

详情

AI中文摘要

从源文档自动生成幻灯片是大语言模型（LLMs）的重要应用。现有基准主要评估幻灯片的完整性和技术深度，而忽略了目标受众这一关键现实因素。例如，专家需要严格的证明，而决策者优先考虑可操作的结论。为弥补这一差距，我们引入了X+Slides，一个专门为受众条件幻灯片生成设计的基准。基于涵盖113个主题和七种演示场景的多样化语料库，X+Slides采用由8,133个去重、基于源的探针构建的动态评估框架。通过为相同的基于源的探针分配受众特定的效用权重，X+Slides报告四个互补指标：受众覆盖率衡量传达了受众必要信息的程度，领域覆盖率显示覆盖了哪些信息类型，效率衡量每单位注意力成本传递的效用，正确性验证幻灯片声明是否得到源支持。在DeepPresenter、SlideTailor和NotebookLM上的实验表明，当前系统可以恢复大部分但仍有缺失的受众必要信息：在τ_A=0.7时，DeepPresenter达到最佳受众覆盖率0.714，SlideTailor达到0.594，NotebookLM消融达到0.853，同时显示出明显的接地差异。这些结果表明，视觉质量和广泛的主题覆盖不应在没有基于源评估的情况下被视为证据支持。

英文摘要

Automatically generating slide decks from source documents is an important application of large language models (LLMs). Existing benchmarks primarily assess slide completeness and technical depth, while overlooking the target audience as a critical real-world factor. For instance, specialists demand rigorous proofs, whereas decision-makers prioritize actionable conclusions. To bridge this gap, we introduce X+Slides, a benchmark specifically designed for audience-conditioned slide generation. Built on a diverse corpus spanning 113 topics and seven presentation scenes, X+Slides employs a dynamic evaluation framework constructed from 8,133 deduplicated, source-grounded probes. By assigning audience-specific utility weights to the same source-grounded probes, X+Slides reports four complementary metrics: Audience Coverage measures how much audience-essential information is conveyed, Domain-wise Coverage shows which information types are covered, Efficiency measures delivered utility per unit of attention cost, and Correctness verifies whether slide claims are supported by the source. Experiments on DeepPresenter, SlideTailor, and NotebookLM show that current systems can recover a substantial but still incomplete part of audience-essential information: at $τ_A=0.7$, DeepPresenter reaches a best Audience Coverage of 0.714, SlideTailor reaches 0.594, and the NotebookLM ablation reaches 0.853 while showing clear grounding differences. These results indicate that visual quality and broad topic coverage should not be treated as evidence support without source-grounded evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.18946 2026-06-18 cs.CL 新提交 80%

SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents

SenFlow: 面向混合文档中AI生成文本检测的句间流建模

Jingkun Luo, Yifan Sun, Da-Tian Peng, Guanxiong Pei

发表机构 * Northwestern Polytechnical University（西北工业大学）； Zhejiang Lab（浙江实验室）

专题命中其他LLM ：检测LLM生成文本，建模句间依赖

AI总结针对人机混合文档的句子级AI文本检测，提出SenFlow模型，通过图传播和CRF解码建模句间依赖，在MOSAIC基准上跨域F1提升4.15个百分点。

Comments 16 pages, 4 figures, 9 tables

详情

AI中文摘要

针对混合文档（人类与LLM共同撰写同一文本）的句子级AI生成文本检测（S-AGTD）面临两个空白：现有方法孤立地对每个句子进行分类，忽略了句间依赖；现有基准遗漏了最新一代生成器。我们构建了MOSAIC基准，包含来自PubMed和XSum的16,000个混合文档，由DeepSeek-V3.2和Kimi K2生成，并经过严格质量控制，包括先前基准中缺失的困惑度一致性过滤器。我们将S-AGTD重新定义为文档句子序列上的结构化预测，并实例化为SenFlow，在句子图的单次文档级传递中，将基于图的句间传播与线性链CRF解码相结合。SenFlow在MOSAIC上达到了最先进的性能，在跨域迁移（三种难度递增协议中最难的一种）上平均Macro-F1提高了4.15个百分点。我们进一步发现，即使困惑度过滤器平衡了显式线索，AI插入仍然保留了一个依赖于生成器的句子长度差距，句子级检测器仍可利用这一点。代码和数据：此 https URL

英文摘要

Sentence-level AI-generated text detection (S-AGTD) for hybrid documents, where humans and LLMs co-author one text, faces two gaps: existing methods classify each sentence in isolation, discarding inter-sentence dependencies, and existing benchmarks omit the newest generation of generators. We construct MOSAIC, a benchmark of 16,000 hybrid documents over PubMed and XSum, generated by DeepSeek-V3.2 and Kimi K2 under stringent quality controls including a perplexity-consistency filter absent from prior benchmarks. We recast S-AGTD as structured prediction over the document sentence sequence and instantiate it as SenFlow, integrating graph-based inter-sentence propagation with linear-chain CRF decoding in a single document-level pass over a sentence graph. SenFlow reaches state-of-the-art performance on MOSAIC, with a +4.15 pp average Macro-F1 margin on cross-domain transfer, the hardest of three protocols of increasing difficulty. We further find that even after the perplexity filter equalizes overt cues, AI insertions retain a generator-dependent sentence-length gap that sentence-level detectors still exploit. Code and data: https://github.com/luojingkun22/SenFlow

URL PDF HTML ☆

赞 0 踩 0

2606.18922 2026-06-18 cs.CL cs.AI 新提交 80%

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

像火箭科学一样简单：评估大型语言模型解释比喻语言中否定能力的研究

Jasmine Owers, Edwin Simpson, Martha Lewis

发表机构 * Intelligent Systems Lab University of Bristol（智能系统实验室英国布里斯托尔大学）； ILLC University of Amsterdam（阿姆斯特丹大学语言学研究所）

专题命中其他LLM ：评估LLM对否定与比喻语言的理解

AI总结本研究通过开发新的注释数据集，测试多种大型语言模型在比喻语言中理解否定的能力，发现否定与比喻的组合对模型构成挑战，且性能高度依赖提示风格。

Comments 16 pages, 16 figures; for associated code and data see https://github.com/jrdowers/Negation-and-Fig-Lang; To be published in Transactions of the Association for Computational Linguistics

2606.18797 2026-06-18 cs.CL 新提交 80%

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

超越标量分数：探索基于LLM的放射学报告临床意义评估指标

Qingyu Lu, Ruochen Li, Liang Ding, Yufei Xia, Youxiang Zhu, Dacheng Tao

发表机构 * Nanyang Technological University（南洋理工大学）； Technical University of Munich（慕尼黑工业大学）； Alibaba（阿里巴巴）； University of Glasgow（格拉斯哥大学）； University of Massachusetts Boston（马萨诸塞大学波士顿分校）

专题命中其他LLM ：基于LLM的放射学报告评估指标

AI总结针对放射学报告评估中临床准确性要求，研究基于LLM的指标区分临床错误与无害变体的能力，发现判别偏差，并通过合成数据训练轻量级指标，在成本敏感部署中优于大型模型。

Comments Under Review

详情

AI中文摘要

对生成的放射学报告进行可靠评估需要严格的临床准确性，因为遗漏关键发现或误判影像学观察结果会直接影响患者护理。现有指标通过将报告质量简化为一个医学上无依据的标量而模糊了这一要求。尽管大型语言模型（LLM）拥有丰富的医学知识，但它们同样难以在临床显著错误和无害变异之间划定可靠边界。我们以ReEvalMed基准为测试平台研究这一边界，并从检测真实临床错误（“判别力”）和容忍无关变异（“鲁棒性”）两方面评估指标的临床意义。在单次和两次设置下对8个LLM评估器进行实验，我们发现了一个普遍的判别偏差：模型能有效检测错误，但也过度惩罚无害的改写。为缓解这一问题，我们合成了4000对报告，并在Qwen3-8B和MedGemma-4B上训练了轻量级可解释指标。我们训练的指标明确了临床意义边界，超越了32B规模的医学LLM，并与专有模型保持竞争力。关键的是，成本更高的两次设置未能持续提升整体性能，主要是在用判别力换取鲁棒性。这些发现表明，单次训练指标是成本敏感部署的实用选择，而两次推理则保留给判别-鲁棒平衡至关重要的场景。我们将发布数据集和指标。

英文摘要

Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care. Existing metrics obscure this requirement by reducing report quality to a medically ungrounded scalar. Although Large Language Models (LLMs) possess rich medical knowledge, they likewise struggle to draw a reliable boundary between clinically significant errors and harmless variation. We study this boundary using ReEvalMed benchmark as testbed and evaluate metric-level clinical significance from detecting true clinical errors ("Discrimination") and tolerating insignificant variations ("Robustness"). Across 8 LLM evaluators under one-pass and two-pass settings, we identify a widespread discrimination bias: models effectively detect errors but also over-penalize harmless rephrasings. To mitigate this, we synthesize 4k report pairs and train lightweight interpretable metrics on Qwen3-8B and MedGemma-4B. Our trained metric sharpens the clinical significance boundary, surpassing 32B-scale medical LLMs and remaining competitive with proprietary models. Crucially, the more costly two-pass setting fails to consistently improve overall performance and mainly trades discrimination for robustness. These findings suggest one-pass trained metrics as the practical choice for cost-sensitive deployment, with two-pass inference reserved for settings where D-R balance is critical. We will release the dataset and metric.

URL PDF HTML ☆

赞 0 踩 0

2606.18741 2026-06-18 cs.DC 新提交 80%

ReMP: Low-Downtime Runtime Model-Parallelism Reconfiguration for LLM Serving

ReMP：面向LLM服务的低停机时间运行时模型并行重配置

Haipeng Yuan, Kaining Zheng, Yongshu Bai, Yuchen Zhang, Yunquan Zhang, Baodong Wu, Xiang Gao, Daning Cheng

专题命中其他LLM ：LLM推理服务模型并行重配置，低停机时间。

AI总结提出ReMP框架，通过解耦拓扑与运行时状态、二维KV缓存迁移等技术，实现LLM推理服务中模型并行拓扑的在线动态调整，将重配置停机时间从分钟级降至1-7秒。

详情

AI中文摘要

当前大语言模型（LLM）推理系统普遍采用张量并行（TP）和流水线并行（PP）的组合来部署超大规模模型。然而，现有系统将模型并行拓扑视为静态配置，无法在运行时灵活调整。这种刚性设计与实际场景中动态变化的推理负载存在根本矛盾。最先进的系统缺乏在线重配置能力，只能通过重启服务来切换配置，导致数分钟的服务中断、KV缓存丢失以及高昂的重计算开销。为解决此问题，本文提出ReMP，一种支持低停机时间的运行时模型并行重配置框架。ReMP通过三项关键技术实现动态调整：（1）将模型并行拓扑与运行时状态解耦，避免完全重建服务；（2）设计二维KV缓存迁移机制，在TP/PP变化后保留可复用的缓存状态；（3）实现端到端的在线重配置。实验表明，ReMP能在7B到70B参数规模的模型上，在1-7秒内完成大多数拓扑切换，相比重启方法实现数十至上百倍的加速。此外，在动态负载下，ReMP显著优于固定配置，在TTFT、TPOT和输出吞吐量方面表现出更优性能。

英文摘要

Current large language model (LLM) inference systems universally deploy ultra-large-scale models using a combination of Tensor Parallelism (TP) and Pipeline Parallelism (PP). However, existing systems treat the model parallelism topology as a static configuration that cannot be flexibly adjusted at runtime. This rigid design creates a fundamental contradiction with the dynamically changing inference workloads in real-world scenarios. State-of-the-art systems lack online reconfiguration capabilities and can only switch configurations by restarting the service, resulting in several minutes of service interruption, KV cache loss, and prohibitive recomputation overhead. To address this problem, this paper presents ReMP, a runtime model parallelism reconfiguration framework that supports low downtime. ReMP achieves dynamic adjustment through three key techniques: (1) decoupling the model parallelism topology from runtime state to avoid full service reconstruction; (2) designing a two-dimensional KV cache migration mechanism to preserve reusable cache states after TP/PP changes; and (3) implementing end-to-end online reconfiguration. Experiments demonstrate that ReMP can complete most topology switches within 1-7 seconds on models ranging from 7B to 70B parameters, achieving speedups of tens to over a hundred times compared to the restart approach. Moreover, ReMP significantly outperforms fixed configurations under dynamic workloads, delivering superior performance in terms of TTFT, TPOT, and output throughput.

URL PDF HTML ☆

赞 0 踩 0

2606.18677 2026-06-18 cs.LG cs.AI 新提交 80%

Bounded Context Management for Tabular Foundation Models on Stream Learning

表格基础模型在流学习中的有界上下文管理

Jinmo Lee, Doyun Choi, Moongi Choi, Jaemin Yoo

发表机构 * Seoul National University（首尔大学）； KAIST（韩国科学技术院）

专题命中其他LLM ：表格基础模型流学习上下文管理

AI总结针对表格流学习中分布漂移问题，提出上下文管理策略CURE，通过不确定性门控准入和冗余感知驱逐管理上下文，在七个流上相对提升最高27.0%。

Comments Accepted as a spotlight oral (top 5%) at the 2nd ICML Workshop on Foundation Models for Structured Data (FMSD@ICML2026)

详情

AI中文摘要

表格流学习需要在分布漂移下对顺序到达的样本进行预测。虽然标准方法通过更新模型状态来适应，但表格基础模型（TFMs）以上下文方式基于标记上下文进行预测，使其成为流学习的自然替代方案。这便将挑战从如何更新模型转移到如何管理上下文。我们提出一种未来信息视角，为上下文管理导出三个实际需求：保留最近样本、保留不确定样本、移除冗余样本。我们将这些需求实例化为CURE（通过不确定性感知准入和冗余感知驱逐的上下文管理），一种具有熵门控准入和冗余感知驱逐的上下文管理策略。在七个流上，CURE相比经典流学习器相对提升高达27.0%，在多个TFM骨干上保持鲁棒，并在其他策略变体中排名第一。代码和数据集可在该https URL获取。

英文摘要

Tabular stream learning requires predictions on sequentially arriving examples under distribution shift. While standard methods adapt by updating model states, tabular foundation models (TFMs) make predictions conditioned on a labeled context in an in-context manner, making them a natural alternative for stream learning. This shifts the challenge from how to update the model to how to manage the context. We propose a future information view that yields three practical requirements for context management: preserve recent examples, retain uncertain examples, and remove redundant examples. We instantiate these requirements as CURE (Context management via Uncertainty-aware admission and Redundancy aware Eviction), a context-managing policy with entropy-gated admission and redundancy-aware eviction. Across seven streams, CURE shows up to 27.0% relative improvement over classical stream learners, remains robust across multiple TFM backbones, and ranks first among other policy variants. Code and datasets are available at https://github.com/morcellinus/CURE-ICML-FMSD.

URL PDF HTML ☆

赞 0 踩 0

2606.18557 2026-06-18 cs.AI cs.LG cs.LO 新提交 80%

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

DeFAb：基础模型中可废止溯因的可验证基准

Patrick Cooper, Alvaro Velasquez

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）

专题命中其他LLM ：评估基础模型的可废止溯因推理

AI总结提出DeFAb基准，通过将知识库转换为可验证的溯因实例，评估基础模型在可废止推理中的创造力与理论推理能力，发现前沿模型准确率远低于符号求解器。

Comments 33 pages, 14 figures, 23 tables. Dataset: https://huggingface.co/datasets/PatrickAllenCooper/DeFAb ; code and evaluation harness: https://github.com/PatrickAllenCooper/blanc

详情

AI中文摘要

一个基于规则的逻辑求解器在不到50微秒内以100%的准确率解决了我们基准中的每个实例；而最佳前沿语言模型在渲染鲁棒评估下最高仅达65%，最差降至23.5%（四种表面渲染的最坏情况）。我们引入DeFAb（可废止溯因基准），这是一个数据集和生成流水线，将四十年的公共资助知识库转换为形式化可废止溯因实例：通过覆盖默认值同时保留无关期望来构建解释异常假设。由于每个假设必须通过多项式时间检查（有效推导、保守性和最小性），DeFAb将逻辑严谨性作为衡量创造性和理论推理的工具，评分的是理论修正的规范构建，而非流畅但破坏理论的散文。该流水线将分类层次结构（OpenCyc、YAGO、Wikidata）与行为属性图（ConceptNet、UMLS）配对，从18个来源生成372,648+个实例，涉及33.75M条实例化规则，分为三个级别，并具有多项式时间可验证的金标准。四个前沿模型未能可靠内化可废止推理：渲染鲁棒的Level 2准确率为7.8-23.5%；思维链方差（约36个百分点）超过任何模型间差距；匹配的污染控制隔离出+19.4个百分点的Level 3差距。我们进一步发布了DeFAb-Hard（235个实例的Level 3难度变体；最佳模型53.3% vs 符号100%）和CONJURE（一个内核验证的变革性创造力变体，包含560个Lean 4/Mathlib实例，其金答案证明内核先前未包含的定义，无需判断的验证器；试点发现零新概念）。同一验证器还可作为偏好优化（DPO、RLVR/GRPO）的精确奖励。基于MIT许可发布于此https URL。

英文摘要

A rule-based logic solver resolves every instance in our benchmark in under 50 microseconds with 100% accuracy; the best frontier language model reaches 65% at best and drops to 23.5% under rendering-robust evaluation (worst case over four surface renderings). We introduce DeFAb (Defeasible Abduction Benchmark), a dataset and generation pipeline that converts four decades of publicly funded knowledge bases into formally grounded instances for defeasible abduction: constructing hypotheses that explain anomalies by overriding defaults while preserving unrelated expectations. Because every hypothesis must pass polynomial-time checks for valid derivation, conservativity, and minimality, DeFAb makes logical rigor the instrument for measuring creativity and theoretical reasoning, scoring the disciplined construction of theory revisions rather than fluent but theory-destroying prose. The pipeline pairs taxonomic hierarchies (OpenCyc, YAGO, Wikidata) with behavioral property graphs (ConceptNet, UMLS) to produce 372,648+ instances across 33.75M materialized rules from 18 sources, in three levels with polynomial-time verifiable gold standards. Four frontier models do not reliably internalize defeasible reasoning: rendering-robust Level 2 accuracy is 7.8-23.5%; chain-of-thought variance (~36 pp) exceeds any inter-model gap; and a matched contamination control isolates a +19.4 pp Level 3 gap. We further release DeFAb-Hard (a 235-instance Level 3 difficulty variant; best model 53.3% vs 100% symbolic) and CONJURE (a kernel-verified transformative-creativity variant of 560 Lean 4/Mathlib instances whose gold answers are definitions the proof kernel did not previously contain, judge-free verifier; a pilot finds zero novel concepts). The same verifier doubles as an exact reward for preference optimization (DPO, RLVR/GRPO). Released under MIT at https://huggingface.co/datasets/PatrickAllenCooper/DeFAb.

URL PDF HTML ☆

赞 0 踩 0

2606.18383 2026-06-18 cs.LG cs.CL 新提交 80%

From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability

从稀疏特征到可信代理：认证基于SAE的可解释性

Dibyanayan Bandyopadhyay, Asif Ekbal

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology Patna（印度理工学院巴特那分校计算机科学与工程系）

专题命中其他LLM ：认证基于SAE的语言模型可解释性

AI总结提出一种后验泛化框架，通过稀疏代理（SAE重建）认证语言模型，推导期望风险上界，并在GPT-2 Small等模型上验证非平凡界，揭示深层更易认证且特征分解区分语义对齐与统计稀疏性。

详情

AI中文摘要

稀疏自编码器（SAE）越来越多地被用于从语言模型（LM）中提取可解释特征，但一个核心问题仍然存在：基于SAE的解释何时可以被视为底层冻结LM的忠实视图？我们通过一个后验泛化框架来研究这个问题，该框架通过稀疏代理来认证LM，稀疏代理是通过将原生隐藏激活替换为其预训练的SAE重建而获得的。我们的框架使用四个可测量量推导出基础模型期望风险的上界：代理风险、SAE重建差距、概念池不匹配和稀疏复杂度。我们将此证书解释为解释忠实性的操作标准。特别地，非平凡界表明提取的稀疏特征保留了有意义的预测信息，而小的重建和匹配误差表明代理在行为上接近原始模型。实验上，我们展示了在GPT-2 Small、Gemma-2B和Llama-3-8B上，该界在实际样本量下变得非平凡。对Llama-3-8B的详细逐层分析揭示了强烈的深度依赖性，较深层变得更容易认证，这与更强的局部保真度和更弱的下游误差放大相关。最后，通过特征洗牌消融，我们展示了分解区分了真正的语义对齐与单纯的统计稀疏性，为基于SAE的解释何时变得不太可靠提供了有用的诊断。

英文摘要

Sparse autoencoders (SAEs) are increasingly used to extract interpretable features from language models (LMs), yet a central question remains: when can an SAE-based explanation be treated as a faithful view of an underlying frozen LM We study this through a post-hoc generalization framework that certifies the LM via a sparse proxy, obtained by replacing a native hidden activation with its pretrained SAE reconstruction. Our framework derives an upper bound on the base model's expected risk using four measurable quantities: proxy risk, SAE reconstruction gap, concept-pool mismatch, and sparse complexity. We interpret this certificate as an operational criterion for explanatory faithfulness. In particular, a non-vacuous bound indicates that the extracted sparse features retain meaningful predictive information, while small reconstruction and mismatch errors indicate that the proxy remains behaviorally close to the original model. Empirically, we show that the bound becomes non-vacuous on GPT-2 Small, Gemma-2B, and Llama-3-8B at practical sample sizes. A detailed layerwise analysis of Llama-3-8B reveals a strong depth dependence, with later layers becoming much easier to certify, associated with both stronger local fidelity and weaker downstream error amplification. Finally, through feature-shuffling ablations, we show that the decomposition distinguishes genuine semantic alignment from mere statistical sparsity, providing a useful diagnostic for when SAE-based explanations become less reliable.

URL PDF HTML ☆

赞 0 踩 0

2606.18042 2026-06-18 cs.DC 新提交 80%

Latency Prediction for LLM Inference on NPU Systems

NPU系统上LLM推理的延迟预测

Juhyun Park, Seungwoo Jeong, Jingyu Lee, Kyungyong Lee

专题命中其他LLM ：预测LLM在NPU上的推理延迟

AI总结针对NPU上LLM推理延迟预测面临微架构不公开、编译器优化不可预测和分桶导致非线性延迟的挑战，提出LENS延迟估计器，通过每个桶两次端到端测量组合预测任意输入输出长度组合的延迟，平均预测误差2.15%。

Comments 12 pages, 9 figures

详情

AI中文摘要

部署大型语言模型（LLM）需要探索涵盖并行化策略、批处理技术和调度策略的庞大配置空间。在此空间上进行穷举测量是不切实际的，因此延迟预测对于系统优化至关重要。尽管NPU已成为专为LLM推理设计的加速器，但尚未建立针对它们的预测方法。具体来说，将先前的工作应用于NPU上的LLM推理延迟预测面临三个挑战：商用NPU的微架构不公开、不可预测的编译器优化以及由分桶引起的延迟非线性。我们提出了LENS，一种延迟估计器，它可以在没有微架构或编译器信息的情况下预测NPU推理延迟，并捕获由分桶引起的非线性延迟。LENS通过两次端到端（E2E）测量对每个桶进行剖析，并组合结果以预测任意输入-输出长度组合的延迟。我们在来自多个供应商的NPU、多个LLM以及多样化工作负载上验证了LENS，平均预测误差为2.15%。我们进一步将LENS与两个方法相关的基线进行比较，确认了其方法的有效性。

英文摘要

Deploying Large Language Models (LLMs) requires exploring a large configuration space spanning parallelization strategies, batching techniques, and scheduling policies. Exhaustive measurement across this space is impractical, making latency prediction essential for system optimization. While NPUs have emerged as accelerators designed for LLM inference, no prediction methodology has been established for them. Specifically, applying prior work to LLM inference latency prediction on NPUs faces three challenges: undisclosed microarchitecture of commercial NPUs, unpredictable compiler optimizations, and latency non-linearity induced by bucketing. We present LENS, a latency estimator that predicts NPU inference latency without information on the microarchitecture or compiler, and captures the non-linear latency induced by bucketing. LENS profiles each bucket with two end-to-end (E2E) measurements and composes the results to predict latency for arbitrary input-output length combinations. We validate LENS across NPUs from multiple vendors, several LLMs, and diverse workloads, achieving a mean prediction error of 2.15\%. We further compare LENS against two methodologically related baselines, confirming the validity of its approach.

URL PDF HTML ☆

赞 0 踩 0

2606.12629 2026-06-18 cs.LG cs.AI 新提交 80%

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

Bag of Dims：通过维度级符号模式实现无需训练的机制可解释性

Varun Reddy Nalagatla

发表机构 * Amazon Web Services（亚马逊云服务）

专题命中其他LLM ：无需训练的Transformer机制可解释性方法

AI总结本文提出Bag of Dims框架，证明Transformer隐藏状态的标准基即可作为无需训练的特征基，通过维度符号模式编码语义，并在三个模型上验证了其有效性。

Comments 22 pages, 5 figures, 27 tables

详情

AI中文摘要

我们表明，Transformer隐藏状态的标准基已经提供了一个无需训练、架构通用的特征基。单个维度通过其符号编码语义内容，通过其幅度编码置信度，充当独立的二进制寄存器。我们通过四个渐进实验在三个模型家族（Qwen 3.5-4B、Gemma 3-4B、Mistral 7B）上验证了这种Bag of Dims框架。仅符号模式就携带预测性内容：将所有幅度替换为1，通过LM头实现72-93%的top-5下一个token准确率，而无需任何解码器的纯汉明评分达到80-90%的top-4096准确率。这些符号模式组织成语义特征：使用单token类型缓存（每个词汇token一次前向传播，无上下文），我们通过每维度符号一致性（平均AUC 0.80）从50个锚点发现了175个类别，无需任何训练。一个训练过的探针仅增加+0.018 AUC并收敛到轴对齐的权重，证实了可忽略的跨维度结构。这种结构扩展到注意力：所有175个类别在K和V投影中仍然可发现。在写入端，静态FFN权重检查将20%的特征与单个写入神经元联系起来（一致性>0.70；随机对照：0%），通过多数投票，top-200神经元联盟在99.9%的原型上实现>0.70的一致性。完全无监督的发现（随机种子，无标签）在所有三个模型上扩展到1500个特征，产量100%，稀疏度99%，成对互信息为0.0014比特，证实了低维度间耦合。这些结果确立了标准基已经足以在整个Transformer计算路径中进行特征读取，无需训练、无需优化，且每个词汇token仅需一次前向传播，无需GPU天数。

英文摘要

We show the standard basis of transformer hidden states already provides a training-free, architecture-general feature basis. Individual dimensions encode semantic content via their signs (+/-1) and confidence via their magnitudes, acting as independent binary registers; a feature is a subset of dimensions with a consistent sign pattern, read by counting sign agreements with no learned rotation. We validate this Bag of Dims framework across seven models spanning language (Qwen 3.5-4B, Gemma 3-4B, Mistral 7B, Qwen3-32B), vision (DINOv2, ViT-Base), and audio (AST). Signs alone carry predictive content: unit-magnitude sign patterns preserve 60-93% top-5 next-token accuracy through the LM head, and decoder-free Hamming scoring reaches 80-90% top-4096. From a single-token cache (one forward pass per token, no context, no labels), we detect 175 categories at AUC 0.97-0.99 by sign agreement; a trained probe adds only +0.018 AUC and converges to axis-aligned weights. These features are causally operative: they survive the K/V attention projections, trace to the FFN neuron coalitions that write them (random-weight controls never reproduce this), and flipping a feature's signs during the live forward pass suppresses its concept across four language models, magnitude-matched and concept-specific. Dimensions stay independent throughout (pairwise mutual information below 0.006 bits). The structure is not specific to language: the same per-dimension signs appear in self-supervised vision (DINOv2, 9/12 ImageNet superclasses), supervised vision (ViT-Base, 11/12), and audio (AST, 50/50 ESC-50 categories), so it reflects transformer training in general, not the language-modeling objective. The standard basis already suffices for feature reading at one forward pass, no optimization, no GPU-days. The open problem shifts from finding the right rotation to cataloging what each dimension encodes.

URL PDF HTML ☆

赞 0 踩 0

2606.08532 2026-06-18 cs.AI 新提交 80%

DN-Hypo-Pipeline: An AI-Driven Workflow for Hypothesis Generation via Large Language Models and Scientific Explanations

DN-Hypo-Pipeline：一种基于大语言模型和科学解释的AI驱动假设生成工作流

Lei Lin, Ronghao Wang, Chunbao Zhou, Jue Wang, Yangang Wang

发表机构 * Computer Network Information Center, Chinese Academy of Sciences, China（中国科学院计算机网络信息中心）

专题命中其他LLM ：LLM驱动的假设生成工作流

AI总结提出DN-Hypo-Pipeline，利用大语言模型和科学解释作为先验知识，从现有文献中推导新假设，在数据科学建模中通过统计推断和专家评估证明优于直接生成方法，并验证了生成假设对应的算法性能。

详情

AI中文摘要

科学假设是研究的第一步并经过实验验证，但它也反映了对科学现象的深刻理解和推理。我们引入了DN-Hypo-Pipeline，一种基于大语言模型的AI驱动工作流，旨在通过利用科学解释作为先验知识来支持结构化科学思维和假设生成。该流水线帮助研究人员从现有文献中推导出新假设。给定研究论文的解释项（即结论），它识别潜在的定律、理论和原理，并为观察到的现象重构一个新的、尚未验证的解释。我们在数据科学建模领域使用三篇高被引论文评估了DN-Hypo-Pipeline。由LLM作为评判者和人类专家评估支持的统计推断表明，我们的流水线比直接生成方法更有效。此外，我们通过开发相应新颖算法验证了得分最高的两个生成假设，这些算法优于原始论文中提出的基线模型。除了在数据科学中的应用，DN-Hypo-Pipeline还提供了一个理论框架，不仅包含了理论指导的数据科学建模方法，还揭示了建模过程更基础的结构。此外，这种方法本质上是理论指导建模的推广，具有扩展到其他领域和更广泛科学学科的潜力。

英文摘要

A scientific hypothesis is the first step in research and undergoes experimental validation, yet it also reflects a deep understanding of and reasoning about scientific phenomena. We introduce DN-Hypo-Pipeline, an AI-powered workflow based on large language models, designed to support structured scientific thinking and hypothesis generation by leveraging scientific explanations as prior knowledge. This pipeline assists researchers in deriving novel hypotheses from existing literature. Given the explanandum (i.e., the conclusion) of a research paper, it identifies underlying laws, theories, and principles, and reconstructs a new, yet-to-be-verified explanation for the observed phenomenon. We evaluated DN-Hypo-Pipeline in the field of data science modeling using three highly cited papers. Statistical inference, supported by both LLM-as-judge assessment and human expert evaluation, demonstrates that our pipeline is more effective than direct generation methods. Additionally, we validated the two highest-scoring generated hypotheses by developing corresponding novel algorithms, which outperformed the baseline models presented in the original papers. Beyond application in data science, DN-Hypo-Pipeline provides a theoretical framework that not only encompasses theory-guided data science modeling methods but also reveals a more fundamental structure of the modeling process. Moreover, this approach is essentially a generalization of theory-guided modeling, offering potential for extension to other domains and across a broader range of scientific disciplines.

URL PDF HTML ☆

赞 0 踩 0

2602.06470 2026-06-18 cs.CL cs.AI 版本更新 80%

Improve Large Language Model Systems with User Logs

通过用户日志改进大型语言模型系统

Changyue Wang, Weihang Su, Qingyao Ai, Xingzhao Yue, Rui Zhang, Xiaojia Chang, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）

专题命中其他LLM ：提出UNO框架利用用户日志改进LLM系统。

AI总结本文提出UNO框架，通过用户日志提炼规则和偏好对，利用查询反馈驱动聚类处理数据异质性，量化模型知识与日志数据间的认知差距，提升LLM系统性能。

详情

AI中文摘要

扩大训练数据和模型参数规模长期以来推动了大型语言模型（LLMs）的发展，但这一范式日益受到高质量数据稀缺和计算成本上升导致的边际效益递减的限制。因此，近期研究更加关注从真实世界部署中持续学习，其中用户交互日志提供了丰富的真人类反馈和过程知识。然而，从用户日志学习具有挑战性，因为它们是无结构和嘈杂的。传统的LLM系统往往难以区分有用的反馈信号与嘈杂的用户行为，且用户日志收集与模型优化之间的差异（例如，非策略优化问题）进一步加剧了这一问题。为此，我们提出UNO（用户日志驱动的优化），一个统一的框架，用于通过用户日志改进LLM系统（LLMsys）。UNO首先将日志提炼为半结构化的规则和偏好对，然后利用查询和反馈驱动的聚类来管理数据异质性，最后量化模型先验知识与日志数据之间的认知差距。这一评估指导LLMsys自适应地过滤掉嘈杂的反馈并构建不同模块，以处理从用户日志中提取的初级和反思性经验，从而提升未来的响应。广泛的实验表明，UNO在效果和效率上均达到最先进的水平，显著优于检索增强生成（RAG）和基于记忆的基线方法。我们已开源代码至https://github.com/bebr2/UNO。

英文摘要

Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high-quality data and diminishing returns from rising computational costs. As a result, recent work is increasing the focus on continual learning from real-world deployment, where user interaction logs provide a rich source of authentic human feedback and procedural knowledge. However, learning from user logs is challenging due to their unstructured and noisy nature. Vanilla LLM systems often struggle to distinguish useful feedback signals from noisy user behavior, and the disparity between user log collection and model optimization (e.g., the off-policy optimization problem) further strengthens the problem. To this end, we propose UNO (User log-driveN Optimization), a unified framework for improving LLM systems (LLMsys) with user logs. UNO first distills logs into semi-structured rules and preference pairs, then employs query-and-feedback-driven clustering to manage data heterogeneity, and finally quantifies the cognitive gap between the model's prior knowledge and the log data. This assessment guides the LLMsys to adaptively filter out noisy feedback and construct different modules for primary and reflective experiences extracted from user logs, thereby improving future responses. Extensive experiments show that UNO achieves state-of-the-art effectiveness and efficiency, significantly outperforming Retrieval Augmented Generation (RAG) and memory-based baselines. We have open-sourced our code at https://github.com/bebr2/UNO .

URL PDF HTML ☆

赞 0 踩 0

2602.00161 2026-06-18 cs.LG cs.AI cs.CL quant-ph 版本更新 80%

LLM Compression by Block Removal with Constrained Binary Optimization

通过带约束二进制优化的块移除进行LLM压缩

David Jansen, Roman Rausch, Ali Hashemi, David Montero, Román Orús

发表机构 * Multiverse Computing（多维计算公司）； Donostia International Physics Center（多斯蒂亚国际物理中心）； Ikerbasque Foundation for Science（伊克尔巴斯克科学基金会）

专题命中其他LLM ：提出LLM压缩方法，通过块移除优化，属于LLM。

AI总结提出将大语言模型块移除压缩问题建模为约束二进制优化，映射到Ising玻璃系统，实现高效排序和高质量非连续块移除，在50%压缩时MMLU提升近23个百分点，且计算高效、通用性强。

Comments 16 pages, 3 figures

详情

AI中文摘要

在本文中，我们将通过最优删除Transformer块（“块移除”）来压缩大语言模型（LLM）的问题，表述为一个约束二进制优化（CBO）问题，该问题可以映射到物理系统（Ising玻璃），其能量是下游模型性能的强代理。这种表述使得能够高效地对大量候选块移除配置进行排序，产生许多高质量、非平凡的解决方案，而不仅仅是移除连续区域。我们的方法在深度压缩场景中表现强劲，例如在Llama-3.3-70B-Instruct的50%压缩中，与其他最先进的块移除方法相比，我们在MMLU基准上取得了近23个百分点的提升。对于较轻的压缩，它在多个基准上与这些方法表现相当，适用于Llama-3.1-8B-Instruct、Qwen3-14B（重训练前后）以及Llama-3.3-70B-Instruct。该方法计算效率高，仅需在校准数据集上对少数活跃参数进行前向和反向传播。此外，我们证明，当无法精确求解CBO问题时，使用良好的启发式求解器可以在可忽略的运行时间内提供在下游任务上表现良好的解决方案。该方法可以轻松应用于任何架构。我们在最近的NVIDIA-Nemotron-3-Nano-30B-A3B-FP8模型上展示了这种通用性，该模型具有高度不均匀且具有挑战性的块结构，并且在移除2个注意力层或3个混合专家层时，我们在AIME25和GPQA上超越了最先进水平。

英文摘要

In this paper, we formulate the compression of large language models (LLMs) by optimally deleting transformer blocks (``block removal'') as a constrained binary optimization (CBO) problem that can be mapped to a physical system (Ising glass), whose energies are a strong proxy for downstream model performance. This formulation enables an efficient ranking of a large number of candidate block-removal configurations yielding many high-quality, non-trivial solutions beyond those only removing consecutive regions. Our method performs strongly in the deep compression regime, such as for 50% compression of Llama-3.3-70B-Instruct, where we achieve an almost 23 percentage point increase on the MMLU benchmark compared to other state-of-the-art (SOTA) block-removal methods. For lighter compression, it performs on par with those methods across several benchmarks for Llama-3.1-8B-Instruct, Qwen3-14B (both before and after retraining), as well as Llama-3.3-70B-Instruct. The approach is computationally efficient and requires only forward and backward passes on a calibration dataset for a few active parameters. Additionally, we demonstrate that using good heuristic solvers for the CBO problem provides solutions that perform well on downstream tasks in negligible runtime when it is unfeasible to solve the problem exactly. The method can be readily applied to any architecture. We illustrate this generality on the recent NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 model, which exhibits a highly inhomogeneous and challenging block structure, and where we outperform SOTA for AIME25 and GPQA when removing either 2 attention layers or 3 mixture-of-experts layers.

URL PDF HTML ☆

赞 0 踩 0

2601.14968 2026-06-18 cs.LG cs.AI 版本更新 80%

InstructTime++: Time Series Classification with Multimodal Language Modeling via Implicit Feature Enhancement

InstructTime++: 通过隐式特征增强的多模态语言建模进行时间序列分类

Mingyue Cheng, Xiaoyu Tao, Huajian Zhang, Qi Liu, Zhiding Liu, Yucong Luo, Yiheng Chen, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China（中国科学技术大学认知智能国家重点实验室）

专题命中其他LLM ：将时间序列分类转化为多模态生成任务

AI总结提出将时间序列分类转化为多模态生成任务，通过离散化模块和对齐投影层弥合模态差距，并利用隐式特征建模提升语言模型性能。

详情

AI中文摘要

大多数现有的时间序列分类方法采用判别范式，将输入序列直接映射到独热编码的类别标签。虽然有效，但这种范式难以融入上下文特征，也无法捕捉类别间的语义关系。为了解决这些局限性，我们提出了InstructTime，一种将时间序列分类重新定义为多模态生成任务的新框架。具体来说，连续的数值序列、上下文文本特征和任务指令被视为多模态输入，而类别标签则通过调优的语言模型作为文本输出生成。为了弥合模态差距，InstructTime引入了一个时间序列离散化模块，将连续序列转换为离散的时间标记，同时结合对齐投影层和生成式自监督预训练策略，以增强跨模态表示对齐。在此框架基础上，我们进一步提出了InstructTime++，通过引入隐式特征建模来扩展InstructTime，以补偿语言模型有限的归纳偏差。InstructTime++利用专门的工具包从原始时间序列和上下文输入中挖掘信息丰富的隐式模式，包括统计特征提取和基于视觉-语言模型的图像描述，并将其转化为文本描述以实现无缝集成。在多个基准数据集上的大量实验证明了InstructTime++的优越性能。

英文摘要

Most existing time series classification methods adopt a discriminative paradigm that maps input sequences directly to one-hot encoded class labels. While effective, this paradigm struggles to incorporate contextual features and fails to capture semantic relationships among classes. To address these limitations, we propose InstructTime, a novel framework that reformulates time series classification as a multimodal generative task. Specifically, continuous numerical sequences, contextual textual features, and task instructions are treated as multimodal inputs, while class labels are generated as textual outputs by tuned language models. To bridge the modality gap, InstructTime introduces a time series discretization module that converts continuous sequences into discrete temporal tokens, together with an alignment projection layer and a generative self-supervised pre-training strategy to enhance cross-modal representation alignment. Building upon this framework, we further propose InstructTime++, which extends InstructTime by incorporating implicit feature modeling to compensate for the limited inductive bias of language models. InstructTime++ leverages specialized toolkits to mine informative implicit patterns from raw time series and contextual inputs, including statistical feature extraction and vision-language-based image captioning, and translates them into textual descriptions for seamless integration. Extensive experiments on multiple benchmark datasets demonstrate the superior performance of InstructTime++.

URL PDF HTML ☆

赞 0 踩 0

2508.07375 2026-06-18 cs.CL cs.SD eess.AS 版本更新 80%

TurnGuide: Enhancing Meaningful Full Duplex Spoken Interactions via Dynamic Turn-Level Text-Speech Interleaving

TurnGuide: 通过动态轮次级文本-语音交错增强有意义的全双工口语交互

Wenqian Cui, Lei Zhu, Xiaohui Li, Zhihan Guo, Haoli Bai, Lu Hou, Irwin King

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Huawei Technologies（华为技术）

专题命中其他LLM ：全双工语音语言模型中的文本-语音交错生成

AI总结提出TurnGuide方法，通过动态分割助手语音为对话轮次并交错生成轮次级文本和语音，解决全双工语音语言模型在连续双通道音频中集成离散文本令牌导致的时间对齐问题，显著提升语义连贯性和轮次交互性能。

Comments Interspeech 2026 Long Paper Track

详情

AI中文摘要

全双工语音语言模型（FD-SLMs）是专门的基础模型，旨在通过建模复杂的对话轮次（如打断、反馈和重叠语音）来实现自然的实时口语交互。端到端（e2e）FD-SLMs利用真实世界的双通道对话数据捕捉细微的双说话者对话模式以实现类人交互，但由于语音序列过长和高质量口语对话数据有限，其对话能力往往比纯文本对话有所下降。尽管交错文本-语音生成可以缓解这种退化，但将离散文本令牌集成到连续双通道音频流中可能会破坏流畅交互所需的时间对齐。为了解决这个问题，我们提出了TurnGuide，一种用于e2e FD-SLMs的新型文本-语音交错生成方法，该方法动态地将助手语音分割成对话轮次，并交错生成轮次级文本和语音。这种方法使FD-SLMs能够整合LLMs的语义智能，同时不损害自然的声学流畅性。大量实验表明，TurnGuide不仅显著提升了e2e FD-SLMs生成语义有意义且连贯语音的能力，而且在各种轮次事件上达到了最先进的性能。演示请访问此https URL。代码请访问此https URL。

英文摘要

Full-Duplex Speech Language Models (FD-SLMs) are specialized foundation models designed to enable natural, real-time spoken interactions by modeling complex conversational turn-taking such as interruptions, backchannels, and overlapping speech. End-to-end (e2e) FD-SLMs leverage real-world double-channel conversational data to capture nuanced two-speaker dialogue patterns for human-like interactions, but their conversational abilities often degrade compared to pure-text conversation due to prolonged speech sequences and limited high-quality spoken dialogue data. Although interleaved text-speech generation could mitigate this degradation, integrating discrete text tokens into continuous double-channel audio streams could disrupt the precise time alignment required for fluid interaction. To address this, we propose TurnGuide, a novel text-speech interleaved generation approach for e2e FD-SLMs that dynamically segments assistant speech into dialogue turns and interleaves turn-level text and speech generation. This approach allows FD-SLMs to integrate the semantic intelligence of LLMs without compromising the natural acoustic flow. Extensive experiments show that TurnGuide not only significantly improves e2e FD-SLMs to produce semantically meaningful, coherent speech but also achieves state-of-the-art performance on various turn-taking events. Demos are available at https://dreamtheater123.github.io/TurnGuide-Demo/. Code is available at https://github.com/dreamtheater123/TurnGuide.

URL PDF HTML ☆

赞 0 踩 0

2512.04144 2026-06-18 cs.AI 版本更新 80%

RippleBench: Capturing Ripple Effects Using Existing Knowledge Repositories

RippleBench: 利用现有知识库捕捉涟漪效应

Roy Rinberg, Usha Bhalla, Igor Shilov, Flavio P. Calmon, Rohit Gandikota

发表机构 * Harvard University（哈佛大学）； Imperial College London（伦敦帝国学院）； Northeastern University（东北大学）

专题命中其他LLM ：评估语言模型遗忘的涟漪效应

AI总结提出RippleBench-Maker自动管道，从知识库检索语义邻居生成选择题，评估八种遗忘方法在Llama3-8B-Instruct上的涟漪效应，发现准确率下降随语义距离衰减且跨模型一致。

详情

AI中文摘要

针对语言模型的目标干预，如遗忘或模型编辑，旨在修改特定信息，但其效果往往传播到相关的、非预期的领域（例如，删除病毒学内容可能降低对过敏任务的性能）；这些副作用通常被称为涟漪效应。我们引入RippleBench-Maker，一个自动管道，从知识库中检索任何源概念的语义邻居，并生成不同语义距离的多选题。我们使用WikiRAG（一个基于英文维基百科的开源RAG系统）实例化该框架，构建RippleBench-WMDP-Bio（584个种子主题，352,961个问题），并在Llama3-8B-Instruct上评估八种遗忘方法。所有八种方法在遗忘目标附近准确率下降最大，并随语义距离衰减，每种方法具有不同的传播曲线。我们在Mistral-7B、Zephyr-7B和Yi-34B上复现了这些发现；跨模型的差值曲线几乎相同，表明涟漪效应是遗忘方法的属性而非基础模型。我们通过一项包含四个实验的Mechanical Turk研究（5,200+次响应，61名工作者）验证了所有主要管道阶段。我们发布所有代码、数据和基础设施。

英文摘要

Targeted interventions on language models, such as unlearning or model editing, aim to modify specific information, but their effects often propagate to related, unintended areas (e.g., removing virology content may degrade performance on allergies); these side-effects are commonly referred to as the ripple effect. We introduce RippleBench-Maker, an automatic pipeline that retrieves semantic neighbors of any source concept from a knowledge repository and generates multiple-choice questions at varying semantic distances. We instantiate this framework using WikiRAG, an open-source RAG system over English Wikipedia, to construct RippleBench-WMDP-Bio (584 seed topics, 352,961 questions), and evaluate eight unlearning methods on Llama3-8B-Instruct. All eight exhibit accuracy drops that are largest near the unlearned target and decay with semantic distance, each with a distinct propagation profile. We replicate these findings across Mistral-7B, Zephyr-7B, and Yi-34B; cross-model delta curves are nearly identical, suggesting ripple effects are a property of the unlearning method rather than the base model. We validate all major pipeline stages using a four-experiment Mechanical Turk study (5,200+ responses, 61 workers). We release all code, data, and infrastructure.

URL PDF HTML ☆

赞 0 踩 0

2510.09905 2026-06-18 cs.AI cs.CL 版本更新 80%

The Personalization Trap: How User Memory Alters Emotional Reasoning in LLMs

个性化陷阱：用户记忆如何改变大语言模型的情感推理

Xi Fang, Weijie Xu, Yuchong Zhang, Stephanie Eckman, Scott Nickleach, Chandan K. Reddy

发表机构 * Amazon（亚马逊）

专题命中其他LLM ：LLM情感推理偏差研究

AI总结研究用户记忆如何导致大语言模型在情感推理中产生系统性偏差，发现高绩效模型对优势背景用户的情感解读更准确，个性化机制可能嵌入社会等级。

Comments 19 pages 5 figures

详情

AI中文摘要

当AI助手记住Sarah是一位打两份工的单亲母亲时，它对她压力的解读是否与她是富有的高管时不同？随着个性化AI系统越来越多地融入长期用户记忆，理解这种记忆如何塑造情感推理至关重要。我们通过在人验证的情感智能测试上评估15个模型，研究用户记忆如何影响大语言模型（LLMs）的情感智能。我们发现，相同的场景搭配不同的用户画像会产生系统性不同的情感解读。在经验证的独立于用户的情感场景和多样化的用户画像中，几个高性能LLM出现了系统性偏差，其中优势背景的用户画像获得了更准确的情感解读。此外，LLM在情感推理和支持性推荐任务中表现出跨人口统计因素的显著差异，表明个性化机制可以将社会等级嵌入模型的情感推理中。这些结果凸显了记忆增强AI的一个关键挑战：为个性化设计的系统可能会强化社会不平等。为缓解这些差异，我们整理了一个通用偏好数据集，旨在减少人口统计画像对情感理解的影响。

英文摘要

When an AI assistant remembers that Sarah is a single mother working two jobs, does it interpret her stress differently than if she were a wealthy executive? As personalized AI systems increasingly incorporate long-term user memory, understanding how this memory shapes emotional reasoning is critical. We investigate how user memory affects emotional intelligence in large language models (LLMs) by evaluating 15 models on human-validated emotional intelligence tests. We find that identical scenarios paired with different user profiles produce systematically divergent emotional interpretations. Across validated user-independent emotional scenarios and diverse user profiles, systematic biases emerged in several high-performing LLMs where advantaged profiles received more accurate emotional interpretations. Moreover, LLMs demonstrate significant disparities across demographic factors in emotion reasoning and supportive recommendations tasks, indicating that personalization mechanisms can embed social hierarchies into models' emotional reasoning. These results highlight a key challenge for memory-enhanced AI: systems designed for personalization may reinforce social inequalities. To mitigate these disparities, we curate a general-purpose preference dataset designed to reduce demographic profiles' influence on emotional understanding.

URL PDF HTML ☆

赞 0 踩 0

2506.09046 2026-06-18 cs.LG cs.AI cs.MA 版本更新 80%

Self-Evolving Multi-Agent Systems via Textual Backpropagation

通过文本反向传播的自进化多智能体系统

Xiaowen Ma, Yunpu Ma, Chenyang Lin, Sikuan Yan, Jinhe Bi, Zixuan Cao, Yijun Tian, Volker Tresp, Hinrich Schuetze

发表机构 * Ludwig Maximilian University of Munich（慕尼黑路德维希-马克西米利安大学）； Technical University of Munich（慕尼黑技术大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）； University of Notre Dame（诺丁汉大学）

专题命中其他LLM ：利用多个LLM构建多智能体神经网络框架。

AI总结提出Agentic Neural Network框架，将多智能体协作建模为分层神经网络，通过前向分解任务和反向传播反馈实现智能体角色、提示和协作的自进化，在七个基准数据集上超越现有方法。

详情

AI中文摘要

利用多个大型语言模型（LLM）已被证明对处理复杂、高维任务有效，但当前方法通常依赖静态、手动设计的多智能体配置。为克服这些限制，我们提出Agentic Neural Network（ANN）框架，该框架将多智能体协作概念化为分层神经网络架构。在此设计中，每个智能体作为节点运行，每一层形成一个专注于特定子任务的协作团队。我们的框架遵循两阶段优化策略：（1）前向阶段——受神经网络前向传播启发，任务被动态分解为子任务，并逐层构建具有合适聚合方法的协作智能体团队。（2）反向阶段——模仿反向传播，我们通过迭代反馈优化全局和局部协作，使智能体能够自进化其角色、提示和协调。这种神经符号方法使我们的框架能够在训练后创建新的或专门的智能体团队，在准确性和适应性方面带来显著提升。在七个基准数据集上，我们的工作在相同配置下超越了领先的多智能体基线，显示出持续的性能改进。

英文摘要

Leveraging multiple Large Language Models (LLMs) has proven effective for addressing complex, high-dimensional tasks, but current approaches often rely on static, manually engineered multi-agent configurations. To overcome these constraints, we present the Agentic Neural Network (ANN), a framework that conceptualizes multi-agent collaboration as a layered neural network architecture. In this design, each agent operates as a node, and each layer forms a cooperative team focused on a specific subtask. Our framework follows a two-phase optimization strategy: (1) Forward Phase - Drawing inspiration from neural network forward passes, tasks are dynamically decomposed into subtasks, and cooperative agent teams with suitable aggregation methods are constructed layer by layer. (2) Backward Phase - Mirroring backpropagation, we refine both global and local collaboration through iterative feedback, allowing agents to self-evolve their roles, prompts, and coordination. This neuro-symbolic approach enables our framework to create new or specialized agent teams post-training, delivering notable gains in accuracy and adaptability. Across seven benchmark datasets, our work surpasses leading multi-agent baselines under the same configurations, showing consistent performance improvements.

URL PDF HTML ☆

赞 0 踩 0

2507.01414 2026-06-18 cs.LG 版本更新 80%

Decomposing Prediction Mechanisms for In-Context Recall

分解上下文召回中的预测机制

Sultan Daniels, Dylan Davis, Dhruv Gautam, Wentinn Liao, Gireeja Ranade, Anant Sahai

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Pennsylvania（宾夕法尼亚大学）

专题命中其他LLM ：分析Transformer上下文学习机制

AI总结通过设计结合连续上下文学习与离散关联召回的新玩具问题，发现Transformer模型在上下文召回任务中存在两种具有不同学习动态的独立机制：一种依赖离散符号标签进行关联召回，另一种基于前一个token和上下文进行贝叶斯式预测。

Comments 45 pages, 47 figures, 2 tables

详情

AI中文摘要

我们引入了一类新的玩具问题，将线性回归风格的连续上下文学习（ICL）特征与离散关联召回相结合。我们在该玩具的样本轨迹上预训练Transformer模型，具体是从随机抽取的线性确定性动力系统中提取的符号标记交错状态观测。我们研究当模型被提示使用相应的上下文标签时，是否能够召回先前在其上下文中见过的序列的状态。仔细观察这个任务，很明显模型必须执行两个功能：（1）识别应召回哪个系统的状态，并将该系统应用于其最后看到的状态；（2）继续应用正确的系统来预测后续状态。训练动态表明，第一个能力在模型训练中后期才出现。令人惊讶的是，第二个能力（继续预测恢复的序列）发展得更早。通过分布外实验和通过边缘剪枝对模型权重的机制分析，我们发现这个玩具问题的下一个token预测涉及至少两个独立的机制。一种机制使用离散符号标签进行关联召回，以预测先前见过的序列恢复的开始。第二种机制在很大程度上与离散符号标签无关，基于前一个token和上下文进行“贝叶斯式”预测。这两种机制具有不同的学习动态。为了确认这种多机制现象（表现为不同的相变）不仅仅是玩具设置的人为产物，我们使用OLMo在ICL翻译任务上的训练检查点观察到了类似的现象：第一个任务token的性能与第二个任务token的性能出现决定性差距。

英文摘要

We introduce a new family of toy problems that combine features of linear-regression-style continuous in-context learning (ICL) with discrete associative recall. We pretrain transformer models on sample traces from this toy, specifically symbolically-labeled interleaved state observations from randomly drawn linear deterministic dynamical systems. We study if the transformer models can recall the state of a sequence previously seen in its context when prompted to do so with the corresponding in-context label. Taking a closer look at this task, it becomes clear that the model must perform two functions: (1) identify which system's state should be recalled and apply that system to its last seen state, and (2) continuing to apply the correct system to predict the subsequent states. Training dynamics reveal that the first capability emerges well into a model's training. Surprisingly, the second capability, of continuing the prediction of a resumed sequence, develops much earlier. Via out-of-distribution experiments, and a mechanistic analysis on model weights via edge pruning, we find that next-token prediction for this toy problem involves at least two separate mechanisms. One mechanism uses the discrete symbolic labels to do the associative recall required to predict the start of a resumption of a previously seen sequence. The second mechanism, which is largely agnostic to the discrete symbolic labels, performs a "Bayesian-style" prediction based on the previous token and the context. These two mechanisms have different learning dynamics. To confirm that this multi-mechanism (manifesting as separate phase transitions) phenomenon is not just an artifact of our toy setting, we used OLMo training checkpoints on an ICL translation task to see a similar phenomenon: a decisive gap in the emergence of first-task-token performance vs second-task-token performance.

URL PDF HTML ☆

赞 0 踩 0

2606.19025 2026-06-18 cs.LG cs.AI cs.DC cs.SY eess.SY 新提交 80%

FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs

FoMoE: 打破全副本壁垒的专家混合联邦系统

Lorenzo Sani, Zeyu Cao, Meghdad Kurmanji, Alex Iacob, Andrej Jovanovic, Yan Gao, Wanru Zhao, Nicholas D. Lane

发表机构 * DeepSeek-AI

专题命中预训练：提出跨数据中心MoE训练系统，降低通信开销。

AI总结提出FoMoE系统，通过跨工作节点分区专家层打破全副本范式，结合部分专家复制和跳跃令牌机制，显著降低通信开销并提升吞吐量。

详情

AI中文摘要

预训练大型语言模型（LLMs）通常需要大规模基础设施，配备紧密耦合的硬件加速器。虽然增加模型和数据集规模仍是性能的主要驱动力，但专家混合（MoE）架构最近通过将参数数量与计算成本解耦，取得了最先进的结果。这种效率使得在受限计算预算下训练大规模模型成为可能，但通常需要单个数据中心的高速互连。为了克服这些物理限制，最近的方法如DiLoCo和Photon使用低通信数据并行方法，使得能够在地理分布、弱连接的数据中心之间进行扩展。然而，这些方法存在根本性的低效问题：它们需要在每个站点拥有完整的模型副本，这带来了高昂的内存约束和通信开销。在这项工作中，我们引入了FoMoE，一个通过跨工作节点分区专家层来打破全副本范式的系统。我们证明FoMoE：（I）通过部分专家复制，在所研究的场景中，相比高效基线降低了高达1.42倍的通信成本，相比DDP降低了45.44倍；（II）通过一种新颖的跳跃令牌机制，实现了高达1.4倍的经验吞吐量加速；（III）在训练代理场景中展示了稳定的路由，并通过系统建模将通信/内存优势推广到100B规模的配置。

英文摘要

Pre-training Large Language Models (LLMs) typically demands large-scale infrastructure with tightly coupled hardware accelerators. While increasing model and dataset scale remains the dominant driver of performance, Mixture-of-Experts (MoEs) architectures have recently achieved state-of-the-art results by decoupling parameter count from computational cost. This efficiency enables training massive models on constrained compute budgets, yet it typically requires the high-speed interconnects of a single datacenter. To overcome these physical limits, recent approaches such as DiLoCo and Photon use low-communication data-parallel methods to enable scaling across geographically distributed, weakly connected data centers. However, these methods suffer from a fundamental inefficiency: they require full model replicas at every site, which imposes prohibitive memory constraints and communication overheads. In this work, we introduce FoMoE, a system that breaks the full-replica paradigm by partitioning expert layers across workers. We demonstrate that FoMoE: (I) reduces communication costs by up to 1.42x over efficient baselines and 45.44x over DDP via partial expert replication in the studied regimes; (II) achieves empirical throughput speedups of up to 1.4x through a novel skip-token mechanism; and (III) shows stable routing in the trained proxy regimes and projects the communication/memory benefits to 100B-scale configurations through system modelling.

URL PDF HTML ☆

赞 0 踩 0

2606.18650 2026-06-18 cs.LG 新提交 80%

BLADE: Scalable Bi-level Adaptive Data Selection for LLM Training

BLADE: 面向LLM训练的可扩展双层自适应数据选择

Jiaxing Wang, Deping Xiang, Jin Xu, Zirui Liu, Zicheng Zhang, Guoqiang Gong, Jun Fang, Chao Liu, Pengzhang Liu, Tongxuan Liu, Ke Zhang, Qixia Jiang

发表机构 * University of Oxford（牛津大学）； Renmin University of China（中国人民大学）； University of Chinese Academy of Sciences（中国科学院大学）

专题命中预训练：面向LLM训练的可扩展双层自适应数据选择

AI总结提出BLADE框架，通过拉格朗日乘子将双层优化转化为单层惩罚目标，避免逆Hessian计算，实现动态参考模型，理论保证一阶收敛，实验优于现有方法。

详情

AI中文摘要

随着大语言模型（LLM）数据集规模扩展到数万亿token，数据选择已成为过滤无信息噪声和构建自适应学习轨迹的关键前沿。除了静态启发式过滤，LLM训练的高级数据选择方法主要遵循两种范式，每种都有根本性局限。基于影响的方法提供了原则性的双层目标，但需要难以处理的逆Hessian计算，而超额损失方法计算高效但依赖静态参考模型，该模型在训练过程中与不断演化的代理模型失配。我们提出BLADE（双层自适应数据选择），一种无Hessian的数据选择框架。BLADE通过拉格朗日乘子将基于影响的方法背后的双层优化问题重新表述为惩罚单层目标，避免了逆Hessian计算，同时揭示了与基于超额损失的数据选择之间的原则性联系。所得目标恢复了超额损失形式，但用与训练同步的动态参考模型替代了静态参考模型。理论上，我们证明该惩罚公式保证一阶收敛。为了实现高效的在线批次选择，我们将BLADE实例化为一种无记忆随机块坐标Frank-Wolfe算法。大量实验表明，BLADE始终优于最先进的数据选择基线，为LLM训练提供了实用方案。

英文摘要

As Large Language Model (LLM) datasets scale to trillions of tokens, data selection has emerged as a critical frontier to filter out uninformative noise and construct adaptive learning trajectories. Beyond static heuristic filtering, advanced data selection methods for LLM training largely follow two paradigms, each with fundamental limitations. Influence-based methods provide principled bi-level objectives but require intractable inverse-Hessian computations, while excess-loss methods are computationally efficient but rely on a static reference model that becomes misaligned with the evolving proxy model during training. We propose BLADE (Bi-Level Adaptive Data sElection), a Hessian-free framework for data selection. BLADE reformulates the bi-level optimization problem underlying influence-based methods as a penalized single-level objective via Lagrange multipliers, avoiding inverse-Hessian computation while revealing a principled connection to excess-loss based data selection. The resulting objective recovers an excess-loss form but replaces the static reference model with a dynamic one that stays synchronized with training. Theoretically, we prove that this penalized formulation guarantees first-order convergence. For efficient online batch selection, we instantiate BLADE as a memoryless randomized block-coordinate Frank-Wolfe algorithm. Extensive experiments show that BLADE consistently outperforms state-of-the-art data selection baselines, providing a practical recipe for LLM training.

URL PDF HTML ☆

赞 0 踩 0

2606.18192 2026-06-18 cs.AI 新提交 80%

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

斯坦福EDGAR文件数据集：将美国公司及财务披露重建为布局忠实且令牌高效的预训练数据

Nick Bettencourt, Xiaowei Ding, Kay Giesecke

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； Nanjing University（南京大学）； Stanford University（斯坦福大学）

专题命中预训练：构建长上下文预训练数据集用于LLM

AI总结为解决长上下文文档稀缺问题，提出SEFD数据集，将SEC文件重建为布局忠实的MultiMarkdown格式，用于金融语言建模与评估，具有令牌高效、与Common Crawl重叠率低于0.1%的特点。

Comments Preprint. Includes appendix, tables, and figures

详情

AI中文摘要

随着高质量公共网络语料库日益枯竭，干净的长上下文文档已成为大型语言模型（LLM）训练数据中稀缺且昂贵的来源。现有的长上下文语料库通常是专有的且获取成本高昂、合成生成的，或集中在编程等狭窄领域。我们介绍了斯坦福EDGAR文件数据集（SEFD），这是将SEC文件重建为布局忠实的MultiMarkdown格式的开放数据集，用于金融语言建模和评估。SEFD使经过审计的财务报表、风险披露、所有权报告、会计说明和影响市场的事件文件能够用作长上下文预训练数据，并作为金融推理、预测、合规和文档理解的基础。生成的语料库令牌高效、可直接用于模型，并且与Common Crawl衍生的语料库重叠率低于0.1%。我们发布了SEFD-v1，一个152B令牌的初始公共快照，并提供了更大的1850万文件档案（估计为550B令牌）的语料库级分析。我们进一步引入了两个基于SEFD的基准：EDGAR-Forecast，用于评估模型知识截止后基于文件的数值预测；以及EDGAR-OCR，用于评估复杂金融表格的转录。

英文摘要

As high-quality public web corpora become increasingly exhausted, clean long-context documents have become a scarce and expensive source of training data for large language models (LLMs). Existing long-context corpora are often proprietary and costly to acquire, synthetically generated, or concentrated in narrow domains such as programming. We introduce the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown for financial language modeling and evaluation. SEFD makes audited financial statements, risk disclosures, ownership reports, accounting notes, and market-moving event filings usable as long-context pretraining data and as a basis for financial reasoning, forecasting, compliance, and document understanding. The resulting corpus is token-efficient, model-ready, and has less than 0.1% overlap with Common Crawl-derived corpora. We release SEFD-v1, a 152B-token initial public snapshot, and provide corpus-level analyses of a larger 18.5M-filing archive estimated at 550B tokens. We further introduce two SEFD-derived benchmarks: EDGAR-Forecast, which evaluates filing-grounded numerical forecasting after model knowledge cutoffs, and EDGAR-OCR, which evaluates transcription of complex financial tables.

URL PDF HTML ☆

赞 0 踩 0

2606.10466 2026-06-18 cs.LG cs.AI 新提交 80%

UPLOTS: A Unified Pretrained Language Model for Constrained Time-series Generation

UPLOTS: 一种用于约束时间序列生成的统一预训练语言模型

Du Yin, Hao Xue, Jinliang Deng, Yang Yang, Shuang Ao, Arian Prabowo, Flora Salim

发表机构 * University of New South Wales（新南威尔士大学）； HKUST(GZ)（香港科技大学（广州））； BUAA（北京航空航天大学）

专题命中预训练：统一预训练语言模型生成时间序列

AI总结提出UPLOTS，一种基于统一预训练语言模型和提示引导的框架，通过动态多数据集损失重加权和提示到模式映射，实现跨领域约束时间序列生成，在四个基准上验证了其泛化性和数据增强效果。

详情

AI中文摘要

在时间序列生成中，现有方法通常为每个数据集手工设计或训练单独的模型，这阻碍了它们的可扩展性，并且未能利用跨领域的共享时间结构。为了解决这种碎片化问题，我们提出了UPLOTS，一种统一的、提示引导的语言模型框架，用于跨不同领域的约束时间序列生成。UPLOTS不是构建任务特定的模型，而是利用一个由学习到的约束提示引导的单一预训练transformer骨干网络，从而能够按需生成并精确控制模式。一个关键创新是我们的动态多数据集损失重加权和提示到模式映射，这使得UPLOTS能够在训练期间内化多样化的时间结构，并在推理时有条件地生成它们。我们在四个真实世界基准和多个约束设置（包括峰值周期、日历、负载水平和波动性模式）上评估了UPLOTS。额外的保留约束组合和下游预测实验进一步表明，UPLOTS能够泛化到原始峰值模式设置之外，并在真实数据稀缺的情况下改进数据增强。我们的代码和基线可在匿名GitHub仓库获取：this https URL。

英文摘要

In time-series generation, existing approaches typically handcraft ortrain a separate model for each dataset, which hinders their scalability and fails to leverage shared temporal structures across domains. To address this fragmentation, we propose UPLOTS, a Unified, Prompt-guided Language model framework fOr constrained Time-Series Generation across diverse domains. Instead of building task-specific models, UPLOTS leverages a single pre-trained transformer backbone guided by learned constraint prompts, enabling on-demand generation with precise pattern control. One key innovation is our dynamic multi-dataset loss re-weighting and prompt-to-pattern mapping, which allows UPLOTS to internalize diverse temporal structures during training and conditionally generate them at inference. We evaluate UPLOTS on four real-world benchmarks and multiple constraint settings, including peak-period, calendar, load-level, and volatility patterns. Additional held-out constraint-combination and downstream forecasting experiments further demonstrate that UPLOTS generalizes beyond the original peak-pattern setting and improves data augmentation under scarce real-data regimes. Our code and baselines are available at anonymous github repo: https://anonymous.4open.science/r/UPLOTS-6C36.

URL PDF HTML ☆

赞 0 踩 0

2606.19004 2026-06-18 cs.DC cs.AI cs.LG 新提交 80%

Spotlight: Synergizing Seed Exploration and Spot GPUs for DiT RL Post-Training

Spotlight: 协同种子探索与抢占式GPU用于DiT强化学习后训练

Ruiqi Lai, Dakai An, Wei Gao, Ju Huang, Siran Yang, Jiamang Wang, Lin Qu, Dmitrii Ustiugov, Wei Wang

发表机构 * NTU Singapore（南洋理工大学）； Hong Kong University of Science and Technology（香港科技大学）； Alibaba Group（阿里巴巴集团）

专题命中后训练：提出Spotlight系统，利用抢占式GPU加速DiT强化学习后训练。

AI总结针对DiT强化学习后训练成本高的问题，提出Spotlight系统，通过利用探索对旧权重的容忍性和SP组快速重配置，在抢占式GPU上实现高效训练，加速4倍并降低成本1.4-6.4倍。

详情

AI中文摘要

扩散Transformer（DiT）的强化学习（RL）后训练成本极高，需要数千块高端GPU。现有工作探索了两个降低成本的方向：种子探索通过选择高对比度样本来改善训练收敛，但增加了关键路径的计算量；抢占式GPU提供69-77%的成本降低，但在训练期间处于空闲状态，因为DiT rollout几乎同时完成，这阻止了类似LLM的rollout与训练流水线化。抢占式GPU的抢占进一步破坏了序列并行（SP）组，导致GPU拓扑碎片化。我们提出了Spotlight，这是第一个利用抢占式GPU进行DiT RL后训练的系统。Spotlight基于我们设计的两个关键洞察：（1）我们证明探索可以容忍过时的模型权重，因为使用前一次迭代模型权重的探索保留了随机种子的相对排序，允许探索在训练期间在空闲的抢占式GPU上运行。（2）SP重配置可以重用节点内状态，将组恢复时间从分钟级缩短到亚秒级启动。基于这些洞察，Spotlight引入了三种技术：基于bandit的探索规划器，在训练时间预算内最大化奖励方差；弹性序列并行，通过持久调度器和节点内权重复制动态重配置SP组；以及抢占感知的拉取式请求调度器，平衡负载并在抢占时提交进行中的状态。我们在开源RL平台ROLL上实现了Spotlight，并在Qwen-Image后训练上进行了评估。Spotlight达到相同目标验证分数的速度比基线快4倍，总成本降低1.4-6.4倍，同时在分辨率512×512和1280×1280的DeepSeek-OCR和Geneval数据集上实现了更优的图像质量。

英文摘要

Reinforcement learning (RL) post-training of Diffusion Transformers (DiTs) is prohibitively expensive, requiring thousands of high-end GPUs. Existing works explore two directions to reduce cost: seed exploration improves training convergence by selecting high-contrast samples, yet adds compute to the critical path; spot GPUs offer 69--77\% lower cost, yet sit idle during training because DiT rollouts finish nearly simultaneously, which prevents LLM-style pipelining of rollout with training. Spot preemptions further break Sequence Parallelism (SP) groups, fragmenting GPU topology. We present Spotlight, the first system that harvests spot GPUs for DiT RL post-training. Spotlight rests on two key insights we devise: (1)~we show that exploration can tolerate stale model weights because exploration that uses the model weights from the previous iteration preserves the relative ranking of random seeds, allowing exploration to run on idle spot GPUs during training. (2)~SP reconfiguration can reuse on-node state, reducing group recovery from minutes to sub-second launches. Built on these insights, Spotlight introduces three techniques: a bandit-based exploration planner that maximizes reward variance within the training time budget, elastic sequence parallelism that reconfigures SP groups on the fly via persistent schedulers and intra-node weight copying, and a preemption-aware pull-based request scheduler that balances load and commits in-flight state upon preemption. We implement Spotlight on the open-source RL platform ROLL and evaluate it on Qwen-Image post-training. Spotlight reaches the same target validation score $4\times$ faster than baselines, reducing total cost by $1.4$-$6.4\times$ while achieving superior image quality on DeepSeek-OCR and Geneval datasets with resolution $512\times512$ and $1280\times1280$.

URL PDF HTML ☆

赞 0 踩 0

2606.19002 2026-06-18 cs.CL 新提交 80%

Enhancing Multilingual Reasoning via Steerable Model Merging

通过可引导的模型合并增强多语言推理

Zhuoran Li, Rui Xu, Jian Yang, Junnan Liu, Zhijun Chen, Qianren Mao, Hongcheng Guo, Jiaheng Liu, Likang Xiao, Ming Li, Xiaojie Wang

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Fudan University（复旦大学）； Beihang University（北京航空航天大学）； Monash University（墨尔本大学）； Zhongguancun Laboratory（中关村实验室）； Nanjing University（南京大学）； Tsinghua University（清华大学）

专题命中后训练：提出可引导模型合并框架，增强多语言推理能力。

AI总结提出可引导模型合并（ST-Merge）框架，通过门控交叉注意力机制自适应调节源模型贡献，在多语言推理任务中优于强基线。

Comments 12 pages, 7 figures, 8 tables. Accepted by ACL2026 Findings

详情

AI中文摘要

模型合并是组合多语言模型和推理模型能力的有效技术。通过对齐不同模型的特征空间，它在多语言推理任务中取得了有希望的泛化效果。然而，合并后的单一模型往往无法解决源模型之间的冲突，导致性能次优。换句话说，一刀切的合并策略可能无法适应不同输入的特性，这些输入可能要求优先考虑某些模型。为此，我们提出了一个可引导模型合并（ST-Merge）框架来调节每个源模型的贡献。为了实现这一想法，我们引入了一种门控交叉注意力机制，以自适应方式加权或过滤两个关注的源模型。大量实验表明，ST-Merge在涵盖21种不同语言的四个多语言推理基准上持续优于多个强基线。

英文摘要

Model merging is an effective technique for composing the capabilities of a multilingual model and a reasoning model. It has achieved promising generalization in multilingual reasoning tasks by aligning feature spaces of different models. However, the merged single model often fails to address the conflicts between source models, leading to suboptimal performance. In other words, the one-size-fits-all merging strategy may not align with the characteristics of different inputs which may require prioritizing certain models over others. To this end, we propose a Steerable Model Merging (ST-Merge) framework to modulate the contribution of each source model. To realize this idea, we introduce a gated cross-attention mechanism to weight or filter the two attended source models in an adaptive manner. Extensive experiments demonstrate that ST-Merge consistently outperforms multiple strong baselines on four multilingual reasoning benchmarks across 21 different languages.

URL PDF HTML ☆

赞 0 踩 0

2606.18967 2026-06-18 cs.LG 新提交 80%

EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

EfficientRollout: 面向强化学习推演的感知系统的自推测解码

Minseo Kim, Minjae Lee, Seunghyuk Oh, Kevin Galim, Donghoon Kim, Coleman Hooper, Harman Singh, Amir Gholami, Hyung Il Koo, Wonjun Kang

发表机构 * FuriosaAI ； University of California, Berkeley（加州大学伯克利分校）

专题命中后训练：提出自推测解码加速强化学习推演。

AI总结针对强化学习推演中自回归解码延迟瓶颈，提出感知系统的自推测解码框架，通过量化自推测解码器与感知系统的推测开关策略，在保持模型质量前提下降低推演和端到端延迟。

Comments Project Page: https://github.com/furiosa-ai/EfficientRollout

详情

AI中文摘要

强化学习（RL）已成为LLMs代表性后训练范式，赋予其强大的推理和智能体能力。然而，推演生成仍是主要的延迟瓶颈，因为自回归采样顺序解码响应，且少量长尾生成往往决定完成时间。推测解码（SD）为缓解此瓶颈提供了自然途径，它是一种用于服务固定LLMs的成熟技术，通过快速草拟令牌并通过并行验证接受它们来降低延迟，同时保持目标模型分布。但其实际加速效果无法直接迁移到RL推演：（i）不断变化的目标策略使得任何固定草拟者与策略输出分布日益不匹配；（ii）推演解码过程中活跃批次大小缩小，解码从计算受限转向内存受限，此时并行验证可利用未充分利用的计算资源。因此，加速RL推演需要草拟者在长序列、高温生成下对演化策略保持有效，以及感知系统的SD使用以避免计算受限状态。我们提出EfficientRollout，一个感知系统的自推测SD框架，旨在解决RL推演中的这一差距。EfficientRollout从目标模型诱导量化草拟者（即自推测解码），使其与演化策略保持耦合，无需单独草拟者预训练或在线适应。它进一步协调感知系统的SD切换策略与接受感知的草稿长度自适应，仅在有益状态下进行推测，同时使草拟预算与演化草拟者质量匹配。EfficientRollout在加速自回归推演基线上分别将推演和端到端延迟降低高达19.6%和12.7%，同时保持最终模型质量。

英文摘要

Reinforcement learning (RL) has become a representative post-training paradigm for LLMs, enabling strong reasoning and agentic capabilities. However, rollout generation remains a dominant latency bottleneck because autoregressive sampling decodes responses sequentially and a small number of long-tailed generations often determine completion time. Speculative decoding (SD) offers a natural way to address this bottleneck, as it is a well-established technique for serving fixed LLMs that reduces latency by rapidly drafting tokens and accepting them through parallel verification while preserving the target-model distribution. However, its practical speedups do not directly carry over to RL rollouts: (i) the evolving target policy makes any fixed drafter increasingly mismatched with the policy's output distribution; and (ii) active batch sizes shrink throughout rollout decoding, shifting decoding from compute-bound to memory-bound regimes where parallel verification can exploit underutilized compute. Therefore, accelerating RL rollouts requires both a drafter that remains effective under long, high-temperature generations from an evolving policy and system-aware use of SD that avoids compute-bound regimes. We present EfficientRollout, a system-aware self-SD framework designed to address this gap for RL rollouts. EfficientRollout induces a quantized drafter from the target model (i.e. self-speculative decoding), keeping it coupled to the evolving policy without separate drafter pretraining or online adaptation. It further coordinates a system-aware SD toggle policy with acceptance-aware draft-length adaptation, enabling speculation only in beneficial regimes while matching the drafting budget to evolving drafter quality. EfficientRollout reduces rollout and end-to-end latency by up to 19.6% and 12.7%, respectively, over an accelerated AR rollout baseline, while preserving final model quality.

URL PDF HTML ☆

赞 0 踩 0

2606.18844 2026-06-18 cs.LG 新提交 80%

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

从自身错误中学习：为自蒸馏构建可学习的微反思轨迹

Zhilin Huang, Hang Gao, Ziqiang Dong, Yuan Chen, Yifeng Luo, Chujun Qin, Jingyi Wang, Yang Yang, Guanjun Jiang

发表机构 * Qwen Business Unit of Alibaba（阿里巴巴通义千问事业部）； Tsinghua University（清华大学）； Peking University（北京大学）

专题命中后训练：策略优化方法，利用自身轨迹。

AI总结提出TAPO方法，通过对比正确与错误轨迹构建微反思修正，实现从隐式分布对齐到显式轨迹构建的自蒸馏改进，在多个数学推理基准上优于GRPO。

详情

AI中文摘要

自蒸馏通过使用模型自身的生成作为训练信号来改进大型语言模型的推理能力，通常通过隐式的logit级对齐来实现，最小化与特权目标分布的KL散度。然而，由于这种监督是通过无控制采样生成的，它无法提供关于模型特定错误的诊断性洞察，也无法针对其个体失败模式提供纠正性指导。因此，模型学习的是模仿特权分布，而不是接收精确指出其推理失败位置和原因的细粒度修正。在本文中，我们提出了轨迹增强策略优化（TAPO），将自蒸馏从隐式分布对齐推进到显式轨迹构建。在强化学习训练期间，模型对同一查询同时产生正确和错误的生成轨迹，TAPO利用这种对比结构来构建微反思修正——新的训练轨迹，保留模型在失败点之前的错误推理，然后插入自然语言诊断和由同一采样组中的正确参考引导的修正推理。由于每条轨迹都锚定在学习者自身的前缀和解决方案上，与基于KL的方法施加的位置级对齐相比，修正信号在更大程度上保留了模型的在策略分布。为了整合这些轨迹，TAPO在模型能力边界引入了难度感知的候选选择，并采用解耦优势估计以防止梯度污染。在AIME 2024、AIME 2025和HMMT 2025上的实验表明，在相同训练步数下，TAPO相比GRPO取得了一致的改进。进一步分析表明，TAPO增强了首次推理和错误纠正的有效性。

英文摘要

Self-distillation improves reasoning in large language models by using the model's own rollouts as training signal, typically through implicit logit-level alignment that minimizes KL divergence toward a privileged target distribution. However, because this supervision is generated via uncontrolled sampling, it provides no diagnostic insight into the model's specific errors or corrective guidance for its individual failure patterns. Consequently, the model learns to imitate a privileged distribution rather than receiving fine-grained corrections that pinpoint where and why its reasoning fails. In this paper, we propose Trajectory-Augmented Policy Optimization (TAPO), which advances self-distillation from implicit distributional alignment to explicit trajectory construction. During RL training, the model produces both correct and incorrect rollouts to the same query, and TAPO leverages this contrastive structure to construct micro-reflective corrections, new training trajectories that retain the model's erroneous reasoning up to the point of failure, then insert a natural-language diagnosis and corrected reasoning guided by a correct reference from the same sampling group. Since each trajectory is anchored in the learner's own prefix and solutions, the corrective signal preserves the model's on-policy distribution to a greater extent than the position-wise alignment imposed by KL-based methods. To integrate these trajectories, TAPO introduces difficulty-aware candidate selection at the model's capability boundary and decoupled advantage estimation to prevent gradient contamination. Experiments on AIME 2024, AIME 2025, and HMMT 2025 show that TAPO achieves consistent improvements over GRPO under the same number of training steps. Further analysis demonstrates that TAPO strengthens both first-pass reasoning and error-correction effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2606.18774 2026-06-18 cs.LG 新提交 80%

RouteJudge: An Open Platform for Reproducible and Preference-Aware LLM Routing

RouteJudge: 一个可复现且偏好感知的LLM路由开放平台

Guannan Lai, Haoran Hu, Han-Jia Ye

发表机构 * School of Artificial Intelligence, Nanjing University（南京大学人工智能学院）； National Key Laboratory for Novel Software Technology, Nanjing University（南京大学计算机软件新技术国家重点实验室）； SinapisAI

专题命中后训练：评估LLM路由策略，偏好感知平台。

AI总结提出RouteJudge平台，通过匿名成对比较评估LLM路由策略的决策质量，并发布ORBIT工具箱标准化路由工作流，支持可复现和偏好感知的路由评估。

Comments Accepted by Pluralistic Alignment Workshop at ICML 2026

详情

AI中文摘要

我们提出RouteJudge，一个用于LLM路由系统的在线成对偏好评估框架，并提供一个公开平台（https://...）。与模型级别的响应评估不同，RouteJudge关注路由器级别的决策质量。对于每个用户查询，多个路由策略在相同的模型池和预算约束下独立推荐候选模型。然后通过匿名成对比较将所选模型的响应呈现给用户，由此产生的用户偏好归因于比较响应背后的路由策略。每条评估记录存储查询、路由决策、模型响应、偏好标签、成本、延迟和任务元数据，从而支持对LLM路由器进行偏好感知、成本感知和任务条件分析。为了支持RouteJudge中路由方法的持续扩展，我们进一步发布了ORBIT（最优路由与预算推理工具箱），这是一个模块化且可扩展的工具箱，标准化了LLM路由的端到端工作流。ORBIT为基准加载、查询表示、路由器实现、预算感知评估和方法比较提供了统一接口，允许研究人员在一致的协议下开发和评估路由算法。它同时作为RouteJudge的提交和集成层：研究人员可以在ORBIT中实现路由方法，在现有路由基准上验证它们，并提交兼容的路由器进行在线偏好评估。ORBIT的代码可在https://...获取。

英文摘要

We present RouteJudge, an online pairwise preference evaluation framework for LLM routing systems, with a public platform available at https://routejudge.cn. Different from model-level response evaluation, RouteJudge focuses on router-level decision quality. For each user query, multiple routing strategies independently recommend candidate models under the same model pool and budget constraints. The selected model responses are then presented to users through anonymous pairwise comparisons, and the resulting user preferences are attributed back to the routing strategies behind the compared responses. Each evaluation record stores the query, routing decisions, model responses, preference labels, cost, latency, and task metadata, enabling preference-aware, cost-aware, and task-conditioned analysis of LLM routers. To support the continuous expansion of routing methods in RouteJudge, we further release ORBIT (Optimal Routing and Budgeted Inference Toolbox), a modular and extensible toolbox that standardizes the end-to-end workflow of LLM routing. ORBIT provides unified interfaces for benchmark loading, query representation, router implementation, budget-aware evaluation, and method comparison, allowing researchers to develop and evaluate routing algorithms under consistent protocols. It also serves as the submission and integration layer for RouteJudge: researchers can implement routing methods within ORBIT, validate them on existing routing benchmarks, and submit compatible routers for online preference-based evaluation. The code of ORBIT is available at https://github.com/AIGNLAI/LAMDA-ORBIT.

URL PDF HTML ☆

赞 0 踩 0

2606.13795 2026-06-18 cs.LG 新提交 80%

DiPOD: Diffusion Policy Optimization without Drifting Apart

无漂移扩散策略优化

Haozhe Jiang, Haiwen Feng, Pieter Abbeel, Jiantao Jiao, Angjoo Kanazawa, Nika Haghtalab

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Simons Institute for the Theory of Computing（西蒙斯计算理论研究所）； Department of Electrical Engineering and Computer Sciences, University of California, Berkeley（加州大学伯克利分校电气工程与计算机科学系）

专题命中后训练：扩散策略优化用于语言模型后训练

AI总结针对扩散策略梯度方法的不稳定性，提出DiPOD框架，通过自蒸馏与策略改进梯度更新交替进行，维持紧界行为，实现稳定且高效的策略优化。

Comments Project page: astro-eric.github.io/blogs/dipod/ Code: https://github.com/Astro-Eric/DiPOD-release

2606.18596 2026-06-18 cs.HC cs.AI 新提交 80%

Better Adherence, Richer Context: A Field Evaluation of LLM-Powered Conversational Voice Diaries for Sleep

更好的依从性，更丰富的上下文：基于LLM的对话式语音睡眠日记的现场评估

Amama Mahmood, Bokyung Kim, Honghao Zhao, Molly E. Atwood, Luis F. Buenaver, Michael T. Smith, Chien-Ming Huang

发表机构 * The Johns Hopkins University（约翰霍普金斯大学）； Department of Psychiatry and Behavioral Sciences, The Johns Hopkins University School of Medicine（精神病学与行为科学系，约翰霍普金斯大学医学院）

专题命中领域大模型：LLM驱动的对话式语音睡眠日记现场评估

AI总结通过现场实验评估基于LLM的对话式语音睡眠日记，发现相比文本日记，语音日记提高了依从性并收集了更详细的上下文信息，但结构化字段完整性较低。

详情

AI中文摘要

睡眠日记是行为睡眠医学和失眠认知行为疗法的核心，但每日完成难以维持，静态形式通常为解释夜间睡眠变化提供的上下文有限。我们设计了一个基于LLM的对话式语音日记，通过主动智能音箱提示、结构化对话输入和自适应后续对话，提供临床基础的早晚睡眠日记问题。我们在为期四周的受试者间现场研究中评估了该系统，涉及30名大学生，使用匹配的日记项目、报告窗口和提醒间隔，与基于文本的移动日记进行比较。与文本日记相比，对话式语音日记显示出更高的依从性，并引发了关于日常习惯、压力源、环境条件和其他睡眠相关因素的更详细上下文自我报告。参与者还描述语音日记更容易融入日常，尽管感知完成时间更长。然而，基于语音的对话输入导致某些结构化日记字段的完整性较低，揭示了表达丰富性与结构化精度之间的权衡。这些发现展示了使用基于LLM的对话式语音助手进行纵向健康自我报告的前景和挑战。

英文摘要

Sleep diaries are central to behavioral sleep medicine and cognitive behavioral therapy for insomnia, yet daily completion is difficult to sustain, and static forms often provide limited context for interpreting night-to-night sleep variation. We designed an LLM-powered conversational voice diary that delivers clinically grounded morning and evening sleep diary questions through proactive smart-speaker prompts, structured conversational intake, and adaptive follow-up dialogue. We evaluated the system in a four-week between-subjects field study with 30 university students, comparing it with a text-based mobile diary using matched diary items, reporting windows, and reminder intervals. Compared with the text-based diary, the conversational voice diary showed higher adherence and elicited more detailed contextual self-report about routines, stressors, environmental conditions, and other sleep-related factors. Participants also described the voice diary as easier to integrate into daily routines, despite longer perceived completion time. However, voice-based conversational intake produced lower completeness for some structured diary fields, revealing a trade-off between expressive richness and structured precision. These findings show both the promise and the challenge of using LLM-powered conversational voice assistants for longitudinal health self-report.

URL PDF HTML ☆

赞 0 踩 0

1. 其他LLM 19 篇

X+Slides: Benchmarking Audience-Conditioned Slide Generation

SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

ReMP: Low-Downtime Runtime Model-Parallelism Reconfiguration for LLM Serving

Bounded Context Management for Tabular Foundation Models on Stream Learning

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability

Latency Prediction for LLM Inference on NPU Systems

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

DN-Hypo-Pipeline: An AI-Driven Workflow for Hypothesis Generation via Large Language Models and Scientific Explanations

Improve Large Language Model Systems with User Logs

LLM Compression by Block Removal with Constrained Binary Optimization

InstructTime++: Time Series Classification with Multimodal Language Modeling via Implicit Feature Enhancement

TurnGuide: Enhancing Meaningful Full Duplex Spoken Interactions via Dynamic Turn-Level Text-Speech Interleaving

RippleBench: Capturing Ripple Effects Using Existing Knowledge Repositories

The Personalization Trap: How User Memory Alters Emotional Reasoning in LLMs

Self-Evolving Multi-Agent Systems via Textual Backpropagation

Decomposing Prediction Mechanisms for In-Context Recall

2. 预训练 4 篇

FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs

BLADE: Scalable Bi-level Adaptive Data Selection for LLM Training

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

UPLOTS: A Unified Pretrained Language Model for Constrained Time-series Generation

3. 后训练 6 篇

Spotlight: Synergizing Seed Exploration and Spot GPUs for DiT RL Post-Training

Enhancing Multilingual Reasoning via Steerable Model Merging

EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

RouteJudge: An Open Platform for Reproducible and Preference-Aware LLM Routing

DiPOD: Diffusion Policy Optimization without Drifting Apart

4. 领域大模型 1 篇

Better Adherence, Richer Context: A Field Evaluation of LLM-Powered Conversational Voice Diaries for Sleep