语言大模型 / LLM

2606.18650 2026-06-18 cs.LG 新提交 80%

BLADE: Scalable Bi-level Adaptive Data Selection for LLM Training

BLADE: 面向LLM训练的可扩展双层自适应数据选择

Jiaxing Wang, Deping Xiang, Jin Xu, Zirui Liu, Zicheng Zhang, Guoqiang Gong, Jun Fang, Chao Liu, Pengzhang Liu, Tongxuan Liu, Ke Zhang, Qixia Jiang

发表机构 * University of Oxford（牛津大学）； Renmin University of China（中国人民大学）； University of Chinese Academy of Sciences（中国科学院大学）

专题命中预训练：面向LLM训练的可扩展双层自适应数据选择

AI总结提出BLADE框架，通过拉格朗日乘子将双层优化转化为单层惩罚目标，避免逆Hessian计算，实现动态参考模型，理论保证一阶收敛，实验优于现有方法。

详情

AI中文摘要

随着大语言模型（LLM）数据集规模扩展到数万亿token，数据选择已成为过滤无信息噪声和构建自适应学习轨迹的关键前沿。除了静态启发式过滤，LLM训练的高级数据选择方法主要遵循两种范式，每种都有根本性局限。基于影响的方法提供了原则性的双层目标，但需要难以处理的逆Hessian计算，而超额损失方法计算高效但依赖静态参考模型，该模型在训练过程中与不断演化的代理模型失配。我们提出BLADE（双层自适应数据选择），一种无Hessian的数据选择框架。BLADE通过拉格朗日乘子将基于影响的方法背后的双层优化问题重新表述为惩罚单层目标，避免了逆Hessian计算，同时揭示了与基于超额损失的数据选择之间的原则性联系。所得目标恢复了超额损失形式，但用与训练同步的动态参考模型替代了静态参考模型。理论上，我们证明该惩罚公式保证一阶收敛。为了实现高效的在线批次选择，我们将BLADE实例化为一种无记忆随机块坐标Frank-Wolfe算法。大量实验表明，BLADE始终优于最先进的数据选择基线，为LLM训练提供了实用方案。

英文摘要

As Large Language Model (LLM) datasets scale to trillions of tokens, data selection has emerged as a critical frontier to filter out uninformative noise and construct adaptive learning trajectories. Beyond static heuristic filtering, advanced data selection methods for LLM training largely follow two paradigms, each with fundamental limitations. Influence-based methods provide principled bi-level objectives but require intractable inverse-Hessian computations, while excess-loss methods are computationally efficient but rely on a static reference model that becomes misaligned with the evolving proxy model during training. We propose BLADE (Bi-Level Adaptive Data sElection), a Hessian-free framework for data selection. BLADE reformulates the bi-level optimization problem underlying influence-based methods as a penalized single-level objective via Lagrange multipliers, avoiding inverse-Hessian computation while revealing a principled connection to excess-loss based data selection. The resulting objective recovers an excess-loss form but replaces the static reference model with a dynamic one that stays synchronized with training. Theoretically, we prove that this penalized formulation guarantees first-order convergence. For efficient online batch selection, we instantiate BLADE as a memoryless randomized block-coordinate Frank-Wolfe algorithm. Extensive experiments show that BLADE consistently outperforms state-of-the-art data selection baselines, providing a practical recipe for LLM training.

URL PDF HTML ☆

赞 0 踩 0

2606.18192 2026-06-18 cs.AI 新提交 80%

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

斯坦福EDGAR文件数据集：将美国公司及财务披露重建为布局忠实且令牌高效的预训练数据

Nick Bettencourt, Xiaowei Ding, Kay Giesecke

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； Nanjing University（南京大学）； Stanford University（斯坦福大学）

专题命中预训练：构建长上下文预训练数据集用于LLM

AI总结为解决长上下文文档稀缺问题，提出SEFD数据集，将SEC文件重建为布局忠实的MultiMarkdown格式，用于金融语言建模与评估，具有令牌高效、与Common Crawl重叠率低于0.1%的特点。

Comments Preprint. Includes appendix, tables, and figures

详情

AI中文摘要

随着高质量公共网络语料库日益枯竭，干净的长上下文文档已成为大型语言模型（LLM）训练数据中稀缺且昂贵的来源。现有的长上下文语料库通常是专有的且获取成本高昂、合成生成的，或集中在编程等狭窄领域。我们介绍了斯坦福EDGAR文件数据集（SEFD），这是将SEC文件重建为布局忠实的MultiMarkdown格式的开放数据集，用于金融语言建模和评估。SEFD使经过审计的财务报表、风险披露、所有权报告、会计说明和影响市场的事件文件能够用作长上下文预训练数据，并作为金融推理、预测、合规和文档理解的基础。生成的语料库令牌高效、可直接用于模型，并且与Common Crawl衍生的语料库重叠率低于0.1%。我们发布了SEFD-v1，一个152B令牌的初始公共快照，并提供了更大的1850万文件档案（估计为550B令牌）的语料库级分析。我们进一步引入了两个基于SEFD的基准：EDGAR-Forecast，用于评估模型知识截止后基于文件的数值预测；以及EDGAR-OCR，用于评估复杂金融表格的转录。

英文摘要

As high-quality public web corpora become increasingly exhausted, clean long-context documents have become a scarce and expensive source of training data for large language models (LLMs). Existing long-context corpora are often proprietary and costly to acquire, synthetically generated, or concentrated in narrow domains such as programming. We introduce the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown for financial language modeling and evaluation. SEFD makes audited financial statements, risk disclosures, ownership reports, accounting notes, and market-moving event filings usable as long-context pretraining data and as a basis for financial reasoning, forecasting, compliance, and document understanding. The resulting corpus is token-efficient, model-ready, and has less than 0.1% overlap with Common Crawl-derived corpora. We release SEFD-v1, a 152B-token initial public snapshot, and provide corpus-level analyses of a larger 18.5M-filing archive estimated at 550B tokens. We further introduce two SEFD-derived benchmarks: EDGAR-Forecast, which evaluates filing-grounded numerical forecasting after model knowledge cutoffs, and EDGAR-OCR, which evaluates transcription of complex financial tables.

URL PDF HTML ☆

赞 0 踩 0

2606.10466 2026-06-18 cs.LG cs.AI 新提交 80%

UPLOTS: A Unified Pretrained Language Model for Constrained Time-series Generation

UPLOTS: 一种用于约束时间序列生成的统一预训练语言模型

Du Yin, Hao Xue, Jinliang Deng, Yang Yang, Shuang Ao, Arian Prabowo, Flora Salim

发表机构 * University of New South Wales（新南威尔士大学）； HKUST(GZ)（香港科技大学（广州））； BUAA（北京航空航天大学）

专题命中预训练：统一预训练语言模型生成时间序列

AI总结提出UPLOTS，一种基于统一预训练语言模型和提示引导的框架，通过动态多数据集损失重加权和提示到模式映射，实现跨领域约束时间序列生成，在四个基准上验证了其泛化性和数据增强效果。

详情

AI中文摘要

在时间序列生成中，现有方法通常为每个数据集手工设计或训练单独的模型，这阻碍了它们的可扩展性，并且未能利用跨领域的共享时间结构。为了解决这种碎片化问题，我们提出了UPLOTS，一种统一的、提示引导的语言模型框架，用于跨不同领域的约束时间序列生成。UPLOTS不是构建任务特定的模型，而是利用一个由学习到的约束提示引导的单一预训练transformer骨干网络，从而能够按需生成并精确控制模式。一个关键创新是我们的动态多数据集损失重加权和提示到模式映射，这使得UPLOTS能够在训练期间内化多样化的时间结构，并在推理时有条件地生成它们。我们在四个真实世界基准和多个约束设置（包括峰值周期、日历、负载水平和波动性模式）上评估了UPLOTS。额外的保留约束组合和下游预测实验进一步表明，UPLOTS能够泛化到原始峰值模式设置之外，并在真实数据稀缺的情况下改进数据增强。我们的代码和基线可在匿名GitHub仓库获取：this https URL。

英文摘要

In time-series generation, existing approaches typically handcraft ortrain a separate model for each dataset, which hinders their scalability and fails to leverage shared temporal structures across domains. To address this fragmentation, we propose UPLOTS, a Unified, Prompt-guided Language model framework fOr constrained Time-Series Generation across diverse domains. Instead of building task-specific models, UPLOTS leverages a single pre-trained transformer backbone guided by learned constraint prompts, enabling on-demand generation with precise pattern control. One key innovation is our dynamic multi-dataset loss re-weighting and prompt-to-pattern mapping, which allows UPLOTS to internalize diverse temporal structures during training and conditionally generate them at inference. We evaluate UPLOTS on four real-world benchmarks and multiple constraint settings, including peak-period, calendar, load-level, and volatility patterns. Additional held-out constraint-combination and downstream forecasting experiments further demonstrate that UPLOTS generalizes beyond the original peak-pattern setting and improves data augmentation under scarce real-data regimes. Our code and baselines are available at anonymous github repo: https://anonymous.4open.science/r/UPLOTS-6C36.

URL PDF HTML ☆

赞 0 踩 0

2606.18587 2026-06-18 cs.CL cs.AI 新提交 75%

Dual Dimensionality for Local and Global Attention

局部与全局注意力的双重维度

Zhiyuan Wang, Xuan Luo, Sirui Zeng, Xifeng Yan

发表机构 * UC Santa Barbara（加州大学圣塔芭芭拉分校）

专题命中预训练：提出距离自适应表示优化Transformer注意力

AI总结提出距离自适应表示（DAR），对局部上下文保留全维度表示，对远距离token使用低维表示，在保持性能的同时减少KV缓存。

详情

AI中文摘要

解码器仅Transformer计算前面token的KV缓存上的注意力。键（和值）通常以相同的维度表示，无论其与预测目标的距离如何。然而，在自然语言中，下一个词受紧邻的前一个词影响最大。我们假设局部和远距离token对表示能力有不对称需求：局部token对预测即时输出更关键，因此需要更丰富的表示，而远距离token主要作为长期记忆，低维表示可能就足够了。我们将这一思想形式化为距离自适应表示（DAR），在受控设置中实现，该设置在局部上下文窗口内保留全维度表示，同时为超出该窗口的token分配降维表示（例如原始维度的1/4）。在多个预训练规模（70M到410M参数）以及1B规模模型上的持续监督微调中，该方法与全维度基线的性能紧密匹配。相比之下，在所有token位置上均匀降低维度会导致性能下降。这些结果挑战了键和值维度应在所有token位置上均匀的常见假设。我们的发现为设计注意力架构提供了新方向，该架构可自适应地跨序列分配表示能力，从而在推理期间进一步减少KV缓存。

英文摘要

Decoder-only Transformers compute attention over the KV cache of preceding tokens. Keys (and Values) are typically represented with the same dimensionality, regardless of its distance from the prediction target. In natural language, however, the next word is most strongly influenced by the immediately preceding tokens. We hypothesize that local and distant tokens impose asymmetric demands on representational capacity: local tokens are more critical for predicting immediate outputs and thus require richer representations, whereas distant tokens primarily serve as long-range memory, for which lower-dimensional representations may suffice. We formalize this idea as Distance-Adaptive Representation (DAR), implemented in a controlled setting that preserves full-dimensional representations within a local context window while assigning reduced-dimensional representations (e.g. 1/4 of the original dimensionality) to tokens beyond that window. Across multiple pretraining scales (70M to 410M parameters), as well as continued supervised fine-tuning on a 1B-scale model, this approach closely matches the performance of full-dimensional baselines. In contrast, uniformly reducing dimensionality across all token positions leads to worse performance. These results challenge the common assumption that key and value dimensionality should be uniform across token positions. Our findings suggest a new direction for designing attention architectures that adaptively allocate representational capacity across sequences, enabling further reductions in KV cache during inference.

URL PDF HTML ☆

赞 0 踩 0

2606.19170 2026-06-18 cs.CL 新提交 70%

Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition

Dango：一个严格仅L1的大型语言模型，用于研究第二语言习得

Shiho Matta, Yin Jou Huang, Fei Cheng, Takashi Kodama, Hirokazu Kiyomaru, Yugo Murawaki

发表机构 * Kyoto University（京都大学）； NII-LLMC（日本国立信息与通信技术研究所-语言模型中心）

专题命中预训练：模拟第二语言习得的LLM，涉及预训练

AI总结提出1.8B参数的Dango模型，通过过滤L2污染和微调L2学习课程，模拟人类L2产出模式，优于未过滤和多语言基线。

Comments 8 pages main text, 20 pages total including references and appendices

详情

AI中文摘要

我们介绍了Dango，一个1.8B参数的大型语言模型，旨在用于第二语言习得（SLA）中L1到L2（日语到英语）迁移的受控研究。虽然先前的研究已经探索了语言模型中的SLA，但它们主要依赖于较小的或非解码器模型，限制了它们生成开放式文本的能力，并降低了它们作为实用L2模拟器的适用性。我们发现了将模型扩展到该规模时的一个关键挑战：用于L1习得的“单语”预训练语料库中的L2污染。为了解决这个问题，我们提出了一种过滤方法，以减少对英语的过早暴露，同时保留现实的最小暴露。然后，我们在LLM生成的L2学习课程上对模型进行微调，以模拟L2习得过程。我们的评估证实，Dango发展了类似人类的L2产出模式，优于未过滤和标准的多语言基线。我们发布了模型、数据和代码，以促进可重复的计算SLA研究和面向学习者的应用。

英文摘要

We introduce Dango, a 1.8B-parameter large language model designed for controlled studies of L1-to-L2 (Japanese-to-English) transfer in second language acquisition (SLA). While previous studies have explored SLA in language models, they have predominantly relied on smaller or non-decoder models, limiting their ability to generate open-ended text and reducing their suitability as practical L2 simulators. We identify a key challenge when scaling models to this size: L2 contamination within the "monolingual" pretraining corpus used for L1 acquisition. To address this, we propose a filtering method to reduce premature exposure to English while preserving realistic, minimal exposure. We then fine-tune the model on LLM-generated L2-learning lessons to simulate the L2 acquisition process. Our evaluations confirm that Dango develops human-like L2 production patterns, outperforming both unfiltered and standard multilingual baselines. We release the model, data, and code to facilitate reproducible computational SLA research and learner-facing applications.

URL PDF HTML ☆

赞 0 踩 0

2606.18596 2026-06-18 cs.HC cs.AI 新提交 80%

Better Adherence, Richer Context: A Field Evaluation of LLM-Powered Conversational Voice Diaries for Sleep

更好的依从性，更丰富的上下文：基于LLM的对话式语音睡眠日记的现场评估

Amama Mahmood, Bokyung Kim, Honghao Zhao, Molly E. Atwood, Luis F. Buenaver, Michael T. Smith, Chien-Ming Huang

发表机构 * The Johns Hopkins University（约翰霍普金斯大学）； Department of Psychiatry and Behavioral Sciences, The Johns Hopkins University School of Medicine（精神病学与行为科学系，约翰霍普金斯大学医学院）

专题命中领域大模型：LLM驱动的对话式语音睡眠日记现场评估

AI总结通过现场实验评估基于LLM的对话式语音睡眠日记，发现相比文本日记，语音日记提高了依从性并收集了更详细的上下文信息，但结构化字段完整性较低。

详情

AI中文摘要

睡眠日记是行为睡眠医学和失眠认知行为疗法的核心，但每日完成难以维持，静态形式通常为解释夜间睡眠变化提供的上下文有限。我们设计了一个基于LLM的对话式语音日记，通过主动智能音箱提示、结构化对话输入和自适应后续对话，提供临床基础的早晚睡眠日记问题。我们在为期四周的受试者间现场研究中评估了该系统，涉及30名大学生，使用匹配的日记项目、报告窗口和提醒间隔，与基于文本的移动日记进行比较。与文本日记相比，对话式语音日记显示出更高的依从性，并引发了关于日常习惯、压力源、环境条件和其他睡眠相关因素的更详细上下文自我报告。参与者还描述语音日记更容易融入日常，尽管感知完成时间更长。然而，基于语音的对话输入导致某些结构化日记字段的完整性较低，揭示了表达丰富性与结构化精度之间的权衡。这些发现展示了使用基于LLM的对话式语音助手进行纵向健康自我报告的前景和挑战。

英文摘要

Sleep diaries are central to behavioral sleep medicine and cognitive behavioral therapy for insomnia, yet daily completion is difficult to sustain, and static forms often provide limited context for interpreting night-to-night sleep variation. We designed an LLM-powered conversational voice diary that delivers clinically grounded morning and evening sleep diary questions through proactive smart-speaker prompts, structured conversational intake, and adaptive follow-up dialogue. We evaluated the system in a four-week between-subjects field study with 30 university students, comparing it with a text-based mobile diary using matched diary items, reporting windows, and reminder intervals. Compared with the text-based diary, the conversational voice diary showed higher adherence and elicited more detailed contextual self-report about routines, stressors, environmental conditions, and other sleep-related factors. Participants also described the voice diary as easier to integrate into daily routines, despite longer perceived completion time. However, voice-based conversational intake produced lower completeness for some structured diary fields, revealing a trade-off between expressive richness and structured precision. These findings show both the promise and the challenge of using LLM-powered conversational voice assistants for longitudinal health self-report.

URL PDF HTML ☆

赞 0 踩 0

2606.18989 2026-06-18 cs.CL cs.AI 新提交 75%

G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

G-IdiomAlign：基于释义的跨语言习语对齐基准

Fengying Ye, Yanming Sun, Runzhe Zhan, Zheqi Zhang, Lidia S. Chao, Derek F. Wong

发表机构 * NLP 2 CT Lab, Department of Computer and Information Science, University of Macau（NLP 2 CT实验室，计算机与信息科学系，澳门大学）； Faculty of Arts and Humanities, University of Macau（人文学院，澳门大学）

专题命中领域大模型：构建跨语言习语对齐基准，评估LLM翻译能力。

AI总结提出G-IdiomAlign基准，通过维基词典释义锚定习语，构建高置信度对齐集，并设计多项选择等价测试和释义对比生成协议，揭示大语言模型在习语翻译中的字面翻译偏差。

Comments Accepted to ACL 2026

详情

AI中文摘要

习语由于其非组合性和弱表层形式基础，难以跨语言转换，使得字面映射不可靠。我们提出G-IdiomAlign，一个基于释义的基准，其中每个习语通过维基词典的英语释义进行锚定。我们进一步构建了一个高置信度的参考对齐集，用于可重复评估。G-IdiomAlign支持两种协议：（1）受控的多项选择习语等价测试，带有类型化干扰项用于错误归因；（2）释义对比生成，对比无释义和有释义输入，以隔离显式语义枢轴的影响。在不同的大语言模型中，字面翻译偏差是主要的失败模式，尤其是当目标语言是低资源语言时。在基于嵌入的语义代理下，释义一致地改善了释义对比生成，但性能仍然有限，表明在开放输出空间中存在显著提升空间。随后对Qwen3-8B的分析进一步表明，跨条件差异更多集中在注意力头而非层中，而有释义生成更好的情况与更强的释义锚定相关。

英文摘要

Idioms are difficult to transfer across languages due to their non-compositionality and weak surface-form grounding, making literal mappings unreliable. We present G-IdiomAlign, a gloss-pivoted benchmark where each idiom is anchored by an English gloss from Wiktionary. We further construct a high-confidence reference alignment set for reproducible evaluation. G-IdiomAlign supports two protocols: (1) a controlled Multiple-Choice Idiom Equivalence with typed distractors for error attribution; and (2) a Gloss-Contrastive Generation contrasting No-gloss and With-gloss inputs to isolate the effect of an explicit semantic pivot. Across diverse LLMs, a bias to literal translation is a dominant failure mode, especially when the target is a low-resource language. Glosses consistently improve Gloss-Contrastive Generation under an embedding-based semantic proxy, but performance remains modest, indicating substantial headroom in the open output space. Subsequent analysis on Qwen3-8B further suggests that cross-condition differences are concentrated more in attention heads than in layers, while better With-gloss generations coincide with stronger gloss anchoring.

URL PDF HTML ☆

赞 0 踩 0

2606.18986 2026-06-18 cs.CL cs.AI 新提交 75%

Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

超越分词：面向时间序列问答的直接时间步嵌入与对比对齐

Yafeng Wu, Huu Hiep Nguyen, Thin Nguyen, Hung Le

发表机构 * Deakin University（德肯大学）

专题命中领域大模型：提出时间序列问答框架，直接嵌入时间步避免分词瓶颈。

AI总结提出CADE框架，通过逐点线性编码器直接嵌入每个时间步，避免分词瓶颈，并利用单向监督对比损失对齐时间序列与文本锚点，在Time-MQA基准上提升六项TSQA任务性能。

详情

AI中文摘要

大型语言模型的最新进展催生了时间序列问答（TSQA），它将时间序列分析表述为自然语言问答。然而，直接将原始数值序列输入LLM会遇到分词瓶颈：字节对编码将连续值分割成不稳定的词元，其嵌入缺乏有意义的度量结构，导致幅度、尺度和趋势信息的丢失。先前的方法使用基于分块的编码器将序列分割成固定窗口，锁定单一粒度，这会破坏模式并隐藏确切的时间步，且通过一个在不同长度或采样率的数据集上很少迁移的独立模块实现。为了解决这一挑战，我们提出了CADE（对比对齐与直接嵌入），一个基于两个关键组件构建的TSQA新框架：直接时间步嵌入和语义对齐。该框架通过逐点线性编码器和MLP投影器将每个时间步直接映射到LLM嵌入空间，保留了精确的索引级访问，同时消除了分块和填充的需要。为了进一步弥合时间序列与语言表示之间的语义差距，我们引入了一种新颖的单向监督对比损失，将时间序列嵌入与冻结的类名文本锚点对齐。在公开的Time-MQA基准上的实验结果表明，我们的框架在六项TSQA任务上持续提升了性能，优于开源和专有的LLM基线。

英文摘要

Recent advances in large language models (LLMs) have given rise to time-series question answering (TSQA), which formulates time-series analysis as natural-language question answering. However, directly feeding raw numerical series into LLMs suffers from a tokenization bottleneck: Byte Pair Encoding fragments continuous values into unstable tokens whose embeddings lack meaningful metric structure, resulting in the loss of magnitude, scale, and trend information. Prior methods use patch-based encoders that split the series into fixed windows, locking in one granularity that breaks patterns and hides exact timesteps, through a separate module that rarely transfers across datasets with different lengths or sampling rates. To address this challenge, we propose CADE (Contrastive Alignment with Direct Embedding), a novel framework for TSQA built upon two key components: direct timestep embedding and semantic alignment. The proposed framework maps each timestep directly into the LLM embedding space through a point-wise linear encoder and MLP projector, preserving exact index-level access while eliminating the need for patching and padding. To further bridge the semantic gap between time-series and language representations, we introduce a novel one-directional supervised contrastive loss that aligns time-series embeddings with frozen class-name text anchors. Experimental results on the public Time-MQA benchmark demonstrate that our framework consistently improves performance across six TSQA tasks, outperforming both open-source and proprietary LLM baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.18803 2026-06-18 cs.AI cs.CY 新提交 75%

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

ProfiLLM: 面向工业网约车调度的效用对齐智能用户画像

Tengfei Lyu, Zirui Yuan, Xu Liu, Kai Wan, Zihao Lu, Li Ma, Hao Liu

发表机构 * Didichuxing Co. Ltd（滴滴出行科技有限公司）

专题命中领域大模型：LLM应用于工业调度，属于领域大模型

AI总结提出ProfiLLM，一种通过工具增强全局知识挖掘和效用对齐画像探索的智能LLM数据管道，解决工业网约车调度中大规模行为日志的用户画像问题，在滴滴生产系统中实现AUC提升6.14%、GMV提升4.35%。

详情

AI中文摘要

将大型语言模型（LLM）作为语义特征提取器引入工业网约车调度，处理平台规模的行为日志，是一个引人注目但尚未充分探索的数据系统问题。生产匹配管道仍然以结构化数值特征为主，但关键的行为信号（例如，驾驶员对某些区域的习惯性厌恶）本质上是上下文相关的，并且可以自然地表达为LLM生成的用户画像。然而，将这种画像扩展到实时的、毫秒级延迟的调度器面临三个相互交织的约束，这些约束很少被一起解决：在一个拥有数百万日订单量的平台上，日志超出任何LLM的上下文窗口数个数量级；大多数用户是长尾用户，交互太少无法进行单个用户画像；表面流畅的画像不一定能提高下游预测效用。我们提出了ProfiLLM，一个智能LLM数据管道，通过两个模块实现面向生产匹配系统的效用对齐用户画像。（1）工具增强全局知识挖掘：为LLM智能体配备27个分析工具，用于挖掘平台规模的数据，生成可复用的全局知识、自适应用户聚类规则和区域级供需先验。（2）效用对齐画像探索：为每个聚类生成多个候选画像，通过轻量级下游效用代理进行评估，迭代优化最佳候选，并为DPO微调构建偏好对。在滴滴生产调度器上部署后，ProfiLLM在结果预测中实现了高达+6.14%的相对AUC改进，在调度模拟中实现了高达+4.35%的GMV增长，并在14天在线A/B测试中持续改进，包括+0.47% GMV、+0.33%完成率和-0.82%接单前取消率。

英文摘要

Bringing Large Language Models (LLMs) into industrial ride-hailing dispatch as semantic feature extractors over platform-scale behavioral logs is a compelling but under-explored data systems problem. Production matching pipelines remain dominated by structured numerical features, yet decisive behavioral signals (e.g., a driver's habitual aversion to certain regions) are inherently contextual and naturally expressible as LLM-generated user profiles. However, scaling such profiling to a live, millisecond-latency dispatcher faces three intertwined constraints rarely addressed together: on a platform with millions of daily orders, logs exceed any LLM's context window by orders of magnitude; most users are long-tail, with too few interactions for per-user profiling; and surface-fluent profiles do not necessarily improve downstream prediction utility. We present ProfiLLM, an agentic LLM data pipeline that operationalizes utility-aligned user profiling for production matching systems through two modules. (1) Tool-Augmented Global Knowledge Mining equips an LLM agent with 27 analytical tools to mine platform-scale data, producing reusable global knowledge, adaptive user clustering rules, and region-level supply-demand priors. (2) Utility-Aligned Profile Exploration generates multiple candidate profiles per cluster, evaluates them via a lightweight downstream utility proxy, iteratively refines the best candidates and constructs preference pairs for DPO fine-tuning. Deployed on DiDi's production dispatcher, ProfiLLM achieves up to +6.14% relative AUC improvement in outcome prediction, up to +4.35% GMV gain in dispatching simulation, and consistent improvements in a 14-day online A/B test including +0.47% GMV, +0.33% Completion Rate, and -0.82% Cancel-Before-Accept rate.

URL PDF HTML ☆

赞 0 踩 0

2606.18597 2026-06-18 cs.CL 新提交 75%

Low-resource Language Discrimination Towards Chinese Dialects with Transfer learning and Data Augmentation

低资源中文方言辨识：基于迁移学习与数据增强

Fan Xu, Yangjie Dan, Keyu Yan, Yong Ma, Mingwen Wang

发表机构 * Jiangxi Normal University（江西师范大学）

专题命中领域大模型：迁移学习与数据增强用于中文方言辨识

AI总结针对中文方言标注资源稀缺的问题，提出结合迁移学习与数据增强的CDDTLDA框架，利用源域ASR模型和目标域数据增强及微调，通过自注意力机制捕获共性语义特征，显著超越现有方法。

Comments Published in ACM TALLIP

详情

AI中文摘要

中文方言辨识是一项具有挑战性的自然语言处理任务，由于标注资源稀缺。本文中，我们开发了一种新颖的中文方言辨识框架，结合迁移学习与数据增强（CDDTLDA），以克服资源短缺问题。具体来说，我们首先使用一个较大的中文方言语料库训练一个源端自动语音识别（ASR）模型。然后，我们采用一种简单但有效的数据增强方法（即速度、音高和噪声干扰）来增强目标端低资源中文方言，并基于之前的源端ASR模型微调另一个目标ASR模型。同时，通过使用自注意力机制，可以捕获源端和目标端ASR模型之间的潜在共性语义特征。最后，我们提取目标ASR模型中的隐藏语义表示来进行中文方言辨识。我们广泛的实验结果表明，我们的模型在两个基准中文方言语料库上显著优于最先进的方法。

英文摘要

Chinese dialects discrimination is a challenging natural language processing task due to scarce annotation resource. In this article, we develop a novel Chinese dialects discrimination framework with transfer learning and data augmentation (CDDTLDA) in order to overcome the shortage of resources. To be more specific, we first use a relatively larger Chinese dialects corpus to train a source-side automatic speech recognition (ASR) model. Then, we adopt a simple but effective data augmentation method (i.e., speed, pitch, and noise disturbance) to augment the target-side low-resource Chinese dialects, and fine-tune another target ASR model based on the previous source-side ASR model. Meanwhile, the potential common semantic features between source-side and target-side ASR models can be captured by using self-attention mechanism. Finally, we extract the hidden semantic representation in the target ASR model to conduct Chinese dialects discrimination. Our extensive experimental results demonstrate that our model significantly outperforms state-of-the-art methods on two benchmark Chinese dialects corpora.

URL PDF HTML ☆

赞 0 踩 0

2606.19167 2026-06-18 cs.SE 新提交 70%

Teaching Software Engineering with LLM and MCP Integration: From Classroom to Industry Practice

用LLM和MCP集成教学软件工程：从课堂到工业实践

Kehui Chen, Jacky Keung, Weining Li, Xiangbing Shao, Yishu Li, Xiaoxue Ma

专题命中领域大模型：使用LLM辅助软件工程教学，但非核心模型创新

AI总结本研究将LLM和MCP集成到软件工程协作教学模式中，通过嵌入驱动工具到教学、代码辅助和工程模拟，弥合传统教学与工业流程的差距，提升学生编程、问题解决和智能工具使用能力。

Comments Aceept by International Symposium on Educational Technology (ISET) 2026

详情

AI中文摘要

大型语言模型（LLM）和模型上下文协议（MCP）在工业软件工程中的快速集成，迫切要求更新软件工程教育以跟上新兴技术和不断变化的行业需求。本研究探讨了一种创新方法，将LLM和MCP集成到软件工程教育的协作教学模式中，旨在构建一个与实际工程实践紧密相连的实用学习框架。通过将LLM和MCP驱动的工具嵌入日常教学、代码辅助和工程模拟中，该模型有效弥合了传统教学与工业工作流程之间的差距。这种集成增强了学生的编程能力、实际问题解决能力以及使用智能工程工具的熟练度。此外，通过与行业实习的合作，学生可以在真实环境中应用这些技术，进一步加强学术准备与专业实践之间的联系。总体而言，本研究为人工智能时代软件工程教育的改革与创新提供了一条实用路径。

英文摘要

The rapid integration of Large Language Models (LLMs) and the Model Context Protocol (MCP) into industrial software engineering has created a pressing need to update software engineering education to align with emerging technologies and evolving industry demands. This study investigates an innovative approach that integrates LLMs and MCP into a collaborative teaching model for software engineering education, aiming to build a practical learning framework closely connected to real-world engineering practices. By embedding LLM and MCP driven tools into daily teaching, code assistance, and engineering simulations, the model effectively bridges the gap between traditional instruction and industrial workflows. This integration enhances students' programming competence, practical problem-solving abilities, and proficiency in using intelligent engineering tools. Furthermore, through partnerships with industry internships, students can apply these technologies in real-world settings, further strengthening the connection between academic preparation and professional practice. Overall, this research offers a practical pathway for reforming and innovating software engineering education in the era of artificial intelligence.

URL PDF HTML ☆

赞 0 踩 0

2606.18557 2026-06-18 cs.AI cs.LG cs.LO 新提交 80%

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

DeFAb：基础模型中可废止溯因的可验证基准

Patrick Cooper, Alvaro Velasquez

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）

专题命中其他LLM ：评估基础模型的可废止溯因推理

AI总结提出DeFAb基准，通过将知识库转换为可验证的溯因实例，评估基础模型在可废止推理中的创造力与理论推理能力，发现前沿模型准确率远低于符号求解器。

Comments 33 pages, 14 figures, 23 tables. Dataset: https://huggingface.co/datasets/PatrickAllenCooper/DeFAb ; code and evaluation harness: https://github.com/PatrickAllenCooper/blanc

详情

AI中文摘要

一个基于规则的逻辑求解器在不到50微秒内以100%的准确率解决了我们基准中的每个实例；而最佳前沿语言模型在渲染鲁棒评估下最高仅达65%，最差降至23.5%（四种表面渲染的最坏情况）。我们引入DeFAb（可废止溯因基准），这是一个数据集和生成流水线，将四十年的公共资助知识库转换为形式化可废止溯因实例：通过覆盖默认值同时保留无关期望来构建解释异常假设。由于每个假设必须通过多项式时间检查（有效推导、保守性和最小性），DeFAb将逻辑严谨性作为衡量创造性和理论推理的工具，评分的是理论修正的规范构建，而非流畅但破坏理论的散文。该流水线将分类层次结构（OpenCyc、YAGO、Wikidata）与行为属性图（ConceptNet、UMLS）配对，从18个来源生成372,648+个实例，涉及33.75M条实例化规则，分为三个级别，并具有多项式时间可验证的金标准。四个前沿模型未能可靠内化可废止推理：渲染鲁棒的Level 2准确率为7.8-23.5%；思维链方差（约36个百分点）超过任何模型间差距；匹配的污染控制隔离出+19.4个百分点的Level 3差距。我们进一步发布了DeFAb-Hard（235个实例的Level 3难度变体；最佳模型53.3% vs 符号100%）和CONJURE（一个内核验证的变革性创造力变体，包含560个Lean 4/Mathlib实例，其金答案证明内核先前未包含的定义，无需判断的验证器；试点发现零新概念）。同一验证器还可作为偏好优化（DPO、RLVR/GRPO）的精确奖励。基于MIT许可发布于此https URL。

英文摘要

A rule-based logic solver resolves every instance in our benchmark in under 50 microseconds with 100% accuracy; the best frontier language model reaches 65% at best and drops to 23.5% under rendering-robust evaluation (worst case over four surface renderings). We introduce DeFAb (Defeasible Abduction Benchmark), a dataset and generation pipeline that converts four decades of publicly funded knowledge bases into formally grounded instances for defeasible abduction: constructing hypotheses that explain anomalies by overriding defaults while preserving unrelated expectations. Because every hypothesis must pass polynomial-time checks for valid derivation, conservativity, and minimality, DeFAb makes logical rigor the instrument for measuring creativity and theoretical reasoning, scoring the disciplined construction of theory revisions rather than fluent but theory-destroying prose. The pipeline pairs taxonomic hierarchies (OpenCyc, YAGO, Wikidata) with behavioral property graphs (ConceptNet, UMLS) to produce 372,648+ instances across 33.75M materialized rules from 18 sources, in three levels with polynomial-time verifiable gold standards. Four frontier models do not reliably internalize defeasible reasoning: rendering-robust Level 2 accuracy is 7.8-23.5%; chain-of-thought variance (~36 pp) exceeds any inter-model gap; and a matched contamination control isolates a +19.4 pp Level 3 gap. We further release DeFAb-Hard (a 235-instance Level 3 difficulty variant; best model 53.3% vs 100% symbolic) and CONJURE (a kernel-verified transformative-creativity variant of 560 Lean 4/Mathlib instances whose gold answers are definitions the proof kernel did not previously contain, judge-free verifier; a pilot finds zero novel concepts). The same verifier doubles as an exact reward for preference optimization (DPO, RLVR/GRPO). Released under MIT at https://huggingface.co/datasets/PatrickAllenCooper/DeFAb.

URL PDF HTML ☆

赞 0 踩 0

2606.18383 2026-06-18 cs.LG cs.CL 新提交 80%

From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability

从稀疏特征到可信代理：认证基于SAE的可解释性

Dibyanayan Bandyopadhyay, Asif Ekbal

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology Patna（印度理工学院巴特那分校计算机科学与工程系）

专题命中其他LLM ：认证基于SAE的语言模型可解释性

AI总结提出一种后验泛化框架，通过稀疏代理（SAE重建）认证语言模型，推导期望风险上界，并在GPT-2 Small等模型上验证非平凡界，揭示深层更易认证且特征分解区分语义对齐与统计稀疏性。

详情

AI中文摘要

稀疏自编码器（SAE）越来越多地被用于从语言模型（LM）中提取可解释特征，但一个核心问题仍然存在：基于SAE的解释何时可以被视为底层冻结LM的忠实视图？我们通过一个后验泛化框架来研究这个问题，该框架通过稀疏代理来认证LM，稀疏代理是通过将原生隐藏激活替换为其预训练的SAE重建而获得的。我们的框架使用四个可测量量推导出基础模型期望风险的上界：代理风险、SAE重建差距、概念池不匹配和稀疏复杂度。我们将此证书解释为解释忠实性的操作标准。特别地，非平凡界表明提取的稀疏特征保留了有意义的预测信息，而小的重建和匹配误差表明代理在行为上接近原始模型。实验上，我们展示了在GPT-2 Small、Gemma-2B和Llama-3-8B上，该界在实际样本量下变得非平凡。对Llama-3-8B的详细逐层分析揭示了强烈的深度依赖性，较深层变得更容易认证，这与更强的局部保真度和更弱的下游误差放大相关。最后，通过特征洗牌消融，我们展示了分解区分了真正的语义对齐与单纯的统计稀疏性，为基于SAE的解释何时变得不太可靠提供了有用的诊断。

英文摘要

Sparse autoencoders (SAEs) are increasingly used to extract interpretable features from language models (LMs), yet a central question remains: when can an SAE-based explanation be treated as a faithful view of an underlying frozen LM We study this through a post-hoc generalization framework that certifies the LM via a sparse proxy, obtained by replacing a native hidden activation with its pretrained SAE reconstruction. Our framework derives an upper bound on the base model's expected risk using four measurable quantities: proxy risk, SAE reconstruction gap, concept-pool mismatch, and sparse complexity. We interpret this certificate as an operational criterion for explanatory faithfulness. In particular, a non-vacuous bound indicates that the extracted sparse features retain meaningful predictive information, while small reconstruction and mismatch errors indicate that the proxy remains behaviorally close to the original model. Empirically, we show that the bound becomes non-vacuous on GPT-2 Small, Gemma-2B, and Llama-3-8B at practical sample sizes. A detailed layerwise analysis of Llama-3-8B reveals a strong depth dependence, with later layers becoming much easier to certify, associated with both stronger local fidelity and weaker downstream error amplification. Finally, through feature-shuffling ablations, we show that the decomposition distinguishes genuine semantic alignment from mere statistical sparsity, providing a useful diagnostic for when SAE-based explanations become less reliable.

URL PDF HTML ☆

赞 0 踩 0

2606.18042 2026-06-18 cs.DC 新提交 80%

Latency Prediction for LLM Inference on NPU Systems

NPU系统上LLM推理的延迟预测

Juhyun Park, Seungwoo Jeong, Jingyu Lee, Kyungyong Lee

专题命中其他LLM ：预测LLM在NPU上的推理延迟

AI总结针对NPU上LLM推理延迟预测面临微架构不公开、编译器优化不可预测和分桶导致非线性延迟的挑战，提出LENS延迟估计器，通过每个桶两次端到端测量组合预测任意输入输出长度组合的延迟，平均预测误差2.15%。

Comments 12 pages, 9 figures

详情

AI中文摘要

部署大型语言模型（LLM）需要探索涵盖并行化策略、批处理技术和调度策略的庞大配置空间。在此空间上进行穷举测量是不切实际的，因此延迟预测对于系统优化至关重要。尽管NPU已成为专为LLM推理设计的加速器，但尚未建立针对它们的预测方法。具体来说，将先前的工作应用于NPU上的LLM推理延迟预测面临三个挑战：商用NPU的微架构不公开、不可预测的编译器优化以及由分桶引起的延迟非线性。我们提出了LENS，一种延迟估计器，它可以在没有微架构或编译器信息的情况下预测NPU推理延迟，并捕获由分桶引起的非线性延迟。LENS通过两次端到端（E2E）测量对每个桶进行剖析，并组合结果以预测任意输入-输出长度组合的延迟。我们在来自多个供应商的NPU、多个LLM以及多样化工作负载上验证了LENS，平均预测误差为2.15%。我们进一步将LENS与两个方法相关的基线进行比较，确认了其方法的有效性。

英文摘要

Deploying Large Language Models (LLMs) requires exploring a large configuration space spanning parallelization strategies, batching techniques, and scheduling policies. Exhaustive measurement across this space is impractical, making latency prediction essential for system optimization. While NPUs have emerged as accelerators designed for LLM inference, no prediction methodology has been established for them. Specifically, applying prior work to LLM inference latency prediction on NPUs faces three challenges: undisclosed microarchitecture of commercial NPUs, unpredictable compiler optimizations, and latency non-linearity induced by bucketing. We present LENS, a latency estimator that predicts NPU inference latency without information on the microarchitecture or compiler, and captures the non-linear latency induced by bucketing. LENS profiles each bucket with two end-to-end (E2E) measurements and composes the results to predict latency for arbitrary input-output length combinations. We validate LENS across NPUs from multiple vendors, several LLMs, and diverse workloads, achieving a mean prediction error of 2.15\%. We further compare LENS against two methodologically related baselines, confirming the validity of its approach.

URL PDF HTML ☆

赞 0 踩 0

2606.12629 2026-06-18 cs.LG cs.AI 新提交 80%

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

Bag of Dims：通过维度级符号模式实现无需训练的机制可解释性

Varun Reddy Nalagatla

发表机构 * Amazon Web Services（亚马逊云服务）

专题命中其他LLM ：无需训练的Transformer机制可解释性方法

AI总结本文提出Bag of Dims框架，证明Transformer隐藏状态的标准基即可作为无需训练的特征基，通过维度符号模式编码语义，并在三个模型上验证了其有效性。

Comments 22 pages, 5 figures, 27 tables

详情

AI中文摘要

我们表明，Transformer隐藏状态的标准基已经提供了一个无需训练、架构通用的特征基。单个维度通过其符号编码语义内容，通过其幅度编码置信度，充当独立的二进制寄存器。我们通过四个渐进实验在三个模型家族（Qwen 3.5-4B、Gemma 3-4B、Mistral 7B）上验证了这种Bag of Dims框架。仅符号模式就携带预测性内容：将所有幅度替换为1，通过LM头实现72-93%的top-5下一个token准确率，而无需任何解码器的纯汉明评分达到80-90%的top-4096准确率。这些符号模式组织成语义特征：使用单token类型缓存（每个词汇token一次前向传播，无上下文），我们通过每维度符号一致性（平均AUC 0.80）从50个锚点发现了175个类别，无需任何训练。一个训练过的探针仅增加+0.018 AUC并收敛到轴对齐的权重，证实了可忽略的跨维度结构。这种结构扩展到注意力：所有175个类别在K和V投影中仍然可发现。在写入端，静态FFN权重检查将20%的特征与单个写入神经元联系起来（一致性>0.70；随机对照：0%），通过多数投票，top-200神经元联盟在99.9%的原型上实现>0.70的一致性。完全无监督的发现（随机种子，无标签）在所有三个模型上扩展到1500个特征，产量100%，稀疏度99%，成对互信息为0.0014比特，证实了低维度间耦合。这些结果确立了标准基已经足以在整个Transformer计算路径中进行特征读取，无需训练、无需优化，且每个词汇token仅需一次前向传播，无需GPU天数。

英文摘要

We show the standard basis of transformer hidden states already provides a training-free, architecture-general feature basis. Individual dimensions encode semantic content via their signs (+/-1) and confidence via their magnitudes, acting as independent binary registers; a feature is a subset of dimensions with a consistent sign pattern, read by counting sign agreements with no learned rotation. We validate this Bag of Dims framework across seven models spanning language (Qwen 3.5-4B, Gemma 3-4B, Mistral 7B, Qwen3-32B), vision (DINOv2, ViT-Base), and audio (AST). Signs alone carry predictive content: unit-magnitude sign patterns preserve 60-93% top-5 next-token accuracy through the LM head, and decoder-free Hamming scoring reaches 80-90% top-4096. From a single-token cache (one forward pass per token, no context, no labels), we detect 175 categories at AUC 0.97-0.99 by sign agreement; a trained probe adds only +0.018 AUC and converges to axis-aligned weights. These features are causally operative: they survive the K/V attention projections, trace to the FFN neuron coalitions that write them (random-weight controls never reproduce this), and flipping a feature's signs during the live forward pass suppresses its concept across four language models, magnitude-matched and concept-specific. Dimensions stay independent throughout (pairwise mutual information below 0.006 bits). The structure is not specific to language: the same per-dimension signs appear in self-supervised vision (DINOv2, 9/12 ImageNet superclasses), supervised vision (ViT-Base, 11/12), and audio (AST, 50/50 ESC-50 categories), so it reflects transformer training in general, not the language-modeling objective. The standard basis already suffices for feature reading at one forward pass, no optimization, no GPU-days. The open problem shifts from finding the right rotation to cataloging what each dimension encodes.

URL PDF HTML ☆

赞 0 踩 0

2606.08532 2026-06-18 cs.AI 新提交 80%

DN-Hypo-Pipeline: An AI-Driven Workflow for Hypothesis Generation via Large Language Models and Scientific Explanations

DN-Hypo-Pipeline：一种基于大语言模型和科学解释的AI驱动假设生成工作流

Lei Lin, Ronghao Wang, Chunbao Zhou, Jue Wang, Yangang Wang

发表机构 * Computer Network Information Center, Chinese Academy of Sciences, China（中国科学院计算机网络信息中心）

专题命中其他LLM ：LLM驱动的假设生成工作流

AI总结提出DN-Hypo-Pipeline，利用大语言模型和科学解释作为先验知识，从现有文献中推导新假设，在数据科学建模中通过统计推断和专家评估证明优于直接生成方法，并验证了生成假设对应的算法性能。

详情

AI中文摘要

科学假设是研究的第一步并经过实验验证，但它也反映了对科学现象的深刻理解和推理。我们引入了DN-Hypo-Pipeline，一种基于大语言模型的AI驱动工作流，旨在通过利用科学解释作为先验知识来支持结构化科学思维和假设生成。该流水线帮助研究人员从现有文献中推导出新假设。给定研究论文的解释项（即结论），它识别潜在的定律、理论和原理，并为观察到的现象重构一个新的、尚未验证的解释。我们在数据科学建模领域使用三篇高被引论文评估了DN-Hypo-Pipeline。由LLM作为评判者和人类专家评估支持的统计推断表明，我们的流水线比直接生成方法更有效。此外，我们通过开发相应新颖算法验证了得分最高的两个生成假设，这些算法优于原始论文中提出的基线模型。除了在数据科学中的应用，DN-Hypo-Pipeline还提供了一个理论框架，不仅包含了理论指导的数据科学建模方法，还揭示了建模过程更基础的结构。此外，这种方法本质上是理论指导建模的推广，具有扩展到其他领域和更广泛科学学科的潜力。

英文摘要

A scientific hypothesis is the first step in research and undergoes experimental validation, yet it also reflects a deep understanding of and reasoning about scientific phenomena. We introduce DN-Hypo-Pipeline, an AI-powered workflow based on large language models, designed to support structured scientific thinking and hypothesis generation by leveraging scientific explanations as prior knowledge. This pipeline assists researchers in deriving novel hypotheses from existing literature. Given the explanandum (i.e., the conclusion) of a research paper, it identifies underlying laws, theories, and principles, and reconstructs a new, yet-to-be-verified explanation for the observed phenomenon. We evaluated DN-Hypo-Pipeline in the field of data science modeling using three highly cited papers. Statistical inference, supported by both LLM-as-judge assessment and human expert evaluation, demonstrates that our pipeline is more effective than direct generation methods. Additionally, we validated the two highest-scoring generated hypotheses by developing corresponding novel algorithms, which outperformed the baseline models presented in the original papers. Beyond application in data science, DN-Hypo-Pipeline provides a theoretical framework that not only encompasses theory-guided data science modeling methods but also reveals a more fundamental structure of the modeling process. Moreover, this approach is essentially a generalization of theory-guided modeling, offering potential for extension to other domains and across a broader range of scientific disciplines.

URL PDF HTML ☆

赞 0 踩 0

2606.18829 2026-06-18 cs.LG cs.CL 新提交 75%

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

GateMem：多主体共享内存代理中的内存治理基准

Zhe Ren, Yibo Yang, Yimeng Chen, Zijun Zhao, Benshuo Fu, Zhihao Shu, Bingjie Zhang, Yangyang Xu, Dandan Guo, Shuicheng Yan

发表机构 * School of Artificial Intelligence, Jilin University（吉林大学人工智能学院）； Shanghai Jiao Tong University（上海交通大学）； King Abdullah University of Science and Technology (KAUST)（卡尔斯鲁厄大学）； Tsinghua University（清华大学）； National University of Singapore（新加坡国立大学）

专题命中其他LLM ：评估多主体共享内存代理的记忆治理，涉及LLM代理

AI总结提出GateMem基准，评估多主体共享内存代理在效用、访问控制和遗忘三方面的治理能力，发现现有方法无法同时满足三者。

Comments 24 pages, 8 figures. Code and dataset are available at https://github.com/rzhub/GateMem and https://huggingface.co/datasets/Ray368/GateMem

详情

AI中文摘要

LLM代理的内存基准主要假设单用户设置，而医院、工作场所、校园和家庭中的共享助手研究不足。在这些部署中，多个主体写入公共内存池并根据不同角色、范围和关系进行查询，因此内存质量需要治理和召回。我们引入GateMem，一个多主体共享内存代理的基准。GateMem联合评估合法长期请求的效用（含状态更新）、跨上下文授权边界的访问控制，以及显式删除请求后的主动遗忘。它涵盖医疗、办公、教育和家庭领域，包含长形式多方情节、增量内存注入、隐藏检查点、结构化评判和泄漏目标注释。在多种基线和骨干模型上，没有方法能同时实现强效用、鲁棒访问控制和可靠遗忘。长上下文提示通常以高令牌成本获得最佳治理分数，而基于检索和外部内存的方法降低成本但仍泄漏未授权或已删除信息。这些结果表明，当前内存代理远未达到可靠的共享机构部署水平。

英文摘要

Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory pool and query it under different roles, scopes, and relationships, so memory quality requires governance as well as recall. We introduce GateMem, a benchmark for multi-principal shared-memory agents. GateMem jointly evaluates utility for legitimate long-horizon requests with state updates, access control across contextual authorization boundaries, and agent-facing active forgetting after explicit deletion requests. It spans medical, office, education, and household domains, with long-form multi-party episodes, incremental memory injection, hidden checkpoints, structured judging, and leak-target annotations. Across diverse baselines and backbone models, no method simultaneously achieves strong utility, robust access control, and reliable forgetting. Long-context prompting often yields the best governance score at high token cost, while retrieval-based and external-memory methods reduce cost yet still leak unauthorized or deleted information. These results show current memory agents remain far from reliable shared institutional deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.18389 2026-06-18 cs.CL 新提交 75%

Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

想要更好的合成数据？引导它：面向低资源语言生成的激活引导

Jan Cegin, Daniil Gurgurov, Yusser Al Ghussin, Simon Ostermann

发表机构 * Kempelen Institute of Intelligent Technologies（肯佩伦智能技术研究所）； German Research Institute for Artificial Intelligence (DFKI)（德国人工智能研究中心（DFKI））

专题命中其他LLM ：激活引导用于低资源语言合成数据生成

AI总结提出激活引导作为低资源语言合成数据生成的替代方法，包括语言引导和质量引导，实验表明早期层引导能提升数据多样性和下游模型性能。

Comments 25 pages

详情

AI中文摘要

大型语言模型（LLMs）已成为合成数据生成的有效工具，包括低资源语言，生成的数据可以提升下游任务性能。当前最佳方法通常依赖于目标语言示例的少样本提示，这增加了推理成本，并可能通过词汇锚定降低多样性。在这项工作中，我们研究激活引导作为低资源合成数据生成的替代方案。我们研究了两种引导策略：语言引导，针对语言的 linguistic identity；以及质量引导，通过对比人类撰写和反向翻译的文本表示来捕捉良好形式性。我们在四个开源LLM、多个层和11种类型多样的语言上评估这些方法，通过生成情感和主题分类数据并微调较小的分类器。引导在零样本和少样本提示设置中应用，并与非引导对应方法进行比较。我们的结果表明，早期层的引导一致地提高了生成数据的多样性，同时通常产生更强的下游模型性能，特别是对于低资源语言。

英文摘要

Large language models (LLMs) have become an effective tool for synthetic data generation, including for low-resource languages, where generated data can improve downstream task performance. Current best-performing approaches typically rely on few-shot prompting with target-language examples, which increases inference costs and may reduce diversity through lexical anchoring. In this work, we investigate activation steering as an alternative for low-resource synthetic data generation. We study two steering strategies: Language Steering, which targets the linguistic identity of a language, and Quality Steering, which captures well-formedness by contrasting human-written and backtranslated text representations. We evaluate these methods across four open-source LLMs, multiple layers, and 11 typologically diverse languages by generating sentiment and topic classification data and finetuning smaller classifiers. Steering is applied in both zero-shot and few-shot prompting settings and compared against non-steered counterparts. Our results show that steering on early layers consistently improves the diversity of generated data while often yielding stronger downstream model performance, particularly for low-resource languages.

URL PDF HTML ☆

赞 0 踩 0

2606.18304 2026-06-18 cs.LG cs.AI 新提交 75%

Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression

基于归因引导和覆盖最大化的结构MoE剪枝

Yifu Ding, Jiacheng Wang, Ge Yang, Yongcheng Jing, Jinyang Guo, Xianglong Liu, Dacheng Tao

发表机构 * School of Computer Science and Engineering, Beihang University（北京航空航天大学计算机科学与工程学院）； School of Artificial Intelligence, Beihang University（北京航空航天大学人工智能学院）； Nanyang Technological University（南洋理工大学）

专题命中其他LLM ：针对MoE模型的结构剪枝，属于LLM压缩与部署。

AI总结针对MoE模型专家级剪枝粒度粗、冗余识别不足的问题，提出基于归因引导和覆盖最大化的结构剪枝框架，将剪枝分配转化为通道分数覆盖优化问题，在50%剪枝率下结合4位量化保持精度，内存减少5.27倍。

Comments 9 pages, 5 figures. Submitted to ICML 2026

详情

AI中文摘要

混合专家（MoE）模型在计算上高效扩展，但由于其巨大的内存占用和推理开销，部署成本仍然很高。先前的压缩方法主要在专家级别操作，要么移除整个专家，要么通过粗粒度的重要性分数对专家进行排序。然而，这种专家级别的决策通常过于粗糙，无法捕捉细粒度的冗余，导致剪枝预算分配不当和压缩效果有限。为了解决这个问题，我们观察到MoE专家内的信息高度集中在一小部分通道中，即使在被认为重要的专家中也存在大量冗余。基于这一观察，我们提出了一种针对MoE模型量身定制的结构剪枝框架。我们的方法将剪枝比例分配重新表述为通道分数覆盖最大化问题，并使用基于归因的近似方法高效求解。在DeepSeek和Qwen MoE模型上的实验表明，我们的方法在结合4位量化时，在50%或25%的结构化剪枝下仍能保持模型精度。在Qwen3-30B-A3B上，我们的方法将内存占用减少了5.27倍，并在各种基准测试中持续优于最先进的基线方法。

英文摘要

Mixture-of-Experts (MoE) models scale compute efficiently, yet remain expensive to deploy due to their substantial memory footprint and inference overhead. Prior compression methods mainly operate at the expert level, either removing entire experts or ranking experts by coarse-grained importance scores. However, such expert-wise decisions are often too coarse to capture fine-grained redundancy, leading to misallocated pruning budgets and limited compression. To address this problem, we observe that information within MoE experts is highly concentrated in a small subset of channels, leaving substantial redundancy even in experts deemed important. Based on this observation, we propose a structural pruning framework tailored for MoE models. Our method reformulates prune-ratio allocation as a channel-score coverage maximization problem and solves it efficiently using an attribution-based approximation. Experiments on DeepSeek and Qwen MoE models show that our method preserves model accuracy under 50% or 25% structured pruning when combined with 4-bit quantization. On Qwen3-30B-A3B, our approach reduces memory footprint by 5.27$\times$ and consistently outperforms state-of-the-art baselines across diverse benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.18105 2026-06-18 cs.NI cs.LG 新提交 75%

OmniPlan: An Adaptive Framework for Timely and Near-Optimal Network Planning Optimization

OmniPlan：一种用于及时且近乎最优的网络规划优化的自适应框架

Longlong Zhu, Jiashuo Yu, Zedi Chen, Yuhan Wu, Zhifan Jiang, Yuchen Xian, Yimeng Liu, Jiajie Su, Shaopeng Zhou, Xingyuan Li, Hongyan Liu, Xuan Liu, Dong Zhang, Chunming Wu, Xiang Chen

发表机构 * Zhejiang University（浙江大学）； Fuzhou University（福州市大学）； Yangzhou University（扬州大学）； The State Key Laboratory of Blockchain and Data Security（区块链与数据安全国家重点实验室）； College of Computer Science and Technology（计算机科学与技术学院）

专题命中其他LLM ：LLM用于解析用户意图进行网络规划

AI总结提出OmniPlan自适应框架，利用大语言模型解析用户意图，通过混合专家架构动态选择MIP求解器、启发式算法或深度强化学习模型，实现网络规划优化的及时性与近乎最优性，在分布式机器学习推理卸载任务中延迟降低97.8%，资源消耗降低11.5%。

Comments Accepted by ACM KDD 2026

详情

AI中文摘要

网络规划优化是跨多个领域（包括交通系统、通信网络和电网）的基本问题。它需要在复杂约束下同时优化多个相互竞争的目标。现有的网络规划优化框架依赖混合整数规划（MIP）求解器、启发式算法和深度强化学习（DRL）模型来计算规划决策。然而，它们缺乏对多样化和动态用户意图的有效适应性，从而导致执行时间与最优性之间的权衡。在本文中，我们提出OmniPlan，一种自适应框架，在网络规划优化中同时实现及时性和近乎最优性。为了实现现有解决方案所缺乏的适应性，OmniPlan采用基于大语言模型（LLM）的解释器，将异构的自然语言意图转换为统一且可量化的用户偏好向量。然后，它采用混合专家架构，集成MIP求解器、启发式算法和DRL模型作为专门专家，OmniPlan通过动态选择及时且近乎最优的专家来适应多样化的意图。最后，它包含一个基于DRL的专家配置模块，该模块微调优化目标权重，使规划决策与用户特定偏好对齐。我们使用代表性的真实工作负载（即分布式机器学习（ML））评估OmniPlan，其中我们利用OmniPlan将广泛的ML推理任务（例如决策树、SVM、朴素贝叶斯、XGBoost和随机森林）卸载到硬件设备网络。我们在真实测试平台上的实验表明，OmniPlan为真实ML推理任务实现了近乎最优且低执行时间的卸载，延迟降低高达97.8%，网络设备资源消耗降低高达11.5%。

英文摘要

Network planning optimization is a fundamental problem across diverse domains, including transportation systems, communication networks, and power grids. It requires simultaneous optimization of multiple competing objectives under complex constraints. Existing network planning optimization frameworks rely on mixed integer programming (MIP) solvers, heuristics, and deep reinforcement learning (DRL) models to compute planning decisions. However, they lack effective adaptability to diverse and dynamic user intents, thus leading to the trade-off between execution time and optimality. In this paper, we propose OmniPlan, an adaptive framework that achieves both timeliness and near-optimality in network planning optimization. To achieve the adaptability lacking in existing solutions, OmniPlan employs a large language model (LLM)-based interpreter to convert heterogeneous natural-language intents into a unified and quantifiable user-preference vector. Then it employs a mixture-of-experts architecture that integrates MIP solvers, heuristics, and DRL models as specialized experts, where OmniPlan adapts to diverse intents by dynamically selecting timely and near-optimal experts. Finally, it incorporates a DRL-based expert configuration module that fine-tunes optimization objective weights to align planning decisions with user-specific preferences. We evaluate OmniPlan with a representative real-world workload, i.e., distributed machine learning (ML), where we leverage OmniPlan to offload a wide spectrum of ML inference tasks, e.g., decision trees, SVM, naive Bayes, XGBoost, and random forests, onto a network of hardware devices. Our experiments on a real-world testbed indicate that OmniPlan achieves near-optimal and low-execution-time offloading for real-world ML inference tasks, reducing latency by up to 97.8\% and network device resource consumption by up to 11.5\%.

URL PDF HTML ☆

赞 0 踩 0

2606.17276 2026-06-18 cs.IR cs.LG 新提交 75%

On the Memorization Behavior of LLMs in Generative Recommendation: Observations, Implications, and Training Strategies

LLM在生成式推荐中的记忆行为：观察、启示与训练策略

Sunwoo Kim, Sunkyung Lee, Clark Mingxuan Ju, Donald Loveland, Bhuvesh Kumar, Kijung Shin, Neil Shah, Liam Collins

发表机构 * KAIST（韩国科学技术院）； Sungkyunkwan University（成均馆大学）； Snap Inc.（Snap公司）

专题命中其他LLM ：研究LLM在生成式推荐中的记忆行为

AI总结研究LLM在生成式推荐中的记忆倾向，发现其过度依赖一跳记忆，提出IIRG训练策略以学习多跳协同与语义关系，显著提升对非一跳记忆用户的推荐效果。

详情

AI中文摘要

生成式推荐（GR）已成为推荐系统的一个有前景的方向。最近，大型语言模型（LLM）越来越多地被用于GR，因为其丰富的预训练知识有望帮助它们泛化到传统以记忆为导向的基线所能捕捉的常见用户行为模式之外。然而，现有的基于LLM的GR工作很大程度上忽略了LLM众所周知的记忆倾向，如果这种倾向存在于为GR微调的LLM中，将限制它们对预训练知识的利用。在这项工作中，我们通过检查一跳记忆（即模型推荐训练数据中项目的直接后继项目）来研究这一担忧。我们表明，LLM比非LLM的GR模型更频繁地这样做——事实上，它们相对于GR基线的大部分增益实际上来自那些目标项目可以通过一跳记忆预测的用户。我们直觉认为，提高剩余用户的性能需要LLM学习更丰富的项目-项目关系，超越一跳转换。为此，我们提出了IIRG，一种新颖的训练策略，教导LLM捕获：（1）从用户序列中跨多跳的项目共现导出的协同关系，以及（2）具有相似主题的项目之间的语义关系，这两者都可以作为有用的推荐信号。我们表明，IIRG显著优于仅使用标准下一项目预测训练的LLM，尤其是对于那些测试项目在训练时的一跳转换中未覆盖的用户，增益尤为显著。

英文摘要

Generative recommendation (GR) has emerged as a promising direction for recommender systems. Recently, large language models (LLMs) have been increasingly adopted for GR, as their rich pretrained knowledge is expected to help them generalize beyond common user behavior patterns that traditional memorization-oriented baselines can capture. However, existing LLM-based GR works largely ignore LLMs' well-known tendency to memorize, which, if present in LLMs fine-tuned for GR, would restrict their utilization of pretrained knowledge. In this work, we investigate this concern by examining one-hop memorization, where a model recommends items that are direct successors of items in the training data. We show that LLMs do this more than non-LLM-based GR models-in fact, the vast majority of their gains over GR baselines are actually on users whose target items can be predicted through one-hop memorization. We intuit that improving performance on the remaining users requires LLMs to learn richer item-item relations beyond one-hop transitions. To achieve this, we propose IIRG, a novel training strategy that teaches LLMs to capture: (1) collaborative relations derived from item co-occurrences across multiple hops in user sequences, and (2) semantic relations among items with similar themes, both of which can serve as useful recommendation signals. We show that IIRG significantly improves over LLMs trained solely with standard next-item prediction, with especially large gains for users whose test items are not covered by train-time one-hop transitions.

URL PDF HTML ☆

赞 0 踩 0

2606.19317 2026-06-18 cs.LG cs.AI 新提交 70%

Explaining Attention with Program Synthesis

用程序合成解释注意力机制

Amiri Hayes, Belinda Li, Jacob Andreas

发表机构 * NJIT（新泽西理工学院）； MIT（麻省理工学院）； MIT CSAIL（麻省理工学院计算机科学与人工智能实验室）

专题命中其他LLM ：用程序合成解释注意力头

AI总结提出用可执行程序近似深度网络组件行为的方法，针对Transformer注意力头，通过生成Python程序再现注意力模式，实现可解释性。

详情

AI中文摘要

可解释深度学习研究的一个长期目标是，用人类可理解的符号描述取代不透明的神经计算。本文提出了一种用可执行程序近似深度网络组件行为的方法。我们专注于Transformer语言模型中的注意力头。对于给定的注意力头，我们首先在一组随机选择的训练样本上计算其关联的注意力矩阵。接着，我们向预训练语言模型提供这些矩阵的摘要，并指示它生成一组Python程序，这些程序仅根据输入句子中的文本即可再现相关的注意力模式。最后，我们根据最终程序集在保留输入上预测行为的效果对程序进行重新排序。我们证明，少于1000个这样的生成程序即可再现GPT-2、TinyLlama-1.1B和Llama-3B中注意力头的注意力模式，在TinyStories上平均交并比相似度超过75%。此外，最佳匹配程序可以替代神经注意力头而不会显著影响模型行为：在三个模型中用程序替代25%的注意力头仅导致平均困惑度增加16%，同时在各种下游问答基准上保持性能。这项工作为使用人类可读、可执行的代码逆向工程Transformer模型中的注意力头提供了一个可扩展的流程，推动了神经模型向符号透明性的发展。

英文摘要

A longstanding goal of research on interpretable deep learning is to replace opaque neural computations with human-meaningful symbolic descriptions. In this paper, we propose an approach for approximating the behavior of components of deep networks with executable programs. We focus on attention heads in transformer language models. For a given head, we first compute its associated attention matrices on a collection of randomly selected training examples. Next, we prompt a pre-trained language model with a summary of these matrices, and instruct it to generate a set of Python programs that can reproduce the associated attention patterns given only text from the input sentence. Finally, we re-rank programs according to how well our final set of programs predict behavior on held-out inputs. We demonstrate that a set of fewer than 1,000 such generated programs can reproduce the attention patterns of heads in GPT-2, TinyLlama-1.1B, and Llama-3B, achieving an average Intersection-over-Union similarity above 75% on TinyStories. Moreover, the best-fit programs can replace neural attention heads without substantially affecting model behavior: replacing 25% of attention heads with programmatic surrogates across the three models incurs only a 16% average perplexity increase, while maintaining performance on a variety of downstream question answering benchmarks. This work contributes a scalable pipeline for reverse-engineering attention heads in transformer models using human-readable, executable code, advancing a path toward symbolic transparency in neural models.

URL PDF HTML ☆

赞 0 踩 0

2606.19264 2026-06-18 cs.LG cs.CL 新提交 70%

Structured Inference with Large Language Gibbs

大语言吉布斯结构化推理

Sanghyeok Choi, Henry Gouk, Esmeralda S. Whitammer

发表机构 * University of Edinburgh, School of Informatics（爱丁堡大学信息学院）

专题命中其他LLM ：利用LLM条件分布进行结构化概率推理

AI总结提出大语言吉布斯方法，利用大语言模型的条件分布作为转移算子进行结构化概率推理，通过迭代重采样变量避免顺序偏差，在合成分布、一致性推理和贝叶斯结构学习中验证有效性。

Comments Code: https://github.com/hyeok9855/large-language-gibbs

详情

AI中文摘要

大型语言模型（LLMs）中编码的知识可以作为描述复杂世界变量的结构化推理的基础，但以概率一致的方式访问这些知识构成了一个困难的推理问题。我们提出了大语言吉布斯，一种结构化概率推理方案，它使用LLM的条件分布作为转移算子。不是通过单次自回归生成来采样结构化对象，而是利用LLM的下一个标记条件分布，在给定其他变量的条件下迭代地重采样单个变量。这种方法避免了顺序依赖偏差，并产生一个反映所有局部条件分布之间折衷的平稳分布。我们将这种方法应用于从合成分布中采样、一致性推理任务和贝叶斯结构学习。结果表明，在通过噪声LLM条件分布可访问的世界先验下，MCMC中使用LLM条件分布是用于结构化概率推理的一次性生成的实际替代方案。

英文摘要

The knowledge encoded in large language models (LLMs) can serve as a substrate for structured reasoning over variables describing a complex world, but accessing this knowledge in a probabilistically coherent manner poses a difficult inference problem. We propose Large Language Gibbs, a scheme for structured probabilistic inference that uses conditional distributions of an LLM as transition operators. Rather than sampling structured objects through single-pass autoregressive generation, we iteratively resample individual variables conditioned on others using an LLM's next-token conditionals. This approach avoids order-dependent biases and produces a stationary distribution that reflects a compromise between all local conditionals. We apply this approach to sampling from synthetic distributions, consistent reasoning tasks, and Bayesian structure learning. The results suggest that the use of LLM conditionals in MCMC is a practical alternative to one-pass generation for structured probabilistic inference under a world prior accessible through noisy LLM conditionals.

URL PDF HTML ☆

赞 0 踩 0

2606.19218 2026-06-18 cs.CL 新提交 70%

RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

RECOM：开放式 Reddit 问答中自动评估指标的有效性与区分性权衡

Pushwitha Krishnappa, Amit Das, Vinija Jain, Aman Chadha, Tathagata Mukherjee

发表机构 * University of Alabama Huntsville（阿拉巴马大学亨茨维尔分校）； University of North Alabama（北阿拉巴马大学）； Stanford University（斯坦福大学）； Meta AI ； Amazon GenAI（亚马逊生成人工智能）

专题命中其他LLM ：评估LLM生成文本的自动指标，属于LLM应用

AI总结提出 RECOM 数据集，发现自动评估指标在开放式问答中无法同时兼顾有效性和区分性，余弦相似度有效性高但区分性差，BERTScore 区分性受长度影响且有效性弱。

详情

AI中文摘要

自动评估指标是评估 LLM 生成文本的默认方法，但一个指标被默默要求完成两项任务：区分真实内容对齐与表面巧合（有效性），以及区分更好的系统与更差的系统（区分性）。在开放式、观点驱动的问答中，这两者存在矛盾。我们引入了 RECOM（Reddit Evaluation for Correspondence of Models），一个无污染评估数据集，包含 15,000 个 r/AskReddit 问题（2025 年 9 月），每个问题都配有真实的社区回复，这些回复的发布时间晚于所有被评估模型的训练截止日期。通过将五个开源 LLM（7-10B）的每个回复与每个指标配对，并加入随机乱序噪声基线，我们发现没有指标能同时做好这两项工作。余弦相似度能很好地区分真实回答与随机回答（Cohen's $d \approx 2$），但无法对五个模型进行排序（$|d| < 0.1$）；BERTScore 精确度看似能对模型排序（原始 $|d|$ 高达 0.63），但一旦控制回复长度，这一数值骤降至 $|d| = 0.09$，且其有效性较弱（$d \approx 0.8$，而余弦相似度约为 2）。由于每个指标对相同的输出进行评分，这种有效性与区分性的权衡是指标的属性，而非模型的属性，我们认为这源于表示设计。三个独立的 LLM 评判员再现了有效性差距，同样只能微弱地区分五个模型。我们建议在两个轴上报告指标，并明确给出随机基线。RECOM 在此 https URL 公开提供。

英文摘要

Automatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity), and tell a better system from a worse one (discriminative power). On open-ended, opinion-driven question answering, the two are in tension. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a contamination-free evaluation dataset of 15,000 r/AskReddit questions (September 2025), each paired with its authentic community replies, which postdate every evaluated model's training cutoff. Scoring five open-source LLMs (7--10B) against every reply each metric paired with a random-derangement noise floor we find that no metric does both jobs well. Cosine similarity separates real from random answers (Cohen's $d \approx 2$) but cannot rank the five models ($|d| < 0.1$); BERTScore precision appears to rank the models (raw $|d|$ up to 0.63), but once response length is controlled this collapses to $|d| = 0.09$ and its validity is weak ($d \approx 0.8$, versus cosine's $\approx 2$). Because every metric scores the same outputs, this validity--discrimination tradeoff is a property of the metrics, not the models, and we argue it stems from representation design. Three independent LLM judges reproduce the validity gap and likewise separate the five models only weakly. We recommend reporting metrics on both axes, with an explicit random-baseline floor. RECOM is publicly available at https://anonymous.4open.science/r/recom-D4B0

URL PDF HTML ☆

赞 0 踩 0

2606.19172 2026-06-18 cs.AI 新提交 70%

User as Engram: Internalizing Per-User Memory as Local Parametric Edits

用户作为印迹：将每用户记忆内化为局部参数编辑

Bojie Li

发表机构 * Pine AI

专题命中其他LLM ：将用户记忆内化为参数编辑，属于LLM个性化

AI总结提出User as Engram方法，将用户事实存储为Engram模型的哈希键控记忆表中的局部编辑，推理技能共享一个适配器，实现高精度间接推理且内存占用极小。

详情

AI中文摘要

语言模型中的个人记忆涉及两个问题：内容和推理技能。大脑将两者分开（每个情节在海马体中有一个稀疏的局部印迹，解释它的共享技能在缓慢的新皮层中），因此新事实不必覆盖其他一切。如今大多数个性化方法将用户事实保存在权重之外，存储在自然语言记忆文件或检索索引中。当事实被写入模型时，标准方法是每用户的LoRA适配器，这与大脑相反，将内容和技能折叠成一个全局权重增量。将用户事实写为LoRA会污染与它们无关的文本；将相同事实写为局部Engram行则数学上保持不变，导致内存占用大约减少33,000倍。因此，我们提出User as Engram：将用户内容存储为对Engram模型的哈希键控记忆表的手术式编辑，并将推理技能携带在一个共享适配器中。这种分层设计匹配了每用户LoRA的直接召回，同时平均提供5.6倍更高的间接推理准确性，并且从未使单个用户在推理方面比未触及的基座更差。编辑是一个玻璃盒：写入一个事实会在精确触发时打开其查找，添加答案所需的值，保持其他每个位置不变到最后一位，如果写入错误层则失败。由于不同用户的事实落在不相交的哈希槽中，它们的编辑可组合：许多用户同时共享一个表，可加性且无损地堆叠，而每用户LoRA（一个全局权重增量）只允许一个。在检索时，每用户Engram表不会随着检索器必须搜索的群体增长，因此在大约100个事实后，它超越了在2.5倍更大模型上的检索流水线。

英文摘要

Personal memory in a language model is two problems: content and reasoning skill. The brain keeps the two apart (a sparse, local engram in the hippocampus for each episode, a slow neocortex for the shared skills that interpret it), so a new fact need not overwrite everything else. Most personalization today keeps a user's facts outside the weights, in a natural-language memory file or a retrieval index. When facts are written into the model instead, the standard recipe is the per-user LoRA adapter, which does the opposite of the brain, folding content and skill into one global weight delta. Writing a user's facts as a LoRA contaminates text unrelated to them; writing the same facts as local Engram rows leaves it mathematically untouched, resulting in a roughly 33,000x smaller memory footprint. We therefore propose User as Engram: store a user's content as surgical edits to the hash-keyed memory table of an Engram model, and carry the reasoning skill in one shared adapter. This layered design matches per-user LoRA's direct recall while delivering 5.6x higher indirect-reasoning accuracy on average, and never makes a single user worse at reasoning than the untouched base. The edit is a glass box: writing a fact switches on its lookup at exactly the trigger, adds the value the answer needs, leaves every other position unchanged to the last bit, and fails if written into the wrong layer. Because different users' facts land in disjoint hash slots, their edits compose: many users live in one shared table at once, stacking additively and losslessly, where a per-user LoRA, a single global weight delta, admits only one. Upon retrieval, a per-user Engram table does not grow with the population the retriever must search, so past ~100 facts it overtakes a retrieval pipeline on a 2.5x larger model.

URL PDF HTML ☆

赞 0 踩 0

2606.18851 2026-06-18 eess.SY cs.SY 新提交 70%

From Tokens to Energy Flexibility: Quantization-Enabled Demand Response for Data Centers with LLM Inference Workloads

从令牌到能量灵活性：面向LLM推理工作负载的数据中心量化使能需求响应

Bojun Du, Xiaoyi Fan, Ershun Du, Long Chen, Jianpei Han, Qingchun Hou, Ning Zhang, Chongqing Kang

专题命中其他LLM ：LLM推理数据中心需求响应，量化管理。

AI总结提出一种量化使能的能量管理框架，通过建立量化-功率模型和两阶段需求响应模型，实现多园区协同优化，降低数据中心运营成本34.3%。

Comments 10 pages, 7 figures

详情

AI中文摘要

大型语言模型（LLM）推理的快速增长正在造成显著的数据中心负载，在日益紧张的电网条件和需求响应（DR）要求下，这些负载面临着越来越多的能量管理挑战。传统的数据中心能量管理主要依赖于时间和空间上的工作负载转移以及园区级能量资产调度，但通常将LLM推理需求视为聚合负载。因此，这些方法未能利用LLM服务的内部特性，从而忽视了模型量化等LLM特定技术所提供的灵活性。为了释放这种灵活性，本文提出了一种面向电网响应型LLM推理数据中心的量化使能能量管理框架。首先，建立了一个量化-功率模型，将每个模型-量化配置映射到一个紧凑的可调度参数集。其次，开发了一个两阶段量化使能的需求响应模型，以考虑模型实例切换、请求路由和精度选择。第三，引入了一种多园区协同优化方法，通过将电网侧电力和碳信号与量化使能的需求响应模型相结合，参与需求响应。案例研究表明，所提出的框架在不减少服务令牌量的情况下，将数据中心总运营成本降低了34.3%，验证了模型量化作为电网响应型LLM数据中心能量管理的有效灵活性杠杆。

英文摘要

The rapid growth of large language model (LLM) inference is creating significant data-center loads that face increasing energy-management challenges under tightening grid conditions and demand response (DR) requirements. Conventional data-center energy management mainly relies on temporal and spatial workload shifting and campus-level energy asset scheduling, but it usually treats LLM inference demand as an aggregate load. As a result, these approaches fail to exploit the internal characteristics of LLM serving and therefore overlook the flexibility offered by LLM-specific techniques such as model quantization. To unlock this flexibility, this paper proposes a quantization-enabled energy management framework for grid-responsive LLM inference data centers. First, a quantization-to-power model is established to map each model--quantization configuration to a compact set of dispatchable parameters. Second, a two-stage quantization-enabled DR model is developed to account for model instance switching, request routing, and precision selection. Third, a multi-campus co-optimization method is introduced for DR participation by integrating grid-side electricity and carbon signals with the quantization-enabled DR model. Case studies show that the proposed framework reduces total data-center operating cost by 34.3\% without curtailing served token volume, validating model quantization as an effective flexibility lever for grid-responsive LLM data-center energy management.

URL PDF HTML ☆

赞 0 踩 0

2606.13795 2026-06-18 cs.LG 新提交 80%

DiPOD: Diffusion Policy Optimization without Drifting Apart

无漂移扩散策略优化

Haozhe Jiang, Haiwen Feng, Pieter Abbeel, Jiantao Jiao, Angjoo Kanazawa, Nika Haghtalab

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Simons Institute for the Theory of Computing（西蒙斯计算理论研究所）； Department of Electrical Engineering and Computer Sciences, University of California, Berkeley（加州大学伯克利分校电气工程与计算机科学系）

专题命中后训练：扩散策略优化用于语言模型后训练

AI总结针对扩散策略梯度方法的不稳定性，提出DiPOD框架，通过自蒸馏与策略改进梯度更新交替进行，维持紧界行为，实现稳定且高效的策略优化。

Comments Project page: astro-eric.github.io/blogs/dipod/ Code: https://github.com/Astro-Eric/DiPOD-release

2606.18910 2026-06-18 cs.LG cs.CL 新提交 75%

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

REVES：通过修订与验证增强的测试时扩展训练

Yuanxin Liu, Ruida Zhou, Xinyan Zhao, Amr Sharaf, Hongzhou Lin, Arijit Biswas, Mohammad Ghavamzadeh, Zhaoran Wang, Mingyi Hong

发表机构 * Northwestern University（西北大学）； Amazon AGI（亚马逊人工智能实验室）； Qualcomm AI Research（高通人工智能研究）； University of Minnesota（明尼苏达大学）

专题命中后训练：提出两阶段训练框架优化推理

AI总结提出REVES框架，通过将中间步骤的“接近正确”答案转化为解耦的修订和验证提示，实现高效的离策略数据生成，提升大语言模型的多步推理能力，在LiveCodeBench上比强化学习基线高6.5分。

详情

AI中文摘要

通过顺序修订进行测试时扩展已成为增强大语言模型（LLM）推理能力的强大范式。然而，标准的后训练方法主要优化单次目标，与多步推理动态存在根本性不匹配。虽然最近的工作将其视为多轮强化学习（RL），但传统方法直接优化多步轨迹，未能进一步利用模型可以从纠正中学习的中间步骤中的高质量错误。我们提出了一个两阶段迭代框架，交替进行在线数据/提示增强和策略优化。通过将成功恢复轨迹中的中间步骤（“接近正确”答案）转化为解耦的修订和验证提示，我们的方法将训练集中在有效的答案转换和错误识别上。与标准的多轮RL相比，这种方法实现了高效的离策略数据生成，并减少了长程采样的计算开销。在LiveCodeBench上，使用公开可用的测试用例作为反馈，我们观察到比RL基线高6.5分，比标准多轮训练高4.0分。除了编码，我们的方法在圆填充问题上达到了先前报告的SOTA结果，同时使用了最小的基础模型（4B）和远少于更大进化搜索系统的采样次数。在真实验证下的数学结果进一步证实了改进的纠正能力。该方法还泛化到分布外的约束满足谜题，如n皇后和迷你数独，其中正确性完全由问题约束定义。代码可在该https URL获取。

英文摘要

Test-time scaling via sequential revision has emerged as a powerful paradigm for enhancing Large Language Model (LLM) reasoning. However, standard post-training methods primarily optimize single-shot objectives, creating a fundamental misalignment with multi-step inference dynamics. While recent work treats this as multi-turn reinforcement learning (RL), conventional approaches optimize over the multi-step trajectories directly, failing to further exploit the high-quality mistakes in intermediate steps that model can learn from correcting them. We propose a two-stage iterative framework that alternates between online data/prompt augmentation and policy optimization. By converting the intermediate steps (``near-miss'' answers) in the successful recovery trajectories into decoupled revision and verification prompts, our approach concentrates training on both effective answer transformation and error identification. This approach enables efficient off-policy data generation and reduces the computational overhead of long-horizon sampling compared to standard multi-turn RL. On LiveCodeBench, using publicly available test cases as feedback, we observe gains of +6.5 points over the RL baseline and +4.0 points over standard multi-turn training. Beyond coding, our approach matches the previously reported SOTA result on circle packing while using the smallest base model (4B) and far fewer rollouts than the much larger evolutionary search systems. Math results under ground-truth verification further confirm improved correction ability. It also generalizes to out-of-distribution constraint-satisfaction puzzles such as n\_queens and mini\_sudoku, where correctness is defined entirely by problem constraints. Code is available at https://github.com/yxliu02/REVES.git.

URL PDF HTML ☆

赞 0 踩 0

2606.18850 2026-06-18 cs.CL cs.IR 新提交 75%

ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement

ScholarSum：基于知识图谱推理与反思性精炼的师生式抽象摘要生成

Bohou Zhang, Xiaoyu Tao, Mingyue Cheng, Huijie Liu, Qi Liu

发表机构 * State Key Laboratory of Cognitive Intelligence（认知智能国家重点实验室）

专题命中指令微调：科学文献摘要生成，师生框架与知识图谱。

AI总结提出ScholarSum框架，通过构建层次知识图谱引导学生生成初稿，并利用教师式审阅者迭代检查与修正，实现科学文献摘要的流畅性与事实一致性。

详情

AI中文摘要

抽象摘要生成在实现科学文献高效理解中起着关键作用，但它本质上要求同时具备语言流畅性和事实忠实性。现有方法往往难以协调这两个要求。抽取式方法依赖僵硬的句子拼接，破坏了宏观层面的逻辑连贯性；而基于大语言模型的生成式方法尽管掌握了语言流畅性，但事实一致性有限。在这项工作中，我们提出了ScholarSum，一个层次化反思性图框架，模拟师生写作过程以实现流畅且忠实的科学摘要生成。ScholarSum首先通过将文档分割成语义连贯的单元，组织成层次知识图谱，其多层社区结构捕获全局逻辑和宏观主题。在该全局结构引导下，学生生成初稿，随后通过细粒度证据检索进行精炼。为确保事实一致性，教师式审阅者迭代检查初稿，识别不支持的内容，并触发有针对性的重新检索和重写，直到摘要达到严格的质量标准。大量实验表明，ScholarSum在完整性和忠实性方面显著优于之前的基线方法。我们的代码可在该https URL获取。

英文摘要

Abstractive summarization plays a crucial role in enabling efficient understanding of scientific literature, yet it inherently demands both linguistic fluency and factual faithfulness. Existing approaches often fail to reconcile these two requirements. Extractive methods rely on rigid sentence splicing that disrupts macro-level logical coherence, while large language model (LLM)-based generative approaches, despite mastering linguistic fluency, exhibit limited factual consistency. In this work, we propose ScholarSum, a hierarchical reflective graph-based framework that emulates a student-teacher writing process for fluent and faithful scientific summarization. ScholarSum first organizes the document into a hierarchical knowledge graph by segmenting it into semantically coherent units, whose multi-layered community structure captures global logic and macro-level themes. Guided by this global structure, the student generates an initial draft, which is subsequently refined through fine-grained evidence retrieval. To ensure factual consistency, a teacher-like reviewer then iteratively examines the draft, identifies unsupported content, and prompts targeted re-retrieval and rewriting until the summary meets rigorous quality standards. Extensive experiments demonstrate that ScholarSum significantly outperforms previous baselines in terms of both completeness and faithfulness. Our code is available at https://github.com/Xiaoyu-Tao/ScholarSum.

URL PDF HTML ☆

赞 0 踩 0

2606.18902 2026-06-18 cs.CL 新提交 70%

SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

SAGE: 基于智能体引导探索的随机提示优化

Ziyi Zhu, Luka Smyth, Saki Shinoda, Jinghong Chen

发表机构 * Slingshot AI ； Department of Engineering, University of Cambridge（剑桥大学工程系）

专题命中指令微调：自动提示优化属于LLM应用

AI总结提出随机提示优化框架SPO，其中SAGE方法通过多智能体诊断代码执行实现黑盒搜索，在多个基准测试中表现依赖于错误类型，并在心理健康聊天机器人中通过连续优化显著提升次日留存率。

详情

AI中文摘要

上下文工程已成为无需参数更新即可改进AI系统的主要手段。最近研究表明文本梯度并非真实梯度，这促使我们将自动提示优化（APO）视为黑盒搜索。我们引入了SPO（随机提示优化），一个在提示空间上进行随机搜索的框架，并比较了三种复杂度递增的策略：基于错误信息的随机搜索、带有进化算子的遗传算法以及SAGE（基于智能体引导探索的SPO），后者是一个具有诊断代码执行的多智能体流水线。在三个基准测试中，没有单一策略占主导地位；有效性取决于景观结构与错误类型的相互作用。我们进一步在连续优化范式下将SAGE部署到一个心理健康聊天机器人上，它将八个个体噪声A/B测试周期累积为次日留存率的统计显著提升。我们认为，将定性诊断与定量验证相结合是使智能体优化对开放式任务导向对话有效的关键。

英文摘要

Context engineering has emerged as a primary lever for improving AI systems without parameter updates. Recent work showing that textual gradients do not function as real gradients motivates treating automatic prompt optimization (APO) as black-box search. We introduce SPO (Stochastic Prompt Optimization), a framework for stochastic search over prompt space, and compare three strategies of increasing sophistication: error-informed random search, a genetic algorithm with evolutionary operators, and SAGE (SPO via Agent-Guided Exploration), a multi-agent pipeline with diagnostic code execution. Across three benchmarks, no single strategy dominates; effectiveness depends on the interaction of landscape structure with error type. We further deploy SAGE on a mental-health chatbot under a continuous optimization paradigm, where it compounds eight cycles of individually-noisy A/B tests into a statistically robust gain in next-day retention. We argue that coupling qualitative diagnosis with quantitative validation is what makes agentic optimization effective for open-ended task-oriented dialogue.

URL PDF HTML ☆

赞 0 踩 0

1. 预训练 5 篇

BLADE: Scalable Bi-level Adaptive Data Selection for LLM Training

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

UPLOTS: A Unified Pretrained Language Model for Constrained Time-series Generation

Dual Dimensionality for Local and Global Attention

Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition

2. 领域大模型 6 篇

Better Adherence, Richer Context: A Field Evaluation of LLM-Powered Conversational Voice Diaries for Sleep

G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

Low-resource Language Discrimination Towards Chinese Dialects with Transfer learning and Data Augmentation

Teaching Software Engineering with LLM and MCP Integration: From Classroom to Industry Practice

3. 其他LLM 15 篇

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability

Latency Prediction for LLM Inference on NPU Systems

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

DN-Hypo-Pipeline: An AI-Driven Workflow for Hypothesis Generation via Large Language Models and Scientific Explanations

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression

OmniPlan: An Adaptive Framework for Timely and Near-Optimal Network Planning Optimization

On the Memorization Behavior of LLMs in Generative Recommendation: Observations, Implications, and Training Strategies

Explaining Attention with Program Synthesis

Structured Inference with Large Language Gibbs

RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

User as Engram: Internalizing Per-User Memory as Local Parametric Edits

From Tokens to Energy Flexibility: Quantization-Enabled Demand Response for Data Centers with LLM Inference Workloads

4. 后训练 2 篇

DiPOD: Diffusion Policy Optimization without Drifting Apart

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

5. 指令微调 2 篇

ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement

SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration