语言大模型 / LLM - arXivDaily 专题

2506.09046 2026-06-18 cs.LG cs.AI cs.MA 版本更新 80%

Self-Evolving Multi-Agent Systems via Textual Backpropagation

通过文本反向传播的自进化多智能体系统

Xiaowen Ma, Yunpu Ma, Chenyang Lin, Sikuan Yan, Jinhe Bi, Zixuan Cao, Yijun Tian, Volker Tresp, Hinrich Schuetze

发表机构 * Ludwig Maximilian University of Munich（慕尼黑路德维希-马克西米利安大学）； Technical University of Munich（慕尼黑技术大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）； University of Notre Dame（诺丁汉大学）

专题命中其他LLM ：利用多个LLM构建多智能体神经网络框架。

AI总结提出Agentic Neural Network框架，将多智能体协作建模为分层神经网络，通过前向分解任务和反向传播反馈实现智能体角色、提示和协作的自进化，在七个基准数据集上超越现有方法。

详情

AI中文摘要

利用多个大型语言模型（LLM）已被证明对处理复杂、高维任务有效，但当前方法通常依赖静态、手动设计的多智能体配置。为克服这些限制，我们提出Agentic Neural Network（ANN）框架，该框架将多智能体协作概念化为分层神经网络架构。在此设计中，每个智能体作为节点运行，每一层形成一个专注于特定子任务的协作团队。我们的框架遵循两阶段优化策略：（1）前向阶段——受神经网络前向传播启发，任务被动态分解为子任务，并逐层构建具有合适聚合方法的协作智能体团队。（2）反向阶段——模仿反向传播，我们通过迭代反馈优化全局和局部协作，使智能体能够自进化其角色、提示和协调。这种神经符号方法使我们的框架能够在训练后创建新的或专门的智能体团队，在准确性和适应性方面带来显著提升。在七个基准数据集上，我们的工作在相同配置下超越了领先的多智能体基线，显示出持续的性能改进。

英文摘要

Leveraging multiple Large Language Models (LLMs) has proven effective for addressing complex, high-dimensional tasks, but current approaches often rely on static, manually engineered multi-agent configurations. To overcome these constraints, we present the Agentic Neural Network (ANN), a framework that conceptualizes multi-agent collaboration as a layered neural network architecture. In this design, each agent operates as a node, and each layer forms a cooperative team focused on a specific subtask. Our framework follows a two-phase optimization strategy: (1) Forward Phase - Drawing inspiration from neural network forward passes, tasks are dynamically decomposed into subtasks, and cooperative agent teams with suitable aggregation methods are constructed layer by layer. (2) Backward Phase - Mirroring backpropagation, we refine both global and local collaboration through iterative feedback, allowing agents to self-evolve their roles, prompts, and coordination. This neuro-symbolic approach enables our framework to create new or specialized agent teams post-training, delivering notable gains in accuracy and adaptability. Across seven benchmark datasets, our work surpasses leading multi-agent baselines under the same configurations, showing consistent performance improvements.

URL PDF HTML ☆

赞 0 踩 0

2507.01414 2026-06-18 cs.LG 版本更新 80%

Decomposing Prediction Mechanisms for In-Context Recall

分解上下文召回中的预测机制

Sultan Daniels, Dylan Davis, Dhruv Gautam, Wentinn Liao, Gireeja Ranade, Anant Sahai

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Pennsylvania（宾夕法尼亚大学）

专题命中其他LLM ：分析Transformer上下文学习机制

AI总结通过设计结合连续上下文学习与离散关联召回的新玩具问题，发现Transformer模型在上下文召回任务中存在两种具有不同学习动态的独立机制：一种依赖离散符号标签进行关联召回，另一种基于前一个token和上下文进行贝叶斯式预测。

Comments 45 pages, 47 figures, 2 tables

详情

AI中文摘要

我们引入了一类新的玩具问题，将线性回归风格的连续上下文学习（ICL）特征与离散关联召回相结合。我们在该玩具的样本轨迹上预训练Transformer模型，具体是从随机抽取的线性确定性动力系统中提取的符号标记交错状态观测。我们研究当模型被提示使用相应的上下文标签时，是否能够召回先前在其上下文中见过的序列的状态。仔细观察这个任务，很明显模型必须执行两个功能：（1）识别应召回哪个系统的状态，并将该系统应用于其最后看到的状态；（2）继续应用正确的系统来预测后续状态。训练动态表明，第一个能力在模型训练中后期才出现。令人惊讶的是，第二个能力（继续预测恢复的序列）发展得更早。通过分布外实验和通过边缘剪枝对模型权重的机制分析，我们发现这个玩具问题的下一个token预测涉及至少两个独立的机制。一种机制使用离散符号标签进行关联召回，以预测先前见过的序列恢复的开始。第二种机制在很大程度上与离散符号标签无关，基于前一个token和上下文进行“贝叶斯式”预测。这两种机制具有不同的学习动态。为了确认这种多机制现象（表现为不同的相变）不仅仅是玩具设置的人为产物，我们使用OLMo在ICL翻译任务上的训练检查点观察到了类似的现象：第一个任务token的性能与第二个任务token的性能出现决定性差距。

英文摘要

We introduce a new family of toy problems that combine features of linear-regression-style continuous in-context learning (ICL) with discrete associative recall. We pretrain transformer models on sample traces from this toy, specifically symbolically-labeled interleaved state observations from randomly drawn linear deterministic dynamical systems. We study if the transformer models can recall the state of a sequence previously seen in its context when prompted to do so with the corresponding in-context label. Taking a closer look at this task, it becomes clear that the model must perform two functions: (1) identify which system's state should be recalled and apply that system to its last seen state, and (2) continuing to apply the correct system to predict the subsequent states. Training dynamics reveal that the first capability emerges well into a model's training. Surprisingly, the second capability, of continuing the prediction of a resumed sequence, develops much earlier. Via out-of-distribution experiments, and a mechanistic analysis on model weights via edge pruning, we find that next-token prediction for this toy problem involves at least two separate mechanisms. One mechanism uses the discrete symbolic labels to do the associative recall required to predict the start of a resumption of a previously seen sequence. The second mechanism, which is largely agnostic to the discrete symbolic labels, performs a "Bayesian-style" prediction based on the previous token and the context. These two mechanisms have different learning dynamics. To confirm that this multi-mechanism (manifesting as separate phase transitions) phenomenon is not just an artifact of our toy setting, we used OLMo training checkpoints on an ICL translation task to see a similar phenomenon: a decisive gap in the emergence of first-task-token performance vs second-task-token performance.

URL PDF HTML ☆

赞 0 踩 0

2606.18829 2026-06-18 cs.LG cs.CL 新提交 75%

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

GateMem：多主体共享内存代理中的内存治理基准

Zhe Ren, Yibo Yang, Yimeng Chen, Zijun Zhao, Benshuo Fu, Zhihao Shu, Bingjie Zhang, Yangyang Xu, Dandan Guo, Shuicheng Yan

发表机构 * School of Artificial Intelligence, Jilin University（吉林大学人工智能学院）； Shanghai Jiao Tong University（上海交通大学）； King Abdullah University of Science and Technology (KAUST)（卡尔斯鲁厄大学）； Tsinghua University（清华大学）； National University of Singapore（新加坡国立大学）

专题命中其他LLM ：评估多主体共享内存代理的记忆治理，涉及LLM代理

AI总结提出GateMem基准，评估多主体共享内存代理在效用、访问控制和遗忘三方面的治理能力，发现现有方法无法同时满足三者。

Comments 24 pages, 8 figures. Code and dataset are available at https://github.com/rzhub/GateMem and https://huggingface.co/datasets/Ray368/GateMem

详情

AI中文摘要

LLM代理的内存基准主要假设单用户设置，而医院、工作场所、校园和家庭中的共享助手研究不足。在这些部署中，多个主体写入公共内存池并根据不同角色、范围和关系进行查询，因此内存质量需要治理和召回。我们引入GateMem，一个多主体共享内存代理的基准。GateMem联合评估合法长期请求的效用（含状态更新）、跨上下文授权边界的访问控制，以及显式删除请求后的主动遗忘。它涵盖医疗、办公、教育和家庭领域，包含长形式多方情节、增量内存注入、隐藏检查点、结构化评判和泄漏目标注释。在多种基线和骨干模型上，没有方法能同时实现强效用、鲁棒访问控制和可靠遗忘。长上下文提示通常以高令牌成本获得最佳治理分数，而基于检索和外部内存的方法降低成本但仍泄漏未授权或已删除信息。这些结果表明，当前内存代理远未达到可靠的共享机构部署水平。

英文摘要

Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory pool and query it under different roles, scopes, and relationships, so memory quality requires governance as well as recall. We introduce GateMem, a benchmark for multi-principal shared-memory agents. GateMem jointly evaluates utility for legitimate long-horizon requests with state updates, access control across contextual authorization boundaries, and agent-facing active forgetting after explicit deletion requests. It spans medical, office, education, and household domains, with long-form multi-party episodes, incremental memory injection, hidden checkpoints, structured judging, and leak-target annotations. Across diverse baselines and backbone models, no method simultaneously achieves strong utility, robust access control, and reliable forgetting. Long-context prompting often yields the best governance score at high token cost, while retrieval-based and external-memory methods reduce cost yet still leak unauthorized or deleted information. These results show current memory agents remain far from reliable shared institutional deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.18389 2026-06-18 cs.CL 新提交 75%

Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

想要更好的合成数据？引导它：面向低资源语言生成的激活引导

Jan Cegin, Daniil Gurgurov, Yusser Al Ghussin, Simon Ostermann

发表机构 * Kempelen Institute of Intelligent Technologies（肯佩伦智能技术研究所）； German Research Institute for Artificial Intelligence (DFKI)（德国人工智能研究中心（DFKI））

专题命中其他LLM ：激活引导用于低资源语言合成数据生成

AI总结提出激活引导作为低资源语言合成数据生成的替代方法，包括语言引导和质量引导，实验表明早期层引导能提升数据多样性和下游模型性能。

Comments 25 pages

详情

AI中文摘要

大型语言模型（LLMs）已成为合成数据生成的有效工具，包括低资源语言，生成的数据可以提升下游任务性能。当前最佳方法通常依赖于目标语言示例的少样本提示，这增加了推理成本，并可能通过词汇锚定降低多样性。在这项工作中，我们研究激活引导作为低资源合成数据生成的替代方案。我们研究了两种引导策略：语言引导，针对语言的 linguistic identity；以及质量引导，通过对比人类撰写和反向翻译的文本表示来捕捉良好形式性。我们在四个开源LLM、多个层和11种类型多样的语言上评估这些方法，通过生成情感和主题分类数据并微调较小的分类器。引导在零样本和少样本提示设置中应用，并与非引导对应方法进行比较。我们的结果表明，早期层的引导一致地提高了生成数据的多样性，同时通常产生更强的下游模型性能，特别是对于低资源语言。

英文摘要

Large language models (LLMs) have become an effective tool for synthetic data generation, including for low-resource languages, where generated data can improve downstream task performance. Current best-performing approaches typically rely on few-shot prompting with target-language examples, which increases inference costs and may reduce diversity through lexical anchoring. In this work, we investigate activation steering as an alternative for low-resource synthetic data generation. We study two steering strategies: Language Steering, which targets the linguistic identity of a language, and Quality Steering, which captures well-formedness by contrasting human-written and backtranslated text representations. We evaluate these methods across four open-source LLMs, multiple layers, and 11 typologically diverse languages by generating sentiment and topic classification data and finetuning smaller classifiers. Steering is applied in both zero-shot and few-shot prompting settings and compared against non-steered counterparts. Our results show that steering on early layers consistently improves the diversity of generated data while often yielding stronger downstream model performance, particularly for low-resource languages.

URL PDF HTML ☆

赞 0 踩 0

2606.18304 2026-06-18 cs.LG cs.AI 新提交 75%

Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression

基于归因引导和覆盖最大化的结构MoE剪枝

Yifu Ding, Jiacheng Wang, Ge Yang, Yongcheng Jing, Jinyang Guo, Xianglong Liu, Dacheng Tao

发表机构 * School of Computer Science and Engineering, Beihang University（北京航空航天大学计算机科学与工程学院）； School of Artificial Intelligence, Beihang University（北京航空航天大学人工智能学院）； Nanyang Technological University（南洋理工大学）

专题命中其他LLM ：针对MoE模型的结构剪枝，属于LLM压缩与部署。

AI总结针对MoE模型专家级剪枝粒度粗、冗余识别不足的问题，提出基于归因引导和覆盖最大化的结构剪枝框架，将剪枝分配转化为通道分数覆盖优化问题，在50%剪枝率下结合4位量化保持精度，内存减少5.27倍。

Comments 9 pages, 5 figures. Submitted to ICML 2026

详情

AI中文摘要

混合专家（MoE）模型在计算上高效扩展，但由于其巨大的内存占用和推理开销，部署成本仍然很高。先前的压缩方法主要在专家级别操作，要么移除整个专家，要么通过粗粒度的重要性分数对专家进行排序。然而，这种专家级别的决策通常过于粗糙，无法捕捉细粒度的冗余，导致剪枝预算分配不当和压缩效果有限。为了解决这个问题，我们观察到MoE专家内的信息高度集中在一小部分通道中，即使在被认为重要的专家中也存在大量冗余。基于这一观察，我们提出了一种针对MoE模型量身定制的结构剪枝框架。我们的方法将剪枝比例分配重新表述为通道分数覆盖最大化问题，并使用基于归因的近似方法高效求解。在DeepSeek和Qwen MoE模型上的实验表明，我们的方法在结合4位量化时，在50%或25%的结构化剪枝下仍能保持模型精度。在Qwen3-30B-A3B上，我们的方法将内存占用减少了5.27倍，并在各种基准测试中持续优于最先进的基线方法。

英文摘要

Mixture-of-Experts (MoE) models scale compute efficiently, yet remain expensive to deploy due to their substantial memory footprint and inference overhead. Prior compression methods mainly operate at the expert level, either removing entire experts or ranking experts by coarse-grained importance scores. However, such expert-wise decisions are often too coarse to capture fine-grained redundancy, leading to misallocated pruning budgets and limited compression. To address this problem, we observe that information within MoE experts is highly concentrated in a small subset of channels, leaving substantial redundancy even in experts deemed important. Based on this observation, we propose a structural pruning framework tailored for MoE models. Our method reformulates prune-ratio allocation as a channel-score coverage maximization problem and solves it efficiently using an attribution-based approximation. Experiments on DeepSeek and Qwen MoE models show that our method preserves model accuracy under 50% or 25% structured pruning when combined with 4-bit quantization. On Qwen3-30B-A3B, our approach reduces memory footprint by 5.27$\times$ and consistently outperforms state-of-the-art baselines across diverse benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.18105 2026-06-18 cs.NI cs.LG 新提交 75%

OmniPlan: An Adaptive Framework for Timely and Near-Optimal Network Planning Optimization

OmniPlan：一种用于及时且近乎最优的网络规划优化的自适应框架

Longlong Zhu, Jiashuo Yu, Zedi Chen, Yuhan Wu, Zhifan Jiang, Yuchen Xian, Yimeng Liu, Jiajie Su, Shaopeng Zhou, Xingyuan Li, Hongyan Liu, Xuan Liu, Dong Zhang, Chunming Wu, Xiang Chen

发表机构 * Zhejiang University（浙江大学）； Fuzhou University（福州市大学）； Yangzhou University（扬州大学）； The State Key Laboratory of Blockchain and Data Security（区块链与数据安全国家重点实验室）； College of Computer Science and Technology（计算机科学与技术学院）

专题命中其他LLM ：LLM用于解析用户意图进行网络规划

AI总结提出OmniPlan自适应框架，利用大语言模型解析用户意图，通过混合专家架构动态选择MIP求解器、启发式算法或深度强化学习模型，实现网络规划优化的及时性与近乎最优性，在分布式机器学习推理卸载任务中延迟降低97.8%，资源消耗降低11.5%。

Comments Accepted by ACM KDD 2026

详情

AI中文摘要

网络规划优化是跨多个领域（包括交通系统、通信网络和电网）的基本问题。它需要在复杂约束下同时优化多个相互竞争的目标。现有的网络规划优化框架依赖混合整数规划（MIP）求解器、启发式算法和深度强化学习（DRL）模型来计算规划决策。然而，它们缺乏对多样化和动态用户意图的有效适应性，从而导致执行时间与最优性之间的权衡。在本文中，我们提出OmniPlan，一种自适应框架，在网络规划优化中同时实现及时性和近乎最优性。为了实现现有解决方案所缺乏的适应性，OmniPlan采用基于大语言模型（LLM）的解释器，将异构的自然语言意图转换为统一且可量化的用户偏好向量。然后，它采用混合专家架构，集成MIP求解器、启发式算法和DRL模型作为专门专家，OmniPlan通过动态选择及时且近乎最优的专家来适应多样化的意图。最后，它包含一个基于DRL的专家配置模块，该模块微调优化目标权重，使规划决策与用户特定偏好对齐。我们使用代表性的真实工作负载（即分布式机器学习（ML））评估OmniPlan，其中我们利用OmniPlan将广泛的ML推理任务（例如决策树、SVM、朴素贝叶斯、XGBoost和随机森林）卸载到硬件设备网络。我们在真实测试平台上的实验表明，OmniPlan为真实ML推理任务实现了近乎最优且低执行时间的卸载，延迟降低高达97.8%，网络设备资源消耗降低高达11.5%。

英文摘要

Network planning optimization is a fundamental problem across diverse domains, including transportation systems, communication networks, and power grids. It requires simultaneous optimization of multiple competing objectives under complex constraints. Existing network planning optimization frameworks rely on mixed integer programming (MIP) solvers, heuristics, and deep reinforcement learning (DRL) models to compute planning decisions. However, they lack effective adaptability to diverse and dynamic user intents, thus leading to the trade-off between execution time and optimality. In this paper, we propose OmniPlan, an adaptive framework that achieves both timeliness and near-optimality in network planning optimization. To achieve the adaptability lacking in existing solutions, OmniPlan employs a large language model (LLM)-based interpreter to convert heterogeneous natural-language intents into a unified and quantifiable user-preference vector. Then it employs a mixture-of-experts architecture that integrates MIP solvers, heuristics, and DRL models as specialized experts, where OmniPlan adapts to diverse intents by dynamically selecting timely and near-optimal experts. Finally, it incorporates a DRL-based expert configuration module that fine-tunes optimization objective weights to align planning decisions with user-specific preferences. We evaluate OmniPlan with a representative real-world workload, i.e., distributed machine learning (ML), where we leverage OmniPlan to offload a wide spectrum of ML inference tasks, e.g., decision trees, SVM, naive Bayes, XGBoost, and random forests, onto a network of hardware devices. Our experiments on a real-world testbed indicate that OmniPlan achieves near-optimal and low-execution-time offloading for real-world ML inference tasks, reducing latency by up to 97.8\% and network device resource consumption by up to 11.5\%.

URL PDF HTML ☆

赞 0 踩 0

2606.17276 2026-06-18 cs.IR cs.LG 新提交 75%

On the Memorization Behavior of LLMs in Generative Recommendation: Observations, Implications, and Training Strategies

LLM在生成式推荐中的记忆行为：观察、启示与训练策略

Sunwoo Kim, Sunkyung Lee, Clark Mingxuan Ju, Donald Loveland, Bhuvesh Kumar, Kijung Shin, Neil Shah, Liam Collins

发表机构 * KAIST（韩国科学技术院）； Sungkyunkwan University（成均馆大学）； Snap Inc.（Snap公司）

专题命中其他LLM ：研究LLM在生成式推荐中的记忆行为

AI总结研究LLM在生成式推荐中的记忆倾向，发现其过度依赖一跳记忆，提出IIRG训练策略以学习多跳协同与语义关系，显著提升对非一跳记忆用户的推荐效果。

详情

AI中文摘要

生成式推荐（GR）已成为推荐系统的一个有前景的方向。最近，大型语言模型（LLM）越来越多地被用于GR，因为其丰富的预训练知识有望帮助它们泛化到传统以记忆为导向的基线所能捕捉的常见用户行为模式之外。然而，现有的基于LLM的GR工作很大程度上忽略了LLM众所周知的记忆倾向，如果这种倾向存在于为GR微调的LLM中，将限制它们对预训练知识的利用。在这项工作中，我们通过检查一跳记忆（即模型推荐训练数据中项目的直接后继项目）来研究这一担忧。我们表明，LLM比非LLM的GR模型更频繁地这样做——事实上，它们相对于GR基线的大部分增益实际上来自那些目标项目可以通过一跳记忆预测的用户。我们直觉认为，提高剩余用户的性能需要LLM学习更丰富的项目-项目关系，超越一跳转换。为此，我们提出了IIRG，一种新颖的训练策略，教导LLM捕获：（1）从用户序列中跨多跳的项目共现导出的协同关系，以及（2）具有相似主题的项目之间的语义关系，这两者都可以作为有用的推荐信号。我们表明，IIRG显著优于仅使用标准下一项目预测训练的LLM，尤其是对于那些测试项目在训练时的一跳转换中未覆盖的用户，增益尤为显著。

英文摘要

Generative recommendation (GR) has emerged as a promising direction for recommender systems. Recently, large language models (LLMs) have been increasingly adopted for GR, as their rich pretrained knowledge is expected to help them generalize beyond common user behavior patterns that traditional memorization-oriented baselines can capture. However, existing LLM-based GR works largely ignore LLMs' well-known tendency to memorize, which, if present in LLMs fine-tuned for GR, would restrict their utilization of pretrained knowledge. In this work, we investigate this concern by examining one-hop memorization, where a model recommends items that are direct successors of items in the training data. We show that LLMs do this more than non-LLM-based GR models-in fact, the vast majority of their gains over GR baselines are actually on users whose target items can be predicted through one-hop memorization. We intuit that improving performance on the remaining users requires LLMs to learn richer item-item relations beyond one-hop transitions. To achieve this, we propose IIRG, a novel training strategy that teaches LLMs to capture: (1) collaborative relations derived from item co-occurrences across multiple hops in user sequences, and (2) semantic relations among items with similar themes, both of which can serve as useful recommendation signals. We show that IIRG significantly improves over LLMs trained solely with standard next-item prediction, with especially large gains for users whose test items are not covered by train-time one-hop transitions.

URL PDF HTML ☆

赞 0 踩 0

2601.21626 2026-06-18 cs.LG cs.AI 版本更新 75%

HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning

HeRo-Q: 通过Hessian条件化实现稳定低比特量化的通用框架

Jinhao Zhang, Yunquan Zhang, Zicheng yan, Boyang Zhang, Jun Sun, Daning Cheng

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所）； University of Science and Technology of China（中国科学技术大学）； Zhejiang Lab（浙江实验室）； Peng Cheng Laboratory（鹏城实验室）

专题命中其他LLM ：提出HeRo-Q算法用于LLM低比特量化，属于LLM。

AI总结针对后训练量化中“低误差、高损失”的矛盾，提出HeRo-Q算法，通过轻量可学习的旋转压缩矩阵重塑损失景观，降低最大Hessian特征值，增强对量化噪声的鲁棒性，在Llama和Qwen模型上优于现有方法。

详情

AI中文摘要

后训练量化（PTQ）是一种主流的模型压缩技术，但由于其仅专注于最小化量化误差，常常导致矛盾的“低误差、高损失”现象。根本原因在于LLM损失景观的Hessian矩阵：少数高曲率方向对扰动极其敏感。为了解决这个问题，我们提出了Hessian鲁棒量化（HeRo Q）算法，该算法在量化前对权重空间应用一个轻量级、可学习的旋转压缩矩阵。这个联合框架通过降低最大的Hessian特征值并减小其最大特征值来重塑损失景观，从而显著增强对量化噪声的鲁棒性。HeRo-Q不需要修改架构，计算开销可忽略不计，并且可以无缝集成到现有的PTQ流程中。在Llama和Qwen模型上的实验表明，HeRo Q在标准W4A8设置下不仅持续优于包括GPTQ、AWQ和SpinQuant在内的最先进方法，而且在极具挑战性的W3A16超低比特场景中表现出色，将Llama3 8B在GSM8K上的准确率提升至70.15%，并有效避免了激进量化中常见的逻辑崩溃。

英文摘要

Post Training Quantization (PTQ), a mainstream model compression technique, often leads to the paradoxical 'low error, high loss' phenomenon because it focuses solely on minimizing quantization error. The root cause lies in the Hessian matrix of the LLM loss landscape: a few high curvature directions are extremely sensitive to perturbations. To address this, we propose the Hessian Robust Quantization (HeRo Q) algorithm, which applies a lightweight, learnable rotation-compression matrix to the weight space prior to quantization. This joint framework reshapes the loss landscape by reducing the largest Hessian eigenvalue and reducing its max eigenvalue, thereby significantly enhancing robustness to quantization noise. HeRo-Q requires no architectural modifications, incurs negligible computational overhead, and integrates seamlessly into existing PTQ pipelines. Experiments on Llama and Qwen models show that HeRo Q consistently outperforms state of the art methods including GPTQ, AWQ, and SpinQuant not only achieving superior performance under standard W4A8 settings, but also excelling in the highly challenging W3A16 ultra low bit regime, where it boosts GSM8K accuracy on Llama3 8B to 70.15\% and effectively avoids the logical collapse commonly seen in aggressive quantization.

URL PDF HTML ☆

赞 0 踩 0

2606.19317 2026-06-18 cs.LG cs.AI 新提交 70%

Explaining Attention with Program Synthesis

用程序合成解释注意力机制

Amiri Hayes, Belinda Li, Jacob Andreas

发表机构 * NJIT（新泽西理工学院）； MIT（麻省理工学院）； MIT CSAIL（麻省理工学院计算机科学与人工智能实验室）

专题命中其他LLM ：用程序合成解释注意力头

AI总结提出用可执行程序近似深度网络组件行为的方法，针对Transformer注意力头，通过生成Python程序再现注意力模式，实现可解释性。

详情

AI中文摘要

可解释深度学习研究的一个长期目标是，用人类可理解的符号描述取代不透明的神经计算。本文提出了一种用可执行程序近似深度网络组件行为的方法。我们专注于Transformer语言模型中的注意力头。对于给定的注意力头，我们首先在一组随机选择的训练样本上计算其关联的注意力矩阵。接着，我们向预训练语言模型提供这些矩阵的摘要，并指示它生成一组Python程序，这些程序仅根据输入句子中的文本即可再现相关的注意力模式。最后，我们根据最终程序集在保留输入上预测行为的效果对程序进行重新排序。我们证明，少于1000个这样的生成程序即可再现GPT-2、TinyLlama-1.1B和Llama-3B中注意力头的注意力模式，在TinyStories上平均交并比相似度超过75%。此外，最佳匹配程序可以替代神经注意力头而不会显著影响模型行为：在三个模型中用程序替代25%的注意力头仅导致平均困惑度增加16%，同时在各种下游问答基准上保持性能。这项工作为使用人类可读、可执行的代码逆向工程Transformer模型中的注意力头提供了一个可扩展的流程，推动了神经模型向符号透明性的发展。

英文摘要

A longstanding goal of research on interpretable deep learning is to replace opaque neural computations with human-meaningful symbolic descriptions. In this paper, we propose an approach for approximating the behavior of components of deep networks with executable programs. We focus on attention heads in transformer language models. For a given head, we first compute its associated attention matrices on a collection of randomly selected training examples. Next, we prompt a pre-trained language model with a summary of these matrices, and instruct it to generate a set of Python programs that can reproduce the associated attention patterns given only text from the input sentence. Finally, we re-rank programs according to how well our final set of programs predict behavior on held-out inputs. We demonstrate that a set of fewer than 1,000 such generated programs can reproduce the attention patterns of heads in GPT-2, TinyLlama-1.1B, and Llama-3B, achieving an average Intersection-over-Union similarity above 75% on TinyStories. Moreover, the best-fit programs can replace neural attention heads without substantially affecting model behavior: replacing 25% of attention heads with programmatic surrogates across the three models incurs only a 16% average perplexity increase, while maintaining performance on a variety of downstream question answering benchmarks. This work contributes a scalable pipeline for reverse-engineering attention heads in transformer models using human-readable, executable code, advancing a path toward symbolic transparency in neural models.

URL PDF HTML ☆

赞 0 踩 0

2606.19264 2026-06-18 cs.LG cs.CL 新提交 70%

Structured Inference with Large Language Gibbs

大语言吉布斯结构化推理

Sanghyeok Choi, Henry Gouk, Esmeralda S. Whitammer

发表机构 * University of Edinburgh, School of Informatics（爱丁堡大学信息学院）

专题命中其他LLM ：利用LLM条件分布进行结构化概率推理

AI总结提出大语言吉布斯方法，利用大语言模型的条件分布作为转移算子进行结构化概率推理，通过迭代重采样变量避免顺序偏差，在合成分布、一致性推理和贝叶斯结构学习中验证有效性。

Comments Code: https://github.com/hyeok9855/large-language-gibbs

详情

AI中文摘要

大型语言模型（LLMs）中编码的知识可以作为描述复杂世界变量的结构化推理的基础，但以概率一致的方式访问这些知识构成了一个困难的推理问题。我们提出了大语言吉布斯，一种结构化概率推理方案，它使用LLM的条件分布作为转移算子。不是通过单次自回归生成来采样结构化对象，而是利用LLM的下一个标记条件分布，在给定其他变量的条件下迭代地重采样单个变量。这种方法避免了顺序依赖偏差，并产生一个反映所有局部条件分布之间折衷的平稳分布。我们将这种方法应用于从合成分布中采样、一致性推理任务和贝叶斯结构学习。结果表明，在通过噪声LLM条件分布可访问的世界先验下，MCMC中使用LLM条件分布是用于结构化概率推理的一次性生成的实际替代方案。

英文摘要

The knowledge encoded in large language models (LLMs) can serve as a substrate for structured reasoning over variables describing a complex world, but accessing this knowledge in a probabilistically coherent manner poses a difficult inference problem. We propose Large Language Gibbs, a scheme for structured probabilistic inference that uses conditional distributions of an LLM as transition operators. Rather than sampling structured objects through single-pass autoregressive generation, we iteratively resample individual variables conditioned on others using an LLM's next-token conditionals. This approach avoids order-dependent biases and produces a stationary distribution that reflects a compromise between all local conditionals. We apply this approach to sampling from synthetic distributions, consistent reasoning tasks, and Bayesian structure learning. The results suggest that the use of LLM conditionals in MCMC is a practical alternative to one-pass generation for structured probabilistic inference under a world prior accessible through noisy LLM conditionals.

URL PDF HTML ☆

赞 0 踩 0

2606.19218 2026-06-18 cs.CL 新提交 70%

RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

RECOM：开放式 Reddit 问答中自动评估指标的有效性与区分性权衡

Pushwitha Krishnappa, Amit Das, Vinija Jain, Aman Chadha, Tathagata Mukherjee

发表机构 * University of Alabama Huntsville（阿拉巴马大学亨茨维尔分校）； University of North Alabama（北阿拉巴马大学）； Stanford University（斯坦福大学）； Meta AI ； Amazon GenAI（亚马逊生成人工智能）

专题命中其他LLM ：评估LLM生成文本的自动指标，属于LLM应用

AI总结提出 RECOM 数据集，发现自动评估指标在开放式问答中无法同时兼顾有效性和区分性，余弦相似度有效性高但区分性差，BERTScore 区分性受长度影响且有效性弱。

详情

AI中文摘要

自动评估指标是评估 LLM 生成文本的默认方法，但一个指标被默默要求完成两项任务：区分真实内容对齐与表面巧合（有效性），以及区分更好的系统与更差的系统（区分性）。在开放式、观点驱动的问答中，这两者存在矛盾。我们引入了 RECOM（Reddit Evaluation for Correspondence of Models），一个无污染评估数据集，包含 15,000 个 r/AskReddit 问题（2025 年 9 月），每个问题都配有真实的社区回复，这些回复的发布时间晚于所有被评估模型的训练截止日期。通过将五个开源 LLM（7-10B）的每个回复与每个指标配对，并加入随机乱序噪声基线，我们发现没有指标能同时做好这两项工作。余弦相似度能很好地区分真实回答与随机回答（Cohen's $d \approx 2$），但无法对五个模型进行排序（$|d| < 0.1$）；BERTScore 精确度看似能对模型排序（原始 $|d|$ 高达 0.63），但一旦控制回复长度，这一数值骤降至 $|d| = 0.09$，且其有效性较弱（$d \approx 0.8$，而余弦相似度约为 2）。由于每个指标对相同的输出进行评分，这种有效性与区分性的权衡是指标的属性，而非模型的属性，我们认为这源于表示设计。三个独立的 LLM 评判员再现了有效性差距，同样只能微弱地区分五个模型。我们建议在两个轴上报告指标，并明确给出随机基线。RECOM 在此 https URL 公开提供。

英文摘要

Automatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity), and tell a better system from a worse one (discriminative power). On open-ended, opinion-driven question answering, the two are in tension. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a contamination-free evaluation dataset of 15,000 r/AskReddit questions (September 2025), each paired with its authentic community replies, which postdate every evaluated model's training cutoff. Scoring five open-source LLMs (7--10B) against every reply each metric paired with a random-derangement noise floor we find that no metric does both jobs well. Cosine similarity separates real from random answers (Cohen's $d \approx 2$) but cannot rank the five models ($|d| < 0.1$); BERTScore precision appears to rank the models (raw $|d|$ up to 0.63), but once response length is controlled this collapses to $|d| = 0.09$ and its validity is weak ($d \approx 0.8$, versus cosine's $\approx 2$). Because every metric scores the same outputs, this validity--discrimination tradeoff is a property of the metrics, not the models, and we argue it stems from representation design. Three independent LLM judges reproduce the validity gap and likewise separate the five models only weakly. We recommend reporting metrics on both axes, with an explicit random-baseline floor. RECOM is publicly available at https://anonymous.4open.science/r/recom-D4B0

URL PDF HTML ☆

赞 0 踩 0

2606.19172 2026-06-18 cs.AI 新提交 70%

User as Engram: Internalizing Per-User Memory as Local Parametric Edits

用户作为印迹：将每用户记忆内化为局部参数编辑

Bojie Li

发表机构 * Pine AI

专题命中其他LLM ：将用户记忆内化为参数编辑，属于LLM个性化

AI总结提出User as Engram方法，将用户事实存储为Engram模型的哈希键控记忆表中的局部编辑，推理技能共享一个适配器，实现高精度间接推理且内存占用极小。

详情

AI中文摘要

语言模型中的个人记忆涉及两个问题：内容和推理技能。大脑将两者分开（每个情节在海马体中有一个稀疏的局部印迹，解释它的共享技能在缓慢的新皮层中），因此新事实不必覆盖其他一切。如今大多数个性化方法将用户事实保存在权重之外，存储在自然语言记忆文件或检索索引中。当事实被写入模型时，标准方法是每用户的LoRA适配器，这与大脑相反，将内容和技能折叠成一个全局权重增量。将用户事实写为LoRA会污染与它们无关的文本；将相同事实写为局部Engram行则数学上保持不变，导致内存占用大约减少33,000倍。因此，我们提出User as Engram：将用户内容存储为对Engram模型的哈希键控记忆表的手术式编辑，并将推理技能携带在一个共享适配器中。这种分层设计匹配了每用户LoRA的直接召回，同时平均提供5.6倍更高的间接推理准确性，并且从未使单个用户在推理方面比未触及的基座更差。编辑是一个玻璃盒：写入一个事实会在精确触发时打开其查找，添加答案所需的值，保持其他每个位置不变到最后一位，如果写入错误层则失败。由于不同用户的事实落在不相交的哈希槽中，它们的编辑可组合：许多用户同时共享一个表，可加性且无损地堆叠，而每用户LoRA（一个全局权重增量）只允许一个。在检索时，每用户Engram表不会随着检索器必须搜索的群体增长，因此在大约100个事实后，它超越了在2.5倍更大模型上的检索流水线。

英文摘要

Personal memory in a language model is two problems: content and reasoning skill. The brain keeps the two apart (a sparse, local engram in the hippocampus for each episode, a slow neocortex for the shared skills that interpret it), so a new fact need not overwrite everything else. Most personalization today keeps a user's facts outside the weights, in a natural-language memory file or a retrieval index. When facts are written into the model instead, the standard recipe is the per-user LoRA adapter, which does the opposite of the brain, folding content and skill into one global weight delta. Writing a user's facts as a LoRA contaminates text unrelated to them; writing the same facts as local Engram rows leaves it mathematically untouched, resulting in a roughly 33,000x smaller memory footprint. We therefore propose User as Engram: store a user's content as surgical edits to the hash-keyed memory table of an Engram model, and carry the reasoning skill in one shared adapter. This layered design matches per-user LoRA's direct recall while delivering 5.6x higher indirect-reasoning accuracy on average, and never makes a single user worse at reasoning than the untouched base. The edit is a glass box: writing a fact switches on its lookup at exactly the trigger, adds the value the answer needs, leaves every other position unchanged to the last bit, and fails if written into the wrong layer. Because different users' facts land in disjoint hash slots, their edits compose: many users live in one shared table at once, stacking additively and losslessly, where a per-user LoRA, a single global weight delta, admits only one. Upon retrieval, a per-user Engram table does not grow with the population the retriever must search, so past ~100 facts it overtakes a retrieval pipeline on a 2.5x larger model.

URL PDF HTML ☆

赞 0 踩 0

2606.18851 2026-06-18 eess.SY cs.SY 新提交 70%

From Tokens to Energy Flexibility: Quantization-Enabled Demand Response for Data Centers with LLM Inference Workloads

从令牌到能量灵活性：面向LLM推理工作负载的数据中心量化使能需求响应

Bojun Du, Xiaoyi Fan, Ershun Du, Long Chen, Jianpei Han, Qingchun Hou, Ning Zhang, Chongqing Kang

专题命中其他LLM ：LLM推理数据中心需求响应，量化管理。

AI总结提出一种量化使能的能量管理框架，通过建立量化-功率模型和两阶段需求响应模型，实现多园区协同优化，降低数据中心运营成本34.3%。

Comments 10 pages, 7 figures

详情

AI中文摘要

大型语言模型（LLM）推理的快速增长正在造成显著的数据中心负载，在日益紧张的电网条件和需求响应（DR）要求下，这些负载面临着越来越多的能量管理挑战。传统的数据中心能量管理主要依赖于时间和空间上的工作负载转移以及园区级能量资产调度，但通常将LLM推理需求视为聚合负载。因此，这些方法未能利用LLM服务的内部特性，从而忽视了模型量化等LLM特定技术所提供的灵活性。为了释放这种灵活性，本文提出了一种面向电网响应型LLM推理数据中心的量化使能能量管理框架。首先，建立了一个量化-功率模型，将每个模型-量化配置映射到一个紧凑的可调度参数集。其次，开发了一个两阶段量化使能的需求响应模型，以考虑模型实例切换、请求路由和精度选择。第三，引入了一种多园区协同优化方法，通过将电网侧电力和碳信号与量化使能的需求响应模型相结合，参与需求响应。案例研究表明，所提出的框架在不减少服务令牌量的情况下，将数据中心总运营成本降低了34.3%，验证了模型量化作为电网响应型LLM数据中心能量管理的有效灵活性杠杆。

英文摘要

The rapid growth of large language model (LLM) inference is creating significant data-center loads that face increasing energy-management challenges under tightening grid conditions and demand response (DR) requirements. Conventional data-center energy management mainly relies on temporal and spatial workload shifting and campus-level energy asset scheduling, but it usually treats LLM inference demand as an aggregate load. As a result, these approaches fail to exploit the internal characteristics of LLM serving and therefore overlook the flexibility offered by LLM-specific techniques such as model quantization. To unlock this flexibility, this paper proposes a quantization-enabled energy management framework for grid-responsive LLM inference data centers. First, a quantization-to-power model is established to map each model--quantization configuration to a compact set of dispatchable parameters. Second, a two-stage quantization-enabled DR model is developed to account for model instance switching, request routing, and precision selection. Third, a multi-campus co-optimization method is introduced for DR participation by integrating grid-side electricity and carbon signals with the quantization-enabled DR model. Case studies show that the proposed framework reduces total data-center operating cost by 34.3\% without curtailing served token volume, validating model quantization as an effective flexibility lever for grid-responsive LLM data-center energy management.

URL PDF HTML ☆

赞 0 踩 0

2606.18832 2026-06-18 cs.LG cs.AI 新提交 70%

Target-confidence Recourse Using tSeTlin machines: TRUST

使用Tsetlin机器的目标置信度追索：TRUST

K. Darshana Abeyrathna, Sara El Mekkaoui, Nils Enric Canut Taugbøl, Anuja Vats

发表机构 * Group Research and Development Det Norske Veritas (DNV)（挪威船级社（DNV）集团研发部）

专题命中其他LLM ：提出TRUST框架，使用概率Tsetlin机器生成反事实解释，属于LLM应用

AI总结提出TRUST框架，通过概率Tsetlin机器和贝叶斯优化直接搜索满足用户指定置信度目标的最小输入变化，生成更稳健和可解释的反事实解释。

详情

AI中文摘要

反事实解释被广泛用于高风险决策系统中的算法追索。大多数现有方法寻求最小化改变输入以翻转模型决策。然而，决策者通常不仅依赖预测标签，还依赖置信度阈值和风险边际。刚好越过决策边界的反事实在噪声或模型变化下可能脆弱且不稳定。本文提出使用Tsetlin机器的目标置信度追索（TRUST），一种用户明确指定追索所需预测置信度的框架。TRUST不是先生成反事实再评估置信度，而是直接搜索满足用户定义置信度目标的最小变化，从而在成本、置信度和鲁棒性方面比较追索选项。我们使用概率Tsetlin机器（PTM）结合贝叶斯优化实例化TRUST。PTM基于概率子句的结构将预测置信度与决策规则的稳定性联系起来。我们表明，满足相同规则的反事实在可靠性上可能差异很大，取决于它们满足这些规则的安全程度，揭示了决策是由稳健还是脆弱的子句激活支持的。在合成和真实数据集上的实验表明，目标置信度反事实比传统的基于边界的方法产生更稳健和可解释的追索。在多个基准测试中，TRUST实现了完美的鲁棒性，同时保持较低的追索成本，包括在Haberman数据集上以0.92置信度达到0.10的L2距离。通过显式控制置信度和暴露规则级稳定性，TRUST为高风险决策支持提供了可操作的追索。

英文摘要

Counterfactual explanations are widely used to provide algorithmic recourse in high-stakes decision-making systems. Most existing methods seek the smallest change to an input that flips a model's decision. However, decision-makers often rely not only on predicted labels but also on confidence thresholds and risk margins. Counterfactuals that barely cross a decision boundary can be fragile and unstable under noise or model variation. In this paper, we propose Target-confidence Recourse Using tSeTlin machines (TRUST), a framework in which users explicitly specify the desired prediction confidence for recourse. Rather than generating counterfactuals and evaluating confidence afterward, TRUST directly searches for minimal changes that satisfy a user-defined confidence target, enabling comparison of recourse options in terms of cost, confidence, and robustness. We instantiate TRUST using a Probabilistic Tsetlin Machine (PTM) combined with Bayesian optimization. The probabilistic clause-based structure of PTM links prediction confidence to the stability of decision rules. We show that counterfactuals satisfying the same rules can still differ substantially in reliability depending on how securely they satisfy those rules, revealing whether decisions are supported by robust or fragile clause activations. Experiments on synthetic and real-world datasets demonstrate that target-confidence counterfactuals produce more robust and interpretable recourse than conventional boundary-based approaches. Across multiple benchmarks, TRUST achieves perfect robustness while maintaining low recourse cost, including an L2 distance of 0.10 on the Haberman dataset at 0.92 confidence. By explicitly controlling confidence and exposing rule-level stability, TRUST provides actionable recourse for high-stakes decision support.

URL PDF HTML ☆

赞 0 踩 0

2606.18795 2026-06-18 cs.SI 新提交 70%

Opinion Polarization in LLM-Based Social Networks: Manipulation and Mitigation

基于LLM的社交网络中的意见极化：操纵与缓解

Ali Safarpoor Dehkordi, Mohammad Shirzadi, Ahad N. Zehmakan

专题命中其他LLM ：基于LLM的社交网络意见极化研究

AI总结研究在基于大语言模型模拟的社交网络中，对手如何通过有限预算操纵意见极化，并评估两种防御机制（反应性和主动性）的效果，发现两者均无法完全恢复基线极化状态。

Comments 14 pages, 7 figures

详情

AI中文摘要

在线社交网络在面对试图通过操纵意见来放大意见极化的对手时有多脆弱？缓解这种操纵有多困难？现有研究使用意见动态的数学模型来探讨这一问题。虽然这些模型提供了有价值的理论见解，但它们依赖于关于交互、消息内容和意见更新的简化假设，限制了它们能够捕捉的对抗策略及其发现在现实环境中的适用性。基于大语言模型的模拟提供了一种更丰富的替代方案：智能体可以被赋予多样化的角色，通过自然语言进行交流，并以上下文相关的方式回应说服性或对抗性内容。这使得研究难以用经典数学模型表示的操纵策略成为可能。据我们所知，本研究首次在基于LLM的模拟社交网络框架中系统分析了极化的放大和缓解。在我们的框架中，具有多样化角色的LLM智能体通过交换自然语言帖子在社交网络上进行交互，并相应地更新他们的意见。我们表明，即使预算有限的对手也能显著增加极化。然后，我们研究了两类防御机制：反应性缓解（指派特定用户主动对抗操纵）和主动性干预（通过不针对特定用户的一般机制增加抵抗力）。我们的结果表明，尽管这些机制减少了对抗攻击的影响，但它们通常无法将网络恢复到其基线极化状态。这些发现表明，这两种方法都不能完全克服网络的脆弱性，凸显了此类攻击的潜在风险。

英文摘要

How vulnerable are online social networks to adversaries who seek to amplify opinion polarization by manipulating opinions, and how difficult is it to mitigate such manipulation? Existing studies have examined this question using mathematical models of opinion dynamics. While these models offer valuable theoretical insights, they rely on simplified assumptions about interactions, message content, and opinion updates, limiting the adversarial strategies they can capture and the applicability of their findings to real-world settings. Large language model (LLM)-based simulations provide a richer alternative: agents can be assigned diverse personas, communicate through natural language, and respond to persuasive or adversarial content in a context-dependent way. This enables the study of manipulation strategies that are difficult to represent using classical mathematical models. To the best of our knowledge, this study provides the first systematic analysis of polarization amplification and mitigation in an LLM-based simulated social network framework. In our framework, LLM agents with diverse personas interact over a social network by exchanging natural language posts and updating their opinions accordingly. We show that even an adversary with a limited manipulation budget can considerably increase polarization. We then study two classes of defense mechanisms: reactive mitigations, which assign specific users to actively counter manipulation, and proactive interventions, which increase resistance through general mechanisms not tied to particular users. Our results show that although these mechanisms reduce the impact of adversarial attacks, they generally do not restore the network to its baseline polarization state. These findings suggest that neither approach fully overcomes the vulnerability of the network, highlighting the potential risk of such attacks.

URL PDF HTML ☆

赞 0 踩 0

2606.18726 2026-06-18 cs.LG cs.AI 新提交 70%

Graph Grounded Cross Attention Transformer Neural Network for Structurally Constrained Full Event Sequence Generation in Predictive Process Monitoring

基于图锚定交叉注意力Transformer神经网络的预测过程监控中结构约束完整事件序列生成

Fang Wang, Ernesto Damiani

发表机构 * Department of Computer Science, University of Milan（米兰大学计算机科学系）

专题命中其他LLM ：预测过程监控，图锚定交叉注意力Transformer。

AI总结提出图锚定交叉注意力Transformer（GGATN），通过全局过程图作为结构化记忆、Transformer自注意力编码序列位置、图锚定交叉注意力注入过程拓扑，结合维特比式图约束解码，一次性生成完整事件序列，在六个基准日志上优于LLM基线。

Comments 40 pages

详情

AI中文摘要

结构约束的事件序列生成仍然具有挑战性，因为生成的路径必须保持转移可行性、时间顺序、终止和属性一致性。在预测过程监控（PPM）中，这一挑战表现为完整事件序列生成，而现有工作主要处理子任务，如下一个活动、剩余时间、结果和属性预测。本文提出了图锚定交叉注意力Transformer神经网络（GGATN）用于这一统一的PPM任务。GGATN使用全局过程图作为结构化活动记忆，通过Transformer自注意力对序列位置进行上下文化，并通过图锚定交叉注意力注入过程拓扑。与自回归解码不同，GGATN一次性生成活动、时间戳、长度以及事件级和序列级属性，随后进行维特比风格的图约束解码以获得可行路径和显式终止。在六个基准事件日志上的实验表明，其生成质量优于局部指令提示的LLM基线。GGATN在序列相似性、Damerau-Levenshtein相似性、基于二元组的控制流相似性和持续时间分布方面取得了强劲性能，同时保持零幻觉活动和零序列级属性不一致。消融分析证实了全局图编码器作为稳定的结构先验。可解释性分析展示了图结构、序列上下文、反馈细化和约束解码如何塑造生成过程。

英文摘要

Structurally constrained event sequence generation remains challenging because generated paths must preserve transition feasibility, temporal order, termination, and attribute consistency. In predictive process monitoring (PPM), this challenge appears as full event sequence generation, whereas existing work mainly addresses component tasks such as next activity, remaining time, outcome, and attribute prediction. This paper proposes the Graph Grounded Cross Attention Transformer Neural Network (GGATN) for this unified PPM task. GGATN uses a global process graph as structured activity memory, contextualizes sequence positions through Transformer self attention, and injects process topology through graph grounded cross attention. Unlike autoregressive decoding, GGATN generates activities, timestamps, length, and event level and sequence level attributes in a single pass, followed by Viterbi style graph constrained decoding for feasible paths and explicit termination. Experiments on six benchmark event logs show more reliable generation quality than local instruction prompted LLM baselines. GGATN achieves strong performance on sequence similarity, Damerau Levenshtein similarity, bigram based control flow similarity, and duration distribution, while maintaining zero hallucinated activities and zero sequence level attribute inconsistency. Ablation analyses confirm the global graph encoder as a stable structural prior. Interpretability analyses show how graph structure, sequence context, feedback refinement, and constrained decoding shape generation.

URL PDF HTML ☆

赞 0 踩 0

2606.18717 2026-06-18 cs.CL cs.AI 新提交 70%

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

Morpheus: 一种面向土耳其语的形态感知神经分词器和词嵌入器

Tolga Şakar

发表机构 * Independent Researcher（独立研究者）

专题命中其他LLM ：土耳其语形态感知分词器与词嵌入。

AI总结针对土耳其语粘着特性，提出Morpheus神经词素边界模型，实现无损可逆分词与结构化词嵌入，在可逆分词器中达到最低比特每字符（1.425），词素对齐F1提升至0.61，GPU内存节省约19%。

详情

AI中文摘要

土耳其语是粘着语：意义由词素承载，然而驱动现代语言模型的子词分词器根据语料库统计分割单词，切碎了承载语义的后缀，并且在WordPiece和基于规则的分析器的情况下，无法将其输出解码回原始文本。本文提出\textbf{Morpheus}，一个面向土耳其语的神经词素边界模型，它同时是一个无损的、形态感知的分词器和一个词嵌入生成器。一个可微的泊松-二项式动态规划程序在训练期间将每个字符的边界概率转化为软词素隶属度，在推理时转化为精确的片段，无需字符串归一化，因此$\mathrm{decode}(\mathrm{encode}(w)) = w$由构造保证。由于该模型是神经模型，相同的正向传播在分词的同时也输出结构化的词嵌入。在可逆分词器中——唯一适用于生成的分词器——Morpheus达到了最低的比特每字符（1.425），将子词家族的金标准词素对齐大致翻倍（MorphScore宏F1从约0.32提升至0.61），并且相比64K词汇量的子词分词器节省了约19%的GPU内存。作为嵌入器，冻结的Morpheus向量在词汇检索（根家族MAP 0.85）和同根验证（ROC-AUC 1.00）上领先，超越了多语言检索器BGE-M3和BERTurk；在上下文和屈折依赖的任务（NER、格/数探测）上，更重的上下文编码器仍然领先——我们将这一权衡归因于Morpheus以词根为中心的几何结构。代码：此https URL 模型：此https URL 交互演示：此https URL。

英文摘要

Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and -- in the case of WordPiece and rule-based analyzers -- failing to decode their output back to the original text. This paper presents \textbf{Morpheus}, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so $\mathrm{decode}(\mathrm{encode}(w)) = w$ holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers -- the only ones valid for generation -- Morpheus attains the lowest bits-per-character ($1.425$), roughly doubles the gold morphological alignment of the subword family (MorphScore macro-F1 $0.61$ vs.\ ${\sim}0.32$), and uses ${\sim}19\%$ less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP $0.85$) and same-root verification (ROC-AUC $1.00$), surpassing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heavier contextual encoders remain ahead -- a trade-off we attribute to Morpheus's root-centric geometry. Code: https://github.com/lonewolf-rd/TurkishMorpheus; model: https://huggingface.co/lonewolflab/Morpheus-TR-50K; interactive demo: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo.

URL PDF HTML ☆

赞 0 踩 0

2606.18709 2026-06-18 cs.CL 新提交 70%

LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

LLMs难以衡量区分不同水平学生的题目：阅读理解评估中题目区分度研究

Han Chen, Ming Li, Chenguang Wang, Yijun Liang, Dawei Zhou, Hong jiao, Tianyi Zhou

发表机构 * MBZUAI（穆罕默德·本·扎耶德人工智能大学）； University of Maryland（马里兰大学）； Virginia Tech（弗吉尼亚理工大学）

专题命中其他LLM ：评估LLM预测题目区分度能力。

AI总结本研究评估42个LLM在零样本设置下预测题目区分度的能力，发现直接预测与人类校准的区分度相关性弱（最高Spearman 0.152），基于CTT的响应校准相关性有限（0.241），表明LLM尚不能可靠捕捉题目区分度。

详情

AI中文摘要

题目区分度是教育评估的一个基本心理测量属性，它衡量一个题目是否能有效区分高水平和低水平学生。虽然已有研究探讨了大语言模型（LLM）能否估计题目难度，但尚不清楚它们能否捕捉题目区分度。在本工作中，我们使用两种互补方法评估了42个专有和开源LLM在零样本设置下的表现：直接区分度预测，即模型从其内容中显式估计题目的区分度值；以及基于响应的经典测试理论（CTT）校准，其中LLM的答案被视为合成学生响应以计算区分度分数。我们的结果表明，直接预测与人类校准的区分度一致性较弱：表现最好的模型仅达到0.152的Spearman相关性。基于响应的CTT校准提供了更强但仍然有限的信号，全人格合成受访者池达到0.241的Spearman相关性。这些发现突显了题目区分度作为基于LLM的心理测量评估的一个开放挑战：当前的LLM包含非随机的区分度相关信号，但它们尚不能可靠地捕捉评估题目如何区分人类学生。

英文摘要

Item discrimination is a fundamental psychometric property of educational assessment, which measures whether an item meaningfully distinguishes students with higher proficiency from students with lower proficiency. While various existing works have explored whether large language models (LLMs) can estimate item difficulty, it remains unclear whether they can capture item discrimination. In this work, we evaluate 42 proprietary and open-weight LLMs in zero-shot settings using two complementary approaches: direct discrimination prediction, where models explicitly estimate an item's discrimination value from its content, and response-based Classical Test Theory (CTT) calibration, where LLM answers are treated as synthetic student responses to compute discrimination scores. Our results show that direct prediction yields weak alignment with human-calibrated discrimination: the best-performing model reaches only a Spearman correlation of 0.152. Response-based CTT calibration provides a stronger but still limited signal, with the all-persona synthetic respondent pool reaching a Spearman correlation of 0.241. These findings highlight item discrimination as an open challenge for LLM-based psychometric evaluation: current LLMs contain non-random discrimination-relevant signal, but they do not yet reliably capture how assessment items distinguish human students.

URL PDF HTML ☆

赞 0 踩 0

2606.18620 2026-06-18 cs.CL cs.AI 新提交 70%

BCL: Bayesian In-Context Learning Framework for Information Extraction

BCL：面向信息抽取的贝叶斯上下文学习框架

Haoliang Liu, Chengkun Cai, Xu Zhao, Han Zhu, Shizhou Huang, Xinglin Zhang, Tao Chen, Jenq-Neng Hwang, Zhang Huaping, Lei Li

发表机构 * HiThink Research（海天瑞声研究）； University College London（伦敦大学学院）； University of Edinburgh（爱丁堡大学）； The Hong Kong University of Science and Technology（香港科技大学）； East China Normal University（华东师范大学）； Shanghai Medical Image Insights（上海医学影像洞察）； University of Waterloo（滑铁卢大学）； University of Washington（华盛顿大学）； Beijing Institute of Technology（北京理工大学）

专题命中其他LLM ：贝叶斯上下文学习框架用于信息抽取

AI总结提出BCL框架，利用贝叶斯更新和粒子滤波优化信息抽取中的上下文学习，在序列标注和关系分类任务上取得显著提升。

Comments ACL 2026 Findings

2606.18263 2026-06-18 cs.HC cs.AI 新提交 70%

How Well Do Large Language Models Capture Human Personality?

大型语言模型在多大程度上捕捉人类个性？

Aanisha Bhattacharyya, Yaman Kumar Singla, Rajiv Ratn Shah, Changyou Chen, Jitendra Ajmera

发表机构 * Adobe Media and Data Science Research (MDSR)（Adobe媒体与数据科学研究院）

专题命中其他LLM ：评估LLM通过角色提示模拟人类个性的保真度。

AI总结研究通过形式化假设并系统评估，发现增加角色描述复杂性会导致表征和行为多样性收缩（角色流形坍缩），简单年龄-性别角色比丰富描述更准确。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地通过角色提示用于模拟人类群体，通常基于以下假设：更丰富的角色描述能提高行为保真度、相同大小的属性组合可同等模拟、角色定义可跨任务泛化。在这项工作中，我们形式化了这些假设，并在多种架构、规模和模拟设置下系统评估它们。我们识别出一个基本限制，称为角色流形坍缩，即越来越具表现力的角色规范导致表征和行为多样性的系统性收缩。跨模型而言，增加角色复杂性持续降低潜在空间中角色间的分离度，并削弱下游模拟任务中的行为分化。这些效应在多项分析中持续存在：更丰富的角色未能保留人类子群体分歧，相同大小的属性组合性能各异，添加描述细节往往降低而非提高模拟保真度。令人惊讶的是，简单的年龄-性别角色在多个行业中持续优于详细指定的理想客户画像（ICPs），实现了显著更高的下游预测准确性。我们发现坍缩并非在所有属性上均匀发生。某些组合在行为上保持稳定，并与人类响应保持更强的一致性，形成我们称为对齐桥的局部区域。总之，我们的结果为理解角色条件模拟的局限性提供了经验和概念基础，强调了需要构建表征感知的角色，而非仅仅增加角色表现力。

英文摘要

Large language models (LLMs) are increasingly used to simulate human populations via persona prompting, often under the assumptions that richer persona descriptions improve behavioral fidelity, similarly sized attribute combinations are equally simulatable, and persona definitions generalize across tasks. In this work, we formalize these assumptions and systematically evaluate them across multiple architectures, scales, and simulation settings. We identify a fundamental limitation we term persona manifold collapse, where increasingly expressive persona specifications lead to systematic contraction of representational and behavioral diversity. Across models, increasing persona complexity consistently reduces inter-persona separation in latent space and weakens behavioral differentiation in downstream simulation tasks. These effects persist across multiple analyses as richer personas fail to preserve human subgroup disagreement, performance varies across attribute combinations of similar size, and adding descriptive detail often degrades rather than improves simulation fidelity. Surprisingly, simple Age-Gender personas consistently outperform richly specified Ideal Customer Profiles (ICPs) across industries, achieving substantially higher downstream prediction accuracy. We find that collapse is not uniform across attributes. Certain combinations remain behaviorally stable and preserve stronger alignment with human responses, forming localized regions we term alignment bridges. Together, our results provide empirical and conceptual foundations for understanding the limits of persona-conditioned simulation, highlighting the need for representation-aware persona construction rather than increasing persona expressivity alone.

URL PDF HTML ☆

赞 0 踩 0

2606.18258 2026-06-18 cs.HC cs.AI 新提交 70%

Examining Human-Like Behaviors in LLMs: A Multi-Dimensional Analysis of Model Behaviors, User Factors, and System Prompts

审视LLM中的人类行为：模型行为、用户因素和系统提示的多维分析

Sunnie S. Y. Kim, Margit Bowler, Leon A Gatys

发表机构 * Apple（苹果公司）

专题命中其他LLM ：多维分析LLM的人类行为表现及系统提示控制。

AI总结通过21,000次对话的多维分析，发现LLM普遍表现出人类行为，但不同模型和用户因素下差异显著；人类评估者认为LLM的自我参照和关系建立行为不如人类适当，但边界维护行为更适当；系统提示可控制这些行为但需谨慎评估。

详情

AI中文摘要

大型语言模型（LLM）展现出广泛的人类行为，从表达思想和情感，到与用户建立关系，再到拒绝请求和维持边界。尽管这些行为普遍存在，但研究者和实践者缺乏方法和实证见解来做出关于LLM何时以及应展现何种类型人类行为的明智决策。为填补这一空白，我们使用LLM-as-a-judge和人类评估，对这些行为的普遍性、潜在影响和可控性进行了多维分析。在来自四个广泛使用的模型（gpt-4o、gpt-4.1-mini、claude-sonnet-4.6、gemini-2.5-flash）的21,000次多轮对话中，我们发现人类行为普遍存在，但不同模型和用户因素（对话目标和用户画像）间存在差异。在感知适当性方面，人类评估者认为LLM的自我参照和关系建立行为不如人类适当，但边界维护行为比人类更适当。最后，我们表明系统提示可以控制这些行为，但需要仔细评估以避免意外效果。我们讨论了研究结果的含义，并为负责任的LLM设计和评估提供了建议。

英文摘要

Large language models (LLMs) exhibit a wide range of human-like behaviors, from expressing thoughts and emotions, to engaging in relationship-building with users, to refusing requests and maintaining boundaries. Despite their prevalence, researchers and practitioners lack methods and empirical insights to make informed decisions about when and what types of human-like behaviors LLMs should exhibit. To fill this gap, we present a multi-dimensional analysis of the prevalence, potential effects, and controllability of these behaviors using LLM-as-a-judge and human evaluation. Across 21,000 multi-turn conversations from four widely used models (gpt-4o, gpt-4.1-mini, claude-sonnet-4.6, gemini-2.5-flash), we find that human-like behaviors are pervasive but vary across models and user factors (conversation goals and user profiles). In terms of perceived appropriateness, human evaluators judged self-referential and relationship-building behaviors as less appropriate from LLMs than from humans, but boundary-maintaining behaviors more appropriate from LLMs than from humans. Finally, we show that system prompting can control these behaviors, though it requires careful evaluation to avoid unintended effects. We discuss the implications of our findings and provide recommendations for responsible LLM design and evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.18422 2026-06-18 quant-ph 新提交 70%

Gatekeepers and Hallucinations: A Layered Evaluation Framework for LLM-Driven Quantum Circuit Generation

守门人与幻觉：LLM驱动的量子电路生成的分层评估框架

Christopher Coleman, Sharon Marfatia

专题命中其他LLM ：LLM生成量子电路评估框架

AI总结提出分层评估框架，通过守门人筛选、电路保真度分析和设计熵指标，识别LLM在量子电路生成中的五种失败模式，并揭示评估基础设施本身可能引入错误。

Comments 7 pages, 4 figures

详情

AI中文摘要

随着大型语言模型（LLM）嵌入量子模拟工作流程（IDE协作者、笔记本助手、智能体管道），评估必须超越功能正确性，以预测并捕获结构化故障，防止其通过昂贵管道传播。我们提出一个用于材料信息变分量子本征求解器（VQE）电路生成的分层评估框架：（i）跨七个物理和框架标准的守门人筛选规则；（ii）电路保真度分析，将模型输出与H2/STO-3G/Jordan-Wigner/UCCSD的分析和参考实现值进行比较，包括ansatz分类和门组成分解；以及（iii）设计熵，一种运行间行为一致性度量。我们揭示了五种不同LLM失败模式的分类（几何幻觉、不存在的API使用、运行时集成失败、约束违反以及看似合理但不可验证的输出），每种模式具有不同的可检测性特征，并且结构上属于任务本身而非任何特定模型。对评估平台自身源代码的法证审计进一步表明，两个明显的模型失败源于测试平台中的静默回退模板替换，证明评估基础设施应与所测试的模型处于相同的信任边界内。将该框架应用于多个基础模型在材料项目集成管道上，结果表明守门人式验证对于可靠部署是必要的，而非可选的。

英文摘要

As large language models (LLMs) become embedded in quantum simulation workflows (IDE copilots, notebook assistants, agentic pipelines), evaluation must move beyond functional correctness to anticipate and catch structured failures before they propagate through expensive pipelines. We present a layered evaluation framework for materials-informed Variational Quantum Eigensolver (VQE) circuit generation: (i) a gatekeeper screening rubric across seven physical and framework criteria; (ii) a circuit fidelity analysis comparing model outputs against analytical and reference-implementation values for H2/STO-3G/Jordan-Wigner/UCCSD, with ansatz classification and gate-composition breakdown; and (iii) design entropy, a run-to-run behavioral consistency metric. We surface a taxonomy of five distinct LLM failure modes (geometry hallucination, nonexistent API usage, runtime integration failures, constraint violations, and plausible-but-unverifiable output), each with distinct detectability profiles and structural to the task rather than to any one model. A forensic audit of the evaluation platform's own source code further establishes that two apparent model failures originated in the harness through silent fallback-template substitution, demonstrating that evaluation infrastructure belongs inside the same trust boundary as the models it tests. Applied across multiple foundation models on a Materials Project integrated pipeline, the framework shows that gatekeeper-style validation is necessary, not optional, for reliable deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.18276 2026-06-18 cs.MA cs.SI physics.soc-ph 新提交 70%

Characterizing Opinion Evolution of Networked LLMs

表征网络化大语言模型的意见演化

Caleb Probine, Yigit Ege Bayiz, Filippos Fotiadis, Samuel Li, Yunhao Yang, Ufuk Topcu

专题命中其他LLM ：使用LLM模拟意见传播，属于LLM应用研究。

AI总结研究经典意见动力学模型能否描述多智能体系统中大语言模型（LLM）的意见传播，发现引入偏置项可显著提升建模精度，将平均意见误差降低高达88%。

Comments 19 pages, 2 figures

2606.15633 2026-06-18 cs.LG 新提交 70%

Formalizing and Mitigating Structural Distortion in LLM Attention for Graph Reasoning

形式化并缓解大语言模型注意力中的结构失真以实现零样本图推理

Donald Loveland, Puja Trivedi, Ari Weinstein, Edward W Huang, Danai Koutra

发表机构 * University of Michigan（密歇根大学）； Amazon（亚马逊）

专题命中其他LLM ：改进LLM在图推理任务中的表现

AI总结本文形式化了大语言模型处理文本属性图时因图线性化导致的结构失真机制，并提出轻量级推理时修改方法GaLA，通过校正注意力偏差提升零样本图推理性能。

Comments Accepted to KDD 2026

详情

AI中文摘要

大语言模型（LLM）在文本属性图（TAG）推理中展现出潜力。然而，将LLM应用于图需要将其结构线性化为序列，这引入了根源于图带宽问题的失真。虽然这种失真已被证明会降低性能，但通常归因于提示设计或模型规模，其潜在机制尚不清楚。在这项工作中，我们展示了旋转位置嵌入如何将图线性化为带宽相关的注意力衰减，抑制了序列化序列中被强制分隔开的图相邻节点之间的注意力。这将基于LLM的图推理的焦点从提示工程和规模缩放转向纠正注意力错位。受此分析启发，我们提出了图对齐语言注意力（GaLA），一种轻量级的、推理时修改LLM的方法。GaLA将注意力偏向图相邻节点，同时保留LLM的序列归纳偏差。在TAG基准测试中，GaLA以可忽略的开销提升了性能，表明失真是基于LLM的图推理中可纠正的瓶颈。

英文摘要

Large Language Models (LLMs) have shown promise for reasoning over Text-Attributed Graphs (TAGs). However, applying LLMs to graphs requires linearizing their structure into sequences, introducing distortion rooted in the graph bandwidth problem. While this distortion has been shown to degrade performance, it is often attributed to prompt design or model scale, leaving the underlying mechanism unclear. In this work, we show \textit{how} rotary positional embeddings turn graph linearization into bandwidth-dependent attention decay, suppressing attention between graph-adjacent nodes that are forced far apart in the serialized sequence. This shifts the focus of LLM-based graph reasoning from prompt engineering and scaling toward correcting attention misalignment. Motivated by this analysis, we propose \textbf{G}raph-\textbf{a}ligned \textbf{L}anguage \textbf{A}ttention (\textbf{GaLA}), a lightweight, inference-time modification for LLMs. GaLA biases attention toward graph-adjacent nodes while preserving the LLM's sequential inductive biases. Across TAG benchmarks, GaLA improves performance with negligible overhead, demonstrating that distortion is a correctable bottleneck in LLM-based graph reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.14202 2026-06-18 cs.NE cs.AI 新提交 70%

MeEvo: Metacognitive Evolution Combined with Natural Evolution for Automatic Heuristic Design

MeEvo: 元认知进化与自然进化相结合用于自动启发式设计

Zishang Qiu, Xinan Chen, Rong Qu, Ruibin Bai

发表机构 * School of Computer Science, University of Nottingham Ningbo China（诺丁汉大学宁波分校计算机科学学院）； School of Computer Science, University of Nottingham（诺丁汉大学计算机科学学院）

专题命中其他LLM ：利用LLM生成启发式代码

AI总结提出MeEvo框架，通过循环耦合自然进化（探索启发式代码）和元认知进化（反思历史生成改进启发式），解决现有方法知识继承弱、探索不足的问题，在五个优化问题上表现更优。

详情

AI中文摘要

大型语言模型（LLMs）通过推理和代码合成实现启发式生成，推动了自动启发式设计（AHD）的发展。现有的基于LLM的AHD架构主要遵循两种范式：自然进化，它使用交叉和变异来探索启发式程序；以及元认知进化，它通过反思来改进推理。然而，自然进化丢弃了推理轨迹，削弱了知识继承和利用，而元认知进化缺乏种群级别的重组，限制了探索并增加了过早收敛的风险。这些局限性降低了复杂问题的搜索效率、稳定性和解的质量。为了解决这一差距，我们提出了MeEvo，一种双层AHD框架，它循环耦合自然进化和元认知进化。自然进化探索启发式代码，同时将推理轨迹、适应度值和错误记录到共享历史中；然后元认知进化反思该历史以生成改进的启发式，这些启发式重新进入父代池以进行下一轮循环。这种设计使得种群驱动的探索和反思驱动的改进相互加强。在五个优化问题上的实验（使用两个LLM骨干）表明，MeEvo比现有的基于LLM的AHD架构实现了更强且更稳定的性能，尤其是在复杂约束任务上。

英文摘要

Large Language Models (LLMs) have advanced Automatic Heuristic Design (AHD) by enabling heuristic generation through reasoning and code synthesis. Existing LLM-based AHD architectures mainly follow two paradigms: Natural Evolution, which uses crossover and mutation to explore heuristic programs, and Metacognitive Evolution, which refines reasoning through reflection. However, Natural Evolution discards reasoning traces, weakening knowledge inheritance and exploitation, while Metacognitive Evolution lacks population-level recombination, limiting exploration and increasing the risk of premature convergence. These limitations reduce search efficiency, stability, and solution quality on complex problems. To address this gap, we propose MeEvo, a dual-layer AHD framework that cyclically couples Natural Evolution and Metacognitive Evolution. Natural Evolution explores heuristic code while recording reasoning traces, fitness values, and errors into a shared history; Metacognitive Evolution then reflects on this history to generate improved heuristics that re-enter the parent pool for the next cycle. This design enables population-driven exploration and reflection-driven refinement to reinforce each other. Experiments on five optimization problems with two LLM backbones show that MeEvo achieves stronger and more stable performance than existing LLM-based AHD architectures, especially on complex constrained tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.07622 2026-06-18 cs.LG stat.AP 新提交 70%

Airport Terminal Passenger Queue Forecasting for Departure Gates and Security Checkpoints

机场航站楼登机口与安检点旅客排队预测

Juhwan Lee, Seokbin Yoon, Keumjin Lee, Hojong Baik, Seyeon Jung

发表机构 * Korea Aerospace University（韩国航空大学）； Korea Airports Corporation（韩国机场公社）

专题命中其他LLM ：Transformer预测机场排队

AI总结提出基于Transformer的框架，利用历史队列长度、等待时间和旅客吞吐量数据，预测登机口和安检点未来两小时的队列长度与等待时间，支持主动排队管理。

Comments 10 pages, 6 figures, accepted at DASC 2026

详情

AI中文摘要

准确的机场航站楼旅客排队预测对于高效的离港运营至关重要，因为它能够实现主动的拥堵管理。然而，时变的旅客需求以及多个离港设施中异构的设施使用情况使得预测具有挑战性。在这项工作中，我们提出了一种旅客排队预测框架，该框架从运营数据中学习历史旅客流量模式。所提出的模型采用基于Transformer的架构，利用过去登机口和安检点的队列长度和等待时间，以及值机岛的旅客吞吐量，来捕捉时间依赖性和设施间相关性。学习到的表示被映射到两个设施特定的MLP头部，以预测登机口和安检点的队列长度和等待时间。实验结果表明，该模型能够准确预测未来两小时内的排队情况。所提出的方法为机场航站楼运营中的主动排队管理和人员重新分配提供了实用的实时决策支持。

英文摘要

Accurate passenger queue forecasting in airport terminals is essential for efficient departure operations, as it enables proactive congestion management. However, time-varying passenger demand and heterogeneous facility usage across multiple departure facilities make forecasting challenging. In this work, we propose a passenger queue forecasting framework that learns historical passenger flow patterns from operational data. The proposed model employs a Transformer-based architecture to capture temporal dependencies and inter-facility correlations using past queue length and waiting time at departure gates and security checkpoints, together with passenger throughput at check-in islands. The learned representations are mapped to two facility-specific prediction heads to predict queue length and waiting time at departure gates and security checkpoints. Experimental results demonstrate accurate forecasts up to two hours ahead. The proposed approach offers practical real-time decision support for proactive queue management and staff reallocation in airport terminal operations.

URL PDF HTML ☆

赞 0 踩 0

2604.13082 2026-06-18 cs.LG cs.AI 版本更新 70%

The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

算术泛化的长延迟：当学习到的表征超越行为时

Laura Gomezjurado Gonzalez

发表机构 * Stanford University（斯坦福大学）

专题命中其他LLM ：研究Transformer泛化机制，与LLM相关

AI总结研究Transformer在算术任务中泛化延迟的原因，发现编码器早期已学到结构，但解码器瓶颈导致延迟，通过移植编码器或冻结编码器可加速泛化，且数字基的选择影响学习难度。

Comments 19 pages, 10 fugures

详情

AI中文摘要

在算法任务上训练的Transformer中的grokking现象以训练集拟合与突然泛化之间的长延迟为特征，但该延迟的来源仍不清楚。在编码器-解码器算术模型中，我们认为这种延迟反映了对已学习结构的有限访问，而非未能首先获得该结构。我们研究一步Collatz预测，发现编码器在最初几千训练步内组织了奇偶性和残差结构，而输出精度在数万步内仍接近随机。因果干预支持解码器瓶颈假说。将训练好的编码器移植到新模型中将grokking加速2.75倍，而移植训练好的解码器则有害。冻结收敛的编码器并仅重新训练解码器完全消除了平台期，并达到97.6%的准确率，而联合训练为86.1%。解码器任务的难易取决于数字表示。在15种基中，那些分解与Collatz映射算术对齐的基（例如基24）达到99.8%的准确率，而二进制完全失败，因为其表示崩溃且无法恢复。基的选择作为归纳偏置，控制解码器可利用的局部数字结构量，从而在相同底层任务上产生巨大的可学习性差异。

英文摘要

Grokking in transformers trained on algorithmic tasks is characterized by a long delay between training-set fit and abrupt generalization, but the source of that delay remains poorly understood. In encoder-decoder arithmetic models, we argue that this delay reflects limited access to already learned structure rather than failure to acquire that structure in the first place. We study one-step Collatz prediction and find that the encoder organizes parity and residue structure within the first few thousand training steps, while output accuracy remains near chance for tens of thousands more. Causal interventions support the decoder bottleneck hypothesis. Transplanting a trained encoder into a fresh model accelerates grokking by 2.75 times, while transplanting a trained decoder actively hurts. Freezing a converged encoder and retraining only the decoder eliminates the plateau entirely and yields 97.6% accuracy, compared to 86.1% for joint training. What makes the decoder's job harder or easier depends on numeral representation. Across 15 bases, those whose factorization aligns with the Collatz map's arithmetic (e.g., base 24) reach 99.8% accuracy, while binary fails completely because its representations collapse and never recover. The choice of base acts as an inductive bias that controls how much local digit structure the decoder can exploit, producing large differences in learnability from the same underlying task.

URL PDF HTML ☆

赞 0 踩 0

2601.18511 2026-06-18 cs.CR 版本更新 70%

Scaling up FHE-based Privacy-Preserving ML: Higher Throughput, Longer Inputs for LLama-3-8B

扩展基于FHE的隐私保护机器学习：LLama-3-8B的更高吞吐量和更长输入

Jaiyoung Park, Sejin Park, Jai Hyun Park, Jung Ho Ahn, Jung Hee Cheon, Guillaume Hanrot, Jung Woo Kim, Minje Park, Damien Stehlé

专题命中其他LLM ：提出基于FHE的隐私保护LLM推理加速方法。

AI总结针对FHE-based LLM推理中输入长度扩展性差和非线性层评估受异常值影响的问题，采用令牌预置、正交旋转和稀疏密文多项式求值方法，结合快速同态线性代数技术，实现128加密令牌推理加速，并扩展至数千令牌的异构输入，在Llama-3-8B上取得显著性能提升。

详情

AI中文摘要

随着大型语言模型（LLM）变得无处不在，与推理相关的隐私问题日益突出。全同态加密（FHE）已成为非交互式机密LLM推理的主要密码学解决方案。然而，现有解决方案在输入令牌长度上扩展性差，主要关注小模型或小输入尺寸。它们还受到大的异常值影响，这强烈影响非线性层的评估，导致高昂的多项式逼近成本。我们从两个方向扩展基于FHE的LLM推理。首先，我们加速了128个加密令牌的基于FHE的推理。我们采用机器学习技术（令牌预置和正交旋转）来减轻异常值对非线性层FHE评估的影响。另外，我们设计了一种新颖的稀疏密文多项式求值方法，以加速我们的同态SoftMax实现。我们将这些与最近的快速同态线性代数技术相结合，实现了显著提高的效率。其次，我们将提示大小扩展到数千个令牌，适用于只有输入的最终部分敏感且加密的场景。处理此问题需要处理标准的明文-明文和密文-密文组件，以及针对新颖的明文-密文组件的宽同态计算。为了解决这个问题，我们设计了一种专用的同态线性代数算法，构建了一个浅层同态注意力电路，以最小化自举成本。基于这些要素，我们提出了一个基于CKKS的Llama-3-8B私有推理端到端实现。在8个NVIDIA RTX PRO 6000 GPU上，128个加密令牌的摘要生成需要20秒，生成每个令牌需要18秒（远超SOTA在更昂贵的H100 GPU上的295秒）。对于4096个令牌的异构输入（最后128个加密），摘要生成需要64秒，生成每个令牌需要22秒。

英文摘要

As large language models (LLMs) become ubiquitous, privacy concerns pertaining to inference keep growing. Fully homomorphic encryption (FHE) has emerged as a primary cryptographic solution for non-interactive confidential LLM inference. However, existing solutions scale poorly with input token length, focusing on small models or input sizes. They also suffer from large outlier values, which strongly impact the evaluation of non-linear layers, leading to heavy polynomial approximation costs. We scale up FHE-based LLM inference in two directions. First, we accelerate FHE-based inference for 128 encrypted tokens. We adopt ML techniques (token prepending and orthogonal rotations) to mitigate outlier impacts on the FHE evaluation of non-linear layers. Separately, we devise a novel polynomial evaluation method for sparsely-packed ciphertexts to speed up our homomorphic SoftMax implementation. We combine these with recent fast homomorphic linear algebra techniques, achieving significantly improved efficiency. Second, we expand the prompt size up to thousands of tokens for contexts where only the final part of the input is sensitive and encrypted. Processing this requires handling standard plaintext-plaintext and ciphertext-ciphertext components, alongside a wide homomorphic computation for a novel plaintext-ciphertext component. To address this, we devise a dedicated homomorphic linear algebra algorithm, building a shallow homomorphic attention circuit that minimizes bootstrapping costs. Based on these ingredients, we present a CKKS-based end-to-end implementation of Llama-3-8B private inference. On 8 NVIDIA RTX PRO 6000 GPUs, 128 encrypted tokens take 20s for summarization and 18s/token for generation (vastly outperforming the SOTA 295s on costlier H100 GPUs). For a heterogeneous 4096-token input (last 128 encrypted), it takes 64s for summarization and 22s/token for generation.

URL PDF HTML ☆

赞 0 踩 0

2510.27353 2026-06-18 cs.AI 版本更新 70%

An In-depth Study of LLM Contributions to the Bin Packing Problem

LLM对装箱问题贡献的深入研究

Julien Herrmann, Guillaume Pallez

发表机构 * CNRS-IRIT ； Inria

专题命中其他LLM ：研究LLM对装箱问题的贡献，分析LLM生成启发式算法。

AI总结通过分析LLM生成的启发式算法，发现其虽可读但难以解释，进而提出更简单高效的新算法，质疑LLM对装箱问题的实际贡献。

Comments Accepted for publication in ACM Transactions on Evolutionary Learning and Optimization

详情

DOI: 10.1145/3821574

AI中文摘要

近期研究表明，大型语言模型（LLM）可能为数学发现提供有趣的思路。该主张基于报告称，基于LLM的遗传算法在均匀分布和Weibull分布下为在线装箱问题产生了具有新见解的启发式算法。本文通过详细分析LLM产生的启发式算法，考察其行为和可解释性，重新评估了这一主张。尽管这些启发式算法是人类可读的，但即使对领域专家而言，它们仍然在很大程度上是不透明的。基于此分析，我们提出了一类针对这些特定装箱实例的新算法。推导出的算法显著更简单、更高效、更可解释且更具泛化性，表明所考虑的实例本身相对简单。然后，我们讨论了关于LLM对该问题贡献的主张的局限性，该主张似乎基于一个错误的假设，即这些实例先前已被研究过。我们的发现反而强调了在评估LLM生成输出的科学价值时，需要进行严格的验证和情境化。

英文摘要

Recent studies have suggested that Large Language Models (LLMs) could provide interesting ideas contributing to mathematical discovery. This claim was motivated by reports that LLM-based genetic algorithms produced heuristics offering new insights into the online bin packing problem under uniform and Weibull distributions. In this work, we reassess this claim through a detailed analysis of the heuristics produced by LLMs, examining both their behavior and interpretability. Despite being human-readable, these heuristics remain largely opaque even to domain experts. Building on this analysis, we propose a new class of algorithms tailored to these specific bin packing instances. The derived algorithms are significantly simpler, more efficient, more interpretable, and more generalizable, suggesting that the considered instances are themselves relatively simple. We then discuss the limitations of the claim regarding LLMs' contribution to this problem, which appears to rest on the mistaken assumption that the instances had previously been studied. Our findings instead emphasize the need for rigorous validation and contextualization when assessing the scientific value of LLM-generated outputs.

URL PDF HTML ☆

赞 0 踩 0

2506.12311 2026-06-18 cs.CL cs.SD eess.AS 版本更新 70%

Phonikud: Overcoming Phonetic Underspecification for Hebrew Text-To-Speech

Phonikud：克服希伯来语文本转语音中的语音欠指定问题

Yakov Kolani, Maxim Melichov, Cobi Calev, Morris Alper

发表机构 * Independent Researcher（独立研究者）； Reichman University（雷赫曼大学）； Tel Aviv University（特拉维夫大学）； Carnegie Mellon University（卡内基梅隆大学）

专题命中其他LLM ：希伯来语TTS，涉及语言模型

AI总结提出Phonikud框架，通过开源G2P系统、语料库、基准和评估模型，解决希伯来语TTS中重音等语音特征欠指定问题，实现更准确的音素预测。

Comments Accepted to Interspeech 2026. Project page: https://phonikud.github.io

详情

AI中文摘要

现代希伯来语的文本转语音（TTS）受到该语言正字法复杂性的挑战，现有解决方案忽略了诸如重音等欠指定的语音特征。我们提出了一个更准确的希伯来语TTS框架，包含四个贡献：（1）Phonikud，一个开源的希伯来语字素到音素（G2P）系统，输出完全指定的国际音标（IPA）转录，通过增强基础注音器设计而成。（2）ILSpeech语料库，包含配对的希伯来语音频、文本和专家IPA标注。（3）针对先前未测量的希伯来语G2P转换任务的基准。（4）希伯来语音频到IPA模型，捕获先前忽略的语音细节，用于自动TTS评估。我们的结果表明，Phonikud比先前方法更准确地预测希伯来语音素，并且使用Phonikud语音输入的小型本地TTS模型接近大型专有系统。我们在以下网址发布代码、数据和模型：this https URL。

英文摘要

Text-to-speech (TTS) for Modern Hebrew is challenged by the language's orthographic complexity, with existing solutions ignoring underspecified phonetic features such as stress. We present a framework for more phonetically accurate Hebrew TTS with four contributions: (1) Phonikud, an open-source Hebrew grapheme-to-phoneme (G2P) system that outputs fully-specified International Phonetic Alphabet (IPA) transcriptions, designed by augmenting a base diacritizer. (2) The ILSpeech corpus of paired Hebrew audio, text, and expert IPA annotations. (3) A benchmark for the previously unmeasured task of Hebrew G2P conversion. (4) Hebrew audio-to-IPA models capturing previously disregarded phonetic details for automatic TTS evaluation. Our results show that Phonikud more accurately predicts Hebrew phonemes than prior methods, and that small, local TTS models with phonetic input from Phonikud approach large proprietary systems. We release our code, data, and models at https://phonikud.github.io.

URL PDF HTML ☆

赞 0 踩 0