arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2094
专题追踪
2606.19379 2026-06-19 cs.LG cs.AI cs.CL 新提交

How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural

Transformer 前馈块有多线性?逐块线性可恢复性是学习得到的,而非架构决定的

Stuart Whipp

发表机构 * Independent Research(独立研究)

AI总结 通过精确最小二乘线性近似,测量训练后 Transformer 各前馈块的线性可恢复性,发现其高度异质且非单调,是学习得到的属性而非架构决定,并可用于压缩和诊断。

Comments 14 pages, 5 figures

详情
AI中文摘要

Transformer 前馈网络(FFN)通常被视为非线性的计算存储单元,但训练后的 FFN 块实际非线性程度很少被测量。我们将每个 FFN 视为位置级的输入-输出映射,并将其分解为精确的最小二乘线性近似加上残差。闭式线性映射解释的留出方差定义了一个块的线性可恢复性(R^2_lin),这是一种无需优化器的线性度量。在 GPT-2、Pythia-160m 和 llama-160m 的所有十二个块中,R^2_lin 高度异质且随深度非单调变化,相邻块之间范围从近线性(>0.99)到强非线性(<0.3),且并非由激活函数决定:相同宽度的 GELU 模型 GPT-2 和 Pythia-160m 具有截然不同的轮廓,因此可恢复性是单个训练块的学习属性,而非架构属性。残差的低秩双线性探针仅恢复少量 R^2 点,且增益与残差非线性不相关:未恢复的计算不是单个位置级乘积,而是高阶或分布式结构。该测量还作为有针对性的压缩信号:可恢复块允许大的单层替换(GPT-2 的早期 FFN 参数减少 8 倍,困惑度增加 +0.77),而低可恢复性块标记了这不安全的情况。它还暴露了一个方法论陷阱:训练后的线性基线可能在病态条件的 Transformer 激活上严重欠收敛,因此我们报告了整个过程中精确的闭式最小二乘上限。

英文摘要

Transformer feed-forward networks (FFNs) are often treated as nonlinear stores of computation, yet how nonlinear a trained FFN block actually is has rarely been measured. We treat each FFN as a position-wise input-to-output map and split it into the exact least-squares linear approximation plus a residual. The held-out variance the closed-form linear map explains defines a block's linear recoverability (R^2_lin), an optimiser-free measure of its linearity. Across all twelve blocks of GPT-2, Pythia-160m, and llama-160m, R^2_lin is highly heterogeneous and non-monotone with depth, ranging from near-linear (>0.99) to strongly nonlinear (<0.3) between adjacent blocks, and is not set by the activation function: same-width GELU models GPT-2 and Pythia-160m have sharply different profiles, so recoverability is a learned property of individual trained blocks, not an architectural one. A low-rank bilinear probe of the residual recovers only a few points of R^2, with gain uncorrelated with residual nonlinearity: the unrecovered computation is not a single position-wise product but higher-order or distributed structure. The measurement also serves as a targeted compression signal: recoverable blocks admit large single-layer replacements (GPT-2's early FFN at 8x fewer parameters for +0.77 perplexity), while low-recoverability blocks flag where this is unsafe. It further exposes a methodological pitfall: trained linear baselines can badly under-converge on ill-conditioned transformer activations, so we report the exact closed-form least-squares ceiling throughout.

2606.19377 2026-06-19 cs.LG cs.AI 新提交

Emyx: Fast and efficient all-atom protein generation

Emyx: 快速高效的全原子蛋白质生成

Nicholas J. Williams, Ward Haddadin, Matteo P. Ferla, Constantin Schneider, Nicholas B. Woodall, Ruby Sedgwick, Christian D. Madsen, Andrew L. Hopkins, Edward O. Pyzer-Knapp

发表机构 * Xyme

AI总结 提出Emyx,一种140M参数的流匹配模型,通过轻量条件表示和稀疏连接降低复杂度,在酶设计基准上超越现有方法,训练仅需682 GPU小时。

详情
AI中文摘要

计算酶设计需要生成能够支撑催化残基和配体的蛋白质,这要求生成模型同时具备几何准确性和结构多样性。当前的全原子生成模型继承了结构预测中的昂贵架构,导致训练成本高、样本多样性有限。我们认为,对于生成模型而言,这种复杂性大多是不必要的,因为生成模型依赖于稀疏的几何约束而非丰富的共进化信号。Emyx是一个140M参数的条件流匹配模型,将能力集中在标准Transformer块中,用轻量条件表示和稀疏连接替代了厚重的嵌入堆叠。此外,我们推导了流匹配插值到EDM噪声水平框架的精确重参数化,将流匹配训练效率与为扩散模型设计的最先进采样方法桥接起来,无需重新训练。尽管是最小的模型,Emyx在AME酶设计基准上,在要求全局折叠恢复和催化几何准确性的严格评估下,在成功率、结构新颖性、骨架多样性和几何有效性方面均优于Proteína-Complexa和RFdiffusion3,而训练仅需682 GPU小时,约为RFdiffusion3的1/4。

英文摘要

Computational enzyme design requires generating proteins that scaffold catalytic residues and ligands, a task that demands both geometric accuracy and structural diversity from the underlying generative model. Current all-atom generators inherit expensive architectures from structure prediction, leading to high training costs and limited sample diversity. We argue that much of this complexity is unnecessary for generators, which condition on sparse geometric constraints rather than rich co-evolutionary signals. Emyx is a 140M-parameter conditional flow matching model that concentrates capacity within standard transformer blocks, replacing heavy embedding stacks with lightweight conditional representations and sparse connectivity. We additionally derive an exact reparametrisation of the flow matching interpolant into the EDM noise-level framework, bridging flow matching training efficiency with state-of-the-art sampling methods designed for diffusion models without retraining. Despite being the smallest model, Emyx outperforms both Proteína-Complexa and RFdiffusion3 against the AME enzyme design benchmark across success rate under strict evaluation requiring both global fold recovery and catalytic geometry accuracy, structural novelty, scaffold diversity, and geometric validity, while training in just $682$ GPU-hours, roughly $4\times$ less than RFdiffusion3.

2606.19376 2026-06-19 cs.LG cs.AI cs.IR 新提交

Cost-Optimal LLM Routing with Limited User Feedback under User Satisfaction Guarantees

在用户满意度保证下基于有限用户反馈的成本最优LLM路由

Herbert Woisetschläger, Arastun Mammadli, Ryan Zhang, Shiqiang Wang

发表机构 * Technical University of Munich(慕尼黑工业大学) University of Exeter(埃克塞特大学) Horace Greeley High School(霍勒斯格里利高中)

AI总结 针对LLM推理成本与服务质量之间的矛盾,提出SLARouter在线路由算法,利用稀疏单侧用户反馈学习成本最优策略,理论保证成本最优和SLA合规,实验显示成本降低高达2.2倍。

Comments Preprint. Under review

详情
AI中文摘要

大型语言模型(LLM)应用的推理成本正在快速增长,这是由于需求激增和基础设施成本上升所驱动的。用户期望高质量的响应,在商业环境中,这被正式编码在服务级别协议(SLA)中,从而在成本和质量之间形成了根本性的矛盾。最近在成本感知的LLM请求路由方面的进展显示出解决这一矛盾的潜力,但现有方法依赖于完整的反馈信号、离线训练、大量的每工作负载调优,并且大多数缺乏SLA保证或推理时适应性。我们引入了SLARouter,一种在线路由算法,它从生产系统中可用的稀疏、单侧用户反馈中学习成本最优策略。SLARouter为成本最优性和严格的SLA合规性提供了理论保证。在广泛的LLM基准测试上的实验表明,SLARouter无需每基准调优即可满足SLA约束,将运营成本降低至现有基线的2.2倍。

英文摘要

Inference costs for large language model (LLM) applications are rapidly growing, driven by surging demand and rising infrastructure cost. Users expect high-quality responses, and in commercial settings this is formally codified in Service Level Agreements (SLAs), creating a fundamental tension between cost and quality. Recent progress on cost-aware LLM request routing has shown potential to resolve this tension, but existing approaches rely on complete feedback signals, offline training, extensive per-workload tuning, and most lack SLA guarantees or inference-time adaptivity. We introduce SLARouter, an online routing algorithm that learns a cost-optimal policy from the sparse, one-sided user feedback available in production systems. SLARouter provides theoretical guarantees for both cost optimality and strict SLA compliance. Experiments across a wide range of LLM benchmarks show that SLARouter satisfies SLA constraints without the need for per-benchmark tuning, reducing operating cost by up to 2.2x over existing baselines.

2606.19374 2026-06-19 cs.LG cs.AI 新提交

Protein Representation Learning with Secondary-Structure and Energy-Filtered Hydrogen-Bond Graphs

基于二级结构和能量过滤氢键图的蛋白质表示学习

Mohamed Mouhajir, Limei Wang, El Houcine Bergou, Hajar El Hammouti, Lamiae Azizi, Dongqi Fu

发表机构 * College of Computing, UM6P(穆罕默德六世理工大学计算机学院)

AI总结 提出一种二级结构感知的图神经网络,通过增强残基节点表示并基于能量过滤的氢键构建边,以捕获局部结构上下文和长程耦合,在蛋白质基准上取得一致改进并增强生物学可解释性。

Journal ref The 25th International Workshop on Data Mining in Bioinformatics (BIOKDD 2026)

详情
AI中文摘要

基于图的表示被广泛用于蛋白质建模,然而许多现有方法主要依赖序列邻接或几何邻近,这仅部分反映了控制蛋白质折叠的原理。蛋白质实际上采用围绕二级结构元素(如α-螺旋和β-折叠)组织的复杂三维构象,这些元素编码了重复的局部基序和稳定的氢键相互作用。在这项工作中,我们引入了一种二级结构感知的图神经网络用于蛋白质表示学习。残基级别的节点表示通过二级结构分配得到增强,图边由经过能量强度过滤的氢键相互作用构建。这种设计使模型能够捕获对蛋白质稳定性和功能至关重要的局部结构上下文和长程耦合。我们在常用的蛋白质基准上评估了所提出的方法,并观察到相对于现有基于图的方法的一致改进。此外,生成的图表示提供了增强的生物学可解释性,因为学习到的连接性与已建立的结构基序一致。这些发现表明,融入二级结构和能量过滤的氢键拓扑为蛋白质表示学习提供了有效的归纳偏置。代码发布在 https://this URL。

英文摘要

Graph-based representations are widely used in protein modeling, yet many existing approaches rely primarily on sequence adjacency or geometric proximity, which only partially reflect the principles governing protein folding. Proteins instead adopt complex three-dimensional conformations organized around secondary structure elements, such as $α$-helices and $β$-sheets, which encode recurring local motifs and stabilizing hydrogen-bond interactions. In this work, we introduce a secondary-structure-aware graph neural network for protein representation learning. Residue-level node representations are augmented with secondary structure assignments, and graph edges are constructed from hydrogen-bond interactions filtered by their energetic strength. This design enables the model to capture both local structural context and long-range couplings that are central to protein stability and function. We evaluate the proposed approach on commonly used protein benchmarks and observe consistent improvements over existing graph-based methods. In addition, the resulting graph representations offer enhanced biological interpretability, as the learned connectivity aligns with established structural motifs. These findings suggest that incorporating secondary structure and energy-filtered hydrogen-bond topology provides an effective inductive bias for protein representation learning. The code is released at https://github.com/mohamedmohamed2021/SSProNet

2606.19373 2026-06-19 cs.LG cs.AI 新提交

cAPM: Continual AI-Assisted Pace-Mapping with Active Learning

cAPM:具有主动学习的持续AI辅助起搏标测

Dylan O'Hara, Pradeep Bajracharya, Casey Meisenzahl, Karli Gillette, Anton J. Prassl, Gernot Plank, Saman Nazarian, Roderick Tung, John L Sapp, Linwei Wang

发表机构 * Rochester Institute of Technology(罗切斯特理工学院) University of Utah(犹他大学) Scientific Computing and Imaging Institute, University of Utah(犹他大学科学计算与成像研究所) Medical University of Graz(格拉茨医科大学) University of Pennsylvania Perelman School of Medicine(宾夕法尼亚大学佩雷尔曼医学院) The University of Arizona College of Medicine(亚利桑那大学医学院) Dalhousie University(达尔豪斯大学)

AI总结 提出cAPM框架,通过任务无关的代理神经网络、主动学习和持续学习策略,在减少起搏标测数据量的同时,实现跨室性心动过速的知识迁移,将定位精度提升至81%。

详情
AI中文摘要

室性心动过速是一种危及生命的心律失常,是心源性猝死的主要原因。起搏标测是一种临床程序,用于在导管消融室性心动过速期间识别干预靶点。它要求临床医生在心室的不同部位起搏,并快速解释由此产生的心电图,以确定下一步起搏位置或是否已识别出靶点。已提出主动学习AI模型来指导临床医生选择下一个起搏点,显示出在减少起搏点数量和改善起搏标测效率方面的潜力。现有方法需要对每个靶点重新训练,无法在同一患者或不同患者的多个室性心动过速之间迁移知识。我们引入cAPM用于持续AI辅助起搏标测,以捕获和迁移从过去起搏标测数据中积累的知识,从而减少未来靶点室性心动过速所需的起搏标测数据量。这是通过一个任务无关的代理神经网络实现的,该网络学习从起搏点到12导联心电图形态的映射;一种主动学习策略,通过为每个靶点选择信息量最大的起搏点来优化该代理模型;以及一种持续学习策略,以顺序方式执行此操作,同时保留先前靶点的知识。在由不同生理条件和心室几何形状下顺序呈现的定位任务组成的计算机模拟测试平台上评估,cAPM(无论是否重放过去数据样本)在使用4.5个起搏标测点时,在临床耐受范围内(5毫米精度)定位的概率达到81%,而最先进的主动学习方法使用13.7个起搏点达到38%的概率。这些结果为cAPM准备用于体内临床前和临床研究提供了坚实基础,在这些研究中,cAPM可用于指导起搏标测。

英文摘要

Ventricular tachycardia is a life-threatening rhythm disorder and a major cause of sudden cardiac death. Pace-mapping is a clinical procedure for identifying the intervention target during catheter ablation of VT. It requires clinicians to pace different sites in the ventricles and rapidly interpret the resulting electrocardiograms to determine where to pace next or whether a target site has been identified. Active learning AI models have been proposed to guide clinicians to the next pacing site, showing promise in reducing the number of pacing sites and improving the efficiency of pace-mapping. Existing methods require retraining each target without the ability to transfer knowledge across multiple VTs within the same patient or across patients. We introduce cAPM for continuous AI-assisted pace-mapping to capture and transfer knowledge accumulated from past pace-mapping data to reduce the number of pace-mapping data needed for future target VTs. This is made possible by a task-agnostic surrogate neural network that learns the mapping from pacing sites to 12-lead ECG morphology, an active-learning strategy that refines this surrogate model by selecting the most informative pacing site for each target, and a continual learning strategy to do so sequentially while retaining knowledge from prior targets. Evaluated on an in-silico testbed consisting of sequentially-presented localization tasks across different physiological conditions and ventricular geometries, cAPM with and without replay of past data samples achieved an 81% probability of localizing within clinical tolerance (5 mm accuracy) using 4.5 pace-mapping sites, compared to the state-of-the-art active-learning method achieving 38% probability using 13.7 pacing sites. These results provide a strong basis for preparing cAPM towards in-vivo preclinical and clinical studies where it can be used to guide pace-mapping.

2606.19371 2026-06-19 cs.LG cs.AI cs.CV 新提交

ProMUSE: Progressive Multi-modal Uncertainty-guided Staged Evidential Alzheimer Disease Classification

ProMUSE: 渐进式多模态不确定性引导的分阶段证据阿尔茨海默病分类

Long Doan, Branden Chen, Ethan Litton, Huan Huang, Jiajing Huang, Yixin Xie, Weihua Zhou, Nandakumar Narayanan, Chen Zhao

发表机构 * Kennesaw State University(肯尼索州立大学) Michigan Technological University(密歇根理工大学) University of Iowa(爱荷华大学)

AI总结 提出ProMUSE,一种渐进式多模态不确定性引导的分阶段证据网络,通过自适应决定何时需要额外模态,在保持准确性的同时降低数据采集成本。

详情
AI中文摘要

阿尔茨海默病(AD)是一种致命性疾病,会破坏老年人的记忆和认知能力。大多数AD治疗在早期阶段有效,导致对早期AD诊断的需求日益增加。AD诊断越来越依赖多模态数据,如临床评估、结构磁共振成像(MRI)和正电子发射断层扫描(PET)成像。然而,MRI和PET采集仍然昂贵且不易普及,使得全模态推理在现实临床工作流程中不切实际。我们提出ProMUSE,一种渐进式多模态不确定性引导的分阶段证据网络,该网络自适应地确定何时需要额外模态,有助于在保持准确性的同时降低数据采集的总体成本。ProMUSE首先使用低成本临床数据进行证据分类,并通过基于Dirichlet的主观逻辑模型量化不确定性。当不确定性超过学习阈值时,ProMUSE逐步引入MRI或PET特征,通过Dempster-Shafer理论融合模态层面的信念和不确定性,获得校准的多模态预测。这种分阶段采集策略能够在最小化对昂贵成像依赖的同时实现准确诊断。在ADNI、AIBL和OASIS数据集上针对CN-AD、CN-MCI和MCI-AD任务的实验表明,ProMUSE在减少50-90%的MRI/PET使用量的同时,实现了与全模态基线相当或更优的准确性,从而大幅节省成本。这些结果突显了ProMUSE作为现实世界AD筛查中一种实用、不确定性感知且资源高效的解决方案。

英文摘要

Alzheimer's disease (AD) is a fatal disorder that destroys memory and cognitive skills in the elderly population. Most treatments for AD are effective in the early stage, leading to an increasing demand for early AD diagnosis. AD diagnosis increasingly relies on multimodal data such as clinical assessments, structural Magnetic Resonance Imaging (MRI), and Positron Emission Tomography (PET) imaging. However, MRI and PET acquisition remain costly and not universally accessible, making full-modality inference impractical in real-world clinical workflows. We propose ProMUSE, a Progressive Multi-modal Uncertainty Guided Staged Evidential Network that adaptively determines when additional modalities are necessary, helping reduce the overall cost of data acquisition while maintaining accuracy. ProMUSE first performs evidential classification using low-cost clinical data and quantifies uncertainty via a Dirichlet-based subjective logic model. When uncertainty exceeds a learned threshold, ProMUSE progressively incorporates MRI or PET features, fusing modality-wise belief and uncertainty through Dempster-Shafer theory to obtain a calibrated multimodal prediction. This staged acquisition strategy enables accurate diagnosis while minimizing reliance on expensive imaging. Experiments on ADNI, AIBL, and OASIS across CN-AD, CN-MCI, and MCI-AD tasks demonstrate that ProMUSE achieves competitive or superior accuracy compared to full-modality baselines while reducing MRI/PET usage by 50-90%, yielding substantial cost savings. These results highlight ProMUSE as a practical, uncertainty-aware, and resource-efficient solution for real-world AD screening.

2606.19370 2026-06-19 cs.LG cs.AI cs.MA 新提交

Human-like autonomy emerges from self-play and a pinch of human data

类人自主性从自我对弈和少量人类数据中涌现

Daphne Cornelisse, Julian Hunt, Zixu Zhang, Waël Doulazmi, Kevin Joseph, Jaime Fernández Fisac, Eugene Vinitsky

发表机构 * NYU Tandon School of Engineering(纽约大学坦登工程学院) NYU Courant(纽约大学库朗数学科学研究所) Princeton University(普林斯顿大学) Centre for Robotics, Mines Paris(巴黎矿业大学机器人中心) Valeo(法雷奥)

AI总结 提出一种结合自我对弈强化学习与少量人类演示的正则化方法,仅用30分钟人类数据即可训练出与人类协调的驾驶策略,训练时间仅15小时。

Comments 10 pages

详情
AI中文摘要

自我对弈强化学习最近成为一种无需任何人类数据即可训练驾驶策略的方法。它利用廉价的大规模模拟来替代昂贵的大规模人类驾驶演示。这种方法的一个关键局限性是,通过纯自我对弈训练的策略可以学习有效但不符合人类习惯的驾驶惯例。先前的工作试图通过广泛的奖励工程和领域随机化来缓解这种行为偏差,但这些方法脆弱且劳动密集。我们的方法没有完全抛弃人类演示,而是将其作为最小安全目标达到奖励之上的正则化目标。就像好炖菜中的香料一样,我们发现少量人类数据大有裨益:我们的方法仅使用30分钟的人类演示,比同类模仿学习方法少2500倍。由此产生的策略与保留的人类轨迹协调,并在单个消费级GPU上15小时内完成训练。视频和完整源代码见https://this URL。

英文摘要

Self-play reinforcement learning has recently emerged as a way to train driving policies without any human data. It uses cheap, large-scale simulations to substitute expensive, large-scale human driving demonstrations. A key limitation of this approach is that policies trained through pure self-play can learn effective but alien driving conventions incompatible with people. Previous works attempt to mitigate such behavioral misalignments through extensive reward engineering and domain randomization, which are brittle and labor-intensive. Instead of completely discarding human demonstrations, our method treats them as a regularization objective on top of a minimal safe goal-reaching reward. Like the spice in a good stew, we find that a little human data goes a long way: our method uses only 30 minutes of human demonstrations, 2500x fewer than comparable imitation learning approaches. Resulting policies coordinate with held-out human trajectories and complete training in 15 hours on a single consumer-grade GPU. Videos and full source code are available at https://spiced-self-play.com/.

2606.19369 2026-06-19 cs.LG cs.AI 新提交

Zero-Inflated Gaussian Distributions Enable Parameter-Space Sparsity in Estimation-of-Distribution Algorithms

零膨胀高斯分布使估计分布算法中的参数空间稀疏化

Andreas Faust, Sven Nitzsche, Juergen Becker

发表机构 * University of Freiburg(弗莱堡大学) FZI Research Center for Information Technology(FZI信息技术研究中心) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 提出多元零膨胀高斯分布作为估计分布算法的采样分布,联合优化稀疏模式和活跃参数,无需手工设计稀疏算子,在Lunar Lander基准上收敛更快且最终回报更高。

详情
AI中文摘要

估计分布算法(EDA)是一类强大的黑箱优化进化方法,尤其当目标函数结构未知时。经典进化算法依赖于手工设计的变异和交叉算子,这些算子难以针对未知问题结构设计,且是偏差的来源,而EDA完全绕过了算子设计:它们将概率分布拟合到最佳个体,并从中采样下一代。EDA在连续参数空间上已得到充分确立,但此前尚未推广到稀疏空间——其中良好解的大多数系数恰好为零。现有的稀疏黑箱优化器因此重新引入了EDA旨在避免的东西:手工制作的稀疏算子、支持集与活跃值交替的双层方案、零阈值以及其他内置假设。我们通过提出多元零膨胀高斯(ZIG)分布作为EDA采样法则来填补这一空白。一个具有独立指示维度和值维度的潜在高斯模型表示稀疏模式、活跃参数之间的相关性以及两者之间的相互作用,因此稀疏模式和活跃值被联合优化,无需层次结构。我们证明该模型的潜在参数可以从观测样本中识别,不同于相关构造起源的缺失数据设置,并引入了实用的基于摊销反演的估计器。这些估计器准确恢复潜在相关结构,在Lunar Lander基准上,由此产生的ZIG-EDA比稠密高斯EDA、手工制作的稀疏进化算法和特设稀疏EDA收敛更快且最终回报更高,同时找到的控制器只有一小部分参数活跃。

英文摘要

Estimation-of-distribution algorithms (EDAs) are a powerful class of evolutionary methods for black-box optimization, especially when little is known about the structure of the objective. Whereas classical evolutionary algorithms rely on hand-designed mutation and crossover operators, hard to devise for unknown problem structures, and a source of bias, EDAs sidestep operator design entirely: they fit a probability distribution to the best individuals and sample the next generation from it. EDAs are well established on continuous parameter spaces, but they have not previously been generalized to sparse ones, in which most coefficients of a good solution are exactly zero. Existing sparse black-box optimizers therefore reintroduce exactly what EDAs were designed to avoid: hand-crafted sparsity operators, bi-level schemes alternating between support set and active values, zeroing thresholds, and other baked-in assumptions. We close this gap by proposing multivariate zero-inflated Gaussian (ZIG) distributions as EDA sampling laws. A latent Gaussian model with separate indicator and value dimensions represents sparsity patterns, correlations among active parameters, and the interactions between the two, so sparsity patterns and active values are optimized jointly, hierarchy-free. We show that the latent parameters of this model are identifiable from observed samples, unlike in the missing-data settings where related constructions originate, and introduce practical amortized inversion-based estimators for them. The estimators accurately recover latent correlation structures, and on the Lunar Lander benchmark the resulting ZIG-EDA converges faster and reaches higher final returns than a dense Gaussian EDA, a hand-crafted sparse evolutionary algorithm, and an ad-hoc sparse EDA, while finding controllers with only a small fraction of parameters active.

2606.19367 2026-06-19 cs.LG 新提交

Weibull Weight-Scale Parameter Evolution under AdamW Training Dynamics

Weibull 权重尺度参数在 AdamW 训练动态下的演化

Tiexin Ding

发表机构 * Independent Researcher(独立研究员)

AI总结 研究 AdamW 训练中 Weibull 权重尺度参数 λ 增长、过冲和松弛的原因,推导出三种力(对齐、注入、衰减)的分解,并在 Pythia-70M 模型上验证对齐力主导上升阶段,贡献 88-94%。

Comments 21 pages, 14 figures

详情
AI中文摘要

基于用于诊断变压器权重分布的双参数 Weibull 框架,我们研究了为什么在 AdamW 训练期间 Weibull 权重尺度参数 λ 会增长、过冲然后松弛。我们从 AdamW 更新中推导出平方权重范数的领先阶三力分解:一个对齐力,测量权重与自适应更新方向之间的相关性;一个注入力,来自自适应步长幅度;以及一个衰减力,来自解耦的权重衰减。在具有真实优化器矩的自训练 Pythia-70M 模型上,对齐力主导上升阶段,在四个随机种子中贡献了绝对力预算的 88-94%,并且对超权重移除具有鲁棒性。接近饱和时,对齐力和衰减力趋于平衡,解释了从权重尺度增长到松弛的转变。这些力动态直接控制 λ(t) 背后的平方范数分量;剩余的 RMS 到 Weibull 重建偏移是可测量的,并分解为桥接分量和积分分量,在密集采样区域总计约 5-6%。为了将分析扩展到无法获得优化器矩的真实模型,我们引入了一种样条位移方法,该方法从稀疏检查点以约 92-94% 的准确率恢复对齐力,大约是朴素两点基线的两倍。我们进一步观察到,在我们的实验中,λ(t) 的峰值随训练数据一致性而变化,这表明权重尺度增长存在数据依赖成分,我们将其留待后续对照研究。代码和数据可在 https://this URL 获取。

英文摘要

Building on a two-parameter Weibull framework for diagnosing transformer weight distributions, we study why the Weibull weight-scale parameter $λ$ grows, overshoots, and then relaxes during AdamW training. We derive a leading-order three-force decomposition of the squared weight norm from the AdamW update: an alignment force measuring the correlation between weights and the adaptive update direction, an injection force from adaptive step magnitude, and a decay force from decoupled weight decay. On self-trained Pythia-70M models with ground-truth optimizer moments, alignment dominates the rise phase, contributing 88-94% of the absolute force budget across four random seeds and remaining robust to super-weight removal. Near saturation, alignment and decay approach balance, explaining the transition from weight-scale growth to relaxation. These force dynamics directly govern the squared-norm component underlying $λ(t)$; the remaining RMS-to-Weibull reconstruction offset is measurable and decomposes into bridge and integration components, totaling approximately 5-6% in densely sampled regions. To extend the analysis to real models where optimizer moments are unavailable, we introduce a spline displacement method that recovers the alignment force from sparse checkpoints with approximately 92-94% accuracy, about twice the naive two-point baseline. We further observe that the peak value of $λ(t)$ varies with training-data coherence in our experiments, suggesting a data-dependent component of weight-scale growth that we leave to a controlled follow-up study. Code and data are available at https://github.com/tiexinding/NPM-Weibull-public.

2606.19366 2026-06-19 cs.LG cs.AI eess.SP 新提交

Information Lattice Learning as Probabilistic Graphical Model Structure Learning

信息格学习作为概率图模型结构学习

Haizi Yu, Lav R. Varshney

发表机构 * Kocree, Inc.(Kocree公司) AI Innovation Institute, Stony Brook University(石溪大学人工智能创新研究所)

AI总结 将信息格学习(ILL)解释为概率图模型结构学习,通过投影到分区格上学习可解释规则,并建立与最大熵和因子图的联系。

详情
AI中文摘要

信息格学习(ILL)通过将信号交替投影到编码抽象层次结构的分区格上,并将选定的规则提升回信号域,来学习信号的可解释规则。当信号是概率质量函数时,我们证明ILL学习的概率规则具有自然的概率图模型(PGM)解释,并详细发展了这一解释。ILL中的分区诱导出一个确定性的商变量,规则是该商变量的边际分布。因此,规则集是可解释抽象上的边际约束集合。一般提升是满足这些约束的所有联合分布的可行族,而特殊提升则选择最大无知重建,在ILL中通过L2均匀性原理实现,该原理与最大熵密切相关。在香农熵提升下,相同的约束产生一个对数线性因子图,其因子由学习的抽象索引。然而,信息格本身不是贝叶斯网络:其边编码抽象的细化与粗化,而非条件依赖。因此,ILL最好被视为商变量上可解释的基于约束的因子图的结构学习。这一观点阐明了ILL如何与图模型和最大熵模型相关,同时为推理、可识别性和混合符号-概率学习提出了新方向。

英文摘要

Information lattice learning (ILL) learns interpretable rules of a signal by alternately projecting the signal onto a partition lattice that encodes a hierarchy of abstractions and lifting selected rules back to the signal domain. When the signal is a probability mass function, we show the probabilistic rules learned by ILL admit a natural probabilistic graphical model (PGM) interpretation and develop this interpretation in detail. A partition in ILL induces a deterministic quotient variable, and a rule is the marginal law of that quotient variable. A rule set is therefore a collection of marginal constraints over interpretable abstractions. General lifting is the feasible family of all joint distributions satisfying those constraints, while special lifting chooses a maximum-ignorance reconstruction, implemented in ILL by an L2 uniformity principle closely related to maximum entropy. Under a Shannon-entropy lifting, the same constraints yield a log-linear factor graph whose factors are indexed by learned abstractions. The information lattice itself, however, is not a Bayesian network: its edges encode refinement and coarsening of abstractions, not conditional dependence. Thus ILL is best viewed as structure learning for interpretable constraint-based factor graphs over quotient variables. This view clarifies how ILL relates to graphical models and maximum entropy models, while suggesting new directions for inference, identifiability, and hybrid symbolic-probabilistic learning.

2606.19365 2026-06-19 cs.LG 新提交

Performance Analysis and Optimization of 3D Generative Diffusion Models across GPU Architectures

跨GPU架构的3D生成扩散模型性能分析与优化

Jeeho Ryoo, Yongchan Jung, Muhammad Ali Khaliq, Weidong Zhang, Jiatong Han, Byeong Kil Lee

发表机构 * Fairleigh Dickinson University(费尔利·迪金森大学) The University of Colorado at Colorado Springs(科罗拉多大学科罗拉多斯普林斯分校) Northeastern University(东北大学)

AI总结 针对3D MRI扩散模型Med-DDPM,分析其在三代NVIDIA架构上的内核级性能瓶颈,提出TF32 Tensor Core激活和3D channels-last布局优化,实现SM周期和动态指令减少100倍,Tensor Core利用率提升至9.98倍,IPC提升7%。

详情
AI中文摘要

扩散模型已成为高保真3D MRI合成的关键,但由于每个样本需要数百次U-Net评估以及高度异构的内核行为,其部署仍受到大量GPU资源需求的限制。本文对最先进的医学扩散模型Med-DDPM在三代NVIDIA架构上进行了全面的性能分析,研究了内核级运行时分解、指令混合特征、内存系统利用率、线程束级活动以及分析器优先级得分估计。我们发现训练主要由cuDNN卷积和隐式GEMM内核主导,效率低下源于内存访问模式、张量布局转换和有限的Tensor Core利用率。基于这些洞察,我们评估了两种架构感知优化——TF32 Tensor Core激活和3D channels-last布局,并证明它们将SM周期减少多达100倍,动态指令减少100倍,Tensor Core利用率从1.45倍提高到9.98倍,并在A100上将IPC提高7%,且不降低合成质量。

英文摘要

Diffusion models have become essential for high-fidelity 3D MRI synthesis, yet their deployment remains constrained by substantial GPU resource demands arising from hundreds of U-Net evaluations per sample and a highly heterogeneous kernel behavior. This paper performs a comprehensive performance analysis of the state-of-the-art medical diffusion model, Med-DDPM, across three generations of NVIDIA architectures to study kernel-level runtime breakdowns, instruction-mix characteristics, memory system utilization, warp-level activities, and profiler priority-score estimates. We show that training is overwhelmingly dominated by cuDNN convolution and implicit-GEMM kernels, with inefficiencies arising from memory-access patterns, tensor-layout conversions, and limited Tensor Core utilization. Guided by these insights, we evaluate two architecture-aware optimizations TF32 Tensor Core activation and a 3D channels-last layout and demonstrate that they reduce SM cycles by up to 100x, cut dynamic instructions by 100x, raise Tensor Core utilization from 1.45 to 9.98x, and increase IPC by 7% on A100, all without degrading synthesis quality.

2606.19364 2026-06-19 cs.LG 新提交

Closing the Social-Semantic Gap: SPSD for Edge-Based Prompt Compression in Cloud LLM Inference

缩小社会-语义差距:SPSD用于云LLM推理中的边缘端提示压缩

Abhinit Sen, Ajeet Kumar, Manaranjan Pradhan

发表机构 * Indian School of Business(印度管理学院)

AI总结 针对云LLM推理中提示词预填充阶段能耗高的问题,提出SPSD边缘端管道,利用4比特量化小语言模型压缩用户提示,在保持响应质量非劣效的前提下,平均节省99.9个输入token,每调用净节能70-270 uWh。

Comments 19 pages, 7 tables, 1 figure, includes appendix

详情
AI中文摘要

大语言模型(LLM)推理的预填充阶段正成为云规模能耗的日益增长的贡献者。许多面向消费者的支持和对话提示包含社会性支架:礼貌标记、道歉性开场白、重复以及建立融洽关系的语言,这些对人类交流很重要,但对机器推理而言边际信息量较低。我们将这种差异称为社会-语义差距。我们提出SPSD(情感保留语义蒸馏),一种边缘端管道,在传输到云端部署的LLM之前,使用4比特量化的小语言模型压缩用户提示。在248个提示的语料库上,使用Gemma-2-2B-Instruct(Q4_K_M)作为SLM、Llama-3.1-8B-Instruct作为云端评估模型进行评估,每次蒸馏调用平均输入token节省99.9个,所有146次蒸馏调用均产生正向节省。通过盲法LLM-as-judge评分对121对进行评估,响应质量在15分制中预先指定的1分非劣效范围内不劣于原始路径;评审员给出43%平局、28%蒸馏胜出和29%原始胜出。余弦相似度结果不一:均值0.682,中位数0.712,54.1%的配对高于0.70参考阈值。安全关键领域通过基于规则的网关保守地路由至直通模式。在所述假设下,每次调用净节能估计为70-270 uWh。SPSD表明,设备端提示蒸馏可以在保持响应质量在实际非劣效范围内的同时,降低云LLM的输入token成本。

英文摘要

The prefill stage of Large Language Model (LLM) inference is a growing contributor to cloud-scale energy cost. Many consumer-support and conversational prompts contain social scaffolding: politeness markers, apologetic preamble, repetition, and rapport-building language that is important for human communication but carries low marginal information for machine reasoning. We call this discrepancy the Social-Semantic Gap. We present SPSD (Sentiment Preserving Semantic Distillation), an edge-based pipeline that compresses user prompts using a 4-bit quantised Small Language Model before transmission to a cloud-deployed LLM. Evaluation on a 248-prompt corpus using Gemma-2-2B-Instruct (Q4_K_M) as the SLM and Llama-3.1-8B-Instruct as the cloud evaluation model yields a mean input token saving of 99.9 tokens per distilled call, with all 146 distilled calls yielding positive savings. Response quality, assessed by blind LLM-as-judge scoring across 121 pairs, is non-inferior to the raw path within a pre-specified 1-point margin on a 15-point rubric; the judge awarded 43 percent ties, 28 percent distilled wins, and 29 percent raw wins. Cosine similarity is mixed: mean 0.682, median 0.712, with 54.1 percent of pairs above the 0.70 reference threshold. Safety-critical domains are conservatively routed to passthrough via rule-based gates. Per-call net energy saving is estimated at 70-270 uWh under stated assumptions. SPSD shows that on-device prompt distillation can reduce cloud LLM input-token cost while preserving response quality within a practical non-inferiority margin.

2606.19363 2026-06-19 cs.LG 新提交

When to Trust, How to Distill: Multi-Foundation Model Guidance for Lightweight, Robust Scientific Time Series Forecasting

何时信任,如何蒸馏:面向轻量级鲁棒科学时间序列预测的多基础模型指导

Rupasree Dey, Abdul Matin, Nathan Orwick, Yao Zhang, Shrideep Pallickara, Sangmi Lee Pallickara

发表机构 * Colorado State University(科罗拉多州立大学)

AI总结 提出Guard框架,通过上下文路由器和不确定性门控温度机制,从多个分布偏移的基础模型中蒸馏知识,训练轻量级预测器,在气象、碳通量等四个领域降低RMSE。

Comments KDD 2026, paper decision: Accepted, track: AI for Science. total 12 pages including references and appendix

详情
AI中文摘要

时间序列基础模型(TSFMs)在物理科学中的部署受到一个关键权衡的阻碍:虽然这些模型编码了丰富、通用的时间动态,但当零样本应用于特定科学领域时,它们会遭受严重的分布错位,并且其计算成本阻碍了在边缘计算传感器网络中的部署。我们解决了一个基本挑战:如何从错位的基础模型(FM)中提取潜在的结构知识,以训练轻量级、专门的预测器?我们提出了用于蒸馏的门控不确定性感知路由(Guard),这是一个新颖的框架,将多教师蒸馏重新定义为实例级决策过程,具有两种自适应机制:(1)上下文路由器,根据局部输入统计动态选择最相关的教师,利用不同基础模型之间的互补性;(2)不确定性门控温度机制,充当“断路器”,当教师置信度与领域现实偏离时自动减弱蒸馏强度。我们在四个气候关键领域评估了我们提出的轻量级框架:气象学、生态系统碳通量、土壤湿度和能源电网。我们的方法相对于固定权重的多教师蒸馏基线显著降低了RMSE,成功地从预训练的FM(教师)中蒸馏知识,即使由于原始和目标数据域之间的分布偏移,它们表现出次优的零样本准确性。我们证明,这些领域错位的教师仍然可以作为关键的纠正者,在28.5%的最难实例上优于全局优越的FM。最终,这使得适用于资源受限边缘部署的高精度科学预测成为可能。代码可在https://this URL获取。

英文摘要

The deployment of Time-Series Foundation Models (TSFMs) in physical sciences is hindered by a critical trade-off: while these models encode rich, universal temporal dynamics, they suffer from severe distributional misalignment when applied zero-shot to specific scientific domains, and their computational cost prohibits deployment in edge-computing sensor networks. We address a fundamental challenge: How can we extract latent structural knowledge from misaligned foundation models (FM) to train lightweight, specialized forecasters? We propose Gated Uncertainty-Aware Routing for Distillation (Guard), a novel framework that reframes multiteacher distillation as an instance-wise decision process with two adaptive mechanisms: (1) a Contextual Router that dynamically selects the most relevant teacher based on local input statistics, exploiting complementarity across diverse foundation models; and (2) an Uncertainty-Gated Temperature mechanism that acts as a "circuit-breaker," automatically attenuating distillation strength when teacher confidence diverges from domain reality. We evaluate our proposed lightweight framework on four climate-critical domains: meteorology, ecosystem carbon flux, soil moisture, and energy grids. Our method significantly reduces RMSE relative to a fixed-weight multi-teacher distillation baseline, successfully distilling knowledge from pretrained FMs (teachers) even when they exhibit suboptimal zero-shot accuracy due to distribution shift between the original and target data domains. We demonstrate that these domain-misaligned teachers can still serve as critical correctives, outperforming the globally superior FMs on 28.5% of the hardest instances. Ultimately, this enables high-precision scientific forecasting suitable for resource-constrained edge deployment. Code is available at https://github.com/RupasreeDey/GUARD-KDD2026.

2606.19358 2026-06-19 cs.RO 新提交

WorkBenchMark: A LEGO-Based Assembly Benchmark with an Assembly-by-Disassembly Baseline for the Smart Manufacturing League

WorkBenchMark:面向智能制造联盟的基于乐高积木的装配基准与通过拆卸进行装配的基线方法

Wenbo Ma, Daniel Swoboda, Matteo Tschesche, Till Hofmann

发表机构 * Chair of Machine Learning and Reasoning (i6), RWTH Aachen University(亚琛工业大学机器学习与推理教席(i6)) MASCOR Institute, FH Aachen University of Applied Science(亚琛应用技术大学MASCOR研究所)

AI总结 提出一个基于乐高Duplo的机器人装配基准,包含400个任务和四个复杂度层级,并提供一个基于规划的基线方法,在所有层级上优于现代视觉-语言-动作方法。

Comments RoboCup Symposium 2026 accepted paper

详情
AI中文摘要

我们介绍了WorkBenchMark,一个受RoboCup智能制造联盟启发的基于乐高Duplo的机器人装配基准。机器人装配将低层操作与物理约束下的任务级符号推理相结合,当前端到端学习方法尚未可靠解决这一组合。该基准提供跨四个复杂度层级的400个任务。我们提供了一个开放词汇的感知、通过拆卸进行装配的基线解决方案。我们的基于规划的流水线在所有层级上优于现代视觉-语言-动作方法。该基准、仿真环境和基线实现将公开发布,以支持更广泛的机器人装配社区。

英文摘要

We introduceWorkBenchMark, a LEGO Duplo-based robotic assembly benchmark motivated by the RoboCup Smart Manufacturing League. Robotic assembly couples low-level manipulation with task-level symbolic reasoning under physical constraints, a combination that current end-to-end learning methods do not yet solve reliably. The benchmark provides 400 tasks across four complexity tiers. We provide an open-vocabulary perception, Assembly-by-Disassembly baseline solution. Our planning-based pipeline outperforms a modern vision-language-action approach across all tiers. The benchmark, simulation environment, and baseline implementation will be released openly to support the broader robotic assembly community.

2606.19357 2026-06-19 cs.RO cs.AI 新提交

Physical Atari: A Robust and Accessible Platform for Real-time Reinforcement Learning on Robots

Physical Atari: 一个用于机器人实时强化学习的鲁棒且可访问的平台

Khurram Javed, Joseph Modayil, Gloria Kennickell, Richard S. Sutton, John Carmack

发表机构 * Keen Technologies University of Alberta, Canada(阿尔伯塔大学,加拿大) Openmind Research Institute(Openmind研究机构)

AI总结 提出Physical Atari平台,通过机器人操作Atari控制器和实时渲染游戏帧,实现物理世界中的强化学习研究,验证了算法可直接在机器人上学习,并指出分布偏移会显著降低策略性能。

Comments To appear at RLC 2026

详情
AI中文摘要

我们构建了一个名为Robotroller的机器人,它能够操作Atari CX40+控制器,以及一个名为Atari Devbox的设备,该设备在屏幕上渲染来自Arcade Learning Environment的游戏帧和奖励信号。Robotroller和Atari Devbox,连同现成的摄像头和台式计算机,构成一个可用于研究物理世界中强化学习算法的系统。我们将整个系统称为Physical Atari。在本文中,我们详细介绍了使Physical Atari成为一个鲁棒且可访问平台的关键决策。为了使系统鲁棒,我们设计了Robotroller,使得所有运动都通过轴承完成,从而减少磨损。此外,我们编写了软件,以高频监控伺服电机的状态并进行干预以限制应力。为了使系统可访问,我们使用了价格合理的现成组件和可通过消费级3D打印机制造的零件。Physical Atari的建造成本低于1000美元,并且已用于数周不间断的强化学习实验,未出现任何机械故障。我们用它验证了强化学习算法可以直接在机器人上学习,并表明即使学习和部署之间的微小分布偏移也会显著降低策略的性能。我们的结果强调了设备端适应对于在机器人上获得强性能的重要性。

英文摘要

We built a robot called the Robotroller that actuates an Atari CX40+ controller and a device called the Atari Devbox that renders the game frame and the reward signal from the Arcade Learning Environment on a screen. The Robotroller and the Atari Devbox, together with an off-the-shelf camera and a desktop computer, constitute a system that can be used to study reinforcement learning algorithms in the physical world. We call the full system Physical Atari. In this paper, we detail the key decisions that make Physical Atari a robust and accessible platform. To make the system robust, we designed the Robotroller so that all movement is done through bearings, which reduces wear. Additionally, we wrote software that monitors the state of the servos at a high frequency and intervenes to limit stress. To make the system accessible, we used affordable off-the-shelf components and parts that can be manufactured using consumer 3D printers. Physical Atari can be built for under $1,000 and has been used for weeks of non-stop reinforcement learning experiments without any mechanical failures. We used it to validate that reinforcement learning algorithms can learn directly on robots and show that even small distribution shifts between learning and deployment can significantly degrade the performance of policies. Our results underscore the importance of on-device adaptation for strong performance on robots.

2606.19356 2026-06-19 cs.CL cs.AI 新提交

Trustworthy Multi-Agent Systems: Mitigating Semantic Drift with the Argent Signaling Protocol

可信多智能体系统:使用Argent信令协议缓解语义漂移

Anantha Sharma

发表机构 * Synechron Inc(Synechron公司)

AI总结 提出Argent信令协议(ASP),通过结构化质量信号区分可修复与不可修复的失败,在文档问答和多智能体系统中分别提升通过率和阻断无依据传播。

Comments 17 pages

详情
AI中文摘要

当多智能体LLM系统产生错误答案时,并非所有失败都相同:有些答案基于正确材料但不完整,而另一些则完全无依据且应被阻止。当前的重新尝试策略对两种情况一视同仁(重试并希望最好),使得人类监督者无法判断重试是否合理或系统是否应停止。我们引入Argent信令协议(ASP),这是一种紧凑的机器可读头部,为每个AI生成的响应附带结构化质量信号:确定性(@C)、依据性(@G)、随机性(@S)以及一个假设索引,用于分类每个声明的证据基础。这些信号使控制器能够区分可修复失败与遏制失败,并对每种情况进行不同路由。我们在两种模式下评估ASP。在独立模式下,基于Array BioPharma/Ono许可协议的27个问题的文档问答基准,比较基线提示与ASP仪器化控制器动作在三个本地GGUF模型上的表现。在Qwen~(0.8B)上,ASP将通过率从11.1%提升至33.3%,平均术语覆盖率从36.7%提升至65.4%;在Dobby~(8B)上,ASP产生4次失败到通过的恢复,通过率从33.3%提升至44.4%;在SmolLM3~(3B)上,ASP在每次问题中交替进行修复和遏制。总体改进显著(从12/81通过到21/81通过)。在多智能体模式下,ASP侧车位于检索智能体和下游决策智能体之间;侧车100%阻止无依据的上游输出到达下游智能体(24/27被阻止,0次无依据传播)。

英文摘要

When multi-agent LLM systems produce bad answers, not all failures are equal: some answers are grounded in the right material but incomplete, while others are simply ungrounded and should be stopped. Current retry strategies treat both cases identically (try again and hope for the best), leaving human supervisors unable to tell whether a retry was warranted or whether the system should have halted instead. We introduce the Argent Signaling Protocol (ASP), a compact machine-readable header that accompanies every AI-generated response with structured quality signals: certainty (@C), grounding (@G), stochasticity (@S), and an assumption index that classifies the evidentiary basis of each claim. These signals enable a controller to distinguish repairable failures from containment failures and route each case differently. We evaluate ASP in two modes. In standalone mode, a 27-question document-grounded QA benchmark over the Array BioPharma/Ono license agreement compares baseline prompts against ASP-instrumented controller actions across three local GGUF models. On Qwen~(0.8B), ASP improves pass rate from 11.1% to 33.3% and mean term coverage from 36.7% to 65.4%; on Dobby~(8B), ASP produces 4 fail-to-pass recoveries, raising pass rate from 33.3% to 44.4%; on SmolLM3~(3B), ASP alternates between repair and containment per question. Aggregate improvement is meaningful (12/81 to 21/81 passes). In multi-agent mode, an ASP sidecar sits between a retrieval agent and a downstream decision agent; the sidecar blocks 100% of ungrounded upstream outputs from reaching the downstream agent (24/27 blocked, 0 ungrounded propagations).

2606.19354 2026-06-19 cs.CL cs.LG 新提交

Granularity-Regulated Adaptive Computational Efficiency for Optimal Verification in Test-Time Scaling

粒度调控的自适应计算效率:测试时扩展中的最优验证

Ardit Krasniqi, Luan Vejsiu, Elira Dervishi

发表机构 * European University of Tirana(欧洲地拉那大学)

AI总结 提出GRACE理论框架,将验证粒度建模为问题难度、验证器准确率和计算预算的函数,证明存在相变:细粒度验证在计算预算大或问题难时占优,粗粒度验证在低预算简单问题时更优,自适应策略可达到计算-性能帕累托前沿。

详情
AI中文摘要

测试时扩展(TTS)已成为一种强大的范式,通过在推理时投入额外计算来提升大语言模型(LLMs)的推理性能。TTS的核心组件是验证器,它选择或评分候选解以引导搜索过程。虽然先前工作已探索验证的益处,但一个基本问题仍未充分探索:在给定计算预算下,最优验证粒度是什么?粗粒度的结果奖励模型(ORMs)和细粒度的过程奖励模型(PRMs)代表两个极端,但两者单独均无法在所有场景下实现计算最优性。本文建立了一个统一的理论框架,称为GRACE(粒度调控的自适应计算效率),该框架将最优验证粒度刻画为问题难度、验证器准确率和计算预算的显式函数。我们证明存在一个相变:当计算预算大或问题难时,细粒度验证占优;而在低预算、简单问题场景下,粗粒度验证更受青睐。我们的理论将Best-of-N、束搜索和步骤级MCTS统一在一个帕累托最优框架内,并激发了一种自适应粒度策略,该策略可证明达到计算-性能帕累托前沿。在MATH-500、GSM8K和AIME基准上的实验结果证实了所有四个理论主张,在匹配计算量下,我们的自适应策略相比固定粒度基线准确率提升高达3.1%。

英文摘要

Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning performance of large language models (LLMs) by investing additional compute at inference time. A central component of TTS is the \emph{verifier}, which selects or scores candidate solutions to guide the search process. While prior work has explored the benefit of verification, a fundamental question remains underexplored: \emph{what is the optimal granularity of verification under a given compute budget?} Coarse-grained outcome reward models (ORMs) and fine-grained process reward models (PRMs) represent two extremes, yet neither alone achieves compute-optimality across all regimes. In this paper, we establish a unified theoretical framework, called \textbf{GRACE} (\underline{G}ranularity-\underline{R}egulated \underline{A}daptive \underline{C}omputational \underline{E}fficiency), that characterizes the optimal verification granularity as an explicit function of problem difficulty, verifier accuracy, and compute budget. We prove that there exists a phase transition: fine-grained verification dominates when either the compute budget is large or the problem is hard, whereas coarse-grained verification is preferred in the low-budget, easy-problem regime. Our theory unifies Best-of-$N$, beam search, and step-level MCTS within a single Pareto-optimality framework, and motivates an adaptive granularity strategy that provably achieves the compute-performance Pareto frontier. Empirical results on MATH-500, GSM8K, and AIME benchmarks corroborate all four theoretical claims, with our adaptive strategy outperforming fixed-granularity baselines by up to 3.1\% accuracy at matched compute.

2606.19353 2026-06-19 cs.CL cs.LG 新提交

Quantifying Aleatoric Uncertainty of In-Context Learning for Robust Measure of LLM Prediction Confidence

量化上下文学习中的偶然不确定性以稳健衡量LLM预测置信度

Jinseok Chung, Minkyoung Song, Hyunji Jung, Namhoon Lee

发表机构 * POSTECH(浦项科技大学)

AI总结 针对上下文学习(ICL)中预测对提示设计敏感的问题,提出基于贝叶斯观点和机制可解释性的自函数向量,直接估计偶然不确定性,并设计严格评估协议,在合成和真实数据集上验证了方法的可靠性及在幻觉检测等应用中的实用性。

Comments Accepted to ACL 2026

详情
AI中文摘要

上下文学习(ICL)使LLM能够从少量示例中适应新任务,但其可靠性仍存疑虑:预测对提示设计和模型理解上下文的能力高度敏感,使得失败源于数据特性还是模型限制难以区分。不确定性分解——将偶然不确定性从认知不确定性中分离——在此场景中尤为关键,然而现有方法针对标准生成任务设计,未能捕捉ICL的独特动态。为解决此问题,我们引入基于贝叶斯观点和ICL机制可解释性的自函数向量概念。这些向量利用模型内部表示来建模上下文提示中学习的潜在概念,从而在贝叶斯框架内直接估计偶然不确定性,并规避了对脆弱的输入或解码操作的依赖。鉴于缺乏既定基准和合适的评估协议,我们还提出了首个严格的评估协议,其中数据以受控方式被操纵,以便精确量化偶然不确定性并将其与认知不确定性分离。借助这一新的评估框架(最初基于合成任务进行概念开发,随后扩展到真实世界数据集),我们展示了所提出的方法比现有替代方法更可靠地衡量LLM在ICL下做出的预测的不确定性。此外,我们展示了它可作为可信相关应用(如幻觉检测)的实用工具。我们的发现为将不确定性的量化观点与模型行为的机制理解联系起来开辟了新方向。

英文摘要

In-Context Learning (ICL) allows LLMs to adapt to new tasks from a few demonstrations, but its reliability remains a concern: predictions are highly sensitive to both prompt design and the model's ability to understand the context, obscuring whether failures arise from data properties or model limitations. Uncertainty decomposition-separating aleatoric from epistemic sources-is particularly crucial in this setting, yet existing methods, designed for standard generation tasks, fail to capture the unique dynamics of ICL. To address this, we introduce a concept of self-function vectors, built upon Bayesian views and the mechanistic interpretability of ICL. These vectors leverage internal model representations to model the latent concept learned during in-context prompting, thereby enabling a direct estimation of aleatoric uncertainty within a Bayesian framework and circumventing the reliance on brittle input or decoding manipulations. Given the lack of established benchmarks and suitable evaluation protocols, we also propose the first and rigorous evaluation protocol, in which data is manipulated in controlled ways so as to quantify aleatoric uncertainty precisely and separately from epistemic uncertainty. With this new evaluation framework, initially grounded in synthetic tasks for conceptual development and subsequently extended to real-world datasets, we show that our proposed methodology can measure uncertainty of LLM predictions made under ICL more reliably than existing alternative methods. Moreover, we show it can be used as a practical tool for trustworthy-related applications, such as hallucination detection. Our findings pave a new direction for connecting the quantitative view of uncertainty with the mechanistic understanding of model behavior.

2606.19352 2026-06-19 cs.CL cs.AI 新提交

Sign-Language Datasets at Scale: A Comprehensive Survey on Resources, Benchmarks, and Annotation Standards

大规模手语数据集:资源、基准和标注标准的综合调查

Yiming Ni, Zhi-Qi Cheng, Jiayu Li, Wei Cheng

发表机构 * Tacoma School of Engineering & Technology, University of Washington(华盛顿大学塔科马工程与技术学院)

AI总结 本文调查了35种手语的120个数据集,分析了模态不平衡、标注粒度和手语者偏差等挑战,并提出了24字段手语数据表以支持标准化文档和可复现评估。

Comments Accepted to ACL 2026 Main. 27 pages, 5 figures

详情
AI中文摘要

手语是聋人和听障社区使用的表达性视觉语言。尽管在手语识别、翻译和生成方面取得了显著进展,但由于数据集碎片化、标注不一致以及语言覆盖有限,进展仍然受到制约。现有的基准往往无法反映现实世界的通信需求,对这些局限性的系统分析仍然有限。在本调查中,我们提出了一个全面的手语数据集索引,涵盖了35种手语的120个资源。我们分析了关键挑战,如模态不平衡、标注粒度和手语者偏差,并概述了未来数据集设计的考虑因素。我们还引入了一个24字段的手语数据表,并发布了一个公共GitHub仓库(此 https URL ),以支持标准化文档和可复现评估。总体而言,我们的工作为在现实应用中开发包容、稳健和可扩展的手语技术提供了统一且实用的基础。

英文摘要

Sign languages are expressive visual languages used by Deaf and Hard-of-Hearing (DHH) communities. Despite substantial progress in sign-language recognition, translation, and production, advances remain constrained by fragmented datasets, inconsistent annotations, and limited linguistic coverage. Existing benchmarks often fail to reflect real-world communication needs, and systematic analyses of these limitations remain limited. In this survey, we present a comprehensive index of sign-language datasets, covering 120 resources across 35 sign languages. We analyze key challenges such as modality imbalance, annotation granularity, and signer bias, and outline considerations for future dataset design. We also introduce a 24-field Sign-Language Datasheet and release a public GitHub repository (https://github.com/Ginqwerty/Open-Sign-Language) to support standardized documentation and reproducible evaluation. Overall, our work provides a unified and practical foundation for developing inclusive, robust, and scalable sign-language technologies in real-world applications.

2606.19351 2026-06-19 cs.CL cs.AI 新提交

Detecting Hallucinations for Large Language Model-based Knowledge Graph Reasoning

基于大语言模型的知识图谱推理中的幻觉检测

Xinyan Zhu, Yaoqi Liu, Yue Gao, Huadong Ma, Cheng Yang, Chuan Shi

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Tsinghua University(清华大学)

AI总结 提出LUCID方法,结合LLM注意力分数、知识图谱语义和结构信息,利用图神经网络检测LLM在知识图谱推理中的幻觉,在九个数据集上达到最优性能。

详情
AI中文摘要

知识图谱推理从现有事实中推断新知识,广泛应用于问答、推荐和决策支持。随着大语言模型(LLM)的快速发展,基于LLM的知识图谱推理框架通过利用检索到的知识图谱信息变得越来越流行。然而,LLM中的幻觉仍然是一个关键问题。即使融入了相关的知识图谱知识,模型仍可能生成错误输出,导致错误信息和不可靠的决策。现有的幻觉检测方法要么关注LLM内部状态,要么验证与检索上下文的一致性,但两者都忽略了知识图谱中的结构信息,导致性能次优。为了解决这一差距,我们提出了LUCID,这是首个针对基于LLM的知识图谱推理框架的幻觉检测方法。LUCID联合利用LLM注意力分数、知识图谱语义和结构信息。具体来说,它从注意力分数和语义相似度中提取节点和边特征,并使用图神经网络将其与知识图谱结构集成。我们还构建了人工标注的基准数据集用于评估。在九个数据集上的实验表明,与15个基线相比,LUCID达到了最先进的性能。

英文摘要

Knowledge graph (KG) reasoning infers new knowledge from existing facts and is widely applied in question answering, recommendation, and decision support. With the rapid development of large language models (LLMs), LLM-based KG reasoning frameworks have become increasingly popular by leveraging retrieved KG information. However, hallucinations in LLMs remain a critical issue. Even when relevant KG knowledge is incorporated, models may still generate incorrect outputs, leading to misinformation and unreliable decisions. Existing hallucination detection methods either focus on LLM internal states or verify consistency with retrieved contexts, but both overlook the structural information in KGs, resulting in suboptimal performance. To address this gap, we propose LUCID, the first halLUcination deteCtIon method for LLM-based knowleDge graph reasoning frameworks. LUCID jointly leverages LLM attention scores, KG semantics, and structural information. Specifically, it extracts node and edge features from attention scores and semantic similarities, and integrates them with KG structure using a graph neural network. We also construct manually annotated benchmark datasets for evaluation. Experiments on nine datasets show that LUCID achieves state of the art performance compared to 15 baselines.

2606.19350 2026-06-19 cs.CL 新提交

Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models

基于因果归因的剪枝保留大型语言模型的推理性能

Amogh Sheth, Biruk Assefa, Yi Wen Huang, Andrew Lin, Yuhao Ge

发表机构 * Edison Academy Magnet School(爱迪生学院磁石学校) Massachusetts Institute of Technology(麻省理工学院) State University of New York College at Plattsburgh(纽约州立大学普拉茨堡学院) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Independent Researcher(独立研究员)

AI总结 提出无需训练的因果归因剪枝(CAP)方法,通过测量注意力头对推理任务的因果影响进行细粒度剪枝,在20%稀疏度下相比Wanda在ARC-Challenge上准确率提升高达61%。

Comments Accepted at the ICLR 2026 Workshop on LLM Reasoning. 13 pages, 2 figures

详情
AI中文摘要

大型语言模型(LLMs)在多步推理方面表现出色,但推理成本高昂。我们引入了因果归因剪枝(CAP),一种无需训练的方法,通过测量注意力头对推理任务的因果影响来识别关键注意力头,并利用这些头级分数指导细粒度的权重剪枝。对于每个注意力头,CAP估计在推理问题的小型校准集上前向传播时掩码该头所导致的预期性能下降。这些因果分数随后被转换为对应投影矩阵的权重级重要性值。与仅基于幅度或激活的标准不同,CAP的干预测量直接捕捉每个头的功能贡献,在20%稀疏度下,相比Wanda在ARC-Challenge上获得高达61%的相对准确率提升。我们在GSM8K、StrategyQA和ARC-Challenge上使用Llama-3-8B-Instruct和Mistral-7B-Instruct在10%、20%和50%稀疏度下评估CAP。在中等稀疏度(10-20%)下,CAP在大多数模型-基准配置中优于Wanda,尤其在Llama-3的ARC-Challenge上提升显著。我们的结果表明,在相同稀疏度下,注意力头级因果归因比相关性剪枝标准能更好地保留下游基准的推理性能,但在50%稀疏度下仍受限于粗粒度的MLP归因。

英文摘要

Large language models (LLMs) excel at multi-step reasoning but incur substantial inference cost. We introduce Causal Attribution Pruning (CAP), a training-free method that identifies critical attention heads by measuring their causal impact on reasoning tasks and uses these head-level scores to guide fine-grained weight pruning. For each attention head, CAP estimates the expected performance degradation when the head is masked during forward passes on a small calibration set of reasoning problems. These causal scores are then converted into weight-level importance values for the corresponding projection matrices. Unlike magnitude-only or activation-based criteria, CAP's interventional measurement directly captures each head's functional contribution, yielding relative accuracy gains of up to 61% over Wanda on ARC-Challenge at 20% sparsity. We evaluate CAP on GSM8K, StrategyQA, and ARC-Challenge using Llama-3-8B-Instruct and Mistral-7B-Instruct at 10%, 20%, and 50% sparsity. At moderate sparsity (10-20%), CAP improves over Wanda in most model-benchmark configurations. with especially large gains on ARC-Challenge for Llama-3. Our results suggest that attention-head-level causal attribution can better preserve reasoning performance on downstream benchmarks than correlational pruning criteria at equivalent sparsity, while remaining limited by coarse MLP attribution at 50% sparsity.

2606.19349 2026-06-19 cs.CL cs.AI 新提交

Where to Place the Query? Unveiling and Mitigating Positional Bias in In-Context Learning for Diffusion LLMs via Decoding Dynamics

查询应置于何处?通过解码动力学揭示并缓解扩散大语言模型中上下文学习的位置偏差

Zhengheng Li, Panrui Li, Xuyang Liu, Puzhi Xia

发表机构 * Southeast University(东南大学)

AI总结 本文系统分析了扩散大语言模型中查询位置对生成质量的影响,发现其与示例语义质量同等重要,并提出基于平均置信度的无训练自适应路由策略Auto-ICL以优化查询放置。

Comments 9 figures, 4 tables

详情
AI中文摘要

尽管上下文学习(ICL)在自回归(AR)大语言模型(LLMs)中已被广泛研究,但其在扩散大语言模型(dLLMs)中的机制仍基本未被探索。与受单向因果掩码限制的AR模型不同,dLLMs本质上利用双向注意力,为查询放置提供了广泛的空间灵活性。不幸的是,当前实践通常继承AR风格的尾随查询模板,往往忽略了结构范式转变。本文通过全面分析揭示了查询位置实际上是dLLMs中的一阶变量。通过经验解耦,我们证明了位置方差对生成质量的影响与示例语义质量相当。在内部,这种位置敏感性源于注意力流中的空间“近因效应”以及解码轨迹中依赖于任务的偏移。为了在没有真实标签的情况下缓解这种不稳定性,我们揭示了传统的单步置信度($C_{decoded}$)在dLLMs中失效。相反,我们提出了平均置信度($\overline{C}$),一种跟踪迭代解码过程的新指标。通过建立基础的空间ICL基线,我们引入了Auto-ICL,一种无需训练的自适应路由策略,动态优化查询放置,在异构推理和感知任务中稳健地接近最优性能。

英文摘要

While In-Context Learning (ICL) is extensively studied in Autoregressive (AR) LLMs, its mechanism within Diffusion Large Language Models (dLLMs) remains largely unexplored. Unlike AR models restricted by unidirectional causal masking, dLLMs intrinsically utilize bidirectional attention, offering extensive spatial flexibility for query placement. Unfortunately, current practices conventionally inherit AR-style trailing-query templates, often overlooking the structural paradigm shift. This paper presents a comprehensive analysis unveiling that query position is actually a first-order variable in dLLMs. Through empirical decoupling, we demonstrate that positional variance impacts generation quality on par with example semantic quality. Internally, this positional sensitivity stems from a spatial ``Recency Effect'' in attention flow and task-dependent shifts in decoding trajectories. To mitigate this instability without ground-truth labels, we reveal that traditional single-step confidence ($C_{decoded}$) fails in dLLMs. Instead, we propose Average Confidence ($\overline{C}$), a novel metric tracking the iterative decoding process. By establishing the foundational spatial ICL baselines, we introduce Auto-ICL, a training-free adaptive routing strategy that dynamically optimizes query placement, robustly approaching oracle performance across heterogeneous reasoning and perception tasks.

2606.19348 2026-06-19 cs.CL cs.AI 新提交

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

DeepSeek-V4: 迈向高效百万令牌上下文智能

DeepSeek-AI, Anyi Xu, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chenchen Ling, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chengyu Hou, Chenhao Xu, Chenze Shao, Chong Ruan, Conner Sun, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Donghao Li, Dongjie Ji, Erhang Li, Fang Wei, Fangyun Lin, Fangzhou Yuan, Feiyu Xia, Fucong Dai, Guangbo Hao, Guanting Chen, Guoai Cao, Guolai Meng, Guowei Li, Han Yu, Han Zhang, Hanwei Xu, Hao Li, Haofen Liang, Haoling Zhang, Haoming Luo, Haoran Wei, Haotian Yuan, Haowei Zhang, Haowen Luo, Haoyu Chen, Haozhe Ji, Hengqing Zhang, Honghui Ding, Hongxuan Tang, Huanqi Cao, Huazuo Gao, Hui Qu, Hui Zeng, J Yang, JQ Zhu, Jia Luo, Jia Song, Jia Yu, Jialiang Huang, Jialu Cai, Jian Liang, Jiangting Zhou, Jiasheng Ye, Jiashi Li, Jiaxin Xu, Jiewen Hu, Jieyu Yang, Jin Chen, Jin Yan, Jingchang Chen, Jingli Zhou, Jingting Xiang, Jingyang Yuan, Jingyuan Cheng, Jingzi Zhou, Jinhua Zhu, Jiping Yu, Joseph Sun, Jun Ran, Junguang Jiang, Junjie Qiu, Junlong Li, Junmin Zheng, Junxiao Song, Kai Dong, Kaige Gao, Kang Guan, Kexing Zhou, Kezhao Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Wang, Leyi Xia, Li Zhang, Liang Zhao, Lihua Guo, Lingxiao Luo, Linwang Ma, Linyan Zhu, Litong Wang, Liyu Cai, Liyue Zhang, Longhao Chen, MS Di, MY Xu, Max Mei, Miaojun Wang, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingming Li, Mingxu Zhou, Minmin Han, Ning Wang, Panpan Huang, Panpan Wang, Peixin Cong, Peiyi Wang, Peng Zhang, Qiancheng Wang, Qihao Zhu, Qingyang Li, Qinyu Chen, Qiushi Du, Qiwei Jiang, Rui Tian, Ruifan Xu, Ruijie Lu, Ruiling Xu, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runqian Chen, Runqiu Yin, Runxin Xu, Ruomeng Shen, Ruoyu Zhang, Ruyi Chen, SH Liu, Shanghao Lu, Shangmian Sun, Shangyan Zhou, Shanhuang Chen, Shaofei Cai, Shaoheng Nie, Shaoqing Wu, Shaoyuan Chen, Shengding Hu, Shengyu Liu, Shiqiang Hu, Shirong Ma, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, Shuying Yu, Songyang Zhou, Tao Ni, Tao Yun, Tian Jin, Tian Pei, Tian Ye, Tianle Lin, Tianran Ji, Tianyi Cui, Tianyuan Yue, Tingting Yu, Tun Wang, W Zhang, WL Xiao, Wangding Zeng, Wei An, Weilin Zhao, Wen Liu, Wenfeng Liang, Wenjie Pang, Wenjing Luo, Wenjing Yao, Wenjun Gao, Wenkai Yang, Wenlve Huang, Wenqing Hou, Wentao Zhang, Wenting Ma, Xi Gao, Xiang He, Xiangwen Wang, Xianzu Wang, Xiao Bi, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaokang Zhang, Xiaotao Nie, Xiaowen Sun, Xiaoxiang Wang, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xingchen Liu, Xingkai Yu, Xingyou Li, Xinyu Yang, Xinyu Zhang, Xu Chen, Xuanyu Wang, Xuecheng Su, Xueyin Chen, Xuheng Lin, Xuwei Fu, YC Yan, YQ Wang, YW Ma, Yanfeng Luo, Yang Zhang, Yanhong Xu, Yanru Ma, Yanwen Huang, Yao Li, Yao Li, Yao Xu, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Qian, Yi Shao, Yi Yu, Yichao Zhang, Yifan Ding, Yifan Shi, Yijia Wu, Yiliang Xiong, Yiling Ma, Ying He, Ying Tang, Ying Zhou, Yingjia Luo, Yinmin Zhong, Yishi Piao, Yisong Wang, Yixiang Zhang, Yixiao Chen, Yixuan Tan, Yixuan Wei, Yiyang Ma, Yiyuan Liu, Yonglun Yang, Yongqiang Guo, Yongtong Wu, Yu Wu, YuKun Li, Yuan Cheng, Yuan Ou, Yuanfan Xu, Yuanhao Li, Yuduan Wang, Yuehan Yang, Yuer Xu, Yuhan Wu, Yuhao Meng, Yuheng Zou, Yukun Zha, Yunfan Xiong, Yupeng Chen, Yuping Lin, Yuqian Cao, Yuqian Wang, Yushun Zhang, Yuting Yan, Yutong Lin, Yuxian Gu, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuxuan Zhou, Yuyang Zhou, Yuzhen Huang, ZF Wu, Zehao Wang, Zehua Zhao, Zehui Ren, Zekai Zhang, Zhangli Sha, Zhe Fu, Zhe Ju, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zheren Gao, Zhewen Hao, Zhibin Gou, Zhicheng Ma, Zhigang Yan, Zhihong Shao, Zhixian Huang, Zhixuan Chen, Zhiyu Wu, Zhizhou Ren, Zhongyu Wu, Zhuoshu Li, Zhuping Zhang, Zian Xu, Zihao Wang, Zihua Qu, Zihui Gu, Zijia Zhu, Zilin Li, Zipeng Zhang, Ziwei Xie, Ziyi Gao, Ziyi Wan, Zizheng Pan, Zongqing Yao

发表机构 * DeepSeek-AI(深度求索人工智能)

AI总结 提出DeepSeek-V4系列MoE模型,通过混合注意力架构、流形约束超连接和Muon优化器,实现百万令牌上下文的高效推理,在核心任务上超越前代。

详情
AI中文摘要

我们展示了DeepSeek-V4系列的预览版本,包括两个强大的混合专家(MoE)语言模型——DeepSeek-V4-Pro(1.6T参数,49B激活)和DeepSeek-V4-Flash(284B参数,13B激活),两者均支持一百万个令牌的上下文长度。DeepSeek-V4系列在架构和优化方面引入了多项关键升级:(1)混合注意力架构,结合压缩稀疏注意力(CSA)和重度压缩注意力(HCA),以提高长上下文效率;(2)流形约束超连接(mHC),增强传统残差连接;(3)Muon优化器,实现更快的收敛和更高的训练稳定性。我们在超过32T多样且高质量的令牌上预训练了两个模型,随后通过全面的后训练流程解锁并进一步增强其能力。DeepSeek-V4-Pro-Max是DeepSeek-V4-Pro的最大推理努力模式,重新定义了开放模型的最先进水平,在核心任务上超越了其前代。同时,DeepSeek-V4系列在长上下文场景中非常高效。在百万令牌上下文设置下,与DeepSeek-V3.2相比,DeepSeek-V4-Pro仅需27%的单令牌推理FLOPs和10%的KV缓存。这使得我们能够常规支持百万令牌上下文,从而使长时任务和进一步的测试时扩展更加可行。模型检查点可从此https URL获取。

英文摘要

We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models -- DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) -- both supporting a context length of one million tokens. DeepSeek-V4 series incorporate several key upgrades in architecture and optimization: (1) a hybrid attention architecture that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to improve long-context efficiency; (2) Manifold-Constrained Hyper-Connections (mHC) that enhance conventional residual connections; (3) and the Muon optimizer for faster convergence and greater training stability. We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline that unlocks and further enhances their capabilities. DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, redefines the state-of-the-art for open models, outperforming its predecessors in core tasks. Meanwhile, DeepSeek-V4 series are highly efficient in long-context scenarios. In the one-million-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. This enables us to routinely support one-million-token contexts, thereby making long-horizon tasks and further test-time scaling more feasible. The model checkpoints are available at https://huggingface.co/collections/deepseek-ai/deepseek-v4.

2606.19347 2026-06-19 cs.CL cs.AI cs.PL 新提交

How LLMs Fail and Generalize in RTL Coding for Hardware Design?

LLM在硬件设计的RTL编码中如何失败与泛化?

Guan-Ting Liu, Chao-Han Huck Yang, Chenhui Deng, Zhongzhi Yu, Brucek Khailany, Yu-Chiang Frank Wang

发表机构 * NVIDIA Research(英伟达研究院)

AI总结 提出基于问题可解性的错误分类法,揭示LLM在RTL编码中受限于预训练知识,对齐技术仅教会编译,而推理能力才是关键瓶颈。

Comments Preview, under submission for EMNLP 2026

详情
AI中文摘要

将顺序编程先验转换为硬件设计的并行时序逻辑仍然是大型语言模型(LLM)的关键瓶颈。为了研究这一点,我们引入了一种新的错误分类法,该分类法基于问题可解性,受认知理论启发。我们的分类法将失败分为语法、语义、可解功能和不可解功能类型。评估揭示了VerilogEval基准上的严格经验上限,前沿模型初始通过率稳定在90.8%。这些平台期由不可解的功能错误定义,暴露出对测试时计算扩展免疫的持续知识差距。此外,我们揭示了一个显著的表面收敛差距:优化容易消除语法错误,但同时加剧了更深层次的功能失败。我们的发现表明,对齐技术仅仅教会模型编译。虽然重复采样策略可以修补可解错误,但寄存器传输级(RTL)编码能力仍然严格受限于预训练知识。解决当前基于LLM的硬件生成流水线中的挑战需要更多关于模型推理的研究,而不是对齐干预。

英文摘要

Translating sequential programming priors into the parallel temporal logic of hardware design remains a crucial bottleneck for large language models(LLM). To investigate this, we introduce a new error taxonomy grounded in problem solvability, inspired by cognitive theory. Our taxonomy categorizes failures into syntactic, semantic, solvable functional, and unsolvable functional types. Evaluations reveal a strict empirical ceiling on the VerilogEval benchmark, as frontier models plateau at a 90.8% initial pass rate. These plateaus are defined by unsolvable functional errors, exposing persistent knowledge gaps immune to test time compute scaling. Furthermore, we expose a striking surface convergence gap: optimization readily eliminates syntax errors but concurrently exacerbates deeper functional failures. Our findings demonstrate that alignment techniques merely teach models to compile. While repeated sampling strategies can patch solvable errors, register-transfer level(RTL) coding capacity remains strictly bounded by pretraining knowledge. Addressing challenges in the current LLM based hardware generation pipeline requires more studies in model reasoning rather than alignment interventions.

2606.19346 2026-06-19 cs.CL cs.AI 新提交

Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer

跨语言迁移中语言相关性与任务对齐的解耦

Ahmed Haj Ahmed, Ruochen Zhang, Alvin Grissom

发表机构 * Haverford College(哈弗福德学院) Brown University(布朗大学)

AI总结 通过微调大语言模型并在闪语族与非闪语族语言上评估零样本阅读理解,发现跨语言迁移主要提升任务格式对齐而非语言特定知识。

详情
AI中文摘要

我们通过微调七个大语言模型(4B--671B参数)在阿拉伯语上,并在闪语族语言和非闪语族对照语言上评估零样本阅读理解,研究跨语言迁移。在密集架构和混合专家架构中,我们没有发现闪语族特定迁移的证据:基线较弱的模型在所有语言上都有显著提升,而基线较强的模型无论语言族系如何,只有边际提升。思维链消融实验强化了这一发现——从微调中获益最多的模型同样从推理时推理中获益,这表明两种机制都解决了任务格式对齐问题,而非跨语言知识迁移。

英文摘要

We study cross-lingual transfer by fine-tuning seven large language models (4B--671B parameters) on Arabic and evaluating zero-shot reading comprehension on Semitic languages and non-Semitic controls. Across dense and Mixture-of-Experts architectures, we find no evidence of Semitic-specific transfer: models with weak baselines improve dramatically across all languages, while strong-baseline models show only marginal gains regardless of language family. A chain-of-thought ablation reinforces this finding -- the same models that benefit most from fine-tuning benefit equally from inference-time reasoning, suggesting both mechanisms address task-format alignment rather than cross-lingual knowledge transfer.

2606.19345 2026-06-19 cs.CL cs.AI 新提交

Ensembles of Large Language Models for Identifying EQ-5D Studies in PubMed Based on Their Abstracts

基于摘要识别PubMed中EQ-5D研究的大型语言模型集成

Zhyar Rzgar K. Rostam, Márta Péntek, János Tibor Czere, Zsombor Zrubka, László Gulácsi, Gábor Kertész

发表机构 * Doctoral School of Applied Informatics and Applied Mathematics, Obuda University(欧布达大学应用信息学与应用数学博士学院) John von Neumann Faculty of Informatics, Obuda University(欧布达大学约翰·冯·诺伊曼信息学学院) Doctoral School of Innovation Management, Obuda University(欧布达大学创新管理博士学院)

AI总结 提出多阶段框架集成Gemini和Gemma等LLM,通过少样本提示、权重集成和软堆叠元分类器,自动检测PubMed中EQ-5D研究,加权集成F1达0.74。

Comments 6 pages, 7 tables, 8 equations

详情
AI中文摘要

科学出版物的快速增长导致系统文献综述(SLR)中的人工研究筛选越来越耗费资源、效率低下且不一致。分类明确报告健康相关生活质量结果(如EQ-5D数据)的研究需要高水平的临床解释,并给人类评审者带来挑战。本研究探讨了使用Google的Gemini和Gemma大型语言模型(LLM)仅基于已发表摘要自动检测PubMed生物医学数据库中的EQ-5D。提出了一个多阶段框架,集成了少样本提示、权重集成聚合和软堆叠元分类器。在由两位专家手动标记的PubMed研究数据集上评估了九个LLM的EQ-5D报告情况。gemini-2.5-pro、gemma-3-12b和gemma-3-27b的加权集成获得了0.74的加权F1分数和0.74的准确率,超过了单独获得的结果。与单个模型相比,表现最佳模型的集成改善了精确率和召回率之间的平衡,而软堆叠方法提供了更高的可靠性和可解释性。特征分析表明,模型的概率结果在指导最终预测中很重要。研究结果表明,基于集成的LLM设置是自动化生物医学研究筛选的可靠且可扩展的方法。

英文摘要

The rapid increase in scientific publications leads to the fact that manual study screening in systematic literature reviews (SLRs) is increasingly resource consuming, inefficient, and inconsistent. Classifying studies that clearly report health-related quality-of-life results, such as EQ-5D data, requires a high level of clinical interpretation and poses challenges for human reviewers. This study investigates the use of Google's Gemini and Gemma large language models (LLMs) in automating EQ-5D detection in the PubMed biomedical database based only on published abstracts. A multi-phase framework is proposed that integrates few-shot prompting, weight ensembling aggregation, and a soft stacking meta-classifier. Nine LLMs are evaluated on a dataset of PubMed studies manually labeled by two experts regarding EQ-5D reporting. The weighted ensemble of gemini-2.5-pro, gemma-3-12b, and gemma-3-27b obtained a 0.74 weighted F1-score and 0.74 accuracy, exceeding individually attained results. The ensembling of top-performing models improved the balance between precision and recall compared to individual models, while the soft stacking approach provided greater reliability and interpretability. Feature analysis shows that the probability results from the models are important in guiding the final predictions. The findings suggest that an ensemble-based LLM setup is a reliable and scalable approach for automating screening in biomedical research.

2606.19344 2026-06-19 cs.CL cs.AI 新提交

Exposing the Unsaid: Visualizing Hidden LLM Bias through Stochastic Path Aggregation

揭示未言明之事:通过随机路径聚合可视化隐藏的LLM偏见

Matteo Pelossi, Rita Sevastjanova, Thilo Spinner, Mennatallah El-Assady

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 提出TreeTracer工具,通过系统扰动分析、语法对齐聚合和分类感知节点合并,利用桑基图对比不同语义上下文,揭示LLM中隐藏的代表性和句法偏见。

Comments 14 pages

详情
AI中文摘要

大型语言模型(LLM)表现出表征性和句法性偏见,由于文本生成的随机性,这些偏见难以评估。标准审计方法依赖于单一输出检查或静态自动化指标,这些方法掩盖了底层概率分布,未能捕捉隐藏在低概率生成分支中的偏见。本文介绍了TreeTracer,一种通过聚合比较评估LLM偏见的可视化分析工具。该工具使用系统扰动分析流程,替换每个输入提示中由本体定义的术语,将数百次随机生成聚合成语法对齐的层次结构,然后使用辅助语言模型进行分类感知节点合并。生成的结构通过自定义桑基图可视化。通过并置两个本体驱动的树,工作空间能够直接比较语义上下文,并支持系统性偏见检测。由于任何可视化仅反映模型学习行为的一个子集,系统进一步应用对比推理来计算并直接显示跨上下文的反事实标记概率,从而降低误解偏见存在的风险。我们通过案例研究验证了该工作空间,比较了未对齐的基线模型GPT-2 XL与宪法对齐的Apertus模型。视觉聚合成功揭示了隐藏的代表性伤害,例如反事实代词抑制和对话中对个体的边缘化。初步用户研究证实,聚合比较界面降低了认知负荷,并有效支持分析人员检测系统性偏见。

英文摘要

Large Language Models (LLMs) exhibit representational and syntactic biases that are difficult to evaluate due to the stochastic nature of text generation. Standard auditing methods rely on a single output inspection or static automated metrics. These approaches obscure the underlying probability distributions and fail to capture biases hidden in lower-probability generation branches. This paper introduces TreeTracer, a visual analytics tool designed to evaluate LLM bias through aggregated comparison. Using a systematic perturbation analysis pipeline, the tool replaces ontology-defined terms in each input prompt, aggregates hundreds of stochastic generations into a syntax-aligned hierarchical structure, and then performs classification-aware node merging with an auxiliary language model. The resulting structure is visualized through a custom Sankey diagram. By juxtaposing two ontology-driven trees, the workspace enables direct comparison between semantic contexts and supports systematic bias detection. Because any visualization reflects only a subset of the model's learned behavior, the system further applies contrastive inference to compute and directly display counterfactual token probabilities across contexts, reducing the risk of misinterpreting the presence of bias. We validate the workspace through case studies comparing an unaligned baseline model GPT-2 XL against the constitutionally aligned Apertus models. The visual aggregation successfully exposes hidden representational harms, such as counterfactual pronoun suppression and conversational marginalization of individuals. A preliminary user study confirms that the aggregated comparative interface reduces cognitive load and effectively supports analysts in detecting systemic biases.

2606.20557 2026-06-19 cs.LG math.ST stat.ML stat.TH 新提交

Optimal Deterministic Multicalibration and Omniprediction

最优确定性多校准与全预测

Georgy Noarov, Aaron Roth

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 本文提出一种确定性算法,实现多校准的极小化最优样本复杂度,并推广到结果不可区分性,解决确定性预测器是否必要的问题。

详情
AI中文摘要

一个模型在一组群体权重 $G$ 上是多校准的,如果它是校准的——即即使以其预测为条件也是无偏的——不仅整体上,而且在通过每个 $g \in G$ 对上下文重新加权后也是如此。这对于许多下游应用是一个有用的性质,也是可信机器学习的基本要求。在这项工作之前,所有已知达到 $\varepsilon$-多校准的极小化最优 $\widetilde O(\varepsilon^{-3})$ 样本复杂度的预测器都是随机化的,而确定性预测器仅以更差的样本复杂度已知。多校准中随机化对于最优样本复杂度是否必要的问题由 [CLNR26] 明确提出,并在之前的几项工作中隐含提出。我们通过给出一个输出确定性预测器的极小化最优多校准算法解决了这个开放问题。然后我们将该算法推广到产生满足关于有限或有限覆盖测试集合的结果不可区分性(OI)的最优确定性预测器。作为一个应用,这也给出了具有最优样本复杂度的确定性全预测器和泛预测器,解决了 [OKK25] 和 [BHHLZ25] 提出的开放问题。

英文摘要

A model is multicalibrated on a collection of group weights $G$ if it is calibrated -- i.e. unbiased even conditional on its prediction -- not just overall, but also after reweighting contexts by each $g \in G$. It is a useful property for many downstream applications and is a basic desideratum of trustworthy machine learning. Before this work, all predictors known to attain the minimax-optimal $\widetilde O(\varepsilon^{-3})$ sample complexity rate for $\varepsilon$-multicalibration were randomized, while deterministic predictors were known only with substantially worse sample complexity. Whether randomization is necessary for optimal sample complexity in multicalibration was explicitly asked by [CLNR26] and implicitly in several prior works. We resolve this open problem by giving a minimax-optimal multicalibration algorithm that outputs a deterministic predictor. We then generalize the algorithm to produce optimal deterministic predictors that satisfy outcome indistinguishability (OI) with respect to finite or finitely covered collections of tests. As an application, this also gives deterministic omnipredictors and panpredictors with optimal sample complexity, resolving open problems posed by [OKK25] and [BHHLZ25].

2606.20547 2026-06-19 cs.LG cs.CV cs.GR cs.RO math.DG 新提交

The Token Is a Group Element: On Lie-Algebra Attention over Matrix Lie Groups

Token 是群元素:关于矩阵李群上的李代数注意力

Przemyslaw Musialski

发表机构 * New Jersey Institute of Technology(新泽西理工学院)

AI总结 提出李代数注意力机制,将token定义为矩阵李群元素,利用相对位姿的李代数范数作为注意力分数,无需学习核函数或表示论工具,适用于仿射全帧群等非紧致非阿贝尔群。

Comments preprint, 19 pages, 3 figures

详情
AI中文摘要

我们将注意力token置于群上:一个token是矩阵李群$G$的一个元素$g_i$——一个纯粹的变换,没有特征负载,也没有外部作用$\rho(g)$承载它。据我们所知,这是第一个token为裸矩阵李群元素的注意力构造:它们的分数是相对位姿的闭式代数范数,而非学习核,并且它达到了每个基于不可约表示或满射指数的方法必须排除的仿射全帧群。我们称之为李代数注意力。一旦token是群元素,其余部分无需通常的表示论机制。一对的相对几何是规范的,即$g_i^{-1} g_j$,因此成对不变量$w_{ij} = \log(g_i^{-1} g_j)$是内在的而非设计的;在$G$对角作用下的等变性是重言式的,且余循环条件自动成立。注意力分数是负平方代数范数$s_{ij} = -\|\log(g_i^{-1} g_j)\|_\lambda^2/\tau$:在块加权Frobenius内积下的规范邻近核,无需不可约表示、球谐函数、Clebsch-Gordan积或学习核。该构造适用于任何矩阵李群,在包含相对位姿的选定对数图上,包括具有尺度和剪切的非紧致非阿贝尔仿射群,这些是向量token注意力方法无法达到的:既不是不可约表示传统,也不是满射指数方法。在SE(2)、SO(3)和Aff(2)上的三个序列补全实验证实了这一点:闭式分数匹配了相同不变量上的学习MLP核,并在SE(2)上优于它,使用的分数参数少50到80倍,而向量token基线破坏了不变量,误差达五到十二个数量级。

英文摘要

We place the attention token on the group: a token is an element $g_i$ of a matrix Lie group $G$ -- a bare transformation, with no feature payload and no external action $ρ(g)$ carrying it. To our knowledge this is the first attention construction whose tokens are bare matrix Lie group elements: their score is the closed-form algebra norm of the relative pose rather than a learned kernel, and it reaches the affine full-frame groups that every irrep- or surjective-exp-based method must exclude. We call it Lie-Algebra Attention. Once tokens are group elements, the rest follows with none of the usual representation-theoretic machinery. The relative geometry of a pair is canonical, $g_i^{-1} g_j$, so the pairwise invariant $w_{ij} = \log(g_i^{-1} g_j)$ is intrinsic rather than designed; equivariance under the diagonal $G$-action is tautological, and the cocycle condition holds automatically. The attention score is the negative squared algebra norm, $s_{ij} = -\|\log(g_i^{-1} g_j)\|_λ^2/τ$: the canonical proximity kernel under a block-weighted Frobenius inner product, with no irreducible representations, spherical harmonics, Clebsch-Gordan products, or learned kernel. The construction applies to any matrix Lie group on a chosen logarithm chart containing the relative poses, including the non-compact non-abelian affine groups with scale and shear that no vector-token attention method reaches: neither the irrep tradition nor surjective-exp methods. Three sequence-completion experiments, on SE(2), SO(3), and Aff(2), bear this out: the closed-form score matches a learned MLP kernel on the same invariant and outperforms it on SE(2), using 50 to 80x fewer score parameters, while a vector-token baseline breaks invariance by five to twelve orders of magnitude.

2606.20442 2026-06-19 cs.LG cs.NA cs.NE math.NA 新提交

Evolutionary Two-Stage Hyperparameter Optimization Strategies for Physics-Informed Neural Networks

物理信息神经网络的进化两阶段超参数优化策略

Fedor Buzaev, Dmitry Efremenko, Egor Bugaev, Andrei Ermakov, Denis Derkach, Daria Pugacheva, Fedor Ratnikov

发表机构 * HSE University(高等经济大学) AXXX

AI总结 针对物理信息神经网络训练不稳定、超参数敏感的问题,提出基于进化算法的两阶段优化策略,先低保真筛选再全训练,在三个PDE问题上显著降低误差。

Comments Equal advising: Daria Pugacheva and Fedor Ratnikov. Accepted to the ICLR 2026 Workshop on AI and PDEs

详情
AI中文摘要

物理信息神经网络(PINNs)通过将物理定律嵌入神经网络训练来求解偏微分方程(PDE)。然而,由于物理信息损失的高度非凸和多项结构,其性能受到不稳定收敛、训练平台期以及对架构和优化超参数的强敏感性的影响。在这种情况下,外循环超参数搜索是一个在异构参数上的噪声黑盒优化问题,经典的局部或基于梯度的策略容易陷入次优区域。进化算法凭借其基于种群的探索能力和处理混合、不可微搜索空间的能力,为发现有前景的配置提供了更稳健的机制。我们提出并研究了一种基于进化算法的两阶段方法,该方法结合了PINNs训练的探索和利用部分,以在固定计算预算下提高解的精度和鲁棒性。在第一阶段,我们执行具有截断轮次的低保真训练运行,以快速筛选候选配置,将超参数选择视为黑盒外循环问题。在第二阶段,只有最有希望的候选者使用标准基于梯度的优化器进行完全训练以细化解。在三个流行问题(即平流方程、Klein-Gordon方程和Helmholtz方程)上评估,我们的方法一致优于标准训练,并在受限计算资源内实现了显著更低的平均误差。

英文摘要

Physics-Informed Neural Networks (PINNs) solve Partial Differential Equations (PDEs) by embedding physical laws into neural network training. However, their performance suffers from unstable convergence, training plateaus, and strong sensitivity to architectural and optimization hyperparameters due to the highly non-convex and multi-term structure of the physics-informed loss. In this setting, the outer-loop hyperparameter search is a noisy and black-box optimization problem over heterogeneous parameters, where classical local or gradient-based strategies are easily trapped in suboptimal regions. Evolutionary algorithms, with their population-based exploration and ability to handle mixed, non-differentiable search spaces, provide a more robust mechanism for discovering promising configurations. We propose and investigate a two-stage approach based on evolutionary algorithms that combines exploration and exploitation parts of PINNs training to improve solution accuracy and robustness under fixed computational budgets. In the first stage, we perform low-fidelity training runs with truncated epochs to rapidly screen candidate configurations, treating hyperparameter selection as a black-box outer-loop problem. In the second stage, only the most promising candidates are fully trained with standard gradient-based optimizers to refine the solution. Evaluated on three popular problems, namely Advection, Klein-Gordon and Helmholtz equations, our method consistently outperforms standard training and achieves significantly lower mean error within constrained computational resources.