arXivDaily arXiv每日学术速递 周一至周五更新
2606.19496 2026-06-19 cs.LG 新提交

Calibrating Generative Models to Feature Distributions with MMD Finetuning

使用MMD微调将生成模型校准到特征分布

Nathaniel L. Diamant, Brian L. Trippe

发表机构 * Stanford University(斯坦福大学)

AI总结 提出kCGM方法,通过最小化生成与目标特征分布的最大均值差异(MMD)并加入KL正则化,在不牺牲有效性的前提下校准生成模型的特征分布,适用于多种生成模型。

详情
AI中文摘要

生成模型可以产生个体上合理的样本,但在关键特征分布上与目标集存在显著偏差。例如,在广泛的药物类化学空间上预训练的模型可能生成分子,其分子特征与感兴趣的治疗类别(如已知抗生素)不同。纠正这种分布校准错误具有挑战性:在目标集上直接微调可能导致过拟合,并且无法控制匹配哪些特征。为了填补这一空白,我们引入了核校准生成模型(kCGM)。kCGM使用无偏得分函数估计器最小化生成特征分布与目标特征分布之间的最大均值差异(MMD),并通过KL正则化保持与预训练模型的接近。在一个包含174种抗生素的目标集上,直接微调牺牲了化学有效性以匹配特征分布,而kCGM在提高有效性的同时改善了目标特征匹配。我们还在蛋白质和DNA生成任务中展示了kCGM,表明它可以使用仅特征级别的监督来适应自回归、连续空间扩散和离散扩散模型。代码可在https://this URL获取。

英文摘要

Generative models can produce individually plausible samples while deviating substantially from a target set in the distribution of key features. For example, a model pretrained on broad drug-like chemical space may generate molecules whose molecular features differ from those of a therapeutic class of interest, such as known antibiotics. Correcting such distributional miscalibration is challenging: direct finetuning on the target set can overfit and does not control which features are matched. To fill this gap, we introduce kernel Calibrating Generative Models (kCGM). kCGM minimizes a maximum mean discrepancy (MMD) between generated and target feature distributions using an unbiased score-function estimator, with KL regularization to remain close to the pretrained model. On a target set of 174 antibiotics, direct finetuning sacrifices chemical validity for feature-distribution matching, whereas kCGM improves target feature matching while increasing validity. We further demonstrate kCGM in protein and DNA generation tasks, showing it can adapt autoregressive, continuous-space diffusion, and discrete diffusion models using only feature-level supervision. Code is available at https://github.com/smithhenryd/cgm.

2606.19495 2026-06-19 cs.CV 新提交

LooseControlVideo: Directorial Video Control using Spatial Blocking

LooseControlVideo: 使用空间分块进行导演式视频控制

Shariq Farooq Bhat, Niloy J. Mitra, Kalyan Sunkavalli

发表机构 * Adobe Research(Adobe研究院)

AI总结 提出LooseControlVideo框架,通过稀疏定向3D框作为“分块”代理,实现文本到视频生成中多对象场景的直观布局与轨迹控制,显著优于现有2D框和流方法。

Comments Project page at https://shariqfarooq123.github.io/LooseControlVideo/

详情
AI中文摘要

在文本到视频生成中,精确的3D空间编排仍然是一个重大挑战,特别是对于语义布局和时间动态经常纠缠的多对象场景。虽然现有的深度条件模型实现了良好的结构保真度,但它们需要密集的、帧精确的指导,这对于涉及可变形对象的动态事件来说,制作起来非常费力。我们提出了LooseControlVideo,一个通过使用稀疏的、定向的3D框作为“分块”代理来实现直观和表达性控制的框架。这允许用户创作高级布局和轨迹,同时利用视频生成模型生成逼真的遮挡、动态和交互。我们通过在带有DNOCS(一种用于3D大小、方向和深度排序遮挡的新型编码)注释的视频数据集上微调Wan 2.2骨干网络来实现这一点。此外,我们的方法允许局部细化,例如调整跳跃轨迹或添加交互,而对全局场景上下文的干扰最小。在nuScenes、HO-3D和BEHAVE基准上的广泛评估表明,LooseControlVideo显著优于现有的2D框和基于流的基线。我们的结果表明,与当前最先进的布局条件模型相比,轨迹误差提高了1.2倍到3倍;刚体运动一致性提高了2倍;遮挡精度提高了1.5倍到2倍,表明定向3D基元为复杂的多智能体视频创作提供了良好的几何先验。

英文摘要

Precise 3D spatial orchestration in text-to-video generation remains a significant challenge, particularly for multi-object scenes where semantic layout and temporal dynamics are often entangled. While existing depth-conditioned models achieve good structural fidelity, they necessitate dense, frame-accurate guidance that is labor-intensive to author for dynamic events involving deformable objects. We present LooseControlVideo, a framework that enables intuitive and expressive control by using sparse, oriented 3D boxes as a "blocking" proxy. This allows users to author high-level layout and trajectory while leveraging a video generative model to generate realistic occlusions, dynamics and interactions. We achieve this by fine-tuning a Wan 2.2 backbone on a video dataset annotated with DNOCS, a novel encoding for 3D size, orientation and depth-ordered occlusions. Furthermore, our method allows for localized refinement, such as adjusting a jump trajectory or adding an interaction, with minimal disruption to the global scene context. Extensive evaluations on the nuScenes, HO-3D, and BEHAVE benchmarks demonstrate that LooseControlVideo significantly outperforms existing 2D-box and flow-based baselines. Our findings indicate a 1.2x to 3x improvement in Trajectory Error; 2x improvement in Rigid Motion Consistency; and a 1.5x to 2x increase in Occlusion Accuracy over current state-of-the-art layout-conditioned models, demonstrating that oriented 3D primitives provide good geometric prior for complex, multi-agent video authoring.

2606.19494 2026-06-19 cs.AI 新提交

Hidden Anchors in Multi-Agent LLM Deliberation

多智能体LLM协商中的隐藏锚点

Apurba Pokharel, Ram Dantu

发表机构 * University of North Texas(北德克萨斯大学)

AI总结 将多智能体LLM协商建模为闭环动力系统,每个智能体有隐藏内部信念(锚点),解释协商如何超越初始信念凸包,并通过恢复锚点预测模型行为。

Comments 13 pages, 6 figures, 7 tables

详情
AI中文摘要

多智能体LLM协商,即智能体在多轮中交换和修改答案,越来越多地被用于提高推理和准确性,但其工作原理很少被建模。这种协商反映了人类如何做出决策。作为社会性动物,我们既受到群体的影响(如DeGroot和Friedkin-Johnsen等经典意见动力学模型所捕捉的羊群效应),也受到自身内部信念的影响(这些模型未考虑)。我们将多智能体协商建模为一个闭环动力系统,其中每个智能体携带一个隐藏的内部信念(其锚点),该锚点持续拉动其意见,无论邻居如何。我们证明,仅从协商中就可以恢复该锚点,并且它解释了经典共识规则所禁止的行为:智能体对正确答案的信心可以超过任何智能体初始时的水平,从而逃离由初始信念形成的空间(凸包)。检查恢复的锚点是否也能预测未参与运行的协商(泛化),为模型是否真正由这样的锚点驱动提供了一个简单测试。在三个开放权重模型系列中,这是一个谱系,而非全有或全无。所有锚点的影响强度大致相同,但它们在锚点位置上有差异,只有当锚点远离初始意见时,协商才会逃离凸包并需要完整的闭环模型。

英文摘要

Multi-agent LLM deliberation, where agents exchange and revise answers over several rounds, is increasingly used to improve reasoning and accuracy, yet how and why it works is rarely modelled. Such deliberation mirrors how humans reach decisions. As social animals we are pulled both by the group, the herd effect that classical opinion-dynamics models such as DeGroot and Friedkin--Johnsen capture, and by our own internal belief, which they do not. We model multi-agent deliberation as a closed-loop dynamical system in which each agent carries a hidden internal belief, its anchor, that continually pulls its opinion regardless of its neighbours. We show this anchor can be recovered from the deliberation alone, and that it explains a behaviour classical consensus rules forbid: an agent's confidence in the correct answer can climb past where any agent started, escaping the space (convexhull) formed by the initial beliefs. Checking whether the recovered anchor also predicts held-out runs (generalizes) gives a simple test for when a model is truly driven bysuch an anchor. Across three open-weight model families this is a spectrum, not all-or-nothing. All anchors' influence are about equally strongly, but they differ in where the anchor sits, and only when it sits far from the initial opinions does deliberation escape the hull and need the full closed-loop model.

2606.19491 2026-06-19 cs.LG stat.ML 新提交

Algebraic Dead Directions in LayerNorm Transformers: A Forward-Pass-Only Diagnostic at LLM Scale

LayerNorm Transformer 中的代数死方向:一种仅需前向传播的大语言模型规模诊断方法

Tejas Pradeep Shirodkar, P. J. Narayanan

发表机构 * IIIT, Hyderabad(海得拉巴国际信息技术学院)

AI总结 本文发现 LayerNorm 的逆尺度方向是后最终归一化中心激活协方差矩阵的精确代数核,可仅从参数中读取死方向,无需前向或后向传播,并在 14 个预训练模型上验证了其有效性。

Comments 34 pages, 7 figures, 6 tables. Empirical companion to arXiv:2606.05957

详情
AI中文摘要

预训练 Transformer 位于损失函数的奇异极小值附近,此时 Fisher 信息度量沿死方向退化:参数空间中方向性 Fisher 为零的方向。通常定位这样的方向需要一次前向传播和激活矩阵的特征分解,或基于采样的复杂度估计;没有一种方法能仅从网络参数计算方向。我们针对 LayerNorm Transformer 给出了一个这样的方向。LayerNorm 仿射的逆尺度方向 $\gamma^{-1}/\|\gamma^{-1}\|$ 是后最终归一化中心激活协方差矩阵的精确代数核,适用于任何输入分布,并在参数空间中诱导出相应的死方向。它仅从 LN 尺度参数读取,无需前向或后向传播,无需特征分解:这是针对 LayerNorm 的最廉价死方向读取方法。我们在 14 个预训练 Transformer(9 个 LayerNorm,5 个 RMSNorm;160M-35B;语言和视觉目标)上进行了测试。在随机初始化时,预测方向与测量的底部奇异方向(一次前向传播,直接 SVD)在 9/9 的 LayerNorm 模型上匹配到小数点后四位,并在 5/5 的 RMSNorm 模型上正确缺失,后者缺乏产生该方向的均值减法投影器。在训练后的检查点上,沿该方向的协方差特征值加深约 ${\sim}10^3$ 倍,并打开更多死方向;随机初始化到训练后的差距是一次前向传播、每检查点沿预测坐标的奇异结构读出。由此得出两个闭式结论:残差流的最小奇异值在 13/14 个 Transformer 上逐块保持不变(在其自身输入分布上测量),唯一的例外(Gemma$4$-$31$B)是一个真正的死方向,同一读出可精确定位;核方向的存在从参数本身即可对 Transformer 的归一化进行分类。

英文摘要

Pretrained transformers sit near singular minima of the loss, where the Fisher information metric degenerates along dead directions: directions in parameter space along which the directional Fisher vanishes. Locating such a direction normally needs a forward pass and an eigendecomposition of activations, or a sampling-based complexity estimate; none returns a direction computable from the network's parameters alone. We give one, for LayerNorm transformers. The inverse-scale direction $γ^{-1}/\|γ^{-1}\|$ of the LayerNorm affine is an exact algebraic kernel of the post-final-norm centred activation covariance, for any input distribution, and induces a corresponding dead direction in parameter space. It is read from the LN scale parameter alone, with no forward or backward pass and no eigensolve: the cheapest dead-direction read, specific to LayerNorm. We test it on $14$ pretrained transformers ($9$ LayerNorm, $5$ RMSNorm; $160$M-$35$B; language and vision objectives). At random initialisation the predicted direction matches the measured bottom singular direction (one forward pass, direct SVD) to four decimal places on $9/9$ LayerNorm models, and is correctly absent on $5/5$ RMSNorm models, which lack the mean-subtraction projector that creates it. On the trained checkpoint the covariance eigenvalue along this direction deepens by ${\sim}10^3\times$ and further dead directions open; the random-init-to-trained gap is a one-forward-pass, per-checkpoint readout of singular structure along the predicted coordinate. Two consequences follow in closed form: the residual stream's smallest singular value is preserved block-to-block on $13/14$ transformers measured on their own input distribution, the one exception (Gemma$4$-$31$B) a genuine dead direction the same read pinpoints; and the kernel direction's presence classifies a transformer's normalisation from the parameters alone.

2606.19489 2026-06-19 cs.LG cs.AI 新提交

Concept Flow Models: Anchoring Concept-Based Reasoning with Hierarchical Bottlenecks

概念流模型:通过层次瓶颈锚定基于概念的推理

Ya Wang, Adrian Paschke

发表机构 * Fraunhofer Institute for Open Communication Systems(弗劳恩霍夫开放通信系统研究所) Freie Universität Berlin(柏林自由大学)

AI总结 提出概念流模型(CFM),用层次化概念决策树替代扁平瓶颈,通过逐步缩小预测范围减少信息泄露,在保持预测性能的同时提升可解释性。

Journal ref Transaction on Machine Learning Research, 2/2026

详情
AI中文摘要

概念瓶颈模型(CBM)通过将学习到的特征投影到人类可理解的概念空间来增强可解释性。最近的方法利用视觉-语言模型生成概念嵌入,减少了对人工概念标注的需求。然而,这些模型存在一个关键限制:当概念数量接近嵌入维度时,信息泄露增加,使得模型能够利用虚假或语义上不相关的相关性,从而削弱可解释性。在这项工作中,我们提出了概念流模型(CFM),它将扁平瓶颈替换为层次化的、概念驱动的决策树。层次结构中的每个内部节点专注于局部判别性概念子集,逐步缩小预测范围。我们的框架从视觉嵌入构建决策层次,在每个层次级别分布语义概念,并通过概率树遍历训练可微的概念权重。在多个基准上的大量实验表明,CFM在预测性能上与扁平CBM相当,同时通过减少有效概念使用显著缓解了信息泄露。此外,CFM产生逐步决策流,使得具有层次类结构的透明且可审计的模型推理成为可能。

英文摘要

Concept Bottleneck Models (CBMs) enhance interpretability by projecting learned features into a human-understandable concept space. Recent approaches leverage vision-language models to generate concept embeddings, reducing the need for manual concept annotations. However, these models suffer from a critical limitation: as the number of concepts approaches the embedding dimension, information leakage increases, enabling the model to exploit spurious or semantically irrelevant correlations and undermining interpretability. In this work, we propose Concept Flow Models (CFMs), which replace the flat bottleneck with a hierarchical, concept-driven decision tree. Each internal node in the hierarchy focuses on a localized subset of discriminative concepts, progressively narrowing the prediction scope. Our framework constructs decision hierarchies from visual embeddings, distributes semantic concepts at each hierarchy level, and trains differentiable concept weights through probabilistic tree traversal. Extensive experiments on diverse benchmarks demonstrate that CFMs match the predictive performance of flat CBMs, while substantially mitigating information leakage by reducing effective concept usage. Furthermore, CFMs yield stepwise decision flows that enable transparent and auditable model reasoning with hierarchical class structures.

2606.19483 2026-06-19 cs.CV 新提交

LEAP: Layer-skipping Efficiency via Adaptive Progression for Vision Transformer Distillation

LEAP: 通过自适应进度实现视觉Transformer蒸馏的层跳过效率

Jiaqi Zhang, Ashton Lee, Anthony Wong, John Zou, Sami BuGhanem, Randall Balestriero

发表机构 * Brown University(布朗大学) Rice University(莱斯大学)

AI总结 提出LEAP训练课程,通过自适应选择教师中间特征图作为渐进式目标,加速学生ViT的知识蒸馏,在ImageNet-100上提升12.24%准确率,并节省25.1%训练FLOPs。

详情
AI中文摘要

基于视觉Transformer(ViT)骨干的视觉基础模型(VFMs),如DINOv2,已成为目标识别和语义分割等下游任务的关键。骨干网络的巨大计算需求通常需要将其蒸馏到更小的架构中以便在边缘部署。基于特征的知识蒸馏(KD)常受师生差距影响;学生由于容量有限难以模仿教师复杂的特征图。为缓解这一瓶颈,我们提出LEAP:通过自适应进度实现层跳过效率,一种用于ViT特征知识蒸馏的训练课程。通过利用教师的中间特征图作为一系列逐渐困难的渐进目标,我们的课程允许学生在处理更高层抽象之前构建基础表示。我们的结果表明,这种范式通过在不同学生模型大小和数据集规模上自适应选择难度,显著加速了收敛。采用我们的课程,LEAP蒸馏的ViT-S在ImageNet-100上达到90.1%的准确率,相比基线提升12.24%。在ImageNet-1K上,LEAP在Oxford和Paris数据集上的实例检索任务分别提升3.84%和7.75%。此外,该课程通过在训练初始阶段对教师推理实施早停,在ImageNet-100上节省了25.1%的训练FLOPs和21%的训练时间。代码可在以下网址获取:https://this URL

英文摘要

Vision Foundation Models (VFMs) with Vision Transformer (ViT) backbones, such as DINOv2, have become essential for downstream tasks like object recognition and semantic segmentation. The immense computational requirements of backbones often necessitate distillation into smaller architectures for edge deployment. Feature-based knowledge distillation (KD) often suffers from the teacher-student gap; the student struggles to imitate teacher's complex feature map due to its limited capacity. To mitigate this bottleneck, we propose LEAP: Layer-skipping Efficiency via Adaptive Progression, a training curriculum for ViT feature-based knowledge distillation. By utilizing the teacher's intermediate feature maps as a sequence of progressively more difficult targets, our curriculum allows the student to build a foundational representation before tackling higher-level abstractions. Our results demonstrate that this paradigm significantly accelerates convergence through adaptive difficulty selection across various student model sizes and dataset scales. With our curriculum, the LEAP-distilled ViT-S achieves 90.1% accuracy on ImageNet-100, a +12.24% improvement compared with baseline. On ImageNet-1K, LEAP achieves +3.84% and +7.75% improvement for the instance retrieval task on the Oxford and Paris datasets, respectively. Furthermore, the curriculum enables 25.1% savings in training FLOPs and 21% savings in training time on ImageNet-100 by implementing early-stopping for teacher inference during the initial stages of training. Code is available at https://github.com/KevinZ0217/LEAP

2606.19481 2026-06-19 cs.LG 新提交

Insulin4RL: Real-Time Insulin Management in the Intensive Care Unit for Offline Reinforcement Learning

Insulin4RL:面向离线强化学习的重症监护室实时胰岛素管理

Thomas Frost, Steve Harris

发表机构 * Institute of Health Informatics(健康信息学研究所) University College London(伦敦大学学院)

AI总结 针对电子健康记录离散化导致模型泛化性差的问题,提出基于真实临床轨迹的离线强化学习数据集Insulin4RL,包含375,000+决策和12,209名患者,用于评估模型在真实采样假设下的性能。

Comments Under submission

详情
AI中文摘要

离线强化学习(ORL)有潜力利用历史电子健康记录(EHR)数据提高临床决策质量。当前该领域的训练和评估实践严重依赖于按固定规则时间间隔离散化的EHR数据集。离散化创建了复杂临床场景的虚构表示,并损害了回顾性模型评估的泛化性。在本文中,我们介绍Insulin4RL,一个医疗ORL数据集,其特点是来自真实临床轨迹的自然不规则输入和动作。该数据集源自MIMIC-IV,包含超过375,000个标记决策,涉及12,209名需要在重症监护室进行胰岛素输注滴定的患者。因此,该数据集可用于研究ORL模型在现实临床采样假设下的性能。我们提供了数据集结构和特征的描述、使用无模型离线强化学习的基线性能指标,以及使用拟合Q评估的标准化评估协议。最后,我们提出了未来研究可以利用该资源解决的领域。

英文摘要

Offline reinforcement learning (ORL) offers the potential to improve the quality of clinical decision-making using historical electronic health record (EHR) data. Current training and evaluative practices in this field rely heavily on EHR datasets that have been temporally discretised into fixed, regular time intervals. Discretisation creates fictional representations of complex clinical scenarios and compromises the generalisability of retrospective model evaluations. In this paper, we introduce Insulin4RL, a healthcare ORL dataset featuring naturally irregular inputs and actions from real clinical trajectories. Derived from MIMIC-IV, Insulin4RL comprises over 375,000 labelled decisions across 12,209 patients requiring insulin infusion titration in the Intensive Care Unit. The dataset can thus be used for research into ORL model performance under realistic clinical sampling assumptions. We provide a description of the dataset's structure and characteristics, baseline performance metrics using model-free offline reinforcement learning, and a standardised evaluation protocol using fitted Q-evaluation. We conclude with suggested areas for future research that could be addressed using this resource.

2606.19476 2026-06-19 cs.LG cs.AI 新提交

Can In-Context Learning Support Intrinsic Curiosity?

上下文学习能否支持内在好奇心?

Eric Elmoznino, Sangnie Bhardwaj, Johannes von Oswald, Rajai Nasser, Blaise Agüera y Arcas, João Sacramento, Rif A. Saurous, Guillaume Lajoie

发表机构 * Google – Paradigms of Intelligence Team(Google – 智能范式团队) Google DeepMind

AI总结 研究利用序列模型的上下文学习能力作为即时无更新世界模型,以消除传统内在好奇心方法中梯度下降的计算瓶颈,理论证明在非时间设置下可渐近收敛到真实学习进度。

详情
AI中文摘要

有效的机器学习不仅取决于我们如何对数据建模,还取决于我们选择收集哪些数据。虽然大型序列模型已经彻底改变了数据建模,但自动数据选择或“内在好奇心”的问题仍然是一个重大挑战。经典方法通过基于智能体的“学习进度”奖励来激励探索,该奖励衡量新获得的观测在多大程度上改进了世界模型的预测能力。然而,传统上评估这些奖励需要在每个轨迹内进行昂贵的梯度下降内循环更新,这使得它们在规模上计算上不可行。在这项工作中,我们研究序列模型涌现的上下文学习(ICL)能力是否可以通过作为即时的、无需更新的世界模型来消除这一瓶颈。具体来说,我们评估是否可以训练一个探索策略来最大化学习进度,仅使用上下文学习者的预测误差和反事实上下文操作。我们首先证明,在一般马尔可夫决策过程中,这实际上不可能以无偏的方式实现:由此产生的内在奖励要么包含干扰项,使其对真实学习进度的估计产生偏差,要么无法使用上下文学习者的预测误差来实现。相反,我们对于非时间设置的一个广泛子类(包括主动学习和贝叶斯实验设计)证明了积极结果:在这里,ICL派生的奖励成功界定了真实学习进度并渐近收敛到它。我们通过连续和符号环境中的受控实验证实了我们的理论,表明我们的ICL驱动框架成功训练了以最优方式进行探索的好奇数据收集策略。

英文摘要

Effective machine learning depends not only on how we model data, but also on what data we choose to collect. While large sequence models have revolutionized data modeling, the problem of automated data selection, or "intrinsic curiosity", remains a significant challenge. Classic approaches incentivize exploration by rewarding an agent based on its "learning progress", which measures how much a newly acquired observation improves a world model's predictive ability. However, evaluating these rewards traditionally requires expensive inner loops of gradient descent updates within each trajectory, rendering them computationally impractical at scale. In this work, we investigate whether the emergent in-context learning (ICL) capabilities of sequence models can eliminate this bottleneck by serving as immediate, update-free world models. Specifically, we evaluate whether an exploration policy can be trained to maximize learning progress, using solely the prediction errors and counterfactual context manipulations of an in-context learner. We first prove that in general Markov decision processes, this is in fact impossible in an unbiased way: the resulting intrinsic rewards either suffer from nuisance terms that bias their estimation of true learning progress, or they cannot be implemented using an in-context learner's prediction errors. Conversely, we prove a positive result for a broad subclass of non-temporal settings, encompassing active learning and Bayesian Experimental Design: here, ICL-derived rewards successfully bound and asymptotically converge to the true learning progress. We corroborate our theory with controlled experiments across continuous and symbolic environments, demonstrating that our ICL-driven framework successfully trains curious data-collection policies that explore optimally.

2606.19475 2026-06-19 cs.AI cs.CL 新提交

Diffusion Language Models: An Experimental Analysis

扩散语言模型:一项实验分析

Thomas Bertolani, Davide Bucciarelli, Leonardo Zini, Marcella Cornia, Lorenzo Baraldi

发表机构 * University of Modena and Reggio Emilia(摩德纳和雷焦艾米利亚大学) University of Pisa(比萨大学)

AI总结 本文系统比较了八种扩散语言模型在推理、编码、翻译等任务上的表现,分析了去噪步数、上下文长度等推理因素对性能与效率的影响,揭示了扩散语言模型在不同任务和预算下的权衡。

详情
AI中文摘要

大型语言模型(LLMs)通过自回归生成彻底改变了语言建模,使其在广泛的任务中表现出色。最近,扩散语言模型(DLMs)作为一种替代范式出现,它通过迭代去噪而非下一个词预测来生成文本,从而允许对整个序列进行并行精炼。尽管已经提出了许多基于扩散的架构,但评估协议、数据集、推理预算和生成超参数的差异使得比较它们的能力和理解它们提供的权衡变得困难。在这项工作中,我们对现代DLMs进行了系统的实验分析。具体来说,我们评估了八种最先进的DLMs在八个基准上的表现,这些基准涵盖推理、编码、翻译、知识和结构化问题解决,同时明确考虑了生成质量和计算效率。除了下游评估,我们还分析了关键推理时间因素的影响,包括去噪步数、上下文长度、块大小和并行解掩策略,并通过在相同条件下训练的较小模型的受控比较来补充大规模实验。我们的分析突出了基于扩散的语言建模在不同任务、架构和推理预算下的优势和局限性。我们表明,DLMs的行为受到生成时间设计选择的强烈影响,导致性能和计算效率之间的不同权衡。总体而言,我们的研究为当代DLMs的能力和部署特性提供了实用见解。

英文摘要

Large Language Models (LLMs) have revolutionized language modeling through autoregressive generation, enabling strong performance across a wide range of tasks. Recently, Diffusion Language Models (DLMs) have emerged as an alternative paradigm that generates text through iterative denoising rather than next-token prediction, allowing parallel refinement of entire sequences. While numerous diffusion-based architectures have been proposed, differences in evaluation protocols, datasets, inference budgets, and generation hyperparameters make it difficult to compare their capabilities and understand the trade-offs they offer. In this work, we present a systematic experimental analysis of modern DLMs. Specifically, we evaluate eight state-of-the-art DLMs across eight benchmarks spanning reasoning, coding, translation, knowledge, and structured problem solving, while explicitly considering both generation quality and computational efficiency. Beyond downstream evaluation, we analyze the impact of key inference-time factors, including denoising steps, context length, block size, and parallel unmasking strategies, and complement large-scale experiments with controlled comparisons of smaller models trained under identical conditions. Our analysis highlights the strengths and limitations of diffusion-based language modeling across different tasks, architectures, and inference budgets. We show that the behavior of DLMs is strongly influenced by generation-time design choices, leading to distinct trade-offs between performance and computational efficiency. Overall, our study provides practical insights into the capabilities and deployment characteristics of contemporary DLMs.

2606.19474 2026-06-19 cs.CR cs.AI cs.SE 新提交

Secure Coding Drift in LLM-Assisted Post-Quantum Cryptography Development: A Gamified Fix

LLM辅助后量子密码开发中的安全编码漂移:一种游戏化修复方案

R. D. N. Shakya, C. P. Wijesiriwardana, S. M. Vidanagamachchi, Nalin A. G. Arachchilage

发表机构 * University of Moratuwa(摩图瓦大学) University of Ruhuna(鲁胡纳大学) RMIT University(皇家墨尔本理工大学)

AI总结 提出LLM辅助PQC开发中的安全编码漂移模型,通过游戏化框架将LLM转变为主动安全协作者,以缓解长期依赖LLM导致的安全退化。

Comments Accepted for 2026 SIGIR Workshop on Vulnerabilities in Generative Systems for Information Retrieval track

详情
AI中文摘要

向后量子密码学(PQC)的过渡引入了相当大的实现复杂性,要求严格遵守恒定时间执行、侧信道抵抗和精确参数化。同时,大型语言模型(LLM)已深度嵌入软件开发工作流程,包括密码工程。虽然LLM提高了生产力,但证据表明它们经常生成不安全或次优的代码,特别是在安全关键领域。本文引入了PQC中的安全编码漂移,这是一种新颖的社会技术漏洞模型,捕捉了由于持续依赖LLM生成的代码而导致的安全编码实践逐渐退化。与先前关注静态漏洞的工作不同,我们将安全风险概念化为一种源于人机交互的纵向行为现象。为了缓解这一问题,我们提出了一种游戏化的、LLM增强的安全编码框架,将对抗性评估、行为反馈和安全评分嵌入开发工作流程。我们的方法将LLM从被动助手重新定义为主动安全协作者,为AI中介环境中的更安全PQC实现做出贡献。

英文摘要

The transition to Post Quantum Cryptography (PQC) introduces considerable implementation complexity, requiring strict adherence to constant-time execution, side channel resistance, and precise parametrisation. Simultaneously, large language models (LLMs) are heavily embedded in software development workflows, including cryptographic engineering. While LLMs improve productivity, evidence shows that they frequently generate insecure or suboptimal code, particularly in security critical domains. This paper introduces Secure Coding Drift in PQC, a novel socio technical vulnerability model capturing the gradual degradation of secure coding practices due to sustained reliance on LLM-generated code. Unlike prior work that focuses on static vulnerabilities, we conceptualise security risk as a longitudinal behavioural phenomenon rising from human AI interaction. To mitigate this, we propose a gamified, LLM augmented secure coding framework that embeds adversarial evaluation, behavioural feedback, and security scoring into development workflows. Our approach reframes LLMs from passive assistants into active security co-pilots, contributing toward safer PQC implementation in AI mediated environments.

2606.19469 2026-06-19 cs.AI cs.SE 新提交

Measuring Curriculum Alignment across Topical Coverage, Competency, and Cognitive Depth: A Longitudinal Framework Applied to CS2013 and CS2023

衡量课程在主题覆盖、能力与认知深度上的一致性:应用于CS2013和CS2023的纵向框架

Sherzod Turaev, Mary John, Saja Aldabet, Mamoun Awad, Nazar Zaki, Khaled Shuaib

发表机构 * United Arab Emirates University(阿联酋大学) Abu Dhabi Polytechnic(阿布扎比理工学院)

AI总结 提出一种人机协同流程,通过语义检索与人工确认,纵向衡量计算机科学课程对CS2013和CS2023指南的覆盖情况,发现课程覆盖稳定但新指南对认知深度要求更高。

Comments 24 pages, 5 figures, 8 tables

详情
AI中文摘要

本科计算机科学教育受约每十年修订一次的国际课程指南指导,但各项目缺乏可靠且可重复的方法来衡量其对当前指南的覆盖程度,以及当指南重组时覆盖情况如何变化。我们通过一个人机协同流程解决此问题,该流程衡量项目对外部知识体系的覆盖情况,并纵向应用于一个经认证的计算机科学学士学位项目,对照计算机科学课程2013(CS2013)和2023(CS2023)。该流程将项目和每个指南表示为结构化语料库,通过语义检索生成候选课程-知识单元匹配,并在明确的覆盖定义下通过人工判断确认。在七个基准检索器中,倒数秩融合集成最强,而知名长上下文模型表现不如小型句子模型,因此必须衡量检索器的选择。两个映射由独立第二评分者验证(CS2023的Cohen's kappa为0.64,CS2013为0.69)。该项目覆盖CS2023的49.7%和CS2013的50.9%的知识单元,十年间几乎恒定。将相同的检索-确认设计扩展到能力表述和认知深度,显示项目在每个指南下对约88%的覆盖单元表述了能力,但在CS2023下对76%的现有单元以推荐深度交付,而在CS2013下为95%,这一差距反映了新指南提高了期望,而非项目本身。纵向比较将持久的结构性差距(并行与分布式计算、编程语言基础、系统基础)——这些差距在两种指南和ABET下均未覆盖——与反映标准演变的差异区分开来。该工具可重用,并可向作者索取。

英文摘要

Undergraduate computer science is governed by international curricular guidelines revised about once a decade, yet programs lack a reliable, reproducible way to measure how completely they cover the current guidelines and how that coverage shifts when the guidelines are restructured. We address this with a human-in-the-loop pipeline that measures a program's coverage of an external body of knowledge, applied longitudinally to one accredited BSc in Computer Science against Computer Science Curricula 2013 (CS2013) and 2023 (CS2023). The pipeline represents the program and each guideline as structured corpora, generates candidate course-to-knowledge-unit matches by semantic retrieval, and confirms them through human judgment under an explicit coverage definition. Of seven benchmarked retrievers, a reciprocal-rank-fusion ensemble was strongest, and a reputed long-context model underperformed a small sentence model, so retriever choice must be measured. Both maps were validated by an independent second rater (Cohen's kappa 0.64 for CS2023, 0.69 for CS2013). The program covers 49.7% of CS2023 and 50.9% of CS2013 knowledge units, near-constant across a decade. Extending the same retrieve-then-confirm design to competency articulation and cognitive depth shows that the program articulates the competency for ~88% of covered units under each guideline, yet delivers it at the recommended depth for 76% of present units under CS2023 against 95% under CS2013, a gap reflecting the newer guideline's raised expectations, not the program. The longitudinal comparison separates persistent structural gaps (parallel and distributed computing, foundations of programming languages, systems fundamentals), uncovered against both guidelines and ABET, from differences that reflect the standard's evolution. The instrument is reusable and available from the authors on request.

2606.19468 2026-06-19 cs.CL 新提交

Characterizing Narrative Content in Web-scale LLM Pretraining Data

网络规模LLM预训练数据中的叙事内容特征化

Teagan Johnson, Elliott Ash, Andrew Piper, Maria Antoniak

发表机构 * University of Colorado Boulder(科罗拉多大学波尔德分校) ETH Zürich(苏黎世联邦理工学院) McGill University(麦吉尔大学)

AI总结 首次细粒度研究LLM预训练语料库Dolma的叙事特征,提出涵盖三个核心叙事元素(能动性、场景、事件)的框架,构建NarraBERT模型并发布NarraDolma数据集,揭示叙事结构在异构数据中可测量且分布不均。

Comments 8 pages of main content, 28 total pages. 30 figures

详情
AI中文摘要

尽管叙事是人类交流的基本模式,但网络规模LLM预训练语料库的叙事组成仍然很大程度上未被探索。我们首次对Dolma(一个3万亿词元的开放预训练语料库)中的叙事特征进行了细粒度研究。借鉴叙事理论,我们设计了一个框架,涵盖三个核心叙事元素(能动性、场景和事件),并将其操作化为11个可解释维度。在采样并标注了400个多样化的段落之后,我们微调并验证了NarraBERT,一个基于RoBERTa的细粒度叙事预测模型。我们将NarraBERT应用于300万个段落,生成了新数据集NarraDolma。我们发现:(i) 叙事结构在极度异构的数据中是可大规模测量的;(ii) 我们揭示了网络文本背后连续的多维叙事结构;(iii) 叙事质量在预训练来源和主题之间分布不均,而当前的策展实践既未测量也未考虑这一点。我们的框架、数据集和分析为理解LLM预训练数据中叙事质量的分布以及研究数据组成如何影响叙事推理任务提供了基础。我们公开发布了NarraDolma和NarraBERT。

英文摘要

The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features in Dolma, a 3-trillion-token open pretraining corpus. Drawing on narrative theory, we design a framework spanning three core narrative elements (agency, setting, and events) operationalized as 11 interpretable dimensions. After sampling and annotating a diverse set of 400 passages, we finetune and validate NarraBERT, a RoBERTa-based model for fine-grained narrative prediction. We apply NarraBERT to 3M passages, resulting in a new dataset, NarraDolma. We find (i) narrative structure is measurable at scale across extremely heterogeneous data, (ii) we uncover a continuous, multidimensional narrative structure underlying web text, and (iii) narrative qualities are unequally distributed across pretraining sources and topics in ways that current curation practices neither measure nor account for. Our framework, dataset, and analyses provide a foundation for understanding how narrative qualities are distributed in LLM pretraining data and for studying how data composition affects narrative reasoning tasks. We publicly release NarraDolma and NarraBERT.

2606.19464 2026-06-19 cs.AI cs.MA 新提交

Deontic Policies for Runtime Governance of Agentic AI Systems

面向自主AI系统运行时治理的道义策略

Anupam Joshi, Tim Finin, Karuna Pande Joshi, Lalana Kagal

发表机构 * CSEE Department UMBC Baltimore, MD, USA Center for AI UMBC Baltimore, MD, USA Information Systems Department UMBC Baltimore, MD, USA CSAIL MIT Cambridge, MA, USA

AI总结 针对大语言模型驱动的自主AI系统在安全、隐私和合规方面的治理挑战,提出AgenticRei框架,利用基于Rei的道义策略语言(OWL表示)在运行时通过逻辑引擎强制执行义务、豁免、冲突解决等治理约束,并兼容A2AS等标准。

Comments 10 pages, 1 figure. To be published in the 2026 IEEE Symposium on Agentic Services which is part of the IEEE Conference on Web Services

详情
AI中文摘要

由大语言模型驱动的自主AI系统引入了一类新的安全、隐私和合规挑战:能够调用工具、操作数据、安装软件并与跨组织边界对等代理协调的代理,不仅必须通过身份验证和访问控制来约束,还必须通过企业治理的完整结构来约束。这包括指定代理被允许和禁止做什么,它们在特定操作后必须做什么(例如,通知CISO),在什么条件下可以免除一项持续义务,以及当策略冲突时哪些规则优先。这个治理问题超出了当前策略引擎的能力范围。诸如XACML、Rego和Cedar等系统仅处理此治理结构的允许/禁止子集。它们不提供义务生命周期管理、元策略冲突解决、在特定情况下免除义务的豁免,以及通常在医疗、网络安全或数据隐私等应用中发现的领域类层次结构的本体推理。我们提出了AgenticRei,它实现了关键的治理需求,如义务、豁免、策略冲突解决和策略推理,以及基本的允许/禁止约束。我们使用基于Rei框架的道义策略语言,表示为OWL(Web本体语言),并由完全在LLM外部的高性能逻辑引擎在运行时评估。同一管道同时管理代理的工具调用和代理间消息。我们通过示例表明,道义策略捕获了当前生产引擎大多无法表达的安全和隐私治理约束。我们的方法自然地与A2AS等行业标准框架兼容。

英文摘要

Autonomous agentic AI systems driven by Large Language Models (LLMs) introduce a new class of security, privacy, and compliance challenges: an agent that can invoke tools, manipulate data, install software, and coordinate with peer agents across organizational boundaries must be constrained not just by authentication and access control, but by the full structure of enterprise governance. This includes specifying what agents are permitted and prohibited from doing, what they areobliged to do after certain actions (e.g., notify the CISO), under what conditions a standing obligation may be waived, and which rules take precedence when policies conflict. This governance problem exceeds what current policy engines provide. Systems such as XACML, Rego, and Cedar address only the permit/prohibit subset of this governance structure. They do not provide obligation lifecycle management, meta-policy conflict resolution, dispensations that waive obligations in specific circumstances, and ontological reasoning over domain class hierarchies commonly found in applications such as healthcare, cybersecurity, or data privacy. We propose AgenticRei, which realizes key governance requirements such as obligations, dispensations, policy conflict resolutions, and reasoning over policies, as well as the basic permit/prohibit constraints. We use a deontic policy language built on the Rei framework, expressed as OWL (Web Ontology Language) and evaluated at runtime by a high-performance logic engine entirely outside the LLM. The same pipeline governs both tool invocations by the agent and agent-to-agent messages. We show through examples that deontic policies capture governance constraints around security and privacy that mostly cannot be expressed in current production engines. Our approach composes naturally with industry-standard frameworks like A2AS.

2606.19460 2026-06-19 cs.CV cs.AI cs.LG 新提交

Scaling Generative Foundation Models for Chest Radiography with Rectified Flow Transformers

使用整流流变换器扩展胸部X光片的生成式基础模型

Fabio De Sousa Ribeiro, Emma A. M. Stanley, Charles Jones, Tian Xia, Dominic C. Marshall, Laurent Renard Triché, Christopher V. Cosgriff, Panagiotis Dimitrakopoulos, Sotirios A. Tsaftaris, Ben Glocker

发表机构 * Imperial College London(帝国理工学院) Causality in Healthcare AI Hub(医疗AI因果关系中心) University of Edinburgh(爱丁堡大学) Cleveland Clinic London(克利夫兰诊所伦敦) Department of Perioperative Medicine, CHU Clermont-Ferrand(克莱蒙费朗大学医院围手术期医学科) Department of Medicine, Massachusetts General Hospital(麻省总医院医学部) Broad Institute of MIT and Harvard(麻省理工学院与哈佛大学博德研究所)

AI总结 提出首个十亿参数级胸部X光片生成基础模型,通过整流流变换器实现高保真可控合成,显著提升合成图像与真实图像的不可区分性。

Comments Project page: https://RadiT-project.github.io

详情
AI中文摘要

我们引入了首个从零开始在十亿参数规模上训练的胸部X光片合成生成基础模型。现有的放射学AI模型通常在不同患者亚群、机构和采集设置下泛化能力差,导致实际临床效用有限。可控、高保真的胸部X光片合成是多样化临床数据集和评估诊断模型鲁棒性的有前景途径。因此,我们提出了迄今为止最大的胸部X光片专用生成基础模型,拥有超过13亿参数,在包含120万张X光片和临床专家指导元数据的精选异质数据集上训练了1.6万亿个token。我们的模型支持跨多个人口统计亚组、采集视图和十多种病理的可控X光片生成和编辑。此外,我们显著推进了X光片合成保真度的最新技术,生成的图像对临床专家而言与真实X光片无法区分。

英文摘要

We introduce the first generative foundation model for chest radiograph synthesis trained from scratch at the billion-parameter scale. Existing radiographic AI models often suffer from poor generalisation across patient subpopulations, institutions, and acquisition settings, resulting in limited real-world clinical utility. Controlled, high-fidelity synthesis of chest radiographs is a promising path toward diversifying clinical datasets and evaluating the robustness of diagnostic models. Therefore, we present the largest specialist generative foundation model for chest radiographs to date, with over 1.3B parameters, trained for 1.6T tokens on a curated, heterogeneous dataset comprising 1.2M radiographs and clinical expert-guided metadata. Our model supports controllable radiograph generation and editing across multiple demographic subgroups, acquisition views, and a dozen pathologies. Moreover, we significantly advance the state of the art in radiograph synthesis fidelity, producing images that are indistinguishable from real radiographs to clinical experts.

2606.19458 2026-06-19 cs.IR 新提交

MonaVec: A Training-Free Embedded Vector Search Kernel for Edge and Offline AI Systems

MonaVec: 一种面向边缘和离线AI系统的免训练嵌入式向量搜索内核

Oğuzhan Yenen

AI总结 提出MonaVec,一种无需训练、数据无关的嵌入式向量搜索内核,通过随机哈达玛变换和预计算Lloyd-Max量化实现4位压缩,在边缘和离线场景下提供确定性结果,支持单文件部署。

Comments 27 pages, 11 figures. Code and artifacts: https://github.com/mona-hq/monavec (PyPI: monavec; crates.io: monavec-core). Zenodo: doi:10.5281/zenodo.20559587

详情
AI中文摘要

我们提出MonaVec,一种确定性的嵌入式向量搜索内核,适用于边缘和离线AI场景——即服务器基础设施、网络连接和训练数据均不可用的环境。现有的向量搜索系统假设存在持久化服务器、千兆字节RAM或对语料库进行训练;而MonaVec则针对SQLite的部署模式:一个文件、一次函数调用、随处运行。其量化核心默认免训练且数据无关:随机哈达玛变换(RHDH)将任意输入分布调整至N(0,1),因此预计算的Lloyd-Max表可将数据量化至4位(缩小8倍),无需学习码本或数据遍历。索引持久化为单个.mvec文件,其中嵌入的ChaCha20旋转种子使得结果在不同架构间可重现,并在同一构建内字节一致——这是并行构建图库无法提供的确定性保证。在语义嵌入(AG News,45K x 1024维BGE-M3,余弦相似度)上,MonaVec 4位BruteForce在27 MB内达到0.960 Recall@10,在召回率上领先float32 FAISS-IVF和8位usearch,同时以峰值吞吐量换取字节一致的确定性。单次全局标准化(fit())将相同的数据无关流程扩展到对幅度敏感的L2数据,可选的IvfFlat和HNSW后端将其扩展到百万向量语料库。MonaVec使用纯Rust实现,并带有Python绑定和运行时SIMD调度(AVX-512/AVX2/NEON/scalar)。它面向设备端RAG、离线代理和嵌入式检索——即SQLite在关系数据领域占据的细分市场:一个文件、一次调用、随处运行。

英文摘要

We present MonaVec, a deterministic, embedded vector-search kernel for edge and offline AI -- settings where server infrastructure, network connectivity, and training data are all unavailable. Existing vector-search systems assume a persistent server, gigabytes of RAM, or a training pass over the corpus; MonaVec instead targets the deployment profile of SQLite: one file, one function call, runs anywhere. Its quantization core is training-free by default and data-oblivious: a Randomized Hadamard Transform (RHDH) conditions any input distribution toward N(0,1), so precomputed Lloyd-Max tables quantize to 4 bits (8x smaller) with no learned codebook and no data pass. The index persists as a single .mvec file whose embedded ChaCha20 rotation seed makes results reproducible across architectures and byte-identical within a build -- a determinism guarantee that parallel-build graph libraries cannot offer. On semantic embeddings (AG News, 45K x 1024-dim BGE-M3, cosine), MonaVec 4-bit BruteForce reaches 0.960 Recall@10 in 27 MB -- leading float32 FAISS-IVF and 8-bit usearch on recall -- while trading peak throughput for byte-identical determinism. A single-pass global standardization (fit()) extends the same data-oblivious pipeline to magnitude-sensitive L2 data, and optional IvfFlat and HNSW backends carry it to million-vector corpora. MonaVec is implemented in pure Rust with Python bindings and runtime SIMD dispatch (AVX-512/AVX2/NEON/scalar). It targets on-device RAG, offline agents, and embedded retrieval -- the niche SQLite occupies for relational data: one file, one call, runs anywhere.

2606.19451 2026-06-19 cs.LG cs.CV cs.RO 新提交

3D-DLP: Self-Supervised 3D Object-Centric Scene Representation Learning

3D-DLP:自监督3D物体中心场景表示学习

Ellina Zhang, Madhaven Iyengar, Amir Zadeh, Chuan Li, Deepak Pathak, David Held, Tal Daniel

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出3D-DLP模型,通过自监督学习将场景级RGB-D或体素观测分解为3D潜在粒子,每个粒子编码解耦属性,实现可解释的逐粒子分割图,并支持场景操控和下游机器人操作。

Comments ICML 2026. Project webpage: https://eubooks3003.github.io/3d-dlp

详情
AI中文摘要

我们引入了3D-DLP,一种自监督的物体中心表示学习模型,它将场景级RGB-D或体素观测分解为一组3D潜在粒子。基于深度潜在粒子(DLP)框架,每个粒子编码解耦的属性,包括3D关键点位置、边界框尺寸和外观特征,并代表场景中的一个独特实体。该模型通过端到端的自监督重建目标学习可解释的逐粒子分割图。我们在模拟和真实数据集上证明,学习到的潜在空间是可解释和可控的:通过操纵粒子位置并解码,我们可以生成新颖的场景配置。此外,我们展示了将这些紧凑的3D潜在粒子用于下游机器人操作,相比缺乏显式3D信息或依赖无物体中心结构的密集3D输入的基线方法,性能有所提升。代码和视频可在以下网址获取:此 https URL。

英文摘要

We introduce 3D-DLP, a self-supervised object-centric representation learning model that decomposes scene-level RGB-D or voxel observations into a set of 3D latent particles. Building on the Deep Latent Particles (DLP) framework, each particle encodes disentangled attributes, including 3D keypoint position, bounding box dimensions, and appearance features, and represents a distinct entity in the scene. The model learns interpretable per-particle segmentation maps through an end-to-end self-supervised reconstruction objective. We demonstrate on both simulated and real-world datasets that the learned latent space is interpretable and controllable: by manipulating particle positions and decoding, we can generate novel scene configurations. Furthermore, we show that leveraging these compact 3D latent particles for downstream robotic manipulation improves performance over baselines that either lack explicit 3D information or rely on memory-intensive dense 3D inputs without object-centric structure. Code and videos are available at https://eubooks3003.github.io/3d-dlp.

2606.19419 2026-06-19 cs.RO cs.AI 新提交

Playful Agentic Robot Learning

趣味性具身机器人学习

Junyi Zhang, Jiaxin Ge, Hanjun Yoo, Letian Fu, Zihan Yang, Yaowei Liu, Raj Saravanan, Shaofeng Yin, Justin Yu, Dantong Niu, Zirui Wang, Roei Herzig, Ken Goldberg, Yutong Bai, David M. Chan, Ion Stoica, Angjoo Kanazawa, Jiahui Lei, Haiwen Feng, Trevor Darrell

发表机构 * University of California, Berkeley(加州大学伯克利分校) Impossible Research

AI总结 提出RATs框架,让机器人通过自主探索学习可复用技能,在LIBERO-PRO和MolmoSpaces上分别提升20.6和17.0个百分点。

Comments Project page: https://playful-rats.github.io/

详情
AI中文摘要

当前的具身机器人系统可以编写可执行的代码即策略程序、观察反馈并在多次尝试中修正行为,但它们仍然主要是任务驱动的:可复用技能仅在明确指令后获得。我们研究趣味性具身机器人学习,其中具身编码代理在下游任务到来之前,将自主导向的趣味性作为持续技能学习阶段。我们引入RATs,即专为趣味性技能获取设计的机器人代理团队。在趣味性阶段,RATs提出新颖且可学习的探索性任务,规划并执行机器人代码策略,验证中间进展,诊断失败,通过密集的步骤级反馈进行重试,并将成功执行提炼到持久代码技能库中。在测试时,代理从该冻结库中重用相关技能以帮助解决新任务。在LIBERO-PRO和MolmoSpaces上的实验表明,与无趣味性和随机趣味性基线相比,趣味性学习技能在保留的下游任务上分别提升了20.6和17.0个百分点(相对于CaP-Agent0)。此外,学习到的技能可以通过简单地检索到上下文中插入到其他推理时代码即策略代理中,无需微调基础模型,即可在RoboSuite和真实世界迁移中分别提升8.9和8.8个百分点。

英文摘要

Current agentic robot systems can write executable Code-as-Policy programs, observe feedback, and revise behavior across multiple attempts, but they remain largely task-driven: reusable skills are acquired only after explicit instructions. We study Playful Agentic Robot Learning, where an embodied coding agent uses self-directed play as a continual skill-learning stage before downstream tasks arrive. We introduce RATs, Robotics Agent Teams designed for play-time skill acquisition. During play, RATs proposes novel yet learnable exploratory tasks, plans and executes robot-code policies, verifies intermediate progress, diagnoses failures, retries with dense, step-level feedback, and distills successful executions into a persistent code skill library. At test time, the agent reuses relevant skills from this frozen library to help solve new tasks. Experiments in LIBERO-PRO and MolmoSpaces show that play-learned skills improve held-out downstream tasks over no-play and random-play baselines, with 20.6 and 17.0 percentage-point gains over CaP-Agent0 on LIBERO-PRO and MolmoSpaces, respectively. Moreover, the learned skills can be plugged into other inference-time Code-as-Policy agents by simply retrieving them into the context, improving RoboSuite and real-world transfer by 8.9 and 8.8 points, respectively, without finetuning the underlying model.

2606.19416 2026-06-19 cs.LG 新提交

MortarBench: Evaluating Mortgage Loan Origination Agents

MortarBench: 评估抵押贷款发起代理

Matthew Toles, Yunan Lu, Manav Munjal, Bojun Liu, Yuanhao Deng, Stephanie Selig, Derek Rindner, Cheng Li, Zhou Yu

发表机构 * Columbia University(哥伦比亚大学) Tidalwave

AI总结 提出MortarBench基准,通过金融数据合成与变异管道生成覆盖边缘案例的示例,评估大语言模型在贷款发起任务中的表现,发现模型准确率低且存在偏见,并引入CRIT校准框架提升准确率至80.5%。

详情
AI中文摘要

贷款发起是贷方创建新贷款的过程,从申请和承保到批准和融资。该过程在评估申请人的资格和风险水平方面起着关键作用。最近,尽管缺乏任何公开基准,公司已开始使用抵押贷款代理来增强人类贷款官员。为填补这一空白,我们提出了MortarBench,一个贷款发起代理基准。MortarBench使用金融数据合成和变异管道生成具有广泛边缘案例覆盖的示例,这些示例匹配真实世界的分布和问题。我们发现最先进的大语言模型(LLM)表现不佳,闭源模型最多达到77.1%的精确匹配准确率。我们还发现LLM对与非英语名字相关的外国性存在系统性偏见。注意到这些弱点,我们引入了CRIT,一个置信度校准框架。我们的方法将准确率提高到80.5%,同时改善了风险管理导向并减少了偏见。

英文摘要

Loan origination is the process by which a lender creates a new loan, from application and underwriting through approval and funding. This process serves a critical role in evaluating the eligibility and level of risk posed by an applicant. Recently, firms have begun using mortgage loan agents to augment human loan officers, despite a lack of any public benchmark. To fill this gap, we present MortarBench, a loan origination agent benchmark. MortarBench uses a financial data synthesis and mutation pipeline to generate examples with broad edge case coverage that match real-world distributions and questions. We find that state-of-the-art large language models (LLMs) perform poorly, with closed-source models achieving at most 77.1\% exact match accuracy. We also discover systematic biases in LLM perception of foreignness related to non-English names. Noting these weaknesses, we introduce CRIT, a confidence calibration framework. Our method increases accuracy to 80.5\% while improving risk management steering and reducing bias.

2606.19413 2026-06-19 cs.LG 新提交

Does Text Actually Help? Uncovering and Resolving Text Collapse in Multimodal Time Series Forecasting

文本真的有用吗?揭示并解决多模态时间序列预测中的文本坍缩问题

Huu Hiep Nguyen, Minh Hoang Nguyen, Dung Nguyen, Hung Le

发表机构 * Applied Artificial Intelligence Initiative(应用人工智能计划)

AI总结 针对多模态时间序列预测中文本分支被忽视导致“文本坍缩”的问题,提出REST-TS方法,通过让文本分支专门预测数值主干无法解释的残差,强制其提取真实内容,实现最先进性能。

详情
AI中文摘要

多模态时间序列预测将数值序列与领域相关的文本报告配对,有望将世界知识注入预测流程。然而,我们揭示了现有框架中的一个关键失败模式,称为文本坍缩:文本分支收敛到与内容无关的变换,无论输入描述如何,都贡献可忽略的判别信号。我们认为文本坍缩是时间序列预测中基本不对称性的结果:数值输入与输出强自相关,使得数值主干天生占主导地位,而文本分支尽管携带互补且通常关键的信息,却未被充分利用,导致其系统性欠利用。为解决此问题,我们提出REST-TS(时间序列中文本的残差独占监督),将不对称性转化为设计原则:数值主干产生其独立的数值预测,而文本分支被独占监督以预测残差的结构化组成部分,即数值无法解释的预测差距。由于没有数值路径可以减少这些损失,文本分支必须从输入描述中提取真实内容。在多样化的现实领域和主干架构上的评估表明,REST-TS实现了最先进的性能,并一致地显示出比现有框架更高的文本分支利用率,提供了强有力的经验证据,表明对文本分支进行残差监督迫使其从输入中提取真实内容。

英文摘要

Multimodal time series forecasting, which pairs numerical sequences with domain-relevant textual reports, promises to inject world knowledge into forecasting pipelines. However, we uncover a critical failure mode in existing frameworks that we term text collapse: the text branch converges to a content-independent transformation, contributing negligible discriminative signal regardless of the input description. We argue that text collapse is a consequence of a fundamental asymmetry in time series forecasting: the numerical input is strongly autocorrelated with the output, making the numerical backbone inherently dominant, while the text branch, despite carrying complementary and often critical information, is insufficiently utilized, leading to its systematic underexploitation. To address this, we propose \textbf{REST-TS} (\textbf{R}esidual-\textbf{E}xclusive \textbf{S}upervision for \textbf{T}ext in \textbf{T}ime \textbf{S}eries), which turns the asymmetry into a design principle: the numerical backbone produces its own independent numerical forecast, and the text branch is exclusively supervised to predict the structured components of the residual, the prediction gap that numbers cannot explain. Because no numerical pathway can reduce these losses, the text branch must extract genuine content from the input description. Evaluated across diverse real-world domains and backbone architectures, REST-TS achieves state-of-the-art performance and consistently demonstrates greater text-branch utilization than existing frameworks, providing strong empirical evidence that supervising the text branch on the residual compels it to extract genuine content from the input.

2606.19412 2026-06-19 cs.LG 新提交

Spectral Retrieval-Augmented Time-Series Forecasting

频谱检索增强的时间序列预测

Huu Hiep Nguyen, Minh Hoang Nguyen, Dung Nguyen, Hung Le

发表机构 * Applied Artificial Intelligence Initiative(应用人工智能倡议) Deakin University(迪肯大学)

AI总结 提出SpecReTF方法,通过将时间序列转换为窗口化频率表示并采用结合幅度和相位的相似性度量,以及指数移动平均加权方案,解决了现有检索方法在频谱盲区和时间近因上的局限性,提升了非平稳时间序列预测的准确性。

详情
AI中文摘要

时间序列预测利用历史模式来预测未来值,但传统方法在处理复杂、非平稳模式时面临挑战,这些模式在训练期间难以记忆。检索增强方法通过检索相似历史模式来增强预测,已成为有前景的解决方案。然而,现有检索方法存在两个基本局限性:频谱盲区,即忽略了捕捉潜在周期结构的关键频域特征;以及时间近因,即对所有历史数据一视同仁,而不强调最近、更相关的模式。在本文中,我们提出SpecReTF,一种新颖的检索方法,通过将时间序列转换为窗口化频率表示,并使用结合幅度和相位信息的组合度量来衡量相似性,从而解决这些问题。为了平衡近因和历史上下文,我们应用指数移动平均加权方案,强调最近的窗口。在基准数据集上的大量实验表明,SpecReTF优于时域检索方法,在多样化的非平稳时间序列上实现了卓越的预测准确性。

英文摘要

Time series forecasting leverages historical patterns to predict future values, but traditional methods face challenges when dealing with complex, non-stationary patterns that are difficult to memorize during training. Retrieval-augmented approaches have emerged as promising solutions by retrieving similar historical patterns to enhance predictions. However, existing retrieval methods suffer from two fundamental limitations: spectral blindness, which overlooks critical frequency-domain characteristics that capture underlying periodic structures, and temporal recency, which treats all historical data equally without emphasizing recent, more relevant patterns. In this paper, we propose SpecReTF, a novel retrieval method that addresses these issues by converting time series into windowed frequency representations, measuring similarity with a combined metric that captures both amplitude and phase information. To balance recency and historical context, we apply an exponential moving average weighting scheme that emphasizes recent windows. Extensive experiments on benchmark datasets demonstrate that SpecReTF outperforms time-domain retrieval methods, achieving superior forecasting accuracy across diverse, non-stationary time series.

2606.19411 2026-06-19 cs.LG 新提交

Spectral DPPs via NEPv: A Scalable Continuous Relaxation of Determinantal MAP for Diversity-Aware Data Selection

通过NEPv的谱DPP:用于多样性感知数据选择的确定性点过程MAP的可扩展连续松弛

Richard Yi Da Xu

发表机构 * Hong Kong Baptist University(香港浸会大学) TadReamk Limited(TadReamk有限公司)

AI总结 提出将NP难的DPP-MAP选择问题转化为Stiefel流形上的连续优化,通过非线性特征值问题(NEPv)的自洽场迭代实现近线性时间求解,适用于大规模数据选择。

详情
AI中文摘要

从海量候选池中选择一个小的、多样化的、高质量的子集是现代机器学习中的一个常见原语——用于训练和微调大型模型的数据整理和核心集选择、主动学习批次获取、上下文学习的提示和示例选择、检索多样化以及实验设计。确定性点过程(DPP)为此任务提供了原则性的、良好校准的多样性概念,但其MAP目标——选择大小为$k$的子集$S$最大化$\log\det(L_S)$——是NP难的,并且标准的贪心和采样算法在候选集大小$n$上具有超线性复杂度。这种成本在多样性最重要的数据为中心的场景中尤其高昂,其中$n$范围从数百万到数十亿的候选示例、特征或嵌入。我们将DPP-MAP重新表述为Stiefel流形上的连续优化问题,并证明其最优性条件构成一个先前未研究形式的具有特征向量依赖性的非线性特征值问题(NEPv)。该NEPv允许自洽场(SCF)迭代,具有基于谱间隙的局部收缩保证,从而提供了一个原则性的迭代求解器,其中多样性目标驱动一个特征向量依赖的算子。由此产生的算法OurMethod仅需要与核的矩阵-向量乘积,运行时间为$O\!\big((ndk+nk^2)\,t\big)$,其中迭代次数$t$很小,在$n$上接近线性,并直接与机器学习中常见的低秩和特征映射核集成。本文重点介绍松弛、求解器和扩展分析;完整的真实数据基准测试留给计划中的实证研究。

英文摘要

Selecting a small, diverse, high-quality subset from a massive pool of candidates is a recurring primitive in modern machine learning -- data curation and coreset selection for training and fine-tuning large models, active-learning batch acquisition, prompt and exemplar selection for in-context learning, retrieval diversification, and experimental design. Determinantal Point Processes (\DPP s) give a principled, well-calibrated notion of diversity for this task, but their \emph{MAP} objective -- pick a size-$k$ subset $S$ maximizing $\logdet(L_S)$ -- is NP-hard, and the standard greedy and sampling algorithms scale superlinearly in the ground-set size $n$. This cost is prohibitive precisely in the data-centric regime where diversity matters most, where $n$ ranges over millions to billions of candidate examples, features, or embeddings. We recast \DPP-MAP as a continuous optimization problem over the Stiefel manifold, and show that its first-order optimality conditions form a \emph{Nonlinear Eigenvalue Problem with eigenvector dependency} (\NEPv) of a previously unstudied form. This \NEPv\ admits a self-consistent field (\SCF) iteration with a spectral-gap-based local contraction guarantee, giving a principled iterative solver where the diversity objective drives an eigenvector-dependent operator. The resulting algorithm, \OurMethod, requires only matrix-vector products with the kernel and runs in time $O\!\big((ndk+nk^2)\,t\big)$ for a small number of iterations $t$, scaling near-linearly in $n$ and integrating directly with low-rank and feature-map kernels common in ML. This paper focuses on the relaxation, solver, and scaling analysis; full real-data benchmarking is left to a planned empirical study.

2606.19409 2026-06-19 cs.SE cs.PL 新提交

OpenRath: Session-Centered Runtime State for Agent Systems

OpenRath: 面向会话的代理系统运行时状态

Fukang Wen, Zhijie Wang, Ruilin Xu

AI总结 针对代理系统运行时状态碎片化问题,提出以Session为核心的一等运行时抽象,支持分支、检查、重放、后端感知和组合,使fork、merge和replay成为显式运行时操作。

详情
AI中文摘要

现代代理系统常常遭受碎片化的运行时状态:对话记录、工具效果、内存事件、工作区放置、分支来源和重放证据被分别记录,难以检查或重现。OpenRath通过一个类似PyTorch的编程模型来解决这个问题,适用于多代理、多会话系统。这里的类比涉及中心一等运行时抽象的角色,而非张量计算。其核心抽象是Session,即在代理和工作流之间传递的运行时值。Session是可分支、可检查、可重放、后端感知且可组合的。它记录对话片段、沙箱放置、谱系元数据、令牌使用、待处理工作和工具证据,同时定义内存交互进入运行时记录的位置。由于此状态由程序执行中使用的同一值携带,fork、merge和replay成为显式的运行时操作,而非从外部痕迹重建的状态。OpenRath进一步定义了Sandbox、Tool、Agent、Memory、Workflow和Selector,其中Selector将控制流转化为运行时路由的决策。本报告介绍了编程模型、架构、审计里程碑和证据协议。其主张仅限于受控的运行时属性,而广泛的定量比较、实时提供者质量、可选后端可用性和内存质量留待后续评估。核心论点是Session为代理系统提供了一个一等运行时值,用于可审计的组合。

英文摘要

Modern agent systems often suffer from fragmented runtime state: transcripts, tool effects, memory events, workspace placement, branch provenance, and replay evidence are recorded separately and become difficult to inspect or reproduce. OpenRath addresses this issue with a PyTorch-like programming model for multi-agent, multi-session systems. The analogy concerns the role of a central first-class runtime abstraction, not tensor computation. Its core abstraction is Session, the runtime value passed between agents and workflows. A Session is branchable, inspectable, replayable, backend-aware, and composable. It records conversation chunks, sandbox placement, lineage metadata, token usage, pending work, and tool evidence, while defining where memory interactions enter the runtime record. Since this state is carried by the same value used in program execution, fork, merge, and replay become explicit runtime operations rather than states reconstructed from external traces. OpenRath further defines Sandbox, Tool, Agent, Memory, Workflow, and Selector, with Selector turning control flow into runtime-routed decisions. This report presents the programming model, architecture, audited milestones, and evidence protocol. Its claims are limited to controlled runtime properties, while broad quantitative comparisons, live-provider quality, optional-backend availability, and memory quality are left for follow-on evaluation. The central thesis is that Session provides agent systems with a first-class runtime value for auditable composition.

2606.19408 2026-06-19 cs.LG cs.RO 新提交

FlexLAM: Resolving the Bottleneck Trade-off in Latent Action Learning

FlexLAM: 解决潜在动作学习中的瓶颈权衡

Takanori Yoshimoto, Yang Hu, Naruya Kondo, Tatsuya Matsushima

发表机构 * University of Tsukuba(筑波大学) The University of Tokyo(东京大学)

AI总结 针对潜在动作模型中固定容量瓶颈导致的权衡问题,提出FlexLAM,通过嵌套dropout实现变长潜在动作,在不增加架构或损失的情况下,在稀缺标签和低回报任务中优于固定容量模型,并支持推理时调整令牌预算。

详情
AI中文摘要

潜在动作为无动作视频与下游决策提供了紧凑接口,但现有潜在动作模型(LAM)强制每个转换通过固定容量瓶颈。我们识别出一个瓶颈权衡:过于紧凑的编码可能丢弃动作对齐所需的转换线索,而过于松散的编码则保留了额外的转换变化,当对齐标签稀缺或分布狭窄时必须解决这些变化。FlexLAM用通过嵌套dropout训练的变长潜在动作取代固定容量,产生前缀有效编码,首先捕获紧凑的转换结构,仅在需要时添加细节,无需新架构或损失。在标准稀缺标签监督下和低回报单任务对齐压力测试中,单个FlexLAM在每个评估的令牌预算下匹配或超越单独训练的固定容量LAM,表明FlexLAM不仅在推理时可调整,而且在相同令牌预算下学习了更好的潜在动作接口。同一模型支持推理时令牌预算调整而无需重新训练,并且FlexLAM改善了Ego4D转换重建。这些结果表明,变长潜在动作是对潜在动作模型、潜在动作世界模型和视频预训练动作接口中固定容量瓶颈的无架构、即插即用升级。

英文摘要

Latent actions provide a compact interface between action-free video and downstream decision-making, yet existing Latent Action Models (LAMs) force every transition through a fixed-capacity bottleneck. We identify a bottleneck trade-off: overly tight codes can discard transition cues needed for action alignment, while overly loose codes preserve additional transition variation that must be resolved when alignment labels are scarce or narrowly distributed. FlexLAM replaces this fixed capacity with variable-length latent actions trained by nested dropout, yielding prefix-valid codes that capture compact transition structure first and add detail only when needed, without new architectures or losses. A single FlexLAM matches or surpasses separately trained fixed-capacity LAMs at every evaluated token budget under standard scarce-label supervision and under a low-return single-task alignment stress test, indicating that FlexLAM is not merely adjustable at inference time but learns a better latent-action interface at the same token budgets. The same model supports inference-time token-budget adjustment without retraining, and FlexLAM improves Ego4D transition reconstruction. These results suggest that variable-length latent actions are an architecture-free, drop-in upgrade to the fixed-capacity bottleneck in latent action models, latent-action world models, and video-pretrained action interfaces.

2606.19407 2026-06-19 cs.SE cs.AI 新提交

JustDiag!: A Diagnostic Justification Engine for Accountable Root Cause Analysis

JustDiag!:用于可问责根本原因分析的诊断论证引擎

Tingzhu Bi, Xinrui Jiang, Xun Zhang, Pengcheng Su, Congjie He, Jinglin Li, Ping Wang, Meng Ma

发表机构 * Peking University(北京大学) University of Edinburgh(爱丁堡大学) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出JustDiag诊断论证引擎,通过维护显式的过程状态(证据、发现、竞争假设、冲突和下一步检查)来支持可问责的根本原因分析,在66个真实事件上评估显示其优于仅提供流畅最终答案的方法。

详情
AI中文摘要

大型语言模型可以生成流畅的根本原因分析,但仅凭流畅的最终答案不足以证明高风险操作中的可问责性。在实际事件响应中,工程师需要知道哪些证据支持诊断,考虑了哪些替代方案,哪里存在矛盾,以及系统是解决了问题还是保留了不确定性。我们通过JustDiag填补了这一空白,这是一个用于RCA的诊断论证引擎,它维护了关于证据、发现、竞争假设、冲突和下一步检查的显式过程状态。我们使用两层协议在66个真实事件上评估了该系统,该协议分别对最终答案质量和过程质量进行评分。与没有诊断论证的匹配对照组相比,JustDiag获得了更强的结果和过程分数,同时由于更校准的非闭合性而接受了略低的终端完成率。这些结果表明,可问责的RCA需要显式的诊断论证工件和过程感知评估,而不仅仅是流畅的最终答案。

英文摘要

Large language models can produce fluent root cause analyses, but fluent final answers alone are insufficient evidence for accountability in high-stakes operations. In real incident response, engineers need to know what evidence supported a diagnosis, which alternatives were considered, where contradictions remained, and whether the system resolved the case or preserved uncertainty. We address this gap with JustDiag, a diagnostic justification engine for RCA that maintains an explicit process state over evidence, findings, competing hypotheses, conflicts, and next checks. We evaluated the system on 66 real-world incidents using a two-layer protocol that separately scores final-answer quality and process quality. Relative to a matched control without diagnostic justification, JustDiag achieved stronger outcome and process scores, while accepting slightly lower terminal completion due to more calibrated non-closure. These results suggest that accountable RCA requires explicit diagnostic justification artifacts and process-aware evaluation, not only fluent final answers.

2606.19404 2026-06-19 cs.LG cs.CL 新提交

Thermodynamic Signatures of Reasoning: Free-Energy and Spectral-Form-Factor Diagnostics for Hallucination Detection in Large Language Models

推理的热力学特征:用于大型语言模型幻觉检测的自由能和谱形因子诊断

Salim Khazem

发表机构 * Talan Research & Innovation Center(Talan研究与创新中心)

AI总结 提出自由能签名(Fes)作为谱描述符,将注意力拉普拉斯视为哈密顿量并提取热力学势和随机矩阵理论谱形因子,用于检测LLM幻觉,无需训练即可实现高AUROC。

详情
AI中文摘要

大型语言模型(LLM)中的幻觉检测对部署至关重要,近期研究表明注意力导出的图拉普拉斯谱携带关于推理质量的强信号。然而,先前的谱诊断仅通过少数特征值或手工选取的标量来总结拉普拉斯谱,忽略了其大部分结构。我们提出自由能签名(Fes),一种谱描述符,将每层的注意力拉普拉斯视为哈密顿量,并提取其热力学势(配分函数、自由能、谱熵、热容)以及随机矩阵理论(RMT)谱形因子。我们证明了三个结果:(i)Fes在注意力扰动下的Lipschitz稳定性;(ii)一个表达性结果,表明Fes丰富了有限谱摘要,并在明确的规则性和网格分辨率假设下逼近矩导出的谱泛函;(iii)基于Fes构建的无训练检测器AUROC的有限样本PAC界。实验上,在六个开源LLM和六个基准测试中,基于Fes描述符的轻量级探测在注意力谱基线中实现了最强的平均AUROC,相比LapEig平均提高+6.5 AUROC点,相比GoR-4平均提高+2.4点,且无需更新底层LLM。在完全无监督设置下,RMT偏差得分达到平均AUROC 0.71,提供了一个无标签但较弱的检测器。互补的RMT分析表明,正确生成表现出更接近Wigner-Dyson的谱统计,而幻觉表现出更接近Poisson的统计。匿名代码和配置在补充材料中提供。

英文摘要

Hallucination detection in large language models (LLMs) is deployment-critical, and recent work shows that the spectrum of attention-derived graph Laplacians carries strong signal about reasoning quality. Prior spectral diagnostics, however, summarize the Laplacian spectrum by a handful of eigenvalues or hand-picked scalars, leaving most of its structure unused. We propose Free-Energy Signatures (Fes), a spectral descriptor that treats each layer's attention Laplacian as a Hamiltonian and extracts its thermodynamic potentials partition function, free energy, spectral entropy, heat capacity together with the random-matrix-theory (RMT) spectral form factor. We prove three results: (i)~Lipschitz stability of Fes under attention perturbation; (ii)~an expressiveness result showing that Fes enriches finite spectral summaries and approximates moment-derived spectral functionals under explicit regularity and grid-resolution assumptions; and (iii)~a finite-sample PAC bound on the AUROC of a training-free detector built from Fes. Empirically, across six open-weight LLMs and six benchmarks, a lightweight probe on Fes descriptors achieves the strongest aggregate AUROC among attention-spectral baselines, improving over LapEig by $+6.5$ AUROC points and over GoR-4 by $+2.4$ points on average, while requiring no update to the underlying LLM. In the fully unsupervised setting, an RMT-deviation score achieves mean AUROC $0.71$, providing a label-free but weaker detector. A complementary RMT analysis shows that correct generations exhibit more Wigner-Dyson like spectral statistics, whereas hallucinations exhibit more Poisson-like statistics. The anonymized code and config are provided in the supplementary material.

2606.19401 2026-06-19 cs.CC 新提交

The Complexity of Auditing Disclosure-Robust Defeasible Explanations

审计披露鲁棒的可废止解释的复杂性

Haoyang Li

AI总结 研究在增量披露下保持鲁棒的最小解释核心的复杂性,发现验证鲁棒核心为coNP完全,寻找大小不超过θ的鲁棒核心为Σ₂ᵖ完全,并给出了精确审计的复杂度景观。

Comments 11 pages, 4 figures; full proofs in appendix

详情
AI中文摘要

一个形式化解释用子集最小充分理由来认证一个预测。然而,在增量披露下,证据逐字段到达,通常充分的理由可能被后续信息推翻。我们研究在所有允许的后续披露下仍然充分的最小理由核心;其大小为鲁棒半径。我们将一个可废止分类器编译成一个显式的边界图谱,包含入口锚点和出口击败者,并描绘了审计它的复杂性(所有陈述均以图谱大小衡量)。预测和常驻锚点通过对图谱的多项式时间扫描读取,无需迭代不动点计算;一个理由的击败者前沿通过扫描并子集最小化其上的击败者获得。但验证一个理由核心是鲁棒的是coNP完全的,而判断是否存在大小不超过θ的鲁棒核心是Σ₂ᵖ完全的——一个四格P/coNP完全/NP完全/Σ₂ᵖ完全的景观,其中接受情况(A(t)=1)达到多项式层次第二层。最小认证披露的判定版本是NP完全的;其优化版本在排除无击败者世界的数量上具有固定参数可解性,而一般击败者情况未解决。在标准表格数据集上的深度受限决策树的精确审计中,采用故意小的布尔抽象,控制参数处于小参数范围(鲁棒核心在低个位数),因此在这些审计立方体中精确鲁棒审计是可处理的;在从我们的归约构建的对抗实例上,困难性显现,鲁棒核心大小为Θ(n)。据我们所知,这是针对披露鲁棒形式化解释的第一个Σ₂ᵖ完全审计查询。

英文摘要

A formal explanation certifies a prediction with a subset-minimal sufficient reason. Under incremental disclosure, however, evidence arrives field by field, and a normally sufficient reason can be overturned by later information. We study the smallest reason core that remains sufficient under all admissible later disclosures; its size is the robustness radius. We compile a defeasible classifier into an explicit boundary atlas of entry anchors and exit defeaters, and chart the complexity of auditing it (all statements are in the atlas size). Prediction and standing anchors are read by polynomial-time scans of the atlas, without iterative fixpoint computation; a reason's defeater frontier is obtained by scanning and subset-minimizing the defeaters above it. But verifying that a reason core is robust is coNP-complete, and deciding whether a robust core of size at most theta exists is $Σ_2^p$-complete -- a four-cell P / coNP-complete / NP-complete / $Σ_2^p$-complete landscape, with the accepted (A(t)=1) case reaching the second level of the polynomial hierarchy. The decision version of minimal certified disclosure is NP-complete; its optimization version is fixed-parameter tractable in the number of excluded worlds without defeaters, with the general-defeater case open. On exact audits of depth-limited decision trees over standard tabular datasets under a deliberately small Boolean abstraction, the governing parameters fall in a small-parameter regime (robust cores in the low single digits), so exact robust auditing is tractable in these audited cubes; on adversarial instances built from our reductions the hardness bites, with robust cores of size Theta(n). To our knowledge this is the first $Σ_2^p$-complete audit query for disclosure-robust formal explanations.

2606.19399 2026-06-19 cs.LG cs.AI cs.LO cs.PL 新提交

VERITAS: Verifier-Guided Proof Search for Zero-Shot Formal Theorem Proving

VERITAS:验证器引导的零样本形式定理证明搜索

Manish Acharya, Zhenyu Liao, Yueke Zhang, Kevin Leach, Yu Huang, Yifan Zhang

发表机构 * Department of Computer Science, Vanderbilt University(范德堡大学计算机科学系) Amazon(亚马逊)

AI总结 提出VERITAS框架,通过两阶段协议(Best-of-N采样+批评引导MCTS)利用验证器反馈进行零样本定理证明,在miniF2F上达40.6%准确率,并发布组合学基准VERITAS-CombiBench。

详情
AI中文摘要

基于LLM的形式化证明器通常将丰富的验证器信号(语法错误、类型不匹配、部分目标进展)压缩为二进制的通过/失败位。我们提出VERITAS,一个零样本框架,通过两阶段协议将每个验证器信号路由回证明搜索:首先进行Best-of-N采样,然后进行批评引导的MCTS遍历,该遍历将第一阶段失败作为显式负例吸收。该协议保留其第一阶段扫描解决的每个定理,因此第二阶段额外的解决可归因于反馈驱动的探索。VERITAS在miniF2F上达到40.6%(相比之下,独立运行的Best-of-5为36.9%,Portfolio为26.2%),在VERITAS-CombiBench上达到7.3%,这是一个我们发布的55个定理的组合学基准,在该基准上Best-of-5(1.8%)低于Portfolio(3.6%),暴露了当必须从验证器反馈中迭代恢复正确的引理名称时,无指导的采样会带来损害。工件可在GitHub上获取。

英文摘要

LLM-based formal provers often collapse rich verifier signals (syntax errors, type mismatches, partial goal progress) into a binary pass/fail bit. We present VERITAS, a zero-shot framework that routes every verifier signal back into proof search through a two-phase protocol: Best-of-N sampling first, then a critic-guided MCTS pass that ingests Phase 1 failures as explicit negative examples. The protocol preserves every theorem solved by its own Phase 1 sweep, so Phase 2's additional solves are attributable to feedback-driven exploration. VERITAS reaches 40.6% on miniF2F (vs. an independently run Best-of-5 at 36.9%, Portfolio 26.2%) and 7.3% on VERITAS-CombiBench, a 55-theorem combinatorics benchmark we release on which Best-of-5 (1.8%) falls below Portfolio (3.6%), exposing that unguided sampling hurts when correct lemma names must be recovered iteratively from verifier feedback. Artifacts are available on GitHub.

2606.19398 2026-06-19 cs.SD eess.AS eess.SP 新提交

S-JEPA : Soft Clustering Anchors for Self-Supervised Speech Representation Learning

S-JEPA:用于自监督语音表示学习的软聚类锚点

Georgios Ioannides, Adrian Kieback, Judah Goldfeder, Linsey Pang, Aman Chadha, Aaron Elkins, Yann LeCun, Ravid Shwartz-Ziv

发表机构 * Carnegie Mellon University(卡内基梅隆大学) New York University(纽约大学) James Silberrad Brown Center for AI(詹姆斯·西尔伯拉德·布朗人工智能中心) Columbia University(哥伦比亚大学) Northeastern University(东北大学) Stanford University(斯坦福大学) Amazon GenAI(亚马逊生成式人工智能)

AI总结 提出S-JEPA,通过KL散度匹配高斯混合模型的软后验概率训练编码器-预测器对,无需离线重聚类或教师蒸馏,在SUPERB协议下以低于90M参数取得最低WER,并建立新的帕累托前沿。

详情
AI中文摘要

自监督语音编码器主要通过预测掩蔽位置处的离散硬聚类ID进行训练,这种方法会坍缩类别边界处的声学模糊性,并需要在迭代之间中断训练以对整个语料库进行重聚类。我们提出S-JEPA,一种JEPA风格的编码器-预测器对,通过KL散度训练以匹配掩蔽位置处高斯混合模型的软后验概率。训练作为连续优化轨迹分两个阶段进行:首先在MFCC特征上使用固定GMM,然后在编码器特征上使用在线GMM,输入层从无标签信号中自适应选择,从而消除了离线重聚类步骤以及手动选择聚类所在Transformer层的问题。在SUPERB协议下,S-JEPA在评估的低于90M参数的自监督方法中实现了最低的词错误率(WER),并在大约一半参数量的情况下在情感识别任务上与HuBERT-Base相当,无需离线重聚类或教师蒸馏即建立了新的帕累托前沿。对预测器在保留语音上的每帧熵的分析揭示了双峰分布,其中相当一部分帧的熵接近完美两聚类平局的熵,这直接经验性地证明了软目标目标保留了硬目标会坍缩的声学模糊性。代码可在以下网址获取:https://this https URL。

英文摘要

Self-supervised speech encoders are predominantly trained by predicting discrete hard cluster IDs at masked positions, a recipe that collapses acoustic ambiguity at category boundaries and requires interrupting training to re-cluster the entire corpus between iterations. We introduce S-JEPA, a JEPA-style encoder-predictor pair trained to match the soft posteriors of a Gaussian Mixture Model at masked positions via KL divergence. Training runs as one continuous optimization trajectory in two phases: a fixed GMM over MFCC features, then an online GMM over encoder features, with the input layer selected adaptively from a label-free signal, removing both the offline re-cluster step and the hand-tuned choice of which transformer layer to cluster on. Under the SUPERB protocol, S-JEPA achieves the lowest WER among evaluated SSL methods below 90M parameters and matches HuBERT-Base on emotion recognition at roughly half its parameter count, establishing a new Pareto frontier without offline re-clustering or teacher distillation. An analysis of the predictor's per-frame entropy on held-out speech reveals a bimodal distribution with a substantial minority of frames near the entropy of a perfect two-cluster tie, providing direct empirical evidence that the soft-target objective preserves the acoustic ambiguity that hard targets would collapse. Code is available at https://github.com/gioannides/s-jepa.

2606.19397 2026-06-19 cs.RO 新提交

DiffusionVS: A Generative Framework for Robust Visual Servoing Based on Diffusion Policy

DiffusionVS:基于扩散策略的鲁棒视觉伺服生成框架

Hongkang Cui, Rui He, Haoyao Chen

AI总结 提出基于扩散策略的视觉伺服方法,通过条件去噪生成相机速度,并采用在线训练增强泛化能力,仿真成功率近100%,物理实验93%。

Comments 8 pages, 4 figures, 7 tables

详情
AI中文摘要

视觉伺服是机器人操作和导航中的基础技术。基于回归的视觉伺服常因噪声敏感的单步映射和分布偏移时的误差累积而出现轨迹抖动。相比之下,扩散策略通过预测动作序列保持时间一致性,并通过隐式数据增强提高鲁棒性。本文提出一种新颖的基于扩散的伺服方法。基于扩散策略,该方法使用观测标签角点的归一化图像坐标作为输入,通过条件去噪生成相机速度。为了克服在静态数据集上训练的模型的泛化限制,采用了在线训练范式,通过交互经验收集持续扩展训练数据的多样性。该策略显著提升了模型的性能和泛化能力。全面的仿真和实际实验证明了该方法的有效性,在仿真中实现了近100%的成功率,在物理实验中达到93%。除了具体的流程,我们进一步验证了扩散机制的通用性。实验表明,现有的视觉伺服网络在与我们的扩散模块集成时,性能持续提升。这些结果表明,所提出的策略具有广泛的适用性,能够增强除本文具体架构之外的各种视觉伺服系统。

英文摘要

Visual servoing is a fundamental technique in robotic manipulation and navigation. Regression-based visual servoing frequently experiences trajectory jitter as a result of noise-sensitive single-step mappings and the accumulation of errors during distribution shifts. In contrast, Diffusion Policy maintains temporal consistency by predicting action sequences and improves robustness through implicit data augmentation. This paper presents a novel diffusion-based servoing method. Based on Diffusion Policy, the proposed approach uses normalized image coordinates of observed tag corners as input and generates camera velocity through conditional denoising. To overcome the generalization limitations of models trained on static datasets, an online training paradigm is adopted, continuously expanding the diversity of training data through interactive experience collection. This strategy substantially enhances both the performance and generalization capability of the model. Comprehensive simulations and real-world experiments demonstrate the effectiveness of the proposed method, achieving success rates of nearly 100\% in simulation and 93\% in physical experiments. Beyond the specific pipeline, we further validate the generality of the diffusion mechanism. Experiments show that existing visual servoing networks consistently achieve improved performance when integrated with our diffusion-based module. These results indicate that the proposed strategy possesses broad applicability and can enhance various visual servoing systems beyond the specific architecture presented here.

2606.19395 2026-06-19 cs.SE 新提交

DevOps and General Developers: Insights from Stack Overflow's 2023 Survey

DevOps 与普通开发者:来自 Stack Overflow 2023 年调查的见解

Hasan Abdulla, Fatema AlJazeeri, Fawzi AlBalooshi, Jaflah Al-Ammary

AI总结 通过分析 Stack Overflow 2023 年调查数据,比较 DevOps 专家与普通开发者在工具、技术、方法论和人口统计上的差异,发现两者角色互补,工具偏好无显著差异。

Comments 17 pages, 11 tables, research paper based on the 2023 Stack Overflow Developer Survey data analysis

详情
AI中文摘要

目的:调查 DevOps 专家和普通软件开发者在当前软件开发环境中不同的角色,考察他们在工具、技术、方法论和人口统计方面的不同使用情况。此外,区分这两个专业群体在该领域的独特贡献和挑战。设计/方法论/方法:研究采用定量方法分析 Stack Overflow 2023 年开发者调查数据。重点比较 DevOps 专家和普通开发者在技术偏好、人口统计信息和专业经验方面的差异,突出关键趋势和差异。数据分析使用 Python 的 Pandas 库进行。发现:研究表明,DevOps 专家和普通开发者在工具和技术偏好上没有显著差异,突出了他们的互补角色。DevOps 专家和普通开发者都使用 Docker 和 Kubernetes 等工具,强调效率和自动化。而普通开发者根据不同的角色需求使用多样化的工具,人口统计趋势显示普通开发者更年轻,DevOps 专业人员处于职业生涯中期。这一年龄范围反映了 DevOps 经验的增长,两个群体都在适应技术行业不断发展的远程和混合工作模式。实际意义:这项研究提供了对软件开发中动态角色的视角,强调了 DevOps 日益增长的重要性。它是学术和行业专业人士了解软件开发角色不断演变的宝贵资源。原创性/价值:这项研究填补了现有文献中关于软件开发角色动态演变的重要空白。

英文摘要

Purpose: To investigate the distinct roles of DevOps specialists and general software developers, examining their varying use of tools, technologies, methodologies, and demographics in the current software development environment. In addition, to differentiate these two professional groups regarding their unique contributions and challenges in the field. Design/Methodology/Approach: The research uses a quantitative approach to analyze data from the Stack Overflow 2023 Developer Survey. It focuses on a comparative analysis of technological preferences, demographic information, and professional experiences between DevOps specialists and general developers, highlighting key trends and differences. The data analysis was conducted using Python's Pandas library for data analysis. Findings: The research indicates no significant difference in the tool and technology preferences between DevOps specialists and general software developers, highlighting their complementary roles. DevOps specialists and general software developers use tools like Docker and Kubernetes, emphasizing efficiency and automation. While general developers employ diverse tools for various role demands, demographic trends show younger general developers and mid-career DevOps professionals. This age range reflects growing experience in DevOps, and both groups are adapting to remote and hybrid work models in the evolving tech industry. Practical Implications: This research offers perspectives on the dynamic roles within software development, emphasizing the growing importance of DevOps. It is a valuable resource for academic and industry professionals to understand the evolving dynamics in software development roles. Originality/Value: This research fills a significant gap in the existing literature regarding the evolving dynamics of software development roles.