arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2157
2605.18824 2026-05-20 cs.LG cs.AI cs.CL

Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models

细粒度基准生成用于基础模型的全面评估

Mohammed Saidul Islam, Negin Baghbanzadeh, Farnaz Kohankhaki, Afshin Cheraghi, Ali Kore, Shayaan Mehdi, Elham Dolatabadi, Arash Afkanpour

AI总结 本文提出了一种自动化基准生成框架,用于生成覆盖广泛、元数据丰富且抗污染的评估问题,从而提升基础模型的全面评估能力。

详情
AI中文摘要

基础模型的评估通常依赖于缺乏全面覆盖和细粒度评估元数据的基准汇总分数。我们引入了一个自动化基准生成框架。该框架生成基于参考材料(如教科书)的评估问题,生成具有广泛覆盖、丰富元数据和抗污染性的基准。该流程采用多代理架构进行问题生成,并采用以解决方案图驱动的策略,显著提高了地面真实解决方案的可靠性。使用该框架,我们生成了三个基准:机器学习、公司金融和个人金融。专家审查发现,其地面真实错误率显著低于之前的基准,如MMLU和GSM8K。对12个商业和开源模型的评估显示,我们的基准实现了接近均匀的竞争力覆盖,并揭示了现有基准未能捕捉到的模型间性能差异。我们即将开源该框架和我们精心挑选的基准。

英文摘要

Evaluation of foundation models often rely on aggregate scores from benchmarks that lack comprehensive coverage and metadata for a fine-grained evaluation. We introduce a framework for automated benchmark generation. Our framework generates evaluation problems grounded in reference material, such as textbooks, producing benchmarks with broad coverage, rich metadata, and robustness to contamination. The pipeline employs a multi-agent architecture for problem generation and a solution-graph-driven strategy that significantly improves the reliability of ground truth solutions. Using the framework, we generate three benchmarks in Machine Learning, Corporate Finance, and Personal Finance. Expert review finds a significantly lower ground-truth error rate than previous benchmarks such as MMLU and GSM8K. Evaluation of 12 commercial and open-source models shows that our benchmarks achieve near-uniform competency coverage and surface performance differences across models that existing benchmarks fail to capture. We will open-source the framework and our curated benchmarks soon.

2605.18823 2026-05-20 cs.LG

Multi-Pedestrian Safety Warning at Urban Intersections Use Case of Digital Twin

城市交叉口多行人安全预警的数字孪生应用案例

Yongjie Fu, Qi Gao, Mahshid Ghasemi Dehkordi, Gil Zussman, Xuan Di

AI总结 本文提出一种基于紧密耦合物理-数字孪生框架的城市交叉口多行人安全预警系统,通过COSMOS无线测试床进行实地部署和虚拟现实实验,验证了系统在提高安全预警准确性和响应效率方面的有效性。

详情
AI中文摘要

数字孪生(DTs)在城市交通系统中已获得越来越多的关注;然而,其在安全关键场景中的系统性评估仍然有限。本文提出了一种基于紧密耦合物理-数字孪生框架的城市交叉口多行人安全预警系统。该系统基于纽约市的COSMOS城市级无线测试床,整合了摄像头和超宽带(UWB)、边缘-云计算、预测轨迹建模以及基于MQTT的通信,以向易受伤害道路使用者(VRUs)提供实时安全警报。该系统通过实地部署和虚拟现实(VR)实验进行评估。结果表明,系统在不同模型配置下具有高预警生成准确率、高定位准确率、高效的端到端延迟以及在发出警告时显著减少用户响应时间。所提出的DT框架提供了一种可扩展、模块化且通用的解决方案,用于复杂城市交叉口的实时多行人安全增强。

英文摘要

Digital twins (DTs) for urban transportation systems have gained increasing attention; however, their systematic evaluation in safety-critical scenarios remains limited. This paper presents a multi-pedestrian safety warning system at urban intersections enabled by a tightly coupled physical-digital twin framework. Built upon the COSMOS city-scale wireless testbed in New York City, the proposed system integrates camera and ultra-wideband (UWB), edge-cloud computing, predictive trajectory modeling, and MQTT-based communication to deliver real-time safety alerts to vulnerable road users (VRUs). The system is evaluated through both field deployment and virtual reality (VR) experiments. Results demonstrate high warning generation accuracy, localization accuracy, efficient end-to-end latency under different model configurations, and significant reductions in user response time when warnings are issued. The proposed DT framework provides a scalable, modular, and generalizable solution for real-time multi-pedestrian safety enhancement at complex urban intersections.

2605.18822 2026-05-20 cs.LG cs.AI

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

Hybrid-LoRA: 桥接全微调与低秩适应以实现训练后优化

Chengqian Zhang, Wei Zhu, Kyumin Lee

AI总结 本文提出Hybrid-LoRA框架,通过选择性地对部分模块进行全微调,其余模块使用LoRA进行适应,从而在训练后优化中实现高效性能。

详情
AI中文摘要

训练后已成为适应大型语言模型(LLMs)以实现复杂下游行为(如指令遵循、偏好对齐和多步推理)的关键方法。最近,基于可验证奖励的强化学习(RLVR)作为一种特别有效的训练后范式,通过如GRPO和GSPO等无批评算法实现了可扩展的优化。然而,使用全微调(FFT)的RLVR训练后方法需要大量GPU内存并导致高训练成本。尽管参数高效微调(PEFT)方法如低秩适应(LoRA)能有效降低计算成本,但它们在复杂推理任务的训练后性能上往往存在显著差距。在本文中,我们提出了Hybrid-LoRA,一种高效的训练后框架,该框架选择性地对一小部分不太适合低秩适应的模块进行全微调,而对其余模块使用LoRA进行适应。我们引入了一个新的Hybrid-LoRA Score,用于在固定参数预算下对候选模块按其对低秩适应的敏感性进行排序。实验表明,在10%的全微调模块预算下,Hybrid-LoRA能够接近全微调性能,其余候选模块通过LoRA进行适应, consistently outperforming four state-of-the-art PEFT post-training baselines,实现了高达5.65%和平均4.36%的改进。

英文摘要

Post-training has become essential for adapting large language models (LLMs) to complex downstream behaviors, including instruction following, preference alignment, and multi-step reasoning. Reinforcement learning with verifiable rewards (RLVR) has recently emerged as a particularly effective post-training paradigm for improving reasoning capabilities, with critic-free algorithms such as GRPO and GSPO enabling scalable optimization. However, RLVR post-training with full fine-tuning (FFT) requires substantial GPU memory and incurs high training costs. Although parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), effectively reduce computational costs, they often suffer from a noticeable performance gap compared to full fine-tuning in post-training for complex reasoning tasks. In this paper, we propose Hybrid-LoRA, an efficient hybrid post-training framework that selectively applies full fine-tuning to a small subset of modules less suited to low-rank adaptation, while adapting the remaining components with LoRA. We introduce a novel Hybrid-LoRA Score to rank candidate modules according to their sensitivity to low-rank adaptation under a fixed parameter budget. Experiments show that Hybrid-LoRA closely matches full fine-tuning performance under a 10% full fine-tuning module budget, with the remaining candidate modules adapted by LoRA, consistently outperforming four state-of-the-art PEFT post-training baselines, achieving improvements of up to 5.65% and on average 4.36% over the best baseline.

2605.18821 2026-05-20 cs.LG cs.CR

Quantum Adversarial Machine Learning: From Classical Adaptations to Quantum-Native Methods

量子对抗机器学习:从经典适应到量子原生方法

Roozbeh Razavi-Far, Mohammad Meymani, Erfan Mahmoudinia, Dorsa Vazirzade, Peyman Paknezhad, Fateme Ghasemi, Saeed Saravani, Somayeh Nikkhoo, Kimia Haghjooei

AI总结 本文研究量子对抗机器学习中的攻击与防御策略,探讨其理论基础、发展趋势和关键挑战。

详情
Journal ref
Artif Intell Rev (2026)
AI中文摘要

机器学习已革新了众多工业领域。尽管取得了近期进展,机器学习模型仍然容易受到对抗性威胁。对抗性机器学习研究这些脆弱性以构建稳健的机器学习模型。量子机器学习是连接量子计算和经典机器学习的交叉领域。虽然量子机器学习在回归、分类和生成建模等复杂任务中可能超越经典机器学习,但它仍然容易受到对抗性攻击。鉴于量子计算和机器学习的近期进展,量子对抗性机器学习领域应运而生,以研究量子机器学习的脆弱性、可能的攻击和新型量子增强的防御策略。在本文的综述中,我们提供了量子对抗性机器学习的详细概述,探讨了现有的攻击和防御措施。我们还回顾了该领域的理论基础、新兴趋势和关键挑战。

英文摘要

Machine learning has revolutionized numerous industrial domains. Despite recent advances, machine learning models remain vulnerable to adversarial threats. Adversarial machine learning is a field that studies these vulnerabilities to build robust machine learning models. Quantum machine learning is an interdisciplinary field that bridges quantum computing and classical machine learning. While quantum machine learning shows potentials to outperform classical machine learning in complex tasks such as regression, classification, and generative modeling, it remains vulnerable to adversarial attacks. Given the recent advancements in quantum computing and machine learning, the quantum adversarial machine learning field has emerged to study the vulnerabilities of quantum machine learning, possible attacks, and novel quantum-enhanced defense strategies. In this survey, we provide a detailed overview on quantum adversarial machine learning and explore the existing attacks and countermeasures. We also review the theoretical underpinnings of this area, emerging trends, and critical challenges.

2605.18820 2026-05-20 cs.LG cs.AI

Emergence of Frontier Superposition: Möbius attractor and Cascade Supervision

前沿叠加的涌现:莫比乌斯吸引子与级联监督

Hongyu Gu, Jingwen Fu

AI总结 本文研究了通过叠加实现深度推理的问题,提出莫比乌斯吸引子和级联监督方法,证明了在Erdős-Rényi图上,叠加推理的涌现是通过建筑和监督的贡献实现的。

Comments 40 pages, 3 figures

详情
AI中文摘要

叠加允许Transformer在深度推理中并行处理整个推理前沿,通过有限深度的前向传递而不是展开串行的思维链token。虽然Zhu等人(2025)在单一残差流中手工构建了一个等权重的广度优先前沿用于图可达性,但仍未确定梯度下降能否在排列对称的鞍点中找到这个目标。我们通过隔离建筑和监督的贡献,填补了在Erdős-Rényi图上通过叠加实现可达性的问题。在建筑方面,我们识别出一个莫比乌斯吸引子:在树的 regime 中,层间动态减少到一个1D莫比乌斯映射,其零集是一个共维数为一的全局最优解 manifold,包含等权重叠加状态。在监督方面,我们识别出级联监督:一个损失类别,其反向传播同时提供(A)选择性 bootstrap,(B)梯度在深度的持续性,以及(C)每一步的区分(例如L_sup和L_node)。端到端监督失败于条件(B),并被证明是不足的:在图的扇出和停滞前到达 manifold 之前,层c的内部梯度衰减为(np)^{-(D-c-2)/2}。我们的论点:莫比乌斯吸引子 + 级联监督 = 叠加推理的涌现。参数无关的衰减定律预测在深度D=3时,最终步骤余弦为0.35 vs. 0.71(端到端 vs. 级联);实验证实0.37 vs. 0.69,每一步的匹配误差在0.02以内。

英文摘要

Superposition allows Transformers to reason in depth, carrying an entire reasoning frontier in parallel through a bounded-depth forward pass instead of unrolling serial chain-of-thought tokens. While Zhu et al. (2025) hand-crafted an equal-weight breadth-first frontier in a single residual stream for graph reachability, it remained open whether gradient descent could ever find this target amidst permutation-symmetric saddles. We close this gap on Reachability-by-Superposition over Erdős-Rényi graphs by isolating architectural and supervisional contributions. Architecturally, we identify a Möbius attractor: under $S_n$-symmetry in the tree regime, layerwise dynamics reduce to a 1D Möbius map whose zero set is a codimension-one manifold of global optima containing the equal-weight superposition state. On the supervision side, we identify Cascade Supervision: a loss class whose backward pass simultaneously delivers (A) selectivity bootstrap, (B) gradient persistence across depth, and (C) per-step discrimination (e.g., \mathcal{L}_{sup} and \mathcal{L}_{node}). End-to-end supervision fails condition (B) and is provably insufficient: internal gradients at layer c decay as (np)^{-(D-c-2)/2} in the graph fan-out and stall before the manifold is reached. Our thesis: Möbius attractor + Cascade Supervision = emergence of superposition reasoning. The parameter-free decay law predicts a final-step cosine of 0.35 vs. 0.71 (end-to-end vs. cascade) at depth D=3; experiments confirm 0.37 vs. 0.69, matching within 0.02 at every step.

2605.18819 2026-05-20 cs.LG

Efficient Conditioning Why Pseudo Observation Batch Bayesian Optimization Works When It Does not

高效条件化:为何伪观测批量贝叶斯优化在某些情况下有效

Kumbha Nagaswetha, Rabi Pathak

AI总结 本文研究了批量并行贝叶斯优化中常用于批量选择的常数骗子(CL)、克里格信徒(KB)和幻想模型的有效性,揭示了高效条件化作为关键的替代属性,即在数据增强时能够以闭合形式更新预测。通过证明高斯过程满足这一要求,以及任何单调非递减于后验不确定性的获取函数(如EI、UCB、PI)都具有类似行为,统一了CL、KB和幻想模型为单一条件机制的不同实例,并建立了与局部惩罚(LP)的定量联系和与决定性点过程(DPPs)的定性联系。

详情
AI中文摘要

常数骗子(CL)、克里格信徒(KB)和幻想模型广泛用于并行贝叶斯优化中的批量选择,但缺乏统一的理论来解释它们的有效性和在何种条件下失效。我们识别出高效条件化是关键的替代属性,即在数据增强时能够以闭合形式更新预测。我们证明高斯过程满足这一要求,产生可证明不同的批量点,分离阶为l,并且对于任何单调非递减于后验不确定性的获取函数(如EI、UCB、PI),以及汤普森采样具有类似的行为。我们将CL、KB和幻想模型统一为单一的条件机制的不同实例,仅在谎言值分布上有所不同,并建立了与局部惩罚(LP)的定量联系和与决定性点过程(DPPs)的定性联系。为了区分模型结构与优化器随机性,我们引入了结构多样性诊断(SDD),一种可重用的方法用于测试替代模型的兼容性。在Hartmann6D、Ackley 8D、Levy10D和SVM超参数调节的实验中验证了所有理论预测:CL或KB隐含的惩罚匹配或优于显式的LP贪婪条件化,达到与联合qEI类似的收敛;高效条件化扩展到多二次径向基网络;参数替代模型即使在完全重新训练(随机森林)时仍产生退化的批量,而神经网络仅在15倍的墙钟成本下恢复多样性,优于高斯过程条件化。鲁棒性在多个初始数据集和观察噪声下得到确认。

英文摘要

Constant Liar (CL), Kriging Believer (KB), and fantasy models are widely used for batch selection in parallel Bayesian Optimization, yet a unified theory explaining their effectiveness and conditions under which they fail has been lacking. We identify efficient conditioning as the key surrogate property the ability to update predictions in closed form when data is augmented. We prove that Gaussian Processes satisfy this requirement, producing provably distinct batch points with separation of order l, and that this holds for any acquisition function monotonically non decreasing in posterior uncertainty (EI, UCB, PI), with qualitatively similar behavior for Thompson Sampling. We unify CL, KB, and fantasy models as instances of a single conditioning mechanism differing only in the lie value distribution, and draw quantitative connections to Local Penalization (LP) and qualitative connections to Determinantal Point Processes (DPPs). To disentangle model structure from optimizer randomness, we introduce the Structural Diversity Diagnostic (SDD), a reusable methodology for testing surrogate compatibility. Experiments on Hartmann6D, Ackley 8D, Levy10D, and SVM hyperparameter tuning validate all theoretical predictions: CL or KBs implicit penalty matches or outperforms explicit LP greedy conditioning achieves convergence on par with joint qEI efficient conditioning extends to Multiquadric RBF networks; and parametric surrogates produce degenerate batches even when fully retrained (random forests), while neural networks regain diversity only at 15x the wall clock cost of GP conditioning. Robustness is confirmed across multiple initial datasets and under observation noise.

2605.18818 2026-05-20 cs.AI cs.LG cs.SE

Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production

将文档AI operationalize:一种用于OCR和LLM流水线的微服务架构

Yao Fehlis, Benjamin Bengfort, Zhangzhang Si, Vahid Eyorokon, Prema Roman, Patrick Deziel, Devon Slonaker, Steve Veldman, Ben Johnson, Joyce Rigelo, Michael Wharton, Steve Kramer

AI总结 本文提出了一种微服务架构,用于在生产环境中实现文档理解,通过整合多个模型的流水线,包括分类、OCR和LLM结构字段提取,并展示了在每小时处理数千页文档的经验。

详情
AI中文摘要

学术研究往往集中在新的文档理解模型上,导致文献中模型定义与大规模生产模型之间存在较大差距。为了缩小这一差距,我们提出了一种微服务架构,该架构封装了多个模型的流水线,包括分类、光学字符识别(OCR)和大型语言模型结构字段提取,并展示了该流水线在每小时处理数千页文档的经验。我们描述了主要的设计决策,包括混合分类、将GPU绑定的推理与CPU绑定的编排分离、使用异步处理处理流水线中的许多I/O绑定操作,以及独立的水平扩展策略。通过批量分析,我们发现了两个令人惊讶的定性发现,这些发现影响了生产部署:OCR而不是语言模型解析主导了端到端延迟,并且系统饱和度由共享的GPU推理容量而不是工作程序数量决定。我们的目标是为从业者提供具体的架构模式,以构建在基准之外有效工作的文档理解系统;有效地将模型 operationalize 在生产环境中。

英文摘要

Academic research tends to focus on new models for document understanding creating a wide gap in the literature between model definition and running models at production scale. To close that gap, we present a microservice architecture that encapsulates pipelines of multiple models for classification, optical character recognition (OCR), and large language model structured field extraction as well as our experience running this pipeline on thousands of multi-page documents per hour. We describe our primary design decisions, including a hybrid classification, separation of GPU-bound inference from CPU-bound orchestration, use of asynchronous processing for the many IO-bound operations in the pipeline, and an independent, horizontal scaling strategy. Using batch profiling, we identified two surprising qualitative findings that shape production deployments: OCR, not language-model parsing, dominates end-to-end latency, and the system saturates at a concurrency determined by shared GPU-inference capacity rather than worker count. Our goal is to provide practitioners with concrete architectural patterns for building document understanding systems that work beyond the benchmark; effectively operationalizing models in production.

2605.18816 2026-05-20 cs.LG cs.AI

Symmetry in the Wild: The Role of Equivariance in Neural Fluid Surrogates

野生中的对称性:等变性在神经流体代理中的作用

Patryk Rygiel, Julian Suk, Kak Khee Yeung, Christoph Brune, Jelmer M. Wolterink

AI总结 本文研究了等变性在神经流体代理中的作用,探讨了在不同分布对齐和真实度的任务中,等变性如何提高泛化能力,并介绍了AB-GATr模型在处理耦合表面和体积量时的效率。

详情
AI中文摘要

神经代理能够将计算流体动力学(CFD)模拟的计算速度提升几个数量级,有望改变工程和医疗流程。在现实应用中使用神经代理需要解决可扩展性问题,包括大规模、高分辨率表面和体积网格以及定制架构,并通过归纳偏置来应对有限的训练数据。群等变架构是引入此类偏置的一种系统方法,但当学习问题本身破坏对称性时,例如由于数据集中的强分布对齐,可能会产生不利影响。在本工作中,我们探讨了在具有不同分布对齐和真实度的任务中,等变性如何提高神经CFD代理的泛化能力,涵盖汽车空气动力学和血流(血动力学)。为了系统评估等变性在问题可扩展性极限处的附加价值,我们引入了Anchored-Branched Geometric Algebra Transformer(AB-GATr),一种整合了可扩展性和对称性保持的神经代理,能够以E(3)等变的方式高效建模耦合的表面和体积量。我们发现,在强对齐的空气动力学数据集上,即那些破坏对称性的数据集,强制等变性会降低分布内性能。相反,在具有不同几何形状和变化对齐的血动力学基准测试中,等变性始终有益。此外,在所有基准测试中,AB-GATr的显式等变性通过数据增强始终优于隐式对称学习。我们的发现表明,等变性并非在所有领域都有益,但在缺乏强数据规律的问题中带来了实质性的优势。

英文摘要

Neural surrogates enable orders-of-magnitude acceleration of computational fluid dynamics (CFD) simulations, with the potential to transform engineering and healthcare workflows. Neural surrogate use in real-world applications requires addressing scalability to large, high-resolution surface and volume meshes, as well as to bespoke architectures, and accounting for limited training data through the use of inductive biases. Group-equivariant architectures are a principled way to introduce such bias, yet they can be detrimental when the learning problem itself breaks symmetry, for example, due to strong distributional alignment in the dataset. In this work, we investigate under which conditions equivariance improves generalization in neural CFD surrogates across tasks with increasing levels of distributional alignment and realism, covering automotive aerodynamics and blood flow (hemodynamics). To systematically assess the added value of equivariance at the limit of problem scaling, we introduce the Anchored-Branched Geometric Algebra Transformer (AB-GATr), a neural surrogate that integrates scalability and symmetry preservation to efficiently model coupled surface and volume quantities in an $E(3)$-equivariant manner. We find that on strongly aligned aerodynamics datasets, i.e., those that break symmetry, enforcing equivariance can degrade in-distribution performance. In contrast, across hemodynamic benchmarks with diverse geometries and varying alignment, equivariance is consistently beneficial. Moreover, across all benchmarks, the explicit equivariance of AB-GATr reliably outperforms implicit symmetry learning through data augmentation. Our findings showcase that equivariance is not universally beneficial across domains, yet it brings tangible advantages in problems lacking strong data regularities.

2605.18815 2026-05-20 cs.LG cs.DC

DynaTrain: Fast Online Parallelism Switching for Elastic LLM Training

DynaTrain: 快速在线并行切换用于弹性大语言模型训练

Yuanqing Wang, Yuchen Zhang, Hao Lin, Junhao Hu, Chunyang Zhu, Quanlu Zhang, Boxun Li, Guohao Dai, Zhi Yang, Daning Cheng, Yunquan Zhang, Yu Wang

AI总结 本文提出DynaTrain,一种能够快速在线重新配置任意多维并行性的分布式训练系统,通过虚拟参数空间抽象统一所有分布式训练状态,实现并行配置的确定性映射,并在密集和MoE模型上展示了显著的性能提升。

Comments GitHub Repo: https://github.com/infinigence/ElasticMegatron

详情
AI中文摘要

现代大型语言模型(LLM)训练本质上是动态的:资源波动、RLHF阶段转换和集群弹性持续地改变最优并行性布局,对现有基于静态执行模型的训练框架构成重大挑战。我们提出了DynaTrain,一种支持亚秒级在线重新配置的分布式训练系统。其核心是虚拟参数空间(VPS)抽象,该抽象将所有分布式训练状态统一到一个逻辑坐标空间中,将任何并行性配置转换为确定性映射,并将复杂的转换折叠为可管理的几何交集。在VPS之上,状态路由和转换层在内存感知、无死锁的调度下执行rank-local传输,而弹性设备管理器则将新世界构建与正在进行的训练重叠,以掩盖拓扑变化成本。在密集和MoE模型上,DynaTrain能够在2秒内重新配置70B密集模型,在4.36秒内重新配置235B MoE模型,性能优于最先进的检查点基和弹性系统,提升幅度高达三个数量级,同时保持正确性。

英文摘要

Modern large language model (LLM) training is inherently dynamic: resource fluctuations, RLHF phase shifts, and cluster elasticity continually reshape the optimal parallelism layout, posing a significant challenge to existing training frameworks built around a static execution model. We present DynaTrain, a distributed training system for sub-second, online reconfiguration across arbitrary multi-dimensional parallelism. At its core, we propose a Virtual Parameter Space (VPS) abstraction that unifies all distributed training states under one logical coordinate space, turning any parallelism configuration into a deterministic mapping and collapsing complex transition into manageable geometric intersections. On top of VPS, a state routing-and-transition layer executes rank-local transfers under a memory-aware, deadlock-free schedule, and an Elastic Device Manager overlaps new-world construction with ongoing training to mask topology-change cost. On dense and MoE models up to 235B parameters, DynaTrain reconfigures a 70B dense model in under 2s and a 235B MoE model in 4.36s, outperforming state-of-the-art checkpoint-based and elastic systems by up to three orders of magnitude while preserving correctness.

2605.18814 2026-05-20 cs.LG

How Faithful Is Trajectory-Based Data Attribution? Error Sources, Remedies, and Practical Guidelines

轨迹数据归因的可信度如何?误差来源、缓解方法和实用指南

Junwei Deng, Pingbang Hu, Suliang Jin, Hao Lu, Jiachen T. Wang, Shichang Zhang, Jiaqi W. Ma

AI总结 本文系统分析了轨迹数据归因方法的误差来源,并提出缓解方法和实用指南,通过将总误差分为配置级、算法级和系统级,改进了归因的准确性,并为数据选择提供了可行的实践指导。

详情
AI中文摘要

基于轨迹的数据归因方法通过展开训练轨迹来估计训练样本对模型预测的影响。它们被广泛应用于数据选择、数据估值和模型诊断等应用,但缺乏对这些方法的全面误差分析,引发了对方法可信度的担忧,并阻碍了可靠部署。在本文中,我们提供了轨迹数据归因方法误差来源的首次系统分析,以及具体的缓解方法和下游应用的实用指南。我们将总误差分为三类:配置级、算法级和系统级。我们做出了三个贡献。首先,我们识别出优化器不匹配是主导的配置级误差:现有方法在其归因下假设使用SGD,即使对于使用现代事实上的优化器AdamW训练的模型也是如此。我们提出了AdamW-influence,以充分考虑AdamW的优化动态,在四个设置中(MLP、CNN、GPT-2和Llama 3.2-1B)估计与真实影响之间的Spearman相关性提高了10%到超过300%。其次,我们隔离了剩余的算法级误差,源于一阶泰勒近似,识别了学习率和轨迹长度作为误差大小的决定因素,并推导出一个闭合形式的误差代理,可以在原始轨迹上评估而无需重新训练。第三,我们将这些见解转化为数据选择的实用指南,通过在K-step前瞻框架下统一离线和在线策略。在此框架下,在线选择具有短时间范围通常匹配或超过离线,且最佳时间范围可以与学习率联合调节。共同,这些结果将框架转化为从业者可操作的选择配方。

英文摘要

Trajectory-based data attribution methods estimate the influence of training samples on model predictions by unrolling the training trajectory. They are widely used in applications such as data selection, data valuation, and model diagnosis, but there is a lack of comprehensive error analysis of these methods, raising concerns about method faithfulness and hindering reliable deployment. In this work, we provide the first systematic analysis of error sources in trajectory-based data attribution, together with concrete remedies to mitigate them and practical guidelines for downstream use. We organize the total error into three categories, config-level, algorithm-level, and system-level. We make three contributions. First, we identify optimizer mismatch as the dominant config-level error: existing methods derive their attribution under the assumption of SGD, even for models trained with the modern de facto optimizer AdamW. We propose AdamW-influence to fully account for AdamW's optimization dynamics, yielding improvements from 10% to over 300% in Spearman correlation between estimated and ground-truth influence across four settings spanning MLP, CNN, GPT-2, and Llama 3.2-1B. Second, we isolate the remaining algorithm-level error arising from the first-order Taylor approximation, identify the learning rate and trajectory length as factors governing the error magnitude, and derive a closed-form error proxy that can be evaluated along the original trajectory without retraining. Third, we translate these insights into practical guidelines for data selection by unifying offline and online strategies under a K-step look-ahead framework. Under this framework, online selection with a short horizon often matches or exceeds offline, and the optimal horizon can be tuned jointly with the learning rate. Together, these results turn the framework into an actionable selection recipe for practitioners.

2605.18813 2026-05-20 cs.LG cs.AI

Composition of Memory Experts for Diffusion World Models

记忆专家的组合用于扩散世界模型

Sebastian Stapf, Pablo Acuaviva Huertos, Aram Davtyan, Paolo Favaro

AI总结 本文提出了一种基于扩散的世界模型框架,通过组合专门化的记忆专家来解决记忆与效率之间的权衡问题,提升了时间一致性、过去观察的回忆和导航性能。

详情
Journal ref
Proceedings of the Fourteenth International Conference on Learning Representations (ICLR), 2026
AI中文摘要

世界模型旨在预测与过去观察一致的合理未来,这是强化学习中规划和决策的关键能力。然而,现有架构面临根本性的记忆权衡:转换器保留局部细节但受二次注意限制,而递归和状态空间模型更高效但以牺牲保真度为代价。为克服这一权衡,我们建议将未来-过去一致性与任何单一架构解耦,并利用一组专门的专家。我们引入了一种基于扩散的框架,通过对比产品-专家公式整合异构记忆模型。我们的方法实现了三个互补的角色:短期记忆专家捕捉精细的局部动态,长期记忆专家通过轻量级测试时微调在外部扩散权重中存储事件历史,以及空间长期记忆专家强制几何和空间一致性。这种组合设计避免了模式崩溃,并在不产生二次成本的情况下扩展到长上下文。在模拟和现实世界基准测试中,我们的方法提高了时间一致性、过去观察的回忆和导航性能,建立了一种新的构建和操作记忆增强扩散世界模型的范式。

英文摘要

World models aim to predict plausible futures consistent with past observations, a capability central to planning and decision-making in reinforcement learning. Yet, existing architectures face a fundamental memory trade-off: transformers preserve local detail but are bottlenecked by quadratic attention, while recurrent and state-space models scale more efficiently but compress history at the cost of fidelity. To overcome this trade-off, we suggest decoupling future-past consistency from any single architecture and instead leveraging a set of specialized experts. We introduce a diffusion-based framework that integrates heterogeneous memory models through a contrastive product-of-experts formulation. Our approach instantiates three complementary roles: a short-term memory expert that captures fine local dynamics, a long-term memory expert that stores episodic history in external diffusion weights via lightweight test-time finetuning, and a spatial long-term memory expert that enforces geometric and spatial coherence. This compositional design avoids mode collapse and scales to long contexts without incurring a quadratic cost. Across simulated and real-world benchmarks, our method improves temporal consistency, recall of past observations, and navigation performance, establishing a novel paradigm for building and operating memory-augmented diffusion world models.

2605.18812 2026-05-20 cs.LG cs.CL cs.IR

PASC: Pipeline-Aware Conformal Prediction with Joint Coverage Guarantees for Multi-Stage NLP and LLM Pipelines

PASC:面向多阶段NLP和LLM流水线的管道感知置信区间

Varun Kotte

AI总结 本文提出PASC,一种面向多阶段NLP和LLM流水线的管道感知置信区间方法,通过联合覆盖保证提升多阶段流水线的置信区间性能。

详情
AI中文摘要

现代NLP和LLM系统是流水线:命名实体识别(NER)->实体消歧(NED)->实体类型、检索增强生成(检索器->读者),以及代理链(规划器->工具->批评者)。错误在各阶段累积,但现有不确定性量化方法要么独立校准每个阶段(无联合覆盖),要么应用Bonferroni联合界(有联合覆盖但保守)。我们提出了PASC(Pipeline-Aware Split Conformal),将多阶段联合覆盖转换为单个标量置信区间问题,基于联合最大不一致性分数。PASC提供了一个有限样本分布无关的保证,所有K阶段同时覆盖的概率至少为1 - alpha,并且几乎紧致,误差不超过1/(n+1)。在CoNLL-2003上的三阶段NER->NED->实体类型流水线中,PASC实现了96.4%的端到端覆盖,优于Bonferroni的93.4%和独立CP的86.5%,在相同平均预测集大小(1.083)下。在分布偏移至WNUT-17推特和WikiNEuRal维基数据时,PASC在测试偏移设置中保持目标覆盖,而独立CP下降到59%。PASC只需一次分位数计算,运行速度比Bonferroni快1.7倍,并可扩展到K=6阶段,其中独立CP下降到0.53端到端覆盖。相同的联合最大分数减少直接应用于复合LLM系统和代理流水线。

英文摘要

Modern NLP and LLM systems are pipelines: named entity recognition (NER) -> entity disambiguation (NED) -> entity typing, retrieval-augmented generation (retriever -> reader), and agentic chains of planner -> tool -> critic. Errors compound across stages, but existing uncertainty quantification methods either calibrate each stage independently (no joint coverage) or apply a Bonferroni union bound (joint coverage, but conservative). We present PASC (Pipeline-Aware Split Conformal), which reduces multi-stage joint coverage to a single scalar conformal prediction problem on the joint maximum nonconformity score. PASC provides a finite-sample distribution-free guarantee that all K stages are simultaneously covered with probability at least 1 - alpha, and is nearly tight up to a 1/(n+1) factor. On a three-stage NER -> NED -> entity-typing pipeline over CoNLL-2003, PASC achieves 96.4% end-to-end coverage versus 93.4% for Bonferroni and 86.5% for independent CP, at identical average prediction set size (1.083). Under distribution shift to WNUT-17 Twitter and WikiNEuRal Wikipedia data, PASC empirically maintains the target coverage in the tested shift settings while independent CP collapses to 59%. PASC requires a single quantile computation, runs 1.7x faster than Bonferroni, and scales to K = 6 stages where independent CP drops to 0.53 end-to-end coverage. The same joint-maximum-score reduction applies directly to compound LLM systems and agent pipelines.

2605.18810 2026-05-20 cs.LG cs.AI

D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting

D-PACE:动态位置感知交叉熵用于并行推测草案

Tianyu Wu, Yu Yao, Zhenting Qi, Han Zheng, Zhuohan Wang, Haoran Ma, Lawrence Liao, Himabindu Lakkaraju, Ju Li, Yilun Du

AI总结 本文提出D-PACE,一种动态位置感知交叉熵,用于改进并行推测草案的训练,通过动态调整位置权重以提高生成速度和输出长度。

详情
AI中文摘要

推测解码通过让小型草案生成器并行生成token,由更大目标模型验证,从而加速LLM推理。最近的扩散式并行草案生成器如DFlash在一次前向传递中预测完整的B-token块,使深度草案生成器和更长的接受块成为可能。然而,现有多token草案生成器目标通常使用固定的位置依赖加权计划,如头部依赖权重或块位置衰减,这在训练过程中无法适应限制接受的位置变化。为此,我们从可微的替代品中推导出每位置的训练权重,使每个位置的权重与其log概率梯度贡献相匹配。所得到的损失,D-PACE(动态位置感知交叉熵),将训练信号转向当前限制接受的位置,随着草案生成器的改进。在六个基准、两个Qwen3-4B草案深度、两个解码温度和两个额外的目标模型上,D-PACE一致地提高了墙钟加速速度和平均生成长度,测量训练时间开销为2.3%,且不改变草案生成器的架构或推理过程。

英文摘要

Speculative decoding accelerates LLM inference by having a small drafter propose tokens that a larger target model verifies in parallel. Recent diffusion-based parallel drafters such as DFlash predict the full B-token block in one forward pass, enabling deeper drafters and longer accepted blocks. However, existing multi-token drafter objectives often use fixed position-dependent weighting schedules, such as head-dependent weights or block-position decays, which do not adapt as the positions limiting acceptance change during training. To address this, we derive per-position training weights from a differentiable surrogate of expected accepted draft length, matching the weight of each position to its log-probability gradient contribution. The resulting loss, D-PACE (Dynamic Position-Aware Cross-Entropy), shifts training signal toward positions that currently limit acceptance as the drafter improves. Across six benchmarks, two Qwen3-4B draft depths, two decoding temperatures, and two additional target models, D-PACE consistently improves both wall-clock speedup and average emitted length, with 2.3\% measured training-time overhead and no changes to the drafter architecture or inference procedure.

2605.18809 2026-05-20 cs.LG cs.AI

Metric-Gradient Projection for Stable Multi-Agent Policy Learning

基于度量梯度的稳定多智能体策略学习

Zuyuan Zhang, Sizhe Tang, Mahdi Imani, Tian Lan

AI总结 本文提出HPML方法,通过将多智能体系统的联合更新场视为L²空间中的向量场,并计算其在最接近度量梯度势流上的Hodge型投影,从而提升多智能体强化学习的稳定性。

详情
AI中文摘要

一般和解的多智能体学习通常由堆叠更新场主导,其中每个智能体的策略更新会改变其他智能体面临的优化景观。这种耦合可以将可积分的集体改进组件与循环交互动力学纠缠在一起,导致多智能体学习缓慢或不稳定。现有方法,如正则化、信用分配和共识方法,通过局部或算法修改稳定MARL;HPML通过将联合更新场投影到度量梯度组件来补充它们。我们引入HPML(Hodge-Projected Multi-agent Learning),将多智能体系统的联合更新场视为L²空间中的向量场,并计算其在最接近度量梯度势流上的Hodge型投影。HPML遵循投影组件作为更新方向,从而在所选度量和采样度量下获得最接近的度量梯度场。投影通过变分定义,由泊松型方程表征,并通过基于图的和放缩神经网络实现,从样本中恢复投影方向。我们证明投影动力学具有Lyapunov势,并能产生具有显式加性非势项的平衡间隙界。受控实验验证了几何机制,CTDE基准测试显示当HPML用作MARL流水线中的插件投影层时,稳定性和归一化回报有所提高。

英文摘要

General-sum multi-agent learning is often governed by a stacked update field in which each agent's policy update changes the optimization landscape faced by the others. This coupling can entangle an integrable component of collective improvement with cyclic interaction dynamics, leading to slow or unstable multi-agent learning. Existing approaches, such as regularization, credit assignment, and consensus methods, stabilize MARL through local or algorithmic modifications; HPML complements them by projecting the joint update field onto a metric-gradient component. We introduce \textbf{HPML} (\textbf{H}odge-\textbf{P}rojected \textbf{M}ulti-agent \textbf{L}earning), which views the joint update field of a multi-agent system as an element of an $L^2$ space of vector fields and computes a Hodge-type projection onto the closest metric-gradient potential flow. HPML follows the projected component as the update direction, yielding the closest metric-gradient field under the chosen metric and sampling measure. The projection is defined variationally, characterized by a Poisson-type equation, and implemented through graph-based and amortized neural realizations that recover projected directions from samples. We show that the projected dynamics admit a Lyapunov potential and yield equilibrium-gap bounds with an explicit additive non-potentiality term. Controlled experiments validate the geometric mechanism, and CTDE benchmarks show improved stability and normalized return when HPML is used as a plug-in projection layer in MARL pipelines.

2605.18808 2026-05-20 cs.LG cs.AI cs.CL

Compositional Literary Primitives in Instruction-Tuned LLMs: Cross-Architectural SAE Features for Self, Style, and Affect

在指令微调的LLM中构建组合文学原语:跨架构SAE特征用于自我、风格和情感

Joao Paulo Cavalcante Presa, Savio Salvarino Teles de Oliveira

AI总结 本文通过稀疏自编码器研究了指令微调的LLM中组合文学原语的架构,发现四种特征类别,并通过跨架构SAE特征验证了自我、风格和情感的表达能力。

Comments 36 pages, 6 figures

详情
AI中文摘要

我们通过在中层残差流上使用稀疏自编码器,对两个指令微调的大型语言模型(Llama 3.1 8B-Instruct和Gemma 2 9B-IT)的文学原语组合架构进行了表征。四种特征类别出现:促进目标情感词的命名门,一个包含第一人称注册特征的十一自我簇,风格注册调节器(show-don't-tell和陌生化),以及仅由多特征引导产生的组合情感。在应用于27类情感分类法(Cowen-Keltner)的强制选择5-LLM判断小组中,Llama通过结合命名门、多特征食谱和单个自我特征引导实现了完全27/27覆盖;Gemma在adoration作为单一残差严格失败的情况下达到23/27。在随机判断中,每个单元格通过的概率约为$10^{-3}$,整个目录中两个种子假阳单元格的预期数量可忽略不计,因此观察到的覆盖度不一致于偶然。在严格与柔和判断对比中存在跨架构不对称性:在相同生成中,判断者在Llama输出上比在Gemma输出上更一致,因为Llama输出更直接地命名目标情感,而Gemma输出则通过场景和意象来唤起情感。两种架构都包含同时作为注册标记和情感发射器的自我特征,包括每个架构中一个最RLHF加载的自我特征,该特征在某一操作 regime 中增强机构Helper-AI人格,并在相同校准系数下产生可分类情感的输出。方法上,本文提出了一个三阶段验证流程(logit-lens,LLM-rate,5-LLM判断)并记录了文档化的反模式;总计算量为单GPU,大约每种情感特征发现循环15分钟。

英文摘要

We characterize a compositional architecture of literary primitives in two instruction-tuned large language models (Llama 3.1 8B-Instruct and Gemma 2 9B-IT) via sparse autoencoders on mid-depth residual streams. Four feature classes emerge: naming-gates that promote lexical tokens of a target affect, an eleven-self cluster of first-person register features, stylistic register modulators (show-don't-tell and defamiliarization), and compositional emotions that arise only from multi-feature steering. Under a forced-choice 5-LLM judge panel applied to a 27-category emotion taxonomy (Cowen-Keltner), Llama reaches full 27/27 coverage by combining naming-gates, multi-feature recipes, and single self-feature steering; Gemma reaches 23/27 with adoration as the single residual strict-fail. Under random judging, the per-cell pass probability is on the order of $10^{-3}$ and the expected number of two-seed false-positive cells across the catalog is negligible, so the observed coverage is not consistent with chance. A cross-architectural asymmetry sits in the strict-versus-soft judge contrast: on the same generations, judges agree more often on Llama outputs than on Gemma outputs because Llama outputs name the target affect more directly while Gemma outputs evoke it through scene and imagery. Both architectures contain self-features that serve simultaneously as register markers and as emotion emitters, including a single most-RLHF-loaded self-feature per architecture that intensifies the institutional Helper-AI persona at one operating regime and produces affect-categorizable output at the same calibrated coefficient. Methodologically, the paper presents a three-stage validation pipeline (logit-lens, LLM-rate, 5-LLM judge) with documented anti-patterns; the total compute is single-GPU and about 15 minutes per emotion-feature discovery cycle.

2605.18804 2026-05-20 cs.LG cs.AI

Adaptive Multi-Scale Goodness Aggregation for Forward-Forward Learning

自适应多尺度良度聚合用于前-前学习

Salar Beigzad, Vansh Verma

AI总结 本文提出了一种自适应多尺度良度聚合(AMSGA)方法,通过改进局部学习神经网络的稳定性、鲁棒性和泛化能力,解决了原始前-前(FF)框架的局限性,实验表明在MNIST和Fashion-MNIST数据集上性能提升显著。

Comments 6 pages, 5 tables, IEEE format

详情
AI中文摘要

我们提出自适应多尺度良度聚合(AMSGA),一种新颖的前-前(FF)算法扩展,旨在提高局部学习神经网络的稳定性、鲁棒性和泛化能力。AMSGA通过引入多尺度良度聚合(局部、中间和全局表示)、自适应课程引导的困难负样本挖掘、层依赖的自适应阈值以及改进的优化稳定性warm-up余弦退火学习率调度,解决了原始FF框架的多个局限性。这些修改增强了FF范式,同时保持了其生物合理性和内存高效性。在MNIST和Fashion-MNIST上的实验表明,与基线FF算法相比,性能有显著提升,分别在MNIST和Fashion-MNIST上达到+1.45%和+1.50%的改进,而计算开销不大。我们的结果表明,当良度估计和训练动态精心设计时,局部学习方法可以变得更具竞争力。

英文摘要

We propose Adaptive Multi-Scale Goodness Aggregation (AMSGA), a novel extension of the Forward-Forward (FF) algorithm designed to improve stability, robustness, and generalization in local-learning neural networks. AMSGA addresses several limitations of the original FF framework by introducing multi-scale goodness aggregation across local, intermediate, and global representations; adaptive curriculum-guided hard negative mining; layer-dependent adaptive thresholds; and a warm-up cosine annealing learning-rate schedule for improved optimization stability. Together, these modifications strengthen the FF paradigm while preserving its biologically plausible and memory-efficient properties. Experiments on MNIST and Fashion-MNIST demonstrate consistent performance improvements over the baseline FF algorithm, achieving up to +1.45% improvement on MNIST and +1.50% improvement on Fashion-MNIST without significant computational overhead. Our results suggest that local learning methods can become substantially more competitive when goodness estimation and training dynamics are carefully designed.

2605.18801 2026-05-20 cs.AI cs.IR cs.LG

Position: Let's Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance

位置:让我们开发数据探针,以根本理解数据如何影响大语言模型性能

Shiqiang Wang, Herbert Woisetschläger, Hans Arno Jacobsen, Mingyue Ji

AI总结 本文提出通过开发数据探针系统方法生成合成序列,以揭示数据特性对大语言模型性能、泛化能力和鲁棒性的影响,从而超越经验启发式方法。

Comments Accepted to ICML 2026 Position Paper Track

详情
Journal ref
Link to ICML record: https://icml.cc/virtual/2026/poster/67154
AI中文摘要

数据对于大语言模型(LLMs)至关重要。然而,了解哪些数据对LLM工作流程的不同阶段(包括训练、微调、对齐、上下文学习等)有用,以及为什么有用,仍然是一个开放性问题。当前的方法依赖于对大型公共数据集进行大量实验来获得数据过滤和数据集构建的经验启发式方法。这些方法计算成本高,并且缺乏一种系统的方法来理解特定数据特性如何驱动LLM行为的本质。在本文的位置论文中,我们倡导开发系统方法来生成合成序列,这些序列由适当定义的随机过程生成,目的是当它们用于LLM工作流程的一个或多个阶段时,能够揭示有用的特点。我们将这些序列称为数据探针。通过观察LLM在数据探针上的行为,研究人员可以系统地研究数据特性如何影响模型性能、泛化能力和鲁棒性。探测序列表现出的统计特性可以通过理论概念(如典型集)来观察,这些概念被推广以描述LLM的行为。这种数据探针方法为揭示数据在LLM训练和推理中的基础作用提供了途径,超越了经验启发式方法。

英文摘要

Data is fundamental to large language models (LLMs). However, understanding of what makes certain data useful for different stages of an LLM workflow, including training, tuning, alignment, in-context learning, etc., and why, remains an open question. Current approaches rely heavily on extensive experimentation with large public datasets to obtain empirical heuristics for data filtering and dataset construction. These approaches are compute intensive and lack a principled way of understanding the essence of how specific data characteristics drive LLM behavior. In this position paper, we advocate for the need of developing systematic methodologies for generating synthetic sequences from appropriately defined random processes, with the goal that these sequences can reveal useful characteristics when they are used in one or multiple stages of the LLM workflow. We refer to such sequences as data probes. By observing LLM behavior on data probes, researchers can systematically conduct studies on how data characteristics influence model performance, generalization, and robustness. The probing sequences exhibit statistical properties that can be viewed using theoretical concepts, such as typical sets, which are generalized to describe the behaviors of LLMs. This data-probe approach provides a pathway for uncovering foundational insights into the role of data in LLM training and inference, beyond empirical heuristics.

2605.18800 2026-05-20 cs.LG cs.AI

Theory-optimal Quantization Based on Flatness

基于平坦度的理论最优量化

Xiusheng Huang, Zhe Li, Xuanwu Yin, Lu Wang, Yequan Wang, Dong Li, Emad Barsoum, Kang Liu

AI总结 本文提出了一种基于平坦度的理论最优量化方法,通过分析量化误差与异常值之间的数学关系,引入了平坦度指标来量化异常值分布,并提出了双向对角量化框架BDQ,有效分散异常值模式,提升了大语言模型在低比特精度下的性能。

Comments 16 pages, 2 figures

详情
AI中文摘要

后训练量化已成为压缩和加速大型语言模型(LLMs)推理的广泛采用技术。LLMs量化的首要挑战源于激活异常值,这些异常值在低比特精度下显著降低模型性能。尽管近期方法试图通过跨特征维度的线性变换来缓解异常值,我们的分析表明,变换后的权重和激活仍然表现出持续的异常值模式,具有集中化的幅度分布。在本文中,我们首先建模量化误差与异常值之间的数学关系,然后引入一个新的指标平坦度来量化异常值的分布。基于此,我们推导出与平坦度相关的理论最优解。基于这些见解,我们提出了双向对角量化(BDQ),一种新的后训练量化框架,通过优化的矩阵变换有效分散异常值模式。BDQ通过学习的对角操作策略性地将异常值幅度分布到矩阵维度中。广泛的实验表明,BDQ建立了新的量化基准。在LLaMA-3-8B模型上,BDQ在W4A4量化中实现了小于1%的精度下降。在更具挑战性的W2A4KV16实验中,与最先进的方法相比,BDQ在DeepSeek-R1-Distill-LLaMA-70B模型上将性能差距减少了39.1%。

英文摘要

Post-training quantization has emerged as a widely adopted technique for compressing and accelerating the inference of Large Language Models (LLMs). The primary challenges in LLMs quantization stem from activation outliers, which significantly degrade model performance especially at lower bit precision. While recent approaches attempt to mitigate outliers through linear transformations across feature dimensions, our analysis reveals that the transformed weights and activations still exhibit persistent outlier patterns with concentrated magnitude distributions. In this paper, we first model the mathematical relationship between quantization error and outliers, and then introduce a new metric Flatness to quantify the distribution of outliers. Based on this, we derive the theoretical optimal solution with respect to Flatness. Building on these insights, we propose Bidirectional Diagonal Quantization (BDQ), a novel post-training quantization framework that effectively disperses outlier patterns through optimized matrix transformations. BDQ strategically distributes outlier magnitudes across matrix dimensions via learned diagonal operations. Extensive experiments demonstrate that BDQ establishes a new quantization benchmark. It achieves less than 1\% accuracy drop in W4A4 quantization on the LLaMA-3-8B model. In the more challenging W2A4KV16 experiment, compared to state-of-the-art approaches, BDQ reduces the performance gap by 39.1\% on the DeepSeek-R1-Distill-LLaMA-70B model.

2605.18799 2026-05-20 cs.LG cs.AI cs.CL

ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

ReCrit: 基于过渡意识的强化学习用于科学批评推理

Wanghan Xu, Yuhao Zhou, Hengyuan Zhao, Shuo Li, Dianzhi Yu, Zhenfei Yin, Yaowen Hu, Fengli Xu, Wanli Ouyang, Wenlong Zhang, Lei Bai

AI总结 该研究提出ReCrit框架,通过强化学习解决科学批评推理中的过渡意识问题,改进了批评准确性。

详情
AI中文摘要

大型语言模型在批评交互中不仅可能因回答错误而失败,还可能在用户批评后放弃最初正确的科学解答。在科学推理中,这种风险尤为突出,因为用户的批评可能将正确答案变为错误答案。我们将批评交互视为跨回合正确性过渡问题,而非最终答案准确性问题,并识别出三个挑战:过渡意识、解耦有用的修正与有害的阿谀奉承,以及可扩展的回放。我们提出了ReCrit,一个基于过渡意识的强化学习框架,将初始到批评行为分解为四个象限:修正、阿谀奉承、鲁棒性和边界。ReCrit奖励修正和鲁棒性,惩罚阿谀奉承,并将持续错误视为弱边界信号。为了使交互训练实用,ReCrit进一步使用动态异步回放与尾部自适应完成以减少回放等待。在三个科学推理基准测试(ChemBench、TRQA和EarthSE)上,ReCrit在Qwen3.5-4B上将平均批评准确性从38.15提升到51.49,在Qwen3.5-9B上从45.40提升到55.59。消融实验显示,最终答案奖励提供很少的交互层面增益,而基于过渡意识的奖励和象限加权产生更可区分的训练信号和更大的净批评阶段改进。代码可在https://github.com/black-yt/ReCrit获取。

英文摘要

Large language models can fail in critic interaction not only by answering incorrectly, but also by abandoning an initially correct scientific solution after user criticism. This is especially risky in scientific reasoning, where user criticism can turn a valid answer into an incorrect one. We frame critic interaction as an inter-turn correctness-transition problem rather than a final-answer accuracy problem, and identify three challenges: transition awareness, decoupling useful correction from harmful sycophancy, and scalable rollout. We propose ReCrit, a transition-aware reinforcement learning framework that decomposes Initial-to-Critic behavior into four quadrants: Correction, Sycophancy, Robustness, and Boundary. ReCrit rewards correction and robustness, penalizes sycophancy, and treats persistent errors as weak boundary signals. To make interaction training practical, ReCrit further uses dynamic asynchronous rollout with tail-adaptive completion to reduce rollout waiting. On three scientific reasoning benchmarks, ChemBench, TRQA, and EarthSE, ReCrit improves average Critic accuracy from 38.15 to 51.49 on Qwen3.5-4B and from 45.40 to 55.59 on Qwen3.5-9B. Ablations show that final-answer rewards provide little interaction-level gain, while transition-aware rewards and quadrant weighting produce more distinguishable training signals and larger net Critic-stage improvement. The code is available at https://github.com/black-yt/ReCrit .

2605.18798 2026-05-20 cs.LG cs.IT math.IT math.ST stat.ML stat.TH

Accurate Evaluation of Quickest Changepoint Detectors via Non-parametric Survival Analysis

通过非参数生存分析准确评估最快突变点检测器

Taiki Miyagawa, Akinori F. Ebihara

AI总结 本文提出非参数估计方法用于快速突变点检测中的平均运行长度和平均检测延迟,通过将突变点检测与生存分析类比,解决了有限和不规则序列长度下的估计问题,提升了模型的鲁棒性和可解释性。

Comments Accepted to ICML 2026. GitHub: https://github.com/TaikiMiyagawa/Kaplan-Meier-Average-Run-Length

详情
AI中文摘要

我们提出非参数估计器用于在有限和不规则序列长度下快速突变点检测(QCD)中的平均运行长度(ARL)和平均检测延迟(ADD)。尽管ARL和ADD广泛用于理论和模拟研究中的最优性标准,但它们在实际数据集中的应用受到有限和不规则序列长度的限制。为了解决这个问题,我们通过将QCD与生存分析类比,提出非参数估计器ARL和ADD,称为KM-ARL和KM-ADD,以建模序列截断下的检测概率。我们推导了估计偏差界限,并证明除非需要外推,否则它们在渐近上是无偏的。在模拟和实际数据集上的实验展示了其实际用途,增强了对有限和不规则序列长度的鲁棒性,提高了可解释性,并促进了经验、直观的模型选择。我们的Python代码可在https://github.com/TaikiMiyagawa/Kaplan-Meier-Average-Run-Length提供,为从业者提供了即用型实现。

英文摘要

We propose non-parametric estimators for the average run length (ARL) and average detection delay (ADD) in quickest changepoint detection (QCD) under finite and irregular sequence lengths. Although ARL and ADD are widely used as optimality criteria in theoretical and simulation studies, their application to real-world datasets is hindered by limited and irregular sequence lengths. To address this issue, we propose non-parametric estimators for the ARL and ADD, termed KM-ARL and KM-ADD, by drawing an analogy between QCD and survival analysis to model detection probabilities under sequence truncation. We derive estimation bias bounds and prove that they are asymptotically unbiased unless extrapolation is required. Experiments on simulated and real-world datasets demonstrate their practical utility, enhancing robustness against limited and irregular sequence lengths, improving interpretability, and facilitating empirical, intuitive model selection. Our Python code is provided at https://github.com/TaikiMiyagawa/Kaplan-Meier-Average-Run-Length, offering ready-to-use implementations for practitioners.

2605.18796 2026-05-20 cs.LG cs.CL

UCCI: Calibrated Uncertainty for Cost-Optimal LLM Cascade Routing

UCCI:用于成本最优LLM级联路由的校准不确定性

Varun Kotte

AI总结 本文提出UCCI,一种以校准为核心的路由方法,通过异质回归将token层面的边际不确定性映射到查询级误差概率,并通过约束成本最小化选择升级阈值。在三个显式假设下,阈值策略在校准分数上是成本最优的,异质校准在期望校准误差(ECE)上实现O(n^{-1/3})的样本复杂度。在75000个生产命名实体识别工作负载上,UCCI将推理成本降低了31%(95%CI:[27%, 35%]),同时将ECE从0.12降低到0.03。

Comments 9 pages, 2 figures, 4 tables. Code: https://github.com/varunkotte6/ucci

详情
AI中文摘要

LLM级联和模型路由通过将简单查询发送到小型模型并升级困难查询到大型模型来降低推理成本,但大多数部署的路由器使用未校准的置信度分数并需要每个工作负载的阈值调整。我们提出了UCCI,一种以校准为核心的路由器,通过异质回归将token层面的边际不确定性映射到查询级误差概率,并通过约束成本最小化选择升级阈值。在三个显式假设下,阈值策略在校准分数上是成本最优的,异质校准在期望校准误差(ECE)上实现O(n^{-1/3})的样本复杂度。在75000个生产命名实体识别工作负载上,UCCI将推理成本降低了31%(95%CI:[27%, 35%]),同时将ECE从0.12降低到0.03。在相同的操作点上,UCCI优于熵阈值法、分割置信路由以及FrugalGPT风格的学习阈值。所有级联结果均使用实际模型输出和测量的H100延迟进行端到端路由,而不是基于全局准确率或名义API价格的模拟路由。

英文摘要

LLM cascades and model routing promise lower inference cost by sending easy queries to a small model and escalating hard ones to a large model, but most deployed routers use uncalibrated confidence scores and require per-workload threshold tuning. We present UCCI, a calibration-first router that maps token-level margin uncertainty to a per-query error probability via isotonic regression and selects the escalation threshold by constrained cost minimization. Under three explicit assumptions, threshold policies on the calibrated score are cost-optimal, and isotonic calibration achieves O(n^{-1/3}) sample complexity for expected calibration error (ECE). On a production named entity recognition workload of 75,000 queries served by 4B and 12B instruction-tuned LLMs on H100 GPUs, UCCI cuts inference cost by 31% (95% CI: [27%, 35%]) at micro-F1 = 0.91 while reducing ECE from 0.12 to 0.03. At the same operating point, UCCI beats entropy thresholding, split-conformal routing, and a FrugalGPT-style learned threshold. All cascade results use end-to-end routing on actual model outputs and measured H100 latency, not simulated routing from global accuracies or nominal API prices.

2605.18795 2026-05-20 cs.LG cs.AI

HELLoRA: Hot Experts Layer-Level Low-Rank Adaptation for Mixture-of-Experts Models

HELLoRA: Hot Experts Layer-Level Low-Rank Adaptation for Mixture-of-Experts Models

Jia Wei, Zhonghao Zhang, Ping Chen, Qianyang li, Yancheng Pan, Shaoxun Wang, Ziyi Qiu, Longxiang Wang

AI总结 本文提出HELLoRA,一种针对混合专家模型的层级低秩适应方法,通过仅对最活跃的专家添加LoRA模块,减少可训练参数和计算量,同时提升下游任务性能。

详情
AI中文摘要

Low-Rank Adaptation (LoRA) dominates parameter-efficient fine-tuning of large language models, yet most variants target dense architectures. Mixture-of-Experts (MoE) models scale parameters at near-constant per-token compute, and their sparse activation patterns create untapped opportunities for more efficient adaptation. We propose Hot-Experts Layer-level Low-Rank Adaptation (HELLoRA), which attaches LoRA modules only to the most frequently activated experts at each layer. This simple mechanism reduces trainable parameters and adapter-induced FLOPs while improving downstream performance, an effect we attribute to a form of structured regularization that preserves pretrained expert specialization. To stress-test HELLoRA under extreme parameter budgets, we further compose it with LoRI to form HELLoRI, which freezes the up-projection and sparsifies the down-projection. Across three MoE backbones, namely OlMoE-1B-7B, Mixtral-8x7B, and DeepSeekMoE, and three task families covering mathematical reasoning, code generation, and safety alignment, HELLoRA consistently outperforms strong PEFT baselines. Relative to vanilla LoRA on OlMoE, HELLoRA uses 15.7% of the trainable parameters, reduces adapter FLOPs by 38.7%, achieves 1.9x the training throughput, and improves accuracy by 9.2%. On DeepSeekMoE, HELLoRA outperforms LoRA while using only 23.2% of its trainable parameters. These results demonstrate that activation-aware adapter placement is an effective and practical route to scaling PEFT for MoE language models.

英文摘要

Low-Rank Adaptation (LoRA) dominates parameter-efficient fine-tuning of large language models, yet most variants target dense architectures. Mixture-of-Experts (MoE) models scale parameters at near-constant per-token compute, and their sparse activation patterns create untapped opportunities for more efficient adaptation. We propose Hot-Experts Layer-level Low-Rank Adaptation (HELLoRA), which attaches LoRA modules only to the most frequently activated experts at each layer. This simple mechanism reduces trainable parameters and adapter-induced FLOPs while improving downstream performance, an effect we attribute to a form of structured regularization that preserves pretrained expert specialization. To stress-test HELLoRA under extreme parameter budgets, we further compose it with LoRI to form HELLoRI, which freezes the up-projection and sparsifies the down-projection. Across three MoE backbones, namely OlMoE-1B-7B, Mixtral-8x7B, and DeepSeekMoE, and three task families covering mathematical reasoning, code generation, and safety alignment, HELLoRA consistently outperforms strong PEFT baselines. Relative to vanilla LoRA on OlMoE, HELLoRA uses 15.7% of the trainable parameters, reduces adapter FLOPs by 38.7%, achieves 1.9x the training throughput, and improves accuracy by 9.2%. On DeepSeekMoE, HELLoRA outperforms LoRA while using only 23.2% of its trainable parameters. These results demonstrate that activation-aware adapter placement is an effective and practical route to scaling PEFT for MoE language models.

2605.18794 2026-05-20 cs.LG cs.AI

Robust Basis Spline Decoupling for the Compression of Transformer Models

基于鲁棒基样条的变压器模型压缩解耦方法

Joppe De Jonghe, Van Tien Pham, Mariya Ishteva

AI总结 本文提出了一种基于B-样条的解耦框架,通过利用B-样条的局部支持和灵活的光滑性控制,改进了传统张量解耦方法,提高了数值稳定性和表达能力,实验表明该方法在保持竞争力精度的同时实现了显著的参数减少。

详情
AI中文摘要

解耦是一种强大的建模范式,用于将多元函数表示为线性变换和单变量非线性函数的组合。单层解耦可以视为具有单个隐藏层和灵活激活函数的全连接神经网络,提供了与神经网络的直接联系。因此,解耦方法在神经网络领域中的应用日益增加,尤其是在压缩方面,因为它能够通过减少参数复杂性实现结构化近似。现有的基于张量的解耦方法通常依赖于多项式或分段线性参数化内部非线性函数,这可能导致数值不稳定或表达能力有限。在本工作中,我们引入了一种基于B-样条的解耦框架,扩展了这些现有方法。通过利用B-样条的局部支持和灵活的光滑性控制,所提出的公式产生了一种更加数值稳定和表达力更强的表示。我们推导出一个受约束的耦合矩阵-张量分解,并提出了一种名为R-CMTF-BSD的鲁棒交替最小二乘算法,结合了归一化和Tikhonov正则化。所提出的方法通过合成数据和变压器模型压缩实验进行了验证。在视觉和Swin Transformer架构上的结果表明,B-样条解耦在保持竞争性精度的同时实现了显著的参数减少,使R-CMTF-BSD算法成为结构化神经网络压缩的有前景的工具。

英文摘要

Decoupling is a powerful modeling paradigm for representing multivariate functions as compositions of linear transformations and univariate nonlinear functions. A single-layer decoupling can be viewed as a fully connected neural network with a single hidden layer and flexible activation functions, providing a direct link with neural networks. Because of this, the use of decoupling methods has gained increasing attention in neural network domains, particularly compression, since it enables structured approximations with reduced parameter complexity. Existing tensor-based decoupling methods typically rely on polynomial or piecewise-linear parameterizations of the internal nonlinear functions, which can suffer from numerical instability or limited expressiveness. In this work, we introduce a B-spline-based decoupling framework that generalizes these existing approaches. By exploiting the local support and flexible smoothness control of B-splines, the proposed formulation yields a more numerically stable and expressive representation. We derive a constrained coupled matrix-tensor factorization and propose a robust alternating least-squares algorithm, called R-CMTF-BSD, incorporating normalization and Tikhonov regularization. The proposed method is validated through experiments on synthetic data and transformer model compression. Results on the Vision and Swin Transformer architectures demonstrate that B-spline decoupling enables substantial parameter reduction while maintaining competitive accuracy, making the R-CMTF-BSD algorithm a promising tool for structured neural network compression.

2605.18793 2026-05-20 cs.LG cs.AI

Dimensional Balance Improves Large Scale Spatiotemporal Prediction Performance

维度平衡提升大规模时空预测性能

Jing Chen, Shixiang Pan, Yujie Fan, Haocheng Ye, Haitao Xu, Wenqiang Xu

AI总结 本文提出一种可扩展的自适应框架,通过压缩空间维度和扩展时间范围来解决时空预测中的性能瓶颈问题,从而提高预测精度和跨领域适用性。

详情
AI中文摘要

准确的时空模式分析在城市交通、气象和公共卫生监测等领域至关重要。然而,现有方法面临性能瓶颈,通常只能带来微小的改进,并且往往具有有限的跨领域迁移能力。我们通过空间和时间熵度量来分析这一瓶颈,这些度量用于诊断时空复杂性不匹配,而非作为熵对齐单独能提高预测的保证。经验上,较大的不匹配通常伴随着较高的预测不确定性,尤其是在模型容量预算固定的情况下。基于此诊断,我们提出了一种可扩展、自适应的框架,以协调空间和时间特征表示。通过低秩矩阵嵌入压缩空间维度以保留关键结构,而扩展的时间范围捕捉长距离依赖关系并减轻时间异质性带来的累积误差。在城市交通、气象和流行病数据集上的广泛实验显示了显著的准确性提升,并且在评估的各个领域中具有广泛的适用性,表明该框架在当前研究之外的广泛时空任务中具有前景。代码可在GitHub上获得:https://github.com/ST-Balance/ST-Balance。

英文摘要

Accurate spatiotemporal pattern analysis is critical in fields such as urban traffic, meteorology, and public health monitoring. However, existing methods face performance bottlenecks, typically yielding only incremental gains and often exhibiting limited cross-domain transferability. We analyze this bottleneck through spatial and temporal entropy measures, which are used as diagnostic indicators of spatiotemporal complexity mismatch rather than as guarantees that entropy alignment alone yields better forecasting. Empirically, larger mismatch is often accompanied by higher prediction uncertainty, especially under a fixed model-capacity budget. Guided by this diagnostic, we propose a scalable, adaptive framework that harmonizes spatial and temporal feature representations. Spatial dimensionality is compressed via low-rank matrix embedding to preserve essential structure, while an extended temporal horizon captures long-range dependencies and mitigates cumulative errors arising from temporal heterogeneity. Extensive experiments on urban traffic, meteorological, and epidemic datasets demonstrate substantial accuracy gains and broad applicability across the evaluated domains, suggesting that the framework is promising for a wide range of spatiotemporal tasks beyond the current study. The code is available on GitHub at https://github.com/ST-Balance/ST-Balance.

2605.10075 2026-05-20 cs.AI

Active Testing of Large Language Models via Approximate Neyman Allocation

通过近似奈曼分配主动测试大型语言模型

Zeli Liu, Jiancheng Zhang, Cong Liu, Yinglun Zhu

AI总结 本文提出了一种针对生成任务的主动测试算法,利用语义熵进行分层并基于代理模型提取的信号进行近似奈曼分配,从而在多个语言和多模态基准测试中显著提升性能,实现高达28%的均方误差降低和22.9%的预算节省。

详情
AI中文摘要

大型语言模型(LLMs)需要从预训练到测试时间扩展的可靠评估,使评估成为重复而非一次性成本。随着模型规模增长和目标任务日益需要专家标注者,每次评估所需的计算和标注成本迅速上升。主动测试旨在通过从评估池中较小但有信息量的子集近似评估结果来缓解这一瓶颈。然而,现有方法主要针对分类任务并在生成任务上失效。我们提出了一种新的主动测试算法,专门针对生成任务。我们的方法利用代理模型的语义熵对评估池进行分层,并基于这些代理模型提取的信号进行近似奈曼分配。在多个语言和多模态基准测试以及多种代理-目标模型配对中,我们的方法在基线上显著提升,并接近Oracle-Neyman,实现了相对于均匀采样高达28%的均方误差降低和平均22.9%的预算节省。

英文摘要

Large language models (LLMs) require reliable evaluation from pre-training to test-time scaling, making evaluation a recurring rather than one-off cost. As model scales grow and target tasks increasingly demand expert annotators, both the compute and labeling costs needed for each evaluation rise rapidly. Active testing aims to alleviate this bottleneck by approximating the evaluation result from a small but informative subset of the evaluation pool. However, existing approaches primarily target classification and break down on generative tasks. We introduce a novel active testing algorithm tailored to generative tasks. Our method leverages semantic entropy from surrogate models to stratify the evaluation pool and then conducts approximate Neyman allocation based on signals extracted from these surrogates. Across multiple language and multimodal benchmarks and a range of surrogate-target model pairs, our method significantly improves on baselines and closely tracks Oracle-Neyman, delivering up to 28% MSE reduction over Uniform Sampling and an average of 22.9% budget savings.

2603.11673 2026-05-20 cs.LG

Context-dependent manifold learning: A neuromodulated constrained autoencoder approach

基于上下文的流形学习:一种受神经调节的约束自编码器方法

Jérôme Adriaens, Gustave Bainier, Guillaume Drion, Pierre Sacré

AI总结 本文提出了一种受神经调节的约束自编码器(NcAE),通过上下文驱动的超网络调节自编码器的激活斜率和偏置,以恢复上下文变化下的投影保证,从而在物理系统中保持几何一致性。

Comments 26 pages, 5 figures, 24 Tables

详情
AI中文摘要

许多物理系统表现出随着外部参数变化而变化的低维结构:机器人中的链接长度、流体中的强迫常数或流动中的雷诺数会改变底层流形,但保持其内在维度。受限自编码器(cAEs)通过一种幂等的编码器-解码器投影学习此类流形,这一特性是无约束自编码器无法匹敌的,且在模型迭代应用时尤为关键。然而,标准的使cAE上下文依赖的方法,即在输入中连接上下文或通过仿射调节隐藏激活,破坏了编码器-解码器的幂等性,恰好在最需要保证投影的情况下牺牲了投影保证。为在上下文变化下恢复此保证,我们开发了受神经调节的受限自编码器(NcAE),通过上下文驱动的超网络调节cAE的激活斜率和偏置。本文介绍了NcAE,其理论基础及其经验验证。我们证明,对于每个上下文,包括训练时未见过的上下文,重构映射仍保持幂等投影,所学流形的拓扑不变,且上下文扰动导致流形的平滑变化。我们在具有上下文依赖耦合的16自由度摆动器和跨分岔的洛伦茨96系统上评估了我们的方法。NcAE在重构、幂等性和潜在几何度量方面匹配或超过了六个基线中的最佳,同时是唯一通过构造保持几何一致性的架构。因此,NcAE在物理系统家族中提供了稳定的、保持几何一致的坐标系统。

英文摘要

Many physical systems exhibit a low-dimensional structure that varies with external parameters: link lengths in a robot, forcing constants in a fluid, or Reynolds numbers in a flow shift the underlying manifold while preserving its intrinsic dimension. Constrained AutoEncoders (cAEs) learn such manifolds through an idempotent encoder-decoder projection, a property that unconstrained autoencoders cannot match and that is essential whenever the model is applied iteratively. However, the standard strategies for making a cAE context-dependent, namely concatenating the context to the input or affinely modulating hidden activations, break the encoder-decoder idempotency, sacrificing the projection guarantee precisely in the setting where it would be most valuable. To restore this guarantee under context variation, we developed the Neuromodulated Constrained Autoencoder (NcAE), which modulates the activation slope and bias of a cAE through a context-driven hyper-network. This paper presents the NcAE, its theoretical foundation, and its empirical validation. We prove that for every context, including contexts unseen at training time, the reconstruction map remains an idempotent projection, the topology of the learned manifold is invariant, and context perturbations induce smooth changes in the manifold. We evaluated our approach on a 16-DoF pendulum with context-dependent coupling and the Lorenz96 system across a bifurcation. The NcAE matched or exceeded the best of six baselines on reconstruction, idempotency, and latent-geometry metrics, while being the only architecture that preserves geometric consistency by construction. The NcAE thereby provides a stable, geometry-preserving coordinate system across families of physical regimes.

2602.04883 2026-05-20 cs.LG cs.AI q-bio.BM q-bio.QM

Protein Autoregressive Modeling via Multiscale Structure Generation

通过多尺度结构生成进行蛋白质自回归建模

Yanru Qu, Cheng-Yen Hsieh, Zaixiang Zheng, Ge Liu, Quanquan Gu

AI总结 本文提出了一种多尺度自回归框架PAR,用于通过粗到细的下一尺度预测生成蛋白质主链结构。核心方法包括多尺度下采样操作、自回归Transformer和基于流的主链解码器,通过噪声上下文学习和调度采样缓解曝光偏差,实现高质量主链生成,并展示了强大的零样本泛化能力。

Comments ICML 2026 Spotlight; ByteDance Seed Tech Report; Page: https://par-protein.github.io/

详情
AI中文摘要

我们提出了蛋白质自回归建模(PAR),这是首个多尺度自回归框架,用于通过粗到细的下一尺度预测生成蛋白质主链结构。利用蛋白质的分层性质,PAR生成的结构模仿雕刻雕像的过程,形成粗略拓扑结构并逐步细化结构细节。为此,PAR由三个关键组件组成:(i)多尺度下采样操作,在训练过程中表示蛋白质结构在多个尺度上的特征;(ii)一个自回归Transformer,编码多尺度信息并生成条件嵌入以指导结构生成;(iii)基于流的主链解码器,根据这些嵌入生成主链原子。此外,自回归模型由于训练和生成过程不匹配而遭受曝光偏差,这会显著降低结构生成质量。我们通过采用噪声上下文学习和调度采样有效缓解了这一问题,实现了鲁棒的主链生成。值得注意的是,PAR表现出强大的零样本泛化能力,支持灵活的人类提示条件生成和基序支架构建,而无需微调。在无条件生成基准测试中,PAR有效学习了蛋白质分布,并生成高质量的主链结构,且表现出良好的扩展性。这些特性使PAR成为蛋白质结构生成的有前途的框架。

英文摘要

We present protein autoregressive modeling (PAR), the first multi-scale autoregressive framework for protein backbone generation via coarse-to-fine next-scale prediction. Using the hierarchical nature of proteins, PAR generates structures that mimic sculpting a statue, forming a coarse topology and refining structural details over scales. To achieve this, PAR consists of three key components: (i) multi-scale downsampling operations that represent protein structures across multiple scales during training; (ii) an autoregressive transformer that encodes multi-scale information and produces conditional embeddings to guide structure generation; (iii) a flow-based backbone decoder that generates backbone atoms conditioned on these embeddings. Moreover, autoregressive models suffer from exposure bias, caused by the training and the generation procedure mismatch, and substantially degrades structure generation quality. We effectively alleviate this issue by adopting noisy context learning and scheduled sampling, enabling robust backbone generation. Notably, PAR exhibits strong zero-shot generalization, supporting flexible human-prompted conditional generation and motif scaffolding without requiring fine-tuning. On the unconditional generation benchmark, PAR effectively learns protein distributions and produces backbones of high design quality, and exhibits favorable scaling behavior. Together, these properties establish PAR as a promising framework for protein structure generation.

2601.05391 2026-05-20 cs.LG

DynaSTy: A Framework for SpatioTemporal Node Attribute Prediction in Dynamic Graphs

DynaSTy: 一个用于动态图中时空节点属性预测的框架

Namrata Banerji, Tanya Berger-Wolf

AI总结 本文提出了一种端到端的动态边偏置时空模型,用于预测动态图中节点属性的多步未来值,通过引入可适应的注意力偏置和预训练目标,提高了长期预测的准确性。

详情
AI中文摘要

准确预测动态图中节点级别的属性对于金融信任网络和生物网络等应用至关重要。现有时空图神经网络通常假设邻接矩阵是静态的。在本文中,我们提出了一种端到端的动态边偏置时空模型,该模型输入多维节点属性时间序列和邻接矩阵时间序列,以预测多个未来步骤的节点属性。在每个时间步,我们的基于变压器的模型将给定的邻接矩阵作为可适应的注意力偏置注入,使模型能够根据图的演变关注相关的邻居。我们进一步部署了一个掩码节点-时间预训练目标,使编码器能够重建缺失的特征,并通过调度采样和水平加权损失进行训练,以减轻长期预测中的复合误差。与先前工作不同,我们的模型能够适应不同输入样本中变化的动态图,使多系统设置中的预测成为可能,如不同主体的脑网络、不同情境的金融系统或演变的社会系统。实验证明,我们的方法在均方根误差(RMSE)和平均绝对误差(MAE)上一致优于强大的基线方法。

英文摘要

Accurate multistep forecasting of node-level attributes on dynamic graphs is critical for applications ranging from financial trust networks to biological networks. Existing spatiotemporal graph neural networks typically assume a static adjacency matrix. In this work, we propose an end-to-end dynamic edge-biased spatiotemporal model that ingests a multi-dimensional timeseries of node attributes and a timeseries of adjacency matrices, to predict multiple future steps of node attributes. At each time step, our transformer-based model injects the given adjacency as an adaptable attention bias, allowing the model to focus on relevant neighbors as the graph evolves. We further deploy a masked node-time pretraining objective that primes the encoder to reconstruct missing features, and train with scheduled sampling and a horizon-weighted loss to mitigate compounding error over long horizons. Unlike prior work, our model accommodates dynamic graphs that vary across input samples, enabling forecasting in multi-system settings such as brain networks across different subjects, financial systems in different contexts, or evolving social systems. Empirical results demonstrate that our method consistently outperforms strong baselines on Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE).

2510.03589 2026-05-20 cs.LG

FieldFormer: Locality-Aware Transformers for Spatio-Temporal Modeling on Sparse Sensor Networks

FieldFormer:用于稀疏传感器网络中时空建模的具有局部性的变换器

Ankit Bhardwaj, Ananth Balashankar, Lakshminarayanan Subramanian

AI总结 本文提出FieldFormer,一种无网格变换器架构,用于在持续传感器网络中进行具有局部性的传感器空间建模。通过学习可调节的速度缩放偏移量,聚合局部证据,以适应时空依赖性,并在极端稀疏性下实现稳定和可扩展的推理。

详情
AI中文摘要

现实世界系统中的时空传感器数据往往稀疏、噪声且不规则,使得潜在场重建从根本上处于欠约束状态。在极端稀疏性下,多个物理上合理的场可能与相同观测一致,要求模型依赖于关于局部性、传输和空间规律的归纳偏置。在这种情况下,可靠的重建集中在由传感器网络引起的观测支持上,使传感器空间建模比无约束的全局场恢复更具可识别性。我们引入FieldFormer,一种无网格变换器架构,用于在持续传感器网络中进行具有局部性的传感器空间建模。对于每个查询,FieldFormer通过可学习的速度缩放偏移量聚合局部证据,以适应邻域几何到时空依赖性。邻域被构建为固定最大稀疏上下文,覆盖附近的传感器和有限的时间窗口,使在极端稀疏性下实现稳定和可扩展的推理。一个局部变换器编码器整合邻域信息,而基于坐标的神经场公式支持无网格预测。我们在五个合成和现实世界基准上评估FieldFormer,包括各向异性热扩散、浅水动力学、大气传输和污染监测数据集。结果表明,具有局部性的重建在局部依赖域仍被观测时提供显著优势,使FieldFormer在稀疏传感器空间预测任务中一致优于最先进的基线。

英文摘要

Spatio-temporal sensor data in real-world systems is often sparse, noisy, and irregular, making latent field reconstruction fundamentally underconstrained. Under extreme sparsity, multiple physically plausible fields may remain consistent with the same observations, requiring models to rely on inductive biases about locality, transport, and spatial regularity. In such regimes, reliable reconstruction is concentrated around the observational support induced by the sensor network, making sensor-space modeling a more identifiable objective than unconstrained global field recovery. We introduce FieldFormer, a mesh-free transformer architecture for locality-aware sensor-space modeling in persistent sensor networks. For each query, FieldFormer aggregates local evidence using learnable velocity-scaled offsets that adapt neighborhood geometry to spatio-temporal dependencies. Neighborhoods are constructed as fixed maximal sparse contexts over nearby sensors and bounded temporal windows, enabling stable and scalable inference under extreme sparsity. A local transformer encoder integrates neighborhood information, while a coordinate-based neural field formulation supports mesh-free prediction. We evaluate FieldFormer on five synthetic and real-world benchmarks, including anisotropic heat diffusion, shallow-water dynamics, atmospheric transport, and pollution monitoring datasets. Results show that locality-aware reconstruction provides strong advantages when local domains of dependence remain observed, enabling FieldFormer to consistently outperform state-of-the-art baselines on sparse sensor-space prediction tasks.

2412.02818 2026-05-20 cs.RO cs.LG

RoboMD: Uncovering Robot Vulnerabilities through Semantic Potential Fields

RoboMD: 通过语义势场揭示机器人漏洞

Som Sagar, Jiafei Duan, Sreevishakh Vasudevan, Yifan Zhou, Heni Ben Amor, Dieter Fox, Ransalu Senanayake

AI总结 本研究提出RoboMD框架,通过学习基于连续视觉-语言嵌入的深度强化学习策略,揭示机器人在现实世界中因外部变化导致的漏洞,通过虚拟运行实现高效安全的漏洞分析,实验表明其能发现比现有基线多23%的漏洞,并提升机器人操作性能。

Comments 26 Pages, 20 figures

详情
AI中文摘要

机器人操作策略虽然对物理AI的前景至关重要,但在现实世界中存在外部变化时却极易产生漏洞。诊断这些漏洞面临两大挑战:(i)需要测试的 relevant 变化通常未知,(ii)直接在现实世界中测试成本高且不安全。我们介绍了一个框架,通过在连续视觉-语言嵌入上进行虚拟运行,学习一个单独的深度强化学习(深度RL)策略来预测漏洞。通过将富含语义和视觉变化的嵌入空间视为势场,该策略学会向易损区域移动并被成功区域排斥。该漏洞预测策略在虚拟运行中训练,使漏洞分析能够扩展和安全地进行,而无需昂贵的物理试验。通过查询该策略,我们的框架构建了一个概率性漏洞可能性地图。在模拟基准和物理机器人手臂上的实验表明,我们的框架揭示的漏洞比最先进的视觉-语言基线多出23%,揭示了被启发式测试忽略的细微漏洞。此外,我们展示了通过我们的框架发现的漏洞微调操作策略,可以使用更少的微调数据提升操作性能。

英文摘要

Robot manipulation policies, while central to the promise of physical AI, are highly vulnerable in the presence of external variations in the real world. Diagnosing these vulnerabilities is hindered by two key challenges: (i) the relevant variations to test against are often unknown, and (ii) direct testing in the real world is costly and unsafe. We introduce a framework that tackles both issues by learning a separate deep reinforcement learning (deep RL) policy for vulnerability prediction through virtual runs on a continuous vision-language embedding trained with limited success-failure data. By treating this embedding space, which is rich in semantic and visual variations, as a potential field, the policy learns to move toward vulnerable regions while being repelled from success regions. This vulnerability prediction policy, trained on virtual rollouts, enables scalable and safe vulnerability analysis without expensive physical trials. By querying this policy, our framework builds a probabilistic vulnerability-likelihood map. Experiments across simulation benchmarks and a physical robot arm show that our framework uncovers up to 23% more unique vulnerabilities than state-of-the-art vision-language baselines, revealing subtle vulnerabilities overlooked by heuristic testing. Additionally, we show that fine-tuning the manipulation policy with the vulnerabilities discovered by our framework improves manipulation performance with much less fine-tuning data.