arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.12497 2026-06-12 cs.LG cs.RO 新提交

$μ$VLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models

$μ$VLA:部分可观测操作中VLA模型的循环记忆研究

Egor Cherepanov, Nikita Kachaev, Daniil Zelezetsky, Aydar Bulatov, Artem Pshenitsyn, Yuri Kuratov, Alexey Skrynnik, Aleksandr I. Panov, Alexey K. Kovalev

发表机构 * CogAI Lab, Moscow, Russia(CogAI实验室,莫斯科,俄罗斯) MIRAI, Moscow, Russia(MIRAI,莫斯科,俄罗斯)

AI总结 针对VLA模型在部分可观测场景中的记忆缺失问题,提出仅通过可学习记忆令牌和截断反向传播时间实现最小化循环记忆增强,在MIKASA-Robo上将训练任务成功率从0.42提升至0.84,并在LIBERO上保持全可观测性能。

详情
Comments
34 pages, 20 figures, 9 tables
AI中文摘要

视觉-语言-动作(VLA)模型从当前观测预测未来动作块,这一假设在部分可观测性下失效,因为决策依赖于不再可见的信息。现有的记忆增强VLA同时引入了循环、检索、压缩模块、辅助目标、层次化记忆或特定任务架构变化,因此循环本身的贡献与周围机制纠缠不清。我们提出了一个在强预训练VLA骨干网络中的受控隔离研究。我们的方案通过一小部分可学习的记忆令牌增强Transformer,这些令牌跨时间步传递并通过自注意力更新,使用截断反向传播时间进行端到端训练,没有辅助损失和架构变化。我们将其实例化为$μ$VLA,一组由记忆宽度m、TBPTT长度K和记忆更新规则(跨步梯度或分离的EMA)参数化的OpenVLA-OFT变体,使得循环是唯一变化的因素。在MIKASA-Robo上,$μ$VLA在最强设置下将五个训练任务的平均成功率从0.42提高到0.84,并在具有相同记忆结构的保留任务上达到0.23,而无记忆基线为0.07。在需要不同记忆结构的任务上,性能接近基线。在LIBERO上,最强的循环变体达到96.2%的平均成功率,表明在全可观测性下没有性能下降。我们将这些结果解释为对最小化骨干网络循环能力范围的校准,识别了其足够的情况以及需要额外记忆结构的情况。演示和视频可在以下链接找到:https://example.com。

英文摘要

Vision-language-action (VLA) models predict chunks of future actions from the current observation, an assumption that fails under partial observability, where decisions depend on information no longer visible. Existing memory-augmented VLAs simultaneously introduce recurrence, retrieval, compression modules, auxiliary objectives, hierarchical memory, or task-specific architectural changes, so the contribution of recurrence itself remains entangled with surrounding machinery. We present a controlled isolation study of recurrence in a strong pretrained VLA backbone. Our formulation augments the transformer with a small set of learnable memory tokens carried across timesteps and updated through self-attention, trained end to end with truncated backpropagation through time, with no auxiliary losses and no architectural changes. We instantiate this as $\mu$VLA, a family of OpenVLA-OFT variants parameterized by memory width m, TBPTT length K, and the memory update rule (cross-step gradients or a detached EMA), so that recurrence is the only varying factor. On MIKASA-Robo, $\mu$VLA improves average success rate on five training tasks from 0.42 to 0.84 at the strongest setting and reaches 0.23 on held-out tasks with the same memory structure versus 0.07 for the memoryless baseline. On tasks requiring different memory structure, performance remains near baseline. On LIBERO, the strongest recurrent variant achieves 96.2% average success, indicating no regression under full observability. We interpret these results as a calibration of the capability envelope of minimal in-backbone recurrence, identifying the regime in which it is sufficient and the regime where additional memory structure is required. Demos and videos can be found in this https URL.

2606.12495 2026-06-12 cs.SD 新提交

Missing-Token Prompted Reliability-Aware Fusion for Robust Polyglot Speaker Identification

缺失令牌提示的可靠性感知融合用于鲁棒多语种说话人识别

Peng Jia, Li Dai, Jia Li, Zhenzhen Hu, Ye Zhao, Richang Hong

发表机构 * Hefei University of Technology(合肥工业大学) Intelligent Interconnected Systems Laboratory of Anhui Province(安徽省智能互联系统实验室)

AI总结 提出MRAF框架,通过可学习的缺失令牌和可靠性感知交叉注意力融合,解决多语种场景下跨语言泛化和人脸缺失时的鲁棒性问题,在POLY-SIM 2026测试集上取得高准确率。

详情
Comments
8 pages, 3 figures, 4 tables
AI中文摘要

准确且鲁棒的多模态说话人识别对于多媒体理解和生物特征认证至关重要。然而,现实中的多语种场景带来了两个关键挑战:说话人判别性表示应跨语言泛化,并且当人脸信息不可用时模型应保持可靠。为了解决这些挑战,我们提出了MRAF,一个缺失令牌提示的可靠性感知融合框架,用于跨完整模态、缺失人脸和跨语言场景的多语种说话人识别。MRAF用可学习的缺失令牌代替固定的零值特征来表示不可用的人脸输入,提供了缺失视觉状态的可训练表示。这种设计减少了由缺失输入引起的分布差距,并允许后续的可靠性估计和跨模态融合在统一的令牌空间内操作。为了自适应地集成具有不同可靠性的模态,MRAF进一步引入了可靠性感知的交叉注意力融合模块,该模块估计人脸和音频的可靠性分数,将其归一化为模态权重,并在双向交叉注意力之前将这些权重应用于令牌表示。这样,模型可以强调可靠的模态线索,同时抑制不可靠的。在训练过程中,MRAF联合优化多分支分类损失、仅音频知识蒸馏和中心损失,以提高说话人判别性和缺失模态鲁棒性。在官方POLY-SIM 2026测试集上的实验证明了所提出框架的有效性。在最终评估中,MRAF在P3和P5上达到了100%的准确率,并在更具挑战性的缺失人脸设置P4和P6上获得了有竞争力的结果。源代码将在https://this URL发布。

英文摘要

Accurate and robust multimodal speaker identification is essential for multimedia understanding and biometric authentication. However, real-world polyglot scenarios pose two key challenges: speaker-discriminative representations should generalize across languages, and the model should remain reliable when face information is unavailable. To address these challenges, we propose MRAF, a Missing-Token Prompted Reliability-Aware Fusion framework for polyglot speaker identification across complete-modality, missing-face, and cross-lingual scenarios. MRAF represents unavailable face inputs with a learnable missing token instead of fixed zero-valued features, providing a trainable representation of the missing visual state. This design reduces the distribution gap caused by missing inputs and allows subsequent reliability estimation and cross-modal fusion to operate within a unified token space. To adaptively integrate modalities with different reliability, MRAF further introduces a reliability-aware cross-attention fusion module, which estimates face and audio reliability scores, normalizes them into modality weights, and applies these weights to token representations before bidirectional cross-attention. In this way, the model can emphasize reliable modality cues while suppressing unreliable ones. During training, MRAF jointly optimizes multi-branch classification losses, audio-only knowledge distillation, and center loss to improve speaker discrimination and missing-modality robustness. Experiments on the official POLY-SIM 2026 test set demonstrate the effectiveness of the proposed framework. In the final evaluation, MRAF achieves 100% accuracy on P3 and P5, and obtains competitive results on the more challenging missing-face settings P4 and P6. The source code will be released at this https URL.

2606.12494 2026-06-12 cs.LG 新提交

Net-Ev$^2$: A Generative Simulator for Network Event Evolution

Net-Ev$^2$:网络事件演化的生成式模拟器

Guangyu Wang, Zhaonan Wang

发表机构 * NYU Shanghai(上海纽约大学)

AI总结 提出Net-Ev$^2$,一种结合事件线索与网络拓扑的生成式模拟器,通过结构引导掩码预训练和拓扑感知扩散过程模拟网络事件演化,在多个道路网络数据集上达到最优性能。

详情
Comments
Accepted by KDD 2026 Research Track
AI中文摘要

减少现实世界的试错一直是决策的核心目标,生成式模拟器通过建模未来状态的演化推进了这一目标。一个更具挑战性且更有意义的任务是模拟扰动事件(如事故)如何通过网络传播其影响。现有方法在模拟网络事件演化时,未能同时建模事件的结构化属性和非结构化语义,也未能捕捉拓扑结构。因此,我们提出Net-Ev$^2$($\underline{\textbf{Net}}$work $\underline{\textbf{Ev}}$ent $\underline{\textbf{Ev}}$olution),一种新颖的生成式模拟器,在模拟中联合利用事件线索并保留网络拓扑。具体而言,该框架包含两个阶段:结构引导的掩码预训练和拓扑感知扩散过程,后者通过类似U-Net的图下采样和上采样实现去噪。在推理时,Net-Ev$^2$仅需自然语言事件输入即可生成模拟,具有更大的实际使用灵活性。此外,我们引入了Net-Ev$^2$-6.5M,一个跨四个大规模道路网络的对齐事件和网络流量数据的多模态基准,以及一个新的拓扑感知指标JL-MMD,用于评估生成网络动态的拓扑保真度。大量实验证明了Net-Ev$^2$的最优性能和强泛化能力。代码已开源。

英文摘要

Reducing real-world trial and error has long been a central goal of decision making, and generative simulators advance this goal by modeling the evolution of future states. An even more challenging yet meaningful task is simulating how disturbance events (e.g., accidents) propagate their impacts across real-world networks. The existing approaches fall short of modeling both structured attributes and unstructured semantics of events, and capturing topological structures in simulating network event evolution. Therefore, we are motivated to propose Net-Ev$^2$ ($\underline{\textbf{Net}}$work $\underline{\textbf{Ev}}$ent $\underline{\textbf{Ev}}$olution), a novel generative simulator that jointly leverages event cues while preserving network topology in simulations. Specifically, the framework consists of two stages, namely structure-guided masked pre-training and topology-aware diffusion process, which is achieved by U-Net-like graph downsampling and upsampling during denoising. At inference time, Net-Ev$^2$ can generate simulations using natural-language event input only, with greater flexibility for practical usage. Furthermore, we introduce Net-Ev$^2$-6.5M, a multimodal benchmark of aligned event and network traffic data across four large-scale road networks, as well as a new topology-aware metric, namely JL-MMD, to evaluate topological fidelity in generated network dynamics. Extensive experiments demonstrate the state-of-the-art performance and strong generalization ability of Net-Ev$^2$. Code is made available at this https URL.

2606.12490 2026-06-12 cs.LG 新提交

Robustness Verification of Recurrent Neural Networks with Abstraction Refinement

基于抽象精化的循环神经网络鲁棒性验证

Li-Jen Lin, Chih-Duo Hong

发表机构 * National Science and Technology Council (NSTC), Taiwan(台湾国家科学与技术委员会)

AI总结 提出抽象精化框架,通过分割预激活区间消除非线性松弛误差,并利用SHAP引导的时间步选择策略降低组合成本,显著提升RNN鲁棒性验证成功率。

详情
AI中文摘要

循环神经网络(RNN)的认证局部鲁棒性验证具有挑战性,因为非线性松弛引入的近似误差会通过循环连接传播并随时间累积。因此,可扩展的线性边界传播方法往往过于保守,无法认证实际上鲁棒的输入,尤其是当许多预激活区间跨越零点时。我们提出了一种用于RNN验证的抽象精化框架,该框架划分此类区间以消除主要的松弛误差:在每个精化分支上,ReLU变得精确,而tanh和sigmoid等平滑激活函数则允许更紧的线性包络。为了控制在长序列中分裂的组合成本,我们引入了一种SHAP引导的时间步选择策略,该策略根据隐藏状态对验证目标的贡献进行排序,并按时间顺序仅精化最关键的时间步。在CIFAR10和MNIST笔画基准上的实验表明,与仅使用抽象的基线相比,验证成功率和鲁棒性边界紧度持续提升,同时揭示了ReLU和tanh模型之间清晰的运行时权衡。

英文摘要

Certified local robustness verification for recurrent neural networks (RNNs) is challenging because approximation errors introduced by nonlinear relaxations can propagate through recurrent connections and accumulate over time. As a result, scalable linear bound propagation methods often become overly conservative and fail to certify inputs that are in fact robust, especially when many pre-activation intervals cross zero. We propose an abstraction-refinement framework for RNN verification that partitions such intervals to remove the dominant relaxation error: on each refined branch, ReLU becomes exact, and smooth activations such as tanh and sigmoid admit substantially tighter linear envelopes. To control the combinatorial cost of splitting in long sequences, we introduce a SHAP-guided timestep selection strategy that ranks hidden states by their contribution to the verification objective and refines only the most critical timesteps in temporal order. Experiments on CIFAR10 and MNIST stroke benchmarks demonstrate consistent improvements in verification success and robustness-margin tightness over abstraction-only baselines, while exposing clear runtime trade-offs between ReLU and tanh models.

2606.12488 2026-06-12 cs.LG 新提交

A Stationary (and Therefore Compatible) Representation is All You Need

静态(因此兼容)表示即所需

Niccolò Biondi, Federico Pernici, Simone Ricci, Alberto Del Bimbo

发表机构 * Media Integration and Communication Center (MICC), Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Firenze(佛罗伦萨大学信息工程系媒体集成与通信中心(MICC))

AI总结 本文证明d-Simplex固定分类器学习的静态表示满足兼容性定义,并通过交叉熵与对比损失的凸组合捕获高阶依赖,实现模型更新时无需重处理的检索服务。

详情
Comments
Accepted to TPAMI2026. Extension of the CVPR2024 version ( arXiv:2405.02581 )
AI中文摘要

学习兼容表示旨在当模型更新时,特征表示可以互换使用。本文证明,由d-Simplex固定分类器学习的静态表示隐含了其正式定义中的兼容性。这一结果为未来工作奠定了基础,并可直接应用于实际学习场景。我们解决了在模型顺序微调时使用d-Simplex固定分类器学习兼容性的挑战。使用交叉熵损失的d-Simplex固定分类器学习对齐一阶统计量的特征分布,因此可能无法完全捕捉模型更新之间表示的高阶依赖。为解决此问题,我们证明通过交叉熵损失和对比损失的凸组合使用d-Simplex固定分类器训练模型,不仅能捕捉高阶依赖,而且等价于在兼容性约束下使用交叉熵学习。我们通过大量实验证实了我们的发现,并考虑了一个新场景:预训练模型被顺序微调,偶尔被改进模型替换。我们表明,静态表示能够实现不间断的检索服务(无需重新处理图库图像),同时在模型更新和替换期间提升性能,达到最先进水平。代码见此 https URL。

英文摘要

Learning compatible representations aims to learn feature representations that can be used interchangeably over time whenever a model undergoes updates. In this paper, we demonstrate that stationary representations learned by d-Simplex fixed classifiers imply compatibility as in its formal definition. This result establishes a foundation for future works and can be directly exploited in practical learning scenarios. We address the challenge of learning compatibility using $d$-Simplex fixed classifiers when the model is sequentially fine-tuned. Learning according to a d-Simplex fixed classifier with the cross-entropy loss aligns feature distributions at the first-order statistics. Consequently, it may not fully capture higher-order dependencies in the representation between model updates. To address this issue, we demonstrate that training the model using a $d$-Simplex fixed classifier through a convex combination of the cross-entropy loss and a contrastive loss not only captures higher-order dependencies, but is also equivalent to learning with the cross-entropy under the compatibility constraints. We confirm our findings with extensive experiments also considering a new scenario where a pre-trained model is sequentially fine-tuned and occasionally replaced with an improved model. We show that stationary representations enable uninterrupted retrieval services (without reprocessing gallery images) while improving performance during model updates and replacements, achieving state-of-the-art. Code at this https URL.

2606.12487 2026-06-12 cs.LG 新提交

DynamicPTQ: Mitigating Activation Quantization Collapse via Residual-Stream Dynamics

DynamicPTQ: 通过残差流动态缓解激活量化崩溃

Zimo Zhao, Maolin Wang, Bowen Yu, Bowen Liu, Xiao Han, Xiangyu Zhao

发表机构 * City University of Hong Kong(香港城市大学) Zhejiang University of Technology(浙江工业大学)

AI总结 提出DynamicPTQ,通过分析残差流中激活的相位式动态变化,识别量化敏感层并分配8位精度,在W4A4KV4量化下提升LLaMA-2/3的困惑度和零样本QA性能,吞吐量提升1.05-1.07倍。

详情
AI中文摘要

训练后量化(PTQ)对于高效的大语言模型推理至关重要,但当权重、激活和KV缓存全部量化到4位精度时,可靠地量化激活仍然具有挑战性。一个关键困难在于大规模激活,其极端值主导激活范围并放大量化误差。最先进的方法主要通过基于变换的平滑(如正交旋转和仿射缩放)来缓解大规模激活,但忽略了残差流的跨层动态。在本文中,我们展示了大规模激活在网络深度上以相位模式出现和消失,触发大的残差变化。这些变化导致新注入的逐层更新主导4位量化尺度,并削弱历史残差信息。为了表征这种行为,我们引入了跳跃比和历史特征信噪比。这表明基于静态变换的平滑无法完全解决由跨层残差变化引起的动态量化不稳定性。基于这一分析,我们提出了DynamicPTQ,一种用于相位感知混合精度激活量化的动态训练后量化策略。DynamicPTQ从残差流动态中识别量化敏感层,并仅对这些层分配8位激活精度,同时保持权重、KV缓存和其他激活为4位精度。它可以直接集成到强大的PTQ基线中,如QuaRot、SpinQuant和FlatQuant。在LLaMA-2和LLaMA-3上的实验表明,DynamicPTQ在W4A4KV4量化下一致地提高了困惑度和零样本QA性能,同时实现了1.05到1.07倍的吞吐量提升,且内存开销适中。这些结果展示了实现鲁棒低位LLM推理的实用路径。

英文摘要

Post-training quantization (PTQ) is essential for efficient large language model inference, but reliably quantizing activations remains challenging when weights, activations, and KV caches are all quantized to 4-bit precision. A key difficulty lies in massive activations, whose extreme values dominate the activation range and amplify quantization errors. State-of-the-art methods mainly mitigate massive activations through transformation-based smoothing, such as orthogonal rotations and affine scaling, but overlook the cross-layer dynamics of the residual stream. In this paper, we show that massive activations emerge and disappear in a phase-wise pattern across network depth, triggering large residual changes. These changes cause newly injected layer-wise updates to dominate the 4-bit quantization scale and weaken historical residual information. To characterize this behavior, we introduce Jump Ratio and Historical Feature SNR. This suggests that static transformation-based smoothing cannot fully resolve dynamic quantization instability caused by cross-layer residual changes. Based on this analysis, we propose DynamicPTQ, a Dynamic Post-Training Quantization policy for phase-aware mixed-precision activation quantization. DynamicPTQ identifies quantization-sensitive layers from residual-stream dynamics and assigns 8-bit activation precision only to these layers, while keeping weights, KV caches, and other activations in 4-bit precision. It can be directly integrated with strong PTQ baselines such as QuaRot, SpinQuant, and FlatQuant. Experiments on LLaMA-2 and LLaMA-3 show that DynamicPTQ consistently improves perplexity and zero-shot QA performance under W4A4KV4 quantization, while achieving 1.05 to 1.07 times throughput improvement with modest memory overhead. These results demonstrate a practical path toward robust low-bit LLM inference.

2606.12486 2026-06-12 cs.LG 新提交

An Empirical Study on Predictive Maintenance for Component X in Heavy-Duty Scania Trucks

重型斯堪尼亚卡车中组件X的预测性维护实证研究

Valeriu Dimidov, Sasan Jafarnejad, Raphaël Frank

发表机构 * SnT, University of Luxembourg(卢森堡大学SnT) Scania CV AB(斯堪尼亚商用车公司)

AI总结 针对卡车车队,提出一种基于状态监测的预测性维护方法,将磨损状态建模为单调非递减时间序列,通过选取最近观测并转换为表格数据,利用AutoML简化建模,在Scania组件X数据集上降低了成本。

详情
AI中文摘要

近年来,基于状态的预测性维护(PdM)在卡车车队中得到了广泛应用。这种维护策略旨在通过监测车辆的健康状况并根据其状态采取主动措施,最大限度地减少计划外停机并降低成本。然而,由于卡车产生的大量数据、通过传感器数据检测故障的内在复杂性以及在解决方案实施中寻找成本效益权衡的困难,基于状态的PdM系统的实施具有挑战性。在本文中,我们定义并验证了一种基于状态的PdM方法,该方法基于一个假设:被监测组件的磨损状态可以表示为单调非递减的时间序列。它涉及仅从时间序列中选择最近的观测值,并将其转换为表格格式,以便使用为表格数据设计的机器学习(ML)模型进行分类。我们的结果表明,与当前最先进(SOTA)方法相比,所提出的方法在Scania组件X数据集上降低了成本,同时通过AutoML简化了建模过程。

英文摘要

Condition-based Predictive Maintenance (PdM) for truck fleets has gained momentum in recent years. This maintenance strategy aims to minimize unplanned downtimes and reduce costs by monitoring the health status of vehicles and taking proactive action based on their condition. However, the implementation of condition-based PdM systems is challenging due to the large volume of data generated by the trucks, the inherent complexity of detecting failures through sensor data and the difficulties in finding cost-effective trade-offs in the solution's implementation. In this paper, we define and validate a condition-based PdM methodology built on the assumption that the wear-and-tear state of the monitored component can be represented as a monotonically non-decreasing time series. It involves selecting only the most recent observations from the time series and transforming them into a tabular format for classification using machine learning (ML) models designed for tabular data. Our results indicate that the proposed methodology reduces costs on the Scania Component X dataset compared to current state-of-the-art (SOTA) approaches, while also simplifying the modeling process through AutoML.

2606.12485 2026-06-12 cs.LG cs.AI 新提交

Speculative Rollback Correction for Quality-Diverse Web Agent Imitation

面向质量多样性的Web智能体模仿的推测性回滚修正

Longkun Hao, Hongyu Lin, Hao Li, Zhichao Yang, Haojie Hao, Dongshuo Huang, Haitao Yang, Hongyu Ge, Ming jie Xie, Yanjun Wu, Zi Hao Yin, Yan Bai, Yihang Lou

发表机构 * Beihang University(北京航空航天大学) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) The Hong Kong University of Science and Technology(香港科技大学) Northwestern Polytechnical University(西北工业大学) Tsinghua University(清华大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Peking University(北京大学)

AI总结 提出推测性回滚修正(SRC)框架,通过固定视野分支审查和回滚机制,在减少教师查询的同时保持轨迹多样性,在WebArena-Infinity上收集了977条通过验证的轨迹和9183个下一步动作示例。

详情
AI中文摘要

通过从专家轨迹进行模仿学习来训练交互式Web智能体已成为一种高效的方法。然而,在此背景下,确定专家干预的最佳时机是一个关键挑战。延迟干预往往导致早期错误的累积,将页面状态推入不可恢复的区域。相反,过早或过度干预会使智能体过度依赖专家策略,将模型困在以单一刚性轨迹为特征的局部最优中。我们提出推测性回滚修正(SRC),一种针对可重置智能体环境的分支级模仿框架。SRC不是在每个访问状态请求教师标签,也不是仅在完成轨迹后修正,而是采用固定视野分支审查:学生先执行一个短的推测性片段,然后由教师审查,仅当局部进展中断时,教师才定位第一个有害偏差。回滚保留有用的前缀,而成功的展开由硬验证器过滤并保留在轻量级质量多样性档案中。所得数据支持对局部修正和通过验证器的轨迹进行下一步动作监督微调。在WebArena-Infinity上,SRC收集了977条通过验证器的轨迹和9183个下一步动作示例;固定视野审查在保留通过验证器的解决方案变体的同时,改善了恢复与查询的权衡。代码可在该https URL获取。

英文摘要

Training interactive web agents through imitation learning from expert trajectories has emerged as a highly effective approach. However, determining the optimal timing for expert intervention presents a critical challenge in this context. Delayed intervention often leads to the accumulation of early-stage errors, pushing the page state into an irrecoverable regime. Conversely, premature or excessive intervention causes the agent to become overly reliant on expert policies, trapping the model in local optima characterized by a single, rigid trajectory. We propose Speculative Rollback Correction (SRC), a branch-level imitation framework for resettable agent environments. Instead of requesting teacher labels at every visited state or correcting only after a completed trajectory, SRC uses fixed-horizon branch review: the student executes a short speculative segment before teacher review, and the teacher localizes the first harmful deviation only when local progress breaks. Rollback preserves useful prefixes, while successful rollouts are filtered by a hard verifier and retained in a lightweight quality-diversity archive. The resulting data supports next-action supervised fine-tuning on both localized corrections and verifier-passing trajectories. On WebArena-Infinity, SRC collects 977 verifier-passing trajectories and 9,183 next-action examples; fixed-horizon review improves the recovery-versus-query tradeoff over step-level review while retaining verifier-passing solution variants. Code is available at this https URL.

2606.12481 2026-06-12 cs.LG cs.AI 新提交

Representing Time Series as Structured Programs for LLM Reasoning

将时间序列表示为结构化程序以进行LLM推理

Jaeho Kim, Changhun Oh, Seokhyun Lee, Irina Rish, Changhee Lee

发表机构 * Korea University(高丽大学) Mila, University of Montreal(蒙特利尔大学米拉研究所)

AI总结 提出T2SP方法,将时间序列分解为趋势、周期和显著事件并表示为结构化符号程序,使LLM无需微调即可高效推理,在编辑、描述和问答任务上优于原始序列表示。

详情
Comments
Preprint
AI中文摘要

大型语言模型(LLM)展示了强大的推理和指令遵循能力,使其成为时间序列分析的潜在强大工具。然而,时间序列超出了其原生文本模态,引发了一个基本问题:应该如何表示时间序列,以便LLM能够有效地推理它们?现有工作通常序列化原始数值序列或在时间序列数据上微调预训练的LLM。这些方法将提取时间结构的负担直接放在LLM上,造成了模态不匹配,常常降低长序列的性能并引入大量计算开销。在这项工作中,我们引入了时间序列到结构化程序表示(T2SP),一种确定性的、无需训练的方法,将时间序列表示为结构化的符号程序。T2SP将时间序列分解为趋势、周期和显著事件,并以与LLM原生训练的文本和代码类模态对齐的程序友好格式表达它们。通过将时间结构提取从模型转移到表示本身,T2SP使现成的LLM能够利用其现有的推理能力进行时间序列理解。我们在三个推理任务上评估T2SP——编辑、描述和问答——与原始字符串表示相比,它持续提高了性能,减少了推理时间,并降低了失败率。我们的结果表明,T2SP提供了时间序列和LLM之间的有效接口。

英文摘要

Large language models (LLMs) have demonstrated strong reasoning and instruction-following capabilities, making them potentially powerful tools for time-series analysis. However, time series lie outside their native textual modality, raising a fundamental question: how should time series be represented so that LLMs can reason about them effectively? Existing work typically serializes raw numerical sequences or fine-tunes pre-trained LLMs on time-series data. These approaches place the burden of extracting temporal structure directly on the LLM, creating a modality mismatch that often degrades performance on long sequences and introduces substantial computational overhead. In this work, we introduce Time-Series-to-Structured-Program representation (T2SP), a deterministic, training-free method that represents a time series as a structured symbolic program. T2SP decomposes time series into trends, periods, and salient events, expressing them in a program-friendly format aligned with the textual and code-like modalities on which LLMs are natively trained. By shifting temporal-structure extraction from the model to the representation itself, T2SP enables off-the-shelf LLMs to leverage their existing reasoning capabilities for time-series understanding. We evaluate T2SP on three reasoning tasks -- editing, captioning, and question answering -- where it consistently improves performance, reduces reasoning time, and lowers failure rates compared with raw-string representations. Our results demonstrate that T2SP provides an effective interface between time series and LLMs.

2606.12479 2026-06-12 cs.LG cs.AI 新提交

ReCal: Reward Calibration for RL-based LLM Routing

ReCal: 基于强化学习的LLM路由的奖励校准

Qihang Yu, Hanwen Tong, Zhengqi Zhang, Bo Zheng, Feng Wei, Shengyu Zhang, Zemin Liu, Fei Wu

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出ReCal框架,通过分层奖励分解和分布感知优化校准奖励信号,解决多目标冲突和异质性任务优化偏差,提升LLM路由性能与稳定性。

详情
AI中文摘要

大型语言模型(LLM)路由已成为一种有效范式,通过动态模型和推理策略选择来利用多个LLM的互补优势。最近的基于强化学习(RL)的路由方法通过从交互反馈中优化路由策略,进一步提高了路由质量。然而,在难度不同的异质性任务下,它们仍然难以提供信息丰富且可比较的学习信号。在实践中,多个目标(如正确性、格式行为)被聚合为单个标量奖励,导致模糊的信用分配和冲突的优化信号。此外,奖励信号在不同实例间表现出显著变异性,其中一些实例产生更高或更可变的奖励,引入了偏向于平凡样本而非信息性样本的优化偏差。为了解决这些问题,我们提出了\textbf{ReCal},一个用于基于RL的LLM路由的\textbf{\underline{Re}}ward \textbf{\underline{Cal}}ibration(奖励校准)框架。我们首先引入了一种具有分量式优势估计的分层奖励分解机制。我们进一步提出了一种分布感知的优化策略,通过方差感知重加权和每数据集归一化来校准优化变异性。在七个数据集上的实验表明,ReCal在路由性能和训练稳定性上持续优于基线方法。代码可在该网址获取。

英文摘要

Large language model (LLM) routing has emerged as an effective paradigm for leveraging the complementary strengths of multiple LLMs through dynamic model and reasoning-strategy selection. Recent reinforcement learning (RL)-based routing methods further improve routing quality by optimizing routing policies from interaction feedback. However, they still struggle to provide informative and comparable learning signals under heterogeneous tasks with varying difficulty. In practice, multiple objectives (e.g., correctness, format behavior) are aggregated into a single scalar reward, leading to ambiguous credit assignment and conflicting optimization signals. Moreover, reward signals exhibit significant variability across instances, where some instances produce higher or more variable rewards, introducing optimization bias that favors trivial samples over informative ones. To address these issues, we propose \textbf{ReCal}, a \textbf{\underline{Re}}ward \textbf{\underline{Cal}}ibration framework for RL-based LLM routing. We first introduce a hierarchical reward decomposition mechanism with component-wise advantage estimation. We further propose a distribution-aware optimization strategy that calibrates optimization variability through variance-aware reweighting and per-dataset normalization. Experiments on seven datasets demonstrate that ReCal consistently improves routing performance, and training stability over baselines. Code is available at this https URL.

2606.12476 2026-06-12 cs.LG cs.AI cs.CL 新提交

Quickest Detection of Hallucination Onset: Delay Bounds and Learned CUSUM Statistics

幻觉起始的快速检测:延迟界与学习型CUSUM统计量

Igor Itkin

发表机构 * Independent Researcher(独立研究员)

AI总结 将幻觉起始检测建模为快速变化检测问题,基于RAGTruth验证的一阶马尔可夫模型,利用学习型CUSUM算法在匹配虚警率下实现11-13个token的检测延迟,优于线性基线,并揭示了分类指标掩盖的延迟结构。

详情
Comments
14 pages, 1 figure
AI中文摘要

Token级幻觉检测器作为分类器进行评估,通过所有token的AUC,但流式监控器由其反应时间判断:从幻觉开始到警报之间的token数量。我们将幻觉起始检测表述为一个快速变化检测问题。在RAGTruth上验证的潜在忠实/幻觉状态的一阶马尔可夫模型,将任务置于经典变点理论中,并得出Lorden关于检测延迟的下界:在虚警率为0.01时约为1.3个token。然后我们证明,因果循环标注器充当了具有学习增量的CUSUM;在匹配的虚警率下,它在11-13个token内检测到,而线性每token基线为31个token,受控分解将大部分优势归因于更好的每token得分,而非时间累积。Donsker-Varadhan型的信息率最优性定理解释了剩余的数量级差距:学习得分仅实现了特征携带散度的1/4.5,这一缺陷无法通过重新校准消除,其余部分为有限时域效应。分类指标掩盖了这种延迟结构;序列分析使其可测量。

英文摘要

Token-level hallucination detectors are evaluated as classifiers, by AUC over all tokens, yet a streaming monitor is judged by its reaction time: the number of tokens that pass between the onset of a hallucination and the alarm. We formulate hallucination onset detection as a quickest change detection problem. A first-order Markov model of the latent faithful/hallucinated state, validated on RAGTruth, places the task inside classical change-point theory and yields Lorden's lower bound on detection delay: about 1.3 tokens at a false-alarm rate of 0.01. We then show that a causal recurrent labeler acts as a CUSUM with a learned increment; at a matched false-alarm rate it detects in 11-13 tokens, against 31 for a linear per-token baseline, and a controlled decomposition attributes most of this advantage to a better per-token score rather than to temporal accumulation. An information-rate optimality theorem of Donsker-Varadhan type explains the remaining order-of-magnitude gap: the learned score realizes only 1/4.5 of the divergence the features carry, a deficit that recalibration cannot remove, with the remainder a finite-horizon effect. Classification metrics conceal this delay structure; sequential analysis makes it measurable

2606.12475 2026-06-12 cs.RO 新提交

Learning to Assist: Collaborative VLAs for Implicit Human-Robot Collaboration

学习辅助:面向隐式人机协作的协作式VLA模型

Leo Xu, Letian Li, Alex Cuellar, Michael Hagenow

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文研究利用视觉-语言-动作(VLA)模型通过模仿学习实现人机协作,发现动作分块策略在隐式协作中存在演示动作泄漏问题,提出推理时引导方法缓解过早辅助行为,并通过用户研究验证其有效性。

详情
AI中文摘要

人机协作(HRC)结合了人类和机器人的互补优势,以提高任务效率。然而,许多现有的协作系统依赖于手工设计的流程,限制了其对新任务的可扩展性和灵活性。在这项工作中,我们展示了通过模仿学习进行端到端训练的模型,特别是视觉-语言-动作(VLA)模型,可以支持协作操作,并刻画了影响其真实世界性能的关键因素。我们评估了两种最先进的模型,并识别了隐式HRC中动作分块策略的一种失败模式,其中演示动作泄漏(即动作块跨越潜在任务转换)可能导致过早的辅助行为。我们发现,这个问题随着执行时域的增长而加剧,并在真实世界的协作VLA系统中出现,例如当机器人试图在人员准备好之前移交工具时。我们提出了一种推理时引导方法,以减轻这些错误的辅助动作,同时保持策略性能。最后,通过一项16名参与者在长时域协作组装任务上的用户研究,我们表明引导能够实现更长的执行时域,同时减轻过早辅助,与短时域策略相比,实现了更快的协作和更少的失败。

英文摘要

Human-robot collaboration (HRC) combines the complementary strengths of humans and robots to improve task efficiency. However, many existing collaborative systems rely on hand-engineered pipelines, limiting their scalability and flexibility for new tasks. In this work, we show that models trained end-to-end with imitation learning, specifically vision-language-action (VLA) models, can support collaborative manipulation, and characterize the key factors affecting their real-world performance. We evaluate two state-of-the-art models and identify a failure mode of action-chunking policies in implicit HRC, where demonstration action leakage (i.e., action chunks crossing latent task transitions) can cause premature assistive behavior. We find that this issue increases with longer execution horizons and occurs in real-world collaborative VLA systems, such as when a robot attempts to hand over a tool before the person is ready. We propose an inference-time steering method to mitigate these erroneous assistive actions while preserving policy performance. Finally, through a 16-participant user study on a long-horizon collaborative assembly task, we show that steering enables a longer execution horizon while mitigating premature assistance, leading to faster collaboration and fewer failures compared to a shorter-horizon policy.

2606.12473 2026-06-12 cs.CV 新提交

Stereo Vision-Based Fall Prediction and Detection using Human Pose Estimation on the AMD Kria K26 SOM

基于人体姿态估计的立体视觉跌倒预测与检测在AMD Kria K26 SOM上的实现

Shreyas Narasimhiah Ramesh, P. D. Rathika, Mahasweta Sarkar, Kristen Wells, Michel Audette, Christopher Paolini

发表机构 * San Diego State University(圣地亚哥州立大学) PSG College of Technology(PSG理工学院) Old Dominion University(欧道明大学)

AI总结 提出一种基于AMD Kria K26 SOM的低功耗、便携式立体视觉跌倒预测与检测系统,通过量化YOLOX、A2J和CNN三级流水线实现实时、隐私保护的跌倒检测,多线程版本达到4.5 FPS。

详情
Comments
19 pages; 31 figures
AI中文摘要

背景与目标:老年人跌倒可能导致严重伤害并降低生活质量。及时的预测和检测对于预防伤害和支持健康至关重要。我们提出了一种便携式、低功耗、电池供电的基于视觉的跌倒预测与检测系统,在AMD Kria K26系统模块(SOM)上使用人体姿态估计(HPE)。目标是实现非侵入性、保护隐私的实时跌倒检测系统。方法:系统使用Intel RealSense D455距离感应摄像头,通过USB连接到K26 SOM。它捕获同步的RGB和深度帧,分辨率分别为640×480×3和640×480像素,帧率为60 FPS。SOM运行一个三级流水线,包括量化的YOLOX、Anchor-to-Joint(A2J)和跌倒检测模型。YOLOX从RGB帧中识别人体边界框,然后丢弃RGB帧以保护隐私。A2J使用深度帧估计每个人的15个关节点。CNN使用选定的关节坐标(x, y, z)对跌倒活动进行分类。YOLOX在CrowdHuman上训练;A2J在ITOP、MP-3DHP、UR Fall Detection和自定义的SDSU PSG数据集上训练;CNN在UR Fall Detection和SDSU PSG上训练。设计使用了单核DPU的串行流水线和双核DPU运行YOLOX和A2J的多线程版本。结果:量化精度通过YOLOX的IoU≥50%、A2J的10厘米规则mAP以及CNN的分类准确率(TP+TN)/(TP+TN+FP+FN)进行评估。准确率分别为74%、84.13%和75.85%。吞吐量从单线程流水线的2.5 FPS提高到多线程版本的4.5 FPS。结论:结果证明了在AMD Kria K26边缘设备上实现隐私保护跌倒检测的可行性。设备上的HPE和跌倒分类无需依赖云端,支持老年人监测和辅助医疗。未来工作将提高模型精度和速度。

英文摘要

Background and Objective: Falls among elderly people can cause serious injury and reduce quality of life. Timely prediction and detection are essential to prevent harm and support well-being. We propose a portable, low-power, battery-operated, vision-based fall prediction and detection system using HPE on an AMD Kria K26 System-on-Module (SOM). The objective is a non-intrusive, privacy-preserving system for real-time fall detection. Methods: The system uses an Intel RealSense D455 range-sensing camera connected to the K26 SOM by USB. It captures synchronized RGB and depth frames, 640 x 480 x 3 and 640 x 480 pixels, at 60 FPS. The SOM runs a three-stage pipeline with quantized YOLOX, Anchor-to-Joint (A2J), and fall-detection models. YOLOX identifies human bounding boxes from RGB frames, then discards the RGB frames to preserve privacy. A2J uses depth frames to estimate 15 joint keypoints per person. A CNN uses selected joint coordinates (x, y, z) to classify fall activity. YOLOX was trained on CrowdHuman; A2J on ITOP, MP-3DHP, UR Fall Detection, and a custom SDSU PSG dataset; and the CNN on UR Fall Detection and SDSU PSG. The design used a single-core DPU with a serial pipeline and a dual-core DPU running YOLOX and A2J with multiple threads. Results: Quantized accuracy was evaluated using IoU >= 50% for YOLOX, mAP with a 10-cm rule for A2J, and classification accuracy, (TP + TN)/(TP + TN + FP + FN), for the CNN. Accuracies were 74%, 84.13%, and 75.85%. Throughput improved from 2.5 FPS for the single-threaded pipeline to 4.5 FPS for the multi-threaded version. Conclusion: Results demonstrate the feasibility of privacy-preserving fall detection on an AMD Kria K26 edge device. On-device HPE and fall classification runs without cloud dependency, supporting elderly monitoring and assistive healthcare. Future work will improve model accuracy and speed.

2606.12451 2026-06-12 cs.AI cs.IR cs.LG 新提交

ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

ToolSense: 审计LLM中参数化工具知识的诊断框架

Ashutosh Hathidara, Sai Shruthi Sistla, Sebastian Schreiber, Sahil Bansal

发表机构 * SAP Labs(SAP实验室)

AI总结 提出ToolSense诊断框架,自动生成三类基准测试,揭示参数化工具检索中知识-检索分离现象,发现模型在模糊查询下性能显著下降。

详情
AI中文摘要

作为大型工具目录上的代理部署的大型语言模型面临关键的工具检索瓶颈。由于基于嵌入的检索方法依赖于可能无法充分捕获专用工具语义的紧凑编码器,参数化工具检索通过将每个工具编码为附加到LLM词汇表的虚拟令牌来解决这一问题,经过两个阶段(记忆然后检索SFT)的微调,将LLM用作检索器,在标准ToolBench检索基准上取得了强劲性能。然而,这些基准使用冗长、完全指定的查询,并且其评估应用了将输出限制为有效令牌路径的约束解码,这并不能揭示模型是否真正理解其工具。我们引入了\textbf{ToolSense},一个开源LLM驱动的诊断框架,它将任何工具目录作为输入,并自动生成三个基准:具有三个模糊级别查询的现实检索基准(RRB)、MCQ探测基准和QA探测基准。将ToolSense应用于ToolBench(约47k个工具)并评估五个参数化模型训练配置,揭示了知识-检索分离:在RRB查询上,与完全指定的ToolBench基准相比,几个配置下降了约50-64个百分点,低于嵌入模型基线。此外,尽管检索性能强劲,一些模型在事实探测上得分接近随机,表明存在知识-检索分离。我们在https://this URL上开源了ToolSense框架和ToolBench诊断基准。

英文摘要

Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval addresses this by encoding each tool as a virtual token appended to the LLM vocabulary, fine-tuned in two stages (memorization then retrieval SFT) to use the LLM as a retriever, achieving strong performance on standard ToolBench retrieval benchmarks. Yet these benchmarks use verbose, fully-specified queries, and their evaluation applies constrained decoding that restricts outputs to valid token paths, neither reveals whether the model actually understands its tools. We introduce \textbf{ToolSense}, an open-source LLM-powered diagnostic framework that takes any tool catalog as input and automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with queries at three ambiguity tiers, an MCQ probing benchmark, and a QA probing benchmark. Applying ToolSense to ToolBench (~47k tools) and evaluating five parametric model training configurations reveals a knowledge-retrieval dissociation: on RRB queries, several configurations collapse by ~50-64 percentage points compared to fully-specified ToolBench benchmarks, falling below the embedding-model baseline. Additionally, despite strong retrieval performance, some models score near-random on factual probes, suggesting a knowledge-retrieval dissociation. We open-source the ToolSense framework and the ToolBench diagnostic benchmarks at this https URL.

2606.13287 2026-06-12 cs.LG cs.DC math.OC 新提交

Clipping Makes Distributed and Federated Asynchronous SGD Robust to Stragglers

裁剪使分布式和联邦异步SGD对掉队者具有鲁棒性

Samuel Erickson, Mikael Johansson

发表机构 * KTH Royal Institute of Technology(瑞典皇家理工学院)

AI总结 本文理论证明梯度裁剪能消除异步SGD中最大延迟对复杂度的影响,基于次Weibull梯度噪声模型,首次实现异步优化的高概率收敛。

详情
AI中文摘要

在现代机器学习中,训练的并行化是扩大规模的重要策略。异步随机梯度下降(ASGD)通过避免等待慢速工作节点来最大化可用硬件的利用率。然而,在恒定步长下,由于更新中的大延迟,慢速工作节点仍然会对ASGD的收敛产生负面影响。同时,在深度学习模型的异步训练中,经验观察到梯度裁剪能“稳定”训练。在这项工作中,我们为这一行为提供了理论依据,证明裁剪消除了最大延迟对预言复杂度的依赖。我们采用次Weibull梯度噪声模型,该模型将次高斯和次指数分布推广到更重尾的分布,受深度学习中的经验观察启发。我们证明了期望收敛,并且首次在异步优化中证明了高概率收敛。

英文摘要

In modern machine learning, parallelization of training is an important strategy for increasing scale. Asynchronous stochastic gradient descent (ASGD), which maximizes the utilization of available hardware by avoiding waiting for slow workers. However, with constant step sizes, the convergence of ASGD is nonetheless affected negatively by slow workers due to large delays in updates. At the same time, it has been empirically observed in asynchronous training of deep learning models that gradient clipping "stabilizes" training. In this work, we provide a theoretical justification for this behavior, as we show that clipping removes the dependence of the maximum delay in the oracle complexity. We employ a sub-Weibull model of gradient noise which generalizes sub-Gaussian and sub-exponential distributions to more heavy-tailed distributions, motivated by empirical observations in deep learning. We show convergence in expectation, and the first time in asynchronous optimization, convergence with high probability.

2606.12876 2026-06-12 cs.LG cs.CL cs.IT 新提交

Multi-Bitwidth Quantization for LLMs Using Additive Codebooks

使用加性码本的大语言模型多比特宽度量化

Liza Babaoglu, Shuangyi Chen, Ashish Khisti

发表机构 * University of Toronto(多伦多大学)

AI总结 提出Drop-by-Drop框架,基于信息论和逐次细化理论,利用加性码本和Matryoshka监督实现单个模型在推理时支持多精度权重控制,降低存储开销并保持性能。

详情
Comments
37 pages, 12 figures
AI中文摘要

随着大语言模型(LLM)在具有不同资源约束的异构硬件上部署越来越广泛,无需重新训练即可自适应管理性能与效率之间权衡的能力变得至关重要。我们提出Drop-by-Drop,一种新颖的多比特宽度训练后量化框架,能够从单个训练模型实现对LLM权重的推理时精度控制。我们的方法在理论上基于信息论和逐次细化。我们证明,通常服从高斯分布的LLM权重,在由LLM损失函数驱动的加权均方误差失真下,随着额外比特的加入可以以递增的保真度最优重建。为了在实践中实现这一点,Drop-by-Drop将Matryoshka风格的监督纳入损失函数,利用了加性码本的结构。Drop-by-Drop生成单个模型,其中有序的码本子集在每个精度级别产生精确的部分重建。这种方法通过允许单个检查点服务于多个比特宽度,显著减少了存储和内存开销,同时在主要架构(如Qwen、LLaMA、Gemma和Mistral)上保持了有竞争力的困惑度和准确度。

英文摘要

As large language models (LLMs) are increasingly deployed across heterogeneous hardware with varying resource constraints, the ability to adaptively manage the trade-off between performance and efficiency without retraining is critical. We propose Drop-by-Drop, a novel multi-bitwidth post-training quantization framework that enables inference-time precision control over LLM weights from a single trained model. Our method is theoretically grounded in information theory and successive refinement. We establish that LLM weights, which commonly follow a Gaussian distribution, can be optimally reconstructed with increasing fidelity as additional bits are incorporated, under a weighted mean squared error distortion motivated by LLM loss functions. To realize this in practice, Drop-by-Drop incorporates Matryoshka-style supervision into the loss function, exploiting the structure of additive codebooks. Drop-by-Drop produces a single model where ordered subsets of codebooks yield accurate partial reconstructions at each precision level. This approach significantly reduces storage and memory overhead by allowing a single checkpoint to serve multiple bitwidths, while maintaining competitive perplexity and accuracy across major architectures, such as Qwen, LLaMA, Gemma, and Mistral.

2606.12710 2026-06-12 cs.LG math.OC 新提交

A Stabilized Path-Space Approach to Diffusion-Based Posterior Sampling

一种稳定的路径空间方法用于基于扩散的后验采样

Evan Scope Crafts, Umberto Villa, Saviz Mowlavi, Yanting Ma, Hassan Mansour, Wael H. Ali

发表机构 * Oden Institute for Computational Engineering and Sciences, The University of Texas at Austin(德克萨斯大学奥斯汀分校奥登计算工程与科学研究所) Mitsubishi Electric Research Laboratories (MERL)(三菱电机研究实验室) Department of Biomedical Engineering, The University of Texas at Austin(德克萨斯大学奥斯汀分校生物医学工程系) Mitsubishi Electric Research Laboratories(三菱电机研究实验室)

AI总结 提出一种稳定的路径空间框架,通过随机最优控制与信任域优化,实现非线性逆问题中准确且鲁棒的后验采样。

详情
AI中文摘要

扩散模型为贝叶斯逆问题提供了表达性数据驱动先验,但许多扩散后验采样器依赖启发式引导近似,可能对非线性算子和多模态后验失效。本文开发了一种稳定的路径空间框架用于基于扩散的后验采样。从终端边际代表先验的基础扩散过程出发,我们定义了轨迹上的似然加权目标测度,并将后验采样转化为学习一个路径测度匹配该目标的受控随机过程。该公式将扩散后验采样与随机最优控制联系起来,同时保留了不确定性量化所需的贝叶斯结构。我们引入了一种时间重参数化,通过消除未知初始值函数引起的偏差,使路径空间控制问题适定,无需辅助训练。然后通过具有对数方差目标的信任域路径空间优化方法学习控制。路径空间视角还统一了我们的学习控制方法与现有的基于引导的采样器,量化了近似控制引起的采样误差,并产生了用于渐近精确后验期望的重要性采样校正。我们在具有解析表征或高质量参考后验的基准逆问题套件上评估了所提出的框架,从而实现了对采样精度和不确定性量化的原则性评估。这些实验深入揭示了基于扩散的后验采样器的行为,并证明了相比领先方法更高的准确性和鲁棒性。

英文摘要

Diffusion models provide expressive data-driven priors for Bayesian inverse problems, but many diffusion posterior samplers rely on heuristic guidance approximations that can fail for nonlinear operators and multimodal posteriors. In this work, we develop a stabilized path-space framework for diffusion-based posterior sampling. Starting from a base diffusion process whose terminal marginal represents the prior, we define a likelihood-weighted target measure on trajectories and cast posterior sampling as learning a controlled stochastic process whose path measure matches this target. This formulation connects diffusion posterior sampling to stochastic optimal control while preserving the Bayesian structure needed for uncertainty quantification. We introduce a time reparameterization that makes the path-space control problem well posed by removing the bias induced by the unknown initial value function, without auxiliary training. We then learn the control via a trust-region path-space optimization method with log-variance objectives. The path-space perspective also unifies our learned control approach with existing guidance-based samplers, quantifies the sampling error induced by approximate controls, and yields importance sampling corrections for asymptotically exact posterior expectations. We evaluate the proposed framework on a suite of benchmark inverse problems with analytically characterized or high-quality reference posteriors, enabling principled assessment of sampling accuracy and uncertainty quantification. These experiments provide insight into the behavior of diffusion-based posterior samplers and demonstrate improved accuracy and robustness over leading approaches.

2606.12611 2026-06-12 cs.LG cs.IT 新提交

Evaluation of AutoML Frameworks for IDS under Imbalanced Data Conditions of the NSL-KDD Dataset

NSL-KDD数据集不平衡数据条件下IDS的AutoML框架评估

Wiliane Carolina Silva, Evandro César Vilas Boas, Felipe A. P. de Figueiredo

发表机构 * Cybersecurity and Artificial Intelligence Laboratory (CS&I Lab), National Institute of Telecommunications (Inatel)(网络安全与人工智能实验室(CS&I Lab),国家电信研究所(Inatel)) Wireless and Artificial Intelligence Laboratory (WAI Lab), National Institute of Telecommunications (Inatel)(无线与人工智能实验室(WAI Lab),国家电信研究所(Inatel))

AI总结 研究NSL-KDD数据集上严重类别不平衡对多分类入侵检测中AutoML框架性能的影响,发现集成学习和不平衡感知优化可提升少数类检测能力,PyCaret表现最佳(macro-F1 66%)。

详情
AI中文摘要

本研究探讨了严重类别不平衡对使用NSL-KDD数据集进行多分类网络入侵检测的自动化机器学习(AutoML)框架性能的影响。与以往通过二分类或移除少数类来简化问题的研究不同,我们保留了原始的五类分布,包括高度欠表示的R2L和U2R攻击,从而能够对不平衡敏感的学习行为进行现实评估。在统一且可重复的实验协议下,分析了九个开源AutoML框架,考虑了架构设计、集成策略、验证程序、超参数优化和不平衡处理机制的差异。结果表明,采用集成学习和不平衡感知优化的框架在少数类判别上表现更好。PyCaret获得了最佳整体性能,macro-F1达到66%,其次是AutoGluon(55%),而缺乏原生平衡支持的框架在少数类检测能力上显著下降。进一步分析表明,仅以准确率为导向的优化不足以应对高度不平衡的入侵检测场景,因为高加权指标可能与对罕见攻击类别的泛化能力差共存。作为贡献,本研究为严重多类不平衡下的AutoML入侵检测建立了标准化基准,指出了当前架构的局限性,以及将不平衡感知优化、重采样和分层评估策略原生集成到自动化学习流水线中的必要性。源代码已公开。

英文摘要

This work investigates the impact of severe class imbalance on the performance of automated machine learning (AutoML) frameworks for multiclass network intrusion detection using the NSL-KDD dataset. Unlike previous studies that simplify the problem through binary classification or minority-class removal, we preserve the original five-class distribution, including highly underrepresented attacks such as R2L and U2R, enabling a realistic evaluation of imbalance-sensitive learning behavior. Nine open-source AutoML frameworks were analyzed under a unified and reproducible experimental protocol, considering differences in architectural design, ensemble strategies, validation procedures, hyperparameter optimization, and imbalance-handling mechanisms. The results demonstrate that frameworks incorporating ensemble learning and imbalance-aware optimization achieve better minority-class discrimination. PyCaret obtained the best overall performance, reaching 66\% macro-F1, followed by AutoGluon with 55\%, whereas frameworks lacking native balancing support exhibited significant degradation in minority-class detection capability. The analysis further shows that accuracy-oriented optimization alone is insufficient for highly imbalanced IDS scenarios, since high-weighted metrics may coexist with poor generalization on rare attack categories. As a contribution, this work establishes a standardized benchmark for AutoML-based intrusion detection under severe multiclass imbalance, highlighting current architectural limitations and the need for native integration of imbalance-aware optimization, resampling, and stratified evaluation strategies into automated learning pipelines. The source code is publicly available.

2606.12478 2026-06-12 cs.LG cond-mat.stat-mech quant-ph 新提交

Boltzmann Attention: Learnable Ising Couplings for Cooperative Attention

玻尔兹曼注意力:用于协同注意力的可学习伊辛耦合

Gilhan Kim, Daniel K. Park

发表机构 * Yonsei University(延世大学)

AI总结 提出玻尔兹曼注意力,通过可学习的伊辛耦合增强注意力机制中的位置间交互,在字符级语言建模和括号匹配任务中优于标准softmax注意力,并展示了量子退火训练的有效性。

详情
Comments
19 pages, 5 figures
AI中文摘要

注意力机制是现代序列模型的核心,但标准注意力主要通过单个查询-键相似度计算相关性。尽管softmax归一化引入了位置间的竞争,但标准注意力层并未显式参数化注意力决策之间的可学习交互。这限制了其直接在注意力机制内建模协同或对抗性共注意力结构的能力。我们提出玻尔兹曼注意力,一种基于能量的泛化,其中注意力模式由相互作用的伊辛模型控制。该方法用可学习的成对耦合增强通常的数据依赖局部场,使模型能够表示超出softmax或sigmoid注意力所捕获的位置间相关性。在字符级语言建模和合成括号匹配实验上,玻尔兹曼注意力在标准Transformer架构中持续优于标准softmax注意力,且优势随序列长度增加而更加明显。四路消融实验证实改进来自可学习的成对耦合。这些结果表明,显式位置间交互为基于注意力的序列建模提供了原则性增强。此外,伊辛公式为基于量子计算的采样策略开辟了自然路径:我们证明非绝热量子退火提供了实用的训练方法,同时保持了与精确玻尔兹曼计算相当的性能。

英文摘要

Attention mechanisms are central to modern sequence models, yet standard attention computes relevance primarily through individual query--key similarities. Although softmax normalization introduces competition among positions, a standard attention layer does not explicitly parameterize learnable interactions between attention decisions. This limits its ability to directly model cooperative or antagonistic co-attention structure within the attention mechanism itself. We propose Boltzmann attention, an energy-based generalization in which attention patterns are governed by an interacting Ising model. The method augments the usual data-dependent local fields with learnable pairwise couplings, allowing the model to represent inter-position correlations beyond those captured by softmax or sigmoid attention. Experiments on character-level language modeling and synthetic bracket matching show that Boltzmann attention consistently improves over standard softmax attention within a standard Transformer architecture, with the advantage becoming more pronounced as sequence length increases. A four-way ablation confirms that the improvement arises from the learnable pairwise couplings. These results suggest that explicit inter-position interactions provide a principled enhancement for attention-based sequence modeling. Moreover, the Ising formulation opens a natural path toward quantum-computing-based sampling strategies: we demonstrate that diabatic quantum annealing provides a practical training method while maintaining competitive performance with exact Boltzmann computation.

2606.12263 2026-06-12 cs.CV 新提交

VOID: Defeating Unauthorized Mimicry in Latent Diffusion Models

VOID: 击败潜在扩散模型中的未授权模仿

Chunlin Qiu, Ang Li, Tianxiao Huang, Ruilin Gan, Yunjie Ge, Shenyi Zhang, Huayi Duan, Lingchen Zhao, Chao Shen, Qian Wang

发表机构 * School of Cyber Science and Engineering, Wuhan University(武汉大学网络空间安全学院) School of Computer Science, Wuhan University(武汉大学计算机学院) Institute for Math&AI, Wuhan University(武汉大学数学与人工智能研究所) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) School of Cyber Science and Engineering, Xi’an Jiaotong University(西安交通大学网络空间安全学院)

AI总结 针对潜在扩散模型被用于未授权模仿的问题,提出VOID防御框架,通过操纵模型内在随机性,放大潜在编码误差并抵消目标引导信号,实现语义破坏,阻止未授权模仿,同时将扰动限制在人眼不可感知区域。

详情
Comments
Extended full version with more comprehensive experimental results. To appear in the 35th USENIX Security Symposium (USENIX Security 2026)
AI中文摘要

虽然潜在扩散模型(LDM)彻底改变了视觉合成,但它们越来越多地被用于对个人的未授权模仿。现有防御通过注入欺骗性扰动,将生成图像引导至无关目标。然而,这种方法基于一个无根据的假设:微小的扰动能在LDM的整个生成过程中保持其欺骗效果。实际上,模型固有的恢复机制会移除这些扰动,导致个体身份在生成的图像中重新出现。我们提出VOID,一种通过操纵LDM内在随机性克服这一难题的防御框架。VOID以两种新颖方式扰动扩散管道:1)放大潜在编码误差以破坏图像的语义结构,以及2)抵消目标引导信号以抑制模型的恢复能力。这导致语义破坏,阻止任何未授权模仿。值得注意的是,安全增益不以视觉效用为代价,因为VOID同时设法将扰动限制在受保护图像的人眼不可感知区域。我们在5个数据集上对10种模仿攻击的24种最先进防御进行了全面评估,证明了VOID前所未有的保护能力:它将平均Frechet Inception Distance(FID)从113提高到365,比迄今为止最强的防御提升了223%。

英文摘要

While Latent Diffusion Models (LDMs) have revolutionized visual synthesis, they are increasingly exploited for unauthorized mimicry of individuals. Existing defenses inject deceptive perturbations to steer the generated images toward irrelevant targets. However, this approach hinges on an ungrounded assumption: subtle perturbations can maintain their deceptive efficacy throughout an LDM's extensive generation process. In reality, the model's innate restoration mechanism will remove such perturbations and cause individual identities to re-emerge in the images generated. We propose VOID, a defense framework that overcomes this conundrum by manipulating an LDM's intrinsic stochasticity. VOID perturbs the diffusion pipeline in two novel ways: 1) amplifying the latent encoding errors to shatter an image's semantic structure, and 2) counteracting the target guidance signals to suppress the model's restoration capabilities. This results in a semantic corruption that thwarts any unauthorized mimicry. Notably, the security gain does not come at the price of visual utility, as VOID simultaneously manages to confine perturbations to human-imperceptible regions of protected images. Our comprehensive evaluation of 24 state-of-the-art defenses against 10 mimicry attacks on 5 datasets demonstrates VOID's unprecedented protection power: it increases the average Frechet Inception Distance (FID) from 113 to 365, a 223% improvement over the strongest defense to date.

2606.12236 2026-06-12 cs.RO cs.CV 新提交

DrivingAgent: Design and Scheduling Agents for Autonomous Driving Systems

DrivingAgent: 自动驾驶系统的设计与调度智能体

Zhongyu Xia, Wenhao Chen, Yongtao Wang, Ming-Hsuan Yang

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学王选计算机技术研究所) University of California, Merced(加州大学默塞德分校)

AI总结 提出DrivingAgent框架,通过自动化模块开发(设计阶段)和强化学习训练的轻量级LLM实时调度(调度阶段),解决自动驾驶系统集成新模型和满足实时约束的挑战,在nuScenes和Bench2Drive上取得更优速度-精度权衡。

详情
AI中文摘要

许多自动驾驶系统越来越多地整合基础模型以提高泛化能力并处理长尾场景。然而,这一趋势带来了两个关键挑战:(i)设计和集成新模型的手动且劳动密集型过程,以及(ii)缺乏智能、动态的调度机制以满足严格的实时约束。虽然基于大语言模型(LLM)的智能体为自动化提供了有前景的途径,但现有框架并不适合自动驾驶。具体来说,它们未能区分系统设计和实时调度的根本不同需求,将模块视为不透明的黑盒,并且并非为持续运行而设计。为了解决这些局限性,我们提出了DrivingAgent,这是一个针对自动驾驶系统设计和调度双重挑战的新型智能体框架。在设计阶段,DrivingAgent通过解释系统架构、生成代码以及通过超网络训练验证模块来自动化模块开发。在调度阶段,它采用一个通过强化学习训练的轻量级LLM来实时动态编排系统模块,并由一个集成长期存储与带时间戳短期上下文的结构化记忆支持。实验结果表明,DrivingAgent在nuScenes和Bench2Drive基准测试上实现了更优的速度-精度权衡。

英文摘要

Many autonomous driving systems are increasingly incorporating foundation models to improve generalization and handle long-tail scenarios. However, this trend introduces two key challenges: (i) the manual and labor-intensive process of designing and integrating new models, and (ii) the lack of intelligent, dynamic scheduling mechanisms to meet strict real-time constraints. While Large Language Model (LLM)-based agents offer a promising avenue for automation, existing frameworks are ill-suited for autonomous driving. Specifically, they fail to distinguish between the fundamentally different requirements of system design and real-time scheduling, treat modules as opaque black boxes, and are not designed for continuous operation. To address these limitations, we propose DrivingAgent, a novel agent framework tailored to the dual challenges of autonomous driving system design and scheduling. In the design phase, DrivingAgent automates module development by interpreting system architecture, generating code, and validating modules via super-network training. In the scheduling phase, it employs a lightweight LLM trained with reinforcement learning to dynamically orchestrate system modules in real time, supported by a structured memory that integrates long-term storage with timestamped short-term context. Experimental results demonstrate that DrivingAgent achieves a superior speed--accuracy trade-off on both the nuScenes and Bench2Drive benchmarks.

2606.12160 2026-06-12 cs.CL 新提交

A Controlled Study of Decoding-Time Truthfulness Methods on Instruction-Tuned LLMs

指令调优大语言模型解码时真实性方法的受控研究

Ao Sun

发表机构 * Independent Researcher(独立研究员)

AI总结 本研究通过分析每层令牌logits特征,提出CHAIR框架检测幻觉,在TruthfulQA和MMLU上显著提升零样本检测准确率。

详情
AI中文摘要

在这项工作中,我们引入了CHAIR(Classifier of Hallucination As ImproveR),一个通过分析每个令牌每一层的内部logits来检测幻觉的监督框架。我们的方法从所有层的令牌logits中提取一组紧凑的特征,如最大值、最小值、均值、标准差和斜率,从而在不发生过拟合的情况下实现有效的幻觉检测。在TruthfulQA和MMLU数据集上的实验表明,CHAIR显著提高了检测准确性,特别是在零样本场景下,展示了其鲁棒性和泛化能力。除了幻觉检测,CHAIR还凸显了利用内部表示设计高级解码策略的潜力。通过利用logits中的模式,我们建议更复杂的模型和自适应解码方法可以进一步减少幻觉并提高文本完成质量。CHAIR不仅为检测幻觉提供了实用解决方案,还为探索LLM中更丰富的表示以改进其事实性和连贯性奠定了基础。

英文摘要

Decoding-time truthfulness methods -- layer-contrast decoding, inference-time intervention, and learned logit adapters -- have demonstrated 10-30 point gains on TruthfulQA when applied to base language models. However, modern instruction-tuned LLMs already achieve substantially higher baselines (61-76%), raising the question of whether these methods remain effective in practice. We design a six-control evaluation framework -- out-of-distribution training, multi-judge validation, simple decoding baselines, confound controls, bootstrap confidence intervals, and seed variance -- and apply it across 5 models (1B-70B), 3 benchmarks, and 15 methods. We find that previously reported gains shrink substantially under strict controls: on the full TruthfulQA benchmark (N=817), no token-level method achieves statistically significant improvement, and the best learned adapter scores -2.0 points below greedy (p=.23). We identify five evaluation sensitivities -- contamination, judge choice, missing baselines, confounds, and statistical noise -- that individually or jointly account for these discrepancies. Cross-benchmark validation on HaluEval QA and TriviaQA confirms that these patterns extend beyond TruthfulQA. Deliberative prompting methods (chain-of-thought, self-critique) appear more robust in the evaluated regime, with CoT achieving +5.6-19pp across benchmarks as a training-free, single-pass method. We release a seven-point evaluation checklist and discuss implications for future truthfulness research.

2606.11898 2026-06-12 cs.CL cs.LG 新提交

GraspLLM: Towards Zero-Shot Generalization on Text-Attributed Graphs with LLMs

GraspLLM: 面向文本属性图与LLM的零样本泛化

Hengyi Feng, Zeang Sheng, Meiyi Qiang, Li Yang, Wentao Zhang

发表机构 * Peking University(北京大学) National University of Singapore(新加坡国立大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出GraspLLM框架,通过融合图结构理解与LLM语义能力,利用基序感知对比学习和最优上下文子图对齐,实现跨数据集和跨任务的零样本泛化。

详情
AI中文摘要

近年来,对文本属性图(TAGs)的研究因其在引文网络、电子商务平台、社交媒体和网页等各类真实数据场景中的广泛应用而备受关注。受大语言模型(LLMs)卓越语义理解能力的启发,已有许多尝试将LLMs集成到TAGs中。然而,现有方法仍难以在不同图和任务间泛化,且其捕获可迁移图结构模式的能力有限。为此,我们提出了GraspLLM框架,该框架将图结构理解与LLM的语义理解能力相结合,以增强跨数据集和跨任务的泛化能力。具体而言,我们使用冻结的通用嵌入模型将不同图的节点文本表示在统一语义空间中,在此基础上,我们在多个基序诱导的邻接矩阵上进行基序感知对比学习,以提取与数据集无关的结构信息。然后,通过我们提出的最优上下文子图,为每个目标节点提取最相关的上下文子图,并通过对齐投影仪将这些子图对齐到LLM的令牌空间。在涵盖不同领域的TAG基准数据集上的大量实验表明,GraspLLM在零样本场景下始终优于先前基于LLM的TAG方法,突显了其在不同数据集和任务上的强泛化能力。我们的代码可在以下网址获取:此 https URL。

英文摘要

Research on Text-Attributed Graphs (TAGs) has gained significant attention recently due to its broad applications across various real-world data scenarios, such as citation networks, e-commerce platforms, social media, and web pages. Inspired by the remarkable semantic understanding ability of Large Language Models (LLMs), there have been numerous attempts to integrate LLMs into TAGs. However, existing methods still struggle to generalize across diverse graphs and tasks, and their ability to capture transferable graph structural patterns remains limited. To address this, we introduce the GraspLLM, a framework that combines Graph structural comprehension with semantic understanding prowess of LLMs to enhance the cross-dataset and cross-task generalizability. Specifically, we represent node texts from different graphs in a unified semantic space with a frozen general embedding model, on top of which we perform motif-aware contrastive learning across multiple motif-induced adjacency matrices to extract dataset-agnostic structural information. Then, with our proposed optimal contextual subgraph, we extract the most contextually relevant subgraph for each target node and align these subgraphs to the token space of LLM via an alignment projector. Extensive experiments on TAG benchmark datasets spanning diverse domains reveal that GraspLLM consistently outperforms previous LLM-based methods for TAGs, especially in zero-shot scenarios, highlighting its strong generalizability across different datasets and tasks. Our code is available at this https URL.

2606.11894 2026-06-12 cs.CV 新提交

Wild3R: Feed-Forward 3D Gaussian Splatting from Unconstrained Sparse Photo Collection

Wild3R: 从无约束稀疏照片集合进行前馈式3D高斯泼溅

Yuto Furutani, Takashi Otonari, Kaede Shiohara, Toshihiko Yamasaki

发表机构 * The University of Tokyo(东京大学)

AI总结 提出Wild3R,一种针对无约束稀疏照片集合的前馈式3D高斯泼溅方法,通过引入包含多样光照和瞬态物体的WildCity数据集,学习跨视角外观一致性并移除瞬态内容,性能优于现有前馈方法,与基于逐场景优化的方法相当。

详情
Comments
Project page: this https URL
AI中文摘要

前馈式3D高斯泼溅(3DGS)消除了传统3DGS所需的耗时逐场景优化。然而,现有的前馈方法难以处理包含多样光照条件和瞬态物体的真实世界照片集合。在本文中,我们提出了Wild3R,一种针对无约束稀疏照片集合的前馈方法。主要瓶颈在于缺乏提供多视角、多种光照和瞬态变化的训练数据,而这些是学习鲁棒场景表示所必需的。为解决这一问题,我们引入了WildCity数据集,该数据集包含200个场景、170种光照条件和瞬态物体,总计337,500张图像。通过利用该数据集,我们的模型在参考视图条件下学习跨视角的外观一致性,同时移除瞬态内容。大量实验表明,我们的方法优于现有的前馈方法,并取得了与先前基于逐场景优化的方法相竞争的结果。

英文摘要

Feed-forward 3D Gaussian Splatting (3DGS) removes the need for time-consuming per-scene optimization required by traditional 3DGS. However, existing feed-forward approaches struggle with real-world photo collections that include diverse lighting conditions and transient objects. In this paper, we present Wild3R, a feed-forward approach for unconstrained sparse photo collections. The main bottleneck is the lack of training data that provides multiple viewpoints, a variety of illuminations, and transient variations necessary for learning robust scene representations. To address this, we introduce the WildCity dataset, which comprises 200 scenes, 170 lighting conditions, and transient objects, resulting in 337,500 images in total. By leveraging the dataset, our model learns appearance consistency across viewpoints conditioned on reference views, while removing transient content. Extensive experiments demonstrate that our method outperforms existing feed-forward approaches and achieves results competitive with prior per-scene optimization-based methods.

2606.11836 2026-06-12 cs.SD cs.AI eess.AS 新提交

Towards Data-free and Training-free Compression for Speech Foundation Models Using Parameter Clustering

面向语音基础模型的无数据无训练压缩:基于参数聚类的方法

Haoning Xu, Zhaoqing Li, Huimeng Wang, Youjun Chen, Chengxi Deng, Mengzhe Geng, Xunying Liu

发表机构 * The Chinese University of Hong Kong(香港中文大学) National Research Council Canada(加拿大国家研究委员会)

AI总结 提出一种基于k-means通道聚类的无数据无训练压缩方法,通过层间不同参数簇数实现细粒度混合稀疏剪枝,在HuBERT-large和Whisper-large-v3上显著降低WER。

详情
Comments
Accepted by Interspeech 2026
AI中文摘要

本文提出了一种新颖的无数据无训练压缩方法,用于语音基础模型,该方法通过k-means进行通道级聚类。还探索了更细粒度的混合稀疏剪枝,通过层间不同数量的参数簇实现。在LibriSpeech数据集上进行的实验表明,当对HuBERT-large进行50%的剪枝稀疏度操作时,在微调前,测试干净和测试其他子集上,相对于基于幅度的剪枝,获得了27.73%/18.61%绝对(34.37%/21.91%相对)的一致WER降低;在仅3个epoch的微调后,获得了0.19%/0.79%绝对(3.36%/4.62%相对)的降低。在Whisper-large-v3上,在10%稀疏度下,相对于基于幅度的剪枝,观察到2.86%/5.02%绝对(59.21%/55.29%相对)的类似WER降低,所有这些相对于未压缩基线均没有显著的WER增加。

英文摘要

This paper presents a novel data-free and training-free compression approach for speech foundation models using channelwise clustering via k-means. More fine-grained, mixed sparsity pruning by layer-level varying number of parameter clusters is also explored. Experiments conducted on the LibriSpeech dataset suggest that when operating with pruning sparsity of 50% on HuBERT-large, consistent WER reductions of 27.73%/18.61% absolute (34.37%/21.91% relative) over the magnitude-based pruning were obtained on the test-clean and test-other subsets before fine-tuning and 0.19%/0.79% absolute (3.36%/4.62% relative) after fine-tuning with only 3 epochs. Similar WER reductions of 2.86%/5.02% absolute (59.21%/55.29% relative) were observed against magnitudebased pruning on Whisper-large-v3 at 10% sparsity, all with no significant WER increase relative to the uncompressed baseline.

2606.11793 2026-06-12 cs.LG cs.AI physics.ao-ph 新提交

Scalable Deep Learning Framework for Global High-Resolution Land Use Reconstruction

AI4Land: 面向全球高分辨率土地利用重建的可扩展深度学习

Amirpasha Mozaffari, Marina Castaño, Stefano Materia, Etienne Tourigny, Oscar Molina-Sedano, Jordi Varela-Agrelo, Dario Garcia-Gasulla, Miguel Castrillo Melguizo, Mario Acosta, Amanda Duarte

发表机构 * Barcelona Supercomputing Center(巴塞罗那超级计算中心)

AI总结 提出AI4Land框架,采用U-Net两阶段方法,结合粗分辨率情景数据与静态地理特征,重建高分辨率年度土地利用与覆盖,减少陆地碳循环不确定性,支持气候模拟。

详情
AI中文摘要

陆地碳循环的不确定性仍是气候预测的主要制约因素,部分源于地球系统模型中陆面表征和变率的不确定性。为解决此问题,我们提出了数据驱动框架AI4Land,用于生成关键陆面变量的高分辨率历史重建和未来预测。该框架采用U-Net架构的两阶段方法。在第一阶段(本文重点),它通过整合粗分辨率情景数据与静态地理特征,重建年度土地利用与土地覆盖。在计划的第二阶段,生成的高分辨率地图将用于在更细时间尺度上预测动态生物物理变量,特别是叶面积指数。模型基于地球观测数据训练,学习再现空间明确且物理一致的陆面模式,并将时间覆盖扩展到缺乏直接观测的时期。AI4Land在MareNostrum5上开发和训练,展示了GPU加速的高性能计算基础设施如何支持全球尺度的气候AI流水线。最终产品是一套开源模拟器,旨在与数字孪生平台(如Destination Earth计划下开发的平台)实时耦合。通过按需提供逼真且演变的陆面条件,本工作旨在减少关键不确定性,提高下一代气候模拟的预测能力。

英文摘要

Uncertainty in the terrestrial carbon cycle remains a major constraint in climate projections, partly driven by the uncertainties affecting the land surface representation and variability in Earth system models. To address this limitation, we present a data-driven framework AI4Land, for generating high-resolution historical reconstructions and future projections of key land surface variables. The framework follows a two-phase approach using a U-Net architecture. In the first phase, which is the focus of this work, it reconstructs annual land use and land cover by integrating coarse-resolution scenario data with static geophysical features. In a planned second phase, the resulting high-resolution maps will be used to predict dynamic biophysical variables, particularly leaf area index, at finer temporal scales. Trained on Earth observation data, the models learn to reproduce spatially explicit and physically consistent land surface patterns, extending temporal coverage to periods lacking direct observations. AI4Land was developed and trained on MareNostrum5, demonstrating how GPU-accelerated HPC infrastructure enables global-scale climate AI pipelines. The final product is a suite of open-source emulators designed for real-time coupling with digital twin platforms, such as those developed under the Destination Earth initiative. By delivering realistic and evolving land surface conditions on demand, this work aims to reduce critical uncertainties and improve the predictive power of next-generation climate simulations.

2606.11792 2026-06-12 cs.CV cs.AI cs.CL 新提交

MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models

MultiToP:学习修补视觉令牌以减轻视频大型多模态模型中的幻觉

Yuansheng Gao, Wenbin Xing, Jiahao Yuan, Kaiwen Zhou, Han Bao, Zonghui Wang, Wenzhi Chen

发表机构 * Zhejiang University(浙江大学) Sun Yat-sen University(中山大学) East China Normal University(华东师范大学)

AI总结 提出MultiToP框架,通过轻量级视觉令牌修补器动态替换不可靠视觉令牌,结合信息引导排名校准和稀疏正则化,在不修改原模型情况下减少视频多模态模型幻觉,显著提升F1分数和问答准确率。

详情
Comments
Preprint
AI中文摘要

视频大型多模态模型在视频理解方面取得了显著进展,但仍容易产生幻觉,即生成的响应未能忠实于输入视频。在本文中,我们提出MultiToP,一种多模态上下文感知的视觉令牌修补框架,通过在语言生成之前优化不可靠的视觉令牌来减轻幻觉。MultiToP引入了一个轻量级的视觉令牌修补器,用于预测令牌级替换分布,并选择性地用动态全局修补令牌替换不可靠的视觉令牌。为了有效训练修补器,我们进一步提出了信息引导的排名校准,利用从主干网络派生的答案条件帧级信息线索来指导令牌替换。结合真实答案监督和稀疏正则化,MultiToP实现了局部视觉证据优化,而无需修改原始模型。大量实验表明,MultiToP在Vript-HAL上有效减少了幻觉,且推理开销可忽略不计,将Qwen3-VL-4B-Instruct的F1分数相比原始模型提高了50.60%。同时,MultiToP保持了通用的视频理解能力,在ActivityNet-QA上为Video-LLaVA-7B带来了18.58%的相对准确率提升。

英文摘要

Video Large Multimodal Models have achieved remarkable progress in video understanding, yet they remain prone to hallucinations, where generated responses are not faithfully supported by the input video. In this paper, we propose MultiToP, a multimodal-context-aware visual token patching framework that mitigates hallucinations by refining unreliable visual tokens before language generation. MultiToP introduces a lightweight Visual Token Patcher to predict token-level replacement distributions and selectively substitute unreliable visual tokens with a dynamic global patch token. To train the patcher effectively, we further propose information-guided rank calibration, which uses answer-conditioned frame-level information cues derived from the backbone to guide token replacement. Combined with ground-truth answer supervision and sparsity regularization, MultiToP enables localized visual evidence refinement without modifying the original model. Extensive experiments demonstrate that MultiToP effectively reduces hallucinations on Vript-HAL with negligible inference overhead, improving the F1 scores of Qwen3-VL-4B-Instruct by 50.60% over the vanilla model. Meanwhile, MultiToP preserves general video understanding ability, yielding an 18.58% relative accuracy gain on ActivityNet-QA for Video-LLaVA-7B.

2606.11767 2026-06-12 cs.RO cs.AI 新提交

Blind Dexterous Grasping via Real2Sim2Real Tactile Policy Learning

通过真实到仿真到真实触觉策略学习的盲操作灵巧抓取

Shengcheng Luo, Xiyan Huang, Zhe Xu, Wanlin Li, Ziyuan Jiao, Chenxi Xiao

发表机构 * ShanghaiTech University(上海科技大学) Beijing Institute for General Artificial Intelligence(北京通用人工智能研究院)

AI总结 提出一种结合Real2Sim触觉校准、布局感知触觉编码器和触觉条件扩散策略的框架,实现仅依赖触觉的灵巧手盲抓取,在真实机器人上对20个物体达到27%成功率。

详情
Comments
23 pages, 6 figures
AI中文摘要

使用灵巧手进行盲抓取是一项关键的操作能力。然而,由于触觉的仿真到真实差距以及稀疏触觉信号的有限表达能力,为真实机器人学习这种仅依赖触觉的策略仍然具有挑战性。为了弥合这一差距,我们提出了一个仅依赖触觉的盲抓取框架,该框架可部署在物理多指机器人手上。我们的方法结合了三个关键组成部分。首先,我们引入了一个Real2Sim触觉校准流程,构建了一个接触校准的数字孪生模拟器,能够复现真实的触觉信号。其次,我们使用布局感知触觉编码器改进了稀疏触觉观测的表达能力,该编码器通过自监督预训练融入了传感器几何先验。第三,为了提高对未见物体的泛化能力,我们在校准后的模拟器中训练了特定物体的强化学习专家,并将其成功的抓取轨迹聚合为触觉条件扩散策略。我们在配备分布式触觉传感的物理LEAP手上评估了我们的方法,涉及10个见过和10个未见过的物体。部署的策略在所有20个物体上实现了27%的真实世界抓取成功率,无需真实世界的抓取演示或视觉输入。仿真消融实验表明,布局感知触觉预训练提高了抓取性能,而传感级评估确认Real2Sim校准增加了仿真与硬件之间触觉接触事件的一致性。这些结果表明,接触事件校准、几何感知触觉表示学习和基于扩散的策略聚合为真实灵巧机器人手上的仅触觉盲抓取提供了一条有效路径。项目页面:此HTTP URL。

英文摘要

Blind grasping with a dexterous hand is a crucial manipulation capability. Nevertheless, learning such tactile-only policies for real robots remains challenging due to the tactile sim-to-real gap and the limited expressiveness of sparse tactile signals. To bridge this gap, we propose a framework for tactile-only blind grasping that is deployable on a physical multi-fingered robotic hand. Our approach combines three key components. First, we introduce a Real2Sim tactile calibration pipeline that constructs a contact-calibrated digital-twin simulator capable of reproducing real tactile signals. Second, we improve the expressiveness of sparse tactile observations using a layout-aware tactile encoder, which incorporates sensor-geometry priors through self-supervised pretraining. Third, to improve generalization to unseen objects, we train object-specific reinforcement-learning experts in the calibrated simulator and aggregate their successful grasp trajectories into a tactile-conditioned Diffusion Policy. We evaluate our method on a physical LEAP Hand equipped with distributed tactile sensing across 10 seen and 10 unseen objects. The deployed policy achieves a 27\% real-world grasp success rate across all 20 objects, without real-world grasping demonstrations or visual input. Simulation ablations show that layout-aware tactile pretraining improves grasping performance, while sensing-level evaluations confirm that Real2Sim calibration increases the consistency of tactile contact events between simulation and hardware. Together, these results suggest that contact-event calibration, geometry-aware tactile representation learning, and diffusion-based policy aggregation provide an effective path toward tactile-only blind grasping on real dexterous robotic hands. Project page: this http URL.

2606.11681 2026-06-12 cs.CL cs.SD 新提交

UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

UR-BERT:通过通用罗马化和语音标记预测扩展大规模多语言TTS的文本编码器

Sangmin Lee, Eekgyun Ahn, Woongjib Choi, Hong-Goo Kang

发表机构 * Dept. of Electronics and Electrical Engineering, Yonsei University(延世大学电子与电气工程系)

AI总结 提出UR-BERT,一种基于罗马化转录的TTS编码器,通过统一书写系统为罗马化表示,结合语音标记预测目标,在495种语言上实现高效多语言TTS,优于现有基线并泛化到未见语言。

详情
Comments
Accepted to Interspeech 2026, Github: this https URL
AI中文摘要

我们提出UR-BERT,一种基于罗马化转录的文本到语音(TTS)编码器,用于大规模多语言TTS系统。传统的字素到音素(G2P)方法由于可靠G2P资源的可用性,仅限于约100种语言。相比之下,UR-BERT通过将多样化的书写系统统一为共享的罗马化表示,扩展到495种语言。为了进一步增强语音保真度和文本-语音对齐,我们在训练过程中引入了一个语音标记预测目标,这促使编码器以数据高效的方式学习语音感知的语音表示。实验表明,基于UR-BERT构建的TTS系统在广泛的语言和资源条件下,始终优于最近的文本编码器基线,并展现出对未见语言的强大泛化能力。

英文摘要

We propose UR-BERT, a Romanized transcription-based text-to-speech (TTS) encoder for massively multilingual TTS systems. Conventional grapheme-to-phoneme (G2P)-based approaches are limited to around 100 languages due to the availability of reliable G2P resources. In contrast, UR-BERT scales to 495 languages by unifying diverse writing systems into a shared Romanization representation. To further enhance phonetic fidelity and text-speech alignment, we introduce a speech token prediction objective during training, which encourages the encoder to learn speech-aware phonetic representations in a data-efficient manner. Experiments show that TTS systems built on UR-BERT consistently outperform recent text encoder baselines across a wide range of languages and resource conditions, and demonstrate strong generalization to unseen languages.

2606.11255 2026-06-12 cs.LG 新提交

Bernstein-Schur Kernels: Random Features by Sketched Modulation and Radial Randomization

Bernstein-Schur核:通过草图调制和径向随机化的随机特征

Taha Bouhsine

发表机构 * Azetta AI

AI总结 提出一种随机特征构造方法,用于Bernstein-Schur核类,通过草图化有限调制和随机化完全单调径向因子,实现无偏估计和算子范数界,应用于yat核族。

详情
AI中文摘要

Bernstein-Schur核是有限特征核(具有显式有限维特征映射的核)与完全单调平移不变核的乘积:非平稳核介于平移不变和点积模板之间,随机特征通常利用后者,因此一般Bochner采样或多项式草图都不能直接应用于完整核。我们为整个类给出一种随机特征构造,它随机化两个因子:草图化有限调制并随机化完全单调径向因子,对后者的单变量Bernstein-Widder尺度进行采样,然后应用高斯随机傅里叶特征(其频率仍是d维的)。特征维度为Dm,由草图大小m和径向抽取次数D设定,与精确调制特征的O(d^2)大小无关。保持调制精确是可分析极限(m→∞):在那里我们证明无偏性、推荐平坦估计量的精确方差、期望矩阵-Bernstein算子范数界(具有匹配的高概率尾部),该界由核和调制Gram矩阵的最大特征值以及固有维度控制,而非粗糙的N max_{ij}逐元素路径,以及确定性相对谱核岭稳定性结果。通过条件化于草图,双随机化估计量继承了相同的固有维度算子范数保证,加上一个可调加性草图项,该草图项由m独立于D调节。激励实例是有偏yat核k_{yat,b}(w,x)=(w^⊤x+b)^2/(‖w-x‖^2+ε),b≥0,其族通过b的有限差分包含逆多二次核;对于它,径向混合是IMQ谱采样器,每个尺度一个频率在固定径向特征预算下是方差最优的。

英文摘要

Bernstein--Schur kernels are products of a finite-feature kernel and a completely monotone shift-invariant kernel: nonstationary kernels falling between the shift-invariant and dot-product templates random features exploit, so neither Bochner sampling nor polynomial sketching applies to the full kernel directly. We give one random-feature construction for the whole class that randomizes both factors: it sketches the finite modulation and samples the radial factor's one-dimensional Bernstein--Widder scale before applying Gaussian random Fourier features, giving feature dimension $Dm$, free of the $O(d^2)$ size of the exact modulation feature. With the modulation kept exact (the $m\to\infty$ limit), we prove unbiasedness, an exact variance, and a matrix-Bernstein operator-norm bound controlled by the top kernel and modulation eigenvalues and an intrinsic dimension rather than the crude $N\max_{ij}$ route. Whitening this argument at the ridge makes the effective dimension $d_{\mathrm{eff}}(\lambda)$ the \emph{exact} intrinsic dimension of the matrix variance, so $O((1+\|P\|_{\mathrm{op}}/\lambda)\log(d_{\mathrm{eff}}/\delta))$ radial draws preserve the kernel-ridge solution; tilting the draw by a closed-form whitened leverage improves this to the effective-dimension count $O((1+d_{\mathrm{eff}})\log(d_{\mathrm{eff}}/\delta))$. Conditioning on the sketch carries every guarantee to the deployed doubly-randomized estimator up to one additive sketch term, and all hold for the whole class with the modulation Gram in place of the polynomial one. The flagship instance is the biased $yat$-kernel $k_{yat,b}(w,x)=(w^\top x+b)^2/(\|w-x\|^2+\varepsilon)$, whose family span contains the inverse-multiquadric kernel by finite differences in $b$.