arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.24980 2026-05-26 cs.RO

Loosely Coupled Factor Graph Optimization for Pseudolite-Augmented Navigation

松耦合因子图优化用于伪卫星增强导航

Chih-Chun Chen, Lipeng Tan, Shiyu Bai, Heike Vallery

AI总结提出一种松耦合因子图优化框架，融合GNSS/伪卫星最小二乘解与IMU数据，在低可见度环境下相比标准最小二乘方法将平均三维误差降低22.8%至41.3%。

详情

AI中文摘要

在全球导航卫星系统（GNSS）退化环境中，伪卫星（PL）提供额外的信号源以增强定位性能，但它们在基于优化的框架中的集成仍然有限。本文提出了一种松耦合因子图优化（FGO）框架，该框架融合了GNSS/PL最小二乘（LS）解与惯性测量单元（IMU）数据。评估考虑了低GNSS可见度场景，包括四颗高仰角GNSS卫星和最多两个PL发射机，时间窗口为80秒。与标准LS方法相比，FGO实现了平均三维误差降低22.8%至41.3%。与GNSS-IMU基线相比，加入PL发射机进一步提高了定位精度，性能取决于几何配置。

英文摘要

In Global Navigation Satellite System (GNSS)-degraded environments, pseudolites (PLs) provide additional signal sources to enhance positioning performance, but their integration in optimization-based frameworks remains limited. This paper presents a loosely coupled factor graph optimization (FGO) framework that fuses the GNSS/PL least-squares (LS) solutions with inertial measurement unit (IMU) data. The evaluation considers low GNSS visibility scenarios with four high-elevation GNSS satellites and up to two PL transmitters over an 80~s window. FGO achieves a 22.8\% to 41.3\% reduction in mean 3D error compared to standard LS methods. Compared to a GNSS-IMU baseline, incorporating PL transmitters further improves positioning accuracy, with performance depending on geometry.

URL PDF HTML ☆

赞 0 踩 0

2605.24977 2026-05-26 cs.CV cs.CL

Universal Boosts, Specific Suppressors: Sparse Autoencoder Steering of Medical Vision-Language Models

通用增强，特定抑制：基于稀疏自编码器引导的医学视觉语言模型

Farhad Nooralahzadeh, Benjamin Gundersen, Nicolas Deperrois, Hidetoshi Matsuom, Mizuho Nishio, Thomas Frauenfelder, Ahmed Allam, Christian Blüthgen, Michael Moor, Michael Krauthammer

AI总结本文提出一种无需权重更新的解码时残差引导方法，通过每token稀疏自编码器（SAE）对医学视觉语言模型进行干预，抑制幻觉并提升报告质量，在多个模型上取得显著改进。

详情

AI中文摘要

医学视觉语言模型（VLM）在生成胸部X光报告时经常出现幻觉：它们编造图像中不存在的发现，遗漏重要发现，或定位错误。我们通过解码时残差引导，基于每token稀疏自编码器（SAE）来缓解这一问题，无需权重更新：在后期层使用Top-$K$ SAE，针对临床错误进行因果引导，然后在推理时结合抑制/增强干预。在MIMIC-CXR测试集上，我们的纯推理方法提高了三个放射学VLM（RadVLM、LLaVA-Rad和CheXOne）生成报告的质量，临床复合指标的相对改进分别为+5.4%、+7.2%和+17.0%，并且所有骨干网络的GREEN得分均具有统计显著性。跨模型特征对齐表明，质量促进（增强）方向在不同架构间高度重叠，而与幻觉相关的（抑制）方向则是模型特定的。因此，可迁移的引导必须针对每个骨干网络进行抑制处理，而不是共享一个通用的抑制列表。相同的配方无需重新训练即可零样本迁移到IU-Xray（GREEN相对提升+7.7%），确认了所识别的特征是模型属性，而非训练语料库的属性。我们发布了因果特征集和一个交互式特征仪表板：https://cxr-sparse-feature-dashboard.netlify.app/。

英文摘要

Medical vision-language models (VLMs) often hallucinate findings when generating chest X-ray reports: they fabricate findings that are not present in the image, miss important ones, or locate them incorrectly. We mitigate this without weight updates by decoding-time residual steering on a per-token sparse autoencoder (SAE) basis: Top-$K$ SAEs on late layers, causal steering against clinical errors, then combined suppress/boost intervention at inference time. On the MIMIC-CXR test split, our inference-only method improves the quality of generated reports for three radiology VLMs (RadVLM, LLaVA-Rad, and CheXOne), with relative improvements of +5.4%, +7.2%, and +17.0% in the clinical composite metric, and statistically significant GREEN gains on all backbones. A cross-model feature alignment shows that the quality-promoting (boost) directions overlap strongly across architectures, whereas hallucination-linked (suppress) directions are model-specific. Therefore, transferable steering must treat suppression per-backbone, rather than sharing a universal suppress list. The same recipe transfers zero-shot to IU-Xray (Green $+7.7\%$ rel.) without retraining, confirming that the identified features are properties of the model, not of the training corpus. We release causal feature sets and an interactive feature dashboard: https://cxr-sparse-feature-dashboard.netlify.app/.

URL PDF HTML ☆

赞 0 踩 0

2605.24975 2026-05-26 cs.RO cs.AI cs.LG

Bridging the Gap: Enabling Soft Actor Critic for High Performance Legged Locomotion

弥合差距：实现软演员-评论家算法用于高性能腿部运动

Gianluca Sabatini, Chenhao Li, Marco Hutter

AI总结本文通过识别软演员-评论家（SAC）在并行训练中性能不足的根本原因，并提出策略初始化、超时感知评论家目标和多步回报估计等改进，使其在腿部运动任务中达到与近端策略优化（PPO）相当的性能。

详情

AI中文摘要

近端策略优化（PPO）由于其在IsaacLab等大规模并行仿真环境中的鲁棒性和可扩展性，已成为训练腿部机器人的事实标准。然而，其基于策略的性质使其天生样本效率低下，阻碍了其在真实硬件上的持续适应和微调。相比之下，软演员-评论家（SAC）是一种可以重用过去经验的离策略算法，使其成为模拟到现实迁移工作流程的自然候选，其中同一算法既可用于仿真，也可用于真实机器人的在线学习。尽管有这些优势，SAC在大规模并行训练设置中始终未能匹配PPO的经验性能。本工作确定了这一差距的根本原因，并引入了针对性的修改，包括策略初始化、超时感知评论家目标和多步回报估计，使SAC能够稳定地大规模训练。在多个腿部机器人平台和多样化的运动任务上评估，我们的方法完全弥合了与PPO的性能差距。

英文摘要

Proximal Policy Optimization (PPO) has become the de facto standard for training legged robots, thanks to its robustness and scalability in massively parallel simulation environments like IsaacLab. However, its on-policy nature makes it inherently sample-inefficient, preventing its use for continuous adaptation and fine-tuning on real hardware. Soft Actor-Critic (SAC), by contrast, is an off-policy algorithm that can reuse past experience, making it a natural candidate for sim-to-real transfer workflows where the same algorithm can be used both in simulation and for online learning on the real robot. Despite these advantages, SAC has consistently failed to match PPO's empirical performance in massively parallel training settings. This work identifies the root causes of this gap and introduces targeted modifications, covering policy initialization, timeout-aware critic targets, and multi-step return estimation, that enable SAC to train stably at scale. Evaluated across multiple legged robot platforms and diverse locomotion tasks, our approach closes the performance gap with PPO entirely.

URL PDF HTML ☆

赞 0 踩 0

2605.24973 2026-05-26 cs.CV cs.AI cs.CL

MinerU-Popo: Universal Post-Processing Model for Structured Document Parsing

MinerU-Popo：结构化文档解析的通用后处理模型

Bangrui Xu, Ziyang Miao, Xuanhe Zhou, Yiming Lin, Zirui Tang, Xiaomeng Zhao, Fan Wu, Cheng Tan, Fan Wu, Bin Wang, Conghui He

AI总结提出MinerU-Popo轻量级通用后处理框架，通过分解为文本/表格截断恢复、标题层级重建和图文关联四个子任务，并利用动态分块和重叠同步将OCR页面级结果重构为文档级逻辑结构，显著提升标题层级TEDS和RAG准确性。

Comments The code is available at https://github.com/opendatalab/MinerU-Popo

详情

AI中文摘要

基于VLM的OCR模型已成为文档解析的事实标准，因为它们可以准确提取页面级元素（例如单个页面内的段落）及其边界框和文本内容。然而，下游应用（如RAG）需要连贯的文档级信息，而这些模型常常破坏跨页连续性，并且无法恢复被页面边界截断的结构（如段落和表格）。这种关系不局限于单个页面；相反，它们需要对跨多个页面的标题、段落、表格和图像进行联合分析。因此，一个自然的解决方案是重用现有的OCR输出，并通过后处理重建文档级逻辑结构。为此，我们提出了MinerU-Popo，一个轻量级且通用的OCR输出后处理框架，它将来自不同解析器的页面级结果转换为连贯的文档级结构。MinerU-Popo将问题分解为四个聚焦的子任务：文本截断恢复、表格截断恢复、标题层级重建和图文关联。为了有效解决这些问题，我们构建了一个面向任务的数据引擎，具有任务特定的输入过滤，并使用生成的数据（30K）微调了一个轻量级后处理模型（Qwen3-VL-4B）。为了支持长文档，我们引入了基于重叠同步的动态分块，对齐微调模型的分块级输出并保持全局一致性。最后，我们将对齐后的输出组装成树状文档表示，并通过节点分块和摘要进一步丰富，以支持下游检索和分析。实验结果表明，MinerU-Popo在所有五个测试的OCR模型上，标题层级TEDS至少提高了20%，提高了RAG准确性并降低了每次查询的延迟。

英文摘要

VLM-based OCR models have become the de facto choice for document parsing, as they can accurately extract page-level elements (e.g., paragraphs within individual pages) together with their bounding boxes and textual content. However, downstream applications such as RAG require coherent document-level information, whereas these models often break cross-page continuity and fail to recover disrupted structures, such as paragraphs and tables truncated by page boundaries. Such relationships are not confined to a single page; instead, they require joint analysis of titles, paragraphs, tables, and images spanning multiple pages. A natural solution is therefore to reuse existing OCR outputs and reconstruct document-level logical structures through post-processing. To this end, we propose MinerU-Popo, a lightweight and universal framework for POst-Processing OCR outputs, which converts page-level results from diverse parsers into coherent document-level structures. MinerU-Popo decomposes the problem into four focused subtasks: text truncation recovery, table truncation recovery, title hierarchy reconstruction, and image-text association. To address these effectively, we build a task-oriented data engine with task-specific input filtering, and use the generated data (30K) to fine-tune a lightweight post-processing model (Qwen3-VL-4B). To support long documents, we introduce dynamic chunking with overlap-based synchronization, which aligns chunk-level outputs from the fine-tuned model and preserves global consistency. Finally, we assemble the aligned outputs into a tree-structured document representation, further enriched with node chunking and summaries for downstream retrieval and analysis. Empirical results show MinerU-Popo improves title-hierarchy TEDS by at least 20% across all five tested OCR models, improves RAG accuracy and reduces per-query latency.

URL PDF HTML ☆

赞 0 踩 0

2605.24971 2026-05-26 cs.LG cs.AI

TGFormer: Towards Temporal Graph Transformer with Auto-Correlation Mechanism

TGFormer：基于自相关机制的时间图Transformer

Hongjiang Chen, Pengfei Jiao, Ming Du, Xuan Guo, Zhidong Zhao, Di Jin, Xiao Liu

AI总结针对时间图神经网络在捕获长期依赖和周期模式上的不足，提出TGFormer，通过轨迹框架和自相关机制实现子交互级别的依赖发现与表示聚合，在六个基准上最高提升9.35%精度。

详情

DOI: 10.1016/j.patcog.2025.112053
Journal ref: Pattern Recognition 170 (2026): 112053

AI中文摘要

对时间图神经网络（TGNN）日益增长的兴趣源于它们能够建模复杂动态并提供卓越性能。然而，TGNN在捕获长期依赖和识别周期模式方面面临根本性挑战。为解决这些限制，我们提出了TGFormer，一种专为时间图设计的新型Transformer架构。我们的模型通过建立与时间序列分析原理一致的轨迹框架，重新定义了时间图学习。这种方法使TGFormer能够通过对历史交互的系统分析来推导节点表示，从而实现对跨连续时间戳的节点关系的精细检查。基于随机过程理论，我们开发了一种自相关机制，系统性地揭示节点交互中的周期依赖。这一创新使TGFormer能够在子交互级别进行依赖发现和表示聚合，相比传统注意力机制展现出更高的效率和准确性。在六个公开基准上的实验验证了我们的方法的有效性，与最先进方法相比，TGFormer最高实现了9.35%的精度提升。

英文摘要

The growing interest in Temporal Graph Neural Networks (TGNNs) stems from their ability to model complex dynamics and deliver superior performance. However, TGNNs encounter fundamental challenges in capturing long-term dependencies and identifying periodic patterns. To address these limitations, we propose TGFormer, a novel Transformer architecture specifically designed for temporal graphs. Our model redefines temporal graph learning by establishing a trajectory framework that aligns with time series analysis principles. This approach allows TGFormer to derive node representations through systematic analysis of historical interactions, enabling granular examination of node relationships across sequential timestamps. Building upon stochastic process theory, we develop an auto-correlation mechanism that systematically uncovers periodic dependencies in node interactions. This innovation empowers TGFormer to perform dependency discovery and representation aggregation at sub-interaction levels, demonstrating superior efficiency and accuracy compared to conventional attention mechanisms. Experimental validation across six public benchmarks confirms the effectiveness of our approach, with TGFormer at most achieving 9.35\% precision improvement compared to state-of-the-art approaches.

URL PDF HTML ☆

赞 0 踩 0

2605.24969 2026-05-26 cs.LG cs.AI

OSDTW: Optimal Shared Depth and Task Weighting for Long-Tailed Recognition

OSDTW：长尾识别的最优共享深度与任务加权

Chang Chu, Qingyue Zhang, Shao-Lun Huang, Junxiong Zheng

AI总结提出OSDTW框架，通过分解任务、共享编码器与任务特定解码器，并基于Fisher信息矩阵推导泛化误差的偏置-方差分解，以优化共享深度和任务权重，解决长尾识别中头部-尾部性能权衡问题。

Comments ICIC 2026 Oral

详情

AI中文摘要

长尾识别面临持续的头部-尾部权衡：提升尾部性能通常会降低头部准确率，并可能增加训练不稳定性。尽管重加权、解耦训练和多专家方法取得了强有力的实证结果，但关于头部和尾部类别之间表示共享以及跨类别组监督加权的关键设计选择仍主要基于启发式。在这项工作中，我们提出了OSDTW，一个原则性的任务分解框架，将原始的单标签识别问题划分为头部任务和尾部任务，通过共享编码器和任务特定解码器实现。为了处理两个标签组之间的互斥性和统计依赖性，我们引入了一个因子化模型，并表明由此产生的基于KL散度的泛化误差可以写为任务项之和（加一个常数），从而得到一个定义良好的任务级目标。我们进一步开发了一个三阶段训练流程：独立任务训练以估计任务级最优值和Fisher信息矩阵，加权联合训练以学习共享编码器，以及分支组装以构建最终的解耦模型。在块对角Fisher近似下，我们推导了期望泛化误差的可计算二阶展开，将其分解为编码器方差、编码器偏置和解码器方差。这种偏置-方差分解提供了一个可计算的代理来选择共享深度和任务权重，从而实现高效的超参数搜索。在标准长尾基准上的实验证明了所提出方法相对于强基线的有效性。

英文摘要

Long-tailed recognition suffers from a persistent head--tail trade-off: improving tail performance often degrades head accuracy and can increase training instability. Despite strong empirical results from re-weighting, decoupled training, and multi-expert methods, key design choices about representation sharing between head and tail classes and supervision weighting across class groups remain largely heuristic. In this work, we propose OSDTW, a principled task-decomposition framework that partitions the original single-label recognition problem into a head task and a tail task, implemented with a shared encoder and task-specific decoders. To handle the mutual exclusivity and statistical dependence between the two label groups, we introduce a factorized model and show that the resulting Kullback--Leibler divergence-based generalization error can be written as the sum of task-wise terms up to an additive constant, yielding a well-defined task-wise objective. We further develop a three-stage training pipeline: independent task training to estimate task-wise optima and the Fisher information matrix, weighted joint training to learn a shared encoder, and branch assembly to construct the final decoupled model. Under a block-diagonal Fisher approximation, we derive a computable second-order expansion of the expected generalization error, decomposing it into encoder variance, encoder bias, and decoder variance. This bias--variance decomposition provides a computable proxy to select the shared depth and task weights, enabling efficient hyper-parameter search. Experiments on standard long-tailed benchmarks demonstrate the effectiveness of the proposed approach over strong baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.24965 2026-05-26 cs.CV cs.AI cs.LG

Cross-Domain Generalization Limits of Vision Foundation Models in Facial Deepfake Detection

视觉基础模型在面部深度伪造检测中的跨域泛化极限

Ibrahim Delibasoglu

AI总结本文通过系统评估三种视觉基础模型（RoPE-ViT、DINOv3、NVIDIA C-RADIOv4-H）在DF40基准上的线性探测性能，揭示了它们在面部深度伪造检测中的跨域泛化极限，发现基础模型对全脸合成保持高判别力，但对局部编辑技术存在根本性边界。

详情

AI中文摘要

生成模型的快速进化使得超逼真面部深度伪造的创建成为可能，暴露了现代数字取证中的一个关键弱点：检测器无法泛化到未见过的操作技术。传统网络遭受表示崩溃，过度拟合特定训练生成器的局部伪影指纹。本研究探讨了现代视觉基础模型是否可以作为可泛化的、开箱即用的特征提取器，能够在完全未见过的生成流形上追踪取证异常。我们进行了系统的跨域评估，比较了三种基础学习范式：全监督宏观语义特征（RoPE-ViT）、纯自监督几何特征（DINOv3）和多教师聚合表示（NVIDIA C-RADIOv4-H）。通过部署冻结的骨干网络并进行下游线性探测，我们映射了这些架构在具有挑战性的DF40基准上的性能极限。我们的实证结果揭示了预训练范式和参数规模之间的内在权衡，证明虽然基础模型对全脸合成保持高判别能力，但局部面部编辑技术在线性探测评估结构中暴露了基本边界。源代码和模型权重可在 http://github.com/mribrahim/deepfake 获取。

英文摘要

The rapid evolution of generative models has enabled the creation of hyper-realistic facial deepfakes, exposing a critical vulnerability in modern digital forensics: the inability of detectors to generalize to unseen manipulation techniques. Traditional networks suffer from representation collapse, overfitting to localized artifact fingerprints of specific training generators. This work investigates whether modern Vision Foundation Models can serve as generalizable, out-of-the-box feature extractors capable of tracking forensic anomalies across entirely unseen generative manifolds. We conduct a systematic cross-domain evaluation comparing three foundational learning paradigms: fully supervised macro-semantic features (RoPE-ViT), pure self-supervised geometric features (DINOv3), and multi-teacher agglomerative representations (NVIDIA C-RADIOv4-H). By deploying frozen backbones subjected to downstream linear probing, we map the performance limitations of these architectures on the challenging DF40 benchmark. Our empirical findings expose the intrinsic trade-offs between pre-training paradigms and parameter scale, proving that while foundation models retain high discriminative capabilities for entire face synthesis, localized face editing techniques expose fundamental boundaries in linear probe evaluation structures. Source code and model weights are available in http://github.com/mribrahim/deepfake

URL PDF HTML ☆

赞 0 踩 0

2605.24964 2026-05-26 cs.CV

ConFi-GS Confidence-Guided High-Frequency Injection for 3D Gaussian Splatting Super-Resolution

ConFi-GS：置信度引导的高频注入用于3D高斯泼溅超分辨率

Jiaxiang Li, Zongtan Zhou, Zhen Tan, Yadong Liu, Dewen Hu

AI总结提出一种可靠性感知的频率建模框架，通过几何引导的细节需求先验和频率感知的可靠性图，指导低分辨率3DGS重建中高频细节的注入，提升保真度和感知质量。

详情

AI中文摘要

从低分辨率多视图图像重建高质量3D场景对3D高斯泼溅（3DGS）仍具挑战，因为高频观测不足常导致纹理模糊、边界弱化和视图不一致细节。现有方法要么统一应用超分辨率引导，要么主要基于几何采样定位增强区域。然而，它们通常不区分两个根本不同的问题：哪里需要额外细节，以及相应的候选高频内容是否足够可靠以融入多视图一致的3D表示。本文提出一种用于低分辨率3DGS重建的可靠性感知频率建模框架。该框架首先估计几何引导的细节需求先验，以定位在低分辨率监督下可能细节不足的区域。然后计算频率感知的可靠性图，以确定候选高频细节是否结构上受支持、频谱上未解决且跨视图稳定。结合这些信号得到细节注入图，指导优化过程中超分辨率细节的引入位置。基于该图，我们设计了一个统一的优化方案，包括空间选择性监督、从粗到细的频率正则化和可靠性感知的高斯稠密化。该方案控制可靠细节的注入位置、高频监督的激活时机以及未解决但可靠的细节如何融入高斯表示。多个基准上的实验表明，在抑制不稳定或视图不一致细节的同时，保真度和感知质量得到提升。

英文摘要

Reconstructing high-quality 3D scenes from low-resolution multi-view images remains challenging for 3D Gaussian Splatting (3DGS), because insufficient high-frequency observations often lead to blurred textures, weak boundaries, and view-inconsistent details. Existing approaches either apply super-resolution guidance uniformly or localize enhancement regions based mainly on geometric sampling. However, they typically do not distinguish between two fundamentally different questions: where additional detail is needed, and whether the corresponding candidate high-frequency content is reliable enough to be internalized into a multi-view consistent 3D representation. In this paper, we propose a reliability-aware frequency modeling framework for low-resolution 3DGS reconstruction. The framework first estimates a geometry-guided detail-demand prior to locate regions that are likely under-detailed under low-resolution supervision. It then computes a frequency-aware reliability map to determine whether candidate high-frequency details are structurally supported, spectrally unresolved, and cross-view stable. Combining these signals yields a detail-injection map that guides where super-resolved details should be introduced during optimization. Based on this map, we design a unified optimization scheme comprising spatially selective supervision, coarse-to-fine frequency regularization, and reliability-aware Gaussian densification. This scheme controls where reliable details are injected, when high-frequency supervision is activated, and how unresolved yet reliable details are internalized into the Gaussian representation. Experiments on multiple benchmarks show improved fidelity and perceptual quality while suppressing unstable or view-inconsistent details.

URL PDF HTML ☆

赞 0 踩 0

2605.24962 2026-05-26 cs.CV

Tempered Self-Similarity Alignment for Physically Plausible Video Generation

Manjin Kim, Suha Kwak, Minsu Cho

AI总结提出Tempered Self-Similarity Alignment (TSA)损失函数，通过将视觉基础模型中的时空自相似性关系知识迁移到视频生成模型中，以改善视频的物理合理性。

Comments Accepted to the CVPR 2026 Workshop on Video Generative Models: Benchmarks and Evaluation (VGBE)

详情

AI中文摘要

尽管视频生成模型取得了显著进展，但它们仍然难以生成物理上逼真的视频，经常出现外观漂移、不合理的运动和时间不一致性。在这项工作中，我们通过将视觉基础模型中编码的时空自相似性（STSS）关系知识迁移到视频生成模型中来解决这一局限性。STSS表示特征在空间和时间上的成对相似性，揭示了视频中物体如何与其他实体相互作用的 relational structure，有效捕捉了真实世界的动态，包括物体运动和语义变换。为了迁移这种关系知识，我们提出了Tempered Self-similarity Alignment (TSA)损失，它将STSS转换为概率对应分布，并训练视频生成模型使其在动态变化区域上的对应分布与视觉基础模型的对应分布对齐。在VideoPhy和VideoPhy2基准测试上的评估表明，我们的方法在不同交互场景中显著提升了物理合理性，验证了迁移关系知识对于生成物理逼真视频的有效性。

英文摘要

Despite remarkable advances in video generative models, they still struggle to generate physically realistic videos, frequently exhibiting appearance drift, implausible motion, and temporal inconsistencies. In this work, we address this limitation by transferring relational knowledge encoded in spatio-temporal self-similarity (STSS) from visual foundation models into video generative models. STSS represents pairwise similarities among features across space and time, revealing the relational structure of how objects interact with other entities throughout a video, effectively capturing real-world dynamics, including object motion and semantic transformations. To transfer this relational knowledge, we propose Tempered Self-similarity Alignment (TSA) loss, which transforms STSS into probabilistic correspondence distributions and trains the video generative model to align its correspondence distributions with those of the visual foundation model on dynamically changing regions. Evaluated on VideoPhy and VideoPhy2 benchmarks, our method demonstrates substantial improvements in physical plausibility across diverse interaction scenarios, validating the effectiveness of transferring relational knowledge for physically realistic video generation.

URL PDF HTML ☆

赞 0 踩 0

2605.24961 2026-05-26 cs.LG

MedMamba: Multi-View State Space Models with Adaptive Graph Learning for Medical Time Series Classification

MedMamba: 基于自适应图学习的多视图状态空间模型用于医疗时间序列分类

Da Zhang, Bingyu Li, Zhiyuan Zhao, Hongyuan Zhang, Junyu Gao, Xuelong Li

AI总结提出MedMamba，一种集成状态空间模型与领域特定归纳偏置的端到端架构，通过多尺度卷积嵌入、三支差分状态空间编码器和空间图Mamba模块，分别处理局部-全局动态、非平稳性和潜在通道交互，在五个真实数据集上实现最先进性能。

Comments Accepted to 2026 ICML

详情

AI中文摘要

医疗时间序列是医疗保健的核心，能够实现连续监测并支持及时的临床决策。尽管最近取得了进展，现有方法仍难以联合建模局部-全局动态并处理基线漂移等非平稳性，同时常常无法捕捉潜在的通道交互。为了解决这些挑战，我们提出了MedMamba，一种将状态空间模型与领域特定归纳偏置相结合的端到端架构。具体来说，MedMamba首先采用多尺度卷积嵌入来捕获判别性的局部形态。其次，为了缓解非平稳性，我们引入了一个三支差分状态空间编码器，处理原始视图、时间差分视图和频域视图，融合它们以强调信息模式同时抑制漂移。此外，为了揭示潜在的通道相关性，我们设计了一个空间图Mamba模块，学习一个向稀疏性和无环性正则化的有向依赖结构，从而无需预定义图。在五个真实世界数据集上的大量实验表明，MedMamba在保持线性计算复杂度的同时实现了最先进的性能，消融研究验证了每个组件的贡献。代码可在 https://github.com/zhangda1018/MedMamba 获取。

英文摘要

Medical time series are central to healthcare, enabling continuous monitoring and supporting timely clinical decisions. Despite recent progress, existing methods struggle to jointly model local-global dynamics and handle nonstationarities like baseline drift, while often failing to capture latent channel interactions. To address these challenges, we propose MedMamba, an end-to-end architecture that integrates state space models with domain-specific inductive biases. Specifically, MedMamba first employs multi-scale convolutional embeddings to capture discriminative local morphology. Second, to mitigate nonstationarity, we introduce a tri-branch differential state space encoder that processes raw, temporal-difference, and frequency-domain views, fusing them to emphasize informative patterns while suppressing drift. Furthermore, to uncover latent channel correlations, we design a spatial graph Mamba module that learns a directed dependency structure regularized toward sparsity and acyclicity, which obviates the need for predefined graphs. Extensive experiments on five real-world datasets demonstrate that MedMamba achieves state-of-the-art performance while maintaining linear computational complexity, and ablation studies validate each component's contribution.Code is available at https://github.com/zhangda1018/MedMamba.

URL PDF HTML ☆

赞 0 踩 0

2605.24960 2026-05-26 cs.CL cs.AI cs.LG

Investigating the Interplay between Contextual and Parametric Chain-of-Thought Faithfulness under Optimization

探究优化下上下文与参数化思维链忠实性之间的相互作用

Jingyi Sun, Qianli Wang, Pepa Atanasova, Nils Feldhus, Isabelle Augenstein

AI总结通过提出统一偏好对齐接口FaithMate，研究上下文与参数化两种思维链忠实性范式在优化下的相互作用，发现两者正相关但不对称，且上下文忠实性指标间存在权衡。

Comments The first two authors contributed equally and share first-authorship

详情

AI中文摘要

思维链（CoT）忠实性，即CoT是否真实反映大型语言模型（LLM）的底层行为，通常通过两种不相交的范式进行评估：上下文忠实性（通过扰动输入或CoT轨迹测量）和参数化忠实性（通过干预模型的参数化知识评估）。然而，先前的工作仅对它们进行描述性比较。我们通过提出FaithMate（一个统一的偏好对齐接口，用于优化模型朝向任一忠实性范式）来填补这一空白。它使我们能够研究两种范式之间的相互作用，检查忠实性增益在范式内部和跨范式之间是否以及多大程度上泛化。在三个模型、两个数据集和六个忠实性指标上，我们发现两种范式呈正相关但不对称：优化参数化忠实性在两种范式上均产生一致的增益，而上下文对应范式则带来更多可变的增益。在上下文范式内，一个指标上的忠实性增益不能一致地转移到其他指标上，这表明现有的上下文指标捕捉了忠实性的不同方面，并暴露了固有的权衡。这些发现意味着CoT忠实性不是一个单一目标，因此需要多方面的优化和评估。

英文摘要

Chain-of-Thought (CoT) faithfulness, i.e., whether CoTs genuinely reflect large language models' (LLM) underlying behavior, is typically evaluated under two disjoint paradigms: contextual faithfulness, measured by perturbing the input or CoT trace, and parametric faithfulness, assessed by intervening on a model's parametric knowledge. Yet prior work compares them only descriptively. We fill this gap by proposing FaithMate, a unified preference-alignment interface for optimizing models towards either faithfulness paradigm. It enables us to investigate the interplay between the two paradigms, examining whether and to what extent faithfulness gains generalize within and across paradigms. Across three models, two datasets, and six faithfulness metrics, we find that the two paradigms are positively coupled, yet asymmetric: optimizing towards parametric faithfulness yields consistent gains across both paradigms, whereas the contextual counterpart delivers more variable gains. Within the contextual paradigm, faithfulness gains on one metric do not consistently transfer to others, implying that existing contextual metrics capture disjoint facets of faithfulness and exposing inherent trade-offs. These findings imply that CoT faithfulness is not a monolithic objective and therefore requires multifaceted optimization and evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.24959 2026-05-26 cs.CV

Three-Step Conditional Diffusion 3D Reconstruction for Light-Field Microscopy

三步条件扩散光场显微三维重建

Qihong Zhao, Shaokang Yan, Zhimin Qiao, Jinjia Wang, Bo Xiong

AI总结针对光场显微成像中传统算法分辨率低、伪影重、计算成本高，以及现有学习方法重建精度和泛化能力不足的问题，提出一种基于三步条件扩散的高保真三维重建方法，通过确定性三步采样和轻量条件U-Net实现快速准确重建，并引入类间检测模块增强稳定性。

Comments 10 pages, 6 figures. Accepted to CVPR 2026 Findings

详情

AI中文摘要

光场显微镜（LFM）能够单次捕获生物样本的多角度信息，支持实时体积成像。然而，传统的基于物理的算法通常受限于有限的空间分辨率、严重的伪影和高计算成本。现有的基于学习的方法提高了推理效率，但在重建精度和泛化能力方面仍面临限制。为了解决这些挑战，本文提出了一种用于LFM的高保真三步条件扩散（TCD）三维重建方法。尽管传统扩散模型在生成建模中取得了显著成功，但其缓慢的采样过程以及质量与效率之间的固有权衡阻碍了其在实时三维成像中的应用。我们通过确定性三步采样策略结合轻量条件U-Net重新设计了扩散过程，为快速准确的体积重建建立了新范式。此外，还引入了类间检测（ICD）模块，以在推理过程中识别分布外或异常输入，从而增强模型的稳定性和可靠性。大量实验和跨数据集评估表明，TCD在重建保真度和泛化能力方面均显著优于最先进的方法，为光场显微镜提供了一种高效实用的三维重建解决方案。

英文摘要

Light-field microscopy (LFM) enables single-shot capture of multi-angular information from biological samples, supporting real-time volumetric imaging. However, traditional physics-based algorithms often suffer from limited spatial resolution, severe artifacts, and high computational costs. Existing learning-based methods improve inference efficiency but still face limitations in reconstruction accuracy and generalization capability. To address these challenges, this paper proposes a high-fidelity Three-Step Conditional Diffusion (TCD) 3D reconstruction method for LFM. Although conventional diffusion models have achieved remarkable success in generative modeling, their slow sampling process and the inherent trade-off between quality and efficiency hinder their application in real-time 3D imaging. We redesign the diffusion process through a deterministic three-step sampling strategy coupled with a lightweight conditional U-Net, establishing a new paradigm for fast and accurate volumetric reconstruction. Furthermore, an Inter-Class Detection (ICD) module is incorporated to identify out-of-distribution or anomalous inputs during inference, thereby enhancing model stability and reliability. Extensive experiments and cross-dataset evaluations demonstrate that TCD significantly outperforms state-of-the-art methods in both reconstruction fidelity and generalization, providing an efficient and practical 3D reconstruction solution for light-field microscopy.

URL PDF HTML ☆

赞 0 踩 0

2605.24958 2026-05-26 cs.CL cs.AI

SEP-Attack: A Simple and Effective Paradigm for Transfer-Based Textual Adversarial Attack

SEP-Attack：一种简单有效的基于迁移的文本对抗攻击范式

Han Liu, Zhi Xu, Xiaotong Zhang, Feng Zhang, Xiaoming Xu, Wei Wang, Fenglong Ma, Hong Yu

AI总结提出SEP-Attack，利用行列式点过程生成多样化的代理集成权重，通过新指标评估预测置信度以计算词重要性并生成对抗样本，在多个数据集和API上显著优于现有方法。

详情

AI中文摘要

尽管深度神经网络在现代Web和语言应用中表现出色，但它们仍然容易受到对抗攻击，尤其是使用代理模型生成对抗样本而无需访问受害者模型的迁移攻击。文本领域的迁移攻击仍未得到充分探索，只有少数研究解决了这一挑战性问题，且由于对子模型平等对待或重要性分数估计不准确，往往导致次优结果。为了解决这些挑战，我们提出了一种简单而有效的基于迁移的文本对抗攻击范式，名为SEP-Attack。具体来说，我们采用行列式点过程（DPP）生成多样化的代理集成权重，代表子模型的迁移性。利用这些权重，我们引入了一种新的度量来评估预测置信度分数，进而用于计算词重要性分数并生成对抗候选。最后，我们量化每个候选的迁移性分数，并选择排名靠前的作为最终的迁移对抗样本。在四个数据集和两个真实API上进行的实验验证了SEP-Attack的有效性，显著优于最先进的基线方法。

英文摘要

Despite the strong performance of deep neural networks in modern Web and language applications, they remain vulnerable to adversarial attacks, especially transferable attacks that generate adversarial examples using surrogate models without accessing the victim model. Transferable attacks in the text domain are still under-explored, with only a few studies addressing this challenging issue, often with suboptimal results due to equal treatment of submodels or inaccurate estimation of importance scores. To address these challenges, we propose a simple yet effective paradigm for transfer-based textual adversarial attack, named SEP-Attack. Specifically, we employ the Determinantal Point Process (DPP) to generate diverse surrogate ensemble weights, representing the transferability of submodels. Using these weights, we introduce a new metric to evaluate prediction confidence scores, which in turn are used to calculate word importance scores and generate adversarial candidates. Finally, we quantify the transferability score for each candidate and select the top ones as the final transferable adversarial examples. Experiments conducted on four datasets and two real-world APIs validate the efficacy of SEP-Attack, significantly outperforming state-of-the-art baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.24957 2026-05-26 cs.AI cs.CV cs.LG

Mitigating Object Hallucinations in Vision-Language Models through Region-Aware Attention Recalibration

通过区域感知注意力重校准减轻视觉语言模型中的对象幻觉

Yuanzhi Xu, Qian Gao, Jun Fan, Guohui Ding, Zhenyu Yang, Sixue Lin, Yuteng Xiao

AI总结提出一种无需训练的区域感知自适应加权机制，通过计算注意力头的稳健统计中点并利用跨头分歧动态调整干预预算，以连续惩罚调制抑制幻觉路径，有效纠正视觉语义错位，同时保持生成流畅性。

详情

AI中文摘要

生成事实上不正确的对象（通常称为对象幻觉）仍然是大型视觉语言模型（LVLMs）中的一个持久挑战。当前解决该问题的方法——从昂贵的数据驱动微调和延迟较高的对比解码到刚性的注意力头截断——常常在计算效率或模型特征空间的连续性上做出妥协。为克服这些限制，我们引入了一种新颖的、无需训练的推理策略，该策略作为一种区域感知的自适应加权机制，动态纠正语义漂移，而不依赖于突然的启发式截断。通过计算各注意力头上的离群值稳健统计中点，我们为可靠的视觉表示建立了一个稳定锚点。然后，我们利用跨区域映射的跨头分歧来动态确定干预预算，通过连续惩罚调制温和地抑制引起幻觉的注意力路径。这种重校准过程有效纠正了视觉语义错位，同时完全保留了生成流畅性和语言先验。在包括CHAIR、POPE和MME在内的标准多模态基准上的全面评估表明，我们的策略显著减少了实例级和句子级幻觉。结果展示了与当代基线相比的最先进性能，证实了我们方法的效率和算法鲁棒性。我们的代码将公开。

英文摘要

The generation of factually incorrect objects, commonly known as object hallucination, remains a persistent challenge in Large Vision-Language Models (LVLMs). Current approaches to address this issue - ranging from expensive data-driven fine-tuning and high-latency contrastive decoding to rigid attention head truncation - frequently compromise either computational efficiency or the continuity of the model's feature space. To overcome these limitations, we introduce a novel, training-free inference strategy that operates as a region-aware adaptive weighting mechanism to dynamically correct semantic drift without relying on abrupt heuristic truncations. By computing an outlier-resistant statistical midpoint across various attention heads, we establish a stable anchor for reliable visual representations. We then utilize the inter-head disagreement mapped across regions to dynamically determine intervention budgets, gently suppressing hallucination-inducing attention paths through a continuous penalty modulation. This recalibration process effectively rectifies visual-semantic misalignments while fully preserving generative fluency and language priors. Comprehensive evaluations on standard multimodal benchmarks, including CHAIR, POPE, and MME, reveal that our strategy substantially curtails both instance- and sentence-level hallucinations. The results demonstrate state-of-the-art performance against contemporary baselines, confirming our method's efficiency and algorithmic robustness. Our code will be public.

URL PDF HTML ☆

赞 0 踩 0

2605.24956 2026-05-26 cs.CL

NITP: Next Implicit Token Prediction for LLM Pre-training

NITP：面向LLM预训练的下一隐式令牌预测

Xiangdong Zhang, Debing Zhang, Shaofeng Zhang, Xiaohan Qin, Yu Cheng, Junchi Yan

AI总结提出NITP方法，通过在表示空间中添加密集连续监督来增强离散令牌预测，以解决标准下一令牌预测中潜在表示空间约束不足的问题，并在0.5B至9B参数模型上取得一致性能提升。

Comments Accepted at ICML 2026

详情

AI中文摘要

标准的下一令牌预测（NTP）仅通过输出logit空间中的离散标签来监督语言模型。我们认为这种稀疏的one-hot监督使得潜在表示空间约束不足，允许隐藏状态退化为退化和各向异性的配置，从而限制泛化能力。为解决这一问题，我们提出下一隐式令牌预测（NITP），该方法直接在表示空间中用密集的连续监督增强离散预测。NITP训练模型预测下一令牌的隐式语义内容，使用同一模型的浅层表示作为稳定的自监督目标。我们提供理论分析，表明NITP通过缓解欠约束的自由度并鼓励紧凑、结构化的表示几何来正则化优化景观。实验上，在从0.5B到9B参数的密集和MoE模型中，NITP以可忽略的计算开销持续提升下游性能。在9B MoE模型上，NITP在MMLU-Pro上实现了5.7%的绝对提升，在C3上提升6.4%，在CommonsenseQA上提升4.3%，训练FLOPs仅增加约2%，且无额外推理成本。我们的实现可在https://github.com/aHapBean/NITP获取。

英文摘要

Standard next-token prediction (NTP) supervises language models solely through discrete labels in the output logit space. We argue that this sparse one-hot supervision leaves the latent representation space under-constrained, allowing hidden states to drift into degenerate and anisotropic configurations that can limit generalization. To address this issue, we propose Next Implicit Token Prediction (NITP), which augments discrete prediction with dense continuous supervision directly in the representation space. NITP trains the model to predict the implicit semantic content of the next token, using shallow-layer representations from the same model as stable self-supervised targets. We provide theoretical analysis showing that NITP regularizes the optimization landscape by mitigating under-constrained degrees of freedom and encouraging a compact, structured representation geometry. Empirically, across dense and MoE models ranging from 0.5B to 9B parameters, NITP consistently improves downstream performance with negligible computational overhead. On a 9B MoE model, NITP achieves a 5.7% absolute improvement on MMLU-Pro, along with gains of 6.4% on C3 and 4.3% on CommonsenseQA, with approximately 2% additional training FLOPs and no additional inference cost. Our implementation is available at https://github.com/aHapBean/NITP.

URL PDF HTML ☆

赞 0 踩 0

2605.24953 2026-05-26 cs.AI

Towards Multi-Turn Dialog Systems for Industrial Asset Operations and Maintenance

面向工业资产运维的多轮对话系统

Chengrui Li, Rujing Li, Yitong Bai, Rui Li

AI总结针对工业资产运维中的多轮、迭代问答问题，提出基于监督者-专家多智能体架构的多轮对话系统，通过结构化工件复用、动态重规划和并行工具执行，显著提升规划效果和任务完成率。

详情

AI中文摘要

工业资产运维问答本质上是多轮、迭代且高度依赖外部工具调用的。然而，传统的计划-执行单智能体架构在维护跨轮上下文和复用中间结果方面存在明显局限性。本文提出了一种基于监督者-专家多智能体架构的工业场景多轮对话系统。为缓解工具调用瓶颈，该系统集成了结构化工件复用、动态重规划和并行工具执行。评估结果表明，与基线相比，我们的系统实现了更好的响应质量，规划效果提升54.5%，任务完成率提升37.8%。系统性能分析进一步显示，跨轮工件复用有效减少了冗余工具调用，工具时间占比从47.3%降至26.3%，且第2-5轮的响应速度比第一轮快约4.2倍。

英文摘要

Industrial asset operations and maintenance question answering is inherently multi-turn, iterative, and highly dependent on external tool invocation. However, the conventional plan-execute single-agent architecture exhibits clear limitations in maintaining cross-turn context, and reusing intermediate results. In this paper, we present a multi-turn dialog system designed for industrial scenarios based on a supervisor-specialist multi-agent architecture. To alleviate tool invocation bottlenecks, the system incorporates structured artifact reuse, dynamic replanning, and parallel tool execution. Evaluation results show that our system achieves better response quality compared with the baseline, with planning effectiveness increasing by 54.5% and task completion improving by 37.8%. System profiling further shows that cross-turn artifact reuse effectively reduces redundant tool invocation, decreasing the tool-time share from 47.3% to 26.3% and making turns 2-5 approximately 4.2x faster than the first turn.

URL PDF HTML ☆

赞 0 踩 0

2605.24950 2026-05-26 cs.RO cs.LG

ARCANE-PedSynth: Synthetic Multi-Pedestrian Datasets with Behavioural Crossing Annotations

ARCANE-PedSynth：具有行为穿越注释的合成多行人数据集

Muhammad Naveed Riaz, Maciej Wielgosz, Antonio M. López Peña

AI总结提出基于CARLA的开源框架ARCANE-PedSynth，通过混合AI-手动控制架构和12状态行为有限状态机生成高穿越率的多行人合成数据，支持RGB、LiDAR和DVS模态及行为标注，用于自动驾驶中的行人穿越预测。

详情

AI中文摘要

我们提出ARCANE-PedSynth，一个基于CARLA的开源软件框架，用于生成具有密集行为注释的合成多行人数据集，以支持自动驾驶中的行人穿越预测。该框架通过混合AI-手动行人控制架构克服了CARLA原生9%的穿越率，可实现高达75%的可配置目标穿越率。一个包含五种角色原型的12状态行为有限状态机产生了多样化的穿越行为。该框架生成同步的RGB、LiDAR和DVS数据，并带有每帧穿越标签、行为状态和估计的2D姿态关键点。我们通过PedSynth++（一个使用该框架生成的示例数据集）展示了ARCANE-PedSynth，该数据集包含533个多行人片段，覆盖12种天气条件，并带有RGB、LiDAR和DVS流。ARCANE-PedSynth通过CLI参数化和Docker容器化实现完全可重复性。

英文摘要

We present ARCANE-PedSynth, an open-source CARLA-based software framework for generating synthetic multi-pedestrian datasets with dense behavioural annotations for pedestrian crossing prediction in autonomous driving. The framework overcomes CARLA's native 9% crossing rate through a hybrid AI-manual pedestrian control architecture, enabling configurable target rates up to 75%. A 12-state behavioural finite state machine with five character archetypes produces diverse crossing behaviours. The framework generates synchronised RGB, LiDAR, and DVS data with per-frame crossing labels, behavioural states, and estimated 2D pose keypoints. We demonstrate ARCANE-PedSynth through PedSynth++, an example dataset generated with the framework, comprising 533 multi-pedestrian clips across 12 weather conditions with RGB, LiDAR, and DVS streams. ARCANE-PedSynth is fully reproducible via CLI parameterisation and Docker containerisation.

URL PDF HTML ☆

赞 0 踩 0

2605.24946 2026-05-26 cs.CV

Interpretability Transfer from Language to Vision via Sparse Autoencoders

通过稀疏自编码器实现从语言到视觉的可解释性迁移

Alexey Kravets, Da Li, Chuan Li, Da Chen, Vinay P. Namboodiri

AI总结提出VISTA框架，通过约束视觉投影器将视觉token映射到LLM的文本SAE空间，实现无需专用视觉SAE的视觉可解释性，并在对象移除和替换任务上分别提升35%和47%。

详情

Journal ref: ICML 2026

AI中文摘要

最近使用稀疏自编码器（SAE）在语言模型可解释性方面取得的进展尚未有效迁移到视觉领域，主要原因是标记视觉概念的困难和模糊性。在本文中，我们引入了通过SAE迁移对齐的视觉可解释性（VISTA），这是一个在LLaVA风格的视觉-语言模型中通过约束视觉投影器将视觉token映射到LLM预先存在的、已标记的文本SAE空间，从而将可解释性从语言迁移到视觉的框架。该方法无需训练专用的视觉SAE即可实现视觉可解释性。通过使用LLM的SAE重建损失对投影器进行正则化，VISTA将匹配率（衡量SAE空间中激活最强的文本概念与图像中语义元素对应准确度的指标）提高了三倍。利用该框架，我们进一步分析了不同视觉编码器的空间定位特性，并表明DINOv2特征比其他编码器具有更强的定位能力。利用这种精确性，我们通过细粒度的局部概念干预验证了VISTA的跨模态对齐，其中特定对象在模型感知中被移除或替换，同时保留周围场景。与纯视觉基线相比，对象移除任务提升了35%，对象替换任务提升了47%，为视觉token存在于文本SAE流形中提供了因果证据。这些贡献在多种LLM架构上得到了验证。

英文摘要

Recent advances in language model interpretability using sparse autoencoders (SAEs) have yet to effectively translate to the visual domain, mainly due to the difficulty and ambiguity of labeling visual concepts. In this paper, we introduce Visual Interpretability via SAE Transfer Alignment (VISTA), a framework that transfers interpretability from language to vision in a LLaVA-style vision-language model by constraining a visual projector to map visual tokens into an LLM's pre-existing, labeled textual SAE space. This approach enables visual interpretability without training dedicated vision SAEs. By regularizing the projector using the LLM's SAE reconstruction loss, VISTA achieves a threefold increase in the matching rate, which measures how accurately the most activating textual concepts in the SAE space correspond to semantic elements in the image. Using this framework, we further analyze spatial localization properties of different vision encoders and show that DINOv2 features have stronger localization abilities than other encoders. Leveraging this precision, we validate VISTA's cross-modal alignment through fine-grained, localized concept interventions, where specific objects are removed or replaced in the model's perception while preserving the surrounding scene. This results in improvements of 35% in object removal and 47% in object replacement tasks over vision-only baselines, providing causal evidence that visual tokens inhabit the text SAE manifold. These contributions are validated across multiple LLM architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.24945 2026-05-26 cs.LG cs.AI physics.ao-ph

RealBench: Benchmarking Data-Driven Numerical Weather Forecasting Under Operational Conditions and Extreme Event Challenges

RealBench: 在操作条件和极端事件挑战下对数据驱动数值天气预报的基准测试

Ruize Li, Zhibin Wen, Tao Han, Hao Chen, Fenghua Ling, Wei Zhang, Song Guo, Lei Bai

AI总结提出RealBench基准，通过使用低延迟操作分析和全球10,000+站点观测数据，在严格分布外测试集上评估AI天气预报模型，揭示再分析指标与实际性能的显著差异，特别是极端事件方面。

Comments 35 pages, 22 figures

详情

AI中文摘要

准确评估天气预报模型对于其在现实世界应用中的可靠部署至关重要。然而，现有基准主要依赖再分析产品（如ERA5），这些产品通过延迟数据同化生成，不能反映实时操作预报的约束，导致基准性能与现实预报之间存在系统性不匹配。在这项工作中，我们引入了RealBench，这是一个用于AI天气预报的下一代基准，强调在操作条件下的现实评估。RealBench具有严格分布外测试集，覆盖2025年，以消除数据泄露并捕捉近期大气状况。它整合了多个数据源，包括低延迟操作分析和包含超过10,000个站点的全球原位观测数据集，从而能够直接针对真实大气测量进行评估。除了标准全球指标外，RealBench还为高影响极端事件（包括热浪、寒潮和热带气旋）提供了全面的评估框架，使用事件特定指标更好地反映现实预报优先级。评估结果揭示了基于再分析的指标与现实性能之间的显著差异，特别是关于极端事件。通过突出现有基准的局限性，这项工作建立了一个更忠实且与操作相关的评估范式，为推进下一代AI天气预报系统提供了严格的基础。基准实现可在以下网址获取：https://github.com/lixruize-del/NWP-Benchmark。

英文摘要

Accurate evaluation of weather forecasting models is critical for their reliable deployment in real-world applications. However, existing benchmarks predominantly rely on reanalysis products such as ERA5, which are generated through delayed data assimilation and do not reflect the constraints of real-time operational forecasting, thereby resulting in a systematic mismatch between benchmark performance and real-world forecasting. In this work, we introduce RealBench, a next-generation benchmark for AI weather forecasting that emphasizes realistic evaluation under operational conditions. RealBench features a strictly out-of-distribution test set spanning 2025 to eliminate data leakage and capture recent atmospheric regimes. It integrates multiple data sources, including low-latency operational analysis and a large-scale global in-situ observation dataset comprising over 10,000 stations, enabling direct evaluation against real atmospheric measurements. Beyond standard global metrics, RealBench provides a comprehensive evaluation framework for high-impact extreme events, including heatwaves, cold surges, and tropical cyclones, using event-specific metrics that better reflect real-world forecasting priorities. The evaluation results reveal substantial discrepancies between reanalysis-based metrics and real-world performance, particularly concerning extreme events. By highlighting the limitations of existing benchmarks, this work establishes a more faithful and operationally relevant evaluation paradigm, providing a rigorous foundation for advancing next-generation AI weather forecasting systems. The benchmark implementation is available at: https://github.com/lixruize-del/NWP-Benchmark.

URL PDF HTML ☆

赞 0 踩 0

2605.24939 2026-05-26 cs.LG math.OC

Global linear convergence of entropy-regularized softmax policy gradient beyond tabular MDPs

熵正则化softmax策略梯度的全局线性收敛性：超越表格MDP

Ziyue Chen, David Šiška, Lukasz Szpruch

AI总结本文研究连续状态和动作空间的无限时域熵正则化马尔可夫决策过程中策略梯度的全局收敛性，通过线性函数逼近的log-linear softmax策略，在$Q^π_τ$可实现性假设下建立非均匀Polyak--Łojasiewicz不等式，并识别两种特征机制下非均匀常数的有界性，证明正则化目标沿梯度流的全局线性收敛。

详情

AI中文摘要

我们研究了具有连续状态和动作空间的无限时域熵正则化马尔可夫决策过程（MDP）中策略梯度的全局收敛性。我们考虑带有线性函数逼近的log-linear softmax策略，它扩展了表格softmax参数化，同时保留了易处理的策略类。在正则化状态-动作值函数的$Q^π_τ$可实现性下，我们首先建立了一个非均匀的Polyak--Łojasiewicz（PŁ）不等式。非均匀性源于与策略几何相关的常数的退化性，即Fisher信息矩阵或非中心特征协方差矩阵。然后，我们确定了两种特征机制，在这些机制下，该非均匀常数可以沿梯度流有界。对于全仿射跨度特征，我们证明了KL正则化子的径向无界性，并表明Fisher信息矩阵的最小特征值保持在一个依赖于初始化的正常数之下。对于单纯形值特征，我们在与全1向量正交的子空间中证明了类似的径向无界性结果，并获得了非中心协方差矩阵最小特征值的统一下界。这些结果意味着正则化目标沿梯度流的全局线性收敛，即次优性以$\mathcal{O}(e^{-Ct})$衰减，其中$C>0$。我们的分析将熵正则化softmax策略梯度的全局收敛理论扩展到Agarwal等人（2020）；Bhandari和Russo（2024）；Mei等人（2020）的表格设置之外。

英文摘要

We study the global convergence of policy gradient for infinite-horizon entropy-regularized Markov decision processes (MDPs) with continuous state and action spaces. We consider log-linear softmax policies with linear function approximation, which extend the tabular softmax parameterization while retaining a tractable policy class. Under $Q^π_τ$-realizability for the regularized state-action value function, we first establish a non-uniform Polyak--Łojasiewicz (PŁ) inequality. The non-uniformity arises through degeneracy of constants associated with the policy geometry, namely the Fisher information matrix or an uncentered feature covariance matrix. We then identify two feature regimes under which this non-uniform constant can be bounded along the gradient flow. For full-affine-span features, we prove radial unboundedness of the KL regularizer and show that the smallest eigenvalue of the Fisher information matrix remains bounded below by an initialization-dependent positive constant. For simplex-valued features, we prove an analogous radial unboundedness result in the subspace orthogonal to the all-ones vector and obtain a uniform lower bound for the smallest eigenvalue of the uncentered covariance matrix. These results imply global linear convergence of the regularized objective along the gradient flow, i.e. suboptimality decaying as $\mathcal{O}(e^{-Ct})$ for some $C>0$. Our analysis extends the global convergence theory of entropy-regularized softmax policy gradient beyond the tabular setting of Agarwal et al. (2020); Bhandari and Russo (2024); Mei et al. (2020).

URL PDF HTML ☆

赞 0 踩 0

2605.24932 2026-05-26 cs.CV

X-Edit: Exact, Explicit, and Explainable Null-Space Editing for Medical Vision Transformers

X-Edit: 面向医学视觉Transformer的精确、显式且可解释的零空间编辑

Yuanye Liu, Siyuan Zhou, Ke Zhang, Lei Li, Wei Chen, Xiahai Zhuang

AI总结提出X-Edit框架，通过因果定位和零空间投影实现医学图像分类中ViT模型的精确错误修正，避免灾难性遗忘。

Comments Early accepted by MICCAI 2026

详情

AI中文摘要

预训练的视觉Transformer（ViT）越来越多地用于医学图像分类。然而，在动态临床场景中纠正其不可避免的失败案例是一个关键挑战。传统的微调方法固有地遭受灾难性遗忘，严重降低先前获得的诊断能力。这种不稳定性从根本上危及临床安全。解决这一脆弱性需要一种主动、可控且可靠的干预机制，该机制既有理论依据又具有内在可解释性。为此，我们提出X-Edit（精确、显式且可解释的编辑），一种高效的零空间模型编辑框架。X-Edit将编辑过程从基于梯度的迭代优化转变为有理论依据的闭式解。具体来说，我们首先通过因果追踪显式定位导致错误预测的影响层。然后，从精心挑选的锚点集中构建正交零空间投影矩阵。通过将精确的参数更新几何约束在该零空间内，我们提供了数学保证，即干预能够纠正目标错误而不干扰已建立的诊断表示。在六个医学影像基准上的广泛评估表明，X-Edit全面抑制了灾难性遗忘，同时实现了卓越的编辑成功率。我们的代码可在https://github.com/HenryLau7/X-Edit获取。

英文摘要

Pre-trained Vision Transformers (ViTs) are increasingly deployed for medical image classification. However, correcting their inevitable failure cases in dynamic clinical scenarios poses a critical challenge. Conventional fine-tuning approaches inherently suffer from catastrophic forgetting, severely degrading previously acquired diagnostic capabilities. Such instability fundamentally compromises clinical safety. Addressing this vulnerability requires an active, controllable, and reliable intervention mechanism that is both theoretically grounded and inherently interpretable. To this end, we propose X-Edit (eXact, eXplicit, and eXplainable Editing), an efficient null-space model editing framework. X-Edit transitions the editing process from iterative gradient-based optimization to a theoretically grounded, closed-form solution. Specifically, we first explicitly localize the influential layers via causal tracing governing the erroneous prediction. Subsequently, we construct an orthogonal null-space projection matrix from a curated anchor set. By geometrically constraining the exact parameter update strictly within this null space, we provide mathematical guarantees that the intervention rectifies targeted errors without perturbing established diagnostic representations. Extensive evaluations on six medical imaging benchmarks demonstrate that X-Edit comprehensively suppresses catastrophic forgetting while achieving superior edit success rates. Our code is available at https://github.com/HenryLau7/X-Edit.

URL PDF HTML ☆

赞 0 踩 0

2605.24931 2026-05-26 cs.RO

Learning High-Frequency Continuous Action Chunks in Latent Space

在潜在空间中学习高频连续动作块

Kunyun Wang, Yuhang Zheng, Yupeng Zheng, Jieru Zhao, Wenchao Ding

AI总结本文提出通过变分自编码器将高频动作学习从动作空间转移到潜在空间，并引入Reuse-then-Refine块级精炼策略，以提升高频控制的时间与空间一致性，实现复杂接触任务的平滑执行。

Comments 17 pages, 10 figures

详情

AI中文摘要

现代机器人策略越来越依赖动作块来在物理世界中执行复杂任务。虽然动作块在中等动作频率下提高了时间一致性，但当动作频率进一步增加（例如到60 Hz）时，它变得不足。在这样的高频下，策略常常无法生成既时间平滑又空间一致的动作。我们通过使用变分自编码器（VAE）将高频动作学习从动作空间转移到潜在空间来解决这一挑战。这种表述显著提高了高频控制的时间与空间一致性。为了实现平滑的实时执行，我们进一步引入了Reuse-then-Refine，一种块级精炼策略，在异步推理下改善相邻动作块之间的连续性。因此，由我们的策略控制的机器人可以连续执行复杂的接触丰富任务，减少停顿和抖动。在三个真实世界的接触丰富机器人任务上的实验表明，我们的方法能够以平滑的动作一致地完成任务。我们的代码和数据可在 https://github.com/tars-robotics/RTR 获取。

英文摘要

Modern robotic policies increasingly rely on action chunking to execute complex tasks in the physical world. While action chunking improves temporal consistency at moderate action frequencies, it becomes insufficient when the action frequency is further increased (e.g., to 60~Hz). At such high frequencies, policies often fail to generate actions that are both temporally smooth and spatially consistent. We address this challenge by shifting high-frequency action learning from the action space to a latent space with variational autoencoder (VAE). This formulation significantly improves both temporal and spatial consistency of high-frequency control. To enable smooth real-time execution, we further introduce Reuse-then-Refine, a chunk-level refine strategy that improves continuity between adjacent action chunks under asynchronous inference. As a result, robots controlled by our policy can execute complex contact-rich tasks continuously, with less pauses and jerky motions. Experiments on three real-world contact-rich robotic tasks show that our approach consistently completes tasks with smooth motions. Our code and data are available at https://github.com/tars-robotics/RTR.

URL PDF HTML ☆

赞 0 踩 0

2605.24930 2026-05-26 cs.CL

H$^{2}$MT: Semantic Hierarchy-Aware Hierarchical Memory Transformer

H$^{2}$MT: 语义层次感知的层次记忆Transformer

Maryam Haghifam, Zifan He, Jason Cong, Yizhou Sun

AI总结提出H$^{2}$MT模型，通过离线构建语义层次结构并利用自底向上的后序聚合计算记忆嵌入，在推理时实现从粗到细的查询路由，从而在长上下文推理中实现质量与效率的权衡。

详情

AI中文摘要

基于Transformer的LLM在许多语言任务上取得了强劲的结果；然而，长输入仍然具有挑战性，因为上下文窗口是有限的，并且预填充延迟和内存随提示长度快速增长。因此，平坦的令牌流处理和基于块的检索可能会在与查询无关的文本上花费大量计算和上下文预算。离线索引的RAG额外引入了外部存储和索引管理开销，并且通常将检索到的证据作为原始文本附加，增加了预填充成本和延迟。H^{2}MT使长上下文推理具有结构感知性：它离线构建语义层次结构，通过自底向上的后序聚合为每个节点计算记忆嵌入，并在推理时从粗到细地路由查询，以早期修剪不相关的分支。在LongBench QA（NarrativeQA、HotpotQA、QASPER）和两个结构化技术文档设置上，H^{2}MT实现了有利的质量效率权衡，与提示压缩、记忆令牌方法和检索增强生成基线相比，在更低的峰值GPU内存和首令牌时间（TTFT）下取得了具有竞争力的ROUGE-L和F1（在适用情况下）。

英文摘要

Transformer-based LLMs achieve strong results on many language tasks; however, long inputs remain challenging because context windows are finite, and prefill latency and memory grow rapidly with prompt length. Flat token-stream processing and chunk-based retrieval can therefore spend substantial computation and context budget on text unrelated to the query. Offline-indexed RAG additionally introduces external storage and index management overhead, and typically appends retrieved evidence as raw text, increasing prefill cost and latency. H^{2}MT makes long-context inference structure-aware: it builds a semantic hierarchy offline, computes a memory embedding for each node via bottom-up post-order aggregation, and routes queries coarse-to-fine at inference to prune irrelevant branches early. On LongBench QA (NarrativeQA, HotpotQA, QASPER) and two structured technical-document settings, H MT achieves favorable quality efficiency trade-offs, delivering competitive ROUGE-L and F1 (where applicable) with lower peak GPU memory and time-to-first-token (TTFT) than prompt compression, memory-token methods, and retrieval-augmented generation baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.24928 2026-05-26 cs.CV

MambaDSF: Multi-Scale SSM with Dilated Feature Fusion for Sonar Small Target Detection

MambaDSF：基于膨胀特征融合的多尺度SSM用于声纳小目标检测

Hui Lin, Jiayi Li, Jing Wang, Shenghui Rong

AI总结针对声纳小目标检测中像素覆盖不足、噪声干扰和尺度模糊问题，提出MambaDSF混合框架，通过Mamba增强特征金字塔、膨胀融合编码器和尺度自适应损失函数，在UATD数据集上达到91.5% mAP50，参数28.7M。

Comments 8 pages, 4 figures, under review at IEEE Geoscience and Remote Sensing Letters (GRSL)

详情

AI中文摘要

声纳成像是水下目标检测的主要方式，但由于像素覆盖不足、声学对比度低以及不同成像距离下的尺度模糊，小目标仍然难以检测。基于CNN的检测器能高效提取局部特征，但缺乏全局声学上下文，无法抑制噪声引起的虚警。基于Transformer的方法以二次计算代价捕捉长距离依赖。现有的基于Mamba的视觉模型提供高效的线性代价扫描，但缺乏跨金字塔层级的多尺度语义对齐、多感受野融合以及可靠声纳检测所需的小目标感知训练监督。本文提出Mamba膨胀尺度融合（MambaDSF），一个混合框架，通过三个贡献解决这些局限：Mamba增强特征金字塔（MambaEFP）骨干网络，以线性复杂度联合捕捉局部回波线索和全局声学上下文；膨胀融合Mamba（DFMamba）编码器，强制跨金字塔层级的多尺度特征对齐；以及尺度自适应加权IoU（SA-WIoU）和跨尺度一致性（CSC）损失，稳定小目标训练。MambaDSF在UATD前视声纳基准上达到91.5% mAP50，参数为2870万，超越所有对比检测器。在小目标子集上，增益达到+2.2个百分点，在FLS和MD-FLS上的跨域评估证实了所提出架构的泛化能力。代码公开于https://github.com/IDontKnowAAA/MambaDSF。

英文摘要

Sonar imaging is the primary modality for underwater target detection, yet small targets remain difficult to detect due to insufficient pixel coverage, low acoustic contrast, and scale ambiguity across imaging ranges. CNN-based detectors extract local features efficiently but cannot suppress noise-induced false alarms without global acoustic context. Transformer-based methods capture long-range dependencies at quadratic computational cost. Existing Mamba-based vision models offer efficient linear-cost scanning but lack multi-scale semantic alignment across pyramid levels, multi-receptive-field fusion, and small-target-aware training supervision needed for reliable sonar detection. This letter proposes Mamba Dilated-Scale Fusion (MambaDSF), a hybrid framework addressing these limitations through three contributions: a Mamba Enhanced Feature Pyramid (MambaEFP) backbone that jointly captures local echo cues and global acoustic context at linear complexity; a Dilate Fusion Mamba (DFMamba) encoder that enforces multi-scale feature alignment across pyramid levels; and Scale-Adaptive Weighted IoU (SA-WIoU) and Cross-Scale Coherence (CSC) losses that stabilize small-target training. MambaDSF achieves 91.5% mAP50 on the UATD forward-looking sonar benchmark with 28.7 million parameters, surpassing all compared detectors. On a small-target subset the gain reached +2.2 percentage points, and cross-domain evaluation on FLS and MD-FLS confirms the generalization of the proposed architecture. The codes are publicly available at https://github.com/IDontKnowAAA/MambaDSF.

URL PDF HTML ☆

赞 0 踩 0

2605.24926 2026-05-26 cs.AI

Energy Shields for Fairness

公平性能量护盾

Filip Cano, Thomas A. Henzinger, Konstantin Kueffner

AI总结提出一种受物理学启发的轻量级自适应控制器——能量护盾，通过概率性干预平滑地保证运行时公平性，并首次同时提供短期安全性和长期活性保证。

详情

DOI: 10.1145/3805689.3806807

AI中文摘要

运行时公平性不是一个一次性约束，而是一个在决策序列上评估的动态属性。为了确保运行时公平性，必须考虑过去的决策，这是传统静态分类器所忽略的信息。传统的公平性护盾通过确定性干预来强制执行运行时公平性，每当决策序列违反运行公平性度量的目标时，就会突然干预。这激发了我们主要的概念贡献：能量护盾。能量护盾是一种新颖的、轻量级的自适应控制器，它监控决策序列并概率性地干预，通过利用受物理学启发的能量函数将序列推向公平性，从而平滑地确保运行时公平性：决策越不公平，推动力就越强。这使得能量护盾成为第一个同时提供短期安全性和长期活性保证的公平性护盾。安全性确保运行公平性度量以高概率保持在运行目标区间内，而活性确保公平性度量的极限位于极限目标区间内。直观地说，短期指定了容忍的公平性值，长期指定了期望的公平性值。我们还提供了一种合成程序，用于为给定的目标规范构建最小侵入性的能量护盾，并通过实验证明其效率。我们通过短期和长期公平性的视角，将我们的能量护盾与现有的公平性护盾进行了评估。

英文摘要

Runtime fairness is not a one-time constraint but a dynamic property evaluated over a sequence of decisions. To ensure fairness at runtime, it is necessary to account for past decisions, information neglected by conventional, static classifiers. Traditional fairness shields enforce runtime fairness abruptly, by intervening \emph{deterministically} whenever a sequence of decisions violates the target for a running fairness measure. This motivates our \emph{main conceptual contribution: \textbf{energy shields}.} An energy shield is a novel, lightweight, adaptive controller that monitors a sequence of decisions and intervenes \emph{probabilistically} to ensure runtime fairness smoothly, by utilizing physics-inspired energy functions to nudge the sequence toward fairness: the more unfair the decisions, the stronger the nudging force becomes. This makes energy shields the \emph{\textbf{first}} fairness shields to provide both \emph{short-term safety and long-term liveness guarantees}. Safety ensures that the running fairness measure stays within a running target interval with high probability, and liveness ensures that the limit of the fairness measure lies within the limit target interval. Intuitively, the short-term specifies the tolerated fairness values and the long-term specifies the desired fairness values. We also provide a synthesis procedure for constructing the least intrusive energy shield for a given target specification, and demonstrate its efficiency experimentally. We evaluate our energy shields against existing fairness shields through the lens of short- and long-term fairness.

URL PDF HTML ☆

赞 0 踩 0

2605.24924 2026-05-26 cs.RO

Dynamic Neural Koopman Distillation for Real-Time Robot Control Using Diffusion Models

动态神经Koopman蒸馏：基于扩散模型的实时机器人控制

Lei Zheng, Peiqi Yu, Zengqi Peng, Changliu Liu, Armin Lederer

AI总结提出动态神经Koopman蒸馏框架，将多步扩散推理蒸馏为单步前向传递，通过因子化动态Koopman层保留多模态表达能力，在D4RL MuJoCo和物理机器人上实现毫秒级延迟的闭环控制。

Comments 8 pages, 5 figures

详情

AI中文摘要

扩散模型在生成多样化和多模态轨迹用于机器人规划方面表现出色，但其迭代去噪过程引入了与高频闭环控制不兼容的延迟。为了解决这个问题，我们提出了动态神经Koopman蒸馏，这是一个将多步扩散推理蒸馏为单步前向传递的框架，同时保留了教师模型的多模态表达能力。具体来说，我们引入了一个因子化动态Koopman层，通过具有状态依赖模态增益的因子化潜在转移来建模去噪过程。我们在标准D4RL MuJoCo运动基准测试和一个物理Kinova机械臂上评估了所提出的方法，并与单步基线进行了比较。结果表明，我们的方法在报告的运动任务上显著优于现有的单步蒸馏方法，并将推理延迟降低到毫秒级别，与教师策略相比。硬件实验进一步证明，我们的方法能够在保持任务成功和相当准确性的同时，实现平滑且快速的闭环执行。项目页面可在 https://fdkoopman.github.io/ 获取。

英文摘要

Diffusion models excel at generating diverse and multimodal trajectories for robotic planning, yet their iterative denoising process introduces latency that is incompatible with high-frequency closed-loop control. To address this problem, we propose Dynamic Neural Koopman Distillation, a framework that distills multistep diffusion inference into a single forward pass while retaining the multimodal expressivity of the teacher model. Specifically, we introduce a Factorized Dynamic Koopman layer that models the denoising process through a factorized latent transition with state-dependent modal gains. We evaluate the proposed method on standard D4RL MuJoCo locomotion benchmarks and a physical Kinova manipulator, comparing against one-step baselines. The results show that our method significantly outperforms existing one-step distillation approaches on the reported locomotion tasks, and reduces the inference latency to the millisecond regime compared with the teacher policy. Hardware experiments further demonstrate that our method enables smooth and fast closed-loop execution while maintaining task success and comparable accuracy. A project page is available at https://fdkoopman.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2605.24922 2026-05-26 cs.RO

MuJoCoUni:Persistent Batched Runtime Primitives for MuJoCo

MuJoCoUni：MuJoCo的持久化批处理运行时原语

Yufei Jia, Junzhe Wu

AI总结提出MuJoCoUni，一个用于在线机器人学习和批处理物理评估的MuJoCo下游发行版，通过BatchEnvPool提供有状态环境执行的运行时原语，支持高吞吐并行执行并保持上游语义。

Comments Technical report

详情

AI中文摘要

我们提出MuJoCoUni，一个用于在线机器人学习和批处理物理评估的MuJoCo下游发行版。除了上游mujoco.rollout已经提供的开环批处理轨迹生成外，MuJoCoUni还提供了用于有状态环境执行的运行时原语。目标工作负载需要高吞吐并行执行，同时保留上游CPU MuJoCo在模型、传感器、接触和约束方面的语义。其核心对象BatchEnvPool是一个C++/pybind11执行器，拥有每个环境的mjModel副本、每个线程的mjData工作线程以及一个内部线程池。它提供仅最终状态的短步进、稀疏重置、重置生命周期域随机化、不推进动力学的批处理传感器前向评估，以及批处理雅可比矩阵和高度场查询。该实现仅限于Python绑定层；MuJoCo的求解器、接触模型、积分器和核心源代码树保留上游语义。本报告描述了BatchEnvPool API、实现边界、与rollout的关系，以及随开源mujoco-uni包一起提供的验证和基准测试脚本，该包可通过 exttt{pip install mujoco-uni}安装。

英文摘要

We present MuJoCoUni, a downstream MuJoCo distribution for online robot learning and batched physics evaluation. Alongside the open-loop batched trajectory generation already provided by upstream mujoco.rollout, MuJoCoUni supplies runtime primitives for stateful environment execution. The target workloads need high-throughput parallel execution while retaining upstream CPU MuJoCo semantics for models, sensors, contact, and constraints. Its core object, BatchEnvPool, is a C++/pybind11 executor that owns per-environment mjModel copies, per-thread mjData workers, and an internal thread pool. It provides final-state-only short stepping, sparse reset, reset-lifecycle domain randomization, batched sensor forward evaluation without advancing dynamics, and batched Jacobian and height-field queries. The implementation is confined to the Python binding layer; MuJoCo's solver, contact model, integrator, and core source tree retain upstream semantics. This report describes the BatchEnvPool API, implementation boundary, relationship to rollout, and the validation and benchmark scripts shipped with the open-source mujoco-uni package, which is installed with \texttt{pip install mujoco-uni}.

URL PDF HTML ☆

赞 0 踩 0

2605.24921 2026-05-26 cs.LG

BandVQ: Band-Wise Vector-Quantized EEG Foundation Model

BandVQ: 分带向量量化的脑电图基础模型

Jamiyan Sukhbaatar, Satoshi Imamura, Toshihisa Tanaka

AI总结针对脑电图基础模型中频率特异性活动表征不足的问题，提出BandVQ模型，通过分带VQ-VAE分词器和共享Transformer编码器，在71个公共数据集上预训练，并在六个分类任务上取得领先性能。

Comments 15 pages, 1 figure

详情

AI中文摘要

脑电图（EEG）基础建模的一个核心挑战是学习跨不同任务、导联、参考和频谱特征的记录的可迁移表示。现有的掩码建模方法通常依赖于宽带连续块或单一离散表示，这可能无法充分表征频率特异性活动。本文提出BandVQ，一种分带向量量化的EEG基础模型，它将EEG分解为delta、theta、alpha、beta和gamma频带，为每个频带训练独立的VQ-VAE分词器，并在生成的离散VQ码索引上预训练一个共享的Transformer编码器。编码器使用掩码码元、量化绝对对数功率元、通道和时间嵌入，以及表示参考、频带、任务族和阶段的元数据前缀元。还引入了基于区域的掩码，以减少空间相邻电极的平凡重建。该模型在71个公共EEG语料库上进行预训练，涵盖超过9200名受试者和357,000单通道小时，并在六个独立于受试者的分类数据集上进行评估。在当前评估设置下，所提模型实现了强大的迁移性能，在三个认知任务上取得了最高报告结果，在三个运动想象任务上取得了有竞争力的性能。

英文摘要

A central challenge in electroencephalography (EEG) foundation modeling is learning transferable representations across recordings with diverse tasks, montages, references, and spectral characteristics. Existing masked modeling approaches often rely on broadband continuous patches or a single discrete representation, which may underrepresent frequency-specific activity. This paper proposes BandVQ, a band-wise vector-quantized EEG foundation model that decomposes EEG into delta, theta, alpha, beta, and gamma bands, trains an independent VQ-VAE tokenizer for each band, and pretrains a shared Transformer encoder on the resulting discrete VQ code indices. The encoder uses masked code tokens, quantized absolute log-power tokens, channel and temporal embeddings, and metadata prefix tokens representing reference, band, task family, and phase. Region-based masking is also introduced to reduce the trivial reconstruction of spatially adjacent electrodes. The model is pretrained on 71 public EEG corpora comprising over 9,200 subjects and 357,000 single-channel hours and evaluated on six subject-independent classification datasets. Under the current evaluation setting, the proposed model achieves strong transfer performance, with the highest reported results on three cognitive tasks and competitive performance on three motor imagery tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.24920 2026-05-26 cs.LG cs.AI stat.ML

Quaternion Self-Attention with Shared Scores

共享分数的四元数自注意力

Shogo Yamauchi, Tohru Nitta, Hideaki Tamori

AI总结提出一种共享分数四元数自注意力机制，通过四元数内积计算单一实值分数并共享注意力分布，在保持性能的同时大幅降低计算成本。

Comments 26 pages, 6 figures and 15 tables. Accepted at ICML2026

详情

AI中文摘要

四元数神经网络通过将四个相关特征表示为一个单一实体，实现了参数高效并建模多维依赖关系。然而，现有的四元数自注意力计算每个分量的分数并对每个分量应用独立的softmax操作，这增加了计算成本并允许注意力分布在分量间发散。我们提出了一种共享分数的四元数自注意力机制，该机制使用四元数内积计算单一实值分数，并在所有分量上应用共享的注意力分布。这将分数计算乘法减少了75%，并将softmax操作次数从四次减少到一次。我们证明，当查询和键由诱导分量预混合的四元数线性投影产生时，分量级分数和共享分数位于相同的交互子空间中，表明独立的分量级注意力主要重新参数化相同的交互，而不是扩展特征交互空间。在语音增强中，我们的方法在GPU上将推理时间减少了高达44.3%，在CPU上减少了58.1%，同时保持了质量，并且在视觉和自然语言处理中呈现一致的趋势。

英文摘要

Quaternion neural networks are parameter-efficient and model multidimensional dependencies by representing four related features as a single entity. However, existing quaternion self-attention computes component-wise scores and applies independent softmax operations to each component, which increases the computational cost and allows attention distributions to diverge across components. We propose a shared-score quaternion self-attention mechanism that computes a single real-valued score using the quaternion inner product and applies a shared attention distribution across all components. This reduces score-computation multiplications by 75% and the number of softmax operations from four to one. We prove that, when queries and keys are produced by quaternion linear projections that induce component pre-mixing, the component-wise and shared scores lie in the same interaction subspace, indicating that independent component-wise attention primarily re-parameterizes the same interactions rather than expanding the feature interaction space. In speech enhancement, our method reduces inference time by up to 44.3% on a GPU and 58.1% on a CPU while maintaining quality, with consistent trends across vision and natural language processing.

URL PDF HTML ☆

赞 0 踩 0

2605.24919 2026-05-26 cs.CL

MultiHaluDet: Multilingual Hallucination Detection via LLM Hidden State Probing

MultiHaluDet: 通过LLM隐藏状态探测实现多语言幻觉检测

Riasad Alvi, Nurul Labib Sayeedi, Md. Faiyaz Abdullah Sayeedi

AI总结提出MultiHaluDet框架，通过探测冻结LLM的全隐藏状态轨迹，结合多尺度注意力和自注意力池化的混合架构，以及校准的经典分类器集成，实现跨语言的高精度幻觉检测，在英语基准上达到98.55% AUROC，并展现出对高、中、低资源语言的强泛化能力。

Comments MeLLM @ ACL 2026

详情

AI中文摘要

大型语言模型（LLM）中的幻觉是其可靠部署的关键障碍，这一漏洞在非英语和资源受限的环境中尤为严重。现有的依赖输出置信度启发式或单层内部表示的检测方法，往往无法捕捉跨语言的深层、复杂事实不一致性。为此，我们引入了MultiHaluDet，一种新颖的三阶段堆叠框架，通过探测冻结LLM的全隐藏状态轨迹来检测多语言幻觉，无需特定语言的微调。我们的方法提取跨多个层的序列特征，并通过使用多尺度注意力和自注意力池化的混合架构进行处理。通过生成折叠外嵌入并输入到校准的经典分类器集成中，MultiHaluDet捕捉了事实不一致性的细粒度和粗粒度模式。大量实验表明，我们的框架在Mistral-7B和LLaMA2-7B架构上，在英语HaluEval和TriviaQA基准测试中达到了高达98.55% AUROC的最先进检测性能。关键的是，我们严格评估了框架在高资源（法语）、中资源（孟加拉语）和低资源（阿姆哈拉语）语言上的跨语言泛化能力。MultiHaluDet展现出卓越的表示鲁棒性，始终优于基线，并成功地将幻觉检测能力迁移到类型多样的语言层级中。

英文摘要

Hallucinations in Large Language Models (LLMs) represent a critical barrier to their reliable deployment, a vulnerability heavily exacerbated in non-English and resource-constrained contexts. Existing detection approaches that rely on output confidence heuristics or single-layer internal representations frequently fail to capture deep, complex factual inconsistencies across diverse languages. To address this, we introduce MultiHaluDet, a novel three-stage stacking framework that detects multilingual hallucinations by probing the full hidden state trajectories of frozen LLMs without requiring language-specific fine-tuning. Our method extracts sequential features across multiple layers and processes them via a hybrid architecture using multi-scale attention and self-attention pooling. By generating out-of-fold embeddings that feed into a calibrated classical classifier ensemble, MultiHaluDet captures both fine-grained and coarse-grained patterns of factual inconsistency. Extensive experiments demonstrate that our framework achieves state-of-the-art detection performance, reaching up to 98.55% AUROC on the English HaluEval and TriviaQA benchmarks using Mistral-7B and LLaMA2-7B architectures. Crucially, we rigorously evaluate our framework's cross-lingual generalization across high (French), medium (Bangla), and low-resource (Amharic) languages. MultiHaluDet demonstrates exceptional representational robustness, consistently outperforming baselines and successfully transferring hallucination detection capabilities across typologically diverse linguistic tiers.

URL PDF HTML ☆

赞 0 踩 0