arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3813
2605.18346 2026-05-19 cs.CV cs.AI

Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion

聚焦强制:面向内容的每帧KV选择用于高效的自回归视频扩散

Peiliang Cai, Evelyn Zhang, Jiacheng Liu, Hao Lin, Ruiqi Zhang, Weile Mo, Yue Ma, Shikang Zheng, Jiehang Huang, Dongrui Liu, Linfeng Zhang

发表机构 * SJTU(上海交通大学) SDU(山东大学) HUST(华中科技大学) UTokyo(东京大学) HKUST(香港科技大学) SCUT(上海大学) Shanghai AI Lab(上海人工智能实验室)

AI总结 本文提出了一种无需训练的KV选择方法,通过结合注意力分数和历史帧的多样性分数,保留最相关和有区别的历史帧,从而在不牺牲质量的情况下提高自回归视频扩散的效率。

详情
AI中文摘要

近期在自回归视频扩散领域的进展使得序列和流式视频生成成为可能。然而,长视界生成需要越来越大的KV缓存,这使得在不牺牲质量的情况下实现高效的压缩具有挑战性。现有方法大多基于注意力分数选择历史帧,但它们的上下文决策仍然粗略。当同一块中生成多个帧时,这些方法通常对整个块应用共享的历史选择,仅通过注意力对历史帧评分,并将头预算均匀或通过注意力模式启发式分配,而不是显式估计头重要性。我们发现同一生成块中的帧可能依赖于不同的历史帧,同一历史帧在与当前帧的相对时间距离变化时可能获得不同的注意力分数,且屏蔽不同头会引发不均等的生成退化。受这些发现的启发,我们提出了Focused Forcing,一种无需训练的KV选择方法,该方法在生成帧和头维度上聚焦缓存历史。对于每个生成帧,Focused Forcing通过结合注意力分数和历史帧的多样性分数保留最相关和有区别的历史帧,同时将较大的预算分配给估计重要性更高的头。在多个自回归生成范式中,Focused Forcing在不训练的情况下实现了高达1.48倍的端到端加速,同时提高了视觉质量和文本对齐。

英文摘要

Recent advances in autoregressive video diffusion have enabled sequential and streaming video generation. However, long-horizon generation requires increasingly large KV caches, making efficient compression without sacrificing quality challenging. Existing methods mostly select historical frames based on attention scores, but their context decisions remain coarse. When multiple frames are generated in the same chunk, these methods often apply a shared history selection to the whole chunk, score historical frames solely by attention, and assign head-wise budgets either uniformly or by attention-pattern heuristics rather than explicit head-importance estimation. We show that frames within the same generated chunk can depend on distinct historical frames, that the same historical frame can receive different attention scores as its relative temporal distance to the current frames changes, and that masking different heads induces unequal generation degradation. Motivated by these findings, we propose \textbf{Focused Forcing}, a training-free KV selection method that focuses cached history along both generated-frame and head dimensions. For each generated frame, Focused Forcing preserves the most relevant and distinctive historical frames by combining attention scores with diversity scores of historical frames, while assigning larger budgets to heads with higher estimated importance. Across multiple autoregressive generation paradigms, Focused Forcing achieves up to $\textbf{1.48}\times$ end-to-end acceleration without training, while \textbf{improving visual quality and text alignment}. \textit{Our code will be released on GitHub.}

2605.18337 2026-05-19 cs.CL

Infini-News: Efficiently Queryable Access to 1.3 Billion Processed Common Crawl News Articles

Infini-News:13亿篇Common Crawl新闻文章的高效可查询访问

Ruggero Marino Lazzaroni, Jana Lasser, Kirill Solovev

发表机构 * Common Crawl Foundation(Common Crawl基金会)

AI总结 本文提出Infini-News,通过构建全文索引和检索工具,为研究者提供了13亿篇Common Crawl新闻文章的高效访问方式,包括文本清洗、多语言检测、地理归属和高效检索功能,降低了纵向跨国媒体研究的门槛。

详情
AI中文摘要

大规模新闻语料库支持计算社会科学和自然语言处理中的广泛研究,但访问仍然受限:商业档案施加了昂贵的成本和许可限制,而开放替代方案如Common Crawl的CC-News需要TB级存储和计算密集型处理。我们提出了Infini-News,一个用于整个CC-News档案(2016年8月至最新可用快照)的检索工具包和索引。我们的贡献有三方面:首先,我们提取、清洗超过13.5亿篇文章的文本,并解析结构化元数据。其次,我们通过三种前沿语言分类器(GlotLID、lingua和CommonLingua)丰富语料库,并通过多源地理归属确定83.4%的文章的国家来源(涵盖222个国家)。第三,我们构建了Infini-gram索引:后缀数组结构,使研究者能够以亚秒级时间在全文档案中搜索任意文本模式。这些资源降低了纵向、跨国媒体研究的门槛。

英文摘要

Large-scale news corpora support a wide range of research in Computational Social Science and NLP, yet access remains constrained: commercial archives impose prohibitive costs and licensing restrictions, while open alternatives like Common Crawl's CC-News require terabyte-scale storage and computationally intensive processing. We present Infini-News, a retrieval toolkit and index for the entire CC-News archive from August 2016 to the latest available snapshot. Our contributions are threefold. First, we extract, clean the text, and parse the structured metadata of over 1.35B articles. Second, we enrich the corpus with language detection using three frontier language classifiers (GlotLID, lingua, and CommonLingua), and with multi-source geographic attribution that resolves a country of origin for 83.4% of articles across 222 countries. Third, we construct Infini-gram indexes: suffix-array structures that let researchers search the full archive for arbitrary text patterns in sub-second time. Together, these resources lower the barrier to longitudinal, cross-national media research.

2605.18334 2026-05-19 cs.CV cs.GR

3D Skew Gaussian Splatting with Any Camera Trajectory Visualization Engine

具有任意相机轨迹可视化引擎的3D斜高斯散射

Beizhen Zhao, Yifan Zhou, Gaochao Song, Ziran Yin, Hao Wang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Zhejiang University(浙江大学) The University of Hong Kong(香港大学)

AI总结 本文提出3D斜高斯散射(3DSGS),通过引入斜高斯分布来提升3D高斯散射的结构保真度和紧凑性,以解决对称高斯分布在捕捉形状和颜色不连续性方面的不足,从而提高可视化效果和空间数据探索的准确性。

Comments 16 pages

详情
AI中文摘要

尽管3D高斯散射(3DGS)已经革新了实时逼真视角合成,但其对称高斯分布的根本依赖引入了视觉伪影,阻碍了准确的空间数据探索。具体而言,对称内核难以捕捉形状和颜色不连续性,导致模糊和基本元素冗余,这在视觉分析中会误导人类感知。为了解决这些可视化障碍,我们引入了3D斜高斯散射(3DSGS),一种新的框架,显著增强了显式场景表示的结构保真度和紧凑性。我们的关键见解在于将标准基本元素扩展为一般斜高斯对应物。这种通用基本元素继承了标准高斯的高效光栅化特性,同时获得了内在的非对称建模能力。我们将其与增强的不透明度表示相结合,以更好地处理复杂的透明度,同时结合一种深度感知的密集化策略,智能管理基本元素的分配。此外,为了使这些进步能够应用于实际的视觉分析,我们重新推导了CUDA光栅化管线,使其普遍支持对称和斜高斯,将其整合到一个解耦的自由相机交互可视化引擎中。广泛的实验表明,3DSGS在复杂细节区域实现了更优的渲染质量和结构紧凑性,同时保持了实时帧率,以支持流畅的交互探索。补充推导和视觉结果可在https://3d-skew-gs.github.io/上获得。

英文摘要

While 3D Gaussian Splatting (3DGS) has revolutionized real-time photorealistic view synthesis, its fundamental reliance on symmetric Gaussian distributions introduces visual artifacts that hinder accurate spatial data exploration. Specifically, symmetric kernels struggle to capture shape and color discontinuities , which cause blurriness and primitive redundancy that mislead human perception during visual analysis. To address these visualization barriers, we introduce 3D Skew Gaussian Splatting (3DSGS), a novel framework that significantly enhances the structural fidelity and compactness of explicit scene representations. Our key insight lies in extending the standard primitive to a general Skew Gaussian counterpart. This generalized primitive inherits the highly efficient rasterization properties of standard Gaussians while gaining intrinsic asymmetric modeling capabilities. We couple this with an enhanced opacity representation to better handle complex transparency, alongside a depth-aware densification strategy that intelligently manages primitive allocation. Furthermore, to make these advancements actionable for real-world visual analytics, we re-derive the CUDA rasterization pipeline to universally support both symmetric and skew Gaussians, integrating it into a decoupled, free-camera interactive visualization engine. Extensive experiments demonstrate that 3DSGS achieves superior rendering quality and structural compactness, particularly in regions with intricate details, while maintaining the real-time frame rates necessary for fluid interactive exploration. Supplementary derivations and visual results are available at \textbf{\textit{https://3d-skew-gs.github.io/}}.

2605.18331 2026-05-19 cs.LG

Prune, Update and Trim: Robust Structured Pruning for Large Language Models

剪枝、更新与裁剪:大型语言模型的鲁棒结构剪枝

Diego Coello de Portugal Mecke, Tom Hanika, Lars Schmidth-Thieme

发表机构 * ISMLL & DARC VWFS University of Hildesheim(ISMLL与DARC VWFS大学海德斯海姆大学) ISMLL University of Hildesheim(ISMLL大学海德斯海姆大学)

AI总结 本文提出Putri方法,通过更新未剪枝权重、按顺序剪枝FFN层以及移除单个注意力头来改进大型语言模型的后训练剪枝,实现了在极端稀疏率下的高效剪枝。

详情
AI中文摘要

大型语言模型(LLMs)近年来经历了显著的增长和开发。然而,进行LLMs的推理仍然成本高昂,尤其是在长上下文推理或资源受限的设备上。这促使开发新的后训练剪枝(PTP)方法。这些方法通过移除模型参数的大量部分来降低LLMs的要求。被丢弃的权重根据其对模型性能的影响进行选择。当前的PTP方法通过移除FFN层中信息较少的隐藏节点和最不重要的注意力层来剪枝模型。我们提出Putri,一种PTP方法,引入了三个改进:首先,更新未剪枝的FFN权重以补偿引入的剪枝误差;其次,按顺序剪枝FFN层,考虑之前层的更新;第三,而不是移除完整的注意力层,我们移除单个注意力头。我们扩展了这种方法,使其能够处理分组查询注意力。总之,Putri是一种保持简单但表现卓越的结构剪枝方法。在多个模型上进行剪枝实验,涵盖广泛的稀疏率范围和不同的数据集,验证了Putri的通用性。值得注意的是,我们证明,与以前的方法不同,Putri可以在极端稀疏率下剪枝LLMs。代码可在:https://github.com/Coello-dev/Putri 获取。

英文摘要

Large Language Models (LLMs) have experienced significant growth and development in recent years. However, performing inference on LLMs remains costly, especially for long-context inference or in resource-constrained devices. This motivates the development of new post-training pruning (PTP) methods. These methods reduce LLMs' requirements by removing a substantial part of the model's parameters. The discarded weights are selected depending on their impact on the models performance. Current PTP methods prune the models by removing the less informative hidden nodes from the FFN layers, and the least important attention layers. We propose Putri, a PTP method that introduces three changes to the State- of-the-art. First, we update the un-pruned weights of the FFN to compensate for the introduced pruning error. Second, the FFN layers are pruned sequentially, taking into account the updates done to the previous layers. Third, instead of removing full attention layers, we remove individual attention-heads. We extend this method such that it can also address Grouped-Query Attention. In summary, Putri is a structure pruning method which remains simple while showing SOTA performance. Pruning experiments on multiple models with a wide variety of sparsity ranges and on different datasets, validate the generality of Putri. Notably, we demonstrate that, unlike previous methods, Putri can prune LLMs on extreme sparsity ratios. The code is available at: https://github.com/Coello-dev/Putri.

2605.18328 2026-05-19 cs.CV

CineMatte: Background Matting for Virtual Production and Beyond

CineMatte:虚拟制作及其他场景的背景分割

Yuanjian He, Chen Zhang, Fasheng Chen, Jiangbo Cao

发表机构 * Online Video Business Unit, Tencent PCG Shenzhen, China(腾讯PCG深圳在线视频事业部)

AI总结 本文提出CineMatte,一种用于虚拟制作及其他场景的鲁棒背景分割框架。该方法采用交叉注意力条件设计,通过共享权重的冻结DINOv3 Vision Transformer编码输入帧和捕获的背景,并利用交叉注意力模块预测前景,从而保留预训练语义并提高对背景位移的鲁棒性。此外,还引入了CineMatte-4K数据集,包含4K HDR图像视频,为虚拟制作分割提供了首个非合成的数据集。

详情
AI中文摘要

LED虚拟制作(VP)利用大LED体积实时渲染背景,使镜头内视觉效果成为可能,但使剪辑后更改变得费力。我们通过CineMatte,一种用于VP及其他场景的鲁棒背景分割框架来解决这一问题。CineMatte采用交叉注意力条件设计。不同于将背景与输入拼接,CineMatte采用一个冻结的DINOv3 Vision Transformer,具有共享权重,分别对输入帧和捕获的背景进行编码。交叉注意力模块比较两个流以预测前景,保留预训练语义并提高对背景位移的鲁棒性。先前基于ViT的分割模型使用并行卷积“细节分支”来恢复细节,这在实际样本中可能由于与主干的语义对齐问题导致边界伪影。我们改用预训练的图像引导特征上采样器,这在很大程度上缓解了该问题。我们还引入了CineMatte-4K,一个在专业LED VP舞台上拍摄的4K HDR图像视频数据集。据我们所知,图像子集是首个VP分割数据集,非合成,通过绿色屏幕插入获得;视频子集包含相机运动和跟踪轨迹,以便后续可以正确渲染任意背景。在CineMatte-4K和公共基准(VideoMatte240K,YouTubeMatte)上,CineMatte不仅在VP中表现出色,而且对真实世界 footage 也具有强大的泛化能力。

英文摘要

LED Virtual Production (VP) uses large LED volumes to render backgrounds in real time, enabling in-camera visual effects but making post-shot changes labor-intensive. We address this with CineMatte, a robust background matting framework for VP and beyond. CineMatte employs a cross-attention-conditioned design. Instead of concatenating the background with the input, CineMatte employs a Siamese, frozen DINOv3 Vision Transformer with shared weights to encode the input frame and the captured background separately. A cross-attention module compares the two streams to predict the foreground, preserving pretrained semantics and improving robustness to background shifts. Previous ViT-based matting models use a parallel convolutional "detail branch" to recover fine details, which can cause boundary artifacts in real-world samples due to semantic misalignment with the backbone. We instead replace it with a pretrained, image-guided feature upsampler, which largely mitigates the problem. We also introduce CineMatte-4K, a 4K HDR image-video dataset captured on a professional LED VP stage. To the best of our knowledge, the image subset is the first dataset for VP matting and is non-synthetic, obtained via green-screen insertion; the video subset includes camera motion with tracked trajectories so that arbitrary backgrounds can be rendered later with correct parallax. Across CineMatte-4K and public benchmarks (VideoMatte240K, YouTubeMatte), CineMatte not only excels in VP but also generalizes robustly to real-world footage.

2605.18327 2026-05-19 cs.AI

Causely: A Causal Intelligence Layer for Enterprise AI A Benchmark Study on SRE and Reliability Workflows

Causely: 企业AI中的因果智能层 一项关于SRE和可靠性工作流的基准研究

Dhairya Dalal, Endre Sara, Ben Yemini, Christine Miller, Shmuel Kliger

发表机构 * Causely

AI总结 本文提出Causely,一种企业AI的因果智能层,通过维护环境拓扑、属性依赖性和因果关系的结构化表示,为AI代理提供语义和因果基础,以诊断、评估影响并安全地在生产环境中操作。通过在受控环境下注入故障的24微服务OpenTelemetry演示应用进行基准研究,评估了Causely的价值主张。

详情
AI中文摘要

目前,部署到SRE工作流中的AI代理在查询时从原始可观测性遥测中获取对环境状态的理解,这在令牌、延迟和推断可靠性上产生了语义解释的代价。我们提出了Causely,一种因果智能层,它维护了环境拓扑、属性依赖性和因果关系的结构化表示,这些关系锚定在受管理环境的本体表示上。Causely将原始遥测转换为一个实时、可查询的模型,为AI代理提供所需的语义和因果基础,以诊断、评估影响并在生产环境中安全地行动。我们通过在受控环境下注入故障的24微服务OpenTelemetry演示应用进行基准研究来评估这一价值主张。我们的实验比较了四种代理配置(Claude Code、OpenAI Codex、HolmesGPT与Sonnet和Gemini后端)。实验在两种场景下进行:活跃事件和健康基线,分别有和无访问Causely。在活跃故障场景中,因果基础将平均诊断时间减少63%,平均令牌消耗减少60%,平均工具调用次数减少78%,将调查足迹压缩了4.8倍,并降低了每运行的直接API成本57%;根因诊断准确率从75%提升到100%。

英文摘要

AI agents deployed into SRE workflows currently derive their understanding of environment state from raw observability telemetry at query time, paying a semantic-interpretation tax in tokens, latency, and inferential reliability. We propose Causely, a causal intelligence layer that maintains a structured representation of environment topology, attribute dependencies, and causal relationships that are anchroed to a ontological representation of the managed environment. Causely transforms raw telemetry into a live, queryable model providing the semantic and causal foundation AI agents require to diagnose, evaluate impact, and act safely in production. We evaluate this value proposition through a benchmark study conducted in a controlled setting with injected faults in a 24-microservice OpenTelemetry demo application. Our experiments compare four agent configurations (Claude Code, OpenAI Codex, HolmesGPT with Sonnet and Gemini backends). Experiments are run with and without access to Causely under two scenarios: an active incident and a healthy baseline. On the active-fault scenario, causal grounding reduces mean time-to-diagnosis by 63\%, mean token consumption by 60\%, and mean tool-call count by 78\%, compressing the investigation footprint by 4.8$\times$ and lowering direct API cost per run by 57\%; root-cause-diagnosis accuracy rises from 75\% to 100\%.

2605.18320 2026-05-19 cs.LG cs.AI

ISEP: Implicit Support Expansion for Offline Reinforcement Learning via Stochastic Policy Optimization

ISEP: 通过随机策略优化实现离线强化学习的隐式支持扩展

Yifei Chen, Shaoqin Zhu, Xiaoqiang Ji

发表机构 * The Chinese University of Hong Kong, Shenzhen Longgang(香港中文大学(深圳)松山湖校区)

AI总结 本文提出ISEP方法,通过随机策略优化实现离线强化学习中的隐式支持扩展,以解决传统方法在安全约束下难以发现最优行为的问题,核心贡献是通过价值函数插值和随机动作选择策略提高策略改进的导航能力。

详情
AI中文摘要

离线强化学习方法通常强制严格的约束以确保安全;然而这种刚性往往阻止了在行为策略即时支持之外发现最优行为。为了解决这个问题,我们提出了通过随机策略优化实现的隐式支持扩展(ISEP),该方法利用在分布数据和策略样本之间插值的价值函数,以隐式方式扩展可行动作支持。这种机制“密集化”高奖励区域,为策略改进创建可导航路径,同时在理论上保证价值误差的有界性。然而,优化此扩展支持会创建多模态景观,标准确定性平均会导致模式崩溃和无效动作。ISEP通过随机动作选择策略缓解了这一问题,通过随机交替保守克隆和乐观扩展信号来优化策略。我们通过使用条件流匹配利用分类器免费引导,将此框架实例化为ISEP-FM,以有效捕捉插值的价值信号。

英文摘要

Offline reinforcement learning methods typically enforce strict constraints to ensure safety; yet this rigidity often prevents the discovery of optimal behaviors outside the immediate support of the behavior policy. To address this, we propose Implicit Support Expansion via stochastic Policy optimization (ISEP), which leverages a value function interpolated between in-distribution data and policy samples to implicitly expand the feasible action support. This mechanism "densifies" high-reward regions, creating a navigable path for policy improvement while theoretically guaranteeing bounded value error. However, optimizing against this expanded support creates a multimodal landscape where standard deterministic averaging leads to mode collapse and invalid actions. ISEP mitigates this via a stochastic action selection strategy, optimizing the policy by stochastically alternating between conservative cloning and optimistic expansion signals. We instantiate this framework as ISEP-FM using Conditional Flow Matching utilizing classifier-free guidance to effectively capture the interpolated value signal.

2605.18319 2026-05-19 cs.LG cs.DM math.AG math.CO

The Symmetries of Three-Layer ReLU Networks

三层ReLU网络的对称性

Johanna Marie Gegenfurtner, Moritz Grillo, Guido Montúfar

发表机构 * Technical University of Denmark(丹麦技术大学) Max Planck Institute for Mathematics in the Sciences(马克斯·普朗克数学研究所) UCLA and MPI MiS(加州大学洛杉矶分校和马克斯·普朗克研究所)

AI总结 本文研究了三层ReLU网络参数对称性的分析框架,给出了三层层状架构通用参数纤维的完整刻画,并提出了一个多项式时间算法来判断两个参数的功能等价性。

详情
AI中文摘要

我们开发了一个分析深度ReLU网络参数对称性的框架,并获得了三层层状架构通用参数纤维的完整刻画。我们的方法为这些纤维提供了显式的半代数描述,并给出了一个多项式时间算法来决定两个参数的功能等价性。这些对称性包括来自层组合的离散和连续变换,并取决于更深的层是否隐藏或保留先前层的几何结构。最后,我们证明了一些这些对称性在梯度流中诱导局部守恒定律,而其他则不。

英文摘要

We develop a framework for analyzing parameter symmetries in deep ReLU networks and obtain a complete characterization of the generic parameter fibers for three-layer bottleneck architectures. Our approach provides explicit semi-algebraic descriptions of these fibers and yields a polynomial time algorithm for deciding functional equivalence of two parameters. The symmetries include discrete and continuous transformations arising from layer composition, and depend on whether deeper layers hide or preserve geometric structure from preceding layers. Finally, we show that some of these symmetries induce local conservation laws along gradient flow, while others do not.

2605.18316 2026-05-19 cs.LG cs.GR

Dynamic Elliptical Graph Factor Models via Riemannian Optimization with Geodesic Temporal Regularization

通过黎曼优化与测地时间正则化进行动态椭圆图因子模型

Chuansen Peng, Xiaojing Shen

发表机构 * School of Mathematics, Sichuan University(四川大学数学学院)

AI总结 本文提出了一种基于黎曼流形的动态估计方法(Degfm),通过结合低秩加对角结构和椭圆图因子模型,解决时间变化图结构推断中的时空一致性与黎曼几何保持问题,并在合成数据和真实数据集上验证了其有效性。

详情
AI中文摘要

从高维节点观测推断时间变化的图结构是神经科学、金融、气候学等领域中的基本问题。该问题有两个内在挑战:在连续观测窗口中保持潜在图的时空一致性,以及尊重对称正定流形的内在黎曼几何,这是一个曲面,其测地结构与欧几里得空间根本不同。本文提出了一种在Grassmann流形上进行动态估计的方法(Degfm),这是一种新颖的算法,共同解决这两个挑战。我们将时间变化的精度矩阵序列建模为低秩加对角结构,由潜在的椭圆图因子模型所驱动,这大大减少了有效参数数量,并在具有挑战性的小样本情况下实现了可靠的估计。通过在Grassmann流形上定义黎曼测地惩罚,强制执行时间一致性,确保估计的图轨迹在内在几何上而非环境欧几里得空间上是平滑的。为了解决由此产生的非凸优化问题,我们推导出一个高效的黎曼梯度下降算法,该算法在每次迭代中都尊重流形结构,并严格建立了其收敛到平稳点的收敛性。在合成基准和真实世界数据集上的广泛实验表明,Degfm在所有评估指标上都优于最先进的基线方法,证实了所提框架的实用性。

英文摘要

Inferring time-varying graph structures from high-dimensional nodal observations is a fundamental problem arising in neuroscience, finance, climatology, and beyond. Two intrinsic challenges govern this problem: maintaining the \emph{temporal coherence} of the latent graph across successive observation windows, and respecting the \emph{intrinsic Riemannian geometry} of the symmetric positive definite manifold on which precision matrices naturally reside, a curved space whose geodesic structure departs fundamentally from that of the ambient Euclidean space. In this paper we propose dynamic estimation on the Grassmann manifold with a factor model (\textsc{Degfm}), a novel algorithm that jointly addresses both challenges. We model the time-varying precision matrix sequence as a low-rank-plus-diagonal structure governed by a latent elliptical graph factor model, which drastically reduces the effective parameter count and enables reliable estimation in the challenging small-sample regime. Temporal coherence is enforced through a Riemannian geodesic penalty defined on the Grassmann manifold, ensuring that the estimated graph trajectory is smooth with respect to the intrinsic geometry rather than the ambient Euclidean space. To solve the resulting non-convex optimization problem over Grassmann-manifold-valued sequences subject to the LRaD constraint, we derive an efficient Riemannian gradient descent algorithm that respects the manifold structure at every iterate and rigorously establish its convergence to a stationary point. Extensive experiments on both synthetic benchmarks and real-world datasets demonstrate that \textsc{Degfm} consistently outperforms state-of-the-art baselines across all evaluation metrics, confirming the practical effectiveness of the proposed framework.

2605.18309 2026-05-19 cs.LG cs.AI

Alignment Dynamics in LLM Fine-Tuning

在LLM微调中的对齐动力学

Yuhan Huang, Huanran Chen, Yinpeng Dong

发表机构 * Shanghai Qi Zhi Institue & University of Tokyo(上海启智研究院 & 东京大学) College of AI, Tsinghua University(清华大学人工智能学院)

AI总结 本文研究了在LLM微调过程中对齐的动态特性,提出了一种可计算的对齐评分,并推导了其在微调过程中的闭式更新公式,从而建立了对齐动态的统一框架。通过将对齐更新分解为两种竞争成分:反弹力和驱动力,解释了为何先前的对齐可能被后续微调逆转,以及为何更狭窄的后验结构会增强这种逆转。此外,该框架预测了‘复习强化效应’,即先前的对齐会在重新暴露时留下潜在的后验印记,从而增强驱动力,导致更快的重新对齐。

详情
AI中文摘要

尽管大型语言模型(LLMs)通过监督微调和人类反馈强化学习实现了强大的对齐,但在后续微调中对齐往往容易崩溃。现有的解释要么将对齐脆弱性归因于梯度几何,要么将其描述为模型输出的分布转移,但很少有研究能提供一个统一的框架,将参数空间的学习动态与函数空间的对齐行为联系起来。在本文中,我们引入了一个可计算的对齐评分,并推导了其在微调过程中的闭式更新公式,从而建立了对齐动态的统一框架。我们的分析将对齐更新分解为两个竞争成分:一种由当前对齐状态和模型分布狭窄性共同决定的“反弹力”,以及一种由训练分布与条件后验对齐和非对齐完成的后验对齐程度决定的“驱动力”。这种分解解释了为何先前的对齐可能被后续微调逆转,以及为何更狭窄的后验结构会增强这种逆转。此外,我们的框架预测了“复习强化效应”:先前的对齐会在重新暴露时留下潜在的后验印记,从而增强驱动力,导致更快的重新对齐。我们通过安全对齐、新兴不一致和情感设置验证了这些预测,展示了在重新暴露下一致的对齐逆转和加速的重新对齐。此外,安全对齐的受控实验确认了预测的反弹强度与后验狭窄性之间的依赖关系。这些结果共同提供了一个统一的动态视角,说明在LLM微调过程中对齐是如何被破坏和重新激活的。

英文摘要

Although Large Language Models (LLMs) achieve strong alignment through supervised fine-tuning and reinforcement learning from human feedback, the alignment is often fragile under subsequent fine-tuning. Existing explanations either attribute alignment fragility to gradient geometry or characterize it as a distributional shift in model outputs, yet few provide a unified account that bridges parameter-space learning dynamics with function-space alignment behavior during fine-tuning. In this work, we introduce a tractable alignment score and derive its closed-form update during fine-tuning, yielding a unified framework for alignment dynamics. Our analysis decomposes alignment updates into two competing components: a \textbf{\color{red!60!black} Rebound Force}, governed jointly by the current alignment state and the narrowness of model distribution, and a \textbf{\color{green!60!black} Driving Force}, determined by how the training distribution aligns with outcome-conditioned posteriors over aligned and non-aligned completions. This decomposition explains why prior alignment can be reversed by later fine-tuning and why narrower posterior structure strengthens such reversal. Moreover, our framework predicts a \textbf{Rehearsal Priming Effect}: prior alignment leaves a latent posterior imprint that amplifies the effective Driving Force upon re-exposure, leading to faster re-alignment. We validate these predictions across safety alignment, emergent misalignment, and sentiment settings, demonstrating consistent alignment reversal and accelerated re-alignment under re-exposure. In addition, controlled experiments in safety alignment confirm the predicted dependence of rebound strength on posterior narrowness. Together, these results provide a unified dynamical perspective on how alignment is disrupted and reactivated during LLM fine-tuning.

2605.18299 2026-05-19 cs.AI cs.CL cs.IR

SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning

SD-Search: 用于搜索增强推理的在线策略 hindsight 自监督学习

Yufei Ma, Zihan Liang, Ben Chen, Zhipeng Qian, Huangyu Dai, Lingtao Mao, Xuxin Zhang, Chenyi Lei, Wenwu Ou

发表机构 * Kuaishou Technology(快手科技)

AI总结 本文提出SD-Search,一种基于在线策略hindsight自监督学习的搜索增强推理方法,通过自身策略生成细粒度监督信号,无需外部教师模型或额外标注。

详情
AI中文摘要

搜索增强推理代理将内部推理与外部检索器的调用交替进行,其性能依赖于每次发出的查询质量。然而,在基于结果奖励的强化学习中,每个搜索决策在展开过程中共享同一轨迹级奖励,使个体查询缺乏步级信用。最近的过程监督方法通过从政策外部获取步级信号来解决这一差距,依赖于一个更大的教师模型或由更强的外部系统生成的子问题注释。相比之下,我们提出了SD-Search,通过在线策略的hindsight自监督学习自身生成步级监督,无需外部教师或额外标注。在SD-Search中,一个模型扮演两个角色:学生只看到推理时可用的上下文,而教师还根据一个紧凑的hindsight块总结了搜索查询和一组从同一问题采样的展开的最终结果。由于教师知道每个展开的展开过程和哪些成功,其查询分布隐含地标记了哪些决策值得做出,学生通过最小化token级的Jensen-Shannon散度来恢复这种行为。这在GRPO的粗粒度轨迹奖励上叠加了密集的步级信号。关键的是,这个信号由策略本身在标准RL训练循环中生成,无需外部模型推理、辅助标注流程或额外的训练阶段。

英文摘要

Search-augmented reasoning agents interleave internal reasoning with calls to an external retriever, and their performance relies on the quality of each issued query. However, under outcome-reward reinforcement learning, every search decision in a rollout shares the same trajectory-level reward, leaving individual queries without step-specific credit. Recent process-supervision approaches address this gap by drawing step-level signals from outside the policy, relying either on a much larger teacher model, or on sub-question annotations produced by a stronger external system. In contrast, we propose SD-Search, which derives step-level supervision from the policy itself through on-policy hindsight self-distillation, requiring neither an external teacher nor additional annotations. In SD-Search, a single model plays two roles that differ only in conditioning: a student that sees only the context available at inference time, and a teacher that additionally conditions on a compact hindsight block summarizing the search queries and final outcomes of a group of rollouts sampled from the same question. Since the teacher knows how each rollout unfolded and which ones succeeded, its query distribution implicitly marks which decisions were worth making, and the student is trained to recover this behavior by minimizing the token-level Jensen--Shannon divergence to the teacher at search-query positions. This layers a dense, step-level signal on top of GRPO's coarse trajectory reward. Crucially, this signal is produced by the policy itself within the standard RL training loop, without external model inference, auxiliary annotation pipeline, or additional training stage.

2605.18298 2026-05-19 cs.AI cs.HC cs.LG

DARE-EEG: A Foundation Model for Mining Dual-Aligned Representation of EEG

DARE-EEG: 一种用于挖掘双对齐表示的EEG基础模型

Yang Shao, Peiliang Gong, Qun Dai, Daoqiang Zhang

发表机构 * College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics(航空宇航学院人工智能学院)

AI总结 本文提出DARE-EEG,一种通过双对齐表示学习预训练的自监督基础模型,旨在解决EEG编码器在不完整观测下学习不变表示的问题,通过对比学习和动量更新实现语义稳定性,并通过卷积-线性探针策略适应异构电极配置和采样率,实验表明其在EEG基准测试中表现优异。

Comments 22 pages, 10 pages of main text + 12 pages of appendices

详情
AI中文摘要

通过在大规模EEG数据上进行掩码重建预训练,基础模型已成为在多样化脑机接口应用中学习通用神经表示的有前景范式。然而,一个关键但被忽视的挑战是EEG编码器必须学习对不完整观测不变的表示——当不同掩码视图的同一信号有最小重叠时,现有方法无法将它们约束到一致的潜在子空间,导致转移性下降。为此,我们提出DARE-EEG,一种自监督基础模型,通过预训练期间的双对齐表示学习显式强制掩码不变性。具体而言,我们引入掩码对齐,通过对比学习约束同一EEG样本多个掩码视图的表示,补充锚点对齐,将掩码表示对齐到动量更新的完整特征以实现语义稳定性。此外,我们提出卷积-线性探针,一种参数高效策略,通过解耦频谱-空间投影适应异构电极配置和采样率。在多样化的EEG基准测试中,广泛实验表明DARE-EEG在准确性表现上始终领先,同时保持相对较低的参数复杂度和优于现有方法的跨数据集可移植性。此外,DARE-EEG有助于有效发现和利用EEG中的丰富潜在表示。

英文摘要

Foundation models pre-trained through masked reconstruction on large-scale EEG data have emerged as a promising paradigm for learning generalizable neural representations across diverse brain-computer interface applications. However, a critical yet overlooked challenge is that EEG encoders must learn representations invariant to incomplete observations-when different masked views of the same signal have minimal overlap, existing methods fail to constrain them to a consistent latent subspace, leading to degraded transferability. To address this, we propose DARE-EEG, a self-supervised foundation model that explicitly enforces the mask-invariance property through dual-aligned representation learning during pre-training. Specifically, we introduce mask alignment that constrains representations from multiple masked views of the same EEG sample via contrastive learning, complementing anchor alignment that aligns masked representations to momentum-updated complete features for semantic stability. Additionally, we propose conv-linear-probing, a parameter-efficient strategy that adapts pre-trained representations to heterogeneous electrode configurations and sampling rates through decoupled spectro-spatial projections. Extensive experiments across diverse EEG benchmarks demonstrate that DARE-EEG consistently achieves state-of-the-art in accuracy performance while maintaining relatively low parameter complexity and superior cross-dataset portability compared to existing methods. Furthermore, DARE-EEG contributes to effectively discovering and utilizing the rich potential representations in EEG.

2605.18295 2026-05-19 cs.RO

Assessing Localization Technologies for Pedestrian Collision Avoidance

评估用于行人碰撞避让的定位技术

Joshua Varughese, Joseba Gorospe, Novel Certad, Cristina Olaverri-Monreal

发表机构 * Dept. Intelligent Transport Systems, Johannes Kepler University Linz(智能交通系统系,约翰内斯·开普勒大学林茨)

AI总结 本文评估了超宽频技术和蓝牙6.0在行人碰撞预警中的定位精度,并将其与全球导航卫星系统进行性能对比,发现这些技术在特定场景下可作为替代或补充方案,提升环境感知能力。

详情
AI中文摘要

鲁棒的行人安全对于下一代智能交通系统至关重要。此类系统依赖于主动的行人定位和预测碰撞警报。行人定位可以借助超宽频技术和蓝牙6.0,这两种技术提供了高精度测距和低延迟通信,使其成为车辆碰撞预警系统有前途的候选者。本文评估了这些技术在行人警报中的定位精度,并将其性能与全球导航卫星系统进行对比。本文进行的实验评估聚焦于关键性能指标,包括定位精度和对环境条件的鲁棒性。初步结果表明,超宽频和蓝牙6.0可以在某些场景下作为全球导航卫星系统的替代或补充方案,提高环境感知能力,并实现及时的行人警报。

英文摘要

Robust pedestrian safety is crucial to the next-generation of intelligent transportation systems. Such systems rely on active pedestrian localization and predictive collision alerts. Pedestrian localization can be supported by Ultra-Wideband technology and Bluetooth 6.0, which offer high-precision ranging and low-latency communication, making them promising candidates for vehicular collision warning systems. This paper assesses the localization accuracy of these technologies for pedestrian alerting and benchmarks their performance against Global Navigation Satellite Systems. Experimental evaluations performed in this paper focused on key performance metrics, including localization accuracy and robustness to environmental conditions. Preliminary results suggest that Ultra-Wideband and Bluetooth 6.0 can serve as viable alternatives or complements to Global Navigation Satellite Systems in certain scenarios, improving situational awareness and enabling timely pedestrian alerts.

2605.18288 2026-05-19 cs.CV

Collision-Resistant Single-Pass Method for Unsupervised Fine-Grained Image Hashing

抗碰撞的单次传递方法用于无监督细粒度图像哈希

Anh-Kiet Duong, Petra Gomez-Krämer, Jean-Michel Carozza

发表机构 * GitHub

AI总结 本文提出了一种抗碰撞的单次传递自监督语义哈希方法(CS3H),通过单次传递归一化汉明距离损失直接优化汉明空间相似性,生成良好的二进制表示,同时引入了对碰撞敏感的注意力模块以强调稀有且判别性的局部模式,从而减少哈希碰撞并提高细粒度判别能力。

Comments 17 pages, accepted to ICIP 2026

详情
AI中文摘要

无监督细粒度图像哈希旨在学习紧凑的二进制代码,以在高度相似实例之间保留细微的视觉差异,而无需人工标注。然而,大多数现有方法忽视了碰撞抵抗性,导致略微语义不同的样本具有相同的哈希代码。在本文中,我们提出了一种抗碰撞的单次传递自监督语义哈希(CS3H)框架,该框架通过单次传递归一化汉明距离损失直接优化汉明空间相似性,以生成良好的二进制表示。我们进一步引入了对碰撞敏感的注意力模块,以强调稀有且判别性的局部模式,从而减少哈希碰撞并提高细粒度判别能力。在多个基准测试中,实验表明CS3H在检索准确性上始终优于最先进的方法,同时在最小计算开销的情况下实现了优越的碰撞抵抗性。

英文摘要

Unsupervised fine-grained image hashing aims to learn compact binary codes that preserve subtle visual differences among highly similar instances without manual annotations. However, most existing methods neglect collision resistance, leading to identical hash codes for slightly semantically different samples. In this paper, we propose Collision-Resistant Single-Pass Self-Supervised Semantic Hashing (CS3H), a collision-resistant framework that directly optimizes Hamming-space similarity via a single-pass normalized Hamming distance loss to produce well-separated binary representations. We further introduce a collision-sensitive attention module to emphasize rare and discriminative local patterns, reducing hash collisions and improving fine-grained discrimination. Experiments on multiple benchmarks show that CS3H consistently outperforms state-of-the-art methods in retrieval accuracy while achieving superior collision resistance with minimal computational overhead.

2605.18287 2026-05-19 cs.CV cs.RO

StableVLA: Towards Robust Vision-Language-Action Models without Extra Data

StableVLA: 向无额外数据的鲁棒视觉-语言-动作模型迈进

Yiyang Fu, Chubin Zhang, Shukai Gong, Yufan Deng, Kaiwei Sun, Qiyang Min, Qibin Hou, Yansong Tang, Jianan Wang, Daquan Zhou

发表机构 * Peking University(北京大学) Tsinghua University(清华大学) Nanjing University(南京大学) Nankai University(南开大学)

AI总结 本文研究了在未见真实世界视觉扰动下视觉-语言-动作(VLA)模型的鲁棒性问题,提出了一种基于信息理论的轻量级适配模块IB-Adapter,有效提升模型性能,同时保持高效和效果。

Comments Accepted by ICML 2026. Code: https://github.com/DAGroup-PKU/HumanNet. Project website: https://dagroup-pku.github.io/StableVLA/

详情
AI中文摘要

在训练数据中无法涵盖所有可能的扰动,这引发了关于在遇到未见真实世界视觉扰动时,视觉-语言-动作(VLA)模型鲁棒性的问题。在本文中,我们基于最近最先进的VLA模型进行了系统研究,并揭示了当引入训练数据中没有的视觉扰动时,性能显著下降。为缓解这一问题,我们提出了一种基于信息理论的轻量级适配模块,称为信息瓶颈适配器(IB-Adapter),该模块能够选择性地从视觉输入中过滤潜在噪声。无需任何额外数据或增强策略,IB-Adapter在基线模型上平均提升了30%,同时添加少于10M参数,显示出显著的效率和效果。此外,即使使用14倍更小的主干(0.5B参数)且未在Open X-Embodiment数据集上预训练,我们的模型StableVLA也实现了与7B规模最先进的VLA相媲美的鲁棒性。在参数开销极小(<10M)的情况下,我们的方法在长周期任务上保持了准确性,并在合成和物理视觉扰动下超越了OpenPi。

英文摘要

It is infeasible to encompass all possible disturbances within the training dataset. This raises a critical question regarding the robustness of Vision-Language-Action (VLA) models when encountering unseen real-world visual disturbances, particularly under imperfect visual conditions. In this work, we conduct a systematic study based on recent state-of-the-art VLA models and reveal a significant performance drop when visual disturbances absent from the training data are introduced. To mitigate this issue, we propose a lightweight adapter module grounded in information theory, termed the Information Bottleneck Adapter (IB-Adapter), which selectively filters potential noise from visual inputs. Without requiring any extra data or augmentation strategies, IB-Adapter consistently improves over the baseline by an average of 30%, while adding fewer than 10M parameters, demonstrating notable efficiency and effectiveness. Furthermore, even with a 14x smaller backbone (0.5B parameters) and no pre-training on the Open X-Embodiment dataset, our model StableVLA achieves robustness competitive with 7B-scale state-of-the-art VLAs. With negligible parameter overhead (<10M), our approach maintains accuracy on long-horizon tasks and surpasses OpenPi under both synthetic and physical visual corruptions.

2605.18281 2026-05-19 cs.LG

Temporal Task Diversity: Inductive Biases Under Non-Stationarity in Synthetic Sequence Modelling

时间任务多样性:非平稳性下的归纳偏置

Afiq Abdillah Effiezal Aswadi, Oliver Britton, Ross Baker, Matthew Farrugia-Roberts

发表机构 * University of Oxford(牛津大学)

AI总结 研究探讨了在合成序列建模中,任务分布随时间变化对深度学习模型归纳偏置的影响,发现任务分布的多样性增强了模型对泛化而非记忆的偏好。

Comments Presented at Technical AI Safety Conference (TAIS), Oxford, May 2026. Code available at https://github.com/matomatical/temporal-task-diversity

详情
AI中文摘要

现代深度学习科学常常假设神经网络从固定的数据分布中学习。然而,许多实际重要的学习问题涉及在训练过程中数据分布发生变化的情况。这种非平稳性如何影响深度学习对具有不同结构、泛化性和安全性属性的模型的归纳偏置?一个研究归纳偏置的有成效的测试平台是在上下文线性回归序列建模中,其中小型变压器根据训练任务分布的多样性表现出显著不同的泛化模式。在本文中,我们探讨了在训练时间多样化任务分布的影响,发现这种时间多样性导致对泛化而非记忆的偏置增加。

英文摘要

Modern deep learning science often assumes that neural networks learn from a fixed data distribution. However, many practically important learning problems involve data distributions that change throughout training. How does such non-stationarity impact the inductive biases of deep learning towards models with different structural, generalisation, and safety properties? A fruitful testbed for studying inductive bias is in-context linear regression sequence modelling, where small transformers display strikingly different generalisation patterns depending on the diversity of the (fixed) training task distribution. In this paper, we explore the effect of diversifying the task distribution across training time, finding that such temporal diversity leads to an increased bias towards generalisation over memorisation.

2605.18263 2026-05-19 cs.CV

RT-Splatting: Joint Reflection-Transmission Modeling with Gaussian Splatting

RT-Splatting:基于高斯点散布的联合反射-透射建模

Ji Shi, Xianghua Ying, Bowei Xing, Ruohao Guo, Wenzhen Yue

发表机构 * State Key Laboratory of General Artificial Intelligence(国家一般人工智能重点实验室) School of Intelligence Science and Technology(智能科学与技术学院)

AI总结 本文提出RT-Splatting方法,通过将高斯点的几何占用与光学不透明度分离,实现对半透明表面复杂反射和清晰透射的联合建模,从而在实时渲染中获得高质量的反射和透射效果。

Comments CVPR 2026 Highlight, Project Page: https://sjj118.github.io/RT-Splatting/

详情
AI中文摘要

3D高斯点散布(3DGS)能够实现实时新型视角合成,具有高质量的视觉效果。然而,现有方法在处理半透明镜面表面时存在困难,这些表面同时表现出复杂的反射和清晰的透射,常常产生模糊的反射或过度遮挡的透射。为了解决这个问题,我们提出了RT-Splatting框架,该框架将每个高斯点的几何占用与其光学不透明度解耦。这种分解产生了一个统一的表面-体积场景表示,使用单组高斯基元。我们的混合渲染器将这种表示同时解释为表面以捕获高频反射,以及体积以保持清晰的透射。为了减轻联合优化反射和透射时的模糊性,我们引入了Specular-Aware Gradient Gating,该方法抑制了高镜面区域的误导梯度进入透射分支,从而有效减少 distracting floaters。在具有挑战性的半透明场景上的实验表明,RT-Splatting实现了最先进的性能,能够实时渲染高质量的反射和清晰的透射。此外,我们的分解自然地支持灵活的场景编辑。项目页面可在https://sjj118.github.io/RT-Splatting上找到。

英文摘要

3D Gaussian Splatting (3DGS) enables real-time novel view synthesis with high visual quality. However, existing methods struggle with semi-transparent specular surfaces that exhibit both complex reflections and clear transmission, often producing blurry reflections or overly occluded transmission. To address this, we present RT-Splatting, a framework that disentangles each Gaussian's geometric occupancy from its optical opacity. This factorization yields a unified surface-volume scene representation with a single set of Gaussian primitives. Our hybrid renderer interprets this representation both as a surface to capture high-frequency reflections and as a volume to preserve clear transmission. To mitigate the ambiguity in jointly optimizing reflection and transmission, we introduce Specular-Aware Gradient Gating, which suppresses misleading gradients from highly specular regions into the transmission branch, effectively reducing distracting floaters. Experiments on challenging semi-transparent scenes show that RT-Splatting achieves state-of-the-art performance, delivering high-fidelity reflections and clear transmission with real-time rendering. Moreover, our factorization naturally enables flexible scene editing. The project page is available at https://sjj118.github.io/RT-Splatting.

2605.18262 2026-05-19 cs.RO

On Improving Multimodal Pedestrian Trajectory Prediction with CVAE: A Study on Benchmark and Robot Data

基于CVAE的多模态行人轨迹预测改进:对基准数据和机器人数据的研究

Yuzhou Liu, Cristina Olaverri-Monreal

发表机构 * Dept. Intelligent Transport Systems, Johannes Kepler University Linz(智能交通系统系,约翰·凯撒大学林茨)

AI总结 本文提出基于Social-STGCNN的CVAE概率模型,以改进多模态行人轨迹预测,通过在基准数据集和真实机器人数据集上的评估,展示了方法在不同人群配置下的端点准确性和轨迹多样性改进。

详情
AI中文摘要

准确的行人轨迹预测对于在复杂环境中运行的自主系统至关重要,例如郊区或半结构化区域中的模块化巴士和送货机器人。Social Spatio-Temporal Graph Convolutional Neural Networks (Social-STGCNN) 通过建模社会互动展示了强大的性能;然而,生成多样且校准良好的未来轨迹仍然具有挑战性。在本文中,我们基于Social-STGCNN骨架,引入基于条件变分自动编码器(CVAE)的概率公式,以显式建模多模态未来轨迹。我们评估了该方法在ETH和UCY行人轨迹数据集以及由移动机器人收集的真实世界行人数据集上的性能。结果表明,在公共基准上取得了适度的提升,但在不同人群配置下表现出更一致的端点准确性和改进的轨迹多样性。在机器人收集的数据上的评估进一步证明了该方法在非定制基准之外的有效性,并支持其在实际部署中的适用性。

英文摘要

Accurate pedestrian trajectory prediction is crucial for autonomous systems operating in complex environments, such as modular buses and delivery robots in suburban or semi-structured areas. Social Spatio-Temporal Graph Convolutional Neural Networks (Social-STGCNN) have shown strong performance by modeling social interactions; however, producing diverse and well-calibrated future trajectories remains challenging. In this work, we build on a Social-STGCNN backbone and introduce a Conditional Variational Autoencoder (CVAE)-based probabilistic formulation to explicitly model multimodal future trajectories. We evaluate the method on the ETH and UCY pedestrian trajectory datasets as well as on a real-world pedestrian dataset collected by a mobile robot. Results show moderate gains on public benchmarks, but more consistent endpoint accuracy and improved trajectory diversity across different crowd configurations. Evaluation on robot-collected data further demonstrates the approach's effectiveness beyond curated benchmarks and supports its applicability in practical deployments.

2605.18261 2026-05-19 cs.CL

Knowledge-to-Verification: Exploring RLVR for LLMs in Knowledge-Intensive Domains

知识到验证:探索RLVR在知识密集型领域的LLM应用

Zhonghang Yuan, Zhefan Wang, Fang Hu, Zihong Chen, Jinzhe Li, Gang Li, Jie Ying, Huanjun Kong, Songyang Zhang, Nanqing Dong

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Innovation Institute(上海创新研究院)

AI总结 本文提出K2V框架,通过自动化可验证数据合成扩展RLVR到知识密集型领域,提升LLM在这些领域的推理能力,同时不显著影响模型的通用能力。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已在数学和编程等领域展示了增强大型语言模型(LLM)推理能力的潜力。然而,由于高质量可验证数据的稀缺,其在知识密集型领域的应用尚未得到有效探索。此外,当前RLVR仅关注最终答案的正确性,导致推理错误和稀疏奖励信号的限制。在本文中,我们提出了知识到验证(K2V),一个通过自动化可验证数据合成扩展RLVR到知识密集型领域的框架,同时使LLM的推理过程得以验证。广泛的实验表明,K2V在知识密集型领域增强了LLM的推理能力,而不会显著损害模型的通用能力。本研究还表明,将自动化数据合成与推理验证相结合是增强这些更广泛领域模型能力的有前途的方向。代码可在https://github.com/SeedScientist/K2V上获得。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has demonstrated promising potential to enhance the reasoning capabilities of large language models (LLMs) in domains such as mathematics and coding. However, its applications on knowledge-intensive domains have not been effectively explored due to the scarcity of high-quality verifiable data. Furthermore, current RLVR focuses solely on the correctness of final answers, leading to the limitations of flawed reasoning and sparse reward signals. In this work, we propose Knowledge-to-Verification (K2V), a framework that extends RLVR to knowledge-intensive domains through automated verifiable data synthesis, while enabling verification of the LLM's reasoning process. Extensive experiments demonstrate that K2V enhances the reasoning of LLM in knowledge-intensive domains without significantly compromising the model's general capabilities. This study also suggests that integrating automated data synthesis with reasoning verification is a promising direction to enhance model capabilities in these broader domains. Code is available at https://github.com/SeedScientist/K2V.

2605.18257 2026-05-19 cs.CV cs.AI cs.CL

CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook

CodeBind: 一种用于多模态对齐的解耦表示学习框架

Zeyu Chen, Jie Li, Kai Han

发表机构 * Visual AI Lab, The University of Hong Kong(视觉人工智能实验室,香港大学)

AI总结 CodeBind通过统一的组合代码本设计优化多模态表示空间,解决了传统方法在跨模态信息差异和数据稀缺导致的对齐空间不足问题,实现了多模态分类和检索任务中的最佳性能。

Comments ACL 2026 Findings; Project page: https://visual-ai.github.io/codebind

详情
AI中文摘要

多模态表示对齐对于大语言模型和机器人至关重要。传统方法常受到跨模态信息差异和数据稀缺的限制,导致对齐空间不优,忽略了模态特有的特征。我们提出了CodeBind,一种通过模态共享-特定代码本设计优化多模态表示空间的框架。通过逐步对齐目标和连接模态,CodeBind避免了需要完全配对数据的需要。不同于传统硬对齐,CodeBind将特征分解为共享组件以实现语义一致性,以及特定组件以捕捉模态特有的细节。这种设计利用了组合向量量化方案,其中共享代码本弥合模态差距,而模态特定代码本通过防止主导模态压制其他模态来缓解表示偏差。在九种模态(文本、图像、视频、音频、深度、热成像、触觉、3D点云、EEG)上验证,CodeBind在多模态分类和检索任务中实现了最先进的性能。

英文摘要

Multimodal representation alignment is pivotal for large language models and robotics. Traditional methods are often hindered by cross-modal information discrepancies and data scarcity, leading to suboptimal alignment spaces that overlook modality-unique features. We propose CodeBind, a framework that optimizes multimodal representation spaces through a modality-shared-specific codebook design. By incrementally aligning target and bridging modalities, CodeBind bypasses the need for fully paired data. Unlike traditional hard alignment, CodeBind decomposes features into shared components for semantic consistency and specific components for modality-unique details. This design utilizes a compositional vector quantization scheme, where a shared codebook bridges modality gaps and modality-specific codebooks mitigate representation bias by preventing dominant modalities from overshadowing others. Validated across nine modalities (text, image, video, audio, depth, thermal, tactile, 3D point cloud, EEG), CodeBind achieves state-of-the-art performance in multimodal classification and retrieval tasks.

2605.18253 2026-05-19 cs.CL cs.AI

Machine Unlearning for Masked Diffusion Language Models

针对掩码扩散语言模型的机器去学习

Georu Lee, Seungwon Jeong, Hoki Kim, Jinseong Park, Woojin Lee

发表机构 * Dongguk University-Seoul(东国大学-首尔) Chung-Ang University(Chung-Ang 大学) Korea Institute for Advanced Study(韩国高级研究院)

AI总结 本文提出了一种针对掩码扩散语言模型的去学习框架MDU,通过重新审视扩散过程中的知识学习,实现了高效的去学习性能。

Comments 20 pages, 8 figures, appendix included

详情
AI中文摘要

最近的掩码扩散语言模型(MDLMs),如LLaDA和Dream,已经达到了与自回归大语言模型相当的性能。与自回归模型不同,MDLMs通过并行迭代去噪掩码位置来生成文本。在微调过程中,MDLMs学习从掩码响应状态中恢复响应,从而将预测从提示-掩码无条件分布转向提示-条件分布。尽管生成和微调机制不同,针对MDLMs的机器去学习仍鲜有研究。在本文中,我们提出Masked Diffusion Unlearning(MDU),通过重新审视扩散过程中的知识学习,首次提出了针对MDLMs的去学习框架。具体而言,MDU在每个掩码响应位置上最小化从提示-条件预测到提示-掩码无条件锚点的正向KL散度,并通过温度缩放参数控制隐私-效用权衡。在标准基准和MDLM骨干网络上的实验证明,MDU在与现有LLM去学习方法相比时实现了较高的去学习性能。代码可在https://github.com/leegeoru/MDU上获得。

英文摘要

Recent masked diffusion language models (MDLMs), such as LLaDA and Dream, have achieved performance comparable to autoregressive large language models. Unlike autoregressive models, which generate text sequentially, MDLMs generate text by iteratively denoising masked positions in parallel. During fine-tuning, MDLMs learn to recover responses from masked response states conditioned on a prompt, thereby shifting their predictions from a prompt-masked unconditional distribution toward a prompt-conditional distribution. Despite this distinct generative and fine-tuning mechanism, machine unlearning for MDLMs remains largely unexplored. In this paper, we propose Masked Diffusion Unlearning (MDU), the first unlearning framework for MDLMs, by revisiting the process of learning specific knowledge in terms of diffusion. Specifically, MDU minimizes a forward KL divergence from the prompt-conditional prediction to a prompt-masked unconditional anchor at every masked response position, with a temperature scaling parameter to control the privacy-utility trade-off. Our empirical results on standard benchmarks and MDLM backbones show that MDU achieves high unlearning performance compared to existing LLM unlearning methods. Code is available at https://github.com/leegeoru/MDU.

2605.18252 2026-05-19 cs.CV

GaussianZoom: Progressive Zoom-in Generative 3D Gaussian Splatting with Geometric and Semantic Guidance

GaussianZoom: 一种结合几何和语义引导的渐进式缩放生成3D高斯点云

Jiale Shi, Jiarui Hu, Zesong Yang, Kaixuan Luan, Hujun Bao, Zhaopeng Cui

发表机构 * State Key Lab of CAD & CG(计算机辅助设计与图形学国家重点实验室)

AI总结 本文提出GaussianZoom,一种结合几何一致场景建模和多尺度语义推理的生成式缩放3D重建系统,通过迭代渐进框架实现从低分辨率输入生成高保真的极端缩放渲染。

Comments 10 pages, 7 figures

详情
AI中文摘要

我们介绍了GaussianZoom,一种具有迭代渐进框架的生成式缩放3D重建系统,该框架结合几何一致的场景建模和多尺度语义推理,以实现从低分辨率输入生成高保真的极端缩放渲染。为此,我们开发了一种新的多视角一致超分辨率模块,结合基于深度的特征扭曲和VLM驱动的细节合成,确保准确的多视角对应关系,同时在观察分辨率之外丰富细粒度外观。为了支持大范围的缩放,我们进一步引入了一种可扩展的连续细节层次结构,该结构动态调节高斯可见性,以实现平滑且无混叠的跨尺度渲染。在Mip-NeRF360和Tanks\&Temples上的实验表明,GaussianZoom在极端缩放下实现了优越的感知质量、多视角一致性和鲁棒性,为生成式缩放3D场景重建建立了强有力的基准。

英文摘要

We introduce GaussianZoom, a generative zoom-in 3D reconstruction system with an iterative progressive framework that combines geometry-consistent scene modeling and multi-scale semantic reasoning to enable high-fidelity extreme zoom-in rendering from low-resolution inputs. To achieve this, we develop a novel multi-view consistent super-resolution module with depth-based feature warping and VLM-driven detail synthesis, ensuring accurate multi-view correspondence while enriching fine-scale appearance beyond the observed resolution. To support zooming across large magnification ranges, we further introduce a new expandable continuous Level-of-Detail hierarchy that dynamically modulates Gaussian visibility for smooth, alias-free cross-scale rendering. Experiments on Mip-NeRF360 and Tanks\&Temples demonstrate that GaussianZoom achieves superior perceptual quality, multi-view consistency, and robustness under extreme magnification, establishing a strong baseline for generative zoom-in 3D scene reconstruction.

2605.18246 2026-05-19 cs.LG cs.AI

Privacy Preserving Reinforcement Learning with One-Sided Feedback

具有单侧反馈的隐私保护强化学习

Lin William Cong, Guangyan Gan, Hanzhang Qin, Zhenzhen Yan

发表机构 * Nanyang Technological University(南洋理工大学) National University of Singapore(国立新加坡大学) Cornell SC Johnson College of Business(康奈尔大学SC Johnson商学院)

AI总结 本文研究了在多维连续状态和动作空间中,代理仅接收状态部分观测并仅在每个时间步获得状态-动作空间子集奖励信息的强化学习问题,提出了一种新的隐私保护强化学习算法POOL,并通过理论分析证明其样本复杂度与非隐私强化学习的下界一致,展示了在保持高学习效率的同时实现强隐私保障的可行性。

Comments Accepted at IJCAI-ECAI 2026

详情
AI中文摘要

我们研究了在多维连续状态和动作空间中具有单侧反馈的强化学习(RL)。在此设置中,智能体仅接收状态的部分观测,并在每个时间步仅获得状态-动作空间子集的奖励信息。这种设置在学习效率和隐私保护方面带来了重大挑战。为了解决这些挑战,我们提出了POOL,一种新颖的隐私保护RL算法。我们对POOL进行了全面的理论分析,推导出一个样本复杂度界,该界与已知的非隐私RL下界相匹配。其中,E_rho表示隐私参数,H是时间范围,alpha是最优性差距参数。我们的研究结果表明,可以在保持高学习效率的同时实现强隐私保障,这标志着在具有单侧反馈的多维环境中实现实用的隐私感知RL迈出重要一步。

英文摘要

We study reinforcement learning (RL) in multi-dimensional continuous state and action spaces with one-sided feedback, where the agent receives partial observations of the state and obtains reward information for only a subset of the state-action space at each time step. This setting introduces substantial challenges in both learning efficiency and privacy preservation. To address these challenges, we propose POOL, a novel privacy-preserving RL algorithm. We conduct a comprehensive theoretical analysis of POOL, deriving a sample complexity bound that matches the known lower bounds for non-private RL. Here, E_rho denotes the privacy parameter, H is the time horizon, and alpha is the optimality-gap parameter. Our findings show that it is possible to enforce strong privacy guarantees while maintaining high learning efficiency, marking a significant step toward practical, privacy-aware RL in multi-dimensional environments with one-sided feedback.

2605.18239 2026-05-19 cs.CL cs.AI

Multilingual jailbreaking of LLMs using low-resource languages

使用低资源语言对LLM进行多语言劫持

Dylan Marx, Marcel Dunaiski

发表机构 * Computer Science Division, Mathematical Sciences Department(计算机科学系,数学科学系)

AI总结 研究通过使用低资源非洲语言进行多轮对话来测试大型语言模型的安全机制,发现翻译质量是影响低资源语言劫持成功率的关键因素。

Comments 12 pages, 5 figures

详情
AI中文摘要

大型语言模型(LLMs)仍然容易受到绕过安全防护措施的劫持攻击。我们研究了使用低资源非洲语言(阿弗里卡语、基索瓦希利、isiXhosa和isiZulu)的多轮对话是否能绕过商业LLM的安全机制。我们翻译了现有数据集中的提示,并通过自动化测试和本地母语者的人工红队测试评估了ChatGPT、Claude、DeepSeek、Gemini和Grok。单轮翻译攻击效果不佳,而多轮对话在英语有害响应率方面从52.7%(Claude 3.5 Haiku)到83.6%(GPT-4o-mini),阿弗里卡语从60.0%(Claude 3.5 Haiku)到78.2%(GPT-4o-mini),基索瓦希利从41.8%(Claude 3.5 Haiku)到70.9%(DeepSeek)。人工红队测试比自动化方法提高了劫持率。所有评估语言的平均劫持率从59.8%增加到75.8%,其中阿弗里卡语提高了20.0%,isiZulu提高了12.7%,isiXhosa提高了12.3%,基索瓦希利提高了1%,这表明翻译质量限制了劫持的成功率。这些发现表明,LLM中的漏洞在多语言环境中仍然存在,翻译质量是决定低资源语言劫持成功率的关键因素。

英文摘要

Large Language Models (LLMs) remain vulnerable to jailbreak attempts that circumvent safety guardrails. We investigate whether multi-turn conversations using low-resource African languages (Afrikaans, Kiswahili, isiXhosa, and isiZulu) can bypass safety mechanisms across commercial LLMs. We translated prompts from existing datasets and evaluated ChatGPT, Claude, DeepSeek, Gemini, and Grok through automated testing and human red-teaming with native speakers. Single-turn translation attacks proved ineffective, while multi-turn conversations achieved English harmful response rates from 52.7% (Claude 3.5 Haiku) to 83.6% (GPT-4o-mini), Afrikaans from 60.0% (Claude 3.5 Haiku) to 78.2% (GPT-4o-mini), and Kiswahili from 41.8% (Claude 3.5 Haiku) to 70.9% (DeepSeek). Human red-teaming increased jailbreak rates compared to automated methods. Over all evaluated languages, the average jailbreak rate increased from 59.8% to 75.8%, with improvements of +20.0% (Afrikaans), +12.7% (isiZulu), +12.3% (isiXhosa), and +1% (Kiswahili), demonstrating that poor translation quality limits jailbreak success. These findings suggest that vulnerabilities in LLMs persist in multilingual contexts and that translation quality is the critical factor determining jailbreak success in low-resource languages.

2605.18238 2026-05-19 cs.CV

Non-Colliding Biometric Identities for Digital Entities: Geometry, Capacity, and Million-Scale Virtual Identity Provisioning

非碰撞生物识别身份用于数字实体:几何、容量与百万级虚拟身份提供

Yuyang Ji, Yixuan Shen, Anil Jain, Xiaoming Liu, Feng Liu

发表机构 * Department of Computer Science, Drexel University(德雷塞尔大学计算机科学系) Department of Computer Science and Engineering, Michigan State University(密歇根州立大学计算机科学与工程系) Department of Computer Science, University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校计算机科学系)

AI总结 本研究提出Biometric Identity Provisioning(BIP)框架,解决在真实人类身份库中提供非碰撞虚拟身份的问题,通过几何方法在真实面部流形中分配未被占用的间隙,生成高保真面部图像,并展示1000万非碰撞虚拟身份嵌入。

Comments 25 pages, 11 figures

详情
AI中文摘要

数字实体如AI代理和人形机器人日益与真实人类共同操作,但其身份基础设施仍基于凭证而非生物识别身份。我们引入Biometric Identity Provisioning(BIP),一种新的问题和解决方案框架,旨在:给定一个真实人类身份的注册画廊,提供虚拟身份,这些身份与每个注册身份不碰撞,保持足够的类间分离性,并能作为高保真面部图像实现。关键的几何洞察是真实面部身份占据嵌入超球面的低维子空间,留下的残余子空间无法供虚拟身份使用。因此,虚拟身份必须在真实面部流形本身中分配未被占用的间隙。BIP因此是一个受限的填充问题:可用的间隙远超任何可预见的注册规模,并且即使后续注册了新的真实身份,已提供的身份仍保持不碰撞。基于此几何,我们的排斥式分配不受任何固定提供数量的限制;我们展示了针对360,000个真实身份画廊的1000万非碰撞虚拟身份嵌入。将这些嵌入转化为面部图像需要一个在真实面部图像训练分布外运行的生成器;我们引入GapGen,一种具有间隙意识的生成器,通过渐进扩展合成到非碰撞区域的课程进行训练,验证了100万张逼真虚拟面部图像。我们进一步构建了v-LFW,一个LFW面部数据集的虚拟对应物,包含虚拟面部验证、跨现实匹配、真实与虚拟检测以及统一识别和检测的协议。

英文摘要

Digital entities such as AI agents and humanoid robots increasingly operate alongside real humans, yet their identity infrastructure is based on credentials rather than embodied biometric identity. We introduce Biometric Identity Provisioning (BIP), a new problem and solution framework that addresses: given an enrollment gallery of real human identities, provision virtual identities that are non-colliding with every enrolled identity, maintain sufficient inter-class separability, and are realizable as high-fidelity face images. The key geometric insight is that real face identities occupy a low-dimensional subspace of the embedding hypersphere, leaving no residual subspace for virtual identities. Hence, virtual identities must instead be allocated as unclaimed gaps within the real face manifold itself. BIP is therefore a constrained packing problem: available gaps vastly exceed any foreseeable enrollment scale, and provisioned identities remain non-colliding even as new real identities are subsequently enrolled. Grounded in this geometry, our repulsion-based allocation is not bounded by any fixed provisioning count; we demonstrate 10M non-colliding virtual identity embeddings against a gallery of 360K real identities. Realizing these embeddings as face images requires a generator that operates outside the training distribution of real face images; we introduce GapGen, a gap-aware generator trained with a curriculum that progressively extends synthesis into non-colliding regions, validated at 1M photorealistic virtual face images. We further construct v-LFW, a virtual counterpart to LFW face dataset, with protocols for virtual face verification, cross-reality matching, real-vs-virtual detection, and unified recognition and detection.

2605.18233 2026-05-19 cs.CV

Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos

增强无列车无限帧生成以实现一致的长视频

X. Feng, J. Zhu, M. Wu, C. Chen, F. Mao, H. Guo, J. Wu, X. Chu, K. Huang

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学人工智能学院,北京,中国) The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China(复杂系统认知与决策智能重点实验室,中国科学院自动化研究所,北京,中国) AMAP, Alibaba Group, Beijing, China(AMAP,阿里巴巴集团,北京,中国)

AI总结 本文提出MIGA方法,通过两阶段对齐机制和双一致性增强机制,解决训练与推理不匹配和长时一致性维持的问题,从而提升长视频生成效果。

Comments Accepted by ICML 2026~

详情
AI中文摘要

Without incurring significant computational overhead, train-free long video generation aims to enable foundation video generation models to produce longer videos. Frame-level autoregressive frameworks, e.g., FIFO-diffusion, offer the advantage of generating infinitely long videos with constant memory consumption. However, the mismatch between training and inference, coupled with the challenge of maintaining long-term consistency, limits the effective utilization of foundation models. To mitigate these concerns, we propose extbf{MIGA}, a novel infinite-frame long video generation method. Firstly, we propose an effective two-stage alignment mechanism that mitigates the training-inference gap by reducing the excessive noise span fed to the model. We then introduce an innovative dual consistency enhancement mechanism, where the self-reflection approach corrects early high-noise frames and the long-range frame guidance approach leverages later low-noise frames with broad coverage to steer generation, jointly improving temporal consistency. Extensive experiments on VBench and NarrLV demonstrate the state-of-the-art performance of MIGA. Our project page is available at https://xiaokunfeng.github.io/miga_homepage/.

英文摘要

Without incurring significant computational overhead, train-free long video generation aims to enable foundation video generation models to produce longer videos. Frame-level autoregressive frameworks, e.g., FIFO-diffusion, offer the advantage of generating infinitely long videos with constant memory consumption. However, the mismatch between training and inference, coupled with the challenge of maintaining long-term consistency, limits the effective utilization of foundation models. To mitigate these concerns, we propose \textbf{MIGA}, a novel infinite-frame long video generation method. Firstly, we propose an effective two-stage alignment mechanism that mitigates the training-inference gap by reducing the excessive noise span fed to the model. We then introduce an innovative dual consistency enhancement mechanism, where the self-reflection approach corrects early high-noise frames and the long-range frame guidance approach leverages later low-noise frames with broad coverage to steer generation, jointly improving temporal consistency. Extensive experiments on VBench and NarrLV demonstrate the state-of-the-art performance of MIGA. Our project page is available at https://xiaokunfeng.github.io/miga_homepage/.

2605.18232 2026-05-19 cs.CL cs.AI cs.IR

SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

SomaliWeb v1: 一个经过质量过滤的索马里网页语料库,配有匹配的分词器和公开的语言识别基准

Khalid Yusuf Dahir

发表机构 * Independent researcher(独立研究者)

AI总结 本文提出了SomaliWeb v1,一个经过质量过滤的索马里语语料库,包含匹配的BPE-16K分词器和首个公开的索马里语言识别基准,揭示了现有分布中的质量问题。

Comments 16 pages, 6 figures, 6 tables. Code: https://github.com/khaledyusuf44/somali-corpus Dataset: https://huggingface.co/datasets/khaledyusuf44/somaliweb-v1

详情
AI中文摘要

索马里是一种非洲之角的库希特语,有约2500万使用者,但目前没有公开的专门索马里预训练语料库及其配套的分词器和语言识别基准。现有的索马里文本要么出现在多语言分布中(如HPLT v2、CC100、MADLAD-400、OSCAR、mC4),要么出现在Hugging Face上的小规模、未记录的索马里-only上传中。我们介绍了SomaliWeb v1,一个经过质量过滤的索马里语料库,包含819,322个文档(约303亿个标记),由三个上游来源(HPLT v2、CC100、索马里维基百科)通过六阶段可重复的流程构建。我们发布了(i)语料库,(ii)匹配的BPE-16K分词器,以及(iii)首个公开的索马里语言识别基准。我们的测量揭示了现有分布中的具体质量问题:HPLT v2的“清理”索马里发布保留了17.3%的字节精确重复项,其56.1%的文档包含可修复的mojibake,且其10.7%的字节唯一文档在Jaccard tau=0.80时为近重复项。我们的BPE-16K分词器在FLORES-200索马里开发测试上比GPT-4的cl100k_base少发出40.2%的标记;下游语言模型困惑度比较将推迟到后续发布。

英文摘要

Somali is a Cushitic language of the Horn of Africa with ~25 million speakers, yet no documented dedicated Somali pretraining corpus with a companion tokenizer and language-identification benchmark has been publicly released. Existing Somali text appears either inside multilingual distributions (HPLT v2, CC100, MADLAD-400, OSCAR, mC4) or in small, undocumented Somali-only uploads on Hugging Face. We introduce SomaliWeb v1, a quality-filtered Somali corpus of 819,322 documents (~303M tokens) built from three upstream sources (HPLT v2, CC100, Somali Wikipedia) through a six-stage reproducible pipeline. We release (i) the corpus, (ii) a matched BPE-16K tokenizer, and (iii) the first public side-by-side Somali benchmark of three production language identifiers. Our measurements reveal concrete quality defects in existing distributions: HPLT v2's "cleaned" Somali release retains 17.3% byte-exact duplicates, 56.1% of its documents contain fixable mojibake, and 10.7% of its byte-unique documents are near-duplicates at Jaccard tau=0.80. Our BPE-16K tokenizer emits 40.2% fewer tokens than GPT-4's cl100k_base on FLORES-200 Somali devtest as a tokenizer-level measurement; downstream language-model perplexity comparisons are deferred to a follow-up release.

2605.18229 2026-05-19 cs.LG cs.AI

Are Sparse Autoencoder Benchmarks Reliable?

稀疏自编码基准测试是否可靠?

David Chanin

发表机构 * Decode Research, MATS, UCL(Decode研究、MATS、伦敦大学学院)

AI总结 该研究评估了稀疏自编码(SAE)基准测试的可靠性,发现其中两个指标在多个角度下表现不佳,其他指标也未能达到预期效果,表明需要改进SAE基准测试。

详情
AI中文摘要

稀疏自编码(SAEs)是大型语言模型的核心可解释性工具,其进展依赖于能够可靠区分更好和更差SAE的基准测试。我们通过三种互补的视角审计了SAEBench中SAE质量指标:固定SAE上的重新播种噪声、合成SAE上的真实相关性以及训练轨迹的可区分性。我们发现,两个指标,即目标探测扰动(TPP)和虚假相关性消除(SCR),在它们的典型设置下未能通过多个视角,不应用于评估SAE。其他指标显示出更高的重新播种噪声和更低的可区分性,比领域假设的要差。sae-probes变体的k-稀疏探测是我们在测试中发现最可靠的指标,但即使sae-probes也难以区分同一体系结构的不同变体。我们的结果表明,领域需要更好的SAE基准测试。

英文摘要

Sparse autoencoders (SAEs) are a core interpretability tool for large language models, and progress on SAE architectures depends on benchmarks that reliably distinguish better SAEs from worse ones. We audit the SAE quality metrics in SAEBench, the de-facto standard SAE evaluation suite, through three complementary lenses: reseed noise on a fixed SAE, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories. We find that two of these metrics, Targeted Probe Perturbation (TPP) and Spurious Correlation Removal (SCR), fail multiple lenses at their canonical settings and should not be used to evaluate SAEs. The other metrics show higher reseed noise and lower discriminability than the field assumes. The sae-probes variant of $k$-sparse probing is the most reliable metric we tested, but even sae-probes struggles to separate variants of the same SAE architecture. Our results show the field needs better SAE benchmarks.

2605.18226 2026-05-19 cs.CL cs.AI

Context Memorization for Efficient Long Context Generation

上下文记忆用于高效长上下文生成

Yasuyuki Okoshi, Hao Mark Chen, Guanxi Lu, Hongxiang Fan, Masato Motomura, Daichi Fujiki

发表机构 * Institute of Science Tokyo, Japan(东京科学研究所) Imperial College London, UK(伦敦帝国学院)

AI总结 本文提出了一种无需训练的上下文记忆方法,通过将前缀外部化为轻量级的预计算注意力状态查找表,以提高长上下文生成的准确性和效率,同时减少注意力计算的延迟。

详情
AI中文摘要

现代大型语言模型(LLM)应用越来越多地依赖长前缀来在推理时控制模型行为。尽管增强前缀的推理是有效的,但存在两个结构限制:i)随着生成过程的进行,前缀的影响逐渐减弱;ii)对前缀的注意力计算与长度成线性关系。现有方法要么在注意力中保留前缀同时压缩它,要么通过梯度训练将它内部化到模型参数中。前者在推理时仍然会关注到前缀,而后者训练成本高且不适合前缀更新。为了解决这些问题,我们提出了注意力状态记忆,这是一种无需训练的方法,将前缀外部化为一个轻量级的预计算注意力状态的查找表。在ManyICLBench上使用LLaMA-3.1-8B,我们的方法在1K-8K内存预算下比上下文学习提高了准确性,同时在8K时将注意力延迟减少了1.36倍,并在NBA基准测试中仅使用其内存足迹的20%就超过了全注意力RAG性能。

英文摘要

Modern large language model (LLM) applications increasingly rely on long conditioning prefixes to control model behavior at inference time. While prefix-augmented inference is effective, it incurs two structural limitations: i) the prefix's influence fades as generation proceeds, and ii) attention computation over the prefix scales linearly with its length. Existing approaches either keep the prefix in attention while compressing it, or internalize it into model parameters through gradient-based training. The former still attends to the prefix at inference, while the latter is training-intensive and ill-suited to prefix updates. To address these issues, we propose attention-state memory, a training-free approach that externalizes the prefix into a lightweight, lookup-based memory of precomputed attention states between prefix and query tokens. On ManyICLBench with LLaMA-3.1-8B, our method improves accuracy over in-context learning at 1K-8K memory budgets while reducing attention latency by 1.36x at 8K, and surpasses full-attention RAG performance on NBA benchmark using only 20% of its memory footprint.

2605.18221 2026-05-19 cs.SD cs.CL cs.CV cs.LG physics.med-ph

SIREM: Speech-Informed MRI Reconstruction with Learned Sampling

SIREM: 语音引导的MRI重建与学习采样

Md Hasan, Nyvenn Castro, Daiqi Liu, Lukas Mulzer, Jana Hutter, Jonghye Woo, Moritz Zaiss, Andreas Maier, Paula A. Perez-Toro

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg(埃森哲-埃尔朗根-纽伦堡大学模式识别实验室) Institute of Radiology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg(埃尔朗根大学医院放射学研究所) Institut für Informationsverarbeitung, Leibniz Universität Hannover(汉诺威莱比锡大学信息处理研究所) Department of Radiology, Harvard Medical School and Massachusetts General Hospital(哈佛医学院放射科和麻省总医院)

AI总结 本文提出了一种语音引导的MRI重建框架SIREM,通过同步语音作为跨模态先验,利用语音与声音学之间的相关性预测图像内容,从而在更高的吞吐量下实现更合理的解剖结构重建。

详情
AI中文摘要

实时磁共振成像(rtMRI)在语音生产中的应用能够非侵入性地可视化动态声带运动,对语音科学和临床评估具有价值。然而,rtMRI本质上受到空间分辨率、时间分辨率和获取速度之间的权衡限制,常常导致k空间测量不足和重建质量下降。我们提出SIREM,一种利用同步语音作为跨模态先验的MRI重建框架。核心思想是语音期间的声带配置与产生的声音学相关,使图像部分内容可从音频预测。SIREM将每帧建模为音频驱动组件和MRI驱动组件的融合,通过空间加权图。音频分支从语音预测发音器相关结构,而MRI分支从测量的k空间数据重建互补内容。我们进一步引入了可学习的软加权轮廓,使螺旋臂的使用与语音引导融合的交互研究可微分。这产生了一个统一的多模态公式,结合了音频驱动预测、MRI重建和采样适应。我们在USC语音rtMRI基准上评估了SIREM,与标准基线(包括栅格、基于小波的压缩感知和总变分)进行比较。SIREM引入了一种语音引导的重建范式,在比迭代方法高得多的吞吐量下运行,同时保持解剖上合理的声带结构。这些结果为多模态语音引导的rtMRI重建建立了初步基准,并突显了同步语音作为快速重建辅助先验的潜力。源代码可在https://github.com/mdhasanai/SIREM获取。

英文摘要

Real-time magnetic resonance imaging (rtMRI) of speech production enables non-invasive visualization of dynamic vocal-tract motion and is valuable for speech science and clinical assessment. However, rtMRI is fundamentally constrained by trade-offs among spatial resolution, temporal resolution, and acquisition speed, often leading to undersampled k-space measurements and degraded reconstructions. We propose SIREM, a speech-informed MRI reconstruction framework that uses synchronized speech as a cross-modal prior. The central idea is that vocal-tract configurations during speech are correlated with the produced acoustics, making part of the image content predictable from audio. SIREM models each frame as a fusion of an audio-driven component and an MRI-driven component through a spatial weighting map. The audio branch predicts articulator-related structure from speech, while the MRI branch reconstructs complementary content from measured k-space data. We further introduce a learnable soft weighting profile over spiral arms, enabling a differentiable study of how k-space arm usage interacts with speech-informed fusion. This yields a unified multimodal formulation that combines audio-driven prediction, MRI reconstruction, and sampling adaptation. We evaluate SIREM on the USC speech rtMRI benchmark against standard baselines, including gridding, wavelet-based compressed sensing, and total variation. SIREM introduces a speech-informed reconstruction paradigm that operates in a substantially higher-throughput regime than iterative methods while preserving anatomically plausible vocal-tract structure. These results establish an initial benchmark for multimodal speech-informed rtMRI reconstruction and highlight the potential of synchronized speech as an auxiliary prior for fast reconstruction. The source code is available at https://github.com/mdhasanai/SIREM