arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.17651 2026-05-19 cs.LG

Counterfactual Explanations Under Concept Drift

反事实解释在概念漂移下的应用

Marcin Kostrzewa, Jerzy Stefanowski, Maciej Zięba

发表机构 * Wrocław University of Science and Technology（沃拉什大学科学与技术学院）； Poznań University of Technology（波兹南技术大学）

AI总结本文研究了在数据不断变化的环境中，如何维护反事实解释的有效性，提出了一种轻量级的更新方案以修复现有解释，保持其与原始实例的接近性。

2605.17648 2026-05-19 cs.AI

SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation

SAPO：基于推理的生成推荐的步骤对齐策略优化

Zaiyi Zheng, Guanghui Min, Yaochen Zhu, Liang Wu, Liangjie Hong, Chen Chen, Jundong Li

发表机构 * University of Virginia（弗吉尼亚大学）； Nokia（诺基亚）

AI总结本文提出SAPO方法，通过步骤对齐策略优化解决生成推荐中因精确匹配反馈不足导致的训练不稳定问题，改进了基于推理的生成推荐系统的训练效果。

详情

AI中文摘要

生成推荐将下一项预测视为自回归的物品标识符生成。具体而言，物品被编码为语义标识符（SIDs），这些是短的由粗到细的令牌序列，早期令牌捕捉广泛语义，后期令牌细化它们。近期工作在该范式中加入了推理轨迹并通过强化学习进行优化，通常使用具有生成SID的精确匹配反馈的成果奖励算法。然而，在大型目录推荐中，对生成SID的精确匹配反馈只能报告最终物品是否正确；当生成SID不匹配时，成果奖励无法识别导致不匹配的SID-令牌预测，并可能对匹配的SID-令牌位置和不匹配的位置一起进行惩罚。我们发现在此设置中的自然信用分配单位是一个单独的推理步骤（一个思考块配对一个SID令牌）。我们实例化这一想法在SAPO（步骤对齐策略优化）中：而不是将一个优势广播到整个响应，SAPO为每个推理步骤计算一个单独的组内优势，并仅应用于相应的思考块和SID令牌。在三个真实世界推荐数据集中，SAPO稳定了强化学习训练并持续改进现有生成推荐基线，最大收益出现在稀疏精确匹配反馈使推理步骤信用分配重要的地方。我们的结果表明，结构生成的强化学习目标应反映解码器自身的输出分解。

英文摘要

Generative recommendation treats next-item prediction as autoregressive item-identifier generation. Specifically, items are encoded as semantic identifiers (SIDs), which are short coarse-to-fine token sequences whose early tokens capture broad semantics and later tokens refine them. Recent work augments this paradigm with reasoning traces and optimizes them via reinforcement learning with verifiable rewards, typically outcome-reward algorithm with exact-match feedback on the generated SID. However, in large-catalog recommendation, exact-match feedback on the generated SID only reports whether the final item is correct; when a generated SID mismatches, outcome-reward cannot identify which SID-token prediction caused the mismatch and may penalize matched SID-token positions together with the mismatched position. We identify that the natural unit of credit assignment in this setting is a single reasoning step (one thinking block paired with one SID token). We instantiate this idea in SAPO (Step-Aligned Policy Optimization): rather than broadcasting one advantage to the whole response, SAPO computes a separate group-relative advantage for each reasoning step and applies it only to the corresponding thinking block and SID token. Across three real-world recommendation datasets, SAPO stabilizes reinforcement-learning training and consistently improves over existing generative recommendation baselines, with the largest gains where sparse exact-match feedback makes reasoning-step credit assignment important. Our results suggest that reinforcement-learning objectives for structured generation should mirror the decoder's own decomposition of the output.

URL PDF HTML ☆

赞 0 踩 0

2605.17642 2026-05-19 cs.LG

TabKDE: Simple and Scalable Tabular Data Generation with Kernel Density Estimates

TabKDE: 通过核密度估计实现简单且可扩展的表格数据生成

Meysam Alishahi, Yan Zheng, Junpeng Wang, Chin-Chia Michael Yeh, Jeff M. Phillips

发表机构 * University of Utah（犹他大学）； Visa Research（Visa研究）

AI总结本文提出了一种基于核密度估计的表格数据生成方法，能够在无需大量训练时间的情况下实现与现有方法相当的准确性和防泄漏性能，并且能够高效处理大规模数据集。

详情

AI中文摘要

表格数据生成考虑的是一个包含多个列的大型表格，每个列包含数值、类别或有时顺序值。目标是生成新的行以复制原始数据行的分布，而不仅仅是复制初始行。过去四年中，这个问题取得了巨大的进展，主要使用计算成本高昂的方法，如one-hot编码、VAE和扩散模型。本文描述了一种新的表格数据生成方法。通过使用copula变换并将分布建模为核密度估计，我们几乎可以达到先前方法在准确性和防泄漏方面的性能，但训练时间几乎可以忽略不计。我们的方法非常可扩展，并且可以在简单的笔记本电脑上处理比现有最先进方法大数个数量级的数据集。此外，由于我们使用核密度估计，我们可以将模型存储为原始数据的coreset -- 我们认为这是生成建模中的首次尝试 -- 并因此需要显著较少的空间。我们的代码可在https://github.com/tabkde/tabkde-main获取。

英文摘要

Tabular data generation considers a large table with multiple columns -- each column comprised of numerical, categorical, or sometimes ordinal values. The goal is to produce new rows for the table that replicate the distribution of rows from the original data -- without just copying those initial rows. The last 4 years have seen enormous progress on this problem, mostly using computational expensive methods that employ one-hot encoding, VAEs, and diffusion. This paper describes a new approach to the problem of tabular data generation. By employing copula transformations and modeling the distribution as a kernel density estimate we can nearly match the accuracy and leakage-avoidance achievements of the previous methods, but with almost no training time. Our method is very scalable, and can be run on data sets orders of magnitude larger than prior state-of-the-art on a simple laptop. Moreover, because we employ kernel density estimates, we can store the model as a coreset of the original data -- we believe the first for generative modeling -- and as a result, require significantly less space as well. Our code is available here: \url{https://github.com/tabkde/tabkde-main}

URL PDF HTML ☆

赞 0 踩 0

2605.17641 2026-05-19 cs.AI cs.CL

Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents

基于因果干预的记忆选择用于长时域大语言模型智能体

Saksham Sahai Srivastava

发表机构 * School of Computing, University of Georgia, Athens, Georgia, USA（佐治亚大学计算机学院）

AI总结本文提出Causal Memory Intervention（CMI）方法，通过因果推理选择大语言模型的长期记忆，以提高回答质量和鲁棒性，同时引入Causal-LoCoMo基准数据集进行评估。

Comments 12 pages, 3 figures, 3 tables

详情

AI中文摘要

长时域大语言模型智能体依赖持久记忆来支持跨会话的交互，但现有记忆系统通常使用语义相似性或广泛历史包含来检索上下文，将检索到的记忆视为统一有用。这一假设是脆弱的，因为记忆可能在主题上相关，但仍然无关、过时或误导性。我们提出了Causal Memory Intervention（CMI），一种因果记忆选择技术，通过在受控干预下估计候选记忆如何影响模型的答案，选择提高任务性能的同时抑制不稳定、无关或有害的记忆。为了评估这一设置，我们引入了Causal-LoCoMo，一个从长对话数据中衍生出的因果标注基准，其中每个示例包含用户请求、结构化记忆库、有用的记忆、无关干扰项以及合成有害记忆。我们比较了CMI与向量、图、反思、摘要、完整历史和无记忆基线。结果表明，CMI在回答质量和对误导性记忆的鲁棒性之间实现了更强的平衡，表明可靠的长期记忆需要基于因果有用性而非相关性本身来选择上下文。完整的框架、基准构建代码和实验流程可在https://github.com/Saksham4796/causal-memory-intervention获取。

英文摘要

Long-horizon LLM agents rely on persistent memory to support interactions across sessions, yet existing memory systems often retrieve context using semantic similarity or broad history inclusion, treating retrieved memories as uniformly useful. This assumption is fragile because memories may be topically related while remaining irrelevant, stale, or misleading. We propose Causal Memory Intervention (CMI), a causal memory-selection technique that estimates how candidate memories affect the model's answer under controlled interventions, selecting memories that improve task performance while suppressing unstable, irrelevant, or harmful ones. To evaluate this setting, we introduce Causal-LoCoMo, a causally annotated benchmark derived from long conversational data, where each example contains a user request, a structured memory bank, useful memories, irrelevant distractors, and synthetic harmful memories. We compare CMI against vector, graph, reflection, summary, full-history, and no-memory baselines. Results show that CMI achieves a stronger balance between answer quality and robustness to misleading memory, suggesting that reliable long-term memory requires selecting context based on causal usefulness rather than relevance alone. The full framework, benchmark construction code, and experimental pipeline are available at https://github.com/Saksham4796/causal-memory-intervention.

URL PDF HTML ☆

赞 0 踩 0

2605.17639 2026-05-19 cs.CL cs.IR

Temporal Decay of Co-Citation Predictability: A 20-Year Statute Retrieval Benchmark from 396M Ukrainian Court Citations

共引预测性的时变衰减：来自3.96亿乌克兰法院引用的20年法规检索基准

Volodymyr Ovcharov

发表机构 * LEX AI LLC

AI总结本文研究了共引结构在法律信息系统中的稳定性假设，通过构建UA-StatuteRetrieval基准，测试了20年中3.96亿条引用数据的共引可预测性，发现Adamic-Adar MRR在固定文章集上下降33%，在训练/测试时间分割下下降47%，证实了真正的时变衰减而非组合变化或评估伪影。

Comments 12 pages, 8 figures, 4 tables. Dataset: https://huggingface.co/datasets/overthelex/ua-statute-retrieval

详情

AI中文摘要

共引结构被广泛假设为提供稳定的检索信号。我们通过构建UA-StatuteRetrieval基准，纵向测试这一假设，该基准在2007-2026年的20个年度快照中测量了3.96亿条法典引用的共引可预测性。通过在完整的双部分引用图上使用留一法协议，我们发现Adamic-Adar MRR在固定文章集上下降33%（从0.43到0.29），在训练/测试时间分割下下降47%（从0.51到0.27），证实了真正的时变衰减而非组合变化或评估伪影。衰减是非均匀的：刑事程序保持稳定的共引模式（MRR ~0.40），而民法从0.35下降到0.15，与2017年司法改革重合。枢纽文章（>100,000引用）抵抗衰减，但中频文章（1,000-10,000）——实际检索前沿失去一半的可预测性。BM25文本基线衰减得更快（31%），嵌入漂移分析显示E5-large揭示了文章引用的语义偏移4.3%，提供了衰减的机制解释。该基准在https://huggingface.co/datasets/overthelex/ua-statute-retrieval发布。

英文摘要

Co-citation structure is widely assumed to provide stable retrieval signal in legal information systems. We test this assumption longitudinally by constructing UA-StatuteRetrieval, a benchmark that measures co-citation predictability across 20 annual snapshots (2007-2026) of 396 million codex citations from 101 million Ukrainian court decisions. Using a leave-one-out protocol over the full bipartite citation graph, we find that Adamic-Adar MRR declines 33% on a fixed set of articles (from 0.43 to 0.29) and 47% under a train/test temporal split (from 0.51 to 0.27) confirming genuine temporal decay rather than compositional shift or evaluation artifact. The decay is non-uniform: criminal procedure maintains stable co-citation patterns (MRR ~0.40), while civil law degrades from 0.35 to 0.15, coinciding with the 2017 judicial reform. Hub articles (>100K citations) resist decay, but mid-frequency articles (1K-10K) -- the practical retrieval frontier lose half their predictability. A BM25 text baseline decays even faster (31%), and embedding drift analysis with E5-large reveals a 4.3% semantic shift in how articles are cited, providing a mechanistic explanation for the observed decay. The benchmark is released at https://huggingface.co/datasets/overthelex/ua-statute-retrieval.

URL PDF HTML ☆

赞 0 踩 0

2605.17638 2026-05-19 cs.CV

TouchMap-OR: Multi-View 3D Mapping of Hand-Surface Contacts

TouchMap-OR: 医院内多视角手-表面接触的3D映射

Sophokles Ktistakis, Rui Wang, Bastian Grande, Hugo Sax

发表机构 * ETH Zurich（苏黎世联邦理工学院）； Institute for Anesthesiology and Perioperative Medicine, University Hospital Zurich（苏黎世大学麻醉学与围术期医学研究所）； Department of Public and Global Health, University of Zurich（苏黎世大学公共卫生与全球健康系）

AI总结本文提出TouchMap-OR系统，通过多视角RGB-D视觉系统实现手术室中身份分辨的手-表面接触重建，利用临床环境的语义结构推断接触时间和位置，通过多视角手部重建与追踪医生获得一致的手部轨迹，并建立手术室的语义3D模型以将手部轨迹映射到特定表面。

详情

AI中文摘要

临床医生、患者和医疗设备之间的手-表面互动在医疗程序中起着核心作用，在病原体传播中起关键作用。然而，这些互动仍然大多未被观察到，因为目前的感染预防实践依赖于手动观察，无法重建详细的接触历史。在本工作中，我们提出了在手术室中身份分辨的手-表面互动重建问题，并引入了TouchMap-OR，一种多视角RGB-D视觉系统，该系统能够建模医生、可变形手部几何结构以及临床环境的语义结构，以推断接触发生的时间和位置。该系统在多摄像机之间重建全局一致的多个人3D骨骼轨迹，同时从RGB观测与深度数据对齐的数据中估计可变形MANO手部网格。多视角手部重建被融合并关联到追踪的医生，以获得一致的左右手轨迹。通过多视角分割和深度融合构建手术室的语义3D模型，使重建的手部轨迹能够映射到特定表面，包括医疗设备、可移动物体和患者身体部位。利用时间手-表面接近性推断接触事件，描述了哪位医生接触了哪个表面以及何时。我们在三个真实的麻醉诱导记录上评估了TouchMap-OR，手动标注了接触事件。TouchMap-OR在二元接触F1值上达到0.75，优于基于跟踪的基线方法，同时保持了可比的多个人跟踪精度，并实现了0.96的身份分配精度。

英文摘要

Hand-surface interactions between clinicians, patients, and medical equipment play a central role in pathogen transmission during medical procedures. However, these interactions remain largely unobserved, as current infection-prevention practices rely on manual observation and cannot reconstruct detailed contact histories. In this work we formulate the problem of identity-resolved hand-surface interaction reconstruction in operating rooms and introduce TouchMap-OR, a multi-view RGB-D vision system that models clinicians, articulated hand geometry, and the semantic structure of the clinical environment to infer when and where contacts occur. The system reconstructs globally consistent multi-person 3D skeleton tracks across cameras while estimating articulated MANO hand meshes from RGB observations aligned to depth data. Multi-view hand reconstructions are fused and associated with tracked clinicians to obtain consistent left and right hand trajectories. A semantic 3D model of the operating room is built from multi-view segmentation and depth fusion, enabling reconstructed hand trajectories to be mapped to specific surfaces, including medical equipment, movable objects, and patient body sites. Temporal hand-surface proximity is used to infer contact episodes describing which clinician touched which surface and when. We evaluate TouchMap-OR on recordings from three real anesthesia inductions with manually annotated contact events. TouchMap-OR achieves 0.75 binary contact F1, outperforming tracking-based baselines while maintaining comparable multi-person tracking accuracy and achieving 0.96 identity attribution accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.17633 2026-05-19 cs.CV cs.AI

SparseSAM: Structured Sparsification of Activations in Segment Anything Models

SparseSAM: Segment Anything模型中激活的结构稀疏化

Hoai-Chau Tran, Chi H. Nguyen, Duy M. H. Nguyen, Mathias Niepert, Fan Lai, Khoa D. Doan

发表机构 * University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； College of Engineering & Computer Science, VinUniversity（Vin大学工程与计算机科学学院）； VinUni-Illinois Smart Health Center, VinUniversity（Vin大学-伊利诺伊智能健康中心）； DFKI ； Max Planck Research School for Intelligent Systems (IMPRS-IS)（马克斯·普朗克智能系统研究学校）； University of Stuttgart（斯图加特大学）

AI总结本文提出SparseSAM，一种无需训练的结构稀疏化框架，通过联合加速注意力和MLP层并保持token身份，从而在保持高质量的同时提高推理速度和减少内存使用。

详情

AI中文摘要

Segment Anything Model (SAM) 实现了强大的开放词汇分割，但其基于ViT的图像编码器在推理延迟和内存方面占主导地位。现有的激活压缩方法，如标记合并，通过减少标记长度来处理，但引入了非平凡的运行时开销，并在高压缩下导致灾难性质量下降。其他应用稀疏注意力的方法仅关注注意力本身，使MLP完全密集，并限制了可达到的速度提升。我们提出了SparseSAM，一种（i）无需训练的结构稀疏化框架，该框架在加速注意力和MLP层的同时保持token身份。SparseSAM引入了（ii）Stripe-Sort Attention，它使用确定性的Z序排列将密集注意力转换为静态的硬件友好的稀疏模式，消除了动态掩码的开销。SparseSAM进一步引入了（iii）残差一致性MLP，只将信息性token路由通过MLP，同时通过残差路径传播剩余token。在四个分割基准测试中，SparseSAM在0.4密度下仅损失0.004 mIoU，在0.3密度下损失0.021 mIoU，相较于标记合并方法的改进，准确率损失减少了2.10倍，同时实现了2倍更快的推理速度和2.8倍的内存减少。

英文摘要

The Segment Anything Model (SAM) achieves strong open-vocabulary segmentation, but its ViT-based image encoders dominate inference latency and memory. Existing activation compression methods, such as token merging, reduce the token length to process, yet introduce non-trivial runtime overhead and encounter catastrophic quality drop under high compression. Other methods applying Sparse Attention focus on attention alone, leaving the MLP fully dense and capping achievable speedup. We propose SparseSAM, a (i) training-free structured sparsification framework that jointly accelerates attention and MLP layers while preserving token identity. SparseSAM introduces (ii) Stripe-Sort Attention, which uses a deterministic Z-order permutation to transform dense attention into static hardware-friendly sparse patterns, eliminating dynamic masking overhead. SparseSAM further introduces a (iii) Residual-Consistency MLP that routes only informative tokens through the MLP while propagating remaining tokens through the residual pathway. Across four segmentation benchmarks, SparseSAM loses only 0.004 mIoU at a 0.4 density and 0.021 mIoU at 0.3, a 2.10x reduction in accuracy loss versus token merging advances, while achieving 2x faster inference and 2.8x memory reduction.

URL PDF HTML ☆

赞 0 踩 0

2605.17626 2026-05-19 cs.LG cs.SE

Verifier-Guided Code Translation via Meta-Step Decoding

通过元步骤解码实现验证器引导的代码翻译

Tianyang Zhou, Somesh Jha, Mihai Christodorescu, Kirill Levchenko, Varun Chandrasekaran

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； Google（谷歌）

AI总结本研究提出了一种元步骤解码框架DTV，通过在生成过程中整合验证器调用，提高了代码翻译的通过率，同时减少了token的使用。

Comments 31 pages, 8 figures

详情

AI中文摘要

测试时间缩放是提高大语言模型的重要机制，特别是在具有确定性验证器的任务中。代码翻译是典型例子：源程序约束有效输出，而编译器、类型检查器和行为检查提供精确的通过/失败反馈。现有方法通常在生成后才应用这些验证器，这效率低下，因为早期错误会破坏自回归上下文且很少被后续纠正。我们引入解码时间验证（DTV），一种框架将结构边界视为元步骤，用于引导解码。DTV在状态机控制器下交替生成与验证器调用，强制有效前缀，利用结构边界检查和结构感知回滚，防止错误传播并减少浪费的token。我们在C到Rust和JavaScript到TypeScript翻译上评估DTV。使用Qwen3-4B作为主要生成器，在匹配的token预算下，DTV将C到Rust的通过率从72.3%提升到82.0%，JavaScript到TypeScript的通过率从33.3%提升到46.0%，同时每案例使用更少的token；相同趋势在Gemma-4-E4B上也有所体现。在评估的匹配成本网格中，DTV在通过率与成本的权衡上优于事后验证或基于采样的缩放。这些结果表明，验证器引导的解码是代码翻译中有效利用推理时间计算的方法。

英文摘要

Test-time scaling is an important mechanism for improving large language models, especially on tasks with deterministic verifiers. Code translation is a canonical example: the source program constrains valid outputs, while compilers, type check- ers, and behavioral checks provide exact pass/fail feedback. Existing approaches typically apply these verifiers only after generation, which is inefficient because early errors corrupt the autoregressive context and are rarely corrected later. We introduce Decoding Time Verification (DTV), a framework that treats structural boundaries as meta steps for verifier-guided decoding. DTV interleaves generation with verifier calls under a state-machine controller that enforces valid prefixes, using structural-boundary checks and structure-aware rollback to prevent error propagation while reducing wasted tokens. We evaluate DTV on C-to-Rust and JavaScript-to-TypeScript translation. Using Qwen3-4B as the primary generator under matched token budgets, DTV improves pass rates from 72.3% to 82.0% on C-to-Rust and from 33.3% to 46.0% on JavaScript-to-TypeScript relative to matched self-refinement baselines, while using fewer tokens per case; the same trend largely transfers to Gemma-4-E4B. In the evaluated cost-matched grid, DTV achieves a more favorable pass-rate-cost tradeoff than post-hoc verification or sampling-based scaling. These results show that verifier-guided decoding is an effective use of inference-time compute for code translation.

URL PDF HTML ☆

赞 0 踩 0

2605.17625 2026-05-19 cs.AI

Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents

用于长周期科学代理的事件-语义记忆架构

Nikola Milosevic

发表机构 * Serbian Institute for Artificial Intelligence Research and Development（塞尔维亚人工智能研究与发展研究所）； Bayer A.G.（勃林格殷曼有限公司）

AI总结本文提出了一种双过程记忆架构，用于解决科学代理在长周期任务中面临的情境窗口饱和问题，通过分离即时事件需求和长期知识整合，提升了在大规模科学工作流中的表现和可扩展性。

详情

AI中文摘要

随着大型语言模型（LLMs）发展为持久的科学合作者，情境窗口饱和已成为关键瓶颈。涉及迭代数据分析和假设修正的科学工作流迅速耗尽即使扩展的情境，而单一方法面临二次成本扩展和认知退化。我们评估了一种双过程记忆架构，将即时事件需求（恒定10条消息窗口）与长期整合知识（以每条消息约3个标记增长）分离。不同于先前的社会代理记忆系统，我们的领域特定整合解决了矛盾的参数演变、跨实验阶段的多跳推理以及精确的技术事实保留。通过覆盖15,000条消息的大型评估，跨模型验证六个LLM家族（OpenAI、Anthropic、Google）共计1,440个查询，我们得出三个关键发现。首先，尽管全情境模型在10,000条消息时因情境溢出失败，我们的系统在使用62%更少的标记（45,434 vs 120,000+限制）的情况下，保持70-85%的准确性，延迟仅1-2秒。其次，跨模型验证揭示了架构层面的权衡，与特定LLM无关：双过程在数值/时间查询（65-90%准确率）方面表现优异，而RAG在历史检索（60-85%）方面更优，表明互补的部署策略。第三，我们识别出“仿真到现实”的差距，合成测试保持恒定的记忆，但现实工作流表现出线性增长（约每条消息3个标记），其中整合质量成为主要的可扩展性瓶颈。该架构成功管理了包含14,000多个科学事实（125k标记）的资料，证明了领域特定的记忆整合能够持续运行超过全情境限制。

英文摘要

As Large Language Models (LLMs) evolve into persistent scientific collaborators, context window saturation has emerged as a critical bottleneck. Scientific workflows involving iterative data analysis and hypothesis refinement rapidly saturate even extended contexts with dense technical content, while monolithic approaches suffer from quadratic cost scaling and cognitive degradation. We evaluate a Dual Process Memory Architecture that decouples immediate episodic needs (constant 10-message window) from long-term consolidated knowledge (growing at approximately 3 tokens/message). Unlike prior social agent memory systems, our domain-specific consolidation addresses contradictory parameter evolution, multi-hop reasoning across experimental phases, and precise technical fact retention. Through large-scale evaluation spanning 15,000 messages with cross-model validation across six LLMs from three families (OpenAI, Anthropic, Google), totaling 1,440 queries, we establish three key findings. First, while full-context models fail at 10,000 messages due to context overflow, our system maintains 70-85% accuracy with 1-2 second latency using 62% fewer tokens (45,434 vs 120,000+ limit). Second, cross-model validation reveals architecture-level trade-offs independent of specific LLMs: Dual Process excels at numeric/temporal queries (65-90% accuracy) while RAG excels at historical retrieval (60-85%), suggesting complementary deployment strategies. Third, we identify a "Sim-to-Real" gap where synthetic tests maintain constant memory but realistic workflows exhibit linear growth (about 3 tokens/message), with consolidation quality emerging as the primary scalability bottleneck. The architecture successfully manages profiles with 14,000+ scientific facts (125k tokens), demonstrating that domain-specific memory consolidation enables sustained operation beyond full-context limits.

URL PDF HTML ☆

赞 0 踩 0

2605.17624 2026-05-19 cs.CV cs.AI cs.LG

Multi-task learning on partially labeled datasets via invariant/equivariant semi-supervised learning

通过不变/等变半监督学习进行部分标注数据集上的多任务学习

Miquel Martí i Rabadán, Alessandro Pieropan, Hossein Azizpour, Atsuto Maki

发表机构 * KTH Royal Institute of Technology（皇家理工学院）； Univrses AB

AI总结本文研究了不变和等变半监督学习在处理部分标注数据集上多任务模型训练挑战的潜力，通过FixMatch方法和其等变扩展Dense FixMatch进行评估，在城市景观和BDD100K数据集上针对常见的目标检测和语义分割任务进行测试，发现不变和等变半监督学习在大多数情况下优于监督基线，特别是在标注样本较少时效果更佳。

Comments https://github.com/miquelmarti/DenseFixMatch

详情

AI中文摘要

我们研究了不变和等变半监督学习在处理部分标注数据集上多任务模型训练挑战的潜力。具体而言，我们使用流行的FixMatch方法进行不变半监督学习，并采用其等变扩展Dense FixMatch。我们在Cityscapes和BDD100K数据集上评估了它们在计算机视觉中普遍的目标检测和语义分割任务中的性能。我们考虑了每个任务标注子集的不同大小以及它们之间的不同重叠情况。我们的结果表明，对于不变和等变半监督学习，大多数情况下都优于监督基线，特别是在任务中可用标注样本较少时，改进最为显著，且后者方法通常表现更好。我们的研究表明，不变/等变学习是有限标注数据下多任务学习的一个有前途的方向。

英文摘要

We investigate the potential of invariant and equivariant semi-supervised learning for addressing the challenges of training multi-task models on partially labeled datasets with differently structured output tasks. Specifically, we use the popular FixMatch method for invariant semi-supervised learning and its equivariant extension Dense FixMatch. We evaluate their performance on the Cityscapes and BDD100K datasets in the context of the prevalent object detection and semantic segmentation tasks in computer vision. We consider varying sizes of the subsets annotated for each task and different overlaps among them. Our results for both invariant and equivariant semi-supervised learning outperform supervised baselines in most situations, with the most significant improvements observed when fewer labeled samples are available for a task and generally better results for the latter approach. Our study suggests that invariant/equivariant learning is a promising general direction for multi-task learning from limited labeled data.

URL PDF HTML ☆

赞 0 踩 0

2605.17620 2026-05-19 cs.CV cs.AI cs.LG

SynVA: A Modular Toolkit for Vessel Generation and Aneurysm Editing

SynVA：一种用于血管生成和动脉瘤编辑的模块化工具包

Marten J. Finck, Niklas C. Koser, Sarker M. Mahfuz, Tameem Jahangir, Jon E. Wilhelm, Daniel Behme, Naomi Larsen, Wojtek Palubicki, Sylvia Saalfeld, Sören Pirk

发表机构 * Visual Computing and Artificial Intelligence, Kiel University, Germany（视觉计算与人工智能研究所，基尔大学，德国）； Institute for Medical Informatics and Statistics, Kiel University, Germany（医学信息学与统计研究所，基尔大学，德国）； Clinic for Neuroradiology, Medical Faculty, Magdeburg University, Germany（神经放射科，马格德堡大学医学学院，德国）； Department of Radiology and Neuroradiology, University Hospital Schleswig-Holstein, Germany（放射学与神经放射学部门，石勒苏益格-荷尔斯泰因大学医院，德国）； Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Poland（数学与计算机科学学院，亚当·密茨凯维奇大学，波兰）

AI总结本文提出SynVA，一种模块化工具包，用于生成血管网格和在解剖学上一致的动脉瘤合成，通过结合新的流匹配方法和基于学习的方法，生成真实血管几何和解剖学合理的动脉瘤，同时提供大规模标注数据集以提升医疗影像分析能力。

详情

AI中文摘要

颅内动脉瘤（IAs）以不可预测的生长和破裂风险为特征，是导致中风的主要原因，可能引发致命性出血，具有高死亡率和长期残疾。随着人口老龄化，脑血管疾病的发病率和整体负担预计会增加，凸显了需要可扩展的方法来分析复杂的医疗数据并提高对这些疾病的群体层面理解的必要性。尽管数字孪生和深度学习为提高诊断、预后和治疗提供了有希望的途径，但其效果受到大规模高质量医疗数据和相应标签稀缺的限制。我们提出了SynVA，一种用于血管网格生成和解剖学一致动脉瘤合成的模块化工具包。SynVA结合了基于流匹配的新型方法生成健康血管网格与基于学习的方法生成解剖条件下的动脉瘤网格——动脉瘤是从已有的血管几何结构计算而来的，而不是孤立生成。此外，我们引入了基于生理学原理和统计先验的SynVA过程模型，用于血管和动脉瘤合成，从而能够生成大规模数据集（例如用于训练基于网格的生成模型）。为此，我们发布了包含50,000个完全标注网格样本的数据集，用于各种下游视觉任务，如语义分割。广泛的定量和定性评估证明了SynVA能够生成逼真的血管几何和解剖学合理的动脉瘤。具体而言，我们的实验表明，某些方法生成的动脉瘤形状更符合专家人类感知，而其他方法在定量相似性度量上与真实动脉瘤的重建表现更优。

英文摘要

Intracranial aneurysms (IAs), characterized by unpredictable growth and risk of rupture, are a major cause of stroke and can lead to life-threatening hemorrhages with high mortality and long-term disability. With aging populations, the incidence and overall burden of cerebrovascular diseases are expected to increase, highlighting the need for scalable approaches to analyze complex medical data and improve population-level understanding of these conditions. While digital twins and deep learning offer promising avenues for improving diagnosis, prognosis, and treatment, their effectiveness is limited by the scarcity of large-scale, high-quality medical data and corresponding labels. We present Synthetic VAsculature (SynVA), a modular toolkit for vascular mesh generation and anatomically consistent aneurysm synthesis. SynVA combines novel flow-matching-based methods for generating healthy vessel meshes with learning-based approaches for anatomy-conditioned aneurysm mesh generation - aneurysms are computed from pre-existing vascular geometries rather than being generated in isolation. In addition, we introduce the SynVA procedural model for vascular and aneurysm synthesis based solely on physiological principles and statistical priors, which enables the generation of large-scale datasets (e.g., for the training of mesh-based generative models). To this end, we release a dataset of 50,000 fully labeled mesh samples for a variety of downstream vision tasks, such as semantic segmentation. Extensive quantitative and qualitative evaluations demonstrate that SynVA generates realistic vessel geometries and anatomically plausible aneurysms. Specifically, our experiments indicate that some methods produce aneurysm shapes more aligned with expert human perception while others perform better on quantitative similarity metrics with reconstructions of real aneurysms.

URL PDF HTML ☆

赞 0 踩 0

2605.17610 2026-05-19 cs.CV cs.CL

SafeLens: Deliberate and Efficient Video Guardrails with Fast-and-Slow Screening

SafeLens: 一种高效且可靠的视频护栏系统，采用快速和缓慢筛查

Shahriar Kabir Nahin, Hadi Askari, Muhao Chen, Anshuman Chhabra

发表机构 * University of South Florida（佛罗里达州立大学）； University of California, Davis（加州大学戴维斯分校）

AI总结本研究提出SafeLens视频护栏框架，通过快速和缓慢的推理架构实现高效的视频内容审核，同时构建高质量数据集并采用结构化Chain-of-Thought追踪来解决训练时间扩展的限制，从而在实际和AI生成视频基准测试中取得最佳性能，同时显著降低推理成本。

详情

AI中文摘要

在线视频平台和AI生成内容的快速增长使得可靠的视频护栏成为安全性和现实部署的关键挑战。尽管大多数视频可通过快速模式识别筛查，但一小部分需要对时间复杂的内容和细致的政策约束进行深入推理。现有方法通常依赖于在所有输入上统一应用大型视觉-语言模型，导致推理成本高且计算资源分配效率低。我们提出了SafeLens视频护栏框架，引入快速和缓慢的推理架构，以实现高效且准确的内容审核，根据输入的不同具有可变的计算成本。此外，我们通过应用影响引导过滤对SafeWatch数据集进行处理，仅保留原始数据的2.4%。为进一步解决训练时间扩展的限制，我们通过在过滤数据中添加结构化的Chain-of-Thought追踪来实现测试时间推理。在实际和AI生成视频基准测试中，SafeLens实现了最先进的性能，优于强大的开源视频护栏（如SafeWatch-8B、OmniGuard-7B）和闭源模型（如GPT-5.4、Gemini-3.1-pro），同时显著降低推理成本，证明了高效设计比仅扩大数据或模型大小更有效。

英文摘要

The rapid growth of online video platforms and AI-generated content has made reliable video guardrails a key challenge for safety and real-world deployment. While most videos can be screened through fast pattern recognition, a small subset requires deeper reasoning over temporally complex content and nuanced policy constraints. Existing approaches typically rely on large vision-language models applied uniformly across all inputs, resulting in high inference costs and inefficient allocation of computation. We propose SafeLens, a video guardrail framework that introduces a fast-and-slow inference architecture for efficient and accurate content moderation with variable computational cost across inputs. Additionally, we construct a high-quality dataset by applying influence-guided filtering to the SafeWatch Dataset, retaining only 2.4% of the original data. To further address limitations of training-time scaling, we enable test-time reasoning by augmenting the filtered data with structured Chain-of-Thought traces. Across real-world and AI-generated video benchmarks, SafeLens achieves state-of-the-art performance, outperforming strong open-source video guardrails (e.g., SafeWatch-8B, OmniGuard-7B) and closed-source models (e.g., GPT-5.4, Gemini-3.1-pro) while significantly reducing inference cost, demonstrating that efficient design serves to be more effective than scaling data or model size alone.

URL PDF HTML ☆

赞 0 踩 0

2605.17605 2026-05-19 cs.LG

Venom: A PyTorch Generative Modeling Toolkit

Venom：一个PyTorch生成建模工具包

Liang Yan

发表机构 * Paul G. Allen School of Computer Science & Engineering（保罗·G·艾伦计算机科学与工程学院）

AI总结本文提出Venom，一个基于PyTorch的生成建模工具包，旨在通过统一的接口实现多种生成建模家族，提供可读、可复现的入口点以及一致的训练和采样API，便于教学、原型设计和轻量级基准测试。

Comments Preprints

详情

AI中文摘要

现代生成建模已发展为一个包含多种相关但通常独立实现范式的广泛集合，包括去噪扩散模型、基于分数的随机微分方程、流匹配、变分自编码器、归一化流、对抗模型和基于能量的模型。对于 newcomers 来说，这种碎片化使得在单一统一的代码库中比较训练目标、推理过程、采样算法和条件机制变得困难。我们介绍了 V ENOM，一个教育性的 PyTorch 工具包，它在统一的、以 MNIST 为先的接口下实现了代表性的生成建模家族。V ENOM 强调广度、可读性、可复现的入口点以及一致的训练和采样 API，而不是大规模的性能工程。该包目前包括扩散和基于分数的模型、流匹配和一步生成器、变分自编码器、归一化流、生成对抗网络和基于能量的模型。它提供了单独的训练和采样脚本、分类器和无分类器指导示例、双语教程笔记本以及支持教学、原型设计和轻量级基准测试的模型家族组织。

英文摘要

Modern generative modeling has grown into a broad collection of related but often separately implemented paradigms, including denoising diffusion models, score-based stochastic differential equations, flow matching, variational autoencoders, normalizing flows, adversarial models, and energy-based models. For newcomers, this fragmentation makes it difficult to compare training objectives, inference procedures, sampling algorithms, and conditioning mechanisms within a single coherent codebase. We introduce V ENOM, an educational PyTorch toolkit that implements representative generative modeling families under a unified, MNIST-first interface. V ENOM emphasizes breadth, readability, reproducible entry points, and consistent training and sampling APIs rather than large-scale performance engineering. The package currently includes diffusion and score-based models, flow matching and one-step generators, variational autoencoders, normalizing flows, generative adversarial networks, and energy-based models. It provides separate training and sampling scripts, classifier and classifier-free guidance examples, bilingual tutorial notebooks, and a model-family organization that supports teaching, prototyping, and lightweight benchmarking.

URL PDF HTML ☆

赞 0 踩 0

2605.17601 2026-05-19 cs.RO

From a Single Demonstration to a General Policy for Contact-Rich Manipulation

从单次示范到通用的接触密集操纵策略

Xing Li, Oliver Brock

发表机构 * Robotics and Biology Laboratory, Technische Universität Berlin（技术大学柏林机器人与生物学实验室）； Science of Intelligence, Research Cluster of Excellence, Berlin（智能科学，卓越研究集群，柏林）； Robotics Institute Germany（德国机器人研究所）

AI总结本文提出了一种学习从示范（LfD）框架，通过利用环境约束作为归纳偏差，实现多阶段、接触密集任务的一次性泛化。该方法将示范表示为利用环境约束的行为序列，将任务通用结构（约束类型及其转换）与实例特定细节（精确示范轨迹、姿态和局部几何）分离。四阶段流程在该表示上构建完整策略：机器人首先将单次示范抽象为环境约束原语，然后通过自我引导探索进行歧义消除，接着整合针对人类修正以处理超出分布的变化，最后通过合规交互在线恢复抽象掉的细节。由于最终策略遵循约束而非模仿轨迹，它在物体姿态、局部几何和未建模接触动力学上实现了泛化。我们在七个现实世界多阶段接触密集操纵任务上验证了该方法，成功率达到90%以上。这些广泛实验结果确立了环境约束作为学习从示范中高效泛化基本构建块的重要性。

Comments 21 pages, 22 figures, 7 tables

详情

AI中文摘要

我们提出了一种学习从示范（LfD）框架，实现多阶段、接触密集任务的一次性泛化。我们的方法核心是利用环境约束作为归纳偏差。通过将示范表示为利用环境约束的行为序列，机器人将任务通用结构——约束类型及其转换——与实例特定细节（如精确示范轨迹、姿态和局部几何）分离。我们的四阶段流程在该表示上构建完整策略：机器人首先将单次示范抽象为环境约束原语，然后通过自我引导探索进行歧义消除，接着整合针对人类修正以处理超出分布的变化，最后通过合规交互在线恢复抽象掉的细节。由于最终策略遵循约束而非模仿轨迹，它在物体姿态、局部几何和未建模接触动力学上实现了泛化。我们在七个现实世界多阶段接触密集操纵任务上验证了该方法，成功率达到90%以上。这些广泛实验结果确立了环境约束作为学习从示范中高效泛化基本构建块的重要性。

英文摘要

We present a Learning from Demonstration (LfD) framework that achieves one-shot generalization in multi-stage, contact-rich manipulation tasks. Central to our approach is the utilization of environmental constraints as the inductive bias. By representing a demonstration as a sequence of behaviors that exploit environmental constraints, the robot separates task-general structure -- the constraint types and their transitions -- from instance-specific details such as exact demonstration trajectories, poses, and local geometries. Our four-stage pipeline builds a complete policy on this representation: the robot first abstracts a single demonstration into environmental-constraint primitives, then disambiguates them through self-guided exploration, next assimilates targeted human corrections that handle out-of-distribution variations, and finally recovers the abstracted-away details online through compliant interaction. Because the resulting policy follows constraints rather than mimics trajectories, it generalizes across object poses, local geometries, and unmodeled contact dynamics. We validate our approach on seven real-world multi-stage contact-rich manipulation tasks and achieve over 90% success. These extensive experimental results establish environmental constraints as fundamental building blocks for efficient generalization in learning from demonstration.

URL PDF HTML ☆

赞 0 踩 0

2605.17598 2026-05-19 cs.CL

Mixture of Experts for Low-Resource LLMs

专家混合用于低资源大语言模型

Ori Bar Joseph, Smadar Arvatz, Noam Kayzer, Dan Revital, Sarel Weinberger

发表机构 * National Institute for Testing and Evaluation（国家测试与评估研究所）

AI总结本文研究了专家混合架构在低资源语言中的专家路由行为，发现持续预训练能有效缓解路由不平衡问题，且路由改进与下游任务表现提升相关。

详情

AI中文摘要

混合专家（MoE）架构能够实现高效的模型扩展，但跨低资源语言的专家路由行为仍不明确。我们通过希伯来语这一形态丰富且低资源的测试环境，分析了两种架构不同的MoE模型——纯Transformer（Qwen3-30B-A3B）和混合Mamba-Transformer（Nemotron-3-Nano-30B-A3B）的路由动态。两种预训练模型均表现出深度层路由崩溃：在最终层使用熵急剧下降，令牌集中在狭窄的专家子集，这种模式在英语中很少见。持续预训练（CPT）在平衡双语数据上显著纠正了这种不平衡，提高了熵并使路由转向共享、语言无关的专家；监督微调（SFT）单独实现的纠正程度较低。将分析扩展到日语，发现定量一致的崩溃特征，提供了跨语言证据，表明该现象是预训练下代表性不足的系统性结果，而非任何语言固有属性。路由改进与一致的下游基准表现提升相关，将路由熵和专家专业化定位为MoE系统多语言能力的原理性诊断。

英文摘要

Mixture-of-Experts (MoE) architectures enable efficient model scaling, yet expert routing behavior across underrepresented languages remains poorly understood. We analyze routing dynamics in two architecturally distinct MoE models -- a pure Transformer (Qwen3-30B-A3B) and a hybrid Mamba-Transformer (Nemotron-3-Nano-30B-A3B) -- using Hebrew as a morphologically rich, low-resource testbed. Both pre-trained models exhibit \emph{deep-layer routing collapse}: usage entropy drops sharply in final layers and tokens concentrate on a narrow expert subset, a pattern largely absent for English. Continual pre-training (CPT) on balanced bilingual data substantially corrects this imbalance, increasing entropy and shifting routing toward shared, language-agnostic experts; supervised fine-tuning (SFT) alone achieves less complete correction. Extending the analysis to Japanese reveals quantitatively consistent collapse signatures, providing cross-linguistic evidence that the phenomenon is a systematic consequence of pre-training underrepresentation rather than any language-intrinsic property. Routing improvements correlate with consistent downstream benchmark gains, positioning routing entropy and expert specialization as principled diagnostics for multilingual capacity in MoE systems.

URL PDF HTML ☆

赞 0 踩 0

2605.17593 2026-05-19 cs.RO

Motion-Uncertainty-Aware Next-Best-View Planning for Moving Object Reconstruction

考虑运动不确定性的移动物体重建最佳下视角规划

Karen Li, Mattia Mantovani, Robert J. Wood, Lorenzo Sabattini, Stephanie Gil

发表机构 * Harvard University（哈佛大学）； University of Modena and Reggio Emilia（摩德纳和雷焦艾米利亚大学）

AI总结本文提出了一种考虑运动不确定性的最佳下视角规划框架，用于重建未知的刚体目标，该框架利用噪声的平面位置测量和移动机器人深度观测，通过固定滞后高斯过程平滑器估计和预测目标状态，从而生成候选视角并提高重建完整性。

Comments This paper is accepted for publication for Robotics: Science and Systems (RSS) 2026

详情

AI中文摘要

主动重建移动物体需要在决策到执行延迟期间选择信息丰富的视角，同时考虑物体运动的不确定性。现有方法只解决了该问题的一部分：用于物体重建的下最佳视角（NBV）规划器通常优化表面覆盖但假设物体静止，而针对移动目标的运动感知主动感知方法考虑了目标运动，但优先考虑跟踪或可见性而非重建覆盖。本文提出了一种考虑运动不确定性的NBV框架，用于重建未知的刚体目标，该目标处于平面运动中。该框架利用目标的噪声平面位置测量和移动机器人的深度观测。关键思想是通过评估每个候选视角在由运动和测量不确定性诱导的可能未来目标状态下的预期观测质量，而不是在单一预测目标姿态上。为了获得这种预测信念，固定滞后高斯过程平滑器从噪声位置测量中估计和预测目标状态。所得信念用于生成围绕预测目标位置的候选视角，并通过可达性过滤它们，并估计其预期覆盖驱动的分数。仿真和实际实验表明，与非预测的NBV和仅预测的跟踪方法相比，重建完整性得到了改进，从而弥合了覆盖驱动的主动重建和预测驱动的跟踪之间。

英文摘要

Active 3D reconstruction of moving objects requires selecting informative viewpoints while accounting for object motion uncertainty during the decision-to-execution delay. Existing methods address only parts of this problem: next-best-view (NBV) planners for object reconstruction typically optimize surface coverage but assume static objects, while motion-aware active perception for moving targets accounts for target motion but prioritizes tracking or visibility over reconstruction coverage. This work presents a motion-uncertainty-aware NBV framework for reconstructing an unknown rigid object undergoing planar motion, using noisy planar position measurements of the object and depth observations from a mobile robot. The key idea is to evaluate each candidate viewpoint by its expected observation quality over plausible future object states induced by motion and measurement uncertainty, rather than at a single predicted object pose. To obtain this predictive belief, a fixed-lag Gaussian Process smoother estimates and predicts the object state from noisy position measurements. The resulting belief is used to generate candidate viewpoints around the predicted object location, filter them by reachability, and estimate their expected coverage-driven scores. Simulation and real-world experiments demonstrate improved reconstruction completeness over non-predictive NBV and prediction-only tracking methods, bridging coverage-driven active reconstruction and prediction-driven tracking.

URL PDF HTML ☆

赞 0 踩 0

2605.17591 2026-05-19 cs.CV

Error-Decomposed Class-Conditional Fusion for Statistically Guaranteed Hard-Category Robust Perception

误差分解类条件融合用于统计保证的硬类别鲁棒感知

Guowei Luo, Ziqi Shi, Zhao Xie

发表机构 * Hefei University of Technology, Hefei, China（合肥工业大学）； Lishui University, Lishui, China（丽水大学）

AI总结本文提出误差分解类条件融合（ED-CCF）方法，通过解决硬类别可靠性问题，提升关键类别性能的同时保持整体稳定性，实现统计保证的鲁棒感知。

Comments 14 pages, 8 figures. Preprint

详情

AI中文摘要

聚合目标检测指标本质上会掩盖在操作关键的长尾少数类别中的灾难性和可重复性故障。本文正式将这种普遍的脆弱性定义为硬类别可靠性问题（HCRP）：严格纠正脆弱类别而不影响稳定类别性能边界的基本架构挑战。为系统性消除这一限制，我们提出了误差分解类条件融合（ED-CCF），一种优雅的决策层推断框架。不同于启发式全局后处理，ED-CCF将预测投射到复杂的四状态误差分类学中，在严格的经验验证下动态激活校准路径。在高度受限的600张图像验证基准上，隔离cz作为关键脆弱性（HCEC=0.86，BSR=0.14），我们的框架实现了突破性进展：在提升cz mAP50从0.089343到0.109353（巨大的+22.4%相对增长）的同时，完美保持了全局稳定性的帕累托最优性（将所有mAP50从0.581925提升到0.584864）。通过在50对子集试验中进行彻底验证，展示了压倒性的96%胜率和严格的布农尼校正威尔科xon显著性（p<0.05），这项工作从根本上重新定义了输出层融合作为安全关键视觉感知的可审计、统计保证范式。

英文摘要

Aggregate object detection metrics inherently mask catastrophic and repeatable failures in operationally critical, long-tail minority classes. This paper formally defines this pervasive vulnerability as the Hard-Category Reliability Problem (HCRP): the fundamental architectural challenge of strictly rectifying vulnerable categories without compromising the performance boundaries of stable classes under stringent protocols. To systematically dismantle this limitation, we propose Error-Decomposed Class-Conditional Fusion (ED-CCF), an elegant decision-layer inference framework. Diverging from heuristic global post-processing, ED-CCF projects predictions into a sophisticated quad-state error taxonomy, dynamically activating calibration pathways exclusively upon rigorous empirical justification. On a highly constrained 600-image validation benchmark, isolating cz as the critical vulnerability (HCEC=0.86, BSR=0.14), our framework achieves a targeted breakthrough: it elevates cz mAP50 from 0.089343 to 0.109353 (a massive +22.4% relative surge) while flawlessly preserving the Pareto optimality of global stability (raising all mAP50 from 0.581925 to 0.584864). Backed by exhaustive validation across 50 paired subset trials demonstrating an overwhelming 96% win rate and strict Bonferroni-corrected Wilcoxon significance (p<0.05), this work fundamentally redefines output-level fusion as an auditable, statistically guaranteed paradigm for safety-critical visual perception.

URL PDF HTML ☆

赞 0 踩 0

2605.17590 2026-05-19 cs.LG math.OC

Form and Function: Machine Unlearning as a Problem of Misaligned States

形式与功能：将机器去学习视为不一致状态的问题

Kennon Stewart

发表机构 * Second Street Labs, Detroit, MI, USA（第二街实验室，密歇根州底特律）； Department of Statistics, University of Michigan, Ann Arbor, MI, USA（密歇根大学统计系，密歇根州安阿伯）

AI总结本文提出将在线L-BFGS的机器去学习问题建模为反事实状态对齐问题，通过引入状态感知度量和反事实 oracle 模型，证明去学习不仅仅是参数修正问题，还需要与可实现的反事实优化器状态对齐。

详情

AI中文摘要

我们把在线L-BFGS的机器去学习问题建模为反事实状态对齐问题。给定一个实际事件流和一个经过删除编辑的反事实流，去学习的目标是确定在从未处理过被删除样本的情况下会产生的优化器状态。我们引入了状态感知度量，分别衡量参数误差、内存运算符误差、综合状态误差和更新方向误差。内存度量比较由o-L-BFGS内存引起的逆Hessian作用，而不是将曲率对视为有限影响。在凸性假设下，我们推导出反事实状态偏差的递归界。然后，我们评估了一个状态感知的删除干预基准，包括仅内存和仅参数的修正，与反事实 oracle 模型进行比较。这些结果表明，在线L-BFGS的去学习不仅仅是参数修正问题：它需要与可实现的反事实优化器状态对齐。

英文摘要

We formulate machine unlearning for online L-BFGS as a counterfactual state-alignment problem. Given an actual event stream and a deletion-edited counterfactual stream, the target of unlearning is the optimizer state that would have arisen had the deleted samples never been processed. We introduce state-aware metrics that separately measure parameter error, memory-operator error, combined state error, and update-direction error. The memory metric compares the inverse-Hessian actions induced by the o-L-BFGS memory, rather than treating curvature pairs as of finite influence. Under convexity assumptions, we derive a recursive bound on counterfactual state deviation. We then evaluate a state-aware benchmark of deletion interventions, including memory-only and parameter-only corrections, against an counterfactual oracle model. These results show that unlearning for online L-BFGS is not merely a parameter-correction problem: it requires alignment with a realizable counterfactual optimizer state.

URL PDF HTML ☆

赞 0 踩 0

2605.17588 2026-05-19 cs.CV

MSIQ: Moment-based Scale-Invariant Quality Measure for Single Image Super-Resolution

MSIQ: 基于矩的尺度不变质量度量用于单图像超分辨率

Leonid Bedratyuk

发表机构 * Khmelnytsky National University, Faculty of Information Technology, Ukraine（赫梅利尼茨基国立大学信息科技学院，乌克兰）

AI总结本文提出了一种基于矩的尺度不变质量度量（MSIQ），用于评估单图像超分辨率（SISR）结果的质量，该方法通过比较两幅图像的归一化中心几何矩，能够直接比较不同空间分辨率的图像，且具有数学确定性和解析形式，解决了传统方法在几何结构保持和强制缩放带来的误差问题。

Comments 23 pages

详情

AI中文摘要

评估单图像超分辨率（SISR）结果的质量仍然是一个开放的方法学问题。常见的全参考度量（PSNR, SSIM, LPIPS）没有明确评估图像几何结构的保持，这对于基于尺度的重建正确性至关重要。此外，它们需要将图像强制对齐到相同大小（强制缩放），这在评估过程中引入了外部插值误差。本文提出了一种诊断性的尺度不变质量度量MSIQ（基于矩的尺度不变质量度量），基于两幅图像的归一化中心几何矩的比较。MSIQ能够在不缩放的情况下直接比较不同空间分辨率的图像，具有数学确定性（模型无关）和解析形式。为了为该方法提供理论基础，我们引入了一个概念区分，即度量在跟踪退化方面的能力（跟踪能力）与它们的几何选择性（几何特异性）之间的区别。实验验证确认了MSIQ在均匀缩放下的稳定性，并同时揭示了传统度量对插值方法选择的高敏感性。结果表明，MSIQ具有显著的几何选择性：所提出的方法有效地区分了几何变形与非几何伪影，特别是JPEG压缩，不同于基于像素和感知的度量。还显示，MSIQ对结构扰动的响应在不同SR算法类别中保持稳定，包括具有不同架构的DNN模型。所提出的方法是一种补充的诊断工具，适用于几何保真度优先的领域，特别是医学成像和遥感。

英文摘要

Assessing the quality of single image super-resolution (SISR) results remains an open methodological problem. Common full-reference metrics (PSNR, SSIM, LPIPS) do not explicitly evaluate the preservation of the geometric structure of images, which is critical for the correctness of scale-based reconstruction. In addition, they require the forced alignment of images to the same size (\textit{forced resizing}), which introduces an external interpolation error into the evaluation process. This paper proposes a diagnostic scale-invariant quality measure, MSIQ (\textit{Moment-based Scale-Invariant Quality}), based on the comparison of normalized central geometric moments of two images. MSIQ enables direct comparison of images with different spatial resolutions without resizing, is mathematically deterministic (\textit{model-free}), and has an analytical form. To provide a theoretical basis for the approach, we introduce a conceptual distinction between the ability of metrics to monotonically track degradation (\textit{tracking ability}) and their geometric selectivity (\textit{geometric specificity}). The experimental validation confirmed the stability of MSIQ under uniform scaling and, at the same time, revealed the high sensitivity of traditional metrics to the choice of interpolation method. The results show that MSIQ has pronounced geometric selectivity: the proposed measure effectively separates geometric deformations from non-geometric artifacts, in particular JPEG compression, unlike pixel-based and perceptual metrics. It is also shown that the response of MSIQ to structural perturbations remains stable across different classes of SR algorithms, including DNN models with different architectures. The proposed measure is a complementary diagnostic tool for domains where geometric fidelity has priority, in particular medical imaging and remote sensing.

URL PDF HTML ☆

赞 0 踩 0

2605.17584 2026-05-19 cs.CV

VVitCutLER: Towards Unsupervised Object Detection and Segmentation in Videos

VVitCutLER: 向视频中无监督的目标检测和分割迈进

Zhijing Lu, Khurram Azeem Hashmi, Didier Stricker, Muhammad Zeshan Afzal

发表机构 * RPTU University of Kaiserslautern-Landau（莱茵-瓦尔德大学凯撒斯劳滕-兰道分校）； German Research Center for Artificial Intelligence（德国人工智能研究中心）

AI总结该研究提出VVitCutLER框架，通过引入时间一致性提升视频中伪标签的质量，从而改进目标检测和实例分割的性能，减少时间不稳定性。

Comments 11 figures, cvpr workshop

详情

AI中文摘要

无监督像素级视频理解在现实场景中仍具有挑战性，因为运动模糊、遮挡和快速物体动态常导致时间漂移和闪烁的伪标签。我们提出VVitCutLER，一个用于视频目标检测和实例分割的无监督框架，通过时间一致性提高伪标签的质量。我们的核心贡献是VitCut，一个时间稳定的伪标签生成器，通过跨帧区域一致性减少在场退化期间的误差积累。同时，VitCut使用蒸馏解码器实现有效的实例掩码预测。然后，基于VitCut，VVitCutLER进一步整合跨帧特征聚合以增强视频级的鲁棒性。在标准视频基准上的广泛实验表明，VVitCutLER显著提高了检测和分割性能，同时减少了时间不稳定性。这些结果突显了时间一致性监督对鲁棒像素级视频理解的重要性。

英文摘要

Unsupervised pixel-level video understanding remains challenging in real-world scenarios, where motion blur, occlusion, and fast object dynamics often cause temporal drift and flickering pseudo-labels.We propose VVitCutLER, an unsupervised framework for video object detection and instance segmentation, which improves the quality of pseudo-labels through temporal consistency. Our core contribution is VitCut, a temporarily stable pseudo-label generator that reduces error accumulation during field degradation through cross-frame region consistency. Meanwhile, VitCut uses a distillation decoder to achieve effective instance mask prediction. Then, based on VitCut, VVitCutLER further integrates cross-frame feature aggregation to enhance video-level robustness. Extensive experiments on standard video benchmarks demonstrate that VVitCutLER significantly improves detection and segmentation performance while reducing temporal instability. These results highlight the importance of temporally consistent supervision for robust pixel-level video understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.17583 2026-05-19 cs.CV

AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech

AgentSteerTTS: 一个用于复合指令文本到语音的多智能体闭环框架

Bin Kang, Shaoguo Wen, Yang Fan, Shunlong Wu, Junjie Wang, Yulin Li, Junzhi Zhao, Junle Wang, Zhuotao Tian

发表机构 * University of Chinese Academy of Sciences（中国科学院大学）； Shenzhen Loop Area Institute（深圳环湖研究院）； Tencent Turinglab（腾讯人工智能实验室）； Tsinghua University（清华大学）； Southwest Jiaotong University（西南交通大学）

AI总结本文提出AgentSteerTTS，一个多智能体闭环框架，通过引入对抗解耦代理、双流锚定控制器和快速-慢反馈代理，实现了对复合指令的意图忠实表达控制，实验表明其在复合指令基准和公开测试集上显著提升了性能。

Comments Project page: https://kane2kang.github.io/AgentSteerTTS/

详情

AI中文摘要

尽管现有的文本到语音（TTS）模型表现出高度的表达性，但对复合指令的细粒度控制仍然具有挑战性，因为离散的文本意图与连续的语音实现之间存在结构不匹配。受人类认知解耦的启发，我们引入了AgentSteerTTS，一个用于意图忠实表达控制的多智能体闭环框架。首先，在我们的框架中，对抗解耦代理通过学习分离的身份和情感-语调子空间，并利用泄漏抑制正则化来减轻说话者-情感泄漏。接下来，双流锚定控制器利用大规模的语音原型库来使抽象意图具体化：检索代理选择表达锚点，而合成代理通过门控注意力融合它们为连续控制向量。最后，快速-慢反馈代理通过潜在梯度校正来细化输出强度，并利用高层感知批评来解决语义-语音不匹配。在复合指令基准和公开测试集上的实验表明，AgentSteerTTS在基线模型上产生了一致且显著的改进，证明了所提出方法的有效性。

英文摘要

While existing text-to-speech (TTS) models exhibit high expressiveness, fine-grained control over composite instructions remains challenging due to the structural mismatch between discrete textual intents and continuous acoustic realizations. Inspired by human cognitive decoupling, we introduce AgentSteerTTS, a multi-agent closed-loop framework designed for intent-faithful expressive control of composite instructions. First, in our framework, an adversarial disentanglement agent mitigates speaker-emotion leakage by learning separable identity and emotion-prosody subspaces with leakage-suppressing regularization. Next, a Dual-Stream Anchoring Controller grounds abstract intents using a large-scale acoustic prototype library: a Retrieval Agent selects expressive anchors, while a Synthesis Agent fuses them into continuous control vectors via gated attention. Finally, a Fast-Slow Feedback Agent refines output intensity through latent gradient correction and resolves semantic-acoustic mismatches using high-level perceptual critique. Experiments on a composite-instruction benchmark and public test sets show that AgentSteerTTS yields consistent and significant improvements to the baselines, demonstrating the effectiveness of the proposed method.

URL PDF HTML ☆

赞 0 踩 0

2605.17582 2026-05-19 cs.LG cs.CE

Scale-Equivariant Generative Forecasting: Weight-Tied Dilated Convolutions, Wavelet Scattering Inputs, and Spectral-Consistency Training for Self-Similar Time Series

尺度等变生成预测：权重绑定的扩张卷积、小波散射输入和频谱一致性训练用于自相似时间序列

Andrea Morandi

发表机构 * Cisco Systems, Inc.（思科系统公司）

AI总结本文提出了一种尺度等变生成预测方法，通过权重绑定的扩张卷积、小波散射输入和频谱一致性训练，用于自相似时间序列的生成，展示了在S&P 500日收益率上的优越性能。

详情

AI中文摘要

许多自然和工程时间序列--股票回报、气候异常、湍流速度、神经记录、分组网络流量--近似自相似：其时间跨度为T的分布与时间跨度为1的分布通过一个缩放指数H关联。标准深度生成序列模型（Transformer、扩张TCN、WaveNet家族）忽略了这一点。它们的感受野很宽，但内核参数在每个扩张级别独立存在，导致多尺度架构，而非尺度等变架构。我们有三个贡献。首先，我们为一维因果网络给出了离散尺度等变的精确定义，并证明了二进制扩张在边界效应范围内与任何内核权重在不同级别共享的扩张卷积堆栈相容。绑定内核将卷积参数预算减少L倍（L为深度），并强制自相似性作为归纳偏置。其次，我们将这种尺度等变WaveNet（SE-WaveNet）主干包裹在三个具有相同先验的组件中：一级Daubechies-4小波输入、Hurst-FiLM块暴露局部缩放指数、以及针对|f|^{-(2H+1)}幂律频谱的频谱一致性训练项。头部是条件归一化流，选择以保持等变性。第三，在30年S&P 500每日对数收益率上，SE-WaveNet样本在Allan方差前25个宇宙上重现经验缩放崩溃诊断（中位数C* = 0.020），而普通WaveNet在匹配容量下不（≥0.06）。NLL、KS校准和尾部能量距离与基线持平或优于基线，参数数量更少L倍。

英文摘要

Many natural and engineered time series -- equity returns, climate anomalies, turbulent velocities, neural recordings, packet-level network traffic -- are approximately self-similar: their horizon-$T$ distribution is tied to the horizon-$1$ distribution by one scaling exponent $H$. Standard deep generative sequence models (transformers, dilated TCNs, the WaveNet family) ignore this. Their receptive fields are wide, but kernel parameters live independently at every dilation level, yielding a multi-scale architecture, not a scale-equivariant one. We make three contributions. First, we give a precise definition of discrete scale equivariance for 1D causal networks and prove that dyadic dilation commutes (up to boundary effects) with any dilated-convolution stack whose kernel weights are shared across levels. Tying the kernel shrinks the convolutional parameter budget by an $L$-fold factor (where $L$ is depth) and hard-wires self-similarity in as an inductive bias. Second, we wrap this Scale-Equivariant WaveNet (SE-WaveNet) backbone in three components that carry the same prior: a one-level Daubechies-4 wavelet input, a Hurst-FiLM block exposing the local scaling exponent, and a spectral-consistency training term targeting the $|f|^{-(2H+1)}$ power-law spectrum. The head is a conditional normalising flow, chosen to preserve equivariance. Third, on 30 years of S&P 500 daily log-returns, SE-WaveNet samples reproduce the empirical scaling-collapse diagnostic on the Allan-Variance top-25 universe (median $\mathcal{C}^\star = 0.020$), while a vanilla WaveNet at matched capacity does not ($\geq 0.06$). NLL, KS-calibration, and tail energy distance tie or beat the baseline, with $L\times$ fewer convolutional parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.17580 2026-05-19 cs.AI

ECG-WM: A Physiology-Informed ECG World Model for Clinical Intervention Simulation

ECG-WM: 一种基于生理的ECG世界模型用于临床干预模拟

Zhikang Chen, Yue Wang, Sen Cui, Yu Zhang, Changshui Zhang, Tianling Ren, Tingting Zhu

发表机构 * University of Oxford（牛津大学）； Tsinghua University（清华大学）； Southern University of Science and Technology（南方科技大学）

AI总结本文提出了一种基于ECG的世界模型，用于条件化预测心脏电生理学，通过整合生理学普通微分方程先验知识，提升干预后ECG轨迹的生理合理性，并引入不确定性评估策略以更可靠地评估候选干预方案。

详情

AI中文摘要

基于ECG的模型在诊断任务中表现出色，但在建模外部干预下心脏动态演变方面仍有限。现有方法主要集中在静态预测，缺乏捕捉不同药理条件下ECG变化的机制。本文提出了一种ECG世界模型，用于动作条件化的预测模拟。通过将生理学普通微分方程先验知识整合到潜在扩散动态中，利用能量正则化，该框架实现了生理合理的干预后ECG轨迹合成，并有效缓解生成幻觉。在此模拟过程中，我们引入了一种不确定性意识的评估策略，利用扩散采样中的随机性来表征预期的临床风险及其变异性，从而更可靠地比较候选干预方案。我们在多种设置中评估了我们的方法，包括受控药物反应场景和真实世界临床记录。除了标准波形指标外，实验结果还显示了改进的风险校准和与专家指导治疗偏好的强一致。这些结果确立了我们的方法作为安全且干预感知的临床决策支持的稳健基础。

英文摘要

Electrocardiogram (ECG)-based models have achieved strong performance in diagnostic tasks, yet they remain limited in modeling how cardiac dynamics evolve under external interventions. In particular, existing approaches focus primarily on static prediction and lack mechanisms to capture ECG variations under different pharmacological conditions. In this work, we propose an ECG World Model for action-conditioned predictive simulation of cardiac electrophysiology. Moving beyond disjoint pipelines, our framework features a principled integration of physiological ordinary differential equation (ODE) priors into latent diffusion dynamics via energy regularization. This structural constraint enables the synthesis of physiologically plausible post-intervention ECG trajectories while effectively mitigating generative hallucinations. Building on this simulation process, we introduce an uncertainty-aware evaluation strategy that leverages the stochasticity of diffusion sampling to characterize both the expected clinical risk and its variability, allowing a more reliable comparative assessment of candidate interventions. We evaluate our method across diverse settings, including controlled drug-response scenarios and real-world clinical records. Beyond standard waveform metrics, experimental results demonstrate improved risk calibration and strong alignment with expert-informed treatment preferences. These results establish our approach as a robust foundation for safe and intervention-aware clinical decision support.

URL PDF HTML ☆

赞 0 踩 0

2605.17577 2026-05-19 cs.CV

TAME: Test-Time Adversarial Prompt Tuning via Mixture-of-Experts for Vision-Language Models

TAME: 通过混合专家架构实现视觉语言模型的测试时对抗提示调优

Xin Wang, Yixu Wang, Jiaming Zhang, Ruofan Wang, Jiaqi Yu, Kai Chen, Jingjing Chen, Xingjun Ma, Yu-Gang Jiang

发表机构 * Fellow, IEEE（IEEE会士）

AI总结本文提出TAME，一种基于混合专家架构的测试时防御方法，旨在提升视觉语言模型在对抗扰动下的鲁棒性，同时保持对清洁样本的泛化能力。

详情

AI中文摘要

大规模预训练的视觉语言模型（VLMs），如CLIP，在零样本泛化方面表现强大，但对不可察觉的对抗扰动高度敏感，这在开放世界部署中引发了严重安全问题。为了在不需下游任务特定重新训练的情况下增强鲁棒性，我们提出了TAME，一种新颖的测试时防御方法。基于我们之前的测试时对抗提示调优（TAPT），TAME通过将TAPT的单一自适应提示替换为输入条件化的混合专家（MoE）框架进行架构重构，从而实现更表达力和适应性的防御。具体而言，TAME维护一个可学习的专家提示库，并利用输入依赖的路由机制，在推理时为每个未标记的测试样本聚合定制化的提示混合。这种测试时防御机制由三个无监督目标驱动：（1）多视图预测熵最小化，（2）逐层对齐视觉标记统计到预计算的干净和对抗参考分布，以及（3）MoE正则化以实现平衡的专家利用和提示多样性。我们在11个基准数据集上评估了TAME，包括ImageNet和10个额外的零样本数据集。结果表明，TAME在AutoAttack下将原始CLIP的零样本对抗鲁棒性提高了至少49.1%，同时在清洁样本上保持了良好的泛化能力。TAME还普遍优于现有对抗提示调优方法，平均鲁棒性提升至少30.2%。

英文摘要

Large-scale pre-trained Vision-Language models (VLMs), such as CLIP, exhibit strong zero-shot generalization, yet remain highly vulnerable to imperceptible adversarial perturbations, raising serious safety concerns for open-world deployment. To enhance robustness without requiring downstream task-specific retraining, we propose TAME, a novel test-time defense. Building upon our prior Test-Time Adversarial Prompt Tuning (TAPT), TAME introduces an architectural reformulation by replacing TAPT's single adaptive prompt with an input-conditioned Mixture-of-Experts (MoE) framework, enabling more expressive and adaptive defense. Specifically, TAME maintains a bank of learnable expert prompts and employs an input-dependent routing mechanism to aggregate a customized prompt mixture for each unlabeled test sample at inference time. This test-time defense mechanism is driven by three unsupervised objectives: (1) multi-view prediction entropy minimization, (2) layer-wise alignment of visual token statistics to precomputed clean and adversarial reference distributions, and (3) MoE regularization for balanced expert utilization and prompt diversity. We evaluated TAME on 11 benchmark datasets, including ImageNet and 10 additional zero-shot datasets. The results show that TAME improves the zero-shot adversarial robustness of the original CLIP by at least 49.1% under AutoAttack while largely preserving generalization on clean samples. TAME also consistently outperforms existing adversarial prompt tuning methods across multiple prompt designs, yielding an average robustness gain of at least 30.2%.

URL PDF HTML ☆

赞 0 踩 0

2605.17575 2026-05-19 cs.LG cs.AI

UniAlign: A Model-Agnostic Framework for Robust Network Traffic Classification under Distribution Shifts

UniAlign：一种用于在分布偏移下鲁棒网络流量分类的模型无关框架

Tongze Wang, Xiaohui Xie, Wenduo Wang, Chuyi Wang, Yong Cui

发表机构 * Institute for Network Sciences and Cyberspace, Tsinghua University（网络科学与网络空间研究院，清华大学）； Department of Computer Science and Technology, Tsinghua University（计算机科学与技术系，清华大学）

AI总结本文提出UniAlign，一种模型无关的框架，通过领域对齐微调和稳定模型集成提升深度学习网络流量分类模型在分布偏移下的鲁棒性，实验表明其在准确率和F1分数上均优于现有基线。

详情

AI中文摘要

网络流量分类（NTC）模型在真实世界环境中部署时，由于网络条件的变化导致的分布偏移常常引起严重的性能下降。现有的增强鲁棒性的方法通常与特定的模型架构或数据设置耦合，无法泛化到最先进的原始字节基NTC模型，或导致显著的训练开销。在本文中，我们提出UniAlign，一种新的模型无关框架，旨在提升基于深度学习的NTC模型在分布偏移下的鲁棒性。UniAlign结合了领域对齐微调，该方法鼓励在异构网络条件下学习领域不变的流量表示，以及稳定模型集成，该方法通过在平坦损失区域内的检查点聚合来增强推理鲁棒性。该框架可以无缝集成到现有的监督NTC模型中，无需特定的特征模态或引入非常数的额外训练成本。我们在三个涵盖多样分布偏移的公开数据集上评估了UniAlign，包括加密方案、数据收集设备和攻击行为。在两个代表性的NTC模型上的实验结果表明，与标准训练相比，UniAlign将平均分类准确率提高了2.51%，平均F1分数提高了2.71%，在准确率和F1分数上均优于最强基线，同时仅需所有NTC特定基线训练时间的12.4%至53.9%。

英文摘要

Network traffic classification (NTC) models often suffer severe performance degradation when deployed in real-world environments due to distribution shifts caused by changing network conditions. Existing robustness-enhancing approaches are commonly coupled to specific model architectures or data settings, fail to generalize to state-of-the-art raw-byte-based NTC models, or incur significant training overhead. In this paper, we propose UniAlign, a novel model-agnostic framework that improves the robustness of deep learning-based NTC models under distribution shifts. UniAlign combines \emph{domain alignment fine-tuning}, which encourages the learning of domain-invariant traffic representations across heterogeneous network conditions, with \emph{stable model ensembling}, which enhances inference robustness by aggregating checkpoints within a flat loss region. The framework can be seamlessly integrated into existing supervised NTC models without requiring specific feature modalities or introducing non-constant additional training costs. We evaluate UniAlign on three public datasets covering diverse distribution shifts, including encryption schemes, data collection devices, and attack behaviors. Experimental results on two representative NTC models demonstrate that, compared with standard training, UniAlign improves average classification accuracy by 2.51\% and average F1 score by 2.71\%, outperforming the strongest baseline by 1.45\% in accuracy and 1.69\% in F1 score, while requiring only 12.4\%--53.9\% of the training time of all NTC-specific baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.17573 2026-05-19 cs.CV cs.CR

Deepfake Detection in Social Media: A Temporal Artifact Analysis Using 3D Convolutional Neural Networks

社交媒体中的深度伪造检测：利用3D卷积神经网络进行时序特征分析

Mohammadreza Rashidi, Raja Hashim Ali, Sami Ur Rahman

发表机构 * Department of Computer Science AI（计算机科学系人工智能部门）； Media Analysis Lab Berlin, Germany（媒体分析实验室柏林德国）

AI总结本文提出了一种基于R3D-18的3D卷积神经网络检测器，通过结合二元交叉熵损失与时间一致性正则化损失，提升深度伪造检测在高分辨率和跨数据集场景下的准确性，证明了时间特征比空间特征在社交媒体重编码中更具鲁棒性。

Comments 13 pages, 6 figures

详情

AI中文摘要

合成面部视频在社交媒体上传播的速度比平台审核速度更快，导致虚假信息和身份攻击的成本上升。帧级深度伪造检测器在生成器质量增加时性能急剧下降；高质量的128x128 GAN输出在空间仅准确性上减少五个百分点，而时间不一致性的特征基本保持不变。我们通过基于R3D-18的3D卷积神经网络检测器解决这一差距，该检测器使用复合损失函数，结合二元交叉熵与时间一致性正则化。模型处理来自DeepfakeTIMIT数据集的16帧片段，并初始化自Kinetics-400动作识别权重。我们在128x128分辨率的内数据集评估中报告了92.8%的准确率；在不微调的情况下跨数据集转移到FaceForensics++达到76.4%，微调后有所提升。消融研究显示，迁移学习贡献了7.2个百分点，面部跟踪增加了3.5个百分点，而时间一致性正则化在高质量伪造中提供了额外的增益。结果表明，时间特征比空间特征在社交媒体重编码中更具泛化能力，提供了一个能够存活的检测信号。

英文摘要

Synthetic facial videos have proliferated across social media faster than platform moderation can respond, raising the cost of disinformation and identity-based attacks. Frame-level deepfake detectors degrade sharply as generator quality increases; high-quality 128x128 GAN output cuts spatial-only accuracy by five percentage points while leaving temporal inconsistencies largely intact. We address this gap with a 3D Convolutional Neural Network detector based on R3D-18, trained with a composite loss that combines binary cross-entropy with a temporal-consistency regularizer. The model processes 16-frame clips from the DeepfakeTIMIT dataset and is initialized from Kinetics-400 action-recognition weights. We report 92.8% accuracy on intra-dataset evaluation at 128x128 resolution; cross-dataset transfer to FaceForensics++ without fine-tuning reaches 76.4%, rising after minimal fine-tuning. Ablation studies show that transfer learning contributes 7.2 percentage points and face tracking adds 3.5 points, while temporal consistency regularization provides additional gains on high-quality fakes. The results establish that temporal artifacts generalize more broadly than spatial ones, providing a detection signal that survives social-media re-encoding.

URL PDF HTML ☆

赞 0 踩 0

2605.17571 2026-05-19 cs.CV cs.LG

Stable Routing for Mixture-of-Experts in Class-Incremental Learning

混合专家在类增量学习中的稳定路由

Zirui Guo, Quan Cheng, Da-Wei Zhou, Lijun Zhang

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University（南京大学新型软件技术国家重点实验室）； School of Artificial Intelligence, Nanjing University（南京大学人工智能学院）

AI总结本文研究了在类增量学习中混合专家模型的稳定路由问题，提出了一种稳定路由框架StaR-MoE，通过敏感性感知路由对齐和不对称容量正则化，提高了模型对新类别的适应能力和旧类别的知识保留能力。

详情

AI中文摘要

类增量学习（CIL）要求模型在学习新类别时保持先前知识。最近，结合预训练模型与混合专家（MoE）的方法在CIL中受到越来越多关注：它们通常在学习过程中扩展专家，并使用路由器分配权重。然而，现有MoE方法往往忽视了专家扩展引起的路由漂移。一旦引入新的专家，路由器可能会将样本从早期类别重新分配给新加入的专家，从而扰动已建立的专家组合，即使旧专家保持冻结。我们主张，可扩展的MoE在CIL中需要两个互补的性质：稳定的旧类路由用于知识保留和足够的容量利用用于新类适应。为此，我们提出了Stable Routing for MoE（StaR-MoE），一种用于可扩展MoE的路由级别框架。通过结合敏感性感知的路由对齐，StaR-MoE通过敏感性引导的约束将当前旧类路由行为与历史路由分布对齐。同时，StaR-MoE引入了不对称容量正则化，以鼓励有效利用扩展的专家池，而不影响类特定的路由专业化。在四个标准CIL基准上的广泛实验表明，StaR-MoE在平均准确率和最后准确率上均优于现有最先进方法，突显了稳定路由的重要性。

英文摘要

Class-incremental learning (CIL) requires models to learn new classes sequentially while preserving prior knowledge. Recently, approaches that combine pre-trained models with mixture-of-experts (MoE) have received increasing attention in CIL: they typically expand experts during learning and employ a router to assign weights across experts. However, existing MoE methods often overlook routing drift induced by expert expansion. Once new experts are introduced, the router may reassign samples from earlier classes to newly added experts, thereby perturbing previously established expert compositions and causing interference even when old experts remain frozen. We argue that expandable MoE in CIL requires two complementary properties: stable old-class routing for knowledge preservation and sufficient capacity utilization for new-class adaptation. To this end, we propose Stable Routing for MoE (StaR-MoE), a routing-level framework for expandable MoE in CIL. By incorporating sensitivity-aware routing alignment, StaR-MoE aligns current old-class routing behavior with historical routing distributions through sensitivity-guided constraints. Complementarily, StaR-MoE introduces asymmetric capacity regularization to encourage effective utilization of the expanded expert pool without compromising class-specific routing specialization. Extensive experiments across four standard CIL benchmarks demonstrate that StaR-MoE consistently improves both average and last accuracy over state-of-the-art methods, highlighting the importance of stable routing.

URL PDF HTML ☆

赞 0 踩 0

2605.17570 2026-05-19 cs.LG cs.CL

How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning

GRPO在离线策略下的可能性：Mu-GRPO用于高效的大语言模型强化学习

Minghao Tian, Yunfei Xie, Chen Wei

发表机构 * Rice University（里士大学）

AI总结本文探讨了GRPO在离线策略下的可行性，提出Mu-GRPO方法，通过减少rollout-optimization切换开销，实现高效的LLM强化学习，同时在多个基准测试中表现出色。

详情

AI中文摘要

组相对策略优化（GRPO）已成为近期大语言模型强化学习中可验证奖励（RLVR）进展的关键推动因素，但通常在低延迟、近策略的 regime 中训练，导致系统开销显著。我们提出一个简单的问题：GRPO可以多离线策略吗？我们证明GRPO类算法可以容忍比之前假设更大的rollout延迟，并提出Mu-GRPO，一种将训练分为少量（例如四个）大序列生成-优化阶段的RL训练框架。这种设计在诱导高rollout延迟的同时大幅减少了rollout-optimization切换开销。为了在延迟数据下稳定学习，Mu-GRPO结合了放松的剪裁（保留有用的延迟rollout梯度）与负优势 veto（移除不稳定后触发后缀更新）。在五个语言模型和多个数学推理基准测试中，Mu-GRPO在性能上与标准GRPO匹配或超过，同时在墙钟训练时间上实现了约2倍的加速，为LLM强化学习建立了显著改进的性能-效率权衡。

英文摘要

Group Relative Policy Optimization (GRPO) has been a key driver of recent progress in reinforcement learning with verifiable rewards (RLVR) for large language models, but it is typically trained in a low-staleness, near-on-policy regime that incurs substantial system overhead. We ask a simple question: How off-policy can GRPO be? We show that GRPO-style algorithms can tolerate substantially larger rollout staleness than previously assumed, and propose Mu-GRPO, an RL training framework that organizes training into a small number (e.g., four) of large sequential generation-optimization stages. This design induces high rollout staleness while greatly reducing rollout-optimization switching overhead. To stabilize learning under stale data, Mu-GRPO combines relaxed clipping, which preserves useful stale-rollout gradients, with negative-advantage veto, which removes destabilizing post-trigger suffix updates in negative-advantage responses. Across five language models and multiple math reasoning benchmarks, Mu-GRPO matches or exceeds the performance of standard GRPO while achieving around 2x speedup in wall-clock training time, establishing a substantially improved performance-efficiency trade-off for LLM reinforcement learning.

URL PDF HTML ☆

赞 0 踩 0

2605.17566 2026-05-19 cs.CV

Rethinking Point Clouds as Sequences: A Causal Next-Token Predictive Learning Framework

重新思考点云作为序列：一种因果性下一标记预测学习框架

Yumeng Yao, Jingzhi Dong, Haowen Gu, Tao Chen, Zonghan Wu, Xiaoshui Huang, Yazhou Yao

发表机构 * Nanjing University of Science and Technology（南京理工大学）； Shanghai Jiao Tong University（上海交通大学）； Hangzhou City University（杭州城市学院）； East China Normal University（华东师范大学）

AI总结本文提出PointNTP，将点云预训练重新定义为全因果、无解码器的潜在下一标记预测问题，通过局部补丁分割和结构化3D标记序列生成，实现对点云结构依赖的直接建模，无需重建解码器或显式几何恢复，实验表明其在多个下游任务中表现优异。

Comments 10 pages, 2 figures. Code will be released upon acceptance

详情

AI中文摘要

随着多模态基础模型和预测预训练的快速发展，一个重要的开放问题是如何为3D点云配备一种更符合下一标记和下一嵌入学习的预训练范式。现有点云自监督方法大多基于掩码重建或显式几何生成，因此仍局限于输入恢复而非预测依赖建模。本文引入PointNTP，将点云预训练重新定义为全因果、无解码器的潜在下一标记预测问题。具体而言，每个点云首先被分割成局部补丁，并根据补丁中心几何结构化为3D标记序列。生成的序列随后通过因果Transformer进行建模，采用仅前缀条件的训练方式，并通过停止梯度目标稳定移位预测目标。该设计使模型能够在潜在空间中直接学习结构依赖，而无需重建解码器或显式几何恢复。大量实验表明，所提出的PointNTP在多个下游任务中表现优异：在ScanObjectNN的OBJ_BG、OBJ_ONLY和PB_T50_RS上分别达到93.8%(+0.5%)、92.6%(+0.3%)和89.3%(+1.1%)；在ShapeNetPart上获得85.0%(+0.1%)的Cls.mIoU；在S3DIS Area 5上达到71.1%的mAcc。总体而言，无解码器的因果潜在预测提供了一种简单、可扩展且可能模态无关的点云自监督学习范式，为3D数据的基础式预测学习提供了新的视角。

英文摘要

With the rapid progress of multimodal foundation models and predictive pre-training, an important open question is how to equip 3D point clouds with a pre-training paradigm that is better aligned with next-token and next-embedding learning. Existing point-cloud self-supervised methods are largely built on masked reconstruction or explicit geometric generation, and thus remain tied to input recovery rather than predictive dependency modeling. In this paper, we introduce PointNTP, which reformulates point cloud pre-training as a fully causal, decoder-free latent Next-Token Prediction problem. Specifically, each point cloud is first partitioned into local patches and serialized into a structured 3D token sequence according to patch-center geometry. The resulting sequence is then modeled by a causal Transformer under prefix-only conditioning, and trained with a shift-based prediction objective stabilized by stop-gradient targets. This design enables the model to learn structural dependencies directly in latent space, without reconstruction decoders or explicit geometric recovery. Extensive experiments demonstrate that the proposed PointNTP is highly competitive across multiple downstream tasks: it achieves 93.8%(+0.5%), 92.6%(+0.3%), and 89.3%(+1.1%) on OBJ_BG, OBJ_ONLY, and PB_T50_RS of ScanObjectNN, respectively; obtains 85.0%(+0.1%) in Cls.mIoU on ShapeNetPart; and reaches 71.1% mAcc on S3DIS Area 5. Overall, decoder-free causal latent prediction provides a simple, scalable, and potentially modality-agnostic paradigm for point-cloud self-supervised learning, offering a new 3D perspective on foundation-style predictive learning for 3D data.

URL PDF HTML ☆

赞 0 踩 0

2605.17565 2026-05-19 cs.AI cs.CL

Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models

泛化还是记忆？国际象棋训练语言模型的脆弱性测试

Ethan Tang

发表机构 * School of Computing and Augmented Intelligence（计算与增强智能学院）

AI总结本文研究了国际象棋训练语言模型是泛化还是记忆，通过测试发现其高性能主要源于模式匹配，并展示了LLM-Modulo框架在提升国际象棋谜题解决性能上的效果，证明了与外部验证器结合的通用LLM比直接训练合成数据更灵活。

Comments 14 pages, 2 figures, 4 tables, 3 equations

详情

AI中文摘要

最近的研究对语言模型进行了棋类数据微调，并报告了高基准分数，作为证据表明由此产生的模型可以理解国际象棋规则、以专业水平下完整棋局，或生成基于专家知识的人可读解释。我们训练了KinGPT，一个仅在（位置，最佳移动）对上训练的2500万参数字符级语言模型，其在600个mate-in-N谜题套件上超过了300亿参数的ChessGPT，在20个主题谜题基准上超过了4000亿参数的C1-4B。我们检查了现有文献中关于国际象棋训练语言模型的几个主张，并断言其令人印象深刻的基准性能主要由模式匹配解释。我们还展示了LLM-Modulo，一个验证器在环框架，如何将RedPajama 3B的最佳移动准确率从1.2%提升到21.2%，移动生成有效性从19.3%提升到95.3%，在mate-in-N国际象棋谜题上，与ChessGPT在棋类特定网络语料库上微调所获得的提升相当，但成本仅为后者的一小部分。我们的结果展示了将通用LLM与外部验证器结合，为明确领域提供了一个更灵活的替代方案，而不是直接训练合成数据。我们开源了所有训练/评估代码、数据集、谜题样本和KinGPT模型检查点，以确保可重复性。

英文摘要

Recent work has fine-tuned language models on chess data and reported high benchmark scores as evidence that the resulting models can understand the rules of chess, play full chess games at a professional level, or generate human-readable explanations grounded in expert knowledge. We train KinGPT, a 25M-parameter character-level language model trained only on (position, best-move) pairs, who exceeds 3B-parameter ChessGPT on a 600-puzzle mate-in-N suite and 4B-parameter C1-4B over a 20-theme puzzle benchmark. We examine several claims made in existing literature regarding chess-trained language models and assert that their impressive benchmark performance is largely explained by pattern-matching. We also demonstrate how LLM-Modulo, a verifier-in-the-loop framework, raises RedPajama 3B's best move accuracy from 1.2% to 21.2% and move generation validity from 19.3% to 95.3% on mate-in-N chess puzzles, comparable to gains achieved from ChessGPT's fine-tuning on chess-specific web corpora at a fraction of the cost. Our results illustrate how pairing a general LLM with an external verifier offers a more flexible alternative to directly training on synthetic data for well-defined domains. We open source all training/evaluation code, datasets, puzzle samples, and KinGPT model checkpoints for reproducibility.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Counterfactual Explanations Under Concept Drift

SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation

TabKDE: Simple and Scalable Tabular Data Generation with Kernel Density Estimates

Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents

Temporal Decay of Co-Citation Predictability: A 20-Year Statute Retrieval Benchmark from 396M Ukrainian Court Citations

TouchMap-OR: Multi-View 3D Mapping of Hand-Surface Contacts

SparseSAM: Structured Sparsification of Activations in Segment Anything Models

Verifier-Guided Code Translation via Meta-Step Decoding

Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents

Multi-task learning on partially labeled datasets via invariant/equivariant semi-supervised learning

SynVA: A Modular Toolkit for Vessel Generation and Aneurysm Editing

SafeLens: Deliberate and Efficient Video Guardrails with Fast-and-Slow Screening

Venom: A PyTorch Generative Modeling Toolkit

From a Single Demonstration to a General Policy for Contact-Rich Manipulation

Mixture of Experts for Low-Resource LLMs

Motion-Uncertainty-Aware Next-Best-View Planning for Moving Object Reconstruction

Error-Decomposed Class-Conditional Fusion for Statistically Guaranteed Hard-Category Robust Perception

Form and Function: Machine Unlearning as a Problem of Misaligned States

MSIQ: Moment-based Scale-Invariant Quality Measure for Single Image Super-Resolution

VVitCutLER: Towards Unsupervised Object Detection and Segmentation in Videos

AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech

Scale-Equivariant Generative Forecasting: Weight-Tied Dilated Convolutions, Wavelet Scattering Inputs, and Spectral-Consistency Training for Self-Similar Time Series

ECG-WM: A Physiology-Informed ECG World Model for Clinical Intervention Simulation

TAME: Test-Time Adversarial Prompt Tuning via Mixture-of-Experts for Vision-Language Models

UniAlign: A Model-Agnostic Framework for Robust Network Traffic Classification under Distribution Shifts

Deepfake Detection in Social Media: A Temporal Artifact Analysis Using 3D Convolutional Neural Networks

Stable Routing for Mixture-of-Experts in Class-Incremental Learning

How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning

Rethinking Point Clouds as Sequences: A Causal Next-Token Predictive Learning Framework

Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models