arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1970
2605.20624 2026-05-21 cs.CV cs.AI cs.LG

Accelerating Video Inverse Problem Solvers with Autoregressive Diffusion Models

用自回归扩散模型加速视频逆问题求解器

Taesung Kwon, Jonghyun Park, Hyungjin Chung, Jong Chul Ye

AI总结 本文提出自回归视频逆问题求解器(AVIS),通过自回归扩散模型实现流式视频恢复,显著降低初始延迟并提高吞吐量,同时保持高质量的恢复效果,并进一步提出加速变体AVIS Flash,实现更高的吞吐量和更优的效率-性能权衡,为实时部署铺平道路。

Comments Project page is available here: https://avis-project.github.io/

详情
AI中文摘要

扩散模型为零样本视频逆问题提供了强大的先验知识,但其实时部署受到两个效率问题的阻碍:由整体视频恢复引起的高初始延迟,以及由于在像素空间中多次VAE传递以强制测量一致性导致的低吞吐量。为克服这些限制,我们提出了自回归视频逆问题求解器(AVIS)。AVIS框架利用自回归视频扩散模型以流式方式恢复视频,自然地消除了延迟瓶颈。具体而言,AVIS通过测量一致性的估计初始化反向扩散,减少了所需的采样步骤。与领先的非自回归求解器相比,AVIS将初始延迟从114秒减少到4秒,并将吞吐量从0.71提高到1.18 FPS,同时实现更优的恢复质量。我们进一步引入了一个高度加速的变体,称为AVIS Flash,该变体仅在第一个片段上强制测量一致性。AVIS Flash在单个RTX 4090 GPU上将吞吐量提高到5.91 FPS,同时保持竞争性的性能,并实现有利的效率-性能权衡,为实时部署铺平道路。

英文摘要

Diffusion models provide powerful priors for zero-shot video inverse problems, but their real-time deployment is hindered by two inefficiencies: high initial latency caused by holistic video restoration, and low throughput resulting from multiple VAE passes to enforce measurement consistency in pixel space. To overcome these limitations, we propose Autoregressive Video Inverse problem Solver (AVIS). The AVIS framework leverages autoregressive video diffusion models to restore videos in a streaming manner, naturally eliminating latency bottlenecks. Specifically, AVIS initializes reverse diffusion with a measurement-consistent estimate, reducing the required sampling steps. Compared to leading non-autoregressive solvers, AVIS drastically reduces initial latency from 114s to 4s and increases throughput from 0.71 to 1.18 FPS while achieving superior restoration quality. We further introduce a highly accelerated variant, dubbed AVIS Flash, that enforces measurement consistency solely on the first chunk. AVIS Flash substantially boosts throughput to 5.91 FPS on a single RTX 4090 GPU while maintaining competitive performance and achieving a favorable efficiency-performance trade-off, paving the way toward real-time deployment.

2605.20620 2026-05-21 cs.LG cs.DB cs.GT

Dynamic Shapley Computation

动态Shapley值计算

Xuan Yang, Hsi-Wen Chen, Ming-Syan Chen, Jian Pei

AI总结 本文提出D-Shap框架,通过将Shapley值表示为玩家-任务矩阵,解决动态环境下训练数据贡献评估的高效更新问题,利用任务和联盟的局部性特性实现快速更新和自评估。

详情
AI中文摘要

基于数据的Shapley估值提供了一种量化训练数据贡献的原则性方法,但其高计算成本使其在动态设置中难以应用,其中任务和训练玩家不断变化。现有方法将Shapley计算视为一次性过程,将贡献汇总为聚合分数,阻止了重用并要求在任何变化时重新计算。我们引入了一种新的视角,将Shapley值表示为玩家-任务矩阵,并将动态估值建模为结构化矩阵维护问题。我们利用每个任务依赖于少量训练玩家的事实以及相似任务产生相似估值,导致效用局部性和联盟局部性。基于这些见解,我们提出了D-Shap,一种动态估值框架,通过仅修改矩阵的小部分实现高效更新:新任务估值通过结构感知插值推断,而由新玩家引起的更新被限制在受影响的局部矩阵块中。为消除对预指定评估任务的需求,我们引入了自估值,通过可扩展的子集重用和覆盖感知的锚点选择,直接从训练数据构建初始矩阵。在多样模型上的实验表明,D-Shap在毫秒级内完成任务更新,并将玩家更新成本降低至全重新计算的三量级,同时实现与全重新计算相当的估值质量。

英文摘要

Shapley-based data valuation provides a principled way to quantify the contribution of training data, but its high computational cost makes it impractical in dynamic settings where tasks and training players evolve. Existing methods treat Shapley computation as a one-shot process and collapse contributions into aggregated scores, preventing reuse and requiring recomputation under any change. We introduce a new perspective that represents Shapley values as a player-by-task matrix and formulates dynamic valuation as a structured matrix maintenance problem. We exploit the fact that each task depends on a small subset of training players and that similar tasks yield similar valuations, leading to utility locality and coalition locality. Based on these insights, we propose D-Shap, a dynamic valuation framework that enables efficient updates by modifying only a small portion of the matrix: new task valuations are inferred via structure-aware interpolation, while updates induced by new players are confined to affected local matrix blocks. To eliminate the need for pre-specified evaluation tasks, we introduce self-valuation, which constructs the initial matrix directly from training data, supported by scalable subset reuse and coverage-aware anchor selection. Experiments across diverse models show that D-Shap performs task updates in milliseconds and reduces the cost of player updates by up to three orders of magnitude, while achieving valuation quality competitive with full recomputation.

2605.20619 2026-05-21 cs.LG math.OC stat.ML

SURF: Steering the Scalarization Weight to Uniformly Traverse the Pareto Front

SURF: 通过调整标量化权重以均匀遍历帕累托前沿

Liuyuan Jiang, Chentong Huang, Lisha Chen

AI总结 本文提出SURF方法,通过调整标量化权重以实现帕累托前沿的均匀覆盖,解决了传统标量化方法在多目标优化中导致非均匀覆盖的问题。

详情
AI中文摘要

标量化在多目标优化中因其简单性和可扩展性而被广泛应用。然而,在许多应用中,目标是生成代表多样化用户偏好的解决方案,理想情况下应实现帕累托前沿(PF)的均匀覆盖。然而,通常均匀采样标量化权重通常会导致PF的非均匀覆盖。我们通过标量化路径的几何分析解释了这种不匹配。随着标量化权重的变化,对应的解决方案通常以非均匀的速度遍历PF。这种速度诱导了一个弧长累积分布函数(CDF);通过反向此CDF映射,可以得到一个原则性的规则,用于选择产生均匀PF覆盖的权重。基于这一见解,我们提出了SURF(沿帕累托前沿均匀采样)。对于结构化问题,包括双目标老虎机,我们推导了此CDF映射和由此产生的PF感知的权重采样规则。对于一般问题,SURF在CDF重建和权重采样之间交替进行。理论上,我们证明在可证明的条件下,SURF收敛到一个不可避免的有限采样地板。经验上,在老虎机、多目标gymnasium和多目标LLM对齐实验中,SURF在效率上实现了比基线更均匀的PF覆盖。

英文摘要

Scalarization is widely used in multi-objective optimization owing to its simplicity and scalability. In many applications, the goal is to generate solutions that represent diverse user preferences, ideally with uniform coverage of the Pareto front (PF). However, uniformly sampling scalarization weights usually induces non-uniform coverage of the PF. We explain this mismatch through a geometric analysis of the scalarization path. As the scalarization weight varies, the corresponding solutions trace the PF with a generally non-uniform traversal speed. This speed induces an arc-length cumulative distribution function (CDF); inverting this CDF map yields a principled rule for selecting weights that produce uniform PF coverage. Building on this insight, we propose SURF (Sampling Uniformly along the PaReto Front). For structured problems, including bi-objective bandits, we derive closed-form expressions for this CDF map and the resulting PF-aware weight sampling rule. For general problems, SURF alternates between CDF reconstruction and weight sampling. Theoretically, we show that under provable conditions, SURF converges linearly to an unavoidable finite-sampling floor. Empirically, experiments on bandits, multi-objective-gymnasium, and multi-objective LLM alignment demonstrate that SURF efficiently achieves more uniform PF coverage than baselines.

2605.20618 2026-05-21 cs.AI

COAgents: Multi-Agent Framework to Learn and Navigate Routing Problems Search Space

COAgents: 多智能体框架用于学习和导航路由问题搜索空间

Oleksandr Yakovenko, Mahdi Mostajabdaveh, Cheikh Ahmed, Abdullah Ali Sivas, Xiaorui Li, Zirui Zhou, Mao Kun

AI总结 本文提出COAgents多智能体框架,通过将搜索过程建模为图来解决车辆路径问题的计算复杂性问题,通过训练不同智能体来指导强化和探索,从而在CVRP和VRPTW基准测试中取得优异成绩。

Comments Accepted at LION 2026, The Learning and Intelligent Optimization Conference

详情
AI中文摘要

尽管车辆路径问题(VRP)对许多现实系统至关重要,但其计算复杂性使其在大规模情况下难以处理。传统启发式方法依赖于手工制定的规则进行局部改进和偶尔的跳跃以逃避局部极小值,但往往难以在多样化的实例上泛化。我们引入COAgents,一种协作多智能体框架,将搜索过程建模为图:节点代表解决方案,边对应于局部细化或大型扰动以进行多样化(即跳跃)。在搜索过程中动态构建部分搜索图(PSG),使COAgents能够训练节点选择代理和移动选择代理以指导强化,并触发跳跃代理以探索新区域。与端到端学习方法不同,COAgents将问题无关的搜索控制与紧凑的领域特定编码分离,从而在跨任务中提高适应性。在CVRP和VRPTW基准测试中进行了广泛的实验,结果表明COAgents在CVRP上与多个学习搜索基线竞争,并在更具有挑战性的VRPTW实例上设定了新的学习方法状态。在N=100时,COAgents将与最强神经求解器(POMO)的最佳解差距缩小了14%,在N=50时缩小了44%。

英文摘要

Although Vehicle Routing Problems (VRP) are essential to many real-world systems, they remain computationally intractable at scale due to their combinatorial complexity. Traditional heuristics rely on handcrafted rules for local improvements and occasional \textit{jumps} to escape local minima, but often struggle to generalize across diverse instances. We introduce \textbf{COAgents}, a cooperative multi-agent framework that models the search process as a graph: nodes represent solutions, and edges correspond to either local refinements or large perturbations for diversification (i.e., jumps). A \textit{Partial Search Graph} (PSG) is dynamically constructed during search, enabling COAgents to train a Node Selection Agent and a Move Selection Agent to guide intensification, and a Jump Agent to trigger well-timed explorations of new regions. Unlike end-to-end learning approaches, COAgents cleanly separates problem-agnostic search control from compact domain-specific encoding, facilitating adaptability across tasks. Extensive experiments on the CVRP and VRPTW benchmarks show that COAgents remains competitive with several learn-to-search baselines on CVRP and sets a new state of the art among learning-based methods on the more challenging VRPTW instances, reducing the gap to the best-known solutions by 14\% at $N\!=\!100$ and 44\% at $N\!=\!50$ relative to the strongest neural solver (POMO), and by 21\% and 40\% respectively relative to ALNS. Code is available at https://github.com/mahdims/COAgents.

2605.20616 2026-05-21 cs.CL

Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents

Auto-Dreamer:学习离线记忆巩固用于语言智能体

Chongrui Ye, Yuxiang Liu, Yu Wang, Haofei Yu, Yining Zhao, Ge Liu, Julian McAuley, Jiaxuan You

AI总结 本文提出Auto-Dreamer,一种学习离线记忆巩固方法,用于语言智能体,通过分离快速会话记忆获取与慢速跨会话巩固过程,提升智能体在多个任务流中的记忆整合与知识复用能力。

Comments Preprint

详情
AI中文摘要

语言智能体越来越多地在相关任务流上运行,但现有记忆系统难以将积累的经验转化为可重用的知识。检索增强和结构化记忆方法能有效记录每会话的观察,但通常将获取和巩固过程合并为一个在线过程,使智能体无法获得跨会话的全局视图以发现重复模式、抽象共享流程或剪枝冗余条目。受互补学习系统理论启发,我们提出Auto-Dreamer,一种学习的离线巩固器用于语言智能体记忆。Auto-Dreamer将快速的每会话记忆获取与慢速的跨会话巩固过程分离。给定一个选定的类型记忆库的工作区域,巩固器将该区域视为只读证据,执行受限的工具使用来检查条目和与来源轨迹相关联的来源轨迹,并合成一个新鲜的紧凑替换集,该集在跨会话中抽象并取代原始区域。我们通过GRPO训练Auto-Dreamer,使用端到端智能体性能作为奖励信号来学习如何通过快速在线经验巩固记忆。仅在ScienceWorld轨迹上训练,Auto-Dreamer在ScienceWorld上优于固定、强化学习训练和提示记忆基线,得分高出7分,同时使用比最强基线小12倍的活跃记忆库,并在不重新训练的情况下继续在held-out的ALFWorld和WebArena上领先,使用比最强基线小6倍的内存。

英文摘要

Language agents increasingly operate over streams of related tasks, yet existing memory systems struggle to convert accumulated experience into reusable knowledge. Retrieval-augmented and structured memory methods record per-session observations effectively, but often couple acquisition and consolidation into a single online process, leaving the agent without a global view across sessions to discover recurring patterns, abstract shared procedures, or prune redundant entries. Inspired by complementary learning systems theory, we propose Auto-Dreamer, a learned offline consolidator for language-agent memory. Auto-Dreamer decouples fast per-session memory acquisition from slow cross-session consolidation. Given a selected working region of a typed memory bank, the consolidator treats the region as read-only evidence, performs bounded tool-use to inspect entries and provenance-linked source trajectories, and synthesizes a fresh compact replacement set that abstracts across sessions and supersedes the original region. We train Auto-Dreamer via GRPO, using end-to-end agent performance as the reward signal to learn how to consolidate memories acquired through fast online experience. Trained on ScienceWorld trajectories alone, Auto-Dreamer outperforms fixed, RL-trained, and prompted memory baselines on ScienceWorld by 7 points while using an active memory bank 12$\times$ smaller than the strongest baseline, and continues to lead on held-out ALFWorld and WebArena without retraining -- using 6$\times$ less memory than the strongest baseline on ALFWorld.

2605.20613 2026-05-21 cs.CL

HRM-Text: Efficient Pretraining Beyond Scaling

HRM-Text: 超越规模的高效预训练

Guan Wang, Changling Liu, Chenyu Wang, Cai Zhou, Yuhao Sun, Yifei Wu, Shuai Zhen, Luca Scimeca, Yasin Abbasi Yadkori

AI总结 本文提出HRM-Text模型,通过引入分层递归模型和新的训练方法,在减少计算资源消耗的同时实现了与大规模模型相当的性能,展示了高效预训练的可能性。

详情
AI中文摘要

当前大型语言模型的预训练范式依赖于巨大的计算资源和互联网级原始文本,这在基础研究中形成了显著的障碍。相比之下,生物系统通过多时间尺度处理实现高样本效率的学习,例如前额叶环路的功能组织。受此启发,我们引入了HRM-Text,它用分层递归模型(HRM)取代标准Transformer,将计算分解为慢速演变的战略层和快速演变的执行层。为了稳定这种深度递归进行语言建模,我们引入了MagicNorm和深度信用分配的预热。此外,我们不再使用标准的原始文本预训练,而是仅在指令-响应对上进行训练,使用任务完成目标和PrefixLM遮蔽。作为高效预训练的实证存在证明,一个仅用400亿个唯一词和1,500美元预算从头训练的10亿参数HRM-Text模型在MMLU上达到60.7%,在ARC-C上达到81.9%,在DROP上达到82.2%,在GSM8K上达到84.5%,在MATH上达到56.2%。尽管使用了比标准基线少100-900倍的训练词和96-432倍的估计计算,HRM-Text的性能与2-7B参数的开源模型相媲美。这些结果表明,协同设计架构和目标可以大幅降低计算到性能的比率,使从头开始的预训练对更广泛的研究社区具有可及性。

英文摘要

The current pretraining paradigm for large language models relies on massive compute and internet-scale raw text, creating a significant barrier to foundational research. In contrast, biological systems demonstrate highly sample-efficient learning through multi-timescale processing, such as the functional organization of the frontoparietal loop. Taking this as inspiration, we introduce HRM-Text, which replaces standard Transformers with a Hierarchical Recurrent Model (HRM) that decouples computation into slow-evolving strategic and fast-evolving execution layers. To stabilize this deep recurrence for language modeling, we introduce MagicNorm and warmup deep credit assignment. Furthermore, instead of standard raw-text pretraining, we train exclusively on instruction-response pairs using a task-completion objective and PrefixLM masking. Serving as an empirical existence proof of efficient pretraining, a 1B-parameter HRM-Text model trained from scratch on only 40 billion unique tokens and $1,500 budget achieves 60.7% on MMLU, 81.9% on ARC-C, 82.2% on DROP, 84.5% on GSM8K, and 56.2% on MATH. Despite utilizing roughly 100-900x fewer training tokens and 96-432x less estimated compute than standard baselines, HRM-Text performs competitively with 2-7B parameter open models. These results demonstrate that co-designing architectures and objectives can radically reduce the compute-to-performance ratio, making pretraining from scratch accessible to the broader research community.

2605.20610 2026-05-21 cs.CV cs.AI

Beyond Routing: Characterising Expert Tuning and Representation in Vision Mixture-of-Experts

超越路由:表征专家调节与表示在视觉混合专家中的刻画

Gene Tangtartharakul, Katherine R. Storrs

AI总结 本文研究了视觉混合专家模型中专家调节与表示的特性,通过对比学习训练稀疏门控卷积MoE模型,并利用视觉神经科学工具分析专家的专业化,发现动植物区分主导专家划分,并揭示了专家在更广泛的连续视觉和语义维度上的调节。

Comments 21 Pages, 6 Main Figures, 1 Table

详情
AI中文摘要

混合专家(MoE)模型通常通过分析哪些类别被路由到哪些专家来解释。然而,仅靠路由并不能揭示每个专家实际编码的内容。我们训练了稀疏门控卷积MoE模型,并在自然图像上使用对比目标进行训练,利用视觉神经科学工具来表征专家的专业化。从门控级别扩展到专家级别分析,我们测量了每个专家的类别分离度,并利用最吸引人的输入来分析每个专家的调节。从类别级别扩展到特征级别解释,我们通过从人类行为判断数据集(THINGS)中衍生出的语义维度来解释调节。最后,我们使用调节和表征相似性分析来评估在独立初始化下专家分配的稳定性。我们发现,动植物区分主导专家划分,从门控到专家读取都明显,并在独立训练模型中保持稳定。尽管路由统计数据表明相对稀疏的、类别的偏好,但专家分析揭示了更广泛的对连续视觉和语义维度的调节,超出了类别边界。尽管特征调节不同,专家之间表现出相似的类别分离度,这表明超越类别级别分析的解释优势。这些结果表明,视觉MoE中的专家专业化远超类别路由,并通过探测细粒度专家级别调节和表征结构来更好地理解。

英文摘要

Mixture-of-Experts (MoE) models are often interpreted by analysing which categories are routed to which experts. However, routing alone does not reveal what each expert actually encodes. We train sparsely-gated convolutional MoE models with a contrastive objective on natural images and characterise expert specialisation using tools from visual neuroscience. Extending from gating-level to expert-level analyses, we measure per-expert category separability, and per-expert tuning using the most exciting inputs. Extending from category-level to feature-level explanations, we interpret tuning via semantic dimensions derived from a dataset of human behavioural judgements (THINGS). Finally, we use tuning and representational similarity analysis to assess the stability of expertise-allocation across independent initialisations. We find that an animate-inanimate distinction dominates expert partitioning, apparent from gating through to expert readout, and is stable across independently trained models. Although routing statistics suggest relatively sparse, categorical preferences, expert analyses reveal broader tuning to continuous visual and semantic dimensions that extend beyond category boundaries. Experts exhibit similar category-separability to one another, despite distinct feature tuning, demonstrating the explanatory benefits of moving beyond category-level analyses. Together, these results show that expert specialisation in vision MoEs extends well beyond category routing and is better understood by probing fine-grained expert-level tuning and representational structure.

2605.20609 2026-05-21 cs.LG

Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning

基于潜在类比的组合转导用于离线目标条件强化学习

Junseok Kim, Dohyeong Kim, Mineui Hong, Songhwai Oh

AI总结 本文提出了一种基于潜在类比的组合转导方法,用于解决离线目标条件强化学习中面对新情境时的目标泛化问题,通过引入新的类比表示方法,提升了在不同情境下的目标达到能力。

Comments ICML 2026

详情
AI中文摘要

组合泛化对于在新颖的上下文变化中达到未见过的目标在离线目标条件强化学习(GCRL)中至关重要,其中必须从有限的数据中学习一个通用的目标达到智能体。大多数先前的方法通过在时间连续的片段上进行轨迹缝合来实现这一点,这限制了在不同上下文中组合行为的能力。为了克服这一限制,我们正式将类比转导定义为通过组合任务内固有的类比与给定的上下文来合成新的计划,并提出了一个针对此目的的新型类比表示。基于我们的理论,这种类比表示捕捉了在最优任务执行下发生变化的内容,对上下文变化保持不变,并且足以实现最优的目标达到。我们进一步认为,对未见过的类比-上下文对的泛化是类比转导中的实际障碍,并引入了一种新的离线GCRL方法,使类比转导能够超越已见过的对到未见的组合。我们通过在OGBench操纵环境中实验证明了我们方法的有效性,显著优于不进行类比转导的先前方法。项目页面:https://rllab-snu.github.io/projects/CTA/

英文摘要

Compositional generalization is essential for reaching unseen goals under novel contextual variations in offline goal-conditioned reinforcement learning (GCRL), where a generalist goal-reaching agent must be learned from limited data. Most prior approaches pursue this via trajectory stitching over temporally contiguous segments, which limits composing behaviors across varying contexts. To overcome this limitation, we formalize analogy transduction as synthesizing new plans by composing task-endogenous analogies with given contexts and propose a novel analogy representation tailored for it. Grounded in our theory, this analogy representation captures what changes under optimal task execution, remains invariant to contextual variations, and is sufficient for optimal goal reaching. We further contend that generalization to unseen analogy-context pairs is a practical obstacle in analogy transduction, and introduce a new approach for offline GCRL that enables analogy transduction beyond seen pairs to unseen combinations. We empirically demonstrate the effectiveness of our approach on OGBench manipulation environments, substantially outperforming prior methods that do not perform analogy transduction. Project page: https://rllab-snu.github.io/projects/CTA/

2605.20608 2026-05-21 cs.AI cs.NI

From Automated to Autonomous: Hierarchical Agent-native Network Architecture (HANA)

从自动化到自主:分层代理原生网络架构(HANA)

Binghan Wu, Shoufeng Wang, Yunxin Liu, Ya-Qin Zhang, Joseph Sifakis, Ye Ouyang

AI总结 本文提出了一种分层多代理参考架构,旨在实现Level 4/5自主网络,通过引入代理自意识,统一战略规划与操作韧性,验证了其在5G核心环境中的有效性。

Comments This manuscript has been accepted by IEEE Networking Letters

详情
Journal ref
B. Wu, S. Wang, Y. Liu, Y. -Q. Zhang, J. Sifakis and Y. Ouyang, "From Automated to Autonomous: Hierarchical Agent-native Network Architecture (HANA)," in IEEE Networking Letters, 2026
AI中文摘要

实现Level 4/5自主网络(AN)需要从静态自动化转向代理原生智能。当前的操作依赖于刚性的脚本,缺乏处理非正常条件的认知能力。为此,本文提出了一种分层多代理参考架构,该架构包含一个双驱动协调器,协调专门的执行代理,并通过共享的公共内存实现统一的领域知识。关键创新是将代理自意识整合进来,使系统能够协调 deliberative战略治理与 reflexive 故障恢复。我们将在5G核心环境中实例化并验证该架构。案例研究表明,该系统在拥堵条件下仍能维持关键吞吐量,并将平均修复时间(MTTR)减少了86%,证实了其在统一战略规划与操作韧性方面的有效性。

英文摘要

Realizing Level 4/5 Autonomous Networks (AN) demands a shift from static automation to agent-native intelligence. Current operations, reliant on rigid scripts, lack the cognitive agency to handle off-nominal conditions. To address this, this letter proposes a hierarchical multi-agent reference architecture enabling high-level autonomy. The framework features a Dual-Driven Orchestrator that coordinates specialized Executive Agents, supported by a shared Public Memory for unified domain knowledge. A key innovation is the integration of agent self-awareness, which empowers the system to harmonize deliberative strategic governance with reflexive fault recovery. We instantiate and validate this architecture within a 5G Core environment. Case studies demonstrate that the system sustains critical throughput under congestion and reduces Mean Time to Repair (MTTR) by 86%, confirming its efficacy in unifying strategic planning with operational resilience.

2605.20607 2026-05-21 cs.LG cs.CV cs.RO

Mechanistic Interpretability for Learning Assurance of a Vision-Based Landing System

基于视觉着陆系统的学习保证机制解释

Romeo Valentin, Olivia Beyer Bruvik, Marc R. Schlichting, Mykel J. Kochenderfer

AI总结 本文提出了一种基于视觉着陆系统的学习保证机制,通过分离内容与风格来构建可解释的模型,从而提供可靠的证据支持,同时引入了新的运行时保证方法来监控模型的情境表示。

Comments 10 pages, 4 figures

详情
AI中文摘要

EASA的学习保证指导要求数据驱动的航空系统构建并监控自身的情境表示,但对神经网络而言,提供此类证据的技术手段仍是一个开放问题。我们针对基于视觉的飞机着陆系统填补了这一空白:我们提出,一个可保证的模型至少必须展示其情境表示中能够分离内容与风格。展示模型的预测主要依赖于内容表示组件,从而得到一个具体的保证路径。为了在具体模型上展示这个保证路径,我们训练了一个用于跑道关键点回归的视觉Transformer模型,在LARDv2数据集上进行训练。该模型作为我们保证演示的主体,产生每块嵌入,我们通过K-SVD稀疏字典学习将其分解为可解释的原子。定性可视化确认了内容原子跟踪任务相关的跑道结构,风格原子跟踪领域特定的外观,且回归头几乎将所有线性权重放在内容原子上。我们进一步基于内容/风格分离并定义了模型外范围(OOMS)检测,一种新的运行时保证方法,直接监控模型的情境表示。OOMS监控与操作设计领域和输出空间的分布外监控互补,并满足最近EASA指导的明确要求。通过在测试时间和运行时直接分析模型的情境表示,本工作提供了EASA学习保证指导所要求的第一个具体的表示层面证据,并指出了机制解释作为未来航空安全案例的实用构建块。

英文摘要

EASA's learning-assurance guidance requires data-driven aviation systems to build and monitor their own situation representation, yet for neural networks the technical means to provide such evidence remain an open problem. We address this gap for a vision-based aircraft landing system: we propose that a minimally assurable model must at least be shown to separate content from style in its own situation representation. Showing that the model's predictions then rely largely on the contentful representation components leads to a concrete assurance path. To demonstrate this assurance path on a concrete model we train a vision transformer model for runway keypoint regression on the LARDv2 dataset. The model, which acts as the subject for our assurance demonstration, produces per-patch embeddings that we decompose into interpretable atoms via K-SVD sparse dictionary learning. A qualitative visualization confirms that contentful atoms track task-relevant runway structure and stylistic atoms track domain-specific appearance, and the regression head is shown to place almost all of its linear weight on contentful atoms. We further build on the content/style separation and define out-of-model-scope (OOMS) detection, a novel runtime assurance approach directly monitoring the model's situation representation. OOMS monitoring is complementary to operational design domain and output-space out-of-distribution monitoring and addresses concrete requirements of the recent EASA guidance. By directly analyzing a model's situation representation both at test time and runtime, this work delivers the first concrete piece of the representation-level evidence that EASA learning-assurance guidance demands, and points to mechanistic interpretability as a practical building block of future aviation safety cases.

2605.20602 2026-05-21 cs.CL cs.AI cs.LG

Self-Training Doesn't Flatten Language -- It Restructures It: Surface Markers Amplify While Deep Syntax Dies

自我训练不使语言扁平化——它重构了它:表面标记增强而深层语法消失

Ming Liu

AI总结 该研究通过实验发现自我训练过程并非使语言扁平化,而是重构了语言结构,表面标记增强而深层语法结构消失,并提出了结构性深度假说来解释这一现象。

Comments 19 pages (14 main + 5 appendix), 8 figures, 3 tables

详情
AI中文摘要

连续对语言模型自身输出进行自我训练通常被描述为一种扁平化过程:多样性下降,分布变窄,文本变得“更像自己”。我们提供了证据表明这种描述是不完整的。在对五个模型(GPT-2 124M,Pythia-410M,Pythia-1.4B,OPT-1.3B,Pythia-2.8B)进行十一代自我训练的过程中,语言并非均匀扁平化——它被重构了。表面标记(连贯词、缓和词、破折号)上升,而中层和深层语法结构(疑问句、插入语、被动语态、条件句)崩溃。我们正式将这种不对称崩溃定义为结构性深度假说(SDH):语言特征的每一代衰减率主要由其结构性深度——它所需嵌套语法依赖的数量——决定,其次才由其生成零次输出频率决定。通过汇总五个模型中三个架构家族的17个特征面板(N=85),汇总的斯皮尔曼相关系数为rho=0.540(p < 10^{-6};簇Bootstrap 95% CI [0.434, 0.634]),而频率是一个显著较弱的预测因子(rho=0.225)。一个匹配的人类文本微调对照实验得到rho=0.039(p=0.88),证实了该梯度是特定于自我训练的。我们进一步记录了一个表面复杂性悖论:总体复杂性代理(依赖树深度、TTR、词长)在底层从句结构消失时均上升,这对训练数据筛选和LLM文本检测有直接影响。

英文摘要

Successive self-training on a language model's own outputs is widely characterized as a process of flattening: diversity drops, distributions narrow, and the text becomes "more like itself." We provide evidence that this characterization is incomplete. Across eleven generations of self-training on five models (GPT-2 124M, Pythia-410M, Pythia-1.4B, OPT-1.3B, Pythia-2.8B), language is not flattened uniformly -- it is restructured. Surface markers (discourse connectives, hedges, em-dashes) rise, while mid- and deep-syntactic structures (questions, parentheticals, passives, subjunctives) collapse. We formalize this asymmetric collapse as the Structural Depth Hypothesis (SDH): the per-generation decay rate of a linguistic feature is predicted primarily by its structural depth -- the number of nested syntactic dependencies it requires -- and only secondarily by its generation-zero output frequency. Pooling 17-feature panels from five models spanning three architecture families (N=85), the pooled Spearman correlation is rho=0.540 (p < 10^{-6}; cluster-bootstrap 95% CI [0.434, 0.634]), while frequency is a substantially weaker predictor (rho=0.225). A matched human-text fine-tuning control yields rho=0.039 (p=0.88), confirming the gradient is self-training-specific. We further document a Superficial Complexity Paradox: aggregate complexity proxies (dep-tree depth, TTR, word length) all rise as the underlying clause structure dies, with direct implications for training-data curation and LLM-text detection.

2605.20600 2026-05-21 cs.CV

Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation

面向头部的键值压缩用于高效自回归图像生成

Guotao Liang, Baoquan Zhang, Zhiyuan Wen, Yunming Ye

AI总结 本文提出HeadKV框架,通过根据注意力头的局部性偏置分配不同的缓存预算,提高自回归图像生成的效率和内存利用率,同时设计分层令牌驱逐策略以保留长距离信息。

Comments Under review

详情
AI中文摘要

自回归(AR)视觉生成在性能上取得了显著成果,但存在内存使用高和吞吐量低的问题,因为需要缓存之前生成的视觉标记。最近的研究表明,仅保留少量缓存标记即可维持高质量图像,同时显著减少内存使用并提高吞吐量。然而,这些方法为每个注意力头分配固定预算,忽视了注意力头之间的异质性,导致内存分配不优。在本文中,我们观察到不同层的注意力头表现出多样的注意力模式,其中一些头专注于局部邻域,而另一些头捕捉更广泛的上下文依赖。基于这一见解,我们提出了一种新的面向头部的键值(KV)缓存压缩框架,称为HeadKV,用于自回归图像生成,该框架为局部偏置头分配较小的预算,为具有更广泛注意力的头分配更大的预算。一个关键挑战在于确定每个注意力头的类型以指导缓存压缩。我们进一步观察到,在同一层中,每个头在不同位置的令牌上表现出一致的注意力模式,即一个头在早期令牌上的行为与后期令牌上的行为保持一致。这一见解表明,头类型可以在早期阶段确定并在生成过程中重用以进行KV压缩。其优势是它不需要额外的训练或数据集级统计,并且可以无缝泛化到不同的输入。此外,我们设计了一种分层令牌驱逐策略以有效保留长距离信息。广泛的实验展示了其在多种自回归图像生成模型上的有效性。

英文摘要

Autoregressive (AR) visual generation has achieved remarkable performance but suffers from high memory usage and low throughput, as it requires caching previously generated visual tokens. Recent research has shown that retaining only a few lines of cache tokens can maintain high-quality images while significantly reducing memory usage and improving throughput. However, these methods allocate a fixed budget to each attention head, overlooking the heterogeneity among attention heads, leading to suboptimal memory allocation. In this paper, we observe that attention heads across different layers exhibit diverse attention patterns, where some heads focus on local neighborhoods while others capture broader contextual dependencies. Based on this insight, we propose a novel head-aware key-value (KV) cache compression framework for autoregressive image generation, called HeadKV, which assigns smaller budgets to locality-biased heads and larger budgets to heads with broader attention. A key challenge lies in identifying the type of each attention head to guide cache compression. We further observe that, within the same layer, each head exhibits consistent attention patterns across token positions, \emph{i.e.}, a head's behavior for early tokens remains consistent with that for later tokens. This insight suggests that head types can be identified during the early stage and reused for KV compression throughout generation. Its advantage is that it requires no additional training or dataset-level statistics and generalizes seamlessly across different inputs. Moreover, we design a Stratified Token Eviction strategy to effectively preserve long-range information. Extensive experiments demonstrate its effectiveness across multiple autoregressive image generation models.

2605.20599 2026-05-21 cs.LG

Unsupervised clustering and classification of upper limb EMG signals during functional movements: a data-driven

无监督聚类和分类功能性运动中上肢EMG信号:一种数据驱动的方法

L. F. Salazar Álvarez, D. Escobar-Saltarén, M. B. Salazar Sánchez, S. C. Henao-Aguirre

AI总结 本文提出了一种综合方法,用于对功能性抓取和抓握运动中上肢表面肌电信号进行聚类和分类,通过数据驱动的方法在NINAPRO DB4数据集上应用,提出了一种四阶段流程,包括信号预处理、特征提取、通过层次聚类选择手势以及比较模型评估,最终选出五个关键特征用于分类任务。

Comments 19 Congreso Colombiano de Computación (19CCC)

详情
AI中文摘要

本研究提出了一种综合方法,用于对功能性抓取和抓握运动中上肢表面肌电信号进行聚类和分类。该方法应用于NINAPRO DB4数据集,该数据集提供了52个手势的多通道肌电信号记录。设计了一种四阶段流程,包括信号预处理、特征提取、通过层次聚类选择手势以及比较模型评估。预处理包括四阶低通滤波器(0.6 Hz)和希尔伯特包络变换,有效减少噪声并增强信号清晰度。特征提取得到26个时域和频域指标,随后通过视觉分析、互信息、主成分分析和决策树重要性分数进行优化。最终选出五个关键特征用于分类任务。通过使用马氏距离进行层次聚类,选择了六个代表性动作,平衡了生物力学多样性和计算效率。200 ms窗口被确定为最佳时间分割长度,基于稳定性和生理合理性。分类器模型在两个阶段进行评估。使用PyCaret自动比较发现Extra Trees(ET)和人工神经网络(ANN)表现最佳。随后的独立训练证实了它们的稳定性和泛化能力,ANN显示出渐进学习,而ET保持了稳健、一致的结果。研究结果支持了对肌电假肢实施自适应、低延迟控制策略的实现,并提供了一个可扩展的流程用于未来实时应用。

英文摘要

This study presents a comprehensive approach for the clustering and classification of upper-limb surface electromyography (sEMG) signals during functional reach and grasp movements. The methodology was applied to the NINAPRO DB4 dataset, which provides multichannel EMG recordings of 52 gestures. A four-stage pipeline was designed, including signal preprocessing, fea-ture extraction, gesture selection via hierarchical clustering, and comparative model evaluation. Preprocessing involved a fourth-order low-pass filter (0.6 Hz) and Hilbert envelope transformation, effectively reducing noise and enhancing signal clarity. Feature extraction yielded 26 temporal and frequency-domain met-rics, which were later refined using visual analysis, mutual information, principal component analysis, and decision tree importance scores. A final subset of five key features was selected for classification tasks. Gesture selection was per-formed through hierarchical clustering using Mahalanobis distance, resulting in six representative movements that balanced biomechanical diversity and compu-tational efficiency. A 200 ms window was identified as optimal for temporal seg-mentation based on stability and physiological plausibility. Classifier models were evaluated in two stages. Automated comparison using PyCaret identified Extra Trees (ET) and Artificial Neural Networks (ANN) as top performers. Sub-sequent independent training confirmed their stability and generalization capac-ity, with ANN showing progressive learning and ET maintaining robust, con-sistent results. The findings support the implementation of adaptive, low-latency control strategies for myoelectric prostheses and provide a scalable pipeline for future real-time applications.

2605.20595 2026-05-21 cs.RO cs.MA cs.NI

Intent-First Aerial V2V for Tactical Coordination and Separation: Protocol and Performance Under Density and Disturbance

意图优先的空中车对车通信用于战术协调与分离:协议与性能在密度和干扰下的表现

Mehrnaz Sabet

AI总结 本研究提出了一种意图优先的空中车对车(V2V)协议,用于密集的无人机交通管理(UTM)操作,通过部署的邻近通信机制提供新鲜可信的信息以实现局部协调,该协议结合刷新的状态和意图信标,用于局部感知、协同感知和降级模式评估,并通过事件触发的消息进行让行、排序、释放和应急协调。

Comments Submitted to IEEE Transactions on Intelligent Transportation Systems

详情
AI中文摘要

密集的低空航空操作需要的不仅仅是预先飞行路线协调和最后手段的碰撞避免。一旦飞机进入空中,扰动可以在战略重新授权能够吸收的时间尺度以下出现,而碰撞避免太晚且具有破坏性,无法作为常规交通管理。虽然战术分离被认可为中间层,但实现它需要一个可部署的邻近通信机制,该机制能够为本地协调提供新鲜、可信的信息。本文提出了迄今为止我们所知的第一个控制器耦合的特征化,即一个全空中、 sidelink 类型、意图优先的车辆对车辆(V2V)战术邻近交换堆栈,用于密集的无人机交通管理(UTM)操作。与仅意识广播不同,所提出的交换结合了刷新的状态和意图信标,用于局部感知、协同感知和降级模式评估,并通过事件触发的消息进行让行、排序、释放和应急协调。我们通过使用 sidelink 类型的 C-V2X 模块实现并评估了该模型,这些模块具有认证的 freshness 检查。评估使用了由实时、实地锚定的基础设施支持的场景驱动、高流量压力测试。结果表明,V2V 减少了过时信念的分歧,通过协同感知保持可观测性,拒绝无效的战术信息,抑制虚假的局部推断,并结构化共享资源协调。所实现的堆栈在较低到中等密度范围内提供了一个可行的通信层用于战术分离,但随着密度、干扰和复杂性的增加,会转向受保护的回退模式。这些发现将意图优先的空中 V2V 定位为在扰动驱动的都市空域中扩展战术协调的有界促进者。

英文摘要

Dense low-altitude aerial operations require more than pre-flight route coordination and last-resort collision avoidance. Once aircraft are airborne, disturbances can emerge on timescales shorter than strategic reauthorization can absorb, while collision avoidance is too late and disruptive to serve as routine traffic management. Although tactical separation is recognized as the intermediate layer, realizing it at scale requires a deployable neighborhood communication mechanism that provides fresh, trusted information for local coordination. This paper presents what is, to our knowledge, the first controller-coupled characterization of an all-airborne, sidelink-class, intent-first vehicle-to-vehicle (V2V) tactical neighborhood exchange stack for dense Unmanned Aircraft System Traffic Management (UTM) operations. Unlike awareness-only broadcast, the proposed exchange combines refreshed state and intent beacons for local awareness, cooperative perception, and degraded-mode assessment with event-triggered messages for yielding, sequencing, release, and contingency coordination. We implement and evaluate this model on an all-airborne V2V stack using sidelink-class C-V2X modules with authenticated freshness checks. Evaluation uses a scenario-driven, high-volume stress campaign supported by real-time, field-anchored infrastructure. Results show that V2V reduces stale-belief divergence, preserves observability through cooperative perception, rejects invalid tactical messages, suppresses false local inference, and structures shared-resource coordination. The implemented stack provides a viable communication layer for tactical separation in lower-to-moderate regimes, but transitions toward guarded fallback as density, impairment, and complexity increase. These findings position intent-first aerial V2V as a bounded enabler for scaling tactical coordination in disturbance-driven urban airspace.

2605.20592 2026-05-21 cs.LG

ReversedQ: Opportunities for Faster Q-Learning in Episodic Online Reinforcement Learning

ReversedQ: 在回合制在线强化学习中更快的Q学习机会

Sofia R. Miskala-Dinc, Aviva Prins

AI总结 本文研究了在回合有限的马尔可夫决策过程(MDPs)中使用无模型Q学习的效率问题,提出了ReversedQ方法,通过改进价值函数更新顺序、更新频率和初始化来提升学习速度,实验表明其在多个任务中均优于现有方法。

Comments This paper contains 5 pages and 2 figures. To be presented at the Adaptive and Learning Agents workshop (ALA 2026) at AAMAS 2026

详情
AI中文摘要

我们研究了在有限回合的回合制马尔可夫决策过程(MDPs)中使用无模型Q学习的性能,其中动态在回合间保持稳定。我们识别了新兴无模型后验抽样工作中一个核心问题:为了证明理论保证,必须依赖延迟学习。特别是,我们识别了三个加速学习的机会:(i)价值函数更新顺序,(ii)更新频率,以及(iii)价值函数初始化。基于Wang等人提出的RandomizedQ,我们展示了这些变化及其单独和累积的影响,并在多个经验研究中进行了验证。我们发现,我们的综合修改,称为ReversedQ,在Bidirectional Diabolical Combination Lock(BDCL)任务中,相对于RandomizedQ,缩放后的平均累积奖励从9.53%提升至78.78%,在链状MDP中,从21.76%提升至61.81%。

英文摘要

We study model-free Q-learning in finite-horizon episodic Markov Decision Processes (MDPs) with stationary dynamics across episodes. We identify a central issue in nascent model-free posterior-sampling works: the reliance on delayed learning in order to prove theoretical guarantees. In particular, we identify three opportunities for faster learning - (i) value-function update order, (ii) update frequencies, and (iii) value-function initialization. Using Wang et al.'s RandomizedQ as a basis, we illustrate these changes and their individual (as well as cumulative) impact in multiple empirical studies. We find that our combined modifications, termed ReversedQ, improve scaled mean cumulative reward compared to RandomizedQ, from 9.53% to 78.78% in the Bidirectional Diabolical Combination Lock (BDCL), and from 21.76% to 61.81% in a chain MDP.

2605.20591 2026-05-21 cs.CL cs.CY

Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models

有害吗?网络部署医疗大语言模型中的幻觉与作用层面滥用

Sunday Oyinlola Ogundoyin, Muhammad Ikram, Rahat Masood

AI总结 本文研究了网络部署的医疗大语言模型中的幻觉和作用层面滥用问题,通过评估6233个MedGPT和10个开源LLM,发现25-30%的MedGPT事实准确性较低,33.6-54.3%违反操作阈值,57.06%的Action-enabled模型缺乏充分的隐私披露,揭示了系统性漏洞,强调了多指标评估和更强的安全保障的必要性。

详情
AI中文摘要

医疗大语言模型(LLMs),包括定制医疗GPT(MedGPTs)和开源模型,正越来越多地部署在网页平台上以提供临床指导。然而,它们存在幻觉、政策不合规和不安全设计的风险。我们对6,233个MedGPT进行了大规模评估,评估了1,500个分层样本以及10个开源LLM。我们引入了两个框架:MedGPT-HEval用于幻觉检测,以及一个基于LLM的流程用于评估违规行为和开发者意图。我们的结果表明,25-30%的MedGPT事实准确性较低,底层和中层模型风险最高;33.6-54.3%违反操作阈值,57.06%的Action-enabled模型缺乏充分的隐私披露。与开源模型相比,MedGPT在事实准确性和语义对齐方面表现更好,但开源模型更稳定。这些结果揭示了幻觉和合规性的系统性缺口,强调了多指标评估和更强的安全保障的必要性。我们发布了HAA-MedGPT,一个结构化数据集,支持未来关于网络面向医疗LLM安全性的研究。

英文摘要

Medical large language models (LLMs), including custom medical GPTs (MedGPTs) and open-source models, are increasingly deployed on web platforms to provide clinical guidance. However, they pose risks of hallucination, policy noncompliance, and unsafe design. We conduct a large-scale assessment of 6,233 MedGPTs, evaluating a stratified sample of 1,500, together with 10 open-source LLMs. We introduce two frameworks: MedGPT-HEval for hallucination detection and an LLM-based pipeline for assessing policy violations and developer intent. Our results show that 25-30% of MedGPTs exhibit low factual accuracy, with bottom- and middle-tier models at highest risk; 33.6-54.3% violate operational thresholds, and 57.06% of Action-enabled models lack adequate privacy disclosures. Compared with open-source models, MedGPTs achieve higher factual accuracy and semantic alignment, though open-source models are more stable. These results reveal systemic gaps in hallucination and compliance, highlighting the need for multi-metric evaluation and stronger safeguards. We release HAA-MedGPT, a structured dataset that supports future research on the safety of web-facing medical LLMs.

2605.20588 2026-05-21 cs.CL cs.CV

Direct Translation between Sign Languages

手语之间的直接翻译

Zetian Wu, Bowen Xie, Wuyang Meng, Milan Gautam, Stefan Lee, Liang Huang

AI总结 本文提出了一种直接的手语到手语翻译方法,通过使用回译技术生成合成的手语对,从而克服了传统级联方法中的误差传播和信息丢失问题,并在多个手语数据集上实现了更高的翻译质量和速度提升。

详情
AI中文摘要

手语翻译领域在手语与口语之间的翻译上取得了显著进展,但手语之间的翻译仍鲜为人知且难以实现。后者可以帮助15亿全球聋人和听力障碍者在语言障碍中交流,而无需依赖听力翻译者或书面语言能力。级联方法由单独的手语到文本、文本到文本和文本到手语系统组成,但存在误差传播、额外延迟以及视觉模态中独特信息的丢失。我们旨在开发直接的手语到手语翻译。然而,尚未有大规模的开放领域平行语料库在手语之间。为了实现直接的手语翻译,我们使用回译技术从不对齐的个体语言语音-手语语料库中生成合成的手语对。使用这些数据,我们联合训练了一个基于MBART的单一模型,用于文本到手语(T2S)和手语到手语(S2S)。在合成生成的美国手语(ASL)、中国手语(CSL)和德国手语(DGS)之间配对集上,我们的直接S2S方法在几何手语误差指标(20%更低的DTW对齐MPJPE)和翻译回句子后的语言匹配指标(50%高BLEU-4)上优于级联基线,同时实现了大约2.3倍的速度提升。在一小部分现有的跨语言手语数据上,我们发现我们的方法也实现了类似的改进。

英文摘要

The field of sign language translation has witnessed significant progress in the translation between sign and spoken languages, but the translation between sign languages remains largely unexplored and out of reach. The latter can help 1.5 billion deaf and hard-of-hearing (DHH) people worldwide communicate across language barriers without relying on hearing interpreters or written-language fluency. The cascade approach composing separate sign-to-text, text-to-text, and text-to-sign systems suffers from error propagation and extra latency as well as the loss of information unique in the visual modality. We aim to develop direct sign-to-sign translation. However, a large-scale open-domain parallel corpus has not been curated between sign languages. To enable direct translation between sign language utterances, we use back-translation to produce synthetic sign-sign pairs from unaligned individual language utterance-sign corpora. Using this data, we jointly train a single MBART-based model for both text->sign (T2S) and sign->sign (S2S). On synthetically generated paired sets between American Sign Language (ASL), Chinese Sign Language (CSL), and German Sign Language (DGS), our direct S2S method outperforms the cascaded baseline on geometric sign error metrics (20% lower DTW-aligned MPJPE) and language matching metrics after predicted sign utterances are translated back to sentences (50% high BLEU-4) while achieving a roughly 2.3* speedup. On a small set of pre-existing cross-lingual sign data, we find similar improvements for our proposed method.

2605.20584 2026-05-21 cs.CV

QwenSafe: Multimodal Content Rating Description Identification via Preference-Aligned VLMs

QwenSafe: 通过偏好对齐的视觉语言模型实现多模态内容评级描述识别

Dishanika Denipitiyage, Aruna Seneviratne, Suranga Seneviratne

AI总结 本文提出QwenSafe,一种通过联合推理应用元数据和截图来自动识别苹果定义的内容评级描述(CRDs)的视觉语言模型,通过引入metadata2CRD数据构建管道和直接偏好优化(DPO)提升模型预测准确性,实验结果显示QwenSafe在二元CRD分类中显著优于现有模型。

详情
AI中文摘要

移动应用市场要求开发者披露标准化的内容评级描述(CRDs)以告知用户潜在敏感或受限制的内容。确保这些披露的准确性和一致性仍然具有挑战性,因为应用内容的多模态性质跨越了文本描述和视觉界面。在本文中,我们提出了QwenSafe,一种视觉语言模型(VLM),旨在通过联合推理应用元数据和截图自动识别苹果定义的CRDs。为了使该任务能够扩展训练,我们引入了metadata2CRD数据构建管道,通过结合应用描述、截图和正式描述定义来合成描述对齐的问题-答案对。我们通过监督微调后直接偏好优化(DPO)调整Qwen3-VL-8B,以使模型预测与视觉和文本模态的描述特定证据和解释对齐。我们在12个苹果定义的内容评级描述上评估QwenSafe,并将其与最先进的视觉语言模型进行比较,包括Qwen3-VL、LLaVA-1.6和Gemini-2.5-Flash。QwenSafe在二元CRD分类中始终优于所有基线模型,分别在正类召回率上实现了111.8%、36.1%和2.1%的提升。我们的结果表明,描述意识的多模态对齐显著提高了自动化内容分类,并突显了视觉语言模型在支持移动应用市场中可扩展和一致的内容评级方面的潜力。

英文摘要

Mobile app marketplaces require developers to disclose standardized content rating descriptors (CRDs) to inform users about potentially sensitive or restricted content. Ensuring the accuracy and consistency of these disclosures remains challenging due to the multimodal nature of app content, which spans textual descriptions and visual interfaces. In this paper, we present QwenSafe, a Vision-Language Model (VLM) designed to automatically identify the presence of Apple-defined CRDs by jointly reasoning over app metadata and screenshots. To enable scalable training for this task, we introduce metadata2CRD, a data-construction pipeline that synthesizes descriptor-aligned question-answer pairs by combining app descriptions, screenshots, and formal descriptor definitions. We adapt Qwen3-VL-8B using supervised fine-tuning followed by Direct Preference Optimization (DPO) to align model predictions with descriptor-specific evidence and explanations across visual and textual modalities. We evaluate QwenSafe on 12 Apple-defined content rating descriptors and compare it against state-of-the-art vision-language models, including Qwen3-VL, LLaVA-1.6, and Gemini-2.5-Flash. QwenSafe consistently outperforms all baselines in binary CRD classification, achieving improvements in positive-class recall of 111.8%, 36.1%, and 2.1%, respectively. Our results demonstrate that descriptor-aware multimodal alignment substantially improves automated content classification and highlights the potential of vision-language models to support scalable and consistent content rating in mobile app marketplaces.

2605.20581 2026-05-21 cs.LG cond-mat.mtrl-sci

TriForces: Augmenting Atomistic GNNs for Transferable Representations

TriForces: 为可迁移表示增强原子istic GNNs

Ali Ramlaoui, Alexandre Duval, Hannah Bull, Victor Schmidt, Hugues Talbot, Fragkiskos D. Malliaros, Joseph Musielewicz

AI总结 TriForces通过分离组成和结构信息并结合自监督学习,提升MatBench和QM9的性能,无需DFT标签,并在OMat24上实现高效相似结构检索。

Comments 28 pages, 11 figures. Accepted at ICML 2026

详情
AI中文摘要

机器学习互原子势(MLIPs)在训练于大规模密度泛函理论(DFT)数据时能取得优异的准确性。为了在实践中有用,它们通常需要通过小而昂贵的特定任务数据集进行调整。然而,MLIPs在不同领域之间的迁移不一致,其表示往往失去可访问的组成和结构信息。为此,我们提出了TriForces,一个模型无关的三流框架,通过分离组成和结构信息并结合自监督学习来保持可迁移的表示。TriForces在MatBench和QM9上优于基线模型,无需DFT标签,并通过其学习的潜在空间实现高效的相似结构检索。在OMat24上,在有限数据训练条件下,TriForces在20K样本仅需时将能量MAE减少57%,并在不同样本数量下提升力MAE。我们还发布了多个MLIP架构的预训练TriForces变体,并在https://github.com/Ramlaoui/triforces上提供代码。

英文摘要

Machine learning interatomic potentials (MLIPs) achieve excellent accuracy when trained on large Density Functional Theory (DFT) data. To be useful in practice, they must often be adapted to target chemistries using small and expensive task-specific datasets. However, MLIPs transfer inconsistently across domains, with representations that often loose accessible composition and structure information. To address this, we present TriForces, a model-agnostic three-stream framework that separates composition and structure information, combined with self-supervised learning to preserve transferable representations. TriForces improves performance on MatBench and QM9 over baselines without needing DFT labels and enables efficient similar structure retrieval through its learned latent space. On OMat24, in limited-data training regime, TriForces reduces energy MAE by 57% at 20K samples only and improves force MAE across sample sizes. We release pretrained TriForces variants across multiple MLIP architectures with code at https://github.com/Ramlaoui/triforces.

2605.20580 2026-05-21 cs.LG

Deep Learning Surrogates for Emulating Stochastic Climate Tipping Dynamics

深度学习代理用于模拟随机气候临界动态

Adeline Hillier, Jennifer Sleeman, Jay Brett, Caroline Tang, Jenelle Millison, Anand Gnanadesikan

AI总结 本文提出了一种基于动态信息的时序融合变换器作为数据驱动的代理,用于高效模拟复杂的地球系统模拟,通过预测临界事件的时间来提高计算效率。

详情
AI中文摘要

本文探讨了一种基于动态信息的时序融合变换器(TFT)作为数据驱动代理,用于计算密集型地球系统模拟。聚焦于描述全球海洋输送的多变量时间序列,我们展示了该代理在数千个时间步上预测临界事件的能力。数据包括多达21个非平稳时间序列以及描述自由参数和初始条件的静态协变量。对架构和目标函数的修改使代理能够高保真地预测大西洋和太平洋崩溃的时间,并捕捉跨集合预测的随机不确定性。所学代理在数值模拟器上实现了465倍的计算加速,同时保持对参数和初始条件的可微性。

英文摘要

This work explores a dynamics-informed Temporal Fusion Transformer (TFT) as a data-driven surrogate for computationally intensive Earth system simulations. Focusing on multivariate time series describing global ocean transport, we demonstrate the surrogate's ability to forecast tip events across thousands of time steps. The data involve up to 21 non-stationary time series in addition to static covariates describing free parameters and initial conditions. Modifications to the architecture and objective function yield a surrogate that anticipates the timing of Atlantic and Pacific collapses to high fidelity and captures the stochastic uncertainty in transition timing across ensemble predictions. The learned surrogate achieves a 465x computational speedup over the numerical simulator while maintaining differentiability with respect to parameters and initial conditions.

2605.20577 2026-05-21 cs.AI cs.LG

Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX

Mahjax: 一种用于在JAX中进行强化学习的GPU加速麻将模拟器

Soichiro Nishimori, Shinri Okano, Keigo Habara, Sotetsu Koyamada, Eason Yu, Masashi Sugiyama

AI总结 本文提出Mahjax,一种基于JAX实现的麻将环境,利用GPU加速大规模并行化,以解决麻将游戏中的高维状态空间和随机性问题,为强化学习提供高效的训练平台。

详情
AI中文摘要

Riichi Mahjong是一种多玩家、信息不完全的游戏,具有随机性和高维状态空间的特性。这些属性构成了强化学习中复杂决策问题的独特挑战。尽管先前研究主要依赖于从人类游戏日志中监督学习来预训练策略,但能够从头开始学习(tabula rasa)的算法在通用性上具有更大潜力,如AlphaZero所示。为促进此类研究,我们引入了Mahjax,一个完全向量化实现的Riichi Mahjong环境,用于在图形处理器(GPU)上实现大规模的回放并行化。我们还提供了一个高质量的可视化工具,以简化调试和与训练代理的交互。实验结果表明,Mahjax在八块NVIDIA A100 GPU上分别实现了高达200万和100万步每秒的吞吐量。此外,我们通过展示代理能够有效训练以提高其相对于基线策略的排名,验证了该环境在强化学习中的实用性。

英文摘要

Riichi Mahjong is a multi-player, imperfect-information game characterized by stochasticity and high-dimensional state spaces. These attributes present a unique combination of challenges that mirror complex real-world decision-making problems in reinforcement learning. While prior research has heavily relied on supervised learning from human play logs to pre-train the policy, algorithms capable of learning \textit{tabula rasa} (from scratch) offer greater potential for general applicability, as evidenced by the AlphaZero lineage. To facilitate such research, we introduce \textbf{Mahjax}, a fully vectorized Riichi Mahjong environment implemented in JAX to enable large-scale rollout parallelization on Graphics Processing Units (GPUs). We also provide a high-quality visualization tool to streamline debugging and interaction with trained agents. Experimental results demonstrate that Mahjax achieves throughputs of up to \textbf{2 million} and \textbf{1 million steps per second} on eight NVIDIA A100 GPUs under the no-red and red rules, respectively. Furthermore, we validate the environment's utility for reinforcement learning by showing that agents can be trained effectively to improve their rank against baseline policies.

2605.20576 2026-05-21 cs.CV

$Δ$ynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos

$Δ$ynamics: 一种基于语言的表示方法,用于从视频中推断刚体动力学

Chia-Hsiang Kao, Cong Phuoc Huynh, Chien-Yi Wang, Noranart Vesdapunt, Stefan Stojanov, Bharath Hariharan, Oleksandr Obiednikov, Ning Zhou

AI总结 本文提出$Δ$YNAMICS框架,通过语言统一表示刚体动力学,利用结构化文本生成物理模拟场景配置,结合自然语言运动推理和光流输入提升泛化能力,在CLEVRER数据集上实现了7倍于现有VLMs的分割IoU,并在新数据集上展示了良好的迁移能力。

Comments Accepted to CVPR 2026. Project page: https://iandrover.github.io/2026_dynamics

详情
AI中文摘要

从单目视频中推断刚体物理状态和属性是实现基于物理的感知和模拟的关键步骤。现有方法假设特定的物理系统、物体类型和相机姿态,无法泛化到复杂的现实环境。我们引入$Δ$YNAMICS,一种视觉-语言框架,利用语言作为刚体动力学的统一表示。不同于直接预测参数,$Δ$YNAMICS生成结构化的文本格式场景配置用于物理模拟。我们通过整合自然语言运动推理和利用光流作为语义无关的输入来增强模型的泛化能力。在CLEVRER数据集上,$Δ$YNAMICS实现了0.30的分割IoU,比领先的VLMs(InternVL3-8B,Qwen2.5-VL-7B和Claude-4-Sonnet)提高了7倍。此外,测试时采样和进化搜索分别将分割IoU提高27%和120%。最后,我们展示了在包含235个现实世界刚体视频的新数据集上的良好迁移能力,突显了语言驱动的物理推断在连接感知和模拟方面的潜力。

英文摘要

Inferring rigid-body physical states and properties from monocular videos is a fundamental step toward physics-based perception and simulation. Existing approaches assume specific underlying physical systems, object types, and camera poses, making them unable to generalize to complex real-world settings. We introduce $Δ$YNAMICS, a vision-language framework that uses language as a unified representation of rigid-body dynamics. Instead of directly predicting parameters, $Δ$YNAMICS generates scene configurations in a structured text format for physics simulation. We enhance the model's generalization by integrating natural language motion reasoning and leveraging optical flow as a semantic-agnostic input. On the CLEVRER dataset, $Δ$YNAMICS achieves a segmentation IoU of 0.30, a 7x improvement over leading VLMs (InternVL3-8B, Qwen2.5-VL-7B and Claude-4-Sonnet). Additionally, test-time sampling and evolutionary search further boost performance by 27% and 120% in segmentation IoU, respectively. Finally, we demonstrate strong transfer to a new dataset of 235 real-world rigid-body videos, highlighting the potential of language-driven physics inference for bridging perception and simulation.

2605.20569 2026-05-21 cs.CV

End-to-End Unmixing with Material Prompts for Hyperspectral Object Tracking

端到端材料提示的超光谱目标跟踪

Xu Han, Mohammad Aminul Islam, Lei Wang, Zekun Long, Guanmanyi Fu, Wangshu Cai, Kuldip K. Paliwal, Jun Zhou

AI总结 本文提出了一种端到端的材料感知跟踪框架,通过联合优化材料分解和目标定位,利用加权目标导向的解混损失对齐材料表示与定位精度,以提升超光谱图像在外观模糊、光照变化和背景杂波下的跟踪鲁棒性。

详情
AI中文摘要

超光谱成像编码了丰富的材料属性,可以在外观模糊、光照变化和背景杂波下提高跟踪鲁棒性。然而,由于超光谱视频数据有限,许多现有方法通过空间或通道融合策略适应预训练的RGB跟踪器,很大程度上忽略了超光谱成像中的内在材料信息。此外,很少的材料感知方法通常依赖于外部光谱解混管道,这些管道与跟踪目标解耦,限制了对材料表示的有效优化。为了解决这些限制,我们将超光谱目标跟踪公式化为材料分解和目标定位的联合优化问题,通过加权目标导向的解混损失将两个任务耦合起来,显式地对齐材料表示与定位精度。具体来说,我们提出了一种用于深度学习光谱解混的材料表示分解模块,具有自适应频率分解。基于分解的材料表示,我们进一步引入了双分支小波增强的材料提示模块,通过频域中的高效空间-材料交互学习低频和高频的材料提示。该框架是模型无关的,可以无缝扩展到不同的解混后端。在标准的超光谱跟踪基准上的大量实验验证了所提出端到端材料感知跟踪框架的最先进性能,并验证了其有效性。代码可在https://github.com/han030927/E2EMPT上获得。

英文摘要

Hyperspectral imagery encodes rich material properties that can improve tracking robustness under appearance ambiguity, illumination change, and background clutter. However, due to the limited availability of hyperspectral video data, many existing methods adapt pretrained RGB trackers via spatial or channel fusion strategies, largely neglecting the intrinsic material information in hyperspectral imagery. Moreover, the few material-aware approaches typically rely on external spectral unmixing pipelines that are decoupled from the tracking objective, limiting effective optimization of material representations for target localization. To address these limitations, we formulate hyperspectral object tracking as a joint optimization problem of material decomposition and target localization, coupling the two tasks via a weighted target-oriented unmixing loss that explicitly aligns material representations with localization accuracy. Specifically, we propose a material representation decomposition module for deep learning-based spectral unmixing with adaptive frequency decomposition. Building on the decomposed material representations, we further introduce a dual-branch wavelet-enhanced material prompt module that learns low- and high-frequency material prompts through efficient spatial-material interactions in the frequency domain. The framework is model-agnostic and can be seamlessly generalized to different unmixing backbones. Extensive experiments on standard hyperspectral tracking benchmarks demonstrate state-of-the-art performance and validate the effectiveness of the proposed end-to-end material-aware tracking framework. Code is available at https://github.com/han030927/E2EMPT.

2605.20566 2026-05-21 cs.RO

Conflict-Aware Active Perception and Control in 3D Gaussian Splatting Fields via Control Barrier Functions

基于控制屏障函数的3D高斯点云场中冲突感知与控制

Amirhossein Mollaei Khass, Athanasios Cosse, Vivek Pandey, Nader Motee

AI总结 本文提出了一种基于控制屏障函数的冲突感知与控制框架,用于在3D高斯点云场环境中安全导航并获取信息以减少地图不确定性,通过统一的安全关键和感知感知二次规划程序解决安全与感知目标的冲突。

Comments Project website: https://sircesoc.github.io/Conflict_Aware_Active_Perception/

详情
AI中文摘要

在不确定环境中主动感知要求机器人在安全导航的同时获取信息以减少地图不确定性。这些目标本质上存在冲突,因为信息丰富的视角通常位于不确定区域,具有更高的碰撞风险。为了解决这一挑战,我们开发了一种冲突感知和控制框架,用于在由3D高斯点云(3DGS)表示的环境中运行的机器人系统。通过从平均条件风险AV@R碰撞风险度量中导出的控制屏障函数(CBF)来确保安全,该度量考虑了几何不确定性和保证了安全集的前向不变性。为了提高感知,我们提出了一种风险感知的预期信息增益(EIG)公式,用于选择下一个最佳视角,并引入了将摄像机方向对齐局部信息上升方向的感知屏障函数。为了获得这些冲突的安全和感知目标的可处理公式,我们提出了一种统一的安全关键和感知感知二次规划程序,通过松弛变量放松感知约束。仿真结果表明,所提出的方法在安全性和信息获取方面均优于现有基于3DGS的方法。

英文摘要

Active perception in uncertain environments requires robots to navigate safely while acquiring informative observations to reduce map uncertainty. These objectives inherently conflict, as informative viewpoints often lie near uncertain regions with higher collision risk. To address this challenge, we develop a conflict-aware active perception and control framework for robotic systems operating in environments represented by 3D Gaussian Splatting (3DGS). Safety is enforced using a Control Barrier Function (CBF) derived from an Average Value-at-Risk AV@R collision-risk metric that accounts for geometric uncertainty and guarantees forward invariance of a safe set. To improve perception, we propose a risk-aware Expected Information Gain (EIG) formulation for selecting the next-best-view and introduce perception barrier functions that align the camera orientation with the local information-ascent direction. To obtain a tractable formulation for these conflicting safety and perception objectives, we propose a unified safety-critical, perception-aware quadratic program that enforces safety as a hard constraint while relaxing perception constraints through slack variables. Simulation results demonstrate that the proposed method improves both safety and information acquisition compared to existing 3DGS-based approaches.

2605.20561 2026-05-21 cs.RO

Fault-Tolerant, Rigidity-Preserving Control of Inflatable Truss Robots

容错、保持刚性的可膨胀桁架机器人控制

James Wade, Isaac Weaver, Mihai Stanciu, Nathan Usevitch

AI总结 本文提出了一种容错控制框架,用于可膨胀机器人桁架,能够在电机故障的情况下保持功能,通过三个关键贡献:扩展运动学优化以处理任意电机故障组合,引入离散时间控制屏障函数约束以保证结构刚性,以及利用 onboard 编码器反馈和基于正向运动学的状态估计器实现闭环位置控制。

详情
AI中文摘要

等周机器人桁架可以适应不同的任务和环境,因为它们具有高强重比,能够大幅改变自身形状,并可以重新配置成多种不同形状。然而,操作环境中电机故障如果未得到妥善处理,会严重限制操作能力。本文提出了一种容错控制框架,用于可膨胀机器人桁架,能够在电机故障的情况下保持功能,通过三个关键贡献。首先,我们扩展运动学优化以处理任意组合的电机故障,通过施加等式约束确保故障执行器不被使用。其次,我们引入离散时间控制屏障函数(DTCBF)约束,数学上保证结构刚性的同时最大化工作空间利用率,这是在离散时间控制下可靠操作桁架机器人的重要要求。第三,我们利用 onboard 编码器反馈和基于正向运动学的状态估计器实现闭环位置控制,在存在干扰的情况下提高位置精度。我们通过模拟和硬件实验验证了我们的方法,针对一个具有6个执行器的2D等周桁架测试平台。对于具有6个执行器的2D配置,我们展示了在单个电机故障下工作空间保留超过69%,并利用闭环控制实现了跟踪精度的25%提升。这些结果为在退化驱动条件下更鲁棒和坚韧的等周桁架机器人奠定了基础。

英文摘要

Isoperimetric robotic trusses can adapt to different tasks and environments because they have a high strength-to-weight ratio, can change their own shape dramatically, and can be reconfigured into a variety of different shapes. However, motor failures in operational environments can severely limit operational capabilities if not properly addressed. This paper presents a fault-tolerant control framework for an inflatable robotic truss that maintains functionality despite motor failures, shown through three key contributions. First, we extend the kinematic optimization to handle arbitrary combinations of motor failures by imposing equality constraints to ensure failed actuators are not used. Second, we introduce discrete-time control barrier function (DTCBF) constraints that mathematically guarantee structural rigidity while maximizing workspace utilization, a critical requirement for reliable operation of truss robots under discrete-time control. Third, we implement closed-loop position control using onboard encoder feedback and a forward kinematics-based state estimator, improving positional accuracy in the presence of disturbances. We validate our approach through simulation and hardware experiments on a 2D isoperimetric truss testbed. For a 2D configuration with 6 actuators, we demonstrate >69% workspace preservation under single-motor failures and a >25% improvement in tracking accuracy with closed-loop control. These results establish a foundation for more robust and resilient isoperimetric truss robots operating under degraded actuation.

2605.20555 2026-05-21 cs.LG cs.AI

Complementing reinforcement learning with SFT through logit averaging in the post training of LLMs

通过logit平均在LLMs后训练中补充强化学习

Xingwei Gan, Ying Zhu

AI总结 本文提出一种在LLMs后训练中通过logit平均补充强化学习的方法,将该方法整合到Group Relative Policy Optimization (GRPO)中,无需使用KL正则化或critic,通过logit平均结构将可训练策略与参考策略耦合,以利用可训练策略的推理能力并保持SFT的格式优势。

详情
AI中文摘要

我们介绍了一种新颖的方法,该方法对冻结的参考策略(例如SFT)和可训练策略的logits进行平均,并将该方法整合到Group Relative Policy Optimization (GRPO)中。与Reinforcement Learning with Verifiable Rewards (RLVR)方法不同,我们的方法不涉及Kullback Leibler (KL)正则化或critic;可训练策略和参考锚点通过logit平均结构耦合,以利用可训练策略的推理能力,同时保持SFT的格式优势。我们的方法在MATH、cn-k12和MMLU上进行了评估,结果表明其准确率高于或至少与传统的KL正则化GRPO相当。

英文摘要

We introduce a novel method that averages the logits of a frozen reference policy (e.g., SFT) and a trainable policy, and incorporate the method into Group Relative Policy Optimization (GRPO). In contrast to Reinforcement Learning with Verifiable Rewards (RLVR) methods, our proposal does not involve a Kullback Leibler (KL) regularization or critic; the trainable policy and the reference anchor are coupled through the logit averaging structure to leverage the reasoning expertise of the trainable policy while maintaining the formatting advantage of SFT. Our method is evaluated on MATH, cn-k12, and MMLU, and the results show a higher accuracy or at least comparable accuracy relative to the canonical KL-regularized GRPO.

2605.20554 2026-05-21 cs.AI cs.HC cs.SI

Personality Engineering with AI Agents: A New Methodology for Negotiation Research

利用AI代理的人格工程:谈判研究的新方法论

Michelle A. Vaccaro, Jared R. Curhan

AI总结 本文提出了一种利用AI代理进行谈判者人格参数化、操纵和评估的方法,通过人际圆周坐标系中的温暖和支配两个核心维度,为谈判理论的严格测试和AI谈判代理的人格设计提供了一种新方法。

详情
AI中文摘要

根据经典谈判理论,人们在谈判中的成功取决于他们平衡竞争需求的能力--共情与主张,表现出对他人的关心和对自己的关心,对人温和而对问题强硬。然而,人们难以管理这些张力,因此研究人员缺乏在受控条件下严格测试该领域规定的能力。AI代理没有相同的限制,其精确性、 repertoire、一致性以及可扩展性使能够贡献于谈判理论的新一类实验成为可能。在本文中,我们介绍了一种称为人格工程的方法论,该方法利用AI代理来精确参数化、操纵和评估谈判者的人格。我们提议使用人际圆周--以及其两个核心维度温暖和支配--作为该领域的基础坐标系统。这种方法不仅提供了一种严格测试经典谈判理论的方法,还为设计AI谈判代理的人格提供了一种实用指南。

英文摘要

According to canonical negotiation theory, people's success in a negotiation depends on how well they balance competing demands--empathizing and asserting, demonstrating concern for other and concern for self, being soft on the people and hard on the problem. Yet people struggle to manage these tensions, so researchers have lacked the ability to rigorously test the field's prescriptions under controlled conditions. AI agents do not face the same limitations, and their precision, repertoire, consistency, and scalability enable a new class of experiments to contribute to negotiation theory. In this article, we introduce personality engineering: a methodology that uses AI agents to precisely parameterize, manipulate, and evaluate negotiator personality. We propose using the interpersonal circumplex--and its two core dimensions of warmth and dominance--as a foundational coordinate system for the field. This approach offers both a rigorous methodology for testing classic negotiation theories and a practical guide for designing the personalities of AI negotiation agents.

2605.20551 2026-05-21 cs.CV cs.AI cs.RO

Faster or Stronger: Towards Flexible Visual Place Recognition via Weighted Aggregation and Token Pruning

更快或更强:通过加权聚合和标记剪枝实现灵活的视觉位置识别

Zichao Zeng, June Moh Goo, Junwei Zheng, Weijia Fan, Jiaming Zhang, Rainer Stiefelhagen, Jan Boehm

AI总结 本文提出了一种加权聚合描述符(WeiAD)和标记剪枝框架(WeiToP),用于提升视觉位置识别的性能和效率,通过动态调整特征提取的精度与效率平衡。

详情
AI中文摘要

视觉位置识别(VPR)旨在将查询图像匹配到大规模数据库中相同地点的参考图像。最近最先进的方法采用视觉Transformer(ViTs)作为基础模型,提取对视角、光照和季节变化具有鲁棒性的补丁级特征,然后聚合为紧凑的全局描述符进行检索。大多数现有聚合方法将补丁标记均匀地池化到学习的簇中,尽管不同簇往往编码不同的空间或语义模式,并对VPR性能贡献不均。为了解决这一限制,我们提出了加权聚合描述符(WeiAD),在聚合过程中分配簇的权重,产生更具判别性的全局表示。除了准确性之外,检索延迟是大规模部署和资源受限边缘设备的关键关注点。先前的工作主要通过压缩全局描述符来减少延迟,而忽略了特征提取的成本,这在基于ViT的基础模型中变得更加严重。因此,我们引入了面向VPR的标记剪枝框架WeiToP,通过自蒸馏减少特征提取成本,其中聚合诱导的标记重要性监督一个轻量级剪枝模块,附加到早期Transformer层上,使推理时能够进行标记剪枝。在单次联合训练阶段后,WeiToP能够在推理时实现插拔式的标记剪枝,允许在不额外训练的情况下灵活地控制精度-效率权衡。此外,WeiToP在现有针对通用视觉任务的标记剪枝方法上表现更优。

英文摘要

Visual Place Recognition (VPR) aims to match a query image to reference images of the same place in a large-scale database. Recent state-of-the-art methods employ Vision Transformers (ViTs) as backbone foundation models to extract patch-level features that are robust to viewpoint, illumination, and seasonal variations, which are then aggregated into a compact global descriptor for retrieval. Most existing aggregation methods uniformly pool patch tokens into learned clusters, despite the fact that different clusters often encode distinct spatial or semantic patterns and contribute unequally to VPR performance. To address this limitation, we propose Weighted Aggregated Descriptor (WeiAD), which assigns weights to clusters during aggregation, producing more discriminative global representations. Beyond accuracy, retrieval latency is a critical concern for large-scale deployments and resource-constrained edge devices. Prior work mainly reduces latency by compressing global descriptors, while overlooking the cost of feature extraction, an issue exacerbated by ViT-based backbones. We therefore introduce WeiToP, a VPR-oriented token pruning framework that reduces feature extraction cost via self-distillation, where aggregation-induced token importance supervises a lightweight pruning module attached to an early transformer layer, enabling inference-time token pruning. After a single joint training phase, WeiToP enables plug-and-play token pruning at inference time, allowing flexible and on-demand control over the accuracy-efficiency trade-off without additional training. Moreover, WeiToP outperforms existing token pruning methods adapted from general vision tasks.

2605.20549 2026-05-21 cs.CV

MAPS: A Synthetic Dataset for Probing Vision Models in a Controlled 3D Scene Space

MAPS:用于在受控3D场景空间中探测视觉模型的合成数据集

Santiago Galella, Pamela Osuna-Vargas, Maren Wehrheim, Martina G. Vilas, Gemma Roig, Matthias Kaschube

AI总结 本文提出MAPS数据集,用于在受控3D场景空间中研究视觉模型的行为,通过回归敏感性分析评估20种模型对场景因素的依赖性,发现相机距离和高度是导致识别失败的主要因素,且现代CNN和Transformer模型在敏感性上表现出相似性。

Comments 33 pages, 20 figures

详情
AI中文摘要

现代视觉模型在标准基准上表现强劲,但其整体准确率难以揭示驱动预测的场景属性。现有鲁棒性基准提供重要压力测试,但通常操纵全局2D图像属性,依赖现实世界变化或仅覆盖有限的3D对象和场景参数。我们引入MAPS(Manifolds of Artificial Parametric Scenes),一种可扩展的工具,用于受控地将视觉模型行为归因于场景参数。MAPS包含2,618个经过筛选的逼真3D网格,已验证在560个ImageNet类别上具有可识别性,并提供基于Blender的渲染管道,可按需生成图像,连续变化九个独立场景因素,涵盖背景、相机和照明,可扩展至其他因素。为了展示其适用性,我们使用MAPS评估20种卷积和Transformer模型,通过基于回归的敏感性分析量化其对这些场景因素的依赖性。我们发现所有测试架构中普遍存在一个几乎普遍的失败轴:相机距离和高度在识别失败中始终占主导地位,无论ImageNet准确性如何。然而,完整的敏感性结构揭示出现代CNN和Transformer模型聚集在一起,与旧架构不同,表明细粒度的架构设计选择,而非粗粒度的CNN与Transformer区别,是敏感性特征的更强决定因素。

英文摘要

Modern vision models achieve strong performance on standard benchmarks, yet their aggregate accuracy reveals little about which scene properties drive their predictions. Existing robustness benchmarks provide important stress tests, but typically manipulate global 2D image properties, rely on entangled real-world variation, or cover only a limited set of 3D objects and scene parameters. We introduce MAPS (Manifolds of Artificial Parametric Scenes), a scalable instrument for controlled attribution of vision model behavior to scene parameters. MAPS comprises 2,618 curated photorealistic 3D meshes validated for recognizability across 560 ImageNet classes and provides a Blender-based rendering pipeline for on-demand image generation under continuous variation of nine independent scene-factors spanning background, camera, and lighting, extensible to other factors. To showcase its applicability, we use MAPS to evaluate 20 convolutional and transformer-based models by quantifying their reliance on these scene factors through regression-based sensitivity analysis. We find a near-universal failure axis across all tested architectures: camera distance and elevation consistently dominate recognition failure regardless of ImageNet accuracy. However, the full sensitivity structure reveals that modern CNNs and transformers cluster together, distinct from older architectures, suggesting that fine-grained architectural design choices, rather than the coarse CNN-versus-transformer distinction, are the stronger determinant of sensitivity profiles.

2605.20547 2026-05-21 cs.LG cs.AI stat.ML

Latent Process Generator Matching

潜在过程生成器匹配

Lukas Billera, Hedwig Nora Nordlinder, Ben Murrell

AI总结 本文提出了一种潜在过程生成器匹配框架,该框架将观测到的生成状态视为可 tractable 马尔可夫过程的确定性图像,从而扩展了生成器匹配理论,使其适用于时间依赖的潜在条件过程。

Comments 18 pages, 1 figure

详情
AI中文摘要

许多近期的流匹配和扩散式生成模型在训练过程中依赖于辅助的随机动力学:通过模拟更丰富的过程来定义条件目标,但辅助状态在生成时要么难以采样,要么并不属于期望的输出。现有的生成器匹配理论规范了对静态潜在随机变量的条件,而几篇近期论文证明了特定增强状态构造的投影结果的特殊情况。我们引入了潜在过程生成器匹配,一种通用框架,将观测到的生成状态视为可 tractable 马尔可夫过程的确定性图像 $X_t=Φ(Y_t)$。我们显示在这一设定下,可以在图像空间中学习一个随机过程的生成器,其一阶边缘分布与投影过程相同。这扩展并涵盖了文献中的离散潜在过程结果,并将生成器匹配从静态潜在变量扩展到丰富的时间依赖潜在条件过程家族。

英文摘要

Many recent flow-matching and diffusion-style generative models rely on auxiliary stochastic dynamics during training: a richer process is simulated to define conditional targets, but the auxiliary state is either intractable to sample at generation time or simply not part of the desired output. Existing Generator Matching theory formalises conditioning on static latent random variables, and several recent papers prove special cases of projection results for particular augmented-state constructions. We introduce latent process generator matching, a general framework that treats the observed generative state as a deterministic image $X_t=Φ(Y_t)$ of a tractable Markov process $Y_t$. We show that in this setting one may learn the generator of a stochastic process on the image space which has the same one-time marginal distributions as the projected process. This generalizes and subsumes the discrete latent process results from the literature, and extends Generator Matching from static latent variables to a rich family of time-dependent latent conditional processes.