arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.14917 2026-05-15 cs.LG cs.CE cs.IT math.IT stat.ML

A Mutual Information Lower Bound for Multimodal Regression Active Learning

Leonardo Ferreira Guilhoto, Akshat Kaushal, Paris Perdikaris

AI总结该论文针对多模态回归中的主动学习问题，提出了一种新的获取函数MI-LB，用于更准确地捕捉模型的不确定性。研究引入了双索引框架，区分认识论不确定性和偶然性不确定性，并基于信息论推导出一个互信息下界作为获取目标。实验表明，该方法在多模态系统基准上表现优异，优于现有各类基线方法。

2605.14915 2026-05-15 cs.LG

TILBench: A Systematic Benchmark for Tabular Imbalanced Learning Across Data Regimes

Ruizhe Liu, Jiaqi Luo

AI总结本文提出TILBench，一个用于评估表格数据不平衡学习的系统性基准平台。该基准测试了40多种代表性算法在57个多样化表格数据集上的表现，覆盖了超过20万个受控实验，揭示了不同方法在预测性能、鲁棒性和计算可扩展性方面的差异。研究发现，没有一种方法在所有场景下都表现最佳，方法的有效性高度依赖于数据特性和计算约束，基于此研究提供了实际应用中的方法选择建议。

2605.14913 2026-05-15 cs.CV

Representative Attention For Vision Transformers

Yuntong Li, Hainuo Wang, Hengxing Liu, Mingjia Li, Xiaojie Guo

AI总结该论文提出了一种名为Representative Attention（RPAttention）的线性全局注意力机制，旨在解决视觉Transformer中传统自注意力计算复杂度高、依赖图像坐标的问题。其核心方法通过在表示空间中动态生成语义相关的代表性token，替代固定空间划分的中间token，从而实现跨空间区域的语义通信。该方法在保持全局感受野的同时，将token交互复杂度从二次降至线性，实验表明其在图像分类、目标检测和语义分割任务中均表现出优越的性能。

详情

英文摘要

Linear attention has emerged as a promising direction for scaling Vision Transformers beyond the quadratic cost of dense self-attention. A prevalent strategy is to compress spatial tokens into a compact set of intermediate proxies that mediate global information exchange. However, existing methods typically derive these proxy tokens from predefined spatial layouts, causing token compression to remain anchored to image coordinates rather than the semantic organization of visual content. To overcome this limitation, we propose Representative Attention (RPAttention), a linear global attention mechanism that performs token compression directly in representation space. Instead of constructing intermediate tokens from fixed spatial partitions, it dynamically forms a compact set of learned representative tokens to enable semantically related regions to communicate regardless of their spatial distance, by following a lightweight Gather-Interact-Distribute paradigm. Spatial tokens are first softly gathered into representative tokens through competitive similarity-based routing. The representatives then perform global interaction within a compact latent space, before broadcasting the refined information back to all spatial tokens via query-driven cross-attention. Via replacing coordinate-driven aggregation with representation-driven compression, RPAttention preserves global receptive fields while adaptively aligning token communication with the content structure of each input.RPAttention reduces the dominant token interaction complexity from quadratic to linear scaling with respect to the number of spatial tokens, while maintaining expressive global context modeling. Extensive experiments across diverse vision transformer backbones on image classification, object detection, and semantic segmentation demonstrate the effectiveness of our design.

URL PDF HTML ☆

赞 0 踩 0

2605.14912 2026-05-15 cs.AI cs.CY cs.HC cs.LG

From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement

Varad Vishwarupe, Nigel Shadbolt, Marina Jirotka

AI总结本文探讨了人工智能对齐中的“多元主义对齐”问题，指出当前基于强化学习的AI系统在面对不同价值观时倾向于迎合用户意见，导致缺乏真实的价值冲突与分歧。为此，作者提出以格赖斯语用原则为基础的三种对话机制——界定、信号和修正，强调AI应能承认自身视角限制、揭示价值冲突并基于原则进行修正，而非简单迎合。研究引入“多元修正得分”（PRS）作为衡量指标，并在实验中验证了现有模型在面对争议性问题时虽能遵循用户意见，但修正能力较弱，突显了部署阶段治理机制对实现多元主义的重要性。

2605.14911 2026-05-15 cs.RO

Chrono-Gymnasium: An Open-Source, Gymnasium-Compatible Distributed Simulation Framework

Bocheng Zou, Harry Zhang, Khailanii Slaton, Jingquan Wang, Derrick Ruan, Huzaifa Mustafa Unjhawala, Radu Serban, Dan Negrut

AI总结本文提出了一种名为 Chrono-Gymnasium 的开源分布式仿真框架，旨在解决高精度物理仿真在机器人和复杂机械系统中计算开销大、难以应用于数据密集型任务的问题。该框架基于 Ray 构建，兼容 Gymnasium 接口，支持与现代机器学习库的无缝集成，并提供了分布式执行所需的同步与通信机制。通过两个案例研究，展示了其在强化学习和贝叶斯优化中的应用效果，证明了其在保证物理精度的同时显著提升了仿真效率。

2605.14908 2026-05-15 cs.CV

SteerSeg: Attention Steering for Reasoning Video Segmentation

Ali Cheraghian, Hamidreza Dastmalchi, Abdelwahed Khamis, Morteza Saberi, Aijun An, Lars Petersson

AI总结视频推理分割任务需要根据自然语言描述在视频帧中定位目标对象，通常涉及空间推理和隐含引用。现有方法通过提取冻结的大视觉语言模型（LVLM）的注意力图作为分割的先验信息，实现无需训练的定位，但这些注意力图主要用于文本生成，导致定位信号模糊。本文提出SteerSeg，一种轻量框架，通过识别注意力偏差并引入输入级条件引导来优化注意力分布，结合可学习的软提示和推理引导的思维链（CoT）提示，显著提升了LVLM的空间定位能力，并在多个基准测试中表现出良好的泛化性能。

Comments Project page: https://steerseg.github.io

2605.14907 2026-05-15 cs.AI

KGPFN: Unlocking the Potential of Knowledge Graph Foundation Model via In-Context Learning

Yisen Gao, Jiaxin Bai, Haoyu Huang, Zhongwei Xie, Yufei Li, Hong Ting Tsang, Sirui Han, Yangqiu Song

AI总结知识图谱基础模型旨在通过学习可迁移的关系结构，实现对包含新实体和关系的图的泛化。然而，现有方法大多关注关系层面的通用性，而对上下文学习这一基础模型的重要支柱在知识图谱推理中的应用研究较少。本文提出KGPFN，一种结合先验数据适配网络的知识图谱基础模型，通过结构化上下文中的局部和全局信息进行推理，实现了跨图的强适应能力，并在多个基准测试中表现出色。

2605.14906 2026-05-15 cs.CV

MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

Xiyu Ren, Zhaowei Wang, Yiming Du, Zhongwei Xie, Chi Liu, Xinlin Yang, Haoyue Feng, Wenjun Pan, Tianshi Zheng, Baixuan Xu, Zhengnan Li, Yangqiu Song, Ginny Wong, Simon See

AI总结 MemLens 是一个用于评估大型视觉语言模型（LVLMs）多模态长期记忆能力的综合性基准，涵盖了信息抽取、多轮推理、时序推理等五个方面，测试了不同上下文长度下的模型表现。研究发现，长上下文模型在短对话中表现良好，但随着对话增长性能下降，而记忆增强代理虽在长度上更稳定，却在存储时间压缩下丢失了视觉细节。实验表明，单一方法难以胜任多轮多模态任务，因此提出了结合长上下文注意力与结构化多模态检索的混合架构方向。

Comments Work in progress

2605.14900 2026-05-15 cs.AI

COREKG: Coreset-Guided Personalized Summarization of Knowledge Graphs

Sohel Aman Khan, Raghava Mutharaju, Supratim Shit

AI总结本文提出了一种基于核心集理论的个性化知识图谱摘要方法 COREKG，旨在解决大规模知识图谱在问答和可视化等任务中应用不便的问题。该方法通过基于用户查询模式的敏感度评分，从知识图谱中采样出一个具有代表性的三元组子集，以保证摘要在结构和语义上的准确性。实验表明，COREKG 在多个真实数据集上相比现有方法在查询准确率和结构覆盖率方面表现更优，同时显著减少了存储和查询开销。

Comments Accepted at IJCAI 2026

2605.14897 2026-05-15 cs.LG cs.AI

Critic-Driven Voronoi-Quantization for Distilling Deep RL Policies to Explainable Models

Senne Deproost, Denis Steckelmacher, Ann Nowé

AI总结本文研究如何将深度强化学习策略蒸馏到可解释模型中，以平衡性能与可解释性之间的矛盾。提出了一种基于评论家网络的Voronoi量化方法，通过划分状态空间并为每个区域拟合线性函数，实现对复杂策略的简化表示。该方法利用原策略的评论家网络迭代优化子策略，有效提升了蒸馏模型的性能与可解释性。

Comments Accepted for presentation at EXTRAAMAS 2026

2605.14896 2026-05-15 cs.SD cs.LG

Text-Dependent Speaker Verification (TdSV) Challenge 2024: Team Naive System Report

Amir Mohammad Rostami, Pourya Jafarzadeh

AI总结本文介绍了2024年文本依赖说话人验证（TdSV）挑战赛中“Naive”团队的系统方案。该系统基于现有的先进神经网络ResNet-TDNN和NeXt-TDNN进行适配，并设计了轻量高效的EfficientNet-A0模型，结合数据增强和优化的超参数，实现了优异的验证性能，取得了0.0461的最小检测代价函数（MinDCF）和1.3%的等错误率（EER）。研究展示了多模型集成学习在说话人和短语验证中的有效性。

2605.14894 2026-05-15 cs.CV

SEDiT: Mask-Free Video Subtitle Erasure via One-step Diffusion Transformer

Zheng Hui, Yunlong Bai

AI总结本文提出了一种名为 SEDiT 的新型视频字幕擦除方法，无需预先生成掩码即可直接完成字幕移除任务。该方法基于一步式扩散变换器，通过引入单阶段框架避免了传统两阶段处理中的次优问题，并在理论上证明了一步去噪的可行性。为保证时间一致性，文中采用混合训练策略并支持原生高清视频的高效处理。

Comments Project page:http://zheng222.github.io/SEDiT_project

详情

英文摘要

Recent breakthroughs in video diffusion models have significantly accelerated the development of video editing techniques. However, existing methods often rely on inpainting video frames based on masked input, which requires extracting the target video mask in advance, and the precision of the segmentation directly affects the quality of the completion. In this paper, we present SEDiT, a novel one-stage video Subtitle Erasure approach via One-step Diffusion Transformer. We introduce a mask-free inference approach that enables direct erasure of the targeted subtitle. The proposed one-stage framework mitigates the sub-optimality inherent in the two-stage processing of prior models. Since subtitle removal is a localized editing task in which most pixels remain unchanged, the underlying distribution shift is minimal, making it well-suited to one-step generation under rectified flow. We empirically validate the reliability of one-step denoising and further provide a formal theoretical justification. Under the localized-editing structure of subtitle removal, the conditional optimal transport (OT) map and its induced rectified flow velocity field are Lipschitz continuous with respect to the latent variable, which underpins the theoretical feasibility of one-step sampling. To address the challenge of long-term temporal consistency, we adopt a hybrid training strategy by occasionally conditioning the model with a clean first-frame latent. This facilitates temporal continuity, allowing each segment during inference to leverage the output of its predecessor. To avoid visible seams caused by cropping and reinserting processed targets, particularly in scenarios involving substantial motion, we feed the original video directly into SEDiT. Thanks to one-step and chunk-wise streaming inference, our method can efficiently handle native 1440p video with infinite length.

URL PDF HTML ☆

赞 0 踩 0

2605.14893 2026-05-15 cs.CV cs.AI cs.LG

Your CLIP has 164 dimensions of noise: Exploring the embeddings covariance eigenspectrum of contrastively pretrained vision-language transformers

Jakub Grzywaczewski, Dawid Płudowski, Przemysław Biecek

AI总结本文研究了对比预训练视觉-语言模型（VLMs）中潜在空间的结构问题，发现其共享的潜在空间中存在大量非语义的多模态噪声。作者通过协方差矩阵的谱分解方法，将潜在空间分解为语义信号和共享噪声子空间，并观察到噪声结构在不同数据子集上具有强子群不变性。实验表明，去除这些噪声维度对下游任务性能影响较小，甚至有助于提升性能，揭示了现代VLMs潜在空间中存在大量由模型架构引起的噪声，而非仅由任务相关语义主导。

2605.14891 2026-05-15 cs.CV

Hierarchical Image Tokenization for Multi-Scale Image Super Resolution

Isma Hadji, Enrique Sanchez, Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos

AI总结本文提出了一种基于视觉自回归（VAR）模型的多尺度图像超分辨率方法，通过引入层次化图像分块（HIT）和直接偏好优化（DPO）正则化项，解决了现有方法在尺度映射和模型复杂度方面的不足。HIT 通过逐级表示不同尺度的图像并强制跨尺度的分块重叠，提升了模型的灵活性，而 DPO 则仅依赖低分辨率与高分辨率图像对，引导模型生成更高质量的输出。该方法在无需外部训练数据的情况下，使用更小的模型实现了领先的多尺度超分辨率效果。

Comments Accepted for publication at ICML 2026. *Joint first authorship (alphabetical order). arXiv admin note: substantial text overlap with arXiv:2506.04990

2605.14888 2026-05-15 cs.SD cs.LG

PROCESS-2: A Benchmark Speech Corpus for Early Cognitive Impairment Detection

Madhurananda Pahar, Caitlin H. Illingworth, Bahman Mirheidari, Hend Elghazaly, Fritz Peters, Sophie Young, Wing-Zin Leung, Labhpreet Kaur, Daniel Blackburn, Heidi Christensen

AI总结 PROCESS-2 是一个用于早期认知障碍检测的大型语音数据集，旨在支持基于自发和任务导向语音的自动认知评估研究。该数据集包含200名健康受试者、150名轻度认知障碍患者和50名痴呆患者的语音记录，共计约21小时，涵盖图片描述和语言流畅性任务，并附有手动验证的文本和元数据。PROCESS-2 通过严格的临床验证和分区设计，确保了数据的可靠性与实用性，为相关研究提供了可复现的基准资源。

2605.14886 2026-05-15 cs.AI

BiFedKD: Bidirectional Federated Knowledge Distillation Framework for Non-IID and Long-Tailed ECG Monitoring

Zixuan Shu, Tiancheng Cao, Hen-Wei Huang

AI总结在物联网医疗（IoMT）网络中，心电图（ECG）监测受到数据共享法规和隐私保护的限制。为解决联邦学习中模型更新通信开销大、在非独立同分布和长尾标签场景下性能下降的问题，本文提出了一种双向联邦知识蒸馏框架BiFedKD，通过温度缩放和聚合蒸馏机制提升模型对齐效果。实验表明，BiFedKD在MIT-BIH心律失常数据集上显著提升了准确率和Macro-F1指标，同时大幅降低了通信和计算开销。

2605.14885 2026-05-15 cs.CV

Masked Next-Scale Prediction for Self-supervised Scene Text Recognition

Zhuohao Chen, Zeng Li, Yifei Zhang, Chang Liu, Yu Zhou

AI总结场景文本识别需要建模从粗粒度布局到细粒度字符笔画的视觉结构演变过程，但现有方法依赖大量标注数据。本文提出了一种统一的自监督框架——Masked Next-Scale Prediction（MNSP），通过跨尺度预测和掩码图像重建联合学习，显式建模场景文本的层次结构演化。该方法引入了Next-Scale Prediction（NSP）模块，从低分辨率上下文预测高分辨率特征，并结合多尺度语言对齐模块保持语义一致性，实验表明其在多个基准数据集上取得了先进性能。

Comments Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 Findings Track.10 pages, 4 figures

2605.14880 2026-05-15 cs.CV cs.GR cs.LG

Denoising-GS: Gaussian Splatting with Spatial-aware Denoising

Qingyuan Zhou, Xinyi Liu, Weidong Yang, Ning Wang, Shuquan Ye, Ben Fei, Ying He, Wanli Ouyang

AI总结本文提出了一种名为Denoising-GS的高保真新视角合成方法，针对3D高斯泼溅（3DGS）在优化过程中因初始点云稀疏不完整而引入噪声的问题，引入了一种基于空间感知的去噪框架。该方法通过同时考虑高斯原语的位置和空间结构，设计了保持空间优化流的优化器和基于空间梯度的去噪策略，有效提升了去噪的连贯性和一致性，并通过不确定性估计和空间一致性优化进一步提升了模型的表现。实验表明，Denoising-GS在多个基准数据集上均取得了最先进的效果。

2605.14877 2026-05-15 cs.CV

HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling

Jonathan Cederlund, Axel Berg, Durmus Alp Emre Acar, Chuteng Zhou, Pontus Giselsson

AI总结视觉自回归（VAR）模型在保持低延迟的同时展现了出色的图像生成质量，但其面临严重的KV缓存内存限制问题。本文提出了一种名为HeatKV的新压缩方法，通过根据每个注意力头对先前生成尺度的关注程度动态调整缓存分配，实现更高效的内存利用。该方法基于小规模离线校准集对注意力头进行排序，并据此构建静态剪枝计划，显著提升了KV缓存的压缩比，同时保持了图像保真度和生成质量，在VAR模型的KV缓存压缩任务中取得了新的最优性能。

Comments 18 pages total including appendix; 6 main-paper figures, 2 appendix figures; 4 tables

2605.14874 2026-05-15 cs.CV

LPH-VTON: Resolving the Structure-Texture Dilemma of Virtual Try-On via Latent Process Handover

Yixin Liu, Baihong Qian, Jinglin Jiang, Jeffery Wu, Yan Chen, Wei Wang, Yida Wang, Lanqing Yang, Guangtao Xue

AI总结虚拟试穿（VTON）旨在生成与人体姿态和结构精确对齐的逼真服装图像。当前基于扩散模型的方法在结构完整性和纹理保真度之间面临根本性的权衡问题。本文提出LPH-VTON框架，通过在单一连续去噪过程中解耦结构与纹理生成，实现两者的协同优化，有效解决了这一矛盾，并在标准数据集VITON-HD上取得了结构对齐与感知真实感的优越平衡。

2605.14868 2026-05-15 cs.LG

Fast Adversarial Attacks with Gradient Prediction

Kamil Ciosek, Aleksandr V. Petrov, Nicolò Felicioni, Konstantina Palla

AI总结该论文提出了一种通过预测梯度来加速对抗样本生成的方法，避免了传统方法中耗时的反向传播过程。研究基于神经网络的核视角，利用前向传播中的隐藏状态通过轻量线性回归估计输入梯度，从而大幅提升了生成效率。实验表明，该方法在保持较高攻击效果的同时，显著提高了吞吐量，比FGSM方法快了超过5倍。

Comments 17 pages

2605.14867 2026-05-15 cs.LG cs.AI q-bio.NC

REALM: Retrospective Encoder Alignment for LFP Modeling

Peicheng Wu, Zhenyu Bu, Runze Ma, Lin Du

AI总结该研究提出了一种名为REALM的因果LFP解码框架，旨在解决基于局部场电位（LFP）的行为解码中精度低和非因果架构不适用于实时应用的问题。REALM通过从预训练的双向LFP模型中迁移表征知识到因果学生模型，实现了高效的实时解码。实验表明，REALM在保持高解码性能的同时，显著减少了模型参数和训练时间，展示了LFP-only模型在无线植入式脑机接口中的实用性和可扩展性。

详情

英文摘要

Spike activity has been the dominant neural signal for behavior decoding due to its high spatial and temporal resolution. However, as brain-computer interfaces (BCIs) move toward high channel counts and wireless operation, the high sampling frequency of spike signals becomes a bottleneck due to high power and bandwidth requirements. Local field potentials (LFPs) represent a different spatial-temporal scale of brain activity compared to spikes, offering key advantages including improved long-term stability, reduced energy consumption, and lower bandwidth requirement. Despite these benefits, LFP-based decoding models typically show reduced accuracy and often rely on non-causal architectures that are unsuitable for real-time deployment. To address these challenges, we propose REALM: a retrospective distillation framework that enables causal LFP decoding. Inspired by offline-to-online distillation strategies in speech recognition, REALM transfers representational knowledge from a pretrained multi-session bidirectional LFP model to a causal version for real-time deployment. We first pretrain a bidirectional Mamba-2 teacher model using a masked autoencoding objective. We then distill this teacher model into a compact student model via a combined objective of representation alignment and task supervision. REALM consistently outperforms both causal and non-causal LFP-based SOTA methods for behavior decoding. Notably, our REALM improves decoding performance while achieving a $2\times$ reduction in parameter count and a $10\times$ reduction in training time. These results demonstrate that retrospective distillation effectively bridges the gap between offline and real-time neural decoding. REALM shows that LFP-only models can achieve competitive decoding performance without reliance on spike signals, offering a practical and scalable alternative for next-generation wireless implantable BCIs.

URL PDF HTML ☆

赞 0 踩 0

2605.14865 2026-05-15 cs.AI cs.CL

Holistic Evaluation and Failure Diagnosis of AI Agents

Netta Madvil, Gilad Dym, Alon Mecilati, Edo Dekel, Jonatan Liberman, Rotem Brazilay, Liron Schliesser, Max Svidlo, Shai Nir, Orel Shalom, Yaron Friedman, David Connack, Amos Rimon, Philip Tannor, Shir Chorev

AI总结该研究提出了一种用于AI智能体的全面评估与故障诊断框架，旨在解决现有评估方法在解释失败原因和定位问题位置方面的不足。该框架结合自顶向下的智能体级诊断与自底向上的片段级评估，将分析过程分解为独立的片段评估，从而支持任意长度的轨迹分析，并为每个判断提供片段级的解释依据。实验表明，该方法在多个基准测试中取得领先结果，显著提升了分类、定位及联合定位-分类的准确率。

2605.14857 2026-05-15 cs.AI cs.IR

A Deterministic Agentic Workflow for HS Tariff Classification: Multi-Dimensional Rule Reasoning with Interpretable Decisions

Yu Zhang, Dongjiang Zhuang, Qu Zhou, Zheng Huang, Junhe Wu, Jing Cao, Kai Chen

AI总结本文提出了一种确定性智能体工作流，用于解决高阶协调制度（HS）税则分类这一专家级任务。该方法通过多维规则推理，结合可解释的决策过程，解决了在材料、形式、功能等多个维度上同时满足优先规则的挑战。研究设计了一个固定流程的智能体架构，将大语言模型调用限制在特定阶段，并保留本地的反思与验证机制，从而实现结构化、可解释的分类决策。实验表明，该方法在HSCodeComp数据集上取得了较高的分类准确率，并揭示了部分标注可能存在与HS规则不符的情况。

详情

英文摘要

Harmonized System (HS) tariff classification is a high-stakes, expert-level task in which a free-form product description must be mapped to a specific six- or eight-digit code under the General Interpretive Rules (GIR), section notes, chapter notes, and Explanatory Notes. The difficulty lies not in knowledge volume but in *multi-dimensional rule reasoning*: a correct classification must satisfy competing priority rules along several axes simultaneously, including material, form, function, essential character, the part-versus-whole boundary, and specific listing versus residual headings. End-to-end prompting of large language models fails characteristically by resolving one axis while ignoring the priority constraints on the others. We present a *deterministic agentic workflow* in contrast to self-planning agents: the control flow is fixed, language model calls are confined to narrow stages, and reflection and verification are retained as local mechanisms. This design yields interpretability by construction--each decision is decomposed into stage-wise structured outputs with verbatim citation of the chapter or section notes that bear on it. The architecture combines offline knowledge-engineering of the Chinese HS tariff with an online six-stage pipeline. Evaluated on HSCodeComp at the six-digit level, the workflow reaches 75.0% top-1 and 91.5% top-3 at four digits, and 64.2% top-1 and 78.3% top-3 at six digits with Qwen3.6-plus; an open-weight Qwen3.6-27B-FP8 backbone in non-thinking mode achieves 84.2% four-digit and 77.4% six-digit top-1 agreement with the frontier model. A two-stage manual audit of 226 six-digit disagreements suggests that a non-trivial fraction of HSCodeComp ground-truth labels may deviate from HS general rules; full adjudication records are released in the appendix as preliminary findings for community review.

URL PDF HTML ☆

赞 0 踩 0

2605.14855 2026-05-15 cs.LG cs.AI eess.SP

Exploitation of Hidden Context in Dynamic Movement Forecasting: A Neural Network Journey from Recurrent to Graph Neural Networks and General Purpose Transformers

Lukas Schelenz, Shobha Rajanna, Denis Gosalci, Lucas Heublein, Jonas Pirkl, Jonathan Ott, Felix Ott, Christopher Mutschler, Tobias Feigl

AI总结本文研究了在动态运动预测任务中如何有效利用隐藏上下文信息，重点探讨了从循环神经网络到图神经网络以及通用型Transformer模型的演进过程。研究对比了多种机器学习方法在预测NBA球员动态运动轨迹中的性能，发现基于LSTM的混合模型在结合上下文信息后取得了最低的最终位移误差，表现优于图注意力网络和Transformer等其他模型。实验表明，不同模型在预测精度、泛化能力和训练效率方面各有优劣，强调了在快速动态环境中进行轨迹预测时需根据具体任务选择合适模型。

Comments 12 pages

详情

DOI: 10.1109/PLANS61210.2025.11028353
Journal ref: IEEE/ION Position, Location and Navigation Symposium (PLANS), Salt Lake City, UT, May 2025

英文摘要

Forecasting within signal processing pipelines is crucial for mitigating delays, particularly in predicting the dynamic movements of objects such as NBA players. This task poses significant challenges due to the inherently interactive and unpredictable nature of sports, where abrupt changes in velocity and direction are prevalent. Traditional approaches, including (S)ARIMA(X), Kalman filters (KF), and Particle filters (PF), often struggle to model the non-linear dynamics present in such scenarios. Machine learning (ML) methods, such as long short-term memory (LSTM) networks, graph neural networks (GNNs), and Transformers, offer greater flexibility and accuracy but frequently fail to explicitly capture the interplay between temporal dependencies and contextual interactions, which are critical in chaotic sports environments. In this paper, we evaluate these models and assess their strengths and weaknesses. Experimental results reveal key performance trade-offs across input history length, generalizability, and the ability to incorporate contextual information. ML-based methods demonstrated substantial improvements over linear models across forecast horizons of up to 2s. Among the tested architectures, our hybrid LSTM augmented with contextual information achieved the lowest final displacement error (FDE) of 1.51m, outperforming temporal convolutional neural network (TCNN), graph attention network (GAT), and Transformers, while also requiring less data and training time compared to GAT and Transformers. Our findings indicate that no single architecture excels across all metrics, emphasizing the need for task-specific considerations in trajectory prediction for fast-paced, dynamic environments such as NBA gameplay.

URL PDF HTML ☆

赞 0 踩 0

2605.14847 2026-05-15 cs.CV

SR-Prominence: A Crowdsourced Protocol and Dataset Suite for Perceptually-Weighted Super-Resolution Artifact Evaluation

Ivan Molodetskikh, Kirill Malyshev, Mark Mirgaleev, Nikita Zagainov, Evgeney Bogatyrev, Dmitriy Vatolin

AI总结现代图像超分辨率方法虽然能生成细节丰富、视觉吸引的结果，但常常引入影响感知质量的视觉伪影。本文提出“伪影显著性”作为评估指标，定义为多数观者认为某区域存在明显伪影的比例，并构建了SR-Prominence数据集，包含3,935个标注显著性的伪影掩码，涵盖多个真实场景。研究发现传统全参考质量评估指标如SSIM在局部显著性预测上表现突出，而无参考方法和专用伪影检测器泛化能力较差，该数据集为超分辨率伪影评估提供了感知导向的新基准。

2605.14845 2026-05-15 cs.CV

Exploring Vision-Language Models for Online Signature Verification: A Zero-Shot Capability Study

Marta Robledo-Moreno, Ruben Vera-Rodriguez, Ruben Tolosana, Javier Ortega-Garcia

AI总结本文研究了视觉-语言模型（VLM）在在线签名验证任务中的零样本能力，评估了GPT-5.2和Gemini 2.5 Pro等先进模型在签名验证挑战（SVC）基准上的表现。通过将原始运动时间序列转化为静态图像，并利用模型的隐含token概率计算生物特征分数，实验发现模型在随机伪造场景下表现出色，GPT-5.2在移动任务中的等错误率低至0.32%，但在高难度的熟练伪造场景中性能显著下降，并暴露出模型在链式推理过程中产生运动幻觉的问题。

Comments Accepted at the 14th International Workshop on Biometrics and Forensics

2605.14844 2026-05-15 cs.LG cs.AI

XFP: Quality-Targeted Adaptive Codebook Quantization with Sparse Outlier Separation for LLM Inference

Thomas Witt

AI总结本文提出了一种名为XFP的动态权重量化方法，用于大语言模型的高效推理。该方法通过设定每通道的余弦相似度质量下限，自动确定每层的码本大小、异常值预算和打包方式，无需手动选择位宽或校准数据。XFP将权重矩阵分解为稀疏的fp16异常值残差和密集的子字节索引张量，并通过两种存储模式实现高效解码。实验表明，XFP在多个大模型上实现了比现有方法更高的推理速度和准确率，同时有效解决了模型超出内存限制的问题。

Comments 17 pages, 3 figures, 17 tables, 1 algorithm. Code: https://github.com/flash7777/vllm/tree/multiquant

2605.14843 2026-05-15 cs.CV

MechVerse: Evaluating Physical Motion Consistency in Video Generation Models

Rahul Jain, Mayank Patel, Asim Unmesh, Karthik Ramani

AI总结本文提出 MechVerse，一个用于评估视频生成模型中物理运动一致性的新基准。研究关注当前模型在生成具有机械结构的视频时，常无法满足运动学和几何约束的问题，例如部件变形、运动传递不一致等。MechVerse 包含大量合成视频片段及结构化提示，用于评估模型在机械约束下的生成能力，实验表明现有模型在外观和流畅性上表现良好，但在生成符合物理机制的运动方面仍存在明显不足。

Comments Under Review

详情

英文摘要

Text- and image-conditioned video generation models have achieved strong visual fidelity and temporal coherence, but they often fail to generate motion governed by kinematic and geometric constraints. In these settings, object parts must remain rigid, maintain contact or coupling with neighboring components, and transfer motion consistently across connected parts. These requirements are especially explicit in articulated mechanical assemblies, where motion is constrained by rigid-link geometry, contact/coupling relations, and transmission through kinematic chains. A generated video may therefore appear plausible while violating the intended mechanism, such as rotating a part that should translate, deforming a rigid component, breaking coupling between parts, or failing to move downstream components. To evaluate this gap, We introduce MechVerse, a benchmark for mechanically consistent image-to-video generation. MechVerse contains 21,156 synthetic clips from 1,357 mechanical assemblies across 141 categories, organized into three tiers of increasing kinematic complexity: independent articulation, pairwise coupling, and densely coupled multi-part mechanisms. Each clip is paired with a structured prompt describing part identities, stationary supports, moving components, motion primitives, direction, speed/extent, and inter-part dependencies. We evaluate proprietary, open-source, and fine-tuned image-to-video models using standard video metrics, instruction-following scores, and human judgments of motion correctness and kinematic coupling. Results show that current models can preserve appearance and smoothness while failing to generate mechanically admissible motion, with errors increasing as coupling complexity grows. MechVerse provides a benchmark for measuring and improving mechanism-aware video generation from image and language inputs.

URL PDF HTML ☆

赞 0 踩 0

2605.14842 2026-05-15 cs.CV

Editor's Choice: Evaluating Abstract Intent in Image Editing through Atomic Entity Analysis

Mor Ventura, Roy Hirsch, Yonatan Bitton, Regev Cohen, Roi Reichart

AI总结本文研究了图像编辑中抽象意图的理解与评估问题，提出了一个基于原子实体分析的评估框架Entity-Rubrics，并构建了首个专注于抽象图像编辑的基准数据集AbstractEdit。该工作首次对抽象图像编辑进行了形式化定义与分类，通过分解编辑任务为实体级别的评估指标，实现了与人类判断的高相关性。实验表明，现有模型在抽象指令理解上存在显著挑战，而结合先进语言模型编码器和迭代推理机制可有效提升性能，为多模态交互的自然化提供了新方向。