arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1981
2605.14917 2026-05-15 cs.LG cs.CE cs.IT math.IT stat.ML

A Mutual Information Lower Bound for Multimodal Regression Active Learning

Leonardo Ferreira Guilhoto, Akshat Kaushal, Paris Perdikaris

AI总结 该论文针对多模态回归中的主动学习问题,提出了一种新的获取函数MI-LB,用于更准确地捕捉模型的不确定性。研究引入了双索引框架,区分认识论不确定性和偶然性不确定性,并基于信息论推导出一个互信息下界作为获取目标。实验表明,该方法在多模态系统基准上表现优异,优于现有各类基线方法。

详情
英文摘要

Active learning for continuous regression has lacked an acquisition function that targets epistemic uncertainty when the predictive distribution is multimodal: variance misses modal disagreement, and information-theoretic targets like BALD are designed for discrete outputs. We introduce a Two-Index framework that makes this separation explicit: one stochastic index selects among competing model hypotheses (epistemic source), while a second governs within-hypothesis randomness (aleatoric source). An entropy decomposition within the framework identifies the mutual information between the output and the epistemic index as a principled acquisition objective, and we prove this quantity vanishes as the model is trained on growing datasets, confirming that it captures exactly the uncertainty data can resolve. Because this mutual information is intractable for continuous outputs, we derive the Mutual Information Lower Bound (MI-LB) acquisition function, a closed-form approximation for Mixture Density Network ensembles. On benchmarks featuring multimodal systems, MI-LB matches or beats every baseline evaluated and is the only method to do so consistently -- geometric and Fisher-based baselines compete only when the input space already encodes the multimodality, and collapse otherwise.

2605.14915 2026-05-15 cs.LG

TILBench: A Systematic Benchmark for Tabular Imbalanced Learning Across Data Regimes

Ruizhe Liu, Jiaqi Luo

AI总结 本文提出TILBench,一个用于评估表格数据不平衡学习的系统性基准平台。该基准测试了40多种代表性算法在57个多样化表格数据集上的表现,覆盖了超过20万个受控实验,揭示了不同方法在预测性能、鲁棒性和计算可扩展性方面的差异。研究发现,没有一种方法在所有场景下都表现最佳,方法的有效性高度依赖于数据特性和计算约束,基于此研究提供了实际应用中的方法选择建议。

详情
英文摘要

Imbalanced learning remains a fundamental challenge in tabular data applications. Despite decades of research and numerous proposed algorithms, a systematic empirical understanding of how different imbalanced learning methods behave across diverse data characteristics is still lacking. In particular, it remains unclear how different method families compare in predictive performance, robustness under varying data characteristics, and computational scalability. In this work, we present Tabular Imbalanced Learning Benchmark (TILBench), a large-scale empirical benchmark for tabular imbalanced learning. TILBench evaluates more than 40 representative algorithms across 57 diverse tabular datasets, resulting in over 200000 controlled experiments across a wide range of data characteristics. Our findings show that no single method consistently dominates across all settings; instead, the effectiveness of imbalanced learning methods depends strongly on dataset characteristics and computational constraints. Based on these findings, we provide practical recommendations for selecting appropriate methods in real-world applications.

2605.14913 2026-05-15 cs.CV

Representative Attention For Vision Transformers

Yuntong Li, Hainuo Wang, Hengxing Liu, Mingjia Li, Xiaojie Guo

AI总结 该论文提出了一种名为Representative Attention(RPAttention)的线性全局注意力机制,旨在解决视觉Transformer中传统自注意力计算复杂度高、依赖图像坐标的问题。其核心方法通过在表示空间中动态生成语义相关的代表性token,替代固定空间划分的中间token,从而实现跨空间区域的语义通信。该方法在保持全局感受野的同时,将token交互复杂度从二次降至线性,实验表明其在图像分类、目标检测和语义分割任务中均表现出优越的性能。

详情
英文摘要

Linear attention has emerged as a promising direction for scaling Vision Transformers beyond the quadratic cost of dense self-attention. A prevalent strategy is to compress spatial tokens into a compact set of intermediate proxies that mediate global information exchange. However, existing methods typically derive these proxy tokens from predefined spatial layouts, causing token compression to remain anchored to image coordinates rather than the semantic organization of visual content. To overcome this limitation, we propose Representative Attention (RPAttention), a linear global attention mechanism that performs token compression directly in representation space. Instead of constructing intermediate tokens from fixed spatial partitions, it dynamically forms a compact set of learned representative tokens to enable semantically related regions to communicate regardless of their spatial distance, by following a lightweight Gather-Interact-Distribute paradigm. Spatial tokens are first softly gathered into representative tokens through competitive similarity-based routing. The representatives then perform global interaction within a compact latent space, before broadcasting the refined information back to all spatial tokens via query-driven cross-attention. Via replacing coordinate-driven aggregation with representation-driven compression, RPAttention preserves global receptive fields while adaptively aligning token communication with the content structure of each input.RPAttention reduces the dominant token interaction complexity from quadratic to linear scaling with respect to the number of spatial tokens, while maintaining expressive global context modeling. Extensive experiments across diverse vision transformer backbones on image classification, object detection, and semantic segmentation demonstrate the effectiveness of our design.

2605.14912 2026-05-15 cs.AI cs.CY cs.HC cs.LG

From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement

Varad Vishwarupe, Nigel Shadbolt, Marina Jirotka

AI总结 本文探讨了人工智能对齐中的“多元主义对齐”问题,指出当前基于强化学习的AI系统在面对不同价值观时倾向于迎合用户意见,导致缺乏真实的价值冲突与分歧。为此,作者提出以格赖斯语用原则为基础的三种对话机制——界定、信号和修正,强调AI应能承认自身视角限制、揭示价值冲突并基于原则进行修正,而非简单迎合。研究引入“多元修正得分”(PRS)作为衡量指标,并在实验中验证了现有模型在面对争议性问题时虽能遵循用户意见,但修正能力较弱,突显了部署阶段治理机制对实现多元主义的重要性。

详情
英文摘要

Pluralistic alignment is typically operationalised as preference aggregation: producing responses that span (Overton), steer toward (Steerable), or proportionally represent (Distributional) diverse human values. We argue that aggregation alone is an incomplete primitive for deployed pluralistic alignment. Under genuine value pluralism, the failure mode of contemporary RLHF-trained assistants is not insufficient coverage but sycophantic consensus: a learned tendency to agree with, validate, and minimise friction with the immediate interlocutor. Because deployed AI systems now mediate consequential deliberation across health, civic life, labour, and governance, the collapse of disagreement at the interaction layer is not a narrow technical concern but a structural failure with distributive consequences. We reframe pluralistic alignment around three conversational mechanisms drawn from Grice's maxims: scoping (acknowledging the limits of one's perspective), signalling (surfacing value-conflict rather than smoothing it over), and repair (revising one's position on principled grounds, not on user pressure). We formalise a metric, the Pluralistic Repair Score (PRS), distinguishing principled revision from capitulation, and present a small-scale empirical illustration on two frontier RLHF-trained models (Claude Sonnet 4.5, N=198; GPT-4o, N=100) showing that, for both, agreement-following coexists with low repair-quality on contested-value prompts. PRS measures an interactional precondition for pluralism (visible disagreement; principled revision) rather than pluralism in full; we discuss the difference, take seriously the reflexive question of whose "principled" counts, and argue that pluralism is most decisively made or unmade at the deployment-governance layer: interfaces, preference-data pipelines, and audit infrastructure.

2605.14911 2026-05-15 cs.RO

Chrono-Gymnasium: An Open-Source, Gymnasium-Compatible Distributed Simulation Framework

Bocheng Zou, Harry Zhang, Khailanii Slaton, Jingquan Wang, Derrick Ruan, Huzaifa Mustafa Unjhawala, Radu Serban, Dan Negrut

AI总结 本文提出了一种名为 Chrono-Gymnasium 的开源分布式仿真框架,旨在解决高精度物理仿真在机器人和复杂机械系统中计算开销大、难以应用于数据密集型任务的问题。该框架基于 Ray 构建,兼容 Gymnasium 接口,支持与现代机器学习库的无缝集成,并提供了分布式执行所需的同步与通信机制。通过两个案例研究,展示了其在强化学习和贝叶斯优化中的应用效果,证明了其在保证物理精度的同时显著提升了仿真效率。

详情
英文摘要

High-fidelity physics simulation is essential for closing the sim-to-real gap in robotics and complex mechanical systems. However, the computational overhead of high-fidelity engines often limits their use in data-intensive tasks like Reinforcement Learning (RL) and global optimization. We introduce Chrono-Gymnasium, a distributed computing framework that scales the high-fidelity multi-body dynamics of Project Chrono across large-scale computing clusters. Built upon the Ray framework, Chrono-Gymnasium provides a standardized Gymnasium interface, enabling seamless integration with modern machine learning libraries while providing built-in synchronization and messaging primitives for distributed execution. We demonstrate the framework's capabilities through two distinct case studies: (1) the training of an RL agent for autonomous robotic navigation in complex terrains, and (2) the Bayesian Optimization of a planetary lander's design parameters to ensure landing stability. Our results show that Chrono-Gymnasium reduces wall-clock time for high-fidelity simulations without sacrificing physical accuracy, offering a scalable path for the design and control of complex robotic systems.

2605.14908 2026-05-15 cs.CV

SteerSeg: Attention Steering for Reasoning Video Segmentation

Ali Cheraghian, Hamidreza Dastmalchi, Abdelwahed Khamis, Morteza Saberi, Aijun An, Lars Petersson

AI总结 视频推理分割任务需要根据自然语言描述在视频帧中定位目标对象,通常涉及空间推理和隐含引用。现有方法通过提取冻结的大视觉语言模型(LVLM)的注意力图作为分割的先验信息,实现无需训练的定位,但这些注意力图主要用于文本生成,导致定位信号模糊。本文提出SteerSeg,一种轻量框架,通过识别注意力偏差并引入输入级条件引导来优化注意力分布,结合可学习的软提示和推理引导的思维链(CoT)提示,显著提升了LVLM的空间定位能力,并在多个基准测试中表现出良好的泛化性能。

Comments Project page: https://steerseg.github.io

详情
英文摘要

Video reasoning segmentation requires localizing objects across video frames from natural language expressions, often involving spatial reasoning and implicit references. Recent approaches leverage frozen large vision-language models (LVLMs) by extracting attention maps and using them as spatial priors for segmentation, enabling training-free grounding. However, these attention maps are optimized for text generation rather than spatial localization, often resulting in diffuse and ambiguous grounding signals. In this work, we introduce SteerSeg, a lightweight framework that identifies attention misalignment as the key bottleneck in attention-based grounding and proposes to steer attention at its source through input-level conditioning. SteerSeg combines learnable soft prompts with reasoning-guided Chain-of-Thought (CoT) prompting. The soft prompts reshape the attention distribution to produce more spatially concentrated maps, while CoT-derived attributes resolve ambiguity among similar objects by guiding attention toward the correct instance. The resulting attention maps are converted into point prompts across keyframes to guide a segmentation model, while candidate tracklets are ranked and selected using correlation-based scoring. Our approach freezes the LVLM and segmentation model parameters and learns only a small set of soft prompts, preserving the model's pretrained reasoning capabilities while significantly improving grounding. Despite being trained only on Ref-YouTube-VOS, SteerSeg generalizes well across diverse benchmarks, significantly improving the spatial grounding capability of LVLMs. Project page: https://steerseg.github.io

2605.14907 2026-05-15 cs.AI

KGPFN: Unlocking the Potential of Knowledge Graph Foundation Model via In-Context Learning

Yisen Gao, Jiaxin Bai, Haoyu Huang, Zhongwei Xie, Yufei Li, Hong Ting Tsang, Sirui Han, Yangqiu Song

AI总结 知识图谱基础模型旨在通过学习可迁移的关系结构,实现对包含新实体和关系的图的泛化。然而,现有方法大多关注关系层面的通用性,而对上下文学习这一基础模型的重要支柱在知识图谱推理中的应用研究较少。本文提出KGPFN,一种结合先验数据适配网络的知识图谱基础模型,通过结构化上下文中的局部和全局信息进行推理,实现了跨图的强适应能力,并在多个基准测试中表现出色。

详情
英文摘要

Knowledge graph (KG) foundation models aim to generalize across graphs with unseen entities and relations by learning transferable relational structure. However, most existing methods primarily emphasize relation-level universality, while in-context learning, the other pillar of foundation models remains under-explored for KG reasoning. In KGs, context is inherently structured and heterogeneous: effective prediction requires conditioning on the local context around the query entities as well as the global context that summarizes how a relation behaves across many instances. We propose KGPFN, a KG foundation model using Prior-data Fitted Network that unifies transferable relational regularities with inference-time in-context learning from structured context. KGPFN first learns relation representations via message passing on relation graphs to capture cross-graph relational invariances. For query-specific reasoning, it encodes local neighborhoods using a multi-layer NBFNet as local context. To enable ICL at global scale, it constructs relation-specific global context by retrieving a large set of instances of the query relation together with their local neighborhoods, and aggregates them within a Prior-Data Fitted Network framework that combines feature-level and sample-level attention. Through multi-graph pretraining on diverse KGs, KGPFN learns when to instantiate reusable patterns and when to override them using contextual evidence. Experiments on 57 KG benchmarks demonstrate that KGPFN achieves strong adaptation to previously unseen graphs through in-context learning alone, consistently outperforming competitive fine-tuned KG foundation models. Our code is available at https://github.com/HKUST-KnowComp/KGPFN.

2605.14906 2026-05-15 cs.CV

MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

Xiyu Ren, Zhaowei Wang, Yiming Du, Zhongwei Xie, Chi Liu, Xinlin Yang, Haoyue Feng, Wenjun Pan, Tianshi Zheng, Baixuan Xu, Zhengnan Li, Yangqiu Song, Ginny Wong, Simon See

AI总结 MemLens 是一个用于评估大型视觉语言模型(LVLMs)多模态长期记忆能力的综合性基准,涵盖了信息抽取、多轮推理、时序推理等五个方面,测试了不同上下文长度下的模型表现。研究发现,长上下文模型在短对话中表现良好,但随着对话增长性能下降,而记忆增强代理虽在长度上更稳定,却在存储时间压缩下丢失了视觉细节。实验表明,单一方法难以胜任多轮多模态任务,因此提出了结合长上下文注意力与结构化多模态检索的混合架构方向。

Comments Work in progress

详情
英文摘要

Memory is essential for large vision-language models (LVLMs) to handle long, multimodal interactions, with two method directions providing this capability: long-context LVLMs and memory-augmented agents. However, no existing benchmark conducts a systematic comparison of the two on questions that genuinely require multimodal evidence. To close this gap, we introduce MEMLENS, a comprehensive benchmark for memory in multimodal multi-session conversations, comprising 789 questions across five memory abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge update, and answer refusal) at four standard context lengths (32K-256K tokens) under a cross-modal token-counting scheme. An image-ablation study confirms that solving MEMLENS requires visual evidence: removing evidence images drops two frontier LVLMs below 2% accuracy on the 80.4% of questions whose evidence includes images. Evaluating 27 LVLMs and 7 memory-augmented agents, we find that long-context LVLMs achieve high short-context accuracy through direct visual grounding but degrade as conversations grow, whereas memory agents are length-stable but lose visual fidelity under storage-time compression. Multi-session reasoning caps most systems below 30%, and neither approach alone solves the task. These results motivate hybrid architectures that combine long-context attention with structured multimodal retrieval. Our code is available at https://github.com/xrenaf/MEMLENS.

2605.14900 2026-05-15 cs.AI

COREKG: Coreset-Guided Personalized Summarization of Knowledge Graphs

Sohel Aman Khan, Raghava Mutharaju, Supratim Shit

AI总结 本文提出了一种基于核心集理论的个性化知识图谱摘要方法 COREKG,旨在解决大规模知识图谱在问答和可视化等任务中应用不便的问题。该方法通过基于用户查询模式的敏感度评分,从知识图谱中采样出一个具有代表性的三元组子集,以保证摘要在结构和语义上的准确性。实验表明,COREKG 在多个真实数据集上相比现有方法在查询准确率和结构覆盖率方面表现更优,同时显著减少了存储和查询开销。

Comments Accepted at IJCAI 2026

详情
英文摘要

Knowledge Graphs (KGs) are extensively used across different domains and in several applications. Often, these KGs are very large in size. Such KGs become unwieldy for tasks such as question answering and visualization. Summarization of KGs offers a viable alternative in such cases. Furthermore, personalized KG summarization is crucial in the current data-driven world as it captures the specific requirements of users based on their query patterns. Since it only maintains relevant information, the personalized summaries of KG are small, resulting in significantly smaller storage requirements and query runtime. In this work, we adapt the coreset theory to create personalized KG summaries. For a given dataset and a user-specific query workload, we present an approach that samples a relevant subset of triples using sensitivity-based importance sampling. We ensure that the subset approximates the characteristics of the full dataset with bounded approximation error. We define sensitivity scores that measure the importance of a triple with respect to a user's query workload, which are then used by our coreset construction algorithm. We explicitly focus on personalized knowledge graph summarization by constructing summaries independently for each user based on their query behaviour. Our evaluation on Freebase, WikiData, and DBpedia shows that COREKG delivers higher query-answering accuracy and structural coverage than the state-of-the-art methods, such as GLIMPSE, PPR, iSummary, PEGASUS and APEX$^2$ while requiring only a tiny fraction of the original graph.

2605.14897 2026-05-15 cs.LG cs.AI

Critic-Driven Voronoi-Quantization for Distilling Deep RL Policies to Explainable Models

Senne Deproost, Denis Steckelmacher, Ann Nowé

AI总结 本文研究如何将深度强化学习策略蒸馏到可解释模型中,以平衡性能与可解释性之间的矛盾。提出了一种基于评论家网络的Voronoi量化方法,通过划分状态空间并为每个区域拟合线性函数,实现对复杂策略的简化表示。该方法利用原策略的评论家网络迭代优化子策略,有效提升了蒸馏模型的性能与可解释性。

Comments Accepted for presentation at EXTRAAMAS 2026

详情
英文摘要

Despite many successful attempts at explaining Deep Reinforcement Learning policies using distillation, it remains difficult to balance the performance-interpretability trade-off and select a fitting surrogate model. In addition to this, traditional distillation only minimizes the distance between the behavior of the original and the surrogate policy while other RL-specific components such as action value are disregarded. To solve this, we introduce a new model-agnostic method called Critic-Driven Voronoi State Partitioning, which partitions a black box control policy into regions where a simple class of model can be optimized using gradient descent. By exploiting the critic value network of the original policy, we iteratively introduce new subpolicies in regions with insufficient value, standing in for a measure of policy complexity. The partitioning, a Voronoi quantizer, uses nearest neighbor lookups to assign a linear function to each point in the state space resulting in a cell-like diagram. We validate our approach on several well known benchmarks and proof that this distillation approaches the original policy using a reasonable sized set of linear functions.

2605.14896 2026-05-15 cs.SD cs.LG

Text-Dependent Speaker Verification (TdSV) Challenge 2024: Team Naive System Report

Amir Mohammad Rostami, Pourya Jafarzadeh

AI总结 本文介绍了2024年文本依赖说话人验证(TdSV)挑战赛中“Naive”团队的系统方案。该系统基于现有的先进神经网络ResNet-TDNN和NeXt-TDNN进行适配,并设计了轻量高效的EfficientNet-A0模型,结合数据增强和优化的超参数,实现了优异的验证性能,取得了0.0461的最小检测代价函数(MinDCF)和1.3%的等错误率(EER)。研究展示了多模型集成学习在说话人和短语验证中的有效性。

详情
英文摘要

This paper presents a system for the 2024 Text-Dependent Speaker Verification (TdSV) Challenge. The system achieved a Minimum Detection Cost Function (MinDCF) of 0.0461 and an Equal Error Rate (EER) of 1.3\%. Our approach focused on adapting existing state-of-the-art neural networks, ResNet-TDNN and NeXt-TDNN, originally trained on the VoxCeleb dataset. This strategy was chosen because of the limited challenge duration and the available resources at the time. In addition, we designed a lightweight and resource-efficient model, EfficientNet-A0, trained specifically on the challenge dataset to improve adaptation and strengthen the ensemble approach. Our system combines advanced neural architectures, extensive data augmentation, and optimised hyperparameters. These components helped achieve strong performance in text-dependent speaker verification. The results also demonstrate the effectiveness of multi-model ensemble learning for both speaker and phrase verification.

2605.14894 2026-05-15 cs.CV

SEDiT: Mask-Free Video Subtitle Erasure via One-step Diffusion Transformer

Zheng Hui, Yunlong Bai

AI总结 本文提出了一种名为 SEDiT 的新型视频字幕擦除方法,无需预先生成掩码即可直接完成字幕移除任务。该方法基于一步式扩散变换器,通过引入单阶段框架避免了传统两阶段处理中的次优问题,并在理论上证明了一步去噪的可行性。为保证时间一致性,文中采用混合训练策略并支持原生高清视频的高效处理。

Comments Project page:http://zheng222.github.io/SEDiT_project

详情
英文摘要

Recent breakthroughs in video diffusion models have significantly accelerated the development of video editing techniques. However, existing methods often rely on inpainting video frames based on masked input, which requires extracting the target video mask in advance, and the precision of the segmentation directly affects the quality of the completion. In this paper, we present SEDiT, a novel one-stage video Subtitle Erasure approach via One-step Diffusion Transformer. We introduce a mask-free inference approach that enables direct erasure of the targeted subtitle. The proposed one-stage framework mitigates the sub-optimality inherent in the two-stage processing of prior models. Since subtitle removal is a localized editing task in which most pixels remain unchanged, the underlying distribution shift is minimal, making it well-suited to one-step generation under rectified flow. We empirically validate the reliability of one-step denoising and further provide a formal theoretical justification. Under the localized-editing structure of subtitle removal, the conditional optimal transport (OT) map and its induced rectified flow velocity field are Lipschitz continuous with respect to the latent variable, which underpins the theoretical feasibility of one-step sampling. To address the challenge of long-term temporal consistency, we adopt a hybrid training strategy by occasionally conditioning the model with a clean first-frame latent. This facilitates temporal continuity, allowing each segment during inference to leverage the output of its predecessor. To avoid visible seams caused by cropping and reinserting processed targets, particularly in scenarios involving substantial motion, we feed the original video directly into SEDiT. Thanks to one-step and chunk-wise streaming inference, our method can efficiently handle native 1440p video with infinite length.

2605.14893 2026-05-15 cs.CV cs.AI cs.LG

Your CLIP has 164 dimensions of noise: Exploring the embeddings covariance eigenspectrum of contrastively pretrained vision-language transformers

Jakub Grzywaczewski, Dawid Płudowski, Przemysław Biecek

AI总结 本文研究了对比预训练视觉-语言模型(VLMs)中潜在空间的结构问题,发现其共享的潜在空间中存在大量非语义的多模态噪声。作者通过协方差矩阵的谱分解方法,将潜在空间分解为语义信号和共享噪声子空间,并观察到噪声结构在不同数据子集上具有强子群不变性。实验表明,去除这些噪声维度对下游任务性能影响较小,甚至有助于提升性能,揭示了现代VLMs潜在空间中存在大量由模型架构引起的噪声,而非仅由任务相关语义主导。

详情
英文摘要

Contrastively pre-trained Vision-Language Models (VLMs) serve as powerful feature extractors. Yet, their shared latent spaces are prone to structural anomalies and act as repositories for non-semantic, multi-modal noise. To address this phenomenon, we employ spectral decomposition of covariance matrices to decompose the VLM latent space into a multi-modal semantic signal component and a shared noise subspace. We observe that this noise geometry exhibits strong subgroup invariance across distinct data subsets. Crucially, pruning these shared noise dimensions is mainly harmless, preserving or actively improving downstream task performance. By isolating true semantic signals from artifactual noise, this work provides new mechanistic insights into the representational structure of modern VLMs, suggesting that a substantial fraction of their latent geometry is governed by shared, architecture-level noise rather than task-relevant semantics alone.

2605.14891 2026-05-15 cs.CV

Hierarchical Image Tokenization for Multi-Scale Image Super Resolution

Isma Hadji, Enrique Sanchez, Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos

AI总结 本文提出了一种基于视觉自回归(VAR)模型的多尺度图像超分辨率方法,通过引入层次化图像分块(HIT)和直接偏好优化(DPO)正则化项,解决了现有方法在尺度映射和模型复杂度方面的不足。HIT 通过逐级表示不同尺度的图像并强制跨尺度的分块重叠,提升了模型的灵活性,而 DPO 则仅依赖低分辨率与高分辨率图像对,引导模型生成更高质量的输出。该方法在无需外部训练数据的情况下,使用更小的模型实现了领先的多尺度超分辨率效果。

Comments Accepted for publication at ICML 2026. *Joint first authorship (alphabetical order). arXiv admin note: substantial text overlap with arXiv:2506.04990

详情
英文摘要

We introduce a multi-scale Image Super Resolution (ISR) method building on recent advances in Visual Auto-Regressive (VAR) modeling. VAR models break image tokenization into additive, gradually increasing scales, using Residual Quantization (RQ), an approach that aligns perfectly with our target ISR task. Previous works taking advantage of this synergy suffer from two main shortcomings. First, due to the limitations in RQ, they only generate images at a predefined fixed scale, failing to map intermediate outputs to the corresponding image scales. They also rely on large backbones or a large corpus of annotated data to achieve better performance. To address both shortcomings, we introduce two novel components to the VAR training for ISR, aiming at increasing its flexibility and reducing its complexity. In particular, we introduce a) a \textbf{Hierarchical Image Tokenization (HIT)} approach that progressively represents images at different scales while enforcing token overlap across scales, and b) a \textbf{Direct Preference Optimization (DPO) regularization term} that, relying solely on the (LR,HR) pair, encourages the transformer to produce the latter over the former. Our proposed HIT acts as a strong inductive bias for the VAR training, resulting in a small model (300M params vs 1B params of VARSR), that achieves state-of-the-art results without external training data, and that delivers multi-scale outputs with a single forward pass.

2605.14888 2026-05-15 cs.SD cs.LG

PROCESS-2: A Benchmark Speech Corpus for Early Cognitive Impairment Detection

Madhurananda Pahar, Caitlin H. Illingworth, Bahman Mirheidari, Hend Elghazaly, Fritz Peters, Sophie Young, Wing-Zin Leung, Labhpreet Kaur, Daniel Blackburn, Heidi Christensen

AI总结 PROCESS-2 是一个用于早期认知障碍检测的大型语音数据集,旨在支持基于自发和任务导向语音的自动认知评估研究。该数据集包含200名健康受试者、150名轻度认知障碍患者和50名痴呆患者的语音记录,共计约21小时,涵盖图片描述和语言流畅性任务,并附有手动验证的文本和元数据。PROCESS-2 通过严格的临床验证和分区设计,确保了数据的可靠性与实用性,为相关研究提供了可复现的基准资源。

详情
英文摘要

Speech-based analysis offers a scalable and non-invasive approach for detecting cognitive decline, yet progress has been constrained by the limited availability of clinically validated datasets collected under realistic conditions. We introduce PROCESS-2, a large-scale speech dataset designed to support research on automatic assessment of cognitive impairment from spontaneous and task-oriented speech. The dataset comprises recordings from 200 healthy controls, 150 mild cognitive impairment, and 50 dementia diagnoses collected using the CognoMemory digital assessment platform. Each participant completed a single assessment session, including picture description and verbal fluency tasks, accompanied by manually verified transcripts and participant-level metadata. PROCESS-2 contains approximately 21 hours of speech audio with predefined train/test partitions. Comprehensive technical validation evaluated demographic balance, clinical consistency, recording stability, embedding-space structure, and reproducible baseline modelling performance, demonstrating clinically meaningful group separation and stable performance across modelling approaches while preserving real-world conversational variability. PROCESS-2 is released under controlled access via Hugging Face to enable responsible reuse while protecting participant privacy, providing a reproducible benchmark resource for speech-based cognitive assessment research.

2605.14886 2026-05-15 cs.AI

BiFedKD: Bidirectional Federated Knowledge Distillation Framework for Non-IID and Long-Tailed ECG Monitoring

Zixuan Shu, Tiancheng Cao, Hen-Wei Huang

AI总结 在物联网医疗(IoMT)网络中,心电图(ECG)监测受到数据共享法规和隐私保护的限制。为解决联邦学习中模型更新通信开销大、在非独立同分布和长尾标签场景下性能下降的问题,本文提出了一种双向联邦知识蒸馏框架BiFedKD,通过温度缩放和聚合蒸馏机制提升模型对齐效果。实验表明,BiFedKD在MIT-BIH心律失常数据集上显著提升了准确率和Macro-F1指标,同时大幅降低了通信和计算开销。

详情
英文摘要

Electrocardiogram (ECG) monitoring in Internet of Medical Things (IoMT) networks is constrained by strict data-sharing regulations and privacy concerns. Federated learning (FL) enables collaborative learning by keeping raw ECG data on devices, but frequent transmissions of high-dimensional model updates incur heavy per-round traffic over bandwidth-limited links. To alleviate this bottleneck, federated distillation (FD) replaces parameter exchange with logit-based knowledge transfer. However, the performance of FD often degrades under the non-independent and identically distributed (non-IID) and long-tailed label distributions in ECG deployments. To address these challenges, we propose a bidirectional federated knowledge distillation (BiFedKD) framework that employs an aggregation-by-distillation pipeline with temperature scaling to produce a stable global distillation signal for cross-client alignment. Experiments on the MIT-BIH Arrhythmia dataset show that BiFedKD improves accuracy and Macro-F1 over the baseline by $3.52\%$ and $9.93\%$, respectively. Moreover, to reach the same Macro-F1, BiFedKD reduces communication overhead by $40\%$ and computation cost by $71.7\%$ compared with the baseline.

2605.14885 2026-05-15 cs.CV

Masked Next-Scale Prediction for Self-supervised Scene Text Recognition

Zhuohao Chen, Zeng Li, Yifei Zhang, Chang Liu, Yu Zhou

AI总结 场景文本识别需要建模从粗粒度布局到细粒度字符笔画的视觉结构演变过程,但现有方法依赖大量标注数据。本文提出了一种统一的自监督框架——Masked Next-Scale Prediction(MNSP),通过跨尺度预测和掩码图像重建联合学习,显式建模场景文本的层次结构演化。该方法引入了Next-Scale Prediction(NSP)模块,从低分辨率上下文预测高分辨率特征,并结合多尺度语言对齐模块保持语义一致性,实验表明其在多个基准数据集上取得了先进性能。

Comments Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 Findings Track.10 pages, 4 figures

详情
英文摘要

Scene Text Recognition requires modeling visual structures that evolve from coarse layouts to fine-grained character strokes. Training such models relies on large amounts of annotated data. Recent self-supervised approaches, such as Masked Image Modeling (MIM), alleviate this dependency by leveraging large-scale unlabeled data. Yet most existing MIM methods operate at a single spatial scale and fail to capture the hierarchical nature of scene text. In this work, we introduce Masked Next-Scale Prediction (MNSP), a unified self-supervised framework designed to explicitly model cross-scale structural evolution. The framework incorporates Next-Scale Prediction (NSP), which learns hierarchical representations by predicting higher-resolution features from lower-resolution contexts. Naive scale prediction, however, tends to produce spatially diffuse attention, directing the model toward background regions rather than textual structures. MNSP resolves this limitation by jointly learning cross-scale prediction and masked image reconstruction. NSP captures global layout priors across resolutions, while masked reconstruction imposes strong local constraints that guide attention toward informative text regions. A Multi-scale Linguistic Alignment module further maintains semantic consistency across different resolutions. Extensive experiments demonstrate that MNSP achieves state-of-the-art performance, reaching 86.2\% average accuracy on the challenging Union14M benchmark and 96.7\% across six standard datasets. Additional analyses show that our method improves robustness under extreme scale and layout variations. Code is available at https://github.com/CzhczhcHczh/MNSP

2605.14880 2026-05-15 cs.CV cs.GR cs.LG

Denoising-GS: Gaussian Splatting with Spatial-aware Denoising

Qingyuan Zhou, Xinyi Liu, Weidong Yang, Ning Wang, Shuquan Ye, Ben Fei, Ying He, Wanli Ouyang

AI总结 本文提出了一种名为Denoising-GS的高保真新视角合成方法,针对3D高斯泼溅(3DGS)在优化过程中因初始点云稀疏不完整而引入噪声的问题,引入了一种基于空间感知的去噪框架。该方法通过同时考虑高斯原语的位置和空间结构,设计了保持空间优化流的优化器和基于空间梯度的去噪策略,有效提升了去噪的连贯性和一致性,并通过不确定性估计和空间一致性优化进一步提升了模型的表现。实验表明,Denoising-GS在多个基准数据集上均取得了最先进的效果。

详情
英文摘要

Recent advances in 3D Gaussian Splatting (3DGS) have achieved remarkable success in high-fidelity Novel View Synthesis (NVS), yet the optimization process inevitably introduces noisy Gaussian primitives due to the sparse and incomplete initialization from Structure-from-Motion (SfM) point clouds. Most existing methods focus solely on adjusting the positions of primitives during optimization, while neglecting the underlying spatial structure. To this end, we introduce a new perspective by formulating the optimization of 3DGS as a primitive denoising process and propose Denoising-GS, a spatial-aware denoising framework for Gaussian primitives by taking both the positions and spatial structure into consideration. Specifically, we design an optimizer that preserves the spatial optimization flow of primitives, facilitating coherent and directed denoising rather than random perturbations. Building upon this, the Spatial Gradient-based Denoising strategy jointly considers the spatial supports of primitives to ensure gradient-consistent updates. Furthermore, the Uncertainty-based Denoising module estimates primitive-wise uncertainty to prune redundant or noisy primitives, while the Spatial Coherence Refinement strategy selectively splits primitives in sparse regions to maintain structural completeness. Experiments conducted on three benchmark datasets demonstrate that Denoising-GS consistently enhances NVS fidelity while maintaining representation compactness, achieving state-of-the-art performance across all benchmarks. Source code and models will be made publicly available.

2605.14877 2026-05-15 cs.CV

HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling

Jonathan Cederlund, Axel Berg, Durmus Alp Emre Acar, Chuteng Zhou, Pontus Giselsson

AI总结 视觉自回归(VAR)模型在保持低延迟的同时展现了出色的图像生成质量,但其面临严重的KV缓存内存限制问题。本文提出了一种名为HeatKV的新压缩方法,通过根据每个注意力头对先前生成尺度的关注程度动态调整缓存分配,实现更高效的内存利用。该方法基于小规模离线校准集对注意力头进行排序,并据此构建静态剪枝计划,显著提升了KV缓存的压缩比,同时保持了图像保真度和生成质量,在VAR模型的KV缓存压缩任务中取得了新的最优性能。

Comments 18 pages total including appendix; 6 main-paper figures, 2 appendix figures; 4 tables

详情
英文摘要

Visual Autoregressive (VAR) models have recently demonstrated impressive image generation quality while maintaining low latency. However, they suffer from severe KV-cache memory constraints, often requiring gigabytes of memory per generated image. We introduce HeatKV, a novel compression method that adapts cache allocation in each head based on its attention to previously generated scales. Using a small offline calibration set, the attention heads are ranked according to their attention scores over prior scales. Based on this ranking, we construct a static pruning schedule tailored to a given memory budget. Applied to the Infinity-2B model, HeatKV achieves $2 \times$ higher compression ratio in memory allocation for KV cache compared to existing methods, while maintaining similar or better image fidelity, prompt alignment and human perception score. Our method achieves a new state-of-the-art (SOTA) for VAR model KV-cache compression, showcasing the effectiveness of fine-grained, head-specific cache allocation.

2605.14874 2026-05-15 cs.CV

LPH-VTON: Resolving the Structure-Texture Dilemma of Virtual Try-On via Latent Process Handover

Yixin Liu, Baihong Qian, Jinglin Jiang, Jeffery Wu, Yan Chen, Wei Wang, Yida Wang, Lanqing Yang, Guangtao Xue

AI总结 虚拟试穿(VTON)旨在生成与人体姿态和结构精确对齐的逼真服装图像。当前基于扩散模型的方法在结构完整性和纹理保真度之间面临根本性的权衡问题。本文提出LPH-VTON框架,通过在单一连续去噪过程中解耦结构与纹理生成,实现两者的协同优化,有效解决了这一矛盾,并在标准数据集VITON-HD上取得了结构对齐与感知真实感的优越平衡。

详情
英文摘要

Virtual Try-On (VTON) aims to synthesize photorealistic images of garments precisely aligned with a person's body and pose. Current diffusion-based methods, however, face a fundamental trade-off between structural integrity and textural fidelity. In this paper, we formalize this challenge as a consequence of complementary inductive biases inherent in prevailing architectures: models heavily reliant on spatial constraints naturally favor geometric alignment but often suppress textures, whereas models dominated by unconstrained generative priors excel at vibrant detail rendering but are prone to structural drift. Based on this diagnosis, we propose LPH-VTON, a new synergistic framework that resolves this tension within a single, continuous denoising process. LPH-VTON strategically decomposes the generation, leveraging a structure-biased model to establish a geometrically consistent latent scaffold in the early stages, before handing over control to a texture-biased model for high-fidelity detail rendering. Extensive experiments validate our approach. Our model achieves a superior Pareto-optimal balance, establishing new benchmarks in perceptual faithfulness while maintaining highly competitive structural alignment across the standard dataset VITON-HD, proving the efficacy of temporal architectural decoupling.

2605.14868 2026-05-15 cs.LG

Fast Adversarial Attacks with Gradient Prediction

Kamil Ciosek, Aleksandr V. Petrov, Nicolò Felicioni, Konstantina Palla

AI总结 该论文提出了一种通过预测梯度来加速对抗样本生成的方法,避免了传统方法中耗时的反向传播过程。研究基于神经网络的核视角,利用前向传播中的隐藏状态通过轻量线性回归估计输入梯度,从而大幅提升了生成效率。实验表明,该方法在保持较高攻击效果的同时,显著提高了吞吐量,比FGSM方法快了超过5倍。

Comments 17 pages

详情
英文摘要

Generating adversarial examples at scale is a core primitive for robustness evaluation, adversarial training, and red-teaming, yet even "fast" attacks such as FGSM remain throughput-limited by the cost of a backward pass. We introduce a family of attacks that eliminates the backward pass by predicting the input gradient from forward-pass hidden states via a lightweight linear regression. The approach is motivated by a kernel view of neural networks and is exact in the Neural Tangent Kernel regime, while remaining effective for practical finite-width models. Empirically, our methods recover much of FGSM's attack performance while using only a small fraction of the time, corresponding to a $532\%$ increase in throughput. These results suggest gradient prediction as a simple and general route to significantly faster adversarial generation under realistic wall-clock constraints.

2605.14867 2026-05-15 cs.LG cs.AI q-bio.NC

REALM: Retrospective Encoder Alignment for LFP Modeling

Peicheng Wu, Zhenyu Bu, Runze Ma, Lin Du

AI总结 该研究提出了一种名为REALM的因果LFP解码框架,旨在解决基于局部场电位(LFP)的行为解码中精度低和非因果架构不适用于实时应用的问题。REALM通过从预训练的双向LFP模型中迁移表征知识到因果学生模型,实现了高效的实时解码。实验表明,REALM在保持高解码性能的同时,显著减少了模型参数和训练时间,展示了LFP-only模型在无线植入式脑机接口中的实用性和可扩展性。

详情
英文摘要

Spike activity has been the dominant neural signal for behavior decoding due to its high spatial and temporal resolution. However, as brain-computer interfaces (BCIs) move toward high channel counts and wireless operation, the high sampling frequency of spike signals becomes a bottleneck due to high power and bandwidth requirements. Local field potentials (LFPs) represent a different spatial-temporal scale of brain activity compared to spikes, offering key advantages including improved long-term stability, reduced energy consumption, and lower bandwidth requirement. Despite these benefits, LFP-based decoding models typically show reduced accuracy and often rely on non-causal architectures that are unsuitable for real-time deployment. To address these challenges, we propose REALM: a retrospective distillation framework that enables causal LFP decoding. Inspired by offline-to-online distillation strategies in speech recognition, REALM transfers representational knowledge from a pretrained multi-session bidirectional LFP model to a causal version for real-time deployment. We first pretrain a bidirectional Mamba-2 teacher model using a masked autoencoding objective. We then distill this teacher model into a compact student model via a combined objective of representation alignment and task supervision. REALM consistently outperforms both causal and non-causal LFP-based SOTA methods for behavior decoding. Notably, our REALM improves decoding performance while achieving a $2\times$ reduction in parameter count and a $10\times$ reduction in training time. These results demonstrate that retrospective distillation effectively bridges the gap between offline and real-time neural decoding. REALM shows that LFP-only models can achieve competitive decoding performance without reliance on spike signals, offering a practical and scalable alternative for next-generation wireless implantable BCIs.

2605.14865 2026-05-15 cs.AI cs.CL

Holistic Evaluation and Failure Diagnosis of AI Agents

Netta Madvil, Gilad Dym, Alon Mecilati, Edo Dekel, Jonatan Liberman, Rotem Brazilay, Liron Schliesser, Max Svidlo, Shai Nir, Orel Shalom, Yaron Friedman, David Connack, Amos Rimon, Philip Tannor, Shir Chorev

AI总结 该研究提出了一种用于AI智能体的全面评估与故障诊断框架,旨在解决现有评估方法在解释失败原因和定位问题位置方面的不足。该框架结合自顶向下的智能体级诊断与自底向上的片段级评估,将分析过程分解为独立的片段评估,从而支持任意长度的轨迹分析,并为每个判断提供片段级的解释依据。实验表明,该方法在多个基准测试中取得领先结果,显著提升了分类、定位及联合定位-分类的准确率。

详情
英文摘要

AI agents execute complex multi-step processes, but current evaluation falls short: outcome metrics report success or failure without explaining why, and process-level approaches struggle to connect failure types to their precise locations within long, structured traces. We present a holistic agent evaluation framework that pairs top-down agent-level diagnosis with bottom-up span-level evaluation, decomposing analysis into independent per-span assessments. This decomposition scales to traces of arbitrary length and produces span-level rationales for each verdict. On the TRAIL benchmark, our framework achieves state-of-the-art results across all metrics on both GAIA and SWE-Bench, with relative gains over the strongest prior baselines of up to 38% on category F1, up to 3.5x on localization accuracy, and up to 12.5x on joint localization-categorization accuracy. Per-category analysis shows our framework leading in more error categories than any other evaluator. Notably, the same frontier model achieves several times higher localization accuracy when used inside our framework than as a monolithic judge over the full trace, showing that evaluation methodology, not model capability, is the bottleneck.

2605.14857 2026-05-15 cs.AI cs.IR

A Deterministic Agentic Workflow for HS Tariff Classification: Multi-Dimensional Rule Reasoning with Interpretable Decisions

Yu Zhang, Dongjiang Zhuang, Qu Zhou, Zheng Huang, Junhe Wu, Jing Cao, Kai Chen

AI总结 本文提出了一种确定性智能体工作流,用于解决高阶协调制度(HS)税则分类这一专家级任务。该方法通过多维规则推理,结合可解释的决策过程,解决了在材料、形式、功能等多个维度上同时满足优先规则的挑战。研究设计了一个固定流程的智能体架构,将大语言模型调用限制在特定阶段,并保留本地的反思与验证机制,从而实现结构化、可解释的分类决策。实验表明,该方法在HSCodeComp数据集上取得了较高的分类准确率,并揭示了部分标注可能存在与HS规则不符的情况。

详情
英文摘要

Harmonized System (HS) tariff classification is a high-stakes, expert-level task in which a free-form product description must be mapped to a specific six- or eight-digit code under the General Interpretive Rules (GIR), section notes, chapter notes, and Explanatory Notes. The difficulty lies not in knowledge volume but in *multi-dimensional rule reasoning*: a correct classification must satisfy competing priority rules along several axes simultaneously, including material, form, function, essential character, the part-versus-whole boundary, and specific listing versus residual headings. End-to-end prompting of large language models fails characteristically by resolving one axis while ignoring the priority constraints on the others. We present a *deterministic agentic workflow* in contrast to self-planning agents: the control flow is fixed, language model calls are confined to narrow stages, and reflection and verification are retained as local mechanisms. This design yields interpretability by construction--each decision is decomposed into stage-wise structured outputs with verbatim citation of the chapter or section notes that bear on it. The architecture combines offline knowledge-engineering of the Chinese HS tariff with an online six-stage pipeline. Evaluated on HSCodeComp at the six-digit level, the workflow reaches 75.0% top-1 and 91.5% top-3 at four digits, and 64.2% top-1 and 78.3% top-3 at six digits with Qwen3.6-plus; an open-weight Qwen3.6-27B-FP8 backbone in non-thinking mode achieves 84.2% four-digit and 77.4% six-digit top-1 agreement with the frontier model. A two-stage manual audit of 226 six-digit disagreements suggests that a non-trivial fraction of HSCodeComp ground-truth labels may deviate from HS general rules; full adjudication records are released in the appendix as preliminary findings for community review.

2605.14855 2026-05-15 cs.LG cs.AI eess.SP

Exploitation of Hidden Context in Dynamic Movement Forecasting: A Neural Network Journey from Recurrent to Graph Neural Networks and General Purpose Transformers

Lukas Schelenz, Shobha Rajanna, Denis Gosalci, Lucas Heublein, Jonas Pirkl, Jonathan Ott, Felix Ott, Christopher Mutschler, Tobias Feigl

AI总结 本文研究了在动态运动预测任务中如何有效利用隐藏上下文信息,重点探讨了从循环神经网络到图神经网络以及通用型Transformer模型的演进过程。研究对比了多种机器学习方法在预测NBA球员动态运动轨迹中的性能,发现基于LSTM的混合模型在结合上下文信息后取得了最低的最终位移误差,表现优于图注意力网络和Transformer等其他模型。实验表明,不同模型在预测精度、泛化能力和训练效率方面各有优劣,强调了在快速动态环境中进行轨迹预测时需根据具体任务选择合适模型。

Comments 12 pages

详情
Journal ref
IEEE/ION Position, Location and Navigation Symposium (PLANS), Salt Lake City, UT, May 2025
英文摘要

Forecasting within signal processing pipelines is crucial for mitigating delays, particularly in predicting the dynamic movements of objects such as NBA players. This task poses significant challenges due to the inherently interactive and unpredictable nature of sports, where abrupt changes in velocity and direction are prevalent. Traditional approaches, including (S)ARIMA(X), Kalman filters (KF), and Particle filters (PF), often struggle to model the non-linear dynamics present in such scenarios. Machine learning (ML) methods, such as long short-term memory (LSTM) networks, graph neural networks (GNNs), and Transformers, offer greater flexibility and accuracy but frequently fail to explicitly capture the interplay between temporal dependencies and contextual interactions, which are critical in chaotic sports environments. In this paper, we evaluate these models and assess their strengths and weaknesses. Experimental results reveal key performance trade-offs across input history length, generalizability, and the ability to incorporate contextual information. ML-based methods demonstrated substantial improvements over linear models across forecast horizons of up to 2s. Among the tested architectures, our hybrid LSTM augmented with contextual information achieved the lowest final displacement error (FDE) of 1.51m, outperforming temporal convolutional neural network (TCNN), graph attention network (GAT), and Transformers, while also requiring less data and training time compared to GAT and Transformers. Our findings indicate that no single architecture excels across all metrics, emphasizing the need for task-specific considerations in trajectory prediction for fast-paced, dynamic environments such as NBA gameplay.

2605.14847 2026-05-15 cs.CV

SR-Prominence: A Crowdsourced Protocol and Dataset Suite for Perceptually-Weighted Super-Resolution Artifact Evaluation

Ivan Molodetskikh, Kirill Malyshev, Mark Mirgaleev, Nikita Zagainov, Evgeney Bogatyrev, Dmitriy Vatolin

AI总结 现代图像超分辨率方法虽然能生成细节丰富、视觉吸引的结果,但常常引入影响感知质量的视觉伪影。本文提出“伪影显著性”作为评估指标,定义为多数观者认为某区域存在明显伪影的比例,并构建了SR-Prominence数据集,包含3,935个标注显著性的伪影掩码,涵盖多个真实场景。研究发现传统全参考质量评估指标如SSIM在局部显著性预测上表现突出,而无参考方法和专用伪影检测器泛化能力较差,该数据集为超分辨率伪影评估提供了感知导向的新基准。

详情
英文摘要

Modern image super-resolution methods generate detailed, visually appealing results, but they often introduce visual artifacts: unnatural patterns and texture distortions that degrade perceived quality. These defects vary widely in perceptual impact--some are barely noticeable, while others are highly disturbing--yet existing detection methods treat them equally. We propose artifact prominence as an evaluative target, defined as the fraction of viewers who judge a highlighted region to contain a noticeable artifact. We design a crowdsourced annotation protocol and construct SR-Prominence, a dataset suite containing 3,935 artifact masks from DeSRA, Open Images, Urban100, and a realistic no-ground-truth Urban100-HR setting, annotated with prominence. Re-annotating DeSRA reveals that 48.2% of its in-lab binary artifacts are not noticed by a majority of viewers. Across the suite, we audit SR artifact detectors, image-quality metrics, and SR methods. We find that classical full-reference metrics, especially SSIM and DISTS, provide surprisingly strong localized prominence signals, whereas no-reference IQA methods and specialized artifact detectors often fail to generalize across datasets and reference settings. SR-Prominence is released with an objective scoring protocol that allows new metrics to be benchmarked on our suite without further crowdsourcing. Together, the data and protocols enable SR artifact evaluation to move from binary defect presence toward perceptual impact. SR-Prominence is available at https://huggingface.co/datasets/imolodetskikh/sr-artifact-prominence.

2605.14845 2026-05-15 cs.CV

Exploring Vision-Language Models for Online Signature Verification: A Zero-Shot Capability Study

Marta Robledo-Moreno, Ruben Vera-Rodriguez, Ruben Tolosana, Javier Ortega-Garcia

AI总结 本文研究了视觉-语言模型(VLM)在在线签名验证任务中的零样本能力,评估了GPT-5.2和Gemini 2.5 Pro等先进模型在签名验证挑战(SVC)基准上的表现。通过将原始运动时间序列转化为静态图像,并利用模型的隐含token概率计算生物特征分数,实验发现模型在随机伪造场景下表现出色,GPT-5.2在移动任务中的等错误率低至0.32%,但在高难度的熟练伪造场景中性能显著下降,并暴露出模型在链式推理过程中产生运动幻觉的问题。

Comments Accepted at the 14th International Workshop on Biometrics and Forensics

详情
英文摘要

Recent advancements in Vision-Language Models (VLMs) have demonstrated strong capabilities in general visual reasoning, yet their applicability to rigorous biometric tasks remains unexplored. This work presents an exploratory study evaluating the zero-shot performance of state-of-the-art VLMs (GPT-5.2 and Gemini 2.5 Pro) on the Signature Verification Challenge (SVC) benchmark. To enable visual processing, raw kinematic time-series are converted into static images, encoding pressure information into stroke opacity whenever available in the source data. Furthermore, we introduce a scoring protocol that extracts latent token probabilities to compute robust biometric scores. Experimental results reveal a significant performance dichotomy dependent on signal quality and forgery type. In random forgery scenarios, the zero-shot VLM achieves exceptional discrimination, with GPT-5.2 reaching an Equal Error Rate of 0.32% in mobile tasks, outperforming supervised state-of-the-art systems. Conversely, in skilled forgery scenarios, where the task is more challenging because both signatures are almost identical, the results are significantly worse, and a critical "Rationalization Trap" emerges: chain-of-thought (CoT) reasoning degrades performance as the model produces kinematic hallucinations to justify forgery artifacts as natural variability.

2605.14844 2026-05-15 cs.LG cs.AI

XFP: Quality-Targeted Adaptive Codebook Quantization with Sparse Outlier Separation for LLM Inference

Thomas Witt

AI总结 本文提出了一种名为XFP的动态权重量化方法,用于大语言模型的高效推理。该方法通过设定每通道的余弦相似度质量下限,自动确定每层的码本大小、异常值预算和打包方式,无需手动选择位宽或校准数据。XFP将权重矩阵分解为稀疏的fp16异常值残差和密集的子字节索引张量,并通过两种存储模式实现高效解码。实验表明,XFP在多个大模型上实现了比现有方法更高的推理速度和准确率,同时有效解决了模型超出内存限制的问题。

Comments 17 pages, 3 figures, 17 tables, 1 algorithm. Code: https://github.com/flash7777/vllm/tree/multiquant

详情
英文摘要

We introduce XFP, a dynamic weight quantizer for LLM inference that inverts the conventional workflow: the operator specifies reconstruction quality floors on per-channel cosine similarity (one strict floor for attention and shared experts, one lazy floor for routed-expert MoE); XFP determines codebook size, outlier budget, and packing per layer automatically -- no Hessian, no calibration data, no manual bit-width selection. Each weight matrix is decomposed into a sparse fp16 outlier residual and a dense sub-byte index tensor into a per-group learned codebook. Two storage modes share one auto-select frontend and one fused decode kernel: V2 (per-channel Lloyd) and V2a (shared library of L=32 codebooks per layer). On Qwen3.5-122B-A10B under V2, XFP reaches 138 tok/s single-stream decode on workstation hardware (RTX PRO 6000 Blackwell, TP=2) at 94.49% GSM8K strict-match (3 seeds, n=3957), and is 49% faster than Marlin INT4 at TP=1. For models that do not fit in the target memory envelope, we present the H-Process: a quality-driven iteration over the two cosine thresholds that finds the operating point at which the model just fits while still producing sensible output. Three constraints define its search space: the operator-set thresholds, an OOM boundary at quantize-on-load, and a garbage boundary in generation (cosine similarity steers; benches verify). On Qwen3.5-397B-A17B (512 routed experts/layer), the H-Process fits the full expert population into 2x96 GB at ~3.4 effective bits and delivers 100.9 tok/s long-output decode at 66.72% GSM8K strict-match on the full 1319-problem set (single seed at submission; multi-seed evaluation in progress), exceeding INT4 with routed-expert pruning on memory, throughput, and accuracy simultaneously.

2605.14843 2026-05-15 cs.CV

MechVerse: Evaluating Physical Motion Consistency in Video Generation Models

Rahul Jain, Mayank Patel, Asim Unmesh, Karthik Ramani

AI总结 本文提出 MechVerse,一个用于评估视频生成模型中物理运动一致性的新基准。研究关注当前模型在生成具有机械结构的视频时,常无法满足运动学和几何约束的问题,例如部件变形、运动传递不一致等。MechVerse 包含大量合成视频片段及结构化提示,用于评估模型在机械约束下的生成能力,实验表明现有模型在外观和流畅性上表现良好,但在生成符合物理机制的运动方面仍存在明显不足。

Comments Under Review

详情
英文摘要

Text- and image-conditioned video generation models have achieved strong visual fidelity and temporal coherence, but they often fail to generate motion governed by kinematic and geometric constraints. In these settings, object parts must remain rigid, maintain contact or coupling with neighboring components, and transfer motion consistently across connected parts. These requirements are especially explicit in articulated mechanical assemblies, where motion is constrained by rigid-link geometry, contact/coupling relations, and transmission through kinematic chains. A generated video may therefore appear plausible while violating the intended mechanism, such as rotating a part that should translate, deforming a rigid component, breaking coupling between parts, or failing to move downstream components. To evaluate this gap, We introduce MechVerse, a benchmark for mechanically consistent image-to-video generation. MechVerse contains 21,156 synthetic clips from 1,357 mechanical assemblies across 141 categories, organized into three tiers of increasing kinematic complexity: independent articulation, pairwise coupling, and densely coupled multi-part mechanisms. Each clip is paired with a structured prompt describing part identities, stationary supports, moving components, motion primitives, direction, speed/extent, and inter-part dependencies. We evaluate proprietary, open-source, and fine-tuned image-to-video models using standard video metrics, instruction-following scores, and human judgments of motion correctness and kinematic coupling. Results show that current models can preserve appearance and smoothness while failing to generate mechanically admissible motion, with errors increasing as coupling complexity grows. MechVerse provides a benchmark for measuring and improving mechanism-aware video generation from image and language inputs.

2605.14842 2026-05-15 cs.CV

Editor's Choice: Evaluating Abstract Intent in Image Editing through Atomic Entity Analysis

Mor Ventura, Roy Hirsch, Yonatan Bitton, Regev Cohen, Roi Reichart

AI总结 本文研究了图像编辑中抽象意图的理解与评估问题,提出了一个基于原子实体分析的评估框架Entity-Rubrics,并构建了首个专注于抽象图像编辑的基准数据集AbstractEdit。该工作首次对抽象图像编辑进行了形式化定义与分类,通过分解编辑任务为实体级别的评估指标,实现了与人类判断的高相关性。实验表明,现有模型在抽象指令理解上存在显著挑战,而结合先进语言模型编码器和迭代推理机制可有效提升性能,为多模态交互的自然化提供了新方向。

详情
英文摘要

Humans naturally communicate through abstract concepts like "mood". However, current image editing benchmarks focus primarily on explicit, literal commands, leaving abstract instructions largely underexplored. In this work, we first formalize the definition and taxonomy of abstract image editing. To measure instruction-following in this challenging domain, we introduce Entity-Rubrics, a framework that breaks down abstract edits into individual, entity-level assessments and achieves strong correlation with human judgment. Alongside this framework, we contribute AbstractEdit, the first benchmark dedicated to abstract image editing across diverse real-world scenes. Evaluating 11 leading models on this dataset reveals a fundamental challenge: standard architectures struggle to balance intent and preservation, commonly defaulting to under-editing or over-editing. Our analysis demonstrates that driving meaningful improvements relies heavily on integrating advanced LLM text encoders and iterative thinking. Looking forward, our entity-based paradigm can generalize beyond assessment to serve as a reward model, enable models to correctly interpret abstract communication, or highlight specific failures in test-time critique loops. Ultimately, we hope this work serves as a stepping stone toward seamless multimodal interaction, closing the gap between rigid machine execution and the natural, open-ended way humans communicate.