arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2088
2506.08543 2026-05-27 cs.CV

Spectral Principal Paths: A Spectral Perspective on Linear Representation Formation in LLMs

谱主路径:大语言模型中线性表示形成的谱视角

Bowei Tian, Xuntao Lyu, Meng Liu, Hongyi Wang, Ang Li

AI总结 提出输入空间线性假设和谱主路径框架,利用谱理论解释大语言模型中线性表示的形成与稳定性,并给出严格保证。

Comments arXiv admin note: text overlap with arXiv:2503.22720

详情
AI中文摘要

高层表示已成为增强AI透明度和可控性的核心焦点,将注意力从单个神经元或电路转向与人类可解释概念对齐的结构化语义方向。虽然线性表示假说(LRH)表明这些方向在表示中出现,但这些表示如何起源以及为何在层间变得日益稳定仍不清楚。为解决此问题,我们引入输入空间线性假设,认为与概念对齐的方向起源于输入空间,并随着深度增加而稳定维持。然后我们提出谱主路径(SPP)框架,该框架形式化了深度网络如何沿谱主方向逐步蒸馏线性表示。我们基于Wedin $\sin\Theta$ 扰动定理为SPP提供了严格的稳定性保证,识别了可测试的条件,包括谱间隙和上下文不连贯性,这些条件共同确保层间方向保持。通过将理论分析与实证证据相结合,本工作提供了关于线性表示如何在大语言模型中产生的谱视角,并暗示了对现代AI系统中公平性和透明度的概念级可控、鲁棒和连贯方法的潜在影响。

英文摘要

High-level representations have become a central focus in enhancing AI transparency and control, shifting attention from individual neurons or circuits to structured semantic directions that align with human-interpretable concepts. While the Linear Representation Hypothesis (LRH) suggests that such directions emerge in representations, it remains unclear how these representations originate and why they become increasingly stable across layers. To solve this issue, we introduce the Input-Space Linearity Hypothesis, positing that concept-aligned directions originate in the input space and are steadily maintained with increasing depth. We then propose the Spectral Principal Path (SPP) framework, which formalizes how deep networks progressively distill linear representations along the spectral principal directions. We provide rigorous stability guarantees for the SPP based on the Wedin $\sinΘ$ perturbation theorem, identifying testable conditions, including spectral gap and context incoherence, that jointly ensure layer-wise directional preservation. By bridging theoretical analysis with empirical evidence, this work identifies a spectral view of how linear representations arise in LLMs, and suggests potential implications for concept-level controllable, robust, and coherent approaches to fairness and transparency in modern AI systems.

2509.21167 2026-05-27 cs.LG cs.CV

A Unified Framework for Diffusion Model Unlearning with f-Divergence

基于f-散度的扩散模型遗忘统一框架

Nicola Novello, Federico Fontana, Luigi Cinque, Deniz Gunduz, Andrea M. Tonello

AI总结 提出一个基于f-散度的统一框架,将扩散模型概念遗忘中的MSE损失推广到任意f-散度,并通过理论分析和实验验证不同散度对遗忘效果的影响。

Comments Accepted at ICML 2026

详情
AI中文摘要

现有的大多数文本到图像扩散模型概念遗忘方法最小化基于目标概念和锚定概念的去噪器输出之间的均方误差(MSE)损失,这隐式地是两个高斯分布之间的KL散度。我们将这一目标推广到任意$f$-散度,将MSE恢复为KL实例,并识别出一族$\alpha$-散度,其高斯闭式形式产生廉价、类似MSE的训练目标。对于剩余的$f$-散度,我们基于$f$-散度的变分公式提供了一个最小-最大目标。我们从理论上分析并数值验证了不同$f$-散度如何影响梯度幅度和算法的收敛性质,从而影响遗忘质量。例如,我们观察到Hellinger闭式实例在多种场景下始终优于MSE。更一般地,所提出的统一框架为根据应用和用户目标选择最优散度提供了灵活的范式,允许对遗忘效果与生成保真度之间的权衡进行更精细的控制。

英文摘要

Most existing methods for concept unlearning in text-to-image diffusion models minimize a mean squared error (MSE) loss between the denoiser outputs conditioned on a target and an anchor concept, which is implicitly the KL divergence between two Gaussians. We generalize this objective to any $f$-divergence, recovering MSE as the KL instance, and identify a family of $α$-divergences whose Gaussian closed-form yields cheap, MSE-like training objectives. For the remaining $f$-divergences, we provide a min-max objective based on the variational formulation of the $f$-divergence. We theoretically analyze and numerically validate how different $f$-divergences impact the gradient magnitude and the convergence properties of the algorithm, affecting the quality of unlearning. For instance, we observe that the Hellinger closed-form instance consistently dominates MSE across multiple scenarios. More generally, the proposed unified framework offers a flexible paradigm for selecting the optimal divergence based on the application and user goal, allowing for finer control over the trade-off between unlearning efficacy and generative fidelity.

2509.21106 2026-05-27 cs.CL cs.IR

BESPOKE: Benchmark for Search-Augmented Large Language Model Personalization via Diagnostic Feedback

BESPOKE:通过诊断反馈进行搜索增强型大语言模型个性化的基准测试

Hyunseo Kim, Sangam Lee, Kwangwook Seo, Dongha Lee

AI总结 提出BESPOKE基准,通过收集真实用户历史并配对细粒度偏好分数与反馈,系统评估搜索增强型大语言模型在个性化信息检索任务中的表现。

Comments Accepted to ICML 2026

详情
AI中文摘要

搜索增强型大语言模型(LLMs)通过将检索集成到生成中,推进了信息寻求任务,相比传统搜索系统减少了用户的认知负担。然而,它们在充分满足多样化用户需求方面仍显不足,这需要识别同一查询如何反映不同用户的意图,并以偏好的形式传递信息。尽管最近如ChatGPT和Gemini等系统通过利用用户历史尝试个性化,但对这种个性化的系统评估尚不充分。为填补这一空白,我们提出了BESPOKE,一个用于评估搜索增强型LLMs个性化的现实基准。BESPOKE的设计既现实(通过直接从人类收集真实的聊天和搜索历史)又具有诊断性(通过将响应与细粒度的偏好分数和反馈配对)。该基准通过长期、深度参与的人工标注构建,标注者贡献自己的历史、撰写带有详细信息需求的查询,并用分数和诊断反馈评估响应。利用BESPOKE,我们进行了系统分析,揭示了信息寻求任务中有效个性化的关键要求,为个性化搜索增强型LLMs的细粒度评估提供了基础。我们的代码和数据可在https://augustinlib.github.io/BESPOKE/获取。

英文摘要

Search-augmented large language models (LLMs) have advanced information-seeking tasks by integrating retrieval into generation, reducing users' cognitive burden compared to traditional search systems. Yet they remain insufficient for fully addressing diverse user needs, which requires recognizing how the same query can reflect different intents across users and delivering information in preferred forms. While recent systems such as ChatGPT and Gemini attempt personalization by leveraging user histories, systematic evaluation of such personalization is under-explored. To address this gap, we propose BESPOKE, the realistic benchmark for evaluating personalization in search-augmented LLMs. BESPOKE is designed to be both realistic, by collecting authentic chat and search histories directly from humans, and diagnostic, by pairing responses with fine-grained preference scores and feedback. The benchmark is constructed through long-term, deeply engaged human annotation, where human annotators contributed their own histories, authored queries with detailed information needs, and evaluated responses with scores and diagnostic feedback. Leveraging BESPOKE, we conduct systematic analyses that reveal key requirements for effective personalization in information-seeking tasks, providing a foundation for fine-grained evaluation of personalized search-augmented LLMs. Our code and data are available at https://augustinlib.github.io/BESPOKE/.

2509.18919 2026-05-27 cs.CV

Advancing Metallic Surface Defect Detection via Anomaly-Guided Pretraining on a Large Industrial Dataset

通过在大规模工业数据集上的异常引导预训练推进金属表面缺陷检测

Chuni Liu, Hongjie Li, Jiaqi Du, Yangyang Hou, Qian Sun, Lei Jin, Ke Xu

AI总结 提出异常引导自监督预训练(AGSSP)方法,通过两阶段框架利用异常先验引导表示学习,在金属表面缺陷检测中显著提升性能,mAP@0.5提升高达10%。

Comments Accepted for publication in Pattern Recognition

详情
Journal ref
Pattern Recognition, Volume 179, Part C, 2026, 113788
AI中文摘要

预训练-微调范式是金属表面缺陷检测中缓解数据稀缺挑战的关键策略。然而,其实现面临一个关键困境:在ImageNet等自然图像数据集上预训练存在显著的领域差距;同时,由于现有学习目标无法区分复杂背景噪声和纹理中的细微缺陷模式,在领域内工业数据上进行简单的自监督预训练往往效果不佳。为解决这一问题,我们引入了异常引导自监督预训练(AGSSP),这是一种通过异常先验显式引导表示学习的新范式。AGSSP采用两阶段框架:(1)首先通过从异常图中蒸馏知识来预训练模型的主干网络,鼓励网络捕获缺陷显著特征;(2)然后使用从这些图中导出的伪缺陷框预训练检测器,使其与定位任务对齐。为此,我们开发了一种知识增强方法来生成高质量的异常图,并收集了一个包含120,000张图像的大规模工业数据集。此外,我们提供了两个小规模、像素级标注的金属表面缺陷数据集用于验证。大量实验表明,AGSSP在各种设置下均能持续提升性能,与基于ImageNet的模型相比,mAP@0.5提升高达10%,mAP@0.5:0.95提升高达11.4%。所有代码、预训练模型和数据集均可在https://clovermini.github.io/AGSSP-Dev/公开获取。

英文摘要

The pretraining-finetuning paradigm is a crucial strategy in metallic surface defect detection for mitigating the challenges posed by data scarcity. However, its implementation presents a critical dilemma. Pretraining on natural image datasets such as ImageNet, faces a significant domain gap. Meanwhile, naive self-supervised pretraining on in-domain industrial data is often ineffective due to the inability of existing learning objectives to distinguish subtle defect patterns from complex background noise and textures. To resolve this, we introduce Anomaly-Guided Self-Supervised Pretraining (AGSSP), a novel paradigm that explicitly guides representation learning through anomaly priors. AGSSP employs a two-stage framework: (1) it first pretrains the model's backbone by distilling knowledge from anomaly maps, encouraging the network to capture defect-salient features; (2) it then pretrains the detector using pseudo-defect boxes derived from these maps, aligning it with localization tasks. To enable this, we develop a knowledge-enhanced method to generate high-quality anomaly maps and collect a large-scale industrial dataset of 120,000 images. Additionally, we present two small-scale, pixel-level labeled metallic surface defect datasets for validation. Extensive experiments demonstrate that AGSSP consistently enhances performance across various settings, achieving up to a 10\% improvement in mAP@0.5 and 11.4\% in mAP@0.5:0.95 compared to ImageNet-based models. All code, pretrained models, and datasets are publicly available at https://clovermini.github.io/AGSSP-Dev/.

2509.18384 2026-05-27 cs.RO cs.FL

LAD-VF: LLM-Automatic Differentiation Enables Fine-Tuning-Free Robot Planning from Formal Methods Feedback

LAD-VF:LLM自动微分实现基于形式化方法反馈的无微调机器人规划

Yunhao Yang, Junyuan Hong, Gabriel Jacob Perin, Zhiwen Fan, Li Yin, Zhangyang Wang, Ufuk Topcu

AI总结 提出LAD-VF框架,利用形式化验证反馈和LLM自动微分自动优化提示词,无需微调即可提升机器人规划任务中规范符合率,成功率从60%提升至90%以上。

Comments Presented at ICRA 2026

详情
AI中文摘要

大型语言模型(LLM)能够将自然语言指令转化为机器人、自动驾驶等领域的可执行动作计划。然而,在物理世界中部署LLM驱动的规划需要严格遵守安全和监管约束,当前模型常因幻觉或弱对齐而违反这些约束。传统的数据驱动对齐方法(如直接偏好优化DPO)需要昂贵的人工标注,而近期基于形式化反馈的方法仍依赖资源密集型的微调。本文提出LAD-VF,一种无需微调的框架,利用形式化验证反馈实现自动化提示工程。通过引入与LLM-AutoDiff集成的形式化验证感知文本损失,LAD-VF迭代优化提示词而非模型参数。这带来三个关键优势:(i) 无需微调的可扩展适应;(ii) 与模块化LLM架构兼容;(iii) 通过可审计的提示词实现可解释的优化。在机器人导航和操作任务中的实验表明,LAD-VF显著提升了规范符合率,将成功率从60%提升至90%以上。因此,我们的方法为可信、形式化验证的LLM驱动控制系统提供了一条可扩展且可解释的路径。

英文摘要

Large language models (LLMs) can translate natural language instructions into executable action plans for robotics, autonomous driving, and other domains. Yet, deploying LLM-driven planning in the physical world demands strict adherence to safety and regulatory constraints, which current models often violate due to hallucination or weak alignment. Traditional data-driven alignment methods, such as Direct Preference Optimization (DPO), require costly human labeling, while recent formal-feedback approaches still depend on resource-intensive fine-tuning. In this paper, we propose LAD-VF, a fine-tuning-free framework that leverages formal verification feedback for automated prompt engineering. By introducing a formal-verification-informed text loss integrated with LLM-AutoDiff, LAD-VF iteratively refines prompts rather than model parameters. This yields three key benefits: (i) scalable adaptation without fine-tuning; (ii) compatibility with modular LLM architectures; and (iii) interpretable refinement via auditable prompts. Experiments in robot navigation and manipulation tasks demonstrate that LAD-VF substantially enhances specification compliance, improving success rates from 60% to over 90%. Our method thus presents a scalable and interpretable pathway toward trustworthy, formally-verified LLM-driven control systems.

2509.09977 2026-05-27 cs.CV

ISTASTrack: Bridging ANN and SNN via ISTA Adapter for RGB-Event Tracking

ISTASTrack:通过ISTA适配器桥接ANN和SNN用于RGB-事件跟踪

Siying Liu, Zikai Wang, Hanle Zheng, Yifan Hu, Xilin Wang, Qingkai Yang, Jibin Wu, Hao Guo, Lei Deng

AI总结 提出首个基于Transformer的ANN-SNN混合跟踪器ISTASTrack,利用ISTA适配器双向融合RGB和事件特征,实现高效鲁棒跟踪。

Comments Accepted by IEEE Transactions on Image Processing, DOI: 10.1109/TIP.2026.3694138, 15 pages, 8 figures

详情
AI中文摘要

RGB-事件跟踪已成为视觉目标跟踪中一个有前景的趋势,旨在利用RGB图像和动态尖峰事件的互补优势来提高性能。然而,现有的人工神经网络(ANN)难以充分利用事件流的稀疏和异步特性。最近,结合ANN和脉冲神经网络(SNN)的混合架构研究作为RGB-事件感知中的一种有前途的解决方案出现,但有效融合跨异构范式的特征仍然是一个挑战。在这项工作中,我们提出了ISTASTrack,这是第一个基于Transformer的ANN-SNN混合跟踪器,配备了ISTA适配器用于RGB-事件跟踪。该双分支模型采用视觉Transformer从RGB输入中提取空间上下文,并使用脉冲Transformer从事件流中捕获时空动态。为了弥合ANN和SNN特征之间的模态和范式差距,我们系统地设计了一个基于模型的ISTA适配器,用于两个分支之间的双向特征交互,该适配器通过展开迭代收缩阈值算法从稀疏表示理论推导而来。此外,我们在适配器中引入了一个时间下采样注意力模块,以在潜在空间中对齐多步SNN特征与单步ANN特征,从而改善时间融合。在RGB-事件跟踪基准(如FE240hz、VisEvent、COESOT和FELT)上的实验结果表明,ISTASTrack在保持高能效的同时实现了最先进的性能,突显了混合ANN-SNN设计在鲁棒视觉跟踪中的有效性和实用性。代码已公开在https://github.com/lsying009/ISTASTrack.git。

英文摘要

RGB-Event tracking has become a promising trend in visual object tracking to leverage the complementary strengths of both RGB images and dynamic spike events for improved performance. However, existing artificial neural networks (ANNs) struggle to fully exploit the sparse and asynchronous nature of event streams. Recent efforts toward hybrid architectures combining ANNs and spiking neural networks (SNNs) have emerged as a promising solution in RGB-Event perception, yet effectively fusing features across heterogeneous paradigms remains a challenge. In this work, we propose ISTASTrack, the first transformer-based \textbf{A}NN-\textbf{S}NN hybrid \textbf{Track}er equipped with \textbf{ISTA} adapters for RGB-Event tracking. The two-branch model employs a vision transformer to extract spatial context from RGB inputs and a spiking transformer to capture spatio-temporal dynamics from event streams. To bridge the modality and paradigm gap between ANN and SNN features, we systematically design a model-based ISTA adapter for bidirectional feature interaction between the two branches, derived from sparse representation theory by unfolding the iterative shrinkage thresholding algorithm. Additionally, we incorporate a temporal downsampling attention module within the adapter to align multi-step SNN features with single-step ANN features in the latent space, improving temporal fusion. Experimental results on RGB-Event tracking benchmarks, such as FE240hz, VisEvent, COESOT, and FELT, have demonstrated that ISTASTrack achieves state-of-the-art performance while maintaining high energy efficiency, highlighting the effectiveness and practicality of hybrid ANN-SNN designs for robust visual tracking. The code is publicly available at https://github.com/lsying009/ISTASTrack.git.

2508.18444 2026-05-27 cs.CL cs.AI

How Reliable are LLMs for Reasoning on the Re-ranking task?

LLMs在重排序任务上的推理有多可靠?

Nafis Tanveer Islam, Zhiming Zhao

AI总结 本研究分析不同训练方法对LLMs在重排序任务中语义理解的影响,并探究模型能否生成更知情的文本推理以克服透明度和数据有限的挑战。

Comments This chapter has been published in Advancements in AI From Foundations to Cross-Disciplinary Applications, Springer, 2026

详情
AI中文摘要

随着大型语言模型(LLMs)语义理解能力的提升,它们表现出对人类更高的认知和一致性,但这以牺牲透明度为代价。尽管通过实验分析取得了有希望的结果,但深入理解LLM的内部工作机制对于理解重排序背后的推理是不可避免的,这为最终用户提供了解释,使他们能够做出明智的决定。此外,在新开发的系统中,用户参与有限且排序数据不足,准确地对内容进行重排序仍然是一个重大挑战。虽然各种训练方法影响LLMs的训练并生成推理,但我们的分析发现,一些训练方法比其他方法表现出更好的可解释性,这意味着并非所有训练方法都学到了准确的语义理解;相反,获得了抽象知识以优化评估,这引发了对LLMs真正可靠性的质疑。因此,在这项工作中,我们分析了不同训练方法如何影响LLMs在重排序任务中的语义理解,并调查这些模型是否能够生成更知情的文本推理,以克服透明度或LLMs以及有限训练数据的挑战。为了分析用于重排序任务的LLMs,我们利用来自环境和地球科学领域的相对较小的排序数据集来对检索到的内容进行重排序。此外,我们还分析了可解释信息,以查看是否可以使用可解释性对重排序进行推理。

英文摘要

With the improving semantic understanding capability of Large Language Models (LLMs), they exhibit a greater awareness and alignment with human values, but this comes at the cost of transparency. Although promising results are achieved via experimental analysis, an in-depth understanding of the LLM's internal workings is unavoidable to comprehend the reasoning behind the re-ranking, which provides end users with an explanation that enables them to make an informed decision. Moreover, in newly developed systems with limited user engagement and insufficient ranking data, accurately re-ranking content remains a significant challenge. While various training methods affect the training of LLMs and generate inference, our analysis has found that some training methods exhibit better explainability than others, implying that an accurate semantic understanding has not been learned through all training methods; instead, abstract knowledge has been gained to optimize evaluation, which raises questions about the true reliability of LLMs. Therefore, in this work, we analyze how different training methods affect the semantic understanding of the re-ranking task in LLMs and investigate whether these models can generate more informed textual reasoning to overcome the challenges of transparency or LLMs and limited training data. To analyze the LLMs for re-ranking tasks, we utilize a relatively small ranking dataset from the environment and the Earth science domain to re-rank retrieved content. Furthermore, we also analyze the explainable information to see if the re-ranking can be reasoned using explainability.

2504.08593 2026-05-27 cs.CV cs.AI

Hands-On: Segmenting Individual Signs from Continuous Sequences

动手实践:从连续序列中分割单个手势

JianHe Low, Harry Walsh, Ozge Mercanoglu Sincan, Richard Bowden

AI总结 针对连续手语分割难题,提出基于Transformer的架构,利用HaMeR手部特征和3D角度,采用BIO标注方案建模时序动态,在DGS语料库上达到最优性能。

Comments Accepted in the 19th IEEE International Conference on Automatic Face and Gesture Recognition. Code Implementation Released

详情
Journal ref
IEEE 19th International Conference on Automatic Face and Gesture Recognition. (2025) 1-5
AI中文摘要

这项工作解决了连续手语分割的挑战,这是一项对手语翻译和数据标注具有重大影响的关键任务。我们提出了一种基于Transformer的架构,该架构对手语的时序动态进行建模,并使用开始-内部-外部(BIO)标注方案将分割视为序列标注问题。我们的方法利用了HaMeR手部特征,并辅以3D角度。大量实验表明,我们的模型在DGS语料库上取得了最先进的结果,而我们的特征在BSLCorpus上超越了先前的基准。

英文摘要

This work tackles the challenge of continuous sign language segmentation, a key task with huge implications for sign language translation and data annotation. We propose a transformer-based architecture that models the temporal dynamics of signing and frames segmentation as a sequence labeling problem using the Begin-In-Out (BIO) tagging scheme. Our method leverages the HaMeR hand features, and is complemented with 3D Angles. Extensive experiments show that our model achieves state-of-the-art results on the DGS Corpus, while our features surpass prior benchmarks on BSLCorpus.

2508.07996 2026-05-27 cs.CV

Structured Relational Reasoning for Group Activity Assessment

结构化关系推理用于群体活动评估

Thinesh Thiyakesan Ponbagavathi, Chengzheng Yang, Alina Roitberg

AI总结 提出ProGraD框架,利用冻结视觉基础模型和轻量级GroupContext Transformer,通过结构化关系推理在单次前向传播中联合推断群体位置、成员关系和活动,仅用10M参数即在Cafe和Social-CAD基准上取得最优性能。

Comments Accepted to CVPR 2026 Workshop (SAUAFG)

详情
AI中文摘要

群体活动检测(GAD)涉及识别视频中的社会群体及其集体行为。视觉基础模型(VFM),如DINOv2,提供优秀的特征,但是在以物体为中心的数据上预训练的。我们发现,将它们简单替换到现有GAD流程中实际上会降低性能,暴露出结构化的群体感知解码才是真正的瓶颈。我们提出了ProGraD,一个基于冻结VFM构建的结构化关系推理框架。其核心是一个轻量级的两层GroupContext Transformer,显式建模演员-群体关联并聚合全局上下文以推断集体行为。可学习的群体提示作为最小条件机制,引导冻结骨干网络朝向社交相关表示,而关系解码器对演员和群体执行核心推理。该设计在单次前向传播中联合推断群体位置、成员关系和活动,仅使用10M可训练参数——不到先前方法的一半。在具有多个并发社交群体的Cafe基准上,ProGraD将Group mAP$@$1.0提升了6.5%,Group mAP$@$0.5提升了8.2%。在Social-CAD上,它实现了最先进的社交和成员关系准确性。ProGraD还生成可解释的注意力图,为演员-群体推理提供洞察。

英文摘要

Group Activity Detection (GAD) involves recognizing social groups and their collective behaviors in videos. Vision Foundation Models (VFMs), like DINOv2, offer excellent features but are pretrained on object-centric data. We find that naively substituting them into existing GAD pipelines actually degrades performance, exposing structured group-aware decoding as the true bottleneck. We introduce ProGraD, a structured relational-reasoning framework for GAD built on top of frozen VFMs. At its core is a lightweight two-layer GroupContext Transformer that explicitly models actor-group associations and aggregates global context to infer collective behavior. Learnable group prompts serve as a minimal conditioning mechanism to guide the frozen backbone toward socially relevant representations, while the relational decoder performs the core reasoning over actors and groups. This design jointly infers group locations, memberships, and activities in a single pass using only 10M trainable parameters - less than half of prior methods. On the Cafe benchmark with multiple concurrent social groups, ProGraD improves the state-of-the-art by 6.5% Group mAP$@$1.0 and 8.2% Group mAP$@$0.5. On Social-CAD, it achieves state-of-the-art social and membership accuracy. ProGraD further produces interpretable attention maps that provide insights into actor-group reasoning.

2506.00250 2026-05-27 cs.CL cs.IT math.IT

PersianMedQA: Evaluating Large Language Models on a Persian-English Bilingual Medical Question Answering Benchmark

PersianMedQA: 在波斯语-英语双语医学问答基准上评估大型语言模型

Mohammad Javad Ranjbar Kalahroodi, Amirhossein Sheikholselami, Sepehr Karimi, Sepideh Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery

AI总结 本文提出PersianMedQA数据集,包含20,785道波斯语医学多选题,用于评估41个LLM在零样本和思维链设置下的双语医学推理能力,发现闭源通用模型表现最佳,而波斯语模型性能较差,且翻译会丢失文化临床线索。

Comments Accepted at LREC 2026 (The Fifteenth Language Resources and Evaluation Conference), Palma, Mallorca, Spain, May 2026

详情
AI中文摘要

大型语言模型(LLM)在广泛的自然语言处理(NLP)基准测试中取得了显著性能,通常超越人类水平。然而,它们在医学等高风险领域中的可靠性,尤其是在低资源语言中,仍未得到充分探索。在这项工作中,我们引入了PersianMedQA,这是一个大规模数据集,包含来自14年伊朗国家医学考试的20,785道经过专家验证的波斯语医学多选题,涵盖23个医学专业,旨在评估LLM在波斯语和英语中的表现。我们对41个最先进的模型进行了基准测试,包括通用型、波斯语和医学LLM,在零样本和思维链(CoT)设置下。我们的结果表明,闭权重通用模型(例如GPT-4.1)持续优于所有其他类别,在波斯语中达到83.09%的准确率,在英语中达到80.7%,而波斯语LLM(例如Dorna)表现显著较差(例如波斯语中34.9%),通常在指令遵循和领域推理方面存在困难。我们还分析了翻译的影响,表明虽然英语性能通常更高,但3-10%的问题只能通过波斯语正确回答,因为翻译丢失了文化和临床上下文线索。最后,我们证明,如果没有强大的领域或语言适应,仅凭模型大小不足以获得稳健的性能。PersianMedQA为评估LLM中的双语和文化基础医学推理提供了基础。该数据集以及双语医学词典可在以下网址获取:https://huggingface.co/datasets/MohammadJRanjbar/PersianMedQA 。

英文摘要

Large Language Models (LLMs) have achieved remarkable performance on a wide range of Natural Language Processing (NLP) benchmarks, often surpassing human-level accuracy. However, their reliability in high-stakes domains such as medicine, particularly in low-resource languages, remains underexplored. In this work, we introduce PersianMedQA, a large-scale dataset of 20,785 expert-validated multiple-choice Persian medical questions from 14 years of Iranian national medical exams, spanning 23 medical specialties and designed to evaluate LLMs in both Persian and English. We benchmark 41 state-of-the-art models, including general-purpose, Persian, and medical LLMs, in zero-shot and chain-of-thought (CoT) settings. Our results show that closed-weight general models (e.g., GPT-4.1) consistently outperform all other categories, achieving 83.09% accuracy in Persian and 80.7% in English, while Persian LLMs such as Dorna underperform significantly (e.g., 34.9% in Persian), often struggling with both instruction-following and domain reasoning. We also analyze the impact of translation, showing that while English performance is generally higher, 3-10% of questions can only be answered correctly in Persian due to cultural and clinical contextual cues that are lost in translation. Finally, we demonstrate that model size alone is insufficient for robust performance without strong domain or language adaptation. PersianMedQA provides a foundation for evaluating bilingual and culturally grounded medical reasoning in LLMs. The dataset, along with a bilingual medical dictionary, is available: https://huggingface.co/datasets/MohammadJRanjbar/PersianMedQA .

2508.05305 2026-05-27 cs.CL

SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens

SONAR-LLM:在句子嵌入中思考、以令牌说话的自回归Transformer

Nikita Dragunov, Temurbek Rahmatullaev, Elizaveta Goncharova, Nikita Kurdiukov, Aysel Mirzoeva, Anna Borisiuk, Andrey Kuznetsov, Anton Razzhigaev

AI总结 提出SONAR-LLM,一种在连续SONAR嵌入空间“思考”但通过令牌级交叉熵监督的解码器-only Transformer,融合语义抽象与似然训练信号,在39M至1.3B参数规模上取得有竞争力的生成质量。

详情
AI中文摘要

最近提出的大型概念模型(LCM)通过预测句子级嵌入序列并使用均方误差或扩散目标进行训练来生成文本。我们提出了SONAR-LLM,一种仅在连续SONAR嵌入空间中“思考”的解码器-only Transformer,但通过冻结的SONAR解码器传播的令牌级交叉熵进行监督。这种混合目标保留了LCM的语义抽象,同时消除了其扩散采样器并恢复了基于似然的训练信号。在从39M到1.3B参数的模型规模上,SONAR-LLM取得了有竞争力的生成质量。我们报告了扩展趋势、消融实验、基准测试结果,并发布了完整的训练代码和所有预训练检查点,以促进可重复性和未来研究。

英文摘要

The recently proposed Large Concept Model (LCM) generates text by predicting a sequence of sentence-level embeddings and training with either mean-squared error or diffusion objectives. We present SONAR-LLM, a decoder-only transformer that "thinks" in the same continuous SONAR embedding space, yet is supervised through token-level cross-entropy propagated via the frozen SONAR decoder. This hybrid objective retains the semantic abstraction of LCM while eliminating its diffusion sampler and restoring a likelihood-based training signal. Across model sizes from 39M to 1.3B parameters, SONAR-LLM attains competitive generation quality. We report scaling trends, ablations, benchmark results, and release the complete training code and all pretrained checkpoints to foster reproducibility and future research.

2508.00748 2026-05-27 cs.CV cs.AI cs.CR cs.MM

Is It Really You? Exploring Biometric Verification Scenarios in Photorealistic Talking-Head Avatar Videos

真的是你吗?探索逼真说话头像视频中的生物特征验证场景

Laura Pedrouzo-Rodriguez, Pedro Delgado-DeRobles, Luis F. Gomez, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Julian Fierrez

AI总结 本文研究在逼真说话头像视频中,利用面部运动模式作为行为生物特征进行身份验证,提出基于图卷积网络的轻量级模型,AUC接近80%。

Comments Accepted at the IEEE International Joint Conference on Biometrics (IJCB 2025)

详情
Journal ref
2025 IEEE International Joint Conference on Biometrics (IJCB)
AI中文摘要

逼真说话头像在虚拟会议、游戏和社交平台中越来越常见。这些头像允许更沉浸式的交流,但也引入了严重的安全风险。一个新兴威胁是冒充:攻击者可以窃取用户的头像,保留其外观和声音,使得仅凭视觉或听觉几乎无法检测欺诈性使用。在本文中,我们探讨了在这种头像中介场景中生物特征验证的挑战。我们的主要问题是,当头像的视觉外观是其主人的复制品时,个体的面部运动模式能否作为可靠的行为生物特征来验证其身份。为了回答这个问题,我们引入了一个新的数据集,其中包含使用最先进的一次性头像生成模型GAGAvatar创建的逼真头像视频,包括真实和冒充的头像视频。我们还提出了一种轻量级、可解释的时空图卷积网络架构,具有时间注意力池化,仅使用面部标志点来建模动态面部手势。实验结果表明,面部运动线索能够实现有意义的身份验证,AUC值接近80%。所提出的基准和生物特征系统可供研究社区使用,以引起对基于头像的通信系统中更高级行为生物特征防御的迫切需求的关注。

英文摘要

Photorealistic talking-head avatars are becoming increasingly common in virtual meetings, gaming, and social platforms. These avatars allow for more immersive communication, but they also introduce serious security risks. One emerging threat is impersonation: an attacker can steal a user's avatar, preserving his appearance and voice, making it nearly impossible to detect its fraudulent usage by sight or sound alone. In this paper, we explore the challenge of biometric verification in such avatar-mediated scenarios. Our main question is whether an individual's facial motion patterns can serve as reliable behavioral biometrics to verify their identity when the avatar's visual appearance is a facsimile of its owner. To answer this question, we introduce a new dataset of realistic avatar videos created using a state-of-the-art one-shot avatar generation model, GAGAvatar, with genuine and impostor avatar videos. We also propose a lightweight, explainable spatio-temporal Graph Convolutional Network architecture with temporal attention pooling, that uses only facial landmarks to model dynamic facial gestures. Experimental results demonstrate that facial motion cues enable meaningful identity verification with AUC values approaching 80%. The proposed benchmark and biometric system are available for the research community in order to bring attention to the urgent need for more advanced behavioral biometric defenses in avatar-based communication systems.

2508.01253 2026-05-27 cs.CV

ODOV: Benchmark the Open-Domain Open-Vocabulary Object Detection

ODOV:开放域开放词汇目标检测基准

Yupeng Zhang, Ruize Han, Fangnan Zhou, Wei Feng, Liang Wan

AI总结 针对真实场景中域偏移和类别偏移同时发生的问题,提出开放域开放词汇目标检测任务,构建OD-LVIS基准数据集,并设计基于VLM的基线方法,通过域无关类别提示和域投影嫁接模块提升检测性能。

详情
AI中文摘要

现有研究通常将域偏移和类别偏移作为独立问题进行研究,然而在真实场景中,这两种偏移常常同时发生并相互作用,导致检测性能显著下降。为了解决这一问题,我们提出并系统研究了一个新问题——开放域开放词汇(ODOV)目标检测,旨在评估模型在真实环境中适应复合域和类别偏移的能力。我们构建了一个新的基准数据集OD-LVIS,包含来自15个不同真实场景的46,949张图像和1,203个类别,用于评估目标检测性能。此外,我们提出了一种新的ODOV检测基线,充分利用VLM强大的多模态对齐能力,并引入两种关键机制以增强类别和域泛化能力。一种是域无关类别提示(DAPmt),它在增强类别语义的同时减弱域表示,从而实现纯粹的类别表示。另一种是域投影与嫁接(DP&G)模块,它融合了输入图像中的域特定特征,使模型能够动态地在各种开放域中进行泛化。这两个组件使模型能够在真实场景中同时存在类别和域变化的情况下保持有效的检测性能。我们为提出的ODOV检测任务提供了广泛的基准评估,并报告了实验结果。这些结果验证了ODOV任务的合理性、OD-LVIS数据集的实用性以及该方法的优越性。

英文摘要

Existing studies typically investigate domain shift and category shift as independent problems, however, in real-world scenarios, the two types of shifts often occur simultaneously and interact, leading to significant degradation in detection performance. To address this, we propose and systematically study a novel problem-Open-Domain Open-Vocabulary (ODOV) object detection-which aims to evaluate a model's ability to adapt to the compound domain and category shifts in real-world environments.We construct a new benchmark, OD-LVIS, which contains 46,949 images spanning 15 diverse real-world scenarios and 1,203 categories, for assessing object detection performance. Furthermore, we propose a novel ODOV detection baseline that fully leverages VLM's powerful multi-modal alignment capabilities and introduces two key mechanisms to enhance both category and domain generalization. One is the Domain-Agnostic Category Prompt (DAPmt), which strengthens category semantics while attenuating domain representations, enabling pure category representation. The other is the Domain Projection and Grafting (DP&G) module, which incorporates domain-specific features from input images, allowing the model to dynamically generalize across diverse open domains. These two components enable the model to maintain effective detection performance under simultaneous category and domain variations in real-world scenarios. We provide extensive benchmark evaluations for the proposed ODOV detection task and report experimental results. These results validate the soundness of the ODOV task, the practicality of the OD-LVIS dataset, and the superiority of the method.

2507.13762 2026-05-27 cs.LG q-bio.BM

MolPIF: A Parameter Interpolation Flow Model for Molecule Generation

MolPIF: 一种用于分子生成的参数插值流模型

Yaowei Jin, Junjie Wang, Yufan Tang, Wenkai Xiang, Duanhua Cao, Dan Teng, Zhehuan Fan, Jiacheng Xiong, Xia Sheng, Chuanlong Zeng, Duo An, Mingyue Zheng, Shuangjia Zheng, Qian Shi

AI总结 提出参数插值流模型MolPIF,通过参数空间分布插值统一连续坐标与离散原子类型的生成,在CrossDocked2020数据集上优于基线方法。

Comments Accepted to Bioinformatics

详情
AI中文摘要

动机:基于结构的药物设计(SBDD)随着深度生成模型的发展而进步,但弥合连续原子坐标与离散原子类型之间的差距仍然是一个挑战。当前的方法,如扩散和流匹配模型,通常未能统一这些异质模态,依赖于分离的策略或对离散变量不合适的欧几里得度量。缺乏一致的框架限制了生成模型捕捉蛋白质-配体复合物的几何和化学结构的能力。结果:我们提出了MolPIF,一种参数插值流机制,旨在统一连续和离散分子变量的生成。与在样本空间中运行的传统流模型不同,MolPIF在参数空间中对分布进行插值,理论上恢复了连续坐标的Wasserstein-2最优传输,并建立了离散原子类型的Fisher-Rao测地线。我们进一步整合了几何增强学习策略,以改善原子上下文的捕捉。在CrossDocked2020数据集上的广泛评估表明,MolPIF在结合亲和力、化学有效性、几何保真度和化学空间覆盖方面优于基线。此外,MolPIF在先导优化中表现出多功能性,并提供灵活的先验分布选择(如Laplace),为SBDD建立了一个稳健的范式。可用性:源代码可在https://github.com/BLEACH366/MolPIF免费获取。补充信息:补充数据可在Bioinformatics上获取。

英文摘要

Motivation: Structure-based drug design (SBDD) has advanced with deep generative models, but bridging the gap between continuous atomic coordinates and discrete atom types remains a challenge. Current approaches, such as diffusion and flow matching models, often fail to unify these heterogeneous modalities, relying on separate strategies or ill-fitting Euclidean metrics for discrete variables. This lack of a consistent framework limits generative models' ability to capture the geometric and chemical structure of protein-ligand complexes. Results: We present MolPIF, a parameter interpolation flow mechanism designed to unify the generation of continuous and discrete molecular variables. Unlike traditional flow models that operate in sample space, MolPIF interpolates between distributions in the parameter space, theoretically recovering Wasserstein-2 optimal transport for continuous coordinates and establishing Fisher-Rao geodesics for discrete atom types. We further incorporate a geometry-enhanced learning strategy to improve the capture of atomic contexts. Extensive evaluations on the CrossDocked2020 dataset demonstrate that MolPIF outperforms baselines in binding affinity, chemical validity, geometric fidelity and chemical space coverage. Additionally, MolPIF exhibits versatility in lead optimization and offers flexible prior distribution selection (such as Laplace), establishing a robust paradigm for SBDD. Availability: Source code is freely available at https://github.com/BLEACH366/MolPIF. Supplementary information: Supplementary data are available at Bioinformatics.

2507.20758 2026-05-27 cs.AI

How Chain-of-Thought Works? Tracing Information Flow from Decoding, Projection, and Activation

思维链如何工作?从解码、投影和激活追踪信息流

Hao Yang, Qinghua Zhao, Lei Li, Lingyi Meng, Mengda Yu

AI总结 通过反向追踪解码、投影和激活阶段的信息流,揭示思维链作为解码空间剪枝器的作用,并发现其以任务依赖方式调节神经元激活。

Comments Accept by ACL 2026

详情
AI中文摘要

思维链提示显著增强了模型推理能力,但其内部机制仍知之甚少。我们通过反向追踪解码、投影和激活阶段的信息流来分析CoT的操作原理。我们的定量分析表明,CoT可能作为解码空间剪枝器,利用答案模板引导输出生成,更高的模板遵循度与性能提升强相关。此外,我们惊讶地发现CoT以任务依赖方式调节神经元参与:在开放领域任务中减少神经元激活,而在封闭领域场景中增加激活。这些发现提供了一个新颖的机制可解释性框架,并为实现有针对性的CoT干预以设计更高效和鲁棒的提示提供了关键见解。我们在https://anonymous.4open.science/r/cot-D247发布了代码和数据。

英文摘要

Chain-of-Thought (CoT) prompting significantly enhances model reasoning, yet its internal mechanisms remain poorly understood. We analyze CoT's operational principles by reversely tracing information flow across decoding, projection, and activation phases. Our quantitative analysis suggests that CoT may serve as a decoding space pruner, leveraging answer templates to guide output generation, with higher template adherence strongly correlating with improved performance. Furthermore, we surprisingly find that CoT modulates neuron engagement in a task-dependent manner: reducing neuron activation in open-domain tasks, yet increasing it in closed-domain scenarios. These findings offer a novel mechanistic interpretability framework and critical insights for enabling targeted CoT interventions to design more efficient and robust prompts. We released our code and data at https://anonymous.4open.science/r/cot-D247.

2507.16116 2026-05-27 cs.CV

Pusa V1.0: Unlocking Temporal Control in Pretrained Video Diffusion Models via Vectorized Timestep Adaptation

Pusa V1.0: 通过向量化时间步长适应解锁预训练视频扩散模型中的时间控制

Yaofang Liu, Yumeng Ren, Aitor Artola, Yuxuan Hu, Xiaodong Cun, Xiaotong Zhao, Alan Zhao, Raymond H. Chan, Suiyun Zhang, Rui Liu, Dandan Tu, Jean-Michel Morel

AI总结 提出向量化时间步长适应(VTA)方法,在统一视频扩散框架中实现细粒度时间控制,零样本完成图像到视频生成、起止帧控制等任务,且不破坏基础模型能力。

Comments Code is open-sourced at https://github.com/Yaofang-Liu/Pusa-VidGen

详情
AI中文摘要

视频扩散模型的快速发展受到时间建模基本限制的阻碍,特别是传统标量时间步长变量导致的帧演化刚性同步。尽管任务特定适应和自回归模型试图解决这些挑战,但它们仍受限于计算效率低下、灾难性遗忘或适用性狭窄。在这项工作中,我们提出了 extbf{Pusa} V1.0,一个利用 extbf{向量化时间步长适应(VTA)}在统一视频扩散框架中实现细粒度时间控制的通用模型。注意,VTA是一种非破坏性适应,意味着它完全保留了基础模型的能力。与Wan-I2V等传统方法(通过大量资源微调基础文本到视频(T2V)模型以进行图像到视频(I2V))不同,我们在基于VTA的超高效微调过程后以零样本方式实现了可比结果。此外,该方法还同时解锁了许多其他零样本能力,例如起止帧和视频扩展——所有这些都不需要任务特定训练。同时,它保留了基础模型的T2V能力。机制分析还表明,我们的方法保留了基础模型的生成先验,同时精确注入时间动态,避免了向量化时间步长固有的组合爆炸。这项工作为下一代视频合成建立了一个可扩展、高效且通用的范式,使高保真视频生成在研究和工业领域得以普及。

英文摘要

The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability. In this work, we present \textbf{Pusa} V1.0, a versatile model that leverages \textbf{vectorized timestep adaptation (VTA)} to enable fine-grained temporal control within a unified video diffusion framework. Note that VTA is a non-destructive adaptation, which means that it fully preserves the capabilities of the base model. Unlike conventional methods like Wan-I2V, which finetune a base text-to-video (T2V) model with abundant resources to do image-to-video (I2V), we achieve comparable results in a zero-shot manner after an ultra-efficient finetuning process based on VTA. Moreover, this method also unlocks many other zero-shot capabilities simultaneously, such as start-end frames and video extension -- all without task-specific training. Meanwhile, it keeps the T2V capability from the base model. Mechanistic analyses also reveal that our approach preserves the foundation model's generative priors while surgically injecting temporal dynamics, avoiding the combinatorial explosion inherent to the vectorized timestep. This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis, democratizing high-fidelity video generation for research and industry alike.

2507.06513 2026-05-27 cs.CV

What Demands Attention in Urban Street Scenes? From Scene Understanding towards Road Safety: A Survey of Vision-driven Datasets and Studies

城市街景中什么需要关注?从场景理解到道路安全:视觉驱动数据集与研究的综述

Yaoqi Huang, Julie Stephany Berrio, Mao Shan, Stewart Worrall

AI总结 本文通过系统分类交通场景中需要关注的关键元素,全面分析35个视觉驱动任务和73个数据集,提出统一分析框架,旨在促进道路安全研究。

Comments 40 tasks, 78 datasets

详情
AI中文摘要

基于视觉的传感器和计算机视觉算法的进步显著提升了对交通场景的分析与理解。为促进这些进步在道路安全中的应用,本综述系统分类了交通场景中需要关注的关键元素,并全面分析了现有的视觉驱动任务和数据集。与现有聚焦于孤立领域的综述相比,我们的分类法将值得关注的交通实体分为两大类:异常实体和正常但关键的实体,整合了十个类别和二十个子类。它建立了内在相关领域之间的联系,并提供了统一的分析框架。我们的综述重点分析了35个视觉驱动任务,并基于提出的分类法对73个可用数据集进行了全面检查和可视化。跨领域调查涵盖了每个基准的优缺点,旨在提供标准统一和资源优化的信息。文章最后系统讨论了现有弱点,从不同角度强调了潜在影响和有前景的解决方案。集成的分类法、全面分析和总结性表格为这一快速发展的领域提供了宝贵贡献,为研究人员提供了整体概览,指导战略性资源选择,并突出了关键研究空白。

英文摘要

Advances in vision-based sensors and computer vision algorithms have significantly improved the analysis and understanding of traffic scenarios. To facilitate the use of these improvements for road safety, this survey systematically categorizes the critical elements that demand attention in traffic scenarios and comprehensively analyzes available vision-driven tasks and datasets. Compared to existing surveys that focus on isolated domains, our taxonomy categorizes attention-worthy traffic entities into two main groups that are anomalies and normal but critical entities, integrating ten categories and twenty subclasses. It establishes connections between inherently related fields and provides a unified analytical framework. Our survey highlights the analysis of 35 vision-driven tasks and comprehensive examinations and visualizations of 73 available datasets based on the proposed taxonomy. The cross-domain investigation covers the pros and cons of each benchmark with the aim of providing information on standards unification and resource optimization. Our article concludes with a systematic discussion of the existing weaknesses, underlining the potential effects and promising solutions from various perspectives. The integrated taxonomy, comprehensive analysis, and recapitulatory tables serve as valuable contributions to this rapidly evolving field by providing researchers with a holistic overview, guiding strategic resource selection, and highlighting critical research gaps.

2507.05757 2026-05-27 cs.CV

Normal Patch Retinex Robust Alghoritm for White Balancing in Digital Microscopy

Normal Patch Retinex 稳健算法用于数字显微镜白平衡

Radoslaw Roszczyk, Artur Krupa, Izabella Antoniuk

AI总结 提出一种基于Normal Patch Retinex的全自动白平衡算法,用于校正数字显微镜彩色图像,实验证明其优于经典算法。

详情
Journal ref
Vol. 29 No. 1/4 (2020)
AI中文摘要

在光学显微镜中获取准确彩色、平衡的图像即使对于经验丰富的显微镜操作者也可能是一个挑战。本文提出了一种完全自动的白平衡机制,能够充分校正显微彩色图像。该算法的结果已在200张显微图像数据集上通过实验验证。这些图像包含病理形态学中常用的三种显微标本的扫描图。此外,将所得结果与数字摄影中其他常用的白平衡算法进行了比较。本文应用的算法对于苏木精-荧光桃红-番红染色的显微图像和免疫组织化学染色图像比彩色摄影中使用的经典算法更有效。

英文摘要

The acquisition of accurately coloured, balanced images in an optical microscope can be a challenge even for experienced microscope operators. This article presents an entirely automatic mechanism for balancing the white level that allows the correction of the microscopic colour images adequately. The results of the algorithm have been confirmed experimentally on a set of two hundred microscopic images. The images contained scans of three microscopic specimens commonly used in pathomorphology. Also, the results achieved were compared with other commonly used white balance algorithms in digital photography. The algorithm applied in this work is more effective than the classical algorithms used in colour photography for microscopic images stained with hematoxylin-phloxine-saffron and for immunohistochemical staining images.

2506.23149 2026-05-27 cs.CL

AlignEvoSkill: Towards Knowledge-Aware and Task-Aligned Agent Skill Evolution

AlignEvoSkill: 迈向知识感知与任务对齐的智能体技能进化

Dingzirui Wang, Xuanliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng

AI总结 提出AlignEvoSkill框架,通过联合建模知识覆盖和任务对齐,从失败轨迹中识别知识标签、检索并适配候选技能,再基于知识覆盖和任务对齐分数筛选高质量技能,在3个基准和4个LLM骨干上相对提升34.7%,实现技能进化新SOTA且成本更低。

详情
AI中文摘要

可重用技能在提升基于LLM的智能体中扮演关键角色,但现有技能进化方法往往无法确保进化后的技能既覆盖任务所需的知识,又与目标任务保持对齐。结果,进化后的技能可能不完整或无关。为解决这一局限,我们提出AlignEvoSkill,一个联合建模知识覆盖和任务对齐的技能进化框架。给定失败的任务轨迹,AlignEvoSkill首先识别与任务相关的知识标签,检索互补的先前技能,并将它们适配为弥补缺失知识的候选技能。然后,它使用基于知识覆盖和任务对齐分数的联合过滤标准选择高质量候选技能。在3个基准和4个LLM骨干上的实验表明,AlignEvoSkill相对于非进化基线实现了34.7%的相对增益,并以更低的成本实现了技能进化的新SOTA。

英文摘要

Reusable skills play a key role in improving LLM-based agents, but existing skill-evolution methods often fail to ensure that evolved skills both cover the knowledge required by the task and remain aligned with the target task. As a result, evolved skills could be incomplete or irrelevant. To address this limitation, we propose AlignEvoSkill, a skill-evolution framework that jointly models knowledge coverage and task alignment. Given failed task trajectories, AlignEvoSkill first identifies task-relevant knowledge tags, retrieves complementary prior skills, and adapts them into candidate skills that address missing knowledge. It then selects high-quality candidates using a joint filtering criterion based on knowledge-coverage and task-alignment scores. Experiments on 3 benchmarks with4 LLM backbones show a 34.7% relative gain of AlignEvoSkill over the non-evolution baseline and achieves a new SOTA in skill evolution with lower cost.

2506.21443 2026-05-27 cs.CL cs.AI

Domain Knowledge-Enhanced LLMs for Fraud and Concept Drift Detection

领域知识增强的大语言模型用于欺诈和概念漂移检测

Ali Şenol, Garima Agrawal, Huan Liu

AI总结 提出一种领域知识增强的大语言模型框架,通过集成结构化领域知识和漂移检测单元,实现高准确率的欺诈对话检测和概念漂移分类。

详情
AI中文摘要

在动态平台上检测欺骗性对话变得越来越困难,原因是语言模式的演变和概念漂移(CD)——即随着时间推移,语义或主题的转变会改变交互的上下文或意图。这些转变可能掩盖恶意意图或模仿正常对话,使得准确分类具有挑战性。尽管大语言模型(LLMs)在自然语言任务中表现出色,但在风险敏感场景中,它们常常面临上下文模糊和幻觉问题。为了解决这些挑战,我们提出了一个领域知识(DK)增强的LLM框架,该框架将预训练的LLM与结构化的、任务特定的见解相结合,以执行欺诈和概念漂移检测。所提出的架构由三个主要组件组成:(1)一个DK-LLM模块,用于检测虚假或欺骗性对话;(2)一个漂移检测单元(OCDD),用于判断是否发生了语义转变;(3)第二个DK-LLM模块,用于将漂移分类为良性或欺诈性。我们首先使用虚假评论数据集验证领域知识的价值,然后将我们的完整框架应用于SEConvo,一个包含多种欺诈和垃圾攻击的多轮对话数据集。结果表明,我们的系统能够高精度地检测虚假对话,并有效分类漂移的性质。在结构化提示的引导下,基于LLaMA的实现达到了98%的分类准确率。与零样本基线的对比研究表明,在高风险NLP应用中,融入领域知识和漂移意识显著提高了性能、可解释性和鲁棒性。

英文摘要

Detecting deceptive conversations on dynamic platforms is increasingly difficult due to evolving language patterns and Concept Drift (CD)-i.e., semantic or topical shifts that alter the context or intent of interactions over time. These shifts can obscure malicious intent or mimic normal dialogue, making accurate classification challenging. While Large Language Models (LLMs) show strong performance in natural language tasks, they often struggle with contextual ambiguity and hallucinations in risk-sensitive scenarios. To address these challenges, we present a Domain Knowledge (DK)-Enhanced LLM framework that integrates pretrained LLMs with structured, task-specific insights to perform fraud and concept drift detection. The proposed architecture consists of three main components: (1) a DK-LLM module to detect fake or deceptive conversations; (2) a drift detection unit (OCDD) to determine whether a semantic shift has occurred; and (3) a second DK-LLM module to classify the drift as either benign or fraudulent. We first validate the value of domain knowledge using a fake review dataset and then apply our full framework to SEConvo, a multiturn dialogue dataset that includes various types of fraud and spam attacks. Results show that our system detects fake conversations with high accuracy and effectively classifies the nature of drift. Guided by structured prompts, the LLaMA-based implementation achieves 98% classification accuracy. Comparative studies against zero-shot baselines demonstrate that incorporating domain knowledge and drift awareness significantly improves performance, interpretability, and robustness in high-stakes NLP applications.

2506.17633 2026-05-27 cs.CV cs.AI

Adaptive Multi-prompt Contrastive Network for Few-shot Out-of-distribution Detection

自适应多提示对比网络用于少样本分布外检测

Xiang Fang, Arvind Easwaran, Blaise Genest

AI总结 针对少样本分布外检测问题,提出自适应多提示对比网络(AMCN),通过CLIP学习可学习文本提示和类间/类内分布,实现ID-OOD分离边界自适应。

Comments Published in ICML 2025

详情
AI中文摘要

分布外(OOD)检测旨在区分异常样本,以防止在分布内(ID)数据集上训练的模型产生不可用的输出。大多数OOD检测方法需要大量IID样本进行训练,这严重限制了它们的实际应用。为此,我们针对一个具有挑战性的场景:少样本OOD检测,其中只有少量标记的ID样本可用。因此,少样本OOD检测比传统的OOD检测设置更具挑战性。先前的少样本OOD检测工作忽略了不同类别之间的显著多样性。在本文中,我们提出了一种新颖的网络:自适应多提示对比网络(AMCN),它通过学习类间和类内分布来适应ID-OOD分离边界。为了弥补OOD的缺失和ID图像样本的稀缺,我们利用CLIP连接文本与图像,设计可学习的ID和OOD文本提示。具体来说,我们首先生成自适应提示(可学习ID提示、标签固定OOD提示和标签自适应OOD提示)。然后,我们通过引入类级阈值为每个类生成自适应类边界。最后,我们提出一个提示引导的ID-OOD分离模块来控制ID和OOD提示之间的间隔。实验结果表明,AMCN优于其他最先进的工作。

英文摘要

Out-of-distribution (OOD) detection attempts to distinguish outlier samples to prevent models trained on the in-distribution (ID) dataset from producing unavailable outputs. Most OOD detection methods require many IID samples for training, which seriously limits their real-world applications. To this end, we target a challenging setting: few-shot OOD detection, where {Only a few {\em labeled ID} samples are available.} Therefore, few-shot OOD detection is much more challenging than the traditional OOD detection setting. Previous few-shot OOD detection works ignore the distinct diversity between different classes. In this paper, we propose a novel network: Adaptive Multi-prompt Contrastive Network (AMCN), which adapts the ID-OOD separation boundary by learning inter- and intra-class distribution. To compensate for the absence of OOD and scarcity of ID {\em image samples}, we leverage CLIP, connecting text with images, engineering learnable ID and OOD {\em textual prompts}. Specifically, we first generate adaptive prompts (learnable ID prompts, label-fixed OOD prompts and label-adaptive OOD prompts). Then, we generate an adaptive class boundary for each class by introducing a class-wise threshold. Finally, we propose a prompt-guided ID-OOD separation module to control the margin between ID and OOD prompts. Experimental results show that AMCN outperforms other state-of-the-art works.

2506.11253 2026-05-27 cs.CV cs.LG

Lifting Data-Tracing Machine Unlearning to Knowledge-Tracing for Foundation Models

将数据追踪的机器遗忘提升为基础模型的知识追踪

Yuwen Tan, Boqing Gong

AI总结 本文提出将数据追踪的机器遗忘提升为基础模型的知识追踪,以应对多样化遗忘请求,并更接近人类遗忘机制,通过视觉语言模型案例展示实现范式。

Comments Accepted to TMLR

详情
AI中文摘要

机器遗忘从AI模型中移除特定训练数据点及其影响(例如,当数据所有者撤销其同意允许模型从数据中学习时)。在这篇立场论文中,我们提出将数据追踪的机器遗忘提升为基础模型(FMs)的知识追踪。我们基于实际需求和认知研究的见解支持这一立场。实际上,追踪数据无法满足对FMs的多样化遗忘请求,这些请求可能来自监管机构、企业用户、产品团队等,他们无法访问FMs的大量训练数据。相反,这些方方便提出关于FMs(不应)拥有的知识或能力的遗忘请求。认知上,知识追踪遗忘比追踪单个训练数据点更接近人脑的遗忘方式。我们进一步讨论了知识追踪机器遗忘范式中的重大挑战。最后,我们提供了一个关于视觉语言FMs的具体案例研究,以说明遗忘者如何实例化知识追踪机器遗忘范式。代码可在:https://1yuwen.github.io/Knowledge-Tracing-MU-Page 获取。

英文摘要

Machine unlearning removes certain training data points and their influence from AI models (e.g., when a data owner revokes their consent to allow models to learn from the data). In this position paper, we propose to lift data-tracing machine unlearning to knowledge-tracing for foundation models (FMs). We support this position based on practical needs and insights from cognitive studies. Practically, tracing data cannot meet the diverse unlearning requests for FMs, which may be from regulators, enterprise users, product teams, etc., who have no access to FMs' massive training data. Instead, it is convenient for these parties to issue an unlearning request about the knowledge or capability FMs (should not) possess. Cognitively, knowledge-tracing unlearning aligns with how the human brain forgets more closely than tracing individual training data points does. We further discuss the nontrivial challenges in the knowledge-tracing machine unlearning paradigm. Finally, we provide a concrete case study about a vision-language FM to illustrate how an unlearner might instantiate the knowledge-tracing machine unlearning paradigm. Code is available at: https://1yuwen.github.io/Knowledge-Tracing-MU-Page.

2506.10225 2026-05-27 cs.SD cs.AI eess.AS

Genre Controlled Music Generation via Activation Steering

通过激活引导实现体裁控制的音乐生成

Swathi Narashiman, Pranay Mathur, Dipanshu Panda, Jayden Koshy Joe, Harshith M R, Anish Veerakumar, Aniruddh Krishna, Keerthiharan A

AI总结 提出一种在推理时对自回归生成模型MusicGen进行干预的方法,利用线性探针权重引导残差流,实现细粒度的体裁控制。

详情
AI中文摘要

计算音乐生成正朝着非传统风格发展,需要能够精确且可控地融合不同音乐元素的方法。在这项工作中,我们提出了一种方法,通过对自回归生成变换器MusicGen进行推理时干预来实现细粒度控制。通过我们的方法,我们利用线性探针在残差流上的权重来引导残差流,从而实现体裁控制。通过将激活引导视为一种人类可控的交互,我们的工作突出了可解释的模型行为如何在协同创作的音乐生成中发挥作用。展示我们方法的音频样本可在我们的演示页面上找到。

英文摘要

Computational Music Generation is evolving towards non-conventional styles, demanding methods that enable precise and controllable blending of diverse music elements. In this work, we present a method for fine grained control using inference-time interventions on an autoregressive generative transformer, MusicGen. Through our approach, we achieve genre control by steering the residual stream using weights of a linear probe on it. By framing activation steering as a human-controllable interaction, our work highlights how interpretable model behaviors can empower in co-creative music generation.Audio samples demonstrating our method are available on our demo page.

2506.07813 2026-05-27 cs.CV cs.AI

Self-Cascaded Diffusion Models for Arbitrary-Scale Image Super-Resolution

自级联扩散模型用于任意尺度图像超分辨率

Junseo Bang, Joonhee Lee, Kyeonghyun Lee, Haechang Lee, Dong Un Kang, Se Young Chun

AI总结 提出自级联扩散框架CasArbi,通过将任意缩放因子分解为连续小步骤,逐步提升分辨率并保持尺度一致性,在感知和失真指标上优于现有方法。

详情
AI中文摘要

任意尺度图像超分辨率旨在将图像上采样到任意期望分辨率,比传统固定尺度超分辨率提供更大灵活性。最近基于回归或生成模型的方法显示出有希望的结果,但由于其单阶段公式必须同时处理大范围的缩放因子,常常遭受尺度不一致的问题。为了解决这个问题,我们提出了CasArbi,一个用于任意尺度图像超分辨率的自级联扩散框架。CasArbi将不同的缩放因子分解为更小的顺序步骤,逐步提升图像分辨率,并在每一步实现任意尺度的无缝过渡。CasArbi利用坐标条件扩散模型学习连续图像表示,并在推理时采用自一致性指导生成尺度一致的细节。大量实验表明,CasArbi在感知和失真指标上均优于现有方法,并在各种任意尺度超分辨率基准上展现出卓越的尺度一致性。我们的代码可在https://github.com/junseo88/CasArbi获取。

英文摘要

Arbitrary-scale image super-resolution aims to upsample images to any desired resolution, offering greater flexibility than traditional fixed-scale super-resolution. Recent approaches based on regression-based or generative models have shown promising results but often suffer from scale inconsistency due to their single-stage formulation, which must handle a wide range of scaling factors simultaneously. To address this, we propose CasArbi, a self-cascaded diffusion framework for arbitrary-scale image super-resolution. CasArbi decomposes varying scaling factors into smaller sequential steps, progressively enhancing the image resolution at each step with seamless transitions for arbitrary scales. CasArbi leverages a coordinate-conditioned diffusion model for learning continuous image representations and adopts self-consistency guidance to generate scale-consistent details at inference time. Extensive experiments show that CasArbi outperforms existing methods in both perceptual and distortion metrics and demonstrates superior scale consistency across diverse arbitrary-scale super-resolution benchmarks. Our code is available at https://github.com/junseo88/CasArbi.

2502.06963 2026-05-27 cs.LG cs.AI cs.DC cs.MA

Intelligent Offloading in Vehicular Edge Computing: A Comprehensive Review of Deep Reinforcement Learning Approaches and Architectures

车辆边缘计算中的智能卸载:深度强化学习方法与架构综述

Ashab Uddin, Ahmed Hamdi Sakr, Ning Zhang

AI总结 本文综述了基于深度强化学习的车辆边缘计算卸载方法,分类比较了学习范式、系统架构和优化目标,并分析了马尔可夫决策过程的应用及未来研究方向。

Comments 33 Pages, 6 Figures, 7 Tables. Machine Learning, Reinforcement Learning, Multi Agent Reinforcement Learning, Computational Offloading and Edge Computing

详情
AI中文摘要

智能交通系统(ITS)日益复杂,导致对计算卸载到边缘服务器、车辆节点和无人机等外部基础设施的兴趣显著增加。这些动态异构环境给传统卸载策略带来了挑战,促使人们探索强化学习(RL)和深度强化学习(DRL)作为自适应决策框架。本综述全面回顾了基于DRL的车辆边缘计算(VEC)卸载的最新进展。我们根据学习范式(如单智能体、多智能体)、系统架构(如集中式、分布式、分层式)和优化目标(如延迟、能量、公平性)对现有工作进行分类和比较。此外,我们分析了马尔可夫决策过程(MDP)公式的应用方式,并强调了奖励设计、协调机制和可扩展性方面的新兴趋势。最后,我们确定了开放挑战,并概述了未来研究方向,以指导下一代ITS鲁棒且智能的卸载策略的开发。

英文摘要

The increasing complexity of Intelligent Transportation Systems (ITS) has led to significant interest in computational offloading to external infrastructures such as edge servers, vehicular nodes, and UAVs. These dynamic and heterogeneous environments pose challenges for traditional offloading strategies, prompting the exploration of Reinforcement Learning (RL) and Deep Reinforcement Learning (DRL) as adaptive decision-making frameworks. This survey presents a comprehensive review of recent advances in DRL-based offloading for vehicular edge computing (VEC). We classify and compare existing works based on learning paradigms (e.g., single-agent, multi-agent), system architectures (e.g., centralized, distributed, hierarchical), and optimization objectives (e.g., latency, energy, fairness). Furthermore, we analyze how Markov Decision Process (MDP) formulations are applied and highlight emerging trends in reward design, coordination mechanisms, and scalability. Finally, we identify open challenges and outline future research directions to guide the development of robust and intelligent offloading strategies for next-generation ITS.

2506.03627 2026-05-27 cs.CL cs.AI

Robustness of Prompting: Enhancing Robustness of Large Language Models Against Prompting Attacks

提示的鲁棒性:增强大型语言模型对抗提示攻击的鲁棒性

Lin Mu, Guowei Chu, Li Ni, Lei Sang, Yiwen Zhang

AI总结 提出RoP(提示鲁棒性)策略,通过错误校正和引导两个阶段,增强LLM对输入扰动的鲁棒性,在算术、常识和逻辑推理任务上显著提升性能。

Comments Accepted by IEEE Transactions on Artificial Intelligence

详情
AI中文摘要

大型语言模型(LLM)通过有效利用提示策略在各种任务中展现了卓越的性能。然而,它们对输入扰动高度敏感,例如拼写错误或轻微字符顺序错误,这些扰动会显著损害其性能。尽管在提示技术方面取得了进展,如思维链和自动提示生成,但开发一种明确减轻此类扰动负面影响的提示策略仍然是一个开放的挑战。为弥补这一差距,我们提出了提示鲁棒性(RoP),一种旨在增强LLM鲁棒性的新型提示策略。RoP包括两个阶段:错误校正和引导。在错误校正阶段,RoP应用多种扰动方法生成对抗样本,用于生成自动纠正输入错误的提示。在引导阶段,RoP基于校正后的输入生成最优引导提示,引导模型生成更鲁棒和准确的推理。通过在算术、常识和逻辑推理任务上的全面实验,我们证明RoP显著提高了LLM对抗对抗扰动的鲁棒性。至关重要的是,与干净输入场景相比,它仅以最小的精度下降保持了模型准确性,从而将RoP确立为在实际应用中增强LLM鲁棒性的实用且有效的方法。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable performance across various tasks by effectively utilizing a prompting strategy. However, they are highly sensitive to input perturbations, such as typographical errors or slight character order errors, which can significantly impair their performance. Despite advances in prompting techniques such as Chain-of-Thought and automatic prompt generation, developing a prompting strategy that explicitly mitigates the negative impact of such perturbations remains an open challenge. To bridge this gap, we propose Robustness of Prompting (RoP), a novel prompting strategy aimed at enhancing the robustness of LLMs. RoP consists of two stages: Error Correction and Guidance. In the Error Correction stage, RoP applies diverse perturbation methods to generate adversarial examples, which are used to generate prompts that correct input errors automatically. In the Guidance stage, RoP generates an optimal guidance prompt based on the corrected input, guiding the model to generate more robust and accurate inferences. Through comprehensive experiments spanning arithmetic, commonsense, and logical reasoning tasks, we demonstrate that RoP significantly improves LLMs' robustness against adversarial perturbations. Crucially, it preserves model accuracy with only minimal degradation compared to clean input scenarios, thereby establishing RoP as a practical and effective approach for enhancing LLM robustness in real-world applications.

2411.02355 2026-05-27 cs.LG cs.AI

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

“给我BF16,否则给我死亡”?LLM量化中的精度-性能权衡

Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh

AI总结 本文通过超过50万次评估,全面研究了FP8、INT8和INT4量化在Llama-3.1模型族上的精度-性能权衡,发现FP8无损、INT8精度损失低、INT4权重仅量化具有竞争力,并基于vLLM框架给出了不同部署场景下的最优量化格式建议。

Comments Accepted to ACL 2025

详情
AI中文摘要

量化是加速大型语言模型(LLM)推理的强大工具,但不同格式下的精度-性能权衡仍不明确。在本文中,我们进行了迄今为止最全面的实证研究,评估了FP8、INT8和INT4量化在整个Llama-3.1模型族上的学术基准和实际任务。通过超过50万次评估,我们的研究得出了几个关键发现:(1)FP8(W8A8-FP)在所有模型规模上均无损;(2)良好调优的INT8(W8A8-INT)实现了令人惊讶的低精度下降(1-3%);(3)INT4权重仅量化(W4A16-INT)比预期更具竞争力,可与8位量化相媲美。此外,我们通过流行的vLLM框架分析推理性能,研究了不同部署场景下的最优量化格式。我们的分析提供了明确的部署建议:W4A16是同步设置中最具成本效益的,而W8A8在异步连续批处理中占主导地位。对于混合工作负载,最优选择取决于具体用例。我们的发现为大规模部署量化LLM提供了实用的、数据驱动的指导——确保速度、效率和精度之间的最佳平衡。

英文摘要

Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across different formats remain unclear. In this paper, we conduct the most comprehensive empirical study to date, evaluating FP8, INT8, and INT4 quantization across academic benchmarks and real-world tasks on the entire Llama-3.1 model family. Through over 500,000 evaluations, our investigation yields several key findings: (1) FP8 (W8A8-FP) is effectively lossless across all model scales, (2) well-tuned INT8 (W8A8-INT) achieves surprisingly low (1-3\%) accuracy degradation, and (3) INT4 weight-only (W4A16-INT) is more competitive than expected, rivaling 8-bit quantization. Further, we investigate the optimal quantization format for different deployments by analyzing inference performance through the popular vLLM framework. Our analysis provides clear deployment recommendations: W4A16 is the most cost-efficient for synchronous setups, while W8A8 dominates in asynchronous continuous batching. For mixed workloads, the optimal choice depends on the specific use case. Our findings offer practical, data-driven guidelines for deploying quantized LLMs at scale -- ensuring the best balance between speed, efficiency, and accuracy.

2505.18728 2026-05-27 cs.LG cs.AI

Message-Passing State-Space Models: Improving Graph Learning with Modern Sequence Modeling

消息传递状态空间模型:利用现代序列建模改进图学习

Andrea Ceni, Alessio Gravina, Claudio Gallicchio, Davide Bacciu, Carola-Bibiane Schonlieb, Moshe Eliasof

AI总结 提出MP-SSM,将现代状态空间模型的核心计算嵌入消息传递神经网络,实现静态和时序图上的高效、置换等变和长程信息传播,并通过精确敏感性分析刻画深层信息流问题。

详情
AI中文摘要

状态空间模型(SSM)在序列建模中的近期成功推动了其向图学习的迁移,催生了图状态空间模型(GSSM)。然而,现有的GSSM通过将SSM模块应用于从图中提取的序列,往往损害了置换等变性、消息传递兼容性和计算效率等核心属性。本文引入了一种新视角,将现代SSM计算的关键原理直接嵌入消息传递神经网络框架,从而为静态图和时序图提供统一的方法论。我们的方法MP-SSM能够实现高效、置换等变和长程信息传播,同时保持消息传递的架构简洁性。关键的是,MP-SSM支持精确的敏感性分析,我们利用该分析从理论上刻画信息流,并评估深层网络中的梯度消失和过压缩等问题。此外,我们的设计选择允许类似现代SSM的高度优化并行实现。我们在包括节点分类、图属性预测、长程基准和时空预测在内的广泛任务上验证了MP-SSM,展示了其多功能性和强大的实证性能。

英文摘要

The recent success of State-Space Models (SSMs) in sequence modeling has motivated their adaptation to graph learning, giving rise to Graph State-Space Models (GSSMs). However, existing GSSMs operate by applying SSM modules to sequences extracted from graphs, often compromising core properties such as permutation equivariance, message-passing compatibility, and computational efficiency. In this paper, we introduce a new perspective by embedding the key principles of modern SSM computation directly into the Message-Passing Neural Network framework, resulting in a unified methodology for both static and temporal graphs. Our approach, MP-SSM, enables efficient, permutation-equivariant, and long-range information propagation while preserving the architectural simplicity of message passing. Crucially, MP-SSM enables an exact sensitivity analysis, which we use to theoretically characterize information flow and evaluate issues like vanishing gradients and over-squashing in the deep regime. Furthermore, our design choices allow for a highly optimized parallel implementation akin to modern SSMs. We validate MP-SSM across a wide range of tasks, including node classification, graph property prediction, long-range benchmarks, and spatiotemporal forecasting, demonstrating both its versatility and strong empirical performance.

2505.18603 2026-05-27 cs.AI cs.CV

Doc-CoB: Enhancing Document Understanding with Visual Chain-of-Boxes Reasoning

Doc-CoB:通过视觉链式框推理增强文档理解

Ye Mo, Kai Ye, Xianwei Mao, Zirui Shao, Gang Huang, Bo Zhang, Hangdi Xing, Kehan Chen, Huan Zhou, Zixu Yan, Jiajun Bu, Sheng Zhou

AI总结 提出Doc-CoB框架,通过粗到细的布局感知视觉推理,结合多模态大语言模型,逐步聚焦查询相关布局区域,提升文档理解性能。

详情
AI中文摘要

文档理解旨在对文档图像进行问答和信息提取,其中视觉内容信息密集,大多数查询仅依赖于少数相关布局区域。然而,现有方法要么采用一次通过策略,隐式假设所有布局同等重要,要么过度关注小区域而丢失关键布局信息。为解决这些局限性,我们引入了Doc-CoB(链式框),一个简单而有效的框架,将粗到细的布局感知视觉推理集成到多模态大语言模型中。Doc-CoB不是直接放大到小区域,而是逐步聚焦于查询相关布局,同时保留全局文档信息。具体来说,它首先选择关键布局框,然后通过视觉提示聚焦于这些框进行进一步理解。为支持这一范式,我们引入了两个推理任务:框识别和框推理,并构建了一个自动流水线,生成24.9万个带有中间视觉监督的训练样本。在七个基准测试和四种流行模型上的广泛实验表明,Doc-CoB显著提升了性能,证明了其有效性和广泛适用性。

英文摘要

Document understanding aims to perform question answering and information extraction over document images, where the visual content is highly information-dense and most queries rely on only a few relevant layout regions. However, existing methods either adopt a one-pass strategy that implicitly assumes all layouts are equally important, or focus excessively on small regions at the cost of losing critical layout information. To address these limitations, we introduce Doc-CoB (Chain-of-Boxes), a simple-yet-effective framework that integrates coarse-to-fine layout-aware visual reasoning into multimodal large language models. Instead of directly zooming into small regions, Doc-CoB progressively focuses on query-relevant layouts while preserving global document information. Specifically, it first selects key layout boxes and then focuses on them for further understanding with visual prompting. To support this paradigm, we introduce two reasoning tasks for box recognition and box reasoning, with an automatic pipeline that constructs 249k training samples with intermediate visual supervision. Extensive experiments on seven benchmarks with four popular models show that Doc-CoB significantly improves performance, demonstrating its effectiveness and wide applicability.

2505.17163 2026-05-27 cs.LG cs.AI cs.CL cs.CV

OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

OCR-Reasoning基准:揭示MLLMs在复杂文本丰富图像推理中的真实能力

Mingxin Huang, Yongxin Shi, Dezhi Peng, Songxuan Lai, Zecheng Xie, Lianwen Jin

AI总结 提出OCR-Reasoning基准,包含1069个人工标注样本,覆盖6种核心推理能力和18个实际推理任务,通过双标注(最终答案和逐步推理过程)评估多模态大语言模型在文本丰富图像推理中的能力,发现最先进模型准确率均低于50%。

Comments ICLR 2026

详情
AI中文摘要

近期多模态慢思考系统在各种视觉推理任务中表现出色。然而,由于缺乏专门且系统的基准,它们在文本丰富图像推理任务中的能力仍未得到充分研究。为填补这一空白,我们提出了OCR-Reasoning,一个新颖的基准,旨在系统评估多模态大语言模型在文本丰富图像推理任务上的表现。具体而言,OCR-Reasoning包含1069个人工标注的示例,涵盖文本丰富视觉场景中的6种核心推理能力和18个实际推理任务。与仅提供最终答案的现有文本丰富图像理解基准不同,本基准额外提供了详细的逐步推理过程。这种双标注使得能够同时评估模型的最终答案和推理过程,从而全面评估文本丰富推理能力。利用该基准,我们对最新的多模态大语言模型进行了全面评估。结果表明,即使是最先进的多模态大语言模型在文本丰富图像推理任务中也面临巨大困难,在我们的基准上没有一个模型的准确率超过50%,这表明文本丰富图像推理的挑战是一个亟待解决的问题。基准和评估脚本可在https://github.com/SCUT-DLVCLab/OCR-Reasoning获取。

英文摘要

Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across various visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the absence of a dedicated and systematic benchmark. To address this gap, we propose OCR-Reasoning, a novel benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. Specifically, OCR-Reasoning comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Unlike existing text-rich image understanding benchmarks that only provide a final answer, this benchmark additionally provides a detailed step-by-step reasoning process. This dual annotation enables the evaluation of both the models' final answers and their reasoning processes, thereby offering a holistic assessment of text-rich reasoning capabilities. By leveraging this benchmark, we conducted a comprehensive evaluation of the latest MLLMs. Our results demonstrate that even the most advanced MLLMs exhibit substantial difficulties in text-rich image reasoning tasks, with none achieving an accuracy above 50\% on our benchmark, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at https://github.com/SCUT-DLVCLab/OCR-Reasoning.