arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1926
专题追踪
2605.22011 2026-05-22 cs.CV

Rethinking Token Reduction for Diffusion Models via Output-Similarity-Awareness

重新思考扩散模型的token减少:通过输出相似性意识

Hangyeol Lee, Hyojeong Lee, Joo-Young Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出DiTo,一种基于输出中心的token减少方法,通过利用相邻时间步的输出相似性来建立token对应关系,从而减少计算复杂度并提高生成质量。

详情
AI中文摘要

扩散变换器(DiTs)在图像生成质量上表现出色,但其计算复杂度与token数量呈二次关系。尽管已提出多种token减少(TR)方法以缓解这一成本,但它们忽略了生成模型的主要目标:最小化恢复误差,这需要反映输出token的相似性。它们仅依赖于输入token相似性,这是来自仅减少的ViT范式继承的,导致与该目标的根本不一致。为弥合这一差距,我们提出DiTo,一种新的TR范式,其重点转向以输出为中心的token减少。基于观察到输出token相似性在相邻时间步中保持一致,DiTo利用先前步骤的相似性作为有效代理,在匹配时间步中建立token对应关系,然后在多个后续减少时间步中重用。为了优化这种交错调度,我们提出Pair Match Ratio(PMR)引导的区间调度,以确定最佳匹配频率。此外,为了减轻由重复重用导致的局部近似误差和由此产生的阻塞伪影,我们提出频率感知的token匹配,通过引入选择频率惩罚。广泛的实验表明,DiTo在可比的加速下,比现有TR方法在PSNR上高出1.6-3.9 dB,实现了更优的帕累托前沿。

英文摘要

Diffusion Transformers (DiTs) achieve superior image generation quality but suffer from quadratic computational complexity relative to token count. While various token reduction (TR) methods have been proposed to mitigate this cost, they overlook the primary objective of generative models: minimizing recovery error, which requires reflecting output token similarity. They rely solely on input token similarity inherited from reduction-only ViT paradigms, leading to a fundamental misalignment with this objective. To bridge this gap, we propose DiTo, a novel TR paradigm that shifts the focus toward output-centric token reduction. Based on the observation that output token similarity is consistently preserved across adjacent timesteps, DiTo utilizes prior-step similarities as an effective proxy to establish token correspondences at a Matching timestep, which are then reused across multiple subsequent Reduction timesteps. To optimize this interleaved scheduling, we propose Pair Match Ratio (PMR)-guided Interval Scheduling to determine the optimal matching frequency. Furthermore, to mitigate localized approximation errors and resulting blocking artifacts caused by repeated reuse, we propose Frequency-aware Token Matching by incorporating a selection-frequency penalty. Extensive experiments demonstrate that DiTo consistently outperforms existing TR methods with 1.6-3.9 dB higher PSNR at comparable speedups, achieving a superior Pareto frontier.

2605.22007 2026-05-22 cs.CL

Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer

幻觉作为承诺失败:更大的LLM在知道答案的情况下仍会出错

Jewon Yeom, Jaewon Sok, Heejun Kim, Seonghyeon Park, Jeongjae Park, Taesup Kim

发表机构 * Graduate School of Data Science(数据科学研究生院) Department of Rural Systems Engineering(农村系统工程系) Electrical Engineering and Computer Science(电子工程与计算机科学) Department of Aerospace Engineering(航空航天工程系)

AI总结 本文研究了大型LLM在知道正确答案的情况下仍出现幻觉的现象,发现模型在生成答案时,正确概念的概率分布方式决定了幻觉的发生,而非是否包含正确概念。

详情
AI中文摘要

幻觉通常被视为知识缺失的直接后果:当正确答案不在生成时的分布中,模型会错误回答;当正确答案存在时,模型会正确回答。我们通过引入一个语义上的答案可用性概念,聚合表达相同答案概念的token级变体,检验这一假设。在Qwen和Llama模型(0.8B至72B,包括Instruct和Base版本)中,16-47%的Instruct幻觉在模型承诺回答时已有显著概率质量在正确概念上,且随着规模增加,此比例单调上升。将此类失败与具有匹配语义支持的正确生成进行比较,发现区别不在于是否表示正确概念,而在于概率分布方式:正确生成将质量集中于单一表层形式,幻觉则将其分散到多个替代选项中。这种锐化不对称性在多token生成中也延伸,并在预生成隐藏状态中可检测到。这些结果识别出单一机制:指令微调通过规模锐化答案承诺,使有用性和自信幻觉成为同一底层倾向的两种结果。

英文摘要

Hallucination is often viewed as a direct consequence of missing knowledge: a model answers incorrectly when the correct answer is absent from its generation-time distribution, and correctly when it is present. We test this assumption by introducing a semantic notion of answer availability that aggregates token-level variants expressing the same answer concept, and asks whether the correct concept is already available at the moment the model commits to an answer. Across Qwen and Llama models from 0.8B to 72B in both Instruct and Base variants, 16-47% of Instruct hallucinations occur with substantial probability mass already on the correct concept, and the rate rises monotonically with scale. Comparing such failures against correct generations with matched semantic support, the distinguishing factor is not whether the correct concept is represented, but how its probability is distributed: correct generations concentrate mass on a single surface form, hallucinations disperse it across alternatives. The same sharpening asymmetry extends across multi-token generation and is detectable in pre-generation hidden states. Together, these results identify a single mechanism: instruction tuning sharpens answer commitment with scale, making helpfulness and confident hallucination two consequences of the same underlying disposition.

2605.22003 2026-05-22 cs.CL cs.AI cs.IR cs.LG

From TF-IDF to Transformers: A Comparative and Ensemble Approach to Sentiment Classification

从TF-IDF到Transformer:一种比较和集成的方法用于情感分类

Dip Biswas Shanto, Mitali Yadav, Prajwal Panth, Suresh Chandra Satapathy

发表机构 * School of Computer Engineering KIIT Deemed to be University(计算机工程学院 KIIT 被认定大学)

AI总结 本文比较了多种机器学习模型,包括Naive Bayes、逻辑回归、SVM、LightGBM、LSTM以及基于Transformer的RoBERTa和DistilBERT,旨在对电影评论进行情感分类,并发现RoBERTa在准确率上表现最佳,同时集成所有模型的软投票方法进一步提升了分类性能。

Comments 6 pages, 9 figures. This is the author's accepted manuscript, presented at the International Conference on Intelligent Computing, Networks and Security (IC-ICNS 2026), March 26-28, Bhubaneswar, India. Proceedings publication pending

详情
AI中文摘要

情感分析,也称为观点挖掘,主要试图从任何基于文本的数据中提取观点。在电影评论和评论员的背景下,情感分析可以成为预测电影评论总体是积极还是消极的有用工具。对于ML模型来说,理解上下文或隐喻性情感可能具有挑战性,因为ML模型主要依赖统计词表示。本文的目标是检验并分类电影评论为积极或消极情感。为此考虑了多种机器学习模型,并运用自然语言处理(NLP)方法进行数据预处理和模型评估。使用IMDb数据集。具体来说,评估了Naive Bayes、逻辑回归、支持向量机(SVM)、LightGBM、LSTM以及基于Transformer的RoBERTa和DistilBERT等模型。经过大量测试,使用准确率、精确率、召回率、F1分数和ROC-AUC后,RoBERTa在所有其他模型之上表现更好,准确率为93.02%。一个结合所有模型的软投票集成方法也提高了分类性能,表明模型集成在情感分析中效果良好。

英文摘要

Sentiment analysis, also referred to as opinion mining, primarily tries to extract opinion from any text-based data. In the context of movie reviews and critics, sentimental analysis can be a helpful tool to predict whether a movie review is generally positive or negative. It can be difficult for the ML models to understand the context or metaphysical sentiment accurately, as ML models rely largely on statistical word representations. The objective of this paper is to examine and categorise movie reviews into positive and negative sentiments. Diverse machine learning models are considered in doing so, and Natural Language Processing (NLP) methodologies are employed for data preprocessing and model assessment. The IMDb dataset is used. Specifically, Naive Bayes, Logistic Regression, Support Vector Machines (SVM), LightGBM, LSTM, and transformer-based models such as RoBERTa and DistilBERT were evaluated. After a lot of testing with accuracy, precision, recall, F1-score, and ROC-AUC, RoBERTa performed better than all the other models, with an accuracy of 93.02%. A soft voting ensemble that combined all the models also improved classification performance, showing that model ensembling works well for sentiment analysis.

2605.22002 2026-05-22 cs.CV

ConvNeXt-FD: A Fractal-Based Deep Model for Robust Biomedical Image Segmentation

ConvNeXt-FD:一种基于分形的深度模型用于鲁棒的生物医学图像分割

Joao Batista Florindo, Amanda Pontes de Oliveira Ornelas

发表机构 * Institute of Mathematics, Statistics and Scientific Computing, Department of Applied Mathematics, University of Campinas(数学、统计与科学计算研究所,应用数学系,坎皮纳斯大学)

AI总结 本文提出了一种基于分形的深度学习模型ConvNeXt-FD,用于提高生物医学图像分割的鲁棒性,通过结合Dice系数和边界感知正则化项,提升模型对物体边界和形状保真的敏感性。

详情
AI中文摘要

生物医学图像分割是医疗诊断和治疗计划中的关键任务,能够精确勾勒解剖结构和病理区域。尽管取得了显著进展,但由于不同医学成像模态中固有的变异性、噪声和复杂的形态,仍存在挑战。本文介绍了一种新的深度学习架构ConvNeXt-FD,基于类似U-Net的编码器-解码器框架,利用强大的ConvNeXt主干网络。我们的方法结合了一种混合损失函数,该函数结合了Dice系数和受可微分分形维度公式启发的边界感知正则化项,旨在增强模型对物体边界和形状保真的敏感性。我们严格评估了ConvNeXt-FD在六个不同的生物医学数据集上的表现:BUSI(乳腺超声图像)、DDTI(甲状腺超声图像)、FluoCells(荧光细胞图像)、IDRiD(糖尿病视网膜病变图像用于视盘分割)、ISIC2018(皮肤病变图像)和MoNuSeg(核分割)。实验结果表明,ConvNeXt-FD,特别是在使用ImageNet预训练权重初始化时,在各种指标上(包括Dice、Jaccard、准确率、灵敏度、特异度和假阳性率)均表现出竞争性甚至更优的性能。ConvNeXt作为强大编码器的结合,与边界感知正则化相结合,证明了在挑战性的生物医学上下文中捕获高级语义特征和细粒度边界细节的有效性,从而实现更准确和可靠的分割。

英文摘要

Biomedical image segmentation is a critical task in medical diagnosis and treatment planning, enabling precise delineation of anatomical structures and pathological regions. Despite significant advancements, challenges persist due to the inherent variability, noise, and complex morphology present in diverse medical imaging modalities. This paper introduces ConvNeXt-FD, a novel deep learning architecture for robust biomedical image segmentation, built upon a U-Net-like encoder-decoder framework leveraging the powerful ConvNeXt backbone. Our approach integrates a hybrid loss function combining the Dice coefficient with a boundary-aware regularization term inspired by a differentiable formulation of Fractal Dimension, designed to enhance the model's sensitivity to object boundaries and shape fidelity. We rigorously evaluate ConvNeXt-FD across six distinct biomedical datasets: BUSI (Breast Ultrasound Images), DDTI (Thyroid Ultrasound Images), FluoCells (Fluorescent Cell Images), IDRiD (Diabetic Retinopathy Images for Optic Disc Segmentation), ISIC2018 (Skin Lesion Images), and MoNuSeg (Nuclei Segmentation). Experimental results demonstrate that ConvNeXt-FD, particularly when initialized with ImageNet pre-trained weights, achieves competitive and often superior performance compared to existing state-of-the-art methods across various metrics, including Dice, Jaccard, Accuracy, Sensitivity, Specificity, and False Positive Rate. The integration of ConvNeXt as a strong encoder, coupled with the boundary-aware regularization, proves effective in capturing both high-level semantic features and fine-grained boundary details, leading to more accurate and reliable segmentations in challenging biomedical contexts.

2605.22000 2026-05-22 cs.CV cs.AI

Virtual 3D H&E Staining from Phase-contrast Back-illumination Interference Tomography

从相位对比背光干涉断层扫描生成虚拟3D的H&E染色

Anthony Song, Boyan Zhou, Mayank Golhar, Marisa Morakis, Alex Baras, Nicholas Durr

发表机构 * Department of Biomedical Engineering, Johns Hopkins University(约翰霍普金斯大学生物医学工程系) Department of Pathology, Johns Hopkins Hospital(约翰霍普金斯医院病理学系)

AI总结 本文提出HistoBIT3D,首个基于voxel的配对BIT和荧光标记核数据集,用于评估无监督虚拟染色在结构保持方面的定量效果。通过该数据集,作者提出一种新的虚拟染色框架,利用双向多尺度内容一致性与跨域风格复用,将具有移变对比度的BIT体积转化为逼真的H&E体积,从而提升3D核分割精度和边界保持性。

详情
AI中文摘要

三维(3D)未处理组织的病理学具有潜在的疾病管理变革能力,通过使组织微结构的体积分析和活体评估成为可能。背光干涉断层扫描(BIT)是一种新的相位显微镜技术,能够提供快速、非破坏性的未处理组织体积分像。然而,将BIT体积转化为临床可解释的H&E图像仍然具有挑战性,特别是由于移变对比和缺乏定量验证基准。我们引入HistoBIT3D,首个voxel-wise配对的BIT和荧光标记核数据集,使在无监督虚拟染色中结构保持的定量评估成为可能。利用该数据集,我们提出了一种新的虚拟染色框架,通过双向多尺度内容一致性和跨域风格复用来增强结构保真度和感知现实性,将具有移变对比度的BIT体积转化为逼真的H&E体积。我们的方法在现实感度量方面达到最先进的水平,同时显著提高了3D核分割精度和边界保持性,特别是在零shot Cellpose评估下。这些贡献共同建立了一个经过定量验证、结构忠实且可扩展的3D虚拟H&E染色流程,推动了无切片、体积分计算病理学的范式转变。我们的数据和代码可在:https://github.com/aasong113/HistoBIT3D_VirtualStaining。

英文摘要

Three-dimensional (3D) histopathology of unprocessed tissues has the potential to transform disease management by enabling volumetric characterization of tissue microarchitecture and in-vivo assessment. Back-illumination Interference Tomography (BIT) is a new phase microscopy technology that provides rapid, non-destructive volumetric imaging of unprocessed tissues. However, translating BIT volumes into clinically interpretable H&E images remains challenging, particularly due to shift-variant contrast and the absence of quantitative validation benchmarks. We introduce HistoBIT3D, the first voxel-wise paired BIT and fluorescence-labeled nuclei dataset, enabling quantitative evaluation of structural preservation in unsupervised virtual staining against ground-truth nuclear distributions. Using this dataset, we present a novel virtual staining framework that translates BIT volumes with shift-variant contrast into realistic H&E volumes by leveraging bidirectional multiscale content consistency and cross-domain style reuse to enhance structural fidelity and perceptual realism. Our method achieves state-of-the-art realism metrics while significantly improving 3D nuclei segmentation accuracy and boundary preservation under zero-shot Cellpose evaluation. Together, these contributions establish a quantitatively validated, structurally faithful, and scalable pipeline for 3D virtual H&E staining, advancing the paradigm of slide-free, volumetric computational histopathology. Our data and code are available at: https://github.com/aasong113/HistoBIT3D_VirtualStaining.

2605.21999 2026-05-22 cs.LG

Toward Understanding Adversarial Distillation: Why Robust Teachers Fail

迈向理解对抗蒸馏:为何鲁棒教师失败

Hongsin Lee, Hye Won Chung

发表机构 * School of Electrical Engineering, KAIST, Daejeon, Korea(韩国科学技术院电子工程学院)

AI总结 本文研究了对抗蒸馏中鲁棒教师与学生鲁棒性之间的关系,揭示了教师监督信心与学生表示限制之间的不匹配导致鲁棒过拟合现象,并提出了理论框架和实验验证。

Comments Accepted to ICML 2026. Code is available at https://github.com/HongsinLee/why-robust-teachers-fail

详情
AI中文摘要

对抗蒸馏旨在通过在最小-最大对抗训练框架内利用鲁棒教师的软标签来增强学生的鲁棒性,但其成功却往往不一致:更鲁棒的教师往往无法提升甚至损害学生的鲁棒泛化能力。本文识别了这一教师依赖的关键机制:教师监督信心与学生表示限制在一致训练数据子集上的不匹配——鲁棒不可学集。我们提出了一个理论框架,分析了两层神经网络的特征学习动态,证明这种不匹配导致蒸馏结果的二元性。我们证明当教师在不可学样本上提供自信监督时,会迫使学生记忆虚假噪声模式,最终超过学习的鲁棒信号,从而驱动鲁棒过拟合。相反,教师在这些样本上表现出高不确定性时,会抑制噪声记忆,使学生仅依赖可学习信号进行鲁棒泛化。我们通过合成模拟和真实图像分类数据集验证了我们的理论,确认鲁棒过拟合由教师与不可学样本的交互驱动。最后,我们证明教师在不可学样本上的预测熵是学生鲁棒性的一个强指标,验证了我们的理论框架并提供了鲁棒教师选择的指导原则。

英文摘要

Adversarial Distillation aims to enhance student robustness by guiding the student with a robust teacher's soft labels within the min-max adversarial training framework, yet its success is notoriously inconsistent: a more robust teacher often fails to improve, or even harms, the student's robust generalization. In this paper, we identify a key mechanism of this teacher dependency: the misalignment between the teacher's supervisory confidence and the student's representational limitations on a consistent subset of training data -- the Robustly Unlearnable Set. We present a theoretical framework analyzing the feature learning dynamics of a two-layer neural network, demonstrating that this mismatch creates a dichotomy in distillation outcomes. We prove that when a teacher provides confident supervision on unlearnable samples, it compels the student to memorize spurious noise patterns that eventually overpower the learned robust signal, thereby driving robust overfitting. Conversely, a teacher that exhibits high uncertainty on these samples effectively suppresses noise memorization, allowing the student to rely solely on the learnable signal for robust generalization. We empirically validate our theory across both synthetic simulations and real-image classification datasets, confirming that robust overfitting is driven by the teacher's interaction with unlearnable samples. Finally, we demonstrate that a teacher's predictive entropy on unlearnable samples serves as a strong indicator of student robustness, validating our theoretical framework and offering a principled guideline for robust teacher selection.

2605.21994 2026-05-22 cs.LG cs.AI

Ex-GraphRAG: Interpretable Evidence Routing for Graph-Augmented LLMs

Ex-GraphRAG:图增强大语言模型中的可解释证据路由

Yoav Kor Sade, Arvindh Arun, Rishi Puri, Steffen Staab, Maya Bechler-Speicher

发表机构 * Tel Aviv University(特拉维夫大学) Institute for AI, University of Stuttgart(人工智能研究所,斯图加特大学) NVIDIA(英伟达) Meta AI

AI总结 本文提出Ex-GraphRAG,通过引入多变量图神经加法网络(M-GNAN)来解决图增强大语言模型中证据路由的可解释性问题,揭示了语义重要性与结构连通性之间的不匹配,对检索剪枝、上下文构建和失败诊断有重要影响。

详情
AI中文摘要

GraphRAG通过从知识图中检索子图并使用消息传递GNN进行编码,将语言模型置于这些子图上。由于这些编码器通过迭代邻域聚合将节点贡献纠缠在一起,因此无法确定每个检索实体对编码器输出的影响程度,因此无法忠实审计实际到达模型的结构证据。我们引入Ex-GraphRAG,用多变量图神经加法网络(M-GNAN)替代GNN编码器,这是一种扩展到高维嵌入空间的加法图模型,能够精确分解编码器的输出,而无需事后近似。在STaRK-Prime上,这种可审计的编码器与黑盒性能相匹配。利用它审计证据路由,我们发现语义-结构不匹配:主导编码器输出的节点在检索的子图中结构上是断开的,由低贡献的中介节点连接,其移除会使多跳问答性能下降高达28%。这种不匹配对任何不透明编码器都是不可见的,揭示了语义重要性与结构连通性由不同的节点集控制,对图增强大语言模型的检索剪枝、上下文构建和故障诊断有直接的影响。

英文摘要

GraphRAG conditions language models on subgraphs retrieved from knowledge graphs, encoded via message-passing GNNs. Because these encoders entangle node contributions through iterated neighborhood aggregation, there is no closed-form way to determine how much each retrieved entity influenced the encoder's output, and therefore no way to faithfully audit what structural evidence actually reached the model. We introduce Ex-GraphRAG, which replaces the GNN encoder with a Multivariate Graph Neural Additive Network (M-GNAN), an extension of additive graph models to high-dimensional embedding spaces that yields an exact decomposition of the encoder's output across individual nodes and feature groups, without post-hoc approximation. On STaRK-Prime, this auditable encoder matches black-box performance. Using it to audit evidence routing, we uncover a semantic-structural mismatch: the nodes that dominate the encoder's output are structurally disconnected in the retrieved subgraph, held together by low-attribution intermediaries whose removal degrades multi-hop QA by up to 28%. This mismatch, invisible to any opaque encoder, reveals that semantic importance and structural connectivity are governed by disjoint sets of nodes, with direct implications for retrieval pruning, context construction, and failure diagnosis in graph-augmented LLMs.

2605.21993 2026-05-22 cs.AI cs.LG

ECPO: Evidence-Coupled Policy Optimization for Evidence-Certified Candidate Ranking

ECPO:基于证据的策略优化用于证据认证的候选者排序

Miaobo Hu, Shuhao Hu, BoKun Wang, Yina Sa, Xin Wang, Xiaobo Guo, Daren Zha, Jun Xiao

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(信息工程研究所,中国科学院) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 本文研究了证据认证候选者排序问题,提出了一种名为ECPO的策略优化方法,通过结合排序和证据证书来提升排序效果和证据可靠性。

详情
AI中文摘要

用于决策支持的排序系统不仅应对候选者进行排序,还应展示可独立验证的证据。我们研究了证据认证候选者排序:给定一个意图ID、预定义的计划骨架、窗口局部的候选者名单、以及通过文本推导出的候选者轨迹及其跨度来源,系统必须输出一个Top-K列表以及doc_id:span证据证书,其引用的跨度足以恢复决策。我们在此任务上在MAVEN-ERE和RAMS上进行了实例化,使用固定上游提取、窗口局部随机候选者标识符、骨架对齐的轨迹监督、难例和审计参考。我们引入了证据耦合策略优化(ECPO),一种列表级策略优化目标,其动作是排序和证据证书的联合对象。ECPO首先从骨架对齐、论点一致性以及可选图特征中学习可解释的轨迹奖励;然后优化一个受约束的策略,具有三个耦合奖励:列表级排序效用、跨度级证书有效性以及由一个无标签的确定性验证器计算的证据循环奖励,该验证器通过去除声明的引用跨度重建候选者支持。这将目标从单独最大化普通NDCG转变为最大化CertNDCG和决策-证据耦合。评估将ECPO与零样本、SFT和GRPO策略、仅RM的评分带确定性证据附件、语法/JSON约束解码、验证器重试、最佳-N RM选择以及后验证据合理化在封闭名单、预测名单和混合名单设置下进行比较。

英文摘要

Ranking systems used in decision-support settings should not only order candidates but also expose evidence that can be independently checked. We study evidence-certified candidate ranking: given an intent_id, a predefined plan skeleton, a window-local candidate roster, and text-derived candidate trajectories with span provenance, a system must output a Top-K list together with doc_id:span evidence certificates whose cited spans are sufficient to recover the decision. We instantiate this task on MAVEN-ERE and RAMS with fixed upstream extraction, window-local randomized candidate identifiers, skeleton-aligned trajectory supervision, hard negatives, and audit references. We introduce Evidence-Coupled Policy Optimization (ECPO), a listwise policy-optimization objective whose action is the joint object of ranking and evidence certificate. ECPO first learns an interpretable trajectory reward from skeleton alignment, argument consistency, and optional graph features; it then optimizes a constrained policy with three coupled rewards: listwise ranking utility, span-level certificate validity, and an evidence-cycle reward computed by a label-free deterministic verifier that reconstructs candidate support from claim-stripped cited spans. This reframes the goal from maximizing ordinary NDCG alone to maximizing CertNDCG and decision-evidence coupling. The evaluation compares ECPO against zero-shot, SFT, and GRPO policies, RM-only scoring with deterministic evidence attachment, grammar/JSON-constrained decoding, validator retry, best-of-N RM selection, and post-hoc evidence rationalization under closed-roster, predicted-roster, and hybrid-roster settings.

2605.21988 2026-05-22 cs.CV cs.AI

Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

通过反事实强化学习学习视频大语言模型中的时空敏感性

Dazhao Du, Jian Liu, Jialong Qin, Tao Han, Bohai Gu, Fangqi Zhu, Yujia Zhang, Eric Liu, Xi Chen, Song Guo

发表机构 * Hong Kong University of Science and Technology(香港科技大学) Tencent(腾讯)

AI总结 本文提出CRPO方法,通过反事实强化学习提升视频大语言模型对时空动态的敏感性,通过构建反事实视频并引入反事实关系奖励,有效抑制了依赖静态线索的简略策略,从而在DyBench基准测试中提升了模型的时空敏感性。

Comments Project website: https://ddz16.github.io/crpo.github.io/

详情
AI中文摘要

视频大语言模型(Video LLMs)在基准测试中表现出色,但往往通过单帧线索和语言先验来回答视频问题,而不是通过跟踪时空动态。在训练后强化学习(RL)中,这种问题进一步加剧,因为仅正确性奖励会进一步强化那些不跟踪视频动态但能获得高奖励的简略策略。为此,我们提出一个受控的反事实问题:如果视觉世界发生变化而问题保持不变,答案应改变还是保持不变?基于这一观点,我们提出了反事实关系策略优化(CRPO),一种双分支强化学习框架,用于提升时空敏感性。CRPO通过水平翻转和时间反转构建反事实视频,在原始和反事实分支上进行训练,并引入反事实关系奖励(CRR)以鼓励答案在动态问题中改变而在静态问题中保持不变。这种跨分支约束使简略策略难以在两个分支中持续获得奖励。为了评估这一特性,我们引入了DyBench,一个配对反事实视频基准,包含3,014个视频,涵盖可逆动态、运动方向和事件序列,以及一个严格的配对准确度指标,防止固定答案简略策略夸大分数。实验表明,CRPO在时空敏感性评估中优于先前的RL方法,同时保持了竞争性的通用视频性能。在Qwen3-VL-8B上,CRPO在DyBench P-Acc上比基模型提高了+7.7,在TimeBlind I-Acc上提高了+8.2,表明改进了时空敏感性而非更强依赖静态简略策略。项目网站可在https://ddz16.github.io/crpo.github.io/上找到。

英文摘要

Video large language models (Video LLMs) achieve strong benchmark accuracy, yet often answer video questions through shortcuts such as single-frame cues and language priors rather than by tracking spatiotemporal dynamics. This issue is exacerbated in RL post-training, where correctness-only rewards can further reinforce shortcut policies that obtain high reward without tracking video dynamics. We address this by asking a controlled counterfactual question: if the visual world changed while the question remained fixed, should the answer change or stay the same? Based on this view, we propose \textbf{Counterfactual Relational Policy Optimization (CRPO)}, a dual-branch RL framework for improving \emph{spatiotemporal sensitivity}. CRPO constructs counterfactual videos through horizontal flips and temporal reversals, trains on both original and counterfactual branches, and introduces a \textbf{Counterfactual Relation Reward (CRR)} between their answers. CRR encourages answers to change for dynamic questions and remain unchanged for static questions. This cross-branch constraint makes it difficult for shortcut policies to be consistently rewarded across both branches. To evaluate this property, we introduce \textbf{DyBench}, a paired counterfactual video benchmark with 3,014 videos covering reversible dynamics, moving direction, and event sequence, together with a strict pair-accuracy metric that prevents fixed-answer shortcuts from inflating scores. Experiments show that CRPO outperforms prior RL methods on spatiotemporal-sensitive evaluations while maintaining competitive general video performance. On Qwen3-VL-8B, CRPO improves DyBench P-Acc by +7.7 and TimeBlind I-Acc by +8.2 over the base model, indicating improved spatiotemporal sensitivity rather than stronger reliance on static shortcuts. The project website can be found at https://ddz16.github.io/crpo.github.io/ .

2605.21984 2026-05-22 cs.AI cs.CL

Echo: Learning from Experience Data via User-Driven Refinement

Echo:通过用户驱动的细化学习经验数据

Hande Dong, Xiaoyun Liang, Jiarui Yu, Jiayi Lin, Changqing Ai, Feng Liu, Wenjun Zhang, Rongbi Wei, Chaofan Zhu, Linjie Che, Feng Wu, Xin Shen, Dexu Kong, Xiaotian Wang, Qiuyuan Chen, Bingxu An, Yueting Lei, Qiang Lin

发表机构 * Core Contributors(核心贡献者) Qiang Lin is the team leader(Qiang Lin 是团队负责人)

AI总结 本文提出Echo框架,通过用户驱动的细化过程将原始经验数据转化为可学习的知识,提升模型性能,实验表明其能将接受率从25.7%提升至35.7%。

详情
AI中文摘要

静态的'人类数据'面临固有局限:扩展成本高且受制于创造者知识。持续学习'经验数据'——智能体与其环境的交互——有望超越这些障碍。如今,AI智能体的广泛应用使我们能够以低成本获取大量真实世界经验数据。然而,原始交互日志本质上嘈杂,充满试错和低信息密度,使其不适合直接用于模型训练。我们引入Echo,一个通用框架,旨在将原始经验转化为可学习的知识,有效将环境反馈回训练循环以优化模型。在当今智能体生态系统中,用户细化是主要的反馈来源:出于对结果的责任感,用户严格地将缺陷智能体提案转化为已验证的解决方案。这些用户驱动的细化序列本质上将智能体的粗略尝试提炼为高质量的训练信号。Echo系统性地收集这些信号,持续使智能体与真实世界需求对齐。在大规模生产代码补全环境中的验证表明,Echo有效利用这一流程,打破静态性能上限,将接受率从25.7%提升至35.7%。

英文摘要

Static "human data" faces inherent limitations: it is expensive to scale and bounded by the knowledge of its creators. Continuous learning from "experience data" - interactions between agents and their environments - promises to transcend these barriers. Today, the widespread deployment of AI agents grants us low-cost access to massive streams of such real-world experience. However, raw interaction logs are inherently noisy, filled with trial-and-error and low information density, rendering them inefficient for direct model training. We introduce Echo, a generalized framework designed to operationalize the transition from raw experience to learnable knowledge, effectively "echoing" environmental feedback back into the training loop for model optimization. In today's agent ecosystem, user refinement serves as a primary source of such feedback: driven by responsibility for the outcome, users rigorously transform flawed agent proposals into verified solutions. These user-driven refinement sequences inherently distill agents' crude attempts into high-quality training signals. Echo systematically harvests these signals to continuously align the agent with real-world needs. Large-scale validation in a production code completion environment confirms that Echo effectively harnesses this pipeline, breaking the static performance ceiling by increasing the acceptance rate from 25.7% to 35.7%.

2605.21981 2026-05-22 cs.CV

RiT: Vanilla Diffusion Transformers Suffice in Representation Space

RiT: vanilla diffusion transformers suffice in representation space

Le Zhang, Ning Mang, Aishwarya Agrawal

发表机构 * Mila – Québec AI Institute, UdeM(魁北克AI研究院,麦吉尔大学) Utrecht University(乌得勒支大学) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)

AI总结 该研究探讨了在表示空间中使用vanilla diffusion transformers进行图像生成的有效性,发现通过预训练的表示空间能够更有效地进行流匹配学习,从而在ImageNet数据集上取得了优于DiT-DH-XL的性能。

详情
AI中文摘要

流匹配与x预测--回归干净的数据点而非环境速度--已被证明在像素空间中有效利用低维流形结构\cite{li2025back}。我们询问是否预训练的表示空间,尽管包含具有可比内在维度的低维数据流形,能提供更有利于流匹配学习的分布。通过比较像素、SD-VAE和DINOv2特征在四个几何轴上的表现,我们发现像素和DINOv2具有几乎相同的内在维度性(两者$\hat{d}\!\approx\!33$),但DINOv2表现出7.3倍更高的有效秩、35倍更好的协方差条件、11.5倍更低的超额峰度以及1.7倍更低的流形插值误差;SD-VAE潜在特征始终处于中间位置,表明优势源于表示学习目标而非单纯的压缩。这些统计特性使流匹配回归变得良好条件化,并消除了先前DINOv2扩散方法中专门预测头或Riemannian运输的需要。我们提出了表示图像变换器(RiT):一个通过冻结DINOv2特征进行x预测训练的vanilla Diffusion Transformer,仅通过维度感知的噪声调度和联合 exttt{[CLS]}-patch建模进行增强。在ImageNet $256{ imes}256$上,RiT在无指导时达到FID 1.45,在无分类器指导时达到1.14,优于参数更少19%的DiT$^ ext{DH}$-XL(676M vs.\ 839M)。所得到的ODE在粗略离散化下可以高效求解:在无分类器指导时,5步Heun步骤已达到FID 2.0,10步达到1.25,无需蒸馏或一致性训练。代码在https://github.com/lezhang7/RiT。

英文摘要

Flow matching with $x$-prediction -- regressing the clean data point rather than the ambient velocity -- is known to exploit low-dimensional manifold structure effectively in pixel space \cite{li2025back}. We ask whether a pretrained representation space, while containing a low-dimensional data manifold of comparable intrinsic dimensionality, offers a distribution more favorable for flow-matching learning. Comparing pixel, SD-VAE, and DINOv2 features along four geometric axes, we find that pixel and DINOv2 share nearly identical intrinsic dimensionalities (both $\hat{d}\!\approx\!33$) yet DINOv2 exhibits $7.3\times$ higher effective rank, $35\times$ better covariance conditioning, $11.5\times$ lower excess kurtosis, and $1.7\times$ lower on-manifold interpolation error; SD-VAE latents are consistently intermediate, indicating that the advantage stems from representation-learning objectives rather than mere compression. These statistical properties render the flow-matching regression well-conditioned and remove the need for the specialized prediction heads or Riemannian transport used by prior DINOv2 diffusion methods. We propose the \emph{Representation Image Transformer} (RiT): a vanilla Diffusion Transformer trained by $x$-prediction on frozen DINOv2 features, augmented only by a dimension-aware noise schedule and joint \texttt{[CLS]}-patch modeling. On ImageNet $256{\times}256$, RiT attains FID 1.45 without guidance and 1.14 with classifier-free guidance, outperforming DiT$^\text{DH}$-XL with $19\%$ fewer parameters (676M vs.\ 839M). The resulting ODE is efficiently solvable at coarse discretizations: with classifier-free guidance, $5$ Heun steps already reach FID 2.0 and $10$ steps reach 1.25, without distillation or consistency training. Code at https://github.com/lezhang7/RiT.

2605.21980 2026-05-22 cs.CV cs.AI

Interpreting and Enhancing Emotional Circuits in Large Vision-Language Models via Cross-Modal Information Flow

通过跨模态信息流解读并增强大视觉-语言模型中的情感电路

Chengsheng Zhang, Chenghao Sun, Zhining Xie, Xinmei Tian

发表机构 * MoE Key Laboratory of Brain-inspired Intelligent Perception(脑启发智能感知与认知MOE实验室) Cognition, University of Science(认知,科学大学) AIPD, Tencent(AIPD,腾讯)

AI总结 本文提出了一种基于转向向量的因果归因框架,用于描述性情感推理,通过构建专用数据集揭示了三阶段'适应-聚合-执行'机制下的情感电路,发现视觉情感线索在中间层通过情感特定的注意力头进行聚合,随后在深层通过情感通用路径转换为叙述生成,并通过调控情感信息路由增强注意力流和语义激活,从而提升性能并缓解情感幻觉。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型视觉-语言模型(LVLMs)代表了迈向共情代理的重要进展,展示了在情绪理解方面的显著能力。然而, governing how LVLMs translate abstract visual stimuli into coherent emotional narratives remains largely unexplored, primarily due to the scarcity of visual counterfactuals and the diffuse nature of emotional expression. In this paper, we bridge this gap by introducing a steering-vector-based causal attribution framework tailored for descriptive emotional reasoning. To this end, we construct a specialized dataset to demystify the emotional circuits underlying the three-stage ``Adapt-Aggregate-Execute'' mechanism. Crucially, we discover a functional decoupling: visual emotional cues are aggregated in middle layers via sentiment-specific attention heads, but are subsequently translated into narrative generation in deep layers through emotion-general pathways. Guided by these insights, we regulate the emotional information routing to strengthen attention flow and amplify the semantic activation to consolidate expression. Extensive experiments on the comprehensive MER-UniBench demonstrate that our methods significantly improve performance via inference-time intervention, effectively mitigating emotional hallucinations and corroborating the causal fidelity of the discovered circuits.

英文摘要

Large Vision-Language Models (LVLMs) represent a significant leap towards empathetic agents, demonstrating remarkable capabilities in emotion understanding. However, the internal mechanisms governing how LVLMs translate abstract visual stimuli into coherent emotional narratives remain largely unexplored, primarily due to the scarcity of visual counterfactuals and the diffuse nature of emotional expression. In this paper, we bridge this gap by introducing a steering-vector-based causal attribution framework tailored for descriptive emotional reasoning. To this end, we construct a specialized dataset to demystify the emotional circuits underlying the three-stage ``Adapt-Aggregate-Execute'' mechanism. Crucially, we discover a functional decoupling: visual emotional cues are aggregated in middle layers via sentiment-specific attention heads, but are subsequently translated into narrative generation in deep layers through emotion-general pathways. Guided by these insights, we regulate the emotional information routing to strengthen attention flow and amplify the semantic activation to consolidate expression. Extensive experiments on the comprehensive MER-UniBench demonstrate that our methods significantly improve performance via inference-time intervention, effectively mitigating emotional hallucinations and corroborating the causal fidelity of the discovered circuits.

2605.21977 2026-05-22 cs.CV cs.AI

Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection

视频作为自然增强:迈向统一的AI生成图像和视频检测

Zhengcen Li, Chenyang Jiang, Liangxu Su, Tong Shao, Shiyang Zhou, Ming Tao, Jingyong Su

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Pengcheng Laboratory(鹏城实验室) Shenzhen Loop Area Institute(深圳南山区研究院)

AI总结 本研究针对AI生成内容检测中跨模态差距的问题,提出VINA框架,通过联合训练图像和视频数据,利用视频帧作为自然增强,并引入跨模态监督对比目标,实现统一的AI生成内容检测,提升鲁棒性和迁移性。

详情
AI中文摘要

AI生成内容(AIGC)正在迅速提升,催生了需要在数据源、部署管道和视觉模态间通用的检测器的紧迫需求。一个高度通用的检测器应在分布变化下保持稳健。然而,我们发现了一种一致的失败模式:最先进的AI生成图像检测器在应用于从视频中提取的帧时往往会崩溃。通过系统分析,我们发现这种跨模态差距源于交织的合成无关视频处理转换,包括颜色转换、编码压缩、缩放和模糊,以及由现代视频生成器引入的模型特定指纹。受这些发现的启发,我们提出了VINA(Video as Natural Augmentation),一个统一的AIGC检测框架,联合训练图像和视频数据。VINA利用视频帧作为物理上合理的自然增强,并进一步引入跨模态监督对比目标,以在共享的真/假决策边界下对齐图像和视频表示。在14个图像、视频和现实世界基准测试中,VINA展示了双向收益,提高了鲁棒性和迁移性,并在几乎所有评估设置中实现了最先进的性能,无需复杂的增强或数据集特定调整。

英文摘要

AI-generated content (AIGC) is rapidly improving, creating an urgent need for detectors that generalize across data sources, deployment pipelines, and visual modalities. A strongly generalizable detector should remain robust under distributional variations. However, we identify a consistent failure mode: SOTA AI-generated image detectors often collapse when applied to frames extracted from videos. Through systematic analysis, we show that this cross-modal gap arises from both entangled synthesis-agnostic video processing shifts, including color conversion, codec compression, resizing, and blur, and model-specific fingerprints introduced by modern video generators. Motivated by these findings, we propose VINA (Video as Natural Augmentation), a unified AIGC detection framework that jointly trains on image and video data. VINA uses video frames as physically grounded natural augmentations and further introduces a cross-modal supervised contrastive objective to align image and video representations under a shared real/fake decision boundary. Extensive experiments on 14 image, video, and in-the-wild benchmarks show that VINA delivers bidirectional gains, improves robustness and transferability, and achieves state-of-the-art performance across nearly all evaluated settings without complex augmentation or dataset-specific tuning.

2605.21976 2026-05-22 cs.RO

TacO: Benchmarking Tactile Sensors for Object Manipulation

TacO: 用于物体操作的触觉传感器基准测试

Anya Zorin, Zilin Si, Myungsun Park, Junsung Park, Alexiy Buynitsky, Sachin Bhadang, Taejun Park, Sohee John Yoon, Yong-Lae Park, Oliver Kroemer, Zeynep Temel, Michael T. Tolley, Sha Yi, Xiaolong Wang

发表机构 * UC San Diego(圣迭戈大学) CMU(卡内基梅隆大学) SNU(首尔国立大学)

AI总结 本文提出了一种基于任务驱动的触觉传感器评估框架,通过训练不同模态的触觉传感器(视觉、声学、磁性和电阻性)在三个任务上的表现,探讨了触觉信息在不同材料和任务中的有效性。

详情
AI中文摘要

基于视觉的学习从示范中取得了在使机器人执行操作任务和高层语义推理方面的显著成功,但仍然不足以处理复杂且接触丰富的操作。尽管普遍认为触觉感知能改善操作,但尚无实证指导说明哪种触觉传感器最适合哪种操作任务。本文提供了一种系统性的、任务驱动的触觉传感器评估,提出了基于操作策略性能选择和评估传感器的框架。为四个不同的模态(视觉、声学、磁性和电阻性)的触觉传感器分别训练了独立的操作策略,用于三个任务:未知质量的拾取和放置、物体重新定向和插头插入。对于每个任务,分析了传感器属性如空间分辨率、剪切感知和触觉表示,以及固有材料摩擦对任务性能的影响。而不是触觉感知以相同方式对所有任务都有帮助,我们的结果表明触觉信息的有用性在很大程度上取决于传感器模态、材料属性和特定的操作任务。所有触觉传感器、代码、数据和硬件设置将在项目网站上公开。

英文摘要

Vision-based learning from demonstrations has achieved remarkable success in enabling robots to perform manipulation tasks and high-level semantic reasoning, yet it remains insufficient for complex, contact-rich manipulation. While there is broad agreement that tactile sensing improves manipulation, there is no empirical guidance on which tactile sensors are best suited for which manipulation tasks. In this paper, we provide a systematic, task-driven evaluation of tactile sensors for robot manipulation and propose a framework for selecting and evaluating sensors based on manipulation policy performance. Separate manipulation policies are trained for tactile sensors of four distinct modalities: visual, acoustic, magnetic, and resistive, across three tasks: pick-and-place with unknown mass, object reorientation, and plug insertion. For each task, an analysis of how sensor properties such as spatial resolution, shear sensing, and tactile representation, and the inherent material friction affect task performances is done. Rather than tactile sensing being universally beneficial in the same way, our results show that the usefulness of tactile information depends strongly on sensor modality, material properties, and the specific manipulation tasks. All of the tactile sensors, code, data, and hardware setup will be publicly available on the project website.

2605.21975 2026-05-22 cs.LG

Reasoning through Verifiable Forecast Actions: Consistency-Grounded RL for Financial LLMs

通过可验证的预测动作进行推理:面向金融大语言模型的一致性导向强化学习

Jialin Chen, Aosong Feng, Harshit Verma, Siyi Gu, Haiwen Wang, Ali Maatouk, Yixuan He, Yifeng Gao, Leandros Tassiulas, Rex Ying

发表机构 * Yale University(耶鲁大学) University of Texas Rio Grande Valley(德克萨斯理工大学) Arizona State University(亚利桑那州立大学)

AI总结 本文提出StockR1,一种结合时间序列的LLM,通过可验证的预测动作统一股票预测与金融推理,利用强化学习优化整个流程,提升金融问答和股票预测的准确性。

详情
AI中文摘要

金融市场以极端非平稳性、低信噪比和对新闻、公司基本面和宏观经济信号的强依赖性为特征。然而,现有方法要么将时间序列抽象为文本,要么将预测与基于语言的推理解耦,导致定性推理与定量结果之间存在根本性不匹配。为此,我们引入StockR1,一种增强时间序列的LLM,通过可验证的预测动作统一股票预测与金融推理。基于工具调用设计,模型首先发出预测动作,即对其定性市场展望的结构化和可解释的表示。然后,它调用一个受此动作条件的时序解码器,生成分布式的未来轨迹,从而更有效地进行问答和金融推理。我们通过强化学习优化整个流程,其中奖励共同反映答案的正确性、预测的准确性以及生成动作与观察到的时序动态之间的一致性。此外,奖励通过样本级不确定性标量重新加权,鼓励模型适应市场动态中变化的不确定性。我们在大规模10年基准上评估StockR1的金融问答和股票预测。我们的方法在时间序列基线和通用LLM上均表现优异,将推理准确性提高了17.7%(4B)和25.9%(8B)。这些发现表明,结构化预测动作在语言推理和时间预测之间建立了强大的协同效应,使LLM能够通过可验证、可解释和数值基础的决策进行推理。

英文摘要

Financial markets are characterized by extreme non-stationarity, low signal-to-noise ratios, and strong dependence on external information such as news, company fundamentals, and macroeconomic signals. Yet, existing approaches either abstract time-series into text or decouple forecasting from language-based reasoning, leading to a fundamental mismatch between qualitative reasoning and quantitative outcomes. To address this, we introduce StockR1, a time-series-enhanced LLM that unifies stock forecasting and financial reasoning through a verifiable forecast action. Based on a tool-call design, the model first emits a forecast action, which is a structured and interpretable representation of its qualitative market outlook. It then invokes a time-series decoder conditioned on this action to generate distributional future trajectories, leading to more informed question answering and financial reasoning. We optimize the full pipeline with reinforcement learning, where rewards jointly reflect answer validity, forecast accuracy, and consistency between generated actions and observed time-series dynamics. In addition, rewards are reweighted by a sample-level uncertainty scalar, encouraging the model to accommodate varying uncertainty in market dynamics. We evaluate StockR1 on financial question answering and stock forecasting over a large-scale 10-year benchmark. Our method consistently outperforms time-series baselines and general-purpose LLMs, improving reasoning accuracy by 17.7% (4B) and 25.9% (8B). These findings demonstrate that structuring the forecast actions establishes a powerful synergy between language reasoning and temporal prediction, enabling LLMs to reason through verifiable, interpretable, and numerically grounded decisions.

2605.21974 2026-05-22 cs.AI

Format-Constraint Coupling in Knowledge Graph Construction from Statistical Tables

知识图谱构建中统计表的格式约束耦合

Jingxuan Qi, Zhiqiang Ye, Yuxiang Feng

发表机构 * South China University of Technology(华南理工大学)

AI总结 本文研究了在统计表中构建知识图谱时,格式约束与提取方案之间的耦合效应,发现格式与约束的联合影响超过了独立影响的总和,并提出了CSVFidelity-Bench基准测试集以支持基于保真的评估。

Comments 8 pages main body, 18 pages appendices. Submitted to EMNLP 2026 via ACL Rolling Review (ARR). Corresponding author: Yuxiang Feng (yxfeng@scut.edu.cn). Code and data available at https://anonymous.4open.science/r/sge_lightrag-BE19

详情
AI中文摘要

提取方案不应降低知识图谱的保真度。然而,在统计CSV表中却可能降低。我们研究了国家-年份时间序列矩阵,这是开放数据门户中常见的布局。在此设置中,序列化格式和模式约束的交互作用是超加性的。它们的联合效应超过独立效应的总和,最高可达+1.180(2x2因子,6个数据集)。Bootstrap 95%置信区间在4/6个数据集中严格为正,其中在宽型II矩阵上证据最强。更关键的是,应用于不匹配格式的模式可能触发灾难性不匹配。事实覆盖率在4/6个数据集中低于无约束基线,通过实体膨胀或提取拒绝实现。我们称这种观察到的模式为格式-约束耦合。探测和标记消融支持以列名参考为中心的表面形式锚定解释。在格式-模式配对、GraphRAG主机和LLM家族之间的受控变体中,结果在测量范围内保持相同方向;一个LLM家族仅显示部分激活。这一观察还具有诊断后果。三种标准检索模式在很大程度上掩盖了构建质量(delta <= 1pp),而直接图访问暴露了高达+47.6pp(p < 0.0001)的差距。为了支持保真度意识的评估,我们发布了CSVFidelity-Bench。它包含15个数据集、11个II型矩阵、4个III型表格和1,892个标准事实,覆盖6个领域。

英文摘要

An extraction schema should not reduce knowledge graph fidelity. On statistical CSV, however, it can. We study country-by-year time-series matrices, a common layout on open-data portals. In this setting, serialization format and schema constraints interact super-additively. Their joint effect exceeds the sum of independent effects by up to +1.180 (2x2 factorial, 6 datasets). Bootstrap 95% CIs are strictly positive on 4/6 datasets, with strongest evidence on wide Type-II matrices. More critically, a schema applied to a mismatched format can trigger catastrophic mismatch. Fact coverage falls below the unconstrained baseline on 4/6 datasets through entity inflation or extraction refusal. We call this observed pattern format-constraint coupling. Probing and token ablation support a surface-form anchoring explanation centred on column-name references. Controlled variants across format-schema pairings, GraphRAG hosts, and LLM families show the same direction within the measured scope; one LLM family shows only partial activation. The observation also has a diagnostic consequence. Three standard retrieval modes largely mask construction quality (delta <= 1pp), whereas direct graph access exposes gaps up to +47.6pp (p < 0.0001). To support fidelity-aware evaluation, we release CSVFidelity-Bench. It contains 15 datasets, 11 Type-II matrices, 4 Type-III tables, and 1,892 Gold Standard facts across 6 domains.

2605.21973 2026-05-22 cs.CV

Foresee-to-Ground: From Predictive Temporal Perception to Evidence-Driven Reasoning for Video Temporal Grounding

Foresee-to-Ground: 从预测性时间感知到证据驱动推理的视频时间接地

Zelin Zheng, Xinyan Liu, Ruixin Li, Antoni B. Chan, Guorong Li, Qingming Huang, Laiyun Qing

发表机构 * Qwen3-VL-8B-Instruct

AI总结 本文提出了一种新的视频时间接地框架F2G,通过将时间接地问题重新表述为可验证的识别-测量问题,结合预测性时间感知和证据驱动推理,以提高时间接地的准确性和鲁棒性。

Comments Accepted by ICML 2026

详情
AI中文摘要

当前视频大语言模型(Video-LLM)在视频时间接地(VTG)中的方法通常依赖于从无结构的视觉令牌流中直接生成时间戳,这通常导致脆弱的数值和不一致的边界。为了解决这个问题,我们提出了Foresee-to-Ground(F2G),一种将VTG重新表述为可验证的识别-测量问题的框架。F2G集成了预测性时间感知与证据驱动推理:它学习对边界敏感的时间表示,以构建一个覆盖整个视频的候选事件片段证据池,并将这些片段暴露给LLM作为可引用的证据单元,将边界预测与显式事件假设绑定。通过将事件识别与精确边界测量解耦,F2G稳定了接地并使预测可验证。广泛的实验表明,F2G在各种基准上都一致提高了接地准确性,能够在不同的Video-LLM后端之间稳健地转移,并保持了通用视频理解能力。

英文摘要

Current Video-LLM approaches for Video Temporal Grounding (VTG) typically rely on direct timestamp generation from an unstructured visual-token stream, often leading to brittle numerics and inconsistent boundaries. To address this, we propose Foresee-to-Ground (F2G), a framework that reformulates VTG as a verifiable Identify-then-Measure problem. F2G integrates Predictive Temporal Perception with Evidence-Driven Reasoning: it learns boundary-sensitive temporal representations to build a video-wide evidence pool of candidate event segments, and exposes these segments to the LLM as citable evidence units that bind boundary prediction to explicit event hypotheses. By decoupling event identification from precise boundary measurement, F2G stabilizes grounding and makes predictions verifiable. Extensive experiments demonstrate that F2G consistently improves grounding accuracy across diverse benchmarks, transfers robustly across different Video-LLM backbones, and preserves general video understanding capabilities.

2605.21972 2026-05-22 cs.LG

How Sparsity Allocation Shapes Label-Free Post-Pruning Recoverability

稀疏性分配如何塑造无标签后剪枝恢复能力

Qishi Zhan, Minxuan Hu, Liang He

发表机构 * Marquette University(马凯特大学) Cornell University(康奈尔大学) Tongji University(同济大学)

AI总结 本文研究了在固定激活统计修复后端下,稀疏性分配如何影响后修复恢复能力,通过比较ERK和LAMP分配在不同数据集和模型上的表现,发现分配选择对后修复准确性有显著影响,并揭示了修复敏感的过渡区域。

详情
AI中文摘要

在高稀疏度下进行无结构幅度剪枝可能导致神经网络精度降至接近随机水平,而在实际部署中可能无法进行带标签的重新训练。无标签后剪枝修复方法可以部分恢复塌陷的稀疏模型,但其有效性取决于上游剪枝分配留下的稀疏模型。本文研究了在固定激活统计修复后端下,稀疏性分配如何影响后修复恢复能力。我们在CIFAR-10、CIFAR-100和ImageNetet上,使用ResNet-18、ResNet-34和ResNet-50,在90%到95.5%的稀疏度下,比较ERK和LAMP分配在相同无标签修复协议下的表现。结果表明,在相同全局稀疏度下,分配选择可以显著改变后修复准确性,并且优选的分配会随着架构、数据集难度和稀疏度水平而变化。我们识别出一个修复敏感的过渡区域,在此区域内批归一化重新校准开始失效,而激活统计修复仍能恢复非平凡的准确性。在ImageNet-100和DenseNet-121上的额外验证表明,此可恢复区域的位置和宽度取决于数据规模和连接结构。这些发现表明,剪枝分配和后剪枝修复应联合研究,因为分配决定了可用于无标签恢复的激活信号量。

英文摘要

Unstructured magnitude pruning at high sparsity can reduce neural network accuracy to near-random performance, while labeled retraining may be unavailable in practical deployment settings. Label-free post-pruning repair methods can partially recover collapsed sparse models, but their effectiveness depends on the sparse model left by the upstream pruning allocation. This paper studies how sparsity allocation shapes post-repair recoverability under a fixed activation-statistic repair backend. We compare ERK and LAMP allocations under the same label-free repair protocol across CIFAR-10, CIFAR-100, and Imagenette with ResNet-18, ResNet-34, and ResNet-50 at sparsities from 90% to 95.5%. The results show that allocation choice can substantially change post-repair accuracy at the same global sparsity, and that the preferred allocation varies with architecture, dataset difficulty, and sparsity level. We identify a repair-sensitive transition regime in which BatchNorm recalibration begins to fail, while activation-statistic repair still recovers nontrivial accuracy. Additional validation on ImageNet-100 and DenseNet-121 shows that the location and width of this recoverable regime depend on data scale and connectivity structure. These findings suggest that pruning allocation and post-pruning repair should be studied jointly, since the allocation determines how much activation signal remains available for label-free recovery.

2605.21968 2026-05-22 cs.LG

An Improved Adaptive PID Optimizer with Enhanced Convergence and Stability for Deep Learning

一种改进的自适应PID优化器,具有增强的收敛性和稳定性,用于深度学习

Saurabh Saini, Kapil Ahuja, Thomas Wick, Saurav Kumar

发表机构 * 1 Department of Computer Science \& Engineering, Indian Institute of Technology Indore, India. 3 National Remote Sensing Centre, Indian Space Research Organisation, India.

AI总结 本文提出了一种改进的自适应PID优化器IAdaPID-ADG,通过引入非递增有效学习率和基于梯度差的调制因子来解决AdaPID在收敛性和稳定性方面的不足,实验表明其在多个数据集上表现优异。

Comments 11 Pages, Double Column, 6 Tables, 5 Figures

详情
AI中文摘要

优化在深度学习中至关重要。大多数优化器的基础方法是基于动量的随机梯度下降。然而,它有两个关键缺点。首先,它有噪声和变化的梯度,其次,它有超调现象。为了解决噪声梯度,提出了Adam,它仍然是最广泛使用的自适应优化器。为了解决超调现象,提出了一种基于控制理论的PID优化器。为了在单一框架内解决这些限制,最近提出了几种AdaPID的变体。尽管AdaPID表现良好,但它仍然继承了Adam的两个关键缺点,即收敛性和稳定性问题。在本文中,我们解决了这两个限制。为了修复收敛问题,我们独特地将使用非递增有效学习率的想法整合到AdaPID中(最初在AMSGrad中提出,是Adam的扩展)。为了修复稳定性问题,我们创新性地将基于梯度差的调制因子整合到AdaPID中(最初在DiffGrad中提出,是Adam的另一个扩展)。将这两种想法结合到AdaPID中,结果得到我们新的IAdaPID-ADG优化器。我们在多个数据集上评估了所提出的优化器,包括基准数据集(MNIST和CIFAR10)和实际数据集(IARC和AnnoCerv)。IAdaPID-ADG在所有竞争优化器中表现显著更好。此外,我们在MNIST数据集上进行了消融研究,以展示每个添加组件的贡献。

英文摘要

Optimization is essential in deep learning. The foundational method upon which most optimizers are built is momentum-based stochastic gradient descent. However, it suffers from two key drawbacks. First, it has noisy and varying gradients, and second, it has an overshoot phenomenon. To address noisy gradients, Adam was proposed, which remains the most widely used adaptive optimizer. To address the overshoot phenomenon, a control-theory-based PID optimizer was proposed. To tackle both the limitations within a single framework, several variants of Adaptive PID (AdaPID) have recently been proposed. Although AdaPID performs well, it still inherits two critical drawbacks from Adam, namely convergence and stability issues. In this work, we address both these limitations. To fix the convergence issue, we uniquely integrate the idea of using a non-increasing effective learning rate into AdaPID (originally proposed in AMSGrad, an extension of Adam). To fix the stability issue, we innovatively integrate a gradient difference based modulation factor into AdaPID (originally proposed in DiffGrad, another extension of Adam). Combining both these ideas in AdaPID, results in our novel IAdaPID-ADG optimizer. We evaluate our proposed optimizer on multiple datasets, including benchmark datasets (MNIST and CIFAR10) and real-world datasets (IARC and AnnoCerv). The IAdaPID-ADG substantially outperforms all competing optimizers. Additionally, we perform an ablation study on the MNIST dataset to demonstrate the contribution of each added component.

2605.21965 2026-05-22 cs.CL

SpecHop: Continuous Speculation for Accelerating Multi-Hop Retrieval Agents

SpecHop:连续推测用于加速多跳检索代理

Mehrdad Saberi, Keivan Rezaei, Soheil Feizi

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 本文研究如何在不改变最终轨迹的情况下加速多跳工具使用过程,提出了一种连续推测框架SpecHop,通过维护多个推测线程和异步验证预测观测来减少延迟。

详情
AI中文摘要

大型语言模型越来越多地使用外部工具如网络搜索和文档检索来解决信息密集型任务。然而,在复杂任务中多跳工具使用引入了显著的延迟,因为模型必须反复等待工具观察结果才能继续。我们研究如何在不改变最终轨迹的情况下加速此类轨迹,假设可以访问更快但更不可靠的推测工具。我们开发了一个理论框架用于多跳工具使用设置中的无损推测,表征了最佳可实现的延迟增益。我们提出了SpecHop,一种连续推测框架,维护多个推测线程,在目标工具输出到达时异步验证预测观测,提交正确的分支并回滚错误的分支。这在保持准确性的同时减少了实际时间延迟。我们证明,当有足够活跃线程时,SpecHop可以接近理想延迟增益。在检索增强的多跳任务中,Empirically,SpecHop接近理论预测,并在某些设置中将延迟减少高达40%。代码:https://github.com/mehrdadsaberi/spechop

英文摘要

Large language models increasingly use external tools such as web search and document retrieval to solve information-intensive tasks. However, multi-hop tool use in complex tasks introduces substantial latency, since the model must repeatedly wait for tool observations before continuing. We study how to accelerate such trajectories without changing the final trajectory the model would have taken without acceleration, assuming access to faster but less reliable speculator tools. We develop a theoretical framework for lossless speculation in multi-hop tool-use settings, characterizing the optimal achievable latency gain. We propose SpecHop, a continuous speculation framework that maintains multiple speculative threads, verifies predicted observations asynchronously as target tool outputs arrive, commits correct branches, and rolls back incorrect ones. This preserves accuracy while reducing wall-clock latency. We show that SpecHop can approach oracle latency gains with enough active threads. Empirically, on retrieval-augmented multi-hop tasks, SpecHop closely matches theoretical predictions and reduces latency by up to 40\% in some settings. Code: https://github.com/mehrdadsaberi/spechop

2605.21963 2026-05-22 cs.LG cs.AI

ChronoMedicalWorld: A Medical World Model for Learning Patient Trajectories from Longitudinal Care Data

ChronoMedicalWorld:一个用于从纵向护理数据中学习患者轨迹的医学世界模型

Jiangyuan Wang, Xuyong Chen, Junwei He, Xu Xu, Shasha Xie, Fuman Han

发表机构 * Beijing KidneyTec Medical Technology Co., Ltd.(北京肾科医疗技术有限公司)

AI总结 本文提出了一种名为ChronoMedicalWorld的模型,旨在通过纵向护理数据学习患者轨迹,该模型结合了联合嵌入状态编码器和宽动作编码器,并在六个术语目标下训练了循环潜在转移模块,以提高慢性病护理中长期预测的准确性。

Comments 14 pages, 2 figures, 6 tables

详情
AI中文摘要

长期临床模拟--预测患者在指定干预下数年的生理演变--是慢性病护理的核心,但现有的电子健康记录(EHR)模型大多为判别性模型,且通用的大语言模型在重复干预下会漂移。我们提出了ChronoMedicalWorld模型(CMWM),一种用于从纵向护理数据中学习患者轨迹的动作条件潜在世界模型框架。CMWM结合了联合嵌入状态编码器和宽动作编码器,该编码器可以接受结构化干预指标和自由文本通信嵌入,并在六个术语目标下训练了循环潜在转移模块:下一步观察监督、下一步潜在预测、SIGReg潜在正则化,以及三个生理感知的形状先验(斜率、连续性、大跳跃惩罚)。闭环滚出前缀协议使训练与部署相匹配,因此模型在推理时表现出的多步误差相同。作为具体案例研究,我们为慢性肾病(CKD)的年度估计肾小球滤过率(eGFR)轨迹预测实例化CMWM。在2,232名肾病患者队列上,CKD实例化实现了动态-50%历史滚动测试的平均绝对误差(MAE)为7.384和均方根误差(RMSE)为10.256,而调优的GPT-5.5结构提示基线为7.964和11.069(MAE减少7.28%,RMSE减少7.35%),增益主要由患者与健康教练交流的对话部分主导。该框架不特定于CKD:其架构、损失设计和训练协议适用于任何可以被描述为周期性临床状态交替与结构化和对话干预的慢性疾病。

英文摘要

Long-horizon clinical simulation -- predicting how a patient's physiology evolves over years under specified interventions -- is central to chronic-disease care, yet existing electronic health record (EHR) models are predominantly discriminative, and general-purpose large language models drift under repeated interventions. We propose the \textbf{ChronoMedicalWorld Model (CMWM)}, an action-conditioned latent world-model framework for learning patient trajectories from longitudinal care data. CMWM couples a joint-embedding state encoder with a wide action encoder that admits both structured intervention indicators and free-text communication embeddings, and trains a recurrent latent transition module under a six-term objective: next-observation supervision, next-latent prediction, SIGReg latent regularisation, and three physiology-aware shape priors (slope, continuity, large-jump penalty). A closed-loop rollout-prefix protocol matches training to deployment, so the model is optimised against the same multi-step error it exhibits at inference. As a concrete case study, we instantiate CMWM for annual estimated glomerular filtration rate (eGFR) trajectory forecasting in chronic kidney disease (CKD). On a 2{,}232-patient nephrology cohort, the CKD instantiation achieves a dynamic-50\% history rollout test mean absolute error (MAE) of 7.384 and root-mean-square error (RMSE) of 10.256, against 7.964 and 11.069 for a tuned GPT-5.5 structured-prompting baseline ($-7.28\%$ MAE, $-7.35\%$ RMSE), with the gain dominated by the dialogue portion of patient--health-coach communication. The framework is not CKD-specific: its architecture, loss design, and training protocol apply to any chronic condition that can be cast as periodic clinical state interleaved with structured and conversational interventions.

2605.21962 2026-05-22 cs.AI cs.CY cs.HC cs.MA

AI-Enabled Serious Games: Integrating Intelligence and Adaptivity in Training Systems

AI赋能的严肃游戏:在训练系统中整合智能与适应性

Priyamvada Tripathi, Bill Kapralos

发表机构 * Durham College(达灵顿学院) Ontario Tech University(安大略技术大学)

AI总结 本文探讨了如何利用人工智能技术提升严肃游戏中的实时教学适应能力,分析了智能与适应性的定义,并讨论了大型语言模型、强化学习和基于代理的架构在严肃游戏中的应用及面临的挑战。

Comments Book chapter, 1 figure. To appear in "Advances in Global Applied Artificial Intelligence," G. A. Tsihrintzis, M. Virvou, N. G. Bourbakis, and L. C. Jain (Eds.), Springer, Learning and Analytics in Intelligent Systems book series, 2026

详情
AI中文摘要

严肃游戏在医疗、国防和教育等多个领域被广泛用于学习和培训。然而,仍然存在静态场景设计、作者瓶颈、有限的学习者建模和实现有意义的实时教学适应的困难。近年来,人工智能(AI)的进步引入了动态场景变化、上下文反馈、自适应节奏和学习者状态建模等新能力,可能帮助解决一些限制。同时,将AI集成到严肃游戏中也引发了关于有效性、透明性、系统控制和学习者信任的重要问题。本章探讨了当代AI方法如何支持严肃游戏中的实时教学适应。它区分了教学智能,即系统推断学习者知识并合理回应的能力,以及适应性,即在交互过程中修改教学行动的能力。本文呈现了适应性学习系统的综述,从早期的计算机辅助教学到智能辅导系统(ITS)、动态难度调整(DDA)、作者平台、学习分析和最近的AI赋能架构。基于这一视角,本文讨论了大型语言模型(LLMs)、强化学习(RL)和基于代理的架构如何促进严肃游戏中更整合的智能和适应性。同时,它还突出了与AI赋能系统相关的实际和研究挑战,包括可解释性、验证、计算成本以及关于AI赋能严肃游戏中长期学习结果的有限实证证据。

英文摘要

Serious games are widely used for learning and training across domains such as healthcare, defense, and education. Persistent challenges remain, however, including static scenario design, authoring bottlenecks, limited learner modeling, and difficulty implementing meaningful real-time instructional adaptation. Recent advances in artificial intelligence (AI) introduce novel capabilities such as dynamic scenario variation, contextual feedback, adaptive pacing, and learner-state modeling that may help address some of these limitations. At the same time, integrating AI into serious games raises important questions related to validity, transparency, system control, and learner trust. This chapter examines how contemporary AI approaches may support real-time instructional adaptation in serious games. It distinguishes between instructional intelligence, defined as a system's capacity to infer learner knowledge and reason about pedagogically appropriate responses, and adaptivity, defined as the ability to modify instructional actions during interaction. A historical synthesis of adaptive learning systems is presented, tracing developments from early computer-assisted instruction through intelligent tutoring systems (ITS), dynamic difficulty adjustment (DDA), authoring platforms, learning analytics, and recent AI-enabled architectures. Building on this perspective, the chapter discusses how large language models (LLMs), reinforcement learning (RL), and agent-based architectures may contribute to more integrated forms of intelligence and adaptivity in serious games. It also highlights practical and research challenges associated with AI-enabled systems, including explainability, validation, computational cost, and the limited empirical evidence regarding long-term learning outcomes in AI-enabled serious games.

2605.21958 2026-05-22 cs.CL

Diagnosis Is Not Prescription: Linguistic Co-Adaptation Explains Patching Hazards in LLM Pipelines

诊断并非处方:语言共适应解释了LLM流水线中的修补危害

Yoon Jeonghun, Kim Dongchan

发表机构 * KAIST (Korea Advanced Institute of Science and Technology)(韩国科学技术院) NAVER Corp.(NAVER公司)

AI总结 本文研究了多模块LLM代理失败时,诊断与修补之间的矛盾,发现路由模块虽为瓶颈,但注入修正示例反而降低性能,而修正查询重写模块则更有效,提出了语言合同假说解释这种现象。

Comments Preprint. Under review at EMNLP 2026 (ARR)

详情
AI中文摘要

当多模块LLM代理失效时,最负责失败的模块未必是最佳干预地点。我们通过实证展示了这种诊断悖论:因果分析一致地将路由模块——选择下一步调用的工具——识别为三个独立代理家族中的主要瓶颈。然而,将提示级修正示例注入此模块会持续降低性能,有时甚至严重。相反,修补上游的查询重写模块则能可靠地改善结果。这种效果在两个代理家族中具有统计显著性,在第三个家族中表现出方向一致性;在路由模块的替代修补策略(指令重写、模型升级)则无明显影响,证实了危害仅特定于修正注入修补。我们通过语言合同假说解释这种不对称性:每个下游模块隐式地适应其上游的特征错误分布,因此修正瓶颈会打破这种隐式对齐,而上游修正不会。我们通过基于诊断的每代理共适应度度量来操作化这一概念,并展示其在所有代理家族中与修补危害一致相关:较高的共适应度与危害相关,较低的与安全性相关。这一趋势在所有三个代理家族中均成立,为假说提供了初步支持,超越了单代理观察。

英文摘要

When a multi-module LLM agent fails, the module most responsible for the failure is not necessarily the best place to intervene. We demonstrate this Diagnostic Paradox empirically: causal analysis consistently identifies the routing module -- which selects which tool to call next -- as the primary bottleneck across three independent agent families. Yet injecting prompt-level correction examples into this module consistently degrades performance, sometimes severely. Patching an upstream query-rewriting module instead reliably improves outcomes. The effect holds with statistical significance on two agent families and directional consistency on a third; alternative repair strategies at the routing module (instruction rewriting, model upgrade) are neutral, confirming that the harm is specific to correction-injection patching. We explain this asymmetry through the Linguistic Contract hypothesis: each downstream module implicitly adapts to its upstream's characteristic error distribution, so correcting the bottleneck breaks this implicit alignment in a way that upstream corrections do not. We operationalize this via a per-agent co-adaptation measure, derived from diagnosis alone, and show it is consistently associated with patching harm across agent families: higher co-adaptation co-occurs with harm, lower with safety. This trend holds across all three agent families, providing preliminary support for the hypothesis beyond a single-agent observation.

2605.21957 2026-05-22 cs.CV

Bounding-Box Trajectories Matter for Video Anomaly Detection

边界框轨迹对视频异常检测至关重要

Inpyo Song, Jangwon Lee

发表机构 * Sungkyunkwan University(成均馆大学)

AI总结 本文提出TrajVAD框架,通过建模多类边界框轨迹来学习正常运动模式,利用边界框轨迹作为主要异常线索,在ShanghaiTech数据集上取得优于现有姿态基方法的性能。

Comments 17 pages, 3 figures

详情
AI中文摘要

视频异常检测对于公共安全和安保至关重要,尽管已有大量研究,但仍极具挑战性,因为存在大量外观、视角和场景动态的变化。在现有方法中,基于人类姿态的方法已成为主要研究方向,由于许多公共数据集中的异常涉及人类,姿态表示对外观变化具有鲁棒性,同时提供紧凑的运动描述。然而,这些方法往往忽视了边界框轨迹,尽管这种信息在基于姿态的管道中本应是固有的。在本文中,我们明确利用这些轨迹作为主要异常线索。我们提出了TrajVAD框架,使用归一化流建模多类边界框轨迹以学习正常运动模式。其仅轨迹变体(TrajVAD-T)消除了姿态估计,并在ShanghaiTech上以87.7%的AP超越了所有比较的姿态基方法,同时在MSAD上取得最佳结果。扩展版本(TrajVAD-P)纳入了姿态信息,进一步将ShanghaiTech上的性能提升至88.6%的AUROC和90.9%的AP,突显了边界框轨迹作为视频异常检测中有效但尚未充分研究的模态。

英文摘要

Video anomaly detection is critical for public safety and security, yet remains highly challenging despite extensive research due to large variations in appearance, viewpoint, and scene dynamics. Among existing approaches, human pose-based methods have emerged as a major line of research, showing strong performance since many anomalies in public datasets involve humans and pose representations are robust to appearance changes while providing compact motion descriptions. However, these methods often overlook bounding-box trajectories, although such information is inherently available in pose-based pipelines. In this paper, we explicitly leverage these trajectories as a primary anomaly cue. We present TrajVAD, a framework that models multi-class bounding-box trajectories using normalizing flows to learn normal kinematic patterns. Its trajectory-only variant (TrajVAD-T) eliminates pose estimation and surpasses all compared pose-based methods on ShanghaiTech in AP (87.7%), while achieving the best results on MSAD. An extended version (TrajVAD-P) incorporates pose information and further improves performance to 88.6% AUROC and 90.9% AP on ShanghaiTech, highlighting bounding-box trajectories as an effective yet underexplored modality for video anomaly detection.

2605.21954 2026-05-22 cs.CV cs.AI

MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues

MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues

Dazhao Du, Liao Duan, Jian Liu, Tao Han, Yujia Zhang, Eric Liu, Xi Chen, Song Guo

发表机构 * Hong Kong University of Science and Technology(香港理工大学) Xi’an Jiaotong University(西安交通大学) Tencent(腾讯)

AI总结 本文研究了多模态大语言模型(MLLMs)在视频时间定位中的感知与生成之间的差距,提出了一种推理阶段的读取-再生成框架,通过利用注意力线索来提高时间定位的准确性,从而在三个视频时间定位基准上提升了MiMo-VL-7B、Qwen3-VL-8B和TimeLens-8B的性能。

Comments Project Website: https://ddz16.github.io/mllmsknowwhen.github.io/

详情
AI中文摘要

视频时间定位(VTG),即在未剪裁的视频中定位查询事件的起止时间,是检验多模态大语言模型(MLLMs)是否理解不仅发生了什么,而且何时发生的关键测试。尽管现代MLLMs能够流畅地描述视频内容,但它们的时间戳预测仍然不可靠,而现有的解决方案要么需要昂贵的后训练时间标注,要么依赖于粗略的训练无关启发式方法。在本文中,我们探测了MLLMs的跨模态注意力,并揭示了一个感知-生成的差距。我们的关键发现是,MLLMs在prefill阶段往往知道目标区间,但在生成最终答案时会丢失这个信号。在prefill阶段,一组稀疏的注意力头(我们称之为时间定位头(TG-Heads))会将查询到视频的注意力集中在真实区间上。然而,在自回归解码过程中,答案标记会将注意力从该区间转移到视觉显著但与查询无关的段落。这一观察促使我们提出了一种推理阶段的读取-再生成框架。我们首先将TG-Head prefill注意力转换为一个去偏的帧级相关性信号,并提取它突出的高注意力区间。然后,我们使用视频裁剪或注意力掩码来限制MLLM的视觉上下文,仅限于该区间,以抑制干扰项。在不进行参数更新和架构更改的情况下,我们的框架在三个VTG基准上一致地提高了MiMo-VL-7B、Qwen3-VL-8B和TimeLens-8B的性能,最大提升达到+3.5 mIoU。该项目网站可在https://ddz16.github.io/mllmsknowwhen.github.io/上找到。

英文摘要

Video temporal grounding (VTG), which localizes the start and end times of a queried event in an untrimmed video, is a key test of whether multimodal large language models (MLLMs) understand not only what happens but also when it happens. Although modern MLLMs describe video content fluently, their timestamp predictions remain unreliable, while existing remedies either require costly post-training on temporal annotations or rely on coarse training-free heuristics. In this work, we probe the cross-modal attention of MLLMs and uncover a perception-generation gap. Our key finding is that MLLMs often know the target interval during prefill, but lose this signal when generating the final answer. In the prefill stage, a sparse set of attention heads, which we call \emph{Temporal Grounding Heads} (TG-Heads), concentrates query-to-video attention on the ground-truth interval. During autoregressive decoding, however, the answer tokens shift attention away from this interval toward visually salient but query-irrelevant segments. This observation motivates an inference-time read-then-regenerate framework. We first convert TG-Head prefill attention into a debiased frame-level relevance signal and extract the high-attention interval it highlights. We then re-invoke the MLLM with visual context restricted to this interval, using video cropping or attention masking to suppress distractors. Without parameter updates and architectural changes, our framework consistently improves MiMo-VL-7B, Qwen3-VL-8B, and TimeLens-8B on three VTG benchmarks, with gains of up to +3.5 mIoU. The project website can be found at https://ddz16.github.io/mllmsknowwhen.github.io/.

2605.21951 2026-05-22 cs.LG

Dynamic Mixture of Latent Memories for Self-Evolving Agents

动态潜在记忆混合用于自演化智能体

Dianzhi Yu, Vireo Zhang, Hongru Wang, Yanyu Chen, Minda Hu, Wanghan Xu, Siki Chen, Philip Torr, Zhenfei Yin, Irwin King

发表机构 * The Chinese University of Hong Kong(香港中文大学) University of Oxford(牛津大学) Nanyang Technological University(南洋理工大学) University of Edinburgh(爱丁堡大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出MoLEM框架,通过动态混合专家机制实现智能体的持续学习,避免灾难性遗忘,提升任务学习和能力保持。

Comments 19 pages, 5 figures, 5 tables

详情
AI中文摘要

实现智能体的自演化需要在变化的任务序列中持续积累新知识,同时不遗忘之前获得的能力。现有方法要么通过更新模型参数内部化知识,导致灾难性遗忘,要么依赖外部记忆,无法真正增强模型的内在能力。我们提出MoLEM,一种基于动态混合专家(MoE)的生成性潜在记忆混合框架。我们将多个专家视为独立的记忆载体来生成记忆。路由器通过键-查询匹配选择并加权专家,聚合的潜在记忆被注入推理过程。基础模型保持完全冻结,所有经验知识被内部化到附加模块中,避免灾难性遗忘。对于持续学习,每个训练阶段配对一个轻量级自编码器,在推理时选择适当的路由组,输入若不匹配任何阶段则回退到预训练模型。实验在涵盖数学、科学和代码领域的持续学习序列上训练框架。训练后,我们在相应的测试集上评估框架,以测量跨持续适应阶段的任务学习和能力保持。在完整的持续学习序列后,我们的方法在Vanilla预训练基线基础上将平均准确率提高了10.40%,而其他方法在不同训练顺序中均无法超过此基线。

英文摘要

Achieving self-evolution in intelligent agents requires the continual accumulation of new knowledge across changing task sequences without forgetting previously acquired abilities. Existing approaches either internalize knowledge by updating model parameters, which induces catastrophic forgetting, or rely on external memory, which fails to genuinely enhance the model's intrinsic capabilities. We propose MoLEM, a generative mixture of latent memory framework based on a dynamic mixture-of-experts (MoE). We treat multiple experts as independent carriers to generate memory. A router selects and weights experts through key-query matching, and the aggregated latent memory is injected into the reasoning process. The base model for reasoning remains entirely frozen, with all experiential knowledge internalized into the additional modules, avoiding catastrophic forgetting. For continual learning, each training stage is paired with a lightweight autoencoder that selects the appropriate routing group at inference, and inputs that match no stage fall back to the pretrained model. Experiments train the framework on continual-learning sequences spanning math, science, and code domains. After training, we evaluate the framework on the corresponding test sets to measure task learning and competence preservation across continual adaptation stages. After the full continual-learning sequence, our method improves the average accuracy by 10.40% over the Vanilla pretrained baseline, while none of the competing methods consistently exceed this baseline across different training orders.

2605.21949 2026-05-22 cs.CL

Claim-Selective Certification for High-Risk Medical Retrieval-Augmented Generation

面向高风险医疗检索增强生成的声明选择性认证

Shao Kan

发表机构 * Jinglue Technology Development (Nanjing) Co., Ltd.(Jinglue 技术发展(南京)有限公司)

AI总结 本文研究了高风险医疗问答场景中检索增强生成系统中声明选择性认证问题,通过将响应分解为可验证的声明并根据检索证据评分,结合意图感知选择器映射到{完整、部分、冲突、回避},在弱标签证书协议上实现了高准确率的认证结果。

Comments 22 pages, 7 figures, 11 tables

详情
AI中文摘要

在高风险问答设置中,医疗RAG系统通常通过单个答案或回避决策进行评估,但混合证据可能支持一个声明,要求另一个声明的条件,并与第三个声明矛盾。我们研究声明选择性认证:每个响应被分解为可验证的声明,根据检索证据评分,并通过意图感知选择器映射到{完整、部分、冲突、回避}。在主要弱标签证书协议上,其真实源-only的开发/测试行覆盖了自然发生的非回避动作,完整的系统在开发集(n=314)上记录UCCR=0.0000,PAU=1.0000,PAU Precision=0.9901,以及动作准确率=0.9204,在测试集(n=319)上记录UCCR=0.0000,PAU=0.9967,PAU Precision=0.9739,以及动作准确率=0.8997。UCCR衡量证书定义内的不支持声明风险,而源缺失的反事实切片评估在空证据下的回避。捷径控制量化由源和意图元数据解释的动作-标签先验,而源/证据新颖切片表征转移边界。所得到的界面在混合证据下将动作-标签预测与证据关联的声明选择分开。

英文摘要

Medical RAG systems in high-risk QA settings are often evaluated through a single answer-or-abstain decision, but mixed evidence may support one claim, require conditions for another, and contradict a third. We study claim-selective certification: each response is decomposed into verifiable claims, scored against retrieved evidence, and mapped by an intent-aware selector to {full, partial, conflict, abstain}. On the primary weak-label certificate protocol, whose real-source-only dev/test rows cover the naturally occurring non-abstain actions, the full system records UCCR=0.0000, PAU=1.0000, PAU Precision=0.9901, and action accuracy=0.9204 on dev (n=314), and UCCR=0.0000, PAU=0.9967, PAU Precision=0.9739, and action accuracy=0.8997 on test (n=319). UCCR measures unsupported-claim risk within the certificate definition, and a source-missing counterfactual slice evaluates abstain under empty evidence. Shortcut controls quantify the action-label prior explained by source and intent metadata, while source/evidence-novel slices characterize transfer boundaries. The resulting interface separates action-label prediction from evidence-linked claim selection under mixed evidence.

2605.21948 2026-05-22 cs.LG

SCI-Defense: Defending Manipulation Attacks from Generative Engine Optimization

SCI-Defense: 防御生成引擎优化的操纵攻击

Xucheng Yu, Haibo Jin, Huimin Zeng, Haohan Wang

发表机构 * Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校计算机科学与数据科学学院) School of Information Sciences, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校信息科学学院) Amazon topcited.ai(亚马逊topcited.ai)

AI总结 本文提出SCI-Defense框架,通过检测困惑度、语义完整性评分和跨候选检测三种组件,有效识别生成引擎优化攻击,实现了高精度和低误报率,同时揭示了现有防御方法的局限性及未来研究方向。

Comments 20 pages, NeurIPS 2026 submission

详情
AI中文摘要

基于大型语言模型的排序系统易受生成引擎优化(GEO)攻击影响,攻击者通过在产品描述中注入语义信号来人为提升排名。我们提出了SCI-Defense,一种结合困惑度检测(PPL)、语义完整性评分(SIS)和跨候选检测(ICD)的三元防御框架。SIS评估四个操纵维度:权威归因(AA)、叙事目的性(NP)、比较主张(CA)和时间主张(TC)。在600个亚马逊产品描述(6个类别)上评估,SCI-Defense实现了精度1.000和误报率0.000,召回率分别为1.000、0.952和0.830,分别针对字符串、推理和评论攻击。在600个MS MARCO网页段落上,字符串攻击被完美阻止,而评论攻击的召回率接近零,因为网页段落缺乏SIS在产品描述中针对的说服性信号。我们证明现有防御方法——仅PPL过滤、SafetyClf内容分类器和改写——在对抗语义操纵攻击时召回率为零。我们进一步展示了新的攻击方式如规范放大和用例饱和,可以暴露语义相关性操纵作为结构防御盲点,指明了未来研究的方向。

英文摘要

LLM-based ranking systems are vulnerable to Generative Engine Optimization (GEO) attacks, where adversaries inject semantic signals into product descriptions to artificially boost rankings. We propose SCI-Defense, a three-component defense framework combining Perplexity detection (PPL), Semantic Integrity Scoring (SIS), and Inter-Candidate Detection (ICD). SIS evaluates four manipulation dimensions: Authority Attribution (AA), Narrative Purposiveness (NP), Comparative Claims (CA), and Temporal Claims (TC). Evaluated on 600 Amazon product descriptions across 6 categories, SCI-Defense achieves Precision=1.000 and FPR=0.000, with Recall of 1.000, 0.952, and 0.830 against String, Reasoning, and Review attacks respectively. On 600 MS MARCO web passages, String attacks are blocked with perfect recall while Review attacks yield near-zero recall, as web passages lack the persuasion-oriented signals that SIS targets in product descriptions. We demonstrate that existing defenses -- PPL-only filters, SafetyClf content classifiers, and paraphrasing -- achieve zero recall against semantic manipulation attacks. We further demonstrate new attacks such as Specification Amplification and Use-Case Saturation can expose semantic relevance manipulation as a structural defense blind spot that suggests directions for future research.

2605.21947 2026-05-22 cs.RO

A Visitation Grid for Complete Coverage Foraging in Robot Swarms

用于机器人群完全覆盖觅食的访问网格

Qi Arturo Gonzalez, Yifeng Gao, Li Zhang, Qi Lu

发表机构 * Department of Computer Science(计算机科学系) The University of Texas Rio Grande Valley(德克萨斯大学里奥格兰德谷分校)

AI总结 本文提出了一种基于网格的随机觅食策略,通过减少冗余访问并加速后期收集,提高了机器人群在大规模未知环境中的资源收集效率和完整性。

Comments The 23rd International Conference on Ubiquitous Robots, 10 figures, 3 tables

详情
AI中文摘要

在大规模未知环境中对稀疏资源的完全收集仍然是自主机器人群面临的挑战。先前研究表明,收集阶段的大部分时间消耗在最终阶段,此时仅剩下少量随机分布的资源。因此,许多现有的群体觅食算法(搜索和收集)专注于在有限的时间窗口内收集大多数资源,而不是改进后期收集所有资源的效率。我们提出了一种基于网格的随机觅食策略,通过显式减少冗余访问并加速后期收集。未知的搜索区域被划分为网格地图,该地图由一个轻量级的中央服务器维护。为了保持可扩展性,机器人和服务器都在有限的内存和计算约束下运行。服务器根据机器人报告的位置更新网格级别的访问次数,生成探索密度的全局估计。对于每次新的觅食任务,机器人从一个局部3×3邻域网格中以最低访问次数的概率选择下一个搜索区域,从而将探索偏向于未访问的区域,同时保持随机性。广泛的模拟实验表明,所提出的策略在性能上始终优于传统的中央放置基线觅食算法(CPFA)。与CPFA相比,所提出的方法将总收集时间减少了多达33%,并在任务的最后阶段将收集效率提高了超过48%。这些结果表明,所提出的策略在机器人群的近完全和完全资源收集中具有鲁棒性、灵活性和可扩展性,并且可以作为在有限机载资源下随机群体觅食方法的一般增强。

英文摘要

The complete collection of sparse resources in large, unknown environments remains a challenging problem for autonomous robot swarms. Previous studies have shown that a substantial portion of total mission time is consumed during the final stage of collection, where only a small fraction of randomly scattered resources remain. Consequently, many existing swarm foraging algorithms (search and collection) focus on collecting most resources within a limited time window, rather than improving end-stage efficiency for collecting all resources. We propose a grid-based stochastic foraging strategy that explicitly reduces redundant visits and accelerates late-stage collection. The unknown search area is partitioned into a grid map, which is maintained by a lightweight central server. To maintain scalability, both robots and the server operate within limited memory and computational constraints. The server updates the grid-level visitation counts based on robot-reported locations, producing a global estimate of the exploration density. For each new foraging trip, a robot selects its next search area from a local 3 X 3 neighborhood of grids probabilistically with the lowest visitation count, thus biasing exploration toward under-visited regions while maintaining stochasticity. Extensive simulation experiments demonstrate that the proposed strategy consistently outperforms the canonical centrally placed baseline foraging algorithm (CPFA). Compared to CPFA, the proposed method reduces the total collection time by up to 33% and improves collection efficiency by more than 48% during the final stage of the mission. These results indicate that the proposed strategy is robust, flexible, and scalable for near-complete and complete resource collection in robot swarms and can serve as a general enhancement for stochastic swarm foraging methods under limited onboard resources.

2605.21938 2026-05-22 cs.LG cs.CR cs.IT math.IT

Optimal Guarantees for Auditing Rényi Differentially Private Machine Learning

对Rényi差分隐私机器学习的最优审计保证

Benjamin D. Kim, Lav R. Varshney, Daniel Alabi

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Stony Brook University(石溪大学)

AI总结 本文研究了声称具有Rényi差分隐私(RDP)保证的机器学习算法的黑盒审计问题,提出了一种基于假设检验的审计框架,利用Donsker-Varadhan(DV)变分估计器直接估计相邻执行之间的Rényi散度,并通过类受限DV估计器得出非渐近的置信区间,证明了样本复杂度保证在信息论上最优,首次建立了通过DV估计器审计RDP的最优保证。

Comments 28 pages, 3 figures

详情
AI中文摘要

我们研究了声称具有Rényi差分隐私(RDP)保证的机器学习算法的黑盒审计问题。我们引入了一种基于假设检验的审计框架,该框架利用Donsker-Varadhan(DV)变分估计器直接估计相邻执行之间的Rényi散度。我们的分析得出通过类受限DV估计器进行RDP审计的显式且非渐近的置信区间,将统计估计误差与算法隐私泄漏分开。我们证明了匹配的minimax下界,表明在对数因子范围内,我们的样本复杂度保证在信息论上最优,从而建立了通过DV估计器审计RDP的首次最优保证。经验上,我们为在完全黑盒设置中审计DP-SGD实例化了我们的框架。在MNIST和CIFAR-10上,以及在广泛的隐私制度下,我们的审计器在经验RDP下界方面相比先前最先进的黑盒方法表现出显著的整体改进,尤其是在小和中等Rényi阶数,其中准确审计最为具有挑战性时。

英文摘要

We study black-box auditing for machine learning algorithms that claim R \ 'enyi differential privacy (RDP) guarantees. We introduce an auditing framework, based on hypothesis testing, that directly estimates Rényi divergence between neighboring executions using the Donsker-Varadhan (DV) variational estimator. Our analysis yields explicit and non-asymptotic confidence intervals for RDP auditing via class-restricted DV estimators, separating statistical estimation error from algorithmic privacy leakage. We prove matching minimax lower bounds showing that, up to logarithmic factors, our sample-complexity guarantees are information-theoretically optimal, thereby establishing the first optimal guarantees for auditing RDP via DV estimators. Empirically, we instantiate our framework for auditing DP-SGD in a fully black-box setting. Across MNIST and CIFAR-10, and over a wide range of privacy regimes, our auditors produce a strong overall improvement on empirical RDP lower bounds compared to prior state-of-the-art black-box methods especially at small and moderate Rényi orders where accurate auditing is most challenging.