arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1946
2605.22047 2026-05-22 cs.AI

Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support

大语言模型在临床决策支持中的主动证据获取与诊断推理

Chen Zhan, Xihe Qiu, Xiaoyu Tan, Xibing Zhuang, Gengchen Ma, Yue Zhang, Shuo Li, Peifeng Liu, Xiaoxiao Ge, Liang Liu, Lu Gan

AI总结 研究探讨了大语言模型在临床决策支持中的主动证据获取与诊断推理问题,提出了一种基于OSCE的标准化患者模拟器和可控可复现的基准测试,发现多轮证据获取会降低诊断准确性并降低支持证据质量,表明静态全上下文基准可能高估交互证据获取场景中的性能,需引入互补的交互评估以提高临床决策安全性。

详情
AI中文摘要

大语言模型在静态医学检查中表现良好,但临床诊断往往需要在不确定性下进行迭代证据收集。基于先前的交互评估努力,我们引入了受OSCE启发的标准化患者模拟器和一个受控、可复现的基准测试,用于主动诊断查询。在我们的协议中,经过468个案例和15个模型的测试,我们发现多轮证据获取会将诊断准确性降低12.75%,并将支持证据质量降低24.36%,相对全上下文评估。错误分析将这些下降与过早的诊断封闭和低效的提问联系起来。这些结果表明,静态全上下文基准可能高估交互证据获取场景中的性能,从而推动对更安全临床决策支持的互补交互评估。

英文摘要

Large language models perform well on static medical examinations, yet clinical diagnosis often requires iterative evidence gathering under uncertainty. Building on prior interactive evaluation efforts, we introduce an OSCE-inspired standardized patient simulator and a controlled, reproducible benchmark for active diagnostic inquiry. Across 468 cases and 15 models in our protocol, we observe that multi-turn evidence seeking reduces diagnostic accuracy by 12.75% and lowers supporting-evidence quality by 24.36% relative to full-context evaluation; error analyses associate these drops with premature diagnostic closure and inefficient questioning. Together, these results suggest that static full-context benchmarks may overestimate performance in interactive evidence-seeking settings, motivating complementary interactive assessment for safer clinical decision support.

2605.22044 2026-05-22 cs.CV

Physiology and Anatomy Aware Inverse Inference of Myocardial Infarction for Cardiac Digital Twin

心肌梗死逆推推理的生理与解剖意识:用于心脏数字双胞胎

Mengxiao Wang, Yilin Lyu, Julia Camps, Ching Hui Sia, Mark Yan-Yee Chan, Yanrui Jin, Shuzhi Sam Ge, Chengliang Liu, Lei Li

AI总结 本文提出了一种基于心脏数字双胞胎的非侵入性心肌梗死定位方法,通过整合运动成像和心电图,利用解剖和生理意识网络(PAA-Net)来更准确地推断心肌梗死区域的形态和位置,从而提高逆推推理的精度和可解释性。

Comments Early-accepted by MICCAI 2026. This version corresponds to the submitted version. The final version will be available on Springer Link

详情
AI中文摘要

准确定位心肌梗死对于风险分层至关重要。虽然LGE-MRI仍是金标准,但其资源消耗大。将运动MRI与ECG结合可以更详细地表示梗死特性。现有的逆推心肌梗死推断方法忽略了真实疤痕形态和心脏复极化,降低了对ECG细微变化的敏感性和对梗死引起电生理变化的可解释性。在本文中,我们提出了一种用于非侵入性心肌梗死定位的新框架。为了弥合仿真与现实之间的领域差距,我们引入了一种解剖意识的随机梗死合成策略,以合成真实、不规则的疤痕和边缘区,模拟缺血性横纹进展。我们然后构建了一个虚拟队列来模拟QRS-T波形,捕捉去极化和复极化动态。此外,我们设计了一种生理和解剖意识网络(PAA-Net),联合编码3D心肌几何和多导联ECG,以推断具有不同定位、大小、空间范围和横纹性的梗死区域。实验结果表明,我们的框架在逆推推断中显著优于现有方法,实现了疤痕和边缘区分割的Dice分数分别为0.7391和0.5503,同时进一步提高了ECG-梗死关系的可解释性。我们的代码将在接受后发布。

英文摘要

Accurate localization of myocardial infarction is essential for risk stratification. While LGE-MRI remains the gold standard, it is resource-intensive. Integrating cine MRI with ECG enables a more detailed representation of infarct properties. Existing inverse MI inference methods overlook realistic scar morphology and cardiac repolarization, reducing sensitivity to subtle ECG variations and interpretability of infarct-induced electrophysiological changes. In this paper, we propose a novel framework for noninvasive MI localization using cardiac digital twins. To bridge the domain gap between simulation and reality, we introduce an anatomy-aware stochastic infarct synthesis strategy to synthesize realistic, irregular scars with border zones, mimicking ischemic transmural progression. We then construct a virtual cohort to simulate QRS-T waveforms, capturing both depolarization and repolarization dynamics. Furthermore, we design a Physiology and Anatomy Aware Network (PAA-Net) that jointly encodes 3D myocardial geometry and multi-lead ECGs to infer infarct areas with varying localizations, sizes, spatial extents, and transmuralities. Experimental results demonstrate that our framework significantly outperforms existing methods in inverse inference, achieving Dice scores of 0.7391 and 0.5503 for scar and border zone segmentation, respectively, while further enhancing the interpretability of the ECG-infarct relationship. Our code will be released upon acceptance.

2605.22043 2026-05-22 cs.LG

CASE-NET: Deep Spatio-Temporal Representation Learning via Causal Attention and Channel Recalibration for Multivariate Time Series Classification

CASE-NET:通过因果注意力和通道重校准进行多变量时间序列分类的深度时空表示学习

Fan Zhang, Yating Cui, Hua Wang

AI总结 本文提出CASE-NET,通过因果注意力和通道重校准模块,解决多变量时间序列分类中时空表示不准确的问题,实现在四个任务上达到新的最先进基准,最高准确率达98.6%。

Comments 9 pages, 6 figures, 2 tables

详情
AI中文摘要

多变量时间序列(MTS)分类是普适计算和金融分析的基础,但现有多尺度方法常受限于表示保真度不足。我们识别出两个关键瓶颈:标准编码器中的时间非因果性导致非平稳动态中的时间混淆,以及缺乏显式通道重要性机制导致噪声污染潜在空间。为解决这些挑战,我们提出因果注意力和时空编码器网络(CASE-NET),一种用于结构流形预条件的架构。CASE-NET结合了因果时间编码器,通过掩码自注意力和因果卷积强制物理时间箭头约束,以及适应性通道重校准模块,作为信息瓶颈以抑制有害噪声。在六个异质领域上的全面评估表明,CASE-NET在四个任务上建立了新的最先进基准,达到AWR数据集上的最高准确率98.6%,并在非平稳环境中表现出卓越的鲁棒性。

英文摘要

Multivariate time series (MTS) classification is foundational to pervasive computing and financial analysis, yet existing multi-scale paradigms are often constrained by suboptimal representation fidelity. We identify two critical bottlenecks: temporal non-causality in standard encoders that induces temporal confounding in non-stationary dynamics, and the absence of explicit channel saliency mechanisms that allows noise to contaminate the latent space. To address these challenges, we propose the Causal Attention and Spatio-temporal Encoder Network (CASE-NET), an architecture designed for structural manifold pre-conditioning. CASE-NET synergizes a Causal Temporal Encoder, which enforces physical arrow-of-time constraints via masked self-attention and causal convolutions, with an Adaptive Channel Recalibration module functioning as an information bottleneck to suppress detrimental noise. Comprehensive evaluations across six heterogeneous domains demonstrate that CASE-NET establishes new state-of-the-art benchmarks on four tasks, achieving a peak accuracy of 98.6% on the AWR dataset and superior robustness in non-stationary regimes.

2605.22036 2026-05-22 cs.CV cs.AI

GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation

GA-VLN: 用于高效视觉-语言导航的几何感知鸟瞰图表示

Jiahao Yang, Zihan Wang, Xiangyang Li, Xing Zhu, Yujun Shen, Yinghao Xu, Shuqiang Jiang

AI总结 本文提出GA-VLN框架,通过引入几何感知的鸟瞰图表示(GA-BEV),整合显式和隐式几何信息,提升视觉-语言导航的效率和性能,实验表明其在仅使用导航数据的情况下取得了最先进的结果。

详情
AI中文摘要

尽管在视觉-语言导航(VLN)领域取得了显著进展,现有方法仍依赖密集的RGB视频,产生过多的片段标记且缺乏显式的空间结构,导致计算开销大且空间推理能力有限。为了解决这些问题,我们引入了几何感知的鸟瞰图(GA-BEV)-一种紧凑且3D基础的特征表示,将显式和隐式的几何线索整合到多模态大语言模型(MLLM)导航系统中。我们通过将视觉特征投影到3D空间并聚合为以代理为中心的布局来构建BEV空间地图,该布局在保持几何一致性的同时减少标记冗余。为了进一步丰富几何理解,我们将预训练的3D基础模型的特征融入BEV空间,注入从大规模3D重建任务中学习到的结构先验。这些互补的线索-基于深度的显式投影和隐式学习的先验-产生紧凑但空间表达能力强的表示,显著提高了导航效率和性能。实验表明,我们的方法仅使用导航数据即可取得最先进的结果,无需DaGger增强或混合VQA训练,证明了所提GA-VLN框架的鲁棒性和数据效率。

英文摘要

Despite significant progress in Vision-Language Navigation (VLN), existing approaches still rely on dense RGB videos that produce excessive patch tokens and lack explicit spatial structure, resulting in substantial computational overhead and limited spatial reasoning. To address these issues, we introduce the Geometry-Aware BEV (GA-BEV) - a compact, 3D-grounded feature representation that integrates both explicit and implicit geometric cues into multimodal large language model (MLLM) - based navigation systems. We construct BEV spatial maps from RGB-D inputs by projecting visual features into 3D space and aggregating them into an agent-centric layout that preserves geometric consistency while reducing token redundancy. To further enrich geometric understanding, we incorporate features from a pretrained 3D foundation model into the BEV space, injecting structural priors learned from large-scale 3D reconstruction tasks. Together, these complementary cues - explicit depth-based projection and implicit learned priors - yield compact yet spatially expressive representations that substantially improve navigation efficiency and performance. Experiments show that our method achieves state-of-the-art results using only navigation data, without DAgger augmentation or mixed VQA training, demonstrating the robustness and data efficiency of the proposed GA-VLN framework.

2605.22035 2026-05-22 cs.CV cs.CL

HyLoVQA: Dynamic Hypernetwork-Generated Low-Rank Adaptation for Continual Visual Question Answering

HyLoVQA: 动态超网络生成低秩适应用于连续视觉问答

Yiran Wang, Chenyi Xiong, Ziyue Qin, Miao Zhang, Kui Xiao, Zhifei Li

AI总结 HyLoVQA通过动态超网络生成低秩适应,解决连续视觉问答中任务干扰问题,提升模型对当前任务和对象的适应能力。

Comments Accepted by IJCAI 2026

详情
AI中文摘要

连续视觉问答(VQA)需要在非稳态的视觉输入和问题流中学习,同时保持过去知识。大多数先前方法通过更新大量共享参数集来适应,这通常导致跨层任务干扰,阻碍对当前任务和对象的准确适应。为了解决这一限制,我们提出了HyLoVQA。它维护一个具有漂移鲁棒性的锚点记忆库。该库存储视觉对象的内容和文本任务的内容,并使用当前输入特征进行更新。基于检索到的锚点,超网络生成轻量级低秩适应(LoRA)适配器。这确保了参数效率,使模型能够动态适应每个任务和对象。此外,我们提出了一个对齐损失,将特征空间中的语义差异与参数空间中的功能变化对齐,从而约束LoRA适配器保持专注于当前任务和对象。在VQA v2和NExT-QA上广泛实验表明,HyLoVQA在标准和组合设置下优于先前最先进的方法。

英文摘要

Continual Visual Question Answering (VQA) requires learning from non-stationary streams of visual inputs and questions while preserving past knowledge. Most prior methods adapt by updating a largely shared parameter set. This often leads to cross-level task interference, hindering accurate adaptation to the current task and object. To address this limitation, we propose HyLoVQA. It maintains a drift-resilient memory bank of anchors. The bank stores the content of visual objects and textual tasks, and they are updated using current input features. Conditioned on retrieved anchors, a hypernetwork generates lightweight Low-Rank Adaptation (LoRA) adapters. This ensures parameter efficiency, allowing the model to adapt to each task and object dynamically. Additionally, we formulate an alignment loss that aligns semantic discrepancies in the feature space with functional changes in the parameter space, thereby constraining LoRA adapters to remain focused on the current task and object. Extensive experiments on VQA v2 and NExT-QA under both standard and compositional settings demonstrate the superiority of HyLoVQA over prior state-of-the-art methods.

2605.22034 2026-05-22 cs.CV cs.AI

AgroVG: A Large-Scale Multi-Source Benchmark for Agricultural Visual Grounding

AgroVG:一个大规模多源基准用于农业视觉 grounding

Haocheng Li, Juepeng Zheng, Zenghao Yang, Kaiqi Du, Guilong Xiao, Gengmeng Pu, Haohuan Fu, Jianxi Huang

AI总结 本文提出AgroVG基准,用于评估农业视觉 grounding能力,通过多源数据集和任务特定协议,评估模型在多目标、多实例和无目标场景下的性能,揭示了现有模型在农业视觉 grounding任务中的不足。

Comments 45 pages,12 figures

详情
AI中文摘要

视觉 grounding,即根据自然语言描述定位物体的任务,是农业人工智能系统的基础能力,可应用于选择性除草、疾病监测和定向收获。农业视觉 grounding的可靠评估具有挑战性,因为农业目标往往小、重复、被遮挡或形状不规则,且指令可能指一个、多个或没有物体。因此,评估此能力需要联合测试定位精度、目标集完整性和存在感知的回避。为了解决这些挑战,我们引入了AgroVG,一个多源基准,将农业 grounding 视为广义集合预测:给定一张图像和一个指称表达,模型必须返回所有匹配的目标实例或在没有目标时回避。AgroVG包含来自十个数据集的10,071个注释-图像查询对,涵盖六个目标类别:作物/杂草、水果、小麦头、害虫、植物疾病和树冠。它支持所有六个类别上的边界框 grounding(T1)和具有可靠实例级像素注释的数据源上的实例掩码 grounding(T2),查询涵盖单目标、多目标和无目标场景。AgroVG进一步提供任务特定的协议用于框集匹配和查询级掩码覆盖。对26种模型配置的零样本评估揭示了持续的差距:最好的多目标Set-F1仅达到0.35,最好的正查询掩码成功率在IoU@0.75下仍低于0.17。数据和代码可在https://anonymous.4open.science/r/AgroVG-5172/上获得。

英文摘要

Visual grounding, the task of localizing objects described by natural-language expressions, is a foundational capability for agricultural AI systems, enabling applications such as selective weeding, disease monitoring, and targeted harvesting. Reliable evaluation of agricultural visual grounding remains challenging because agricultural targets are often small, repetitive, occluded, or irregularly shaped, and instructions may refer to one, many, or no objects in an image. Evaluating this capability therefore requires jointly testing localization accuracy, target-set completeness, and existence-aware abstention. To address these challenges, we introduce \textbf{AgroVG}, a multi-source benchmark that formulates agricultural grounding as generalized set prediction: given an image and a referring expression, a model must return all matching target instances or abstain when no target is present. AgroVG contains 10{,}071 annotation-grounded image-query pairs from ten source datasets across six target families: crop/weed, fruit, wheat head, pest, plant disease, and tree canopy. It supports bounding-box grounding (T1) across all six families and instance-mask grounding (T2) on sources with reliable instance-level pixel annotations, with queries covering single-target, multi-target, and target-absent regimes. AgroVG further provides task-specific protocols for box-set matching and query-level mask coverage. Zero-shot evaluation of 26 model configurations spanning closed-source MLLMs, open-source VLMs, and specialized grounding systems reveals persistent gaps: the best multi-target Set-$F_1$ reaches only 0.35, and the best positive-query mask success rate at IoU@0.75 remains below 0.17. Data and code are available at https://anonymous.4open.science/r/AgroVG-5172/ .

2605.22031 2026-05-22 cs.CV

SO-Mamba: State-Ownership Mamba for Unrolled MRI Reconstruction

SO-Mamba:用于展开MRI重建的态所有权Mamba

Pengcheng Fang, Hongli Chen, Fangfang Tang, Feng Liu, Xiaohao Cai, Shanshan Shan

AI总结 本文提出SO-Mamba,一种用于展开MRI重建的态所有权Mamba正则化器,通过分配每个Mamba阶段的重建证据到递归驻留、态接口访问和非态输出校正,以提升重建质量与效率。

详情
AI中文摘要

加速MRI重建需要在大空间区域内恢复缺失细节的同时保持解剖学一致的结构。状态空间模型如Mamba提供高效的长距离建模,使其成为展开重建中的有吸引力的学得正则化器。然而,在数据一致性耦合的展开求解器中,不同阶段操作于不同的重建迭代,其中驻留载体应在不同阶段保持一致的重建内容,而阶段依赖的非驻留证据则与当前更新相关。将这些角色统一处理会将持久驻留载体证据和更新依赖的非驻留证据置于相同的递归内容路由中。为此,我们提出了SO-Mamba,一种态所有权Mamba正则化器,该正则化器将每个Mamba阶段的重建证据分配到递归驻留、态接口访问和非态输出校正。SO-Mamba通过State-Ownership Router (SOR)实现这一所有权规则,构建递归内容的驻留载体,并将非驻留证据路由到B/C态接口的仿射调制和输出校正出口。驻留载体提供Mamba内容路由,而非驻留证据流调整态接口并通过输出出口贡献,而无需进入递归内容路由。我们进一步引入了两级外带泄漏诊断,通过测量选择性扫描状态轨迹中的外带能量和扫描后Mamba读取中的外带能量,将隐藏状态存储与读取表达分开。在五个公开的MRI重建基准上进行的实验表明,SO-Mamba在具有竞争性计算效率的CNN、Transformer和Mamba基线中表现一致提升。

英文摘要

Accelerated MRI reconstruction requires recovering missing details while preserving anatomically coherent structures across large spatial regions. State-space models such as Mamba provide efficient long-range modeling, making them attractive learned regularizers for unrolled reconstruction. However, in a data-consistency-coupled unrolled solver, different stages operate on different reconstruction iterates, where the resident carrier should preserve coherent reconstruction content across stages while stage-dependent non-resident evidence is tied to the current update. Treating these roles uniformly can place persistent resident-carrier evidence and update-dependent non-resident evidence into the same recurrent content route. We therefore propose SO-Mamba, a state-ownership Mamba regularizer that assigns reconstruction evidence within each Mamba stage to recurrent residency, state-interface access, and non-state output correction. SO-Mamba implements this ownership rule with a State-Ownership Router (SOR), which constructs a resident carrier for recurrent content and routes non-resident evidence to affine modulation of the B/C state interfaces and an output correction outlet. The resident carrier supplies the Mamba content route, while the non-resident evidence stream adapts the state interfaces and contributes through the output outlet without entering the recurrent content route. We further introduce a two-level outer-band leakage diagnostic that separates hidden-state storage from readout expression by measuring outer-band energy in the selective-scan state trajectory and the post-scan Mamba readout. Experiments on five public MRI reconstruction benchmarks spanning diverse anatomies, sampling patterns, and coil configurations show that SO-Mamba consistently improves over CNN-, Transformer-, and Mamba-based baselines with competitive computational efficiency.

2605.22021 2026-05-22 cs.RO

Industrial Dual-Arm Box Handling via Online Inertial Estimation and Convex Wrench Optimization

工业双臂箱体搬运 via 在线惯性估计和凸 wrench 优化

Kenzhi Iskandar Wong, Lin Yang, Qian Ying Lee, Domenico Campolo

AI总结 本文提出了一种摩擦感知的双臂箱体搬运框架,用于处理具有未知惯性特性的物体。通过在线估计物体质量和质心,并利用二次锥规划在椭球摩擦限制表面约束下计算摩擦可行的接触力和扭距,从而实现稳定的搬运。

Comments 14 pages, submitted to Robotics and Computer-Integrated Manufacturing (RCIM) Journal

详情
AI中文摘要

工业机器人物体搬运 often 涉及箱子和包裹,其质量和质心通常在事先未知。这些不确定性影响了稳定提升所需的力-力矩平衡,不当的接触 wrench 控制可能导致滑动、物体掉落、方向偏差或过度挤压。本文提出了一种摩擦感知的双臂箱体搬运框架,用于具有未知惯性特性的物体。所提出的方法从测量的接触 wrench 中在线估计物体质量和质心,并通过二次锥规划(SOCP)在椭球摩擦限制表面约束下计算摩擦可行的接触力和扭距。还包含一个离线轨迹细化阶段,以减少存在几何约束时的不希望的物体-环境接触。通过将摩擦可行性作为硬约束,并在可行区域内最小化接触努力,该框架实现了稳定的提升,而无需将滑动避免和过度挤压作为单独调节的目标。在不同质心配置下的真实双臂机器人系统实验表明,该方法在未知惯性特性物体上实现了稳定的摩擦接触。

英文摘要

Industrial robotic object handling often involves boxes and packages whose mass and center of mass are not known in advance. These uncertainties affect the force--moment balance required for stable lifting, and improper regulation of contact wrenches can lead to slip, object drop, orientation deviation, or excessive squeezing. This paper presents a friction-aware dual-arm box-handling framework for objects with unknown inertial properties. The proposed approach estimates the object mass and center of mass online from measured contact wrenches, and computes friction-feasible contact forces and torsional moments through a second-order cone program (SOCP) under ellipsoidal friction-limit-surface constraints. An offline trajectory refinement stage is also included to reduce undesired object--environment contact when geometric constraints are present. By enforcing friction feasibility as a hard constraint and minimizing contact effort within the feasible region, the framework achieves stable lifting without treating slip avoidance and excessive squeezing as separately tuned objectives. Experiments on a real dual-arm robotic system under different center-of-mass configurations demonstrate that the method lifts objects with unknown inertial properties while maintaining stable frictional contact.

2605.22017 2026-05-22 cs.CV

Diverse Yet Consistent: Context-Guided Diffusion with Energy-Based Joint Refinement for Multi-Agent Motion Prediction

多样且一致:基于能量的联合细化的上下文引导扩散用于多智能体运动预测

Lei Chu, Yuhuan Zhao

AI总结 本文提出了一种基于扩散的框架,通过利用历史轨迹中的丰富上下文信息来改进多智能体运动预测,通过引导机制增强预测动作的多样性和表达性,并引入基于能量的公式来细化联合轨迹分布,同时保持个体轨迹的合理性,实验表明该方法在多个基准数据集上均优于现有方法。

Comments MEIS-- CVPR

详情
AI中文摘要

深度生成模型由于其能够捕捉多模态分布和表示多样化的人类行为的能力,已成为人类运动预测的有希望的方法。然而,生成在相互作用代理之间既多样又联合一致的预测仍然具有挑战性。此外,大多数现有方法主要使用单代理(边缘)度量进行评估,这无法充分反映多代理互动的联合动态。我们提出了一种基于扩散的框架,通过利用历史轨迹中的丰富上下文信息来改进多代理运动预测。这种信息通过引导机制进行整合,以增强预测动作的多样性和表达性。为了进一步强制交互一致性,我们引入了基于能量的公式,通过细化联合轨迹分布的同时保持个体轨迹的合理性。在四个基准数据集上的大量实验表明,我们的方法在多个指标上均优于现有方法。值得注意的是,我们的方法在ETH/UCY上显著提高了边缘(ADE/FDE)和联合(JADE/JFDE)度量,与先前的联合预测方法相比,它在保持竞争性联合性能的同时,显著提高了边缘度量。

英文摘要

Deepgenerative models havebecomeapromisingapproach for human motion prediction due to their ability to capture multimodal distributions and represent diverse human be haviors. However, generating predictions that are both di verse and jointly consistent among interacting agents re mains challenging. In addition, most existing approaches are primarily evaluated using single-agent (marginal) met rics, which fail to fully reflect the joint dynamics of multi agent interactions. We propose a diffusion-based frame work that improves multi-agent motion prediction by lever aging rich contextual information from historical trajecto ries. This information is incorporated through a guidance mechanism to enhance the diversity and expressiveness of predicted motions. To further enforce interaction consis tency, we introduce an energy-based formulation that re fines the joint trajectory distribution while preserving the plausibility of individual trajectories. Extensive experi ments on four benchmark datasets demonstrate that our approach consistently outperforms existing methods. No tably, our approach substantially improves both marginal (ADE/FDE) and joint (JADE/JFDE) metrics on ETH/UCY over strong marginal baselines. Compared with prior joint prediction methods, it delivers significant gains in marginal metrics while maintaining competitive joint performance.

2605.22015 2026-05-22 cs.CV cs.AR

ORBIS: Output-Guided Token Reduction with Distribution-Aware Matching for Video Diffusion Acceleration

ORBIS: 通过分布感知匹配的输出引导标记减少以加速视频扩散

Hangyeol Lee, Joo-Young Kim

AI总结 本文提出ORBIS,一种针对视频扩散Transformer的SW-HW协同设计加速器,通过利用前一时间步的输出激活获得更准确的token相似性,从而提高匹配质量并实现更高的标记减少比例,同时引入分布感知标记匹配算法和专用硬件设计,实现比现有方法更高的标记减少率、更快的速度和更低的能耗。

详情
AI中文摘要

扩散Transformer(DiT)已发展为生成高质量图像和视频的强大模型架构。在视频DiT中,3D空间时间注意力使token长度与帧数成正比,显著增加计算成本。标记减少方法通过利用空间冗余来缓解这一成本,但现有方法依赖于不准确的相似性估计和轻量级匹配算法,导致匹配质量差且仅带来微小的加速效果。为克服这些限制,我们提出了ORBIS,一种为视频DiT设计的SW-HW协同加速器。ORBIS利用前一时间步的输出激活以获得更准确的token间相似性,显著提高匹配质量并实现更高的token减少比例。我们进一步引入了分布感知标记匹配(DATM)算法,该算法捕捉全局token分布并显式最小化token对损失以获得额外收益。为了完全隐藏DATM延迟,我们设计了专用、深度流水线化的硬件并通过量化来最小化其硬件成本,仅占用总面积的2.4%,且精度损失可忽略不计。大量实验表明,ORBIS的token减少比例比最先进的方法AsymRnR高约2倍,相比NVIDIA A100 GPU实现了高达4.5倍的速度提升和79.3%的能耗降低。

英文摘要

Diffusion Transformer (DiT) has emerged as a powerful model architecture for generating high-quality images and videos. In the case of video DiT, 3D Spatio-Temporal Attention increases token length in proportion to the number of frames, sharply increasing computational cost. Token reduction methods mitigate this cost by exploiting spatial redundancy, but existing approaches rely on inaccurate similarity estimates and lightweight matching algorithms, resulting in poor matching quality and only marginal acceleration. To overcome these limitations, we propose ORBIS, an SW-HW co-designed accelerator for video DiT. ORBIS leverages the output activation from the previous timestep to obtain more accurate inter-token similarity, substantially improving matching quality and enabling a higher token reduction ratio. We further introduce a Distribution-Aware Token Matching (DATM) algorithm that captures global token distribution and explicitly minimizes token-pair loss for additional gains. To fully hide DATM latency, we design specialized, deeply pipelined hardware and minimize its hardware cost through quantization, occupying only 2.4% of total area with negligible accuracy loss. Extensive experiments show that ORBIS achieves about 2x higher token reduction ratio than the state-of-the-art approach, AsymRnR, while delivering up to 4.5x speedup and 79.3% energy reduction compared to an NVIDIA A100 GPU.

2605.22013 2026-05-22 cs.CV cs.GR cs.LG

PointLLM-R: Enhancing 3D Point Cloud Reasoning via Chain-of-Thought

PointLLM-R: 通过链式推理增强3D点云推理

Chaoqi Chen, Qile Xu, Wenjun Zhou, Hui Huang

AI总结 本文提出了一种数据驱动的框架,用于构建大规模链式推理监督,以改进3D点云理解。通过两阶段流程优化点文本指令数据,并合成高质量推理路径,构建了包含55K样本的PoCoTI数据集,训练PointLLM-R实现3D多模态语言模型的推理能力,实验表明其在生成3D分类和描述任务中达到最先进的性能。

详情
AI中文摘要

通过语言理解3D点云仍然是计算机图形学和视觉计算中的基本挑战,由于点云数据的不规则结构和现有3D多模态模型中缺乏显式推理。尽管链式推理(CoT)在LLM和基于图像的MLLM中表现出强大的有效性,但其在3D理解中的扩展仍鲜有探索。本文提出了一种数据驱动的框架,用于构建大规模CoT监督,专门针对3D点云理解。我们的框架由一个两阶段流程组成,首先通过基于视觉语言模型的质量评估和参考引导细化点文本指令数据,然后通过人机协同提示优化(HiLPO)合成高质量的推理路径。使用这种方法,我们构建了PoCoTI,一个包含55K样本的CoT增强点文本指令遵循数据集。在PoCoTI上微调PointLLM,得到PointLLM-R,一个具备推理能力的3D多模态语言模型。在生成3D分类和描述任务上的大量实验表明,PointLLM-R在生成3D分类和描述任务中达到了最先进的性能,并且能够稳健地推广到现实世界扫描点云和多轮对话场景中。

英文摘要

Understanding 3D point clouds through language remains a fundamental challenge in computer graphics and visual computing, due to the irregular structure of point cloud data and the lack of explicit reasoning in existing 3D multimodal models. While Chain-of-Thought (CoT) reasoning has shown strong effectiveness in LLMs and image-based MLLMs, its extension to 3D understanding remains largely underexplored. In this paper, we propose a data-centric framework for constructing large-scale CoT supervision tailored to 3D point cloud understanding. Our framework consists of a two-stage pipeline that first refines point-text instruction data via vision-language-model-based quality evaluation and reference-guided refinement, and then synthesizes high-quality reasoning paths through Human-in-the-Loop Prompt Optimization (HiLPO). Using this approach, we build PoCoTI, a CoT-enhanced point-text instruction-following dataset containing 55K samples with explicit reasoning paths. Fine-tuning PointLLM on PoCoTI yields PointLLM-R, a reasoning-capable 3D multimodal language model. Extensive experiments on generative 3D classification and captioning demonstrate that PointLLM-R achieves state-of-the-art performance and generalizes robustly to real-world scanned point clouds and multi-turn dialogue scenarios.

2605.22012 2026-05-22 cs.CL cs.CV

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

LatentOmni: 通过统一的音频-视觉潜在推理重新思考多模态理解

Yifan Dai, Zhenhua Wu, Bohan Zeng, Daili Hua, Jialing Liu, Bozhou Li, Yuran Wang, Chengzhuo Tong, Hao Liang, Xiaochen Ma, Junbo Niu, Tianyu Guo, Yang Shi, Yue Ding, Yiyan Ji, Bingyin Mei, Yushuo Guan, Yuanxing Zhang, Pengfei Wan, Fangcheng Fu, Wentao Zhang

AI总结 本文提出LatentOmni框架,通过统一的音频-视觉潜在空间进行多模态推理,利用特征级监督和Omni-Sync Position Embedding保持时间一致性,从而在多个音频-视觉推理基准测试中取得最佳性能。

Comments 21 pages, 15 figures

详情
AI中文摘要

联合音频-视觉推理对于多模态理解至关重要,但当前的多模态大语言模型(MLLMs)在需要从两种模态中提取细粒度证据进行推理时仍存在困难。一个核心限制是显式的基于文本的推理链(CoT)将连续的音频-视觉信号压缩成离散的标记,削弱了时间定位并使中间推理偏向语言先验。我们主张统一的潜在空间是此类推理更好的媒介,因为它保留了密集的感知信息,同时仍能与自回归生成兼容。基于这一见解,我们提出了LatentOmni,一个跨模态推理框架,将文本推理与音频-视觉潜在状态交织在一起。LatentOmni引入了特征级监督,以对齐潜在推理状态与任务相关的感知特征,并使用Omni-Sync Position Embedding(OSPE)来保持潜在音频和视觉状态之间的时间一致性。我们进一步构建了LatentOmni-Instruct-35K数据集,该数据集包含音频-视觉交织推理轨迹,用于监督潜在空间推理。在多个音频-视觉推理基准测试中的综合评估表明,LatentOmni在评估的开源模型中取得了最佳性能,并且在显式文本CoT基线中表现一致,支持潜在空间联合推理作为更强多模态理解的有前途的路径。

英文摘要

Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose \textbf{LatentOmni}, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses Omni-Sync Position Embedding (OSPE) to maintain temporal consistency between latent audio and visual states. We further construct \textbf{LatentOmni-Instruct-35K}, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multiple audio-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.

2605.22011 2026-05-22 cs.CV

Rethinking Token Reduction for Diffusion Models via Output-Similarity-Awareness

重新思考扩散模型的token减少:通过输出相似性意识

Hangyeol Lee, Hyojeong Lee, Joo-Young Kim

AI总结 本文提出DiTo,一种基于输出中心的token减少方法,通过利用相邻时间步的输出相似性来建立token对应关系,从而减少计算复杂度并提高生成质量。

详情
AI中文摘要

扩散变换器(DiTs)在图像生成质量上表现出色,但其计算复杂度与token数量呈二次关系。尽管已提出多种token减少(TR)方法以缓解这一成本,但它们忽略了生成模型的主要目标:最小化恢复误差,这需要反映输出token的相似性。它们仅依赖于输入token相似性,这是来自仅减少的ViT范式继承的,导致与该目标的根本不一致。为弥合这一差距,我们提出DiTo,一种新的TR范式,其重点转向以输出为中心的token减少。基于观察到输出token相似性在相邻时间步中保持一致,DiTo利用先前步骤的相似性作为有效代理,在匹配时间步中建立token对应关系,然后在多个后续减少时间步中重用。为了优化这种交错调度,我们提出Pair Match Ratio(PMR)引导的区间调度,以确定最佳匹配频率。此外,为了减轻由重复重用导致的局部近似误差和由此产生的阻塞伪影,我们提出频率感知的token匹配,通过引入选择频率惩罚。广泛的实验表明,DiTo在可比的加速下,比现有TR方法在PSNR上高出1.6-3.9 dB,实现了更优的帕累托前沿。

英文摘要

Diffusion Transformers (DiTs) achieve superior image generation quality but suffer from quadratic computational complexity relative to token count. While various token reduction (TR) methods have been proposed to mitigate this cost, they overlook the primary objective of generative models: minimizing recovery error, which requires reflecting output token similarity. They rely solely on input token similarity inherited from reduction-only ViT paradigms, leading to a fundamental misalignment with this objective. To bridge this gap, we propose DiTo, a novel TR paradigm that shifts the focus toward output-centric token reduction. Based on the observation that output token similarity is consistently preserved across adjacent timesteps, DiTo utilizes prior-step similarities as an effective proxy to establish token correspondences at a Matching timestep, which are then reused across multiple subsequent Reduction timesteps. To optimize this interleaved scheduling, we propose Pair Match Ratio (PMR)-guided Interval Scheduling to determine the optimal matching frequency. Furthermore, to mitigate localized approximation errors and resulting blocking artifacts caused by repeated reuse, we propose Frequency-aware Token Matching by incorporating a selection-frequency penalty. Extensive experiments demonstrate that DiTo consistently outperforms existing TR methods with 1.6-3.9 dB higher PSNR at comparable speedups, achieving a superior Pareto frontier.

2605.22007 2026-05-22 cs.CL

Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer

幻觉作为承诺失败:更大的LLM在知道答案的情况下仍会出错

Jewon Yeom, Jaewon Sok, Heejun Kim, Seonghyeon Park, Jeongjae Park, Taesup Kim

AI总结 本文研究了大型LLM在知道正确答案的情况下仍出现幻觉的现象,发现模型在生成答案时,正确概念的概率分布方式决定了幻觉的发生,而非是否包含正确概念。

详情
AI中文摘要

幻觉通常被视为知识缺失的直接后果:当正确答案不在生成时的分布中,模型会错误回答;当正确答案存在时,模型会正确回答。我们通过引入一个语义上的答案可用性概念,聚合表达相同答案概念的token级变体,检验这一假设。在Qwen和Llama模型(0.8B至72B,包括Instruct和Base版本)中,16-47%的Instruct幻觉在模型承诺回答时已有显著概率质量在正确概念上,且随着规模增加,此比例单调上升。将此类失败与具有匹配语义支持的正确生成进行比较,发现区别不在于是否表示正确概念,而在于概率分布方式:正确生成将质量集中于单一表层形式,幻觉则将其分散到多个替代选项中。这种锐化不对称性在多token生成中也延伸,并在预生成隐藏状态中可检测到。这些结果识别出单一机制:指令微调通过规模锐化答案承诺,使有用性和自信幻觉成为同一底层倾向的两种结果。

英文摘要

Hallucination is often viewed as a direct consequence of missing knowledge: a model answers incorrectly when the correct answer is absent from its generation-time distribution, and correctly when it is present. We test this assumption by introducing a semantic notion of answer availability that aggregates token-level variants expressing the same answer concept, and asks whether the correct concept is already available at the moment the model commits to an answer. Across Qwen and Llama models from 0.8B to 72B in both Instruct and Base variants, 16-47% of Instruct hallucinations occur with substantial probability mass already on the correct concept, and the rate rises monotonically with scale. Comparing such failures against correct generations with matched semantic support, the distinguishing factor is not whether the correct concept is represented, but how its probability is distributed: correct generations concentrate mass on a single surface form, hallucinations disperse it across alternatives. The same sharpening asymmetry extends across multi-token generation and is detectable in pre-generation hidden states. Together, these results identify a single mechanism: instruction tuning sharpens answer commitment with scale, making helpfulness and confident hallucination two consequences of the same underlying disposition.

2605.22003 2026-05-22 cs.CL cs.AI cs.IR cs.LG

From TF-IDF to Transformers: A Comparative and Ensemble Approach to Sentiment Classification

从TF-IDF到Transformer:一种比较和集成的方法用于情感分类

Dip Biswas Shanto, Mitali Yadav, Prajwal Panth, Suresh Chandra Satapathy

AI总结 本文比较了多种机器学习模型,包括Naive Bayes、逻辑回归、SVM、LightGBM、LSTM以及基于Transformer的RoBERTa和DistilBERT,旨在对电影评论进行情感分类,并发现RoBERTa在准确率上表现最佳,同时集成所有模型的软投票方法进一步提升了分类性能。

Comments 6 pages, 9 figures. This is the author's accepted manuscript, presented at the International Conference on Intelligent Computing, Networks and Security (IC-ICNS 2026), March 26-28, Bhubaneswar, India. Proceedings publication pending

详情
AI中文摘要

情感分析,也称为观点挖掘,主要试图从任何基于文本的数据中提取观点。在电影评论和评论员的背景下,情感分析可以成为预测电影评论总体是积极还是消极的有用工具。对于ML模型来说,理解上下文或隐喻性情感可能具有挑战性,因为ML模型主要依赖统计词表示。本文的目标是检验并分类电影评论为积极或消极情感。为此考虑了多种机器学习模型,并运用自然语言处理(NLP)方法进行数据预处理和模型评估。使用IMDb数据集。具体来说,评估了Naive Bayes、逻辑回归、支持向量机(SVM)、LightGBM、LSTM以及基于Transformer的RoBERTa和DistilBERT等模型。经过大量测试,使用准确率、精确率、召回率、F1分数和ROC-AUC后,RoBERTa在所有其他模型之上表现更好,准确率为93.02%。一个结合所有模型的软投票集成方法也提高了分类性能,表明模型集成在情感分析中效果良好。

英文摘要

Sentiment analysis, also referred to as opinion mining, primarily tries to extract opinion from any text-based data. In the context of movie reviews and critics, sentimental analysis can be a helpful tool to predict whether a movie review is generally positive or negative. It can be difficult for the ML models to understand the context or metaphysical sentiment accurately, as ML models rely largely on statistical word representations. The objective of this paper is to examine and categorise movie reviews into positive and negative sentiments. Diverse machine learning models are considered in doing so, and Natural Language Processing (NLP) methodologies are employed for data preprocessing and model assessment. The IMDb dataset is used. Specifically, Naive Bayes, Logistic Regression, Support Vector Machines (SVM), LightGBM, LSTM, and transformer-based models such as RoBERTa and DistilBERT were evaluated. After a lot of testing with accuracy, precision, recall, F1-score, and ROC-AUC, RoBERTa performed better than all the other models, with an accuracy of 93.02%. A soft voting ensemble that combined all the models also improved classification performance, showing that model ensembling works well for sentiment analysis.

2605.22002 2026-05-22 cs.CV

ConvNeXt-FD: A Fractal-Based Deep Model for Robust Biomedical Image Segmentation

ConvNeXt-FD:一种基于分形的深度模型用于鲁棒的生物医学图像分割

Joao Batista Florindo, Amanda Pontes de Oliveira Ornelas

AI总结 本文提出了一种基于分形的深度学习模型ConvNeXt-FD,用于提高生物医学图像分割的鲁棒性,通过结合Dice系数和边界感知正则化项,提升模型对物体边界和形状保真的敏感性。

详情
AI中文摘要

生物医学图像分割是医疗诊断和治疗计划中的关键任务,能够精确勾勒解剖结构和病理区域。尽管取得了显著进展,但由于不同医学成像模态中固有的变异性、噪声和复杂的形态,仍存在挑战。本文介绍了一种新的深度学习架构ConvNeXt-FD,基于类似U-Net的编码器-解码器框架,利用强大的ConvNeXt主干网络。我们的方法结合了一种混合损失函数,该函数结合了Dice系数和受可微分分形维度公式启发的边界感知正则化项,旨在增强模型对物体边界和形状保真的敏感性。我们严格评估了ConvNeXt-FD在六个不同的生物医学数据集上的表现:BUSI(乳腺超声图像)、DDTI(甲状腺超声图像)、FluoCells(荧光细胞图像)、IDRiD(糖尿病视网膜病变图像用于视盘分割)、ISIC2018(皮肤病变图像)和MoNuSeg(核分割)。实验结果表明,ConvNeXt-FD,特别是在使用ImageNet预训练权重初始化时,在各种指标上(包括Dice、Jaccard、准确率、灵敏度、特异度和假阳性率)均表现出竞争性甚至更优的性能。ConvNeXt作为强大编码器的结合,与边界感知正则化相结合,证明了在挑战性的生物医学上下文中捕获高级语义特征和细粒度边界细节的有效性,从而实现更准确和可靠的分割。

英文摘要

Biomedical image segmentation is a critical task in medical diagnosis and treatment planning, enabling precise delineation of anatomical structures and pathological regions. Despite significant advancements, challenges persist due to the inherent variability, noise, and complex morphology present in diverse medical imaging modalities. This paper introduces ConvNeXt-FD, a novel deep learning architecture for robust biomedical image segmentation, built upon a U-Net-like encoder-decoder framework leveraging the powerful ConvNeXt backbone. Our approach integrates a hybrid loss function combining the Dice coefficient with a boundary-aware regularization term inspired by a differentiable formulation of Fractal Dimension, designed to enhance the model's sensitivity to object boundaries and shape fidelity. We rigorously evaluate ConvNeXt-FD across six distinct biomedical datasets: BUSI (Breast Ultrasound Images), DDTI (Thyroid Ultrasound Images), FluoCells (Fluorescent Cell Images), IDRiD (Diabetic Retinopathy Images for Optic Disc Segmentation), ISIC2018 (Skin Lesion Images), and MoNuSeg (Nuclei Segmentation). Experimental results demonstrate that ConvNeXt-FD, particularly when initialized with ImageNet pre-trained weights, achieves competitive and often superior performance compared to existing state-of-the-art methods across various metrics, including Dice, Jaccard, Accuracy, Sensitivity, Specificity, and False Positive Rate. The integration of ConvNeXt as a strong encoder, coupled with the boundary-aware regularization, proves effective in capturing both high-level semantic features and fine-grained boundary details, leading to more accurate and reliable segmentations in challenging biomedical contexts.

2605.22000 2026-05-22 cs.CV cs.AI

Virtual 3D H&E Staining from Phase-contrast Back-illumination Interference Tomography

从相位对比背光干涉断层扫描生成虚拟3D的H&E染色

Anthony Song, Boyan Zhou, Mayank Golhar, Marisa Morakis, Alex Baras, Nicholas Durr

AI总结 本文提出HistoBIT3D,首个基于voxel的配对BIT和荧光标记核数据集,用于评估无监督虚拟染色在结构保持方面的定量效果。通过该数据集,作者提出一种新的虚拟染色框架,利用双向多尺度内容一致性与跨域风格复用,将具有移变对比度的BIT体积转化为逼真的H&E体积,从而提升3D核分割精度和边界保持性。

详情
AI中文摘要

三维(3D)未处理组织的病理学具有潜在的疾病管理变革能力,通过使组织微结构的体积分析和活体评估成为可能。背光干涉断层扫描(BIT)是一种新的相位显微镜技术,能够提供快速、非破坏性的未处理组织体积分像。然而,将BIT体积转化为临床可解释的H&E图像仍然具有挑战性,特别是由于移变对比和缺乏定量验证基准。我们引入HistoBIT3D,首个voxel-wise配对的BIT和荧光标记核数据集,使在无监督虚拟染色中结构保持的定量评估成为可能。利用该数据集,我们提出了一种新的虚拟染色框架,通过双向多尺度内容一致性和跨域风格复用来增强结构保真度和感知现实性,将具有移变对比度的BIT体积转化为逼真的H&E体积。我们的方法在现实感度量方面达到最先进的水平,同时显著提高了3D核分割精度和边界保持性,特别是在零shot Cellpose评估下。这些贡献共同建立了一个经过定量验证、结构忠实且可扩展的3D虚拟H&E染色流程,推动了无切片、体积分计算病理学的范式转变。我们的数据和代码可在:https://github.com/aasong113/HistoBIT3D_VirtualStaining。

英文摘要

Three-dimensional (3D) histopathology of unprocessed tissues has the potential to transform disease management by enabling volumetric characterization of tissue microarchitecture and in-vivo assessment. Back-illumination Interference Tomography (BIT) is a new phase microscopy technology that provides rapid, non-destructive volumetric imaging of unprocessed tissues. However, translating BIT volumes into clinically interpretable H&E images remains challenging, particularly due to shift-variant contrast and the absence of quantitative validation benchmarks. We introduce HistoBIT3D, the first voxel-wise paired BIT and fluorescence-labeled nuclei dataset, enabling quantitative evaluation of structural preservation in unsupervised virtual staining against ground-truth nuclear distributions. Using this dataset, we present a novel virtual staining framework that translates BIT volumes with shift-variant contrast into realistic H&E volumes by leveraging bidirectional multiscale content consistency and cross-domain style reuse to enhance structural fidelity and perceptual realism. Our method achieves state-of-the-art realism metrics while significantly improving 3D nuclei segmentation accuracy and boundary preservation under zero-shot Cellpose evaluation. Together, these contributions establish a quantitatively validated, structurally faithful, and scalable pipeline for 3D virtual H&E staining, advancing the paradigm of slide-free, volumetric computational histopathology. Our data and code are available at: https://github.com/aasong113/HistoBIT3D_VirtualStaining.

2605.21999 2026-05-22 cs.LG

Toward Understanding Adversarial Distillation: Why Robust Teachers Fail

迈向理解对抗蒸馏:为何鲁棒教师失败

Hongsin Lee, Hye Won Chung

AI总结 本文研究了对抗蒸馏中鲁棒教师与学生鲁棒性之间的关系,揭示了教师监督信心与学生表示限制之间的不匹配导致鲁棒过拟合现象,并提出了理论框架和实验验证。

Comments Accepted to ICML 2026. Code is available at https://github.com/HongsinLee/why-robust-teachers-fail

详情
AI中文摘要

对抗蒸馏旨在通过在最小-最大对抗训练框架内利用鲁棒教师的软标签来增强学生的鲁棒性,但其成功却往往不一致:更鲁棒的教师往往无法提升甚至损害学生的鲁棒泛化能力。本文识别了这一教师依赖的关键机制:教师监督信心与学生表示限制在一致训练数据子集上的不匹配——鲁棒不可学集。我们提出了一个理论框架,分析了两层神经网络的特征学习动态,证明这种不匹配导致蒸馏结果的二元性。我们证明当教师在不可学样本上提供自信监督时,会迫使学生记忆虚假噪声模式,最终超过学习的鲁棒信号,从而驱动鲁棒过拟合。相反,教师在这些样本上表现出高不确定性时,会抑制噪声记忆,使学生仅依赖可学习信号进行鲁棒泛化。我们通过合成模拟和真实图像分类数据集验证了我们的理论,确认鲁棒过拟合由教师与不可学样本的交互驱动。最后,我们证明教师在不可学样本上的预测熵是学生鲁棒性的一个强指标,验证了我们的理论框架并提供了鲁棒教师选择的指导原则。

英文摘要

Adversarial Distillation aims to enhance student robustness by guiding the student with a robust teacher's soft labels within the min-max adversarial training framework, yet its success is notoriously inconsistent: a more robust teacher often fails to improve, or even harms, the student's robust generalization. In this paper, we identify a key mechanism of this teacher dependency: the misalignment between the teacher's supervisory confidence and the student's representational limitations on a consistent subset of training data -- the Robustly Unlearnable Set. We present a theoretical framework analyzing the feature learning dynamics of a two-layer neural network, demonstrating that this mismatch creates a dichotomy in distillation outcomes. We prove that when a teacher provides confident supervision on unlearnable samples, it compels the student to memorize spurious noise patterns that eventually overpower the learned robust signal, thereby driving robust overfitting. Conversely, a teacher that exhibits high uncertainty on these samples effectively suppresses noise memorization, allowing the student to rely solely on the learnable signal for robust generalization. We empirically validate our theory across both synthetic simulations and real-image classification datasets, confirming that robust overfitting is driven by the teacher's interaction with unlearnable samples. Finally, we demonstrate that a teacher's predictive entropy on unlearnable samples serves as a strong indicator of student robustness, validating our theoretical framework and offering a principled guideline for robust teacher selection.

2605.21997 2026-05-22 cs.AI cs.MA

The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems

日志即代理:用于可审计、可分支代理系统的事件源反应图

Yohei Nakajima

AI总结 本文提出了一种基于事件源的反应图结构,通过将日志作为事实来源,实现了可审计、可分支的代理系统,提供了确定性回放、低成本分支和端到端溯源能力。

Comments 11 pages, 1 figure. Open-source Apache-2.0 implementation with reproducible quickstart demo, deterministic replay, fork-and-diff, and lineage tracing

详情
AI中文摘要

大多数代理框架围绕语言模型构建:先有对话循环,然后是工具,接着是规则,最后是用于可观测性的日志层,状态被保存为可检索的

英文摘要

Most agent frameworks are built around the language model: a conversation loop comes first, then tools, then rules, and finally a logging layer bolted on for observability, with state persisted as retrievable "memory." We describe ActiveGraph, a runtime that inverts this arrangement. The append-only event log is the source of truth; the working graph is a deterministic projection of that log; and behaviors--ordinary functions, classes, LLM-backed routines, or logic attached to typed edges--react to changes in the graph and emit new events. No component instructs another; coordination happens entirely through the shared graph. This single design decision yields three properties that retrieval-and-summarization memory systems do not provide: deterministic replay of any run from its log, cheap forking that branches a run at any event without re-executing the shared prefix, and end-to-end lineage from a high-level goal down to the individual model call that produced each artifact. We present the architecture, a determinism contract that makes replay sound, and a worked diligence example whose full causal structure is reconstructable from the log alone. We discuss--without claiming to demonstrate--why this substrate is unusually well suited to self-improving agents, and how it extends the BabyAGI lineage and prior graph-memory research.

2605.21994 2026-05-22 cs.LG cs.AI

Ex-GraphRAG: Interpretable Evidence Routing for Graph-Augmented LLMs

Ex-GraphRAG:图增强大语言模型中的可解释证据路由

Yoav Kor Sade, Arvindh Arun, Rishi Puri, Steffen Staab, Maya Bechler-Speicher

AI总结 本文提出Ex-GraphRAG,通过引入多变量图神经加法网络(M-GNAN)来解决图增强大语言模型中证据路由的可解释性问题,揭示了语义重要性与结构连通性之间的不匹配,对检索剪枝、上下文构建和失败诊断有重要影响。

详情
AI中文摘要

GraphRAG通过从知识图中检索子图并使用消息传递GNN进行编码,将语言模型置于这些子图上。由于这些编码器通过迭代邻域聚合将节点贡献纠缠在一起,因此无法确定每个检索实体对编码器输出的影响程度,因此无法忠实审计实际到达模型的结构证据。我们引入Ex-GraphRAG,用多变量图神经加法网络(M-GNAN)替代GNN编码器,这是一种扩展到高维嵌入空间的加法图模型,能够精确分解编码器的输出,而无需事后近似。在STaRK-Prime上,这种可审计的编码器与黑盒性能相匹配。利用它审计证据路由,我们发现语义-结构不匹配:主导编码器输出的节点在检索的子图中结构上是断开的,由低贡献的中介节点连接,其移除会使多跳问答性能下降高达28%。这种不匹配对任何不透明编码器都是不可见的,揭示了语义重要性与结构连通性由不同的节点集控制,对图增强大语言模型的检索剪枝、上下文构建和故障诊断有直接的影响。

英文摘要

GraphRAG conditions language models on subgraphs retrieved from knowledge graphs, encoded via message-passing GNNs. Because these encoders entangle node contributions through iterated neighborhood aggregation, there is no closed-form way to determine how much each retrieved entity influenced the encoder's output, and therefore no way to faithfully audit what structural evidence actually reached the model. We introduce Ex-GraphRAG, which replaces the GNN encoder with a Multivariate Graph Neural Additive Network (M-GNAN), an extension of additive graph models to high-dimensional embedding spaces that yields an exact decomposition of the encoder's output across individual nodes and feature groups, without post-hoc approximation. On STaRK-Prime, this auditable encoder matches black-box performance. Using it to audit evidence routing, we uncover a semantic-structural mismatch: the nodes that dominate the encoder's output are structurally disconnected in the retrieved subgraph, held together by low-attribution intermediaries whose removal degrades multi-hop QA by up to 28%. This mismatch, invisible to any opaque encoder, reveals that semantic importance and structural connectivity are governed by disjoint sets of nodes, with direct implications for retrieval pruning, context construction, and failure diagnosis in graph-augmented LLMs.

2605.21993 2026-05-22 cs.AI cs.LG

ECPO: Evidence-Coupled Policy Optimization for Evidence-Certified Candidate Ranking

ECPO:基于证据的策略优化用于证据认证的候选者排序

Miaobo Hu, Shuhao Hu, BoKun Wang, Yina Sa, Xin Wang, Xiaobo Guo, Daren Zha, Jun Xiao

AI总结 本文研究了证据认证候选者排序问题,提出了一种名为ECPO的策略优化方法,通过结合排序和证据证书来提升排序效果和证据可靠性。

详情
AI中文摘要

用于决策支持的排序系统不仅应对候选者进行排序,还应展示可独立验证的证据。我们研究了证据认证候选者排序:给定一个意图ID、预定义的计划骨架、窗口局部的候选者名单、以及通过文本推导出的候选者轨迹及其跨度来源,系统必须输出一个Top-K列表以及doc_id:span证据证书,其引用的跨度足以恢复决策。我们在此任务上在MAVEN-ERE和RAMS上进行了实例化,使用固定上游提取、窗口局部随机候选者标识符、骨架对齐的轨迹监督、难例和审计参考。我们引入了证据耦合策略优化(ECPO),一种列表级策略优化目标,其动作是排序和证据证书的联合对象。ECPO首先从骨架对齐、论点一致性以及可选图特征中学习可解释的轨迹奖励;然后优化一个受约束的策略,具有三个耦合奖励:列表级排序效用、跨度级证书有效性以及由一个无标签的确定性验证器计算的证据循环奖励,该验证器通过去除声明的引用跨度重建候选者支持。这将目标从单独最大化普通NDCG转变为最大化CertNDCG和决策-证据耦合。评估将ECPO与零样本、SFT和GRPO策略、仅RM的评分带确定性证据附件、语法/JSON约束解码、验证器重试、最佳-N RM选择以及后验证据合理化在封闭名单、预测名单和混合名单设置下进行比较。

英文摘要

Ranking systems used in decision-support settings should not only order candidates but also expose evidence that can be independently checked. We study evidence-certified candidate ranking: given an intent_id, a predefined plan skeleton, a window-local candidate roster, and text-derived candidate trajectories with span provenance, a system must output a Top-K list together with doc_id:span evidence certificates whose cited spans are sufficient to recover the decision. We instantiate this task on MAVEN-ERE and RAMS with fixed upstream extraction, window-local randomized candidate identifiers, skeleton-aligned trajectory supervision, hard negatives, and audit references. We introduce Evidence-Coupled Policy Optimization (ECPO), a listwise policy-optimization objective whose action is the joint object of ranking and evidence certificate. ECPO first learns an interpretable trajectory reward from skeleton alignment, argument consistency, and optional graph features; it then optimizes a constrained policy with three coupled rewards: listwise ranking utility, span-level certificate validity, and an evidence-cycle reward computed by a label-free deterministic verifier that reconstructs candidate support from claim-stripped cited spans. This reframes the goal from maximizing ordinary NDCG alone to maximizing CertNDCG and decision-evidence coupling. The evaluation compares ECPO against zero-shot, SFT, and GRPO policies, RM-only scoring with deterministic evidence attachment, grammar/JSON-constrained decoding, validator retry, best-of-N RM selection, and post-hoc evidence rationalization under closed-roster, predicted-roster, and hybrid-roster settings.

2605.21988 2026-05-22 cs.CV cs.AI

Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

通过反事实强化学习学习视频大语言模型中的时空敏感性

Dazhao Du, Jian Liu, Jialong Qin, Tao Han, Bohai Gu, Fangqi Zhu, Yujia Zhang, Eric Liu, Xi Chen, Song Guo

AI总结 本文提出CRPO方法,通过反事实强化学习提升视频大语言模型对时空动态的敏感性,通过构建反事实视频并引入反事实关系奖励,有效抑制了依赖静态线索的简略策略,从而在DyBench基准测试中提升了模型的时空敏感性。

Comments Project website: https://ddz16.github.io/crpo.github.io/

详情
AI中文摘要

视频大语言模型(Video LLMs)在基准测试中表现出色,但往往通过单帧线索和语言先验来回答视频问题,而不是通过跟踪时空动态。在训练后强化学习(RL)中,这种问题进一步加剧,因为仅正确性奖励会进一步强化那些不跟踪视频动态但能获得高奖励的简略策略。为此,我们提出一个受控的反事实问题:如果视觉世界发生变化而问题保持不变,答案应改变还是保持不变?基于这一观点,我们提出了反事实关系策略优化(CRPO),一种双分支强化学习框架,用于提升时空敏感性。CRPO通过水平翻转和时间反转构建反事实视频,在原始和反事实分支上进行训练,并引入反事实关系奖励(CRR)以鼓励答案在动态问题中改变而在静态问题中保持不变。这种跨分支约束使简略策略难以在两个分支中持续获得奖励。为了评估这一特性,我们引入了DyBench,一个配对反事实视频基准,包含3,014个视频,涵盖可逆动态、运动方向和事件序列,以及一个严格的配对准确度指标,防止固定答案简略策略夸大分数。实验表明,CRPO在时空敏感性评估中优于先前的RL方法,同时保持了竞争性的通用视频性能。在Qwen3-VL-8B上,CRPO在DyBench P-Acc上比基模型提高了+7.7,在TimeBlind I-Acc上提高了+8.2,表明改进了时空敏感性而非更强依赖静态简略策略。项目网站可在https://ddz16.github.io/crpo.github.io/上找到。

英文摘要

Video large language models (Video LLMs) achieve strong benchmark accuracy, yet often answer video questions through shortcuts such as single-frame cues and language priors rather than by tracking spatiotemporal dynamics. This issue is exacerbated in RL post-training, where correctness-only rewards can further reinforce shortcut policies that obtain high reward without tracking video dynamics. We address this by asking a controlled counterfactual question: if the visual world changed while the question remained fixed, should the answer change or stay the same? Based on this view, we propose \textbf{Counterfactual Relational Policy Optimization (CRPO)}, a dual-branch RL framework for improving \emph{spatiotemporal sensitivity}. CRPO constructs counterfactual videos through horizontal flips and temporal reversals, trains on both original and counterfactual branches, and introduces a \textbf{Counterfactual Relation Reward (CRR)} between their answers. CRR encourages answers to change for dynamic questions and remain unchanged for static questions. This cross-branch constraint makes it difficult for shortcut policies to be consistently rewarded across both branches. To evaluate this property, we introduce \textbf{DyBench}, a paired counterfactual video benchmark with 3,014 videos covering reversible dynamics, moving direction, and event sequence, together with a strict pair-accuracy metric that prevents fixed-answer shortcuts from inflating scores. Experiments show that CRPO outperforms prior RL methods on spatiotemporal-sensitive evaluations while maintaining competitive general video performance. On Qwen3-VL-8B, CRPO improves DyBench P-Acc by +7.7 and TimeBlind I-Acc by +8.2 over the base model, indicating improved spatiotemporal sensitivity rather than stronger reliance on static shortcuts. The project website can be found at https://ddz16.github.io/crpo.github.io/ .

2605.21984 2026-05-22 cs.AI cs.CL

Echo: Learning from Experience Data via User-Driven Refinement

Echo:通过用户驱动的细化学习经验数据

Hande Dong, Xiaoyun Liang, Jiarui Yu, Jiayi Lin, Changqing Ai, Feng Liu, Wenjun Zhang, Rongbi Wei, Chaofan Zhu, Linjie Che, Feng Wu, Xin Shen, Dexu Kong, Xiaotian Wang, Qiuyuan Chen, Bingxu An, Yueting Lei, Qiang Lin

AI总结 本文提出Echo框架,通过用户驱动的细化过程将原始经验数据转化为可学习的知识,提升模型性能,实验表明其能将接受率从25.7%提升至35.7%。

详情
AI中文摘要

静态的'人类数据'面临固有局限:扩展成本高且受制于创造者知识。持续学习'经验数据'——智能体与其环境的交互——有望超越这些障碍。如今,AI智能体的广泛应用使我们能够以低成本获取大量真实世界经验数据。然而,原始交互日志本质上嘈杂,充满试错和低信息密度,使其不适合直接用于模型训练。我们引入Echo,一个通用框架,旨在将原始经验转化为可学习的知识,有效将环境反馈回训练循环以优化模型。在当今智能体生态系统中,用户细化是主要的反馈来源:出于对结果的责任感,用户严格地将缺陷智能体提案转化为已验证的解决方案。这些用户驱动的细化序列本质上将智能体的粗略尝试提炼为高质量的训练信号。Echo系统性地收集这些信号,持续使智能体与真实世界需求对齐。在大规模生产代码补全环境中的验证表明,Echo有效利用这一流程,打破静态性能上限,将接受率从25.7%提升至35.7%。

英文摘要

Static "human data" faces inherent limitations: it is expensive to scale and bounded by the knowledge of its creators. Continuous learning from "experience data" - interactions between agents and their environments - promises to transcend these barriers. Today, the widespread deployment of AI agents grants us low-cost access to massive streams of such real-world experience. However, raw interaction logs are inherently noisy, filled with trial-and-error and low information density, rendering them inefficient for direct model training. We introduce Echo, a generalized framework designed to operationalize the transition from raw experience to learnable knowledge, effectively "echoing" environmental feedback back into the training loop for model optimization. In today's agent ecosystem, user refinement serves as a primary source of such feedback: driven by responsibility for the outcome, users rigorously transform flawed agent proposals into verified solutions. These user-driven refinement sequences inherently distill agents' crude attempts into high-quality training signals. Echo systematically harvests these signals to continuously align the agent with real-world needs. Large-scale validation in a production code completion environment confirms that Echo effectively harnesses this pipeline, breaking the static performance ceiling by increasing the acceptance rate from 25.7% to 35.7%.

2605.21981 2026-05-22 cs.CV

RiT: Vanilla Diffusion Transformers Suffice in Representation Space

RiT: vanilla diffusion transformers suffice in representation space

Le Zhang, Ning Mang, Aishwarya Agrawal

AI总结 该研究探讨了在表示空间中使用vanilla diffusion transformers进行图像生成的有效性,发现通过预训练的表示空间能够更有效地进行流匹配学习,从而在ImageNet数据集上取得了优于DiT-DH-XL的性能。

详情
AI中文摘要

流匹配与x预测--回归干净的数据点而非环境速度--已被证明在像素空间中有效利用低维流形结构\cite{li2025back}。我们询问是否预训练的表示空间,尽管包含具有可比内在维度的低维数据流形,能提供更有利于流匹配学习的分布。通过比较像素、SD-VAE和DINOv2特征在四个几何轴上的表现,我们发现像素和DINOv2具有几乎相同的内在维度性(两者$\hat{d}\!\approx\!33$),但DINOv2表现出7.3倍更高的有效秩、35倍更好的协方差条件、11.5倍更低的超额峰度以及1.7倍更低的流形插值误差;SD-VAE潜在特征始终处于中间位置,表明优势源于表示学习目标而非单纯的压缩。这些统计特性使流匹配回归变得良好条件化,并消除了先前DINOv2扩散方法中专门预测头或Riemannian运输的需要。我们提出了表示图像变换器(RiT):一个通过冻结DINOv2特征进行x预测训练的vanilla Diffusion Transformer,仅通过维度感知的噪声调度和联合 exttt{[CLS]}-patch建模进行增强。在ImageNet $256{ imes}256$上,RiT在无指导时达到FID 1.45,在无分类器指导时达到1.14,优于参数更少19%的DiT$^ ext{DH}$-XL(676M vs.\ 839M)。所得到的ODE在粗略离散化下可以高效求解:在无分类器指导时,5步Heun步骤已达到FID 2.0,10步达到1.25,无需蒸馏或一致性训练。代码在https://github.com/lezhang7/RiT。

英文摘要

Flow matching with $x$-prediction -- regressing the clean data point rather than the ambient velocity -- is known to exploit low-dimensional manifold structure effectively in pixel space \cite{li2025back}. We ask whether a pretrained representation space, while containing a low-dimensional data manifold of comparable intrinsic dimensionality, offers a distribution more favorable for flow-matching learning. Comparing pixel, SD-VAE, and DINOv2 features along four geometric axes, we find that pixel and DINOv2 share nearly identical intrinsic dimensionalities (both $\hat{d}\!\approx\!33$) yet DINOv2 exhibits $7.3\times$ higher effective rank, $35\times$ better covariance conditioning, $11.5\times$ lower excess kurtosis, and $1.7\times$ lower on-manifold interpolation error; SD-VAE latents are consistently intermediate, indicating that the advantage stems from representation-learning objectives rather than mere compression. These statistical properties render the flow-matching regression well-conditioned and remove the need for the specialized prediction heads or Riemannian transport used by prior DINOv2 diffusion methods. We propose the \emph{Representation Image Transformer} (RiT): a vanilla Diffusion Transformer trained by $x$-prediction on frozen DINOv2 features, augmented only by a dimension-aware noise schedule and joint \texttt{[CLS]}-patch modeling. On ImageNet $256{\times}256$, RiT attains FID 1.45 without guidance and 1.14 with classifier-free guidance, outperforming DiT$^\text{DH}$-XL with $19\%$ fewer parameters (676M vs.\ 839M). The resulting ODE is efficiently solvable at coarse discretizations: with classifier-free guidance, $5$ Heun steps already reach FID 2.0 and $10$ steps reach 1.25, without distillation or consistency training. Code at https://github.com/lezhang7/RiT.

2605.21980 2026-05-22 cs.CV cs.AI

Interpreting and Enhancing Emotional Circuits in Large Vision-Language Models via Cross-Modal Information Flow

通过跨模态信息流解读并增强大视觉-语言模型中的情感电路

Chengsheng Zhang, Chenghao Sun, Zhining Xie, Xinmei Tian

AI总结 本文提出了一种基于转向向量的因果归因框架,用于描述性情感推理,通过构建专用数据集揭示了三阶段'适应-聚合-执行'机制下的情感电路,发现视觉情感线索在中间层通过情感特定的注意力头进行聚合,随后在深层通过情感通用路径转换为叙述生成,并通过调控情感信息路由增强注意力流和语义激活,从而提升性能并缓解情感幻觉。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型视觉-语言模型(LVLMs)代表了迈向共情代理的重要进展,展示了在情绪理解方面的显著能力。然而, governing how LVLMs translate abstract visual stimuli into coherent emotional narratives remains largely unexplored, primarily due to the scarcity of visual counterfactuals and the diffuse nature of emotional expression. In this paper, we bridge this gap by introducing a steering-vector-based causal attribution framework tailored for descriptive emotional reasoning. To this end, we construct a specialized dataset to demystify the emotional circuits underlying the three-stage ``Adapt-Aggregate-Execute'' mechanism. Crucially, we discover a functional decoupling: visual emotional cues are aggregated in middle layers via sentiment-specific attention heads, but are subsequently translated into narrative generation in deep layers through emotion-general pathways. Guided by these insights, we regulate the emotional information routing to strengthen attention flow and amplify the semantic activation to consolidate expression. Extensive experiments on the comprehensive MER-UniBench demonstrate that our methods significantly improve performance via inference-time intervention, effectively mitigating emotional hallucinations and corroborating the causal fidelity of the discovered circuits.

英文摘要

Large Vision-Language Models (LVLMs) represent a significant leap towards empathetic agents, demonstrating remarkable capabilities in emotion understanding. However, the internal mechanisms governing how LVLMs translate abstract visual stimuli into coherent emotional narratives remain largely unexplored, primarily due to the scarcity of visual counterfactuals and the diffuse nature of emotional expression. In this paper, we bridge this gap by introducing a steering-vector-based causal attribution framework tailored for descriptive emotional reasoning. To this end, we construct a specialized dataset to demystify the emotional circuits underlying the three-stage ``Adapt-Aggregate-Execute'' mechanism. Crucially, we discover a functional decoupling: visual emotional cues are aggregated in middle layers via sentiment-specific attention heads, but are subsequently translated into narrative generation in deep layers through emotion-general pathways. Guided by these insights, we regulate the emotional information routing to strengthen attention flow and amplify the semantic activation to consolidate expression. Extensive experiments on the comprehensive MER-UniBench demonstrate that our methods significantly improve performance via inference-time intervention, effectively mitigating emotional hallucinations and corroborating the causal fidelity of the discovered circuits.

2605.21977 2026-05-22 cs.CV cs.AI

Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection

视频作为自然增强:迈向统一的AI生成图像和视频检测

Zhengcen Li, Chenyang Jiang, Liangxu Su, Tong Shao, Shiyang Zhou, Ming Tao, Jingyong Su

AI总结 本研究针对AI生成内容检测中跨模态差距的问题,提出VINA框架,通过联合训练图像和视频数据,利用视频帧作为自然增强,并引入跨模态监督对比目标,实现统一的AI生成内容检测,提升鲁棒性和迁移性。

详情
AI中文摘要

AI生成内容(AIGC)正在迅速提升,催生了需要在数据源、部署管道和视觉模态间通用的检测器的紧迫需求。一个高度通用的检测器应在分布变化下保持稳健。然而,我们发现了一种一致的失败模式:最先进的AI生成图像检测器在应用于从视频中提取的帧时往往会崩溃。通过系统分析,我们发现这种跨模态差距源于交织的合成无关视频处理转换,包括颜色转换、编码压缩、缩放和模糊,以及由现代视频生成器引入的模型特定指纹。受这些发现的启发,我们提出了VINA(Video as Natural Augmentation),一个统一的AIGC检测框架,联合训练图像和视频数据。VINA利用视频帧作为物理上合理的自然增强,并进一步引入跨模态监督对比目标,以在共享的真/假决策边界下对齐图像和视频表示。在14个图像、视频和现实世界基准测试中,VINA展示了双向收益,提高了鲁棒性和迁移性,并在几乎所有评估设置中实现了最先进的性能,无需复杂的增强或数据集特定调整。

英文摘要

AI-generated content (AIGC) is rapidly improving, creating an urgent need for detectors that generalize across data sources, deployment pipelines, and visual modalities. A strongly generalizable detector should remain robust under distributional variations. However, we identify a consistent failure mode: SOTA AI-generated image detectors often collapse when applied to frames extracted from videos. Through systematic analysis, we show that this cross-modal gap arises from both entangled synthesis-agnostic video processing shifts, including color conversion, codec compression, resizing, and blur, and model-specific fingerprints introduced by modern video generators. Motivated by these findings, we propose VINA (Video as Natural Augmentation), a unified AIGC detection framework that jointly trains on image and video data. VINA uses video frames as physically grounded natural augmentations and further introduces a cross-modal supervised contrastive objective to align image and video representations under a shared real/fake decision boundary. Extensive experiments on 14 image, video, and in-the-wild benchmarks show that VINA delivers bidirectional gains, improves robustness and transferability, and achieves state-of-the-art performance across nearly all evaluated settings without complex augmentation or dataset-specific tuning.

2605.21976 2026-05-22 cs.RO

TacO: Benchmarking Tactile Sensors for Object Manipulation

TacO: 用于物体操作的触觉传感器基准测试

Anya Zorin, Zilin Si, Myungsun Park, Junsung Park, Alexiy Buynitsky, Sachin Bhadang, Taejun Park, Sohee John Yoon, Yong-Lae Park, Oliver Kroemer, Zeynep Temel, Michael T. Tolley, Sha Yi, Xiaolong Wang

AI总结 本文提出了一种基于任务驱动的触觉传感器评估框架,通过训练不同模态的触觉传感器(视觉、声学、磁性和电阻性)在三个任务上的表现,探讨了触觉信息在不同材料和任务中的有效性。

详情
AI中文摘要

基于视觉的学习从示范中取得了在使机器人执行操作任务和高层语义推理方面的显著成功,但仍然不足以处理复杂且接触丰富的操作。尽管普遍认为触觉感知能改善操作,但尚无实证指导说明哪种触觉传感器最适合哪种操作任务。本文提供了一种系统性的、任务驱动的触觉传感器评估,提出了基于操作策略性能选择和评估传感器的框架。为四个不同的模态(视觉、声学、磁性和电阻性)的触觉传感器分别训练了独立的操作策略,用于三个任务:未知质量的拾取和放置、物体重新定向和插头插入。对于每个任务,分析了传感器属性如空间分辨率、剪切感知和触觉表示,以及固有材料摩擦对任务性能的影响。而不是触觉感知以相同方式对所有任务都有帮助,我们的结果表明触觉信息的有用性在很大程度上取决于传感器模态、材料属性和特定的操作任务。所有触觉传感器、代码、数据和硬件设置将在项目网站上公开。

英文摘要

Vision-based learning from demonstrations has achieved remarkable success in enabling robots to perform manipulation tasks and high-level semantic reasoning, yet it remains insufficient for complex, contact-rich manipulation. While there is broad agreement that tactile sensing improves manipulation, there is no empirical guidance on which tactile sensors are best suited for which manipulation tasks. In this paper, we provide a systematic, task-driven evaluation of tactile sensors for robot manipulation and propose a framework for selecting and evaluating sensors based on manipulation policy performance. Separate manipulation policies are trained for tactile sensors of four distinct modalities: visual, acoustic, magnetic, and resistive, across three tasks: pick-and-place with unknown mass, object reorientation, and plug insertion. For each task, an analysis of how sensor properties such as spatial resolution, shear sensing, and tactile representation, and the inherent material friction affect task performances is done. Rather than tactile sensing being universally beneficial in the same way, our results show that the usefulness of tactile information depends strongly on sensor modality, material properties, and the specific manipulation tasks. All of the tactile sensors, code, data, and hardware setup will be publicly available on the project website.

2605.21975 2026-05-22 cs.LG

Reasoning through Verifiable Forecast Actions: Consistency-Grounded RL for Financial LLMs

通过可验证的预测动作进行推理:面向金融大语言模型的一致性导向强化学习

Jialin Chen, Aosong Feng, Harshit Verma, Siyi Gu, Haiwen Wang, Ali Maatouk, Yixuan He, Yifeng Gao, Leandros Tassiulas, Rex Ying

AI总结 本文提出StockR1,一种结合时间序列的LLM,通过可验证的预测动作统一股票预测与金融推理,利用强化学习优化整个流程,提升金融问答和股票预测的准确性。

详情
AI中文摘要

金融市场以极端非平稳性、低信噪比和对新闻、公司基本面和宏观经济信号的强依赖性为特征。然而,现有方法要么将时间序列抽象为文本,要么将预测与基于语言的推理解耦,导致定性推理与定量结果之间存在根本性不匹配。为此,我们引入StockR1,一种增强时间序列的LLM,通过可验证的预测动作统一股票预测与金融推理。基于工具调用设计,模型首先发出预测动作,即对其定性市场展望的结构化和可解释的表示。然后,它调用一个受此动作条件的时序解码器,生成分布式的未来轨迹,从而更有效地进行问答和金融推理。我们通过强化学习优化整个流程,其中奖励共同反映答案的正确性、预测的准确性以及生成动作与观察到的时序动态之间的一致性。此外,奖励通过样本级不确定性标量重新加权,鼓励模型适应市场动态中变化的不确定性。我们在大规模10年基准上评估StockR1的金融问答和股票预测。我们的方法在时间序列基线和通用LLM上均表现优异,将推理准确性提高了17.7%(4B)和25.9%(8B)。这些发现表明,结构化预测动作在语言推理和时间预测之间建立了强大的协同效应,使LLM能够通过可验证、可解释和数值基础的决策进行推理。

英文摘要

Financial markets are characterized by extreme non-stationarity, low signal-to-noise ratios, and strong dependence on external information such as news, company fundamentals, and macroeconomic signals. Yet, existing approaches either abstract time-series into text or decouple forecasting from language-based reasoning, leading to a fundamental mismatch between qualitative reasoning and quantitative outcomes. To address this, we introduce StockR1, a time-series-enhanced LLM that unifies stock forecasting and financial reasoning through a verifiable forecast action. Based on a tool-call design, the model first emits a forecast action, which is a structured and interpretable representation of its qualitative market outlook. It then invokes a time-series decoder conditioned on this action to generate distributional future trajectories, leading to more informed question answering and financial reasoning. We optimize the full pipeline with reinforcement learning, where rewards jointly reflect answer validity, forecast accuracy, and consistency between generated actions and observed time-series dynamics. In addition, rewards are reweighted by a sample-level uncertainty scalar, encouraging the model to accommodate varying uncertainty in market dynamics. We evaluate StockR1 on financial question answering and stock forecasting over a large-scale 10-year benchmark. Our method consistently outperforms time-series baselines and general-purpose LLMs, improving reasoning accuracy by 17.7% (4B) and 25.9% (8B). These findings demonstrate that structuring the forecast actions establishes a powerful synergy between language reasoning and temporal prediction, enabling LLMs to reason through verifiable, interpretable, and numerically grounded decisions.

2605.21974 2026-05-22 cs.AI

Format-Constraint Coupling in Knowledge Graph Construction from Statistical Tables

知识图谱构建中统计表的格式约束耦合

Jingxuan Qi, Zhiqiang Ye, Yuxiang Feng

AI总结 本文研究了在统计表中构建知识图谱时,格式约束与提取方案之间的耦合效应,发现格式与约束的联合影响超过了独立影响的总和,并提出了CSVFidelity-Bench基准测试集以支持基于保真的评估。

Comments 8 pages main body, 18 pages appendices. Submitted to EMNLP 2026 via ACL Rolling Review (ARR). Corresponding author: Yuxiang Feng (yxfeng@scut.edu.cn). Code and data available at https://anonymous.4open.science/r/sge_lightrag-BE19

详情
AI中文摘要

提取方案不应降低知识图谱的保真度。然而,在统计CSV表中却可能降低。我们研究了国家-年份时间序列矩阵,这是开放数据门户中常见的布局。在此设置中,序列化格式和模式约束的交互作用是超加性的。它们的联合效应超过独立效应的总和,最高可达+1.180(2x2因子,6个数据集)。Bootstrap 95%置信区间在4/6个数据集中严格为正,其中在宽型II矩阵上证据最强。更关键的是,应用于不匹配格式的模式可能触发灾难性不匹配。事实覆盖率在4/6个数据集中低于无约束基线,通过实体膨胀或提取拒绝实现。我们称这种观察到的模式为格式-约束耦合。探测和标记消融支持以列名参考为中心的表面形式锚定解释。在格式-模式配对、GraphRAG主机和LLM家族之间的受控变体中,结果在测量范围内保持相同方向;一个LLM家族仅显示部分激活。这一观察还具有诊断后果。三种标准检索模式在很大程度上掩盖了构建质量(delta <= 1pp),而直接图访问暴露了高达+47.6pp(p < 0.0001)的差距。为了支持保真度意识的评估,我们发布了CSVFidelity-Bench。它包含15个数据集、11个II型矩阵、4个III型表格和1,892个标准事实,覆盖6个领域。

英文摘要

An extraction schema should not reduce knowledge graph fidelity. On statistical CSV, however, it can. We study country-by-year time-series matrices, a common layout on open-data portals. In this setting, serialization format and schema constraints interact super-additively. Their joint effect exceeds the sum of independent effects by up to +1.180 (2x2 factorial, 6 datasets). Bootstrap 95% CIs are strictly positive on 4/6 datasets, with strongest evidence on wide Type-II matrices. More critically, a schema applied to a mismatched format can trigger catastrophic mismatch. Fact coverage falls below the unconstrained baseline on 4/6 datasets through entity inflation or extraction refusal. We call this observed pattern format-constraint coupling. Probing and token ablation support a surface-form anchoring explanation centred on column-name references. Controlled variants across format-schema pairings, GraphRAG hosts, and LLM families show the same direction within the measured scope; one LLM family shows only partial activation. The observation also has a diagnostic consequence. Three standard retrieval modes largely mask construction quality (delta <= 1pp), whereas direct graph access exposes gaps up to +47.6pp (p < 0.0001). To support fidelity-aware evaluation, we release CSVFidelity-Bench. It contains 15 datasets, 11 Type-II matrices, 4 Type-III tables, and 1,892 Gold Standard facts across 6 domains.

2605.21973 2026-05-22 cs.CV

Foresee-to-Ground: From Predictive Temporal Perception to Evidence-Driven Reasoning for Video Temporal Grounding

Foresee-to-Ground: 从预测性时间感知到证据驱动推理的视频时间接地

Zelin Zheng, Xinyan Liu, Ruixin Li, Antoni B. Chan, Guorong Li, Qingming Huang, Laiyun Qing

AI总结 本文提出了一种新的视频时间接地框架F2G,通过将时间接地问题重新表述为可验证的识别-测量问题,结合预测性时间感知和证据驱动推理,以提高时间接地的准确性和鲁棒性。

Comments Accepted by ICML 2026

详情
AI中文摘要

当前视频大语言模型(Video-LLM)在视频时间接地(VTG)中的方法通常依赖于从无结构的视觉令牌流中直接生成时间戳,这通常导致脆弱的数值和不一致的边界。为了解决这个问题,我们提出了Foresee-to-Ground(F2G),一种将VTG重新表述为可验证的识别-测量问题的框架。F2G集成了预测性时间感知与证据驱动推理:它学习对边界敏感的时间表示,以构建一个覆盖整个视频的候选事件片段证据池,并将这些片段暴露给LLM作为可引用的证据单元,将边界预测与显式事件假设绑定。通过将事件识别与精确边界测量解耦,F2G稳定了接地并使预测可验证。广泛的实验表明,F2G在各种基准上都一致提高了接地准确性,能够在不同的Video-LLM后端之间稳健地转移,并保持了通用视频理解能力。

英文摘要

Current Video-LLM approaches for Video Temporal Grounding (VTG) typically rely on direct timestamp generation from an unstructured visual-token stream, often leading to brittle numerics and inconsistent boundaries. To address this, we propose Foresee-to-Ground (F2G), a framework that reformulates VTG as a verifiable Identify-then-Measure problem. F2G integrates Predictive Temporal Perception with Evidence-Driven Reasoning: it learns boundary-sensitive temporal representations to build a video-wide evidence pool of candidate event segments, and exposes these segments to the LLM as citable evidence units that bind boundary prediction to explicit event hypotheses. By decoupling event identification from precise boundary measurement, F2G stabilizes grounding and makes predictions verifiable. Extensive experiments demonstrate that F2G consistently improves grounding accuracy across diverse benchmarks, transfers robustly across different Video-LLM backbones, and preserves general video understanding capabilities.