arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2079
2606.06379 2026-06-05 cs.CV cs.AI

EasyLens: A Training-Free Plug-and-Play Subtle-Lesion Representation Amplifier for Medical Vision-Language Models

EasyLens: 一种无需训练的即插即用型微病变表示放大器,用于医学视觉语言模型

Qiwei Zeng, Hao Wang, Jinghao Lin, Shuchang Ye, Yuezhe Yang, Yige Peng, Haoyuan Che, Jinman Kim, Lei Bi

发表机构 * Jilin University(吉林大学) School of Computer Science, The University of Sydney(悉尼大学计算机科学学院) ByteDance(字节跳动) Institute of Translational Medicine, Shanghai Jiao Tong University(上海交通大学转化医学研究院)

AI总结 提出EasyLens,一种无需训练的即插即用模块,通过构建病理-解剖原型空间、反事实推理选择病变相关补丁以及形态引导残差增强,放大医学视觉语言模型对微病变的表示能力。

详情
AI中文摘要

医学视觉语言模型(VLM)在临床图像解读(包括病变检测和报告生成)方面显示出越来越大的潜力。然而,其对微病变的敏感性不足限制了其实用性,因为微病变的视觉证据通常稀疏、低对比度且嵌入复杂的解剖背景中。随着局部视觉标记的聚合,这些微弱的病变线索在全局图像表示中可能变得代表性不足,使得医学VLM难以识别。现有的提高病变敏感性的工作主要依赖于医学领域的视觉编码器预训练、临床术语引导的对齐或可训练的病理表示增强。尽管有效,但这些方法通常需要额外训练或模型特定适配,并可能过度适应特定疾病形态,限制了其在冻结的医学VLM上的适用性。为解决这些限制,我们提出EasyLens,一种无需训练的即插即用型微病变表示放大器,用于医学VLM。EasyLens首先构建EasyBank,一个病理-解剖原型空间,提供病变相关原型和解剖感知的正常参考,用于将可疑补丁与病理和正常解剖模式进行比较。为避免盲目放大正常组织,EasyTag通过反事实原型推理选择病变相关补丁。为抵消全局图像表示中微病变线索的稀释,EasyAmplifier通过形态引导的残差增强强化所选病变相关补丁的表示,从而增加其对全局图像嵌入的贡献。在多个医学图像数据集和冻结的医学VLM骨干上的实验表明,EasyLens改进了微病变检测,并优于现有的编码器增强基线。

英文摘要

Medical vision-language models (VLMs) have shown increasing potential for clinical image interpretation, including lesion detection and report generation. However, their practical utility remains limited by insufficient sensitivity to subtle lesions, whose visual evidence is often sparse, low-contrast, and embedded within complex anatomical context. As local visual tokens are aggregated, these weak lesion cues can become underrepresented in global image representations, making them difficult for medical VLMs to recognize. Existing efforts to improve lesion sensitivity mainly rely on medical-domain vision-encoder pre-training, clinical-term-guided alignment, or trainable pathological representation enhancement. Although effective, these approaches usually require additional training or model-specific adaptation and may overfit to particular disease morphologies, limiting their applicability to frozen medical VLMs. To address these limitations, we propose EasyLens, a training-free plug-and-play subtle-lesion representation amplifier for medical VLMs. EasyLens first constructs EasyBank, a pathology-anatomy prototype space that provides lesion-related prototypes and anatomy-aware normal references for comparing suspicious patches against both pathological and normal anatomical patterns. To avoid blindly amplifying normal tissues, EasyTag selects lesion-relevant patches through counterfactual prototype reasoning. To counteract the dilution of subtle lesion cues in global image representations, EasyAmplifier strengthens the selected lesion-relevant patch representations through morphology-guided residual enhancement, thereby increasing their contribution to the global image embedding. Experiments on multiple medical image datasets and frozen medical VLM backbones show that EasyLens improves subtle-lesion detection and outperforms existing encoder-enhancement baselines.

2606.06375 2026-06-05 cs.AI

Rethinking Infrastructure Inspection as Image Difference Classification: A Traffic Sign Case Study

重新思考基础设施检测为图像差异分类:以交通标志为例

Ching Yau Fergus Mok, Lavindra de Silva, Varun Kumar Reja, Ioannis Brilakis

发表机构 * University of Cambridge(剑桥大学) IIT Bombay(印度理工学院Bombay)

AI总结 本研究将基础设施检测重新定义为图像差异分类(IDC),通过利用连续资产状态监测的关系性质减少数据依赖,并在低资源交通标志检测案例中验证了基于指令的分类器优于基于编码器的分类器。

详情
Comments
CVPR 2026 Computer Vision for the Built World Workshop (CV4AEC @ CVPR)
AI中文摘要

数字孪生(DTs)允许道路基础设施检测的数字化,但这受到有限标注数据的阻碍。本工作利用连续资产状态监测的关系性质,将基于图像的缺陷检测重新定义为图像差异分类(IDC),以减少数据依赖。通过使用新策划的高质量数据集,在低资源交通标志检测案例研究中评估了不同的IDC分类器。结果表明,基于指令的分类器优于基于编码器的分类器,并且通过与参考图像比较获得增益。这表明IDC可以成为应对基础设施检测和DT资产状态更新中数据约束的有效任务建模。

英文摘要

Digital twins (DTs) allow the digitalization of road infrastructure inspection, though this is hindered by limited annotated data. This work exploits the relational nature of continuous asset condition monitoring to reformulate image-based defect detection as image difference classification (IDC) to reduce data reliance. This was evaluated in a case study on low-resource traffic sign inspection with different IDC classifiers using a newly-curated, high quality dataset. Results indicate that the instruction-based classifier outperforms encoder-based ones and gains from comparison with reference images. This shows that IDC can be an effective task modeling for tackling data constraints in infrastructure inspection and DT asset condition updating.

2606.06370 2026-06-05 cs.RO

Ensuring Interaction Safety in Multitask Exoskeleton Control: A Simulation-Trained Variable Impedance Framework

确保多任务外骨骼控制中的交互安全性:一种仿真训练的可变阻抗框架

Muyuan Ma, Houcheng Li, Haotian Zhai, Lijun Han, Xinpan Meng, Xiuze Xia, Long Cheng

发表机构 * Tsinghua University(清华大学)

AI总结 提出一种基于仿真训练的可变阻抗控制框架,通过Lyapunov稳定性理论约束刚度变化,实现多任务外骨骼的安全交互控制并降低代谢成本。

详情
AI中文摘要

可穿戴外骨骼可以在复杂活动中增强人体物理能力。然而,在确保交互安全性的同时实现跨任务的适应性仍然是一个关键挑战。为了解决这一问题,提出了一种具有稳定性保证的仿真训练可变阻抗控制方法。首先,建立了一个基于仿真的人-外骨骼运动数据生成流程,利用近端策略优化(PPO)合成人体肌肉激活,同时外骨骼对人体生物关节力矩提供直接补偿。随后,使用生成的数据集训练一个双模态策略,该策略融合语义指令与本体感受历史,能够预测九种不同运动任务的参考轨迹和可变阻抗增益。为了保证安全性,网络输出受到基于Lyapunov稳定性理论导出的稳定性准则的约束,该准则限制了刚度变化,以确保耦合系统的渐近稳定性。实验结果表明,与标准基线方法相比,所提出的框架在实际场景中降低了代谢成本。这些发现表明了所提框架用于安全、多任务外骨骼控制的可行性。

英文摘要

Wearable exoskeletons can augment human phys ical capabilities during complex activities. However, ensuring adaptation across diverse tasks while guaranteeing interaction safety remains a critical challenge. To address this, a simulation trained variable impedance control approach with stability guarantees is proposed. First, a simulation-based human exoskeleton motion data generation pipeline is established, utilizing Proximal Policy Optimization (PPO) to synthesize human muscle activations while the exoskeleton provides direct compensation for human biological joint torques. Subsequently, the generated dataset is used to train a dual modality policy that fuses semantic instructions with proprioceptive history, enabling the prediction of reference trajectories and variable impedance gains for nine different motion tasks. To guarantee safety, the network outputs are constrained by a stability criterion derived from Lyapunov stability theory, which bounds stiffness variations to ensure the asymptotic stability of the coupled system. Experimental results indicate that the proposed framework reduces metabolic cost in real-world scenarios com pared with standard baseline methods. These findings suggest the feasibility of the proposed framework for safe, multitask exoskeleton control.

2606.06369 2026-06-05 cs.CV

Visual Commonsense Driven Knowledge Refinements for Scene Graph Generation

视觉常识驱动的场景图生成知识精炼

Maëlic Neau, Salim Baloch, Jakob Suchan, Zoe Falomir, Mehul Bhatt

发表机构 * Computing Science Department, Umeå University(乌梅大学计算机科学系) School of Computer Science & Engineering, Constructor University(构造大学计算机科学与工程学院) School of Science and Technology, Örebro University(Örebro大学科学与技术学院) CoDesign Lab EU.(欧盟CoDesign实验室)

AI总结 提出一种模型无关的语义引导知识精炼框架,通过挖掘训练数据中的常识约束并利用声明式常识推理在推理时修正场景图预测,无需人工规则或重新训练,在三个基准上持续提升强基线性能。

详情
AI中文摘要

基于学习的场景图生成(SGG)模型在频繁关系类型上表现优异,但在标注稀疏情况下性能急剧下降,无法捕获可靠的视觉常识知识。我们提出一种模型无关、语义引导的知识精炼框架,系统地从训练数据中挖掘基于常识的约束——捕获空间、功能和定性关系规律——并使用通用声明式常识推理在推理时修正和排序SGG预测。该框架无需手动规则编写、无需模型重新训练,并且可跨数据集和架构迁移。在三个标准基准上,我们相对于强基线获得了一致改进,表明对深层场景语义的结构化视觉常识推理是纯学习式场景图生成的实用且有效的补充。

英文摘要

Learning-driven Scene Graph Generation (SGG) models excel on frequent relation types but degrade sharply under annotation sparsity, failing to capture reliable visual commonsense knowledge. We propose a model-agnostic, semantically-guided knowledge refinement framework that systematically mines commonsense-grounded constraints from training data - capturing spatial, functional, and qualitative relational regularities - and uses general declarative commonsense reasoning to correct and refine ranked SGG predictions at inference time. The framework requires no manual rule authoring, no model retraining, and transfers across datasets and architectures. On three standard benchmarks, we obtain consistent improvements over strong baselines, demonstrating that structured visual commonsense reasoning over deep scene semantics is a practical and effective complement to purely learning-based scene graph generation.

2606.06366 2026-06-05 cs.RO

Waypoints Matter: A Systematic Study for Sampling-Based Trajectory Planning

航点至关重要:基于采样的轨迹规划的系统研究

Josep M. Barbera, Antonio Artuñedo, Jorge Villagra

发表机构 * AUTOPIA Program at the Centre for Automation and Robotics, CSIC-Universidad Politécnica de Madrid(自动化与机器人中心,CSIC-马德里理工大学)

AI总结 本文系统研究了航点放置策略(均匀间隔、RDP*变体、曲率条件分配)对采样轨迹规划器性能的影响,发现标称航点间距是主要性能驱动因素,均匀采样在适当间距下表现最佳。

详情
Comments
8 pages, 5 figures, 3 tables; accepted at IEEE ITSC 2026
AI中文摘要

实时自动驾驶通常依赖于基于采样的轨迹规划器,该规划器将候选轨迹与沿道路中心线的目标航点连接起来。这些航点的放置直接影响可行轨迹的存在性和质量。然而,其对规划器性能的影响在很大程度上尚未被探索。在本文中,我们将航点放置视为一等设计变量。我们固定轨迹基元和候选预算,系统地扫描三种放置策略(均匀间隔、增强的Ramer-Douglas-Peucker变体(RDP*)和一种新颖的曲率条件分配),跨越449种配置和五个几何复杂度递增的CommonRoad地图。我们的结果表明,标称航点间距$d_s$是主要的性能驱动因素,仅由放置引起的规划器可靠性差异很大。在调整良好的间距下进行均匀采样,其表现匹配或超过RDP*和中心曲率变体。曲率变体在几何复杂道路上,在可靠性优先和平衡加权下提供了微小但一致的优势,而RDP*从未优于均匀采样。这些发现表明,$d_s$应被视为主要的调优参数,而几何感知策略应保留给曲率丰富的走廊,其中可行性是限制因素。

英文摘要

Real-time autonomous driving commonly relies on sampling-based trajectory planners that link candidate trajectories to target waypoints along the road centerline. The placement of these waypoints directly impacts both the existence and quality of feasible trajectories. Yet, its effect on planner performance remains largely unexplored. In this paper, we treat waypoint placement as a first-class design variable. We hold the trajectory primitive and candidate budget fixed, and systematically sweep three placement strategies (uniform spacing, an augmented Ramer-Douglas-Peucker variant (RDP*), and a novel curvature-conditioned allocation) across 449 configurations and five CommonRoad maps of increasing geometric complexity. Our results show that the nominal inter-waypoint spacing $d_s$ is the primary performance driver, with large differences in planner reliability attributed to placement alone. Uniform sampling at a well-tuned spacing matches or surpasses both RDP* and the centered curvature variant. The curvature variant offers a small but consistent advantage on geometrically complex roads under reliability-first and balanced weightings, while RDP* never outperforms uniform sampling. These findings suggest that $d_s$ should be treated as the dominant tuning parameter, with geometry-aware strategies reserved for curvature-rich corridors where feasibility is the limiting factor.

2606.06364 2026-06-05 cs.LG stat.ML

End-to-End Subgraph Detection with GraphDETR

端到端子图检测与GraphDETR

Dexiong Chen, Till Hendrik Schulz, Karsten Borgwardt

发表机构 * Max Planck Institute of Biochemistry(马克斯·普朗克生物化学研究所)

AI总结 提出GraphDETR框架,将子图检测视为集合预测问题,通过图神经网络编码目标图、Transformer解码器联合预测所有模式实例,并采用二分匹配端到端训练,支持精确和近似匹配,在多达1000节点的图中检测50节点模式,并在ChEMBL数据集上实现AP100=91.2。

详情
AI中文摘要

子图检测旨在识别查询模式实例是否出现在更大图中及其位置。该问题在科学领域至关重要,且与子图同构密切相关,后者是NP完全的,限制了组合方法只能处理小模式或中等规模图。我们提出GraphDETR,一个深度学习框架,将子图检测公式化为集合预测问题,类似于目标检测中的DETR。GraphDETR使用图神经网络编码目标图,并采用一组固定的可学习查询向量,通过Transformer解码器解码,在单次前向传播中联合预测所有模式实例。这通过端到端训练和二分匹配实现。与传统仅解决精确结构匹配的组合方法不同,GraphDETR自然扩展到近似匹配,使得能够检测超出精确模式对应的实例。实验表明,GraphDETR能够在多达1000个节点的目标图中检测多达50个节点的多样化模式,如分子结构、环、团和模糊模式。我们进一步在ChEMBL数据集上评估分子官能团检测,GraphDETR预测每个分子的完整官能团集合,实现了$ ext{AP}_{100} = 91.2$的强性能。

英文摘要

Subgraph detection seeks to identify whether and where instances of query patterns occur within a larger graph. This problem is fundamental across scientific domains and is closely related to subgraph isomorphism, which is NP-complete, limiting combinatorial approaches to small patterns or moderately sized graphs. We introduce GraphDETR, a deep learning framework that formulates subgraph detection as a set prediction problem, analogous to DETR in object detection. GraphDETR encodes the target graph with a graph neural network, and employs a fixed set of learnable query vectors, decoded via a transformer decoder, to predict all pattern occurrences jointly in a single forward pass. This is enabled by training the model end-to-end with bipartite matching. Unlike traditional combinatorial methods that only solve exact structural matching, GraphDETR naturally extends to approximate matching, enabling detection beyond exact pattern correspondence. Empirically, we show that GraphDETR can detect diverse patterns, such as molecular structures, cycles, cliques, and fuzzy patterns of up to 50 nodes, in target graphs with up to 1000 nodes. We further evaluate on molecular functional group detection over the ChEMBL dataset, where GraphDETR predicts the complete set of functional groups per molecule, achieving a strong performance of $\text{AP}_{100} = 91.2$.

2606.06363 2026-06-05 cs.CV

GMBFormer: An NDVI-Guided Global Memory Bank Transformer for Urban Green-Space Extraction from Ultra-High-Resolution Imagery

GMBFormer: 一种NDVI引导的全局记忆库Transformer用于超高分辨率影像城市绿地提取

Hao Lei, Xi Cheng, Chenlu Shu, Zhiheng Chen, Zhengjie Duan, Haoyu Wang, Zhanfeng Shen

发表机构 * College of Geophysics, Chengdu University of Technology(成都理工大学地球物理学院) National Engineering Research Center for Geomatics, Aerospace Information Research Institute, Chinese Academy of Sciences, and University of Chinese Academy of Sciences(中国科学院测绘学部国家工程研究中心、航天信息研究院、中国科学院大学)

AI总结 针对超高分辨率影像城市绿地提取中视觉相似植被模式语义复用受限及NDVI与RGB特征融合模糊的问题,提出GMBFormer框架,通过解耦NDVI作为物理门控并利用全局记忆库进行选择性原型检索,在三个数据集上提升了分割精度。

详情
Comments
34 pages, 5 figures
AI中文摘要

从超高分辨率(UHR)影像中提取城市绿地通常逐块进行,这限制了空间分离但视觉相似的植被模式之间的语义复用。将归一化差异植被指数(NDVI)直接注入红绿蓝(RGB)主干网络也会模糊视觉外观学习与物理植被置信度的作用。我们提出了GMBFormer,一个基于SegFormer的框架,用选择性、相似性驱动的原型检索替代邻域驱动的特征传播。只有RGB通道进入主干网络和解码器,而NDVI被解耦为一个物理信息门控,通过动量更新将高置信度植被描述符纳入紧凑的全局记忆库。在训练和推理过程中,当前块通过记忆介导的交叉注意力查询存储的原型,并以有限的开销集成检索到的响应。实验使用了自建的成都UHR数据集(含7,700个标注的512×512块)以及从公共国际摄影测量与遥感学会(ISPRS)波茨坦数据集派生的两种减少标签设置。在相同的训练和评估协议下,GMBFormer分别获得了89.25%/94.31%、92.17%/95.92%和83.72%/90.86%的平均交并比(mIoU)/平均Dice(mDice)分数,在每种设置下均优于受控的SegFormer-B4基线。消融研究表明,解耦的NDVI准入、记忆检索、容量和动量共同决定了最终性能。

英文摘要

Urban green-space extraction from ultra-high-resolution (UHR) imagery is commonly performed patch by patch, which limits semantic reuse among spatially separated but visually similar vegetation patterns. Directly injecting the Normalized Difference Vegetation Index (NDVI) into red-green-blue (RGB) backbones can also blur the roles of visual appearance learning and physical vegetation confidence. We propose GMBFormer, a SegFormer-based framework that replaces adjacency-driven feature propagation with selective, similarity-driven prototype retrieval. Only RGB channels enter the backbone and decoder, while NDVI is decoupled as a physics-informed gate that admits high-confidence vegetation descriptors into a compact global memory bank through momentum updates. During training and inference, the current patch queries stored prototypes through memory-mediated cross-attention, and the retrieved response is integrated with bounded overhead. Experiments use a self-constructed Chengdu UHR dataset with 7,700 labeled 512 x 512 patches and two reduced-label settings derived from the public International Society for Photogrammetry and Remote Sensing (ISPRS) Potsdam dataset. Under the same training and evaluation protocol, GMBFormer obtains mean intersection over union (mIoU)/mean Dice (mDice) scores of 89.25%/94.31%, 92.17%/95.92%, and 83.72%/90.86%, respectively, improving the controlled SegFormer-B4 baseline in each setting. Ablation studies indicate that decoupled NDVI admission, memory retrieval, capacity, and momentum jointly shape the final performance.

2606.06361 2026-06-05 cs.CV

Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them

两步物理:在视觉细化之前锁定运动先验会擦除它们

Woojung Han, Seil Kang, Youngjun Jun, Min-Hung Chen, Fu-En Yang, Seong Jae Hwang

发表机构 * National Institute of Standards and Technology(国家标准与技术研究院)

AI总结 本文发现图像到视频扩散模型在两步生成中比多步生成具有更好的物理一致性,通过频谱分析将原因归结为去噪过程中的相位侵蚀,并提出无需训练的PhaseLock框架,通过从两步推理中提取运动先验并利用潜在增量引导强制到高保真生成中,有效缓解相位退化,提升物理一致性平均6.2点,同时保持视觉保真度且开销极小。

详情
Comments
ICML 2026
AI中文摘要

图像到视频扩散模型利用输入图像生成视觉上令人惊艳的内容,但常常产生违反物理规律的运动。我们揭示了一个令人惊讶的发现:两步生成通常比同一模型的50步输出表现出更好的物理一致性。通过频谱分析,我们将其追溯到去噪过程中的相位侵蚀:相位显著退化(从第2步到第50步下降约18%),而幅度保持相对稳定。基于这一洞察,我们提出PhaseLock,一个无需训练的框架,在整个去噪轨迹中保留来自少步推理的有效运动先验。PhaseLock不依赖全步推理来获得物理一致性,而是仅从2步中提取运动先验,并通过潜在增量引导将其强制到高保真生成中。我们的方法有效缓解了相位退化,在多种模型上平均提升物理一致性6.2点,同时基本保持视觉保真度,且开销可忽略不计(时间1.06倍,内存1.02倍),并减少了对昂贵外部引导方法(时间约5倍)的依赖。

英文摘要

Image-to-Video diffusion models leverage input images to generate visually stunning content, yet frequently produce motion that violates physical laws. We reveal a surprising finding: a 2-step generation often exhibits better physical consistency than a 50-step output from the same model. Through spectral analysis, we trace this to phase erosion during denoising; the phase degrades significantly (dropping by $\approx 18\%$ from step 2 to step 50), whereas the magnitude remains relatively stable. Building on this insight, we propose PhaseLock, a training-free framework that preserves the valid motion priors from few-step inference throughout the denoising trajectory. Rather than relying on full-step inference for physical consistency, PhaseLock extracts a motion prior from just 2 steps and enforces it onto high-fidelity generation via Latent Delta Guidance. Our approach effectively mitigates phase degradation, improving physical consistency by an average of 6.2 points across diverse models while largely maintaining visual fidelity, with negligible overhead ($1.06\times$ time, $1.02\times$ memory) and reduced reliance on expensive external guidance methods ($\sim5\times$ time).

2606.06359 2026-06-05 cs.CV

Comparison of Deep Learning Frameworks For Rice Disease Mapping From UAV Multispectral Imaging

基于无人机多光谱成像的水稻病害深度学习框架比较

Yadav Raj Ghimire, Jagrati Talreja, Tewodros Syum Gebre, Timothy Agboada, Shikha V. Chandel, Leila Hashemi Beni

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) University of California, Los android(加州大学洛杉矶分校)

AI总结 本研究使用CNN和Transformer模型对无人机多光谱图像进行水稻白叶枯病严重程度分割,发现轻量级CNN骨干网络在操作监测中更可靠,植被指数可带来小幅持续改进。

详情
Comments
This paper has been accepted in IGARSS 2026. Copyright 2026 IEEE
AI中文摘要

在本研究中,利用无人机多光谱图像,采用卷积神经网络(CNN)和基于Transformer的模型对水稻白叶枯病(BLB)的严重程度进行分割。评估的架构包括带有ResNet-101编码器的U-Net、带有EfficientNet-B3和EfficientNet-B7的U-Net++、DeepLabV3+以及SegFormer,所有模型均在统一的流水线下使用三种输入配置(仅多光谱、多光谱+NDVI、多光谱+NDRE)进行训练。实验使用公开的BLB数据集进行,性能指标包括平均IoU(mIoU)、平均F1(mF1)、平均准确率(mAcc)、精确率和召回率。带有EfficientNet-B3的U-Net++取得了最高性能,mIoU达到97.62%。SegFormer的分割精度较低,但推理速度相当。总体而言,结果表明轻量级CNN骨干网络在操作性的BLB监测中更为可靠,而植被指数的整合带来了微小但一致的改进。该研究还强调了标准化无人机数据集在比较病害映射方法中的价值,并鼓励在实地实施中使用CNN架构。

英文摘要

In this study, UAV multispectral imagery is used to segment the severity of bacterial leaf blight (BLB) in rice using convolutional neural networks (CNNs) and transformer-based models. The evaluated architectures include U-Net with a ResNet- 101 encoder, U-Net++ with EfficientNet-B3 and EfficientNetB7, DeepLabV3+, and SegFormer, all trained under a common pipeline with three input configurations (multispectral only, multispectral+NDVI, and multispectral+NDRE). Experiments are conducted using the publicly available BLB dataset with performance reported using mean IoU (mIoU), mean F1 (mF1), mean accuracy (mAcc), precision, and recall. U-Net++ with EfficientNet-B3 achieved the highest performance, with an mIoU of 97.62%. SegFormer obtained lower segmentation accuracy but comparable inference speed. Overall, the results indicate that lightweight CNN backbones remain more reliable for operational BLB monitoring while integration of vegetation indices provides small and consistent improvements. The study also highlights the value of standardised UAV datasets to compare disease mapping methods and encourages the use of CNN architectures for field implementation.

2606.06357 2026-06-05 cs.SD cs.AI eess.AS

F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation

F3-Tokenizer: 驯服音频自编码器潜在变量以支持理解与生成

Dinghao Zhou, Xingchen Song, Di Wu, Pengyu Cheng, Shengfan Shen, Sixiang Lv

发表机构 * Nanjing University, China(南京大学) WeNet Open Source Community(WeNet开源社区)

AI总结 针对连续音频自编码器潜在变量结构弱、自监督编码器不可解码的问题,提出F3-Tokenizer,通过噪声正则化自编码器瓶颈和潜在侧表示编码器,实现统一的理解与生成音频分词器。

详情
Comments
Technical report; early work; 9 pages, 2 figures, 5 tables
AI中文摘要

连续音频自编码器能很好地重建波形,但通常产生的潜在变量结构较弱,不利于理解;而自监督音频编码器能捕捉语义,但不可直接解码。这种不匹配使得单个音频分词器难以同时支持理解和生成。我们通过两个组件将连续自编码器潜在变量适应于这一场景:噪声正则化的自编码器瓶颈和潜在侧表示编码器。瓶颈使用通道归一化和随机扰动代替基于KL的变分训练,为重建和自回归生成提供尺度可控的连续潜在变量。表示编码器在冻结的自编码器潜在变量上使用RQ-MTP和冻结LLM监督进行训练。最终的分词器为理解提供高维表示,同时保留归一化的连续潜在变量作为生成目标。

英文摘要

Continuous audio autoencoders reconstruct waveforms well but often produce latents with weak structure for understanding, while self-supervised audio encoders capture semantics but are not directly decodable. This mismatch complicates a single audio tokenizer that must support both understanding and generation. We adapt continuous autoencoder latents to this setting with two components: a noise-regularized autoencoder bottleneck and a latent-side representation encoder. The bottleneck uses channel normalization and stochastic perturbation instead of KL-based variational training, yielding scale-controlled continuous latents for reconstruction and autoregressive generation. The representation encoder is trained on frozen autoencoder latents with RQ-MTP and frozen-LLM supervision. The resulting tokenizer provides high-dimensional representations for understanding while preserving normalized continuous latents as generation targets

2606.06356 2026-06-05 cs.AI

Where Should Knowledge Enter? A Layered Framework for Knowledge Infusion in Multimodal Iterative Generative Mo

知识应从哪里注入?多模态迭代生成模型中知识注入的分层框架

Renjith Prasad, Chathurangi Shyalika, Anushka Pawar, Amit Sheth

发表机构 * University of South Carolina(南卡罗来纳大学) Indian AI Research Organization(印度人工智能研究组织)

AI总结 提出一个分层框架,将多模态迭代生成模型中的知识注入分为表面层、轨迹层、潜在层和参数层四个干预层,并通过扩散模型实验证明多层组合可互补地减少知识违规输出。

详情
AI中文摘要

多模态生成模型能够生成流畅的输出,但在生成必须遵循结构化、领域特定或安全关键知识时仍不可靠。现有方法通过提示增强、引导、潜在编辑或微调等机制注入知识,但这些方法通常按技术而非按它们修改的生成过程组件进行分类。我们认为,在迭代生成模型中,知识注入本质上是一个干预层问题。由于生成过程展开为内部状态的轨迹,知识可以作用于该过程的四个结构不同的组件:输入/输出边界、转移函数、中间状态和模型参数。这对应四个干预层:表面层、轨迹层、潜在层和参数层。我们在扩散模型中实例化该框架,将代表性方法映射到所有四个层,并推导出多层组合的设计原则。在使用多模态知识图谱和两个扩散骨干的受控安全对齐实验中,我们累积实现了四个层中的三个:表面层(输入侧和输出侧)以及轨迹-潜在层(生成中期)。我们经验性地表明,每个额外的层解决了先前层无法触及的失败类别,与原始生成相比,将知识违规输出减少了70.97%,并经验性地证实了框架的互补性预测。

英文摘要

Multimodal generative models produce fluent outputs but remain unreliable when generation must respect structured, domain-specific, or safety-critical knowledge. Existing methods incorporate knowledge through mechanisms such as prompt augmentation, guidance, latent editing, or fine-tuning, yet they are typically categorized by technique rather than by the component of the generative process they modify. We argue that knowledge infusion in iterative generative models is fundamentally anintervention-layer problem. Since thegenerative process unfolds as a trajectory of internal states, knowledge can act on four structurally distinct components of this process: the input/output boundary, the transition function, the intermediate state, and the model parameters. This maps to four intervention layers: surface, trajectory, latent, and parametric infusion. We instantiate the framework in diffusion models, map representative methods to all four layers, and derive design principles for multi-layer composition. In a controlled safety-alignment experiment using a multimodal knowledge graph with two diffusion backbones, we implement three of the four layers cumulatively, surface (input-side and output-side) and trajectory--latent (mid-generation). We show empirically that each additional layer addresses failure classes that prior layers cannot reach, reducing knowledge-violating outputs by 70.97% compared to vanilla generation and empirically confirming the framework's complementarity prediction.

2606.06353 2026-06-05 cs.LG

Maximising the Set-Piece Return: Optimising Football Corner Tactics with Graph Reinforcement Learning

最大化定位球回报:用图强化学习优化足球角球战术

Sean Groom, Michael Groom, Francisco Belo, Axl Rice, Liam Anderson, Victor-Alexandru Darvariu, Shuo Wang

发表机构 * School of Computer Science, University of Birmingham, Birmingham, UK(伯明翰大学计算机科学学院) Nottingham Forest Football Club, Nottingham, UK(诺丁汉森林足球俱乐部) Oxford Robotics Institute, University of Oxford, Oxford, UK(牛津大学机器人研究所) School of Sport, Exercise and Rehabilitation Sciences, University of Birmingham, Birmingham, UK(伯明翰大学运动、体能与康复科学学院)

AI总结 提出一种基于图强化学习的框架,通过调整进攻球员位置和速度来最大化角球首次触球射门概率,在英超角球数据上优于传统优化方法。

详情
Comments
11 pages, 4 figures
AI中文摘要

机器学习越来越多地被用于评估足球战术。然而,现有方法侧重于描述历史动作或分析师指定的反事实场景。在这项工作中,我们旨在超越对历史观察模式的模仿,发现新的可泛化的球员配置和策略。为此,我们专注于优化角球套路,并制定了一个决策问题,其中中央策略调整进攻球员的位置和速度,以最大化首次触球射门概率。与解决孤立设置的经典优化不同,我们贡献了一个基于图结构数据的强化学习架构,该架构产生一个通用策略,用于调整任意起始球员位置。在超过3000个英超角球上的评估表明,在匹配推理预算下,我们的方法显著优于基线优化技术。我们的结果表明,图强化学习可以将定位球分析从历史评估和模仿转向奖励驱动的战术发现。

英文摘要

Machine learning is increasingly employed for the evaluation of football tactics. However, existing approaches focus on characterising historical actions or analyst-specified counterfactual scenarios. In this work, we seek to go beyond the imitation of historically observed patterns towards discovering new generalisable player configurations and strategies. To tackle this, we focus on optimising corner kick routines, and formulate a decision-making problem in which a central policy makes adjustments to attacking player positions and velocities to maximise first contact shot probability. Unlike classic optimisation that solves for isolated setups, we contribute a reinforcement learning architecture operating on graph-structured data that yields a general policy for adjusting arbitrary starting player positions. Evaluated on over 3,000 Premier League corners, our approach strongly outperforms baseline optimisation techniques under matched inference budgets. Our results suggest that graph reinforcement learning can shift set-piece analysis from historical evaluation and imitation towards reward-driven tactical discovery.

2606.06350 2026-06-05 cs.CL

EDIT: Evidence-Diagnosed Intervention Training for Rule-Faithful LLM Grading

EDIT:基于证据诊断的干预训练以实现遵循规则的LLM评分

Zhihao Wu, Linhai Zhang, Taiyi Wang, Runcong Zhao, Peter Andrews, Cesare Aloisi, Yulan He

发表机构 * King’s College London(伦敦国王学院) University of Cambridge(剑桥大学) AQA The Alan Turing Institute(艾伦·图灵研究所)

AI总结 提出EDIT框架,通过内部模型信号定位推理错误步骤并修正,结合信念引导的奖励塑造,提升LLM评分对评分标准的忠实度。

详情
AI中文摘要

可靠的评分标准评分需要比准确分数预测更多。每个判断必须基于评分方案和学生答案中的证据。现有的信用分配和干预方法主要针对数学推理等自包含推理任务设计,在此场景下表现不佳,因为它们无法识别评分推理出错的位置或模型对最终分数的信念在推理过程中如何变化。我们提出基于证据诊断的干预训练(EDIT),一个两阶段框架,用于训练更遵循评分标准的LLM评分器。首先,EDIT-SFT使用内部模型信号定位有问题的推理步骤:对最终分数的后验信念和输入基础得分。然后,它仅借助评分清单修正这些局部步骤。其次,EDIT-RL通过信念引导的奖励塑造校准评分器,惩罚有害的大信念漂移,同时允许有益的探索。在两个真实世界、多学科评分基准上的实验表明,EDIT在领域内和领域外分割上均持续优于强监督微调和强化学习基线,消融研究证实内部状态诊断推动了这些增益。

英文摘要

Reliable rubric grading requires more than accurate score prediction. Each judgement must be grounded in the mark scheme and evidence from the student answer. Existing credit-assignment and intervention methods, primarily designed for self-contained reasoning tasks such as mathematics reasoning, struggle in this setting because they do not identify where grading reasoning goes wrong or how the model's belief about the final mark changes during reasoning. We propose Evidence-Diagnosed Intervention Training (EDIT), a two-phase framework for training more rubric-faithful LLM graders. First, EDIT-SFT locates problematic reasoning steps using internal model signals: posterior belief over the final mark and input-grounding scores. It then revises only these local steps with help from a rubric checklist. Second, EDIT-RL calibrates the grader with belief-guided reward shaping, penalising large harmful belief drifts while still allowing helpful exploration. Experiments on two real-world, multi-subject grading benchmarks demonstrate that EDIT consistently outperforms strong supervised fine-tuning and reinforcement learning baselines on both in-domain and out-of-domain splits, with ablation studies confirming that internal-state diagnostics drive these gains.

2606.06349 2026-06-05 cs.CL

"Chi nas dal soch el sent de legn" -- Auditing Text Corpora for Lombard

Chi nas dal soch el sent de legn —— 审计伦巴第语文本语料库

Edoardo Signoroni, Pavel Rychlý

发表机构 * NLP Centre, Faculty of Informatics Masaryk University(马萨里克大学信息学院自然语言处理中心)

AI总结 本文通过手动审计伦巴第语的平行和单语语料库,发现网络抓取数据存在严重的语言误识别、模板文本和非语言噪声问题,并揭示了高质量数据偏向西部伦巴第语变体、东部变体被边缘化的代表性偏差,强调需要关注变体多样性和社区驱动的数据策展。

详情
Comments
Submitted to TSD 2026
AI中文摘要

世界上几种语言在自然语言处理(NLP)工具方面仍然资源不足。这主要是由于缺乏高质量的数据集来训练、开发和评估用于多种任务(如机器翻译(MT))的系统和模型。我们对伦巴第语(意大利的一种资源不足的语言连续体)可用的平行和单语语料库进行了手动审计。我们的分析表明,网络抓取数据看似丰富实则是一种幻觉,大量数据集受到严重的语言误识别、模板文本和非语言噪声的困扰。此外,我们分析了网络抓取数据集、策展语料库和基准测试中有效伦巴第语部分的拼写构成。我们的发现揭示了所有语料库中存在冲突的拼写系统和严重的代表性偏差:高质量数据严重偏向西部伦巴第语变体,而东部变体则被边缘化。这强调了需要关注变体多样性和社区驱动的数据策展,而非纯粹数量驱动的抓取。

英文摘要

Several of the world's languages are still under-resourced in terms of Natural Language Processing (NLP) tools. This is mostly due to the lack of high-quality datasets to train, develop, and evaluate systems and models for several tasks, such as Machine Translation (MT). We conduct a manual audit of the parallel and monolingual corpora available for Lombard, an under-resourced language continuum from Italy. Our analysis reveals that the perceived abundance of web-scraped data is an illusion, with massive datasets plagued by severe language misidentification, boilerplate text, and non-linguistic noise. Furthermore, we analyze the orthographic composition of the valid Lombard portions across web-scraped datasets, curated corpora, and benchmarks. Our findings show conflicting orthographical systems and severe representational bias across all corpora: high-quality data is heavily skewed towards Western Lombard varieties, with Eastern ones left on the margins. This underscores the need for variety-aware, community-driven data curation rather than purely quantity-driven scraping.

2606.06348 2026-06-05 cs.LG

Performance Evaluation of GraphCast for Medium-Range Weather Forecasting over Brazil

GraphCast在巴西中期天气预报中的性能评估

Wolfgang R. Rowell, Lucas S. Kupssinskü

发表机构 * MALTA, Machine Learning Theory and Applications Lab, PUCRS, Porto Alegre, Brazil(MALTA,机器学习理论与应用实验室,PUCRS,波士顿,巴西)

AI总结 本研究利用GraphCast模型与ECMWF IFS HRES基线,评估其在巴西四个气候子区域的中期天气预报性能,发现其技能具有季节性依赖性,在冬季中期表现不佳但在延伸期有优势,夏季则能准确捕捉大尺度水汽输送并抑制高频对流变率。

详情
AI中文摘要

全球天气预报范式正随着机器学习天气预报模型(MLWP)的出现而迅速转变。虽然这些数据驱动的架构展现出卓越的全球技能,但全球南方地区的区域基准仍然稀缺,其在复杂、高对流环境中的有效性在很大程度上未经验证。本研究评估了GraphCast operational在巴西四个不同气候子区域中,以确定性ECMWF IFS HRES为基线的性能。利用可扩展的云原生管道和WeatherBench-X框架进行天气模型基准测试,我们评估了四个选定季节窗口中的选定对流层变量($T_{850}$、$Q_{850}$、$Z_{500}$),以运行IFS分析作为地面实况,计算两个模型的统计指标。结果揭示了依赖于天气形势的技能特征。在南半球冬季,GraphCast在中期(预报天数2-7)对$Z_{500}$解析巴西上空快速传播的斜压系统时表现不佳,但在延伸期重新获得优势,此时其固有的对混沌小尺度变率的平滑在确定性技能指标下变得有益。相反,在南半球夏季雨季,GraphCast准确捕捉了大尺度水汽输送,同时内在抑制了破坏确定性NWP温度预报的高频对流变率。这些发现为巴西建立了基线,并定义了将指导未来“热带化”努力的具体物理边界,旨在优化这些基础AI模型以增强区域韧性。

英文摘要

The paradigm of global weather forecasting is rapidly shifting with the emergence of Machine Learning Weather Prediction models (MLWP). While these data-driven architectures demonstrate remarkable global skill, regional benchmarks in the Global South remain scarce, leaving their efficacy in complex, highly convective environments largely unverified. This study evaluates the performance of GraphCast operational against the deterministic ECMWF IFS HRES as baseline across four distinct Brazilian climatic sub-regions. Utilizing a scalable, cloud-native pipeline and the WeatherBench-X framework for benchmarking weather models, we assess selected tropospheric variables ($T_{850}$, $Q_{850}$, $Z_{500}$) over four selected seasonal windows, employing the operational IFS analysis as the ground truth to calculate the statistical metrics for both models. Results reveal a regime-dependent skill profile. During the austral winter, GraphCast underperforms in the medium range (lead days 2-7) for $Z_{500}$ when resolving fast-propagating baroclinic systems over southern Brazil, but regains an advantage in the extended range, where its inherent smoothing of chaotic small-scale variability becomes beneficial under deterministic skill metrics. Conversely, during the austral summer wet season, GraphCast accurately captures large-scale moisture transport while intrinsically dampening the high-frequency convective variability that degrades deterministic NWP temperature forecasts. These findings establish a baseline for Brazil and define the specific physical boundaries that will guide future ``tropicalization'' efforts, aiming to optimize these foundational AI models for regional resilience.

2606.06345 2026-06-05 cs.AI cs.LG q-bio.NC

Boosting Brain-to-Image Decoding with TRIBE v2 Data Augmentation

使用TRIBE v2数据增强提升脑到图像解码

Yohann Benchetrit, Marlène Careil, Simon Dahan, Hubert Banville, Stéphane d'Ascoli, Jean-Rémi King

发表机构 * Meta AI

AI总结 针对脑解码中标记数据稀缺的问题,提出利用预训练的fMRI响应模型TRIBE v2生成合成数据来增强小样本数据集,在两个数据集上实现最高68%的Top-10图像检索准确率提升,并发现纯合成数据训练的解码器在零样本设置中也能达到高于随机水平的性能。

详情
AI中文摘要

脑解码受限于标记神经数据的可用性,在低数据量情况下仍然具有挑战性。为了解决这个问题,我们研究了是否以及何时可以通过使用预训练的fMRI刺激响应模型生成的合成数据来增强小样本fMRI数据集,从而提升脑解码性能。我们使用TRIBE v2,这是一个大型编码模型,在超过1000小时的视频、音频和语言fMRI响应数据上进行了预训练。对于每个数据集,我们评估了系统网格,展示了图像解码器性能如何随用于训练的合成数据量变化。基于两个数据集(7T fMRI自然场景数据集和3T fMRI BOLD5000)的结果显示,与仅使用真实数据训练的解码器相比,Top-10图像检索准确率最高提升68%。重要的是,达到给定图像解码性能所需的增强数据比例需要根据数据源进行调整。令人惊讶的是,仅使用合成fMRI数据训练的图像解码器在某些设置下性能高于随机水平,表明TRIBE v2可以支持零样本脑到图像解码。这些结果共同表明,大规模fMRI响应模型(针对视觉、声音和语言)可以为提高图像解码的数据效率提供基础。

英文摘要

Brain decoding is limited by the availability of labeled neural data, and remains challenging in low-data regimes. To address this issue, we investigate whether and when brain decoding can be boosted by augmenting small fMRI datasets with synthetic data generated by a pretrained model of fMRI responses to stimuli. We use TRIBE v2, a large encoding model pretrained on more than 1000 hours of fMRI responses to video, audio and language. For each dataset, we evaluate systematic grids that show how the performance of image decoders varies with the amount of synthetic data used for training. Our results, based on two datasets (the 7T fMRI Natural Scenes Dataset and 3T fMRI BOLD5000), show up to 68% improvement in Top-10 image-retrieval accuracy compared to decoders trained only on real data. Importantly, the proportion of augmented data required to reach a given image decoding performance needs to be adjusted depending on the data source. Surprisingly, image decoders trained exclusively on synthetic fMRI can perform above chance in some settings, suggesting that TRIBE v2 can support zero-shot brain-to-image decoding. Together, these results show how large-scale models of the fMRI responses to sight, sound and language may provide a foundation to improve the data efficiency for image decoding.

2606.06344 2026-06-05 cs.LG cs.SC

Equivariant Neural Belief Propagation

等变神经信念传播

Zehua Cheng, Wei Dai, Jiahao Sun

发表机构 * Department of Computer Science, University of Oxford(计算机科学系,牛津大学) FLock.io

AI总结 提出等变神经信念传播(ENBP),通过等变高斯混合消息和秩2精度矩阵合成,实现SE(3)对称性下的高效概率推理,在分子构象和机器人推理任务中显著超越基线。

详情
Comments
18 pages
AI中文摘要

空间嵌入变量上的概率推理需要尊重$SE(3)$对称性的信念,然而现有的等变网络仅产生标量和向量——而不是各向异性不确定性所需的秩2精度张量,并且单分量消息将多模态能量景观坍缩为物理上无意义的平均值。我们引入了等变神经信念传播(ENBP),一种因子图框架,其消息是等变高斯混合模型,其充分统计量在$SE(3)$下精确变换。秩2精度矩阵通过等变外积合成,通过可微谱分解处理,并通过贪婪的基于KL的混合约简保持可处理性,该约简可证明与$SE(3)$交换。在GEOM-QM9和GEOM-Drugs上,ENBP在0.090 $\mathring{A}$误差下实现了98.9%的构象覆盖率,延迟低于亚秒——比扩散基线快100倍以上且精度更高。在多体机器人推理中,普通环状BP在15个智能体时发散,而ENBP收敛,碰撞率接近零,等变误差达到机器精度(${\sim}10^{-7}$,而增强基线为$10^{-1}$)。

英文摘要

Probabilistic inference over spatially embedded variables requires beliefs that respect $SE(3)$ symmetry, yet existing equivariant networks produce only scalars and vectors -- not the rank-2 precision tensors needed for anisotropic uncertainty, and single-component messages collapse multi-modal energy landscapes to physically meaningless averages. We introduce Equivariant Neural Belief Propagation (ENBP), a factor-graph framework whose messages are equivariant Gaussian mixture models with sufficient statistics that transform exactly under $SE(3)$. Rank-2 precision matrices are synthesised via equivariant outer products, ingested through differentiable spectral decomposition, and kept tractable by a greedy KL-based mixture reduction that provably commutes with $SE(3)$. On GEOM-QM9 and GEOM-Drugs, ENBP achieves 98.9% conformational coverage at 0.090 $\mathring{A}$ error with sub-second latency -- over $100\times$ faster than diffusion baselines at higher accuracy. On multi-body robotic inference, vanilla loopy BP diverges at 15+ agents while ENBP converges with near-zero collision rates and machine-precision equivariance error (${\sim}10^{-7}$ vs.\ $10^{-1}$ for augmented baselines).

2606.06338 2026-06-05 cs.CV

StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale, Multi-Genre and Auto-Generated Dataset

StoryVideoQA: 通过大规模、多类型和自动生成的数据集扩展深度视频理解

Zhengqian Wu, Zhixian Liu, Aodong Chen, Jingyang Zhang, Ruizhe Li, Hanlin Ge, Zhongyuan Wang, Chunxia Xiao, Chao Liang

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) National Engineering Research Center for Multimedia Software(多媒体软件国家工程研究中心) Hubei Key Laboratory of Multimedia and Network Communication Engineering(湖北省多媒体与网络通信工程重点实验室)

AI总结 提出StoryVideoQA数据集和PlotTree方法,通过多智能体协作框架自动生成大规模深度视频理解问答对,并利用层次化情节结构提升复杂故事线推理能力。

详情
Journal ref
International Journal of Computer Vision (2026)
Comments
Accepted by IJCV 2026
AI中文摘要

视频问答(VideoQA)旨在回答关于给定视频的问题。现有方法在事实型VideoQA上表现出色,但在深度视频理解(DVU)上存在困难,后者需要理解复杂的故事线。这一挑战源于固有的长程视频内容、多类型问题以及实例级故事元素,这些都限制了人工构建DVU数据集的规模和多样性。为了解决这些问题,我们之前引入了StoryMind来自动构建具有平衡细粒度主题的DVU数据集。尽管它能为电视剧生成高质量问答对,但在处理更长更复杂的电影时性能显著下降。本文进一步设计了StoryMindv2,一个增强的多智能体协作框架,用于为电视剧和电影生成高质量的DVU数据集。通过集成新颖的监督引导生成机制和精细的多审阅者投票策略,该框架用于构建StoryVideoQA,这是迄今为止最大的DVU数据集,包含超过363K个问答对,覆盖393.2小时多样化的故事视频,包括电视剧(平均1635秒)和电影(平均7878秒)。在此大规模基准上对20种最先进的VideoQA方法进行全面评估,发现它们无法完全维持长程角色关联或构建对复杂故事线的连贯理解。为弥补这一差距,我们提出PlotTree,一种新颖的视频理解智能体,将长程视频内容重新组织为层次化情节结构,从而在StoryVideoQA上实现高效的故事线推理。项目页面:https://github.com/nercms-mmap/StoryVideoQA/

英文摘要

Video question answering (VideoQA) aims to answer questions about given videos. While existing approaches excel on factoid VideoQA, they struggle with deep video understanding (DVU), which requires the comprehension of complex storylines. This challenge arises from the inherent long-range video content, multi-faceted question types, and instance-level story elements, all of which constrain the scale and diversity of manually constructed DVU datasets. These difficulties constrain the scale and diversity of manually-constructed DVU dataset. To address these, we previously introduced StoryMind to automatically construct DVU datasets with balanced fine-grained topics. Though it can generate high-quality question-answer pairs (QAs) for TV series, it suffers significant performance degradation when handling longer and more complex movies. In this paper, we further design StoryMindv2, an enhanced multi-agent collaboration framework to generate high-quality DVU datasets for both TV series and movies. By integrating a novel supervisor-guided generation mechanism and a refined multi-reviewer voting strategy, the framework is utilized to construct StoryVideoQA, the largest DVU dataset to date, featuring over 363K QAs on 393.2 hours diverse story videos including TV series (avg. 1,635 seconds) and movies (avg. 7,878 seconds). Comprehensive evaluations of 20 state-of-the-art VideoQA methods on this large-scale benchmark reveal that they cannot fully maintain long-range character associations or construct a coherent understanding of complex storylines. To bridge this gap, we propose PlotTree, a novel video understanding agent, re-organizing long-range video content into a hierarchical plot structure, enabling efficient storyline reasoning on StoryVideoQA. Project page: https://github.com/nercms-mmap/StoryVideoQA/

2606.06337 2026-06-05 cs.AI

TokenMizer: Graph-Structured Session Memory for Long-Horizon LLM Context Management

TokenMizer: 面向长程LLM上下文管理的图结构会话记忆

Shweta Mishra

发表机构 * Independent Researcher(独立研究者)

AI总结 提出TokenMizer,一种将LLM会话历史建模为类型化知识图的开源代理系统,通过混合提取、三级检查点和8层压缩流水线,在显著减少token开销的同时保留结构化决策信息。

详情
Comments
12 pages, 10 figures. Code and benchmark available at https://github.com/Shweta-Mishra-ai/tokenmizer
AI中文摘要

大型语言模型(LLM)在长程任务部署中面临一个基本约束:上下文窗口是有限的,而生产性工作会话却不是。当历史超过最大有效上下文窗口(MECW)时,关键的结构化信息——架构决策、任务转换、文件历史——会被静默丢弃。现有缓解方法将历史视为纯文本,破坏了使会话可恢复的关系结构。我们提出TokenMizer,一个将LLM会话历史建模为类型化知识图的开源代理系统。该模式定义了14种节点类型和7种边类型。混合提取流水线逐步填充图,而三级检查点系统将其序列化为紧凑的恢复块。8层压缩流水线减少上下文开销,语义缓存降低重复查询延迟。在涵盖5个领域的21个会话的受控基准上评估,TokenMizer展示了显著的token经济性。它生成的恢复块平均78个token(范围:42-124)——比评估基线(159-170个token)小2倍——同时实现了更高的决策召回率(+9-17个百分点)。关键的是,基线仅保留提到某项技术的事实;TokenMizer保留了其原理。在所有会话中,TokenMizer实现了平均任务召回率51.0%、决策召回率46.6%和文件召回率58.7%。方差反映了领域异质性:显式命令式表述(软件工程)得分高于隐式推理(研究)。消融研究表明模糊标签匹配是主要的改进因素(任务召回率+33个百分点)。启发式压缩实现了47.3%的token减少且零外部依赖。TokenMizer以一半的token成本提供了可查询的替代方案,优于文本保留基线。

英文摘要

Large language model (LLM) deployments for long-horizon tasks face a fundamental constraint: context windows are finite while productive work sessions are not. When history exceeds the Maximum Effective Context Window (MECW), critical structured information - architectural decisions, task transitions, file histories - is silently discarded. Existing mitigations treat history as flat text, destroying the relational structure that makes sessions resumable. We present TokenMizer, an open-source proxy system that models LLM session history as a typed knowledge graph. The schema defines 14 node types and 7 edge types. A hybrid extraction pipeline populates the graph incrementally, while a three-tier checkpoint system serializes it into compact resume blocks. An 8-layer compression pipeline reduces context overhead, and a semantic cache reduces repeated-query latency. Evaluated on a controlled benchmark of 21 sessions spanning 5 domains, TokenMizer demonstrates significant token economy. It produces resume blocks averaging 78 tokens (range: 42-124) - 2x smaller than evaluated baselines (159-170 tokens) - while achieving higher decision recall (+9-17 percentage points). Crucially, baselines only preserve that a technology was mentioned; TokenMizer preserves the rationale. Across all sessions, TokenMizer achieves mean task recall 51.0%, decision recall 46.6%, and file recall 58.7%. Variance reflects domain heterogeneity: explicit imperative phrasing (software engineering) scores higher than implicit reasoning (research). Ablation studies show fuzzy label matching is the dominant improvement factor (+33 pp task recall). The heuristic compression achieves 47.3% token reduction with zero external dependencies. TokenMizer provides a queryable alternative to text-retention baselines at half the token cost.

2606.06335 2026-06-05 cs.LG cs.AI

Bridging Domain Expertise and Generalization for Performance Estimation

弥合领域专业知识与泛化能力以实现性能估计

Shuxuan Li, Zhilin Zhao, Quyu Kong, Wei-Shi Zheng

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University, China(中山大学计算机科学与工程学院) Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China(教育部人工智能与先进计算重点实验室) Shenzhen Loop Area Institute, China(深圳环湖院) Alibaba Cloud(阿里云)

AI总结 提出FRAP方法,利用外部基础模型和基础模型的互补优势,通过温度缩放校准和对齐预测分布,构建更可靠的伪标签参考分布,从而在分布偏移下准确估计模型性能。

详情
AI中文摘要

分布偏移下的性能估计旨在预测模型在未标记测试集上的行为,该测试集的分布与训练数据不同,这一场景需要能够真实反映模型行为且无需真实标签的可靠指标。现有方法仅依赖给定模型的输出,而一旦分布发生偏移,其偏差会被放大,削弱了与真实性能的相关性。受此限制,我们提出融合参考对齐预测(FRAP),利用外部基础模型和基础模型的互补优势,构建更可靠的伪标签替代。FRAP通过应用温度缩放校准最小化基础模型与基础模型预测分布之间的差异,从而对齐两者。对齐后的预测通过基于置信度的加权融合成精炼的参考分布,该分布整合了基础模型的鲁棒性和基础模型的领域专业知识,并通过测量基础模型预测与该参考分布的一致性来获得性能估计。在多种数据集和架构上的大量实验表明,FRAP在分布偏移下相较于代表性性能估计方法取得了持续且显著的改进。

英文摘要

Performance estimation under distribution shift aims to predict how a model behaves on an unlabeled test set whose distribution differs from the training data, a scenario that requires reliable indicators that can faithfully reflect model behavior without ground-truth labels. Existing approaches rely solely on the outputs of the given model whose biases are amplified once the distribution shifts, weakening the correlation with the true performance. Motivated by this limitation, we propose Fused Reference Alignment Prediction (FRAP), which leverages the complementary strengths of an external foundation model and the base model to construct a more reliable surrogate of the ground-truth labels. FRAP aligns the prediction distribution of the foundation model with that of the base model by applying temperature-scaled calibration that minimizes their divergence. The aligned predictions are fused through confidence-based weighting into a refined reference distribution that integrates robustness from the foundation model and domain-specific expertise from the base model, and performance estimation is obtained by measuring how closely the base model predictions agree with this reference. Extensive experiments across diverse datasets and architectures show that FRAP provides consistent and substantial improvements over representative performance-estimation methods under distribution shift.

2606.06334 2026-06-05 cs.LG

Quantifying the Privacy of Counterfactuals by Leveraging Membership Inference Attacks Against Synthetic Data

通过利用针对合成数据的成员推理攻击量化反事实的隐私性

Maryam Babaei, Yingke Wang, Hadrien Lautraite, Heber H. Arcolezi, Ulrich Aivodji, Sebastien Gambs

发表机构 * ÉTS Montreal and Mila Canada(蒙特利尔ÉTS学院和Mila加拿大) UQAM Canada(魁北克大学UQAM加拿大) Inria Grenoble France(法国格勒诺布尔Inria)

AI总结 本文利用针对合成数据的成员推理攻击,证明仅通过反事实即可成功实施成员推理攻击,无需访问模型,揭示了反事实发布中的隐私风险。

详情
AI中文摘要

反事实通常用于高风险决策领域,通过展示用户档案的变化如何导致期望结果来解释机器学习模型。然而,通过反事实解释模型决策也可能被对手利用,对模型或其训练数据进行隐私攻击。基于反事实提供真实训练数据的现实替代品(类似于合成数据)的类比,我们在本文中展示了如何通过借鉴针对合成数据开发的攻击,成功地对反事实进行隐私攻击。更准确地说,我们研究了针对合成数据设计的成员推理攻击在各种类型反事实上的有效性。此外,虽然现有的针对反事实的成员推理攻击通常需要能够查询模型,但我们展示了如何仅使用一组反事实(无需访问生成它们的模型)即可成功进行成员推理攻击。我们的结果表明,模型开发者在向不同用户发布反事实时应更加谨慎,因为这可能导致隐私泄露。

英文摘要

Counterfactuals are typically used in high-stakes decision areas to explain a machine learning model by showing how changes to the user profiles result in the desired outcome. However, explaining the model's decisions through counterfactuals can also be exploited by an adversary to conduct privacy attacks against the model or its training data. Drawing on the analogy that counterfactuals provide realistic substitutes for real training data, similar to synthetic data, we demonstrate in this paper how it is possible to successfully perform privacy attacks on counterfactuals by drawing on the attacks developed against synthetic data. More precisely, we investigate the effectiveness of the membership inference attacks designed for synthetic data on various types of counterfactuals. Additionally, while existing membership inference attacks against counterfactuals usually require to be able to query the model, we show how it is possible to perform successful membership inference attacks using only a set of counterfactuals, with no access to the model from which they are generated. Our results demonstrate that model developers should be more cautious when releasing counterfactuals to various users, as it can lead to a privacy breach.

2606.06333 2026-06-05 cs.LG cs.AI

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

子空间感知稀疏自编码器用于有效的机制可解释性

Seyed Arshan Dalili, Mehrdad Mahdavi

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 针对稀疏自编码器将特征假设为一维导致特征分裂的问题,提出子空间感知稀疏自编码器(SASA),通过学习解码器子空间、块稀疏门控和核范数正则化,在GPT-2和Mistral-7B上减少特征分裂和吸收,提高单义性和可解释性。

详情
AI中文摘要

稀疏自编码器(SAEs)广泛用于大型语言模型的机制可解释性,但其公式为每个潜在特征分配单个解码器方向,隐含地假设特征是一维的。我们证明这一假设与模型特征的多维结构不匹配,通过两种不同机制可证明地诱导特征分裂。从几何角度看,用单方向解码器重构内在维度$d_i \ge 2$的特征到误差$\varepsilon$,所需的原子数量随$d_i$呈指数增长。从端到端优化角度看,这种分裂不仅是可能的,而且是主动偏好的。我们证明存在一条从真实的$d_i$维基到$\ell_1$正则化SAE目标严格更低风险的连续路径,其下降方向驱使任何训练字典进入该指数区域。因此,一个单一连贯的特征被碎片化到许多近乎共线的潜在变量中,产生虚假的多重性并掩盖内在几何结构。受此启发,我们引入子空间感知稀疏自编码器(SASA),用学习的解码器子空间替换单向量解码器,通过Top-$s$组门控强制块稀疏性,并用核范数正则化器适应每个组的有效秩。然后我们证明,一旦块大小满足$r \ge d_i$,单个组不仅能表示整个特征切片,而且是SASA目标的全局最小值。这种整合产生样本复杂度关于$d_i$的多项式而非指数——鉴于每次训练激活都需要LLM前向传递,这是一个决定性优势。实验上,在GPT-2和Mistral-7B上,SASA减少了特征分裂和吸收,提高了单义性和可解释性,并且在约一半的token预算下训练,性能匹配或超过标准SAE。

英文摘要

Sparse Autoencoders (SAEs) are widely used for mechanistic interpretability in large language models, yet their formulation assigns each latent feature a single decoder direction, implicitly assuming features to be one-dimensional. We show that this assumption mismatches with the multi-dimensional structure of model features, provably inducing feature splitting through two distinct mechanisms. Geometrically, reconstructing a feature of intrinsic dimension $d_i \ge 2$ to error $\varepsilon$ with single-direction decoders forces a number of atoms that is exponential in $d_i$. From an end-to-end optimization perspective, this splitting is not merely possible but actively preferred. We prove that there exists a continuous path from the true $d_i$-dimensional basis to a strictly lower risk of the $\ell_1$-regularized SAE objective, whose descent directions drive any trained dictionary into that exponential regime. A single coherent feature is therefore fragmented across many near-collinear latents, producing spurious multiplicity and obscuring the intrinsic geometry. Motivated by this, we introduce Subspace-Aware Sparse Autoencoders (SASA), which replace single-vector decoders with learned decoder subspaces, enforce block sparsity via Top-$s$ group gating, and adapt each group's effective rank with a nuclear-norm regularizer. We then show that once the block size satisfies $r \ge d_i$, a single group not only can represent the entire feature slice but is the global minimizer of the SASA objective. This consolidation yields a sample complexity polynomial in $d_i$ rather than exponential -- a decisive advantage given that every training activation costs an LLM forward pass. Empirically, on GPT-2 and Mistral-7B, SASA reduces feature splitting and absorption, improves monosemanticity and interpretability, and matches or exceeds standard SAEs while training on roughly half the token budget.

2606.06329 2026-06-05 cs.LG cs.CG cs.CV stat.ML

Efficient Mean Curvature Computation on High-Dimensional Data Manifolds

高维数据流形上的高效平均曲率计算

Alexandre L. M. Levada

发表机构 * Federal University of São Carlos(萨尔瓦多·卡洛斯联邦大学)

AI总结 针对高维数据集局部平均曲率计算中原始方法O(m^4)每点成本过高的问题,提出基于代数恒等式和截断SVD的快速估计器,将成本降至O(k^2 m + k m p^2),在真实数据集上实现50-300倍加速且精度损失可忽略。

详情
Comments
31 pages, 2 figures and 5 tables
AI中文摘要

估计高维数据集中每个点的局部平均曲率是几何感知机器学习算法(如平均曲率边界点(MCBP)方法)的关键组成部分。该计算的朴素实现基于从k近邻块近似的局部形状算子,涉及显式构造矩阵$H$,其迹形式导致每点成本为$O(m^4)$,使得该方法对于具有超过几十个特征的数据集变得难以处理。本文提出了两个互补的贡献,共同将这一成本降低了几个数量级。第一个贡献是一个精确的代数恒等式。该恒等式源自协方差矩阵特征向量的正交性和迹算子的循环性,完全消除了$H$,并将特征分解后的每点成本降低到$O(m^2)$。第二个贡献解决了完整特征分解中剩余的$O(m^3)$瓶颈。由于局部协方差矩阵的秩最多为$k-1 \ll m$,我们将其替换为$k imes m$中心数据矩阵的截断SVD,这是一个$O(k^2 m)$操作,并基于Haar测度下零空间特征向量外积的期望值,推导出其贡献的解析近似。得到的估计器总成本为$O(k^2 m + k m p^2)$,其中$p = k-1$。在真实数据集上的实验证实,相对于原始实现,加速比为50到300倍,当使用快速估计器替换原始版本时,精度损失可忽略。通过提供可扩展且数据驱动的局部曲率估计,所提出的方法将曲率确立为从经典到现代深度学习流水线的广泛机器学习任务中的实用几何特征。

英文摘要

Estimating local mean curvature at each point of a high-dimensional dataset is a key ingredient of geometry-aware machine learning algorithms, such as the Mean Curvature Boundary Points (MCBP) method. The naive implementation of this computation, based on a local shape operator approximated from k-nearest neighbor patches, involves an explicit construction of a matrix $H$ whose trace form yields an $O(m^4)$ cost per point, rendering the approach intractable for datasets with more than a few dozen features. This paper introduces two complementary contributions that together reduce this cost by several orders of magnitude. The first contribution is an exact algebraic identity. This identity, derived from the orthogonality of the eigenvectors of the covariance matrix and the cyclicity of the trace operator, eliminates $H$ entirely and reduces the per-point cost to $O(m^2)$ after the eigendecomposition. The second contribution addresses the remaining $O(m^3)$ bottleneck of the full eigendecomposition. Since the local covariance matrix has rank at most $k-1 \ll m$, we replace it with a truncated SVD of the $k \times m$ centered data matrix, an $O(k^2 m)$ operation, and derive an analytical approximation for the contribution of the null-space eigenvectors based on the expected value of their outer product under the Haar measure. The resulting estimator has total cost $O(k^2 m + k m p^2)$, where $p = k-1$. Experiments on real-world datasets confirm speedups of 50 to 300 times relative to the original implementation, with negligible loss when the fast estimator is used to replace the original version. By providing a scalable and data-driven estimate of local curvature, the proposed method establishes curvature as a practical geometric feature for a broad range of machine learning tasks, from classical to modern deep learning pipelines.

2606.06328 2026-06-05 cs.LG cs.AI

PAMF: Prior-Aware Multimodal Fusion for Incomplete Time Series Data

PAMF: 面向不完整时间序列数据的先验感知多模态融合

Ziwen Kan, Wugeng Zheng, Tianlong Chen, Song Wang

发表机构 * Department of Computer Science, University of Central Florida(中央佛罗里达大学计算机科学系) Department of Computer Science, University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校计算机科学系)

AI总结 提出PAMF框架,通过先验感知流匹配和权重共享显式处理模态内缺失和模态级缺失,将插补与下游预测耦合,提升多模态医疗时间序列任务性能。

详情
Comments
5 figures. arXiv preprint version
AI中文摘要

在医疗保健中,多模态时间序列任务在实践中通常处理不完整的观测,例如当电极脱落导致心电图片段丢失或夜间监测期间整个呼吸通道不可用时。这种缺失通常表现为两种结构上不同的模式:模态内缺失,即在某个观测模态内值缺失;以及模态级缺失,即整个模态不可用。现有方法通常通过掩码或缺失嵌入隐式表示未观测数据,而不学习实例特定的缺失信息,且大多数方法仅针对一种缺失模式设计。一种自然的方法是显式估计缺失数据;然而,现有的插补方法尽管缺失具有不同的结构先验,却统一处理缺失,并且插补过程通常与下游任务隔离,阻止下游任务引导插补朝向更具信息性的表示。为了解决这些局限性,我们提出了PAMF,一个多模态时间序列框架,它显式处理不同的缺失模式,同时通过先验感知流匹配和权重共享将插补与下游预测耦合。具体来说,该方法使用类型特定的先验初始化流匹配源状态,以区分两种缺失类型。它进一步通过架构匹配的编码器与权重共享连接插补和分类,将任务相关表示转移到插补过程中。在多个多模态医疗时间序列基准上的实验表明,与现有基线相比,所提出的方法在多样化的数据集和缺失设置下实现了最强的整体下游性能。

英文摘要

In healthcare, multimodal time series tasks often operate on incomplete observations in practice, for example when ECG segments are lost because electrodes detach or an entire respiratory channel is unavailable during overnight monitoring. Such missingness typically appears in two structurally distinct patterns: within-modality missing, where values are absent within an otherwise observed modality, and modality-level missing, where an entire modality is unavailable. Existing methods typically represent unobserved data implicitly through masks or missing embeddings, without learning instance-specific missing information, and most are designed for only one missingness pattern. A natural approach is to explicitly estimate the missing data; however, existing imputation methods treat missingness uniformly despite their different structural priors, and the imputation process is often isolated from downstream tasks, preventing downstream tasks from guiding imputation toward more informative representations. To address these limitations, we present PAMF, a multimodal time-series framework that explicitly handles different missingness patterns while coupling imputation with downstream prediction through prior-aware flow matching and weight sharing. Specifically, the method initializes the flow-matching source state with type-specific priors to distinguish two missing types. It further connects imputation and classification through architecturally matched encoders with weight sharing, transferring task-relevant representations into the imputation process. Experiments on multiple multimodal healthcare time-series benchmarks show that the proposed method achieves the strongest overall downstream performance across diverse datasets and missing settings compared with existing baselines.

2606.06322 2026-06-05 cs.AI

DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions

DragOn:基于拖拽的GUI交互基准与数据集

Nathan Bout, Maxime Langevin, Ronan Riochet

发表机构 * GitHub arXiv

AI总结 针对GUI代理在拖拽操作(如拖放、滑动、高亮)上的性能不足,提出DragOn基准和训练数据集,涵盖文本高亮、单元格选择、元素缩放和滑块操作四个领域,包含28.6万张训练截图和350万个训练任务,评估了多个模型并显示数据集能提升下游任务性能。

详情
Journal ref
Published as a workshop paper at SCALE - 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026
AI中文摘要

GUI代理——通过图形用户界面控制桌面、网页浏览器和移动设备的视觉模型——有望自动化广泛的数字任务。虽然百万级数据集在点击定位方面取得了显著进展,但拖拽定位(例如拖放、滑动、高亮)的数据规模仍小一个数量级,当前模型在复杂的基于拖拽的交互上表现不足。我们引入了DragOn,一个拖拽定位基准和训练数据集,涵盖四个领域:文本高亮、单元格选择、元素缩放和滑块操作。该数据集包含28.6万张训练截图和350万个训练任务,外加一个2000个样本的保留评估集。我们评估了专有模型(GPT、Claude)和开源模型(Qwen、Kimi、Holo),以及在我们训练数据上微调的Qwen VLM。结果表明,我们的数据集可以提升最先进模型在下游计算机使用任务上的性能。

英文摘要

GUI agents - vision-based models that control desktops, web browsers, and mobile devices through graphical user interfaces - promise to automate a wide range of digital tasks. While million-scale datasets have enabled substantial progress on click-grounding, drag grounding (e.g. drag-and-drop, swipe, highlight) data remains an order of magnitude smaller and current models fall short on complex drag-based interactions. We introduce DragOn, a drag grounding benchmark and training dataset covering four domains: text highlighting, cell selection, element resizing and slider manipulation. The dataset comprises 286K training screenshots and 3.5M training tasks, plus a 2000-example held-out evaluation suite. We evaluate proprietary (GPT, Claude) and open-weight (Qwen, Kimi, Holo) models, as well as a Qwen VLM fine-tuned on our training data. Results suggest that our dataset could improve performance of state-of-the-art models on downstream computer-use tasks.

2606.06320 2026-06-05 cs.LG cs.AI cs.CL

Learning What to Forget: Improving LLM Unlearning via Learned Token-Level Importance

学习遗忘什么:通过习得的词元级重要性改进大语言模型遗忘

Gizem Yüce, Giorgos Nikolaou, Nicolas Flammarion

发表机构 * Theory of Machine Learning Lab, EPFL(机器学习理论实验室,EPFL)

AI总结 提出交替词元加权遗忘(ATWU)框架,通过联合学习词元遗忘特异性和模型参数,在无外部监督下实现最优的遗忘-保留权衡。

详情
AI中文摘要

机器遗忘旨在从训练好的模型中移除特定知识,同时保留其通用能力。对于自回归语言模型,遗忘样本中的并非所有词元都与遗忘同等相关。现有方法要么忽略这种异质性,要么依赖辅助模型、启发式方法或外部标注来估计每个词元对遗忘的相关性。我们转而通过其与保留目标的交互来刻画这种相关性:一个词元是遗忘特异性的,其程度取决于在该词元上最小化遗忘损失不与保留最优性冲突。我们将这一视角形式化为一个关于模型参数和词元权重的联合优化问题,并证明在自然分离条件下,所得目标能够恢复 oracle 遗忘特异性词元支持。受此公式启发,我们引入了交替词元加权遗忘(ATWU),这是一个轻量级框架,在遗忘过程中通过一个基于隐藏状态的简单线性评分器联合学习词元遗忘特异性和模型参数,无需外部词元级监督。在 TOFU 和 RWKU 上,ATWU 实现了最先进的遗忘-保留权衡,优于样本级方法、基于概率的词元加权启发式方法和基于辅助模型的方法。此外,学习到的分数与真实遗忘特异性跨度显著更好地对齐,表明 ATWU 识别了语义上有意义的词元级遗忘信号。总体而言,我们的结果表明,保留冲突为识别语言模型应遗忘什么提供了有效标准,使得能够直接从模型表示中以最小计算开销无监督学习词元级遗忘特异性。

英文摘要

Machine unlearning aims to remove targeted knowledge from a trained model while preserving its general capabilities. For autoregressive language models, not all tokens in a forget sample are equally relevant to forgetting. Existing approaches either ignore this heterogeneity or rely on auxiliary models, heuristics, or external annotations to estimate each token's relevance for forgetting. We instead characterize it through the interaction with the retain objective: a token is forget-specific to the extent that minimizing the forget loss on that token does not conflict with retain optimality. We formalize this perspective as a joint optimization problem over the model parameters and the token weights and show that, under a natural separation condition, the resulting objective recovers the oracle forget-specific token support. Motivated by this formulation, we introduce Alternating Token-Weighted Unlearning (ATWU), a lightweight framework that jointly learns token forget-specificity and model parameters during unlearning using a simple linear scorer over the hidden states, without external token level supervision. Across TOFU and RWKU, ATWU achieves state of the art forget-retain trade-offs, outperforming sample-level methods, probability-based token weighting heuristics, and auxiliary-model-based approaches. Moreover, the learned scores align substantially better with ground truth forget-specific spans, indicating that ATWU identifies semantically meaningful token level forgetting signals. Overall, our results suggest that retain conflict provides an effective criterion for identifying what language models should forget, enabling unsupervised learning of token level forget-specificity directly from model representations with minimal computational overhead.

2606.06315 2026-06-05 cs.AI

LLM Self-Recognition: Steering and Retrieving Activation Signatures

LLM 自我识别:引导与检索激活签名

Thibaud Ardoin, Jonas Schäfer, Gerhard Wunder

发表机构 * University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院)

AI总结 通过随机稀疏向量引导内部残差流,在LLM生成文本中嵌入可检测指纹,实现高精度归属,同时保持文本质量。

详情
Comments
To appear in Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

近期可解释性进展表明,大型语言模型(LLMs)在其生成的文本中隐式编码信号,使其能够自我识别输出。我们证明这种能力是可靠的,即使在低熵场景中也是如此,并且可以通过定向干预来增强。通过在生成过程中使用随机稀疏向量引导内部残差流,我们创建了一个可检测的指纹,从而能够将给定文本归属于特定的LLM。该信号可从用作检测器的LLM的激活中恢复,在多种检测设置中实现超过98%的准确率,同时保持生成文本的质量。随着AI生成内容的激增,这种方法通过利用模型自然的表示结构进行归属,而非外部嵌入信号,为传统检测器提供了实用替代方案。我们的贡献包括:(i) 在LLM中建立可靠的自我识别能力,(ii) 一种简单的引导机制,实现多LLM识别且无质量下降,(iii) 证明激活空间包含可被利用的结构,用于编码信号而不产生语义干扰。

英文摘要

Recent advances in interpretability suggest that large language models (LLMs) implicitly encode signals in their generated text that enable self-recognition of their outputs. We demonstrate that this capability is reliable, even in low-entropy scenarios, and that it can be amplified through targeted intervention. By steering the internal residual stream during generation with a random sparse vector, we create a detectable fingerprint that enables attribution of a given text to a specific LLM. This signal is recoverable from the activations of an LLM used as a detector, achieving over 98% accuracy across multiple detection settings while preserving the quality of generated text. As AI-generated content proliferates, this approach offers a practical alternative to traditional detectors by leveraging the model's natural representation structure for attribution rather than embedding a signal externally. Our contributions include: (i) establishing reliable self-recognition capabilities in LLMs, (ii) a simple steering mechanism enabling multi-LLM identification with no quality degradation, (iii) demonstrating that activation spaces contain exploitable structure for encoding signals without semantic interference.

2606.06312 2026-06-05 cs.RO

Meridian: Metric-Semantic Primitive Matching for Cross-View Geo-Localization Beyond Urban Environments

Meridian: 超越城市环境的跨视角地理定位的度量-语义基元匹配

Mason Peterson, Qingyuan Li, Yixuan Jia, Fernando Cladera, Carlos Nieto-Granda, Camillo Jose Taylor, Jonathan P. How

发表机构 * Massachusetts Institute of Technology(麻省理工学院) GRASP Laboratory, University of Pennsylvania(宾夕法尼亚大学GRASP实验室) U.S. Army Combat Capabilities Development Command, Army Research Laboratory(美国陆军战斗能力发展指挥部,陆军研究实验室)

AI总结 提出Meridian方法,通过匹配航拍图像与地面机器人RGB-D数据中的高层度量-语义基元,无需特定区域训练即可实现跨多种环境的全局定位,平均轨迹误差2.4米。

详情
Comments
9 pages, 6 figures
AI中文摘要

成功的机器人自动化需要准确的全局定位以支持可重复性、任务规划、目标指定和安全操作。然而,在GNSS受限环境中的可靠定位仍然是一个开放问题。高空航拍图像提供了一种有前景的解决方案,但现有方法主要针对结构化城市环境,很少在非结构化自然地形中得到验证。现有技术的局限性包括依赖针对特定环境训练的模型,以及在自然户外区域常见的重复几何和无特征景观中难以处理。为克服这些挑战,我们提出了Meridian,一种在航拍图像和地面机器人RGB-D相机数据之间匹配高层度量-语义基元的方法,实现了准确的全局定位,并在多样环境中具有良好的泛化能力,无需任何针对特定区域数据的训练或算法微调。我们提出了新颖的一致性度量来估计机器人子图位姿的分布,并在鲁棒的位姿图优化步骤中剔除异常假设,以实现准确的机器人轨迹估计。我们证明了我们的算法可以在多种环境中定位地面机器人,包括自动驾驶数据集、公园和校园区域以及荒野营地,在19公里的地面遍历中平均优化轨迹误差为2.4米。

英文摘要

Successful robot automation requires accurate global localization to support repeatability, task planning, goal specification, and safe operation. However, reliable localization in GNSS-denied environments remains an open problem. Overhead aerial imagery offers a promising solution, but existing approaches primarily target structured urban environments and have been rarely demonstrated in unstructured natural terrain. Limitations of the state-of-the-art include a reliance on models trained for specific environments, as well as difficulty handling repetitive geometries and featureless landscapes commonly found in natural outdoor areas. To overcome these challenges, we present Meridian, a method for matching high-level metric-semantic primitives across aerial images and ground robot RGB-D camera data that achieves accurate global localization and generalizes well across diverse environments, all without any training or algorithmic fine-tuning on area-specific data. We formulate novel consistency metrics to estimate a distribution over robot submap poses and to reject outlier hypotheses in a robust pose graph optimization step for accurate robot trajectory estimation. We demonstrate that our algorithm can localize a ground robot across a wide variety of environments, including an autonomous driving dataset, a park and campus area, and a wilderness camp, with an average optimized trajectory error of 2.4 m over 19 km of ground traversal.

2606.06311 2026-06-05 cs.AI

AIS-Based Vessel Trajectory Prediction Using Memory-Augmented Neural Networks

基于记忆增强神经网络的AIS船舶轨迹预测

Wonmo Koo, Sanha Chang, Heeyoung Kim

发表机构 * Department of Industrial and Systems Engineering, Korea Advanced Institute of Science and Technology (KAIST)(工业与系统工程系,韩国科学技术院)

AI总结 本文提出使用记忆增强神经网络,基于AIS数据预测船舶轨迹,在墨西哥湾和纽约湾数据集上显著优于无外部记忆的深度学习基线。

详情
AI中文摘要

准确的船舶轨迹预测对于安全高效的海上作业至关重要,能够实现碰撞避免并支持航线优化。尽管记忆增强神经网络最近通过从外部记忆中选择性检索相关信息,在行人和道路车辆轨迹预测中表现出色,但其在船舶轨迹预测中的潜力尚未被充分探索。本文基于自动识别系统(AIS)数据,对基于记忆的轨迹预测进行了实证研究。在墨西哥湾和纽约湾数据集上的实验表明,与未集成外部记忆的多种深度学习基线相比,该方法持续且显著地提升了性能。

英文摘要

Accurate vessel trajectory prediction is essential for safe and efficient maritime operations, enabling collision avoidance and supporting route optimization. Although memory-augmented neural networks have recently shown strong performance in pedestrian and road-vehicle trajectory prediction by selectively retrieving relevant information from an external memory, their potential for vessel trajectory prediction remains underexplored. This paper presents an empirical investigation of memory-based trajectory prediction using Automatic Identification System (AIS) data. Experiments on data from the Gulf of Mexico and the New York Bight demonstrate consistent and substantial performance gains over a range of deep learning baselines that do not incorporate an external memory.

2606.06309 2026-06-05 cs.CV

RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow Scheduling

RhymeFlow: 基于异步去噪流调度的无训练加速视频生成

Chensheng Dai, Shengjun Zhang, Yifan Li, Zhang Zhang, Zheng Zhu, Yueqi Duan

发表机构 * Tsinghua University(清华大学) GigaAI

AI总结 针对DiT视频生成模型推理慢的问题,提出无训练框架RhymeFlow,通过识别关键帧并仅对其密集去噪,非关键帧逐步跳过步骤,同时引入潜在轨迹投影模块保持时序一致性,实现加速并提升质量。

详情
Comments
Project Page: https://simon-dcs.github.io/Website-of-RhymeFlow/, Code: https://github.com/Simon-Dcs/RhymeFlow
AI中文摘要

基于扩散变换器(DiTs)的视频生成模型在视频合成中取得了显著性能,但由于3D注意力的二次复杂度,它们存在高推理延迟和计算成本的问题。现有的加速方法主要通过稀疏注意力和KV缓存等技术降低每个单独去噪步骤内的计算复杂度。然而,它们严格遵循标准扩散管道的固有约束:目标视频序列中的每一帧都必须经历所有扩散时间步的完整、密集去噪过程。我们观察到,由于相邻帧之间的对应内容和运动,当锚定具有关键语义过渡的关键帧时,其他帧的中间状态通常遵循更可预测的轨迹,这表明这种均匀、密集的去噪过程对于自然视频数据本质上是冗余的。为此,我们引入了 extbf{RhymeFlow},一个无训练框架,它将不同帧的去噪轨迹解耦。具体来说,我们首先识别出一组稀疏的关键帧,它们主导了潜在语义演化。然后,只有这些关键帧经历密集的逐步去噪以确保结构完整性,而非关键帧则逐步跳过去噪步骤以最小化计算成本。由于非关键帧跳过的中间状态破坏了关键帧去噪步骤中的时间连贯性,导致视觉退化,我们进一步引入了一个潜在轨迹投影模块,使关键帧能够与完整且时间一致的序列表示进行交互。在当前的基于DiT的视频生成模型上的大量实验表明,我们的方法以更高的推理速度和更好的视觉质量优于现有基线。

英文摘要

Video generation models based on Diffusion Transformers (DiTs) have achieved remarkable performance in video synthesis, yet they suffer from high inference latency and computational costs due to the quadratic complexity of 3D attention. Existing acceleration methods primarily reduce computational complexity within each individual denoising steps through techniques such as sparse attention and KV-caching. However, they rigidly adhere to the inherent constraint of the standard diffusion pipeline: every frame in the target video sequence must be subjected to a complete, dense denoising process across all diffusion timesteps. We observe that due to the corresponding contents and motions among adjacent frames, when keyframes with critical semantic transitions are anchored, the intermediate states of others often follow more predictable trajectories, which indicates that such uniform, dense denoising process is inherently redundant for natural video data. To this end, we introduce \textbf{RhymeFlow}, a training-free framework that decouples the denoising trajectories of different frames. Specifically, we first identify a sparse set of pivotal key frames that dominate the latent semantic evolution. Then, only these keyframes undergo dense, step-by-step denoising to ensure structural integrity, while non-keyframes progressively skip denoising steps to minimize computational cost. Since skipped intermediate states of non-keyframes break the temporal coherence in keyframe denoising steps, leading to visual degradation, we further introduce a latent trajectory projection module, which enables keyframes to interact with a complete and temporally consistent sequence representation. Extensive experiments on current DiT-based video generation models demonstrate our method outperforms existing baselines with higher inference speed and better visual quality.