arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1971
专题追踪
2604.22476 2026-06-18 cs.CV cs.LG 版本更新

All Eyes on the Workflow: Automated and Efficient Event Discovery from Video Streams

全神贯注于工作流:从视频流中自动高效发现事件

Marco Pegoraro, Jonas Seng, Dustin Heller, Wil M. P. van der Aalst, Kristian Kersting

发表机构 * Chair of Process and Data Science, RWTH Aachen University(过程与数据科学教授席位,亚琛工业大学) Artificial Intelligence & Machine Learning Lab, Technical University of Darmstadt(人工智能与机器学习实验室,达姆施塔特技术大学)

AI总结 提出SnapLog方法,利用图像嵌入和帧间相似矩阵进行时间分割,结合广义少样本分类从视频中提取事件数据,生成可解释的带标签时间戳帧序列。

Comments 18 pages, 6 figures, 1 table, 27 references

详情
AI中文摘要

业务流程管理和流程挖掘等学科通过基于记录的事件数据发现流程见解来帮助组织。然而,流程分析的一个障碍是数据多模态性:例如,视频形式的数据不能直接解释为事件。现有方法依赖于活动标签字典作为输入,无法提供逐帧标签解释,或依赖于过时的计算机视觉技术。在这项工作中,我们提出了SnapLog,一种通过使用图像嵌入将帧转换为特征向量,并通过帧间相似矩阵进行时间分割来从视频中提取事件数据的方法。然后使用广义少样本分类为视频片段分配标签,生成可解释为事件的带标签、时间戳的子帧序列。传统的流程挖掘技术可用于分析结果数据。我们表明,我们的方法生成的日志准确反映了视频中的流程。

英文摘要

Disciplines such as business process management and process mining aid organizations by discovering insights about processes on the basis of recorded event data. However, an obstacle to process analysis is data multi-modality: for instance, data in video form are not directly interpretable as events. Existing approaches rely on a dictionary of activity label as input, cannot provide frame-by-frame labeling explanations, or rely on superseded computer vision techniques. In this work, we present SnapLog, an approach to extract event data from videos by converting frames to feature vectors using image embeddings and performing temporal segmentation through frame-wise similarity matrices. A generalized few-shot classification is then used to assign labels to the video segments, yielding labeled, timestamped sub-sequences of frames that are interpretable as events. Conventional process mining techniques can be used to analyze the resulting data. We show that our approach produces logs that accurately reflect the process in the videos.

2508.21720 2026-06-18 cs.AI 版本更新

PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation

PosterForest: 用于科学海报生成的分层多智能体协作

Jiho Choi, Seojeong Park, Seongjong Song, Hyunjung Shim

发表机构 * Graduate School of Artificial Intelligence, KAIST(韩国釜山国立大学人工智能研究生院) School of Integrated Technology, Yonsei University(延世大学整合技术学院)

AI总结 提出PosterForest,一种无需训练的科学海报生成框架,通过Poster Tree分层表示文档结构,并利用内容与布局智能体进行分层推理与递归优化,实现内容与布局的联合优化,提升语义连贯性、逻辑流畅性和视觉平衡。

Comments ACL 2026

详情
AI中文摘要

自动化科学海报生成需要层次化的文档理解和连贯的内容-布局规划。现有方法通常依赖于平面摘要或分别优化内容和布局。因此,它们常常遭受信息丢失、逻辑流程薄弱和视觉平衡差的问题。我们提出了PosterForest,一个无需训练的科学海报生成框架。我们的方法引入了Poster Tree,一种结构化的中间表示,能够跨多个层次捕获文档层次结构和视觉-文本语义。基于这种表示,内容和布局智能体执行分层推理和递归优化,从全局组织到局部组成逐步优化海报。这种联合优化提高了语义连贯性、逻辑流畅性和视觉和谐。实验表明,PosterForest在自动评估和人工评估中均优于先前方法,且无需额外训练或领域特定监督。

英文摘要

Automating scientific poster generation requires hierarchical document understanding and coherent content-layout planning. Existing methods often rely on flat summarization or optimize content and layout separately. As a result, they often suffer from information loss, weak logical flow, and poor visual balance. We present PosterForest, a training-free framework for scientific poster generation. Our method introduces the Poster Tree, a structured intermediate representation that captures document hierarchy and visual-textual semantics across multiple levels. Building on this representation, content and layout agents perform hierarchical reasoning and recursive refinement, progressively optimizing the poster from global organization to local composition. This joint optimization improves semantic coherence, logical flow, and visual harmony. Experiments show that PosterForest outperforms prior methods in both automatic and human evaluations, without additional training or domain-specific supervision.

2604.20822 2026-06-18 cs.CV cs.LG 版本更新

Global Offshore Wind Infrastructure: Deployment and Operational Dynamics from Dense Sentinel-1 Time Series

全球海上风电基础设施:基于密集Sentinel-1时间序列的部署与运行动态

Thorsten Hoeser, Felix Bachofer, Claudia Kuenzer

发表机构 * Earth Observation Center (EOC), German Aerospace Center (DLR)(地球观测中心(EOC),德国航空航天中心(DLR)) Institute for Geography and Geology, University of Wuerzburg(地理与地质研究所,乌尔姆大学)

AI总结 提出全球Sentinel-1 SAR时间序列数据集,通过目标检测和规则分类器识别海上风电基础设施的部署与运行阶段,支持全球尺度动态分析。

Comments 29 pages, 18 figures

详情
AI中文摘要

海上风电行业正在快速扩张,增加了对全球范围内基础设施部署和运行进行独立、高时间分辨率监测的需求。虽然基于地球观测的海上风电基础设施测绘在空间定位方面已经成熟,但现有的开放数据集缺乏关于建设和运行动态的时间密集且语义精细的信息。我们引入了一个全球Sentinel-1合成孔径雷达(SAR)时间序列数据语料库,该语料库解析了2016年第一季度至2025年第一季度海上风电基础设施的部署和运行阶段。基于更新的目标检测工作流程,我们在检测到的基础设施位置编译了15,606条时间序列,共有14,840,637个事件作为分析就绪的一维SAR后向散射剖面,每个剖面对应一次Sentinel-1采集和一个位置。为了便于直接使用和基准测试,我们发布了(i)分析就绪的一维SAR剖面,(ii)由基于规则的分类器生成的事件级基线语义标签,以及(iii)包含553条时间序列和328,657个事件标签的专家标注基准数据集。基线分类器在事件评估中实现了0.84的宏F1分数,在折叠编辑相似性-质量阈值曲线下面积(AUC)为0.785,表明时间一致性。我们证明,由此产生的语料库支持全球尺度的部署动态分析、区域部署模式差异的识别、船只交互和运行事件,并为开发和比较海上风电基础设施监测的时间序列分类方法提供了参考。

英文摘要

The offshore wind energy sector is expanding rapidly, increasing the need for independent, high-temporal-resolution monitoring of infrastructure deployment and operation at global scale. While Earth Observation based offshore wind infrastructure mapping has matured for spatial localization, existing open datasets lack temporally dense and semantically fine-grained information on construction and operational dynamics. We introduce a global Sentinel-1 synthetic aperture radar (SAR) time series data corpus that resolves deployment and operational phases of offshore wind infrastructure from 2016Q1 to 2025Q1. Building on an updated object detection workflow, we compile 15,606 time series at detected infrastructure locations, with overall 14,840,637 events as analysis-ready 1D SAR backscatter profiles, one profile per Sentinel-1 acquisition and location. To enable direct use and benchmarking, we release (i) the analysis ready 1D SAR profiles, (ii) event-level baseline semantic labels generated by a rule-based classifier, and (iii) an expert-annotated benchmark dataset of 553 time series with 328,657 event labels. The baseline classifier achieves a macro F1 score of 0.84 in event-wise evaluation and an area under the collapsed edit similarity-quality threshold curve (AUC) of 0.785, indicating temporal coherence. We demonstrate that the resulting corpus supports global-scale analyses of deployment dynamics, the identification of differences in regional deployment patterns, vessel interactions, and operational events, and provides a reference for developing and comparing time series classification methods for offshore wind infrastructure monitoring.

2604.18109 2026-06-18 cs.CL cs.SD 版本更新

FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings

FLiP:理解和解释多模态多语句子嵌入

Santosh Kesiraju, Bolaji Yusuf, Šimon Sedláček, Oldřich Plchot, Petr Schwarz

发表机构 * Brno University of Technology(布拉格技术大学)

AI总结 提出因子化线性投影(FLiP)模型,从多语言、多模态句子嵌入中恢复词汇内容,揭示编码器的模态和语言偏差。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

本文提出了因子化线性投影(FLiP)模型,用于理解预训练句子嵌入空间。我们训练FLiP模型从多语言(LaBSE)、多模态(SONAR)和基于API(Gemini)的句子嵌入空间中恢复多种高资源和中等资源语言的词汇内容。我们表明,FLiP可以从嵌入中召回超过75%的词汇内容,显著优于现有的非因子化基线。使用此作为诊断工具,我们揭示了所选句子编码器的模态和语言偏差,并为从业者提供了关于编码器的内在见解,而无需依赖传统的下游评估任务。我们的实现已公开,链接见此:https://this URL。

英文摘要

This paper presents factorized linear projection (FLiP) models for understanding pretrained sentence embedding spaces. We train FLiP models to recover the lexical content from multilingual (LaBSE), multimodal (SONAR) and API-based (Gemini) sentence embedding spaces in several high- and mid-resource languages. We show that FLiP can recall more than 75% of lexical content from the embeddings, significantly outperforming existing non-factorized baselines. Using this as a diagnostic tool, we uncover the modality and language biases across the selected sentence encoders and provide practitioners with intrinsic insights about the encoders without relying on conventional downstream evaluation tasks. Our implementation is public https://github.com/BUTSpeechFIT/FLiP.

2604.13082 2026-06-18 cs.LG cs.AI 版本更新

The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

算术泛化的长延迟:当学习到的表征超越行为时

Laura Gomezjurado Gonzalez

发表机构 * Stanford University(斯坦福大学)

AI总结 研究Transformer在算术任务中泛化延迟的原因,发现编码器早期已学到结构,但解码器瓶颈导致延迟,通过移植编码器或冻结编码器可加速泛化,且数字基的选择影响学习难度。

Comments 19 pages, 10 fugures

详情
AI中文摘要

在算法任务上训练的Transformer中的grokking现象以训练集拟合与突然泛化之间的长延迟为特征,但该延迟的来源仍不清楚。在编码器-解码器算术模型中,我们认为这种延迟反映了对已学习结构的有限访问,而非未能首先获得该结构。我们研究一步Collatz预测,发现编码器在最初几千训练步内组织了奇偶性和残差结构,而输出精度在数万步内仍接近随机。因果干预支持解码器瓶颈假说。将训练好的编码器移植到新模型中将grokking加速2.75倍,而移植训练好的解码器则有害。冻结收敛的编码器并仅重新训练解码器完全消除了平台期,并达到97.6%的准确率,而联合训练为86.1%。解码器任务的难易取决于数字表示。在15种基中,那些分解与Collatz映射算术对齐的基(例如基24)达到99.8%的准确率,而二进制完全失败,因为其表示崩溃且无法恢复。基的选择作为归纳偏置,控制解码器可利用的局部数字结构量,从而在相同底层任务上产生巨大的可学习性差异。

英文摘要

Grokking in transformers trained on algorithmic tasks is characterized by a long delay between training-set fit and abrupt generalization, but the source of that delay remains poorly understood. In encoder-decoder arithmetic models, we argue that this delay reflects limited access to already learned structure rather than failure to acquire that structure in the first place. We study one-step Collatz prediction and find that the encoder organizes parity and residue structure within the first few thousand training steps, while output accuracy remains near chance for tens of thousands more. Causal interventions support the decoder bottleneck hypothesis. Transplanting a trained encoder into a fresh model accelerates grokking by 2.75 times, while transplanting a trained decoder actively hurts. Freezing a converged encoder and retraining only the decoder eliminates the plateau entirely and yields 97.6% accuracy, compared to 86.1% for joint training. What makes the decoder's job harder or easier depends on numeral representation. Across 15 bases, those whose factorization aligns with the Collatz map's arithmetic (e.g., base 24) reach 99.8% accuracy, while binary fails completely because its representations collapse and never recover. The choice of base acts as an inductive bias that controls how much local digit structure the decoder can exploit, producing large differences in learnability from the same underlying task.

2603.13941 2026-06-18 cs.CV 版本更新

Bidirectional Cross-Attention Fusion of High-Resolution RGB and Low-Resolution Hyperspectral Inputs for Multimodal Semantic Segmentation

高分辨率RGB与低分辨率高光谱输入的双向交叉注意力融合用于多模态语义分割

Jonas V. Funk, Lukas Roming, Andreas Michel, Paul Bäcker, Georg Maier, Thomas Längle, Markus Klute

发表机构 * KIT, Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) Fraunhofer IOSB, Fraunhofer Institute of Optronics, System Technologies(弗劳恩霍夫院所光学、系统技术与图像利用研究所)

AI总结 提出双向交叉注意力融合(BCAF),通过局部双向交叉注意力对齐高分辨率RGB与低分辨率高光谱图像,避免预上采样或早期光谱坍缩,在实时约束下提升多模态分割性能。

Comments Submitted to Image and Vision Computing (Elsevier). 23 pages, 10 figures, 7 tables

详情
AI中文摘要

异构传感器的多模态语义分割必须协调空间分辨率和通道维度不同的模态间的互补信息。具体而言,高分辨率RGB成像提供详细的空间结构,但通常难以区分视觉上相似的材料,而高光谱成像(HSI)提供判别性光谱特征,但空间分辨率较低。我们提出双向交叉注意力融合(BCAF),通过局部化、双向交叉注意力在原生网格上对齐高分辨率RGB与低分辨率HSI,避免预上采样或早期光谱坍缩。BCAF使用两个独立骨干网络:一个用于RGB的标准Swin Transformer,以及一个用于HSI的适应型Swin骨干网络,通过带有光谱自注意力的3D令牌化保留光谱结构。尽管我们的评估针对RGB-HSI融合,但BCAF是模态无关的,适用于与低分辨率、高通道辅助传感器配准的RGB。在基准SpectralWaste数据集上,BCAF以55图像/秒的速度达到75.4%的性能。我们进一步评估了一个新的工业数据集:K3I-Cycling(首个RGB子集已在Fordatis上发布)。在该数据集上,BCAF在材料分割(纸张、金属、塑料等)上达到62.3% mIoU,在塑料类型分割(PET、PP、HDPE、LDPE、PS等)上达到66.2% mIoU。这些结果表明,保留原生网格空间细节和光谱结构可在实时约束下改善多模态分割。代码和模型检查点已公开于该https URL。

英文摘要

Multimodal semantic segmentation with heterogeneous sensors must reconcile complementary information across modalities that differ in spatial resolution and channel dimensionality. In particular, high-resolution RGB imaging provides detailed spatial structure but often fails to distinguish visually similar materials, whereas hyperspectral imaging (HSI) provides discriminative spectral signatures but at lower spatial resolution. We present Bidirectional Cross-Attention Fusion (BCAF), which aligns high-resolution RGB with low-resolution HSI at their native grids via localized, bidirectional cross-attention, avoiding pre-upsampling or early spectral collapse. BCAF uses two independent backbones: a standard Swin Transformer for RGB and an HSI-adapted Swin backbone that preserves spectral structure through 3D tokenization with spectral self-attention. Although our evaluation targets RGB-HSI fusion, BCAF is modality-agnostic and applies to co-registered RGB with lower-resolution, high-channel auxiliary sensors. On the benchmark SpectralWaste dataset, BCAF delivers strong performance, achieving 75.4% at 55 images/s. We further evaluate a novel industrial dataset: K3I-Cycling (first RGB subset already released on Fordatis). On this dataset, BCAF reaches 62.3% mIoU for material segmentation (paper, metal, plastic, etc.) and 66.2% mIoU for plastic-type segmentation (PET, PP, HDPE, LDPE, PS, etc.). These results show that preserving native-grid spatial detail and spectral structure improves multimodal segmentation under real-time constraints. Code and model checkpoints are publicly available at https://github.com/jonasvilhofunk/BCAF_2026.

2604.05527 2026-06-18 cs.CV 版本更新

Prior-guided Fusion of Multimodal Features for Change Detection from Optical-SAR Images

先验引导的多模态特征融合用于光学-SAR图像变化检测

Xuanguang Liu, Lei Ding, Yujie Li, Chenguang Dai, Zhenchao Zhang, Mengmeng Li, Ziyi Yang, Yifan Sun, Yongqi Sun, Hanyun Wang, Lorenzo Bruzzone

发表机构 * Institute of Geospatial Information, Information Engineering University(地理信息研究所,信息工程大学) Academy of Digital China (Fujian), Fuzhou University(数字中国研究院(福建),福州大学) The School of Electronics and Communication Engineering, Sun Yat-sen University(电子与通信工程学院,中山大学) The Department of Information Engineering and Computer Science, University of Trento(信息工程与计算机科学系,特伦托大学)

AI总结 提出STSF-Net框架,联合建模模态特定和时空共同特征,并利用视觉基础模型的语义先验自适应融合多模态特征,在三个数据集上达到最优性能。

详情
AI中文摘要

多模态变化检测(MMCD)识别多模态遥感数据中的变化区域,在土地利用监测和城市可持续发展中具有重要应用价值。然而,现有MMCD方法在跨模态交互和利用模态特定特征方面存在局限性,导致对细粒度变化信息的建模不足,从而阻碍了语义变化的精确检测。为解决这些问题,我们提出了STSF-Net,一个专为光学和SAR图像之间的MMCD设计的框架。STSF-Net联合建模模态特定特征和时空共同特征以增强变化表示。具体而言,利用模态特定特征捕获真实的语义变化信号,同时嵌入时空共同特征以抑制由成像机制差异引起的伪变化。此外,我们引入了一种光学和SAR特征融合策略,该策略基于从视觉基础模型获得的语义先验自适应调整多模态特征的重要性。最后,我们引入了新的Delta-SN6数据集,这是第一个公开可访问的多类MMCD基准,包含极高分辨率全极化SAR和光学图像。在Delta-SN6、BRIGHT和Wuhan数据集上的实验结果表明,我们的方法在mIoU上分别比最先进方法高出3.21%、0.87%和1.32%。

英文摘要

Multimodal change detection (MMCD) identifies changed areas in multimodal remote sensing data, demonstrating significant application value in land use monitoring and urban sustainable development. However, literature MMCD approaches exhibit limitations in both cross-modal interaction and exploiting modality-specific characteristics. This leads to insufficient modeling of fine-grained change information, thus hindering the precise detection of semantic changes. To address these problems, we propose STSF-Net, a framework designed for MMCD between optical and SAR images. STSF-Net jointly models modality-specific and spatio-temporal common features to enhance change representations. Specifically, modality-specific features are exploited to capture genuine semantic change signals, while spatio-temporal common features are embedded to suppress pseudo-changes caused by differences in imaging mechanisms. Furthermore, we introduce an optical and SAR feature fusion strategy that adaptively adjusts multimodal feature importance based on semantic priors obtained from visual foundation models. Finally, we introduce the novel Delta-SN6 dataset, the first openly-accessible multiclass MMCD benchmark consisting of very-high-resolution fully polarimetric SAR and optical images. Experimental results on Delta-SN6, BRIGHT, and Wuhan datasets demonstrate that our method outperforms the state-of-the-art by 3.21%, 0.87%, and 1.32% in mIoU, respectively.

2604.04342 2026-06-18 cs.LG stat.ML 版本更新

Generative models for decision-making under distributional shift

分布偏移下决策的生成模型

Xiuyuan Cheng, Yunqin Zhu, Yao Xie

发表机构 * Department of Mathematics, Duke University(杜克大学数学系) H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology(佐治亚理工学院H. Milton Stewart工业与系统工程学院)

AI总结 本文提出基于流和分数生成模型的统一框架,通过传输映射、速度场等工具处理分布偏移下的决策问题,实现鲁棒性、条件分布生成及不确定性量化。

Comments INFORMS TutORials in Operations Research, 2026

详情
AI中文摘要

许多数据驱动的决策问题使用从历史数据估计的名义分布来制定,而性能最终由可能发生偏移、依赖于上下文、部分观测或由压力引起的部署分布决定。本教程介绍了现代生成模型,特别是基于流和分数的方法,作为构建决策相关分布的数学工具。从运筹学的角度来看,它们的主要价值不在于无约束的样本合成,而在于通过传输映射、速度场、分数场和引导随机动力学来表示和变换分布。我们提出了一个基于前推映射、连续性、Fokker-Planck方程、Wasserstein几何和概率空间优化的统一框架。在此框架内,生成模型可用于学习名义不确定性、构建用于鲁棒性的受压或最不利分布,以及在侧信息和部分观测下生成条件或后验分布。我们还强调了代表性的理论保证,包括迭代流模型的前向-反向收敛、传输映射空间中的一阶极小极大分析,以及具有生成先验的后验采样的误差传递界。本教程为在分布偏移下使用生成模型进行场景生成、鲁棒决策、不确定性量化及相关问题提供了原则性的介绍。

英文摘要

Many data-driven decision problems are formulated using a nominal distribution estimated from historical data, while performance is ultimately determined by a deployment distribution that may be shifted, context-dependent, partially observed, or stress-induced. This tutorial presents modern generative models, particularly flow- and score-based methods, as mathematical tools for constructing decision-relevant distributions. From an operations research perspective, their primary value lies not in unconstrained sample synthesis but in representing and transforming distributions through transport maps, velocity fields, score fields, and guided stochastic dynamics. We present a unified framework based on pushforward maps, continuity, Fokker-Planck equations, Wasserstein geometry, and optimization in probability space. Within this framework, generative models can be used to learn nominal uncertainty, construct stressed or least-favorable distributions for robustness, and produce conditional or posterior distributions under side information and partial observation. We also highlight representative theoretical guarantees, including forward-reverse convergence for iterative flow models, first-order minimax analysis in transport-map space, and error-transfer bounds for posterior sampling with generative priors. The tutorial provides a principled introduction to using generative models for scenario generation, robust decision-making, uncertainty quantification, and related problems under distributional shift.

2604.03208 2026-06-18 cs.LG 版本更新

Hierarchical Planning with Latent World Models

基于潜在世界模型的分层规划

Wancong Zhang, Basile Terver, Artem Zholus, Soham Chitnis, Harsh Sutaria, Mido Assran, Randall Balestriero, Amir Bar, Adrien Bardes, Yann LeCun, Nicolas Ballas

发表机构 * FAIR at Meta(Meta旗下的FAIR) New York University(纽约大学) Mila - Québec AI Institute(魁北克AI研究院) Brown University(布朗大学)

AI总结 提出HWM架构,通过多时间尺度潜在世界模型和潜在匹配实现分层模型预测控制,解决长时域任务中单层规划失败和计算爆炸问题。

详情
AI中文摘要

世界模型是通过规划实现零样本具身控制的一条有前景的路径。然而,现有的世界模型规划器在长时域、多阶段任务中面临困难:预测误差累积,且朴素搜索的复杂度随规划时域呈指数增长。分层方法通过将任务分解为更短、可处理的子问题来缓解这两个问题;然而,先前的分层方法要么将控制摊销为任务特定的策略(分层强化学习),要么假设低维状态和已知动力学(经典分层MPC)。我们提出了基于潜在世界模型的分层规划(HWM),这是一种直接在仅通过下一潜在预测训练的视觉世界模型上进行分层模型预测控制(MPC)的架构和规划范式。HWM在共享潜在空间内学习多个时间尺度的世界模型,因此长时域模型的预测通过潜在匹配作为短时域模型的子目标,无需任务特定的奖励、技能学习或分层策略。为了保持长时域搜索的可处理性,HWM学习了一个动作编码器,将原始动作块压缩为潜在宏动作。在真实世界的Franka操作中,HWM从单个目标图像中完成拾取和放置的成功率为70%,而单层规划的成功率为0%。在模拟的推操作和迷宫导航任务中,HWM在长时域任务上持续提升性能,同时所需规划计算量最多减少3倍。

英文摘要

World models are a promising path to zero-shot embodied control through planning. However, existing world model planners struggle on long-horizon, multi-stage tasks: prediction errors compound and naive search is exponential in the planning horizon. Hierarchy mitigates both by decomposing tasks into shorter, tractable subproblems; yet prior hierarchical approaches either amortize control into task-specific policies (hierarchical RL) or assume low-dimensional states and known dynamics (classical hierarchical MPC). We present Hierarchical Planning with Latent World Models (HWM), an architecture and planning paradigm for hierarchical model predictive control (MPC) directly on visual world models trained solely via next-latent prediction. HWM learns world models at multiple temporal scales within a shared latent space, so predictions from the long-horizon model serve as subgoals for the short-horizon model via latent matching, without task-specific rewards, skill learning, or hierarchical policies. To keep long-horizon search tractable, HWM learns an action encoder that compresses primitive action chunks into latent macro-actions. On real-world Franka manipulation, HWM solves pick-and-place from a single goal image at 70% success vs. 0% for single-level planning. Across simulated push manipulation and maze navigation, HWM consistently improves performance on long-horizon tasks while requiring up to 3x less planning compute.

2604.03156 2026-06-18 cs.CV 版本更新

CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator

CAMEO: 一种条件感知与质量驱动的多智能体图像编辑编排器

Yuhan Pu, Hao Zheng, Ziqian Mo, Zirui Pang, Hill Zhang, Tianyi Fan, Shuhong Wu, Jiaheng Wei

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Harbin Institute of Technology(哈尔滨工业大学) Shenzhen University(深圳大学) Claremont McKenna College(克莱蒙特麦肯纳学院) Research Institute of Petroleum Exploration and Development, CNPC(石油勘探开发研究院,中石油)

AI总结 提出CAMEO多智能体框架,将条件图像编辑重构为质量感知的反馈驱动过程,通过分解编辑阶段、嵌入评估循环,在异常插入和人体姿态切换任务中平均胜率提升20%。

详情
AI中文摘要

条件图像编辑旨在根据文本提示和可选的参考指导修改源图像。这种编辑在需要严格结构控制的场景中至关重要(例如,驾驶场景中的异常插入和复杂人体姿态变换)。尽管近期大规模编辑模型(如Seedream、Nano Banana等)取得了进展,但大多数方法依赖单步生成。这种范式通常缺乏显式质量控制,可能引入与原始图像的过度偏差,并经常产生结构伪影或环境不一致的修改,通常需要手动调整提示才能获得可接受的结果。我们提出\textbf{CAMEO},一个结构化的多智能体框架,将条件编辑重构为质量感知、反馈驱动的过程,而非一次性生成任务。CAMEO将编辑分解为协调的阶段:规划、结构化提示、假设生成和自适应参考定位,仅在任务复杂度需要时才调用外部指导。为克服现有方法缺乏内在质量控制的不足,评估直接嵌入编辑循环中。通过结构化反馈迭代优化中间结果,形成闭环过程,逐步纠正结构和上下文不一致性。我们在异常插入和人体姿态切换任务上评估CAMEO。在多个强编辑骨干网络和独立评估模型上,CAMEO相比多个最先进模型平均胜率提升20%,展示了在条件图像编辑中更强的鲁棒性、可控性和结构可靠性。

英文摘要

Conditional image editing aims to modify a source image according to textual prompts and optional reference guidance. Such editing is crucial in scenarios requiring strict structural control (i.e., anomaly insertion in driving scenes and complex human pose transformation). Despite recent advances in large-scale editing models (i.e., Seedream, Nano Banana, etc), most approaches rely on single-step generation. This paradigm often lacks explicit quality control, may introduce excessive deviation from the original image, and frequently produces structural artifacts or environment-inconsistent modifications, typically requiring manual prompt tuning to achieve acceptable results. We propose \textbf{CAMEO}, a structured multi-agent framework that reformulates conditional editing as a quality-aware, feedback-driven process rather than a one-shot generation task. CAMEO decomposes editing into coordinated stages of planning, structured prompting, hypothesis generation, and adaptive reference grounding, where external guidance is invoked only when task complexity requires it. To overcome the lack of intrinsic quality control in existing methods, evaluation is embedded directly within the editing loop. Intermediate results are iteratively refined through structured feedback, forming a closed-loop process that progressively corrects structural and contextual inconsistencies. We evaluate CAMEO on anomaly insertion and human pose switching tasks. Across multiple strong editing backbones and independent evaluation models, CAMEO consistently achieves 20\% more win rate on average compared to multiple state-of-the-art models, demonstrating improved robustness, controllability, and structural reliability in conditional image editing.

2603.29247 2026-06-18 cs.CL cs.AI cs.LG 版本更新

MemRerank: Preference Memory for Personalized Product Reranking

MemRerank:用于个性化产品重排序的偏好记忆

Zhiyuan Peng, Xuyang Wu, Huaixiao Tou, Yi Fang, Yu Gong

发表机构 * Santa Clara University(圣克拉拉大学) Independent Researcher(独立研究者)

AI总结 提出MemRerank框架,通过强化学习将用户购买历史提炼为查询无关的偏好记忆,用于LLM购物代理的个性化重排序,在1-in-5选择任务中准确率提升高达10.61个百分点。

Comments correct author name in metadata

详情
AI中文摘要

基于LLM的购物代理越来越依赖长购买历史和多轮交互来实现个性化,然而,由于噪声、长度和相关性不匹配,将原始历史简单地附加到提示中通常效果不佳。我们提出MemRerank,一个偏好记忆框架,将用户购买历史提炼为简洁、查询无关的信号,用于个性化产品重排序。为了研究这个问题,我们构建了一个端到端的基准测试和评估框架,围绕基于LLM的\ extbf{1-in-5}选择任务,该任务同时衡量记忆质量和下游重排序效用。我们进一步使用强化学习(RL)训练记忆提取器,以下游重排序性能作为监督。使用两个基于LLM的重排序器进行的实验表明,MemRerank始终优于无记忆、原始历史和现成记忆基线,在1-in-5准确率上提高了高达\ extbf{+10.61}个绝对百分点。这些结果表明,显式偏好记忆是代理型电子商务系统中个性化的一种实用且有效的构建模块。

英文摘要

LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch. We propose MemRerank, a preference memory framework that distills user purchase history into concise, query-independent signals for personalized product reranking. To study this problem, we build an end-to-end benchmark and evaluation framework centered on an LLM-based \textbf{1-in-5} selection task, which measures both memory quality and downstream reranking utility. We further train the memory extractor with reinforcement learning (RL), using downstream reranking performance as supervision. Experiments with two LLM-based rerankers show that MemRerank consistently outperforms no-memory, raw-history, and off-the-shelf memory baselines, yielding up to \textbf{+10.61} absolute points in 1-in-5 accuracy. These results suggest that explicit preference memory is a practical and effective building block for personalization in agentic e-commerce systems.

2406.14399 2026-06-18 cs.LG cs.CV physics.ao-ph stat.ML 版本更新

Benchmarking Physics-Informed Time-Series Models for Operational Global Station Weather Forecasting

面向全球站点业务天气预报的物理信息时间序列模型基准测试

Tao Han, Zhibin Wen, Zhenghao Chen, Dazhao Du, Song Guo, Lei Bai

发表机构 * Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong SAR China(香港科技大学计算机科学与工程系) Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, China(南方科技大学计算机科学与工程系) School of Computer and Information Sciences, University of Newcastle, Newcastle, Australia(新castle大学计算机与信息科学学院) Hangzhou Innovation Institute of Beihang University, Hangzhou, China(北京航空航天大学杭州创新研究院) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室)

AI总结 提出大规模观测数据集WEATHER-5K和物理信息模型PhysicsFormer,通过压力-风对齐和能量感知平滑损失增强物理一致性,在多个天气变量和极端事件预测上评估学术模型与业务系统的差距。

Comments Accepted by ICML2026

详情
AI中文摘要

时间序列预测(TSF)模型的发展常受限于缺乏全面的数据集,尤其是在全球站点天气预报(GSWF)中,现有数据集规模小、时间短且空间稀疏。为解决这一问题,我们引入了WEATHER-5K,一个大规模观测天气数据集,能更好地反映真实世界条件,支持改进模型训练和评估。尽管最近的TSF方法在基准测试上表现良好,但在捕捉复杂天气动态和极端事件方面落后于业务数值天气预报系统。我们提出了PhysicsFormer,一种物理信息预测模型,结合动态核心与Transformer残差来预测未来天气状态。通过压力-风对齐和能量感知平滑损失强制物理一致性,确保在捕捉复杂时间模式的同时保持合理的动力学。我们将PhysicsFormer及其他TSF模型与业务系统在多个天气变量、极端事件预测和模型复杂度上进行基准测试,全面评估学术TSF模型与业务预报之间的差距。数据集和基准测试实现可在以下网址获取:this https URL。

英文摘要

The development of Time-Series Forecasting (TSF) models is often constrained by the lack of comprehensive datasets, especially in Global Station Weather Forecasting (GSWF), where existing datasets are small, temporally short, and spatially sparse. To address this, we introduce WEATHER-5K, a large-scale observational weather dataset that better reflects real-world conditions, supporting improved model training and evaluation. While recent TSF methods perform well on benchmarks, they lag behind operational Numerical Weather Prediction systems in capturing complex weather dynamics and extreme events. We propose PhysicsFormer, a physics-informed forecasting model combining a dynamic core with a Transformer residual to predict future weather states. Physical consistency is enforced via pressure-wind alignment and energy-aware smoothness losses, ensuring plausible dynamics while capturing complex temporal patterns. We benchmark PhysicsFormer and other TSF models against operational systems across several weather variables, extreme event prediction, and model complexity, providing a comprehensive assessment of the gap between academic TSF models and operational forecasting. The dataset and benchmark implementation are available at: https://github.com/taohan10200/WEATHER-5K.

2603.26557 2026-06-18 cs.CL 版本更新

MemBoost: A Memory-Boosted Framework for Cost-Aware LLM Inference

MemBoost:一种面向成本感知的LLM推理的内存增强框架

Joris Köster, Zixuan Liu, Siavash Khajavi, Zizhan Zheng

发表机构 * University of Cambridge(剑桥大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出MemBoost框架,通过轻量模型重用历史答案和检索支持信息,并选择性将困难查询路由到强模型,以降低LLM推理成本,同时保持回答质量。

Comments ICML MemFM 2026 Workshop

详情
AI中文摘要

大型语言模型(LLM)在现实服务中表现出色,但在跨用户和会话的重复或近似重复查询工作负载下,推理成本高昂。本文提出MemBoost,一种内存增强的LLM服务框架,使轻量模型能够重用先前生成的答案并检索相关支持信息以实现低成本推理,同时选择性地将困难或不确定的查询升级到更强的模型。与主要基于单一响应的标准检索增强生成不同,MemBoost通过支持答案重用、持续内存增长和成本感知路由,专为交互式场景设计。在模拟工作负载下跨多个模型的实验表明,MemBoost显著减少了昂贵的大模型调用和总体推理成本,同时保持了与强模型基线相当的高答案质量。

英文摘要

Large Language Models (LLMs) deliver strong performance but incur high inference cost in real-world services, especially under workloads with repeated or near-duplicate queries across users and sessions. In this work, we propose MemBoost, a memory-boosted LLM serving framework that enables a lightweight model to reuse previously generated answers and retrieve relevant supporting information for cheap inference, while selectively escalating difficult or uncertain queries to a stronger model. Unlike standard retrieval-augmented generation, which primarily grounds a single response, MemBoost is designed for interactive settings by supporting answer reuse, continual memory growth, and cost-aware routing. Experiments across multiple models under simulated workloads show that MemBoost substantially reduces expensive large-model invocations and overall inference cost, while maintaining high answer quality comparable to the strong model baseline.

2601.01200 2026-06-18 cs.CV eess.IV 版本更新

Objective Quality Assessment of Point Clouds Using Multi-scale Implicit Structural Similarity

点云的多尺度隐式结构相似性客观质量评估

Zhang Chen, Shuai Wan, Yuezhe Zhang, Siyu Ren, Fuzheng Yang, Junhui Hou

发表机构 * School of Electronics and Information, Northwestern Polytechnical University(电子与信息学院,西北工业大学) Department of Computer Science, City University of Hong Kong(计算机科学系,香港城市大学) School of Telecommunication Engineering, Xidian University(电信工程学院,西安电子科技大学)

AI总结 针对点云质量评估中不规则数据匹配困难的问题,提出多尺度隐式结构相似性度量(MS-ISSM),通过径向基函数连续表示局部特征并比较隐式函数系数,结合ResGrouped-MLP网络,在多个基准上超越现有方法。

Comments IEEE TMM Accepted

详情
AI中文摘要

点云的无结构和不规则特性对精确的点云质量评估(PCQA)构成重大挑战,特别是在建立准确的感知特征对应关系方面。为了解决这一问题,我们提出了多尺度隐式结构相似性度量(MS-ISSM)。与传统的点对点匹配不同,MS-ISSM利用径向基函数(RBF)连续表示局部特征,将失真测量转化为隐式函数系数的比较。该方法有效避免了不规则数据中固有的匹配误差。此外,我们提出了ResGrouped-MLP质量评估网络,该网络能够鲁棒地将多尺度特征差异映射到感知分数。该网络架构摒弃了传统的平面多层感知器(MLP),采用分组编码策略,集成了残差块和通道注意力机制。这种分层设计使得模型能够保留亮度、色度和几何的独特物理语义,同时自适应地关注高、中、低尺度上最显著的失真特征。在多个基准上的实验结果表明,MS-ISSM在可靠性和泛化性方面均优于最先进的指标。源代码可在以下网址获取:this https URL。

英文摘要

The unstructured and irregular nature of points poses a significant challenge for accurate point cloud quality assessment (PCQA), particularly in establishing accurate perceptual feature correspondence. To tackle this, we propose the Multi-scale Implicit Structural Similarity Measurement (MS-ISSM). Unlike traditional point-to-point matching, MS-ISSM utilizes radial basis function (RBF) to represent local features continuously, transforming distortion measurement into a comparison of implicit function coefficients. This approach effectively circumvents matching errors inherent in irregular data. Additionally, we propose a ResGrouped-MLP quality assessment network, which robustly maps multi-scale feature differences to perceptual scores. The network architecture departs from traditional flat multi-layer perceptron (MLP) by adopting a grouped encoding strategy integrated with residual blocks and channel-wise attention mechanisms. This hierarchical design allows the model to preserve the distinct physical semantics of luma, chroma, and geometry while adaptively focusing on the most salient distortion features across High, Medium, and Low scales. Experimental results on multiple benchmarks demonstrate that MS-ISSM outperforms state-of-the-art metrics in both reliability and generalization. The source code is available at: https://github.com/ZhangChen2022/MS-ISSM.

2603.21583 2026-06-18 cs.CV 版本更新

HACMatch Semi-Supervised Rotation Regression with Hardness-Aware Curriculum Pseudo Labeling

HACMatch: 基于难度感知课程伪标签的半监督旋转回归

Mei Li, Huayi Zhou, Suizhi Huang, Yuxiang Lu, Yue Ding, Hongtao Lu

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出一种难度感知课程学习框架,通过动态选择伪标签样本和结构化数据增强,在少量标注数据下提升半监督旋转回归性能。

Comments This is an accepted manuscript of an article published in Computer Vision and Image Understanding

Journal ref Computer Vision and Image Understanding (2026)

详情
AI中文摘要

从2D图像回归物体的3D旋转是一项关键且具有挑战性的任务,在自动驾驶、虚拟现实和机器人控制等领域有广泛应用。现有的旋转回归模型通常依赖大量标注数据进行训练,或需要点云、CAD模型等2D图像之外的额外信息。因此,探索仅使用有限数量标注2D图像的半监督旋转回归具有重要价值。尽管最近的工作FisherMatch将半监督学习引入旋转回归,但其基于熵的刚性伪标签过滤方法未能有效区分可靠和不可靠的无标注样本。为解决这一局限,我们提出一种难度感知课程学习框架,根据样本难度动态选择伪标签样本,从简单到复杂逐步推进。我们引入了多阶段和自适应课程策略,用更灵活、难度感知的机制替代固定阈值过滤。此外,我们提出一种专门针对旋转估计的新型结构化数据增强策略,通过从增强补丁中组装复合图像来引入特征多样性,同时保持关键几何完整性。在PASCAL3D+和ObjectNet3D上的综合实验表明,我们的方法在低数据场景下尤其优于现有的监督和半监督基线,验证了课程学习框架和结构化增强方法的有效性。

英文摘要

Regressing 3D rotations of objects from 2D images is a crucial yet challenging task, with broad applications in autonomous driving, virtual reality, and robotic control. Existing rotation regression models often rely on large amounts of labeled data for training or require additional information beyond 2D images, such as point clouds or CAD models. Therefore, exploring semi-supervised rotation regression using only a limited number of labeled 2D images is highly valuable. While recent work FisherMatch introduces semi-supervised learning to rotation regression, it suffers from rigid entropy-based pseudo-label filtering that fails to effectively distinguish between reliable and unreliable unlabeled samples. To address this limitation, we propose a hardness-aware curriculum learning framework that dynamically selects pseudo-labeled samples based on their difficulty, progressing from easy to complex examples. We introduce both multi-stage and adaptive curriculum strategies to replace fixed-threshold filtering with more flexible, hardness-aware mechanisms. Additionally, we present a novel structured data augmentation strategy specifically tailored for rotation estimation, which assembles composite images from augmented patches to introduce feature diversity while preserving critical geometric integrity. Comprehensive experiments on PASCAL3D+ and ObjectNet3D demonstrate that our method outperforms existing supervised and semi-supervised baselines, particularly in low-data regimes, validating the effectiveness of our curriculum learning framework and structured augmentation approach.

2509.22363 2026-06-18 cs.LG eess.AS 版本更新

Investigating Faithfulness in Large Audio Language Models

大型音频语言模型中的忠实性研究

Pooneh Mousavi, Lovenya Jain, Mirco Ravanelli, Cem Subakan

发表机构 * Concordia University(康科迪亚大学) Mila - Quebec AI Institute(魁北克人工智能研究院) Université Laval(拉瓦尔大学) Birla Institute of Technology and Science, Pilani(比拉理工学院和科学学院,皮兰尼)

AI总结 提出系统框架评估大型音频语言模型在推理链忠实性上的表现,定义三个音频忠实性标准,并通过基准测试发现模型推理与音频输入存在脱节。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

大型音频语言模型(LALMs)将音频编码器与预训练的大型语言模型集成,以执行复杂的多模态推理任务。虽然这些模型可以生成思维链(CoT)解释,但这些推理链的忠实性仍不清楚。在这项工作中,我们提出了一个系统框架来评估LALMs中CoT在输入音频和最终模型预测方面的忠实性。我们定义了音频忠实性的三个标准:无幻觉、整体性和专注聆听。我们还引入了一个基于音频和CoT干预的基准来评估忠实性\footnote{基准测试界面和评估结果可在以下网址获取:https://this https URL。}。在Audio Flamingo 3和Qwen2.5-Omni上的实验表明存在潜在的多模态脱节:推理通常与最终预测一致,但并不总是强烈基于音频,并且可能容易受到幻觉或对抗性扰动的影响。

英文摘要

Large Audio Language Models (LALMs) integrate audio encoders with pretrained Large Language Models to perform complex multimodal reasoning tasks. While these models can generate Chain-of-Thought (CoT) explanations, the faithfulness of these reasoning chains remains unclear. In this work, we propose a systematic framework to evaluate CoT faithfulness in LALMs with respect to both the input audio and the final model prediction. We define three criteria for audio faithfulness: hallucination-free, holistic, and attentive listening. We also introduce a benchmark based on both audio and CoT interventions to assess faithfulness\footnote{The benchmarking interface and evaluation results are available at https://poonehmousavi.github.io/faithfulness/. Experiments on Audio Flamingo 3 and Qwen2.5-Omni suggest a potential multimodal disconnect: reasoning often aligns with the final prediction but is not always strongly grounded in the audio and can be vulnerable to hallucinations or adversarial perturbations.

2511.02036 2026-06-18 cs.RO 版本更新

TurboMap: GPU-Accelerated Local Mapping for Visual SLAM

TurboMap: 面向视觉SLAM的GPU加速局部建图

Parsa Hosseininejad, Kimia Khabiri, Shishir Gopinath, Soudabeh Mohammadhashemi, Karthik Dantu, Steven Y. Ko

发表机构 * Simon Fraser University(西蒙弗雷泽大学) University at Buffalo(布法罗大学)

AI总结 针对视觉SLAM中局部建图延迟问题,提出GPU并行化与CPU优化结合的TurboMap后端,通过重构地图点创建、融合及关键帧管理,实现1.3-1.6倍加速且保持精度。

Comments Accepted for presentation at IROS 2026, preprint

详情
AI中文摘要

在实时视觉SLAM系统中,局部建图必须在严格的延迟约束下运行,因为延迟会降低地图质量并增加跟踪失败的风险。GPU并行化是降低延迟的有效途径。然而,由于同步共享状态更新以及将大型地图数据结构传输到GPU的开销,并行化局部建图具有挑战性。本文提出TurboMap,一个GPU并行化且CPU优化的局部建图后端,全面解决了这些挑战。我们重构了地图点创建,以在GPU上实现并行关键点对应搜索,重新设计并并行化了地图点融合,在CPU上优化了冗余关键帧剔除,并集成了基于GPU的快速局部光束法平差求解器。为最小化数据传输和同步成本,我们引入了持久化的GPU驻留关键帧存储。在EuRoC和TUM-VI数据集上的实验表明,平均局部建图速度分别提升1.3倍和1.6倍,同时保持精度不变。

英文摘要

In real-time Visual SLAM systems, local mapping must operate under strict latency constraints, as delays degrade map quality and increase the risk of tracking failure. GPU parallelization offers a promising way to reduce latency. However, parallelizing local mapping is challenging due to synchronized shared-state updates and the overhead of transferring large map data structures to the GPU. This paper presents TurboMap, a GPU-parallelized and CPU-optimized local mapping backend that holistically addresses these challenges. We restructure Map Point Creation to enable parallel Keypoint Correspondence Search on the GPU, redesign and parallelize Map Point Fusion, optimize Redundant Keyframe Culling on the CPU, and integrate a fast GPU-based Local Bundle Adjustment solver. To minimize data transfer and synchronization costs, we introduce persistent GPU-resident keyframe storage. Experiments on the EuRoC and TUM-VI datasets show average local mapping speedups of 1.3x and 1.6x, respectively, while preserving accuracy.

2602.05992 2026-06-18 cs.CL 版本更新

DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs

DSB: 扩散语言模型的动态滑动块调度

Lizhuo Luo, Shenggui Li, Yonggang Wen, Tianwei Zhang

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 针对扩散语言模型固定块调度忽视语义难度的问题,提出无训练的动态滑动块方法DSB及配套KV缓存机制DSB Cache,显著提升生成质量和推理效率。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

扩散大语言模型(dLLMs)已成为文本生成的一种有前景的替代方案,其特点在于原生支持并行解码。在实践中,块推理对于避免全局双向解码中的顺序错乱以及提高输出质量至关重要。然而,广泛使用的固定、预定义块(朴素)调度忽略了语义难度,使其在质量和效率上均非最优策略:它可能迫使模型对不确定的位置过早做出承诺,同时延迟块边界附近的简单位置。在这项工作中,我们分析了朴素块调度的局限性,并揭示了根据语义难度动态调整调度对于可靠高效推理的重要性。受此启发,我们提出了动态滑动块(DSB),一种无训练的块调度方法,它使用动态大小的滑动块来克服朴素块的刚性。为了进一步提高效率,我们引入了DSB Cache,一种针对DSB量身定制的无训练KV缓存机制。跨多个模型和基准的大量实验表明,DSB与DSB Cache一起,持续提升了dLLMs的生成质量和推理效率。代码已发布在 https://this https URL。

英文摘要

Diffusion large language models (dLLMs) have emerged as a promising alternative for text generation, distinguished by their native support for parallel decoding. In practice, block inference is crucial for avoiding order misalignment in global bidirectional decoding and improving output quality. However, the widely-used fixed, predefined block (naive) schedule is agnostic to semantic difficulty, making it a suboptimal strategy for both quality and efficiency: it can force premature commitments to uncertain positions while delaying easy positions near block boundaries. In this work, we analyze the limitations of naive block scheduling and disclose the importance of dynamically adapting the schedule to semantic difficulty for reliable and efficient inference. Motivated by this, we propose Dynamic Sliding Block (DSB), a training-free block scheduling method that uses a sliding block with a dynamic size to overcome the rigidity of the naive block. To further improve efficiency, we introduce DSB Cache, a training-free KV-cache mechanism tailored to DSB. Extensive experiments across multiple models and benchmarks demonstrate that DSB, together with DSB Cache, consistently improves both generation quality and inference efficiency for dLLMs. Code is released at https://github.com/lizhuo-luo/DSB.

2603.09344 2026-06-18 cs.AI stat.ML 版本更新

Robust Regularized Policy Iteration under Transition Uncertainty

鲁棒正则化策略迭代在转移不确定性下

Hongqiang Lin, Zhenghui Fu, Weihao Tang, Pengfei Wang, Yiding Sun, Qixian Huang, Dongxu Zhang

发表机构 * College of Computer Science and Technology, Zhejiang University, Hangzhou, China(浙江大学计算机科学与技术学院) School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi'an, China(西北工业大学人工智能、光学与电子学院(iOPEN)) School of Software Technology, Zhejiang University, Hangzhou, China(浙江大学软件技术学院) School of Software Engineering, Xi'an Jiaotong University, Xi'an, China(西安交通大学软件工程学院) School of Systems Science and Engineering, Sun Yat-sen University, Guangzhou, China(中山大学系统科学与工程学院)

AI总结 提出鲁棒正则化策略迭代(RRPI),通过将离线强化学习建模为鲁棒策略优化,使用KL正则化替代难解的双层目标,并基于鲁棒正则化贝尔曼算子实现高效策略迭代,理论保证收敛性,实验在D4RL基准上表现优异。

详情
AI中文摘要

离线强化学习(RL)无需在线探索即可实现数据高效且安全的策略学习,但其性能常因分布偏移而下降。学习到的策略可能访问分布外的状态-动作对,其中价值估计和学习到的动态不可靠。为了在统一框架中处理策略引发的外推和转移不确定性,我们将离线RL建模为鲁棒策略优化,将转移核视为不确定性集内的决策变量,并针对最坏情况动态优化策略。我们提出鲁棒正则化策略迭代(RRPI),用可处理的KL正则化替代难解的最大-最小双层目标,并基于鲁棒正则化贝尔曼算子推导出高效的策略迭代过程。我们提供了理论保证,证明所提出的算子是$\gamma$-压缩算子,且迭代更新替代目标能单调改进原始鲁棒目标并收敛。在D4RL基准上的实验表明,RRPI实现了强大的平均性能,在大多数环境中优于包括基于百分位数方法在内的最新基线,并在其余环境中保持竞争力。此外,RRPI通过将较低的$Q$值与高认知不确定性对齐,展现出鲁棒性能,从而防止策略执行不可靠的分布外动作。

英文摘要

Offline reinforcement learning (RL) enables data-efficient and safe policy learning without online exploration, but its performance often degrades under distribution shift. The learned policy may visit out-of-distribution state-action pairs where value estimates and learned dynamics are unreliable. To address policy-induced extrapolation and transition uncertainty in a unified framework, we formulate offline RL as robust policy optimization, treating the transition kernel as a decision variable within an uncertainty set and optimizing the policy against the worst-case dynamics. We propose Robust Regularized Policy Iteration (RRPI), which replaces the intractable max-min bilevel objective with a tractable KL-regularized surrogate and derives an efficient policy iteration procedure based on a robust regularized Bellman operator. We provide theoretical guarantees by showing that the proposed operator is a $γ$-contraction and that iteratively updating the surrogate yields monotonic improvement of the original robust objective with convergence. Experiments on D4RL benchmarks demonstrate that RRPI achieves strong average performance, outperforming recent baselines including percentile-based methods on the majority of environments while remaining competitive on the rest. Moreover, RRPI exhibits robust performance by aligning lower $Q$-values with high epistemic uncertainty, which prevents the policy from executing unreliable out-of-distribution actions.

2603.11417 2026-06-18 cs.CV cs.LG 版本更新

Zero-Shot Cross-City Generalization in End-to-End Autonomous Driving: Self-Supervised versus Supervised Representations

端到端自动驾驶中的零样本跨城市泛化:自监督与监督表示

Fatemeh Naeinian, Ali Hamza, Haoran Zhu, Anna Choromanska

发表机构 * Department of Electrical and Computer Engineering, NYU Tandon School of Engineering(电气工程系,纽约大学Tandon工程学院)

AI总结 研究端到端自动驾驶模型在跨城市零样本迁移中的泛化能力,发现自监督预训练(如I-JEPA、DINOv2、MAE)相比监督预训练能显著减少位移和碰撞退化,提升闭环评估中的分布外PDMS。

详情
AI中文摘要

端到端自动驾驶模型通常使用监督的ImageNet预训练骨干网络在多城市数据集上训练,但其泛化到未见城市的能力尚未得到充分检验。当训练和评估数据在地理上混合时,模型可能隐含地依赖城市特定线索,掩盖了在真实世界域偏移下泛化到新位置时可能出现的失败模式。在这项工作中,我们将零样本跨城市迁移定义为端到端自动驾驶的受控表示级压力测试,并探究视觉预训练如何影响地理域偏移下的迁移行为。我们通过将自监督骨干网络I-JEPA、DINOv2和MAE集成到规划框架中进行了全面研究。我们在nuScenes上的开环设置和NAVSIM上的闭环评估协议中,在严格的地理划分下评估性能。我们的实验揭示了当模型在不同道路拓扑、交通规则和视觉环境的城市间迁移时存在显著的泛化差距。在开环评估中,监督骨干网络在城市间迁移时表现出严重退化,而某些领域特定的自监督方法可以显著减少位移和碰撞退化。在闭环评估中,自监督预训练在多个单城市训练设置中提高了平均分布外PDMS。我们的结果提供了经验证据,表明表示学习影响跨城市规划的鲁棒性,并促使将零样本地理迁移作为评估端到端自动驾驶系统的重要压力测试。

英文摘要

End-to-end autonomous driving models are typically trained on multi-city datasets using supervised ImageNet-pretrained backbones, yet their ability to generalize to unseen cities remains largely unexamined. When training and evaluation data are geographically mixed, models may implicitly rely on city-specific cues, masking failure modes that would occur under real-world domain shifts when generalizing to new locations. In this work, we formulate zero-shot cross-city transfer as a controlled representation-level stress test for end-to-end autonomous driving and ask how visual pretraining affects transfer behavior under geographic domain shift. We conduct a comprehensive study by integrating self-supervised backbones I-JEPA, DINOv2, and MAE into planning frameworks. We evaluate performance under strict geographic splits on nuScenes in the open-loop setting and on NAVSIM in the closed-loop evaluation protocol. Our experiments reveal a substantial generalization gap when transferring models across cities with different road topologies, traffic conventions, and visual environments. In open-loop evaluation, a supervised backbone exhibits severe degradation when transferring between cities, yet some domain-specific self-supervised methods can substantially reduce both displacement and collision degradation. In closed-loop evaluation, self-supervised pretraining improves average out-of-distribution PDMS in several single-city training settings. Our results provide empirical evidence that representation learning influences the robustness of cross-city planning and motivate zero-shot geographic transfer as an important stress test for evaluating end-to-end autonomous driving systems.

2603.10827 2026-06-18 cs.SD cs.AI 版本更新

Speaker Verification with Speech-Aware LLMs: Evaluation and Augmentation

语音感知大语言模型的说话人验证:评估与增强

Thomas Thebaud, Yuzhe Wang, Laureano Moro-Velazquez, Jesus Villalba-Lopez, Najim Dehak

发表机构 * Electrical and Computer Engineering Department, Johns Hopkins University, Baltimore, MD, USA(约翰霍普金斯大学电气与计算机工程系) Human Language Technology Center of Excellence, Johns Hopkins University, Baltimore, MD, USA(约翰霍普金斯大学人机语言技术中心卓越中心)

AI总结 提出模型无关的评分协议评估语音感知LLM的说话人区分能力(EER>20%),并通过注入冻结的ECAPA-TDNN说话人嵌入和LoRA微调,实现接近专用系统的性能(EER 1.03%)。

Comments 3 Tables, 1 Figure, Published in Interspeech 2026

详情
AI中文摘要

语音感知大语言模型(LLMs)可以接受语音输入,但其训练目标主要强调语言内容或特定领域(如情感或说话人性别),尚不清楚它们是否编码了说话人身份。首先,我们提出了一种模型无关的评分协议,该协议利用Yes/No令牌概率的置信度分数或对数似然比,为仅API模型和开放权重模型生成连续验证分数。使用该协议,我们评估了最近的语音感知LLMs,观察到较弱的说话人区分能力(在VoxCeleb1上EER高于20%)。其次,我们引入了一种轻量级增强方法,通过可学习的投影注入冻结的ECAPA-TDNN说话人嵌入,并仅训练LoRA适配器,使LLM具备自动说话人验证(ASV)能力。在TinyLLaMA-1.1B上,得到的ECAPA-LLM在VoxCeleb1-E上实现了1.03%的EER,接近专用说话人验证系统,同时保留了自然语言接口。

英文摘要

Speech-aware large language models (LLMs) can accept speech inputs, yet their training objectives largely emphasize linguistic content or specific fields such as emotions or the speaker's gender, leaving it unclear whether they encode speaker identity. First, we propose a model-agnostic scoring protocol that produces continuous verification scores for both API-only and open-weight models, using confidence scores or log-likelihood ratios from the Yes/No token probabilities. Using this protocol, we benchmark recent speech-aware LLMs and observe weak speaker discrimination (EERs above 20% on VoxCeleb1). Second, we introduce a lightweight augmentation that equips an LLM with ASV capability by injecting frozen ECAPA-TDNN speaker embeddings through a learned projection and training only LoRA adapters. On TinyLLaMA-1.1B, the resulting ECAPA-LLM achieves 1.03% EER on VoxCeleb1-E, approaching a dedicated speaker verification system while preserving a natural-language interface.

2603.04865 2026-06-18 cs.SD 版本更新

The First Environmental Sound Deepfake Detection Challenge: Benchmarking Robustness, Evaluation, and Insights

环境声音深度伪造检测挑战赛:鲁棒性、评估与洞察的基准测试

Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Ting Dang

发表机构 * School of Electrical Engineering, KAIST, Daejeon, Republic of Korea(韩国成均馆大学电气工程学院) University of Melbourne, Australia(墨尔本大学) Fortemedia Singapore, Singapore(新加坡Fortemedia公司) Xi’an University of Posts & Telecommunications, Xi’an, China(西安邮电大学) Xi'an Lianfeng Acoustic Technologies Co., Ltd., China(西安联丰声学技术有限公司)

AI总结 本文介绍了环境声音深度伪造检测挑战赛,探讨了鲁棒性评估、系统架构及未来研究方向,提出了环境声音深度伪造检测的关键挑战与机遇。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

近年来,音频生成技术的进步使得创建高度逼真的环境声音景观变得更加容易,这可能被滥用于制造欺骗性内容,如假警报、枪声和人群声音,从而引发公众安全和信任的担忧。尽管语音和歌唱声的深度伪造检测已被广泛研究,但环境声音深度伪造检测(ESDD)仍处于探索阶段。为了推动ESDD的发展,首次ESDD挑战赛被启动,吸引了97支注册团队,收到了1748份有效提交。本文提出了该任务的定义、数据集构建、评估协议、基线系统以及挑战赛结果中的关键见解。此外,我们分析了高性能系统中常见的架构选择和训练策略。最后,我们讨论了ESDD的潜在未来研究方向,概述了关键机会和开放问题,以指导该领域后续研究。

英文摘要

Recent progress in audio generation has made it increasingly easy to create highly realistic environmental soundscapes, which can be misused to produce deceptive content, such as fake alarms, gunshots, and crowd sounds, raising concerns for public safety and trust. While deepfake detection for speech and singing voice has been extensively studied, environmental sound deepfake detection (ESDD) remains underexplored. To advance ESDD, the first edition of the ESDD challenge was launched, attracting 97 registered teams and receiving 1,748 valid submissions. This paper presents the task formulation, dataset construction, evaluation protocols, baseline systems, and key insights from the challenge results. Furthermore, we analyze common architectural choices and training strategies among top-performing systems. Finally, we discuss potential future research directions for ESDD, outlining key opportunities and open problems to guide subsequent studies in this field.

2603.05010 2026-06-18 cs.CV 版本更新

How far have we gone in Generative Image Restoration? A study on its capability, limitations and evaluation practices

生成式图像恢复进展:能力、局限性与评估实践研究

Xiang Yin, Jinfan Hu, Zhiyuan You, Kainan Yan, Yu Tang, Chao Dong, Jinjin Gu

发表机构 * Fudan University(复旦大学) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院,中国科学院) University of the Chinese Academy of Sciences(中国科学院大学) Multimedia Laboratory, The Chinese University of Hong Kong(香港中文大学多媒体实验室) Shenzhen University of Advanced Technology(深圳先进技术大学)

AI总结 通过多维度评估管道系统比较扩散、GAN等生成式模型与PSNR导向模型,揭示从细节不足到细节质量与语义控制的范式转变,并训练了更符合人类感知的IQA模型。

Comments Accepted by CVPR 2026 Findings

详情
AI中文摘要

生成式图像恢复(GIR)在感知真实感方面取得了显著进展,但与先前方法相比,其实际能力究竟有多大提升?为回答这一问题,我们基于新的多维度评估管道开展大规模研究,该管道从细节、清晰度、语义正确性和整体质量四个维度评估模型。我们的分析涵盖多种架构,包括基于扩散的、基于GAN的、PSNR导向的以及通用生成模型,揭示了关键的性能差异。此外,我们的分析揭示了失败模式的演变,这标志着以感知为导向的低层视觉领域发生了范式转变。核心挑战正从先前的细节稀缺(欠生成)问题演变为细节质量和语义控制(防止过生成)的新前沿。我们还利用我们的基准训练了一个新的IQA模型,该模型更符合人类感知判断。最终,本工作对现代生成式图像恢复模型进行了系统研究,提供了关键见解,重新定义了对其真实状态的理解,并为未来发展指明了方向。

英文摘要

Generative Image Restoration (GIR) has achieved impressive perceptual realism, but how far have its practical capabilities truly advanced compared with previous methods? To answer this, we present a large-scale study grounded in a new multi-dimensional evaluation pipeline that assesses models on detail, sharpness, semantic correctness, and overall quality. Our analysis covers diverse architectures, including diffusion-based, GAN-based, PSNR-oriented, and general-purpose generation models, revealing critical performance disparities. Furthermore, our analysis uncovers a key evolution in failure modes that signifies a paradigm shift for the perception-oriented low-level vision field. The central challenge is evolving from the previous problem of detail scarcity (under-generation) to the new frontier of detail quality and semantic control (preventing over-generation). We also leverage our benchmark to train a new IQA model that better aligns with human perceptual judgments. Ultimately, this work provides a systematic study of modern generative image restoration models, offering crucial insights that redefine our understanding of their true state and chart a course for future development.

2510.21605 2026-06-18 cs.CV 版本更新

S3OD: Towards Generalizable Salient Object Detection with Synthetic Data

S3OD:基于合成数据的通用显著目标检测

Orest Kupyn, Hirokatsu Kataoka, Christian Rupprecht

发表机构 * University of Oxford, VGG(牛津大学,视觉信息集团)

AI总结 提出S3OD方法,通过大规模合成数据生成和歧义感知架构,显著提升显著目标检测的跨数据集泛化能力,仅用合成数据训练即可降低20-50%误差。

详情
AI中文摘要

显著目标检测体现了数据受限任务的特点,昂贵的像素级精确标注迫使相关子任务(如DIS和HR-SOD)进行单独的模型训练。我们提出了一种通过大规模合成数据生成和歧义感知架构来大幅提升泛化能力的方法。我们引入了S3OD,一个包含超过139,000张高分辨率图像的数据集,通过我们的多模态扩散管道从扩散和DINO-v3特征中提取标签。迭代生成框架根据模型性能优先处理具有挑战性的类别。我们提出了一个简化的多掩码解码器,通过预测多个有效解释来处理显著目标检测中固有的歧义。仅使用合成数据训练的模型在跨数据集泛化中实现了20-50%的错误率降低,而微调版本在DIS和HR-SOD基准上达到了最先进的性能。

英文摘要

Salient object detection exemplifies data-bounded tasks where expensive pixel-precise annotations force separate model training for related subtasks like DIS and HR-SOD. We present a method that dramatically improves generalization through large-scale synthetic data generation and ambiguity-aware architecture. We introduce S3OD, a dataset of over 139,000 high-resolution images created through our multi-modal diffusion pipeline that extracts labels from diffusion and DINO-v3 features. The iterative generation framework prioritizes challenging categories based on model performance. We propose a streamlined multi-mask decoder that handles the inherent ambiguity in salient object detection by predicting multiple valid interpretations. Models trained only on synthetic data achieve 20-50% error reduction in cross-dataset generalization, while fine-tuned versions reach state-of-the-art performance across DIS and HR-SOD benchmarks.

2507.04219 2026-06-18 cs.LG cs.AI 版本更新

Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs

模型崩溃不是错误,而是大语言模型机器遗忘中的一种特性

Yan Scholten, Sophie Xhonneux, Leo Schwinn, Stephan Günnemann

发表机构 * Dept. of Computer Science & Munich Data Science Institute, Technical University of Munich(计算机科学系及慕尼黑数据科学研究所,技术大学慕尼黑) Mila, Université de Montréal(蒙特利尔大学Mila)

AI总结 提出部分模型崩溃(PMC)方法,通过故意触发模型在目标数据上的分布崩溃实现遗忘,无需在遗忘目标上优化,有效移除私有信息并保持模型效用。

Comments Accepted at ICLR 2026

详情
AI中文摘要

当前大语言模型的遗忘方法通过将待移除的私有信息纳入微调数据来优化。我们认为这不仅可能强化对敏感数据的暴露,而且从根本上违背了最小化其使用的原则。作为补救,我们提出了一种新颖的遗忘方法——部分模型崩溃(PMC),该方法在遗忘目标中不需要遗忘目标。我们的方法受到最近观察的启发:在生成模型上训练其自身生成会导致分布崩溃,从而有效移除模型输出中的信息。我们的核心见解是,可以通过故意触发我们旨在移除的数据上的模型崩溃来利用模型崩溃进行机器遗忘。我们从理论上分析了我们的方法收敛到期望结果,即模型遗忘目标移除的数据。我们实验证明,PMC克服了现有显式优化遗忘目标的遗忘方法的四个关键限制,并在保持通用模型效用的同时更有效地从模型输出中移除私有信息。总体而言,我们的贡献代表了向更全面、更符合现实隐私约束的遗忘迈出的重要一步。代码可在该 https URL 获取。

英文摘要

Current unlearning methods for LLMs optimize on the private information they seek to remove by incorporating it into their fine-tuning data. We argue this not only risks reinforcing exposure to sensitive data, but also fundamentally contradicts the principle of minimizing its use. As a remedy, we propose a novel unlearning method-Partial Model Collapse (PMC), which does not require unlearning targets in the unlearning objective. Our approach is inspired by recent observations that training generative models on their own generations leads to distribution collapse, effectively removing information from model outputs. Our central insight is that model collapse can be leveraged for machine unlearning by deliberately triggering it for data we aim to remove. We theoretically analyze that our approach converges to the desired outcome, i.e. the model unlearns the data targeted for removal. We empirically demonstrate that PMC overcomes four key limitations of existing unlearning methods that explicitly optimize on unlearning targets, and more effectively removes private information from model outputs while preserving general model utility. Overall, our contributions represent an important step toward more comprehensive unlearning that better aligns with real-world privacy constraints. Code available at https://www.cs.cit.tum.de/daml/partial-model-collapse/.

2603.00656 2026-06-18 cs.AI 版本更新

InfoPO: Information-Driven Policy Optimization for User-Centric Agents

InfoPO:面向用户智能体的信息驱动策略优化

Fanqi Kong, Jiayi Zhang, Mingyi Deng, Chenglin Wu, Yuyu Luo, Bang Liu

发表机构 * Peking University(北京大学) The Hong Kong University of Science(香港科学大学)

AI总结 针对多轮交互中信用分配和优势信号不足的问题,提出信息增益奖励与自适应方差门控融合的InfoPO方法,在意图澄清、协作编码等任务上优于现有基线。

详情
AI中文摘要

现实世界中用户对LLM智能体的请求往往不明确。智能体必须通过交互获取缺失信息并做出正确的下游决策。然而,当前基于多轮GRPO的方法通常依赖于轨迹级奖励计算,这导致信用分配问题以及rollout组内优势信号不足。一种可行的方法是在细粒度上识别有价值的交互轮次,以驱动更有针对性的学习。为此,我们引入了InfoPO(信息驱动策略优化),它将多轮交互视为一个主动不确定性降低的过程,并计算信息增益奖励,该奖励对反馈可测量地改变智能体后续动作分布(与掩码反馈反事实相比)的轮次进行奖励。然后,通过自适应方差门控融合将该信号与任务结果结合,以在保持任务导向目标方向的同时识别信息重要性。在包括意图澄清、协作编码和工具增强决策在内的多种任务中,InfoPO始终优于提示和多轮RL基线。它还在用户模拟器偏移下表现出鲁棒性,并有效泛化到环境交互任务。总体而言,InfoPO为优化复杂的智能体-用户协作提供了一种原则性且可扩展的机制。代码可在以下网址获取:https://this URL。

英文摘要

Real-world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-level reward computation, which leads to credit assignment problems and insufficient advantage signals within rollout groups. A feasible approach is to identify valuable interaction turns at a fine granularity to drive more targeted learning. To address this, we introduce InfoPO (Information-Driven Policy Optimization), which frames multi-turn interaction as a process of active uncertainty reduction and computes an information-gain reward that credits turns whose feedback measurably changes the agent's subsequent action distribution compared to a masked-feedback counterfactual. It then combines this signal with task outcomes via an adaptive variance-gated fusion to identify information importance while maintaining task-oriented goal direction. Across diverse tasks, including intent clarification, collaborative coding, and tool-augmented decision making, InfoPO consistently outperforms prompting and multi-turn RL baselines. It also demonstrates robustness under user simulator shifts and generalizes effectively to environment-interactive tasks. Overall, InfoPO provides a principled and scalable mechanism for optimizing complex agent-user collaboration. Code is available at https://github.com/kfq20/InfoPO.

2603.00026 2026-06-18 cs.CL cs.AI cs.IR 版本更新

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

ActMem:弥合LLM代理中记忆检索与推理之间的差距

Xiaohui Zhang, Zequn Sun, Chengyuan Yang, Yaqin Jin, Yazhong Zhang, Wei Hu

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University, China(南京大学新型软件技术国家重点实验室) Alibaba Group, Hangzhou, China(阿里巴巴集团,杭州,中国) National Institute of Healthcare Data Science, Nanjing University, China(南京大学健康数据科学国家研究院)

AI总结 提出ActMem框架,通过将非结构化对话历史转化为结构化因果语义图,结合反事实推理和常识补全,实现主动因果推理,显著提升LLM代理在复杂记忆依赖任务中的表现。

详情
AI中文摘要

记忆管理对于长期交互中的LLM代理至关重要。当前的记忆框架通常将代理视为被动的“记录器”,并在不理解其深层含义的情况下检索信息。它们可能在需要推理和复杂决策的场景中失败。为了弥合这一关键差距,我们提出了一种新颖的可操作记忆框架ActMem,它将记忆检索与主动因果推理相结合。ActMem将非结构化对话历史转化为结构化的因果语义图。通过利用反事实推理和常识补全,它使代理能够推断隐含约束并解决过去状态与当前意图之间的潜在冲突。此外,我们引入了一个全面的数据集ActMemEval,用于评估代理在逻辑驱动场景中的推理能力,超越了现有记忆基准测试中事实检索的焦点。实验表明,ActMem在处理复杂的、依赖记忆的任务时显著优于基线,为更一致和可靠的智能助手铺平了道路。

英文摘要

Memory management is essential for LLM agents in long-term interactions. Current memory frameworks typically treat agents as passive ``recorders'' and retrieve information without understanding its deeper implications. They may fail in scenarios requiring reasoning and complex decision-making. To bridge this critical gap, we propose a novel actionable memory framework called ActMem that integrates memory retrieval with active causal reasoning. ActMem transforms unstructured dialogue history into a structured causal and semantic graph. By leveraging counterfactual reasoning and commonsense completion, it enables agents to deduce implicit constraints and resolve potential conflicts between past states and current intentions. Furthermore, we introduce a comprehensive dataset ActMemEval to evaluate agent reasoning capabilities in logic-driven scenarios, moving beyond the fact-retrieval focus of existing memory benchmarks. Experiments demonstrate that ActMem significantly outperforms baselines in handling complex, memory-dependent tasks, paving the way for more consistent and reliable intelligent assistants.

2602.19591 2026-06-18 cs.LG cs.AI 版本更新

Detecting High-Potential SMEs with Heterogeneous Graph Neural Networks

使用异构图神经网络检测高潜力中小企业

Yijiashun Qi, Hanzhe Guo, Yijiazhen Qi

发表机构 * University of Michigan(密歇根大学) The University of Hong Kong(香港大学)

AI总结 提出SME-HGT异构图Transformer框架,利用公开数据构建包含公司、研究主题和政府机构的异构图,预测SBIR第一阶段获奖者能否进入第二阶段,AUPRC达0.621,优于基线模型。

Comments accepted by (ICIIS 2026)

详情
AI中文摘要

中小企业占美国企业的99.9%,贡献44%的经济活动,但系统性地识别高潜力中小企业仍是一个开放挑战。我们提出了SME-HGT,一个异构图Transformer框架,仅使用公开数据预测哪些SBIR第一阶段获奖者将进入第二阶段资助。我们构建了一个异构图,包含32,268个公司节点、124个研究主题节点和13个政府机构节点,通过约99,000条边连接三种语义关系类型。SME-HGT在时间分割测试集上达到0.621±0.003的AUPRC,在五个随机种子上优于MLP基线(0.590±0.002)和R-GCN(0.608±0.013)。在筛选深度为100家公司时,SME-HGT达到89.6%的精确率,比随机选择提升2.14倍。我们的时间评估协议防止信息泄露,对公开数据的依赖确保了可重复性。这些结果表明,公司、研究主题和资助机构之间的关系结构为中小企业潜力评估提供了有意义的信号,对政策制定者和早期投资者具有启示意义。

英文摘要

Small and Medium Enterprises (SMEs) constitute 99.9% of U.S. businesses and generate 44% of economic activity, yet systematically identifying high-potential SMEs remains an open challenge. We introduce SME-HGT, a Heterogeneous Graph Transformer framework that predicts which SBIR Phase I awardees will advance to Phase II funding using exclusively public data. We construct a heterogeneous graph with 32,268 company nodes, 124 research topic nodes, and 13 government agency nodes connected by approximately 99,000 edges across three semantic relation types. SME-HGT achieves an AUPRC of 0.621 0.003 on a temporally-split test set, outperforming an MLP baseline (0.590 0.002) and R-GCN (0.608 0.013) across five random seeds. At a screening depth of 100 companies, SME-HGT attains 89.6% precision with a 2.14 lift over random selection. Our temporal evaluation protocol prevents information leakage, and our reliance on public data ensures reproducibility. These results demonstrate that relational structure among firms, research topics, and funding agencies provides meaningful signal for SME potential assessment, with implications for policymakers and early-stage investors.

2602.23092 2026-06-18 cs.AI 版本更新

Enhancing CVRP Solver through LLM-driven Automatic Heuristic Design

通过LLM驱动的自动启发式设计增强CVRP求解器

Zhuoliang Xie, Fei Liu, Zhenkun Wang, Qingfu Zhang

发表机构 * Southern University of Science and Technology(南方科技大学) City University of Hong Kong(香港城市大学)

AI总结 提出AILS-AHD方法,结合进化搜索框架与大语言模型动态生成和优化破坏启发式,并引入加速机制,在中等和大规模CVRP实例上优于现有求解器,在CVRPLib大规模基准中10个实例上取得8个新最优解。

详情
AI中文摘要

容量受限车辆路径问题(CVRP)是一个基本的组合优化挑战,专注于在车辆容量约束下优化车队运营。尽管在运筹学中得到了广泛研究,CVRP的NP-hard性质仍然带来显著的计算挑战,特别是对于大规模实例。本研究提出了AILS-AHD(自适应迭代局部搜索与自动启发式设计),一种利用大语言模型(LLMs)革新CVRP求解的新方法。我们的方法将进化搜索框架与LLMs集成,在AILS方法中动态生成和优化破坏启发式。此外,我们引入了一种基于LLM的加速机制以提高计算效率。针对最先进的求解器(包括AILS-II和HGS)的综合实验评估表明,AILS-AHD在中等和大规模实例上均表现出优越性能。值得注意的是,我们的方法在CVRPLib大规模基准的10个实例中为8个建立了新的最佳已知解,突显了LLM驱动的启发式设计在推进车辆路径优化领域的潜力。

英文摘要

The Capacitated Vehicle Routing Problem (CVRP), a fundamental combinatorial optimization challenge, focuses on optimizing fleet operations under vehicle capacity constraints. While extensively studied in operational research, the NP-hard nature of CVRP continues to pose significant computational challenges, particularly for large-scale instances. This study presents AILS-AHD (Adaptive Iterated Local Search with Automatic Heuristic Design), a novel approach that leverages Large Language Models (LLMs) to revolutionize CVRP solving. Our methodology integrates an evolutionary search framework with LLMs to dynamically generate and optimize ruin heuristics within the AILS method. Additionally, we introduce an LLM-based acceleration mechanism to enhance computational efficiency. Comprehensive experimental evaluations against state-of-the-art solvers, including AILS-II and HGS, demonstrate the superior performance of AILS-AHD across both moderate and large-scale instances. Notably, our approach establishes new best-known solutions for 8 out of 10 instances in the CVRPLib large-scale benchmark, underscoring the potential of LLM-driven heuristic design in advancing the field of vehicle routing optimization.

2505.03646 2026-06-18 cs.LG cs.AI cs.CV 版本更新

Revealing Hidden Vulnerabilities in Autoencoders through Gradient Signal Restoration

通过梯度信号恢复揭示自编码器中的隐藏漏洞

Chethan Krishnamurthy Ramanaik, Arjun Roy, Tobias Callies, Eirini Ntoutsi

发表机构 * University of the Bundeswehr Munich(联邦国防军理工大学)

AI总结 针对自编码器对抗攻击中梯度消失导致鲁棒性被高估的问题,提出GRILL框架恢复梯度信号,显著提升攻击效果,暴露隐藏漏洞。

详情
AI中文摘要

深度自编码器(AE)的对抗鲁棒性受到的关注远少于判别模型,尽管其压缩的潜在表示会导致病态映射,从而放大小的输入扰动并破坏重建稳定性。现有的AE白盒攻击通过优化范数有界的对抗扰动以最大化重建损失,往往收敛到次优扰动,从而可能高估AE的鲁棒性。我们表明,这种限制与通过病态层反向传播时对抗损失梯度消失有关,这些病态层的中间权重矩阵具有接近零的奇异值。为了解决这个问题,我们提出了GRILL(病态层中的梯度信号恢复)框架,旨在减轻梯度退化并提高编码器-解码器架构中对抗鲁棒性评估的可靠性。GRILL旨在缓解优化过程中的对抗梯度退化,使攻击能够在固定范数约束下更好地逼近高失真扰动。通过在多种AE架构上的广泛实验,包括样本特定和通用攻击,以及标准和自适应攻击设置,我们表明GRILL显著提高了攻击有效性,从而暴露了现有攻击限制所隐藏的漏洞。除了AE之外,我们提供了初步证据表明现代多模态编码器-解码器架构也存在类似的漏洞。

英文摘要

Adversarial robustness of deep autoencoders (AEs) has received less attention than that of discriminative models, although their compressed latent representations induce ill-conditioned mappings that can amplify small input perturbations and destabilize reconstructions. Existing white-box attacks for AEs, which optimize norm-bounded adversarial perturbations to maximize reconstruction damage, often converge to suboptimal perturbations, thereby potentially overstating AE robustness. We show that this limitation is linked to vanishing adversarial loss gradients during backpropagation through ill-conditioned layers, associated with near-zero singular values in their intermediate weight matrices. To address this, we propose GRILL (Gradient Signal Restoration in Ill-Conditioned Layers), a framework designed to mitigate gradient degradation and improve the reliability of adversarial robustness evaluation in encoder-decoder architectures. GRILL is designed to mitigate adversarial gradient degradation during optimization, enabling attacks to better approximate high-distortion perturbations under fixed norm constraints. Through extensive experiments across multiple AE architectures, under both sample-specific and universal attacks, as well as standard and adaptive attack settings, we show that GRILL significantly increases attack effectiveness, thereby exposing vulnerabilities hidden by existing attack limitations. Beyond AEs, we provide preliminary evidence that modern multimodal encoder-decoder architectures exhibit similar vulnerabilities.