arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.04743 2026-06-04 cs.CL cs.AI cs.LG

TIDE: Proactive Multi-Problem Discovery via Template-Guided Iteration

TIDE:通过模板引导迭代的主动多问题发现

Soyeong Jeong, Jinheon Baek, Minki Kang, Sung Ju Hwang

AI总结 提出TIDE框架,通过模板引导的迭代机制主动发现用户上下文中隐藏的多个问题,并给出具体行动方案,在个人工作区和软件仓库两个场景中显著提升任务覆盖率和问题识别与解决能力。

详情
AI中文摘要

智能体被广泛部署为文档、工具和代码的助手。然而,它们通常仅对明确的用户请求做出响应,这些请求只反映了用户已注意到的问题,而许多其他重要问题共存于更广泛的用户上下文中,隐藏于显而易见之处,且其总数事先未知。我们将此定义为从上下文中发现多个隐藏问题的任务,其中应揭示共存的问题,基于支持性证据,并配以具体行动。为此,我们引入了TIDE,一个模板引导的迭代框架,包含两种互补机制。具体而言,基于单次预测倾向于关注最显著案例并产生泛化结论的观察,我们提出迭代发现:每轮生成一小批候选,同时基于已发现结果进行条件化,从而后续轮次扩展覆盖范围;以及思维模板:从先前解决的案例中提炼的可重用模式,指定应关注哪些上下文信号以及如何连接它们,将每个预测锚定于可识别的问题类别。我们在两个现实场景(个人工作区和软件仓库)中,使用四种模型骨干验证了TIDE,在任务覆盖率、识别和解决方面显著优于单次和并行多智能体基线。

英文摘要

Agents are widely deployed as assistants over documents, tools, and code. However, they typically act only on explicit user requests, which surface only the problems the user has noticed, while many other important problems coexist, hidden in plain sight, within the broader user context, with their total number unknown in advance. We frame this as the task of discovering multiple hidden problems from context, in which coexisting problems should be uncovered, grounded in supporting evidence, and paired with concrete actions. To this end, we introduce TIDE, a template-guided iterative framework with two complementary mechanisms. Specifically, motivated by the observation that single-pass prediction anchors on the most salient cases and yields generic claims, we propose iterative discovery, which surfaces a small batch of candidates per round while conditioning on what has already been found, so subsequent rounds extend coverage; and thought templates, reusable schemas distilled from previously solved cases that specify what contextual signals to attend to and how to connect them, anchoring each prediction in a recognizable problem class. We validate TIDE on two realistic settings, personal workspaces and software repositories, across four model backbones, showing substantial gains over single-shot and parallel multi-agent baselines on task coverage, identification, and resolution.

2606.04739 2026-06-04 cs.SE cs.AI

Revisiting Vul-RAG: Reproducibility and Replicability of RAG-based Vulnerability Detection with Open-Weight Models

重新审视Vul-RAG:基于RAG的漏洞检测的可复现性与可复制性——使用开放权重模型

Sabrina Kaniewski, Fabian Schmidt, Tobias Heer

AI总结 本研究通过本地部署和多种开放权重模型,复现并扩展了Vul-RAG框架,发现其性能存在约0.30成对准确率的上限,且模型能力提升无法显著改善性能。

详情
Comments
Accepted at AI&CCPS 2026 workshop, co-located with the 21st International Conference on Availability, Reliability and Security (ARES 2026). This is the authors' preprint version
AI中文摘要

大型语言模型(LLMs)在自动化软件漏洞检测方面展现出强大潜力,尤其是在检索增强生成(RAG)设置中。然而,对于依赖专有模型和API的方法,可复现性和可复制性在很大程度上仍未得到探索,这引发了一个问题:报告的结果是否具有普遍性,还是主要依赖于特定的模型选择。在这项工作中,我们对Vul-RAG进行了可复现性研究,Vul-RAG是一个基于RAG的源代码漏洞检测框架,它利用高级漏洞知识增强LLMs。我们首先使用报告中的开放权重基线模型,在完全本地和开放权重的设置下复现了结果。然后,我们将评估扩展到一组多样化的最新开放权重LLMs,包括代码专用、通用和推理模型,参数规模各异。结果证实,Vul-RAG的发现可以在本地部署下复现,但存在微小偏差。在所有评估的模型中,我们观察到性能在约0.30成对准确率(即漏洞函数和修补函数都被正确分类的代码对)处达到平台期。值得注意的是,即使对于更新更先进的模型,这一平台期仍然存在,表明仅凭模型能力的提升并不能显著提高性能。最后,我们讨论了检测效果、模型能力和模型规模之间的实际影响和权衡。实现和评估工件可在 https://github.com/hs-esslingen-it-security/revisiting-Vul-RAG 公开获取。

英文摘要

Large language models (LLMs) have shown strong potential for automated software vulnerability detection, particularly in retrieval-augmented generation (RAG) settings. However, for approaches relying on proprietary models and APIs, reproducibility and replicability remain largely unexplored, raising the question of whether reported results generalize or depend primarily on specific model choices. In this work, we present a reproducibility study of Vul-RAG, a RAG-based framework for source code vulnerability detection that enhances LLMs with high-level vulnerability knowledge. We first replicate the results in a fully local and open-weights setting using the reported open-weight baseline models. We then extend the evaluation to a diverse set of recent open-weight LLMs, including code-specialized, general-purpose, and reasoning models of varying parameter sizes. The results confirm that the findings of Vul-RAG are reproducible under local deployment, but with minor deviations. Across all evaluated models, we observe a performance plateau at approximately 0.30 pairwise accuracy (code pairs for which both the vulnerable and the patched function are correctly classified). Notably, this plateau persists even for more recent and advanced models, indicating that improvements in model capacity alone do not substantially enhance performance. Finally, we discuss practical implications and trade-offs between detection effectiveness, model capabilities, and model scale. Implementation and evaluation artifacts are publicly available at https://github.com/hs-esslingen-it-security/revisiting-Vul-RAG.

2606.04737 2026-06-04 cs.CV

Physics-Informed Video Generation via Mixture-of-Experts Latent Alignment

基于物理信息的视频生成:通过混合专家潜在对齐

Cong Wang, Hanxin Zhu, Jiayi Luo, Yonglin Tian, Xiaoqian Cheng, Peiyan Tu, Xin Jin, Long Chen, Zhibo Chen

AI总结 提出PILA框架,通过混合专家潜在对齐将物理结构化潜在引导注入预训练视频模型的冻结流匹配动力学,以提升生成视频的物理合理性。

详情
AI中文摘要

大规模视频生成模型在语义一致性和视觉质量方面取得了显著进展,生成的视频越来越连贯且视觉上令人信服。然而,由像素级拟合引发的动态过程自然无法适应支配真实世界运动和交互的规律性,导致在物理合理性方面持续存在不足。为解决这一局限,我们提出了PILA(物理信息潜在对齐),一个将物理结构化的潜在引导注入预训练视频模型冻结流匹配动力学的框架。具体而言,PILA首先采用锚定场估计,将冻结生成器的潜在变量映射到一个由场代理槽组织的可操作物理属性库中,利用可观测运动作为运动学锚点来构建较难直接观测的代理。为处理真实世界动态的异质性,PILA采用基于物理类别的混合专家设计。标签先验掩码专家路由选择特定类别的算子专家,其精炼结果通过从物理关系中抽象出的操作残差进行正则化。最后,精炼后的代理被融合回物理属性库,并解码为流匹配向量场的修正,从而在保持预训练骨干网络视觉先验的同时注入物理感知引导。通过在Wan 2.1-1.3B上进行分阶段适配器训练,并将学到的适配器直接迁移到Wan 2.2-14B,PILA在VBench-2.0、VideoPhy-2和PhyGenBench上,在视觉质量和基准测量的物理合理性方面均达到了最先进的结果。

英文摘要

Large-scale video generation models have made remarkable progress in semantic consistency and visual quality, producing videos that are increasingly coherent and visually convincing. Nevertheless, the dynamics induced by pixel-level fitting do not naturally accommodate the regularities that govern real-world motion and interaction, resulting in persistent shortcomings in physical plausibility. To address this limitation, we propose \textbf{PILA} (Physics-Informed Latent Alignment), a framework that injects physics-structured latent guidance into the frozen flow-matching dynamics of pretrained video models. Specifically, PILA first employs anchored field estimation to map frozen-generator latents into an operational physical attribute bank organized by field-proxy slots, using observable motion as a kinematic anchor for constructing less directly observed proxies. To handle the heterogeneity of real-world dynamics, PILA adopts a mixture-of-experts design over physical categories. Label-prior masked expert routing selects category-specific operator experts, whose refinements are regularized by operational residuals abstracted from physical relations. Finally, the refined proxies are fused into the physical attribute bank and decoded into a correction to the flow-matching vector field, injecting physics-aware guidance while preserving the visual prior of the pretrained backbone. With staged adapter training on Wan 2.1-1.3B and direct transfer of the learned adapter to Wan 2.2-14B, PILA achieves state-of-the-art results on VBench-2.0, VideoPhy-2, and PhyGenBench in both visual quality and benchmark-measured physical plausibility.

2606.04736 2026-06-04 cs.LG cs.AI

Curvature-aware dynamic precision approach for physics-informed neural networks

面向物理信息神经网络的曲率感知动态精度方法

Yingjie Shao, Ioannis N. Athanasiadis, George van Voorn, Taniya Kapoor

AI总结 提出一种曲率感知精度控制器,利用L-BFGS优化器中的曲率信息动态调整数值精度,在保持预测精度的同时降低双精度训练的计算成本。

详情
AI中文摘要

物理信息神经网络(PINNs)通过将物理定律直接嵌入神经网络训练,已成为模拟偏微分方程(PDEs)的有前景框架。然而,近期研究表明PINN优化对数值精度敏感。现有实现通常使用单精度(FP32),计算效率高但易出现失败模式,或双精度(FP64),鲁棒但成本高昂。这造成了计算效率与数值精度之间的权衡。为降低双精度训练的计算成本同时保持预测精度,我们提出一种曲率感知精度控制器,在训练过程中自适应调整数值精度,而非将其视为固定的实现选择。该方法重用来自有限内存BFGS(L-BFGS)优化器的曲率信息来构建精度控制器,在低精度足够时保留FP32,并在训练动态表明数值敏感或精度受限停滞时提升至FP64计算。我们在四个典型PINN失败模式基准和一个辐照度驱动的常微分方程示例上评估了所提方法。我们还测试了不同神经网络架构下的方法。该方法在所有基准方程上一致匹配甚至略微超过全FP64解的精度,同时相对于全双精度训练减少了训练时间。所得结果表明,PINN优化中的精度敏感性具有相位依赖性,仅在数值关键阶段选择性应用更高精度可以在不牺牲预测精度的前提下降低计算成本。

英文摘要

Physics-informed neural networks (PINNs) have become a promising framework for simulating partial differential equations (PDEs) by embedding physical laws directly into neural network training. However, recent studies show that PINN optimisation is sensitive to numerical precision. Existing implementations commonly use either single precision (FP32), which is computationally efficient but prone to failure modes, or double precision (FP64), which is robust but substantially expensive. This creates a trade-off between computational efficiency and numerical accuracy. To reduce the computational cost of double-precision training while retaining prediction accuracy, we propose a curvature-aware precision controller that adapts numerical precision during training rather than treating it as a fixed implementation choice. The proposed method reuses curvature information derived from the limited-memory BFGS (L-BFGS) optimiser to construct a precision controller, retaining FP32 when lower precision is sufficient and promoting computation to FP64 when the training dynamics indicate numerical sensitivity or precision-limited stagnation. We evaluate the proposed approach on four canonical PINN failure-mode benchmarks and an irradiance-driven ordinary differential equation example. We further test the proposed approach across different neural network architectures. The method consistently matches or even slightly exceeds full FP64 solution accuracy while reducing training time relative to full double-precision training on all benchmark equations. The obtained results indicate that precision sensitivity in PINN optimisation is phase-dependent, and that selectively applying higher precision only during numerically critical stages can lower computational cost without sacrificing predictive accuracy.

2606.04735 2026-06-04 cs.LG cs.AI

Trace-Mediated Peak Bias: Bridging Temporal Credit Assignment and Cognitive Heuristics in Deep Reinforcement Learning

迹介导的峰值偏差:深度强化学习中时间信用分配与认知启发式的桥梁

Viktor Veselý, Aleksandar Todorov, Erwan Escudie, Matthia Sabatelli

AI总结 本文发现深度强化学习中的迹介导峰值偏差(TMPB),揭示了其作为峰值-末端规则的机制基础,并证明自适应优化器通过二阶矩归一化可缓解该偏差。

详情
AI中文摘要

时间信用分配是生物和人工智能的核心问题,但其与非线性函数逼近的相互作用尚不清楚。我们在深度强化学习中识别出一种系统性失效模式,称为迹介导峰值偏差(TMPB)。在中间资格迹深度下,智能体非理性地偏好具有高幅度奖励“峰值”的轨迹,而非具有更高累积回报的替代轨迹。这为峰值-末端规则提供了一种机制解释:一种人类记忆偏差,其中经验由其最强烈的时刻而非整合效用判断。我们证明,TMPB的出现是因为迹将远时时间差分误差放大为“梯度冲击”,而固定步长的随机梯度下降无法将其归一化,导致全局高估。相反,自适应优化器通过二阶矩归一化缓解了这种病理现象。我们的结果表明,类人的显著性扭曲可能自然产生于分布式系统中信用分配的数学约束,而自适应优化是理性价值估计的理论必要条件。

英文摘要

Temporal credit assignment is central to both biological and artificial intelligence, yet its interaction with non-linear function approximation is poorly understood. We identify a systematic failure mode in deep reinforcement learning (RL) termed Trace-Mediated Peak Bias (TMPB). At intermediate eligibility trace depths, agents irrationally prefer trajectories with high-magnitude reward ``peaks'' over alternatives with higher cumulative returns. This provides a mechanistic account of the Peak-End Rule: a human memory bias where experiences are judged by their most intense moments rather than integrated utility. We show that TMPB emerges because traces amplify distal Temporal Difference errors into ``gradient shocks'' that fixed-step-size Stochastic Gradient Descent cannot normalize, leading to global overestimation. Conversely, adaptive optimizers mitigate this pathology via second-moment normalization. Our results suggest that human-like saliency distortions may emerge naturally from the mathematical constraints of credit assignment in distributed systems, and that adaptive optimization is a theoretical necessity for rational value estimation.

2606.04733 2026-06-04 cs.LG cs.NI

Contrastive Learning and Correlation Clustering for Sequences of Network Telescope Data

对比学习与相关聚类在网络望远镜数据序列中的应用

Jannik Presberger, Alexander Männel, Maynard Koch, Thomas C. Schmidt, Matthias Wählisch, Bjoern Andres

AI总结 本文提出一种无需预训练和标注的对比学习变压器模型,用于估计网络流记录序列间的语义关系,并通过相关聚类实现扫描器行为的无监督分组。

详情
Comments
Code: https://github.com/JannikPresberger/Contrastive_Learning_and_Correlation_Clustering_for_Sequences_of_Network_Telescope_Data
AI中文摘要

理解互联网扫描器的活动具有挑战性;通常需要识别源之间的关系,而这一任务的语义标注非常稀缺。本文研究是否可以通过对比学习,无需预训练和标注,来估计网络流记录序列之间具有语义意义的成对关系。为此,我们提出一个变压器模型,嵌入经过最小预处理的网络流记录序列,并使用对比学习进行训练。利用该模型获得的相似度,我们定义了一个相关聚类问题并局部求解。实验表明:来自同一源的序列之间的学习相似度平均高于来自不同源的序列,并且这一特性可推广到未见过的源和未见过的序列。此外,相关聚类产生的聚类结果与扫描器标签一致。算法和重现实验的完整源代码已公开。

英文摘要

Understanding activities of Internet scanners is challenging; it often requires identifying relationships between sources, a task for which semantic annotations are scarce. This work investigates whether semantically meaningful pairwise relationships between sequences of network flow records can be estimated by contrastive learning, without pretraining and without annotations. To this end, we propose a transformer model that embeds minimally preprocessed sequences of network flow records and train it using contrastive learning. With the similarities obtained from this model, we state a correlation clustering problem and solve it locally. Experimentally, we show: Learned similarities are higher on average for sequences originating from the same source than for sequences originating from different sources, and this property generalizes to unseen sequences of unseen sources. Moreover, correlation clustering yields clusters consistent with scanner labels. The complete source code of the algorithms and for reproducing the experiments is publicly available.

2606.04730 2026-06-04 cs.CL eess.AS

Multilingual Long-Form Speech Instruction Following: KIT's Submission to IWSLT 2026

多语言长篇语音指令跟随:KIT 在 IWSLT 2026 的提交

Enes Yavuz Ugan, Maike Züfle, Yuka Ko, Supriti Sinhamahapatra, Fabian Retkowski, Seymanur Akti, Jan Niehues, Alexander Waibel

AI总结 提出一种通用数据增强流水线,通过片段拼接、LLM标签生成和跨语言翻译将短语音语料转换为长语音训练数据,结合似然与最小贝叶斯风险解码解决长语音语义任务退化问题。

详情
Comments
9 pages main paper, IWSLT 2026 Instruction Following track
AI中文摘要

随着大语言模型的出现,单任务和基于标记的多任务模型已演变为基于指令的系统,该系统从自然语言提示中隐式推断任务和目标语言。这一趋势反映在IWSLT的指令跟随赛道中,该赛道今年引入了包括未知惊喜任务在内的新任务,对已知任务的过拟合构成了真正的挑战。我们展示了KIT在无约束设置下对长指令和短指令跟随赛道的提交。我们的方法结合了一个通用数据增强流水线,通过片段拼接、基于LLM的标签生成和跨语言翻译将短语音语料转换为长语音训练数据,在六个任务和四种语言上产生了超过100万个实例。我们进一步表明,基于似然的重新排序虽然对ASR非常有效,但会系统地降低语义任务,通过选择从分段音频处理而非整体长语音推理中生成的候选者,这一失败模式通过将似然与最小贝叶斯风险解码相结合得以解决。

英文摘要

With the advent of Large Language Models, single-task and token-based multi-task models have evolved into instruction-based systems that infer task and target language implicitly from natural language prompts. This trend is reflected in IWSLT's Instruction Following Track, which this year introduced new tasks including an unknown surprise task, posing a genuine challenge against overfitting to known tasks. We present KIT's submission to the Long and Short Instruction Following tracks in the unconstrained setting. Our approach combines a general data augmentation pipeline that converts short-form corpora into long-form training data through segment concatenation, LLM-based label generation, and cross-lingual translation, yielding over 1M instances across six tasks and four languages. We further show that likelihood-based re-ranking, while highly effective for ASR, systematically degrades semantic tasks by spuriously selecting candidates generated from segmented audio processing rather than holistic long-form inference, a failure mode resolved by combining likelihood with Minimum Bayes Risk decoding.

2606.04722 2026-06-04 cs.CV

StrokeTimer: Robust Representation Learning for Ischemic Stroke Onset-Time Estimation from Non-contrast CT

StrokeTimer: 基于非增强CT的缺血性卒中发病时间估计的鲁棒表示学习

Weiru Wang, Susanne G. H. Olthuis, Elizaveta Lavrova, Robert J. van Oostenbrugge, Charles B. L. M. Majoie, Wim H. van Zwam, Ruisheng Su

AI总结 提出StrokeTimer框架,通过自监督解耦学习和能量引导对比学习,从非增强CT中估计缺血性卒中发病时间,在大型多中心数据集上实现宏AUC 0.69和宏F1 0.57,较基线提升近50%。

详情
Comments
Early accepted at MICCAI 2026
AI中文摘要

缺血性卒中是一种主要的全球性疾病。治疗决策高度时间敏感,因为再灌注治疗的资格取决于卒中发病与干预之间的时间间隔。然而,在临床实践中,真实的发病时间往往不确定,因此需要基于影像的组织年龄评估作为替代标志物。常规非增强CT(NCCT)上的早期缺血性改变通常很细微,而真实世界的临床数据集表现出显著的发病时间类别不平衡和中心-扫描仪相关的异质性。在这项工作中,我们提出了StrokeTimer,一个用于急性缺血性卒中发病时间估计的全自动框架。StrokeTimer整合了自监督解耦学习和能量引导对比学习,以捕捉细微的缺血模式,同时解决采集变异下的长尾数据分布。发病时间被分为三个临床相关窗口:<4.5小时、4.5-6小时和>6小时。在两个国家队列(MR CLEAN Registry和MR CLEAN LATE)的大型多中心NCCT数据集上的实验结果表明,StrokeTimer实现了宏AUC 0.69和宏F1分数0.57,比最强基线提高了近50%(p < 0.005)。在这个现实且具有挑战性的设置中,代表性基线方法表现出接近随机的宏性能。模型解释进一步突出了与已建立的放射学生物标志物一致的细微灰白质模糊和低密度区域。这些发现证明了StrokeTimer在支持急性缺血性卒中治疗决策方面的潜力。代码可在https://github.com/BrainVas/StrokeTimer获取。

英文摘要

Ischemic stroke is a major global disease. Treatment decisions are highly time-sensitive, as eligibility for reperfusion therapies relies on the interval between stroke onset and intervention. However, the true onset time is often uncertain in clinical practice, necessitating imaging-based assessment of tissue age as a surrogate marker. Early ischemic changes on routinely acquired non-contrast CT (NCCT) are often subtle, and real-world clinical datasets exhibit pronounced onset-time class imbalance and center-scanner-related heterogeneity. In this work, we propose StrokeTimer, a fully automated framework for onset-time estimation in acute ischemic stroke. StrokeTimer integrates self-supervised disentanglement learning with energy-guided contrastive learning to capture subtle ischemic patterns while addressing long-tailed data distributions under acquisition variability. Onset time is categorized into three clinically relevant windows: <4.5 h, 4.5-6 h, and >6 h. Experimental results on a large multi-center NCCT dataset from two national cohorts, MR CLEAN Registry and MR CLEAN LATE, show that StrokeTimer achieves a macro AUC of 0.69 and a macro F1-score of 0.57, improving the strongest baseline by nearly 50% (p < 0.005). In this realistic, challenging setting, representative baseline approaches exhibit near-chance macro performance. Model explanations further highlight subtle gray-white matter blurring and hypodense regions consistent with established radiological biomarkers. These findings demonstrate the potential of StrokeTimer to support treatment decision-making in acute ischemic stroke. Code is available at https://github.com/BrainVas/StrokeTimer.

2606.04719 2026-06-04 cs.CL

Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM

基于查询的跨模态投影器增强Mamba多模态大语言模型

SooHwan Eom, Jay Shim, Gwanhyeong Koo, Haebin Na, Mark A. Hasegawa-Johnson, Sungwoong Kim, Chang D. Yoo

AI总结 提出基于查询的跨模态投影器,通过交叉注意力压缩视觉令牌,消除手动设计2D扫描顺序的需求,提升Mamba多模态LLM的性能和吞吐量。

详情
Comments
Accepted to EMNLP 2024 Findings
AI中文摘要

Transformer的复杂度随输入长度呈二次增长,给大语言模型(LLM)带来了不可持续的计算负担。相比之下,选择性扫描结构化状态空间模型(即Mamba)有效解决了这一计算挑战。本文探索了一种基于查询的跨模态投影器,通过交叉注意力机制根据输入压缩视觉令牌,从而增强Mamba在视觉-语言建模中的效率。这种创新的投影器还消除了将原始图像特征转换为Mamba LLM输入序列时手动设计2D扫描顺序的需求。在各种视觉-语言理解基准上的实验结果表明,所提出的跨模态投影器增强了基于Mamba的多模态LLM,提升了性能和吞吐量。

英文摘要

The Transformer's quadratic complexity with input length imposes an unsustainable computational load on large language models (LLMs). In contrast, the Selective Scan Structured State-Space Model, or Mamba, addresses this computational challenge effectively. This paper explores a query-based cross-modal projector designed to bolster Mamba's efficiency for vision-language modeling by compressing visual tokens based on input through the cross-attention mechanism. This innovative projector also removes the need for manually designing the 2D scan order of original image features when converting them into an input sequence for Mamba LLM. Experimental results across various vision-language understanding benchmarks show that the proposed cross-modal projector enhances Mamba-based multimodal LLMs, boosting both performance and throughput.

2606.04710 2026-06-04 cs.CV

Data Efficient Complex Feature Fusion Network For Hyperspectral Image Classification

数据高效复杂特征融合网络用于高光谱图像分类

Maitreya Shelare, Atharva Satam, Poonam Sonar, Sneha Burnase

AI总结 提出一种数据高效的注意力双支路复杂特征融合网络(DE-CFFN),通过因子分析降维和3D卷积层滤波器数量减半来减少模型复杂度,同时保持与CFFN相当的分类性能。

详情
Journal ref
In Proceedings of International Conference on Wireless Communication (ICWiCOM 2025), Lecture Notes in Electrical Engineering, vol. 1499, Springer, 2025
Comments
10 pages, 3 figures
AI中文摘要

本工作提出了一种数据高效的基于注意力的双支路复杂特征融合网络(CFFN)变体,用于高光谱图像分类。所提出的模型称为DE-CFFN,保留了原始的双流结构:实值神经网络(RVNN)处理标准高光谱图像块,而复值神经网络(CVNN)处理其傅里叶变换后的对应物。本工作的主要贡献在于特征提取过程和架构增强。使用因子分析进行降维,相比主成分分析提供了更好的潜在特征表示。此外,RVNN和CVNN流均通过将3D卷积层中的滤波器数量逐次减半来减少复杂度。两个分支的输出被拼接并通过一个挤压激励(SE)块以增强联合特征表示。在Pavia University和Salinas数据集上的评估表明,DE-CFFN实现了与CFFN相当的分类性能,同时显著减小了模型大小、内存消耗和推理延迟,使其适用于实时高光谱成像应用。

英文摘要

This work presents a data-efficient variant of the Attention-Based Dual-Branch Complex Feature Fusion Network (CFFN) for hyperspectral image classification. The proposed model, termed DE-CFFN, retains the original two-stream structure: the Real-Valued Neural Network (RVNN) processes standard hyperspectral patches, while the Complex-Valued Neural Network (CVNN) handles their Fourier-transformed counterparts. The main contribution of this work lies in the feature extraction process and architectural enhancement. Factor Analysis is used for dimensionality reduction, offering improved latent feature representation over Principal Component Analysis. Additionally, both the RVNN and CVNN streams are structurally modified by successively halving the number of filters in the 3D convolutional layers to reduce complexity. The outputs of both branches are concatenated and passed through a Squeeze and Excitation (SE) block to enhance joint feature representation. Evaluated on the Pavia University and Salinas datasets, DE-CFFN achieves classification performance comparable to CFFN, while significantly reducing model size, memory consumption, and inference latency, making it suitable for real-time hyperspectral imaging applications.

2606.04706 2026-06-04 cs.CV

ReConFuse: Reconstruction-Error Guided Semantic Fusion for AI-Generated Video Detection

ReConFuse: 重建误差引导的语义融合用于AI生成视频检测

Xiaojing Chen, Xinyu Lu, Changtao Miao, Yunfeng Diao

AI总结 提出ReConFuse框架,利用预训练WF-VAE的重建误差作为鉴别线索,结合多帧语义特征和Mamba时序建模,实现AI生成视频的鲁棒检测。

详情
AI中文摘要

AI生成的视频变得越来越逼真,引发了关于错误信息、内容真实性和媒体信任的严重担忧。因此,可靠的AI生成视频检测对于多媒体取证至关重要,但由于需要捕捉空间伪影、时间动态并泛化到不断演变的生成模型,这仍然具有挑战性。在本文中,我们探索重建误差作为AI生成视频检测的判别性取证线索。通过使用预训练的WF-VAE重建输入视频,我们观察到真实视频和生成视频表现出可区分的逐帧重建误差模式,表明重建误差可以揭示它们的分布差异。然而,将基于重建的图像检测扩展到视频并非易事,因为视频重建误差在帧间具有时间组织性,并且需要语义上下文才能有效解释。为了应对这些挑战,我们提出了ReConFuse,一个用于视频级AI生成视频检测的重建引导语义融合框架。ReConFuse从WF-VAE重建的视频中提取重建误差线索,将其与多帧语义特征对齐,并使用基于Mamba的模块对时间演化进行建模以进行视频级分类。在多个生成器和评估设置上的实验证明了ReConFuse的有效性和强大的泛化能力。

英文摘要

AI-generated videos are becoming increasingly realistic, raising serious concerns about misinformation, content authenticity, and media trust. Reliable AI-generated video detection is therefore essential for multimedia forensics, yet remains challenging due to the need to capture spatial artifacts, temporal dynamics, and generalize to evolving generative models. In this paper, we explore reconstruction error as a discriminative forensic cue for AI-generated video detection. By reconstructing input videos with a pretrained WF-VAE, we observe that real and generated videos exhibit distinguishable frame-wise reconstruction error patterns, suggesting that reconstruction errors can reveal their distributional discrepancies. However, extending reconstruction-based image detection to videos is non-trivial, since video reconstruction errors are temporally organized across frames and require semantic context for effective interpretation. To address these challenges, we propose ReConFuse, a reconstruction-guided semantic fusion framework for video-level AI-generated video detection. ReConFuse extracts reconstruction error cues from WF-VAE reconstructed videos, aligns them with multi-frame semantic features, and uses a Mamba-based module to model temporal evolution for video-level classification. Experiments across multiple generators and evaluation settings demonstrate the effectiveness and strong generalization ability of ReConFuse.

2606.04705 2026-06-04 cs.CV cs.AI

Enhancing MedSAM with a Lightweight Box Predictor for Medical Image Segmentation

通过轻量级框预测器增强 MedSAM 用于医学图像分割

Amirhossein Movahedisefat, Amirreza Fateh, Mohammad Reza Mohammadi

AI总结 提出一种集成轻量级框预测器的 MedSAM 增强框架,通过单次点击估计边界框以提升点提示的空间引导能力,在仅增加 1.6M 参数下显著提高多模态医学图像分割的准确性和鲁棒性。

详情
AI中文摘要

医学图像中的语义分割是一项关键但具有挑战性的任务,原因是数据稀缺和跨模态的高变异性。虽然像 Segment Anything Model (SAM) 这样的基础模型显示出潜力,但它们在没有特定适应的情况下往往难以处理医学图像。此外,点提示尽管是最自然的用户交互形式,但为可靠分割提供的空间上下文不足,特别是当目标结构不规则或对比度差时。在本文中,我们提出了一种增强的分割框架,将轻量级框预测器模块集成到 MedSAM 架构中。框预测器通过使用局部图像嵌入特征从单次用户点击估计近似边界框,提供空间引导以减少点提示的模糊性,同时仅引入 1.6M 额外参数和可忽略的推理开销。我们引入了一个两阶段训练流程,其中框预测器在集成到 MedSAM 之前独立训练。为了验证我们方法的泛化能力,我们在四个不同的数据集(FLARE22、BRISC、BUSI、LungSegDB)上进行了广泛评估,这些数据集涵盖不同的成像模态,包括 CT、MRI 和超声。我们的方法在不同解剖结构和成像领域中提高了分割准确性和鲁棒性,在 BUSI、FLARE22、BRISC 和 LungSegDB 上分别达到了 0.89、0.93、0.88 和 0.98 的 Dice 分数。代码可在 https://github.com/Amirhosseinmovahedi/MedSAM-BoxPredictor 获取。

英文摘要

Semantic segmentation in medical imaging is a critical yet challenging task due to data scarcity and high variability across modalities. While foundation models like the Segment Anything Model (SAM) show promise, they often struggle with medical images without specific adaptation. Moreover, point prompts, despite being the most natural form of user interaction, provide insufficient spatial context for reliable segmentation, particularly when target structures are irregular or poorly contrasted. In this paper, we propose an enhanced segmentation framework that integrates a lightweight Box Predictor module into the MedSAM architecture. The Box Predictor estimates an approximate bounding box from a single user click using localized image embedding features, providing spatial guidance that reduces the ambiguity of point prompts, while introducing only 1.6M additional parameters and negligible inference overhead. We introduce a two-stage training pipeline where the Box Predictor is trained independently before being integrated into MedSAM. To validate the generalization capability of our method, we conduct extensive evaluations on four diverse datasets (FLARE22, BRISC, BUSI, LungSegDB) spanning distinct imaging modalities, including CT, MRI, and Ultrasound. Our method improves segmentation accuracy and robustness across varied anatomical structures and imaging domains, achieving Dice scores of 0.89 (BUSI), 0.93 (FLARE22), 0.88 (BRISC), and 0.98 (LungSegDB). Code is available at https://github.com/Amirhosseinmovahedi/MedSAM-BoxPredictor

2606.04703 2026-06-04 cs.CL cs.LG

Rethinking Continual Experience Internalization for Self-Evolving LLM Agents

重新思考持续经验内化以实现自我进化的大语言模型智能体

Jingwen Chen, Wenkai Yang, Shengda Fan, Wenbo Nie, Chenxing Sun, Shaodong Zheng, Yangen Hu, Lu Pan, Ke Zeng, Yankai Lin

AI总结 本文通过经验粒度、注入模式和内化机制三个维度,提出一种稳定可持续的经验内化方法,解决多轮经验学习中的能力崩溃问题。

详情
Comments
10 pages, 8 figures
AI中文摘要

经验内化将过去交互中的上下文经验转化为可重用的参数化能力,为大型语言模型(LLM)的持续学习提供了一条有前景的路径。虽然先前的工作主要关注单次迭代迁移,但我们发现在多轮经验学习下,现有方法遭受的是渐进的能力崩溃而非复合改进。我们通过经验内化的三个关键维度系统地考察了这种失败:(1)经验粒度:我们发现原则级经验比实例级经验更持久,因为它有效地从轨迹特定细节中抽象出可迁移的策略。(2)经验注入模式:我们的分析表明,逐步注入通过将经验与中间决策状态对齐,显著优于全局注入,这一特性对于长程工具使用至关重要。(3)内化机制:我们证明,在高质量教师轨迹上的离策略上下文蒸馏提供了比在策略上下文蒸馏更稳定的训练信号,后者固有地受限于对学生诱导的缺陷状态的局部修正。这些见解共同产生了一个简单而稳健的配方,用于稳定和可持续的经验内化,为工程化自我进化和持续学习的LLM提供了具体指导。

英文摘要

Experience internalization converts contextual experience from past interactions into reusable parametric capability, offering a promising path toward continual learning in large language models (LLMs). While prior work has predominantly focused on single-iteration transfer, we discover that under multi-iteration experience learning, existing methods suffer from a progressive capability collapse rather than compounding improvement. We systematically examine this failure through three vital dimensions of experience internalization: (1) Experience Granularity: We find that principle-level experience is more durable than instance-level experience, as it effectively abstracts transferable strategies away from trajectory-specific details. (2) Experience Injection Pattern: Our analysis reveals that step-wise injection significantly outperforms global injection by aligning experience with intermediate decision states, a property that is critical for long-horizon tool use. (3) Internalization Regime: We demonstrate that off-policy context-distillation on high-quality teacher trajectories provides a substantially more stable training signal than on-policy context-distillation, which is inherently limited by local corrections on student-induced flawed states. Together, these insights yield a simple yet robust recipe for stable and sustainable experience internalization, providing concrete guidance for engineering self-evolving and continually learning LLMs.

2606.04701 2026-06-04 cs.CV cs.CL

Benchmarking Living-Screen-Native GUI Agents on Short-Video Platforms

在短视频平台上对原生动态屏幕GUI代理的基准测试

Jiashu Yao, Heyan Huang, Daiqing Wu, Wangke Chen, Huaxi Ai, Haoyu Wen, Zeming Liu, Yuhang Guo

AI总结 针对短视频平台等动态屏幕环境,提出LivingScreen基准测试,通过三级任务套件和联合评估准确性与信息效率的指标,发现现有GUI代理存在观察过度或不足的问题。

详情
Comments
preprint
AI中文摘要

当前的GUI代理假设屏幕是静态的,即两次动作之间世界是冻结的。然而,诸如短视频应用之类的真实界面违反了这一假设,因为其内容持续播放,一个称职的用户必须决定观看什么以及观看多长时间。我们将此任务形式化为原生动态屏幕GUI代理,并引入LivingScreen——首个在短视频平台上实例化该任务的基准测试,它包含一个基于浏览器的忠实环境、三级任务套件以及联合评估准确性和信息效率的指标。评估广泛的前沿模型后,我们发现没有一个模型能达到人类的成本-准确率性能,并且它们的主要失败模式是过度观察和观察不足,这表明观察控制是未来GUI代理缺失的能力轴。所有数据和代码将在https://github.com/BITHLP/LivingScreen上提供。

英文摘要

GUI agents today assume a static screen, where the world is frozen between two actions. However, real interfaces such as short-video applications violate this assumption, as their content keeps playing, and a competent user must decide what to watch and for how long. We formalize this task as Living-Screen-Native GUI agents and introduce LivingScreen, the first benchmark instantiating it on short-video platforms, with a faithful browser-based environment, a three-tier task suite, and metrics that jointly score accuracy and information efficiency. Evaluating extensive frontier models, we find that none reaches the human cost-accuracy performance, and that their dominant failure mode is over- and under-observation, pointing to observation control as a missing capability axis for future GUI agents. All data and code will be available at https://github.com/BITHLP/LivingScreen.

2606.04700 2026-06-04 cs.CV

A New Angle on Bones: Robust Pose Estimation in X-Ray and Ultrasound

骨骼的新视角:X射线和超声中的鲁棒姿态估计

Ron Keuth, Christoph Großbröhmer, Franziska Halm, Miriam Johann, Anne-Nele Schröder, Ludger Tüshaus, Mattias P. Heinrich, Lasse Hansen

AI总结 提出基于学习的关键点候选和鲁棒线模型(RANSAC、霍夫变换)的自动骨骼姿态估计方法,在儿科骨折和髋关节发育不良评估中达到临床可接受的误差并优于地标方法。

详情
Comments
Code and annotations for fracture angle assessment in radiographs: https://github.com/multimodallearning/RobustBonePoseEstimation
AI中文摘要

测量骨骼结构之间的角度是医学图像分析中的常规任务,为诊断和治疗规划提供关键的定量参数。自动化方法可以减少时间和成本,同时提高可重复性。在这项工作中,我们通过基于学习的关键点候选提议,随后使用线模型提取轴参数,来解决自动骨骼姿态估计问题。由于传统线模型如最小二乘法对异常值敏感,我们结合了假阳性减少策略和鲁棒拟合技术,如RANSAC和霍夫变换,以提高鲁棒性。我们在三个临床相关的儿科角度估计任务上评估了我们的方法:X射线和超声中的骨折碎片评估,以及使用Graf方法的超声中髋关节发育不良评估。我们的方法分别实现了$4.1^\circ$、$5.4^\circ$和$5.51^\circ$的平均误差,不仅保持在预期的临床观察者变异范围内,而且显著优于基于地标的方法。我们的代码和用于X射线骨折角度评估的注释已在GitHub上公开。

英文摘要

Measuring the angle between bone structures is a routine task in medical image analysis and provides a key quantitative parameter for diagnosis and treatment planning. Automated methods can reduce time and cost while improving reproducibility. In this work, we address automatic bone pose estimation using a learning-based point candidate proposal followed by a line model to extract axis parameters. Since conventional line models such as least squares are sensitive to outliers, we incorporate false-positive reduction strategies and robust fitting techniques, such as RANSAC and Hough transforms, to improve robustness. We evaluate our method on three clinically relevant paediatric angle estimation tasks: fracture fragment assessment in radiographs and ultrasound and developmental dysplasia of the hip evaluation in ultrasound using the Graf method. Our approach achieves mean errors of $4.1^\circ$, $5.4^\circ$, and $5.51^\circ$, respectively, not only remaining within the expected clinical observer variability, but also significantly outperforming landmark-based methods. Our code and annotations for fracture angle assessment in radiographs are publicly available on GitHub.

2606.04699 2026-06-04 cs.LG cs.AI cs.CV

Graph-Guided Universum Learning in Generalized Eigenvalue Proximal SVMs for Alzheimer's Disease Classification

基于图引导的广义特征值近端支持向量机中的Universum学习用于阿尔茨海默病分类

Yogesh Kumar, Vrushank Ahire, Mudasir Ganaie

AI总结 针对阿尔茨海默病分类,提出两种图引导的Universum学习模型UG-GEPSVM和IUG-GEPSVM,利用轻度认知障碍样本构建图拉普拉斯正则化,替代传统独立惩罚项,在ADNI MRI数据集上取得更优性能。

详情
AI中文摘要

早期准确检测阿尔茨海默病(AD)对于及时干预和疾病管理至关重要。广义特征值近端支持向量机(GEPSVM)及其基于Universum的变体在AD分类中显示出有希望的结果。然而,现有方法将Universum样本视为独立点,未考虑它们之间的几何关系。本文提出了两种图引导的Universum学习模型,即UG-GEPSVM和IUG-GEPSVM,用于使用结构MRI数据进行AD与认知正常(CN)分类。在所提出的框架中,轻度认知障碍(MCI)受试者被用作Universum数据,以提供AD和CN类别之间的中间信息。使用高斯相似性、最小生成树连通性和多跳传播在Universum样本上构建图。从该图中导出拉普拉斯矩阵,捕获MCI样本的几何结构。这种基于拉普拉斯的正则化被纳入学习过程,以替代传统的独立Universum惩罚项。UG-GEPSVM将此正则化集成到广义特征值公式中,而IUG-GEPSVM使用标准特征值公式扩展了数值稳定的改进GEPSVM框架。在ADNI MRI数据集变体上使用ICA和PCA特征在五个不同噪声水平下的实验表明,两种提出的模型始终优于现有的GEPSVM和基于Universum的方法。UG-GEPSVM实现了88.07%的最高平均AUC,并在增加的噪声水平下保持稳定的性能。统计检验进一步证实了观察到的改进的显著性。

英文摘要

Early and accurate detection of Alzheimer's disease (AD) is important for timely intervention and disease management. Generalized Eigenvalue Proximal Support Vector Machine (GEPSVM) and its Universum-based variants have shown promising results for AD classification. However, existing methods treat Universum samples as independent points and do not consider the geometric relationships among them. This paper proposes two graph-guided Universum learning models, namely UG-GEPSVM and IUG-GEPSVM, for AD versus cognitively normal (CN) classification using structural MRI data. In the proposed framework, mild cognitive impairment (MCI) subjects are used as Universum data to provide intermediate information between AD and CN classes. A graph is constructed over the Universum samples using Gaussian similarity, Minimum Spanning Tree connectivity, and multi-hop propagation. From this graph, a Laplacian matrix is derived that captures the geometric structure of the MCI samples. This Laplacian-based regularization is incorporated into the learning process in place of the conventional independent Universum penalty term. UG-GEPSVM integrates this regularization into the generalized eigenvalue formulation, while IUG-GEPSVM extends the numerically stable improved GEPSVM framework using a standard eigenvalue formulation. Experiments on ADNI MRI dataset variants using ICA- and PCA-based features at five different noise levels show that both proposed models consistently outperform existing GEPSVM and Universum-based methods. UG-GEPSVM achieves the highest average AUC of 88.07% and maintains stable performance under increasing noise levels. Statistical tests further confirm the significance of the observed improvements.

2606.04695 2026-06-04 cs.LG

Cone-Compatible Monge Geometry for High-Dimensional Ordered Optimal Transport

锥相容的Monge几何用于高维有序最优输运

Lei Luo, Hongliang Zhang, Jian Yang

AI总结 本文提出锥相容的Monge几何,通过闭凸锥诱导的偏序与输运成本兼容的条件,为高维有序数据提供闭式最优耦合。

详情
Comments
13 pages, 2 figures, including appendices
AI中文摘要

高维最优输运很少具有闭式解。一维情况是例外,因为实数线的顺序与凸输运成本兼容,使得单调重排最优。本文研究在更高维中如何从偏序恢复类似的Monge结构。我们引入锥相容的Monge几何:一个闭凸锥(K)诱导序(x\preceq_K y)当(y-x\in K),并且如果有序对满足Monge交换不等式,则与成本兼容。对于平方马氏距离成本(c_M(x,y)=(x-y)^\top M(x-y)),我们证明了一个尖锐的刻画:兼容性恰好当(K)在(M)-内积下是锐角锥,即对所有(u,v\in K)有(u^\top Mv\ge0),等价于(K\subseteq K_M^*)。在此条件下,支撑在锥链上的测度允许分位数型的闭式最优耦合,在原始地面成本下(而非投影或度量替换后)得到精确输运。我们将由此产生的锥链Wasserstein度量(定义在规范有序的链分布上)与扩展的有向锥输运成本(定义在一般测度上)区分开来,并发展了可行性、对偶性、稳定性、逼近、高斯恢复、统计和计算方面的结果。该理论与切片和树Wasserstein距离互补:它不是通用的快速替代,而是为有序高维数据提供可解释、方向有效、原始空间单调输运的一种方法。

英文摘要

High-dimensional optimal transport is seldom available in closed form. The one-dimensional case is exceptional because the order of the real line is compatible with convex transport costs, making monotone rearrangement optimal. This paper studies when an analogous Monge structure can be recovered in higher dimensions from a partial order. We introduce a cone-compatible Monge geometry: a closed convex cone (K) induces the order (x\preceq_K y) whenever (y-x\in K), and is compatible with a cost if ordered pairs satisfy a Monge exchange inequality. For squared Mahalanobis costs (c_M(x,y)=(x-y)^\top M(x-y)), we prove a sharp characterization: compatibility holds exactly when (K) is acute under the (M)-inner product, namely (u^\top Mv\ge0) for all (u,v\in K), equivalently (K\subseteq K_M^*). Under this condition, measures supported on cone chains admit a quantile-type closed-form optimal coupling, yielding exact transport under the original ground cost rather than after projection or metric replacement. We distinguish the resulting cone-chain Wasserstein metric on canonically ordered chain distributions from an extended directed cone transport cost on general measures, and develop feasibility, duality, stability, approximation, Gaussian recovery, statistical, and computational results. The theory is complementary to sliced and tree Wasserstein distances: it is not a universal fast surrogate, but a way to obtain interpretable, direction-valid, original-space monotone transport for ordered high-dimensional data.

2606.04691 2026-06-04 cs.CL

SMADE-IE: Sparse Multi-Agent Framework with Evidence-Driven Debate for Zero-Shot Information Extraction

SMADE-IE: 基于证据驱动辩论的稀疏多智能体框架用于零样本信息抽取

Kenfeng Huang, Yi Cai, Xin Wu, Zikun Deng, Li Yuan

AI总结 提出SMADE-IE稀疏多智能体框架,通过自适应模式选择器和证据驱动辩论机制,在零样本信息抽取中减少冗余交互并提升性能。

详情
Comments
21 pages, 9 figures
AI中文摘要

基于大型语言模型的零样本信息抽取因其无需任务特定训练即可适应新模式和领域的灵活性而受到越来越多的关注。现有方法主要依赖于整体提示、逐类型提示或多智能体辩论。然而,整体提示常常遭受边界和类型错误,而逐类型提示和多智能体辩论引入了跨类型冲突、冗余智能体交互和大量令牌开销。为了解决这些挑战,我们提出了SMADE-IE,一种用于零样本信息抽取的稀疏且证据驱动的多智能体框架。SMADE-IE首先采用自适应模式选择器将输入动态路由到轻量级全局抽取模式或类型中心抽取模式,减少不必要的类型选择和推理噪声。对于冲突预测,我们进一步引入了证据驱动辩论机制,将论证结构化为图尔敏式组件,并通过外部证据评分和贝叶斯更新进行置信度聚合。在NER、RE和JERE任务的9个基准数据集上的实验结果表明,SMADE-IE在持续优于现有零样本信息抽取基线的同时,通过稀疏智能体选择和早期停止辩论提高了令牌效率。

英文摘要

Zero-shot information extraction (IE) with large language models (LLMs) has attracted increasing attention due to its flexibility in adapting to new schemas and domains without task-specific training. Existing approaches mainly rely on monolithic prompting, each-type prompting, or multi-agent debate. However, monolithic prompting often suffers from boundary and type errors, while each-type prompting and multi-agent debate introduce cross-type conflicts, redundant agent interactions, and substantial token overhead. To address these challenges, we propose SMADE-IE, a sparse and evidence-driven multi-agent framework for zero-shot IE. SMADE-IE first employs an Adaptive Mode Selector to dynamically route inputs into either a lightweight Global Extraction Mode or a Type-Centric Extraction Mode, reducing unnecessary type selection and reasoning noise. For conflicting predictions, we further introduce an Evidence-Driven Debate mechanism that structures arguments into Toulmin-style components and performs confidence aggregation through external evidence scoring and Bayesian updates. Experimental results on 9 benchmark datasets across NER, RE, and JERE tasks show that SMADE-IE consistently outperforms existing zero-shot IE baselines while also improving token efficiency through sparse agent selection and early-stopping debate.

2606.04689 2026-06-04 quant-ph cs.LG

QPredSGG: Hybrid Quantum Predicate Learning for Long-Tailed Scene Graph Generation

QPredSGG:面向长尾场景图生成的混合量子谓词学习

Prerana Ramkumar, Nouhaila Innan, Muhammad Shafique

AI总结 针对场景图生成中长尾谓词分布导致的分类偏差,提出用量子谓词头(QP-Head)替换经典谓词头,通过振幅嵌入和强纠缠层压缩特征,在Visual Genome 150上实现参数高效的长尾关系分类。

详情
Comments
11 pages, 5 figures
AI中文摘要

场景图生成(SGG)需要对物体及其交互进行关系推理,但性能常受严重的长尾谓词不平衡限制。经典SGG模型通常依赖数据集统计,导致预测偏向频繁关系而非细粒度语义谓词。尽管现有去偏策略提高了平均召回率,但当前框架中的谓词分类仍常依赖参数成本高的大型经典决策模块。本文通过用加权交叉熵训练的量子谓词头(QP-Head)替换因果特征增强网络(CFEN)中的经典谓词头,引入了一种用于SGG的混合量子谓词分类器。据我们所知,这是首批评估混合量子架构在Visual Genome 150上进行场景图谓词分类的研究之一。我们研究了量子比特数、编码策略、纠缠结构和电路深度对关系预测的影响。最佳4量子比特QP-Head使用振幅嵌入和强纠缠层将4096维对特征压缩为16维量子兼容表示,对应256倍缩减。它实现了57.25%的mR@100,而经典CFEN参考为41.1%,同时仅使用96个可训练量子参数。扩展到8量子比特保持了强大的长尾性能,达到55.38%的mR@100,使用384个量子参数,而深度分析显示了表达能力和运行时间开销之间的权衡。这些结果表明,紧凑的混合量子谓词头可以支持复杂视觉推理任务中参数高效的长尾关系分类。

英文摘要

Scene Graph Generation (SGG) requires relational reasoning over objects and their interactions, but performance is often limited by severe long-tail predicate imbalance. Classical SGG models frequently rely on dataset statistics, leading to biased predictions toward frequent relations rather than fine-grained semantic predicates. Although existing debiasing strategies improve mean recall, predicate classification in current frameworks still often depends on large classical decision modules with high parameter cost. This work introduces a hybrid quantum predicate classifier for SGG by replacing the classical predicate head in Causal Feature Enhancement Network (CFEN) with a Quantum Predicate Head (QP-Head) trained using weighted cross-entropy. To the best of our knowledge, this is among the first studies to evaluate a hybrid quantum architecture for scene graph predicate classification on Visual Genome 150. We study the effect of qubit count, encoding strategy, entangling structure, and circuit depth on relational prediction. The best 4-qubit QP-Head uses Amplitude Embedding and Strongly Entangling Layers to compress 4096-dimensional pair features into a 16-dimensional quantum-compatible representation, corresponding to a 256$\times$ reduction. It achieves an mR@100 of 57.25%, compared with 41.1% for the classical CFEN reference, while using only 96 trainable quantum parameters. Scaling to 8 qubits maintains strong long-tail performance, reaching an mR@100 of 55.38% with 384 quantum parameters, while the depth analysis shows a trade-off between expressibility and runtime overhead. These results suggest that compact hybrid quantum predicate heads can support parameter-efficient long-tail relational classification in complex visual reasoning tasks.

2606.04688 2026-06-04 cs.CV

MeshWeaver: Sparse-Voxel-Guided Surface Weaving for Autoregressive Mesh Generation

MeshWeaver: 稀疏体素引导的表面编织用于自回归网格生成

Jiale Xu, Wang Zhao, Ying Shan

AI总结 提出MeshWeaver框架,通过多级稀疏体素编码器注入几何上下文,以自回归方式直接预测顶点实现表面编织,在压缩比、高多边形网格生成和几何保真度上达到最优。

详情
Comments
CVPR 2026
AI中文摘要

自回归网格生成通过将网格标记化为序列并以语言建模方式训练模型而受到关注。然而,现有方法存在两个基本限制:(i) 标记化效率低,导致长标记序列并阻碍扩展到高多边形网格;(ii) 缺乏几何感知引导,因为生成仅基于全局形状嵌入而非局部表面线索。我们提出MeshWeaver,一个自回归框架,将网格生成视为表面编织过程,直接预测下一个顶点而非独立坐标。其核心是多级稀疏体素编码器,通过三种互补方式将几何上下文注入生成过程:提供体素特征作为顶点表示,通过交叉注意力引导标记预测,以及作为结构支架约束生成围绕输入表面。我们的层次化设计使得在单次解码步骤中实现从粗到细的顶点预测,同时紧密耦合生成模型与3D几何。大量实验表明,MeshWeaver实现了18%的最先进压缩比,能够生成多达16K面的网格,并且在几何保真度上显著优于先前方法。

英文摘要

Autoregressive mesh generation has gained attention by tokenizing meshes into sequences and training models in a language-modeling fashion. However, existing approaches suffer from two fundamental limitations: (i) low tokenization efficiency, which yields long token sequences and prevents scaling to high-poly meshes, and (ii) absence of geometry-aware guidance, as generation is conditioned only on global shape embeddings rather than local surface cues. We introduce MeshWeaver, an autoregressive framework that treats mesh generation as a surface weaving process by directly predicting the next vertex instead of independent coordinates. At its core is a multi-level sparse-voxel encoder that injects geometric context into the generative process in three complementary ways: providing voxel features as vertex representations, guiding token prediction via cross-attention to voxel features, and serving as a structural scaffold that constrains generation around the input surface. Our hierarchical design enables coarse-to-fine vertex prediction in a single decoding step, while tightly coupling the generative model with 3D geometry. Extensive experiments demonstrate that MeshWeaver achieves a state-of-the-art compression ratio of 18%, can generate meshes with up to 16K faces, and significantly improves geometric fidelity over prior approaches.

2606.04684 2026-06-04 cs.CV cs.AI

Real-Time Automatic License Plate Recognition Using YOLOv8, SORT Tracking, and Temporal Data Interpolation

基于YOLOv8、SORT跟踪与时间数据插值的实时自动车牌识别

Mirza Muhammad Mobeen

AI总结 提出一个五阶段端到端算法流程,结合YOLOv8目标检测、SORT多目标跟踪和时间数据插值,解决动态交通监控中因光照变化、遮挡等导致的识别率低和跟踪路径断裂问题。

详情
Comments
7 Pages, For Accessing code:https://github.com/ mobeen-pmo/Automatic-License-Plate-Recognition
AI中文摘要

视频处理的实时困难严重限制了自动车牌识别(ALPR)在动态交通监控环境中的应用。对非受控变量(如光照剧烈变化、摄像机扫描角度、车辆高速行驶和物理遮挡)的高保真识别是一个问题,常导致跟踪路径断裂和光学字符识别(OCR)率低下。为缓解这些弱点,本研究提出一个五阶段端到端算法流程,涵盖基于深度学习的目标检测、运动学多目标跟踪和几何时间数据插值之间的平滑过渡。所提出的架构利用强大的YOLOv8 nano模型在第一阶段定位车辆,然后使用简单在线实时跟踪(SORT)算法建立帧间时空联系。另一种更具体的YOLOv8目标检测器检测车牌区域,将切片数组传递给EasyOCR链,并受位置语法验证约束。更重要的是,启动离线时间边界框插值机制以重新连接断裂的路径。

英文摘要

The real-time hardships of video processing seriously limit the usage of Automatic License Plate Recognition (ALPR) with application in dynamic traffic monitoring settings. High-fidelity recognition of unconstrained variables, e.g. drastic variations in illumination, acute camera scans, high vehicle speeds, and harsh physical concealment, is a problem that often leads to disjointed tracking paths and poor Optical Character Recognition (OCR) rates. In order to mitigate these weaknesses, the study proposes a 5 stage, end-to-end algorithmic pipeline, encompassing a smooth transition between deep learning based object detection, multi-object tracking which is kinematic in nature, and geometry temporal data interpolation. The suggested architecture takes advantage of a very powerful YOLOv8 nano model to localize the vehicle at the first stage and then Simple Online and Realtime Tracking (SORT) algorithm is used to build spatial-temporal links between frames. Another, more specific typology of YOLOv8 object detectors the license plate area, channeling the sliced array to an EasyOCR chain under the limitations of positional syntax verification. More importantly, an offline interpolation mechanism of temporal bounding box is initiated to recast fragmented paths.

2606.04680 2026-06-04 eess.AS cs.CL cs.SD

Read What You Hear: Reference-Free Hypotheses Evaluation with Acoustic Discrepancy

听你所写:基于声学差异的无参考假设评估

Zhihan Li, Hankun Wang, Yiwei Guo, Bohan Li, Xie Chen, Kai Yu

AI总结 提出READ指标,利用预训练自回归TTS模型计算语音与文本假设的声学差异,无需参考转录即可评估ASR假设,并在噪声条件下实现高达20%的相对错误率降低。

详情
Comments
Submitted to Interspeech 2026. 6 pages, 4 figures
AI中文摘要

自动语音识别系统通常依赖参考转录进行评估,而无参考方法往往依赖于内部置信度估计或辅助语言模型。我们提出READ(基于声学差异的无参考假设评估),一种直接从语音信号评估ASR假设的新颖指标。READ强调假设的声学基础。它使用预训练的自回归TTS模型计算给定文本假设下语音令牌的条件似然,以衡量语音与文本之间的细粒度声学差异。无需额外训练,READ即可应用于假设优化。实验表明,READ与特定识别错误相关,并改善ASR输出,实现高达20%的相对错误率降低,在噪声条件下尤其显著。

英文摘要

Automatic speech recognition systems commonly rely on reference transcriptions for evaluation, while reference-free approaches often depend on internal confidence estimation or auxiliary language models. We propose READ (Reference-free Hypothesis Evaluation with Acoustic Discrepancy), a novel metric that evaluates ASR hypotheses directly from the speech signal. READ emphasizes the acoustic grounding of hypotheses. It uses a pretrained auto-regressive TTS model to compute the conditional likelihood of speech tokens given a text hypothesis, to measure fine-grained acoustic discrepancy between speech and text. Without additional training, READ can be applied for hypothesis refinement. Experiments show that READ correlates with specific recognition errors and improves ASR outputs, achieving up to 20\% relative error rate reduction, with particularly strong gains under noisy conditions.

2606.04678 2026-06-04 cs.LG

Test-Time Compute Scaling for ASR with Depth-Conditioned Looped Transformers

基于深度条件循环Transformer的ASR测试时计算缩放

Yacouba Kaloga, Shashi Kumar, Shakeel A. Sheikh, Driss Khalil, Petr Motlicek, Ina Kodrasi

AI总结 提出LARM模型,通过深度条件循环Transformer将循环编码器深度变为可控的测试时计算轴,结合稀疏CTC检查点、监督时钟嵌入、FiLM深度条件和延迟软后验反馈,在LibriSpeech上随推理循环次数增加提升WER,实现测试时计算缩放从自回归语言模型推理扩展到连续非自回归语音识别。

详情
AI中文摘要

端到端ASR系统通常在推理时使用固定深度的声学编码器,这使得在不训练更大模型的情况下,难以用额外的测试时计算换取更好的识别性能。一种自然的方法是循环重用共享的Transformer块,但我们发现简单的循环并不能充分利用额外的循环计算。我们引入了LARM,一种深度条件循环Transformer,将循环编码器深度变为可控的测试时计算轴。LARM结合了稀疏CTC检查点、监督时钟嵌入、FiLM深度条件和延迟软后验反馈。这些组件将循环结构化为由潜在精炼阶段分隔的识别检查点,并允许共享权重在循环步骤间进行特化。在LibriSpeech上,LARM随着推理循环次数的增加提高了WER,并达到了与更深的非共享参数基线相竞争的性能。我们的结果表明,测试时计算缩放可以超越自回归语言模型推理,扩展到连续非自回归语音识别。

英文摘要

End-to-end ASR systems typically use fixed-depth acoustic encoders at inference, making it difficult to trade additional test-time computation for improved recognition without training a larger model. A natural approach is to reuse a shared Transformer block recurrently, but we find that naive looping does not fully exploit additional recurrent compute. We introduce LARM, a depth-conditioned looped Transformer that turns recurrent encoder depth into a controllable test-time compute axis. LARM combines sparse CTC checkpoints, supervision-clock embeddings, FiLM depth conditioning, and delayed soft-posterior feedback. These components structure the loop into recognition checkpoints separated by latent refinement phases and allow shared weights to specialize across recurrent steps. On LibriSpeech, LARM improves WER as the number of inference loops increases and achieves performance competitive with deeper unshared-parameter baselines. Our results show that test-time compute scaling can extend beyond autoregressive language-model reasoning to continuous non-autoregressive speech recognition.

2606.04670 2026-06-04 math.NA cs.LG cs.MS cs.NA

Fitting scattered data with optional monotonicity constraints on GPU: LipFit package

在GPU上拟合带有可选单调性约束的散乱数据:LipFit包

Gleb Beliakov

AI总结 提出一种多变量散乱数据插值与逼近方法,在满足单调性约束下产生最优Lipschitz连续逼近,并实现GPU并行化的Python包LipFit。

详情
AI中文摘要

本文提出了一种多变量散乱数据插值与逼近方法,该方法在期望的单调性约束下产生最优的Lipschitz连续逼近。该方法依赖于数据的紧上下逼近,其精神类似于最近邻逼近,但不会遭受不连续性。还介绍了局部Lipschitz插值和Lipschitz平滑。该方法属于无训练阶段的基于实例的逼近范畴,适用于基于GPU的并行化。讨论了一个实现所述方法的Python GPU友好包LipFit。

英文摘要

This paper presents a method of multivariate scattered data interpolation and approximation that produces optimal Lipschitz-continuous approximation, subject to the desired monotonicity constraints. This method relies on tight upper and lower approximations to the data, and is similar in its spirit to the nearest-neighbour approximation but does not suffer from discontinuities. Local Lipschitz interpolation and Lipschitz smoothing are also presented. This approach falls under the umbrella of instance-based approximation with no training phase, and it is suitable for GPU-based parallelisation. A Python GPU-friendly package LipFit which implements the methods discussed is discussed.

2606.04665 2026-06-04 cs.LG

Towards Accurate Model Selection in Deep Unsupervised Domain Adaptation

面向深度无监督域适应的精确模型选择

Kaichao You, Ximei Wang, Mingsheng Long, Michael I. Jordan

AI总结 针对深度无监督域适应中缺乏准确模型选择方法的问题,提出Deep Embedded Validation (DEV)方法,通过嵌入适应特征表示到验证过程中,获得目标风险的无偏估计,并利用控制变量技术降低方差,理论和实验证明了其有效性。

详情
Comments
upload to arxiv for record
AI中文摘要

深度无监督域适应(Deep UDA)方法成功利用源域中丰富的标记数据来提升相关但未标记的目标域上的性能。然而,由于缺乏准确且标准化的模型选择方法,Deep UDA中的算法比较变得繁琐,这阻碍了该领域的进一步进展。现有的Deep UDA模型选择方法要么高度有偏、受限、不稳定,甚至存在争议(需要标记的目标数据)。为此,我们提出了 extit{Deep Embedded Validation}( extbf{DEV}),它将适应后的特征表示嵌入到验证过程中,以获得目标风险的无偏估计,且方差有界。通过控制变量技术进一步降低了方差。该方法的有效性在理论和实验上都得到了验证。

英文摘要

Deep unsupervised domain adaptation (Deep UDA) methods successfully leverage rich labeled data in a source domain to boost the performance on related but unlabeled data in a target domain. However, algorithm comparison is cumbersome in Deep UDA due to the absence of accurate and standardized model selection method, posing an obstacle to further advances in the field. Existing model selection methods for Deep UDA are either highly biased, restricted, unstable, or even controversial (requiring labeled target data). To this end, we propose \textit{Deep Embedded Validation} (\textbf{DEV}), which embeds adapted feature representation into the validation procedure to obtain unbiased estimation of the target risk with bounded variance. The variance is further reduced by the technique of control variate. The efficacy of the method has been justified both theoretically and empirically.

2606.04662 2026-06-04 cs.LG cs.AI

Why Muon Outperforms Adam: A Curvature Perspective

为什么 Muon 优于 Adam:曲率视角

Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Dirk Bergemann, Zhuoran Yang

AI总结 从曲率视角出发,通过泰勒展开和曲率分解,发现 Muon 因更低的归一化方向锐度(NDS)而比 Adam 实现更大的一步损失下降,数据不平衡和层内曲率是其主要优势来源。

详情
AI中文摘要

Muon 在大语言模型训练中相比 Adam 将训练效率提升约两倍,但这一优势的局部几何来源尚不清楚。我们的工作首次从曲率视角尝试揭开 Muon 优于 Adam 的原因。首先,我们对训练损失曲面应用二阶泰勒近似,表明在匹配验证损失下,Muon 比 Adam 实现更大的一步损失下降。两种优化器的一阶增益相当,但 Muon 始终承受更小的二阶曲率惩罚。其次,我们将该曲率惩罚分解为更新范数的平方和归一化方向锐度(NDS)。我们发现 Muon 和 Adam 的更新范数相当,因此 Muon 更小的曲率惩罚源于更低的 NDS,而非更新尺度。第三,我们研究训练数据和模型结构如何塑造 Muon 的 NDS 优势。使用具有受控不平衡的 Zipf-概率上下文无关文法(PCFG)数据,我们表明数据不平衡放大了 Muon 相对于 Adam 的 NDS 优势。进一步的层内/跨层分解表明,在训练的中后期,Muon 更低的 NDS 主要由更小的层内曲率维持。除了经验证据,我们还分析了具有异质曲率和梯度对齐于高曲率模式的风格化二次问题。我们证明 Muon 通过平衡曲率组间的更新能量,实现了比 GD 更低的平均 NDS;当曲率异质性足够强时,在相同步数后这也产生更低的局部二次损失。

英文摘要

Muon improves training efficiency over Adam in large language-model training by about two times, but the local geometric source of this advantage remains unclear. Our work takes a first step toward demystifying Muon's superiority over Adam from a curvature perspective. First, we apply a second-order Taylor approximation to the training landscape and show that Muon achieves a larger one-step loss decrease than Adam at matched validation loss. The two optimizers have comparable first-order gains, but Muon consistently incurs a smaller second-order curvature penalty. Second, we decompose this curvature penalty into the squared update norm and Normalized Directional Sharpness (NDS). We find that Muon and Adam have comparable update norms, so Muon's smaller curvature penalty is driven by lower NDS, not update scale. Third, we study how training data and model structure shape Muon's NDS advantage. Using Zipf-Probabilistic Context-Free Grammar (PCFG) data with controlled imbalance, we show that data imbalance amplifies Muon's NDS advantage over Adam. A within-/cross-layer decomposition further shows that, in the middle and late stages of training, Muon's lower NDS is mainly sustained by smaller within-layer curvature. Beyond empirical evidence, we analyze stylized quadratic problems with heterogeneous curvature and gradient alignment toward high-curvature modes. We prove that Muon attains a smaller average NDS than GD by balancing update energy across curvature groups; when curvature heterogeneity is sufficiently strong, this also yields lower local quadratic loss after the same number of steps.

2606.04661 2026-06-04 cs.CL cs.LG

CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts

CRAFT: 成本感知的提示精炼与前沿感知的调优

Shanu Kumar, Shubhanshu Khandelwal, Akhila Yesantarao Venkata, Parag Agrawal, Yova Kementchedjhieva, Manish Gupta

AI总结 提出CRAFT方法,通过帕累托前沿优化提示的准确性和成本,避免标量化崩溃,在多个基准上实现更广泛的准确-成本权衡。

详情
AI中文摘要

为准确性调优的提示通常变长,每次模型调用都会增加推理成本。最佳的准确-成本权衡取决于任务和预算,因此提示优化是在准确性和提示令牌成本的帕累托前沿上的搜索,而不是针对单个提示。通常的捷径是将目标折叠成加权和,在搜索前固定权衡权重,通常只能恢复前沿的狭窄区域,我们称之为标量化崩溃。我们提出了CRAFT(成本感知的精炼和前沿感知的调优),一种帕累托前沿提示优化器,将目标LLM验证调用视为稀缺资源,并将其分配给乐观候选前沿附近的候选。每轮,互补的面向准确性和面向成本的生成器提出编辑,帕累托差距获取花费每轮的验证预算,NSGA-II保留保持分布广泛的种群。在六个分类和推理基准上,CRAFT保留的前沿同时达到高准确性和低成本区域,而仅准确性、仅成本和加权和基线各自集中在更窄的区域。准确-成本权衡成为搜索后的选择,而不是搜索前的权重。

英文摘要

Prompts tuned for accuracy often grow long, raising inference cost on every model call. The best accuracy-cost trade-off depends on the task and the budget, so prompt optimization is a search over the Pareto front of accuracy and prompt-token cost rather than for one prompt. The usual shortcut, collapsing the objectives into a weighted sum, fixes the trade-off weight before search and often recovers only a narrow region of the front, a failure we call scalarization collapse. We present CRAFT (Cost-aware Refinement And Front-aware Tuning), a Pareto-front prompt optimizer that treats target-LLM validation calls as the scarce resource and allocates them to candidates near the optimistic candidate front. Each round, complementary accuracy-oriented and cost-oriented generators propose edits, Pareto-gap acquisition spends the per-round validation budget, and NSGA-II retention keeps a spread-out population. Across six classification and reasoning benchmarks, CRAFT's retained fronts reach both high-accuracy and low-cost regions, while accuracy-only, cost-only, and weighted-sum baselines each concentrate in narrower regions. The accuracy-cost trade-off becomes a post-search choice, not a pre-search weight.

2606.04660 2026-06-04 cs.CL

LifeSide: Benchmarking Agents as Lifelong Digital Companions

LifeSide: 将智能体作为终身数字伴侣的基准测试

Yuqian Wu, Zhijie Deng, Wei Chen, Junwei Li, Yutian Jiang, Junle Chen, Zhengjun Huang, Qingxiang Liu, Jing Tang, Jiaheng Wei, Yuxuan Liang

AI总结 针对现有评估无法捕捉终身数字伴侣所需的多会话记忆、用户理解和隐私适应能力的问题,提出LifeSide基准,通过多智能体模拟构建记忆-情感-环境循环,评估模型在记忆追踪、用户理解、隐私控制和情感陪伴方面的表现,发现即使当前记忆基准饱和的模型也无法在长期内维持准确的用户理解和真正的陪伴。

详情
Comments
28 pages, 23 figures, 7 tables
AI中文摘要

终身数字伴侣必须整合跨会话线索,持续更新对用户的理解,并适应不断变化的隐私边界。现有评估未能捕捉到这一点,而是孤立地测试记忆回忆和短期共情。为了弥补这一差距,我们引入了\benchmark,一个以多会话 extit{记忆-情感-环境}循环为中心的基准。通过将用户建模为具有分层档案和事件轨迹的持久世界,\benchmark使用多智能体模拟将环境动态投射到对话中,保留了潜在思想与可观察表达之间的关键差距。在记忆追踪、用户理解、隐私控制和情感陪伴方面评估了2,000个角色和111K个任务,我们的实验结果揭示了一个严峻的现实:即使是在当前记忆基准上饱和的模型,也无法在长期内维持准确的用户理解和真正的陪伴。

英文摘要

Lifelong digital companions must integrate cross-session cues, continually update their understanding of users, and adapt to shifting privacy boundaries. Existing evaluations fail to capture this, testing memory recall and short-term empathy in isolation. To bridge this gap, we introduce \benchmark, a benchmark centered on multi-session \textit{Memory-Emotion-Environment} loops. By modeling users as persistent worlds with layered profiles and event trajectories, \benchmark uses multi-agent simulation to project environmental dynamics into dialogue, preserving the critical gap between latent thoughts and observable expressions. Evaluating 2,000 personas and 111K tasks across memory tracking, user understanding, privacy control, and emotional companionship, our experiment results reveal a stark reality: even models that saturate current memory benchmarks fail to sustain accurate user understanding and true companionship over long horizons.

2606.04658 2026-06-04 cs.NE cs.LG

U-Net-Accelerated Quality-Diversity Optimization for Climate-Adaptive Urban Layouts

U-Net加速的质量-多样性优化用于气候适应性城市布局

Alexander Hagg, Tania Guerrero, Dirk Reith

AI总结 提出用U-Net替代慢速物理模拟器作为代理模型,结合离线MAP-Elites算法,实现快速生成数千个多样化且经气候评估的建筑布局。

详情
AI中文摘要

优化城市布局以适应气候需要在建筑密度与冷空气通风之间取得平衡。由于基于物理的气候模拟计算成本高昂,规划者通常只能评估少于十个手动设计方案。质量-多样性(QD)算法提供了一种系统性地照亮设计空间的方法,但需要代理模型才能实用。在本文中,我们用一个空间深度学习代理(U-Net)替换了缓慢的监管物理模拟器,并将其嵌入离线MAP-Elites循环中。我们系统地比较了这种空间方法与传统的高斯过程(GP)代理在不同训练数据策略(准随机Sobol采样 vs. 主动QD自举)下的表现。结果表明,标量GP代理在随机样本上训练时灾难性地失败,需要昂贵的、主动生成的QD存档才能泛化。相比之下,U-Net的空间归纳偏置使其能够稳健地学习底层物理映射(R² = 0.996),完全独立于训练数据来源。这使得离线QD优化仅需一次性随机训练样本批次即可实现高度准确的适应度排名(ρ = 0.994)。最终流程部署在开源OpenSKIZZE工具中,能在十分钟内生成数千个多样化且经气候评估的建筑布局。

英文摘要

Optimizing urban layouts for climate adaptation requires balancing building density with cold-air ventilation. Because physics-based climate simulations are computationally expensive, planners typically evaluate fewer than ten manual designs. \gls{qd} algorithms offer a way to systematically illuminate the design space, but they require surrogate models to be practical. In this paper, we replace a slow, regulatory physics simulator with a spatial deep-learning surrogate (U-Net) inside an offline MAP-Elites loop. We systematically compare this spatial approach with a traditional \gls{gp} surrogate across different training-data strategies (quasi-random Sobol sampling vs.\ active \gls{qd} bootstrapping). Our results reveal that scalar \gls{gp} surrogates fail catastrophically when trained on random samples, requiring expensive, actively generated \gls{qd} archives to generalize. In contrast, the spatial inductive bias of the U-Net allows it to learn the underlying physics mapping robustly ($R^2 = 0.996$), completely independent of the training data source. This allows offline \gls{qd} optimization to achieve highly accurate fitness rankings ($ρ= 0.994$) using only a one-time batch of random training samples. The resulting pipeline, deployed in the open-source OpenSKIZZE tool, generates thousands of diverse, climate-evaluated building layouts in under ten minutes.

2606.04656 2026-06-04 cs.CV cs.AI

Instance-Level Post Hoc Uncertainty Quantification in Object Detection

目标检测中的实例级事后不确定性量化

Chongzhe Zhang, Zifan Zeng, Qunli Zhang, Feng Liu, Zheng Hu

AI总结 提出蒙特卡洛广义线性模型(MC-GLM),用于目标检测中实例级、近似事后不确定性量化,无需重新训练,在nuScenes数据集上验证了有效性。

详情
Comments
7 pages, 2 figures
AI中文摘要

目标检测是自动驾驶的安全关键组成部分。为了安全保证,量化边界框预测中的不确定性至关重要。无需重新训练的事后不确定性量化符合实际部署需求;因此,我们采用拉普拉斯近似。由于需要实例级不确定性,需要多次反向传播的线性化推理方法时间效率不高,而基于采样的方法并非完全事后。我们提出了蒙特卡洛广义线性模型(MC-GLM),它提供实例级且近似事后不确定性量化。蒙特卡洛步骤中所需的样本数量是恒定的,与输出实例数量无关,因此可以并行化。在nuScenes数据集上使用CenterPoint检测器的实验验证了我们方法的有效性,所得不确定性表现出良好质量。

英文摘要

Object detection is a safety-critical component of autonomous driving. It is essential to quantify the uncertainty in bounding-box predictions for safety assurance. Post hoc uncertainty quantification without retraining aligns with real-world deployment requirements; therefore, we employ the Laplace approximation. Because instance-level uncertainty is needed, linearized inference methods that require multiple backpropagations are not time-efficient, and sampling-based methods are not fully post hoc. We propose Monte-Carlo generalized linearized model (MC-GLM), which provides instance-level and approximately post hoc uncertainty quantification. The number of samples required in the Monte Carlo step is constant and independent of the number of output instances, so it can be parallelized. Experiments on the nuScenes dataset with the CenterPoint detector validate the effectiveness of our method, and the resulting uncertainties exhibit good quality.