arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.09195 2026-06-09 cs.CL 新提交

Symbolic and Abstractive Reasoning with Complex Visual Queries

复杂视觉查询的符号与抽象推理

Yichi Zhang, Jingdian Lu, Zhuo Chen, Lingbing Guo, Jun Xu, Wen Zhang, Huajun Chen

发表机构 * Zhejiang University(浙江大学) Nanjing University(南京大学) Ant Group(蚂蚁集团)

AI总结 提出复杂视觉查询(CVQ)概念,通过多模态知识图谱合成数据集,并设计两阶段训练框架,提升多模态大语言模型的符号与抽象推理能力。

详情
Comments
Work in progress
AI中文摘要

理解和推理抽象视觉内容仍然是当前多模态大语言模型(MLLMs)面临的挑战。本文探索了一种新颖的抽象数据类型,称为复杂视觉查询(CVQ),旨在探测符号和抽象推理,这是MLLMs类人神经符号推理中关键但尚未充分探索的维度。我们从三个角度进行了全面研究:\textbf{数据 $\times$ 范式 $\times$ 探索}。具体而言,我们提出了一种可扩展的流水线,用于合成基于大规模多模态知识图谱的CVQ,通过一阶逻辑算子的系统组合生成了一个包含14种不同查询类型的多样化数据集。我们进一步引入了一个两阶段训练框架,逐步赋予MLLMs强大的视觉推理能力。我们进行了大量实验,从多个维度严格评估MLLMs,包括在CVQ上的推理性能,以及跨任务和跨场景的泛化能力。我们相信,我们的工作为推进MLLMs的推理前沿开辟了新的视角和途径。

英文摘要

Understanding and reasoning over abstract visual content remains a challenge for current multi-modal large language models (MLLMs). In this paper, we explore a novel abstract data type termed complex visual query (CVQ), designed to probe symbolic and abstractive reasoning, which is a critical yet underexplored dimension of human-like neuro-symbolic reasoning for MLLMs. We present a comprehensive investigation from three perspectives: \textbf{Data $\times$ Paradigm $\times$ Exploration}. Specifically, we propose a scalable pipeline for synthesizing CVQs grounded in large-scale multi-modal knowledge graphs, generating a diverse dataset encompassing 14 distinct query types via systematic combinations of first-order logic operators. We further introduce a two-stage training framework that progressively equips MLLMs with robust visual reasoning capabilities. We conduct extensive experiments to rigorously evaluate MLLMs across multiple dimensions, including reasoning performance on CVQs, as well as cross-task and cross-scenario generalization. We believe our work opens new perspectives and avenues for advancing the reasoning frontiers of MLLMs.

2606.09188 2026-06-09 cs.RO cs.CV 新提交

Trajectory Optimization in Single and Dual-UAV Bearing-Only Target Localization

单无人机和双无人机仅方位目标定位中的轨迹优化

Zhijian Xiao, Huayu Huang, Bin Li, Yang Shang, Banglei Guan

发表机构 * College of Aerospace Science and Engineering, National University of Defense Technology(国防科技大学航天科学与工程学院) Hunan Key Laboratory of Image Measurement and Visual Navigation(湖南省图像测量与视觉导航重点实验室)

AI总结 提出基于Fisher信息矩阵的轨迹优化方法,通过谱加权目标函数和交叉角正弦项改善观测几何,结合改进粒子群算法,显著降低定位误差。

详情
Comments
16 pages, 13 figures and 6 tables. Submitted to Measurement
AI中文摘要

仅方位目标定位是光学测量中的一个基本问题,在无人机技术中有着广泛的应用。有效的轨迹规划可以建立有利的观测几何,从而提高仅方位无人机系统的目标定位精度。本文提出了一种用于无人机在仅方位目标定位场景中的轨迹优化方法。通过利用Fisher信息矩阵,该方法将几何构型和飞行器机动性动态集成到优化框架中。具体而言,我们引入了一个谱加权FIM目标函数,该函数在退化构型附近提供更好的梯度动力学,使规划器能够快速逃离不良观测条件。对于双无人机场景,引入交叉角正弦项,通过改善视线交叉角来优化三角测量几何,从而防止轨迹聚集。此外,我们提出了一种改进的粒子群优化算法,该算法具有运动模型约束和粒子归一化,以确保轨迹的物理可行性并增强与目标函数的兼容性。仿真结果表明,与传统的基于FIM的方法相比,所提出的方法在单无人机场景中将中位定位误差降低了99.21%,在双无人机配置中实现了69.70%的提升,在远距离机动目标的长时间仅方位目标定位中表现出优越的性能。

英文摘要

Bearing-only target localization is a fundamental problem in optical measurement and finds extensive applications in unmanned aerial vehicle (UAV) technology. Effective trajectory planning establishes favorable observation geometries, thereby enhancing the target localization accuracy of bearing-only UAV systems. This paper proposes an trajectory optimization method for unmanned aerial vehicles (UAVs) in bearing-only target localization scenarios. By leveraging the Fisher Information Matrix (FIM), the proposed approach dynamically integrates the geometric configuration and vehicle maneuverability into the optimization framework. Specifically, we introduce a spectrally-weighted FIM objective function that provides better gradient dynamics near degenerate configurations, enabling the planner to rapidly escape from poor observation conditions. For dual-UAV scenarios, an intersection angle sine term is introduced to optimize triangulation geometry by improving the sight-line intersection angle, thereby preventing trajectory aggregation. Furthermore, we propose an improved Particle Swarm Optimization (PSO) algorithm with motion model constraints and particle normalization to ensure the physical feasibility of the trajectory and enhance the compatibility with the objective functions. Simulation results demonstrate that the proposed method reduces the median localization error by 99.21% compared to conventional FIM-based approaches in single-UAV scenarios, and achieves a 69.70% improvement for dual-UAV configurations, exhibits superior performance in long-duration bearing-only target localization of maneuverability targets at extended ranges.

2606.09183 2026-06-09 cs.RO 新提交

Autonomous Obstacle Removal for Excavators through Policy Learning with Particle Simulation

通过粒子模拟的策略学习实现挖掘机自主障碍物移除

Yuki Kadokawa, Sandro M. Alcantara Tacora, Taro Abe, Daisuke Endo, Genki Yamauchi, Takeshi Hashimoto, Takamitsu Matsubara

发表机构 * Nara Institute of Science and Technology(奈良先端科学技术大学院大学) Public Works Research Institute(土木研究所)

AI总结 提出一种基于粒子模拟的课程学习框架,通过RGB-D感知和参数化轨迹输出,实现挖掘机在不同埋深条件下自主移除地面障碍物,并在真实12吨挖掘机上验证了鲁棒性。

详情
Comments
under review
AI中文摘要

从地面自主移除障碍物是一项重要的土方工程任务,但由于挖掘机必须随着土壤-障碍物条件的变化在重复循环中调整其挖掘轨迹,因此难以自动化。学习这种状态依赖行为需要一个能够再现累积土壤-障碍物相互作用的训练环境,包括接触状态、地形变形和障碍物可见性。因此,基于粒子的模拟适用于相关的策略学习。然而,粒子模拟计算成本高,重复的挖掘循环进一步增加了学习成本。我们观察到障碍物的埋藏条件决定了任务难度和模拟成本:更深的埋藏使障碍物移除更难,同时也需要更多粒子进行精确模拟。这一观察启发了一种基于埋藏条件的课程学习策略。我们提出了一种时间高效的模拟到现实策略学习框架,其中策略从RGB-D测量中观察地形和障碍物信息,然后输出参数化的挖掘轨迹;在此过程中,模拟器在可控埋藏条件下再现了真实挖掘机所使用的相同观测-动作接口。课程从浅埋条件开始,逐步增加埋藏深度,同时调整粒子数量,从而同时控制任务难度和模拟成本。实验表明,所提出的框架成功学习了一个有效的障碍物移除策略,而基线方法即使在完整一周的训练后也失败。所提出的课程在三天内实现了有效性能,并成功迁移到一台在开阔地面上操作各种钢制障碍物的真实12吨挖掘机上,从而展示了鲁棒的障碍物移除能力。

英文摘要

Autonomous obstacle removal from the ground is an important earthwork task, but this is difficult to automate because an excavator must adapt its excavation trajectories over repeated cycles as soil-obstacle conditions change. Learning such state-dependent behavior requires a training environment that reproduces accumulated soil-obstacle interactions, including contact states, terrain deformation, and obstacle visibility. Accordingly, particle-based simulation is suitable for the relevant policy learning. However, particle simulation is computationally expensive, and repeated excavation cycles further increase the learning cost. We observe that the burial condition of an obstacle governs both task difficulty and simulation cost: deeper burial makes obstacle removal harder while also requiring more particles for accurate simulation. This observation motivates a burial-conditioned curriculum learning strategy. We propose a time-efficient sim-to-real policy learning framework in which the policy observes terrain and obstacle information from RGB-D measurements and then outputs a parameterized excavation trajectory; in this process, the simulator reproduces in a real-world excavator the same observation-action interface it uses under controllable burial conditions. The curriculum begins with shallow burial conditions and progressively increases burial depth while adjusting particle count, thus simultaneously controlling task difficulty and simulation cost. Experiments show that the proposed framework successfully learns an effective obstacle-removal policy, whereas baseline methods fail even after a full week of training. The proposed curriculum achieves effective performance within three days and achieves successful transfer to a real 12-ton excavator operating on open ground with various steel obstacles, thus demonstrating robust obstacle removal.

2606.09180 2026-06-09 cs.CV 新提交

Claude Code-Driving Scenario Mining for the Argoverse 2 Challenge

Claude Code驱动的Argoverse 2挑战赛场景挖掘

Wei Deng, Caoshengzhe Xue, Shuaikun Liu, Zhaohong Liu, Mengshi Qi, Huadong Ma

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出四阶段管道:Claude Code自主生成代码、迭代训练集筛选、语义代码审查和场景级验证,用于Argoverse 2场景挖掘挑战。

详情
AI中文摘要

我们提交了参加CVPR 2026 Argoverse 2场景挖掘挑战赛的作品。我们的系统使用四阶段管道:(1) 由GLM~5.1驱动的Claude Code代理进行自主代码生成,(2) 使用时间戳平衡准确率阈值0.8进行迭代训练集筛选以策划少样本示例,(3) 由单独的Claude Code会话进行语义代码审查,以及(4) Qwen3-VL场景级验证以过滤误报。我们报告了在Argoverse 2测试集上的结果。

英文摘要

We present our submission to the CVPR 2026 Argoverse 2 Scenario Mining Challenge. Our system uses a four-stage pipeline: (1) autonomous code generation via a Claude Code agent powered by GLM~5.1, (2) iterative training set screening with Timestamp Balanced Accuracy threshold 0.8 to curate few-shot examples, (3) semantic code review by a separate Claude Code session, and (4) Qwen3-VL scene-level verification to filter false positives. We report results on the Argoverse 2 test set.

2606.09175 2026-06-09 cs.LG cs.AI cs.DC 新提交

CANS: Accelerating Multiuser Collaborative Edge Inference via Cooperative Autodidactic NeuroSurgeon

CANS: 通过合作自教神经外科加速多用户协同边缘推理

Zheshun Wu, Ziyang Zhang, Changyao Lin, Zenglin Xu, Jie Liu

发表机构 * Harbin Institute of Technology Shenzhen(哈尔滨工业大学(深圳)) Politecnico di Milano(米兰理工大学) Harbin Institute of Technology(哈尔滨工业大学) Fudan University(复旦大学) Shanghai Academy of Artificial Intelligence for Science(上海人工智能科学研究院)

AI总结 提出CANS框架,利用FedLinUCB-DW算法让异构设备自适应学习最优DNN分区,通过共享在线推理反馈和离线经验加速多用户边缘协同推理,显著降低延迟。

详情
Comments
24 pages, 14 figures, 5 tables, submitted for possible journal publication
AI中文摘要

最近,移动边缘计算(MEC)支持的协作深度神经网络(DNN)推理已成为向资源受限的移动设备提供智能服务的一种有前景的方法。一个代表性场景是多用户协同边缘推理,其中不同设备独立地划分其DNN模型,并通过无线网络将后端计算卸载到公共边缘服务器。然而,由于未知且时变的系统条件(包括波动的无线链路和多样的设备能力),确定每个设备的最优DNN分区具有挑战性。为解决此问题,我们提出了合作自教神经外科(CANS),一种协同边缘推理框架,使设备能够通过在线推理期间共享信息反馈来自适应学习最优DNN分区。为处理设备异构性并更好地利用离线推理经验,我们集成了一种新颖的FedLinUCB-DW算法,该算法将相同类型的设备分组,并使用本地离线早期退出推理经验来热启动在线探索。此外,我们通过推导遗憾上界为FedLinUCB-DW提供了理论保证。我们还在模拟环境和硬件原型系统上验证了我们的方法。实证评估表明,与最先进的基线相比,CANS实现了更低的推理延迟。特别是在两个边缘设备的原型实验中,所提出的CANS相比非合作基线将平均推理延迟降低了高达50%。

英文摘要

Recently, mobile edge computing (MEC)-enabled collaborative deep neural network (DNN) inference has emerged as a promising approach for delivering intelligent services to resource-constrained mobile devices. A representative scenario is multi-user collaborative edge inference, where distinct devices independently partition their DNN models and offload backend computation to a common edge server over wireless networks. However, determining the optimal DNN partition for each device is challenging due to unknown and time-varying system conditions, including fluctuating wireless links and diverse device capabilities. To address this problem, we propose Cooperative Autodidactic NeuroSurgeon (CANS), a collaborative edge inference framework that enables devices to adaptively learn optimal DNN partitions by sharing informative feedback during online inference. To handle the challenge of device heterogeneity and better leverage offline inference experience, we integrate a novel FedLinUCB-DW algorithm that groups devices of the same type and warm-starts online exploration using local offline early-exit inference experience. Furthermore, we provide theoretical guarantees for FedLinUCB-DW by deriving the regret upper bound. We also validate our method on both a simulated environment and a hardware prototype system. Empirical evaluations demonstrate that CANS achieves lower inference latency compared to state-of-the-art baselines. Especially, in prototype experiments on two edge devices, the proposed CANS reduced average inference latency by up to 50% compared to the non-cooperative baseline.

2606.09167 2026-06-09 cs.CV 新提交

Vision-Language Guided Hyperspectral Object Tracking via Semantics Fusion and Contextual Template Updating

视觉-语言引导的高光谱目标跟踪:语义融合与上下文模板更新

Rui Yao, Yuhong Zhang, Kunyang Sun, Hancheng Zhu, Jiaqi Zhao, Zhiwen Shao, Abdulmotaleb El Saddik

发表机构 * China University of Mining and Technology(中国矿业大学) University of Ottawa(渥太华大学)

AI总结 提出VLHTrack框架,通过语言引导波段选择模块缓解光谱冗余,利用多模态融合模块整合视觉与语言特征,并采用动态模板更新策略应对目标形变,在HOT2023/2024上超越现有方法。

详情
Comments
14 pages,8 figures
AI中文摘要

高光谱目标跟踪(HOT)利用高光谱视频(HSV)提供的丰富光谱信息,为目标跟踪提供了巨大潜力。然而,从冗余光谱波段中高效提取和利用光谱信息仍然是一个基本挑战,严重限制了模型泛化能力和跟踪性能。此外,在动态场景中,目标常因遮挡和光照变化等因素出现剧烈外观变化,导致当前帧与模板之间产生大变形,这对现有时序建模方法构成重大挑战。本文提出VLHTrack,一种新颖的高光谱视觉-语言(VL)联合跟踪框架。具体而言,我们引入语言先验,通过设计语言引导波段选择模块(LBSM)来解决光谱冗余的基本挑战。LBSM利用大语言模型(LLM)描述建立语义到光谱的映射,从而减轻冗余并突出判别性光谱特征。随后采用多模态视觉-语言融合模块无缝整合视觉和语言嵌入,利用其互补优势学习连贯的跨模态表示。为解决长序列中的目标形变问题,我们提出通过动态模板更新与Mamba(DTUM)模块实现的动态更新模板特征策略。DTUM利用选择性状态空间建模学习帧间依赖关系以更新模板特征,确保在时间上下文引导下模板特征的高效演化。在HOT2023和HOT2024上的实验表明,VLHTrack优于最先进(SOTA)方法。

英文摘要

Hyperspectral object tracking (HOT) leverages the rich spectral information provided by hyperspectral videos (HSVs), offering substantial potential for object tracking. However, efficiently extracting and exploiting spectral information from redundant spectral bands remains a fundamental challenge, which severely limits model generalization and tracking performance. Moreover, in dynamic scenes, targets often experience drastic appearance variations due to factors such as occlusion and illumination changes. These variations lead to large deformations between the current frame and the template. Such discrepancies pose major challenges for existing temporal modeling approaches. In this work, we propose VLHTrack, a novel hyperspectral vision-language (VL) joint tracking framework. Specifically, we incorporate language priors to address the fundamental challenge of spectral redundancy by designing a Language-Guided Band Selection Module (LBSM). By leveraging Large Language Model (LLM) descriptions, LBSM establishes a semantic-to-spectral mapping that mitigates redundancy and accentuates discriminative spectral features. A Multi-Modal Vision-Language Fusion Module is then employed to seamlessly integrate visual and linguistic embeddings, harnessing their complementary advantages to learn coherent cross-modal representations. To address target deformation in long-term sequences, we propose a dynamic update template feature strategy implemented via the Dynamic Template Update with Mamba (DTUM) module. By leveraging selective state space modeling, DTUM learns inter-frame dependencies to update template feature, ensuring efficient template feature evolution guided by temporal context. Experiments on HOT2023 and HOT2024 demonstrate that VLHTrack outperforms state-of-the-art (SOTA) methods.

2606.09162 2026-06-09 cs.CV 新提交

Zero-Parameter Geometric Gating for Temporally Stable Low-Altitude UAV Video Semantic Segmentation

用于低空无人机视频语义分割的零参数几何门控以实现时间稳定性

Jingpu Yang, Fengxian Ji, Zhengzhao Lai, Juanfan Wu, Mingxuan Cui, Yufeng Wang

发表机构 * Beihang University(北京航空航天大学) Northeastern University(东北大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Beijing Institute of Technology(北京理工大学)

AI总结 提出零参数几何门控,利用RANSAC单应性内点比率在16x16网格上路由区域,结合语义相似性传播实现时间稳定分割,在合成UAVid上提升mIoU达4.91%。

详情
AI中文摘要

低空无人机的视频语义分割需要时间一致性,但密集光流在主导航拍图像的平面区域中引入了空间结构化噪声。我们提出了一种零参数几何门控,它利用$16\ imes16$空间网格上的RANSAC单应性内点比率,在通过语义相似性传播融合之前,将每个区域路由到单应性或光流扭曲。该门控不需要学习参数——仅对RANSAC统计量进行中值阈值二值决策——为冻结的骨干网络仅增加了211K可训练参数(SSP融合层)。在合成UAVid上,该方法在两种架构(SegFormer-b2和Hiera-S+UPerNet)上比基础模型实现了+4.24--4.91%的mIoU改进。机制诊断表明,平面区域中的光流残差在空间上自相关(Moran's I = 0.32,$p < 0.001$),预测边界不稳定性(Spearman $\ ho= 0.66$),并且刚性化在单应性有效区域中将时间一致性从62%恢复到92%(+29.5pp)。

英文摘要

Video semantic segmentation for low-altitude UAVs requires temporal consistency, yet dense optical flow introduces spatially structured noise in the planar regions that dominate aerial imagery. We propose a zero-parameter geometric gate that uses RANSAC homography inlier ratios on a $16\times16$ spatial grid to route each region to either homography or optical flow warp before fusion via Semantic Similarity Propagation. The gate requires no learned parameters -- only a median-threshold binary decision on RANSAC statistics -- adding only 211K trainable parameters (the SSP fusion layer) to a frozen backbone. On synthetic UAVid, the method achieves +4.24--4.91\% mIoU improvement over base models across two architectures (SegFormer-b2 and Hiera-S+UPerNet). Mechanism diagnostics reveal that flow residuals in planar regions are spatially autocorrelated (Moran's I = 0.32, $p < 0.001$), predict boundary instability (Spearman $ρ= 0.66$), and that rigidification recovers temporal consistency from 62\% to 92\% (+29.5pp) in homography-valid regions.

2606.09159 2026-06-09 cs.CL cs.AI 新提交

Unified Energy for Invariant and Independent Decoding in Diffusion Language Models

扩散语言模型中不变性与独立性解码的统一能量

Yuchen Yan, Minkai Xu, Zaiquan Yang, Yatao Bian

发表机构 * National University of Singapore(新加坡国立大学) Stanford University(斯坦福大学) City University of Hong Kong(香港城市大学)

AI总结 针对扩散语言模型并行生成文本时与自回归模型的性能差距,提出统一能量(Uni-E)方法,通过不变能量和独立能量解决模型容量、依赖性和不变性问题,无需采样即可精确计算,并能纠正分布偏移。

详情
AI中文摘要

扩散语言模型(DLM)通过迭代去噪完整序列实现并行文本生成,与自回归(AR)解码相比具有吸引人的灵活性。然而,现有方法未能完全捕捉令牌关系,导致与AR基线存在性能差距,尤其是在并行度增加时。本文对该差距进行了系统分析,确定了三个关键因素:(i)模型容量、(ii)依赖性和(iii)不变性。为解决这些问题,我们首先提出不变能量(Inv-E)以及一个有效的基于采样的估计器来处理不变性问题。通过进一步与独立能量(Ind-E)结合,我们得到统一能量(Uni-E),它涵盖了所有这些因素。Uni-E具有独特优势:无需基于采样的分区估计即可精确计算。此外,Uni-E是模型无关的,因此可以扩展到任意大小的模型。我们进一步证明Uni-E可以纠正由依赖性和不变性引起的分布偏移。在扩散语言模型(DLM)和扩散大语言模型(DLLM)上的大量实验证明了所提出的Uni-E的有效性。

英文摘要

Diffusion Language Models (DLMs) enable parallel text generation by iteratively denoising a full sequence, offering attractive flexibility compared to auto-regressive (AR) decoding. However, existing methods fail to fully capture token relationships, leading to a performance gap relative to AR baselines, especially as the degree of parallelism increases. In this paper, we give a systematic analysis of the gap, identifying three key factors: (i) model capacity, (ii) dependency, and (iii) invariance. To address these issues, we first propose an invariant energy (Inv-E) together with an effective sampling-based estimator to handle the invariance issue. By further combining with the independent energy (Ind-E), we obtain a unified energy (Uni-E), that accounts for all these factors. Uni-E enjoys a unique advantage: it can be computed exactly without sampling-based partition estimation. Besides, Uni-E is model agnostic and can therefore be scaled to models of arbitrary size. We further prove that Uni-E can correct the distribution shift caused by dependency and invariance. Extensive experiments across Diffusion Language Models (DLMs) and Diffusion Large Language Models (DLLMs) demonstrate the effectiveness of the proposed Uni-E.

2606.09157 2026-06-09 cs.CL cs.AI 新提交

SEF-CLGC at SemEval-2026 Task 11: Logical Notation Impact on Language Model Performance

SEF-CLGC在SemEval-2026任务11中的应用:逻辑符号对语言模型性能的影响

Hanna Abi Akl, Fabien Gandon, Catherine Faron, Pierre Monnin

发表机构 * Université Côte d’Azur, Inria, CNRS, I3S, Sophia Antipolis, France(蔚蓝海岸大学, 法国国家信息与自动化研究所, 法国国家科学研究中心, 信息与系统科学实验室, 索菲亚安蒂波利斯, 法国) Data ScienceTech Institute, Paris, France(数据科学技术学院, 巴黎, 法国)

AI总结 本文提出SEF-CLGC管道,结合形式逻辑符号与小语言模型,在SemEval-2026任务11中评估推理性能,最佳模型在降低内容偏差的同时达到27.80%的内容分数。

详情
Comments
Accepted to SemEval-2026 co-located with ACL 2026
AI中文摘要

本文重新审视了我们称为三段论评估框架-通用逻辑语法构建(SEF-CLGC)的管道。我们将形式逻辑符号与小语言模型(SLMs)相结合,以评估在SemEval-2026任务11子任务1:大型语言模型中内容与形式推理的分离中的推理性能。我们的实验表明,仅依靠在自然语言和符号语言组合上训练的SLMs,我们的最佳模型在该任务上达到了27.80%的内容分数,同时显著降低了推理中的内容偏差。

英文摘要

This paper revisits our pipeline called Syllogistic Evaluation Framework-Common Logic Grammar Construction (SEF-CLGC). We combine formal logical notations with Small Language Models (SLMs) to evaluate reasoning performance on the SemEval-2026 Task 11 Subtask 1: Disentangling Content and Formal Reasoning in Large Language Models. Our experiments show that by relying solely on SLMs, trained on a combination of natural and symbolic languages, our best model achieves a content score of 27.80% on the task while significantly lowering the content bias in reasoning.

2606.09156 2026-06-09 cs.CV 新提交

OmniGen-AR: AutoRegressive Any-to-Image Generation

OmniGen-AR: 自回归任意到图像生成

Junke Wang, Xun Wang, Qiushan Guo, Peize Sun, Weilin Huang, Zuxuan Wu, Yu-Gang Jiang

发表机构 * Institute of Trustworthy Embodied AI, Fudan University(复旦大学可信具身人工智能研究所) Shanghai Collaborative Innovation Center of Intelligent Visual Computing(上海智能视觉计算协同创新中心) Bytedance Seed(字节跳动Seed) The University of Hong Kong(香港大学)

AI总结 提出统一自回归框架OmniGen-AR,通过共享视觉分词器和解耦因果注意力,支持文本、空间信号和视觉上下文等多种条件输入,在多项基准上达到最优或竞争性能。

详情
Comments
Accepted by NeurIPS
AI中文摘要

自回归(AR)模型在视觉生成中展现出强大潜力,以简单的架构和优化目标实现了优越性能。然而,现有方法通常局限于单一模态条件(如文本),限制了其在需要从多种控制信号合成图像的现实场景中的应用。在这项工作中,我们提出了OmniGen-AR,一个统一的任意到图像生成的自回归框架。通过共享视觉分词器将各种视觉条件离散化,并使用文本分词器处理文本提示,OmniGen-AR在单个模型中支持广泛的条件输入,包括文本(文本到图像生成)、空间信号(分割到图像和深度到图像)以及视觉上下文(图像编辑、帧预测和文本到视频生成)。为了减轻条件令牌到内容令牌的信息泄露风险,我们引入了解耦因果注意力(DCA),它将全序列因果掩码分离为条件因果注意力和内容因果注意力。这作为训练时的正则化器,不影响推理时的标准下一个令牌预测。通过这种设计,OmniGen-AR在多个基准上取得了新的最先进或至少具有竞争力的结果,例如在GenEval上达到0.63,在VBench上达到80.02,展示了其在灵活和高保真视觉生成方面的有效性。

英文摘要

Autoregressive (AR) models have demonstrated strong potential in visual generation, offering superior performance with simple architectures and optimization objectives. However, existing methods are typically limited to single-modality conditions, e.g., text, restricting their applicability in real-world scenarios that demand image synthesis from diverse controls. In this work, we present OmniGen-AR, a unified autoregressive framework for Any-to-Image generation. By discretizing various visual conditions through a shared visual tokenizer and text prompts with a text tokenizer, OmniGen-AR supports a broad spectrum of conditional inputs within a single model, including text (text-to-image generation), spatial signals (segmentation-to-image and depth-to-image), and visual context (image editing, frame prediction, and text-to-video generation). To mitigate the risk of information leakage from condition tokens to content tokens, we introduce Disentangled Causal Attention (DCA), which separates the full-sequence causal mask into condition causal attention and content causal attention. It serves as a training-time regularizer without affecting the standard next-token prediction during inference. With this design, OmniGen-AR achieves new state-of-the-art or at least competitive results across a range of benchmark, e.g., 0.63 on GenEval and 80.02 on VBench, demonstrating its effectiveness in flexible and high-fidelity visual generation.

2606.09150 2026-06-09 cs.CV 新提交

Ultra Flash: Scaling Real-Time Streaming Video Generation to High Resolutions

Ultra Flash: 将实时流式视频生成扩展到高分辨率

Luxury, Jie Huang, Zihao Fan, Xiaoxiao Ma, Yuming Li, Jun-hao Zhuang, Zeyue Xue, Siming Fu, Haoran Li, Mingchen Zhong, Guohui Zhang, Shichen Ma, Yijun Liu, Jiaqi Shi, Yanwen Ma, Yaofeng Su, Haoyu Wang, Yaowei Li, Songchun Zhang, Weiyang Jin, Yuxuan Bian, Shiyi Zhang, Haojun Xu, Shuai Lu, Xin Han, Wei Tang, Haoyang Huang, Nan Duan

发表机构 * JD Explore Academy(京东探索研究院) USTC(中国科学技术大学) PKU(北京大学) THU(清华大学) BUAA(北京航空航天大学) FDU(复旦大学) HKUST(香港科技大学) HKU(香港大学) CUHK(香港中文大学)

AI总结 提出Ultra Flash级联框架,通过架构保持的超分辨率训练、因果流式潜在上采样器和高分辨率解码器、以及级联优化方案,在单GPU上实现1K分辨率约30 FPS和2K分辨率约18 FPS的实时高分辨率流式视频生成。

详情
AI中文摘要

尽管最近的自回归视频扩散模型在流式质量上取得了显著成果,但它们仍局限于低分辨率(如480P),使得高效、可扩展的实时高分辨率视频生成成为一个根本性的开放挑战。为弥补这一差距,我们提出了Ultra Flash,一个能够实时生成高分辨率视频的级联流式框架。Ultra Flash在单GPU上实现约30 FPS(1K分辨率)和约18 FPS(2K分辨率),通过三个关键贡献:(1)一种保持架构的T2V到TV2V超分辨率训练范式,结合面向AIGC的数据降级流水线,有效保留基础模型的生成能力,从而在级联到主流低分辨率生成模型后增强高分辨率细节;(2)一个因果流式潜在上采样器与高分辨率解码器配对,增强时空连贯性,同时实现高效的潜在空间缩放和精确的高分辨率解码,且计算开销可忽略;(3)一种级联高分辨率流式视频生成优化方案,首先对超分辨率模型进行混合奖励增强的稀疏因果化和单步蒸馏,然后引入带有动态缓存管理的级联流式自强迫偏好优化,共同增强整体连贯性、提高质量,并实现实时高分辨率流式视频生成。大量实验表明,Ultra Flash能够可靠地生成超高分辨率流式视频,同时保持最先进的视觉质量和卓越效率。

英文摘要

While recent autoregressive video diffusion models achieve remarkable streaming quality, they remain confined to low resolutions (e.g., 480P), leaving efficient, scalable, real-time high-resolution video generation a fundamental open challenge. To bridge this gap, we present Ultra Flash, a cascaded streaming framework capable of real-time high-resolution video generation. Ultra Flash achieves ~30 FPS at 1K resolution and ~18 FPS at 2K resolution on a single GPU through three key contributions: (1) an architecture-preserving T2V-to-TV2V super-resolution training paradigm coupled with an AIGC-oriented data degradation pipeline that effectively preserves the generative capability of the base model, enabling enhanced high-resolution detail when cascaded after mainstream low-resolution generative models; (2) a causal streaming latent upsampler paired with a high-resolution decoder, which enhances spatiotemporal coherence while enabling efficient latent spatial scaling and precise high-resolution decoding with negligible computational overhead; and (3) a cascade high-resolution streaming video generation optimization scheme that first performs hybrid-reward-enhanced sparse causalization and single-step distillation of the super-resolution model, then introduces cascaded streaming self-forcing preference optimization with dynamic cache management, jointly enhancing overall coherence, improving quality, and enabling real-time high-resolution streaming video generation. Extensive experiments demonstrate that Ultra Flash reliably produces ultra-high-resolution streaming video while maintaining state-of-the-art visual quality and superior efficiency.

2606.09143 2026-06-09 cs.CV 新提交

CAMF-Det: Closure-Aware Multimodal Fusion for LiDAR-Camera 3D Object Detection on UAV Platforms

CAMF-Det: 面向无人机平台的激光雷达-相机闭合感知多模态融合3D目标检测

Yanze Jiang, Yanfeng Gu, Xian Li

发表机构 * School of Electronics and Information Engineering, Harbin Institute of Technology(哈尔滨工业大学电子与信息工程学院)

AI总结 针对无人机俯视场景中树冠遮挡导致的多模态信息退化问题,提出基于比尔-朗伯定律的闭合感知融合框架CAMF-Det,通过显式建模双模态遮挡强度并注入检测流程,在自建数据集上实现困难级别mAP_BEV提升9.43%和4.88%。

详情
AI中文摘要

基于激光雷达和相机的多模态3D目标检测在地面车辆场景中表现出色,但尚未在无人机平台上得到探索。在无人机俯视场景中,以树冠为主的频繁地面物体遮挡导致空间变化和模态依赖的信息退化。现有的多模态融合框架既未显式建模这种地面物体遮挡,也未将遮挡感知嵌入检测流程,限制了其在遮挡无人机场景中的性能。为应对这些挑战,我们提出CAMF-Det,一种面向无人机平台的激光雷达-相机3D目标检测的闭合感知多模态融合框架,该框架通过物理启发式建模导出双模态遮挡强度,并将其作为先验嵌入整个检测流程。首先,双模态闭合建模模块通过比尔-朗伯启发式公式和建筑物掩码校正,离线为两种模态显式构建遮挡强度真值。其次,以这些真值图为监督,双模态预测网络在单帧推理下将离线建模结果转换为在线遮挡强度预测。第三,将真值和预测的遮挡强度注入数据增强、特征编码、多模态融合和检测头,实现在空间变化和模态依赖信息退化下的自适应检测。在两个自建的基于无人机的多模态数据集SI3D-DI和SI3D-DII上的实验表明,CAMF-Det在所有难度级别上均达到最佳性能,困难级别的mAP$_{\mathrm{BEV}}$分别比最佳竞争方法提升9.43%和4.88%。这些结果证实了显式遮挡先验建模和利用对于无人机场景中鲁棒多模态3D检测的有效性。

英文摘要

Multimodal 3D object detection based on LiDAR and cameras has demonstrated excellent performance in ground-vehicle scenarios, but has not been explored for Unmanned Aerial Vehicle (UAV) platforms. In UAV top-down scenes, frequent groundobject occlusion dominated by tree canopies causes spatially varying and modality-dependent information degradation. Existing multimodal fusion frameworks neither explicitly model such ground-object occlusion nor embed occlusion awareness into the detection pipeline, limiting their performance in occluded UAV scenes. To address these challenges, we propose CAMF-Det, a closure-aware multimodal fusion framework for LiDAR-camera 3D object detection on UAV platforms, which derives dual-modal occlusion intensity through physics-inspired modeling and embeds them as priors throughout the detection pipeline. First, a dual-modal closure modeling module explicitly constructs occlusion intensity ground truth for both modalities offline via a Beer-Lambert-inspired formulation and building-mask correction. Second, using these ground-truth maps as supervision, a dual-modal prediction network converts the offline modeling results into online occlusion intensity predictions under single-frame inference. Third, both ground-truth and predicted occlusion intensity are injected into data augmentation, feature encoding, multimodal fusion, and detection head, enabling adaptive detection under spatially varying and modality-dependent information degradation. Experiments on two self-built UAV-based multimodal datasets, SI3D-DI and SI3D-DII, demonstrate that CAMF-Det achieves the best performance across all difficulty levels, with hard-level mAP$_{\mathrm{BEV}}$ improvements of 9.43% and 4.88% over the best competing methods, respectively. These results confirm the effectiveness of explicit occlusion prior modeling and exploitation for robust multimodal 3D detection in UAV scenes.

2606.09142 2026-06-09 cs.CV cs.AI 新提交

Decoding Pedestrian Crossing Intention from Egocentric Vision via Vision Language Models

通过视觉语言模型从自我中心视觉解码行人过街意图

Danya Li, Xiang Su, Yan Feng, Rico Krueger

发表机构 * Technical University of Denmark(丹麦技术大学) University of Helsinki(赫尔辛基大学) Delft University of Technology(代尔夫特理工大学)

AI总结 利用视觉语言模型(VLM)将行人过街意图预测转化为视觉问答任务,通过参数高效微调并结合自我运动、车辆运动和眼动等上下文线索,在自我中心视频上实现了14.5%的准确率提升,创下新纪录。

详情
AI中文摘要

自我中心视觉提供了人类感知和决策的第一人称视角,但其在交通安全预测方面的潜力尚未得到充分探索。在这项工作中,我们研究从短自我中心视频片段中解码行人过街意图。我们通过将任务表述为封闭式视觉问答(VQA)问题,并利用视觉语言模型(VLM)来预测行人的意图。我们首先在零样本设置下对三个系列的最先进VLM进行了基准测试,发现它们相对于随机猜测有适度提升,但表现出有限的高层次交通推理能力。基于这些发现,我们进一步使用参数高效微调将VLM适应于目标任务。我们的结果表明,微调后的模型显著优于其零样本对应模型,并在专门的基于Transformer的基线基础上实现了9%的准确率提升。最后,我们证明加入额外的上下文线索,包括自我运动、车辆运动和眼动,进一步提高了预测性能。特别是,由眼动和自我运动引导的微调Qwen3-VL-2B模型相比Transformer基线实现了14.5%的准确率提升,为自我中心行人意图解码建立了新的最先进水平。

英文摘要

Egocentric vision offers a first-person view of human perception and decision making, yet its potential for traffic-safety prediction remains underexplored. In this work, we study the decoding of pedestrian crossing intentions from short egocentric video clips. We approach this by formulating the task as a closed-ended visual question answering (VQA) problem and leveraging vision language models (VLMs) to predict the pedestrians' intent. We first benchmark three families of state-of-the-art VLMs in a zero-shot setting, finding that they achieve moderate gains over random guessing but exhibit limited higher-level traffic reasoning. Motivated by these findings, we further adapt VLMs to the target task using parameter-efficient fine-tuning. Our results show that the fine-tuned models substantially outperform their zero-shot counterparts and achieve a 9\% accuracy improvement over a specialized transformer-based baseline. Finally, we demonstrate that incorporating additional contextual cues, including ego motion, vehicle motion, and eye gaze, further improves predictive performance. In particular, the fine-tuned Qwen3-VL-2B model guided by eye gaze and ego motion achieves a 14.5% accuracy improvement over the transformer baseline, establishing a new state of the art for egocentric pedestrian intent decoding.

2606.09140 2026-06-09 cs.CV 新提交

DiffSight-Former: Modeling Structural Differences and Temporal Dynamics for Glaucoma Progression Prediction

DiffSight-Former:建模结构差异和时间动态用于青光眼进展预测

Yi Huang, Lei Bi, Jinman Kim

发表机构 * The University of Sydney(悉尼大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出DiffSight-Former框架,通过时间变异特征提取、多结构差异建模和时间感知Transformer,从序列眼底图像中预测青光眼进展,在SIGF和GRAPE数据集上取得高AUC和灵敏度。

详情
Comments
12 pages, 6 figures
AI中文摘要

青光眼是全球不可逆失明的主要原因,从眼底图像早期检测对于有效疾病管理至关重要。虽然深度学习在眼底图像分析中取得了有希望的性能,但现有方法大多依赖单时间点图像,未能捕捉与疾病进展相关的纵向结构和血管变化。临床随访期间获取的序列眼底图像提供了宝贵的时间信息;然而,当前的序列模型通常难以检测微妙的早期进展信号,并且常依赖固定长度输入或已患青光眼图像的诊断线索,限制了其在早期预测中的临床实用性。为解决这些限制,我们提出了DiffSight-Former,一个从序列眼底图像预测青光眼进展的框架。它包含一个基于眼底专用基础模型的时间变异特征提取模块,以获得稳健的解剖表示。引入多结构差异建模模块来量化视盘/杯区域和视网膜血管中与进展相关的变化。这些表示与时间间隔嵌入集成,并由时间感知Transformer处理,以建模疾病进展并估计未来青光眼发作的概率。在两个纵向数据集SIGF(405个序列)和GRAPE(263个序列)上进行了实验。在SIGF上,DiffSight-Former在进展预测中达到了91.54%的AUC和92.16%的灵敏度。在GRAPE上,它在三个临床视野进展标准上平均准确率达到87.48%。与现有方法相比,DiffSight-Former在不同时间设置下表现出强大的性能和鲁棒性,突显了其在纵向青光眼监测和早期风险预测中的潜力。

英文摘要

Glaucoma is a leading cause of irreversible blindness worldwide, and early detection from fundus images is critical for effective disease management. While deep learning has achieved promising performance in fundus image analysis, most existing methods rely on single time-point images and fail to capture longitudinal structural and vascular changes associated with disease progression. Sequential fundus images acquired during clinical follow-up provide valuable temporal information; however, current sequential models often struggle to detect subtle early progression signals and commonly depend on fixed-length inputs or diagnostic cues from already glaucomatous images, limiting their clinical utility for early prediction. To address these limitations, we propose DiffSight-Former, a framework for glaucoma progression prediction from sequential fundus images. It incorporates a time-variant feature extraction module based on a fundus-specific foundation model to obtain robust anatomical representations. A multi-structure difference modeling module is introduced to quantify progression-related changes in the optic disc/cup region and retinal vasculature. These representations are integrated with temporal interval embeddings and processed by a time-aware Transformer to model disease progression and estimate the probability of future glaucoma onset. Experiments were conducted on two longitudinal datasets, SIGF (405 sequences) and GRAPE (263 sequences). On SIGF, DiffSight-Former achieved an AUC of 91.54% and a sensitivity of 92.16% for progression prediction. On GRAPE, it achieved an average accuracy of 87.48% across three clinical visual-field progression criteria. Compared with existing approaches, DiffSight-Former demonstrates strong performance and robustness across different temporal settings, highlighting its potential for longitudinal glaucoma monitoring and early risk prediction.

2606.09139 2026-06-09 cs.CV 新提交

A Geometric Framework for Absolute Pose and Velocity Estimation with Event Cameras

事件相机的绝对位姿与速度估计的几何框架

Zibin Liu, Shunkun Liang, Banglei Guan, Yang Shang, Qifeng Yu, Ji Zhao

发表机构 * National University of Defense Technology(国防科技大学) independent researcher(独立研究者)

AI总结 提出利用3D直线及其触发事件的几何约束,通过线性与多项式求解器同时估计事件相机的绝对位姿和速度,最少仅需三个对应关系,在精度和效率上超越现有方法。

详情
AI中文摘要

尽管基于事件的运动估计取得了快速进展,当前的几何方法主要关注速度估计。然而,对于机器人导航和增强现实等关键应用同样至关重要的绝对位姿估计仍相对未被充分探索。因此,从事件流中同时恢复绝对位姿和速度仍然是一个开放且具有挑战性的问题。为弥补这一空白,我们提出了一种几何框架,通过利用场景中的3D直线及其触发的事件来估计绝对位姿和速度。该框架的核心是两个关键几何约束:3D直线与其对应事件平面的法向量之间的正交性,以及事件与其关联直线的2D投影之间的共线性。基于这些约束,我们提出了用于绝对位姿估计的线性求解器和多项式求解器。前者能够高效计算,而后者为旋转提供了全局最优解。对于速度估计,我们开发了一个高效的线性求解器和一个更精确的基于优化的求解器,以恢复角速度和线速度。值得注意的是,我们的方法最少需要三个事件-直线对应关系即可独立确定6自由度绝对位姿或速度。在仿真和真实世界数据集上的大量实验表明,我们的方法达到了最先进的性能,与现有方法相比,在精度和计算效率上都有显著提升。演示代码公开于 https://github.com/Zibin6/EventPoseVelocity。

英文摘要

Despite the rapid advancements in event-based motion estimation, current geometric methods primarily focus on velocity estimation. However, absolute pose estimation, which is equally crucial for key applications such as robotic navigation and augmented reality, remains relatively underexplored. Consequently, the simultaneous recovery of absolute pose and velocity from event streams remains an open and challenging problem. To address this gap, we propose a geometric framework for absolute pose and velocity estimation by leveraging 3D lines in the scene and the events they trigger. At the core of the framework lie two key geometric constraints: the orthogonality between a 3D line and the normal vector of its corresponding event plane, and the collinearity of an event with the 2D projection of its associated line. Based on these constraints, we present both linear and polynomial solvers for absolute pose estimation. The former enables efficient computation, while the latter provides a globally optimal solution for rotation. For velocity estimation, we develop an efficient linear solver and a more accurate optimization-based solver to recover both angular and linear velocities. Notably, our methods require a minimum of three event-line correspondences to determine the 6-DoF absolute pose or velocities independently. Extensive experiments in simulation and on real-world datasets demonstrate that our methods achieve state-of-the-art performance, with significant improvements in accuracy and computational efficiency compared to existing methods. The demo code is publicly available at https://github.com/Zibin6/EventPoseVelocity.

2606.09138 2026-06-09 cs.LG cs.CL 新提交

Claw-R1: A Step-Level Data Middleware System for Agentic Reinforcement Learning

Claw-R1:面向智能体强化学习的步骤级数据中间件系统

Daoyu Wang, Mingyue Cheng, Qingchuan Li, Shuo Yu, Jie Ouyang, Qi Liu

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China(中国科学技术大学认知智能国家重点实验室)

AI总结 提出Claw-R1系统,通过网关服务器和数据池组件,将智能体交互步骤转化为结构化数据资产,支持实时检查、质量筛选和训练批次配置,解决智能体强化学习中数据生命周期管理问题。

详情
AI中文摘要

智能体强化学习已成为将大语言模型从静态聊天机器人转变为交互式智能体的重要后训练范式,催生了如OpenClaw等代表性应用。现有工作主要关注策略优化算法和训练框架,但对从数据产生到训练消费的智能体-环境交互完整数据生命周期关注不足。为弥补这一差距,我们提出Claw-R1,一个面向智能体强化学习的交互式步骤级数据中间件系统。Claw-R1通过两个核心组件——网关服务器和数据池——连接异构智能体运行时与强化学习训练后端。网关服务器通过统一的LLM API入口捕获多轮交互步骤,而数据池将其组织为由提示ID、响应ID、奖励和其他元数据组成的步骤级记录。在我们的演示中,用户可以交互式检查实时轨迹,查看每一步的状态、动作和奖励,根据质量和就绪程度筛选数据,并为不同的下游强化学习算法配置训练就绪批次。总体而言,Claw-R1将智能体交互轨迹视为受管理的数据资产,而非临时运行时日志。通过此演示,我们希望鼓励社区认识到数据管理在智能体强化学习中的重要性。我们的代码可在https://github.com/AgentR1/Claw-R1获取,演示视频可在https://youtu.be/Pw47dAOw6B0找到。

英文摘要

Agentic reinforcement learning (RL) has become an important post-training paradigm for turning LLMs from static chatbots into interactive agents, giving rise to representative applications such as OpenClaw. Existing work mainly focuses on policy optimization algorithms and training frameworks, but pays less attention to the full data lifecycle of agent-environment interactions, from data production to training consumption. To bridge this gap, we present Claw-R1, an interactive step-level data middleware system for agentic RL. Claw-R1 connects heterogeneous agent runtimes with RL training backends through two core components: a Gateway Server and a Data Pool. The Gateway Server captures multi-turn interaction steps through a unified LLM API entry point, while the Data Pool organizes them into step-level records consisting of prompt IDs, response IDs, rewards and other metadata. In our demo, users can interactively inspect live trajectories, examine the state, action, and reward of each step, curate data by quality and readiness, and configure training-ready batches for different downstream RL algorithms. Overall, Claw-R1 treats agent interaction traces as managed data assets rather than temporary runtime logs. Through this demonstration, we hope to encourage the community to recognize the importance of data management in agentic RL. Our code is available at https://github.com/AgentR1/Claw-R1 and the demonstration video can be found at link https://youtu.be/Pw47dAOw6B0.

2606.09134 2026-06-09 cs.RO cs.AI cs.CL cs.CV cs.GR 新提交

From USD Scenes to Knowledge Graphs: Zero-Shot Ontology Grounding with LLMs

从USD场景到知识图谱:基于LLM的零样本本体接地

Jiangtao Shuai, Zongxiong Chen, Manfred Hauswirth, Sonja Schimmler

发表机构 * Technical University of Berlin(柏林工业大学) Fraunhofer FOKUS(弗劳恩霍夫开放通信系统研究所)

AI总结 研究利用大语言模型(LLM)零样本地将3D场景对象自动映射到本体类别,无需训练,在厨房场景中达到90-96%准确率,并揭示语义线索是关键。

详情
Comments
Accepted to the IEEE ICRA 2026 International Joint Workshop on Ontologies, Semantic Maps and Autonomous Robotics Standardization (J-WOSMARS 2026), Vienna, 2026
AI中文摘要

从3D仿真场景构建知识图谱对于机器人任务推理至关重要,但关键瓶颈——将场景对象接地到形式本体类别——仍然依赖于手工制作的字典,这些字典脆弱且无法跨资产泛化。我们研究大语言模型(LLM)是否能够自动化通用场景描述(USD)场景的接地步骤,作为一种零样本、无需训练的替代方案。在具有SOMA-HOME本体的厨房场景(125个对象)中,LLM在描述性名称下达到90-96%的精确匹配准确率,在缩写名称下达到49-89%,显著优于字典和嵌入基线。在完全不透明名称下,上下文增强提示可恢复高达48%的准确率。特征消融表明,LLM主要利用场景图中的语义线索(兄弟名称和父路径);匿名化这些线索将准确率降至0-6%,而仅凭几何信息仅能达到4-17%。

英文摘要

Constructing knowledge graphs from 3D simulation scenes is essential for robot task reasoning, but the key bottleneck, grounding scene objects to formal ontology classes, still relies on manually curated dictionaries that are brittle and do not generalize across assets. We investigate whether large language models (LLMs) can automate this grounding step for Universal Scene Description (USD) scenes as a zero-shot, training-free alternative. On a kitchen scene (125 objects) with SOMA-HOME Ontology, LLMs achieve 90-96% exact-match accuracy with descriptive names and 49-89% with abbreviated names, substantially outperforming dictionary and embedding baselines. Under fully opaque names, context-augmented prompting recovers up to 48%. Feature ablation reveals that LLMs primarily exploit semantic cues in the scene graph (sibling names and parent paths); anonymizing these cues reduces accuracy to 0-6%, while geometry alone yields only 4-17%.

2606.09132 2026-06-09 cs.AI 新提交

Vision Language Model Helps Private Information De-Identification in Vision Data

视觉语言模型助力视觉数据中的隐私信息去标识化

Tiejin Chen, Pingzhi Li, Kaixiong Zhou, Tianlong Chen, Hua Wei

发表机构 * Arizona State University(亚利桑那州立大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) North Carolina State University(北卡罗来纳州立大学)

AI总结 提出VisShield框架,通过专用指令微调数据集OPTIC和训练策略,使视觉语言模型精准定位并掩码敏感文本,有效保护医学图像等视觉数据中的隐私信息。

详情
AI中文摘要

视觉语言模型(VLM)因其卓越的能力而广受欢迎。尽管存在多种增强文本应用隐私的方法,但视觉输入相关的隐私风险(如医学图像中的受保护健康信息)仍被广泛忽视。为解决此问题,需执行两项关键任务:准确定位敏感文本并处理以确保隐私保护。为此,我们引入VisShield(视觉隐私盾),一个端到端框架,旨在增强VLM的隐私意识。我们的框架包含两个关键组件:专用指令微调数据集OPTIC(光学隐私文本指令集)和定制训练方法。该数据集提供多样化的隐私导向提示,引导VLM执行目标光学字符识别(OCR)以精确定位敏感文本,而训练策略确保VLM有效适应隐私保护任务。具体而言,我们的方法确保VLM识别隐私敏感文本并输出检测实体的精确边界框,从而有效掩码敏感信息。大量实验表明,我们的框架在处理隐私信息方面显著优于现有方法,为视觉语言模型中的隐私保护应用铺平了道路。我们的数据集和代码可在此处获取。

英文摘要

Visual Language Models (VLMs) have gained significant popularity due to their remarkable ability. While various methods exist to enhance privacy in text-based applications, privacy risks associated with visual inputs remain largely overlooked such as Protected Health Information (PHI) in medical images. To tackle this problem, two key tasks: accurately localizing sensitive text and processing it to ensure privacy protection should be performed. To address this issue, we introduce VisShield (Vision Privacy Shield), an end-to-end framework designed to enhance the privacy awareness of VLMs. Our framework consists of two key components: a specialized instruction-tuning dataset OPTIC (Optical Privacy Text Instruction Collection) and a tailored training methodology. The dataset provides diverse privacy-oriented prompts that guide VLMs to perform targeted Optical Character Recognition (OCR) for precise localization of sensitive text, while the training strategy ensures effective adaptation of VLMs to privacy-preserving tasks. Specifically, our approach ensures that VLMs recognize privacy-sensitive text and output precise bounding boxes for detected entities, allowing for effective masking of sensitive information. Extensive experiments demonstrate that our framework significantly outperforms existing approaches in handling private information, paving the way for privacy-preserving applications in vision-language models. Our dataset and code can be found here.

2606.09131 2026-06-09 cs.AI cs.CL cs.CV cs.LG 新提交

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

晚期融合足矣:面向视觉饱和的多模态大语言模型的双路径视觉令牌路由

Siyuan Liu, Jinyang Wu

发表机构 * School of Mechanics and Engineering Science, Peking University(北京大学力学与工程科学学院) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 针对多模态大语言模型中视觉令牌在深层饱和的问题,提出双路径视觉令牌路由(DPVR-LF),在饱和点将视觉令牌路由至单层可训练分支,仅最后层融合,以约3%可训练参数保持性能并减少计算。

详情
Comments
18 pages, 4 figures. Submitted to Pattern Recognition
AI中文摘要

多模态大语言模型(MLLMs)通常继承为单模态文本建模设计的深层对称Transformer骨干,并对图像和语言令牌应用相同的统一计算。这种设计忽略了一个关键的模态不对称性:图像和文本令牌在信息密度、冗余度和所需推理深度上存在显著差异。通过对LLaVA-1.5的逐层分析,我们观察到视觉令牌倾向于在中间层饱和。具体而言,文本到图像的注意力从第0层的0.68下降到第4层的0.07,并在第18层后稳定在0.04附近,而文本令牌则继续受益于深层语义处理。这些发现表明架构对称性与深度异步模态演化之间存在不匹配,导致冗余的视觉计算以及在深层任务特定适应期间感知表示的潜在漂移。受此启发,我们提出了双路径视觉令牌路由(DPVR),一种用于高效MLLMs的模态不对称路由框架。其核心实例DPVR-LF(晚期融合)在饱和点将视觉令牌路由到一个单层可训练侧分支,运行一个跳过深层堆栈中图像位置的十三层纯文本前向传播,并仅在最后一层重新融合视觉和文本流。使用约3%的可训练参数,DPVR-LF在标准基准上保持了有竞争力的多模态性能,同时减少了深层Transformer堆栈中的视觉计算。该结果挑战了视觉令牌必须遍历所有深层语言模型层的传统假设,并表明单个晚期融合层足以在LLaVA风格的MLLMs中维持强大的感知能力。

英文摘要

Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a key modality asymmetry: image and text tokens differ substantially in information density, redundancy, and required reasoning depth. Through a layer-wise analysis of LLaVA-1.5, we observe that vision tokens tend to saturate in the middle layers. Specifically, text-to-image attention decreases from 0.68 at layer 0 to 0.07 by layer 4, and stabilizes near 0.04 after layer 18, whereas text tokens continue to benefit from deep semantic processing. These findings suggest a mismatch between architectural symmetry and depth-asynchronous modality evolution, resulting in redundant visual computation and possible drift in perceptual representations during deep task-specific adaptation. Motivated by this, we propose Dual-Path Vision Token Routing (DPVR), a modality-asymmetric routing framework for efficient MLLMs. Its core instantiation, DPVR-LF (Late-Layer Fusion), routes vision tokens at the saturation point into a one-layer trainable side branch, runs a thirteen-layer text-only forward that skips image positions in the deep stack, and re-fuses the visual and textual streams only at the final layer. With approximately 3% trainable parameters, DPVR-LF preserves competitive multimodal performance on standard benchmarks while reducing visual computation in the deep Transformer stack. The results challenge the conventional assumption that vision tokens must traverse all deep language-model layers, and indicate that a single late fusion layer can be sufficient for maintaining strong perceptual competence in LLaVA-style MLLMs.

2606.09118 2026-06-09 cs.AI 新提交

ComplexConstraints and Beyond: Expert Rubrics for RLVR

复杂约束与超越:RLVR的专家评分标准

Sushant Mehta, Liudas Panavas, Edwin Chen

发表机构 * Surge AI

AI总结 提出专家设计的评分标准作为评估和训练信号,通过复杂指令遵循和企业智能体任务验证,在RL训练中显著提升模型性能。

详情
Comments
Accepted to the GEM workshop at ACL 2026: https://gem-workshop.com/
AI中文摘要

随着LLM能力的快速提升,用于评估它们的方法越来越滞后。传统基准依赖于对狭窄、表面约束的程序化验证,但现实世界的指令遵循和智能体任务需要评估细微的、上下文依赖的行为,这些行为难以通过简单的脚本检查。我们提出了一个基于专家策划的评分标准评估的系统分析作为替代范式,借鉴了来自两个领域的实证证据:复杂指令遵循和企业智能体任务。我们首先阐述了构建高质量评分标准的五个设计原则,包括最大可行原子性、意图感知标准设计和迭代LLM判断校准。为了验证这些原则,我们引入了ComplexConstraints,一个新的专家策划的指令遵循数据集,其中每个提示与10-40个原子评分标准配对。我们证明这些专家评分标准不仅是更好的评估工具,而且是高度有效的训练信号:在大约1000个ComplexConstraints示例上训练,使得4B参数模型在指令遵循上提升+15.5%,235B参数模型提升+12.2%,而在评分标准评分的企业环境上进行单周期RL训练产生的收益可以转移到模型从未训练过的分布外基准(BFCL +4.5%,Tau2-Bench +7.4%,Tool-Decathlon +6.8%)。我们的发现表明,专家编写的评分标准既改进了前沿LLM能力的测量,也改进了其发展,作为有效的评估和RL训练信号。

英文摘要

As LLM capabilities advance rapidly, the evaluation methods used to assess them increasingly lag behind. Traditional benchmarks relied on programmatic verification of narrow, surface-level constraints, but real-world instruction following and agentic tasks demand assessment of nuanced, context-dependent behaviors that resist simple scripted checks. We present a systematic analysis of expert-curated rubric-based evaluation as an alternative paradigm, drawing on empirical evidence from two domains: complex instruction following and enterprise agentic tasks. We first articulate five design principles for constructing high-quality rubrics, including Maximum Viable Atomicity, intent-aware criterion design, and iterative LLM-judge calibration. To validate these principles, we introduce ComplexConstraints, a new expert-curated instruction-following dataset in which each prompt is paired with 10-40 atomic rubric criteria. We demonstrate that these expert rubrics are not only better evaluation instruments but also highly effective training signals: training on approximately 1,000 ComplexConstraints examples yields +15.5% improvement for a 4B-parameter model and +12.2% for a 235B-parameter model on instruction following, while single-epoch RL training on a rubric-graded enterprise environment produces gains that transfer to out-of-distribution benchmarks the model was never trained on (+4.5% BFCL, +7.4% Tau2-Bench, +6.8% Tool-Decathlon). Our findings establish that expert-authored rubrics improve both the measurement and the development of frontier LLM capabilities, serving as effective evaluation and RL training signals.

2606.09117 2026-06-09 cs.LG cs.AI 新提交

Optimizing Energy-based Neural Network Training with Coherent Ising Machine

利用相干伊辛机优化基于能量的神经网络训练

Chen-Rui Fan, Bo Lu, Zhi-Hong Zhang, Run-Qing Zhang, Jing-Wei Wen, Chuan Wang

发表机构 * School of Artificial Intelligence, Beijing Normal University(北京师范大学人工智能学院) Laboratory for Advanced Computing and Intelligence Engineering, Information Engineering University(信息工程大学先进计算与智能工程实验室) China Mobile (Suzhou) Software Technology Company Limited(中移(苏州)软件技术有限公司) School of Science, Beijing University of Posts and Telecommunications(北京邮电大学理学院)

AI总结 本文利用相干伊辛机结合平衡传播训练基于能量的神经网络,并通过Adam优化器加速收敛,展示了在深层架构和卷积操作上的可扩展性,为下一代AI硬件提供了物理框架。

详情
AI中文摘要

尽管伊辛机作为伊辛模型的高级物理求解器,在组合优化和神经网络训练中具有应用潜力,但其在大规模神经网络中的可扩展性仍受限于硬件连接限制和次优的训练方法。在这项工作中,我们利用相干伊辛机(CIM)通过平衡传播训练基于能量的神经网络,实现了与现有软件实现相当的性能。我们进一步通过集成Adam优化器来求解Hopfield能量网络的基态,从而显著提高了收敛速度和求解精度。此外,我们展示了该方法在更深层网络架构和卷积操作上的可扩展性。我们的结果突显了CIM动力学作为训练复杂神经网络的可扩展平台的潜力,为通过模拟电路、光电子或集成光子学实现节能实现提供了途径。这项工作为下一代AI硬件开发建立了一个新颖的物理框架。

英文摘要

While Ising machines serve as advanced physical solvers for the Ising model,enabling applications in combinatorial optimization and neural network training,their scalability for large-scale neural networks remains constrained by hardware connectivity limitations and suboptimal training methodologies. In this work,we leverage a Coherent Ising Machine (CIM) to train an energy-based neural network using Equilibrium Propagation, achieving performance comparable to existing software-based implementations. We further enhance the algorithm by integrating the Adam optimizer to solve for the ground state of a Hopfield energy network, significantly improving convergence speed and solution accuracy. Additionally, we demonstrate the scalability of our approach across deeper network architectures and convolutional operations. Our results highlight the potential of CIM dynamics as a scalable platform for training complex neural networks, offering a pathway toward energy-efficient implementations via analog circuits, optoelectronics, or integrated photonics. This work establishes a novel physical framework for next-generation AI hardware development.

2606.09114 2026-06-09 cs.CL 新提交

MAAM: Anchor-Preserving Compression and Contextual Calibration for Chinese Discriminatory Language Detection

MAAM:面向中文歧视性语言检测的锚点保留压缩与上下文校准

Yuxin Fu, Shijing Si

发表机构 * School of Economics and Finance, Shanghai International Studies University(上海外国语大学国际金融贸易学院)

AI总结 提出MAAM框架,通过保留歧视相关语义锚点并结合上下文先验校准,在轻量级模型上提升中文歧视性语言检测的准确性和校准性,同时构建首个中文LGBT歧视语料库ChLGBT。

详情
AI中文摘要

中文歧视性语言检测具有挑战性,因为有害意图往往是隐式的且依赖上下文。我们提出MAAM(近视-散光锚点机制),一种轻量级、模型无关的框架,受功能性视觉模糊启发:MAAM并非同等保留每个词元,而是保留歧视相关的语义锚点,并通过C-I-S上下文先验(上下文语气、群体身份和立场极性)对其进行校准。我们还引入了ChLGBT,据我们所知,这是首个专注于中文LGBT的歧视性语言数据集,包含8,120条人工标注样本和三个序数标签:显式偏见、隐式偏见和情感强度。在强编码器基线上,MAAM提升了所有三个预测维度,在准确率、F1、Brier分数和期望校准误差上均取得一致增益。与零样本和少样本提示协议下的前沿LLM基线相比,MAAM在保持竞争力的同时,提供了更强的紧凑性和稳定性。这些结果表明,可解释的锚点保留和上下文校准为中文歧视性语言评估提供了一种实用的替代方案,无需依赖更大规模的模型缩放。

英文摘要

Chinese discriminatory-language detection is challenging because harmful intent is often implicit and context-dependent. We propose MAAM (Myopia--Astigmatism Anchor Mechanism), a lightweight, model-agnostic framework inspired by functional visual blur: rather than preserving every token equally, MAAM retains discrimination-relevant semantic anchors and calibrates them with C--I--S contextual priors (Contextual Tone, Group Identity, and Stance Polarity). We also introduce ChLGBT, to our knowledge the first Chinese LGBT-focused discriminatory-language dataset, with 8,120 manually annotated samples and three ordinal labels: explicit bias, implicit bias, and emotional intensity. Across strong encoder baselines, MAAM improves all three prediction dimensions, with consistent gains in accuracy, F1, Brier score, and expected calibration error. Compared with frontier LLM baselines under zero-shot and few-shot prompting protocols, MAAM remains competitive while offering stronger compactness and stability. These results suggest that interpretable anchor preservation and contextual calibration provide a practical alternative to heavier model scaling for Chinese discriminatory-language assessment.

2606.09112 2026-06-09 cs.LG cs.AI 新提交

Hybridizing Equilibrium Propagation with Ising Machines for Efficient Energy-Based Learning

将平衡传播与伊辛机混合以实现高效的基于能量的学习

Chen-Rui Fan, Bo Lu, Xing-Yu Wu, Tie-Jun Wang, Chuan Wang

发表机构 * School of Artificial Intelligence, Beijing Normal University(北京师范大学人工智能学院) Laboratory for Advanced Computing and Intelligence Engineering, Information Engineering University(信息工程大学先进计算与智能工程实验室) School of Physical Science and Technology, Beijing University of Posts and Telecommunications(北京邮电大学物理科学与技术学院)

AI总结 提出一种受伊辛动力学启发的平衡传播框架,通过扩展相空间动力学替代耗散Hopfield松弛,加速收敛、提高噪声鲁棒性,并在MNIST等数据集上实现与反向传播相当的性能。

详情
AI中文摘要

人工智能的快速发展推动了深度神经网络的重大进步。然而,传统的基于GPU的训练仍然高度耗能,这促使人们探索物理动力学和兼容的基于能量的学习方案,例如平衡传播(EP)。然而,基于EP的训练常常由于相空间收缩而陷入局部最小值。本文介绍了一种受伊辛动力学启发的平衡传播框架,其中耗散的Hopfield松弛被具有共轭变量的扩展相空间动力学所取代。由此产生的训练范式保留了EP的局部两阶段学习规则,同时改变了神经状态达到平衡的物理路径。我们表明,这种动力学降低了有效能量壁垒,加速了收敛,提高了噪声鲁棒性,并在MNIST、FashionMNIST和CIFAR-10上训练了深度卷积Hopfield网络,性能与反向传播相当。

英文摘要

The rapid evolution of artificial intelligence has led to substantial advances in deep neural networks. Nonetheless, conventional GPU-based training remains highly energy-demanding, motivating the exploration of physical dynamics and compatible energy-based learning schemes, such as equilibrium propagation (EP). EP-based training, however, frequently suffers from convergence to local minima due to phase-space contraction. Here we introduce an Ising-dynamics-inspired equilibrium-propagation framework in which dissipative Hopfield relaxation is replaced by an extended phase-space dynamics with conjugate variables. The resulting training paradigm keeps the local two-phase learning rule of EP while changing the physical route by which neural states reach equilibrium. We show that this dynamics lowers effective energy barriers, accelerates convergence, improves noise robustness, and trains deep convolutional Hopfield networks on MNIST, FashionMNIST, and CIFAR-10 with performance comparable to backpropagation.

2606.09110 2026-06-09 cs.CV 新提交

HDRAgent: An Agentic Framework for Multi-Exposure HDR Imaging

HDRAgent: 一种用于多曝光HDR成像的智能体框架

Weiyu Zhou, Tao Hu, Yijian Wang, Xiaogang Xu, Ruixing Wang, Qingsen Yan

发表机构 * School of Computer Science, Northwestern Polytechnical University(西北工业大学计算机学院) Shenzhen Research Institute, Northwestern Polytechnical University(西北工业大学深圳研究院) Zhejiang University(浙江大学) Camera Group, DJI(大疆相机部门)

AI总结 提出首个智能体驱动的HDR成像框架HDRAgent,通过细粒度上下文知识匹配、感知-失真反馈机制和智能体引导的生成对齐策略,自适应选择重建策略,减少复杂动态场景中的鬼影和局部伪影。

详情
AI中文摘要

大多数现有的多曝光HDR方法遵循固定的前馈重建范式,使其在复杂动态场景中容易产生鬼影伪影。为了解决这个问题,我们提出了HDRAgent,这是第一个用于HDR成像的智能体驱动框架,它根据当前场景条件自适应地选择重建策略。具体来说,为了提供场景特定的先验知识,我们引入了一个细粒度上下文知识匹配(FCM)模块。该模块利用多模态大语言模型(MLLM)衍生的场景感知来检索相关的历史案例和工具知识,并将它们组织成结构化证据,用于基于MLLM的自适应工具调度。此外,我们提出了一种感知-失真反馈机制,将执行后的质量评估和伪影诊断转化为结构化反馈,并累积到历史记忆中,以帮助后续的上下文知识细化和策略选择。此外,考虑到极端运动可能使对齐方法失效,我们设计了一种智能体引导的生成对齐策略,该策略使用基于MLLM的动态区域解析,在参考帧引导下重建非参考帧中的不可靠内容。实验表明,HDRAgent有效减少了鬼影和局部伪影,同时实现了具有竞争力或更优的客观性能和视觉质量。

英文摘要

Most existing multi-exposure HDR methods follow a fixed feed-forward reconstruction paradigm, making them prone to ghosting artifacts in complex dynamic scenes. To address this issue, we propose HDRAgent, the first agent-driven framework for HDR imaging, which adaptively selects reconstruction strategies according to the current scene conditions. Specifically, to provide scene-specific prior knowledge, we introduce a fine-grained contextual knowledge matching (FCM) module. This module leverages multimodal large language model (MLLM)-derived scene perception to retrieve relevant historical cases and tool knowledge, organizing them into structured evidence for MLLM-based adaptive tool scheduling. In addition, we propose a perception--distortion feedback mechanism that transforms post-execution quality assessment and artifact diagnosis into structured feedback, which is accumulated in historical memory to help subsequent contextual knowledge refinement and strategy selection. Furthermore, considering that extreme motion can invalidate alignment methods, we design an agent-guided generative alignment strategy that uses MLLM-based dynamic-region parsing to reconstruct unreliable contents in non-reference frames under reference-frame guidance. Experiments demonstrate that HDRAgent effectively reduces ghosting and local artifacts while achieving competitive or superior objective performance and visual quality.

2606.09109 2026-06-09 cs.CV cs.IR cs.LG 新提交

Driving Video Retrieval for Complex Queries with Structured Grounding

面向复杂查询的驾驶视频检索与结构化对齐

Manyi Yao, Sparsh Garg, Christian Shelton, Amit Roy-Chowdhury, Abhishek Aich

发表机构 * NEC Laboratories, America(美国NEC实验室) University of California, Riverside(加州大学河滨分校)

AI总结 提出STRIVE-D框架,通过弱监督领域视频校准规则、融合视觉语言与关键词检索信号,在驾驶视频检索中实现高达84%的top-1准确率提升。

详情
AI中文摘要

大规模视频检索是自动驾驶中数据整理和安全验证的核心,用户不仅希望找到场景,还希望找到诸如切入和急刹车等动态事件。现有的视觉语言和基于关键词的检索方法常常遗漏这些事件,因为相关的运动可能没有在文本中明确描述或通过词汇重叠捕获。基于规则的检索可以更直接地编码此类事件,但它是脆弱的:生成的或手工编写的规则在假设与真实驾驶数据不匹配时常常失败。我们提出了STRIVE-D,一种针对驾驶视频的数据校准检索框架。它使用弱标记的领域内视频来估计查询规则何时可靠,调整与观测数据不匹配的规则,并将校准后的规则分数与视觉语言和基于关键词的检索信号融合。在三个驾驶基准测试中,包括新发布的DrivingDojo上的人工标注事件数据,STRIVE-D相对于最先进方法在top-1准确率上实现了高达84%的相对改进。

英文摘要

Video retrieval at scale is central to data curation and safety validation in autonomous driving, where users want to find not only scenes but also dynamic events such as cut-ins and hard braking. Existing vision-language and keyword-based retrieval methods often miss these events because the relevant motion may not be explicitly described in text or captured by lexical overlap. Rule-based retrieval can encode such events more directly, but it is brittle: generated or hand-written rules often fail when their assumptions do not match real driving data. We propose STRIVE-D, a data-calibrated retrieval framework for driving videos. It uses weakly labeled in-domain videos to estimate when a query rule is reliable, adapt rules that mismatch observed data, and fuse calibrated rule scores with vision-language and keyword-based retrieval signals. Across three driving benchmarks, including newly released human-annotated event data on DrivingDojo, STRIVE-D delivers up to 84% relative improvement in top-1 accuracy over state-of-the-art methods.

2606.09104 2026-06-09 cs.LG cs.AI q-fin.PM 新提交

Addressing Market Regime Changes and Heavy-Tailed Returns in Portfolio Optimization via Bayesian VAR and Elliptical Black-Litterman

通过贝叶斯VAR和椭圆Black-Litterman解决投资组合优化中的市场机制变化和重尾收益问题

Daniil Mikriukov, Ruoyu Sun, Angelos Stefanidis, Jionglong Su, Zhengyong Jiang

发表机构 * University of Liverpool(利物浦大学) Xi'an Jiaotong-Liverpool University(西交利物浦大学)

AI总结 提出BAVAR-BLED算法,结合贝叶斯平均向量自回归和椭圆分布Black-Litterman模型,在TD3架构下自适应分配资产,在道琼斯工业平均指数成分股上实现夏普比率1.72和总收益57.26%。

详情
Comments
9 pages, 3 figures, 4 tables. Extends our prior work [Mikriukov et al., ICIC 2025] on Black-Litterman under Elliptical Distributions (BLED). Manuscript under review
AI中文摘要

用于投资组合优化的深度强化学习框架因其能够从市场数据中动态学习分配规则而显示出前景。然而,这些模型未能考虑肥尾收益,而肥尾收益以更频繁的极端事件为特征,描述了实际市场行为。此外,历史数据被同质化处理,未考虑时间重要性,导致模型在机制变化时失效。我们提出了一种新的BAVAR-BLED算法,该算法在TD3架构内结合了源自贝叶斯平均向量自回归(BAVAR)和使用椭圆分布的Black-Litterman模型(BLED)的方法。BAVAR捕获一组考虑多尺度时间特征的向量自回归表示,从而基于对收益预期和离散矩阵的机制感知估计实现自适应分配决策。这些估计作为BLED的先验输入,BLED使用学生t分布,允许更现实的肥尾收益估计。BAVAR-BLED算法使用Transformer网络进行观点构建,使用CNN进行风险厌恶估计,根据市场条件修改动态分配决策。对道琼斯工业平均指数29只成分股在十年市场周期内的评估表明,BAVAR-BLED显著优于最先进的方法,实现了1.72的夏普比率和2.70的索提诺比率,总收益为57.26%。

英文摘要

Deep reinforcement learning (DRL) frameworks for portfolio optimization have shown promise for their ability to learn allocation rules dynamically from market data. However, these models fail to account for fat-tailed returns, which characterize actual market behavior with more frequent extreme events. Furthermore, historical data is treated homogeneously, without accounting for temporal importance, leading models to fail during regime changes. We propose a new BAVAR-BLED algorithm that combines methods derived from Bayesian-Averaging Vector Autoregressive (BAVAR) and the Black-Litterman model using Elliptical Distributions (BLED) within a TD3 architecture. BAVAR captures a set of vector autoregressive representations that consider multi-scale temporal features, enabling adaptive allocation decisions based on regime-aware estimates of return expectations and dispersion matrices. These estimates serve as prior inputs to BLED, a model that uses Student's t-distributions, allowing for more realistic fat tail return estimates. The BAVAR-BLED algorithm uses transformer networks for view construction and CNNs for risk-aversion estimates, which modify dynamic allocation decisions based on market conditions. An evaluation of 29 Dow Jones Industrial Average constituents over a decade-long market period shows that BAVAR-BLED significantly outperforms state-of-the-art methods, achieving Sharpe and Sortino ratios of 1.72 and 2.70, respectively, and total returns of 57.26%.

2606.09099 2026-06-09 cs.RO 新提交

LAEI: Layered Autonomous Edge Intelligence Framework for Robust UAV Swarm Operations

LAEI: 面向鲁棒无人机蜂群操作的分层自主边缘智能框架

Changmin Park, Wooyong Jung, Hwangnam Kim

发表机构 * Korea University(高丽大学)

AI总结 提出分层自主边缘智能框架,通过机载学习策略与轻量级任务级监督结合,实现无人机蜂群在通信受限、环境不确定和组件故障下的可扩展协调,显著降低任务完成时间并提高效率。

详情
Comments
Preprint. Submitted to arXiv
AI中文摘要

自主无人机蜂群需要可扩展的协调机制,以在有限通信、环境不确定性和组件故障下保持任务性能。集中式方法提供全局协调,但存在通信瓶颈和单节点脆弱性,而完全分散的方法通常缺乏任务级一致性。本文提出了分层自主边缘智能(LAEI),一种无人机蜂群框架,它将机载学习策略与轻量级任务级监督相结合。每个无人机在机载执行局部感知、避障和动作选择,而监督层提供自适应目标重分配、故障感知恢复和上下文相关策略指导,而不直接控制低级动作。LAEI进一步整合了恢复策略,包括动态重新关联、备份监督支持和回退局部自主性,以在代表性故障场景下维持任务连续性。我们在模拟的无人机蜂群场景中评估了LAEI,使用任务完成时间、碰撞率和覆盖效率。结果表明,LAEI减少了任务完成时间并提高了操作效率,同时保持了碰撞感知的分布式无人机级决策。

英文摘要

Autonomous UAV swarms require scalable coordination mechanisms that maintain mission performance under limited communication, environmental uncertainty, and component failures. Centralized approaches provide global coordination but suffer from communication bottlenecks and single-node vulnerabilities, whereas fully decentralized methods often lack mission-level consistency. This paper presents Layered Autonomous Edge Intelligence (LAEI), a UAV-swarm framework that combines onboard learned policies with lightweight mission-level supervision. Each UAV performs local perception, obstacle avoidance, and action selection onboard, while the supervisory layer provides adaptive goal reassignment, fault-aware recovery, and context-dependent policy guidance without directly controlling low-level actions. LAEI further incorporates recovery strategies, including dynamic reassociation, backup supervisory support, and fallback local autonomy, to maintain mission continuity under representative failure scenarios. We evaluate LAEI in simulated UAV-swarm scenarios using mission completion time, collision rate, and coverage efficiency. The results show that LAEI reduces mission completion time and improves operational efficiency while maintaining collision-aware distributed UAV-level decision-making.

2606.09091 2026-06-09 cs.LG cs.CV 新提交

Stabilizing On-Policy Distillation for MLLM Reasoning with Global Normalization

稳定基于策略的蒸馏用于多模态大语言模型推理的全局归一化

Dongze Hao, Zhiwei Jin, Chen Chen, Haonan Lu

发表机构 * OPPO AI Center(OPPO AI中心)

AI总结 针对策略蒸馏中异常状态导致梯度不稳定的问题,提出全局归一化蒸馏策略优化(GNDPO),通过将KL分数转化为批次级相对优势来稳定优化,提升多模态推理任务的训练鲁棒性和性能。

详情
AI中文摘要

基于策略的蒸馏(OPD)最近成为一种重要的后训练范式。通过使用更强的教师模型为采样轨迹提供密集、细粒度的监督,OPD相比依赖稀疏二元或基于结果的环境反馈的可验证奖励强化学习(RLVR)具有明显优势。然而,朴素的token级蒸馏可能因异常状态中的幅度不匹配而遭受梯度不稳定性。为了解决这个问题,我们提出了全局归一化蒸馏策略优化(GNDPO),这是一种实用方法,通过将原始KL分数转化为批次级相对优势来稳定优化。这种归一化有效缓解了梯度爆炸,同时保留了token级指导的优势。实验结果表明,GNDPO在多模态推理任务中显著提高了训练鲁棒性和下游性能。代码已发布在 https://github.com/OPPO-Mente-Lab/GNDPO。

英文摘要

On-policy distillation (OPD) has recently emerged as an important post-training paradigm. By using a stronger teacher model to provide dense, fine-grained supervision for sampled trajectories, OPD offers a clear advantage over reinforcement learning with verifiable rewards (RLVR), which typically depends on sparse binary or outcome-based environmental feedback. However, naive token-level distillation can suffer from gradient instability, due to magnitude misalignment in outlier states. To address this issue, we propose Globally Normalized Distillation Policy Optimization (GNDPO), a practical method that stabilizes optimization by transforming raw KL scores into batch-level relative advantages. This normalization effectively mitigates gradient explosions while retaining the benefits of token-level guidance. Experimental results show that GNDPO substantially improves training robustness and downstream performance across multimodal reasoning tasks. The code is released at https://github.com/OPPO-Mente-Lab/GNDPO.

2606.09088 2026-06-09 cs.RO 新提交

Autonomous FPV Flight with Translational Optical Flow and Uncertainty Mask

基于平移光流与不确定性掩膜的自主FPV飞行

Yang Deng, Yu Hu, Feng Yu, Linzuo Zhang, Danping Zou

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出利用平移光流和不确定性掩膜增强FPV四旋翼自主飞行,在仿真和真实森林环境中实现高达13.91 m/s和11.79 m/s的飞行速度,成功率93.3%。

详情
AI中文摘要

在复杂环境中使用单目RGB相机作为唯一外部传感器的自主FPV四旋翼飞行仍然是一个基本挑战。最近的研究表明,使用光流作为神经网络的输入可以实现杂乱场景中的端到端自主飞行。然而,从光流估计中提取最相关信息是限制敏捷性和鲁棒性的关键瓶颈。现有方法难以将障碍物引起的光流与自运动背景光流分离,并且在膨胀焦点(FoE)附近信噪比低。为了解决这些问题,我们将光流分解为平移和旋转分量,并仅利用捕捉场景几何和深度线索的平移光流。此外,我们引入了一种基于前向和后向光流估计不一致性的不确定性掩膜。该掩膜突出显示障碍物结构,包括FoE区域内的结构。这两个线索被输入到在可微仿真框架中训练的控制策略中,该框架能够实现感知和控制的一阶优化。我们通过在仿真和真实森林环境中的大量实验验证了我们的方法。所提出的系统在仿真中实现了高达13.91 m/s的速度,在真实测试中实现了11.79 m/s的速度,在30次真实试验中成功率为93.3%,几乎使先前报道的单目RGB光流无人机避障系统的6 m/s真实速度翻倍。

英文摘要

Autonomous FPV quadrotor flight in complex environments using a monocular RGB camera as the sole exteroceptive sensor remains a fundamental challenge. Recent research has shown that using optical flow as the input of a neural network can achieve end-to-end autonomous flight in cluttered scenes. However, extracting the most relevant information from the flow estimation is the key bottleneck limiting agility and robustness. Existing methods struggle to disentangle obstacle-induced optical flow from the ego-motion background flow and suffer from low signal-to-noise ratios near the focus of expansion (FoE). To address these issues, we decompose the optical flow into translational and rotational components and utilize only the translational flow, which captures scene geometry and depth cues. In addition, we introduce an uncertainty mask derived from inconsistencies between forward and backward flow estimates. This mask highlights obstacle structures, including those within the FoE region. Both cues are fed to a control policy trained in a differentiable simulation framework, which enables efficient first-order optimization across perception and control. We validate our approach through extensive experiments in both simulated and real-world forest environments. The proposed system achieves robust flight at speeds of up to 13.91 m/s in simulation and 11.79 m/s in real-world tests, with a 93.3\% success rate over 30 real-world trials, nearly doubling the previously reported 6 m/s real-world speed of the monocular-RGB optical-flow UAV obstacle avoidance system.

2606.09086 2026-06-09 cs.AI 新提交

DynaOD: Dynamic Origin-Destination Flow Generation with Discrete-to-Continuous Temporal Semantic Modeling

DynaOD: 基于离散到连续时间语义建模的动态起讫点流量生成

Jie Zhao, Xianqi Dai, Jie Feng, Huandong Wang, Yong Li

发表机构 * Department of Electronic Engineering, BNRist, Tsinghua University(清华大学电子工程系,BNRist) Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院) Zhongguancun Academy(中关村学院)

AI总结 提出DynaOD框架,通过离散方向趋势和连续时间演化双视角建模时间语义,以轻量即插即用方式调节预训练静态OD生成器,实现无历史观测的动态OD流生成,在预测精度和分布保真度上优于基线。

详情
Comments
Accepted by IJCAI2026
AI中文摘要

动态起讫点(OD)流量生成旨在仅从时间上下文合成逼真的移动动态,而不依赖历史OD观测。一个关键挑战是将语义时间信号转化为时间上连贯的OD模式,同时保留城市区域固有的空间异质性。我们提出DynaOD,一个语义驱动框架,通过两个互补视角建模时间动态:离散方向趋势,刻画城市活动模式的定性变化;连续时间演化,捕捉这些变化如何随时间展开。通过联合编码这些时间语义,该框架构建时变区域表示,以轻量即插即用方式调节预训练的静态OD生成器。这种模块化设计进一步支持可扩展部署和跨城市迁移。在大型真实世界数据集上的大量实验表明,我们的方法在预测精度和分布保真度上均持续优于代表性基线。代码公开于https://github.com/csjiezhao/DynaOD。

英文摘要

Dynamic origin-destination (OD) flow generation seeks to synthesize realistic mobility dynamics from temporal context alone, without relying on historical OD observations. A key challenge is to translate semantic temporal signals into temporally coherent OD patterns while preserving the inherent spatial heterogeneity of urban regions. We propose DynaOD, a semantic-driven framework that models temporal dynamics through two complementary perspectives: discrete directional trends that characterize qualitative shifts in urban activity patterns, and continuous temporal evolution that captures how such shifts unfold over time. By jointly encoding these temporal semantics, the framework constructs time-varying region representations that condition pretrained static OD generators in a lightweight and plug-and-play fashion. This modular design further supports scalable deployment and cross-city transferability. Extensive experiments on large-scale real-world datasets show that our method consistently outperforms representative baselines in both predictive accuracy and distributional fidelity. Code is publicly available at https://github.com/csjiezhao/DynaOD.