arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2409
2605.29577 2026-05-29 cs.CV

Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning

通过逆动力学学习缓解视觉-语言-动作模型中的状态混叠

Kyujin Lee, Injae Kim, Jihwan Park, Yejun Ju, Minseok Joo, Hyunwoo J. Kim

AI总结 提出将逆动力学学习作为辅助目标,直接监督VLA视觉编码器,通过预测当前与未来观测之间的动作来捕捉细粒度视觉差异,从而缓解状态混叠问题。

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过将预训练的视觉-语言模型(VLM)适应于动作预测,成为统一机器人操作中感知、推理和控制的 promising 框架。然而,VLM 衍生的表示通常对低级控制所需的细微视觉差异不敏感,导致视觉相似但需要截然不同动作的状态之间出现状态混叠。先前的 VLA 研究通过生成视觉或推理输出(如未来帧、2D 接地点或轨迹、或中间空间推理步骤)来改善视觉理解,但这些目标通常仅通过端到端预测间接塑造视觉编码器,并未显式分析学习到的视觉特征空间中的状态混叠。为了缓解状态混叠,我们引入逆动力学学习作为辅助目标,直接监督 VLA 视觉编码器。通过预测当前与未来观测之间的动作,我们的目标鼓励编码器捕捉决定低级动作的细粒度视觉差异。我们进一步使用伪反向监督,使编码器暴露于更广泛的动作方向,并在有限的机器人演示下提高泛化能力。我们的方法适用于多种 VLA 基线,仅使用标准的观测-动作对,无需额外标注,并在测试时保留原始推理流程。在 CALVIN ABC-D 和 SimplerEnv 上的实验表明,在多种 VLA 基线上均获得一致的性能提升。冻结编码器探测和状态-特征对齐分析进一步表明,我们的方法学习了状态判别性的视觉表示,减少了状态混叠,并更好地与机器人状态变化对齐。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising framework that unifies perception, reasoning, and control for robot manipulation by adapting pretrained vision-language models (VLMs) to action prediction. However, VLM-derived representations are often insensitive to subtle visual distinctions required for low-level control, causing state aliasing between visually similar states that require substantially different actions. Prior VLA studies improve visual understanding by generating visual or reasoning outputs, such as future frames, 2D grounding points or traces, or intermediate spatial reasoning steps, but these objectives typically shape the vision encoder only indirectly through end-to-end prediction and do not explicitly analyze state aliasing in the learned visual feature space. To mitigate state aliasing, we introduce inverse dynamics learning as an auxiliary objective that directly supervises the VLA vision encoder. By predicting the action between current and future observations, our objective encourages the encoder to capture fine-grained visual distinctions that determine low-level actions. We further use pseudo-reversed supervision to expose the encoder to a broader range of action directions and improve generalization under limited robot demonstrations. Our method applies to diverse VLA baselines, uses only standard observation-action pairs without additional annotations, and preserves the original inference pipeline at test time. Experiments on CALVIN ABC-D and SimplerEnv show consistent gains across diverse VLA baselines. Frozen-encoder probing and state-feature alignment analyses further show that our method learns state-discriminative visual representations that reduce state aliasing and better align with robot state changes.

2605.29575 2026-05-29 cs.CV

Optimizing Latent Representations for Robust Building Damage Assessment Onboard Earth Observation Satellites

优化潜在表示以实现地球观测卫星上稳健的建筑物损坏评估

Thomas Goudemant, Benjamin Francesconi

AI总结 提出一种基于AI的星上系统,通过编码预灾图像为紧凑潜在表示并与灾后图像在轨比较,实现建筑物损坏的定位与分类,减少下行数据量并提高响应速度。

Comments IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2026), Jun 2026, Denver, United States

详情
AI中文摘要

在自然灾害或战区后快速识别受损建筑物对于支持应急响应和优先干预至关重要。地球观测星座提供及时、大范围的覆盖,但可操作信息常因数据下行限制、地面处理及人工解读而延迟。减少这种延迟对于提高决策响应能力至关重要。本文提出一种原创的基于AI的系统,可直接在卫星上从灾前和灾后高分辨率光学图像中进行目标级建筑物损坏评估(定位和损坏分类)。可用的灾前图像在地面编码为紧凑潜在表示,传输至卫星,并与新获取的灾后观测在轨比较。利用AI解读能力和星上处理能力的提升,所提设计支持在数据源直接处理,减少需下行的信息量,同时保留任务相关内容并提高系统整体响应性。我们通过系统基准测试星上兼容变体,分析孪生处理、交叉注意力、潜在空间压缩和面向鲁棒性的数据增强的影响。在xBD数据集上的实验表明,在未对准情况下具有可靠且稳健的损坏评估,且在强压缩下性能退化最小。

英文摘要

Rapid identification of damaged buildings after natural disasters or on war areas is crucial to support emergency response and prioritize interventions. Earth Observation constellations provide timely, large-scale coverage, but actionable information is often delayed by data downlink constraints, on-ground processing, and human interpretation. Reducing this latency is essential to improve decision-making responsiveness. In this work, we propose an original AI-based system that enables object-level building damage assessment (localization and damage classification) directly onboard satellites from pre-disaster and post-disaster highresolution optical imagery. Available pre-disaster images are encoded on ground into compact latent representations, transmitted to the satellite, and compared on-board with newly acquired post-event observations. Leveraging AI interpretation capabilities and increasing processing capabilities on-board satellites, the proposed design enables processing directly at the data source, reducing the amount of information to be downlinked while preserving task-relevant content and improving overall system responsivity. We explore the design space through a systematic benchmark of onboard-compatible variants, analyzing the impact of siamese processing, cross-attention, latent-space compression, and robustness-oriented data augmentation. Experiments on xBD dataset demonstrate reliable and robust damage assessment under misalignment, with minimal performance degradation under strong compression.

2605.29572 2026-05-29 cs.RO cs.HC

Learning to Feel Materials from Multisensory Tactile Data via Interpretable Models

通过可解释模型从多感官触觉数据中学习感知材料

Li Zou, Yasemin Vardar

AI总结 提出一个可解释的计算框架,利用多感官触觉数据(包括按压、静态接触和滑动交互)建模人类材料感知与识别,发现热觉和顺应性线索对感知建模和材料分类至关重要。

Comments 12 pages, 3 figures, journal

详情
AI中文摘要

人类对材料的触觉感知依赖于复杂的多感官触觉线索,然而低级触觉信号与感知表征之间的关系仍不清楚。这一知识差距阻碍了触觉在数字环境中的集成以及具有类人触觉感知能力的机器人的开发。在这里,我们提出了一个可解释的计算框架,用于使用多感官触觉数据建模人类材料感知和识别。我们的框架包含三个相互关联的模型:模型1将手指-表面交互特征映射到心理物理感官属性,模型2基于这些感知表征对材料进行分类,模型3直接从触觉特征对材料进行分类。结果表明,结合按压、静态接触和滑动交互的信息提高了预测准确性,并且热觉线索对于感知建模和材料分类尤其具有信息量。这些发现强调了热觉和顺应性线索的重要性,这些线索在当前机器人手指和触觉显示器中仍未得到充分体现。纳入此类线索可能增强人工系统近似人类材料感知的能力,并指导设计更具感知基础的触觉界面。

英文摘要

Human tactile perception of materials relies on complex multisensory touch cues, yet the relationship between low-level tactile signals and perceptual representations remains poorly understood. This knowledge gap hinders the integration of touch in digital environments and the development of robots capable of human-like tactile perception. Here, we present an interpretable computational framework for modeling human material perception and recognition using multisensory touch data. Our framework comprises three interconnected models: Model 1 maps finger-surface interaction features to psychophysical sensory attributes, Model 2 classifies materials based on these perceptual representations, and Model 3 directly classifies materials from tactile features. The results showed that combining information from pressing, static contact, and sliding interactions improves prediction accuracy, and that thermal cues are particularly informative for both perceptual modeling and material classification. These findings highlight the importance of thermal and compliance cues, which remain underrepresented in current robotic fingers and haptic displays. Incorporating such cues may enhance artificial systems' ability to approximate human material perception and guide the design of more perceptually grounded haptic interfaces.

2605.29570 2026-05-29 cs.CV

DefSynUS: Real-time Patient-specific Intrahepatic Vessel Identification via Deformation-Aware CT-US Domain Adaptation

DefSynUS:通过形变感知CT-超声域自适应的实时患者特异性肝内血管识别

Karl-Philippe Beaudet, Yordanka Velikova, Sidaty El Hadramy, Nassir Navab, Philippe Cattin, Juan Verde, Stéphane Cotin

AI总结 提出一种基于物理渲染和形变感知数据增强的域自适应框架,无需术前超声即可实现术中实时、患者特异性的肝内血管分支识别。

详情
AI中文摘要

目的:腹腔镜超声通过实时可视化肝内血管增强肝脏手术的安全性。然而,由于探头限制、复杂的血管结构和组织形变,血管识别仍然困难。本研究旨在通过可变形超声增强,实现实时、患者特异性的血管识别,并在形变下保持鲁棒性。方法:利用术前CT血管标注,通过优化的基于物理的渲染生成合成超声数据,并结合域自适应到术中超声。渲染过程以端到端方式训练,用于血管识别和患者特异性,无需术前超声。形变感知增强在渲染流程中模拟真实的术中运动和软组织形变。结果:在腹部体模和有限临床可行性实验(单病例临床评估)中,该框架实现了实时肝内血管分支识别,并在新患者姿势下保持性能。结论:该框架无需术前超声即可实现实时血管识别,并支持技术可行性,但仍需多患者验证以评估泛化性和临床可行性。

英文摘要

Purpose: Laparoscopic ultrasound (LUS) enhances the safety of liver surgery by visualizing intrahepatic vessels in real-time. Still, vessel identification remains difficult due to probe constraints, complex vascular structure, and tissue deformation. This work aims to enable real-time, patient-specific vessel identification that remains robust under deformation through deformable ultrasound augmentation. Methods: Preoperative CT vessel annotations are used to generate synthetic ultrasound data via optimized physics-based rendering, coupled with domain adaptation to intraoperative ultrasound. The rendering is trained end-to-end for vessel identification and patient-specificity, eliminating the need for preoperative ultrasound. A deformation-aware augmentation simulates realistic intraoperative motion and tissue deformation within the rendering pipeline. Results: In abdominal phantom and limited clinical feasibility experiments (single-case clinical evaluation), the framework achieved real-time intrahepatic vessel-branch identification, maintaining performance under new patient poses. Conclusion: The framework enables real-time vessel identification without preoperative ultrasound and supports technical feasibility, but multi-patient validation is still needed for generalizability and clinical feasibility.

2605.29568 2026-05-29 cs.AI

DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning

DeepTool: 通过过程监督强化学习扩展工具集成推理中的交错思考

Yang He, Xiao Ding, Bibo Cai, Yufei Zhang, Kai Xiong, Zhouhao Sun, Bing Qin, Ting Liu

AI总结 针对工具集成推理中缺乏逐步监督和自纠正能力的问题,提出DeepTool框架,通过合成交错轨迹和基于动作中心过程奖励的GRPO强化学习,显著提升模型在多个基准上的性能。

详情
AI中文摘要

工具集成推理通过利用外部环境扩展了LLM的能力。然而,现有方法在顺序调用工具时缺乏战略规划和自我纠正所需的思考。虽然强化学习缓解了这一问题,但传统的工具集成推理方法受到稀疏的基于结果奖励的阻碍,无法监督中间推理步骤和工具调用。为了解决这个问题,我们提出了DeepTool,一个新颖的框架,它在每一轮思考、行动和观察的交错过程中扩展了深思熟虑的思考。在DeepTool中,我们首先引入了一个合成流程,将扩展思考演变为交错轨迹,并集成对抗性扰动以确保鲁棒性和自我纠正。其次,我们基于GRPO设计了过程监督强化学习,利用以行动为中心的过程奖励来强化中间交错思考,并在每一轮强制执行精确的工具调用。大量实验表明,DeepTool实现了卓越的性能,在六个基准测试中显著提升了Qwen2.5-7B(例如,AIME24: 3.2% -> 40.4%,HMMT25: 0.0% -> 28.6%)。此外,令牌成本效益分析证实了交错思考的实用性,展示了DeepTool在性能和令牌效率之间的最佳平衡。

英文摘要

Tool-Integrated Reasoning (TIR) extends LLM capabilities by leveraging external environments. However, existing methods lack the deliberation during sequential tool invocation required for strategic planning and self-correction. While RL mitigates this, conventional approaches for Tool-Integrated Reasoning are hindered by sparse outcome-based rewards, failing to supervise intermediate reasoning steps and tool invocations. To address this, we propose DeepTool, a novel framework that scales deliberate thinking within the interleaved process of thinking, action, and observation at each turn. In DeepTool, we first introduce a synthesis pipeline that evolves extended thinking into interleaved trajectories, integrating adversarial perturbations to ensure robustness and self-correction. Secondly, we devise Process-Supervised Reinforcement Learning based on GRPO, which utilizes an Action-Centric Process Reward to reinforce intermediate interleaved thinking and enforce precise tool invocation at every turn. Extensive experiments demonstrate that DeepTool achieves superior performance, boosting Qwen2.5-7B significantly across six benchmarks (e.g., AIME24: 3.2% -> 40.4% and HMMT25: 0.0% -> 28.6%). Furthermore, the token cost-effectiveness analysis confirms the utility of interleaved thinking, demonstrating DeepTool's optimal balance between performance and token efficiency.

2605.29565 2026-05-29 cs.CV cs.RO

From General Vision to Reliable Traversability Estimation: Adapting Vision Foundation Models for Unstructured Outdoor Environments

从通用视觉到可靠的可通行性估计:适应视觉基础模型用于非结构化户外环境

Ji-Hoon Hwang, Jisung Bae, Dong-Wook Kim, Yeonkyu Lee, Seung-Woo Seo

AI总结 提出ViTA框架,通过可学习提示、视角多样化训练和几何知识蒸馏,将视觉基础模型适应于非结构化户外环境的可靠可通行性估计,显著降低误报并提升跨域泛化。

Comments 8 pages, 5figures

详情
AI中文摘要

基于视觉的方法已成为非结构化户外环境中可通行性估计的主导范式,通常通过语义分割监督来适应视觉基础模型(VFM)。然而,该范式面临三个根本性挑战,削弱了其可靠性:VFM的任务无关设计、可通行性标注的模糊性以及语义标签与物理安全性之间的差异。我们提出了视觉到可通行性适应(ViTA)框架,该框架将VFM适应于可靠的可通行性估计,并在SAM2上实例化。ViTA通过可学习的可通行性提示注入任务特定知识,同时保留VFM的跨域泛化能力。为处理标注模糊性,我们引入了视角多样化训练,通过估计语义不确定性来抑制模糊边界处的自信预测。为弥合语义与可通行性之间的差异,我们在训练期间蒸馏几何知识,使得推理时仅从RGB图像即可进行坡度和高程推理。语义和几何输出融合为一个连续的可通行性分数,同时反映语义不确定性和几何风险。在包括具有挑战性的真实越野数据集在内的多个领域的评估表明,ViTA实现了最先进的IoU和精确度,同时大幅减少误报并具备强大的跨域泛化能力。

英文摘要

Vision-based approaches have become the dominant paradigm for traversability estimation in unstructured outdoor environments, typically adapting vision foundation models (VFMs) via semantic segmentation supervision. However, this paradigm faces three fundamental challenges that undermine its reliability: the task-agnostic design of VFMs, the ambiguity of traversability annotations, and the discrepancy between semantic labels and physical safety. We propose Vision-to-Traversability Adaptation (ViTA), a framework that adapts VFMs for reliable traversability estimation, instantiated on SAM2. ViTA injects task-specific knowledge through learnable traversability prompts while preserving the VFM's cross-domain generalization. To handle annotation ambiguity, we introduce Perspective-Diversified Training, which estimates semantic uncertainty to suppress confident predictions at ambiguous boundaries. To bridge the semantic-traversability discrepancy, we distill geometric knowledge during training, enabling slope and elevation reasoning from RGB images alone at inference. The semantic and geometric outputs are fused into a continuous traversability score that reflects both semantic uncertainty and geometric risk. Evaluations across diverse domains, including challenging real-world off-road datasets, demonstrate that ViTA achieves state-of-the-art IoU and Precision with substantial false-positive reduction and strong cross-domain generalization.

2605.29564 2026-05-29 cs.RO

VE2VF: Vision-Enabled to Vision-Free Distillation via Real-world Reinforcement Learning for Robust Contact-Rich Manipulation

VE2VF: 基于真实世界强化学习的视觉使能到无视觉蒸馏用于鲁棒接触丰富操作

Victor Kowalski, Chengxi Li, Dongheui Lee

AI总结 提出一种人在环强化学习框架,通过教师-学生蒸馏将视觉使能策略的知识迁移到仅依赖本体感知的无视觉策略,在真实世界训练中实现鲁棒泛化,无需域随机化或数据增强。

详情
AI中文摘要

当使用强化学习进行接触丰富的机器人操作时,视觉可以提供任务相关信息,加速学习,超越仅靠本体感知所能达到的效果。然而,视觉使能策略容易过拟合训练时看到的视觉条件,限制了其鲁棒性和可迁移性。我们提出一种人在环强化学习框架,采用教师-学生蒸馏,在完全真实世界训练中实现跨多个任务变体的鲁棒性能,无需域随机化或数据增强。视觉使能教师将其知识蒸馏到仅依赖位姿、扭转和力传感的无视觉学生中,结合了快速训练与强任务泛化。在真实世界的NIST装配基准板上,我们的方法在3个代表性任务上经过约50分钟训练后达到95%的整体成功率,包括对8个未见任务变体的鲁棒泛化。通过蒸馏微调在最困难的任务上实现了完全成功。我们证明所得策略在鲁棒性和适应性上均优于基线。

英文摘要

When using reinforcement learning (RL) for contact-rich robotic manipulation, vision can provide task-relevant information that accelerates learning beyond what proprioception alone can achieve. However, vision-enabled policies tend to overfit to the visual conditions seen during training, limiting their robustness and transferability. We present a human-in-the-loop RL framework that employs teacher-student distillation to achieve robust performance across multiple task variants, trained entirely in the real world without requiring domain randomization or data augmentation. A vision-enabled teacher distills its knowledge into a vision-free student that relies solely on pose, twist, and wrench sensing, combining fast training with strong task generalization. On the real-world NIST assembly benchmark board, our approach achieves 95\% overall success after approximately 50 minutes of training on 3 representative tasks, including robust generalization to 8 unseen task variants. Fine-tuning with distillation achieves full success on the most challenging task. We demonstrate that the resulting policies outperform baselines in both robustness and adaptability.

2605.29562 2026-05-29 cs.RO cs.AI cs.CV

VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models

VLA-Pro:面向视觉-语言-动作模型的跨任务程序性记忆迁移

Shengyu Si, Yuanzhuo Lu, Ruimeng Yang, Ziyi Ye, Zuxuan Wu, Yu-Gang Jiang

AI总结 提出VLA-Pro框架,通过存储和检索任务相关的LoRA适配器作为程序性记忆,实现跨任务泛化,在仿真和真实任务中成功率显著提升。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在通用机器人操作中展现出强大潜力,但在泛化到需要跨物体、场景和动作模式迁移相关经验的新任务时仍面临挑战。本文提出VLA-Pro,一种即插即用框架,通过在训练时存储任务相关的程序性记忆并在推理时迁移这些记忆来增强跨任务泛化。具体而言,VLA-Pro在训练时将任务特定的LoRA适配器存储为参数化的程序性记忆。在推理时,VLA-Pro基于当前多模态上下文检索相关程序性记忆,并动态融合这些记忆以生成当前动作块。在RoboTwin、RLBench和真实世界操作任务上的实验表明,VLA-Pro在多个骨干网络上持续提升跨任务泛化能力,在仿真中实现高达207%的相对改进,并将真实世界成功率从5.8%提升至65.0%。这些结果表明,程序性记忆检索与自适应为将操作经验迁移到新任务提供了一种有效机制,同时保持了模块化和执行稳定性。

英文摘要

Vision-Language-Action~(VLA) models have shown strong potential for general-purpose robotic manipulation, yet they still struggle to generalize to unseen tasks that necessitate transferring relevant experience across objects, scenes, and action patterns. This paper proposes VLA-Pro, a plug-and-play framework designed to enhance cross-task generalization by storing task-relevant procedural memories at training time and transferring these memories during inference. Specifically, VLA-Pro stores task-specific LoRA adapters as parameterized procedural memories during training. At inference time, VLA-Pro retrieves relevant procedural memories based on the current multi-modal context and dynamically fuses these memories for generating the current action chunk. Experiments on RoboTwin, RLBench, and real-world manipulation tasks show that VLA-Pro consistently improves cross-task generalization across multiple backbones, achieving up to a 207% relative improvement in simulation and increasing real-world success rate from 5.8% to 65.0%. These results suggest that procedural memory retrieval and adaptation provide an effective mechanism for transferring manipulation experience to novel tasks while preserving modularity and execution stability.

2605.29561 2026-05-29 cs.AI cs.SE

ParaTool: Shifting Tool Representations from Context to Parameters

ParaTool: 将工具表示从上下文转移到参数中

Zekai Yu, Qi Meng, Qizhi Chu, Yu Hao, Chuan Shi, Cheng Yang

AI总结 提出ParaTool框架,通过将每个工具投影为可加载的参数集,结合参数化工具预训练、软工具选择和参数化工具微调三个阶段,使大语言模型无需上下文文档即可进行工具调用,在Stable ToolBench和BFCL上显著优于基于ICL的基线方法。

详情
AI中文摘要

工具调用通过使大语言模型(LLM)能够与外部可执行接口进行基于环境的交互,从而扩展了其能力。然而,主流的上下文学习(ICL)方法通常将详细的工具文档和使用示例直接纳入上下文中,这导致随着上下文长度的增长,推理开销显著增加,并且幻觉风险升高。相反,基于微调的方法虽然提高了通用工具调用能力,但往往无法有效内化先前见过的工具的特定细节,从而仍然依赖于上下文文档。为了解决这些限制,我们提出了ParaTool,一个将每个工具投影到专用的、可加载的参数集中的框架。通过配备这些参数化工具的动态集成,LLM可以在不依赖上下文文档或示例的情况下执行工具调用。具体来说,我们的方法包括三个阶段:(1)参数化工具预训练将不同工具的知识封装到独立的参数模块中;(2)软工具选择使用门控网络动态加权和聚合相关工具参数;(3)参数化工具微调联合更新工具参数以对齐训练和推理过程。在Stable ToolBench和BFCL上的实验表明,ParaTool显著优于基于ICL的强基线方法,在降低计算复杂度的同时实现了优越的性能。

英文摘要

Tool calling extends large language models (LLMs) by enabling grounded interaction with external executable interfaces, thereby supporting environment-coupled problem solving. However, mainstream in-context learning (ICL) approaches typically incorporate detailed tool documentation and usage examples directly into the context. This results in substantial inference overhead and heightened risks of hallucination as the context length grows. Conversely, while tuning-based methods improve general tool-calling capabilities, they often fail to effectively internalize the specific details of previously seen tools, thereby retaining a dependency on in-context documentation. To address these limitations, we propose ParaTool, a framework that projects each tool into a dedicated, loadable set of parameters. By equipping a dynamic integration of these parameterized tools, the LLM can perform tool calling without relying on in-context documents or examples. Specifically, our approach consists of three stages: (1) parametric tool pre-training encapsulates the knowledge of different tools into independent parameter modules; (2) soft tool selection employs a gating network to dynamically weigh and aggregate relevant tool parameters; and (3) parametric tool fine-tuning jointly updates tool parameters to align the training and inference processes. Experiments on Stable ToolBench and BFCL demonstrate that ParaTool significantly outperforms strong ICL-based baselines, achieving superior performance while reducing computational complexity.

2605.29560 2026-05-29 cs.AI

Battery-Sim-Agent: Leveraging LLM-Agent for Inverse Battery Parameter Estimation

Battery-Sim-Agent: 利用LLM智能体进行电池逆参数估计

Jiawei Chen, Xiaofan Gui, Shikai Fang, Shengyu Tao, Shun Zheng, Weiqing Liu, Jiang Bian

AI总结 提出Battery-Sim-Agent框架,将电池逆参数估计重构为推理任务,利用LLM智能体与高保真模拟器闭环交互,通过物理假设和结构化参数更新,显著优于贝叶斯优化等黑箱优化方法。

详情
AI中文摘要

对高保真电池“数字孪生”进行参数化是一个关键但具有挑战性的逆问题,阻碍了电池创新的步伐。现有方法将此表述为黑箱优化(BBO)任务,采用样本效率低且忽视底层物理的算法。在这项工作中,我们引入了一种新范式,将逆问题重新定义为推理任务,并提出了Battery-Sim-Agent,这是第一个将大型语言模型(LLM)智能体与高保真电池模拟器闭环部署的框架。该智能体模仿人类科学家的工作流程:它解释来自模拟器的丰富多模态反馈,形成基于物理的假设来解释差异,并提出结构化的参数更新。在一个系统构建的基准套件上,涵盖多种电池化学成分、操作条件和难度级别,我们的智能体在识别准确参数方面显著优于贝叶斯优化等强BBO基线。我们进一步展示了该框架在复杂长期退化拟合任务中的能力,并在真实电池数据集上验证了其实用性。我们的结果突显了LLM智能体作为基于推理的优化器在科学发现和电池参数估计中的前景。

英文摘要

Parameterizing high-fidelity "digital twins" of batteries is a critical yet challenging inverse problem that hinders the pace of battery innovation. Prevailing methods formulate this as a black-box optimization (BBO) task, employing algorithms that are sample-inefficient and blind to the underlying physics. In this work, we introduce a new paradigm that reframes the inverse problem as a reasoning task, and present Battery-Sim-Agent, the first framework to deploy a Large Language Model (LLM) agent in a closed loop with a high-fidelity battery simulator. The agent mimics a human scientist's workflow: it interprets rich, multi-modal feedback from the simulator, forms physically-grounded hypotheses to explain discrepancies, and proposes structured parameter updates. On a systematically constructed benchmark suite spanning diverse battery chemistries, operating conditions, and difficulty levels, our agent significantly outperforms strong BBO baselines like Bayesian optimization in identifying accurate parameters. We further demonstrate the framework's capability in complex long-horizon degradation fitting tasks and validate its practical applicability on real-world battery datasets. Our results highlight the promise of LLM-agents as reasoning-based optimizers for scientific discovery and battery parameter estimation.

2605.29559 2026-05-29 cs.CL

LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language Agents

LiteCoder-Terminal: 扩展用于学习语言代理的长时程终端环境

Xiaoxuan Peng, Kaiqi Zhang, Xinyu Lu, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

AI总结 提出零依赖合成框架LiteCoder-Terminal-Gen,自动生成可执行且可验证的终端训练环境,构建大规模SFT和RL数据集,通过监督微调和直接多轮偏好优化显著提升语言代理在终端任务上的性能。

详情
AI中文摘要

掌握终端环境需要语言代理具备多步规划、基于反馈的执行和动态状态适应能力。然而,当前训练此类代理的瓶颈在于依赖从外部仓库抓取的数据,这限制了领域多样性、环境可控性以及针对特定能力缺陷的优化。我们引入了LiteCoder-Terminal-Gen,一个零依赖的合成流水线,能够直接从领域规范自动生成可执行且可验证的终端训练环境。利用该框架,我们构建了两个大规模资源:LiteCoder-Terminal-SFT,包含跨10个领域的11,255条专家轨迹;以及LiteCoder-Terminal-RL,包含602个可验证环境,用于轨迹级偏好优化。在SFT数据集上对Qwen系列模型进行监督微调,得到的代理显著优于其基础版本。值得注意的是,我们的32B变体在Terminal Bench 1.0、2.0和Pro上分别达到了29.06%、18.54%和34.00%的pass@1。此外,在RL环境上应用直接多轮偏好优化(DMPO)进一步提升了性能。这些结果系统性地表明,完全合成的可执行环境为掌握复杂的真实命令行工作流提供了可扩展且可验证的监督信号。

英文摘要

Mastering terminal environments requires language agents capable of multi-step planning, feedback-grounded execution, and dynamic state adaptation. However, training such agents is currently bottlenecked by a reliance on scraped external repositories, which limits domain diversity, environment controllability, and the targeting of specific capability deficits. We introduce LiteCoder-Terminal-Gen, a zero-dependency synthesis pipeline that autonomously generates executable and verifiable terminal training environments directly from domain specifications. Using this framework, we construct two large-scale resources: LiteCoder-Terminal-SFT, comprising 11,255 expert trajectories across 10 domains, and LiteCoder-Terminal-RL, featuring 602 verifiable environments for trajectory-level preference optimization. Supervised fine-tuning of Qwen-family models on our SFT dataset yields agents that significantly outperform their base counterparts. Notably, our 32B variant achieves 29.06%, 18.54%, and 34.00% pass@1 on Terminal Bench 1.0, 2.0, and Pro, respectively. Furthermore, applying Direct Multi-turn Preference Optimization (DMPO) on our RL environments yields additional performance gains. These results systematically demonstrate that fully synthetic, executable environments offer a scalable and verifiable supervision signal for mastering complex, real-world command-line workflows.

2605.29558 2026-05-29 cs.CV

TAE: Target-aware enhancer for nighttime UAV tracking

TAE:面向夜间无人机跟踪的目标感知增强器

Yanyan Chen, Ruigang Fu, Yu Song, Ping Zhong

AI总结 提出一种目标感知的低光增强框架TAE,利用跟踪框弱监督信号进行区域感知增强和自适应RGB多曲线融合,显著提升夜间无人机跟踪性能,并贡献了包含268个序列的DarkSOT基准。

Comments Accepted at ICIP 2026. Dataset is avaliable at: https://github.com/Fu0511/DarkSOT-Dataset

详情
AI中文摘要

夜间低光条件下的严重图像退化是基于无人机的单目标跟踪全天候应用的核心瓶颈。现有的图像增强方法通常难以区分目标和背景区域,容易放大背景噪声或损害目标特征。为克服这一限制,我们提出TAE,一种专为夜间目标跟踪设计的目标感知低光增强框架。在跟踪边界框的弱监督信号显式引导下,该框架进行区域感知增强,确保操作聚焦于目标区域。它进一步采用自适应RGB多曲线融合机制,实现不同区域的精细建模和自适应调整。为促进该领域研究,我们还贡献了DarkSOT,一个新的夜间无人机跟踪基准,包含9个目标类别的268个序列。在DarkSOT和UAVDark135上的实验结果表明,TAE显著提升了低光夜间场景下的跟踪性能,展现出强鲁棒性和泛化能力。DarkSOT数据集可在https://github.com/Fu0511/DarkSOT-Dataset获取。

英文摘要

Severe image degradation under low-light nighttime conditions constitutes a core bottleneck preventing all-day applications for UAV-based single object tracking. Existing image enhancement methods often struggle to distinguish between target and background regions, which can easily lead to amplified background noise or compromise target features. To overcome this limitation, we propose TAE, a target-aware low-light enhancement framework tailored for nighttime object tracking. Guided explicitly by weak supervisory signals from tracking bounding boxes, the framework performs region-aware enhancement to ensure operations focus on the target area. It further adopts an adaptive RGB multi-curve fusion mechanism to achieve refined modeling and adaptive adjustment across different regions. To facilitate research in this domain, we also contribute DarkSOT, a new benchmark for nighttime UAV tracking, comprising 268 sequences across 9 target categories. Experimental results on the DarkSOT and UAVDark135 demonstrate that TAE significantly improves tracking performance in low-light nighttime scenarios, exhibiting strong robustness and generalization. The DarkSOT dataset is available at https://github.com/Fu0511/DarkSOT-Dataset.

2605.29556 2026-05-29 cs.AI

Opt-Verifier: Unleashing the Power of LLMs for Optimization Modeling via Dual-Side Verification

Opt-Verifier:通过双面验证释放大语言模型在优化建模中的潜力

Haoyang Liu, Jie Wang, Boxuan Niu, Xiongwei Han, Yian Xu, Mingxuan Ye, Zijie Geng, Fangzhou Zhu, Tao Zhong, Mingxuan Yuan, Jianye Hao

AI总结 提出Opt-Verifier框架,通过结构侧和解决方案侧的双面验证,利用大语言模型自动构建数学优化模型,显著提升建模准确性。

详情
Journal ref
International Conference on Machine Learning (ICML), 2026
AI中文摘要

构建数学优化模型在运筹学中至关重要,但需要大量人类专业知识。最近的进展利用大语言模型(LLMs)来自动化这一建模过程。然而,现有工作往往难以验证生成的优化模型的正确性,既不检查约束和变量的合理性,也不检查生成模型解的有效性。这阻碍了后续的验证和纠正步骤,从而严重损害了建模准确性。为了解决这一挑战,我们提出了一种新颖的基于LLM的框架,具有从结构和解决方案两个角度的双面验证(Opt-Verifier),从而提高建模准确性。结构侧验证确保生成的优化模型的建模结构与原始问题描述一致,准确捕捉问题的约束和要求。同时,解决方案侧验证解释和评估解的有效性,确认优化模型在逻辑和数学上是合理的。在流行基准上的实验表明,我们的方法在准确性上提高了20%以上。

英文摘要

Building mathematical optimization models is critical in operations research (OR), while it requires substantial human expertise. Recent advancements have utilized large language models (LLMs) to automate this modeling process. However, existing works often struggle to verify the correctness of the generated optimization models, without checking the rationality of the constraints and variables or the validity of solutions to the generated models. This hampers the subsequent verification and correction steps, and thus it severely hurts the modeling accuracy. To address this challenge, we propose a novel LLM-based framework with Dual-side Verification (Opt-Verifier) from both structure and solution perspectives, thereby improving the modeling accuracy. The structure-side verification ensures that the modeling structure of the generated optimization models aligns with the original problem description, accurately capturing the problem's constraints and requirements. Meanwhile, the solution-side verification interprets and evaluates the solutions' validity, confirming that the optimization models are logically and mathematically sound. Experiments on popular benchmarks demonstrate that our approach achieves over 20\% improvement in accuracy.

2605.29555 2026-05-29 cs.CL

From Blind Guess to Informed Judgment: Teaching LLMs to Evaluate Materials by Building Knowledge-Augmented Preference Signals

从盲目猜测到知情判断:通过构建知识增强的偏好信号教会LLM评估材料

Yeyong Yu, Wenya Hu, Xing Wu, Quan Qian

AI总结 提出知识增强偏好信号框架MaterEval,通过成对偏好数据引导大语言模型从直觉判断转向基于证据的可靠评估,并引入快慢推理方案平衡吞吐量、成本和可靠性,在高熵合金评估中验证了有效性。

Comments 33 pages, 5 figures

详情
AI中文摘要

随着候选生成和高通量实验的进步,材料发现的主要瓶颈正从性质预测转向在大量候选集中进行可靠评估。我们提出了知识增强偏好信号框架MaterEval,该框架自动为同一候选生成两种评估:一种遵循专家规则并提供支持证据的知情判断,另一种是移除规则的盲目猜测。通过将这两种评估配对作为偏好数据,我们引导原本缺乏材料特定标准的通用大语言模型(LLM)从直觉判断转向由明确证据支持的可靠评估。为了平衡吞吐量、成本和可靠性,我们进一步引入了一种快慢推理方案,将大规模快速筛选与对小子集的深入审查解耦。以高熵合金(HEA)评估为例,我们表明,无需外部检索,仅依赖内化能力,小型开源LLM在准确性、结论一致性和证据区分度上取得了显著提升,接近基于规则的闭源LLM的性能。这些结果表明,专家规则可以系统地转化为可学习的偏好信号,从而为自主材料发现循环提供低成本且可部署的评估模块。

英文摘要

As candidate generation and high-throughput experimentation advance, the primary bottleneck in materials discovery is shifting from property prediction to making reliable evaluations among massive candidate sets. We propose a Knowledge-Augmented Preference Signals Framework, MaterEval, that automatically produces, for the same candidate, two evaluations: an informed judgment that follows expert rules and provides supporting evidence, and a rule-removed blind guess. By pairing the two evaluations as preference data, we guide general-purpose large language models (LLMs), originally lacking materials-specific criteria, from intuitive judgment toward reliable evaluation supported by explicit evidence. To balance throughput, cost, and reliability, we further introduce a fast-slow reasoning scheme that decouples large-scale rapid screening from in-depth review on a small subset. Using high-entropy alloy (HEA) assessment as a case study, we show that, without external retrieval and relying solely on internalized capabilities, small open-source LLMs achieve substantial gains in accuracy, conclusion consistency, and evidence discrimination, approaching the performance of rule-based closed-source LLMs. These results demonstrate that expert rules can be systematically transformed into learnable preference signals, enabling a low-cost and deployable evaluation module for autonomous materials discovery loops.

2605.29549 2026-05-29 cs.CV

Learning Representations from 3D Gaussian Splats

从3D高斯溅射中学习表示

Julia Farganus, Krzysztof Żurawicki, Arkadiusz Gaweł, Weronika Jakubowska, Halina Kwaśnicka

AI总结 本研究通过比较多种几何深度学习架构,评估了基于3D高斯溅射的场景表示在分类任务中的有效性,揭示了不同架构和输入特征对表示质量的影响。

Comments 5 figures, 15 pages

详情
AI中文摘要

3D高斯溅射(3DGS)是一种用于场景渲染的最新方法。尽管其主要设计用于视图合成,但其在场景理解任务中的潜力尚未得到充分探索。在这项工作中,我们对使用高斯溅射表示的3D场景分类的各种几何深度学习架构进行了比较评估。我们在传统点云数据集和专用高斯溅射数据集上对基于点和基于图的模型进行了基准测试。场景被嵌入到潜在表示中,并通过端到端分类、线性探测和聚类分析进行评估。我们的研究为不同几何感知架构和输入特征配置在学习有效3D高斯溅射表示方面的适用性提供了见解。结果突出了架构家族之间的一致差异,并揭示了高斯特定属性对表示质量的影响。

英文摘要

3D Gaussian Splatting (3DGS) is a recent approach for scene rendering. Although primarily designed for view synthesis, its potential for scene understanding tasks remains underexplored. In this work, we conduct a comparative evaluation of various geometric deep learning architectures for the classification of 3D scenes represented using Gaussian Splatting. We benchmark point-based and graph-based models across both traditional point cloud datasets and dedicated Gaussian Splatting datasets. Scenes are embedded into latent representations, which are evaluated through end-to-end classification, linear probing, and clustering analysis. Our study provides insight into the suitability of different geometry-aware architectures and input feature configurations for learning effective 3D Gaussian Splat representations. The results highlight consistent differences between architectural families and reveal the impact of Gaussian-specific attributes on the quality of representation.

2605.29547 2026-05-29 cs.LG cs.AI math.OC

Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization

基于随机几何探测的奇异性感知优化:迈向稳定的非光滑优化

Ruoran Xu, Borong She, Xiaobo Jin, Qiufeng Wang

AI总结 针对非光滑优化中Adam优化器的梯度抖动问题,提出奇异性感知Adam(S-Adam),通过局部几何不稳定性(LGI)度量动态调整步长,实现稳定训练并提升泛化性能。

Comments International Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

深度学习优化严重依赖于损失景观平滑的假设,而现代架构由于ReLU激活和量化算子等非光滑组件系统性地违反了这一条件。在这种非光滑情况下,Adam等自适应优化器会出现梯度抖动,即由Clarke次微分内冲突信号引起的剧烈振荡,导致收敛性差和泛化能力欠佳。为解决此问题,我们引入了奇异性感知Adam(S-Adam),一种通过基于局部几何不稳定性动态调整步长来稳定训练的新型优化器。我们的关键贡献是局部几何不稳定性(LGI)度量,一种从随机方向导数方差导出的Clarke次微分直径的计算高效估计量。S-Adam采用自适应阻尼机制exp(-$λ$$ρ$),在高不稳定性区域减缓更新,同时在平滑盆地保持快速收敛。我们使用微分包含提供了严格的收敛性分析,证明S-Adam以最优的O(1/$\sqrt(T)$)速率几乎必然收敛到($δ$,$ε$)-Clarke稳定点。在量化感知训练(QAT)和高噪声小批量学习上的实证评估表明,S-Adam持续优于AdamW和Prox-SGD,在CIFAR-100上实现高达6%的准确率提升,在TinyImageNet上实现3%的提升,同时有效缓解梯度振荡。

英文摘要

Deep learning optimization relies heavily on the assumption of smooth loss landscapes, a condition systematically violated by modern architectures due to non-smooth components such as ReLU activations and quantization operators. In such non-smooth regimes, adaptive optimizers such as Adam suffer from gradient chattering, violent oscillations caused by conflicting signals within the Clarke subdifferential, leading to poor convergence and suboptimal generalization. To address this, we introduce Singularity-aware Adam (S-Adam), a novel optimizer that stabilizes training by dynamically modulating step sizes based on local geometric instability. Our key contribution is the Local Geometric Instability (LGI) metric, a computationally efficient estimator of the Clarke subdifferential diameter derived from the variance of randomized directional derivatives. S-Adam incorporates an adaptive damping mechanism exp(-$λ$$ρ$) that decelerates updates in high-instability regions while preserving fast convergence in smooth basins. We provide a rigorous convergence analysis using differential inclusions, proving that S-Adam converges almost surely to ($δ$,$ε$)-Clarke stationary points at the optimal O(1/$\sqrt(T)$) rate. Empirical evaluations on Quantization-Aware Training (QAT) and high-noise small-batch learning demonstrate that S-Adam consistently outperforms AdamW and Prox-SGD, achieving accuracy gains of up to 6 percent on CIFAR-100 and 3 percent on TinyImageNet while effectively mitigating gradient oscillations.

2605.29543 2026-05-29 cs.LG cs.AI cs.CL cs.HC cs.IR

SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring

SCOPE:一种用于空中交通管制复诵监控的轻量训练LLM框架

Qihan Deng, Minghua Zhang, Yang Yang, Zhenyu Gao

AI总结 提出SCOPE框架,通过冻结LLM结合插件式开放集分类器和上下文学习机制,实现高效准确的空管复诵监控,在少样本设置下开放集检测准确率达91.05%,异常纠正率96.63%。

详情
AI中文摘要

飞行员对空中交通管制(ATC)语音指令的复诵是航空运输中防止沟通失误的主要保障。然而,复诵异常仍与约80%的航空事故相关。这一脆弱性因交通量增加和认知负荷升高而进一步加剧,从而推动了机器自动化复诵监控的需求。传统的基于规则和机器学习的方法难以在高度可变且不断演变的空管-飞行员通信术语中泛化。尽管大语言模型(LLM)凭借其强大的推理和泛化能力开辟了新途径,但现有方法在实践中仍面临部署和计算障碍。在这项工作中,我们提出了SCOPE(Semantic reasoning for Communication via Open-set Plug-in with Examples),一种新颖的轻量训练LLM框架,提升了基于机器的ATC复诵监控的效率和准确性。核心思想是在冻结的LLM之上,将插件式开放集分类器与精心设计的上下文学习机制相结合。在半合成通信数据集上的大量实验表明,SCOPE在实现运行环境所需的低延迟响应的同时,达到了优越的准确性。在少样本设置下,SCOPE在开放集检测中达到91.05%的准确率,并纠正了96.63%的异常复诵,从而在提供决策解释的同时优于现有最强基线。这些发现证明了我们的框架作为通向可解释和可控的ATC复诵监控的实用途径的潜力。

英文摘要

Pilot readback of Air Traffic Control (ATC) voice instructions is a primary safeguard against miscommunication in air transportation. However, readback anomalies remain implicated in approximately 80% of aviation incidents. This vulnerability is further exacerbated by rising traffic volume and elevated cognitive workload, thereby motivating automated readback monitoring by machine. Traditional rule-based and machine learning approaches struggle to generalize across the highly variable and evolving phraseology of air traffic controller-pilot communications. While Large Language Models (LLMs) have opened a new avenue through their strong reasoning and generalization capabilities, existing approaches still face deployment and computational barriers in practice. In this work, we propose Semantic reasoning for Communication via Open-set Plug-in with Examples (SCOPE), a novel lightweight-training LLM framework that advances both the efficiency and accuracy of machine-based ATC readback monitoring. The core idea is to couple a plug-in open-set classifier with a carefully designed in-context learning mechanism on top of a frozen LLM. Extensive experiments on the semi-synthetic communication dataset show that SCOPE attains superior accuracy while delivering the low-latency response required for operational environments. Under a few-shot setting, SCOPE achieves 91.05% accuracy in open-set detection and corrects 96.63% of anomalous readbacks, thereby outperforming the strongest available baselines while providing explanations for its decisions. These findings demonstrate the potential of our framework as a practical pathway toward interpretable and controllable ATC readback monitoring.

2605.29538 2026-05-29 cs.CV

RadioFormer3D: Weakly Supervised 3D Radio Map Estimation in Low-Altitude Airspace via Generative Modeling

RadioFormer3D:通过生成式建模在低空空域中进行弱监督三维无线电地图估计

Zheng Fang, Junjie Liu, Kangjun Liu, Jianguo Zhang, Yaowei Wang, Ke Chen

AI总结 提出RadioFormer3D模型,采用傅里叶采样编码器、体素解码器和联合频谱完整性损失,在弱监督下实现三维空间稀疏测量的无线电地图估计,有效提升未标注高度层的重建质量。

详情
AI中文摘要

随着三维环境中无线应用(如低空空域和三维异构网络)的出现,无线电地图估计越来越需要表征信号在水平和垂直维度上的传播。然而,由于空间稀疏性增加和连续高度上的监督有限,将无线电地图估计从二维扩展到三维仍然具有挑战性。在本文中,我们提出了 extbf{ extit{RadioFormer3D}},一种专门用于弱监督下体素频谱重建的模型。基于 extit{RadioFormer}的双流多粒度融合架构, extit{RadioFormer3D}引入了基于傅里叶的采样编码器和体素解码器,以有效处理三维空间中的稀疏测量。为了缓解垂直监督的缺乏,我们提出了 extbf{ extit{联合频谱完整性损失}},它将体素级伪标签监督、地图级几何感知无线电渲染和像素级局部约束整合到一个统一的优化方案中。这种设计使模型能够在稀疏监督下更有效地捕捉复杂的垂直结构关系。在多个无线电地图数据集上的大量实验表明,与现有代表性方法相比, extit{RadioFormer3D}实现了优越的整体性能。特别是,它在保持精度和推理效率之间良好权衡的同时,在未标注高度层上展示了改进的重建质量,使其成为未来三维环境感知无线网络的一个非常有前景的解决方案。

英文摘要

With the emergence of wireless applications in three-dimensional environments, such as the low-altitude airspace and 3D heterogeneous networks, radio map estimation is increasingly required to characterize signal propagation across both horizontal and vertical dimensions. However, extending radio map estimation from 2D to 3D remains challenging due to increased spatial sparsity and limited supervision across continuous altitudes. In this paper, we propose \textbf{\textit{RadioFormer3D}}, a specialized model for volumetric spectrum reconstruction under weak supervision. Building on the dual-stream, multi-granularity fusion architecture of \textit{RadioFormer}, \textit{RadioFormer3D} introduces a Fourier-based sampling encoder and a volumetric decoder to efficiently process sparse measurements in 3D space. To alleviate the lack of vertical supervision, we propose the \textbf{\textit{Joint Spectrum Integrity Loss}}, which integrates volume-level pseudo-label supervision, map-level geometry-aware radio rendering, and pixel-level localized constraints within a unified optimization scheme. This design enables the model to capture complex vertical structural relationships more effectively under sparse supervision. Extensive experiments across several radio map datasets show that \textit{RadioFormer3D} achieves superior overall performance compared to representative existing methods. In particular, it demonstrates improved reconstruction quality at unlabeled altitudes while maintaining a favorable trade-off between accuracy and inference efficiency, positioning it as a highly promising solution for future 3D environment-aware wireless networks.

2605.29535 2026-05-29 cs.LG

AsymVLM: Asymmetric Token Pruning for Efficient Vision-Language Model Inference

AsymVLM:面向高效视觉-语言模型推理的非对称令牌剪枝

Yilin Feng, Ahmed Burak Gulhan, Mahmut Taylan Kandemir

AI总结 针对视觉和文本令牌在预填充与解码阶段的不同特性,提出非对称剪枝方法AsymVLM,通过视觉令牌的激进剪枝和文本令牌的基于阈值的驱逐,实现高达54%的FLOPs节省并在文档和图表理解任务上提升2-3%的准确率。

详情
AI中文摘要

视觉-语言模型(VLM)每张图像处理数千个视觉令牌,而文本令牌相对较少,但现有压缩方法对两种模态一视同仁。我们观察到两种模态具有根本不同的特性:视觉令牌在空间上冗余且主导预填充阶段,而文本令牌具有因果依赖性并在解码过程中累积。基于这种非对称性,我们提出并实证评估了AsymVLM,该方法在预填充前使用学习的重要性评分器结合每样本自适应预算对视觉令牌进行激进剪枝,并仅在文本令牌超过固定预算时执行基于时间阈值的驱逐。实验表明,AsymVLM在现有方法中实现了最高的FLOPs节省(高达54%),同时在视觉信息空间局部化且与查询相关的文档和图表理解任务上,比现有方法提升2-3%的准确率,并在整体基准上保持竞争性精度。在文本主导的场景中,我们的驱逐策略通过适应VLM的短上下文特性,显著优于标准的LLM缓存压缩方法。

英文摘要

Vision-Language Models (VLMs) process thousands of visual tokens per image alongside comparatively few text tokens, yet existing compression methods treat both modalities uniformly. We observe that the two modalities have fundamentally different properties: vision tokens are spatially redundant and dominate prefill, while text tokens are causally dependent and accumulate during decoding. Based on this asymmetry, we propose and empirically evaluate AsymVLM, which applies aggressive pruning to vision tokens before prefill using a learned importance scorer with per-sample adaptive budgeting, and temporal threshold-based eviction to text tokens only when they exceed a fixed budget. Our experiments indicate that AsymVLM achieves the highest FLOPs savings (up to 54%) among state-of-the-art methods while outperforming existing approaches by 2--3% on document and chart understanding tasks where visual information is spatially localized and query-specific, and maintaining competitive accuracy on holistic benchmarks. In text-dominated scenarios, our eviction strategy substantially outperforms standard LLM cache compression methods by adapting to the short-context nature of VLM.

2605.29534 2026-05-29 cs.AI

UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI Agents

UI-KOBE:面向轻量级图引导GUI代理的知识导向行为探索

Yuxiang Chai, Han Xiao, Xinyu Fu, Jinpeng Chen, Rui Liu, Hongsheng Li

AI总结 提出UI-KOBE框架,通过自动构建应用知识图谱并引导轻量级GUI代理进行运行时决策,以提升其移动端GUI任务执行效果。

详情
AI中文摘要

近期移动GUI代理的进展显示出自动化移动任务的强大潜力,但大多数有效系统仍依赖大型视觉语言模型进行截图理解和长期规划。可直接部署在移动设备上的小型GUI代理在实际应用中更具吸引力,具有更低的推理成本和更好的敏感设备信息保护。然而,由于模型容量有限,这些轻量级代理在仅凭截图端到端规划和执行GUI任务时仍不可靠。我们提出知识导向行为探索(UI-KOBE),一种利用可复用的应用特定图知识来改进轻量级移动GUI代理的框架。UI-KOBE首先自主探索移动应用并构建应用知识图谱,其中节点代表不同的UI状态,边代表可执行的转换。运行时,轻量级GUI代理将图作为外部指导:给定用户任务和当前截图,它识别当前图节点,并选择与该节点关联的自循环动作、相邻转换、任务完成或回退自由动作。通过用应用特定的图指导支持运行时决策,UI-KOBE减轻了端到端GUI规划的负担,帮助轻量级模型更有效地执行移动GUI任务,为高效、可解释且注重隐私的设备端GUI代理提供了实用的一步。

英文摘要

Recent advances in mobile GUI agents have shown strong potential for automating mobile tasks, but most effective systems still depend on large vision-language models for screenshot understanding and long-horizon planning. Small GUI agents that can be deployed directly on mobile devices are more attractive for practical use, offering lower inference cost and better protection of sensitive on-device information. However, due to limited model capacity, such lightweight agents remain unreliable when planning and executing GUI tasks end-to-end from screenshots alone. We propose Knowledge-Oriented Behavior Exploration (\textbf{UI-KOBE}), a framework that improves lightweight mobile GUI agents with reusable app-specific graph knowledge. UI-KOBE first autonomously explores a mobile application and constructs an app knowledge graph, where nodes represent distinct UI states and edges represent executable transitions. At runtime, a lightweight GUI agent uses the graph as external guidance: given a user task and the current screenshot, it identifies the current graph node and selects among self-loop actions, neighboring transitions, task completion, or fallback free actions associated with that node. By supporting runtime decisions with app-specific graph guidance, UI-KOBE reduces the burden of end-to-end GUI planning and helps lightweight models perform mobile GUI tasks more effectively, offering a practical step toward efficient, interpretable, and privacy-conscious on-device GUI agents.

2605.29531 2026-05-29 cs.SD cs.CV cs.LG

Audio Deepfake Detection with Half-Truth Localisation Using Cross-Attentive Feature Fusion

使用交叉注意力特征融合的半真音频深度伪造检测与定位

S. Sutharya, Remya K. Sasi

AI总结 提出CAFNet模型,通过三元分类和边界回归联合检测部分伪造音频,在MLADDC数据集上达到92.71%准确率和0.075s定位误差。

Comments 13 pages, 5 figures, 11 tables

详情
AI中文摘要

音频深度伪造检测通常作为二分类问题研究,但部分篡改语音(其中一段短合成片段被拼接进真实语音)构成了更困难且更现实的威胁。检测此类半真音频不仅需要区分真实和完全伪造语音,还需要定位篡改发生的位置。我们提出了CAFNet,一个576k参数的架构,联合处理这两个任务:它在单次前向传播中执行三元分类(真实、完全伪造或半真)并回归合成区域的时间边界。CAFNet通过并行深度可分离卷积分支和交叉注意力融合梅尔频率倒谱系数(MFCC)、线性频率倒谱系数(LFCC)和色度短时傅里叶变换(Chroma-STFT)特征,随后使用双向长短期记忆(BiLSTM)回归头进行边界预测。在组合的多语言音频深度伪造检测语料库(MLADDC)T2+T3测试集上,CAFNet达到92.71%的准确率和0.9910的宏观曲线下面积(AUC),边界定位平均绝对误差(MAE)为0.075秒,中位误差为0.052秒。在二分类检测中,它达到96.76%的准确率和3.20%的等错误率(EER),以超过500倍的参数减少优于微调的XLS-R 300M(78.31%)和AST 87M(93.03%)。跨数据集研究进一步表明,即使在降低骨干学习率的情况下,标准微调也会破坏跨域表示。

英文摘要

Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such half-truth audio requires not only distinguishing it from real and fully fake speech, but also localising where the manipulation occurs. We present CAFNet, a 576k-parameter architecture that addresses both tasks jointly: it performs ternary classification (real, fully-fake, or half-truth) and regresses the temporal boundaries of the synthesised region in a single forward pass. CAFNet fuses Mel-Frequency Cepstral Coefficient (MFCC), Linear-Frequency Cepstral Coefficient (LFCC), and Chroma Short-Time Fourier Transform (Chroma-STFT) features through parallel depthwise-separable convolution branches with cross-attention, followed by a Bidirectional Long Short-Term Memory (BiLSTM) regression head for boundary prediction. On the combined Multi-Lingual Audio Deepfake Detection Corpus (MLADDC) T2+T3 test set, CAFNet achieves 92.71% accuracy and macro Area Under the Curve (AUC) of 0.9910, with boundary localisation Mean Absolute Error (MAE) of 0.075s and a median error of 0.052s. On binary detection, it achieves 96.76% accuracy and 3.20% Equal Error Rate (EER), outperforming fine-tuned XLS-R 300M (78.31%) and AST 87M (93.03%) at over 500 times fewer parameters. A cross-dataset study further shows that standard fine-tuning collapses cross-domain representations even under reduced backbone learning rates.

2605.29525 2026-05-29 cs.LG

Learning to Perturb Hidden Representations for Generalizable Deep Learning

学习扰动隐藏表示以实现可泛化深度学习

Hua Li

AI总结 提出学习扰动激活(LPA)方法,通过自适应地扰动隐藏层激活并利用PGD学习类别级扰动,提升模型泛化能力,在平衡分类、长尾分类和域泛化任务上优于现有方法。

详情
AI中文摘要

深度神经网络通过级联表示处理数据:输入特征、隐藏激活、logits和损失。虽然输入、logit和标签层面的扰动已被系统研究,但构成网络大部分计算的中间隐藏激活尚未得到统一的扰动分析。本文建立了隐藏激活扰动的统一框架,揭示了Dropout、Manifold Mixup、对抗特征扰动及相关方法都施加了特定形式的激活扰动,但采用类别无关或随机策略。我们推测扩张性扰动(增加激活范数)起到正增强作用,而收缩性扰动(减少激活范数)起到负增强作用,并且扰动层决定了效果类似于输入级增强(浅层)还是logit级操作(深层)。我们提出学习扰动激活(LPA),该方法在选定的隐藏层自适应地扰动激活,并通过PGD学习类别级扰动。我们进一步提供了将激活扰动与平坦最小值和通过层的扰动放大联系起来的理论分析。在平衡分类、长尾分类和域泛化上的实验表明,LPA一致优于现有方法,并为logit扰动方法(如LPL)提供互补优势。

英文摘要

Deep neural networks process data through a cascade of representations: input features, hidden activations, logits, and loss. While perturbations at the input, logit, and label levels have been systematically studied, the intermediate hidden activations, which constitute the bulk of the network's computation, have received no unified perturbation analysis. In this paper, we establish a unified framework for hidden activation perturbation, revealing that Dropout, Manifold Mixup, adversarial feature perturbation, and related methods all impose specific forms of activation perturbation but with class-agnostic or random strategies. We conjecture that expansive perturbation (increasing activation norm) acts as positive augmentation, while contractive perturbation (decreasing activation norm) acts as negative augmentation, and that the perturbation layer determines whether the effect resembles input-level augmentation (shallow layers) or logit-level manipulation (deep layers). We propose Learning to Perturb Activations (LPA), which adaptively perturbs activations at a selected hidden layer with class-level perturbations learned via PGD. We further provide theoretical analysis connecting activation perturbation to flat minima and perturbation amplification through layers. Experiments on balanced classification, long-tail classification, and domain generalization demonstrate that LPA consistently outperforms existing methods and provides complementary benefits to logit perturbation methods such as LPL.

2605.29523 2026-05-29 cs.LG

K-FinHallu: A Hallucination Detection Benchmark for Multi-Turn RAG in Korean Finance

K-FinHallu:面向韩语金融多轮RAG的幻觉检测基准

Eunbyeol Cho, Yunseung Lee, Mirae Kim, Jeewon Yang, Youngjun Kwak, Edward Choi

AI总结 提出K-FinHallu基准,通过构建多轮对话和层次化幻觉分类,评估LLM在韩语金融RAG中的幻觉检测能力,发现即使最强模型在细粒度金融诊断和合理弃权上表现不佳。

详情
AI中文摘要

大型语言模型(LLMs)通过检索增强生成(RAG)推动了金融自动化,但幻觉仍然是高风险环境中部署的关键障碍。现有基准侧重于单轮、以英语为中心的任务,未解决韩语金融领域的多轮动态和语言-监管细微差别。我们引入K-FinHallu,这是首个用于多轮韩语金融RAG中幻觉检测的基准。我们从真实的韩语金融文档中构建多轮对话,并在基于上下文可回答性(明确考虑合理弃权)的层次化分类下注入幻觉。将前沿和开源LLMs作为幻觉检测器进行基准测试,我们发现即使最强的模型也难以进行细粒度的金融诊断和拒绝行为。虽然在我们的训练集上微调8B模型可获得与前沿LLMs竞争的性能,但合理弃权仍然是所有评估模型中最薄弱的方面。

英文摘要

Large Language Models (LLMs) have advanced financial automation through Retrieval-Augmented Generation (RAG), yet hallucinations remain a critical barrier to deployment in high-stakes environments. Existing benchmarks focus on single-turn, English-centric tasks, leaving the multi-turn dynamics and linguistic-regulatory nuances of the Korean financial domain unaddressed. We introduce K-FinHallu, the first benchmark for hallucination detection in multi-turn Korean financial RAG. We construct multi-turn dialogues from authentic Korean financial documents and inject hallucinations under a proposed hierarchical taxonomy based on context answerability that explicitly accounts for justified abstention. Benchmarking frontier and open-source LLMs as hallucination detectors, we find that even the strongest models struggle with fine-grained financial diagnostics and refusal behavior. While fine-tuning an 8B model on our training split yields performance competitive with frontier LLMs, justified abstention remains the weakest axis across all evaluated models.

2605.29522 2026-05-29 cs.AI

DeepSurvey: Enhancing Analytical Depth and Citation Reliability in Automated Survey Generation

DeepSurvey: 提升自动综述生成中的分析深度与引用可靠性

Ziyue Yang, Da Ma, Hanqi Li, Zijian Wang, Tiancheng Huang, Zijian Hu, Chenrun Wang, Yunzhe Zhang, Xiaobao Wu, Kai Yu, Lu Chen

AI总结 提出DeepSurvey智能体系统,通过结构化全文笔记、跨论文关系建模和代码仓库分析增强分析深度,结合引文图扩展与混合过滤、证据约束引用分配及多粒度智能体精炼提升引用可靠性,在内容质量和引用准确性上超越现有方法。

详情
AI中文摘要

随着科学文献的快速增长,自动综述生成已成为AI科学家和人类研究者的关键能力。然而,现有系统由于依赖摘要和孤立论文处理而分析深度有限,并且由于不精确的检索和事后归因而导致引用不可靠,从而产生肤浅的综述并可能误导研究者。我们提出DeepSurvey,一个解决这两个问题的智能体系统。为了增强深度,DeepSurvey从全文论文中提取结构化要点,通过聚类和比较分析建模跨论文关系,并集成代码仓库分析以恢复实现级细节。为了加强可靠性,它结合引文图扩展与混合过滤进行主题聚焦检索,强制执行证据约束的引用分配,并部署多粒度智能体精炼以验证引用-声明对齐。实验表明,DeepSurvey在内容得分(8.644/10)和引用质量(召回率和精确率分别比最强基线提高12.3%和9.3%)上达到最高,跨领域泛化更稳健(CS到非CS的下降为0.14 vs 0.22至0.69),并且领域专家更倾向于选择它而非人类撰写的综述(整体质量83.3%,内容深度100%)。

英文摘要

As scientific literature grows rapidly, automated survey generation has become a key capability for AI scientists and human researchers. However, existing systems suffer from limited analytical depth due to reliance on abstracts and isolated paper processing, and unreliable citations from imprecise retrieval and post-hoc grounding, producing superficial surveys and may mislead researchers. We present DeepSurvey, an agentic system that addresses both. To enhance depth, DeepSurvey extracts structured keynotes from full-text papers, models cross-paper relationships through clustering and comparative analysis, and integrates code-repository analysis to recover implementation-level details. To fortify reliability, it combines citation-graph expansion with hybrid filtering for topic-focussed retrieval, enforces evidence-constrained citation assignment, and deploys multi-granularity agentic refinement to validate citation-claim alignment. Experiments show that DeepSurvey achieves the highest content score (8.644/10) and citation quality (12.3% and 9.3% recall and precision gains over the strongest baseline), generalizes more robustly across domains (0.14 vs 0.22 to 0.69 CS-to-non-CS drop), and is preferred over human-written surveys by domain experts (83.3% overall quality, 100% content depth).

2605.29512 2026-05-29 cs.AI

MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs

MINDGAMES: 多智能体LLM中社会与策略推理评估的实时竞技场

Kevin Wang, Anna Thöni, Benjamin Kempinski, Bobby Cheng, Jianzhu Yao, Benjamin Finch, Leon Guertler, Viraj Nadkarni, Yihan Jiang, Aliaksei Korshuk, Alexander Buyantuev, Ilya Makarov, Siyuan Wu, Yu-Chi Cheng, Yan-Ru Ju, Ti-Rong Wu, I-Hsuan Chu, Yu-Yu Yang, I-Chen Wu, Yitian Huang, Qinlu Cao, Yiheng Sun, Yuhong Dai, Hongkun Yao, Jingxuan Fu, Jiwei Zhang, Hao Liao, Mossimo Ebeling, Govind Arun, Sadhvik Bathini, Mihir S Arya, Avinash Anish, Aditya Ranjan, Kirtana Sunil Phatnani, Paval KS, Vrushali Mehta, Aravind S, Nikhil Arora, Tanya Upadhyay, Amol Bandagale, Yuan Lu, ChunEn Hsiao, YuTing Lin, Arvin Chung, Jerry John Thomas, Mathieu Laurière, Leshem Choshen, Yoram Bachrach, Pramod Viswanath, Maria Polukarov, Cheston Tan, Tal Kachman, Atlas Wang

AI总结 提出MINDGAMES多游戏竞技平台,通过四个游戏环境评估LLM智能体的社会推理与策略能力,揭示规则遵循瓶颈与排行榜有效性差异。

详情
AI中文摘要

大型语言模型(LLM)正越来越多地被部署为交互式智能体,但它们在长时间交互中的社会与策略推理能力仍知之甚少。现有评估依赖于静态场景或单一游戏基准,无法捕捉现实多智能体环境所需的持续、多面推理。我们引入MINDGAMES,一个多游戏竞技场和LLM智能体评估平台,它操作化了与“心智理论”相关的互补推理需求:隐藏信息下的信念归因、通过重复策略交互进行对手建模、知识不对称下的合作推理,以及社会推理中的持续欺骗。基于TextArena,MINDGAMES提供了统一的交互界面、基于TrueSkill的评分和四个游戏环境的完整轨迹记录。我们通过2025年在一场主要AI会议上举办的竞赛周期实例化MINDGAMES,评估了来自76个团队的944个提交智能体,涉及四个游戏:Colonel Blotto、迭代囚徒困境、Codenames和Secret Mafia。我们的分析揭示了智能体层面和评估层面的局限性:脆弱的规则遵循仍是主要瓶颈,顶级系统反复依赖显式结构支撑,且排行榜有效性在不同环境中差异显著。特别是,失败密集的环境可能同样奖励对对手错误的鲁棒性和策略能力,其中Secret Mafia在本周期中表现出明显的错误生存混杂。我们发布了一个包含29,571场多智能体游戏的数据集,包含回合级观察、动作和奖励,以及MG-Ref,一个确定性离线锦标赛协议,该协议使用与本分析相同的错误归因视角,将新智能体与冻结的顶级、低错误Stage II提交参考池进行评分。

英文摘要

Large language models (LLMs) are increasingly deployed as interactive agents, yet their capacity for social and strategic reasoning over extended interaction remains poorly understood. Existing evaluations rely on static vignettes or single-game benchmarks that cannot capture the sustained, multi-faceted reasoning that real-world multi-agent settings demand. We introduce Mindgames, a multi-game arena and evaluation platform for LLM agents that operationalizes complementary reasoning demands relevant to ``theory of mind'': belief attribution under hidden information, opponent modeling through repeated strategic interaction, cooperative inference under knowledge asymmetries, and sustained deception in social deduction. Built on TextArena, Mindgames provides a unified interaction interface, TrueSkill-based rating, and full trajectory logging across four game environments. We instantiate Mindgames through a 2025 competition cycle hosted at a major AI conference, which assessed 944 submitted agents from 76 teams across four games: Colonel Blotto, Iterated Prisoner's Dilemma, Codenames, and Secret Mafia. Our analysis surfaces both agent-level and evaluation-level limitations: brittle rule adherence remains a major bottleneck, top-performing systems repeatedly rely on explicit structural scaffolding, and leaderboard validity differs sharply across environments. In particular, failure-heavy environments can reward robustness to opponent errors as much as strategic ability, with Secret Mafia exhibiting a pronounced error-survival confound in this cycle. We release a dataset of 29,571 multi-agent games with turn-level observations, actions, and rewards, together with MG-Ref, a deterministic offline tournament protocol that scores new agents against a frozen reference pool of top-ranked, low-error Stage~II submissions under the same error-attribution lens used in this analysis.

2605.29509 2026-05-29 cs.CV

KGEdit: Ambiguity-Aware Knowledge Graphs for Training-Free Precise Video Generation and Editing

KGEdit: 面向无训练精确视频生成与编辑的歧义感知知识图谱

Mingshu Cai, Miao Zhang, Chenghe Yang, Yixuan Li, Osamu Yoshie, Yuya Ieiri

AI总结 提出KGEdit框架,通过构建歧义感知知识图谱和结构化语义注入模块,解决文本到视频扩散模型中的语义歧义、概念绑定错误和跨帧不一致问题,实现无需额外训练的精确视频生成与编辑。

详情
AI中文摘要

近年来,无训练视频生成取得了显著进展。然而,在处理复杂文本指令时,现有方法仍存在语义歧义、概念绑定错误和跨帧不一致的问题。为解决这些问题,我们提出了KGEdit,一种用于文本到视频(T2V)扩散模型的结构化语义控制框架。具体而言,我们首先构建一个歧义感知知识图谱(AAKG)来解耦和消歧输入提示,将其转换为四种类型的结构化语义:身份、关系、属性和负约束。然后,我们设计了一个结构化语义注入模块(SSIM),将这些语义信号注入扩散Transformer的关键层,实现细粒度的语义控制。此外,我们引入了一个时间感知语义控制(TASC)模块,根据去噪过程的阶段特性动态调度语义目标,进一步提高了语义对齐和时间一致性。实验表明,KGEdit在编辑精度和时间稳定性方面优于现有方法,同时在文本驱动的交互场景中提供了更高的效率和可控性。

英文摘要

In recent years, training-free video generation has progressed remarkably. However, when handling complex textual instructions, existing methods still suffer from semantic ambiguity, incorrect concept binding, and cross-frame inconsistency. To address these issues, we propose KGEdit, a structured semantic control framework for text-to-video (T2V) diffusion models. Specifically, we first construct an ambiguity-aware knowledge graph (AAKG) to disentangle and disambiguate the input prompt, converting it into four types of structured semantics: identity, relation, attribute, and negative constraints. We then design a structured semantic injection module (SSIM) to inject these semantic signals into key layers of the diffusion Transformer, enabling fine-grained semantic control. In addition, we introduce a temporal-aware semantic control (TASC) module that dynamically schedules semantic objectives according to the stage-wise characteristics of the denoising process, further improving semantic alignment and temporal consistency. Experiments show that KGEdit outperforms existing methods in editing precision and temporal stability, while offering higher efficiency and controllability in text-driven interaction scenarios.

2605.29507 2026-05-29 cs.AI cs.IR

Xetrieval: Mechanistically Explaining Dense Retrieval

Xetrieval:机械解释稠密检索

Zhixin Cai, Jun Bai, Yang Liu, Jiaqi Li, Yichi Zhang, Taichuan Li, Zhuofan Chen, Zixia Jia, Zilong Zheng, Wenge Rong

AI总结 提出Xetrieval框架,通过嵌入级别的推理内化器和稀疏可解释特征分解,机械地解释稠密检索模型为何赋予高相关性分数。

Comments Code: https://github.com/Hihiczx/Xetrieval ; Project page: https://hihiczx.github.io/Xetrieval

详情
AI中文摘要

解释稠密检索器为何赋予高相关性分数仍然具有挑战性,因为检索决策是通过不透明的高维嵌入做出的。现有的解释通常关注表面信号,如词汇匹配、令牌对齐或事后文本理由,因此对塑造稠密检索行为在嵌入级别的潜在因素提供的洞察有限。我们提出 extit{Xetrieval},一个嵌入级别的机械框架,用于解释稠密检索。 extit{Xetrieval}首先引入一个轻量级推理内化器,通过单次前向传递直接在嵌入空间中近似思维链推理,丰富句子嵌入的推理导向信息,同时避免昂贵的自回归生成。然后,它将这些推理增强的嵌入分解为稀疏、人类可解释的特征,每个特征与连贯的自然语言描述相关联。通过聚合多个文档端视图上的稀疏特征重叠, extit{Xetrieval}提供单个检索决策的特征级解释。在多种检索器和基准上的实验表明, extit{Xetrieval}揭示了连贯的可解释特征,产生更强的成对干预效果,并支持任务级特征引导。项目页面和源代码可在https://hihiczx.github.io/Xetrieval获取。

英文摘要

Explaining why dense retrievers assign high relevance scores remains challenging because retrieval decisions are made through opaque high-dimensional embeddings. Existing explanations often focus on surface signals, such as lexical matches, token alignments, or post-hoc textual rationales, and thus provide limited insight into the latent factors that shape dense retrieval behavior at the embedding level. We propose \textit{Xetrieval}, an embedding-level mechanistic framework for explaining dense retrieval. \textit{Xetrieval} first introduces a lightweight reasoning internalizer that approximates Chain-of-Thought reasoning directly in the embedding space with a single forward pass, enriching sentence embeddings with reasoning-oriented information while avoiding expensive autoregressive generation. It then decomposes these reasoning-enhanced embeddings into sparse, human-interpretable features, each associated with a coherent natural language description. By aggregating sparse feature overlaps across multiple document-side views, \textit{Xetrieval} provides feature-level explanations of individual retrieval decisions. Experiments on diverse retrievers and benchmarks show that \textit{Xetrieval} uncovers coherent interpretable features, yields stronger pair-level intervention effects, and supports task-level feature steering. The project page and source code are available at https://hihiczx.github.io/Xetrieval .

2605.29505 2026-05-29 cs.CV

ESAM++: Efficient Online 3D Perception on the Edge

ESAM++:边缘上的高效在线3D感知

Qin Liu, Lavisha Aggarwal, Saptarashmi Bandyopadhyay, Vikas Bahirwani, Marc Niethammer, Ehsan Adeli, Andrea Colaco

AI总结 提出ESAM++,一种轻量级可扩展的在线3D场景感知方法,通过3D稀疏特征金字塔网络(SFPN)在边缘设备上实现高效、准确的3D实例分割。

详情
AI中文摘要

实时在线3D场景感知对于机器人、AR/VR和自主系统至关重要,尤其是在计算资源有限且隐私至关重要的边缘计算场景中。最近的最先进方法如EmbodiedSAM(ESAM)通过利用Segment Anything Model(SAM)进行实时、细粒度且泛化的3D实例分割,展示了在线3D感知的前景。然而,ESAM仍然依赖计算昂贵的3D稀疏UNet进行点云特征提取,这占据了3D推理时间的大部分,阻碍了其在资源受限设备上的实用性。在本文中,我们提出ESAM++,一种轻量级且可扩展的在线3D场景感知替代方案,专为无GPU加速的边缘设备设计。我们的方法引入了3D稀疏特征金字塔网络(SFPN),该网络高效地从流式3D点云中捕获多尺度几何特征,同时显著降低计算开销和模型大小。我们在四个具有挑战性的分割基准(即ScanNet、ScanNet200、SceneNN和3RScan)上评估了我们的方法,结果表明,与ESAM相比,我们的模型在实现竞争性精度的同时,推理速度提升高达3倍,模型大小缩小2倍,从而能够在边缘设备上实际部署。

英文摘要

Online 3D scene perception in real time is essential for robotics, AR/VR, and autonomous systems, particularly in edge computing scenarios where computational resources are limited and privacy is crucial. Recent state-of-the-art methods like EmbodiedSAM (ESAM) demonstrate the promise of online 3D perception by leveraging the Segment Anything Model (SAM) for real-time, fine-grained, and generalized 3D instance segmentation. However, ESAM still relies on a computationally expensive 3D sparse UNet for point cloud feature extraction, which accounts for the majority of the 3D inference time, hindering its practicality on resource-constrained devices. In this paper, we propose ESAM++, a lightweight and scalable alternative for online 3D scene perception tailored to edge devices without GPU acceleration. Our method introduces a 3D Sparse Feature Pyramid Network (SFPN) that efficiently captures multi-scale geometric features from streaming 3D point clouds while significantly reducing computational overhead and model size. We evaluate our approach on four challenging segmentation benchmarks, namely ScanNet, ScanNet200, SceneNN, and 3RScan, demonstrating that our model achieves competitive accuracy with up to 3 times faster inference with a 2 times smaller model size compared to ESAM, enabling practical deployment on edge devices.

2605.29502 2026-05-29 cs.CL cs.AI

Source-Grounded Semantic Reinforcement Learning for Low-Resource Target-Language Generation

源语言锚定的语义强化学习用于低资源目标语言生成

Zeli Su, Ziyin Zhang, Zewei Pan, Zhou Liu, Dingcheng Huang, Dehan Li, Zhankai Xu, Longfei Zheng, Xiaolu Zhang, Jun Zhou, Wentao Zhang

AI总结 提出源语言锚定的语义强化学习(SG-SRL),通过跨语言语义奖励模型利用源语言单语数据,结合轻量级恢复阶段解决奖励黑客问题,在低资源目标语言生成中提升语义锚定和事实覆盖。

详情
AI中文摘要

低资源目标语言生成通常受限于稀缺的平行数据,而高资源源语言单语数据丰富但难以通过标准监督微调使用。我们提出源语言锚定的语义强化学习(SG-SRL),一种资源利用框架,将源语言单语数据转换为用于目标语言生成的跨语言语义监督。SG-SRL使用跨语言语义奖励模型(由跨语言重排序器实例化,对源输入与目标语言生成之间的语义相关性进行评分)在源语言数据上执行无参考强化学习(RL)。虽然这会导致严重的基于冗长的奖励黑客问题,但使用小型平行语料库的轻量级恢复阶段在保留语义增益的同时恢复了流畅性、简洁性和任务格式。在中文到泰语生成上的实验表明,SG-SRL在冷启动SFT基础上改善了语义锚定和事实覆盖。对长文本迁移和基于藏语嵌入奖励的额外分析阐明了SG-SRL的泛化行为,并表明在现实低资源语言设置中,基于编码器的语义奖励可以替代基于LLM的重排序器。

英文摘要

Low-resource target-language generation is often limited by scarce parallel data, while high-resource source-language monolingual data is abundant but difficult to use with standard supervised fine-tuning. We propose Source-Grounded Semantic Reinforcement Learning (SG-SRL), a resource-utilization framework that converts source-language monolingual data into cross-lingual semantic supervision for target-language generation. SG-SRL performs reference-free reinforcement learning (RL) on source-language data using a cross-lingual semantic reward model, instantiated by a cross-lingual reranker that scores the semantic relevance between the source input and the target-language generation. While this induces severe verbosity-based reward hacking, a lightweight recovery stage using a small parallel corpus restores fluency, conciseness, and task format while preserving the semantic gains. Experiments on Chinese-to-Thai generation show that SG-SRL improves semantic grounding and factual coverage over cold-start SFT. Additional analyses on long-form transfer and Tibetan embedding-based rewards clarify the generalization behavior of SG-SRL and show that an encoder-based semantic reward can substitute for an LLM-based reranker in a realistic low-resource language setting.

2605.29500 2026-05-29 cs.LG cs.AI

Quotient DAGs for Off-Policy Evaluation:Forward-Flow Importance Sampling and Exact Slate Propensities

离线策略评估的商DAG:前向流重要性采样与精确板倾向

Ziwen Xie, Shaowen Xiang, Hongyu He, Dianbo Liu

AI总结 提出商DAG视角,通过前向流比率合并等价历史,实现精确的无序板倾向计算,减少方差并提高计算效率。

Comments 31 pages, 3 figures, 7 tables

详情
AI中文摘要

离线策略评估利用不同行为策略收集的数据来估计目标策略的表现,这在在线测试成本高或风险大时(如推荐或医疗)至关重要。标准重要性采样对每条记录轨迹进行重加权,但即使评估目标忽略生成过程的某些细节,它仍可能将其视为有意义:例如,自回归板推荐器可能生成有序的项目序列,而奖励和下游估计器仅依赖于无序板。这产生了噪声方差和计算差距,因为精确的无序板倾向需要对所有生成顺序求和。我们引入商DAG视角,合并对评估等价的历史,并在合并图上使用目标与行为的前向流比率分配权重。对于在集合充分的下一个项目接口下的板推荐,这产生了Forward-DP,一种子集DAG动态规划,无需阶乘枚举即可计算精确的无序倾向。得到的倾向基元使得能够对上下文相关的自回归板记录器进行实用的基于倾向的评估和模型选择。

英文摘要

Off-policy evaluation estimates how a target policy would perform using data collected by a different behavior policy, which is crucial when online testing is costly or risky, such as in recommendation or healthcare. Standard importance sampling reweights each logged trajectory, but it can treat details of the generation process as meaningful even when the evaluation target ignores them: for example, an autoregressive slate recommender may generate an ordered sequence of items while the reward and downstream estimator depend only on the unordered slate. This creates nuisance variance and a computational gap, since exact unordered slate propensities require summing over all generation orders. We introduce a quotient-DAG view that merges histories equivalent for evaluation and assigns weights using target-to-behavior forward-flow ratios on the merged graph. For slate recommendation under a set-sufficient next-item interface, this yields Forward-DP, a subset-DAG dynamic program that computes exact unordered propensities without factorial enumeration. The resulting propensity primitive enables practical propensity-based evaluation and model selection for context-dependent autoregressive slate loggers.