arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部类别2338
2606.10819 2026-06-10 cs.CV cs.AI 新提交

Earth-OneVision: Extending Remote Sensing Multimodal Large Language Models to More Sensor Modalities and Tasks

Earth-OneVision:将遥感多模态大语言模型扩展到更多传感器模态和任务

Miaoxin Cai, Guanqun Wang, Wei Zhang, Guangyao Zhou, Yin Zhuang, Tong Zhang, Hao Wang, He Chen, Jun Li

发表机构 * National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing (SBIIP), Beijing Institute of Technology(北京理工大学空间智能信息处理国家重点实验室) Aerospace Information Research Institute, Chinese Academy of Sciences(中国科学院空天信息创新研究院) Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Chinese Academy of Sciences(中国科学院地理空间信息处理与应用系统技术重点实验室) Advanced Research Institute of Multidisciplinary Sciences, Beijing Institute of Technology(北京理工大学前沿交叉科学研究院) School of Mechatronical Engineering, Beijing Institute of Technology(北京理工大学机电学院) School of Earth and Space Sciences, Peking University(北京大学地球与空间科学学院) School of Electronics, Peking University(北京大学电子学院) School of Computer Science and Hubei Key Laboratory of Intelligent Geo-Information Processing(华中科技大学计算机科学与技术学院&湖北省智能地理信息处理重点实验室)

AI总结 提出Earth-OneVision,一个2B参数的RS-MLLM,通过全粒度视觉语言对齐、空间语言同构序列化和渐进式跨模态适应机制,统一六种传感器模态和九类任务,在多个基准上达到或超越4B-72B模型。

详情
AI中文摘要

RS-MLLM能够对地球观测图像进行自然语言理解和空间推理。然而,现有模型仅支持狭窄的传感器类型和任务范围,导致对地球的碎片化视角,并使得跨模态地球科学知识在很大程度上未被利用。本文提出了Earth-OneVision,一个2B参数的RS-MLLM,它在单一自回归框架内统一了六种传感器模态(即光学、SAR、红外、多光谱、时序和视频)以及跨传感器融合,涵盖9个任务类别。三种专用机制解决了三个瓶颈。全粒度视觉语言对齐(FGVLA)将多级视觉特征与多维语言空间对齐。空间语言同构序列化(SLIS)将异构空间输出统一为自回归令牌。渐进式跨模态适应(PCMA)将复合领域差距分解为连续阶段,依次解决视角和成像物理差距。为了支持联合训练,构建了MMRS-OneVision,包含约340万QA对,涵盖所有六种传感器模态和9个任务类别的跨传感器融合,大大超过了现有的遥感多模态指令数据集。仅用2B参数,Earth-OneVision在广泛基准上取得了具有竞争力或最先进的结果,持续匹配或超越4B-72B的RS-MLLM。它在光学视觉定位的OPT-RSVG测试集上达到87.52%的P@0.5,在SAR VQA基准SARLANG-Bench上达到80.68%,超过7B模型7%以上。它还在多光谱分类的BigEarthNet-MS测试集上达到75.74%的召回率,在跨模态推理的EarthMind-Bench上达到81.94%的MCQ准确率。

英文摘要

RS-MLLMs enable natural-language understanding and spatial reasoning over earth observation imagery. However, existing models support only a narrow range of sensor types and tasks, yielding a fragmented view of the earth and leaving cross-modal geoscientific knowledge largely unexploited. This work presents Earth-OneVision, a 2B RS-MLLM that unifies six sensor modalities (i.e., optical, SAR, infrared, multispectral, temporal, and video) and cross-sensor fusion across 9 task categories within a single autoregressive framework. Three dedicated mechanisms address three bottlenecks. Full-Granularity Vision-Language Alignment (FGVLA) aligns multi-level visual features with the multi-dimensional language space. Spatial-Linguistic Isomorphic Serialization (SLIS) unifies heterogeneous spatial outputs as autoregressive tokens. Progressive Cross-Modality Adaptation (PCMA) decomposes the compound domain gap into sequential stages, tackling the viewpoint and imaging physics gaps in turn. To support joint training, MMRS-OneVision is constructed with ~34M QA pairs spanning all six sensor modalities and cross-sensor fusion across 9 task categories, substantially exceeding existing RS multimodal instruction datasets. With only 2B parameters, Earth-OneVision achieves competitive or state-of-the-art results across extensive benchmarks, consistently matching or outperforming 4B-72B RS-MLLMs. It achieves 87.52% P@0.5 on the OPT-RSVG testset for optical visual grounding and 80.68% on the SAR VQA benchmark SARLANG-Bench, exceeding 7B models by over 7%. It further achieves 75.74% recall on the BigEarthNet-MS testset for multispectral classification, and 81.94% MCQ accuracy on EarthMind-Bench for cross-modality reasoning.

2606.10818 2026-06-10 cs.RO cs.CV 新提交

IMPACT: Learning Internal-Model Predictive Control for Forceful Robotic Manipulation

IMPACT:面向强力机器人操控的内部模型预测控制学习

Jiawei Gao, Chaoqi Liu, Peilin Wu, Haonan Chen, Yilun Du

发表机构 * Harvard University(哈佛大学) Stanford University(斯坦福大学)

AI总结 提出IMPACT框架,将强力操控任务解耦为任务规划和基于内部模型的预测控制,通过仿真和实验证明其在成功率、泛化性、安全性和能效上的优势。

详情
Comments
Project website: https://gao-jiawei.com/IMPACT/
AI中文摘要

现实世界中的机器人操控任务通常涉及与环境的有力交互,例如使用不同重量的工具、运输不同质量的物体以及执行接触密集任务(如擦桌子)。先前的基于学习方法通常采用模仿学习策略,输出由低级阻抗控制器跟踪的目标末端执行器姿态。在这些系统中,有力交互要么通过稳态跟踪误差隐式实现,要么使用腕部力/扭矩或触觉传感器显式命令。然而,隐式方法在不同物体重量下泛化能力差,而显式方法需要专用硬件并增加系统复杂性。在这项工作中,我们提出了IMPACT,一个将这些有力任务解耦为任务规划和基于内部模型的预测控制的框架。广泛的仿真和真实世界实验表明,所提出的框架实现了更高的成功率、对未见物体重量的更好泛化性,以及更好的安全性和能效。

英文摘要

Real-world robotic manipulation tasks often involve forceful interactions with the environment, such as using tools of varying weights, transporting objects with different masses, and performing contact-rich tasks like table wiping. Previous learning-based approaches typically employ imitation learning policies that output target end-effector poses tracked by low-level impedance controllers. In these systems, forceful interactions are either implicitly realized through steady-state tracking errors or explicitly commanded using wrist force/torque or tactile sensors. However, implicit approaches generalize poorly across object weights, while explicit approaches require specialized hardware and increase system complexity. In this work, we propose IMPACT, a framework that decouples these forceful tasks into task-planning and internal-model-based predictive control. Extensive simulation and real-world experiments demonstrate that the proposed framework achieves higher success rates and improved generalization to unseen object weights, as well as better safety and energy efficiency.

2606.10808 2026-06-10 cs.RO 新提交

Bridging Semantics and Physical Execution: A Neuro-Symbolic Framework for Multi-Pair Robotic Assembly

桥接语义与物理执行:面向多对机器人装配的神经符号框架

Xinyi Li, Aiguo Song, Linhu Wei, Huijun Li

发表机构 * School of Instrument Science and Engineering, Southeast University(东南大学仪器科学与工程学院)

AI总结 提出一种端到端神经符号框架,通过分层生成最优子图、解耦通用性与边缘情况、协调全局序列,解决非结构化环境中多对装配的空间干扰和接触不确定性,在100个真实场景中达到97%全局可执行性,UR3机械臂部署成功率90%。

详情
Comments
Corresponding author: Aiguo Song (a.g.song@seu.edu.cn)
AI中文摘要

非结构化环境中的多对机器人装配面临空间干扰和接触不确定性。现有范式无法桥接认知决策与物理执行,要么遭遇状态空间爆炸和知识瓶颈,要么遭受逻辑幻觉和拓扑冲突。我们提出一种端到端神经符号框架,分层解决该挑战:为每对生成最优子图,将通用性与边缘情况解耦,然后解决跨对干扰。给定眼在手RGB-D装配场景,框架提取语义实例身份和状态,同时量化场景以计算散度。对于每对,通过LLM使用基本动作生成最优子图以减轻幻觉。边缘情况的支撑动作通过轻量级判别器推理并插入。由量化基线与当前场景之间的散度驱动,该框架易于以低成本扩展。增强的子图在拓扑上协调为全局序列,同时保持内部行为一致性。嵌入原子技能的动态行为树闭环力感知执行循环。在100个真实场景上的离线评估达到97.00%的全局可执行性,优于经典和最新规划器。在UR3机械臂上的真实机器人部署在强干扰下达到90%的成功率,公差0.5毫米,展示了复杂自主装配的统一且可验证解决方案。

英文摘要

Multi-pair robotic assembly in unstructured environments faces spatial interference and contact uncertainties. Existing paradigms fail to bridge cognitive decision-making and physical execution, as they either encounter state-space explosion and knowledge bottlenecks or suffer from logical hallucinations and topological conflicts. We propose an end-to-end neuro-symbolic framework that solves the challenge hierarchically: generating optimal subgraphs for each pair, decoupling generality from edge cases, and then resolving cross-pair interferences. Given an eye-on-hand RGB-D assembly scene, the framework extracts semantic instance identity and state while quantifying the scene for divergence calculation. For each pair, optimal subgraph is generated via LLM using barely basic actions to mitigate hallucinations. Supportive actions for edge cases are reasoned and inserted with a lightweight discriminator. Driven by the divergence between the quantified baseline and current scene, it is easily extensible at low cost. Augmented subgraphs are topologically coordinated into global sequences while preserving internal behavioral coherence. Dynamic behavior trees embedding atomic skills close the force-aware execution loop. Offline evaluation on 100 real-world scenes achieves 97.00% global executability, outperforming classical and state-of-the-art planners. Real-robot deployment on a UR3 arm attains 90% success rate with 0.5 mm tolerance under strong interference, demonstrating a unified and verifiable solution for complex autonomous assembly.

2606.10803 2026-06-10 cs.CL cs.AI cs.CV 新提交

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

超越API:探索多模态大语言模型在物理工具使用中的极限

Zhixin Ma, Yutong Zhou, Yongqi Li, Chong-Wah Ngo, Wenjie Li

发表机构 * Singapore Management University(新加坡管理大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出PhysTool-Bench基准,评估多模态大语言模型在真实场景中识别物理工具并规划使用的能力,发现最强模型仅完成21%任务,揭示感知与规划双重缺陷。

详情
AI中文摘要

多模态大语言模型(MLLMs)在利用数字API方面表现出色,并日益成为具身AI的“大脑”,指导机器人与物理世界交互。在这种具身环境中,核心能力之一是使用物理工具,这支撑着MLLMs在现实任务中协助人类的能力。尽管重要性显著,MLLMs在物理工具使用方面的熟练程度仍 largely unexplored。为填补这一空白,我们引入了PhysTool-Bench,这是首个评估MLLMs理解真实场景、识别物理工具并规划其使用能力的物理工具使用基准。PhysTool-Bench包含2,510个查询,覆盖2,678个真实世界物理工具,涉及制造、电气工程、农业和医疗等多个领域。具体而言,模型沿两个主要维度进行评估:1)识别场景中所有存在的物理工具,2)根据指令和视觉上下文规划工具选择和使用顺序。在13个领先的MLLMs中,即使最强的模型(Gemini-3.1-Pro)也只能识别场景中58.7%的工具,并仅完成21.0%的端到端查询。我们的分析揭示了两个层面的缺陷:MLLMs难以在真实场景中感知工具,而规划阶段更大的下降进一步表明缺乏将感知到的工具映射到任务语义的功能常识,这指出了发展实用具身AI的关键瓶颈。

英文摘要

Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the "brain" of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability is the use of physical tools, which underpins MLLMs' ability to assist humans in real-world tasks. Despite the importance, MLLMs' proficiency in physical tool use remains largely unexplored. To address this gap, we introduce PhysTool-Bench, the first physical tool-use benchmark designed to evaluate MLLMs' ability to comprehend real-world scenarios, identify physical tools, and plan their use. PhysTool-Bench comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains, including manufacturing, electrical work, agriculture, and healthcare. Concretely, models are evaluated along two primary dimensions: 1) recognizing all physical tools present in the scene, and 2) planning the tool selection and use sequence based on the instruction and visual context. Across 13 leading MLLMs, even the strongest model (Gemini-3.1-Pro) identifies only 58.7% of tools in a scene and completes merely 21.0% of queries end-to-end. Our analysis reveals a two-level deficit: MLLMs struggle to perceive tools in realistic scenes, and the much larger drop at the planning stage further indicates a lack of functional commonsense for mapping perceived tools onto task semantics, pinpointing a critical bottleneck for the development of practical embodied AI.

2606.10802 2026-06-10 cs.LG cs.AI 新提交

Boosting ECG Classification Performance by Pre-training with Synthesized Data

通过合成数据预训练提升心电图分类性能

Naoki Nonaka, Jun Seita

发表机构 * Advanced Data Science Project, RIKEN Information R&D and Strategy Headquarters(理化学研究所信息研发与战略总部先进数据科学项目)

AI总结 提出基于医学知识的高斯组合合成算法生成单导联II心电图数据,用于预训练深度神经网络,在四种异常分类中平均提升最高33.2%,尤其在小数据集场景下效果显著。

详情
AI中文摘要

深度神经网络通常需要大量数据集才能有效训练。在医学领域,由于隐私问题和某些疾病的罕见性,获取大规模数据往往具有挑战性。为了解决数据稀缺问题,我们研究了使用基于领域医学知识生成的合成数据训练深度神经网络模型的有效性。具体来说,我们针对单导联II心电图开发了一种知识驱动的高斯组合合成算法,其中每个心跳由高斯形状的P、Q、R、S和T波分量表示。使用该模拟器,我们为四种异常心电图类别生成合成数据:心房颤动、心房扑动、室性早搏和沃尔夫-帕金森-怀特综合征。我们通过使用十种不同的深度神经网络架构进行异常心电图分类来评估该合成数据的效用。结果表明,合成到真实的训练提高了四种目标异常中三种的分类性能,其中心房扑动观察到的最大架构平均增益为33.2%。进一步分析表明,合成数据带来的性能提升在真实数据集较小时更为明显。这些发现表明,基于领域知识的合成心电图可以作为有用的预训练资源,特别是在真实数据有限或难以获取的场景中。

英文摘要

Deep Neural Networks (DNNs) typically require extensive datasets for effective training. In the medical domain, acquiring large-scale data is often challenging due to privacy concerns and the rarity of certain diseases. To address this data scarcity, we investigate the efficacy of training DNN models using synthetic data, generated based on domain-specific medical knowledge. Specifically, we develop a knowledge-driven Gaussian-composition synthesis algorithm for single-lead II ECGs, in which each heartbeat is represented by Gaussian-shaped P, Q, R, S, and T wave components. Using this simulator, we generate synthetic data for four abnormal electrocardiogram (ECG) classes: atrial fibrillation (AF), atrial flutter (AFLT), premature ventricular complex (PVC), and Wolff-Parkinson-White Syndrome (WPW). We evaluate the utility of this synthetic data by conducting abnormal ECG classification using ten different DNN architectures. Our results demonstrate that synthetic-to-real training improves classification performance for three of the four target abnormalities, with the largest architecture-averaged gain of $33.2\%$ observed for AFLT. Further analysis reveals that the performance enhancement from synthetic data is more pronounced with smaller real-world datasets. These findings suggest that domain-knowledge-based synthetic ECGs can serve as a useful pre-training resource, particularly in scenarios where real-world data are limited or difficult to obtain.

2606.10799 2026-06-10 cs.AI 新提交

Evaluating Research-Level Math Proofs via Strict Step-Level Verification

通过严格的步骤级验证评估研究级数学证明

Yifeng Sun

发表机构 * Independent Researcher(独立研究者)

AI总结 提出严格步骤级验证框架,通过约束推理上下文和定理来源,解决大模型在复杂数学证明验证中的“上下文中毒”问题,在FirstProof挑战数据集上优于全局评估,并揭示基准中的隐含歧义。

详情
AI中文摘要

大型语言模型(LLM)难以严格验证复杂的数学证明。标准的全局评估方法遭受“上下文中毒”,即表面上合理的陈述掩盖了微妙的逻辑缺陷,导致幻觉或过度怀疑。为了解决这个问题,我们从全局评估转向严格的步骤级验证:我们的框架为每个推理步骤维护详细的上下文,并严格约束所应用定理的来源。我们在从FirstProof挑战中精心策划的对抗性诊断套件上评估研究级证明。系统的消融研究表明,这些演绎约束是不可或缺的,因为无约束的全局提示始终无法定位微妙的逻辑错误。除了优于全局评估,我们的方法从根本上改变了失败分类。错误分析显示,剩余的拒绝主要是“迂腐的过度严谨”实例,源于未说明的领域约定,而不是表现出严重的逻辑幻觉,这有效地暴露了专家基准本身中的隐含歧义。我们的发现表明,提示代理以谨慎的、类似人类数学家的方式组织其验证笔记,可以显著提高其区分严谨证明和有缺陷证明的能力,有可能加强基础模型尚不熟悉的前沿数学概念上的代理推理,并为未来的自动化证明审查系统奠定理论基础。代码和提示可在GitHub上获取。

英文摘要

Large Language Models (LLMs) struggle to rigorously verify complex mathematical proofs. Standard global evaluation approaches suffer from "context poisoning," in which superficially plausible statements mask subtle logical flaws, leading to hallucination or over-skepticism. To address this, we shift from global evaluation to strict step-level verification: our framework maintains detailed context for each deduction step and strictly constrains the sources of applied theorems. We evaluate on a carefully curated adversarial diagnostic suite of research-level proofs drawn from the FirstProof challenge. A systematic ablation study demonstrates that these deductive constraints are indispensable, as unconstrained global prompting consistently fails to localize subtle logical errors. Beyond outperforming global evaluation, our approach fundamentally alters the failure taxonomy. Error analysis reveals that, rather than exhibiting severe logical hallucinations, remaining rejections are primarily instances of "pedantic hyper-rigor" stemming from unstated domain conventions, effectively exposing implicit ambiguities within the expert benchmark itself. Our findings suggest that prompting agents to organize their verification notes in a cautious, human-mathematician-like manner can substantially improve their ability to distinguish rigorous proofs from flawed ones, with the potential to strengthen agentic reasoning on frontier mathematical concepts that the base model does not already know well, and to lay a theoretical foundation for future automated proof-review systems. Code and prompts are available at GitHub.

2606.10796 2026-06-10 cs.CL cs.AI 新提交

Dep-LLM: Training-Free Depression Diagnosis via Evidence-Guided Structured Multi-factor with Reliable LLM Reasoning

Dep-LLM:基于证据引导的结构化多因素与可靠LLM推理的无训练抑郁症诊断

Yiqing Lyu, Xianbing Zhao, Buzhou Tang, Ronghuan Jiang

发表机构 * School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China(哈尔滨工业大学(深圳)计算机科学与技术学院) School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China(江南大学人工智能与计算机学院) Guangdong Provincial Key Laboratory of Intelligent Information Processing(广东省智能信息处理重点实验室) Pengcheng Laboratory(鹏城实验室) Chinese People’s Liberation Army General Hospital, Beijing, China(中国人民解放军总医院)

AI总结 提出无训练框架Dep-LLM,通过思维链多因素分析、置信度调制和协作预测,在冻结LLM上实现抑郁症诊断,超越零样本和微调方法。

详情
AI中文摘要

从临床访谈中进行自动抑郁症检测(ADD)是计算心理健康领域的关键任务,但由于两个关键障碍仍然具有挑战性:1)在冗长、多主题的临床访谈中建模复杂但稀疏分布的抑郁线索困难,导致推理肤浅且不可靠;2)由于临床隐私导致标记数据稀缺,加上训练和微调的高成本,限制了监督式ADD系统的部署。为了共同应对这些挑战,我们提出了Dep-LLM,一个无训练框架,它模仿临床精神科医生的逐步推理,并完全在冻结的现成基础LLM上运行。Dep-LLM包含三个阶段。首先,思维链(CoT)抑郁症多因素分析模块将长对话结构性地分解为五个临床对齐的主题,并产生基于证据的推理,有效处理长上下文依赖。其次,我们引入了置信度分析与调制模块,该模块从每个推理的token级熵中量化认知可靠性,并应用标签内和主题间调制,在不进行额外训练的情况下放大可信信号同时抑制不确定信号。第三,协作多因素预测模块动态整合由置信度加权的多因素信号,形成最终诊断。在DAIC-WOZ和E-DAIC数据集上的大量实验证明了Dep-LLM的有效性和泛化性:它在几乎所有21个基础LLM上,在准确率、宏F1和加权平均F1等9个指标上超越了零样本基线,并进一步优于最先进的监督式领域特定LLM以及最新的闭源商业LLM,同时无需额外训练。

英文摘要

Automatic Depression Detection (ADD) from clinical interviews is a pivotal task in computational mental health, yet it remains challenging due to two critical obstacles: 1) difficulty in modeling complex but sparsely distributed depression clues within lengthy, multi-topic clinical interviews, leading to superficial and unreliable reasoning; 2) scarcity of labeled data due to clinical privacy, together with high cost of training and fine-tuning, limiting the deployment of supervised ADD systems. To jointly address these challenges, we propose Dep-LLM, a training-free framework that mirrors the step-by-step reasoning of clinical psychiatrists and operates entirely on frozen off-the-shelf foundation LLMs. Dep-LLM comprises three stages. First, a Chain-of-Thought (CoT) Depression Multi-factor Analysis module structurally decomposes the long dialogue into five clinically aligned themes and produces evidence-grounded rationales, effectively handling long-context dependencies. Second, we introduce Confidence Analysis and Modulation module that quantifies the epistemic reliability from token-level entropy of each rationale and applies an intra-label and inter-theme modulation that amplifies trustworthy signals while suppressing uncertain ones without extra training. Third, a Collaborative Multi-factor Prediction module dynamically integrates multi-factor signals weighted by confidence into the final diagnosis. Extensive experiments on the DAIC-WOZ and E-DAIC datasets demonstrate the effectiveness and generalizability of Dep-LLM: it surpasses zero-shot baseline on nearly all 21 foundation LLMs across 9 metrics such as accuracy, macro F1 and weighted-average F1, and further outperforms state-of-the-art supervised domain-specific LLMs as well as the latest closed-source commercial LLMs, while requiring no extra training.

2606.10791 2026-06-10 cs.SD 新提交

Overview of ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge

ESDD2概述:环境感知语音与声音深度伪造检测挑战赛

Xueping Zhang, Han Yin, Yang Xiao, Lin Zhang, Ting Dang, Rohan Kumar Das, Ming Li

发表机构 * Duke Kunshan University(昆山杜克大学) Korea Advanced Institute of Science and Technology(韩国科学技术院) The University of Melbourne(墨尔本大学) Johns Hopkins University(约翰霍普金斯大学) Fortemedia Singapore(Fortemedia新加坡) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 介绍ESDD2挑战赛,评估语音和环境声音独立或联合操纵的检测系统,最佳系统Macro-F1达0.8775,模块化分解、跨域自监督编码器、数据增强和选择性集成是关键。

详情
Comments
Accepted to 2026 ICME workshop
AI中文摘要

与ICME 2026联合举办的环境感知语音与声音深度伪造检测挑战赛(ESDD2)评估了五个组件级别的音频欺骗检测系统,其中语音和环境声音可能被独立或联合操纵。挑战结束后,我们分析了最终排行榜,并总结了来自顶级提交的有效设计选择。该挑战吸引了来自16个国家的94个注册;在验证提交要求和元数据后,保留了13个团队进行最终分析。在测试集上,最佳系统实现了0.8775的Macro-F1分数,显著优于分离增强的联合学习基线(0.6327)。顶级系统一致受益于模块化任务分解、跨域自监督编码器、针对性数据增强和选择性集成,而非简单的模型缩放。同时,辅助EER分析揭示了在检测伪造环境组件以及泛化到测试集中未见生成器方面的持续困难。本文报告了挑战结果,并为未来环境感知深度伪造检测研究提供了见解。CompSpoofV2数据集和基线代码仍公开可用,以促进可重复性。

英文摘要

The Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2), held in conjunction with ICME 2026, evaluated systems for five component-level audio spoofing detection, where speech and environmental sounds may be manipulated independently or jointly. After the challenge concludes, we analyze the final leaderboard and summarize effective design choices from the top-performing submissions. The challenge attracted 94 registrations from 16 countries; after verification of submission requirements and metadata, 13 teams were retained for the final analysis. On the test set, the best system achieved a Macro-F1 score of 0.8775, substantially outperforming the separation-enhanced joint learning baseline (0.6327). Top systems consistently benefited from modular task decomposition, cross-domain self-supervised encoders, targeted data augmentation, and selective ensembling rather than simple model scaling. At the same time, auxiliary EER analyses reveal persistent difficulty in detecting the spoofed environmental component and in generalizing to unseen generators in the test set. This paper reports challenge results and provides insights for future environment-aware deepfake detection research. The CompSpoofV2 dataset and baseline code remain publicly available for reproducibility.

2606.10790 2026-06-10 cs.CV 新提交

A Multimodal RGB and Events Dataset for Hand Detection in First-Person View

第一人称视角下用于手部检测的多模态RGB和事件数据集

Bharghav Kota, Yulia Sandamirskaya

发表机构 * Zurich University of Applied Sciences(苏黎世应用科技大学)

AI总结 针对移动机器人系统中传统相机在暗光下运动模糊的问题,提出利用事件相机与RGB相机结合的多模态手部检测方法,并通过合成事件数据集实现与现有方法相当的性能。

详情
AI中文摘要

现有的手部检测算法基于图像工作,检测率受限于相机的帧率。在移动机器人系统的手部检测应用中,传统相机会导致运动模糊,尤其是在较暗的光照条件下。我们可以利用事件相机,它具有高动态范围、高时间分辨率和低功耗的特点。最近的研究表明,使用事件相机和帧相机的立体设置可以提高检测精度和带宽-延迟权衡。在目标检测和识别任务中使用事件相机的主要瓶颈是训练数据量相对较少。在这项工作中,我们提出了一种方法以及一个从自我中心、第一人称视角合成的示例性事件手部数据集。数据使用v2e工具箱从现有的RGB Egohands数据集合成。通过改变v2e工具箱的参数,提供不同光照条件和尺度的数据集版本。使用微调后的YOLOv8模型生成地面真值检测,该模型应用于Egohands数据集中的RGB图像,并在高时间分辨率事件上进行插值。我们使用多模态数据集,利用现有的使用事件和RGB相机多模态设置的目标检测算法进行手部检测,并展示了与最先进方法相当的性能。

英文摘要

Existing hand detection algorithms work on images and the detection rate is restricted by the frame rate of the camera. In hand detection applications for moving robotic systems, conventional cameras cause motion blur, especially in darker lighting conditions. We can leverage the use of event-based cameras which possess a high dynamic range, high temporal resolution, and low power consumption. Recent work has shown that using a stereo setup of an event-based and a frame-based camera improves detection accuracy and the bandwidth-latency tradeoff. The main bottleneck in using event-based cameras in object detection and recognition tasks is a relatively low amount of training data. In this work, we propose a methodology and an exemplary synthetic event-based hand dataset from an egocentric, first-person view perspective. The data is synthesized from the existing RGB Egohands dataset with the v2e toolbox. Parameters of the v2e toolbox are varied to provide versions of the dataset with different lighting conditions and scales. Ground truth detections are generated with a fine-tuned YOLOv8 model which is applied to the RGB images in the Egohands dataset and interpolated on the high-temporal resolution events. We use the multi-modal dataset to perform hand detection with existing object detection algorithms which use a multi-modal setup of event and RGB cameras and demonstrate performance comparable to the state-of-the-art.

2606.10778 2026-06-10 cs.CV 新提交

From Patches to Patients: A study of the tile-to-slide performance transferability in Digital Pathology

从斑块到患者:数字病理学中斑块到全切片性能可迁移性的研究

Sofiène Boutaj, Leo Fillioux, Maria Vakalopoulou, Stergios Christodoulidis, Pierre Marza

发表机构 * Université Paris-Saclay, CentraleSupélec, Gustave Roussy, INSERM, IHU PRISM, Cancer Data Science Unit(巴黎-萨克雷大学、中央理工-高等电力学院、古斯塔夫·鲁西研究所、法国国家健康与医学研究院、IHU PRISM、癌症数据科学单元) Université Paris-Saclay, CentraleSupélec, MICS Laboratory(巴黎-萨克雷大学、中央理工-高等电力学院、MICS实验室)

AI总结 研究斑块级线性探测能否作为全切片级性能的可靠代理,通过19个基础模型在42个切片级和16个斑块级任务上的基准测试,发现斑块与切片性能高度相关,斑块级基准测试可有效筛选候选模型。

详情
Comments
Accepted to MICCAI 2026
AI中文摘要

基础模型最近通过为全切片图像分析提供稳健表示,重新定义了组织病理学中的最先进技术。然而,为特定临床队列选择最优基础模型目前需要多个预处理步骤,随后对每个模型进行计算昂贵的特征提取和训练多实例学习聚合器。在这项工作中,我们研究高效的斑块级线性探测能否作为切片级性能的可靠代理,从而减少对每个候选编码器运行完整切片级管道的需求。我们在42个切片级和16个斑块级任务上对19个最先进的基础模型进行基准测试,使用ABMIL和均值池化聚合器比较斑块探测指标与切片级结果。我们观察到在不同任务难度下,斑块与切片性能之间存在高度相关性,表明编码器表示质量是WSI成功的主要决定因素。敏感性分析显示,可迁移性在不同模型间稳定,且受队列规模和每张切片斑块数量的影响大于平均任务难度。我们还测量了斑块级和切片级任务中最佳表现模型的一致性,表明斑块基准测试可靠地筛选出强候选模型。总体而言,我们的研究表明,斑块级基准测试为缩小候选模型范围提供了高效且实用的第一步,而切片级评估对于临床任务的最终验证仍然必不可少。

英文摘要

Foundation Models (FMs) have recently redefined the state-of-the-art in histopathology by providing robust representations for whole-slide image (WSI) analysis. However, selecting the optimal foundation model (FM) for a specific clinical cohort currently requires multiple preprocessing steps, followed by computationally expensive feature extraction and the training of a Multiple Instance Learning (MIL) aggregator for every model. In this work, we investigate whether efficient tile-level linear probing can serve as a reliable proxy for slide-level performance, reducing the need to run full slide-level pipelines for every candidate encoder. We benchmark 19 state-of-the-art FMs on 42 slide-level and 16 tile-level tasks, comparing tile probing metrics against slide-level outcomes using ABMIL and Mean Pooling aggregations. We observe a high correlation between tile and slide performance across varying task difficulties, indicating that encoder representation quality is the primary determinant of WSI success. Sensitivity analyses show that transferability is stable across models and is more influenced by cohort sizes and numbers of tiles per slide than by average task difficulty. We also measure the agreement in best performing models between tile and slide-level tasks, showing tile benchmarks reliably shortlist strong candidates. Overall, our study indicates that tile-level benchmarking provides an efficient and practical first step for narrowing down candidate models, while slide-level evaluation remains essential for final validation on clinical tasks.

2606.10777 2026-06-10 cs.LG 新提交

Can we trust our models? Epistemic calibration in second-order classification

我们能信任我们的模型吗?二阶分类中的认知校准

Arthur Hoarau

发表机构 * Université de Lorraine, CentraleSupélec Loria, CNRS(洛林大学,中央理工-高等电力学院洛里亚实验室,法国国家科学研究中心)

AI总结 提出认知校准准则,衡量认知不确定性估计是否可靠,并证明其比经典校准更严格,通过EECE指标实验揭示不同不确定性量化方法的差异。

详情
AI中文摘要

不确定性估计对于在高风险场景中部署机器学习模型至关重要。然而,经典校准仅评估预测概率的可靠性,并不评估认知不确定性估计本身是否可信。这一局限性对于二阶分类模型尤为突出。我们引入认知校准,这是一个有原则的准则,用于衡量报告的认知不确定性是否忠实地反映了模型预测围绕真实值的分散程度。我们证明认知校准是比经典校准更严格的概念,并能捕捉标准指标无法发现的失败模式。通过一个在认知校准假设下成立的不可能性定理,我们将这项工作与现有文献联系起来。为了将这一概念付诸实践,我们提出了期望认知校准误差(EECE),并证明它是真实认知校准误差(TECE)的一致估计量。在广泛的不确定性量化方法上的实验表明,认知校准是一个连贯且有意义的准则,并揭示了不同方法之间的显著差异,尽管它们的预测性能相似。

英文摘要

Uncertainty estimation is critical for deploying machine learning models in high-stakes settings. However, classical calibration only assesses the reliability of predicted probabilities and does not evaluate whether epistemic uncertainty estimates are themselves trustworthy. This limitation is particularly relevant for second-order classification models. We introduce epistemic calibration, a principled criterion that measures whether reported epistemic uncertainty faithfully reflects the dispersion of model predictions around the ground truth. We show that epistemic calibration is a strictly stronger notion than classical calibration and captures failure modes invisible to standard metrics. We relate this work to the existing literature through an impossibility theorem that holds under the epistemic calibration hypothesis. To operationalize this concept, we propose the Expected Epistemic Calibration Error (EECE), which we prove to be a consistent estimator of a True Epistemic Calibration Error (TECE). Experiments across a broad range of uncertainty quantification methods show that epistemic calibration is a coherent and meaningful criterion and reveal substantial differences across methods, despite similar predictive performance.

2606.10769 2026-06-10 cs.CV 新提交

ZODS-RS -- Zero-training Oriented Detection & Segmentation for Remote Sensing

ZODS-RS -- 面向遥感的零训练目标检测与分割

Zuan Gu, Tianhan Gao, Langxu Zhao

发表机构 * Northeastern University, China(东北大学)

AI总结 提出一种无需训练的封闭式管道ZODS-RS,通过原型纯化、旋转尺度等变匹配和不确定性感知像素合并,统一了遥感图像的水平框检测与实例分割,在多个数据集上取得优异性能。

详情
AI中文摘要

遥感与无人机应用需要模型能够跨平台和视角泛化,而无需特定任务训练。然而,无训练管道在处理有向几何、尺度/旋转变化以及拥挤的港口或机场时常常失败,并且很少统一检测与分割。我们提出ZODS-RS,一种无训练、封闭式的管道,输出水平框(HBB)和实例掩码。基于DINOv3密集特征和SAM风格的提议,ZODS-RS链式包含:PP(通过Tyler协方差进行原型纯化)、R-SEM(使用可分离核和全局匈牙利分配的旋转尺度等变匹配)以及UAM(具有自适应先验和可选负原型的不确定性感知逐像素合并)。一个轻量级的CWLA融合多个DINOv3层。在FAIR1M(HBB)上,我们获得$\mathrm{mAP}_{0.50:0.95}=\mathbf{13.06}$和$\mathrm{AP}_S=\mathbf{2.93}$(船舶/飞机类别平均);在xView(HBB)上,我们报告$\mathrm{mAP}=\mathbf{16.69}$。在我们的无人机数据集上,ZODS-RS实现了掩码$\mathrm{mIoU}=\mathbf{31.10}$,并在单张5090上将小目标AP相对于Grounded-SAM提升了$\mathbf{+30.70}$。这项工作为航空影像中的水平框检测加实例分割提供了统一的、无需训练的解决方案;提供了与DINOv3紧密耦合的PP/R-SEM/UAM的显式封闭形式公式;并在小目标和拥挤目标以及跨域迁移下展示了一致的增益,同时保持部署简单。

英文摘要

Remote-sensing and UAV applications need models that generalize across platforms and viewpoints without task-specific training. Yet training-free pipelines often falter on oriented geometry, scale/rotation variation, and crowded ports or airfields, and rarely unify detection and segmentation. We introduce ZODS-RS, a training-free, closed-form pipeline that outputs horizontal boxes (HBB) and instance masks. Built on DINOv3 dense features and SAM-style proposals, ZODS-RS chains: PP (prototype purification via Tyler covariance), R-SEM (rotation-scale equivariant matching with separable kernels and global Hungarian assignment), and UAM (uncertainty-aware pixelwise merging with adaptive priors and optional negative prototypes). A lightweight CWLA fuses multiple DINOv3 layers. On FAIR1M (HBB) we obtain $\mathrm{mAP}_{0.50:0.95}=\mathbf{13.06}$ and $\mathrm{AP}_S=\mathbf{2.93}$ \emph{(class-averaged over ship/airplane)}; on xView (HBB) we report $\mathrm{mAP}=\mathbf{16.69}$. On our UAV dataset, ZODS-RS achieves mask $\mathrm{mIoU}=\mathbf{31.10}$ and improves small-object AP by $\mathbf{+30.70}$ over Grounded-SAM on a single 5090. This work offers a unified, \emph{no-training} solution for horizontal-box detection plus instance segmentation in aerial imagery; provides explicit closed-form formulations for PP/R-SEM/UAM tightly coupled with DINOv3; and demonstrates \emph{consistent} gains on small and crowded targets and under cross-domain shifts while keeping deployment simple.

2606.10768 2026-06-10 cs.LG cs.CL 新提交

N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization

N-GRPO:嵌入级邻居混合增强策略优化

Xukun Zhu, Hang Yu, Peng Di, Linchao Zhu

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团)

AI总结 针对大语言模型数学推理中探索策略的折衷问题,提出N-GRPO方法,通过语义邻居混合机制在嵌入层注入多样性,在保持语义一致性的同时提升策略优化效果。

详情
Comments
ACL 2026 Findings. 16 pages, 3 figures. Code: https://github.com/ZJUSCL/N-GRPO
AI中文摘要

大语言模型在数学推理中的成功很大程度上依赖于生成多样化且有效的解题路径。然而,当前的展开技术面临一个基本折衷:token级采样通常产生仅在措辞上不同的冗余轨迹,而利用随机噪声的嵌入级方法则经常破坏语义一致性。为解决此问题,我们引入N-GRPO,一种集成到组相对策略优化(GRPO)框架中的新型探索策略。我们的方法不依赖于token级采样或原生嵌入级噪声,而是利用语义邻居混合机制。该机制通过混合锚点token及其最近语义邻居的嵌入来动态构建输入表示,从而在严格遵循局部语义流形的同时注入多样性。在不同大小的DeepSeek-R1-Distill-Qwen模型上的实验评估表明,N-GRPO不仅在数学推理基准上相比强基线取得一致改进,而且在分布外任务上展现出鲁棒的泛化能力。

英文摘要

The success of Large Language Models in mathematical reasoning relies heavily on the generation of diverse and valid solution paths during the rollout phase. However, current rollout techniques face a fundamental trade-off: token-level sampling often yields redundant trajectories that differ only in rephrasing, while embedding-level methods utilizing random noise frequently disrupt semantic consistency. To resolve this, we introduce N-GRPO, a novel exploration strategy integrated into the Group Relative Policy Optimization (GRPO) framework. Rather than relying on token-level sampling or native embedding-level noise, our approach leverages Semantic Neighbor Mixing. This mechanism dynamically constructs input representations by mixing the embeddings of an anchor token and its nearest semantic neighbors, thereby injecting diversity while strictly adhering to the local semantic manifold. Experimental evaluations on the DeepSeek-R1-Distill-Qwen models across different sizes show that N-GRPO not only achieves consistent improvements over strong baselines on math reasoning benchmarks but also exhibits robust generalization capabilities on out-of-distribution tasks.

2606.10752 2026-06-10 cs.AI 新提交

AutoPDE: Reliable Agentic PDE Solving via Explicitly Represented Solver Strategies

AutoPDE: 通过显式表示的求解器策略实现可靠的智能体PDE求解

Huanshuo Dong, Keyao Zhang, Hong Wang, Zhezheng Hao, Zhiwei Zhuang, Ziyan Liu, Jiacong Wang, Gengyuan Liu, Xin Jin

发表机构 * University of Science and Technology of China(中国科学技术大学) Zhejiang University(浙江大学) University of the Chinese Academy of Sciences(中国科学院大学) Tsinghua University(清华大学) Eastern Institute of Technology, Ningbo(宁波东方理工大学)

AI总结 提出AutoPDE,一种将求解器策略作为显式对象维护的代码智能体,通过PDE分析、数值方法选择和自适应调优三阶段构建策略,在PDE Agent Bench上达到54.5%的通过率,比最强基线提升14.2个百分点。

详情
AI中文摘要

偏微分方程(PDE)的数值求解器是科学和工程中的核心计算工具。构建可靠的PDE求解器不仅需要可执行的代码,还需要一个数值求解器策略——一组关于离散化、稳定化、求解器配置和分辨率控制的决策,这些决策需与PDE结构相匹配。最近基于LLM的编码智能体通过生成和调试求解器实现,开始减轻编程负担。然而,它们通常直接从PDE问题跳到求解器代码,将求解器策略隐含在实现细节中。因此,求解失败的反馈被路由回代码编辑,而不是底层策略,导致数值决策在代码生成前难以检查,且在失败时难以利用数值证据进行修改。为解决这一局限,我们提出AutoPDE,一种在整个求解过程中将求解器策略作为显式表示对象维护的代码智能体:一个独立的、可检查的对象,在编写任何代码之前构建,并在求解失败时可根据数值证据进行修订。AutoPDE通过三个阶段构建和维护该对象,所有阶段均利用可重用的PDE求解技能库:PDE分析识别方程类型和代数结构;数值方法选择选择与分析结果匹配的数值方法,并确定离散化、稳定化和线性求解器;自适应调优运行低成本试算以在规定的精度和运行时间预算下校准分辨率和容差。我们在PDE Agent Bench上评估AutoPDE,实验结果表明,AutoPDE的通过率达到54.5%,比最强基线提高了14.2个百分点。

英文摘要

Numerical solvers for partial differential equations (PDEs) are core computational tools in science and engineering. Building reliable PDE solvers requires not only executable code, but a numerical solver strategy, a set of decisions about discretization, stabilization, solver configuration, and resolution control, that matches the PDE structure. Recent LLM-based coding agents have begun to reduce the programming burden by generating and debugging solver implementations. However, they typically move directly from a PDE problem to solver code, leaving the solver strategy implicit in implementation details. Feedback from a failed solve is therefore routed back to code edits rather than to the underlying strategy, so numerical decisions remain hard to check before code is generated and hard to revise using numerical evidence when it fails. To address this limitation, we propose AutoPDE, a code agent that maintains the solver strategy as an explicitly represented object throughout the solving process: an independent, inspectable object that is built before any code is written and can be revised, using numerical evidence, whenever a solve fails. AutoPDE builds and maintains this object in three stages, all drawing from a library of reusable PDE-solving skills: PDE analysis identifies the equation type and algebraic structure; numerical method selection chooses a numerical method that matches the analysis result and commits to a discretization, stabilization, and linear solver accordingly; and adaptive tuning runs low-cost pilot solves to calibrate resolution and tolerances under the prescribed accuracy and runtime budget. We evaluate AutoPDE on the PDE Agent Bench, where experimental results show that AutoPDE achieves a pass rate of $54.5%$, improving over the strongest baseline by $14.2$ percentage points.

2606.10747 2026-06-10 cs.AI 新提交

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

仲裁者代理:持续监控多智能体对话以检测涌现性失调

Filippo Tonini, Federico Torrielli, Anton Danholt Lautrup, Peter Schneider-Kamp, Mustafa Mert Çelikok, Lukas Galke Poech

发表机构 * University of Southern Denmark(南丹麦大学) University of Turin(都灵大学)

AI总结 提出仲裁者代理,在有限检查预算下实时监控多智能体对话,通过主动检查工具检测失调行为,实验表明能可靠提前发现失调,并分析不同失调类型的检测难度。

详情
Comments
AITC 2026
AI中文摘要

随着由多个语言模型代理构建的AI系统变得越来越普遍,它们被越来越多地用于共同决策:讨论、协商并执行共享任务。尽管单个代理在单独测试时可能表现良好,但它们之间的交互方式可能会引发问题。我们引入了仲裁者,一个旨在实时监控多智能体对话并识别哪些参与者可能表现出失调行为的代理。仲裁者在有限的“检查预算”下运行,这意味着它必须谨慎决定如何使用其资源。当它逐步观察对话时,可以选择等待、询问参与者、检查系统提示或推理轨迹等内部信息,或记录可疑行为。最后,它生成一份报告,识别失调的可能来源。我们在五种对话条件下评估仲裁者,范围从风险金融建议模型生物到评估感知和共谋代理,测试了五种能力递增的工具配置和两种骨干模型。我们发现仲裁者能在对话结束前可靠地检测到失调代理,主动检查工具提高了检测准确性和速度。权重引起的失调最难检测,而指令引起的失调即使在被动观察下也能可靠识别。记录工具表现出双重效果,以精度为代价提高了召回率。这些结果表明,持续的、预算感知的监控可以有效捕捉失调,并且监督多智能体系统可能需要将审计者视为过程中的积极参与者。代码可在以下网址获取:https://this URL。

英文摘要

As AI systems built from multiple language-model agents become more common, they are increasingly used to make decisions together: discussing, negotiating, and acting on shared tasks. While individual agents may appear well-aligned when tested on their own, problems can arise from how they interact with one another. We introduce the Arbiter, an agent designed to monitor multi-agent conversations in real time and identify which participants may be behaving in misaligned ways. The Arbiter operates under a limited "inspection budget", meaning it must decide carefully how to use its resources. As it observes a conversation step by step, it can choose to wait, question a participant, examine internal information such as system prompts or reasoning traces, or log concerning behavior. At the end, it produces a report identifying the likely source of misalignment. We evaluate the Arbiter across five conversation conditions, ranging from risky financial advice model organisms to evaluation-aware and colluding agents, we test five tool configurations of increasing capability and two backbone models. We find that the Arbiter reliably detects misaligned agents well before the end of the conversation, with active inspection tools improving both detection accuracy and speed. Weight-induced misalignment proves hardest to detect, while instruction-induced misalignment is identified reliably even under passive observation. The logging tool exhibits a dual effect, improving recall at the cost of precision. These results suggest that continual, budget-aware monitoring can effectively catch misalignment, and that overseeing multi-agent systems may require treating the auditor as an active participant in the process. The code is available at https://github.com/aisilab/arbiter.

2606.10746 2026-06-10 cs.RO 新提交

ros2probe: Non-intrusive, Kernel-selective Observability for Robot Operating System 2 Middleware

ros2probe: 面向机器人操作系统2中间件的非侵入式、内核选择性可观测性

Jisang Yu, Sanghoon Lee, Yeonwoo Choi, Kyung-Joon Park

发表机构 * DGIST(大邱庆北科学技术院)

AI总结 针对ROS 2观测工具因加入DDS域而产生的探针效应(膨胀发现平面、增加反序列化开销、导致丢包偏差),提出ros2probe,通过被动捕获发现包重构通信状态,并利用内核过滤仅提取用户指定主题的包,消除探针效应,保持发现图误差在0.5%以内,无丢包,CPU和内存开销降低最高28倍。

详情
Comments
13 pages, 8 figures, 7 tables
AI中文摘要

机器人操作系统2(ROS 2)是机器人的事实标准中间件框架,它将每个机器人作为节点图运行,节点通过数据分发服务(DDS)——一种发布/订阅底层——进行通信。实时观察这种节点间通信对机器人开发至关重要,但需要付出代价。工具只能通过作为订阅者加入DDS域来接收数据,而发现过程会将其与发布者匹配,因此观察将工具折叠到其所测量的系统中并扰动该系统。我们将这种协议固有的扰动定义为观察者的探针效应。它会膨胀发现平面,增加观察者的反序列化成本,使其报告的丢包与订阅者实际接收的丢包偏离,并在接近饱和时取代订阅者的消息。唯一的逃避方法是被动捕获所有线路流量,但这会丢弃ROS 2消息语义,并且其规模与总流量成正比,而非被观察的流量。我们提出ros2probe,一种非侵入式观察框架,消除了探针效应。它从域中的发现数据包重构完整的ROS 2通信状态,且无带宽成本,然后驱动一个内核级过滤器,仅限用户请求的主题,以最小成本提取这些数据包,并观察真实订阅者接收的内容。其接口和记录与标准ROS 2工具匹配。在三个硬件平台(笔记本电脑、Jetson和树莓派)、两种DDS实现和七种机器人操作工作负载上,ros2probe将发现图保持在未观察系统的0.5%以内,而加入域的工具将发现膨胀高达2.6倍,并在饱和时丢弃订阅者38.5%的消息,而ros2probe无丢包。其丢包报告召回率为1.0,将观察者的CPU和内存开销分别降低高达7倍和28倍,并在现有工具会使系统过载的嵌入式机器人上保持实用性。

英文摘要

Robot Operating System 2 (ROS 2), the de facto standard middleware framework for robots, runs each robot as a graph of nodes communicating over the Data Distribution Service (DDS), a publish/subscribe substrate. Observing this inter-node communication in real time is essential to robot development, yet it has a price. A tool can receive data only by joining the DDS domain as a subscriber that discovery has matched to the publisher, so observing folds the tool into the system it measures and perturbs it. We define this protocol-inherent perturbation as the observer's probe effect. It inflates the discovery plane, adds deserialization cost on the observer, makes the loss it reports diverge from what the subscriber actually received, and near saturation displaces the subscriber's messages. The only escape, capturing all wire traffic passively, discards ROS 2 message semantics and scales with total traffic, not what is observed. We present ros2probe, a non-intrusive observation framework that removes the probe effect. It reconstructs the full ROS 2 communication state from the domain's discovery packets at no bandwidth cost, then drives an in-kernel filter restricted to the topics the user asks for, lifting only those packets at minimal cost and observing what the real subscriber receives. Its interfaces and recordings match the standard ROS 2 tools. Across three hardware platforms (laptop, Jetson, and Raspberry Pi), two DDS implementations, and seven robot-operation workloads, ros2probe holds the discovery graph within 0.5% of an unobserved system, whereas domain-joining tools inflate discovery up to 2.6$\times$ and drop 38.5% of the subscriber's messages at saturation while ros2probe drops none. It reports loss with a recall of 1.0, cuts observer CPU and memory by up to 7$\times$ and 28$\times$, and stays practical on the embedded robots where existing tools overload the system.

2606.10743 2026-06-10 cs.RO 新提交

Hand-centric Human-to-Robot Trajectory Transfer from Video Demonstrations via Open-World Contact Localization

基于开放世界接触定位的以手为中心的人到机器人轨迹迁移

Yitian Shi, Di Wen, Zhengqi Han, Zicheng Guo, Yu Hu, Edgar Welte, Kunyu Peng, Rainer Stiefelhagen, Rania Rayyes

发表机构 * Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院)

AI总结 提出HOWTransfer框架,通过接触定位从人类视频中提取接触感知的机器人轨迹,无需物体特定描述,在多样化操作任务中实现86%的成功率。

详情
AI中文摘要

由于嘈杂的手-物体交互、部分观测下的未知物体以及跨实体差异,从人类视频演示中学习仍然具有挑战性。为了解决这些问题,我们提出了\textit{HOWTransfer}(\emph{H}and-\emph{O}bject \emph{O}pen-\emph{W}orld Transfer),这是一个以手为中心的框架,将人类演示提炼为接触感知、分类学信息丰富且多样化的机器人轨迹。\emph{HOWTransfer}不依赖于物体特定描述、视觉语言查询或显式物体状态跟踪,而是通过推理观测到的手-物体交互线索,恢复时间一致的三维手部运动并定位时间接触区间。然后,利用定位的接触起始点将人类抓取意图重定向到多模态平行颚抓取假设,这些假设沿恢复的手腕轨迹传播以生成机器人可执行的运动。最后,轨迹编辑阶段细化接触对齐,并从单个演示生成多样化的可执行变体。跨多种操作任务的实验表明,\emph{HOWTransfer}能够实现准确的接触定位和高质量的机器人运动重定向,成功率为86%,在盲选偏好研究中优于遥操作轨迹。

英文摘要

Learning from human video demonstrations remains challenging due to noisy hand-object interactions, unseen objects with partial observation, and cross-embodiment discrepancy. To address these challenges, we present \textit{HOWTransfer} (\emph{H}and-\emph{O}bject \emph{O}pen-\emph{W}orld Transfer), a hand-centric framework that distills human demonstrations into contact-aware, taxonomy-informed, and diverse robotic trajectories. Instead of relying on object-specific descriptions, vision-language queries, or explicit object-state tracking, \emph{HOWTransfer} recovers temporally consistent 3D hand motion and localizes temporal contact intervals by reasoning over observed hand-object interaction cues. The localized contact onsets are then used to retarget human grasp intent into multi-modal parallel-jaw grasp hypotheses, which are propagated along the recovered wrist trajectory to generate robot-executable motions. Finally, a trajectory editing stage refines contact alignment and produces diverse executable variants from a single demonstration. Experiments across diverse manipulation tasks show that \emph{HOWTransfer} enables accurate contact localization and high-quality robot motion retargeting with $86\%$ success, which is preferred over teleoperated trajectories in a blinded preference study.

2606.10736 2026-06-10 cs.CL cs.AI cs.CY 新提交

Detecting Knowledge Gaps from Conversational AI Interactions Using Curriculum Prerequisite Graphs

利用课程先决条件图检测对话式AI交互中的知识缺口

Youssef Medhat, Junsoo Park, Ploy Thajchayapong, Ashok K. Goel

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出一个流水线,通过少样本文本分类器将学生向对话式AI助教提出的问题映射到课程主题,并利用GPT-4提取的先决条件知识图谱,以检测主题级知识缺口。

详情
Comments
Accepted as a short paper at the 10th CSEDM Workshop, co-located with the 18th International Conference on Educational Data Mining (EDM 2026). 7 pages, 2 figures, 2 tables
AI中文摘要

大型在线课程会产生数千条学生向对话式AI助教提出的问题,但这些交互日志作为诊断信号在很大程度上未被利用。我们提出一个流水线,使用少样本文本分类器,将学生向对话式AI助教提出的问题映射到课程主题,该分类器基于GPT-4提取的课程概念先决条件知识图谱。在研究生级别AI课程的164名学生的1,340个问题事件上评估,我们的分类器在43个标签(42个课程主题加上一个“未知”弃权类别)上达到80.0%的准确率。主题级问题数量与独立期中调查中学生自我报告的难度显著相关(rho = 0.491, p = 0.008, n = 28个主题),提供了趋同证据,表明分类后的问题流反映了真实的主题难度。这些结果表明,映射到课程结构上的对话式AI交互日志携带关于主题级知识缺口的可操作信号,并为教师提供基于课程视角的哪些主题需要关注的视图。

英文摘要

Large online courses generate thousands of student questions directed at conversational AI teaching assistants, yet these interaction logs remain largely untapped as diagnostic signals. We present a pipeline that maps student questions from a conversational AI teaching assistant to curriculum topics using a few-shot text classifier, grounded in a GPT-4-extracted prerequisite knowledge graph of course concepts. Evaluated on 1,340 question events from 164 students in a graduate-level AI course, our classifier achieves 80.0% accuracy across 43 labels (42 curriculum topics plus an "unknown" abstention class). Topic-level question volume correlates significantly with student self-reported difficulty from an independent mid-semester survey (rho = 0.491, p = 0.008, n = 28 topics), providing convergent evidence that the classified question stream reflects genuine topic difficulty. These results demonstrate that conversational AI interaction logs, mapped onto curriculum structure, carry actionable signals about topic-level knowledge gaps and provide instructors with a curriculum-grounded view of which topics warrant attention.

2606.10734 2026-06-10 cs.LG stat.ME stat.ML 新提交

SPACR: Single-Pass Adaptive Training of Uncertainty-Aware Conformal Regressors

SPACR: 单次自适应训练的不确定性感知共形回归器

Soundouss Messoudi, Sylvain Rousseau, Sébastien Destercke

发表机构 * Heudiasyc - UMR CNRS 7253, Université de Technologie de Compiègne(法国贡比涅技术大学 - CNRS 7253联合实验室 Heudiasyc)

AI总结 提出SPACR方法,通过可微损失直接训练不确定性感知回归器,联合优化效率和有效性,无需批分割或预定义置信水平,单个模型在推理时支持多置信水平预测区间,实验表明其区间更窄、覆盖-效率权衡更优且计算成本更低。

详情
AI中文摘要

共形预测(CP)为预测模型提供了鲁棒的不确定性保证,但通常事后应用,这导致模型训练与产生高效(即窄)区间的共形目标不一致。我们提出SPACR(单次自适应共形回归器),一种在可微损失内直接训练不确定性感知回归器的新方法。SPACR联合优化效率和有效性,无需在训练期间进行批分割或预定义置信水平。因此,单个SPACR模型在推理时能在多个置信水平下产生有效的预测区间,避免了像DOICR等方法所需的高成本重训练。在多个数据集上的实验表明,与标准CP和DOICR相比,SPACR始终提供更紧的区间和更好的覆盖-效率权衡,同时显著降低计算成本。

英文摘要

Conformal Prediction (CP) provides robust uncertainty guarantees for predictive models, but is typically applied post hoc, which misaligns model training with the conformal goal of producing efficient (i.e, narrow) intervals. We propose SPACR (Single-Pass Adaptive Conformal Regressor), a novel method for directly training uncertainty-aware regressors within a differentiable loss. SPACR jointly optimizes efficiency and validity without batch-splitting or a predefined confidence levels during training. As a result, a single SPACR model yields valid prediction intervals at multiple confidence levels during inference, avoiding the costly retraining required by methods like DOICR. Experiments on diverse datasets show that SPACR consistently gives tighter intervals and better coverage-efficiency trade-offs compared to standard CP and DOICR, while significantly reducing computational costs.

2606.10733 2026-06-10 cs.RO 新提交

Pushing the Performance Limits in Autonomous Racing: Continuous Stability-Aware Adaptive Velocity Planning in Formula Student Driverless

推动自动驾驶赛车的性能极限:大学生方程式无人驾驶中的连续稳定性感知自适应速度规划

Tamara Bergerhoff, Sebastian Baader, Pascal Meißner, Frank Deinzer

发表机构 * Center for Artificial Intelligence and Robotics (CAIRO)(人工智能与机器人中心(CAIRO);维尔茨堡-施韦因富特应用技术大学) TUAS Würzburg-Schweinfurt

AI总结 提出一种连续稳定性感知自适应速度规划方法,通过推断连续缩放因子生成摩擦图,实现实时最优目标速度计算,在真实赛车上测试圈速提升35%。

详情
Comments
Accepted as a conference paper in IEEE Intelligent Vehicles Symposium (IV) 2026, Detroit, MI, United States
AI中文摘要

在自动驾驶赛车中,尤其是在大学生方程式无人驾驶等比赛中,精确规划赛车的目标速度对于实现有竞争力的圈速和稳定的驾驶行为至关重要。特别是在高速行驶时,速度规划是一项重大挑战,因为它必须实时进行,同时考虑赛道布局、环境影响、机械公差以及由此产生的控制不准确性。在本文中,我们提出了一种新颖的速度规划方法,能够动态适应这些变化的条件。该方法不是估计物理轮胎-路面摩擦系数,而是从车辆稳定性中间接推断出一个连续缩放因子。该因子不仅反映了有效的轮胎-路面相互作用,还捕捉了控制不准确性的影响。由此,我们生成一个连续的摩擦图,作为稳健、自适应的基础,用于计算考虑车辆和环境限制的最优目标速度。我们提出的方法在一辆真实的大学生方程式赛车上进行了评估,结果显示,与十圈相比,圈速提高了35%,与非自适应方法相比,平均提高了8%。

英文摘要

In autonomous racing, especially in competitions such as Formula Student Driverless, precise planning of the target velocity of a race car is crucial for competitive lap times and stable driving behavior. Especially at high speeds, Velocity Planning (VP) is a significant challenge as it has to be performed in real time, taking into account track layouts, environmental influences, mechanical tolerances, and the resulting control inaccuracies. In this paper, we present a novel approach to VP that dynamically adapts to such changing conditions. Instead of estimating the physical Tire-Road Friction Coefficient (TRFC), a continuous scaling factor is inferred indirectly from vehicle stability. This factor not only reflects the effective tire-road interaction but also captures effects of control inaccuracies. From this, we generate a continuous friction map, which serves as a robust, adaptive basis for computing the optimal target speed, accounting for both vehicle and environmental limits. Our proposed approach was evaluated on a real Formula Student race car, showing a lap time improvement of 35 % over ten laps and an average increase of 8 % compared to a non-adaptive approach.

2606.10732 2026-06-10 cs.RO 新提交

Vehicle Prediction Model for Enhanced MPC Path Tracking in Formula Student Driverless

面向大学生无人驾驶方程式赛车增强MPC路径跟踪的车辆预测模型

Sebastian Baader, Tamara Bergerhoff, Pascal Meißner, Frank Deinzer

发表机构 * Center for Artificial Intelligence and Robotics (CAIRO)(人工智能与机器人中心(CAIRO);维尔茨堡-施韦因富特应用科学大学) TUAS Würzburg-Schweinfurt

AI总结 提出一种结合离线贝叶斯线性回归与在线稀疏高斯过程回归的实时车辆预测模型,将预测精度提升高达57%,并在实际赛车MPC路径跟踪控制器中验证有效性。

详情
Comments
Accepted as a conference paper in IEEE Intelligent Vehicles Symposium (IV) 2026, Detroit, MI, United States
AI中文摘要

自动驾驶赛车,如大学生无人驾驶方程式赛车,在接近其物理操控极限下运行。由此产生的高度非线性车辆行为增加了路径跟踪的复杂性,尤其是在狭窄赛道上。模型预测控制(MPC)通常用于解决此问题,其性能与底层预测模型的准确性密切相关。本文提出一种新颖的、实时能力强的自动驾驶赛车预测模型,该模型通过结合过去运行和当前驾驶情况的信息来适应变化的条件。我们的模型分为三个连续的子模型:名义运动学自行车模型、离线贝叶斯线性回归(BLR)模型和在线稀疏高斯过程回归(SGPR)模型。所提出的方法能够在不显著增加计算成本的情况下有效整合所有可用数据,确保从运行开始就具有高预测精度和定量不确定性评估。与现有方法相比,预测精度提高了高达57%。此外,我们成功地在基于MPC的路径跟踪控制器中,在真实的大学生方程式赛车上展示了该模型的实际适用性。

英文摘要

Autonomous race cars, such as in Formula Student Driverless, operate close to their physical handling limits. The resulting highly nonlinear vehicle behavior increases the path tracking complexity, especially on narrow tracks. Model Predictive Control (MPC) is commonly used to address this issue, a method whose performance is closely tied to the accuracy of the underlying prediction model. This paper presents a novel, real-time capable prediction model for autonomous race cars that adjusts to changing conditions by combining information from past runs and the current driving situation. Our model is divided into three consecutive submodels: a nominal Kinematic Bicycle Model, an offline Bayesian Linear Regression (BLR) model, and an online Sparse Gaussian Process Regression (SGPR) model. The proposed approach enables efficient integration of all available data without significantly increasing computational cost, ensuring high prediction accuracy and a quantitative uncertainty assessment right from the start of the run. Compared to existing approaches, an improvement in prediction accuracy of up to 57% was achieved. Further, we successfully demonstrated the practical applicability of the model within an MPC-based path tracking controller on a real Formula Student race car.

2606.10722 2026-06-10 cs.CL 新提交

Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs

持续LLM升级:一种用于稠密到稀疏LLM的预测器门控银行级稀疏训练方案

Ruixuan Huang, Jinyuan Shi, Hantao Huang, Yifan Huang, Ziyi Guan, Hao Zeng, Ian En-Hsu Yen, Minghui Yu

发表机构 * Nanyang Technological University(南洋理工大学) Salesforce AI Huawei Noah's Ark Lab(华为诺亚方舟实验室)

AI总结 提出一种从稠密检查点构建通道稀疏大语言模型的持续训练方法,通过预测器门控稀疏SwiGLU FFN和银行级top-k规则实现4倍稀疏性,并修复长上下文失败模式。

详情
AI中文摘要

我们研究稠密到稀疏的持续训练,作为从稠密检查点构建通道稀疏大语言模型的一种方式。从Qwen2.5-8B稠密骨干网络开始,我们在32K上下文中继续训练,并在32K阶段引入预测器门控稀疏SwiGLU FFN。对于每个token和层,我们使用低秩预测器生成FFN通道路由logits。然后应用银行级top-k规则,在每个64通道的银行中保留16个通道,从而在FFN中间激活中实现4倍稀疏性。与事后稀疏推理方法不同,路由模块被放置在主要语言建模路径上,并在持续训练期间进行优化,使稠密模型能够升级为面向硬件的稀疏模型。我们报告了架构、训练方案、基准性能以及训练经验。我们还识别了RULER-CWE上的层局部长上下文失败模式,并提出了一种单层修复算法,显著改善了受影响长度范围内的性能。

英文摘要

We study dense-to-sparse continual training as a way to construct channel-sparse large language models from dense checkpoints. Starting from a Qwen2.5-8B dense backbone, we continue training at 32K context and introduce a predictor-gated sparse SwiGLU FFN in the 32K stage. For each token and layer, we use a low-rank predictor to produce FFN-channel routing logits. We then apply a bank-wise top-k rule to retain 16 channels in every 64-channel bank, yielding 4x sparsity in the FFN intermediate activation. Unlike post-hoc sparse inference methods, the routing module is placed on the main language modeling path and optimized during continual training, enabling the dense model to be upcycled into a hardware-oriented sparse model. We report the architecture, training recipe, benchmark performance, and training lessons. We also identify a layer-local long-context failure mode on RULER-CWE and propose a single-layer repair algorithm that substantially improves the affected length range.

2606.10718 2026-06-10 cs.LG cs.AI 新提交

Transformer Based Model for Spatiotemporal Feature Learning in EEG Emotion Recognition

基于Transformer的脑电情绪识别时空特征学习模型

Xinglong Cui, Dian Gu

发表机构 * Beijing Neurodeep Technology Co., Ltd(北京纽罗德普科技有限公司) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出EEG-TransNet架构,通过局部自注意力块和模糊注意力同步Transformer捕捉脑电信号的时空特征,在三个数据集上优于现有方法。

详情
AI中文摘要

脑电图(EEG)是一种广泛采用的监测大脑活动的技术,因其高时间分辨率和成本效益,为神经状态提供了有价值的见解。为了增强对复杂EEG数据的分析,我们提出了EEG-TransNet,一种旨在捕捉EEG信号的时间、区域和同步特征的架构。EEG-TransNet引入了三个关键模块:1)利用ResNet和基于小波去噪的预处理与特征提取模块,2)用于区域特征学习的局部自注意力块,以及3)用于建模时空依赖性的模糊注意力同步Transformer(FAST)。通过在三个EEG数据集(BETA、SEED和DepEEG)上的大量实验,所提出的模型在不同信号长度下的分类准确性和鲁棒性方面始终优于其他方法。消融研究证实了局部自注意力块在提高性能方面的贡献,并且解码器中引入深度可分离卷积降低了计算复杂度,同时保持了高准确性。EEG-TransNet在受试者间具有最小的性能变化,突显了其作为基于EEG的大脑活动分类和情绪识别任务的鲁棒工具的潜力。

英文摘要

Electroencephalography (EEG) is a widely adopted technique for monitoring brain activity, offering valuable insights into neurological states due to its high temporal resolution and cost-effectiveness. To enhance the analysis of complex EEG data, we propose EEG-TransNet, an architecture designed to capture temporal, regional, and synchronous features of EEG signals. EEG-TransNet introduces three key modules: 1) a preprocessing and feature extraction module leveraging ResNet and wavelet-based denoising, 2) a Local Self-Attention Block for regional feature learning, and 3) a Fuzzy-Attention Synchronous Transformer (FAST) to model spatiotemporal dependencies. Through extensive experiments on three EEG datasets (BETA, SEED, and DepEEG), the proposed model consistently outperforms other methods in terms of classification accuracy and robustness across varying signal lengths. Ablation studies confirm the contribution of the Local Self-Attention Block in improving performance, and the inclusion of depthwise separable convolutions in the decoder reduces computational complexity while maintaining high accuracy. EEG-TransNet's ability to generalize across subjects with minimal performance variation highlights its potential as a robust tool for EEG-based brain activity classification and emotion recognition tasks.

2606.10716 2026-06-10 cs.CL cs.AI 新提交

Attention Expansion: Enhancing Keyphrase Extraction from Long Documents with Attention-Augmented Contextualized Embeddings

注意力扩展:利用注意力增强的上下文嵌入提升长文档关键短语提取

Roberto Martínez-Cruz, Alvaro J. López-López, José Portela

发表机构 * Institute for Research in Technology, ICAI School of Engineering, Comillas Pontifical University(技术研究所,ICAI工程学院,科米利亚斯宗座大学) DD-AIM, Senior Machine Learning Researcher(DD-AIM,高级机器学习研究员)

AI总结 提出注意力扩展机制,通过预训练词嵌入增强PLM的上下文表示,在不增加计算成本的情况下扩展有效上下文范围,显著提升长文档关键短语提取性能。

详情
AI中文摘要

预训练语言模型(PLM)在关键短语提取(KPE)中取得了强劲性能,主要得益于其生成丰富上下文表示的能力。然而,长文档KPE仍然具有挑战性,因为显著的关键短语证据可能分散在遥远的文档部分,而这些部分无法在大多数PLM有限的上下文窗口内被联合捕获。尽管长上下文大语言模型(LLM)可以处理更广泛的文本上下文,但其计算成本限制了它们在高效和高通量KPE中的实用性。为了克服这一限制,我们提出了一种注意力扩展机制,该机制利用预训练词嵌入,用周围超出上下文的块中的信息来增强PLM的令牌表示。所提出的机制扩展了基于PLM的KPE模型的有效上下文范围,而无需全文档注意力或昂贵的基于LLM的推理。我们在五个PLM骨干网络上评估了我们的方法,包括通用、科学、任务特定和长上下文编码器,使用了两种训练机制和来自科学和新闻领域的五个基准语料库。实验结果表明,注意力扩展在所有评估设置中一致地提升了KPE性能,超越了最先进的模型,并在F1分数上取得了显著改进。这些改进扩展到领域特定、任务专门化和原生长上下文模型,表明所提出的机制提供了互补信息,而不仅仅是补偿有限的输入长度。这些结果确立了注意力扩展作为长文档KPE的一种高效且有效的策略。

英文摘要

Pre-trained language models (PLMs) have achieved strong performance in keyphrase extraction (KPE), largely due to their ability to generate rich contextualized representations. However, long-document KPE remains challenging because salient keyphrase evidence may be scattered across distant document sections that cannot be jointly captured within the limited context window of most PLMs. Although long-context large language models (LLMs) can process broader textual contexts, their computational cost limits their practicality for efficient and high-throughput KPE. To overcome this limitation, we propose an attention expansion mechanism that augments PLM token representations with information from surrounding out-of-context chunks using pre-trained word embeddings. The proposed mechanism expands the effective contextual scope of PLM-based KPE models without requiring full-document attention or expensive LLM-based inference. We evaluate our approach across five PLM backbones, including general-purpose, scientific, task-specific, and long-context encoders, using two training regimes and five benchmark corpora from scientific and news domains. Experimental results demonstrate that attention expansion consistently enhances KPE performance across all evaluation settings, outperforming state-of-the-art models and yielding notable improvements in F1 score. The improvements extend to domain-specific, task-specialized, and native long-context models, showing that the proposed mechanism provides complementary information rather than merely compensating for limited input length. These results establish attention expansion as an efficient and effective strategy for long-document KPE.

2606.10706 2026-06-10 cs.LG cs.AI 新提交

Unifying Data, Memory, and Compute Efficiency in LLM training: A Survey

统一LLM训练中的数据、内存和计算效率:一项综述

Vanessa Schmidt, Huy Hoang Nguyen, Cédric Jung, Shirin Salehi, Anke Schmeink

发表机构 * Chair of Information Theory and Data Analytics (INDA), RWTH Aachen University(亚琛工业大学信息理论与数据分析教席) AIT Austrian Institute of Technology GmbH(奥地利技术研究所) Automation and Control Institute, Technische Universität Wien (TUW)(维也纳工业大学自动化与控制研究所)

AI总结 本文从资源约束视角综述大语言模型训练中的数据效率、内存效率和计算预算感知三大瓶颈,强调三者需联合优化而非孤立处理。

详情
Comments
Accpeted for publication in IEEE Transactions on Artificial Intelligence (TAI)
AI中文摘要

资源约束日益决定了大语言模型(LLM)中可以训练、微调和部署的内容,然而效率通常通过孤立的技术而非作为相互作用的限制系统来研究。本综述采用以约束为中心的视角,围绕三个耦合的瓶颈组织近期进展:数据效率(训练什么)、内存效率(如何适应训练)和计算预算感知(何时何地消耗FLOPs)。在数据轴上,我们回顾了最大化每个token学习量的选择和剪枝方法,从基于学习动态的可扩展代理信号到基于梯度和影响的评分,以及难度感知和课程式策略。我们强调新兴证据表明,不同的“好数据”概念在不同机制中占主导地位,这意味着最优子集取决于任务目标和资源预算,而非普遍适用。在系统方面,我们表明GPU内存(而非原始计算)通常是微调中的主要瓶颈,有效的扩展需要联合减少权重存储、优化器状态和激活内存,而不是孤立地优化任何单一组件。超越内存,我们将训练和推理视为计算主导的过程,其中优化、数据选择和解码必须明确考虑有限的FLOP预算。我们回顾了计算最优分配和停止规则的证据,其中一旦边际性能增益低于预算依赖的阈值,计算应停止或重新分配。总之,这些结果将计算感知的数据选择、缩放定律和自适应推理统一在资源条件决策的共同原则下。

英文摘要

Resource constraints increasingly determine what can be trained, fine-tuned, and deployed in large language models (LLMs), yet efficiency is often studied through isolated techniques rather than as an interacting system of limits. This survey adopts a constraint-centric perspective and organizes recent progress around three coupled bottlenecks: data efficiency (what to train on), memory efficiency (how to fit training), and compute budget awareness (when and where to spend FLOPs). On the data axis, we review selection and pruning methods that maximize learning per token, ranging from scalable proxy signals based on learning dynamics to gradient- and influence-based scoring, as well as difficulty-aware and curriculum-style strategies. We highlight emerging evidence that different notions of good data dominate in different regimes, implying that optimal subsets depend on the task objective and resource budget rather than being universal. On the systems side, we show that GPU memory, not raw compute, is often the dominant bottleneck in fine-tuning, and that effective scaling requires jointly reducing weight storage, optimizer states, and activation memory rather than optimizing any single component in isolation. Beyond memory, we frame training and inference as compute-governed processes in which optimization, data selection, and decoding must explicitly account for finite FLOP budgets. We review evidence for compute-optimal allocation and stopping rules, where computation should be halted or reallocated once marginal performance gains fall below a budget-dependent threshold. Together, these results unify compute-aware data selection, scaling laws, and adaptive inference under a common principle of resource-conditioned decision-making.

2606.10701 2026-06-10 cs.CV 新提交

Vector Map as Language: Toward Unified Remote Sensing Vector Mapping

向量地图即语言:迈向统一的遥感向量制图

Yinglong Yan, Yunkai Yang, Haoyi Wang, Wei Fu, Linshan Wu, Honghu Pan, Shaobo Xia, Shanghang Zhang, Hao Chen, Leyuan Fang

发表机构 * School of Artificial Intelligence and Robotics, Hunan University(湖南大学人工智能与机器人学院) Department of Computer Science and Engineering, The Hong Kong University of Science and Technology(香港科技大学计算机科学与工程系) Department of Geomatics Engineering, Changsha University of Science and Technology(长沙理工大学测绘工程系) State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(北京大学计算机学院多媒体信息处理国家重点实验室)

AI总结 提出VecLang范式,将多类向量制图重构为结构化文本生成,通过类GeoJSON语言统一表达不同地理实体,并设计渐进式视觉-语言映射框架和层次化向量语言优化方法,实现跨类别、跨数据集和开放词汇的向量地图生成。

详情
AI中文摘要

遥感向量制图旨在从遥感图像中生成地理实体的结构化地图,如建筑物、道路和水体。实践中,向量地图通常包含多个类别层和异构实体结构,需要统一模型满足多样化的制图需求。然而,现有方法通常将向量对象表示为多边形或图,使其仅适用于特定类别:多边形难以捕捉拓扑关系,而图往往模糊实例边界。我们观察到,语言作为人类交流的自然媒介,提供了一种灵活且富有表现力的表示,能够容纳异构地图元素,包括几何、语义和拓扑。受此启发,我们提出向量地图即语言(VecLang),一种统一范式,将多类向量制图重构为结构化文本生成。VecLang将不同地理实体的共同元素编码为类GeoJSON的向量语言,从而在共享文本格式内实现跨类别建模。为了可靠地生成这种语言,我们设计了一个渐进式视觉-语言映射框架,首先定位向量化单元,然后生成结构化地图元素。我们进一步引入层次化向量语言优化,利用强化学习提高语法有效性、内容保真度和地图可执行性。我们还构建了包含54K图像和800K实例的VecMap-Bench,支持标准和泛化设置下的训练与评估。大量实验表明,VecLang能够处理单类和多样向量制图,同时实现强大的跨数据集和开放词汇泛化。模型和数据集已公开于该网址。

英文摘要

Remote sensing vector mapping aims to generate structured maps of geospatial entities, such as buildings, roads, and water bodies, from remote sensing imagery. In practice, vector maps usually contain multiple category layers and heterogeneous entity structures, requiring a unified model for diverse mapping needs. However, existing methods typically represent vector objects as polygons or graphs, making them suitable only for specific categories: polygons poorly capture topological relations, while graphs often blur instance boundaries. We observe that language, as a natural medium for human communication, offers a flexible and expressive representation that can accommodate heterogeneous map elements, including geometry, semantics, and topolog. Motivated by this insight, we propose Vector Map as Language (VecLang), a unified paradigm that reformulates multiclass vector mapping as structured text generation. VecLang encodes the common elements of different geospatial entities into a GeoJSON-like vector language, enabling cross-category modeling within a shared textual format. To generate this language reliably, we design a progressive vision-language mapping framework that first localizes vectorization units and then generates structured map elements. We further introduce Hierarchical Vector Language Optimization, which uses reinforcement learning to improve syntax validity, content fidelity, and map executability. We also build VecMap-Bench with 54K images and 800K instances, supporting training and evaluation across standard and generalization settings. Extensive experiments demonstrate that VecLang handles both single-class and multiclass vector mapping while achieving strong cross-dataset and open-vocabulary generalization. The model and dataset are publicly available at https://github.com/yyyyll0ss/VecLang.

2606.10696 2026-06-10 cs.CV 新提交

Don't waste SAM

不要浪费 SAM

Nermeen Abou Baker, Uwe Handmann

发表机构 * Ruhr West University of Applied Sciences - Dept of Computer Science(鲁尔西部应用科学大学计算机科学系)

AI总结 本文评估了SAM在垃圾分割任务中的泛化能力,通过微调SAM-ViT-H模型,在三个数据集上显著提升IoU,表明微调SAM作为基础模型对下游任务至关重要。

详情
Comments
Published at European Symposium on Artificial Neural Networks (ESANN2023), Computational Intelligence and Machine Learning. Bruges (Belgium)
AI中文摘要

Meta AI 最近发布了 Segment Anything Model (SAM),该模型在各种任务中展示了卓越的零样本图像分割性能,具有显著的准确性。尽管 SAM 无法在多个研究领域提供精确的分割,但它仍然是支持分割流程的宝贵起点,特别是对于需要大量高级技能标注的任务。本研究旨在使用三个垃圾分割数据集评估 SAM 和微调 SAM 模型的泛化能力。尽管这些数据集是从真实场景中捕获的(与 SAM 预训练的数据相同),但它们带来了若干挑战,包括遮挡、可变形物体、透明物体以及易与背景混淆的物体。我们的发现表明,微调的 SAM-ViT-H 模型在 Zerowaste 和 TACO 数据集上优于最先进的方法,IoU 显著提高了 +30,并且非常接近 TrashCan 1.0 的性能水平,仅相差 -1.44。在评估这些流行的垃圾数据集后,很明显,微调 SAM 作为基础模型是为下游垃圾分割任务提供更好泛化能力的关键步骤。因此,SAM 不应被忽视或浪费。

英文摘要

Meta AI has recently released the Segment Anything Model (SAM), which demonstrates exceptional zero-shot image segmentation performance across various tasks with remarkable accuracy. Despite its inability to provide accurate segmentation across multiple research fields, SAM still serves as a valuable starting point for supporting the segmentation pipeline process, particularly for tasks that require extensive and senior skills annotations. This study aims to evaluate the generalization of SAM and fine-tuning SAM models using three waste segmentation datasets. Although they are captured from real scenes as SAM was pretrained on, these datasets present several challenges, including occlusions, deformable objects, transparency, and objects easily confused with backgrounds. In our findings, the fine-tuned SAM-ViT-H model outperforms the state-ofthe-art Zerowaste, and TACO datasets with a significant increase of +30 in IoU, and it closely approaches performance levels of TrashCan 1.0, with only a -1.44 difference. After evaluating these popular waste datasets, it became evident that fine-tuning SAM as a foundational model is a crucial step for providing better generalization for downstream waste segmentation tasks. Therefore, SAM should not be disregarded or wasted.

2606.10694 2026-06-10 cs.CL 新提交

REAL: A Reasoning-Enhanced Graph Framework for Long-Term Memory Management of LLMs

REAL: 一种增强推理的图框架用于LLM的长期记忆管理

Keer Lu, Liwei Chen, Guoqing Jiang, Zhiheng Qin, Yunhuai Liu, Wentao Zhang

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院) Kuaishou Technology(快手科技) Center for Data Science, Academy for Advanced Interdisciplinary Studies, Peking University(北京大学前沿交叉学科研究院数据科学中心)

AI总结 提出REAL框架,通过构建时序和置信度感知的有向属性图,采用非破坏性更新和混合束搜索检索,解决LLM长期记忆中的关系缺失、事实覆盖和查询被动问题,平均性能提升22.72%。

详情
AI中文摘要

大型语言模型(LLM)越来越期望与用户进行长时间跨度的交互。然而,由于其有限的上下文窗口,LLM无法保留所有过去的交互,因此长期记忆管理对于存储、更新和检索超出上下文限制的历史信息至关重要。尽管最近的记忆系统试图通过外部存储历史信息来解决这个问题,但现有方法存在三个关键限制:基于平面文本的记忆组织无法捕捉记忆之间的显式关系,结构化记忆系统通常会破坏性地覆盖演变的事实,而当前的检索机制在证据不完整时仍然与查询无关且被动。REAL将长期对话记忆构建为时序和置信度感知的有向属性图,其中每个原子事实都用实体、关系、有效时间区间、置信度分数和探索意图标签表示。在记忆构建过程中,REAL采用非破坏性时序更新策略,保留并行的事实版本及其有效性区间,从而能够忠实地追踪事实的演变。在检索过程中,REAL锚定与查询相关的根实体,解耦其探索意图,并执行语义评估器引导的混合束搜索以提取紧凑的记忆子图。它进一步结合反事实推理来修复不可靠的检索状态,并通过隐式逻辑关系恢复缺失的记忆证据。综合实验表明,REAL在长期记忆性能上显著优于平面文本、基于图和现有记忆基线,平均提升22.72%。

英文摘要

Large Language Models (LLMs) are increasingly expected to interact with users over long time horizons. However, due to their finite context window, LLMs cannot retain all past interactions, making long-term memory management essential for storing, updating, and retrieving historical information beyond the context limit. Although recent memory systems attempt to address this issue by storing historical information externally, existing approaches suffer from three key limitations: flat text-based memory organizations fail to capture explicit relations among memories, structured memory systems often destructively overwrite evolving facts, and current retrieval mechanisms remain query-agnostic and passive when evidence is incomplete. REAL constructs long-term conversational memory as a temporal and confidence-aware directed property graph, where each atomic fact is represented with entities, relations, valid-time intervals, confidence scores, and exploration intent labels. During memory construction, REAL adopts a non-destructive temporal update strategy that preserves parallel fact versions and their validity intervals, enabling faithful tracking of fact evolution. During retrieval, REAL anchors query-relevant root entities, decouples their exploration intents, and performs semantic evaluator-guided hybrid beam search to extract compact memory subgraphs. It further incorporates counterfactual inference to repair unreliable retrieval states and recover missing memory evidence through implicit logical relations. Comprehensive experiments demonstrate that REAL substantially improves long-term memory performance over flat-text, graph-based, and existing memory baselines, achieving an average improvement of 22.72\%.

2606.10684 2026-06-10 cs.LG cs.AI 新提交

Divide and Cooperate: Role-Decomposed Multi-Agent LLM Training with Cross-Agent Learning Signals

分工与合作:基于跨智能体学习信号的角色分解多智能体LLM训练

Jaewan Park, Solbee Cho, Jay-Yoon Lee

发表机构 * Seoul National University(首尔大学)

AI总结 提出DAC框架,将多步推理分解为搜索和生成两个子任务,分别由专用智能体处理,并通过跨智能体学习信号解决信用分配问题,在QA基准上超越全参数微调的单体模型。

详情
AI中文摘要

现代语言智能体通过多步推理在知识密集型问答中表现出色。然而,现有方法通常将证据获取和答案生成耦合在单一策略中。这迫使单个模型扮演多个可能冲突的角色,导致策略空间组合爆炸并阻碍高效探索。同时,训练中引入信用分配问题:当生成失败时,检索到足够证据的搜索动作仍可能受到惩罚,反之亦然。我们提出DAC(分工与合作),一个角色分解的多智能体训练框架,将智能体搜索分解为两个合作性子任务,每个子任务由专用智能体处理,并使用角色特定的学习信号进行训练。生成器扮演双重角色:既是答案生成器,也是证据充分性验证器,当检索到的证据不足时放弃回答。该放弃信号被纳入搜索智能体的奖励中,提供结构化的跨智能体学习信号以改进信用分配。相反,搜索器通过硬阳性证据增强向生成器暴露多样且具有挑战性的证据环境,提高其鲁棒性。在通用和多跳问答基准上的实验表明,DAC通过共享骨干网络上的参数高效LoRA模块实现,在性能上优于先前依赖全参数微调单体模型的基线方法。

英文摘要

Modern language agents which perform multi-step reasoning have shown strong performance in knowledge-intensive question answering. However, existing approaches typically couple evidence acquisition and answer generation within a single policy. This forces a single model to play multiple potentially conflicting roles, inducing a combinatorial explosion in the policy space and hindering efficient exploration. It also introduces a credit assignment problem during training: a search action that retrieves sufficient evidence may still be penalized when generation fails, and vice versa. We propose DAC (Divide and Cooperate), a role-decomposed multi-agent training framework that divides agentic search into two cooperative subtasks, each handled by a dedicated agent trained with role-specific learning signals. The generator serves a dual role as both an answer producer and an evidence sufficiency verifier, abstaining when retrieved evidence is insufficient. This abstention signal is incorporated into the search agent's reward, providing structured cross-agent learning signals that improve credit assignment. Conversely, the searcher exposes the generator to diverse and challenging evidence environments by hard-positive evidence augmentation, improving its robustness. Experiments on general and multi-hop QA benchmarks show that DAC, implemented via parameter-efficient LoRA modules over a shared backbone, achieves strong performance against prior baselines that rely on full fine-tuning of monolithic models.

2606.10683 2026-06-10 cs.RO cs.AI cs.CV 新提交

UniDexTok: A Unified Dexterous Hand Tokenizer from Real Data

UniDexTok:基于真实数据的统一灵巧手分词器

Dong Fang, Youjun Wu, Yuanxin Zhong, Rui Zhang, Yunlong Wang, Xiaosong Jia, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学) Hefei University of Technology(合肥工业大学) Rimbot Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出统一灵巧手模型(UDHM)将人手和机器人手状态映射到共享22自由度语义接口,并基于此开发UniDexTok,一种免重定向的状态分词器,学习基于真实关节状态的离散token,实现异构灵巧手的统一表示,误差降低98%以上。

详情
AI中文摘要

灵巧手对于精细操作至关重要,但其硬件设计在不同实施例之间存在显著差异。运动学、关节定义和自由度方面的差异使得定义共享状态表示变得困难,与平行夹爪相比更是如此。因此,灵巧手数据仍然碎片化,难以用于联合训练。在这项工作中,我们提出了统一灵巧手模型(UDHM),它将人手和机器人手状态映射到一个共享的22自由度语义接口。基于UDHM,我们引入了UniDexTok,一种免重定向的状态分词器,它从标准化的真实关节状态中学习基于实施例的离散token。UniDexTok为异构灵巧手提供了统一表示,无需依赖重定向或仿真数据。与最近的基线UniHM相比,UniDexTok将MPJAE从15.63度降低到0.16度,MPJPE从18.51毫米降低到0.18毫米,误差分别减少了98.98%和99.03%。这些结果将重建精度从厘米级提升到亚毫米级。实验进一步表明,来自其他实施例的数据提高了目标实施例的重建精度,证明了跨实施例分词的优势。当引入新的灵巧手时,UniDexTok还表现出强大的零样本和少样本重建能力。

英文摘要

Dexterous hands are essential for fine-grained manipulation, but their hardware designs vary substantially across embodiments. Differences in kinematics, joint definitions, and degrees of freedom make it difficult to define a shared state representation compared with parallel grippers. As a result, dexterous-hand data remains fragmented and difficult to use for joint training. In this work, we propose the Unified Dexterous Hand Model (UDHM), which maps human and robot hand states into a shared 22-DoF semantic interface. Based on UDHM, we introduce UniDexTok, a retargeting-free state tokenizer that learns embodiment-conditioned discrete tokens from standardized real joint states. UniDexTok provides a unified representation for heterogeneous dexterous hands without relying on retargeting or simulation data. Compared with the recent baseline UniHM, UniDexTok reduces MPJAE from 15.63 degrees to 0.16 degrees and MPJPE from 18.51 mm to 0.18 mm, corresponding to error reductions of 98.98% and 99.03%, respectively. These results improve reconstruction from centimeter-scale to sub-millimeter accuracy. Experiments further show that data from other embodiments improves target-embodiment reconstruction accuracy, demonstrating the benefit of cross-embodiment tokenization. UniDexTok also shows strong zero-shot and few-shot reconstruction ability when new dexterous hands are introduced.