arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2023
热门方向导航
2606.18656 2026-06-18 cs.CL 新提交

The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

错误的正确:量化和定位大语言模型中的失调对齐

Naihao Deng, Yiming Feng, Chimaobi Okite, Kaijian Zou, Lu Wang, Rada Mihalcea, Yulong Chen

发表机构 * University of Michigan(密歇根大学) University of Cambridge(剑桥大学) University of Aberdeen(阿伯丁大学)

AI总结 本文提出VETO基准和失调对齐率(MAR)指标,发现所有LLM在刻板印象相关问题上均存在非平凡的失调对齐,且人类为0%,机制分析表明对齐诱导的线索会放大该现象。

详情
AI中文摘要

警告:本文研究刻板印象和偏见,包含可能令人不适的例子,仅用于说明目的。我们的发现不应被解释为反对对齐的论据。相反,本文强调了需要更先进对齐的原则性方法。对齐旨在确保大语言模型(LLMs)安全可靠地行为,包括避免不安全的推理。然而,我们表明这种安全导向的行为可能误触发:模型可能拒绝有根据的结论,即使上下文明确支持它们。我们将这种失败模式称为失调对齐,其中对齐引起的改变导致LLMs覆盖显式证据。为了量化这一现象,特别是针对刻板印象相关的对齐,我们引入了VETO,一个由2,032个BBQ派生对比对组成的基准,并定义了一个新指标,失调对齐率(MAR),它衡量在0到100的尺度上,模型在刻板印象相关问题上失败但在其对比对应问题上成功的频率。我们在VETO上对25个LLMs进行了基准测试,并表明所有LLMs,包括最新的,都表现出非平凡的(4.7%至18.9%)MAR,而所有人类参与者达到0.0%的MAR。受控启动实验进一步表明,对齐诱导的线索可以显著放大LLMs的MAR,表明这些失败不仅仅是单个例子的伪影,而是可以由安全相关的框架诱导。对开放权重LLMs的机制分析揭示了后期层对证据支持答案的抑制,并且指令模型与基础模型之间的比较表明这种抑制在指令训练后出现。这些发现表明,当前的对齐方法可能过度泛化表面安全线索,以至于覆盖客观证据,这激励了更多关于更好保持上下文基础的对齐目标的工作。

英文摘要

Warning: This paper studies stereotypes and biases, and contains potentially disturbing examples, used for illustration purposes only. Our findings should not be interpreted as an argument against alignment. Instead, this paper highlights the need for principled approaches to more advanced alignment. Alignment aims to ensure that large language models (LLMs) behave safely and reliably, including by avoiding unsafe inferences. However, we show that such safety-oriented behaviors can misfire: models may reject warranted conclusions even when they are explicitly supported by context. We call this failure mode misfired alignment, where alignment-induced changes cause LLMs to override explicit evidence. To quantify this phenomenon, specifically on stereotype-related alignment, we introduce VETO, a benchmark consisting of 2,032 BBQ-derived contrastive pairs, and define a new metric, Misfired Alignment Rate (MAR), which measures on a 0 to 100 scale how often a model fails on a stereotype-related question but succeeds on its contrastive counterpart. We benchmark 25 LLMs on VETO, and show that all LLMs, including the most recent ones, exhibit non-trivial (4.7 to 18.9%) MARs while all human participants achieve 0.0% MAR. Controlled priming experiments further show that alignment-induced cues can substantially amplify MAR across LLMs, indicating that these failures are not merely artifacts of individual examples but can be induced by safety-related framing. Mechanistic analyses on open-weight LLMs reveal late-layer suppression of evidence-supported answers, and comparisons between instruct and base LLMs suggest that this suppression emerges after instruction training. These findings show that current alignment methods can overgeneralize surface-level safety cues, to the point of overriding objective evidence, motivating more work on alignment objectives that better preserve contextual grounding.

2606.18650 2026-06-18 cs.LG 新提交

BLADE: Scalable Bi-level Adaptive Data Selection for LLM Training

BLADE: 面向LLM训练的可扩展双层自适应数据选择

Jiaxing Wang, Deping Xiang, Jin Xu, Zirui Liu, Zicheng Zhang, Guoqiang Gong, Jun Fang, Chao Liu, Pengzhang Liu, Tongxuan Liu, Ke Zhang, Qixia Jiang

发表机构 * University of Oxford(牛津大学) Renmin University of China(中国人民大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出BLADE框架,通过拉格朗日乘子将双层优化转化为单层惩罚目标,避免逆Hessian计算,实现动态参考模型,理论保证一阶收敛,实验优于现有方法。

详情
AI中文摘要

随着大语言模型(LLM)数据集规模扩展到数万亿token,数据选择已成为过滤无信息噪声和构建自适应学习轨迹的关键前沿。除了静态启发式过滤,LLM训练的高级数据选择方法主要遵循两种范式,每种都有根本性局限。基于影响的方法提供了原则性的双层目标,但需要难以处理的逆Hessian计算,而超额损失方法计算高效但依赖静态参考模型,该模型在训练过程中与不断演化的代理模型失配。我们提出BLADE(双层自适应数据选择),一种无Hessian的数据选择框架。BLADE通过拉格朗日乘子将基于影响的方法背后的双层优化问题重新表述为惩罚单层目标,避免了逆Hessian计算,同时揭示了与基于超额损失的数据选择之间的原则性联系。所得目标恢复了超额损失形式,但用与训练同步的动态参考模型替代了静态参考模型。理论上,我们证明该惩罚公式保证一阶收敛。为了实现高效的在线批次选择,我们将BLADE实例化为一种无记忆随机块坐标Frank-Wolfe算法。大量实验表明,BLADE始终优于最先进的数据选择基线,为LLM训练提供了实用方案。

英文摘要

As Large Language Model (LLM) datasets scale to trillions of tokens, data selection has emerged as a critical frontier to filter out uninformative noise and construct adaptive learning trajectories. Beyond static heuristic filtering, advanced data selection methods for LLM training largely follow two paradigms, each with fundamental limitations. Influence-based methods provide principled bi-level objectives but require intractable inverse-Hessian computations, while excess-loss methods are computationally efficient but rely on a static reference model that becomes misaligned with the evolving proxy model during training. We propose BLADE (Bi-Level Adaptive Data sElection), a Hessian-free framework for data selection. BLADE reformulates the bi-level optimization problem underlying influence-based methods as a penalized single-level objective via Lagrange multipliers, avoiding inverse-Hessian computation while revealing a principled connection to excess-loss based data selection. The resulting objective recovers an excess-loss form but replaces the static reference model with a dynamic one that stays synchronized with training. Theoretically, we prove that this penalized formulation guarantees first-order convergence. For efficient online batch selection, we instantiate BLADE as a memoryless randomized block-coordinate Frank-Wolfe algorithm. Extensive experiments show that BLADE consistently outperforms state-of-the-art data selection baselines, providing a practical recipe for LLM training.

2606.18646 2026-06-18 cs.RO 新提交

A Scalable Embodied Intelligence Platform for Seamless Real-to-Sim-to-Real Transfer of Household Mobile Manipulation Tasks

一种可扩展的具身智能平台,用于家庭移动操作任务的无缝真实-仿真-真实迁移

Kui Yang, Xianlei Long, Haoxuan Li, Yan Ding, Chao Chen

发表机构 * School of Computer Science, Chongqing University(重庆大学计算机学院) R&D Department, Lumos Robotics Technology (Suzhou) Co., Ltd(苏州 Lumos 机器人技术(苏州)有限公司研发部)

AI总结 提出BestMan平台,通过自动化场景生成、仿真引导任务形式化和硬件无关中间件,解决真实-仿真-真实迁移中的场景重建、策略评估和部署兼容性挑战,实现家庭移动操作的无缝迁移。

Comments CCF Transactions on Pervasive Computing and Interaction

详情
AI中文摘要

移动操作是具身智能机器人的基本能力。对非结构化家庭环境中鲁棒且可泛化操作的需求日益增长,推动了具身智能平台的快速发展。然而,实现真实-仿真-真实循环的无缝迁移面临三个关键挑战:昂贵的高保真仿真场景重建、仿真中系统策略评估的复杂性以及不兼容的真实世界部署。为了解决这些挑战,我们开发了BestMan,一个可扩展且无缝的真实-仿真-真实平台,弥合仿真与真实世界之间的差距,实现家庭移动操作的有效策略开发、集成和部署。具体来说,我们设计了一个新颖的自动化场景生成(ASG)模块,从真实观测中重建逼真的仿真。然后,我们提出了一种仿真引导的任务形式化和技能学习架构,支持在仿真中灵活集成和大规模评估混合技能策略。最后,为了增强真实世界的可扩展性,我们开发了一个硬件无关的统一中间件(HUM),确保跨异构移动操作器的无缝且兼容的仿真到真实迁移,用于真实部署。实验结果表明,我们提出的平台在建立标准化基准和促进移动操作领域有前景的研究方面表现出优越的性能。

英文摘要

Mobile manipulation is a fundamental capability in embodied intelligence robotics. The growing demand for robust and generalizable manipulation in unstructured household environments has driven rapid progress in embodied intelligence platforms. However, achieving a seamless transfer across the real-to-sim-to-real cycle faces three key challenges, including costly high-fidelity simulation scenes reconstruction, the complexity of systematic strategy evaluation in simulation, and incompatible real-world deployments. To address these challenges, we develop BestMan, a scalable and seamless real-to-sim-to-real platform that bridges the gap between the simulation and the real world, enabling effective strategy development, integration, and deployment for household mobile manipulation. Specifically, we design a novel Automated Scene Generation (ASG) module to reconstruct realistic simulations from real observations. Then, we propose a simulation-guided task formalization and skill learning architecture that supports the flexible integration and large-scale evaluations of hybrid skill strategies in simulation. Finally, to enhance the real-world scalability, we develop a Hardware-agnostic and Unified Middleware (HUM) to ensure seamless and compatible sim-to-real transfer across heterogeneous mobile manipulators for real deployments. Experimental results demonstrate the superior performance of our proposed platform in establishing standardized benchmarks and facilitating promising research in the field of mobile manipulation.

2606.18644 2026-06-18 cs.CV 新提交

Spiking Pyramid Wavelet Transformation for High-efficient and Low-energy Image Restoration

尖峰金字塔小波变换用于高效低能耗图像恢复

Chen Zhao, Xiantao Hu, Song Wu, Qian Wang, Chen Wu, Rui Xie, Jian Yang, Ying Tai

发表机构 * Nanjing University(南京大学) Nanjing University of Science and Technology(南京理工大学) University of Science and Technology of China(中国科学技术大学) China Mobile Institute(中国移动研究院)

AI总结 提出基于尖峰神经网络和金字塔小波变换的SPWM模型,通过SDPW块建模长程依赖并利用小波域退化特性,在保持图像质量的同时显著降低计算和能耗。

Comments Accepted by Pattern Recognition

详情
AI中文摘要

尖峰神经网络(SNNs)因其高效性和生物启发的潜力在计算机视觉领域引起了广泛兴趣。虽然基于尖峰CNN的方法在图像恢复(IR)任务中显示出前景,但其性能受到CNN操作固有感受野限制的约束。在本文中,我们探索了离散小波变换的优势,并提出了一种基于尖峰金字塔小波模型(SPWM)以实现高效低能耗目标。具体来说,我们开发了一个尖峰双金字塔小波(SDPW)块来建模长程依赖并利用小波域中的退化特性。在多个基准上的实验结果表明,SPWM在保持图像质量的同时显著降低了计算成本和能耗。我们的方法展示了SNNs在IR领域的潜力,为资源受限设备的未来应用提供了新的见解。

英文摘要

Spiking neural networks (SNNs) have garnered significant interest in computer vision due to their potential for efficiency and biological inspiration. While spiking CNN-based methods have shown promise for image restoration (IR) tasks, their performance is constrained by the inherent receptive field limitations of CNN operations. In the paper, we explore the benefits of discrete wavelet transformation and propose a spiking pyramid wavelet-based model (SPWM) for high-efficient and low-energy target. Specifically, we develop a spiking dual pyramid wavelet (SDPW) block to model long-range dependency and exploit the properties of the degradation in the wavelet domain. Experimental results on several benchmarks demonstrate that SPWM significantly lowers computational costs and energy consumption while maintaining image quality. Our method showcases the potential of SNNs in the field of IR, offering new insights for future applications of resource-limited devices.

2606.18640 2026-06-18 cs.LG q-bio.QM 新提交

MetaboNet-Bench: A Multi-modal Benchmark for Glucose Forecasting in Type 1 Diabetes

MetaboNet-Bench:1型糖尿病血糖预测的多模态基准

Nathaniel Jeffries, Miriam Wolff, Sam Royston, Elizabeth Healey, Caleb Mayer, David Klonoff, Michael Snyder, Tao Wang

发表机构 * Department of Genetics, Stanford University School of Medicine(斯坦福大学医学院遗传学系) Replica Health Boston Children’s Hospital, Harvard Medical School(哈佛医学院波士顿儿童医院) Diabetes Research Institute, Mills-Peninsula Medical Center(米尔斯半岛医学中心糖尿病研究所)

AI总结 针对1型糖尿病血糖预测算法缺乏标准化评估基准的问题,提出MetaboNet-Bench多模态基准,集成血糖、胰岛素和碳水化合物数据,通过多个模型对比验证多模态数据对模型性能的影响。

Comments main content in 10 pages with 5 figures; supplementary section with 11 more pages and 5 more figures

详情
AI中文摘要

血糖预测算法是1型糖尿病血糖控制管理的重要方面。迄今为止,研究社区已经开发了大量预测算法和模型。然而,公认的是,缺乏标准化的模型性能评估基准使得公平比较变得困难,并阻碍了进一步的创新,因此基准标准化迫在眉睫。此外,许多已发表的血糖预测算法仅限于CGM数据,忽略了其他多模态信号,如胰岛素剂量和碳水化合物摄入。在此,我们介绍MetaboNet-Bench,这是一个针对1型糖尿病患者的多模态血糖预测基准,它提供了一个可扩展的开源评估框架,用于比较利用血糖、胰岛素和碳水化合物数据的血糖预测算法。然后,我们通过基准测试几个最近发布的血糖预测模型和一个自定义的多模态时间序列模型(代表不同的模型架构)来展示其实用性。结果表明,添加数据模态的好处取决于模型的复杂性,并且纳入更多临床指标有助于识别未来研究中有意义的空白。

英文摘要

Glucose forecasting algorithms are an important aspect of glycemic control management in type 1 diabetes. So far, the research community has developed numerous algorithms and models for forecasting. However, it is well-recognized that the lack of standardized model performance evaluation benchmarks makes fair comparison difficult and hinders further innovation, and thus benchmark standardization is in urgent need. Furthermore, many published glucose forecasting algorithms are limited to CGM data alone, ignoring other multimodal signals such as insulin dosing and carbohydrate intake. Here, we introduce MetaboNet-Bench, a benchmark for multimodal glucose forecasting for patients with type 1 diabetes that provides an extensible open-source evaluation framework for comparison of glucose forecasting algorithms that leverage glucose, insulin, and carbohydrate data. We then demonstrate its utility by benchmarking several recently published glucose forecasting models and a custom multimodal time-series model, representing different model architectures. The results show that the benefit of adding data modalities is conditioned on the complexity of the model and that incorporating more clinical metrics helps identify meaningful gaps to fill for future research.

2606.18636 2026-06-18 cs.CL cs.AI 新提交

PEC-Home: Interpretation of Progressively Elliptical Commands in Smart Homes

PEC-Home:智能家居中渐进式省略命令的解释

Yingyu Shan, Zeming Liu, Silin Li, Boao Qian, Jiashu Yao, Yuhang Guo, Haifeng Wang

发表机构 * Beijing Institute of Technology(北京理工大学) Beihang University(北京航空航天大学) Baidu Inc.(百度公司)

AI总结 针对智能家居中用户因共享上下文而使用渐进式省略命令导致的指代和意图歧义问题,提出首个模拟家庭数据集PEC-Home,实验表明现有LLM助手难以准确执行省略命令。

Comments Accepted by ACL 2026 Findings

详情
AI中文摘要

近年来,大型语言模型(LLM)的进步使家庭助手具备了自然语言交互能力。然而,当前的助手忽略了人类对话中随着共享上下文积累而发生的渐进式省略,即为了高效沟通而使用更简洁的表达。因此,当前助手仍难以准确解释此类省略表达,限制了其在现实应用中的有效性。在实际智能家居场景中,助手面临由省略命令引起的两大挑战:(1)多个用户对环境期望不同导致的指代歧义;(2)用户偏好随时间或环境变化导致的意图歧义。为应对这些挑战,我们引入了PEC-Home,这是首个专门为解释智能家居中渐进式省略命令而设计的模拟家庭数据集。在包括GPT-4o在内的多种LLM上的广泛实验表明,现有的家庭助手难以仅基于省略命令执行用户意图的操作。即使配备存储和检索用户对话历史的工具,其执行准确率仍低于使用完整命令时的水平。

英文摘要

Recent advancements in Large Language Models (LLMs) have empowered home assistants with natural language interaction capabilities. However, current assistants overlook the progressive omission that occurs in human dialogue as shared context accumulates, leading to more elliptical expressions for efficient communication. Thus, current assistants still struggle to interpret such elliptical expressions accurately, which limits their effectiveness in real-world applications. In practical smart home scenarios, assistants face two major challenges caused by elliptical commands: (1) referential ambiguity caused by different environmental expectations among multiple users; and (2) intention ambiguity resulting from user preferences that evolve over time or change with the environment. To address these challenges, we introduce PEC-Home, the first simulated home dataset specifically designed for interpreting progressively elliptical commands in smart homes. Extensive experiments on various LLMs, including GPT-4o, show that existing home assistants struggle to execute user-intended operations based solely on elliptical commands. Even when equipped with tools for storing and retrieving user dialogue history, execution accuracy remains below that achieved with complete commands.}.

2606.18634 2026-06-18 cs.RO cs.AI 新提交

EffiNav: Fusing Depth and Vision-Language for Efficient Object Goal Navigation

EffiNav: 融合深度与视觉语言实现高效物体目标导航

Zecheng Yin, Benedict Jun Ma

发表机构 * Systems Hub of Intelligence Transportation HKUST(GZ)(香港科技大学(广州)智能交通系统中心)

AI总结 提出EffiNav框架,融合深度信息与视觉语言模型,通过预测探索边界和语义先验指导导航,在HM3D和OVON数据集上匹配或超越基线,提升路径效率与泛化性。

详情
AI中文摘要

在未知环境中定位目标物体是自主智能体的基本能力,应用范围从搜索救援到野外机器人。该任务的简化版本是物体目标导航(ObjNav)。在ObjNav中,成功到达目标物体提供了基本的性能度量;然而,导航轨迹的效率同样重要,因为它指示了智能体探索的智能程度以及后续任务剩余的时间。在未知环境中,高效导航的关键在于决定下一步探索的位置。尽管许多先前工作旨在解决这一核心挑战并在某些场景中取得了有希望的性能,但最近的基于训练的模型和非训练框架分别仍存在泛化性和效率问题,在最坏情况下可能导致对已访问区域的过度探索或冗余的来回运动。我们在两个广泛使用的仿真基准Habitat Matterport 3D(HM3D)和开放词汇物体目标导航(OVON)上评估EffiNav,并在真实世界的物理机器人上进一步验证其有效性。我们对大量仿真回合进行了失败分析。通过最小修改,我们还将EffiNav扩展到GOAT-BENCH数据集上的记忆增强ObjNav任务,展示了其在标准ObjNav设置之外的适应性。在两个标准指标——成功率(SR)和路径长度加权成功率(SPL)上,EffiNav匹配或超越了最近的基线,反映了其效率、鲁棒性和实际适用性。认识到两个数据集的不同侧重点,性能表明该框架在高效ObjNav中更加平衡和可泛化。

英文摘要

To locate a target object while exploring the unknown environment is a fundamental capability for autonomous agents, with applications ranging from search-and-rescue to field robots. A simplified version of such task is Object Goal Navigation (ObjNav). In ObjNav, successful arrival at the target object provides a basic measure of performance; however, the efficiency of the navigation trajectory is equally important, as it indicates how intelligently the agent explores and how much time remains for subsequent tasks. In unknown environments, the key to efficient navigation lies in deciding where to explore next. While many prior works aim to address this core challenge and achieved promising performance in certain settings, recent training-based models and non-training frameworks still suffer from generalization and efficiency issues respectively, which in the worst cases can lead to excessive exploration of already-visited areas or redundant back-and-forth motion. We evaluate EffiNav on two widely used simulation benchmarks Habitat Matterport 3D (HM3D) and Open-Vocabulary Object goal Navigation (OVON), and further validate its effectiveness on physical robots in real-world settings. We conduct failure analysis on massive simulation episodes. With minimal modification, we also extend EffiNav to a memory-augmented ObjNav task on the GOAT-BENCH dataset, demonstrating its adaptability beyond standard ObjNav settings. Across two standard metrics--Success Rate (SR) and Success weighted by Path Length (SPL), EffiNav matches or outperforms recent baselines, reflecting its efficiency, robustness, and practical applicability. Recognizing the different emphases of the two datasets, the performances reveals this framework is more balanced and generalizable for efficient ObjNav.

2606.18632 2026-06-18 cs.RO 新提交

ROBOSHACKLES: A Safety Dataset for Human-Injury Prevention in Embodied Foundation Models

ROBOSHACKLES: 面向具身基础模型中人体伤害预防的安全数据集

Zhuowen Yin, Chongyang Liu, Wenzhang Yang, Renjue Li, Yinxing Xue

发表机构 * Institute of Al for Industries, Chinese Academy of Sciences(工业人工智能研究所,中国科学院) University of Science and Technology of China(中国科学技术大学)

AI总结 为解决机器人伤害人类数据难以安全收集的问题,提出基于真实观测的安全数据构建流水线,生成包含1万条视频的ROBOSHACKLES数据集,涵盖直接和间接伤害类别,评估发现现有模型在安全关键场景下100%产生不安全动作。

详情
AI中文摘要

具身基础模型(EFMs)整合了多模态理解、未来状态推理和可执行的机器人动作。然而,它们在预防人体伤害方面的安全对齐仍未得到充分探索,主要是因为机器人伤害人类或造成危险家庭情境的真实世界数据无法安全或合乎道德地收集。为应对这一挑战,我们提出了一种针对人体伤害预防的安全关键数据构建流水线。该流水线从真实的DROID观测出发,经过场景理解、危险感知图像编辑、时间提示生成和单次滚动合成等步骤。时间提示指定了预期的场景演变,而Wan2.7则从编辑后的危险状态中单次合成逼真的机器人滚动视频。利用该流水线,我们构建了ROBOSHACKLES,一个包含10,000条机器人视频片段的数据集,源自真实的DROID观测,涵盖两个直接伤害和四个间接伤害类别。为确保数据集质量,我们使用自动指标评估任务完成度和视觉质量,并在基于拒绝的安全准则下评估了六个代表性EFM。结果表明,所有评估模型在测试的安全关键场景中都产生了不安全动作,不安全动作生成率为100%。ROBOSHACKLES可作为拒绝学习和机器人动作执行前危险预测的可扩展基准和训练资源。该数据集公开于https://roboshackles.github.io。

英文摘要

Embodied Foundation Models (EFMs) integrate multimodal understanding, future-state reasoning, and executable robot actions. Yet their safety alignment for human-injury prevention remains underexplored, primarily because real-world data of robots harming humans or creating hazardous household situations cannot be safely or ethically collected. To address this challenge, we propose a safety-critical data construction pipeline for human-injury prevention in EFMs.Starting from real DROID observations, our construction pipeline proceeds through scene understanding, hazard-aware image editing, temporal prompt generation, and single-pass rollout synthesis. The temporal prompts specify the expected scene evolution, while Wan2.7 synthesizes realistic robotic rollouts from the edited hazardous states in a single pass. Using this pipeline, we construct ROBOSHACKLES, a 10,000-clip robotic video dataset derived from real DROID observations, spanning two direct-harm and four indirect-harm categories. To ensure dataset quality, we assess task completion and visual quality with automatic metrics, and evaluate six representative EFMs under a refusal-based safety criterion. Results show that all evaluated models produce unsafe actions in the tested safety-critical scenarios, yielding a 100% unsafe action generation rate. ROBOSHACKLES serves as a scalable benchmark and training resource for refusal learning and hazard anticipation before robot action execution.The dataset is publicly available at https://huggingface.co/datasets/YZW00/RoboShackles.

2606.18630 2026-06-18 cs.RO 新提交

DNN Koopman-Based Deviation Compensation for UGV Path Tracking Control on Coupled Slope and Potholed Road

基于DNN Koopman的偏差补偿用于耦合坡度和坑洼道路上的UGV路径跟踪控制

Jian Zhao, Wenbo Zhou, Zhicheng Chen, Bing Zhu, Jiayi Han, Dongjian Song, Yinju Lin, Peixing Zhang

发表机构 * Xiamen King Long United Automotive Industry Co., Ltd.(厦门金龙联合汽车工业有限公司)

AI总结 提出基于DNN Koopman的偏差补偿策略,结合自适应遗忘递推最小二乘估计轮胎刚度、Laguerre模型预测控制与事件触发协同补偿,在耦合坡度和坑洼道路上提升UGV路径跟踪精度超11.5%

Comments 22 pages, 13 figures

详情
AI中文摘要

在越野场景中运行的无人地面车辆面临复杂地形扰动,这些扰动会显著降低路径跟踪性能。针对这一挑战,本文提出了一种基于深度神经网络Koopman的偏差补偿策略,用于无人地面车辆路径跟踪控制。首先,基于耦合坡度上的车辆动力学函数,设计了一种带有解耦误差项的自适应遗忘递推最小二乘法来估计轮胎侧偏刚度。在此基础上,通过引入Laguerre函数,设计了一种Laguerre模型预测控制路径跟踪控制策略,该策略可在不同耦合坡度场景下降低计算资源消耗的同时保持可靠的跟踪性能。然后,通过将Koopman算子理论与深度神经网络相结合,提出了一种深度神经网络Koopman路径偏差补偿方法,该方法显著提高了无人地面车辆在坑洼道路扰动下的路径跟踪精度。此外,基于补偿激活准则和可信度验证,建立了一种将Laguerre模型预测控制与深度神经网络Koopman耦合的事件触发并行协同补偿机制。该机制提高了坑洼道路上的路径跟踪精度,同时确保了整体转向指令的可行性和深度神经网络Koopman补偿后车辆的稳定性。最后,构建了硬件在环实验平台进行验证。实验结果表明,所提出的无人地面车辆路径跟踪策略在多种工况下跟踪性能提升超过11.5%。

英文摘要

Unmanned ground vehicles (UGVs) operating in off-road scenarios are confronted with complex terrain disturbances that can substantially degrade path tracking performance. To address this challenge, this paper proposes a deep neural network (DNN) Koopman-based deviation compensation strategy for UGV path tracking control. Firstly, based on the vehicle dynamic function on coupled slope, an adaptive forgetting recursive least squares method with decoupled error terms is designed to estimate tire cornering stiffness. On this basis, a Laguerre model predictive control (LMPC) path tracking control strategy is designed by incorporating Laguerre functions, which can reduce computational resource usage while maintaining reliable tracking performance across different coupled slope scenarios. Then, by integrating Koopman operator theory with DNN, a DNN Koopman (DK) path deviation compensation method is proposed, which significantly improves the path tracking accuracy of UGV under potholed road disturbances. Furthermore, an event-triggered parallel cooperative (EPC) compensation mechanism that couples LMPC with DK is established based on compensation activation criteria and credibility verification. This mechanism improves path tracking accuracy on potholed road while ensuring the feasibility of overall steering command and stability of vehicle after DK compensation. Finally, a hardware-in-the-loop (HiL) experimental platform is constructed for validation. Experimental results demonstrate that the proposed UGV path tracking strategy improves tracking performance by more than 11.5% across multiple operating conditions.

2606.18628 2026-06-18 cs.RO 新提交

Self-Supervised Mask-Aware Transformers for Fault-Tolerant FBG Force Sensing in Minimally Invasive Surgical Robotics

自监督掩码感知Transformer用于微创手术机器人中容错FBG力传感

Peibo Sun, Shiyuan Dong, Shucheng Ye, Jianrong Cai, Yushan Liu, Hongen Liao, Tianqi Huang, Fang Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院)

AI总结 针对微创手术机器人中FBG传感器因通道耦合和断裂导致的力估计退化问题,提出统一的自监督掩码感知Transformer,通过掩码通道重建预训练和动态损坏课程微调,实现多通道故障下的优雅降级,在8通道数据集上达到0.0066 N均方根误差。

详情
AI中文摘要

在微创手术机器人中,导管级光纤布拉格光栅(FBG)传感器因其能够通过复用多个光学通道来估计多维力而具有前景。然而,部署这些紧凑的多通道传感器引入了两个关键工程挑战:复杂变形过程中固有的非线性交叉轴耦合,以及受限工作空间中光纤断裂导致的间歇性通道丢失。这些复合问题严重降低了力估计性能。现有的容错方法依赖于组合模型库,其随通道数量呈指数级扩展,并且需要昂贵的每模式校准。在本文中,我们提出了一种统一的、自监督的掩码感知Transformer,它显式地建模通道可用性,以在多样化和动态的传感器故障下实现优雅降级。编码器通过未标记数据流上的掩码通道重建进行预训练,并使用平衡的干净与损坏视图目标以及动态损坏课程进行力回归微调。此外,通过异方差高斯负对数似然训练的并行不确定性头,在单次前向传播中预测每轴置信度,避免了多遍集成的开销。在导管级8通道FBG数据集上评估,我们的单一统一模型实现了标称均方根误差(RMSE)0.0066 N,并在严重4通道故障下优雅降级至0.0126 N。这显著优于包含255个每模式神经网络的综合模型库(4通道丢失时为0.0154 N),同时消除了模式特定校准。

英文摘要

In minimally invasive surgical robotics, catheter-scale Fiber Bragg Grating (FBG) sensors are promising due to their ability to estimate multi-dimensional forces by multiplexing several optical channels. However, deploying these compact multi-channel sensors introduces two critical engineering challenges: inherent nonlinear cross-axis coupling during complex deformations, and intermittent channel dropouts caused by fiber fractures in constrained workspaces. These compounding issues severely degrade force estimation. Existing fault-tolerant approaches rely on combinatorial model banks, which scale exponentially with the channel count and demand prohibitively expensive per-pattern calibration. In this paper, we propose a unified, self-supervised mask-aware Transformer that explicitly models channel availability to enable graceful degradation under diverse and dynamic sensor failures. The encoder is pretrained via masked-channel reconstruction on unlabeled data streams and fine-tuned for force regression using a balanced clean-and-corrupted-view objective alongside a dynamic corruption curriculum. Furthermore, a parallel uncertainty head, trained via heteroscedastic Gaussian negative log-likelihood, predicts per-axis confidence in a single forward pass, circumventing the overhead of multi-pass ensembles. Evaluated on a catheter-scale 8-channel FBG dataset, our single unified model achieves a nominal Root Mean Square Error (RMSE) of 0.0066~N and degrades gracefully to 0.0126~N under severe 4-channel failures. This significantly outperforms a comprehensive model bank of 255 per-pattern neural networks (0.0154~N at 4-channel loss) while eliminating pattern-specific calibration.

2606.18627 2026-06-18 cs.LG 新提交

PACT: Preserving Anchored Cores in Task-vectors for Model Merging

PACT: 在任务向量中保留锚定核心用于模型合并

Ningyuan Shi, Zhipeng Zhou, Hao Wang, Chunyan Miao, Peilin Zhao

发表机构 * Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出PACT方法,通过识别并保留预训练权重中的承重墙维度,在任务向量中锚定任务特定核心,解决任务向量范式下任务冲突和性能下降问题,提升模型合并效果。

Comments 33 pages,14 figures

详情
AI中文摘要

模型合并已成为多任务学习的一种无需训练的替代方案,旨在将多个任务特定的微调模型组合成一个单一的多任务模型。大多数现有的模型合并方法遵循任务算术范式,该范式将微调权重分解为预训练参数和任务向量,并仅在任务向量空间中进行合并。这一范式的有效性隐含地依赖于一个假设,即任务特定知识仅编码在任务向量中。我们认为,由于预训练模型固有的任务偏好,这一假设通常不成立。具体而言,我们识别出\textbf{承重墙(LBW)维度},即一些任务关键知识仍嵌入在预训练权重中,而非完全转移到任务向量中。我们从标量权重和子空间两个角度刻画LBW维度,从而覆盖现有模型合并方法的主要范式。我们的分析表明,忽略LBW维度会导致基于任务向量的方法无法完全解决任务冲突,并可能无意中破坏预训练模型中编码的任务特定知识,从而导致性能下降。为解决这一问题,我们提出PACT,该方法通过将任务向量的正交补与预训练权重的子空间对齐,从而在任务向量中保留锚定的任务特定核心(即LBW维度)。在应用现有模型合并算法之前,将这些对齐的子空间分量从任务向量中移除。此外,我们开发了一种基于随机SVD的高效变体以提高可扩展性。PACT可以无缝集成到现有方法中。在多个基准上的大量实验表明,PACT持续增强主流模型合并方法,并建立了新的最先进性能。

英文摘要

Model merging has emerged as a training-free alternative to multi-task learning, aiming to combine multiple task-specific fine-tuned models into a single multi-task model. Most existing model merging approaches follow the Task Arithmetic paradigm, which decomposes fine-tuned weights into pre-trained parameters and task vectors, and performs merging exclusively in the task-vector space. The effectiveness of this paradigm implicitly relies on the assumption that task-specific knowledge is encoded solely within task vectors. We argue that this assumption generally does not hold due to the intrinsic task preferences of pre-trained models. Specifically, we identify \textbf{Load-Bearing Wall (LBW) dimensions}, namely some task-critical knowledge that remains embedded in the pre-trained weights rather than being fully transferred into task vectors. We characterize LBW dimensions from both scalar-weight and subspace perspectives, thereby covering the major paradigms of existing model merging methods. Our analysis reveals that, by ignoring LBW dimensions, task-vector-based approaches fail to fully resolve task conflicts and may inadvertently damage task-specific knowledge encoded in the pre-trained model, leading to degradation. To address this issue, we propose PACT, which preserves the anchored task-specific cores (i.e., LBW dimensions) within task vectors by aligning their orthogonal complements with the subspace of the pre-trained weights. These aligned subspace components are then removed from the task vectors before applying existing model merging algorithms. Furthermore, we develop an efficient variant based on randomized SVD to improve scalability. PACT can be seamlessly integrated with existing methods. Extensive experiments across multiple benchmarks demonstrate that PACT consistently enhances mainstream model merging approaches and establishes new state-of-the-art performance.

2606.18625 2026-06-18 cs.RO 新提交

SRL: Combining SLIP Model and Reinforcement Learning for Agile Robotic Jumping

SRL:结合SLIP模型与强化学习实现敏捷机器人跳跃

Xiaowen Hu, Linqi Ye, Yudi Zhu, Chenyue Shao, Rankun Li, Qingdu Li, Yan Peng

发表机构 * Institute of Artificial Intelligence, Shanghai University(上海大学人工智能研究院) Institute of Machine Intelligence, University of Shanghai for Science and Technology(上海理工大学机器智能研究院)

AI总结 提出SRL框架,融合SLIP模型的物理基线与强化学习的自适应能力,通过前馈控制信号与实时反馈优化机器人跳跃,显著减少训练时间并保持高精度跟踪。

Comments 17 pages, 12 figures

详情
AI中文摘要

机器人跳跃在搜救和物流等应用中至关重要,这些场景中跨越障碍和提高机动效率是关键。弹簧负载倒立摆(SLIP)模型利用简化的弹簧-质量动力学,自然编码了生物上合理的弹跳运动,但由于对接触和关节动力学的理想化假设,其在不规则地形上的性能会下降。同时,强化学习(RL)能够适应多样化和复杂的环境,但通常需要来自无引导探索的大量数据。SLIP的物理基线与RL的自适应能力的互补优势促使我们提出一种混合框架,以克服各自的局限性。因此,我们提出了弹簧负载强化学习(SRL),它将基于SLIP的前馈控制信号与RL驱动的实时反馈相结合,实现了机器人跳跃的持续优化。实验结果表明,与基线方法相比,SRL能够在更少的训练时间内实现更稳定的跳跃,平均位置跟踪误差低于0.1米,速度跟踪误差在目标值的±3%以内。通过双足和四足模拟的地面与楼梯跳跃,以及sim-to-sim和sim-to-real验证,SRL展现出对各种任务要求和环境复杂性的鲁棒适应性,突显了其在实际部署中的潜力。

英文摘要

Robotic jumping is pivotal in applications such as search and rescue and logistics, where crossing obstacles and enhancing mobility efficiency are critical. The Spring-Loaded Inverted Pendulum (SLIP) model leverages simplified spring-mass dynamics that naturally encode biologically plausible hopping motions, yet its performance degrades on irregular terrain due to idealized assumptions regarding contact and joint dynamics. Meanwhile, Reinforcement Learning (RL) can adapt to diverse and complex environments but often requires extensive data from unguided exploration. The complementary strengths of SLIP's physically grounded baseline and RL's adaptive capabilities motivate a hybrid framework that overcomes these individual limitations. We therefore propose Spring-loaded Reinforcement Learning (SRL), which integrates SLIP-based feedforward control signals with RL-driven real-time feedback, enabling continuous optimization of robotic jumping. Experimental results demonstrate that SRL can achieve more stable jumps with much less training time than the baseline method, maintaining an average position tracking error below 0.1 m and velocity tracking errors within +/-3% of the target values. Through bipedal and quadrupedal simulations of ground and stair jumping, as well as sim-to-sim and sim-to-real validations, SRL exhibits robust adaptability to various task requirements and environmental complexities, underscoring its potential for real-world deployment.

2606.18624 2026-06-18 cs.CL 新提交

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

PragReST:用于语用语言理解的自我强化反事实推理

Jihyung Park, Minchao Huang, Leqi Liu, Elias Stengel-Eskin

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出PragReST框架,通过自监督构建语用问答数据、生成反事实推理轨迹,结合监督微调和强化学习提升大语言模型的语用推理能力,在四个基准上显著优于基线模型。

Comments First two authors contributed equally. Code and models: https://github.com/jihyung803/PragReST

详情
AI中文摘要

自然语言理解通常依赖于隐含而非明确陈述的含义,需要语用推理。尽管大语言模型(LLMs)在数学和逻辑推理上表现强劲,但在进行语用推理时仍存在困难,往往选择字面解释。为了提升LLM的语用推理能力,我们提出了PragReST,一个自监督框架,它构建语用问答数据,生成反事实推理轨迹,并通过监督微调和强化学习训练模型内化这些轨迹,无需人工标注训练数据或从更强的教师模型蒸馏。在四个语用基准(PragMega、Ludwig、MetoQA和AltPrag)上,PragReST相比骨干模型、任务特定的语用微调基线以及同一流水线的非反事实变体均有提升。在基于准确率的基准上,PragReST在Qwen3-8B和Qwen3-14B上分别比指令骨干模型提升了5.37%和5.50%(绝对值)。我们的错误分析和消融实验强调了反事实推理的重要性:PragReST主要减少了因未能将观察到的话语与合理的替代方案进行对比而导致的错误,而去除反事实推理会显著降低性能。此外,我们的训练保留了对通用知识和数学推理基准的域外性能。

英文摘要

Natural language understanding often depends on meanings that are implied rather than explicitly stated, requiring pragmatic reasoning. Despite strong performance on math and logical reasoning, large language models (LLMs) still struggle with making pragmatic inferences, often choosing literal interpretations. To improve LLM pragmatic reasoning, we introduce PragReST, a self-supervised framework that constructs pragmatic QA data, generates counterfactual reasoning traces, and trains models to internalize them through supervised fine-tuning and reinforcement learning, without human-labeled training data or distillation from a stronger teacher. Across four pragmatic benchmarks (PragMega, Ludwig, MetoQA, and AltPrag), PragReST improves over backbone models, task-specific pragmatic tuning baselines, and non-counterfactual variants of the same pipeline. On accuracy-based benchmarks, PragReST improves over the instruct backbone by 5.37 and 5.50% (absolute) for Qwen3-8B and Qwen3-14B, respectively. Our error analysis and ablations underscore the importance of counterfactual reasoning: PragReST primarily reduces errors caused by failures to contrast observed utterances with plausible alternatives, and removing counterfactual reasoning substantially reduces performance. Moreover, our training preserves out-of-domain performance on general-knowledge and mathematical reasoning benchmarks.

2606.18623 2026-06-18 cs.CV eess.IV 新提交

Intrinsic 4D Gaussian Segmentation from Scene Cues

内在4D高斯分割:基于场景线索

Hasan Yazar, Mohamed Rayan Barhdadi, Erchin Serpedin, Mehmet Tuncel, Hasan Kurban

发表机构 * Istanbul Technical University(伊斯坦布尔理工大学) Texas A&M University(德克萨斯农工大学) Hamad Bin Khalifa University(哈马德·本·哈利法大学)

AI总结 提出Intrinsic-GS方法,无需训练和掩码,通过构建高斯原语的亲和图并利用社区检测实现4D场景分割,在Neu3D和HyperNeRF上达到与掩码监督方法相当的精度,且速度提升12.5倍。

Comments 15 pages, 4 figures, 7 tables. Includes supplementary material. Preprint

详情
AI中文摘要

动态4D高斯泼溅以高保真度重建变形场景,并越来越多地被用作动态3D场景的表示。要利用此类场景进行编辑、操作或运动分析,首先需要对其进行分割:将高斯原语分组为连贯的对象。当前流程通过从基础模型(如SAM)导入2D掩码,并将其提升或蒸馏到高斯表示中来获得这种分组。在动态场景中,这些掩码必须在多个帧和视角中生成,成本高昂,并且所得分割可能强烈依赖于这些外部掩码的质量和一致性。我们探究能否从高斯本身恢复更多的对象级结构,并提出Intrinsic-GS,一种无需训练、无需掩码的方法,该方法根据外观、方向、尺度、变形轨迹和非学习渲染边界线索,在高斯原语上构建稀疏亲和图。该图通过Leiden社区检测进行划分,无需基础模型,也无需学习特征场。在标准的4D高斯分割基准Neu3D和HyperNeRF上,Intrinsic-GS在没有掩码监督的情况下恢复了大量的对象结构,在Neu3D上达到0.746 mIoU,在HyperNeRF上达到0.575;在Neu3D上,仅几何变体达到0.902 mIoU,与SAM监督的TRASE相当。在HyperNeRF上,Intrinsic-GS的运行速度比掩码监督流程中使用的掩码生成和特征渲染阶段快12.5倍。这些结果表明,大部分分割信号已经编码在高斯本身中,为3D和4D高斯分割提供了一种快速、无需掩码的方向,也可能指向在外部掩码不可靠或昂贵的情况下更可泛化、更鲁棒的分割。

英文摘要

Dynamic 4D Gaussian Splatting reconstructs deforming scenes with high fidelity and is increasingly adopted as a representation for dynamic 3D scenes. Putting such a scene to use, for editing, manipulation or motion analysis, first requires segmenting it: grouping the Gaussian primitives into coherent objects. Current pipelines obtain this grouping by importing 2D masks from foundation models such as SAM and lifting or distilling them into the Gaussian representation. In dynamic scenes these masks must be generated across many frames and views, which is costly, and the resulting segmentation can depend strongly on the quality and consistency of those external masks. We ask how much object-level structure can instead be recovered from the Gaussians themselves, and propose Intrinsic-GS, a training-free, mask-free method that builds a sparse affinity graph over Gaussian primitives from appearance, orientation, scale, deformation-trajectory and non-learned rendered-boundary cues. The graph is partitioned with Leiden community detection, requiring no foundation model and no learned feature field. On the standard 4D Gaussian segmentation benchmarks, Neu3D and HyperNeRF, Intrinsic-GS recovers substantial object structure without mask supervision, reaching 0.746 mIoU on Neu3D and 0.575 on HyperNeRF; on Neu3D, a geometry-only variant reaches 0.902 mIoU, matching SAM-supervised TRASE. On HyperNeRF, Intrinsic-GS runs 12.5x faster than the mask-generation and feature-rendering stages used by mask-supervised pipelines. These results suggest that much of the segmentation signal is already encoded in the Gaussians themselves, offering a fast, mask-free direction for 3D and 4D Gaussian segmentation that may also point toward more generalizable, robust segmentation in settings where external masks are unreliable or expensive.

2606.18621 2026-06-18 cs.LG 新提交

Towards Anomaly Detection on Relational Data

面向关系数据的异常检测

Shiyuan Li, Yunfeng Zhao, Yue Tan, Qingfeng Chen, Yixin Liu, Shirui Pan

发表机构 * Griffith University(格里菲斯大学) Guangxi University(广西大学)

AI总结 提出RelAD框架,通过条件稀疏门控属性重建和双视图多关系边重建,有效检测关系数据中的属性异常和连接模式异常,在6个基准数据集上优于现有方法。

详情
AI中文摘要

关系数据库广泛应用于现实系统中管理结构化数据。从这类关系数据中检测异常对于识别欺诈、风险和异常行为至关重要,但尚未得到充分探索。关键挑战在于关系数据的内在复杂性:多表属性是高维且异质的,使得稀疏的异常线索容易被正常或无关信息淹没;异常还可能表现为跨不同外键关系的异常连接模式,而现有的表格和图异常检测方法难以捕捉。为解决这些问题,我们提出RelAD,一个基于重建的框架,从属性和关系边重建中捕捉异常。RelAD包含两个核心模块:条件稀疏门控属性重建,抑制冗余的多表属性并强调异常语义块;以及双视图多关系边重建,从内在和行为实体画像中检测关系特定的异常连接。得到的属性和关系信号通过轻量级融合模块整合,产生最终异常分数。我们进一步构建了6个具有系统性异常的基准数据集,大量实验表明RelAD在取得竞争性效率的同时,始终优于其他基线方法。

英文摘要

Relational databases are widely used for managing structured data in real-world systems. Detecting anomalies from such relational data is crucial for identifying fraud, risks, and abnormal behaviors, yet remains under-explored. The key challenges lie in the intrinsic complexity of relational data: multi-table attributes are high-dimensional and heterogeneous, making sparse abnormal clues easy to overwhelm by normal or irrelevant information; and anomalies may further manifest as abnormal connection patterns across different foreign-key relations, which existing tabular and graph anomaly detection methods are ill-suited to capture. To address them, we propose RelAD, a reconstruction-based framework that captures anomalies from both attribute and relational edge reconstruction. RelAD contains two core modules: conditional sparse-gated attribute reconstruction, which suppresses redundant multi-table attributes and emphasizes abnormal semantic blocks, and dual-view multi-relational edge reconstruction, which detects relation-specific abnormal connections from both intrinsic and behavioral entity profiles. The resulting attribute and relational signals are integrated through a lightweight fusion module to produce the final anomaly score. We further construct 6 benchmark datasets with systematic anomalies, on which extensive experiments show that RelAD consistently outperforms other baselines while achieving competitive efficiency.

2606.18620 2026-06-18 cs.CL cs.AI 新提交

BCL: Bayesian In-Context Learning Framework for Information Extraction

BCL:面向信息抽取的贝叶斯上下文学习框架

Haoliang Liu, Chengkun Cai, Xu Zhao, Han Zhu, Shizhou Huang, Xinglin Zhang, Tao Chen, Jenq-Neng Hwang, Zhang Huaping, Lei Li

发表机构 * HiThink Research(海天瑞声研究) University College London(伦敦大学学院) University of Edinburgh(爱丁堡大学) The Hong Kong University of Science and Technology(香港科技大学) East China Normal University(华东师范大学) Shanghai Medical Image Insights(上海医学影像洞察) University of Waterloo(滑铁卢大学) University of Washington(华盛顿大学) Beijing Institute of Technology(北京理工大学)

AI总结 提出BCL框架,利用贝叶斯更新和粒子滤波优化信息抽取中的上下文学习,在序列标注和关系分类任务上取得显著提升。

Comments ACL 2026 Findings

详情
AI中文摘要

现有的信息抽取(IE)任务越来越多地采用大型语言模型的上下文学习(ICL)。然而,当前的方法要么在不同模型规模上表现不一致,要么缺乏系统优化和泛化能力。基于此,我们提出了BCL(面向信息抽取的贝叶斯上下文学习框架),这是第一个使用贝叶斯更新的粒子滤波来系统优化IE任务中标签表示的优化框架。通过四个步骤——初始化、观测、权重更新和重采样,BCL可以泛化到序列标注和关系分类两种范式。大量实验表明,与现有方法相比,BCL取得了显著且一致的改进。

英文摘要

Existing information extraction (IE) tasks increasingly adopt in-context learning (ICL) with large language models. However, current approaches either show inconsistent performance across model scales or lack systematic optimization and generalizability. Building on this, we propose BCL (Bayesian In-Context Learning Framework for Information Extraction), the first optimization framework that uses particle filtering with Bayesian updates to systematically refine label representations across IE tasks. Through four steps initialization, observation, weight update, and resampling, BCL generalizes to both sequence labeling and relation classification paradigms. Extensive experiments demonstrate substantial and consistent improvements over existing approaches.

2606.18613 2026-06-18 cs.CL cs.AI 新提交

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

LLMs 是否已准备好辅助医生?PhysAssistBench:交互式医患-电子病历辅助基准

Tianming Du, Peijie Yu, Sihan Shang, Danli Shi, My Linh Nguyen, Shengbo Gao, Guangyuan Li, Yinghong Yu, Yan Jiang, Qianlong Zhao, Behzad Bozorgtabar, Shaoxiong Ji, Jiazhen Pan, Daniel Rueckert, Jiancheng Yang

发表机构 * Aalto University(阿尔托大学) Tencent(腾讯) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Hong Kong Polytechnic University(香港理工大学) Aarhus University(奥胡斯大学) Technical University of Munich(慕尼黑工业大学)

AI总结 提出PhysAssistBench基准,通过构建交互式患者代理评估LLM在医患-EHR交互中的协调能力,发现当前模型不可靠,瓶颈在于多维度协调而非单一能力。

Comments 34 pages with 8 figures

详情
AI中文摘要

医疗LLM最合理的近期角色是辅助而非替代医生,但当前的评估通常测试孤立能力:临床知识、EHR系统交互或患者沟通。而医生辅助需要在同一交互中协调这些能力,其中医生提出不明确的请求,患者模糊描述症状,EHR系统要求精确的工具使用。我们引入PhysAssistBench,一个用于交互式医患-EHR辅助的基准。基于真实的MIMIC-IV病例,PhysAssistBench使用可扩展的流水线构建交互式、记录驱动的患者代理,将静态EHR记录转化为多轮临床场景,同时保持临床事实准确性。PhysAssistBench提供了一个精选的双语评估集,包含1,296个经过人工审查和医生验证的轮次。与领先LLM的实验表明,当前模型在此设置下仍不可靠,这暴露了临床LLM的关键瓶颈:可靠的辅助需要知识、沟通和系统之间的协调,而非任何单一能力的孤立提升。

英文摘要

The most plausible near-term role of medical LLMs is to assist rather than replace physicians, yet current evaluations often test isolated capabilities: clinical knowledge, EHR system interaction, or patient communication. Physician assistance instead requires coordinating these capabilities within the same interaction, where physicians issue underspecified requests, patients describe symptoms ambiguously, and EHR systems demand precise tool use. We introduce PhysAssistBench, a benchmark for interactive doctor-patient-EHR assistance. Built from real MIMIC-IV cases, PhysAssistBench uses a scalable pipeline to construct agentic patients: interactive, record-grounded agents that turn static EHR records into multi-turn clinical scenarios while preserving clinical factuality. PhysAssistBench provides a curated bilingual evaluation set of 1,296 manually reviewed and physician-validated turns. Experiments with leading LLMs show that current models remain unreliable in this setting, which exposes a key bottleneck for clinical LLMs: reliable assistance requires coordination across knowledge, communication, and systems, not isolated gains in any of them.

2606.18611 2026-06-18 cs.SD cs.AI cs.LG stat.ML 新提交

QC-GAN: A Parameter-Efficient Quaternion Conformer GAN for High-Fidelity Speech Enhancement

QC-GAN: 一种参数高效的四元数Conformer GAN用于高保真语音增强

Shogo Yamauchi, Hideaki Tamori, Makoto Sakai, Yosuke Yamano, Tohru Nitta

发表机构 * The Asahi Shimbun Company(朝日新闻社) Tokyo Woman's Christian University(东京女子基督教大学)

AI总结 提出参数高效的QC-GAN,结合四元数Conformer生成器和MetricGAN训练,通过汉密尔顿积共享权重减少参数量,在VoiceBank+DEMAND上以0.89M参数达到PESQ 3.48,性能媲美两倍大小模型。

Comments 10 pages, 6 figures and 5 tables. Accepted at Interspeech2026

详情
AI中文摘要

我们提出了一种参数高效的语音增强框架——四元数Conformer GAN(QC-GAN),它将四元数Conformer生成器与基于MetricGAN的训练相结合。汉密尔顿积通过结构化权重共享对幅度和相位进行编码,在减少层参数数量的同时保持其相互依赖性。采用度量学习判别器,通过优化近似感知评估分数来最大化感知质量。在VoiceBank+DEMAND数据集上,QC-GAN仅用0.89M参数就达到了3.48的语音质量感知评估(PESQ)分数,其性能与最先进模型相当,而参数量不到后者的一半。一个35K参数的变体实现了3.23的PESQ分数,以显著更少的参数超越了传统方法。在DNS-Challenge 3数据集上的评估进一步证实了其在真实世界条件下的泛化能力。

英文摘要

We propose a parameter-efficient speech enhancement framework, Quaternion Conformer GAN (QC-GAN), which combines a Quaternion Conformer generator with MetricGAN-based training. The Hamilton product encodes the magnitude and phase via structured weight sharing, reducing the number of layer parameters while preserving their interdependencies. A metric-learning discriminator was employed to maximize perceptual quality by optimizing the approximate perceptual evaluation scores. On the VoiceBank+DEMAND dataset, QC-GAN achieved a Perceptual Evaluation of Speech Quality (PESQ) score of 3.48 with only 0.89M parameters, delivering a performance comparable to state-of-the-art models at less than half their size. A 35K-parameter variant achieved a PESQ score of 3.23, surpassing conventional methods with significantly fewer parameters. Evaluation on the DNS-Challenge 3 dataset further confirmed generalization to real-world conditions.

2606.18610 2026-06-18 cs.RO cs.CV 新提交

SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

SC3-Eval: 通过自洽视频生成评估机器人基础模型

Wei-Cheng Tseng, Gashon Hussein, Yuzhu Dong, Allen Z. Ren, Lucy X. Shi, XuDong Wang, Sergey Levine, Zhaoshuo Li, Jinwei Gu, Florian Shkurti, Ming-Yu Liu, Quan Vuong

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) NVIDIA(英伟达) Physical Intelligence Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校) Allen Institute for AI(艾伦人工智能研究所)

AI总结 提出SC3-Eval方法,利用前向-反向动力学一致性、跨视角一致性和测试时一致性,将预训练视频基础模型转化为准确的策略评估器,在7个真实世界策略上达到0.929的皮尔逊相关系数。

详情
AI中文摘要

在真实世界中评估通用机器人操作策略成本高、速度慢且难以扩展。动作条件视频世界模型通过模拟策略 rollout 提供了一种可扩展的替代方案。自回归 rollout 会累积复合误差,多视角观测必须保持相互一致,且评估器必须泛化到行为超出训练分布的策略。我们通过 SC3-Eval 解决这些挑战,这是一种自洽视频生成方案,通过强制三种互补的一致性,将预训练视频基础模型转化为准确的策略评估器。首先,前向-反向动力学一致性联合训练模型从动作预测帧以及从帧恢复动作,将生成的 rollout 锚定在物理上合理的动作流形上,并抵消仅前向模型无法惩罚的漂移。其次,跨视角一致性训练模型从每个相机视角修补其他视角,使多相机观测在长 rollout 中保持连贯,无需任何显式记忆机制。第三,测试时一致性在推理时重用反向动力学模式作为每个动作块的置信度信号,当生成的帧偏离请求的动作时终止 rollout。我们还展示了 SC3-Eval rollout 复现了策略在真实世界 rollout 中表现出的失败模式,支持细粒度的诊断比较而不仅仅是聚合排名。在七个真实世界的视觉-语言-动作策略上,SC3-Eval 达到了闭环皮尔逊相关系数 0.929 和 MMRV 0.119,优于三个强先前的基于视频模型的基线,并泛化到新任务。

英文摘要

Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. Action-conditioned video world models offer a scalable alternative by simulating policy rollouts. Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution. We address these challenges with SC3-Eval, a self-consistent video generation recipe that adapts a pre-trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency. First, forward-inverse dynamics consistency jointly trains the model to predict frames from actions and to recover actions from frames, anchoring generated rollouts to a physically plausible action manifold and counteracting the drift a forward-only model cannot penalize. Second, cross-view consistency trains the model to inpaint each camera view from the other, keeping the multi-camera observation coherent over long rollouts without any explicit memory mechanism. Third, test-time consistency reuses the inverse dynamics mode at inference as a per-action-chunk uncertainty signal that terminates rollouts whose generated frames drift away from the requested actions. We also demonstrate SC3-Eval rollouts reproduce the failure modes that policies exhibit in real-world rollouts, supporting fine-grained diagnostic comparison rather than aggregate ranking alone. Across seven real-world vision-language-action policies, SC3-Eval attains a closed-loop Pearson correlation of $0.929$ and MMRV of $0.119$, outperforming three strong prior video-model-based baselines, and generalizes to new tasks.

2606.18609 2026-06-18 cs.CV 新提交

Hallucination Detection and Correction in Medical VLMs via Counter-Evidence Verification

基于反事实证据验证的医学视觉语言模型幻觉检测与纠正

Nan Zhou, Ke Zou, Meng Liu, Linchao He, Jiaqi Zhu, Yi Zhang, Hu Chen, Huazhu Fu

发表机构 * College of Computer Science, Sichuan University(四川大学计算机科学学院) Yong Loo Lin School of Medicine, National University of Singapore(新加坡国立大学杨潞龄医学院) Key Laboratory of Data Protection and Intelligent Management, Ministry of Education, Sichuan University(四川大学数据保护与智能管理教育部重点实验室) National Key Laboratory of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology(北京理工大学自主智能无人系统国家重点实验室) Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR)(新加坡科技研究局高性能计算研究所)

AI总结 提出CoEV框架,通过文本与视觉证据的双向验证检测并纠正医学VLM幻觉,无需重新训练,在四个数据集上显著提升检测和纠正性能。

Comments MICCAI 2026 Accept. Submission Version

详情
AI中文摘要

视觉语言模型(VLM)在医学诊断中的可靠性受到幻觉的挑战,这削弱了信任。现有的幻觉检测方法主要关注识别生成文本与参考数据之间的事实不一致性。虽然一些研究分析了模型在图像中的注意力区域,但它们很少验证这种注意力是否真正反映了支持生成文本的视觉证据。为了解决这一差距,我们提出了反事实证据验证(CoEV),一个无需训练的即插即用框架,通过基于证据的事实一致性验证来检测和纠正幻觉。CoEV在文本断言和视觉证据之间执行双向验证,测试每个陈述是否得到其对应证据区域的支持,并将每个陈述分配到一个四象限诊断图中,该图捕获文本事实性和视觉基础性的组合。CoEV检测幻觉内容,并作为事后细化工具,无需重新训练即可纠正幻觉。在四个医学数据集上的大量实验表明,CoEV能够对抗幻觉。在幻觉检测方面,CoEV始终优于现有方法,平均PR-AUC和ROC-AUC分别提高了3.0%和3.9%的绝对百分点,在特定VQA场景中提升高达18.5%。在幻觉纠正方面,它将Micro-F1提高了高达12.5%,在医学报告生成中将幻觉率降低了超过11.9%,并提高了医学VQA的准确性。这些结果表明,CoEV能够可靠地检测和纠正幻觉,为临床医生提供可靠的、基于证据的诊断线索。代码将在接收后发布。

英文摘要

Vision-Language models (VLMs) reliability in medical diagnosis is challenged by trust-undermining hallucinations. Existing hallucination detection approaches mainly focus on identifying factual inconsistencies between generated text and reference data. While some studies analyze where models attend in images, they seldom verify whether such attention truly reflects the visual evidence supporting the generated text. To address this gap, we propose Co}unter-Evidence Verification (CoEV), a training-free plug-and-play framework that detects and corrects hallucinations through evidence-based factual consistency verification. CoEV performs bidirectional verification between textual assertions and visual evidence, testing whether each statement is supported by its corresponding evidence region, and assigns each statement into a four-quadrant diagnostic map capturing combinations of text factuality and visual grounding. CoEV detects hallucinated content and serves as a post hoc refinement tool, correcting hallucinations without retraining. Extensive experiments on four medical datasets show that CoEV combats hallucinations in VLMs.For hallucination detection, CoEV consistently outperforms existing methods, improving average PR-AUC and ROC-AUC by 3.0% and 3.9% absolute points respectively, with notable gains of up to 18.5% in specific VQA scenarios. For hallucination correction, it improves Micro-F1 by up to 12.5%, reduces hallucination rates by over 11.9% on medical report generation, and also boosts medical VQA accuracy. These results show that CoEV enables reliable detection and correction of hallucinations, providing clinicians with dependable, evidence-based cues for diagnosis. Code will be released upon acceptance.

2606.18606 2026-06-18 cs.CL cs.AI 新提交

Steerable Cultural Preference Optimization of Reward Models

可引导的文化偏好优化奖励模型

Minsik Oh, Advit Deepak, Sophie Wu, Douwe Kiela, Ekaterina Shutova

发表机构 * Stanford University(斯坦福大学) University of Amsterdam(阿姆斯特丹大学)

AI总结 提出SCPO算法,通过平衡多种文化偏好训练奖励模型,在PRISM和GlobalOpinionQA数据集上提升少数群体偏好预测准确率最多7点,训练效率提高280%。

Comments Accepted to Pluralistic Alignment @ ICML 2026

详情
AI中文摘要

大型语言模型(LLM)技术以每个文化子社区可接受的方式服务于众多不同文化子社区至关重要。然而,迄今为止,关于LLM对齐的研究主要集中于预测来自特定地区的标注者的统一响应偏好。本文旨在以更全球化的视角推进对齐模型的发展,使其能够准确代表子社区的偏好,并且不对任何子社区表现出过度偏见。我们专注于为此目的开发奖励模型,并提出一种新颖的奖励模型训练算法(SCPO),该算法能够以平衡的方式融入多样化的文化偏好。我们的方法使得少数群体奖励模型在两个数据集(PRISM和GlobalOpinionQA)以及7个国家上的性能比基线模型提升最多7点。SCPO在训练数据效率上比奖励模型的完整数据微调高出最多280%。此外,我们通过分别评估子社区的偏好来进行偏见分析,并表明我们的加权方法减轻了过度偏见。我们的代码可在以下网址获取:this https URL

英文摘要

It is essential for large language model (LLM) technology to serve many different cultural sub-communities in a manner that is acceptable to each community. However, research on LLM alignment has so far predominantly focused on predicting a unified response preference of annotators from certain regions. This paper aims to advance the development of alignment models with a more global outlook, that are able to accurately represent the preferences of subcommunities and do not exhibit excessive bias towards any of them. We focus on the development of reward models for this purpose and present a novel reward model training algorithm (SCPO) that can incorporate diverse cultural preferences in a balanced manner. Our method results in performance increases of the minority reward model of up to 7 points over the baseline model across two datasets, PRISM and GlobalOpinionQA, and across 7 countries. SCPO is up to 280% more training data-efficient than full-data finetuning of reward models. In addition, we perform analysis of bias by separately evaluating on the preference of subcommunities and show that excessive bias is mitigated via our weighting method. Our code is available at https://github.com/minsik-ai/Steerable-Cultural-Preference

2606.18601 2026-06-18 cs.RO 新提交

Admittance-Based Surface Alignment for Human-in-the-Loop Robotic Visual Inspection

基于导纳的表面对齐用于人在环机器人视觉检测

Antara Banerjee, Colin Acton, Xu Chen

发表机构 * University of Washington(华盛顿大学)

AI总结 提出一种基于导纳的实时闭环控制框架,融合操作员输入与感知驱动,实现机器人末端执行器与局部表面的精确对齐,在6自由度机械臂上验证了稳定法向跟踪和0.4°的平均定向误差。

详情
AI中文摘要

精密视觉检测是航空航天、半导体和医疗制造中质量保证的基础,这些领域中高价值零件上未被检测到的表面缺陷直接导致报废、返工和现场故障。机器人视觉检测需要在存在感知噪声和表面不规则的情况下,实现末端执行器与局部表面几何的精确对齐。在工业环境中,通常通过遥操作或共享自主性将人类操作员保持在回路中,引入实时调整,使得纯离线运动规划不足。这激发了能够在人类和感知不确定性下做出反应性、顺从行为的控制架构。本文提出了一种新颖的实时闭环机器人定向控制流程,用于精密视觉检测,该流程采用基于导纳的框架,统一了操作员输入和感知驱动的表面对齐。我们将末端执行器设计为在粘性介质中运动的虚拟球体,使得由此产生的物理可解释的质量-阻尼系统根据定向误差和操作员命令生成同步、顺从的运动。我们在6自由度机械臂上验证了该框架,展示了稳定的法向跟踪和0.4°的最终平均定向误差。

英文摘要

Precision visual inspection underpins quality assurance across aerospace, semiconductor, and medical manufacturing, where undetected surface anomalies on high-value parts translate directly into scrap, rework, and field failures. Robotic visual inspection requires precise alignment between the end-effector and local surface geometry in the presence of perception noise and surface irregularities. In industrial settings, a human operator is often kept in the loop via teleoperation or shared autonomy, introducing real-time adjustments that render purely offline motion planning inadequate. This motivates control architectures capable of reactive, compliant behavior under combined human and perceptual uncertainty. This paper presents a novel real-time, closed-loop robotic orientation control pipeline for precision visual inspection, with an admittance-based framework that unifies operator input and perception-driven surface alignment. We design the end-effector as a virtual sphere moving through a viscous medium, such that the resulting physically interpretable mass--damper system generates synchronized, compliant motion from orientation error and operator commands. We validate the framework on a 6-DOF manipulator demonstrating stable normal-tracking and a final mean orientation error of 0.4°.

2606.18598 2026-06-18 cs.AI cs.LG 新提交

Optimizing Lithium Production Decisions under Geological, Demand, and Pricing Uncertainties: A POMDP Framework for Multi-Objective Decision Making

在地质、需求和定价不确定性下优化锂生产决策:多目标决策的POMDP框架

Anna C. Edmonds, Mansur M. Arief, Robert J. Moss, Mykel J. Kochenderfer, Jef Caers

发表机构 * Computer Science Department, Stanford University(斯坦福大学计算机科学系) Aeronautics and Astronautics Department, Stanford University(斯坦福大学航空与航天系) Earth and Planetary Sciences Department, Stanford University(斯坦福大学地球与行星科学系)

AI总结 提出POMDP框架,通过信念状态规划优化锂矿开采决策,动态适应价格不确定性,实现更高需求满足和更平衡的经济环境效益。

Comments 24 pages, 14 tables, 4 figures

详情
AI中文摘要

锂生产中的决策制定具有挑战性,无论是从投资者角度还是战略生产角度。决定开采哪些矿山以及何时开采,不仅涉及地质和价格不确定性,还涉及提取方法选择的复杂性,从直接锂提取到硬岩开采。先前的工作探索了该问题的模型和优化采矿决策的不同方法;这些模型没有考虑定价不确定性、需求不确定性或提取锂的不同采矿技术。将不同的定价模型和提取技术纳入这些模型,可以制定更稳健的策略,不仅决定何时何地开采矿山,还决定采用哪种生产方法。我们将问题表述为部分可观测马尔可夫决策过程(POMDP),并使用信念状态规划方法求解以获得最优决策。在我们的研究中,我们表明POMDP求解器通过信念状态规划和显式不确定性管理,动态适应变化的锂价格机制(静态、线性、指数和随机),优于人类启发式启发法。通过优化勘探、生产和技术选择的顺序,该框架在所有不同的定价和矿床情景下,在项目生命周期内实现了更高的需求满足和更平衡的经济环境结果。

英文摘要

Decision making in lithium production is challenging, whether from an investor's perspective or a strategic production standpoint. Determining which mines to open and when to open them involves not only geological and price uncertainties, but also complexities around the choice of extraction method, from direct lithium extraction to hard rock mining. Prior work explored models of this problem and different methods to optimize mining decisions; these models did not account for uncertainty in pricing, uncertainty in demand, or different mining technologies to extract lithium. Incorporating different pricing models and extraction technology into these models enables more robust strategies for determining not only when and where to open a mine, but also which method of production to pursue. We frame the problem as a partially observable Markov decision process (POMDP) and solve using belief state planning methods to get optimal decision making. In our study, we show that POMDP solvers outperform human inspired heuristics by dynamically adapting to shifting lithium price regimes (static, linear, exponential, and stochastic) through belief state planning and explicit uncertainty management. By optimally sequencing exploration, production, and technology choice, the framework achieves higher demand fulfillment and more balanced economic environmental outcomes over the projects lifetime in all different pricing and deposit scenarios.

2606.18597 2026-06-18 cs.CL 新提交

Low-resource Language Discrimination Towards Chinese Dialects with Transfer learning and Data Augmentation

低资源中文方言辨识:基于迁移学习与数据增强

Fan Xu, Yangjie Dan, Keyu Yan, Yong Ma, Mingwen Wang

发表机构 * Jiangxi Normal University(江西师范大学)

AI总结 针对中文方言标注资源稀缺的问题,提出结合迁移学习与数据增强的CDDTLDA框架,利用源域ASR模型和目标域数据增强及微调,通过自注意力机制捕获共性语义特征,显著超越现有方法。

Comments Published in ACM TALLIP

详情
AI中文摘要

中文方言辨识是一项具有挑战性的自然语言处理任务,由于标注资源稀缺。本文中,我们开发了一种新颖的中文方言辨识框架,结合迁移学习与数据增强(CDDTLDA),以克服资源短缺问题。具体来说,我们首先使用一个较大的中文方言语料库训练一个源端自动语音识别(ASR)模型。然后,我们采用一种简单但有效的数据增强方法(即速度、音高和噪声干扰)来增强目标端低资源中文方言,并基于之前的源端ASR模型微调另一个目标ASR模型。同时,通过使用自注意力机制,可以捕获源端和目标端ASR模型之间的潜在共性语义特征。最后,我们提取目标ASR模型中的隐藏语义表示来进行中文方言辨识。我们广泛的实验结果表明,我们的模型在两个基准中文方言语料库上显著优于最先进的方法。

英文摘要

Chinese dialects discrimination is a challenging natural language processing task due to scarce annotation resource. In this article, we develop a novel Chinese dialects discrimination framework with transfer learning and data augmentation (CDDTLDA) in order to overcome the shortage of resources. To be more specific, we first use a relatively larger Chinese dialects corpus to train a source-side automatic speech recognition (ASR) model. Then, we adopt a simple but effective data augmentation method (i.e., speed, pitch, and noise disturbance) to augment the target-side low-resource Chinese dialects, and fine-tune another target ASR model based on the previous source-side ASR model. Meanwhile, the potential common semantic features between source-side and target-side ASR models can be captured by using self-attention mechanism. Finally, we extract the hidden semantic representation in the target ASR model to conduct Chinese dialects discrimination. Our extensive experimental results demonstrate that our model significantly outperforms state-of-the-art methods on two benchmark Chinese dialects corpora.

2606.18594 2026-06-18 cs.RO cs.AI 新提交

Benchmarking Action Spaces in Reinforcement Learning for Vision-based Robotic Manipulation

基于视觉的机器人操作中强化学习动作空间的基准测试

Seyed Alireza Azimi, Homayoon Farrahi, Abhishek Naik, Colin Bellinger, A. Rupam Mahmood

发表机构 * Department of Computing Science, University of Alberta(阿尔伯塔大学计算机科学系) National Research Council Canada(加拿大国家研究委员会) School of Electrical Engineering and Computer Science, University of Ottawa(渥太华大学电气工程与计算机科学学院) Vector Institute(向量研究所) Alberta Machine Intelligence Institute (Amii)(阿尔伯塔机器智能研究所)

AI总结 本研究通过模拟到现实的迁移,在物体抓取和推动任务中评估了四种动作空间,发现关节速度动作空间在平滑性和任务性能上最优,并为RL实践者提供了动作空间选择指导。

Comments 9 pages with references

详情
AI中文摘要

在现实世界的强化学习(RL)中,动作空间的选择在塑造运动平滑性、安全性和整体任务性能方面起着关键作用。在本研究中,我们评估了位姿增量、位姿速度、关节位置增量和关节速度在两项基于视觉的操作任务(物体抓取和推动)中的表现。我们在模拟中训练策略,并通过模拟到现实的迁移将其部署到现实世界。我们发现,动作空间表示确实显著影响模拟到现实的性能。特别是,我们发现关节速度动作空间在平滑性和最终任务性能方面最适合基于视觉的抓取和推动任务。我们还为RL实践者在模拟和现实实验中选择动作空间提供了实用指导。

英文摘要

In real-world reinforcement learning (RL), the choice of action space can play a key role in shaping motion smoothness, safety, and overall task performance. In this study, we evaluate pose increment, pose velocity, joint position increment, and joint velocity across two vision-based manipulation tasks: object picking and pushing. We train policies in simulation and deploy them to the real world using sim-to-real transfer. We find that action-space representation indeed significantly affects sim-to-real performance. In particular, we find that the joint velocity action space is best for the vision-based picking and pushing tasks in terms of smoothness and final task performance. We also provide practical guidance for RL practitioners in choosing action spaces for both simulation and real-world experiments.

2606.18591 2026-06-18 cs.CV 新提交

Bridging Creative Intent and Visual Quality: Creator-Driven Recurrent Video Generation with Agentic Feedback Loops

桥接创意意图与视觉质量:基于创作者驱动的循环视频生成与代理反馈循环

Denis Savytski, Aiden Lei, Heding Liu, Warren Yang, Sihan Liang, Alexander Liu, Zhe Zhao

发表机构 * University of California, Davis(加州大学戴维斯分校) The Harker School(哈克学校) Basis Independent Silicon Valley(硅谷贝斯独立学校) Saratoga High(萨拉托加高中)

AI总结 提出CHIEF框架,通过人类-AI协作的迭代视频精炼,结合创作者驱动和代理主观反馈,提升长视频的叙事连贯性与创意方向。

Comments Accepted to the Workshop on Human-AI Co-Creativity at ICML 2026

详情
AI中文摘要

生成式AI使内容创作日益普及,但许多AI生成的视频缺乏叙事连贯性和创意方向,尤其在较长时长时问题更为突出。与编码不同,AI生成受益于可靠的反馈和循环自我改进等技术,而视频生成需要关于情节、场景和叙事的主观反馈,这自然激发了融入人类创意方向的方法。我们提出了CHIEF,一个人类-AI协同创作视频生成框架,将创作者置于人机循环迭代视频精炼的中心,并通过提供自动主观反馈来支持他们。创作者通过驱动每次迭代来融入其创意方向,而他们的修订则由专门的精炼代理整合。反馈循环由基于角色条件的多模态LLM生成,这些LLM观看生成的视频并从观众角度产生主观批评,提供自我评估无法捕捉的反馈。为测试我们提出框架的有效性,我们与没有电影制作经验的高中生和大学生合作,创作从1分钟短视频到具有复杂情节的完整10分钟短片的视频。

英文摘要

Generative AI has made content creation increasingly accessible, but many AI-generated videos lack narrative coherence and creative direction, issues that become more substantial at longer durations. Unlike coding, where AI generation benefits from reliable feedback and techniques such as recurrent self-improvement, video generation requires subjective feedback about plot, scenes, and narrative, which naturally motivates approaches that incorporate human creative direction. We introduce CHIEF, a human-AI co-creation video generation framework that places the creator at the center of human-in-the-loop iterative video refinement, and supports them by providing automatic subjective feedback. The creator incorporates their creative direction by driving each iteration, while their revisions are incorporated by a specialized refiner agent. The feedback loop is generated by persona-conditioned multimodal LLMs that watch generated videos and produce subjective critique from the audience perspectives, providing feedback that self-evaluation alone cannot capture. To test the effectiveness of our proposed framework, we work with high school and college students with no prior filmmaking experience to create videos, from short 1-minute videos to a complete short 10-minute film with a complicated plot.

2606.18589 2026-06-18 cs.RO 新提交

DREAM-Chunk: Reactive Action Chunking with Latent World Model

DREAM-Chunk:基于潜在世界模型的反应式动作分块

Wenxi Chen, Kaidi Zhang, Chi Lin, Zhiyuan Zhang, Yu She, Yuejiang Liu, Raymond A. Yeh, Shaoshuai Mou, Yan Gu

发表机构 * Purdue University(普渡大学) Stanford University(斯坦福大学)

AI总结 提出DREAM-Chunk方法,通过轻量级潜在世界模型在测试时采样多个候选动作分块并选择最优执行,提升动作分块策略在随机动态下的鲁棒性。

详情
AI中文摘要

动作分块已成为视觉-语言-动作(VLA)模型的常见接口,使得低频策略推理能够驱动高频机器人执行。然而,一旦动作分块被提交,其开环执行在随机动态、硬件执行错误和部分可观测性下可能变得脆弱。我们提出DREAM-Chunk,一种测试时扩展方法,通过轻量级潜在世界模型增强基于分块的策略,无需额外的策略微调。在测试时,DREAM-Chunk采样多个候选动作分块,展开其预测的潜在未来,并从预测状态与观测展开最匹配的分块中选择动作。通过这种方式,DREAM-Chunk利用额外的测试时计算覆盖多个可能的随机未来,并提高长时域分块执行期间的响应性。在Kinetix基准测试中,DREAM-Chunk在增加的动作噪声下提高了鲁棒性,并从更大的候选样本量中受益,尤其是当演示包含纠正行为时。我们进一步在两个机器人平台的四个操作任务和两种VLA策略下,针对各种随机性来源验证了DREAM-Chunk。在仿真和硬件实验中,DREAM-Chunk提高了动作分块策略在随机动态下的鲁棒性。

英文摘要

Action chunking has become a common interface for vision-language-action (VLA) models, enabling low-frequency policy inference to drive high-frequency robot execution. However, once an action chunk is committed, its open-loop execution can be brittle under stochastic dynamics, hardware execution errors, and partial observability. We propose DREAM-Chunk, a test-time scaling method that augments chunking-based policies with a lightweight latent world model, without requiring additional policy fine-tuning. At test time, DREAM-Chunk samples multiple candidate action chunks, rolls out their predicted latent futures, and selects actions from the chunk whose predicted state best matches the observed rollout. In this way, DREAM-Chunk uses additional test-time computation to cover multiple plausible stochastic futures and improve reactivity during long-horizon chunk execution. On the Kinetix benchmark, DREAM-Chunk improves robustness under increasing action noise and benefits from larger candidate sample sizes, especially when demonstrations contain corrective behaviors. We further validate DREAM-Chunk on four manipulation tasks across two robot platforms and two VLA policies under various sources of stochasticity. Across simulation and hardware experiments, DREAM-Chunk improves the robustness of action-chunking policies in stochastic dynamics.

2606.18587 2026-06-18 cs.CL cs.AI 新提交

Dual Dimensionality for Local and Global Attention

局部与全局注意力的双重维度

Zhiyuan Wang, Xuan Luo, Sirui Zeng, Xifeng Yan

发表机构 * UC Santa Barbara(加州大学圣塔芭芭拉分校)

AI总结 提出距离自适应表示(DAR),对局部上下文保留全维度表示,对远距离token使用低维表示,在保持性能的同时减少KV缓存。

详情
AI中文摘要

解码器仅Transformer计算前面token的KV缓存上的注意力。键(和值)通常以相同的维度表示,无论其与预测目标的距离如何。然而,在自然语言中,下一个词受紧邻的前一个词影响最大。我们假设局部和远距离token对表示能力有不对称需求:局部token对预测即时输出更关键,因此需要更丰富的表示,而远距离token主要作为长期记忆,低维表示可能就足够了。我们将这一思想形式化为距离自适应表示(DAR),在受控设置中实现,该设置在局部上下文窗口内保留全维度表示,同时为超出该窗口的token分配降维表示(例如原始维度的1/4)。在多个预训练规模(70M到410M参数)以及1B规模模型上的持续监督微调中,该方法与全维度基线的性能紧密匹配。相比之下,在所有token位置上均匀降低维度会导致性能下降。这些结果挑战了键和值维度应在所有token位置上均匀的常见假设。我们的发现为设计注意力架构提供了新方向,该架构可自适应地跨序列分配表示能力,从而在推理期间进一步减少KV缓存。

英文摘要

Decoder-only Transformers compute attention over the KV cache of preceding tokens. Keys (and Values) are typically represented with the same dimensionality, regardless of its distance from the prediction target. In natural language, however, the next word is most strongly influenced by the immediately preceding tokens. We hypothesize that local and distant tokens impose asymmetric demands on representational capacity: local tokens are more critical for predicting immediate outputs and thus require richer representations, whereas distant tokens primarily serve as long-range memory, for which lower-dimensional representations may suffice. We formalize this idea as Distance-Adaptive Representation (DAR), implemented in a controlled setting that preserves full-dimensional representations within a local context window while assigning reduced-dimensional representations (e.g. 1/4 of the original dimensionality) to tokens beyond that window. Across multiple pretraining scales (70M to 410M parameters), as well as continued supervised fine-tuning on a 1B-scale model, this approach closely matches the performance of full-dimensional baselines. In contrast, uniformly reducing dimensionality across all token positions leads to worse performance. These results challenge the common assumption that key and value dimensionality should be uniform across token positions. Our findings suggest a new direction for designing attention architectures that adaptively allocate representational capacity across sequences, enabling further reductions in KV cache during inference.

2606.18586 2026-06-18 cs.CV cs.AI 新提交

APT: Atomic Physical Transitions for Causal Video-Language Understanding

APT: 用于因果视频语言理解的原子物理转变

Shang Wu, Haoran Lu, Songling Liu, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, Han Liu

发表机构 * Northwestern University(西北大学) Dolby Laboratories(杜比实验室)

AI总结 提出原子物理转变(APT)作为视频中因果状态变化的显式表示,并构建混合来源数据集,通过APT-Tune微调方法使VLM学习物理转变而不遗忘事件级知识。

详情
AI中文摘要

物理事件不仅通过其名称来理解,还通过组成它们的因果状态变化来理解。诸如“弹跳”之类的片段级标签可能是正确的,但同时隐藏了使事件在物理上有效的过程,从支撑丧失和接触开始到反弹和稳定。为了使这一隐藏过程显式化,我们引入了原子物理转变(APT):最小的、时间局部化的状态变化,将可见线索与活跃的物理机制以及前后动力学状态联系起来。APT链将视频表示为有序的因果转变序列,而不是单个聚合事件标签:事件标签说明发生了什么;APT链解释为什么会发生。为了使VLM能够学习APT,我们从人工标注和模拟器真实数据构建了混合来源的APT数据,涵盖接触、重力、摩擦和旋转/稳定性中的14种转变类型,包含1,246个试验中的27,303个计时实例。利用这些数据,我们发现当前的VLM在转变级物理理解上存在不足,零样本召回率最多为14%,错误主要由遗漏的转变主导。直接在APT链上进行微调可以改善转变检测,但会导致事件级遗忘,表明模型学习的是专门的答案格式,而不是可复用的物理表示。因此,我们提出了APT-Tune,一种参数高效的方案,教会VLM使用因果转变而不遗忘如何回答视频问题。它结合了图像填充感知监督、格式条件协同训练和机制条件域到类型解码,使APT学习具有格式鲁棒性和物理基础。在Qwen3-VL-2B上仅使用11M LoRA参数,APT-Tune显著提高了APT召回率,同时改善了事件级视频迁移。这些结果表明,APT不是一种新的答案格式,而是一种用于物理视频理解的人类对齐的因果监督信号。

英文摘要

Physical events are not understood by their names alone, but by the causal state changes that compose them. A clip-level label such as "bounce" can be correct while hiding the process that makes the event physically valid, from support loss and contact onset to rebound and settling. To make this hidden process explicit, we introduce Atomic Physical Transitions (APTs): minimal, temporally localized state changes that bind a visible cue to an active physical mechanism and before/after dynamical regimes. An APT chain represents a video as an ordered causal transition sequence rather than a single aggregate event label: event labels tell what happened; APT chains explain why it happened. To make APTs learnable by VLMs, we construct mixed-source APT data from human annotations and simulator ground truth, covering 14 transition types across contact, gravity, friction, and rotation/stability, with 27,303 timed instances over 1,246 trials. Using this data, we find that current VLMs miss transition-level physics, with zero-shot recall at most 14% and errors dominated by missed transitions. Direct fine-tuning on APT chains improves transition detection but causes event-level forgetting, indicating that the model learns a specialized answer format rather than a reusable physical representation. We therefore propose APT-Tune, a parameter-efficient recipe that teaches VLMs to use causal transitions without forgetting how to answer video questions. It combines image-pad-aware supervision, format-conditional co-training, and mechanism-conditioned domain-to-type decoding to make APT learning format-robust and physically grounded. With only 11 M LoRA parameters on Qwen3-VL-2B, APT-Tune substantially improves APT recall while also improving event-level video transfer. These results show that APTs are not a new answer format, but a human-aligned causal supervision signal for physical video understanding.

2606.18584 2026-06-18 cs.CL 新提交

Speech-Driven End-to-End Language Discrimination towards Chinese Dialects

语音驱动的端到端汉语方言语言鉴别

Fan Xu, Jian Luo, MingWen Wang, GuoDong Zhou

发表机构 * Jiangxi normal university(江西师范大学) Soochow university(苏州大学)

AI总结 针对相似语言和方言鉴别难题,提出基于MFCC特征和HMM-DNN端到端模型的语音驱动方法,结合注意力机制和CNN融合词嵌入与MFCC特征,在基准语料上优于现有方法。

Comments Published in ACM TALLIP

详情
AI中文摘要

在相似语言、变体和方言之间进行语言鉴别是一项具有挑战性的自然语言处理任务。传统的文本驱动方法效果不佳。本文探讨了语音驱动特征在汉语方言鉴别中的有效性。首先,我们系统地研究了语音驱动的MFCC特征对于基于CNN的语言鉴别的适用性。然后,我们设计了一个基于HMM-DNN的端到端语音识别模型来预测汉语方言词汇。我们采用注意力机制提取与不同汉语方言相关的鉴别性词汇。最后,通过CNN,我们将词级嵌入与基于MFCC的特征相结合。在两个基准汉语方言语料库上的评估表明,与最先进的方法相比,所提出的语音驱动方法在细粒度汉语方言鉴别中具有适用性和有效性。

英文摘要

Language discrimination among similar languages, varieties, and dialects is a challenging natural language processing task. The traditional text-driven focus leads to poor results. In this paper, we explore the effectiveness of speech-driven features towards language discrimination among Chinese dialects. First, we systematically explore the appropriateness of speech-driven MFCC features towards CNN-based language discrimination. Then, we design an end-to-end speech recognition model based on HMM-DNN to predict Chinese dialect words. We adopt attention to extract the discriminative words related to different Chinese dialects. Finally, through a CNN, we combine the word-level embedding and the MFCC-based features. Evaluation of two benchmark Chinese dialect corpora shows the appropriateness and effectiveness of the proposed speech-driven approach to fine-grained Chinese dialect discrimination compared to the state-of-the-art methods.