arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3818
2507.16481 2026-05-19 cs.RO cs.SY eess.SY

Guided Reinforcement Learning for Omnidirectional 3D Jumping in Quadruped Robots

为四足机器人提供全方位三维跳跃的引导强化学习

Riccardo Bussola, Michele Focchi, Giulio Turrisi, Claudio Semini, Luigi Palopoli

AI总结 本文提出一种引导强化学习方法,结合贝塞尔曲线与匀加速直线运动模型,提高四足机器人三维跳跃的效率和可解释性,通过仿真和实验验证了其优越性。

详情
AI中文摘要

跳跃对四足机器人来说是一个重大挑战,尽管在许多操作场景中至关重要。虽然存在用于控制此类运动的优化方法,但它们往往耗时且需要大量的机器人和地形参数知识,使其在现实世界中不够稳健。强化学习(RL)正逐渐成为一种可行的替代方案,但传统端到端方法在样本复杂性方面效率低下,需要在模拟中进行大量训练,并且最终运动的可预测性差,这使得难以验证最终运动的安全性。为克服这些限制,本文介绍了一种新的引导强化学习方法,通过结合贝塞尔曲线与匀加速直线运动(UARM)模型,利用物理直觉实现高效且可解释的跳跃。广泛的仿真和实验结果清楚地证明了我们的方法相较于现有方法的优势。

英文摘要

Jumping poses a significant challenge for quadruped robots, despite being crucial for many operational scenarios. While optimisation methods exist for controlling such motions, they are often time-consuming and demand extensive knowledge of robot and terrain parameters, making them less robust in real-world scenarios. Reinforcement learning (RL) is emerging as a viable alternative, yet conventional end-to-end approaches lack efficiency in terms of sample complexity, requiring extensive training in simulations, and predictability of the final motion, which makes it difficult to certify the safety of the final motion. To overcome these limitations, this paper introduces a novel guided reinforcement learning approach that leverages physical intuition for efficient and explainable jumping, by combining Bézier curves with a Uniformly Accelerated Rectilinear Motion (UARM) model. Extensive simulation and experimental results clearly demonstrate the advantages of our approach over existing alternatives.

2507.16307 2026-05-19 cs.LG cond-mat.mtrl-sci cs.AI physics.chem-ph

Perovskite-R1: a domain-specialized large language model for intelligent discovery of precursor additives and experimental design

钙钛矿-R1:一个专门领域的大型语言模型,用于智能发现前驱体添加剂和实验设计

Xin-De Wang, Zhi-Rui Chen, Peng-Jie Guo, Ze-Feng Gao, Cheng Mu, Zhong-Yi Lu

AI总结 本研究提出Perovskite-R1,一个专门用于发现钙钛矿太阳能电池前驱体添加剂和实验设计的大型语言模型,通过系统挖掘和整理1232篇高质量科学文献,并整合33269种候选材料,构建了领域特定的指令微调数据集,从而提升材料发现的效率。

Comments 24 pages; 5 figures

详情
Journal ref
Communications Materials 7, 86 (2026)
AI中文摘要

钙钛矿太阳能电池(PSCs)因其卓越的功率转换效率和有利的材料特性而迅速成为下一代光伏技术的有力竞争者。尽管有这些进展,长期稳定性、环境可持续性和可扩展制造等挑战仍然阻碍其商业化。前驱体添加剂工程显示出通过提高PSCs的性能和耐久性来解决这些问题的潜力。然而,科学文献的爆炸性增长以及材料、工艺和设备架构之间的复杂相互作用,使研究人员难以高效地访问、组织和利用该领域内的领域知识。为此,我们介绍了Perovskite-R1,一个具有先进推理能力的专门大型语言模型(LLM),专门用于发现和设计PSC前驱体添加剂。通过系统挖掘和整理1232篇高质量科学出版物,并整合一个包含33,269种候选材料的全面库,我们使用自动问答生成和推理链的方法构建了一个领域特定的指令微调数据集。在该数据集上微调QwQ-32B模型,得到了Perovskite-R1,它可以智能地综合文献见解,生成创新且实用的解决方案用于缺陷钝化和前驱体添加剂的选择。对几个模型提出策略的实验验证证实了它们在提高材料稳定性和性能方面的有效性。我们的工作展示了领域适应的LLM在加速材料发现中的潜力,并提供了一个闭环框架,用于智能、数据驱动的钙钛矿光伏研究进展。

英文摘要

Perovskite solar cells (PSCs) have rapidly emerged as a leading contender in next-generation photovoltaic technologies, owing to their exceptional power conversion efficiencies and advantageous material properties. Despite these advances, challenges such as long-term stability, environmental sustainability, and scalable manufacturing continue to hinder their commercialization. Precursor additive engineering has shown promise in addressing these issues by enhancing both the performance and durability of PSCs. However, the explosive growth of scientific literature and the complex interplay of materials, processes, and device architectures make it increasingly difficult for researchers to efficiently access, organize, and utilize domain knowledge in this rapidly evolving field. To address this gap, we introduce Perovskite-R1, a specialized large language model (LLM) with advanced reasoning capabilities tailored for the discovery and design of PSC precursor additives. By systematically mining and curating 1,232 high-quality scientific publications and integrating a comprehensive library of 33,269 candidate materials, we constructed a domain-specific instruction-tuning dataset using automated question-answer generation and chain-of-thought reasoning. Fine-tuning the QwQ-32B model on this dataset resulted in Perovskite-R1, which can intelligently synthesize literature insights and generate innovative and practical solutions for defect passivation and the selection of precursor additives. Experimental validation of several model-proposed strategies confirms their effectiveness in improving material stability and performance. Our work demonstrates the potential of domain-adapted LLMs in accelerating materials discovery and provides a closed-loop framework for intelligent, data-driven advancements in perovskite photovoltaic research.

2507.16059 2026-05-19 cs.RO

Therapist-Exoskeleton-Patient Interaction for Gait Therapy

治疗师-外骨骼-患者互动用于步态治疗

Emek Barış Küçüktabak, Matthew R. Short, Lorenzo Vianello, Daniel Ludvig, Levi Hargrove, Kevin Lynch, Jose Pons

AI总结 本文提出了一种基于物理人机人交互(pHRHI)的步态康复新方法,通过让治疗师和中风患者均佩戴下肢外骨骼,并通过弹簧阻尼元件连接在髋膝处,实现双向互动,从而提高康复效果。

详情
AI中文摘要

中风后,个体常因下肢无力和失去独立关节控制而出现运动和平衡障碍。步态恢复是康复的关键目标,传统上通过高强度的治疗师指导训练实现。然而,手动辅助对治疗师来说体力消耗大,并限制了治疗师同时与多个关节互动的能力。机器人外骨骼能够提供多关节支持,减少治疗师的负担,并提供客观反馈,但当前的控制策略往往限制了治疗师的参与和适应性。本文提出了一种基于物理人机人交互(pHRHI)的步态康复新范式,其中治疗师和中风患者均佩戴下肢外骨骼,并通过弹簧阻尼元件在髋膝处虚拟连接。这使得双向互动成为可能,允许治疗师引导运动并接收触觉反馈。在一项针对八名慢性中风患者的研究中,pHRHI训练优于传统治疗师指导的 treadmill 走行,导致关节活动范围、步态指标、肌肉激活和动机均有所增加。这些结果突显了pHRHI在结合机器人精度与治疗师直觉方面对改善康复结果的潜力。

英文摘要

Following a stroke, individuals often experience mobility and balance impairments due to lower-limb weakness and loss of independent joint control. Gait recovery is a key goal of rehabilitation, traditionally achieved through high-intensity therapist-led training. However, manual assistance can be physically demanding and limits the therapist's ability to interact with multiple joints simultaneously. Robotic exoskeletons offer multi-joint support, reduce therapist strain, and provide objective feedback, but current control strategies often limit therapist involvement and adaptability. We present a novel gait rehabilitation paradigm based on physical Human-Robot-Human Interaction (pHRHI), where both the therapist and the post-stroke individual wear lower-limb exoskeletons virtually connected at the hips and knees via spring-damper elements. This enables bidirectional interaction, allowing the therapist to guide movement and receive haptic feedback. In a study with eight chronic stroke patients, pHRHI training outperformed conventional therapist-guided treadmill walking, leading to increased joint range of motion, step metrics, muscle activation, and motivation. These results highlight pHRHI's potential to combine robotic precision with therapist intuition for improved rehabilitation outcomes.

2507.01099 2026-05-19 cs.CV cs.AI cs.LG cs.RO

Geometry-aware 4D Video Generation for Robot Manipulation

面向机器人操作的几何感知4D视频生成

Zeyi Liu, Shuang Li, Eric Cousineau, Siyuan Feng, Benjamin Burchfiel, Shuran Song

AI总结 本文提出了一种几何感知的4D视频生成模型,通过跨视角点图对齐进行训练,以确保生成视频在多视角下的3D一致性,从而在单个RGB-D图像输入下生成时空一致的未来视频序列,并在不依赖相机姿态的情况下实现稳定的视觉和空间对齐预测。

Comments ICLR 2026; Project website: https://robot4dgen.github.io

详情
AI中文摘要

理解并预测物理世界的动态可以增强机器人在复杂环境中的规划和交互能力。尽管最近的视频生成模型在建模动态场景方面显示出强大的潜力,但生成在不同摄像机视角下既时间一致又几何一致的视频仍然是一项重大挑战。为此,我们提出了一种4D视频生成模型,通过在训练过程中使用跨视角点图对齐来监督模型,以确保生成视频的多视角3D一致性。通过这种几何监督,模型学习了一个共享的3D场景表示,使其能够从单个RGB-D图像输入中,根据新的视角生成时空一致的未来视频序列,而无需依赖相机姿态作为输入。与现有基线方法相比,我们的方法在多个模拟和现实世界机器人数据集上产生了更稳定和空间对齐的预测。我们进一步表明,预测的4D视频可用于使用现成的6自由度姿态跟踪器恢复机器人末端执行器轨迹,从而生成在新相机视角下具有良好泛化能力的机器人操作策略。

英文摘要

Understanding and predicting dynamics of the physical world can enhance a robot's ability to plan and interact effectively in complex environments. While recent video generation models have shown strong potential in modeling dynamic scenes, generating videos that are both temporally coherent and geometrically consistent across camera views remains a significant challenge. To address this, we propose a 4D video generation model that enforces multi-view 3D consistency of generated videos by supervising the model with cross-view pointmap alignment during training. Through this geometric supervision, the model learns a shared 3D scene representation, enabling it to generate spatio-temporally aligned future video sequences from novel viewpoints given a single RGB-D image per view, and without relying on camera poses as input. Compared to existing baselines, our method produces more visually stable and spatially aligned predictions across multiple simulated and real-world robotic datasets. We further show that the predicted 4D videos can be used to recover robot end-effector trajectories using an off-the-shelf 6DoF pose tracker, yielding robot manipulation policies that generalize well to novel camera viewpoints.

2506.23549 2026-05-19 cs.AI cs.HC cs.LG

CooT: Learning to Coordinate In-Context with Coordination Transformers

CooT: 通过协调转换器学习协调上下文

Huai-Chih Wang, Hsiang-Chun Chuang, Hsi-Chun Cheng, Dai-Jie Wu, Shao-Hua Sun

AI总结 本研究提出CooT框架,通过上下文学习实现实时合作伙伴适应,解决了多智能体系统中协调不熟悉合作伙伴的挑战,其核心方法是通过观察学习对齐动作与合作伙伴意图,主要贡献是实现了在多样合作伙伴行为下的泛化能力。

Comments ICML 2026

详情
AI中文摘要

在多智能体系统中,协调不熟悉合作伙伴仍然是一个重大挑战。现有方法,如基于种群的方法,通过多样性提高鲁棒性,但通常缺乏在训练分布之外高效适应的机制。此外,微调在少样本设置中不可行,因为其交互成本高。为了解决这些限制,我们提出了CooT,一个利用上下文学习(ICL)进行实时合作伙伴适应的框架。与以往专注于任务泛化的ICL方法不同,CooT旨在在多样化的合作伙伴行为上实现泛化。在行为偏好智能体的轨迹上训练,它通过观察学习对齐动作与合作伙伴意图。我们在两个具有挑战性的多智能体基准测试中评估了CooT:Overcooked和Google Research Football。结果表明,CooT在性能上始终优于基于种群的方法、基于梯度的微调和Meta-RL基线,实现了稳定且快速的适应,而无需参数更新。人类评估也发现CooT是更受青睐的合作者,我们的消融实验确认了其快速适应新合作伙伴并在突然合作伙伴变化下保持稳定的能力,使其在现实世界的人机协作中具有可靠性。

英文摘要

Effective coordination among unfamiliar partners remains a major challenge in multi-agent systems. Existing approaches, such as population-based methods, improve robustness through diversity but often lack mechanisms for efficient adaptation beyond training distribution. Moreover, fine-tuning is impractical in few-shot settings due to its high interaction cost. To address these limitations, we propose CooT, a framework that leverages in-context learning (ICL) for real-time partner adaptation. Unlike prior ICL approaches that focus on task generalization, CooT is designed to generalize across diverse partner behaviors. Trained on trajectories from behavior-preferring agents, it learns to align actions with partner intentions purely through observation. We evaluate CooT on two challenging multi-agent benchmarks: Overcooked and Google Research Football. Results show that CooT consistently outperforms population-based methods, gradient-based fine-tuning, and Meta-RL baselines, achieving stable and rapid adaptation without parameter updates. Human evaluations also identify CooT as a preferred collaborator, and our ablations confirm its ability to adapt quickly to new partners and remain stable under sudden partner changes, making it reliable for real-world human-AI collaboration.

2506.23287 2026-05-19 cs.LG q-bio.QM

HDTree: Generative Modeling of Cellular Hierarchies for Robust Lineage Inference

HDTree: 用于鲁棒谱系推断的细胞层次生成建模

Zelin Zang, WenZhe Li, Yongjie Xu, Chang Yu, Changxi Chi, Jingbo Zhou, Zhen Lei, Stan Z. Li

AI总结 本文提出HDTree,一种用于鲁棒谱系推断的生成建模框架,通过统一的层次代码库和量化扩散过程捕捉细胞层次关系,提升稳定性与可扩展性,并在通用和单细胞数据集上验证了其在谱系推断准确性、重建质量和层次一致性方面的优越性。

Comments accepted by ICML26

详情
AI中文摘要

在单细胞研究中,追踪和分析高通量单细胞分化轨迹对于理解生物过程至关重要。关键在于对支配细胞发育的层次结构的稳健建模。传统方法在计算成本、性能和稳定性方面存在局限。基于VAE的方法虽有所进展,但仍需要分支特定的网络模块,限制了其可扩展性和稳定性,同时常遭遇后验崩溃问题。为克服这些挑战,我们引入HDTree,一种用于稳健谱系推断的生成建模框架。HDTree通过统一的层次代码库在层次化潜在空间中捕捉树状关系,并利用量化扩散过程建模连续细胞状态转换。通过将生成过程与Waddington景观对齐,该方法不仅提高了稳定性和可扩展性,还增强了推断谱系的生物学合理性。HDTree的有效性通过在通用和单细胞数据集上的比较得到验证,其在谱系推断准确性、重建质量和层次一致性方面均优于现有方法。这些贡献使细胞分化路径的准确高效建模成为可能,为生物学发现提供可靠见解。 ootnote{代码可在https://github.com/zangzelin/code\_HDTree\_icml获取。}

英文摘要

In single-cell research, tracing and analyzing high-throughput single-cell differentiation trajectories is crucial for understanding biological processes. Key to this is the robust modeling of hierarchical structures that govern cellular development. Traditional methods face limitations in computational cost, performance, and stability. VAE-based approaches have made strides but still require branch-specific network modules, limiting their scalability and stability, while often suffering from posterior collapse. To overcome these challenges, we introduce HDTree, a generative modeling framework designed for robust lineage inference. HDTree captures tree relationships within a hierarchical latent space using a unified hierarchical codebook and employs a quantized diffusion process to model continuous cell state transitions. By aligning the generative process with the Waddington landscape, this method not only improves stability and scalability but also enhances the biological plausibility of inferred lineages. HDTree's effectiveness is demonstrated through comparisons on both general-purpose and single-cell datasets, where it outperforms existing methods in lineage inference accuracy, reconstruction quality, and hierarchical consistency. These contributions enable accurate and efficient modeling of cellular differentiation paths, offering reliable insights for biological discovery.\footnote{Code is available at https://github.com/zangzelin/code\_HDTree\_icml.

2506.20522 2026-05-19 cs.CV

AI-assisted radiographic analysis in detecting alveolar bone-loss severity and patterns

辅助人工智能的放射学分析用于检测牙槽骨丧失的严重程度和模式

Chathura Wimalasiri, Piumal Rathnayake, Shamod Wijerathne, Sumudu Rasnayaka, Dhanushka Leuke Bandara, Roshan Ragel, Vajira Thambawita, Isuru Nawinne

AI总结 本研究提出了一种新型的基于人工智能的深度学习框架,利用牙内窥镜根尖放射图像自动检测和量化牙槽骨丧失及其模式,通过结合YOLOv8进行牙齿检测和Keypoint R-CNN模型识别解剖标志物,实现了对牙槽骨丧失严重程度的精确计算,并通过几何分析确定水平与角状骨丧失模式,实验结果在1000张专家标注的放射图像上达到了高准确率。

Comments This manuscript is 17 pages with 5 tables and 12 figures. The manuscript is under review at Nature Scientific Reports

详情
AI中文摘要

牙周炎是一种慢性炎症性疾病,导致牙槽骨丧失,显著影响口腔健康和生活质量。准确评估骨丧失的严重程度和模式对于诊断和治疗计划至关重要。在本研究中,我们提出了一种新型的基于人工智能的深度学习框架,利用牙内窥镜根尖放射图像自动检测和量化牙槽骨丧失及其模式。我们的方法结合YOLOv8进行牙齿检测,与Keypoint R-CNN模型识别解剖标志物,从而实现对骨丧失严重程度的精确计算。此外,YOLOv8x-seg模型用于分割骨水平和牙齿掩码,通过几何分析确定骨丧失模式(水平 vs. 角状)。在1000张大规模、专家标注的放射图像上进行评估,我们的方法在检测骨丧失严重程度(类内相关系数高达0.80)和骨丧失模式分类(准确率87%)方面取得了高准确率。这种自动化系统提供了一种快速、客观且可重复的牙周评估工具,减少了对主观手动评估的依赖。通过将人工智能整合到牙科放射学分析中,我们的框架有潜力提高牙周炎的早期诊断和个性化治疗计划,最终改善患者护理和临床结果。

英文摘要

Periodontitis, a chronic inflammatory disease causing alveolar bone loss, significantly affects oral health and quality of life. Accurate assessment of bone loss severity and pattern is critical for diagnosis and treatment planning. In this study, we propose a novel AI-based deep learning framework to automatically detect and quantify alveolar bone loss and its patterns using intraoral periapical (IOPA) radiographs. Our method combines YOLOv8 for tooth detection with Keypoint R-CNN models to identify anatomical landmarks, enabling precise calculation of bone loss severity. Additionally, YOLOv8x-seg models segment bone levels and tooth masks to determine bone loss patterns (horizontal vs. angular) via geometric analysis. Evaluated on a large, expertly annotated dataset of 1000 radiographs, our approach achieved high accuracy in detecting bone loss severity (intra-class correlation coefficient up to 0.80) and bone loss pattern classification (accuracy 87%). This automated system offers a rapid, objective, and reproducible tool for periodontal assessment, reducing reliance on subjective manual evaluation. By integrating AI into dental radiographic analysis, our framework has the potential to improve early diagnosis and personalized treatment planning for periodontitis, ultimately enhancing patient care and clinical outcomes.

2506.06114 2026-05-19 cs.LG

Scalable unsupervised feature selection via weight stability

通过权重稳定性实现可扩展的无监督特征选择

Xudong Zhang, Renato Cordeiro de Amorim

AI总结 本文提出了一种基于Minkowski加权k-均值的无监督特征选择方法,通过聚合不同Minkowski指数下的特征权重来识别稳定且信息丰富的特征,从而提升聚类性能。

详情
AI中文摘要

无监督特征选择对于在高维数据中提升聚类性能至关重要,其中无关特征可能会掩盖有意义的结构。在本文中,我们引入了Minkowski加权k-均值++,一种新的Minkowski加权k-均值初始化策略。我们的初始化策略利用数据本身得出的特征相关性估计,以概率方式选择质心。在此基础上,我们提出了两种新的特征选择算法,FS-MWK++,通过聚合不同Minkowski指数下的特征权重来识别稳定且信息丰富的特征,以及SFS-MWK++,一种基于子采样的可扩展变体。我们通过理论分析支持我们的方法,证明在显式假设噪声特征和聚类结构的情况下,相关特征在不同Minkowski指数下均被赋予比噪声特征更高的权重。我们的软件可在https://github.com/xzhang4-ops1/FSMWK找到。

英文摘要

Unsupervised feature selection is critical for improving clustering performance in high-dimensional data, where irrelevant features can obscure meaningful structure. In this work, we introduce the Minkowski weighted $k$-means++, a novel initialisation strategy for the Minkowski Weighted $k$-means. Our initialisation selects centroids probabilistically using feature relevance estimates derived from the data itself. Building on this, we propose two new feature selection algorithms, FS-MWK++, which aggregates feature weights across a range of Minkowski exponents to identify stable and informative features, and SFS-MWK++, a scalable variant based on subsampling. We support our approach with a theoretical analysis, demonstrating that, under explicit assumptions on noise features and cluster structure, relevant features are assigned consistently higher weights than noise features across a range of Minkowski exponents. Our software can be found at https://github.com/xzhang4-ops1/FSMWK.

2506.05606 2026-05-19 cs.CL cs.HC

OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

OPeRA: 一个用于评估LLM在模拟人类在线购物行为上的表现的观察、人设、推理和行动数据集

Ziyi Wang, Yuxuan Lu, Wenbo Li, Amirali Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia Chilton, Dakuo Wang

AI总结 本文提出OPeRA数据集,用于评估LLM在模拟人类在线购物行为上的能力,通过收集真实用户在在线购物会话中的观察、人设、推理和行动,建立首个评估LLM预测特定用户下一步行动和推理的基准。

Comments ACL 2026 main

详情
AI中文摘要

大型语言模型(LLMs)能否准确模拟特定用户的下一步网页操作?尽管LLMs在生成“可信”的人类行为方面表现出色,但评估其模仿真实用户行为的能力仍是一个开放性挑战,这主要归因于缺乏高质量、公开可用的数据集,这些数据集能够捕捉到实际人类用户的可观测行为和内部推理。为了解决这一差距,我们引入了OPeRA,一个新型的数据集,收集自真实人类参与者在在线购物会话中的观察、人设、推理和行动。OPeRA是首个公开数据集,全面捕捉了用户人设、浏览器观察、细粒度网页操作以及即时自报的推理。我们开发了在线问卷和定制浏览器插件来收集该数据集,以高保真度的方式获取。使用OPeRA,我们建立了首个基准,用于评估当前LLMs在给定人设和<观察、行动、推理>历史的情况下,预测特定用户下一步行动和推理的能力。该数据集为未来研究LLM代理提供了基础,这些代理旨在作为人类的个性化数字双胞胎发挥作用。

英文摘要

Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable'' human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user's next action and rationale with a given persona and <observation, action, rationale> history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.

2506.01523 2026-05-19 cs.LG stat.ML

Beyond RLHF: A Unified Theoretical Framework of Alignment

超越RLHF:对齐的统一理论框架

Jihun Yun, Juno Kim, Jongho Park, Junhyuck Kim, Jongha Jon Ryu, Jaewoong Cho, Kwang-Sung Jun

AI总结 本文提出了一种统一的对齐理论框架,通过将对齐视为基于成对偏好的分布学习,推导出三种新的对齐目标,并证明了它们在非渐近情况下具有O(1/n)的收敛性,为RLHF提供了理论支持。

详情
AI中文摘要

通过强化学习从人类反馈(RLHF)对大型语言模型(LLMs)输出质量进行控制已成为主流方法。然而,现有理论未能为RLHF目标本身提供有力的理论依据,并且由于不同方法通常在不同框架下分析,难以比较各种方法的保证。为建立统一的对齐框架,本文探讨在何种假设下可以推导出现有或新的训练目标并获得理论保证。为此,本文将对齐重新定义为基于成对偏好的分布学习,这建立了一个概率假设,描述了偏好如何揭示关于目标LM的信息。这导致我们提出三种原理性的对齐目标:偏好最大似然估计、偏好蒸馏和反KL最小化。我们证明了它们都自然地避免退化,并具有O(1/n)的收敛性。特别是,反KL高度类似于RLHF目标,为RLHF提供了有力的理论支持。此外,本文的理论首次解释了实证发现:在策略性目标(如RLHF)通常优于似然式目标(如DPO)。最后,实验结果表明,所提出的目标在多个任务和模型上与强基线竞争。

英文摘要

Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm for controlling the quality of outputs from large language models (LLMs). However, existing theories do not provide strong justification for the RLHF objective itself and do not allow comparisons of the guarantees between various methods because different methods are often analyzed under different frameworks. Toward a unified framework for alignment, we ask under what assumptions can we derive existing or new training objectives and obtain theoretical guarantees. To this end, we reframe alignment as distribution learning from pairwise preferences, which makes a probabilistic assumption describing how preferences reveal information about the target LM. This leads us to propose three principled alignment objectives: preference maximum likelihood estimation, preference distillation, and reverse KL minimization. We prove that they all enjoy strong non-asymptotic $O(1/n)$ convergence to the target LM, naturally avoiding degeneracy. In particular, reverse KL highly resembles the RLHF objective, providing strong justification for RLHF. Furthermore, our theory explains, for the first time, the empirical finding that on-policy objectives (e.g., RLHF) typically outperform likelihood-style objectives (e.g., DPO). Finally, empirical results indicate that the proposed objectives are competitive with strong baselines across several tasks and models.

2505.20650 2026-05-19 cs.CL cs.AI cs.CE

FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information

FinTagging: 评估LLM提取和结构化财务信息

Yan Wang, Lingfei Qian, Xueqing Peng, Yang Ren, Keyi Wang, Yi Han, Dongji Feng, Fengran Mo, Shengyuan Lin, Qinchuan Zhang, Kaiwen He, Chenri Luo, Jianxing Chen, Junwei Wu, Chen Xu, Ziyang Xu, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, Qianqian Xie, Jian-Yun Nie

AI总结 本文提出FinTagging基准,用于评估LLM在提取和结构化财务信息方面的能力,通过分解为FinNI和FinCL两个子任务,揭示了LLM在细粒度概念链接上的局限性。

详情
AI中文摘要

准确解读财务报告中的数字数据对市场和监管机构至关重要。尽管XBRL(可扩展商业报告语言)提供了对财务数据进行标记的标准,但将数千个事实映射到超过1万项美国通用会计准则(US GAAP)概念仍然成本高昂且容易出错。现有基准将此任务简化为对小概念子集的扁平单步分类,忽略了分类法的层次语义和财务文档的结构特性。因此,这些基准无法评估LLM在真实报告条件下的表现。为弥合这一差距,我们引入FinTagging,首个全面的结构感知和全范围XBRL标记基准。我们将复杂的标记过程分解为两个子任务:(1)FinNI(财务数字识别),从异构上下文中提取实体和类型;(2)FinCL(财务概念链接),将提取的实体映射到完整的US GAAP分类法。这种两阶段的框架使能够公平评估LLM在数值推理和分类法对齐方面的能力。在零样本设置下评估多种LLM发现,尽管模型在提取方面表现良好,但在细粒度概念链接上存在显著困难,突显了领域特定结构感知推理的关键限制。

英文摘要

Accurate interpretation of numerical data in financial reports is critical for markets and regulators. Although XBRL (eXtensible Business Reporting Language) provides a standard for tagging financial figures, mapping thousands of facts to over 10k US GAAP concepts remains costly and error prone. Existing benchmarks oversimplify this task as flat, single step classification over small subsets of concepts, ignoring the hierarchical semantics of the taxonomy and the structured nature of financial documents. Consequently, these benchmarks fail to evaluate Large Language Models (LLMs) under realistic reporting conditions. To bridge this gap, we introduce FinTagging, the first comprehensive benchmark for structure aware and full scope XBRL tagging. We decompose the complex tagging process into two subtasks: (1) FinNI (Financial Numeric Identification), which extracts entities and types from heterogeneous contexts including text and tables; and (2) FinCL (Financial Concept Linking), which maps extracted entities to the full US GAAP taxonomy. This two stage formulation enables a fair assessment of LLMs' capabilities in numerical reasoning and taxonomy alignment. Evaluating diverse LLMs in zero shot settings reveals that while models generalize well in extraction, they struggle significantly with fine grained concept linking, highlighting critical limitations in domain specific structure aware reasoning.

2505.19155 2026-05-19 cs.CV cs.CL

Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs

稀疏到密集:一种无损加速视频理解的LLM免费午餐

Xuan Zhang, Cunxiao Du, Sicheng Yu, Jiawei Wu, Fengzhuo Zhang, Wei Gao, Qian Liu

AI总结 本文提出了一种名为Sparse-to-Dense(StD)的解码策略,通过结合稀疏top-K注意力和密集全注意力模块,实现视频大语言模型(Video-LLMs)的无损加速,从而在处理长视频序列时显著提高处理速度。

Comments Accepted by ACL 2025

详情
AI中文摘要

由于当前视频大语言模型(Video-LLMs)的自回归性质,输入序列长度的增长会导致推理延迟增加,这给处理通常非常长的视频序列带来了挑战。我们发现,在解码过程中,Video-LLMs中大多数标记的注意力分数趋于稀疏和集中,只有某些标记需要全面的全注意力。基于这一见解,我们引入了Sparse-to-Dense(StD),一种新颖的解码策略,集成了两个不同的模块:一个利用稀疏top-K注意力,另一个采用密集全注意力。这些模块协同工作,以在不损失的情况下加速Video-LLMs。快速(稀疏)模型推测解码多个标记,而缓慢(密集)模型并行验证它们。StD是一种无调优、即插即用的解决方案,可在视频处理中实现高达1.94倍的壁时加速。它在保持模型性能的同时,使从标准Video-LLM无缝过渡到稀疏Video-LLM变得可能,只需最小的代码修改。

英文摘要

Due to the auto-regressive nature of current video large language models (Video-LLMs), the inference latency increases as the input sequence length grows, posing challenges for the efficient processing of video sequences that are usually very long. We observe that during decoding, the attention scores of most tokens in Video-LLMs tend to be sparse and concentrated, with only certain tokens requiring comprehensive full attention. Based on this insight, we introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules: one leveraging sparse top-K attention and the other employing dense full attention. These modules collaborate to accelerate Video-LLMs without loss. The fast (sparse) model speculatively decodes multiple tokens, while the slow (dense) model verifies them in parallel. StD is a tuning-free, plug-and-play solution that achieves up to a 1.94$\times$ walltime speedup in video processing. It maintains model performance while enabling a seamless transition from a standard Video-LLM to a sparse Video-LLM with minimal code modifications.

2505.07813 2026-05-19 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY

DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies

DexWild:面向真实场景的机器人策略的灵巧交互

Tony Tao, Mohan Kumar Srirama, Jason Jingzhou Liu, Kenneth Shaw, Deepak Pathak

AI总结 本文提出DexWild框架,通过结合人类和机器人示范数据,提升机器人在多样化环境中的泛化能力,实验表明其在未见环境中的成功率显著高于传统方法。

Comments In RSS 2025. Website at https://dexwild.github.io

详情
AI中文摘要

大规模、多样化的机器人数据集已成为使灵巧操作策略泛化到新环境的有希望途径,但获取此类数据集存在诸多挑战。虽然远程操作能提供高保真的数据集,但其高成本限制了可扩展性。相反,如果人们可以像在日常生活中一样使用自己的手来收集数据呢?在DexWild中,一个多样化的数据收集团队使用他们的手在多种环境和物体上收集数小时的交互数据。为了记录这些数据,我们创建了DexWild-System,一种低成本、移动且易于使用的设备。DexWild学习框架在人类和机器人示范数据上共同训练,相较于单独训练每个数据集,其性能得到提升。这种组合产生了能够泛化到新环境、任务和形态的稳健机器人策略,只需少量额外的机器人特定数据。实验结果表明,DexWild显著提高了性能,在未见环境中实现了68.5%的成功率,几乎是仅使用机器人数据训练的策略的四倍,并提供了5.8倍更好的跨形态泛化能力。视频结果、代码库和说明可在https://dexwild.github.io上找到。

英文摘要

Large-scale, diverse robot datasets have emerged as a promising path toward enabling dexterous manipulation policies to generalize to novel environments, but acquiring such datasets presents many challenges. While teleoperation provides high-fidelity datasets, its high cost limits its scalability. Instead, what if people could use their own hands, just as they do in everyday life, to collect data? In DexWild, a diverse team of data collectors uses their hands to collect hours of interactions across a multitude of environments and objects. To record this data, we create DexWild-System, a low-cost, mobile, and easy-to-use device. The DexWild learning framework co-trains on both human and robot demonstrations, leading to improved performance compared to training on each dataset individually. This combination results in robust robot policies capable of generalizing to novel environments, tasks, and embodiments with minimal additional robot-specific data. Experimental results demonstrate that DexWild significantly improves performance, achieving a 68.5% success rate in unseen environments-nearly four times higher than policies trained with robot data only-and offering 5.8x better cross-embodiment generalization. Video results, codebases, and instructions at https://dexwild.github.io

2505.06907 2026-05-19 cs.AI cs.CV cs.NE

A Survey on Foundation Models for Personalized Federated Intelligence

面向个性化联邦智能的基础模型综述

Yu Qiao, Huy Q. Le, Avi Deb Raha, Phuong-Nam Tran, Apurba Adhikary, Mengchun Zhang, Loc X. Nguyen, Eui-Nam Huh, Dusit Niyato, Choong Seon Hong

AI总结 本文综述了基础模型在个性化联邦智能中的应用,探讨了联邦学习与基础模型的结合,提出了一种新的个性化联邦智能范式,旨在为实现人工智能个性化提供基础支持。

Comments Accepted ACM Computing Survey

详情
AI中文摘要

大语言模型(如ChatGPT、Gemini和Grok)的兴起重塑了人工智能领域。作为基础模型(FMs)的典型实例,它们在生成类人内容方面表现出色,推动人工智能向通用人工智能(AGI)迈进。然而,它们的规模庞大、隐私敏感和计算需求高,给个性化定制带来了挑战。为此,我们提出了人工智能个性化(API)的愿景,专注于将FMs适应到个体用户,同时确保隐私。作为API的核心赋能者,我们提出个性化联邦智能(PFI),这是一种新的范式,不仅整合了联邦学习(FL)的隐私优势和FMs的泛化能力,还将个性化置于核心。为此,我们首先回顾了最近的FL和FMs进展,为PFI奠定基础。然后,我们探讨了PFI流水线的核心阶段:边缘的高效个性化、可信的适应和通过检索增强生成的自适应细化。最后,我们强调了实现PFI的未来方向。总体而言,本文的综述旨在为API的发展奠定基础,作为AGI的补充方向,PFI是关键的赋能范式。

英文摘要

The rise of large language models (LLMs), such as ChatGPT, Gemini, and Grok, has reshaped the AI landscape. As prominent instances of foundational models (FMs), they exhibit remarkable capabilities in generating human-like content, pushing the boundaries towards artificial general intelligence (AGI). However, their large-scale nature, privacy sensitivity, and substantial computational demands pose significant challenges for personalized customization for end users. To bridge this gap, we present the vision of artificial personalized intelligence (API), which focuses on adapting FMs to individual users while ensuring privacy. As a central enabler of API, we propose personalized federated intelligence (PFI), a new paradigm that not only integrates the privacy benefits of federated learning (FL) with the generalization capabilities of FMs but also places personalization at its core. To this end, we first survey recent advances in FL and FMs that lay the foundation for PFI. We then explore core stages of the PFI pipeline: efficient personalization at the edge, trustworthy adaptation, and adaptive refinement via retrieval-augmented generation. Finally, we highlight future directions for enabling PFI. Overall, this survey aims to lay a foundation for the development of API as a complementary direction to AGI, with PFI as a key enabling paradigm.

2505.06852 2026-05-19 cs.LG stat.ML

Improving Random Forests by Smoothing

通过平滑改进随机森林

Ziyi Liu, Phuc Luong, Mario Boley, Daniel F. Schmidt

AI总结 本文提出一种基于核的平滑机制,通过引入局部正则性来增强随机森林的预测性能,同时保留其自适应分区能力,特别是在数据稀缺情况下提升了预测效果。

Comments v2: Accepted manuscript. 30 pages (18 main + 12 appendix), 6 figures

详情
AI中文摘要

随机森林回归是一种强大的非参数方法,通过数据驱动的分区适应局部数据特征,在各种应用领域中表现出色。然而,随机森林预测的分段常数性质意味着每个分区都是独立预测的,忽略了潜在的函数平滑性。特别是在小数据情况下,输入空间内缺乏信息共享可能导致性能不佳。在本文中,我们提出了一种基于核的平滑机制,通过引入局部正则性来增强随机森林,同时保留其自适应分区能力。我们的方法将核平滑应用于随机森林的分段常数输出,有效地结合了基于树的方法的适应性和核方法的平滑性假设。我们证明这种平滑过程可以被解释为在重新采样训练输入的情况下捕捉树切分点的变异性/不确定性。实验证实,所提出的平滑随机森林模型在各种测试案例中一致提高了预测性能,特别是在数据稀缺的情况下。代码、数据集和实验结果可在 https://github.com/Neal-Liu-Ziyi/SmoothedRandomForest.git 公开获取。

英文摘要

Random forest regression is a powerful non-parametric method that adapts to local data characteristics through data-driven partitioning, making it effective across diverse application domains. However, the piecewise constant nature of random forest predictions means each partition is predicted independently, ignoring potential smoothness in the underlying function. Particularly in the small data regime, this lack of information sharing across the input space can lead to suboptimal performance. In this work, we propose a kernel-based smoothing mechanism that enhances random forests by introducing local regularity to their predictions while preserving their adaptive partitioning capabilities. Our approach applies kernel smoothing to the piecewise constant outputs of random forests, effectively combining the adaptability of tree-based methods with the smoothness assumptions of kernel methods. We show that this smoothing procedure can be interpreted as capturing the variability/uncertainty in the tree cut points under resampling of the training inputs. Empirical results demonstrate that the proposed smoothed random forest model consistently improves predictive performance across diverse test cases, particularly in data-scarce settings. Code, datasets, and experiment results are publicly available at https://github.com/Neal-Liu-Ziyi/SmoothedRandomForest.git.

2505.02621 2026-05-19 cs.LG math.OC stat.ML

Mirror Mean-Field Langevin Dynamics

镜像均场 Langevin 动力学

Anming Gu, Juno Kim

AI总结 本文提出镜像均场 Langevin 动力学(MMFLD),用于优化受限在 $\mathbb{R}^d$ 子集上的概率测度,并通过统一的对数 Sobolev 不等式获得连续 MMFLD 的线性收敛性保证,以及其时间-粒子离散化版本的统一时间传播混沌结果。

Comments ICML 2026

详情
AI中文摘要

均场 Langevin 动力学(MFLD)在 $\mathbb{R}^d$ 上的 Wasserstein 空间上最小化一个熵正则化的非线性凸函数,并最近因其作为无限宽度两层神经网络等相互作用粒子系统的梯度下降动力学模型而受到关注。然而,许多感兴趣的问题具有受限的域,而现有的均场算法由于全局扩散项无法解决此类问题。我们通过将 MFLD 扩展到镜像 Langevin 框架,提出镜像均场 Langevin 动力学(MMFLD),以研究受限在 $\mathbb{R}^d$ 的凸子集上的概率测度的优化。我们通过统一的对数 Sobolev 不等式获得了连续 MMFLD 的线性收敛性保证,并获得了其时间-粒子离散化版本的统一时间传播混沌结果。

英文摘要

The mean-field Langevin dynamics (MFLD) minimizes an entropy-regularized nonlinear convex functional on the Wasserstein space over $\mathbb{R}^d$, and has gained attention recently as a model for the gradient descent dynamics of interacting particle systems such as infinite-width two-layer neural networks. However, many problems of interest have constrained domains, which are not solved by existing mean-field algorithms due to the global diffusion term. We study the optimization of probability measures constrained to a convex subset of $\mathbb{R}^d$ by proposing the \emph{mirror mean-field Langevin dynamics} (MMFLD), an extension of MFLD to the mirror Langevin framework. We obtain linear convergence guarantees for the continuous MMFLD via a uniform log-Sobolev inequality, and uniform-in-time propagation of chaos results for its time- and particle-discretized counterpart.

2503.02161 2026-05-19 cs.LG

LLM-TabLogic: Preserving Inter-Column Logical Relationships in Synthetic Tabular Data via Prompt-Guided Latent Diffusion

LLM-TabLogic: 通过提示引导的潜在扩散模型在合成表格数据中保留列间逻辑关系

Yunbo Long, Liming Xu, Alexandra Brintrup

AI总结 本文提出LLM-TabLogic方法,利用大语言模型推理捕捉表格列间的复杂逻辑关系,并通过Score-based Diffusion模型在潜在空间中生成数据,以在不需领域知识的情况下有效保持合成表格数据中的列间关系。

详情
AI中文摘要

合成表格数据越来越多地被用来替代真实数据,作为一种同时保护隐私和解决数据稀缺问题的有效解决方案。然而,除了保持全局统计属性外,合成数据集还必须维持领域特定的逻辑一致性——特别是在供应链等复杂系统中,诸如运输日期、位置和产品类别等字段必须保持逻辑一致性以确保现实应用。现有生成模型往往忽视这些列间关系,导致现实应用中不可靠的合成表格数据。为了解决这些挑战,我们提出了LLM-TabLogic,一种新颖的方法,利用大语言模型推理来捕捉和压缩表格列间的复杂逻辑关系,同时这些条件约束被传递到Score-based Diffusion模型中,在潜在空间中进行数据生成。通过在真实工业数据集上的广泛实验,我们评估了LLM-TabLogic在列推理和数据生成中的表现,将其与SMOTE和最先进的生成模型等五个基线进行比较。我们的结果表明,LLM-TabLogic在逻辑推理方面具有强大的泛化能力,在未见过的表格上实现了超过90%的准确率。此外,我们的方法在数据生成方面优于所有基线,通过完全保留列间关系的同时保持数据保真度、实用性和隐私的最佳平衡。本研究提出了首个在不需领域知识的情况下有效保持合成表格数据中列间关系的方法,为创建逻辑一致的现实表格数据提供了新的见解。代码可在https://github.com/Yunbo-max/TabKG获取。

英文摘要

Synthetic tabular data are increasingly being used to replace real data, serving as an effective solution that simultaneously protects privacy and addresses data scarcity. However, in addition to preserving global statistical properties, synthetic datasets must also maintain domain-specific logical consistency**-**especially in complex systems like supply chains, where fields such as shipment dates, locations, and product categories must remain logically consistent for real-world usability. Existing generative models often overlook these inter-column relationships, leading to unreliable synthetic tabular data in real-world applications. To address these challenges, we propose LLM-TabLogic, a novel approach that leverages Large Language Model reasoning to capture and compress the complex logical relationships among tabular columns, while these conditional constraints are passed into a Score-based Diffusion model for data generation in latent space. Through extensive experiments on real-world industrial datasets, we evaluate LLM-TabLogic for column reasoning and data generation, comparing it with five baselines including SMOTE and state-of-the-art generative models. Our results show that LLM-TabLogic demonstrates strong generalization in logical inference, achieving over 90% accuracy on unseen tables. Furthermore, our method outperforms all baselines in data generation by fully preserving inter-column relationships while maintaining the best balance between data fidelity, utility, and privacy. This study presents the first method to effectively preserve inter-column relationships in synthetic tabular data generation without requiring domain knowledge, offering new insights for creating logically consistent real-world tabular data. The code is available at https://github.com/Yunbo-max/TabKG.

2503.02087 2026-05-19 cs.RO cs.LG cs.SY eess.SY

Uncertainty Representation in a SOTIF-Related Use Case with Dempster-Shafer Theory for LiDAR Sensor-Based Object Detection

基于Dempster-Shafer理论的LiDAR传感器目标检测SOTIF相关用例中的不确定性表示

Milin Patel, Rolf Jung

AI总结 本文提出了一种系统的方法,利用Dempster-Shafer理论构建判定框架,以表示LiDAR传感器目标检测中的不确定性,并通过方差敏感性分析量化和优先处理这些不确定性,以确保自动驾驶场景的安全性。

Comments submitted as extended paper of Vehicle Technology and Intelligent Transport Systems (VEHITS)2024 conference and will be published by Springer in a CCIS Series book later in 2025

详情
AI中文摘要

LiDAR传感器目标检测中的不确定性源于环境变化和传感器性能限制。表示这些不确定性对于确保预期功能安全(SOTIF)至关重要,SOTIF旨在防止自动驾驶场景中的危险。本文提出了一种系统的方法,用于识别、分类和表示LiDAR目标检测中的不确定性。Dempster-Shafer理论(DST)被用于构建判定框架(FoD)以表示检测结果。基于识别的不确定性来源之间的依赖性,应用条件基本概率分配(BPAs)。Yager的证据组合规则用于解决多个来源的冲突证据,提供一个结构化的框架来评估不确定性对检测准确性的影响。研究应用方差基于敏感性分析(VBSA)来量化和优先处理不确定性,详细说明其对检测性能的具体影响。

英文摘要

Uncertainty in LiDAR sensor-based object detection arises from environmental variability and sensor performance limitations. Representing these uncertainties is essential for ensuring the Safety of the Intended Functionality (SOTIF), which focuses on preventing hazards in automated driving scenarios. This paper presents a systematic approach to identifying, classifying, and representing uncertainties in LiDAR-based object detection within a SOTIF-related scenario. Dempster-Shafer Theory (DST) is employed to construct a Frame of Discernment (FoD) to represent detection outcomes. Conditional Basic Probability Assignments (BPAs) are applied based on dependencies among identified uncertainty sources. Yager's Rule of Combination is used to resolve conflicting evidence from multiple sources, providing a structured framework to evaluate uncertainties' effects on detection accuracy. The study applies variance-based sensitivity analysis (VBSA) to quantify and prioritize uncertainties, detailing their specific impact on detection performance.

2502.05462 2026-05-19 cs.RO cs.MA cs.SY eess.SY math.OC

Motion Planning of Cooperative Nonholonomic Mobile Manipulators

协作非holonomic移动机械臂的运动规划

Keshab Patra, Arpita Sinha, Anirban Guha

AI总结 本文提出了一种实时可实现的运动规划框架,用于非holonomic移动机械臂机器人在动态环境中协作运输物体。该框架通过静态无障碍区域找到从起点到目标的路径,并利用一种新颖、快速且计算轻量的椭圆技术生成路径周围的凸、静态、无障碍区域。引入了基于非线性模型预测控制(NMPC)的实时可实现规划技术,联合规划移动基底和机械臂的可行运动,并生成可行的、无碰撞的轨迹以实现协作物体运输。仿真和硬件实验验证了所提规划框架的有效性。

Comments Published in ASME Letters in Translational Robotics. This includes supplementary materials

详情
Journal ref
Patra, K., Sinha, A., and Guha, A. (May 2, 2026). "Motion Planning of Cooperative Nonholonomic Mobile Manipulators." ASME. Letters Trans. Robotics. December 2025; 1(4): 041003
AI中文摘要

我们提出了一种实时可实现的运动规划框架,用于非holonomic移动机械臂机器人(MMRs)在动态环境中协作运输物体。我们的全局规划器通过环境中的静态无障碍区域找到从起点到目标的路径,并利用一种新颖、快速且计算轻量的基于椭圆的技术生成路径周围的凸、静态、无障碍区域。我们引入了一种基于非线性模型预测控制(NMPC)的实时可实现规划技术,该技术联合规划移动基底和机械臂的可行运动,并生成可行的、无碰撞的轨迹以实现协作物体运输。仿真和硬件实验验证了我们所提规划框架的效率。

英文摘要

We propose a real-time implementable motion planning framework for cooperative object transportation by nonholonomic mobile manipulator robots (MMRs) in dynamic environments. Our global planner finds a path from start to goal through the static, obstacle-free regions in the environment and generates a set of convex, static, obstacle-free regions around the path using a novel, fast, and computationally lightweight ellipse-based technique. We introduce a nonlinear Model Predictive Control (NMPC) based real-time implementable planning technique that jointly plans feasible motion for the mobile base and the manipulator's arm and generates a kinodynamic feasible, collision-free trajectory for cooperative object transportation. Simulation and hardware experiments validate the efficiency of our proposed planning framework.

2502.04055 2026-05-19 cs.LG

Evaluating Inter-Column Logical Relationships in Synthetic Tabular Data Generation

评估合成表格数据生成中列之间的逻辑关系

Yunbo Long, Liming Xu, Alexandra Brintrup

AI总结 本文提出三种评估指标,用于评估合成表格数据中列间逻辑关系的保持情况,并通过实验证明现有方法在保持逻辑一致性方面存在不足,讨论了改进逻辑关系建模的可能路径。

详情
AI中文摘要

当前对合成表格数据的评估主要集中在联合分布建模的质量上,往往忽略了其在保持真实事件序列和列间一致实体关系方面的有效性。本文提出了三种评估指标,用于评估合成表格数据中列间逻辑关系的保持情况。我们通过在真实工业数据集上评估经典和最新生成方法的性能来验证这些指标。实验结果表明,现有方法往往无法严格保持逻辑一致性(例如地理或组织中的层级关系)和依赖性(例如时间序列或数学关系),这些对于保持真实世界表格数据的细粒度真实性至关重要。基于这些见解,本文还讨论了在建模合成表格数据分布时更好地捕捉逻辑关系的可能路径。代码可在https://github.com/Yunbo-max/TabLogicEval获取。

英文摘要

Current evaluations of synthetic tabular data mainly focus on how well joint distributions are modeled, often overlooking the assessment of their effectiveness in preserving realistic event sequences and coherent entity relationships across columns.This paper proposes three evaluation metrics designed to assess the preservation of logical relationships among columns in synthetic tabular data. We validate these metrics by assessing the performance of both classical and state-of-the-art generation methods on a real-world industrial dataset.Experimental results reveal that existing methods often fail to rigorously maintain logical consistency (e.g., hierarchical relationships in geography or organization) and dependencies (e.g., temporal sequences or mathematical relationships), which are crucial for preserving the fine-grained realism of real-world tabular data. Building on these insights, this study also discusses possible pathways to better capture logical relationships while modeling the distribution of synthetic tabular data. The code is available at https://github.com/Yunbo-max/TabLogicEval.

2501.17549 2026-05-19 cs.CL

Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models

具有查询意识的可学习图池化标记作为大语言模型的提示

Wooyoung Kim, Byungyoon Park, Wooju Kim

AI总结 本文提出了一种名为可学习图池化标记(LGPT)的新方法,通过引入可学习参数作为大语言模型中的标记,解决节点级投影的可扩展性和图级投影信息丢失的问题,并通过早查询融合技术提升图嵌入效果,在GraphQA基准测试中实现了4.13%的性能提升。

详情
AI中文摘要

图结构数据在许多领域中发挥着重要作用,例如社交网络、引用网络、常识推理图和知识图。尽管图神经网络已被用于图处理,但最近的进展探索了将大语言模型整合到图相关任务中。在本文中,我们提出了一种新的方法,称为可学习图池化标记(LGPT),该方法解决了节点级投影的可扩展性问题和图级投影的信息丢失问题。LGPT通过引入可学习参数作为大语言模型中的标记,实现了灵活且高效的图表示,平衡了细粒度和全局图信息。此外,我们研究了一种早查询融合技术,该技术在构建图表示之前融合查询上下文,从而产生更有效的图嵌入。我们的方法在不训练大语言模型的情况下,在GraphQA基准测试中实现了4.13%的性能提升,展示了在处理复杂文本属性图数据方面的显著优势。

英文摘要

Graph-structured data plays a vital role in numerous domains, such as social networks, citation networks, commonsense reasoning graphs and knowledge graphs. While graph neural networks have been employed for graph processing, recent advancements have explored integrating large language models for graph-based tasks. In this paper, we propose a novel approach named Learnable Graph Pooling Token (LGPT), which addresses the limitations of the scalability issues in node-level projection and information loss in graph-level projection. LGPT enables flexible and efficient graph representation by introducing learnable parameters that act as tokens in large language models, balancing fine-grained and global graph information. Additionally, we investigate an Early Query Fusion technique, which fuses query context before constructing the graph representation, leading to more effective graph embeddings. Our method achieves a 4.13\% performance improvement on the GraphQA benchmark without training the large language model, demonstrating significant gains in handling complex textual-attributed graph data.

2501.13795 2026-05-19 cs.CV

Training-Free Zero-Shot Temporal Action Detection with Vision-Language Models

无需训练的零样本时序动作检测与视觉-语言模型

Chaolei Han, Hongsong Wang, Jidong Kuang, Lei Zhang, Jie Gui

AI总结 本文提出一种无需训练的零样本时序动作检测方法FreeZAD,利用现有的视觉-语言模型直接对未标记视频中的未知活动进行分类和定位,无需额外微调或适应,并通过LogOIC和频率基于的动作校准以及测试时适应策略提升性能。

详情
Journal ref
IEEE Transactions on Multimedia, 2026
AI中文摘要

现有的零样本时序动作检测(ZSTAD)方法主要采用全监督或无监督策略来识别未见活动。然而,这些基于训练的方法容易出现领域偏移且计算成本高,阻碍了其在现实场景中的应用。在本文中,不同于以往的工作,我们提出了一种无需训练的零样本时序动作检测(FreeZAD)方法,利用现有的视觉-语言(ViL)模型,直接对未修剪视频中的未知活动进行分类和定位,而无需任何额外的微调或适应。我们通过设计Logarithmic decay weighted Outer-Inner-Contrastive Score(LogOIC)和基于频率的动作校准,消除了显式时间建模和伪标签质量的依赖。此外,我们引入了使用原型中心采样(PCS)的测试时适应(TTA)策略来扩展FreeZAD,使ViL模型能够更有效地适应ZSTAD。在THUMOS14和ActivityNet-1.3数据集上的大量实验表明,我们的无需训练的方法在性能上优于最先进的无监督方法,且仅需1/13的运行时间。当配备TTA时,增强的方法进一步缩小了与全监督方法之间的差距。

英文摘要

Existing zero-shot temporal action detection (ZSTAD) methods predominantly use fully supervised or unsupervised strategies to recognize unseen activities. However, these training-based methods are prone to domain shifts and require high computational costs, which hinder their practical applicability in real-world scenarios. In this paper, unlike previous works, we propose a training-Free Zero-shot temporal Action Detection (FreeZAD) method, leveraging existing vision-language (ViL) models to directly classify and localize unseen activities within untrimmed videos without any additional fine-tuning or adaptation. We mitigate the need for explicit temporal modeling and reliance on pseudo-label quality by designing the LOGarithmic decay weighted Outer-Inner-Contrastive Score (LogOIC) and frequency-based Actionness Calibration. Furthermore, we introduce a test-time adaptation (TTA) strategy using Prototype-Centric Sampling (PCS) to expand FreeZAD, enabling ViL models to adapt more effectively for ZSTAD. Extensive experiments on the THUMOS14 and ActivityNet-1.3 datasets demonstrate that our training-free method outperforms state-of-the-art unsupervised methods while requiring only 1/13 of the runtime. When equipped with TTA, the enhanced method further narrows the gap with fully supervised methods.

2412.18158 2026-05-19 cs.CV eess.IV

Semantics Disentanglement and Composition for Universal Image Coding with Efficiently LLM Reasoning and Generative Diffusion

语义解耦与组合用于具有高效LLM推理和生成扩散的通用图像编码

Jinming Liu, Yuntao Wei, Junyan Lin, Shengyang Zhao, Heming Sun, Zhibo Chen, Wenjun Zeng, Xin Jin

AI总结 本文提出UniCodec,一种基于语义解耦和组合生成的通用图像编码框架,通过高效LLM推理和生成扩散模型实现人类和机器需求的统一压缩,无需重新训练。

详情
AI中文摘要

已学习的图像压缩方法在性能上表现出色,但通常高度专门化于人类感知或特定机器视觉任务。这种专门化限制了其通用性和重新训练成本。为此,我们引入UniCodec,一种基于编码器的语义解耦和解码器的组合生成的通用编码器。该框架旨在同时满足人类和机器需求,消除任务特定重新训练的需要。在编码器中,UniCodec利用由大型语言模型(LLM)预先生成的任务特定标签代码本。对于任何给定任务,接地模型使用相应的代码本进行任务感知的解耦,压缩最相关的图像区域。这种机制不仅节省了大量位数,而且是系统快速零重新训练适应的关键:切换到新任务只需选择新代码本。解码器则进行组合生成:它将紧凑的解耦组件与生成扩散模型的强大先验结合,从而重建高质量、完整的图像,优化以满足人类感知的丰富细节和机器视觉任务的精确特征。广泛的实验表明,UniCodec在性能上始终优于现有方法,有效弥合了以人类为中心和以机器为中心压缩之间的差距。

英文摘要

Learned image compression methods have shown impressive performance but are often highly specialized for either human perception or specific machine vision tasks. This specialization limits their versatility and requires costly retraining for new applications. To address this, we introduce UniCodec, a universal codec built on a novel paradigm of semantic disentanglement at the encoder and compositional generation at the decoder. This framework is designed to simultaneously serve both human and machine needs, eliminating the need for task-specific retraining. At the encoder, UniCodec leverages pre-generated, task-specific label codebooks created by a Large Language Model (LLM). For any given task, a grounding model uses the corresponding codebook to perform task-aware disentanglement, compressing only the most relevant image regions. This mechanism not only saves significant bits but is also the key to our system's rapid, zero-retraining adaptation: switching to a new task is as simple as selecting a new codebook. The decoder then performs compositional generation: it combines the compact, disentangled components with powerful priors from a generative diffusion model. This process reconstructs a high-quality, complete image optimized with rich detail for human perception and precise features for machine vision tasks. Extensive experiments demonstrate that UniCodec consistently outperforms existing methods, effectively bridging the gap between human-centric and machine-centric compression.

2411.03936 2026-05-19 cs.LG stat.ML

GUIDE-VAE: Advancing Data Generation with User Information and Pattern Dictionaries

GUIDE-VAE:利用用户信息和模式词典推进数据生成

Kutay Bölat, Simon Tindemans

AI总结 本文提出GUIDE-VAE,一种基于用户嵌入和模式词典的生成模型,通过整合用户信息和复杂特征依赖性,提升多用户数据集下的生成性能和样本真实性。

详情
AI中文摘要

多用户数据集的生成建模在科学和工程中变得突出。生成特定用户的样本需要利用用户信息,而传统生成模型,包括变分自编码器(VAEs),通常忽略这一点。本文介绍了GUIDE-VAE,一种新的条件生成模型,利用用户嵌入生成用户引导的数据。通过利用用户之间的共享模式,GUIDE-VAE在多用户设置中提升了性能,即使在数据不平衡显著的情况下。除了整合用户信息外,GUIDE-VAE还采用基于模式词典的协方差组成(PDCC)来提高生成样本的真实性和捕捉复杂特征依赖性。虽然用户嵌入推动了性能提升,但PDCC解决了VAEs中常见的噪声和过平滑问题。所提出的GUIDE-VAE在具有显著用户数据不平衡的多用户智能电表数据集上进行了评估。定量结果表明,GUIDE-VAE在合成数据生成和缺失记录填补任务中表现良好,而定性评估表明其生成的数据更加合理且噪声更少。这些结果确立了GUIDE-VAE作为多用户数据集可控、真实数据生成的有前景工具,具有跨领域应用的潜力。

英文摘要

Generative modelling of multi-user datasets has become prominent in science and engineering. Generating a data point for a given user requires employing user information, and conventional generative models, including variational autoencoders (VAEs), often ignore this. This paper introduces GUIDE-VAE, a novel conditional generative model that leverages user embeddings to generate user-guided data. By leveraging shared patterns across users, GUIDE-VAE improves performance in multi-user settings, even under significant data imbalance. In addition to integrating user information, GUIDE-VAE incorporates a pattern dictionary-based covariance composition (PDCC) to improve the realism of generated samples by capturing complex feature dependencies. While user embeddings drive performance gains, PDCC addresses common issues such as noise and over-smoothing typically seen in VAEs. The proposed GUIDE-VAE was evaluated on a multi-user smart meter dataset characterised by substantial data imbalance across users. Quantitative results show that GUIDE-VAE performs effectively on both synthetic data generation and missing-record imputation tasks, while qualitative evaluations indicate that it produces more plausible and less noisy data. These results establish GUIDE-VAE as a promising tool for controlled, realistic data generation in multi-user datasets, with potential applications across domains that require user-informed modelling.

2410.13846 2026-05-19 cs.CL cs.AI cs.LG

LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation

LightTransfer: 你的长上下文LLM实际上是一个具有轻松适应能力的混合模型

Xuan Zhang, Fengzhuo Zhang, Cunxiao Du, Chao Du, Tianyu Pang, Wei Gao, Min Lin

AI总结 本文提出LightTransfer方法,通过将LLaMA等模型转换为混合架构,实现更高效的生成,实验表明在长上下文理解任务中,即使有半数层被识别为懒层,也能在性能损失小于1.5%的情况下提升2.17倍的吞吐量,并在数学基准AIME24上达到53.3%的分数。

Comments Accepted by TMLR 2025

详情
AI中文摘要

将语言模型扩展到处理更长上下文引入了由于键值(KV)缓存成本增加而带来的显著内存挑战。受混合模型的效率提升和预训练大变压器骨干的广泛可用性启发,我们探索将变压器模型转换为混合架构以实现更高效的生成。在本工作中,我们提出了LightTransfer,一种轻量级方法,将模型如LLaMA转换为混合变体。我们的方法识别出懒层——那些专注于最近或初始token的层,并将它们的完整注意力替换为流式注意力。这种转换可以在无需任何训练的情况下用于长上下文理解任务,或在需要更强推理能力的o1-like长推理生成任务中进行最小微调。在多样化的基准和模型(如LLaMA、Mistral、QwQ-STILL)上的实验表明,即使有半数层被识别为懒层,LightTransfer在性能损失小于1.5%(在LongBench上)的情况下,也能实现高达2.17倍的吞吐量提升,并在数学基准AIME24上达到先进o1-like长推理模型QwQ-STILL的53.3%。

英文摘要

Scaling language models to handle longer contexts introduces substantial memory challenges due to the growing cost of key-value (KV) caches. Motivated by the efficiency gains of hybrid models and the broad availability of pretrained large transformer backbones, we explore transitioning transformer models into hybrid architectures for a more efficient generation. In this work, we propose LightTransfer, a lightweight method that transforms models such as LLaMA into hybrid variants. Our approach identifies lazy layers -- those focusing on recent or initial tokens -- and replaces their full attention with streaming attention. This transformation can be performed without any training for long-context understanding tasks or with minimal fine-tuning for o1-like long reasoning generation tasks that require stronger reasoning capabilities. Experiments across diverse benchmarks and models (e.g., LLaMA, Mistral, QwQ-STILL) demonstrate that, even when half of the layers are identified as lazy, LightTransfer achieves up to 2.17$\times$ throughput improvement with minimal performance loss ($<1.5\%$ on LongBench) and achieves 53.3\% on math benchmark AIME24 of advanced o1-like long reasoning model QwQ-STILL.

2410.04941 2026-05-19 cs.LG cs.AI

TOAST: Transformer Optimization using Adaptive and Simple Transformations

TOAST: 使用自适应和简单变换的Transformer优化

Irene Cannistraci, Simone Antonelli, Emanuele Palumbo, Thomas M. Sutter, Emanuele Rodolà, Bastian Rieck, Julia E. Vogt

AI总结 本文提出TOAST框架,通过利用Transformer内部的冗余性,用轻量级闭式映射(如线性变换或身份函数)近似整个Transformer块,从而在不额外训练的情况下减少参数和计算量,同时保持甚至提升下游性能。

Comments 33 pages, 16 figures, 22 tables

详情
AI中文摘要

基础模型在不同任务上实现了最先进的性能,但其规模和计算需求引发了关于可访问性和可持续性的担忧。现有的效率方法通常需要额外的重新训练或微调,限制了其实用性。最近的研究发现,深度神经网络表现出内部表示的相似性。虽然这种相似性已被用于启用技术如模型缝合和合并,但网络内部的冗余性仍较少被用作效率提升的来源。在本文中,我们介绍了Transformer优化使用自适应和简单变换(TOAST),一个框架利用这些冗余性,用轻量级闭式映射(如线性变换或甚至身份函数)近似整个Transformer块,而无需任何额外训练。在最先进的预训练视觉模型(如ViT、DINOv2、DeiT)和从MNIST到ImageNet-1k的各类数据集上,TOAST在减少参数和计算量的同时,保持并有时提升下游性能。这些结果表明,Transformer深度的大部分可以被简单函数替代,为高效基础模型提供了新的视角。

英文摘要

Foundation models achieve state-of-the-art performance across different tasks, but their size and computational demands raise concerns about accessibility and sustainability. Existing efficiency methods often require additional retraining or finetuning, limiting their practicality. Recent findings suggest that deep neural networks exhibit internal representation similarities. While such similarities across different models have been exploited for enabling techniques such as model stitching and merging, intra-network redundancy remains underexplored as a source for efficiency gains. In this paper, we introduce Transformer Optimization using Adaptive and Simple Transformations (TOAST), a framework that exploits these redundancies to approximate entire transformer blocks with lightweight closed-form mappings, such as linear transformations or even the identity function, without any additional training. Across state-of-the-art pretrained vision models (e.g., ViT, DINOv2, DeiT) and datasets ranging from MNIST to ImageNet-1k, TOAST reduces parameters and computation while preserving, and in some cases improving, downstream performance. These results show that large portions of transformer depth can be replaced by trivial functions, opening a new perspective on efficient foundation models.

2410.02064 2026-05-19 cs.LG cs.AI cs.CL

Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

对Llama3-8b-Instruct自生成文本识别能力的检查与控制

Christopher Ackerman, Nina Panickssery

AI总结 本研究探讨了LLM是否能识别自身生成的文本,发现Llama3-8b-Instruct模型能够区分自身输出与人类输出,并通过残差流中的特定向量控制其行为和感知,揭示了模型自我归属的认知机制。

Comments 10 pages, 13 figs, 2 tables, accepted as conference paper to ICLR 2025

详情
Journal ref
The Thirteenth International Conference on Learning Representations (ICLR 2025)
AI中文摘要

已报告LLM能够识别其自身生成的文本,这可能对AI安全有重要影响,但研究较少。我们调查这一现象,以确定其在行为层面是否稳健发生,观察行为是如何实现的,以及是否可以控制。首先,我们发现Llama3-8b-Instruct聊天模型(而非基础Llama3-8b模型)能够可靠地区分自身输出与人类输出,并提供证据表明聊天模型很可能利用其在训练后对自身输出的经验来完成文本识别任务。其次,我们识别出残差流中一个在模型正确识别自身生成文本时被差异激活的向量,证明该向量对自我归属相关信息的响应,并提供证据表明该向量与模型中的“自我”概念相关,并展示该向量与模型感知和声明自我归属能力的因果关系。最后,我们证明该向量可用于控制模型的行为和感知,通过将其应用于模型生成输出时,可引导模型声称或否认作者身份;通过将其应用于模型阅读的文本时,可引导模型相信或不相信其写了任意文本。

英文摘要

It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we investigate the phenomenon, seeking to establish whether it robustly occurs at the behavioral level, how the observed behavior is achieved, and whether it can be controlled. First, we find that the Llama3-8b-Instruct chat model - but not the base Llama3-8b model - can reliably distinguish its own outputs from those of humans, and present evidence that the chat model is likely using its experience with its own outputs, acquired during post-training, to succeed at the writing recognition task. Second, we identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment, show that the vector activates in response to information relevant to self-authorship, present evidence that the vector is related to the concept of "self" in the model, and demonstrate that the vector is causally related to the model's ability to perceive and assert self-authorship. Finally, we show that the vector can be used to control both the model's behavior and its perception, steering the model to claim or disclaim authorship by applying the vector to the model's output as it generates it, and steering the model to believe or disbelieve it wrote arbitrary texts by applying the vector to them as the model reads them.

2409.15980 2026-05-19 cs.CV cs.AI

Leveraging Unsupervised Learning for Cost-Effective Visual Anomaly Detection

利用无监督学习实现高效视觉异常检测

Yunbo Long, Zhengyang Ling, Sam Brook, Duncan McFarlane, Alexandra Brintrup

AI总结 本研究提出一种低成本视觉异常检测系统,通过预训练模型和低成本硬件,利用少量数据实现高准确率的异常检测,适用于中小型企业。

详情
AI中文摘要

传统的基于机器学习的视觉检测系统需要大量数据收集和重复模型训练来提高准确性。这些系统通常需要昂贵的相机、计算设备和显著的机器学习专业知识,这对中小型企业构成重大负担。本研究探索利用预训练模型和低成本硬件的无监督学习方法,开发一种高效的视觉异常检测系统。该系统利用Anomalib的无监督学习模型,并通过openVINO部署在经济型Raspberry Pi硬件上。结果表明,该系统仅用10张正常产品图像即可在Raspberry Pi上完成异常检测的训练和推理,耗时仅90秒,达到F1宏评分超过0.95的性能。尽管系统对环境变化如光照、产品摆放或背景略有敏感,但其仍为中小型企业提供了一种快速且经济的工厂自动化检测方法。代码可在https://github.com/Yunbo-max/Cost-Effective-Visual-Anomaly-Detection-using-Unsupervised-Learning获取。

英文摘要

Traditional machine learning-based visual inspection systems require extensive data collection and repetitive model training to improve accuracy. These systems typically require expensive camera, computing equipment and significant machine learning expertise, which can substantially burden small and medium-sized enterprises. This study explores leveraging unsupervised learning methods with pre-trained models and low-cost hardware to create a cost-effective visual anomaly detection system. The research aims to develop a low-cost visual anomaly detection solution that uses minimal data for model training while maintaining generalizability and scalability. The system utilises unsupervised learning models from Anomalib and is deployed on affordable Raspberry Pi hardware through openVINO. The results show that this cost-effective system can complete anomaly defection training and inference on a Raspberry Pi in just 90 seconds using only 10 normal product images, achieving an F1 macro score exceeding 0.95. While the system is slightly sensitive to environmental changes like lighting, product positioning, or background, it remains a swift and economical method for factory automation inspection for small and medium-sized manufacturers. The code is available at https://github.com/Yunbo-max/Cost-Effective-Visual-Anomaly-Detection-using-Unsupervised-Learning.

2409.12190 2026-05-19 cs.RO cs.CV

Bundle Adjustment in the Eager Mode

急切模式下的捆绑调整

Zitong Zhan, Huan Xu, Zihang Fang, Xinpeng Wei, Yaoyu Hu, Chen Wang

AI总结 本文提出了一种与PyTorch无缝集成的高效急切模式捆绑调整库,通过稀疏感知的自动微分设计和GPU加速的稀疏运算,提升了在机器人应用中捆绑调整的运行效率和性能。

详情
AI中文摘要

捆绑调整(BA)是各种机器人应用中的关键技术,例如同步定位与建图(SLAM)、增强现实(AR)和摄影测量学。BA通过优化诸如相机姿态和3D地标等参数,使它们与观测结果对齐。随着深度学习在感知系统中的重要性日益增加,将BA与深度学习框架整合已成为提高可靠性和性能的迫切需求。然而,广泛使用的基于C++的BA库,如GTSAM、g²o和Ceres Solver,缺乏与现代深度学习库如PyTorch的原生整合。这种限制影响了它们的灵活性、调试简便性和整体实现效率。为了解决这一差距,我们引入了一种与PyTorch无缝集成的高效急切模式BA库。我们的方法包括稀疏感知的自动微分设计和针对二次优化设计的GPU加速稀疏运算。我们的GPU急切模式BA在所有基准测试中均实现了显著的运行时间效率,与GTSAM、g²o和Ceres相比,平均加速分别为18.5×、22×和23×。

英文摘要

Bundle adjustment (BA) is a critical technique in various robotic applications such as simultaneous localization and mapping (SLAM), augmented reality (AR), and photogrammetry. BA optimizes parameters such as camera poses and 3D landmarks to align them with observations. With the growing importance of deep learning in perception systems, there is an increasing need to integrate BA with deep learning frameworks for enhanced reliability and performance. However, widely-used C++-based BA libraries, such as GTSAM, g$^2$o, and Ceres Solver, lack native integration with modern deep learning libraries like PyTorch. This limitation affects their flexibility, ease of debugging, and overall implementation efficiency. To address this gap, we introduce an eager-mode BA library seamlessly integrated with PyTorch with high efficiency. Our approach includes a sparsity-aware auto-differentiation design and GPU-accelerated sparse operations designed for 2nd-order optimization. Our eager-mode BA on GPU demonstrates substantial runtime efficiency, achieving an average speedup of 18.5$\times$, 22$\times$, and 23$\times$ across all benchmarks compared to GTSAM, g$^2$o, and Ceres, respectively.

2405.19189 2026-05-19 cs.LG

DyDiff: Long-Horizon Rollout via Dynamics Diffusion for Offline Reinforcement Learning

DyDiff: 通过动力学扩散实现离线强化学习中的长周期 rollout

Hanye Zhao, Xiaoshen Han, Zhengbang Zhu, Minghuan Liu, Yong Yu, De-Chuan Zhan, Weinan Zhang

AI总结 本文提出DyDiff,一种通过动力学扩散模型实现离线强化学习中长周期轨迹生成的方法,通过迭代注入学习策略信息,解决行为策略与学习策略不一致的问题,提升长周期rollout的准确性。

Comments 18 pages, 10 figures, 9 tables. The article has been accepted by Frontiers of Computer Science (FCS), with the DOI: {10.1007/s11704-026-52028-5}

详情
AI中文摘要

随着扩散模型(DMs)在生成逼真合成视觉数据方面的巨大成功,许多研究者探索其在决策和控制中的潜力。大多数工作利用DMs直接从轨迹空间采样,其中DMs可视为动力学模型和策略的结合。在本工作中,我们探讨如何在完全离线设置中解耦DMs作为动力学模型的能力,使学习策略能够生成轨迹。由于DMs从数据集中学习数据分布,其内在策略实际上是数据集诱导的行为策略,导致行为策略与学习策略之间存在不匹配。我们提出Dynamics Diffusion,简称DyDiff,可以迭代地将学习策略的信息注入DMs中。DyDiff在保持策略一致性的同时确保长周期rollout的准确性,并且可以轻松部署在无模型算法上。我们提供了理论分析,证明DMs在长周期rollout上的优势优于其他模型,并在离线强化学习的上下文中验证了DyDiff的有效性,其中提供了一个rollout数据集但没有交互环境。

英文摘要

With the great success of diffusion models (DMs) in generating realistic synthetic vision data, many researchers have investigated their potential in decision-making and control. Most of these works utilized DMs to sample directly from the trajectory space, where DMs can be viewed as a combination of dynamics models and policies. In this work, we explore how to decouple DMs' ability as dynamics models in fully offline settings, allowing the learning policy to roll out trajectories. As DMs learn the data distribution from the dataset, their intrinsic policy is actually the behavior policy induced from the dataset, which results in a mismatch between the behavior policy and the learning policy. We propose Dynamics Diffusion, short as DyDiff, which can inject information from the learning policy to DMs iteratively. DyDiff ensures long-horizon rollout accuracy while maintaining policy consistency and can be easily deployed on model-free algorithms. We provide theoretical analysis to show the advantage of DMs on long-horizon rollout over models and demonstrate the effectiveness of DyDiff in the context of offline reinforcement learning, where the rollout dataset is provided but no online environment for interaction.