arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1971
专题追踪
2602.15851 2026-06-18 cs.CL cs.AI 版本更新

Narrative Theory-Driven LLM Methods for Automatic Story Generation and Understanding: A Survey

叙事理论驱动的LLM方法在自动故事生成与理解中的应用:综述

David Y. Liu, Aditya Joshi, Paul Dawson

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院) School of Arts and Media(艺术与媒体学院) University of New South Wales (UNSW)(新南威尔士大学)

AI总结 综述叙事理论驱动的大语言模型方法在自动故事生成与理解中的应用,分析现状并指出生成任务在理论应用、后训练方法、非虚构叙事及叙事层次等方面落后于理解任务,提出未来方向。

Comments 31 pages

详情
AI中文摘要

使用大语言模型(LLM)的叙事理论应用在自动故事生成和理解任务中提供了有前景的方法。本综述考察了自然语言处理(NLP)研究如何利用LLM方法处理叙事研究中的不同概念。我们使用叙事学中的既定区分来分类当前工作,并发现以下内容:(a) 叙事文本来源多样,不仅限于文学;(b) 理论综合与验证是潜在成果;(c) 生成任务在多个方面落后于理解任务:理论应用、后训练方法、探索非虚构叙事以及处理超出故事与话语层面的叙事层次。对于未来方向,我们相信,与其追求单一的、通用的“叙事质量”基准,进步可以受益于以下方面的努力:定义和改进针对单个叙事属性的基于理论的度量;继续开展大规模、理论驱动的文学/社会/文化分析;在情境化上下文中生成叙事;以及继续进行实验,其输出可用于验证或完善叙事理论。本文通过概述当前研究工作和更广泛的叙事研究领域,为NLP中更系统、更具理论依据的叙事研究提供了背景基础。

英文摘要

Applications of narrative theories using large language models (LLMs) deliver promising methods in automatic story generation and understanding tasks. Our survey examines how natural language processing (NLP) research uses LLM methods to engage with diverse concepts from narrative studies. We use established distinctions from narratology to categorise ongoing efforts and discover the following: \redtext{(a) narrative texts come from diverse sources beyond just literature, (b) theoretical synthesis and validation are potential outcomes, (c) generation tasks lag behind understanding in several ways: theoretical application, post-training methods, exploring non-fiction narratives and addressing narrative levels beyond fabula and discourse.} For future directions, instead of the pursuit of a single, generalised benchmark for `narrative quality', we believe that progress can benefit from efforts that focus on the following: defining and improving theory-based metrics for individual narrative attributes; continue conducting large-scale, theory-driven literary/social/cultural analysis; generating narratives in situated contexts; and continuing experiments where outputs can be used to validate or refine narrative theories. This work provides a contextual foundation for more systematic and theoretically informed narrative research in NLP by providing an overview to ongoing research efforts and the broader narrative studies landscape.

2602.14789 2026-06-18 cs.LG stat.ML 版本更新

On the Stability of Nonlinear Dynamics in GD and SGD: Beyond Quadratic Potentials

关于GD和SGD中非线性动力学的稳定性:超越二次势能

Rotem Mulayoff, Sebastian U. Stich

发表机构 * CISPA Helmholtz Center for Information Security(CISPA赫尔姆霍兹信息安全中心)

AI总结 研究梯度下降和随机梯度下降中非线性项对动力学稳定性的影响,推导了多元设置下稳定振荡的精确条件,并发现SGD的稳定性由单个不稳定批次决定。

Comments Accepted to COLT 2026

详情
AI中文摘要

训练过程中迭代的动力稳定性在确定优化算法所获得的极小值方面起着关键作用。例如,梯度下降(GD)的稳定解对应于平坦极小值,而平坦极小值被认为具有有利特征。虽然先前的工作通常依赖线性化来确定稳定性,但线性化动力学是否忠实捕捉完整的非线性行为仍不清楚。最近的研究表明,GD可能在线性不稳定的极小值附近稳定振荡,并在步长衰减后收敛,这表明线性分析可能具有误导性。在这项工作中,我们明确研究了非线性项的影响。具体而言,我们在多元设置下推导了GD在极小值附近稳定振荡的精确准则。我们的条件依赖于高阶导数,推广了现有结果。将分析扩展到随机梯度下降(SGD),我们表明即使单个批次不稳定,非线性动力学也可能在期望上发散。这意味着稳定性可能由单个不稳定振荡的批次决定,而非线性分析所暗示的平均效应。最后,我们证明如果所有批次都是线性稳定的,则SGD的非线性动力学在期望上是稳定的。

英文摘要

The dynamical stability of the iterates during training plays a key role in determining the minima obtained by optimization algorithms. For example, stable solutions of gradient descent (GD) correspond to flat minima, which have been associated with favorable features. While prior work often relies on linearization to determine stability, it remains unclear whether linearized dynamics faithfully capture the full nonlinear behavior. Recent work has shown that GD may stably oscillate near a linearly unstable minimum and still converge once the step size decays, indicating that linear analysis can be misleading. In this work, we explicitly study the effect of nonlinear terms. Specifically, we derive an exact criterion for stable oscillations of GD near minima in the multivariate setting. Our condition depends on high-order derivatives, generalizing existing results. Extending the analysis to stochastic gradient descent (SGD), we show that nonlinear dynamics can diverge in expectation even if a single batch is unstable. This implies that stability can be dictated by a single batch that oscillates unstably, rather than an average effect, as linear analysis suggests. Finally, we prove that if all batches are linearly stable, the nonlinear dynamics of SGD are stable in expectation.

2512.09185 2026-06-18 cs.CV cs.AI 版本更新

Learning Patient-Specific Disease Dynamics with Latent Flow Matching for Longitudinal Imaging Generation

学习患者特异性疾病动态:基于潜在流匹配的纵向影像生成

Hao Chen, Rui Yin, Yifan Chen, Qi Chen, Chao Li

发表机构 * University of Cambridge(剑桥大学) Nanjing First Hospital(南京第一医院) Nanjing Medical University(南京医科大学) Johns Hopkins University(约翰霍普金斯大学) University of Dundee(邓迪大学)

AI总结 提出Δ-LFM框架,利用流匹配对齐患者潜在轨迹,通过患者特异性潜在对齐实现单调疾病进展建模,在三个纵向MRI基准上验证了可解释性和性能。

Comments ICLR 2026 accepted

详情
AI中文摘要

理解疾病进展是一个直接的临床挑战,对早期诊断和个性化治疗具有重要意义。虽然最近的生成方法试图对进展进行建模,但关键不匹配仍然存在:疾病动态本质上是连续且单调的,然而潜在表示通常是分散的,缺乏语义结构,并且基于扩散的模型通过随机去噪过程破坏了连续性。在这项工作中,我们提出将疾病动态视为速度场,并利用流匹配(FM)来对齐患者数据的时间演变。与先前方法不同,它捕捉了疾病的内在动态,使进展更具可解释性。然而,一个关键挑战仍然存在:在潜在空间中,自动编码器(AE)不能保证跨患者的对齐或与临床严重性指标(例如年龄和疾病状况)的相关性。为了解决这个问题,我们提出学习患者特异性潜在对齐,这迫使患者轨迹沿着特定轴延伸,其幅度随疾病严重程度单调增加。这导致了一个一致且语义上有意义的潜在空间。总之,我们提出了Δ-LFM,一个用于通过流匹配建模患者特异性潜在进展的框架。在三个纵向MRI基准上,Δ-LFM展示了强大的实证性能,更重要的是,为解释和可视化疾病动态提供了一个新框架。

英文摘要

Understanding disease progression is a central clinical challenge with direct implications for early diagnosis and personalized treatment. While recent generative approaches have attempted to model progression, key mismatches remain: disease dynamics are inherently continuous and monotonic, yet latent representations are often scattered, lacking semantic structure, and diffusion-based models disrupt continuity with random denoising process. In this work, we propose to treat the disease dynamic as a velocity field and leverage Flow Matching (FM) to align the temporal evolution of patient data. Unlike prior methods, it captures the intrinsic dynamic of disease, making the progression more interpretable. However, a key challenge remains: in latent space, Auto-Encoders (AEs) do not guarantee alignment across patients or correlation with clinical-severity indicators (e.g., age and disease conditions). To address this, we propose to learn patient-specific latent alignment, which enforces patient trajectories to lie along a specific axis, with magnitude increasing monotonically with disease severity. This leads to a consistent and semantically meaningful latent space. Together, we present $Δ$-LFM, a framework for modeling patient-specific latent progression with flow matching. Across three longitudinal MRI benchmarks, $Δ$-LFM demonstrates strong empirical performance and, more importantly, offers a new framework for interpreting and visualizing disease dynamics.

2510.10779 2026-06-18 cs.CV 版本更新

Structured Spectral Graph Representation Learning for Multi-label Abnormality Analysis from 3D CT Scans

结构化谱图表示学习用于3D CT扫描的多标签异常分析

Theo Di Piazza, Carole Lazarus, Olivier Nempont, Loic Boussel

发表机构 * INSA Lyon, University of Lyon, CNRS, INSERM, CREATIS UMR 5220, U1294(里昂国立应用科学学院、里昂大学、国家科学研究中心、法国国家医学研究院、CREATIS UMR 5220、U1294)

AI总结 提出一种基于谱图卷积的2.5D框架,将3D CT体积表示为结构化图,通过轴向切片三元组节点建模层间依赖,实现多标签异常分类,跨数据集泛化性能强。

Comments Accepted at MELBA Journal 2026

详情
AI中文摘要

随着CT检查数量的增长,对器官分割、异常检测和报告生成等自动化工具的需求日益增加,以支持放射科医生管理临床工作负载。由于三维数据中固有的复杂空间关系和异常的广泛变异性,3D胸部CT扫描的多标签分类仍然是一个关键但具有挑战性的问题。基于3D卷积神经网络的现有方法难以捕捉长距离依赖,而视觉Transformer通常需要在大规模领域特定数据集上进行大量预训练才能获得竞争力。在这项工作中,我们提出了一种2.5D替代方案,引入了一个新的基于图的框架,将3D CT体积表示为结构化图,其中轴向切片三元组作为节点,通过谱图卷积处理,使模型能够推理层间依赖,同时保持与临床部署兼容的复杂度。我们的方法在来自独立机构的3个数据集上进行训练和评估,实现了强大的跨数据集泛化能力,并与最先进的视觉编码器相比表现出竞争性能。我们进一步进行了全面的消融研究,以评估各种聚合策略、边加权方案和图连接模式的影响。此外,我们通过自动放射学报告生成和腹部CT数据的迁移实验展示了我们方法的更广泛适用性。

英文摘要

With the growing volume of CT examinations, there is an increasing demand for automated tools such as organ segmentation, abnormality detection, and report generation to support radiologists in managing their clinical workload. Multi-label classification of 3D Chest CT scans remains a critical yet challenging problem due to the complex spatial relationships inherent in volumetric data and the wide variability of abnormalities. Existing methods based on 3D convolutional neural networks struggle to capture long-range dependencies, while Vision Transformers often require extensive pre-training on large-scale, domain-specific datasets to perform competitively. In this work, we propose a 2.5D alternative by introducing a new graph-based framework that represents 3D CT volumes as structured graphs, where axial slice triplets serve as nodes processed through spectral graph convolution, enabling the model to reason over inter-slice dependencies while maintaining complexity compatible with clinical deployment. Our method, trained and evaluated on 3 datasets from independent institutions, achieves strong cross-dataset generalization, and shows competitive performance compared to state-of-the-art visual encoders. We further conduct comprehensive ablation studies to evaluate the impact of various aggregation strategies, edge-weighting schemes, and graph connectivity patterns. Additionally, we demonstrate the broader applicability of our approach through transfer experiments on automated radiology report generation and abdominal CT data.

2602.11467 2026-06-18 cs.LG 版本更新

PRISM: A 3D Probabilistic Neural Representation for Interpretable Shape Modeling

PRISM:一种用于可解释形状建模的三维概率神经表示

Yining Jiao, Sreekalyani Bhamidi, Carlton Jude Zdanski, Julia S Kimbell, Andrew Prince, Cameron P Worden, Samuel Kirse, Christopher Rutter, Benjamin H Shields, Jisan Mahmud, Marc Niethammer

发表机构 * Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, USA(北卡罗来纳大学教堂山分校计算机科学系) Department of Computer Science, University of California San Diego, La Jolla, USA(加州大学圣地亚哥分校计算机科学系) School of Medicine, University of North Carolina at Chapel Hill, Chapel Hill, USA(北卡罗来纳大学教堂山分校医学院)

AI总结 提出PRISM框架,结合隐式神经表示与不确定性感知统计形状分析,通过封闭形式Fisher信息度量实现高效局部时间不确定性量化,在形状演化、个性化预测和异常检测任务中表现优异。

Comments ICML 2026, camera-ready version, 24 pages

详情
AI中文摘要

理解解剖形状如何响应发育协变量而演变——并量化其空间变化的不确定性——在医疗保健研究中至关重要。现有方法通常依赖于忽略空间异质性动态的全局时间扭曲公式。我们引入PRISM,一种新颖的框架,将隐式神经表示与不确定性感知统计形状分析相结合。PRISM建模给定协变量下形状的条件分布,提供总体均值和协变量依赖不确定性在任意位置的空间连续估计。一个关键的理论贡献是封闭形式的Fisher信息度量,通过自动微分实现高效、解析可处理的局部时间不确定性量化。在三个合成数据集和一个临床数据集上的实验表明,PRISM在统一框架内从建模形状演化到个性化形状预测和异常检测等多样化任务中表现出色,同时提供可解释且临床有意义的不确定性估计。

英文摘要

Understanding how anatomical shapes evolve in response to developmental covariates - and quantifying their spatially varying uncertainties - is critical in healthcare research. Existing approaches typically rely on global time-warping formulations that ignore spatially heterogeneous dynamics. We introduce PRISM, a novel framework that bridges implicit neural representations with uncertainty-aware statistical shape analysis. PRISM models the conditional distribution of shapes given covariates, providing spatially continuous estimates of both the population mean and covariate-dependent uncertainty at arbitrary locations. A key theoretical contribution is a closed-form Fisher Information metric that enables efficient, analytically tractable local temporal uncertainty quantification via automatic differentiation. Experiments on three synthetic datasets and one clinical dataset demonstrate PRISM's strong performance across diverse tasks - from modeling shape evolution to personalized shape prediction and anomaly detection - within a unified framework, while providing interpretable and clinically meaningful uncertainty estimates.

2602.09234 2026-06-18 cs.LG cs.AI 版本更新

Do Neural Networks Lose Plasticity in a Gradually Changing World?

神经网络在渐变世界中会失去可塑性吗?

Tianhui Liu, Lili Mou

发表机构 * Dept. Computing Science \& Alberta Machine Intelligence Institute (Amii), University of Alberta Canada CIFAR AI Chair

AI总结 研究任务转换的突然性对神经网络可塑性损失的影响,通过输入/输出插值和任务采样模拟渐变环境,理论和实验表明可塑性损失严重程度与任务转换突然性密切相关,渐变环境下可显著减轻。

详情
AI中文摘要

持续学习已成为机器学习的热门话题。最近的研究发现了一个有趣的现象,称为可塑性丧失,指的是神经网络逐渐失去学习新任务的能力。然而,现有的可塑性研究很大程度上依赖于具有突然任务转换的基准测试,而没有检验突然性本身是否导致了观察到的可塑性损失。在本文中,我们通过输入/输出插值和任务采样模拟逐渐变化的环境,研究了转换突然性的作用。我们进行了理论和实证分析,表明可塑性损失的严重程度与任务转换的突然性密切相关,并且在环境逐渐变化时可以显著降低。

英文摘要

Continual learning has become a trending topic in machine learning. Recent studies have discovered an interesting phenomenon called loss of plasticity, referring to neural networks gradually losing the ability to learn new tasks. However, existing plasticity research largely relies on benchmarks with abrupt task transitions, without examining whether the abruptness itself contributes to the observed plasticity loss. In this paper, we investigate the role of transition abruptness by simulating gradually changing environments through input/output interpolation and task sampling. We perform theoretical and empirical analysis, showing that the severity of plasticity loss is closely tied to the abruptness of task transitions, and can be substantially reduced when the environment changes gradually.

2602.07544 2026-06-18 cs.CV 版本更新

MUFASA: A Multi-Layer Framework for Slot Attention

MUFASA: 一种用于槽注意力的多层框架

Sebastian Bock, Leonie Schüßler, Krishnakant Singh, Simone Schaub-Meyer, Stefan Roth

发表机构 * TU Darmstadt(图宾根大学) Zuse School ELIZA(泽尼特学校ELIZA)

AI总结 提出MUFASA,一种轻量级即插即用框架,通过跨ViT编码器多层计算槽注意力并融合,提升无监督对象中心学习的分割性能,达到新最优。

Comments CVPR 2026. Authors Sebastian Bock and Leonie Schüßler contributed equally. Project page: https://visinf.github.io/mufasa/

详情
AI中文摘要

无监督对象中心学习(OCL)将视觉场景分解为不同的实体。槽注意力是一种流行的方法,将单个对象表示为潜在向量,称为槽。当前方法仅从预训练视觉变换器(ViT)的最后一层获取这些槽表示,忽略了跨其他层编码的宝贵、语义丰富的信息。为了更好地利用这些潜在语义信息,我们引入了MUFASA,一种用于基于槽注意力的无监督对象分割方法的轻量级即插即用框架。我们的模型跨ViT编码器的多个特征层计算槽注意力,充分利用其语义丰富性。我们提出了一种融合策略,将在多个层上获得的槽聚合成统一的以对象为中心的表示。将MUFASA集成到现有的OCL方法中,提高了它们在多个数据集上的分割结果,在仅增加少量推理开销的同时,建立了新的最先进水平并改善了训练收敛性。

英文摘要

Unsupervised object-centric learning (OCL) decomposes visual scenes into distinct entities. Slot attention is a popular approach that represents individual objects as latent vectors, called slots. Current methods obtain these slot representations solely from the last layer of a pre-trained vision transformer (ViT), ignoring valuable, semantically rich information encoded across the other layers. To better utilize this latent semantic information, we introduce MUFASA, a lightweight plug-and-play framework for slot-attention-based approaches to unsupervised object segmentation. Our model computes slot attention across multiple feature layers of the ViT encoder, fully leveraging their semantic richness. We propose a fusion strategy to aggregate slots obtained on multiple layers into a unified object-centric representation. Integrating MUFASA into existing OCL methods improves their segmentation results across multiple datasets, setting a new state of the art while simultaneously improving training convergence with only minor inference overhead.

2602.06774 2026-06-18 cs.AI 版本更新

Towards Understanding What State Space Models Learn About Code

理解状态空间模型在代码中学到了什么

Jiali Wu, Abhinav Anand, Shweta Verma, Mira Mezini

发表机构 * TU Darmstadt(图宾根大学) Hessian Center for Artificial Intelligence(黑森人工智能中心) National Research Center for Applied Cybersecurity ATHENE(应用网络安全国家研究中心ATHENE)

AI总结 本文首次系统分析状态空间模型(SSM)在代码理解中的学习机制,发现SSM在预训练时比Transformer更有效捕获语法和语义结构,但微调时会遗忘某些关系,并提出SSM-Interpret框架和架构改进,将NLCodeSearch的MRR提升高达6。

详情
AI中文摘要

状态空间模型(SSM)已成为Transformer架构的高效替代方案。先前工作表明,在可比条件下训练时,SSM在代码理解任务上可以匹配或超越Transformer。然而,其内部机制仍是一个黑箱。我们首次系统分析了基于SSM的代码模型所学到的内容,并在此领域直接比较了SSM和Transformer模型。我们的分析表明,SSM在预训练期间比Transformer更有效地捕获了语法和语义结构,但在某些任务的微调过程中会遗忘某些关系。为了研究这种行为,我们引入了SSM-Interpret,一个频域框架,揭示了微调期间向短程依赖的频谱偏移。在这些发现的指导下,我们提出了架构修改,将基于SSM的代码模型在NLCodeSearch上的性能显著提升了高达+6 MRR。这表明我们的分析不仅解释了模型行为,而且直接导致了更好的设计。

英文摘要

State Space Models (SSMs) have emerged as an efficient alternative to the Transformer architecture. Prior work shows that, when trained under comparable conditions, SSMs can match or surpass Transformers on code understanding tasks. However, their internal mechanisms remain a black box. We present the first systematic analysis of what SSM-based code models learn along with the direct comparison between SSM and Transformer models in this domain. Our analysis shows that SSMs capture syntactic and semantic structure more effectively than Transformers during pretraining but forgets certain relations during fine-tuning on some tasks. To investigate this behavior, we introduce SSM-Interpret, a frequency-domain framework that exposes a spectral shift toward short-range dependencies during fine-tuning. Guided by these findings, we propose architectural modifications that significantly improve the performance of SSM-based code model by upto +6 MRR on NLCodeSearch. This demonstrates that our analysis not only explains model behavior but also leads directly to better designs.

2602.04401 2026-06-18 cs.RO cs.CV 版本更新

Quantile Transfer for Reliable Operating Point Selection in Visual Place Recognition

视觉地点识别中可靠操作点选择的分位数迁移

Dhyey Manish Rajani, Michael Milford, Tobias Fischer

发表机构 * QUT Centre for Robotics(昆士兰理工大学机器人中心) School of Electrical Engineering and Robotics(电气工程与机器人学院) Queensland University of Technology(昆士兰理工大学)

AI总结 提出一种通过分位数归一化迁移阈值的方法,自动选择视觉地点识别系统的操作点,在100%精度下最大化召回率,无需手动调参。

Comments Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026

详情
AI中文摘要

视觉地点识别(VPR)是全球导航卫星系统(GNSS)受限环境中定位的关键组成部分,但其性能严重依赖于选择平衡精度和召回率的图像匹配阈值(操作点)。阈值通常针对特定环境离线手动调整,并在部署期间固定,导致在环境变化下性能下降。我们提出一种方法,自动选择VPR系统的操作点,以在100%精度下最大化召回率。该方法使用已知对应关系的小型校准遍历,并通过相似度得分分布的分位数归一化将阈值迁移到部署中。这种分位数迁移确保阈值在校准大小和查询子集上保持稳定。在五个基准数据集上使用七种最先进的VPR技术进行的实验表明,我们提出的方法始终优于现有基线,使底层VPR技术在大约两倍的部署场景中(中位数改进)以100%精度运行,同时在该精度下检索到多达29%的正确匹配。该方法通过适应新环境并在操作条件下泛化,消除了手动调整。我们的代码可在该https URL获取。

英文摘要

Visual Place Recognition (VPR) is a key component for localisation in Global Navigation Satellite System (GNSS)-denied environments, but its performance critically depends on selecting an image matching threshold (operating point) that balances precision and recall. Thresholds are typically hand-tuned offline for a specific environment and fixed during deployment, leading to degraded performance under environmental change. We propose a method that automatically selects the operating point of a VPR system to maximise recall at 100% precision. The method uses a small calibration traversal with known correspondences and transfers thresholds to deployment via quantile normalisation of similarity score distributions. This quantile transfer ensures that thresholds remain stable across calibration sizes and query subsets. Experiments with seven state-of-the-art VPR techniques across five benchmark datasets demonstrate that our proposed approach consistently outperforms existing baselines, enabling the underlying VPR technique to operate at 100% precision in approximately twice as many deployment scenarios (median improvement), while retrieving up to 29% more correct matches at that precision. The method eliminates manual tuning by adapting to new environments and generalising across operating conditions. Our code is available at https://github.com/DhyeyR-007/Quantile-Transfer-for-Reliable-VPR.

2602.01700 2026-06-18 cs.RO 版本更新

Tilt-Ropter: A Fully Actuated Hybrid Aerial-Terrestrial Vehicle with Tilt Rotors and Passive Wheels

Tilt-Ropter: 一种带有倾转旋翼和被动轮的全驱动混合空中-地面车辆

Ruoyu Wang, Xuchen Liu, Zongzhou Wu, Zixuan Guo, Wendi Ding, Ben M. Chen

发表机构 * Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong(机械与自动化工程系,香港中文大学) Faculty of Engineering, The University of Hong Kong(工程学院,香港大学) Peng Cheng Laboratory(鹏城实验室)

AI总结 提出全驱动混合空中-地面车辆Tilt-Ropter,通过倾转旋翼和被动轮实现高效多模态运动,并设计统一非线性模型预测控制器实现低跟踪误差和地面运动功耗降低92.8%。

Comments 8 pages, 10 figures. Accepted by the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

详情
AI中文摘要

在这项工作中,我们提出了Tilt-Ropter,一种全驱动的混合空中-地面车辆(HATV),它集成了倾转旋翼和被动轮,以实现高效的多模态运动。与传统的欠驱动HATV不同,Tilt-Ropter的全驱动设计允许力和扭矩解耦控制,提高了机动性和地面运动效率。开发了一个统一的非线性模型预测控制器(NMPC)来跟踪参考轨迹,强制执行非完整约束,并适应运动模式间的接触效应,同时通过专门的控制分配确保执行器可行性。为了解决复杂的轮地动力学问题,集成了一个外部力估计器来提供实时交互力估计。该系统通过仿真和实际实验进行了验证,包括无缝的空地过渡和轨迹跟踪任务。实验结果表明,两种模式下的跟踪误差都很低,并且地面运动期间的功耗相比飞行降低了92.8%,突显了该平台在能源受限环境中执行长时间任务的适用性。

英文摘要

In this work, we present Tilt-Ropter, a fully actuated hybrid aerial-terrestrial vehicle (HATV) that integrates tilt rotors with passive wheels to enable efficient multi-modal locomotion. Unlike conventional underactuated HATVs, the fully actuated design of Tilt-Ropter allows decoupled force and torque control, improving maneuverability and ground locomotion efficiency. A unified nonlinear model predictive controller (NMPC) is developed to track reference trajectories, enforce non-holonomic constraints, and accommodate contact effects across locomotion modes, while ensuring actuator feasibility through dedicated control allocation. To address complex wheel-ground dynamics, an external wrench estimator is incorporated to provide real-time interaction wrench estimates. The system is validated through simulation and real-world experiments, including seamless air-ground transitions and trajectory tracking tasks. Experimental results demonstrate low tracking errors in both modes and reveal a 92.8% reduction in power consumption during ground locomotion compared to flight, highlighting the platform's suitability for long-duration missions in energy-constrained environments.

2503.09439 2026-06-18 cs.CV 版本更新

SuperCarver: Texture-Consistent 3D Geometry Super-Resolution for High-Fidelity Surface Detail Generation

SuperCarver: 纹理一致的3D几何超分辨率用于高保真表面细节生成

Qijian Zhang, Xiaozheng Jian, Xuan Zhang, Wenping Wang, Junhui Hou

发表机构 * Tencent Games, China(腾讯游戏,中国) Department of Computer Science & Engineering, Texas A & M University(电子与计算机工程系,德克萨斯A&M大学) Department of Computer Science, City University of Hong Kong(计算机科学系,香港城市大学)

AI总结 提出SuperCarver,一种3D几何超分辨率管线,通过先验引导的法线扩散模型和噪声鲁棒的逆渲染,为粗糙网格补充纹理一致的表面细节,实现高保真细节生成。

Comments Accepted in IEEE TVCG

详情
AI中文摘要

传统的高精度网格资产生产流程需要专业3D艺术家/建模师进行繁琐且费力的手动雕刻。近年来,AI赋能的3D内容创作在从图像或文本提示生成合理结构和复杂外观方面取得了显著进展。然而,合成逼真的表面细节仍然面临巨大挑战,并且增强现有低质量3D网格(而非图像/文本到3D生成)的几何保真度仍然是一个开放问题。在本文中,我们介绍了SuperCarver,一种3D几何超分辨率管线,用于为给定的粗糙网格补充纹理一致的表面细节。我们首先从多个视角将原始纹理网格渲染到图像域。为了实现细节增强,我们构建了一个确定性先验引导的法线扩散模型,该模型在精心策划的成对细节缺乏和细节丰富的法线图渲染数据集上进行微调。为了从潜在不完美的法线图预测更新网格表面,我们设计了一种通过可变形距离场的噪声鲁棒逆渲染方案。实验表明,我们的SuperCarver能够生成由实际纹理外观描述的逼真且富有表现力的表面细节,使其成为升级历史低质量3D资产和减少高多边形网格雕刻工作量的强大工具。

英文摘要

Conventional production workflow of high-precision mesh assets necessitates a cumbersome and laborious process of manual sculpting by specialized 3D artists/modelers. The recent years have witnessed remarkable advances in AI-empowered 3D content creation for generating plausible structures and intricate appearances from images or text prompts. However, synthesizing realistic surface details still poses great challenges, and enhancing the geometry fidelity of existing lower-quality 3D meshes (instead of image/text-to-3D generation) remains an open problem. In this paper, we introduce SuperCarver, a 3D geometry super-resolution pipeline for supplementing texture-consistent surface details onto a given coarse mesh. We start by rendering the original textured mesh into the image domain from multiple viewpoints. To achieve detail boosting, we construct a deterministic prior-guided normal diffusion model, which is fine-tuned on a carefully curated dataset of paired detail-lacking and detail-rich normal map renderings. To update mesh surfaces from potentially imperfect normal map predictions, we design a noise-resistant inverse rendering scheme through deformable distance field. Experiments demonstrate that our SuperCarver is capable of generating realistic and expressive surface details depicted by the actual texture appearance, making it a powerful tool to both upgrade historical low-quality 3D assets and reduce the workload of sculpting high-poly meshes.

2602.00176 2026-06-18 cs.CV cs.AI 版本更新

Posterior Continuation with Noise-Conditioned Frequency Exposure for Diffusion Inverse Problems

基于噪声条件频率暴露的扩散逆问题后验延续

Feng Tian, Yixuan Li, Weili Zeng, Weitian Zhang, Yichao Yan, Xiaokang Yang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出后验延续框架,根据扩散噪声水平逐步暴露测量频率,结合稳定采样器实现超分辨率、修复和去模糊的先进性能。

详情
AI中文摘要

扩散后验采样通过将预训练的扩散先验与测量一致性指导相结合来解决逆问题。然而,在高噪声水平下,全频带指导可能不可靠,因为干净估计包含分数诱导误差,且高频测量方向弱可识别。我们认为后验指导应根据瞬时扩散噪声水平暴露测量频率。基于这一原则,我们提出一个后验延续框架,构建一系列中间后验,其似然强调当前可靠频带并逐渐恢复全频带一致性。我们通过一个稳定采样器实例化该框架,该采样器结合了扩散预测器、频率受限似然细化以及Haar域承诺规则,该规则提交可靠粗校正同时推迟弱可识别细节。在超分辨率、修复和去模糊任务中,我们的方法实现了具有竞争力乃至最先进的恢复性能,包括在FFHQ和ImageNet评估中,运动去模糊相比强基线PSNR提升高达5 dB。

英文摘要

Diffusion posterior sampling solves inverse problems by combining a pretrained diffusion prior with measurement-consistency guidance. However, full-band guidance can be unreliable at high noise levels, where clean estimates contain score-induced errors and high-frequency measurement directions are weakly identifiable. We argue that posterior guidance should expose measurement frequencies according to the instantaneous diffusion noise level. Based on this principle, we propose a posterior continuation framework that constructs a family of intermediate posteriors whose likelihood emphasizes currently reliable frequency bands and gradually returns to full-band consistency. We instantiate this framework with a stabilized sampler that combines a diffusion predictor, frequency-limited likelihood refinement, and a Haar-domain commitment rule that commits reliable coarse corrections while deferring weakly identifiable details. Across super-resolution, inpainting, and deblurring, our method achieves competitive-to-state-of-the-art restoration performance, including up to 5 dB PSNR improvement on motion deblurring over strong baselines in evaluations on FFHQ and ImageNet.

2602.00161 2026-06-18 cs.LG cs.AI cs.CL quant-ph 版本更新

LLM Compression by Block Removal with Constrained Binary Optimization

通过带约束二进制优化的块移除进行LLM压缩

David Jansen, Roman Rausch, Ali Hashemi, David Montero, Román Orús

发表机构 * Multiverse Computing(多维计算公司) Donostia International Physics Center(多斯蒂亚国际物理中心) Ikerbasque Foundation for Science(伊克尔巴斯克科学基金会)

AI总结 提出将大语言模型块移除压缩问题建模为约束二进制优化,映射到Ising玻璃系统,实现高效排序和高质量非连续块移除,在50%压缩时MMLU提升近23个百分点,且计算高效、通用性强。

Comments 16 pages, 3 figures

详情
AI中文摘要

在本文中,我们将通过最优删除Transformer块(“块移除”)来压缩大语言模型(LLM)的问题,表述为一个约束二进制优化(CBO)问题,该问题可以映射到物理系统(Ising玻璃),其能量是下游模型性能的强代理。这种表述使得能够高效地对大量候选块移除配置进行排序,产生许多高质量、非平凡的解决方案,而不仅仅是移除连续区域。我们的方法在深度压缩场景中表现强劲,例如在Llama-3.3-70B-Instruct的50%压缩中,与其他最先进的块移除方法相比,我们在MMLU基准上取得了近23个百分点的提升。对于较轻的压缩,它在多个基准上与这些方法表现相当,适用于Llama-3.1-8B-Instruct、Qwen3-14B(重训练前后)以及Llama-3.3-70B-Instruct。该方法计算效率高,仅需在校准数据集上对少数活跃参数进行前向和反向传播。此外,我们证明,当无法精确求解CBO问题时,使用良好的启发式求解器可以在可忽略的运行时间内提供在下游任务上表现良好的解决方案。该方法可以轻松应用于任何架构。我们在最近的NVIDIA-Nemotron-3-Nano-30B-A3B-FP8模型上展示了这种通用性,该模型具有高度不均匀且具有挑战性的块结构,并且在移除2个注意力层或3个混合专家层时,我们在AIME25和GPQA上超越了最先进水平。

英文摘要

In this paper, we formulate the compression of large language models (LLMs) by optimally deleting transformer blocks (``block removal'') as a constrained binary optimization (CBO) problem that can be mapped to a physical system (Ising glass), whose energies are a strong proxy for downstream model performance. This formulation enables an efficient ranking of a large number of candidate block-removal configurations yielding many high-quality, non-trivial solutions beyond those only removing consecutive regions. Our method performs strongly in the deep compression regime, such as for 50% compression of Llama-3.3-70B-Instruct, where we achieve an almost 23 percentage point increase on the MMLU benchmark compared to other state-of-the-art (SOTA) block-removal methods. For lighter compression, it performs on par with those methods across several benchmarks for Llama-3.1-8B-Instruct, Qwen3-14B (both before and after retraining), as well as Llama-3.3-70B-Instruct. The approach is computationally efficient and requires only forward and backward passes on a calibration dataset for a few active parameters. Additionally, we demonstrate that using good heuristic solvers for the CBO problem provides solutions that perform well on downstream tasks in negligible runtime when it is unfeasible to solve the problem exactly. The method can be readily applied to any architecture. We illustrate this generality on the recent NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 model, which exhibits a highly inhomogeneous and challenging block structure, and where we outperform SOTA for AIME25 and GPQA when removing either 2 attention layers or 3 mixture-of-experts layers.

2511.20002 2026-06-18 cs.CV cs.AI cs.CR 版本更新

Semantic Router: On the Feasibility of Hijacking MLLMs via a Single Adversarial Perturbation

语义路由器:通过单一对抗扰动劫持多模态大语言模型的可行性研究

Changyue Li, Jiaying Li, Youliang Yuan, Jiaming He, Zhicong Huang, Pinjia He

发表机构 * The Chinese University of Hong Kong, Shenzhen, China(香港中文大学(深圳)) School of Data Science, School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen, China(数据科学学院、人工智能学院、香港中文大学(深圳))

AI总结 提出语义感知通用扰动(SAUP),作为语义路由器同时劫持多个无状态决策,通过理论分析和SORT优化策略实现,在Qwen上对五个目标达到66%攻击成功率。

Comments Accepted to ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越多地部署在无状态系统中,例如自动驾驶和机器人技术。本文研究了一种新型威胁:语义感知劫持。我们探索了使用单一通用扰动同时劫持多个无状态决策的可行性。我们引入了语义感知通用扰动(SAUP),它充当语义路由器,“主动”感知输入语义并将其路由到不同的、攻击者定义的目标。为了实现这一点,我们对潜在空间中的几何特性进行了理论和实证分析。在这些见解的指导下,我们提出了语义导向(SORT)优化策略,并标注了一个具有细粒度语义的新数据集以评估性能。在三个代表性MLLM上的大量实验证明了这种攻击的基本可行性,在针对Qwen的五个目标上使用单帧实现了66%的攻击成功率。

英文摘要

Multimodal Large Language Models (MLLMs) are increasingly deployed in stateless systems, such as autonomous driving and robotics. This paper investigates a novel threat: Semantic-Aware Hijacking. We explore the feasibility of hijacking multiple stateless decisions simultaneously using a single universal perturbation. We introduce the Semantic-Aware Universal Perturbation (SAUP), which acts as a semantic router, "actively" perceiving input semantics and routing them to distinct, attacker-defined targets. To achieve this, we conduct theoretical and empirical analysis on the geometric properties in the latent space. Guided by these insights, we propose the Semantic-Oriented (SORT) optimization strategy and annotate a new dataset with fine-grained semantics to evaluate performance. Extensive experiments on three representative MLLMs demonstrate the fundamental feasibility of this attack, achieving a 66% attack success rate over five targets using a single frame against Qwen.

2601.21626 2026-06-18 cs.LG cs.AI 版本更新

HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning

HeRo-Q: 通过Hessian条件化实现稳定低比特量化的通用框架

Jinhao Zhang, Yunquan Zhang, Zicheng yan, Boyang Zhang, Jun Sun, Daning Cheng

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) University of Science and Technology of China(中国科学技术大学) Zhejiang Lab(浙江实验室) Peng Cheng Laboratory(鹏城实验室)

AI总结 针对后训练量化中“低误差、高损失”的矛盾,提出HeRo-Q算法,通过轻量可学习的旋转压缩矩阵重塑损失景观,降低最大Hessian特征值,增强对量化噪声的鲁棒性,在Llama和Qwen模型上优于现有方法。

详情
AI中文摘要

后训练量化(PTQ)是一种主流的模型压缩技术,但由于其仅专注于最小化量化误差,常常导致矛盾的“低误差、高损失”现象。根本原因在于LLM损失景观的Hessian矩阵:少数高曲率方向对扰动极其敏感。为了解决这个问题,我们提出了Hessian鲁棒量化(HeRo Q)算法,该算法在量化前对权重空间应用一个轻量级、可学习的旋转压缩矩阵。这个联合框架通过降低最大的Hessian特征值并减小其最大特征值来重塑损失景观,从而显著增强对量化噪声的鲁棒性。HeRo-Q不需要修改架构,计算开销可忽略不计,并且可以无缝集成到现有的PTQ流程中。在Llama和Qwen模型上的实验表明,HeRo Q在标准W4A8设置下不仅持续优于包括GPTQ、AWQ和SpinQuant在内的最先进方法,而且在极具挑战性的W3A16超低比特场景中表现出色,将Llama3 8B在GSM8K上的准确率提升至70.15%,并有效避免了激进量化中常见的逻辑崩溃。

英文摘要

Post Training Quantization (PTQ), a mainstream model compression technique, often leads to the paradoxical 'low error, high loss' phenomenon because it focuses solely on minimizing quantization error. The root cause lies in the Hessian matrix of the LLM loss landscape: a few high curvature directions are extremely sensitive to perturbations. To address this, we propose the Hessian Robust Quantization (HeRo Q) algorithm, which applies a lightweight, learnable rotation-compression matrix to the weight space prior to quantization. This joint framework reshapes the loss landscape by reducing the largest Hessian eigenvalue and reducing its max eigenvalue, thereby significantly enhancing robustness to quantization noise. HeRo-Q requires no architectural modifications, incurs negligible computational overhead, and integrates seamlessly into existing PTQ pipelines. Experiments on Llama and Qwen models show that HeRo Q consistently outperforms state of the art methods including GPTQ, AWQ, and SpinQuant not only achieving superior performance under standard W4A8 settings, but also excelling in the highly challenging W3A16 ultra low bit regime, where it boosts GSM8K accuracy on Llama3 8B to 70.15\% and effectively avoids the logical collapse commonly seen in aggressive quantization.

2601.20381 2026-06-18 cs.RO 版本更新

STORM: Slot-based Task-aware Object-centric Representation for robotic Manipulation

STORM:基于槽的任务感知面向对象的机器人操作表示

Alexandre Chapin, Emmanuel Dellandréa, Liming Chen

发表机构 * Ecole Centrale de Lyon, LIRIS(里尔森中央理工大学,LIRIS实验室)

AI总结 提出STORM模块,通过多阶段训练策略将冻结的视觉基础模型与语义感知槽结合,生成面向对象的任务感知表示,提升机器人操作在视觉干扰下的泛化性和控制性能。

详情
AI中文摘要

视觉基础模型为机器人提供了强大的感知特征,但其密集表示缺乏显式的对象级结构,限制了操作任务的鲁棒性和可收缩性。我们提出STORM(基于槽的任务感知面向对象的机器人操作表示),一个轻量级的面向对象适应模块,通过一组语义感知槽增强冻结的视觉基础模型,用于机器人操作。STORM不重新训练大型骨干网络,而是采用多阶段训练策略:首先通过使用语言嵌入的视觉-语义预训练稳定面向对象的槽,然后与下游操作策略联合适应。这种分阶段学习防止了退化槽的形成,并在保持语义一致性的同时将感知与任务目标对齐。在对象发现基准和模拟操作任务上的实验表明,与直接使用冻结的基础模型特征或端到端训练面向对象的表示相比,STORM改善了对视觉干扰物的泛化能力和控制性能。我们的结果强调了多阶段适应作为将通用基础模型特征转化为用于机器人控制的任务感知面向对象表示的有效机制。

英文摘要

Visual foundation models provide strong perceptual features for robotics, but their dense representations lack explicit object-level structure, limiting robustness and contractility in manipulation tasks. We propose STORM (Slot-based Task-aware Object-centric Representation for robotic Manipulation), a lightweight object-centric adaptation module that augments frozen visual foundation models with a small set of semantic-aware slots for robotic manipulation. Rather than retraining large backbones, STORM employs a multi-phase training strategy: object-centric slots are first stabilized through visual--semantic pretraining using language embeddings, then jointly adapted with a downstream manipulation policy. This staged learning prevents degenerate slot formation and preserves semantic consistency while aligning perception with task objectives. Experiments on object discovery benchmarks and simulated manipulation tasks show that STORM improves generalization to visual distractors, and control performance compared to directly using frozen foundation model features or training object-centric representations end-to-end. Our results highlight multi-phase adaptation as an efficient mechanism for transforming generic foundation model features into task-aware object-centric representations for robotic control.

2601.20361 2026-06-18 cs.LG cs.NA math.NA 版本更新

TINNs: Time-Induced Neural Networks for Solving Time-Dependent PDEs

TINNs:时间诱导神经网络求解时变偏微分方程

Chen-Yang Dai, Che-Chia Chang, Te-Sheng Lin, Ming-Chih Lai, Chieh-Hsin Lai

发表机构 * Department of Applied Mathematics, National Yang Ming Chiao Tung University, Hsinchu 30010, Taiwan(应用数学系,国立阳明交通大学,新竹30010,台湾) Institute of Artificial Intelligence Innovation, National Yang Ming Chiao Tung University, Hsinchu 30010, Taiwan(人工智能创新研究所,国立阳明交通大学,新竹30010,台湾) National Center for Theoretical Sciences, National Taiwan University, Taipei 10617, Taiwan(理论科学研究中心,国立台湾大学,台北10617,台湾)

AI总结 提出时间诱导神经网络(TINNs),将网络权重参数化为时间的函数,使空间表示随时间演化,结合Levenberg-Marquardt优化,在时变PDE求解中相对误差降低4倍,收敛速度提升10倍。

Comments Accepted at ICML 2026. Camera-ready version. Includes appendix

详情
AI中文摘要

物理信息神经网络(PINNs)通过学习一个无网格、可微的解来求解时变偏微分方程(PDE),该解可在空间和时间的任意位置进行评估。然而,标准的时空PINNs将时间作为输入,但在所有时间上重用具有共享权重的单一网络,迫使相同的特征表示显著不同的动力学。这种耦合会降低误差性能,并在联合强制执行PDE、边界和初始条件时可能破坏训练稳定性。我们提出时间诱导神经网络(TINNs),一种新颖的架构,将网络权重参数化为时间的可学习函数,允许有效的空间表示随时间演化,同时保持共享结构。由此产生的公式自然产生一个非线性最小二乘问题,我们使用Levenberg-Marquardt方法高效优化。在各种时变PDE上的实验表明,与PINNs和强基线相比,相对误差提高了4倍,收敛速度提高了10倍。

英文摘要

Physics-informed neural networks (PINNs) solve time-dependent partial differential equations (PDEs) by learning a mesh-free, differentiable solution that can be evaluated anywhere in space and time. However, standard space-time PINNs take time as an input but reuse a single network with shared weights across all times, forcing the same features to represent markedly different dynamics. This coupling degrades error performance and can destabilize training when enforcing PDE, boundary, and initial constraints jointly. We propose Time-Induced Neural Networks (TINNs), a novel architecture that parameterizes the network weights as a learned function of time, allowing the effective spatial representation to evolve over time while maintaining shared structure. The resulting formulation naturally yields a nonlinear least-squares problem, which we optimize efficiently using a Levenberg-Marquardt method. Experiments on various time-dependent PDEs show up to 4 times improved relative error and 10 times faster convergence compared to PINNs and strong baselines.

2503.01805 2026-06-18 cs.LG cs.AI cs.CL 版本更新

Depth-Width tradeoffs in Algorithmic Reasoning of Graph Tasks with Transformers

图任务算法推理中Transformer的深度-宽度权衡

Gilad Yehudai, Clayton Sanford, Maya Bechler-Speicher, Orr Fischer, Ran Gilad-Bachrach, Amir Globerson

发表机构 * Courant Institute of Mathematical Sciences, New York University(纽约大学应用数学科学研究所) Google Research(谷歌研究) Meta AI Bar-Ilan University(巴伊兰大学) Department of Bio-Medical Engineering, Edmond J. Safra Center for Bioinformatics, Tel-Aviv University(生物医学工程系,埃德蒙·J·萨法中心,特拉维夫大学) Tel Aviv University(特拉维夫大学)

AI总结 研究Transformer在图算法任务中深度与宽度的权衡,发现线性宽度下常数深度足以解决许多图问题,而某些问题需要二次宽度,实验验证了宽模型在保持精度的同时训练和推理更快。

Comments Updated ISF grant number

详情
AI中文摘要

Transformer已经彻底改变了机器学习领域。特别是,它们可用于解决复杂的算法问题,包括基于图的任务。在此类算法任务中,一个关键问题是能够实现该任务的Transformer的最小尺寸是多少。最近的工作开始探索图任务的这个问题,表明对于次线性嵌入维度(即模型宽度),对数深度就足够了。然而,我们在这里解决的一个开放问题是,如果允许宽度线性增长而深度保持固定,会发生什么。我们分析了这种情况,并得出了一个令人惊讶的结果:在线性宽度下,常数深度足以解决一系列基于图的问题。这表明宽度的适度增加可以允许更浅的模型,这在推理和训练时间方面是有利的。对于其他问题,我们表明需要二次宽度。我们的结果展示了Transformer实现图算法的复杂而有趣的格局。我们通过实验研究了深度和宽度相对能力之间的这些权衡,并发现宽模型在具有与深模型相同准确度的任务中,由于可并行化的硬件,训练和推理时间更快。

英文摘要

Transformers have revolutionized the field of machine learning. In particular, they can be used to solve complex algorithmic problems, including graph-based tasks. In such algorithmic tasks a key question is what is the minimal size of a transformer that can implement the task. Recent work has begun to explore this problem for graph-based tasks, showing that for sub-linear embedding dimension (i.e., model width) logarithmic depth suffices. However, an open question, which we address here, is what happens if width is allowed to grow linearly, while depth is kept fixed. Here we analyze this setting, and provide the surprising result that with linear width, constant depth suffices for solving a host of graph-based problems. This suggests that a moderate increase in width can allow much shallower models, which are advantageous in terms of inference and train time. For other problems, we show that quadratic width is required. Our results demonstrate the complex and intriguing landscape of transformer implementations of graph-based algorithms. We empirically investigate these trade-offs between the relative powers of depth and width and find tasks where wider models have the same accuracy as deep models, while having much faster train and inference time due to parallelizable hardware.

2601.17226 2026-06-18 cs.CL cs.AI 版本更新

Retell, Reward, Repeat: Reinforcement Learning for Narrative Theory-Informed Story Retelling

复述、奖励、重复:面向叙事理论启发的故事复述的强化学习

David Y. Liu, Xanthe Muston, Dipankar Srirag, Aditya Joshi, Sebastian Sequoiah-Grayson

发表机构 * University of New South Wales(新南威尔士大学)

AI总结 提出RRR强化学习框架,结合结构主义叙事学与标量叙事性,通过d-RLAIF从文本特征中获取训练信号,无需参考输出,提升LLM故事复述的逻辑性、合理性和完整性。

Comments 8 Pages, 7 figures

详情
AI中文摘要

反事实故事复述暴露了LLM在受限叙事解空间中的缺陷,此时它们无法依赖回忆记忆的训练数据。基于真实值的后训练(如SFT)无法教会LLM生成逻辑合理的叙事事件。本文提出Retell, Reward, Repeat (RRR),一个基于强化学习的流水线,将结构主义叙事学与标量叙事性相结合,以教授故事结构。我们扩展了TimeTravel数据集,加入人工标注的叙事平衡阶段,以评估奖励模型。通过d-RLAIF,RRR从文本特征的叙事性中推导训练信号,无需参考输出。评估表明,RRR训练的LLM在逻辑性、合理性和完整性上优于少样本和SFT基线,输出质量通过盲人偏好验证。RRR仅依赖小型查询数据集,为故事讲述——一个目前缺乏有效后训练方法的领域——提供了一种基于语言学、成本效益高的后训练机制。RRR强调了将既定语言学理论整合到当代NLP中的持续相关性。

英文摘要

Counterfactual story retelling exposes LLM shortcomings in constrained narrative solution spaces where they can no longer rely on recalling memorised training data. Ground-truth-based post-training, such as SFT, fails to teach LLMs how to generate logical and rational narrative events. In this paper, we introduce Retell, Reward, Repeat (RRR), an RL-based pipeline synthesising Structuralist Narratology with scalar narrativity to teach storytelling structure. We extend the TimeTravel dataset with human-annotated stages of narrative equilibrium to evaluate reward models. By using d-RLAIF, RRR derives training signals from the narrativity of textual features without the need for reference outputs. Evaluations demonstrate that RRR-trained LLMs outperform few-shot and SFT baselines in logic, rationality, and completeness, with output quality additionally validated by blind human preference. Relying on a small, query-only dataset, RRR provides a linguistically grounded, cost-effective post-training mechanism for storytelling--a domain currently lacking effective post-training methods. RRR highlights the continued relevance of integrating established linguistic theories into contemporary NLP.

2506.13196 2026-06-18 cs.LG 版本更新

KEPLA: A Knowledge-Enhanced Deep Learning Framework for Accurate Protein-Ligand Binding Affinity Prediction

KEPLA:一种用于精确预测蛋白质-配体结合亲和力的知识增强深度学习框架

Han Liu, Keyan Ding, Peilin Chen, Yinwei Wei, Liqiang Nie, Dapeng Wu, Shiqi Wang

发表机构 * Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系) ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University(浙江大学杭州国际科技创新中心) School of Software, Shandong University(山东大学软件学院) College of Informatics, Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)计算机学院)

AI总结 提出KEPLA框架,通过整合基因本体和配体属性的先验知识,利用全局表示对齐与局部交叉注意力,提升蛋白质-配体结合亲和力预测的准确性,在多个基准数据集上超越现有方法。

详情
AI中文摘要

准确预测蛋白质-配体结合亲和力对药物发现至关重要。尽管最近的深度学习方法已展现出有希望的结果,但它们通常仅依赖蛋白质和配体的结构特征,忽略了与结合亲和力相关的宝贵生化知识。为解决这一局限,我们提出KEPLA,一种新颖的深度学习框架,明确整合来自基因本体和配体属性的先验知识以增强预测性能。KEPLA以蛋白质序列和配体分子图作为输入,并优化两个互补目标:(1)将全局表示与知识图谱关系对齐,以捕获领域特定的生化见解;(2)利用局部表示之间的交叉注意力构建细粒度联合嵌入用于预测。在两个基准数据集上的域内和跨域场景实验表明,KEPLA始终优于最先进的基线方法。此外,基于知识图谱关系和交叉注意力图的可解释性分析为潜在的预测机制提供了有价值的见解。

英文摘要

Accurate prediction of protein-ligand binding affinity is critical for drug discovery. While recent deep learning approaches have demonstrated promising results, they often rely solely on structural features of proteins and ligands, overlooking their valuable biochemical knowledge associated with binding affinity. To address this limitation, we propose KEPLA, a novel deep learning framework that explicitly integrates prior knowledge from Gene Ontology and ligand properties to enhance prediction performance. KEPLA takes protein sequences and ligand molecular graphs as input and optimizes two complementary objectives: (1) aligning global representations with knowledge graph relations to capture domain-specific biochemical insights, and (2) leveraging cross attention between local representations to construct fine-grained joint embeddings for prediction. Experiments on two benchmark datasets across both in-domain and cross-domain scenarios demonstrate that KEPLA consistently outperforms state-of-the-art baselines. Furthermore, interpretability analyses based on knowledge graph relations and cross attention maps provide valuable insights into the underlying predictive mechanisms.

2601.14968 2026-06-18 cs.LG cs.AI 版本更新

InstructTime++: Time Series Classification with Multimodal Language Modeling via Implicit Feature Enhancement

InstructTime++: 通过隐式特征增强的多模态语言建模进行时间序列分类

Mingyue Cheng, Xiaoyu Tao, Huajian Zhang, Qi Liu, Zhiding Liu, Yucong Luo, Yiheng Chen, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China(中国科学技术大学认知智能国家重点实验室)

AI总结 提出将时间序列分类转化为多模态生成任务,通过离散化模块和对齐投影层弥合模态差距,并利用隐式特征建模提升语言模型性能。

详情
AI中文摘要

大多数现有的时间序列分类方法采用判别范式,将输入序列直接映射到独热编码的类别标签。虽然有效,但这种范式难以融入上下文特征,也无法捕捉类别间的语义关系。为了解决这些局限性,我们提出了InstructTime,一种将时间序列分类重新定义为多模态生成任务的新框架。具体来说,连续的数值序列、上下文文本特征和任务指令被视为多模态输入,而类别标签则通过调优的语言模型作为文本输出生成。为了弥合模态差距,InstructTime引入了一个时间序列离散化模块,将连续序列转换为离散的时间标记,同时结合对齐投影层和生成式自监督预训练策略,以增强跨模态表示对齐。在此框架基础上,我们进一步提出了InstructTime++,通过引入隐式特征建模来扩展InstructTime,以补偿语言模型有限的归纳偏差。InstructTime++利用专门的工具包从原始时间序列和上下文输入中挖掘信息丰富的隐式模式,包括统计特征提取和基于视觉-语言模型的图像描述,并将其转化为文本描述以实现无缝集成。在多个基准数据集上的大量实验证明了InstructTime++的优越性能。

英文摘要

Most existing time series classification methods adopt a discriminative paradigm that maps input sequences directly to one-hot encoded class labels. While effective, this paradigm struggles to incorporate contextual features and fails to capture semantic relationships among classes. To address these limitations, we propose InstructTime, a novel framework that reformulates time series classification as a multimodal generative task. Specifically, continuous numerical sequences, contextual textual features, and task instructions are treated as multimodal inputs, while class labels are generated as textual outputs by tuned language models. To bridge the modality gap, InstructTime introduces a time series discretization module that converts continuous sequences into discrete temporal tokens, together with an alignment projection layer and a generative self-supervised pre-training strategy to enhance cross-modal representation alignment. Building upon this framework, we further propose InstructTime++, which extends InstructTime by incorporating implicit feature modeling to compensate for the limited inductive bias of language models. InstructTime++ leverages specialized toolkits to mine informative implicit patterns from raw time series and contextual inputs, including statistical feature extraction and vision-language-based image captioning, and translates them into textual descriptions for seamless integration. Extensive experiments on multiple benchmark datasets demonstrate the superior performance of InstructTime++.

2601.13836 2026-06-18 cs.CL cs.CV cs.MM 版本更新

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

FutureOmni:从全模态上下文中评估多模态大语言模型的未来预测能力

Qian Chen, Jinlan Fu, Changsong Li, Min Zhang, See-Kiong Ng, Xipeng Qiu

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳分校) National University of Singapore(新加坡国立大学)

AI总结 提出FutureOmni基准,评估多模态大模型从音视频线索预测未来的能力,发现现有模型在语音密集场景下表现差,并设计OFF训练策略提升性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)展现出强大的全模态感知能力,但它们从音视频线索预测未来事件的能力仍未被充分探索,因为现有基准主要关注回顾性理解。为弥补这一差距,我们引入了FutureOmni,这是第一个旨在从音视频环境中评估全模态未来预测的基准。评估模型需要执行跨模态因果和时间推理,并有效利用内部知识预测未来事件。FutureOmni通过可扩展的LLM辅助、人在回路流水线构建,包含8个主要领域的919个视频和1034个多项选择问答对。对13个全模态和7个仅视频模型的评估表明,当前系统在音视频未来预测方面存在困难,尤其是在语音密集场景中,Gemini 3 Flash达到最佳准确率64.8%。为缓解这一局限,我们整理了一个7K样本的指令微调数据集,并提出全模态未来预测(OFF)训练策略。在FutureOmni以及流行的音视频和仅视频基准上的评估表明,OFF增强了未来预测和泛化能力。我们公开发布所有代码(此 https URL )和数据集(此 https URL )。

英文摘要

Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (https://github.com/OpenMOSS/FutureOmni) and datasets (https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni).

2508.07375 2026-06-18 cs.CL cs.SD eess.AS 版本更新

TurnGuide: Enhancing Meaningful Full Duplex Spoken Interactions via Dynamic Turn-Level Text-Speech Interleaving

TurnGuide: 通过动态轮次级文本-语音交错增强有意义的全双工口语交互

Wenqian Cui, Lei Zhu, Xiaohui Li, Zhihan Guo, Haoli Bai, Lu Hou, Irwin King

发表机构 * The Chinese University of Hong Kong(香港中文大学) Huawei Technologies(华为技术)

AI总结 提出TurnGuide方法,通过动态分割助手语音为对话轮次并交错生成轮次级文本和语音,解决全双工语音语言模型在连续双通道音频中集成离散文本令牌导致的时间对齐问题,显著提升语义连贯性和轮次交互性能。

Comments Interspeech 2026 Long Paper Track

详情
AI中文摘要

全双工语音语言模型(FD-SLMs)是专门的基础模型,旨在通过建模复杂的对话轮次(如打断、反馈和重叠语音)来实现自然的实时口语交互。端到端(e2e)FD-SLMs利用真实世界的双通道对话数据捕捉细微的双说话者对话模式以实现类人交互,但由于语音序列过长和高质量口语对话数据有限,其对话能力往往比纯文本对话有所下降。尽管交错文本-语音生成可以缓解这种退化,但将离散文本令牌集成到连续双通道音频流中可能会破坏流畅交互所需的时间对齐。为了解决这个问题,我们提出了TurnGuide,一种用于e2e FD-SLMs的新型文本-语音交错生成方法,该方法动态地将助手语音分割成对话轮次,并交错生成轮次级文本和语音。这种方法使FD-SLMs能够整合LLMs的语义智能,同时不损害自然的声学流畅性。大量实验表明,TurnGuide不仅显著提升了e2e FD-SLMs生成语义有意义且连贯语音的能力,而且在各种轮次事件上达到了最先进的性能。演示请访问此https URL。代码请访问此https URL。

英文摘要

Full-Duplex Speech Language Models (FD-SLMs) are specialized foundation models designed to enable natural, real-time spoken interactions by modeling complex conversational turn-taking such as interruptions, backchannels, and overlapping speech. End-to-end (e2e) FD-SLMs leverage real-world double-channel conversational data to capture nuanced two-speaker dialogue patterns for human-like interactions, but their conversational abilities often degrade compared to pure-text conversation due to prolonged speech sequences and limited high-quality spoken dialogue data. Although interleaved text-speech generation could mitigate this degradation, integrating discrete text tokens into continuous double-channel audio streams could disrupt the precise time alignment required for fluid interaction. To address this, we propose TurnGuide, a novel text-speech interleaved generation approach for e2e FD-SLMs that dynamically segments assistant speech into dialogue turns and interleaves turn-level text and speech generation. This approach allows FD-SLMs to integrate the semantic intelligence of LLMs without compromising the natural acoustic flow. Extensive experiments show that TurnGuide not only significantly improves e2e FD-SLMs to produce semantically meaningful, coherent speech but also achieves state-of-the-art performance on various turn-taking events. Demos are available at https://dreamtheater123.github.io/TurnGuide-Demo/. Code is available at https://github.com/dreamtheater123/TurnGuide.

2601.07052 2026-06-18 cs.RO 版本更新

RSLCPP -- Deterministic Simulations Using ROS 2

RSLCPP——使用ROS 2进行确定性仿真

Simon Sagmeister, Marcel Weinmann, Phillip Pitschi, Markus Lienkamp

发表机构 * Technical University of Munich, Germany(慕尼黑技术大学) School of Engineering & Design, Department of Mobility Systems Engineering, Institute of Automotive Technology(工程与设计学院,移动系统工程系,汽车技术研究所) School of Engineering & Design, Department of Engineering Physics and Computation, Institute of Automatic Control(工程与设计学院,工程物理与计算系,自动控制研究所)

AI总结 针对ROS异步多进程设计导致仿真结果不可复现的问题,提出RSLCPP库,通过确定性回调执行实现跨平台可复现仿真,无需修改现有节点代码。

Comments Accepted for publication at the 'IEEE Robotics and Automation Practice'

详情
AI中文摘要

仿真在现实机器人技术中至关重要,为开发各种机器人应用提供了安全、可扩展且高效的环境。虽然机器人操作系统(ROS)在学术界和工业界已被广泛采用作为这些机器人应用的基础,但其异步、多进程的设计使得复现变得复杂,尤其是在不同的硬件平台上。当计算时间和通信延迟变化时,无法保证确定性回调执行。这种缺乏复现性的问题给科学基准测试和持续集成带来了困难,因为在这些场景中一致的结果至关重要。为了解决这个问题,我们提出了一种使用ROS 2节点创建确定性仿真的方法。我们的ROS仿真库(RSLCPP)实现了这种方法,使得现有节点可以组合成一个产生可复现结果的仿真例程,通常无需更改任何源代码。我们证明,在测试合成基准测试和真实机器人系统时,我们的方法在各种CPU和架构上产生相同的结果。RSLCPP已开源,网址为:https://this https URL。

英文摘要

Simulation is crucial in real-world robotics, offering safe, scalable, and efficient environments for developing a variety of robotic applications. While the Robot Operating System (ROS) has been widely adopted as the backbone of these robotic applications in both academia and industry, its asynchronous, multi-process design complicates reproducibility, especially across varying hardware platforms. Deterministic callback execution cannot be guaranteed when computation times and communication delays vary. This lack of reproducibility complicates scientific benchmarking and continuous integration, where consistent results are essential. To address this, we present a methodology to create deterministic simulations using ROS 2 nodes. Our ROS Simulation Library for C++ (RSLCPP) implements this approach, enabling existing nodes to be combined into a simulation routine that yields reproducible results, usually without requiring any source code changes. We demonstrate that our approach produces identical results across various CPUs and architectures when testing both a synthetic benchmark and a real-world robotics system. RSLCPP is open-sourced at https://github.com/TUMFTM/rslcpp.

2501.02874 2026-06-18 cs.RO 版本更新

Steering Flexible Linear Objects in Planar Environments by Two Robot Hands Using Euler's Elastica Solutions

使用欧拉弹性线解在两机器人手在平面环境中操控柔性线性物体

Aharon Levin, Elon Rimon, Amir Shapiro

发表机构 * Dept. of ME, Technion, Israel(技术学院机械工程系,以色列) Dept. of ME, Ben-Gurion University, Israel(本· Gurion大学机械工程系,以色列)

AI总结 本文利用欧拉弹性线解,通过控制两机器人手的抓取端点位置和切线,实现平面环境中柔性线性物体的无自交、稳定和避障操控。

详情
AI中文摘要

机器人手对柔性物体(如电缆、电线和生鲜食品)的操控构成了机器人抓取力学中的一个特殊挑战。本文考虑了两机器人手在平面环境中操控柔性线性物体的问题。柔性线性物体被建模为弹性不可拉伸杆,通过改变抓取端点位置同时保持端点切线相等来进行操控。柔性线性物体的形状具有基于抓取端点位置和切线的闭式解,称为欧拉弹性线。本文在最优控制框架下获得了弹性线解,然后利用弹性线解得到了柔性线性物体无自交、稳定性和避障的闭式判据。这些新工具被整合到一个规划方案中,用于在稀疏障碍物分布的平面环境中操控柔性线性物体。该方案已完全实现并通过详细示例进行了演示。

英文摘要

The manipulation of flexible objects such as cables, wires and fresh food items by robot hands forms a special challenge in robot grasp mechanics. This paper considers the steering of flexible linear objects in planar environments by two robot hands. The flexible linear object, modeled as an elastic non-stretchable rod, is manipulated by varying the gripping endpoint positions while keeping equal endpoint tangents. The flexible linear object shape has a closed form solution in terms of the grasp endpoint positions and tangents, called Euler's elastica. This paper obtains the elastica solutions under the optimal control framework, then uses the elastica solutions to obtain closed-form criteria for non self-intersection, stability and obstacle avoidance of the flexible linear object. The new tools are incorporated into a planning scheme for steering flexible linear objects in planar environments populated by sparsely spaced obstacles. The scheme is fully implemented and demonstrated with detailed examples.

2512.21109 2026-06-18 cs.RO 版本更新

Robust and Efficient MuJoCo-based Model Predictive Control via Web of Affine Spaces Derivatives

基于仿射空间网络导数的鲁棒高效MuJoCo模型预测控制

Chen Liang, Daniel Rakita

发表机构 * Department of Computer Science, Yale University(耶鲁大学计算机科学系)

AI总结 针对MJPC中有限差分导数计算瓶颈,引入仿射空间网络(WASP)导数替代,实现高效稳定的导数计算,在多种机器人任务中实现高达2倍加速,并优于随机采样规划器。

Comments Accepted to 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

详情
AI中文摘要

MuJoCo是一个强大且高效的物理模拟器,广泛应用于机器人领域。其在实际中的一种常见应用是通过模型预测控制(MPC),该控制方法利用模拟器的重复滚动来优化未来动作,并实时生成响应性控制策略。为了使这一过程更易于使用,开源库MuJoCo MPC(MJPC)提供了直接构建在MuJoCo模拟器之上的即用型MPC算法和实现。然而,MJPC依赖有限差分(FD)来计算通过底层MuJoCo模拟器的导数,这通常是一个关键瓶颈,可能使其在时间敏感任务中成本过高,尤其是在高自由度系统或复杂场景中。在本文中,我们介绍了在MJPC中使用仿射空间网络(WASP)导数作为FD的即插即用替代方案。WASP是一种最近开发的方法,用于高效计算精确导数近似序列。通过重用先前相关导数计算的信息,WASP加速并稳定了新导数的计算,使其特别适合MPC随时间迭代的细粒度更新。我们在涵盖多种机器人形态的多样化MJPC任务集上评估了WASP。我们的结果表明,WASP导数在MJPC中特别有效:它无缝集成到各种任务中,提供一致鲁棒的性能,并且与基于导数的规划器(如iLQG)一起使用时,相比FD后端实现了高达2倍的加速。此外,基于WASP的MPC在我们的评估任务中优于MJPC的随机采样规划器,提供了更高的效率和可靠性。为了支持采用和未来研究,我们发布了完全集成WASP导数的MJPC开源实现。

英文摘要

MuJoCo is a powerful and efficient physics simulator widely used in robotics. One common way it is applied in practice is through Model Predictive Control (MPC), which uses repeated rollouts of the simulator to optimize future actions and generate responsive control policies in real time. To make this process more accessible, the open source library MuJoCo MPC (MJPC) provides ready-to-use MPC algorithms and implementations built directly on top of the MuJoCo simulator. However, MJPC relies on finite differencing (FD) to compute derivatives through the underlying MuJoCo simulator, which is often a key bottleneck that can make it prohibitively costly for time-sensitive tasks, especially in high-DOF systems or complex scenes. In this paper, we introduce the use of Web of Affine Spaces (WASP) derivatives within MJPC as a drop-in replacement for FD. WASP is a recently developed approach for efficiently computing sequences of accurate derivative approximations. By reusing information from prior, related derivative calculations, WASP accelerates and stabilizes the computation of new derivatives, making it especially well suited for MPC's iterative, fine-grained updates over time. We evaluate WASP across a diverse suite of MJPC tasks spanning multiple robot embodiments. Our results suggest that WASP derivatives are particularly effective in MJPC: it integrates seamlessly across tasks, delivers consistently robust performance, and achieves up to a 2$\mathsf{x}$ speedup compared to an FD backend when used with derivative-based planners, such as iLQG. In addition, WASP-based MPC outperforms MJPC's stochastic sampling-based planners on our evaluation tasks, offering both greater efficiency and reliability. To support adoption and future research, we release an open-source implementation of MJPC with WASP derivatives fully integrated.

2502.10239 2026-06-18 cs.LG cs.AI 版本更新

Efficient Zeroth-Order Federated Finetuning of Language Models on Resource-Constrained Devices

资源受限设备上语言模型的高效零阶联邦微调

Mohamed Aboelenien Ahmed, Kilian Pfeiffer, Ramin Khalili, Heba Khdr, Jörg Henkel

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) Huawei(华为) Heisenberg Research Center (Munich), Germany(海森堡研究中心(慕尼黑),德国)

AI总结 提出一种基于零阶优化的联邦微调方法,通过分块模型并分配更多扰动到后一块,复用中间激活减少前向评估次数,在保持内存和通信优势的同时将计算量降低至其他零阶方法的1/3。

Comments Published at TMLR

详情
AI中文摘要

联邦学习是一种有前景的范式,可以在分布式数据源上微调大型语言模型,同时保护数据隐私。然而,在边缘设备上微调如此大的模型由于资源需求高而具有挑战性。零阶优化通过有限差分近似估计梯度,依赖于模型参数随机扰动下的函数评估。因此,与任务对齐的零阶优化提供了一种潜在解决方案,允许仅使用前向传播(推理级内存需求和低通信开销)进行微调,但存在收敛慢和计算需求高的问题。在本文中,我们提出了一种新的基于零阶优化的方法,应用更高效的技术来减少使用大量扰动带来的计算需求,同时保留其收敛优势。这是通过将模型分成连续的块,并为第二块分配更多扰动来实现的,从而能够高效复用中间激活,以更少的前向评估更新整个网络。我们在RoBERTa-large、OPT1.3B、LLaMa-3-3.2B模型上的评估显示,与其他基于零阶优化的技术相比,计算量减少了高达3倍,同时保留了一阶联邦学习技术的内存和通信优势。

英文摘要

Federated Learning (FL) is a promising paradigm for finetuning Large Language Models (LLMs) across distributed data sources while preserving data privacy. However, finetuning such large models is challenging on edge devices due to its high resource demand. Zeroth-order Optimization (ZO) estimates gradients through finite-difference approximations, which rely on function evaluations under random perturbations of the model parameters. Consequently, ZO with task alignment provides a potential solution, allowing finetuning using only forward passes with inference-level memory requirements and low communication overhead, but it suffers from slow convergence and higher computational demand. In this paper, we propose a new ZO-based method that applies a more efficient technique to reduce the computational demand associated with using a large number of perturbations while preserving their convergence benefits. This is achieved by splitting the model into consecutive blocks and allocating a higher number of perturbations to the second block, enabling efficient reuse of intermediate activations to update the full network with fewer forward evaluations. Our evaluation on RoBERTa-large, OPT1.3B, LLaMa-3-3.2B models shows up to $3\times$ reduction in computation compared to the other ZO-based techniques, while retaining the memory and communication benefits over first-order federated learning techniques.

2512.14428 2026-06-18 cs.RO 版本更新

Odyssey: An Automotive Lidar-Inertial Odometry Dataset with GNSS-denied situations

Odyssey:一种面向GNSS拒止场景的汽车激光雷达-惯性里程计数据集

Aaron Kurda, Simon Steuernagel, Lukas Jung, Marcus Baum

发表机构 * University of Göttingen(哥廷根大学) iMAR Navigation(iMAR导航)

AI总结 提出Odyssey数据集,采用导航级环形激光陀螺仪RTK/INS提供高精度真值,包含36个序列和长时间GNSS拒止环境(隧道、室内停车场),用于评估LIO/SLAM系统。

Comments 10 pages, 4 figures, 3 tables, submitted to International Journal of Robotics Research (IJRR)

详情
AI中文摘要

激光雷达-惯性里程计(LIO)及同时定位与建图(SLAM)系统的开发与评估需要精确的真值。全球导航卫星系统(GNSS)常作为其基础,但在遮挡环境中,由于多径效应或信号丢失,其信号可能不可靠。现有数据集通过引入惯性测量单元(IMU)测量来补偿偶发的GNSS丢失,但由于累积漂移,常用系统不允许对GNSS拒止环境进行长时间研究。因此,此类数据集的多样性有限。为弥补这一空白,我们提出了Odyssey,一个汽车LIO数据集,其特点包括:(1)基于导航级环形激光陀螺仪(RLG)的RTK/INS导出的真值,其偏置稳定性比现有汽车数据集好1到4个数量级;(2)跨不同环境的36个序列的全面收集,支持稳健且全面的评估;(3)长时间的GNSS拒止环境,包括隧道以及汽车基准测试中此前未见过的室内停车场。在此,我们的RLG系统能够在常用系统会过度漂移的场景中实现准确评估。除了为LIO提供数据外,Odyssey还通过三次轨迹重复和通过精确大地坐标集成外部地图数据来支持地点识别任务。所有数据、数据加载器和补充材料均可在线获取,网址为:https://this https URL。

英文摘要

The development and evaluation of Lidar-Inertial Odometry (LIO) and Simultaneous Localization and Mapping (SLAM) systems requires a precise ground truth. The Global Navigation Satellite System (GNSS) is often used as a foundation for this, but its signals can be unreliable in obstructed environments due to multi-path effects or loss-of-signal. While existing datasets compensate for sporadic GNSS loss by incorporating Inertial Measurement Unit (IMU) measurements, the commonly used systems do not permit prolonged study of GNSS-denied environments due to accumulated drift. Therefore, the diversity of such datasets is limited. To close this gap, we present Odyssey, an automotive LIO dataset featuring: (1) a ground truth derived from a navigation-grade Ring Laser Gyroscope (RLG)-based RTK/INS, offering bias stability one to four orders of magnitude better than existing automotive datasets; (2) a comprehensive collection of 36 sequences across diverse environments, enabling robust and comprehensive evaluation and (3) prolonged GNSS-denied environments, including tunnels and, previously unseen in the context of automotive benchmarks, indoor parking garages. Here, our RLG-based system enables accurate evaluation in scenarios where commonly employed systems would drift excessively. Besides providing data for LIO, Odyssey also supports place recognition tasks through threefold trajectory repetition and integration of external mapping data via precise geodetic coordinates. All data, dataloader and supplementary material are available online at https://odyssey.uni-goettingen.de/ .

2507.07574 2026-06-18 cs.CV 版本更新

Beyond the Linear Separability Ceiling: Aligning Representations in VLMs

超越线性可分上限:对齐视觉-语言模型中的表征

Enrico Vompa, Tanel Tammet, Mohit Vaishnav

发表机构 * Applied Artificial Intelligence Group(应用人工智能小组) Tallinn University of Technology(塔林技术大学)

AI总结 提出线性可分上限(LSC)诊断框架,发现VLM存在对齐差距,并通过对比目标重塑视觉流形,使模型在抽象组合推理任务上显著超越LSC。

Comments Accepted TMLR

详情
AI中文摘要

推进视觉-语言模型(VLM)的一个挑战是确定其在抽象推理任务(如Bongard问题)上的失败源于有缺陷的感知还是有缺陷的自顶向下推理。为了分离这些因素,我们引入了一个诊断框架,该框架以线性可分上限(LSC)为中心,即线性分类器在VLM的原始视觉嵌入上可达到的性能。将该框架应用于最先进的VLM,我们发现了一个普遍的“对齐差距”,其中大多数模型无法在生成性能上超越其表征的线性可分性。我们发现,少数超越这一上限的模型通过两种机制实现:进一步将视觉表征细化为更线性可分的形式,或执行非线性决策逻辑。我们证明,这一瓶颈并非根本限制,而是可解决的视觉对齐问题。我们的方法用对比目标增强标准的下一个词预测,以将视觉流形重塑为更一维线性的几何结构,改进图像间比较,并使模型在抽象组合推理任务上显著超越LSC。

英文摘要

A challenge in advancing Visual-Language Models (VLMs) is determining whether their failures on abstract reasoning tasks, such as Bongard problems, stem from flawed perception or faulty top-down reasoning. To disentangle these factors, we introduce a diagnostic framework centered on the Linear Separability Ceiling (LSC), the performance achievable by a linear classifier on a VLM's raw visual embeddings. Applying this framework to state-of-the-art VLMs, we uncover a pervasive ''alignment gap'', where most models fail to generatively outperform the linear separability of their representations. We find that the few models surpassing this ceiling do so via two mechanisms: by further refining visual representations into a more linearly separable format or by executing non-linear decision logic. We demonstrate that this bottleneck is not a fundamental limitation but a solvable visual alignment issue. Our method augments standard next-token prediction with a contrastive objective to restructure the visual manifold into a more one-dimensionally linear geometry, improving image-to-image comparison and enabling models to significantly surpass the LSC on abstract compositional reasoning tasks.

2512.11736 2026-06-18 cs.RO 版本更新

Bench-Push: Benchmarking Pushing-based Navigation and Manipulation Tasks for Mobile Robots

Bench-Push:基于推动的移动机器人导航与操作任务基准测试

Ninghan Zhong, Steven Caro, Megnath Ramesh, Rishi Bhatnagar, Avraiem Iskandar, Stephen L. Smith

发表机构 * Institute for Robotics and Intelligent Machines, Georgia Institute of Technology(机器人与智能机器研究所,佐治亚理工学院) Department of Electrical and Computer Engineering, University of Waterloo(电气与计算机工程系,滑铁卢大学) Department of Mechanical Engineering, University of Alberta(机械工程系,阿尔伯塔大学)

AI总结 提出首个统一的推动式移动机器人导航与操作基准Bench-Push,包含多种模拟环境、新评估指标和基线实现,用于解决可移动障碍物环境中的机器人推动任务评估问题。

Comments Published in CRV 2026

详情
AI中文摘要

移动机器人越来越多地部署在具有可移动物体的杂乱环境中,这对禁止交互的传统方法提出了挑战。在这种环境中,移动机器人必须超越传统的避障策略,利用推动或轻推策略来实现其目标。尽管基于推动的机器人研究正在增长,但评估依赖于临时设置,限制了可重复性和交叉比较。为了解决这个问题,我们提出了Bench-Push,这是首个用于基于推动的移动机器人导航和操作任务的统一基准。Bench-Push包括多个组件:1)一系列全面的模拟环境,捕捉推动任务中的基本挑战,包括在具有可移动障碍物的迷宫中导航、自主船舶在冰覆盖水域中导航、箱子递送和区域清理,每个任务都有不同复杂程度;2)新的评估指标,用于捕捉效率、交互努力和部分任务完成;3)使用Bench-Push评估跨环境的已建立基线的示例实现。Bench-Push作为Python库开源,采用模块化设计。代码、文档和训练模型可在https://this URL找到。

英文摘要

Mobile robots are increasingly deployed in cluttered environments with movable objects, posing challenges for traditional methods that prohibit interaction. In such settings, the mobile robot must go beyond traditional obstacle avoidance, leveraging pushing or nudging strategies to accomplish its goals. While research in pushing-based robotics is growing, evaluations rely on ad hoc setups, limiting reproducibility and cross-comparison. To address this, we present Bench-Push, the first unified benchmark for pushing-based mobile robot navigation and manipulation tasks. Bench-Push includes multiple components: 1) a comprehensive range of simulated environments that capture the fundamental challenges in pushing-based tasks, including navigating a maze with movable obstacles, autonomous ship navigation in ice-covered waters, box delivery, and area clearing, each with varying levels of complexity; 2) novel evaluation metrics to capture efficiency, interaction effort, and partial task completion; and 3) demonstrations using Bench-Push to evaluate example implementations of established baselines across environments. Bench-Push is open-sourced as a Python library with a modular design. The code, documentation, and trained models can be found at https://github.com/IvanIZ/BenchNPIN.