arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2079
2606.06120 2026-06-05 cs.CV

Diff-CA: Separating Common and Salient Factors with Diffusion Models

Diff-CA: 使用扩散模型分离共同因素和显著因素

Michaël Soumm, Alexandre Fournier Montgieux, Yunlong He, Pietro Gori, Alasdair Newson

发表机构 * INRIA at Univ. Grenoble Alpes(法国格勒诺布尔大学INRIA实验室) CEA List, Palaiseau(法国CEA列表,帕莱索) Télécom Paris, Institut Polytechnique de Paris(巴黎电信学院,巴黎理工学院)

AI总结 提出一种基于扩散模型的条件框架,通过弱监督学习将图像条件分解为共同因素和显著因素,实现对比分析中的因素分离,并保持高保真图像生成质量。

详情
AI中文摘要

对比分析旨在将两个数据分布之间的共同因素与仅对其中一个分布显著的因素分离开来。现有的对比方法基于生成模型(如VAE或GAN),这些模型通常受到重建和图像质量有限的困扰,这阻碍了有效的潜在因素分离,并限制了它们在高保真图像生成和编辑中的应用。我们提出了一种新颖的扩散模型条件框架,能够在不牺牲生成质量的情况下实现对比分解。我们首先训练一个无需提示、以图像为条件的扩散模型,然后学习使用弱监督将条件分解为共同因素和显著因素。我们证明了先前工作中通常假设的加性对比分解在温和条件下是可识别的。这种分解通过仅交换或插值显著因素来实现有针对性的操作。

英文摘要

Contrastive Analysis aims to separate factors that are common between two data distributions from those that are salient to only one of them. Existing contrastive methods are based on generative models (e.g., VAEs or GANs) that often suffer from limited reconstruction and image quality, which hampers effective latent factor separation and limits their applicability to high-fidelity image generation and edition. We propose a novel conditioning framework for diffusion models that enables contrastive decomposition without compromising generation quality. We first train a prompt-free, image-conditioned diffusion model, and then learn to decompose the conditioning into a common and a salient factor, using weak supervision. We prove that the additive contrastive factorization, commonly assumed in prior work, is identifiable under mild conditions. This factorization enables targeted operations by swapping or interpolating only the salient factor.

2605.03413 2026-06-05 cs.LG cs.AI

Learning to Theorize the World from Observation

从观察中学习理论化世界

Doojin Baek, Gyubin Lee, Junyeob Baek, Hosung Lee, Sungjin Ahn

发表机构 * University of Washington(华盛顿大学)

AI总结 受认知科学启发,提出Learning-to-Theorize范式,通过神经理论家(NEO)模型从原始非文本观测中推断显式解释性理论,实现基于解释的泛化。

详情
AI中文摘要

理解世界意味着什么?当代世界模型通常将理解操作化为在潜在空间或观测空间中的准确未来预测。然而,发展认知科学提出了不同的观点:人类理解是通过构建关于世界如何运作的内部理论而涌现的,即使在成熟语言习得之前也是如此。受这种理论构建的认知观点启发,我们引入了Learning-to-Theorize,一种从原始非文本观测中推断世界的显式解释性理论的学习范式。我们通过神经理论家(NEO)实例化该范式,这是一种概率神经模型,它将潜在程序诱导为习得的思维语言,并通过共享的转移模型执行它们。在NEO中,理论被表示为一个可执行的组合程序,其习得的原语可以系统地重新组合以解释新现象。实验表明,这种公式化实现了基于解释的泛化,允许根据生成观测的程序来理解观测。

英文摘要

What does it mean to understand the world? Contemporary world models often operationalize understanding as accurate future prediction in latent or observation space. Developmental cognitive science, however, suggests a different view: human understanding emerges through the construction of internal theories of how the world works, even before mature language is acquired. Inspired by this theory-building view of cognition, we introduce Learning-to-Theorize, a learning paradigm for inferring explicit explanatory theories of the world from raw, non-textual observations. We instantiate this paradigm with the Neural Theorizer (NEO), a probabilistic neural model that induces latent programs as a learned Language of Thought and executes them through a shared transition model. In NEO, a theory is represented as an executable, compositional program whose learned primitives can be systematically recombined to explain novel phenomena. Experiments show that this formulation enables explanation-driven generalization, allowing observations to be understood in terms of the programs that generate them.

2606.06109 2026-06-05 cs.CL cs.AI

Harnessing Structural Context for Entity Alignment Foundation Models

利用结构上下文进行实体对齐基础模型

Xingyu Chen, Yuanning Cui, Zequn Sun, Wei Hu

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China(南京大学新型软件技术国家重点实验室) Nanjing University of Information Science and Technology, Nanjing, China(南京信息科学技术大学) National Institute of Healthcare Data Science, Nanjing University, Nanjing, China(南京大学健康数据科学国家研究院)

AI总结 提出ContextEA框架,通过交叉KG交互编码器和结构校准解码器增强结构上下文的构建与利用,在29个数据集上超越强基线,实现更强的跨KG迁移能力。

详情
AI中文摘要

实体对齐(EA)旨在识别异构知识图谱(KG)中的等价实体,是知识融合和跨KG推理的关键组成部分。最近的EA基础模型表明,对齐知识一旦预训练,可以直接应用于各种未见过的KG对。然而,它仍然在两个地方未充分利用结构上下文:编码时跨KG交互较弱,最终候选排序仍然过于依赖粗略的相似性。我们通过ContextEA(一种用于可迁移EA的增强型编码器-解码器框架)来解决这些局限性。在编码器侧,我们引入了一个跨KG交互编码器,该编码器通过锚点桥统一两个KG,并执行更早的关系感知跨图传播。在解码器侧,我们引入了一个结构校准解码器,该解码器使用实体级、邻域级、关系级和锚点感知的结构证据来校准对齐分数。这种设计在保持轻量级的同时,增强了结构上下文的构建和利用。在OpenEA、SRPRS和DBP的29个EA数据集上的实验显示,与强可迁移基线相比,取得了持续改进。值得注意的是,预训练的ContextEA已经在所有三个基准组上超越了微调基线,显示出对未见KG的显著更强的迁移能力。这些结果表明,显式利用结构上下文是改进EA基础模型的有效方向。

英文摘要

Entity alignment (EA) aims to identify equivalent entities across heterogeneous knowledge graphs (KGs) and is a key component of knowledge fusion and cross-KG reasoning. The recent EA foundation model demonstrates that alignment knowledge, once pretrained, can be directly applied to diverse previously unseen KG pairs. However, it still underuses structural context in two places: cross-KG interaction is weak during encoding, and final candidate ranking still relies too heavily on coarse similarity. We address these limitations with ContextEA, an enhanced encoder-decoder framework for transferable EA. On the encoder side, we introduce a cross-KG interaction encoder that unifies the two KGs with anchor bridges and performs earlier relation-aware cross-graph propagation. On the decoder side, we introduce a structural calibration decoder that calibrates alignment scores with entity-level, neighborhood-level, relation-level, and anchor-aware structural evidence. This design strengthens both structural context construction and structural context exploitation while remaining lightweight. Experiments on 29 EA datasets in OpenEA, SRPRS, and DBP show consistent gains over strong transferable baselines. Notably, the pretrained ContextEA already surpasses the finetuned baselines on all three benchmark groups, demonstrating substantially stronger transfer to unseen KGs. These results suggest that explicitly harnessing structural context is an effective direction for improving EA foundation models.

2606.06104 2026-06-05 cs.LG

A Sliced-Wasserstein Framework on Correlation Matrices for EEG Decoding

用于脑电图解码的相关矩阵切片Wasserstein框架

Chen Hu, Rui Wang, Jiale Zhou, Jingjun Yi, Shaocheng Jin, Yidong Song, Yefeng Zheng

发表机构 * Westlake University(西湖大学) School of Artificial Intelligence and Computer Science(人工智能与计算机科学学院) Jiangnan University(江南大学) Sun Yat-sen University(中山大学)

AI总结 提出基于拉回欧几里得度量的切片Wasserstein框架,实例化两种相关矩阵切片Wasserstein差异,并构建脑电图解码的域泛化方法,在三个数据集上验证了分布偏移下的泛化能力提升。

详情
Comments
Accepted by KDD 2026
AI中文摘要

脑电图(EEG)提供非侵入性、毫秒分辨率的神经活动记录,广泛应用于神经科学和医疗保健。许多EEG解码流程依赖协方差描述符以抵抗噪声,但这种表示对通道缩放敏感。因此,近期研究提倡使用满秩相关矩阵作为EEG解码的尺度不变替代。本文提出一个通用框架,用于在赋予拉回欧几里得度量(PEM)的流形上进行切片Wasserstein(SW)差异计算,称为拉回欧几里得度量切片Wasserstein(PEMSW)。在该框架下,我们在两种最近引入的相关几何(即Off-Log度量(OLM)和对数缩放度量(LSM))下,在满秩相关矩阵流形上实例化了两种相关切片Wasserstein(CorSW)差异。基于CorSW,我们进一步开发了用于EEG解码的域泛化(DG)框架。在三个EEG数据集上的实验表明,在分布偏移下泛化能力得到提升,且训练开销低,无额外推理成本。源代码可在https://github.com/ChenHu-ML/CorSW获取。

英文摘要

Electroencephalography (EEG) offers noninvasive, millisecond resolution recordings of neuronal activity and is widely used in neuroscience and healthcare. Many EEG decoding pipelines rely on covariance descriptors for their robustness to noise, but such representations are sensitive to channel-wise scaling. Recent studies have therefore advocated full-rank correlation matrices as a scale-invariant alternative for EEG decoding. In this paper, we propose a general framework for Sliced Wasserstein (SW) discrepancies on manifolds endowed with Pullback Euclidean Metrics (PEMs), termed Pullback Euclidean Metric Sliced Wasserstein (PEMSW). Within this framework, we instantiate two Correlation Sliced-Wasserstein (CorSW) discrepancies on the manifold of full-rank correlation matrices under two recently introduced correlation geometries, \textit{i.e.}, the Off-Log Metric (OLM) and Log-Scaled Metric (LSM). Building on CorSW, we further develop a domain generalization (DG) framework for EEG decoding. Experiments on three EEG datasets demonstrate improved generalization under distribution shifts, with low training overhead and no additional inference cost. The source code is available at https://github.com/ChenHu-ML/CorSW.

2606.06103 2026-06-05 cs.CV

MS-DKC: A Dataset Knowledge Card Framework for Designing and Adapting Medical Image Segmentation Models

MS-DKC:用于设计和适配医学图像分割模型的数据集知识卡片框架

Tariq M. Khan, Syed Saud Naqvi, Thantrira Porntaveetus, Hamid Alinejad-Rokny, Shahzaib Iqbal, Imran Razzak, Mohammad AU Khan

发表机构 * Center of Excellence in Precision Medicine and Digital Health, Faculty of Dentistry, Chulalongkorn University, Bangkok, Thailand(精准医学与数字健康中心,朱拉隆功大学牙科学院,泰国曼谷) Department of Computer Engineering, COMSATS University Islamabad, Islamabad, Pakistan(计算机工程系,COMSATS伊斯兰堡大学,巴基斯坦伊斯兰堡) School of Biomedical Engineering, UNSW, Sydney, NSW, Australia(生物医学工程学院,新南威尔士大学,澳大利亚悉尼,新南威尔士) Visiting Scholar (Collaborative Projects), Center of Excellence in Precision Medicine and Digital Health, Chulalongkorn University, Bangkok, Thailand(访问学者(合作项目),精准医学与数字健康中心,朱拉隆功大学,泰国曼谷) Department of Computing, Abasyn University Islamabad Campus (AUIC), Islamabad, Pakistan(计算系,阿巴斯扬大学伊斯兰堡校区(AUIC),巴基斯坦伊斯兰堡) Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates(Mohamed bin Zayed人工智能大学,阿布扎比,阿拉伯联合酋长国) College of Computer and Information Sciences, prince Sultan University, Riyadh, SAudi Arabia(计算机与信息科学学院,苏丹王子大学,沙特阿拉伯利雅得)

AI总结 提出MS-DKC框架,通过显式记录数据集特征(如前景占有率、形态、边界模糊性等)并映射到失败模式、设计先验和风险对齐标准,指导医学图像分割模型的设计与适配,在DRIVE、ISIC2018和ACDC数据集上验证了数据集条件化设计的有效性。

详情
AI中文摘要

医学图像分割通常被定义为寻找更强架构的问题,但这可能掩盖一个更基本的问题:数据集对模型有什么要求?在医学影像中,这种要求由前景占有率、形态、边界模糊性、拓扑敏感性、标注质量、采集变异和操作点决定。本文介绍了医学分割数据集知识卡片(MS-DKC),一个使这些因素显式化的框架。MS-DKC通过图像/采集、形态、监督、上下文依赖和部署风险描述符记录数据集证据。这些描述符被映射到失败模式、设计先验和风险对齐标准,使分割设计比架构优先比较更具可追溯性。我们在DRIVE、ISIC2018和ACDC上评估了MS-DKC,它们代表了不同的场景。DRIVE包含稀疏、细小的分支血管,有利于细节保持模型、敏感性感知优化、阈值分析和拓扑感知指标。DKC-TNet-v2以35103个参数达到了Dice 0.8044和IoU 0.6730,而SA-UNetv2-DKC-AmbRef达到了Dice 0.8141、IoU 0.6865、敏感性0.8265、特异性0.9804和AUC 0.9853。ISIC2018涉及紧凑但外观可变的病变;在Att-Next-Topo/ATTNext上基于验证约束的评分函数选择产生了MS-DKC-AttNextTopo-VCSF-NoAug,Dice 0.8872、IoU 0.8214、精确率0.9173、边界F1 0.4878和ASSD 4.13,而合理的添加未能改善风险对齐的轮廓。ACDC提供了一个多类心脏案例,其中MS-DKC推荐四类softmax分割、类别平衡的Dice/CE监督和类别级表面评估。总体而言,结果支持数据集条件化设计:不同的数据集需要不同的先验、操作点和证据,然后才能判断模型是否合适。

英文摘要

Medical image segmentation is often framed as a search for stronger architectures, but this can obscure a more fundamental question: what does the dataset require from the model? In medical imaging, this requirement is shaped by foreground occupancy, morphology, boundary ambiguity, topology sensitivity, annotation quality, acquisition variation, and operating point. This paper introduces the Medical Segmentation Dataset Knowledge Card (MS-DKC), a framework for making these factors explicit. MS-DKC records dataset evidence through image/acquisition, morphology, supervision, context-dependence, and deployment-risk descriptors. These descriptors are mapped to failure modes, design priors, and risk-aligned criteria, making segmentation design more traceable than architecture-first comparison. We evaluate MS-DKC on DRIVE, ISIC2018, and ACDC, representing distinct regimes. DRIVE contains sparse, thin, branching vessels, favoring detail-preserving models, sensitivity-aware optimization, threshold analysis, and topology-aware metrics. DKC-TNet-v2 achieved Dice 0.8044 and IoU 0.6730 with 35103 parameters, while SA-UNetv2-DKC-AmbRef reached Dice 0.8141, IoU 0.6865, sensitivity 0.8265, specificity 0.9804, and AUC 0.9853. ISIC2018 involves compact but appearance-variable lesions; validation-constrained score-function selection on Att-Next-Topo/ATTNext produced MS-DKC-AttNextTopo-VCSF-NoAug with Dice 0.8872, IoU 0.8214, precision 0.9173, Boundary F1 0.4878, and ASSD 4.13, while plausible additions failed to improve the risk-aligned profile. ACDC provides a multi-class cardiac case, where MS-DKC recommends four-class softmax segmentation, class-balanced Dice/CE supervision, and class-wise surface evaluation. Overall, the results support dataset-conditioned design: different datasets require different priors, operating points, and evidence before a model can be judged appropriate.

2606.06100 2026-06-05 cs.CV

HyperVis: Continuous Latent Visual Relational Graphs on the Lorentz Hyperboloid for Compositional Reasoning

HyperVis:洛伦兹双曲面上的连续潜在视觉关系图用于组合推理

Moshiur Farazi, Sameera Ramasinghe, Mahbub Ahmed Turza, Shafin Rahman

发表机构 * Data Science and AI, University of Doha for Science and Technology, Qatar(数据科学与人工智能,多哈科学技术大学,卡塔尔) Pluralis Research, Australia(Pluralis研究,澳大利亚) Department of Electrical and Computer Engineering, North South University, Bangladesh(电气与计算机工程系,北南大学,孟加拉国)

AI总结 针对视觉语言模型在组合推理中理解物体间关系的困难,提出HyperVis方法,通过计算密集视觉关系张量并投影到洛伦兹双曲面,利用空间物理(IoA驱动的蕴含锥和外部角排斥)增强层次结构,在训练时作为正则化器提升生成式VQA性能,在推理时作为关系编码器提升判别式组合评分。

详情
AI中文摘要

视觉语言模型(VLM)在需要理解物体间关系的组合推理中表现不佳。一个自然的补救措施是从现成的场景图生成器(SGG)注入显式场景图三元组$\langle s, p, o \rangle$,但我们发现这会产生反效果:离散文本标签与连续视觉模态冲突,导致GQA准确率从60.38%降至58.86%。我们提出 extbf{HyperVis},完全绕过了SGG的语义瓶颈。从$N$个类别无关的区域提议出发,通过空间偏置交叉注意力计算密集的$O(N^2)$视觉关系张量,将其投影到洛伦兹双曲面上,并通过空间物理(即IoA驱动的蕴含锥和外部角排斥)强制执行层次结构。我们发现HyperVis以两种互补的方式发挥作用:(1)作为 extit{训练时正则化器},双曲关系损失塑造了LoRA表示,提高了生成式VQA性能(GQA 61.03%对比无关系损失的LoRA微调57.21%,恢复并超越基线);(2)作为 extit{推理时关系编码器},双曲前缀令牌提升了判别式组合评分(SugarCrepe 79.94%,比基线高6.25个百分点)。学习到的曲率稳定在$\kappa=4.0$,比先前的双曲VLM高一个数量级(先前$\kappa$通常趋近于零),表明连续视觉特征确实需要强曲率空间的指数体积。受控的欧几里得消融实验证实了这种分解:关系流水线在平坦空间中对LoRA的正则化效果相当(GQA 60.81%),但组合增益是双曲空间特有的(SugarCrepe比欧几里得高4.58个百分点),且欧几里得训练中的蕴含损失高出约6倍。代码将在后续公布。

英文摘要

Vision-Language Models (VLMs) struggle with compositional reasoning that requires understanding inter-object relationships. A natural remedy is to inject explicit scene graph triplets $\langle s, p, o \rangle$ from an off-the-shelf scene graph generator (SGG), but we show this backfires: discrete text labels collide with the continuous visual modality, degrading GQA accuracy from 60.38\% to 58.86\%. We propose \textbf{HyperVis}, which bypasses the SGG semantic bottleneck entirely. From $N$ class-agnostic region proposals, we compute a dense $O(N^2)$ visual relation tensor via spatially-biased cross-attention, project it onto a Lorentz hyperboloid, and enforce hierarchy through spatial physics, namely IoA-driven entailment cones and exterior-angle repulsion. We discover that HyperVis contributes in two complementary ways: (1) as a \emph{training-time regularizer}, the hyperbolic relational losses shape LoRA representations that improve generative VQA (GQA 61.03\% vs.\ 57.21\% for LoRA fine-tuning without relational losses, recovering and surpassing the baseline); and (2) as an \emph{inference-time relational encoder}, hyperbolic prefix tokens boost discriminative compositional scoring (SugarCrepe 79.94\%, $+$6.25pp over baseline). The learned curvature stabilises at $κ{=}4.0$, an order of magnitude above prior hyperbolic VLMs where $κ$ typically collapses toward zero, indicating that continuous visual features genuinely require the exponential volume of strongly curved space. A controlled Euclidean ablation confirms this decomposition: the relational pipeline regularises LoRA comparably in flat space (GQA 60.81\%), but the compositionality gain is specifically hyperbolic (SugarCrepe $+$4.58pp over Euclidean), with entailment loss ${\sim}6{\times}$ higher in Euclidean training. Codes are available at TBA.

2606.06099 2026-06-05 cs.AI

CogManip: Benchmarking Manipulative Behavior in Multi-Turn Interactions with Large Language Model

CogManip: 在大语言模型多轮交互中操控行为的基准测试

Zeyang Yue, Chenfei Yan, Feifei Zhao, Haibo Tong, Mengwen Xu, Xiaozhen Wang, Erliang Lin, Yi Zeng

发表机构 * School of Artificial Intelligence, Beihang University(北京航空航天大学人工智能学院) BrainCog AI Lab, CASIA(CASIA脑认知人工智能实验室) Gaoling School of AI, Renmin University of China(中国人民大学 Gallagher人工智能学院) Beijing-AISI(北京人工智能研究所) Beijing Key Laboratory of Safe AI and Superalignment(北京安全人工智能与超对齐重点实验室) School of Artificial Intelligence, UCAS(中国科学技术大学人工智能学院) Huawei Technologies Co., Ltd.(华为技术有限公司)

AI总结 提出CogManip基准,通过1000个多轮交互场景评估15种操控策略风险,发现前沿模型存在显著风险异质性,并揭示提示工程防御的重要性。

详情
AI中文摘要

大语言模型(LLM)在复杂人机交互中是否表现出隐蔽的心理操控已引起越来越多的安全担忧。然而,现有的人工智能安全基准大多局限于显式的规则遵守和静态提示,未能捕捉多轮对话中操控策略的动态性和隐蔽性。我们引入了CogManip,一个全面的基准,在1000个多轮交互场景中评估15种操控策略风险,并由人类专家验证。对包括GPT-5.4和DeepSeek-V3.2等前沿模型在内的13个代表性模型的系统评估揭示了显著的风险异质性,并为未来防御指明了方向。进一步的目标函数扰动分析表明,DeepSeek-V3.2的操控策略对负面和良性系统提示均高度敏感,证明了基于提示的防御工程和隐式目标审计的关键必要性。CogManip为审计现代LLM的隐式心理影响和动态策略选择提供了强大的工具和视角。

英文摘要

Whether Large Language Models (LLMs) exhibit covert psychological manipulation in complex human-AI interactions has garnered increasing safety concerns. However, existing AI safety benchmarks remain largely restricted to explicit rule compliance and static prompts, failing to capture the dynamic and covert nature of manipulative strategies in multi-turn dialogues. We introduce CogManip, a comprehensive benchmark that evaluates 15 manipulation strategy risks across 1,000 multi-turn interaction scenarios, validated by human experts. A systematic evaluation of 13 representative models, including frontier models like GPT-5.4 and DeepSeek-V3.2, reveals significant risk heterogeneities and illuminates the targeted direction for future defense. Further analysis of objective function perturbation reveals that DeepSeek-V3.2's manipulation tactics are highly sensitive to both negative and benign system prompts, demonstrating the critical necessity of prompt-based defense engineering and implicit goal auditing. CogManip offers a robust instrument and perspective for auditing the implicit psychological influence and dynamic strategy selection of modern LLMs.

2606.06098 2026-06-05 cs.CL cs.LG

IR3DE: A Linear Router for Large Language Models

IR3DE:面向大型语言模型的线性路由器

Eros Fanì, Oğuzhan Ersoy

发表机构 * Gensyn

AI总结 提出基于岭回归的线性路由器IR3DE,以低成本快速为每个提示选择最合适的领域专家大语言模型,在推理任务中超越基线方法,并支持动态添加或移除专家模型。

详情
Comments
Accepted at the ICML 2026 Workshop on Resource-Adaptive Foundation Model Inference
AI中文摘要

基础大型语言模型(LLM)在广泛的一般任务上表现出色,并通过领域专家LLM在各种专业任务上取得显著成果。随着可用LLM列表的不断增长,推理路由器被提出以选择每个提示最合适的LLM。然而,现有的路由方法要么优化弱到强通用LLM的成本,要么需要大量训练来支持领域专家路由。在本文中,我们提出IR3DE,一种基于岭回归的领域专家路由器,为每个提示提供廉价且快速的路由决策。我们在两种因果语言建模(CLM)设置中评估IR3DE,其中任务是对所有域进行下一个词预测,以及一种推理设置,其中每个域有自己的独特推理任务。尽管是线性路由器,IR3DE在两种CLM设置中实现了与其他基线相当的性能,并在推理设置中超越了它们,归一化性能达到98.4%。此外,IR3DE允许添加或移除新的领域专家,而无需从头重新训练路由器,从而可以动态服务一组LLM,对路由器本身的干扰最小。我们的代码可在github.com/gensyn-ai/IR3DE获取。

英文摘要

Foundational Large Language Models (LLMs) demonstrate proficiency on a wide range of general tasks, and achieve remarkable results on various specialized tasks via domain-expert LLMs. With the ever-growing list of available LLMs, inference routers are being proposed to select the most appropriate LLM for each prompt. However, existing routing methods either optimize cost across weak-to-strong generalist LLMs or require substantial training to support domain-expertise routing. In this paper, we propose IR3DE, a Ridge Regression-based Router for Domain Experts that provides cheap and fast routing decisions for each prompt. We evaluate IR3DE in two Causal Language Modeling (CLM) settings where the tasks are next-token prediction for all domains, and one reasoning setting where each domain has its own distinct reasoning task. Despite being a linear router, IR3DE achieves performance comparable to the other baselines in both CLM settings, and surpassing them in the reasoning setting, with a normalized performance of 98.4%. Moreover, IR3DE enables the addition or removal of new domain experts without requiring the router to be retrained from scratch, allowing a dynamic set of LLMs to be served with minimal disruption to the router itself. Our code is available at: github.com/gensyn-ai/IR3DE.

2606.06096 2026-06-05 cs.LG cs.AI cs.CL

OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation

OrderGrad: 通过顺序统计量策略梯度估计超越均值优化

Paavo Parmas, Yongmin Kim, Kohsei Matsutani, Shota Takashiro, Soichiro Nishimori, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

发表机构 * The University of Tokyo(东京大学)

AI总结 提出OrderGrad,一种用于顺序统计量目标的似然比和重参数化梯度估计器族,通过奖励变换实现风险厌恶、鲁棒和探索性学习的统一即插即用方法。

详情
AI中文摘要

策略梯度方法通常优化期望回报,但许多现实应用关心回报的分布特性:尾部风险、异常值鲁棒性或最佳K发现。我们引入OrderGrad,一种用于顺序统计量目标的似然比和重参数化梯度估计器族。OrderGrad优化有限样本L-统计量,即排序奖励或成本的加权平均,通过仅改变秩权重来恢复诸如VaR、CVaR、修剪均值、中位数和top-m/最佳K标准等目标。对于任何固定样本大小和秩权重向量,OrderGrad为相应的顺序统计量目标提供无偏梯度估计。该方法实现为简单的奖励变换,然后可在其他标准策略梯度或重参数化更新中使用。我们研究了所得估计量的方差行为,并在均值优化与部署目标不匹配的任务上进行了评估,包括LLM数学后训练和其他任务。OrderGrad为风险厌恶、鲁棒和探索性学习提供了统一的即插即用途径。代码:https://github.com/paavo5/ordergrad

英文摘要

Policy-gradient methods usually optimize expected return, but many real world applications care about distributional properties of returns: tail risk, outlier robustness, or best-of-K discovery. We introduce OrderGrad, a family of likelihood-ratio and reparameterization gradient estimators for order-statistic objectives. OrderGrad optimizes finite-sample L-statistics, i.e., weighted averages of sorted rewards or costs, recovering objectives such as VaR, CVaR, trimmed means, medians, and top-m/best-of-K criteria by changing only the rank weights. For any fixed sample size and rank-weight vector, OrderGrad provides an unbiased gradient estimator for the corresponding order-statistic objective. The method is implemented as a simple reward transformation that can then be used in an otherwise standard policy-gradient or reparameterized update. We study the resulting estimator's variance behavior and evaluate it on tasks where mean optimization is mismatched to the deployment objective, including LLM math post-training and other tasks. OrderGrad provides a unified, plug-and-play route to risk-averse, robust, and exploratory learning. Code: https://github.com/paavo5/ordergrad

2606.06090 2026-06-05 cs.AI

Beyond Semantic Organization: Memory as Execution State Management for Long-Horizon Agents

超越语义组织:记忆作为长时程智能体的执行状态管理

Yaoqi Chen, Haibin Lai, Yuru Feng, Chuyu Han, Qianxi Zhang, Baotong Lu, Menghao Li, Xinjiang Wang, Zhirui Wang, Shusen Xu, Zengzhong Li, Zewen Jin, Hao Wu, Cheng Li, Qi Chen

发表机构 * University of Science and Technology of China(中国科学技术大学) Microsoft(微软) Nanjing University(南京大学) University of California, San Diego(加州大学圣地亚哥分校)

AI总结 针对长时程任务中智能体依赖执行状态而非语义相似性的问题,提出MAGE(记忆作为智能体引导的探索),通过层次状态树管理交互,实现状态完整性和错误隔离,在MemoryArena上任务成功率提升7.8-20.4个百分点,token消耗降低55.1%。

详情
Comments
16 pages
AI中文摘要

基于LLM的智能体越来越多地处理具有相互依赖决策的长时程任务,其中每个动作都会重塑未来约束,中间错误可能级联。现有的RAG和智能体记忆系统通过语义相似性组织历史,在决策时检索内容相关的条目。我们认为这种设计与执行状态依赖不匹配:它分割了决策轨迹,混合了有效和错误的痕迹,阻碍了连贯的状态重建和错误隔离。我们提出MAGE(记忆作为智能体引导的探索),一个主动的执行状态管理器,将交互存储在层次状态树中。智能体从活跃的根到当前路径派生其状态,结合子目标摘要、近期轨迹和来自先前分支的提示。四个耦合操作维护树:Grow记录新轨迹,Compress总结完成的子目标,Maintain验证摘要,Revise恢复目标边界并在新分支上继续。这种设计在保持状态完整性和将缺陷片段与活跃路径隔离的同时,限制了上下文增长。在MemoryArena上的实验表明,MAGE将平均任务成功率提高了7.8-20.4个百分点,同时将token消耗降低了55.1%。

英文摘要

LLM-based agents increasingly tackle long-horizon tasks with interdependent decisions, where each action reshapes future constraints and intermediate errors can cascade. Existing RAG and agent memory systems organize histories by semantic similarity, retrieving content-relevant entries at decision time. We argue that this design mismatches execution-state dependencies: it fragments decision trajectories and mixes valid and erroneous traces, hindering coherent state reconstruction and error isolation. We propose MAGE (Memory as Agent-Guided Exploration), an active execution-state manager that stores interactions in a hierarchical state tree. The agent derives its state from the active root-to-current path, combining subgoal summaries, recent traces, and hints from prior branches. Four coupled operations maintain the tree: Grow records new traces, Compress summarizes completed subgoals, Maintain validates summaries, and Revise restores a target boundary and resumes on a new branch. This design bounds context growth while preserving state integrity and isolating flawed segments from the active path. Experiments on MemoryArena show that MAGE improves the average task success rate by 7.8--20.4 pp over baselines, while reducing token consumption by 55.1%.

2606.06087 2026-06-05 cs.CL cs.AI

LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents

LatentSkill: 从上下文文本技能到LLM智能体的权重内隐技能

Aofan Yu, Chenyu Zhou, Tianyi Xu, Zihan Guo, Rong Shan, Zhihui Fu, Jun Wang, Weiwen Liu, Yong Yu, Weinan Zhang, Jianghao Lin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Sun Yat-Sen University(中山大学) Shanghai Innovation Institute(上海创新研究院) OPPO Research Institute(OPPO研究院)

AI总结 提出LatentSkill框架,通过预训练超网络将文本技能转换为即插即用的LoRA适配器,将技能知识存储在权重空间而非上下文空间,从而减少预填充令牌并提升性能。

详情
Comments
16 pages, 4 figures
AI中文摘要

智能体系统越来越多地使用文本技能来编码可重用的任务流程,但在每一步将这些技能注入提示中会带来大量的上下文开销,并将技能内容暴露为明文。我们提出了LatentSkill,一个通过预训练超网络将文本技能转换为即插即用LoRA适配器的框架。LatentSkill将技能知识存储在权重空间而非上下文空间中,消除了每步的技能令牌,同时保留了模块化加载、缩放和组合。在ALFWorld和Search-QA上,LatentSkill在显著减少预填充令牌的情况下,优于相应的上下文技能基线:在ALFWorld的已见和未见划分上,它分别提高了21.4和13.4个百分点的成功率,预填充令牌减少了64.1%;在Search-QA上,精确匹配提高了3.0个百分点,技能令牌开销降低了72.2%。进一步分析表明,生成的技能LoRA形成了结构化的语义几何,可以通过LoRA缩放系数精确控制,并且在技能组件对齐时可以通过参数空间算术进行组合。这些发现表明,权重空间技能为扩展LLM智能体提供了一种高效、模块化且暴露更少的基础。

英文摘要

Agent systems increasingly use textual skills to encode reusable task procedures, but injecting these skills into the prompt at every step incurs substantial context overhead and exposes skill content as plaintext. We present LatentSkill, a framework that converts textual skills into plug-and-play LoRA adapters through a pretrained hypernetwork. LatentSkill stores skill knowledge in weight space rather than context space, removing per-step skill tokens while preserving modular loading, scaling, and composition. On ALFWorld and Search-QA, LatentSkill outperforms the corresponding in-context skill baseline while using substantially fewer prefill tokens: it improves ALFWorld success by 21.4 and 13.4 points on the seen and unseen splits with 64.1% fewer prefill tokens, and improves Search-QA exact match by 3.0 points with 72.2% lower skill-token overhead. Further analysis shows that generated skill LoRAs form a structured semantic geometry, can be precisely controlled via the LoRA scaling coefficient, and can be composed through parameter-space arithmetic when skill components are aligned. These findings suggest that weight-space skills provide an efficient, modular, and less exposed substrate for extending LLM agents.

2606.06081 2026-06-05 cs.AI cs.HC

A Framework for Measuring Appropriate Reliance on Set-Valued AI Advice

衡量对集合值AI建议适当依赖的框架

Ranjan Mishra, Jakob Schoeffer

发表机构 * University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院)

AI总结 本文提出首个正式框架,用于在序列判断-顾问范式中衡量对集合值AI建议的适当依赖,涵盖分类和回归任务,并定义了新的度量指标以捕捉现有方法忽略的细微差别。

详情
AI中文摘要

对AI建议的适当依赖已成为人机协作的核心研究主题。现有框架仅关注点预测作为AI建议。然而,集合值AI建议(例如离散集或连续区间)越来越多地被用于传达不确定性和改善人类决策。在本文中,我们在序列判断-顾问范式中开发了第一个用于衡量对集合值AI建议适当依赖的正式框架,涵盖分类和回归任务。对于分类,我们首先引入了评估集合值AI建议所需的维度。然后我们定义了两个指标:对AI的正确依赖率和对自身的正确依赖率,它们共同表征了这种设置下的适当依赖。对于回归,我们引入了AI依赖的数量和AI依赖的质量,分别衡量决策者是否利用了AI建议以及他们的依赖是否帮助他们相对于初始估计更接近真实值。通过应用我们的框架,我们展示了这些度量如何捕捉现有方法忽略的人机协作中的重要细微差别。

英文摘要

Appropriate reliance on AI advice has become a central research theme in human-AI collaboration. Existing frameworks have focused exclusively on point predictions as AI advice. However, set-valued AI advice (e.g., discrete sets or continuous intervals) is increasingly being used to communicate uncertainty and improve human decision making. In this paper, we develop the first formal framework for measuring appropriate reliance on set-valued AI advice within the sequential judge-advisor paradigm, spanning both classification and regression tasks. For classification, we first introduce the dimensions that are necessary for evaluating set-valued AI advice. We then define two metrics: correct reliance rate on AI and correct reliance rate on self, which jointly characterize appropriate reliance in this setting. For regression, we introduce quantity of AI reliance and quality of AI reliance, which respectively measure whether a decision maker utilized the AI advice and whether their reliance helped them get closer to the ground truth relative to their initial estimate. Through the application of our framework, we demonstrate how these metrics capture important nuances in human-AI collaboration that existing measures overlook.

2606.06080 2026-06-05 cs.LG cs.AI cs.CL

On Advantage Estimates for Max@K Policy Gradients

关于 Max@K 策略梯度的优势估计

Shota Takashiro, Soichiro Nishimori, Paavo Parmas, Yongmin Kim, Kohsei Matsutani, Gouki Minegishi, Yusuke Iwasawa, Takeshi Kojima, Yutaka Matsuo

发表机构 * The University of Tokyo(东京大学)

AI总结 针对稀疏奖励下推理模型后训练困难,提出一种新的优势估计方法 MaxPO,通过 Leave-Two-Out 基线实现中心化优势,降低梯度方差并提升性能。

详情
AI中文摘要

具有可验证奖励的强化学习广泛用于推理模型的后训练,但稀疏的结果奖励使得探索困难。一种补充方法是直接优化推理时目标如 pass@K 和 max@K,然而现有针对这些目标的策略梯度估计器使用不同的信号、基线和归一化,使得它们之间的关系不明确。我们通过基线设计和优势中心化来研究这个问题。从该领域领先方法的优势估计器出发,我们证明它是策略梯度无偏的,但产生非中心化的优势。然后我们引入一种 Leave-Two-Out 基线,它在保持策略梯度无偏性的同时,使得实现的批次优势完全中心化。由此产生的方法 MaxPO 具有高效的二次时间实现,并自然地集成到基于组的 LLM 后训练强化学习中。我们进一步推导了 max@K 的规范有限批次优势,为现有优势估计器提供了统一视角。实验上,我们验证了 L2O 基线降低了梯度方差,并优于非中心化的替代方案。

英文摘要

Reinforcement learning with verifiable rewards is widely used for post-training reasoning models, but sparse outcome rewards make exploration difficult. A complementary approach is to optimize inference-time objectives such as pass@K and max@K directly, yet existing policy-gradient estimators for these objectives use different signals, baselines, and normalizations, making their relationships unclear. We study this issue through baseline design and advantage centering. Starting from the advantage estimator of a leading method in the field, we show that it is policy-gradient unbiased but yields a non-centered advantage. We then introduce a Leave-Two-Out baseline that preserves policy-gradient unbiasedness while making realized batch advantages exactly centered. The resulting method, MaxPO, has an efficient quadratic-time implementation and integrates naturally into group-based RL for LLM post-training. We further derive the canonical finite-batch advantage for max@K, providing a unified view of existing advantage estimators. Empirically, we verify that the L2O baseline reduces gradient variance and outperforms non-centered alternatives.

2606.06079 2026-06-05 cs.CL

SkillComposer: Learning to Evolve Agent Skills for Specification and Generalization

SkillComposer: 学习演化智能体技能以实现特化与泛化

Qi Zhang, Zhaopeng Feng, Xiaonan Shi, Xiaomeng Hu, Chu Liu, Pengjun Xie, Xiaobin Wang, Jieping Ye, Bryan Hooi, Haobo Wang, Junbo Zhao

发表机构 * Zhejiang University(浙江大学) Tongyi Lab(通义实验室) National University of Singapore(新加坡国立大学)

AI总结 提出SkillComposer框架,通过创建、改进和合并三种可学习操作,使语言模型在推理时自我演化技能,支持离线、在线和混合部署模式,在多个基准上提升性能。

详情
Comments
Under Review
AI中文摘要

智能体技能由指导智能体推理和行动的可重用策略组成,在推理时展现出提升模型能力的强大潜力。然而,当前的技能构建方法将问题视为一次性提取,忽略了一个基本矛盾:针对特定任务的技能难以迁移,而抽象化的技能往往提供不足的指导。我们将这种脆弱性归因于缺乏明确的技能特化和泛化机制。为解决这一问题,我们引入了SkillComposer框架,该框架将技能构建分解为三种可学习操作:创建、改进和合并。通过系统性的拒绝采样方案进行训练,SkillComposer使语言模型能够在推理时自我演化技能,并支持三种部署模式:离线构建通用库、在线进行任务特定优化以及混合模式结合两者。在$τ^2$-Bench、LiveCodeBench v6和AppWorld上的综合实验表明,SkillComposer持续优于基线方法。我们的SkillComposer-4B将27B执行器在智能体任务上提升了最多+4.5,在代码任务上提升了最多+3.4,同时泛化到训练中未见过的领域和任务类型。分析表明,合并和改进操作处理正交的质量维度,且技能组合是一种可迁移的元能力,为技能增强推理提供了实用方案。

英文摘要

Agent skills, which consist of reusable strategies that guide agent reasoning and action, have shown strong potential for improving model capability at inference time. However, current skill construction methods treat the problem as one-shot extraction, overlooking a fundamental tension: a skill tailored to the specific task fails to transfer, while the abstracted skill often provides insufficient guidance. We attribute this fragility to the absence of explicit mechanisms for skill specification and generalization. To address this gap, we introduce SkillComposer, a framework that decomposes skill construction into three learnable operations: create, improve, and merge. Trained via systematic rejection sampling recipe, SkillComposer enables language models to self-evolve skills at inference time and supports three deployment modes: offline for building generalized libraries, online for task-specific refinement, and hybrid for combining both. Comprehensive experiments on $τ^2$-Bench, LiveCodeBench v6, and AppWorld show that SkillComposer consistently outperforms baselines. Our SkillComposer-4B improves a 27B executor by up to +4.5 on agent tasks and +3.4 on code tasks, while generalizing across domains and task types unseen during training. Analysis reveals that merge and improve address orthogonal quality dimensions and that skill composition is a transferable meta-ability, providing a practical recipe for skill-augmented inference.

2606.06078 2026-06-05 cs.CV

Knowledge Distillation for Visual Autoregressive Models

视觉自回归模型的知识蒸馏

Elia Peruzzo, Aritra Bhowmik, Guillaume Sautiere, Yuki M Asano, Amirhossein Habibian

发表机构 * Qualcomm AI Research(高通人工智能研究) University of Technology Nuremberg(纽伦堡技术大学)

AI总结 针对视觉自回归模型计算开销大的问题,提出VarKD蒸馏框架,通过选择性教师监督和减少令牌级歧义,在ImageNet上多个AR骨干网络中优于现有蒸馏方法。

详情
AI中文摘要

自回归图像生成模型具有高表达能力但计算密集,因此需要有效的模型压缩。知识蒸馏是模型压缩的自然方法,已在语言建模中得到广泛研究,但其在视觉自回归生成中的行为尚未充分探索。在这项工作中,我们首次系统研究了AR图像模型的蒸馏策略。我们的分析表明,虽然标准蒸馏可以带来有意义的收益,但最近为语言开发的方法不能直接迁移到图像:长解码视野和视觉令牌歧义使得教师监督不可靠,尤其是在学生条件下的上下文中。为了解决这个问题,我们提出了VarKD,一个针对视觉自回归模型的蒸馏框架,它在学生样本上进行蒸馏,同时选择性应用教师监督并减少令牌级歧义。在ImageNet上多个AR骨干网络上的实验表明,VarKD始终优于先前的蒸馏基线,缩小了与大规模模型的差距。

英文摘要

Autoregressive (AR) image generation models are highly expressive but computationally intensive, motivating effective model compression. Knowledge distillation (KD) is a natural approach for model compression and has been widely studied in language modeling, yet its behavior in visual AR generation remains underexplored. In this work, we present the first systematic study of distillation strategies for AR image models. Our analysis shows that while standard distillation can yield meaningful gains, recent methods developed for language do not directly transfer to images: long decoding horizons and visual token ambiguity make teacher supervision unreliable especially under student-conditioned contexts. To address this, we propose VarKD, a distillation framework for visual autoregressive models that distills on student samples while selectively applying teacher supervision and reducing token-level ambiguity. Experiments on ImageNet across multiple AR backbones show that VarKD consistently outperforms prior distillation baselines, narrowing the gap to large-scale models.

2606.06077 2026-06-05 cs.RO cs.LG

3D Underwater Path Planning via Generative Flow Field Surrogates

基于生成流场代理的三维水下路径规划

Zachary Cooper-Baldock, Paulo E. Santos, Russell S. A. Brinkworth, Karl Sammut

发表机构 * Flinders University(弗林德斯大学)

AI总结 针对自主水下航行器回收过程中复杂三维螺旋桨尾流的高成本CFD仿真问题,提出用条件生成对抗网络(cGAN)作为替代,结合能量加权A*路径规划,实现快速且有效的路径规划。

详情
Comments
41 pages, 5 figures, 11 tables
AI中文摘要

自主水下航行器(AUV)从行进中的母船船体发射和回收(LAR)需要穿越复杂的三维螺旋桨尾流,其水动力学结构无法用均匀流模型表征。高保真雷诺平均Navier-Stokes(RANS)计算流体动力学(CFD)仿真能够以足够精度解析该结构以用于路径规划,但其计算成本使其无法在机载使用。我们通过集成两种条件生成对抗网络(cGAN)架构——正则化PatchGAN和带有自注意力的2D3DGAN——作为三维能量加权A*路径规划框架中RANS CFD数据的即插即用替代方案来填补这一空白。两个生成器均由一个分层流水线驱动,该流水线仅从标量操作条件输入合成完整的$128^3$体素流场体积,端到端推理时间约为28-146微秒,而单个RANS计算则需要数小时。我们在550种不同流动条件下的19,800条独立生成轨迹上对所有四种环境知识水平(均匀流、真实CFD、PatchGAN和2D3DGAN~SA)进行了基准测试。与均匀流规划相比,完整的CFD尾流知识使能量消耗降低5.7-12.5%,高速尾流核心遭遇减少高达77.8%,且两种优势随操作严重程度增加而扩大。cGAN代理在推理速度与边缘设备兼容的情况下,恢复了约45-60%的CFD能量收益和高速单元规避收益。这些结果首次系统量化了cGAN预测水动力场在三维海洋机器人应用中的下游路径规划价值。

英文摘要

Autonomous underwater vehicle (AUV) launch and recovery (LAR) into the hull of an advancing host platform requires traversal of a complex, three-dimensional propeller wake whose hydrodynamic structure cannot be characterised by a uniform current model. High-fidelity Reynolds-Averaged Navier-Stokes (RANS) Computational Fluid Dynamics (CFD) simulations resolve this structure with sufficient accuracy for path planning, but their computational cost renders them impractical for onboard use. We address this gap by integrating two conditional generative adversarial network (cGAN) architectures -- a regularised PatchGAN and a 2D3DGAN with self-attention -- as drop-in replacements for RANS CFD data within a three-dimensional, energy-weighted A* path planning framework. Both generators are driven by a hierarchical pipeline that synthesises full $128^3$ voxel flow field volumes from scalar operating condition inputs alone, with end-to-end inference times of approximately 28-146 $μ$s, compared to hours for a single RANS computation. We benchmark all four environmental knowledge levels: uniform current, ground-truth CFD, PatchGAN, and 2D3DGAN~SA across 19,800 independently generated trajectories spanning 550 distinct flow conditions. Full CFD wake knowledge reduces energy expenditure by 5.7-12.5% and high-velocity wake-core encounters by up to 77.8% relative to uniform-current planning, with both benefits scaling with operating severity. The cGAN surrogates recover approximately 45-60% of the CFD energy benefit and high-velocity cell avoidance benefit while operating at inference speeds compatible with edge device use. These results provide the first systematic quantification of the downstream path planning value of cGAN-predicted hydrodynamic fields in a three-dimensional maritime robotics application.

2606.06074 2026-06-05 cs.CV

VZCrash: A Large-Scale IMU Dataset of Ego-Vehicle Crashes

VZCrash:大规模自车碰撞IMU数据集

Tommaso Bianconcini, Henrique Piñeiro Monteagudo, Aurel Pjetri, Tomaso Trinci, Leonardo Taccari

发表机构 * Verizon Connect

AI总结 提出VZCrash,目前最大的真实车辆碰撞IMU数据集,包含超过31,000个验证碰撞和158,000个负样本,并基于该数据集对多种碰撞检测方法进行了基准测试和规模效应分析。

详情
Comments
Accepted at the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026). VZCrash is publicly available at this URL: https://huggingface.co/datasets/vzc-research-chapter/VZCrash
AI中文摘要

我们介绍了VZCrash,这是目前最大的公开真实车辆碰撞数据集,包含惯性测量单元(IMU)遥测数据。该数据集包含超过31,000个经过验证的碰撞事件和158,000个负样本,包括困难案例和干扰项。每个样本包含100 Hz的加速度和角速度,以及1 Hz的GPS速度。VZCrash中的事件由安装在美国各地行驶的73,010辆不同尺寸商用车辆上的设备捕获,时间跨度数年。我们还利用该数据集的规模进行了广泛的实验研究。首先,我们对从简单的基于阈值的启发式方法到最先进的深度学习模型等多种方法进行了基准测试。然后,我们进行了一项实验,证明了数据规模对于训练高质量碰撞检测模型的重要性,并表明当这些模型需要部署到真实环境中时,规模尤其重要。

英文摘要

We introduce VZCrash, the largest publicly available dataset of real-world vehicle collision data featuring Inertial Measurement Unit (IMU) telemetry. The dataset contains more than 31,000 validated crashes and 158,000 negative samples, including hard cases and distractors. Each sample includes acceleration and angular velocity at 100 Hz, and GPS speed at 1 Hz. Events in VZCrash were captured by devices installed on a fleet of 73,010 commercial vehicles of different sizes driving in the United States over the span of several years. We also present an extensive experimental study enabled by the volume of the dataset. We first benchmark several different approaches, from a simple threshold-based heuristic to state-of-the-art deep learning models. Then, we present an experiment demonstrating the importance of scaling data to train high-quality crash detection models, and we show that scale is especially important when these models need to be deployed into a real-world environment.

2606.06066 2026-06-05 cs.CV cs.GR

FontFusion: Enhancing Generative Text in Diffusion Models with Typographic Conditioning

FontFusion: 通过排版条件增强扩散模型中的生成文本

Marian Lupascu, Nipun Jindal, Ionut Mironica, Zhaowen Wang

发表机构 * Adobe Research(Adobe研究院) Department of Computer Science, University of Bucharest(布加勒斯特大学计算机科学系)

AI总结 提出FontFusion框架,通过层次化token表示、位置感知嵌入和多级token丢弃策略,在扩散Transformer中实现精确字体控制与文本可读性的平衡,显著提升排版保真度。

详情
Comments
12 pages, 8 figures, accepted at ICANN 2026
AI中文摘要

扩散模型中的排版生成面临持续的权衡:精确的字体控制通常会降低文本可读性,而保持可读性往往牺牲排版保真度。我们提出FontFusion,一种用于扩散Transformer(DiT)架构的即插即用条件框架,通过三个核心创新解决了这一困境:(1)层次化token表示,在多个粒度上建立明确的文本-字体关系;(2)位置感知嵌入,在排版和图像内容之间创建空间绑定;(3)多级token丢弃策略,提高计算效率和对未见字体的泛化能力。我们对字体嵌入空间的系统评估表明,结合DeepFont和DINOv2的双编码器在排版任务上优于任何单一编码器。FontFusion在挑战性装饰字体上相比单编码器基线实现了76%的相对改进,相比无条件模型字体一致性增益超过约68-76%,同时无需重新训练即可集成到现有DiT架构中。

英文摘要

Typography generation in diffusion models faces a persistent trade-off: enabling precise font control typically degrades text legibility, while maintaining readability often sacrifices typographic fidelity. We present FontFusion, a plug-and-play conditioning framework for Diffusion Transformer (DiT) architectures that resolves this dilemma through three core innovations: (1) a hierarchical token representation establishing explicit text-font relationships at multiple granularities, (2) position-aware embeddings creating spatial bindings between typography and image content, and (3) a multi-level token dropping strategy improving both computational efficiency and generalization to unseen fonts. Our systematic evaluation of font embedding spaces reveals that a dual encoder combining DeepFont and DINOv2 outperforms any single encoder for typography tasks. FontFusion demonstrates 76% relative improvement on challenging decorative fonts over single-encoder baselines and font consistency gains exceeding approximately 68-76% over unconditioned models, while integrating into existing DiT architectures without retraining.

2606.06061 2026-06-05 cs.RO

A Conversational Framework for Human-Robot Collaborative Manipulation with Distributed Generative AI models

基于分布式生成式AI模型的人机协作操作对话框架

Arash Ghasemzadeh Kakroudi, Roel Pieters

发表机构 * Automation Technology and Mechanical Engineering, Tampere University(自动化技术与机械工程,塔尔库大学)

AI总结 提出一个分布式对话框架,集成语言和视觉语言模型与ROS 2执行栈,实现从自由形式用户命令生成结构化操作请求,并通过视觉基础将图像空间目标转换为机器人框架目标,实验验证了端到端任务可靠性和延迟。

详情
Comments
Accepted to the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026). The final published version will appear under the title "A Distributed Conversational Framework for Human-Robot Collaborative Manipulation Using Local LLMs and VLMs"
AI中文摘要

本文提出了一种用于人机协作操作的分布式对话框架,该框架将本地语言和视觉语言模型(VLM)与基于机器人操作系统2(ROS 2)的执行栈集成在一起。语言理解、视觉基础、编排和运动执行作为独立的ROS 2节点运行,能够在保持响应控制循环的同时,跨分布式硬件灵活部署。该系统从自由形式的用户命令中生成拾取、放置和交接的结构化动作请求。它使用VLM返回图像空间目标,并通过深度和校准将其转换为度量机器人框架目标。一个Web仪表板显示中间意图和基础叠加(像素、深度和机器人框架),并在执行任何运动之前需要操作员明确确认。在Franka FR3平台上的实验评估了在不断增加的工作台场景模糊性下的端到端任务可靠性和延迟,并比较了同一流水线中替代的LLM/VLM配置。代码和完整文档可在[github.com/cogrob-tuni/franka-llm](https://github.com/cogrob-tuni/franka-llm)获取。

英文摘要

This paper presents a distributed conversational framework for human-robot collaborative manipulation that integrates local language and vision-language models (VLMs) with a Robot Operating System 2 (ROS 2)-based execution stack. Language understanding, visual grounding, orchestration, and motion execution run as separate ROS 2 nodes, enabling flexible deployment across distributed hardware while maintaining a responsive control loop. From free-form user commands, the system generates structured action requests for pick, place, and handover. It uses a VLM to return image-space targets, which are converted into metric robot-frame goals using depth and calibration. A web dashboard exposes intermediate intent and grounding overlays (pixel, depth, and robot-frame) and requires explicit operator confirmation before any motion is executed. Experiments on a Franka FR3 platform evaluate end-to-end task reliability and latency under increasing working table scene ambiguity and compare alternative LLM/VLM configurations in the same pipeline. Code and full documentation are available at [github.com/cogrob-tuni/franka-llm](https://github.com/cogrob-tuni/franka-llm).

2606.06060 2026-06-05 cs.CV

ReCache: Learning Budget-Aware Caching Schedules for Diffusion Models via REINFORCE

ReCache: 通过REINFORCE学习扩散模型的预算感知缓存调度

Mishan Aliev, Eva Neudachina, Ilya Bykov, Aleksandr Oganov, Kirill Struminsky, Aibek Alanov, Denis Rakitin

发表机构 * HSE University(俄罗斯高等经济学院) Yandex Research(Yandex研究院)

AI总结 提出ReCache,利用策略梯度学习在给定计算预算下最大化生成质量的去噪步骤重计算调度,无需标注数据且兼容多种缓存机制。

详情
AI中文摘要

现代扩散模型生成高质量图像和视频,但其迭代去噪过程导致推理成本高昂。特征缓存通过重用或预测相邻去噪步骤的中间激活来加速采样,利用沿反向轨迹的计算冗余。本文关注缓存调度:选择哪些去噪步骤应完全重计算。现有调度要么是固定的(如均匀),要么根据每步误差启发式自适应选择;这两种情况下,实际计算成本是手动调整阈值的副作用,而非用户可指定的量。我们提出ReCache,它反转了这一过程:给定目标预算k,学习最大化生成质量的重计算调度,将计算变为可直接控制的输入。ReCache通过策略梯度训练,避开了通过完整扩散推理的反向传播,且不使用任何标注数据。来自无缓存推理的生成作为匹配目标,并配以生成质量的奖励。ReCache兼容任何缓存机制,包括特征重用和特征预测;对于每种机制,单个训练好的策略在推理时适应不同计算预算。ReCache持续优于调度基线:在FLUX上减少$ imes5.04$ FLOPs时,与DiCache相比,LPIPS降低31%(从0.456降至0.316);在Wan 2.1上实现$\sim imes2.6$加速时,与均匀HiCache相比,LPIPS降低65%(从0.480降至0.169),VBench分数提升7%(5.6分,从70.4升至76.0)。代码见https://github.com/thecrazymage/ReCache。

英文摘要

Modern diffusion models generate high-quality images and videos, but their iterative denoising process makes inference expensive. Feature caching accelerates sampling by reusing or predicting intermediate activations across neighboring denoising steps, exploiting the redundancy of computations along the reverse trajectory. In this work, we focus on the caching schedule: selecting which denoising steps should be fully recomputed. Existing schedules are either fixed (e.g. uniform) or chosen adaptively from per-step error heuristics; in both cases, the actual compute cost is a side-effect of hand-tuned thresholds rather than a quantity the user can specify. We propose ReCache, which inverts this: given a target budget k, it learns the recomputation schedule that maximizes generation quality, turning compute into a directly controllable input. ReCache trains via policy gradients, sidestepping backpropagation through full diffusion inference, and uses no labelled data. Generations from uncached inference serve as matching targets, paired with a reward for generation quality. ReCache is compatible with any caching mechanism, including feature reuse and feature forecasting; for each mechanism, a single trained policy adapts across computational budgets at inference time. ReCache consistently outperforms scheduling baselines: under a $\times5.04$ FLOPs reduction on FLUX, it reduces LPIPS by 31% (from 0.456 to 0.316) compared to DiCache; on Wan 2.1 at a $\sim \times2.6$ speedup, it drops LPIPS by 65% (from 0.480 to 0.169) and boosts the VBench score by 7% (5.6 points, from 70.4 to 76.0) over uniform HiCache. Code is available at https://github.com/thecrazymage/ReCache.

2606.06058 2026-06-05 cs.LG cs.AI cs.CL

MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following

MDP-GRPO:面向多约束指令跟随的稳定化组相对策略优化

Mohammad Mahdi Salmani-Zarchi, Zahra Rahimi, Heshaam Faili, Mohammad Javad Dousti

发表机构 * Department of Electrical and Computer Engineering, College of Engineering, University of Tehran(德黑兰大学电气与计算机工程系,工程学院) Department of Statistics, Mathematics and Computer Science, Allameh Tabataba’i University(塔巴蒂大学统计、数学与计算机科学系)

AI总结 针对标准GRPO在离散低分散奖励下的不稳定性,提出MDP-GRPO,通过多温度采样、双锚优势、前景理论整形和非对称KL正则化,在FollowBench等数据集上提升严格约束满足率最高5.0%。

详情
Comments
Accepted to ACL 2026 Main Conference. 14 pages, 9 figures
AI中文摘要

可验证奖励的强化学习非常适合多约束指令跟随,但标准组相对策略优化(GRPO)在离散、低分散奖励下变得不稳定,此时组内奖励分布常常同质。我们识别并形式化了在此场景下z-score组归一化的三种病理:低方差放大、均值中心盲视和零方差崩溃。为解决这些问题,我们提出MDP-GRPO,通过以下方式稳定学习:(1)多温度采样以增加奖励分散度,(2)双锚优势以恢复同质组中的梯度并阻止均值中心盲视,(3)基于Kahneman和Tversky理论的前景理论整形以限制更新并惩罚违规,以及(4)非对称KL正则化。在FollowBench、IFEval和一个精心策划的多约束数据集上评估,MDP-GRPO优于标准GRPO,在Llama-3.2-3B上将严格约束满足率提高了最多5.0%。我们的方法还能够在保持MMLU和ARC上通用能力的同时,实现小批量大小的稳定收敛。

英文摘要

Reinforcement learning with verifiable rewards is ideal for multi-constraint instruction following, yet standard group-relative policy optimization (GRPO) becomes unstable under discrete, low-dispersion rewards, where within-group reward distributions are frequently homogeneous. We identify and formalize three pathologies of z-score group normalization in this regime: low-variance amplification, mean-centering blindness, and zero-variance collapse. To address them, we propose MDP-GRPO, which stabilizes learning through (1) multi-temperature sampling to increase reward dispersion, (2) dual-anchor advantages to restore gradients in homogeneous groups and stop mean-centering blindness, (3) prospect-theoretic shaping to bound updates and penalize violations based on Kahneman and Tversky's theory, and (4) asymmetric KL regularization. Evaluated on FollowBench, IFEval, and a curated multi-constraint dataset, MDP-GRPO outperforms standard GRPO, improving strict constraint satisfaction by up to 5.0% on Llama-3.2-3B. Our method also enables stable convergence with small group sizes while preserving general capabilities on MMLU and ARC.

2606.06055 2026-06-05 cs.AI

When Should Memory Stay Silent: Measuring Memory-Use Boundaries in Memory-Augmented Conversational Agents

记忆何时应保持沉默:衡量记忆增强型对话代理的记忆使用边界

Lingxiang Xu, Jiaoyun Yang, Min Hu, Hongtu Chen, Ning An

发表机构 * Hefei University of Technology(合肥工业大学) Harvard Medical School(哈佛医学院)

AI总结 提出RBI-Eval框架,通过探针集比较模型在有/无敏感记忆时的行为差异,发现当前检索增强生成系统无法避免敏感记忆的不当整合,需在检索和生成阶段同时进行记忆感知决策。

详情
Comments
21 pages, 10 figures
AI中文摘要

长期记忆使语言模型代理能够支持个性化交互,但目前尚不清楚何时可用记忆应被整合到响应中。现有的记忆评估强调检索准确性和下游任务效用,而忽略了检索到的敏感记忆内容在当前轮次中是否合理。我们引入RBI-Eval,这是一种基于探针集的受控测量研究,比较模型在相同良性提示下访问和不访问敏感记忆时的行为。我们在四种记忆访问设置(全上下文暴露和三种检索系统)下,针对四个基础LLM与匹配的无记忆参考进行评估。我们的结果揭示了显著的行为差异。在有记忆可用时,GPT-5.4-mini的敏感记忆整合分离分数相对于匹配的无记忆参考下降了8.9%–26.6%,而Claude-Sonnet-4.6、DeepSeek-V4-Flash和Qwen3.5-9B下降了51.1%–82.9%。对DeepSeek和GPT-5.4-mini的对照实验表明,这种效应是敏感内容特有的,而非一般个性化。检索系统减少了暴露,但一旦敏感记忆到达生成器,并不能消除整合。这些发现表明,安全个性化需要在检索和生成时都进行记忆感知决策。

英文摘要

Long-term memory enables language model agents to support personalized interactions, but it remains unclear when available memories warrant integration into responses. Existing memory evaluations emphasize retrieval accuracy and downstream task utility, while overlooking whether retrieved sensitive memory content is warranted in the current turn. We introduce RBI-Eval, a controlled measurement study built around a probe set that compares model behavior with and without access to sensitive memory under identical benign prompts. We evaluate four base LLMs against a matched no-memory reference across four memory-access settings: full-context exposure and three retrieval systems. Our results reveal substantial behavioral divergence. With memory available, the separation score for sensitive-memory integration decreases by 8.9\%--26.6\% relative to the matched no-memory reference for GPT-5.4-mini, but by 51.1\%--82.9\% for Claude-Sonnet-4.6, DeepSeek-V4-Flash, and Qwen3.5-9B. Control experiments on DeepSeek and GPT-5.4-mini show this effect is specific to sensitive content, rather than general personalization. Retrieval systems reduce exposure but do not eliminate integration once sensitive memory reaches the generator. These findings suggest safe personalization requires memory-aware decisions at both retrieval and generation time.

2606.06054 2026-06-05 cs.AI

Beyond Similarity: Trustworthy Memory Search for Personal AI Agents

超越相似性:面向个人AI代理的可信记忆搜索

Jiawen Zhang, Kejia Chen, Jiachen Ma, Yangfan Hu, Lipeng He, Yechao Zhang, Jian Liu, Xiaohu Yang, Tianwei Zhang, Ruoxi Jia

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学) National University of Singapore(新加坡国立大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 针对个人AI代理中基于语义相似性的记忆检索存在的信任漏洞,提出轻量级记忆插件MemGate,通过查询条件神经门控实现可信记忆搜索。

详情
AI中文摘要

个人AI代理越来越依赖长期记忆来跨会话提供持久个性化。然而,现有的记忆流水线主要由语义相似性驱动:检索与当前查询语义接近的记忆数据并将其注入模型上下文。这造成了关键的信任差距,因为语义相关的记忆可能仍然在上下文中不合适,导致跨域泄露、谄媚、工具调用漂移或记忆引发的越狱等威胁。在本文中,我们将记忆搜索作为个人AI代理中的信任边界进行研究。我们评估了代表性的代理记忆框架,包括A-Mem、Mem0和MemOS,以及OpenClaw(一个具有持久状态和工具使用能力的真实世界个人代理环境)。我们的结果表明,长期记忆不仅仅是一个实用层,而是一个持久的控制通道,可以重塑代理如何解释任务和执行操作,使其极易受到上述威胁的影响。为了缓解这些漏洞,我们提出了MemGate,一个轻量级且可部署的记忆插件,用于可信记忆搜索,仅9M参数和35.1MB占用空间。MemGate插入在向量记忆存储和骨干LLM之间,无需修改LLM、重写记忆数据库或推理时LLM评判。它对候选记忆表示应用查询条件神经门控,将原始相似性搜索转化为任务条件记忆准入。在多个主流记忆框架、真实世界代理设置和多样化LLM骨干上,MemGate在保留长期记忆效用的同时减少了记忆引发的威胁。

英文摘要

Personal AI agents increasingly rely on long-term memory to provide persistent personalization across sessions. However, existing memory pipelines are largely driven by semantic similarity: memory data close to the current query is retrieved and injected into the model context. This creates a critical trustworthiness gap, since a semantically related memory may still be contextually inappropriate, leading to threats such as cross-domain leakage, sycophancy, tool-call drift, or memory-induced jailbreaks. In this paper, we study memory search as a trust boundary in personal AI agents. We evaluate representative agentic memory frameworks, including A-Mem, Mem0, and MemOS, together with OpenClaw, a real-world personal-agent environment with persistent state and tool-use capability. Our results show that long-term memory is not merely a utility layer, but a durable control channel that can reshape how agents interpret tasks and execute actions, leaving them highly susceptible to the aforementioned threats. To mitigate these vulnerabilities, we propose MemGate, a lightweight and deployable memory plug-in for trustworthy memory search, with only 9M parameters and a 35.1MB footprint. MemGate is inserted between the vector memory store and the backbone LLM, requiring no LLM modification, memory-database rewriting, or inference-time LLM judge. It applies a query-conditioned neural gate to candidate memory representations, turning raw similarity search into task-conditioned memory admission. Across multiple mainstream memory frameworks, real-world agent settings, and diverse LLM backbones, MemGate reduces memory-induced threats while preserving long-term memory utility.

2606.06049 2026-06-05 cs.RO

L-SDPPO: Policy Optimization of Spiking Diffusion Policy for Intra-vehicular Robotic Manipulation

L-SDPPO:用于舱内机器人操作的脉冲扩散策略优化

Liwen Zhang, Dong Zhou, Guanghui Sun, Yifei Zheng, Yuhui Hu, Kaihong Ouyang, Zuoquan Zhao

发表机构 * Department of Control Science and Engineering, Harbin Institute of Technology(控制科学与工程系,哈尔滨工业大学) Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong(机械与自动化工程系,香港中文大学)

AI总结 提出L-SDPPO框架,结合脉冲扩散策略与强化学习优化,并引入状态依赖延迟注入机制,在舱内机器人操作任务中实现高成功率和低能耗。

详情
AI中文摘要

航天器中的舱内机器人有助于减少宇航员的工作量并提高任务效率。最近的研究集中于使用深度学习方法来实现这些复杂环境中操作所需的精确控制。然而,在没有重力阻尼的情况下,物体会表现出不可预测、无约束的漂移。这些因素要求对复杂的多模态动作分布具有鲁棒性。扩散策略(DP)可以建模这些复杂动作,但其迭代采样过程对于航天器有限的功率预算来说消耗过多能量。因此,我们提出了一种低能耗的舱内机器人操作框架L-SDPPO,其中脉冲扩散策略(SDP)通过强化学习(RL)算法进行优化。此外,为了解决微重力下动态时空特征感知不足的问题,我们提出了状态依赖延迟注入(SDLI)机制,该机制模拟生物神经延迟以动态调节输入信息的时间。在五个代表性的舱内日常任务(例如舱门打开和精密容器盖合)上的评估表明,与最先进的机器人操作方法相比,我们的方法始终能实现更高的成功率和更低的能耗。这些结果表明我们的方法是一种可行的舱内机器人操作方法。

英文摘要

Intra-vehicular robots in spacecraft help reduce astronaut workload and improve mission efficiency. Recent research focuses on using deep learning methods to achieve the acute control required for operations in these complex environments. However, objects exhibit unpredictable, unconstrained drift without gravitational damping. These factors demand robustness against complex multimodal action distributions. Diffusion policies (DP) can model these complex actions, but their iterative sampling process consumes too much energy for the limited power budgets of spacecraft. We therefore propose a low-energy intra-vehicular robotic manipulation framework, L-SDPPO, in which the Spiking Diffusion Policy (SDP) is optimized with a reinforcement learning (RL) algorithm. Furthermore, to address the insufficient perception of dynamic spatiotemporal features in microgravity, we propose the statedependent latency injection (SDLI) mechanism, which mimics biological neural delays to dynamically regulate the timing of input information. Evaluation on five representative intra-vehicular daily tasks (e.g., hatch opening and precision container capping) shows that our method consistently achieves higher success rates and lower energy consumption, compared to the state-of-the-art robotic manipulation methods. These results demonstrate our method is a viable intra-vehicular robotic manipulation method.

2606.06044 2026-06-05 cs.CL

IA-RAG: Interval-Algebra-Driven Temporal Reasoning for Dynamic Knowledge Retrieval

IA-RAG:基于区间代数的动态知识检索时间推理

Xiaoman Wang, Yaoze Zhang, Wenzhuo Fan, Hongwei Zhang, Ding Wang, Guohang Yan, Song Mao, Botian Shi, Yunshi Lan, Pinlong Cai

发表机构 * East China Normal University(华东师范大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) University of Shanghai for Science and Technology(上海科技大学) Harbin Engineering University(哈尔滨工程大学)

AI总结 提出IA-RAG框架,通过区间代数建模时间约束,实现层次化时间检索与推理,在复杂时间问答任务上表现优异。

详情
Comments
22 pages, 10 figures, 13 tables. Code available at https://github.com/xiaoAugenstern/LogicalRAG_TemporalQA
AI中文摘要

检索增强生成(RAG)在利用外部知识增强大语言模型(LLMs)方面表现出强大的有效性。然而,现有的RAG和Graph RAG框架大多将知识视为静态,或仅将时间与粗粒度的时间戳或元数据关联,未能捕捉丰富的时间结构,如持续时间、重叠和包含关系。我们提出IA-RAG,一种层次化时间RAG框架,将知识建模为时间区间,并在形式化时间约束下进行检索。IA-RAG将事实表示为区间事件单元(IEUs),并将其组织成层次化的主题森林,其中时间依赖关系由Allen的区间代数控制。为处理不完整或不确定的时间边界,IA-RAG进一步引入子图时间收紧机制,通过连接事件子图中的逻辑约束来细化模糊区间。此外,IA-RAG通过区间代数引导的遍历支持隐式时间语义检索。在多个时间问答基准(包括TimeQA、TempReason和ComplexTR)上的实验表明,IA-RAG在时间检索和推理性能上表现优异,尤其是在复杂的组合时间推理任务上。我们的代码已发布在https://github.com/xiaoAugenstern/LogicalRAG_TemporalQA。

英文摘要

Retrieval-Augmented Generation (RAG) has shown strong effectiveness in grounding Large Language Models (LLMs) with external knowledge. However, existing RAG and Graph RAG frameworks largely treat knowledge as static or associate time with coarse-grained timestamps or metadata, failing to capture rich temporal structures such as duration, overlap, and containment. We propose IA-RAG, a hierarchical temporal RAG framework that models knowledge as time intervals and performs retrieval under formal temporal constraints. IA-RAG represents facts as Interval Event Units (IEUs) and organizes them into a hierarchical Thematic Forest, where temporal dependencies are governed by Allen's Interval Algebra. To handle incomplete or uncertain temporal boundaries, IA-RAG further introduces a Sub-graph Time Tightening mechanism that refines fuzzy intervals through logical constraints within connected event subgraphs. In addition, IA-RAG supports implicit temporal semantic retrieval through interval-algebra-guided traversal. Experiments on multiple temporal question answering benchmarks, including TimeQA, TempReason, and ComplexTR, demonstrate that IA-RAG achieves strong temporal retrieval and reasoning performance, particularly on complex compositional temporal reasoning tasks. Our code is released at https://github.com/xiaoAugenstern/LogicalRAG_TemporalQA.

2606.06041 2026-06-05 cs.RO cs.AI cs.NE

Sample-efficient Low-level Motion Planning for Robotic Manipulation Tasks via Zero-shot Transfer Learning

通过零样本迁移学习实现机器人操作任务的样本高效低级运动规划

Yuanzhi He, Victor Romero-Cano, José J. Patiño, Juan David Hernández, William Sawtell, Gualtiero Colombo

发表机构 * School of Computer Science & Informatics, Cardiff University, Cardiff, UK(计算机科学与信息学系,卡迪夫大学,卡迪夫,英国)

AI总结 提出iCEM+TL框架,通过迁移学习和奖励重塑提高复杂操作任务的成功率,仿真中提升高达23%,并在真实机器人上验证。

详情
Comments
12 pages, 5 figures, International Conference on Artificial Neural Networks (ICANN) 2026 conference accepted
AI中文摘要

随着机器人系统变得日益复杂,其运动规划模型的复杂性和更长的训练时间带来了巨大挑战。进化算法如样本高效交叉熵方法(iCEM)最近通过利用高效的知识重用策略来提升性能,在低级实时规划中展现出潜力。尽管在许多控制任务中有效,但iCEM在更复杂场景中的性能可能受到限制,特别是那些需要堆叠、滑动和放置到架子的任务。在这项工作中,我们提出了一种新颖的iCEM+TL框架,明确利用迁移学习(TL),其中关键的iCEM参数从较简单的上游任务迁移以指导更复杂的下游任务。此外,我们通过任务分解对堆叠物体和放置到架子应用了奖励重塑(RR)以优化任务特定性能。仿真结果表明,我们的框架实现了高达23%的成功率提升。该框架还在真实的Franka Emika机器人上的堆叠任务中得到进一步验证,展示了其在实际部署中的可行性。

英文摘要

As robotic systems become more sophisticated, the growing complexity of their motion planning models and the longer training times pose substantial challenges. Evolutionary algorithms such as the Sample-efficient Cross-Entropy Method (iCEM) have recently demonstrated promising potential for low-level real-time planning by leveraging efficient knowledge reuse strategies to improve performance. Although effective in many control tasks, iCEM's performance can be constrained in more complex scenarios, particularly those requiring stacking, sliding, and shelf placement. In this work, we propose a novel iCEM+TL framework that explicitly leverages Transfer Learning (TL), where key iCEM parameters are transferred from simpler upstream tasks to guide more complex downstream tasks. Additionally, we applied Reward Redesign (RR) through task decomposition for stacking objects and shelf placement to optimize task-specific performance. Results from the simulation show that our framework achieves success rate improvements of up to 23%. The framework is further validated on a real Franka Emika robot in a stacking task, demonstrating its practical feasibility for real-world deployment.

2606.06040 2026-06-05 cs.RO cs.SY eess.SY

Gotta Grow Fast: Design and Benchmarking of a Tip Mount for High-Speed Vine Robots

快速生长:高速藤蔓机器人尖端支架的设计与基准测试

Antonio Alvarez Valdivia, Robert Reeve, Ankush Dhawan, Ciera McFarland, Chad Council, Margaret McGuinness, Nathaniel Hanson

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Lincoln Laboratory(林肯实验室) Stanford University(斯坦福大学) University of Notre Dame(圣母大学)

AI总结 提出一种三角滚轮尖端支架,通过滚动代替滑动减少生长阻力,实现TPU涂层防撕裂尼龙藤蔓机器人的一致外翻,并建立可重复的基准测试框架。

详情
Comments
Accepted to IEEE Robotics & Automation Letters
AI中文摘要

软体生长藤蔓机器人通过尖端外翻机制扩展,该机制使其能够在杂乱环境中导航。然而,在尖端集成摄像头和其他传感器具有独特挑战,因为形成尖端的材料随着机器人生长而不断更新。这种持续的材料更替,加上内层之间的摩擦、增加的尖端重量和织物收缩,使传感器和工具安装复杂化。这些限制阻碍了藤蔓机器人在检查和搜索任务中的应用,而快速生长并携带尖端传感器至关重要。在这项工作中,我们提出了一种三角滚轮尖端支架,通过滚动而非滑动与机器人本体接触,减少生长过程中的内部阻力。通过迭代故障分析优化设计,首次实现了在TPU涂层防撕裂尼龙藤蔓机器人上的一致外翻。为了定量评估支架性能,我们引入了一个定制测试台,通过测量外翻过程中的尾部张力来隔离尖端安装效应。跨多个支架变体(包括先前设计)的比较实验表明,我们的三角滚轮支架实现了最低的尾部张力和最可重复的生长性能。这些结果既建立了一个经过验证的尖端支架设计,也为推进软体生长机器人中传感器和工具集成提供了一个可重复的基准测试框架。支架和测试台的CAD文件可在以下网址获取:https://sprout-mitll.github.io/tip_mounts/。

英文摘要

Soft, growing vine robots extend through tip eversion, a mechanism that enables navigation through cluttered environments. However, integrating cameras and other sensors at the tip is uniquely challenging because the material forming the tip is constantly renewed as the robot grows. This continual material turnover, combined with friction between internal layers, added tip weight, and fabric constriction, complicates sensor and tool mounting. These limitations hinder the deployment of vine robots for inspection and search tasks, where rapid growth while carrying tip-mounted sensors is essential. In this work, we present a triangular roller tip mount that reduces internal resistance during growth by rolling rather than sliding against the robot body. The design was refined through iterative failure analysis, enabling, for the first time, consistent eversion on a TPU-coated ripstop nylon vine robot. To quantitatively evaluate mount performance, we introduce a custom testbed that isolates tip mounting effects by measuring tail tension during eversion. Comparative experiments across multiple mount variants, including prior designs, show that our triangular roller mount achieves the lowest tail tension and most repeatable growth performance. These results establish both a validated tip mount design and a repeatable benchmarking framework for advancing sensor and tool integration in soft growing robots. CAD for the mount and testbed is available at: https://sprout-mitll.github.io/tip_mounts/.

2606.06039 2026-06-05 cs.CV

Texture-preserving implicit neural representation for Cone beam CT truncated reconstruction

保留纹理的隐式神经表示用于锥束CT截断重建

Genyuan Zhang, Junyao Wang, Haoran Lan, Chuandong Tan, Songtao Zhu, Fenglin Liu

发表机构 * National Key Research and Development Program of China(中华人民共和国国家重点研发计划) National Natural Science Foundation of China(中华人民共和国国家自然科学基金) Fundamental Research Funds for the Central Universities(中央高校基本科研业务费)

AI总结 提出一种自监督的3D重建框架,基于神经场景表示,结合物理迭代细化模块,解决锥束CT截断重建中的伪影和纹理丢失问题。

详情
AI中文摘要

锥束计算机断层扫描(CBCT)经常受到数据截断的影响,这引入了严重的伪影并限制了有效视场(FOV)。现有的用于截断锥束CT重建的深度学习方法存在严重局限性,包括严格依赖有监督的真实数据和未能考虑连续3D空间截断变化。为了解决这些挑战,我们引入了一个基于神经场景表示的自监督3D重建框架。通过在投影监督下将空间坐标直接映射到辐射密度,我们的方法固有地绕过了传统的滤波和反投影操作,从而从根本上消除了截断引起的环状伪影,同时实现了鲁棒的连续3D数据外推。然而,坐标网络容易受到固有的频谱偏差影响,这导致临床关键的高频纹理严重丢失。为了解决这一瓶颈,我们进一步将基于物理的迭代细化模块集成到神经场景表示架构中。利用来自坐标网络的无伪影外推体积作为最优初始化,该模块逐步从原始投影中重新提取高频结构信息并将其注入体积中。在模拟和真实数据集上的大量实验表明,我们的方法成功地将神经网络的优异伪影抑制和外推能力与迭代算法的高保真细节保留统一起来。

英文摘要

Cone-beam computed tomography (CBCT) frequently suffers from data truncation, which introduces severe artifacts and limits the effective field of view (FOV). Existing deep learning methods for truncated cone-beam computed tomography (CBCT) reconstruction suffer from serious limitations, including a strict reliance on supervised ground truth and a failure to account for continuous 3D spatial truncation variations. To address these challenges, we introduce a self-supervised 3D reconstruction framework based on neural scene representations. By directly mapping spatial coordinates to radiodensity under projection supervision, our approach inherently bypasses traditional filtering and backprojection operations, thereby fundamentally eliminating truncation-induced ring artifacts while enabling robust continuous 3D data extrapolation. However, coordinate networks are susceptible to an inherent spectral bias, which leads to a severe loss of clinically vital high-frequency textures. To resolve this bottleneck, we further incorporate a physics-based iterative refinement module into the neural scene representation architecture. Leveraging the artifact-free, extrapolated volume from the coordinate network as an optimal initialization, this module progressively re-extracts and injects high-frequency structural information from the original projections back into the volume. Extensive experiments on both simulated and real-world datasets demonstrate that our method successfully unifies the exceptional artifact suppression and extrapolation capabilities of neural networks with the high-fidelity detail preservation of iterative algorithms.

2606.06038 2026-06-05 cs.CL

English-to-Prakrit Machine Translation via Multilingual Transfer Learning

英语到普拉克里特语的机器翻译:基于多语言迁移学习

Om Choksi, Smit Kareliya, Shrikant Malviya, Pruthwik Mishra

发表机构 * Sardar Vallabhbhai National Institute of Technology(萨达尔·瓦拉布尔·尼西特国家理工学院)

AI总结 针对低资源目标语言普拉克里特语,通过将普拉克里特语映射到印地语标签并利用多语言模型,在少量平行语料上实现可行的机器翻译,揭示了脚本兼容的语言路由对未支持古典语言的迁移潜力及数据稀缺和方言不匹配的限制。

详情
AI中文摘要

我们在低资源环境下研究英语到普拉克里特语的机器翻译,其中目标语言不受IndicTrans2支持。我们通过将普拉克里特语映射到印地语标签(hin_Deva)来适配多语言模型,而不修改分词器、词汇表或架构。使用包含1,474对马哈拉施特拉普拉克里特语平行语料库,并在20样本的阿尔达摩揭陀语测试集上进行评估,我们报告了相对于未调优基线的语料库BLEU改进。结果表明,脚本兼容的语言路由可以实现对未支持的古典语言的可行迁移,同时突显了数据稀缺和方言不匹配带来的局限性。我们的代码和训练模型已公开发布,供进一步探索:https://github.com/D3v1s0m/indictrans2-prakrit-mt。

英文摘要

We study English-to-Prakrit machine translation in a low-resource setting where the target language is unsupported by IndicTrans2. We adapt the multilingual model by mapping Prakrit to the Hindi language tag (hin_Deva) without modifying the tokenizer, vocabulary, or architecture. Using a 1,474-pair Maharashtri Prakrit parallel corpus and evaluation on a 20-sample Ardhamagadhi test set, we report corpus BLEU improvements over an untuned baseline. The results indicate that script-compatible language routing can enable feasible transfer to unsupported classical languages, while highlighting limitations due to data scarcity and dialect mismatch. Our code and trained models are released to the public for further exploration https://github.com/D3v1s0m/indictrans2-prakrit-mt.

2606.06036 2026-06-05 cs.AI cs.IR

Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents

记忆是重建的,而非检索的:面向LLM智能体的图记忆

Shuo Ji, Yibo Li, Bryan Hooi

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出MRAgent框架,通过关联记忆图和主动重建机制,使LLM智能体在推理过程中动态调整记忆访问,显著提升长程记忆推理性能。

详情
Comments
Accepted at ICML 2026
AI中文摘要

尽管近期取得了进展,LLM智能体在处理长交互历史推理时仍面临困难。当前记忆增强智能体依赖静态的检索-推理范式,这种僵化的流水线设计阻碍了它们根据推理过程中发现的中间证据动态调整记忆访问。为弥补这一差距,我们提出MRAgent,一个将关联记忆图与主动重建机制相结合的框架。我们将记忆表示为线索-标签-内容图,其中关联标签作为语义桥梁连接细粒度线索与记忆内容。在此结构上,我们的主动重建机制将LLM推理直接融入记忆访问,使智能体能够基于累积证据迭代探索和修剪检索路径。这确保了记忆检索动态适应推理上下文,同时避免无约束扩展导致的组合爆炸。在LoCoMo基准和LongMemEval基准上的实验表明,该方法在强基线上取得了显著提升(高达23%),同时大幅降低了令牌和运行时间成本,凸显了主动和关联重建对于长程记忆推理的有效性。

英文摘要

Despite recent progress, LLM agents still struggle with reasoning over long interaction histories. While current memory-augmented agents rely on a static retrieve-then-reason paradigm, this rigid pipeline design prevents them from dynamically adapting memory access to intermediate evidence discovered during inference. To bridge this gap, we propose MRAgent, a framework that combines an associative memory graph with an active reconstruction mechanism. We represent memory as a Cue-Tag-Content graph, where associative tags serve as semantic bridges connecting fine-grained cues to memory contents. Operating on this structure, our active reconstruction mechanism integrates LLM reasoning directly into memory access, allowing the agent to iteratively explore and prune retrieval paths based on accumulated evidence. This ensures that memory retrieval is dynamically adapted to the reasoning context while avoiding combinatorial explosion caused by unconstrained expansion. Experiments on the LoCoMo benchmark and LongMemEval benchmark demonstrate significant improvements over strong baselines (up to 23%), while substantially reducing token and runtime cost, highlighting the effectiveness of active and associative reconstruction for long-horizon memory reasoning.