arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1942
2605.20476 2026-05-21 cs.CV

Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation

告别漂移:用于长时视频到视频生成的锚定树采样

Matthew Bendel, Stephen W. Bailey, Mithilesh Vaidya, Sumukh Badam, Xingzhe He

AI总结 本文提出了一种名为锚定树采样的方法,通过减少关键路径步骤来解决长时视频生成中的漂移问题,并在静态相机模式下实现了稳定且高质量的视频生成。

Comments 30 pages, 23 figures

详情
AI中文摘要

长时视频生成面临两个交织的问题。首先,漂移问题,即视频质量随时间下降。其次,连续性问题,表现为物体永久性问题或不当渲染瞬态内容(例如,出现在非连续帧中的物体颜色/风格变化)。最近的工作集中在自回归蒸馏技术上,旨在同时解决这两个问题。我们选择专注于漂移问题,并引入锚定树采样(ATS):一种无训练的推理时间调度器,用稀疏到密集、锚定范围内的填补方法替代从左到右的滚动。根调用在全时间范围内生成稀疏锚点,递归细化生成中间锚点,最终叶跨度在相邻锚点之间合成。这将关键路径从K个连续滚动步骤减少到L+1个树状步骤,并将时间累积漂移转换为锚定范围内的漂移。我们专注于静态相机模式下的V2V生成,其中稀疏锚点在时间范围内可由密集条件信号近似,且基础模型可在不重新训练的情况下生成它们。我们在Wan 2.1 + VACE上评估了ATS,针对五种条件模式(修复、扩展、边缘、姿态、深度)。我们证明ATS在整体质量和漂移防止方面均优于两个竞争对手。此外,我们还展示了在LTX-2.3上稳定生成至少40分钟的视频。最后,我们提出了一条路径,将ATS扩展到任意长的T2V生成,以及动态相机和多镜头模式。

英文摘要

Long-horizon video generation suffers from two intertwined issues. First, there is drift, where video quality degrades over time. Second, there are continuity issues which manifest as object permanence issues, or improperly rendering transient content (e.g., an object that appears in non-consecutive frames changing color/style). Recent work has focused on autoregressive distillation techniques that attack both problems simultaneously. We instead choose to focus on drift directly and introduce \textbf{Anchored Tree Sampling (ATS)}: a training-free inference-time scheduler that replaces left-to-right rollout with sparse-to-dense, anchor-bounded imputation organized as a tree. A root call produces sparse anchors over the full horizon, recursive refinement generates intermediate anchors, and final leaf spans are synthesized between neighboring anchors. This reduces the critical path from $K$ sequential rollout steps to $L+1$ tree-hierarchical steps and converts horizon-compounding drift into anchor-bounded drift. We focus on V2V generation in the \emph{static-camera} regime, where sparse anchors over the horizon are well approximated by the dense conditioning signal, and the base model can produce them without retraining. We evaluate ATS against two contemporary autoregressive baselines on Wan $2.1$ $+$ VACE, across five conditioning modalities (inpainting, outpainting, edge, pose, depth). We show that ATS outperforms both competitors in overall quality, as well as in drift prevention. We additionally demonstrate stable $\geq 40$-minute generation on LTX-$2.3$ across the same five modalities. We conclude by proposing a path forward to extend ATS to arbitrarily long T2V generation, as well as the dynamic-camera and multi-shot regimes.

2605.20470 2026-05-21 cs.CV cs.AI physics.med-ph

EPC-3D-Diff: Equivariant Physics Consistent Conditional 3D Latent Diffusion for CBCT to CT Synthesis

EPC-3D-Diff: 基于CBCT到CT合成的等价物理一致条件3D潜在扩散模型

Alzahra Altalib, Chunhui Li, Haytham Al Ewaidat, Khaled Alawneh, Ahmad Qendel, Alessandro Perelli

AI总结 本文提出EPC-3D-Diff,一种新的条件3D潜在扩散框架,用于体积CBCT到CT合成,通过引入从成像物理导出的投影域等价损失,提高了物理一致性。该方法在训练过程中通过正向投影旋转合成的CT体积,并将其与相应角度偏移的投影进行匹配,从而在扩散目标中集成物理一致的等价约束。

Comments 10 pages, 4 figures

详情
AI中文摘要

锥束CT(CBCT)在放疗中常用于患者定位,但其定量可靠性受到散射、噪声和重建伪影的限制,限制了Hounsfield单位(HU)的准确性。我们提出了EPC-3D-Diff,一种新的条件3D潜在扩散框架,用于体积CBCT到CT合成,引入了从成像物理导出的投影域等价损失。与常见的图像域等价性不同,我们利用体积内旋转对应于其投影的角偏移的事实。在训练过程中,我们通过正向投影旋转合成的CT体积并将其与适当角度偏移的投影进行匹配,从而在扩散目标中集成物理一致的等价约束。为了高效捕捉完整的3D上下文,条件扩散在由轻量3D自动编码器学习的紧凑潜在空间中进行,保持轴向深度的同时在平面分辨率上进行下采样以实现稳定训练。我们验证了配对的头CBCT/CT假体数据集,包括重复扫描,并使用患者层面的分割进行配对临床数据验证,并进行了单域和混合域训练、消融实验和与扩散和CycleGAN的比较。EPC-3D-Diff具有良好的泛化能力,并在PSNR上相比最先进的方法取得了显著的改进,分别在假体和临床数据上提高了+7.4 dB和+1.8 dB,同时在SSIM和HU准确性方面也有所提升,在组织边界内。总体而言,EPC-3D-Diff提高了鲁棒性和物理一致性,支持HU意识的合成,以支持下游的放疗工作流程。

英文摘要

Cone-beam CT (CBCT) is routinely acquired during radiotherapy for patient setup, but its quantitative reliability is degraded by scatter, noise, and reconstruction artifacts, limiting Hounsfield Unit (HU) accuracy. We propose EPC-3D-Diff, a novel conditional 3D latent diffusion framework for volumetric CBCT to CT synthesis that introduces a projection domain equivariance loss derived from acquisition physics. Unlike common image domain equivariance, we exploit the fact that an in plane rotation of the volume corresponds to an angular shift in its projections. During training, we enforce this relationship by forward projecting rotated synthesized CT volumes and matching them to appropriately angle shifted projections of the paired target CT, yielding a physics consistent equivariance constraint integrated into the diffusion objective. To capture full 3D context efficiently, conditional diffusion is performed in a compact latent space learnt by a lightweight 3D autoencoder, preserving axial depth while downsampling in plane resolution for stable training. We validate on a paired head CBCT/CT phantom dataset, including repeat scans, and paired clinical data using patient wise splits, and perform single and mixed domain training, ablations, and comparisons with diffusion and CycleGAN. EPC-3D-Diff generalizes well and achieved substantial improvements, +7.4 dB (phantom) and +1.8 dB (clinical data) in PSNR compared to state of the art methods, alongside improved SSIM and HU accuracy, within tissue boundaries. Overall, EPC-3D-Diff improves robustness and physics consistency, supporting HU aware synthesis for downstream radiotherapy workflows.

2605.20469 2026-05-21 cs.CV

HalluCXR: Benchmarking and Mitigating Hallucinations in Medical Vision-Language Models for Chest Radiograph Interpretation

HalluCXR: 评估和缓解医疗视觉-语言模型在胸部X光解读中的幻觉

Haoyu Wang, Zitong Li

AI总结 本文提出HalluCXR基准,评估六种不同架构的视觉-语言模型在856例分层MIMIC-CXR胸部X光图像上的表现,发现61.9%-82.3%的输出存在幻觉,其中80.2%存在临床危险错误,通过引入幻觉分类学、检测管道和模型集成方法,提出了缓解幻觉的策略。

详情
AI中文摘要

视觉-语言模型(VLMs)在医学影像解读中日益被使用,但它们经常产生幻觉,即生成在临床上合理但事实错误的发现,这直接对患者安全构成风险。我们介绍了HalluCXR,一个基准,评估了六个架构各异的VLMs在856例分层MIMIC-CXR胸部X光图像和三种查询类型上的表现,产生15,408次模型评估。一个八类幻觉分类学,带有临床严重程度评分和一个双层检测管道,经过250个人类注释验证(自动检测F1=0.959;LLM判断F1=0.907)。我们发现61.9%-82.3%的输出包含幻觉,其中最多80.2%存在临床危险错误。三种关键模式显现:正常X光图像反而吸引最严重的幻觉,常见发现被系统性夸大,而罕见发现被低估,且响应长度本身预测幻觉风险(AUC最高达0.908)。一个六模型集成减少了伪造的84.8%,但增加了遗漏;一个三模型子集在成本减半的情况下保持了相当的性能。这些结果表明,幻觉审计、基于 verbosity 的风险监控和基于集成的安全层是临床部署的先决条件。

英文摘要

Vision-language models (VLMs) are increasingly used for medical image interpretation, yet they frequently hallucinate, generating clinically plausible but factually incorrect findings that pose direct patient safety risks. We introduce HalluCXR, a benchmark evaluating six architecturally diverse VLMs across 856 stratified MIMIC-CXR chest radiographs and three query types, yielding 15,408 model evaluations. An eight-category hallucination taxonomy with clinical severity ratings and a two-layer detection pipeline are validated against 250 human annotations (auto-detection F1=0.959; LLM judge F1=0.907). We find that 61.9--82.3% of outputs contain hallucinations, with clinically dangerous errors in up to 80.2%. Three key patterns emerge: normal radiographs paradoxically attract the most severe hallucinations, common findings are systematically over-fabricated while rare findings go under-detected, and response length alone predicts hallucination risk (AUC up to 0.908). A six-model ensemble reduces fabrication by up to 84.8% at the cost of increased omission; a three-model subset retains comparable performance at half the cost. These results establish that hallucination auditing, verbosity-based risk monitoring, and ensemble-based safety layers are prerequisites for clinical deployment.

2605.20467 2026-05-21 cs.AI

High Quality Embeddings for Horn Logic Reasoning

用于霍恩逻辑推理的高质量嵌入

Yifan Zhang, Yasir White, Dean Clark, Joseph Sanchez, Jevon Lipsey, Ashely Hirst, Jeff Heflin

AI总结 本文提出了一种生成高质量逻辑语句嵌入的方法,通过三元组损失训练嵌入,并通过生成重复术语的锚点、平衡易难例以及强调最困难的例子来提高下游任务的表现。

详情
Journal ref
Proceedings of Machine Learning Research 284:1-14, 2025
AI中文摘要

神经网络可以被训练以对逻辑推理者的选择进行排序,从而更高效地寻找答案。这一过程中的关键步骤是创建有用的嵌入,即逻辑语句的数值表示。本文介绍了并评估了几种生成嵌入的方法,以获得更好的下游结果。我们使用三元组损失训练嵌入,这需要由锚点、正例和负例组成的示例。我们引入了三个想法:生成更可能具有重复术语的锚点,以生成正例和负例的方式确保在简单、中等和困难示例之间有良好的平衡,并在训练过程中定期强调最困难的例子。我们进行了几项实验来评估这种方法,包括在不同知识库中比较不同嵌入的性能,以尝试确定哪些特征使嵌入适合特定的推理任务。

英文摘要

Neural networks can be trained to rank the choices made by logical reasoners, resulting in more efficient searches for answers. A key step in this process is creating useful embeddings, i.e., numeric representations of logical statements. This paper introduces and evaluates several approaches to creating embeddings that result in better downstream results. We train embeddings using triplet loss, which requires examples consisting of an anchor, a positive example, and a negative example. We introduce three ideas: generating anchors that are more likely to have repeated terms, generating positive and negative examples in a way that ensures a good balance between easy, medium, and hard examples, and periodically emphasizing the hardest examples during training. We conduct several experiments to evaluate this approach, including a comparison of different embeddings across different knowledge bases, in an attempt to identify what characteristics make an embedding well-suited to a particular reasoning task.

2605.20461 2026-05-21 cs.CV

Understanding Model Behavior in Monocular Polyp Sizing

理解单目肠镜下息肉大小的模型行为

Xinqi Xiong, Andrea Dunn Beltran, Junmyeong Choi, Sarah K. McGill, Marc Niethammer, Roni Sengupta

AI总结 本文通过多中心数据集和多种模型对二元息肉大小分类(≤5 mm vs. >5 mm)进行诊断审核,发现模型性能在不同架构和输入模态下较为一致,表明其依赖于与检查行为相关的线索而非真实度量尺度,并展示了完美尺度信息的潜在改进以及当前深度估计和全局校准的有限增益。

详情
AI中文摘要

准确的息肉大小分层指导监视决策,通常大于5 mm的病变需要更密切的随访。然而,单目结肠镜缺乏可靠的参考度量标准。我们对多个公共多中心数据集、模型家族和患者分层交叉验证中的二元息肉大小分类(≤5 mm vs. >5 mm)进行了诊断审核。在不同架构和输入模态(包括RGB外观、相对深度和照度)下,模型性能相对一致,表明其依赖于与检查行为相关的线索而非真实度量尺度。通过提供不同粒度的地面真实尺度,我们量化了完美尺度信息的潜在改进,并显示当前深度估计和全局校准提供的增益有限。我们进一步证明,在分布偏移下分割错误消除了大部分潜在增益,具有预测掩码的oracle尺度仅恢复基线性能。这些结果突显了度量尺度和掩码鲁棒性作为两个独立的瓶颈,并提供了可重用的评估工具,如oracle尺度梯子、快捷分组和掩码替换,用于审核未来的息肉大小管道。我们的代码在https://github.com/anaxqx/polyp-sizing-audit上公开可用。

英文摘要

Accurate polyp size stratification guides surveillance decisions, with lesions larger than 5 mm typically requiring closer follow-up. However, monocular colonoscopy lacks a reliable metric reference. We present a diagnostic audit of binary polyp size classification (<=5 mm vs. >5 mm) across multiple public multi-center datasets, model families, and patient-stratified cross-validation. Across architectures and input modalities, including RGB appearance, relative depth, and photometry, model performance is moderately consistent, suggesting reliance on cues correlated with examination behavior rather than true metric scales. By providing ground-truth scale at varying granularities, we quantify the potential improvement from perfect scale information and show that current depth estimation and global calibration offer limited gains. We further demonstrate that segmentation errors under distribution shift eliminate most of this potential, with oracle scale under predicted masks recovering only baseline performance. These results highlight metric scale and mask robustness as two independent bottlenecks and provide reusable evaluation tools such as oracle scale ladders, shortcut partitions, and mask substitution for auditing future polyp sizing pipelines. Our code is publicly accessible at https://github.com/anaxqx/polyp-sizing-audit.

2605.20459 2026-05-21 cs.CV cs.AI

Pixel Wised Lesion Prediction on COVID-19 CT Imagery: A Comparative Analysis of Automated Image Segmentation Architectures

基于像素的新冠CT影像病变预测:自动图像分割架构的比较分析

Sarmad Khan, Arslan Shaukat, Umer Asgher, Basim Azam

AI总结 本文通过比较四种深度学习架构与六种预训练编码器,评估了在新冠CT影像中预测病变的性能,发现深度学习在分割任务中具有高精度和效率,其中二分类分割达到98%的F1分数,多分类分割在不同数据集上分别达到75%和77%的F1分数。

Comments 7 pages, 6 figures, 4 tables

详情
AI中文摘要

近年来,深度学习算法在医学图像分割领域受到了越来越多的关注。然而,由于缺乏标准化的性能分析方法和先前研究中使用不同数据集,该领域的可靠性受到阻碍。本研究的主要目的是全面评估当前的分割框架与最先进的预训练骨干网络,以准确预测CT影像中的新冠病变。此外,这种评估可以作为其他成像场景图像分割的参考点。为了实现这一目标,我们整合了四个不同的深度学习架构,即Unet、PSPNet、Linknet和FPN,以及六个预训练编码器,包括VGG 19、DenseNet 121、Inception ResNet V2、MobileNet V2、SeresNet 101和EfficientNet B0。这种方法使能够开发出多样化的测试架构。在图像分割的背景下,我们的研究涵盖了二分类和多分类实验。通过分析三个不同的新冠CT分割数据集,我们的分析结果表明深度学习架构能够产生精确且高效的分割结果。显著的是,二分类分割的最高F1分数达到98%,而多分类分割在两个不同的数据集上分别达到了75%和77%的F1分数。人工智能和深度学习的使用在多个维度上增强了对流行病疾病诊断过程的帮助。

英文摘要

In recent years, there has been a notable increase in the level of attention that is given to algorithms based on deep learning in the context of medical image segmentation. Nevertheless, the reliability of the field has been hindered due to the absence of a standardized methodology for performance analysis and the utilization of different datasets in previous research. The primary objective of the research is to comprehensively evaluate contemporary segmentation frameworks combined with state-of-the-art pre-trained backbones in order to accurately predict COVID-19 lesions in CT images. Moreover, this evaluation can serve as a point of reference for the segmentation of images in various other imaging scenarios. In order to accomplish this, we integrate four distinct deep learning architectures, namely Unet, PSPNet, Linknet, and FPN, with six pre-trained encoders, including VGG 19, DenseNet 121, Inception ResNet V2, MobileNet V2, SeresNet 101, and EfficientNet B0. This approach enables the development of diverse testing architectures. In the context of image segmentation, our research encompassed both binary and multi-class experimentation. The findings derived from our analysis of three distinct COVID-19 CT segmentation datasets indicate that deep learning architectures yield precise and efficient segmentation outcomes. Significantly, a maximum F1-Score of 98% was attained for binary class segmentation, while multi-class segmentation yielded F1-Scores of 75% and 77% across two separate datasets. The utilization of artificial intelligence and deep learning enhances the diagnostic process for pandemic diseases across multiple dimensions.

2605.20458 2026-05-21 cs.CV

ELEMENT: Multi-Modal Retinal Vessel Segmentation Based on a Coupled Region Growing and Machine Learning Approach

ELEMENT:基于耦合区域生长和机器学习方法的多模态视网膜血管分割

Erick O. Rodrigues, Aura Conci, Panos Liatsis

AI总结 本文提出了一种基于耦合区域生长和机器学习方法的多模态视网膜血管分割框架ELEMENT,通过区域生长和机器学习提取特征并进行像素分类,提高了分割的准确性和效率,实验表明其在多个数据集上均优于现有方法。

详情
Journal ref
IEEE Journal of Biomedical and Health Informatics 2020
AI中文摘要

视网膜血管结构包含重要的信息,用于检测和分析眼部疾病,包括年龄相关性黄斑变性、糖尿病视网膜病变和青光眼。常用的诊断模态包括视网膜摄影、扫描激光眼底镜(SLO)和荧光素血管造影(FA)。通常,视网膜血管分割是手动或交互式进行的,这使得过程耗时且容易出错。在本研究中,我们提出了一种新的多模态分割框架,称为ELEMENT(vEsseL sEgmentation using Machine lEarning and coNnecTivity)。该框架由区域生长和机器学习进行的特征提取和像素分类组成。所提出的特征基于灰度级和血管连通性属性捕获互补证据。后者信息在分类阶段无缝传播通过像素。ELEMENT减少了不一致性和加快了分割吞吐量。我们分析并比较了所提出方法与现有血管分割算法在三个主要实验组中的性能,针对每种眼部模态。我们的方法产生了更高的整体性能,整体准确率为97.40%,优于26种现有方法中的25种,包括6种基于深度学习的工作,评估在广泛知名的DRIVE视网膜图像数据集上。在STARE、CHASE-DB、VAMPIRE FA、IOSTAR SLO和RC-SLO数据集中,所提出的框架分别以98.27%、97.78%、98.34%、98.04%和98.35%的准确率超过了所有现有方法。

英文摘要

Vascular structures in the retina contain important information for the detection and analysis of ocular diseases, including age-related macular degeneration, diabetic retinopathy and glaucoma. Commonly used modalities in diagnosis of these diseases are fundus photography, scanning laser ophthalmoscope (SLO) and fluorescein angiography (FA). Typically, retinal vessel segmentation is carried out either manually or interactively, which makes it time consuming and prone to human errors. In this research, we propose a new multi-modal framework for vessel segmentation called ELEMENT (vEsseL sEgmentation using Machine lEarning and coNnecTivity). This framework consists of feature extraction and pixel-based classification using region growing and machine learning. The proposed features capture complementary evidence based on grey level and vessel connectivity properties. The latter information is seamlessly propagated through the pixels at the classification phase. ELEMENT reduces inconsistencies and speeds up the segmentation throughput. We analyze and compare the performance of the proposed approach against state-of-the-art vessel segmentation algorithms in three major groups of experiments, for each of the ocular modalities. Our method produced higher overall performance, with an overall accuracy of 97.40%, compared to 25 of the 26 state-of-the-art approaches, including six works based on deep learning, evaluated on the widely known DRIVE fundus image dataset. In the case of the STARE, CHASE-DB, VAMPIRE FA, IOSTAR SLO and RC-SLO datasets, the proposed framework outperformed all of the state-of-the-art methods with accuracies of 98.27%, 97.78%, 98.34%, 98.04% and 98.35%, respectively.

2605.20450 2026-05-21 cs.LG cs.CR

SMA-DP: Spectral Memory-Aware Differential Privacy for Deep Learning

SMA-DP:基于频谱记忆的差分隐私用于深度学习

Mohammad Partohaghighi, Roummel Marcia

AI总结 本文提出了一种名为SMA-DP-SGD的差分隐私随机梯度下降方法,通过引入频谱记忆分支来增强DP-SGD的隐私保护性能,从而在多个数据集上实现了更优的准确率和隐私保护。

详情
AI中文摘要

差分隐私随机梯度下降(DP-SGD)通过每个示例裁剪和校准的高斯噪声实现私人的深度学习,但其高方差更新会降低在具有挑战性的数据集上的效用。我们提出了SMA-DP-SGD,一种基于频谱记忆的差分隐私随机梯度下降方法,该方法通过在之前隐私化噪声发布中构建的分数记忆分支来增强DP-SGD。受WeightWatcher启发的幂律频谱指数提供了组级可靠性信号,在实验中以层级方式实现,以适应衰减和有效记忆深度。隐私历史对齐、范数匹配和激活预热稳定了记忆贡献。隐私保持透明:在给定隐私发布历史的条件下,记忆分支是固定的,而唯一新的数据依赖项是当前裁剪总和乘以固定系数β。因此,SMA-DP-SGD保持了干净的条件敏感度结构,并且当β=1时,精确恢复组级DP-SGD。在CIFAR-100、CIFAR-10和MNIST上的实验显示,SMA-DP-SGD在多个DP优化基线中表现竞争或更优,尤其在CIFAR-100和CIFAR-10上获得最大收益。CIFAR-10的消融实验显示,β控制隐私-效用轨迹,而频谱和记忆诊断确认了受控的短至中等有效记忆深度和小的记忆分支比。运行时分析显示,该机制带来了额外的开销,大约是DP-SGD的2.94倍,在我们的CIFAR-10实现中,揭示了适应性隐私记忆与计算成本之间的实际权衡。

英文摘要

Differentially private stochastic gradient descent (DP-SGD) enables private deep learning through per-example clipping and calibrated Gaussian noise, but its high-variance updates can reduce utility on challenging datasets. We propose \textbf{SMA-DP-SGD}, a \textbf{Spectral Memory-Aware Differentially Private Stochastic Gradient Descent} method that augments DP-SGD with a fractional memory branch built only from previously privatized noisy releases. WeightWatcher-inspired power-law spectral exponents provide group-wise reliability signals, instantiated layer-wise in our experiments, to adapt the decay and effective memory depth. Private-history alignment, norm matching, and warm-up activation stabilize the memory contribution. Privacy remains transparent: conditioned on the private release history, the memory branch is fixed, and the only newly data-dependent term is the current clipped sum scaled by a fixed coefficient \(β\). Hence, SMA-DP-SGD preserves a clean conditional sensitivity structure and exactly recovers group-wise DP-SGD when \(β=1\). Experiments on CIFAR-100, CIFAR-10, and MNIST show competitive or superior accuracy over several DP optimization baselines, with the largest gains on CIFAR-100 and CIFAR-10. CIFAR-10 ablations show that \(β\) controls the privacy--utility trajectory, while spectral and memory diagnostics confirm a controlled short-to-moderate effective memory depth and a small memory-branch ratio. Runtime analysis shows that the mechanism incurs additional overhead, about \(2.94\times\) DP-SGD in our CIFAR-10 implementation, revealing a practical trade-off between adaptive private memory and computational cost.

2605.20449 2026-05-21 cs.LG cs.AI

LLM Pretraining Shapes a Generalizable Manifold: Insights into Cross-Modal Transfer to Time Series

LLM预训练塑造了可泛化的流形:跨模态迁移至时间序列的洞察

Alexis Roger, Prateek Humane, Zhenghan Tai, Gwen Legate, Andrei Mircea, Vasilii Feofanov, Irina Rish

AI总结 研究探讨了语言预训练的Transformer能否成为有效的时序预测器,并揭示了跨模态迁移的机制,指出预训练构建了流形,微调则将数值动态投影到任务相关方向。

详情
AI中文摘要

语言预训练的Transformer能否成为有效的时序预测器,以及原因是什么?本文表明,跨模态迁移出现是因为语言预训练为时序训练预设了一个可重用的流形。在冻结的LLM状态上进行线性探测可以解码出真实的时序轨迹而无需配对监督,该投影空间中的检索能产生具有竞争力的预测,表明在微调之前就已经存在结构和动态。预训练初始化还提升了优化效果,产生连贯的梯度和高度各向异性的损失景观,不同于随机初始化。微调则起到低维对齐的作用,重用已有的方向而非从头学习时间原始特性,这通过低秩更新、子空间对齐和共享的周期性、趋势和重复特征得到证实。这些结果支持了LLM到时序迁移的几何解释:语言预训练构建了流形,微调将数值动态投影到任务相关方向上。

英文摘要

Can language-pretrained transformers become effective time-series forecasters, and why? In this paper, we show that cross-modal transfer arises because language pretraining preconditions time series training with a reusable manifold. A linear probe on frozen LLM states decodes realistic time-series trajectories without paired supervision, and retrieval in this projected space yields competitive forecasts, showing that structure and dynamics exist before finetuning. Pretrained initialization also improves optimization, producing coherent gradients and a highly anisotropic loss landscape unlike random initialization. Finetuning then acts as low-dimensional alignment, reusing existing directions rather than learning temporal primitives from scratch, as evidenced by low-rank updates, subspace alignment, and shared features for periodicity, trend, and repetition. Together, these results support a geometric account of LLM-to-time-series transfer: language pretraining builds the manifold, and finetuning projects numerical dynamics onto task-relevant directions.

2605.20448 2026-05-21 cs.CV cs.LG

Do Vision--Language Models Understand 3D Scenes or Just Catalogue Objects?

视觉-语言模型是理解3D场景还是仅仅 catalogue 物体?

Animesh Maheshwari, Divyansh Sahu, Nishit Verma

AI总结 本文通过一个包含3034个样本的人工整理基准,探讨了视觉-语言模型对空间理解的深度有序遮挡、光学几何推断和体积重新安排规划能力,发现模型在重新安排可见布局时表现优异,但在遮挡和反射推断上表现较差。

详情
AI中文摘要

视觉-语言模型能够可靠地命名场景中的物体,但它们是否代表这些物体所处的3D布局?我们引入了一个包含3034个样本的人工整理基准,针对空间理解的三个组成部分:深度有序遮挡(通过三种独立的反事实操作化进行探测)、可见反射的光学几何推断,以及体积重新安排规划。六个前沿和开放权重的VLMs在18,204个响应上由训练注释者评分,没有使用LLM作为判断标准,揭示了明显的分离:在53-97%的准确率下,能够对可见布局进行重新安排的模型,在遮挡任务中表现不佳,仅在6-45%之间,而在反射任务中低于7%。一个具身推理模型重现了相同的模式。对Qwen3-VL-8B-Thinking的白盒分析显示,失败归因于视觉标记合并:在视觉编码器中可恢复的空间信息在标记压缩后变得不可用,只有在清洁的标记合并后激活被重新引入语言解码器后才恢复。

英文摘要

Vision--language models reliably name objects in a scene, but do they represent the 3D layout those objects inhabit? We introduce a 3,034-sample human-curated benchmark targeting three components of spatial understanding: depth-ordered occlusion (probed via three independent counterfactual operationalisations), optical-geometry inference over visible reflections, and volumetric rearrangement planning. Six frontier and open-weight VLMs, scored by trained annotators on 18,204 responses with no LLM-as-judge, reveal a sharp dissociation: models that plan rearrangements over visible layouts at 53--97% accuracy and rarely violate collision constraints fall to 6--45% on occlusion and below 7% on reflections. An embodied-reasoning model reproduces the same profile. White-box analysis on Qwen3-VL-8B-Thinking localises the failure to the visual-token merger: spatial information recoverable throughout the vision encoder becomes inaccessible after token compression and only stabilises again when clean post-merger activations are patched into the language decoder.

2605.20445 2026-05-21 cs.CV cs.AI

A Comprehensive Comparison of Deep Learning Architectures for COVID-19 Classification on CT & X-ray Imagery

对用于CT和X光影像中新冠分类的深度学习架构的全面比较

Sarmad Khan, Arslan Shaukat, Umer Asgher, Basim Azam

AI总结 本文通过比较多种深度学习架构,提出基于卷积神经网络的计算机辅助诊断系统,以区分新冠和正常肺部影像,并在X光和CT数据集上取得了95至98%的平均准确率。

Comments 6 pages, 2 figures, 5 tables

详情
AI中文摘要

新冠是一种造成大量人员伤亡的重大挑战,不仅涉及某些国家,甚至全球也因冠状病毒而遭受影响。使用计算断层扫描(CT)和X光的肺部影像技术是新冠或其他大流行病筛查过程中最有效的工具。如今,技术已通过人工智能取代手动过程,用自动化机器使系统能够模仿人类大脑,通过经验做出明智决策。受此启发,我们的工作提出使用卷积神经网络(CNN)模型设计一个计算机辅助诊断(CAD)系统,以区分新冠和正常肺部影像。我们使用了两组不同的肺部X光影像和两组不同的CT扫描,并利用预训练的多种网络(如VGG(16, 19)、Densenet(121)、Resnet(50, 50 V2, 101 V2)、MobileNet(V2)、Xception Inception(V3, Resnet V2)、EfficientNet(B0)和Nasnet(Large))进行分类。在X光和CT图像数据集上,Resnet和VGG架构显示出能够正确区分新冠和正常图像的能力,平均准确率分别为95至98%。我们在分类数据集上的结果具有竞争力,并优于文献中已报告的发现。

英文摘要

COVID-19 was a significant challenge that led to the loss of numerous lives daily. Not only a certain country was involved in this outbreak, but even the world has suffered because of the coronavirus. Imaging techniques using computed tomography (CT) and X-rays of the lungs are the most useful tools for the COVID-19 or any other pandemic disease screening process. Technology today has revolutionized the world by using artificial intelligence to replace manual processes with automated machines, which enable the system to imitate the human brain by making wise decisions based on experience. Motivated by this, our work proposes to use convolutional neural networks (CNN) based models for designing a computer-aided diagnosis (CAD) system that differentiates between COVID-19 and healthy lung pictures. We used two different sets of X-ray images of the lungs in addition to two different sets of CT scans and the classification is done using a variety of networks that have been pre-trained such as VGG (16, 19), Densenet (121), Resnet (50, 50 V2, 101 V2), Mobile net (V2), Xception Inception (V3, Resnet V2), Efficient net (B0) and Nasnet (Large). On the X-ray and CT image datasets, Resnet and VGG architecture have shown the ability to properly differentiate COVID-19 from normal images, with an average accuracy of 95 to 98 percent respectively. Our acquired results on the classification datasets are competitive and superior to previously reported findings in the literature.

2605.20441 2026-05-21 cs.LG cs.AI cs.NE

Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

Transformer在Grokking中的权重衰减区域:廉价的在线诊断

Lucky Verma

AI总结 研究探讨了在模运算中训练的Transformer模型在记忆、泛化和崩溃之间的尖锐转变,并通过权重衰减作为标量经验控制参数来分析这些区域,引入了两种廉价的在线诊断方法,通过注意力激活来跟踪训练动态,并在较低计算成本下补充损失景观诊断。

Comments 28 pages, 11 figures, 5 tables. Code and aggregate JSONs: https://github.com/lucky-verma/grokking-diagnostics. Per-run JSONs: https://huggingface.co/datasets/lucky-verma/grokking-diagnostics-runs. Lean 4/mathlib v4.29.0 formal checks available in the code repository

详情
AI中文摘要

在模运算中训练的Transformer模型表现出记忆、泛化和崩溃之间的尖锐转变。我们证明权重衰减作为这些区域的标量经验控制参数,并引入了两种廉价的在线诊断方法,即平均成对注意力头余弦相似度和熵标准差,这些方法仅通过注意力激活来跟踪训练动态,并在较低计算成本下补充损失景观诊断。在十一种实验条件和三种模型规模(0.82M到85M参数)中,权重衰减轴将记忆、发展性Grokking和崩溃分开。一个接近临界点的逻辑拟合将记忆到发展性的边界定位在λ_c=0.0158(95%置信区间[0.0109, 0.0200],N=210);一个幂律拟合给出经验指数ν=0.757(置信区间[0.725, 0.799])。参考指数ν=1/2和3D伊辛ν≈0.63在我们四格网格下位于此经验置信区间之外,因此我们报告ν为经验值,并将临界点类别的识别推迟到更密集的有限大小缩放工作。一个与地平线匹配的多任务复制(n=280,四个模运算)保留了权重衰减控制模式;在λ=0.05时进行的配对注意力头重新初始化实验改变了阶段2的振幅(Cohen的d=-1.190,n=10,p_t=4.5×10^-3),而匹配的权重范数裁剪则没有。三个跨架构探测(4L MLP,4L LSTM和4L Mamba;每个n=70)在小Transformer注意力模型的模运算中复制了权重衰减控制的转变,具有架构特定的λ_c值。主要诊断主张限于小Transformer注意力模型的模运算;非注意力实验是范围探测,架构广泛、语言模型和临界点类别的主张超出范围。

英文摘要

Transformers trained on modular arithmetic exhibit sharp transitions between memorization, generalization, and collapse. We show that weight decay acts as a scalar empirical control parameter for these regimes, and introduce two cheap online diagnostics, mean pairwise attention-head cosine similarity and entropy standard deviation, that track training dynamics from attention activations alone and complement loss-landscape diagnostics at lower compute cost. Across eleven experimental conditions and three model scales (0.82M to 85M parameters), the weight-decay axis separates memorization, developmental grokking, and collapse. A near-transition logistic fit localizes the memorization-to-developmental boundary at $λ_c=0.0158$ (95% CI [0.0109, 0.0200], N=210); a power-law fit gives an empirical exponent $ν=0.757$ (CI [0.725, 0.799]). Reference exponents $ν=1/2$ and 3D Ising $ν\approx 0.63$ lie outside this empirical CI under our four-bin grid, so we report $ν$ as empirical and defer universality-class identification to denser finite-size-scaling work. A horizon-matched multi-task replication (n=280, four modular operations) preserves the weight-decay control pattern; a paired attention-head re-initialization experiment at $λ=0.05$ changes Phase-2 amplitude (Cohen's $d=-1.190$, n=10, $p_t=4.5 \times 10^{-3}$), while matched weight-norm clipping does not. Three cross-architecture probes (4L MLP, 4L LSTM, and 4L Mamba; each n=70) replicate the weight-decay-controlled transition with architecture-specific $λ_c$ values. Main diagnostic claims are scoped to modular arithmetic in small transformer attention models; the non-attention experiments are scope probes, and architecture-wide, language-model, and universality-class claims are out of scope.

2605.20440 2026-05-21 cs.LG cs.AI math.RA

Group-Algebraic Tensors: Provably-optimal Equivariant Learning and Physical Symmetry Discovery

群代数张量:可证明最优的等变学习与物理对称性发现

Paulina Hoyos, Shashanka Ubaru, Dongsung Huh, Vasileios Kalantzis, Kenneth L. Clarkson, Misha Kilmer, Haim Avron, Lior Horesh

AI总结 本文提出了一种群代数张量框架,通过将有限群G的乘法规则引入张量代数,使等变性成为代数属性而非架构限制。该框架基于三个理论支柱:(i) Eckart-Young最优性保证的星G-SVD;(ii)通过Kronecker分解组合多个对称性;(iii)600行的Lean4形式化证明。该框架提供了等变神经网络无法实现的能力:每个预测的闭式分解和数据驱动发现最佳对称群。在QM9分子几何上,通过八面体子群恢复角动量选择规则,展示了数据驱动的物理发现。

详情
AI中文摘要

我们引入了$\star_G$张量代数,在其中任何有限群$G$定义乘法规则,使等变性成为代数属性而非架构约束。该框架基于三个机器验证的理论支柱:(i) $\star_G$-SVD的Eckart-Young最优性保证,是首个对称保持张量近似的结果,精确且多项式时间;(ii) 通过Kronecker分解组合多个对称性,通过将$F_G$替换为$F_{G_1} \otimes F_{G_2}$无需架构重设计;(iii) 600行的Lean~4形式化证明了$\star_G$代数。该框架提供了等变神经网络(ENNs)结构无法实现的能力:每个预测的闭式分解,以及数据驱动发现最佳对称群。作为非平凡的实证演示,分解QM9分子几何的八面体子群恢复了角动量选择规则,仅凭数据而非量子力学输入:标量性质由A$_1$主导,偶极子成分由T$_1$主导,各向异性极化率对l=1不敏感,因为秩2迹分解l=0⊕l=2要求,T$_1$/A$_1$预测能力比将向量可观测量与标量可观测量分离了五倍。在完整的QM9(130,831分子)上,$\star_G$-SVD与岭回归提供闭式预测,参数数量比参数匹配的MLP少50-90倍。代数等变性因此补充架构等变性,不是更快、更好、更便宜的替代方案,而是不同的数学能力:可证明最优的对称保持压缩,每irrep可解释性,以及数据驱动的物理发现。

英文摘要

We introduce the $\star_G$ tensor algebra, in which any finite group $G$ defines the multiplication rule, making equivariance an intrinsic algebraic property rather than an architectural constraint. The framework rests on three machine-verified theoretical pillars: (i)~an Eckart-Young optimality guarantee for the $\star_G$-SVD: the first such result for symmetry-preserving tensor approximation, exact and polynomial-time; (ii)~a Kronecker factorization that composes multiple symmetries by replacing $F_G$ with $F_{G_1} \otimes F_{G_2}$ with no architectural redesign; and (iii)~a 600-line Lean~4 formalization of the $\star_G$ algebra. The framework provides capabilities that equivariant neural networks (ENNs) structurally cannot: a closed-form per-irreducible-representation decomposition of every prediction, and data-driven discovery of the symmetry group that best fits a dataset. As a non-trivial empirical demonstration, decomposing QM9 molecular geometry over the chiral octahedral subgroup of SO(3) recovers the Wigner--Eckart selection rules of angular momentum from data alone, with no quantum mechanical input: scalar properties are A$_1$-dominated, dipole components are T$_1$-dominated, the isotropic polarizability is uniquely insensitive to $l\!=\!1$ as the rank-2-trace decomposition $l\!=\!0 \oplus l\!=\!2$ requires, and the T$_1$/A$_1$ predictive-power ratio separates vector observables from scalar observables by a factor of five. On full QM9 (130{,}831 molecules), $\star_G$-SVD with ridge regression provides closed form predictions at $\sim50-90\times$ fewer parameters than parameter-matched MLPs. Algebraic equivariance thus complements architectural equivariance not as a faster-better-cheaper alternative but as a different mathematical affordance: provably-optimal symmetry-preserving compression, per-irrep interpretability, and data-driven physical discovery.

2605.20439 2026-05-21 cs.LG cs.HC

Can Conversational XAI Improve User Performance? An Experimental Study

对话式XAI能否提升用户表现?一项实证研究

Sven Kruschel, Julian Rosenberger, Lasse Bohlen, Mathias Kraus, Patrick Zschech

AI总结 本研究通过实验评估对话式XAI对用户表现的影响,探讨其在预测准确性、模型理解和错误识别方面的核心方法及主要贡献。

Comments Accepted at Thirty-Fourth European Conference on Information Systems (ECIS 2026), Milan, Italy

详情
AI中文摘要

可解释人工智能(XAI)技术旨在为预测模型提供洞察并提升用户表现,但往往未能达到这些期望。对话式XAI助手承诺克服这些限制,但关于其对客观性能指标影响的实证证据仍然有限。我们提出了一种实验设计,通过预测准确性、模型理解和错误识别来评估解释辅助。使用一个可解释性设计的预测模型,我们创建了用户能够通过识别和补偿系统性误差而超越模型的条件。我们将对话辅助与问答辅助进行比较,以评估哪种辅助更有效地支持用户与模型解释互动。初步测试我们实验设计的结果显示,两组参与者(N=42)均显著超越了模型,但两种辅助类型之间没有表现差异,整体参与度较为有限。这些发现为我们的计划全面研究提供了改进方向,包括增强的参与干预措施和对驱动预测改进机制的调查。

英文摘要

Explainable AI (XAI) techniques aim to provide insights into predictive models and enhance user performance, yet they often fall short of these expectations. Conversational XAI assistants promise to overcome such limitations, but empirical evidence on their impact on objective performance measures remains limited. We propose an experimental design for evaluating explanation assistance through prediction accuracy, model understanding, and error identification. Using an explainable-by-design prediction model, we create conditions where users can outperform the model by identifying and compensating for systematic errors. We compare conversational assistance against Q&A-based assistance to assess which better supports users in working with model explanations. Preliminary results from testing our experimental design show that participants (N=42) in both treatments significantly outperformed the model but reveal no performance differences between assistance types and modest engagement overall. These findings inform refinements for our planned full study, including enhanced engagement interventions and investigation of the mechanisms driving improved predictions.

2605.20433 2026-05-21 cs.RO

Spacetime Optimal-Transport Attention for Visuo-Haptic Imitation Learning of Contact-Rich Manipulation

时空最优传输注意力用于视觉-触觉模仿学习中的富接触操作

Yue Feng, Weicheng Huang, I-Ming Chen

AI总结 本文提出了一种三模态融合框架Spacetime Optimal-Transport Attention (SO-TA),通过熵正则化的最优传输对力-姿态衍生的子查询和视觉块进行对齐,以解决富接触操作中多模态信息融合的问题,并在真实机器人任务中实现了高成功率。

Comments 8 pages, 16 figures, 3 tables. Preprint

详情
AI中文摘要

接触密集的操作任务,如紧密间隙插入、连接器配合、抛光和表面适应擦拭,仍然难以由数据驱动控制器处理,因为它们耦合了不连续接触动力学、部分可观测性和严格的安全约束。单一传感模态不足以满足需求:视觉在接触前提供全局上下文,力/扭矩(F/T)反馈在接触后控制交互,而本体感觉姿态提供一致的运动学骨架。大多数先前的接触密集任务模仿学习策略仅在单模态或双模态信号上操作,而少数融合三种模态的策略通常采用现成的注意力模块,没有明确的先验知识指导注意力质量如何分布在任务相关的区域。我们提出了Spacetime Optimal-Transport Attention (SO-TA),一种三模态融合骨干,用熵正则化的最优传输(OT)对齐代替softmax归一化的块注意力。显式的边缘约束作为结构化的归纳偏置,鼓励在接触密集任务中条件感知的空间选择,这种选择在光照、干扰和部分遮挡下保持稳定。SO-TA与基于扩散的序列策略相结合,将观察窗口映射到姿态-动作块。我们在三个真实机器人任务上评估了SO-TA:紧密圆柱体装配、BCM布线连接器插入和曲面标记擦除。在每个条件约200次滚出下,SO-TA在紧密圆柱体装配任务中达到100%的成功率,而在匹配容量下的交叉注意力为93%,在光照、干扰和部分遮挡扰动下保持82.5%的成功率,而连接基线降至43.5%。OT衍生的块热图和留一法模态影响比提供可解释的、相位依赖的诊断。

英文摘要

Contact-rich manipulation tasks such as tight-clearance insertion, connector mating, polishing, and surface-conforming wiping remain difficult for data-driven controllers because they couple discontinuous contact dynamics, partial observability, and strict safety constraints. No single sensing modality suffices: vision supplies global context before contact, force/torque (F/T) feedback governs interaction after contact, and proprioceptive pose provides a consistent kinematic backbone. Most prior imitation-learning policies for contact-rich tasks operate on uni- or bi-modal signals, and the few that fuse three modalities typically adopt off-the-shelf attention modules with no explicit prior on how attention mass should be distributed across task-relevant regions. We present Spacetime Optimal-Transport Attention (SO-TA), a tri-modal fusion backbone that replaces softmax-normalized patch attention by an entropy-regularized Optimal Transport (OT) alignment between force-pose-derived sub-queries and visual patches. Explicit marginal constraints act as a structured inductive bias for contact-rich tasks, encouraging conditioning-aware spatial selection that is stable across illumination, distractors, and partial occlusion. SO-TA is paired with a diffusion-based sequence policy mapping observation windows to pose-action chunks. We evaluate SO-TA on three real-robot tasks: tight peg-in-hole assembly, BCM wiring-connector insertion, and curved-surface mark erasing. With ~200 rollouts per condition, SO-TA reaches 100% success on tight peg-in-hole versus 93% for cross-attention at matched capacity, and retains 82.5% success under illumination, distractor, and partial-occlusion perturbations where a concatenation baseline drops to 43.5%. OT-derived patch heatmaps and leave-one-out modality-influence ratios provide interpretable, phase-dependent diagnostics.

2605.20425 2026-05-21 cs.AI

AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows

AgentCo-op: 基于检索的互操作多智能体工作流合成

Shuaike Shen, Wenduo Cheng, Shike Wang, Mingqian Ma, Jian Ma

AI总结 本文提出AgentCo-op,一种基于检索的多智能体工作流合成框架,通过类型化任务传递合成可执行工作流,并在执行证据表明失败时应用有界自引导局部修复。在两个开放世界基因组学案例研究中,AgentCo-op在不重设计或运行全局拓扑搜索的情况下,将独立开发的科学智能体和外部工具库整合到可审计的工作流中。

详情
AI中文摘要

在开放性科学环境中设计多智能体工作流尤其困难,因为任务缺乏经过精心编写的训练集、可靠的标量评估度量和现有工具和智能体之间的标准化接口。我们提出AgentCo-op,一种基于检索的合成框架,通过类型化任务传递将可重用的技能、工具和外部智能体组合成可执行的工作流,然后在执行证据表明失败时应用有界自引导局部修复。在两个开放世界基因组学案例研究中,AgentCo-op将独立开发的科学智能体和外部工具库整合到可审计的工作流中,而无需重设计或运行全局拓扑搜索。它协调专门的智能体进行空间转录组学和基因集解释,以从空间转录组学数据中实现协作发现,并为单细胞多组数据的跨模态标记分析构建并行工作流。AgentCo-op还可以将搜索的工作流作为结构先验导入,并通过用检索到的组件 grounding 节点并应用局部修复来改进它,表明合成和搜索是互补的。在六个编码、数学和问答基准测试中,AgentCo-op在四个基准测试中取得最佳结果,在统一骨干设置下获得最佳平均分数,同时相对于多智能体基线始终减少每任务成本。这些结果共同表明,基于检索的合成可以将自动化智能体工作流设计扩展到由现有智能体、工具和类型化艺术制品构建的开放世界工作流中。

英文摘要

Designing multi-agent workflows is especially difficult in open-ended scientific settings where tasks lack curated training sets, reliable scalar evaluation metrics, and standardized interfaces between existing tools and agents. We propose AgentCo-op, a retrieval-based synthesis framework that composes reusable skills, tools, and external agents into executable workflows through typed artifact handoffs, then applies bounded self-guided local repair to implicated components when execution evidence indicates failure. In two open-world genomics case studies, AgentCo-op composes independently developed scientific agents and external tool repositories into auditable workflows without redesigning them or running global topology search. It coordinates specialized agents for spatial transcriptomics and gene-set interpretation to enable collaborative discovery from spatial transcriptomics data, and builds a parallel workflow for cross-modality marker analysis on single-cell multiome data. AgentCo-op can also import a searched workflow as a structural prior and improve it by grounding nodes with retrieved components and applying local repair, showing that synthesis and search are complementary. On six coding, math, and question-answering benchmarks, AgentCo-op achieves the best result on four benchmarks and the best average score under a unified backbone setting, while consistently reducing per-task cost relative to multi-agent baselines. Together, these results suggest that retrieval-based synthesis can extend automated agentic workflow design beyond benchmark-optimized agent graphs to open-world workflows built from existing agents, tools, and typed artifacts.

2605.20423 2026-05-21 cs.AI

OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind

OSCToM: 用于高阶理论之心的强化学习引导对抗生成

Sharmin Sultana Srishty, Kazi Mahathir Rahman, Malaika Parizat Sakkhi, Samia Shahid Prianna, Shaikhul Islam Sinat

AI总结 本文提出OSCToM,一种结合强化学习、领域特定语言和组合替代模型的方法,用于建模LLM中的嵌套信念冲突,通过生成观察者-自我冲突来提升复杂社会场景下的理论之心推理能力。

Comments 15 pages, 12 figures containing 15 images, 3 tables. Code available at https://github.com/sharminsrishty/osct

详情
AI中文摘要

大型语言模型(LLMs)在许多语言任务上表现良好,但其在复杂社会场景中的理论之心(ToM)推理仍不均衡。现有基准测试,如ExploreToM,往往不总是测试导致这些场景困难的递归信念和信息不对称。本文提出了OSCToM(观察者-自我冲突理论之心),一种用于建模LLM基于理论之心任务中嵌套信念冲突的方法。关键案例是观察者对另一个代理的看法与其自身信念状态冲突的情况。此类情况超越了简单的视角转换,需要递归、多层推理。OSCToM结合强化学习(RL)、扩展的领域特定语言和组合替代模型来生成观察者-自我冲突。在我们的实验中,OSCToM-8B在测试系统中表现最佳。它在FANToM上优于已报告的ExploreToM结果,并在Hi-ToM和BigToM上保持竞争力。在信息不对称的FANToM基准测试中,OSCToM达到76%的准确率,相比ExploreToM报告的0.2%。数据合成过程也提高了6倍的效率,表明有针对性的训练数据可以帮助较小的模型处理高级认知推理。项目代码可在https://github.com/sharminsrishty/osct上找到。

英文摘要

Large Language Models (LLMs) perform well on many language tasks, but their Theory of Mind (ToM) reasoning is still uneven in complex social settings. Existing benchmarks, including ExploreToM, do not always test the recursive beliefs and information asymmetries that make these settings difficult. This paper presents OSCToM (Observer-Self Conflict Theory of Mind), an approach for modeling nested belief conflicts in LLM-based ToM tasks. The key case is one in which an observer's view of another agent conflicts with the observer's own belief state. Such cases go beyond simple perspective-taking and require recursive, multi-layered reasoning. OSCToM combines reinforcement learning (RL), an extended domain-specific language, and compositional surrogate models to generate observer-self conflicts. In our experiments, OSCToM-8B gives the best overall result among the systems tested. It improves on the reported ExploreToM results on FANToM and remains competitive on Hi-ToM and BigToM. On the information-asymmetric FANToM benchmark, OSCToM reaches 76% accuracy, compared with the 0.2% reported by ExploreToM. The data-synthesis procedure is also 6x more efficient, indicating that targeted training data can help smaller models handle advanced cognitive reasoning. The project code is available at https://github.com/sharminsrishty/osct.

2605.20413 2026-05-21 cs.LG

Supervised Latent Restructuring for Small-Data Quantum Learning in Plant Phenomics

监督潜在重构在植物表型小数据量子学习中的应用

Alakananda Mitra, David H. Fleisher, Vangimalla Reddy, Chittaranjan Ray

AI总结 本文研究了在小数据条件下,通过监督潜在重构提升植物表型数据中高维特征压缩的几何分离性,提出混合工作流程结合PCA和LDA进行潜在空间重构,并利用GPU加速的量子核对齐方法,发现潜在几何结构在小数据量子学习中是关键设计变量。

Comments 11 pages, 4 Tables, 3 Figures

详情
AI中文摘要

高维生物数据往往表现出特征维度与样本数量之间的严重不匹配,这使得在极小数据条件下可靠分类变得困难。在这些情况下,核方法在潜在压缩无法保持类别分离结构时会失去判别能力。我们研究了细粒度植物表型学中的这一问题,并提出了一种混合工作流程,将1280维的深度图像嵌入压缩到64维的PCA空间,然后通过线性判别分析(LDA)重构为11维的监督潜在空间,并在NVIDIA L40S硬件上进行GPU加速的量子核对齐(QKA)。实证研究表明,监督潜在重构显著提高了压缩表示的几何分离性,使轮廓系数从原始嵌入空间中的0.003和PCA-64空间中的-0.006增加到监督LDA-11空间中的0.197。然而,下游经典评估显示存在明显的压缩权衡:线性SVM和XGBoost在重构的潜在空间中有所改善,而RBF-SVM和随机森林在相同的11维瓶颈下则有所下降。在受限的优化预算下,该领域的QKA仍然具有挑战性,表明潜在几何结构本身不足以实现强可训练的量子性能。这些发现将表示几何学定位为小数据量子学习中的关键设计变量,并揭示了从剧烈压缩的生物表示中恢复非线性判别结构的实践难度。

英文摘要

High-dimensional biological data often exhibit a severe mismatch between feature dimensionality and sample size, making reliable classification difficult in extremely small-data regimes. In these settings, kernel methods can lose discriminative power when latent compression fails to preserve class-separating structure. We study this problem in fine-grained plant phenomics and propose a hybrid workflow that compresses 1280-dimensional deep image embeddings into a 64-dimensional PCA space and then restructures them into an 11-dimensional supervised latent space using Linear Discriminant Analysis (LDA), followed by GPU-accelerated Quantum Kernel Alignment (QKA) on NVIDIA L40S hardware. Empirically, supervised latent restructuring substantially improves the geometric separability of the compressed representation, increasing the Silhouette coefficient from 0.003 in the raw embedding space and -0.006 in PCA-64 to 0.197 in the supervised LDA-11 space. However, downstream classical evaluation reveals a clear compression trade-off: Linear SVM and XGBoost improve in the restructured latent space, whereas RBF-SVM and Random Forest degrade under the same 11-dimensional bottleneck. Under a constrained optimization budget, QKA in this regime remains challenging, indicating that latent geometry alone is not sufficient for strong trainable quantum performance. These findings position representation geometry as a central design variable in small-data quantum learning and expose the practical difficulty of recovering nonlinear discriminative structure from aggressively compressed biological representations.

2605.20410 2026-05-21 cs.CL cs.AI

Mechanics of Bias and Reasoning: Interpreting the Impact of Chain-of-Thought Prompting on Gender Bias in LLMs

偏见与推理的力学:分析链式推理提示对大语言模型中性别偏见的影响

Edie Pearman, Sophia Osborne, Mira Kandlikar-Bloch, Mina Arzaghi, Florian Carichon, Golnoosh Farnadi

AI总结 本文研究了链式推理提示对大语言模型中性别偏见的影响,结合基准测试评估与机制可解释性技术,发现链式推理并未有效减少偏见,偏见仍存在于隐藏表示中。

Comments 24 pages, 6 figures, including appendix. Accepted at the ICLR 2026 Workshop on Algorithmic Fairness Across Alignment Procedures and Agentic Systems. Submitted to COLM 2026

详情
AI中文摘要

尽管有大量文献表明大型语言模型(LLMs)在社会敏感领域得到广泛应用,但它们仍编码了性别偏见。链式推理(CoT)提示已被提出作为减轻偏见的方法。然而,现有评估主要关注LLM基准性能的变化,提供了有限的关于模型内部机制是否真正改变的见解。在本文中,我们研究了链式推理提示如何影响LLM中的性别偏见,结合基准测试评估与机制可解释性技术以及推理链失败分析。我们的结果证实了LLM输出中存在刻板印象偏见,显示链式推理提示并未一致减少偏见差距。机制分析显示,尽管链式推理在某些注意力头集群中平衡了偏见行为,但性别偏见仍嵌入在隐藏表示中,表明仅是表面缓解。对推理链的检查进一步表明,这些改进源于对数据集的记忆和熟悉,而非对偏见的真实理解。

英文摘要

Large language models (LLMs) are increasingly deployed in socially sensitive settings despite substantial documentation that they encode gender biases. Chain-of-Thought (CoT) prompting has been proposed as a bias-mitigation approach. However, existing evaluations primarily focus on changes in LLM benchmark performance, providing limited insight into whether apparent bias reductions reflect meaningful changes in a model's internal mechanisms. In this work, we investigate how CoT prompting affects gender bias in LLMs, combining benchmark-based evaluation with mechanistic interpretability techniques and reasoning chain failure analysis. Our results confirm a stereotypical bias present in LLM outputs across benchmarks, showing that CoT prompting does not consistently reduce the bias gap. Mechanistic analyses reveal that although CoT balances biased behavior in certain attention head clusters, gender bias remains embedded in hidden representations, indicating only superficial mitigation. Inspection of reasoning chains further suggests that these improvements stem from memorization and familiarity with the dataset rather than genuine understanding of bias.

2605.20408 2026-05-21 cs.LG

Spectral Souping: A Unified Framework for Online Preference Alignment

谱汤:一种在线偏好对齐的统一框架

Yinlam Chow, Guy Tennenholtz, Ted Yun, James Harrison, Arthur Gretton, Andre Barreto, Bo Dai

AI总结 本文提出了一种统一的在线偏好对齐框架Spectral Souping,通过发现LLM中的通用谱表示,实现了高效的模型合并,从而在不需昂贵在线重训练的情况下快速适应个体用户偏好。

详情
AI中文摘要

基于人类反馈的强化学习(RLHF)能够有效地将大型语言模型(LLMs)与聚合人类偏好对齐,但往往无法解决个体用户多样且冲突的需求。为了解决这个问题,我们引入了Spectral Souping,一种高效的在线偏好对齐统一框架。我们的贡献是发现LLM中的通用谱表示,该表示已被证明对模型合并具有高度适应性。这一理论洞察使我们能够采用两阶段方法:我们首先在离线学习中学习一组专门的策略,每个策略专注于不同的细粒度偏好维度。一个在线适应算法随后在推理时间高效地对这些策略进行“汤化”,通过合并其输出或参数,使模型能够快速适应而无需昂贵的在线重训练。在在线偏好对齐基准测试中的实验表明,我们的方法在现有最先进方法上实现了显著的性能提升,提供了一种可扩展且计算高效的方法,用于动态适应LLMs以适应个体用户偏好。

英文摘要

Reinforcement Learning from Human Feedback (RLHF) effectively aligns Large Language Models (LLMs) with aggregate human preferences but often fails to address the diverse and conflicting needs of individual users. To overcome this issue, we introduce Spectral Souping, a unified framework for efficient, online preference alignment. Our contribution is the discovery of a universal spectral representation within LLMs, which is proven to be highly amenable to model merging. This theoretical insight enables a two-phase methodology: we first learn a basis of specialized policies offline, each focused on a distinct, fine-grained preference dimension. An online adaptation algorithm then efficiently ``soups'' these policies at inference time, either by merging their outputs or parameters, enabling rapid model adaptation without the need for costly online retraining w.r.t. tailored preference rewards. Experiments on online preference alignment benchmarks demonstrate that our method achieves significant performance improvements over existing state-of-the-art approaches, presenting a scalable and computationally efficient solution for dynamically adapting LLMs to individual user preferences.

2605.20404 2026-05-21 cs.CL

Puzzled By ChatGPT? No more! A Jigsaw Puzzle to Promote AI Literacy and Awareness

对ChatGPT感到困惑?不了!一个拼图游戏以促进AI素养和意识

Francesca Padovani, Malvina Nissim

AI总结 本文提出了一种基于游戏的互动拼图,通过完成后的漫画信息图来展示生成AI的工作原理、能力、限制及社会影响,并通过漫画草图作为独立信息卡提供具体AI使用、设计和影响的解释,以提高公众对AI的理解和意识。

详情
AI中文摘要

生成式AI的快速采用,包括基于LLM的聊天机器人如ChatGPT,凸显了需要支持公众理解和AI素养的可访问方法。为了解决这一需求,我们介绍了一种基于游戏的互动方法,形式为一个拼图,其完成图像是一幅基于漫画的信息图,展示了这些技术的工作原理、能力、限制和社会影响。每个漫画草图也作为独立的信息卡片,提供关于AI使用、设计和影响的具体解释。视觉内容是在专业插画师和多学科专家及非专家群体的实时协作会议中创建的,结合了结构化的知识和讨论期间分享的非正式、探索性反思。通过整合动手组装、视觉叙事和协作互动,该拼图提供了一种引人入胜且有趣的工具,用于在非正式学习环境中探索AI系统的机制、优点和危险。

英文摘要

The rapid adoption of Generative AI, including LLM-based chatbots like ChatGPT, has highlighted the need for accessible ways to support public understanding and AI literacy. To address this need, we introduce a game-based, interactive approach in the form of a jigsaw puzzle whose completed image is a comic-based infographic illustrating the workings, capabilities, limitations, and societal implications of these technologies. Each comic sketch also functions as a standalone informational card, providing focused explanations of specific facets of AI use, design, and impact. The visual content was created in a live collaborative session with a professional illustrator and a multidisciplinary group of experts and non experts, combining structured knowledge with informal, exploratory reflections shared during the discussion. By integrating hands-on assembly, visual storytelling, and collaborative interaction, the puzzle provides an engaging and playful tool for exploring the mechanisms, perks, and perils of AI systems in informal learning contexts.

2605.20396 2026-05-21 cs.LG stat.ML

Score-Based Causal Discovery of Latent Variable Causal Models

基于得分的潜在变量因果模型因果发现

Ignavier Ng, Xinshuai Dong, Haoyue Dai, Biwei Huang, Peter Spirtes, Kun Zhang

AI总结 本文提出了一种基于得分的方法,用于识别包含因果相关潜在变量的因果结构,并提供了可识别性保证,同时通过实验验证了方法的有效性。

Comments ICML 2024

详情
AI中文摘要

识别潜在变量及其涉及的因果结构在各种科学领域中都是至关重要的。尽管许多现有工作属于约束性方法(例如条件独立性或秩不足测试),但它们可能面临经验挑战,如测试顺序依赖性、误差传播和选择合适显著性水平的问题。这些问题可以通过精心设计的基于得分的方法(如在没有潜在变量的情况下使用的贪心等价搜索(GES))来缓解。然而,设计包含潜在变量的基于得分的方法却极具挑战性。在本文中,我们开发了能够识别包含因果相关潜在变量的因果结构的基于得分的方法,并提供了可识别性保证。具体而言,我们证明了适当制定的评分函数可以实现结构学习的得分等价性和一致性。我们进一步对文献中考虑的多种结构假设下观测变量边缘分布的有效自由度进行了表征,并据此开发了精确和连续的基于得分的方法。这为几种现有约束性方法提供了统一的视角。实验结果验证了所提出方法的有效性。

英文摘要

Identifying latent variables and the causal structure involving them is essential across various scientific fields. While many existing works fall under the category of constraint-based methods (with e.g. conditional independence or rank deficiency tests), they may face empirical challenges such as testing-order dependency, error propagation, and choosing an appropriate significance level. These issues can potentially be mitigated by properly designed score-based methods, such as Greedy Equivalence Search (GES) (Chickering, 2002) in the specific setting without latent variables. Yet, formulating score-based methods with latent variables is highly challenging. In this work, we develop score-based methods that are capable of identifying causal structures containing causally-related latent variables with identifiability guarantees. Specifically, we show that a properly formulated scoring function can achieve score equivalence and consistency for structure learning of latent variable causal models. We further provide a characterization of the degrees of freedom for the marginal over the observed variables under multiple structural assumptions considered in the literature, and accordingly develop both exact and continuous score-based methods. This offers a unified view of several existing constraint-based methods with different structural assumptions. Experimental results validate the effectiveness of the proposed methods.

2605.20395 2026-05-21 cs.RO

Scalable Multi-robot Motion Planning via Hierarchical Subproblem Expansion and Workspace Decomposition Refinement

通过分层子问题扩展和工作空间分解细化实现可扩展的多机器人运动规划

Isaac Ngui, Courtney McBeth, James D. Motes, Marco Morales, Nancy M. Amato

AI总结 本文提出了一种多机器人运动规划方法,通过工作空间分解的离散搜索提高规划效率,核心方法是分层子问题扩展和工作空间分解细化,主要贡献是通过迭代优化工作空间表示来搜索更小的解耦配置空间。

Comments Accepted to WAFR 2026

详情
AI中文摘要

多机器人运动规划中的基本挑战是在不引起大规模计算开销的情况下实现足够的协调以避免机器人间的冲突。在本文中,我们提出了一种多移动机器人运动规划方法,通过利用工作空间分解的离散搜索来提供规划过程中的协调。虽然先前的工作利用工作空间拓扑来指导何时需要机器人之间的协调,然后将机器人组合到它们的联合配置空间中,我们进一步通过迭代优化工作空间表示,使规划器能够搜索更小、解耦的配置空间,从而将规划时间提高了数量级。

英文摘要

A fundamental challenge in multi-robot motion planning is achieving sufficient coordination to avoid inter-robot conflicts without incurring the large computational expense of searching the joint configuration space of the robot group. In this work, we present a method for multiple mobile robot motion planning that achieves an improvement in planning time up to an order of magnitude by leveraging the insight that we can use discrete search over a workspace decomposition to provide coordination between robots during planning. While prior work uses workspace topology to inform when coordination between robots is needed and then composes robots into their joint configuration space, we take a step further by iteratively refining our workspace representation to allow our planner to search smaller, decoupled configuration spaces.

2605.20392 2026-05-21 cs.RO

VBT-MPC: Vision-Based Tactile MPC for Contour Following

VBT-MPC:基于视觉的触觉MPC用于轮廓跟踪

Edison Velasco-Sanchez, Luis F. Recalde, Guanrui Li, Pablo Gil

AI总结 本文提出了一种基于视觉的触觉模型预测控制(VBT-MPC)框架,用于机器人轮廓跟踪,通过眼在手配置安装的基于视觉的触觉传感器(VBTS)直接在轮廓特征空间中操作,避免了单独的姿态估计模块和复杂的力控制架构,并在仿真和实际实验中评估了在不同几何形状和材料物体上的轮廓跟踪性能。

Comments This article has been accepted for publication in IEEE Robotics and Automation Letters. This is a preprint version. This work was supported by the Interreg-VI Sudoe and European Regional Development Funds through the REMAIN Project under Grant S1/1.1/E0111

详情
AI中文摘要

触觉感知在机器人操作中起着关键作用,特别是在表面检查等任务中。成功的执行需要在准确跟踪物体轮廓的同时保持接触。在本工作中,我们提出了一种基于视觉的触觉模型预测控制(VBT-MPC)框架,用于使用安装在眼在手配置中的基于视觉的触觉传感器(VBTS)的机器人轮廓跟踪。所提出的控制器直接在轮廓特征空间中操作,从而避免了单独的姿态估计模块或复杂的力控制架构。我们进一步将我们的VBT-MPC与适应于触觉特征的视觉伺服策略进行比较,并在仿真和实际实验中评估了在具有不同几何形状和材料的物体上的轮廓跟踪性能。

英文摘要

Tactile sensing plays a key role in robotic manipulation, particularly in tasks like surface inspection. Successful execution requires maintaining contact while accurately tracking object contours. In this work, we propose a Vision-Based Tactile Model Predictive Control (VBT-MPC) framework for robotic contour following using a Vision-Based Tactile Sensor (VBTS) mounted in an eye-in-hand configuration. The proposed controller operates directly in contour features space, thereby avoiding the need for separate pose-estimation modules or complex force-control architectures. We further compare our VBT-MPC with visual-servoing strategies adapted to tactile features, and evaluate contour tracking on objects with diverse geometries and materials in both simulation and real-world experiments.

2605.20390 2026-05-21 cs.CV cs.AI cs.LG cs.RO

STELLAR: Scaling 3D Perception Large Models for Autonomous Driving

STELLAR: 为自动驾驶扩展3D感知大模型

Yingwei Li, Xin Huang, Yang Liu, Yang Fu, Alex Zihao Zhu, Chen Song, Junwen Yao, Anant Subramanian, Hao Xiang, Weijing Shi, Yuliang Zou, Tom Hoddes, Zhaoqi Leng, Govind Thattai, Dragomir Anguelov, Mingxing Tan

AI总结 本文研究了大规模训练在自动驾驶感知系统中的应用,通过扩展输入模态并训练大规模模型,实现了在Waymo数据集上的新状态-of-the-art性能。

详情
AI中文摘要

模型扩展通过在多样化数据集上进行大规模训练已显示出显著的成功。然而,尚不清楚相同的范式是否适用于自动驾驶感知系统,因为存在独特的挑战,如融合异构传感器数据和需要复杂的3D空间理解。为弥合这一差距,我们进行了系统分析,研究了规模对这些系统的影响。我们基于稀疏窗口变换器开发了STELLAR模型,扩展了输入模态,包括LiDAR、雷达、相机和地图先验。我们在一个包含5000万驾驶示例的大规模数据集上训练该模型,参数数量高达5亿。我们的大规模实验揭示了模型性能与模型大小、数据和计算之间的经验扩展趋势。所得到的模型在Waymo Open Dataset挑战中建立了新的状态-of-the-art,大幅超越了先前的成果。我们的工作表明,大规模训练是提升自动驾驶感知模型能力极具前景的路径。

英文摘要

Model scaling has demonstrated remarkable success through large-scale training on diverse datasets. It remains an open question whether the same paradigm would apply to autonomous driving perception systems due to unique challenges, such as fusing heterogeneous sensor data and the need for sophisticated 3D spatial understanding. To bridge this gap, we present a comprehensive study on systematically analyzing the impact of scale on these systems. We develop our STELLAR model based on Sparse Window Transformer, by extending the input modalities to include LiDAR, radar, camera, and map prior. We train the model on a large-scale dataset of 50 million driving examples with up to 500 million parameters. Our large-scale experiments reveal empirical scaling trends that connect model performance to model size, data, and compute. The resulting model establishes a new state-of-the-art on the Waymo Open Dataset challenge, outperforming prior arts by a large margin. Our work demonstrates that large-scale training is a highly promising path for advancing the capabilities of perception models for autonomous driving.

2605.20389 2026-05-21 cs.LG cs.AI

Nonlocal operator learning for fMRI encoding and decoding tasks

非局部算子学习用于fMRI编码和解码任务

Andreas Kramer, Saugat Acharya, Alice Giola, Emanuele Zappala

AI总结 本文提出了一种基于神经积分算子的框架,用于fMRI数据的编码和解码任务,探讨了非局部时空上下文的作用,并通过实验验证了更长的时间窗口和视觉皮层与全脑记录对性能和潜在空间几何的影响。

Comments 18 pages, 4 figures, 5 tables. Comments are welcome!

详情
AI中文摘要

功能性磁共振成像(fMRI)数据表现出高维时空结构,使得预测和解码变得具有挑战性。在本工作中,我们研究了基于神经积分算子的模型用于fMRI的编码和解码任务,特别强调非局部时空上下文的作用。我们实现了一个潜在的神经积分算子框架,该框架在辅助空间中执行固定点迭代,通过解码器进行分类和刺激预测。我们在两个开源fMRI数据集上评估了我们的模型。我们的实验检验了从fMRI记录中解码刺激以及从刺激表示中编码fMRI动态。主要关注点是时空上下文的影响:我们系统比较了短和长的时间窗口,以及使用视觉皮层与全脑记录,并分析其对性能和潜在空间几何的影响。在不同任务和数据集中,更长的时间窗口通常会改善结果并产生更具结构化的学习表示。在解码实验中,学习的潜在空间通常比原始数据提供更清晰的类别分离。在编码实验中,尽管由于任务难度绝对性能保持中等,但更长的时间窗口仍能产生一致的改进。这些发现表明,神经积分算子为建模fMRI动态提供了一个有前景的框架,并且更广泛的时空上下文对预测和表示学习都是有益的。更广泛地说,结果表明,利用大脑动态中的分布式非局部结构需要专门设计的模型架构来捕捉此类依赖关系。

英文摘要

Functional MRI data exhibit high-dimensional spatiotemporal structure, making both prediction and decoding challenging. In this work, we investigate neural integral-operator-based models for encoding and decoding tasks in fMRI, with particular emphasis on the role of nonlocal spatiotemporal context. We implement a latent neural integral operator framework that performs fixed point iterations in an auxiliary space from which classification and stimuli prediction is performed via a decoder. We evaluate our model on two open-source fMRI datasets. Our experiments examine both decoding of stimuli from fMRI recordings and encoding of fMRI dynamics from stimulus representations. A main focus is the effect of spatiotemporal context: we systematically compare short and long temporal windows, as well as the use of visual cortex vs whole brain recordings, and analyze their influence on performance and latent-space geometry. Across tasks and datasets, larger temporal windows generally improve results and produce more structured learned representations. In decoding experiments, the learned latent space often provides clearer class separation than the raw data. In encoding experiments, although absolute performance remains moderate due to the difficulty of the task, longer temporal windows still yield consistent gains. These findings suggest that neural integral operators provide a promising framework for modeling fMRI dynamics and that broader spatiotemporal context can be beneficial for both prediction and representation learning. More broadly, the results indicate that exploiting distributed nonlocal structure in brain dynamics requires model architectures specifically designed to capture such dependencies.

2605.20388 2026-05-21 cs.CV

How You Move Tells What You'll Do: Trajectory-Conditioned Egocentric Prediction

如何移动决定了将来的行动:轨迹条件的自身视角预测

Sejoon Jun, Hai Nguyen-Truong, Luigi Seminara, Lorenzo Torresani

AI总结 该研究通过轨迹条件自身视角预测,发现轨迹能更精确地表达意图,从而在任务规划中优于语言条件,且在预测时无需观察轨迹即可获得显著优势。

Comments Project page: https://farsightlab.github.io/TrajPilot

详情
AI中文摘要

预测一个人的第一人称视角如何演变(接下来会采取什么行动,什么计划完成任务,正在进行的投篮是否会得分)从根本上是不充分的:相同的情境允许许多可能的未来,而一个训练以最小化预测误差的模型被迫在这些未来之间做妥协或平均,无论哪种方式都可能出错。我们的方法基于两个发现。首先,未来的相机轨迹,即头部在空间中划出的路径,让模型承诺其中一个未来:它以足够精细的形式承载操作者的意图,从而决定行动如何展开,显著优于语言作为条件信号。其次,这种意图使轨迹本身可以从当前情境中部分预测出来,足以在测试时无需观察轨迹即可恢复大部分收益。我们将其实例化为TrajPilot,一个模型从自身视角上下文预测候选未来轨迹,并利用这些轨迹在与行动对齐的嵌入空间中引导动作预测,其中语言塑造了结构但从未用作条件输入。TrajPilot在Ego-Exo4D原子、Ego-Exo4D Keystep、Ego-Exo4D GoalStep和EgoPER的程序规划任务中优于VLM和结构化规划基线,随着预测范围的扩大(正是先前规划器崩溃的地方),并且在仅使用RGB的相机姿态估计下保持稳定。在推理时目标被遮蔽,同一模型能够进行无目标的预测,其在Ego-Exo4D原子任务上击败VLM基线,并扩展到EPIC-Kitchens-100和篮球投篮结果预测。

英文摘要

Predicting how a person's first-person view will evolve (what action will follow, what plan completes a task, whether an in-progress shot will score) is fundamentally under-specified: the same context admits many plausible futures, and a model trained to minimize prediction error is forced to hedge or average across them, getting it wrong either way. Two findings shape our approach. First, the future camera trajectory, the path the head carves through space, lets the model commit to one of those futures: it carries the operator's intent in a form fine enough to determine how an action will unfold, substantially outperforming language as a conditioning signal. Second, this same intent makes the trajectory itself partially predictable from the context at hand, enough that trajectory need not be observed at test time to recover most of the gain. We instantiate these findings as TrajPilot, a model that predicts candidate future trajectories from egocentric context and uses them to pilot action prediction in an action-aligned embedding space where language shapes the structure but is never used as a conditioning input. TrajPilot beats VLM and structured-planner baselines on procedural planning across Ego-Exo4D atomic, Ego-Exo4D Keystep, Ego4D GoalStep, and EgoPER, with the trajectory advantage widening with horizon (exactly where prior planners collapse) and holding under RGB-only camera-pose estimation. With the goal masked at inference, the same model performs goal-free anticipation, beating VLM baselines on Ego-Exo4D atomic and extending to EPIC-Kitchens-100 and basketball shot-outcome prediction.

2605.20385 2026-05-21 cs.CV cs.AI

ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning

ConceptSeg-R1: 通过元强化学习实现任意概念的分割

Yuan Zhao, Youwei Pang, Jiaming Zuo, Wei Ji, Kailai Zhou, Bin Fan, Yunkang Cao, Lihe Zhang, Xiaofeng Liu, Huchuan Lu, Weisi Lin, Dacheng Tao, Xiaoqi Zhao

AI总结 本文提出ConceptSeg-R1框架,通过元强化学习机制学习可迁移的任务规则,结合轻量级概念翻译模块实现概念分割,并在多个领域基准上验证了其在概念层次上的强性能。

详情
AI中文摘要

近年来,可提示分割的进步使视觉感知从对象级定位转向概念级理解。然而,概念的定义仍不明确,使得当前方法是否真正超越类别识别仍存疑问。本文通过包含上下文无关(CI)、上下文依赖(CD)和上下文推理(CR)概念的三级分类,揭示了随着认知复杂性增加的能力差距。为解决这一挑战,我们提出ConceptSeg-R1统一框架,将概念分割重新公式化为规则诱导的概念定位。核心方法是Meta-GRPO,通过视觉示范学习可迁移的任务规则并通过代理推理验证。推导出的推理状态通过轻量级概念翻译模块转换为分割准备的概念提示,使推理应用能够扩展到目标图像。快捷路由策略进一步保留了分割模型在简单情况下的原生效率。为系统评估概念分割,我们在自然、工业、医疗和推理密集领域进行了广泛的实验。无需额外装饰,ConceptSeg-R1在完整概念层次上实现了强性能,同时保持了可提示分割主干的原生能力。作为向分割任何概念的初步步骤,我们希望ConceptSeg-R1能成为推进分割从对象级预测到概念级理解的实用基线。

英文摘要

Recent progress in promptable segmentation has shifted visual perception from object-level localization toward concept-level understanding. However, the notion of a concept remains under-specified, making it unclear whether current methods truly generalize beyond category recognition. In this work, we formalize generalized concept segmentation through a three-level taxonomy consisting of context-independent (CI), context-dependent (CD), and context-reasoning (CR) concepts, which reveals a clear capability gap across increasing levels of cognitive complexity. To address this challenge, we propose ConceptSeg-R1, a unified framework that reformulates concept segmentation as rule-induced concept grounding. At the core of our method is Meta-GRPO, a meta-reinforcement learning mechanism that learns transferable task rules from visual demonstrations and verifies them through proxy reasoning. The inferred reasoning states are then translated into segmentation-ready concept prompts via a lightweight concept translation module, enabling deductive application to target images. A shortcut routing strategy further preserves the native efficiency of segmentation models on simple cases. To systematically evaluate generalized concept segmentation, we conduct extensive experiments across diverse CI, CD, and CR concept segmentation benchmarks spanning natural, industrial, medical and reasoning-intensive domains. Without bells and whistles, ConceptSeg-R1 achieves strong performance across the full concept hierarchy while maintaining the native capability of promptable segmentation backbones. As an initial step toward segmenting any concept, we hope ConceptSeg-R1 can serve as a practical baseline for advancing segmentation from object-level prediction toward concept-level understanding.

2605.20382 2026-05-21 cs.CL cs.AI

Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs

言行不一:大型语言模型中的指令诱导冲突

Carolina Camassa, Derek Shiller

AI总结 研究探讨了大型语言模型在面对指令与模式完成之间的冲突时的表现,发现指令遵循率在不同模型和指令下差异显著,输出多样性是预测鲁棒性的主要因素。

Comments 31 pages

详情
AI中文摘要

语言模型被训练以遵循指令,但它们也是强大的模式完成器。当这些两个目标发生冲突时会发生什么?我们构建了对话,其中用户指令要求以目标方式T(例如始终输出特定标记、用特定语言回答或采用特定角色)行为,与N个硬编码助手回合展示的竞争对手模式P相冲突。然后我们测量在此设置下的指令遵循率(IF率),在13个模型和16种不同指令上,最多50回合。平均指令遵循率在模型之间从1%到99%不等,与标准能力基准 largely 无关。从指令遵循到模式遵循的转变是普遍的但高度模型依赖的。鲁棒性由指令内容调节,当指令与训练价值先验一致时,模型对诱导的抵抗更长;同时由输出格式调节,多样化的多标记响应比单标记输出更耐受。链式推理提高鲁棒性但不消除易受性,并可能导致正确推理与错误输出之间的脱节。当被要求预测其在此设置下的行为时,模型平均准确率为83.5%,但系统性低估了自身对诱导压力的抵抗力。这些结果表明,即使对于其他方面能力较强的模型,指令遵循在诱导压力下仍很脆弱,而输出多样性而非输入的语义参与是预测鲁棒性的主要因素。

英文摘要

Language models are trained to follow instructions, but they are also powerful pattern completers. What happens when these two objectives conflict? We construct conversations in which a user instruction to behave in a target way T (e.g., always output a specific token, answer in a particular language, or adopt a persona) is opposed by N hardcoded assistant turns demonstrating a competing pattern P. We then measure instruction-following (IF) rates in this setting, across 13 models and 16 different instructions, for up to 50 turns. Average instruction-following rates range from 1% to 99% across models, largely uncorrelated with standard capability benchmarks. The transition from instruction-following to pattern-following is universal but highly model-dependent. Robustness is modulated both by instruction content, with models resisting induction longer when instructions align with their trained value priors, and by output format, with diverse multi-token responses proving substantially more resistant than single-token outputs. Chain-of-thought reasoning improves robustness but does not eliminate susceptibility, and can produce dissociation between correct deliberation and incorrect output. When asked to predict their behavior in this setting, models achieve 83.5% accuracy on average but systematically underestimate their own resistance to induction pressure. These results suggest that instruction-following remains brittle under induction pressure even for otherwise capable models, and that output diversity, rather than semantic engagement with the input, is the primary factor predicting robustness.

2605.20373 2026-05-21 cs.RO cs.AI cs.CV

SUGAR: A Scalable Human-Video-Driven Generalizable Humanoid Loco-Manipulation Learning Framework

SUGAR: 一种可扩展的人类-视频驱动的通用人形机器人运动-操作学习框架

Tianshu Wu, Xiangqi Kong, Yue Chen, Qize Yu, Hang Ye, Jia Li, Yizhou Wang, Hao Dong

AI总结 该研究提出SUGAR框架,通过将多样化的视频转化为可部署的人形机器人运动-操作技能,无需特定任务的奖励工程或参考动作条件,在仿真和现实硬件中实现了六种代表性任务的高性能表现,展示了可扩展性和零样本现实迁移能力。

Comments Project Page: https://tianshuwu.github.io/sugar-humanoid/

详情
AI中文摘要

构建能够实现在现实世界中通用的全身体运动-操作能力的人形机器人仍是一个根本性挑战。现有方法要么依赖于繁琐的特定任务奖励工程,要么依赖于僵化的参考动作回放,无法泛化,或者依赖于昂贵的远程操作,限制了可扩展性。尽管人类视频捕捉了多样化的动作行为,但从中推断出的运动先验固有地不完美,受到遮挡、接触伪影和重定向误差的影响,使其不适合直接的策略学习。为此,我们提出了SUGAR,一种可扩展的数据驱动框架,能够将多样化的视频转化为可部署的人形机器人运动-操作技能,无需任何特定任务的奖励工程或参考动作条件。SUGAR分为三个阶段。首先,一个完全自动化的流程从无结构的人类视频中提取运动交互先验,包括人类-物体运动轨迹和接触标签。第二,一个特权物理基础的细化器利用统一的模仿奖励和渐进状态池,将不完美的先验转化为物理上可行的、高保真的技能。第三,经过细化的技能被转化为一个分层的自主策略,包括一个命令生成器和一个命令跟踪器。我们在仿真和现实世界的人形硬件中评估了SUGAR,我们的方法在六种代表性运动-操作任务上显著优于参考跟踪基线,性能随着人类视频数据量的增加而明显提升。它还实现了零样本现实迁移,具有可靠的闭环执行、自主故障恢复和在外部扰动下的稳定长时程性能。项目页面:https://tianshuwu.github.io/sugar-humanoid/

英文摘要

Building humanoid robots capable of generalizable whole-body loco-manipulation in the real world remains a fundamental challenge. Existing methods either rely on laborious task-specific reward engineering, rigidly replay reference motions that fail to generalize, or depend on costly teleoperation that limits scalability. While human videos capture diverse human behaviors, motion priors inferred from them are inherently imperfect, suffering from occlusion, contact artifacts, and retargeting errors that render them unsuitable for direct policy learning. To address this, we present SUGAR, a scalable data-driven framework that converts diverse human videos into deployable humanoid loco-manipulation skills, without any task-specific reward engineering or reference-motion conditioning at inference. SUGAR proceeds in three stages. First, a fully automated pipeline extracts kinematic interaction priors including human-object motion trajectories and contact labels from unstructured human videos. Second, a privileged physics-based refiner uses a unified mimic reward and progressive state pool to transform imperfect priors into physically feasible, high-fidelity skills. Third, refined skills are distilled into a hierarchical autonomous policy consisting of a command generator and a command tracker. We evaluate SUGAR on six representative loco-manipulation tasks in simulation and real-world humanoid hardware. Our method substantially outperforms reference-tracking baselines, and performance scales clearly with the amount of human video data. It also achieves zero-shot real-world transfer with reliable closed-loop execution, autonomous failure recovery, and stable long-horizon performance under external perturbations. Project Page: https://tianshuwu.github.io/sugar-humanoid/