arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.01964 2026-06-02 cs.CL

Eyettention II: A Dual-Sequence Architecture for Modeling Fixation Location, Within-Word Landing Position, and Fixation Duration in Reading

Eyettention II: 一种用于建模阅读中注视位置、词内着陆位置和注视持续时间的双序列架构

Shuwen Deng, Cui Ding, David R. Reich, Paul Prasse, Lena A. Jäger

AI总结 提出端到端训练的轻量级深度学习模型Eyettention II,通过双序列架构生成包含注视位置、词内着陆位置和注视持续时间的完整扫描路径,在预测性能上超越现有模型并捕捉关键心理语言学现象。

详情
AI中文摘要

阅读时眼睛的运动方式为理解读者的认知过程和文本属性提供了宝贵信息。特别是,阅读过程中的眼动追踪数据在多种技术应用中显示出高度价值,例如增强和解释语言模型以及推断读者特征。然而,这些应用通常依赖于大规模数据驱动模型,需要大量的眼动追踪数据集,而由于数据收集的资源密集型特性,这些数据集难以获取。为了解决数据稀缺的挑战,我们开发了Eyettention II,一个端到端训练的深度学习模型,能够生成由按时间顺序排列的完整注视属性组成的真实扫描路径,包括注视位置、词内着陆位置和注视持续时间。我们的模型轻量级,可在有限的GPU资源上高效训练,并与认知理论紧密对齐。我们证明,Eyettention II在扫描路径预测方面超越了最先进的模型,并通过捕捉关键心理语言学现象模拟了类似人类的注视行为。凭借其稳健的性能,Eyettention II有潜力推动自然语言处理的发展,促进心理语言学实验材料的预测试,并揭示超出理论认知模型明确编码的新见解。

英文摘要

The way our eyes move while reading provides valuable insights into both the reader's cognitive processes and the properties of the text. In particular, eye-tracking-while-reading data has shown to be highly beneficial in various technological applications, such as enhancing and interpreting language models and inferring a reader's characteristics. However, these applications often rely on large-scale, data-driven models, which demand extensive eye-tracking datasets that are challenging to obtain due to the resource-intensive nature of data collection. To address the challenge of data scarcity, we develop Eyettention II, an end-to-end trained deep-learning model capable of generating realistic scanpaths consisting of a complete set of fixation attributes in chronological order, including fixation location, within-word landing position, and fixation duration. Our model is lightweight, efficiently trainable on limited GPU resources, and closely aligned with cognitive theories. We demonstrate that Eyettention II surpasses state-of-the-art models in scanpath prediction and mirrors human-like gaze behavior by capturing key psycholinguistic phenomena. With its robust performance, Eyettention II holds the potential to drive advancements in natural language processing, facilitate piloting the materials of psycholinguistic experiments, and uncover new insights beyond what is explicitly encoded in theoretical cognitive models.

2606.01955 2026-06-02 cs.RO cs.CV

WALL-WM: Carving World Action Modeling at the Event Joints

WALL-WM:在事件关节处雕刻世界动作建模

Shalfun Li, Victor Yao, Charles Yang, Truth Qu, Regis Cheng, Ryan Yu, Howard Lu, Newton Von, Vincent Chen, Yohann Tang, Maeve Zhang, Ellie Ma, Gody Li, Sage Yang, Lorien Shu, J. W. Gao, Ethan Chen, Colin Ye, Yu Sun, Elise Mon, PS Zhang, Neo Li, Lily Li, James Wang, Ping Yang, Chris Pan, Lucy Liang, Hang Su, Roy Gan, Hao Wang, Qian Wang

AI总结 提出WALL-WM世界动作模型,通过事件级视觉-语言-动作预训练解决固定长度动作块与语言、视觉、动作之间的粒度不匹配问题,实现跨语言、场景和任务的泛化,在大规模真实世界评估中达到最先进性能。

详情
AI中文摘要

WALL-WM是一种世界动作模型,它将视频-动作学习从以块为中心的优化转变为以事件为基础的视觉-语言-动作预训练,使用语义连贯的动作事件作为学习的基本单元。现有的WAM通常从多模态或视频基础模型初始化,然后直接基于当前观测和指令优化固定长度的动作块。尽管方便,但这种以块为中心的公式造成了基本的粒度不匹配。语言描述语义目标和事件,视觉通过连续场景动态演变,动作在控制级时间尺度上运行;将三者强制纳入相同的固定长度预测窗口,使得VLA训练变成短视的相关性拟合。WALL-WM通过围绕语义事件组织监督和数据来解决这种不匹配。具体来说,它将基于事件的VLA预训练与由事件级标题和聚类平衡采样构建的数据生态系统配对,从而实现对多样化行为、场景和任务结构的可扩展学习。从相同的事件预训练骨干出发,WALL-WM支持两种互补的推理模式。事件模式消耗下一事件描述并实现可变长度的执行块,而统一模式使用带有阶梯式解码的VLM来调节传统的固定长度块推理,同时保留梯度连续的VLA路径。结合基于Muon优化器的大规模预训练基础设施,WALL-WM为通用WAM提供了实用的规模化方案。实验表明,WALL-WM在语言、场景和任务上广泛泛化,在大规模真实世界泛化评估中达到了最先进的性能。

英文摘要

WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.

2606.01954 2026-06-02 cs.LG stat.ML

Flow-Transformed Implicit Processes for Function-Space Variational Inference

流变换隐式过程用于函数空间变分推断

Luis A. Ortega, Andrés R. Masegosa, Thomas D. Nielsen

AI总结 提出流变换隐式过程(FTIP),通过归一化流增强组合权重的变分分布,从而在函数空间中捕获非对称、重尾和多模态后验结构,并使用黑盒α目标进行优化。

详情
Comments
24 pages, 4 figures, 10 tables. Pre-print submitted for revision
AI中文摘要

隐式过程先验通过灵活的生成机制定义函数上的分布,使其对贝叶斯函数空间建模具有吸引力。然而,使用此类先验进行后验推断具有挑战性,因为其诱导的函数空间分布通常没有闭式解。一种实用策略是使用有限个采样函数的集合来近似先验,然后将后验函数表示为这些样本的学习组合。现有方法通常对组合权重施加高斯变分分布。虽然易于处理,但这种选择限制了可表示的后验不确定性形状,特别是当真实后验是非对称、重尾或多模态时。我们提出流变换隐式过程(FTIP),一种变分推断方法,使这种有限维函数空间近似更具表达力。FTIP不使用高斯分布,而是使用归一化流来定义更丰富的变分分布,从而在保持可处理优化的同时诱导灵活的后验函数分布。我们使用黑盒α目标训练模型,从而能够比较质量覆盖和模式寻找的变分行为。实验表明,FTIP捕获了函数空间中的非对称和多模态后验结构,而高斯系数近似往往会平滑或崩溃这些结构。

英文摘要

Implicit-process priors define distributions over functions through flexible generative mechanisms, making them attractive for Bayesian function-space modelling. However, performing posterior inference with such priors is challenging because their induced function-space distributions are typically not available in closed form. One practical strategy is to approximate the prior using a finite collection of sampled functions, and then represent posterior functions as learned combinations of these samples. Existing approaches commonly place a Gaussian variational distribution over the combination weights. While tractable, this choice limits the shapes of posterior uncertainty that can be represented, especially when the true posterior is asymmetric, heavy-tailed, or multimodal. We propose Flow-Transformed Implicit Processes (FTIP), a variational inference method that makes this finite-dimensional function-space approximation more expressive. Instead of using a Gaussian distribution over the combination weights, FTIP uses a normalizing flow to define a richer variational distribution. This induces a flexible posterior distribution over functions while preserving tractable optimization. We train the model using a Black-Box α objective, allowing us to compare mass-covering and mode-seeking variational behaviour. Experiments show that FTIP captures asymmetric and multimodal posterior structure in function space that Gaussian coefficient approximations tend to smooth or collapse.

2606.01952 2026-06-02 cs.LG

Randomized Least Squares Value Iteration itself is Joint Differentially Private

随机最小二乘值迭代本身是联合差分隐私的

Haiyang Lu, Pratik Gajane, Shaojie Bai, Mohammad Sadegh Talebi

AI总结 研究随机探索算法RLSVI在表格MDP中的隐私保护,证明其内在噪声同时提供联合差分隐私保证。

详情
Comments
12 pages, 0 figures
AI中文摘要

随着强化学习越来越多地应用于医疗和推荐系统等敏感领域,隐私保护技术对于保护用户的敏感信息变得至关重要。我们研究在情节设置下的隐私保护强化学习,重点关注基于随机探索的算法,如随机最小二乘值迭代(RLSVI)。总体目标是研究随机探索如何与隐私机制所需的注入噪声相互作用。在这项工作中,我们展示了一种新的隐私分析,该分析描述了RLSVI中为探索设置的噪声如何同时提供隐私保护。具体来说,我们证明RLSVI在表格MDP中是$(\varepsilon(δ),δ)$-联合差分隐私的,其中$\varepsilon(δ) = rac{2AK}{H^2\log(2HSA)} + 2\sqrt{ rac{2AK\log(1/δ)}{H^2\log(2HSA)}}$,$S$和$A$分别是状态和动作的数量,$H$是情节的长度,$K$是情节的数量。

英文摘要

As reinforcement learning (RL) increasingly applies to sensitive domains, such as health care and recommendation systems, privacy-preserving techniques have become essential to protect users' sensitive information. We investigate privacy-preserving RL under an episodic setting, focusing on algorithms based on randomized exploration, such as Randomized Least Squares Value Iteration (RLSVI). The overall goal is to study how randomized exploration interacts with the injected noise required by privacy mechanisms. In this work, we show a new privacy analysis that characterizes how the noise in RLSVI set for exploration simultaneously provides privacy protection. Specifically, we prove that RLSVI is $(\varepsilon(δ),δ)$-joint differentially private in tabular MDP as is with $\varepsilon(δ) = \frac{2AK}{H^2\log(2HSA)} + 2\sqrt{\frac{2AK\log(1/δ)}{H^2\log(2HSA)}}$, where $S$ and $A$ are the number of states and actions respectively, $H$ is the length of an episode and $K$ is the number of episodes.

2606.01951 2026-06-02 cs.RO

Co-training with Ego-centric Video and Demonstration for Robot Navigation Task

基于自我中心视频与示范的机器人导航任务协同训练

Shoya Kuno, Yumo Ouchi, Kanata Suzuki

AI总结 提出将自我中心行走视频转化为移动机器人模仿学习数据集的框架,通过联合训练VLA模型提升语言理解和动作生成能力。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在多种机器人任务中展现出潜力,但其性能严重依赖于大规模高质量训练数据,而在真实机器人上收集这些数据成本高昂且耗时。虽然先前的工作已经探索了利用自我中心人类视频来增强操作数据集,但由于运动过程中的视角变化,将此类方法应用于移动机器人导航仍然具有挑战性。在本文中,我们提出了一个框架,将自我中心行走视频转化为移动机器人模仿学习的数据集。该方法从人类视频中估计相机运动,并将其转换为与地面移动机器人兼容的动作表示。通过联合训练基于人类数据和机器人收集数据的VLA模型,该模型在语言理解和鲁棒动作生成方面比单独使用任一数据源训练取得了更好的性能。在水果搜索导航任务上的实验表明,人类自我中心视频为移动机器人学习提供了有效且可扩展的数据源。

英文摘要

Vision-language-action (VLA) models are promising for diverse robotic tasks, but their performance heavily depends on large-scale high-quality training data, whose collection on real robots is costly and time-consuming. While prior work has explored augmenting manipulation datasets with egocentric human videos, applying such approaches to mobile robot navigation remains challenging due to viewpoint changes during locomotion. In this paper, we propose a framework that converts egocentric walking videos into datasets for mobile robot imitation learning. The proposed method estimates camera motion from human videos and transforms it into action representations compatible with ground mobile robots. By jointly training a VLA model on human-derived and robot-collected datasets, the model achieves improved language understanding and more robust action generation than training with either data source alone. Experiments on a fruit-search navigation task demonstrate that human egocentric videos provide an effective and scalable data source for mobile robot learning.

2606.01950 2026-06-02 cs.RO cs.CV cs.LG

Learning Action-Conditional and Object-Centric Gaussian Splatting World Models for Rigid Objects

面向刚性物体的学习动作条件与对象中心高斯溅射世界模型

Jens U. Kreber, Lukas Mack, Joerg Stueckler

AI总结 提出MRO-GWM模型,通过对象中心高斯表示和时空变换器架构,学习刚性物体在3D中的动作条件动力学,支持多物体场景和部分观测下的未来运动预测。

详情
AI中文摘要

世界模型使智能体能够预测其动作对环境的影响。在本文中,我们提出了多刚性物体高斯世界模型(MRO-GWM),一种学习刚性物体在3D中动作条件动力学的新模型。通过用对象中心高斯表示场景,我们可以表示任意物体形状和多物体场景。我们开发了一种新颖的时空变换器架构,该架构根据物体高斯的历史和未来动作预测未来的刚体运动。物体通过其在规范坐标系中的高斯表示,从而可以将物体运动描述为刚体变换。我们的模型在多视角重建上进行训练,这要求模型处理因遮挡导致的物体部分观测。我们分析了该方法在由典型家庭物体组成的合成数据集上的预测性能,这些数据集包含多物体动力学和机器人末端执行器的交互。我们还在模拟中评估了模型在非抓取操作中的模型预测控制性能。

英文摘要

World models enable intelligent agents to predict the consequences of their actions on the environment. In this paper, we propose Multi Rigid Object Gaussian World Model (MRO-GWM), a novel model that learns action-conditional dynamics of rigid objects in 3D. By representing the scene by object-centric Gaussians, we can represent arbitrary object shapes and multi-object scenes. We develop a novel spatio-temporal transformer architecture that predicts future rigid body motion from a history of object Gaussians and future actions. Objects are represented by their Gaussians in a canonical frame, which allows for describing object motion as rigid body transformation. Our model is trained on reconstructions from multiple viewpoints, which requires the model to handle partial observations of objects due to occlusions. We analyze prediction performance of our approach on synthetic datasets composed of typical household objects with multi-object dynamics and interactions by a robot end effector. We also evaluate our model in model-predictive control for non-prehensile manipulation in simulation.

2606.01948 2026-06-02 cs.IR cs.AI

Rank-Constrained Deep Matrix Completion for Group Recommendation

面向群组推荐的秩约束深度矩阵补全

Mubaraka Sani Ibrahim, Lehel Csató, Isah Charles Saidu

AI总结 提出Group RC-DMC框架,通过Set-Transformer聚合器整合群组级表示学习,结合低秩结构和注意力非线性建模,实现个体与群组级别的准确预测。

详情
AI中文摘要

群体活动的日益普及增加了根据用户个体偏好向用户群组提供推荐的方法需求。许多现有的群组推荐系统依赖于聚合个体用户偏好,但通常难以处理现实场景中常见的高维且高度稀疏的评分数据。我们提出了群组秩约束深度矩阵补全(Group RC-DMC),这是一个新颖的框架,通过Set-Transformer聚合器整合群组级表示学习,扩展了RC-DMC,联合利用了低秩结构和基于注意力的非线性建模。与大多数现有群组推荐系统不同,Group RC-DMC在一个统一框架中融合了显式低秩正则化、线性编码器-解码器架构和基于注意力的非线性群组建模,在个体和群组级别都产生准确的预测。Group RC-DMC通过低秩矩阵补全解决数据稀疏性,仅从观测评分计算每个用户的潜在表示,并基于周期性奇异值阈值化使用核范数近端步骤对潜在空间施加秩约束。解码器被参数化为低秩分解,从而实现高效推理。在MovieLens和Goodbooks数据集上的实验结果表明,Group RC-DMC实现了优越的重建精度(以更低的群组RMSE衡量),同时在计算效率上保持竞争力,并且在群组级别的性能(精确率、召回率和F1分数)上与加权前分解(WBF)和加权后分解(AF)基线相当。结果突显了模型恢复用户-物品交互的底层低秩结构的能力,并为小、中、大用户群组提供稳健的群组推荐。

英文摘要

The growing popularity of group activities has increased the need for methods that provide recommendations to groups of users given their individual preferences. Many existing group recommender systems rely on aggregating individual user preferences, but they often struggle with high-dimensional and highly sparse rating data commonly found in real-world scenarios. We propose Group Rank-Constrained Deep Matrix Completion (Group RC-DMC), a novel framework that extends RC-DMC by integrating group-level representation learning via a Set-Transformer aggregator, jointly leveraging low-rank structure and attention-based nonlinear modeling. Unlike most existing group recommender systems, Group RC-DMC unifies explicit low-rank regularization, linear encoder-decoder architectures, and attention-based nonlinear group modeling within a single framework, yielding accurate predictions at both the individual and group levels. Group RC-DMC addresses data sparsity through low-rank matrix completion, computing per-user latent representations from observed ratings only, and enforcing a rank constraint on the latent space using a nuclear-norm proximal step based on periodic singular value thresholding. The decoder is parametrized as a low-rank factorization, enabling efficient inference. Experimental results on the MovieLens and Goodbooks datasets demonstrate that Group RC-DMC achieves superior reconstruction accuracy, measured by lower group RMSE, while remaining computationally efficient and competitive in group-level performance in terms of precision, recall, and F1 score compared with weighted-before-factorization (WBF) and after-factorization (AF) baselines. The results highlight the model's ability to recover the underlying low-rank structure of user-item interactions and provide robust group recommendations across small, medium, and large user groups.

2606.01947 2026-06-02 cs.CV cs.AI

Parameter-Efficient Fine-Tuning of Large Pretrained Models for Instance Segmentation Tasks

大型预训练模型在实例分割任务中的参数高效微调

Nermeen Abou Baker, David Rohrschneider, Uwe Handmann

AI总结 本研究针对实例分割任务,探索了适配器和低秩适应(LoRA)两种参数高效微调方法,在仅微调约1-6%参数的情况下取得竞争性能,并发现每个Transformer块使用2-3个适配器可达到性能与效率的最佳平衡。

详情
Journal ref
Abou Baker N, Rohrschneider D, Handmann U. Parameter-Efficient Fine-Tuning of Large Pretrained Models for Instance Segmentation Tasks. Machine Learning and Knowledge Extraction. 2024; 6(4):2783-2807
Comments
Published by the Machine Learning and Knowledge Extraction Journal
AI中文摘要

近年来,随着大型预训练模型的兴起,人工智能的研究和应用发生了转变,这些模型在众多任务中取得了最先进的结果。然而,参数的大量增加引入了对参数高效训练策略的需求。尽管取得了显著进展,但针对基于Transformer的模型在实例分割任务中的参数高效微调(PEFT)方法的研究仍然有限。为填补这一空白,本研究调查了PEFT方法的有效性,特别是适配器和低秩适应(LoRA),并将其应用于两个模型和四个基准数据集。通过集成顺序排列的适配器模块并将LoRA应用于可变形注意力(本文首次探索),在仅微调约1-6%模型参数的情况下取得了竞争性能,相比传统微调所需的40-55%有显著改进。关键发现表明,每个Transformer块使用2-3个适配器可实现性能与效率的最佳平衡。此外,LoRA在应用于可变形注意力时表现出强大的参数效率,并在某些情况下超越了适配器配置。这些结果表明,PEFT技术的影响因数据集复杂性和模型架构而异,强调了上下文特定调优的重要性。总体而言,这项工作展示了PEFT在实例分割任务中实现可扩展、可定制且计算高效的迁移学习的潜力。

英文摘要

Research and applications in artificial intelligence have recently shifted with the rise of large pretrained models, which deliver state-of-the-art results across numerous tasks. However, the substantial increase in parameters introduces a need for parameter-efficient training strategies. Despite significant advancements, limited research has explored parameter-efficient fine-tuning (PEFT) methods in the context of transformer-based models for instance segmentation. Addressing this gap, this study investigates the effectiveness of PEFT methods, specifically adapters and Low-Rank Adaptation (LoRA), applied to two models across four benchmark datasets. Integrating sequentially arranged adapter modules and applying LoRA to deformable attention--explored here for the first time--achieves competitive performance while fine-tuning only about 1-6% of model parameters, a marked improvement over the 40-55% required in traditional fine-tuning. Key findings indicate that using 2-3 adapters per transformer block offers an optimal balance of performance and efficiency. Furthermore, LoRA, exhibits strong parameter efficiency when applied to deformable attention, and in certain cases surpasses adapter configurations. These results show that the impact of PEFT techniques varies based on dataset complexity and model architecture, underscoring the importance of context-specific tuning. Overall, this work demonstrates the potential of PEFT to enable scalable, customizable, and computationally efficient transfer learning for instance segmentation tasks.

2606.01946 2026-06-02 cs.RO

Closed-Form Pose Estimation of Endoluminal Medical Devices via Gradiometer-Based Electromagnetic Localization System

基于梯度计的电磁定位系统实现腔内医疗器械的闭式位姿估计

Zhiwei Wu, Jiahao Luo, Yubo Pu, Siyi Wei, Yuankai Chen, Jinhui Zhang

AI总结 提出一种基于梯度计的电磁定位系统(GELS),利用紧凑型磁力计阵列作为准梯度计估计局部磁场和梯度张量,通过欧拉齐次关系映射为位移,再经多源Procrustes配准实现闭式位姿估计,无需预校准场图或迭代优化。

详情
AI中文摘要

嵌入式磁跟踪对于腔内医疗器械的远程导航具有极具吸引力的前景。然而,现有的六自由度位姿恢复方法通常需要预校准的工作空间场图或迭代非线性优化。本文提出了一种基于梯度计的电磁定位系统(GELS),这是一种闭式跟踪框架,使用紧凑型磁力计阵列作为嵌入式准梯度计来估计局部磁场和梯度张量。这些量通过欧拉齐次关系映射为源与阵列之间的位移,随后利用至少三个非共线源的多源Procrustes配准恢复阵列的方向和位置。该算法需要已知的源位置和阵列几何结构,但无需预校准的工作空间场图、初始位姿猜测或校准的激励源矩。恢复的位姿还可作为移动磁参考框架,实现概念验证的子级偶极子定位任务。跨传感器阵列配置和激励模式的台架实验显示,序列平均位置误差为\SI{10.80}{\milli\meter}--\SI{15.57}{\milli\meter},最快更新率为\SI{14.49}{\hertz},中位求解器运行时间为\SI{172.00}{\micro\second}。基于扰动的误差传播分析进一步确定了传感器间不一致性和偶极子模型失配是主要的精度限制因素,从而为未来进一步减少位姿估计误差的传感器阵列和磁源设计提供指导。

英文摘要

Embedded magnetic tracking holds highly attractive prospects for remote navigation of endoluminal medical devices. However, existing six-degree-of-freedom pose recovery approaches often require pre-calibrated workspace field maps or iterative nonlinear optimization. This letter presents a Gradiometer-Based Electromagnetic Localization System (GELS), a closed-form tracking framework that uses a compact magnetometer array as an embedded quasi-gradiometer to estimate local magnetic fields and gradient tensors. These quantities are mapped by the Euler homogeneous relation to displacements between source and array, from which multi-source Procrustes registration recovers the array orientation and position using at least three non-collinear sources. The algorithm requires known source positions and array geometry, but no pre-calibrated workspace field maps, initial pose guesses, or calibrated excitation-source moments. The recovered pose also enables a proof-of-concept sub-level dipole localization task by serving as a mobile magnetic reference frame. Benchtop experiments across sensor-array configurations and excitation modes demonstrate sequence-averaged position errors of \SI{10.80}{\milli\meter}--\SI{15.57}{\milli\meter}, a fastest update rate of \SI{14.49}{\hertz}, and a median solver runtime of \SI{172.00}{\micro\second}. A perturbation-based error propagation analysis further identifies inter-sensor inconsistency and dipole-model mismatch as the dominant accuracy limits, thereby informing future sensor array and magnetic source design for further reducing pose-estimation error.

2606.01945 2026-06-02 cs.CV

Beyond Low-Rank: Low-Rank Sparse Prompting via Spiking Neural Network and Prompt Factorization

超越低秩:通过脉冲神经网络和提示分解实现低秩稀疏提示

Yumiao Zhao, Bo Jiang, Beibei Wang, Xixi Wan, Xiao Wang, Jin Tang

AI总结 提出LoRSP框架,利用脉冲神经元的稀疏发放机制和低秩分解,生成实例特定的稀疏视觉提示,实现高效且鲁棒的视觉提示学习。

详情
AI中文摘要

视觉提示(VP)已成为一种高效范式,通过在输入层引入可学习提示来适应大规模预训练视觉模型到下游任务。然而,现有的VP方法通常采用密集的像素级提示,往往存在冗余扰动、泛化能力有限和能效低的问题。为克服这些限制,我们提出将脑启发脉冲学习融入视觉提示学习任务。我们知道,脉冲神经元可以通过将输入数据转换为离散脉冲序列并返回稀疏输出来进行低成本信息处理。受此启发,我们提出低秩视觉脉冲提示(LoRSP),一种新颖框架,通过脉冲神经元学习机制自然地学习动态低秩稀疏视觉提示。LoRSP的核心思想是利用脉冲神经元的脑启发稀疏发放机制为每个实例生成像素级稀疏提示。具体而言,我们首先通过低秩分解构建一系列提示因子以捕获不同的提示子空间。然后将这些提示因子输入SNN架构,执行整合-发放过程以发射脉冲。因此,我们的LoRSP在保持低秩约束的同时生成稀疏视觉提示。这种设计实现了实例特定的选择性提示,从而在多样化的下游任务中实现更紧凑和鲁棒的适应。在五个异构视觉骨干网络和多个基准上的大量实验表明,与现有VP方法相比,LoRSP在需要更少可调参数的情况下实现了具有竞争力的性能。

英文摘要

Visual Prompting (VP) has emerged as an efficient paradigm for adapting large-scale pre-trained vision models to downstream tasks by incorporating learnable prompts at the input level. However, existing VP methods typically employ dense pixel-level prompts, which often suffer from redundant perturbations, limited generalization and energy inefficiency. To overcome these limitations, we propose to integrate brain-inspired spiking learning into visual prompt learning tasks. As we know that spiking neuron can perform inexpensive information processing by transmitting the input data into discrete spike trains and return sparse outputs. Inspired by this, we propose \textbf{Lo}w-\textbf{R}ank visual \textbf{S}pike \textbf{P}rompting (LoRSP), a novel framework that learns dynamic low-rank sparse visual prompts naturally via a Spiking neuron learning mechanism. The core idea of LoRSP is to exploit the brain-inspired sparse firing mechanism of spiking neurons to generate pixel-level sparse prompt for each instance. To be specific, we first construct a series of prompt factors via low-rank factorization to capture distinct prompt subspaces. These prompt factors are then fed into an SNN architecture, which performs the integrate-and-fire process to emit spikes. As a result, our LoRSP generates a \emph{sparse} visual prompt while maintaining the low-rank constraint. This design enables instance-specific selective prompting, leading to more compact and robust adaptation across diverse downstream tasks. Extensive experiments on five heterogeneous vision backbones and multiple benchmarks demonstrate that LoRSP achieves competitive performance while requiring fewer tunable parameters compared to existing VP methods.

2606.01940 2026-06-02 cs.CV

SCAPO: Self-Supervised Category-Level Articulated Pose Estimation from a Single 3D Observation

SCAPO: 从单次3D观测中自监督学习类别级关节物体姿态估计

Can Zhang, Gim Hee Lee

AI总结 提出SCAPO框架,通过自监督方式从单张RGB-D图像中估计关节物体的规范几何、刚性部件分割和关节参数,无需真实标签或类别特定模型。

详情
AI中文摘要

现有的从单次3D观测中估计类别级物体关节的方法通常依赖密集监督、多帧输入或CAD模板,并且仍然难以从关节中解耦几何或恢复显式关节参数。我们提出SCAPO,一个自监督框架,从单张RGB-D观测中估计规范几何、刚性部件分割以及关节枢轴、轴和关节状态,无需真实标签或类别特定模型。我们的SCAPO首先使用SE(3)-等变向量神经元自编码器来分解全局姿态并将不同实例对齐到共享规范空间。在此对齐形状上,设计了一个关节感知的混合蒙皮模块来建模部件运动。我们通过观测形状和规范形状之间的循环重建以及可学习规范模板的跨空间对齐来学习这种表示,该模板将共享类别几何与实例特定残差形状解耦。在合成和真实关节物体数据集上的实验表明,我们的SCAPO恢复了一致的部件结构和准确的关节参数,并优于所有自监督基线。

英文摘要

Existing methods for category-level object articulation from a single 3D observation often rely on dense supervision, multi-frame inputs, or CAD templates, and still struggle to disentangle geometry from articulation or to recover explicit joint parameters. We propose SCAPO, a self-supervised framework that estimates canonical geometry, rigid part segmentation, and joint pivots, axes, and articulation states from a single RGB-D observation without ground-truth labels or category-specific models. Our SCAPO first uses an SE(3)-equivariant vector-neuron autoencoder to factor out global pose and align diverse instances into a shared canonical space. On this aligned shape, a joint-aware blend-skinning module is then designed to model part motion. We learn this representation through cycle reconstruction between observed and canonical shapes and cross-space alignment with a learnable canonical template that decouples shared category geometry from instance-specific residual shape. Experiments on synthetic and real articulated-object datasets show that our SCAPO recovers consistent part structure and accurate articulation parameters and outperforms all self-supervised baselines.

2606.01939 2026-06-02 cs.CV

SAVMap: Structure-Aided Visual Mapping of Large-Scale 2.5D Manhattan Wireframes from Panoramic Video

SAVMap: 基于结构辅助的全景视频大规模2.5D曼哈顿线框视觉映射

Howard Huang, Bharath Surianarayanan, Keifer Lee, Chenyu Wang, Chen Feng

AI总结 提出SAVMap方法,利用全景视频和语义分割网络,结合曼哈顿网格几何约束,从仓库场景生成语义线框地图,实现高精度大规模3D重建。

详情
Comments
IEEE ICRA 2026
AI中文摘要

工业环境的精确3D表示能够支持机器人定位和数字孪生生成等任务。我们提出SAVMap,一种仅使用全景视频相机作为传感器输入,生成仓库货架和灯光结构语义线框地图的方法。从沿仓库通道拍摄的全景视频中提取一系列带有货架和天花板视角的校正图像。通过语义分割网络前端,从每张图像中提取一组稀疏的语义结构特征点(例如货架结构的角点、灯光的中心),并在序列中跟踪这些点。通过考虑点之间的真实世界几何关系(如曼哈顿网格),一种受约束的运动恢复结构算法生成构成线框地图的3D点。我们在一个拥有46排货架的仓库中展示了我们方法的可扩展性和准确性,每排货架的面尺寸为55米×7米。从一小时的视频内容中,我们为超过5000个货架元素创建了线框地图,与真实值相比,总体平均绝对误差为4.8厘米。

英文摘要

Precise 3D representations of industrial environments enable tasks such as robot localization and digital twin generation. We propose SAVMap, a method for generating a semantic wireframe map of warehouse shelf and light structures using only a panoramic video camera as the sensor input. Sequences of rectified images with shelf and ceiling-facing views are extracted from a panoramic video captured along the warehouse aisles. Using a semantic segmentation network front end, a set of sparse, semantic structure feature points (e.g., corners of shelf structures, centers of lights) are extracted from each image and tracked across the sequences. By accounting for real-world geometric relationships among the points such as Manhattan grids, a constrained structure-from-motion algorithm yields the 3D points that form a wireframe map. We demonstrate the scalability and accuracy of our proposal in a warehouse with 46 shelving rows, each with faces spanning 55\,m by 7\,m. From an hour of panoramic video content, we create wireframe maps for over 5000 shelf elements across the rows, achieving an aggregate mean absolute error of 4.8\,cm with respect to ground-truth.

2606.01936 2026-06-02 cs.CL

What to Format and How: A Benchmark and Workflow Approach for Document Formatting

格式化什么以及如何格式化:文档格式化的基准与工作流方法

Shihao Rao, Liang Li, Jiapeng Liu, Tong Lin, Bing Li, Xiyan Gao, Peng Fu, Jing Huang, Can Ma

AI总结 针对内容感知的文档格式化任务,提出基准DocFormBench和工作流方法DocFormFlow,通过解耦目标定位与修改执行,在提升准确率的同时降低token消耗。

详情
AI中文摘要

大型语言模型(LLM)的最新进展为自动化文档格式化开辟了新的可能性。然而,现实中的格式化通常需要根据文档内容识别目标。这种内容感知的设置仍然具有挑战性且未被充分探索,主要是由于缺乏专门的评估数据集。为了在现实的内容感知场景中实现评估,我们引入了DocFormBench,这是一个将文本到格式评估扩展到多样化格式化需求的基准,同时提供了准确性和效率的指标。为了减少现有方法在格式化过程中的冗余文档读取,我们提出了DocFormFlow,一种工作流格式化方法,将目标定位与修改执行解耦为“格式化什么”和“如何格式化”。在多个LLM和多模态模型上的大量实验表明,与代表性基线相比,DocFormFlow在减少token消耗的同时持续提高了格式化准确性。进一步的分析表明,精确的目标定位是影响格式化性能的主要因素。我们希望DocFormBench和DocFormFlow能够促进未来朝着更智能、更可靠的文档格式化的研究。

英文摘要

Recent advances in large language models (LLMs) have opened up new possibilities for automated document formatting. However, real-world formatting often requires identifying targets based on document content. This content-aware setting remains challenging and underexplored, primarily due to the lack of dedicated evaluation datasets.To enable evaluation in realistic content-aware scenarios, we introduce DocFormBench, a benchmark that extends Text-to-Format evaluation to diverse formatting requirements, along with metrics for both accuracy and efficiency.To mitigate redundant document reading in existing methods during formatting, we propose DocFormFlow, a workflow formatting method that decouples target localization from modification execution into what to format and how. Extensive experiments across multiple LLMs and multimodal models show that DocFormFlow consistently improves formatting accuracy while reducing token consumption compared to representative baselines. Further analysis reveals that precise target localization is the primary factor influencing formatting performance. We hope DocFormBench and DocFormFlow will facilitate future research toward more intelligent and reliable document formatting.

2606.01934 2026-06-02 cs.LG cs.CL

HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression

HMPO: 用于思维链压缩的混合中位数长度策略优化

Minghui Zheng, Hongxu Chen, Huimin Ren, Hongsheng Xin, Xiaoyang Qu, Ze Wang, Shuling Yang, Ziyu Peng, Kaike Zhang, Pan Zhou, Kun Zhan

AI总结 提出HMPO,一种单阶段强化学习框架,通过自适应中位数预算、余弦衰减令牌奖励和乘法奖励公式,在数学数据上训练后实现19%-46%的令牌压缩且精度损失极小,并泛化至多种任务。

详情
AI中文摘要

大型语言模型通过扩展的思维链推理取得了显著性能,但这一冗长过程带来了大量推理开销。现有的思维链压缩方法面临不灵活的手动长度预算、计算昂贵的多阶段训练流程以及仅适用于小模型的脆弱可扩展性。我们提出HMPO(混合中位数长度策略优化),一种经济高效的单阶段强化学习框架。HMPO通过三个协同组件高效压缩思维链:基于成功轨迹的自适应中位数预算以消除手动调整、用于平滑长度惩罚的余弦衰减令牌奖励,以及通过严格优先考虑答案正确性来大幅减轻琐碎奖励破解的乘法奖励公式。仅在数学数据上训练,HMPO无缝泛化到数学、代码、科学和指令遵循任务。在从9B到122B参数、涵盖密集和混合专家架构的大规模实验中,HMPO实现了19%-46%的令牌压缩,精度下降可忽略,同时与现有的多阶段基线相比大幅降低了训练成本。

英文摘要

Large language models achieve remarkable performance via extended chain-of-thought (CoT) reasoning, yet this lengthy process incurs substantial inference overhead. Existing CoT compression methods struggle with inflexible manual length budgets, computationally expensive multi-stage training pipelines, and fragile scalability restricted to small models. We propose HMPO (Hybrid Median-length Policy Optimization), a cost-effective, single-stage reinforcement learning framework. HMPO efficiently compresses CoT via three synergistic components: an adaptive median-based budget derived from successful rollouts to eliminate manual tuning, a cosine-decay token reward for smooth length penalization, and a multiplicative reward formulation that substantially mitigates trivial reward hacking by strictly prioritizing answer correctness. Trained exclusively on mathematical data, HMPO generalizes seamlessly across math, code, science, and instruction-following tasks. Extensive experiments scaling from 9B to 122B parameters across dense and Mixture-of-Experts (MoE) architectures demonstrate that HMPO achieves 19%--46% token compression with negligible accuracy degradation, all while drastically reducing training costs compared to existing multi-stage baselines.

2606.01933 2026-06-02 cs.CV

3rd Place at CVPR 2026 CASTLE Challenge: Agentic Multi-View Long-Context Video Understanding via Hierarchical Knowledge Graph Retrieval

CVPR 2026 CASTLE挑战赛第三名:基于层次化知识图谱检索的智能多视角长视频理解

Raghad Albusayes, Munirah Alyahya

AI总结 提出一种免训练的智能框架,通过视频知识图谱和层次化检索索引,解决大规模多视角视频中的复杂时空推理问题,在CASTLE挑战赛中获得第三名。

详情
AI中文摘要

本文介绍了我们在CVPR 2026 EgoVis研讨会举办的CASTLE 2026挑战赛中的获胜方法,我们的团队在全球获得了第三名。该挑战要求参与者在海量多模态视频流中回答高度复杂的视觉、时空和语言问题,包括视觉计数、动作定位、多视角跟踪和说话者时间推理。底层数据集包含由15个自我和外部摄像头源捕获的超过600小时的同步视频。为了应对这种极端规模和长上下文的需求,我们引入了一种无需训练的智能框架,专门针对长视频理解进行了优化。我们的框架引入了两个核心架构组件:i) 视频知识图谱,映射静态和动态实体、它们的时间关系以及交叉事件,以实现多跳关系推理;ii) 自适应智能工作流,通过层次化检索和索引解决复杂查询。实验结果表明,我们的框架在长上下文多视角流上实现了高零样本推理精度。我们的代码将在https://github.com/RaghadKhaled/CASTLE-Challenge-Framework发布。

英文摘要

This paper presents our winning methodology for the CASTLE 2026 Challenge at the CVPR 2026 EgoVis Workshop, where our team secured third place globally. The challenge tasks participants with answering highly complex visual, spatiotemporal, and verbal questions, including visual counting, action localization, multi-view tracking and speaker temporal reasoning, within massive, multimodal video streams. The underlying dataset consists of over 600 hours synchronized footage captured by 15 ego and exo camera sources. To tackle the extreme scale and long-context demands of this environment, we introduce a training-free agentic framework optimized for long-form video understanding. Our framework introduces two core architectural components: i) a Video Knowledge Graph that maps static and dynamic entities, their temporal relationships, and intersecting events to enable multi-hop relational reasoning, and ii) an adaptive agentic workflow that resolves complex queries through a hierarchical retrieval and indexing. Empirical results demonstrate that our framework achieves high zero-shot reasoning accuracy on long-context multi-view streams. Our code will be released at https://github.com/RaghadKhaled/CASTLE-Challenge-Framework.

2606.01926 2026-06-02 cs.CL

Mitigating Bias in Locally Constrained Decoding via Tractable Proposals

通过可处理提议缓解局部约束解码中的偏差

Meihua Dang, Linxin Song, Honghua Zhang, Jieyu Zhao, Guy Van den Broeck, Stefano Ermon

AI总结 针对局部约束解码中因短视掩码导致的采样偏差,提出基于张量化有限自动机的全局约束解码提议和概率全局约束解码提议,结合序贯蒙特卡洛方法实现无偏采样,在函数调用、关键词生成和SQL生成任务中显著减少所需粒子数并加速收敛。

详情
Comments
13 pages, 5 figures
AI中文摘要

大型语言模型的生成结果往往不符合期望的约束,如JSON模式。现有的局部约束解码(LCD)方法通过短视地掩蔽下一个词元来强制约束,导致采样偏差和性能下降。最近的工作使用序贯蒙特卡洛(SMC)方法来缓解此类偏差,但设计有效的提议分布或势函数仍然是一个关键挑战。在这项工作中,我们提出了一种通用方法来构建从 $p_{\mathrm{lm}}( \cdot \mid \mathrm{constraint})$ 进行SMC采样的提议和势函数。首先,我们证明了以有限自动机形式指定的约束可以张量化以在GPU上高效执行,我们利用这一点构建了全局约束解码(GCD)提议。此外,利用张量化有限自动机与隐马尔可夫模型共享相同电路结构的事实,我们通过电路乘法得到概率全局约束解码(P-GCD)提议,该提议编码了目标分布的逻辑和概率信息。我们在函数调用、基于关键词的生成和SQL生成任务上评估了(P-)GCD。实验表明,在相同的SMC采样设置下,与LCD提议相比,(P-)GCD以显著更少的粒子更快地收敛到目标分布。

英文摘要

Generations from large language models often fail to conform to desired constraints such as JSON schema. Existing locally constrained decoding (LCD) approaches enforce constraints by myopically masking out next tokens, resulting in biased sampling and degradation in performance. Recent work uses sequential Monte Carlo (SMC) methods to mitigate such biases, but designing effective proposal distributions or potential functions remains a key challenge. In this work, we propose a generic approach to construct proposals and potentials for SMC sampling from $p_{\mathrm{lm}}( \cdot \mid \mathrm{constraint})$. First, we show that constraints specified as finite automata can be tensorized for efficient execution on GPUs, which we use to construct globally constrained decoding (GCD) proposals. In addition, leveraging the fact that tensorized finite automata share the same circuit structure as hidden Markov models, we circuit-multiply them to obtain the probabilistic GCD (P-GCD) proposals encoding both logical and probabilistic information about the target distributions. We evaluate (P-)GCD on the tasks of function calling, keyword-based generation, and SQL generation. Experiments show that under the same SMC sampling setup, compared to LCD proposals, (P-)GCD converges faster to the target distribution with significantly fewer particles.

2606.01923 2026-06-02 cs.CL cs.LG

Resonant Context Anchoring: Decoupling Attention Routing and Signal Gain at Inference Time

共振上下文锚定:推理时解耦注意力路由与信号增益

Mingkuan Zhao, Yide Gao, Wentao Hu, Suquan Chen, Tianchen Huang, Zhenhua An, Zetao Chang, Xiayu Sun, Yuheng Min

AI总结 提出共振上下文锚定(RCA)方法,通过解耦自注意力中的路由逻辑与信息幅度,在推理时动态增强上下文令牌的信号,有效抑制大语言模型的参数化幻觉,提升事实一致性。

详情
AI中文摘要

大型语言模型(LLM)在面对与内部参数记忆冲突的输入证据时,经常表现出“上下文忽视”,导致持续的事实幻觉。现有的缓解策略主要依赖于抑制特定神经元激活或使用计算昂贵的对比解码机制,这往往会导致困惑度增加或推理延迟显著升高。为了解决这些局限性,我们从残差流信号动力学的角度提出了一种轻量级的推理时干预方法——共振上下文锚定(RCA)。RCA旨在解决外部证据在深层网络传播过程中的信号衰减问题。其核心机制是在自注意力模块中正交解耦路由逻辑和信息幅度。通过利用原始的softmax前注意力分数作为语义对齐的即时度量,我们通过非线性整流构建动态增益场,选择性地放大上下文令牌对应的值向量的范数,而不改变注意力概率分布。该机制有效提升了残差流混合中输入证据的信噪比(SNR),从而在推理时稳健地将生成轨迹锚定到真实上下文。在Llama-3模型系列上的大量实验表明,RCA在多个事实一致性和强知识冲突任务中显著提高了上下文忠实度,有效抑制了参数化幻觉。此外,结果证实,作为一个无需训练且计算量可忽略的即插即用模块,RCA在保持模型通用语言理解能力的同时,在忠实度和流畅性上实现了帕累托改进。

英文摘要

Large Language Models (LLMs) frequently exhibit "contextual disregard" when faced with input evidence that conflicts with their internal parametric memory, leading to persistent factual hallucinations. Existing mitigation strategies primarily rely on suppressing specific neuron activations or employing computationally expensive contrastive decoding mechanisms, which often result in increased perplexity or significantly elevated inference latency. To address these limitations, we propose Resonant Context Anchoring (RCA), a lightweight inference-time intervention method grounded in the perspective of residual stream signal dynamics. RCA aims to resolve the signal attenuation of external evidence during its propagation through deep networks. The core mechanism involves the orthogonal decoupling of routing logic and information magnitude within the self-attention module. By utilizing raw pre-softmax attention scores as an instantaneous metric of semantic alignment, we construct a dynamic gain field via non-linear rectification to selectively amplify the norms of value vectors corresponding to context tokens, without altering the attention probability distribution. This mechanism effectively elevates the signal-to-noise ratio (SNR) of input evidence within the residual stream mixture, thereby robustly anchoring the generation trajectory to the truthful context during inference. Extensive experiments on the Llama-3 model series demonstrate that RCA significantly improves contextual faithfulness across multiple factual consistency and strong knowledge-conflict tasks, effectively suppressing parametric hallucinations. Furthermore, results confirm that as a training-free and computationally negligible plug-and-play module, RCA achieves a Pareto improvement in faithfulness and fluency while maintaining the model's general language understanding capabilities.

2606.01920 2026-06-02 cs.CV

Pool-Select-Refine: Allocation-Aware Generative Dataset Distillation with Soft-Label-Guided Latent Refinement

Pool-Select-Refine: 基于软标签引导潜在精化的分配感知生成式数据集蒸馏

Wenmin Li, Shunsuke Sakai, Zhongkai Zhao, Tatsuhito Hasegawa

AI总结 提出Pool-Select-Refine两阶段框架,通过过完备候选池选择与软标签引导潜在精化解耦生成、选择和精化,提升扩散模型数据集蒸馏的预算利用效率。

详情
AI中文摘要

基于扩散的数据集蒸馏最近作为一种有前景的范式出现,用于将大规模数据集压缩为紧凑的合成集。通过利用预训练的生成先验,这些方法可以比传统的基于匹配的方法更高效地生成逼真的类别条件样本。然而,大多数现有的基于扩散的方法仍然采用僵化的“生成即用”策略,其中生成的样本在固定的每类图像预算下直接被视为最终的蒸馏集。这种设计将候选生成与最终预算分配紧密耦合,可能导致有限预算的冗余浪费或信息不足的样本。在本文中,我们提出“Pool-Select-Refine”,一个用于分配感知生成式数据集蒸馏的两阶段框架。首先,我们不直接使用固定数量的生成样本,而是构建一个过完备的候选池,并在目标预算下选择一个紧凑的子集。其次,我们使用从教师模型导出的软标签监督在潜在空间中精化所选样本,提高语义对齐同时保留生成先验。这种设计明确地将生成、选择和精化解耦,从而更有效地利用蒸馏预算。在大规模和细粒度图像分类基准上的实验表明,所提出的框架在基于扩散的基线上取得了一致的改进。结果表明,在精化之前引入一个筛选阶段是改进基于扩散的数据集蒸馏的一种简单而有效的方法。

英文摘要

Diffusion-based dataset distillation has recently emerged as a promising paradigm for condensing large-scale datasets into compact synthetic sets. By leveraging pretrained generative priors, these methods can produce realistic class-conditional samples more efficiently than traditional matching-based approaches. However, most existing diffusion-based methods still adopt a rigid ``Generate-and-Use'' strategy, where the generated samples are directly treated as the final distilled set under a fixed images-per-class budget. Such a design tightly couples candidate generation with final budget allocation, which may result in redundant waste of the limited budget or insufficiently informative samples. In this paper, we propose ``Pool-Select-Refine'', a two-stage framework for allocation-aware generative dataset distillation. First, instead of directly using a fixed number of generated samples, we construct an over-complete candidate pool and select a compact subset under the target budget. Second, we refine the selected samples in latent space using soft-label supervision derived from the teacher model, improving semantic alignment while preserving the generative prior. This design explicitly decouples generation, selection, and refinement, enabling more effective use of the distillation budget. Experiments on large-scale and fine-grained image classification benchmarks show that the proposed framework delivers consistent gains over diffusion-based baselines. The results suggest that introducing a curation stage before refinement is a simple yet effective way to improve diffusion-based dataset distillation.

2606.01914 2026-06-02 cs.CL cs.CV

Mechanistic Diagnostics of Spatial Lexical Bias in Multimodal Large Language Model Spatial Reasoning

多模态大语言模型空间推理中空间词汇偏差的机制诊断

Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng, Wang Yang, Sudong Cai, Shuyuan Zheng, Akiko Aizawa, Sadao Kurohashi

AI总结 本文发现多模态大语言模型存在空间词汇偏差,即添加空间关系词会吸引模型选择该选项,并通过机制可解释性工具揭示偏差主要源于语言侧而非视觉侧,最后提出轻量级LLM-only DPO更新可有效缓解偏差。

详情
AI中文摘要

多模态大语言模型(MLLMs)在空间多项选择题上仍不可靠,其失败常归因于视觉信息关注不足。本文识别了一种互补的失败模式——空间词汇偏差:向答案选项添加空间关系词会吸引模型决策,使新添加的选项更可能被选中。使用九个开放权重的MLLMs,我们证明该现象广泛存在。特别地,模型能正确回答二元空间问题,但一旦向答案集添加第三个空间选项,模型便持续选择错误的第三选项。我们将这种二元稳定但三元脆弱的案例隔离为诊断示例,并利用机制可解释性工具,揭示失败的主要原因来自语言侧而非视觉侧:视觉注意力分析和残差流探针表明,在这些失败中,正确的空间关系在内部仍然可用,而不相关选项控制、激活修补和稀疏组件干预将偏差追溯到特定的LLM侧通道和神经元。基于此发现,我们证明在微小的单对象对合成数据上进行轻量级仅LLM的DPO更新可缓解偏差,在合成数据上将四路鲁棒准确率提升高达100个百分点,在更广泛的评估数据集WhatsUp、SpatialMQA-Direct和VSR上分别提升68.0、32.6和20.1个百分点。

英文摘要

Multimodal large language models (MLLMs) remain unreliable on spatial multiple-choice questions, and their failures are often attributed to poorly attended visual information. In this work, we identify a complementary failure mode, spatial lexical bias: adding a spatial relation word to the answer options can attract the model's decision and make the newly added option likely to be selected. Using nine open-weight MLLMs, we show that this phenomenon is widely observed. In particular, models can answer a binary spatial question correctly, yet consistently select an incorrect third spatial option once it is added to the answer set. We isolate such binary-stable but ternary-fragile cases as diagnostic examples and leverage mechanistic interpretability tools, revealing that a substantial part of the failure instead originates on the language side rather than the visual side: visual attention analyses and residual-stream probes show the correct spatial relation remains internally available on these failures, while irrelevant-option controls, activation patching, and sparse component interventions trace the bias to specific LLM-side channels and neurons. Based on this finding, we show that a lightweight LLM-only DPO update on tiny single-object-pair synthetic data mitigates the bias, lifting four-way robust accuracy by up to 100 points on synthetic data, and by 68.0, 32.6, and 20.1 points on broader evaluation datasets WhatsUp, SpatialMQA-Direct, and VSR.

2606.01912 2026-06-02 cs.AI

SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes

SMH-Bench:用于智能家居中环境基础推理与行动的LLM代理基准测试

Kuan Li, Shuo Zhang, Huacan Wang, Fangzhou Yu, Zecheng Sheng, Yi Gu, Weipeng Ming, Lei Xue, Chen Liu, Sen Hu, Ronghao Chen, Siyue Lin, Yuqing Hou, Xiaofeng Mou, Yi Xu

AI总结 提出SMH-Bench基准,基于可执行模拟器HomeEnv,通过1100个任务评估LLM在智能家居中的推理与行动能力,发现前沿模型在自动化调度、模糊处理和个性化推理方面存在不足。

详情
AI中文摘要

智能家居正朝着复杂的、依赖于状态的生活环境发展,需要大型语言模型(LLM)对用户意图、偏好和多设备交互进行推理。然而,现有的智能家居基准通常侧重于静态的指令到API映射或有限的模拟,未能评估LLM是否能够在现实家庭场景中可靠地进行推理、交互和行动。为了解决这些局限性,我们引入了SMH-Bench,这是一个用于评估智能家居环境中LLM的全面基准。基于可执行且可验证的智能家居模拟器HomeEnv,SMH-Bench包含1100个高质量任务,涵盖7个类别和22个细粒度子类别。它进一步将任务分层为简单、中等和复杂家庭,范围从小型公寓到拥有135个设备的密集多房间环境。实验表明,尽管前沿LLM在显式控制和查询任务上表现强劲,但在自动化任务调度、模糊处理和个性化推理方面仍存在显著弱点,尤其是在家庭复杂性增加时。我们希望SMH-Bench能够促进更可靠、上下文感知且实际可部署的智能家居代理的发展。

英文摘要

Smart homes are evolving toward complex state-dependent living environments, requiring Large Language Models (LLMs) to reason over user intent, preferences, and multi-device interactions. However, existing smart-home benchmarks often focus on static instruction-to-API mapping or limited simulations, failing to evaluate whether LLMs can reason, interact, and act reliably in realistic household scenarios. To address these limitations, we introduce SMH-Bench, a comprehensive benchmark for evaluating LLMs in smart-home environments. Built upon HomeEnv, an executable and verifiable smart-home simulator, SMH-Bench contains 1,100 high-quality tasks spanning 7 categories and 22 fine-grained subcategories. It further stratifies tasks across simple, medium and complex homes, ranging from small apartments to dense multi-room environments with 135 devices. Experiments show that although frontier LLMs achieve strong performance on explicit control and query tasks, they still exhibit significant weaknesses in automation task scheduling, ambiguity handling and personalized reasoning, especially as home complexity increases. We hope SMH-Bench will facilitate the development of more reliable, context-aware, and practically deployable smart-home agents.

2606.01911 2026-06-02 cs.CV

Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering

残差解码器适配器:用于自回归文本渲染的身份保持分词器适配

Dongxing Mao, Jinpeng Wang, Jiahao Tang, Kevin Qinghong Lin, Linjie Li, Zhengyuan Yang, Lijuan Wang, Min Li, Jingru Tan

AI总结 提出残差解码器适配器(RDA),通过引入配对码本和平行分支学习像素空间残差,在不重新训练分词器和自回归模型的情况下显著提升文本渲染性能。

详情
Comments
CVPR 2026 poster
AI中文摘要

视觉自回归(AR)模型通过预测由视觉分词器解码的离散标记来生成图像。尽管展示了强大的整体图像生成能力,但在文本渲染方面仍表现不佳,出现模糊笔画和破坏字母形状。在这项工作中,我们将这一限制追溯到视觉分词器,它难以重建细粒度细节。改进分词器直接但昂贵,因为它需要重新训练分词器和AR模型。我们能否在不重新训练现有分词器和AR模型的情况下提高AR模型的文本渲染性能?为实现这一目标,我们提出了残差解码器适配器(RDA),它在不改变标记空间的情况下事后升级现有分词器。具体来说,它通过引入两个新颖组件来细化视觉分词器的解码器输出:(i)一个与原始标记分布共享的配对码本;(ii)一个并行分支,用于学习像素空间中重建图像与真实图像之间的微小差异(残差)。这种残差设计使我们能够非侵入性地增强分词器,同时保持与先前AR模型的兼容性。RDA大幅提升了文本渲染性能。例如,在具有竞争力的TextAtlas基准测试上,我们使微调后的Janus-Pro OCR准确率从24.52%提高到58.26%(TextVisionBlend),从12.75%提高到36.81%(StyledTextSynth)。代码可在https://github.com/CSU-JPG/RDA获取。

英文摘要

Visual Autoregressive (AR) models generate images by predicting discrete tokens that are decoded by a visual tokenizer. Despite demonstrating strong overall image generation ability, they still underperform on text rendering with blur strokes and disrupt letter shapes. In this work, we trace this limitation to the visual tokenizer, which struggles to reconstruct fine-grained detail. Improving the tokenizer is straightforward but expensive, as it necessitates retraining both the tokenizer and the AR model. Can we improve text rendering performance of AR models without retraining the existing tokenizer and AR model? To achieve this, we propose the Residual Decoder Adapter(RDA) that upgrades an existing tokenizer post-hoc without changing its token space. Specifically, it refines the decoder output of the visual tokenizer by introducing two novel components: (i) a paired codebook that shares the token distribution with the original one; (ii) a parallel branch to learn the tiny differences (residual) between the reconstructed image and the ground-truth images in the pixel space. This residual design allows us to enhance the tokenizer non-invasively while preserving compatibility with prior AR models. RDA substantially improves text rendering significantly by a large margin. For instance, we boost finetuned Janus-Pro OCR accuracy rises from 24.52% to 58.26% (TextVisionBlend), from 12.75% to 36.81% (StyledTextSynth) on competitive TextAtlas benchmark. The code is available at https://github.com/CSU-JPG/RDA

2606.01910 2026-06-02 cs.GR cs.CV

Single-Line Drawing Generation via Semantics-Driven Optimization

基于语义驱动的单线图生成

Tanguy Magne, Alexandre Binninger, Ruben Wiersma, Olga Sorkine-Hornung

AI总结 提出一种基于语义驱动的方法,通过文本提示或输入图像自动生成矢量格式的单线图,利用分数蒸馏采样优化均匀有理B样条曲线参数,并引入额外损失项控制艺术风格,生成结果优于现有方法且支持下游制造。

详情
Comments
18 pages, published in Computer Graphics Forum 2026
AI中文摘要

线条画是一种高度表现力的艺术形式,要求艺术家抽象和提炼其主题的本质。我们提出了第一种语义驱动的方法,用于自动生成矢量格式的单线图,该方法可由描述概念的文本提示或描绘概念的输入图像引导。我们的方法利用分数蒸馏采样来优化均匀有理B样条(URBS)曲线的参数,确保绘图由单一连续笔画组成。这种表示提供了对细节水平的精细控制,而额外的损失项使我们能够引导最终的艺术风格。我们证明,我们的方法在此任务上优于最先进的文本到图像模型和优化流程,产生的结果在美学上更令人愉悦,并且更忠实于连续线条画艺术家的风格。此外,由于我们的方法生成矢量化的曲线,它直接支持下游制造过程,如刺绣、激光雕刻和弯线。我们的代码和结果可在 https://github.com/tanguymagne/SLDgen 获取。

英文摘要

Line drawings are a highly expressive art form that requires the artist to abstract and distill the essence of their subject. We present the first semantics-driven method for automatically generating single-line drawings in vector format, guided either by a text prompt describing the concept or an input image depicting it. Our approach leverages score distillation sampling to optimize the parameters of a uniform rational B-spline (URBS) curve, ensuring that the drawing consists of a single continuous stroke by design. This representation provides fine-grained control over the level of detail, while additional loss terms allow us to steer the final artistic style. We demonstrate that our method outperforms state-of-the-art text-to-image models and optimization pipelines for this task, producing results that are both more aesthetically pleasing and more faithful to the style of continuous line drawing artists. Furthermore, because our method generates a vectorized curve, it directly supports downstream fabrication processes such as embroidery, laser engraving and wire bending. Our code and results are available at https://github.com/tanguymagne/SLDgen.

2606.01909 2026-06-02 cs.SD cs.AI eess.AS

Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space

Echo: 一种用于共享潜在空间中说话人日志和语音识别的联合嵌入预测架构

Louis Mouchon

AI总结 提出Echo系统,基于单个25M参数ViT编码器,通过JEPA预训练和分阶段特化,在512维潜在空间中联合实现说话人日志、语音分离和语音识别,无需部署时微调。

详情
Comments
18 pages, 17 tables, 1 figure. Proof-of-concept, independent research
AI中文摘要

我们提出Echo,一个围绕单个25M参数ViT编码器构建的概念验证音频系统。该编码器使用JEPA目标进行预训练,然后分阶段特化,以在同一个512维潜在空间中承载说话人身份、语音内容和动态源路由,部署时无需针对每个任务进行微调。轻量级头部处理说话人日志(ArcFace + VBx)和动态源分离(空目标K集预测)。在未知K的合成VoxCeleb2混合数据上,标准堆栈达到15.00%的盲DER、97.80%的PIT分离准确率,潜在SI-SDR提升+9.52 dB,以及在留出k-NN探针上说话人/内容因子化差距为+53.50分。Echo的意义不在于任何单一任务上的新SOTA,而在于三个任务在一个编码器上以这种规模共同共存。我们逐阶段记录了设计,报告了死胡同,并识别了通过VQ瓶颈进行端到端ASR的结构性障碍,该瓶颈仍然限制了PoC。

英文摘要

We present Echo, a proof-of-concept audio system built around a single 25 M-parameter ViT encoder. The encoder is pretrained with a JEPA objective and then specialised by stages to carry speaker identity, phonetic content, and dynamic source routing in the same 512-dimensional latent space, with no per-task fine-tuning at deployment. Light heads handle diarization (ArcFace + VBx) and dynamic source separation (null-target K-set prediction). On synthetic VoxCeleb2 mixtures with unknown K, the canonical stack reaches 15.00% blind DER, 97.80% PIT separation accuracy with +9.52 dB latent SI-SDR, and a +53.50-point speaker/content factorisation gap on a held-out k-NN probe. The point of Echo is not a new SOTA on any single task but the joint coexistence of three tasks on one encoder at this footprint. We document the design stage by stage, report the dead-ends, and identify the structural wall on end-to-end ASR through the VQ bottleneck that still bounds the PoC.

2606.01908 2026-06-02 cs.LG cs.CV

Private and Stable Test-Time Adaptation with Differential Privacy

具有差分隐私的私有且稳定的测试时自适应

Zefeng Li, Qiaoyue Tang, Mathias Lecuyer, Evan Shelhamer

AI总结 提出将多种测试时自适应方法转化为差分隐私形式,通过逐样本梯度裁剪和高斯噪声保护测试数据隐私,在ImageNet-C上实现隐私与精度的平衡,并发现裁剪机制能提升连续自适应的准确性和稳定性。

详情
Comments
ICML 2026
AI中文摘要

测试时自适应(TTA)可以通过在推理过程中更新模型来减少在新数据上的误差。然而,这些更新引发了关于测试数据隐私的问题,因为模型参数现在依赖于所有过去的输入。为了控制这种隐私风险,我们将多种流行的TTA方法(Tent、EATA、SAR、DeYO和COME)转化为差分隐私(DP)形式,对所有更新应用逐样本梯度裁剪和高斯噪声。在ImageNet-C上,我们的DP-TTA方法在精度损失较小的情况下提供了足够的隐私,并且在低隐私机制下,DP的裁剪机制甚至可以改善连续设置中自适应的准确性和稳定性。这些对隐私和精度的改进仅带来适度的计算开销。这些关于私有TTA的初步结果提高了对该问题的认识,为开发更私密的测试时更新提供了信息,并确定了逐样本裁剪作为提高自适应准确性和稳定性的有效技术。

英文摘要

Test-time adaptation (TTA) can reduce error on new and different data by updating the model on these inputs during inference. However, these updates raise the issue of privacy w.r.t. the testing data, because the model parameters now depend on all past inputs. To control this privacy risk, we cast multiple popular TTA methods (Tent, EATA, SAR, DeYO, and COME) into differential privacy (DP) forms that apply per-sample gradient clipping and Gaussian noise for all updates. On ImageNet-C, our DP-TTA methods provide adequate privacy at small cost to accuracy, and in the low-privacy regime the clipping mechanism of DP can even improve the accuracy and stability of adaptation in the continual setting. These improvements to privacy and accuracy come at only modest computational overhead. These first results on private TTA raise awareness of the issue, inform the development of more private test-time updates, and identify per-sample clipping as an effective technique for improving the accuracy and stability of adaptation.

2606.01906 2026-06-02 cs.AI

Bayesian Spectral Emotion Transition Discovery from Multi-Annotator Disagreement

贝叶斯谱情感转移发现:来自多标注者分歧

Keito Inoshita, Takato Ueno

AI总结 提出贝叶斯谱情感转移发现(BSETD)两阶段框架,从多标注者软标签中挖掘情感转移结构,并通过谱分解分离惯性与传染成分,在EmotionLines数据集上验证了与心理学理论的一致性。

详情
AI中文摘要

情感通过对话的动态过程演变,理解其转移结构对于从心理健康筛查到对话系统等应用至关重要。然而,现有研究通常通过多数投票将多评分者判断压缩为单个硬标签,丢弃了理解轮次间转移所需的不确定性信号。本文提出贝叶斯谱情感转移发现(BSETD),一个从多评分者软标签中发现情感转移结构的两阶段框架。第一阶段,通过软标签的外积构建层次狄利克雷-多项后验,为K×K转移矩阵的每个单元配备可信区间和Benjamini-Hochberg(BH)错误发现率(FDR)控制的显著性。第二阶段,对称图拉普拉斯矩阵经谱分解,分离出低频(惯性)和高频(传染)成分。在EmotionLines上,BSETD同时恢复了两个不同情感空间的标志:Plutchik相邻的转移——厌恶到愤怒(log2提升+0.94)和愤怒到厌恶(+0.86)被过度表示,而Russell效价反转的转移——快乐到愤怒(-0.90)和愤怒到快乐(-0.89)被欠表示。五源跨语料验证得到英语内成对皮尔逊相关0.91-0.98,与中文M3ED对比0.79-0.85,以及同一话语集上人类硬标签与LLM虚拟软标签之间0.979的相关性,表明保留标注者不确定性的流程将情感动态的计算研究与既有的心理学理论联系起来。

英文摘要

Emotions evolve through the dynamics of conversation, and understanding their transition structure is foundational to applications ranging from mental-health screening to dialogue systems. However, existing studies typically compress multi-rater judgments into a single hard label by majority voting, discarding the uncertainty signal needed to understand turn-to-turn transitions. In this article, we propose Bayesian Spectral Emotion Transition Discovery (BSETD), a two-stage framework that discovers emotion-transition structure from multi-rater soft labels. In the first stage, a hierarchical Dirichlet-Multinomial posterior is constructed through the outer product of soft labels, equipping each cell of the K x K transition matrix with a credible interval and Benjamini-Hochberg (BH) false discovery rate (FDR)-controlled significance. In the second stage, the symmetrized graph Laplacian is spectrally decomposed to separate a low-frequency (inertia) component from a high-frequency (contagion) component. On EmotionLines, BSETD simultaneously recovers the signatures of two distinct affective spaces: the Plutchik-adjacent transitions disgust to anger (log2 lift +0.94) and anger to disgust (+0.86) are over-represented, while the Russell-valence-reversed transitions joy to anger (-0.90) and anger to joy (-0.89) are under-represented. A five-source cross-corpus validation yields pairwise Pearson correlations in 0.91-0.98 within English, 0.79-0.85 against Chinese M3ED, and 0.979 between the human hard labels and the LLM virtual soft labels on the same utterance set, demonstrating that a pipeline preserving annotator uncertainty bridges the computational study of emotion dynamics with established psychological theory.

2606.01905 2026-06-02 eess.AS cs.SD

Advancing Electrolaryngeal Speech Enhancement Through Speech-Text Representation Learning

通过语音-文本表示学习推进电喉语音增强

Ding Ma, Jinyi Mi, Fengji Li, Lester Phillip Violeta, Jiajun He, Wenchin Huang, Kazuhiro Kobayashi, Tomoki Toda

AI总结 提出一种融合语音和文本表示的学习框架,通过序列到序列语音转换模型改进电喉语音到正常语音的映射与重建质量,实验证明优于仅依赖语音表示的方法。

详情
Journal ref
IEEE Transactions on Biomedical Engineering, Early Access, 2026
Comments
15 pages, 7 figures. Accepted to IEEE TBME
AI中文摘要

目的:喉切除者依赖机电设备产生电喉(EL)语音。与正常语音相比,EL语音存在严重失真、有限的语音变化、不自然的韵律和时间偏移,降低了自然度和可懂度。尽管基于序列到序列(seq2seq)语音转换(VC)的EL语音到正常语音转换(EL2SP)很有前景,但EL与正常语音之间的显著不匹配不可避免地导致累积映射误差,限制了性能。为解决这一问题,我们描述了一种新颖的表示学习框架,该框架整合语音和文本表示,以改善seq2seq VC模型内的映射和重建质量。方法:我们的方法包括两个主要阶段:1)表示整合与学习,以及2)重建训练。首先构建一个能够融入辅助文本信息的网络,使用预训练模块学习基于语音-文本的整合表示。然后,采用自编码器风格的重建策略完成EL2SP模型,以继承这些表示而不增加模型复杂度。我们引入了三种融合策略,包括中级、输入级和混合级融合策略,逐步增强学习。此外,除了标准的seq2seq VC目标外,还引入了对整合表示的额外重建损失,以细化表示迁移。结果:在不同EL2SP数据集上的实验一致表明,我们的方法结合数据增强,优于仅依赖语音表示的基线方法。此外,随着系统设计深度的逐步改进验证了我们方法的有效性。意义:所提出的方法为EL语音增强和辅助通信技术提供了一种可扩展且实用的方法。

英文摘要

Objective: laryngectomees depend on an electromechanical device to generate electrolaryngeal (EL) speech. Compared with normal speech, EL speech suffers from severe distortion, limited phonetic variation, unnatural prosody, and temporal shifts, degrading naturalness and intelligibility. Although sequence-to-sequence (seq2seq) voice conversion (VC) based EL-speech-to-normal-speech conversion (EL2SP) is promising, substantial mismatches between EL and normal speech inevitably cause cumulative mapping errors that limit performance. To address this, we describe a novel representation learning framework integrating speech and text representations to improve mapping and reconstruction quality within a seq2seq VC model. Methods: our methodology comprises two main stages: 1) representation integration and learning, and 2) reconstruction training. A network capable of incorporating auxiliary text information is first constructed with pretrained modules to learn speech--text-based integrated representations. Then, an autoencoder-style reconstruction strategy finalizes EL2SP model to inherit these representations without increasing model complexity. We introduce three fusion strategies including middle-, input-, and hybrid-level fusion strategies that progressively enhance learning. Moreover, besides standard seq2seq VC objectives, an additional reconstruction loss on the integrated representation is introduced to refine representation transfer. Results: experiments under different EL2SP datasets consistently demonstrate that our methods, combined with data augmentations, outperform baselines relying solely on speech representations. Furthermore, progressive improvements with system design depth validate the effectiveness of our methods. Significance: the proposed methods provide an extensible and practical methodology for EL speech enhancement and assistive communication technologies.

2606.01901 2026-06-02 cs.CV cs.AI cs.CL

The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue

图像重建游戏:通过迭代多模态对话建立共同基础

Sherzod Hakimov, Mattia D'Agostini, Ivan Samodelkin, David Schlangen

AI总结 提出图像重建游戏基准,通过多轮迭代中视觉语言模型向图像生成器发出纠正指令,使累积的共同基础直接可视化为重建图像,发现描述器是重建质量的主导因素,而生成器决定迭代改进的效果。

详情
AI中文摘要

我们引入了图像重建游戏,这是一个全自动基准测试,其中视觉语言模型在多轮迭代中向图像生成器发出纠正指令,使得累积的共同基础直接可视化为渲染图像。通过对七个图像类别中的两个描述器模型与两个生成器模型进行交叉基准测试,我们发现描述器是重建质量的主导因素,而生成器决定迭代改进是否有益。数学和几何图像构成了最大的挑战。描述器的令牌预算强烈影响收敛性:较短的预算产生更稀疏的初始渲染,有更多可见改进的空间,而较长的预算提高了绝对质量,但留下的修复空间较少。更强的描述器使用更丰富的纠正词汇,涵盖空间、数值和结构类别,而较弱的描述器则集中于表面属性,并且往往在几轮后停止。人工验证表明,最佳自动评判器与人类偏好之间仅达到轻微到中等的一致性,并且自动评分需要人工重新校准才能可靠使用。

英文摘要

We introduce the Image Reconstruction Game, a fully automated benchmark in which a vision-language model issues corrective instructions to an image generator across multiple turns, making accumulated common ground directly observable as a rendered image. Benchmarking two Describer models crossed with two Generator models across seven image categories, we find that the describer is the dominant factor in reconstruction quality, while the generator determines whether iterative refinement helps or hurts. Mathematical and geometric images pose the greatest challenge. The describer's token budget strongly affects convergence: shorter budgets yield sparser first renderings with more room for visible improvement, while longer budgets raise absolute quality but leave less to fix. Stronger describers use a richer correction vocabulary spanning spatial, numeric, and structural categories, while weaker describers concentrate on surface properties and tend to stop after a few turns. Human validation shows that the best automated judge reaches only slight-to-fair agreement with human preferences, and automated scores require human recalibration to be used reliably.

2606.01900 2026-06-02 cs.CV

Auteur: Language-Driven Cinematographic Framing for Human-Centric Video Generation

Auteur: 以语言驱动的电影化取景实现以人为中心的视频生成

Muhammed Burak Kizil, Enes Sanli, Niloy J. Mitra, Xuelin Chen, Erkut Erdem, Aykut Erdem, Duygu Ceylan

AI总结 提出Auteur方法,通过将相机运动参数化为以人为中心的取景(包括镜头尺寸、角度和构图),并利用领域特定语言(DSL)和微调的多模态大语言模型,实现语言驱动的电影化取景,在人类中心视频生成中优于现有方法。

详情
AI中文摘要

生成式视频模型在视觉保真度和时间连贯性方面取得了显著进展,但有意地控制相机仍然难以实现。现有框架将相机运动视为像素合成的副产品,产生的轨迹具有随机性、空间不一致性,并且对驱动场景的人类主体漠不关心。在这项工作中,我们提出了Auteur,一种用于生成式视频中语言驱动的、以人为中心的相机取景方法。我们的核心见解是,专业电影制作人构思镜头时并非将其视为世界空间中的轨迹,而是定义为相对于演员的取景,将镜头尺寸、角度和构图编码为人体姿态和运动的函数。我们将这一直觉形式化为一种以人为中心的相机参数化,并引入一种可转换为标准6自由度相机参数的领域特定语言(DSL)。然后,一个微调的多模态大语言模型充当虚拟导演,将自然语言描述和粗略的人体运动映射为稀疏的DSL关键帧,这些关键帧通过确定性插值生成连续的相机轨迹,并作为输入提供给视频生成器。我们在一个新数据集上训练和评估Auteur,该数据集包含34K个对齐的文本、人体运动和DSL标注的相机轨迹,这些轨迹来自程序化合成和CondensedMovies数据集中的真实电影片段。Auteur实现了以人为中心的场景的电影化取景,这一能力在先前的生成模型中基本缺失。为了评估这一行为,我们提出了新的以取景为中心的指标,实验表明Auteur持续优于现有方法。

英文摘要

Generative video models have achieved remarkable visual fidelity and temporal coherence, yet intentional camera control remains elusive. Existing frameworks treat camera motion as a byproduct of pixel synthesis, producing trajectories that are stochastic, spatially inconsistent, and indifferent to the human subject driving the scene. In this work, we present Auteur, a method for language-driven, human-centric camera framing in generative video. Our core insight is that professional filmmakers conceive shots not as world-space trajectories but as framings defined relative to the actor, encoding shot size, angle, and composition as functions of human pose and motion. We formalize this intuition as a human-centric camera parameterization and introduce a Domain-Specific Language (DSL) that is convertible to standard 6-DoF camera parameters. A fine-tuned multimodal large language model then acts as a virtual director, mapping natural language descriptions and coarse human motion to sparse DSL keyframes that are deterministically interpolated into continuous camera trajectories, which are then provided as input to video generators. We train and evaluate Auteur on a new dataset of 34K aligned text, human motion, and DSL-annotated camera trajectories drawn from procedural synthesis and real-world movie footage from the CondensedMovies dataset. Auteur enables cinematographic framing of human-centered scenes, a capability largely absent in prior generative models. To assess this behavior, we propose new framing-focused metrics, and our experiments show that Auteur consistently outperforms existing methods

2606.01899 2026-06-02 eess.SP cs.AI

RA-LWLM: Retrieval-Augmented In-Context Localization with Wireless Foundation Models

RA-LWLM:基于检索增强的上下文无线定位基础模型

Guangjin Pan, Hui Chen, Hei Victor Cheng, Henk Wymeersch

AI总结 提出RA-LWLM框架,通过将场景特定信息外化到指纹数据库,实现无需训练的跨场景无线定位,利用冻结的无线基础模型编码器、检索模块和基于Transformer的上下文学习模块预测用户位置。

详情
Comments
13 pages, 9 figures. This work has been submitted to the IEEE for possible publication
AI中文摘要

无线定位是第六代(6G)网络的基本能力。传统的基于模型的方法需要对传播环境进行精确建模,在复杂的多径和非视距场景中性能下降,而基于学习的方法将模型参数紧密耦合到训练场景中,每当基站(BS)配置或传播环境变化时需要昂贵的重新训练。在本文中,我们提出RA-LWLM,一种检索增强的上下文定位框架,通过将场景特定信息外化到每个场景的指纹数据库(而非编码在模型权重中)来实现无需训练的跨场景适应。该框架由三个组件组成:一个冻结的无线基础模型(FM)编码器,将原始信道状态信息映射为场景无关的表示;一个检索模块,通过表示空间中的相似性搜索从每个场景的数据库中选择最具信息量的参考;以及一个基于Transformer的上下文学习(ICL)模块,将查询与检索到的参考融合以预测用户设备(UE)位置。为了适应不同查询的检索质量和传播复杂性,ICL模块采用混合专家设计,其中专家专注于不同的上下文大小,并由可学习的选择器软组合。跨不同BS配置的异构场景的广泛基于射线追踪的实验表明,RA-LWLM在未见和已见场景上实现了几乎相同的精度,无需任何每个场景的重新训练,显著优于端到端和基于FM的基线。这些结果验证了所提出的检索增强上下文范式作为6G网络中跨场景定位的可扩展解决方案。

英文摘要

Wireless localization is a fundamental capability of sixth-generation (6G) networks. Conventional model-based methods require accurate modeling of the propagation environment and degrade in complex multipath and non-line-of-sight scenarios, while learning-based methods couple model parameters tightly to the training scene, requiring costly retraining whenever the base station (BS) configuration or propagation environment changes. In this paper, we propose RA-LWLM, a retrieval-augmented in-context localization framework that achieves training-free cross-scene adaptation by externalizing scene-specific information into a per-scene fingerprint database rather than encoding it in model weights. The framework consists of three components: a frozen wireless foundation model (FM) encoder that maps raw channel state information into a scene-agnostic representation; a retrieval module that selects the most informative references from the per-scene database via similarity search in the representation space; and a transformer-based in-context learning (ICL) module that fuses the query with the retrieved references to predict the user equipment (UE) position. To accommodate varying retrieval quality and propagation complexity across queries, the ICL module adopts a mixture-of-experts design in which experts specialize in different context sizes and are softly combined by a learnable selector. Extensive ray-tracing-based experiments across heterogeneous scenes with diverse BS configurations show that RA-LWLM achieves nearly identical accuracy on seen and unseen scenes without any per-scene retraining, substantially outperforming end-to-end and FM-based baselines. These results validate the proposed retrieval-augmented in-context paradigm as a scalable solution for cross-scene localization in 6G networks.

2606.01896 2026-06-02 cs.CV cs.AI

Train, Test, Re-evaluate: Schedule-Sensitive Evaluation of Generative Data for Hand Detection

训练、测试、重新评估:用于手部检测的生成数据的调度敏感评估

Atmika Bhardwaj, Silvia Vock, Nico Steckhan

AI总结 本研究通过多阶段训练调度实验,评估生成性图像修补数据对安全关键场景下手部检测性能的影响,发现适当的训练流程能显著提升真实部署效果。

详情
Comments
16 pages, 4 figures
AI中文摘要

生成(或合成)图像数据越来越多地被用于增强或替代真实训练数据集,当目标图像稀缺、昂贵或存在偏差时。在手部检测中,特别是在职业安全设置中,公共数据集大多包含裸手。这低估了手套、纹身、珠宝和其他个人防护装备引入的手部外观变化,造成了安全关键应用在部署时遇到的分布偏移。我们测试生成性修补,即仅编辑真实照片的手部区域以引入配饰,是否能缩小这种偏移差距。在一个由真实图像及其合成对应物组成的配对数据集上,我们在六种训练和调度方案(实验A-F,每种三个随机种子)下训练YOLOv8n手部检测器,在真实测试集和仅真实手套测试子集上评估每个检测器,报告两个重叠阈值(mAP@0.5和mAP@0.5:0.95)下的平均精度(mAP),并进行配对统计检验。一个两阶段实验:在真实+合成数据上训练,然后在较低学习率下仅用真实数据微调得到的权重,与标准真实测试集上的仅真实基线模型相比,提高了mAP@0.5,并改善了真实手套的分布外差距。另一个三阶段实验最好地保持了框的紧密度,达到了研究中任何其他实验的最高mAP@0.5:0.95。合成数据对安全关键手部检测的效用由训练过程决定,简单的多阶段实验从修补的配饰数据中提取了实质性的真实部署收益。

英文摘要

Generated (or synthetic) image data is increasingly used to augment or replace real training datasets when target imagery is scarce, expensive, or biased. For hand detection, particularly in occupational safety settings, public datasets mostly contain bare hands. This under-represents the variation in hand appearance introduced by gloves, tattoos, jewelry, and other personal protective equipment, creating a distribution shift that safety-critical applications encounter at deployment. We test whether generative inpainting, editing only the hand region of a real photograph to introduce accessories, can close this shift gap. On a paired dataset of real images and their synthetic counterparts, we train YOLOv8n hand detectors under six training-and-scheduling regimes (Experiments A-F, three random seeds each), evaluate every detector on a real test set and on a real-gloves-only test split, and report the mean average precision (mAP) at two overlap thresholds (mAP@0.5 and mAP@0.5:0.95) along with paired statistical tests. A two-stage experiment: train on real U synthetic data, then fine-tune the resulting weights on real-only at a lower learning rate, increases mAP@0.5 compared to the real-only baseline model on the standard real test set, and improves the real-gloves out-of-distribution gap. Another three-stage experiment preserves box-tightness best, reaching the highest mAP@0.5:0.95 of any other experiment in the study. The synthetic-data utility for safety-critical hand detection is determined by the training procedure, and simple multi-stage experiments extract substantial real-deployment benefit from inpainted accessory data.