arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2163
2605.19974 2026-05-20 cs.CV

SphericalDreamer: Generating Navigable Immersive 3D Worlds with Panorama Fusion

SphericalDreamer: 通过全景融合生成可导航的沉浸式3D世界

Antoine Schnepf, Karim Kassab, Flavian Vasile, Andrew Comport

AI总结 本研究提出SphericalDreamer方法,通过生成多个全景图像并将其提升到3D空间中进行融合,从而生成高度细节且可导航的沉浸式3D户外环境,显著提升了尺度和可导航性。

Comments Accepted at ICML 2026. Project page available at https://sphericaldreamer.github.io

详情
AI中文摘要

沉浸式和可导航的3D环境的生成随着虚拟现实和3D内容的普及而变得越来越普遍。然而,最近的方法面临一个根本性的限制:它们无法生成同时(i)能够在长距离空间范围内导航且(ii)覆盖完整全方位视野(水平360度,垂直180度)的3D世界。为了解决这一挑战,我们引入了SphericalDreamer,一种从文本提示中生成完全沉浸和长距离3D户外环境的方法。我们的方法基于生成多个全景图像,这些图像随后被提升到3D空间中并融合在一起,同时保持视觉和几何一致性。SphericalDreamer生成高度细节的、完全沉浸的3D环境,同时在尺度和可导航性方面显著优于先前的方法。

英文摘要

The generation of immersive and navigable 3D environments is increasingly prevalent with the growing adoption of virtual reality and 3D content. However, recent methods face a fundamental limitation: they cannot produce 3D worlds that simultaneously (i) are navigable over long-range spatial extents and (ii) cover the complete omnidirectional field of view ($360^\circ$ horizontally and $180^\circ$ vertically). To address this challenge, we introduce SphericalDreamer, a method for generating fully immersive and long-range 3D outdoor environments from textual prompts. Our approach is built on the generation of multiple panoramic images, which are subsequently lifted into 3D and fused together while maintaining visual and geometric consistency. SphericalDreamer produces highly detailed, fully immersive 3D environments, while substantially improving scale and navigability compared to prior approaches.

2605.19972 2026-05-20 cs.LG cs.AI cs.DB cs.DS

Block-Sphere Vector Quantization

块球向量量化

Heesang Ann, Joongkyu Lee, Min-hwan Oh

AI总结 本文研究了向量量化方法,通过统一理论比较不同旋转量化器,揭示其性能依赖于特定的失真度量标准,并提出块球量化算法以改进旋转块量化。

详情
AI中文摘要

向量量化是可扩展机器学习系统中的基本操作,能够实现内存高效存储、快速检索和压缩推理。最近的旋转基于量化器如EDEN、RabitQ和TurboQuant引入了强保证和实证性能,但其周围比较难以解释,因为它们依赖于不同的失真标准、概率领域和实现假设。作为我们的第一个贡献,我们提供了这些方法的统一理论比较,表明其相对优势是标准依赖的而非绝对的:EDEN和TurboQuant在均方失真方面有利,EDEN在预期内积失真方面也有效,而RabitQ提供强的高概率控制。此比较进一步表明EDEN在预期失真度量方面提供特别强的保证。作为我们的第二个贡献,我们引入了块球量化(BlockQuant),一种新的旋转块量化算法,围绕随机旋转向量的球几何设计。不同于坐标wise量化器,BlockQuant在球面上量化块,更忠实保持旋转嵌入的几何结构。我们证明这种块球设计在本文考虑的基准上理论上在重建MSE和预期内积失真方面均有所改进。我们在真实嵌入数据集和长上下文LLM推理任务上的实验显示了实际收益,与我们的理论改进一致。

英文摘要

Vector quantization is a fundamental primitive for scalable machine learning systems, enabling memory-efficient storage, fast retrieval, and compressed inference. Recent rotation-based quantizers such as EDEN, RabitQ, and TurboQuant have introduced strong guarantees and empirical performance, but the surrounding comparisons have been difficult to interpret because they rely on different distortion criteria, probability regimes, and implementation assumptions. As our first contribution, we provide a unified theoretical comparison of these methods and show that their relative advantages are criterion-dependent rather than absolute: EDEN and TurboQuant are favorable for MSE distortion, EDEN is also effective for expected inner-product distortion, and RabitQ provides strong high-probability control. This comparison further clarifies that EDEN provides particularly strong guarantees for expected distortion measures. As our second contribution, we introduce Block-Sphere Quantization (BlockQuant), a new rotation-based block quantization algorithm designed around the spherical geometry of randomly rotated vectors. Unlike coordinate-wise quantizers, BlockQuant quantizes blocks on the sphere, preserving the geometry of rotated embeddings more faithfully. We prove that this block-spherical design theoretically improves over the baselines considered in this paper for both reconstruction MSE and expected inner-product distortion. Our experiments on real embedding datasets and long-context LLM inference tasks show practical gains that are consistent with our theoretical improvements.

2605.19966 2026-05-20 cs.LG cs.AI

Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes

通过顺序熵变化检测基于优化的对抗性提示

Mohammed Alshaalan, Miguel R. D. Rodrigues

AI总结 本文提出了一种基于在线变化点检测的对抗性后缀检测方法CPD,通过标准化用户令牌熵并应用单侧CUSUM统计量,提高了对优化基于对抗性提示的检测性能,同时在多个大型语言模型上实现了更高的F1分数和AUC性能。

Comments Accepted at ICML 2026; 20 pages, including 9 pages main text, references, and appendix

详情
AI中文摘要

基于优化的对抗性后缀可以劫持对齐的大型语言模型(LLMs),同时保持流畅,这削弱了静态和窗口化困惑度基于的检测器。我们把对抗性后缀检测视为一个在线变化点检测问题,针对令牌级下一个令牌熵流。使用LLM系统提示来估计一个稳健的基线,我们标准化用户令牌熵并应用单侧CUSUM统计量。所得到的检测器CPD(在线变化点检测)是模型无关的,无需训练,可以在线运行,并能定位对抗性后缀的起始。在1,012个优化基于的后缀攻击(GCG,AutoDAN,AdvPrompter,BEAST,AutoDAN-HGA)和1,012个困惑度控制的良性提示的基准上,CPD在六个开源权重聊天模型(LLaMA-2-7B/13B,Vicuna-7B/13B,Qwen2.5-7B/14B)上均优于最强的窗口化困惑度基线。在LLaMA-2-7B的典型CUSUM设置(k=0)下,CPD达到AUC 0.88和F1 0.82。除了提示级检测外,CPD将79.6%的触发集中在对抗性后缀内,而窗口化困惑度为17-46%。最后,当用作LLaMA Guard的轻量级门控时,CPD在高流量、良性主导的部署中减少了17-22%的门控调用,同时保持了门控级别的检测质量。

英文摘要

Optimization-based adversarial suffixes can jailbreak aligned large language models (LLMs) while remaining fluent, weakening static and windowed perplexity-based detectors. We cast adversarial suffix detection as an online change-point detection problem over the token-level next-token entropy stream. Using the LLM system prompt to estimate a robust baseline, we standardize user-token entropies and apply a one-sided CUSUM statistic. The resulting detector, CPD Online (CPD), is model-agnostic, training-free, runs online, and localizes the adversarial suffix onset. On a benchmark of 1,012 optimization-based suffix attacks (GCG, AutoDAN, AdvPrompter, BEAST, AutoDAN-HGA) and 1,012 perplexity-controlled benign prompts, CPD improves F1 over the strongest windowed-perplexity baseline on all six open-weight chat models (LLaMA-2-7B/13B, Vicuna-7B/13B, Qwen2.5-7B/14B). On LLaMA-2-7B at the canonical CUSUM setting ($k=0$), CPD reaches AUROC $0.88$ and F1 $0.82$. Beyond prompt-level detection, CPD concentrates 79.6% of its triggers inside the adversarial suffix, versus 17-46% for windowed perplexity. Finally, when used as a lightweight gate for LLaMA Guard, CPD reduces guard calls by 17-22% on a high-volume, benign-dominated deployment while preserving guard-level detection quality

2605.19959 2026-05-20 cs.LG math.FA

Learning Orthonormal Bases for Function Spaces

在函数空间中学习正交基

Hamidreza Kamkari, Mohammad Sina Nabizadeh, Justin Solomon

AI总结 本文提出通过神经网络学习和优化函数空间中的正交基,利用李群的流形性质,证明即使使用有限秩生成器,也能在适当算子拓扑下实现正交基的稠密性。

详情
AI中文摘要

无限维正交基的展开在表示和计算函数空间时起着核心作用,由于其有利的线性代数性质。然而,常见的基如傅里叶或小波基是固定的,不能适应给定问题或数据集的结构。本文旨在用神经网络表示这些基并进行优化。我们的关键思想是,任何目标无限维正交基可以视为李群的流形上的一个点,或者等价地,视为连接参考基(例如傅里叶基)到该目标基的连续路径的终点。流形上的路径满足由斜反对称积分算子所支配的常微分方程(ODE)。使用神经网络定义此类ODE的有限秩生成器,使我们能够参数化和优化函数空间中的正交基。虽然使用有限秩生成器来建模无限算子可能显得限制,但我们证明了一个普遍性结果:即使使用秩2的生成器,ODE的积分解在适当的算子拓扑下在正交群中也是稠密的。换句话说,对于任何目标正交基,存在一条从参考基出发并由有限秩生成器驱动的路径,可以无限接近该目标基。我们通过将傅里叶基转换为功能数据集的主成分、线性算子的本征函数或能量守恒物理模拟的动力模式,展示了该框架的灵活性。

英文摘要

Infinite-dimensional orthonormal basis expansions play a central role in representing and computing with function spaces due to their favorable linear algebraic properties. However, common bases such as Fourier or wavelets are fixed and do not adapt to the structure of a given problem or dataset. In this paper, we aim to represent these bases with neural networks and optimize them. Our key idea is that any target infinite-dimensional orthonormal basis can be viewed either as a point on the Lie manifold of the orthogonal group, or equivalently, as the endpoint of a continuous path on that manifold that connects a reference basis, e.g. Fourier, to that target. Paths on the Lie manifold satisfy ordinary differential equations (ODEs) governed by skew-adjoint integral operators. Using neural networks to define finite-rank generators of such ODEs allows us to parameterize and optimize orthonormal bases in function space. While relying on finite-rank generators to model infinite operators might seem restrictive, we prove a universality result: even with a rank-2 generator, the integrated solutions of the ODE are dense in the orthogonal group under the appropriate operator topology. In other words, for any target orthonormal basis, there exists a path originating from a reference basis and driven by finite-rank generators that gets arbitrarily close to that target basis. We demonstrate the flexibility of our framework by transforming the Fourier basis into the principal components of a functional dataset, eigenfunctions of linear operators, or dynamic modes of energy-preserving physical simulations.

2605.19958 2026-05-20 cs.RO

TravExplorer: Cross-Floor Embodied Exploration via Traversability-Aware 3-D Planning

TravExplorer: 通过可 traversability-aware 3-D 规划实现跨楼层的 embodied 探索

Han Zheng, Zhe Chen, Yudong Huang, Haoran Liu, Jinghao Wang, Ming Yang, Tong Qin

AI总结 本文提出TravExplorer框架,结合零样本语义引导与可 traversability-aware 3-D 规划,实现跨楼层的 embodied 探索,通过统一的体积地图区分占用结构与机器人可达支撑面,并提取可 traversable 前沿区域,同时采用FOV-aware的主动感知策略解决跨楼层遍历中的不完整观测问题,最终在HM3D和MP3D上进行了4195次模拟实验,并在真实世界中验证了无需先验地图或人工干预的开放词汇目标搜索能力。

详情
AI中文摘要

Zero-shot Object Navigation (ZSON) has shown promise for open-vocabulary target search in unseen environments, yet most existing systems remain tied to planar representations and single-floor assumptions. These assumptions become inadequate in real buildings, where navigation involves floors, stairs, landings, and vertically overlapping spaces. This article presents TravExplorer, a cross-floor embodied exploration framework that couples zero-shot semantic guidance with traversability-aware 3-D planning. TravExplorer maintains a unified volumetric map that distinguishes occupied structures from robot-reachable support surfaces and extracts traversable frontiers from connected support surfaces, including floors, stairs, and landings. A FOV-aware active perception strategy further resolves incomplete observations during cross-floor traversal. To reduce semantic-reasoning latency, a lightweight guidance module aligns a probabilistic instance map from online open-vocabulary segmentation with a spatial value map from fast image-to-text matching. Based on these geometric and semantic memories, a hierarchical planner performs target-aware frontier touring over object hypotheses, traversable frontiers, and stair landmarks, and generates executable cross-floor motions through foothold-guided 3-D search and vertically constrained local trajectory optimization. Experiments over 4,195 simulated episodes on HM3D and MP3D demonstrate consistent advantages over representative ObjectNav baselines. Fifty real-world trials on a Unitree Go2 further validate open-vocabulary target search across single-floor and cross-floor indoor environments without prior maps or human intervention. The code will be released at https://github.com/wuyi2121/TravExplorer.

英文摘要

Zero-shot Object Navigation (ZSON) has shown promise for open-vocabulary target search in unseen environments, yet most existing systems remain tied to planar representations and single-floor assumptions. These assumptions become inadequate in real buildings, where navigation involves floors, stairs, landings, and vertically overlapping spaces. This article presents TravExplorer, a cross-floor embodied exploration framework that couples zero-shot semantic guidance with traversability-aware 3-D planning. TravExplorer maintains a unified volumetric map that distinguishes occupied structures from robot-reachable support surfaces and extracts traversable frontiers from connected support surfaces, including floors, stairs, and landings. A FOV-aware active perception strategy further resolves incomplete observations during cross-floor traversal. To reduce semantic-reasoning latency, a lightweight guidance module aligns a probabilistic instance map from online open-vocabulary segmentation with a spatial value map from fast image-to-text matching. Based on these geometric and semantic memories, a hierarchical planner performs target-aware frontier touring over object hypotheses, traversable frontiers, and stair landmarks, and generates executable cross-floor motions through foothold-guided 3-D search and vertically constrained local trajectory optimization. Experiments over 4,195 simulated episodes on HM3D and MP3D demonstrate consistent advantages over representative ObjectNav baselines. Fifty real-world trials on a Unitree Go2 further validate open-vocabulary target search across single-floor and cross-floor indoor environments without prior maps or human intervention. The code will be released at https://github.com/wuyi2121/TravExplorer.

2605.19957 2026-05-20 cs.CV cs.AI cs.RO

World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks

为混合具身体验中的长时域演化构建世界-自我模型

Zuyao Lin, Jianhui Zhang, Peidong Jia, Xiaoguang Zhao, Shanghang Zhang, Xingyu Chen

AI总结 本文提出了一种新的世界-自我建模范式,通过分解未来演化为世界和自我组件,解决混合任务中长时域具身体验中的退化问题,并通过HTEWorld基准测试验证了其有效性。

详情
AI中文摘要

世界模型在具身智能中被广泛研究,但通常在同一流中预测世界和自我不同的演化,其中世界捕捉持续的指令无关场景规律,而自我捕捉机器人中心的指令条件动态。这种世界-自我纠缠导致长时域具身体验中的退化,特别是在混合任务中,其中导航和操作行为交替出现。在本文中,我们引入了世界-自我建模,一种新的概念范式,将未来演化分解为世界和自我组件。我们从三种视角定义世界-自我边界,即运动、语义和意图视角,并分析了三种解纠缠策略,即后、前和完全解纠缠。进一步,我们将该范式实例化为世界-自我模型(WEM),一个统一的具身世界模型,它将一个隐含的独立世界-自我规划器与一个级联并行混合专家(CP-MoE)扩散生成器相结合。为了实现严格的评估,我们进一步构建了HTEWorld,第一个长时域世界建模基准,包含125,000个视频片段(超过4.5百万帧)和精细的动作注释,以及300个多轮评估轨迹(超过2,000条指令)。广泛的实验表明,WEM在HTEWorld上实现了最先进的性能,同时在现有的仅操作基准上保持竞争力。

英文摘要

World models are widely explored in embodied intelligence, yet they typically predict distinct evolutions of the world and the ego within a single stream, where the world captures persistent instruction-agnostic scene regularities and the ego captures robot-centric instruction-conditioned dynamics. This world-ego entanglement leads to a degradation in long-horizon embodied scenarios, particularly in hybrid tasks with interleaved navigation and manipulation behaviors. In this paper, we introduce \emph{World-Ego Modeling}, a new conceptual paradigm that decomposes future evolution into world and ego components. We define the world-ego boundary from three perspectives, i.e., motion-, semantic-, and intention-based views, and analyze three disentanglement strategies with post-, pre-, and full disentanglement. Further, we instantiate this paradigm as the World-Ego Model (WEM), a unified embodied world model that couples an implicit separate world-ego planner with a cascade-parallel mixture-of-experts (CP-MoE) diffusion generator. To enable rigorous evaluation, we further construct HTEWorld, the first benchmark for long-horizon world modeling with hybrid navigation-manipulation tasks, providing 125K video clips (over 4.5M frames) with fine-grained action annotations and 300 multi-turn evaluation trajectories (over 2K instructions). Extensive experiments show that WEM achieves state-of-the-art performance on HTEWorld while remaining competitive on existing manipulation-only benchmarks.

2605.19956 2026-05-20 cs.CV

Towards Fine-Grained Robustness: Attention-Guided Test-Time Prompt Tuning for Vision-Language Models

迈向细粒度鲁棒性:面向视觉-语言模型的注意力引导测试时提示调优

Jia-Wei Hai, Yijun Wang, Xiu-Shen Wei

AI总结 本文提出了一种注意力引导的测试时提示调优方法(A-TPT),旨在解决视觉-语言模型在对抗攻击下的鲁棒性问题,通过改进的梯度注意力机制和空间变化的增强强度来提升模型在细粒度场景下的表现。

Comments Accepted by ICML 2026, Project Page: this https, URL Code URL: this https URL

详情
AI中文摘要

视觉-语言模型(VLMs),如CLIP,通过各种微调适应方法在下游任务上实现了显著的零样本性能。然而,最近的研究证明,对抗攻击可以显著降低VLMs的推理能力,对实际应用构成重大风险。普遍的测试时适应方法通常依赖多视图增强来实现各种微调策略,但它们难以识别语义信息,并且在细粒度场景中容易破坏判别区域。为了解决这些限制,我们提出了注意力引导的测试时提示调优(A-TPT),一种旨在测试时适应的语义保持方法。我们首先改进了梯度注意力展开机制,以识别在对抗攻击下仍能存活的语义重要区域。进一步地,我们利用这些区域来指导空间变化的增强强度和多视图集成,以进行提示调优和推理。广泛的实验表明,A-TPT在对抗和干净数据上均优于现有的测试时适应方法。代码可在https://github.com/SEU-VIPGroup/A-TPT获取。

英文摘要

Vision-Language Models (VLMs), such as CLIP, have achieved significant zero-shot performance on downstream tasks with various fine-tuning adaptation methods. However, recent studies have proven that adversarial attacks can significantly degrade the inference ability of VLMs, posing substantial risks to their practical applications. Prevalent test-time adaptation methods typically rely on multi-view augmentation to implement various fine-tuning strategies, which struggle to identify semantic information and are prone to destroying discriminative regions in fine-grained scenarios. To address these limitations, we propose Attention-Guided Test-Time Prompt Tuning (A-TPT), a semantics-preserving method designed for test-time adaptation. We first refine the gradient attention rollout mechanism to identify semantically meaningful regions surviving under adversarial attacks. Furthermore, we leverage them to guide the spatially varying augmentation intensities and multi-view ensemble for prompt tuning and inference. Extensive experiments demonstrate that A-TPT outperforms existing test-time adaptation methods on both adversarial and clean data. Codes are available at https://github.com/SEU-VIPGroup/A-TPT .

2605.19952 2026-05-20 cs.CL

Rethinking How to Remember: Beyond Atomic Facts in Lifelong LLM Agent Memory

重新思考如何记忆:超越原子事实的终身LLM代理记忆

Jingwei Sun, Jianing Zhu, Jiangchao Yao, Tongliang Liu, Bo Han

AI总结 本文提出TriMem,一种能够维护三种共存表示粒度的内存系统,通过保留原始对话片段、提取原子事实以及合成轮廓来实现对积累对话历史的忠实存储、高效检索和深度推理,从而克服现有方法在细节丢失和推理能力不足的问题。

详情
AI中文摘要

为了实现可靠的长期交互,LLM代理需要一种能够忠实存储、高效检索并深入推理积累对话历史的内存系统。大多数现有方法采用提取事实的范式:手工编写的静态提示将原始对话压缩成原子事实,然后存储、匹配并注入到下游推理中。然而,这种以事实为中心的设计不可避免地丢弃了原始对话中的细粒度细节,并且无法支持对分散孤立事实的深度推理。此外,静态提示无法在不同的对话风格中保持一致的提取粒度。为了解决这些限制,我们提出了TriMem,它维护三种共存的表示粒度,包括由源标识符锚定的原始对话片段以保证存储的准确性,提取的原子事实用于高效的内存检索,以及合成的轮廓,将分散的事实聚合为整体语义理解以支持深度推理。我们进一步采用基于TextGrad的提示优化,通过响应质量反馈迭代优化提取和轮廓提示,实现终身进化而无需任何参数更新。在LoCoMo和PerLTQA多个LLM基础架构上的广泛实验表明,TriMem在强内存基线中表现一致。代码可在https://TMLR-TriMem.github.io获取。

英文摘要

To enable reliable long-term interaction, LLM agents require a memory system that can faithfully store, efficiently retrieve, and deeply reason over accumulated dialogue history. Most existing methods adopt an extracted fact based paradigm: handcrafted static prompts compress raw dialogues into atomic facts, which are then stored, matched, and injected into downstream reasoning. Nevertheless, such fact-centric designs inevitably discard fine-grained details in original dialogues and fail to support deep reasoning over scattered isolated facts. Moreover, static prompts cannot maintain consistent extraction granularity across diverse dialogue styles. To address these limitations, we propose TriMem, which maintains three coexisting representation granularities, including raw dialogue segments anchored by source identifiers for storage fidelity, extracted atomic facts for efficient memory retrieval, synthesized profiles that aggregate dispersed facts into holistic semantic understanding for deep reasoning. We further adopt TextGrad-based prompt optimization, which iteratively refines extraction and profiling prompts via response quality feedback, achieving lifelong evolution without any parameter updating. Extensive experiments on LoCoMo and PerLTQA across multiple LLM backbones demonstrate that TriMem consistently outperforms strong memory baselines. The code is available at https://TMLR-TriMem.github.io .

2605.19950 2026-05-20 cs.CV

AffectVerse: Emotional World Models for Multimodal Affective Computing

AffectVerse: 多模态情感计算中的情感世界模型

Bo Zhao, Fanghua Ye, Yixin Ji, Sicheng Zhao, Xiaojiang Peng, Zitong YU

AI总结 本研究提出AffectVerse,一种基于Qwen2.5-Omni的多模态情感计算模型,通过引入情感世界模块实现短期潜在情感预测,利用未来预测作为自监督信号,提高了情感计算的准确性。

详情
AI中文摘要

人类通过整合观察到的多模态线索与对情绪状态可能演变的期望来推断情绪。然而,现有的多模态大语言模型(MLLMs)通常将情绪识别视为对完整音频视觉-文本输入的静态融合,忽略了情感动态。我们提出了AffectVerse,一种基于Qwen2.5-Omni的模型,配备了情感世界模块(EWM),这是一个无动作的表示层面模块,用于短期潜在情感预测。EWM包含三个模块:1)跨模态时间想象通过多步展开预测未来的视频/音频表示;2)MAMA(模态感知多步注意力)信念聚合将想象的标记压缩成模态感知的信念标记;3)信念注入将这些信念标记插入LLM中进行情绪推理。AffectVerse将未来预测作为过去条件的自监督信号:它不替换对观察历史的建模或需要未见过的信号,但迫使当前信念状态编码预测后续情绪变化的转换线索。在九个基准测试中,AffectVerse在其他模型上提高了至少2.57%,而受控消融实验显示了时间想象、跨模态展开和信念聚合的加性增益。这些结果表明,预测信念状态建模是情感计算的一种实用替代方案。

英文摘要

Humans infer emotions by integrating observed multimodal cues with expectations about how affective states may unfold. Existing multimodal large language models (MLLMs), however, often treat emotion recognition as static fusion over complete audiovisual-text inputs, leaving affective dynamics implicit. We propose AffectVerse, a Qwen2.5-Omni-based model equipped with an Emotion World Module (EWM), an action-free representation-level module for short-horizon latent affective prediction. \rev{EWM contains three modules: 1) Cross-Modal Temporal Imagination predicts future video/audio representations from past tokens with multi-step rollout. 2) MAMA(Modality-Aware Multi-step Attention) Belief Aggregation compresses imagined tokens into modality-aware belief tokens. 3) Belief Injection inserts these belief tokens into the LLM for affective reasoning.} AffectVerse uses future prediction as a past-conditioned self-supervised signal: it does not replace modeling observed history or require unseen signals at inference, but forces the current belief state to encode transition cues that are predictive of subsequent affective change. Across nine benchmarks, AffectVerse improves at least 2.57\% over other models, while controlled ablations show additive gains from temporal imagination, cross-modal rollout, and belief aggregation. These results suggest predictive belief-state modeling is a practical alternative for affective computing.

2605.19949 2026-05-20 cs.CV

Feed-Forward Gaussian Splatting from Sparse Aerial Views

从稀疏航拍视图进行前馈高斯点扩散

Dongli Wu, Zhuoxiao Li, Tongyan Hua, Yinrui Ren, Xiaobao Wei, Rongjun Qin, Wufan Zhao

AI总结 本文提出AnyCity框架,通过观察驱动的生成重建方法,解决稀疏航拍视图中大规模城市场景重建中的证据不平衡问题,通过几何潜在表示和条件化空中完成标记预测,实现高质量的3D高斯点场重建。

详情
AI中文摘要

从稀疏航拍视图重建大规模城市场景是一项关键但具有挑战性的任务。由于俯视和浅倾角相机姿态偏置,稀疏航拍捕捉表现出强烈的证据不平衡:屋顶和开放区域被反复观察,而立面、远处建筑和被遮挡的结构则很少有多视图支持。现有的前馈3D高斯点扩散方法直接从稀疏输入回归确定性表示,但这种方法常常导致鬼影、融化立面和拉伸纹理。最近的伪视图和视频基于生成重建方法使用额外的监督或生成先验。然而,它们通常缺乏清晰的观察几何与先验驱动内容之间的分离,这可能导致合理但不一致的结构。我们提出AnyCity,一种用于稀疏航拍城市场景的观察驱动生成重建框架。AnyCity首先预测一个观察支持的几何潜在表示以锚定可靠的结构,然后使用支架条件化的空中完成标记来预测弱约束内容的门控残差更新,在高斯解码之前。在训练过程中,密集到稀疏的蒸馏将结构线索从密集视图重建中转移,同时一个适应于空中视频扩散先验通过门控标记条件提供细粒度的城市外观线索。观察保持目标保持优化后的表示与输入支持的几何一致。在推理过程中,AnyCity从稀疏航拍视图中通过单次前馈传递重建最终的3D高斯点场,实现具有第二级推理的连贯城市新视图合成。在合成、航拍域、无人机纹理和真实世界场景上的实验显示,与前馈基线相比,取得了持续的改进。

英文摘要

Reconstructing large-scale urban scenes from sparse aerial views is a crucial yet challenging task. Due to biased top-down and shallow-oblique camera poses, sparse aerial captures exhibit strong evidence imbalance: roofs and open regions are repeatedly observed, while facades, distant buildings, and occluded structures receive little multi-view support. Existing feed-forward 3D Gaussian Splatting methods directly regress a deterministic representation from sparse inputs, but this often leads to ghosting, melted facades, and stretched textures. Recent pseudo-view and video-based generative reconstruction methods use additional supervision or generative priors. However, they often lack a clear separation between observed geometry and prior-driven content, which can lead to plausible but inconsistent structures. We propose AnyCity, an observation-grounded generative reconstruction framework for sparse aerial urban scenes. AnyCity first predicts an observation-supported geometry latent to anchor reliable structures, and then uses scaffold-conditioned aerial completion tokens to predict a gated residual update for weakly constrained content before Gaussian decoding. During training, dense-to-sparse distillation transfers structural cues from dense-view reconstruction, while an aerial-adapted video diffusion prior provides fine-grained urban appearance cues through gated token conditioning. Observation-preserving objectives keep the refined representation consistent with input-supported geometry. At inference time, AnyCity reconstructs the final 3D Gaussian scene from sparse aerial views in a single feed-forward pass, achieving coherent urban novel-view synthesis with second-level inference. Experiments on synthetic, aerial-domain, UAV-textured, and real-world scenes show consistent improvements over feed-forward baselines.

2605.19947 2026-05-20 cs.LG

Exploiting Non-Negativity in DAG Structure Learning

利用非负性在DAG结构学习中的应用

Samuel Rey, Madeline navarro, Gonzalo Mateos

AI总结 本文研究了如何通过非负性约束简化DAG结构学习中的非凸优化问题,并提出了基于多pliers方法的正则化非负DAG学习算法,证明了在总体情况下真实DAG是唯一全局最小值点。

详情
AI中文摘要

本文研究了从节点观测数据学习有向无环图(DAG)的问题,这些数据由线性结构方程模型生成。DAG学习是信号处理、机器学习和因果推断中的核心任务,但其挑战在于无环性是一个全局组合性质。连续无环约束通过将离散DAG约束替换为光滑等式约束促进了算法进展。然而,现有方法仍然涉及困难的非凸优化景观并可能遭受退化的一阶最优条件。本文专注于具有非负边权的DAG,并利用此额外结构获得更简单的无环性表征。基于此表征,我们提出了一个正则化的非负DAG学习问题,并开发基于多pliers方法的算法。我们进一步分析了非负性诱导的良性优化景观。在总体情况下,我们证明真实DAG是所提出增广拉格朗日公式唯一的全局最小值点;此外,景观中没有虚假的内部 stationary 点,且真实DAG是唯一的无环KKT点。在合成和真实数据上的数值实验表明,所提方法优于现有连续DAG学习方法。

英文摘要

This work addresses the problem of learning directed acyclic graphs (DAGs) from nodal observations generated by a linear structural equation model. DAG learning is a central task in signal processing, machine learning, and causal inference, but it remains challenging because acyclicity is a global combinatorial property. Continuous acyclicity constraints have led to important algorithmic advances by replacing the discrete DAG constraint with smooth equality constraints. However, existing formulations still involve difficult non-convex optimization landscapes and may suffer from degenerate first-order optimality conditions. Here, we restrict attention to DAGs with non-negative edge weights and exploit this additional structure to obtain a simpler characterization of acyclicity. Building on this characterization, we formulate a regularized non-negative DAG learning problem and develop an algorithm based on the method of multipliers. We further analyze the benign optimization landscape induced by non-negativity. In the population regime, we show that the true DAG is the unique global minimizer of the proposed augmented-Lagrangian formulation; moreover, the landscape contains no spurious interior stationary points, and the true DAG is the only acyclic KKT point. Numerical experiments on synthetic and real-world data show that the proposed method improves over state-of-the-art continuous DAG-learning alternatives.

2605.19944 2026-05-20 cs.LG cs.AI cs.CC cs.CL

A Measure-Theoretic Analysis of Reasoning: Structural Generalization and Approximation Limits

关于推理的测度论分析:结构泛化与近似限制

Yuyang Zhang, Yifu Zhang, Xuehai Zhou, Xiaoyin Chen

AI总结 本文通过最优传输理论分析推理过程,揭示了结构泛化和近似限制的理论机制,发现位置依赖注意力机制和Transformer电路深度对推理性能有显著影响。

Comments Preprint

详情
AI中文摘要

尽管大型语言模型推理的经验缩放定律已得到充分文档,但支配分布外泛化的理论机制仍不明确。我们通过最优传输形式化推理,将离散轨迹投影到连续度量空间,利用Wasserstein-1距离量化领域偏移。借助Kantorovich对偶性,我们通过架构Lipschitz连续性和函数近似限制来界定分布外泛化。这揭示了两个主要约束。首先,位置依赖注意力(例如绝对位置编码)无法保持偏移不变性,导致Ω(1)的Lipschitz常数和预期风险,而偏移不变机制(例如旋转嵌入)保持等价性并限制误差。其次,通过将顺序回溯映射到Dyck-k语言,我们为TC⁰变换器建立了严格的电路深度下界。物理层深度的扩展是必要的,以避免表示崩溃——这一约束无法通过扩展表示宽度来绕过,因为Barron空间中存在不可约的近似界限。在54种Transformer配置上对组合搜索的评估证实了这些界限,证明泛化风险随Wasserstein领域偏移单调下降。

英文摘要

While empirical scaling laws for LLM reasoning are well-documented, the theoretical mechanisms governing out-of-distribution (OOD) generalization remain elusive. We formalize reasoning via optimal transport, projecting discrete trajectories into a continuous metric space to quantify domain shifts using the Wasserstein-1 distance. Invoking Kantorovich duality, we bound OOD generalization via architectural Lipschitz continuity and functional approximation limits. This exposes two primary constraints. First, position-dependent attention (e.g., Absolute Positional Encoding) fails to preserve shift invariance, yielding an $Ω(1)$ Lipschitz constant and expected risk, whereas shift-invariant mechanisms (e.g., Rotary Embeddings) preserve equivariance and bound the error. Second, by mapping sequential backtracking to a Dyck-$k$ language, we establish a strict circuit depth lower bound for $\text{TC}^0$ Transformers. Scaling physical layer depth is necessary to avert representation collapse -- a constraint that scaling representation width cannot bypass due to irreducible approximation bounds in Barron spaces. Evaluations across 54 Transformer configurations on combinatorial search corroborate these bounds, demonstrating that generalization risk degrades monotonically with the Wasserstein domain shift.

2605.19943 2026-05-20 cs.AI

Probabilistic Tiny Recursive Model

概率性微型递归模型

Amin Sghaier, Ali Parviz, Alexia Jolicoeur-Martineau

AI总结 本文提出概率性微型递归模型(PTRM),通过在递归步骤中注入高斯噪声,使模型能够并行探索多样化的解决方案盆地,从而在不重新训练或进行任务特定增强的情况下,提升多个基准测试的准确性,包括Sudoku-Extreme和Pencil Puzzle Bench上的各种谜题。

详情
AI中文摘要

微型递归模型(TRM)通过迭代优化潜在状态和最终答案,以少量参数解决复杂推理任务。尽管强大,其确定性递归可能导致收敛于次优解,缺乏逃逸机制。常见的解决方法依赖于测试时的任务特定输入扰动结合答案投票聚合。我们引入概率性TRM(PTRM),一种任务无关的测试时计算扩展框架,通过随机探索解决这一限制。PTRM在每个深度递归步骤中注入高斯噪声,使并行轨迹探索多样化的解决方案盆地,并利用模型原有的Q头(用于原始TRM中的早期停止)在其中选择。无需重新训练或任务特定增强,PTRM在多个基准测试上实现了显著的准确性提升,包括Sudoku-Extreme(87.4%到98.75%)和各种Pencil Puzzle Bench谜题(62.6%到91.2%)。在后者上,PTRM在不到0.0001倍的成本下,仅使用700万参数,就实现了接近前沿LLM(91.2% vs. 55.1%)两倍的准确性。

英文摘要

Tiny Recursive Models (TRM) solve complex reasoning tasks with a fraction of the parameters of modern large language models (LLMs) by iteratively refining a latent state and final answer. While powerful, their deterministic recursion can lead to convergence at suboptimal solutions, without escape mechanism. A common workaround relies on task-specific input perturbations at test time combined with answer aggregation via voting. We introduce Probabilistic TRM (PTRM), a task-agnostic framework for test-time compute scaling that addresses this limitation through stochastic exploration. PTRM injects Gaussian noise at each deep recursion step, enabling parallel trajectories to explore diverse solution basins, and selects among them using the model's existing Q head (used for early stopping in the original TRM). Without requiring retraining or task-specific augmentations, PTRM enables substantial accuracy gains across benchmarks, including Sudoku-Extreme (87.4% to 98.75%) and on various puzzles from Pencil Puzzle Bench (62.6% to 91.2%). On the latter, PTRM achieves nearly double the accuracy of frontier LLMs (91.2% vs. 55.1%) at less than 0.0001x the cost, using only 7M parameters.

2605.19940 2026-05-20 cs.AI cs.RO

Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains

受机器人启发的用于社会敏感领域基础模型的护栏

Rebecca Ramnauth, Drazen Brscic, Brian Scassellati

AI总结 本文提出了一种基于机器人学的护栏框架,用于在社会敏感领域中对基础模型进行运行时行为控制,以减少交互轨迹中向不良状态的漂移,并适应多样化的社会情境。

Comments Under review at Journal of Artificial Intelligence Research (JAIR)

详情
AI中文摘要

基础模型正越来越多地应用于教育、心理健康和护理等社会敏感领域,其中失败往往具有累积性和情境依赖性。现有的护栏方法,从训练时对齐到提示、解码约束和事后调节,主要提供经验风险降低而非可执行的行为保证,并且大多将安全视为单个输出属性而非交互轨迹属性。我们重新将护栏视为对交互轨迹的运行时行为控制问题,并借鉴机器人学引入形式构造以在不确定的闭环系统中执行约束。我们将在Grounded Observer框架中实例化这些想法,并在三个现实世界部署中应用:闲聊、家庭自闭症疗法和学校行为缓和。在各种场景中,该框架能够实现运行时干预,以减少向不良交互状态漂移,同时适应多样化社会情境。我们讨论了该框架的扩展,并提出了加强保证的研究方向。

英文摘要

Foundation models are increasingly deployed in socially sensitive domains such as education, mental health, and caregiving, where failures are often cumulative and context-dependent. Existing guardrail approaches -- ranging from training-time alignment to prompting, decoding constraints, and post-hoc moderation -- primarily provide empirical risk reduction rather than enforceable behavioral guarantees, and largely treat safety as a property of individual outputs rather than interaction trajectories. We reframe guardrails as a problem of runtime behavioral control over interaction trajectories, drawing on robotics to introduce formal constructs for constraint enforcement in uncertain, closed-loop systems. We instantiate these ideas in the Grounded Observer framework and apply it across three real-world deployments: small talk, in-home autism therapy, and behavioral de-escalation in schools. Across settings, the framework enables runtime interventions that mitigate drift into undesirable interaction regimes while adapting to diverse social contexts. We discuss extensions to the framework and propose research directions toward stronger guarantees.

2605.19936 2026-05-20 cs.CL

What Are LLMs Doing to Scientific Communication? Measuring Changes in Writing Practices and Reading Experience

大型语言模型如何影响科学交流?测量写作实践和阅读体验的变化

Filip Miletić, Neele Falk

AI总结 本研究探讨了大型语言模型对科学交流风格的影响,通过分析自然语言处理领域中超过37000篇论文和3000篇人类撰写的文本及其LLM改进版本,发现写作实践和阅读体验发生了显著变化,同时揭示了AI辅助写作对阅读体验的主观影响。

Comments Accepted to LREC 2026

详情
AI中文摘要

科学交流的风格是否因大型语言模型在写作过程中的广泛应用而发生变化?我们通过自然语言处理领域中的两个数据资源来探讨这个问题:一个包含超过37000篇ACL会议论文(2020-2024)的自然语料库,以及一个包含3000篇人类撰写的段落及其LLM生成改进版本的合成数据集。我们首先实施了一系列历时性词频分析,显示词频和使用语境在时间上发生了显著变化,表明在某些情况下存在语义专业化,而在其他情况下则存在泛化。拓宽我们的视角,我们进一步建模了更复杂的风格特征,发现LLM修改的文本更频繁地包含某些句法结构、更复杂和更长的词汇以及较低的词汇多样性。最后,我们通过与20名领域专家的试点标注研究,将这些写作实践的变化与主观阅读体验联系起来。他们总体上认为LLM改进的文本更易于理解且令人兴奋,但也表达了对LLM的负面定性态度,突显了AI辅助写作对阅读体验的强烈主观影响。

英文摘要

Has the style of scientific communication changed due to the growing use of large language models in the writing process? We address this question in the domain of Natural Language Processing by leveraging two data resources we create: a naturalistic corpus of over 37,000 papers from the ACL Anthology (2020-2024); and a synthetic dataset of 3,000 human-written passages and their LLM-generated improvements. We first implement a series of diachronic lexical analyses, showing that both word frequency and usage contexts have changed significantly over time, indicating semantic specialization in some cases and generalization in others. Broadening our perspective, we then model a range of more complex stylistic features and find that LLM-modified texts more frequently contain certain syntactic constructions, more complex and longer words and a lower lexical diversity. Finally, we connect these changes in writing practices to subjective reading experience through a pilot annotation study with 20 domain experts. They overall rate LLM-improved texts as more understandable and exciting, but also express negative qualitative attitudes towards LLMs, highlighting the strongly subjective effect of AI-assisted writing on reading experience.

2605.19932 2026-05-20 cs.AI cs.CL cs.LG

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

PEEK:上下文地图作为长上下文LLM代理的导向缓存

Zhuohan Gu, Qizheng Zhang, Omar Khattab, Samuel Madden

AI总结 本文提出PEEK系统,通过上下文地图缓存和维护导向知识,提升长上下文LLM代理在重复外部上下文中的交互准确性和效率,相比基线方法在推理和上下文学习任务中均取得显著提升。

详情
AI中文摘要

大型语言模型(LLM)代理越来越多地在长且重复的外部上下文中操作,如文档语料库和代码仓库。在多次调用中,现有方法保留的是代理的轨迹、对原始材料的被动访问或任务级别的策略。但它们没有保留我们认为对于重复相同上下文工作负载最需要的:关于重复上下文本身的可重用导向知识(例如,上下文包含什么、如何组织,以及哪些实体、常量和模式历史上有用)。我们引入PEEK,一种系统,通过上下文地图缓存和维护这种导向知识:一个在代理提示中始终存在的小而固定大小的artifact,使代理能够持续查看外部上下文。该地图由一个可编程的缓存策略维护,包含三个模块:一个Distiller从推理时间信号中提取可转移的知识,一个Cartographer将其转换为结构化的编辑,以及一个基于优先级的Evictor强制执行固定的token预算。在长上下文推理和信息聚合中,PEEK在强基线方法上提高了6.3-34.0%,同时使用93-145次更少的迭代,并且成本比最先进的提示学习框架ACE低1.7-5.8倍。在上下文学习中,PEEK在解决率和评分准确性上分别提高了6.0-14.0%和7.8-12.1%,且成本比ACE低1.4倍。这些收益在不同语言模型和代理架构上均能泛化,包括生产级的OpenAI Codex。这些结果表明,上下文地图有助于长上下文LLM代理更准确、更高效地与重复的外部上下文交互。

英文摘要

Large language model (LLM) agents increasingly operate over long and recurring external contexts, like document corpora and code repositories. Across invocations, existing approaches preserve either the agent's trajectory, passive access to raw material, or task-level strategies. None of them preserves what we argue is most needed for repeated same-context workloads: reusable orientation knowledge (e.g., what the context contains, how it is organized, and which entities, constants, and schemas have historically been useful) about the recurring context itself. We introduce PEEK, a system that caches and maintains this orientation knowledge as a context map: a small, constant-sized artifact in the agent's prompt that gives it a persistent peek into the external context. The map is maintained by a programmable cache policy with three modules: a Distiller that extracts transferable knowledge from inference-time signals, a Cartographer that translates it into structured edits, and a priority-based Evictor that enforces a fixed token budget. On long-context reasoning and information aggregation, PEEK improves over strong baselines by 6.3-34.0% while using 93-145 fewer iterations and incurring 1.7-5.8x lower cost than the state-of-the-art prompt-learning framework, ACE. On context learning, PEEK improves solving rate and rubric accuracy by 6.0-14.0% and 7.8-12.1%, respectively, at 1.4x lower cost than ACE. These gains generalize across LMs and agent architectures, including OpenAI Codex, a production-grade coding agent. Together, these results show that a context map helps long-context LLM agents interact with recurring external contexts more accurately and efficiently.

2605.19931 2026-05-20 cs.CV cs.AI cs.LG

StruMPL: Multi-task Dense Regression under Disjoint Partial Supervision and MNAR Labels

StruMPL:在不相交的部分监督和MNAR标签下的多任务密集回归

Reza M. Asiyabi, Juan Alberto Molina-Valero, The SEOSAW Partnership, Steven Hancock, Casey M. Ryan

AI总结 本文针对在不相交的部分监督和MNAR标签下的多任务密集回归问题,提出StruMPL方法,通过共享编码器和可学习的物理模块,结合Augmented IPW损失函数,提高了对森林地上生物量的估计精度。

Comments 10 pages with 3 figures and 4 tables, References and Appendix 12 pages with 1 figure and 4 tables

详情
AI中文摘要

从地球观测估计森林地上生物量(AGB)结合了两个结构上不兼容的标签源:空间borne激光雷达在数百万个位置提供冠层结构但没有生物量估计,而地面样地在数千个偏倚位置提供生物量但没有结构指标。没有单个训练样本携带所有目标变量的标签,样地标签不是随机缺失(MNAR),且生物量通过已知但生物体特异性的所有学定律与结构变量相关联。我们将其正式化为在异质不相交部分监督下的多任务密集回归问题,具有MNAR标签和任务间物理约束,并提出StruMPL方法来联合解决。一个共享编码器为每个变量回归、填补和倾向性头提供空间MNAR校正,以及一个可学习的物理模块,该模块在每个像素上评估任务间约束对模型自身预测的影响。监督损失使用Augmented IPW(AIPW)伪结果,其中在倾向性和填补基线上的停止梯度;我们证明了分析和实证上,两者对于联合优化恢复IPW加权的平稳点并保持损失有界是必要的。在两个生态上不同的生物体上,StruMPL在AGB RMSE和偏倚方面优于消融变体和最接近的已发表方法,分层分析显示AIPW减少了高AGB偏倚约54%。

英文摘要

Estimating forest aboveground biomass (AGB) from Earth observation combines two structurally incompatible label sources: spaceborne lidar provides canopy structure at millions of locations but no biomass estimate, and ground-based plots provide biomass at thousands of biased locations but no metrics of structure. No single training sample carries labels for all target variables, plot labels are missing not at random (MNAR), and biomass is linked to the structural variables by known but biome-specific allometric laws. We formalise this as multi-task dense regression under heterogeneous disjoint partial supervision with MNAR labels and inter-task physical constraints, and propose StruMPL to address it jointly. A shared encoder feeds per-variable regression, imputation, and propensity heads for spatial MNAR correction, and a learnable physics module that evaluates the inter-task constraint on the model's own predictions at every pixel. The supervised loss uses an Augmented IPW (AIPW) pseudo-outcome with stop-gradients on the propensity and on the imputation baseline; we show analytically and empirically that both are necessary for joint optimisation to recover IPW-weighted stationary points while keeping the loss bounded. On two ecologically distinct biomes, StruMPL outperforms ablation variants and the closest published method on AGB RMSE and bias, with a stratified analysis showing AIPW reduces high-AGB bias by ~54%.

2605.19929 2026-05-20 cs.CV cs.AI

Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models

打破大视觉-语言模型低比特量化中的模态异质性

Yi Zhong, Haotong Qin, Xindong Zhang, Lei Zhang, Guolei Sun

AI总结 本文提出SplitQ框架,通过通道分割和自适应跨模态校准模块,解决大视觉-语言模型在低比特量化中因模态异质性导致的精度下降问题,显著提升了在多种多模态数据集上的性能。

详情
AI中文摘要

低比特后训练量化(PTQ)是将视觉-语言模型(VLMs)部署到资源受限设备中的关键技术。然而,现有PTQ方法由于在量化过程中文本和视觉模态的异质激活分布而降低了VLMs的准确性。我们发现这种跨模态异质性在通道上分布不均:一小部分通道包含大部分模态特定的异常值,且这些异常值通常位于每个模态的不同通道中。受此启发,我们提出了SplitQ,一种基于通道分割的后训练量化框架。其核心是引入了一个新的模态特定异常通道解耦(MOCD)模块,该模块能够以最小的开销有效隔离显著的模态特定异常通道。为进一步解决剩余的跨模态分布差异,我们设计了一个自适应跨模态校准(ACC)模块,该模块采用双轻量级可学习分支动态缓解模态引起的量化误差。在流行的VLMs上的广泛实验表明,SplitQ在所有评估的量化设置下,包括W4A8、W4A4、W3A3和W3A2,均在6个流行的多模态数据集上显著优于现有方法。值得注意的是,SplitQ在具有挑战性的W3A3设置下保留了93.5%的FP16性能(69.5 vs. 74.3),推动了高级VLMs部署的效率前沿。我们的代码可在https://github.com/EMVision-NK/SplitQ上获得。

英文摘要

Low-bit post-training quantization (PTQ) is a pivotal technique for deploying Vision-Language Models (VLMs) on resource-constrained devices. However, existing PTQ methods often degrade VLMs' accuracy due to the heterogeneous activation distributions of text and vision modalities during quantization. We find that this cross-modal heterogeneity is distributed unevenly across channels: a small subset of channels contains most modality-specific outliers, and these outliers typically reside in different channels for each modality. Motivated by this, we propose SplitQ, a channel-Splitting-driven post-training Quantization framework. At its core, SplitQ introduces a novel Modality-specific Outlier Channel Decoupling (MOCD) module that effectively isolates salient modality-specific outlier channels with minimal overhead. To further address the remaining cross-modal distribution discrepancies, we design an Adaptive Cross-Modal Calibration (ACC) module that employs dual lightweight learnable branches to dynamically mitigate modality-induced quantization errors. Extensive experiments on popular VLMs demonstrate that SplitQ significantly outperforms existing approaches across 6 popular multi-modal datasets under all evaluated quantization settings, including W4A8, W4A4, W3A3, and W3A2. Notably, SplitQ preserves 93.5% of FP16 performance under the challenging W3A3 setting (69.5 vs. 74.3), pushing the efficiency frontier for deploying advanced VLMs. Our code is available at https://github.com/EMVision-NK/SplitQ

2605.19926 2026-05-20 cs.LG

JAXenstein: Accelerated Benchmarking for First-Person Environments

JAXenstein: 加速的第一人称环境基准测试

Ruo Yu Tao, George Konidaris

AI总结 本文提出JAXenstein,一个基于JAX的开源基准测试,用于加速第一人称视觉任务的实验,通过实现Wolfenstein 3D渲染引擎,提高了实验效率并支持更复杂的环境。

Comments Main paper: 5 pages, supplementary material: 3 pages

详情
AI中文摘要

强化学习算法的进步一直由具有挑战性的基准测试推动。研究人员对问题设置进行迭代的速度直接影响算法开发的速度。现代机器学习已经产生了允许快速和可扩展算法开发的工具,如JAX库。在这些工具可用的情况下,算法开发中的主要瓶颈是大型和复杂领域实验的可用性。特别是,JAX强化学习生态系统中没有测试视觉第一人称任务的基准测试;这些领域对于测试探索能力和代理克服部分可观测性的能力至关重要。我们介绍了JAXenstein:一个基于JAX的开源基准测试,实现了Wolfenstein 3D渲染引擎,以在视觉第一人称任务中实现快速和可扩展的实验。JAXenstein比类似的基于视觉的基准测试快几倍,并且可以轻松扩展到更复杂的第一人称领域。

英文摘要

The progression of reinforcement learning algorithms have been driven by challenging benchmarks. The rate in which a researcher can iterate on a problem setting directly impacts the speed of algorithm development. Modern machine learning has produced tools that allow for fast and scalable algorithm development like the JAX library. With the availability of these tools, a serious bottleneck in algorithm development is the availability of large and complex domains for experimentation. Most notably, the JAX reinforcement learning ecosystem does not have any benchmarks that test visual first-person tasks; these domains are crucial for testing both exploration and an agent's ability to overcome partial observability. We introduce JAXenstein: an open-source JAX-based benchmark that implements the Wolfenstein 3D rendering engine for fast and scalable experimentation in visual first-person tasks. JAXenstein is several times faster than comparable vision-based benchmarks, and is easily extensible to more complex first-person domains.

2605.19924 2026-05-20 cs.RO

RoHIL: Robust Human-in-the-Loop Robotic Reinforcement Learning Against Illumination Variations

RoHIL: 面对光照变化的鲁棒人机协同机器人强化学习

Shuoqin Zhang, Yixin Xiong, Xiru Gao, Kai Liu, Ke Wang, Xichuan Zhou, Zhe Hu

AI总结 本文提出RoHIL框架,通过离线微调方法解决机器人在不同工作间因光照变化导致的性能下降问题,保留原始工作间性能并避免重新收集数据和训练。

详情
AI中文摘要

人机协同强化学习系统在训练工作间表现接近完美,但当同一名机器人被移动到数米外的工作间时,由于新的灯位置和窗户光线导致的视觉输入分布变化,系统会崩溃。重新收集演示并重新运行HIL在每个工作间不可行,而简单地在光照变化的数据上微调会触发灾难性遗忘。为解决跨域差距,我们提出了RoHIL,一个无需额外真实机器人交互的离线微调框架。RoHIL结合(i)基于世界模型的图像重光照器,重新合成源工作间轨迹的视觉流,以多种虚拟HDRI环境下的视觉流;(ii)光照保留回放(IRR),一种数据层面的反遗忘机制,将重光照适应转换与原始光照保留转换交错以保留源工作间的Bellman覆盖;(iii)锚定Bellman-actor正则化器,约束表示和策略漂移,从原始源工作间的策略约束。在四个真实机器人操作任务中,面对显著的跨工作间光照变化,RoHIL显著提高了光照变化下的性能,而标准HIL-RL在此处崩溃,同时保留了源工作间的性能,消除了为每个新工作间和环境重新收集数据和重新训练的需要。项目页面:https://anonymous4365.github.io/RoHIL/

英文摘要

Human-in-the-loop reinforcement learning systems achieve near-perfect success on the workstation where they are trained, but collapse when the same robot is moved to a workstation a few meters away due to shifts in the visual input distribution caused by new lamp positions and window light. Re-collecting demonstrations and re-running HIL on every workstation is incompatible with deployment, and naively fine-tuning on shifted-light data triggers catastrophic forgetting of the source workstation. To close this cross-domain gap, we present RoHIL, an offline fine-tuning framework that uses no extra real-robot interaction. RoHIL combines (i) a world-model-based image relighter that re-synthesises the visual stream of source-workstation trajectories under multiple virtual HDRI environments, leaving actions and rewards real; (ii) Illumination-Retention Replay (IRR), a data-level anti-forgetting mechanism that interleaves relit adaptation transitions with original-light retention transitions to preserve source-workstation Bellman coverage; and (iii) an anchored Bellman-actor regulariser that constrains representation and policy drift from the original source-workstation policy. Across four real-robot manipulation tasks under significant cross-workstation illumination variations, RoHIL substantially improves shifted-light performance where standard HIL-RL collapses, while preserving source-workstation performance, eliminating the need to re-collect data and retrain for every new workstation and environment. Project page: https://anonymous4365.github.io/RoHIL/

2605.19919 2026-05-20 cs.RO

Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning

超越动作残差:通过瓶颈潜在强化学习实现现实世界机器人策略引导

Dongjie Yu, Kun Lei, Zhennan Jiang, Jia Pan, Huazhe Xu

AI总结 本文提出了一种名为Z-Perturbation Reinforcement Learning(ZPRL)的方法,通过紧凑的瓶颈潜在空间来引导预训练策略,从而提高样本效率和最终性能,同时在现实世界任务中显著提升了成功率。

详情
AI中文摘要

预训练的模仿策略已成为机器人操作的强大基础,但它们通常需要在线改进以克服执行错误、数据集覆盖有限和部署不匹配的问题。因此,一个核心问题是强化学习(RL)应在离线预训练后如何适应策略。现有的轻量方法通常直接在动作空间上应用残差校正,但这往往导致噪声和结构不佳的探索。在本工作中,我们提出Z-Perturbation Reinforcement Learning(ZPRL),一种通过紧凑的瓶颈潜在空间而不是通过策略权重或输出动作来引导预训练策略的方法。在离线训练期间,我们通过插件式变分信息瓶颈(VIB)模块增强策略,以从观察嵌入中提取任务相关的潜在接口。在在线微调期间,基础策略被冻结,RL仅学习该潜在上的残差扰动,其解码表示条件冻结的动作生成器。我们将在流匹配策略上实例化ZPRL,并在八个模拟任务和四个现实世界任务上进行评估。在多样化的操作设置中,ZPRL在样本效率和最终性能上优于强大的训练后基线。在现实世界中,ZPRL在四个任务上的平均成功率比模仿基线策略提高了33.7%,同时产生比动作残差对照组更平滑的探索行为。这些结果表明,紧凑且任务对齐的瓶颈潜在空间为在线RL适应提供了一个有效的接口。更多视频可在https://manutdmoon.github.io/ZPRL/上找到。

英文摘要

Pretrained imitation policies have become a strong foundation for robot manipulation, but they often require online improvement to overcome execution errors, limited dataset coverage, and deployment mismatch. A central question is therefore how reinforcement learning (RL) should adapt policies after offline pretraining. Existing lightweight methods commonly apply residual corrections directly in action space, but this often leads to noisy and poorly structured exploration. In this work, we propose Z-Perturbation Reinforcement Learning (ZPRL), an approach that steers pretrained policies through a compact bottleneck latent rather than through policy weights or output actions. During offline training, we augment the policy with a plug-and-play variational information bottleneck (VIB) module to extract a task-relevant latent interface from observation embeddings. During online finetuning, the base policy is frozen and RL learns only a residual perturbation on this latent, whose decoded representation conditions the frozen action generator. We instantiate ZPRL on flow-matching policies and evaluate it on eight simulation tasks and four real-world tasks. Across diverse manipulation settings, ZPRL improves both sample efficiency and final performance over strong post-training baselines. In the real world, ZPRL improves the average success rate on four tasks by 33.7% over imitation base policies while producing smoother exploration behaviors than an action residual counterpart. These results suggest that a compact, task-aligned bottleneck latent provides an effective interface for online RL adaptation. More videos can be found at https://manutdmoon.github.io/ZPRL/.

2605.19916 2026-05-20 cs.LG cs.AI

Fast and Featureless Node Representation Learning with Partial Pairwise Supervision

基于部分成对监督的快速且无特征节点表示学习

Sujan Chakraborty, Saptarshi Bej

AI总结 该研究提出了一种快速且统一的框架,用于在部分可用的成对节点标签和无可用节点特征的图中进行可扩展的节点表示学习,通过结合社区感知的结构信号和带符号的成对约束,实现了高效的优化方案。

详情
AI中文摘要

我们引入了Contrastive FUSE,一种用于图中可扩展节点表示学习的快速且统一的框架,该框架在部分可用的成对节点标签和无可用节点特征的情况下进行优化。与现有方法不同,我们直接优化了一个谱对比目标,该目标整合了社区感知的结构信号和带符号的成对约束。为了支持大规模训练,我们用一种轻量级的近似方法替换了昂贵的模块度梯度,这在保持模块度行为的同时显著降低了计算成本。这产生了一种高效的优化方案,具有自然梯度分解和自适应学习率缩放,即使在百万边图上也能实现快速迭代更新。在基准引文网络、大型共购图和OGB数据集上的广泛实验表明,Contrastive FUSE在不依赖节点特征的情况下实现了竞争性或优越的对比分类性能,同时在现有基线上提供了显著的运行时间提升。这些结果突显了将模块度启发的结构学习与对比监督相结合在高效和可扩展的对比节点表示学习中的有效性。

英文摘要

We introduce Contrastive FUSE, a fast and unified framework for scalable node representation learning in graphs with partially available pairwise node labels and no available node features. Unlike existing methods, we directly optimize a spectral contrastive objective that integrates community-aware structural signals with signed pairwise constraints. To support large-scale training, we replace the expensive modularity gradient with a lightweight approximation, which preserves the structure-seeking behavior of modularity while reducing the computational cost significantly. This yields an efficient optimization scheme with a natural gradient decomposition and adaptive learning-rate scaling, enabling fast iterative updates even on million-edge graphs. Extensive experiments on benchmark citation networks, large co-purchase graphs, and OGB datasets show that Contrastive FUSE achieves competitive or superior contrastive classification performance without relying on node features, while offering substantial runtime gains over existing baselines. These results highlight the effectiveness of coupling modularity-inspired structural learning with contrastive supervision for efficient and scalable contrastive node representation learning.

2605.19902 2026-05-20 cs.LG q-bio.QM

Hierarchical Contrastive Learning for Multi-Domain Protein-Ligand Binding

多领域蛋白质-配体结合的分层对比学习

Shuo Zhang, Rongqi Hong, Huifeng Zhang, Jian K. Liu

AI总结 本研究提出HCLBind框架,通过分层对比学习方法,解决多领域蛋白质-配体结合亲和力预测问题,核心方法是分离几何表示学习与亲和力回归,并采用新颖的分层诱饵策略,结合领域门控图注意力网络和跨模态注意力,提升领域界面优先级,实验表明HCLBind能有效学习判别界面特征并提供鲁棒的不确定性估计。

Comments Accepted by ISBRA2026

详情
AI中文摘要

预测多领域蛋白质-配体结合亲和力仍然面临挑战,因为领域间动态决定了分子识别。现有几何深度学习方法通常将蛋白质视为单一静态图,导致刚体假设和柔性区域的随机噪声问题。为此,我们引入HCLBind,一种自监督框架,将几何表示学习与亲和力回归分离。HCLBind在Q-BioLiP数据库上采用通用到特定的预训练范式,学习稳健的结合物理语法。我们提出了一种新颖的分层诱饵策略:模型通过单领域蛋白质坐标扰动学习局部物理化学约束,通过多领域复合物领域旋转学习全局构象几何。我们的混合架构集成了领域门控图注意力网络和跨模态注意力,以显式优先考虑领域界面。此外,我们采用LoRA对蛋白质和配体基础模型进行优化,确保高效优化的同时保留进化知识。在PDBBind上的实验表明,HCLBind有效学习了判别界面特征,并提供了鲁棒的不确定性估计,克服了标准监督学习的局限性。代码可在https://github.com/jiankliu/HCLBind获取。

英文摘要

Predicting protein-ligand binding affinity remains intractable for multi-domain proteins, where inter-domain dynamics govern molecular recognition. Existing geometric deep learning methods typically treat proteins as monolithic static graphs, suffering from rigid-body assumptions and aleatoric noise in flexible regions. To address this, we introduced HCLBind, a self-supervised framework that decouples geometric representation learning from affinity regression. HCLBind leverages a general-to-specific pre-training paradigm on the Q-BioLiP database to learn a robust physical grammar of binding. We propose a novel hierarchical decoy strategy: the model learns local physicochemical constraints through protein coordinate perturbation in single-domain proteins and global conformational geometry through inter-domain rotation in multi-domain complexes. Our hybrid architecture integrates a domain-gated graph attention network and cross-modal attention to explicitly prioritize domain interfaces. Furthermore, we employ LoRA on protein and ligand foundation models, ensuring efficient optimization while preserving evolutionary knowledge. Experiments on PDBBind demonstrate that HCLBind effectively learns discriminative interface features and provides robust uncertainty estimation, overcoming the limitations of standard supervised learning. The code is available at https://github.com/jiankliu/HCLBind.

2605.19895 2026-05-20 cs.AI

Streamlined Constraint Reasoning via CNN Pattern Recognition on Enumerated Solutions

通过枚举解的CNN模式识别实现约束推理简化

Patrick Spracklen

AI总结 本文提出了一种新的方法,通过枚举可行解并训练卷积神经网络来识别结构模式,从而生成MiniZinc约束流式器,以提高约束编程中硬问题的求解效率。

详情
AI中文摘要

约束编程实践者通过按风险顺序应用一组分层技术来加速解决难题。标准的硬化(对称打破和隐含约束)首先应用,以保持可满足性。流式器约束限制搜索到解的结构子家族,不保持可满足性,作为最后的工具保留。现有的自动流式器合成方法要么搜索约束语法,要么直接提示大型语言模型处理问题模型。我们提出了一种不同的方法:枚举可行解,训练卷积神经网络对比受扰非解以检测结构模式,并通过LLM驱动的合成将CNN的判别信号转换为候选MiniZinc流式器。CNN使LLM的约束生成基于观察到的解结构,而不是仅基于模型文本。我们在硬化基准模型上评估,其中流式器发现是残余性能杠杆。我们的管道在硬化Vessel Loading上实现了98.8%的portfolio时间减少,在硬化Social Golfers上实现了98.6%,在Black Hole上实现了89.4%。发现的流式器包括Vessel Loading上的基于类的打包约束,Social Golfers上的超硬化规范化,以及Black Hole上的布局坐标界限。

英文摘要

Constraint programming practitioners accelerate hard problems through a layered set of techniques applied in order of risk. Standard hardening (symmetry-breaking and implied constraints) is applied first and preserves satisfiability. Streamliner constraints, which restrict search to a structural sub-family of solutions, do not preserve satisfiability and are reserved as a final lever. Existing automated streamliner-synthesis approaches either search a constraint grammar or prompt a Large Language Model directly on the problem model. We propose a different approach: enumerate feasible solutions, train a Convolutional Neural Network contrastively against perturbed non-solutions to detect structural patterns, and translate the CNN's discriminative signal into candidate MiniZinc streamliners through LLM-driven synthesis. The CNN grounds the LLM's constraint generation in observed solution structure rather than model text alone. We evaluate on hardened benchmark models where streamliner discovery is the residual performance lever. Our pipeline achieves 98.8% portfolio time reduction on hardened Vessel Loading, 98.6% on hardened Social Golfers, and 89.4% on Black Hole, with best-single streamliners reaching geometric-mean speedups of 932x, 356x, and 1103x respectively. Discovered streamliners include class-based packing constraints on Vessel Loading, beyond-hardening canonicalisations on Social Golfers, and layout-coordinate bounds on Black Hole.

2605.19890 2026-05-20 cs.CV

GoTTA be Diverse: Rethinking Memory Policies for Test-Time Adaptation

GoTTA be Diverse: 重新思考测试时间适应中的记忆策略

Shyma Alhuwaider, Yasmeen Alsaedy, Merey Ramazanova, Silvio Giancola, Bernard Ghanem

AI总结 本文研究了测试时间适应中记忆策略的重要性,提出了一种基于类平衡和特征空间多样性的GOTTA方法,展示了在受限内存和非独立同分布流中,多样性管理对适应性能的提升。

详情
AI中文摘要

测试时间适应(TTA)使预训练模型能够在分布偏移的测试流中在线适应。尽管大多数TTA研究关注适应目标,但实际流也严重依赖用于选择驱动适应的测试样本的记忆机制。现有记忆机制通常作为特定TTA算法的组件进行评估,这使得难以确定哪些记忆设计选择何时重要。在本文中,我们提供了一个系统性的基准测试,将记忆与适应算法解耦,并在独立同分布、非独立同分布、持续学习和实际测试流中统一评估记忆策略。我们的研究表明,有效的内存管理不仅仅是保留最近或类平衡的样本。特别是,类内多样性是避免冗余缓冲和在时间相关和标签偏斜流中保持代表性的适应信号的关键因素。受这一发现的启发,我们引入了Guided Observational Test-Time Adaptation(GOTTA),一种结合类平衡分配和特征空间多样性的多样性感知记忆策略。GOTTA记忆可以作为现有缓冲区的直接替换,并可与不同的TTA目标配对。在腐蚀基准和视频流设置中,多样性感知的记忆在受限内存预算和具有挑战性的非独立同分布流中显著提升了适应性能,同时在内存容量增加时仍保持竞争力。这些结果突显了内存管理作为稳健测试时间适应的第一要素,并将多样性确定为实际TTA的核心原则。

英文摘要

Test-time adaptation (TTA) enables a pre-trained model to adapt online to an unlabeled test stream under distribution shift. While most TTA research focuses on the adaptation objective, practical streams also depend critically on the memory used to select which test samples drive adaptation. Existing memory mechanisms are usually evaluated as components of specific TTA algorithms, making it difficult to isolate which memory design choices matter and when they matter. In this work, we provide a systematic benchmark that decouples memory from the adaptation algorithm and evaluates memory policies under unified conditions across i.i.d., non-i.i.d., continual, and practical test streams. Our study shows that effective memory management requires more than retaining recent or class-balanced samples. In particular, intra-class diversity is a key factor for avoiding redundant buffers and maintaining representative adaptation signals under temporally correlated and label-skewed streams. Motivated by this finding, we introduce Guided Observational Test-Time Adaptation (GOTTA), a family of diversity-aware memory policies that combine class-balanced allocation with feature-space diversity. GOTTA memories act as drop-in replacements for existing buffers and can be paired with different TTA objectives. Across corruption benchmarks and video-stream settings, diversity-aware memory improves adaptation most clearly under constrained memory budgets and challenging non-i.i.d. streams, while remaining competitive as memory capacity increases. These results highlight memory management as a first-class component of robust test-time adaptation and identify diversity as a central principle for practical TTA.

2605.19881 2026-05-20 cs.RO

Trajectory Planning and Control near the Limits: an Open Experimental Benchmark on the RoboRacer Platform

轨迹规划与控制在极限情况下的研究:RoboRacer平台上的开放实验基准

Mattia Piccinini, Patrick Zambiasi, Aniello Mungiello, Mattia Piazza, Felix Jahncke, Johannnes Betz

AI总结 本文提出了一种模块化框架,用于评估轨迹规划和控制在高加速度 maneuver 中的新方法和现有方法,通过 RoboRacer 平台上的两个赛道测试,展示了 MS-NN 在提高跟踪精度和减少转向振荡方面的优势,以及在线速度重计划对提高 lap 时间和安全速度的贡献。

Comments Accepted - 2026 IEEE 29th International Conference on Intelligent Transportation Systems (ITSC)

详情
AI中文摘要

我们提出了一种模块化框架,用于评估轨迹规划和控制在高加速度 maneuver 中的新方法和现有方法。我们的框架包括时间最优的赛道生成、在线时间最优速度重计划、几何路径跟踪控制器,以及一个新的模型结构神经网络(MS-NN)来学习转向控制的逆动力学。我们将在 1:10 尺寸的 RoboRacer 平台上部署该框架,使用两个赛道。通过几种消融研究,我们研究了单个模块及其组合的性能。我们证明了 MS-NN 显著提高了跟踪精度,减少了转向振荡,并且在物理上是可解释的。此外,在线速度重计划通过补偿执行误差来提高 lap 时间,并使车辆能够安全地达到更高的速度和加速度。为了支持未来研究,我们的代码、数据集、视频和结果均已公开发布在 https://roboracer-benchmark.github.io/planning_control_benchmark/。

英文摘要

We present a modular framework to benchmark new and existing methods for trajectory planning and control in high-acceleration maneuvers that push autonomous driving to the limits. Our framework includes time-optimal raceline generation, online time-optimal velocity replanning, geometric path tracking controllers, and a new model-structured neural network (MS-NN) to learn the inverse dynamics for steering control. We deploy our framework on a 1:10-scale RoboRacer platform, using two circuits. Through several ablations with cautious and aggressive racelines, we study the performance of single modules and their combinations. We show that our MS-NN significantly improves tracking accuracy, decreases steering oscillations, and is physically interpretable. Moreover, online velocity replanning improves lap times by compensating for execution errors, and enables the vehicle to safely reach higher speeds and accelerations. To support future research, our code, datasets, videos and results are publicly available at https://roboracer-benchmark.github.io/planning_control_benchmark/.

2605.19869 2026-05-20 cs.CV cs.AI

Passive Construction Site Safety Monitoring via Persona-Scaffolded Adversarial Chain-of-Thought VLM Verification

通过基于人设的对抗性链式思考视觉语言模型验证实现被动施工现场安全监控

Ananth Sriram, Neel Mokaria, Rajveer Singh

AI总结 本文提出了一种被动的施工现场安全监控方法,通过三阶段架构处理视频数据,结合细调的YOLO11、SAM 3和Qwen3-VL-8B-Instruct模型,利用基于人设的对抗性链式思考协议提高合规性验证和幻觉控制,主要贡献是第三阶段提示设计,提升了12%的精度。

Comments 10 pages, 4 figures. First place, Ironsite.ai Spatial Intelligence Hackathon, University of Maryland, February 2026. Code available at https://github.com/ananthsriram1/ironsite-hackathon-project-safety_assistant

详情
AI中文摘要

建筑行业仍然是美国最危险的行业领域,2023年记录了1,055起致命工人受伤事件,大多数可以预防。现有的监控方法昂贵,需要实时人类操作员,或仅解决狭窄的违规子集。本文提出了一种被动的、工作结束时的建筑安全监控流程,通过三阶段架构处理POV体佩戴和固定墙安装摄像头的视频数据:(1)细调的YOLO11用于主要PPE和危险检测,(2)SAM 3用于分割精修和工人去重,(3)Qwen3-VL-8B-Instruct结合方法提示的、基于人设的三轮对抗性链式思考协议进行合规性验证和幻觉控制。主要贡献是第三阶段提示设计:专业人设背景故事遵循方法-行动者框架,在非正式的三作者评审中,对12个视频的Ironsite开发语料库的12%精度提升,最大的提升在易产生幻觉的违规类别上。结构信息隔离强制生成器、判别器和协调轮之间在不对称规则编码人类观察与自动化检测可靠性的独立性。系统将违规映射到OSHA标准,进行受REBA启发的人体工程学风险评分,从姿态关键点生成每名工人的安全报告并附带时间戳证据。释放了一个评估工具用于未来复现。

英文摘要

Construction remains the deadliest industry sector in the United States, with 1,055 fatal worker injuries recorded in 2023, and the majority preventable. Existing monitoring approaches are expensive, require real-time human operators, or address only a narrow subset of violations. This paper presents a passive, end-of-shift construction safety monitoring pipeline processing video from POV body-worn and fixed wall-mounted cameras through a three-stage architecture: (1) fine-tuned YOLO11 for primary PPE and hazard detection, (2) SAM 3 for segmentation refinement and worker deduplication, and (3) Qwen3-VL-8B-Instruct with a method-prompted, persona-scaffolded three-pass adversarial chain-of-thought protocol for compliance verification and hallucination control. The principal contribution is the Stage 3 prompt design: professional persona backstories following the method-actor framing drive an observed 12% precision improvement over single-pass prompting in an informal three-author review of the 12-video Ironsite development corpus, with the largest gains on hallucination-prone violation categories. Structural message isolation enforces observational independence between a generator, discriminator, and reconciliation pass governed by asymmetric rules encoding priors about human observation versus automated detection reliability. The system maps violations to OSHA standards, performs REBA-inspired ergonomic risk scoring from pose keypoints, and produces per-worker safety reports with timestamped evidence. An evaluation harness is released for future reproduction.

2605.19868 2026-05-20 cs.CV

WoundFormer: Multi-Scale Spatial Feature Fusion for Multi-Class Wound Tissue Segmentation

WoundFormer: 多尺度空间特征融合用于多类伤口组织分割

Muhammad Ashad Kabir, Rabin Dulal

AI总结 本研究提出WoundFormer框架,通过多尺度空间特征融合提升多类伤口组织分割的准确性,解决了现有方法在处理异质组织组成时的不足。

Comments 10 pages

详情
AI中文摘要

慢性伤口如糖尿病足溃疡和压疮需要准确的组织水平评估以指导治疗计划和监测愈合进程。尽管深度学习方法已推动了自动伤口分析的发展,但大多数现有方法专注于二元分割,由于类内变异性高和标注数据有限,难以充分建模异质组织组成。因此,多类伤口组织分割仍然是一个具有临床相关性的重要挑战。我们提出WoundFormer,一种基于Transformer的框架,通过增强层次化空间特征融合来改进多类伤口组织分割。具体来说,我们用一种空间保持的多尺度聚合头替代标准的SegFormer解码器,该头在跨尺度整合过程中保持特征拓扑,并通过卷积融合加强上下文交互。这种设计提高了边界定位和在视觉上相似的组织类别之间的区分能力,同时保持了Transformer的效率。我们在WoundTissueSeg数据集(147张图像,6个组织类别)和第二个基准(DFUTissue数据集)上评估了WoundFormer。所提出的方法在WoundTissueSeg基准上实现了总体Dice分数为81.9%,在强CNN和Transformer基线方法上最高高出4.3个Dice点,且在少数群体组织类别上也表现出一致的改进。这些结果表明,显式建模层次化空间交互增强了Transformer表示,以异质伤口组织分割,并支持更可靠的定量伤口评估。

英文摘要

Chronic wounds such as diabetic foot ulcers and pressure injuries require accurate tissue-level assessment to guide treatment planning and monitor healing progression. While deep learning methods have advanced automated wound analysis, most existing approaches focus on binary segmentation and inadequately model heterogeneous tissue composition due to high intra-class variability and limited annotated data. Multi-class wound tissue segmentation, therefore, remains a challenging and clinically relevant problem. We propose WoundFormer, a transformer-based framework that enhances hierarchical spatial feature fusion for multi-class wound tissue segmentation. Specifically, we replace the standard SegFormer decoder with a spatially-preserving multi-scale aggregation head that maintains feature topology during cross-scale integration and strengthens contextual interactions through convolutional fusion. This design improves boundary localization and discrimination between visually similar tissue categories while preserving transformer efficiency. We evaluate WoundFormer on the WoundTissueSeg dataset (147 images, six tissue classes) and a second benchmark (DFUTissue dataset). The proposed method achieves an overall Dice score of 81.9%, outperforming strong CNN- and transformer-based baselines by up to 4.3 Dice points on the WoundTissueSeg benchmark, with consistent improvements across minority tissue classes. These results indicate that explicit modeling of hierarchical spatial interactions enhances transformer representations for heterogeneous wound tissue segmentation and supports more reliable quantitative wound assessment.

2605.19866 2026-05-20 cs.CV

Structured Layout Priors for Robust Out-of-Distribution Visual Document Understanding

用于鲁棒性文档理解的结构化布局先验

Peter El Hachem, Ahmed Nassar, A. Said Gurbuz, Christoph Auer, Peter W. J. Staar

AI总结 本文提出了一种结构化布局先验,通过在解码器之外运行轻量级RT-DETR检测器,将检测结果转换为解析器的DocTags词汇,并注入到提示中,以解决文档布局解析中的两跳瓶颈问题,从而提升文档理解的鲁棒性。

Comments 18 pages, 7 figures. Main text: 9 pages (4 figures); Appendix: 9 pages (3 figures)

详情
AI中文摘要

视觉-语言模型(VLMs)能够端到端地解析文档,但在处理布局时经常表现不佳,这种布局不同于训练时所见的布局。我们归因于一个两跳瓶颈:在解码器能够提取内容(第二跳)之前,它必须首先分类和定位包含的布局实体(第一跳)。当第一跳失败时,第二跳会退化为遗漏、结构错误或自回归重复。我们通过在解码器之外运行轻量级RT-DETR检测器,将检测结果序列化为解析器的DocTags词汇,并将其注入到提示中,同时保留完整的页面图像。与先分析后解析的方法(裁剪页面)或之前在纯文本中写的提示级别先验不同,我们的先验共享解码器的生成空间,并在检测结果嘈杂时保留全局图像作为后备。在10,000页的结构化Out-of-distribution基准测试中,markdown F1从0.37提升到0.92;在OmniDocBench中文子集上,表格TEDS从0.01提升到0.36;在26,000页的ViDoRe V3基准测试中,所有工业领域中的无限循环解码失败都减少了。这些收益成本15%的墙时延迟和74个中位提示标记,而无需对基础VLM进行架构更改。进一步的注意力级分析揭示了双模态相位转移,即当发出结构时解码器关注注入的布局标记,当发出内容时关注图像块,这与两跳瓶颈被缓解一致。模型权重将被释放以支持可重复性。

英文摘要

Vision-Language Models (VLMs) parse documents end-to-end but frequently break down on layouts unlike those seen in training. We attribute this to a two-hop bottleneck: before the decoder can extract content (Hop 2), it must first classify and localize the enclosing layout entity (Hop 1), and when the first hop fails the second collapses into omissions, malformed structure, or autoregressive repetition. We pre-resolve Hop 1 outside the decoder by running a lightweight RT-DETR detector, serializing its outputs in the parser's native DocTags vocabulary, and injecting them into the prompt alongside the full page image. Unlike analyze-then-parse approaches that crop the page, or prior prompt-level priors written in plain text, our prior shares the decoder's generation space and leaves the global image in view as a fallback when detections are noisy. On a 10k-page structural out-of-distribution benchmark, markdown F1 rises from $0.37$ to $0.92$; on the Chinese subset of OmniDocBench, table TEDS rises from $0.01$ to $0.36$; and on the 26k-page ViDoRe V3 benchmark, infinite-loop decoding failures drop across every industrial domain tested. These gains cost $15\%$ wall-clock latency and a median of $74$ prompt tokens, with no architectural change to the base VLM. An attention-level analysis further reveals a bimodal phase shift in which the decoder attends to injected layout tokens when emitting structure and to image patches when emitting content, consistent with the two-hop bottleneck being alleviated. Model weights will be released to support reproducibility.

2605.19865 2026-05-20 cs.CV

Landscape-Awareness for Geometric View Diffusion Model

面向几何视角的扩散模型

Yan-Ting Chen, Hao-Wei Chen, Tsu-Ching Hsiao, Chun-Yi Lee

AI总结 本文提出了一种面向几何视角的扩散模型,通过重塑优化景观来引导更新至真实视角,并通过视角条件扩散模型进行细化,以提高收敛性、减少对暴力采样依赖并实现更高的样本效率。

Comments CVPR2026

详情
AI中文摘要

在稀疏视角条件下准确估计摄像机视角仍具挑战性,特别是在两视角场景中。最近的方法利用扩散模型如Zero123来合成新视角,基于相对视角进行条件合成,在通过MSE损失优化时显示出有希望的结果。然而,现有方法往往面临非凸损失景观,存在众多局部极小值,使它们对初始化敏感,并依赖于简单的多起始策略。我们分析了这些优化挑战并可视化了失败案例,显示几何歧义,如对称性和自相似性,可能导致梯度更新向错误视角偏移。为了解决这些限制,我们提出了一种基于分数的方法,重塑优化景观以引导更新至真实视角,随后使用视角条件扩散模型进行细化。实验表明,我们的方法提高了收敛性,减少了对暴力采样依赖,并在更高的样本效率下实现了具有竞争力的准确性。

英文摘要

Accurate camera viewpoint estimation under sparse-view conditions remains challenging, particularly in two-view scenarios. Recent approaches leverage diffusion models such as Zero123 to synthesize novel views conditioned on relative viewpoint, showing promising results when repurposed for viewpoint estimation via optimization with MSE loss. However, existing methods often suffer from nonconvex loss landscape with numerous local minima, making them sensitive to initialization and reliant on naive multistart strategies. We analyze these optimization challenges and visualize failure cases, showing that geometric ambiguities, such as symmetry and self-similarity, can mislead gradient-based updates toward incorrect viewpoints. To address these limitations, we propose a score-based method that reshapes the optimization landscape to guide updates toward the ground-truth viewpoint, followed by a refinement stage using a viewpoint-conditioned diffusion model. Experiments show that our method improves convergence, reduces reliance on brute-force sampling, and achieves competitive accuracy with higher sample-efficiency.