arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3860
2605.18238 2026-05-19 cs.CV

Non-Colliding Biometric Identities for Digital Entities: Geometry, Capacity, and Million-Scale Virtual Identity Provisioning

非碰撞生物识别身份用于数字实体:几何、容量与百万级虚拟身份提供

Yuyang Ji, Yixuan Shen, Anil Jain, Xiaoming Liu, Feng Liu

AI总结 本研究提出Biometric Identity Provisioning(BIP)框架,解决在真实人类身份库中提供非碰撞虚拟身份的问题,通过几何方法在真实面部流形中分配未被占用的间隙,生成高保真面部图像,并展示1000万非碰撞虚拟身份嵌入。

Comments 25 pages, 11 figures

详情
AI中文摘要

数字实体如AI代理和人形机器人日益与真实人类共同操作,但其身份基础设施仍基于凭证而非生物识别身份。我们引入Biometric Identity Provisioning(BIP),一种新的问题和解决方案框架,旨在:给定一个真实人类身份的注册画廊,提供虚拟身份,这些身份与每个注册身份不碰撞,保持足够的类间分离性,并能作为高保真面部图像实现。关键的几何洞察是真实面部身份占据嵌入超球面的低维子空间,留下的残余子空间无法供虚拟身份使用。因此,虚拟身份必须在真实面部流形本身中分配未被占用的间隙。BIP因此是一个受限的填充问题:可用的间隙远超任何可预见的注册规模,并且即使后续注册了新的真实身份,已提供的身份仍保持不碰撞。基于此几何,我们的排斥式分配不受任何固定提供数量的限制;我们展示了针对360,000个真实身份画廊的1000万非碰撞虚拟身份嵌入。将这些嵌入转化为面部图像需要一个在真实面部图像训练分布外运行的生成器;我们引入GapGen,一种具有间隙意识的生成器,通过渐进扩展合成到非碰撞区域的课程进行训练,验证了100万张逼真虚拟面部图像。我们进一步构建了v-LFW,一个LFW面部数据集的虚拟对应物,包含虚拟面部验证、跨现实匹配、真实与虚拟检测以及统一识别和检测的协议。

英文摘要

Digital entities such as AI agents and humanoid robots increasingly operate alongside real humans, yet their identity infrastructure is based on credentials rather than embodied biometric identity. We introduce Biometric Identity Provisioning (BIP), a new problem and solution framework that addresses: given an enrollment gallery of real human identities, provision virtual identities that are non-colliding with every enrolled identity, maintain sufficient inter-class separability, and are realizable as high-fidelity face images. The key geometric insight is that real face identities occupy a low-dimensional subspace of the embedding hypersphere, leaving no residual subspace for virtual identities. Hence, virtual identities must instead be allocated as unclaimed gaps within the real face manifold itself. BIP is therefore a constrained packing problem: available gaps vastly exceed any foreseeable enrollment scale, and provisioned identities remain non-colliding even as new real identities are subsequently enrolled. Grounded in this geometry, our repulsion-based allocation is not bounded by any fixed provisioning count; we demonstrate 10M non-colliding virtual identity embeddings against a gallery of 360K real identities. Realizing these embeddings as face images requires a generator that operates outside the training distribution of real face images; we introduce GapGen, a gap-aware generator trained with a curriculum that progressively extends synthesis into non-colliding regions, validated at 1M photorealistic virtual face images. We further construct v-LFW, a virtual counterpart to LFW face dataset, with protocols for virtual face verification, cross-reality matching, real-vs-virtual detection, and unified recognition and detection.

2605.18233 2026-05-19 cs.CV

Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos

增强无列车无限帧生成以实现一致的长视频

X. Feng, J. Zhu, M. Wu, C. Chen, F. Mao, H. Guo, J. Wu, X. Chu, K. Huang

AI总结 本文提出MIGA方法,通过两阶段对齐机制和双一致性增强机制,解决训练与推理不匹配和长时一致性维持的问题,从而提升长视频生成效果。

Comments Accepted by ICML 2026~

详情
AI中文摘要

Without incurring significant computational overhead, train-free long video generation aims to enable foundation video generation models to produce longer videos. Frame-level autoregressive frameworks, e.g., FIFO-diffusion, offer the advantage of generating infinitely long videos with constant memory consumption. However, the mismatch between training and inference, coupled with the challenge of maintaining long-term consistency, limits the effective utilization of foundation models. To mitigate these concerns, we propose extbf{MIGA}, a novel infinite-frame long video generation method. Firstly, we propose an effective two-stage alignment mechanism that mitigates the training-inference gap by reducing the excessive noise span fed to the model. We then introduce an innovative dual consistency enhancement mechanism, where the self-reflection approach corrects early high-noise frames and the long-range frame guidance approach leverages later low-noise frames with broad coverage to steer generation, jointly improving temporal consistency. Extensive experiments on VBench and NarrLV demonstrate the state-of-the-art performance of MIGA. Our project page is available at https://xiaokunfeng.github.io/miga_homepage/.

英文摘要

Without incurring significant computational overhead, train-free long video generation aims to enable foundation video generation models to produce longer videos. Frame-level autoregressive frameworks, e.g., FIFO-diffusion, offer the advantage of generating infinitely long videos with constant memory consumption. However, the mismatch between training and inference, coupled with the challenge of maintaining long-term consistency, limits the effective utilization of foundation models. To mitigate these concerns, we propose \textbf{MIGA}, a novel infinite-frame long video generation method. Firstly, we propose an effective two-stage alignment mechanism that mitigates the training-inference gap by reducing the excessive noise span fed to the model. We then introduce an innovative dual consistency enhancement mechanism, where the self-reflection approach corrects early high-noise frames and the long-range frame guidance approach leverages later low-noise frames with broad coverage to steer generation, jointly improving temporal consistency. Extensive experiments on VBench and NarrLV demonstrate the state-of-the-art performance of MIGA. Our project page is available at https://xiaokunfeng.github.io/miga_homepage/.

2605.18232 2026-05-19 cs.CL cs.AI cs.IR

SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

SomaliWeb v1: 一个经过质量过滤的索马里网页语料库,配有匹配的分词器和公开的语言识别基准

Khalid Yusuf Dahir

AI总结 本文提出了SomaliWeb v1,一个经过质量过滤的索马里语语料库,包含匹配的BPE-16K分词器和首个公开的索马里语言识别基准,揭示了现有分布中的质量问题。

Comments 16 pages, 6 figures, 6 tables. Code: https://github.com/khaledyusuf44/somali-corpus Dataset: https://huggingface.co/datasets/khaledyusuf44/somaliweb-v1

详情
AI中文摘要

索马里是一种非洲之角的库希特语,有约2500万使用者,但目前没有公开的专门索马里预训练语料库及其配套的分词器和语言识别基准。现有的索马里文本要么出现在多语言分布中(如HPLT v2、CC100、MADLAD-400、OSCAR、mC4),要么出现在Hugging Face上的小规模、未记录的索马里-only上传中。我们介绍了SomaliWeb v1,一个经过质量过滤的索马里语料库,包含819,322个文档(约303亿个标记),由三个上游来源(HPLT v2、CC100、索马里维基百科)通过六阶段可重复的流程构建。我们发布了(i)语料库,(ii)匹配的BPE-16K分词器,以及(iii)首个公开的索马里语言识别基准。我们的测量揭示了现有分布中的具体质量问题:HPLT v2的“清理”索马里发布保留了17.3%的字节精确重复项,其56.1%的文档包含可修复的mojibake,且其10.7%的字节唯一文档在Jaccard tau=0.80时为近重复项。我们的BPE-16K分词器在FLORES-200索马里开发测试上比GPT-4的cl100k_base少发出40.2%的标记;下游语言模型困惑度比较将推迟到后续发布。

英文摘要

Somali is a Cushitic language of the Horn of Africa with ~25 million speakers, yet no documented dedicated Somali pretraining corpus with a companion tokenizer and language-identification benchmark has been publicly released. Existing Somali text appears either inside multilingual distributions (HPLT v2, CC100, MADLAD-400, OSCAR, mC4) or in small, undocumented Somali-only uploads on Hugging Face. We introduce SomaliWeb v1, a quality-filtered Somali corpus of 819,322 documents (~303M tokens) built from three upstream sources (HPLT v2, CC100, Somali Wikipedia) through a six-stage reproducible pipeline. We release (i) the corpus, (ii) a matched BPE-16K tokenizer, and (iii) the first public side-by-side Somali benchmark of three production language identifiers. Our measurements reveal concrete quality defects in existing distributions: HPLT v2's "cleaned" Somali release retains 17.3% byte-exact duplicates, 56.1% of its documents contain fixable mojibake, and 10.7% of its byte-unique documents are near-duplicates at Jaccard tau=0.80. Our BPE-16K tokenizer emits 40.2% fewer tokens than GPT-4's cl100k_base on FLORES-200 Somali devtest as a tokenizer-level measurement; downstream language-model perplexity comparisons are deferred to a follow-up release.

2605.18229 2026-05-19 cs.LG cs.AI

Are Sparse Autoencoder Benchmarks Reliable?

稀疏自编码基准测试是否可靠?

David Chanin

AI总结 该研究评估了稀疏自编码(SAE)基准测试的可靠性,发现其中两个指标在多个角度下表现不佳,其他指标也未能达到预期效果,表明需要改进SAE基准测试。

详情
AI中文摘要

稀疏自编码(SAEs)是大型语言模型的核心可解释性工具,其进展依赖于能够可靠区分更好和更差SAE的基准测试。我们通过三种互补的视角审计了SAEBench中SAE质量指标:固定SAE上的重新播种噪声、合成SAE上的真实相关性以及训练轨迹的可区分性。我们发现,两个指标,即目标探测扰动(TPP)和虚假相关性消除(SCR),在它们的典型设置下未能通过多个视角,不应用于评估SAE。其他指标显示出更高的重新播种噪声和更低的可区分性,比领域假设的要差。sae-probes变体的k-稀疏探测是我们在测试中发现最可靠的指标,但即使sae-probes也难以区分同一体系结构的不同变体。我们的结果表明,领域需要更好的SAE基准测试。

英文摘要

Sparse autoencoders (SAEs) are a core interpretability tool for large language models, and progress on SAE architectures depends on benchmarks that reliably distinguish better SAEs from worse ones. We audit the SAE quality metrics in SAEBench, the de-facto standard SAE evaluation suite, through three complementary lenses: reseed noise on a fixed SAE, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories. We find that two of these metrics, Targeted Probe Perturbation (TPP) and Spurious Correlation Removal (SCR), fail multiple lenses at their canonical settings and should not be used to evaluate SAEs. The other metrics show higher reseed noise and lower discriminability than the field assumes. The sae-probes variant of $k$-sparse probing is the most reliable metric we tested, but even sae-probes struggles to separate variants of the same SAE architecture. Our results show the field needs better SAE benchmarks.

2605.18226 2026-05-19 cs.CL cs.AI

Context Memorization for Efficient Long Context Generation

上下文记忆用于高效长上下文生成

Yasuyuki Okoshi, Hao Mark Chen, Guanxi Lu, Hongxiang Fan, Masato Motomura, Daichi Fujiki

AI总结 本文提出了一种无需训练的上下文记忆方法,通过将前缀外部化为轻量级的预计算注意力状态查找表,以提高长上下文生成的准确性和效率,同时减少注意力计算的延迟。

详情
AI中文摘要

现代大型语言模型(LLM)应用越来越多地依赖长前缀来在推理时控制模型行为。尽管增强前缀的推理是有效的,但存在两个结构限制:i)随着生成过程的进行,前缀的影响逐渐减弱;ii)对前缀的注意力计算与长度成线性关系。现有方法要么在注意力中保留前缀同时压缩它,要么通过梯度训练将它内部化到模型参数中。前者在推理时仍然会关注到前缀,而后者训练成本高且不适合前缀更新。为了解决这些问题,我们提出了注意力状态记忆,这是一种无需训练的方法,将前缀外部化为一个轻量级的预计算注意力状态的查找表。在ManyICLBench上使用LLaMA-3.1-8B,我们的方法在1K-8K内存预算下比上下文学习提高了准确性,同时在8K时将注意力延迟减少了1.36倍,并在NBA基准测试中仅使用其内存足迹的20%就超过了全注意力RAG性能。

英文摘要

Modern large language model (LLM) applications increasingly rely on long conditioning prefixes to control model behavior at inference time. While prefix-augmented inference is effective, it incurs two structural limitations: i) the prefix's influence fades as generation proceeds, and ii) attention computation over the prefix scales linearly with its length. Existing approaches either keep the prefix in attention while compressing it, or internalize it into model parameters through gradient-based training. The former still attends to the prefix at inference, while the latter is training-intensive and ill-suited to prefix updates. To address these issues, we propose attention-state memory, a training-free approach that externalizes the prefix into a lightweight, lookup-based memory of precomputed attention states between prefix and query tokens. On ManyICLBench with LLaMA-3.1-8B, our method improves accuracy over in-context learning at 1K-8K memory budgets while reducing attention latency by 1.36x at 8K, and surpasses full-attention RAG performance on NBA benchmark using only 20% of its memory footprint.

2605.18221 2026-05-19 cs.SD cs.CL cs.CV cs.LG physics.med-ph

SIREM: Speech-Informed MRI Reconstruction with Learned Sampling

SIREM: 语音引导的MRI重建与学习采样

Md Hasan, Nyvenn Castro, Daiqi Liu, Lukas Mulzer, Jana Hutter, Jonghye Woo, Moritz Zaiss, Andreas Maier, Paula A. Perez-Toro

AI总结 本文提出了一种语音引导的MRI重建框架SIREM,通过同步语音作为跨模态先验,利用语音与声音学之间的相关性预测图像内容,从而在更高的吞吐量下实现更合理的解剖结构重建。

详情
AI中文摘要

实时磁共振成像(rtMRI)在语音生产中的应用能够非侵入性地可视化动态声带运动,对语音科学和临床评估具有价值。然而,rtMRI本质上受到空间分辨率、时间分辨率和获取速度之间的权衡限制,常常导致k空间测量不足和重建质量下降。我们提出SIREM,一种利用同步语音作为跨模态先验的MRI重建框架。核心思想是语音期间的声带配置与产生的声音学相关,使图像部分内容可从音频预测。SIREM将每帧建模为音频驱动组件和MRI驱动组件的融合,通过空间加权图。音频分支从语音预测发音器相关结构,而MRI分支从测量的k空间数据重建互补内容。我们进一步引入了可学习的软加权轮廓,使螺旋臂的使用与语音引导融合的交互研究可微分。这产生了一个统一的多模态公式,结合了音频驱动预测、MRI重建和采样适应。我们在USC语音rtMRI基准上评估了SIREM,与标准基线(包括栅格、基于小波的压缩感知和总变分)进行比较。SIREM引入了一种语音引导的重建范式,在比迭代方法高得多的吞吐量下运行,同时保持解剖上合理的声带结构。这些结果为多模态语音引导的rtMRI重建建立了初步基准,并突显了同步语音作为快速重建辅助先验的潜力。源代码可在https://github.com/mdhasanai/SIREM获取。

英文摘要

Real-time magnetic resonance imaging (rtMRI) of speech production enables non-invasive visualization of dynamic vocal-tract motion and is valuable for speech science and clinical assessment. However, rtMRI is fundamentally constrained by trade-offs among spatial resolution, temporal resolution, and acquisition speed, often leading to undersampled k-space measurements and degraded reconstructions. We propose SIREM, a speech-informed MRI reconstruction framework that uses synchronized speech as a cross-modal prior. The central idea is that vocal-tract configurations during speech are correlated with the produced acoustics, making part of the image content predictable from audio. SIREM models each frame as a fusion of an audio-driven component and an MRI-driven component through a spatial weighting map. The audio branch predicts articulator-related structure from speech, while the MRI branch reconstructs complementary content from measured k-space data. We further introduce a learnable soft weighting profile over spiral arms, enabling a differentiable study of how k-space arm usage interacts with speech-informed fusion. This yields a unified multimodal formulation that combines audio-driven prediction, MRI reconstruction, and sampling adaptation. We evaluate SIREM on the USC speech rtMRI benchmark against standard baselines, including gridding, wavelet-based compressed sensing, and total variation. SIREM introduces a speech-informed reconstruction paradigm that operates in a substantially higher-throughput regime than iterative methods while preserving anatomically plausible vocal-tract structure. These results establish an initial benchmark for multimodal speech-informed rtMRI reconstruction and highlight the potential of synchronized speech as an auxiliary prior for fast reconstruction. The source code is available at https://github.com/mdhasanai/SIREM

2605.18211 2026-05-19 cs.CL cs.AI

Leveraging Graph Structure in Seq2Seq Models for Knowledge Graph Link Prediction

利用图结构在序列到序列模型中进行知识图谱链接预测

Luu Huu Phuc, Ratan Bahadur Thapa, Mojtaba Nayyeri, Jingcheng Wu, Evgeny Kharlamov, Steffen Staab

AI总结 本文提出了一种结合图结构的序列到序列模型GA-S2S,通过整合T5-small编码器解码器与关系图注意力网络RGAT,提升知识图谱链接预测的性能。

Comments 9 pages, 1 figure, 2 tables. Preprint of a paper accepted at the 5th Workshop on LLM-Integrated Knowledge Graph Generation from Text (TEXT2KG), co-located with ESWC 2026, May 10--14, 2026, Dubrovnik, Croatia

详情
AI中文摘要

我们介绍了图增强的序列到序列(GA-S2S)框架,这是一种新的框架,将T5-small编码器解码器与关系图注意力网络(RGAT)相结合,以提高知识图谱的链接预测。虽然现有的序列到序列模型仅依赖于实体和关系的表面描述,并且在最理想的情况下,将查询实体的邻居扁平化为一个线性序列,从而丢弃了内在的图结构,GA-S2S联合编码文本特征和查询实体周围的完整k跳子图拓扑。通过将原始编码器输出与RGAT的关系感知嵌入相结合,我们的模型捕捉并利用了更丰富的多跳关系模式和文本信息。在CoDEx数据集上的初步实验表明,GA-S2S在链接预测准确性上优于竞争的序列到序列基线模型,达到了高达19%的相对增益。

英文摘要

We introduce Graph-Augmented Sequence-to-Sequence (GA-S2S), a novel framework that integrates a T5-small encoder-decoder with a Relational Graph Attention Network (RGAT) to improve link prediction in knowledge graphs. While existing Seq2Seq models rely solely on surface-level textual descriptions of entities and relations and at best, flatten the neighborhoods of a query entity into a single linear sequence, thereby discarding the inherent graph structure, GA-S2S jointly encodes both textual features and the full $k$-hop subgraph topology surrounding the query entity. By integrating raw encoder outputs with RGAT's relation-aware embeddings, our model captures and leverages richer multi-hop relational patterns and textual information. Our preliminary experiments on the CoDEx dataset demonstrate that GA-S2S outperforms competitive Seq2Seq-based baseline models, achieving up to a 19\% relative gain in link prediction accuracy.

2605.18209 2026-05-19 cs.CV cs.AI

SPATIOROUTE: Dynamic Prompt Routing for Zero-Shot Spatial Reasoning

SPATIOROUTE: 动态提示路由用于零样本空间推理

Pawat Chunhachatrachai, Gueter Josmy Faure, Hung-Ting Su, Winston H. Hsu

AI总结 本文提出SpatioRoute,一种动态提示生成方法,通过语义定制的提示模板路由问题,无需额外训练或3D传感器输入,在零样本设置下提升空间推理性能,同时发现Chain-of-Thought提示在空间视频理解中效果不佳。

Comments 10 pages, 2 figures, 2nd Workshop on 3D-LLM/VLA, CVPR 2026

详情
AI中文摘要

在眼动视频上的空间问题回答是一项具有挑战性的任务,需要视觉-语言模型(VLMs)对3D物体位置、场景可行性和方向关系进行推理,特别是在无任务特定微调的零样本设置中。我们引入SpatioRoute,一种动态提示生成方法,将每个输入问题路由到语义定制的提示模板,无需任何额外训练、微调或3D传感器输入。SpatioRoute在两个互补模式中运行:SpatioRoute-R,一种基于规则的路由器,将问题类型(如What、Is、How、Can、Which)确定性地映射到专门的提示模板;以及SpatioRoute-L,一种基于LLM的方法,仅从问题和情境上下文生成任务特定的提示,无需在路由时使用视频输入。我们评估了SpatioRoute在SQA3D基准测试上跨不同模型家族的VLMs。SpatioRoute在固定提示基线上实现了高达5%的总体准确率提升,建立了在不需3D点云输入的情况下零样本视频-only空间VQA的新状态。此外,我们发现Chain-of-Thought(CoT)提示,通过Think it Twice架构实现,在此设置中对Qwen系列模型性能有持续下降,证实了问题感知路由比统一推理指令在空间视频理解中更有效。

英文摘要

Spatial question answering over egocentric video is a challenging task that requires Vision-Language Models (VLMs) to reason about 3D object positions, scene affordances, and directional relationships, particularly in the zero-shot setting where no task-specific fine-tuning is available. We introduce SpatioRoute, a dynamic prompt generation approach that routes each incoming question to a semantically tailored prompt template -- without any additional training, fine-tuning, or 3D sensor input. SpatioRoute operates in two complementary modes: SpatioRoute-R, a rule-based router that deterministically maps question typologies (e.g., What, Is, How, Can, Which) to specialized prompt templates; and SpatioRoute-L, an LLM-driven approach that generates task-specific prompts from the question and situational context alone, with no video input at routing time. We evaluate SpatioRoute on the SQA3D benchmark across VLMs spanning model families. SpatioRoute achieves consistent overall accuracy gains up to 5% over fixed prompt baselines, establishing a new state-of-the-art for zero-shot video-only spatial VQA without requiring 3D point-cloud inputs. As an additional finding, we observe that Chain-of-Thought (CoT) prompting, implemented via the Think it Twice architecture, consistently degrades performance in this setting on Qwen series models, confirming that question-aware routing is more effective than uniform reasoning instructions for spatial video understanding.

2605.18202 2026-05-19 cs.LG cs.AI

Concise and Logically Consistent Conformal Sets for Neuro-Symbolic Concept-Based Models

简洁且逻辑一致的神经符号概念模型的符合集

Samuele Bortolotti, Emanuele Marconato, Andrea Pugnana, Andrea Passerini, Stefano Teso

AI总结 本文提出COCOCO框架,通过整合符合预测方法,解决神经符号概念模型中标签和概念预测过于自信的问题,满足一致性、覆盖性和简洁性三个要求,提升模型的可靠性。

详情
AI中文摘要

神经符号概念模型(NeSy-CBMs)是一类将神经网络与符号推理相结合的架构,用于在高风险应用中提高可靠性。它们通过从输入中提取高层概念,然后在给定的逻辑约束下推断任务标签。然而,其标签和概念预测可能过于自信,使利益相关者难以判断何时可以信任模型的决策。本文通过整合符合预测(CP)框架,提供严格的分布无关覆盖保证,正式化了三个要求——一致性、覆盖性和简洁性,证明现有方法至少在一项上不足。然后引入COCOCO,一种后处理框架,联合符合概念和标签,并通过单个推断-反推修订步骤进行协调。COCOCO满足所有三个要求,保留分布无关覆盖,对不完美的知识具有鲁棒性,并支持用户指定的大小预算。在8个数据集上的实验显示,COCOCO在性能和集合大小方面优于竞争对手和自然基线。

英文摘要

Neuro-Symbolic Concept-based Models (NeSy-CBMs) are a family of architectures that integrate neural networks with symbolic reasoning for enhanced reliability in high-stakes applications. They work by first extracting high-level concepts from the input and then inferring a task label from these compatibly with given logical constraints. Yet, their label and concept predictions can be overconfident, making it difficult for stakeholders to gauge when the model's decisions can be trusted. We address this issue by integrating ideas from Conformal Prediction (CP), a framework providing rigorous, distribution-free coverage guarantees. We formalize three desiderata -- consistency, coverage, and conciseness -- that any conformal method for NeSy-CBMs should satisfy, and show that existing approaches fall short of at least one. We then introduce COCOCO, a post-hoc framework that conformalizes concepts and labels jointly and reconciles them via a single deduction-abduction revision step. COCOCO satisfies all three desiderata, retains distribution-free coverage, is robust to imperfect knowledge and supports user-specified size budgets. Our experiments on 8 data sets highlight how COCOCO compares favorably against competitors and natural baselines in terms of performance and set size.

2605.18197 2026-05-19 cs.RO cs.AI cs.CV

RGB-only Active 3D Scene Graph Generation for Indoor Mobile Robots

仅RGB的主动3D场景图生成用于室内移动机器人

Giorgia Modi, Davide Buoso, Giuseppe Averta, Daniele De Martini

AI总结 本文提出了一种仅使用RGB输入的主动3D场景图生成方法,通过统一感知与规划的结构化表示,解决了传统方法对专用传感器的依赖问题,并在Replica数据集上验证了其有效性。

详情
AI中文摘要

当前3D场景图生成方法依赖于专用深度传感器,如LiDAR或RGB-D相机,限制了部署到专用机器人平台,并排除了仅使用RGB相机的场景,如固定外部基础设施。现有流程通常基于被动收集的观测轨迹,而不是基于部分构建的场景表示选择视角,因此无法有效利用图中编码的语义和空间信息。本文提出了一种完全视觉框架,用于从仅RGB输入中主动、逐步构建3D场景图,解决了这两个限制。所提出的方法围绕共享的结构化表示统一感知和规划,该表示捕捉了物体语义、3D几何、关系上下文以及多视角信息。由于该框架是硬件无关的,并且仅依赖RGB观测,因此可以将机载机器人相机和固定外部相机的输入整合到同一表示中。在Replica数据集上的实验表明,仅RGB的流程在F1分数上与使用真实深度的基线相当。在ReplicaCAD上的主动探索实验进一步表明,语义驱动的视角选择在相同探索预算下能够检测到比基于几何前沿的基线多超过两倍的物体。最后,外部相机设置表明,互补的RGB视角可以有效启动场景图并提高上下文理解,而无需额外的探索成本。

英文摘要

Current approaches to 3D scene graph generation rely on dedicated depth sensors, such as LiDAR or RGB-D cameras, for metric 3D reconstruction. This limits deployment to specialized robotic platforms and excludes settings where only RGB cameras are available, such as fixed external infrastructure. Existing pipelines also typically operate on passively collected observation trajectories, rather than selecting viewpoints based on the partially built scene representation, and therefore fail to effectively exploit the semantic and spatial information encoded within the graph during exploration. This paper presents a fully visual framework for the active, incremental construction of 3D scene graphs from RGB input only, addressing both limitations. The proposed approach unifies perception and planning around a shared structured representation that captures object semantics, 3D geometry, relational context, and information from multiple viewpoints. Because the framework is hardware-agnostic and relies only on RGB observations, it can incorporate inputs from both onboard robot cameras and fixed external cameras within the same representation. Experiments on the Replica dataset show that the RGB-only pipeline achieves F1-score parity with baselines using ground-truth depth. Active exploration experiments on ReplicaCAD further show that semantic-driven viewpoint selection detects more than twice as many objects as a geometric frontier-based baseline under the same exploration budget. Finally, the external-camera setting demonstrates that complementary RGB views can effectively bootstrap the scene graph and improve contextual understanding at no additional exploration cost.

2605.18194 2026-05-19 cs.AI cs.CV

Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks

超越笛卡尔错觉:在感知瓶颈下测试双阶段多模态理论 of Mind

Yajing Zhou, Xiangyu Kong

AI总结 本文研究了多模态大语言模型在感知瓶颈下的双阶段空间推理能力,提出了一种基于锚点的具身体验空间分解链式推理方法,以解决空间对称性和视角模糊性问题,从而提升多模态理论 of Mind 的表现。

Comments 17 pages, 3 figures

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)在一般推理方面表现出色,但其具身空间智能仍受“笛卡尔错觉”限制——依赖文本概率分布,缺乏基于3D拓扑的理解。这种局限性在多智能体环境中尤为明显,这些环境不仅需要场景感知,还需要第二阶的理论 of Mind(ToM)。具体而言,智能体A必须推断智能体B对环境的看法,这严格受B的物理朝向和感官限制影响。在本文中,我们通过一个新颖的音频-视觉任务来探测MLLMs的双阶段空间推理极限:要求A预测B对A相对位置的估计。为此,我们提出了一个认识感知瓶颈模块,摒弃了刚性的规则坐标转换。相反,我们引入了基于锚点的具身体验空间分解链式推理(CoT)。该方法引导MLLMs进行“几何到语义”的投影,迫使它首先建立B的局部坐标系统,然后根据A是否在B的视觉视锥内动态加权视觉和听觉模态。广泛的评估表明,尽管当前MLLMs在空间对称性和视外模糊性方面根本上存在困难(建立了一个严格的零样本基准线42%准确率),我们的感知受限推理链在纯自体心和 allocentric 基准上表现稳健。通过系统地评估这些感知瓶颈,我们的工作揭示了当前MLLM空间推理的极限,并为具身AI中的认识、模态感知推理建立了基础范式。

英文摘要

While Multi-Modal Large Language Models (MLLMs) demonstrate impressive capabilities in general reasoning, their embodied spatial intelligence remains hampered by a "Cartesian Illusion" - a reliance on text-based probability distributions that lack grounded, 3D topological understanding. This limitation is starkly exposed in multi-agent environments, which demand more than just scene perception; they require second-order Theory of Mind (ToM). Specifically, an Agent A must be able to infer Agent B's belief about the environment, governed strictly by Agent B's physical orientation and sensory limitations. In this paper, we probe the limits of two-stage spatial inference in MLLMs through a novel audio-visual task: requiring Agent A to predict Agent B's estimation of A's relative location. To solve this, we propose an Epistemic Sensory Bottleneck module that abandons rigid, rule-based coordinate transformations. Instead, we introduce an Anchor-Based Embodied Spatial Decomposition Chain-of-Thought (CoT). This guides the MLLM through a "geometric-to-semantic" projection, forcing it to first establish B's local coordinate system and then dynamically weight visual and auditory modalities based on whether A falls within B's visual frustum. Extensive evaluations reveal that while current MLLMs fundamentally struggle with spatial symmetry and out-of-view ambiguities (establishing a rigorous zero-shot baseline of 42% accuracy), our sensory-bounded reasoning chain robustly outperforms pure egocentric and allocentric baselines. By systematically benchmarking these perceptual bottlenecks, our work exposes the current limits of MLLM spatial reasoning and establishes a foundational paradigm for epistemic, modality-aware inference in Embodied AI.

2605.18193 2026-05-19 cs.CV cs.GR

Best Segmentation Buddies for Image-Shape Correspondence

图像-形状对应关系的最佳分割伙伴

Itai Lang, Dongwei Lyu, Dale Decatur, Rana Hanocka

AI总结 本文研究了在真实图像和无纹理3D形状之间估计分割到分割对应关系的任务,通过将图像像素与3D形状顶点链接,利用深度视觉特征进行特征相似性计算,从而实现跨模态的对应关系发现。

Comments CVPR 2026. Project page: https://threedle.github.io/bsb/

详情
AI中文摘要

寻找对应关系是计算机视觉和图形学中的基本且广泛研究的问题。在本工作中,我们探讨了在真实图像和无纹理3D形状之间估计分割到分割对应关系的任务。该任务极具挑战性,因为存在显著的外观、几何和视角差异。我们的方法通过将图像分割中的像素链接到相应3D形状语义部分的顶点,跨越了跨模态的差距。为此,我们首先将深度视觉特征从2D视觉模型蒸馏到3D形状表面,使得能够计算图像像素与形状顶点之间的特征相似性。然后,我们识别出最佳分割伙伴,即那些最相似的图像像素位于图像分割区域内的顶点,从而可靠地发现语义对应形状部分的顶点。最后,我们利用从2D图像分割模型蒸馏的3D特征,直接在3D中对形状进行分割,从而引导对应关系的发现。我们展示了我们的方法在广泛图像-形状配对中的通用性和鲁棒性,展示了准确且具有语义意义的对应关系。我们的项目页面位于https://threedle.github.io/bsb/。

英文摘要

Finding correspondences is a fundamental and extensively researched problem in computer vision and graphics. In this work, we examine the underexplored task of estimating segmentation-to-segmentation correspondence between images in the wild and untextured 3D shapes. This task is highly challenging due to substantial differences in appearance, geometry, and viewpoint. Our approach bridges the cross-modality gap by linking pixels in the image segment to vertices in the corresponding semantic part of the 3D shape. To achieve this, we first distill deep visual features from a 2D vision model onto the 3D shape surface, allowing for the computation of feature similarity between image pixels and shape vertices. Then, we identify Best Segmentation Buddies, vertices whose most similar image pixel lies within the image segmentation region, enabling the reliable discovery of vertices in semantically corresponding shape parts. Finally, we leverage distilled 3D features from the 2D image segmentation model to segment the shape directly in 3D, bootstrapping the correspondence process. We demonstrate the generality and robustness of our approach across a wide range of image-shape pairs, showcasing accurate and semantically meaningful correspondences. Our project page is at https://threedle.github.io/bsb/.

2605.18192 2026-05-19 cs.CV

View-Aware Semantic Alignment for Aerial-Ground Person Re-Identification

视感知语义对齐用于空-地人员重识别

Quan Zhang, Zeqiang Cai, Peiming Zhao, Jingze Wu, Cailun Wu, Hongbo Chen, Jianhuang Lai

AI总结 本文提出ViSA框架,通过视感知语义对齐方法解决空-地人员重识别中的视角差异问题,通过专家驱动令牌生成模块和双分支局部融合模块实现跨视角语义一致性,实验表明在CARGO基准测试中mAP提升了10.06%。

Comments CVPR 2026 POSTER

详情
AI中文摘要

空-地人员重识别(AGPReID)由于无人机和固定相机之间视角差异显著而极具挑战性。现有方法通常采用视角不变的范式,通过对齐不同视角下的共享特征来实现鲁棒性。然而,视角不变性本质上强制了部分级对齐,忽略了视角特定的线索和判别性身份信息。为此,本文提出ViSA(视感知语义对齐)框架,该框架实现了跨视角语义一致性,包含专家驱动令牌生成模块(ETGM)和双分支局部融合模块(DLFM)。技术上,前者构建了一组视感知专家,生成适应性语义查询以感知视角特定的模式,后者利用图推理提取并对齐响应不同专家的局部区域。在三个AGPReID基准测试集(包括AG-ReID.v2、CARGO和LAGPeR)上的广泛实验表明,ViSA在挑战性的CARGO跨视角协议上实现了mAP提升10.06%的显著改进。代码可在https://github.com/Cat-Zero/ViSA获取。

英文摘要

Aerial-Ground Person Re-Identification (AGPReID) remains highly challenging due to drastic viewpoint variations between drones and fixed cameras. Existing methods typically follow a view-invariant paradigm, aligning shared features across views to achieve robustness. However, view-invariant inherently enforces part-level alignment, which ignores view-specific cues and discriminative identity information. To this end, this work proposes ViSA (View-aware Semantic Alignment), a view-aware framework that achieves cross-view semantic consistency containing an Expert-driven Token Generation Module (ETGM) and a Dual-branch Local Fusion Module (DLFM). Technically, the former constructs a set of view-aware experts to generate adaptive semantic queries that perceive viewpoint-specific patterns, while the latter leverages graph reasoning to extract and align local regions responsive to different experts. Extensive experiments on three AGPReID benchmarks including AG-ReID.v2, CARGO and LAGPeR demonstrate that ViSA consistently achieves superior performance, with a notable 10.06\% mAP improvement on the challenging CARGO cross-view protocol. The code is available at \href{https://github.com/Cat-Zero/ViSA}{https://github.com/Cat-Zero/ViSA}.

2605.18191 2026-05-19 cs.AI

Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation

成对偏好奖励与基于群体的多样性增强以实现更优的开放生成

Guining Cao, Jiaxin Peng, Chu Zeng, Yu Zhao, Shuangyong Song, Yongxiang

AI总结 本文提出了一种适用于开放生成任务的强化学习方法PPR-GDE,通过成对偏好奖励和基于群体的多样性增强,解决了传统方法在开放生成任务中存在验证困难、计算成本高以及多样性下降的问题。

Comments Work in progress

详情
AI中文摘要

当前的强化学习(RL)方法在可验证的环境中具有广泛的应用和强大的能力,因为可以提供标量奖励。然而,在开放生成任务中,验证响应的正确性仍然具有挑战性,训练奖励模型会带来显著的计算和标注成本。此外,强化学习(RLVR)往往导致多样性崩溃,并产生刻板或僵化的输出,这在开放领域场景中尤其不可取。我们提出了成对偏好奖励与基于群体的多样性增强(PPR-GDE),一种更适合开放生成的RL方法。PPR-GDE不需要标量奖励,并将群体层面的多样性纳入奖励信号中,它通过成对偏好奖励保留主观评估的比较结构,通过重复比较并交换响应顺序来减轻裁判位置偏差,并引入一个基于群体的多样性奖励,明确鼓励响应组内的语义分散,所有这些奖励信号都被整合到一个统一的群体相对策略优化目标中。我们将在角色扮演任务上实例化PPR-GDE,实验表明PPR-GDE在对齐质量和表达多样性方面优于强大的RL基线。进一步分析显示,成对偏好对于主观视角的偏好对齐至关重要,而多样性度量在实现卓越的表达多样性和更广泛的语义覆盖方面起着关键作用。

英文摘要

Current reinforcement learning(RL) methods are broadly applicable and powerful in verifiable settings where scalar rewards can be provided. However, in open-ended generation tasks, verifying the correctness of responses remains challenging, and training reward models incurs substantial computational and annotation costs. Moreover, reinforcement learning (RLVR) often leads to diversity collapse and produces stereotypical or rigid outputs, outcomes that are particularly undesirable in open-domain scenarios. We propose Pairwise Preference Reward and Group-based Diversity Enhancement (PPR-GDE), a RL method that is more suitable for open-ended generation. PPR-GDE does not require scalar rewards and incorporates group-level diversity into the reward signal, it preserves the comparative structure of subjective evaluation through a pairwise preference reward, mitigates judge position bias via repeated comparisons with swapped response order, and introduces a group-based diversity reward that explicitly encourages semantic dispersion within a response group, all of these reward signals are integrated into a unified group-relative policy optimization objective. We instantiate PPR-GDE on role-playing task, experiments show that PPR-GDE achieves a better alignment quality as well as expressive diversity than strong RL baselines. Further analysis shows that pairwise preference is critical for preference alignment in subjective perspective, while the diversity metric plays an essential role in achieving superior expressive diversity and broader semantic coverage.

2605.18190 2026-05-19 cs.LG cs.CV

Dual-Rate Diffusion: Accelerating diffusion models with an interleaved heavy-light network

双速率扩散:通过交错重-轻网络加速扩散模型

Grigory Bartosh, David Ruhe, Emiel Hoogeboom, Jonathan Heek, Thomas Mensink, Tim Salimans

AI总结 本文提出双速率扩散方法,通过交错执行高容量上下文编码器和轻量解噪模型,加速扩散模型推理,同时保持样本质量,在ImageNet基准上实现性能与计算成本的平衡。

详情
AI中文摘要

扩散模型在生成性能上达到最先进的水平,但在推理过程中由于重复评估重的神经网络而面临高昂的计算成本。在本文中,我们提出了双速率扩散,一种通过交错执行高容量的上下文编码器和轻量高效的去噪模型来加速采样的方法。上下文编码器被稀疏评估以提取高维特征,这些特征在每一步都被轻量去噪模型有效重用,以高效地细化样本。这种方法显著加速了推理过程,而不会牺牲样本质量。在ImageNet基准上,双速率扩散在性能上与标准基线相匹配,同时将计算成本降低了2-4倍。此外,我们证明了我们的方法与蒸馏技术,如动量匹配蒸馏,兼容,从而在少步生成中进一步提高效率。

英文摘要

Diffusion models achieve state-of-the-art generative performance but suffer from high computational costs during inference due to the repeated evaluation of a heavy neural network. In this work, we propose Dual-Rate Diffusion, a method to accelerate sampling by interleaving the execution of a heavy high-capacity context encoder and a light efficient denoising model. The context encoder is evaluated sparsely to extract high-dimensional features, which are effectively reused by the light denoising model at every step to refine the sample efficiently. This approach significantly accelerates inference without compromising sample quality. On ImageNet benchmarks, Dual-Rate Diffusion matches the performance of standard baselines while reducing computational cost by a factor of $2$-$4$. Furthermore, we demonstrate that our method is compatible with distillation techniques, such as Moment Matching Distillation, enabling further efficiency gains in few-step generation.

2605.18188 2026-05-19 cs.LG

UTOPYA: A Multimodal Deep Learning Framework for Physics-Informed Anomaly Detection and Time-Series Prediction

UTOPYA:一种用于物理信息异常检测和时间序列预测的多模态深度学习框架

Robson W. S. Pessoa, Julien Amblard, Alessandra Russo, Idelfonso B. R. Nogueira

AI总结 本文提出UTOPYA框架,通过融合八种数据模态,利用FiLM条件交叉模态注意力和门控融合,共同解决批次蒸馏中的异常检测、时间序列预测和相分类问题,并通过物理信息正则化方案和课程学习方法提升性能。

详情
AI中文摘要

批次过程中的异常检测受到瞬态动态、稀少故障标签和依赖单一模态传感器数据的限制。本文介绍了UTOPYA(统一时间观测用于物理信息异常检测和时间序列预测),一种具有15.2M参数的多模态框架,通过特征-wise线性调制(FiLM)条件交叉模态注意力和门控融合,共同解决批次蒸馏中的异常检测、时间序列预测和相分类问题。本文引入的物理信息正则化方案强制时间平滑性和热力学单调性,而课程学习则按物理难度顺序引入训练样本。在Arweiler等人(2026)的119次实验多模态批次蒸馏数据集上,UTOPYA在窗口级别测试中达到0.832和0.874的AUROC,显著优于四个外部基线(PCA、自动编码器、隔离森林和LSTM自动编码器)在相同条件下的表现(+0.147窗口级别AUROC超过最佳基线)。对15种架构配置的多模态消融研究显示,通过FiLM条件的静态上下文是关键使能器,使实验级别多信号AUROC提高+0.145(从0.729到0.874)。此外,对14种设计选择的训练消融研究发现,包括实例归一化、Mixup、集成、测试时增强和随机权重平均在内的几种广泛采用的技巧在数据稀少的设置中未能提升或主动降低泛化能力。这些负面结果揭示了平滑基于正则化和异常检测之间的根本矛盾,为多模态过程监控部署提供了实际指导。

英文摘要

Anomaly detection in batch processes is hindered by transient dynamics, scarce fault labels, and reliance on single-modality sensor data. This work introduces UTOPYA (Unified Temporal Observation for Physics-Informed Anomaly Detection and Time-Series Prediction), a 15.2M-parameter multimodal framework that jointly addresses anomaly detection, time-series prediction, and phase classification in batch distillation by fusing eight data modalities through Feature-wise Linear Modulation (FiLM) conditioned cross-modal attention and gated fusion. A physics-informed regularisation scheme introduced in this work enforces temporal smoothness and thermodynamic monotonicity, while curriculum learning introduces training samples in order of physical difficulty. On the 119-experiment multimodal batch distillation dataset of Arweiler et al. (2026), UTOPYA achieves a window-level test AUROC of 0.832 and 0.874 under multi-signal experiment-level scoring, substantially outperforming four external baselines (PCA, autoencoder, Isolation Forest, and LSTM autoencoder) evaluated under identical conditions (+0.147 window-level AUROC over the best baseline). A multimodal ablation over 15~architectural configurations shows that static context via FiLM conditioning is the key enabler, lifting experiment-level multi-signal AUROC by +0.145 over the unimodal baseline (0.729 to 0.874). Separately, a training ablation across 14 design choices reveals that several widely-adopted techniques, including instance normalisation, Mixup, ensembling, test-time augmentation, and stochastic weight averaging, fail to improve or actively degrade generalisation in this data-scarce setting. These negative results expose a fundamental tension between smoothing-based regularisation and anomaly detection, providing practical guidance for multimodal process monitoring deployment.

2605.18184 2026-05-19 cs.RO cs.AI cs.CV

Fixed External Cameras as Common Prior Maps for Active 3D Scene Graph Generation

固定外部摄像头作为主动3D场景图生成的共同先验地图

Giorgia Modi, Davide Buoso, Giuseppe Averta, Daniele De Martini

AI总结 本文提出利用固定外部RGB摄像头作为共同先验地图,以实现主动、渐进式的3D场景图生成,通过融合机器人 onboard 摄像头和固定外部摄像头的数据,提高场景理解的效率和准确性。

详情
AI中文摘要

常用的先验信息,如BIM模型、平面图和遥感图像,可以为自主机器人系统提供有价值的几何和语义上下文。在本文中,我们将固定外部RGB摄像头的观测视为共同先验地图(CPMs):环境的广角视图,在任何机器人运动开始之前初始化一个语义和几何场景先验。我们提出一个仅使用RGB的框架,用于主动、渐进式的3D场景图(3DSG)生成,该框架在单一硬件无关的管道中无缝融合来自机器人 onboard 摄像头和固定外部摄像头的观测。通过仅依赖RGB观测并通过前馈3D重建模型进行处理,系统将所有摄像头——机器人 onboard 或外部——视为相同,无需硬件修改。基于图的主动语义探索框架然后直接利用部分场景图,引导机器人向高语义不确定性区域前进,逐步完成和细化先验。实验表明,使用单个外部摄像头初始化场景图可使初始物体召回率提高高达+79%,并且先验的更丰富上下文显著提高了后续主动探索的效率。

英文摘要

Commonly available prior information, such as BIM models, floor plans, and remote sensing images, can provide valuable geometric and semantic context for autonomous robotic systems. In this paper, we treat observations from fixed external RGB cameras as Common Prior Maps (CPMs): wide-field views of the environment that initialize a semantic and geometric scene prior before any robot motion begins. We present an RGB-only framework for active, incremental 3D scene graph (3DSG) generation that seamlessly fuses observations from both onboard robot cameras and fixed external cameras within a single hardware-agnostic pipeline. By relying solely on RGB observations processed by a feed-forward 3D reconstruction model, the system treats all cameras - onboard or external - identically, requiring no hardware modifications. A graph-based active semantic exploration framework then directly leverages the partial scene graph to guide the robot toward regions of high semantic uncertainty, progressively completing and refining the prior. Experiments demonstrate that bootstrapping the scene graph with even a single external camera increases initial object recall by up to +79%, and that the richer context of the prior significantly improves the efficiency of subsequent active exploration.

2605.18181 2026-05-19 cs.AI cs.CL

Scalable Environments Drive Generalizable Agents

可扩展环境驱动可泛化的智能体

Jiayi Zhang, Fanqi Kong, Guibin Zhang, Maojia Song, Zhaoyang Yu, Jianhao Ruan, Jinyu Xiang, Bang Liu, Chenglin Wu, Yuyu Luo

AI总结 本文探讨了可泛化智能体需要通过可扩展环境来适应多样任务和未见环境的问题,提出环境扩展的核心挑战,并提出了统一的分类方法和可扩展环境的构建范式。

详情
AI中文摘要

可泛化智能体应能够适应多样任务和未见环境,而不仅仅是训练分布内的任务。本文认为这种泛化需要环境扩展:扩展智能体交互的可执行规则集分布,而不是仅增加轨迹或任务数量。当前扩展实践主要集中在固定交互规则下收集更多经验或更广的任务集,导致智能体在底层接口、动态、观察或反馈信号变化时变得脆弱。因此,核心挑战是世界层面的分布偏移:智能体需要系统性地暴露于具有显著不同可执行规则集的环境中。为澄清这一挑战,我们提出了一个统一的分类法,通过主要交付成果和可执行规则集的变化将轨迹扩展、任务扩展和环境扩展区分开来。基于此分类法,我们综合了可扩展环境的构建范式,对比了优先考虑可控性和可验证性的程序生成器与提供更广泛覆盖和开放性的生成世界模型。我们进一步概述了如何将环境扩展与具有状态的学习机制结合,强调学习的更新规则用于跨环境适应。最后,我们讨论了其他观点并论证可扩展环境是实现稳健通用智能体的必要基础。

英文摘要

Generalizable agents should adapt to diverse tasks and unseen environments beyond their training distribution. This position paper argues that such generalization requires environment scaling: expanding the distribution of executable rule-sets that agents interact with, rather than only increasing trajectories or tasks within fixed benchmarks. Current scaling practices largely focus on collecting more experience or broader task sets under fixed interaction rules, leaving agents brittle when underlying interfaces, dynamics, observations, or feedback signals change. The core challenge is therefore a world-level distribution shift: agents need systematic exposure to environments with meaningfully different executable rule-sets. To clarify this challenge, we propose a unified taxonomy that separates trajectory scaling, task scaling, and environment scaling by their primary deliverables and by what changes in the executable rule-set. Building on this taxonomy, we synthesize construction paradigms for scalable environments, contrasting programmatic generators that prioritize controllability and verifiability with generative world models that offer broader coverage and open-endedness. We further outline how environment scaling can be coupled with stateful learning mechanisms, emphasizing learned update rules for cross-environment adaptation. We conclude by discussing alternative perspectives and argue that scalable environments provide the essential substrate for measurable and controllable progress toward robust general agents.

2605.18177 2026-05-19 cs.CV

Token-Space Mask Prediction for Efficient Vision Transformer Segmentation

基于令牌空间的掩码预测用于高效的视觉变换器分割

Calvin Galagain, Martyna Poreba, François Goulette

AI总结 本文提出TokenMask,一种直接从查询令牌亲和力计算掩码logits并进行logit空间插值的方法,从而在保持准确性的同时减少计算和内存需求,提高分割效率。

Comments CVPR, EVW 2026

详情
AI中文摘要

基于查询的视觉变换器分割模型通常重建密集的空间特征图以预测掩码,继承了卷积架构的设计模式。我们证明这种显式的图像空间重建是不必要的。我们引入TokenMask,一种令牌空间掩码头,直接从查询令牌亲和力计算掩码logits,并在logit空间而非特征空间中进行插值。这种重新表述保留了原始的线性评分机制,同时简化了计算结构。在多样化的ViT后端、数据集和分割任务中,TokenMask通过减少计算和内存需求,同时保持竞争性的准确性,提高了在NVIDIA Jetson AGX Orin上的实际速度。总体而言,TokenMask为嵌入式视觉系统提供了一种更简单且更易于部署的设计。

英文摘要

Query-based Vision Transformer segmentation models typically reconstruct dense spatial feature maps to predict masks, inheriting design patterns from convolutional architectures. We show that this explicit image-space reconstruction is not required. We introduce TokenMask, a token-space mask head that computes mask logits directly from query-token affinities and performs interpolation in logit space rather than feature space. This reformulation preserves the original linear scoring mechanism while simplifying the computational structure. Across diverse ViT backbones, datasets and segmentation tasks, TokenMask consistently improves efficiency over prior approaches by reducing computational and memory requirements while maintaining competitive accuracy, leading to tangible speedups on NVIDIA Jetson AGX Orin using TensorRT FP16 inference. Overall, TokenMask yields a simpler and more deployment-friendly design for embedded vision systems.

2605.18176 2026-05-19 cs.CV cs.AI

MARS: Technical Report for the CASTLE Challenge at EgoVis 2026

MARS:EgoVis 2026 CASTLE挑战的技术报告

Haoyu Zhang, Qiaohui Chu, Yisen Feng, Meng Liu, Weili Guan, Yaowei Wang, Liqiang Nie

AI总结 本文提出MARS系统,用于EgoVis 2026的CASTLE挑战,通过多模态代理推理解决需要多源信息的复杂问题,核心方法是多模态证据选择,主要贡献是实现了在多源数据上的有效推理。

Comments The Runner-up Solution for CASTLE Challenge @ EgoVis 2026

详情
AI中文摘要

本报告介绍了MARS,即多模态代理推理与源选择系统,是参与EgoVis 2026 CASTLE挑战的系统。参赛者必须在CASTLE 2024数据集上回答185个封闭式问题。与以往单视频眼动基准不同,CASTLE要求对四天活动、15个同步视角、官方 transcripts 及多种辅助模态(包括个人照片、辅助视频、注视、热成像和心率测量)进行推理。MARS将任务视为多模态源的代理证据选择问题,而非纯粹文本流程。MARS首先遵循官方CASTLE目录组织,从视频和 transcripts 两个主要来源以及注视、心率、照片和热成像四个辅助来源构建证据记忆。长视频仅转换为caption和基于DeepSeek的摘要,因为CASTLE视频太长无法直接输入模型上下文;此步骤压缩时间证据,同时保留照片和其他辅助媒体作为源特定证据。在推理时,一个GPT-5.4决策代理反复选择是否继续推理、请求特定缺失模态、生成答案或回退到随机选项,当证据不足时。所得到的系统在最终CASTLE挑战排行榜上获得第二名。我们的代码可在https://github.com/Hyu-Zhang/MARS获取。

英文摘要

This report presents MARS, short for Multimodal Agentic Reasoning with Source selection, our system for the CASTLE Challenge at EgoVis 2026. Participants must answer 185 closed-form questions over the CASTLE 2024 dataset. In contrast to prior single-video egocentric benchmarks, CASTLE requires reasoning over four days of activity, 15 synchronized perspectives, official transcripts, and multiple auxiliary modalities, including personal photos, auxiliary videos, gaze, thermal imagery, and heartrate measurements. MARS therefore treats the task as an agentic evidence-selection problem over multimodal sources rather than a purely text-only pipeline. MARS first follows the official CASTLE directory organization to build evidence memories from two primary sources, videos and transcripts, and four auxiliary sources, gaze, heartrate, photos, and thermal imagery. Long videos are converted into captions and DeepSeek-based summaries only because CASTLE videos are too long to fit directly into the model context for every question; this step compresses temporal evidence while keeping photos and other auxiliary media available as source-specific evidence. At inference time, a GPT-5.4 decision agent repeatedly chooses whether to continue reasoning, request a specific missing modality, produce an answer, or fall back to a random option when the evidence remains insufficient. The resulting system achieved second place on the final CASTLE Challenge leaderboard. Our codes are available at https://github.com/Hyu-Zhang/MARS.

2605.18175 2026-05-19 cs.SD

Sonalyzer-Moz: A Framework for Analyzing the Structure of Mozart's Sonata Form

Sonalyzer-Moz: 一个用于分析莫扎特奏鸣曲形式结构的框架

Jing Zhao, KokSheik Wong, Vishnu Monn Baskaran, Kiki Adhinugraha, David Taniar

AI总结 本文提出Sonalyzer-Moz框架,通过整合特征聚合与序列建模,实现了对奏鸣曲形式结构的自动分析,并建立了首个大规模标注数据集SoSA-Moz,为系统研究奏鸣曲形式提供了基础。

Comments 6 pages, 2 figures

详情
AI中文摘要

奏鸣曲形式是一种音乐丰富且层级结构复杂的形式,对自动分析提出了重大挑战。尽管近年来音乐结构分析取得了进展,但奏鸣曲形式分析仍处于早期阶段。这主要是由于标注古典音乐结构耗时且需要较高的音乐背景知识。为推动该领域研究,我们编制了SoSA-Moz,这是首个大规模数据集,包含全面的层级结构标注。本工作为系统奏鸣曲形式分析奠定了基础。利用这一新贡献的资源,我们进一步提出了Sonalyzer-Moz,一个专门用于研究复杂奏鸣曲结构的基线模型。该框架整合了特征聚合与序列建模,使其能够捕捉局部特征和高层结构依赖性。实验结果表明,Sonalyzer-Moz能够识别对理解奏鸣曲形式至关重要的上层结构组件边界。因此,该方法首次展示了自动上层分析奏鸣曲形式的有效性,并为未来自动理解奏鸣曲形式的研究提供了稳健的基线,同时推进了古典音乐结构分析的研究。

英文摘要

The sonata form is a musically rich and hierarchically structured form that poses significant challenges for automatic analysis. While music structure analysis has seen strides of progress in recent years, sonata form analysis remains in its early stages. This is largely due to the time-consuming and high barrier of the music background requirement for annotating classical music structures. To advance research in this area, we curated SoSA-Moz, the first large-scale dataset featuring comprehensive hierarchical structure annotations. This work establishes a foundation for systematic sonata form analysis. Leveraging this newly contributed resource, we further propose Sonalyzer-Moz, a baseline model specifically designed for investigating complex sonata structures. This framework integrates feature aggregation with sequential modeling, enabling it to capture both local feature and upper-level structural dependencies. Experiment results show that Sonalyzer-Moz is capable of identifying the components' boundaries of the upper-level structure that are critical to understanding sonata form. Therefore, this method demonstrates, for the first time, the effectiveness of automatic upper-level analysis of sonata form, and provides a robust baseline for future research in the automatic understanding of sonata form while advancing the study of classical music structure analysis.

2605.18174 2026-05-19 cs.LG cs.DC math.OC stat.ML

Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method

Ringmaster LMO: 异步线性最小化Oracle动量方法

Abdurakhmon Sadiev, Artavazd Maranjyan, Ivan Ilin, Peter Richtárik

AI总结 本文提出Ringmaster LMO,一种用于无约束随机非凸优化的异步线性最小化Oracle动量方法,通过延迟阈值机制改进传统同步方法,适用于异构分布式系统,实验表明其在系统异构性增强时表现更优。

详情
AI中文摘要

Muon最近作为一种强大的替代AdamW方法出现,展现出大规模预训练的良好结果和矩阵结构更新在实践中可能更快的证据。然而,Muon以及更一般的线性最小化Oracle(LMO)方法通常用于同步方式。这在异构分布式系统中存在问题,因为工人完成梯度计算的速度不同,同步训练必须反复等待较慢的工人。本文引入Ringmaster LMO,一种用于无约束随机非凸优化的异步LMO基于动量方法。我们的方法基于Ringmaster ASGD的延迟阈值思想。对于SGD类型方法,Ringmaster ASGD通过丢弃过于陈旧的梯度实现最优时间复杂度。Ringmaster LMO将这一机制扩展到一般LMO更新。我们建立了在广义$(L_0, L_1)$-平滑条件下的收敛保证,并进一步开发了参数无关变体,具有递减步长和自适应延迟阈值。最后,我们将我们的迭代保证转换为在异构工人计算时间下的时间复杂度界限。在经典欧几里得平滑设置中,这些界限恢复了Ringmaster ASGD的最优时间复杂度。在随机二次问题和NanoChat语言模型预训练中的实验表明,Ringmaster LMO的优势随着系统异构性增加而增强,并且该方法在同步和异步基线方法中表现更优。

英文摘要

Muon has recently emerged as a strong alternative to AdamW for training neural networks, with encouraging large-scale pretraining results and growing evidence that matrix-structured updates can be faster in practice. Yet Muon, and more generally Linear Minimization Oracle (LMO) based methods, are typically used synchronously. This is problematic in heterogeneous distributed systems, where workers complete gradient computations at different speeds and synchronous training must repeatedly wait for slower workers. In this work, we introduce Ringmaster LMO, an asynchronous LMO-based momentum method for unconstrained stochastic nonconvex optimization. Our method builds on the delay-thresholding idea of Ringmaster ASGD. For SGD-type methods, Ringmaster ASGD achieves optimal time complexity by discarding overly stale gradients. Ringmaster LMO extends this mechanism to general LMO-based updates. We establish convergence guarantees under generalized $(L_0, L_1)$-smoothness and further develop a parameter-agnostic variant with decreasing stepsizes and adaptive delay thresholds. Finally, we translate our iteration guarantees into time complexity bounds under heterogeneous worker computation times. In the classical Euclidean smooth setting, these bounds recover the optimal time complexity of Ringmaster ASGD. Experiments on stochastic quadratic problems and NanoChat language-model pretraining show that the advantages of Ringmaster LMO grow with system heterogeneity and that the method outperforms strong synchronous and asynchronous baselines.

2605.18173 2026-05-19 cs.CV

Do You Need Text Rectification? Soft Attention Mask Embedding for Rectification-Free Scene Text Spotting

你需要文本校正吗?用于无校正场景文本识别的软注意力掩码嵌入

Antonio Colombo, Giovanni Bianchi

AI总结 本文提出了一种新的软注意力掩码嵌入模块(SAME),通过Transformer编码器的全局感受野编码高级特征并计算软注意力权重,然后与预测的掩码进行分层嵌入,生成精细的文本边界感知掩码,从而有效抑制背景噪声。基于该模块,本文提出了一个鲁棒的端到端文本识别框架SAME-Net,无需字符级标注或辅助文本校正模块。

详情
AI中文摘要

端到端场景文本识别,即在一个框架内统一文本检测和识别,已因深度学习的进步而取得显著进展。然而,大多数现有方法仍然受到多尺度变化、任意文本形状和复杂背景干扰导致的不完整掩码提案的影响,从而降低识别准确性。在本文中,我们提出了一种新的软注意力掩码嵌入模块(SAME),该模块利用Transformer编码器的全局感受野来编码高级特征并计算软注意力权重,然后与预测的掩码进行分层嵌入,生成精细的文本边界感知掩码,从而有效抑制背景噪声。基于该模块,我们提出了SAME-Net,一个鲁棒的端到端文本识别框架,无需字符级标注或辅助文本校正模块。由于软注意力机制是完全可微分的,识别损失梯度可以反向传播通过SAME模块到检测分支,从而实现检测和识别目标的联合优化。在具有挑战性的基准测试中进行了广泛的实验,证明了我们方法的有效性:SAME-Net在任意形状的Total-Text数据集上实现了84.02%的端到端H-mean,比之前的最先进方法GLASS在全词典准确率上高出1.02%,且无需额外训练数据;在多方向ICDAR 2015数据集上获得了具有竞争力的83.4%强词典结果。

英文摘要

End-to-end scene text spotting, which unifies text detection and recognition within a single framework, has witnessed remarkable progress driven by deep learning advances. However, most existing approaches still suffer from incomplete mask proposals caused by multi-scale variation, arbitrary text shapes, and complex background interference, thereby degrading recognition accuracy. In this paper, we propose a novel Soft Attention Mask Embedding module (SAME) that leverages the global receptive field of Transformer encoders to encode high-level features and compute soft attention weights, which are then hierarchically embedded with predicted masks to generate refined text-boundary-aware masks that effectively suppress background noise. Building upon this module, we present SAME-Net, a robust end-to-end text spotting framework that requires neither character-level annotations nor auxiliary text rectification modules. Since the soft attention mechanism is fully differentiable, recognition loss gradients can be back-propagated through the SAME module to the detection branch, enabling joint optimization of detection and recognition objectives. Extensive experiments on challenging benchmarks demonstrate the effectiveness of our approach: SAME-Net achieves 84.02\% end-to-end H-mean on the arbitrarily-shaped Total-Text dataset, surpassing the previous state-of-the-art GLASS by 1.02\% in full-lexicon accuracy without additional training data, and obtains competitive 83.4\% strong-lexicon results on the multi-oriented ICDAR 2015 dataset.

2605.18165 2026-05-19 cs.LG

Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs

Elastic-dLLM: Diffusion LLMs的弹性上下文压缩与增强

Junyi Wu, Tianchen Zhao, Shaoqiu Zhang, Linfeng Zhang, Guohao Dai, Yu Wang

AI总结 本文针对扩散大语言模型中上下文压缩和增强问题,提出了一种位置保持的上下文压缩和终端感知增强方法,以提高解码效率并实现长上下文扩展。

详情
AI中文摘要

与自回归模型生成一个token一次不同,dLLMs通过联合去噪一批[MASK] tokens并每一步采样一个或多个token;尽管这允许并行解码,但由于被掩码token的大量批大小,这个过程会带来显著的计算成本。我们观察到,大部分成本用于重复处理前面的上下文和许多[MASK] tokens的相同特征表示,表明存在相当大的计算冗余。在本工作中,我们从[MASK] tokens的角度重新审视dLLM的冗余性。通过系统分析,我们验证了[MASK] tokens的冗余性并揭示了它们在提供结构信息中的关键作用。基于这些发现,我们提出了位置保持的[MASK] token压缩和终端感知增强。通过压缩冗余的[MASK]计算,该方法加速了解码,并进一步为受有限输入长度约束的完整序列dLLMs(如LLaDA-8B-Instruct和LLaDA-1.5)提供了自然的上下文折叠式长上下文扩展。此外,对于块dLLMs(如LLaDA2.0-mini),它通过添加受保护的终端[MASK] token来增强生成质量,且无显著开销。

英文摘要

Unlike autoregressive models, which generate one token at a time, dLLMs denoise a chunk of [MASK] tokens jointly and sample one or more tokens per step; despite enabling parallel decoding, this process incurs substantial computational cost due to the large chunk size of masked tokens. We observe that much of this cost is spent on repeatedly processing the preceding context and many [MASK] tokens with the same feature representations, indicating considerable computational redundancy. In this work, we revisit dLLM's redundancy from the perspective of [MASK] tokens. Through systematic analysis, we verify the redundancy of [MASK] tokens while revealing their critical role in providing structural information. Guided by these findings, we propose position-preserving [MASK] token compression and terminal-aware augmentation. By compressing redundant [MASK] computation, this approach accelerates decoding and further provides a natural extension toward context-folding-like long-context scaling under limited input-length constraints for full-sequence dLLMs such as LLaDA-8B-Instruct and LLaDA-1.5. Moreover, for block dLLMs such as LLaDA2.0-mini, it augments the context with a protected terminal [MASK] token to enhance generation quality with negligible overhead.

2605.18163 2026-05-19 cs.AI cs.CL

TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction

TRACE: 通过跨层证据进行轨迹修正以减少幻觉

Tej Sanibh Ranade

AI总结 本文提出TRACE算法,通过跨层证据在推理时修正LLM中的幻觉,无需训练或标注,通过内部证据选择修正策略,提升多个基准测试的性能。

Comments 25 pages, 8 figures, 4 tables

详情
AI中文摘要

幻觉修正并非单向问题。我们表明中间层既不总是比最终层更诚实,也不总是更不可信。然而,幻觉减少通常通过固定干预形式实现:对比一层另一层,引导沿诚实方向,或依赖外部证据。这种框架结构不完整。跨层事实证据并不均匀演变:在某些失败中,内部真实支持存在并随后被压制,而在其他情况下,候选竞争在深度上保持 genuinely 多方向性,因此没有单一符号标量家族通常足够。我们引入TRACE(Trajectory Correction from Cross-layer Evidence for Hallucination Reduction),一种确定性、无训练的算法,在LLM自身前向传递中,通过从每个输入的跨层候选轨迹中推导出修正层和适当的修正操作符,在推理时修正幻觉。在一种冻结超参数设置下,TRACE仅使用模型内部证据,在8个模型家族和3个事实性基准测试中评估为单一通用算法,对15个模型进行评估,提升所有评估单元,产生+12.26 MC1点和+8.65 MC2风格点的平均增益,无退化,增益达到+47.20 MC1和+43.38 MC2风格点。该方法不使用标签、检索、预训练、微调或每模型校准。

英文摘要

Hallucination correction is not a one-direction problem. We show that intermediate layers are neither uniformly more truthful than final layers nor uniformly less trustworthy. Yet hallucination reduction is usually instantiated through one fixed intervention form: contrast one layer against another, steer along a truthfulness direction, or defer to external evidence. This framing is structurally incomplete. Cross-layer factual evidence does not evolve uniformly: in some failures truthful support is present internally and later suppressed, whereas in others candidate competition remains genuinely multi-directional across depth, so no single signed scalar family is generally sufficient. We introduce Trajectory Correction from Cross-layer Evidence for Hallucination Reduction (TRACE), a deterministic, training-free algorithm which corrects hallucinations at inference time by deriving both the corrective layer and the appropriate correction operator from each input's cross-layer candidate trajectory inside the LLM's own forward pass. Under one frozen hyperparameter setting, TRACE selects among scalar reversal, earlier-state recovery, and candidate-space correction using only model-internal evidence. Evaluated as a single universal algorithm across 15 models, 8 model families, and 3 factuality benchmarks, TRACE improves every evaluation cell, yielding mean gains of +12.26 MC1 points and +8.65 MC2-style points with no regressions, with gains reaching +47.20 MC1 and +43.38 MC2-style points. The method uses no labels, retrieval, pretraining, finetuning, or per-model calibration.

2605.18162 2026-05-19 cs.CV cs.AI

Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency

通过几何逻辑一致性实现视觉语言模型中的自演化空间推理

Junming Liu, Yuqi Li, Yifei Sun, Maonan Wang, Piotr Koniusz, Yirong Chen, Ding Wang

AI总结 本文提出SAGE框架,通过几何和语言二元操作在视觉语言模型中实现自演化空间推理,提升模型在空间推理任务中的鲁棒性和泛化能力。

Comments 23 pages, 7 figures, 3 tables

详情
AI中文摘要

视觉语言模型(VLMs)在视觉和语言任务上取得了显著进展,但其空间推理能力仍然脆弱:能够正确回答原始输入的模型在面对具有可预测答案映射的配对变换时仍可能失败,揭示了实例级正确性与鲁棒空间推理之间的差距。为此,我们提出空间对齐通过几何演化(SAGE),一种自演化框架,通过几何和语言二元操作在VLMs中强制逻辑一致性。SAGE将二元一致性作为辅助奖励纳入GRPO训练,鼓励模型在原始和变换输入之间产生逻辑一致的答案。一个动态操作池持续探测不一致,促进具有挑战性的操作并淘汰已掌握的操作,使训练聚焦于最有信息量的信号。SAGE具有模型无关性,比先前的GRPO方法更数据高效,并可作为轻量级的后训练阶段应用于任何现有的VLM。在视频和空间推理基准上的实验表明,SAGE在强基线模型上表现一致提升,并增强了对未见数据的泛化能力。

英文摘要

Vision-Language Models (VLMs) have made striking progress, yet their spatial reasoning remains fragile: models that answer an original input correctly can still fail under paired transformations with predictable answer mappings, revealing a gap between instance-level correctness and robust spatial reasoning. To address this, we propose Spatial Alignment via Geometric Evolution (SAGE), a self-evolving framework that enforces logical consistency in VLMs through geometric and linguistic duality operations. SAGE incorporates duality consistency as an auxiliary reward within GRPO training, encouraging models to produce logically coherent answers across original and transformed inputs. A dynamic operation pool continuously probes for inconsistencies, promoting challenging operations and retiring mastered ones, so that training focuses on the most informative signals. SAGE is model-agnostic, data-efficient compared to prior GRPO methods, and can be applied as a lightweight post-training stage to any existing VLM. Experiments on video and spatial reasoning benchmarks demonstrate consistent improvements over strong baselines and enhanced generalization to unseen data.

2605.18156 2026-05-19 cs.CV

Semi-LAR: Semi-supervised Contrastive Learning with Linear Attention for Removal of Nighttime Flares

Semi-LAR: 基于线性注意力的半监督对比学习用于夜间光斑去除

Xiyu Zhu, Wei Wang, Kui Jiang, Zhengguo Li

AI总结 本文提出了一种半监督对比学习框架,通过联合处理伪标签可靠性与表征歧视性,有效缓解了夜间光斑去除中的误差累积问题,并通过实验验证了该框架的模型无关性和性能提升。

详情
AI中文摘要

由于光斑瑕疵的大空间范围及其与场景结构的纠缠,光斑去除具有挑战性,而现有方法严重依赖大规模配对数据。我们提出了一种半监督光斑去除框架,通过联合处理伪标签可靠性与表征歧视性,实现了从未标记图像中稳定学习。我们提出了一种自适应伪标签存储库,通过无参考质量评估、动量更新和无效标签过滤逐步细化伪监督,有效缓解了误差累积。此外,我们提出了一种光斑感知的对比损失,明确将受光斑污染的输入视为负样本,并进行基于块的对比学习,鼓励表征在区分光斑模式的同时保持与可靠伪目标的一致性。在多个光斑基准上的广泛实验表明,所提出的框架具有模型无关性,并且在性能和鲁棒性方面均表现出一致的提升。

英文摘要

Lens flare removal is challenging due to the large spatial extent of flare artifacts and their entanglement with scene structures, while existing methods heavily rely on large-scale paired data. We propose a semi-supervised flare removal framework that enables stable learning from unlabeled images by jointly addressing pseudo-label reliability and representation discrimination. We propose an adaptive pseudo-label repository that progressively refines pseudo supervision through no-reference quality assessment, momentum-based updates, and invalid label filtering, effectively mitigating error accumulation. Moreover, we propose a flare-aware contrastive loss that explicitly treats flare-contaminated inputs as negatives and performs patch-level contrastive learning, encouraging representations that are discriminative against flare patterns while remaining consistent with reliable pseudo targets. Extensive experiments on multiple flare benchmarks demonstrate that the proposed framework is model-agnostic and consistently improves performance and robustness.

2605.18155 2026-05-19 cs.CL

FOL2NS: Generating Natural Sentences from First-Order Logic

FOL2NS:从一阶逻辑生成自然句子

Mei Jia

AI总结 本文提出FOL2NS框架,结合规则模块和微调语言模型,生成合成一阶逻辑公式并转换为自然语言表达,提升了生成样本的多样性和覆盖范围,但面临结构复杂度增加时语义精确性和自然生成的挑战。

Comments 11 pages, 8 figures

详情
AI中文摘要

将形式语言翻译成自然语言是自然语言处理中的基础挑战,推动了语义解析、定理验证和问答等下游应用的发展。在本研究中,我们引入了First-Order Logic to Natural Sentence (FOL2NS),一种神经符号框架,旨在生成合成的一阶逻辑公式并将它们转换为自然人类表达。该框架能够处理深度嵌套结构,具有变化的量词深度(QD),这些结构很少被现有语料库捕捉。通过结合规则驱动模块和微调的语言模型,FOL2NS增强了生成样本的多样性和覆盖范围。在我们的实验中,我们通过字符级分析和整体性能指标系统地评估了该框架的能力。实验结果表明,FOL2NS能够可靠地生成正确格式的模板和流畅的陈述,但在结构复杂度增加时面临实现精确语义表示和自然生成的挑战。

英文摘要

Translating formal language into natural language is a foundational challenge in NLP, driving various downstream applications in semantic parsing, theorem validation, and question answering. In this study, we introduce First-Order Logic to Natural Sentence (FOL2NS), a neurosymbolic framework designed to generate synthetic FOL formulas and convert them into natural human expressions. It handles deeply nested structures with varying quantifier depths (QD), which are rarely captured by existing corpora. By combining rule-driven modules with fine-tuned language models, FOL2NS enhances the diversity and coverage of the generated samples. In our experiments, we systematically evaluate the framework's capabilities through both character-level analysis and overall performance metrics. Experimental results show that FOL2NS can reliably produce well-formed templates and fluent statements, but it faces challenges in achieving precise semantic representations and natural generation as structural complexity increases.

2605.16142 2026-05-19 cs.AI cs.LG

Property-Guided LLM Program Synthesis for Planning

基于属性的LLM程序合成用于规划

André G. Pereira, Augusto B. Corrêa, Jendrik Seipp

AI总结 本文研究了一种基于属性的LLM程序合成方法,通过检查候选程序是否满足形式定义的属性来指导LLM生成更高质量的程序,从而减少生成和评估成本。

详情
AI中文摘要

LLMs在程序合成中表现出色,能够发现超越先前解决方案的程序。然而,这些方法依赖于简单的数值评分来指示程序质量,如解决方案的值或通过的测试数量。因为评分无法指导程序为何失败,系统必须生成并评估许多候选程序,希望其中一些成功,从而增加LLM推理和评估成本。我们研究了一种不同的方法:属性引导的LLM程序合成。与评分程序后评估不同,我们检查候选程序是否满足形式定义的属性。当属性被违反时,我们提前停止评估并提供具体的反例,显示程序为何失败。这种反馈显著减少了程序生成的数量和评估成本,并可以指导LLM生成更强大的程序。我们在PDDL规划领域评估了这种方法,要求LLM合成直接启发函数:每个通过严格改进转换可达的状态都有严格改进的后继。具有这种属性的启发函数可使爬山算法直接到达目标状态。反例引导的修复循环生成一个候选程序,检查训练集上的属性,并返回第一个违反属性的案例。我们在十个规划领域上评估了这种方法,并使用分布外测试集。合成的启发函数在几乎所有测试任务中都是直接的,与最佳先前生成方法相比,我们的方法在每个领域平均生成的程序数量少七倍,无需使用搜索即可解决更多任务,并且评估候选人的计算量减少了几个数量级。只要问题允许可验证的属性,属性引导的LLM合成可以降低成本并提高程序质量。

英文摘要

LLMs have shown impressive success in program synthesis, discovering programs that surpass prior solutions. However, these approaches rely on simple numeric scores to signal program quality, such as the value of the solution or the number of passed tests. Because a score offers no guidance on why a program failed, the system must generate and evaluate many candidates hoping some succeed, increasing LLM inference and evaluation costs. We study a different approach: property-guided LLM program synthesis. Instead of scoring programs after evaluation, we check whether a candidate satisfies a formally defined property. When the property is violated, we stop the evaluation early and provide the LLM with a concrete counterexample showing exactly how the program failed. This feedback drastically reduces both the number of program generations and the evaluation cost, and can guide the LLM to generate stronger programs. We evaluate this approach on PDDL planning domains, asking the LLM to synthesize direct heuristic functions: every state reachable by strictly improving transitions has a strictly improving successor. A heuristic with this property leads hill-climbing algorithm directly to a goal state. A counterexample-guided repair loop generates one candidate program, checks the property over a training set, and returns the first case that violates the property. We evaluate our approach on ten planning domains with an out-of-distribution test set. The synthesized heuristics are effectively direct on virtually all test tasks, and compared to the best prior generation method our approach generates seven times fewer programs per domain on average, solves more tasks without using search, and requires several orders of magnitude less computation to evaluate candidates. Whenever a problem admits a verifiable property, property-guided LLM synthesis can reduce cost and improve program quality.

2605.16015 2026-05-19 cs.RO cs.LG

Adaptive Outer-Loop Control of Quadrotors via Reinforcement Learning

通过强化学习实现四旋翼机的自适应外环控制

Vishnu Saj, Sushil Vemuri, Dileep Kalathil, Moble Benedict

AI总结 本文提出了一种新颖的自适应控制架构,通过强化学习和残差动力学预测器来提高四旋翼飞行器在动态扰动下的控制性能,实验证明其在现实环境中具有更高的轨迹跟踪精度。

详情
AI中文摘要

深度强化学习(DRL)在四旋翼飞行器控制中通常依赖于领域随机化(DR)进行仿真到现实的转移,导致过于保守的策略难以应对动态扰动。为了解决这个问题,我们提出了一种新的自适应控制架构,能够主动感知并响应即时扰动。首先,我们训练了一个最优的外环策略,然后用残差动力学预测器(RDP)替代其对地面真实扰动数据的依赖。RDP通过仅使用状态和控制动作的历史数据在线估计飞行器所受的外部力和力矩。为了实现无缝的硬件转移,我们引入了数据高效的线性校准桥和在线推力校正机制,利用仅几秒的飞行数据将模拟的潜在空间与现实对齐。在真实世界中对Crazyflie微型四旋翼的验证表明,我们的自适应控制器在严重不确定性下,包括质量变化、不对称载荷和动态悬挂载荷,均显著优于基线方法,保持了精确的轨迹跟踪性能。

英文摘要

Deep Reinforcement Learning (DRL) for quadrotor flight control typically relies on Domain Randomization (DR) for sim-to-real transfer, resulting in overly conservative policies that struggle with dynamic disturbances. To overcome this, we propose a novel adaptive control architecture that actively perceives and reacts to instantaneous perturbations. First, we train an optimal outer-loop policy, then replace its reliance on ground-truth disturbance data with a Residual Dynamics Predictor (RDP). The RDP estimates the external forces and moments acting on the aircraft in flight online using only the history of states and control actions. For seamless hardware transfer, we introduce a data-efficient linear calibration bridge and an online thrust correction mechanism that align the simulated latent space with reality using mere seconds of flight data. Real-world validations on a Crazyflie micro-quadrotor demonstrate that our adaptive controller significantly outperforms baselines, maintaining precise trajectory tracking under severe uncertainties including mass variations, asymmetric payloads, and dynamic slung loads