arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

视觉大模型 / VLM

视觉语言模型、视觉推理、视觉问答、图文理解和视觉 grounding。

今日/当前日期收录 5 信号源:cs.CV, cs.AI, cs.LG
2605.16385 2026-06-18 cs.CV cs.AI cs.CL 版本更新 专题 90

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Hilbert-Geo:通过神经符号推理解决立体几何问题

Ruoran Xu, Haoyu Cheng, Bin Dong, Qiufeng Wang

发表机构 * Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学) Ricoh Software Research Center Beijing Co.,Ltd(Ricoh 软件研究中心北京有限公司)

专题命中 视觉推理 :用神经符号推理解决立体几何问题,涉及视觉推理。

AI总结 提出Hilbert-Geo框架和Parse2Reason方法,利用条件描述语言和定理库实现立体几何问题的严格推理,在SolidFGeo2k和MathVerse-Solid上达到SOTA性能。

Comments Computer Vision and Pattern Recognition (CVPR), 2026

详情
AI中文摘要

几何问题求解作为一种典型的多模态推理问题,近年来受到广泛关注并取得了很大进展,然而大多数工作集中于平面几何,由于三维空间图和复杂推理,通常在立体几何中失败。为弥补这一差距,我们引入了Hilbert-Geo,这是第一个用于立体几何的统一形式语言框架,包括一个广泛的谓词库和一个专用的定理库。基于该框架,我们提出了一种Parse2Reason方法,包含先解析后推理两个步骤。在解析步骤中,我们利用条件描述语言(CDL),一种由专门用于构建几何条件的谓词组成的形式化语言,来表示问题描述(自然文本)和立体图(视觉图像)。在推理步骤中,我们利用这些形式化CDL和定理库进行关系推理和代数计算,生成严格正确、可验证且人类可读的推理过程。值得注意的是,我们提出的Hilbert-Geo也适用于平面几何。为推进几何推理,我们策划了两个专家标注的数据集SolidFGeo2k和PlaneFGeo3k,它们配备了几何形式语言标注、解答和答案。大量实验表明,我们提出的方法在SolidFGeo2k上达到77.3%的最先进性能,在MathVerse-Solid(MathVerse中专用于立体几何的一个小子集)上达到84.1%,显著优于领先的多模态大语言模型,如Gemini-2.5-pro(在SolidFGeo2k上为54.2%)和GPT-5(在MathVerse-Solid上为62.9%)。此外,我们的方法在PlaneFGeo3k上达到80.2%的SOTA准确率,展示了Hilbert-Geo在几何推理中的通用性。我们的代码和数据集将公开提供。

英文摘要

Geometric problem solving, as a typical multimodal reasoning problem, has attracted much attention and made great progress recently, however most of works focus on plane geometry while usually fail in solid geometry due to 3D spatial diagrams and complex reasoning. To bridge this gap, we introduce Hilbert-Geo, the first unified formal language framework for solid geometry, including an extensive predicate library and a dedicated theorem bank. Based on this framework, we propose a Parse2Reason method containing two steps of first parsing then reasoning. In the parsing step, we utilize conditional description language (CDL), a formalized language composed of predicates specifically designed to construct geometric conditions, to represent both problem description (natural text) and solid diagrams (visual image). In the reasoning step, we leverage those formal CDL and the theorem bank to perform relational inference and algebraic computation, generating strictly correct, verifiable, and human-readable reasoning processes. Notably, our proposed Hilbert-Geo is also applicable to plane geometry. To advance geometric reasoning, we curate two expert-annotated dataset SolidFGeo2k and PlaneFGeo3k, which are furnished with geometric formal language annotations, solutions and answers. Extensive experiments show that our proposed method achieves the state-of-the-art (SOTA) performance 77.3% in SolidFGeo2k and 84.1% in MathVerse-Solid (one small subset in MathVerse dedicated to solid geometry), substantially outperforming leading MLLMs, such as Gemini-2.5-pro (54.2% on SolidFGeo2k) and GPT-5 (62.9% on MathVerse-Solid). In addition, our method achieves the SOTA accuracy 80.2% in PlaneFGeo3k, demonstrating the generality of the Hilbert-Geo in geometric reasoning. Our code and datasets are released at https://github.com/PremiLab-Math/Hilbert-Geo.

2507.07574 2026-06-18 cs.CV 版本更新 专题 90

Beyond the Linear Separability Ceiling: Aligning Representations in VLMs

超越线性可分上限:对齐视觉-语言模型中的表征

Enrico Vompa, Tanel Tammet, Mohit Vaishnav

发表机构 * Applied Artificial Intelligence Group(应用人工智能小组) Tallinn University of Technology(塔林技术大学)

专题命中 视觉推理 :诊断VLM在抽象推理任务中的线性可分上限

AI总结 提出线性可分上限(LSC)诊断框架,发现VLM存在对齐差距,并通过对比目标重塑视觉流形,使模型在抽象组合推理任务上显著超越LSC。

Comments Accepted TMLR

详情
AI中文摘要

推进视觉-语言模型(VLM)的一个挑战是确定其在抽象推理任务(如Bongard问题)上的失败源于有缺陷的感知还是有缺陷的自顶向下推理。为了分离这些因素,我们引入了一个诊断框架,该框架以线性可分上限(LSC)为中心,即线性分类器在VLM的原始视觉嵌入上可达到的性能。将该框架应用于最先进的VLM,我们发现了一个普遍的“对齐差距”,其中大多数模型无法在生成性能上超越其表征的线性可分性。我们发现,少数超越这一上限的模型通过两种机制实现:进一步将视觉表征细化为更线性可分的形式,或执行非线性决策逻辑。我们证明,这一瓶颈并非根本限制,而是可解决的视觉对齐问题。我们的方法用对比目标增强标准的下一个词预测,以将视觉流形重塑为更一维线性的几何结构,改进图像间比较,并使模型在抽象组合推理任务上显著超越LSC。

英文摘要

A challenge in advancing Visual-Language Models (VLMs) is determining whether their failures on abstract reasoning tasks, such as Bongard problems, stem from flawed perception or faulty top-down reasoning. To disentangle these factors, we introduce a diagnostic framework centered on the Linear Separability Ceiling (LSC), the performance achievable by a linear classifier on a VLM's raw visual embeddings. Applying this framework to state-of-the-art VLMs, we uncover a pervasive ''alignment gap'', where most models fail to generatively outperform the linear separability of their representations. We find that the few models surpassing this ceiling do so via two mechanisms: by further refining visual representations into a more linearly separable format or by executing non-linear decision logic. We demonstrate that this bottleneck is not a fundamental limitation but a solvable visual alignment issue. Our method augments standard next-token prediction with a contrastive objective to restructure the visual manifold into a more one-dimensionally linear geometry, improving image-to-image comparison and enabling models to significantly surpass the LSC on abstract compositional reasoning tasks.

2606.05409 2026-06-18 cs.CV cs.CL 版本更新 专题 85

Would you still call this Dax? Novel Visual References in VLMs and Humans

你还会称它为Dax吗?VLM与人类中的新颖视觉参照

Ada Defne Tür, Gaurav Kamath, Joyce Chai, Siva Reddy, Benno Krojer

发表机构 * McGill University(麦吉尔大学) Mila Quebec AI Institute(魁北克人工智能研究所) University of Michigan - Ann Arbor(密歇根大学安娜堡分校) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)

专题命中 视觉推理 :VLM与人类对新视觉概念的泛化能力对比

AI总结 提出新颖视觉参照数据集(NVRD),通过对比VLM和人类对新颖视觉概念的泛化能力,发现模型在矛盾先验知识时难以习得新概念,且过度泛化。

详情
AI中文摘要

视觉语言模型(VLM)像人类学习者一样,经常接触新的视觉概念,但它们在接触后如何将新颖的视觉参照映射到语言上仍未被充分探索,特别是当这些参照与预训练的先验知识相矛盾时。为了研究这一点,我们提出了新颖视觉参照数据集(NVRD):包含跨越90个视觉概念的19,176张图像,这些概念具有不同层次的新颖性,每个概念最多有20个原始对象的逐渐扰动版本以测试泛化能力。与之前关于熟悉概念视觉增强的工作不同,NVRD包含完全新颖、开放式的刺激,从头构建,模拟人类遇到真正新概念的方式。我们评估了3个开源和2个闭源模型以及2,400个人类判断,以进行直接的人机比较,发现(i)当新概念与先验知识矛盾时,模型难以在上下文中习得它们,以及(ii)虽然模型和人类对视觉扰动表现出相关的敏感性,但模型显著过度泛化,将学到的标签扩展到人类拒绝的刺激上。我们贡献了NVRD作为人类和机器视觉概念学习研究的语料库和基准。

英文摘要

Vision-language models (VLMs), like human learners, are frequently exposed to new visual concepts, but how they map novel visual references to language after exposure remains largely underexplored, particularly when those references contradict prior knowledge from pre-training. To study this, we present the Novel Visual References Dataset (NVRD): 19,176 images spanning 90 visual concepts across different levels of visual novelty, each with up to 20 increasingly perturbed versions of the original object to probe generalization. Unlike prior work on visual augmentations of familiar concepts, NVRD comprises entirely novel, open-ended stimuli constructed from scratch, mirroring how humans encounter genuinely new concepts. We evaluate 3 open- and 2 closed-source models alongside 2,400 human judgments for direct human-model comparison, and find that (i) models struggle to acquire novel concepts in-context when they contradict prior knowledge, and (ii) while models and humans show correlated sensitivity to visual perturbations, models significantly overgeneralize, extending learned labels to stimuli that humans reject. We contribute NVRD as a corpus and benchmark for research on visual concept learning in both humans and machines.

2606.02800 2026-06-18 cs.CV cs.AI cs.LG cs.MM cs.RO 版本更新 专题 80

Cosmos 3: Omnimodal World Models for Physical AI

Cosmos 3:面向物理AI的全模态世界模型

NVIDIA, :, Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao, Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen, Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen, An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui, Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du, Aigul Dzhumamuratova, Aleksandr Efitorov, Hamid Eghbalzadeh, Naomi Eigbe, Imad El Hanafi, Hassan Eslami, Benedikt Falk, Jiaojiao Fan, Jim Fan, Amol Fasale, Sergiy Fefilatyev, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Vikram Fugro, Prashant Gaikwad, TJ Galda, Katelyn Gao, Yihuai Gao, Wenhang Ge, Sreyan Ghosh, Arushi Goel, Vivek Goel, Akash Gokul, Rama Govindaraju, Jinwei Gu, Miguel Guerrero, Elfie Guo, Aryaman Gupta, Siddharth Gururani, Hugo Hadfield, Song Han, Ankur Handa, Zekun Hao, Mohammad Harrim, Ali Hassani, Nathan Hayes-Roth, Yufan He, Chris Helvig, Cyrus Hogg, Madison Huang, Michael Huang, Sophia Huang, Yufan Huang, Jacob Huffman, DeLesley Hutchins, Suneel Indupuru, Boris Ivanovic, Arihant Jain, Joel Jang, Ryan Ji, Yanan Jian, Dongfu Jiang, Jingyi Jin, Atharva Joshi, Nikhilesh Joshi, Pranjali Joshi, Andy Ju, Jaehun Jung, Weiwei Kang, Scott Kassekert, Jan Kautz, Ashna Khetan, Julia Kiczka, Slawek Kierat, Gwanghyun Kim, Kuno Kim, Sunny Kim, Kezhi Kong, Xin Kong, Zhifeng Kong, Tomasz Kornuta, Egor Krivov, Hui Kuang, Saurav Kumar, Chia-Wen Kuo, George Kurian, Wojciech Kutak, JF Lafleche, Himangshu Lahkar, Omar Laymoun, Jayjun Lee, Sanggil Lee, Gabriele Leone, Boyi Li, Freya Li, Jiajun Li, Jinfeng Li, Ling Li, Pengcheng Li, Shangru Li, Tingle Li, Xiaolong Li, Xuan Li, Zhaoshuo Li, Zhiqi Li, Hao Liang, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Ming-Yu Liu, Sifei Liu, Zihan Liu, Hai Loc Lu, Xiangyu Lu, Alice Luo, Ruipu Luo, Wenjie Luo, Jiangran Lyu, Martin Ding Ma, Nic Ma, Qianli Ma, Dawid Majchrowski, Louis Marcoux, Miguel Martin, Qing Miao, Ashkan Mirzaei, Shreyas Misra, Kaichun Mo, Durra Mohsin, Hyejin Moon, Pawel Morkisz, Saeid Motiian, Kirill Motkov, Seungjun Nah, Yashraj Narang, Deepak Narayanan, Thabang Ngazimbi, Julian Ouyang, Shubham Pachori, David Page, Yatian Pang, Sehwi Park, Mahesh Patekar, Mostofa Patwary, Marco Pavone, Trung Pham, Wei Ping, Soha Pouya, Shrimai Prabhumoye, Varun Praveen, Delin Qu, Hesam Rabeti, Morteza Ramezanali, Marilyn Reeb, Xuanchi Ren, Kristen Rumley, Wojciech Rymer, Jun Saito, Yeongho Seol, John Shao, Piyush Shekdar, Tianwei Shen, Humphrey Shi, Min Shi, Stella Shi, Kevin Shih, Mohammad Shoeybi, Mateusz Sieniawski, Shuran Song, Alexander Sotelo, Amir Sotoodeh, Sunil Srinivasa, Vignesh Srinivasakumar, Bartosz Stefaniak, Rahul Heinrich Steiger, Shangkun Sun, Jiaxiang Tang, Shitao Tang, Yangyang Tang, Yue Tang, Tolou Tavakkoli, Kayley Ting, Krzysztof Tomala, Wei-Cheng Tseng, Jibin Varghese, Sergei Vasilev, Thomas Volk, Raju Wagwani, Roger Waleffe, Andrew Z. Wang, Boxiang Wang, Haoxiang Wang, Qiao Wang, Shihao Wang, Shijie Wang, Ting-Chun Wang, Yan Wang, Yu Wang, Rohit Watve, David Wehr, Fangyin Wei, Xinshuo Weng, Jay Zhangjie Wu, Kedi Wu, Hongchi Xia, Summer Xiao, Tianjun Xiao, Kevin Xie, Daguang Xu, Jiashu Xu, Mengyao Xu, Ruqing Xu, Xingqian Xu, Yao Xu, Dinghao Yang, Dong Yang, Hans Yang, Xiaodong Yang, Xuning Yang, Yichu Yang, Yurong You, Zhiding Yu, Hao Yuan, Simon Yuen, Xiaohui Zeng, Pengcuo Zeren, Cindy Zha, Haotian Zhang, Jenny Zhang, Jing Zhang, Liangkai Zhang, Paris Zhang, Shun Zhang, Xuanmeng Zhang, Zhizheng Zhang, Ann Zhao, Yilin Zhao, Yuliya Zhautouskaya, Charles Zhou, Fengzhe Zhou, Shilin Zhu, Yuke Zhu, Dima Zhylko, Artur Zolkowski

发表机构 * NVIDIA

专题命中 视觉推理 :视觉语言模型,理解与生成任务

AI总结 提出基于统一混合Transformer架构的全模态世界模型Cosmos 3,联合处理语言、图像、视频、音频和动作序列,在理解和生成任务上达到新最优,为具身智能体提供可扩展的通用骨干。

详情
AI中文摘要

我们介绍了Cosmos 3,一个全模态世界模型家族,设计用于在统一的混合Transformer架构中联合处理和生成语言、图像、视频、音频和动作序列。通过支持高度灵活的输入输出配置,Cosmos 3无缝统一了物理AI的关键模态——有效地将视觉语言模型、视频生成器、世界模拟器和世界动作模型整合到一个框架中。我们的评估表明,Cosmos 3在一系列多样化的理解和生成任务中确立了新的最优水平,展示了全模态世界模型作为具身智能体可扩展、通用骨干的能力。我们的后训练Cosmos 3模型在技术报告撰写时被Artificial Analysis评为最佳开源文本到图像和图像到视频模型,并被RoboArena评为最佳策略模型。为了加速物理AI领域的开放研究和部署,我们在Linux基金会的OpenMDW-1.1许可证下提供我们的代码、模型检查点、策划的合成数据集和评估基准,网址为https://this https URL License at this https URL }{ this http URL and this https URL。项目网站位于https://this https URL。

英文摘要

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.

2601.19792 2026-06-18 cs.CL cs.AI cs.HC 版本更新 专题 70

LVLMs and Humans Ground Differently in Referential Communication

LVLMs与人类在指称交流中的基础不同

Peter Zeng, Weiling Li, Amie J. Paige, Zhengxiang Wang, Panagiotis Kaliosis, Dimitris Samaras, Gregory Zelinsky, Susan E. Brennan, Owen Rambow

专题命中 视觉推理 :LVLMs在指称交流中的基础研究

AI总结 通过人类与AI配对的多轮指称交流实验,发现LVLMs无法像人类一样利用共同基础生成和解析指称表达,导致交流不畅。

Comments 27 pages, 16 figures

详情
AI中文摘要

对于生成式AI代理与人类用户有效合作,准确预测人类意图的能力至关重要。但这种协作能力仍然受到一个关键缺陷的限制:无法建模共同基础。我们提出了一个因子设计的指称交流实验,涉及指导者-匹配者配对(人类-人类、人类-AI、AI-人类和AI-AI),他们在多轮重复回合中交互,以匹配与任何明显词汇化标签无关的物体图片。我们表明,LVLMs无法以促进顺畅交流的方式交互式生成和解析指称表达,而这是人类语言使用的基础技能。我们发布了包含356个对话(89对,每对4轮)的语料库,以及用于数据收集的在线流程和用于分析准确性、效率和词汇重叠的工具。

英文摘要

For generative AI agents to partner effectively with human users, the ability to accurately predict human intent is critical. But this ability to collaborate remains limited by a critical deficit: an inability to model common ground. We present a referential communication experiment with a factorial design involving director-matcher pairs (human-human, human-AI, AI-human, and AI-AI) that interact with multiple turns in repeated rounds to match pictures of objects not associated with any obvious lexicalized labels. We show that LVLMs cannot interactively generate and resolve referring expressions in a way that enables smooth communication, a crucial skill that underlies human language use. We release our corpus of 356 dialogues (89 pairs over 4 rounds each) along with the online pipeline for data collection and the tools for analyzing accuracy, efficiency, and lexical overlap.