arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

图像生成

图像生成、文生图、图像编辑、扩散模型和可控生成。

今日/当前日期收录 3 信号源:cs.CV, cs.GR, cs.MM
2601.12870 2026-06-19 cs.CE 版本更新 75%

Text2Structure3D: Graph-Based Generative Modeling of Equilibrium Structures with Diffusion Transformers

Text2Structure3D: 基于扩散变换器的图生成建模平衡结构

Lazlo Bleker, Zifeng Guo, Kaleb E. Smith, Kam-Ming Mark Tam, Karla Saldaña Ochoa, Pierluigi D'Acunto

专题命中 可控生成 :从文本生成平衡结构图,属于可控结构生成。

AI总结 提出Text2Structure3D,结合潜在扩散、变分图自编码器和图变换器,从自然语言提示生成接近平衡状态的结构图,并通过残余力优化确保完全满足静力平衡。

Journal ref Results in Engineering 31 (2026) 111375

详情
AI中文摘要

本文提出Text2Structure3D,一种基于图的机器学习模型,能够从自然语言提示生成平衡结构。Text2Structure3D旨在支持概念结构设计过程中新的直观设计探索和迭代方式。该方法将潜在扩散与变分图自编码器(VGAE)和图变换器相结合,生成接近平衡状态的结构图。Text2Structure3D集成了一个残余力优化后处理步骤,确保生成的结构完全满足静力平衡。该模型使用一个跨类型的悬链线找形和静定桥梁结构数据集进行训练和验证,该数据集配有针对每座桥梁的形式和结构特征的文本描述。结果表明,Text2Structure3D生成的平衡结构高度遵循基于文本的规范,并且与基于参数模型的方法相比,大大提高了泛化能力。Text2Structure3D代表了迈向结构设计通用基础模型的早期一步,使生成式AI能够集成到概念设计工作流程中。

英文摘要

This paper presents Text2Structure3D, a graph-based Machine Learning (ML) model that generates equilibrium structures from natural language prompts. Text2Structure3D is designed to support new intuitive ways of design exploration and iteration in the conceptual structural design process. The approach combines latent diffusion with a Variational Graph Auto-Encoder (VGAE) and graph transformers to generate structural graphs that are close to an equilibrium state. Text2Structure3D integrates a residual force optimization post-processing step that ensures generated structures fully satisfy static equilibrium. The model was trained and validated using a cross-typological dataset of funicular form-found and statically determinate bridge structures, paired with text descriptions that capture the formal and structural features of each bridge. Results demonstrate that Text2Structure3D generates equilibrium structures with strong adherence to text-based specifications and greatly improves generalization capabilities compared to parametric model-based approaches. Text2Structure3D represents an early step toward a general-purpose foundation model for structural design, enabling the integration of generative AI into conceptual design workflows.

2601.21081 2026-06-19 cs.CV 版本更新 70%

Shape of Thought: Progressive Object Assembly via Visual Chain-of-Thought

思维形状:通过视觉思维链进行渐进式物体组装

Yu Huo, Siyu Zhang, Kun Zeng, Haoyue Liu, Owen Lee, Junlin Chen, Yuquan Lu, Yifu Guo, Yaodong Liang, Xiaoying Tang

发表机构 * School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)科学与工程学院) School of Data Science, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)数据科学学院) Sun Yat-sen University(中山大学) The Hong Kong University of Science and Technology, Guangzhou(香港科学与技术大学(广州)) Shenzhen Future Network of Intelligence Institute (FNii-Shenzhen)(深圳未来网络智能研究所(FNii-Shenzhen)) Guangdong Provincial Key Laboratory of Future Networks of Intelligence, CUHK(SZ)(广东省未来网络智能重点实验室,CUHK(SZ))

专题命中 可控生成 :文本到图像生成中的组合结构约束

AI总结 提出Shape-of-Thought (SoT)框架,通过视觉思维链在渲染2D域中逐步组装形状,解决文本到图像生成中的组合结构约束问题,在组件计数和结构拓扑上显著优于直接生成。

Comments ICML2026

详情
AI中文摘要

用于文本到图像生成的多模态模型已实现强视觉保真度,但在组合结构约束(特别是生成计数、属性绑定和部分级关系)下仍然脆弱。为解决这些挑战,我们提出了Shape-of-Thought (SoT),一种视觉思维链框架,用于在渲染2D域中进行过程监督的渐进式形状组装,推理时无需外部引擎。SoT训练一个统一的多模态自回归模型,生成交错文本计划和渲染中间状态,帮助模型在不产生显式几何表示的情况下捕捉形状组装逻辑。与纯文本思维链不同,每个决策都基于渲染状态,使得计数、连接、拓扑和中间部件添加错误在整个轨迹中可检查。为支持这一范式,我们引入了SoT-26K,一个基于部件CAD层次结构的大规模接地组装轨迹数据集,以及T2S-CompBench,一个用于评估结构完整性和轨迹忠实度的基准。在SoT-26K上微调在组件计数上达到88.4%,在结构拓扑上达到84.8%,在组件计数上比直接生成高出24.2个百分点,在结构拓扑上高出19.3个百分点。SoT为渲染域结构感知生成建立了一个透明测试平台。代码见此https URL。

英文摘要

Multimodal models for text-to-image generation have achieved strong visual fidelity, yet they remain brittle under compositional structural constraints, notably generative numeracy, attribute binding, and part-level relations. To address these challenges, we propose Shape-of-Thought (SoT), a visual CoT framework for process-supervised progressive shape assembly in the rendered 2D domain, without external engines at inference time. SoT trains a unified multimodal autoregressive model to generate interleaved textual plans and rendered intermediate states, helping the model capture shape-assembly logic without producing explicit geometric representations. Unlike text-only CoT, each decision is grounded in a rendered state, making counts, attachments, topology, and intermediate part-addition errors inspectable across the trajectory. To support this paradigm, we introduce SoT-26K, a large-scale dataset of grounded assembly traces derived from part-based CAD hierarchies, and T2S-CompBench, a benchmark for evaluating structural integrity and trace faithfulness. Fine-tuning on SoT-26K achieves 88.4% on component numeracy and 84.8% on structural topology, outperforming direct generation by +24.2 points on component numeracy and +19.3 points on structural topology. SoT establishes a transparent testbed for rendered-domain structure-aware generation. The code is available at https://github.com/yuhuo03/Shape-of-Thought.

2503.01425 2026-06-19 cs.GR cs.CV 版本更新 70%

MeshPad: Interactive Sketch-Conditioned Artist-Reminiscent Mesh Generation and Editing

MeshPad: 交互式草图条件艺术家风格网格生成与编辑

Haoxuan Li, Ziya Erkoc, Lei Li, Daniele Sirigatti, Vladislav Rosov, Angela Dai, Matthias Nießner

发表机构 * Technical University of Munich(慕尼黑技术大学) AUDI AG(奥迪股份公司)

专题命中 可控生成 :草图条件生成3D网格,涉及可控生成。

AI总结 提出MeshPad,一种基于草图输入的交互式3D网格生成与编辑方法,通过分解为网格区域的删除和添加操作,结合Transformer和顶点对齐推测策略,实现快速迭代编辑,在Chamfer距离上提升22%以上质量,并获90%用户偏好。

Comments Project page: https://derkleineli.github.io/meshpad/ Video: https://www.youtube.com/watch?v=_T6UTGTMZ1E

详情
AI中文摘要

我们介绍了MeshPad,一种从草图输入生成3D网格的生成方法。基于最近在艺术家风格三角形网格生成方面的进展,我们的方法解决了交互式网格创建的需求。为此,我们专注于通过将编辑分解为网格区域的“删除”和随后新网格几何的“添加”来实现一致编辑。这两个操作都由用户对草图图像的简单编辑触发,促进了迭代内容创建过程,并能够构建复杂的3D网格。我们的方法基于三角形序列网格表示,利用大型Transformer模型进行网格三角形的添加和删除。为了交互式地执行编辑,我们在加法网格生成器之上引入了一种顶点对齐的推测预测策略。该推测器预测对应于一个顶点的多个输出标记,从而显著降低推理的计算成本并加速编辑过程,使得每个编辑步骤只需几秒钟即可完成。综合实验表明,MeshPad优于最先进的草图条件网格生成方法,在Chamfer距离上实现了超过22%的网格质量改进,并且在感知评估中被90%的参与者所偏好。

英文摘要

We introduce MeshPad, a generative approach that creates 3D meshes from sketch inputs. Building on recent advances in artist-reminiscent triangle mesh generation, our approach addresses the need for interactive mesh creation. To this end, we focus on enabling consistent edits by decomposing editing into 'deletion' of regions of a mesh, followed by 'addition' of new mesh geometry. Both operations are invoked by simple user edits of a sketch image, facilitating an iterative content creation process and enabling the construction of complex 3D meshes. Our approach is based on a triangle sequence-based mesh representation, exploiting a large Transformer model for mesh triangle addition and deletion. In order to perform edits interactively, we introduce a vertex-aligned speculative prediction strategy on top of our additive mesh generator. This speculator predicts multiple output tokens corresponding to a vertex, thus significantly reducing the computational cost of inference and accelerating the editing process, making it possible to execute each editing step in only a few seconds. Comprehensive experiments demonstrate that MeshPad outperforms state-of-the-art sketch-conditioned mesh generation methods, achieving more than 22% mesh quality improvement in Chamfer distance, and being preferred by 90% of participants in perceptual evaluations.