arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2604.26498 2026-06-09 cs.LG q-bio.QM 版本更新

Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction

大模型真的在药物发现中胜出吗？AI驱动的分子性质和活性预测中模型规模的基准评估

Jinjiang Guo, Sheng Ding

发表机构 * Global Health Drug Discovery Institute（全球健康药物发现研究所）； School of Pharmaceutical Sciences（药学院）

AI总结本文通过26个ADME、毒性及生物活性端点评估，发现传统机器学习在多数任务中表现最佳，大模型在部分困难分割中竞争力有限，模型性能依赖于任务与验证场景的适配性，而非单纯规模。

Comments Improved benchmark design and reproducibility, replaced restricted datasets with public benchmarks in primary analyses, and added sensitivity analyses supporting the interpretation of model scaling and evaluation protocol effects in molecular prediction

详情

AI中文摘要

分子基础模型和大语言模型的快速发展促使人们以规模为中心看待AI在药物发现中的应用，认为更大的预训练模型将取代紧凑的化学信息学模型。我们测试了这一假设，涵盖26个ADME、毒性及生物活性端点，共165,541个端点级别化合物标签记录。基准测试包含78个端点和分割条目，通过随机、Murcko骨架和结构分离的5折交叉验证协议评估，代表递增的化学泛化难度。在156个任务和指标比较中，传统机器学习（ML）提供了最大的最佳表现份额（47.4%），其次是预训练分子序列模型（28.8%）、图神经网络（21.8%）和基于LLM的SAR基线（1.9%）。传统ML在随机分割插值中占优，并总体上是最大的胜利家族。GNN和序列模型在部分更难的分割中具有竞争力，但其严格胜利份额在固定最终窗口读取下减少，表明对训练设置和模型选择的敏感性。配对Bootstrap分析显示，模型间的小数值差异不应被视为决定性胜利。训练折叠中的SAR知识提高了GPT5.5-SAR和Opus4.7-SAR指标，但并未使基于规则的推理成为监督预测器的通用替代品。紧凑的专业模型仍高度有效，预测性能取决于模型、任务和验证场景之间的适配性，而非规模本身。

英文摘要

The rapid growth of molecular foundation models and large language models (LLMs) has encouraged a scale centred view of AI in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models. We test this assumption across 26 ADME, toxicity and bioactivity endpoints, covering 165,541 endpoint level compound label records. The benchmark contains 78 endpoint and split entries evaluated under random, Murcko scaffold and structure separated 5-fold cross validation protocols, representing increasing chemical generalization difficulty. Across 156 task and metric comparisons, classical machine learning (ML) provides the largest share of best performing entries (47.4%), followed by pretrained molecular sequence models (28.8%), graph neural networks (21.8%) and LLM based SAR baselines (1.9%). Classical ML dominates random split interpolation and remains the largest winner family overall. GNN and sequence models are competitive in selected harder splits, but their strict winner shares decrease under a fixed final-window readout, indicating sensitivity to training settings and model selection. Paired bootstrap analyses show that small numerical differences between individual models should not be read as decisive victories. SAR knowledge from training folds improves GPT5.5-SAR and Opus4.7-SAR metrics but does not make rule based reasoning a universal substitute for supervised predictors. Compact specialized models remain highly effective, and predictive performance depends on the fit among model, task and validation scenario, not on scale alone.

URL PDF HTML ☆

赞 0 踩 0

2503.01125 2026-06-09 cs.RO 版本更新

TACO: General Acrobatic Flight Control via Target-and-Command-Oriented Reinforcement Learning

TACO：基于目标和指令的强化学习实现通用空翻飞行控制

Zikang Yin, Canlun Zheng, Shiliang Guo, Zhikun Wang, Shiyu Zhao

发表机构 * College of Computer Science and Technology, Zhejiang University（浙江大学计算机科学与技术学院）； WINDY Lab, Department of Artificial Intelligence, Westlake University（西湖大学人工智能研究院）

AI总结本文提出TACO框架，通过目标和指令导向的强化学习实现统一的空翻任务处理，并支持在线参数调整，结合频谱归一化方法提升策略的平滑性与对称性，验证了其在高速环形飞行和连续多翻转中的能力。

Comments For the experiment video, please refer to https://youtu.be/x1v7nD2iHIk

2605.14211 2026-06-09 cs.AI cs.LG 版本更新

ASH: Agents that Self-Hone via Embodied Learning

ASH: 通过具身学习自我精炼的智能体

Benjamin Schneider, Xavier Schneider, Victor Zhong, Sun Sun

发表机构 * University of Waterloo（多伦多大学）； National Research Council Canada（加拿大国家研究理事会）

AI总结提出ASH系统，通过从无标签互联网视频中学习具身策略，利用自改进循环和逆动力学模型，在长时域任务中显著超越基线方法。

Comments Published as a workshop paper at ICML 2026 Workshop on Scalable Learning and Optimization for Efficient Multimodal AI Agents

详情

AI中文摘要

长时域具身任务仍然是AI中的一个基本挑战，因为当前方法依赖于手工设计的奖励或带动作标签的演示，两者都无法扩展。我们引入了ASH，一个智能体系统，它从无标签、嘈杂的互联网视频中学习具身策略，无需奖励塑造或专家注释。ASH遵循自我改进循环；当它卡住时，ASH从其自身轨迹中学习逆动力学模型（IDM），并利用其IDM从相关互联网视频中提取监督信号。ASH使用无监督学习从大规模互联网视频中识别关键时刻，并将其保留为长期记忆——使其能够处理长时域问题。我们在两个需要多小时规划的互补环境中评估ASH：回合制角色扮演游戏《宝可梦绿宝石》和实时动作冒险游戏《塞尔达传说：缩小帽》。在这两个游戏中，行为克隆、检索增强和零样本基础模型基线趋于平稳，而ASH在我们的8小时评估中持续进步。ASH在《宝可梦绿宝石》中平均达到11.2/12个里程碑，在《塞尔达传说》中平均达到9.9/12个里程碑，而最强基线在两个环境中分别卡在平均6.5/12和6.0/12个里程碑。我们证明了自我改进的智能体是长时域具身学习的可扩展方案。

英文摘要

Long-horizon embodied tasks remain a fundamental challenge in AI, as current methods rely on hand-engineered rewards or action-labeled demonstrations, neither of which scales. We introduce ASH, an agentic system that learns an embodied policy from unlabeled, noisy internet video, without reward shaping or expert annotation. ASH follows a self-improvement loop; when it gets stuck, ASH learns an Inverse Dynamics Model (IDM) from its own trajectories, and uses its IDM to extract supervision from relevant internet video. ASH uses unsupervised learning to identify key moments from large-scale internet video and retains them as long-term memory -- allowing it to tackle long-horizon problems. We evaluate ASH on two complementary environments demanding multi-hour planning: Pokemon Emerald, a turn-based RPG, and The Legend of Zelda: The Minish Cap, a real-time action-adventure game. In both games, behavioral cloning, retrieval-augmented and zero-shot foundation-model baselines plateau, while ASH sustains progression across our 8-hour evaluation. ASH reaches an average of $11.2/12$ milestones in Pokemon Emerald and $9.9/12$ in Legend of Zelda, while the strongest baseline gets stuck in both environments at an average of $6.5/12$ and $6.0/12$ milestones, respectively. We demonstrate that self-improving agents are a scalable recipe for long-horizon embodied learning.

URL PDF HTML ☆

赞 0 踩 0

2605.13768 2026-06-09 cs.LG cs.AI cs.IT math.IT 版本更新

High-Rate Quantized Matrix Multiplication II

高速率量化矩阵乘法II

Or Ordentlich, Yury Polyanskiy

发表机构 * Hebrew University of Jerusalem（希伯来大学杰里科分校）； MIT（麻省理工学院）

AI总结本文研究在已知第二因子列协方差矩阵情况下高速率量化矩阵乘法，通过水填充算法改进LLM量化方法，展示WaterSIC方案在信息论极限下的性能。

详情

AI中文摘要

本文是关于量化矩阵乘法（MatMul）工作的第二部分。在第一部分中，我们考虑了无校准量化的情况，而在这里，我们讨论了在第二因子列协方差矩阵$Σ_X$已知的情况下的情形。这种情形出现在广泛应用的LLM后训练量化任务中。权重量化与加权均方误差（WMSE）源编码问题相关，其经典的（反向）水填充解决定了如何在向量的坐标之间分配速率。我们展示了如何利用水填充来改进实际的LLM量化算法（GPTQ），目前这些算法平均分配速率。最近的一种方案（称为``WaterSIC''）仅使用标量INT量化器进行分析，其高速率性能被证明为（a）基无关（即由$Σ_X$的行列式决定，因此不同于现有方案，不受随机旋转的影响）；（b）在信息论极限下的性能与$\frac{2πe}{12}$（或0.25 bit/entry）的乘法因子内。GPTQ的性能受基的选择影响，但对于随机旋转和实际的$Σ_X$来自Llama-3-8B，我们发现其性能在0.1 bit（取决于层类型）以内，表明GPTQ结合随机旋转也接近最优，至少在高速率范围内。

英文摘要

This is the second part of the work investigating quantized matrix multiplication (MatMul). In part I we considered the case of calibration-free quantization, whereas here we discuss the setting where covariance matrix $Σ_X$ of the columns of the second factor is available. This setting arises in the ubiquitous task of weight-only post-training quantization of LLMs. Weight-only quantization is related to the problem of weighted mean squared error (WMSE) source coding, whose classical (reverse) waterfilling solution dictates how one should distribute rate between coordinates of the vector. We show how waterfilling can be used to improve practical LLM quantization algorithms (GPTQ), which at present allocate rate equally. A recent scheme (known as ``WaterSIC'') that only uses scalar INT quantizers is analyzed and its high-rate performance is shown to be (a) basis free (i.e., characterized by the determinant of $Σ_X$ and, thus, unlike existing schemes, is immune to applying random rotations); and (b) within a multiplicative factor of $\frac{2πe}{12}$ (or 0.25 bit/entry) of the information-theoretic distortion limit. GPTQ's performance, in turn, is affected by the choice of basis, but for a random rotation and actual $Σ_X$ from Llama-3-8B we find it to be within 0.1 bit (depending on the layer type) of WaterSIC, suggesting that GPTQ with random rotation is also near optimal, at least in the high-rate regime.

URL PDF HTML ☆

赞 0 踩 0

2605.11212 2026-06-09 cs.CL 版本更新

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

ReVision：通过时间视觉冗余减少扩展计算机使用代理

Amirhossein Abaskohi, Yuhang He, Peter West, Giuseppe Carenini, Pranit Chawla, Vibhav Vineet

发表机构 * University of British Columbia（不列颠哥伦比亚大学）； Microsoft Research（微软研究院）

AI总结 ReVision通过去除冗余视觉片段，减少token使用并提升成功率，使代理能处理更长轨迹。

详情

AI中文摘要

计算机使用代理（CUAs）依赖于图形用户界面的视觉观察，每个截图被编码为大量视觉token。随着交互轨迹增长，token成本迅速增加，限制了在固定上下文和计算预算下可纳入的历史量。这导致使用历史时性能提升有限，不同于其他领域。我们通过引入ReVision解决这一效率问题，该方法用于训练多模态语言模型，在轨迹中去除冗余视觉片段，使用学习的片段选择器比较连续截图的片段表示，同时保留模型所需的时空结构。在三个基准测试（OSWorld、WebTailBench和AgentNetBench）中，当使用Qwen2.5-VL-7B处理包含5个历史截图的轨迹时，ReVision平均减少46%的token使用，同时将成功率提高3%。这建立了明显的效率提升，使代理能用更少token处理更长轨迹。通过这一改进效率，我们重新审视CUAs中历史的作用，发现当去除冗余时，性能随更多过去观察的纳入而持续提升。

英文摘要

Computer-use agents (CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets. This has resulted in no or very limited improvement in the performance when using history unlike other domains. We address this inefficiency by introducing ReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model. Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B, ReVision reduces token usage by 46% on average while improving success rate by 3% over the no drop baseline. This establishes a clear efficiency gain, enabling agents to process longer trajectories with fewer tokens. With this improved efficiency, we revisit the role of history in CUAs and find that performance continues to improve as more past observations are incorporated when redundancy is removed.

URL PDF HTML ☆

赞 0 踩 0

2605.12213 2026-06-09 cs.AI 版本更新

参与过程：重新思考动作与观察的时间接口

Jialian Li, Yuchen Cao, Junhong Liu, Weiran Guo, Xutao Wang, Jiaming Song, Jiahao Zhang, Jie Chen

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出参与过程（EP）模型，通过显式时间接口处理动作与观察的不同时间尺度交互，支持多速率协调和子系统组合，揭示隐藏的时间行为并使策略适应显式时间成本。

详情

AI中文摘要

在数字和物理环境中完成任务日益涉及复杂的时序交互，其中动作和观察在不同的时间尺度上展开，而非与固定观察-动作步骤对齐。为了建模此类交互，我们提出参与过程（EP），一种继承POMDP决策理论结构的交互形式，使时间在动作-观察接口中显式化。EP将动作和观察表示为沿时间解耦的事件流，而非在固定决策步骤上配对更新。此接口捕捉单agent的时间问题，如决策延迟、延迟反馈和持续动作，同时支持更丰富的agent侧组织、多速率协调和子系统间的组合交互。在玩具、LLM-agent和学习实验中，EP揭示了由基于步骤的接口隐藏的时间行为，并使策略在显式时间成本下适应。

英文摘要

Task completion in digital and physical environments increasingly involves complex temporal interaction, where actions and observations unfold over different time scales rather than align with fixed observation--action steps. To model such interactions, we propose \emph{Engagement Process} (EP), an interaction formalism that inherits the decision-theoretic structure of POMDPs while making time explicit in the action--observation interface. EP represents actions and observations as decoupled event streams along time, rather than updates paired at fixed decision steps. This interface captures single-agent timing issues such as deliberation latency, delayed feedback, and persistent actions, while supporting richer agent-side organization, multi-rate coordination, and compositional interaction among subsystems. Across toy, LLM-agent, and learning experiments, EP exposes temporal behaviors hidden by step-based interfaces and enables policies to adapt under explicit time costs.

URL PDF HTML ☆

赞 0 踩 0

2601.23286 2026-06-09 cs.CV cs.AI cs.LG 版本更新

VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

VideoGPA: 通过几何先验知识蒸馏实现3D一致的视频生成

Hongyang Du, Junjie Ye, Xiaoyan Cong, Runhao Li, Jingcheng Ni, Aman Agarwal, Zeqi Zhou, Zekun Li, Randall Balestriero, Yue Wang

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结 VideoGPA通过几何先验知识蒸馏提升视频生成的3D一致性，利用数据高效的自监督框架引导视频扩散模型，显著增强时间稳定性、几何合理性与运动一致性。

Comments 8 pages, 5 figures, ICML 2026

2605.10376 2026-06-09 cs.CV 版本更新

PACE：用于学习LiDAR点云压缩的后因果熵建模

Jiahao Zhu, Kang You, Dandan Ding, Zhan Ma

发表机构 * School of Information Science and Technology, Hangzhou Normal University, Hangzhou, China.（信息科学与技术学院，杭州师范大学，杭州，中国）； School of Electronic Science and Engineering, Nanjing University, Nanjing, China（电子科学与工程学院，南京大学，南京，中国）

AI总结 PACE通过非因果骨干网络和轻量级预测器提升LiDAR点云压缩效率，实现90%以上的解码延迟降低和BD-BR节省。

详情

AI中文摘要

LiDAR点云压缩对自动驾驶系统处理高分辨率传感器数据至关重要。尽管基于八叉树结构的学得熵建模能获得高压缩增益，但面临两个关键瓶颈：1）解码时因因果、多阶段上下文建模导致的延迟过高；2）性能-延迟权衡的刚性，使单一模型难以适应变化约束。这些限制源于上下文聚合骨干与概率预测之间的紧密耦合。为此，我们提出PACE，一种新的框架，将祖先上下文聚合重新表述为非因果骨干，并将因果性限制在轻量级、阶段可扩展的预测器中，消除重复骨干执行并减少计算开销。预测器支持任意数量的预测阶段，使模型能够无缝适应多样化的性能-延迟权衡，而无需重新加载参数。实验表明，PACE在压缩效率上达到新状态，实现显著的BD-BR节省，并在自回归模式下将解码延迟降低超过90%，使其在实际应用中具有吸引力。

英文摘要

LiDAR point cloud compression is vital for autonomous systems to handle massive data from high-resolution sensors. While learned entropy modeling built upon octree structures yields high compression gains, it faces two critical bottlenecks: 1) prohibitive latency, particularly during decoding, caused by causal, multi-stage context modeling; and 2) a rigid performance-latency trade-off, preventing a single model from adapting to varying constraints. These limitations stem from the tight coupling between the context aggregation backbone and probability prediction. To address this, we propose PACE, a new framework that reformulates ancestral context aggregation as a non-causal backbone and confines causality to a lightweight, stage-scalable predictor, eliminating repetitive backbone executions and reducing computational overhead. The predictor supports an arbitrary number of prediction stages, enabling seamless adaptation across diverse performance-latency trade-offs without reloading parameters. Experiments demonstrate that PACE sets a new state-of-the-art in compression efficiency, achieving notable BD-BR savings and reducing decoding latency by over 90\% in autoregressive mode, making it attractive for practical applications.

URL PDF HTML ☆

赞 0 踩 0

2605.01171 2026-06-09 cs.CV cs.LG 版本更新

CADFit: Precise Mesh-to-CAD Program Generation with Hybrid Optimization

CADFit：基于混合优化的精确网格到CAD程序生成

Ghadi Nehme, Eamon Whalen, Faez Ahmed

发表机构 * Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA（麻省理工学院机械工程系）

AI总结提出CADFit框架，通过基于几何反馈的增量拟合和验证参数化操作，从网格中恢复复杂可编辑的CAD构造序列，在多个基准上优于现有方法，并显著降低无效比率。

详情

AI中文摘要

尽管最近取得了进展，但从几何输入（如网格或点云）恢复参数化CAD构造序列仍然是设计和制造的关键挑战，因为现有的CAD重建和生成方法主要局限于难以编辑的格式（如网格或Breps）或可编辑的简单草图-拉伸流水线和低复杂度数据集。我们引入了CADFit，一个基于混合优化的CAD重建框架，通过使用几何反馈增量拟合和验证参数化操作，从网格中恢复复杂、可编辑的CAD构造序列。我们的方法的特点是将重建公式化为对结构化CAD程序的IoU驱动优化，并支持丰富的操作集，包括拉伸、旋转、圆角和倒角。在多个CAD基准上的实验表明，CADFit在体积交并比和倒角距离方面优于最先进的网格到CAD方法，同时显著降低了重建CAD程序的无效比率，特别是对于复杂设计。我们进一步提出了一个多模态流水线，通过将基于图像的几何重建与CADFit相结合，实现从图像端到端重建CAD构造序列。通过实现更高复杂度CAD模型的精确重建，CADFit为生成更丰富的数据集和推进未来基于学习的CAD逆向工程方法提供了实用基础。代码可在：https://github.com/ghadinehme/CADFit 获取。

英文摘要

Despite recent progress, recovering parametric CAD construction sequences from geometric input, such as meshes or point clouds, is a key challenge for design and manufacturing, as existing CAD reconstruction and generation methods are largely restricted to difficult-to-edit formats like meshes or Breps or editable simple sketch-and-extrude pipelines and low-complexity datasets. We introduce CADFit, a hybrid optimization-based CAD reconstruction framework that recovers complex, editable CAD construction sequences from meshes by incrementally fitting and validating parametric operations using geometric feedback. Our approach is distinguished by formulating reconstruction as an IoU-driven optimization over structured CAD programs and supporting a rich set of operations, including extrusions, revolutions, fillets, and chamfers. Experiments on multiple CAD benchmarks show that CADFit outperforms state-of-the-art mesh-to-CAD methods in volumetric Intersection-over-Union and Chamfer Distance, while substantially reducing the Invalid Ratio of reconstructed CAD programs, particularly for complex designs. We further present a multimodal pipeline that enables end-to-end reconstruction of CAD construction sequences from images by combining image-based geometry reconstruction with CADFit. By enabling accurate reconstruction of higher-complexity CAD models, CADFit provides a practical foundation for generating richer datasets and advancing future learning-based approaches to CAD reverse engineering. The code is available at: https://github.com/ghadinehme/CADFit.

URL PDF HTML ☆

赞 0 踩 0

2506.20588 2026-06-09 cs.CV 版本更新

TRIM: A Self-Supervised Video Summarization Framework Maximizing Temporal Relative Information and Representativeness

TRIM：一种最大化时间相对信息和代表性的自监督视频摘要框架

Pritam Mishra, Coloma Ballester, Dimosthenis Karatzas

发表机构 * Pompeu Fabra University（庞培法布拉大学）； Universitat Autònoma de Barcelona（自治大学）

AI总结 TRIM框架通过自监督学习实现高效视频摘要，无需注意力机制等复杂结构，优于现有无监督方法并挑战传统复杂架构。

详情

AI中文摘要

随着视频内容的普及，视频摘要和亮点提取成为关键研究领域。然而，许多先进方法依赖监督标注或注意力模型，计算成本高且在分布变化时表现不稳定。我们提出一种新颖的自监督视频摘要模型，无需注意力、RNN或Transformer，通过马尔可夫过程驱动的损失度量和两阶段自监督学习范式，实现性能与效率的平衡。TRIM在SUMME和TVSUM数据集上达到最佳性能，超越所有现有无监督方法，并与最佳监督模型相当，展示了高效无标注架构的潜力，为更通用的视频摘要技术铺平道路，并挑战现有复杂架构的依赖。

英文摘要

The increasing ubiquity of video content and the corresponding demand for efficient access to meaningful information have elevated video summarization and video highlights as a vital research area. However, many state-of-the-art methods depend heavily either on supervised annotations or on attention-based models, which are computationally expensive and brittle in the face of distribution shifts that hinder cross-domain applicability across datasets. We introduce a pioneering self-supervised video summarization model that captures both spatial and temporal dependencies without the overhead of attention, RNNs, or transformers. Our framework integrates a novel set of Markov process-driven loss metrics and a two-stage self supervised learning paradigm that ensures both performance and efficiency. Our approach achieves state-of-the-art performance on the SUMME and TVSUM datasets, outperforming all existing unsupervised methods. It also rivals the best supervised models, demonstrating the potential for efficient, annotation-free architectures. This paves the way for more generalizable video summarization techniques and challenges the prevailing reliance on complex architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.00647 2026-06-09 cs.LG 版本更新

简单自条件适应用于掩码扩散模型

Michael Cardei, Huu Binh Ta, Ferdinando Fioretto

发表机构 * University of Virginia（弗吉尼亚大学）

AI总结本文提出一种简单有效的后训练适应方法，通过自条件预测提升掩码扩散模型的生成能力，减少生成困惑度并提升图像合成和分子生成质量。

详情

AI中文摘要

掩码扩散模型（MDMs）通过迭代去噪在吸收掩码过程中生成离散序列。在标准掩码扩散中，如果一个token在反向更新后仍被掩码，模型会丢弃该位置的干净状态预测。因此，仍被掩码的位置必须反复从掩码token本身推断。这种设计限制了跨步骤的细化。为解决这一限制，本文提出了一种简单但有效的后训练适应方法，使每个去噪步骤都基于模型自身之前的干净状态预测。所提出的方法称为自条件掩码扩散模型（SCMDM），需要最小的架构更改，不引入递归的潜在状态路径，不依赖辅助参考模型，并在采样过程中不增加额外的去噪器评估。这与部分自条件方法形成重要区别，后者需要昂贵的从头模型训练。特别是，本文表明，在后训练阶段，部分自条件，包括用于从头训练自条件模型的常用50% dropout策略，是次优的。相反，一旦模型自生成的干净状态估计变得有信息，专业化于细化优于混合条件和无条件目标。SCMDM在多个领域进行了评估，显示出对普通MDM基线的一致改进，实现了在OWT训练模型上的生成困惑度几乎减少50%（从42.89到23.72），同时在离散图像合成质量、小分子生成和基因组分布建模的保真度方面也取得了显著改进。

英文摘要

Masked diffusion models (MDMs) generate discrete sequences by iterative denoising under an absorbing masking process. In standard masked diffusion, if a token remains masked after a reverse update, the model discards its clean-state prediction for that position. Thus, still-masked positions must be repeatedly inferred from the mask token alone. This design choice limits cross-step refinement. To address this limitation, this paper proposes a simple, yet effective, post-training adaptation for MDMs that conditions each denoising step on the model's own previous clean-state predictions. The resulting method, called Self-Conditioned Masked Diffusion Models (SCMDM), requires minimal architectural change, does not introduce a recurrent latent-state pathway, does not rely on an auxiliary reference model, and adds no extra denoiser evaluations during sampling. This is an important departure from partial self-conditioning approaches which requires expensive model training from scratch. In particular, the paper shows that partial self-conditioning, including the commonly used 50% dropout strategy for training self-conditioned models from scratch, is suboptimal in the post-training regime. Instead, once the model's self-generated clean-state estimates become informative, the specialization to refinement is preferable to mixing conditional and unconditional objectives. SCMDM is evaluated across multiple domains, demonstrating consistent improvement over vanilla MDM baselines, achieving nearly a 50% reduction in generative perplexity on OWT-trained models (42.89 to 23.72), alongside strong improvements in discretized image synthesis quality, small molecular generation, and enhanced fidelity in genomic distribution modeling.

URL PDF HTML ☆

赞 0 踩 0

2604.25781 2026-06-09 cs.CV cs.GR 版本更新

Sketch2Arti: Sketch-based Articulation Modeling of CAD Objects

Sketch2Arti：基于草图的CAD物体关节建模

Yi Yang, Hao Pan, Yijing Cui, Alla Sheffer, Changjian Li

发表机构 * University of Edinburgh（爱丁堡大学）； Tsinghua University（清华大学）； University of British Columbia（不列颠哥伦比亚大学）

AI总结提出Sketch2Arti系统，通过用户从选定视角绘制的2D草图，自动发现CAD模型中的可动部件并预测其运动参数，支持复杂物体的多关节迭代建模，无需类别信息且可推广到多样物体。

Comments Project page: https://arlo-yang.github.io/Sketch2Arti

详情

AI中文摘要

关节建模旨在推断3D物体的可动部件及其运动参数，实现交互式动画、模拟和形状编辑。本文提出Sketch2Arti，首个基于草图的CAD物体关节建模系统。我们的关键观察是，设计师通过轻量级草图（如箭头和笔画）自然地传达关节意图，指示部件应如何移动，但将这些草图转化为关节3D模型仍主要依赖手动操作。Sketch2Arti通过允许用户从选定视角绘制简单2D草图来指定关节，弥合了这一差距。给定CAD模型和用户草图，我们的方法自动发现对应的可动部件并预测其运动参数，支持对复杂物体进行多个关节的迭代建模，并实现精细控制。重要的是，Sketch2Arti以类别无关的方式训练，无需物体类别信息，从而对现有关节数据集之外的多样物体具有强泛化能力。此外，对于缺乏内部结构的壳体模型，Sketch2Arti支持由用户草图引导的可控内部补全，生成与现有几何和预测运动约束一致的合理内部组件。综合实验和用户评估证明了Sketch2Arti的有效性、可控性和泛化性。代码、数据集和原型系统见https://arlo-yang.github.io/Sketch2Arti。

英文摘要

Articulation modeling aims to infer movable parts and their motion parameters for a 3D object, enabling interactive animation, simulation, and shape editing. In this paper, we present Sketch2Arti, the first sketch-based articulation modeling system for CAD objects. Our key observation is that designers naturally communicate articulation intent through lightweight sketches (e.g., arrows and strokes) that indicate how parts should move, yet translating such sketches into articulated 3D models remains largely manual. Sketch2Arti bridges this gap by enabling users to specify articulation through simple 2D sketches drawn from a chosen viewpoint. Given a CAD model and user sketches, our approach automatically discovers the corresponding movable parts and predicts their motion parameters, allowing iterative modeling of multiple articulations on complex objects with fine-grained control. Importantly, Sketch2Arti is trained in a category-agnostic manner without requiring object category information, leading to strong generalization to diverse objects beyond existing articulation datasets. Moreover, for shell models lacking interior structures, Sketch2Arti supports controllable internal completion guided by user sketches, generating plausible internal components consistent with the existing geometry and predicted motion constraints. Comprehensive experiments and user evaluations demonstrate the effectiveness, controllability, and generalization of Sketch2Arti. The code, dataset, and the prototype system are at https://arlo-yang.github.io/Sketch2Arti.

URL PDF HTML ☆

赞 0 踩 0