arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.16258 2026-05-22 cs.CV cs.AI cs.RO

IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation

IVGT：隐式视觉几何变换器用于神经场景表示

Yuqi Wu, Tianyu Hu, Wenzhao Zheng, Yuanhui Huang, Haowen Sun, Jie Zhou, Jiwen Lu

AI总结本文提出IVGT，一种隐式视觉几何变换器，通过无姿态多视角图像隐式建模连续且一致的几何结构，从而实现神经场景表示，支持在任意3D位置进行连续空间查询，以预测签名距离和颜色，并在多个任务中表现出色。

Comments Code: https://github.com/wzzheng/IVGT/

详情

AI中文摘要

从未经姿态的多视角图像中重建一致的3D几何和外观是计算机视觉中的基础但具有挑战性的问题。现有的视觉几何基础模型通常通过回归像素对齐的点图来预测显式几何，常常面临冗余和几何连续性有限的问题。我们提出了IVGT，一种隐式视觉几何变换器，能够从无姿态的多视角图像中隐式建模连续且一致的几何。这种形式在规范坐标系中学习了连续的神经场景表示，并支持在任意3D位置进行连续空间查询，通过轻量级解码器检索局部特征，以预测签名距离（SDF）值和颜色。它允许直接提取连续且一致的表面几何，从而能够从任意视角渲染RGB图像、深度图和表面法线图。我们通过多数据集联合优化进行训练，结合2D监督和3D几何正则化。IVGT在不同场景中表现出良好的泛化能力，并在多种任务中实现了优异的性能，包括网格和点云重建、新视角合成、深度和表面法线估计以及相机姿态估计。

英文摘要

Reconstructing coherent 3D geometry and appearance from unposed multi-view images is a fundamental yet challenging problem in computer vision. Most existing visual geometry foundation models predict explicit geometry by regressing pixel-aligned pointmaps, often suffering from redundancy and limited geometric continuity. We propose IVGT, an Implicit Visual Geometry Transformer that implicitly models continuous and coherent geometry from pose-free multi-view images. This formulation learns a continuous neural scene representation in a canonical coordinate system and supports continuous spatial queries at any 3D positions, retrieving local features to predict signed distance (SDF) values and colors using lightweight decoders. It allows direct extraction of continuous and coherent surface geometry, enabling rendering of RGB images, depth maps, and surface normal maps from arbitrary viewpoints. We train IVGT via multi-dataset joint optimization with 2D supervision and 3D geometric regularization. IVGT demonstrates generalization across scenes and achieves strong performance on various tasks, including mesh and point cloud reconstruction, novel view synthesis, depth and surface normal estimation, and camera pose estimation.

URL PDF HTML ☆

赞 0 踩 0

2605.15588 2026-05-22 cs.CL cs.LG

Calibrating LLMs with Semantic-level Reward

通过语义层面奖励校准大型语言模型

Fengfei Yu, Ruijia Niu, Dongxia Wu, Yian Ma, Rose Yu

AI总结本文提出了一种新的校准框架CSR，通过在语义空间中直接校准语言模型，避免了传统方法中因词汇化置信度导致的不一致问题，实验显示CSR在多个数据集上均能有效降低ECE并提高AUROC。

详情

AI中文摘要

随着大型语言模型（LLMs）被应用于医疗问答和法律推理等关键领域，估计其输出正确性的能力对于安全可靠使用至关重要，要求模型具有良好的校准能力。标准的可验证奖励强化学习（RLVR）通过二元正确性奖励训练模型，但该奖励对置信度不敏感，无法对自信但错误的预测施加惩罚，从而降低校准效果。最近的研究通过训练模型生成带有词汇化置信度的置信分数并奖励与正确性的同意来解决这一问题。然而，词汇化置信度在语义相同但文本变化时表现出不一致性。我们提出Calibration with Semantic Reward（CSR），一种在语义空间中直接校准语言模型的框架，无需词汇化置信度接口。CSR结合了正确性奖励和一种新的语义校准奖励，通过促进正确路径中的语义一致性和不正确路径中的探索来鼓励利用和探索。在HotpotQA（在分布）和TriviaQA、MSMARCO、NQ-Open（不在分布）三个模型家族上的实验表明，CSR在几乎所有设置中都比词汇化置信度基线实现了更低的ECE和更高的AUROC，ECE减少高达40%，AUROC提高高达31%，校准行为在所有四个评估设置中均表现出良好的鲁棒性。

英文摘要

As large language models (LLMs) are deployed in consequential settings such as medical question answering and legal reasoning, the ability to estimate when their outputs are likely to be correct is essential for safe and reliable use, requiring well-calibrated uncertainty. Standard reinforcement learning with verifiable rewards (RLVR) trains models with a binary correctness reward that is indifferent to confidence, providing no penalty for confident but wrong predictions and thereby degrading calibration. Recent work addresses this by training models to produce verbalized confidence scores alongside answers and rewarding agreement with correctness. However, verbalized confidence is calibrated at the token level and thus exhibits inconsistency across textual variations with same semantic meaning. We propose \textbf{Calibration with Semantic Reward (CSR)}, a framework that calibrates language models directly in semantic space without a verbalized confidence interface. CSR combines the correctness reward with a novel semantic calibration reward that encourages exploitation among correct rollouts by promoting semantic agreement, and exploration among incorrect ones by discouraging spurious consistency. Experiments across three model families on HotpotQA (in-distribution) and TriviaQA, MSMARCO, and NQ-Open (out-of-distribution) show that CSR consistently achieves lower ECE and higher AUROC than verbalized-confidence baselines across nearly all settings, reducing ECE by up to $40\%$ and improving AUROC by up to $31\%$ over verbalized-confidence baselines, with calibration behavior generalizing robustly across all four evaluation settings.

URL PDF HTML ☆

赞 0 踩 0

2605.15505 2026-05-22 cs.AI cs.IR cs.LG

X-SYNTH: Beyond Retrieval -- Enterprise Context Synthesis from Observed Digital Human Attention

X-SYNTH：超越检索——从观察到的数字人类注意力中提取企业上下文

Guruprasad Raghavan, George Nychis, Rohan Narayana Murthy

AI总结本文提出X-SYNTH框架，通过分析数字人类注意力行为模式，解决企业上下文合成问题，其核心方法是基于行为模式的上下文合成，而非传统检索，从而显著提升有效线索率并降低误报率。

Comments 11 pages, 7 figures, 5 tables

详情

AI中文摘要

在企业运营中，AI代理任务所需上下文分散在记录系统、静态信息存储和通信渠道中。所存储的是系统状态，这是工作实际发生情况的损失性表示。现有的方法通过匹配请求内容来检索存储的信息；对于狭窄请求，这种方法效果良好。但合成质量依赖于了解应展示什么以及如何解释它：这涉及每个组织、团队和个人特有的知识，存在于行为模式中，而不在任何检索索引中。对于提出对企业有价值的线索给销售员的代理任务，这种方法失效：真正的线索率低，假线索率高，且模型没有改进机制。我们提出了X-SYNTH，一个基于数字人类注意力的框架，这种注意力是每个工人的可数字化交互特征，编码了他们做了什么、按什么顺序做，以及隐含的奖励信号。在没有外部标签的情况下，可以区分出导致积极结果的先前行为轨迹与未导致积极结果的轨迹。X-SYNTH将每个个体的行为基线建模为数字双胞胎签名（DTS），并根据个体和查询选择七种注意力过滤器：比例、反比、微分、递归、比较、顺序和集体，以识别因果相关的活动签名。一个四阶段的管道将基于行为模式的排名上下文组装起来，而不是查询嵌入。一个前沿模型在无辅助的情况下实现了9.5%的真实线索率（TLR）和90.5%的假线索率（FLR）。在加入X-SYNTH后，TLR上升到61.9%（6.5倍），而FLR下降到18.8%。企业上下文合成不是检索问题，而是相关性问题，而数字人类注意力是其最可靠的地面真实值。

英文摘要

In enterprise operations, the context required for an AI agent task is scattered across systems of record, static information stores, and communication channels. What is stored is system state, a lossy representation of the work that actually happened. The prevailing approach retrieves by matching request content to what is stored; for narrow requests this works well. But synthesis quality depends on knowing what to surface and how to interpret it: knowledge specific to each organization, team, and individual, present in behavioral patterns, absent from any retrieval index. For the agentic task of proposing enterprise-valuable leads to sellers, this approach breaks down: True Lead Rate is low, False Lead Rate is high, and the model has no mechanism to improve. We present X-SYNTH, a framework for enterprise context synthesis grounded in digital human attention, the digitally observable interaction signatures of each worker, encoding what they did, the sequence in which they did it, and implicit reward signals. Behavioral traces preceding positive outcomes are distinguishable from those that did not, without external labeling. X-SYNTH models each individual's behavioral baseline as a Digital Twin Signature (DTS) and selects among seven attention filters, Proportional, Inverse, Differential, Recurrent, Comparative, Sequential, and Collective, per individual and per query, to identify causally relevant activity signatures. A four-stage pipeline assembles ranked context grounded in behavioral patterns rather than query embeddings. A frontier model unaided achieves 9.5% True Lead Rate (TLR) with 90.5% False Lead Rate (FLR). Augmented with X-SYNTH, TLR rises to 61.9% (6.5x) while FLR falls to 18.8%. Enterprise context synthesis is not a retrieval problem. It is a relevance problem, and digital human attention is its most reliable ground truth.

URL PDF HTML ☆

赞 0 踩 0

2605.14598 2026-05-22 cs.RO

实例自适应在线多校准

Zhiming Huang, Jamie Morgenstern, Aaron Roth, Claire Jie Zhang

AI总结本文提出了一种高效的实例自适应在线多校准算法，通过动态调整预测值的二进制网格来平衡最坏情况和易处理情况，实现了在不同实例下的最优误差控制。

Comments We tightened the analysis and added a comparison to the concurrent work of Liu et al. (arXiv:2605.11490)

树到流及回归：统一决策树和扩散模型

Sai Niranjan Ramachandran, Suvrit Sra

AI总结本文通过建立层次决策树与扩散过程之间的数学对应关系，统一了决策树和扩散模型，揭示了共同的优化原则'全局轨迹得分匹配'，并提出了两种实用应用：treeflow在表格数据生成中表现优异，且计算速度更快；dsmtree将层次决策逻辑转移到神经网络中，在多个基准上与教师模型表现相近。

Comments 12 pages (main), 68 pages (inclusive of appendix), Accepted in the Forty-Third International Conference on Machine Learning (ICML) 2026

2605.00185 2026-05-22 cs.LG cs.AI

Fair Dataset Distillation via Cross-Group Barycenter Alignment

通过跨组重心对齐实现公平的数据集蒸馏

Mohammad Hossein Moslemi, Nima Hosseini Dashtbayaz, Zhimin Mei, Bissan Ghaddar, Boyu Wang

AI总结本文研究了数据集蒸馏中因不同群体预测模式差异导致的公平性问题，提出通过跨组重心对齐方法来减少群体间的预测偏差，从而提升模型的公平性。

Comments Accepted by ICML 2026

详情

AI中文摘要

数据集蒸馏旨在将大规模数据集压缩成小规模合成数据集，同时保持预测性能。我们发现，由于不同人口群体表现出不同的预测模式，蒸馏过程在保持所有子群体信息信号方面面临困难，无论群体大小是轻微还是严重不平衡。因此，训练在蒸馏数据上的模型可能会在某些子群体上出现显著性能下降，导致公平性差距。关键的是，这些差距不会仅仅通过纠正群体不平衡来消失，因为它们源于子群体预测模式的根本不匹配，而不是样本数量差异本身。因此，我们正式分析了这两种偏差源之间的相互作用，并将解决方案定义为识别一个不考虑群体不平衡的预测信息重心，该重心在所有子群体中诱导出相似的表示。通过向这个共享的聚合表示进行蒸馏，我们证明可以减少群体公平性方面的担忧。我们的方法与现有蒸馏方法兼容，并且实验证明，它显著减少了数据集蒸馏引入的偏差。代码可在https://github.com/mhmoslemi/COBRA上获得。

英文摘要

Dataset Distillation aims to compress a large dataset into a small synthetic one while maintaining predictive performance. We show that as different demographic groups exhibit distinct predictive patterns, the distillation process struggles to simultaneously preserve informative signals for all subgroups, regardless of whether group sizes are mildly or severely imbalanced. Consequently, models trained on distilled data can experience substantial performance drops for certain subgroups, leading to fairness gaps. Crucially, these gaps do not disappear by merely correcting group imbalance, since they stem from fundamental mismatches in subgroup predictive patterns rather than from sample-size disparities alone. We therefore formally analyze the interaction between these two sources of bias and cast the solution as identifying a group-imbalance-agnostic barycenter of the predictive information that induces similar representations across all subgroups. By distilling toward this shared aggregate representation, we show that group fairness concerns can be reduced. Our approach is compatible with existing distillation methods, and empirical results show that it substantially reduces bias introduced by dataset distillation. Code is available at https://github.com/mhmoslemi/COBRA.

URL PDF HTML ☆

赞 0 踩 0

2604.24762 2026-05-22 cs.CV

流之真相：面向图像到视频生成的主动时间鉴伪

Yuzhuo Chen, Zehua Ma, Han Fang, Hengyi Wang, Guanjie Wang, Weiming Zhang

AI总结本文提出了一种面向图像到视频生成的主动时间鉴伪方法，通过追踪像素在视频中的流动和变换，解决了传统空间鉴伪在时间维度上的不足。

详情

AI中文摘要

图像到视频（I2V）生成的迅速发展使单张图像可以生成逼真的视频，但也带来了新的鉴伪需求。与静态图像不同，I2V内容随时间演变，要求鉴伪方法超越二维像素级篡改定位，追踪像素在视频中的流动和变换。随着帧数增加，嵌入的痕迹会漂移和变形，使传统空间鉴伪失效。为应对这一未探索的维度，我们提出了**Flow of Truth**，首个专注于I2V生成中时间鉴伪的主动框架。关键挑战在于发现一个能够与生成过程一致演化的鉴伪特征，这本质上是一种创造性的转换而非确定性重建。尽管存在这种内在困难，我们创新性地将视频生成重新定义为*像素随时间的运动而非帧的合成*。基于这一观点，我们提出了一种可学习的鉴伪模板，追踪像素运动，并提出一个模板引导的流模块，将运动与图像内容解耦，实现稳健的时间追踪。实验表明，Flow of Truth在商业和开源I2V模型上均表现出色，显著提升了时间鉴伪性能。

英文摘要

The rapid rise of image-to-video (I2V) generation enables realistic videos to be created from a single image but also brings new forensic demands. Unlike static images, I2V content evolves over time, requiring forensics to move beyond 2D pixel-level tampering localization toward tracing how pixels flow and transform throughout the video. As frames progress, embedded traces drift and deform, making traditional spatial forensics ineffective. To address this unexplored dimension, we present **Flow of Truth**, the first proactive framework focusing on temporal forensics in I2V generation. A key challenge lies in discovering a forensic signature that can evolve consistently with the generation process, which is inherently a creative transformation rather than a deterministic reconstruction. Despite this intrinsic difficulty, we innovatively redefine video generation as *the motion of pixels through time rather than the synthesis of frames*. Building on this view, we propose a learnable forensic template that follows pixel motion and a template-guided flow module that decouples motion from image content, enabling robust temporal tracing. Experiments show that Flow of Truth generalizes across commercial and open-source I2V models, substantially improving temporal forensics performance.

URL PDF HTML ☆

赞 0 踩 0

2604.14084 2026-05-22 cs.LG cs.AI

TIP: Token Importance in On-Policy Distillation

TIP: on-policy distillation 中的 token 重要性

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard

AI总结本研究探讨了在 on-policy 知识蒸馏中哪些 token 对学习信号最有用，提出了一种基于学生熵和教师-学生分歧的双轴分类方法，并通过实验验证了在有限内存条件下使用少量 token 进行蒸馏的有效性。

详情

AI中文摘要

On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining 50% of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to 47%. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than 10% of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on <20% of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.

英文摘要

On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining $50\%$ of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to $47\%$. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than $10\%$ of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on $<$$20\%$ of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.

URL PDF HTML ☆

赞 0 踩 0

2604.12325 2026-05-22 cs.LG cs.AI

Black-Box Optimization From Small Offline Datasets via Meta Learning with Synthetic Tasks

通过合成任务进行元学习的黑盒优化

Azza Fadhel, The Hung Tran, Trong Nghia Hoang, Jana Doppa

AI总结本文提出了一种通过生成合成任务进行元学习的框架OptBias，用于解决小规模离线数据下的黑盒优化问题，通过学习可重用的优化偏差来提升小数据场景下的性能。

Comments Accepted for Publication at International Conference on Artificial Intelligence and Statistics (AISTATS)

详情

AI中文摘要

规则状态推断（RSI）：一种用于规则治理领域合规监控的贝叶斯框架

Abdou-Raouf Atarmla

AI总结本文提出了一种名为规则状态推断（RSI）的贝叶斯框架，用于解决规则治理领域中合规监控的三大结构性挑战：部署时缺乏标记结果、非合规实体战略性缺失观察以及监管环境变化速度超过任何监督模型的重新训练速度。RSI通过将权威、形式化的规则集作为结构化的贝叶斯先验，利用变分推断和精确坐标上升更新来推断人口的潜在合规状态。

Comments 18 pages. Experimental validation forthcoming

详情

AI中文摘要

在规则治理领域（如税收管理、临床协议遵守、环境监管）的合规监控面临三个结构性障碍，标准机器学习无法同时解决：部署时缺乏标记结果、非合规实体战略性缺失观察以及监管环境变化速度超过任何监督模型的重新训练速度。我们引入规则状态推断（RSI），一种贝叶斯框架，颠覆了传统的学习规则从数据的范式。RSI将权威、形式化的规则集作为结构化的贝叶斯先验，并通过均场变分推断和精确坐标上升更新推断人口的潜在合规状态。核心建模对象是一个联合潜变量，每个监管时期一个：全局合规文化因子η以及每个规则的激活、人口合规水平和参数漂移成分。RSI提供了三个正式保证：每个规则更新的监管适应性为O(n_k + K)；对于可识别的连续成分的伯恩斯坦-冯·米塞斯一致性；以及每次迭代的单调ELBO收敛。我们将在托戈财政系统上实例化RSI，基于官方监管法律的基准2000家合成企业；完整的数值验证将随后进行。该框架设计用于直接扩展到顺序RSI，一种状态空间公式化中，一个监管时期的后验成为下一个的先验，从而产生精确的卡尔曼滤波器用于合规轨迹跟踪和实体级贝叶斯评分。

英文摘要

Compliance monitoring in rule-governed domains (tax administration, clinical protocol adherence, environmental regulation) faces three structural obstacles that standard machine learning does not simultaneously address: the absence of labeled outcomes at deployment, strategically missing observations where non-compliant entities selectively withhold evidence, and a regulatory environment that changes faster than any supervised model can be retrained. We introduce Rule-State Inference (RSI), a Bayesian framework that reverses the usual paradigm. Rather than learning rules from data, RSI treats an authoritative, formalized rule set as structured Bayesian priors and infers the latent compliance state of a population through mean-field variational inference with exact coordinate-ascent updates. The central modeling object is a joint latent state per regulatory period: a global compliance-culture factor eta and per-rule components for activation, population compliance level, and parametric drift. RSI delivers three formal guarantees: O(n_k + K) regulatory adaptability per rule update; Bernstein-von Mises consistency for the identifiable continuous components; and monotone ELBO convergence at every iteration. We instantiate RSI on the Togolese fiscal system on a benchmark of 2,000 synthetic enterprises grounded in official regulatory law; full numerical validation is forthcoming. The framework is designed for direct extension to Sequential RSI, a state-space formulation where the posterior from one regulatory period becomes the prior for the next, yielding an exact Kalman filter for compliance-trajectory tracking and entity-level Bayesian scoring.

URL PDF HTML ☆

赞 0 踩 0

2603.16077 2026-05-22 cs.LG

MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Scaling of Diffusion Language Models

MDM-Prime-v2：二进制编码和索引洗牌使扩散语言模型能够扩展

Chen-Hao Chao, Wei-Fang Sun, Junwei Quan, Chun-Yi Lee, Rahul G. Krishnan

AI总结本文提出MDM-Prime-v2，通过二进制编码和索引洗牌技术改进扩散语言模型，解决了子分词器功能形式与BPE分词器结合导致的交叉熵损失增加以及子分词器粒度超参数选择缺乏工具的问题，从而提升了模型在常识推理基准上的零样本准确率。

详情

AI中文摘要

Masked diffusion models (MDM) exhibit superior generalization when learned using a Partial masking scheme (Prime). This approach converts tokens into sub-tokens and models the diffusion process at the sub-token level. We identify two limitations of the MDM-Prime framework. First, we find that the functional form of the subtokenizer significantly increases the cross-entropy loss in the objective when paired with commonly used Byte-Pair-Encoding (BPE) tokenizers. Second, we lack tools to guide the hyperparameter choice of the token granularity in the subtokenizer. To address these limitations, we analyze the optimal design of the subtokenizer that minimizes MDM-Prime training objective and develop MDM-Prime-v2, a masked diffusion language model which incorporates Binary Encoding and Index Shuffling. Our analysis characterizes how token granularity and sub-token entropy influence the training objective and downstream performance, providing principled criteria for subtokenizer design. When extending the model size to 1.1B parameters, MDM-Prime-v2 demonstrates superior average zero-shot accuracy across eight commonsense reasoning benchmarks, outperforming similar-sized baselines including GPT-Neo, OPT, Pythia, Bloom, SMDM, and TinyLLaMA.

英文摘要

Masked diffusion models (MDM) exhibit superior generalization when learned using a Partial masking scheme (Prime). This approach converts tokens into sub-tokens and models the diffusion process at the sub-token level. We identify two limitations of the MDM-Prime framework. First, we find that the functional form of the subtokenizer significantly increases the cross-entropy loss in the objective when paired with commonly used Byte-Pair-Encoding (BPE) tokenizers. Second, we lack tools to guide the hyperparameter choice of the token granularity in the subtokenizer. To address these limitations, we analyze the optimal design of the subtokenizer that minimizes MDM-Prime training objective and develop MDM-Prime-v2, a masked diffusion language model which incorporates Binary Encoding and Index Shuffling. Our analysis characterizes how token granularity and sub-token entropy influence the training objective and downstream performance, providing principled criteria for subtokenizer design. When extending the model size to 1.1B parameters, MDM-Prime-v2 demonstrates superior average zero-shot accuracy across eight commonsense reasoning benchmarks, outperforming similar-sized baselines including GPT-Neo, OPT, Pythia, Bloom, SMDM, and TinyLLaMA.

URL PDF HTML ☆

赞 0 踩 0