Dynamic Topic Modeling with a Higher-Order Hypergraphical Representation
基于高阶超图表示的动态主题建模
Hanjia Gao, Hanwen Ye, Qing Nie, Annie Qu
AI总结 针对传统主题模型忽略词间高阶交互和动态语料中语义重叠的问题,提出超图表示文本并构建动态主题建模框架,通过结构化低秩分解和时间正则化实现,理论保证收敛性和误差界,实验优于现有模型。
详情
- Comments
- 34 pages, 4 figures
动态主题建模被广泛用于分析科学文献、医疗记录和社交媒体中的演变趋势。传统主题模型通过多项单纯形上的单个概率向量表示每个主题,并将词的出现和重复隐式耦合在一个概率机制中。然而,这种表述限制了词之间的依赖结构,并忽略了信息丰富的高阶交互,特别是在具有重叠语义的动态语料中。为了解决这些局限性,我们引入文本的超图表示,其中每个文档被建模为一个连接所有共现词的超边,重复强度编码为节点权重。这种表示自然地将词的出现与重复分开,并引入了一种新颖的基于超图的多项分布,其非线性归一化取决于每个文档的观测词集。基于此似然,我们通过结构化低秩分解和主题-词轮廓上的显式时间正则化,开发了一个动态主题建模框架。此外,尽管双线性分解和文档特定的非线性归一化导致了内在的非凸性,我们仍建立了局部收敛保证并推导了非渐近误差界。在合成数据上的数值实验以及在国际学习表征会议(ICLR)语料库上的应用表明,该方法比现有的基于多项式的主题模型具有一致的改进。
Dynamic topic modeling is widely used to analyze evolving trends in scientific literature, medical records, and social media. Traditional topic models represent each topic through a single probability vector on the multinomial simplex and implicitly couple word occurrence and repetition within one probabilistic mechanism. However, this formulation restricts the dependence structure among words and overlooks informative higher-order interactions, particularly in dynamic corpora with overlapping semantics. To address these limitations, we introduce a hypergraph representation of text where each document is modeled as a hyperedge connecting all co-occurring words, with repetition intensities encoded as node weights. This representation naturally separates word occurrence from repetition and induces a novel hypergraph-based multinomial distribution with a nonlinear normalization depending on the observed word set of each document. Building on this likelihood, we develop a dynamic topic modeling framework via structured low-rank factorizations with explicit temporal regularization on topic-word profiles. Moreover, we establish local convergence guarantees and derive non-asymptotic error bounds despite the intrinsic nonconvexity induced by bilinear factorization and document-specific nonlinear normalization. Numerical experiments on synthetic data and an application to the International Conference on Learning Representations (ICLR) corpus demonstrate consistent improvements over existing multinomial-based topic models.