计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (1): 154-164.DOI: 10.3778/j.issn.1002-8331.2212-0259

• 模式识别与人工智能 • 上一篇    下一篇

面向短文本的增强上下文神经主题模型

刘刚,王同礼,唐宏伟,战凯,杨雯莉   

  1. 1.哈尔滨工程大学 计算机科学与技术学院,哈尔滨 150001
    2.哈尔滨工程大学 电子政务建模仿真国家工程实验室,哈尔滨 150001
    3.澳大利亚普华永道公司 普华永道数字化部,悉尼 2070
  • 出版日期:2024-01-01 发布日期:2024-01-01

Enhanced Contextual Neural Topic Model for Short Texts

LIU Gang, WANG Tongli, TANG Hongwei, ZHAN Kai, YANG Wenli   

  1. 1.College of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China
    2.Modeling and Emulation in E-Government National Engineering Laboratory, Harbin Engineering University, Harbin 150001, China
    3.PwC Enterprise Digital, PricewaterhouseCoopers, Sydney 2070, Australia
  • Online:2024-01-01 Published:2024-01-01

摘要: 目前的主题模型大多数基于自身文本的词共现信息进行建模,并没有引入主题的稀疏约束来提升模型的主题抽取能力,此外短文本本身存在词共现稀疏的问题,该问题严重影响了短文本主题建模的准确性。针对以上问题,提出了一种增强上下文神经主题模型(enhanced context neural topic model,ECNTM)。ECNTM基于主题控制器对主题进行稀疏性约束,过滤掉不相关的主题,同时模型的输入变成BOW向量和SBERT句子嵌入的拼接,在高斯解码器中,通过在嵌入空间中将单词上的主题分布处理为多元高斯分布或高斯混合分布,显式地丰富了短文本有限的上下文信息,解决了短文本词共现特征稀疏问题。在WS、Reuters、KOS、20 NewsGroups四个公开数据集上的实验结果表明,该模型在困惑度、主题一致性以及文本分类准确率上相较基准模型均有明显提升,证明了引入主题稀疏约束特性以及丰富的上下文信息到短文本主题建模的有效性。

关键词: 神经主题模型, 短文本, 稀疏约束, 变分自编码器, 主题建模

Abstract: Most of the current topic models are modeled based on word co-occurrence information of their own texts, and do not introduce topic sparsity constraints to improve the model’s topic extraction ability. In addition, short texts have the problem of word co-occurrence sparsity, which seriously affects accuracy of short text topic modeling. In response to the above problems, an enhanced context neural topic model (ECNTM) is proposed. ECNTM implements sparsity constraints on the topic based on the topic controller to filter out irrelevant topics. At the same time, the input of the model becomes the splicing of BOW vector and SBERT sentence embedding. In the Gaussian decoder, the topic on the word is embedded in the embedding space. The distribution is treated as a multivariate Gaussian distribution or a Gaussian mixture distribution, which explicitly enriches the limited context information of short texts and solves the problem of sparse word co-occurrence features in short texts. Experimental results on four public datasets of WS, Reuters, KOS and 20 NewsGroups show that this model has significantly improved compared with the benchmark model in terms of perplexity, topic consistency, and text classification accuracy, which proves the introduction of topic sparsity constraints and rich effectiveness of contextual information to short text topic modeling.

Key words: neural subject model, short text, sparsity constraint, variational auto-encoder, topic modeling