计算机应用 ›› 2014, Vol. 34 ›› Issue (10): 2859-2864.DOI: 10.11772/j.issn.1001-9081.2014.10.2859

• 人工智能 • 上一篇    下一篇

基于带权文本矩阵分解信息熵模型的新闻评论摘要

国玉静,姬东鸿   

  1. 武汉大学 计算机学院,武汉 430072
  • 收稿日期:2014-04-14 修回日期:2014-06-18 出版日期:2014-10-01 发布日期:2014-10-30
  • 通讯作者: 国玉静
  • 作者简介:国玉静(1989-),女,天津人,硕士研究生,主要研究方向:自然语言处理、数据挖据;姬东鸿(1966-),男,北京人,教授,博士,主要研究方向:自然语言处理、数据挖掘、智能信息处理、搜索技术、机器学习、生物信息处理、词汇语义学、现代语言学、认知语言学。
  • 基金资助:

    国家自然科学基金重点项目;国家自然科学基金面上项目

Summary extraction of news comments based on weighed textual matrix factorization and information entropy model

GUO Yujing,JI Donghong   

  1. School of Computer Science, Wuhan University, Wuhan Hubei 430072, China
  • Received:2014-04-14 Revised:2014-06-18 Online:2014-10-01 Published:2014-10-30
  • Contact: GUO Yujing

摘要:

针对新闻的评论摘要的抽取问题,提出了一种将带权文本矩阵分解(WTMF)与信息熵结合的社交媒体评论自动抽取方法。该方法对微博(tweets)和news信息构建基于异质图的WTMF模型,解决短文本特征稀疏问题,保障信息的相似性;根据tweet的特征分布,构建基于特征的二元信息熵和连续信息熵,保证信息的多样性。最后依据子模属性,设计基于贪心的抽样算法,获取优化问题近似最优解。实验结果表明,WTMF与信息熵结合的方法能有效提高社交媒体摘要性评论抽取的性能,在ROUGE2上召回率和F1值分别达到0.40074和0.27330。与潜在狄利克雷分配(LDA)扩展模型——基于位的主体模型(BTM)相比,分别提高了0.05和0.03,有效地提高了新闻评论摘要质量。

Abstract:

This paper addressed to select the most interesting and useful comments for an online news article. In summary of comments for news extraction problem, a new way was introduced, and it was proved to be effective in the social media comments automatic extraction with the combination of Weighed Textual Matrix Factorization (WTMF) and information entropy. The construction of information for tweets and news was based on heterogeneous graph WTMF model which solved the sparse problems of short text and maintained the similarity of information. Meanwhile, according to tweet character distribution, binary entropy and continuous entropy were built to guarantee the diversity of information.Last, according to the characteristics of submodularity, a greedy algorithm was designed to get an approximate optimal solution for the optimization problems. The experimental results show that, the method with combination of WTMF and information entropy can improve the extraction performance of summary of comments for social media effectively. The recall rate and F1 value on the ROUGE2 respectively reaches 0.40074 and 0.27330,which is increased by 0.05 and 0.03 in comparison of the Latent Dirichlet Allocation (LDA) extended model—Biterm Topic Model (BTM). The proposed model improves the quality of news summary of comments effectively.

中图分类号: