基于带权文本矩阵分解信息熵模型的新闻评论摘要

doi:10.11772/j.issn.1001-9081.2014.10.2859

计算机应用 ›› 2014, Vol. 34 ›› Issue (10): 2859-2864.DOI: 10.11772/j.issn.1001-9081.2014.10.2859

基于带权文本矩阵分解信息熵模型的新闻评论摘要

国玉静,姬东鸿

武汉大学计算机学院，武汉 430072

收稿日期:2014-04-14 修回日期:2014-06-18 出版日期:2014-10-01 发布日期:2014-10-30
通讯作者: 国玉静
作者简介:国玉静（1989-），女，天津人，硕士研究生，主要研究方向：自然语言处理、数据挖据；姬东鸿(1966-)，男，北京人，教授，博士，主要研究方向：自然语言处理、数据挖掘、智能信息处理、搜索技术、机器学习、生物信息处理、词汇语义学、现代语言学、认知语言学。
基金资助:
国家自然科学基金重点项目;国家自然科学基金面上项目

Summary extraction of news comments based on weighed textual matrix factorization and information entropy model

GUO Yujing,JI Donghong

School of Computer Science, Wuhan University, Wuhan Hubei 430072, China

Received:2014-04-14 Revised:2014-06-18 Online:2014-10-01 Published:2014-10-30
Contact: GUO Yujing

摘要/Abstract

摘要：

针对新闻的评论摘要的抽取问题，提出了一种将带权文本矩阵分解（WTMF）与信息熵结合的社交媒体评论自动抽取方法。该方法对微博（tweets）和news信息构建基于异质图的WTMF模型，解决短文本特征稀疏问题，保障信息的相似性；根据tweet的特征分布，构建基于特征的二元信息熵和连续信息熵，保证信息的多样性。最后依据子模属性，设计基于贪心的抽样算法，获取优化问题近似最优解。实验结果表明，WTMF与信息熵结合的方法能有效提高社交媒体摘要性评论抽取的性能，在ROUGE2上召回率和F1值分别达到0.40074和0.27330。与潜在狄利克雷分配（LDA）扩展模型——基于位的主体模型(BTM)相比，分别提高了0.05和0.03，有效地提高了新闻评论摘要质量。

Abstract:

This paper addressed to select the most interesting and useful comments for an online news article. In summary of comments for news extraction problem, a new way was introduced, and it was proved to be effective in the social media comments automatic extraction with the combination of Weighed Textual Matrix Factorization (WTMF) and information entropy. The construction of information for tweets and news was based on heterogeneous graph WTMF model which solved the sparse problems of short text and maintained the similarity of information. Meanwhile, according to tweet character distribution, binary entropy and continuous entropy were built to guarantee the diversity of information.Last, according to the characteristics of submodularity, a greedy algorithm was designed to get an approximate optimal solution for the optimization problems. The experimental results show that, the method with combination of WTMF and information entropy can improve the extraction performance of summary of comments for social media effectively. The recall rate and F1 value on the ROUGE2 respectively reaches 0.40074 and 0.27330，which is increased by 0.05 and 0.03 in comparison of the Latent Dirichlet Allocation (LDA) extended model—Biterm Topic Model (BTM). The proposed model improves the quality of news summary of comments effectively.

中图分类号:

TP391.1

国玉静姬东鸿. 基于带权文本矩阵分解信息熵模型的新闻评论摘要[J]. 计算机应用, 2014, 34(10): 2859-2864.

GUO Yujing JI Donghong. Summary extraction of news comments based on weighed textual matrix factorization and information entropy model[J]. Journal of Computer Applications, 2014, 34(10): 2859-2864.

参考文献

［1］MAAS A L, DALY R E, PHAM P T, et al.Learning word vectors for sentiment analysis ［C］// Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2011,1:142-150.
［2］LI H, CHEN Y, JI H, et al.Combining social cognitive theories with linguistic features for multi-genre sentiment analysis ［EB/OL］. ［2014-02-06］. https://aclweb.org/anthology/Y/Y12/Y12-1013.pdf［3］GAO W, LI P, DARWISH K. Joint topic modeling for event summarization across news and social media streams ［C］// Proceedings of the 21st ACM International Conference on Information and Knowledge Management. New York: ACM, 2012:1173-1182.
［4］ZHAI Y, WANG K, ZHANG D, et al.An algorithm for semantic similarity of short text based on WordNet ［J］. Acta Electronic Sinica, 2012,40(3):617-620.（翟延冬,王康平,张东娜,等.一种基于 WordNet 的短文本语义相似性算法［J］.电子学报,2012,40(3):617-620.）
［5］YANG Z, CAI K, TANG J, et al.Social context summarization ［C］// Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2011:255-264.
［6］SRIRAM B, FUHRY D, DEMIR E, et al.Short text classification in twitter to improve information filtering ［C］// Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2010:841-842.
［7］LANDAUER T K, FOLTZ P W, LAHAM D. An introduction to latent semantic analysis ［J］. Discourse Processes, 1998,25(2/3):259-284.
［8］HOFMANN T. Probabilistic latent semantic indexing ［C］// Pro-ceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 1999:50-57.
［9］BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation ［J］. Journal of Machine Learning Research, 2003,3(3):993-1022.
［10］GUO W, DIAD M. Modeling sentences in the latent space ［C］// Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2012,1:864-872.
［11］KO C W, LEE J, QUERYRANNE M. An exact algorithm for maximum entropy sampling ［J］. Operations Research, 1995,43(4):684-691.
［12］GUO W, LI H, JI H, et al.Linking tweets to news: a framework to enrich short text data in social media ［EB/OL］. ［2014-02-16］. http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=96887CAC6EC4C2A8A4203B9D1A80CE3A?doi=10.1.1.361.4604&rep=rep1&type=pdf.
［13］YAN X, GUO J, LAN Y, et al.A biterm topic model for short texts［C］// Proceedings of the 22nd International Conference on World Wide Web. Geneva: International World Wide Web Conferences Steering Committee, 2013:1445-1456.
［14］STAJNER T, THOMEE B, POPESCU A, et al.Automatic selection of social media responses to news ［C］// Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2013:50-58.
［15］SREBRO N, JAAKKOLA T. Weighted low-rank approximations ［C］// Proceedings of the 20th International Conference on Machine Learning. Palo Alto: AAAI Press, 2003:720-727.
［16］YAN R, WAN X, LAPATA M, et al.Visualizing timelines: evolutionary summarization via iterative reinforcement between text and image streams ［C］// Proceedings of the 21st ACM International Conference on Information and Knowledge Management. New York: ACM, 2012:275-284.
［17］ABU-JBARA A, RADEV D. Coherent citation-based summarization of scientific papers ［C］// Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: Association for Computational Linguistics, 2011,1:500-509.
［18］NEMHAUSER G L, WOLSEY L A, FISHER M L. An analysis of approximations for maximizing submodular set functions—I ［J］. Mathematical Programming, 1978,14(1):265-294.

[1]	殷雨昌王洪元陈莉冯尊登肖宇. 基于单标注样本的多损失学习与联合度量视频行人重识别[J]. 计算机应用, 0, (): 0-0.
[2]	左亚尧陈致然洪嘉伟陈坤. 融合多语义特征的命名实体识别方法[J]. 计算机应用, 0, (): 0-0.
[3]	袁景凌, 丁远远, 潘东行, 李琳. 基于时序和上下文特征的中文隐式情感分类模型[J]. 计算机应用, 2021, 41(10): 2820-2828.
[4]	杨书新, 张楠. 融合情感词典与上下文语言模型的文本情感分析[J]. 计算机应用, 2021, 41(10): 2829-2834.
[5]	杨璐, 何明祥. 基于门控机制和卷积神经网络的中文文本情感分析模型[J]. 计算机应用, 2021, 41(10): 2842-2848.
[6]	董永峰, 刘超, 王利琴, 李英双. 融合多跳关系路径信息的关系推理方法[J]. 计算机应用, 2021, 41(10): 2799-2805.
[7]	吴赛赛, 梁晓贺, 谢能付, 周爱莲, 郝心宁. 面向领域实体关系联合抽取的标注方法[J]. 计算机应用, 2021, 41(10): 2858-2863.
[8]	胡婕胡燕刘梦赤张龑. 基于知识库实体增强BERT模型的中文命名实体识别[J]. 计算机应用, 0, (): 0-0.
[9]	郝志刚秦丽李国亮. 基于多属性综合评价的食品安全标准引用网络重要节点发现方法[J]. 计算机应用, 0, (): 0-0.
[10]	丁行硕李翔谢乾. 基于标签分层延深建模的企业画像构建方法[J]. 计算机应用, 0, (): 0-0.
[11]	刘子辰, 李小娟, 韦伟. 基于循环神经网络的专利价格自动评估[J]. 计算机应用, 2021, 41(9): 2532-2538.
[12]	余敦辉, 万鹏, 王社. 基于企业知识图谱构建的实体关联查询系统[J]. 计算机应用, 2021, 41(9): 2510-2516.
[13]	张阳王小宁. 基于Word2Vec词嵌入和高维生物基因选择遗传算法的文本特征选择方法 [J]. 计算机应用, 0, (): 0-0.
[14]	李灿杨雅婷马玉鹏董瑞. 基于语种相似性挖掘的神经机器翻译语料库扩充方法[J]. 计算机应用, 0, (): 0-0.
[15]	王伟, 赵尔平, 崔志远, 孙浩. 基于HowNet义原和Word2vec词向量表示的多特征融合消歧方法[J]. 计算机应用, 2021, 41(8): 2193-2198.

基于带权文本矩阵分解信息熵模型的新闻评论摘要

Summary extraction of news comments based on weighed textual matrix factorization and information entropy model

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics