ABSTRACT
Text classification is one of the most important tasks in natural language processing and information retrieval due to the increasing availability of documents in digital form and the ensuing need to access them in flexible ways. By assigning documents to labeled classes, text classification can reduce the search space and expedite the process of retrieving relevant documents. In this paper, we propose a novel text representation method, Hybrid Word Embeddings (HWE), which combines semantic information obtained fromWord- Net and contextual information extracted from text documents to provide concise and accurate representations of text documents. The proposed HWE method can improve the efficiency of deriving word semantics from text by taking advantage of the semantic relationships extracted from WordNet with less training corpus. Experimental study on classification of documents shows that the proposed HWE outperforms existing methods, including Doc2Vec and Word2Vec, in terms of classification accuracy, recall, precision, etc.
- Zellig Harris. Distributional structure. Word, 10(23):146--162, 1954.Google ScholarCross Ref
- TK LANDAUER and ST DUMAIS. A solution to plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, 104(2):211--240, 1997.Google ScholarCross Ref
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111--3119, 2013. Google ScholarDigital Library
- Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In Tony Jebara and Eric P. Xing, editors, Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1188--1196. JMLR Workshop and Conference Proceedings, 2014. Google ScholarDigital Library
- Bo Li, James Z.Wang, Frank Alex Feltus, Jizhong Zhou, and Feng Luo. Effectively integrating information content and structural relationship to improve the gobased similarity measure between proteins. CoRR, abs/1001.0958, 2010.Google Scholar
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013.Google Scholar
- David M. Blei. Probabilistic topic models. Commun. ACM, 55(4):77--84, 2012. Google ScholarDigital Library
- George A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39--41, November 1995. Google ScholarDigital Library
- Kemafor Anyanwu, Angela Maduko, and Amit Sheth. Semrank: Ranking complex relationship search results on the semantic web. In Proceedings of the 14th International Conference on World Wide Web, WWW '05, pages 117--127, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- Xuebo Song, Lin Li, Pradip K. Srimani, Philip S. Yu, and James Z. Wang. Measure the semantic similarity of go terms using aggregate information content. In Zhipeng Cai, Oliver Eulenstein, Daniel Janies, and Daniel Schwartz, editors, Bioinformatics Research and Applications, volume 7875 of Lecture Notes in Computer Science, pages 224--236. Springer Berlin Heidelberg, 2013.Google ScholarCross Ref
- Andreas Hotho, Steffen Staab, and Gerd Stumme. Wordnet improves text document clustering. In In Proc. of the SIGIR 2003 Semantic Web Workshop, pages 541--544, 2003.Google Scholar
- James Z.Wang and William Taylor. Concept forest: A new ontology-assisted text document similarity measurement method. In Web Intelligence, IEEE/WIC/ACM International Conference on, pages 395--401, Nov 2007. Google ScholarDigital Library
- Robert E. Tarjan and Jan van Leeuwen. Worst-case analysis of set union algorithms. J. ACM, 31(2):245--281, March 1984. Google ScholarDigital Library
- Xuebo Song, Lin Li, Pradip K. Srimani, Philip S. Yu, and James Z. Wang. Measure the semantic similarity of go terms using aggregate information content. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 11(3):468--476, May 2014. Google ScholarDigital Library
- 20-newsgroup collection. 1999.Google Scholar
Index Terms
- HWE: Hybrid Word Embeddings For Text Classification
Recommendations
An automated new approach in fast text classification (fastText): A case study for Turkish text classification without pre-processing
NLPIR '19: Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information RetrievalAny Text Classification (TC) problem need pre-processing steps which may affect the classification accuracy. Especially pre-processing steps need substantial effort particularly in agglutinative languages such as Turkish. In this context, a traditional ...
Multi-prototype Morpheme Embedding for Text Classification
SMA 2020: The 9th International Conference on Smart Media and ApplicationsRepresenting a word into a continuous space, also known as a word vector, has been successful in various NLP tasks. The word-based embedding has two problems; one is the out-of-vocabulary problem and the other is does not take into account the context ...
Improving Vietnamese WordNet using word embedding
NLPIR '19: Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information RetrievalThis paper presents a simple but effective method to improve the quality of WordNet synsets and extract glosses for synsets. We translate the Princeton WordNet and other intermediate WordNets to a target language using a machine translator, then the ...
Comments