Vector Space Representations of Documents in Classifying Finnish Social Media Texts

Venekoski, Viljami; Puuska, Samir; Vankka, Jouko

doi:10.1007/978-3-319-46254-7_42

Vector Space Representations of Documents in Classifying Finnish Social Media Texts

Viljami Venekoski¹²,
Samir Puuska¹² &
Jouko Vankka¹²

Conference paper
First Online: 22 September 2016

1338 Accesses
3 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 639))

Abstract

Computational analysis of linguistic data requires that texts are transformed into numeric representations. The aim of this research is to evaluate different methods for building vector representations of text documents from social media. The methods are compared in respect to their performance in a classification task. Namely, traditional count-based term frequency-inverse document frequency (TFIDF) is compared to the semantic distributed word embedding representations. Unlike previous research, we investigate document representations in the context of morphologically rich Finnish. Based on the results, we suggest a framework for building vector space representations of texts in social media, applicable to language technologies for morphologically rich languages. In the current study, lemmatization of tokens increased classification accuracy, while lexical filtering generally hindered performance. Finally, we report that distributed embeddings and TFIDF perform at comparable levels with our data.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In: ACL (1). pp. 238–247 (2014)
Google Scholar
Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28(1), 11–21 (1972)
Article Google Scholar
Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 3, 211–225 (2015)
Google Scholar
Friedman, C., Rindflesch, T.C., Corn, M.: Natural language processing: state of the art and prospects for significant progress, a workshop sponsored by the National Library of Medicine. J. Biomed. Inf. 46(5), 765–773 (2013)
Article Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391 (1990)
Article Google Scholar
Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. arXiv preprint (2014). http://arxiv.org/pdf/1405.4053.pdf
Dai, A.M., Olah, C., Le, Q.V.: Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998 (2015)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint (2013). http://arxiv.org/pdf/1301.3781.pdf
De Boom, C., Van Canneyt, S., Bohez, S., Demeester, T., Dhoedt, B.: Learning semantic similarity for very short texts. arXiv preprint arXiv:1512.00765 (2015)
Liu, Y., Liu, Z., Chua, T.S., Sun, M.: Topical word embeddings. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 2418–2424 (2015)
Google Scholar
Taddy, M.: Document classification by inversion of distributed language representations. arXiv preprint (2015). http://arxiv.org/pdf/1504.07295.pdf
Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.Q.: From word embeddings to document distances. In: Proceedings of the 32nd International Conference on Machine Learning (ICML-2015), pp. 957–966 (2015)
Google Scholar
Kanerva, J., Ginter, F.: Post-hoc manipulations of vector space models with application to semantic role labeling. In: Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC) at EACL 2014, pp. 1–10 (2014). https://aclweb.org/anthology/W/W14/W14-1501.pdf
Tsarfaty, R., Seddah, D., Kübler, S., Nivre, J.: Parsing morphologically rich languages: introduction to the special issue. Comput. Linguistics 39(1), 15–22 (2013)
Article Google Scholar
The Suomi24 Corpus (2015). http://urn.fi/urn:nbn:fi:lb-2015040801. 14 May 2015 Version
Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)
Article Google Scholar
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Association for Computational Linguistics (2014)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA, Valletta, May 2010. http://is.muni.cz/publication/884893/en
Mikolov, T., Yih, W.t., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the HLT-NAACL, pp. 746–751 (2013)
Google Scholar
Hakulinen, A., Vilkuna, M., Korhonen, R., Koivisto, V., Heinonen, T.R., Alho, I.: Ison suomen kielioppi [Great Grammar of Finnish]. Suomalaisen Kirjallisuuden Seura, Helsinki, Finland, online (edn.) (2004). http://scripta.kotus.fi/visk
Enarvi, S., Kurimo, M.: Studies on training text selection for conversational Finish language modeling. In: Proceedings of the 10th International Workshop on Spoken Language Translation (IWSLT 2013), pp. 256–263 (2013)
Google Scholar
Hrala, M., Král, P.: Evaluation of the document classification approaches. In: Burduk, R., Jackowski, K., Kurzynski, M., Wozniak, M., Zolnierek, A. (eds.) CORES 2013. AISC, vol. 226, pp. 877–885. Springer, Switzerland (2013)
Chapter Google Scholar
Khoo, A., Marom, Y., Albrecht, D.: Experiments with sentence classification. In: Proceedings of the 2006 Australasian Language Technology Workshop, pp. 18–25 (2006)
Google Scholar
Toman, M., Tesar, R., Jezek, K.: Influence of word normalization on text classification. Proc. InSciT 4, 354–358 (2006)
Google Scholar
Korenius, T., Laurikkala, J., Järvelin, K., Juhola, M.: Stemming and lemmatization in the clustering of Finnish text documents. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 625–633. ACM (2004)
Google Scholar
Haverinen, K., Nyblom, J., Viljanen, T., Laippala, V., Kohonen, S., Missilä, A., Ojala, S., Salakoski, T., Ginter, F.: Building the essential resources for Finnish: the Turku Dependency Treebank. Lang. Resour. Eval. 48(3), 493–531 (2013)
Article Google Scholar
Salvetti, F., Lewis, S., Reichenbach, C.: Automatic opinion polarity classification of movie reviews. Colorado Res. Linguist. 17(1) (2004)
Google Scholar
Joachims, Thorsten: Text categorization with Support Vector Machines: learning with many relevant features. In: Nédellec, Claire, Rouveirol, Céline (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998). doi:10.1007/BFb0026683
Chapter Google Scholar
van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworth-Heinemann, Newton, MA, USA (1979)
MATH Google Scholar
Dietterich, T.G.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10(7), 1895–1923 (1998)
Article Google Scholar
Qin, S., Song, J., Zhang, P., Tan, Y.: Feature selection for text classification based on part of speech filter and synonym merge. In: Proceedings of the 2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp. 681–685. IEEE (2015)
Google Scholar
Hill, F., Reichart, R., Korhonen, A.: SimLex-999: evaluating semantic models with (genuine) similarity estimation. Computational Linguist. 41(4), 665–695 (2015)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

National Defence University, 00861, Helsinki, Finland
Viljami Venekoski, Samir Puuska & Jouko Vankka

Authors

Viljami Venekoski
View author publications
You can also search for this author in PubMed Google Scholar
Samir Puuska
View author publications
You can also search for this author in PubMed Google Scholar
Jouko Vankka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Viljami Venekoski .

Editor information

Editors and Affiliations

Kaunas University of Technology , Kaunas, Lithuania
Giedre Dregvaite
Kaunas University of Technology , Kaunas, Lithuania
Robertas Damasevicius

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Venekoski, V., Puuska, S., Vankka, J. (2016). Vector Space Representations of Documents in Classifying Finnish Social Media Texts. In: Dregvaite, G., Damasevicius, R. (eds) Information and Software Technologies. ICIST 2016. Communications in Computer and Information Science, vol 639. Springer, Cham. https://doi.org/10.1007/978-3-319-46254-7_42

Download citation

DOI: https://doi.org/10.1007/978-3-319-46254-7_42
Published: 22 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46253-0
Online ISBN: 978-3-319-46254-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics