Skip to main content

Vector Space Representations of Documents in Classifying Finnish Social Media Texts

  • Conference paper
  • First Online:

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 639))

Abstract

Computational analysis of linguistic data requires that texts are transformed into numeric representations. The aim of this research is to evaluate different methods for building vector representations of text documents from social media. The methods are compared in respect to their performance in a classification task. Namely, traditional count-based term frequency-inverse document frequency (TFIDF) is compared to the semantic distributed word embedding representations. Unlike previous research, we investigate document representations in the context of morphologically rich Finnish. Based on the results, we suggest a framework for building vector space representations of texts in social media, applicable to language technologies for morphologically rich languages. In the current study, lemmatization of tokens increased classification accuracy, while lexical filtering generally hindered performance. Finally, we report that distributed embeddings and TFIDF perform at comparable levels with our data.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In: ACL (1). pp. 238–247 (2014)

    Google Scholar 

  2. Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28(1), 11–21 (1972)

    Article  Google Scholar 

  3. Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 3, 211–225 (2015)

    Google Scholar 

  4. Friedman, C., Rindflesch, T.C., Corn, M.: Natural language processing: state of the art and prospects for significant progress, a workshop sponsored by the National Library of Medicine. J. Biomed. Inf. 46(5), 765–773 (2013)

    Article  Google Scholar 

  5. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  6. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391 (1990)

    Article  Google Scholar 

  7. Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. arXiv preprint (2014). http://arxiv.org/pdf/1405.4053.pdf

  8. Dai, A.M., Olah, C., Le, Q.V.: Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998 (2015)

  9. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint (2013). http://arxiv.org/pdf/1301.3781.pdf

  10. De Boom, C., Van Canneyt, S., Bohez, S., Demeester, T., Dhoedt, B.: Learning semantic similarity for very short texts. arXiv preprint arXiv:1512.00765 (2015)

  11. Liu, Y., Liu, Z., Chua, T.S., Sun, M.: Topical word embeddings. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 2418–2424 (2015)

    Google Scholar 

  12. Taddy, M.: Document classification by inversion of distributed language representations. arXiv preprint (2015). http://arxiv.org/pdf/1504.07295.pdf

  13. Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.Q.: From word embeddings to document distances. In: Proceedings of the 32nd International Conference on Machine Learning (ICML-2015), pp. 957–966 (2015)

    Google Scholar 

  14. Kanerva, J., Ginter, F.: Post-hoc manipulations of vector space models with application to semantic role labeling. In: Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC) at EACL 2014, pp. 1–10 (2014). https://aclweb.org/anthology/W/W14/W14-1501.pdf

  15. Tsarfaty, R., Seddah, D., Kübler, S., Nivre, J.: Parsing morphologically rich languages: introduction to the special issue. Comput. Linguistics 39(1), 15–22 (2013)

    Article  Google Scholar 

  16. The Suomi24 Corpus (2015). http://urn.fi/urn:nbn:fi:lb-2015040801. 14 May 2015 Version

  17. Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)

    Article  Google Scholar 

  18. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Association for Computational Linguistics (2014)

    Google Scholar 

  19. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  20. Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA, Valletta, May 2010. http://is.muni.cz/publication/884893/en

  21. Mikolov, T., Yih, W.t., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the HLT-NAACL, pp. 746–751 (2013)

    Google Scholar 

  22. Hakulinen, A., Vilkuna, M., Korhonen, R., Koivisto, V., Heinonen, T.R., Alho, I.: Ison suomen kielioppi [Great Grammar of Finnish]. Suomalaisen Kirjallisuuden Seura, Helsinki, Finland, online (edn.) (2004). http://scripta.kotus.fi/visk

  23. Enarvi, S., Kurimo, M.: Studies on training text selection for conversational Finish language modeling. In: Proceedings of the 10th International Workshop on Spoken Language Translation (IWSLT 2013), pp. 256–263 (2013)

    Google Scholar 

  24. Hrala, M., Král, P.: Evaluation of the document classification approaches. In: Burduk, R., Jackowski, K., Kurzynski, M., Wozniak, M., Zolnierek, A. (eds.) CORES 2013. AISC, vol. 226, pp. 877–885. Springer, Switzerland (2013)

    Chapter  Google Scholar 

  25. Khoo, A., Marom, Y., Albrecht, D.: Experiments with sentence classification. In: Proceedings of the 2006 Australasian Language Technology Workshop, pp. 18–25 (2006)

    Google Scholar 

  26. Toman, M., Tesar, R., Jezek, K.: Influence of word normalization on text classification. Proc. InSciT 4, 354–358 (2006)

    Google Scholar 

  27. Korenius, T., Laurikkala, J., Järvelin, K., Juhola, M.: Stemming and lemmatization in the clustering of Finnish text documents. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 625–633. ACM (2004)

    Google Scholar 

  28. Haverinen, K., Nyblom, J., Viljanen, T., Laippala, V., Kohonen, S., Missilä, A., Ojala, S., Salakoski, T., Ginter, F.: Building the essential resources for Finnish: the Turku Dependency Treebank. Lang. Resour. Eval. 48(3), 493–531 (2013)

    Article  Google Scholar 

  29. Salvetti, F., Lewis, S., Reichenbach, C.: Automatic opinion polarity classification of movie reviews. Colorado Res. Linguist. 17(1) (2004)

    Google Scholar 

  30. Joachims, Thorsten: Text categorization with Support Vector Machines: learning with many relevant features. In: Nédellec, Claire, Rouveirol, Céline (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998). doi:10.1007/BFb0026683

    Chapter  Google Scholar 

  31. van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworth-Heinemann, Newton, MA, USA (1979)

    MATH  Google Scholar 

  32. Dietterich, T.G.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10(7), 1895–1923 (1998)

    Article  Google Scholar 

  33. Qin, S., Song, J., Zhang, P., Tan, Y.: Feature selection for text classification based on part of speech filter and synonym merge. In: Proceedings of the 2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp. 681–685. IEEE (2015)

    Google Scholar 

  34. Hill, F., Reichart, R., Korhonen, A.: SimLex-999: evaluating semantic models with (genuine) similarity estimation. Computational Linguist. 41(4), 665–695 (2015)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Viljami Venekoski .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Venekoski, V., Puuska, S., Vankka, J. (2016). Vector Space Representations of Documents in Classifying Finnish Social Media Texts. In: Dregvaite, G., Damasevicius, R. (eds) Information and Software Technologies. ICIST 2016. Communications in Computer and Information Science, vol 639. Springer, Cham. https://doi.org/10.1007/978-3-319-46254-7_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-46254-7_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-46253-0

  • Online ISBN: 978-3-319-46254-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics