Acta Univ. Agric. Silvic. Mendelianae Brun. 2018, 66(6), 1431-1439 | DOI: 10.11118/actaun201866061431

Analysis of the Association between Topics in Online Documents and Stock Price Movements

František Dařena, Jan Přichystal
Department of Informatics, Faculty of Business and Economics, Mendel University in Brno, Zemědělská 1, 61300 Brno, Czech Republic

This paper aims at discovering the topics hidden in the newspaper articles that have an impact on movements of stock prices of the corresponding companies. Document topics are characterized by combinations of specific words in documents and are shared across a document collection. We describe the process of discovering the topics, the creation of a mapping of the topics to stock price movements, and quantifying and evaluating the results. As the method for finding and quantifying the association, we use machine learning-based classification. We achieved an accuracy of stock price movement predictions higher than 70 %. A feature selection procedure was applied to the features characterizing the topics in order to facilitate the process of assigning a label to the topic by a human expert.

Keywords: stock prices, topics in document collections, machine learning, classification, feature selection
Grants and funding:

This research was supported by the Czech Science Foundation [grant No. 16-26353S "Sentiment and its Impact on Stock Markets"].

Published: December 19, 2018  Show citation

ACS AIP APA ASA Harvard Chicago IEEE ISO690 MLA NLM Turabian Vancouver
Dařena, F., & Přichystal, J. (2018). Analysis of the Association between Topics in Online Documents and Stock Price Movements. Acta Universitatis Agriculturae et Silviculturae Mendelianae Brunensis66(6), 1431-1439. doi: 10.11118/actaun201866061431
Download citation

References

  1. AGGARWAL, C. C. and ZHAI, C. 2012. A survey of text clustering algorithms. In: AGGARWAL, C. C. and ZHAI, C. (Eds.). Mining text data. New York, NY: Springer, pp. 77-128. Go to original source...
  2. BARÁK, K., DAŘENA, F. and ŽIŽKA, J. 2015. Automated Extraction of Typical Expressions Describing Product Features from Customer Reviews. European journal of business science and technology, 1(2): 83-92. DOI: 10.11118/ejobsat.v1i2.27 Go to original source...
  3. BENESTY, J., CHEN, J., HUANG, Y. and COHEN, I. 2009. Pearson Correlation Coefficient. Springer. Go to original source...
  4. BLAU, B. M. and GRIFFITH, T. G. 2016. Price clustering and the stability of stock prices. Journal of Business Research, 69(10): 3933-3942. DOI: 10.1016/j.jbusres.2016.06.008 Go to original source...
  5. BORCH, K. 1963. Price movements in the stock market. Research paper no. 7 Econometric research program. Princeton University. Go to original source...
  6. BSOUL, Q., SALIM, J. and ZAKARIA, L. Q. 2013. An Intelligent Document Clustering Approach to Detect Crime Patterns. Procedia Technology, 11: 1181-1187. DOI: 10.1016/j.protcy.2013.12.311 Go to original source...
  7. BUKOVINA, J. 2016. Social media big data and capital markets-An overview. Journal of Behavioral and Experimental Finance, 11: 18-26. DOI: 10.1016/j.jbef.2016.06.002 Go to original source...
  8. LE CESSIE, S. and VAN HOUWELINGEN, J. C. 1992. Ridge Estimators in Logistic Regression. Applied Statistics, 41(1): 191-201. DOI: 10.2307/2347628 Go to original source...
  9. DAŘENA, F., PETROVSKÝ, J., ŽIŽKA, J. and PŘICHYSTAL, J. 2018. Machine Learning-Based Analysis of the Association between Online Texts and Stock Price Movements. Inteligencia Artificial, 21(61): 95-110. DOI: 10.4114/intartif.vol21iss61pp95-110 Go to original source...
  10. DHILLON, I. S. and MODHA, D. S. 1999. Concept decompositions for large sparse text data using clustering. Machine Learning, 42: 143-175. DOI: 10.1023/A:1007612920971 Go to original source...
  11. FERRANO, G. and WANNER, L. 2012. Labeling Semantically Motivated Clusters of Verbal Relations. Procesamiento del Lenguaje Natural, 49: 129-138.
  12. FRANK, E., HALL, M. A. and WITTEN, I. H. 2016. The WEKA Workbench. Online Appendix for "Data Mining: Practical Machine Learning Tools and Techniques". Morgan Kaufmann.
  13. GELBUKH, A. F., ALEXANDROV, M., BOUREK, A. and MAKAGONOV, P. 2003. Selection of Representative Documents for Clusters in a Document Collection. In: Proceedings of Natural Language Processing and Information Systems, 8th International Conference on Applications of Natural Language to Information Systems, 120-126.
  14. GUO, Q. and ZHANG, M. 2009. Multi-documents Automatic Abstracting based on text clustering and semantic analysis. Knowledge-Based Systems, 22(6): 482-485. DOI: 10.1016/j.knosys.2009.06.010 Go to original source...
  15. JOACHIMS, T. 2002. Learning to classify text using support vector machines. Norwell, MA: Kluwer Academic Publishers. Go to original source...
  16. KEARNEY, C. and LIU, S. 2014. Textual sentiment in finance: A survey of methods and models. International Review of Financial Analysis, 33: 171-185. DOI: 10.1016/j.irfa.2014.02.006 Go to original source...
  17. KUBAT, M., HOLTE, R. C. and MATWIN, S. 1998. Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30(2-3): 195-215. DOI: 10.1023/A:1007452223027 Go to original source...
  18. KUMAR, B. S. and RAVI, V. 2016. A survey of the applications of text mining in financial domain. Knowledge-Based Systems, 114: 128-147. DOI: 10.1016/j.knosys.2016.10.003 Go to original source...
  19. LEE, H., SURDEANU, M., MACCARTNEY, B. and JURAFSKY, D. 2014. On the Importance of Text Analysis for Stock Price Prediction. In: LREC, pp. 1170-1175.
  20. LI, X. et al. 2014. News impact on stock price return via sentiment analysis. Knowledge-Based Systems, 69: 14-23. DOI: 10.1016/j.knosys.2014.04.022 Go to original source...
  21. LOUGHRAN, T. and MCDONALD, B. 2011. When is a liability not a liability? Textual analysis, dictionaries, and 10-ks. The Journal of Finance, 66: 35-65. DOI: 10.1111/j.1540-6261.2010.01625.x Go to original source...
  22. MANNING, C. D., RAGHAVAN, P. and SCHÜTZE, H. 2008. Introduction to Information Retrieval. Cambridge University Press. Go to original source...
  23. MOROZKOV, M., GRANICHIN, O., VOLKOVICH, Z. and ZHANG, X. 2012. Fast algorithm for finding true number of clusters. Applications to control systems. In: Control and Decision Conference (CCDC), pp. 2001-2006. Go to original source...
  24. NIST/SEMATECH. 2016. e-Handbook of Statistical Methods. [Online]. Available at http://www.itl.nist.gov/div898/handbook. [Accessed: 2016, August 11].
  25. PATEL, J., SHAH, S., THAKKAR, P. and KOTECHA, K. 2015. Predicting stock and stock price index movement using trend deterministic data preparation and machine learning techniques. Expert Systems with Applications, 42(1): 259-268. DOI: 10.1016/j.eswa.2014.07.040 Go to original source...
  26. PLATT, J. 1998. Fast Training of Support Vector Machines using Sequential Minimal Optimization. In: SCHOELKOPF, B., C. BURGES, C. and SMOLA, A. (Eds.). Advances in Kernel Methods - Support Vector Learning. MIT Press. Go to original source...
  27. RANCO, G. et al. 2015. The effects of Twitter sentiment on stock price returns. PloS one, 10(9): e0138441. DOI: 10.1371/journal.pone.0138441 Go to original source...
  28. SALTON, G. and MCGILL, M. J. 1983. Introduction to Modern Information Retrieval. New York, NY: McGraw-Hill.
  29. SCHUMAKER, R. P. and CHEN, H. 2009. Textual analysis of stock market prediction using breaking financial news: The AZFin text system. ACM Transactions on Information Systems, 27(2): a12. DOI: 10.1145/1462198.1462204 Go to original source...
  30. SIGANOS, A., VAGENAS-NANOS, E. and VERWIJMEREN, P. 2017. Divergence of sentiment and stock market trading. Journal of Banking & Finance, 78: 130-141. DOI: 10.1016/j.jbankfin.2017.02.005 Go to original source...
  31. SOKOLOVA, M., JAPKOWICZ, N. and SZPAKOWICZ, S. 2006. Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. In: Australasian Joint Conference on Artificial Intelligence. Springer, pp. 1015-1021. Go to original source...
  32. TSENG, Y.-H., LIN, C.-J. and LIN, Y. 2007. Text mining techniques for patent analysis. Information Processing & Management, 43(5): 1216-1247. DOI: 10.1016/j.ipm.2006.11.011 Go to original source...
  33. WEISS, S. M. et al. 2005. Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer.
  34. WENG, B., AHMED, M. A. and MEGAHED, F. M. 2017. Stock market one-day ahead movement prediction using disparate data sources. Expert Systems with Applications, 79: 153-163. DOI: 10.1016/j.eswa.2017.02.041 Go to original source...
  35. WONG, F. M. F., LIU, Z. and CHIANG, M. 2014. Stock market prediction from WSJ: text mining via sparse matrix factorization. In: 2014 IEEE International Conference on Data Mining. IEEE, pp. 430-439.
  36. WUTHRICH, B., CHO, V., LEUNG, S., PERMUNETILLEKE, D., SANKARAN, K. and ZHANG, J. 1998. Daily stock market forecast from textual web data. In: 1998 IEEE International Conference on Systems, Man, and Cybernetics. Vol. 3, pp. 2720-2725.
  37. YANG, Y. and PEDERSEN, J. O. 1997. A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 412-420.
  38. ZHAO, Y. and KARYPIS, G. 2001. Criterion Functions for Document Clustering: Experiments and Analysis. Technical Report #01-40. University of Minnesota, Department of Computer Science.
  39. ZUO, Y. et al. 2016. Topic Modeling of Short Texts: A Pseudo-Document View. In: KDD '16 Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp. 2105-2114. Go to original source...
  40. ŽIŽKA, J. and DAŘENA, F. 2011a. Mining Significant Words from Customer Opinions Written in Different Natural Languages. In: Proceedings of the 14th International Conference on Text, Speech and Dialogue, Lecture Notes in Artificial Intelligence. Heidelberg: Springer, pp. 211-218. Go to original source...
  41. ŽIŽKA, J. and DAŘENA, F. 2011b. Mining Textual Significant Expressions Reflecting Opinions in Natural Languages. In: Proceedings of the 11th International Conference on Intelligent Systems Design and Applications, pp. 136-141. Go to original source...
  42. ŽIŽKA, J., BURDA, K. and DAŘENA, F. 2012. Clustering a very large number of textual unstructured customers' reviews in English. In: Proceedings of Artificial Intelligence: Methodology, Systems, and Applications. Heidelberg: Springer, pp. 38-47. Go to original source...
  43. ŽIŽKA, J. and DAŘENA, F. 2013. Revealing Prevailing Semantic Contents of Clusters Generated from Untagged Freely Written Text Documents in Natural Languages. In: Text, Speech, and Dialogue. Heidelberg: Springer, pp. 434-441. Go to original source...
  44. ŽIŽKA, J. and DAŘENA, F. 2015. Revealing potential changes of significant terms in streams of textual data written in natural languages using windowing and text mining. In: Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference. IEEE, pp. 131-138. Go to original source...

This is an open access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY NC ND 4.0), which permits non-comercial use, distribution, and reproduction in any medium, provided the original publication is properly cited. No use, distribution or reproduction is permitted which does not comply with these terms.