Abstract
This research investigates the problem of news articles classification. The classification is performed using N-gram textual features extracted from text and visual features generated from one representative image. The application domain is news articles written in English that belong to four categories: Business-Finance, Lifestyle-Leisure, Science-Technology and Sports downloaded from three well-known news web-sites (BBC, Reuters, and TheGuardian). Various classification experiments have been performed with the Random Forests machine learning method using N-gram textual features and visual features from a representative image. Using the N-gram textual features alone led to much better accuracy results (84.4%) than using the visual features alone (53%). However, the use of both N-gram textual features and visual features led to slightly better accuracy results (86.2%). The main contribution of this work is the introduction of a news article classification framework based on Random Forests and multimodal features (textual and visual), as well as the late fusion strategy that makes use of Random Forests operational capabilities.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Schneider, K.-M.: Techniques for improving the performance of naive Bayes for text classification. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 682–693. Springer, Heidelberg (2005)
Zeng, A., Huang, Y.: A text classification algorithm based on rocchio and hierarchical clustering. In: Huang, D.-S., Gan, Y., Bevilacqua, V., Figueroa, J.C. (eds.) ICIC 2011. LNCS, vol. 6838, pp. 432–439. Springer, Heidelberg (2011)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Toutanova, K.: Competitive generative models with structure learning for NLP classification tasks. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 576–584 (2006)
Ho, A.K.N., Ragot, N., Ramel, J.Y., Eglin, V., Sidere, N.: Document Classification in a Non-stationary Environment: A One-Class SVM Approach. In: Proceedings of the 2013 12th International Conference on Document Analysis and Recognition (ICDAR), pp. 616–620 (2013)
Klassen, M., Paturi, N.: Web document classification by keywords using random forests. In: Zavoral, F., Yaghob, J., Pichappan, P., El-Qawasmeh, E. (eds.) NDT 2010, Part II. CCIS, vol. 88, pp. 256–261. Springer, Heidelberg (2010)
Caropreso, M.F., Matwin, S., Sebastiani, F.: Statistical phrases in automated text categorization. Centre National de la Recherche Scientifique, Paris (2000)
Braga, I., Monard, M., Matsubara, E.: Combining unigrams and bigrams in semi-supervised text classification. In: Proceedings of Progress in Artificial Intelligence, 14th Portuguese Conference on Artificial Intelligence (EPIA 2009), Aveiro, pp. 489–500 (2009)
Selamat, A., Omatu, S.: Web page feature selection and classification using neural networks. Information Sciences 158, 69–88 (2004)
Aung, W.T., Hla, K.H.M.S.: Random forest classifier for multi-category classification of web pages. In: IEEE Asia-Pacific Services Computing Conference, APSCC 2009, pp. 372–376. IEEE (2009)
Shin, C., Doermann, D., Rosenfeld, A.: Classification of document pages using structure-based features. International Journal on Document Analysis and Recognition 3(4), 232–247 (2001)
Chen, N., Shatkay, H., Blostein, D.: Exploring a new space of features for document classification: figure clustering. In: Proceedings of the 2006 Conference of the Center for Advanced Studies on Collaborative Research, p. 35. IBM Corp. (2006)
Gamon, M., Basu, S., Belenko, D., Fisher, D., Hurst, M., König, A.C.: BLEWS: Using Blogs to Provide Context for News Articles. In: ICWSM (2008)
Bandari, R., Asur, S., Huberman, B.A.: The Pulse of News in Social Media: Forecasting Popularity. In: ICWSM (2012)
Swezey, R.M., Sano, H., Shiramatsu, S., Ozono, T., Shintani, T.: Automatic detection of news articles of interest to regional communities. IJCSNSÂ 12(6), 100 (2012)
Breiman, L.: Random Forests. Machine Learning 45(1), 5–32 (2001)
Xu, B., Ye, Y., Nie, L.: An improved random forest classifier for image classification. In: 2012 International Conference on Information and Automation (ICIA), pp. 795–800. IEEE (2012)
Erdélyi, M., Garzó, A., Benczúr, A.A.: Web spam classification: a few features worth more. In: Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality, pp. 27–34. ACM (2011)
Li, W., Meng, Y.: Improving the performance of neural networks with random forest in detecting network intrusions. In: Guo, C., Hou, Z.-G., Zeng, Z. (eds.) ISNN 2013, Part II. LNCS, vol. 7952, pp. 622–629. Springer, Heidelberg (2013)
Gray, K.R., Aljabar, P., Heckemann, R.A., Hammers, A., Rueckert, D.: Random forest-based similarity measures for multi-modal classification of Alzheimer’s disease. NeuroImage 65, 167–175 (2013)
Robnik-Šikonja, M.: Improving random forests. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 359–370. Springer, Heidelberg (2004)
HaCohen-Kerner, Y., Mughaz, D., Beck, H., Yehudai, E.: Words as Classifiers of Documents According to their Historical Period and the Ethnic Origin of their Authors. Cybernetics and Systems: An International Journal 39(3), 213–228 (2008)
Fox, C.: A stop list for general text. ACM SIGIR Forum 24(1-2) (1989)
Sikora, T.: The MPEG-7 visual standard for content description-an overview. IEEE Transactions on Circuits and Systems for Video Technology 11(6), 696–702 (2001)
Zhou, Q., Hong, W., Luo, L., Yang, F.: Gene selection using random forest and proximity differences criterion on DNA microarray data. Journal of Convergence Information Technology 5(6), 161–170 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Liparas, D., HaCohen-Kerner, Y., Moumtzidou, A., Vrochidis, S., Kompatsiaris, I. (2014). News Articles Classification Using Random Forests and Weighted Multimodal Features. In: Lamas, D., Buitelaar, P. (eds) Multidisciplinary Information Retrieval. IRFC 2014. Lecture Notes in Computer Science, vol 8849. Springer, Cham. https://doi.org/10.1007/978-3-319-12979-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-12979-2_6
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12978-5
Online ISBN: 978-3-319-12979-2
eBook Packages: Computer ScienceComputer Science (R0)