Abstract
Currently, a great volume of the available information on several websites comes from the interaction with users, such as social networks, forums and blogs, where readers can post comments and sometimes develop habits of frequenting them. Some blogs specialized in certain subjects, gain the users credibility and become references in the field. Nevertheless, the ease of inserting content through text comments makes room for unwanted messages, which affect the user experience, reduce the quality of the information provided by the websites and indirectly cause personal and economic losses. In this scenario, this paper presents a comprehensive study of established machine learning techniques applied to automatically detect undesired comments posted on blogs. Furthermore, different sets of attributes were evaluated along with text normalization techniques. Experiments carried out with a real and public database indicate that support vector machines, logistic regression and stacking ensemble methods, trained with both attributes extracted from the text messages and posting information, are promising for the task of blocking undesired comments.
Similar content being viewed by others
References
Aha, D., Kibler, D., Albert, M.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991)
Alberto, T., Almeida, T.: Aprendizado de máquina aplicado na detecção automática de comentários indesejados. In: Anais do X Encontro Nacional de Inteligência Artificial e Computacional (ENIAC’13), pp. 1–12. Fortaleza, Brazil (2013)
Almeida, T., Alberto, T.: Learning to block undesired comments in the blogosphere. In: Proceedings of the 12th IEEE International Conference on Machine Learning and Applications (ICMLA’13), pp. 1–6. Miami (2013)
Almeida, T., Almeida, J., Yamakami, A.: Spam filtering: How the dimensionality reduction affects the accuracy of naive bayes classifiers. JISA 1(3), 183–200 (2011)
Almeida, T., Yamakami, A.: Compression-based spam filter. Secur. Commun. Netw., 1–15 (2012)
Almeida, T., Yamakami, A.: Occam’s razor-based spam filter. JISA 3(3), 245–253 (2012)
Bhattarai, A., Dasgupta, D.: A self-supervised approach to comment spam detection based on content analysis. IJISP 5(1), 14–32 (2011)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM TIST 2, 1–27 (2011)
Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: Smote: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Chu, Z., Gianvecchio, S., Koehl, A., Wang, H., Jajodia, S.: Blog or block: Detecting blog bots through behavioral biometrics. Comp. Netw. 57(1), 634–646 (2013)
Cook, P., Stevenson, S.: An unsupervised model for text message normalization. In: Proceedings of the 2009 CALC, pp. 71–78. Association for Computational Linguistics (2009)
Cormack, G., Gómez Hidalgo, J., Sanz, E.: Spam filtering for short messages. In: Proceedings of the 16th CIKM, pp. 313–320. Lisbon (2007)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)
Frank, E., Witten, I.: Generating accurate rule sets without global optimization. In: Proceedings of the 15th ICML, pp. 144–151. Madison (1998)
Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. In: Proceedings of the 13rd ICML, pp. 148–156. Bari (1996)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The WEKA data mining software: An update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. New York, Prentice Hall (1998)
Kantchelian, A., Ma, J., Huang, L., Afroz, S., Joseph, A., Tygar, J.: Robust detection of comment spam using entropy rate. In: Proceedings of the 5th AISec, pp. 59–69. Raleigh (2012)
Mishne, G., Carmel, D., Lempel, R.: Blocking blog spam with language model disagreement. In: Proceedings of the 1st AIRWeb, pp. 1–6. Chiba (2005)
Mishne, G., Glance, N.: Leave a reply: An analysis of weblog comments. In: Proceedings of the 3rd WWE, pp. 1–8. Edinburgh (2006)
Opitz, D., Maclin, R.: Popular ensemble methods: An empirical study. J. Artif. Intell. Res. 11, 169–198 (1999)
Quinlan, J.: C4.5: programs for machine learning, 1st edn. Morgan Kaufmann, San Mateo (1993)
Romero, C., Valdez, M., Alanis, A.: A comparative study of machine learning techniques in blog comments spam filtering. In: Proceedings of the 6th WCCI, pp. 63–69. Barcelona (2010)
Shin, Y., Gupta, M., Myers, S.: Prevalence and mitigation of forum spamming. In: Proceedings of the 30th INFOCOM, pp. 1–9. Shangai (2011)
Wang, J., Yu, C., Yu, P., Liu, B., Meng, W.: Diversionary comments under political blog posts. In: Proceedings of the 21st CIKM, pp. 1789–1793. Maui (2012)
Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Wolpert, D.: Stacked generalization. Neural Netw. 5(2), 241–259 (1992)
Wu, X., Kumar, V., Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G., Ng, A., Liu, B., Yu, P., Zhou, Z.H., Steinbach, M., Hand, D., Steinberg, D.: Top 10 algorithms in data mining. KAIS 14(1), 1–37 (2008)
Xue, Z., Yin, D., Davison, B.D., Davison, B.: Normalizing Microtext. In: Proceedings of the 2011 AAAI, pp. 74–79. Association for the Advancement of Artificial Intelligence (2011)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Alberto, T.C., Lochter, J.V. & Almeida, T.A. Post or Block? Advances in Automatically Filtering Undesired Comments. J Intell Robot Syst 80 (Suppl 1), 245–259 (2015). https://doi.org/10.1007/s10846-014-0105-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10846-014-0105-y