Skip to main content
Log in

Post or Block? Advances in Automatically Filtering Undesired Comments

  • Published:
Journal of Intelligent & Robotic Systems Aims and scope Submit manuscript

Abstract

Currently, a great volume of the available information on several websites comes from the interaction with users, such as social networks, forums and blogs, where readers can post comments and sometimes develop habits of frequenting them. Some blogs specialized in certain subjects, gain the users credibility and become references in the field. Nevertheless, the ease of inserting content through text comments makes room for unwanted messages, which affect the user experience, reduce the quality of the information provided by the websites and indirectly cause personal and economic losses. In this scenario, this paper presents a comprehensive study of established machine learning techniques applied to automatically detect undesired comments posted on blogs. Furthermore, different sets of attributes were evaluated along with text normalization techniques. Experiments carried out with a real and public database indicate that support vector machines, logistic regression and stacking ensemble methods, trained with both attributes extracted from the text messages and posting information, are promising for the task of blocking undesired comments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Aha, D., Kibler, D., Albert, M.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991)

    Google Scholar 

  2. Alberto, T., Almeida, T.: Aprendizado de máquina aplicado na detecção automática de comentários indesejados. In: Anais do X Encontro Nacional de Inteligência Artificial e Computacional (ENIAC’13), pp. 1–12. Fortaleza, Brazil (2013)

    Google Scholar 

  3. Almeida, T., Alberto, T.: Learning to block undesired comments in the blogosphere. In: Proceedings of the 12th IEEE International Conference on Machine Learning and Applications (ICMLA’13), pp. 1–6. Miami (2013)

  4. Almeida, T., Almeida, J., Yamakami, A.: Spam filtering: How the dimensionality reduction affects the accuracy of naive bayes classifiers. JISA 1(3), 183–200 (2011)

    Google Scholar 

  5. Almeida, T., Yamakami, A.: Compression-based spam filter. Secur. Commun. Netw., 1–15 (2012)

  6. Almeida, T., Yamakami, A.: Occam’s razor-based spam filter. JISA 3(3), 245–253 (2012)

    Google Scholar 

  7. Bhattarai, A., Dasgupta, D.: A self-supervised approach to comment spam detection based on content analysis. IJISP 5(1), 14–32 (2011)

    Google Scholar 

  8. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  9. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM TIST 2, 1–27 (2011)

    Article  Google Scholar 

  10. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: Smote: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    MATH  Google Scholar 

  11. Chu, Z., Gianvecchio, S., Koehl, A., Wang, H., Jajodia, S.: Blog or block: Detecting blog bots through behavioral biometrics. Comp. Netw. 57(1), 634–646 (2013)

    Article  Google Scholar 

  12. Cook, P., Stevenson, S.: An unsupervised model for text message normalization. In: Proceedings of the 2009 CALC, pp. 71–78. Association for Computational Linguistics (2009)

  13. Cormack, G., Gómez Hidalgo, J., Sanz, E.: Spam filtering for short messages. In: Proceedings of the 16th CIKM, pp. 313–320. Lisbon (2007)

  14. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995)

    MATH  Google Scholar 

  15. Frank, E., Witten, I.: Generating accurate rule sets without global optimization. In: Proceedings of the 15th ICML, pp. 144–151. Madison (1998)

  16. Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. In: Proceedings of the 13rd ICML, pp. 148–156. Bari (1996)

  17. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The WEKA data mining software: An update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)

    Article  Google Scholar 

  18. Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. New York, Prentice Hall (1998)

    Google Scholar 

  19. Kantchelian, A., Ma, J., Huang, L., Afroz, S., Joseph, A., Tygar, J.: Robust detection of comment spam using entropy rate. In: Proceedings of the 5th AISec, pp. 59–69. Raleigh (2012)

  20. Mishne, G., Carmel, D., Lempel, R.: Blocking blog spam with language model disagreement. In: Proceedings of the 1st AIRWeb, pp. 1–6. Chiba (2005)

  21. Mishne, G., Glance, N.: Leave a reply: An analysis of weblog comments. In: Proceedings of the 3rd WWE, pp. 1–8. Edinburgh (2006)

  22. Opitz, D., Maclin, R.: Popular ensemble methods: An empirical study. J. Artif. Intell. Res. 11, 169–198 (1999)

    MATH  Google Scholar 

  23. Quinlan, J.: C4.5: programs for machine learning, 1st edn. Morgan Kaufmann, San Mateo (1993)

    Google Scholar 

  24. Romero, C., Valdez, M., Alanis, A.: A comparative study of machine learning techniques in blog comments spam filtering. In: Proceedings of the 6th WCCI, pp. 63–69. Barcelona (2010)

  25. Shin, Y., Gupta, M., Myers, S.: Prevalence and mitigation of forum spamming. In: Proceedings of the 30th INFOCOM, pp. 1–9. Shangai (2011)

  26. Wang, J., Yu, C., Yu, P., Liu, B., Meng, W.: Diversionary comments under political blog posts. In: Proceedings of the 21st CIKM, pp. 1789–1793. Maui (2012)

  27. Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)

    Google Scholar 

  28. Wolpert, D.: Stacked generalization. Neural Netw. 5(2), 241–259 (1992)

    Article  MathSciNet  Google Scholar 

  29. Wu, X., Kumar, V., Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G., Ng, A., Liu, B., Yu, P., Zhou, Z.H., Steinbach, M., Hand, D., Steinberg, D.: Top 10 algorithms in data mining. KAIS 14(1), 1–37 (2008)

    Google Scholar 

  30. Xue, Z., Yin, D., Davison, B.D., Davison, B.: Normalizing Microtext. In: Proceedings of the 2011 AAAI, pp. 74–79. Association for the Advancement of Artificial Intelligence (2011)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tiago A. Almeida.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alberto, T.C., Lochter, J.V. & Almeida, T.A. Post or Block? Advances in Automatically Filtering Undesired Comments. J Intell Robot Syst 80 (Suppl 1), 245–259 (2015). https://doi.org/10.1007/s10846-014-0105-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10846-014-0105-y

Keywords

Navigation