Abstract
Web spam refers to those Web pages where tricks are played to mislead search engines to increase their rank than they really deserved. It causes huge damages on e-commerce and Web users, and threats the Web security. Combating Web spam is an urgent task. In this paper, Web quality and semantic measurements are integrated with the content and link features to construct a more representative characteristic set. A cascade detection mechanism based on entropy-based outlier mining (EOM) algorithm is proposed. The mechanism consists of three stages with different feature groups. The experiments on WEBSPAM-UK2007 show that the quality and semantic features can effectively improve the detection, and the EOM algorithm outperforms many classic classification algorithms under the circumstance of data unbalanced. The cascade detection mechanism can clean out more spam.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Spirin, N., Han, J.: Survey on web spam detection: principles and algorithms. ACM 13(2), 50–64 (2012)
Cheng, Z., Gao, B., Sun, C., Jiang, Y., Liu, T.: Let web spammers expose themselves. In: Proceedings of the 4th ACM International Conference on Web Search and Data Mining, New York, pp. 525–534 (2011)
Wei, X., Li, C., Chen, H.: Content and link based web spam detection with co-training. J. Frontiers Comput. Sci. Technol. 4, 899–908 (2010)
Wang, W., Zeng, G., Tang, D.: Using evidence based content trust model for spam detection. Expert Syst. Appl. 37(8), 5599–5606 (2010)
Lee, P.Y., Hui, S.C., Fong, A.C.M.: Neural Networks for Web Content Filter. IEEE Intell. Syst. 17, 48–57 (2002)
Dong, C., Zhou, B.: Effectively detecting content spam on the web using topical diversity measures. In: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, pp. 1115–1123 (2012)
Suhara, Y., Toda, H., Nishioka, S., Susaki, S.: Automatically generated spam detection based on sentence-level topic information. In: Proceedings of the 22nd International Conference on World Wide Web Companion, pp. 1157–1160 (2013)
Fang, X., Tan, Y., Zheng, X., Zhuang, H., Zhou, S.: Imbalanced web spam classification using self-labeled techniques and multi-classifier models. In: Proceedings of International Conference on Knowledge Science, Engineering and Management, pp. 663–668 (2015)
Bhowan, U., Johnston, M., Zhang, M.: Developing new fitness functions in genetic programming for classification with unbalanced data. IEEE Trans. Syst. Man Cybern. 42, 406–421 (2012)
Bhattacharya, G., Ghosh, K., Chowdhury, A.S.: Outlier detection using neighborhood rank difference. Pattern Recogn. Lett. 60(C), 24–31 (2015)
Daneshpazhouh, A., Sami, A.: Entropy-based outlier detection using semi-supervised approach with few positive examples. Pattern Recogn. Lett. 49, 77–84 (2014)
Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th International Joint Conference on Artificial Intelligence, pp. 1022–1027 (2010)
Zhao, B., Zhu, Y.: Formalizing and validating the Web quality model for Web source quality evaluation. Expert Syst. Appl. 41, 3306–3312 (2014)
Wei, S., Zhu, Y.: Combining topic similarity with link weight for Web spam ranking detection. J. Comput. Appl. 36(3), 735–739 (2016). (in Chinese)
Goh, K.L., Patchmuthu, R.K., Singh, A.K.: Link-based web spam detection using weight properties. J. Intell. Inf. Syst. 43(1), 129–145 (2014)
Krishnan, V., Raj, R.: Web spam detection with anti-trust rank. In: Proceedings of the Second International Workshop on Adversarial Information Retrieval on the Web, Seattle, Washington, USA, pp. 37–40 (2006)
McAfee Inc. TrustSource Web Database Reference Guide (category set 4). https://support.mcafee.com/ServicePortal/faces/knowledgecenter. Accessed 29 Nov 2016
Standardization Administration of the People’s Republic of China (SAC). Information security technology—Guidelines for the category and classification of information security incidents. GB/Z 20986-2007 (2013)
Web Spam Challenge: Results. http://webspam.lip6.fr/wiki/pmwiki.php?n=Main.PhaseIII. Accessed 29 Nov 2016 (2008)
Bíró, I., Siklósi, D., Szabó, J, Benczúr, A.: Linked latent Dirichlet allocation in web spam filtering. In: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web, Madrid, Spain, pp. 37–40 (2009)
Acknowledgements
This work was supported by the Academic and Technological Leadership Training Foundation of Sichuan Province, China [WZ0100112371601/004, WZ0100112371408, YH1500411031402].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Wei, S., Zhu, Y. (2017). Cleaning Out Web Spam by Entropy-Based Cascade Outlier Detection. In: Benslimane, D., Damiani, E., Grosky, W., Hameurlain, A., Sheth, A., Wagner, R. (eds) Database and Expert Systems Applications. DEXA 2017. Lecture Notes in Computer Science(), vol 10439. Springer, Cham. https://doi.org/10.1007/978-3-319-64471-4_19
Download citation
DOI: https://doi.org/10.1007/978-3-319-64471-4_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64470-7
Online ISBN: 978-3-319-64471-4
eBook Packages: Computer ScienceComputer Science (R0)