skip to main content
10.1145/1244408.1244410acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesiea-aeiConference Proceedingsconference-collections
Article

Splog detection using self-similarity analysis on blog temporal dynamics

Published:08 May 2007Publication History

ABSTRACT

This paper focuses on spam blog (splog) detection. Blogs are highly popular, new media social communication mechanisms. The presence of splogs degrades blog search results as well as wastes network resources. In our approach we exploit unique blog temporal dynamics to detect splogs.

There are three key ideas in our splog detection framework. We first represent the blog temporal dynamics using self-similarity matrices defined on the histogram intersection similarity measure of the time, content, and link attributes of posts. Second, we show via a novel visualization that the blog temporal characteristics reveal attribute correlation, depending on type of the blog (normal blogs and splogs). Third, we propose the use of temporal structural properties computed from self-similarity matrices across different attributes. In a splog detector, these novel features are combined with content based features. We extract a content based feature vector from different parts of the blog -- URLs, post content, etc. The dimensionality of the feature vector is reduced by Fisher linear discriminant analysis. We have tested an SVM based splog detector using proposed features on real world datasets, with excellent results (90% accuracy).

References

  1. Wikipedia, Spam blog http://en.wikipedia.org/wiki/Splog.Google ScholarGoogle Scholar
  2. C.-C. Chang and C.-J. Lin (2001). LIBSVM: a library for support vector machines. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Eckmann, S. O. Kamphorst and D. Ruelle (1987). Recurrence plots of dynamical systems. Europhysics Letters(4): 973--977.Google ScholarGoogle Scholar
  4. J. Foote, M. Cooper and U. Nam (2002). Audio retrieval by rhythmic similarity, Proceedings of the International Conference on Music Information Retrieval, 265--266.Google ScholarGoogle Scholar
  5. Z. Gyöngyi, H. Garcia-Molina and J. Pedersen (2004). Combating web spam with TrustRank, Proceedings of the 30th International Conference on Very Large Data Bases (VLDB) 2004, Toronto, Canada, Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Z. Gyöngyi, P. Berkhin, Hector Garcia-Molina and J. Pedersen (2006). Link Spam Detection Based on Mass Estimation, 32nd International Conference on Very Large Data Bases (VLDB), Seoul, Korea, Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Kolari (2005) Welcome to the Splogosphere: 75% of new pings are spings (splogs) permalink: http://ebiquity.umbc.edu/blogger/2005/12/15/welcome-to-the-splogosphere-75-of-new-blog-posts-are-spam/.Google ScholarGoogle Scholar
  8. P. Kolari, T. Finin and A. Joshi (2006). SVMs for the blogosphere: Blog identification and splog detection, AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs,Google ScholarGoogle Scholar
  9. P. Kolari, A. Java and T. Finin (2006). Characterizing the Splogosphere, Proceedings of the 3rd Annual Workshop on Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 15th World Wid Web Conference,Google ScholarGoogle Scholar
  10. P. Kolari, A. Java, T. Finin, T. Oates and A. Joshi (2006). Detecting Spam Blogs: A Machine Learning Approach, Proceedings of the 21st National Conference on Artificial Intelligence (AAAI 2006), Boston, MA, July 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Ntoulas, M. Najork, M. Manasse and D. Fetterly (2006). Detecting spam web pages through content analysis, Proceedings of the 15th international conference on World Wide Web, Edinburgh, Scotland, May 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. Shen, B. Gao, T.-Y. Liu, G. Feng, S. Song and H. Li (2006). Detecting Link Spam using Temporal Information, Proc. of ICDM-2006, to appear, 2006, Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Umbria (2006) Spam in the blogosphere http://www.umbrialistens.com/files/uploads/umbria_splog.pdf.Google ScholarGoogle Scholar

Index Terms

  1. Splog detection using self-similarity analysis on blog temporal dynamics

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Other conferences
              AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
              May 2007
              98 pages
              ISBN:9781595937322
              DOI:10.1145/1244408

              Copyright © 2007 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 8 May 2007

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • Article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader