Article

Splog detection using self-similarity analysis on blog temporal dynamics

Authors:
Yu-Ru Lin

Arizona State University

Arizona State University
View Profile

,
Hari Sundaram

Arizona State University

Arizona State University
View Profile

,
Yun Chi

NEC Laboratories America, Cupertino, CA

NEC Laboratories America, Cupertino, CA
View Profile

,
Junichi Tatemura

NEC Laboratories America, Cupertino, CA

NEC Laboratories America, Cupertino, CA
View Profile

,
Belle L. Tseng

NEC Laboratories America, Cupertino, CA

NEC Laboratories America, Cupertino, CA
View Profile

AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the webMay 2007Pages 1–8https://doi.org/10.1145/1244408.1244410

Published:08 May 2007Publication History

AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web

Pages 1–8

ABSTRACT

This paper focuses on spam blog (splog) detection. Blogs are highly popular, new media social communication mechanisms. The presence of splogs degrades blog search results as well as wastes network resources. In our approach we exploit unique blog temporal dynamics to detect splogs.

There are three key ideas in our splog detection framework. We first represent the blog temporal dynamics using self-similarity matrices defined on the histogram intersection similarity measure of the time, content, and link attributes of posts. Second, we show via a novel visualization that the blog temporal characteristics reveal attribute correlation, depending on type of the blog (normal blogs and splogs). Third, we propose the use of temporal structural properties computed from self-similarity matrices across different attributes. In a splog detector, these novel features are combined with content based features. We extract a content based feature vector from different parts of the blog -- URLs, post content, etc. The dimensionality of the feature vector is reduced by Fisher linear discriminant analysis. We have tested an SVM based splog detector using proposed features on real world datasets, with excellent results (90% accuracy).

References

Wikipedia, Spam blog http://en.wikipedia.org/wiki/Splog.Google Scholar
C.-C. Chang and C.-J. Lin (2001). LIBSVM: a library for support vector machines. Google ScholarDigital Library
J. Eckmann, S. O. Kamphorst and D. Ruelle (1987). Recurrence plots of dynamical systems. Europhysics Letters(4): 973--977.Google Scholar
J. Foote, M. Cooper and U. Nam (2002). Audio retrieval by rhythmic similarity, Proceedings of the International Conference on Music Information Retrieval, 265--266.Google Scholar
Z. Gyöngyi, H. Garcia-Molina and J. Pedersen (2004). Combating web spam with TrustRank, Proceedings of the 30th International Conference on Very Large Data Bases (VLDB) 2004, Toronto, Canada, Google ScholarDigital Library
Z. Gyöngyi, P. Berkhin, Hector Garcia-Molina and J. Pedersen (2006). Link Spam Detection Based on Mass Estimation, 32nd International Conference on Very Large Data Bases (VLDB), Seoul, Korea, Google ScholarDigital Library
P. Kolari (2005) Welcome to the Splogosphere: 75% of new pings are spings (splogs) permalink: http://ebiquity.umbc.edu/blogger/2005/12/15/welcome-to-the-splogosphere-75-of-new-blog-posts-are-spam/.Google Scholar
P. Kolari, T. Finin and A. Joshi (2006). SVMs for the blogosphere: Blog identification and splog detection, AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs,Google Scholar
P. Kolari, A. Java and T. Finin (2006). Characterizing the Splogosphere, Proceedings of the 3rd Annual Workshop on Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 15th World Wid Web Conference,Google Scholar
P. Kolari, A. Java, T. Finin, T. Oates and A. Joshi (2006). Detecting Spam Blogs: A Machine Learning Approach, Proceedings of the 21st National Conference on Artificial Intelligence (AAAI 2006), Boston, MA, July 2006. Google ScholarDigital Library
A. Ntoulas, M. Najork, M. Manasse and D. Fetterly (2006). Detecting spam web pages through content analysis, Proceedings of the 15th international conference on World Wide Web, Edinburgh, Scotland, May 2006. Google ScholarDigital Library
G. Shen, B. Gao, T.-Y. Liu, G. Feng, S. Song and H. Li (2006). Detecting Link Spam using Temporal Information, Proc. of ICDM-2006, to appear, 2006, Google ScholarDigital Library
Umbria (2006) Spam in the blogosphere http://www.umbrialistens.com/files/uploads/umbria_splog.pdf.Google Scholar

Index Terms

Splog detection using self-similarity analysis on blog temporal dynamics

Recommendations

Detecting splogs via temporal dynamics using self-similarity analysis

This article addresses the problem of spam blog (splog) detection using temporal and structural regularity of content, post time and links. Splogs are undesirable blogs meant to attract search engine traffic, used solely for promoting affiliate sites. ...
Read More
Blog Ontology (BloOn) & Blog Visualization System (BloViS)
ONTORACT '08: Proceedings of the 2008 First International Workshop on Ontologies in Interactive Systems

Blogs have emerged as a powerful way to convey and spread any sort of ideas. Thousands of people write daily in these on-line diaries and hold a captive audience. Furthermore information spread in blogs provide an online laboratory to analyze how brands,...
Read More
Temporal Effects on Hashtag Reuse in Twitter: A Cognitive-Inspired Hashtag Recommendation Approach
WWW '17: Proceedings of the 26th International Conference on World Wide Web

Hashtags have become a powerful tool in social platforms such as Twitter to categorize and search for content, and to spread short messages across members of the social network. In this paper, we study temporal hashtag usage practices in Twitter with ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
May 2007
98 pages
ISBN:9781595937322
DOI:10.1145/1244408
Conference Chairs:
Carlos Castillo
Yahoo! Research
,
Kumar Chellapilla
Microsoft Live Labs
,
Brian D. Davison
Lehigh University
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 May 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
blogs
regularity
self-similarity
spam
splog detection
temporal dynamics
topology
Qualifiers
- Article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 32
  Total Citations
  View Citations
- 709
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Splog detection using self-similarity analysis on blog temporal dynamics

AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Detecting splogs via temporal dynamics using self-similarity analysis

Blog Ontology (BloOn) & Blog Visualization System (BloViS)

Temporal Effects on Hashtag Reuse in Twitter: A Cognitive-Inspired Hashtag Recommendation Approach