skip to main content
10.1145/1531914.1531922acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesiea-aeiConference Proceedingsconference-collections
research-article

Linked latent Dirichlet allocation in web spam filtering

Authors Info & Claims
Published:21 April 2009Publication History

ABSTRACT

Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a fully generative statistical language model on the content and topics of a corpus of documents. In this paper we apply an extension of LDA for web spam classification. Our linked LDA technique takes also linkage into account: topics are propagated along links in such a way that the linked document directly influences the words in the linking document. The inferred LDA model can be applied for classification as dimensionality reduction similarly to latent semantic indexing. We test linked LDA on the WEBSPAM-UK2007 corpus. By using BayesNet classifier, in terms of the AUC of classification, we achieve 3% improvement over plain LDA with BayesNet, and 8% over the public link features with C4.5. The addition of this method to a log-odds based combination of strong link and content baseline classifiers results in a 3% improvement in AUC. Our method even slightly improves over the best Web Spam Challenge 2008 result.

References

  1. J. Abernethy, O. Chapelle, and C. Castillo. WITCH: A New Approach to Web Spam Detection. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.Google ScholarGoogle Scholar
  2. I. Bíró, J. Szabó, and A. A. Benczúr. Latent Dirichlet Allocation in Web Spam Filtering. manuscript, 2008.Google ScholarGoogle Scholar
  3. I. Bíró, J. Szabó, and A. A. Benczúr. Very Large Scale Link Based Latent Dirichlet Allocation for Web Document Classification. manuscript, http://www.ilab.sztaki.hu/~ibiro/linkedLDA/, 2009.Google ScholarGoogle Scholar
  4. D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(5):993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Bratko, B. Filipič, G. Cormack, T. Lynam, and B. Zupan. Spam Filtering Using Statistical Data Compression Models. The Journal of Machine Learning Research, 7:2673--2698, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web topology. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 423--430, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. Cohn and T. Hofmann. The Missing Link-A Probabilistic Model of Document Content and Hypertext Connectivity. Advances in Neural Information Processing Systems, pages 430--436, 2001.Google ScholarGoogle Scholar
  8. S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  9. L. Dietz, S. Bickel, and T. Scheffer. Unsupervised prediction of citation influences. In Proceedings of the 24th international conference on Machine learning, pages 233--240. ACM Press New York, NY, USA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. E. Erosheva, S. Fienberg, and J. Lafferty. Mixed-membership models of scientific publications, 2004.Google ScholarGoogle Scholar
  11. D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics -- Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB), pages 1--6, Paris, France, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide web. In Proceedings of the 28th ACM International Conference on Research and Development in Information Retrieval (SIGIR), Salvador, Brazil, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl_1):5228--5235, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  14. Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, 2005.Google ScholarGoogle Scholar
  15. G. Heinrich. Parameter estimation for text analysis. Technical report, Technical Report, 2004.Google ScholarGoogle Scholar
  16. M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. SIGIR Forum, 36(2):11--22, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. Hofmann. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning, 42(1):177--196, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Z. Kou and W. W. Cohen. Stacked graphical models for efficient inference in markov random fields. In SDM 07, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  19. T. Lynam, G. Cormack, and D. Cheriton. On-line spam filter fusion. Proc. of the 29th international ACM SIGIR conference on Research and development in information retrieval, pages 123--130, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Nallapati, A. Ahmed, E. Xing, and W. Cohen. Joint Latent Topic Models for Text and Citations. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press New York, NY, USA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th International World Wide Web Conference (WWW), pages 83--92, Edinburgh, Scotland, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Singhal. Challenges in running a commercial search engine. In IBM Search and Collaboration Seminar 2004. IBM Haifa Labs, 2004.Google ScholarGoogle Scholar
  23. I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, second edition, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. X. Zhu, J. Kandola, Z. Ghahramani, and J. Lafferty. Nonparametric transforms of graph kernels for semi-supervised learning. Advances in Neural Information Processing Systems, 17:1641--1648, 2005.Google ScholarGoogle Scholar

Index Terms

  1. Linked latent Dirichlet allocation in web spam filtering

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          AIRWeb '09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
          April 2009
          67 pages
          ISBN:9781605584386
          DOI:10.1145/1531914

          Copyright © 2009 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 21 April 2009

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader