research-article

Linked latent Dirichlet allocation in web spam filtering

Authors:
István Bíró

Computer and Automation Research Institute of the Hungarian Academy of Sciences

Computer and Automation Research Institute of the Hungarian Academy of Sciences
View Profile

,
Dávid Siklósi

Computer and Automation Research Institute of the Hungarian Academy of Sciences

Computer and Automation Research Institute of the Hungarian Academy of Sciences
View Profile

,
Jácint Szabó

Computer and Automation Research Institute of the Hungarian Academy of Sciences

Computer and Automation Research Institute of the Hungarian Academy of Sciences
View Profile

,
András A. Benczúr

Computer and Automation Research Institute of the Hungarian Academy of Sciences

Computer and Automation Research Institute of the Hungarian Academy of Sciences
View Profile

AIRWeb '09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the WebApril 2009Pages 37–40https://doi.org/10.1145/1531914.1531922

Published:21 April 2009Publication History

AIRWeb '09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web

Pages 37–40

ABSTRACT

Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a fully generative statistical language model on the content and topics of a corpus of documents. In this paper we apply an extension of LDA for web spam classification. Our linked LDA technique takes also linkage into account: topics are propagated along links in such a way that the linked document directly influences the words in the linking document. The inferred LDA model can be applied for classification as dimensionality reduction similarly to latent semantic indexing. We test linked LDA on the WEBSPAM-UK2007 corpus. By using BayesNet classifier, in terms of the AUC of classification, we achieve 3% improvement over plain LDA with BayesNet, and 8% over the public link features with C4.5. The addition of this method to a log-odds based combination of strong link and content baseline classifiers results in a 3% improvement in AUC. Our method even slightly improves over the best Web Spam Challenge 2008 result.

References

J. Abernethy, O. Chapelle, and C. Castillo. WITCH: A New Approach to Web Spam Detection. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.Google Scholar
I. Bíró, J. Szabó, and A. A. Benczúr. Latent Dirichlet Allocation in Web Spam Filtering. manuscript, 2008.Google Scholar
I. Bíró, J. Szabó, and A. A. Benczúr. Very Large Scale Link Based Latent Dirichlet Allocation for Web Document Classification. manuscript, http://www.ilab.sztaki.hu/~ibiro/linkedLDA/, 2009.Google Scholar
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(5):993--1022, 2003. Google ScholarDigital Library
A. Bratko, B. Filipič, G. Cormack, T. Lynam, and B. Zupan. Spam Filtering Using Statistical Data Compression Models. The Journal of Machine Learning Research, 7:2673--2698, 2006. Google ScholarDigital Library
C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web topology. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 423--430, 2007. Google ScholarDigital Library
D. Cohn and T. Hofmann. The Missing Link-A Probabilistic Model of Document Content and Hypertext Connectivity. Advances in Neural Information Processing Systems, pages 430--436, 2001.Google Scholar
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.Google ScholarCross Ref
L. Dietz, S. Bickel, and T. Scheffer. Unsupervised prediction of citation influences. In Proceedings of the 24th international conference on Machine learning, pages 233--240. ACM Press New York, NY, USA, 2007. Google ScholarDigital Library
E. Erosheva, S. Fienberg, and J. Lafferty. Mixed-membership models of scientific publications, 2004.Google Scholar
D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics -- Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB), pages 1--6, Paris, France, 2004. Google ScholarDigital Library
D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide web. In Proceedings of the 28th ACM International Conference on Research and Development in Information Retrieval (SIGIR), Salvador, Brazil, 2005. Google ScholarDigital Library
T. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl_1):5228--5235, 2004.Google ScholarCross Ref
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, 2005.Google Scholar
G. Heinrich. Parameter estimation for text analysis. Technical report, Technical Report, 2004.Google Scholar
M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. SIGIR Forum, 36(2):11--22, 2002. Google ScholarDigital Library
T. Hofmann. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning, 42(1):177--196, 2001. Google ScholarDigital Library
Z. Kou and W. W. Cohen. Stacked graphical models for efficient inference in markov random fields. In SDM 07, 2007.Google ScholarCross Ref
T. Lynam, G. Cormack, and D. Cheriton. On-line spam filter fusion. Proc. of the 29th international ACM SIGIR conference on Research and development in information retrieval, pages 123--130, 2006. Google ScholarDigital Library
R. Nallapati, A. Ahmed, E. Xing, and W. Cohen. Joint Latent Topic Models for Text and Citations. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press New York, NY, USA, 2008. Google ScholarDigital Library
A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th International World Wide Web Conference (WWW), pages 83--92, Edinburgh, Scotland, 2006. Google ScholarDigital Library
A. Singhal. Challenges in running a commercial search engine. In IBM Search and Collaboration Seminar 2004. IBM Haifa Labs, 2004.Google Scholar
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, second edition, June 2005. Google ScholarDigital Library
X. Zhu, J. Kandola, Z. Ghahramani, and J. Lafferty. Nonparametric transforms of graph kernels for semi-supervised learning. Advances in Neural Information Processing Systems, 17:1641--1648, 2005.Google Scholar

Index Terms

Linked latent Dirichlet allocation in web spam filtering
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Latent dirichlet allocation in web spam filtering
AIRWeb '08: Proceedings of the 4th international workshop on Adversarial information retrieval on the web

Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a fully generative statistical language model on the content and topics of a corpus of documents. In this paper we apply a modification of LDA, the novel multi-corpus LDA technique for web ...
Read More
Latent dirichlet allocation based multi-document summarization
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data

Extraction based Multi-Document Summarization Algorithms consist of choosing sentences from the documents using some weighting mechanism and combining them into a summary. In this article we use Latent Dirichlet Allocation to capture the events being ...
Read More
Obtaining single document summaries using latent dirichlet allocation
ICONIP'12: Proceedings of the 19th international conference on Neural Information Processing - Volume Part IV

In this paper, we present a novel approach that makes use of topic models based on Latent Dirichlet allocation(LDA) for generating single document summaries. Our approach is distinguished from other LDA based approaches in that we identify the summary ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
AIRWeb '09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
April 2009
67 pages
ISBN:9781605584386
DOI:10.1145/1531914
Editors:
Dennis Fetterly
Microsoft Research
,
Zoltán Gyöngyi
Google Research
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 April 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
document classification
feature selection
information retrieval
latent Dirichlet allocation
text analysis
web content spam
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 35
  Total Citations
  View Citations
- 573
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Linked latent Dirichlet allocation in web spam filtering

AIRWeb '09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Latent dirichlet allocation in web spam filtering

Latent dirichlet allocation based multi-document summarization

Obtaining single document summaries using latent dirichlet allocation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Linked latent Dirichlet allocation in web spam filtering

AIRWeb '09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Latent dirichlet allocation in web spam filtering

Latent dirichlet allocation based multi-document summarization

Obtaining single document summaries using latent dirichlet allocation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media