Naive (Bayes) at forty: The independence assumption in information retrieval

Lewis, David D.

doi:10.1007/BFb0026666

David D. Lewis¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1398))

Included in the following conference series:

European Conference on Machine Learning

8614 Accesses
776 Citations
1 Altmetric

Abstract

The naive Bayes classifier, currently experiencing a renaissance ] in machine learning, has long been a core technique in information retrieval. We review some of the variations of naive Bayes models used for text retrieval and classification, focusing on the distributional assumptions made about word occurrences in documents.

Download to read the full chapter text

Chapter PDF

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Abraham Bookstein and Don Kraft. Operations research applied to document indexing and retrieval decisions. Journal of the Association for Computing Machinery, 24(3):418–427, 1977.
Google Scholar
Abraham Bookstein and Don R. Swanson. A decision theoretic foundation for indexing. Journal of the American Society for Information Science, pages 45–50, January–February 1975.
Google Scholar
Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, and Prabhakar Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In Matthias Jarke, Michael Carey, Klaus R. Dittrich, Fred Lochovsky, Pericles Loucopoulos, and Manfred A. Jeusfeld, editors, Proceedings of the 23rd VLDB Conference, pages 446–455, 1997.
Google Scholar
Kenneth Ward Church. One term or two? In Edward A. Fox, Peter Ingwersen, and Raya Fidel, editors, SIGIR '95: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 310–318, New York, 1995. Association for Computing Machinery.
Google Scholar
William W. Cohen and Yoram Singer. Context-sensitive learning methods for text categorization. In SIGIR '96: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 307–315, 1996.
Google Scholar
W. S. Cooper. Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval. ACM Transactions on Information Systems, 13(1):100–111, January 1995.
Article Google Scholar
W. B. Croft. Experiments with representation in a document retrieval system. Information Technology: Research and Development, 2:1–21, 1983.
Google Scholar
W. Bruce Croft. Boolean queries and term dependencies in probabilistic retrieval models. Journal of the American Society for Information Science, 37(2):71–77, 1986.
Article Google Scholar
Pedro Domingos and Michael Pazzani. On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning, 29(2/3):103–130, November 1997.
Article Google Scholar
Richard O. Duda and Peter E. Hart. Pattern Classification and Scene Analysis. Wiley-Interscience, New York, 1973.
Google Scholar
B. Del Favero and R. Fung. Bayesian inference with node aggregation for information retrieval. In D. K. Harman, editor, The Second Text Retrieval Conference (TREC-2), pages 151–162, Gaithersburg, MD, March 1994. U. S. Dept. of Com merce, National Institute of Standards and Technology. NIST Special Publication 500-215.
Google Scholar
William B. Frakes and Ricardo Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice Hall, Englewood Cliffs, NJ, 1992.
Google Scholar
Norbert Fuhr. Models for retrieval with probabilistic indexing. Information Processing and Management, 25(1):55–72, 1989.
Article Google Scholar
William A. Gale, Kenneth W. Church, and David Yarowsky. A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26:415–439, 1993.
Article Google Scholar
Louise Guthrie, Elbert Walker, and Joe Guthrie. Document classification by machine: Theory and practice. In COLING 94: The 15th International Conference on Computational Linguistics. Proceedings, Vol. II., pages 1059–1063, 1994.
Google Scholar
D. K. Harman, editor. The First Text REtrieval Conference (TREC-1), Gaithersburg, MD 20899, 1993. National Institute of Standards and Technology. Special Publication 500-207.
Google Scholar
D. K. Harman, editor. The Second Text REtrieval Conference (TREC-2), Gaithersburg, MD 20899, 1994. National Institute of Standards and Technology. Special Publication 500-215.
Google Scholar
D. K. Harman, editor. Overview of the Third Text REtrieval Conference (TREC-3), Gaithersburg, MD 20899-0001, 1995. National Institute of Standards and Technology. Special Publication 500-225.
Google Scholar
D. K. Harman, editor. The Fourth Text REtrieval Conference (TREC-3), Gaithersburg, MD 20899-0001, 1996. National Institute of Standards and Technology. Special Publication 500-236.
Google Scholar
Donna Harman. Relevance feedback and other query modification techniques. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, pages 241–263. Prentice Hall, Englewood Cliffs, NJ, 1992.
Google Scholar
D. J. Harper and C. J. van Rijsbergen. An evaluation of feedback in document retrieval using co-occurrence data. Journal of Documentation, 34:189–216, 1978.
Google Scholar
Stephen P. Harter. A probabilistic approach to automatic keyword indexing. Part I. On the distribution of specialty words in a technical literature. Journal of the American Society for Information Science, pages 197–206, July–August 1975.
Google Scholar
Stephen P. Harter. A probabilistic approach to automatic keyword indexing. Part II. An algorithm for probabilistic indexing. Journal of the American Society for Information Science, pages 280–289, September–October 1975.
Google Scholar
David J. Ittner, David D. Lewis, and David D. Alm. Text categorization of low quality images. In Symposium on Document Analysis and Information Retrieval, pages 301–315, Las Vegas, NV, 1995. ISRI; Univ. of Nevada, Las Vegas.
Google Scholar
Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. LS-8 Report 23, University of Dortmund, Computer Science Dept., Dortmund, Germany, 27 November 1997.
Google Scholar
S. Katz. Distribution of content words and phrases in text and language modelling. Natural Language Engineering, 2(1):15–59, March 1996.
Article Google Scholar
Ron Kohavi. Scaling up the accuracy of Naive-Bayes classifiers: a decision-tree hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 202–207, 1996.
Google Scholar
Robert R. Korfhage. Information Storage and Retrieval. John Wiley, New York, 1997.
Google Scholar
Gerald Kowalski. Information Retrieval Systems: Theory and Implementation. Kluwer, Boston, 1997.
Google Scholar
David D. Lewis. Text representation for intelligent text retrieval: A classification-oriented view. In Paul S. Jacobs, editor, Text-Based Intelligent Systems, pages 179–197. Lawrence Erlbaum, Hillsdale, NJ, 1992.
Google Scholar
David D. Lewis. Evaluating and optimizing autonomous text classification systems. In Edward A. Fox, Peter Ingwersen, and Raya Fidel, editors, SIGIR '95: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 246–254, New York, 1995. Association for Computing Machinery.
Google Scholar
David D. Lewis and William A. Gale. A sequential algorithm for training text classifiers. In W. Bruce Croft and C. J. van Rijsbergen, editors, SIGIR 94: Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 3–12, London, 1994. Springer-Verlag.
Google Scholar
David D. Lewis and Karen Sparck Jones. Natural language processing for information retrieval. Communications of the ACM, 39(1):92–101, January 1996.
Article Google Scholar
Hang Li and Kenji Yamanishi. Document classification using a finite mixture model, 1997.
Google Scholar
Robert M. Losee. Parameter estimation for probabilistic document-retrieval models. Journal of the American Society for Information Science, 39(1):8–16, 1988.
Article Google Scholar
E. L. Margulis. Modelling documents with multiple Poisson distributions. Information Processing and Management; 29:215–227, 1993
Article Google Scholar
M. E. Maron. Automatic indexing: An experimental inquiry. Journal of the Association for Computing Machinery, 8:404–417, 1961.
Google Scholar
M. E. Maron and J. L. Kuhns. On relevance, probabilistic indexing, and information retrieval. Journal of the Association for Computing Machinery, 7(3):216–244, July 1960.
Google Scholar
Marvin Minsky and Seymour Papert. Perceptrons: An Introduction to Computational Geometry (Expanded Edition). The MIT Press, Cambridge, MA, 1988.
Google Scholar
Frederick Mosteller and David L. Wallace. Applied Bayesian and Classical Inference. Springer-Verlag, New York, 2nd edition, 1984.
Google Scholar
S. E. Robertson and K. Sparck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, pages 129–146, May–June 1976.
Google Scholar
S. E. Robertson, C. J. van Rijsbergen, and M. F. Porter. Probabilistic models of indexing and searching. In R. N. Oddy, S. E. Robertson, C. J. van Rijsbergen, and P. W. Williams, editors, Information Research and Retrieval, chapter 4, pages 35–56. Butterworths, 1981.
Google Scholar
S. E. Robertson and S. Walker.Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In W. Bruce Croft and C. J. van Rijsbergen, editors, SIGIR 94: Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 232–241, London, 1994. Springer-Verlag.
Google Scholar
J. J. Rocchio, Jr. Relevance feedback in information retrieval. In Gerard Salton, editor, The SMART Retrieval System: Experiments in Automatic Document Processing, pages 313–323. Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1971.
Google Scholar
Gerard Salton and Chris Buckley. Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, 41(4):288–297, 1990.
Article Google Scholar
Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill Book Company, New York, 1983.
Google Scholar
Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document length normalization. In SIGIR '96: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21–29, 1996.
Google Scholar
Karen Sparck Jones. Search term relevance weighting given little relevance information. Journal of Documentation, 35(1):30–48, March 1979.
Google Scholar
Howard R. Turtle and W. Bruce Croft. Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3):187–222, July 1991.
Article Google Scholar
C. J. van Rijsbergen. A theoretical basis for the use of co-occurrence data in information retrieval. Journal of Documentation, 33(2):106–119, June 1977.
Google Scholar
C. J. van Rijsbergen. Information Retrieval. Butterworths, London, second edition, 1979.
Google Scholar
E. M. Voorhees and D. K. Harman, editors. Information Technology: The Fifth Text REtrieval Conference (TREC-6), Gaithersburg, MD 20899-0001, 1997. National Institute of Standards and Technology. Special Publication 500-238.
Google Scholar
Clement T. Yu and Hirotaka Mizuno. Two learning schemes in information retrieval. In Eleventh International Conference on Research & Development in Information Retrieval, pages 201–215, 1998.
Google Scholar

Download references

Author information

Authors and Affiliations

AT&T Labs - Research, 180 Park Avenue, 07932-0971, Florham Park, NJ, USA
David D. Lewis

Authors

David D. Lewis
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Claire Nédellec Céline Rouveirol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lewis, D.D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds) Machine Learning: ECML-98. ECML 1998. Lecture Notes in Computer Science, vol 1398. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0026666

Download citation

DOI: https://doi.org/10.1007/BFb0026666
Published: 16 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64417-0
Online ISBN: 978-3-540-69781-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics