Abstract
The naive Bayes classifier, currently experiencing a renaissance ] in machine learning, has long been a core technique in information retrieval. We review some of the variations of naive Bayes models used for text retrieval and classification, focusing on the distributional assumptions made about word occurrences in documents.
Chapter PDF
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Abraham Bookstein and Don Kraft. Operations research applied to document indexing and retrieval decisions. Journal of the Association for Computing Machinery, 24(3):418–427, 1977.
Abraham Bookstein and Don R. Swanson. A decision theoretic foundation for indexing. Journal of the American Society for Information Science, pages 45–50, January–February 1975.
Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, and Prabhakar Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In Matthias Jarke, Michael Carey, Klaus R. Dittrich, Fred Lochovsky, Pericles Loucopoulos, and Manfred A. Jeusfeld, editors, Proceedings of the 23rd VLDB Conference, pages 446–455, 1997.
Kenneth Ward Church. One term or two? In Edward A. Fox, Peter Ingwersen, and Raya Fidel, editors, SIGIR '95: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 310–318, New York, 1995. Association for Computing Machinery.
William W. Cohen and Yoram Singer. Context-sensitive learning methods for text categorization. In SIGIR '96: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 307–315, 1996.
W. S. Cooper. Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval. ACM Transactions on Information Systems, 13(1):100–111, January 1995.
W. B. Croft. Experiments with representation in a document retrieval system. Information Technology: Research and Development, 2:1–21, 1983.
W. Bruce Croft. Boolean queries and term dependencies in probabilistic retrieval models. Journal of the American Society for Information Science, 37(2):71–77, 1986.
Pedro Domingos and Michael Pazzani. On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning, 29(2/3):103–130, November 1997.
Richard O. Duda and Peter E. Hart. Pattern Classification and Scene Analysis. Wiley-Interscience, New York, 1973.
B. Del Favero and R. Fung. Bayesian inference with node aggregation for information retrieval. In D. K. Harman, editor, The Second Text Retrieval Conference (TREC-2), pages 151–162, Gaithersburg, MD, March 1994. U. S. Dept. of Com merce, National Institute of Standards and Technology. NIST Special Publication 500-215.
William B. Frakes and Ricardo Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice Hall, Englewood Cliffs, NJ, 1992.
Norbert Fuhr. Models for retrieval with probabilistic indexing. Information Processing and Management, 25(1):55–72, 1989.
William A. Gale, Kenneth W. Church, and David Yarowsky. A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26:415–439, 1993.
Louise Guthrie, Elbert Walker, and Joe Guthrie. Document classification by machine: Theory and practice. In COLING 94: The 15th International Conference on Computational Linguistics. Proceedings, Vol. II., pages 1059–1063, 1994.
D. K. Harman, editor. The First Text REtrieval Conference (TREC-1), Gaithersburg, MD 20899, 1993. National Institute of Standards and Technology. Special Publication 500-207.
D. K. Harman, editor. The Second Text REtrieval Conference (TREC-2), Gaithersburg, MD 20899, 1994. National Institute of Standards and Technology. Special Publication 500-215.
D. K. Harman, editor. Overview of the Third Text REtrieval Conference (TREC-3), Gaithersburg, MD 20899-0001, 1995. National Institute of Standards and Technology. Special Publication 500-225.
D. K. Harman, editor. The Fourth Text REtrieval Conference (TREC-3), Gaithersburg, MD 20899-0001, 1996. National Institute of Standards and Technology. Special Publication 500-236.
Donna Harman. Relevance feedback and other query modification techniques. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, pages 241–263. Prentice Hall, Englewood Cliffs, NJ, 1992.
D. J. Harper and C. J. van Rijsbergen. An evaluation of feedback in document retrieval using co-occurrence data. Journal of Documentation, 34:189–216, 1978.
Stephen P. Harter. A probabilistic approach to automatic keyword indexing. Part I. On the distribution of specialty words in a technical literature. Journal of the American Society for Information Science, pages 197–206, July–August 1975.
Stephen P. Harter. A probabilistic approach to automatic keyword indexing. Part II. An algorithm for probabilistic indexing. Journal of the American Society for Information Science, pages 280–289, September–October 1975.
David J. Ittner, David D. Lewis, and David D. Alm. Text categorization of low quality images. In Symposium on Document Analysis and Information Retrieval, pages 301–315, Las Vegas, NV, 1995. ISRI; Univ. of Nevada, Las Vegas.
Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. LS-8 Report 23, University of Dortmund, Computer Science Dept., Dortmund, Germany, 27 November 1997.
S. Katz. Distribution of content words and phrases in text and language modelling. Natural Language Engineering, 2(1):15–59, March 1996.
Ron Kohavi. Scaling up the accuracy of Naive-Bayes classifiers: a decision-tree hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 202–207, 1996.
Robert R. Korfhage. Information Storage and Retrieval. John Wiley, New York, 1997.
Gerald Kowalski. Information Retrieval Systems: Theory and Implementation. Kluwer, Boston, 1997.
David D. Lewis. Text representation for intelligent text retrieval: A classification-oriented view. In Paul S. Jacobs, editor, Text-Based Intelligent Systems, pages 179–197. Lawrence Erlbaum, Hillsdale, NJ, 1992.
David D. Lewis. Evaluating and optimizing autonomous text classification systems. In Edward A. Fox, Peter Ingwersen, and Raya Fidel, editors, SIGIR '95: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 246–254, New York, 1995. Association for Computing Machinery.
David D. Lewis and William A. Gale. A sequential algorithm for training text classifiers. In W. Bruce Croft and C. J. van Rijsbergen, editors, SIGIR 94: Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 3–12, London, 1994. Springer-Verlag.
David D. Lewis and Karen Sparck Jones. Natural language processing for information retrieval. Communications of the ACM, 39(1):92–101, January 1996.
Hang Li and Kenji Yamanishi. Document classification using a finite mixture model, 1997.
Robert M. Losee. Parameter estimation for probabilistic document-retrieval models. Journal of the American Society for Information Science, 39(1):8–16, 1988.
E. L. Margulis. Modelling documents with multiple Poisson distributions. Information Processing and Management; 29:215–227, 1993
M. E. Maron. Automatic indexing: An experimental inquiry. Journal of the Association for Computing Machinery, 8:404–417, 1961.
M. E. Maron and J. L. Kuhns. On relevance, probabilistic indexing, and information retrieval. Journal of the Association for Computing Machinery, 7(3):216–244, July 1960.
Marvin Minsky and Seymour Papert. Perceptrons: An Introduction to Computational Geometry (Expanded Edition). The MIT Press, Cambridge, MA, 1988.
Frederick Mosteller and David L. Wallace. Applied Bayesian and Classical Inference. Springer-Verlag, New York, 2nd edition, 1984.
S. E. Robertson and K. Sparck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, pages 129–146, May–June 1976.
S. E. Robertson, C. J. van Rijsbergen, and M. F. Porter. Probabilistic models of indexing and searching. In R. N. Oddy, S. E. Robertson, C. J. van Rijsbergen, and P. W. Williams, editors, Information Research and Retrieval, chapter 4, pages 35–56. Butterworths, 1981.
S. E. Robertson and S. Walker.Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In W. Bruce Croft and C. J. van Rijsbergen, editors, SIGIR 94: Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 232–241, London, 1994. Springer-Verlag.
J. J. Rocchio, Jr. Relevance feedback in information retrieval. In Gerard Salton, editor, The SMART Retrieval System: Experiments in Automatic Document Processing, pages 313–323. Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1971.
Gerard Salton and Chris Buckley. Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, 41(4):288–297, 1990.
Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill Book Company, New York, 1983.
Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document length normalization. In SIGIR '96: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21–29, 1996.
Karen Sparck Jones. Search term relevance weighting given little relevance information. Journal of Documentation, 35(1):30–48, March 1979.
Howard R. Turtle and W. Bruce Croft. Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3):187–222, July 1991.
C. J. van Rijsbergen. A theoretical basis for the use of co-occurrence data in information retrieval. Journal of Documentation, 33(2):106–119, June 1977.
C. J. van Rijsbergen. Information Retrieval. Butterworths, London, second edition, 1979.
E. M. Voorhees and D. K. Harman, editors. Information Technology: The Fifth Text REtrieval Conference (TREC-6), Gaithersburg, MD 20899-0001, 1997. National Institute of Standards and Technology. Special Publication 500-238.
Clement T. Yu and Hirotaka Mizuno. Two learning schemes in information retrieval. In Eleventh International Conference on Research & Development in Information Retrieval, pages 201–215, 1998.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lewis, D.D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds) Machine Learning: ECML-98. ECML 1998. Lecture Notes in Computer Science, vol 1398. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0026666
Download citation
DOI: https://doi.org/10.1007/BFb0026666
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64417-0
Online ISBN: 978-3-540-69781-7
eBook Packages: Springer Book Archive