The Role of Occam's Razor in Knowledge Discovery

Domingos, Pedro

doi:10.1023/A:1009868929893

The Role of Occam's Razor in Knowledge Discovery

Published: December 1999

Volume 3, pages 409–425, (1999)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Pedro Domingos¹

2431 Accesses
230 Citations
4 Altmetric
Explore all metrics

Abstract

Many KDD systems incorporate an implicit or explicit preference for simpler models, but this use of “Occam's razor” has been strongly criticized by several authors (e.g., Schaffer, 1993; Webb, 1996). This controversy arises partly because Occam's razor has been interpreted in two quite different ways. The first interpretation (simplicity is a goal in itself) is essentially correct, but is at heart a preference for more comprehensible models. The second interpretation (simplicity leads to greater accuracy) is much more problematic. A critical review of the theoretical arguments for and against it shows that it is unfounded as a universal principle, and demonstrably false. A review of empirical evidence shows that it also fails as a practical heuristic. This article argues that its continued use in KDD risks causing significant opportunities to be missed, and should therefore be restricted to the comparatively few applications where it is appropriate. The article proposes and reviews the use of domain constraints as an alternative for avoiding overfitting, and examines possible methods for handling the accuracy–comprehensibility trade-off.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Abu-Mostafa, Y.S. 1989. Learning from hints in neural networks. Journal of Complexity, 6:192-198.
Google Scholar
Akaike, H. 1978. A Bayesian analysis of the minimum AIC procedure. Annals of the Institute of Statistical Mathematics, 30A:9-14.
Google Scholar
Andrews, R. and Diederich, J. (Eds.). 1996. Proceedings of the NIPS-96 Workshop on Rule Extraction from Trained Artificial Neural Networks, Snowmass, CO: NIPS Foundation.
Google Scholar
Bernardo, J.M. and Smith, A.F.M. 1994. Bayesian Theory. New York, NY: Wiley.
Google Scholar
Bishop, C.M. 1995. Neural Networks for Pattern Recognition. Oxford, UK: Oxford University Press.
Google Scholar
Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M.K. 1987. Occam's razor. Information Processing Letters, 24:377-380.
Google Scholar
Breiman, L. 1996. Bagging predictors. Machine Learning, 24:123-140.
Google Scholar
Breiman, L. and Shang, N. 1997. Born again trees. Technical Report, Berkeley, CA: Statistics Department, University of California at Berkeley.
Google Scholar
Brunk, C., Kelly, J., and Kohavi, R. 1997. MineSet: An integrated system for data mining. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 135-138.
Google Scholar
Cestnik, B. and Bratko, I. 1988. Learning redundant rules in noisy domains. Proceedings of the Eighth European Conference on Artificial Intelligence, Munich, Germany: Pitman, pp. 348-356.
Google Scholar
Cheeseman, P. 1990. On finding the most probable model. In Computational Models of Scientific Discovery and Theory Formation, J. Shrager and P. Langley (Eds.). San Mateo, CA: Morgan Kaufmann, pp. 73-95.
Google Scholar
Chickering, D.M. and Heckerman, D. 1997. Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables. Machine Learning, 29:181-212.
Google Scholar
Clark, P. and Matwin, S. 1993. Using qualitative models to guide inductive learning. Proceedings of the Tenth International Conference on Machine Learning, Amherst, MA: Morgan Kaufmann, pp. 49-56.
Google Scholar
Clearwater, S. and Provost, F. 1990. RL4: A tool for knowledge-based induction. Proceedings of the Second IEEE International Conference on Tools for Artificial Intelligence, San Jose, CA: IEEE Computer Society Press, pp. 24-30.
Google Scholar
Cohen, W.W. 1994. Grammatically biased learning: Learning logic programs using an explicit antecedent description language. Artificial Intelligence, 68:303-366.
Google Scholar
Cohen, W.W. 1995. Fast effective rule induction. Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA: Morgan Kaufmann, pp. 115-123.
Google Scholar
Cooper, G.F. 1997. A simple constraint-based algorithm for efficiently mining observational databases for causal relationships. Data Mining and Knowledge Discovery, 1:203-224.
Google Scholar
Cover, T.M. and Thomas, J.A. 1991. Elements of Information Theory. New York, NY: Wiley.
Google Scholar
Craven, M.W. 1996. Extracting comprehensible models from trained neural networks. Unpublished doctoral dissertation, Department of Computer Sciences, University of Wisconsin—Madison, Madison, WI.
Datta, P. and Kibler, D. 1995. Learning prototypical concept descriptions. Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA: Morgan Kaufmann, pp. 158-166.
Google Scholar
Djoko, S., Cook, D.J., and Holder, L.B. 1995. Analyzing the benefits of domain knowledge in substructure discovery. Proceedings of the First International Conference on Knowledge Discovery and Data Mining, Montréal, Canada: AAAI Press, pp. 75-80.
Google Scholar
Domingos, P. 1996a. Two-way induction. International Journal on Artificial Intelligence Tools, 5:113-125.
Google Scholar
Domingos, P. 1996b. Unifying instance-based and rule-based induction. Machine Learning, 24:141-168.
Google Scholar
Domingos, P. 1997a. Knowledge acquisition from examples via multiple models. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann, pp. 98-106.
Google Scholar
Domingos, P. 1997b. Why does bagging work? A Bayesian account and its implications. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 155-158.
Google Scholar
Domingos, P. 1998a. A process-oriented heuristic for model selection. Proceedings of the Fifteenth International Conference on Machine Learning, Madison, WI: Morgan Kaufmann, pp. 127-135.
Google Scholar
Domingos, P. 1998b. When (and how) to combine predictive and causal learning. Proceedings of the NIPS-98 Workshop on Integrating Supervised and Unsupervised Learning, Breckenridge, CO: NIPS Foundation.
Google Scholar
Domingos, P. 1999. Process-oriented estimation of generalization error. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, Stockholm, Sweden: Morgan Kaufmann.
Google Scholar
Domingos, P. and Pazzani, M. 1997. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29:103-130.
Google Scholar
Donoho, S. and Rendell, L. 1996. Constructive induction using fragmentary knowledge. Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy: Morgan Kaufmann, pp. 113-121.
Google Scholar
Drucker, H., Cortes, C., Jackel, L.D., LeCun, Y., and Vapnik, V. 1994. Boosting and other machine learning algorithms. Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ: Morgan Kaufmann, pp. 53-61.
Google Scholar
Edgington, E.S. 1980. Randomization Tests. New York, NY: Marcel Dekker.
Google Scholar
Elomaa, T. 1994. In defense of C4.5: Notes on learning one-level decision trees. Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ: Morgan Kaufmann, pp. 62-69.
Google Scholar
Fisher, D.H. and Schlimmer, J.C. 1988. Concept simplification and prediction accuracy. Proceedings of the Fifth International Conference on Machine Learning, Ann Arbor, MI: Morgan Kaufmann, pp. 22-28.
Google Scholar
Freund, Y. and Schapire, R.E. 1996. Experiments with a new boosting algorithm. Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy: Morgan Kaufmann, pp. 148-156.
Google Scholar
Friedman, J.H. 1997. On bias, variance, 0/1—loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1:55-77.
Google Scholar
Gams, M. 1989. New measurements highlight the importance of redundant knowledge. Proceedings of the Fourth European Working Session on Learning, Montpellier, France: Pitman, pp. 71-79.
Google Scholar
Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neural Computation, 4:1-58.
Google Scholar
Grove, A.J. and Schuurmans, D. 1998. Boosting in the limit: Maximizing the margin of learned ensembles. Proceedings of the Fifteenth National Conference on Artificial Intelligence, Madison, WI: AAAI Press, pp. 692-699.
Google Scholar
Han, J., Fu, Y., Wang, W., Chiang, J., Gong, W., Koperski, K., Li, D., Lu, Y., Rajan, A., Stefanovic, N., Xia, B., and Zaiane, O. 1996. DB Miner: a system for mining knowledge in large relational databases. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR: AAAI Press, pp. 250-255.
Google Scholar
Hasling, D.W., Clancey, W.J., and Rennels, G. 1984. Strategic explanations for a diagnostic consultation system. Developments in Expert Systems, M.J. Coombs (Ed.), London, UK: Academic Press, pp. 117-133.
Google Scholar
Haussler, D. 1988. Quantifying inductive bias: AI learning algorithms and Valiant's learning framework. Artificial Intelligence, 36:177-221.
Google Scholar
Heckerman, D., Geiger, D., and Chickering, D.M. 1995. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20:197-243.
Google Scholar
Holte, R.C. 1993. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11:63-91.
Google Scholar
Imielinski, T., Virmani, A., and Abdulghani, A. 1996. DataMine: application programming interface and query language for database mining. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR: AAAI Press, pp. 256-262.
Google Scholar
Jensen, D. 1992. Induction with Randomization Testing: Decision-Oriented Analysis of Large Data Sets. Unpublished doctoral dissertation, Washington University, Saint Louis, MO.
Jensen, D. and Cohen, P.R. 1999. Multiple comparisons in induction algorithms. Machine Learning, to appear.
Jensen, D. and Schmill, M. (1997). Adjusting for multiple comparisons in decision tree pruning. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 195-198.
Google Scholar
Joachims, T. 1998. Text categorization with support vector machines: Learning with many relevant features. Proceedings of the Tenth European Conference on Machine Learning, Chemnitz, Germany: Springer-Verlag.
Google Scholar
Kamber, M., Han, J., and Chiang, J.Y. 1997. Metarule-guided mining of multi-dimensional association rules using data cubes. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 207-210.
Google Scholar
Kohavi, R. and Kunz, C. 1997. Option decision trees with majority votes. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann, pp. 161-169.
Google Scholar
Kohavi, R. and Sommerfield, D. 1998. Targeting business users with decision table classifiers. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY: AAAI Press, pp. 249-253.
Google Scholar
Kong, E.B. and Dietterich, T.G. 1995. Error-correcting output coding corrects bias and variance. Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA: Morgan Kaufmann, pp. 313-321.
Google Scholar
Kononenko, I. 1990. Comparison of inductive and naive Bayesian learning approaches to automatic knowledge acquisition. In Current Trends in Knowledge Acquisition, B. Wielinga (Ed.). Amsterdam, The Netherlands: IOS Press.
Google Scholar
Langley, P. 1996. Induction of condensed determinations. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR: AAAI Press, pp. 327-330.
Google Scholar
Lawrence, S., Giles, C.L., and Tsoi, A.C. 1997. Lessons in neural network training: Overfitting may be harder than expected. Proceedings of the Fourteenth National Conference on Artificial Intelligence, Providence, RI: AAAI Press, pp. 540-545.
Google Scholar
Lee, Y., Buchanan, B.G., and Aronis, J.M. 1998. Knowledge-based learning in exploratory science: Learning rules to predict rodent carcinogenicity. Machine Learning, 30:217-240.
Google Scholar
Liu, B., Hsu, W., and Chen, S. 1997. Using general impressions to analyze discovered classification rules. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 31-36.
Google Scholar
MacKay, D. 1992. Bayesian interpolation. Neural Computation, 4:415-447.
Google Scholar
Maclin, R. and Opitz, D. 1997. An empirical evaluation of bagging and boosting. Proceedings of the Fourteenth National Conference on Artificial Intelligence, Providence, RI: AAAI Press.
Google Scholar
Maclin, R. and Shavlik, J. 1996. Creating advice-taking reinforcement learners. Machine Learning, 22:251-281.
Google Scholar
Meo, R., Psaila, G., and Ceri, S. 1996. A new SQL-like operator for mining association rules. Proceedings of the Twenty-Second International Conference on Very Large Databases, Bombay, India: Morgan Kaufmann, pp. 122-133.
Google Scholar
Miller, Jr., R.G. 1981. Simultaneous Statistical Inference, 2nd ed. New York, NY: Springer-Verlag.
Google Scholar
Mingers, J. 1989. An empirical comparison of pruning methods for decision tree induction. Machine Learning, 4:227-243.
Google Scholar
Mitchell, T.M. 1980. The need for biases in learning generalizations, Technical report, New Brunswick, NJ: Computer Science Department, Rutgers University.
Google Scholar
Murphy, P. and Pazzani, M. 1994. Exploring the decision forest: An empirical investigation of Occam's razor in decision tree induction. Journal of Artificial Intelligence Research, 1:257-275.
Google Scholar
Murthy, S. and Salzberg, S. 1995. Lookahead and pathology in decision tree induction. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Montréal, Canada: Morgan Kaufmann, pp. 1025-1031.
Google Scholar
Nédellec, C., Rouveirol, C., Adé, H., Bergadano, F., and Tausend, B. 1996. Declarative bias in ILP. In Advances in Inductive Logic Programming, L. de Raedt (Ed.). Amsterdam, the Netherlands: IOS Press, pp. 82-103.
Google Scholar
Oates, T. and Jensen, D. 1998. Large datasets lead to overly complex models: An explanation and a solution. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY: AAAI Press, pp. 294-298.
Google Scholar
Ourston, D. and Mooney, R.J. 1994. Theory refinement combining analytical and empirical methods. Artificial Intelligence, 66:273-309.
Google Scholar
Padmanabhan, B. and Tuzhilin, A. 1998. A belief-driven method for discovering unexpected patterns. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY: AAAI Press, pp. 94-100.
Google Scholar
Pazzani, M., Mani, S., and Shankle, W.R. 1997. Beyond concise and colorful: Learning intelligible rules. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 235-238.
Google Scholar
Pazzani, M.J. 1991. Influence of prior knowledge on concept acquisition: Experimental and computational results. Journal of Experimental Psychology: Learning, Memory, and Cognition, 17:416-432.
Google Scholar
Pearl, J. 1978. On the connection between the complexity and credibility of inferred models. International Journal of General Systems, 4:255-264.
Google Scholar
Piatetsky-Shapiro, G. 1996. Editorial comments. KDD Nuggets, 96:28.
Google Scholar
Provost, F. and Jensen, D. 1998. KDD-98 Tutorial on Evaluating Knowledge Discovery and Data Mining. New York, NY: AAAI Press.
Google Scholar
Quinlan, J.R. 1996. Bagging, boosting, and C4.5. Proceedings of the Thirteenth National Conference on Artificial Intelligence, Portland, OR: AAAI Press, pp. 725-730.
Google Scholar
Quinlan, J.R. and Cameron-Jones, R.M. 1995. Oversearching and layered search in empirical learning. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Montréal, Canada: Morgan Kaufmann, pp. 1019-1024.
Google Scholar
Quinlan, J.R. and Rivest, R.L. 1989. Inferring decision trees using the minimum description length principle. Information and Computation, 80:227-248.
Google Scholar
Rao, J.S. and Potts, W.J.E. 1997. Visualizing bagged decision trees. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 243-246.
Google Scholar
Rao, R.B., Gordon, D., and Spears, W. 1995. For every action, is there really an equal and opposite reaction? Analysis of the conservation law for generalization performance. Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA: Morgan Kaufmann, pp. 471-479.
Google Scholar
Rissanen, J. 1978. Modeling by shortest data description. Automatica, 14:465-471.
Google Scholar
Russell, S.J. 1986. Preliminary steps towards the automation of induction. Proceedings of the Fifth National Conference on Artificial Intelligence, Philadelphia, PA: AAAI Press, pp. 477-484.
Google Scholar
Schaffer, C. 1993. Overfitting avoidance as bias. Machine Learning, 10:153-178.
Google Scholar
Schaffer, C. 1994. A conservation law for generalization performance. Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ: Morgan Kaufmann, pp. 259-265.
Google Scholar
Schapire, R.E., Freund, Y., Bartlett, P., and Lee, W.S. 1997. Boosting the margin: A new explanation for the effectiveness of voting methods. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann.
Google Scholar
Schölkopf, B., Burges, C., and Smola, A. 1998. Advances in Kernel Methods: Support Vector Machines. Cambridge, MA: MIT Press.
Google Scholar
Schölkopf, B., Burges, C., and Vapnik, V. 1995. Extracting support data for a given task. Proceedings of the First International Conference on Knowledge Discovery and Data Mining, Montréal, Canada: AAAI Press, pp. 252-257.
Google Scholar
Schuurmans, D. 1997. A new metric-based approach to model selection. Proceedings of the Fourteenth National Conference on Artificial Intelligence, Providence, RI: AAAI Press, pp. 552-558.
Google Scholar
Schuurmans, D., Ungar, L.H., and Foster, D.P. 1997. Characterizing the generalization performance of model selection strategies. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann, pp. 340-348.
Google Scholar
Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics, 6:461-464.
Google Scholar
Shawe-Taylor, J., Bartlett, P.L., Williamson, R.C., and Anthony, M. 1996. Structural risk minimization over data-dependent hierarchies, Technical report No. NC-TR-96-053, Egham, UK: Department of Computer Science, Royal Holloway, University of London.
Google Scholar
Shen, W.-M., Ong, K., Mitbander, B., and Zaniolo, C. 1996. Metaqueries for data mining. In Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.). Menlo Park, CA: AAAI Press, pp. 375-398.
Google Scholar
Smola, A., Bartlett, P., Schölkopf, B., and Schuurmans, D. (Eds.). 1998. Proceedings of the NIPS-98 Workshop on Large Margin Classifiers, Breckenridge, CO: NIPS Foundation.
Google Scholar
Srikant, R., Vu, Q., and Agrawal, R. 1997. Mining association rules with item constraints. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 67-73.
Google Scholar
Todorovski, L. and Džeroski, S. 1997. Declarative bias in equation discovery. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann, pp. 376-384.
Google Scholar
Tornay, S.C. 1938. Ockham: Studies and Selections. La Salle, IL: Open Court.
Google Scholar
Vapnik, V.N. 1995. The Nature of Statistical Learning Theory. New York, NY: Springer-Verlag.
Google Scholar
Wallace, C.S. and Boulton, D.M. 1968. An information measure for classification. Computer Journal, 11:185-194.
Google Scholar
Webb, G.I. 1996. Further experimental evidence against the utility of Occam's razor. Journal of Artificial Intelligence Research, 4:397-417.
Google Scholar
Webb, G.I. 1997. Decision tree grafting. Proceeding of the Fifteenth International Joint Conference on Artificial Intelligence, Nagoya, Japan: Morgan Kaufmann, pp. 846-851.
Google Scholar
Wolpert, D. 1992. Stacked generalization. Neural Networks, 5:241-259.
Google Scholar
Wolpert, D. 1996. The lack of a priori distinctions between learning algorithms. Neural Computation, 8:1341-1390.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Washington, Seattle, WA, 98195
Pedro Domingos

Authors

Pedro Domingos
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Domingos, P. The Role of Occam's Razor in Knowledge Discovery. Data Mining and Knowledge Discovery 3, 409–425 (1999). https://doi.org/10.1023/A:1009868929893

Download citation

Issue Date: December 1999
DOI: https://doi.org/10.1023/A:1009868929893

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Role of Occam's Razor in Knowledge Discovery

Abstract

Access this article

Similar content being viewed by others

JSM Reasoning and Knowledge Discovery: Ampliative Reasoning, Causality Recognition, and Three Kinds of Completeness#

A Glimpse on Gerhard Brewka’s Contributions to Artificial Intelligence

A class of k-modes algorithms for extracting knowledge structures from data

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

The Role of Occam's Razor in Knowledge Discovery

Abstract

Access this article

Similar content being viewed by others

JSM Reasoning and Knowledge Discovery: Ampliative Reasoning, Causality Recognition, and Three Kinds of Completeness#

A Glimpse on Gerhard Brewka’s Contributions to Artificial Intelligence

A class of k-modes algorithms for extracting knowledge structures from data

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation