CONSTRUCTION OF DECISION TREES USING DATA CUBE

Fu, Lixin

doi:10.1007/978-1-4020-5347-4_10

Lixin Fu⁵

705 Accesses
1 Citations

Abstract

Data classification is an important problem in data mining. The traditional classification algorithms based on decision trees have been widely used due to their fast model construction and good model understandability. However, the existing decision tree algorithms need to recursively partition dataset into subsets according to some splitting criteria i.e. they still have to repeatedly compute the records belonging to a node (called F-sets) and then compute the splits for the node. For large data sets, this requires multiple passes of original dataset and therefore is often infeasible in many applications. In this paper we present a new approach to constructing decision trees using pre-computed data cube. We use statistics trees to compute the data cube and then build a decision tree on top of it. Mining on aggregated data stored in data cube will be much more efficient than directly mining on flat data files or relational databases. Since data cube server is usually a required component in an analytical system for answering OLAP queries, we essentially provide “free” classification by eliminating the dominant I/O overhead of scanning the massive original data set. Our new algorithm generates trees of the same prediction accuracy as existing decision tree algorithms such as SPRINT and RainForest but improves performance significantly. In this paper we also give a system architecture that integrates DBMS, OLAP, and data mining seamlessly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agarwal, S., R. Agrawal, et al. (1996). On The Computation of Multidimensional ggregates. Proceedings of the International Conference on Very Large Databases, Mumbai (Bomabi), India: 506–521.
Google Scholar
Beyer, K. and R. Ramakrishnan (1999). Bottom-Up Computation of Sparse and Iceberg CUBEs. Proceedings of the 1999 ACM SIGMOD International Conference on anagement of Data (SIGMOD '99). C. Faloutsos. Philadelphia, PA: 359–370.
Chapter Google Scholar
Chan, C. Y. and Y. E. Ioannidis (1998). Bitmap Index Design and Evaluation. roceedings of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD '98), Seattle, WA: 355–366.
Google Scholar
Chaudhuri, S. and U. Dayal (1997). “An Overview of Data Warehousing and OLAP echnology.” SIGMOD Record 26(1): 65–74.
Article Google Scholar
Chaudhuri, S., U. Fayyad, et al. (1999). Scalable Classification over SQL Databases. 15th International Conference on Data Engineering, March 23 - 26, 1999, Sydney, Australia: 470.
Google Scholar
Cheeseman, P. and J. Stutz (1996). Bayesian Classification (AutoClass): Theory and Results. Advances in Knowledge Discovery and Data Mining. R. Uthurusamy, AAAI/MIT Press: 153–180.
Google Scholar
Comer, D. (1979). “The Ubiquitous Btree.” ACM Computing Surveys 11(2): 121–137.
Google Scholar
Duda, R. and P. Hart (1973). Pattern Classification and Scene Analysis. New York, John Wiley & Sons.
MATH Google Scholar
Fu, L. (2003). Classification for Free. International Conference on Internet Computing 2003 (IC'03) June 23 - 26, 2003, Monte Carlo Resort, Las Vegas, Nevada, USA.
Google Scholar
Fu, L. and J. Hammer (2000). CUBIST: A New Algorithm For Improving the Performance of Ad-hoc OLAP Queries. ACM Third International Workshop on Data Warehousing and OLAP, Washington, D.C, USA, November: 72–79.
Google Scholar
Gehrke, J., V. Ganti, et al. (1999). BOAT — Optimistic Decision Tree Construction. Proc. 1999 Int. Conf. Management of Data (SIGMOD '99), Philadephia, PA, June 1999.: 169–180.
Google Scholar
Gehrke, J., R. Ramakrishnan, et al. (1998). RainForest — A Framework for Fast Decision Tree Construction of Large Datasets. Proceedings of the 24th VLDB Conference (VLDB '98), New York, USA, 1998: 416–427.
Google Scholar
Hammer, J. and L. Fu (2001). Improving the Performance of OLAP Queries Using Families of Statistics Trees. 3rd International Conference on Data Warehousing and Knowledge Discovery DaWaK 01, September, 2001, Munich, Germany: 274–283.
Google Scholar
Han, J. and M. Kamber (2001). Data Mining: Concepts and Techniques, Morgan Kaufman Publishers.
Google Scholar
Harinarayan, V., A. Rajaraman, et al. (1996). “Implementing data cubes efficiently.” SIGMOD Record 25 (2): 205–216.
Article Google Scholar
Inmon, W. H. (1996). Building the Data Warehouse. New York, John Wiley & Sons.
Google Scholar
Johnson, T. and D. Shasha (1997). “Some Approaches to Index Design for Cube Forests.” Bulletin of the Technical Committee on Data Engineering, IEEE Computer Society 20 (1): 27–35.
Google Scholar
Lakshmanan, L. V. S., J. Pei, et al. (2003). QC-Trees: An Efficient Summary Structure for Semantic OLAP. Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9-12, 2003. A. Doan, ACM: 64–75.
Google Scholar
Lent, B., A. Swami, et al. (1997). Clustering Association Rules. Proceedings of the Thirteenth International Conference on Database Engineering (ICDE '97), Birmingham, U.K.: 220–231.
Google Scholar
Lu, H., R. Setiono, et al. (1995). NeuroRule: A Connectionist Approach to Data Mining. VLDB'95, Proceedings of 21th International Conference on Very Large Data Bases, September 11–15, 1995, Zurich, Switzerland. S. Nishio, Morgan Kaufmann: 478–489.
Google Scholar
Mehta, M., R. Agrawal, et al. (1996). SLIQ: A Fast Scalable Classifier for Data Mining. Advances in Database Technology — EDBT'96, 5th International Conference on Extending Database Technology, Avignon, France, March 25–29, 1996, Proceedings. G. Gardarin, Springer. 1057: 18–32.
Google Scholar
O'Neil, P. (1987). Model 204 Architecture and Performance. Proc. of the 2nd International Workshop on High Performance Transaction Systems, Asilomar, CA: 40–59.
Google Scholar
O'Neil, P. and D. Quass (1997). “Improved Query Performance with Variant Indexes.” SIGMOD Record (ACM Special Interest Group on Management of Data) 26(2): 38–49.
Google Scholar
Quilan, J. R. (1986). Introduction of Decision Trees. Machine Learning. 1: 8x1–106.
Google Scholar
Quilan, J. R. (1993). C4.5: Programs for Machine Learning, Morgan Kaufmann.
Google Scholar
Shafer, J., R. Agrawal, et al. (1996). SPRINT: A Scalable Parallel Classifier for Data Mining. VLDB'96, Proceedings of 22th International Conference on Very Large Data Bases, September 3–6, 1996, Mumbai (Bombay), India. N. L. Sarda, Morgan Kaufmann: 544–555.
Google Scholar
Sismanis, Y., A. Deligiannakis, et al. (2002). Dwarf: shrinking the PetaCube. Proceedings of the 2002 ACM SIGMOD international conference on Management of data (SIGMOD '02), Madison, Wisconsin: 464–475.
Google Scholar
Zhao, Y., P. M. Deshpande, et al. (1997). “An Array- Based Algorithm for Simultaneous Multidimensional Aggregates.” SIGMOD Record 26(2): 159–170.
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of North Carolina at Greensboro, 383 Bryan Bldg., Greensboro, NC, 27402-6170, USA
Lixin Fu

Authors

Lixin Fu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Florida International University, Miami, FL, U.S.A.
Chin-Sheng Chen
INSTICC/ EST, Setúbal, Portugal
Joaquim Filipe
Universidade Portucalense, Porto, Portugal
Isabel Seruca
INSTICC/ EST, Setúbal, Portugal
José Cordeiro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fu, L. (2007). CONSTRUCTION OF DECISION TREES USING DATA CUBE. In: Chen, CS., Filipe, J., Seruca, I., Cordeiro, J. (eds) Enterprise Information Systems VII. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-5347-4_10

Download citation

DOI: https://doi.org/10.1007/978-1-4020-5347-4_10
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-5323-8
Online ISBN: 978-1-4020-5347-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics