Framework for formal implementation of the business understanding phase of data mining projects
Introduction
Data mining (DM) has been defined as the non trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data (Frawley, Piatetsky-Shapiro, & Matheus, 1992). Research in data mining has addressed a broad range of applications as diverse as sales and customer relationship management (Berry and Linoff, 1997, Berry and Linoff, 2000, Hung et al., 2006), financial forecasting (Chun & Park, 2006), fraud detection (Fawcett & Provost, 1997), gene mapping (Kantardzic & Zurada, 2005) and mining of health care data (Alonso, 2002, Phillips-Wren et al., 2008) to name a few. The interest in the field of DM has surged mainly due to the rapid growth in size of data generated and collected by companies (Han & Kamber, 2006). A recent KD nuggets poll (June 2007), http://www.kdnuggets.com/polls/ based on the largest data size data-mined found that nearly 22% of the respondents reported mining databases of 1 terabyte or more which is double the 11.5% of respondents who mined terabyte size databases in 2006. However there has also been the increasing recognition that mere access to data is not sufficient to learn about interesting patterns found in the data or to uncover novel relationships. There is value in having a formal process that details how the DM project can be implemented (Berry & Linoff, 2000). Data mining (DM) methodologies (Anand and Buchner, 1998, Berry and Linoff, 1997, Cabena, 1998, Cios and Kurgan, 2005, CRISP-DM, 2003, Fayyad et al., 1996b) address this issue by providing explicit guidance regarding implementation of data mining projects. They describe the data mining project as consisting of various phases and suggest how each of the phases can be carried out. The methodologies differ somewhat in their prescribed phases and the sequence of these phases, so also the particular tasks needed to implement the various phases However, most methodologies recommend starting a data mining project with developing an understanding of the business domain. This phase generally encompasses determining of business and data mining objectives of the project, the associated success criteria and an assessment of the resources required to execute the project. Certain methodologies such as CRISP-DM (CRISP-DM, 2003), acronym for CRoss Industry Standard Process for Data Mining, also recommend developing a plan for the remaining phases of the project in addition to the above objectives. Fig. 1 shows the process model for the CRISP-DM methodology.
It appears from our review of published data mining case studies that the business understanding (BU) phase of data mining (DM) projects is often implemented in an ad hoc manner. Hardly any published data mining case studies actually provide a detailed description of how this phase was formally implemented. We believe that the reason behind such an unstructured approach is the general lack of support towards how this phase can be implemented. Charest et al. (2006) believe this to be a broader issue and state that “DM methodologies provide very little detailed advice to the novice miner on how to actually carry out a given step”. In our view, this issue is more dominant in case of the BU phase.
It appears that the opposite situation is found in the case of modeling phase, about which relatively more information is generally available. We argue that formally implementing the business understanding phase is just as important as implementing the modeling phase or any other phase of the data mining project. Perhaps, the business understanding phase is even somewhat more important than other phases given that a number of decisions about other phases, such as the modeling as well as other phases (such as data preparation, data understanding, evaluation, etc.) are made, or ideally should be made, during the BU phase. Fig. 2 shows how the BU phase is pervasive to all other phases of the DM project.
Not making appropriate decisions during the BU phase seems to lead to two problems. First, it creates inefficiencies as these decisions have to be dealt with in later phases taking away the time and resources that were allocated to accomplish the tasks associated with that phase. The second problem is even more detrimental as not making certain decisions during the BU phase can lead to the DM project taking a completely different direction than what was intended. The second problem originates from the numerous dependencies that exist between the various phases and tasks of a data mining project. These dependencies need to be clearly identified and effectively managed in order to formally implement the BU phase.
Accordingly, the objective of this paper is to present an organizationally grounded framework to implement the various tasks of the BU phase, to identify dependencies existing between the tasks of this phase and to explain the various facets of each task such as its desired output, motivation behind the task, role of organizational actors involved in the task, when it should be performed and its predecessor tasks, and how it can be performed. We use an illustrative example of a typical data mining application from the financial services sector to elucidate our approach. By carefully streamlining the various tasks of the BU phase, the proposed framework allows for formal implementation of the various tasks of this phase. This is likely to result in improving the efficiency and reliability with which such projects can be implemented.
Section snippets
Framework for implementation of BU phase
Creating a framework to formally implement the BU phase first requires selection of a DM methodology to serve as an anchor. We reviewed various DM methodologies proposed in the literature (Anand and Buchner, 1998, Berry and Linoff, 1997, Cabena, 1998, Cios and Kurgan, 2005, CRISP-DM, 2003, Fayyad et al., 1996b) to study their suitability of serving as an anchor. Since we intend the framework to be used across all methodologies, we wanted to select a detailed methodology that also covered the
Mapping BU phase of CRISP-DM to the proposed framework
In this section, we present four chief tasks of the BU phase and map them to the organizationally grounded framework proposed in the previous section. These include determination of business objectives, determination of business success criteria, determination of data mining goals, and determination of data mining success criteria. For space considerations, the other tasks have been included in the Appendix. Fig. 3 describes the dependencies between the various tasks uncovered by the mapping of
Illustrative example
Let us consider an illustrative example of how business understanding of a data mining project can be developed. The example serves to highlight various tasks of the BU phase. We have used the problem scenario of a Credit Scoring application from the Financial services sector to illustrate our approach. We have selected this particular application as it is commonly referenced to in data mining papers and text books and its principles find applicability in not just the financial sector, but also
Summary
This paper highlights a chief limitation of existing DM methodologies that only suggest a prescriptive checklist of tasks and activities to be executed during a DM project, but do not establish a definitive workflow and description of how the entire process is to be implemented. The paper also recognizes that the relative lack of research related to the formal implementation of the DM process is more dominant in case of the BU phase. This is problematic since the BU phase is a pivotal phase of
References (16)
Combining expert knowledge and data mining in a medical diagnosis domain
Expert Systems with Applications
(2002)- et al.
Decision support using data mining
(1998) - et al.
Data mining techniques for marketing, sales and customer support
(1997) - et al.
Mastering data mining: The art and relationship of customer relationship management
(2000) Discovering data mining: From concepts to implementation
(1998)- Charest, M. et al. (2006). Intelligent data mining assistance via CBR and ontologies. In Proceedings of the 17th...
- et al.
A new hybrid data mining technique using a regression case based reasoning: Application to financial forecasting
Expert Systems with Applications
(2006) - et al.
Trends in data mining and knowledge discovery
Cited by (30)
Domain driven data mining in human resource management: A review of current research
2013, Expert Systems with ApplicationsCitation Excerpt :Therefore, directly incorporating data mining functionality in existing HRIS and in existing HR processes meets the requirements of ease of use and usefulness and, therefore, points the way for future domain driven research (e.g. Adejuwon & Mosavi, 2010; Romero & Ventura, 2010; Rupnik & Jaklič, 2009). End user-related aspects are rarely addressed in current research, and it appears that it is frequently expected that end-users have the expertise, the time, and the will to duly perform data mining related tasks, such as preparing data, choosing adequate mining methods, setting up relevant method parameters, or interpreting received results (Sharma & Osei-Bryson, 2009). However, from a domain perspective, this assumption may turn out to be a clear misjudgment.
Evaluation of an integrated knowledge discovery and data mining process model
2012, Expert Systems with ApplicationsCitation Excerpt :Besides, the usefulness of organizational charts, a primarily static entity, to identify organizational actors and their interrelationships can be also be debated. Formally implementing the Business Understanding phase is just as important as implementing the Modeling phase or any other phase of the Data Mining project (Sharma & Osei-Bryson, 2008). Perhaps, the Business Understanding Phase is even somewhat more important than other phases given that a number of decisions about other phases, such as the Modeling as well as other phases (such as data preparation, data understanding, and evaluation) are made, or ideally should be made, during the BU phase (see Fig. 2).
Toward intelligent data warehouse mining: An ontology-integrated approach for multi-dimensional association mining
2011, Expert Systems with ApplicationsCitation Excerpt :These researches, however, concentrated on the design of the algorithms, yet discussion of ontology structure design and its benefit to data mining were not covered. Until recently, research on applying ontology to data mining was exploited by several studies such as, ontology-based induction of rules (Aronis, Provost, & Buchanan, 1996; Taylor, Stoffel, & Hendler, 1997), ontology-based business understanding (Sharma & Osei-Bryson, 2009), ontology-based post-processing and explanation of association rules (Domingues & Rezende, 2005; Liao, Ho, & Yang, 2009; Marinica, Guillet, & Briand, 2008; Svatek, Rauch, & Flek, 2005), ontology-supported selection of classification algorithms (Bernstein, Provost, & Hill, 2005; Lin, Zhang, & Yu, 2006), ontology-guided new attributes generation from databases (Phillips & Buchanan, 2001), and ontology-based integration and preprocessing of data (Euler & Scholz, 2004; Perez-Rey, Anguita, & Crespo, 2006). Differing from the above work on dealing with the issue of incorporating ontology in the individual phase of the well known KDD process proposed by Fayyad, Piatetsky-Shapiro, and Smyth (1996), there has been work conducted from an integral perspective.
A survey of data mining and knowledge discovery process models and methodologies
2010, Knowledge Engineering ReviewToward an integrated knowledge discovery and data mining process model
2010, Knowledge Engineering Review