An intelligent middleware for linear correlation discovery

https://doi.org/10.1016/S0167-9236(01)00127-0Get rights and content

Abstract

Although it is widely accepted that research from data mining, knowledge discovery, and data warehousing should be synthesized, little research addresses the integration of existing data management and analysis software. We develop an intelligent middleware that facilitates linear correlation discovery, the discovery of associations between attributes and attribute groups. This middleware integrates data management and data analysis tools to improve traditional data analysis in three perspectives: (1) identify appropriate linear correlation functions to perform based on the semantics of a data set; (2) execute appropriate functions contained in the data analysis packages; and (3) derive useful knowledge from data analysis.

Introduction

Much recent research has focused on database integration [21], data warehousing [13], and data mining and knowledge discovery [6]. All of these research areas have attempted to address an emerging business need: the exploitation of large amounts of data to derive useful information (i.e. obtain business intelligence).

Many businesses own separate software for data definition, data manipulation, and data analysis. For example, while business data may be stored in a Microsoft Access database, necessary data analysis functions are contained in heterogeneous data analysis packages such as SPSS/Base [10] or SAS [14]. These businesses face three major problems in leveraging on their existing software for knowledge discovery:

Scarce data analysis expertise. Few users have formal training with advanced data analysis methods such as data mining and on-line analytical processing (OLAP) [4], and data management tools such as data warehouses. Experienced analysts continue to be in short supply, especially since the growth in data to be analyzed continues to outpace the number of new trained data analysts entering the market [9].

Affordability of integrated tools. While integrated prototype and commercial database/data analysis systems do exist (e.g. [2], [8], [13], [25]), many companies are either unable or unwilling to adapt these products due to technique feasibility, economic feasibility, operational feasibility, and many other reasons.

Lack of a well accepted data analysis communication standard. To transfer data from a database to a data analysis package, it is necessary to create an export file in a format the data analysis package understands, and manually import it to the data analysis package. Results obtained from a data analysis package must also be manually keyed into the database. Furthermore, standard interfaces for data analysis tools do not exist. For example, statistical packages adopt different languages (e.g. SAS and SPSS both employ different commands to perform a linear regression), and generate output in different formats. Data analyst often do not have the time, knowledge or ability to integrate their databases with their disparate data analysis packages.

The transfer of information between database and data analysis packages is not only a tedious and task intensive process, but also an error prone one. Data analysis is fundamentally iterative. Knowledge obtained from one analysis is used to guide a second, and then a third analysis. Each time an analyst must export data to an analysis package, or enter results into a database manually, there is a chance that the analyst commits a mistake. As the analyst performs the same tasks repeatedly, the likelihood that the analyst commits an error increases. To reduce the amount of inaccuracy, we propose to simplify or automate the integration between databases and data analysis systems.

Therefore, there is an emerging and urgent need to develop an intelligent system to seamlessly integrate existing data management and data analysis tools to allow the business to maximize the use of information.

In our research, we aim to develop an intelligent middleware between databases and data analysis packages. Because data analysis is a very broad topic, we restrict the scope of our research to linear correlation discovery. Linear correlation discovery refers to the discovery of associations between attributes and attribute groups (sets of attributes). For example, a store manager wants to know whether alcohol sales are directly related to temperature and consumer profile (e.g. gender, age). While our work concentrates on linear correlation discovery, it can be generalized to other forms of data analysis such as market basket analysis [1], comparisons of groups, or prediction. Our research does not attempt to discover non-linear associations, or associations between data with a time-dependent component. Such analyses often require techniques more sophisticated than that incorporated in this research.

The middleware is developed to accomplish the following objectives:

Automatic identification of appropriate functions. In data analysis, the appropriate function to apply is determined based on knowledge about the kind of analysis to perform, and the characteristics of the data to analyze. For example, when one measures associations between nominal (i.e. unordered) attributes, it is best to use a contingency table. However, associations between ordinal (i.e. ranked) attributes are measured using Spearman's Rho, or Kendall's Tau.

Most existing data analysis package require that the users determine the data set and the function for the analysis. This adds a cognitive burden to the user, because the user must (in a single step) identify not only the kind of analysis to perform, but also the function that best performs the task. Novice users are often unable to perform this task correctly. Our middleware identifies the appropriate correlation function to apply based on the characteristics of data to be analysed, thereby relieving the user from this task.

Standardized access to data analysis packages. No standard language currently exists among data analysis packages. For example, both SPSS and SAS use different commands to execute a linear regression. As migration to more sophisticated data analysis packages requires expensive retraining, users are often locked into one particular package.

Our middleware is developed to translate users' data analysis requests into the commands of the target data analysis package. Thus, users do not have to learn the appropriate command syntax to express their data analysis requirements. Furthermore, users are not constrained by any one package and can apply different packages for their analysis task.

Automatic extraction and interpretation of data analysis results. The functions of most data analysis packages produce voluminous amounts of information, most of it irrelevant to the specific data analysis task performed. Furthermore, different data analysis packages report the same information in different ways. This results in additional learning and human information processing costs, as the user must learn how to extract and interpret results from different packages. Our middleware scans the data analysis output, and extracts only the relevant information.

Special terms used in this paper are defined as follows:

Function: A function refers to a theoretic construct used to perform data analysis. For example, linear regression and classification trees are functions.

Algorithm: An algorithm is a specific implementation of the function. For example, the CART [3], and QUEST [16] algorithms are two implementations of classification trees. Similarly, a linear regression can be implemented using stochastic approximation, or through a matrix minimization approach.

Package: A package is a software that is widely available and adaptable to many situations. Microsoft Access and SPSS are packages. Customized systems specific to a business are not packages.

Section snippets

Intelligent middleware development

The intelligent middleware performs the following tasks to achieve the research objective:

Store expert knowledge concerning data analysis. The middleware stores knowledge concerning statistical function selection, data analysis package execution, and data analysis output interpretation as production rules. This enables users to perform effective data analysis with only minimal training. These rules can be easily adapted and revised to suit an organization's specific data analysis requirements.

Attribute classification

The selection assistant adopts a three-stage selection process. In the first stage, the attributes are classified according to the analysis functions that can be appropriately applied. Schema and instance information are used to obtain these classes. In many cases, functions on attributes are not applicable to their attribute groups (i.e. sets of attributes). For example, while a Pearson's coefficient of determination can be applied to determine the association between Salary and Years_of_

Function coupler

Once the appropriate functions to apply to the data are known, the middleware must call the data analysis package to execute the functions, and read and interpret the package's output results. The function coupler provides facilities to perform these tasks.

Most of the data analysis packages (e.g. SPSS, SAS, Minitab) allow batched requests to be submitted following the process illustrated in Fig. 5. In this process, a sequence of data analysis commands, and a data file to be analyzed are

Conclusion and future research

In this paper, we have presented the development of an intelligent middleware that facilitates knowledge discovery. It integrates the data manipulation power of available databases with the data analysis capability of available data analysis packages. The middleware benefits the user in the following way:

The user does not need to learn about disparate systems: Since the middleware ‘knows’ how to operate the database and data analysis packages and read the data analysis output, the user does

Acknowledgements

The authors wish to thank the Editor-in-chief and the anonymous reviewers for their supportive comments on earlier versions of this manuscript.

Cecil Eng Huang Chua is a PhD student at Georgia State University. In 1995 he received both a Bachelor of Business Administration in Computer Information Systems and Economics and a Masters Certificate in Telecommunications Management from the University of Miami. He received a Masters of Business by Research from Nanyang Technological University in 2000. His research interests include methods for the representation, application, and reuse of knowledge.

References (25)

  • R Agrawal et al.

    Mining association rules between sets of items in large databases

  • S.S Anand et al.

    Designing a kernel for data mining

    IEEE Expert/Intelligent Systems and Their Applications

    (March–April 1997)
  • L Breiman et al.

    Classification and Regression Trees

    (1984)
  • S Chaudhuri et al.

    An overview of data warehousing and olap technology

    ACM SIGMOD Record

    (March 1997)
  • C Chua et al.

    A heuristic method for correlating attribute group pairs in data mining

  • U Fayyad et al.

    Data mining and knowledge discovery in databases

    Communications of the ACM

    (November 1996)
  • C Glymour et al.

    Statistical themes and lessons for data mining

    Data Mining and Knowledge Discovery

    (1997)
  • J Han et al.

    DBMINER: a system for data mining in relational databases and data warehouses

  • D.J Hand

    Intelligent data analysis: issues and opportunities

  • J Hedderson

    SPSS/PC+Made Simple

    (1991)
  • G.J Holzmann

    The model checker SPIN

    IEEE Transactions on Software Engineering

    (May 1997)
  • W Hou

    Extraction and application of statistical relationships in relational databases

    IEEE Transactions on Knowledge and Data Engineering

    (December 1996)
  • Cited by (0)

    Cecil Eng Huang Chua is a PhD student at Georgia State University. In 1995 he received both a Bachelor of Business Administration in Computer Information Systems and Economics and a Masters Certificate in Telecommunications Management from the University of Miami. He received a Masters of Business by Research from Nanyang Technological University in 2000. His research interests include methods for the representation, application, and reuse of knowledge.

    Roger Chiang is an Associate Professor of Information Systems at the College of Business Administration, University of Cincinnati. He received his BS degree in Management Science from National Chiao Tung University, Taiwan, MS degrees in Computer Science from Michigan State University and in Business Administration from University of Rochester, and PhD degree in Computers and Information Systems from University of Rochester. His research interests are in data and knowledge management and intelligent systems, particularly in database reverse engineering, database integration, data mining, and common sense reasoning and learning. He is currently on the editorial board of Journal of Database Management and International Journal of Intelligent Systems in Accounting, Finance and Management. His research has been published in a number of international journals including ACM Transactions on Database Systems, Data Base, Data and Knowledge Engineering, Decision Support Systems, and the Journal of Database Administration. He is a member of AAAI, ACM, AIS, IEEE Computer Society, and INFORMS.

    Ee-Peng Lim received the BS (Honours) degree in information systems and computer science from the National University of Singapore, in 1989, and the PhD degree in computer science from the University of Minnesota, Minneapolis, in 1994. Since 1994, he has been on the faculty of the School of Applied Science at the Nanyang Technological University, Singapore, where he founded the Centre for Advanced Information Systems (CAIS). His current research interests include database integration, web warehousing, and digital libraries.

    View full text