Data Preprocessing: the Techniques for Preparing Clean and Quality Data for Data Analytics Process

The model and pattern for real time data mining have an important role for decision making. The meaningful real time data mining is basically depends on the quality of data while row or rough data available at warehouse. The data available at warehouse can be in any format, it may huge or it may unstructured. These kinds of data require some process to enhance the efficiency of data analysis. The process to make it ready to use is called data preprocessing. There can be many activities for data preprocessing such as data transformation, data cleaning, data integration, data optimization and data conversion which are use to converting the rough data to quality data. The data preprocessing techniques are the vital step for the data mining. The analyzed result will be good as far as data quality is good. This paper is about the different data preprocessing techniques which can be use for preparing the quality data for the data analysis for the available rough data. CONTACT Ashish P. Joshi joshiashish_mca@rediffmail.com BCA Department, Vitthalbhai Patel & Rajratna P.T. Patel Science College, Sardar Patel University, Vallabh Vidyanagar-388120 , India. © 2020 The Author(s). Published by Oriental Scientific Publishing Company This is an Open Access article licensed under a Creative Commons license: Attribution 4.0 International (CC-BY). Doi: http://dx.doi.org/10.13005/ojcst13.0203.03 Article History Received: 10 August 2020 Accepted: 01 September 2020


Introduction to Data Preprocessing
The general model for the real time data mining is as shown in fig.1. The first step is selection of the domain which determines the dataset selection.
The important thing is to select the target data and the target data must be selected from the original data set for enhance the reliability. The data preprocessing requires after generating the target data to make it ready to use. In the next step the ready to use data works for data analysis and generating some knowledge or result by applying some mining techniques. The data preprocessing techniques includes five activities such as Data Cleaning, Data Optimization, Data Transformation, Data Integration and Data Conversion.

Data Cleaning or Data Cleansing
Data cleaning is part of data preprocessing. Data preprocessing has many activities one of it is data cleaning. Imperfect, incorrect, Incomplete, inaccurate or irrelevant parts of the data are identified in data cleaning process. These type of dirty data can be replace, modify or delete by the specific techniques. Data cleaning is also called data cleansing. Following are the steps for the data cleaning process;

Data Transformation
Data transformation is use for converting the structure and also use for converting the format of the attribute.
For example, if the data available in integer format and dataset requires to store it in float.
Another example is storing the 1 and 0 value by replacing the true and false value or you may say age 1-12, 13-20,21-40,41-60 can be categorized in the label like child, teen ager, young, old. The different transformation methods are given below.

Smoothing (Remove Noise from Dataset)
Data smoothing is technique which use algorithm to remove noise from a data set. This allows essential patterns to locate out. It can be used to help estimate trends.

Aggregation (Preparing Data in Abstract Format)
Data aggregation is a process which prepared summary from gathered data. It is use to get more information about class based and group based data.

Discretization (Transforming Continues Data in to Some Interval)
Discretization is a practice that transforming unintrupted data into group of fixed intervals. Majority of Data Mining activities in the real world involve unintrupted data.
There is scope of research for handle these attributes because still the existing framework of data mining are not able to do it.

Range)
Basically these process includes to scaling the attribute's data. It is used to generating the data into a smaller range, such as between 0 to 1. It is generally useful for classification algorithms.
The methods for data normalization are: • Decimal Scaling • Min-Max Normalization • z-Score Normalization

Data Reduction
The data reduction is technique that compresses the data in such a way that the meaning of the data is not lost. For example, data analysis required the year wise analysis and data available quarterly, now data cube aggregation will merge the four quarter data in to year format. The following methods are use to data reduction.

Data Cube Aggregation
Merging quarterly data and make ready yearly data.

Dimension Reduction
It removes redundant features

•
Step by step Forward Selection • Step by step Backward Selection

Data Compression
Reduce the size of files using some mechanism

Numerosity Reduction
The actual data is replaced with mathematical models or it may replace by smaller representation of the data instead of actual data in this reduction technique, it is important to store the representation parameter only.

Discretization
To separate the attributes of the continuous data with a specific intervals by data discretization technique. We can replace many constant values of the attributes by marker of small intervals.

Concept Hierarchy Operation
The size of data can be reduced by collecting and then replacing the low-level concepts (such as 25 degree for tempreture) to high-level concepts (categorical variables can be as hot or cold).

Data Integration
Data integration means merging the two or more datasets in to one data set. Some of the application generates the database based on time interval; it requires merging if we want to process all the data at a time. For example, financial account system may generate the data yearly but if we want to perform analysis on 10 years then it requires merging 10 years dataset into one dataset that is called data integration. It also includes the process of merging data from dissimilar sources into a distinct, unified view. It integrates data at single place which are coming from multiple places. It may require to data conversion process to make unified format for each data.

Data Conversion
In the current scenario data are available in different format. The data required to conversion from the existing format to required format.
For example, python is very compatible with the csv data format bur it is not necessary that every data available in csv format. It can be in SQL data, JSON Data or XML Data. Data transformation use to converting the data into required format. The fig.2 model developed in php which is useful for converting the SQL, JSON or XML data into CSV also it is use for converting csv data to mysql. Fig.2 shows the json file selection for convert into csv. When user select the json file and click convert the json file will be convert into csv as display in fig.3.

Conclusion
The rough data generates the errors in the data analytics process. The data analysis cannot generate the efficient result as per requirement on the basis of the rough and noisy data. The result may be varied as compare to actual result due to unprocessed data. The different techniques of the data preprocessing is useful for removing the noisy data and preparing the quality data which gives efficient result of the data analysis.