Educational Data Mining Model Using Rattle

—Data Mining is the extraction of knowledge from the large databases. Data Mining had affected all the fields from combating terror attacks to the human genome databases. For different data analysis, R programming has a key role to play. Rattle, an effective GUI for R Programming is used extensively for generating reports based on several current trends models like random forest, support vector machine etc. It is otherwise hard to compare which model to choose for the data that needs to be mined. This paper proposes a method using Rattle for selection of Educational Data Mining Model.


I. INTRODUCTION
Dibrugarh University, the easternmost University of India was set up in 1965 under the provisions of the Dibrugarh University Act, 1965 enacted by the Assam Legislative Assembly.It is a teaching-cum-affiliating University with limited residential facilities.The University is situated at Rajabheta at a distance of about five kilometers to the south of the premier town of Dibrugarh in the eastern part of Assam as well as India.Dibrugarh, a commercially and industrially advanced town in the entire northeastern region also enjoys a unique place in the fields of Art, Literature and Culture.The district of Dibrugarh is well known for its vast treasure of minerals (including oil and natural gas and coal), flora and fauna and largest concentration of tea plantations.The diverse tribes with their distinct dialects, customs, traditions and culture form a polychromatic ethnic mosaic, which becomes a paradise for the study of Anthropology and Sociology, besides art and culture.The Dibrugarh University Campus is well linked by roads, rails, air and waterways.The National Highway No.37 passes through the University Campus.The territorial jurisdiction of Dibrugarh University covers seven districts of Upper Assam, viz, Dibrugarh, Tinsukia, Sivasagar, Jorhat, Golaghat, Dhemaji and Lakhimpur.[1] There are more than hundred numbers of Colleges/ Institutes offering TDC (Three Year Degree) Course affiliated/ permitted under the University.Since the number of students in the Arts Stream is larger in comparison to the other stream (B.Sc., B.Com., B.Tech.etc) we considered the data for the B.A. (Bachelor of Arts) course for our present study of educational data mining.The required digitized data are collected from Dibrugarh University Examination Branch for the affiliated colleges of the University B.A. programme from 2010 to 2013.This paper evaluates performance gender wise as well as caste wise of the students.The Colleges are categorized as Urban as well as Rural depending on their locations.In case of caste wise observations, the binomial operators are Urban and Rural.
There are several data mining tools and statistical models available.This paper focuses one which data mining tools shall be the best suited and what would be the statistical models for such knowledge discovery.

A. Data Mining
Data Mining detects the relevant patterns from databases / data warehouses using different programs and algorithms to look into current and historical data which can be analyzed to predict future trends [2].It is very difficult for any organization to extract hidden patterns from the huge data marts and data ware houses without the help of data mining tools and programs.It is like searching for the pearls in the sea of data.This knowledge set is extremely useful in developing a knowledge support system and making important decisions regarding the future trends predictions.
Statisticians have used different manual techniques for the benefit of the business, predicting trends and results based on data over the years.The business houses had developed huge databases or data warehouses to become "data tombs".The data was never transformed into information.But with the help of data mining tools and algorithms now professionals from different areas may extract knowledge quickly and at ease.

B. Educational Data Mining
Data mining, often called knowledge discovery in database (KDD), is known for its powerful role in uncovering hidden information from large volumes of data [3].Its advantages have landed its application in numerous fields including ecommerce, bioinformatics and lately, within the educational research which commonly known as Educational Data Mining (EDM) [4].EDM is defined by The Educational Data Mining community website, www.educationaldatamining.org as an emerging discipline, concerned with developing methods for exploring the unique types of data that come from the educational setting, and using those methods to better understand students, and the settings which they learn in.EDM often stresses with the improvement of student models which denote the student's current knowledge, motivation and attitudes [5].

C. Rattle: A Data Mining GUI for R
The data miner draws heavily on methodologies, techniques and algorithms from statistics, machine learning, and computer science [6].R programming language is a powerful tool for www.ijacsa.thesai.orgdata mining.Rattle (the R Analytical Tool To Learn Easily) provides GUI for the R programming environment.We have to use the library (rattle) and rattle () brings up the GUI for the programmers.Highly skilled Statisticians may efficiently use the R Programming Language.So, it is out of reach for many people without in depth knowledge of Statistics.But Rattle provides sophisticated GUI for data analysis and provides the necessary graphs with a click.Rattle provides another magnitude to the R programming and a platform for the novice data miners to work efficiently.Rattle's user interface provides an entry into the power of R as a data mining tool.[6]

D. ROC Curves Analysis
To determine a cutoff value, Receiver operating characteristic (ROC) curves is used in many areas.We may use the ROC curve for the selection of best suited models.In our educational data mining experiment, we use the ROC curve to determine the selection of model.

A. The Data Set
We have included a small part of the Category and Gender based tables termed as Table 1 and Table 2 for which the suitable models needs to be selected.The Examination Branch of Dibrugarh University provides various College Codes for different Colleges under its jurisdiction.The field 'Appeared' means the number of candidates appeared for that examination and 'Passed' means the number of candidates passed for that particular examination.The field 'PassPercentage' is the Passed Percentage of the Candidates for a particular category.We define various terms in their codes as below: The meaning of the data fields as depicted in the sample Table 2 are same Table 1 except one field i.e. 'Gender'.Now the stage is set and ready to perform.

B. Experiments performed by Rattle
The main objective in this paper is to select the best suited models for performing the statistical analysis of the datasets.We used one Xeon based Database Server for the experiments.The rattle package was used for the same.The data is imported to R which was stored in .csvformat.The target data was categorical data and the partition chosen was 70/30/0.If one explores the data, one may visualise the data by using box plot, histogram, cumulative and benford curves.The histogram, the cumulative and benford curves are presented in the figures I,II,III and IV.Now, one may use the Model tab and select all the models for the comparison.The models are of type tree, random forest, boost, support vector machine, regression models and neural network.The data is evaluated through all the models.Our goal is to find the best suited models for the data through ROC curve.

C. Evaluation of the Experiments
In the figure V, we have placed one of the ROC curves for the category data.The followings are the actual findings using the Rattle based on the category wise data.

Fig. 1 .
Fig. 1.Cumulative Diagram showing category-wise, Pass Percentage-wise, Performance-wise distribution on the basis of Location

Fig. 3 .
Fig. 3. Benford Diagram showing the performance by Gender of the Candidates.

Fig. 4 .
Fig. 4. Cumulative Diagram showing the performance by Pass Percentage and Gender wise.

Fig. 5 .
Fig. 5. ROC Curve for the first experiment i.e. performance by category.

TABLE I .
SAMPLE DATA FOR YEAR-WISE COLLEGE-WISE CATEGORY-WISE LOCATION-WISE DATA OF THE B.A. CANDIDATES ijacsa.thesai.org

TABLE II .
SAMPLE DATA FOR YEAR-WISE COLLEGE-WISE GENDER-WISE DATA OF THE B.A. CANDIDATES