IMPLEMENTING MACHINE LEARNING ALGORITHMS ON SPARK

IMPLEMENTING MACHINE LEARNING ALGORITHMS ON SPARK SHWETA MITTAL, OM PRAKASH SANGWAN Department of Computer Science, Guru Jambeshwar University of Science & Technology, Hisar, Haryana, India Abstract: Massive amount of data is being generated from the number of sources on day to day basis. Spark is a very popular open source platform available freely on web to store and process big databases. For training the machines to learn hidden patterns/information from these huge raw databases, machine learning algorithm needs to be implemented. ML and MLLib are two machine learning libraries to implement machine learning algorithms in Spark. In this paper, Decision Trees, Random Forests and Gradient Boosted Trees have been implemented by using Cardiac and Telecom dataset on local PC as well as Google Colab and it was concluded that Gradient Boosted Trees performed better than Decision Trees and Random Forests in terms of accuracy but took longer time to execute. Further, it has been also observed that algorithms took less time to run on Colab GPU as compared to local PC.


INTRODUCTION
Tremendous amount of data is being generated via number of domains i.e. Health Sector, Banking, Educational Institutions, Social Media etc. on day-to-day basis. Spark is open source platforms developed by Apache to store and process such big databases which can support data in form of text as well as images. The fundamental data structure of Spark is RDD i.e. Resilient 5268 SHWETA MITTAL, OM PRKASH SANGWAN Distributed Dataset which is read-only dataset distributed over numerous nodes. In Spark, there is one master node to control the entire process of resource allocation, job scheduling, task management etc. and several worker nodes to perform the job. Spark performs in-memory computing, thus is faster than Hadoop and supports iterative algorithms. Various platforms to implement Spark cluster are as follows: Azure Databricks (built on the top of AWS, Amazon Web Services), Amazon Elastic Map Reduce, Google Cloud, Dockers etc.
Machine learning is a technique to train the system to learn from the data and to make predictions from it. There are 2 libraries available in Spark i.e. ML and MLLib to implement machine learning algorithms. MLLib is a primary API for Spark and is built on the top of RDDs   [7]. A framework has been presented by C. Tianshi et al. by integrating Jupyter notebook with Spark for scalable big data mining [8]. P. Gupta et al. workstations and as per results, Cluster-Based LSTM performed superior over ordinary LSTM in terms of RMSE [11].O. Aydin et al. implemented LSTM on Spark using Keras and Elephas library for distributed computing and the resulting model proved to be reliable [12]. S. Kumar et al. implemented LSTM and GRU upto 3 hidden layers for energy load forecasting using cluster of 7 machines and it was concluded that GRU performed better than LSTM [13]. It was also concluded that with the use of cluster, training time is 6 times faster. E. Huh et al. analyzed the performance of LSTM on containers and host environment with respect to CPU and GPU and as per results, training time of docker is less than local host [14].
From the review of work done by various researchers, it can be concluded that Spark performed satisfactory for implementing machine learning algorithms and execution time is considerably reduced for big databases. Machine Learning algorithms implemented by various researchers in Spark performed well in terms of both accuracy and training time.

EXPERIMENTAL SETUP
As discussed earlier in the previous section, Spark is a popular and efficient platform to perform learning from big data. Environment for spark can be setup via number of ways i.e. Azure Databricks, Amazon Elastic Map Reduce, Google Cloud, Dockers etc. In this section, Spark environment has been set up on local PC and on Google Colaboratory to implement Machine Learning algorithms.
Spark can be implemented on local machine by installing Java, Python, Winutils, Spark and Anaconda distribution (which uses IPython kernel at its backend) available freely on web.
Python, Java and Scala are the programming languages used for Spark. Anaconda is a distribution of python and R languages which simplifies the task of package management and its deployment. Jupyter Notebook is a web application which supports Julia, Python and R programming language and allows user to create and share documents.
Google Colaboratory is an interactive environment developed by Google for writing and 5271 IMPLEMENTING MACHINE LEARNING ALGORITHMS ON SPARK executing python codes and can be run on CPU, GPU or TPU. Data from local PC can be uploaded to Google Drive and can then be used in Google Colab notebooks. Spark can also be run on Google Colab by installing all the necessary spark dependencies in Colab environment as shown in Figure 2. The major limitation of Colab is that it is available for only 12 hours per session. Heart Disease and Telecom Churn dataset containing 70,000 and 3,333 records respectively (available freely on Kaggle.com) has been used for implementation purpose [16,17] Algorithms have been implemented on spark using Spark's ML library which has number of in-built functions to ease the task of implementation. To implement the algorithm, Transformer (accepts a dataframe as input and generates new dataframe by appending one or more column), Estimator (algorithm for training dataframe to create a model) and Pipelines (sequence of stages where each stage is either a transformer or estimator) are some basic concepts used.

A. DATA PREPARATION:
Input dataset contains a mix of numerical and categorical columns which needs to be handled cautiously while implementation so that categorical attributes may not be miss-treated as numerical ones. The categorical string values present in the input dataset first needs to be converted into an integer label via StringIndexer class which also stores the metadata of attributes. String values can too be retrieved back from the generated integer labels via IndexToString class. As there is no ordinal relation among these newly generated integer labels , One Hot Encoding of attributes via OneHotEncoder class is performed which converts the integer labels into binary vector as shown in Figure 3. ML algorithm accepts input data in the form of a vector. So, the next step is to assemble all the input attributes i.e. numerical and the output of One Hot Encoder in a single vector column via VectorAssembler class (as shown in Figure 4). VectorAssembler class provides sparse vector as its output i.e. vector with huge number of zeroes as its output while ML algorithm accepts dense vector (as shown in Figure 6). Thus, sparse vector needs to be converted into dense vector which is then given as input to VectorIndexer class. VectorIndexer distinguish 5273 IMPLEMENTING MACHINE LEARNING ALGORITHMS ON SPARK between continuous and categorical values and also assign index to categorical values.

B. MODEL TRAINING AND EVALUATION
The output given by VectorIndexer class is then given as input to machine learning algorithm. As shown in Figure 5, 'indexedatafeatures', output column of VectorIndexer class is given as input to Decision Tree algorithm. For training the model, training and testing data is split in the ratio of 70:30 respectively. Fit function has been used to train the model on dataframe and to generate the prediction from the dataset transform function has been used.    Table 2.

CONCLUSION AND FUTURE WORK
Apace Spark is the most popular framework to process big databases which has ML and MLLib library to implement machine learning algorithms. In this research work, Decision Trees, Random Forests and Gradient Boosted Trees have been implemented on Spark using ML Library it can be concluded that Gradient Boosted Tree is slowest among Decision Tree and Random Forest but provides the most accurate results. Algorithms have also been implemented on Google Colaboratory as well by selecting the GPU runtime and from the results it can be inferred that results are similar to the algorithms run on local machine in terms of accuracy but the run time of algorithm is considerably reduced on Google Colab.
In future, other machine learning algorithms can also be implemented using Spark's ML library and much bigger databases can be considered while implementation. Transfer learning and Deep Learning algorithms can also be implemented to further improve the accuracy of the model.

CONFLICT OF INTERESTS
The author(s) declare that there is no conflict of interests.