Big Data Classification Using Distributed Optimized Hoeffding Trees

— Large usage of social media, online shopping or transactions gives birth to voluminous data. Visual representation and analysis of this large amount of data is one of the major research topics today. As this data is changing over the period of time, we need an approach which will take care of velocity of data as well as volume and variety. In this paper, author has proposed a distributed method which will handle three dimensions of data and gives good results as compared to other method. Traditional algorithms are based on global optima which are basically memory resident programs. Our approach which is based on optimized hoeffding bound uses local optima method and distributed map-reduce architecture. It does not require copying whole data set onto a memory. As the model build is frequently updated on multiple nodes concurrently, it is more suitable for time varying data. Hoeffding bound is basically suitable for real time data stream. We have proposed very efficient distributed map-reduce architecture to implement hoeffding tree efficiently. We have used deep learning at leaf level to optimize the hoeffding tree. Drift detection is taken care by the architecture itself no separate provision is required for this. In this paper, with experimental results it is proved that our method takes less learning time with more accuracy. Also distributed algorithm for hoeffding tree implementation is proposed.


I. INTRODUCTION
While Big Data is perhaps not an entirely new concept, it is certainly a hot topic of the modern, digital era.However, it is an evolving concept, and any standards created must take into account that what is classed as 'Big Data' today is likely to change rapidly over the next few years.Standards must therefore look beyond the 'here and now' of how Big Data is currently being used and instead seek to establish frameworks for dealing with data sets that represent a significant logistical challenge [11].Taking decisions based on large amount of past and current data is one of emerging topic today [4,5].For example, predicting exchange rates or forecasting stock prices based on past records.This data is enormous in volume.Data is changing with respect to time.Also, data format is changing enormously.Decision tree algorithms are more suitable for this kind of data as it gives visual representation of data which is beneficial for analysis.Traditional decision tree algorithms like C4.5,ID3, Random forest cannot handle large volume of data due to their large memory requirement [16].These methods are based on global optima which need to copy whole data onto memory to design a model.So the new approach is required which will take care three dimensions of Big data (Volume, Variety and Velocity).Our approach is based on distributed implementation of Hoeffding trees.Hoeffding trees are based on Hoeffding bound which is used to reduce tree height and increase accuracy.MapReduce architecture is used to implement distributed system.In the paper below, we have explained literature survey in section-II, Proposed Methodology in section-III and Experimental Results in section-IV.

II. LITERATURE SURVEY
In literature, many authors have used different machine learning algorithms to analyses time changing big data.In [1], author has used neural network to analyze behaviors of customers using social media data set.In [2], author has explained a way to find link between to users' social media like Twitter or Facebook using machine learning algorithm.In [3,4,5], authors have explained big data architecture, challenges etc.In work done by Isvani Frıas-Blanco, Jose del Campo-Avila [29], moving average method is suggested which is used for Online and Non-Parametric Drift Detection Methods Based on Hoeffding's Bounds.Short-Term Load Forecasting Based on Big Data Technologies by Pei Zhang [31], explain decision tree framework to forecast short term load like electricity.Petra Perner has explained how decision tree induction is suitable than traditional methods in his paper "Decision Tree Induction Methods and their Applications to Big Data".Also there is lot of work available on social media data analysis.Java et al. [21] have studied the topological and geographical properties of Twitter's social network.Characteristics of social activities and patterns of communication in Twitter are studied by Naaman et al. [23].Davidov et al. [24] have used hash tags and other sentiment labels for sentiment analysis.An effective and efficient followed recommender system built by Hannon et al [25].Methods to recommend influential users proposed by Kwak et al [26].Twitter use within and across organizations and geographic markets comparison is proposed by Burton et al. [27].Kim et al. [28], explained how to maximize the outcomes of SMM through Word-of-Mouth (WOM) marketing by identifying the core group of users.On distributed implementation of decision trees some work is available.In [28], author has implemented C4.5 decision tree algorithm using mapreduce.In [30], author has explained how to extract knowledge using decision Tree and Naïve Bayes Algorithm for Classification and Generation of Actionable Knowledge for Direct Marketing.Distributed implementation of support vector machines is proposed by [10] III.PROPOSED METHODOLOGY Authors are requested to submit papers as the data is dynamic, we need an approach which is based on local optima model.This model does not require whole data set present on a memory as well as it can be modified as and when required.To avoid missing, biased and concept drift data, we have concentrated on preprocessing technique.Proposed technique is based on distributed architecture which contains all steps from data preprocessing to tree optimization.This technique is based on horizontal splitting of samples.It is shown in figure 1.

A. Overall Flow of Processing
Algorithm A.1 explains the entire process.
Central node is responsible for continuous stream reading and splitting it among different mapper nodes.Every mapper node will collect stream assigned to it.Stream will be processed to remove noisy, missing data.It will calculate statistics like entropy and information gain for every attribute Ai Є A, i varies from 1 to n. n is the number of attributes.Based on statistics, mapper will create local optimal tree based on set of samples assigned to it.This tree will be forwarded to respective reducer.Every reducer will combine the trees received from respective mapper.The algorithm for combining local trees is given below.After combining a tree, tree pruning will be done to remove unused attribute.This technique will be applied by every mapper node to prune local trees also.
As the data preprocessing, statistics computation and tree creation tasks are divided among multiple mapper and reducer nodes, these tasks will be done parallel and the learning time will get reduce.Tree pruning will be done at every node to minimize the tree size and increase the accuracy.
At leaf level, different other classifiers like Naïve Bayes, Naïve Bayes Adaptive and majority class are used to avoid biased data and to increase accuracy.This technique is based on vertical splitting of data.Data samples will be divided attribute wise.Set of attributes (Ai….At) along with class label attribute (Ac) will be assigned to each node.As the number of attributes increases, number of map-reduce slots can be increased.So, vertical splitting is more suitable for big data.
Drift is detected at very local mapper node and accordingly tree model is modified.This change will be automatically considered by reducer node.Algorithm A.1: Tree Creation INPUT X: Stream of Samples A: Set of attributes f(A)-Function to split node δ -Priority of choosing correct attribute OUTPUT T(A,C)-Tree with A attributes and C class labels PROCEDURE: Create_Tree(X,A,f(A), δ ) Read Data Stream X. Divide stream X into set of streams X1,X2,……….whereevery set will contain t samples.Assign stream Xi to MapperNodei Distribute Stream X among different nodes using MAP_DATA (Xt, A, f(X), δ) Collect class labels from different nodes using REDUCE_DATA (Xt1,f(Xt1),Ct), (Xt2,f(Xt2),Ct),…………………………………….

B. Normalizing Leaf Node
If data samples at leaf node belong to different classes then we need normalize the leaf node.For this, Algorithm B.1 is applied to split a leaf node.
In this, entropy f(X) is calculated for all attributes X, Out of this, attribute Xta with max f(x) value and Xtb with second highest f(x) value is selected.
Two attributes entropy value is compared against hoeffding bound, HB.The attribute with highest entropy value f(x) is selected as parent node and nodes having less entropy value f(x) are added left side and rest are added on right Algorithm B.1: Splitting Node PROCEDURE: Splitnode(X,A,f(X), δ) Select attribute f(Xta) > = max(f(Xt)) and max(f(Xt)) <= f(Xtb)< f(Xta) from each node.

C. Mapper Processing
Every data stream is divided into sub data streams and assigned to different mapper nodes.Mapper node is responsible for doing data preprocessing, statistics calculations and creating local tree which is given in Algorithm C.1 Algorithm C.1 :Mapping Data PROCEDURE MAP_DATA(Xt, A, f(X), δ) Call Data_Preprocessing(Xt, A, f(X), δ) Call Statistics_calc(Xt, A, f(X), δ) Call Create_Tree(Xt, A, f(X), δ)

F. Statistics Calculation
For attribute selection, different statistics like entropy, information gain are calculated.Attribute with highest information gain is selected as a node in a tree.
Entropy is measure of impurity level.Higher is the entropy more is the information value of an attribute.If we create decision tree based on entropy, then it tends to select attribute which forms a node with many branches.
So, Information gain is used which decides proper ordering of attributes.Information gain is difference between entropy before split and entropy after split of an attribute.It is explained in algorithm 3.6.

G. Combining Local Trees
For combining two local trees, information gain of roots of two trees are compared with Hoeffding bound(HB).The root node with more probality is selected as root of combined tree, and node having less information gain is added on left side and other nodes on right side.
After forming single tree, leaf level nodes class distribution is checked.if it is biased or uneven then different classifiers like naïve Bayes are used.Select R2 as root and add R2Left=R1 End if all attributes of A1 and A2 are included For all Xi ϵ {leaf Nodes}, Fnb:arg max r={n`i,j1,n` i,j,2…..,n`i,j,r} n`i,j,r=P(X|cf).P(cf)/P(X)

H. Pruning Tree
After forming a tree at reducer node, pruning is done to remove less useful attributes.While creating local trees, it might have happened that one attribute has good information gain.But after combining trees this attribute may lose information or may have same information like other attributes, then their information gain is compared against hoeffding bound(HB).It is explained in Algorithm 3. Initially results are generated for sequential hoeffding tree and compared with other decision tree algorithms.These results (Table 1) are generated Using Java Weka 3.7.11library and Eclipse.These results show that C.4.5 works well till 50000 attributes.Hoeffding tree works well though attributes beyond 50000.Though data increases accuracy of hoeffding tree remains linear.Learning speed of distributed and sequential hoeffding tree algorithm is compared for different number attributes.It is shown in Figure 3.As the number of attributes increases, the learning speed of distributed as well as sequential hoeffding tree increases.But distributed algorithm's learning speed is three times more while sequential algorithm is four times more.As we are implementing distributed hoeffding tree, the learning time is get reduced.This is obvious.But though the model is distributed the accuracy is more than sequential architecture.It is shown in Figure 4.As the number of samples increases, sequential hoeffding tree accuracy declines while distributed hoeffding tree accuracy is steady.This is because we are creating local tree at every mapper node, we are splitting leaf node if class distribution is biased.At every reducer node, tree pruning is applied to remove irrelevant attributes.
To increase the accuracy of Distributed Hoeffding tree, we have used different classifiers at leaf level.This helps to remove biased data.The results are shown in Table-2.

I. CONCLUSION
Big Data analysis is one of the emerging topic today which is used to analyze past and current data any domain like financial, medical, social etc.This analysis is useful to take many decisions.As data changes in volume, velocity and variety, we need an approach which will take all these three aspects.In our paper, we have proposed a distributed model which is based on hoeffding trees and map-reduce architecture.This model does not require whole data set to be present on a memory.
Local optimum tree models are generated at every node by taking care of drift (changed data) and also biased data.Tree pruning is done to remove same meaning or less useful attributes which keeps tree height minimum.With experimental results, it is proved that our proposed model gives better accuracy as well as learning speed.Different other classifiers like Naïve Bayes are used at leaf level to increase accuracy of model.

Figure 1 .
Figure 1.Distributed Architecture for Big Data Classification

Algorithm F. 1 :
Statistics Calculation PROCEDURE Statistics_calc(Xt, A, f(X), δ) Compute Entropy for all Ai, Where t are the number of samples for Attribute Ai Compute Information Gain for all Ai, I(Ai)= E1(Ai) -E2(Ai) Where E1(Ai) is entropy before split and E2(Ai) is entropy after split

Figure 3 .
Figure 3. Learning Time Comparison (Different Number of Attributes)

Figure 4 .
Figure 4. Accuracy Comparison (Number of attributes=100) Comparison Accuracy of different classifiers used at leaf is shown in Figure 5. Naive Bayes and Naïve Bayes Adaptive shows better results than majority class.

Table 1 .
Comparison of Different Decision Tree Algorithms.Results of distributed Hoeffding tree are compared with sequential Hoeffding tree results.Though the numbers of samples are increasing, learning speed of distributed Hoeffding tree is lesser than sequential Hoeffding Tree.It is shown in Figure 2. Here,Number of attributes are kept same i.e.100.

Table 2 .
Comparison of Algorithms Used at Leaf Level