Internet of

.


Introduction
With the increasing demand and growth in the Internet of Things (IoT) automated network system, the IoT models are getting complicated day by day [1,2] . People are being accustomed to data-driven infrastructure, and this is leading the research more on to Machine Learning based applications alongside IoT. IoT and Machine Learning based techniques are used in every domain of human life at present. In medicine, interpretation of ECG, disease detection using X-Ray, pattern finding in genomic data, an automated pathological system for cancer detection, brain signal modeling all these complex tasks requires the introduction of machine learning approaches [3] . The application of machine learning approaches can also cover the aerospace domain. D'Angelo et al. [4] applied content-based image retrieval technique and machine learning techniques on electrical impedance plane generated from eddy current testing. Eddy current testing is a complex task used in aircraft industries for finding out defects. Besides machine learning, IoT services are also applied to these domains. The growing complexity in IoT infrastructures is raising unwanted vulnerability to their systems. In IoT devices security breach and anomaly has become common phenomena nowadays.
IoT devices use a wireless medium to broadcast data which makes them an easier target for an attack [5] . Normal communication attack in the local network is limited to local nodes or small local domain, but attack in IoT system expands over a larger area and has devastating effects on IoT sites [6] .
Thenceforth, a secured IoT infrastructure is necessary for the protection from cybercrimes. The security measures that have been used become vulnerable with the vulnerability of IoT devices. For some stakeholders and entrepreneurs, data is the money for their business. For the government and some private agency, some data are classified and confidential. Vulnerability in IoT nodes makes a backdoor for an attacker to gather confidential data from any important organization [7] .
There are some trivial methods to solve the problems as mentioned above. In signature based [8] method, attacks and anomaly are previously stored in a database. Moreover, this system is checked at particular time intervals against the database. However, this methodology generates overhead in processing, and it is vulnerable to unknown threats. The advantage of data analysis based technique is that it works faster than other methodologies and it can overcome the problem raised from unknown threats. Hence, in this paper data analysis based techniques are used.
The primary goal of the system is to develop a smart, secured and reliable IoT based infrastructure which can detect its vulnerability, have a secure firewall against all cyber attacks and recover itself automatically. Here, Machine Learning based solution is proposed which can detect and protect the system when it is in the abnormal state. For this task, several machine learning classifiers have been exploited. Another key aspect of this paper is that making the realization of the fact that a simple model like Decision Tree or Random Forest can be compared with a complex network like ANN for anomaly detection.
Further analysis and comparison with other works will be briefly described in the following sections. Section 2 provides a description of other research works on IoT attack and anomaly detection. The description of the dataset, different kinds of attacks and anomalies, learning models and system frameworks are detailed in Section 3 . Section 4 explains our experimental setup, result from analysis and comparison with other state of the art methods. Finally, limitation, conclusion and future scopes are presented in Section 5 .

Literature review
There have been several similar works done in IoT fields. Still, researchers are working in this area. Pahl et al. [1] have mainly developed a detector and firewall for an anomaly of IoT microservices in IoT site. Clustering methods like K-Means and BIRCH have been implemented [9] for different microservices in this work. In clustering, different clusters were grouped in the same if the center is in the three times of standard deviation distance. The clustering model has been updated using an online learning technique. With the algorithms implemented, the overall accuracy obtained by the system is 96.3%.
A detailed description of a smart home system where security breaches were detected by deep learning method Dense Random Neural Network (DRNN) [10] have been introduced in [11] . They have mainly described Denial of Service attack and Denial of Sleep attack in a simple IoT site.
Liu et al. [5] proposed a detector for On and Off attack by a malicious network node in industrial IoT site. By On and Off attack they meant that IoT network could be attacked by a malicious node when it is in an active state or On state. Furthermore, the IoT network behaves normal when its malicious node is in the inactive or off state. The system was developed using a light probe routing mechanism with the calculation of trust estimation of each neighbor node for the detection of an anomaly.
Diro et al. [8] discussed the detection of attack using fog-to-things architecture. The authors of the paper gave a comparison study of a deep and a shallow neural network using open source dataset. This work's primary focus was to detect four classes of attack and anomaly. For four class the system got the accuracy of 98.27% for deep neural network model and accuracy of 96.75% for shallow neural network model.
Usmonov et al. [12] described the recent security problem when developing embedded technologies for the IoT. Preserving data transfer between the physical, logical and virtual components of the IoT system was also challenging. For these problems, the use of digital watermarks was proposed by the authors of this work.
Anthi et al. [13] represented an intrusion detection system for the IoT. Toward this purpose, several ML classifiers have been used for successfully identifying network scanning probing and simple forms of Denial of Service (DoS) attacks. To generate the data set, network traffic is taken for four subsequent days using the software Wireshark. For applying ML classifiers, the software Weka was used.
Ukil et al. [14] discussed the detection of anomalies in healthcare analytics based on IoT. A model of cardiac anomaly detection through a smartphone was also introduced in this paper. For the anomaly detection in healthcare; IoT sensors, medical image analysis, biomedical signal analysis, big data mining, and predictive analytics were used.
Pajouh et al. [6] presented a model for intrusion detection based on a two-layer dimension reduction and two-tier classification module. This model was also designed to identify malicious activities such as User to Root (U2R) and Remote to Local (R2L) attacks. For dimension reduction, component analysis and linear discriminate analysis have been used. NSL-KDD dataset was used to carry out the whole experiment. For detecting suspicious behaviors with the two-tier classification module, Naive Bayes and Certainty Factor version of K-Nearest Neighbor were applied. D'Angelo et al. [15] applied Uncertainty-managing Batch Relevance-based Artificial Intelligence (U-BRAIN) on binary NSL-KDD dataset and Real Traffic Data (from Fredrico II University of Napoli). The U-Brain is a dynamic model operated on multiple machines which can handle missing data. The NSL-KDD dataset contains 41 features. From 41 features 6 features were selected using J-48 based classification algorithm. The accuracy values were 94.1% for NSL-KDD and 97.4% (10-fold training mean) for Real Traffic Data.
Kozik et al. [16] presented classification based attack detection service utilizing cloud architecture. An Extreme Learning Machines (ELM), scalable in Apache Spark cloud framework is employed on the artificial Netflow formatted data in this paper. An IoT network yielded these Netflow formatted data. The work is focused on three significant scenarios in IoT systems: scanning, command and control and infected host. Best accuracy values were found for these scenarios are 0.99, 0.76 and 0.95, respectively.

Materials and methods
The overall framework is a combination of several independent processes. Fig. 1 depicts the overall framework of the system. The first process of this framework is the dataset collection and dataset observation. In this process, the dataset was collected and observed meticulously to find out the types of data. Besides, data preprocessing was implemented on the dataset. Data preprocessing consists of cleaning of data, visualization of data, feature engineering and vectorization steps. These steps converted the data into feature vectors. These feature vectors were then split into 80-20 ratio into training  Table 1 and testing set. The training set was used in Learning Algorithm, and a final model was developed using an optimization technique. Different classifiers used in this work employed different optimization techniques. Logistic Regression used coordinate descent [17] . SVM and ANN used conventional gradient descent technique. The optimizer is not used in the case of DT and RF because these are non-parametric models. The final model was evaluated against the testing set using different evaluation metrics.

Dataset collection and description
The open source dataset was collected from kaggle [18] provided by Pahl et al. [1] . They have created a virtual IoT environment using Distributed Smart Space Orchestration System (DS2OS) for producing synthetic data. Their architecture is a collection of micro-services which communicates with each other using the Message Queuing Telemetry Transport (MQTT) protocol. In the dataset, there are 357,952 samples and 13 features. The dataset has 347,935 Normal data and 10,017 anomalous data and contains eight classes which were classified. Features "Accessed Node Type" and "Value" have 148 and 2050 missing data, respectively. Table 1 gives a detailed picture of the distribution of different attacks and anomaly through the whole data. Descriptions of 13 features are given in Table 2 . All of the features in the described dataset are object type except the timestamp which is an int64 type. Besides these the frequency distribution of several features is illustrated in Fig. 2 . 1. Denial of Service (DoS): the DoS attack is caused by having too many unwanted traffic in a single source or receiver. The attacker sends too many ambiguous packets to flood out the target and make its services unavailable to other services [11] . In the dataset, 5780 samples are containing a DoS attack. 2. Data Type Probing (D.P): in this case, a malicious node writes different data type than intended data type [1] [21] . 805 samples of the dataset contain Malicious Operation. 5. Scan(SC): sometimes the data is acquired through hardware by scanning the system, and in this process sometimes the data can get corrupted [22] . In the dataset, 1547 samples were containing Scan. 6. Spying (SP): by Spying, the attacker exploits the vulnerabilities of the system, and they use a backdoor channel to break into the system and discovers important information [8] . In some cases, they manipulate data causing great hamper to the whole system. The dataset contains 532 samples of Spying. 7. Wrong Setup (W.S): the data may also get disrupted by the wrong system setup [23] . The dataset contains 122 samples of the Wrong Setup. 8. Normal(NL): if the data is entirely correct and accurate, then the data is called normal data. Out of the 357,952 samples, 347,935 samples are of a normal class.

Data preprocessing
Any machine learning research requires exploratory data analysis and data observation. The first task in this research was to make the dataset feed-able to any classifier. So for this reason, the first step was to handle the missing data. In the dataset, "Accessed Node Type" column and "Value" column contain missing values due to anomaly raised in data transferring. From these two features, "Accessed Node Type" feature has categorical values on the other hand "Value" feature has continuous values. "Accessed Node Type" feature has 148 rows containing 'NaN' value depicted as Not a Number, and the corresponding class or label of that row are found to be anomalous. As the "Accessed Node Type" feature is categorical and removing these 148 rows might cause loss of valuable data, so the 'NaN' value in "Accessed Node Type" is replaced with the 'Malicious' value. Similarly, "Value" column also contains some unexpected data which are not continuous values. These unexpected values are transformed into meaningful continuous values that assist the classifiers to have better accuracy. Unexpected values "False", "True", "Twenty" and "none" in the "Value" feature are replaced by meaningful values "0.0", "1.0", "20.0" and "0.0", respectively.
For feature selection, no machine learning approach has been taken here like Pahl et al. [1] because this will not have any significant impact on data analysis. Besides this timestamp column from the dataset has been removed as it has a minimal correlation to the dataset's predictor variable normality.
In feature engineering steps, it is necessary to determine the features type in the dataset at first. The dataset contains Categorical and Numerical data. Categorical Data can be further classified into Ordinal and Nominal Values respectively while Numerical dataset into Discrete and Continuous Values. Table 2 depicts the column types. So from Table 2 , it can be claimed that all columns except "Value" column and "Timestamp" column are categorical nominal variable. Moreover, "Value" column and "Timestamp" column are continuous numerical variable. The "Timestamp" column is not considered here as it was removed from the dataset.
Next vital task is to converting nominal categorical data into vectors. Categorical data can be converted into vectors in many ways. Label Encoding and One Hot Encoding is prevalent among them. In this research label encoding technique have been used to convert the data into a feature vector. Most of the features in this dataset contain nominal categorical value and many unique values. If one hot encoding were applied to these features, the number of features would have increased with a significant number, and the resulting dataset would have lots of dimensions. On the other case, by label encoding, the number of features were the same. Thus the dimension of the dataset was not increased. Besides these, one hot encoded features would have sparse features which are harder to fit in machine learning algorithm and takes a lot of processing time. Hence, label encoding is applied to the dataset.

Theoretical considerations
For data analysis part, several machine learning algorithms were used. Following are the lists of algorithms with their description.

Logistic Regression (LR)
Logistic Regression (LR) is a discriminative model which depends on the quality of the dataset. Given the features X =

Support Vector Machine (SVM)
Support Vector Machine is another discriminative model like LR. It is a supervised learning model for analyzing the data used for classification, regression, and outliers detection [25,26] . SVM is most applicable in the case of Non-Linear data.
Given Input x , Class or Label c and LaGrange multipliers α; weight vector can be calculated by following equation: The target of the SVM is to optimize the following equation: In Eq. (3) , < x i , x j > is a vector which can be obtained by different kernels like polynomial kernel, Radial Basis Function kernel and Sigmoid Kernel [27] .

Decision Tree (DT)
Decision Tree allows each node to weigh possible actions against one another based on their benefits, costs, and probabilities. Overall, it is a map of the possible outcomes of a series of related choices [28] . A DT generally starts with a single node and then it branches into possible outcomes. Each of these outcomes lead to additional nodes, which branch off into other instances. So from there, it became a tree-like shape; in other words, a flowchart-like structure. Considering a binary tree Fig. 3 where a parent node is split into two children node a left child and a right child. Parent node, left child and right child contains data P d , LC d , RC d , respectively [29] . Given, features x , impurity measure I(data), the number of samples in parent node P n , the number of samples in left child LC n and the number of samples in right child RC n ; DT's target is to maximize following Information Gain in Eq. (4) .

Information Gain(P
where, c denotes classes or labels, n denotes any node and p ( c | n ) denotes the ratio of c with respect to n.

Random Forest (RF)
As the name implies, the random forest algorithm creates the forest with many decision trees. It is a supervised classification algorithm. It is an attractive classifier due to the high execution speed [30] . Many decision trees ensemble together to form a random forest, and it predicts by averaging the predictions of each component tree. It usually has much better predictive accuracy than a single decision tree. In general, the more trees in the forest the more robust the forest looks.

Artificial Neural Network (ANN)
Artificial Neural Network (ANN) is a machine learning technique which is the skeleton of different deep learning algorithms. We can train the ANN model using raw data. Compared to other classifiers it has a large number of parameters for tuning which makes it a complex structure. It also takes a long time to optimize error than other techniques. For this reason, Neural Network algorithm instances are trained in Graphics Processing Unit using CUDA programming. Each single Neuron Node of ANN is trained with feature set X = X 1 , X 2 , X 3 , . . . , X n ( where, X 1 − X n = Dist inct f eat ures ). The features are multiplied by some random weights, W = W 1 , W 2 , W 3 , . . . , W n and added with bias values, b = b 1 , b 2 , . . . , b n . The v alues are then given as input in non-linear activation function [8] . Activation functions can be of several types. Following Eqs.
Leaky RELU: After applying Non-Linear function, a softmax function is applied to get initial predicted value which is shown in Eq. (12) .
Predicted Value: ˆ Lastly from the true value and the predicted value, the loss function is calculated and weights of the whole neural network architecture is modified using the backpropagation technique, gradient descent and error got from the loss function. The equation of loss function is given in the following equation:

Evaluation criteria
The following metrics were calculated for evaluating the performance of the developed system. Using these metrics, one can decide which technique is best suited for this work.

Confusion matrix
The confusion matrix is used to visualize the performance of a technique. It is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known. It allows for easy identification of confusion between classes. Most of the time, almost all performance measures are computed from it. A confusion matrix is a summary of prediction results on a classification problem [31] . A definition of True Positive (TP), False Positive (FP), False Negative (FN) and True Negative (TN) for multiple classes can be given from confusion matrix. Let C i be any class out of the eight classes. Following are the definitions of TP, FP, FN, and TN for C i : • TP(C i ) = All the instances of C i that are classified as C i .
• FP(C i ) = All the non C i instances that are classified as C i .
• FN(C i ) = All the C i instances that are not classified as C i .
• TN(C i ) = All the non C i instances that are not classified as C i .

Accuracy
A model's accuracy is only a subset of the model's performance. Accuracy is one of the metrics for evaluating classification models. Eq.

Receiver operating characteristic curve
It is a commonly used graph that summarizes the performance of a classifier over all possible thresholds. It is generated by plotting the True Positive Rate against the False Positive Rate as the value of the threshold is varied for assigning observations to a given class [31] . The calculation of True Positive Rate and False Positive Rate are given in the following equations: The threshold value is the probability value for each predicted class. The ROC curve can be drawn using binary classes. However, using one vs. rest method, it can be extended for multiple classes. The values of the true positive rate and false positive rate for each class ranges from 0 to 1.

Experimental setup
The experiment was done using HP (EliteDesk 800 G3 TWR) desktop where the operating system was Windows 10 Enterprise 64-bit (10.0, Build 17134) (17134.rs4_release.180410-1804), Processor was Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz (8 CPUs), 3.6GHz. The memory of the desktop was 16.384 GB RAM. NVIDIA GeForce GTX 1070 graphics card was used for running the program. For data analogy, cleaning and feature engineering, Pandas framework and Numpy framework; for data visualization, Matplotlib framework and Seaborn framework and lastly for data analysis, scikit-learn framework and Keras framework were used.

Result analysis
In the Data Analysis subsection, it has been described that several machine learning techniques were applied to the dataset. Five-fold cross-validation was performed on the dataset using each of these techniques. Fig. 4 (a) and (b) shows how the accuracy results are converged after five-fold cross-validation. From the cross-validation, it can be inferred that RF and ANN have performed best both in training and testing accuracy. DT performed with approximate similarity to RF and ANN in the case of training. In the case of testing, the DT had most deviations than other techniques and performed poorly at first. However in the last three folds, it performed similarly to RF and ANN. SVM and LR performed weakly than other techniques in training. In the case of testing and in the first two fold, SVM and LR both performed better than other techniques and logistic regression was best among them, but at the last three folds, they performed worse than others. Table 3 represents different evaluation metrics for different techniques trained on the dataset. From Table 3 , it can be seen that DT and RF have more accuracy, precision, recall, and F1 score values than other techniques. ANN also performed well in the case of evaluation. However, DT and RF are a little more accurate than ANN. On the other case, LR and SVM also do well on our dataset but not as good as other classifiers. Now considering the confusion matrices of each technique, the most optimized technique can be found. From the confusion matrices in Fig. 5 it can be concluded that RF is the best technique for this work.

Comparative study
Pahl et al. [1] acquired 96.3% accuracy value for multiclass classification using K-Means and BIRCH Clustering. Liu et al. [5] developed data packet based anomaly detector for a binary classification problem and focused on the energy consumption of each node. Their identification rate of the malicious node is above 80%. Like the data source used in this research, Diro et al. [8] , Pajouh et al. [6] and D'Angelo et al. [15] have used a dataset from another popular open source. The link of the dataset is given in Table 4 [11] have not described much about the dataset and their analysis results. They have detected when the attack is occurring using time series data.
Compared to other papers, our paper provides much detailed description of the dataset. It also provides a clear explanation of the dataset preprocessing steps. The paper is focused on classifying multiple classes which is harder than binary classification. Lastly, a clear description of the evaluation metric values for each classifier is given in this paper.

Conclusion
Based on the full study it was found that one should use RF technique on these kinds of dataset for solving cyberattacks on IoT network because RF predicted D.P, M.C, M.O, SC, SP, W.S attacks accurately compared to other approaches. In the case of DoS and Normal, it also predicted more samples accurately than other techniques. Hence, relying on these estimations, it can be concluded that RF is the best technique for this particular study. However, here only the classical machine learning approaches are employed over the dataset, and comparative study is given. No new algorithm is devised on this dataset. Hence, further study is needed for developing a robust detection algorithm. More analysis should be given on whole framework designing. Besides, this work is based on virtual environment data. In the case of real-time data, there may raise different problems. A more empirical study is needed on this problem focusing on real-time data. In the IoT network, microservices behave differently at different times which causes deviations in normal behavior in IoT services thus creating an anomaly. Further study is needed to interpret these problems in a more in-depth way. In this study, RF performs comparatively better with the accuracy of 99.4%. However, it does not assure that in the case of the big data and other unknown problems RF will perform this way. Hence, more study will be needed.

Conflict of interest
None.