An Efficient Framework for Detection and Classification of IoT Botnet Traffic

The Internet of Things (IoT) has become an integral requirement to equip common life. According to IDC, the number of IoT devices may increase exponentially up to a trillion in near future. Thus, their cyberspace having inherent vulnerabilities leads to various possible serious cyber-attacks. So, the security of IoT systems becomes the prime concern for its consumers and businesses. Therefore, to enhance the reliability of IoT security systems, a better and real-time approach is required. For this purpose, the creation of a real-time dataset is essential for IoT traffic analysis. In this paper, the experimental testbed has been devised for the generation of a real-time dataset using the IoT botnet traffic in which each of the bots consists of several possible attacks. Besides, an extensive comparative study of the proposed dataset and existing datasets are done using popular Machine Learning (ML) techniques to show its relevance in the real-time scenario.

IoT is the most demanding technical paradigm that bridges the gap between physical devices and remote decentralized applications. The number of IoT devices might increase up to 75 billion on a global scale by 2025. 1 As tech firms continue to produce a plethora of IoT devices for consumers and businesses hence the requirements of security measures are becoming critical. These devices are linked directly to the Internet thus they become prone to remote exploitation. Unfortunately, the majority of all these IoT devices were produced without inbuilt security, leaving them ideal for security threats. 2 In the context of IoT, there are lots of security challenges like privacy, backdoor entry, flood attacks, etc. 3 The IoT botnet is one of the most significant security risks in the field of IoT applications. As IoT botnet is a network of exploited devices (bots) that is being controlled remotely by a botmaster or an attacker as mentioned in Fig. 1. The IoT botnet is managed by a Command and Control (C&C) server that uses different bots to carry out synchronized attacks or tasks. 4 Besides, an IoT botnet is a software-based robot that searches for vulnerable devices and turns them into bots in a similar manner to existing bots. It is a background malwareextension operation. Several IoT botnets are upgrading and increasing rapidly to exploit weaknesses in IoT devices.
IoT botnets get triggered after receiving the instructions from the C&C server and then it creates massive amounts of traffic to victimize target nodes. These botnets can be possibly detected by using various detection techniques like Intrusion Detection Systems (IDS), Convolution Neural Network models (CNN)-LSTM, SIEMbased detection, etc. 5 The IoT botnet's growth has been raised exponentially after the release of the Mirai IoT botnet in October 2016. 6 It is anticipated over 1.5 billion IoT botnet attacks happened in the first six months of 2021. 7 The IoT botnets have targeted numerous popular internet services including DYN, Amazon's Alexa, Google Home, Siri, and other smart devices. As a result, strengthening the protection of IoT devices and its network is becoming increasingly more important for researchers, particularly when contending with IoT malware. To protect against the IoT botnet attacks, many top-down approaches amongst the most popular privacy and security solutions have been explored and compared. 8 Tsunami, Bashlite, Mozi, and more variants of Mirai, have emerged after the source code of Mirai, and the same was exposed on the Internet in 2016. 9 Since the Mirai source code is publicly available; developers are creating new sorts of malicious code for IoT devices based on the advantages of previous IoT botnets. 10 To analyze such kinds of IoT botnet traffic the datasets UNSW_NB15 and BoT_IoT were created by the Cyber Range Lab of UNSW Canberra, Australia in 2015 and 2019 respectively. Since UNSW_NB15 is a virtual dataset and has a limited number of attack vectors although BoT_IoT is a real-time dataset still it has few attack vectors that are unable to describe the detailed traffic analysis of IoT botnet for finding the root causes in the various possible attack scenarios. After looking at its limitation, we decided to create a real-time dataset based upon the experimental testbed in the actual scenario of attack. The dataset we have created is having a large number of attack vectors than its counterparts, which makes it more usable for future developments because these vectors describe many of the attack scenarios in more detail.
The major contribution of the paper is listed below: • To propose a framework for the detection of IoT botnet in the Real-time IoT network scenario.
• To build a Real-time dataset in IoT botnet (RI_BoT) with Raspberry pi module and Real-time configured architecture.
• To review and classify several machine learning algorithms for the classification of performance between RI_BoT datasets and popular datasets like BoT_IoT, UNSW_NB15.
• Finally, we compare the results on the basis of performance parameters and identifies the potential threats.
The rest of this paper is arranged as follows. In Section 2, we have mentioned the contributions of our scheme. In Section 3, we go through some of the existing schemes and related work. Section 4 contains our proposed methodology. Section 5 highlights the description of the selected datasets for comparative analysis. Section 6 contains the results obtained from our proposed method and its comparison with existing datasets. Finally, Section 7 concludes our research work.

Related Work
As per the advancement of IoT, many real-world smart technologies have recently attracted the attention of different manufacturers to enhance the quality of human existence. Because IoT devices have limited resources and the absence of embedded security mechanisms, attackers may easily target each device and the entire IoT system. As a result, the attacker's goals shifted from attacking people to attacking the organization. It has necessitated the development of an effective IoT botnet analysis system to discover new, unknown, and groundbreaking botnets in IoT devices. 3 z E-mail: umangarg@gmail.com The major issue in today's IoT botnet analysis is finding invisible and cryptic malicious files or programs. As malware programmers develop various bypassing strategies for information tucking away to avoid detection. Also, with the massive rise in internet users, cyberattacks, including zero-day attacks, have a disastrous influence on people's lives regarding effectiveness and scale. 7 In this section, we have discussed the past research work in two aspects, one is related to IoT botnet dataset generation, and the second is IoT botnet detection framework using machine learning models.
Here, we described the popular dataset generations in the field of IoT botnet As per the advancement of IoT, a plethora of real-world smart technologies has recently attracted the attention of different manufacturers to enhance the quality of human existence. Because IoT devices have limited resources and also the absence of embedded security mechanisms, attackers may easily target each device as well as the entire IoT system. As a result, the attacker's goals shifted from attacking people to attacking the organization. It has necessitated the development of an effective IoT botnet analysis system to discover new, unknown, and groundbreaking botnets in IoT devices.
Mrutyunjaya Panda 11 developed a well-structured dataset that is the UNSW-NB15 dataset. It was utilized to classify cyberattacks in this scheme. This present dataset is based on VM-simulation datasets rather than real-time scenarios. It focused on detecting IoT botnets fast and accurately rather than diving deep into analyzing novel IoT botnets like Mirai. To overcome this, Koroniotis N 12 proposed the BoT_IoT dataset, it includes some real and virtual IoT network traffic, as well as several kinds of attacks. A limited number of ML and DL models have been applied to evaluate the results that lack the detection feature of IoT botnet.
Hai-Viet Le 13 proposed IoT-BDA framework to combat the smartness of the IoT botnets. The IoT-BDA architecture comprises a variety of honeypots and newly evolved sandboxes, and it is made up of blocks such as BCB and BAB. BCB and BAB are still being improvised to make them more dependable and scalable over various CPU architectures and forthcoming IoT devices. Valerian Rey 14 presented a paradigm based on a new technology known as Federated Learning (FL) to overcome this. The present scheme was conducted using the N-BaIoT dataset, which comprises network traffic across various IoT devices that have been tainted by malware. Several adversarial setups with multiple malicious individuals were contacted to test the Federated Learning-based framework's resilience, robustness, and scalability. The model's performance is still being observed, as it is compared with many updated malware versions.
Jun Jeon 15 devised a DAIMD model that has been developed to identify both existing IoT Malware and new varieties that are being developed. DAIMD works by first understanding the malware's behavior on data and then acting on it. The DAIMD model falls short in a VM where IoT malware avoids the malware analysis phase, necessitating a hybrid analytic technique. Segun I. Poopla 16 has developed a Hybrid analysis model based on Deep Learning (DL) approaches to overcome this. The DL applied on the BoT_IoT dataset is a completely VM-based simulation that lacks the features and parameters compared to Real-Time datasets. Javed Ashraf 17 brought up an IoTBoT-IDS model. The IoTBot-IDS examines the usual behavior of an IoT network data traffic using statistical learning techniques such as Correntropy and BMM. This model's study is still to be used in smart controllers for SDN and by incorporating additional AI, ML, and DL approaches. The analysis is on a statistical approach and must be ML/DL-based to be more effective. To overcome this, Segun I. Poopla 18 devised the RNN model for identifying network traffic in unbalanced network traffic in addition to ML or DL approaches. Only one dataset was used for the classification and its comparison with other results. Rather than analyzing IoT botnet variants, the experimental analysis was compared with other ML/DL models.
To address all the problems mentioned above requires an intelligent IDS to detect IoT botnet.

Proposed Methodology
The IoT botnet architecture as explained in Fig. 2 demonstrates the complete procedure of testbed setup and attacking procedure of the IoT botnet. The steps followed for its execution are as follows: Build your own botnet.-The attacker will create the IoT botnet using BYOB, which will produce bots, before sending it to the C&C server.
C&C server (attacker).-The IoT botnet will be shared with the linked IoT devices with the help of the local admin.
IoT devices (victim).-The data received from the server will be executed by the IoT devices. Dataset analysis.-The datasets are produced using the Wireshark network analyzer tool and then categorized using the CatBoost and other six different models as specified in phase 4 of the proposed model.

Build Your Own Botnet
The entire architecture is built upon IoT botnets, which will be produced with the help of the BYOB tool. 19 The BYOB-generated bots include 13 different features of malware exploitation that will be activated once the bots are installed on IoT devices. The following steps include the workflow of Phase 1.
Step1: Using Linux operating system, the attacker will get access to the BYOB web-GUI.
Step2: The attacker will create an IoT botnet by accessing the BYOB Payload generating section and choosing the appropriate architecture of the victim's computer that he intends to target.
Step3: Once the IoT botnet has been successfully installed on the victim's computer, the attacker will generate bots by selecting any one of the 13 specific exploitation modules by visiting the control panel area of the BYOB web-GUI.
Step4: Finally, the IoT botnets will be put onto the C&C server in preparation for the attack on the IoT devices.

C&C Server (Attacker)
On the C&C server, IoT botnets are loaded. The server must give essential updates or distribute any files with all local or distant users that have access to this server. The sequence of events for phase 2 is outlined in the steps below.
Step1: The IoT botnets are loaded on this server without the local administrator being informed, and this operates as a C&C server under the attacker's control.
Step2: Even though the local administrator is unaware of the attack, he just sends the appropriate files and updates to the local server as per its schedule.
Step3: The C&C server will then forward (or broadcast) the updates or files associated with the linked IoT botnet to all local IoT devices.
Step4 (Impacts): The attacker who is monitoring the server learns about the files or the changes that are sent to victims via local admin. The attacker's method is depicted in the sequential flow diagram as mentioned in Fig. 3.
As shown in Fig. 3, the attacker starts by spoofing the source address and sending a large number of TCP packets to the victim, making it appear arbitrary or even originating from a legitimate source. Figure 3 shows the involvement of the attacker where s(he) creates a bash script (.bat) and sends it to the victim over FTP (FileZilla). The bash script's function is to start an auto-download and installation operation in the background as soon as the victim double-clicks the file, without the victim being aware of the attack. The bash script used here is written in the proximity of the attacker by keeping in mind that Window10 installed on Raspberry Pi 3B+, doesn't treat it as malware.

IoT Devices (Victim)
The server's updates or its files are received through wi-fi on both of the IoT devices, which are Raspberry Pi 3B+. The victim will get notification of the received data as an update or its file from the server on most priority to install it. The IoT botnet attack will  auto-initialize as soon as it is installed on the victim's device when s (he) will install the updates or double-click on the files to open it. On the other hand, the attacker is watching the entire process. S(he) will first halt the traffic flood into the victim's device and then use the BYOB web-GUI's control panel to select one of the 13 exploitation modules and load the bots appropriately. On the victim's end, the attacker has complete access to the services provided by IoT devices and hence can be easily exploited. The bot services will then operate in the background on the victim's devices, resulting in an IoT botnet attack against IoT devices.
Besides, the Wireshark network protocol analyzer tool was utilized to construct our new IoT botnet dataset i.e., RI_BoT, and analyze network traffic. 20 We'll build a.csv dataset using this tool for IoT botnet analysis on IoT devices.

Dataset Analysis
Our new RI_BoT dataset has been created as a result of the aforementioned processes. We will analyze it mainly using the CatBoost classifier model including other six different models namely Logistic Regression, Decision Tree, Support Vector Machine, Neural Network, Gradient Boosting, XGBoost in this phase. The CatBoost model will assist us in determining the results using BYOB with accuracy, precision, recall, and F1 score in IoT devices.

Classification Models and Ensemble Learning
Different classification models and other ensemble learning approaches that we have applied to the datasets have been discussed below.
Classification models.-The classification methods which we have used, are supervised learning techniques that utilize training data for identifying certain classification criteria of newer inferences. It uses ML techniques to figure out how to classify datasets from the concerned domain with the class label. Targets, labels, and categories are also the terms that can be used to define the classes. For example, attack or normal. There are varieties of classical ML and ensemble learning approaches that may be used. Several of them are mentioned below: Logistic Regression: This technique uses the dataset that is divided into the training ratio of 70 percent and the testing ratio of 30 percent respectively. Since the logistic regression uses a sigmoid function, the default threshold could be 0.5 for the appropriate measure. For the hyperparameter tuning, "fit_intercept" is evaluated false as the maximum iteration is 10. The predicted mean value of Y for all X = 0 is typically referred to as the intercept in logistic regression. In the regression equation, start with a single variable X.
An intercept is just Y's expected mean value at a certain time if X = 0. After running it to 10 iterations, the model achieves an F1-score and an Accuracy of 97.5% and 97.4% respectively.
Decision Tree: This technique has the advantage that we do not need to standardize the data (mean = 0 and variance = 1). As previously, the dataset provided to the scheme has been used in similar divisions. Hyperparameter tuning for this algorithm is also done, such as the maximum depth is set to "4". When the tree depth is reached at its maximum depth during the creation of the very first decision tree, the maximum tree depth is set to cease for further node splitting. Further, the "min_weight_fraction_leaf" is set to "0.04." The leaves node of the decision tree should get the same proportion of input samples where the sample weight determines different weights; this is the way out to cope with class imbalance. The splitter is set to random. This procedure is continued across the training process until the homogeneity in the nodes is left. It was the most probable reason that the decision tree works so well. After hyperparameter tuning the model we got an F1-score and an Accuracy of 91% and 91.7% respectively. Support Vector Machine: Here for the input dataset, SVM with the linear kernel is used. Whenever a data collection contains a significant number of characteristics, it is mostly used. Maximum iteration is set to 10 for the dataset which yields an F1-score and an accuracy of 62.4% and 57.3% respectively.
Neural Network: This contains a set of algorithms that seek to find the hidden patterns inside the dataset as used in previous algorithms. The maximum number of iterations used by this algorithm is 50. After that, the model is yielding more validation loss. Another hyperparameter is "Adam" (solver), which is essentially used to solve weight optimization. As Adam is a DL training model that is used instead of stochastic gradient boosting. An "Adam" combines the best aspects of the RMSprop and AdaGrad techniques to provide an optimized solution for noisier problems with limited gradients. The models get an F1-score and an Accuracy of 66.31% and 61.32% respectively.
Ensemble learning.-It's a generic ML approach that aims to improve prediction results by integrating predictions received from various models. It is mostly used to improve a model's performance or lessen the chances of picking a bad one randomly. Ensemble learning may also be used to provide a confidence level to the model's choice, picking optimal (or near ideal) attributes, data acquisition, reinforcement learning, stochastic learning, and error correction. We have used the following ensemble learning approaches for finding the useful results: Gradient Boosting: This technique employs numerous weak classifiers to create a robust regression and classification model. For hyperparameter tuning in this algorithm, the learning rate is set to 0.001. This algorithm has a better learning rate by capitalizing on the cost of a semi overall weight matrix. For optimizing the result in terms of global weight the slower learning rate is expected which may be at the cost of the time taken in learning. With the help of this tuning, the model achieves an F1-score of 96.4% and an accuracy of 96.3%.
XGBoost: This algorithm dominates comparatively for providing a result on structured or tabular data. It is a high-speed and highperformance version of gradient boosted decision trees. In this algorithm, the learning rate option is used to determine how new trees are weighted. The learning rate is set to 0.01. The model gives an F1-score of 96.4% and an accuracy of 96.3%.
CatBoost: This technique generates cutting-edge findings without the need for extensive data training, which is required by the conventional ML approaches. The model is trained on 70% of the total data points. The maximum depth is taken as 3. It is advisable to take the maximum depth in the case of the CatBoost algorithm, in the range of 3 to 10, to get good accuracy and get rid of overfitting situations. The learning rate is taken as 0.001. It is kept small so that it can find local minima more efficiently. The maximum number of iterations is 5, which means the model will iterate 5 times, and whatever the best result will come, that will consider as output. Another hyperparameter of CatBoost is taken as default. With the help of this tuning, the model gets an F1score and an accuracy of 96.4% and 96.2% respectively.

Dataset Description
UNSW_NB15: UNSW_NB15 dataset's unprocessed network traffic was created in UNSW Canberra's Cyber Range Laboratory using the IXIA PerfectStorm technology for generating the mixture of genuine normal and simulated modern attacking activities. Using the tcpdump software, 100 Gigabytes of raw form of traffic were recorded that is Pcap files. 21 The UNSW_NB15 dataset contains 9 distinct forms of attacks including worms, analysis, shellcode, fuzzers, exploits, generic, DoS, reconnaissance, and backdoors. 22 The Bro-IDS and Argus tools are used to create a total number of 49 features associated with class labels, and 12 methods are created. 23 The UNSW_NB15 features.csv dataset describes these below features.
The Training.csv and Testing.csv are two partitions from this dataset that were specified as training and testing sets. The overall size of the data in the dataset is 2 million and 540,044 records. The training part and testing part consists of 175,341 records as well as 82,332 records respectively and that includes various forms, including normal and attack. 24 The dataset consists of GT.csv which is the ground-truth table, while the EVENTS.csv is the collection that includes complete events file. The UNSW_NB15 simulated dataset is 100Gigabytes in size. 25 This dataset is composed of PCAP files including a mix of typical and invasive traffic as well as other significant characteristics: Analysis, Worms, Generic, Fuzzers, Shellcode, Exploits, DoS, Reconnaissance, and Backdoors are among the IoT botnet attack types covered by the UNSW_NB15 dataset.
BoT_IoT: In UNSW Canberra's Cyber Range Lab, the BoT_IoT dataset is generated by simulating a practical real-time environment. 26 In this environment, there is a mixture of normal and attack network traffic. 27 This dataset's source files come in  several formats that include raw pcap files and the.csv files. 28 Furthermore, these files are divided into attack types and subtypes to make the tagging process easier. With almost 72,000,000 records, the recorded pcap files are approximately 69.3 Gigabytes in size. In CSV format, the captured flow traffic is 16.7 Gigabytes in size. 29 Data Exfiltration, Service Scan, OS, DoS, Keylogging, and DDoS attacks are all included in the dataset. DoS and DDoS attacks are being further categorized on the basic protocols used. 30 The BoT_IoT dataset has been extracted by 5% of the entire dataset by using MySQL queries to enable handling the dataset easily. The 5% dataset is made up of four files with a combined size of 1.07 Gigabytes and around 3 million entries. 12

Results and Discussion
In this section, the performance metrics that are used to analyze the performance of three different datasets i.e., RI_BoT dataset, Bot_IoT dataset, and UNSW_NB15 dataset are described. The below Fig. 4 is the model architecture of the ML that has been applied on our datasets. We have measured the performance based upon the parameters like Precision, Accuracy, Recall, Confusion Matrix, and F1 score. These results have further been classified using the above-mentioned classification models and further used to compare the efficiency of the three datasets. We present such findings to determine the best suitable among three datasets for the best performance and to present the confusion matrix for the best classification models performance.

Performance Parameters
Performance Metrics refers to the results of the data models to evaluate the performance of datasets. Based upon the standard parameters like Precision, Recall, Accuracy, Confusion Matrix, F1    score, and Receiver Operating Characteristics (ROC) curve we have applied it on 7 different above-mentioned ML models.
Accuracy: The simplest intuitive performance metric is accuracy, which is just the number of expected observations divided by the total number of observed observations. = Accuracy

Number of Predicted Observations Total Number of Observation
Precision: The ratio of precisely anticipated true observations divided by the total number of true positives observations, with a high False Positive score is known as Precision.

= + Precision
Number  Fig. 4. It shows how a classifier's diagnostic performance changes as the classification threshold changes.

Evaluation
This section highlights the different Classification models and Ensemble Learning technique's results on several performance criteria that are Precision, Accuracy, F1 Score, and Recall. We evaluated and compared the efficiency of the best-performing dataset which is the BoT_IoT dataset and UNSW_NB15 with our new RI_BoT dataset. Overall performance metrics of three IoT botnet datasets are summarized in Table I. On three separate datasets that are RI_BoT, UNSW_NB15, and BoT_IoT, we achieved an accuracy of 96.290, 99.983, and 96.645 respectively by applying the abovementioned ML models with the CatBoost Model, which is our major focus. By using performance metrics, we discovered that the CatBoost model produces the best results.

Confusion Matrix and ROC Curve
This section represents the confusion matrix for the CatBoost Model and the combined ROC Curve as we have applied on the three datasets.
RI_BoT  Table I it has been concluded that the CatBoost classifier model performs better on our newly generated RI_BoT dataset and on the best performing datasets like UNSW_NB15 and BoT_IoT dataset. Figures 6, 8, 10 represents the ROC curve, which is used to compare the results, obtained several ML models applied. Finally, by viewing the results based on ROC curve and the aspect of realtime based generated datasets the RI_BoT has provided best results with CatBoost model, which has the best results in Accuracy and Recall.

Conclusion and Future Work
This paper presents a novel scheme that produces a non-existent real-time dataset, named RI_BoT, which consists of the detailed descriptions of real-time traffic of sensors and actuators implemented using the Raspberry pi 3B+ model. The dataset was created on a realistic testbed and also has been tested including pre-defined datasets like the UNSW_NB15 and the BoT_IoT using the abovementioned ML models. A detailed comparative analysis for the above-mentioned datasets is also provided in the result section where RI_BoT achieves the same percentage of accuracy as BoT_IoT by applying the CatBoost model. As this work is based upon the realtime environment so it could be a benchmark for further investigation of IoT-Botnets in terms of vulnerabilities or potential threats associated with them. The proposed real-time dataset (RI_BoT) could be further investigated using hybrid techniques for IoT botnet traffic analysis.