An Intelligent Context-Aware Threat Detection and Response Model for Smart Cyber-Physical Systems

Smart cities, businesses, workplaces, and even residences have all been converged by the Internet of Things (IoT). The types and characteristics of these devices vary depending on the industry 4.0 and have rapidly increased recently, especially in smart homes. These gadgets can expose users to serious cyber dangers because of a variety of computing constraints and vulnerabilities in the security-by-design concept. The smart home network testbed setup presented in this study is used to evaluate and validate the protection of the smart cyber-physical system. The context-aware threat intelligence and response model identifies the states of the aligned smart devices to distinguish between real-world typical and attack scenarios. It then dynamically writes specific rules for protection against potential cyber threats. The context-aware model is trained on IoT Research and Innovation Lab - Smart Home System (IRIL-SHS) testbed dataset. The labeled dataset is utilized to create a random forest model, which is subsequently used to train and test the context-aware threat intelligence SHS model's effectiveness and performance. Finally, the model's logic is used to gain rules to be included in Suricata signatures and the firewall rulesets for the response system. Significant values of the measuring parameters were found in the results. The presented model can be used for the real-time security of smart home cyber-physical systems and develops a vision of security challenges for Industry 4.0.


Introduction
The exponential rise in IoT devices has also resulted in the enhancement of communication services [1]. Our way of life, including how we live, work, communicate, study, manage our health, and enjoy ourselves, has been revolutionized and simplified by IoT devices. It is predicted that there will be more than 13 trillion connected IoT devices by 2030 [2]. As a result of this exponential growth, more security challenges are emerging. One of the most well-known applications of the Internet of Things is the smart home, which poses a significant security problem in preventing threats from unidentified sources. Smart home gadgets are being used by over two-fifths of developed countries, which is more than nearly twice the percentage of developing countries, according to the latest survey [3]. Even though 98% of respondents are cautious about confidentiality linked with their gadgets, more than 50% have taken no action to safeguard them [4]. The most prevalent attacks on IoT devices are creating a center for bitcoin mining as well as cybercrime, data leaks, man-in-the-middle attacks, DDoS attacks, and spamming [5]. Furthermore, an IoT device in a smart home might be exploited to access and formulate a combination of actions to intrude into other devices of the house. Even without physical intrusion, silent surveillance through a hacked or malfunctioned gadget can result in a strategic campaign to compromise other vulnerable devices and run an intimate cyberstalk on the residents [6]. Smart home defense is an evolving notion with no well-defined threat detection and response model that fits different threat scenarios.
The context is any knowledge or information that may be used to comprehend the situation of the underlying environment in which the application is running [7]. Three crucial context factors, according to the authors [8] are where you are, who you are with, and what resources are close by. Location, time, identity, environment, network, history, and activity are the major context used for the development of any context-aware system.
A smart home is envisioned as a novel environment where the Internet of Things (IoT) is widely used. The Internet of Things (IoT), which is built on communication and information technology (ICT), fundamentally alters how we live by changing virtual interactions between people in a variety of scenarios, from the workplace to interpersonal connections [9]. Consequently, the creation of a smart home necessitates the seamless integration of user interactions, physical items, and human engagement. Particularly a smart house is thought to offer individuals a new kind of smart environment and lifestyle, greatly enhancing our quality of life [10]. Figure 1 shows the context-aware smart home system. The Internet of Things is a dynamic network with continually changing things and mobility of users and conventional fixed security solutions are found ineffective as a result. The context-aware defense has brought interesting aspects to classical security by making use of context information to make decisions [11]. Various machine learning methodologies have been proposed to control and expedite the process of creating context-specific access control for IoT devices [12]. The contextual information from the past records of ambient IoT access was used to determine whether to grant or deny a certain IoT direct access request in the future [13]. There were many network-based security solutions proposed but previously proposed technologies offer restricted assistance in deciphering the traffic in between the internet-connected system environment and they necessitate in-depth knowledge of the network standards and the unique infrastructure set-up that leads to the context-aware threat detection system [14]. Context-aware security models had to contend with the question of whether context features will generate more precise and determined security results [15]. Consequently, mapping the context of the device and the network into a model that delivers the best results in terms of security breach detection and decreasing false positive rates in alert generation is a huge challenge. Numerous methods currently in use were designed primarily for static networks and are therefore unable to detect IoT devices mobile activity [16]. Most methods guarantee the accuracy of contextual data and the seamless transfer of that data between devices via the cloud, which is not assessed in a comprehensive scenario [17]. Some methods encrypt data using pricey cryptographic algorithms like Advanced Encryption Standard and Rivest-Shamir-Adleman RSA, however, these are inefficient for IoT devices with limited resources because they demand more computational power to function [18].
Moreover, most research studies tested and refined their proposed context-aware security models using publicly available datasets [19].
To address the deficiencies of current threat detection systems and to achieve the objectives, we proposed an effective context-aware threat intelligence and response model for smart home systems. The existing threat/ attack detection models lack real-time smart cyber-physical systems' datasets which challenges the advancement in the optimization of intelligent methods for context-aware threat detection and response. The development of state-of-the-art tracking and detection methods requires instantaneous access to sensors' data. To track deployed IoT devices' whereabouts and take prompt action in the smart home, this novel research study employed the idea of the context-aware system which collects networkrelated dynamic information from the equipped IoT devices. The dynamic rule writing technique is then used for the response model. This research also provides an innovative smart home network testbed architecture which is used to analyze and secure smart cyber-physical systems. With the deployed IoT Research and Innovation Lab -Smart Home System (IRIL-SHS) testbed, the proposed context-aware threat intelligence model identified the states of the connected smart devices and built a contextual model to differentiate between real-world normal and attack scenarios. With the contextual features extracted, the labeled dataset is later used to train and validate the efficiency and performance of the context-aware threat intelligence SHS model.
The remaining sections of the research are structured as follows: Section 2 presents a comprehensive overview of the available context-aware threat detection and response mechanisms literature along with a comparative analysis. Section 3 describes dataset generation using the topology of IoT devices. The context-aware threat detection and response model is presented in section 4. Section 5 presents the performance analysis through results and a discussion of our suggested model with the comparison of previous studies. Finally, this paper outlines the conclusion in section 6 and talks about future work.

Literature Review
Many studies on context awareness have been conducted to build various context-aware systems that considerably improve people's daily lives. Starting with obtaining expertise in context, acquiring context, and defining the rules, the system must determine what adaptations are required [20]. When adopting and accessing context-aware systems, a lot of variables contribute to consumers developing substantial privacy concerns and potential security risks. Users are also exposed to a range of threats, such as processing enormous amounts of data, spending a lot of energy, and coping with data leakage. As a result, the need for adequate security solutions is growing [21]. Intrusion detection systems, intrusion prevention systems, context-aware security systems, firewalls, and other systems have been used to combat these threats. Context-aware security systems can adequately manage security mechanisms associated with constant context changes [22].

Background
A contemporary research area is context-awareness in the IoT for threat intelligence. Context is any information that may be used to comprehend the situation of the underlying environment in which the application is running [23]. Location, time, identity, environment, network, history, and activity are the major contexts used for the development of any context-aware system [24]. When adopting and accessing context-aware systems, many variables contribute to consumers developing substantial privacy concerns and potential security risks. As a result, the need for adequate security solutions is growing [9]. Contextaware security systems can adequately manage security mechanisms associated with constant context changes [25]. In recent research, context-aware security systems were introduced in smart grids, smart cities, smart industries, smart health care, smart home, and smart transportation systems to provide security against a variety of assaults, including data loss, phishing, service disruption (DoS/DDoS), power losses, unsecured ports, and other issues, utilizing various approaches such as machine learning, anomaly detection, and artificial intelligence. The term "cyber threat" describes the potential for a successful cyberattack with the intent of gaining unauthorized access to, destroying, disrupting, or stealing a computer network, proprietary information, or any other type of personal information. Cyber threats may originate from a company's own trusted employees or may come from distant, unidentified parties [26]. There are numerous detective strategies available to cope with cyber threats. These techniques can be roughly divided into host-based and network-based categories.

Host and Network-based Detection Technique
To identify potential threats at the system level, the host-based detection technique gathers and analyses information about the internal operations of the computing device, such as log files, register data, API calling patterns, etc. [27]. The host-based detection software, as opposed to a network-based approach, provides the advantage of detailed insight because detection software is installed and running on the host. Numerous host-based tools and approaches have been created to detect bots, worms, and other hazardous threats [28]. Another method of identifying malicious software that might dodge the effects of packers, polymorphism, and deformation technologies is a dynamic behavioral analysis based on the API hook. The protected host's performance is impacted by host-based methods, which also provide incompatible network filtering signatures [29]. The network-based technique is a surveillance strategy that seeks to locate cybersecurity risks by monitoring network traffic for specific network portions or devices, analyzing the network, and using protocol activity to identify odd behavior [30]. The two primary categories of network-based approaches are those that rely on signatures and those that analyze network traffic [29].

Signature and Traffic-Based Threat Detection Techniques
An attack or intrusion may be recognized by a signature-based IDS if the attack's fingerprint is currently stored in the prevailing database. These methods are often used in the field because they can reliably identify known assaults [31]. To identify cyber dangers, data transmission can be observed and analyzed. The two types of surveillance are active surveillance and passive surveillance [32]. Active monitoring detects activity from malicious traffic. This technique actively injects packets into the network or employs network scanners [33]. For instance, Nmap is a well-known tool that actively collects location and remote information from the Internet. It can do this automatically or manually and includes the domains and servers of this malicious software. The biggest drawback of this approach is how easy it is to discover a network scanner's IP address.
One of the primary categories of passive monitoring is the anomaly-based threat detection technique, which is based on passively observing network traffic [34]. Based on network traffic anomalies, such as excessive network latency, large traffic volumes, traffic on unusual ports, unexpected DNS responses, and traffic behavior characteristics, anomaly-based detection tries to discover malicious code [35]. Three essential steps typically comprise this strategy: First off, some malicious software can be easily captured because it is used in a controlled setting (like a particular honeypot). Second, network security defense software analyses the network traffic generated by the malicious software, and both its static and dynamic characteristics are modeled using mathematical or statistical tools. Finally, models are used to carry out this identification process using data mining and machine learning approaches [28].

Machine Learning Techniques
In intrusion detection and prevention systems, malicious traffic can be detected using machine learning (ML) techniques [36]. A component of artificial intelligence (AI), machine learning (ML) aims to utilize algorithms to learn from data and produce predictions using that data. Deep learning is computationally expensive due to the massive amount of training data required, as well as sophisticated hardware and software [37]. The functioning of devices and the processes must be checked regularly in industry 4.0 settings in order to spot or anticipate errors or other circumstances that could lead to unfavorable outcomes. Machine learning algorithms can be developed on existing facts to comprehend the hidden aspect of integrated applications and then evaluated to forecast their new state provided by the data. Network threat analysis, which is the process of identifying dangers to the network, is one of the many uses of ML in the field of cybersecurity [38]. Due to its capability to track both incoming and outgoing traffic and identify possibly suspicious activity, machine learning can be useful in this endeavor [39]. ]. ML tasks can be carried out using a variety of ML models, each of which uses its mathematical equations to analyze the provided data. Different machine learning (ML) methods, such as K-nearest neighbor, XG boost, decision tree, and random forest, are frequently employed for threat detection. The K-nearest neighbor supervised learning model is one of the most basic ML models that are currently available. KNN is referred to as a lazy learner because it does not require training and instead makes predictions about the data to categorize it using the training data [40]. The supervised learning algorithm Decision Tree is beneficial for displaying a model's visual representation. A Decision Tree employs a hierarchical architecture with multiple connected nodes, much like a flowchart [41]. This network contains evaluations of the dataset's features, and each one has a split that either goes to another node or makes a classification judgment for the data. The predicted data is processed via the nodes of the tree that was constructed using the training data until the data can be categorized [42].
The Random Forest-supervised learning algorithm is thought to perform better than the DT model. The randomness of the model comes from two core concepts [43]. The first is that when the model is being trained, each tree is given a random sampling of the data, which can result in some trees using the same data more than once [44]. The goal is to narrow the gap between the scores for the expected outcomes by lowering the model's variance. The second concept suggests utilizing a sparse subset of the features to separate the nodes in the trees [45].
The XG Boost structure of the GBRT model is an enhancement. The sequential ensemble method is used to create the boosting model known as GBRT from a series of simple regression trees [46]. More trees can be adaptively added to the model to expand its capacity. Decision trees are used by XG and RF as classifiers in an ensemble [47]. Examples include using regularization, a technique for working with sparse and balanced data, and adopting a block structure for parallel learning [48].

Critical Analysis
The smart home is among the contender's well-known applications on the Internet of Things dominant paradigm, with approximately 27 billion IoT devices in 2017, and this chain is intended to increase rapidly by 12% per year until it reaches more than 13 trillion devices by 2030. It also presents a significant security problem of escalating threats from anonymous sources that devastate customer experience [49]. Many researchers are working on this challenge and have modeled workable security systems for smart homes to overcome it and highlight the limitations of home automation in terms of detecting and responding to cunning attackers.
In [50], an approach for detecting intrusions in general, with a low false positive rate, is presented for Smart Home Systems (SHS). The authors dynamically modeled the SHS's variegated information into contextual arrays of location and time features for behavior analysis, which is centered on one-class learning to detect various forms of abnormal behaviors. A web interface was used to keep track of the daily usage of household equipment to collect 8,110 normal records throughout the course of 90 days. Premised on the baseline model for authentic detection, they revealed that a context sequence of a length with 3 attributes (time slot, gateway availability, and physical location) yields a 2.1 percent false positive ratio. More than 98 percent detection accuracy was achieved for normal usage, involving various related behaviors, but for DoS, asset manipulation and break in attacks involving more than two behaviors, they obtained a detection ratio of more than 94 percent. In terms of future study, researchers suggested that for attacks involving only one behavior, the proposed technique should take detection accuracy into account and make it possible to operate properly for resources that don't require any context information to operate and can conduct complex state changes.
The researchers in [51] monitored the configurations of sensors and devices in a smart home for various user behaviors and utilization of patterns and constructed a context-specific model to distinguish between harmful and benign behavior. They consider context awareness as an understanding of variation in the statuses of sensors and devices as a result of continuing user behavior. The framework uses the machine learning technique of Markov Chain-based to analyze suspected malware by analyzing the present state of smart home assets and comparing it to previously learned user behavior. They tested Aegis in a variety of smart home environments, including with real-life individuals, genuine SHS devices (such as the Samsung Smart Things platform), and various day-to-day routines. By putting Aegis through its paces against a variety of malicious behaviors like impersonation, false data injection, denial-of-service and triggering a malicious app, they were able to obtain more than 95% accuracy with minimal overhead in a variety of smart home situations, detecting threats regardless of smart home designs, user counts, or imposed user regulations.
The authors of [52] emphasized trustworthy context manufacturers and customers, who should secure confidential material in smart surroundings from disclosure or inspection. They have used the attack scenarios of eavesdropping communications between sensors, system components, and applications. The upgraded Cerberus approach is proposed to provide context suppliers and recipients with authentication while also protecting the integrity of the context-specific data and its convenient movement among gadgets. Confidentiality is provided by a symmetric key cryptographic approach, and the integrity of data is ensured using a hash function based on a digital signature. After authentication, the new context providers and recipients were dynamically inserted by the proposed unique method. The strategy is based and evaluated in a genuine cloud platform with six actual devices, proving efficiency, authentication methods, negligible energy usage, and extensibility of multiple portable devices operating in perfect sync, and overall control of the system's privacy rights. The lowest power usage of around 0.35% is reported even without data transmission and while transferring 207.5 B/s toward the cloud.
Another research [32] focused on network-based tracking methodologies and proposed a fundamental structure HomeSnitch for increasing smart home control and visibility by categorization of IoT device exchange mechanisms based on semantic efficacy. Patterns in correlation datagram unit transfers that indicate application-layer interactions between network nodes are searched by HomeSnitch. The model, which was based on a typical wireless router and employed software-defined networks (SDN) primitives to impose connection restrictions, can characterize device behavior by employing destination and content-agnostic features. Using a Random Forest classifier, the model classified behavior from an independent data set of normal and man-in-the-middle attack pcap traffic with greater than 99 percent accuracy. Through all these initiatives, researchers proved the effectiveness of computer networks in classifying behaviors and imposing control on IoT devices.
Furthermore, by relying on IoT device consumption data, the authors in [53] developed a machine learning method to autonomously train contextual access policies from observed social behaviors in home automation. LoFTI is a federated multi-task learning system that identifies six main categories of attributes to record contextual access privileges and trains tailored context-aware rules from numerous smart homes. By designing a novel data augment strategy to handle the problem of outliers in learning, the model achieved a favorable trade between performance and computational power in distributed learning. The results showed that LoFTI may achieve a few false alarms; the false negative rate dropped by 24.2% while the false positive rate dropped by 49.5% when compared to the explicitly stated solo learning to all methods of learning.
Many researchers proposed a context-aware robust intrusion detection system using publicly available datasets such as NSL-KDD, UNSW-NB15, and AWID [54] having attacks scenarios of DoS, eavesdropping, MITM, false data injection, impersonation and triggering a malicious app but did not implement any preventative measures. For effective detection and categorization of innovative and complex assaults, multiple independent deep reinforcement learning agents should be dispersed over the network, producing a model with higher accuracy and a lower false-positive rate than existing systems [55]. The scarcity of real-time smart cyber-physical systems' datasets challenges the advancement in the optimization of intelligent methods for context-aware threat detection and response.

Dataset Generation
The dataset generation comprises a few sections. In section 3.1, the proposed testbed setup is discussed. In 3.2, benign and attack scenarios with tactics, techniques, and procedures (TTP) are explained, while section 3.3 gives a comparison between the IRIL-SHS dataset and other popular publicly available datasets.

Testbed Setup
A detailed description of the IRIL-SHS dataset, including the architecture design, smart devices, and security protocols used, is provided in this section. For this research study, eighteen IoT and non-IoT devices were connected to a Linksys smart Wi-Fi Router WRT1200AC in a star topology within a real testbed of a smart home. Four of them were IoT devices: a TUYA smart plug, a Things access smart hub, a TUYA smart door lock, and a v380 smart Wi-Fi camera that were controlled by homeowners using their mobile and web applications. These IoT devices are shown in Figures 2, 3, 4 & 5. Seven desktop personal computers and mobile phones were connected to the router working in normal routine by the home users. The Thingz access smart hub was connected to and controlled the fans (3), lights (5), and ACs (2) in smart home architecture. The topology of connected devices in the smart home is shown in Figure 6.  Table 1 shows the connected device types in the network topology with their protocols and placement. A brand-new collection of network traffic statistics from internet of things (IoT) devices is titled the IRIL-SHS dataset. The data were collected over six days in a medium-sized smart home setup, with 10 attacks captured pcap files and 16 benign captured pcap files. In the IoT Research and Innovation Lab (IRIL) at the Al-Khwarizmi Institute of Computer Science (KICS), University of Engineering and Technology, Lahore [56], this IoT network flow of smart home systems was recorded. Its objective was to provide researchers with a sizable dataset made up of real-time, labeled IoT attacks and IoT benign traffic for training and testing machine learning models. In particular, the residents of the smart homes behaved normally throughout the first two days of the collecting period, and the associated traffic records showed the system to be in a normal state. Three distinct anomalous behavioral scenarios of DoS, DDoS, and reconnaissance attacks, as explained in section 3.2.2 and 3.2.3, were put into action over the course of the experiment's last four days to imitate device failure or malicious attacker activity. Researchers carried out a specific attack sample utilizing several tools that made use of various protocols and carried out various operations in each harmful situation. In a controlled network environment with an unrestricted internet connection, just like any other actual IoT device, both harmful and benign situations were tested.

Benign and Attack Scenarios
We had twenty-six scenarios for this dataset. The pcap files were recorded for normal and attack traffic scenarios. We had 16 scenarios for normal traffic captured at random timing of different days and 10 scenarios for attacks captured at random timing of different days. All devices in a network used different protocols. The dataset contains TCP, DNS, HTTP, TLS, QUIC, MDNS, UDP, and ARP protocols. DDoS attacks were captured from inside as well as outside the network performing HTTP and TCP flooding.

Tactics, Techniques, and Procedures (TTP)
DoS (Denial of Service), DDoS (Distributed Denial of Service), and reconnaissance attacks of three different sorts were carried out against the network. DoS and DDoS attacks were performed within the network as well as outside the network using the HOIC tool [57]. The reconnaissance was performed using the Nmap tool [58]. All the data were captured by the Wireshark tool and saved into pcap files. Table 2 shows the malicious activities performed.

Reconnaissance Scenario
An attack known as a scanning assault, sometimes referred to as reconnaissance or probing, is the first step in the cyber death chain model or penetration testing [62]. This attack's goal was to gather victim systems' data, including system vulnerabilities and current IP addresses. On a variety of IoT and non-IoT devices in the SHS from outside the network, an attacker with IP 192.168.1.159 executed a Computer OS Fingerprint Probe, Network/Port scan, TCP Null scan, TCP SYNFIN scan, and TCP Xmas scan for a random amount of time.

Validation of the IRIL-SHS dataset
To validate the IRIL-SHS dataset developed in this study, it was evaluated with random forest, decision tree and K-nearest neighbor classification models with 12 selected network features. The same technique was utilized on the publicly available dataset like UNSW-NB15 and KDD99. The motive of adopting these datasets was that (i) they do not contain record duplication while collecting modern network traffic, and (ii) they resolve the issue of imbalances between normal and attack observations. The UNSW-NB15 and KDD99 datasets both contain normal and malicious network traffic activity and have 2,57,673 and 4,66,530 records in total, respectively. Each record is distinguished by 49 attributes, including class labels. The collected 12 attributes, which are common in IRIL-SHS, UNSW-NB15 and KDD99 datasets, were passed to the classification model after data was divided into a 70:30 ratio, with 70% utilized for training and 30% afterwards used for testing. The classification techniques are implemented using the Google collaborative environment in Python language. Table 3 depicts the key difference in the output of classification model between the IRIL-SHS dataset and publicly available datasets to provide usefulness and correctness of the IRIL-SHS dataset.

Comparison with Popular Security Datasets
The design and development of credible datasets remain a key area of research to develop accurate intrusion detection methods able to identify and prevent threats/ attacks. Existing datasets, while useful in some contexts, present a number of problems including lack of consistently labelled data, lack of attack variety such as DDoS attack from outside the network, redundancy of traffic, and the absence of ground truth [67]. This research study has carefully compared the IRIL-SHS dataset with other popular datasets namely KDD99 [63], TUIDS ISCX [64,65], UNSW-NB15, and CICIDS2017 [66]. These datasets have been extensively used in many research studies to provide security solutions. The analysis showed that the KDD99 dataset is now obsolete and does not reflect current network traffic. The TUIDS ISCX, UNSW-NB15, and CICIDS2017 [66] datasets were collected from actual situations and contain comprehensive packet capture with a range of security events. However, they lack heterogeneous data sources and traffic (normal and attack) from outside the network in consideration. In contrast, the IRIL-SHS dataset provides heterogeneous data sources (Table 1) with many malicious activities ( Table 2) from both internal and external networks. The generated dataset captures authentic network traffic from actual networks of IoT devices created by router configuration and consists of heterogeneous data sources gathered from IoT device telemetry datasets, Windows and Linux-based datasets, and network traffic datasets. Table 4 depicts the key differences between the IRIL-SHS dataset and other common datasets to provide a fair comparison.

Context-aware Threat Detection and Response Model
The IRIL-SHS dataset gathered from the communication of various IoT devices in the proposed testbed setup is used in our approach. Our context-aware threat detection and response model consists of three major modules as illustrated below in Figure 7. The contextual feature generation module contains the network and flow features extracted from the pcap to CSV files using NFStream and Tshark. After data preprocessing, we performed the feature selection to get the high-performance features. In the threat detection module, we used machine learning algorithms to validate the dataset and detect the threats accurately. The response module performed the dynamic rule writing against the detected attacked traffic.

Contextual Features Generation Module
The notion of context modeling, which is to represent the data obtained from the IoT devices and resources that may explicitly explain the system's behavior, is the foundation for the construction of the contextual feature-generating module [50]. By tracing the key information into a high-level context that is highly specific and accurate, the diversity of IoT device information may be characterized. The context mapping for the smart home system's hierarchical structure is shown in Figure 8 below. The topmost hierarchy uses pairs in the following order to define the structure for the context [68]: Who: is involved? What: is current the situation? How: they behave? Several context classes are formed in the second tier of SHS's context mapping to execute grouping methods on the IoT data. The classes created using grouping techniques, such as the sender context class, are set up to accept the input of the following data: 1) the IP address, authentication data, and HTTP cookies of the sender, all of which are taken from the packet stream. 2) account info for individuals, obtained from the configuration log.
3) The sender's precise location, as determined by a GPS signal [50]. It is essential to note that if a device's position is maliciously changed, like an IP camera captures traffic from a restricted database or a server room, its location context can be utilized in determining the threat or attack scenario. Furthermore, a trucking company that transports items throughout the country with a fleet of vehicles. Each vehicle is fitted with a GPS device and a telematics system that tracks the vehicle's location, speed, and other critical data. This contextual data can be utilized to determine the malicious scenario, such as sudden changes in the vehicle's route, unexpected stops or detours, or unusual driving behavior. The information from a real-world situation allows human-oriented decision-making by utilizing a number of machine-learning approaches such as feature extraction, learning, and inference. Finally, all class features containing network flow information are stored in a low-level hierarchy. Behavior sequence and behavior set are formed by the pair set of information captured in the traffic related to each other like Close Door number 1 D1C, Turn off Light number 1 L1O etc. If length of behavior sequence is 3 then we can get different pair set as {"D1C", "L2O", "D3O"}. Figure 9 shows the flow diagram for the contextual feature generation module. Figure 9 shows the flow diagram for the contextual feature generation module. This research used NFStream [64] and Tshark [65] to construct the features for the IRIL-SHS dataset. All features obtained from NFStream and Tshark were then combined to get complete possible information about the proposed context-aware smart home system. To get high-performance contextual features, a feature selection technique, explained in section 4.1.2, was utilized by consuming only relevant data, getting rid of the noise in the data, and passing them to the model. To make live and unconnected network data more comprehensible, NFStream, an open-source Python API library, allows simple and customizable feature conversion [66]. The library should act as a common network packet analytics platform for academics, enabling data repeatability between studies, according to the authors' main goal. The following advantages are provided by NFStream: 1. Extraction of statistical features: Regarding feature engineering, NFStream provides both early flow features and post-mortem statistical features (such as the minimum, mean, standard deviation, and maximum packet size and interarrival time) as well as a sequence of the first n packet sizes, inter-arrival times, and directions [66].
2. Flexibility: Extending NFStream is simple. The work is open-sourced, and the feature selection technique can be done via NFPlugins.
Based on the configurations defined by NFStreamer (a driver process), new flow characteristics were developed. The driver's main responsibility is to set the entire workflow, which is mostly an arrangement of concurrent metering tasks. The features that were extracted using NFStreamer are listed in Appendix A, Table A1.
From the network pcap files, we retrieved 86 flow features/attributes using NFStream and stored them as a CSV file. Tshark is a function of Wireshark that is used to monitor network protocols and analyze traffic. Although Wireshark has limited export abilities, it can examine pcap files that record network data [67]. Besides static feature extraction, TShark offers a more versatile, potent export capability that can produce analytical, calculated data. We used the TShark discussions export option to retrieve several fundamental, traffic-and connection-based features [68]. We also used Tshark for extracting TCP, HTTP, and UDP information. The following command on the Kali Linux extracted TCP, HTTP, and UDP features.
[tshark -r input.pcap.pcapng -T fields -e tcp.window_size -e tcp.flags -e tcp.len -e tcp.seq -e tcp.stream -e tcp.ack -e ip.ttl -e http.request -e udp.port -E header=y -E separator=, > output.csv] The features that were retrieved using TShark are listed in table 5. A total of 10 features, extracted by Tshark, and all files of these features were merged with the NFStreamer files having 86 features of different data types containing a total of 96 features set. These features were good enough for extracting context-aware features to provide a heterogeneous security system and model for SHS. All CSV files were labeled manually and merged into one file using a python script. It is important to highlight that each record, whether normal or attack, was tagged using an authentic tagging procedure. Manual tagging/labeling was done by stamping 0 to Normal traffic, 1 to DoS Attack traffic, 2 to DDoS attack, and 3 to Reconnaissance attack traffic.

Preprocessing of Labeled Dataset
The IRIL-SHS dataset contains 608,500 records of real-time traffic, including both attacks and normal events, with far more normal records than anomalous records. To avoid overfitting and examine the generalisation capacity of each model, all redundant records were eliminated from the dataset to reduce the imbalance impact. To address the imbalance issue in the dataset, an experiment was conducted where the performance of the proposed model was compared using the original dataset and a balanced dataset by utilizing oversampling and under-sampling techniques. Specifically, the synthetic minority over-sampling technique (SMOTE) was used for oversampling and random under-sampling (RUS) for under-sampling. For oversampling, SMOTE was applied to the minority class in the dataset to create synthetic samples, resulting in a balanced dataset. For under-sampling, some samples were randomly removed from the majority class in the dataset to balance the dataset. The model was then trained using both the original and the balanced datasets and evaluated the performance using the same metrics.
The experiment showed that the performance of the model improved significantly when using the balanced dataset compared to the original dataset. Notably, the accuracy improved from 88% to 99%, and the F1-score improved from 0.7 to 0.85. These results indicate that dataset imbalance can have a significant impact on the performance of the model and that addressing this issue using oversampling and under-sampling techniques can improve the overall performance. Converting categorical data into numerical information and removing any anomalies and incorrect data from the dataset are the first steps in the data pre-processing process. As a result, we encoded the categorical features using an ordinal encoder, thereby increasing the feature count.

Contextual Features Selection
Throughout this step, we selected a set of attributes that yielded the greatest performance. By lowering the number of features and removing unwanted or loud characteristics, feature selection speeds up training [69]. The Extra Tree classifier approach, an ensemble learning feature selection methodology also known as the Extremely Randomized Trees Classifier, is employed in this study as feature selection [70]. It aggregates the results from various pattern decision trees from a forest to display the outcomes of its classification. Random samples from the training dataset are used to construct each decision tree in the Extra Trees Forest. Then, a random number of K-featured samples are distributed to each decision tree test node. Using the GINI index or Information Gain, also known as feature importance, each decision tree selects the best features to differentiate between meaningful and irrelevant aspects of the data. This forest layout's features are presented in declining order of feature importance [71]. Each feature from this forest layout is arranged in descending order of feature relevance. The top K features are then chosen from this feature order, with the other features being ignored [72]. The following formula can be used to determine the entropy of a feature: Where is the number of distinct class labels and is the probability that a class exists in the dataset. In this study, the top 10 dataset features were chosen using information gain, shown in Figure 10. According to the figure, 10 features were chosen for the IRIL-SHS datasets, which represents an 86% reduction in the size of the feature set for the entire dataset. This results from the Extra Tree classifier strategy selecting the pertinent features like packet TCP stream and window size values that indicate the propensity of cyberattacks to have distinct packet dimensions when contrasted to usual traffic, offering the most details about the class [73]. Table 6 illustrates the 10 contextual features that were chosen for all attack types with their importance.  Table 6 lists the top 10 features that have an impact on our model. Identification, Destination IP, TCP window size, TCP stream, source to destination and bidirectional Inter-packet maximum time, Syn packets from source to destination, duration in both directions, Bidirectional standard deviation packet inter-arrival time and the requested server name which is most useful for indicating DoS, DDoS and Reconnaissance traffic classification.

Threat Detection Module
The second module of our research trains the various machine learning models that are utilized to identify malicious behaviors in SHS using contextual characteristics provided in the contextual features generation module as input. Many learning approaches are used to identify cybersecurity threats, including decision trees, random forests, naive Bayes, support vector machines, K-nearest neighbors, deep belief networks, artificial neural networks, and XG-Boost [73]. Decision Trees, Random Forests, K-Nearest Neighbors, and XG Boost are the four techniques that have been considered. Each tactic is briefly discussed below. This approach aims to assess the effectiveness and performance of different machinelearning approaches against distinct attack types. The data was divided into a 70:30 ratio, with 70% utilized for training and 30% afterward used for testing. The classification techniques are implemented using the Google collaborative environment in Python language. The libraries employed in this work include sklearn, pandas, matplotlib, and NumPy.

Decision Tree Classifier
The decision tree functions by splitting the data points into a representation of (D), and each split is carried out in a way that maximizes the related features while minimizing informational entropy (E). The splits are known as leaves (L), and the terminal leaf is the last split [38]. We chose a minimum sample split of two, a maximum depth of ten, and a random state of zero for our decision tree classifier. Consider G to be the sample split metric that needs to be maximized. R is the possible range of values. Let be a user-defined confidence value state and let n be the total number of training samples that are available. The classifier may be calculated as follows:

Random Forest Classifier
The output of several machine learning algorithms is combined in ensemble learning, which improves predicted performance and makes use of additional machine learning algorithms. As a result, the random forest method incorporates many decision tree methods. Implementation of random forest involves the following steps: Step1: Choose B arbitrary data points from the practice set.
Step 2: Building the decision tree connected to these B data points.
Step 3: Decide how many trees to construct and repeat steps 1 and 2.
Step 4: Forecast the category for each branch of the decision tree for a new data point, and then predict the category that received the majority of votes [39]. Class prediction of the random forest includes:

K Nearest Neighbor Classifier
The KNN algorithm determines how to assign a new data point (D) to one of the categories by using the subsequent steps.
Step 1: Decide how many neighbors there will be; we chose K=7.
Step 2: Using the Euclidean distance, get the K-nearest neighbors of the new data point.
Step 3: Count the number of data points in each category among these K neighbors. The fourth step is to assign the new data point to the category with the greatest number of neighbors [36]. Euclidean distance (d) between sample and (l=1, 2, 3, …, n) is defined as 4.2.4 XG Boost Classifier Accuracy problems, data loss, and resulting discrepancies can all be managed via XG Boost. The following steps make up the XG Boost process.
Step 1: A decision tree is initially created.
Step 2: The majority decision is then examined in a Bagging operation to produce predictions.
Step 3: Trees are independent; hence random forest is used to form forests.
Step 4: Boosting is carried out to find losses (L).
Step 5: Gradient boosting is used to overfit a dataset.
After applying all classifiers, we got the Random Forest classifier classifying the attack traffic and normal traffic 99.177211% accurately. We got the forecast model that could distinguish between various sorts of traffic thanks to the machine learning analysis. We could now carry out several add-on tasks for a model trained to distinguish between good and bad traffic e.g.:  Based on the outcomes of the random forest, create extra firewall rules to block offending traffic.  Create IDS rules based on the outcomes of the random forest to identify malicious traffic.  Review logs or traffic captures regularly, reporting any high-priority malicious traffic that is found [74].

Response Module
The final section of our study offers a straightforward set of rules for finding correlations in the information collected through our Random Forest model. By integrating these rules into already-in-use defenses like IDS, firewalls, specially-written detection scripts, or classification software, malicious traffic or data can be prevented from entering our SHS [74].
For rules, we used the WEKA tool which is a collection of machine-learning techniques for data mine. Obtained data modeling, categorization, grouping, mining, and presentation techniques are all included. Weka is simple to use and can be expanded using Java programming [75].
JRip is indeed a WEKA version of RIPPER, a classifier method that tends to make use of its categories as they increase in size before generating basic rules set for a subclass. Every decision event in the learning dataset is treated by JRip as a class, and it then generates a set of rules that apply to group individuals of that class. Once all classes have already been completed, it moves on to the next one and performs necessary operations [76]. This research work used JRip due to its prominence and widespread use in earlier studies for rule set generation [77,78]. To differentiate between DoS, DDoS, Reconnaissance, and normal traffic, we obtained a total of 9 rules. Simple rules in the form of a tree outlined the metrics values that may be used to identify the sort of attack in the traffic. These rules aided in establishing the precise values to be included in the Suricata rules while creating an SHS prevention system.

Results and Discussion
Based on contextual features, the categorization model forecasts whether the incoming traffic will be harmful or benign. As the proposed model predicts the type of attacks and determines the correlation between the 10 variables. The chosen attributes showed a high correlation with a value near to or equal to -1 or 1. Figure 11 shows the Pearson correlation matrix of the selected features for our dataset. True Positive (TP) is the number of anomaly traffic that has been identified after an attack. True Negative (TN) is the number of detectable normal traffic that is considered to be benign traffic (normal). False Positive (FP) is the number of anomaly traffic (attacks) that have been identified as benign traffic (normal). False Negative (FN) is the quantity of detectable benign traffic that is identified as attack traffic (attack). Table 7 provides definitions of the measures in terms of positives and negatives. This research study represented multi-class classification using the confusion matrix of all models applied to the dataset. Table 8 shows the confusion matrix comparison of the threat detection machine learning module. discriminate packets of DDoS attacks, as 10.9% of DDoS traffic was classified as normal traffic,) because of sufficient training packets for DDoS attacks. So, enhancing the number of packets will help detect attacks accurately. Precision, Recall, and F1 Score comparison of our threat detection module is shown in Figure 12. It is evident from the trials and outcomes above that there is a trade-off between being able to identify destructive behaviors and upholding a low percentage of false alarms (FPR). Most of the time, Random Forest obtained the lowest false positive rates, which explained its successful F1-score performance. The comparison of the accuracy we achieved in our models is shown in figure 13. When compared to these machine learning techniques, Random Forest's performance exhibited a notable improvement. The best KNN reached 98.4239% accuracy, whereas the Decision Tree classifier and XG Boost obtained 98.1791% and 98.2421% accuracy, respectively. The Random Forest model, however, has the highest accuracy of 99.1772%.

Comparison with the previous study
The most pertinent research in the field of machine learning is described in [69], where the authors used the UNSW NB15 data set to apply supervised machine learning (ML) methods such Random Forest  By applying the developed IRIL-SHS dataset to the straightforward machine learning model, as in the prior study without integrating context aware features, 97.99% accuracy was obtained against the random forest model. However, the proposed context-aware threat detection model was improved by achieving 99.1772% accuracy.
For the response module, nine rules were gained using JRip classifier for the developed dataset. The best feature set was obtained with defined values against each feature using the JRip classifier [70], which is beneficial for generating Suricata signatures. For instance, if the bidirectional duration is less than or equal to 94 and the bidirectional maximum packet inter-arrival time is greater than or equal to 13, and the destination IP is 877, and the source to destination maximum packet inter-arrival time is less than or equal to 0, the traffic is reconnaissance and should be blocked from entering the smart home system. Using these metrics, we made some Suricata emerging threats rules for our prevention system as shown in figure 14. The IRIL-SHS dataset is based on the same benchmark datasets, KDD99, TUIDS ISCX, UNSW-NB15, and CICIDS2017. It offers a variety of data sources and traffic, including a lot of malicious actions from both internal and external networks. The conclusion section discusses the stages of Random Forest model deployments on the IRIL-SHS dataset and how contextual feature generation, and transfer learning improve threat detection accuracy to 99.177%. Based on the generated Random Forest model, the conversion rules will be developed in our response module to automatically generate Suricata signatures.

Conclusion and Future Work
For diverse varieties of online threats, various learning models are applied. In contrast, limited researchers have sought to draw attention to the limitations that machine learning techniques encounter. To test the most recent developments in machine learning for cyber-attacks, we observed and advised that an inclusive benchmark dataset which would not only include data from the heterogeneous smart IoT devices but also attack data from internal as well as external network should be created. This dataset was then used for context-aware feature generation in the random forest model, which conclusively showed that the accuracy of threat identification on the IRIL-SHS dataset was increased. When contextaware results are considered, the system obtained a high rate of detection and level of precision. Out of the four ML algorithms, RF performed better than the others by consistently achieving excellent detection performance and F1-score with low false positive rates. To prevent malicious traffic from accessing the smart home system, this study suggested rule-set generation from the open-source tool, Suricata. To add emerging threat rules into Suricata for system defense, JRip rules helped to get the metrics values. This research study acknowledges the limitations of the performance of the developed machine learning model against a limited set of attack types. While the researchers believe that the results demonstrate the robustness of the model against the attacks that were tested, they also recognize that there may be other types of attacks that were not included in the intensive experiments. Additionally, as they rely on 4-tuple information, consisting of source and destination IP addresses and ports, the model may not be able to extract additional features like payload or application layer information from the encrypted traffic. Further research is needed to fully evaluate the performance of the proposed model in real-world scenarios where a variety of attack types may be present along with advancements in encryption and decryption technologies to improve the ability to extract features from encrypted traffic for critical analysis. Based on the real-time IRIL-SHS dataset using IoT devices, future studies can also concentrate on generating the intended Industry 4.0 dataset which could include data from industrial IoT devices like distributed control systems (DCS), programmable logic control (PLCs) and gateways with attack data like Modbus protocols attacks etc., and from an encrypted traffic the feature extraction would be a unique task. As the presented model performs well for smart homes and small offices, it can also perform effectively for industry 4.0 dataset. In future, based on various threat scenarios and zero-day attacks, we also intend to enhance our contextually aware model for the industry 4.0 dataset.