IoT Wireless Intrusion Detection and Network Traffic Analysis

Enhancement in wireless networks had given users the ability to use the Internet without a physical connection to the router. Almost every Internet of Things (IoT) devices such as smartphones, drones, and cameras use wireless technology (Infrared, Bluetooth, IrDA, IEEE 802.11, etc.) to establish multiple interdevice connections simultaneously. With the flexibility of the wireless network, one can set up numerous ad-hoc networks on-demand, connecting hundreds to thousands of users, increasing productivity and profitability significantly. However, the number of network attacks in wireless networks that exploit such flexibilities in setting and tearing down networks has become very alarming. Perpetrators can launch attacks since there is no first line of defense in an ad hoc network setup besides the standard IEEE802.11 WPA2 authentication. One feasible countermeasure is to deploy intrusion detection systems at the edge of these ad hoc networks (Network-based IDS) or at the node level (Host-based IDS). The challenge here is that there is no readily available benchmark data available for IoT network traffic. Creating this benchmark data is very tedious as IoT can work on multiple platforms and networks, and crafting and labelling such dataset is very labor-intensive. This research aims to study the characteristics of existing datasets available such as KDD-Cup and NSL-KDD, and their suitability for wireless IDS implementation. We hypothesize that network features are parametrically different depending on the types of network and assigning weight dynamically to these features can potentially improve the subsequent threat classifications. This paper analyses packet and flow features for the data packet captured on a wireless network rather than a wired network. Combining domain heuristcs and early classification results, the paper had identified 19 header fields exclusive to wireless network that contain high information gain to be used as ML features in Wireless IDS.


Introduction
Humayun et al. [1] has mentioned that the automatic exchange of information between two systems or two devices without any manual input is the main objective of the Internet of Things (IoT). IoT is such a device that can easily trust other devices and exchange information, and this situation results in IoT devices becoming the target of attacks. Moreover, most IoT devices use existing wireless connections due to their convenience and flexibility without considering their weakness. A wireless access point usually is not configured for a secure operation which comes only with front end authentication. Like Distributed Denial of Service (DDoS), some common attacks are not preventable through traffic filtering since ICMP traffic is considered legitimate. Many computers start performing denial of service attack towards the same targeted server in distributed denial of service attacks. There are three types of DDOS attacks, application-layer DDOS attack, protocol DDOS attack, and volume-based DDOS attack. DDOS attacks can severely damage an organization's the business and network security. A DDOS attack can last anywhere from a few hours to several days, making the organizations website and network unreachable during the attack. To improve the IoT security on the network, an Intrusion Detection System (IDS) can be deployed to analyze the network traffic [2]. IDS is a system that monitors a network or a method for malicious activities and reports or alerts the user of the system. The intrusion detection system investigates application vulnerabilities and identifies abnormal activity and data injection in a system as they are designed to observe the activities in the system. The IDS helps the network administrator detect any malicious activity on the network and alerts the administrator to secure the data by taking appropriate actions against those attacks. To implement an effective IDS in a wireless environment, careful selection of datasets or network traffic is also of utmost importance. To that, this research presents an analysis of network traffic from the wired and wireless (IEEE802.11) environment. The study presented here can be contributing to future research, mainly for IoT and wireless security and researchers who wish to implement intrusion detection systems for their IoT networks. A careful selection of network traffic features can contribute towards an exemplary implementation of wireless networks IDS. Therefore, a comparison between the wired and wireless network and traffic characteristics is presented in the following sections, followed by traffic characteristics for wireless (IEEE802.11) networks 2 Classification of IDS Two types of IDS are commonly deployed for intrusion detection, namely (1) Wired IDS and (2) Wireless IDS.

Wired Intrusion Detection System
A wired IDS is the standard IDS connected with all the network components (e.g. router, switch, IDS manager server, end devices) to perform intrusion detection processes [3]. The most common techniques used by wired IDS are signature-based, anomaly-based and hybrid technique. The signature-based method works by comparing the known information with the signatures database. Still, it cannot identify unknown attacks-the anomaly-based method used to compare current user activities against previously loaded logs of users. The bad review of this technique produces a larger number of false alarms because of irregular network and user behaviour. The hybrid detection technique can be used to combine the signature and anomaly-based detection techniques. The weakness of this technique is the complexity due to the integration of both signatures and anomalies. Tab. 1 shows a simple analysis of the intrusion detection technique.
The wired or standard IDS architecture used to connect all the devices with a cable. The IDS console will play the role to monitor and analyze the network traffic. When traffic or packet is coming from the internet, the router will pass the data to the IDS server; the IDS server would collect the traffic and perform the analysis. The IDS do not drop any packets since the job of IDS is to collect and analyze the data. The wired IDS require more components and devices for the network setup, such as routers, switches, IDS consoles, IDS servers, and other end devices, as shown in Fig. 1.
One of the weaknesses of traditional wired IDS for wireless implementation as in IoT is, it does not generally detect network intrusion from internal hosts of the network. Although it is possible to protect an organization's internal network from wireless attackers, there can be only one link between the wireless network and the main network, such a network intrusion system will not cover all of the traffic on the wireless network [4]. The traditional wired IDS may meet some challenges of securing the wireless network because it fundamentally ignores the monitoring of airspace from which most attacks are perpetrated. As a conclusion, the wired IDS is not suitable for deployment in the IoT network because most of the IoT devices are connected via wireless mediums such as IEEE802.11, IEEE 802.15.4 and IEEE 802.15.1.

Wireless Intrusion Detection System
Compared with wired IDS, wireless IDS is more suitable for monitoring IoT networks. Wireless IDS is a better version of wired IDS because it has characteristics to be covered in a wireless network [3]. The wireless  [5], wireless IDS has more intrusion detection techniques to discover possible attacks on a wireless network. The methods included target system, detection technique, collection process, trust model, analysis and response to identify and analyze the network traffic.
A wireless IDS has a more efficient method for investigating wireless network traffics.
The wireless IDS architecture looks like a wired IDS architecture, but it uses a wireless access point for network connectivity. The wireless IDS architecture is more suitable for IoT network due to the wireless sensor deployment. Furthermore, the typical components in a wireless IDS are the console, database server, and sensors, as shown in Fig. 2. In conclusion, wireless IDS is more suitable for investigating IoT network traffic from the technique used to the architecture and the network setup.

Internet of Things Dataset
Network traffic dataset can be sniffed from both wired network and wireless network. There are considerable differences in the attacks that targets wired and wireless networks. Fadlullah et al. [6] states that a wired network has an access medium that is physically secured compared to the wireless network as it does not require the monitoring of airspace. The datasets used in most studies comprise sample datasets such as KDD Cup '99, NSL-KDD, and Kyoto 2006+ datasets. However, there is a lack of studies involving intrusion detection based on wireless networks, which states the essential parameters crucial in detecting intrusions in the wireless network for the classification algorithms. Furthermore, the wireless packets used as the dataset consist of multiple parameters and fields that require feature selection to be implemented in the algorithms to detect the intrusions in a network. This research aims to identify the critical parameters and fields required in a wireless network dataset to produce optimal results. Besides that, an analysis of traditional network traffic characteristics at the packet level, flow level, connection level and host level is studied to investigate if existing parameters are fit to use in a wireless network.

Network Traffic Analysis
Understanding the network data is fundamental before proceeding to the network traffic analysis process. Before investigating the network traffic dataset, understanding the type of network traffic data is crucial. Humayun et al. [7,8] have clearly summarized and compared these data categories. And since network traffic plays an important role in wireless intrusion detection design, the remaining sections focus on the types and techniques to analyze wireless traffic. Moreover, some of the challenges in using existing benchmark data for wireless intrusion detection is discussed in details. The network data has been categorized into 4 categories, that is at: packet, flow, connection, and host-level data. The atomic unit for network traffic are packets. These are captured by a specific application, Libpcap, WinPCap, Wireshark, and Libtrace [9].

Packet Level Data
Network applications generate traffic (packets) containing headers from multiple protocols through the encapsulation process. Some examples of these protocols are Transmission Control Protocol (TCP), User Datagram Protocol (UDP), Internet Control Message Protocol (ICMP). See Tab. 2 for a snippet of packet payload information and packet activity information. This traffic is then used to detect the DDoS and worm attacks. The packet-level data can commonly collected using TCPDump, Snort [10] and Nmap [11]. TCPDump [12] is similar to Wireshark but without a GUI, whereas snort is a standard tool used in intrusion detection while capturing packets simultaneously. Nmap is used for host discovery, service and operating system detection. On the other hand, port mirroring can be another method that can be used for data collection. This is a hardware-based data collection whereby packet forwarding devices can be used to forward packets from one port to another port where packet capture devices can be connected. By using this method, the entire incoming and outgoing packets can be viewed within a whole network. But this approach requires enough bandwidth as mirroring can cause loss and packet delay [13]. Based on the detailed analysis by Mahoney et al. [14] on the important features for detecting attacks, the authors have identified 33 header features crucial in detecting an anomaly in networks. The author has performed a very comprehensive analysis of the types of attacks detected, their occurrences in terms of percentage, and the header fields that contribute to detection effectiveness. A detailed analysis of these seven fields and their significance to detecting attacks is well presented by Jing et al. [15]. Out of that, [15] have identified that source/ destination IP address, source/destination port, time to live, timestamp, a packet payload, packet size and the number of packets is crucial in detecting attacks. As for the IP address, the way the addresses are distributed, and their changing patterns are useful for detecting botnet attacks. Whereas similarly, the changing and distribution of port address can be used to detect worn attacks. IP spoofing attack can be seen by looking at the TTL value because a TTL can show if the IP address is coming from an internal spoofed IP address. At the same time, a timestamp value can tell the inter-arrival time and round trip time. A non-uniform pattern in this can indicate a non-repudiation attack. Packet payload typically carries important information destined to the victim network.
For example, a deep inspection of the packet application header and its payload can indicate whether the header carries malware. One weakness of such a method is that some protocols such as SSL and TLS encrypt the payload information. Therefore, it is hard to detect malicious information carried by the payload. The packet size can indicate whether the payload information carried is initiated by a bot because the attack packets coming from all the bots would have almost similar packet size. A DDoS attack can be detected by analyzing the packets count, based on the inbound and outbound packet counts. Some examples include checking ICMP's request/reply count; traffic distribution among different network protocols; or the ratio of TCP packets with different flags value [15]. With the high-speed networks, packet-level data collection may require expensive hardware, and sophisticated encryption and obfuscation method may hinder the packet inspection.

Packet Level Data
A flow-level data is usually used by using a flow-key aggregated with the relevant data depending on the application. The applications can be summarized as network/application/host monitoring, intrusion detection, security awareness and network application classification. A collection of a flow-based dataset for intrusion detection is outlined in Umer et al. [16]. Flow-based data can be collected either by using depth-first or breadth-first methods. Choosing a specific flow-key to aggregate data in the first approach, and the latter collects as much information as possible to have a more comprehensive view of the network traffic [17]. A complete discussion of flow collection (Fig. 3), types of flow and how to analyze is discussed in Li et al. [17][18][19][20].
A flow is defined as a unidirectional sequence of packets that belongs to a same TCP session. The purposes of flow are to provide an overview of network traffic and attempt to deal with the encrypted packets. Flow level data comprises (as shown in Tab. 3) a tuple with flow-key aggregated with a collection of information such as srcIP, dstIP, src_port, dest_port as in the case of Cisco routers [18]. The reason for collecting flow-level data is to reduce the amount of network traffic to be analysed. Flow data is typically not helpful for deep analysis that requires packet payload. Flow-level data is the statistical information about the flow which comprises of flow count (number of discharges with the same flow key), flow type (ICMP, TCP, UDP, HTTP, DNS etc.), flow size, flow direction (inflow, outflow), flow duration (time between first packet arrival time to end of flow time) and flow rate.

Connection Level Data
The connection-level data records global information between two IP addresses from the viewpoint of a particular network, providing a finer level of network traffic granularity than the flow-level data. Using connection-level data, packet-level data, or flow-level data, we may obtain detailed information about network activities. Connection size (size of packets and length of flow) can be summarised as connection period (time from connection establishment to connection termination), connection count (number of connections per unit time), and connection form (TCP, UDP ICMP etc.).

Host Level Data
Any internal changes in the host can be seen in the host-level data, and most attacks have a direct effect on the host's reliability. Internal attacks such as unauthorised logging or entry, file system alteration, and privilege escalation can be detected using host-level data. They are commonly used in HIDS for monitoring abnormalities in the internals. Host level data can be collected using open source tools like Collect in Linux machines [21] and Load runner in Windows machines [22]. The collected data comprises CPU and memory usage and operation log (equipment and application operation log). Operation logs include events with the equipment such as mouse click, keyboard, cursor changes etc. In contrast, the application log relates directly to a specific application on the local port creation/ destruction, login attempts, system calls and usage of software events etc.

Summary of Network Traffic Dataset
Tab. 4 summarize and compares traffic in terms of data types; based on their strengths and weaknesses. Packet level data is useful because it has a full view of the entire packet information (payload and header) and can conclude the network activities in a much granular way, hence suitable for real-time detection. On the other hand, this method needs much-sophisticated hardware. The network speed increases exponentially, and packet inspection also breaches data privacy since it is done at a significantly deeper level. Instead, flow level data can compromise the speed of traffic by aggregating data into flows and reducing the number of traffic to be inspected. One setback with this method is that it is not suitable for payload inspection because inflow level packet payload is not considered. In this regard, connection-level data is preferable because it provides a more comprehensive picture of network traffic between two hosts. However, link-level data collection necessitates keeping track of connection status, which consumes more resources. As a result, examining the advantages and disadvantages of each packet-level, flow-level, and connection-level data will complement one another. A combination of all three is ideal since it provides comprehensive information on network activity. On the other hand, host-level data gives a complete view of the events in the system but not about any network activities. Therefore, host-level data cannot be used alone as it can give a very high false-positive rate even when normal user activities are performed. As a summary, the analysis given based on each type of data can be used to detect specific attacks by considering the detection method and the network environment. Any application-specific attack detection may require packet-level and connection/flow level data. Whereas inspecting malware within a host might require more host-level data and some extend of connection data. Much network-based attacks such as DDoS and Botnet may need to utilize packet level and connection-level data. This attack can be further granularized by integrating host-level data to see the effect of DDoS on the host performance. A higher accuracy of detection can be achieved by doing a thorough analysis of the nature of the attacks and their impacts on the host or the network.

Summary of Network Traffic Dataset
The live dataset collected from the network is used for intrusion detection, but many compiled datasets are available on the internet for network intrusion detection. One of the most widely used datasets is the KDD Cup data set [23]. According to Can et al. [24], the KDD Cup data set consists of approximately 4,900,000 training instances and 41 features (Tab. 5). It is labelled with exactly one specific attack type, i.e., either standard or an attack. The training data set includes 80% attack and 20% normal data. Some of KDD features may not apply in intrusion detection because they do not have any useful role to classify the outputs. The network intrusion detection by using KDD cup has been done by Can et al. [24] shows 89.17% success rate on 67500 samples of attacks in the artificial neural network-based intrusion detection system for wireless sensor network. Due to the rapid change of technology, KDD datasets might have some limitations in detecting abnormal traffic in ad hoc wireless network [25]. Traffic collectors such as TCPdump, KDD dataset are very likely to become overloaded and drop packets in heavy traffic load. Some of the network traffic datasets are based on the current operating systems and hardware, so the The running records of equipment that connects with a host

Application operation logs
The running records of an application KDD dataset might be a challenge for investigating the ad hoc network traffic since the ad hoc network and IoT devices are built with different operating systems and hardware. Finally, most of the work on using the KDD dataset is mainly deployed for wired intrusion detection, and therefore wireless network features and attacks are not included in KDD datasets. Integer Bytes sent in one connection Dst_bytes Integer Bytes received in one connection Land Binary If src/dst IP address and port numbers are same, then 1 Wrong_fragment Integer Sum of bad checksum packets in a connection Urgent Integer Sum of urgent packets in a connection Sobh [26] performed a complete analysis of the KDD data set and had found some severe limitation of the dataset. Further analysis by Sobh (2006) [26] comprises the top 20 alarms generated, attacks detected, the contribution of fields to detection attacks not seen and overhead in terms of time and space. For the top 20 alarms, the authors conclude that out of 20, 8 has been identified to correctly detect attacks with the destination IP address as the important field. 9 other alarms were false positives (FP), and three others were able to detect arp poison but without the IP address since it only involves ARP packets. As for the attack detections, out of 201 attacks instances created, 67 episodes were detected such as DOS, probe and R2L, and the detection rate is above 50% for all these cases with TTL as one of the main header features contributed to the detection with 33 detections out of the 19 types of attacks. And 8 out of the 67 attacks were detected by IP addresses and none by port numbers. 30 checksum errors were created due to IP fragmentation and no proper reasoning for using smaller fragments. The author performed the statistical evaluation and own classification methods and found some shortcomings in the KDD Cup data. So some of the weaknesses identified by the author are:-very ambiguous attack definition, packet drop due to traffic overflow, too many redundant records (75% in training and 75% in the testing), weird input to unwell configured software, odd data from impractical attacks and some unrealistic data to hide some of the attacks. Furthermore, just by some random selection of data for training and data for testing, the results show a very unrealistic accuracy value.
Therefore [27] proposed a new dataset known as NSL-KDD with properly selected KDD data records. NSL-KDD (Tab. 6) seems to be a refined version of KDD cup data with all the essential 42 features with 5 classes and 4 attacks. According to Chae et al. [27], the dataset is a better selection as it does not have many redundant data in the training set, and therefore no business would occur during classification. Also, the number of records selected according to the difficulty level is inversely proportional to the number of records in the original KDD Cup data. Therefore, classification using various machine learning methods yields diverse accuracy rates that evaluate various classifications methods. Likewise, the number of data selected for training and testing is reasonable without being randomly selected as in the KDD Cup data. The investigation conducted by various researchers [27][28][29][30][31][32][33] show that NSL-KDD gives consistent and more realistic results.

A Preliminary Wireless Network Dataset
IEEE 802.11 based WLANs consists of several frames that contain the information of the packets. The expanded packets in the dataset provide a clear comparison between the types of structures. For example, the number of fields of each packet differs according to the type of frame the packet is in. Therefore, the feature selection highly relies on similar fields that all the frames possess to compare the differences in the same fields of different frames. The types of frames are an important feature in the study of wireless networks as they affect the number of fields and the types of fields present in the frame. Tab. 7 shows a shows important frame information in the IEEE 802.11 dataset for traffic analytics. The features presented here can be used as a guideline for designing an effective wireless intrusion detection system. Besides the traditional TCP, UCP and IP packet headers, it is suggested to include the IEEE802.11 frame headers for effective wireless intrusion detection design.

Discussion
There is no fundamental research in IoT intrusion detection (ID) that mainly focuses on wireless networks to the best of our knowledge. Most of the research area in IDS focuses on traditional wired networks. Applying the wired network research of IDS at wireless network may not be feasible due to the architectural differences of IoT. Traditional security countermeasures and privacy enforcement cannot be directly applied to IoT technologies due to the three fundamental aspects: 1. The limited computing power of IoT components 2. The high number of interconnected devices 3. Sharing of data among objects users Moreover, intrusion response to wireless networks depends on the type of intrusion, network protocols and applications in use and the confidence in the evidence, which is different from wired networks. A few works have been conducted using IDS to counter the wireless network attacks in IoT security. The main challenge is the nature of the wireless network. Unlike a wired network, in the wireless network, centralized access control is hard to be implemented due to the distributed nature of a wireless network. IoT devices and networks are the sources to generate massive unstructured data. Until now, researchers usually do not have access to the complete IoT network data that can be used for intrusion detection research. The wireless intrusion detection system will need to collect as much protocol data from the wireless network as needed.
Moreover, there are specific vulnerabilities in the physical and data link (MAC vulnerability) layer in wireless networks, which was not attempted in designing wired IDS. Therefor just deploying a wired IDS into wireless IDS would be just a false hope as it may not detect some specific wireless attacks, especially at the data link layer. No reliable research work has been conducted to create a standard benchmark dataset in a wireless IoT environment.

Conclusion
Careful selection of datasets is important in training ML-based wireless intrusion detection systems. As discussed, KDD Cup datasets and NSL-KDD Datasets contain traffic features that are detrimental to detect model accuracy when they are used to train to detect IoT variants kind of network intrusions. In IoT networks, wireless traffic carries more critical information at the data link. A detailed comparison between wired and wireless data showed that most wireless IDS' relevant features are found in the physical and data link layers. The findings indicate that adjusting features' weight for wireless-specific header information can potentially improve intrusions classification. Currently, to our best knowledge, no reliable research has been conducted to create a standard benchmark dataset in a wireless IoT environment. This paper identified a set of high gain features (in Tab. 7) that is highly correlated to network intrusion on wireless networks. The feature sets are filtered through a combination of domain heuristics and preliminary testing results of ML models trained with these custom feature sets [34][35][36][37][38][39]. Future investigation can leverage these feature set to customize the scope of data collection for any ML-based Wireless IDS design for IoT infrastructure.
Funding Statement: The author(s) received no specific funding for this study.
Conflicts of Interest: The authors declare that they have no conflict of interest to report regarding the present study.