Network Learning-Enabled Sensor Association for Massive Internet of Things

,


Introduction
Recently, newly emerged internet of things (IoT) applications such as smart agriculture, smart industries, and smart transportation systems have introduced massive growth in terms of connected devices and sensors, which increases wireless bandwidth consumption [1]. In an IoT environment, sensor device association plays a vital role in IoT applications like localization [2] and indoor positioning [3]. One of the well-known wireless communication technologies, the Institute of Electrical and Electronics Engineers (IEEE) 802.11 wireless local area networks (WLANs) has become a communication service provider to overcome the increasing demands for bandwidth consumption [4]. A WLAN for massive IoT is composed of several access points (APs) as gateways (GWs) for sensory data collection. In such a scenario, multiple GWs coexist with the same service set identifier (SSID) and create an extended service set (ESS) as a single WLAN environment. Such an example can be found in IoT applications such as smart industries or smart agriculture, where several smart devices (SDs) with attached sensors are required to connect to a single internet access GW. However, a WLAN environment suffers from network performance degradation as the number of connected devices grows. This occurs due to the limited number of available channels and bandwidth. Hence, channel resource scarcity has been one of the challenges for WLAN environments.
A newly arrived IEEE 802.11 WLAN, IEEE 802.11ax has introduced many new features such as spatial reuse (SR) and 5 gigahertz (GHz) bands [5]. One of the purposes of including these competitive features is to improve the channel access mechanisms that may have unsatisfactory efficiency in highly dense scenarios. In a WLAN, the channel scarcity problem is resolved using the carrier sense multiple access with collision avoidance (CSMA/CA) mechanism with the distributed coordination function (DCF) at the medium access control (MAC) communication layer. Currently implemented CSMA/CA is simple and effective and performs well for a small WLAN network with a limited number of users and low throughput requirements. With the rise in user density and higher throughput requirements, CSMA/CA fails to tackle the collision problem. In IoT network systems, such a situation can be handled by associating SDs with a stronger received signal strength indicator (RSSI) strategy. However, this may introduce an unbalanced use of different GWs for the sensor data. The issue with an RSSIbased strategy is the use of the physical layer RSSI as a metric for the strongest signal for an association, which does not consider the traffic situation on the GW or in the network. Fig. 1 depicts an example IoT scenario, where the two highlighted regions, blue and red, contain unbalanced associated SDs. As shown in the blue highlighted region of the figure, GW 1 and GW 2 may be associated with three SDs each; however, due to the RSSI-based strategy, GW 1 has a higher load than GW 2. Similarly, in the red highlighted region, GW 6 has four associated SDs, while GW 5 contains only one associated device. In each of the regions, the associated devices could be distributed more intelligently to enhance the overall network performance. Therefore, this work suggests considering the network situations as part of the network configuration as a suitable GW selection for connected SDs in an IoT network. Today, machine learning (ML) techniques have a proven ability to provide solutions to such optimization problems. In this paper a multi-armed bandits (MAB) [6], an ML technique, is used to address the GW association problems in an IoT network based on IEEE 802.11 WLANs. In MAB techniques, actions are performed by an agent within a defined environment, intending to maximize the accumulated reward. To do so, it considers SDs as agents that expect to learn the optimal action based on the environment to associate themselves with the GWs. However, this work addresses this problem in a decentralized multiagent context, where GWs and SDs compete for the common set of spectrum resources using multiple actions. This research work uses the MAB technique to perform the SD association with a GW for an IoT network. Our proposed MAB-based algorithm allows SDs to intelligently associate themselves with a suitable GW by learning from their experience (accumulated reward) through a learning-byinteraction approach to improve their future performance. In particular, it considers the use of MAB to study the case of the suitable GW association problem in a large WLAN-based IoT network. To do so, it utilizes the effectiveness of the ε-greedy algorithm as an optimal action exploitation strategy. The following are the key contributions of this paper, • This paper identifies the challenges posed by IoT sensor device associated with IoT gateways.
• A learning-based multi-armed bandit algorithm is proposed for the sensor's device association to an IoT GW. • In this study, several simulations are performed to evaluate our learning-based algorithm.
The remainder of the paper proceeds as follows; Section 2 describes the related research work. Section 3 describes the system model of the proposed ML-enabled algorithm. Section 4 describes the assessment of the problem statement using the MAB-based mechanism. Section 5 evaluates the performance of the proposed ML-enabled algorithm with the help of simulation results. Finally, Section 6 provides conclusions from this work.

Related Research Work
There are several research surveys and articles published on the IoT network systems. These studies mainly summarized the topics like age of information in massive IoT [7] and wireless sensor networks in massive IoT systems [8]. Most of their work targets the application design perspective of IoT systems. The authors discuss the influences of the buffers, queueing modeling, traffic scheduling, and MAC layer channel access mechanism to better design applications for IoT that require frequent updates from the network. In [9], the authors propose a data reduction algorithm to reduce the burden of the IoT gateway. There have been related research in this area, especially in the IEEE 802.11 WLAN context, which has addressed the challenge of the number of devices associated with a GW (or access point in WLAN terminology). In [10], the authors proposed an algorithm for the device's association for optimal throughput achievement. Their proposed mechanism estimated AP utilization based on the required throughput of the device to carry out the association. The authors in [11] proposed two different algorithms; the first was based on the quality of the up/down link channel, and the second used the airtime of each WLAN. In [12], the authors used the average payload of the WLAN to associate the network devices. Their work had the limitation of requiring changes to the standard messaging frames. In [13], the devices were moved (re-associated) to the AP with the lowest network load. However, in their proposed mechanism, the link quality was not considered, which significantly reduces the throughput of a WLAN. In [14], the authors explored a centralized approach for device association by introducing an online GW. Similarly, in [15], the authors introduced a software-defined network (SDN) solution for the unbalanced association of the devices with the GWs. In their proposed solution, the GW with congested associations was requested to re-associate by adjusting the transmission power.
Several authors have contributed to the use of ML-enabled techniques for device association purposes, for example, [16] utilized a reward-based algorithm for a GW AP selection. In [17], the authors proposed a MAB algorithm for the GW selection procedure by extending the ε-greedy exploration/exploitation mechanism. Moreover, a deep neural network-based wireless user association mechanism with six hidden layers was proposed in [18]. In [19], the authors proposed an association rule based on a mean-field game solver using MAB games. In [20], an optimization problem has been formulated to maximize the total gain of the network in different IoT requirements. The authors of this work proposed a stable matching mechanism algorithm to solve this high-complexity problem of IoT slices. An advocate RL algorithm is proposed to enable practical implementations and employs a learning framework to learn different IoT network dynamics [20]. The research in [21,22] studied IEEE 802.11 WLANs in a dense network for WLAN device association with deterministic (fixed) APs. In [23], the authors proposed to turn off as many GWs as possible based on low network load. In [24], the authors proposed a centralized device association with GWs based on an RL. Their proposed mechanism used minimized subscriber dissatisfaction in their RL mechanism.
Critical analysis: In light of this related research work, the current study sees that device association, especially for IoT network systems was either conducted in a traditional way (high signal strength strategy) or a heuristic way (retaining the best association as always). However, such strategies may lead to less efficient situations due to relying on high signal strengths only, which overburden GWs with the nearest sensors devices. Moreover, even in a heuristic mechanism case, an optimal association strategy cannot be reached due to the use of the highest experienced data rate as an instantaneous performance metric. Therefore, it is suggested to utilize ML-enabled solutions for association mechanisms.
The state-of-the-art mechanism for SD association in IoT networks is to examine all nearby available GWs and select the one with the highest RSSI. However, choosing the nearby GW (g) based on the highest RSSI may not be the best action at a particular moment, as it may lead to an unevenly connected device distribution, where network resources remain unused. Therefore, this work proposes an RL-based learning algorithm for IoT devices (SDs) as an agent. In RL-based ML techniques, the MAB addresses the problem where an interacting agent selects an action among a given number of actions (referred to as arms in a MAB algorithm) [6]. An agent chooses one action from the available set of actions alternatively, one at a time. Every time an arm is pulled as an action, a reward (feedback) is generated, which allows the agent to evaluate the performance of that specific action/arm. The objective of the agent is to learn the action with the optimal accumulated reward value. One of the challenges faced by MAB-based mechanisms is to consider the exploration and exploitation. In exploration and exploitation, an agent must balance the tradeoff between learning faster or more slowly. A learning rate parameter is introduced for this purpose. The tradeoff between slower learning and faster learning is crucial, as slow learning may waste too much time, while a faster learning rate may lead to less exploration. The ultimate goal of a MAB algorithm is to search for the optimal action that produces the maximum reward. Once an agent finds a converged maximum value action, say the agent has learned the environment. One of the ways to calculate this is to carefully measure a regret function for the actions performed by the agent. A regret function L i,t of an agent device (d i ) at time t after T total number of time steps can be given by [6], where R * i,t defines the reward of an optimal action at time t. The objective of the agent is to minimize the expected regret with the passage of time; that is, One of the challenges with MAB implementation is to choose a fair tradeoff between exploration and exploitation. In this study, an ε-greedy method is used for exploration with a probability (ε) and exploitation with the probability (1 − ε). Each IoT device acts as an agent and implements the ε-greedy method to explore/exploit to choose the available arms (GWs) in its sensing range. An agent receives a reward for each GW selected and accumulates these for future exploitation. At each iteration of time t, the devices perform reassociation with the nearby GWs. The experienced average throughput τ i of a device from a specific arm (GW) is used as the reward for the arm. Thus, an instant reward R i,t for a GW g i at time t is calculated as the average of the rewards received by an agent device from the associated GW g i as follows, where τ i is the achieved throughput of the device and is measured as the number of successfully transmitted data per unit of time. As mentioned earlier, ε-greedy is implemented at each sensor device (SD). We consider each of the GWs in the range as an arm for the sensor device. Thus, every SD as an agent keeps track of the reward collected against a specific GW selection, that is, the achieved performance. In Fig. 2, it shows the flowchart of the proposed MAB-based GW association algorithm, where the exploration and exploitation play a key role in selecting the GW either using the state-of-theart (random selection, which is exploration) or the GW with the maximum accumulated reward (which is exploitation). Each time an exploration/exploitation leads to a reassociation with the GW. The εgreedy explained in Fig. 2 works in the iteration of time t, which represents the association rounds. In addition, this work uses the achieved average throughput as a reward for each of the GWs. Thus, a sensor device accumulates the average throughput received as the result of any GW selection, and based on this accumulated throughput, the reassociation for the GW is performed using the ε-greedy mechanism.
One of the challenges for an ML-enabled algorithm is to extract the learning features from the data set and effectively tune its hyperparameters. Since this study is proposing an RL-based association mechanism for device association in IoT networks, which concerns how an intelligent agent (device equipped with RL capabilities) takes actions in an environment to maximize the notion of the accumulated reward. Our proposed ML-powered association mechanism differs from supervised or unsupervised learning in that it does not require explicit correction of extracted features with suboptimal actions. Since supervised learning and unsupervised learning are both used for problems where the desired output is known and provided in the training data, it is inconvenient to use such models for a dynamic environment like IoT. However, the benefit of RL is that it can handle problems with sparse or delayed rewards, which may not be well-suited for supervised and unsupervised learning. Additionally, RL can handle problems with long-term dependencies, such as sequential decisionmaking tasks, which can be challenging for other learning methods. It finds a balance between the exploration of the environment and the exploitation of the learned knowledge of the environment (learned in terms of the actions with the highest accumulated reward). One of the key parameters required for exploration/exploitation strategies such as the ε-greedy strategy is the ε. The tuning of this key parameter requires an educated guess depending on the level of exploration and exploitation. For example, a low value of ε allows the agent to exploit (1 − ε) more and explore less; on the other hand, a high value of ε allows the agent to exploit less and explore more. In this study, a 0.2 value of ε is used, so that an agent could exploit 80% of the time and explore 20%.

Simulation Setup
A simulation setup using MATLAB environment is developed, where a grid of 100 m 2 is created to randomly place IoT GWs. Table 1 shows some of the important simulation parameters used for the performance evaluation. To evaluate our proposed association mechanism, this study performed several simulations in a 100 m 2 area with eight IoT GWs and 32 IoT devices placed randomly in the area. A wireless network with a channel bandwidth of 20 megahertz (MHz) was used with a transmission power of 20-decibel milliwatts (dBm) and an interference signal level, clear channel assessment (CCA) of −82 dBm, and −72 dBm for the RSSI level. Fig. 3a shows an IoT environment with eight GWs randomly placed and 32 IoT devices using the standard RSSI-based mechanism to associate with the GWs in their range. We see from the figure that a few of the GWs were unnecessarily overloaded, such as GW 8, 6, and 1. This happens as every device tries to associate itself with the nearest GW with the highest received signal strength. This also affects the overall network performance in terms of the waste of resources and an increase in a collision due to congestion. In addition, it indicates that two of the GWs have no association with the sensor devices due to the comparatively longer distance between the SDs and GWs. However, these GWs may have access to the nearest SDs to lessen the burden from other overloaded GWs. More simulations are performed on a similar network to create an SD-GW association with the use of our proposed MABbased mechanism. Fig. 3b shows that GWs 5 and 4 were also associated with nearby sensors, which lessened the overload from GW 1 and GW 3. GW 6 is not highlighted as GWs 4-5 are; however, one can see that it has also contributed to the efficient association procedure.  In this paper, the proposed association mechanism is evaluated in terms of the enhanced average (mean) throughput (that is, the successfully received data rate). Fig. 4 shows the comparison of the normalized throughput of an IoT network system with a fixed (deterministic association), RSSIbased (standard), heuristic (best from the past/history), and our proposed MAB-based mechanism. The results showed that in the case of the deterministic association mechanism (fixed), the system achieved the lowest normalized throughput due to the fixed number of GWs and SDs. Our proposed mechanism allows an IoT network system to explore and exploit the GWs with less load and associate the SDs with less loaded GWs. However, the heuristic algorithm reached closer to the proposed MABbased mechanism, due to its nature of utilizing the best experience from the past, which is somewhat similar to a MAB-based algorithm. In the case of an ML-enabled GW association, the network may also suffer from low throughput, which was mainly due to the exploration period of the algorithm. During the exploration, our proposed MAB-based algorithm searches for the best available GWs and may associate SDs with already overloaded GWs, resulting in lower network throughput. This study further evaluates the proposed MAB-based algorithm in terms of network satisfaction. The network satisfaction ( ) was calculated as follows,

Results and Discussion
where T * is the number of times a network is satisfied. In the above network satisfaction calculation, w determines whether the network is satisfied or not satisfied and is given by The above equation explains that the network is satisfied if the received data rate τ i at i th iteration is greater than or equal to the required data rate, that is, τ r . Fig. 5 shows a better representation of the performance of our proposed mechanism. This figure shows how the proposed MAB-based mechanism achieves network satisfaction over some time (simulation duration). As shown in the figure, an RSSI-based mechanism remains constant due to the one-time association based on the strongest signal strength strategy. Similarly, a deterministic approach (fixed association between SDs and GWs) may perform better than the RSSI-based approach due to prefixing the problem of device association. However, since wireless networks are usually dynamic and changing rapidly, this limits the performance of a fixed network association algorithm. On the other hand, our proposed MABbased algorithm allows the network to continue to be increasingly satisfied, even more so than the heuristic algorithm. The proposed MAB-based mechanism provides a higher average received data rate than the heuristic mechanism. One of the reasons for this performance improvement is that a heuristic mechanism tries to use the association links with the highest experienced data rate, which is an instantaneous performance metric. However, our proposed MAB-based mechanism exploits the association links with the highest accumulated reward, which shows optimal links over a longer period. This is also justified by Fig. 5, where the agents using the heuristic mechanism converged earlier and at a level lower than our proposed mechanism. Fig. 6 further explains this by adding more box plots to our results showing how the network devices reached their satisfaction level (4 Mbps in our case) using different association mechanisms. In the figure, it can be seen that the average successfully received data from a fixed algorithm remained very low compared to the required data rate. However, with the heuristic and MAB-based algorithms, the network was more satisfied; that is, it reached its required data rate.

Conclusions
Today's IoT networks include a massive number of connected sensor devices (IoT). Each of the sensor devices (SD) must associate itself with a nearby IoT gateway (GW) to transfer the collected sensory data. For this purpose, SDs are permitted to associate with any GW in the network, choosing naturally the one from which higher power is received (RSSI), regardless of whether or not it is the most ideal choice for the network's execution. Recently, a rise in the use of ML techniques for wireless networks as a viable way to determine the effect of various models in the system, and reinforcement learning (RL) is one of those ML models. In this research work, we used an RL-based model for the association of SDs with the nearby GWs in an IoT network. A multi-armed bandit solution was used to find the best GW to associate with a sensor device. The mathematical convergence of a greedy MAB-based mechanism helped achieve this purpose to evaluate our proposed mechanism alongside the standard and related mechanisms. The evaluation results indicated that our intelligent MAB-based mechanism enhances the association as compared to other approaches. The normalized throughput results indicated that the proposed mechanism also enhanced the system performance.

Limitations and Future Works
One of the challenges arrive with MAB-based algorithms is the trade-off between exploration and exploitation. In a network environment like GW selections for IoT devices is always very dynamic and changing. Therefore, an instant limitation that arrives with our proposed mechanism is the selection of ε-greedy parameters wisely. A higher value of ε (more exploration) may lead to a delayed learning rate. Similarly, a low value of ε (more exploitation) leads to immediate exploitation of the environment with very little information. In future works, we aim to explore the educated guess for and ε value. Moreover, we also aim to work on the other possible algorithms for exploration and exploitation, such as SoftMAX and upper confidence bound (UCB) [6].