Self-Adaptive Fault Recovery Mechanism Based on Task Migration Negotiation

Long Range Radio (LoRa) has become one of the widely adopted LowPower Wide Area Network (LPWAN) technologies in power Internet of Things (PIoT). Its major advantages include long-distance, large links and low power consumption. However, in LoRa-based PIoT, terminals are often deployed in the wild place and are easily affected by bad weather or disaster, which could easily lead to large-scale operation faults and could seriously affect the normal operation of the network. Simultaneously, the distribution characteristics of outdoor terminals with wide coverage and large links lead to a sharp increase in the difficulty and cost of fault recovery. Given this background, this paper proposes a self-adaptive fault recovery mechanism for PIoT terminals based on task migration negotiation. Firstly, based on the terminal fault type and service category assessment, a selection strategy of a candidate neighbor terminal or a terminal set is studied to deal with the fault recovery problem among two scenarios: the same rate and the boundary of the rate change, while considering the adaptive characteristics of the LoRa data rate. Secondly, the adaptive terminal task migration negotiation mechanism is discussed. Then, a novel Terminal Fault Self-Adaptive Recovery (TFSR) algorithm is proposed. Simulation results show that, compared with the Genetic Algorithm (GA) and Discrete Particle Swarm Optimization (DPSO) Algorithm, our proposed algorithm can maintain a higher fault recovery rate and a lower task recovery cost in the case of frequent faults.

technology, Long Range Radio (LoRa) has a wider propagation range than others under the same power consumption, so it is preferentially adopted in PIoT sensing networks.
The PIoT sensing network based on LoRa can deploy a large number of terminals over a wide area [6,7]. However, for the complex and changing outdoor environments, sensing terminal faults occur frequently, which is not tolerated for the continuous and effective operation of the PIoT sensing network. Since manual recovery is a labor-consumed method, the self-adaptive fault recovery mechanism of terminals, which outperforms on time and resource saving, is deemed as a potential solution. LoRa is suitable in this solution, where the terminals are directly connected to the LoRa gateway, which can quickly detect faults and negotiate with other terminals to ensure the recovery efficiency of tasks and data.
The fault recovery mechanism of the LoRa Wide Area Network (LoRaWAN) is different from that of the Wireless Sensor Network (WSN). Considering the adaptive data rate mechanism of LoRa, the self-adaptive terminal fault recovery mechanism can be divided into two different cases: same-rate recovery and variablerate recovery. If terminals have the same data transmission rate, the scheme priority is calculated from the migration energy, communication load and the number of sensor types. Then, the optimal neighbor terminal or terminal set can be selected. At the boundary where the terminal data transmission rate changes, the recovery of the fault terminal can either choose the higher rate terminal or speed up the lowrate terminal. In this case, the rate is further considered in the priority calculation to balance the low latency and increased energy consumption.
Based on the above analysis, this paper proposes a self-adaptive terminal fault recovery mechanism based on task migration negotiation, with the aim of recovering frequent faults of PIoT sensing networks in complex and variable environments, improving self-adaptive fault recovery capability of network and ensuring the business continuity. Specifically, the main contributions of this paper are summarized as follows: A mechanism for judging terminal fault types and service categories based on the LoRa gateway is proposed. Gateways and servers periodically check data to determine if data is missing or abnormal. A self-adaptive terminal fault recovery algorithm is proposed to find the optimal neighboring terminal or terminal set to recover the IoT sensing task, with the energy consumption, communication resources, rate and number of sensor types taken into account. A task migration negotiation mechanism is formulated. LoRa gateway performs task assignment and data migration by judging the candidates' actual status to realize the self-adaptive recovery of terminal faults.
The rest of this paper is organized as follows: Section 2 introduces related work. The self-adaptive fault recovery mechanism based on task migration negotiation is established in Section 3. Section 4 proposes the TFSR algorithm. Simulation results and analysis can be found in Section 5. Section 6 concludes this paper.

Related Work
In terms of LPWAN fault handling, current research mainly focuses on traditional sensor fault detection and recovery methods, with less research on the fault recovery mechanism that incorporates LPWAN characteristics. Reference [8] uses LoRa wireless mesh topology for forest fire monitoring to solve the transmission delay of previous forest fire information. Reference [9] proposes a LoRa mesh network system for wide-area monitoring in IoT applications, which improves the communication range and data transmission rate of the gateway. Reference [10] describes the composition and role of a typical LoRaWAN system, discusses the characteristics of LPWAN construction and demonstrates some of the advantages of LoRaWAN technology through a large number of network tests and application cases in different environments. Based on LoRa technology, Reference [11] uses star and chain networks for selforganizing network design and builds an intelligent meter reading system with long communication distance and resistance to multiple interference sources in a complex network environment.
At present, there is a wide variety of fault detection and recovery mechanisms for wireless sensor networks. The network in [12] is divided into virtual cell grids used a cellular architecture and performs fault detection and recovery in the grid with minimum energy consumption. Reference [13] assigns nodes corresponding credit ratings by calculating the difference between the predicted and measured values and proposes a fault recovery algorithm for opportunity credibility. There is also researches on fault recovery through gradient diffusion algorithm and genetic algorithm [14,15], which works on reducing the energy consumed for fault recovery. Reference [16] analyzes and compares similar WSN recovery algorithms mainly in terms of energy efficiency, scalability and network type. However, the above references are all about terminal recovery strategies in the WSN scenario, while the techniques for recovering terminal faults in the LoRa network are less.
In terms of task negotiation, Reference [17] introduces a node sleep scheduling mechanism based on network coverage, which reduces the energy consumption of the network and ensures the monitoring range of the network. Reference [18] introduces the obvious advantages of LoRa over other LPWANs and elaborates on the concept and process of the LoRa adaptive rate mechanism. Reference [19], in its study of collaborative methods for wireless sensor networks, proposes network task allocation based on dynamic negotiation and combinatorial auction.

Terminal Fault Type and Service Category Judgment
The fault types of PIoT sensing terminals are mainly divided into two categories: Sensor module fault, communication module and other components fault. In the case of sensor module fault, the terminal communication module and other components are considered normal. It means that the data missing or abnormal data collection is caused by the sensor module. Hence, the corresponding type of data that needs to be recovered can be obtained. If other components such as communication modules fail, it is considered that the terminal cannot perform the data sensing function and data service supported by all sensor modules of the terminal needs to be recovered.
The sensor module fault can be judged by missing or abnormal data from the terminal [20]. Missing data can be discovered by statistical methods over a fixed period and abnormal data can be found by a comparative analysis of historical data. Faults of other components, such as communication modules, can be detected by the network server via determining whether data is reported from that terminal within the same period.

Candidate Neighbor Terminal or Terminal Set Selection
After determining the type of data services supported by the fault sensing terminals that need to be recovered, this paper classifies the recovery scheme into single neighbor terminal recovery or neighbor terminal set recovery from the perspective of the number of candidate terminals. The optimal recovery method is selected by comparing different candidate solutions. The first task in fault recovery is to select a candidate set of recovery terminals. Under the condition that the terminal can be recovered, the credibility h j of the candidate terminal divice j is used to filter the set of candidate terminals, with h 0 being the lower bound of the credibility of the candidate terminal.
The credibility of terminal divice j is calculated as follows: (1) l j t and r j t are the mean and variance of the nearest k data at the current moment, l j tÀ1 and r j tÀ1 are the mean and variance of the nearest k data at the previous moment. s and e are the thresholds for the change in variance and mean to determine the effect of the change in variance and mean at the beginning and end of the moments on the increase or decrease of the credibility value. @ is the magnitude of the increase or decrease. Set the initial credibility of all terminals as 1 and iteratively calculate the credibility h j of the terminal divice j several times according to Eq. (1).
When a single terminal can recover multiple types of tasks, the same selection method is adopted for a single type of sensor data task recovery. When the number of terminals is more than one, the terminal set is considered as few terminals as possible. The credibility h s i calculation method of the terminal set divice j is as follows:

Same Spreading Factor
The model with the same spreading factor (SF) is shown in Fig. 1. Each self-organized network contains a monitored point P, several sensing terminals T . The sensing terminals include normal working terminals, fault terminals and dormant terminals. The dotted line in the network indicates the neighbor relationship between terminals in the network. Since all the terminals have the same data transmission rate, only energy consumption, the number of sensor types and communication load will be taken into consideration when calculating the priority of the candidate terminals.

Boundary of Spreading Factor
Considering the adaptive rate mechanism of LoRa, sensing terminals at the boundary can reduce the communication delay by speeding up the data rate, but it also brings additional energy consumption to the terminal. Therefore, how to balance the pros and cons is also a problem in the optimization model. The model is shown in Fig. 2.
When recovering a fault terminal, the information of neighbor terminal divice j needs to be considered: energy e i; j required to recover the terminal fault, the proportion of sensor types M j of divice j , the communication load T j of divice j , the percentage increase in data transmission rate P j . Based on this information, the recovery priority x j of the candidate neighbor terminal is calculated. The selection optimization model is constructed as follows: LoRaWAN mainly uses the 125 kHz signal bandwidth. The function representing the relationship between data rate and SF is obtained by Eq. (5). Among them, BW represents the signal bandwidth, CR represents the coding rate.

Task Migration Negotiation
The task migration negotiation mechanism aims to adaptively assign tasks to the neighboring terminals of the fault terminal. This process is implemented through the LoRa gateway in the self-organizing network, as shown in Fig. 3.
When the sensor module or communication module of the terminal fails, the LoRa gateway will analyze the data uploaded by the terminal, find abnormality and report the terminal fault information to the server. The server will analyze the candidate neighbor terminals, select the optimal result by the algorithm and send the recovery solution to the LoRa gateway. Then, the LoRa gateway sends the task migration command to the candidate terminal or the terminal set.
LoRa usually uses Over-The-Air Activation (OTAA) as the terminal access method when it is necessary to activate the dormant terminal to recovery a task. Once the terminal intends to apply for access to the network, it sends an access request and waits for approval from the server. In this way, the dormant terminal will be activated to join the network. When the terminal needs to speed up, the LoRa gateway sends a speed change command to the terminal to change the data rate by modifying the SF.
The LoRa gateway negotiates according to the real-time status of the terminal. When the selected terminal is insufficient to support the recovery task due to sudden fault or insufficient energy, etc., the terminal reports back to the LoRa gateway. The LoRa gateway will notify the server that the task assignment has failed. At this time, the server needs to reassign normal terminals for fault recovery or directly migrate the task using the suboptimal solution.

Our Proposed TFSR Algorithm
In order to find the optimal recovery solution that can recover the data tasks of the fault terminal, the recovery process is divided into two stages: Dynamic adjustment coefficient and network fault recovery. The information from the fault terminal divice i and the neighboring terminal divice j is used to find the candidate set which has higher credibility than the lower limit. Normalization coefficients vary dynamically according to the cost of recovering from terminal faults using various resources. The dynamic normalization coefficient is used to calculate the current fault recovery rate until the new fault recovery rate rNew is less than the old rPre or the maximum number of iterations is reached. Then, the optimal recovery scheme of the terminal fault is calculated by using the optimal coefficients. N i in Eq. (7) is the number of sensors in the terminal. The variables of the terminal set are calculated as follows: j ; device k 2 Sdevice i (8) After calculating the values of variables in the current candidate set, which variables account for the higher cost of fault recovery can be obtained. The average cost avg can be calculated as follows: For the optimal selection strategy described above, the fixed normalization coefficient will cause insufficient generalization of the model. The normalization coefficients are dynamically adjusted according to the actual proportion of energy consumption, the number of sensor types, the communication load and the percentage increase in data transmission rate in the network. The coefficient i can be iterated as follows:

Simulation Results
In this paper, MATLAB is used for simulation. Within the range of 100 m Â 400 m, 1200 terminals are uniformly deployed to simulate the LoRa network at the SF boundary. The left half of the terminals have SF ¼ 8, the right half of the terminals have SF ¼ 9. Within the range of 400 m Â 400 m, 5000 terminals will be deployed uniformly and randomly to simulate the LoRa network with the same transmission rate. The dormant terminals account for about 10%. A first-order energy consumption model is adopted and the initial energy is between 0 J À 1 J . The communication load is between 0 À 1. Each terminal contains 1 À 3 kinds of sensors to monitor the environment. Terminals within 10 m from the fault terminal are identified as neighbors. Terminals within 50 m from the boundary of SF are set as the terminal with adjustable-rate. The coefficient of variation a ¼ 0:02. The maximum number of iterations is set to 10.
Figs. 4 and 6 show the fault recovery rate with different initialization coefficients at the SF boundary and the same SF case, respectively. With the increase in the number of fault terminals, the fault recovery rate gradually decreases. It can be seen that different initialization coefficients have an impact on the final result. Reassigning tasks after an assignment fails will result in additional recovery costs. It can be seen from Figs. 5 and 7 that with the increase of the number of fault terminals, the cost of fault recovery gradually increases. Since some terminals are assigned multiple recovery tasks at the same time, it is difficult for the terminals to complete these tasks. The total cost in Fig. 5 is relatively high because the addition of the rate coefficient leads to an increasing number of alternatives.
Figs. 8 and 9 show the comparison of the TFSR algorithm, GA and DPSO algorithm in terms of fault recovery rate and additional recovery cost. It can be seen that our TFSR algorithm maintains a higher fault recovery rate and a lower extra recovery cost. The TFSR algorithm reduces the consumption of network resources while maintaining the normal operation of the network. The performance of the GA and DPSO algorithms is not as well as the TFSR algorithm.

Conclusion
A large number of terminals in the PIoT sensing network are facing the problems of frequent faults and high manual recovery cost. Therefore, it is necessary to recover the fault data tasks through the neighbors of the fault terminals. This paper proposes a self-adaptive terminal fault recovery mechanism for PIoT based on task migration negotiation. After the LoRa gateway detects the fault terminal, the TFSR algorithm is proposed to solve the problem of self-adaptive recovery of faults. Finally, the fault data tasks are allocated through task negotiation. Simulation results show that the TFSR algorithm is superior to the GA algorithm and DPSO algorithm in the fault recovery rate and can maintain a lower fault recovery cost. Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.