FASDQ: Fault-Tolerant Adaptive Scheduling with Dynamic QoS-Awareness in Edge Containers for Delay-Sensitive Tasks

As the requirement for real-time data analysis increases, edge computing is being implemented to leverage the resources of edge devices to reduce system response times and decrease the latency. However, due to the resource constraints of edge clouds, edge servers are more prone to failures than other systems. Therefore, guaranteeing the reliability of services in edge clouds is critical. In this paper, we propose a fault-tolerant adaptive scheduling mechanism with dynamic quality of service (QoS) awareness (FASDQ), which extends the primary/backup (PB) model by applying QoS on demand to task copies. The aim of the method is to reduce the latency and achieve reliable service for tasks by changing the execution time of task copies. This paper also proposes a container resource-adaptive adjustment mechanism, which adjusts the timing of resources when the available resources cannot meet the task copy requirements. Finally, this paper reports the results of simulation experiments on the EdgeCloudSim platform to evaluate the difference in performance between FASDQ and other methods. The results show that the mechanism effectively reduces the execution time of task copies and outperforms other methods in terms of reliability and general resource utilization.


Introduction
The goal of edge computing [1] is to obtain real-time responses and provide storage and computation at the edge of the network. This process extends cloud resources to the edge of the network and reduces the network access latency. Although edge computing has many advantages, due to the increased functionality and real-time requirements of smart devices, the chances of resource failures in edge computing are not low, which can reduce the availability of resources and affect the response latency. In smart car networking, smart manufacturing, driverless and other smart devices, and latency-sensitive applications, delays due to reliability issues must be taken seriously. Few previous works have deeply studied the fault tolerance of edge clouds, and tasks are often resubmitted to available resources after they fail, which may lead to unfulfilled service level agreements (SLA) due to the delays encountered by requests [2]. Fault-tolerant scheduling is also widely used as an effective way to assure reliability. However, the current literature on fault-tolerant task scheduling does not sufficiently consider the varying requirements of different tasks for quality of service (QoS) so that edge resources can be allocated as needed. Therefore, the ability to guarantee the reliable completion of latency-sensitive tasks within time constraints and the effective utilization of edge resources is a key issue.
Due to the long startup time of traditional virtual machines and their tendency to waste resources, many systems and studies have used container technology to build the underlying technology stack of edge computing platforms based on the reliability and bandwidth limitations of edge networks [3]. For example, Google launched "the Anthos for Telecom" [4], which will bring Anthos to the edge network by packaging its applications into containers that can run at the edge, providing an immersive experience to customers.

1.
For latency-sensitive tasks, we propose a dynamic QoS-aware fault-tolerant task scheduling mechanism, which extends the traditional PB model by introducing different reliability requirements for tasks; i.e., when scheduling task copies, their QoS levels are dynamically adjusted according to their reliability requirements and latency constraints.

2.
After the primary succeeds, the resources and time interval occupied by the backup are released. In order to make full use of the released resources, the start time of task copies that are waiting in the queue will be reselected to improve resource utilization at the container level.

3.
To reduce the resource pressure exerted by the PB model at the edge, we propose a container resource-adaptive adjustment mechanism for the resource quota of each edge container, which replenishes resources according to their usage by the assigned task copies, as well as the latency constraints and reliable service requirements of unexecuted task copies.
The rest of this paper is organized as follows: Section 2 briefly analyzes the problem and related work, and Section 3 provides a systematic analysis. Section 4 describes the FASDQ solution, the primary/backup scheduling algorithm, and the container resourceadaptive adjustment algorithm. Section 5 reports and analyzes the experimental results, and conclusions are drawn at the end of the paper.

Related Work
Among fault-tolerant approaches, the PB model has been extended to remote clouds in several studies, in which each task has a primary copy and backup copy, and fault-tolerant scheduling is performed in two different computing instances [12]. Primary copies are used when tasks are executed normally, and backup copies are executed when the primary copies fail in order to ensure that the task requests are completed within the specified deadline [13]: this is known as a passive backup scheme [14]. However, the passive backup scheme requires the task to have a sufficient time delay to start the backup copy, which is unrealistic for delay-sensitive tasks. In contrast, an active backup scheme [15] can complete tasks with less slack. The authors of [16] proposed a backup overlay in which task copies are overlapped in the same time slot of the same computation instance, and a release technique, which refers to releasing the resources of the corresponding backup copy after the primary copy succeeds. In [17], two backup schemes were analyzed, and the fault-tolerant elastic scheduling algorithm (FESTAL) was designed for real-time tasks in a virtual cloud environment. This method balances fault tolerance and high resource utilization in the cloud.
Various approaches have been proposed and studied to deal with fault tolerance in cloud computing systems. In edge networks, unlike cloud computing environments, the occurrence of failures may cause a large portion of edge-side resources to go offline, resulting in delays and affecting the overall QoS level. To address the failure of edge servers, redundancy-based replication techniques have also been studied for fault tolerance in edge computing. The research work in [18] provided fault tolerance for the Internet of Things (IoT) by modeling the failure behavior of the Internet grid storage area network (SAN) to provide fault tolerance through redundancy to reduce system downtime. In [19], the PB model was extended to the edge cloud, and a fault-tolerant based QoS-aware scheduling algorithm (FTBQA) was proposed to improve the reliability of edge cloud application services. DeCoRAM middleware [20] tolerates processor failures by considering the execution time and failover order of primary and backup copies in order to meet the soft real-time requirements of each task. To achieve fault tolerance, the authors of [21] proposed a reliable and fault-tolerant agent-based hierarchical IoT-cloud architecture by replicating data at the edge of the network, which redirects the backup server when the service fails to apply. The system in [22] adaptively replicates data according to application requirements for data loss and latency. The replication operation is deferred to some extent, and the data loss and latency requirements are satisfied. In [23], the same input sequence was enforced for virtual machines (VM) replicated across hosts, enabling fast multicore scalable VM fault tolerance. NiLiCon [24] provides seamless failover from a failed container to a backup container of another host, providing fault tolerance to protected applications and external clients. Replication-based fault-tolerant scheduling techniques have been extensively studied in edge cloud environments, but they can also place a burden on resource-limited edge servers to provide timely service to task copies.
Optimizing resource utilization on a fault-tolerant basis is a challenge due to the cost of coordinating reliability safeguards, especially since replication-based fault-tolerance techniques suffer from relatively high resource consumption. Currently, edge computing is still in its infancy in terms of resource management. The authors of [25] developed a computing resource management algorithm for multi-user mobile edge computing (MEC) systems. This algorithm minimizes the average power consumption of mobile devices and MEC servers. Another study [26] proposed a node resource reallocation mechanism based on the characteristics of containers; this method calculates the resource quota for each task in the subsequent phases of task execution and reallocates CPU (central processing unit) resources in the node, making it possible to increase the number of tasks processed in a certain time cycle and effectively reducing the task latency. To some extent, these resource management techniques allocate resources to tasks and effectively improve system resource utilization, but none of them involve fault tolerance. Thus, they cannot support fault tolerance when optimizing resource management.
In summary, for resource-limited edge computing, it is crucial to realize the faulttolerance capability of the system while effectively utilizing the limited resources at the edge. We find that existing research lacks studies in this area. In this paper, we propose a fault-tolerant adaptive scheduling mechanism with dynamic QoS awareness, which extends the PB model by determining the QoS level for task copies according to reliable service and latency requirements. In addition, during scheduling, FASDQ adaptively adjusts resources according to the resource usage to meet the demands of task copies. This mechanism achieves fault tolerance, accelerates the execution time of tasks, and improves the efficiency of resource use.

Problem Analysis
Traditional cloud-based industrial Internet of Things (IIoT) architectures struggle to meet requirements such as low latency and high reliability. For example, traditional telecom networks are unable to meet industry-specific requirements for reliable, predictable, and efficient communications. With low latency, scalability, and reduced bandwidth, edge containers are suited for serving real-time or latency-sensitive applications. In 2019, Edgegap, a disruptive gaming startup, released a gaming platform that reduced latency by 58% using real-time telemetry and edge containers [27]. However, edge containers are deployed close to users, their resources are constrained by the environment, and they have higher uncertainty than traditional cloud containers, so reliability assurance for edge containers is critical.
Fault-tolerant scheduling is an effective method to guarantee reliability and is based on replication techniques. The PB model is widely used for fault-tolerant scheduling. In order to reduce the extra resource consumption of backups, the authors of [16] studied a backup overlapping technique in which backups are able to overlap in the same time slots of the same compute instances. However, since the actual compute instances of the edge containers are containers, it is only the primary copies of tasks that cannot be scheduled in the same edge node, whereas their backup copies can overlap in the same container. Another typical technique of the PB model is the reclaiming of resources for backup copies when corresponding primary copies succeed [10]. If the time slot can be utilized after the backup is released, it greatly reduces the latency.
In addition, to meet the strict QoS requirements imposed by latency-sensitive tasks, the method in this paper improves the reliability of task copies by dynamically assigning them QoS levels when task scheduling is performed. Regardless of whether the fault-tolerant technique is based on replication or the assignment of QoS levels to tasks to guarantee reliability, we find that: (1) while processing task requests, task copies are always gathered in containers with higher processing capability, resulting in an uneven distribution of idle and busy containers (in Figure 1, containers C 12 and C 21 have more task copies, while container C 31 is idle); and (2) the existing resources cannot meet the demand of some task copies. Most previous studies have focused on migrating containers as a whole, which has some effect on resource tuning in edge nodes, but they have ignored the effective utilization of resources within individual containers. Therefore, our method makes more adaptive adjustments to resources at the container and node levels separately. When the resources of a task's backup are released, the copies waiting in that container are rescheduled so that the container can handle more task copies; if the current resources cannot meet the task demand, nearby edge nodes that are in the shutdown state will be assessed to determine whether they can be turned on. Furthermore, if some resources are idle for a long time during system operation, the resource adjustment mechanism will eliminate the idle resources and thus improve resource utilization. In the method described in this paper, fault-tolerant adaptive scheduling is performed for latency-sensitive tasks in the edge container, extending the PB model by determining QoS levels for the reliability assurance of task copies. Then, the on-demand adjustment of edge container resources occurs according to resource usage during task scheduling. The approach proposed in this paper not only achieves a tradeoff between the real-time requirement and reliability of tasks, but also avoids reducing resource utilization due to an unbalanced load on edge container resources.

System Model
In Figure 2, edge users can access the edge cloud through wireless networks and submit requests to it. Each edge server includes two modules: the Task Scheduler and the Resource Manager. The Task Scheduler determines whether the edge container can satisfy the requested latency constraint based on the arrival order of requests, and if the request is accepted, the Task Scheduler transmits it to the Resource Manager, which receives tasks and allocates resources for task requests. Before introducing relevant concepts, we first define the parameters and symbols that are used in this paper, as shown in Table 1.
Relevant concepts and definitions are given below. There is a set EN = {EN 1 , EN 2 , · · · , EN n } of edge nodes around devices, where each edge node has fixed computing resources, including CPU, RAM (random access memory), bandwidth, etc. The processing capability P i (MIPS, million instructions per second) is used to determine the type of computing resources of the edge nodes. Each node EN i includes multiple containers denoted by a set of containers c i = {c i1 , c i2 , · · · , c im } with different processing capability P ij .
We use the set T = {t 1 , t 2 , · · · , t n } of latency-sensitive tasks in end devices. Each task has three attributes, t i = {a i , d i , l i }: arrival time a i , deadline d i , and task size l i , which is measured in millions of instructions (MI).
Each task is replicated into two copies: t P i and t B i , which are available in different nodes so that the backup can take over execution in another node if the primary fails.
The execution time of t P i and tiB and the unit is second.
The processing capability and the remaining processing capability of EN i (unit: MIPS).
The processing capability and the remaining processing capability of c ij (unit: MIPS).
Indicates whether t * i is assigned to the container successfully.

Definition 1.
Backup states state t B i : the backup is divided into two standby states according to its start time and the completion time of its primary (active (represented by 1) and passive (represented by 0)), as shown in Equation (1).
If the completion time of the primary copy is later than the start time of its backup copy, the backup state is set to 1; otherwise, it is set to 0. Active backups do not stop until the primary is completed, while passive backups start only after the primary fails. As shown in Figure 3, t B 1 is passive, but t B 2 is active, and the part that is redundant with the primary is denoted by t BR 2 . In container c kl , the execution time e kl t * i of t * i (* denotes two copies of tasks: e.g., t * i represents t P i and t B i ) is the ratio of its task size over the processing capability of this container [28,29]: In this paper, Equation (2) is improved to reduce task latency (see Equations (3) and (5)). The execution time of task copies is changed by assigning different QoS levels that provide different processing capability to different task copies on demand, so the primary copy can be completed as early as possible within the time constraint. In the container c kl , the QoS level of t * i is denoted by q t * i ; P kl q t * i is the processing capability of t * i within q t * i . The two copies of a task can be assigned different QoS levels to improve task schedulability. Their execution times and constraint conditions are described below.

Definition 2. The execution time of t P
i . This is defined as the ratio of the task copy size over the processing capability of a container with the QoS level q t P i , as shown in Equation (3).
In order to meet the delay requirements of primary copies, the following constraints should also be met: The execution time of t B i . According to the backup state of tasks, the execution time of the backup copy depends on whether its primary t P i succeeds. Assuming that the backup t B i is scheduled in the container c st (not in the edge node of t P i ), then: where η * ij denotes whether task copies succeed on EN j . If so, η * ij = 1; otherwise, η * ij = 0. Equation (5) is explained as follows: 1.
If t P i succeeds and state t B i = 0, then t B i is not executed; 2.
If t P i succeeds and state t B i = 1, then the execution If t P i fails, then the execution time of .
In order to meet the delay requirements of tasks and reduce the time redundancy between the active backup and its primary copy, the following conditions should also be met: When a task copy is successfully assigned to an edge container, the remaining processing capability of the container should be calculated first.

Definition 4.
Remaining processing capability. This is defined as the number of instructions processed by the remaining resources of the edge container per second, and the unit is MIPS. Therefore, the remaining processing capability of c kl and EN l are shown in Equations (7) and (8).
Maximizing the average QoS level of task copies in the edge container under fault tolerance is one of the scheduling objectives of this paper.

Definition 5.
The average QoS level. This is obtained as the ratio of all QoS levels of all successful tasks to the number of all successful tasks [30], denoted by AQL. If t * i is successfully assigned to a container, then z * i = 1; otherwise, z * i = 0.
The main goal of task scheduling in this paper is to make the edge container accommodate more task copies while adopting the PB model for fault tolerance. In the next section, we provide the solution according to the system model.

Solution: FASDQ Framework and Key Algorithms
We designed the FASDQ framework shown in Figure 4, which consists of two main parts: the primary/backup scheduling mechanism and the resource-adaptive adjustment mechanism. When a task request is received at the edge, the task is first replicated into two copies by the Copy Controller, and the task copies are assigned different QoS levels by the QoS Controller to change their execution times according to different reliability quality requirements to determine the optimal scheduling slot for the copies. The task copies in the queue are rescheduled by the Resource Controller after the resources of the backups are released so that the primary copies of the tasks can start as early as possible. In addition to dynamic QoS-aware task copy scheduling, the FASDQ framework also includes a container resource-adaptive adjustment mechanism, which adaptively replenishes resources for tasks with insufficient processing capability during task execution to improve resource utilization in the edge node. When task copies cannot be scheduled at the edge, the tasks are handed over to the cloud for processing.
The following introduces the dynamic QoS-aware primary/backup scheduling mechanism and the container resource-adaptive adjustment mechanism at the edge.

Dynamic QoS-Aware Primary/Backup Scheduling Mechanism
In this paper, we consider a case in which the traditional PB model application is extended at the edge container side. FASDQ uses a backup overlapping technique to reduce redundancy in the task scheduling process and determines the schedulable interval of tasks in the edge container with dynamic QoS-awareness to find the best time interval for scheduling. This process is represented by the scheduling function.
According to the analysis in [17], when the primary copies of multiple tasks are not scheduled in the same nodes, their backups in the same virtual machine can overlap. In this paper, we also apply this rule to our scheduler mechanism. In order to ensure that no conflicts occur during the execution of task copies and achieve fault tolerance, we allow backup copies to overlap in time and space when scheduling two copies of a task. Task copies in the same container are subject to the following constraints: (1) the primary copy cannot overlap with other copies; (2) the redundant part of the active backup copy cannot overlap with other copies; (3) for the passive backup copy, if EN t P i = EN(t P j ), then their backups cannot overlap.
The goal of the function scheduler schedulable t P i , c jk is to designate schedulable time slots for t P i in c jk . The schedulable time slots need to be chosen so that t P i starts as early as possible before its deadline in [a i , d i ]; the unschedulable time slots are those that are occupied by other primary/backup copies. Therefore, we traverse the schedulable time slot in [a i , d i ] from left to right to find the earliest start time eq jk t P i of t P i on c jk that satisfies Equation (4); otherwise, return +∞.
The goal of the function scheduler schedulable t B i , c st is to designate schedulable time slots for t B i in c st . If t P i is scheduled successfully, then the scheduling of t B i will start. The schedulable time slots need to be chosen so that t B i starts as late as possible within its deadline in [a i , d i ], and this slot is not part of EN t P i ; the unschedulable time slots of t B i are the time slots occupied by its primary copies, the time slot occupied by the backup copy before f P i , and the redundant parts of active backup copies. As shown in Figure 5, (6); otherwise, return −∞. Algorithm 1 describes the process of dynamic QoS-aware task scheduling for the primary copies in order to assign QoS levels to primary copies on demand to guarantee reliability. The main function of Algorithm 1 is arranging the processing capability of the edge nodes in descending order so that the primary copies are first assigned to the edge container with a higher processing capacity (lines 1-3). Then, it finds the earliest start time of t P i by schedulable t P i , c jk and a container satisfied time constraint in Equation (4); if t P i is successfully scheduled, the label flag is set to be 1, otherwise, it is set to 0 (lines [5][6][7][8][9][10][11][12][13][14]; if failed, the QoS level will be degraded by 1 until it becomes q 1 , and the above operation will be repeated (line 15). If it is still not found, then the function resAdapt() is called (see Algorithm 3) (lines 17-23). The time complexity of Algorithm 1 depends on the number of tasks K. Suppose initially that there are L active nodes with N containers of each node in the neighborhood of the user, and let the QoS level of the task copies be a random integer in (1, m). m and L are negligible in comparison with N and K. Thus, the time complexity of schedulable t P i , c jk is O(NK). Then, Algorithm 1 comes to the function resAdapt(). We will analyze its time complexity in Section 4.2.
If t P i is successfully scheduled, the scheduling of t B i starts. Algorithm 2 describes the process of dynamic QoS-aware task scheduling for the backup. The main function of Algorithm 2 is sorting the edge nodes in descending order of remaining processing capability, except for the nodes in which its primary copy is located (lines 1-3). Then, the latest start time of t B i that satisfies the time constraint in Equation (6) is found by schedulable t B i , c lk in a container and the backup states are marked. If a container is successfully selected, the label flag is set to 1, otherwise, it is set to 0 (lines [5][6][7][8][9][10][11][12][13][14][15][16][17][18]. If failed, the QoS level will be degraded by 1 until it becomes q 1 , and the above operation will be repeated (line 19); if it is still not found, then the function resAdapt() is called (see Algorithm 3) (lines 22-28).

Algorithm 1. Primary Scheduling in FASDQ
Input: The set of tasks: T; QoS level: Q = {q 1 , q 2, · · · , q m }; the set of ENs: EN = {EN 1 , EN 2 , · · · EN n }. Output: c t P i // The container that can be scheduled for t P i with the earliest start time.

Begin:
1 for each new task t i ∈ T, do 2 f lag = 0; est P = +∞;c t P i = null;q t P i = q m ; 3 Sort ENs in descending order of processing capability; 4 while q t P i = q 1 , do // The QoS level of t P i is initialized to the maximum value within time constraints. 5 for each c jk ∈ EN j , do 6 Set the processing capability P jk q t P i ; 7 eq kl t P i = l i P kl (q(t P i )) ; 8 est P i = schedulable t P i , c jk ; 9 if est P i ≤ est P and eq jk t P i ≤ d i − est P i , then 10 //Select the container with the earliest completion time of the primary. 11 c t P i = c jk ; est P = est P i ; f lag = Assuming that the number of the successfully scheduled primary is X, the time complexity of schedulable t B i , c lk , in the worst case, is O(NX). Then, Algorithm 2 comes to the function resAdapt(). We will analyze its time complexity in Section 4.2.
When the primary copy of the task succeeds and the corresponding backup copy is released, the vacant time slot can be used by other copies in the same container. If the primary can start earlier and the backup can start later within the time constraint, the redundant part of the active backup can be reduced, and the time performance of the primary copy that has not yet been executed is improved. Therefore, a new start time is chosen for the other copies in that edge container. The selection process is shown in Figure 6, where σ B i is defined as the time required to release the backup. Figure 6a shows the arrangement of each task copy in the container before t p 1 succeeds. Figure 6b: After t p 1 succeeds and t B 1 is released, then schedulable t P 3 , c 21 determines whether t p 3 can be scheduled in f P 1 − σ B 1 , s B 2 . If so, the new earliest start time nest P 3 of t P 3 is returned, and state t B 3 is changed. In this example, state t B 3 is changed from 1 to 0, reducing energy consumption. Figure 6c: schedulable t B 2 , c 21 determines whether t B 2 can be scheduled in s B 2 , f P 3 , and then the new latest start time nlst p 3 of t B 2 is updated; this can also change state t B 2 from 1 to 0, which can effectively reduce the redundancy. After the primary succeeds, the resources and time interval occupied by the backup are released. The rescheduling of task copies can result in the earlier completion of the primary copy within the time constraints, and the backup can start later, which effectively reduces the task delay.

Algorithm 2. Backup Scheduling in FASDQ
Input: The set of tasks T p whose primary copy is successfully assigned to a container, c t P i ; Q = {q 1 , q 2, · · · , q m }; the neighboring edge nodes of task t i : EN(t i ) ⊆ EN. Output: c t B i //After the primary is successfully scheduled in a container, output the container for which the backup is scheduled.
Sort EN − EN t P i in descending order of the remaining processing capability; 4 while q t B i = q 1 , do //The QoS level of t B i is initialized to the maximum value within time constraints. 5 for each EN l ⊆ EN − EN t P i , do 6 for each c lk ∈ EN l , do 7 Set the processing capability P lk q t B i ; 8 Calculate eq lk t B i using Equation (5)

Container Resource-Adaptive Adjustment Mechanism
In the above fault-tolerant scheduling process, the copy rescheduling mechanism eases the resource utilization of one container. In this subsection, we present the optimization of resource utilization across edge nodes, which means that edge containers can accept as many tasks as possible. However, if the edge container accepts multiple tasks over a period of time, which may lead to insufficient available resources for some tasks, then it will not be able to meet the latency constraint of task requests. To address the above situation, this paper proposes an elastic adjustment mechanism for container resources: i.e., when the current computational resources cannot meet the demand of task copies, the mechanism can: (1) add containers dynamically, (2) balance their workload with other containers, and (3) migrate containers to another edge node with sufficient processing capability. This process is shown by resAdapt().
Migration constraints. To ensure fault tolerance when migrating containers across edge nodes, the following constraints need to be satisfied during container migration: (1) c t B i cannot be migrated to the edge node of t P i , nor can c t P i be migrated to the same edge node of t B i . Otherwise, two copies of the task are in the same edge node and it cannot be fault-tolerant.
(2) If there is overlap between t B i and t B j , c t P i cannot be migrated to EN t P j . The newly created container has a different processing capability and should meet conditions in terms of time: where t(c new ) denotes the creation time of the new container, and θ new is the ready time of the task in the new container. resAdapt() is a container resource-adaptive adjustment algorithm, as shown in Algorithm 3.

1.
The set {EN(t i )} of running edge nodes near the user is sorted in ascending order of remaining processing capability p rem i , and these nodes are traversed in turn and judged to be able to create new containers (processing capacity is denoted by p(c new )) to accommodate the task copy (lines 1-7). If there is no suitable running node, some containers are migrated to a nearby running node to make space to create a new container (lines 8-11).

2.
If the creation of containers in the running node still cannot meet the scheduling demand of task copies, an unpowered node near the task copy (processing capacity is indicated by p(EN new )) starts up, and a new suitable container in this node is created (lines 13-18).
We now evaluate the time complexity of Algorithm 3. Firstly, it takes O(L) to create a new container to an active node, and it takes O(L) to migrate a container to another destination node. Hence, the time complexity of adding a new container by migration (see lines 8-12) is O(L 2 ). For creating a new container to EN new , the time complexity is O(1). Therefore, the complexity of resAdapt() is O(L + L 2 + 1) = O(L 2 ).
In conclusion, the time complexity of primary scheduling in FASDQ is O(NK + L 2 ), and the time complexity of backup scheduling in FASDQ is O(NX + L 2 ). If else return null; 20 end for END

Formatting of Mathematical Components
This section describes simulation experiments for latency-sensitive tasks in FASDQ on the EdgeCloudSim [31] platform, which is capable of modeling edge computing environments and provides a modular architecture that can support various key functions. Since the FASDQ proposed in this paper is mainly in an edge environment, the experimental framework was designed at the edge, as shown in Figure 7. In this study, Kubernetes v1.8.3 was extended and deployed on the Centos 7.0 system. The experiments used fault injection to interrupt the execution of task copies. Four types of edge nodes and four types of containers were set up in EdgeCloudSim, and the processing capability of each edge node was set to 1000, 1500, 2000, and 3000 MIPS. The processing capability of the containers was 250, 500, 700, and 1000 MIPS, respectively, to evaluate the performance of FASDQ.
We want the edge containers to process task requests as quickly as possible. Therefore, we implemented the Task Scheduler and Resource Manager modules in the edge server for edge containers, which enable the containers to run a larger number of tasks in parallel.
In this paper, we compare FASDQ with the following solutions: (1) Fault-tolerant elastic scheduling algorithm (FESTAL) [17]: it can be added or removed certain virtual machines depending on the load state to optimize resource utilization while supporting fault tolerance in the cloud. In terms of using virtualization technology, the approach in this paper uses the more lightweight container virtualization technology. This experiment applied FESTAL to the edge side for comparative experiments to verify the effectiveness of FASDQ in terms of latency and resource utilization.
(2) Fault-tolerance based QoS-aware scheduling algorithm (FTBQA) [19]: adjusts the QoS level of tasks and has an adjustment mechanism, which is invoked when the backup copy of a task in a compute node is deleted after the corresponding primary copy succeeds. To study the performance under a more realistic scenario, we used an open dataset, YFCC100M [32], provided by Yahoo in 2014. YFCC100M contains 100 million media objects. There are approximately 0.8 million videos and 99.2 million images. Each object in YFCC100M contains several metadata, such as media source, owner name, camera, title, tags, and location [33]. We filtered the multimedia contents of 300 videos and 300 photos in the YFCC100M dataset. Multimedia content, namely, the number of tasks, is a random number in [5000, 400, 000]. At a given time, some of the multimedia devices are active, and the average number of active devices is 320. The distance of the edge base station was set to 1000 m, and a hexagonal structure was adopted to depict the range of the geographical location [34].
In order to more clearly observe the impact of different parameters on performance, experiments based on random synthetic workloads were conducted. The data traffic generated from each source point is related to total active users and is transmitted to the edge computing system as packets. The packet size is a random number in [34,6550] and the unit is Byte. The average packet arrival rate is one packet per second per node. The bandwidth capacity between the users and the edge network is 1 Gbps [35].
Assume that the task arrivals at the edge server follow a Poisson distribution and that the average interval of task arrivals is 1/λ, uniformly distributed in 1 λ , 1 λ + 2 [17]. The task size is uniformly distributed in 1 × 10 5 , 2 × 10 5 MI with a deadline of d i = a i + max eq t * i + basedeadline [36], and basedeadline is the benchmark deadline, which is a random positive real number in {100, 140, · · · , 420} in units of s, indicating the range of task delays. The edge server failure rate is uniformly distributed with a time unit of 10 −7 /h [14].
In each set of experiments, only one parameter was changed, and the others were left unchanged. Table 2 shows the parameters and their values. In this experiment, the effectiveness of fault-tolerant scheduling of primary copies was evaluated by the task guarantee rate. The guarantee rate (GR) [17] is defined as the percentage of total tasks that are guaranteed to be completed successfully among all submitted tasks.
(1) The task arrival interval 1/λ value range is from 0 to 16 with a step of 2 s. The value is inversely proportional to the workload per unit time. Other parameters remain unchanged.
(2) Base deadline is set from 100 to 420 with a step of 40 s. The higher the value of the base deadline, the more relaxed the task deadline constraint. The other parameters remain unchanged. Figure 8a shows the variation in the guarantee rate with the task arrival rate of the five algorithms. The arrival rate decreases as the task volume increases, but all have higher guarantee rate (GR) values. Because the task arrival rate decreases, the workload decreases and the resources are more sufficient, so the GR value increases as the arrival rate decreases. Compared with FASDQ, NONRAD, and NONBOL, NONRAD has the lowest GR value because it does not use the resource-adaptive adjustment mechanism proposed in this paper, and when the task volume is large, the current resources cannot meet the demand of the task copy. NONRAD cannot expand resources on demand, resulting in tasks that do not start on time, thus missing deadlines. When a large number of task requests occur at the same time, FASDQ has a better effect on GR. This is because FASDQ can confirm the QoS level of task copies to reduce their execution time, and the resource adjustment mechanism can add new resources adaptively. FASDQ is thus able to handle more tasks under the same conditions. This is in line with our expectations.  Figure 8b shows the effect of the base deadline on the GR value. The data show that base deadline has a greater effect on the GR value because when the task is urgent, the system does not have time to adjust the available resources; therefore, task copies that arrive later are not able to meet the deadline. For several methods, when the latency is relatively lenient, the variation in GR values is usually stable because they implement fault tolerance and use overlapping technology. In particular, FASDQ can provide different QoS levels to tasks according to their different urgency levels to guarantee reliable service. Moreover, FASDQ extends the fault tolerance mechanism by considering the dynamic resource adjustment technique when resources are insufficient. Therefore, FASDQ performs better in terms of GR value compared with the other methods. Figure 8c shows that the guarantee rate decreases as the number of tasks increases. Since failures are uniformly distributed, the number of tasks increases, leading to more failures, but all methods maintain a high guarantee rate, except for FTBQA. When the number of tasks increases, the mechanism can handle new requests by adjusting resources; while FTBQA has no resource-adaptive adjustment, it can guarantee a high GR because if there are no resources in the edge cloud to meet the task requirements, the task will be considered for scheduling in a remote cloud data center. With the increase in the number of tasks, FASDQ's GR value remains above 94%, which proves the effectiveness of the proposed method once again.
In the proposed method, the latency is reduced by providing a QoS level to change the execution time of task copies, allowing the system to accommodate more task copies, so the effect of task latency can be better measured by the average reduced time. Figure 9 shows the results for the average reduction in time for the five methods. Both FASDQ and NONBOL, which adaptively adjust resources, significantly reduce the task latency with time constraints between 180 and 420 s, while the reduction in time is not significant between 100 and 180 s. This is due to the resource-adaptive adjustment mechanism, which is optimized to reduce resource consumption and improve resource utilization, resulting in the increased data-processing capability of edge servers and thus reduced task completion time. The difference in the latency results between FESTAL and FASDQ shows that providing QoS levels to tasks also improves their execution time. However, for tasks with short latency constraints, FASDQ cannot significantly reduce the task latency because of insufficient resources for adjusting the urgent-task system. Because FASDQ dynamically assigns the QoS level to copies during scheduling, this experiment also evaluated the average QoS level (AQL) on the basis of 1/λ, base deadline, and the number of tasks. AQL is the average QoS level of all accepted tasks, and a higher AQL indicates a higher reliability performance of the framework. Let q t * i denote the QoS level of task copies t * i with 0 ≤ q t * i ≤ 1 taking a value of [0, 0.1, 0.2, · · · , 0.9, 1]. Figure 10a shows the variation in AQL as the task arrival rate decreases. As the task arrival rate decreases, the burden is reduced, the competition for resources decreases, and the task copies can be assigned higher QoS levels. Since FASDQ, NONRAD, NONBOL, and FTBQA can dynamically adjust the QoS level according to the demand of the copies when scheduling tasks, its average QoS level is higher than that of FESTAL. Compared with NONRAD and NONBOL, FASDQ has a higher average QoS level, which indicates that the combined application of the proposed method provides a higher QoS level for task copies, enabling the edge containers to handle more tasks in the same time interval.  Figure 10b shows that the more relaxed the task deadline, the better the scheduling of task copies, and the task copies can receive more reliable service guarantees. Therefore, the QoS levels are improved in all five methods. The AQL of FESTAL fluctuates more and is smaller than those of the other methods because it schedules without providing QoS levels to the copies on demand. As shown, the AQL of FASDQ can also exceed 0.6 for urgent task deadlines. This is because FASDQ can adaptively assign different QoS levels according to the urgency of the task. The more urgent the task, the higher the QoS level, and the more resources it consumes, resulting in fewer tasks being processed. However, FASDQ uses a resource adjustment mechanism, so its average QoS level can be higher than those in the other methods. Figure 10c depicts the decreases in AQL as the number of tasks increases. As shown, when the number of tasks is between 5000 and 40,000, FASDQ can obtain a higher AQL than the other methods. The continuous increase in the number of tasks with a fixed number of available resources causes the average computational resources per task to become increasingly scarce, and the task copies obtain a lower QoS level. The results of comparing NONBOL and NONRAD show that the resource-adaptive adjustment mechanism proposed in this paper strongly improves the scheduling effect.

Resource Consumption Evaluation
In this subsection, we verify the effect of the number of tasks on the total active time (HAT) of the edge nodes and the ratio of the service time of tasks to the active time of the edge node (RTH). The number of tasks in this experiment increased from 5000 to 40,000 with a step of 5000. HAT reflects the resource consumption of the system, RTH reflects the resource utilization of the system, and RTH is defined as follows: {TET/HAT} (11) where TET denotes the total execution time of the task. Figure 11a shows that as the number of tasks increases, the HAT of the five algorithms maintains an increasing trend. The HAT is worse in both FTBQA and NONRAD, but for different reasons. FTBQA does not use container technology, and when there is a lack of computing resources, it needs to increase the edge server resources to meet task requirements, which results in significant resource consumption. The HAT is worse in NONRAD because there is no resource adjustment mechanism to replenish the processing capability for task copies in a timely manner, which also further illustrates the predominant role of resource-adaptive adjustment in fault-tolerant scheduling. When the number of tasks increases, FASDQ can migrate containers with a lower processing capability in the current candidate nodes to free up resources. This allows the candidate nodes to create new containers, thus avoiding the need to add or start new nodes and saving node resources. Figure 11. Comparison of resource utilization. (a) Impact of task count on HAT (the total active time of the edge nodes); (b) impact of task count on RTH (the ratio of the task service time to the active time of the edge node). Figure 11b depicts the variation in RTH performance as the number of tasks increases for the five methods. The RTH performance of NONRAD and FTBQA is poor when the number of tasks increases: when the tasks increase substantially, the system is burdened, and more active nodes must be added to the processing capability, resulting in greater resource consumption. The RTH of NONRAD outperforms that of FTBQA by 8.6%, indicating that the overlapping technique can effectively use the available resources to accommodate more tasks and avoid a significant increase in resources, effectively improving resource utilization. Compared with the other methods, the RTH of FASDQ is higher and relatively stable because FASDQ provides better QoS of reliability for task copies and reduces the execution time of tasks; therefore, FASDQ can always process more task copies in a given time, which effectively improves resource utilization.
In summary, the FASDQ mechanism extends the PB model to achieve fault tolerance by providing QoS levels for task copies, which effectively reduces task latency. From the evaluation of AQL, the FASDQ mechanism clearly provides higher QoS levels for latencysensitive tasks and performs better in terms of reliability requirements and scheduling. As illustrated by the variation in HAT and RTH values, FASDQ can effectively improve resource utilization in the edge node and container levels by implementing container resource-adaptive adjustment in terms of consumption and saving resources.

Conclusions and Future Work
This paper introduces FASDQ, a fault-tolerant adaptive scheduling mechanism with dynamic QoS-awareness that guarantees reliable service for latency-sensitive tasks and improves the resource utilization of edge containers. FASDQ extends the PB model and proposes a dynamic QoS-aware task scheduling mechanism to achieve fault tolerance, which dynamically provides QoS levels to task copies according to the different latency requirements of tasks to change the execution time, so the edge containers can accommodate more task copies with a given processing capability while being fault-tolerant. In addition, FASDQ implements a resource-adaptive adjustment mechanism to address the uneven resource distribution and insufficient processing capability of tasks during execution, which effectively replenishes resources for tasks with insufficient processing capacity and reduces resource consumption at the node and container levels. The experimental results show that the method can significantly reduce the execution time of each task copy by about 12%, effectively reducing the task latency and improving system resource utilization.
In this paper, we do not consider task scheduling enough when the tasks are uploaded to the cloud after the failure of edge container scheduling, and we will further extend the fault-tolerant adaptive scheduling mechanism between cloud edges in the future so that the mechanism can tolerate multiple edge server failures. Furthermore, this paper addresses relatively static user requests, such as monitoring devices that are deployed in fixed locations. In realistic scenarios, such as 5G systems, there is a need to support mobility for mobile users. If the corresponding contents, which are located in edge containers, could be migrated to the place close to mobile users as they move, the QoS of reliability will be improved during the movement of the mobile users and the communication delay will be reduced between mobile users and the nodes. However, we did not delve into the task requests of the mobile users, which are common in realistic environments. This is our research focus for future studies.