A Novel Coevolutionary Approach to Reliability Guaranteed Multi-Workflow Scheduling upon Edge Computing Infrastructures

Software eory and Technology Chongqing Key Lab, Chongqing University, Chongqing, China School of Mathematics, Kunming University of Science and Technology, Kunming, Yunnan 650500, China School of Computer and Software Engineering, Xihua University, Chengdu, Sichuan 610065, China School of Computer and Information Engineering, Jiangxi Normal University, Nanchang 330022, China Shanghai Jiaotong University Chongqing Research Institute, Chongqing 401121, China Chongqing Animal Husbandry Techniques Extension Center, Chongqing 401121, China


Introduction
Edge computing is an evolving computing paradigm offering a more efficient alternative: data is processed and analyzed closer to the point where it is created. It enables computation as a service model and prepares a proximitybased and mobility-aware resource provisioning model of virtualized resources applicable on demand [1,2] e edge service providers are equipped with computational facilities, which allow them to provide necessary spaces required by commercial and noncommercial users.
Recently, as novel bioinspired and genetic algorithms are becoming increasingly versatile and powerful, a great deal of research efforts are paid to applying them in dealing with edge-environment-oriented workflow scheduling problem [9][10][11]. However, it remains a great challenge to develop efficient scheduling algorithm with good scheduling performance, low service-level-agreement (SLA) violation rate, and high user-perceived quality of service.
In this paper, we propose a novel edge-environmentbased multi-workflow scheduling approach by leveraging a multi-workflow-reliability estimation model and preference-inspired coevolutionary algorithms, i.e., PICEA-g, for yielding scheduling decisions. We show through simulative studies as well that our proposed method clearly outperforms traditional ones in terms of multiple metrics.

Related Work.
It is widely believed that to arrange multitask business processes or workflows upon distributed nodes or computing resources with Quality of Service (QoS) constraints, e.g., reliability, is an NP-hard problem [12,13]. It is therefore extremely time-consuming to yield optimal schedules through traversal-based algorithms. Fortunately, heuristic and metaheuristic strategies with polynomial complexity are capable of producing approximate or nearoptimal solutions at the cost of acceptable optimality loss.
For example, Wang et al. [14] proposed a look-ahead genetic algorithm (LAGA), which utilized reliability-based reputation scores for optimizing the makespan and the reliability of a workflow application. Wen et al. [15] aimed at solving the problem of deploying workflow applications over federated clouds while meeting the reliability, security, and cost requirements. Wu et al. [16] proposed a soft error-aware and energy-efficient task scheduling method for workflow applications in DVFS-enabled cloud infrastructures under reliability and completion time constraints. Cao et al. [17] proposed a soft error-aware VM selection and the task scheduling approach to minimize the execution cost of cloud workflows under makespan, reliability, and memory constraints while considering soft errors in cloud data centers. Garg et al. [18] proposed a new scheduling algorithm called the reliability and energy-efficient workflow scheduling algorithm, which jointly optimized lifetime reliability of application and energy consumption and guaranteed the userspecified QoS constraint. Nik et al. [19] proposed a scheduling approach, which included four algorithms for minimizing the workflow execution cost while also meeting the user-specified deadline and reliability.
To minimize the overall error probability in a multiserver mobile edge computing (MEC) network, where the wireless data transmission/offloading was carried by finite blocklength (FBL) codes, Zhu et al. [20] characterized the FBL reliability of the transmission phase and investigated the extreme event of queue length violation in the computation phase by applying extreme value theory and provided an optimal framework for deciding time allocation and server selection. Peng et al. [8] proposed a novel method to evaluate the resource reliability in mobile edge computing environment and addressed the workflow scheduling problem by using a Krill-based algorithm. Kouloumpris et al. [21] considered an architecture consisting of an edge node, an intermediate node (hub), and the cloud infrastructure and then used a mathematical programming-based framework to derive an applicationreliability-optimal task allocation based on multiple operational constraints. Wang et al. [22] developed a reinforcement-learning-based approach to the multi-workflow scheduling method. However, they considered the centralized cloud environment as the underlying infrastructure and thus ignored the overhead for inter-edge-node data transmission. For a similar optimization objective, Wang et al. [23] and Saeedi et al. [24] employed an immune-based PSO algorithm for scheduling workflows over centralized clouds.

System
Architecture. An edge computing system usually consists of an edge computing agent (ECA) and multiple edge servers. e edge computing agent manages all resources and each edge server owns several virtual machines (VMs), each of which can usually handle a workflow task that a user offloads at a time. An edge server usually has limited capacity for storage and computation. Due to the requirement of signal strength and channel stability, as illustrated in Figure 1, it is usually believed that an edge server can cover a limited circular range and thus users can only offload their tasks to the reachable edge servers in terms of such coverage ranges.
As can be seen in Figure 2, instead of considering the monolithic task configurations, we consider that user requests can be structured and process-like requests can be expressed as workflows with different constructs. A workflow refers to a directed acyclic graph (DAG), G � (T, E). T denotes the task set T � t 1 , t 2 , . . . t n , E is the set of edges between tasks, and e ij � (t i , t j ) is a priority constraint, indicating that t i is the precedent task of t j . e notations used in this paper are shown in Table 1.

Problem Formulation.
In engineering, reliability is the probability of a system or component to perform its required functions under the stated conditions and with dependable outcomes. Guaranteeing reliability of computing systems and applications is a challenging problem due to the fact that faults are hard to avoid due to hardware failure, software bugs, transient faults, devices that work in high temperature, and so on. e reliability issue of edge-environment-based multi-workflow can be further complicated due to the fact that structured and process-based task flows are more susceptible to varying types of faults, especially transmission errors and faults occurring when wireless communications between edge nodes and users are required. As shown in Figure 3, the reliability of a workflow is usually structure dependent as follows: where n α denotes the number of tasks in a sequential routing, n β is the number of tasks succeeded by a split point in a parallel routing, and n c is that of a selective routing, respectively. For a task executed on the edge server p, its reliability can be estimated as its success rate of execution, i.e., the probability that its time-to-failure (TTF p ) exceeds its completion time: where x pk � 1, if VM pk is selected for the task, 0, otherwise, To estimate the monetary cost of workflows, we first have to estimate the cost for renting server p: where max[FT(T pk )] is the completion time of the task executing queue on VM pk and C r p is the charge per unit time for renting server p. e transmission time for the task i can be estimated as the the downlink time, and Δ bh i is the the backhaul link time. According to [26,27], Δ bh i can be infinitesimal, and the downlink time Δ dl i can usually be a constant ξ. erefore, Δ i can be expressed as where is decided by the distance between the task (user) and the server; as the distance increases, the bit error rate increases and the average transmission speed decreases [27]. And, ω p indicates the averaged bandwidth of the server p and d i is the the data size of task i. If the transmission price per unit time of the server p as C t p , then the transmission fees can be estimated as Based on the described system configuration, the problem that we are interested in is thus, for given proximity constraints of server-user communications and deadline, how to schedule workflows with higher reliability and lower cost. e resulting formulation is thus subject to ST T ij ≥ max FT T il , T il ∈ pred T ij and l ∈ 1, . . . , n i ,

Preference-Inspired Coevolutionary Algorithms Using
Goal Vectors. It has long been known that preference-based approaches are useful for the generation of trade-off surfaces in objective subspaces of interest to the decision maker. Wang et al. [28] offered one realization of such approach named preference-inspired coevolutionary algorithm using goal vectors (PICEA-g), which had been testified to outperform four other best-in-class multiobjective evolutionary algorithms, e.g., NSGA-II, MOEA, HypE, and MOEA/D. PICEA-g is a coevolutionary approach in which the usual population of candidate solutions is considered evolvable with a set of goal vectors during the search. In this algorithm, optimality of candidate solutions is decided by a Pareto-dominance model. To be specific, a family of goal vectors and a population of candidate solutions coevolved during the search process. A candidate solution gains fitness by meeting a set of goal vectors in the objective space, but the fitness contribution must be shared with other solutions satisfying those goal vectors. Goal vectors only gain fitness

Notation
Description N e total number of workflows M e total number of edge servers R i e reliability of the workflow λ j e selection probability for a task x pk A boolean variable indicating whether VM pk is selected for a task ST (T pk ) e start time of the task on VM pk MTTF p e mean time-to-failure of the server p D (W i ) User-defined deadline of the workflow i pred (T ij ) All predecessor node tasks of T ij in the workflow i cov p e coverage area of the server p n i e total number of tasks in the workflow m p e total number of virtual machines in the server p r j e success rate of the task T pk e task executed on VM pk t pk e execution time of the task on VM pk FT (T pk ) e completion time of the task on VM pk prior (T pk ) e prior task of T pk in the execution queue of VM pk T (W i ) e finish time of the workflow i dist ip e distance between the server p and device i -- by being satisfied by a candidate solution, but the fitness is reduced the more the time a goal being satisfied by other solutions in the population. Ultimately, the population of candidate solutions and the goal vectors coevolve toward the Pareto optimal front. e fitness F s of a candidate solution s and the fitness F g of a preference g can be calculated by (9)-(11) as follows: where n g denotes the number of solutions that satisfy preference g. In this formulation, when s fails to satisfy any g, the fitness F s is defined as 0. And, where α � 1, n g � 0, where N S is the population size of candidate solutions. A (μ + λ) elitist framework is usually used for implementing the above model as shown in Figure 4. As can be seen, a population of N S candidate solutions and a set of N G preferences, denoted by S and G, respectively, are evolved for a fixed number of generations, maxGen. In each generation t, genetic variation operators are implemented on parents S(t) to produce N S offspring, Sc(t). Meanwhile, N G new goal vectors, Gc(t), are randomly regenerated based on the predefined bounds. en, S(t) and Sc(t) and G(t) and Gc(t) are pooled, respectively, whereafter the combined population is sorted according to the fitness. Finally, a truncation-selection is applied to select the best N S candidate solutions and N G vectors as the new population, S(t + 1) and G(t + 1).

4.2.
Encoding. For a workflow application, a chromosome is a data structure in which a scheduling solution is encoded. We use a two-dimensional string to represent a scheduling solution. One dimension of the string represents the index of resources, which depicts the task-resource mapping, while the other dimension denotes the order between tasks.As illustrated in Figure 5, in this solution, there are tasks from three workflows, namely, w 1 , w 2 , and w 3 , which are assigned to virtual machines on two edge servers. For instance, VM 21 is executing four tasks with the processing sequence of t 12 ⟶ t 13 ⟶ t 21 ⟶ t 24 .
e decoding scheme can be described as the reverse of encoding.

Initialization.
Two constraints are applied here to generate uniformly feasible chromosomes to improve the quality of the initial population, meanwhile, accelerating the convergence rate, i.e., the topological constraint and the proximity constraint. Based on the constraints, the initial population is generated as follows: (1) Firstly, each workflow is converted into a task list T � t 0 , . . . , t j−1 , t j after topological sort. (2) Secondly, a resource r i from R � r 0 , . . . , r i−1 , r i is selected as the computing resource VM, only if r i is available for t j . en, t j is assigned to a VM. (3) Repeat the above steps until all workflow tasks are assigned. en, a chromosome is generated.
When the population size reaches the defined value, the initialization process stops. e initial goal vectors are randomly generated as objective vectors in the objective space within predefined bounds. In practice, the bounds are estimated via preliminary single-objective optimizations.

Population Update.
e iterative update of population consists of discrete steps described below, until the termination condition is satisfied.

Genetic Variation.
e genetic variation changes the workflow task allocation information to maintain diversity in the population. In our proposed genetic variation operation, a solution is mutated intelligently based on a resource priority heuristic. To generate a promising offspring solution, Dongarra et al. [29] have proven that the resource, which has the minimal multiplication value of some key Security and Communication Networks performance indicators, should have a higher priority to be selected in the scheduling. Hence, we have en, we let 1/Γ indicate the priority of the server p. e genetic variation operation randomly selects one task in the solution and reassigns it to any available server with a higher priority. As an example shown in Figure 6(b), task t 8 is originally scheduled to VM 4 , whose priority is 3. us, the genetic variation reassigns it to VM 1 with a higher priority of 4.
According to the precedence constraint, we insert t 8 into the position behind t 6 , as shown in Figure 6(c).
Simultaneously, N G are new preference sets and Gc(t) are randomly regenerated based on the initial bounds.

Fitness Calculation.
Fitness calculation is based on the distribution of function value vectors and goal vectors in the target space. Assume that there are two candidate solutions s 1 and s 2 , their offspring s 3 and s 4 , two existing preferences g 1 and g 2 , and two new preferences g 3 and g 4 (i.e., N S � N G � 2) as shown in Figure 7. e process to calculate the fitness F s of a candidate solution s and fitness F g of a preference g is shown in Table 2.

Truncation Selection.
Truncation selection aims to select the best N S candidate solutions from the union population according to their fitness. However, some solutions with higher fitness may be Pareto-dominated. erefore, we identify all nondominated solutions before the selection. If the number of nondominated solutions does not exceed the population size, then we assign the maximum fitness to all the nondominated solutions. However, if more than N S nondominated solutions are found, we then disregard the dominated solutions prior to applying truncation selection (implicitly, their fitness is set to zero).

Termination Conditions.
is phase is a major part of the proposed algorithm, which can specify the final solutions. In this article, the termination condition is examined in two  (1) as soon as a maximum iteration criterion is met, the proposed algorithm terminates and (2) T is a threshold value for terminating algorithm, set to 0.9 in our study. In every generation, after calculating the fitness of the populations, if the fitness function value is less than T, the algorithm continues; otherwise, it terminates. Whenever the algorithm ends, a set of optimal solutions is presented to the user. According to all levels presented in this article, the final solution is the best solution for all objectives including reliability and cost. Algorithm 1 presents all the operations of the PICEA-g algorithm.

Performance Evaluation
To evaluate the effectiveness and correctness of our proposed method, we conduct extensive simulative experiments and show through simulative results that our proposed method outperforms traditional ones. We actually intended in the beginning to employ a real-world edge-workflow-scheduling environment to test our developed algorithms. However, we found out that such an edge environment for executing realworld scientific workflow is yet to come. Consequently, we have to rely on simulations and simulative datasets in for the model validation and comparison purpose.
We consider that all edge servers are with 3 different types of resource configurations and charging plans, i.e., tp1, tp2, and tp3, as shown in Table 3. We collected historical time-to-failure (TTF) records of three types as illustrated in Figure 9 as the input reliability data for edge servers. en, the MTTF of each type of edge servers can be estimated by a Monte Carlo method [31].
We assume as well that edge servers and users are located according to the EUA dataset [2] as shown in Figure 10.
We compare our proposed method with three existing approaches, namely, NSGA-II [32], MOEA/D [33], and SPEA-II [34]. Figure 11 shows the solutions obtained by abovementioned approaches for different workflow cases where the x and the y axes represent the resulting success rate and cost. Figure 12 shows the comparison of Pareto optimal solutions of different methods with varying numbers of edge servers.

Security and Communication Networks
As can be seen from Figures 11 and 12, (1) it is evident that our method achieves better Pareto optimal fronts than its peers, regardless of workflow cases or the number of edge servers and (2) our method acquire more feasible solutions than its peers due to the fact that multiple goal vectors help to identify the solution population toward the Pareto front.

Conclusion
In this paper, we address the problem of reliability-guaranteed multi-workflow scheduling in the edge computing environment. We develop a reliability-driven scheduling strategy based on the PICEA-g algorithm. Extensive simulations based on several well-known workflow templates and a real-world edge-server-location dataset clearly indicate that our proposed method outperforms its counterparts in terms of different performance metrics.

Data Availability
e EUA dataset used to support the findings of this study is available at https://github.com/swinedge/eua-dataset.

Disclosure
Zhenxing Wang and Wanbo Zheng are co-first authors.

Conflicts of Interest
e authors declare that they have no conflicts of interest.

Authors' Contributions
Zhenxing Wang and Wanbo Zheng contributed equally to this work.