Workflow Scheduling Scheme for Optimized Reliability and End-to-End Delay Control in Cloud Computing Using AI-Based Modeling

Khaleel, Mustafa Ibrahim; Safran, Mejdl; Alfarhood, Sultan; Zhu, Michelle

doi:10.3390/math11204334

Open AccessArticle

Workflow Scheduling Scheme for Optimized Reliability and End-to-End Delay Control in Cloud Computing Using AI-Based Modeling

¹

Computer Department, College of Science, University of Sulaimani, Kurdistan Regional Government, Sulaimani 46001, Iraq

²

Department of Computer Science, College of Computer and Information Sciences, King Saud University, P.O. Box 51178, Riyadh 11543, Saudi Arabia

³

Department of Computer Science, College of Science and Mathematics, Montclair State University, Montclair, NJ 07043, USA

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(20), 4334; https://doi.org/10.3390/math11204334

Submission received: 25 August 2023 / Revised: 3 October 2023 / Accepted: 10 October 2023 / Published: 18 October 2023

(This article belongs to the Special Issue Application of Cloud Computing and Distributed Systems)

Download

Browse Figures

Versions Notes

Abstract

:

In the context of cloud systems, the effectiveness of placing modules for optimal reliability and end-to-end delay (EED) is directly linked to the success of scheduling distributed scientific workflows. However, the measures used to evaluate these aspects (reliability and EED) are in conflict with each other, making it impossible to optimize both simultaneously. Thus, we introduce a scheduling algorithm for distributed scientific workflows that focuses on enhancing reliability while maintaining specific EED limits. This is particularly important given the inevitable failures of processing servers and communication links. To achieve our objective, we first develop an artificial intelligence-based model that merges an improved version of the wild horse optimization technique with a levy flight approach. This hybrid approach enhances the ability to explore new possibilities effectively. Additionally, we establish a viable strategy for sharing mapping decisions and stored information among processing servers, promoting scalability and robustness—essential qualities for large-scale distributed systems. This strategy not only boosts local search capabilities but also prevents premature convergence of the algorithm. The primary goal of this study is to pinpoint resource placements that strike a balance between global exploration and local exploitation. This entails effectively harnessing the search space and minimizing the inclination toward resources with a high likelihood of failures. Through experimentation in various system configurations, our proposed method consistently outperformed competing workflow scheduling algorithms. It achieved notably higher levels of reliability while adhering to the same EED constraints.

Keywords:

reliable application placement; end-to-end delay optimization; cloud application paradigm; wild horse optimization; levy flight model

MSC:

68M14

1. Introduction

Cloud computing remains a prominent framework for managing extensive distributed workflow module applications due to its adaptability, consistent performance, and robust oversight [1,2,3]. Through virtualization techniques, users can conduct multiple application tasks globally and concurrently. These applications vary in complexity, ranging from straightforward linear pipelines to intricate Directed Acyclic Graphs (DAGs). Effectively dispatching these workflow applications to achieve optimal network performance has garnered research attention, leading to numerous natural network deployment experiments in the grid and cloud infrastructures [4].

Over the past few years, numerous approaches to workflow scheduling have emerged to address the challenge of reducing End-to-End Delay (EED) constraints—the time required to complete a workflow from start to finish. This is a complex problem classified as Nondeterministic Polynomial (NP) [5], which can be tackled using metaheuristic optimization techniques. Yet, the reliability of workflow applications has received a relatively opinionated focus [6]. Given the across-the-board adoption of cloud systems, the reliability of workflow application systems has become an increasingly crucial performance metric alongside EED [7,8]. Inevitable failures in processing servers and network links underscore the need to carefully select operational nodes and transfer links to minimize their detrimental effects. Accessible resources might unexpectedly become deficient, behave unpredictably, or exhibit malicious behavior. Subsidizing scientific workflows in unstable and ever-changing network environments is absolutely critical. It is imperative that we ensure precise end-to-end execution times to achieve successful mission-sensitive task execution. Accurate and prompt handling of data is an absolute necessity, not just a preference, as seen in applications like climate systems and data analysis using advanced scientific mechanisms.

To enhance the reliability of execution, duplicating tasks and restarting from checkpoints is imperative to diminish the likelihood of failure by implementing redundancy. However, this approach increases network traffic and computational requirements. An alternative solution entails the recording of checkpoints and the subsequent restart of the application in the event of a failure, albeit with added overhead and potential delays due to repeated computations and communications. Hence, it is imperative to decrease the likelihood of initial failure even prior to mapping the application onto the network. Simultaneously achieving the objectives of minimizing EED and maximizing reliability is impossible due to the inherent conflict between the two. The optimization of a single objective or weighted cost function with subjective weight values is the primary focus of traditional mapping and scheduling algorithms.

1.1. Research Motivation

Centralized scheduling schemes have been the primary focus of past research. These schemes necessitate a central server to collect network conditions and establish mapping strategies. However, this approach exhibits certain limitations, including the peril of a sole failure point, the requirement for storage and transfer, and an extra computational load. A distributed system that shares network conditions and decisions among various nodes is essential to ensure scalability in workflow mapping within extensive networks. This is necessary due to potential delays in executing the mapping algorithm on the central server before dynamic network status updates occur. It is highly desirable to achieve both optimization objectives in this context.

Utilizing multiple meta-heuristic computations proves more advantageous as compared to relying solely on individual meta-heuristic analyses, which tend to encounter premature convergence. Consequently, the incorporation of population-based meta-heuristics becomes crucial for achieving optimal arrangements of task–server combinations. An instance of an artificial intelligence-driven technique is the wild horse optimization (WHO) algorithm. The amalgamation of this optimization algorithm with the levy flight algorithm has been empirically validated as a potent solution. This combined approach not only addresses network contention effectively but also plays a pivotal role in optimizing both scheduling reliability and End-to-End Delay (EED) to align with user expectations. By harnessing the synergy between thin–thick clients and the cloud network, this approach enhances task scheduling within the processing system, tackling prevalent issues and thereby elevating Quality of Service (QoS), user satisfaction, and system dependability.

1.2. Research Novelty

Recent studies have predominantly adopted a centralized mapping strategy, wherein the global network condition is aggregated to make mapping decisions. However, this approach carries a notable risk of encountering a single point of failure, consequently impairing both storage and computational efficiency. Moreover, this vulnerability can potentially trigger network failures due to dynamic changes in the network state that might surpass the central server’s ability to execute the mapping algorithm promptly. Given these constraints, our proposed workflow scheduling hybrid optimization (WSHO) algorithm breaks free from the reliance on a central server during the mapping phase. Instead, it employs a decentralized approach by broadcasting network decisions among processing servers. This enables the dissemination of messages across the network, precluding failure issues before application scheduling takes place. The most unfavorable situation emerges when a server overlooks its failure, prompting the system to employ downstream servers to detect the failure via timeouts. Upon failure detection, our model reassigns only the pending applications within and after the layer containing the failed application.

1.3. Research Contribution

The key discoveries of this investigation are as follows:

A distributed workflow scheduling algorithm using meta-heuristic techniques is introduced to enhance application placement reliability goals while strictly adhering to the limitations imposed by the EED objective (Section 1).
The integration of the levy flight model enriches the WHO paradigm by achieving a balance between exploring globally and exploiting locally. This also involves efficiently utilizing the search space and reducing the preference for resources prone to high failure rates (Section 3).
The study suggests a workflow scheduling approach that incorporates critical path analysis, topological sorting, layer prioritization, and sorting techniques. This method aims to enhance the resilience and reduce the energy consumption of applications in each cycle. The process will persist until convergence is achieved (Section 5).
The proposed algorithm simultaneously investigates the demonstration of opposing goals: achieving the lowest EED and the highest reliability (Section 6).

1.4. Research Outline

The subsequent parts are structured as follows: Section 2 delves into an extensive examination of existing research efforts. Section 3 outlines the primary structure of the proposed system framework. Section 4 presents the issue’s formulation, demonstrating the incompatibility of EED and reliability goals. The algorithms’ development process is discussed in Section 5, while Section 6 encompasses the results from experimental trials of the suggested algorithm. Section 7 offers the concluding remarks.

2. Related Works

The hybrid workflow scheduling approach proposed has the potential to enhance the dependability of allocating workflow module applications across cloud resources, while concurrently decreasing their execution duration. Additionally, this algorithm can more effectively utilize the computing resources available in a diverse device environment. Various task scheduling methods, like the ACO, PSO, and GA, have been explored in the existing literature. However, the majority of these methods possess limitations as they disregard the simultaneous optimization of both reliability and EED during task–resource allocation. In comparison, our solution surpasses conventional task scheduling algorithms, achieving a 42% increase in placement reliability and a 67% reduction in mean execution time.

The publication referenced as [9] introduced a multi-dimensional challenge of assigning workflows within multi-cloud systems. Their approach to multi-objective workflow management encompasses factors like task completion time, operational expenses, and system reliability. To enhance reliability, they incorporated a backup technique. They outlined two distinctive strategies. The first, referred to as a diversification strategy, incorporates problem-specific genetic operators to generate diverse offspring entities. Conversely, the intensification strategy leverages four distinct neighborhood operators, tailored to critical path and resource usage, to enhance the quality of entities stored in the archive set. The suggested approach considers particular limitations, such as the duration of tasks, operational expenditures, and system reliability, but does not give equal attention to other goals, like minimizing energy usage or maximizing service provider profits. Our method distinguishes itself from their model by incorporating an enhanced iteration of the wild horse optimization method in conjunction with a levy flight strategy.

The publication referenced as [10] introduced a dual-objective optimization scenario that focuses on both makespan and enhancing reliability. The authors tackled the challenge of scheduling scientific workflows across cloud resources. They introduced a centralized log, functioning as a repository, to record various system failures. Additionally, they introduced a novel scheduling failure factor (SFF) inversely correlated with system reliability. Consequently, the previously mentioned model was cast into a dual-objective optimization problem, aiming to minimize both makespan and SFF—a task known to be NP-Hard. To solve this combinatorial issue and strike a balance between exploration and exploitation in optimization, they employed a discrete cuckoo search approach alongside levy flight operators. While their approach represents a pioneering solution for deploying applications on functional hosts, it lacks consideration for specific Quality of Service (QoS) aspects related to computing servers. In contrast, our model distinguishes itself from their methods by effectively overseeing the allocation of time slots for virtual machines to prevent any overlap.

The publication referred to as [11] delved into the scheduling of directed acyclic graph (DAG) module applications across cloud resources. They transformed their DAG task placement challenge into an integer programming problem with binary values (zero and one), aiming to minimize makespan and ensure a high rate of successful executions. To address this issue, they introduced a dynamic downward ranking approach that organizes the scheduling priorities of various subtasks. This approach takes into explicit consideration the sequential execution nature of the DAG structure. Moreover, they introduced a mechanism that relies on degree-based weighting and earliest finish time calculation. This mechanism assigns the subtask with the highest scheduling priority to available resources, thereby facilitating swift task execution and dependable communication links. The problem of dispatching applications onto available resources without reusing the allocated virtual machines leads to higher latency overhead and increased computational costs as the volume of requests grows. Our approach sets itself apart from their methods by taking into account more complex system attributes.

The publication referenced as [12] introduced a heuristic framework with the goal of reducing the operational expenses associated with microservice-oriented workflow modules. This framework also ensures adherence to predefined deadlines and reliability thresholds. To achieve this, the authors implemented a greedy approach to placing application replicas, which considers fault tolerance. This method aims to locate suitable resources that meet the deadlines and minimize costs while ensuring reliability. Additionally, they introduced a resource optimization process to enhance the utilization of available resources. The extensive mapping of applications onto cloud resources results in amplified transmission delays, particularly when users are distant from data sources. Our algorithm distinguishes itself from their methods by actively addressing and alleviating the cloud computing bottleneck.

The paper [13] introduced an innovative method for arranging tasks on cloud resources through a unique scheduling technique based on the reliability of critical parent tasks. The objective was to enhance the reliability of the scheduling process while upholding restrictions such as deadlines and operational expenses. The authors also factored in potential failures in processors and communication links within their system design. Their suggested solution demonstrated superior performance compared to rival algorithms, as evidenced by evaluations on benchmark workflow systems like Cybershake, Sipht, and Montage. Their approach falls short when it comes to articulating the intricacies of the workload balancing challenge and relies on basic system characteristics. In contrast, our model stands apart from their strategies because our technique arrives at a dependable decision by reducing the workload, a crucial aspect that becomes inadequate when virtual machines migrate away from overloaded server machines.

The paper [14] tackled the challenge posed by resource-intensive scientific workflows with diverse requirements. The authors pointed out that relying solely on cloud solutions is not always sufficient to meet the escalating demands of these intensive applications. Given these constraints, they introduced a multi-cloud approach centered around robust workflow placement systems. Their primary objectives were to enhance the dependability of workflow execution and to minimize the associated costs. They achieved this by applying the Weibull distribution to analyze the reliability and hazard rate of module application execution. Additionally, they harnessed the billing strategies of multiple cloud service providers to enhance overall system efficiency. The article also introduces a cost-effective layer for DAG-structured workflows, integrating a fault-tolerant and cost-efficient scheduling model. This integration aims to reduce both the operational costs and time of task execution, all while ensuring their reliability. Nonetheless, the authors do not prioritize the service provider’s profitability as one of the top objectives in their assessed metrics. Therefore, our approach addresses this concern by incorporating cloud service providers into the system design and formulating strategies to enhance their profits.

The paper [15] introduced an innovative approach to mapping workflow module applications, utilizing fuzzy resource utilization and incorporating the PSO model. The primary goal of this approach is to reduce both operational costs and the time required for completion, while still adhering to reliability constraints. The proposed system takes into account factors such as the origin of the data and the sequence of conveyance. While their algorithm has indeed demonstrated superior system performance, the proposed solution focuses exclusively on three evaluated metrics: makespan, operational costs, and reliability. It falls short in terms of optimizing power costs and resource utilization rates effectively.

The article [16] presented a dependable system using a distributed scientific workflow scheduling method to allocate module applications within a diverse distributed computing setup, in which failures in processing hosts and communication networks are inevitable. The proposed algorithm aims to improve the reliability of task execution while adhering to an unavoidable end-to-end delay (EED) limit. The algorithm operates through sequential phases. Initially, it gives priority to, arranges, and organizes workflow tasks on the critical path (CP). These tasks are then assigned to dependable and time-conscious resources, using techniques such as iterative CP exploration and layer priority definition. In the subsequent phase, the algorithm assigns tasks on the non-critical path to cloud resources for enhanced execution dependability.

The article [17] proposed a new placement model to adjust the makespan and operational costs for scheduled tasks while security and reliability metrics are optimized. The proposed paradigm works in two phases. In the first phase, the optimal task–server pairing is detected, guaranteeing system performance, security, and reliability. In contrast, the second phase of the reallocating procedure is repeated iteratively to modify the execution process and operational expenses by lowering the variance of the gauged makespan. Their proposed system outperforms other state-of-the-art algorithms in terms of completion time, operating costs, efficient resource utilization rates, and more acceptable time and economic expenditure trade-offs. The existing paradigm has not incorporated specific objectives like service provider profits and virtual machine time slots, whereas our approach takes these objectives into account.

The paper [18] introduced a mixed reliability-conscious solution designed for executable resources, with the goal of enhancing users’ quality of service. The dependability of a resource can significantly influence the efficiency of workload distribution. The arrangement of workloads is structured as a sophisticated scientific workflow. Within the suggested framework, strategies for dividing deadlines and budgets are explored to differentiate between resource instances and module applications. The sequence of users is established through a user sequencing method. Moreover, the allocation strategy for applications is examined, focusing on the arrangement of tasks that are prepared for assignment onto appropriate and available resources. While their approach effectively achieves load balancing among virtual machines, it does not take into account the potential energy savings that can be realized through application placement. Conversely, our method calculates energy consumption by considering the allocated power costs.

As the demand for workflow applications continues to grow, the task of locating the best pair of task servers is becoming increasingly challenging. This challenge is exacerbated by the rising number of workflow applications, resulting in higher system latency. Additionally, incorporating meta-heuristic methods to strike a balance between exploration and exploitation in the search process poses another obstacle, often leading to longer processing times and latency-related delays. These represent the primary constraints of the suggested algorithm that require substantial consideration.

3. The Key Segments of the Suggested WSHO Paradigm

This section delineates the primary constituents of the WSHO system under consideration, encompassing aspects such as system architecture, reliability framework, and the articulation of the issue.

3.1. System Architecture

The system architecture of WSHO comprises the subsequent grades, as depicted in Figure 1.

3.1.1. Grade 1—Tenants

Tenants, situated across various geographical locations, send their requests to the cloud. These requests come in various sizes and levels of complexity and require careful placement. Each request has its own specific deadline, and it is essential for the cloud provider to finish these requests by or before their respective deadlines in order to enhance the Quality of Service (QoS).

3.1.2. Grade 2—Workflow Parser

The workflow engine is responsible for receiving application modules from tenants and transforming them into intricate DAG-based workflows. During this process, the computational complexity of the applications is calculated, and their dependency edges are established and interconnected while considering the failure rate of communication links. Our workflow service engine constructs workflows using a weighted applications graph

UL = (M a, D e)

with a module applications set

M a = {α_{1}, α_{2}, \dots, α_{n}}

containing n modules. This graph is initiated with

α_{1}

and concludes with

α_{n}

. The significance of the dependency edge for enhanced parallelism is recognized, and the weight

w_{i, j}

of dependency edge

d e_{i, j}

is defined, signifying the data size when transferring information from

α_{i}

to

α_{j}

. Module

α_{i}

receives information like byte weight from its preceding modules in the graph and uses this to compute its computational routine, based on its total aggregated and normalized complexity requirements. This complexity is a conceptual value influenced by not only the computational complexity function but also the operational details used in real-world algorithm scenarios. To streamline this complexity function in our model, we incorporate the data aggregation function illustrated in the computational application. In our workflow system, when a resource accommodates a specific module, the module transmits its byte outputs to the succeeding applications. However, initiation of an application’s execution cycle is only feasible once all the necessary processing information for that application has been received. Moreover, for the sake of realism, we establish a virtual starting or ending point with absolute zero for applications with multiple starting or ending points.

3.1.3. Grade 3—Green Cloud Scheduler

The role of the green cloud scheduler (GCS) involves serving as a bridge connecting tenants and the IaaS network platform, ensuring dependable and energy-efficient-aware allocation of workflow applications. The GCS undertakes the subsequent duties.

Service Inspecting: This function assesses the service prerequisites of module applications, encompassing factors like architectural specifications and deadlines, in order to determine if any processing server within the IaaS Network platform can fulfill the execution demands of the application.
Reliability and EED Monitor: Part of the GCS’s responsibilities is to direct workflow applications to those resources that yield a balance between reliability and EED constraints. It continuously checks the workload levels status of the servers to prevent system degradation and network congestion.
Pricing Charge: The GCS computes the percentage of service charges and approximates the fees imposed by the service provider.
Virtual Machine Supervisor: The GCS refreshes the status of virtual machines (VMs) to allocate new VMs and recycle existing ones.
Resource Monitor: The GCS oversees real resource utilization and tracks the associated resource expenses.

3.1.4. Grade 4—IaaS Network Platform

The IaaS network deployment setting encompasses processing servers situated within a data center. These robust computational units constitute an arbitrary weighted network graph denoted as

L L = (P s, D e)

, where the size of set

P s

is represented as m, and

P s = {β_{1}, β_{2}, \dots, β_{m}}

. Our network framework incorporates processing servers on the placement platform based on their computational capacities, failure probabilities, transmission link reliability, and minimal link latency. Moreover, we characterize the length of module application placement as the duration required for all applications to undergo processing on a specific server. Given the dynamic nature of our system, we particularly emphasize CPU sharing, particularly in scenarios in which multiple module applications are executed simultaneously. This implies that when a server accommodates multiple module applications, CPU resources are evenly distributed among them, as long as no dependency relationship exists. This signifies that once an application completes its computational tasks, it promptly releases the utilized CPU. The same principle is applicable to network bandwidth allocation. Our underlying infrastructure supports time sharing when multiple applications concurrently utilize the same CPU.

3.2. Reliability and EED Trade-off Framework

This part delves into three crucial elements concerning EED, data transfer dependency edges, and reliability considerations encompassing server and link failure rates. To begin, we elucidate the EED associated with running a module application on a designated host. Subsequently, we outline the process of calculating data transfer times across dependency edges using network links. Lastly, we investigate the dependability of workflow execution, an outcome of the reliability exhibited by all module applications and communication links.

3.2.1. The EED Execution Process

Consider

γ_{i}

as the cumulative incoming byte dimensions and

ζ j (.)

as the normalized complexity of the corresponding aggregated data input. The application

α_{i}

is part of the workflow

W_{i}

, with

W_{i}

representing the workflow scheduled as the ith in the sequence. Consequently, the EED execution time for application

α_{i}

on server

β_{j}

is expressed as follows:

E E D (α_{i}, β_{j}) = \frac{\sum_{i = 1}^{n} γ_{i} \cdot ζ_{j} (.)}{\sum_{t_{1}}^{t^{m a x}} P_{j, τ}}

(1)

Here,

t^{m a x}

signifies the utmost count of intervals that task

α_{i}

can sustain, and

P_{j, t}

stands for the power usage of server

β_{j}

during time t. In order to meet the stipulated delay condition, the value of

t^{m a x}

can be computed as follows:

t^{m a x} = \frac{D_{s, d}}{P_{j, τ}}

(2)

3.2.2. The Network Link Dependency Edge

Let us consider

α (τ)

as the count of simultaneous data transfers on the link

L_{i, j}

over

Δ τ

, and

ξ_{i, j} (τ)

as the extent of data transfer partially accomplished during the time span

[τ, τ + Δ τ]

, with

α (τ)

staying constant. This can be expressed mathematically as:

N D E (D e_{i, j}, L_{i, j}) = \frac{\sum_{i = 1}^{n} α (τ) \times ζ_{i, j} (τ)}{B_{i, j}} + D_{i, j}

(3)

Here, the extent of data transfer

ξ_{i, j} (τ)

partially accomplished during the time span

[τ, τ + Δ τ]

can be expressed as follows.

ζ_{i, j} (τ) = \frac{B_{i, j}}{α (τ)} Δ τ

(4)

3.2.3. The Reliability Computation Analysis

One challenge when deploying workflow module applications on cloud-based execution resources is the likelihood of server failures. These failures can manifest in two distinct forms: permanent and transient. However, it is important to note that transient failures, which are more prevalent, can have a more pronounced impact compared to permanent ones [19,20]. Transient failures can stem from various sources such as hardware issues, software glitches, or even radiation. As a result, this paper focuses on addressing transient faults due to their substantial influence on the system’s overall efficiency. Given that cloud resource failures are unavoidable and can detrimentally affect scheduling’s end-to-end delay, we assume that these failures follow a Poisson [21]. Specifically, we are concerned with the fault arrival rate denoted as

λ

, a parameter primarily influenced by the server’s operational frequency, as discussed in the same paper.

Given the frequent occurrence of transient faults during application placement, we formulate the fault rate according to the approach outlined in reference [20], as follows.

\begin{matrix} λ (S_{j, f}) = λ_{0} . ξ (S_{j, f}) \end{matrix}

(5)

Here,

λ_{0}

signifies the initial rate at which faults occur when the server

S (v_{j})

is running at its highest frequency,

S_{j, f}

represents the current operational frequency of server

S (v_{j})

, and

ξ (S_{j, f})

corresponds to a diminishing function.

In our scheduling methodology and experimental investigation, we adopt the exponential model presented in [20] and subsequently utilized in [19,22], which is commonly recognized as Equation (6). This equation characterizes the relationship between the circuit’s critical cost and the

λ

, reminiscent of the general exponential relation denoted as Equation (5).

λ (S_{j, f}) = λ_{0} . ξ (S_{j, f}) = λ_{0} . 10^{\frac{d (1 - S_{j, f})}{1 - S_{j, m i n}}}

(6)

Here, we have a positive constant fault rate denoted as d, and

S_{j, m i n}

signifies the minimal operational frequency of server

S (v_{j})

. Therefore, taking into account the models utilized in [19,20,22], we establish the reliability of workflow module applications running on cloud resources in the manner described below. Nonetheless, when it comes to task scheduling, resource failures are typically statistically independent because there is a wide variety of resource types involved. For instance, certain resources such as servers might go offline due to excessive demand, while others like communication links remain operational and prepared to accept tasks. This is why the failure of one resource does not necessarily impact the status of others.

R (T, \underset{\forall j \in J}{S}) = e^{- λ (S_{j, f}) . \frac{T (C_{1}, C_{2})}{S_{j, f}}}

(7)

In this context,

T

belongs to the set of module applications within a workflow, denoted as

W

. Additionally,

C_{1}

represents the computational complexity related to the data input size, while

C_{2}

signifies the computational complexity itself. The overall reliability of all the module applications within the workflow is determined by multiplying the individual reliability of each application, and it is expressed in the following manner.

R (W) = \prod_{i = 1}^{n} R (T, S_{j, f})

(8)

This equation holds importance in the scheduling process, as it guarantees the dependability of each workload application when it is assigned to computing resources.

3.2.4. Integrated Wild Horse Optimization and Levy Flight Model

We implemented an efficient approach to improve the arrangement of applications during their execution on cloud resources. To achieve this, we employed a technique known as wild horse optimization based on the model proposed by [23]. This involves organizing the initial population

N

into several groups referred to as

G

, each led by a stallion. The remaining population

RN

, consisting of foals and mares, was similarly distributed across these groups. The subsequent equation, derived from [23], was used to define the grazing performance in the following manner:

U_{j}^{i} (G) = 2 α P cos (2 π R α P) \times (S_{j} - U_{j}^{i} (G)) + S_{j}

(9)

Here,

U_{j}^{i} (G)

refers to the present position of a member within the foal and mare group

G

, while

S_{j}

indicates the position of the stallion. The term

R

represents a random number uniformly selected from the range of −2 to 2, and

α P

corresponds to the adaptive process evaluated in the following equation:

V = \vec{α R_{1}} < T D R; I D X = (V = = 0); α P = α R_{2} Θ I D X + \vec{α R_{3}} Θ (I D X)

(10)

V

denotes a vector encompassing values from 0 to 1. The terms

\vec{α R_{1}}

and

\vec{α R_{3}}

stand for arbitrary values selected within the range of 0 to 1. Similarly,

α R_{2}

represents a random value obtained uniformly from the interval between 0 and 1. TDR represents an adaptive parameter that commences at one and gradually decreases to zero. This can be expressed in the following manner:

T D R = 1 - i t r \times (\frac{1}{max (i t r)})

(11)

Here,

i t r

represents the count of iterations, while

max (i t r)

establishes the upper limit for the number of iterations. Motivated by the insights from [24], we have incorporated the foal formula, as outlined in Equation (12), to offer an evaluation of the performance of foals when they exit their initial group and join different groups upon reaching the age of maturity.

L_{δ}^{x} = M e a n (L_{γ}^{y}, L_{ξ}^{z}), where δ \neq γ \neq ξ

(12)

Here,

L_{δ}^{x}

represents the present position of horse x within the group

δ

,

L_{γ}^{y}

indicates the current location of horse y in the group

γ

, and

L_{ξ}^{z}

signifies the current position of horse z within the group

ξ

. The stallions guide their respective groups to the waterhole. The stallion that reaches the waterhole first takes control of the area for its group members. This process can be described as follows:

S_{j + 1} = \{\begin{matrix} 2 α P cos (2 π R α P) \times (W H - S_{j}) + W H, & if R_{3} > 0.5 \\ 2 α P cos (2 π R α P) \times (W H - S_{j}) - W H, & if R_{3} \leq 0.5 \end{matrix}

(13)

Here,

S_{j + 1}

and

S_{j}

denote the present and subsequent positions of the stallion, while

W H

represents the location of the waterhole. The cosine function leads to motion resulting in various improved end locations. In every iteration, the stallion is tasked with determining the most optimal fitness values. To accomplish this objective, we incorporated the levy flight model outlined in [23], using the following approach:

\begin{matrix} U_{j} (G) & = U_{j} (G) - ζ (U_{j} (G) - O P T) \oplus LV (λ) \\ = U_{j} (G) + \frac{0.01 U}{{| V |}^{\frac{1}{λ}}} (U_{j} (G) - O P T) \end{matrix}

(14)

Here,

U_{j} (G)

indicates the position of the jth member within the group.

ζ

signifies the scaling weight factor,

O P T

stands for the best solution, and

LV (λ)

represents the exponent for the levy flight. The variables

U

and

V

are defined as follows:

U N (0, σ_{U}^{2}), V N (0, σ_{V}^{2})

(15)

Taking into account the equation provided earlier, the expressions for

σ_{U}

and

σ_{U}

are developed as follows:

σ_{U} = {[\frac{sin (\frac{λ π}{2}) . Γ (1 + λ)}{2^{(λ - 1)} λ Γ (\frac{1 + λ}{2})}]}^{\frac{1}{λ}}, σ_{V} = 1

(16)

The primary objective of incorporating Equation (12) is to improve both exploration and localized exploitation.

4. Problem Articulation

Managing the allocation of cloud resources for workflow modules employing virtualization technology presents a formidable task. This technology enables the concurrent operation of multiple virtual machines, facilitating the execution of distributed applications in parallel. To illustrate this, let us imagine a scenario in which we have a limited number of servers, each with a finite capacity for hosting virtual machines.

S = \{s_{1}, s_{2}, \dots, s_{n}\}

(17)

V = \{v_{1}, v_{2}, \dots, v_{k}\}

(18)

As previously stated, we represent our applications as a complex directed acyclic graph (DAG) workflow comprising a group of interconnected dependency applications linked by weighted edges.

D A G = \{W_{1}, W_{2}, \dots, W_{w}\}

(19)

W = \{a_{1}, a_{2}, \dots, a_{τ}\}

(20)

E = \{e_{1}, e_{2}, \dots, e_{ϵ}\}

(21)

When deploying these workflow applications on cloud servers for processing via virtual machines, it becomes vital to take into account specific limitations to enhance scalability and resilience—both crucial attributes for large-scale distributed systems. Initially, this involves sharing mapping decisions and stored data among processing servers. Additionally, it entails not only improving local search capabilities but also preventing the algorithm from converging prematurely. To achieve this objective, we have devised an artificial intelligence-driven model that combines an enhanced version of the wild horse optimization technique with a levy flight approach. This amalgamation enables more effective exploration of new possibilities while considering various significant constraints such as provider throughput, energy consumption, application completion time, and reliability. Our primary aim is to strike a balance between the execution time of applications and the reliability of their performance—two vital constraints we prioritize in our research. Mathematically, this can be expressed as follows:

\underset{\forall c y c l e s}{D i s p a t c h} (\sum_{i = 1}^{τ} a_{i} \to \sum_{\forall k \in J} S_{J} (v_{k})), W h e r e a_{i} \in W_{i}

(22)

Subject to the following constraints:

\begin{matrix} 1 . m i n i m i z e d c o m p l e t i o n t i m e a n d e n e r g y c o n s u m p t i o n \\ 2 . m a x i m i z e d e x e c u t i o n r e l i a b i l i t y a n d s e r v i c e t h r o u g h p u t \end{matrix}

5. Algorithm Formulation

This section seeks to clarify the method we propose to augment the efficiency of the suggested WSHO algorithm within cloud systems by integrating hybridized metaheuristic approaches. The paired procedural methods of the suggested WSHO algorithm are elucidated in Algorithm 1.

The initial method integrates an iterative exploration of the critical path and prioritization techniques based on layer structure. This approach is employed to identify the critical path for designated workflows, divide applications into distinct layers, and arrange the applications within each layer according to their computational and communicative needs. This phase differentiates between module applications as sensitive or non-sensitive tasks, harnessing the EED minimization function.

In the second phase, the analysis accounts for the likelihood of server failures and aims to avoid servers with elevated probabilities of experiencing failures. During this stage, we have implemented a wild horse optimization algorithm incorporating a levy flight model. This choice strengthens the objective of enhancing reliability by strategically assigning module applications to appropriate server resources.

Algorithm 1: Workflow scheduling horse optimization (WSHO)

Input: Module Applications Graph (

M a = {α_{1}, α_{2}, \dots, α_{n}}

), Network Placement Graph
(

P s = {β_{1}, β_{2}, \dots, β_{m}}

), and Network Dependency Edges in the

M a

Graph
(

D e = {e_{1}, e_{2}, \dots, e_{d}}

),

W^{T o T} :

Deadline for moudle applications in the workflow
Output: Scheduling module applications on the most reliable and EED optimized processing servers in the

P s

Graph

5.1. Module Application Scheduling

This algorithm operates within the architecture of a workflow parser system to identify a mapping solution while considering resource sharing for all module applications present in the directed acyclic graph (DAG). By analyzing the architecture and requirements of each application, the algorithm organizes tasks and creates various types of workflows. It calculates the time costs for both computation and data transfer of the applications and then utilizes a linear-time, layer-based sorting model to group the applications into different layers. Applications within the same layer can be executed concurrently, with at most one application from the critical path included in the same layer.

Subsequently, the algorithm prioritizes applications within the same layer based on their respective computation and transfer time costs. Applications with higher costs are given greater priority and are assigned first. The algorithm establishes the initial critical path for each workflow, starting from the first application and proceeding through to the final application, using the polynomial-time technique of determining the longest path. Identifying the critical path is crucial as it highlights the longest route, and any application within this path is recognized as consuming the most time and therefore must be executed sequentially.

The algorithm also incorporates a consideration of failure rates to enhance task security during the execution process. It calculates both the initial and overall reliability-enhanced execution time for the critical path, along with the initial and real reliability-enhanced data transfer time for that path. The newly adjusted critical path is computed based on the existing mapping solution, resulting in the regeneration of reliability-enhanced execution and transfer times. The comprehensive description of the algorithm is provided in Algorithm 2.

Algorithm 2: Application Scheduling Algorithm (ASA)

Input:

M a = {α_{1}, α_{2}, \dots, α_{n}}

,

D e = {e_{1}, e_{2}, \dots, e_{v}}

, and

W^{T o T}

Output: Improved Reliability–EED provisional and ultimate mapping arrangement

5.2. Module Application Placement

The Algorithm 3 initially enhances the reliability of data execution and transmission durations as outlined in Equations (23) and (24). These factors have a direct impact on the operational timeframe of module applications and the occurrence of network link failures.

R (E E D j^{i}) = \frac{(E E D j^{i})}{1 - f_{j}}

(23)

The equation above indicates that the improvement in data execution efficiency is inversely proportional to the success rate of the assigned processing server.

R (L_{i, j}) = \frac{L_{i, j} (e_{i, j})}{1 - f_{i, j}}

(24)

The equation above suggests that the enhancement of data transmission efficiency is inverse to the success rate of the designated bandwidth connection.

The algorithm transmits historical data from module applications to the underlying cloud hardware and obtains details about processing servers, encompassing their connections and assigned virtual machines. These data aid the algorithm in creating a query for precursor and successor servers, streamlining the assignment of dependent modules. Applications are received by the algorithm in a sequential manner, prioritized based on their importance, commencing from the initial layer and continuing until completion. After mapping all applications in the first layer, the algorithm enters a brief waiting period prior to commencing the mapping of applications in the subsequent layer. This encompasses a check to ascertain whether all module applications in the first layer have been situated on servers that are conscious of the trade-off between reliability and EED. In the aftermath of this process, each server exchanges information about its neighboring connections with other processing servers. Consequently, three potential scenarios come into play.

Algorithm 3: Application placement algorithm (APA)

Input:

M a = {α_{1}, α_{2}, \dots, α_{n}}

,

S n = {s_{1}, s_{2}, \dots, s_{t}}

,

n S n = {n s_{1}, n s_{2}, \dots, n s_{t}}

,

D e = {e_{1}, e_{2}, \dots, e_{v}}

,

W^{T o T}

,

S c h

,

R^{T o T}

, and

P s = {β_{1}, β_{2}, \dots, β_{n}}

Output: Final module application mapping scheme with reliability and EED improvement
Mathematics 11 04334 i003

The processing server $β_{j}$ holds all preceding module applications leading up to the current preprocessing $s_{i}$ , which could render it a viable option for selecting the server destination.
The processing server $β_{j}$ has the task of hosting a module application that comes before $s_{i}$ in sequence. We check if $β_{j}$ is connected to any servers holding module applications before $s_{i}$ . If such connections are established, then $β_{j}$ becomes a possible choice for assigning the mapping; otherwise, it is not considered suitable.
The processing server $β_{j}$ cannot host any module applications before $s_{i}$ in sequence. In this case, we investigate the connections of the server with others. If it connects with a server that houses module applications preceding $s_{i}$ , then $β_{j}$ becomes a viable choice for mapping; otherwise, it is deemed unsuitable.

The primary aim of these situations is to determine the server that can provide the most favorable balance between reliability and time. To achieve a portion of this improvement, the process involves computing the partial end-to-end delay of the mapping procedure. This calculation aids in selecting the mapping approach that results in the lowest partial EED for the mapping process.

Once the mapping of highly sensitive module applications is completed, the algorithm proceeds to map applications that are not part of the critical path. Similar to the initial mapping phase, it is necessary to balance between reliability and EED. In this regard, the algorithm begins with applications that have a new completion time that is earlier than their latest recorded finish time. Just as in the previous mapping procedure, processing servers that have direct connections to servers containing previously assigned module applications are taken into account for this round of mapping.

6. Experimental Findings and Discussions

This section primarily concentrates on the examination of three key phases. Initially, we delve into the configuration of the simulation. Subsequently, we elucidate the approach based on scenarios employed in this study. Lastly, we assess the results derived from the simulation.

6.1. Software Setup

We implemented the suggested WSHO algorithm using the CloudSim toolkit [25] on the Windows 11 operating system. The system utilized an Intel Core i5 CPU from the 12th generation, clocked at 1.3 GHz, along with 8.0 GB of memory. The specifics of the simulation configuration, processing servers, and virtual machines can be found in Table 1.

6.2. Scenario-Based Study

To verify the experimental simulation results concerning time efficiency and reliability, we executed two distinct scenarios. The initial scenario encompassed applications with modest to moderate resource requirements, while the subsequent scenario entailed applications spanning from moderate to substantial data sizes. Both scenarios incorporated factors such as server processing capacity, server failure likelihood, communication link reliability, bandwidth failure probability, and minimal link latency.

Furthermore, for the purpose of performing comparative assessments, we employed three effective algorithms:

Reliability-Aware Multi-Objective Memetic Algorithm (RA-MOMA) [9]: This represents a versatile algorithm that harmonizes task completion speed, operational costs, and system dependability through a diversified approach to generating various offspring entities. It takes into account specific constraints, including task duration, operational expenses, and system reliability.
Critical Parent Reliability-based Scheduling (CPRS) [13]: The authors of this research brought forth a novel approach to organizing tasks on cloud resources by employing an exclusive scheduling method that hinges on the dependability of crucial parent tasks. Its objective is to boost the dependability of the scheduling procedure while adhering to constraints like deadlines and operational costs. Additionally, the algorithm takes into account processor failures and communication links as integral elements of their system design.
Distributed Reliability Maximization workflow mapping algorithm under End-to-end Delay constraint (dis-DRMED) [16]: This study introduced a reliable system that utilizes a distributed scientific workflow scheduling technique to allocate module applications in a varied distributed computing environment, recognizing the inevitability of failures in processing hosts and communication networks. The primary goal is to enhance the dependability of task execution while respecting an unavoidable end-to-end delay (EED) constraint.

6.3. WSHO Complexity Analysis

The computational intricacy of the WSHO model is characterized by a complexity of

O (S . N . | E_{z} |)

, with

S

representing the count of processing servers,

N

indicating the quantity of applications, and

E_{z}

standing for the iterations. The highest level of complexity emerges when the network is fully connected, leading to every server potentially serving as a candidate mapping destination for all module applications.

6.4. Simulation-Based Result Analysis

This section elucidates the primary elements that directly influence the results of the experiments. This encompasses their effects on completion time, reliability, energy usage, and provider throughput across a spectrum of application sizes, ranging from small to medium and up to large datasets.

6.4.1. Impact of Application Sizes on the EED

Figure 2a depicts the time analysis of the suggested WSHO approach and the existing dis-DRMED, RA-MOMA, and CPRS methods, concerning workloads ranging from small to moderate sizes. With an increase in the number of module applications, there is a corresponding rise in the time rate. This occurrence is attributed to the submission of more applications by users from various locations, each carrying distinct weights. Nevertheless, the proposed WSHO algorithm outperforms its competitors. For instance, when the applications reach a count of 100, the time value for WSHO stands at 5.75, compared to 6.38 for dis-DRMED, 7.0 for RA-MOMA and 7.80 for CPRS. Moreover, as the number of applications surges to 500, the EED values for WSHO, dis-DRMED, RA-MOMA, and CPRS are 19.12, 21.88, 25.12, and 29.34, respectively.

In Figure 2b, the same evaluation is conducted for the proposed WSHO approach and current algorithms, namely dis-DRMED, RA-MOMA, and CPRS. However, this time, the weights of the workload are adjusted to encompass tasks of moderate to extensive sizes. Initially, all approaches exhibit similar trends, but as the number of applications increases, differences between the algorithms become more pronounced. For instance, when the application count reaches 100, the EED values for WSHO, dis-DRMED, RA-MOMA, and CPRS are 17.15, 18.10, 19.30, and 21.13, respectively. Yet, with 500 applications, these values escalate to 27.45, 28.52, 32.80, and 35.14 for all algorithms.

The aforementioned data analysis underscores that extensive applications demanding high computational power and complexity tend to result in longer finish time due to heightened application transfer latency. As evident from the comparison figures, the time values for WSHO are significantly lower than the other three algorithms. This performance gap becomes especially conspicuous as the module application data transfer size expands. Overall, the proposed WSHO consistently maintains a lower finish time across all scenarios. This accomplishment is attributed to the distinctive manner in which our model distributes the mapping schedule, enabling superior scalability as the network environment scales up.

6.4.2. Impact of Application Sizes on the Reliability

Figure 3a illustrates the evaluation of the reliability of the suggested WSHO algorithm in comparison to existing methods, dis-DRMED, RA-MOMA, and CPRS, concerning data sizes ranging from small to moderate. The effectiveness of the system is directly impacted by the accurate placement of module applications on dependable destinations. Therefore, enhancing this percentage rate holds crucial importance. The proposed WSHO algorithm outperforms alternative methods by a substantial margin. For instance, when the count of applications reaches 100, the proposed WSHO algorithm achieves a reliability rate of 0.98, whereas the competing algorithms, dis-DRMED, RA-MOMA, and CPRS, attain rates of 0.95, 0.94, and 0.93, respectively. Moreover, as the count of module applications reaches 500, WSHO achieves a reliability improvement of 0.95. In contrast, dis-DRMED, RA-MOMA, and CPRS experience enhancements of 0.93, 0.92, and 0.90, respectively.

Figure 3b presents a comparison between the proposed WSHO algorithm and alternative approaches like dis-DRMED, RA-MOMA, and CPRS, focusing on varying numbers of workloads, spanning from moderate to intense. Across all scenarios, as the sizes of applications increase, there is a gradual decline in reliability, leading to a degradation of the system’s performance. However, the suggested WSHO algorithm maintains a satisfactory level of reliability. To illustrate, when the module count reaches 100, the proposed WSHO algorithm achieves a reliability value of 0.94, in contrast to dis-DRMED, RA-MOMA, and CPRS, which achieve reliabilities of 0.93, 0.91, and 0.88, respectively. This enhancement is also evident with a workload size of 500. The proposed WSHO algorithm achieves a reliability of 0.88, while the competing algorithms, dis-DRMED, RA-MOMA, and CPRS, record reliabilities of 0.85, 0.83, and 0.81, respectively.

The data assessments clearly indicate that the proposed WSHO algorithm exhibits considerably superior reliability rates compared to the other algorithms, dis-DRMED, RA-MOMA, and CPRS. This decline in performance becomes notably conspicuous as both the module application sizes and network connection links expand. Additionally, it is evident that reliability experiences a reduction as the problem size grows due to increased computing and application transfer activities within the system. However, the improvement in reliability achieved using the proposed WSHO algorithm remains noteworthy when compared to alternative methods of similar nature. Our approach distinguishes itself from other algorithms by factoring in potential failures of processing servers and network links, thereby achieving enhanced scalability as the network environment expands.

6.4.3. Impact of Application Sizes on the Energy

Energy usage poses a significant concern when deploying applications on cloud resources. This factor can have negative impacts on other performance metrics. To demonstrate the effectiveness of the proposed WSHO algorithm, Figure 4a provides a comparison between the algorithm and dis-DRMED, RA-MOMA, and CPRS in terms of energy consumption. The data analysis reveals that the suggested WSHO algorithm outperforms its competitors in scenarios involving small to moderate workloads. As the packet sizes increased to 500, all algorithms experienced a sharp rise in energy consumption, with the WSHO algorithm exhibiting superior efficiency in conservation. Conversely, the dis-DRMED, RA-MOMA, and CPRS algorithms consumed around 13.05, 22.11, and 24.65, respectively. In scenarios involving 100 workloads, the energy consumption was approximately 3.45 for the WSHO algorithm, 3.88 for dis-DRMED, 4.65 for RA-MOMA, and 5.64 for CPRS.

To affirm the efficacy of the introduced WSHO algorithm, we subjected it to an assessment under conditions of moderate to high workloads. Upon initial observation, the proposed WSHO algorithm demonstrates a greater capacity for energy preservation compared to alternative algorithms, specifically when the application count is set at 100. In this scenario, its consumption stands at a mere 7.65 units, while competing models, namely dis-DRMED, RA-MOMA, and CPRS, register consumption levels of approximately 9.45, 11.23, and 13.14 correspondingly. As the number of applications increases, consumption values also rise accordingly. For instance, at an application count of 500, the consumption rates are measured at 23.74 for the WSHO algorithm, 25.09 for dis-DRMED, 28.27 for RA-MOMA, and 31.21 for CPRS, reaffirming the superior performance of the proposed WSHO algorithm.

As indicated by the conclusions drawn from both figures, the suggested WSHO framework demonstrates proficient energy conservation in contrast to alternative strategies, specifically dis-DRMED, RA-MOMA, and CPRS. This superiority is attributed to how the proposed WSHO algorithm efficiently redistributes virtual machines, thus averting the introduction of unnecessary overheads that arise when adjacent virtual machines are tasked with handling applications. A notable advantage of the proposed WSHO algorithm lies in its adept handling of application allocation across servers. This approach not only mitigates latency and addresses server and network failures, but also encompasses considerations for optimizing processing resource utilization during application execution.

6.4.4. Impact of Application Sizes on the Service Throughput

Cloud service providers continuously strive to enhance the rate of this evaluation parameter, as it plays a pivotal role in augmenting the providers’ returns. Illustrated in Figure 5a, we conducted a comparative analysis using small to moderate dataset to juxtapose our novel WSHO algorithm against dis-DRMED, RA-MOMA, and CPRS, gauging the provider profit percentages. The numerical results underscore the substantial profit gains achieved by the proposed WSHO algorithm compared to the other algorithms, dis-DRMED, RA-MOMA, and CPRS. Particularly, the least favorable outcome is observed with CPRS, which exhibits the lowest values.

We executed two distinct experimental scenarios. Initially, we evaluated the algorithms’ performance across a range of workloads, spanning from light to moderate. Subsequently, we assessed their performance within the realm of moderate to heavy workload sizes. In the initial scenario, with 100 applications in play, the proposed WSHO algorithm achieved a rate of 37.66, while the alternative methodologies, dis-DRMED, RA-MOMA, and CPRS, registered rates of 35.84, 31.12, and 29.66, correspondingly. Even as the workload count escalated to 500, the proposed WSHO algorithm maintained its superiority, achieving a rate of 49.72. In comparison, dis-DRMED secured a rate of 45.12, while RA-MOMA, and CPRS achieved rates of 41.88 and 38.69.

In the second scenario, we opted to evaluate the algorithms’ effectiveness concerning provider outcomes under conditions ranging from moderate to intense workloads. Once again, the proposed WSHO algorithm consistently outperformed the other alternatives, namely dis-DRMED, RA-MOMA, and CPRS, in terms of throughput percentage. When the data size stood at 100, the proposed WSHO algorithm achieved a throughput of 49.65, while dis-DRMED achieved 45.63, RA-MOMA gained 43.15, and CPRS reached 40.10. Moreover, even as the workload count escalated to 500, the proposed WSHO model continued to exhibit superior performance compared to the other three methods. Specifically, the throughput efficiency for the WSHO algorithm measured 58.25, while for dis-DRMED it was 66.78, for RA-MOMA it was 54.74, and for CPRS it reached 52.98.

With the augmentation in the quantity of packet weights related to module applications, there is a corresponding gradual rise in the service profits for the providers. This progression contributes to an enhanced framework profitability. Notably, the suggested WSHO algorithm demonstrates superior performance across all scenarios when compared to the alternative models, dis-DRMED, RA-MOMA, and CPRS. The achievement of this heightened provider interest is primarily attributed to the proposed WSHO algorithm’s adept management of overhead latency, energy consumption, completion time, scheduling reliability, and efficient application placement. These factors collectively contribute to attaining the highest level of provider interest.

6.5. Real-Time-Based Result Analysis

Reliability and time constraints are considered two of the most effective evaluation metrics for resource placement algorithms. Users desire to place their applications on reliable resources and finish them before their deadlines. For this evaluation test, we compare our proposed paradigm with other comparable algorithms with respect to two different real scientific workflows, LiGO and SIPHT, for different workload sizes.

6.5.1. Real-Time Evaluation Based on Application Finish Time

Figure 6 illustrates a comparison between our WSHO algorithm and alternative algorithms, including dis-DRMED, RA-MOMA, and CPRS, concerning scientific workflow applications like LIGO and SIPHT. For smaller application sizes, such as indexes 1, 2, and 3, the distinctions between the algorithms are minimal. However, as the dimensions increase in indexes 4 and 5, the disparities between them become more pronounced.

Figure 6a demonstrates that, when analyzing LIGO scientific workflow applications, our proposed WSHO algorithm achieves faster application execution times compared to the other three algorithms. Initially, the performance trends for all algorithms appear to be similar. However, as application sizes increase, the distinctions among them become more pronounced. This discrepancy arises because applications in indexes 1, 2, and 3 demand lower computational resources and exhibit lower complexity. Conversely, applications in indexes 4 and 5 necessitate substantial computational power and intricate processing, resulting in longer execution times.

Figure 6b illustrates that, in the evaluation of SIPHT scientific workflow applications, our proposed WSHO algorithm completes the assigned tasks more quickly than the dis-DRMED, RA-MOMA, and CPRS algorithms. The difference becomes particularly significant as the size of the workflow applications increases, notably in indexes 4 and 5. These algorithms demand more executable resources to process the incoming large-scale applications. In certain cases, particularly with the RA-MOMA and CPRS algorithms, applications are placed in waiting queues and experience longer wait times due to limited resource availability and challenges related to virtual machine window matrices. This results in extended execution times, leading to missed deadlines. The overall improvement in performance, as measured by finish time, is remarkable when using the WSHO algorithm compared to the dis-DRMED, RA-MOMA, and CPRS algorithms. This improvement is attributed to the dynamic resource executable management and avoidance of virtual machine overlapping offered by the former algorithm.

6.5.2. Real-Time Evaluation Based on Execution Reliability

Figure 7 presents a comparison between our WSHO model and three other algorithms, dis-DRMED, RA-MOMA, and CPRS, in the context of scientific workflow applications such as LIGO and SIPHT. As application sizes increase, ensuring their execution reliability becomes increasingly challenging. This is primarily due to the inclusion of more computationally intensive applications that demand greater attention to reliability.

Figure 7a assesses the reliability of LIGO workflow module applications, where an effective algorithm can reduce resource and communication fault rates. To comprehensively evaluate our proposed WSHO algorithm against its competitors, we conducted an analysis of execution reliability across various index problem sizes. The results depicted in Figure 7a indicate that the WSHO algorithm adeptly manages server and communication link failures during application placement, leading to superior reliability outcomes. For problem sizes 1, 2, and 3, the algorithms exhibit similar performances, but as the workload increases in indexes 4 and 5, WSHO outperforms the other algorithms. Based on the findings from Figure 7a, it is evident that the WSHO algorithm excels in identifying and utilizing the most reliable resources for accommodating LIGO workflow applications.

Figure 7b provides a comparison of algorithm reliability and efficiency concerning various sizes of SIPHT workflow applications. The results depicted in Figure 7b confirm that our proposed WSHO algorithm consistently achieves higher execution reliability compared to dis-DRMED, RA-MOMA, and CPRS across all workflow sizes. As is evident from Figure 7b, as the size of workflow applications increases, the reliability of all algorithms decreases significantly due to the inclusion of more computationally demanding applications. Nevertheless, the WSHO algorithm outperforms the other models. Notably, the WSHO and dis-DRMED algorithms exhibit closely aligned performances because they both prioritize server and communication link reliability during application placement.

7. Conclusions and Future Work

This study introduces an innovative method for arranging and situating workflow module applications on appropriate cloud resources. The objective is to strike a balance between scheduling reliability and end-to-end application delay, reaching a point that maximizes provider benefits. Achieving optimization for both these conflicting goals concurrently poses a formidable challenge. To address this, we have integrated artificial intelligence-driven techniques, utilizing an enhanced version of wild horse optimization (WSHO) that blends with the levy flight strategy to efficiently explore novel avenues.

Our sophisticated hybrid approach facilitates the exchange of mapping decisions and routing information among processing servers. This not only highlights processing server and communication link failures to prevent system crashes and network congestion, but also imbues the model with local search capabilities and safeguards against premature convergence. To validate the efficacy of the proposed WSHO algorithm, we conducted two scenarios varying in application sizes. The first scenario covers applications with light to moderate complexity, while the second encompasses applications with moderate to heavy complexity. The central aim of this research is to identify resource placements that achieve a harmonious blend of exploring globally and exploiting locally. This involves efficiently utilizing the search space while mitigating the preference for resources prone to frequent failures.

Moreover, we subjected the simulation to four key evaluation metrics. In all case scenarios, our proposed WSHO algorithm outperforms the dis-DRMED, RA-MOMA, and CPRS methods, demonstrating its superiority. In the upcoming period, our focus will be on creating a real-time environment where we can assess the effectiveness of our proposed algorithm and examine how it compares to similar algorithms.

Author Contributions

Methodology, M.I.K.; Software, M.I.K.; Validation, M.I.K., M.S. and M.Z.; Investigation, M.I.K. and S.A.; Data curation, M.I.K.; Writing—original draft, M.I.K.; Writing—review and editing, M.I.K., M.S., S.A. and M.Z.; Visualization, M.I.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Deputyship for Research and Innovation, “Ministry of Education” in Saudi Arabia (IFKSUOR3-010-4).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

There is no statement regarding the data.

Acknowledgments

The authors extend their appreciation to the Deputyship for Research and Innovation, “Ministry of Education” in Saudi Arabia for funding this research (IFKSUOR3-010-4).

Conflicts of Interest

The authors declare no conflict of interest.

Main Synopsis Applied in this Work

Notations	Descriptions	Equations
$γ_{i}$	the cumulative incoming byte dimensions	Equation (1)
$ζ j (.)$	the normalized complexity of the corresponding aggregated data input	Equation (1)
$α_{i}$	the application that is part of the workflow $W_{i}$	Equation (1)
$W_{i}$	the workflow scheduled as the ith in the sequence	Equation (1)
$t^{m a x}$	the utmost count of intervals that task $α_{i}$ can sustain	Equation (2)
$P_{j, t}$	the power usage of server $β_{j}$ during time t	Equation (2)
$α (τ)$	the count of simultaneous data transfers on the link $L_{i, j}$ over $Δ τ$	Equation (3)
$ξ_{i, j} (τ)$	the extent of data transfer partially accomplished during the time span $[τ, τ + Δ τ]$	Equation (3)
$λ_{0}$	the initial rate at which faults occur when the $S (v_{j})$ is running at its highest frequency	Equation (5)
$S_{j, f}$	the current operational frequency of server $S (v_{j})$	Equation (5)
$ξ (S_{j, f})$	the parameter that corresponds to a diminishing function	Equation (5)
d	the positive constant fault rate	Equation (6)
$S_{j, m i n}$	the minimal operational frequency of server $S (v_{j})$	Equation (6)
$T$	the set of module applications within a workflow, denoted as $W$	Equation (7)
$C_{1}$	the computational complexity related to the data input size	Equation (7)
$C_{2}$	the computational complexity itself	Equation (7)
$U_{j}^{i} (G)$	the current member position within the foal and mare group $G$	Equation (9)
$S_{j}$	the position of the stallion	Equation (9)
$R$	the random number uniformly selected from the range of −2 to 2	Equation (9)
$α P$	the adaptive process evaluated	Equation (9)
$V$	the vector encompassing values from 0 to 1	Equation (10)
$\vec{α R_{1}}$	the arbitrary values selected within the range of 0 to 1	Equation (10)
$\vec{α R_{3}}$	the arbitrary values selected within the range of 0 to 1	Equation (10)
$α R_{2}$	the random value obtained uniformly from the interval between 0 and 1	Equation (10)
$i t r$	the number of iterations	Equation (11)
$max (i t r)$	the upper limit for the number of iterations	Equation (11)
$L_{δ}^{x}$	the present position of horse x within the group $δ$	Equation (12)
$L_{γ}^{y}$	the current location of horse y in the group $γ$	Equation (12)
$L_{ξ}^{z}$	the current position of horse z within the group $ξ$	Equation (12)
$S_{j + 1}$	the current positions of the stallion	Equation (13)
$S_{j}$	the subsequent positions of the stallion	Equation (13)
$W H$	the location of the waterhole	Equation (13)
$U_{j} (G)$	the position of the jth member within the group	Equation (14)
$ζ$	the scaling weight factor	Equation (14)
$O P T$	the stands for the best solution	Equation (14)
$LV (λ)$	the exponent for the levy flight	Equation (14)
$f_{j}$	the current working frequency of the server $β_{j}$	Equation (23)
$f_{i, j}$	the same current working frequency used by servers $β_{i}$ and $β_{j}$	Equation (24)

References

Wang, Y.; Dong, S.; Fan, W. Task Scheduling Mechanism Based on Reinforcement Learning in Cloud Computing. Mathematics 2023, 11, 3364. [Google Scholar] [CrossRef]
Tao, S.; Xia, Y.; Ye, L.; Yan, C.; Gao, R. DB-ACO: A Deadline-Budget Constrained Ant Colony Optimization for Workflow Scheduling in Clouds. IEEE Trans. Autom. Sci. Eng. 2023, Early Access, 1–16. [Google Scholar] [CrossRef]
Yang, L.; Ye, L.; Xia, Y.; Zhan, Y. Look-ahead workflow scheduling with width changing trend in clouds. Future Gener. Comput. Syst. 2023, 139, 139–150. [Google Scholar] [CrossRef]
Sobhanayak, S. MOHBA:multi-objective workflow scheduling in cloud computing using hybrid BAT algorithm. Computing 2023, 105, 2119–2142. [Google Scholar] [CrossRef]
Ullman, J.D. NP-complete scheduling problems. J. Comput. Syst. Sci. 1975, 10, 384–393. [Google Scholar] [CrossRef]
Tekawade, A.; Banerjee, S. WANMS: A Makespan, Energy, and Reliability Aware Scheduling Algorithm for Workflow Scheduling in Multi-processor Systems. In International Conference on Distributed Computing and Intelligent Technology; Springer: Cham, Switzerland, 2023; pp. 20–35. [Google Scholar] [CrossRef]
Ye, L.; Xia, Y.; Tao, S.; Yan, C.; Gao, R.; Zhan, Y. Reliability-Aware and Energy-Efficient Workflow Scheduling in IaaS Clouds. IEEE Trans. Autom. Sci. Eng. 2022, 20, 2156–2169. [Google Scholar] [CrossRef]
Wang, X.; Yeo, C.S.; Buyya, R.; Su, J. Optimizing the makespan and reliability for workflow applications with reputation and a look-ahead genetic algorithm. Future Gener. Comput. Syst. 2011, 27, 1124–1134. [Google Scholar] [CrossRef]
Qin, S.; Pi, D.; Shao, Z.; Xu, Y.; Chen, Y. Reliability-Aware Multi-Objective Memetic Algorithm for Workflow Scheduling Problem in Multi-Cloud System. IEEE Trans. Parallel Distrib. Syst. 2023, 34, 1343–1361. [Google Scholar] [CrossRef]
Alaie, Y.A.; Shirvani, M.H.; Rahmani, A.M. A hybrid bi-objective scheduling algorithm for execution of scientific workflows on cloud platforms with execution time and reliability approach. J. Supercomput. 2023, 79, 1451–1503. [Google Scholar] [CrossRef]
Liu, Z.; Liwang, M.; Hosseinalipour, S.; Dai, H.; Gao, Z.; Huang, L. RFID: Towards Low Latency and Reliable DAG Task Scheduling Over Dynamic Vehicular Clouds. IEEE Trans. Veh. Technol. 2023, 72, 12139–12153. [Google Scholar] [CrossRef]
Li, Z.; Yu, H.; Fan, G.; Zhang, J. Cost-efficient Fault-tolerant Workflow Scheduling for Deadline-constrained Microservice-based Applications in Clouds. IEEE Trans. Netw. Serv. Manag. 2023, 20, 3220–3232. [Google Scholar] [CrossRef]
Khurana, S.; Sharma, G.; Kumar, M.; Goyal, N.; Sharma, B. Reliability Based Workflow Scheduling on Cloud Computing with Deadline Constraint. Wirel. Pers. Commun. 2023, 130, 1417–1434. [Google Scholar] [CrossRef]
Tang, X. Reliability-Aware Cost-Efficient Scientific Workflows Scheduling Strategy on Multi-Cloud Systems. IEEE Trans. Cloud Comput. 2022, 10, 2909–2919. [Google Scholar] [CrossRef]
Farid, M.; Latip, R.; Hussin, M.; Hamid, N.A.W.A. Scheduling scientific workflow using multi-objective algorithm with fuzzy resource utilization in multi-cloud environment. IEEE Access 2020, 8, 24309–24322. [Google Scholar] [CrossRef]
Cao, F.; Zhu, M.M. Distributed workflow mapping algorithm for maximized reliability under end-to-end delay constraint. J. Supercomput. 2013, 66, 1462–1488. [Google Scholar] [CrossRef]
Zhu, Q.H.; Tang, H.; Huang, J.J.; Hou, Y. Task Scheduling for Multi-Cloud Computing Subject to Security and Reliability Constraints. IEEE/CAA J. Autom. Sin. 2021, 8, 848–865. [Google Scholar] [CrossRef]
Li, X.; Pan, D.; Wang, Y.; Ruiz, R. Scheduling multi-tenant cloud workflow tasks with resource reliability. Sci. China Inf. Sci. 2022, 65, 192106. [Google Scholar] [CrossRef]
Garg, R.; Mittal, M.; Son, L.H. Reliability and energy efficient workflow scheduling in cloud environment. Clust. Comput. 2019, 22, 1283–1297. [Google Scholar] [CrossRef]
Zhu, D.; Melhem, R.; Mossé, D. The effects of energy management on reliability in real-time embedded systems. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD, San Jose, CA, USA, 7–11 November 2004; IEEE: Toulouse, France, 2004; pp. 35–40. [Google Scholar] [CrossRef]
Xu, Y.; Li, K.; He, L.; Zhang, L.; Li, K. A Hybrid Chemical Reaction Optimization Scheme for Task Scheduling on Heterogeneous Computing Systems. IEEE Trans. Parallel Distrib. Syst. 2015, 26, 3208–3222. [Google Scholar] [CrossRef]
Medara, R.; Singh, R.S. Energy Efficient and Reliability Aware Workflow Task Scheduling in Cloud Environment. Wirel. Pers. Commun. 2021, 119, 1301–1320. [Google Scholar] [CrossRef]
Saravanan, G.; Neelakandan, S.; Ezhumalai, P.; Maurya, S. Improved wild horse optimization with levy flight algorithm for effective task scheduling in cloud computing. J. Cloud Comput. 2023, 12, 24. [Google Scholar] [CrossRef]
Motwakel, A.; Alabdulkreem, E.; Gaddah, A.; Marzouk, R.; Salem, N.M.; Zamani, A.S.; Abdelmageed, A.A.; Eldesouki, M.I. Wild Horse Optimization with Deep Learning-Driven Short-Term Load Forecasting Scheme for Smart Grids. Sustainability 2023, 15, 1524. [Google Scholar] [CrossRef]
Calheiros, R.N.; Ranjan, R.; Beloglazov, A.; Rose, C.A.D.; Buyya, R. CloudSim: A toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Softw. Pract. Exp. 2011, 41, 23–50. [Google Scholar] [CrossRef]

Figure 1. The Proposed WSHO System Framework.

Figure 2. Finish time evaluation measurement versus (small to large) workloads. (a) Impact of workload (S-to-M) on the finish time; (b) impact of workload (M-to-L) on the finish time.

Figure 3. End-to-end delay evaluation measurement versus (small to large) workloads. (a) Impact of workload (S-to-M) on the reliability; (b) Impact of workload (M-to-L) on the reliability.

Figure 4. Energy evaluation measurement versus (small to large) workloads. (a) Impact of workload (S-to-M) on the energy; (b) Impact of workload (M-to-L) on the energy.

Figure 5. Throughput evaluation measurement versus (small to large) workloads. (a) Impact of workload (S-to-M) on the throughput; (b) impact of workload (M-to-L) on the throughput.

Figure 6. Finish Time of LIGO and SIPHT workflow application weights of different comparison algorithms. (a) Finish time vs. application size; (b) finish time vs. application size.

Figure 7. Reliability of LIGO and SIPHT workflow application weights of different comparison algorithms. (a) Reliability vs. application size; (b) reliability vs. application size.

Table 1. Configurations of the Simulation-based Approach.

Components	Simulation	Workflow	Underlying	Measurement
Programming language	✓			Java-based language
Round termination time	✓			≈30 s
Application instance type		✓		Variant mode
Number of applications		✓		100–500 in S-to-M
Processing server count			✓	100 Hst.
Computing ability			✓	500–2700 MIPS
Running system	✓			Windows system ver. 11
Scenario 1—Workload		✓		Light-to-Medium
Virtual machine count			✓	1000 VMs
Processor	✓			Intel Core i5 ver. 12
Computational failure rate			✓	0.01–0.000005
Number of applications		✓		100–500 in M-to-L
Scenario 2—Workload		✓		Medium-to-Heavy
Tuple problem size		✓		$[M a, \| D e \|]$
Installed memory	✓			8.0 GB
Closing time			✓	8 s.
Software package		✓		CloudSim ver. 5.0
Link delay			✓	0–10 UoT
Cycle required time	✓			7330 s
Service handling time	✓			0.410–0.970 s
Computing power			✓	600–2000 MIPS
Round initiate time	✓			≈145 s
Booting time			✓	100 s
Network failure rate			✓	0.0005–0.000005

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Khaleel, M.I.; Safran, M.; Alfarhood, S.; Zhu, M. Workflow Scheduling Scheme for Optimized Reliability and End-to-End Delay Control in Cloud Computing Using AI-Based Modeling. Mathematics 2023, 11, 4334. https://doi.org/10.3390/math11204334

AMA Style

Khaleel MI, Safran M, Alfarhood S, Zhu M. Workflow Scheduling Scheme for Optimized Reliability and End-to-End Delay Control in Cloud Computing Using AI-Based Modeling. Mathematics. 2023; 11(20):4334. https://doi.org/10.3390/math11204334

Chicago/Turabian Style

Khaleel, Mustafa Ibrahim, Mejdl Safran, Sultan Alfarhood, and Michelle Zhu. 2023. "Workflow Scheduling Scheme for Optimized Reliability and End-to-End Delay Control in Cloud Computing Using AI-Based Modeling" Mathematics 11, no. 20: 4334. https://doi.org/10.3390/math11204334

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Workflow Scheduling Scheme for Optimized Reliability and End-to-End Delay Control in Cloud Computing Using AI-Based Modeling

Abstract

1. Introduction

1.1. Research Motivation

1.2. Research Novelty

1.3. Research Contribution

1.4. Research Outline

2. Related Works

3. The Key Segments of the Suggested WSHO Paradigm

3.1. System Architecture

3.1.1. Grade 1—Tenants

3.1.2. Grade 2—Workflow Parser

3.1.3. Grade 3—Green Cloud Scheduler

3.1.4. Grade 4—IaaS Network Platform

3.2. Reliability and EED Trade-off Framework

3.2.1. The EED Execution Process

3.2.2. The Network Link Dependency Edge

3.2.3. The Reliability Computation Analysis

3.2.4. Integrated Wild Horse Optimization and Levy Flight Model

4. Problem Articulation

5. Algorithm Formulation

5.1. Module Application Scheduling

5.2. Module Application Placement

6. Experimental Findings and Discussions

6.1. Software Setup

6.2. Scenario-Based Study

6.3. WSHO Complexity Analysis

6.4. Simulation-Based Result Analysis

6.4.1. Impact of Application Sizes on the EED

6.4.2. Impact of Application Sizes on the Reliability

6.4.3. Impact of Application Sizes on the Energy

6.4.4. Impact of Application Sizes on the Service Throughput

6.5. Real-Time-Based Result Analysis

6.5.1. Real-Time Evaluation Based on Application Finish Time

6.5.2. Real-Time Evaluation Based on Execution Reliability

7. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Main Synopsis Applied in this Work

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI