Makespan Optimisation in Cloudlet Scheduling with Improved DQN Algorithm in Cloud Computing

Despite increased cloud service providers following advanced cloud infrastructuremanagement, substantial execution time is lost due to minimal server usage. Given the importance of reducing total execution time (makespan) for cloud service providers (as a vital metric) during sustaining Quality-of-Service (QoS), this study established an enhanced scheduling algorithm for minimal cloudlet scheduling (CS) makespan with the deep Q-network (DQN) algorithm under MCS-DQN. A novel reward function was recommended to enhance the DQN model convergence. Additionally, an open-source simulator (CloudSim) was employed to assess the suggested work performance. Resultantly, the recommendedMCS-DQN scheduler revealed optimal outcomes tominimise themakespanmetric and other counterparts (task waiting period, resource usage of virtual machines, and the extent of incongruence against the algorithms).


Introduction
Cloud computing denotes an established shared-computing technology that dynamically conveys measurable on-demand services over the global network [1]. Essentially, cloud computing offered users limitless and diverse virtual resources that could be obtained on-demand and with different billing standards (subscription and static-oriented) [2]. e CS (task scheduling or TS) also outlined independent task mapping processes on a set of obtainable resources within a cloud context (for workflow applications) for execution within users' specified QoS restrictions (makespan and cost). Workflows (common applications associated with empirical studies involving astronomy, earthquake, and biology) were migrated or shifted to the cloud for execution. Although optimal resource identification for every workflow task (to fulfil user-defined QoS) was widely studied over the years, substantial intricacies required further research: (1) e TS on a cloud computing platform is an acknowledged NP-hard problem (2) Multiple TS optimisation objectives are evident: completion time reduction and high resource usage for the entire task queue (3) Cloud resource dynamics, measurability, and heterogeneity resulted in high complexity Recent research has been performed to enhance TS in cloud environment through artificial intelligence algorithms (particularly metaheuristics involving particle swarm optimisation or PSO [3], ant colony, and genetic algorithm or GA) with TS capacities. However, this article does not rely on these algorithms but instead proposes a viable alternative to them and compares it to one of the metaheuristic algorithms, such as PSO, a widely used technique in the task scheduling area. e proposed method primary objective was recommending a novel DQN scheduler for optimal outcomes by comparing TS measures (waiting time, makespan reduction, and enhanced resource usage). e remaining sections are arranged as follows: Section 2 outlines pertinent literary works, Section 3 presents the DQN algorithm, Section 4 highlights the recommended work, Section 5 explains the research experiment setup and simulation outcomes, and Section 6 offers the study conclusion.

Related Work
In cloud computing, TS, jobs scheduling, or resource selection is one of the most substantial complexities that has garnered cloud service providers' and customers' attention. Additionally, specific study types on TS intricacies reflected positive outcomes. e literature's research accomplishments in cloud resources scheduling can be divided into the following categories based on the used techniques.

Heuristic-Based Research.
e heuristic algorithms, including the metaheuristic ones [4] following intuition or experimental development, offered a potential alternative for the affordable resolution of every optimisation occurrence. Following the unpredictable degree of variance between optimal and viable alternatives, past studies selected metaheuristic algorithms, such as PSO [5], GA [6], and ACO [7] to solve optimal TS policy in cloud computing using metaheuristics algorithms. Based on Huang et al.'s [8] recommendation of a PSO-based scheduler with a logarithmreducing approach for makespan optimisation, higher performance was achieved against other heuristics algorithms. Meanwhile, Liang et al. [9] suggested a TS approach following PSO in cloud computing by omitting some inferior particles (to accelerate the convergence rate and dynamically adjust the PSO parameters). e experimental findings revealed that the PSO algorithm obtained improved outcomes compared to other counterparts. e proposed shift in genetic algorithm crossovers and mutation operators implied flexible genetic algorithm operators (FGAO) [10]. For example, the FGAO algorithm minimised execution time and iterations compared to GA. Furthermore, Musa et al. [11] recommended an improved GA-PSO hybrid with small position value (SPV) applications (for the initial population) to diverge from arbitrariness and enhance convergence speed. Consequently, the improved GA-PSO hybrid reflected more valuable outcomes than the conventional GA-PSO algorithm in resource usage and makespan. Yi et al. [12] recommended a task scheduler model following an enhanced ant colony algorithm under the cyber-physical system (CPS). e numerical simulation implied that the model resolved local searchability and TS quality concerns. Peng et al. [13] proposed a scheduling algorithm based on cloud computing's two-phase best heuristic scheduling to reduce the makespan and energy storage metrics. e authors of the paper [14] suggested a VM clustering technique for allocating VMs based on the duration of the requested task and the bandwidth level in order to improve efficiency, availability, and other factors such as VM utilisation, bucket size, and task execution time.
Sun and Qi [15] proposed a hybrid tasks scheduler based on Local search and differential evolution (DE) to enhance the makespan and the cost metrics. e authors in the paper [16] presented a parallel optimized relay selection protocol to minimise latency, collision, and energy for wake-up radioenabled WSNs.

Reinforcement
Learning-Based Research. Reinforcement learning (RL) is a machine-learning category that primarily communicated with the specified context using consecutive trials for an optimal TS method. Recently, RL has garnered much attention in cloud computing. For example, a higher TS success rate and minimal delay and energy consumption were attained in [17] by recommending a Q-learning-oriented and flexible TS from a global viewpoint (QFTS-GV). In [18], Ding et al. recommended a task scheduler using Qlearning for energy-efficient cloud computing (QEEC). Resultantly, QEEC was the most energy-efficient task scheduler compared to other counterparts (primarily catalysed by the M/M/S queueing model and Q-learning method). In [19], a TS algorithm was proposed with Q-learning for wireless sensor network (WSN) to establish Q-learning scheduling on time division multiple access (QS-TDMA). e algorithm implied QS-TDMA to be an approaching optimal TS algorithm that potentially enhanced real-time WSN performance. In [20], Che et al. recommended a novel TS model with the deep RL (DRL) algorithm that incorporated TS into resource-utilisation (RU) optimisation. e recommended scheduling model that was evaluated against conventional TS algorithms (on real datasets) in experiments demonstrated a higher model performance of the defined metrics. Another task scheduler under the DRL architecture (task scheduling algorithm based on a deep reinforcement learning architecture or RLTS) was suggested by Dong et al. [21] for minimal task execution time with a preceding dynamic link to cloud servers. e RLTS algorithm (compared against four heuristic counterparts) reflected that RLTS could efficiently resolve TS in a cloud manufacturing setting. In [22], a cloud-edge collaboration scheduler was constructed following the asynchronous advantage actor-critic (CECS-A3C). e simulation outcomes demonstrated that the CECS-A3C algorithm decreased the task processing period compared to the current DQN and RL-G algorithms. e authors of the article [23] suggest a learning-based approach based on the deep deterministic policy gradient algorithm to improve the performance of mobile devices' fog resource provisioning. Wang et al. [24] introduced an adaptive data placement architecture that can modify the data placement strategy based on LSTM and Q-learning to maximise data availability while minimising overall cost. Authors in [25] presented a hybrid deep neural network scheduler to solve task scheduling issues in order to minimise the makespan metric. Wu et al. [26] utilised DRL to address scheduling in edge computing for enhancing the quality of the services offered in IoT apps to consumers. e authors in the paper [27] applied a DQN model with a multiagent reinforcement learning setting to control the task scheduling over cloud computing.

e RL.
e RL theory was inspired by the psychological and neuroscientific viewpoints of human behaviour [28] to contextually select a pertinent action (from a set of actions) for optimal cumulative rewards. Although the trial-anderror approach was initially utilised for goal attainment (RL was not offered a direct path), the experience was eventually employed towards an optimal path. An agent only determined the most appropriate action in the problem following the current condition, such as the Markov decision process [29]. Figure 1 presents a pictorial RL representation where the RL model encompassed the following elements [30]: (1) A set of environment and agent states (S) (2) A set of actions (A) of the agent (3) Policies of transitioning from states to actions (4) Rules that identified the immediate reward scalar of a transition (5) Rules that outlined agent perception 3.2. e Q-Learning. One of the solutions for the reinforcement problem in polynomial time is Q-learning. As Q-learning could manage problems involving stochastic transitions and rewards without action adaptions or probabilities at a specific point, the technique was also known as the "model-free" approach. Although RL proved successful in different domains (game playing), it was previously restricted to low dimensional state space or domains for manual feature assignation. Equation (1) presents Q-value computations where S denoted an actual and immediate agent situation, α implied learning rate, c reflected a discount factor, and Q(S t , a t ) denoted Q value to attain the "S" state by acting (a). Specifically, reinforcement began with trial and error followed by posttraining experience (the decisions corresponded to policy values that resulted in high reward counterparts).

e DQN Architecture.
Training encompassed specific parameters [s t , a, r, s t+1 , done] that were stored as agent experiences: s t implied the current state, a implied action, r reflected the reward, s t+1 denoted the following state, and done implied a Boolean value to identify goal attainment. e initial idea served to ascertain state and action as the neural network input. Meanwhile, the output should denote the value representing how the aforementioned action would reflect within the given state (see Figure 2).

Experience
Replay. Experience replay [31] highlights the capacity to learn from mistakes and adjust rather than repeating the same errors. Essentially, training encompassed several parameters [s t , a, r, s t+1 , done] that were stored as agent experiences: s t implied the current state, a denoted action, r implied reward, s t+1 reflected the next state, and done denoted the Boolean value to identify goal attainment. As all experiences were stored in fixed-size memory, none were linked to values (raw data input for neural network). Once the memory reached a saturation point during the training process, arbitrary batches of a specific size were chosen from the fixed memory. Regarding the insertion of novel experiences, old experiences were eliminated once the memory became full. In this vein, experience relay deterred overfitting problems. Notably, the same data could be utilised multiple times for network training to resolve insufficient training data.

e TS Problem.
e TS protocol in cloud computing implies one of the vital problem-solving mechanisms on the significant overlap between cloud provider and user needs, including QoS and high profit [32]. Cloud service providers strived to attain optimal virtual machine (VM) group usage through reduced makespan and waiting time. Following Figure 3, a large set of autonomous work with varying parameters was submitted by multiple users (to be managed by cloud providers in a cloud computing setting). For example, the cloud broker performed task delegations to the current VMs [33]. Different optimisation algorithms were also employed to attain optimal VM utilisation. Notably, equation (2) was incorporated to compute the overall execution time (makespan) as follows: Specifically, Ex vm j � n i�0 Ex j (Cloudlet i ), Ex j (Cloudlet i ) demonstrated the cloudlet i execution time on vm j [34], n implied the total number of cloudlets, and Ex vm j reflected the  Scientific Programming complete execution time of a set of cloudlets on vm j execution. Figure 4 presents an example of the first-come firstserved (FCFS) scheduling process where the number of virtual machines was 2 and the number of tasks was 7. Every task encompassed varied time unit lengths. Notably, makespan denoted the most considerable execution time between the aforementioned VMs. e makespan (computed in VM2) was 45.

Environment Definition.
is study regarded a system with multiple virtual machines and cloudlets. Every VM encompassed specific attributes (processing power in MIPS, memory in GB, and bandwidth in GB/s). As users submitted distinct cloudlets that arrived in a queue, the broker implemented the defined scheduling algorithm to assign every cloudlet to an adequate VM. As the broker scheduling algorithm needed to make an assignment decision in every cloudlet input from the queue, the system state was changed in line with the decision. Figure 5 presents CS with a length of 3 to VM 2 .

State Space.
Only the time taken for each virtual machine during a set of task execution was regarded in this study to support the defined system state identification process. e time counted on every virtual machine implied the total cloudlet time running on VM. e virtual machine running time facilitated makespan computation to enhance each novel cloudlet delegation where the system state changed. In this vein, the t system state with n VMs was provided by β t (VM 1 Ex i (Cloudlet j ) denoted the cloudlet j run time in VM i while k i implied the total cloudlets in VM i . Figure 5 presents the t state as 9, 7, 11 { } and t + 1 state as 9, 10, 11 { }.

Action Space.
Available agent actions were defined in the action space. e broker scheduling algorithm was required to choose a VM from all current VMs to schedule the existing task from the queue. For example, the agent would make an action in the space with the same dimension as the number of VMs so that the action space denoted all VMs in the system. e action space was outlined with n VMs by 1, . . . , i, . . . , n { }, wherein i denoted the VM index conceded by the scheduler for cloudlet assignment. In Figure 5, action space denotes 1, 2, 3 { }, while the chosen action implied is 2.

Model Training.
e MCS-DQN model was retrained for each episode in line with the workflow in Figure 6 as follows: Step 1: the environment and agent contexts were established, including server, virtual machine, and cloudlet attributes.
Step 2: the environment state and cloudlet queues were reset.
Step 3: the next cloudlet was selected from cloudlet queues.
Step 4: the agent selected the following action in line with the existing environment state under ε factor. Essentially, the ε factor (exploration rate) influenced the choice between exploration and exploitation in every iteration. e possibility of an agent arbitrarily choosing a VM (exploration) was 1 − ε while the possibility of the agent choosing a VM under the model (exploitation) was ε. e ε factor (initialized by one) would reduce in every iteration following a decay factor.
Step 5: the environment state was updated by adding the cloudlet execution time to the chosen VM.
Step 6: the environment produced a reward under the recommended reward function in the following subsection.
Step 7: the agent saved the played experience into the experience replay queue.
Step 8: upon experience storage, the algorithm identified more cloudlets to schedule (to be repeated from Step 3 if more cloudlets were determined).
Step 9: the model was retrained in every episode (completing all cloudlet queues) with a batch of defined cloudlets from the experience queue. e experience replay queue was applied as a FIFO queue. e oldest experience was omitted when the queue reached a limit.
Step 10: the algorithm was repeated from Step 2 if the number of iterations was yet to reach the predefined episode limits.
Step 11: the trained MCS-DQN model was saved and exited.

Reward Function.
e recommended reward function was utilised with the MCS-DQN model in Algorithm 1. e makespan of every potential scheduling was first computed.  [true] [true]

Scientific Programming
Every VM was subsequently ranked following the makespan computation during CS. A simple example was provided to present the recommended MCS-DQN reward function (see Figure 7). e example encompassed the reward computation for a specific VM state (elaborated following the total execution time in every VM i ). Based on five VMs, every VM i involved a set of cloudlet execution times 9, 7, 11, 8, 8 { }. Specifically, a newly-arrived cloudlet was scheduled with a length of five to VM2 in the example (see Figure 7(a)) by iterating over VMs, creating a copy of VM state in every iteration, adding the cloudlet to the chosen VM in iteration, and computing the makespan following the added cloudlet. Figure 7(b) presents the first iteration where the arrived cloudlet was added to VM1. Figure 7(c) presents the computed makespans. For example, the makespan would be 14 when the cloudlet was added to VM1 in the first iteration, 13 when added to VM2, and so on. In every VM i ranking, the computed makespans were ranked following the lowest value by sorting the aforementioned makespans (see Figure 7(d)) and providing the highest score to the lowest makespan (to decrease the highest score to be delivered to the following makespan) (see Figure 7(e)). Lastly, the corresponding reward was identified following the makespan index and corresponding VM to be scheduled. In the study context, VM2 reflected the reward as 2.

Experimental Setup.
e recommended trained model under deep Q-learning was assessed against FCFS and PSO algorithms with the CloudSim simulator.

CloudSim Parameters.
CloudSim is a modular simulation toolkit for modelling and simulating cloud computing systems and application provisioning environments [33]. It enables the modelling of cloud system components such as data centres, virtual machines (VMs), and resource provisioning rules on both a system and behavioural level [33].   Scientific Programming e CloudSim simulator configuration in the implementation began with establishing one data center, two hosts, and five VMs with subsequent parameters (see Table 1). is configuration setup is taken from the example 6 of CloudSim code source available on GitHub (CloudSim codebase: https://github.com/Cloudslab/ cloudsim), which is based on real servers and VMs information. At the VM level, a time-shared policy (one of the two different scheduling algorithms utilised in CloudSim) was selected. e time-shared policy facilitated VMs and cloudlets towards immediate multitasks and progress within the host. Moreover, the tasks data used in the experiments are real-world workloads of real computer systems recorded by the High-Performance Computing Center North (HPC2N) in Sweden(the HPC2N data: https://www.cse.huji.ac.il/labs/parallel/workload/ l_hpc2n/). e data contain information about tasks such as the number of processors, the average CPU time, the used memory, and other task specifications. e utilised tasks from the workload completely differ from the independent counterparts employed in the trained model.

e MCS-DQN Model Parameters.
e MCS-DQN model application employs a neural network with five fully connected layers (see Figure 8): an input layer (for state), three hidden layers (64 × 128 × 128), and an output layer (for actions). e network was taken from an original Keras  RL tutorial [35] and modified to fit our defined environment. e training was executed following the parameters in Table 2. e aforementioned parameters were obtained following specific training process execution (for a high score in queue scheduling).

PSO Parameters.
e PSO algorithm was applied following the recommended version in [5] with several iterations (equal to 1000), particles (equal to 500), local weights (c1 and c2) with the same value of 1.49445, and a fixed inertia weight with a value of 0.9. Figure 9, the MCS-DQN agent average assessment score reflected over 800 episodes. Perceivably, learning remained steady despite approximately 800 training iterations. e ε parameter evolution was also incorporated into the ε-greedy exploration method during training. Following increased agent scores when ε began decaying, MCS-DQN could already generate sufficiently good Q-value estimates for more thoughtful state and action explorations to accelerate the agent learning process.

Experimental Results and Analysis. Following
After the training process, various cloudlet sets were executed with the MCS-DQN scheduler saved model, FCFS, and PSO algorithms for every metric assessment. As every cloudlet of the same set was simultaneously executed, this study essentially emphasised the makespan metric (the elapsed time when simultaneously executing cloudlet groups on available VMs). Figure 10 presents the reduced research makespan compared to other algorithms.
e makespan metric (employed as the primary model training objective) impacted other performance metrics: (1) e degree of imbalance (DI) metric demonstrated load-balancing between VMs. Specifically, DI was utilised to compute the incongruence between VMs when simultaneously executing a set of cloudlets. e DI metric reduction was attempted for a more congruent system. Equation (3) was employed in this research to calculate the DI metric.
Specifically, E avg , E min , and E max implied the average, minimum, and maximum total execution time of all VMs [34]. Figure 11 presents the recommended MCS-DQN scheduler that minimised the DI metric in every utilised set of cloudlets for an enhanced load-balancing system.      8 Scientific Programming (2) In the waiting time (WT) metric, the cloudlets arrived in the queue and executed following the scheduling algorithm. For example, the waiting time algorithm was applied to compute all cloudlet sequence and average waiting time measures (see equation (4)). Specifically, WT i denoted the cloudlet waiting time while i and n reflected the queue length: Following Figure 12, the recommended MCS-DQN scheduler could efficiently provide an optimal alternative to heighten the cloudlet queue management speed and effectiveness by reducing cloudlet waiting time and queue length. (3) e RU metric proved vital for elevated RU in the CS process. Equation (5) is employed to compute the average RU [34].
Specifically, ET VM i denoted the VM i duration to complete all cloudlets while N reflected the number of resources. In Figure 13, the recommended MCS-DQN scheduler was more improved than PSO and FCFS regarding RU. Specifically, the MCS-DQN scheduler ensured busy resources while CS (as service providers) intended to earn high profits by renting restricted resources.
Furthermore, to prove the effectiveness of our proposed work, more executions based on the same previous VMs configurations were conducted. Figure 14 illustrates the results of these executions where we increased the number of virtual machines to 10, 15, 20, and 30, respectively. In each set of VMs, we scheduled a number of tasks equal to 60, 140, and 200, respectively. ese experiments were done to the makespan since it is our main metric and compared with the PSO, the chosen scheduling algorithm in this work. We notice that our proposed MCS-DQN algorithm performance is still better than the PSO scheduler even when adding more experiments.
However, our suggested approach is restricted to a set number of virtual machines, and any change in the number of virtual machines requires a new model training. We intend to concentrate on variable-length output prediction in the future such that the number of VMs does not impact the model and no training is necessary for every change in VMs.

Conclusion
is study encompassed effective CS application using deep Q-learning in cloud computing. Additionally, the MCS-DQN scheduler recommended TS problem enhancement and metric optimisation. e simulation outcomes revealed that the presented work attained optimal performance for minimal waiting time and makespan and maximum resource employment. Additionally, the recommended algorithm regarded load-balancing during cloudlet distribution to current resources beyond PSO and FCFS algorithms. is proposed model can be applied to solve task scheduling problems in cloud computing, specifically in cloud broker. To solve the limitation of fixed VMs, we plan in the future to enhance our work by relying on variable-length output prediction using dynamic neural networks to include various VM sizes, as well as adding other optimisation approaches, taking into account more efficiency metrics such as task priority, VM migration, and energy consumption. Furthermore, assuming that (n) tasks are scheduled to (m) fog computing resource, we can apply adjustments into the proposed algorithm to work on the edge computing; this may also be an idea for future work.

Conflicts of Interest
e authors declare no conflicts of interest.

Authors' Contributions
All of the authors participated to the article's development, including information gathering, editing, modelling, and reviewing. e final manuscript was reviewed and approved by all of the authors.

10
Scientific Programming