Deep Reinforcement Learning-based Scheduling for Optimizing System Load and Response Time in Edge and Fog Computing Environments

Edge/fog computing, as a distributed computing paradigm, satisfies the low-latency requirements of ever-increasing number of IoT applications and has become the mainstream computing paradigm behind IoT applications. However, because large number of IoT applications require execution on the edge/fog resources, the servers may be overloaded. Hence, it may disrupt the edge/fog servers and also negatively affect IoT applications' response time. Moreover, many IoT applications are composed of dependent components incurring extra constraints for their execution. Besides, edge/fog computing environments and IoT applications are inherently dynamic and stochastic. Thus, efficient and adaptive scheduling of IoT applications in heterogeneous edge/fog computing environments is of paramount importance. However, limited computational resources on edge/fog servers imposes an extra burden for applying optimal but computationally demanding techniques. To overcome these challenges, we propose a Deep Reinforcement Learning-based IoT application Scheduling algorithm, called DRLIS to adaptively and efficiently optimize the response time of heterogeneous IoT applications and balance the load of the edge/fog servers. We implemented DRLIS as a practical scheduler in the FogBus2 function-as-a-service framework for creating an edge-fog-cloud integrated serverless computing environment. Results obtained from extensive experiments show that DRLIS significantly reduces the execution cost of IoT applications by up to 55%, 37%, and 50% in terms of load balancing, response time, and weighted cost, respectively, compared with metaheuristic algorithms and other reinforcement learning techniques.


Introduction
The past few years have witnessed the rapid rise of the Internet of Things (IoT) industry, enabling the connection of people to things and things to things, and facilitating the digitization of the physical world [1].Meanwhile, with the explosive growth of IoT devices and various applications, the expectation for stability and low latency is higher than ever [2].As the main enabler of IoT, cloud computing stores and processes data and information generated by IoT devices.Leveraging powerful computing capabilities and advanced storage technologies, cloud computing ensures the security and reliability of stored information.However, servers in the cloud computing paradigm are usually located at a long physical distance from IoT devices, and the high latency caused by long distances cannot efficiently satisfy real-time IoT applications.Prompted by these issues, edge and fog computing computing have emerged as popular computing paradigms in the IoT context.Although some researchers use the terms edge computing and fog computing interchangeably, we clearly define them in this paper.We consider the case that use "only" edge resources for real-time IoT applications as edge computing, and the case that use edge and whenever necessary also utilizes cloud resources (along with edge resources in a seamless manner) as fog computing.
Edge computing as a decentralized computing architecture brings processing, storage, and intelligent control to the vicinity of IoT devices [3].This flexible architecture extends cloud computing services to the edge of the network.In contrast, the fog computing paradigm inherits the advantages of both cloud and edge computing [4], which not only provides powerful computational capabilities but also reduces the need to transfer data to the cloud for processing, analysis, and storage, thus reducing the inter-network distance.In the real world, edge and fog computing provide strong support for innovation and development in various fields.For example, in the field of smart healthcare, deploying edge computing nodes on wearable devices and medical devices can monitor patients' physiological parameters in real time and transmit the data to the cloud for analysis and diagnosis, realizing telemedicine and personalized medicine [5]; in the field of autonomous driving, deploying edge computing nodes on self-driving vehicles can perform real-time sensing and decision processing, enabling shorter response time and improving driving safety [6].
However, the massive growth in the number of IoT applications and servers in fog computing environments also creates new challenges.Firstly, the execution time is expected to be minimized [7], which means that the applications should be processed by the best (i.e., the most powerful and physically closest) server.Besides, the load should be ideally balanced and distributed to run on multiple operating units.For example, by distributing requests across multiple servers in a seamless manner (as in serverless computing environments), load balancing can avoid overloading individual servers and ensure that each server handles a moderate load.This improves response times, overall system performance, and throughput, and also helps servers run more consistently.Therefore, improving the load balancing level of servers (i.e., lowering the variance of server resource utilization) while reducing the response time becomes an important but challenging problem for scheduling IoT applications on servers in edge/fog computing environments.Since this is an NP-hard problem, metaheuristic and rule-based solutions can be considered [8], [9].However, these approaches often rely on omniscient knowledge of global information and require the solution proponent to have control over the changes.In the fog computing environment, there is often no regularity in server performance, utilization, and downtime.The number of IoT applications and the corresponding resource requirements are even more nearly random.Besides, in reality, Directed Acyclic Graphs (DAGs) are often used to model IoT applications [10], where nodes represent tasks and edges represent data communication between dependent tasks.The dependency among tasks introduces higher complexity in scheduling applications.Therefore, metaheuristic and rule-based solutions cannot efficiently cope with the IoT application scheduling problem in fog computing environments.
Deep Reinforcement Learning (DRL) is the product of combining deep learning with reinforcement learning, integrating the powerful understanding of deep learning on perceptual problems with the decision-making capabilities of reinforcement learning.In deep reinforcement learning, the agent continuously interacts with the environment, recording a large number of empirical trajectories (i.e., sequences of states, actions, and rewards), which are used in the training phase to learn optimal policies.In contrast to metaheuristic algorithms, agents in deep reinforcement learning are able to autonomously sense and respond to changes in the environment, which allows deep reinforcement learning to solve complex problems in realistic scenarios.However, due to the limited computational resources of devices in fog computing environments [11], the computational requirements of complex Deep Neural Networks (DNNs) are often not supported [12].Therefore, how to balance implementation simplicity, sample complexity, and solution performance becomes a key research problem in applying deep reinforcement learning to fog computing environments to cope with complex situations.
To address the above challenges, we propose a Deep Reinforcement Learning-based IoT application Scheduling algorithm (DRLIS), which employs Proximal Policy Optimization (PPO) [13] technique for solving the IoT applications scheduling problem in fog computing environments.DRLIS can effectively optimize the load balancing cost of the servers, the response time cost of the IoT applications, and their weighted cost.Besides, by using clipped surrogate objective to limit the magnitude of policy updates in each iteration and being able to perform multiple iterations of updates in the sampled data, the convergence speed of the algorithm is improved.Moreover, considering the limited computational resources and the optimization objective under study, we design efficient reward functions.The main contributions of this paper are: • We propose a weighted cost model regarding DAGbased IoT applications' scheduling in fog computing environments to improve the load balancing level of the servers while minimizing the response time of the application.In addition, we adapt this weighted cost model to make it applicable to DRL algorithms.
• We propose a DRL-based algorithm (DRLIS) to solve the defined weighted cost optimization problem in dynamic and stochastic fog computing environments.
When the computing environment changes (e.g., requests from different IoT applications, server computing resources, the number of servers), it can adaptively update the scheduling policy with a fast convergence speed.
• Based on DRLIS, we implement a practical scheduler in the FogBus2 function-as-a-service framework1 [14] for handling scheduling requests of IoT applications in heterogeneous fog and edge computing environments.We also extend the functionality of the Fog-Bus2 framework to make different DRL techniques applicable to it.
• We conduct practical experiments and use real IoT applications with heterogeneous tasks and resource demands to evaluate the performance of DRLIS in real system setup.By comparing with common metaheuristics (Non-dominated Sorting Genetic Algorithm 2 (NSGA2) [16], Non-dominated Sorting Genetic Algorithm 3 (NSGA3) [17]) and other reinforcement learning algorithms (Q-Learning [18]), we demonstrate the superiority of DRLIS in terms of convergence speed, optimization cost, and scheduling time.
The rest of the paper is organized as follows.Section 2 discusses related work and Section 3 presents the system model and problem formulation.The Deep Reinforcement Learning model for IoT applications in edge and fog computing environments is presented in Section 4. DRLIS is discussed in Section 5. Section 6 evaluates the performance of DRLIS and compares it with other counterparts.Finally, Section 7 concludes the paper and states future work.

Related Work
In this section, we review the literature on scheduling IoT applications in edge and fog computing environments.The related works are divided into metaheuristic and reinforcement learning categories.

Metaheuristic
In the dependent category, Liu et al. [19] adopted a Markov Decision Process (MDP) approach to achieving shorter average task execution latency in edge computing environments.They proposed an efficient one-dimensional search algorithm to find the optimal task scheduling policy.However, this work cannot adapt to changes in the computing environment and is difficult to extend to solve complex weighted cost optimization problems in heterogeneous fog computing environments.Wu et al. [20] modeled the task scheduling problem in edge and fog computing environments as a DAG and used an estimation of distribution algorithm (EDA) and a partitioning operator to partition the graph in order to queue tasks and assign appropriate servers.However, they did not practically implement and test their work.Sun et al. [21] improved the NSGA2 algorithm and designed a resource scheduling scheme among fog nodes in the same fog cluster, taking into account the diversity of different devices.This work aims to reduce the service latency and improve the stability of task execution.Although capable of handling weighted cost optimization problems, this work only considers scheduling problems in the same computing environment.Hoseiny et al. [22] proposed a Genetic Algorithm (GA)-based technique for minimizing the total computation time and energy consumption of task scheduling in a heterogeneous fog cloud computing environment.By introducing features for tasks, the technique can find a more suitable computing environment for each task.However, it does not consider the dependencies of different tasks in the application, and due to the use of metaheuristic algorithms, scheduling rules need to be manually set, which cannot adapt to changing computing environments.Ali et al. [23] proposed an NSGA2-based technique for minimizing the total computation time and system cost of task scheduling in heterogeneous fog cloud computing environments.Their work formulates the task scheduling problem as an optimization problem in order to dynamically allocate appropriate resources for predefined tasks.Similarly, due to the limitations of metaheuristic algorithms, this work requires the assumption that the technique has some knowledge of the submitted tasks to develop the scheduling policy and thus cannot cope with dynamic and complex scenarios.

Reinforcement Learning
In the dependent category, Shahidani et al. [24] proposed a Q-learning-based algorithm to reduce task execution latency and balance the load in a fog cloud computing environment.However, this work does not consider the intertask dependencies and the heterogeneity of fog and cloud computing environments.Baek et al. [25] adapted the Qlearning algorithm and proposed an approach that aims at improving load balancing in fog computing environments.This work considers the heterogeneity of nodes in fog computing environments but still assumes that the tasks within the application are independent of each other.Jie et al. [26] proposed a Deep Q-Network (DQN)-based approach to minimize the total latency of task processing in edge computing environments.This work formulates task scheduling as a Markov Decision Process while considering the heterogeneity of IoT applications.However, this work only considers the scheduling problem in edge computing environments and investigates only one optimization objective.Xiong et al. [27] adapted the DQN algorithm and proposed a resource allocation strategy for IoT edge computing systems.This work aims at minimizing the average job completion time but does not take into account more complex functions with multiple optimization objectives.Wang et al. [28] focus on edge computing environments and propose a deep reinforcement learning-based resource allocation (DRLRA) scheme based on DQN.This work targets to reduce the average service time and balance the resource usage within the edge computing environment.However, the work does not consider the resources in fog computing environment, and the technique is not practically implemented and tested.Huang et al. [29] adopted a DQN-based approach to address the resource allocation problem in the edge computing environment.This work investigated minimizing the weighted cost, including the total energy consumption and the latency to complete the task.However, it does not consider the heterogeneity of servers in fog computing environments and assumes that the tasks are independent.Chen et al. [30] proposed an approach based on double DQN to balance task execution time and energy consumption in edge computing environments.Similarly, this work is only applicable to the edge environment and does not consider the dependencies between tasks.Zheng et al. [31] proposed a Soft Actor-Critic (SAC)-based algorithm to minimize the task completion time in an edge computing environment.This work focuses on the latency problem and the experiments are simulation-based.Zhao et al. [32] proposed a Twin Delayed DDPG (TD3)-based DRL algorithm.The goal of this work is to minimize the latency and energy consumption, but inter-task dependencies are not considered and the results are also simulation-based.Liao et al. [33] used Deep Deterministic Policy Gradient (DDPG) and Double Deep Q-Network (DQN) algorithms to model computation in an edge environment.This work aims to reduce energy consumption and latency but does not consider the fog environment and the heterogeneity of devices.Sethi et al. [34] proposed a DQN-based algorithm to optimize energy consumption and load balancing of fog servers.Similarly, this work is simulation-based and does not consider the dependencies between tasks.
Table 1 presents the comparison of the related work with our proposed algorithm, in terms of application properties, architecture properties, algorithm properties, and evaluation.In the application properties section, the number of tasks included in the IoT application, and the dependencies between tasks are studied.In the architectural properties section, three aspects are studied including the IoT device layer, the edge/fog layer, and the multi-cloud layer.For the IoT device layer, the application type and request type are identified.The real application section indicates that the work either deploys actual IoT applications, adopts simulated applications, or uses random data.The heterogeneous request type For the edge/fog layer, the computing environment and the heterogeneity of deployed servers are investigated.Besides, the multi-cloud layer studies whether the work considers the scenario of different cloud service providers with heterogeneity.In the algorithm properties section, we investigate the main technique on which each work is based and the corresponding optimization objectives.The evaluation section identifies whether the work is based on simulation or practical experiments.Recent works that we reviewed (e.g., [31], [32], [33], [34], [35], [36], [37]) have often used reinforcement learning approaches to deal with workload scheduling problems.This is because reinforcement learning can learn by interacting with the environment and continuously optimizing the policy through feedback signals (e.g., reward or penalty).This learning ability gives reinforcement learning an advantage when facing complex, dynamic environments [38], whereas metaheuristic techniques require manual adaptation and guidance.

System Model and Problem Formulation
In this section, we first introduce the topology of the IoT systems in the edge and fog computing environment.Then, we discuss the problem formulation.The key notations are listed in Table 2. ) is used to model an IoT application, as depicted in Fig. 2. A vertex   =    denotes a certain task of the application, and an edge  , denotes the data flow between tasks   and   , so some tasks must be executed after predecessor tasks are completed. (  ) represents the critical path (i.e., the path with the highest cost) of the DAG, marked in red in the figure .A set containing || servers is used to process application set , denoted as  = {  |1 ≤  ≤ ||}.To reflect the heterogeneity of the servers, for each server   ,  _  represents its CPU utilization (%),     represents its CPU frequency (MHz),  _  represents its RAM utilization (%), and  _  represents its RAM size (GB).Moreover,  (   ) represents the server set to which the parent tasks of task    are assigned, and     ,  ,     ,  ,    ,  , and    ,  denote the transmission time (ms), the propagation time (ms), the packet size (MB), and the data rate (bit/s) between server   and server   , respectively.

Problem Formulation
Since an application contains one/multiple tasks, it may be executed on different servers.With a set of servers , the The RAM size (GB) of server     , The packet size (MB) from server   to server   for task    _  The CPU utilization (%) of each server in server set , denoted as a set   , The data rate (bit/s) between server   and server    _  The RAM utilization (%) of each server in server set , denoted as a set  scheduling configuration     of a task    is defined as: where  shows the server's index.Accordingly, the scheduling configuration   of an application   is equal to the set of the scheduling configuration of the tasks it contains, defined as: The scheduling configuration  of the application set  is equal to the set of scheduling configuration per application: In addition, we consider that for a given application, the execution model of tasks can be hybrid (i.e., sequential and/or parallel).That is, children tasks have some dependencies on the parent tasks that need to be executed after their completion, and we use  (   ) to represent the parent task set of task    [39].While tasks that do not depend on each other can be executed in parallel, and we use  (   ) to indicate that if a task    is located on a critical path of application   .

Load Balancing Model
The load balancing model is used to measure the resource balancing level of the server set  during the processing of the application set . Regarding the server resource, both CPU and RAM are considered.For task    , the load balancing model      is defined as: where

and 𝜓 𝑟𝑎𝑚 𝑥 𝑆 𝑙 𝑖
represent the CPU and RAM models, and  1 and  2 are the control parameters by which the weighted load balancing model can be tuned.They satisfy: where Correspondingly, for application   , the load balancing model Ψ(  ) is defined as the sum of the load balancing models for each task processed by server set : Our main goal is to find the best-possible scheduling configuration for the application set  such that the variance of the overall CPU and RAM utilization of the server set  during the processing of the application set  can be minimized.Therefore, for the application set , the load balancing model Ψ() is defined as:

Response Time Model
We The task ready time model       represents the maximum time for the data required by the task    to arrive at the server to which it is assigned, defined as: where represents the CPU frequency of server   (for multi-core CPUs, the average frequency is considered).
Accordingly, the response time model Ω(  ) for application   is defined as: where  (   ) equals to 1 if task    is on the critical path of application   , otherwise 0.
The main goal for the response time model Ω() is to find the best-possible scheduling configuration for the application set  such that the total time for the server set  processing them can be minimized.Therefore, for the application set , the response time model Ω() is defined as:

Weighted Cost Model
The weighted cost model is defined as the weighted sum of the normalized load balancing and normalized response time models.For task    : where  (18) where Ψ(  ) and Ω(  ) are obtained from Eq. 8 and Eq. 15, and  represents the normalization.The weighted cost model for the application set  is defined as: (19) where Ψ() and Ω() are obtained from Eq. 9 and Eq.16.
Therefore, the weighted cost optimization problem of IoT applications can be formulated as: where 1 states that any task can only be assigned to one server for processing.2 states that for any server, the CPU utilization and RAM utilization are between 0 and 1.
Besides, 3 states that the CPU frequency and the RAM size of any server are larger than 0.Moreover, 4 denotes that any server should have sufficient RAM resources to process any task.Also, 5 denotes that any task can only be processed after its parent tasks have been processed, and thus the cumulative cost is always larger than or equal to the parent task.In addition, 6 denotes that the control parameters of the weighted cost model can only take value from 0 to 1, and the sum of them should be equal to 1.
The problem being formulated is presented to be a nonconvex optimization problem, because there may be an infinite number of local optima in the set of feasible domains, and usually, the complexity of the algorithm to find the global optimum is exponential (NP-hard) [40].To cope with such non-convex optimization problems, most work decomposes them into several convex sub-problems and then solves these sub-problems iteratively until the algorithm converges [41].This type of approach reduces the complexity of the original problem at the expense of accuracy [42].In addition, such approaches are highly dependent on the current environment and cannot be applied in dynamic environments with complex and continuously changeable parameters and computational resources [42].To deal with this problem, we propose DRLIS to efficiently handle uncertainties in dynamic environments by learning from interaction with the environment.

Deep Reinforcement Learning Model
In reinforcement learning, the autonomous agent first interacts with the surrounding environment through action.Under the action and the environment, the agent generates a new state, while the environment gives an immediate reward.In this cycle, the agent interacts with the environment continuously and thus generates sufficient data.The reinforcement learning algorithm uses the generated data to modify its own action policy, then interacts with the environment to generate new data, and uses the new data to further improve its behavior.Formally, we use Markov Decision Process (MDP) to model the reinforcement learning problem.Specifically, the learning problem can be described by the tuple < , , ℙ, ℝ,  >, where  denotes a finite set of states;  denotes a finite set of actions; ℙ denotes the state transition probability; ℝ denotes the reward function;  ∈ [0, 1] is the discount factor, used to compute the cumulative rewards.
We assume that the time  of the learning process is divided into multiple time steps  and the agent will interact with the environment at each time step and have multiple states   .At a particular time step , the agent possesses the environment state   = , where  ∈ .• State space : Since the optimization problem is related to tasks and servers, the state of the problem consists of the feature space of the task currently being processed and the state space of the current server set . Based on the discussion in Section 3, at the time step , the feature space of the task    includes the task ID, the tasks' predecessors and successors, the application ID to which the task belongs, the number of tasks in the current application, the estimate of the occupied CPU resources for the execution of the task, the task's RAM requirements, the estimate of the task's response time, etc. Formally, the feature space  for task    at the time step  is defined as follows: where  represents the index of the feature in the task feature space  , and | | represents the number of features.Moreover, at the time step , the state space of the current server set  includes the number of servers, each server's CPU utilization, CPU frequency, RAM utilization, and RAM size, and the propagation time and bandwidth between different servers, etc. Formally, the state space  for the server set  at the time step  is defined as: where  represents the state type that is related to only one server (i.e., CPU utilization),  represents its index, and || represents the length of this type of state; besides, ℎ denotes the state type that is related to two servers (i.e., propagation time), and similarly,  represents its index and |ℎ| represents the length of this type of state.Therefore, the state space  is defined as: • Action space : The goal is to find the best-possible scheduling configuration for the application set  to minimize the objective function Eq. 20.Therefore, at the time step , the action can be defined as the assignment of the server to the task    : Accordingly, the action space  can be defined as the server set : • Reward function ℝ: Since this is a weighted cost optimization problem, we need to define the reward function for each sub-problem.First, as the , a very large negative value is introduced if the task cannot be processed on the assigned server for any reason.Also, for the load balancing problem, based on the discussion in section 3.2.1, the reward function    is defined as: where      is obtained from Eq. 4. The value output by reward function    is the difference between the load balancing models of the server set after scheduling the current task and the previous one.If the value of the load balancing model of the server set is reduced after scheduling the current task, the output reward is positive, otherwise it is negative.Beside, for the response time problem, based on the discussion in section 3.2.2, the reward function     is defined as: where      is obtained from Eq. 10, and       represents the average response time for task    .The value output by reward function    is the difference between the average response time (the current response time is also considered) and the current response time for task    .If the current response time is lower than the average one, the output reward is positive, otherwise it is negative.The reward function   for the weighted cost optimization problem is defined as: where  1 and  2 are the control parameters, and  represents the normalization process.
Currently, many advanced deep reinforcement learning algorithms (e.g., PPO, TD3, SAC) have been proposed by different researchers.They show excellent performance in different fields.PPO improves convergence and sampling efficiency by adopting importance sampling and proportional clipping [13].TD3 (Twin Delayed DDPG) introduces a dual Q network and delayed update strategy to effectively solve the overestimation problem in the continuous action space [43].SAC (Soft Actor-Critic) combines policy optimization and learning of Q-value functions, providing more robust and exploratory policy learning through maximum entropy theory [44].These algorithms have achieved remarkable results in different tasks and environments.In our research problem, the agent's action and state space is discrete, which hinders the application of TD3, because it is designed for continuous control [45].In addition, the original SAC only considers the problem of continuous space [44], although there are some works discussing how to apply SAC to discrete space, they usually need to adopt some special tricks and extensions, such as using soft-max or sample-prune techniques to accommodate discrete actions [46].Besides, Wang et al. [47] shows that SAC requires more computation time and convergence time than PPO.Whereas our study focuses on edge and fog computing environments, where handling latency sensitivity and variation are important considerations for choosing the appropriate DRL algorithm.We choose PPO as the basis of DRLIS, because PPO is designed to be more easily adaptable to discrete action spaces [48] and we aim for the algorithm to converge quickly and perform well in diverse environments.

DRL-based Optimization Algorithm
Based on the above-mentioned MDP model, we propose DRLIS to achieve weighted cost optimization of IoT applications in edge and fog computing environments.In this section, we introduce the mathematical principle of the PPO algorithm and discuss the proposed DRLIS.

Preliminaries
The PPO algorithm belongs to the Policy Gradient (PG) algorithm which considers the impact of actions on rewards and adjusts the probability of actions [49].We use the same notations as in section 3 to describe the algorithm.We consider the time horizon  is divided into multiple time steps , and the agent has a policy   for determining its actions and interactions with the environment.The objective can be expressed as adjusting the parameter  to maximize the expected cumulative discounted rewards    [ ∑ ∈     ] [13], expressed by the formula: Since this is a maximization problem, the gradient ascent algorithm can be used to find the maximum value: The key is to obtain the gradient of the reward function  () with respect to , which is called the policy gradient.The algorithm for solving reinforcement problems by optimizing the policy gradient is called the policy gradient algorithm.
The policy gradient can be presented as, where   (  |  ) is the advantage function at time step t, used to evaluate the action   at the state   .Here, the policy gradient indicates the expectation of ∇    (  |  )  (  |  ), which can be estimated using the empirical average obtained by sampling.However, the PG algorithm is very sensitive to the update step size, and choosing a suitable step size is challenging [50].Moreover, practice shows that the difference between old and new policies in training is usually large [13].
To address this problem, Trust Region Policy Optimization (TRPO) [51] is proposed.This algorithm introduces importance sampling to evaluate the difference between the old and new policies and restricts the new policy if the importance sampling ratio grows large.Importance sampling refers to replacing the original sampling distribution with a new one to make sampling easier or more efficient.Specifically, TRPO maintains two policies, the first policy    is the current policy to be refined, and the second policy   is used to collect the samples.The optimization problem is defined as follows: where  represents Kullback-Leibler Divergence, used to quantify the difference between two probability distributions [52], and  represents the restriction of the update between old policy    and new policy   .After linear approximation of the objective and quadratic approximation of the constraints, the problem can be efficiently approximated using the conjugate gradient algorithm.However, the computation of conjugate gradient makes the implementation of TRPO more complex and inflexible in practice [53], [54].
To make this algorithm well applied in practice, the KL-PPO algorithm [13] is proposed.Rather than using the constraint function   [[   (⋅|  ),   (⋅|  )]] ≤ , the  divergence is added as a penalty in the objective function: where   () = is the ratio of the new policy and the old policy, obtained in Eq. 38, and the parameter  can be dynamically adjusted during the iterative process according to the  divergence.If the current  divergence is larger than the predefined maximum value, indicating that the penalty is not strong enough and the parameter  needs to be increased.Conversely, if the current  divergence is smaller than the predefined minimum value, the parameter  needs to be reduced.
Moreover, another idea to restrict the difference between old policy    and new policy   is to use clipped surrogate function .The PPO algorithm using the clip function (CLIP-PPO) removes the KL penalty and the need for adaptive updates to simplify the algorithm.Practice shows CLIP-PPO usually performs better than KL-PPO [13].Formally, the objective function of CLIP-PPO is defined as follows: And (  (), 1 − , 1 + ) restrict the ratio   () into (1 − , 1 + ), defined as: By removing the constraint function as discussed in TRPO, both PPO algorithms significantly reduce the computational complexity, while ensuring that the updated policy deviates not too large from the previous one.

DRLIS: DRL-based IoT Application Scheduling
Since CLIP-PPO usually outperforms KL-PPO in practice, we choose it as the basis for the optimization algorithm.DRLIS is based on the actor-critic framework, which is a reinforcement learning method combining Policy Gradient and Temporal Differential (TD) learning.As the name implies, this framework consists of two parts, the actor and the critic, and in implementation, they are usually presented as Deep Neural Networks (DNNs).The actor network is used to learn a policy function   (|) to maximize the expected cumulative discounted reward   [ ∑ ∈     ], while the critic network is used to evaluate the current policy and to guide the next stage of the actor's action.In the learning process, at the time step , the reinforcement learning agent inputs the current state   into the actor network, and the actor network outputs the action   to be performed by the agent in the MDP.The agent performs the action   , receives the reward   from the environment, and moves to the next state  +1 .The critic network receives the states   and  +1 as input and estimates their value functions    (  ) and    ( +1 ).The agent then computes the TD error   for the time step t: where  denotes the discount factor, as discussed in section 3, and the actor network and critic network update their parameters using the TD error   .DRLIS continues this process after multiple steps, as an estimate Â of the advantage function   , which can be written as: DRLIS maintains three networks, one critical network, and two actor networks (i.e., the old actor and the new actor), representing the old policy function    and the new policy function   , as discussed in section 5. We consider a scheduler that is implemented based on DRLIS.When this scheduler receives a scheduling request from an IoT application, it obtains information about the set of servers currently available and initializes a DRL agent based on the information.This agent contains three deep neural networks, a new actor network Π  with parameter , an old actor network Π   with parameter   , where   = , and a critic network   with parameter .After that, the scheduler obtains the information about the currently submitted task and generates the current state   based on the information regarding the task and servers.Inputting the state   to the new actor network Π  will output an action   , representing the target server to which the current task is to be assigned.The scheduler then assigns the task to the target server and receives the corresponding reward   , which is calculated based on Eq. 32, 33, 34.The reward   is essential for indicating the positive or negative impact of the agent's current scheduling policy on the optimization objectives (e.g., IoT application response time and servers load balancing level).Also, a tuple   with three values (  ,   ,   ) will be stored in buffer .The scheduler repeats the process  times until sufficient information is collected to update the neural networks.When updating the neural networks, the estimate of the advantage function is first computed based on Eq. 44.Then the neural networks are optimized for K times.Both actor network and critic network use Adam optimizer, and the loss function is computed as: (, ) = −    () +      () −     (), (45) where   () is the policy objective function from Eq. 41, and    () is loss function for the state value function: And   () is the entropy bonus for the current policy: In addition,   ,   , and   are the coefficients.After updating the neural networks, the parameter  of the new actor network Π  will be copied to the old actor network Π   .Assuming that there are  tasks, from Algorithm 1, the agent will update the policy K times after scheduling T tasks, so the complexity of the algorithm as ( +   ).In practical applications, both  and  as hyperparameters can be customized to suit different computational environments.Thus the computational complexity of the algorithm actually depends on the number of tasks  and can be written as ().For the edge/fog environment with limited computational resources, we consider this computational complexity to be acceptable.

Practical Implementation in the FogBus2 Framework
We extend the scheduling module of the FogBus2 framework2 [14] to design and develop the DRLIS in practice for processing placement requests from different IoT applications in edge and fog computing environments.
FogBus2 is a lightweight container-based distributed/ serverless framework (realized using Docker microservices software) for integrating edge and fog/cloud computing environments.A scheduling module is implemented to decide the deployment of heterogeneous IoT applications, enabling the management of distributed resources in the hybrid computing environment.There are five main components within FogBus2 framework, namely Master, Actor, RemoteLogger, TaskExecutor, and User.Fig. 3 shows the relationship between different components in the FogBus2 framework, and the updated sub-components used to implement the reinforcement learning function.• Actor: It informs the Remote Logger and Master components of the computing resources of the corresponding node to coordinate the resource scheduling of the framework.Furthermore, it is responsible for launching the appropriate Task Executor components to process the submitted IoT application.We extend the functionality of the Profiler and the Message Handler components to allow system characteristics regarding servers to be passed to the reinforcement learning scheduling module in Master components.
• Task Executor: It is responsible for executing the corresponding tasks of the submitted application.The results are passed to the Master component.
• User: It runs on IoT devices and is responsible for processing raw data from sensors and users.It sends the processed data to the Master component and submits the execution request.We extend the functionality of the Actuator and the Message Handler components to allow information related to IoT applications to be passed to the reinforcement learning scheduling module in Master components.
Fig. 4 shows our implementation of the reinforcement learning scheduling module in the FogBus2 framework.The module can be divided into four sub-modules: 1) Reinforcement Learning Models, 2) Rewards Models, 3) Reinforcement Learning Agent, and 4) Model Warehouse.• Rewards Models: This sub-module contains the models associated with the reward functions.According to Section 3.2 and Section 4, we implemented Load Balancing Model, Response Time Model, and Weighted Cost Model.This sub-module is responsible for calculating the reward values based on the information (e.g., CPU and RAM utilization) and transferring them to the Agent sub-module.
• Reinforcement Learning Agent: This sub-module implements the functions of the reinforcement learning agent.The Agent Initiator calls the Reinforcement Learning Models sub-module and initializes the corresponding models.The Action Selector is responsible for outputting the target server index for the currently scheduled task.The Model Optimizer optimizes the running reinforcement learning scheduling policy based on the reward values returned from the Reward Function Models sub-module.The State Converter is responsible for converting the parameters of the server and IoT application into state vectors that can be recognized by the reinforcement learning scheduling model.The Scheduling Policy Runner is the running program of the reinforcement learning scheduling Agent and is responsible for receiving submitted tasks, saving or loading the trained policies, and requesting and accessing parameters from other FogBus2 components (e.g., FogBus2 Actor, FogBus2 User) for the computation of reward functions.
• Model Warehouse: This sub-module can save the hyperparameters of the trained scheduling policy to the database and loads the hyperparameters to initialize a well-trained scheduling Agent.based on Algorithm 1.In addition, two buffers   and   for storing information from the  component and the   component are also initialized.After the   component submits the IoT application to be processed, the  component first checks whether the  components that have been registered to the framework have the corresponding resources to process the application.If true, the IoT application which contains one or multiple tasks will be scheduled; otherwise, the  component will inform the   component that the current application cannot be processed.For each task of an IoT application, the scheduler will place it to the target  component for execution based on Algorithm 1.After that, the  component sends the relevant information (i.e., CPU utilization, RAM utilization, etc.) to the  component, which is stored in the buffer   .The   component also sends relevant information (i.e., response time, the result of task execution, etc.) to the  component, which is stored in the buffer   .When the  collects sufficient information, it will update the scheduler, where the data in   and   are used to compute the reward for each step, as discussed in Algorithm 1 and Eq.32, 33, 34.

Performance Evaluation
In this section, we first describe the experimental setup and sample applications used in the evaluation.Then, we investigate the hyperparameters of DRLIS.Finally, we discuss the performance of DRLIS by comparing it with its counterparts.

Experiment Setup
We first give a short introduction about the experimental environment and describe the IoT applications used in the experiment.Next, the baseline algorithms used to compare with DRLIS are presented.

Experiment Environment
As discussed in Section 5.3, we implemented a scheduler based on DRLIS in the FogBus2 framework, and we use this scheduler for evaluation.We consider a heterogeneous experimental environment consisting of IoT devices, resource-limited fog servers, and resource-rich cloud servers.To simulate the heterogeneous multi-cloud computing environment, we used two instances of Nectar Cloud infrastructure (Intel Xeon 2 cores @2.0GHz, 9GB RAM, and Intel Xeon 16 cores @2.0GHz, 64GB RAM) and one instance of AWS Cloud (AMD EPYC 2 cores @2.2GHz, 4GM RAM).In the fog computing environment, to reflect the heterogeneity of the servers, we used a Raspberry Pi 3B (Broadcom BCM2837 4 cores @1.2GHz, 1GB RAM), a MacBook Pro (Apple M1 Pro 8 cores, 16GB RAM), and a Linux virtual machine (Intel Core i5 2 cores @3.1GHz, 4GB RAM).In addition, the IoT devices are configured with 2 cores @3.2GHz and 4GB RAM.Furthermore, we profiled the average bandwidth (i.e., data rate) and latency between servers as follows: the latency between the IoT device and the cloud server is around 15ms, and the bandwidth is around 6MB/s, while the latency between the IoT device and the fog server is around 3ms, and the bandwidth is around 25MB/s.Also, both  1 and  2 are set to 0.5 in Eq. 19, meaning that the importance of load balancing and response time are equal.

Sample IoT Applications
We used four IoT applications for evaluating the performance of the scheduler based on DRLIS.All applications implement both real-time and non-real-time features.Realtime means that the application can receive live streams and non-real-time means that the application can receive pre-recorded video files.Specifically, applications follow a sensor-actuator architecture, with each application operating as a single data stream.Sensors (e.g., cameras) capture environmental information and process it into data patterns (e.g., image frames) that will be forwarded to surrogate servers for processing, while actuators receive the processed data and represent the final outcome to the user.In addition, all applications provide a parameter called application label, which can be used to set the frame size in the video.These applications are described as follows: • Face Detection [15]: Detects and captures human faces.The human faces in the video are marked by squares.This application is implemented based on OpenCV3 .
• Color Tracking [15]: Tracks colors from video.The user can dynamically configure the target colors through the GUI provided by the application.This application is implemented based on OpenCV 3 .
• Face And Eye Detection [15]: In addition to detecting and capturing human faces, the application also detects and captures human eyes.This application is implemented based on OpenCV 3 .
• Video OCR [14]: Recognizes and extracts text information from the video and transmits it back to the user.The application will automatically filter out keyframes.This application is implemented based on Google's Tesseract-OCR Engine4 .

Baseline Algorithms
To evaluate the performance of DRLIS, three other schedulers based on metaheuristic algorithms and reinforcement learning techniques are implemented, as follows: • DQN: It is one of the most adapted techniques in deep reinforcement learning, which constructs an endto-end architecture from perception to decision.This algorithm has been used by many works in the current literature such as [26], [27], [28], and [29].To compare with our proposed algorithm, we implement a DQN-based scheduler and integrate it into the Fog-Bus2 framework.This scheduler can minimize the weighted load balancing and response time cost.
• Q-Learning: This technique belongs to value-based reinforcement learning techniques that combine the Monte Carlo method and the TD method.Its ultimate goal is to learn a table (Q-Table ).Works including [25], [55] adopt this technique.To integrate it into the FogBus2 framework, we implemented a scheduling policy.Furthermore, as a comparison, the scheduler can be used in the weighted cost problem to minimize the weighted load balancing and response time cost.
• NSGA2: It is a weighted cost genetic algorithm.It adopts the strategy of fast non-dominated sorting and crowding distance to reduce the complexity of the non-dominated sorting genetic algorithm.The algorithm has high efficiency and fast convergence rate [56].This algorithm is implemented using Pymoo [57].
• NSGA3: The framework of NSGA3 is basically the same as NSGA2, using fast non-dominated sorting to classify population individuals into different nondominated fronts, and the difference mainly lies in the change of selection mechanism.Compared with NSGA2 using crowding distance to select individuals of the same non-dominated level, NSGA3 introduces well-distributed reference points to maintain population diversity under high-dimensional goals [58].This algorithm is implemented using Pymoo [57].

Hyperparameter Tuning
The scheduler based on DRLIS is implemented via Py-Torch.Considering the limited computational resources of some devices in the fog computing environment, both actor network and critic network consist of an input layer, a hidden layer, and an output layer.Henderson et al. [59] investigate the effect of hyperparameter settings on the performance of reinforcement learning models.They survey the literature on different reinforcement learning techniques, list the hyperparameter settings used in the literature, and compare the actual performance of the models under different hyperparameter settings.They compare the performance of the PPO algorithm under different network architectures and the result shows that the model performs best under the network architecture where the hidden layer contains 64 hidden units and the hyperbolic tangent (TanH) function is used as the activation function.Therefore, we used the same network architecture for our experiments.In addition, we performed a grid search to tune the four main hyperparameters (i.e., clipping range, discount factor, learning rate for actor network, and learning rate for critic network), and the results are shown in Fig. 5 The load balancing model control parameters 1 and 2 are both set to 0.5 to show the equal importance of CPU and RAM, however, these values can be tuned by users based on the objectives.
All the experiments regarding hyperparameters tuning are conducted in order to solve the weighted cost problem, as discussed in section 3.2.3.We describe the process of hyperparameters tuning of our reinforcement learning model.For tuning the clipping range , we followed Schulman et al. [13], who proposed PPO and described that the model performs best with settings of clipping range  among 0.1, 0.2, and 0.3.Fig. 5a shows that our model performs best when the clipping range  is set to 0.3.For the discount factor , we reviewed related work on DRL in order to understand the common range for .According to [13,60], the best setting for  sits somewhere among {0.9-0.999}.Accordingly, to keep the search area for tuning  in a viable range, we used the nominated values in these works and found that our model converges faster when  is set to 0.9.Fig. 5b shows the tuning process of .Based on the similar approach for tuning  and , for tuning the actor network learning rate   , we referred to [13,59,61] for designing our tuning range.Accordingly, we used 0.003, 0.0003, and 0.00003 to tune   .Fig. 5c shows that our model performs best when the   is set to 0.0003.Considering the same approach for tuning, we followed [62,63,64] and set our tuning range among {0.01, 0.001, 0.0001} and found that our model works best when   is 0.001.Fig. 5d shows the performance of our model under different settings for   .Overall, the deep neural network and training hyperparameters setting is presented in Table 3. Besides, we also tune the hyperparameters for baseline techniques to fairly study their performance.The corresponding results are shown in Table 4.

Performance Study
We performed two experiments to evaluate DRLIS compared to its counterparts, regarding the load balancing of the servers, the response time of the IoT applications, and the weighted cost.

Cost vs Policy Update Analysis
In this experiment, we investigate the algorithm performance in different iterations when the policy is updated.We used the four applications mentioned in Section 6.1.2for training with the resolution parameter set to 480, and the maximum number of iterations is set to 100.The training results of algorithms with the three optimization objectives are shown in Fig. 6.
As shown in Fig. 6a, when optimizing the load-balancing problem of the servers, the average computational resource variance of the servers is lower for the Q-Learning-based, DQN-based, and DRLIS-based schedulers than for the NSGA2based and NSGA3-based schedulers.Moreover, only the reinforcement learning-based scheduler can achieve a stable In addition, in the weighted cost scenario, the DRLIS-based scheduler can converge the cost to a stable level after about 30 policy updates, while the Q-Learning-based scheduler usually takes about 60 updates to converge to a slightly higher level, and the DQN-based scheduler needs more than 80 updates to converge to the same level.Overall, compared with the Q-Learning-based scheduler, which can converge stably and with the fastest convergence speed in the baseline algorithms, the average performance of the DRLIS-based scheduler improves by 55%, 37%, and 50%, in terms of servers load balancing, IoT application response time, and weighted cost, respectively.

Scheduling Overhead Analysis
In this section, we investigate the scheduling overhead of different techniques-based schedulers when handling IoT applications.The environment settings are the same as Section 6.1.1,and the resolution of the IoT applications is set to 480.For each scheduler, we repeat the experiment for 100 rounds, feeding four IoT applications to the scheduler in each round.Besides, we define the average scheduling overhead as   =   100 , where   represents the total overhead spent by the scheduler to handle the applications in 100 rounds.
Figure 8 depicts the average scheduling overhead   with a 95% Confidence Interval (CNFI) of schedulers based on different technologies when handling IoT applications.It is obvious that the scheduling overheads of reinforcement learning techniques (i.e., DRLIS, DQN, Q-Learning) are usually lower than metaheuristics techniques (i.e., NSGA2, NSGA3).In addition, the 95% CNFI of the scheduling overhead of reinforcement learning techniques is also much shorter than metaheuristic techniques.Specifically, the scheduling overhead of DRLIS is more than 50% lower than NSGA2 and NSGA3, and more than 33% lower than DQN, but it is about 2ms more than Q-Learning.However, considering that the convergence speed of DRLIS is much faster than that of Q-Learning, as discussed in Section 6.3.1, the increased overhead cost of DRLIS over Q-Learning can be negligible.

Conclusions and Future Work
In this paper, we proposed DRLIS, a DRL-based algorithm to solve the weighted cost optimization problem for IoT applications scheduling in heterogeneous edge and fog computing environments.First, we proposed corresponding cost models for optimizing load balancing and response time in heterogeneous edge and fog computing environments and formulate a weighted cost model based on both of them.In addition, we implemented a practical scheduler in the FogBus2 function-as-a-service framework for scheduling IoT applications.Compared to existing work, DRLIS has significant advantages in convergence speed, optimization cost, and scheduling overhead.Through extensive experiments and comparisons with other works in the literature, DRLIS achieves performance improvements of up to 49%, 60%, and 55% in terms of load balancing, response time, and weighted cost, respectively.
For future work, considering the limited resources and the distribution of the devices in edge computing, we plan to explore distributed deep reinforcement learning to further improve the scheduler's performance.Also, we plan to consider more models to extend our proposed weighted cost model, including economic aspects and energy consumption aspects in large-scale serverless computing environments.In addition, to optimize the performance of IoT applications involving GPU tasks (e.g., image processing oriented applications), we will extend FogBus2 framework to consider resource usage when scheduling such applications on Application-Specific Integrated Circuit (ASIC)/GPU-based edge and cloud servers for more efficient performance.

Fig. 1
Fig. 1 represents a layered view of the IoT Systems in the fog computing environment.Consider  = {  |1 ≤  ≤ ||} as a collection of || applications, where each application contains one or more tasks, denoted as   = {   |1 ≤  ≤ |  |}.The DAG  = ( , ) is used to model an IoT application, as depicted in Fig.2.A vertex   =    denotes a certain task of the application, and an edge  , denotes the data flow between tasks   and   , so some tasks must be executed after predecessor tasks are completed. (  ) represents the critical path (i.e., the path with the highest cost) of the DAG, marked in red in the figure.

Figure 1 :
Figure 1: A view of the IoT system in fog computing

Figure 2 :
Figure 2: Sample IoT application with the critical path in red color ) CPU model       and RAM model       are defined as the variance of CPU and RAM utilization of the server set  after the scheduling configuration     : The agent chooses an action   =  according to the policy (|), where  ∈ , and (|) =  [  = |  = ] is the policy function, which denotes the probability of choosing the action  in state .After choosing action , the agent receives a reward  = ℝ[  = ,   = ] from the environment based on the reward function ℝ, and it moves to the next state  +1 =  ′ based on the state transition function    ′ = ℙ[ +1 =  ′ |  = ,   = ].The goal of the reinforcement learning agent is to learn a policy  that maximizes the expectation of cumulative discounted reward   [ ∑ ∈     ].Based on the weighted cost optimization problem of IoT applications in edge and fog computing environments, the state space , action space , and reward function ℝ for the MDP are defined as follows:

Figure 3 :
Figure 3: Updated Sub-Components for Reinforcement Learning in FogBus2 Framework

Figure 4 :
Figure 4: Reinforcement Learning Scheduling Module in Fog-Bus2 Framework (a) Clipping range (b) Discount factor (c) Actor network learning rate (d) Critic network learning rate

Figure 6 :
Figure 6: Cost vs policy update analysis -train phase

Figure 7 :
Figure 7: Cost vs policy update analysis -evaluation phase

Table 1
A qualitative comparison of related works with ours

Table 2
List of key notationsThe variance of RAM utilization of the server set after the scheduling configuration      One application (one task set) Ψ(  ) The load balancing model after the scheduling configuration     One task Ψ() The load balancing model after the scheduling configuration   The server set    The total execution time (ms) for task   based on the scheduling configuration       The scheduling configuration of task       The ready time (ms) for task   based on the scheduling configuration   The time (ms) consumed for required data by task   to be sent from server   to server The scheduling configuration of applications   (  )The parent tasks set of task     _  The CPU utilization (%) of server    (  ) The server set to which the dependency tasks of task    are assigned     The CPU frequency (MHz) of server      , The transmission time (ms) between server   and server    _  The RAM utilization (%) of server      , The propagation time (ms) between server   and server    _ consider the response time model      for the task    consisting of two components, the task ready time model ,  denotes the time consumed for required data by task    sent from server   to server   , and   is the server where the task    will be executed based on scheduling configuration    , and   represents the server where the parent task of task    is executed.Therefore,     for task    between server   and server   :   represents the packet size from server   to server   for task    , and    ,  represents the current bandwidth between server   and server   when the data for task    is transmitted.

20 end if 21 end foreach 22 end while
Algorithm 2 summarizes the scheduling mechanism based on DRLIS.The framework first initializes a scheduler,

Table 3
The hyperparameters setting for DRLIS

Table 4
The hyperparameters setting for baseline techniques