1 Introduction

Many software systems designed nowadays exploit the cloud computing infrastructures which offer high availability, security and the flexibility to allocate the resources on-demand. The last factor often drives the decision to implement a specific system using cloud resources as it allows to greatly reduce the costs of running a distributed application. Such elasticity unfortunately requires paying the price of designing the application to handle scaling events, e.g., changing the number of virtual machines (horizontal scaling) or adding or removing RAM, CPU or storage (vertical scaling). Deploying the application requires also creating a policy which will define the conditions under which the system should be scaled and which resources should be utilized in such a case. It might be possible to create a configuration which will work correctly over a long period of time if the environment shows stable seasonal usage patterns. Unfortunately, in many cases such patterns do not exist, what calls for using an automatic scaling policy. We can define it as a dynamic process, often operating on a Physical Machine (PM), that adapts software configurations (e.g., threads, connections and cache, etc) and hardware resources provisioning (e.g., CPU, memory, etc) on-demand, according to the time-varying environmental conditions [1].

The area of the Reinforcement Learning (RL) techniques has been explored for a long time [2, 3]. Initially the techniques and algorithms from this category could be only used in relatively simple problems. It was assumed that the environment can be observed with use of only a few metrics and there are not that many actions to execute. Handling more complex domains became possible with recent advancements in, e.g., computer games [4], robot control [5] or the game of Go [6]. One of the main drivers of progress has been the application of Deep Learning in various forms, e.g., Deep Q Learning [7], or Policy Gradient Optimization (PGO) methods like proximal policy optimization (PPO) [8], Phasic Policy Gradient [9]. One of the main advantages of the mentioned methods is the ability to learn through observing and interacting with an environment which is similar to or is the same as the one the agent is going to operate in.

One of the most popular methods in DRL is the Deep Q Learning [7]. It is an extension of a classic Q Learning algorithm in which the policy is a given state (s) always chooses an action (a) of highest quality. The quality is defined through a Q function (Q(sa)), which given a combination of state and action can provide a numeric value which can be easily used to compare between different actions. It is quite common that function is hard to define analytically, hence it is approximated through an iterative algorithm. Unfortunately such an approach is not well suited for problems in which there are many actions and possible states. The Deep Q Learning attempts to solve that issue by using a neural network to approximate the Q-function. This approach rendered interesting results, however has some shortcomings. It cannot be easily applied to environments where not all information is known and included in the state, in which probabilistic policies are preferable. Q-Learning focuses on satisfying the Bellman equation, hence it indirectly optimizes the policy’s behavior. Those shortcomings can be addressed by using a PGO method, which optimize the policy’s parameters directly and generate probabilistic policies. Such an approach allowed to surpass the performance of humans in a computer game [10], control 3D bodies in a simulated arena [11, 12] or solve Rubik’s cube using a robotic arm [13]. There are many variants of the PGO methods: Advantage Actor-Critic (A2C) [14], Asynchronous Actor-Critic Agents (A3C) [14], Deterministic Policy Gradient (DPG) [15], Soft Actor Critic [16] or proximal policy optimization (PPO) [8].

Many successful experiments suggest that Deep Reinforcement Learning (DRL) is a generic approach what encourages further research and application to other domains. One area which could benefit from that, is the automatic scaling of distributed applications deployed to a cloud infrastructure. The cloud resources become the environment where an automatic agent operates, their state becomes the state which is subject to change. Cloud vendor API calls become the actions the agent can potentially execute. Measurements and metrics which can be used to determine the mentioned state are driven by the technologies used to implement the application and thanks to that they are well defined. The goals of the system are also clear (e.g., reduce RAM consumption, CPU load, request latency, cost of resources) what helps to translate them into a reward function. Such a reward function becomes the feedback mechanism for the agent and allows to evaluate the impact of the executed actions. Thanks to that the agent does not need to rely on any prior knowledge and can use a process of trial-and-error experiments to discover the optimal management policy.

Unfortunately, with this approach the cost of creating the DRL policy becomes the main disadvantage. The training algorithm needs to go through multiple iterations of interacting with the managed system and observing its responses. Especially at the beginning the actions chosen for execution might be quite random, what can easily destabilize the observed application, even make it completely unusable for the end users. Since such a situation is unacceptable in a production system, the training requires a separate, duplicate environment. This increases unfortunately the overall cost of running the system. Another approach, which does not incur such high costs, is to use an artificial, simulated environment. In such a scenario, the observed workloads can be replayed multiple times in a reliable fashion. Additionally, the flow of time in such an environment can be controlled what allows to speed up the simulation and reduce the training time. Usually, the DRL training processes can also benefit from increasing the number of iterations, what can help obtain a more efficient policy.

Both Deep Q Learning and PGO methods have been applied to automatic control of cloud resources [17,18,19,20,21] Both approaches have their strengths and weaknesses. It is important to evaluate which of them can render superior results in a given environment. Unfortunately, in many cases PGO methods are not taken into account. They are very rarely used in the context of real-world test-beds. One example is [22]. In that paper, the Deep Deterministic Policy Gradient (DDPG) algorithm is used to create a control policy which sets thresholds of the traffic flow control system. While that use case allowed to optimize the work of a cluster of machines, it addressed only the networking aspect of data processing. Unfortunately in that example the controlled cluster is homogeneous. All the available resources have the same configuration. In many cases, though, especially in the context of automatic scaling of applications deployed to very popular cloud infrastructures, there could be potentially employed resources of more than just one type. Depending on the context of a situation, using a more or less powerful resource might be the most beneficial. In our previous work we evaluated various algorithms from the PGO family [23] and the DQN approach and observed that the PPO provides the best results in our environment. We have also demonstrated how to leverage the described ideas to create a system capable of automatic scaling of a homogeneous cloud infrastructure hosting a CPU-intensive workload. In the current paper, we extend this approach to heterogeneous cloud resources: the system can adjust not only the amount of resources (i.e., the number of virtual machines) but can also decide on the type of the resources (i.e., the type of the virtual machines) to adjust. The training does not require providing any additional information about the managed application or specifying resources capabilities. All decisions are derived from the experience gained from simulations. The automatic management system under discussion has been implemented as an extension to Semantic-based Automatic Monitoring and Management (SAMM) monitoring software [24] and includes a decision-making component utilizing a PGO training method.

The contributions of this paper are as follows:

  • We provide an extension to an existing automatic management system [25] with capabilities to manage heterogeneous resources.

  • We provide the design and implementation of a control policy capable of controlling heterogeneous resources and a policy training procedure based on the Deep Reinforcement Learning approach.

  • We demonstrate the correctness of the presented approach. We train a policy based on an long-short term memory (LSTM) [26] Deep Neural Network and deploy it to a real world cloud-based application. While DRL and LSTM have been combined previously together, they were used in simulated environments. In this paper we demonstrate how a system leveraging an LSTM policy trained with the use of a PGO method can be used in a real cloud environment.

  • We analyze the efficiency of the new policy: compare the cost of resources used by the policy in a sample scenario with the cost of resources used by a threshold-based policy, typically available in a cloud environment.

The paper is organized as follows: in Sect. 2 we review the related work. Section 3 describes the design and architecture of the environment under discussion whereas Sect. 4 explains the policy training procedure. Section 5 discusses the design of the experiment and the environment it was executed in. Section 6 provides the experiment results and their discussion. Section 7 summarizes our research and outlines further work.

2 Related work

In this section we present the research which provides a broader context for our work.

2.1 Automatic resources management

Minimizing the monetary cost of cloud resources while maintaining the business requirements (sometimes defined through Quality-of-Service metrics) is a very complex task and has been an active research area for years. There are many different approaches available. The decision on which of them should be chosen, depends on the features of the environment which should be managed (e.g., granularity of managed resources, available actions, etc.). The mentioned approaches can be categorized in multiple ways. In [1] authors described a taxonomy which uses the following features to classify automatic management systems:

  • self-awareness—the capability to obtain and maintain information about the state of the system. There are many kinds of awareness: interaction awareness, time awareness, stimulus awareness, goal awareness, meta awareness.

  • self-adaptivity—the capability to adjust own decision policy to new circumstances. Depending on the area which those decisions affect, we can distinguish between variants of self-configuring, self-healing, self-protecting and self-optimizing.

  • architectural patterns—the way the process of auto-scaling is structured (what are the components and how do they interact with each other). The three best known approaches are: Feedback loop [27], Observe-Decide-Act (ODA) [28], Monitor-Analysis-Plan-Execute (MAPE-K) [29]

  • QoS modeling—the controls which allow to adjust the managed system and a model which connects those controls with QoS metrics. The models can be categorized into three groups: static (connection between metrics and how they affect the way resources are allocated is defined prior to starting the system, e.g., [30]), semi-dynamic (based on machine learning models [31] or simulation [32]), dynamic (resource allocation policies based on statistical analysis of historical workload logs, e.g., [33]).

  • granularity of control—what are the basic objects which are under the control of the agent and are used to perform scaling: virtual machine [34], container [35], application [36].

  • decision-making—the specification of process which leads to taking a decision about changing the resource allocation. It should include a definition of objectives and their representations, an algorithm of how to reason and search for a decision, a definition of controls which need to be adjusted.

    There have been multiple attempts to tackle this issue, with the most distinctive being: rule-based control [37, 38] (an action execution occurs when a condition defined a priori is met), control theory-based [39] (control theory mechanisms are used to make a decision), search-based optimization [31, 40,41,42,43] (decisions form a large, finite search space and choosing among them is treated as a search problem; Machine Learning-based attempts are also included in this category).

The approach presented in this paper can be classified as self-optimizing goal, stimulus aware, with a dynamic QoS modeling capability. The control agent we have developed adjusts the capacity of the system by adding and removing Virtual Machines. The policy is created by using a Deep Reinforcement Learning-type algorithm which allows to classify it among search-based optimization.

2.2 Reinforcement learning

One of the more active areas of research in machine learning is the Reinforcement Learning [2, 44]. Its primary focus is to discover a policy for agents which autonomously take actions within a specific environment. The policy maximizes a reward whose value is returned to the agent. The process of training an agent relies on executing a series of actions. After each of them the agent observes their consequences and builds up its own knowledge. There is no supervising entity providing feedback on how taking a certain action is better than taking others. This distinguishes this approach from supervised learning. RL is also different than unsupervised learning which focuses on discovering the internal structure of a collection of unlabeled data.

Over the years many different approaches to RL were proposed. We can broadly categorize them as:

  • Policy-based and value-based which focus on training different components of the decision-making mechanism. The former explicitly builds a representation of the policy (a function which maps a state to an action which should be taken in that state). The latter creates a value function, which can be used e.g., to compare the values of different actions, and that in turn can be used to make a decision.

  • Monte-Carlo and Temporal Difference in which the training update is implemented either after the control episode finishes (Monte-Carlo), or after one or more steps are executed in an episode (Temporal Difference).

  • Online and offline which differ in when the agent’s policy is changed. In case of the online approach, an update happens after every step, in the offline case—after the full episode (i.e., when the training scenario is finished, the environment needs to restart and the reward is presented to the agent).

  • Model-based and model-free which differ in how the environment is modeled by the agent. In the former approach an explicit model is created (e.g., through reward estimations or specification of state transitions), in the latter one-creating such a model is not necessary (the decision-making process assumes that it is sufficient to have a sample of information about state transitions).

Combining Deep Learning techniques with the model-free approach became popular recently and resulted in creating so called Deep Reinforcement Learning. In this approach, neural networks can be used to create an approximation of a function which is a part of an algorithm (e.g., the Q-function in [7]). Alternatively, in the case of policy gradient methods, neural networks can be used directly as the policy functions. The training process adjusts their weights (\(\Theta\)) based on the gradient of an estimated scalar performance objective function \(J(\Theta )\) in respect to those policy parameters:

$$\begin{aligned} \Theta _{k+1} = \Theta _{k} + \alpha \nabla _{\Theta } J(\Theta _{k}) \end{aligned}$$
(1)

where \(\Theta _{k}\) denotes policy’s parameters in the k-th iteration of the training process. The performance is usually understood as a reward returned from the environment. There are multiple versions of policy gradient optimization. In our research we focus on the proximal policy optimization (PPO) [8].

figure a

It can be outlined as the algorithm presented in Algorithm 1. It is the basic variant of the algorithm presented in [8]. The aim of the algorithm is to calculate the policy parameter update in such a way that it ensures that the new parameters are not changed by more than a predefined threshold value. That helps to avoid a situation where a single parameter update changes the policy in such a negative way, that it would require many training iterations to recover. This goal is achieved through modification of the objective function which is defined as follows:

$$\begin{aligned} J(\Theta ) = L^{CLIP} (\Theta ) = {\mathbb {E}}_{t} \left[ min( r_{t} (\Theta ) A_{t}, clip(r_{t} (\Theta ), 1 - \epsilon , 1+ \epsilon ) A_{t} ) \right] \end{aligned}$$
(2)

where:

  • \({\mathbb {E}}_{t}\) denotes calculating the average over a batch of samples at timestamp t,

  • \(A_{t}\) is an estimator of the advantage function which helps to evaluate which action is the most beneficial in a given environment state,

  • \(r_{t}\) marks the probability ratio \(r_{t}(\Theta ) = \frac{\pi _{\Theta }(a_{t} \vert s_{t})}{\pi _{\Theta _{old}}(a_{t} \vert s_{t})}\) in which \(\pi _{\Theta }(a_{t} \vert s_{t})\) denotes the probability of taking an action a in state s by a stochastic policy and \(\Theta _{old}\) are the policy parameters before the update,

  • \(clip(r_{t} (\Theta ), 1 - \epsilon , 1+ \epsilon )\) function keeps the value of \(r_{t} (\Theta )\) within some specified limits (clips it at the end of the range), as shown in Fig. 1,

  • \(\epsilon\) is a hyperparameter with a typical value between 0.1 and 0.3.

Fig. 1
figure 1

Visualization of the clip function

The improvements to the training progress offered by the PPO algorithm come from the introducing of the clip function. The clipped results (after applying the clip function) are compared to the un-clipped \(r_{t}\) and the smaller value is chosen. This means that part of the change is ignored only if it would have a too big positive influence on the result of \(L^{CLIP}\) (the negative influence is unchanged). That influence is depicted in Fig. 2.

Fig. 2
figure 2

Visualization of the influence of using the clip function on the \(L^{CLIP}\). \(r_{t}\) as a ratio of probabilities is always greater than 0 hence we analyze two cases: when \(A > 0\) (left) and \(A < 0\) (right)

Another important component of PPO is the advantage function (\(A_{t}\)) which helps to value how good an action compared with other available actions for a specific state is. To estimate the advantage function values the generalized advantage estimation (GAE) method [12] is being employed. The advantage function is valued in the units corresponding to the ones used in the reward function.

The policy (\(\pi _{{\Theta }_{old}}\)) is typically implemented as a neural network, which parameters are shared with another network whose is employed by the GAE.

2.3 Applying deep reinforcement learning to automatic resources management

In practice, the complexity of cloud systems constantly increases, which makes it increasingly harder to model accurately. This translates to issues with applying classic approaches to automatic control (e.g., threshold-based rules, which also become more and more complex). That encourages using Machine Learning techniques, where a policy can be trained directly by using the collected measurements. This includes using the most recent advancements in the form of the DRL. Applying DRL in this context has a number of advantages. DRL is capable of creating complex decision-making policies due to the use of DNN. Using a training process which exploits the gradient descent algorithm allows to adjust the policy parameters to various optimization objectives. The past experiences can be memorized thanks to the use of experience replay buffers.

The DRL approach has been already used to automate many different tasks in resource management. In [17, 46,47,48,49] the authors train a policy which allows to allocate tasks (or jobs) to specific servers. [18, 50] demonstrate how to create an agent which can allocate VMs to sustain the workload under management. In [51] authors explain how a DRL agent can be used to control resources of Mobile Edge Computing (both compute and networking) resources. [52] presents an agent which is capable of horizontally and vertically scaling resources used to process a given workload.

DRL policies can be trained to optimize a variety of objectives, including: resources cost [18, 19], resource utilization [52], service latency [49], energy consumption [17, 48], task turnaround time [46]. It is also possible to optimize a combination of such objectives, e.g., service latency and energy consumption together [51]. Most often, the trained policies are evaluated by observing their behavior in the context of a sample workload running in a simulated environment [17, 19, 46, 48, 49, 51, 52]. The performance in many cases is very promising, however it is relatively rare for the control policies to be tested in real-world systems [18, 47, 50].

The Deep-Q-Learning [7] approach seems to be most widely exploited in the resources management domain [17,18,19, 52]. Methods which derive from that approach, e.g., Double Deep Q-Learning [50, 51], Stack Autoencoder Q-Network [47], continuous-time Q-learning for SMDP [48], Dueling Deep Q-Learning [49] are also relatively popular.

There are also examples of using Deep-Q-Learning with other techniques, e.g., in [53] authors described a novel, multilevel hybrid architecture in which agents are trained with the use of that algorithm. The described system has been used to manage allocation of workloads to cloud resources and focused specifically on the VM placement problem. It allocated a VM to a specific host when it had arrived at the cloud system, then monitored the cloud resources and their SLAs and if necessary relocated the VM to optimize profit or energy goals or to meet an SLA. The proposed architecture consists of three levels of control components: Node Controllers (which dynamically adjust configuration to satisfy demand on each node), Lead Nodes (which are higher level controllers for groups of Node Controllers), Data Center Controllers (which manage the Lead Node controllers). Authors demonstrated that Deep-Q-Learning can be used to train agents working on all of those levels. To validate the presented approach an experiment in a simulator has been conducted. Its results demonstrated a significant improvement in reducing SLA violations compared to an established heuristic (Modified Best Fit Decreasing).

In [54] a modified version of the Deep-Q-Learning algorithm is presented. The modification includes Successive Over-Relaxation(SOR). Authors analyze the performance of the new algorithm by training policies to play games available in the Atari collection. Additionally they attempt to create a policy to horizontally scale resources (virtual machines) used by cloud applications. The experiments are performed in a simulated environment which uses HTTP logs from ClarkNet and NASA servers as workload. The experiments showed an improvement over the basic version of DQN approach.

There are also first examples of the use of policy gradient methods. In [22] a system for automatic traffic optimization (AuTO) is presented. Authors implement it with the use of the Deep Deterministic Policy Gradient (DDPG) training algorithm, which utilizes two neural networks: the actor (responsible for making decisions) and the critic (used to evaluate the actor’s decisions). The former one consists of two fully connected hidden layers with 600 neurons each. The latter one reuses the two mentioned layers and adds an additional fully connected layer on top. Such a model is used to demonstrate the performance and adaptability of the discussed approach to the control of dynamic traffic flow consisting of web search and data mining requests.

In [20] authors use a combination of Convolutional Neural Networks (CNN) and fully connected layers to create a policy which is used to schedule processing jobs in a data center. They use the Advantage Actor-Critic (A2C) algorithm to train it. The resulting policy is evaluated using the average job waiting time and average job slowdown metrics. The experiments are carried out in a simulated data center cluster containing a number of nodes with two resources: CPU and memory. Authors concluded that the proposed method performed better than the widely used Shortest Job First (SJF) and Tetris [55] approaches.

In [21] a system for automatic allocation of Spark framework executors to virtual machines is developed. Authors use two algorithms for the policy training: Deep Q-Learning and a policy gradient method called REINFORCE. In order to evaluate how policies generated by the mentioned algorithms perform, a simulated environment has been created. It represents a cloud-deployed Spark cluster which executes jobs from the BigDataBench benchmark suite. The pricing model was similar to the AWS EC2 instance pricing in Australia. The experiments showed that the training using the PGO method was more stable than DQN and allowed to achieve superior results in terms of cost-efficiency and lower average job duration. In some cases though, the classic algorithms e.g., Integer Linear Programming (ILP) and Adaptive Executor Placement (AEP) were able to outperform the DRL policies. The results were very promising, yet limited to a single type of a workload (Spark framework).

In [25] we demonstrated how a similar algorithm from the policy gradient optimization family, the proximal policy optimization (PPO) [8], can be used to horizontally scale cloud resources. Initially, we have also experimented with the very popular DQN approach, however in our environment we found it hard to generate a policy which would reasonably scale the resources. On the other hand, PPO allowed to quite quickly reach a stable result. Figure 3 presents a sample training progress for both algorithms. The control policies were trained using a 100,000 steps in the simulated environment. The episode length depends on actions taken by the policy. In the case of DQN the episodes varied in length with some episodes taking many simulation steps to complete. That resulted in a reduced number of completed episodes, compared to the PPO algorithm.

Fig. 3
figure 3

Training progress of DQN and PPO: the reward obtained in subsequent episodes of the resource management simulation

In our previous work the implementation was limited to control resources of a single type. In the present paper we extend that approach to include resources of other types and employ a recurrent network to represent the policy. We explain how a policy can be trained using a synthetically created, simulated workload, and then present how the results of training in a simulated environment can be transferred to a real-world environment. We provide an evaluation of the behavior of the control policy deployed to manage AWS cloud resources used by a scientific application. Unfortunately, automatic scaling with the use of experimental management systems is typically tested in simulated environments and rarely deployed to real cloud infrastructures. Such results also differ with regard to many details, e.g., the type of managed workflow, for this reason direct comparison is very hard. To provide a reference point on the quality of decisions made by the presented system, we compare it with a threshold-based management policy which is available in the employed cloud infrastructure.

3 Architecture

In this section, we present the architecture of the automatic management system under discussion. The system manages cloud infrastructure resources which are used to host a distributed application and uses RL techniques to create a decision policy.

One of the main challenges in the design of an autonomous management system is how to organize the training process. Using an environment with real cloud resources—nearly exactly the same as in which the policy is going to operate, would be the best solution. Unfortunately, such a scenario would usually introduce significant additional costs which may outweigh the benefits of automatic resource allocation. Using the actual, production environment is not an option either. Using an untrained policy would most likely lead to a significant degradation of performance due to its poor decisions and consequently to business losses. To avoid such a situation, we decided to use a simulation as a training environment, which is a common solution to this problem [56]. This allows to train the policy in an isolated, safe environment. Regardless of the decisions made, their consequences do not influence a real infrastructure, what allows experimenting even with the actions that lead to catastrophic events.

Using a simulator had a big influence on the discussed system’s architecture. It required introducing an interface between the policy and the environment it was operating in. Such an interface had to create an abstraction which would hide whether the policy was accessing a simulation or real cloud resources. The capabilities delivered by the simulator and the API built on top of the cloud vendor libraries had to be aligned to each other.

From a high-level perspective, a system in which an autonomous control policy is deployed, needs to use some form of a feedback loop. Such a feedback loop comprises, on the one hand, a stream of actions triggered by the policy and, on the other hand, an information about the state of the observed environment, which allows to understand the consequences of those actions. The presented system also follows this pattern. First, the information about the state of cloud resources is obtained by monitoring the components of the managed application. That data is then aggregated into a form that can be used by the neural network acting as the control policy. Finally, the output from the network is interpreted as the identifier of the action to execute. This action is then implemented in the managed cloud infrastructure through an available cloud API. The components of the system and the implemented feedback loop are presented in Fig. 4.

Fig. 4
figure 4

Components of the discussed system. Arrows denote interactions between them

The loop starts with collecting the measurements about the resources which take part in executing the workload (marked with number 1 on the diagram). Each of the resources is configured to start reporting relevant measurements as soon as the resource becomes online. The measurements often differ in their nature, which influences how often their values are delivered, e.g., the amount of free RAM and CPU usage is reported every 10 s while the virtual machine (VM) count—once per minute. To simplify the implementation of collecting of those raw measurements, we introduced the Graphite monitoring tool [57] (marked with number 2 on the diagram). Graphite aggregates all the collected values into a single interval to create a consistent snapshot of the environment. In our case this interval is set to one minute.

Next, the measurements are passed to the SAMM monitoring and management system [24] (marked with number 3 on the diagram). SAMM enables experimenting with new approaches to management automation. It allows to easily add support for new types of resources, relevant metrics, integrate new algorithms and technologies and observe their impact on the observed system. In our use case, SAMM is used to pass information between the other elements of the system. It periodically polls measurements which portray the current state of the system and aggregates measurements into metrics used by the decision policy. Next, it delivers the current state of the system in a form of metric values, to the Policy Evaluation Service (marked with number 4 on the diagram). Finally, it retrieves decisions (marked with number 5 on the diagram) and executes them through the cloud vendor API (e.g., Amazon Web Services API) taking into account the environment constraints (e.g., the warm-up and cool-down periods; marked with number 6 on the diagram). SAMM calculates values of the following metrics: ratio of allocated cores, average CPU utilization, 90th percentile of CPU utilization, average RAM utilization, 90th percentile of RAM utilization, ratio of jobs waiting for processing to the number of jobs submitted, ratio of jobs waiting for processing to the number of jobs submitted in the last monitoring interval.

The Policy Evaluation Service provides decisions on how to change the allocation of resources based on the results of evaluation of the observed system state. The decisions are made according to the policy trained with the use of the PPO algorithm. The results of the evaluation may include:

  • starting a new VM of a specific type—deficient resources are used to handle the workload under the current system state,

  • removing resources—shutting down VM of a specific type—in the current state of the system, the resources are underutilized,

  • doing nothing—a proper amount of resources is allocated.

One should remember that implementing the change to the resources allocation is always subject to the environment constraints. Not always it is possible to immediately execute an action. We might need to wait for a while because: the system is in a warm-up or cool-down (a period of inactivity to allow to stabilize the metrics after the previous action has been executed), the previous request might still be being fulfilled, the request failed and needs to be retried in some time. In order to be able to train a policy which can cope with such limitations, the factors need to be involved in the simulation used for training.

The described system makes a few assumptions about the workload it helps to manage. First of all, the workload needs to be organized into many independent tasks, otherwise it would not be possible to distribute the work among a number of resources. It is also necessary to provide a possibility to monitor the number of tasks which are yet to be executed. This implies creating an explicit queue of jobs which are submitted for processing. If such a component does not exist in the controlled system, it needs to be created. Most importantly, the tasks need to be idempotent (i.e., executing them multiple times does not change the end result) and the scheduling subsystem needs to be able to track their progress. Since the resources can be added and removed at any point in time, an interruption of a task before it terminates successfully (e.g., in case the processing VMs are shutdown) needs to be treated as a normal, common situation. In case of any failure, the scheduler should automatically reschedule the relevant tasks. For safety reasons, the resources that are administering the workload (e.g., accepting the input requests) should be isolated, and not included into the automatic management. This should prevent the workload from getting accidentally terminated.

Fulfilling the monitoring requirements may require introducing extensions to the software which generates the workloads and instrumenting the resources which are used to create tasks.

4 Using simulation in policy training

The system presented in Sect. 3 requires a policy to operate. Such a policy could be potentially trained as the first stage of managing a system. Using an environment with real cloud resources would be the best solution for the purpose of training. In such a case, the agent would be able to learn about all of the details of the controlled environment. Unfortunately, as mentioned before, with this approach the cost of creating a DRL policy would become a major challenge. The training algorithm needs to go through multiple iterations of interacting with the managed system and observing responses what significantly increases the resource consumption and, consequently, the costs of training.

In our system a simulation environment has been chosen as a foundation of the training process. This approach has huge advantages of cost efficiency and isolating the training from production environments. From the training perspective it has also a range of interesting properties. Since the simulation is isolated, the training process can be replicated and parallelized to allow for evaluation of multiple agents at the same time. This increases the number of interactions which can be tested by the policy within a given amount of time. The flow of time in a simulation can be changed (e.g., sped up) which allows to further reduce the time required to conduct training. The behavior of the environment and that of the workload are fully deterministic and can be easily repeated if needed. This makes the training predictable and repeatable and helps to tune the training algorithm parameters.

The policy training process has been implemented using an environment which was different from the real-world one. This environment is depicted in Fig. 5.

Fig. 5
figure 5

Components of the training system; arrows denote interactions between them

The simulator can replay any workload written in the Standard Workload Format, including the workload collection available as the Parallel Workloads Archive [58]. Jobs are submitted for processing according to the order and timing defined in the workload traces. Their actual execution is simulated. This allows the system to behave differently, depending on the actions taken by the trained agent. The mentioned pre-recorded workloads may span over many months or years and can include huge numbers of jobs to process. In order to make the training process faster, the flow of time can be speeded up. In such a case, the events occurring in the simulation are not processed immediately when they happen. Instead, events which happen within the same, configured time interval, are grouped together and processed as a batch. This may result in inaccuracy of the simulation (e.g., if time flow is speeded up 1000 times, new events can be scheduled for processing only in the next group, i.e., in the group which includes events from the following 1000 time units). The simulator also allows to adjust other parameters, e.g., the cost of resources by type, SLA penalties (queue wait penalty), maximum counts of VMs per type, etc.

The simulation includes a single datacenter with a configurable number of host machines. Hosts have uniform configuration, each of them can support multiple Virtual Machines (VMs). The resources used by the virtual machines are directly mapped to the resources of the hosts. In other words the simulation does not allow to over-provision simulated hosts. The number of virtual machines available at the beginning of the simulation is configurable as well.

The simulator has been implemented following the results of our prior research [23]. Its main process utilizes the CloudSim Plus simulation framework [59]. To decouple the simulator from other components and allow for easy reuse, it is additionally wrapped with the interface provided by the Open AI Gym framework [60]. This helps to easily launch experiments with various RL algorithms independently of the system we have developed.

5 Experimental design

In order to evaluate our approach, we designed an experiment in which we wanted to compare our policy with a different approach (threshold-based control policy). The overall objective was to perform sample computations while limiting the cost of the used cloud resources. The diagram which explains the experiment is presented in Fig. 6.

First, we have identified a sample workload which we believe could benefit from automated resource provisioning. Then we prepared a simulated training environment, which resembled the target environment in which the real application would be managed. We prepared a simulation workload and conducted a training of the control policy (phase 1 on the diagram). Next we have deployed the sample application into a publicly available cloud infrastructure and configured our system to manage that application using the trained policy (phase 2 on the diagram). Finally, to provide a reference point on the performance of our solution, we have attempted to manage the sample application with the use of a publicly available tool: the threshold-based scaling policies provided in Auto-Scaling Groups (phase 3 on the diagram).

Fig. 6
figure 6

Phases of the discussed experiment

5.1 Workload

As a sample workload, we have used the pytorch-dnn-evolution tool [61]. This is a tool which attempts to discover an optimal structure of a Deep Neural Network (DNN) to solve a given problem (e.g., to categorize images in a given set) using a co-evolutionary algorithm. Such an approach can be used for domains where supervised learning techniques can be used, i.e., there are well-defined training and test datasets. Unfortunately, due to the size of those datasets, in many such problems, evolution-based methods are costly and time consuming. The evaluation of individuals (complete DNNs), which is required for the evolution process to progress, includes training them over the mentioned large datasets. To mitigate this issue, the co-evolutionary algorithm interleaves two evolutionary processes. Such an approach is possible due to the fact that an absolute objective function is not always necessary to identify which individuals should be promoted to the next iteration of evolution. This can be achieved by comparing the fitness of individuals with the use of an approximation of such a function. Using a subset of the original training set (the so called fitness predictor) is one of such solutions. The first evolutionary process evolves the DNNs to find the best neural network architecture for a particular task. Since in many cases training over the complete dataset would be too costly to repeat in the context of the whole population, training is being conducted over a small portion of the dataset (the fitness predictor). The elements of that dataset need to be carefully chosen. In a way, we can describe them as samples which are the hardest from the point of view of the evaluation. This is the purpose of the second evolutionary process: it aims at discovering such subset of the initial training dataset. It uses the best DNN from the first process to evaluate potential subsets. The subset which receives the lowest evaluation score becomes the subset which is used by the first process to evaluate DNNs. In this approach the amount of data used to conduct the evaluation. This in turn results in greatly speeding up the comparison between individuals and thus makes the evolutionary approach a viable option for problems which can be translated into supervised learning processes. The described evolutionary process is depicted by Fig. 7.

Fig. 7
figure 7

Evolutionary process used as a sample workload. In the experiment described in Sect. 5, pytorch-dnnevo is deployed and managed in a publicly available cloud environment (AWS)

The evolutionary algorithm produces a high number of relatively small tasks that are independent of each other and can be easily processed in parallel on a cluster of machines. The workload scheduling is resilient to task failures and reschedules tasks in case processing them have not succeeded. The capacity of the job queue is in practice infinite thanks to the small size of a single job description. This means we can safely assume that none of the tasks is going to be dropped due to technical limitations of the queue system. Each task is going to successfully complete, regardless of how many times it needs to be restarted. Those features help to implement support for scaling events: each virtual machine used to conduct training can be safely shut down at any time. New machines can be added and start the processing of the evaluation tasks ad hoc, without additional configuration. The number of tasks varies over time and is hard to predict upfront. This renders an opportunity to reduce the cost of running the evolutionary process by reducing the amount of the used resources (VMs) when the demand for them drops.

In our case, the evolutionary process tried to find an optimal architecture for a neural network which recognizes handwritten digits. We have ran 20 iterations of evolution over a population of 32 individuals and 16 fitness predictors (subsets of 2000 images from the large training set). The evaluation of a single neural network comprised the training over 10 iterations of a given fitness predictor. We used the MNIST dataset [62] as the training set from which subsets are selected.

5.2 Target environment

As a compute infrastructure we have used the Amazon Web Services Elastic Compute Cloud (AWS EC2) [63]. The managed environment consisted of three Auto Scaling Groups of m5a.large, m5a.xlarge and m5a.2xlarge virtual machines which could have up to 10 instances each. All VMs were running in the US North Virginia region and in the same availability zone to avoid the problems with network latency added by multi-zone setups. The workload driver, together with SAMM and Graphite, has been running on a separate VM.

As mentioned above, for the purpose of the training process, we have simulated a single datacenter capable of hosting VMs of three types: small, medium and large. Their specification followed the configuration of Amazon’s large (2 core CPU and 8 GB of RAM), xlarge (4 core CPU and 16 GB of RAM) and 2xlarge (8 core CPU and 32 GB of RAM) EC2 instances. Each simulation started with one virtual machine of each type active and ran until all the scheduled tasks were completed (no artificial deadline was imposed). In order to reduce the training time, the simulation time was speeded up sixty times.

5.3 Policy training

We attempted to use a few real-world workloads from the Parallel Workloads Archive [58]. However, the best results in the training were achieved by using a set of 1551 jobs generated specifically for the purpose of our experiment. The jobs scheduling pattern resembles a single run of evolution in pytorch-dnn-evolution. We organized the jobs into 21 batches (10 batches of 100 and 11 batches of 50 jobs) submitted every 8 min. Every job requested 360 s on a single CPU core. The final job has been added 30 min after the final batch which ensured that there is always a cool-down period of time at the end. Such a dataset, on the one hand, was similar to the real-world workload (the jobs were submitted in multiple batches which generated spikes of activity). On the other hand, it differed from the real workload with the actual numbers of batches and their size. Since the number of jobs was low, the simulation time was shorter compared with other recorded workloads. We believe such a training dataset allowed to focus on the general features of the environment which is under control (e.g., the latency of the VM control mechanism, job submission spikes), while it reduced the simulation time. This in turn allowed to increase the number of simulations which enabled to obtain an improved control policy.

The training objective was defined as maximizing the following reward function:

$$\begin{aligned} F(V) = - \sum \limits _{x \in V} (T_{x} \cdot C_{x}) - T_{Q} \cdot C_{Q} \end{aligned}$$
(3)

where:

  • F(V) is the negative cost of resources used for processing,

  • V denotes a set of possible VM sizes. In our experiments it includes S, M or L which represent small, medium or large VMs, accordingly,

  • \(T_{x}\) denote the number of hours of running VMs of size x,

  • \(C_{x}\) is the hourly cost of running a machine of size x. In our case \(C_{S} = \$0.2\), \(C_{M} = \$0.4\) and \(C_{L} = \$0.8\),

  • \(T_{Q}\)—the hours spent by tasks waiting for execution,

  • \(C_{Q}\)—the hourly penalty for missing SLA targets when a task is waiting for execution. The cost of 0.036 US dollars is accrued for every second of a delay between submitting task for execution and actual execution. There were no limitations on the waiting time or the waiting queue size.

The training algorithm used to create the control policy follows the proximal policy optimization procedure as described in Sect. 2.2.

5.4 Policy neural network model

We have experimented with different architectures of the neural network used as a decision policy. The best results have been obtained with the use of the LSTM architecture [26]. LSTM is a type of a recurrent neural network, which means it passes the output of a layer back to its input. This makes it well-suited to process data in form of sequences, as it has access to the previously made decisions. One large drawback of recurrent networks is that they are likely to suffer from the problem of exploding [64] or vanishing gradients [65]. While the network error is back-propagated, its value can become high, which over a number of iterations accumulates as the weight value. At some point it becomes so huge that is not possible to represent it as a number in computer memory anymore. On the other hand, the network error can become so small, that it will not be able to affect the value of the weights in a meaningful way. LSTM networks try to mitigate those issues by enhancing the internal structure of a single neuron. The additional components allow to control the flow of the information within the network, e.g., it can be multiplied by a small number (forgotten) while passing.

In our experiment, the policy neural network included LSTM cells and feed forward layers which allow to interpret the output of LSTM cells in different contexts (as the policy or the value function). The LSTM layer included 128 cells. The complete network architecture is presented in Fig. 8. The network contains two outputs: value and policy. The former is used in the GAE algorithm to estimate the advantage function, while the latter is used to determine the action taken by the policy.

Fig. 8
figure 8

Neural network trained in the experiment. \(x_n\) denote the network inputs, \(l_n\) cells of the LSTM layer, s the output of a cell which is passed back to input, \(v_n\) and \(p_n\) the neurons of the value and policy outputs respectively

The progress of training that model (the reward obtained in the subsequent simulations) is depicted in Figs. 9 and 10. The first chart shows how the reward evolved over the course of training and demonstrates a clearly visible tendency for growth. The second chart focuses on the length of simulation and has been smoothed by averaging the metric values over 10 subsequent simulations.

Fig. 9
figure 9

Policy training progress—reward obtained during training

Fig. 10
figure 10

Policy training progress—simulation episode length during training

Both figures show that initially, for approximately 500 iterations, the policy was not making good decisions. This resulted in a high cost of resources within the early simulations and a high number of simulation steps. Both factors gradually improved over time. The cost plaited on single-digit values and simulation length at 230 iterations. Such a number of iterations is driven by the workload. It is not possible to finish the calculations earlier, because the last, very short job is scheduled at iteration 229. Since this job is short it successfully finishes in the next iteration. This means that iteration 230 is the earliest possible iteration at which the simulation can be finished.

The policy training algorithm parameter values are given in Table 1.

Table 1 Parameters of the policy training process

The implementation of the training process based on the source code of the Open AI Baselines project [66] allowed us to speed up the development time and ensure the correctness of the algorithms.

5.5 Using threshold-based policy

We have found that it is challenging to find performance reports of similar automated management systems in real (not simulated) environments. In order to provide a reference point for the results obtained with the use of the presented control policy, we also attempted to manage the pytorch-dnnevo workload with the use of a rule-based policy configured within the ASG. This feature provided by the cloud vendor allows to start and stop virtual machines based on the CPU usage of currently running machines. The user can define a threshold which is compared periodically with the average CPU usage of all virtual machines running within the ASG. If the CPU usage is above the specified threshold a new virtual machine is started. Conversely, if the CPU usage drops below the threshold, one of the running machines is terminated.

The workload generated by the pytorch-dnnevo framework has its own unique characteristic. The driver machine performs only cheap, simple operations of genotype recombination, mutation or fitness comparison. The time-costly operation of individual evaluation is performed solely by the workers. Unfortunately the evaluation requests are not evenly distributed over time. They are submitted by the driver in groups whenever evaluation of the whole population is required. This means that workers’ resources are fully allocated only at the beginning and are fully released after all individual evaluations are done. This means that the CPU load of a worker machine oscillates between very low (5–15%) and very high values (85–100%). Choosing a policy threshold value around the low end would force the policy to scale the number of workers up at the beginning and keep such a configuration till the end of the workload. On the other hand, choosing a threshold value from the high end would make the policy eager to remove resources, what might result in a very slow progress. A value between 20 and 80% enables the threshold policy to add (or remove) resources when they are needed (or obsolete). The exact value influences the sensitivity of the policy to the load changes. We found empirically that a threshold value of 75% average CPU usage allows to achieve the lowest resource costs when managing the sample pytorch-dnnevo workload.

6 Experimental results

In Fig. 11 we present the course of the experiment. We show how many virtual machines of different types were active at a given point in time compared to what was the actual number of jobs waiting for processing. The shape of the charts (the steps) is caused by an artificial delay introduced after executing an action (the cool-down period).

Fig. 11
figure 11

Number of started VMs in context of jobs waiting in the queue

The overall results of the experiment are as follows: the experiment runtime was 173 min with the cost of resources equal to $8.67 for the PPO-trained policy, and respectively 149 min and $9.95 for the threshold-based approach. The PPO-trained policy had a slower execution (by 16.1%—24 min) but a lower resources cost (by 12.9%—$1.28). The cost of the infrastructure required to manage the workload and the management components is the same in both cases (an additional VM to host other elements of the system). The main objective of the policy was to conduct the computations while minimizing the costs. In this context, the PPO-trained policy allowed to obtain a lower cost. It traded off the additional processing time for lowering the overall cost of resources.

The PPO-trained policy maintained a similar number of VMs of all types running most of the time. Occasionally it would attempt to reduce the amount of small VMs what seemed to be a result of the pauses between submitting the jobs of subsequent evolution iterations. However, those drops would get quickly compensated. The number of medium and large machines was relatively stable.

The threshold-based policy was more eager to perform scaling operations and able to launch machines of different types at the same time. As soon as the processing load was decreasing, the policy started to reduce the amount of used resources. Most of the time all resource types were treated similarly (the number of small, medium and large VMs were increased and decreased at the same time). This is caused by the fact that all VM types were scaled based on the same metric—CPU utilization.

Fig. 12
figure 12

Cores allocated by a control policy out of the cores available to the system throughout the experiment

The way that the PPO-trained policy was deciding to perform actions, allowed it to achieve higher resource utilization throughout the experiment. The overall number of allocated cores remained stable after the initial increase (Fig. 12) and was lower by \(25.76\%\) on average (\(49.93\%\) for the PPO-trained, and \(66.24\%\) for threshold-based policy). This helped to prevent over-provisioning visible in the case of threshold-based policy. The average CPU utilization in the former case has been equal to 0.69, while in the latter one it was equal to 0.61. This can be understood as a \(13.11\%\) improvement. Respectively, the average percentage of memory used was equal to 2.33 and 1.99, which is a \(17.08\%\) improvement. The average CPU load during the experiment (for both policies) is depicted in Fig. 13, while the average memory usage is presented in Fig. 14. Interestingly, both policies seemed to reach similar utilization values after 90 min of processing (about half of the workload executed), what suggests that they were both able to discover the infrastructure configuration which captured the near-optimal trade-off between cost and speed of processing.

Fig. 13
figure 13

Average CPU load for the cores allocated by a control policy

Fig. 14
figure 14

Percentage of all cores available to the system allocated by the control policy

We acknowledge that this might not be a fully fair comparison, e.g., it might be possible to fine tune the threshold to avoid the described initial slow-down. Alternatively, implementing a policy which could use multiple thresholds might achieve even better results. This experiment shows, however, that the use of a PPO-trained policy renders results which are on-par with a well-established approach. Using an RL-based policy has an advantage of being able to take into account multiple factors without having to specify special parameters for each of them, e.g., the thresholds.

The training process proved to be flexible and can be easily reused. To create policies for other, similar workloads, one needs to adjust few elements. First, a dataset with jobs which could be simulated in the training process, has to be created. This can be achieved in various ways: one could record sample jobs which are executed in a real environment or simply artificially create them in line with expectations about the real workload. Depending on the chosen platform and SLAs, the reward function might also need to be adjusted (e.g., by including more VM types in the V set). Finally, the monitoring of the system to be automatically controlled, might need to be extended to include metrics which are used to describe the state of that system.

7 Conclusions and further research

In this paper we have presented a novel approach to automating the heterogeneous resource allocation. We proposed an architecture of a monitoring and management system which exploits recent advancements in the Deep Reinforcement Learning field. Through an experiment in the AWS Elastic Compute Cloud, we explained how to train a policy with the use of the PPO algorithm and deploy it to a real-world cloud infrastructure. We demonstrated that the use of such a policy can render better results compared with a traditional threshold-based one. One needs to remember that the observed cost reduction depends on many factors, e.g., on the amount of the managed resources (if that number is low, the benefits of automated scaling may not be significant). Due to the additional cost of the additional VM, the cost improvement expressed as a percentage of the initial resources spend might not be as high as reported in case of smaller infrastructures. Applying the presented approach in a scenario where more resources are being used would render better absolute results, in other words would provide bigger resource cost savings. The DRL-based approach also had other advantages. We did not have to manually set thresholds of the policy, which may depend largely on the workload which is being managed. We can easily include other metrics at the input of the trained policy. Since the policy does not contain any hard-coded parameters, it can be reused in the context of other, similar applications.

The approach we have used to train the policy delivered a good outcome. The resulting policy could manage a sample AWS-based infrastructure, while the training time was not prohibitively long. The use of the simulator allows to run many more interactions with the resources than it would be possible in a real environment. At the same time the cost of training has been greatly reduced compared to running a copy of a production version of the managed application. It is possible to further reduce the training time by running multiple simulations in parallel. Simulations are independent of each other and rely only on CPU-based calculations, which makes it easy and relatively cheap to run multiple of them at the same time.

We have identified some issues which require further work. Our resource allocation policy was unable to react to changes in the environment fast enough. It was limited by having to wait through the resource allocation grace period after executing an action and was capable of starting or stopping only a single VM of a given type at a time. This issue could be mitigated by including actions which affect multiple instances of resources of the same type. Such a solution could also help to reach even better results in the cost optimization.

Although the training process delivered good results, it was still limited in a number of ways. The parameters of the training procedure (e.g., learning rate, \(\lambda\), \(\gamma\) and the clipping factor in the PPO algorithm) had to be fine-tuned to our specific case. Otherwise, the process might end up with an exploding or vanishing gradient or a policy converging to a local minimum (e.g., using only a single action all the time). There is no indication how the input is being used by the training algorithm or the policy, what may lead to creating a huge and very expensive to train neural network model. It is possible that e.g., one of the metrics could be completely removed from the input because its values are mostly ignored. This might result in creating a smaller, easier to train model.

In line with our expectations, the policy was able to make good decisions only in situations, to which it was exposed in the prior training (e.g., was rather slow to shutdown the unused resources after the workload had stopped completely). Unfortunately, since the training process has been done offline (outside of the cloud environment) it may be very hard to update the policy after deployment. One might argue that a simple solution to this problem is to allow the network to be continuously trained while it is operating in the cloud environment. However, such an approach has one significant disadvantage: due to the nature of the training process, the updated policy might not make decisions as good as the current one. This means that we would be risking to introduce potentially disastrous changes into the environment, where such changes should be avoided at all cost. To mitigate this issue, the performance of the new version of the policy needs to be verified prior to the deployment to the managed environment. One way to do this is to compare it with the previous, currently deployed one, e.g., to simulate the behavior of both policies in the same environment with the same entry conditions and compare the rewards after finishing the simulation. Another advantage of such an approach is that the decision policy becomes closer aligned to the environment it controls. New information is constantly being added to the representation of the policy (e.g., in the case of DNN—to the neural network weights).