1 Introduction

As cloud computing industry has been growing rapidly and becoming mature, more and more diverse and heterogeneous workloads have been running on the datacenters. There are a variety of workloads such as big data analytics, scientific workloads, social networks, and other web service workloads in the datacenter. These workloads have their own service goals, and thus it is important to meet these service objectives for user satisfaction [1].

However, it is difficult to efficiently and effectively run these workloads in datacenters because of their different resource demands for the resources. If the demands conflict, the contention for the limited shared resources increases. In order to minimize the contention, it is important to know workloads’ resource consuming patterns. Especially, CPU and memory are most intensively shared resources considering the multicore architecture that most of servers have, and in-memory workloads being widely used, such as in-memory cache [2] and in-memory database [3].

Even though the different demands for the resources in running datacenter workloads including memory intensive or latency-sensitive ones need to be considered, the current datacenter frameworks [4, 5] do not take them into account sufficiently. Traditional frameworks have focused mainly on the allocation of resources to the workloads; they have not considered the resource consuming patterns such as memory access or CPU interrupt handling. These metrics are also important compared to traditional ones including the allocated memory size, because they reflect the execution behavior of the workloads, and thus they can be directly associated with their performance. Using the metrics for scheduling workloads, the datacenters can handle the dynamic changes in the behavior of workloads. The key to software-defined compute (SDC), which is one of the emerging trends for the datacenters, is to consider the dynamic behavior of workloads in performing the allocation of resources to the workloads.

SDC is one part of software-defined data center (SDDC), where all datacenter functions are controlled by the software controller [6]. It is originally from the concept of software-defined network (SDN), which decouples the control and data planes. Similar to the SDN, the SDDC aims to reconfigure and reorganize the datacenter infrastructures by the software controller. Through the software-defined components such as the SDN and SDC, the datacenter can be more flexible and cost-effective. For example, if the SDC is realized, the datacenter infrastructures can be optimized, targeting the specific workloads easily, and flexibly reconfigured responding to changes in the behavior of workloads.

We introduce an effective workload-aware resource management framework for SDC. For the workload-aware resource management, we monitor the behavior of workloads in the datacenter and place or schedule the workloads based on the behavior, not based on the resource availability only. The workload profiler and workload-aware schedulers are the key components for the workload-aware resource management, which are used for profiling the workload characteristics and scheduling workloads to avoid performance interferences. In particular, we focus on the performance isolation with respect to the CPU and memory resources for the latency-sensitive workloads, most commonly used in the datacenter, when demonstrating the effectiveness of the framework. The workload profiler continuously monitors the resource usage pattern of latency-sensitive workloads, and if the performance anomaly is detected, then the workload-aware schedulers handle the performance problem by dynamically scheduling workloads or migrating the workloads to other hosts.

We implemented our schedulers in an OpenStack testbed and made it run in the VMWare one, and evaluated the scheduling algorithms in the OpenStack and VMWare testbeds. By performing experiments, we found that we can improve the performance of latency-sensitive workloads; specifically, we can achieve twice higher throughputs and lower the tail latency by up to 95 % compared to the existing frameworks.

The rest of paper is organized as follow. Section 2 describes the mechanisms of existing frameworks such as OpenStack and VMWare vSphere and presents the proposed SDC architecture. Section 3 shows the motivational scenario and explains the new performance metrics for profiling the behavior of workloads. Section 4 explains the methods for mitigating the contention for shared resources, describes workload-aware scheduling algorithms in the proposed framework, and provides the details of our scheduler implementation. Section 5 shows the results of our experiments performed to compare the proposed algorithms with the existing scheduling ones. Section 6 discusses the applicability of our proposed approach in practice. Section 7 presents the related work. Finally, Sect. 8 concludes the paper.

2 Workload-aware resource management framework

As cloud computing has become popular and the demand for constructing the private or public clouds has been increasing, cloud platforms have evolved and many cloud projects have been suggested. Although these projects have been performed, the technologies for dynamic infrastructure reconfiguration or flexible resource sharing, which are the keys to SDDC, have insufficiently been developed and used. To address this problem, we have designed and developed an SDC framework based on the OpenStack [4], which is one of the most popular open-source cloud platforms. The framework performs profiling and scheduling to control the shared resources such as CPU and memory. For this reason, we focus on the SDC architecture. Figure 1 illustrates the overall high-level SDC architecture. In this architecture, we fill the gap between the traditional datacenter resource management framework and ideal SDC. For more flexible and efficient resource usage, it is important to capture the behavior of workloads and schedule the workloads based on the behavior more dynamically. In an SDC framework, all computing resources, including CPU and memory, are reorganized in a flexible manner or coordinated by the centralized/decentralized controller in order to achieve the specific operational goal such as guaranteeing the service level objectives (SLOs). In contrast to the architecture of traditional datacenter, an SDC one requires the components, such a performance monitoring tool and a coordinator, providing the SDC functionality. The subsequent subsections will describe the profiler and scheduler components of our SDC framework.

Fig. 1
figure 1

Proposed software-defined compute architecture.The components colored in black are the ones newly added for workload profiling and scheduling.The components colored in grey are the OpenStack compute (Nova), resource monitor (Ceilometer) and VMs

2.1 Workload profiling for SDC

Identifying the behavior of workloads is inevitable and necessary to support the SDC. The profiler should monitors the behavior of workloads, and if the performance degradation is detected, then a recovery action should be taken. However, there is no profiler to detect the performance anomaly in OpenStack. There exists the component for resource metering called Ceilometer, but this component simply counts the resource availability for billing. Currently, Ceilometer provides only alarm services which report the resource states to the user. The alarm services monitor the resource consumption of workloads and if the consumption exceeds the predefined threshold, then the services notify the user of that situation or trigger the additional actions such as autoscaling. Figure 2 shows the high-level architecture of the proposed workload profiler. It consists of two components; one is the performance metric monitor which collects the resource consuming stats such as memory contention information, and the other is the interference estimator, which evaluates the interferences among VMs and hosts, and propagates the evaluated information to other services. With the profiler, the cloud platform can monitor the performance degradation and leverage the performance metrics to schedule workloads.

Fig. 2
figure 2

Proposed high-level workload profiler architecture. The workload profiler consists of two components, which are performance metric monitor and interference estimator. Each workload profiler is in each compute, and it delivers the profiled information to other services

Fig. 3
figure 3

VMWare vSphere DRS. DRS collects the stats of VMs and hosts and dynamically schedules the VMs possibly by migrating them to other hosts

2.2 Resource management for SDC

Similar to profiling, existing resource management frameworks have focused on the resource availability. To the current frameworks, “how much of each resource is available in the datacenters?” is an important question. In the case of OpenStack, the numbers of allocatable virtual cores & network IPs and available RAMs are regarded as important resources. For this reason, the OpenStack compute (Nova) and OpenStack resource monitor (Ceilometer) consider the resource availability first, and Nova distributes the workloads to the hosts as evenly as possible.

Fig. 4
figure 4

Different level of memory intensity (ab), Different VM-to-Core Mapping (cd). The darker colored is a VM, the higher memory intensive is the VM. The physical CPUs are depicted as the black circles. We assumed that each VM has only one vCPU to show the problem clearly

VMWare vSphere, another production-level cloud solution, uses distributed resource scheduler (DRS) as a cluster scheduler, and DRS works based on the resource entitlements, which are weighted resource usage based on CPU and memory utilizations [5]. Figure 3 depicts the architecture of VMWare vSphere DRS. It dynamically reconfigures the pools of resources, which are allocated to the VMs based on their requirements. A set of resources in a datacenter are represented by a resource pool tree, controlling the resource by the three different policies, which are reservation, limit, and shares. DRS is different from the Filter scheduler in terms of the granularity of controlling the resources, but it is similar to OpenStack in that it evaluates the cluster imbalance by using the standard deviation of host resource entitlements and dynamically schedules the workloads. These resource management strategies have limitations, specifically “each of them does not care the performance goal such as the SLO.”

To overcome their limitations, the datacenter frameworks require flexibility and efficiency. The flexibility means that the frameworks should prioritize the workloads by their performance goals and allocate the resources accordingly. The efficiency means that the frameworks have to maximize the performance of workloads by allocating the resources properly. To gain the flexibility and efficiency, we propose a resource manager, consisting of the local scheduler and global scheduler. These schedulers are different from the existing datacenter schedulers such as the OpenStack Filter scheduler, in that they focus on the performance goals. In the case of the global scheduler, it periodically predicts the changes in latency, and if the predicted value exceeds the predefined threshold, it triggers the performance recovery procedure. In the case of the local scheduler, it classifies the workloads based on their characteristics, and allocates the resources considering their service goals. For example, the local scheduler allows latency-critical workloads to use the dedicated resource pool, but gives other batch workloads to the shared resource pool, for which the latency does not matter. The local scheduler also adaptively reacts to the resource contentions on the host, and thus it can mitigate the performance degradation of workloads and improve the performance isolation for the workloads.

3 Workload profiling for performance isolation

This section presents the main problem with some motivational examples. We used the metrics to detect the contention for resources to mitigate the contention and those to present the behavior of latency-sensitive workloads in order to maximize the levels of resource utilization while meeting the SLOs.

3.1 Motivational scenario

We consider the contention for memory in a motivating example. Suppose that VMs are placed on the hosts and have various resource demands as shown in Fig. 4a, b. Even though there are various types of workloads in the datacenter, we assume only two types of workloads exist in the hosts in the simple scenario. In this case, some hosts may have lots of memory intensive VMs and others may not. If there are an equal number of VMs in each host, there may be no difference of resource utilization among the hosts, because the resources are allocated based on the types of VM instances. However, although the allocated resources are similar between two hosts, there may be a large difference in the performance of VMs; for example, the intensity of memory contention is different among the hosts due to the difference in the number of the simultaneous memory accesses. Currently the existing datacenter VM management is based on the strategy which maximizes the overall resource utilization of datacenter. Therefore, the current VM scheduling algorithms evenly distribute the VMs among the hosts in the datacenter. This strategy may be good for resource utilization, but it may not guarantee low latency or high throughput for the workloads.

Figure 4c, d show the different levels of VM memory intensity in a host depending on the VM-to-core mapping. Figure 4c shows that the memory intensive VMs spread across the cores in the host, and this placement can result in high memory intensity at once; in the worst case, all cores might handle memory requests at the same time, and thus the VMs accessing the memory subsystem in the same socket could suffer from performance degradation. However, if the VMs are placed and scheduled as shown in Fig. 4d, then simultaneous memory requests could be reduced and thus the performance degradation would also be decreased. Furthermore, to avoid the performance degradation, it is essential to schedule workloads across the cluster considering the resource states of both hosts and cluster. Because the resource demands in the hosts change so dynamically over time that load imbalance can easily occur.

3.2 Performance metrics for workloads behavior

This section presents metric used by the workload-aware schedulers for estimating the memory contention. In addition to the memory contention metric, it also shows the metric that measures the the performance of latency-sensitive workloads. In the following subsections, we answer the following questions: “which metric is useful and effective for understanding the characteristics of workload such as the memory intensity or latency sensitivity?” and “how can the workload-aware schedulers use these metrics?”

3.2.1 Memory intensity

The memory intensity is different from the memory utilization in that it reflects the memory access behavior of the workloads. For example, there may be workloads that have high memory utilization but low memory intensity and vice versa. Therefore, when balancing the load in the datacenter, it might be ineffective to consider only the memory utilization as a metric for load balancing, without taking the memory intensity into account. The memory contention is critical to not only the memory intensive VMs, but also other colocated VMs. High memory contention, caused by ineffective consolidation, could affect the other VMs which access the shared memory subsystem. For this reason, we first investigated that which metric is effective to identify the memory intensity of VMs.

Fig. 5
figure 5

Correlation between average normalized MBFR and average slowdown

There are a lot of memory related components affecting the performance of workloads. However, among the various components such as L3 cache, prefetcher, and memory controller, there is no single dominant factor to influence the memory contention [7]. In the memory subsystem, there exist two parts; one part is called core part, which is dedicated to each core, such as L1 and L2 caches. The other part is called uncore part, which is shared across the cores. L3 cache and memory controller are in the uncore part. Because of the complexity of uncore part, we consider the entire uncore part as a blackbox. And we focus on the rejected requests to the uncore memory subsystem for deciding how much memory contention occurs. Between the L2 and L3 caches, there is an SQ (Super Queue) per core used for the buffer of uncore part [8]. We monitor the rejections for memory requests to the SQ. When a request is rejected because the SQ has become full, then it generates an SQfull event. We call the rate of the requests rejected from the SQ ‘Memory Buffer Full Rate’ or MBFR for short.

To check how MBFR reflects the memory intensity and affects the performance of VMs, we conducted the stress tests. We chose four workloads among the SPEC2006 benchmarks [9]. With two well-known memory intensive workloads, lbm and GemsFDTD, and other two well-known CPU intensive workloads, zeusmp and sjeng, we investigated the relationship between the MBFR and slowdown of the workloads while increasing the number of the same workloads in a host.

Figure 5 shows the experimental results for the four different workloads. We compared the execution time and MBFR with those of the solo execution. As shown in Fig. 5, we can find a strong correlation between MBFR and the average VM slowdown. In the case of memory intensive VMs, the more VMs are colocated, the higher MBFR and slowdown happen. In contrast, in the case of CPU intensive VMs, no matter how many VMs are colocated, neither MBFR nor slowdown is changed much.

Fig. 6
figure 6

Comparison between L3 miss rate (left) and MBFR (right)

In order to confirm that the MBFR could be an appropriate metric which reflects the memory intensiveness, we compared it with the L3 miss rate, which might be regarded as a possible memory intensiveness metric to identify the memory intensity [7, 10]. Figure 6 shows the total L3 miss rate and total MBFR for each workload. As shown in the figure (left), we can figure out that some memory intensive workloads have low L3 miss rates which do not reflect their memory intensity. For example, lbm is known as a memory intensive workload and shows higher slowdown than the CPU intensive workloads as shown in the figure, but it has lower L3 miss rates than the other CPU intensive workloads, zeusmp and sjeng. However Fig. 6 (right) shows that memory intensive workloads, lbm and GemsFDTD, have higher memory buffer full rates than the CPU intensive workloads. These results indicate that a single factor such as L3 miss rates may not show the memory contention in all cases. Via some stress tests and comparison experiments, we found that MBFR is an effective metric that shows the slowdown of VM.

3.2.2 SoftIRQs

To figure out the behavior of latency-sensitive workloads, we need to understand how the network I/O operations are performed. When the latency-sensitive workloads communicate with other clients, the network I/O occurs, and then the interrupts, as known as SoftIRQs (Software Interrupt Requests), are generated from the NIC of physical machines to process the incoming network packets. As shown in Fig. 7, the number of queries processed per second (QPS) during the execution of memcached workload as the throughput, and that of generated SoftIRQs (y-axis) increase as the rate (x-axis) of the requests made by the client grows. Also we find that the more network traffic is incoming, the more packets are processed, and the more interrupts occur. Based on this result, we could conclude that there is a strong correlation among the network traffic, the number of interrupts, and the performance of latency-sensitive workloads.

3.2.3 Time window for collecting the values of metrics

To detect the change in the performance of workloads, we collect the MBFR and SoftIRQ samples for some intervals of time such as every 10 (for the local scheduler) and 30 s (for the global scheduler), considering these intervals as the time windows. The use of time window is necessary for estimating the slowdown of VMs or predicting the violation of SLOs. The local/global scheduler has its own scheduling interval, and the workload profiler monitors the consecutive samples for the intervals in order to provide the information to the schedulers. If the samples are collected, the information for scheduling is calculated from the collected samples and sent to the schedulers. For example, the averages of metric values are calculated and sent to the local schedulers in order to decide the memory intensities of the hosts. For global scheduling, the workload profilers send the samples of metric values collected every 30 s to the global scheduler, and the scheduler can decide which host has workloads causing SLO violations, and then perform migrations for the workloads.

Fig. 7
figure 7

Relationship between SoftIRQs and Throughput. As the SoftIRQ is saturated, the throughput is also saturated. This indicates the SoftIRQ reflects the load in the queueing system

Many other researchers recognized the importance of processing interrupts for the latency sensitive workloads, and thus they tried to improve the performance of interrupt processing [1115]. Therefore, we decided to use the number of interrupts as an indicator to reflect the performance of the latency-sensitive workloads.

4 Workload-aware scheduling for performance isolation

Our approach is to solve the contention problem via VM scheduling. To highlight the effects of the workload-aware scheduling, we make some scenarios for placing and dynamically scheduling VMs.

4.1 Method for mitigating resource contention

This subsection describes the method for mitigating the resource contention and thereby improving the performance. It presents the strategy of reducing the simultaneous memory accesses and also the method of predicting the latency for detecting the performance anomaly. To reduce the resource contention, we adjust the vCPU-to-core mapping and restrict the number of cores that memory intensive workloads can run on. This helps the performance of colocated workloads be not degraded. To meet the SLOs of latency-sensitive workloads, we make a prediction of latency by utilizing the information on the memory buffer full rates and software interrupts. We use a threshold to determine the violation of SLO, and thus if the predicted latency exceeds the threshold, the process of mitigating the contention is triggered.

4.1.1 Restricted scheduling

It is important to minimize the memory contention for latency-sensitive workloads, especially in the case of in-memory latency-critical workloads such as in-memory cache services or in-memory database ones. These in-memory latency-critical services can be easily affected by the characteristics of co-located workloads with high memory intensity. Thus discreet scheduling is essential to meet the latency requirements of latency-sensitive workloads. To do this, it is desirable for these workloads to make dedicated access to the shared resources such as the cores and memory. Latency-sensitive workloads should be executed on the dedicated cores, because they are especially sensitive to the sharing of CPU resources [13]. To mitigate the access for the shared memory subsystem, it is necessary to restrict the number of memory intensive cores that have high memory intensiveness.

Without restricting the number of such cores, the memory intensity may be increased throughout all cores, resulting in high memory contention due to the contention for the limited resources of memory subsystem. Importantly, it is critical to schedule workloads considering the limitations of using the resources such as memory bandwidth. In this paper, to show the effectiveness of our workload-aware resource management framework, we suggest a scheduling algorithm that dynamically restricts the maximum number of memory intensive cores, on which memory intensive workloads can run.

4.1.2 Predicted latency

In cloud datacenters, latency-sensitive workloads such as web servers and in-memory database systems are common, and to these workloads, the tail-latency is considered as a critical performance metric. We suggest a simple, efficient model to predict the violation of SLO (Service Level Objective). To derive such a model, by using only server-side information, the trend of tail latency was predicted. We estimated the latency with some metrics by using a memcached that is well known as the representative latency-sensitive workload. The derived equation Eq. 1 is as follows:

$$\begin{aligned} W = \frac{c\times {MBFR}_{vm}}{1-{R_{IRQ}}_{vm}}, \end{aligned}$$
(1)

where W is the average waiting time, \(c\times {MBFR}_{vm}\) is the average service time of the VM (c is a constant), and \({R_{IRQ}}_{vm}\) is the utilization of queueing system s.t. \(\quad 0\le {R_{IRQ}}_{vm}<1\).

$$\begin{aligned} {R_{IRQ}}_{vm} = \frac{{IRQ}_{vm}}{{IRQ}_{max}} = \frac{{IRQ}_{vm}}{{CPU}_{vm}}\times k, \end{aligned}$$
(2)

where \({R_{IRQ}}_{vm}\) is the ratio of SoftIRQs, \({IRQ}_{vm}\) is the number of SoftIRQs, \({IRQ}_{max}\) is the maximum number of SoftIRQs, and \({CPU}_{vm}\) is the CPU utilization for the VM.

Equation 1 was derived by using the Little’s law [16] to predict latency. This law could be easily applied to any queueing system. To apply the law to our system, we assumed that our queueing system is based on M/M/1, which is commonly used for web server queueing model. In this M/M/1 model, W can be calculated with the average service time and utilization of queueing system. Considering that the MBFR approximates in the execution time of memory intensive workloads, MBFR can be substituted for the service time. From the ratio of SoftIRQ (\({R_{IRQ}}_{vm}\)), we could calculate the utilization of queueing system as shown in Eq. 2. To obtain the maximum IRQ for the system, we simply used the fact that the more SoftIRQ is processed, the more CPU consumption occurs as previously shown in Fig. 7.

4.2 Workload-aware scheduler

We introduce workload-aware VM schedulers that minimize the memory contention of latency-sensitive VMs while meeting the latency requirements for the VMs. Particularly, we have focused on in-memory latency-sensitive workloads such as memcached, the latency requirements for the VMs of which can be met by minimizing the memory contention of latency-sensitive VMs. The proposed schedulers perform the two-phase scheduling, which consists of the host-level and core-level one.

4.2.1 Global scheduler

The objective of the global scheduler is to place and dynamically schedule VMs among the hosts by detecting the performance anomaly with the threshold for the LSVMs (Latency-Sensitive VMs) and meet the performance requirements for these VMs while minimizing the resource contention. First, the global scheduler should predict the latencies of LSVMs, and thus it receives information on the MBFR and SoftIRQ from the profiler.

figure f

The MBFR is used to measure the memory intensity, and the SoftIRQ is also utilized to predict the performance anomaly for the LSVMs. These measurements are used to check whether migration of VMs will be beneficial to the LSVMs, every 30 s (Line 4). We set the interval of global scheduler to 30 s, because we emprically found that the interval is adequate for the global scheduler to collect the performance information about the hosts. The global scheduler periodically checks the predicted latency of each LSVM and when it exceeds the threshold, SLO_threshold, then the scheduler starts the live migration procedure (Lines 7-18). We think that the use of the threshold should lead to meeting the performance requirement for a latency-sensitive workload as if it ran solely. Specifically, in the case of the memcached workloads, we determined the threshold to the predicted latency when a load of about 75 percentile of peak throughput was given, which kept it below sub-10 ms latency. If there are other latency-sensitive workloads, multiple SLO_thresholds could be used, and the SLO_threshold of each latency-sensitive workload should be determined as in the case of memcached workload which we used. If the live migration procedure is triggered, then the global scheduler finds the destination to migrate the LSVM to, based on the MBFR; the scheduler searches the lowest memory intensive host among all hosts except for the source host. If it succeeds in finding the lowest memory intensive host, then it migrates the LSVM to the destination host (Lines 10-13); it migrates a non LSVM that has the highest memory intensity to the second lowest memory intensive host (Lines 14-17).

4.2.2 Local scheduler

The goal of local scheduler is to maximize the performance of latency-sensitive workloads by running them on the dedicated cores while executing the memory intensive VMs on a limited set of other cores. Every 15 s, the local scheduling procedure is triggered. We set the interval of local scheduling to 15 s, because we empirically found that information on about 15–20 VMs was collected in 10–12 s. Once the scheduler is invoked, it classifies the vCPUs in the host into three groups, which are latency-sensitive vCPUs (ls), memory intensive vCPUs (mem), and non-memory intensive vCPUs (nmem) (Line 4). After classifying the vCPUs, the local scheduler decides to allocate the dedicated cores to latency-sensitive vCPUs by pinning each vCPU to a certain dedicated core in order to meet their requirement of the low latency (Line 6). We allocate each dedicated core to each latency-sensitive vCPU, and have the other vCPUs (mem and nmem) be pinned to other non-dedicated cores. It is thus possible to allocate the dedicated cores to the latency-sensitive workloads. It is important to divide the resources into two groups, the dedicated and shared ones, for high performance of latency-critical workloads and for overall high server utilization. To schedule the workloads other than the latency-sensitive ones, the scheduler decides the number of memory intensive cores for throttling the concurrent memory accesses (Line 8).

The scheduler can determine the number dynamically based on the memory intensity of the current host. Considering the fact that the MBFR of host increases quadratically as shown in Fig. 6 (right), we empirically found that the MBFR value of host increases too rapidly, as memory intensive VMs run on more than approximately a half of cores in the host. Therefore we set the maximum number of memory intensive cores to the half of memory intensive workloads in the host. The local scheduler schedules memory intensive workloads on the limited set of memory intensive cores while it distributes the unscheduled non-memory intensive workloads on the rest of the shared cores. When the scheduler allocates the cores to the vCPUs, it considers the number of vCPUs running on each of the cores, which may be regarded as a kind of fairness, so that it can maximize the core utilization in the host (Lines 12, 15 & 17).

figure g

To maximize the utilization, the local scheduler finds which core is the one, on which the smaller number of vCPUs scheduled. For example, suppose that a vCPU is in the set of unscheduled, and that core 0 and core 1 are the candidate destination cores, which have one vCPU and three vCPUs, respectively. The local scheduler calculates the differences in the number of vCPUs between the cores in the get_min_vcpu_core() function. If the difference is 0, then the vCPU is scheduled on the higher memory intensive core; otherwise, this case means the imbalance between the cores occurs, and thus the vCPU is scheduled on the less loaded core.

Once the destination core is determined, the CPU affinity of the vCPU is decided to the destination core in the schedule() function. If the destination core is previously allocated to the vCPU, the function does not do anything in order to avoid unnecessary overhead (Line 18).

5 Implementation

We made a prototype of workload-aware schedulers with Python. The prototype is composed of three parts, the global scheduler, the local scheduler, and workload profiler, which are implemented as user-level programs to run without any modification of host and guest OSes. The global scheduler makes RPC calls in order to receive the values of the metrics about the host from the profiler that resides in each host. The global and local schedulers are implemented on the OpenStack Nova and KVM, and they schedules the VMs by using the Nova and Libvirt API. The workload profiler uses Linux Perf [17] to collect the values of the MBFR and SoftIRQs. To obtain these values, the profiler reads the PMU counter every second. For detecting LSVMs, we use the tracepoint events provided by the Perf. We utilize the MetricWeigher which is a part of the OpenStack Filter scheduler to receive the values of metrics, and decide whether the host is memory intensive or not.

6 Evaluation

To illustrate the benefit of our workload-aware scheduling, we show the result of executing the scheduling algorithms step by step. We present the performance improvements from initial placement of workloads to dynamic scheduling of them in the OpenStack and VMWare testbeds, respectively. We show how beneficial the proposed workload-aware scheduling could be to the latency-sensitive workload by considering the memory intensity and latency sensitiveness. We first show the benefit of local placement which assigns all the workloads (including the non-latency-sensitive ones) to the cores by considering the memory intensity of host, and then show the benefit of global placement which assigns the workloads to the hosts and subsequently the cores of the selected hosts by considering the difference in the memory intensity between the selected host and the other ones. Finally, we present how the proposed approach works well even in a workload-changing scenario where the memory intensity of host dynamically changes. To create a high contention environment, we ran some SPEC2006 workloads as interfering ones. We chose the lbm as a memory intensive one and sjeng as a CPU intensive one, because the memory intensities of these two workloads are almost constant. We also used the memcached workload as the latency-sensitive one, which is one of the most popular in-memory cache services for the Web. Because the memcached should respond quickly to the requests from web services, the tail-latency is important during the execution of memcached. We show benefit of our proposed mechanisms by applying them in each of the testbeds, and comparing them with each of the existing scheduling ones; OpenStack using KVM on top of Linux OS, VMWare vSphere uses ESXi that is the bare-metal hypervisor. To show the benefit of using our algorithms in the VMWare testbed, we manually performed live migration by setting the CPU affinity in the UMA architecture according to our algorithm.

6.1 Experimental setup

We performed experiments in the OpenStack and VMWare vSphere testbeds. Each testbed consists of four compute hosts, running twelve VMs, and one controller host providing services such as networking and managing the compute hosts. Each host in each testbed has an eight-core Intel Xeon CPU E5-2650 v2 @ 2.6GHz with both Hyper-Threading and DVFS disabled. In each host, the CPU frequency is 2.6GHz and the size of main memory is 256GB. Each server is connected though 1Gbps links. For dynamic live migration, we used the Ceph filesystem as the OpenStack volume backend, and vSphere’s VMFS filesystem for vMotion.

We considered the memory contention on the UMA system. To differentiate the memory intensity of each host in a scenario, we ran the memory and CPU intensive VMs at the different ratios as shown in Table 1. We chose the ‘mutilate’ benchmark to generate realistic memcached workloads [18]. Through the mutilate, we simulated Facebook ‘ETC’ workloads which are the representative Facebook ones with the lowest hit rate (81.4 %) [19]. To generate the workloads, five clients were used, and each client made the requests by using sixteen threads on the sixteen-core client host. We ran the memcached server in the VM instance with two vCPUs & 6G RAM, and two memcached server threads in total.

Table 1 Different configurations for experiments

6.2 Experiment results

When creating a VM, its performance depends on the policy of scheduling it. The existing VM scheduling algorithms used in cloud platforms such as OpenStack and VMWare are based on the resource availability or resource entitlement. These approaches are resource efficient because it maximizes the cluster utilization, making the utilization of each host uniform. However, they are not the best in terms of the performance of VMs due to not considering workload patterns or characteristics. Also they decide which hosts to run the VMs on, but not which cores to run them on, and thus the contention for the shared resources, caused by colocation of workloads may make their performance degraded and unpredictable.

6.2.1 Local placement

Fig. 8
figure 8

Comparison for local placemen

To show the effect of local placement of VMs in a host, we ran the thirteen VMs (one Memcached VM, six CPU intensive VMs, and six memory intensive VMs) in a host. Figure 8 shows the performance improvements for different placements of VMs on each testbed; the higher loads on the memcached VM is, the lower latency and higher throughput are led to by using the proposed approach.

The current existing approaches do not determine which workloads to schedule on specific cores, leading to high CPU and memory contention. As shown in Fig. 8, our proposed approach improves the maximum throughput 1.4 times and reduces the latency by 58 % when applied to OpenStack, and improves the maximum throughput 1.6 times and reduces the latency by 95 % when applied to VMWare vSphere.

6.2.2 Global placement

Through the global placement experiments, we examined how and where the memcached VM is scheduled with different scheduling algorithms, and then measured its performance. When the VM was provisioned, the schedulers in OpenStack and vSphere could not schedule the VM in the best place in terms of the memory intensity. We performed experiments for the VM provisioning three times in each case. In the case of OpenStack, although the available RAM size was equal among the hosts, the VM was always scheduled on host 1. The Filter scheduler selected the host based on the available RAM size, but since all hosts had the equal size of available RAM, the scheduler randomly selected a host among all hosts. Tables 2 and 3 show the resource usage in the OpenStack and VMWare testbeds, respectively.

Table 2 Resource Usage in the OpenStack Testbed
Table 3 Resource Usage in the vSphere Testbed
Fig. 9
figure 9

Comparison for global placement

Since the available RAM was equal, one host had the same available memory compared to the other hosts. In the OpenStack testbed, most memory intensive workloads were thus scheduled all across the hosts, and the memcached VM suffered from the higher contention of the memory. In the case of VMWare vSphere, the VM was scheduled on host 1 twice, while the VM was scheduled on host 3 once. Because the DRS algorithm is based on the host resource entitlement, which is the resource usage with respect to the resource capacity. Given the resource capacities, the capacity of memory was so larger than that of CPU that CPU was the most influential factor to the host load metric. As a result, it did not consider the memory intensity, and thus the VMs were scheduled regardless of the memory intensity, degrading the performance of the VMs.

In contrast, when our proposed approach was applied to the OpenStack and VMWare testbeds, the scheduler always scheduled the memcached VM on host 4 that had the lowest memory intensity, and the colocated workloads on the cores, which minimized the contention for the shared resources. As shown in Fig. 9, we could obtain the performance improvements in terms of both the throughput and tail latency. Although there was some degradation of performance in terms of latency in the high load compared with the default, it may be due to the trade-off between the throughput and latency because the throughput is twice as high.

6.2.3 Dynamic scheduling

Cloud datacenters are so dynamic that we should consider the case where the behavior of workload changes over time. We performed experiments in such dynamic scenarios in order to check how much performance could be improved with different dynamic scheduling algorithms. We used the initial placement of global placement experiments previously conducted. We conducted the experiment where CPU intensive workload changed to memory intensive one every 30 s after the memcached VM was placed initially. As shown in Fig. 10, the existing default OpenStack was not able to schedule the VM automatically via live migration. That is, it could not deal with workload changes properly, leading to increases in the tail latency. At this moment, the memcached VM violated the SLO, but stayed in the memory intensive host leading to performance degradation.

Fig. 10
figure 10

Dynamic scheduling (OpenStack). The existing OpenStack a could not handle the dynamic changes in memory intensity. However, our proposed approach b detected the violation of SLO and migrated the LSVM to the lowest memory intensive host other than the current one

Fig. 11
figure 11

Dynamic scheduling (VMWare vSphere). The existing DRS a dispatched the LSVM on the memory intensive host, which is inappropriate because of not considering the memory contention. The approach of migrating only the LSVM b led to the violations of LSVM’s SLO for a longer period of time. However our approach of migrating Non-LSVMs as well c led to the violations for a shorter period of time

However, when our proposed approach was applied, the scheduler predicted the violation of the SLO (10 ms) and lively migrated the memcached VM to the lowest memory intensive host other than the current one. Although the tail latency could be increased temporally, because of the migration overhead, the SLO of memcached VM was met. Figure 11 shows the results of VMWare vSphere’s dynamic scheduling in the three cases, which are DRS scheduling (default), LSVM migration approach, and all VMs (both LSVM and Non-LSVM) migration one (our proposed approach). In the case of scheduling with DRS, the tail latency of the memcached VM was exacerbated, but DRS did not migrate the VM. Since DRS scheduled VMs based on the resource entitlement, the load imbalance standard deviation (0.021) for all hosts did not exceed the migration threshold (0.141), and thus the migration process was not triggered. In the approach of LSVM only migration and core mapping with our local algorithm, the schedulers migrated the LSVM. The approach shows the stable tail-latency in the OpenStack testbed (Fig. 10b). Unlike this case, sometimes the approach shows the unpredictable tail latency as in the vSphere (Fig. 11b), leading to the violation of the SLO. As in this case, the approach of dynamically scheduling LSVM, performing the migration of LSVM might fail to meet its SLO. In this case, we should migrate the other VMs, which are Non-LSVMs, to the lowest memory intensive hosts other than their current ones. As shown in Fig. 11c, even after the LSVM migration, if the SLO of LSVM was violated, then scheduler migrated the Non-LSVMs until the SLO was satisfied. Consequently, the use of our proposed approach met the SLO.

7 Applicability

It would potentially be much beneficial to apply our proposed approach in real-world applications. For example, the approach can be adopted to the public clouds such as Amazon EC2 and Google Compute Engine. They provide lots of memory intensive services such as in-memory databases. These services are memory intensive and latency sensitive as well, to which it could be beneficial to apply the approach. Other examples are the entertainment services such as gaming. The gaming service is a latency-critical one requiring low latency and lots of memory for storing temporal data. The approach could be applicable and effective due to its nature of dynamical change. In fact, it could be useful for any in-memory latency sensitive workloads.

8 Related work

Our research is related to profiling resource contention. There have been some studies, and one of the approaches is measuring the workload sensitivity by giving pressure to a shared resource [10, 20, 21]. The benefit of this approach is to estimate the allowable limit of the contention for the shared resource, which affects the performance of workloads regardless of the server microarchitecture. However, the approach has a defect that taking an offline profiling should be preceded to estimate workload sensitivity. There is another approach that utilizes metrics to decide the state of shared resources. In this work, we take this approach and it is important how strongly the values of the metrics as the result of workload profiling are correlated to the characteristics. By suggesting a new performance metric that reflects the memory intensity, we can improve the weakness of using the L3 miss rate, which has been dominantly used to characterize the memory intensity of workloads, but has turned out to be only partially useful. There have been a number of attempts to detect the performance anomaly, using various performance metrics. Google uses the CPI (Cycle Per Instruction) information for classifying the abnormal behavior of workloads [22]. This approach finds the interfering workloads by using the correlation between the victim workload’s CPI and interfering workloads’ CPU utilization. This approach is similar to ours, in that it restricts the activity of interfering workloads, but they estimated the behavior of workloads by using the CPI, which can be influenced from various types of interfering sources, thus it has the limitation of identifying and recovering the contention for the shared resources. Monasca is one of the recent projects on the OpenStack [23]. In Monasca, they try to perform the analytics for detecting the performance anomaly by utilizing machine learning algorithms, which is currently under continuous development.

Our research is also related to scheduling of datacenter workloads. Recently, there have been some studies, considering a datacenter as a big logical computer and managing its resources. Mesos is one of the representative projects that views a datacenter cluster as a resource pool to maximize resource utilization by sharing clusters among distributed workloads [24]. However, Mesos targets the distributed workloads such as Spark [25], and our performance isolation techniques could be integrated into Mesos as well. And there have some projects on scheduling workloads at a datacenter, which considers its heterogeneity and resource contention, such as Paragon [26] and Quasar [27]. They use the history of profiled data and they also need some hosts for profiling, and thus their approaches require additional data or infrastructure for applying their resource management framework. However, our workload-aware resource management framework does not require any offline obtained data or any hosts for offline profiling. Some researchers have also developed resource management frameworks such as Kubernetes [28] for container and Mercury [29] for the resource management of big data workloads. Kubernetes has currently been developed actively by Google, and our workload-aware resource management techniques can be adopted for further improving performance. Mercury works on top of YARN [30], and uses the dynamic scheduling algorithms based on the policies, which are guaranteed and queueable ones. However, it focuses on efficient scheduling of many short-lived distributed jobs, and thus it is orthogonal to our workload-aware resource management framework. In this paper, we present our mechanisms of local scheduling (host scheduling) and global scheduling (intra-cluster scheduling) in order to demonstrate the effectiveness of our workload-aware resource management framework. We plan to perform research on the cluster-level scheduling such as like inter-cluster scheduling or inter-zone scheduling. We also plan to perform research on the policy-based scheduling with more complicated workloads such as distributed analytic and multi-tier web platform ones. These are important research subjects regarding datacenters and the demand for research on these subject is increasing continuously.

Improving the QoS of latency-sensitive workloads is crucial to the users at a datacenter. To enhance the QoS of the workloads, there have been many attempts such as Bubble-flux [31], Heracles [32]. Similar to our research, these studies have been conducted to predict and/or monitor the latencies in order to meet the SLO. By enhancing our metric, we can make better and more effective algorithm for predicting tail-latencies. When improving the performance of latency-sensitive workloads, it is problematic to consolidate latency-sensitive workloads [33]. Recent studies have just focused on the colocation of latency-sensitive workloads represented as long-running production workloads and best-effort batch workloads such as distributed data intensive workloads. However, most workloads in public cloud datacenters might be latency sensitive ones, and the colocation of them could lead to a critical problem.

9 Conclusion

We have proposed a workload-aware resource management framework, which can lead to performance improvements in the latency and throughput of the target workloads. By using our proposed framework and scheduling algorithm, we could minimize the performance interferences with latency-sensitive workloads, known as representative datacenter ones. To mitigate the resource contention and understand the behavior of workloads, we have developed effective performance metrics to reflect memory intensity and the performance of latency-sensitive workloads. Base on these metrics, we have developed the workload-aware scheduling algorithms that minimize the performance interferences, thereby letting both host scheduling and inter-host scheduling minimize the shared resource contention cooperatively. To demonstrate the effectiveness of this framework, we have developed local (host) and global (inter-host) schedulers in our framework in the OpenStack and VMWare vSphere testbeds, which are most popular cloud platforms, and found the use of the framework can lead to significant performance improvements compared with the existing scheduling algorithms. We plan to extend our algorithm to execute on bigger and more complex systems such as NUMA ones and also with other diverse emerging workloads.