Application of the Single Imputation Method to Estimate Missing Wind Speed Data in Malaysia

In almost all research fields, the procedure for handling missing values must be addressed before a detailed analysis can be made. Thus, a suitable method of imputation should be chosen to address the missing value problem. Wind speed has been found in engineering practice to be the most significant parameter in wind power. However, researchers are sometimes faced with the problem of missing wind speed data caused by equipment failure. In this study, we attempt to implement four types of single imputation methods to estimate the wind speed data from three adjacent stations in Malaysia. The methods, known as the site-dependent effect method, the hour mean method, the last and next method, and the row mean method, are compared based on the index of agreement to identify the best method for estimating the missing values. The results indicate that the last and next is the best of the three methods for estimating the missing data for the wind stations considered.


INTRODUCTION
Missing data are a concern in almost all research fields and need to be addressed before data analysis. There are several reasons why wind speed data may be missing, including malfunctioning equipment, terrible weather, incorrect data recording, and so on. Wind speed data that are missing for these types of reasons can be categorized as Missing Completely at Random (MCAR) because their absence does not depend on other variables.
Among the previous studies on the missing data problem is the work by Plaia and Bondi (2006). They proposed a new single imputation method known as the Site-Dependent Effect Method (SDEM) and compared its performance to other single and multiple imputation methods. The SDEM method was compared with other imputation methods using the correlation coefficient, index of agreement, root mean square deviation and mean absolute deviation as measures of performance. The SDEM method was found to be the best of the methods compared in terms of all of the performance measures considered. Junninen et al. (2004) evaluated and compared univariate and multivariate methods for missing data imputation in air quality data sets. Among the univariate methods studied were linear interpolation, spline interpolation, and the nearest neighbor method, while the multivariate methods were regression-based imputation, the multivariate nearest neighbor method, the self-organizing map and multilayer perception. The performance of each method was evaluated with respect to the index of agreement, the squared correlation coefficient, the root mean squared error and the mean absolute error with bootstrapped standard error. The results indicated that the univariate methods are dependent on the length of the gap in time in the data and that their performance depends on the variable under study. The results also showed that a slight improvement in the performance of multivariate methods can be achieved using hybridization, and a more substantial improvement can be achieved by using multiple imputation, where a final estimate is derived from the outputs of several multivariate fill-in methods.

MISSING DATA MECHANISM
Technically, missing data can be classified into three categories; Missing Completely at Random (MCAR), Missing at Random (MAR) and Not Missing at Random (NMAR). Consider a set of wind speed data X = x j and an indicator (dummy) variable M = m j , where m j has a value of 1 if M is missing and 0 if M is observed. The missing data mechanism is expressed by the conditional distribution of M given Y, say, f (M|X, θ), where θ denotes unknown parameters. Let X obs and X miss denote the observed and missing components of X, respectively. The missing data can be classified as MCAR if the following is true for all X: If the following is true for all X miss and θ: Then, the missing data are said to MAR. However, if the following is true: The missing data are said to be NMAR. In this study, as mentioned above, the wind speed data can be categorized as Missing Completely at Random (MCAR) because their absence does not depend on other variables.

METHODOLOGY
A number of methods are available in the literature to address the missing value problem (Ding and Ross, 2012;Ferrari et al., 2012;Jung et al., 2007;Templ et al., 2011). However, in this study, we focus on single imputation method to provide a satisfactory but simple way to impute missing wind speed data.

Site-Dependent Effect Method (SDEM):
This single imputation method was proposed by Plaia and Bondi (2006). SDEM considers spatial and temporal information to impute the missing values in the data. Table 1 shows the data set structure for the 3 wind stations considered in this study. Let x swdh be a generic element of the data set, where s refers to the wind stations (s = 1, 2 and 3), w refers to the week (w = 1, 2, 3, …, 53), d refers to the day of the week (d = 1, 2, …, 7) and h refers to the hour (h = 1, 2, 3, …24). The SDEM method explicitly considers a week effect, a day effect and an hour effect. The SDEM method in this study is given by the following: The SDEM method incorporates the spatial and temporal information from each station involved in this study.

Hour Mean Method (HMM):
The hour mean method imputes missing data from a given station using hourly information from the same station. According to Li et al. (1999), HMM fills in missing hourly observations by computing the mean for all known hourly observations for the same station at the same hour over the whole year. The HMM is given by the following: ..
, 5 where, ‫̅ݔ‬ ௦.. = The mean of the values observed at site s in hour s = 1, 2 and 3 h = 1, 2,…, 24 Last and Next Method (LNM): The last and next method imputes missing values in the data from a given station by incorporating information from the same station. LNM is performing by assigning the average of the last known and next known observations to the missing value. LNM is given by the following: where, s = 1, 2 and 3 w = 1, 2,…, 53 d = 1, 2,…, 7 h = 1, 2,…, 24 However, LNM can only be applied for a single missing value at a time. For cases involving of missing values, such as values with a gap length l≥2, LNM is not applicable. We suggest a more generalized way to formulate LNM to overcome its limitations. LNM can also be performed by assigning the average of the last known and next known observations to the missing value of a particular day, which may be written as follows: where, s = 1, 2 and 3 w = 1, 2,…, 53 d = 1, 2,…, 7 h = 1, 2,…, 24 In addition, LNM can be performed by assigning the average of the last known and next known observations to the missing value of a particular week, which may be written as follows: where, s = 1, 2 and 3 w = 1, 2,…, 53 d = 1, 2,…, 7 h = 1, 2,…, 24 With these formulations, LNM is more generalized and can easily be used to impute missing values with long gap lengths. However, care should be taken to examine the weekly and daily patterns in the data before applying this method (Plaia and Bondi, 2006;Engels and Diehr, 2003).

Row Mean Method (RMM):
The row mean method imputes missing data from a given station using hourly information from the same station. Row mean method uses hourly information from the same station in order to impute the missing data. RMM is performed by computing the mean of all known observations in the same row of the data matrix as the missing data. HMM is given by the following: .swdh wdh where, w = 1, 2,…, 53 d = 1, 2,…, 7 h = 1, 2,…, 24 Index of agreement as a performance indicator: The index of agreement has been used to evaluate the effectiveness of each imputation method. Let O i = The i th data point ܱ ത = The average of the observed data, let

ANALYSIS, RESULTS AND CONCLUSION
Before a detailed analysis is made, it is important to explore some descriptive statistics to obtain some preliminary information about the data. Table 2 shows a certain percentage of missing data for each station. K. Terengganu and Kemaman stations have a small percentage of missing data, approximately 1.062 and 2.564%, respectively. However, Kertih station has quite a large percentage of missing data, approximately 14.566%. To examine these gaps in detail, we use the four gap length (l) categorizes identified by Plaia and Bondi (2006) for missing data, namely 1 observation gap, 1 to 3 observation gaps, 3 to 12 observation gaps and more than 12 observation gaps. From Table 2, we can see the distribution of missing values for each station according to this categorization. Figure 1 shows the frequency distribution of the gap length for each station.
To apply the imputation methods described above, especially the SDEM method, it is informative for us to examine the spatial and temporal characteristics of the data from each station. Figure 2 shows a line plot for the mean values observed on w week for each station. It is found that week mean plots for K. Terengganu and Kertih station indicate quite similar trends, while the week mean plot for Kemaman station has a different trend. However, the difference in the Kemaman trend is not significant. This indicates that there is some correlation in the week effect for each station. Figure 3 and 4 show the line plots for the mean values observed on h hour and d day.
The hour mean values and day mean values exhibit approximately similar patterns for each station except that they differ by a fairly constant amount. The similarity of the patterns for each station indicate the    Because we already know that there is some correlation in the wind speed data from the three stations, based on our subjective evaluation of Fig. 1 to 3, it is informative for us to apply each imputation method, particularly SDEM, to address the missing data problem. To determine which method provides the best imputation of the missing values, the index of agreement is used as the measure of performance. The evaluation process begins by simulating the incomplete data set from the portion of the original data with no missing values. Each imputation method is then applied to determine the simulated missing value from the simulated data set. The performance of each method is calculated using the index of agreement. The method with the smallest index value is considered the best method to estimate our missing wind speed data. Table 3 shows the index of agreement results for each method. For the data considered in this study, we found that the last and next method had the smallest index of agreement values for all of the stations, which indicates that the last and next method is the best method for estimating the missing data for the stations considered in this study. The site-dependent effect method is found to be the second-best method for estimating the missing value. Thus, we conclude here that the last and next method is the best method for estimating missing values in the data used in this study.