The Probability Distribution Model of Wind Speed over East Malaysia

Many studies have found that wind speed is the most significant parameter of wind power. Thus, an accurate determination of the probability distribution of wind speed is an important parameter to measure before estimating the wind energy potential over a particular region. Utilizing an accurate distribution will minimize the uncertainty in wind resource estimates and improve the site assessment phase of planning. In general, different regions have different wind regimes. Hence, it is reasonable that different wind distributions will be found for different regions. Because it is reasonable to consider that wind regimes vary according to the region of a particular country, nine different statistical distributions have been fitted to the mean hourly wind speed data from 20 wind stations in East Malaysia, for the period from 2000 to 2009. The values from Kolmogorov-Smirnov statistic, Akaike’s Information Criteria, Bayesian Information Criteria and R correlation coefficient were compared with the distributions to determine the best fit for describing the observed data. A good fit for most of the stations in East Malaysia was found using the Gamma and Burr distributions, though there was no clear pattern observed for all regions in East Malaysia. However, the Gamma distribution was a clear fit to the data from all stations in southern Sabah.


INTRODUCTION
In wind turbine design and site planning, the probability distribution of wind speed becomes critically important in estimating the energy production (Morgan et al., 2011) It has been defined in engineering practice, the average wind turbine power, associated with the Probability Density Function (PDF) of wind speeds, X is obtained from: where, f (X) = The PDF of X P w (X) = The turbine power curve that is used to describe the power output of wind speed Generally, P w (X) is defined as a proportion of the area of the airstream, measured in a plane perpendicular to the direction of the wind speed: where, A = The area ρ = A constant for air density Morgan et al. (2011) stated that the largest uncertainty in estimation of lies in the choice of wind speed PDF, f (X), since the turbine manufacturer can measure P w (X) fairy accurate. Thus, the utilization of a more accurate wind speed PDF will minimize the uncertainty in wind resource estimates and improve the site assessment phase of planning.
Wind speed distribution has been explored successfully by several scientists with 2-parameter Weibull and Rayleigh distributions are often quoted as popular distribution for wind speed. However, several authors have indicated that the Weibull and Rayleigh distribution should not be used in a generalized way, as they fail to represent some wind regimes (Carta and Ramirez, 2007;Brano et al., 2011;Jaramillo and Borja, 2004;Safari, 2011). For example, Brano et al. (2011) investigated 7 probability density functions employed to describe wind speed frequency distributions: Weibull, Rayleigh, Lognormal, Gamma, Inverse Gaussian, Pearson type V and Burr. The Burr distribution was the most reliable statistical distribution. Jaramillo and Borja (2004) showed that the Mixed Weibull distribution is more appropriate than the 2parameter Weibull distribution for regions where wind speed presents a bimodal PDF. Safari (2011) used 5 probability distribution functions to fit wind speed data from 4 wind stations in Rwanda: Weibull, Rayleigh, Lognormal, Normal and Gamma. His results showed that Weibull and Gamma were the most suitable distributions. More research on wind speed distribution has been conducted by Carta et al. (2008, 2009), Zhou et al. (2010 and Celik (2003) and etc.
Among the early works on wind energy research in Malaysia is the work by Sopian et al. (1995). They have analyzed data from 10 wind stations in Malaysia and used the Weibull distribution. Their results indicated that Mersing and Kuala Terengganu possess the best potential for wind energy development. Masseran et al. (2012) investigated the persistence of wind speed in Peninsular Malaysia using the stationary properties of time series and the wind speed duration curve. Their results revealed that Chuping wind station had the most persistent wind speeds, compared to other stations. However, data from Mersing wind station are the most persistent for the level of wind speed suitable for generating energy. In addition, Ong et al. (2011) noted that a 150 kW wind turbine, which was built in Terumbu Layang-Layang in 2005, demonstrated some success. However, Tenaga Nasional Berhad (TNB), which is the only electricity supplier in Malaysia, built two units of wind turbines at Pulau Perhentian. Additionally, the Ministry of Rural and Regional Development built 8 small units of wind turbines in Sabah and Sarawak for local communities (Ong et al., 2011).
In this study, we focus on determining the best statistical model that describes the wind regime in East Malaysia. This model provides vital information for the assessment of wind energy potential.
Study area, regional climate and data: East Malaysia is a country that lies entirely in the equatorial zone, situated on the island of Borneo, with a geographic coordinate of 2° 30' north latitude and 119° 30' east longitude. Throughout the year, East Malaysia experiences a wet and humid climate with daily temperatures ranging from 25.5 to 35°C. The wind that blows across East Malaysia is influenced by the northeast monsoon that occurs from November until March. East Malaysia is also influenced by the effect of sea breezes and land breezes, especially when the sky is clear. During the afternoon, sea breezes occur with a speed of 10 to 15 knots, while land breezes occur at night. The data used in this study consist of hourly wind speeds (km/h) from 20 wind stations across the country. The collection period was from January 2000 to November 2009. Data were obtained from the Department of Environment and Malaysian Meteorology Department.

METHODOLOGY
Wind speed probability distribution (Table 1): To describe the behavior of wind speed at a particular location, it is necessary to identify the distribution that best fits the data. Suitable distributions for each wind station were determined by fitting nine types of statistical distribution to the data: Weibull (WE), Burr (BR), Gamma (GA), Inverse Gamma (IGA), Inverse Gaussian (IGU), Exponential (EX), Rayleigh (RY), Lognormal (LN) and Erlang (ER). Here, ER is simply a special case of Gamma distribution, where the shape parameter is an integer. Table 1, lists the probability density functions with their respective cumulative distribution functions (Morgan et al., 2011;Carta and Ramirez, 2007;Carta et al., 2009;Zhou et al., 2010;Evans et al., 1993). (Table 2): In this study, parameter estimation for each model was performed using the maximum likelihood method. The Maximum Likelihood Estimator (MLE) for the parameters of the WE, GA, IGU, ER, IGA and BR distributions can be determined numerically using methods such as Newton-Rapson, scoring, EM algorithm, quasi-Newton and the Nelder-Mead method. In this study, the Nelder-Mead method was used as an optimization technique for determining the MLE (R Development Core Team, 2008). For other distributions, such as LN, RY and EX, the MLEs can be easily determined. After the parameter estimation process, several goodness of fit tests were used to determine the most suitable statistical distribution for the data from each wind station. Goodness of fit tests included Kolmogorov-Smirnov (KS), Akaike's Information Criterion (AIC) and Bayesian Information Criterion (BIC). In addition, the R 2 correlation coefficient was also used to evaluate the goodness of fit for each method. A large R 2 value indicates a good fit of the theoretical distribution to the data.

The Kolmogorov-Smirnov statistic (KS):
To determine the suitable probability distribution of wind speed from each station, the Kolmogorov-Smirnov test was calculated by comparing the cumulative where, numerator is upper incomplete gamma function distribution for the observed data to the cumulative distribution for the fitted data. The empirical distribution function F n for n observations is defined as: where, is an indicator function. Indicator function equals to 1 if X i ≤x and 0 otherwise. The Kolmogorov-Smirnov statistic for a given theoretical cumulative distribution function F (x) is given by: where, sup x is the supremum of the set of distances between F n and F (x). If the sample comes from distribution F (x), then D n will almost surely converges to 0. To implement Kolmogorov-Smirnov statistic test, the information about their theoretical cumulative distribution function, F (x) need to be known (Shorak and Wellner, 1986). The list of theoretical Cumulative Distribution Function (CDF) for each distribution is shown in Table 1.

Akaike's Information Criteria (AIC):
The AIC is a tool used for model selection. It is defined in terms of an appropriate information criterion. The AIC offers a relative measure of the information lost when a given model is used to describe reality. The AIC uses a mechanism that assigns a score to each candidate model. The model with the minimum AIC value is selected as the best fit model. The AIC usually be used to measure the goodness-of-fit for a statistical model. The AIC is not a hypothesis test of the model; rather, it provides a means for comparison among models. The AIC general formula is given by: where, L : The likelihood k : The number of parameter in the fitted model (Burhnam and Anderson, 2002;Hirotugu, 1974) The AIC acts as penalizes based on the log-likelihood criterion, affording a balance between a good fit and complexity. The model with the minimum AIC value is the preferred model. The AIC was used in this study because of its mathematical reason related to the maximum likelihood estimators (Claeskens and Hjort, 2010).

Bayesian Information Criteria (BIC):
The BIC is another set of model selection criteria that chooses the candidate model with the highest probability, given the data. Gideon E. Schwarz developed the BIC using the Bayesian framework. The BIC uses the prior probabilities and the prior densities of all parameter vectors in the different models to select a model. It is closely related to the Akaike information criterion. The BIC is also known as Schwarz's Bayesian Criterion (SBC). The formula for BIC is given by: where, L = The likelihood k = The number of parameters n = The number of observations in the fitted model The BIC takes the form of a penalized log-likelihood function, where the penalty is equal to the logarithm of the sample size times the number of estimated parameters in the model (R Development Core Team, 2008;Claeskens and Hjort, 2010;McQuarrie and Tsai, 1998). The model with the minimum BIC value is selected. In this study, the BIC and AIC scores were compared with the results from the Kolmogorov-Smirnov test statistic.
Evaluating the goodness of fit: The R 2 correlation coefficient was used to evaluate the goodness of fit for each method. A larger R 2 value indicates a better fit of the theoretical distribution to the data. R 2 was used for goodness-of-fit comparisons because it quantifies the correlation between observed probabilities and the predicted probabilities from a distribution. The R 2 coefficient is determined as: The estimated cumulative probabilities, , were derived from the cumulative distribution function of the proposed model. A large value of R 2 indicates a good fit of the model's cumulative probabilities, , to the empirical cumulative probabilities, F. The R 2 coefficient has been used by other researchers in similar studies to measure goodness of fit methods (Morgan et al., 2011;Carta et al., 2008Carta et al., , 2009. Table 3 provides the results from the goodness of fit statistics, the associated R 2 values and the selected    Fig. 1: The spatial distribution of wind speed over East Malaysia distribution to describe the data for each station. Based on the results for all distributions and with respect to all goodness of fit methods, all R 2 values were found to be greater than 0.97, indicating that these distributions fit the data well. However, for the purpose of selecting the best distribution, we used the largest value of R 2 . The Weibull, Gamma, Erlang and Burr distributions were found to be the most suitable for explaining the hourly mean speed in East Malaysia. The most frequent distribution was selected based on the highest number of stations that were successfully fit using that particular distribution. Based on results shown in Table 2, GA was the most frequently selected distribution: it provided the best fit to wind speed observations at 12 stations. The second most frequently selected distribution was BR. Six stations were successfully fitted with the BR distribution. This was followed by WE and ER, which were found to adequately fit the data observed at 1 station each. However, LN, EX, RY, IGA and IGU did not result in a good fit to the distribution of wind speed at all stations. A map of East Malaysia is provided in Fig. 1, which clearly shows the pattern of suitable statistical models that describe the wind regime. Data from most of the stations in the Sabah region (especially northern Sabah) were best fit to the Gamma distribution. A variety of different statistical distributions were observed for data from other regions.

CONCLUSION
The Gamma distribution is the distribution that most frequently adequately described the distribution of wind speed at the 20 stations considered in this study. A variety of different statistical distributions were observed in East Malaysia, except for northern Sabah, where the Gamma was the best fit distribution for all stations in that area. However, we suggest that a more comprehensive analysis needs to be conducted in the future. More stations should be included to obtain a better understanding of wind speed in East Malaysia before the effort is made to assess wind energy potential.