Modeling the distribution of duration time for unhealthy air pollution events

The information about how long a severe unhealthy air pollution event will last is crucial for the purpose of planning a possible measure to mitigate its risk. Thus, analyzing the distribution of duration data on the past occurrences of air pollution events is important. This study analyzes the hourly data of air pollution index (API) in Klang City, Malaysia from 1997 to 2018. Air pollution duration data are determined from the period when API > 100, preceded and followed by periods when API < 100. In this study, four types of statistical distributions, namely, Exponential, Gamma, Lognormal, and Weibull are proposed as practical models. Goodness-of-fit measures are compared for each distribution to determine the best fitted one to describe the observed data. Results indicate that the Lognormal distribution provides the best fitted model among others.


Introduction
Air pollution is an important issue that must be addressed around the world, particularly in urban areas. To deal with this issue, elaborating the behaviors of air pollution events on the basis of observed data is informative. In parallel with that, a statistical analysis is an important tool to extract various kinds of information from the observed air pollution data. One approach is by describing its probability distribution. Statistical distribution can provide knowledge on the probability behavior of the magnitude of an air pollution event. For example, an extreme air pollution event can be modeled and described using the generalized extreme value distribution or generalized Pareto distribution [1,2,3,4]. AL-Dhurafi et al. [5] showed that the distribution of general air pollution index data can be modeled using several unimodal distributions, such as Gamma, Lognormal, and Weibull. The distribution of air pollution index data, which correspond to their compositional sub-indices, can be modeled and described using various forms of mixture distribution [6]. Leiva et al. [7] suggested that the skewed Sinh-normal distribution can be used to model air pollution data. Bartoletti and Loperfido [8] found that a skewnormal distribution can provide good fit to the data PM10 pollutant, particularly in estimating the probabilities of its high concentration values. Kan and Chen [9] found that Lognormal, Pearson-V, and extreme value distributions are suitable to represent the distributions of the daily data of pollutant variables, such as SO2, NO2, and PM10 in Shanghai, China. Nadarajah [10] proposed a truncated inverted beta distribution as an alternative model for modeling air pollution data, particularly the ozone level pollutant variable.
The information about probability distribution is important, as it represents a building block for developing advance and complex statistical techniques for analyzing air pollution behaviors. For example, Sak et al. [11] proposed the copula model for air pollution risk assessment based on a generalized hyperbolic distribution, which represents a marginal model for the data of PM2.5 in several cities in China. Masseran [12] used standard normal, standardized Student-t, and generalized residual distribution to represent the residuals in a mixing model of autoregressive integrated moving average and generalized autoregressive conditional heteroscedastic to forecast the PM10 pollutant data in Kuala Lumpur, Malaysia. AL-Dhurafi et al. [13] employed the generalized Pareto distribution to represent unhealthy air pollution data as input for developing a hierarchical model, which combines air pollution data from several locations in Selangor, Malaysia. Masseran and Hussain [14] used the generalized Pareto model to represent the tail distribution of air pollution data as an input for dynamic copula modeling among five pollutant variables. Zhang et al. [15] employed a Gamma distribution as a building block for the Hidden Markov model for the purpose of an ozone level prediction. Masseran and Mohd Safari [16] used the generalized extreme value distribution as a building block for developing intensityduration-frequency curves to evaluate the risks of extreme air pollution events in Klang, Malaysia Most studies discuss the probability distribution of air pollution data in terms of their magnitude or real value. In this research, we look at a different perspective by attempting to describe the probability distribution of air pollution events in terms of the characteristic defined as a duration size. The duration size of air pollution events refers to a state in which air pollution events indicate unhealthy conditions over an extended period. Thus, a large duration size implies prolonged air pollution events [17]. The probability distribution for this kind of data can be used as a building block for developing an advance statistical model, particularly for the purpose of air pollution risk assessment and analysis.

Study area and data
This study uses Klang City, which is located at Peninsular Malaysia, as a case study. Klang has a dense population and is one of the large cities in Malaysia with a land area of approximately 573 km 2 . Figure  1 shows the location of Klang City in the map of Peninsular Malaysia [18].

Klang is involved in many industrial activities and plays an important role for economic interests in
Malaysia. The most important one is the import and export activities, which operate in Port Klang. Klang has also been recognized as the 13th busiest trans-shipment port and the 16th busiest container port in the world [19]. However, the rapid development of Klang with respect to urban commercial and industrial areas in recent decades has elevated its risk of atmospheric pollution [20]. Given the importance of its industrial activities, analyzing the behaviors of air pollution duration events in Klang on the basis of previous data is important. The data used this study comprise the hourly API data in Klang for the period of January 1, 1997 to December 31, 2018. The API values are evaluated based on the breakpoints of 50, 100, 200, 300, 400 and 500. These breakpoints correspond to air quality status represented as; Healthy [0-50], Moderate [50-100), Unhealthy [100-200), Very Unhealthy [200-300), and Hazardous (above 300) [21].The data has a small percentage of missing values at random points. Thus, to estimate these missing values, the method of single imputation based on the average of the last known and next known observations is used. This method is easy to implement and can provide good results for missing data with random behavior [22]. Next, based on the hourly API data, the duration size data for air pollution events are determined using the following equation where N is the total number of observations and ( ) j i I API is an indicator function determined as follows

Methodologies
Several statistical models have been selected to describe the distribution of duration time for unhealthy air pollution events in Klang. These selected distributions are popular and have been used to model the duration time data for flood or drought events in the hydrological analysis [23,24]. We believe that these selected distributions can also be useful in modeling the distribution of duration time for unhealthy air pollution events.

Exponential distribution (EX)
EX is a simple statistical model that only has one parameter. Nevertheless, EX has been used as a basis model for many applications in diverse areas (see [25] for example). The probability density function (PDF) for EX is given as: where θ is a scale parameter. Its cumulative density function (CDF) can be written as

Gamma distribution (GA)
GA is a popular statistical model in various fields, including theoretical and applied research. In terms of environmental modeling, the application of GA can be found in the research on wind speed behaviors [26], hydrological analysis [27], air pollution analysis [15,28], and many more. The PDF for GA is given as follows: where α is a shape parameter and β is a scale parameter. The CDF of GA can be written as γ is lower incomplete Gamma function. The estimated parameters for GA based on MLE can be obtained using the following equations LG is also a popular model with ubiquitous applications [29,30]. The PDF for LN is given as follows: where µ is a location parameter and σ is a shape parameter. The CDF for LN can be written as

Weibull distribution (WE)
The WE distribution is also a popular statistical model in many fields of applied research involving various kinds of environmental data [27,31]. The PDF for WE is given as where α is a shape parameter and β is a scale parameter. The CDF for WE is given as Next, the estimated parameter for WE based on MLE can be computed using the following equations

Goodness-of-fit measurement
The performance of each will be evaluate using three different goodness-of-fit measures known as Akaike's Information Criterion (AIC), Kolmogorov-Smirnov statistic (K-S statistic), and R 2 coefficient. The AIC formula is given as follow where k is the number of the parameter and L is the likelihood function on each fitted model respectively [32,33]. While, the formula for K-S statistic is given as follow where sup x is the supremum of the set of distances between empirical CDF Fn and theoretical CDF F(x) on each fitted model respectively. The lower the values determine by AIC and K-S statistic provide an indication of a better fitted model to the observed data [34]. Apart from that, the formula for R 2 coefficient given as follow The higher the value of R 2 coefficient indicates a better fitted model to the observed data [34].

Results
Before conducting a detailed analysis regarding the modeling of the distribution of duration size for unhealthy air pollution events, exploring the descriptive statistics on the data can be informative. Table  1 shows the descriptive statistics for the duration data of unhealthy air pollution events. As presented in table 1, the mean duration size of unhealthy air pollution events is approximately 21.393 hours with a standard deviation equal to 35.318 hours. It implies that the variability of the duration size among the air pollution events is very high corresponds to its mean. Meanwhile, the median is found to be only three hours, which is evidently lower than its mean. The range between minimum value (one hour) and the maximum value is quite large (224 hours), indicating the presence of some points as extreme duration sizes. Likewise, the skewness and kurtosis values are unequal to zero, suggesting that the duration data of unhealthy air pollution events in Klang do not follow a normal distribution. Thus, this study attempts to determine the statistical model, which can suitably describe all these properties of the duration data of pollution events.  Table 2 provides the results of the estimated parameters for each model using the MLE approach. On the basis of the estimated parameters, Figure 2 illustrates the graphical representation of the density plot and the CDF plot on each fitted statistical model to the data duration time for unhealthy air pollution events in Klang. PDF shows that all the fitted statistical models can represent the properties of the empirical data. First, the distribution of empirical data is skewed to the right. Thus, instead of the mean, the median should be used as a robust estimate for the center measure. Second, the existence of longtail properties is represented by some extreme data points. This behavior provides a similar argument with the results found in the descriptive statistics in table 1. Likewise, based on PDF and CDF plots, we believe that the Gamma, Lognormal, and Weibull can be provide good fitted models to describe the distribution of the duration size data for unhealthy air pollution events.  However, deciding which model is the best approximated one to the empirical data on the basis of the graphical representations of probability densities is rough. Thus, a further evaluation must be conducted using several goodness-of-fit (GOF) measures, such as Akaike's information criterion (AIC), Kolmogorov-Smirnov (KS) statistic, and R 2 coefficient. A low value of AIC indicates less information loss on a particular model fitted. A low value of KS statistic indicates a high level of similarity between the empirical CDF and the CDF of a fitted model. A high value of R 2 coefficient implies that the fitted model covers a high degree of variation on the empirical data. Table 3 presents the results of the GOF evaluation for all fitted distribution models. As shown in table 3, all the GOF measures find that Lognormal is the best model to provide an approximation to the distribution of duration size data for unhealthy air pollution events in Klang. Considering this information, we suggest that the Lognormal distribution can be used as a marginal model for a further analysis involving the risk of the occurrence of hazardous air pollution events corresponding to their duration size.

Conclusion
The duration sizes for air pollution events indicate the behaviors of consecutive periods of unhealthy air pollution conditions over time. Thus, a large duration size implies a prolonged air pollution event. To provide a preliminary analysis regarding this topic, a case study is conducted using the data of Klang, Malaysia for the period of January 1, 1997-December 31, 2018. Based on the empirical data, this study investigates a potential probability model that can be used to represent such data. The suitable probability model can be used as a building block for developing an advance statistical model for the purpose of air pollution risk assessment and analysis. Given the duration data for unhealthy air pollution events, the properties of unimodal distributions, and distributions skewed to the right, four well-known statistical models that satisfy these properties, namely, Exponential, Gamma, Lognormal, and Weibull are fitted to represent the data. The method of maximum likelihood is used as a parameter estimation, whereas AIC, KS statistic, and R 2 coefficient are employed to evaluate the GOF of each fitted model. Overall, the results reveal that the Lognormal distribution provides the best fitted model to the duration size data for unhealthy air pollution events in Klang. In future studies, the Lognormal distribution can potentially be used as a marginal model for developing a complex statistical model for the risk assessment of the occurrence of hazardous air pollution events corresponding to their duration size.