Modeling generalized statistical distributions of PM2.5 concentrations during the COVID-19 pandemic in Jakarta, Indonesia

Article history: Received October 28, 2020 Received in revised format: December 29, 2020 Accepted January 27 2021


Introduction
The World Health Organization (WHO, 2020) declared COVID-19 a pandemic in early March 2020. The first cases of COVID-19 in Indonesia were reported on March 02, 2020, of which two cases were from Depok, a city located below Jakarta. In the initial phase, the spread was relatively slow until the second week and concentrated in Jakarta and surrounding areas. However, the numbers began to rise rapidly until the end of March 2020. In order to combat the spread of the virus, the President of the Republic of Indonesia (Government of the Republic of Indonesia, 2020) declared public policies of social distancing, named the Large-Scale Social Restriction (LSSR). After receiving an approval letter from the Health Minister, Jakarta state's Governor declared the LSSR on April 10, 2020. This has led to reduced human outdoor activities. Recent literature has reported air quality improvements associated with social distancing measures. For example, Li et al. (2020) reported that during the COVID-19 control period, human activities have decreased greatly, and air pollutants like SO2 and PM2.5 in Yangtze Delta Region, China, have been reduced significantly. Similar conditions can be found in the work of Kerimray et al. (2020), Otmani et al. (2020), Zambrano-Monserrate et al. (2020) and Cameletti (2020). Air pollution studies conducted to monitor ambient levels and quantify the concentration of various pollutants, such as particulate matter (PM), entering a given environmental area are greatly concerned about the possible adverse health effects caused by these pollutants. For example, fine particulate matter smaller than 2.5 μm in aerodynamic diameter (PM2.5) becomes a major health burden and plays an increasingly negative role in China's social and economic development (Li et al., 2018) and population mortality in an urbanized valley in the American tropics (Aguiar-Gil et al., 2020). Information about the frequency distribution of pollutants is crucial for air pollution management. When the parent statistical distribution of PM2.5 concentrations is correctly selected, the specific distribution can be used, for example, to predict mean concentration, the probability of exceeding a critical concentration, and the return period for setting regulatory targets and issuing environmental alerts for public health (Lu and Fang, 2002;Xi et al., 2013;Plocoste et al., 2020). Statistical distributions can also provide a simple description or summary of masses of data.
A significant number of studies suggesting that empirical distributions of pollutant concentration data tend to be lognormally distributed have been published. Lu (2002) presented a critical review of statistical distributions, such as lognormal (LN), Weibull, and type V Pearson, and stated that the LN distribution has been the most appropriate distribution in representing PM10 and PM2.5 in Sha- Lu, Taiwan. Recently, Xi et al. (2013) have reported that instead of the Weibull and gamma distributions, the LN distribution was the best fit distribution for PM10 daily concentration in five cities of China. Other examples of the LN distribution for air quality data can be found in the articles of Kalpasanov and Kurchatova (1976), Kushner (1976), Owen and DeRouen (1980), Mage and Ott (1984), Kao and Friedlander (1995), Lu and Fang (2002). However, other parametric distributions can also be used successfully for air pollutant data. On the basis of the sum-ofsquares error, Bencala and Seinfeld (1976) showed that Weibull models produce values lower than those of the LN model for five of eight CO data sets. In fitting PM2.5 and PM10, Karaca et al. (2004) concluded that instead of the LN, gamma, and Weibull distributions, the log-logistic (LL) distribution was found to be the most suitable distribution to represent their statistical characters. Moreover, Marani et al. (1986) have demonstrated that in terms of log-likelihood (log-L) values, the generalized gamma (GG) distribution, compared with the LN, Weibull, and gamma distributions, was the most suitable statistical model in modeling air pollutant concentrations in the area of Venice, Italy. Applications of the GG distribution to air pollution problems can also be looked at in the study of Lavagnini and Camuffo (1987). Obviously, in fitting air pollutant data, none of the probability models, including the classical LN, has been identified to be superior to others in a general sense. One approach to overcoming this problem could be to use a very general model that includes most of the distributions as limiting distributions or special cases. Among the general models, the generalized log-logistic (GLL) distribution has good potential for fitting environmental pollutant data. The GLL distribution is an extension of the LL distribution. It contains several well-known distributions, such as LN, Weibull, and GG, as special cases (Singh et al.,1988). The main focus of this study, therefore, is to assess the performances of the generalized distributions, especially the GLL model, and the classical distributions in fitting PM2.5 concentrations in the periods without COVID-19 (February-June 2018 and February-June 2019) and during the period with COVID-19 (February-June 2020) in Jakarta, Indonesia.

Data Source
This study uses the data of daily PM2.5 concentrations in the periods of February-June 2018, February-June 2019, and February-June 2020 in Central Jakarta reported by the Indonesian Department of Meteorology, Climatology and Geophysics (https://aqicn.org/city/indonesia/jakarta/us-consulate/central/). We assigned the periods of February-June 2018 and February-June 2019 as the periods without COVID-19 and the period of February-June 2020 as the period with COVID-19.

Methods
As mentioned earlier in the previous section, of primary interest is the fact that, in fitting air pollutant data, none of the probability models, including the classical LN, has been identified to be superior to others in a general sense. It is reasonable, therefore, to use a rich family of generalized distributions that includes several well-known distributions as special cases for fitting air pollutant data. This study proposes the generalized distributions GLL, GG, and generalized extreme value (GEV) to fit PM2.5 concentration data in the periods of February-June 2018, February-June 2019, and February-June 2020 in Jakarta, Indonesia. For comparison purposes, the study also proposes the classical distributions LN, gamma, Weibull, LL, and Gumbel. The probability distribution functions (PDFs) of the compared models can be seen in Table 1. For estimating the parameters of the PDF of the distributions, this study uses the method of maximum-likelihood estimation (MLE). Comparative evaluations of the models of data use graphical methods and numerical values of goodness-of-fit statistics. Graphically, this study utilizes the empirical PDF, cumulative distribution function (CDF), and probabilityprobability (P-P) plot. In terms of goodness-of-fit statistics, this study proposes the following statistics tests: Kolmogorov-Smirnov (K-S), Cramer von Mises (CvM), Anderson-Darling (A-D), and log-L. For computation, this study operates a distribution fitting package in R called "fitdistrplus" to provide functions for estimating parameters of distributions, producing several figure results, and calculating goodness-of-fit statistics (Delignette-Muller and Dutang, 2015). This package requires that a distribution be defined through three functions: the PDF, the CDF, and the quantile function. Except the GLL distribution, all of the compared distributions are available as functions in R. Thus, the initial step is to define the GLL distribution, which is not yet available. Furthermore, this function estimates the parameters using the MLE method as a default method. Optimization of the likelihood function is done by default using the Nelder-Mead method (Nelder and Mead, 1965), which is a direct search method. To choose the best distribution model, this package also provides goodnessof-fit statistics by performing the "gofstat" function.

Table 1
The probability density function of proposed distributions Distribution Probability Density Function is the beta function and G( ) = 1 + ( ( )

Descriptive Data
As an initial step in assessing the frequency distributions of PM2.5 concentrations in the periods without COVID-19 (February-June 2018 and February-June 2019) and during the period with COVID-19 (February-June 2020) in Jakarta, Indonesia, this study explores graphically these data by using time series plots, histograms, and boxplots.   Figure 2, the frequency distribution of PM2.5 concentrations during the period with COVID-19 (February-June 2020) appears somewhat right-skewed and unimodal. But a number of data in the interval right bin of the histogram between 100 and 400 PM2.5 concentrations during the period with COVID-19 are less than those in the periods without COVID-19. Moreover, the boxplots in Figure 3 show that the data during the period with COVID-19 are less spread out than those in the period without COVID-19. Values of the third and second quartiles (median) of the data during the period with COVID-19 are slightly less than those in the periods without COVID-19. This is one indication of an improving trend of air quality in Jakarta.

Fitting Distribution
In order to fit the frequency distributions of PM2.5 concentrations in the periods without COVID-19 (February-June 2018 and February-June 2019) and during the period with COVID-19 (February-June 2020) in Jakarta, Indonesia, the classical distributions (LN, gamma, Weibull, LL, and Gumbel) and the generalized distributions (GG, GEV, and GLL) were executed. For computing and creating graphs, this work utilized the script of R packages. Figure 4 shows the graphs of the CDFs of the compared models superimposed on the empirical distribution function, which is defined as the proportion of sample observations less than or equal to the values of data. In an attempt to select the most appropriate distributions, the goodness-of-fit statistics of each distribution using K-S, CvM, A-D, and log-L were calculated. Table 2   For computing and creating graphs, this work utilized the script of R packages. One of the results of this study is as follows. Fig. 4 displays graphs of the empirical PDFs of compared models superimposed on relative frequency histograms of data. The graphs suggest that in the first period without COVID-19 (February-June 2018) and during the period with COVID-19 (February-June 2020), the GLL and GEV distributions seem close to the frequency distribution of data. Meanwhile, in the second period without COVID-19 (February-June 2019), the GLL and GG models seem close to the data. Hence, graphically, the GLL, GEV, and GG distributions seem to be better distributions than the LN, gamma, Weibull, LL, and Gumbel distributions in modeling PM2.5 concentrations in the period without COVID-19 (February-June 2018) and during the period with COVID-19 (February-June 2020). But in the second period without COVID-19 (February-June 2019), the GLL and GG distributions are the most appropriate models in describing data.

Table 2
Goodness LN 0.09 (7) 0.23 (8) 1.45 (7) -−627.29 (6) Gamma 0.09 (7) 0.17 (7) 1.10 (5 5 presents graphs of the CDFs of the compared models superimposed on the empirical distribution function, which is defined as the proportion of sample observations less than or equal to the values of data. From the CDF's standpoint, the graphs suggest that although each compared model performs relatively persistent in fitting PM2.5 concentrations in the periods without COVID-19 (February-June 2018 and February-June 2019) and during the period with COVID-19 (February-June 2020), the GLL and GEV distributions performs close to the empirical distribution function of data. Thus, the GLL and GEV distributions fit better than the LN, gamma, Weibull, LL, Gumbel, and GG distributions in the periods without and during the period with the COVID-19 pandemic. Fig. 6 provides the P-P plot of the LN, gamma, Weibull, LL, Gumbel, GG, GEV, and GLL distributions. According to the figure, it is clear that the P-P plot of the GLL distribution is approximately linear, making the GLL distribution the closest model to the linear line in all of the periods of data sets and thus the most suitable model in representing the data of PM2.5 concentrations in the periods without COVID-19 (February-June 2018 and February-June 2019) and during the period with COVID-19 (February-June 2020). The smaller the K-S, CvM, and A-D statistics, the better the distribution fits the data. On the other hand, the higher the log-L statistics, the more appropriate the distribution matches the data. According to Table 2, it is clear that the log-L values of the GLL, GG, and GEV distributions are higher than those of the LN, gamma, Weibull, LL, and Gumbel distributions for all of the three data sets of the periods of February-June 2018, February-June 2019, and February-June 2020. Furthermore, the K-S, CvM, and A-D values of the GLL, GG, GEV distributions are generally smaller than those of the LN, gamma, Weibull, LL, and Gumbel distributions for the three periods. Fig. 6. The probability-probability (P-P) plot of the LN, gamma, Weibull, LL, Gumbel, GG, GEV, and GLL distributions of PM2.5 concentrations in the period without and during the period with COVID-19.
Therefore, from the standpoint of values of K-S, CvM, A-D, and Log-L values, the generalized distributions seem to be better than the classical distributions, including the LN distribution. The LN distribution is very well known in fitting data of environmental pollution. Since distributions of environmental pollution data are generally positive as well as highly skewed, the LN model is an ideal descriptor of such data, with a positively skewed, positive range, and heavy right tail. Having developed physical mechanisms generating environmental quality data, Ott (1990) provided an argument as to why the LN distribution is so ubiquitous in environmental phenomena. Ott's explanations involve the central limit theorem and the diffusion law. Unfortunately, as it can be seen from histograms, distributions of data of PM2.5 concentrations in the periods of February-June of 2018 and 2019 (period without COVID-19) are not positive-skewed distributions. For the data of the period with COVID-19 (February-June 2020), the data histogram shows that the shape is just slightly positively skewed, with a short tail. But in fitting the data of PM2.5 concentrations during the period with COVID-19 (February-June 2020), the performance of the LN distribution is still worse than those of other distributions. Furthermore, it can be seen in Table 2, on the basis of almost all goodness-of-fit statistics, the LN distribution fits worse than the other classical distributions. In fact, in terms of all of the proposed goodness-of-fit statistics, the GLL distribution is considerably better than the classical distributions, including the LN distribution, in describing data of PM2.5 concentrations in the periods without COVID-19 and during the period with the COVID-19 pandemic in Jakarta, Indonesia. The GLL distribution is an extension of the LL distribution, which is roughly similar in shape to the LN distribution. As noted by Singh et al. (1988), the family of four-parameter GLL distributions is a flexible distribution containing several submodels, such as LL, LN, Weibull, and GG. Consequently, the GLL distribution as a general model should provide at least as good a fit as that of other special models. In support of this argument, Warsono et al. (2000) demonstrates that the GLL model provides a better representation of environmental pollutant data than do the exponential, LN, Weibull, and GG models.
Furthermore, in the periods of February-June 2018 and February-June 2019, Table 2 shows that, in general, the K-S, CvM, and A-D values of the GLL distributions are smaller than those of the GG and GEV distributions. The log-L value of the GEV distribution is somewhat larger than that of the GLL and GG distributions, but the discrepancy of the values of log-L among distributions is quite small. Thus, in the periods of February-June 2018 and February-June 2019, or the periods without COVID-19, the GLL distribution seems to be an ideal descriptor of such data of PM2.5 concentrations. Similar results are obtained for the period of February-June 2020, or the period with the COVID-19 pandemic, for the data of PM2.5 concentrations. The Log-L and A-D values of the GEV distribution are slightly higher than those of the GLL distribution, but the discrepancy of the log-L and A-D values between both GEV and GLL distributions is small. However, the K-S and CvM values of the GLL distribution and the GEV distribution are the same. Therefore, the GLL and GEV distributions fit the data in modeling data of PM2.5 concentrations during the period with COVID-19. Since the number of parameters of the GLL distribution is higher than that of the other classical distributions, its PDF becomes more mathematically complicated. To estimate the parameters of the GLL distribution, one may involve intensive computations. Fortunately, however, the computational aspect of modeling univariate distributions especially for environmental data is not so difficult because of the availability of the package of R. We have addressed the ability of this package to accommodate general distributions, specifically for distributions not yet defined as functions such as the GLL distribution. In the future, modeling with other general distributions will be possible because of the development of classical distribution models in the case of air pollution with simple computing problems.

Conclusions
Graphically and numerically, the performances of the generalized distributions are considerably better than those of the classical distributions in fitting the data in the periods without and that during COVID-19 in Jakarta, Indonesia. Particularly, on the basis of the graphs of empirical PDFs of compared models superimposed on relative frequency histograms of data, the GLL, GEV, and GG distributions seem to be better than the LN, gamma, Weibull, LL, and Gumbel distributions in modeling PM2.5 concentrations in the first period without COVID-19 (February-June 2018) and during the period with COVID-19 (February-June 2020). However, in the second period without COVID-19 (February-June 2019), the GLL and GG distributions are the most appropriate models in describing data. The CDF graphs also show that the GLL and GEV distributions perform better than the LN, gamma, Weibull, LL, Gumbel, and GG distributions in the periods without and during the period with the COVID-19 pandemic. Furthermore, according to the P-P plots, the GLL distribution is the most suitable model in representing the data. In short, graphically, the GLL distribution seems to be a promising distribution in modeling the data of PM2.5 concentrations in the periods without COVID-19 (February-June 2018 and February-June 2019) and during the period with COVID-19 (February-June 2020) in Jakarta, Indonesia. In terms of the values of K-S, CvM, and A-D statistics, in the periods without COVID-19 (February-June 2018 and February-June 2019), the GLL distribution seems to be an ideal descriptor of such data of PM2.5 concentrations. Similar results were obtained for the period of February-June 2020, or the period during which the COVID-19 pandemic is present, for the data of PM2.5 concentrations. From the standpoint of the log-L values, the GLL model, as well as the GG and GEV distributions, fits. Therefore, on the basis of goodness-of-fit statistics, the GLL distribution is a good alternative to the classical distributions, including the LN distribution, in fitting data of PM2.5 concentrations in the periods without COVID-19 (February-June 2018 and February-June 2019) and during the period with COVID-19 (February-June 2020) in Jakarta, Indonesia. In short, the GLL distribution may become a good alternative to the LN distribution and other distributions in fitting data of PM2.5 concentrations.