Estimation methods for the ratio of medians of three-parameter lognormal distributions containing zero values and their application to wind speed data from northern Thailand

Wind speed has an important impact on the formation and dispersion of fine particulate matter (PM), which can cause several health problems. During the transition from the winter to the summer season in northern Thailand, the wind speed has been low for longer than usual, which has resulted in fine PM accumulating in the air. Motivated by this, we have identified a need to investigate wind speed due to its effect on PM formation and dispersion and to raise awareness among the general public. The hourly windspeed can be approximated by using confidence intervals for the ratio of the medians of three-parameter lognormal distributions containing zero values. Thus, we constructed them by using fiducial, normal approximation, and Bayesian methods. By way of comparison, the performance measures for all ofthe proposed methods (the coverage percentage, lower and upper error probabilities (LEP and UEP,respectively), and expected length) were assessed via Monte Carlo simulation. The results of Monte Carlo simulation studies show that the Bayesian method provided coverage percentages close to the nominal confidence level and shorter intervals than the other methods. Importantly, it maintained a good balance between LEP and UEP even for large variation and percentage of zero-valued observations. To illustrate the efficacy of our proposed methods, we applied them to hourly wind speed data from northern Thailand.


INTRODUCTION
Oxygen in the air is necessary for the survival of humans and other animals. How fast the air moves past a certain point is known as wind speed (measured in km/h), which is an important phenomenon in meteorology for monitoring and predicting weather patterns in a given area. It is reported daily along with the temperature, precipitation, and humidity for all provinces on the Thai Meteorological Department website. Importantly, low wind speeds during the transition from the winter to the summer season lead to increased fine particulate matter (PM) levels. In 2020, high PM 2.5 (PM ≤ 2.5 µm) increased the incidences of several ailments (respiratory illness, allergic reactions in the eyes and nasal passages, etc.) in almost all of the regions in Thailand, especially in the northern region (Tanraksa & Kendall, 2020). Spikes in PM 2.5 levels in the northern and northeastern regions of Thailand occur during the transition from the winter to the summer season (January-March) (Wipatayotin & Tangprasert, 2021). These reasons have led to our interest in estimating wind speeds to provide essential information on current PM 2.5 levels based on historic data. The hourly wind speed data for Phitsanulok and Phayao provinces located in the northern region follow the assumptions for a three-parameter lognormal (TPLN) distribution containing zero values indicating no wind. By way of comparison, estimating the ratio of the median wind speeds in two areas is used as a starting point in this study. The median, a measure of central tendency, is the central value of a dataset (the midpoint of a distribution). Moreover, it is more efficacious to use the median than the mean in analyses when the distribution of data is skewed. In addition, the ratio of the medians of two datasets can be used to measure the difference between them.
A lognormal distribution is used to represent right-skewed data when the threshold parameter (the lower bound of the data) is equal to zero. The TPLN distribution first introduced by Aitchison & Brown (1963) is suitable for highly right-skewed data that do not fit a lognormal distribution because the threshold parameter is greater than zero. It has been used in hydrology (Burges, Lettenmaier & Bates, 1975;Charbeneau, 1978) and for the analysis of flood frequency (Singh & Rajagopal, 1986;Singh & Singh, 1987). In this study, zero values are included among both simulated and wind speed data that follow a TPLN distribution (i.e., a TPLN distribution containing zero values).
One of statistical inference methods is the parameter estimations, including the point and interval estimations with the best-known example of the latter being a confidence interval (CI). According to Casella & Berger (2002), the CI is a range of numbers containing the parameter of interest with the desirable level of confidence which is better than a point estimator. For this reason, the CI is focused in this study. The point parameter estimations of the TPLN distribution have been formulated and discussed by various authors. Cohen & Whitten (1980) modified the local maximum likelihood and the moment estimators for the mean, variance and threshold parameters by using the first, second and third order statistics. Next, the moment estimation has been developed by replacing the third moment in a function of the first order statistics, as previously described in Cohen, Whitten & Ding (1985). Later, Singh, Cruise & Ma (1990) compared the five methods, including the regular method of moments, the modified method of moment (Cohen & Whitten, 1980), the regular maximum likelihood estimate (MLE), the modified MLE, and entropy to estimate the parameters and the quantiles of the TPLN via Monte Carlo simulation.
In particular, some researchers have formulated methods for estimating the CIs for the parameters of the TPLN. Royston (1992) used the zero skewness method to estimate the threshold parameter and its certain functions, motivated by Griffiths (1980). After that, Pang et al. (2005) presented the Bayesian estimation using Markov chain Monte Carlo to approximate the coefficient of variation of three-parameter Weibull, lognormal and where n i1 is the number of nonzero values that are binomially distributed with sample size n i = n i0 + n i1 ; ρ i is the proportion of nonzero values ; n i0 is the number of zero values. Eq.
(3) is computed from the first derivative about (γ i ,µ X i ,σ 2 X i ,ρ i ). The respective MLEs of (γ i ,µ X i ,σ 2 X i ,ρ i ) are solved by setting their first derivative to zero as follows: It is difficult to determine the explicit form ofγ i from the Eq. (4), so the MLE of γ i can be obtained by maximizing Eq. (4) using the Adam algorithm. It is a replacement optimization algorithm for stochastic gradient descent based on adaptive moment estimation that is mainly used in neural networks and other machine learning algorithms (Ruder, 2016). It provides a way of computing adaptive learning rates for specific parameters. Let β 1 and β 2 be the initial time step and decay rate, respectively. Adam performs well in practice when β 1 = 0.9 and β 2 = 0.999 (Singh, 2020). By optimizing γ i , the decaying averages and the past squared gradients are the estimates of the first and second moments of the gradients as follows: where g t denotes the first derivative of the target function (4) at time t = 0. Note that (m t ,v t ) are initialized as vectors of 0's, and the Adam algorithm operates until their gradients approach zero. The bias-corrected first-and second-moment estimates are used to update parameter γ i as follows: Finally, the parameter γ i is updated by Adam update rule, which is defined as where δ and denote the step size and the learning rate, respectively. is fixed at 10 −8 for a sufficient learning rate using the Adam optimization algorithm, as can be seen in Algorithm 1.

Algorithm 1: Adam optimization algorithm
After obtainingγ i by using Adam, one can compute the MLEs of (µ X i ,σ 2 X i ,ρ i ) as follows: Since η i = ρ i [γ i + exp(µ X i )] provides the medians of X i , the ratio of the medians of X i (the parameter of interest in the this study) is given by By substituting MLEs γ i ,μ X i ,σ 2 X i ,mle ,ρ i , estimateη i =ρ i [γ i + exp(μ X i )] becomes the MLE of η i . The concepts are elaborated in the methods for constructing CIs for ω in the next section.

METHODS
Here, we present constructing the CIs for the ratio of the medians in the TPLN models containing zero observations based on the fiducial methods (the fiducial generalized pivotal quantity (fiducial GPQ)) and the method of variance estimates recovery-based fiducial generalized pivotal quantity (MOVER-fiducial GPQ), NA (McKean & Schrader, 1984) and the Bayesian methods.
where λ and θ are vectors of the parameter of interest and the nuisance parameter, respectively. Moreover, let y = (y 1 ,y 2 ,...,y n ) be the observed values of Y , and assume that fiducial GPQ T (Y ;y,λ,θ ) is only a function of λ. This method is especially associated with the fiducial inference proposed by Fisher (1935). The fiducial GPQ CI depends on the fiducial GPQ defined by Hannig, Iyer & Patterson (2006) in Definition 1. Recently, Chankham, Niwitpong & Niwitpong (2022) recommended the fiducial GPQ-based CI for estimating the coefficient of variation of an inverse gaussian distribution when the sample size was small. Similarly, the performance of a fiducial GPQ CI in terms of expected length was the shortest when used to estimate the common coefficient of variation of delta-lognormal distributions by Yosboonruang, Niwitpong & Niwitpong (2022).

The fiducial GPQ CI
The fiducial GPQ CI can be constructed based on the fiducial GPQ conditions in Definition 1. Recall that (γ i ,µ X i ,σ 2 X i ,ρ i ) are the parameters controlling the behavior of a TPLN containing zero observations. Motivated by the possible value γ i < X i(1) , we proposed the fiducial GPQ of γ i based on a continuously uniform distribution and the fiducial GPQ of ρ i based on a beta distribution, respectively, as follows: where X i(1) denotes the minimum value of X i and n i1 is the sample size of nonzero values n i = n i1 + n i0 . Furthermore, the fiducial GPQ of µ X i can be obtained by using the concepts of Krishnamoorthy & Mathew (2003) as Note that the random variables W i and V i are independently draw from a standard normal distribution and a chi-square distribution with n i1 −1 degrees of freedom, respectively. Since the three fiducial pivots (T γ i ,T ρ i ,T µ X i ) satisfy the fiducial GPQ properties and T η i = T 1−ρ i [T γ i + exp(T µ X i )] is the fiducial GPQ of η i , then we can obtain the the 100(1 − ϕ)% fiducial GPQ CI for ω as follows: where T ω = T η 1 /T η 2 , and T ω (ϕ) denotes the ϕ th percentile of T ω . The steps for calculating the CP of fiducial GPQ CI (CP F ) for ω can be carried out by using Algorithm 2.

The MOVER-fiducial GPQ CI
MOVER is a well-known method for estimating the CI of the parameter of interest (Donner & Zou, 2012;Harvey & Van der Merwe, 2012;Hasan & Krishnamoorthy, 2017;Maneerat & Niwitpong, 2020;Maneerat, Niwitpong & Niwitpong, 2021;Zhang et al., 2021;Maneerat, Nakjai & Niwitpong, 2022a). Moreover, it can produce an explicit form of the CI that is easy to compute. For these reasons, we derived the MOVER-fiducial GPQ CI for ω as follows: The fiducial GPQ CIs for (γ i ,µ X i ,ρ i ) can respectively be written as These intervals can be formulated the MOVER CI by using the concept of Donner & Zou (2012) such that the 100(1 − ϕ)% MOVER-fiducial CI for lnη i becomes thereby providing the 100(1 − ϕ)% CI-based the MOVER-fiducial GPQ CI for ω as Algorithm 3 presents the computational steps for calculating the CP of the MOVERfiducial GPQ CI (CP M ) for ω.

The NA method
According to probability theory, the concept behind this method is the assumption that the approximate distributions of all of the samples approach a normal distribution pattern if the sample size is sufficiently large. This idea is integrated with the central limit theorem, in which the distribution of a given sample mean is approximated as a normal pattern if the sample size is sufficiently large under the assumption that all of the samples are similar to each other regardless of the shape of the population distribution. Recently, Maneerat, Niwitpong & Nakjai (2022b) proposed an NA-based CI for the median of a TPLN distribution, which performed well for a large sample size. Thus, we also considered the NA method. Given a set of observations, threshold γ i can be estimated by using the Adam algorithm to find the MLEs of the mean and variance (μ X i ,σ 2 X i ). Here, the medians of TPLN models with zero observations can be log-transformed to become which can be approximated by using Using the delta method, the variance of lnρ i can be derived as Likewise, McKean & Schrader (1984) estimated the variance of the median as a distribution-free estimate defined as follows: where X i(r) denotes the r th order statistic in a random sample drawn from a TPLN model with zero observations of size n i1 , and c = [(n i1 + 1) Hettmansperger & Sheather (1986) claimed that V MS is a consistent estimator of the variance of the median. Similarly, the variance of lnη i can be derived by applying the delta method as follows: where Thus, by applying the central limit theorem , the random variable W i can be defined as which approaches a standard normal distribution as n → ∞. Subsequently, the 100(1−ϕ)% NA-based CI for ω can be written as where W ϕ denotes the ϕ th percentile of a standard normal distribution. Algorithm 4 was used to compute the CP of the NA (CP N ).

The Bayesian method
Bayesian methods are based on treating probability as beliefs rather than frequencies.
Given unknown parameter θ , a prior distribution p(θ ) represents the subjective belief as a subjective distribution formulated before the data are seen. The posterior distribution is obtained from a prior that is updated with the likelihood function (or sample information) by using Bayes' rule (Casella & Berger, 2002). Importantly, the posterior distribution is considered to be a random quantity and can be used to make a statement about θ , For example, the point and interval estimates of θ can be computed by using its posterior. Equal-tailed Bayesian intervals based on the Jeffreys' and uniform priors based on the posterior densities of the zero proportion and the variance have been shown to perform well in certain scenarios (Yosboonruang, Niwitpong & Niwitpong , 2022).
Here, the Bayesian CI for ω is formulated based on the Bayesian method. First, we define an informative prior for our objective assumption depending on the amount of information available in the data as The posterior densities of (γ i ,ρ i ) are obtained by obtaining Eq. (30) with the likelihood function (2) as Next, we apply NA to the posterior distribution of κ i = (µ X i ,logσ X i ); the logarithm of the posterior density is approximated by using a quadratic function of κ i . The second derivatives of the log-posterior density are needed for constructing the approximation. From Eq. (3), the log-likelihood can be expressed as After that the first and second derivatives of (µ X i ,logσ X i ) respectively become The point estimates of (µ X i ,logσ X i ) are derived after setting their first derivatives to zero. Meanwhile, the variances of their estimates are respectively obtained by using an inverse Fisher information matrix as follows: After using the Jacobian to back-transform from lnσ X i to σ 2 X i , the posterior densities of (µ X i ,σ 2 X i ) are respectively approximated as normal distribution as follows: whereσ 2 X i = n i1σ 2 X i /(n i1 + 2). Therefore, the posterior density of ω becomes Finally, the 100(1 − ϕ)% Bayesian-based CI for ω is where f ω (ϕ) denotes the ϕ th percentile of f ω . The CP of Bayesian CI (CP B ) can be computed by using Algorithm 5.

THE MONTE CARLO SIMULATION STUDY
The comparative performances of the proposed Bayesian-, the fiducial GPQ-and MOVERfiducial GPQ-, and NA-based CI were evaluated via a Monte Carlo simulation study. The settings for the simulation parameters for the two simulation studies were as follows. For each specified parameter combination, the 95% CIs for the ratio of the medians (ω) of TPLN distributions containing zero values were constructed based on 5000 randomly generated samples. In addition, 2500 Monte Carlo sampling passes were used for each generated sample for the fiducial GPQ-based method. To assess the performances of the methods, their CPs were calculated by using the proportion of 5000 simulated CIs covering ω. LEP and UEP are defined as the proportion of times that ω falls below and above the stimulated CIs, respectively. At the 95% nominal confidence level, the expected lengths (ELs) of the CIs is also needed for deciding which method performs the best. A good performance will produce CP = 95% and LEP = UEP = 2.5%. Likewise, the comparison between LEP and UEP can be expressed in terms of the relative bias, which is defined as Thus, a good balance between LEP and UEP will produce a relative bias close to zero. Last, the best-performing method will provide the shortest EL.
In the first simulation study, we chose a small proportion of zeros and variance ((d 1 ,d 2 ) = (10%,10%), (10%,30%) and σ 2 X 1 = σ 2 X 2 = 1.25, respectively), the results of which provide insight into the sampling behavior of the CIs (Table 1 and Figs. 1, 2 and 3). It can be seen that although all of the methods generated CPs above or close to the nominal confidence level in almost all of the scenarios, the LEPs, UEPs, and ELs produced by the Bayesian method demonstrated its superiority. Thus, the Bayesian method performed the best in situations with a small proportion of zeros and sample variance except for with a large sample size, with the MOVER-fiducial GPQ method performing the best in that case.
In the next simulation, we were interested in scenarios with a large proportion of zeros and variance (d 1 = d 2 = (20%,40%),(40%,40%) and σ 2 X 1 = σ 2 X 2 = 3, respectively) ( Table 2 and Figs. 4, 5 and 6). Once again, the Bayesian method provided acceptable CPs, as well as better ELs and a better balance between LEP and UEP, than the other methods.

APPLICATION OF THE METHODS TO COMPARE HOURLY WIND SPEED DATA FROM TWO AREAS IN NORTHERN THAILAND
Due to the rapid effects of climate change, agricultural growth, and the social economy, seasonal air pollution from the burning of agricultural waste in preparation for planting, forest fires, and waste disposal during the transition from the winter to the dry season are important factors that influence the environment in northern Thailand (IQAir, 2021). Wind can affect the movement of PM 2.5 and PM 10 when its speed is 7.2 km/hr or higher (Liu et al., 2020). When cold air mass moves from China to Thailand, the upper region of Thailand can potentially become very cold toward the end of winter, which reduces  Full-size DOI: 10.7717/peerj.14194/ fig-3 the northeast monsoon to a calm wind. For this reason, PM 2.5 levels usually increase during the transition between the winter season to the dry season (Teerasuphaset & Culp, 2020). Phitsanulok is a city in lower northern Thailand about halfway between Chiang Mai and Bangkok where crop and forestland burning is extensive, resulting in extreme PM 2.5 occurrences (IQAir, 2022),while Phayao is one of the three highest-ranking provinces for PM 2.5 in the upper northern region (Group, 2021). We used datasets of the hourly wind speed from Phitsanulok and Phayao (Table 3) recorded in January 2021 to illustrate the efficacies of our proposed methods for formulating CIs for the ratio of the medians of TPLN distributions containing zero values. The data were taken from the Thai Meteorological Department Automatic Weather System (Thai Meteorlogical Department Automatic Weather System, 2022). Since wind speed observations are always non-negative, they are suitable for fitting to the following distributions: Cauchy, chi-squared, exponential, lognormal, TPLN, logistic, normal, and t-distributions. The Akaike information criterion (AIC) can be used to determine the best-fitting distribution.
We found that the TPLN model was suitable for the wind speed data, as evidenced by its smallest AIC value in Table 4. The basic statistics for the datasets are reported in Table 5. By way of comparison, the ratio of the medians of the TPLN distributions of the hourly wind speed data from Phayao and Phitsanulok isω = 1.0217 for mediansη phayao = 7.4091 andη phitsanulok = 7.2514. The 95% CIs and corresponding lengths based on the fiducial, Table 2 Monte Carlo simulation results from the simulation study 2: σ 2 X 1 = σ 2 X 2 = 3.    Table 6. It can be interpreted that there is no difference between the wind speeds in Phayao and Phitsanulok. The majority of the population in both areas are agriculturists, and so agricultural burning is often carried out in preparation for planting and after harvesting. Furthermore, the empirical example results are in agreement with the Monte Carlo simulation results in the previous section; the EL of the MOVER-fiducial GPQ CI was the smallest with a suitable CP for a small variance and a large sample size. Overall, the Bayesian-based method is the most suitable for formulating CIs for the ratio of the medians of TPLN distributions containing zero values when taking the checking criteria results from scenarios 1-12 and 37-72 into account.

DISCUSSION
We applied fiducial, NA, and Bayesian-based method to formulate CIs for the ratio of the medians of TPLN distributions containing zero values. From the results of the simulation study, the fiducial methods based on fiducial GPQ and MOVER-fiducial GPQ always provided CPs greater than the nominal 95% confidence level because the fiducial GPQ of (γ i , ρ i , µ X i ) have strong points (Hannig, Iyer & Patterson, 2006), as revealed by the conditions for FGPQ2 in Definition 1. However, the MOVER-fiducial GPQ method produced shorter interval than the fiducial GPQ and worked well for a small variance and a large sample size. The NA method provided CPs greater than the nominal 95% confidence   The ratio of the medians (ω =η 1 /η 2 )ω = 1.1765 Table 6 The 95% CIs for the ratio of the median wind speed in Phisanulok and Phayao provinces. level, thereby making its ELs longer than the other methods, which could have been caused by the variance in the estimated median (McKean & Schrader, 1984). Meanwhile, the Bayesian method provided suitable CPs with the shortest interval length, except for a small variance and a large sample size for which the MOVER-fiducial GPQ method performed the best. This could be because of using the uniform prior to gain information about the parameters from the data to obtain their posterior density. As such, the constructed CIs for the ratio of the medians of TPLN distributions containing zero values based on the Bayesian method with a uniform prior performed well.

CONCLUSIONS
CIs for the ratio of the medians of TPLN distributions containing zero values were formulated by using fiducial-, NA-, and Bayesian-based methods. Since a theoretical comparison was not possible, a Monte Carlo simulation and empirical application with two real datasets of wind speed observations were used to evaluate their performances in terms of their CPs and ELs. The results of the simulation study led us to recommend the Bayesian method for constructing the CIs for the ratio of the medians of TPLN distributions containing zero values because it attained CPs close to the nominal 95% confidence level and the shortest EL in most cases, except for a small variance and a large sample size for which the MOVER-fiducial GPQ method should be used.