Bayesian Nonparametric Model for Estimating Multistate Travel Time Distribution

Multistate models, that is, models with more than two distributions, are preferred over single-state probability models in modeling the distribution of travel time. Literature review indicated that the finite multistate modeling of travel time using lognormal distribution is superior to other probability functions. In this study, we extend the finite multistate lognormal model of estimating the travel time distribution to unbounded lognormal distribution. In particular, a nonparametric Dirichlet Process Mixture Model (DPMM) with stick-breaking process representation was used. The strength of the DPMM is that it can choose the number of components dynamically as part of the algorithm during parameter estimation. To reduce computational complexity, the modeling process was limited to a maximum of six components. Then, the Markov Chain Monte Carlo (MCMC) sampling technique was employed to estimate the parameters’ posterior distribution. Speed data from nine links of a freeway corridor, aggregated on a 5minute basis, were used to calculate the corridor travel time. The results demonstrated that this model offers significant flexibility in modeling to account for complex mixture distributions of the travel time without specifying the number of components. The DPMMmodeling further revealed that freeway travel time is characterized by multistate or single-state models depending on the inclusion of onset and offset of congestion periods.


Introduction
Modeling travel time distribution is essential for measuring the consistency of the traffic performance of a highway system. Moreover, the distribution of the travel time is useful in simulation and theoretical derivations regarding different traffic performance measures such as travel time reliability and variability. The accurate estimation and prediction of travel time are essential for traffic operators, planners, and traveler information systems [1].
This study develops a nonparametric Bayesian model to estimate the travel time distribution for freeways. The model is based on Dirichlet process distribution with an extension of a hierarchical structure to account for the mixture/multistate characteristics of a given dataset. During the modeling process, the proposed model is truncated with an upper bound of six mixture components to reduce computational cost. Unlike a parametric model, this model does not require specifying the true number of components; instead, the number of components grows with the dataset, which is automatically inferred using the Bayesian posterior inference framework. The posterior distributions of the model parameter are derived using the Metropolis-Hastings Markov Chain Monte Carlo (MCMC) sampler. For this study, an Interstate 295 freeway corridor located in Jacksonville, Florida, was studied using 2015 traffic data.
In the next section, review of relevant studies is undertaken, followed by the methodology framework used in this research. Then, the discussion of the dataset and a method used to estimate the travel time is presented. Next, the results and model evaluation using simulated data with known parameters is displayed, after which conclusions and recommendations for possible future research are made.

Literature Review
Literature review indicates that models of estimating the travel time distribution can be divided into two groups, 2 Journal of Advanced Transportation that is, single probability (unimodal) and multistate/mixture models. Unimodal distributions commonly used to estimate the travel time distribution are Gaussian, lognormal, gamma, Weibull, and Burr [2]. Findings from several comparative studies of unimodal distribution functions suggest that travel time distribution is skewed, which makes lognormal, gamma, Burr, and Weibull more accurate than the Gaussian distribution in modeling travel time distribution. For example, using hourly-based data, Kieu et al. [2] compared Gaussian, lognormal, gamma, Burr, and Weibull models and concluded that the lognormal function fits the travel time distribution better than the rest of the models. Similar findings are reported by Arroyo and Kornhauser [3], Rakha et al. [4], and Emam and Al-Deek [5]. On the other hand, Pu [6] reported that, during congested and free flow conditions, travel time distribution is close to symmetrical, suggesting the Gaussian distribution of travel time. However, at the onset and offset of the congestion, the distribution is skewed. The study by Pu [6] suggested that lognormal distribution fits these conditions well.
The multistate/mixture models refer to models comprising two or more distributions. In mixture modeling, the individual distribution forming the mixture is linearly added using a weighted sum of the individual distribution contributing to the model. The weights refer to the mixing probabilities of the model. Studies comparing the performance of mixture models to single models revealed that mixture models provide a superior fit of travel time distribution over single models [1,[7][8][9]. Using field data collected on the Interstate I-35 freeway in San Antonio, Texas, Guo et al. [7] compared different multistate models. The outcomes were that the lognormal multistate distribution outperforms the rest of the models in modeling travel time distribution. This finding is consistent with results by Yang and Wu [10]. As a result, our study also adopts lognormal distribution in the analysis. It should be understood that, with the same road geometric characteristics (e.g., lane width, pavement condition, posted speed limit, and the number of lanes), the multistate characteristic of travel time is attributed to different vehicle type, traffic conditions, incidents, and driving characteristics on freeways. In addition to the previously mentioned factors, arterial roads are influenced by signal light, conflicts with pedestrians, and other factors [9,11,12].
In multistate modeling, there are two commonly used methods for finding model parameters, that is, the maximum likelihood estimation-expectation maximization (MLE-EM) and the Bayesian approach (BA) [13]. The MLE-EM method treats components of the mixture as missing variables and iteratively alternates between the E-step and the M-step to find the parameters of the model [14]. In addition, the method uses random initial guess and, after sufficient iterations, parameters converge. Compared to the BA, the MLE-EM method is computationally less expensive. However, it is susceptible to local maxima trap problem, which could result in overfitting of the resulting model [14]. Unlike the MLE-EM estimation method, the BA treats the model parameters as distributions that can be updated after new data become available. The BA method also incorporates prior knowledge regarding travel time distribution [15], which can be obtained from previously observed characteristics of the data distribution. Moreover, studies indicate that, by using informative priors, the BA can estimate the posterior distributions with smaller number of sample sizes than the MLE-EM approach [15,16].
Taken together, the probability distributions discussed above are parametric with either the single model or multistate characteristics, whereby the multistate model consists of a fixed number of mixture components. The number of mixture components is specified as input in the model. The information criterion, cross-validation, and Bayesian factor are procedures commonly used to select the best model among a set of candidates [13]. However, these procedures for selecting the best model sometimes result in the output model suffering from over-or underfitting problem, depending on the amount of data available and on the model bound complexity [17,18].
However, there are two methods that can be used in modeling without causing overfitting or underfitting problems. The use of the infinite Dirichlet Process Mixture Model (DPMM) with a truncated number of mixture components overcomes the underfitting problem [17][18][19][20]. The overfitting problem can be overcome by the use of a BA to estimate the posterior distribution of the parameters [18]. In this study, both DPMM and BA were used in modeling the travel time distribution. As indicated above, the infinite DPMM was selected. The infinite number of mixture components is achieved through the application of the stick-breaking process in building mixing weight of the mixture. This property of the infinite set of mixture components makes a model to be considered as a typical nonparametric model [21,22]. Although the model is taken as infinite, only a few nonempty components are drawn depending on the actual characteristics of the dataset given [23]. Generally, the nonempty components are less than the realized number of the sample sizes considered in the analysis.
The Bayesian nonparametric mixture models have been implemented in a wide range of applications, including topic modeling, image analysis, and lifetime distribution [21,[24][25][26]. The attractiveness of Bayesian nonparametric mixture models includes the ability to handle randomness of the mixing distribution of a noisy dataset. The randomness of the mixing component is estimated using infinite dimension priors, whereby during sampling, true mixture components are built automatically and the rest die out. This study constructed priors using the stick-breaking process [21]. This process represents an infinite discrete distribution with the probability of being repeated from the previous draws. This characteristic makes the stick-breaking process appropriate for clustering data with multistate characteristics. However, controlling infinite dimensional posterior distribution can be computationally expensive [27]. To reduce this problem, literature suggests the use of truncated dimension priors to reduce computational complexity [27].

Model Framework
The Dirichlet distribution is the generalization of a Beta distribution to account for higher order outcomes. The distribution is parameterized by a concentration parameter and mixture components. Its probability density function is given by The definitions of the terms of (1) through (4) are given in the Abbreviations. The Dirichlet process is described as a set of distributions over the infinite sample space or distributions [21]. A mixture model with a hierarchical structure can be constructed using the Dirichlet process, which is also referred to as the DPMM [21,28]. Figure 1 shows a graphical representation of the hierarchal mixture model.
The model in Figure 1 can also be mathematically represented as follows: In this study, the above model is implemented using the SBP, which involves breaking a unit length stick into infinite disjoint pieces repeatedly [20]. The initial break, = 1, is determined randomly with a probability V 1 , which is considered as the probability of the first mixture component. After the first break, the next break, = 2, has the probability (1 − V 1 ) * V 2 . The process of breaking continues until the infinite number of groups is created [22]. To reduce the computational complexity of the model, the breaking process can be truncated to = groups. In this study, = 6 was selected, which was checked later in the analysis to verify whether the truncation process did not bias the results of mixture component of our dataset. In particular, the highest mixture components used by the data were identified in the probability of the mixture component matrix. Recalling (3), the following conditions apply for the stick-breaking construction process: Estimating the posterior distribution of the hierarchical Bayesian model is analytically difficult as it involves high dimensional integral in the marginal likelihood [1]. The common method used for approximating the posterior distribution of the model parameters is the MCMC simulation. In this study, we apply also the MCMC simulation to estimate the posterior distribution of the unknown parameters. In particular, we adopted Metropolis-Hastings sampling step through PyMC3, an open source package for approximating the posterior distribution of model parameters [29]. The Metropolis-Hastings sampling step uses the acceptance probability to draw a sample from the proposed posterior distribution [29]. The priors for distributions are taken as noninformative with Gamma(1, 1) for concentration parameter , Normal( 1 , 1 ) and HalfCauchy(0, 1) for mean , and sigma, , respectively. On the other hand, the hyperpriors for hyperparameters 1 and 1 are Normal(0, 0.001) and HalfCauchy(0, 1), respectively.

Study Data and Travel Time Estimation
For this study, data from the 20.4-mile corridor of the Interstate 295 freeway (Figure 2) in Jacksonville, Florida, were acquired. The corridor was divided into nine links running between interchanges. Each link had 65 miles per hour (mph) posted speed limit. The archived traffic data for analysis were provided by the Regional Integrated Transportation Information System (RITIS), which is comprised of speed data from microwave vehicle detectors (MVDs) aggregated at a 5-minute interval. The five-minute interval was selected to avoid fluctuation on short-duration travel time [10]. The data gathered were collected for the period of January 1, 2015, through December 31, 2015. Weekend and days in which incidents (crashes, work zones, etc.) happened were omitted from the dataset to reduce variability. If a link had more than one MVD in a lane, the average speed from the MVD was calculated and used to represent the link travel speed. Except for Link 9, other links have at least two MVDs in each lane ( Table 1). The travel time of each link was estimated using the average speed from the traffic speed reported by all MVDs in a segment. In addition, time of a day was an important parameter in the analysis. The travel time of a segment at a time is computed by the following equation: where represents the number of the detectors on link , is the length of segment , and , , is the speed reported by the MVD on a segment at time . We considered the same departure time in estimating the corridor travel time from individual link's travel time. By aggregating the travel time, the results showed that the morning peak hour for both directions (northbound and southbound) occurred between 7 a.m. and 8 a.m. while the evening peaking hour occurred between 5 p.m. and 6 p.m. Figure 3 shows the travel times plotted against time for the day for both the northbound and the southbound traffic. The data in Figure 3 reveal that southbound traffic frequently experiences longer travel times than northbound traffic, particularly during the morning peak hours.

Results and Discussion
Two chains were drawn and the first 10,000 iterations were discarded as burn-in while the next 10,000 iterations were used for inference. To reduce correlations between drawn samples, the sequence of inference iterations were thinned by 10 iterations. Figure 4 presents predicted and actual data densities of some of the hours considered in the analysis. As shown in the figure, the proposed model provided a good fit such that actual and predicted probability densities are close. Furthermore, the quantitative test using the Kolmogorov-Smirnov (KS) goodness-of-fit was conducted testing the hypothesis whether the predicted and actual distributions are the same. The null hypothesis for the test is that the actual cumulative density of the travel time is equal to the predicted density. Results of analysis confirmed that the predicted cumulative travel time follows the empirical cumulative density ( value ≥ 0.05). Figure 5 compares cumulative predicted and empirical cumulative density. Table 2 depicts the number of mixture components, model parameters, and KS test of the predicted travel time distribution for some of the hours. The results from this table reveal that the truncation process of the mixture components using the maximum of six (6) did not bias the results of the dataset. The highest mixture component of the dataset was revealed at 3 mixture components.
As can be seen in Table 2, the travel time distributions in the northbound are predominantly two mixture components with two hours containing one component, while in the southbound, the distribution shows one, two, and three mixture components. However, the third component of the three components' distribution has a very low likelihood, less than 0.1.

Model Evaluation.
To understand the performance of the DPMM in estimating the distribution of the travel time, four finite mixture models (i.e., single, two, and three mixture models) were simulated. The simulation was aimed at evaluating the accuracy of the models given the known parameters. The simulation was conducted with the known mean and variance, which were chosen randomly from link's average and variance of the travel time data. Subsequently, the true parameters were used to simulate various sample sizes including 100, 1,000, and 10,000 following the lognormal distribution with the predefined finite mixture. The reason for simulating different sample sizes was to evaluate the influence of sample size on the proposed model. The truncated DPMM with 6 numbers of components was applied to each sample data. Discarding first 10,000 iterations, the next 10,000 iterations were considered for inference of the model parameters. Table 3 illustrates the true and predicted parameters. According to this table, the number of mixture components, the mean, and the variance converged closer to the true parameters. Comparing the true to the predicted values, the results are promising as the number of components is predicted accurately while some of the data mean, the standard deviation, and mixing probability are somewhat deviating from the true parameters.
Regardless of the number of observations, the number of mixture components was predicted correctly. The true and the predicted distributions are plotted in Figure 6   Journal of Advanced Transportation   cumulative distributions against those predicted by the DPMM. The results from analysis suggest that there is no evidence to reject the null hypothesis indicating that the predicted probability density follows the observed data. As indicated in Table 4, the value for each considered sample is greater than 0.05, suggesting that there is no significant difference between the distribution of the predicted and actual data.

Conclusions and Recommendations for Future Research
This study evaluated the application of a nonparametric    time without specifying the number of components. In the analysis, the uncertainties related to the number of mixture components were incorporated as well. The performance of the model based on the KS test on the actual and predicted cumulative probability density revealed promising results. Moreover, while testing the proposed model using simulated data, the number of true mixture components, mean, and the standard deviation value were correctly predicted. It is important to note that in this study the travel time for the corridor was aggregated using the same departure time. This process may not represent the actual travel time of the corridor. Future studies may consider a vehicle trajectorybased method, dynamic time slice methods, or other methods to aggregate travel time across links. In addition, future studies could aim at analyzing and comparing the finite mixture and nonparametric mixture models using different sample sizes and other kernel functions such as gamma and normal distributions.
Abbreviations DP( , ): The random probability density function coming from the Dirichlet distribution with parameters and : Th eb a s em e a s u r e : Concentration parameter : The random distribution drawn from the Dirichlet process DP( , ) : The parameter of distribution which follows a stick-breaking process (SBP) : The nonnegative vector representing a probability mass function of length * : The mixing proportion * : A Dirac delta function concentrated at LN: The lognormal kernel distribution function with a parameter Journal of Advanced Transportation 9 : The number of mixture components, usually equal to or less than a total number of realizations : Travel time Γ( ): The gamma function.