SHORT-TERM TRAFFIC STATE ESTIMATION USING BREAKPOINT FLOW

Estimation of the state of road trafﬁc conditions is gaining increasing attention in recent intelligent transportation systems. Accurate and real-time estimation of traffic condition changes is critical in the management and control of road network systems. Thus, efforts are been made to predict short-term traffic conditions based on measured traffic data such as speed, flow and density. In this work, the state of the traffic is estimated through a three-step process. First, both speed and flow predictions for 15-minute ahead are made for a particular freeway segment. Four different regression models are used for the prediction task, namely, multi-layer perceptron neural networks (MLPNN), support vector regression (SVR), gradient boosted decision trees (GBDT), and k-nearest neighbors (kNN). Next, the breakpoint (BP) flow is calculated using the distribution of these predicted speed and flow values. In the final step, these predictions are classified as belonging to a “stable state” or “metastable state” by using the calculated BP as the threshold between these states. According to the experimental results, the values for MLPNN are the highest for speed (0.8564) and flow (0.9862) predictions. An identical BP, 1050 pc/15min, is calculated for actual data as well as all prediction methods.


INTRODUCTION
Intelligent transportation systems (ITS) are being equipped with smart sensing, computation and communication technologies to increase the operational efficiency and capacity of transportation systems [1][2][3]. Thus, it is possible to collect data related to traffic parameters accurately, reliably and in real-time. In particular, advanced traffic management systems (ATMS) and advanced traveler information systems (ATIS) need accurate and reliable traffic information to predict traffic characteristics for transport users to understand and estimate future traffic conditions. Hence, accurate and real-time traffic prediction has been defined as a very critical need for the operational efficiency of ITS [4].
Data-driven management and control of transportation systems have become possible as a result of recent advancements in technology and computer science. Prediction of traffic parameters has been a popular research subject since the late 1970s. Consequently, short-term traffic prediction, which refers to estimating the traffic conditions up to 60 minutes ahead, has been an essential part of ITS. Such predictions help optimization of transport systems such as real-time traffic management, development of control strategies, delay, congestion, and energy consumption reduction.
The main variables that form the traffic flow theory are speed, flow and density. These variables alone cannot provide sufficient information to explain the irregular nature of the traffic. Thus, the situation of traffic has been explained using fundamental diagrams such as speed-density, density-flow and speed-flow diagrams. These diagrams were first established by Greenshields [5] and later improved by other researchers. Now, they form the basis of traffic theories and models, besides, they are important subjects of traffic measurements and teaching basis in the area of transportation. Since 1965, in all editions of the Highway Capacity Manual (HCM), inspecting speed-flow diagrams have constituted the basis of design and analysis methodologies for basic freeway segments and uninterrupted flow segments of multilane highways. Furthermore, these diagrams are used as the basic methodology for empirical studies of measured traffic data.
Speed-flow diagrams are used to determine the capacity and level of service in uninterrupted flow segments of highways and basic freeway segments. Additionally, the relationship between these variables is useful in detecting the phase transition of traffic flow. The transition from stable flow to metastable flow occurs at the breakpoint (BP). It is especially the point separating the constant-speed portion of the curve in the diagram from the rest of it. Stable free flow prevails up to BP and after this point, metastable free flow is dominated present up to the maximum capacity value. BP is the point where the traffic flow situation starts to change; therefore, it is important to detect this point to understand the transition between stable traffic flow and change in vehicle speed.
In the existing literature, several efforts have been made to address the prediction of traffic speed, density or flow; however, most of these studies only focus on the predictions of one of these parameters. However, predicting traffic speed or flow alone cannot explain traffic conditions adequately. Therefore, it is necessary to determine the BP of flow after which the traffic speed starts to decrease. Thus, proper identification of BP from the predicted values is important to predict the transition of flow from stable to metastable state in the short term [6].
Relevant studies involve numerous methods for short-term traffic predictions. Van Lint and Van Hinsbergen [7] classified the approaches used in short-term traffic predictions into four categories: naïve, parametric, nonparametric and hybrid. These include the use of machine learning methods such as the k-nearest neighbors (kNN) [8], support vector regression (SVR) [9] and artificial neural networks (ANN) [10].
In this work, the state of traffic flow is estimated using the flow level corresponding to BP. The BP value was determined from the predicted flow and speed values. To achieve this, 15minute ahead predictions are performed using four different regression models, namely, multilayer perceptron neural networks (MLPNN), SVR, gradient boosted decision trees (GBDT), and kNN. Next, speed-flow diagrams for predictions of each of these methods are generated to calculate BP. Rate of change in standard deviations of speed against flow predictions is calculated to determine the BP. Finally, the state of traffic flow is estimated by checking which side of the calculated BP the predictions fall on. The main contributions of this paper are (i) both speed and flow predictions are made for a particular freeway segment, (ii) these predictions are analyzed together to calculate BP, and (iii) traffic flow state is estimated using the predictions and the calculated BP.

BACKGROUND
The speed-flow diagram is a parabolic curve and Hall et al. [11] described it as three regions representing uncongested, queue discharge and congested flow ( Figure 1). It is essential to understand and interpret the speed-flow relationship for basic freeway segments as the related analysis method is based on calibrations of the speed-flow relationships under base uncongested flow conditions. The mathematical model adopted in HCM explains the speed-flow relationship, which is used for both freeways and multilane highways [12]. The same model is used in HCM as well, and based on this, a defined set of speed-flow curves for basic freeway segments under the base condition is given with the generalized graph shown in Figure 2 [13]. In Figure 2, under uncongested traffic flow conditions, the curves consist of two regions; the linear part and the concave part. The linear part is the constant-speed portion of the curve and represents the free flow speed (FFS). FFS is an important parameter as several conditions such as capacity, service flow rates, daily service volumes and service volumes depend on it. In the regions where the flow rate is higher, the speed starts to decrease, and it shows a curvilinear change until it reaches the capacity value of the road segment. The transition between the linear part and the concave part is expressed as the BP.
The model proposed in HCM [12] explains the speed-flow relationship with curves as shown in Figure 2. The speed-flow function is anchored by two points (BP, FFS) and (C, CS) that represent two regions, while C and CS represent the capacity and the speed at capacity, respectively. The basic approach of HCM [12] is quite simple since these anchor points can be algebraically determined with given equations. The equations require the estimation of deterministic values for BP, FFS, C, and CS. In this regard, some researchers have analyzed the speed-flow relationship to find BP. However, the proposed methods to find BP are relatively complicated and computationally heavy [14,15]. One simple and effective approach using standard deviations of speed measurements was offered by Roess [16]. This method is based on the assumption that the standard deviation of speed is low for flow values smaller than BP and begins to increase abruptly as flow is greater than BP.

The dataset and features
Traffic flow and speed data are obtained from the performance measurement system (PeMS). PeMS is a freeway performance measurement system that supplies historical and real-time data collected from detectors in freeways throughout California [17]. The dataset includes readings of a dual-loop detector in the California SR-17 freeway. Four weeks of data from four different seasons of 2017 and 2018 (32 weeks of data in total) were used. The original dataset involved speed and flow data collected for 5 minutes intervals. Therefore, three of these measurement intervals are combined to obtain a dataset for 15 minutes intervals. This combination procedure involves the calculation of total flow and average speed. Three consecutive flow data are added to obtain the total flow. On the other hand, the sum of speed data weighted by corresponding flow data is calculated for average speed. Thus, one hour of data is represented by four examples.
The features extracted from the dataset may be collected under two categories; temporal features and measurement features. The temporal features are categorical and they denote "hour of day" and "day of week". These features are represented by a one-hot encoding. Therefore, the dimensions of corresponding binary vectors for these categorical features are 24 and 7, respectively. The measurement features are continuous values and involve current and historical data for flow and speed. The historical data consists of measurements from one day before and one year before the time to be predicted. Hence, three features (current, yesterday, and last year) for two different measurements (speed and flow) are generated as a sixdimensional feature vector. A representation of the feature vector is illustrated in Figure 3.

Speed and flow prediction
The experiments involved in this work may be collected into three major groups. First, for a prediction horizon of 15 minutes, speed and flow values are predicted using four different machine learning methods. The next step starts with generating speed-flow diagrams using the predicted values. These diagrams provide useful information for detecting traffic conditions of the relevant road segment. Therefore, the predicted values are used to calculate the BP flow, which is an important parameter for traffic analysis and modeling. In the final step, the state of the traffic is estimated by comparing the predicted flow value with the BP flow calculated in the second step.
For speed and flow prediction, the dataset is split into training and test sets with proportions of 75 and 25%, respectively. To evenly distribute the seasonal data into these sets, the first three weeks from each season are merged to generate the training set and the following one-week data are merged to generate the test set. Instead of training a model for the prediction of every sample in the test set, only one model is sufficient to make predictions for all data in the test set as the traffic speed and flow patterns have similar structures throughout the day and week [18,19].
Using the MLPNN, SVR, GBDT, and kNN methods, four different models are trained and tested on these sets. Relevant parameters for these models are selected using 10-fold crossvalidation, and the corresponding prediction performance values are provided in the results section. It is important to note at this point that the speed and flow prediction is an intermediate step to determining BP flow.

Determining breakpoint values
It is critical to determine the first BP in the flow axis of the speed-flow relationship. The speed values up to this BP are considered to be constant. Hence, for the values greater than the BP, the speed values start to decrease while flow increases. Thus, it may be concluded that the standard deviation of speed value from the FFS increases after BP [16]. The use of standard deviation is a common method for determining BP in the literature [6,20].
BP values using the actual ( act BP ) and predicted ( where i x is the speed values of the samples in the range, FFS is the free-flow speed for the site, and N is the number of observations belonging to the range. The size of the ranges is selected as 50 pc/15 min and the corresponding  values are calculated for flow rates greater than 200 pc/15 min.

Estimating the state of traffic flow
Calculation of BP flow allows the speed-flow space to be divided into two parts; stable and metastable regions. Therefore, the samples with flow values smaller than the calculated BP are estimated as "stable state". The other samples that have flow values higher than the calculated BP are labeled as "metastable state".
This state estimation procedure is carried out on both the actual data and all predictions. The labels obtained through comparing actual data with act BP are considered as ground truth. State estimation performance of each regression method is calculated by generating a confusion matrix using the ground truth labels and estimated labels.

Performance metrics
To evaluate the prediction performance of each machine learning method, four different metrics commonly used in the literature are calculated. These metrics are coefficient of determination (

Results for speed and flow prediction
The parameters of the used prediction methods have a direct impact on the performance. As stated earlier, 10-fold cross-validation is applied to the training set and the parameters with the best validation results are selected. The number of hidden units in the MLPNN model is determined as 40. The model is trained with a learning rate of 0.0001 and the rectifier linear unit is used as the activation function. Performance of the SVR model is evaluated on two separate models with different kernel functions. The first one (SVR-RBF) uses the Gaussian radial basis function (RBF) as the kernel. The regularization term ( C ) and the threshold term (  ) are selected as 100 and 0.1, respectively. In addition, the standard deviation (  ) for the RBF kernel is set to 0.027. A polynomial kernel with a degree of 4 and a bias term of 1 is used in the other SVR model (SVR-POLY). This polynomial model has the parameter setting as 80  C and 1 . 0   . The GBDT model, trained with a learning rate of 0.1, contains 300 boosting steps and weak learners with a maximum depth of 3.
For the kNN method, 9  K is selected. The relevant prediction results for speed and flow are provided in Tables 1 and 2 Figure 4, while scatter plots of predicted data versus actual data are provided in Figure 5. Majority of the predictions with a higher error are in the low speed, high flow region. This is the region belonging to the metastable traffic flow, and it corresponds to a relatively small period of one-day timespan. Thus, the proportion of data related to the metastable state is low indicating that the methods have limited learning on this state.

129.
Although inspecting the quantitative results in Tables 1 and 2 makes it possible to determine  which method has a better

Results for BP calculation
The predictions made by the machine learning methods are used to calculate the BP flow value. For this purpose, the standard deviation method is applied to actual speed and flow data as well as their predictions. The speed-flow distribution for actual data and corresponding predictions with the MLPNN method is given in Fig. 6.  [20]. To observe the change in standard deviation for consecutive flow ranges, related plots for all prediction methods, as well as the actual data, are generated ( Figure 7). The BP is defined as the flow value where there is a significant increase in the standard deviation. The smallest flow value at which the first derivative of the standard deviation is greater than 0.1 is determined as the BP. For all the predicted data and the actual data, standard deviation analysis outputs identical BP (1050 pc/15 min).

Results for state of flow estimation
The speed-flow distributions with calculated BP levels are visualized for actual data and MLPNN predictions in Figure 8. Quantitative results of estimation obtained with different methods are given in Table 3. The state estimation accuracy is highest for the MLPNN method. On the other hand, kNN and SVM-POLY methods have better specificity and sensitivity values, respectively. This means that the rate of true positive estimations is higher with SVM-POLY; hence, it is better at detecting metastable states. Conversely, a high true negative rate for kNN means that this method is relatively more successful than others in estimating the stable states of traffic. However, the best overall accuracy is obtained via the MLPNN method, which has the highest 2 R value for regression as well. The average flow of misclassified samples is 1050.62 pc/15 min and the corresponding standard deviation is 77.01 pc/15 min. Therefore, it is possible to conclude that majority of the misclassifications are in the vicinity of the calculated BP.

CONCLUSIONS
In this work, a method to estimate the state of traffic for 15 minutes ahead is proposed. In contrast with the majority of related papers in which only speed or flow predictions are made, this method involves predicting and further processing of both of these data. Using these predictions, speed-flow diagrams are generated and then BP flow is calculated as the separating threshold between two traffic states: stable and metastable. As the final step, the predicted samples are labeled with one of these states as the estimation of the traffic state. Although the highest prediction and state estimation performance are obtained via the MLPNN method, the results obtained through other methods are very close to it. Besides, BP flow values calculated using the predicted values are all identical and they are also equal to the BP flow calculated using the actual data. This indicates that speed and flow predictions are capable of representing the state transition despite some errors in the predictions.
The samples having lower speed and higher flow values are related to the metastable state and the proportion of this data is relatively small. Therefore, the patterns in the metastable data cannot be learned well. Eventually, the error on the predictions of the samples of this state is high. To eliminate the imbalance in the data, increasing the number of samples belonging to the metastable state may be considered as a future work.