Developing a Travel Time Estimation Method of Freeway Based on Floating Car Using Random Forests

Travel time of traffic flow is the basis of traffic guidance. To improve the estimation accuracy, a travel time estimationmodel based on RandomForests is proposed. 7 influence variables are viewed as candidates in this paper. Data obtained fromVISSIM simulation are used to verify themodel. Different fromothermachine learning algorithmas black boxes, RandomForests can provide interpretable results through variable importance. The result of variable importance shows that mean travel time of floating car tf, traffic state parameter X, density of vehicle Kall, and median travel time of floating car tmenf are important variables affecting travel time of traffic flow; meanwhile other variables also have a certain influence on travel time. Compared with the BP (Back Propagation) neural network model and the quadratic polynomial regression model, the proposed Random Forests model is more accurate, and the variables contained in the model are more abundant.


Introduction
Along with economic and populations grow, the number of cars has increased dramatically, causing a series of problems such as traffic congestion, traffic accidents, and environmental pollution [1][2][3].To tackle these issues, Intelligent Transportation System (ITS) is applied to the road system.Through the harmonious and close cooperation of people, vehicles, and roads, ITS can improve the efficiency of transportation, ease traffic congestion, improve road network capacity, reduce traffic accidents, lower energy consumption, and decrease environmental pollution.Travel time is the most intuitionistic index to reflect the running condition, which is an important foundation for constructing ITS [4].Obtaining accurate travel time information, on the one hand, traffic departments improve traffic management decisions; on the other hand, travelers can make better travel choices [5].Therefore, the accurate travel time of traffic flow is paid more attention by travelers, traffic managers, and scholars.As the basis of ITS, some researchers have conducted special studies on travel time.
Travel time can be achieved directly or indirectly.Direct methods measure travel time using probe vehicle, records at toll stations, tracking of cell phones, and many other technologies [6,7].Indirect methods infer travel time using measured traffic volume, speed, and occupancy in point sensors (e.g., loop detector and video camera) along the vehicle trajectory [8].
Recently, GPS on the vehicles and smartphones carried by occupants of motor vehicles can provide data support for travel time [9].Therefore, travel time estimation using GPS data has been carried out [10][11][12].Over the past few decades, a great number of models for travel time estimation have been developed, including models based on mathematical statistics and models based on artificial intelligence technology.
(1) Models based on mathematical statistics.These models provide interpretable parameters and a simple model structure [7,13]; e.g., a piecewise truncated quadratic speed trajectory to estimate travel time was proposed by Sun [14].The speed value can be selected between the highest and lowest, which was selected as the basis of the running condition of the vehicle.The method was more accurate when the vehicle is in the state of transition and congestion.Nevertheless, in the state of free-flow, the advantages of the proposed model were not obvious.Taken the number of single lanes, the speed limitation, and the instantaneous speed as independent variables, a multiple linear regression model based on the floating car was raised by Bobba [15].
The model was applicable during peak and off-peak periods.Yet the road section studied was the section between two signalized intersections (exclude signalized intersections), which was different from most current researches.Choosing different parameters, a linear regression model and a multiple linear regression model were developed by Faria [16].The multiple linear regression model was more accurate, but the accuracy of the model is limited, only 60%.Using the GPS data with lower frequency, two mathematical models were proposed by Sanaullah [17].The two models were based on the number of map matched points, connectivity of links, and spatial and temporal travel time components of the link, respectively.The experimental results indicate that vehicle penetration rates, data sampling frequencies, vehicle coverage on the links, and time window lengths all influence the accuracy of link travel time estimation.Zhan [18] used GPS-OD data from New York taxis to estimate travel time of road network segments.The impact of the single lane of the road on the driving vehicle was taken into account.When the road section was wider, the number of lanes may be more, and the single lane may not portray the fineness of the road network.Using the same data, a Bayesian model to estimate short-term travel time was presented by Zhan [19].However, to reduce the modeling complexity, several assumptions were posed.Because of easy to implementation and low computational effort, models based on mathematical statistics are widely used.However, the accuracy is generally low.
(2) Models based on artificial intelligence technology.These models do not assume any particular model structure of the data but treat it as unknown.Some successful models include Fuzzy reasoning [20], machine learning [21], and the hybrid model [22,23].Such as, a three-layer Artificial Neural Network (ANN) model was presented to estimate link travel time by Zheng [24].In the proposed model, individual probe vehicle's positions, link IDs, timestamps, and speed were used as input information.Compared with Hellinga's model, the ANN model performed quite well under different traffic conditions.However, the ANN model was applied to estimate travel time based on one car with GPS.Using the sparse and large-scale GPS trajectories, Tang [25] presented a tensorbased context-aware approach to estimate personalized travel time.The model was comprised of map matching, travel time tensor construction, context-ware feature extraction, and travel time tensor factorization.The proposed model considers the spatial correlation between different road segments, the deviation between different drivers, the finegrain temporal correlation between different time slots, and the coarse-grain temporal correlation between recent and historical traffic conditions.A bus travel time prediction model based on SVM was proposed by Reddy [26].The model used V-Support vector regression as a linear kernel function and used the data collected by public bus equipped with a GPS system to validate.The result showed that accuracy of the model was significantly improved under the condition of high variance.Although these models need large amounts of computation, the high accuracy drives scholars to shift their research focus on artificial intelligence technology method.
In summary, a wide range of models has been developed for travel time estimation.Although these models have their own advantages, the number of independent variables selected is limited, and the influence of traffic flow parameters on travel time has not been thoroughly considered.
In recent years, data mining and machine learning have gradually come into sight.The development of traffic information acquisition technology (such as data of GPS trajectories) has provided us with a large amount of traffic data, which offer an opportunity to develop a more accurate travel time estimation based on data mining.Compared with traditional parametric models, data mining algorithm can be deeply explored implicit relationships between variables.In view of this, the paper introduces a new data mining technique called Random Forests for travel time estimation.The influence of variables on travel time can be deeply excavated through Random Forests.
The rest of the paper is structured as followed.The next section will give the methodology of Random Forests to build a travel time estimation model followed by Section 3, which describes the data used in this paper.Results and discussions are presented in Section 4. Finally, the conclusions are outlined in Section 5.

Methodology of Random Forests
Random Forests is an integrated learning algorithm based on decision tree proposed by Breiman in 2001 [27].Random Forests is a high-precision algorithm in machine learning, which can overcome the shortcomings of a single prediction or classification model. . .eory.Random Forests is a combination model consisting of a set of regression decision trees.Equation (1) shows the definition of Random Forests [28].
where ℎ(,   ) is a tree-structured classifier and {  } is independent identically distributed random vectors. is the independent variable.  is the independent distributed random variable. represents the number of decision trees.
Use the idea of ensemble learning to take the average of each decision tree as a regression prediction result, which is shown in where ℎ(,   ) is output based on  and .
In order to overcome the problem that the decision tree model is not high in accuracy and is prone to overfitting, the idea of bagging and stochastic subspace was introduced in Random Forests [28,29].
( ) Bagging.Bagging is a Bootstrap sampling technique proposed in 1996 [28].Assuming that  is the original sample and  is the number of samples in .The probability that each sample in  is not extracted is  ( ) Stochastic Subspace.In the process of constructing the regression decision tree, each split node randomly extracts the feature subspace from the total feature space as the candidate feature set of the node and selects the optimal feature for splitting.The method ensures that the feature subsets are not only different among trees, but also the independence and diversity of the tree, and further improve the randomness in node splitting of Random Forests.Determining the stochastic subspace is to choose the number of explanatory variables to be checked for the splitting process.
In Random Forests, the final predictive performance of the model is determined by the number of trees in the forest (T) and the number of explanatory variables to be checked for the splitting process (m).
Figure 1 is the flowchart of the classifier and the flow of training and testing phases, which shows the process of establishing a Random Forests.
. .Generalization Error.Generalization error reflects the ability of the model to predict data outside the training set and is an important indicator for judging the quality of the model.

Definition .
It is assumed that the training sets are extracted from the independent and identically distributed random vectors (, ), and the formed training sets are independent of each other.Then the mean squared error of the output ℎ() is  , ( − ℎ()) 2 .
In Random Forests, when there are enough regression decision trees and ℎ  () = ℎ(,   ), according to the large number theorem, Theorem 2 can be obtained.

Theorem 2. When 𝑡 󳨀→ ∞, mean square generalization error converges on
where   is a random variable of the  − ℎ regression decision subtree.  is a mathematical expectation. * is generalization error of Random Forests.
Theorem 2 shows that, with the increase of the regression decision subtree , Random Forests gradually converges, and generalization error will eventually tend to a limit value.Although Random Forests has been proven not prone to overfitting in mathematics [27], in the actual application process, the parameters of the Random Forests are optimized by experiments to further avoid overfitting.

Data
. .Traffic Simulation So ware.To collect enough data for training and testing, traffic simulation software is used.Traffic simulation software is widely used in the study of traffic planning and traffic flow.The microscopic traffic simulation software can describe the road network and simulate the traffic flow through different models.Many types of traffic simulation software can be used to collect travel time data [30,31].with a powerful application interface.However, PARAMICS lacks a model of mixed traffic and complex traffic flow.
( ) SimTraffic.SimTraffic is originally developed as transportation software for signal optimization timing and traffic model building.With the development of traffic simulation technology, SimTraffic has gradually developed into mature and fully functional microscopic traffic simulation software.It adds ramps, roundabouts, and highway modeling tools based on the original functions.Nevertheless, SimTraffic does not have a dedicated lane, as well as bus and car parking spots.
Through the above description, it can be found that the traffic simulation software has its own advantages and disadvantages.However, VISSIM has the ability to propose separate file output parameters, high traffic description accuracy, and simulated traffic has diverse characteristics.Therefore, data produced by VISSIM simulation software are used to verify the proposed travel time estimation model in this paper.
. .e Source of Data . . .Selection of the Simulation Section.Nanjing Airport freeway between the Airport Interchange and Lukou Interchange with the length of 1048.28m and 4 lanes in one direction is selected as the research area.Time detectors are set at both ends of the selected freeway section.The route diagram is presented in Figure 2.

. . . Determination of the Simulation Parameters
( ) Simulation Traffic.The VISSIM simulation software is calibrated according to actual hourly traffic flow in Nanjing Airport freeway from Nanjing to Airport investigated by airport toll station at 9:00-15:00 on August 22, 2017.Since the real traffic flow does not include congestion, in order to cover the state of free-flow, transition, and congestion in the freeway, the traffic flow increased 600Veh/h from the real measured value of the previous period during 15: 00-17: 00, which reflect the state of congestion.Only increasing the number of vehicles does not necessarily result in congestion.However, based on the state of transition, the authors guarantee that all variables are constant and continue to increase the traffic flow to characterize the state of congestion.The input traffic flow is shown in Table 1.
( ) Vehicle Type.The user-defined taxi type is 1, and the vehicle color is blue; the truck type is 2, and the vehicle color is yellow; the bus type is 3, and the vehicle color is blue; the car type is 4, and the vehicle color is red.The user-defined taxi is chosen as the floating car in this paper.
( ) Speed Distribution.On the freeway, the expected speed of car, truck, and bus is 120,100 and 100 km/h.The speed distribution of cars, trucks, buses, and taxis is shown in Figure 3.
( ) Vehicle Proportion.Through investigation, the vehicle proportion on the airport freeway section is car: truck: bus: taxi =0.42:0.12:0.26:0. . .Design of Experimental Scheme.In the process of experiment, the dynamic changing process of the freeway traffic flow was simulated by changing the input traffic flow, including the state of free-flow, transition, and congestion.
Using different random seed number, the experiment simulated 133 times and the simulation time was 28800s.At last, 133 sets of data were obtained, representing 133 days' data of 9:00-17:00.
Travel time was obtained at the sampling interval of 300 seconds.At the same time, travel time of the floating cars was acquired at the sampling interval of 1 second.
. .Variables of the Model . . .Traffic State Parameter.In the Highway Capacity Manual [32], traffic state of the freeway was divided into six levels (namely A to F) according to the average speed and density.As we all know, speed, density, and traffic flow are three basic parameters, which are interrelated.If values for two of these parameters are known, the third can be computed.The standard of traffic state classification of a freeway is shown in Table 2.
In this paper, traffic state parameter refer to the standard of traffic state classification of a freeway, let  = 1  6 for representing the traffic state A to F of the freeway respectively.The paper combined existing traffic state levels and described the freeway at a lower level.Therefore, traffic state of the freeway was divided into three categories.The state of freeflow includes level A and B, namely   = 1,2; the state of transition includes level C and D, namely   = 3,4; the state of congestion includes level E and F, namely   = 5, 6, which is presented in Table 3.The traffic parameter is  = {  ,   ,   }.  . . .Travel Time Calculation of the Floating Car.On any freeway section Δ  , there are  floating cars within the time interval   to  +1 , which is shown in Figure 4.
Assuming that travel time of each floating car on the freeway section is   (g = 1, 2, 3 . . .s), since the mean value expressed as  and the median value denoted by   can represent the general level of the whole data.The travel time of the floating car is being calculated by the mean value and median value respectively.
. . .Variables of the Model.In the process of VISSIM simulation, 14 traffic variables can be obtained, that is, number of floating car   , occupancy of floating car   , number of vehicle   , occupancy of vehicle   , density of

Results and Discussions
Using the data obtained from VISSIM and the variables discussed above, the Random Forests model for travel time estimation was established.SPM 8.2 data mining software developed by Salford Systems was used to establish the Random Forests model [33], although Random Forests can use the OOB error to evaluate the model.However, in order to compare with other models, in this paper 133 sets of data were trained and validated different models in two scenarios.Total data of 132 simulations were selected as training data, and the 5 th simulation data were selected as test data, which used to compare with the mathematical statistics model.Meanwhile, in order to contrast with machine learning model, data of 133 days were divided into two data sets, in which 27-133 days of data were used as training data sets and 1-26 days of data were used as test data sets.
Mean Square Error (MSE), Mean Absolute Deviation (MAD), and Relative Error (RE) were selected as evaluation criteria.
where in ( 5)-( 7)  is the total number of samples,   is the real value of travel time, and is the estimation value of travel time.
. .Parameter Determination.There are 2 parameters to be ascertained in Random Forests, namely, the number of trees in the forest (T) and the number of explanatory variables to be checked for the splitting process (m).

( ) e Number of Trees in the Forest (T).
In Random Forests, decision trees are not pruned.It is demonstrated that increasing the number of trees would not increase the precision but brings the computational burden.However insufficient trees are generated, the calculated variables importance may not be accurate enough.The number of trees in the forest was determined by 10-fold cross-validation.Table 4 shows the 10fold cross-validation errors with a different number of trees in the forest.
It can be seen from Table 4 that the number of trees was 500 and 600 with the same minimum test error.The fewer the trees are, the smaller the computational burden is; therefore, the number of trees was 500 in the Random Forests model.

( ) e Number of Explanatory Variables to Be Checked for the Splitting Process (m).
In Random Forests, only a subset of independent variables is checked to find the best splits, which makes the forest development more efficient.Beriman [27] has shown that randomly selecting a subset of independent variables to find the best split makes the process faster and leads to accurate results.Ghasri [34] adopted the square root of the number of independent variables in each split.There are other methods, such as twice the square root and half the square root of the number of independent variables in each split.When the number of explanatory variables to be checked for the splitting process is determined by 10-fold cross-validation error, the effect is not obvious.Random Forests can use OOB (Out-of-Bag) error estimates as unbiased estimates of generalization error without running a cross-validation procedure to measure the Random Forests model [28,[34][35][36].Therefore, the OOB error was used to determine the number of explanatory variables to be checked for the splitting process.In this paper, the OOB errors are obtained using a different number of explanatory variables to be checked for the splitting process, respectively, and finally, choose the number of independent variables with the smallest value of OOB error.Table 5 shows the OOB errors with a different number of independent variables in each split.
Finally, the number of explanatory variables to be checked for the splitting process was 3 in the Random Forests model.
. .Variable Importance.Using the training data to train the model, the order of variables importance can be gained.Variable importance explains the influence of independent variables on the dependent variable.The higher the value of the variable importance is, the stronger the influence on the model is.Variable importance is shown in Table 6.
It can be seen from Table 6 that mean travel time of floating car   , density of vehicle   , traffic state parameter , and median travel time of floating car   are the important factors, which are much greater than other variables.It is indicated that travel time of traffic flow is closely related to travel time of floating car, density of vehicle and traffic state parameter.
. .Filtering Feature Variables.As can be seen from Table 6, in the established Random Forests model, much variable importance has low values, such as the ratio of floating car, indicating that there are some redundant variables in the model and the variables need to be screened.The literature [37] uses the variable importance obtained by the model to filter the feature variables, which is presented as follows: (1) Create a Random Forests model using a set of feature variables containing  variables and rank the variable importance of the  feature variables in descending order.(2) Delete the variable with the lowest variable importance among the  feature variables, and get the feature variable set containing -1 variables.(3) Create a Random Forests model using a set of feature variables containing -1 variables and rank the variable importance of the -1 feature variables in descending order.(4) Delete the variable with the lowest variable importance among the -1 feature variables, and get the feature variable set containing -2 variables.(5) Repeat steps (3) and ( 4) until there is one remaining feature variable.(6) The Random Forests models are established containing , -1, -2 ⋅ ⋅ ⋅ 1 variables.OOB errors are ranked in order, and the Random Forests model with the smallest OOB error and the feature variable set is selected.
According to the method described above, a set of feature variables and a Random Forests model containing 7 feature variables are obtained.The training result of the model is shown in Figure 5 and the variable importance is shown in Table 7.
From Figure 5, reducing the number of feature variables does not reduce the performance of the model; however, the OOB error decreases from 6.342 to 5.586.It can be observed from Table 7 that mean travel time of floating car   , traffic state parameter , density of vehicle   , and median travel time of floating car   are still the most important factors, which is much greater than the other three variables.
In all the 7 variables,   and   enter the model at the same time, but the variable importance of   is much larger  than   .In the previous research, due to the correlation of variables, the two parameters were generally not included in the model at the same time.However, the mean value and median value can represent the general level of the whole data with different statistical significance.Therefore, the Random Forests model uses both   and   as variables to take full advantage of different variables.Density of vehicle   and density of floating car   have different effects on the Random Forests model.Density is the most important parameter of traffic flow, and it is an evaluation index of traffic demand.The higher the density is, the slower the speed is and the longer the travel time is.Density is an important indicator that affects travel time.  is much more important than   , which is simple to understand.The paper uses travel time of the floating car to calculate travel time of traffic flow but   represents the condition of all vehicles, which is a more intuitive reflection on travel time of traffic flow.
Speed of vehicle   and speed of floating car   have an influence on the estimated travel time because speed is the most intuitive reflection of travel time.The variable importance value of   is the lowest (0.94), but it is also an important factor affecting travel time of traffic flow.The reason is that the travel time estimation model is based on travel time of floating car and speed of floating car is closely related to travel time of floating car.
Traffic state parameter  is the second most important influence variable in the Random Forests model.As a newly introduced parameter in this paper,  is an intuitive indicator that directly reflects traffic states.
To sum up, the paper uses travel time of floating car to reflect travel time of traffic flow.Travel time of floating car (both   and   ) is an indispensable factor in the Random Forests model.Speed and density are the most intuitive reflection of travel time; therefore   ,   ,   , and   are the selected influence variable of the Random Forests model.The variable of  is selected in the Random Forests model because it directly reflects the traffic states.
. .Accuracy of the Established Model.To test the accuracy of the model presented in this paper, a quadratic polynomial regression model with different states was established according to the method of [38].Consistent with the Random Forests model, the total data of 132 simulations were selected for quadratic polynomial regression and the 5 th simulation data were selected as validation data.The quadratic polynomial regression model is provided in The topology of BP neural network includes input layer, hidden layer, and output layer, which are divided into information forward propagation and error back propagation [40].BP neural network model with a three-layer feedforward Perceptron algorithm is used to estimate travel time.Figure 6 is the network structure of the three-layer BP neural network model designed in this paper.
In order to a fair comparison with the Random Forests model, the input variable in the BP neural network model is the selected variables in Section 4.3, and the output variable is the estimated travel time.The function of the hidden layer is logistic in the BP neural network.The network structure is 7-6-1, that is, the number of input layer nodes is 7, the number of hidden layer nodes is 6, and the number of output layer nodes is 1.Then the model was tested using 1-26 days of data sets.The training and test errors of different models are shown in Table 9.
Figure 7 is travel time obtained by different models.Figure 8 shows the comparison between travel time of the 5 th day in the test data sets (real travel time) and travel time obtained with various models.
Several conclusions can be drawn based on Tables 8-9 and Figures 7-8.Firstly, it can be seen from Table 8 that the accuracy of the Random Forests model is much greater than that of the quadratic polynomial regression model.In addition to travel time (mean and median) of floating car, the proposed model has selected another six variables, which indicate that Random Forests are not sensitive to the interaction between variables.Therefore, the Random Forests model can choose a richer impact variable.
Secondly, Table 9 shows the comparison between the Random Forests model and the BP neural network model; it is found that the error of the Random Forests model is generally less than the BP neural network model in both training data sets and test data sets.The reason may be different from the machine learning algorithm as black boxes (such as BP neural network and SVM); Random Forests has capabilities of data mining.The relationship between variables can be deeply exploited through Random Forests.
Thirdly, it revealed that, in Table 9, when traffic flow is operating in the state of free-flow with high speed, travel time obtained by the two models is close to the real value in both the two data sets.While in the state of transition and congestion with the lower speed, error of the proposed model is obviously less than the BP neural network model in both two data sets.It is shown that the model proposed in this paper has more advantages in the state of transition and congestion.
Fourthly, as indicated in Figures 7-8 that travel time obtained in this paper is consistent with the real travel time, which indicates that the proposed Random Forests model is effective.In addition, the paper established a multiple linear regression model for different states using the 14 variables mentioned in Section 3.4.3.Before establishing a regression model, the multicollinearity between independent variables is tested firstly.Multicollinearity (collinearity for short) proposed by Freund [41] refers to a precise correlation or a high degree of correlation between the variables in the linear regression model, which makes the model difficult to estimate accurately.After eliminating multicollinearity, the variables entering the model are shown in Table 10.
It can be seen from Table 10 that, due to the multicollinearity of variables, when the regression model is established, only 4 variables are selected, and some of the variables that affect travel time are ignored, such as speed, density, etc.However, Random Forests is not affected by multicollinearity of variables.The relationship between travel time and variables can be deeply excavated through Random Forests.

Conclusion
In this paper, Random Forests is proposed for travel time estimation, which is a hotspot algorithm in machine learning and can deeply excavate the complex relationships between variables.The proposed model is established with 7 variables, namely, mean travel time of floating car   , traffic state parameter , density of vehicle   , median travel time of floating car   , speed of vehicle   , density of floating car   , and speed of floating car   .Using different random seed number, the experiment simulates 133 times with VISSIM simulation software.Total data of 132 simulations are selected as training data, and the 5 th simulation data are selected as test data, which used to compare with the quadratic polynomial regression model.Meanwhile, data of 133 days are divided into two data sets, in which 27-133 days of data are used as training data sets and 1-26 days of data are used as test data sets in order to contrast with the BP neural network model.Comparison results show that the Random Forests model is more accurate than the quadratic polynomial regression model and the BP neural network model.The included variables are more abundant in the Random Forests model.However, data are obtained by VISSIM, which limited the diversity of data.In future research, the variables of weather, characters of drivers, and other variables which affect travel time will be considered in the Random Forests model.

Figure 1 :
Figure 1: The process of establishing a Random Forests.

(
) VISSIM.VISSIM is a microscopic traffic simulation software developed by PTV of Germany, which is a simulation system based on traffic behavior model.It uses a discrete, random, microscopic model with a time step of 0.1s.The longitudinal movement of the vehicle adopts the psychophysical car-following model proposed by Professor Wiedemann, and the lane-changing behavior of the vehicle adopts a rule-based algorithm.After an open COM interface, VISSIM has a good secondary development capability.( ) CORSIM.CORSIM is developed by the US Federal Highway Administration (FHWA) and consists of two models, FRESIM and NETSIM.FRESIM is mainly used for the simulation of highways and expressways, while NETSIM is used for the simulation of urban road networks.It has lane change and car-following model simulation module and simulates the state of traffic flow in the road network with 1s simulation step.The software has functions such as analog timing, dynamic filter control, and cooperative filtering control.However, CORSIM lacks an allocation algorithm and it is difficult to evaluate the traffic volume transfer caused by ramp control, accidents, and travel information.() PARAMICS.Developed by British Quadstone, PARAM-ICS can be applied to traffic simulation at different levels, from a single road network to a large-scale urban road network.PARAMICS supports multiuser parallel computing

2 .
( ) Time Detector.In the freeway section, time detectors are set up to collect travel time of the individual floating car and travel time of the traffic flow.The mean and median values of travel time are calculated by the collection travel time of an individual floating car.

Figure 4 :
Figure 4: The distribution of floating cars on the freeway.

Figure 5 :
Figure 5: The result of Random Forests.

Figure 6 :
Figure 6: The network structure of the three-layer BP neural network model.

Table 1 :
The input traffic flow.

Table 2 :
[32]standard of traffic state classification of a freeway[32].inordertokeepthedataneat, the unit used pc/mi/ln and mi/h in Table2.Table2can change into pc/km/ln and km/h by 1mi = 1.609km.The number in Table2is the maximum value of each level. Note:

Table 3 :
The table of traffic state parameter.
, speed of floating car   , traffic flow of floating car   , density of vehicle   , speed of vehicle   , traffic flow of vehicle   , ratio of floating car   , travel time of floating car (mean value   and median value   ), and traffic state parameter .

Table 4 :
10-fold cross-validation errors with a different number of trees in the forest.

Table 5 :
The OOB errors with a different number of independent variables in each split.

Table 7 :
Variable importance after filtering feature variables.
is travel time of traffic flow.  is mean travel time of floating car in the state of free-flow.ismediantraveltime of floating car in the state of transition.  is mean travel time of floating car in the state of congestion.Equation (8) is a regression model with different states.Although there are no separate states to establish the Random Forests model, the introduced traffic state parameter  can distinguish different traffic states.The errors of different models are presented in Table8.

Table 9 :
The MAD of Random Forests model and BP neural network model.

Table 10 :
The variables entering the model after eliminating multicollinearity.traffic state variables free-flow mean travel time of floating car   , traffic state parameter   , occupancy of vehicle   , number of floating car   transition median travel time of floating car   , traffic state parameter   , occupancy of vehicle   , number of floating car   congestion mean travel time of floating car   , traffic state parameter   , occupancy of vehicle   , number of floating car