Modelling on Car-Sharing Serial Prediction Based on Machine Learning and Deep Learning

*e car-sharing system is a popular rental model for cars in shared use. It has become particularly attractive due to its flexibility; that is, the car can be rented and returned anywhere within one of the authorized parking slots.*emain objective of this research work is to predict the car usage in parking stations and to investigate the factors that help to improve the prediction. *us, new strategies can be designed to make more cars on the road and fewer in the parking stations. To achieve that, various machine learning models, namely vector autoregression (VAR), support vector regression (SVR), eXtreme gradient boosting (XGBoost), k-nearest neighbors (kNN), and deep learning models specifically long short-time memory (LSTM), gated recurrent unit (GRU), convolutional neural network (CNN), CNN-LSTM, and multilayer perceptron (MLP), were performed on different kinds of features. *ese features include the past usage levels, Chongqing’s environmental conditions, and temporal information. After comparing the obtained results using different metrics, we found that CNN-LSTM outperformed other methods to predict the future car usage. Meanwhile, the model using all the different feature categories results in the most precise prediction than any of the models using one feature category at a time


Introduction
Predicting the future is considered as one of the most challenging tasks in applied sciences. Computational and statistical methods are used for deducting dependencies between past and future observed values in order to build effective predictors from historical data. Transport answers to people's desire to participate in different activities in different places [1]. Cars have become a part of the mobility ecosystem owing to the flexibility and freedom that they provide [2]. People are more dependent on cars for both intercity and intracity transit, causing traffic congestion and parking difficulties [3]. "Looking for a parking space creates additional delays and impairs local circulation. In central areas of large cities, cruising may account for more than 10% of the local circulation as drivers can spend 20 minutes looking for a parking spot", said by Dr. Jean-Paul Rodrigue Department of Global Studies and Geography of Hofstra University. Many rental models are emerged to solve these parking problems as one of them is the car-sharing model, which aims to distribute cars within a city for use at a low cost. In this fashion, individuals can exploit all the benefits of a private vehicle without the hassles of lease payments, maintenance, or parking. e program comprises one-way or round-trip, depending on whether the pick-up and the drop-off stations are the same or not [4]. e car-sharing system provides an option to numerous people who opt not to own a vehicle, and they use this system whenever a private vehicle is needed. is system usually fixes the cost on the price per minute that includes a quote of variables such as fuel, price per kilometre, and the share of fixed costs for the operator like maintenance, rebalancing, insurance, and parking [5]. Besides helping in decreasing the level of congestion and managing the lack of parking lots, car-sharing systems have many other advantages such as the reduction of vehicle ownership that leads to efficient use of road and infrastructure, economical savings for the users, and diminution of air and noise pollution [5].
However, this program is facing many issues [6], one of them is the nonsuitable distribution of vehicles within carsharing systems. As a result, cars tend to be easily accessible in low-demand parking lots in excess whereas an insufficient number of vehicles are available in high-demand parking lots [6]. For car-sharing companies, this problem causes a major financial loss. To improve the car usage rate, the companies employ a variety of techniques that hold great promises in car-sharing predictions.
Over the last few years, machine learning and deep learning have proved their efficiency and got recognition in different fields. Machine learning approaches make use of learning algorithms that make inferences from data to learn new tasks [7] and are widely adopted in a number of massive and complex data-intensive fields such as medicine, astronomy, and biology [8][9][10][11]. Deep learning models yield good results in the fields of computer vision and natural language processing [12][13][14][15] where it can automatically extract multidimensional features and effectively extract the data patterns for classification or regression [16].
In our work, a multivariate time series approach is presented, and it aims to predict the car usage in the short term and to investigate the factors that help to improve its prediction accuracy. Multiple machine learning models and deep learning models have already fulfilled their promises for multivariate time series prediction approaches and also have proved their ability in extracting meaningful understandings that are hard for humans to analyze and infer [5]. ose models were performed with different features set including the past usage levels, Chongqing's environmental conditions, and the temporal information. e rest of the paper is organized as follows. Section 2 presents a literature review of current studies on times series models. Section 3 gives a description of the studied problem. A time series analysis is presented in Section 4. Section 5 demonstrates the framework of our approach. Section 6 describes the experimental framework used to evaluate the performance of used models for the multivariate time series approach. Finally, Section 7 concludes the paper and outlines future work directions.

Literature Review
Car-sharing has become one of the most popular research subjects in transportation. Many studies have been conducted, but to the best of our knowledge, no work has been done in the scientific literature that compares different machine learning and deep learning models in predicting the future usage of the car-sharing system and in investigating the factors that help improve its prediction accuracy. Interesting works are instead the following: Studies related to this topic can be categorized into numerous subgroups including the followings [5,17]: (i) user characteristics: it investigates the ways the users interact with the service [18,19]; (ii) characterizing the shaping service in charge of the provisions and distribution of the cars around the city [20,21]; (iii) car demands prediction level [22,23].
Car demands prediction levels for car-sharing systems can be formulated as a time series prediction problem. Time series uses many approaches, such as the autoregressive integrated moving average (ARIMA) model that focuses on extracting the temporal variation patterns of the traffic flow and uses them for prediction; support vector regression (SVR) model, that captures complex nonlinearities, [20] demonstrated that this approach generally performs better on traffic flow time series; extreme gradient boosting (XGBoost), [21] showed that this model improves the prediction's precision and efficiency. Before starting the calculation, XGBoost sorts the traffic data according to the feature values and also realizes parallel computing on feature enumerations.
Recently, machine learning methods have been challenged by deep learning methods on traffic prediction. Deep learning approaches have a strong ability to express multimodal patterns in data, in order to reduce the overfitting problem and to obtain high prediction accuracy. In addition, as a traffic flow process is complicated in nature, deep learning algorithms can represent traffic features without prior knowledge, which has good performance for traffic flow prediction.
Xu and Lim [22] used an evolutionary neural network to prove the effectiveness of this algorithm and its possible usage as a tool for forecasting the net flow of a car-sharing system in order to offer the vehicle in the shortest time possible with the best accuracy; [23] attempted using the deep belief network (DBN) to define a deep architecture for traffic flow prediction that learns features with limited prior knowledge. e abovementioned models require the input length to be predefined and static, and they cannot automatically determine the optimal time lags. To remedy these problems, many works have been done such as [24] used a model called long short-term memory recurrent neural network (LSTM RNN) that capture the nonlinearity and randomness of traffic flow more effectively and automatically determine the optimal time lags; [25] presented a novel long short-term memory neural network to predict travel speed using microwave detector data, where the future traffic condition is commonly relevant to the previous events with long time spans; Mo et al. [26] predicted the future trajectory of a surrounding vehicle in congested traffic by using the CNN-LSTM. To the best of our knowledge, no work is found in the literature on car-sharing time series prediction using CNN-LSTM.
Regarding the investigation of factors improving the prediction, [6] conducted the study about the effect of seasonal factors on the bookings of cars in Montreal, and after analysing the results, it was concluded that the usage outcomes scored better in summer season.
With respect to the above works, our approach presents the following highlights: (i) Comparison of various machine learning and deep learning models to predict the future number of bookings made by car-sharing users using different metrics 2 Complexity (ii) Investigation of the factors that help predicting the car-sharing usage by estimating the relationship between data features and model performances

Problem Description
e aim of this study is to predict the number of vehicles that are going to be used in the parking stations at a given moment and to investigate the factors that improve the accuracy of predictions. e number of vehicle usage at a given time t x is likely to be correlated with a set of features [6], which are as follows: (i) e past usage (F x usage ): the history usage is tracked to build prediction models. It comprises of the number of car-sharing transactions based on the data from a car-sharing operator located in Chongqing, China. (ii) Temporal information (F x time ): the time at which the past usages have been acquired. Since the car demands may vary over time, we partition the time period into segments to capture different temporal trends (e.g., holidays/working days, 1 h timeslots) [6]. (iii) e environmental conditions (F x weather ) at that time: the user transportation habits are usually affected by the weather conditions. Table 1 summarizes the description of the feature categories of the car rental time series.

Car Rental Prediction Problem.
Hereafter, we formulate the multivariate regression problem addressed in this paper [6]. It consists of predicting the car usage based on the values of features belonging to categories F x usage , F x time , and F x weather . Let T be the historical time period considered for training, t 1 , . . ., t k−1 be the past time points in T, and t k be the current sampling time. We will denote usage (t j ) [1 ≤ j ≤ k] the usage level at time t j . We will give 1 h timeslots as a prediction horizon [6].
Since the future car usage is related to multiple features, the multivariate regressor R is expressed as follows: where F x usage is the usage levels of cars, F x time is the temporal information at t x , and F x weather is the weather conditions in the area at time t x [6].

Factors Investigation Problem.
Another objective of this research is to determine the factors that help to predict vehicle usage. e features considered in this study are classified into different categories, namely, the past usage, temporal information, and the environmental conditions. We studied numerous machine learning and deep learning models, merging the different feature categories or using them separately one by one, aiming to find features that improve the prediction accuracy of the models.

Time Series Analysis
Time series is a sequential collection of recorded observations in consecutive time periods, and they can be univariate or multivariate [27,28]. We may perform time series analysis with the aim of either predicting future values or understanding the processes driving them [29].
To address the problems stated in the previous section, multiple machine learning and deep learning models were performed.

Vector Autoregression (VAR).
Vector autoregression is a forecasting algorithm used when two or more time series influence each other. It is considered as an autoregressive model because the predictors are not only lags of the series but also past lags of itself [30]. Suppose we measure three different time series variables, denoted by x t,1 , x t,2 , x t, 3 . e vector autoregression model of order 1 denoted as VAR(1) is as follows [30]: e variable a i is a k-vector of constants serving as the intercept of the model. Φ ij is a time-invariant (k × k)-matrix, and W ti is a k-vector of error terms.
Each variable is a linear function of the lag 1 values for all variables in the set. In general, for a VAR(p) model, the first p lags of each variable in the system would be used as regression predictors for each variable [31,32].

eXtreme Gradient Boosting (XGBoost).
XGBoost is an efficient and scalable implementation of gradient boosting framework by Friedman et al. [33] and Friedman et al. [34] e package includes an efficient linear model solver and tree learning algorithm [35]. XGBoost fits the new model to new residuals of the previous prediction and then minimizes the loss while adding the latest prediction [36]. What makes it unique is that it uses "a more regularized model formalization to control overfitting, which gives it better performance"-Tianqi Chen. XGBoost is used for supervised learning problems, where we use the training data x i to predict a target variable y i . After choosing the target variable y i , we need to define the objective function to measure how well the model fits the training data, and it consists of two parts, training loss and regularization term, as follows: where θ denotes the parameters that we need to learn from data, L is the training loss function, and Ω is the regularization term. A common choice of L is the mean-squared error, which is given by: And the regularization term controls the complexity of the model, which helps us to avoid overfitting [27].

Support Vector Regression (SVR).
e foundations of support vector machines (SVM) have been laid by Vapnik and Chervonenkis, and the methodology is gaining in popularity. e foundations of SVM that deal with classification problems are called support vector classification (SVC) and those of SVM that deal with modelling and prediction are called support vector regression (SVR)) [28].
Most real-world problems cannot be modelled by using linear forms [31]. SVR methodology allows to handle realworld problems. Here are some common kernels used in the SVR modelling [25]:

K-Nearest Neighbors (kNN). K-nearest neighbors
(kNN) is an efficient and intuitive method that has been used extensively for classification in pattern recognition [32]. It is a distance-based classifier, which implies that it implicitly presumes that the smaller the distance between two points is, the more similar they would be [37]. KNN classification algorithm is by far more popular than KNN regression [37].
In the KNN regression model, the derived information from the observed data is applied to forecast the amount of predicted variable in real time [38]. In other words, it estimates the response of a testing point X t as an average of the responses of the k closest training points, X (1) , X (2) , . . ., X (k) , in the neighborhood of X t [32]. Let X � {X 1 , X 2 ,..., X M } be a training data-set consisting of M training points, each of which possesses N features [32]. e Euclidean distance is used to calculate how close each training point X i is to the testing point X t using where N is the number of features, x t,n is the nth feature values of the testing point X t , and x i,n is the nth feature values of the training point X i . Some other methods are Manhattan, Minkowski, and Hamming distance methods.

Long Short-Term Memory (LSTM).
Long short-term memory neural network (LSTM NN) was initially introduced by Hochreiter and Schmidhuber (1997) [21]. e primary objectives of LSTM NN are to overcome the vanishing gradients problem of the standard recurrent neural network (RNN) when dealing with long-term dependencies [39]. Its features are especially desirable for traffic prediction in the transportation domain [40]. Figure 1 shows the architecture of long short-term memory cell. e core concept of LSTM is to be composed of recurrently connected memory blocks, each of which contains one or more memory cells, along with three multiplicative "gate" units: All of those gates are set to generate some degrees, by using current input x i , the state h t−1 that the previous step generated, and the current state of this cell c t−1 , for the decisions whether to take the inputs, forget the memory stored before, and output the state generated later. Just as these following equations demonstrate [39]: e network controls the flowing information by its sigmoid layer which outputs numbers between zero and one (S(t) � 1/(1 + e −t )).

Gated Recurrent Unit (GRU).
e gated recurrent unit (GRU) architecture contains two gates: an update gate z t decides how much the unit updates its activation or content and a reset gate r t allows to forget the previously computed state [41]. e model is defined by the following: where h t represents the output state vector at time t, while H t is the candidate state obtained with a hyperbolic tangent, x t represents the input vector at time t, and the parameters of the model are W z , W r , W H (the feed-forward connections), U z , U r , U H (the recurrent weights), and the bias vectors b z , b r , b H [42].

Convolutional
Neural Network (CNN). Convolutional neural networks (CNNs) are analogous to traditional artificial neural networks (ANNs) as they are comprised of neurons that self-optimize through learning [43]. ey were initially developed for computer vision tasks; nevertheless, there have been a few recent studies applying them to time series forecasting tasks. CNNs are comprised of three types of layers including convolutional layers, pooling layers, and fully connected layers as shown in Figure 2.
e convolutional layer will determine the output of neurons connected to local regions of the input through the calculation of the scalar product between their weights and the region connected to the input volume [43].
ere are two important techniques used in the convolutional layers to accelerate the training process: local connectivity and weight sharing. e two techniques are implemented using a filter with a specific kernel size which defines the number of nodes that share weights. eir usage decreases significantly the number of learned and stored weights and allows the network to grow deeper with fewer parameters. e pooling layer is usually incorporated between two successive convolutional layers [44]. e main idea of pooling is to reduce the complexity for further layers by down-sampling [45]. Max-pooling is one of the most common types of pooling methods as it performs better [44]. It consists in partitioning the image to subregion rectangles and only returning the maximum value of the inside of that subregion [45].
e fully connected layers are simply, feed-forward neural networks. ey form the last few layers in the network [46]. e input to the fully connected layer is the output from the final pooling or convolutional layer, which is flattened and then fed into the fully connected layer in order to perform the same duties found in standard ANNs [43,46].

CNN-LSTM Model.
CNN-LSTM is a hybrid model built by combining CNN with LSTM for improving the accuracy of forecasting [47]. Figure 3 shows the architecture of the CNN-LSTM model. e model comprises of two main components: the first component consists of convolutional and pooling layers in which complicated mathematical operations are performed to filter the input data and extract the useful information. More specifically, the convolutional layers apply convolution operation between the raw input data and the convolution kernels, producing new feature values [48]. e convolution kernel can be considered as a window that contains coefficient values in a matrix form. is window slides all over the input matrix applying convolution operation on each subregion of it. e result of all these operations is a convolved matrix that represents a feature value. e convolutional layers are usually followed by a nonlinear activation function and then a pooling layer. A pooling layer is a subsampling technique that extracts certain values from the convolved features and produces new matrices (i.e., summarized versions of the convolved features that are produced by the convolutional layer). e second component exploits the generated features by LSTM which possesses the ability to learn long-term and short-term dependencies through the utilization of feedback connections and dense layers [48].

Multilayer Perceptron (MLP).
Multilayer perceptrons (MLPs) are deep artificial neural networks and are often applied to supervised learning problems [49]. As we can see from Figure 4, a multilayer perceptron consists of three types of layers: An input layer to receive the signal, an output layer that performs the required tasks to make a decision or prediction about the input, and an arbitrary number of hidden layers that are the true computational engine [49,50]. In MLP, the data flow in the forward direction from the input to output layer, and the neurons are trained with the backpropagation learning algorithm on a set of inputoutput pairs. e training involves adjusting the parameters, or the weights and biases, of the model in order to minimize errors [49].

Vanishing Gradients Problem.
Artificial neural networks often experience training problems due to vanishing and exploding gradients. e training problem is amplified exponentially especially in deep learning due to its complex artificial neural network architecture [51]. e vanishing gradient is one example of unstable behavior that may be encountered during the training with gradient-based methods (e.g., back propagation) [52]. e neural network's weights receive an update proportional to the partial derivative of the error function with respect to the current weight in each iteration of training. In some cases, the gradient tends to get smaller as we move backward through the hidden layers. is means that neurons in the earlier layers learn much more slowly than neurons in later layers preventing the weight from changing its value [52].
Several approaches exist to reduce this effect in practice, for example, through careful initialization, hidden layer supervision, and batch normalization [53]. In our work, batch normalization has been used, as it was effective in augmenting the performance of the deep neural network. Figure 5 shows the process of car-sharing usage prediction and factors investigation approach based on machine and deep learning models.

Collecting Chongqing's Car-Sharing Operator Data.
Chongqing car-sharing operators' data set contains more than 1 M records of car-sharing usage over 860 parking lots, from January 1st, 2017, 00:00:00 to January 31st, 2019, 23:00: 00. e initial records were obtained at different time intervals, and for study purposes, the data are aggregated by hours, days, and weeks, for the whole network.

Data Preprocessing.
A data preprocessing is performed on the data set to improve the performance [55].

Processing Missing Values
. Some values were missing in the data set from Chongqing's car-sharing operator. Due to its numerical meaning, we replaced the missing values by the mean of the previous and next hour number of carsharing usage. is method yields better results compared to the removal of rows. e detailed calculation is shown as follows: where C i j,k represents station i's missing value on the k th hour of jth day of the year. After handling the missing values of Chongqing's car-rental operator data set, we merged it with Chongqing's weather conditions data set based upon dates and time.

Encoding the Categorical Data.
Since the final data set, combining Chongqing's car-sharing operator and weather data, contains some categorical data such as weather condition and season, we converted the categorical data to numerical data using one-hot encoding method. It consists of representing each categorical variable with a binary vector that has one element for each unique label and marking the class label with a 1 and all other elements 0 [56].

Clustering Car-Sharing Parking Stations.
To identify and understand the car-rental behaviors across stations and reveal the relationships between the time of a day and usage [57], we organized the parking stations with similar patterns into five distinct classes as follows: (i) Class A: daily rented cars (ii) Class B: frequently used cars (iii) Class C: sometimes used cars (iv) Class D: occasionally used cars b Multiple-Input Neuron Inputs Figure 4: Multilayer perceptron architecture.  [58].
First, the usage frequencies of the parking stations were put in order, and then, the range was calculated as below [59]: range � largestfrequency value -smallest frequency value.
Second, an approximate class width was calculated by dividing the range by the number of classes: e lowest usage frequency represents the first minimum data value.
ird, the next lower class value was calculated by adding the class width to the lowest usage frequency. lower class value � lowest usage frequency + class width.
is step was repeated for the other minimum data values until the chosen classes number was created. Fourth, the upper class limits (that are the highest values possible in the class) were calculated by subtracting 1 from the class width and adding that to the minimum data value. upper class limit � lowest usage frequency +(Class width − 1).

(22)
Finally, the list of classes is obtained by including in each class the usage frequencies that are greater than the lower class value and smaller than the upper class limit.

Deseasonalization.
Stationarity is an important concept for time series analysis. Some experts believe that neural networks are able to model seasonality directly and that no prior deseasonalization is required, whereas others believe the contrary. e results in [60] show that a prior data processing is required to construct a forecasting model. To test our time series, Augmented Dickey-Fuller Test (ADF Test) was conducted with machine learning and deep learning models to make predictions more accurate [61]. We employed differencing to remove seasonality from the nonstationary time series after the ADF Test [61,62].

Scaling.
e scaling phase is crucial to move the time series into a reasonable range. In our work, MinMaxScaler was used to scale each feature to a given range.

Splitting the Data Set.
After handling the previously mentioned steps, we prepared our data set properly. We split the data between training and test sets. e training set starts from January 1st, 2017, to December 31st, 2018, and the test set from January 01st, 2019, to January 31st, 2019. Nested cross-validation with an outer loop equal to ten and an inner loop equal to five is used to calculate and to compare each model error. All models had to use the same validation procedure for consistency matters.

Data Set.
e experiments were performed on the preprocessed Chongqing's car-sharing operator data set combined with Chongqing's weather data set, to extract features that help to predict car usage, and to demonstrate the effectiveness of deep learning, more precisely the CNN-LSTM comparing to other models.
We have implemented the proposed models using a PC with an i7 Intel (R) Core ™ i7-7500U CPU running at  (Table 2).
From Table 2, fair is the most prevalent meteorological condition in Chongqing, closely followed by fog, light rain, partly cloudy, and cloudy.
From Table 3, July and August are the hottest months of the year with an average temperature of 28°C, and January is the coldest month of the year with the temperature of 7°C.

Evaluation Metrics.
e evaluation metrics are the measure that reflects how close the prediction matches the historical data. ey are useful in comparing prediction methods on the same set of data [63].

Root Square Mean Error (RMSE).
e mean-squared error described above is in the squared units of the predictions. It can be transformed back into the original units of the predictions by taking the square root of the meansquared error score [64]: e RMSE is chosen as an evaluation metrics because it penalizes large prediction errors more compared with mean absolute error (MAE).

Mean Absolute Percentage Error (MAPE).
Mean absolute percentage error (MAPE) is one of the most widely used measures of forecast accuracy, due to its advantages of scale independence and interpretability [65]. It is calculated using the absolute error in each period divided by the observed values that are evident for that period and consequently averaging those fixed percentages [66]. MAPE indicates how much error in prediction is compared with the real value. e MAPE can be defined by the following formula: where A t is the actual value, F t is the forecast value, and n denotes the number of fitted points.

Root-Mean-Squared Log Error (RMSLE).
e rootmean-squared log error (RMSLE) is the RMSE of the logtransformed predicted and target values [54]. RMSLE only considers the relative error between predicted and actual values, and the scale of the error is nullified by the logtransformation [67]. e formula for RMSLE is represented as follows: where n is the total number of observations in the data set, p i is the prediction of target, a i is the actual target for i, and log(x) is the natural logarithm of x, log e (x).

Experiments and Analysis.
Many experiments were conducted on parking stations of different classes to extract which features improved the prediction and to predict the demands accordingly. Before applying the different models, some tests were performed on our time series.

Granger Causality Tests.
e Granger causality test is a statistical hypothesis test for determining whether one time series is useful for predicting another [68]. In other words, it Complexity is an approach that analyses the causal relationships between different variables of the time series.
After analysing the results shown in Table 4, we observed that all the given p value were smaller than the significance level (0.05). For example, the value of 0.0003 at (row 4) represents the p value of the Grangers causality test for temperature_x causing number_rented_cars_y (p value (0.0003) < significance level of 0.05).
From the results, we can infer that all the variables are good candidates to help predicting the number of rented cars.

Machine Learning Configuration
(1) eXtreme Gradient Boosting. A grid search was created for XGBoost model in order to locate the most optimal hyperparameters for the data set.
(2) Vector Autoregression. To select the right order of the VAR model, we iteratively fit increasing orders of the VAR model and picked the order that gave a model with the least AIC [29]. In our work, we have chosen the lag 4 model. Before predicting the future values of the target variable with VAR, we used serial correlation of residuals to check if our model is able to explain the patterns in the time series. e obtained scores in Table 5 show that our model is able to capture all the patterns without leftover.
(3) Support Vector Regression. e RBF kernel was chosen in this study for its good performance and advantages in time series prediction problem that has been proved in past researches [69,70]. e penalty parameters are all tuned for the best using of a grid search method. Predictions are computed with optimal combination of cost and gamma parameters.

(4) K-Nearest Neighbors.
e most important hyper-parameters for KNN are as follows: the number of neighbors (K) and the distance metric, and they determine the way in which the nearest neighbors are chosen.
To choose the number of neighbors (K), a grid search was performed. Moreover, the distance metric plays a crucial role in the nearest neighbor algorithm. Most of the papers in the references used the Euclidean distance. Reference [71] did a comparison between the Euclidean and Manhattan distance and found that statistically, no distance metric is significantly better than the other. e Euclidean distance has been selected since it is the most used in time series forecasting with KNN regression works.

Deep Learning Configuration (1) Long Short-Term Memory and Gated Recurrent Unit
Models. In this study, LSTM and GRU models have one neuron in the output layer. Both neural network models were designed with only one hidden layer, a number of 50 epochs were chosen, and the learning rate was set to 0.01. e input and hidden neurons for the GRU model were the same as those for the LSTM model and were set to 41 and 50, respectively [56].
(2) Convolutional Neural Network. CNN is one of the most successful deep learning methods, and its network structures include 1D CNN, 2D CNN, and 3D CNN. 1D CNN is used in this paper, and it can be well applied to time series analysis. e detailed process of the 1D CNN is described as follows: In our experiment, we used various layers including one convolutional layer with a kernel size of 2 and 64 filters, a max-pooling layer, along with that a rectified linear unit  10 Complexity (ReLU) activation function is applied in the convolutional and output layers. To minimize the mean-squared error, the gradient descent backpropagation algorithm and Adam optimizer were used; following that, a dropout rate of 0.5 was employed to avoid the overfitting [72], and a number of 70 epochs were used for training the model.
(3) CNN-LSTM. In our implementation, we utilized a version of the CNN-LSTM model that consists of two onedimensional convolutional layers with a kernel size equal to 5, 32, and 64 filters, respectively, followed by a max-pooling layer, a LSTM layer with 50 units, and a dense layer of 32 neurons [48], In order to avoid overfitting during training, the dropout was adjusted to 0.2. A number of 100 epochs were chosen to train the model.

Prediction Result.
For the good organization of the paper, and not being redundant in our explanations, we only discussed the result analysis of the class "A" as other classes exhibit the same behavior and lead to the same conclusion.
To perform the comparison while fitting the models with different features, the results of only CNN-LSTM for deep learning models and XGBoost for machine learning models are described in our analysis. In the same way, each of the applied models had the analysis of the same results.
In our investigation, metrics such as MAE, MSE, RMSE, MAPE, and RMSLE are used in respective order to make the comparison. Note that the smallest errors are shown in bold text in Tables 6-13.

Univariate Time Series
(1) Machine Learning Models. As shown in Table 6 (2) Deep Learning Models. As it can be seen from Table 7 (2) Deep Learning Models. e addition of weekend features improved prediction accuracy as it can be observed from Table 9, where adding the weekend feature to the univariate time series of class "A" improved the results with a rate of (2.4%, 4.95%, 2.54%, 4.23%, 0.43%) for CNN-LSTM.
Regarding models' comparison, CNN-LSTM achieved the best results with an improvement rate of (43.75%, 46

e Effect of Weather Information.
In addition to rental information in the city of Chongqing, we can leverage weather data to improve prediction at different times of the day.  Moreover, it was observed after a thorough examination that integrating weather data as a feature improved the prediction accuracy. When the CNN-LSTM model was applied to class "A" data with weather features, it reduced the MAE by 2.67%; the MSE by 6.04%; the RMSE by 3.03%; the MAPE by 11.10%; and the RMSLE by 5.02% compared to the univariate time series results. Similarly, it also reduced the MAE by 0.28%; the MSE by 1.16%; the RMSE by 0.51%; the MAPE by 7.17%; and the RMSLE by 5.43% compared to the weekend's results.

Combined Effect of Historical, Weekends, and Weather Information
(1) Machine Learning Models. e evaluation error of XGBoost was lower as shown in Table 12 Our findings show that when all features were used together, XGBoost performed better with rate of (3.96%, 12.92%, 6.65%, 6%, 2.85%). Similarly, it also performed (2.85%, 12.38%, 6.72%, 6.14%, 2.62%) better than when history combined with weekend features were used and (1.31%, 10.6%, 6.41%, 5.48%, 1.31%) better than when history combined with weather features were used.   CNN-LSTM 0.00616 0.00714 0.0074 0.  6.5. Comparison of the Results. One of our objectives is to compare the accuracy of various machine learning and deep learning models in predicting the future number of carsharing transactions. After analysing the results obtained in our study, the following are our findings: 6.5.1. Machine Learning. First, with regard to the results obtained with the machine learning models and after the comparison based on the evaluation metrics, it shows that XGBoost gave the best results, followed by VAR, SVR, and KNN. e XGBoost model had several advantages in model prediction such as complete feature extraction, good fitting effect, and high prediction accuracy.
Second, the SVR prediction series failed to capture random and nonlinear patterns. Hence, it failed to perform well, while XGBoost and VAR forecast series were able to capture random walk patterns.
ird, KNN performed the worse compared to the other machine learning models because of the high number of inputs.

Deep Learning.
After comparison of results, we can deduce that CNN-LSTM generated better outcomes followed by LSTM, GRU, CNN, and MLP. e hybrid CNN-LSTM model yielded better performance on the strength of its capability in supporting very long input sequences that can be read as subsequences by the CNN model and then formed together by the LSTM model.
Besides the CNN-LSTM model, the long short-term memory model achieved good results on account of its ability to learn patterns from sequenced data more effectively.
e key difference between the gated recurrent unit model and the long short-term memory model is that GRU is less complex than LSTM, as it only has two gates (i.e., reset and update) while LSTM has three gates (including input, output, and forget). By comparing the two models using the different evaluation metrics, it can be concluded that the LSTM model had good memory for longer sequences as compared to the GRU model, and it outperformed in the tasks requiring modelling the longdistance relations. e CNN produced quite impressive results because of the ability of its convolutional layer in identifying patterns between the time steps. Contrary to the LSTM model, the CNN model is not recurrent, and it can only train the data that are inputted by the model at a particular time step.
Unlike the other models, the multilayers perceptron model performed worse. e model received inputs and didn't treat them as sequence data, which led to temporal dependencies and sequence patterns loss.

Comparison between Machine Learning and Deep
Learning Models. It can be inferred from the obtained results that the deep learning models outperformed all the machine learning time series prediction models. From the different results of the different models, we noticed that CNN-LSTM gave the best performance measures and achieved the most accurate prediction results. Furthermore, Figure 6 shows that the dashed line of predicted values almost coincides with the one of real values, which proves that the hybrid CNN-LSTM model generated good results. Figure 7 shows a comparison between the two best machine learning and deep learning models of Class A, and it can be noticed that CNN-LSTM slightly outperformed the XGBoost model.

e Computational Time.
e computational time of various machine and deep learning models can be found in the following tables: Table 14 shows that XGBoost has faster computational time while SVM is the more demanding. For deep learning models, Table 15 shows that the computational time of CNN-LSTM is bigger than the LSTM, GRU, CNN, and MLP models.
Machine learning models exhibit faster computational time, while deep learning models take a longer time because of their high number of parameters and their complex mathematical formulas.    It also showed that the environmental condition features dominated other features, and it was followed by the

Conclusions
is research paper, through applying different machine learning and deep learning models to multivariate time series, aims to predict the car usage and to investigate the factors that help to improve the predictions' accuracy. e evaluation of the different machine learning and deep learning models with MAE, MSE, RMSE, MAPE, and RMSLE reveals that the hybrid model (CNN-LSTM) gives substantially smaller errors as compared to the standalone used models.
e experimental results show that the   utilization of the CNN-LSTM model on the number of carsharing transactions, along with environmental conditions and temporal information features together, yields the highest prediction accuracy. e principal idea of the hybrid model is to efficiently amalgamate the advantages of two deep learning techniques. It exploits the ability of convolutional layers for extracting useful knowledge and learning the internal representation of time series data as well as the effectiveness of long short-term memory (LSTM) layers for remembering events for short and very long time. Furthermore, through our experimental analysis, we conclude that even though LSTM models constitute an efficient choice for car-sharing time series prediction, their usage along with additional convolutional layers provides a significant boost in enhancing the forecasting performance. Although CNN-LSTM requires high search time due to its sensitivity to various hyperparameters and its high complexity, it shows the highest forecasting accuracy and the best performance.
All the results of the used models confirm that the carrental usage is more sensitive to environmental conditions than temporal information that means the impact of weather on car-rental transportation deserves more attention at research. However, our work is limited to temporal features. Future studies can extend on adding more features such as the time span of data, spatiotemporal variables, and expand the model to consumers' habits [73][74][75][76].

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this study.