Optimizing Location of Car-Sharing Stations Based on Potential Travel Demand and Present Operation Characteristics: The Case of Chengdu

Car-sharing is becoming an increasingly popular travel mode in China and many companies invest plenty of money on that including vehicle enterprises and Internet companies. But most of them build car-sharing stations by their experience or randomly as long as there is parking space in the early development of their business. This results in many stations with low operational efficiency and causes capital loss.This study aims to use different data sourcewith statisticalmodels andmachine learning algorithm to help car-sharing operator to choose the optimal location of new stations and adjust the location of existing stations. We select Chengdu where there are huge amounts of car-sharing travel demand and several large car-sharing operators as the research area and twomain operators as the research objects. Chengdu is divided into 58724 squared grids each of which is 0.5km∗0.5km instead of focusing on the buffers generated by stations. We try to find a model to estimate a potential travel demand value for each small grid with three data sources: order data, population data, and Point of Interest (POI) data.This problem is transformed into a binary form and five different methods, Logistic Regression, Logistic Regression with LASSO, Naive Bayes, Linear Discriminant Analysis, and Quadratic Discriminant Analysis, are implemented.The optimal model, Logistic Regression with LASSO, is chosen to estimate the probability of existence of demand in all grids. With car-sharing order data fromdifferent operators, an existing order heat value is also computed for each grid. Then we analyze and classify all the grids into four groups. For different groups of grids, we give different suggestions on the optimal location of stations. This study focuses on a more competitive market and finds the influential factors on order number. Suggestions on the optimal location of stations are given in consideration of competitors. We hope that our research can help operators improve their business and make rational plans.


Introduction
Car-sharing is a mode that allows members to access a fleet of vehicles for short-term use without actual ownership. Members just need to reserve a vehicle online or by mobile app and then move to the parking lots and drive the car. They usually pay for this after travelling according to the travelling distance or/and time [1,2]. This kind of new service allows people to avoid buying a car and spending time finding parking lots [3]. Car-sharing is becoming an increasingly important travelling mode in recent 5 years especially with rapid growth of electric vehicles since 2016. The advantages of car-sharing include reducing vehicle ownerships, reducing vehicle kilometres travelled, and reducing greenhouse gas emission [4][5][6].
So far, there are three main types of car-sharing mode: station-based, free-floating, and peer-to-peer car-sharing [7]. Station-based system requires the operators to hire parking space for stations and the vehicles are parcelled out among these stations. According to the operation mode, there are two types of station-based form: one-way car-sharing and round-trip car-sharing. One-way car-sharing allows the customers to return cars at any designated station wherever the trip started. In contrast, with round-trip car-sharing, cars should be returned to the station where the trip starts. The first form provides more flexibility but will cause the 2 Journal of Advanced Transportation problem of spatial imbalance over the stations. The operators have to consider relocation problem that leads to much cost. Free-floating system operates without station. Instead, the operators define an area where the customers can park cars at any parking lots [7,8]. Free-floating car-sharing could also be treated as a special case of one-way car-sharing [9]. Our research will mainly focus on one-way station-based car-sharing system for which station siting is a big challenge and need to take quite a lot of factors into consideration. A good choice can bring high efficiency, large profit, low operation cost, and competitive advantages when competing with other operators [10,11]. However, at the beginning of business, operators usually locate stations by their experience. For example, trading area, universities, or airports always bring great profit if there are car-sharing stations. But different cities have different features and attitudes towards carsharing. Therefore, experience sometimes may be unreliable and relocations are necessarily introduced when the business is mature. What is more, some operators just choose locations randomly as long as there are parking spaces depending on their huge funds in order to seize the market ahead of other competitors, where relocations are even more necessary.
Researchers have done a lot of work on relocation or site selection problem by using various methods. Analytic hierarchy process (AHP) is the most popular method among multicriteria decision making methods [12]. One study uses AHP to solve the site selection problem for EVCARD which is a car-sharing operator in Shanghai. The researchers consider the potential users, potential travel demand, potential travel purposes, and distances from existing stations totally 15 factors as the decision criteria. But this method is based on candidate stations and expert scoring method that is subjective [11]. Mathematical and statistical models are also applied. Several studies use multilinear model and mixedinteger programming model to find the optimal location of stations [13][14][15]. Another paper introduces an intensity model that estimates the demand and an imbalance model that describes the difference between pick-up and dropoff to find the optimal location of stations of EVCARD in Shanghai, which involves usage intensity, usage imbalance, transportation information, and built environment. These two models utilize a combination of Elastic Net, the adaptive Least Absolute Shrinkage and LASSO [3].
The previous two studies about site location of EVCARD all focus on the market in Shanghai where there is no competitor. Therefore, it is much easier to make decision on the optimal site location. Our research will focus on a more complicated market: Chengdu where there are more than five car-sharing operators and each has its own advantages. In Chengdu, there are two main car-sharing operation companies that account for a majority of market shares. For privacy, we call them operator F and operator H. They are both station-based car-sharing operators. The differences between them are the number of electric vehicles and stations, vehicle models, and charging mode. We will consider the potential demand heat combined with the existing order heat to give suggestions on site location.

Data
We divide Chengdu into 58724 squared grids each of which has an area of 0.5km * 0.5km. All the data we use are allocated to these grids. The data sources consist of 3 parts. The first one is the order data. We collect the order data from these two carsharing operator mobile apps and compute the average daily order numbers for each car-sharing point. The time period is between March 28, 2018, and April 17, 2018. One order is only considered once at the beginning point. For each grid, we sum up the day-average order numbers of all points according to these two car-sharing operation companies in that grid. There are 1834 grids with car-sharing stations. The summary of these grids is shown in Table 1.
The second data source is the Point of Interest (POI) information of Chengdu that is collected from AMAP in July 2018. AMAP is an e-map that is similar to Google map and is widely used in China. However, the coordinate system of AMAP is called "GCJ-02" which is different from "WGS-84" coordinate system used in Google map. So a coordinate transformation work is introduced to ensure that the POI information is based on "WGS-84" coordinate system. The total number of POIs reaches 860196 and they are categorized into fourteen classes shown in the first column of Table 2. In AMAP POI categories, auto service contains various services such as filling station, auto-mobile rental, and charging station. But here we only consider one subcategory: auto-mobile rental since car-rental service will affect the demand of car-sharing service. Additionally, we separate transportation service into five parts: Bus Station, Underground Station, Train Station, Airport, and Parking Lots. Again we allocate these POIs to these grids and the summary of them is shown in Table 2.
The last data source used in our research is the population data (POP) called Gridded Population of the World (GPW) from Socioeconomic Data and Applications Center in NASA. Now it is the fourth version that models the population counts and densities on a continuous global raster surface. The population data is collected from the population and housing censuses between 2005 and 2014, which is used to estimate the population for the year 2000, 2005, 2010, 2015, and 2020. Some adjustments of a set of estimates based on national level, historic, and future, population predictions from the United Nation's World Population Prospects report

Distribution of daily order numbers
Daily order numbers (2015 Revision) are also introduced to these sets of years. GPW is gridded with an output resolution of 30 arc-seconds which is approximately equal to 1 km at the equator. The value for each grid is not the population number in it but a scale that reflects the level of population. In our research we use the population density in 2015 and adjust the rasters to ours as well as the population density of them [16].

Methodology
The true value of the potential demand is unknown since it is an insubstantial concept which is difficult to measure. One possible way is to do a full sample questionnaire but this requires high human and financial resources and the results might be biased since questionnaire contains much subjectivity. Another way is to use usage intensity as a proxy of demand [3] or use the number of bookings while this amount can only reflect the present demand that may be restricted by the number of car-sharing stations and vehicles. Therefore using a specific amount to represent the potential demand will cause deviation. An alternative way that is applied in our research is just to distinguish whether the demand of car-sharing in a grid exists or not. Therefore, the question becomes a classification problem where classification algorithms can be implemented. So far 1834 grids contain at least one car-sharing station. The distribution of order numbers in them is shown in Figure 1(a). Clearly, those with larger order numbers can be treated as high demand. Those with tiny 4 Journal of Advanced Transportation order numbers are reasonably defined as no demand since very small order number reflects an occasionally demand that operators do not need to satisfy with cost of renting parking space. Therefore, we choose the lower 20% and upper 20% grids as the sample set (as shown in Figure 1(b)) and let respond equal "1" for demand and "0" for no demand. What is more, we also would like to know the level of demand in a grid which is the probability that the grid belongs to the class "1". Classification algorithms such as k-nearest neighbour (KNN) and tree models cannot provide us such probability so we choose the following four methods: Logistic Regression, Naive Bayes, Linear Discriminant Analysis, and Quadratic Discriminant Analysis.

Logistic Regression with LASSO. Logistic Regression
is used to find the relationship between variables X = ( 1 , 2 , . . . , ) and binary response Y. It models the probability that Y belongs to a particular class. If we use a linear regression model to represent the probability, we might get a result that is smaller than 0 or larger than 1, which does not make sense for probability. Therefore, in order to get an output between 0 and 1, we use the logistic function By taking the logarithm of both sides, we have We can see that the Logistic Regression model has a log-odds which is linear in X.
The coefficients of Logistic Regression models are usually estimated by maximum likelihood method. The likelihood function for observations is where The log-likelihood can be written as Set its derivatives with respect to to zero and we can get the maximum likelihood estimators for . [17] The LASSO is a shrinkage method that constrains the coefficient estimates, which can significantly reduce the variance of them. A penalized term ∑ =1 | | is added to the model, where ≥ 0 is a tuning parameter.
Therefore, for Logistic Regression with LASSO, we would maximize the penalized log-likelihood function: The selecting of a good value of is critical since it significantly affects the coefficients. Cross-validation method is applied to choose the best that produces the smallest crossvalidation error [18].

Naive Bayes.
Assume that there are K classes and we wish to classify an observation X into one of them. According to the Bayes' theorem, where Pr( = | X = x) denotes the probability that X belongs to the class k; is the prior probability that a random observation is from class k; (x) = Pr(X = x | = ) is the probability density function of X which belongs to class k.
For an observation X = ( 1 , 2 , . . . , ), Naive Bayes assumes that each feature is independent given a class k. So we have where ( ) is the probability density function of the -th feature given class k. Then the probability of X coming from the class k is See [18].

Linear Discriminant Analysis. Linear Discriminant
Analysis is based on Bayes' theorem and assumes that the observation X = ( 1 , 2 , . . . , ) is generated from a multivariate Gaussian distribution with a unique class mean vector and a common covariance matrix. We write ∼ ( , Σ) where is the mean vector for the class k and Σ is the covariance matrix that is the same for all classes. Then the probability density function (pdf) of X that is from class k is ( ) = Pr ( = | = ) Journal of Advanced Transportation 5  The parameters of the multivariate Gaussian distributions are unknown and we need to estimate them from training data bŷ where is the number of observations in class k and is the label of -th observation [17].

Quadratic Discriminant Analysis. Quadratic Discriminant Analysis is a bit different from Linear Discriminant
Analysis, which assumes that the covariance matrices for each class are different. Then the observation X from class k has the distribution of ( , Σ ). So the pdf becomes See [17].

Results and Discussion
As mentioned in the Section 3, we take the lower 20% and upper 20% grids as the sample set. The quantiles are 0.79 and 6.33, respectively, which means that the grids with average daily order numbers that are less than 0.79 and larger than 6.33 are chosen as the samples. The sample size is 737 of which 370 samples are of no demand and 367 are of high demand. We randomly choose 500 samples (approximately 65% of total samples) as the training set and the remaining as the test set. Since population density and POI information are in different scale, all variables are normalized by their mean and standard error (shown in Table 3).

Logistic Regression.
The results of Logistic Regression are shown in Table 4. Only five variables, Food& Beverages, Medical Service, Governmental Organization & Social Group, Bus Station, and Parking Lots, are significant under 5% significance level. The population factor is not significant. Food& Beverages positively affects the probability of existence of demand, while Medical Service and Governmental Organization& Social Group has negative effects. According to our search and investigation, the car-sharing stations in Governmental Organization are always of limited access which means only staffs that work in them can get access to the stations. Once a car is returned at these stations, it always takes a long time to receive a next order. The negative effect of Medical Service could be attributed to the limited parking space and huge parking demand. The stations might be occupied by fuel vehicles. The coefficient of Bus Station is also positive, which is the same as the results in literatures [3,19]. Parking Lots has a positive impact on the probability of existence of demand, which implies that more space provide more opportunities to build car-sharing stations. However, Chen's paper [3] concludes that more parking space leads to more private vehicle trips instead of car-sharing service. Journal of Advanced Transportation    [20]. In this model Train Station and Airport are also significant which could be attributed to their functions of traffic hub that brings high exposure rate and large mobile population. The population factor is again not significant.

Linear Discriminant Analysis.
As mentioned in Section 3.3, Linear Discriminant Analysis (LDA) assumes that the sample is from a multivariate Gaussian distribution where the mean vector is unique for different class and the covariance matrix is the same. Table 6 shows the mean vectors for classes: demand = 0 and demand = 1. The covariance matrix is shown in Appendix A. The prior for these two classes are 0 = 0.502 and 1 = 0.498.

Quadratic Discriminant Analysis.
The mean vector of Quadratic Discriminant Analysis (QDA) is the same of LDA. However, QDA assumes that the covariance matrix is different for these two classes. The results are shown in Appendix B.

Naive
Bayes. Naive Bayes assumes that each predictor is independent and can be represented by a Gaussian distribution within each group. The mean and standard deviation for other variables grouped by class are shown in Table 7. The Train Station is a constant within class: demand = 0 since  I   II  III   IV  I   II II II II II  III II II II II   IV

Comparing These Five Models.
To judge the performance of a classifier, AUC value and accuracy rate are always applied. AUC value is the area under receiver operating characteristic (ROC) curve that varies between 0 and 1. The higher this value, the better the discrimination will be. Accuracy rate is computed by the number of correct predictions divided by the total number of predictions. We test these models with the remaining 237 observations. The values of these two measures for the previous five models are shown in Table 8. We can see that all models give an AUC value between 0.8 and 0.9 and accuracy rate between 0.65 and 0.8. QDA model produces the least AUC value and accuracy rate while the pure Logistic Regression or Logistic Regression with LASSO performs best since their AUC value and accuracy rate are the largest. LDA model and Naive Bayes model are a bit worse than Logistic Regression. Therefore, we can choose either the pure Logistic Regression or Logistic Regression with LASSO as the final model. Here we choose the Logistic Regression with LASSO model.

Optimizing Location of Car-Sharing Stations.
For all of the 58724 squared grids, we normalize them with the mean and standard error in Table 3 and then run the Logistic Regression with LASSO model. The predictive probabilities of existence of demand for these grids against present order numbers are shown in Figure 2. The left figure (a) shows the results for the whole 58724 squared grids. We can see that most grids are with order numbers less than 10 so to be clear we plot these grids in figure (b). We define predictive probability larger than 0.5 as high demand heat and otherwise low demand heat. What is more, it is reasonable to consider average daily order numbers less than 1 as low order heat and otherwise high order heat. Therefore, the grids can be sorted in to 4 groups as shown in Figure 2(b): I: high demand heat and high order heat; II: high demand heat and low order heat; III: low demand heat and low order heat; IV: low demand heat and high order heat. For stations in group III grids, operators are advised to close or remove them after investigation such as checking the time interval between two orders. For stations in group IV, further work is required to check if other influential factors that are omitted here exist since low demand heat and high order heat mutually conflict. We will mainly focus on groups I and II when optimizing the locations of car-sharing stations. Figure 3 shows the grids with high demand heat in Chengdu. The red colour refers to group I and the blue colour refers to group II. It is obvious that these grids concentrate in the city centre and town centre that is also the gathering area of crowd and business.
For the grids in groups I and II, we can give suggestions to the two operators: operator F and operator H on the optimal location of car-sharing stations. Three cases of grids are considered:     Moreover, in Case 3, we find that these two operators have not entered town centres yet where most of these grids are of high demand heat. These two operators are advised to do investigation at these areas and consider building stations.

Conclusion
This research focuses on optimizing the car-sharing stations in Chengdu market. The main methodology is trying to estimate the potential demand combining with the present order numbers. Unlike the previous research that applies multiple linear regression to model the demand, this study transforms the question to a binary problem whether the demand exists or not. Then five classification models are  (iv) In the Naive Bayes model, each variable is assumed to be normally distributed of which the mean and standard deviation are estimated with each class. The standard deviation of Train Station in class:demand = 0 is zero; thus it is treated as a constant.
(v) Comparing the performance of these five models by AUC value and accuracy rate, we find that QDA models give the worst estimation while Logistic Regression and Logistic Regression with LASSO perform best. LDA is a bit worse than these two models and Naive Bayes works slightly better than QDA. Therefore we conclude that linear models work better in our case.
The Logistic Regression with LASSO is chosen as the final model and used to estimate the probability of existence of demand for all grids. The predictive probability larger than 0.5 is treated as high demand heat and otherwise low demand heat. These grids with high demand heat are concentrated in the city centre or town centre. Together with the present order numbers, 4 groups of grids are defined. Different suggestions on optimizing location of car-sharing stations are given for each group. Both operator F and operator H are advised to build stations in the absent grids with high demand heat and close or remove part of stations in the grids with low demand heat and low order heat. Operator H is also advised to build stations in the north-west of Chengdu where high demand exists and operator F has low operation efficiency. However, there are still some limitations in our research.
(i) First, Chengdu is a competitive market with more than five car-sharing operators of which only two are considered in our research. The suggestions may be reliable without consideration of the other operators.
(ii) Second, our research is based on 500m * 500m squared grids which are suitable in city centre but are too small for other areas since the building density is much lower in these areas.
(iii) Third, one important factor, cost of building stations, is not considered in our search. The cost always includes renting the parking space, building charging piles, and purchase electric vehicle. The definition of high order heat and low order heat in our research is simply based on the order number equalling one. However, high cost stations require high order numbers to redeem the cost. Therefore, such definition should vary with the cost of building stations.
(iv) Fourth, samples are chosen based on the lower 20% and upper 20% grids according to average daily order numbers subjectively. Other approaches can be applied such as lower 30% and upper 30% to include more observations. The definition of 4 groups is also subjective. More objective method such as clustering may be applied.
(v) Fifth, the models we use are all based on strong assumptions. More technical classification algorithm can be applied such as bagging, boosting, random forest, and Gaussian Process.
(vi) Sixth, the order numbers only consider the one that is placed and omit the return behaviour. One station may have low pick-up orders while having high drop-off orders which is called usage imbalance. The decision on such station should be made carefully.
(vii) Seventh, variable selection work can be done before we train the classification model since not all predictors are related to the response.
Furthermore, it is worth investigating the effect of adopting those suggestions by estimating the order numbers when a certain number of stations are added. One possible way is to first find the features of relationship between start and end grids and compute the intensity. Then for a grid with some new added stations, possible related grids can be found and the order number can also be estimated by intensity. Another possible way is to do transportation simulation where all of the transportation modes, the population, and the traffic operation are considered, which is a huge and challenging work.

C. Distributions of Predictors for Naive Bayes
See Figure 7.

Data Availability
All data included in this study are available upon request by contact with the author Yu Cheng (chengyu@shevdc.org). They will all be based on the grids defined in the research since order data for a specific station is quite sensitive and private.

Disclosure
This research was presented at the conference "International Conference on Smart Mobility and Logistics in Future Cities" and presentation slides which include brief introduction of this research are shared to that conference. This manuscript is also published in the main website of our organization for internal study and communication.