Statistical patterns of human mobility in emerging Bicycle Sharing Systems

The emerging Bicycle Sharing System (BSS) provides a new social microscope that allows us to “photograph” the main aspects of the society and to create a comprehensive picture of human mobility behavior in this new medium. BSS has been deployed in many major cities around the world as a short-distance trip supplement for public transportations and private vehicles. A unique value of the bike flow data generated by these BSSs is to understand the human mobility in a short-distance trip. This understanding of the population on short-distance trip is lacking, limiting our capacity in management and operation of BSSs. Many existing operations research and management methods for BSS impose assumptions that emphasize statistical simplicity and homogeneity. Therefore, a deep understanding of the statistical patterns embedded in the bike flow data is an urgent and overriding issue to inform decision-makings for a variety of problems including traffic prediction, station placement, bike reallocation, and anomaly detection. In this paper, we aim to conduct a comprehensive analysis of the bike flow data using two large datasets collected in Chicago and Hangzhou over months. Our analysis reveals intrinsic structures of the bike flow data and regularities in both spatial and temporal scales such as a community structure and a taxonomy of the eigen-bike-flows.


Introduction
Understanding human mobility pattern is a longstanding scientific pursuit of mankind [1][2][3][4][5]. Many new data resource, such as GPS trajectory [6][7][8] and mobile phone data [3,9,10], are nowadays powerful social microscopes that bring new opportunities for us to study human mobility in new mediums, allowing to "photograph" the main aspects of the society and to create a comprehensive picture of human mobility behavior. Recently, the Bicycle Sharing System (BSS) has been spreading over 1,000 cities around the world [11] as a powerful approach to improve the first/last mile connection to other transportations. Comparing with the first-generation BSS such as the White Bicycle Plan deployed in Amsterdam in 1960s, the third generation BSS highlights the integration of information technology that enables users to borrow PLOS  bike from any station and return the bike to any station in a city. As nowadays the trips could be automatically recorded, this data provide a great opportunity to understand the human mobility in a short-distance trip, which could lead to better management and operation of BSSs in traffic prediction [12,13], station placement [14][15][16], usage pattern analysis [17,18], bike reallocation [19,20], and inventory management [21,22], all are crucial aspects to better manage BSSs to meet the population's dynamic needs. Generally, there are two different schools of approaches to analyze the data of BSS. One considers individual's trip as a basic study object. For example, to provide an efficient service schedule for bike reallocation, Zhang et al. [18] considered a trip destination and duration prediction model on the individual level. Chen et al. [14] formulated the bike station placement issue as a bike trip demand prediction problem. Zhang and Yu [12] studied a trip route planning problem for individuals. Studying one trip information leads to analytical tractability, however, methodologies developed from this perspective would find limitations when considering decision-makings on the system level involving all stations and all users over time. Thus, another type of approaches aggregate individual's trips in a time window (commonly refereed as bike flows, as a bike flow is the collection of all trips from an ingress station to an egress station in a time window). For instances, Li et al. [13] provided a hierarchical prediction model to predict the bike flows that will be rent from/returned to each station in a future period so that reallocation of imbalance bikes can be executed in advance. Etienne and Latifa [17] proposed a model-based clustering algorithm to classify bike stations for efficient management.
In this paper, we focus on the bike flow data as it provides system-level information. Comparing with other existing works that analyzed the bike flow data, we notice that most of the existing works largely focus on prediction using the bike flow data rather than inquiring the data for extracting system-level statistical patterns. Probably because of this, none of them aimed to conduct a delicate analysis of the variation structure in the bike flow data. On the other hand, a series of challenges arise for analyzing the bike flow data. First, it has been found that the bike flow data is very noisy, showing an intrinsic uncertainty structure in both spatial and temporal domains. This often raises up the concern of how much regularity (which then determines predictability) is embedded in the BSS data as the bikes are shared by massive users all over the city, not to mention other uncontrollable conditions such as weather, transportation infrastructure, daily transportation conditions, demographics, and geographical disparities. Second, speaking of the bike flow data as a statistical object, significant dependence has been observed among the bike flows. The dependence makes the analysis of bike flow data difficult since many classical models assume independent assumption. Third, bike flows have a high-dimensional structure. Consider a BSS with N bike stations, there are O(N 2 ) bike flows. The high dimensionality and dependency of the bike flows present major difficulties for statistical analysis.
To overcome the aforementioned challenges, we realize that a crucial step is to decide on what spatial scale the data should be analyzed. Solid evidences are identified in our study that we should first use clustering approach to detect the community structure among stations, and then, build the analysis on these clusters rather than on individual stations. By aggregation of the bike flows in or between detected communities (called as aggregate bike flows (ABFs)), not only the number of bike flows is reduced into a manageable size, but also the statistical regularity embedded in the bike flow data is sharpened which can be statistically articulated by the Principal Component Analysis (PCA). Both hierarchical clustering and PCA are not modelbased methods, so they do not rely on the independent assumption of bike flows. In this paper, we show that this assembly of statistical analysis pipeline could reveal interesting city-wide statistical patterns on both datasets from Chicago and Hangzhou. Note that, in this paper, we use lower-case letters, e.g., x, to represent scalars, bold-face lower-case letters, e.g., x, to represent vectors, and bold-face upper-case letters, e.g., X, to represent matrices.

Data description
The two datasets used in this analysis were provided by Chicago and Hangzhou BSSs (see S1 and S2 Datasets). In the two datasets, each trip records the user ID, the trip start and end time, and the origin and destination stations. In the Chicago data, we only focus on regular subscribers of the system that form the majority of the users. Table 1 shows detailed descriptions of the two datasets.
As shown in Fig 1(a), stations of both BSSs are usually located next to each other, forming a certain spatial pattern. Also, it could be seen that the number of trips per day exhibits a mix of regularity and uncertainty, as shown in Fig 1(b). For instance, the amount of trips of Hangzhou BSS is mostly stable, but could be occasionally small such as the amount of trips on the day of 7th Oct, 2013. Because this was the last day of National Holiday in China, many tourists were leaving Hangzhou city and many local customers were resting at home. This phenomenon is called short-lived property of BSS that has been reported in the literature [13,18]. In   average number of trips of each day in one week, respectively. It is observed that users in Chicago like using the bike in workday and rush hours, indicating that those users may have found usage of the BSS for home-workplace commutes. The average number of trips of Hangzhou BSS is relatively stable over days in one week, and have two peaks in the rush hours of one day.

The community structure
In many real-world networks, it is common that some nodes in the network would be recognized as hubs that either connect with many other nodes or contribute substantially to the network activities. It is interesting that in the bike flow data from both cities, we didn't identify significant hub stations that can account for the amount of bike flow traffic that is significantly larger than average. To show that, we present the cumulative distribution functions (CDFs) of the number of trips in the bike stations in Fig 2. It indicates that, to account for 80% of the total trip records, it took 36% of the stations for Chicago and 42% of the stations for Hangzhou. Although we didn't discover significant hubs, we identified a community structure of the stations [23,24], i.e., showing a pattern that there are many bike flows within the stations in the same cluster but less bike flows between stations in different clusters. Specifically, we used the classical hierarchical clustering method to detect the community structure. Assume that denotes the total number of bike flows between the i-th and j-th stations in the t-th time epoch, T is the total number of time epochs, and N is the number of stations. Also, denote the distance at the t-th time epoch between the i-th and j-th stations as K is the number of clusters, and I(i 2 C k , j 2 C l ) is the indicator function that equals to one only if the i-th station is in the k-th cluster (denoted as C k ) and the j-th station is in the l-th community (denoted as C l ). Denotẽ The columns x p , p = 1, . . ., P of X are the time series of the p-th bike flow in a BSS that is called aggregate bike flows(ABFs). By applying PCA to X, a low intrinsic dimensionality of the ABFs could be found in both BSS datasets, as shown in the scree plots in Fig 4. This indicates that a vast majority of the temporal variability of the ABFs is contributed by the first few eigen-bike-flows (around 5), which is much lower than the number of ABFs. As shown in Fig 5, we randomly select two ABFs and show that the two ABFs can be sufficiently approximated by the top 5 eigen-bikeflows. This observation could be generalized on all ABFs, as shown in Fig 6 that the relative reconstruction errors (RRE) via the first k eigen-bike-flows decrease dramatically as k

Taxonomy of the eigen-bike-flows
The aforementioned analysis of the ABF data emphasizes the central role of eigen-bike-flows in understanding the ABFs. It seems that the eigen-bike-flows can be divided into two categories: deterministic eigen-bike-flows (d-flows) and spike eigen-bike-flows (s-flows). To show this, randomly selected d-flows and s-flows from the two BSS data sets are shown in  To detect the d-flows, we conduct Fourier analysis of the eigen-bike-flows. The fourth column of Fig 7 shows that the spectrum of the selected d-flows all exhibit a significant standalone peak at twenty-four hours. To find the s-flows, the 5-sigma rule can be employed: whether there is a point whose distance from the mean exceeds 5 times standard deviations. The category of each eigen-bike-flow can be determined according to the criteria aforementioned. However, there are eigen-bike-flows who belong to more than one category. To overcome this contradiction, we define the d-flows as the eigen-bike-flows that have a significant standalone peak in the spectrum regardless the existence of the spikes.

Statistical representational power of the eigen-bike-flows
As shown in the Method Section, each ABF can be reconstructed as a weighted sum of eigenbike-flows (e.g., see Eq 6). Particularly, each row of the principal matrix V specifies the extent to which each eigen-bike-flow contributes to the corresponding ABF. Thus, we are interested to examine the rows of V to discern the structure of the ABFs. Particularly, for an ABF, the entries of the corresponding row of V whose magnitudes are remarkably larger than a threshold indicate the significant eigen-bike-flows that constitute the ABF. Here, we set the threshold as 1= ffiffiffi P p , i.e., this is because that, in an extreme situation that all the eigen-bike-flows contribute to one ABF equally, all the entries of the corresponding row of V will be 1= ffiffiffi P p due to the unit norm constrain of the columns of V.
Furthermore, CDFs of the number of entries which exceed 1= ffiffiffi P p in their magnitudes are shown in Fig 9. The figure indicates that overall each ABF only has a small set of constitutional eigen-bike-flows. We further show the entries of V whose magnitudes exceed the threshold for the two data sets in Fig 10, after the rows of V are sorted by the variance of their corresponding ABFs. The top rows in each plot indicate the eigen-bike-flows that are significant in forming the strongest ABFs, and the bottom rows show the significant eigen-bike-flows for the weakest ABFs. Two interesting observations can be drew. First, from the vertical direction, the elements of one ABF are clustered in a small region. Second, from the horizontal direction, the elements of the ABF with larger variance are mainly top eigen-bike-flows, while the ones with smaller variance are mainly less significant eigen-bike-flows.

Temporal stability of the bike flow structure
It is of interest to see if the bike flow structure revealed in aforementioned sections could remain stable over time. To examine its temporal stability, here, we divide the measurement matrix X 2 R TÂP into X 1 2 R T 1 ÂP and X 2 2 R T 2 ÂP where T = T 1 + T 2 . Then, we apply PCA on X 1 and use the obtained eigen-structure to predict X 2 . Our rationale is that, if the eigen-structure learned from X 1 is stable, then it could show significant prediction power for X 2 . Details of how we could leverage the eigen-structure learned from X 1 to predict X 2 in shown in the Section Methods. The performance of the one-step prediction of X 2 is shown in Fig 11, which shows the root mean square error (RMSE) per ABF in X 2 , while the ABFs are ordered with decreasing variances from left to right and T 1 = T 2 = T/2. From Fig 11, it can be seen that the eigen-structure learned from X 1 could lead to accurate prediction of X 2 . Accurately forecasting the ABFs will no doubt benefit many decision-makings for managing the BSSs such as station placement [14,15] and bike reallocation [19,20]. The commonly accepted approaches in bike flow forecasting consider each ABF as a time series, and then, use some time series models such as the Autoregressive Integrated Moving Average (ARIMA for short) method [25] to predict the bike flows. Here, Fig 11 also shows that the performance of the PCA-based prediction model is better than ARIMA model for most ABFs of X 2 .

Discussion
The emerging bike sharing systems provide a new data source for us to understand human mobility. A unique value of this new data, comparing with existing mobility datasets such as GPS trajectory [6][7][8] and mobile phone data [3,9,10], lies on its characterization of human mobiliy in short-distance trips. Thus, as a deep understanding of the population on short-distance trip is currently lacking, we conduct a systematic analysis of the bike flow data collected in two major cities in the United States and China. Understanding the statistical characteristics of bike flow data holds great potential to develop informed decision-makings for better traffic prediction, infrastructure design such as station placement, real-time bike reallocation, and inventory management.
By analyzing the bike flow datasets from the two cities, we found statistical regularities underlying the irregular surface of the bike flow data. Our basic approach is inspired by the recognition of the spatial organizing principles of the bike flow, such that stations could be first clustered into distinct clusters. Then, by using PCA to analyze the aggregated bike flow data on the cluster level, we could identify a taxonomy of constituting eigen-flows that correspond to routine and outburst in the bike flow, i.e., (i) deterministic eigen-bike-flows, which capture the periodic trends that have been reported in previous works of other bike flow data of similar nature [13,18]; (ii) spike eigen-bike-flows, which capture the occasional shortlived bursts [13,18] in BSSs. Besides the interpretability of the eigen-flows, we also find out that with a small set of eigen-flows, the ABFs could be accurately reconstructed, demonstrating their statistical significance and efficiency. We further study the temporal stability of the eigenstructure by using it to predict on unseen bike flow. Thus, although irregularity could be observed from the surface of the data, regularity emerges when looking into the spatial structure embedded in data. Further, on top of the spatial structure, temporal regularity could also be detected. On what level we should interrogate the data and ask what questions seems to be a crucial precondition for us to properly understand the data and translate that understanding into better decision makings. In contrast to some popular assumptions made in some operations research and management methods for BSS that emphasize statistical simplicity and homogeneity, our analysis reveals far more intrinsic structures, heterogeneous patterns, and statistical complexities in both spatial and temporal scales in the bike flow data. Thus, our study anticipates further development of more realistic operations research and management methods which could optimize their performances to account for the unique statistical characteristics embedded in the BSS data. On the other hand, comparing with other existing works that used the bike flow data, we notice that most of the existing works largely focus on prediction using the bike flow data rather than inquiring the data for extracting system-level statistical patterns. Probably because of this, none of them aimed to conduct a delicate analysis of the variation structure in the bike flow data. One exception [13] in these prediction works exploited the idea of first spatially clustering the stations, and then, predicting on the combined bike flows very much like the ABF in our paper. While this study showed positive evidences to backup our finding, it was motivated by gaining prediction accuracy rather than a systematic revelation of the spatial structure and temporal eigen-structure in our paper. In summery, this paper is, to the best of our knowledge, the first attempt to comprehensively investigate the human mobility patterns in short-distance trips, characterized by their manifestation on bike sharing systems. Consistent patterns have been discovered from datasets collected in two major cities in the US and China, implying that these patterns may represent universal conditions that shape the bike flow activities in real-world. This study has limitations. First, the methods used in this study, the classic hierarchical clustering and PCA methods, reveal interesting structures but also impose limitations on the structures they could identify. Particularly, as shown in the taxonomy of the eigen-bike-flows, some d-flows contain both periodic trends and spikes. This indicates the limitations of PCA and suggests that methods that can clearly separate the underlying signals could reveal further structures in the data. Second, while PCA is useful in analyzing the ABFs, more delicate time series analysis tools or signal processing methods could be used to study the dynamics embedded in the time series data. Last but not least, the two datasets used in this study may not fully present the complexity of the BSS data in other cities. While the observations made in this study are interesting and inspiring, this study lays the foundation for further inquiries such as traffic prediction, infrastructure design such as station placement, real-time bike reallocation, and inventory management.

Principle component analysis
PCA, as an unsupervised statistical learning method for studying the underlying structure in complex data, has been used for coordinate transformation and dimension reduction tasks [26,27]. It maps the original data onto a new set of axes via coordinate transformation. The new axes are referred to principal components that point to the directions with the largest variance or energy in the data. Under the assumption that the most important structure exists along the new coordinate with the largest variance, the first few principal components may well capture the concerned structure in complex data. Due to the superiority of PCA, it has been widely used in many scientific fields, such as eigenfaces for recognition [28], network traffic analysis [29] and human mobility modeling [30]. Human mobility in emerging Bicycle Sharing Systems Let X 2 R TÂP be the measurement matrix. The p-th column denotes the p-th ABF and the t-th row represents an instance of all the ABFs at time t. Deriving the principal components is to solve the eigen-decomposition problem for matrix X > X, because X > X measures the covariance between ABFs. The mathematical formulation is where λ p is the p-th eigenvalue corresponding to eigenvector v p . Since X > X is symmetric and semidefinite, its eigenvectors fv p g P p¼1 are orthogonal and the corresponding eigenvalues fl p g P p¼1 are nonnegative. It is required that the eigenvectors fv p g P p¼1 have unit norm. Also, the eigenvalues are arranged from largest to smallest, i.e., λ 1 ! λ 2 ! . . . ! λ P ! 0. If eigenvalues fl p g P p¼1 have r nonzero values, then the rank of X is r. It is well known that fv p g r p¼1 are the principal components of X. According to (1), where V = [v 1 , . . ., v r ] and Λ = diag(λ 1 , . . ., λ r ). Calculating principal components actually is intimately related to Singular Value Decomposition (SVD) [31]. SVD is matrix decomposition tool that can be expressed in the form of matrix multiplication as where U = [u 1 , . . ., u r ], u > i u j ¼ 0 for i 6 ¼ j and S = diag(σ 1 , . . ., σ r ) is an r × r diagonal matrix with singular values fs p g r p¼1 on the diagonal. Therefore, Comparing with (2), we find that l p ¼ s 2 p . Furthermore, based on (3), it has u p ¼ Xv p =s p ; p ¼ 1; . . . ; r; ð5Þ and x p ¼ s p UðV > Þ p ; p ¼ 1; . . . ; P; ð6Þ where u p , p = 1, . . ., r are vectors of size T and orthogonal by construction, and (V > ) p is the pth row of V. The Eq (5) indicates that all the ABFs can be transformed into a new coordinate with weights v p . u p captures the temporal variation common to all flows along principal axis p. Specifically, u 1 captures the strongest temporal trend common to all ABFs, u 2 captures the second strongest, and so on so forth. The Eq (6) shows that each ABF is in turn a linear combination of the eigen-bike-flows, weighted by (V > ) p . Using SVD, a low-rank approximation matrix of X can be constructed as follows. The approximation form isX whereX k is an approximation of X with rank k < r, U k and V k are the first k columns of U and V, respectively, and S k is the top-left part of S of size k. The low-rank approximation of X is actually a dimension reduction approach via PCA with the form Writing -review & editing: Xiangyu Chang, Shuai Huang.