Computers, Environment and Urban Systems

Short-term demand prediction is important for managing transportation infrastructure, particularly in times of disruption, or around new developments. Many bike-sharing schemes face the challenges of managing service provision and bike ﬂ eet rebalancing due to the “ tidal ﬂ ows ” of travel and use. For them, it is crucial to have precise predictions of travel demand at a ﬁ ne spatiotemporal granularities. Despite recent advances in machine learning approaches (e.g. deep neural networks) and in short-term tra ﬃ c demand predictions, relatively few studies have examined this issue using a feature engineering approach to inform model selection. This research extracts novel time-lagged variables describing graph structures and ﬂ ow interactions from real-world bike usage datasets, including graph node Out-strength, In-strength, Out-degree, In-degree and PageRank. These are used as inputs to di ﬀ erent machine learning algorithms to predict short-term bike demand. The results of the experiments indicate the graph-based attributes to be more important in demand prediction than more commonly used meteorological information. The results from the di ﬀ erent machine learning approaches (XGBoost, MLP, LSTM) improve when time-lagged graph information is included. Deep neural networks were found to be better able to handle the sequences of the time-lagged graph variables than other approaches, resulting in more accurate forecasting. Thus incorporating graph-based features can improve understanding and modelling of demand patterns in urban areas, supporting bike-sharing schemes and promoting sustainable transport. The proposed approach can be extended into many existing models using spatial data and can be readily transferred to other applications for predicting dynamics in mass transit systems. A number of limitations and areas of further work are discussed.


Introduction
Research has shown that bike-sharing contributes to improved air quality and reduced congestion in cities as a part of a sustainable travel infrastructure (Lovelace & Philips, 2014;Shaheen, Guzman, & Zhang, 2010). Its global popularity has increased in the last few years due to advantages in both cost and convenience over other forms of transport such as cars. A growing number of cities have operated such schemes to promote sustainable mobility, such as Santander Bikes (London), Citi Bikes (New York) and more advanced dock-less systems (e.g. Mobike in Chinese cities). Bike-sharing schemes provide a key component of urban transportation infrastructures by providing an "extension service" for the "first/last mile" from other public transport hubs (Ma, Liu, & Erdoğan, 2015;Saberi, Ghamami, Gu, Shojaei, & Fishman, 2018;Shaheen et al., 2010).
While bike-sharing greatly enhances urban mobility as an affordable and sustainable traffic mode (Fishman, 2016), meeting the demand of users poses a challenge to scheme operators. This is due to the "tidal flows" of bike-sharing trips, with certain areas in the city facing the problem of insufficient bikes (Beecham, Wood, & Bowerman, 2014). For example, during the morning rush hour, the number of commuting trips departing from residential areas will be high, potentially leading to a deficit of available bikes in those areas. This results in reduced service reliability and reduced user satisfaction (Fishman, 2016;O'Brien, Cheshire, & Batty, 2014). Accurate and up-to-date estimations of travel demands across the city over the course of the day are crucial for successful bike scheme management and fleet rebalancing. This also has attracted a lot of research interest in recent years.
Researchers have used a combination of statistical models, machine learning and more recently, deep learning neural networks to forecast short-term travel demands (Karlaftis & Vlahogianni, 2011;Lin, He, & Peeta, 2018;Vlahogianni, Karlaftis, & Golias, 2014). While some https://doi.org/10.1016/j.compenvurbsys.2020.101521 Received 8 January 2020; Received in revised form 25 June 2020; Accepted 26 June 2020 There are two conventional approaches for dealing with dock-based bike-sharing travel demand forecasting problems: predicting at an individual station level or over aggregated groups / areas. The former approach models dynamics at each station (Lin et al., 2018), while the latter focuses on regional dynamics (Xu, Ying, Wu, & Lin, 2013;. Station level modelling can support bike-fleet management at finer spatial granularities, but can be less accurate due to higher levels of noise in the data. Many studies (Li, Zheng, Zhang, & Chen, 2015; attempt to predict demand over small geographical areas for the following reasons. Firstly, bike docking stations are dynamic in urban areas over long periods. New stations may be added, with existing stations removed, or relocated. Analysing small clusters of stations allows local travel dynamics to be captured and supports a deeper understanding of these dynamics (Li et al., 2015;. Secondly, the emergence and rise of dockless bike-sharing may change the nature of bike-sharing in the future. Dockless schemes allow individuals to borrow and return bikes at any location, rather than at fixed docking stations, this makes it both challenging as well as important to understand travel demand at the small area level (Cao & Shen, 2019;Yang, Heppenstall, Turner, & Comber, 2019). Finally, grouping stations into small area-based clusters supports bike fleet management regardless of the scheme type, with sufficient spatial grain to support rebalancing (Li et al., 2015).
A broad range of data-driven models have been proposed to forecast short-term travel demand in bike-sharing systems and other transportation systems such as the metro, buses and taxis (Vlahogianni et al., 2014). These can be categorised into parametric statistical models, and nonparametric machine learning (ML) approaches (Zhang, Cheng, & Ren, 2019). Some examples of the former group include ARIMA (Autoregressive Integrated Moving Average model) and its variants (e.g. ARIMAX, seasonal ARIMA) and Bayesian Networks (Froehlich, Neumann, & Oliver, 2009). Statistical models are easier to interpret but may have lower prediction accuracies when compared to ML models. Karlaftis and Vlahogianni (2011) observed a trend of research moving from statistical models to ML models as a result of both increased data accessibility and computing power. Different ML models have been applied to forecast short-term traffic demand, such as support vector regression (Xu et al., 2013) and Regression Trees (Li et al., 2015). More recently, deep neural networks have attracted significant research interest due to their automatic feature extraction capacity and their success in handling temporal, spatial and semantic dependencies.
Temporal dependencies include snapshots of historical relationships, and have been widely used for traffic demand prediction problems (Froehlich et al., 2009;Giot & Cherrier, 2014;Li & Shuai, 2018). For example, useful travel demand information is retained from the last few hours to suggest demand intensity trends. Deep neural networks such as Recurrent Neural Networks (RNNs) provide powerful tools for dealing with sequential information, and are suitable for analysing temporal dependencies. These recurrently connect hidden layers with different timestamps, identifying sequential characteristics and patterns that are then used to predict the next likely scenario. LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) Networks, both enhanced forms of RNNs, have been used to predict travel demand (Fu, Zhang, & Li, 2016;Xu, Ji, & Liu, 2018). They are able to overcome the "vanishing gradients" problem common in neural networks. This occurs when gradients of the loss function approach zero, making the neural network hard to train, which commonly happens when processing longterm temporal dependencies with standard RNNs.
The idea of spatial dependencies (Tobler, 1970) suggests that information from nearby locations can contribute to improved forecasting. Some studies (Ke, Zheng, Yang, & Chen, 2017) have applied Convolutional Neural Networks (CNNs) to capture spatial dependencies in traffic demand forecasting. CNNs were initially designed for the analysis of gridded data, such as images. They capture spatial dependencies between grid locations using localised filters or kernels. Previous research (Ke et al., 2017;Zhang, Zheng, & Qi, 2017) using this approach to analyse travel demand divided urban areas into two-dimensional grid cells and calculated the demand across each grid, with demand intensity represented as colour scales. However, the selection of grid size is critical and difficult to determine objectively: if the grid is too coarse, it will fail to capture sufficient spatial granularity to support bike fleet management. If it is too fine, then the computational burden increases significantly due to the large image-like matrices containing redundant information (grid cells with zero demand).
More recent studies have used semantic dependence. Semantically similar areas may not be contiguous or near each other. For example, bike stations located in two distant residential areas may have similar temporal patterns of travel demand. Characterising semantic dependencies from such similar areas may improve model performance. Some research has quantified the similarity of historical travel demand sequences over different sites and constructed semantic graphs to connect similar places (Hoang et al., 2016;Yao, Wu, et al., 2018). Lin et al. (2018) applied Multi-Graph Convolutional Neural Networks (MGCNN) to capture pairwise relations between bike stations, using spatial and semantic graphs to provide multi-graph embedding. However, the pre-processing requirements of capturing demand sequence similarities for MGCNNs are heavy, requiring at least one year's historical data to obtain a good prediction accuracy in bike demand forecasting (Chai, Wang, & Yang, 2018;Lin et al., 2018). This leads to limitations for analyses of sites and systems with insufficient historical travel records, for example, when new service stations or areas are introduced into a bike-sharing scheme.
Outside of the deep neural networks family, XGBoost (Chen & Guestrin, 2016), an implementation of gradient boosted decision/regression trees, has been found to perform well in transport prediction problems and was the winner of the Kaggle bike-sharing prediction competition (Kaggle, 2015). Some research compared XGBoost to neural networks (Lin et al., 2018;Ma, Guo, Guo, & Guo, 2019;Yao, Tang, Wei, Zheng, & Li, 2019;Yao, Wu, et al., 2018;, and most of these suggest that XGBoost is capable of obtaining better or Y. Yang, et al. Computers, Environment and Urban Systems 83 (2020) 101521 similar performances in travel demand forecasting when compared to RNNs (LSTM, GRU), CNNs and to hybrid neural networks, for example ConvLSTM (convolutional LSTM), ST-ResNet (Deep Spatio-Temporal Residual Network). XGBoost is also found to have comparable performance to MGCNN in the work of , in 50% of datasets XGBoost produced better predictions than MGCNN. However, XGBoost may be inferior to some state-of-the-art fusion deep neural networks, such as Spatio-Temporal U-shape Networks  and Spatial-Temporal Dynamic Networks (Yao et al., 2018). Despite the intense competition among complex algorithms, whether one model outperforms the others is questionable. Li et al. (2019) compared various models for traffic demand forecasting, and concluded that a universally best model does not exist. When considering different specific areas and timestamps, several algorithms (e.g. LSTM and XGBoost) may offer better solutions depending on the nature of the spatial and temporal variables. Therefore, it could be beneficial to combine the prediction results of different models (KDD-Cup, 2017). There are also reproducibility issues in the literature; for example, ST-ResNet was found to outperform XGBoost in . However, some studies (Ma et al., 2019;Yao et al., 2019) show contrasting results. MGCNN shows advances in predicting dynamics in planarnetworks (e.g. road networks) that have a clear concept of graph construction. However, for non-planar networks such as origin and destination graphs (e.g. bike-sharing graph), the nodes (docking stations) connection is subject to several factors including time-series similarity and distance to each other, etc. They rely on an arbitrary and ambiguous choice of threshold (e.g. proximity, similarity, consistency); as well as specific preprocessing (e.g. removing less-used stations) (Chai et al., 2018;Lin et al., 2018). These lead to reproducibility issue to some extent. For example,  found that MGCNN is worse than Xgboost on several datasets, while Lin et al. (2018) suggested a better performance on New York bike-sharing. Differences in results may be due to the complexity of hyperparameter tuning in deep neural networks, varied model performance on different datasets, different preprocessing or unfair comparisons (Karlaftis & Vlahogianni, 2011). This makes the "best" models even more of a challnge to identify.
Overall, short-term traffic forecasting is a highly dynamic and developing research arena with ever-growing literature that has mainly focused on testing and comparing the performance of alternative models (Vlahogianni et al., 2014). This focus on models has left other vital questions relatively unaddressed, for example, consideration of what kinds of variables should be included in models. The performance of a predictive model is not only associated with its generalisation ability but also its dependency on the input data and features (Hall & Smith, 1998). Deep learning neural networks require less effort to manually extract features from raw data (Goodfellow, Bengio, & Courville, 2016;Lin et al., 2018), but still may benefit from effective feature engineering, especially when the size of the training dataset is limited (Ketkar, 2017). Research has suggested that short-term traffic demand can be inferred from its spatiotemporal properties (e.g. historical travel demand) but may also benefit from other explanatory variables (Ke et al., 2017). However, there is only limited insight into the nature and direction of feature engineering, with studies generally using temporal features (e.g. time of day, day of the week) and meteorological features (e.g. temperature) to forecast travel demand (Giot & Cherrier, 2014;Lin et al., 2018;Salaken, Hosen, Khosravi, & Nahavandi, 2015). For example, the work of Yang et al. (2016) suggests that average trips amount on weekdays are relatively smaller than during weekends (with the patterns being opposite for stations in residential areas). Both day of week and calendar events (Kim, 2018) are informative for modelling trip demand. Meteorological factors have a huge influence on user behaviours in bike-sharing systems, and good weather is strongly correlated with higher trip amount (Kim, 2018;Yang et al., 2016). In particular, temperature has been included in many studies and identified as a useful feature for predicting bike trip demand in various cities and regions (e.g. American, Asian, Europe) under different climates and cultural backgrounds (Rudloff & Lackner, 2013;Li et al., 2015;Salaken et al., 2015;Yang et al., 2016). Some studies have also used urban context such as land-use, Points of Interest (POI) (Tran, Ovtracht, & D'arcier, 2015;Xu et al., 2018) and event information (e.g. metro delays, concerts) Rodrigues, Markou, & Pereira, 2019) to improve forecasts. The work of Xu et al. (2018) suggests that land-use information derived from POI is not as helpful as meteorological features, but still can enhance prediction performance for neural network models. However, these are data enrichment approaches, requiring data from other sources (e.g. POI, textual data from twitter), some of which are relatively difficult to obtain, process and merge into models. This leaves an important question: is it possible to derive additional useful information from the flow data itself, such as bike travel records, to improve the prediction performance further? In machine learning, feature engineering is the process of using domain knowledge to extract and transform raw data into explanatory features. The result is that ML algorithms are better able to detect patterns in input data, leading to better outcomes. As yet relatively little research has been undertaken using such approaches in this area to consider what features can be derived from raw travel data using domain knowledge, and whether they can improve different traffic prediction models. Here we examine the graph structures present in bike travel records.
Research using graph theory has been successfully applied to analyse urban phenomena such as polycentric transformation, urban resilience, infrastructure updates and mobility change to analyse and understand urban flows such as travel (Batty, 2013;Yang et al., 2019;Zhong, Arisona, Huang, Batty, & Schmitt, 2014). Graph structures of travel flow spatial and temporal patterns may be used for interpreting urban dynamics as well for traffic demand prediction (Zhang et al., 2017). Austwick, O'Brien, Strano, and Viana (2013) examined bikesharing systems in different cities. They highlighted the use of graph analysis for understanding urban flow in spatial systems and whilst Zhang et al. (2017) argued that the historical regional inflows are related to outflows. Generally, studies examining short-term traffic demand forecasting have not fully exploited inflow interactions. The current state of the art in this area uses historical demand and common environmental variables (e.g. temperature) to predict future demand (Feng, Chen, Du, Li, & Jing, 2018;Li et al., 2015;Li & Axhausen, 2019;Li & Shuai, 2018;Lin et al., 2018;Xu et al., 2013;Yao, Wu, et al., 2018). There are many kinds of graph information (e.g. degree, PageRank) that can be derived from bike travel data to describe flow interactions and to characterise the different urban places within the graph, for example, to infer the likelihoods of bike trips starting from specific regions. The utility of spatio-temporal graph properties to support short-term bikesharing demand prediction has not been evaluated, and the research described in this paper starts to address this.

Study area and data
This study uses dock-based bike-sharing data from two cities to ensure the findings are not exclusive to a specific case. They are New York Citi bike and Chicago Divvy bike schemes, as shown in Table 1. The datasets cover one year and contain variables describing bike trip departure and end time, departure station and end station. Corresponding hourly meteorological data were obtained from open weather map (https://openweathermap.org/), and the variables included temperature, humidity, wind speed, pressure and weather description (e.g. Cloudy, light rain).

Station groups
This study predicts regional (small area) demand and groups of stations based on their spatial proximity. A hierarchical clustering method was applied to cluster stations into 120 and 80 groups in the New York and Chicago data, respectively. The choice of k clusters is arbitrary and usually depends on the knowledge of the study area (Li et al., 2015). Here, values of k were chosen to generate groups consisting of roughly 6 or 7 stations on average (Table 1 shows total station number). Fig. 1(a, b) shows the groups of stations (small areas), where the shading and plot characters indicate different clusters.

Travel flow graph structure construction
Graph theory is a mathematical approach for modelling pairwise relations between individuals. A graph structure typically consists of observations represented by nodes or vertices and their relationships represented by links or edges (although this can be reversed). A system formed of nodes and links that are interconnected is termed a graph. In urban and transport studies, public transportation systems have been viewed as complex networks (Saberi et al., 2018;Yang et al., 2019;Zhong et al., 2014) and represented as graphs in order to generate different scale-free graph-based measures pertaining to the network.
Generally, transportation hubs and urban areas are regarded as graph nodes, and the travel flows between a pair of nodes generate links to connect them. Analysis of the network flows between nodes and their changes, for example over time, provides insights into spatiotemporal mobility characteristics in transportation systems. Saberi et al. (2018) used graph-based analysis to examine the impact of public transit disruptions on bike-sharing usage and travel behaviours.
In this study, hourly graph structures were constructed from bike trip records. Each group of bike stations were cast as node, and the volume of hourly bike trips between any two nodes was used to generate edges to represent the origin-destination flows between them. This resulted in a series of temporally weighted and directed graph structures, from which a number of graph properties were calculated, describing the state of each node at different times. (1) Strengththe total of the edge weights. In a directed and weighted graph structure, there are two strength measures, in-strength and out-strength. Here they represent the number of trips that end at and start from a node in the network. Out-strength can also be interpreted as the number of departuresi.e. travel demand.
(2) Degreethe number of edges that are incidental to the node,  Y. Yang, et al. Computers, Environment and Urban Systems 83 (2020) 101521 indicating the number of neighbouring nodes. In-degree and outdegree account for the number of in-flow and out-flow links in a directed graph. A node is considered important if it is connected to many neighbours, and for urban mobility networks, the degree can be used to describe the connectivity and accessibility to destinations or activities across the network (Zhong et al., 2014). (3) PageRanka measure of node importance. This was first introduced by Google to evaluate the importance of a web page (Brin & Page, 1998). The key idea behind PageRank in a graph context is that nodes with the same degree may not have the same importance in a graph. By not counting links from other nodes equally, Pa-geRank treats an edge from a strongly connected node as more important than an edge linked to a node with few connections. Assume graph node A has incoming edges T 1 …T n , and the parameter d is a damping factor (0.85 as the default value), C(T n ) is defined as the out-degree of node T n . The PageRank (PR) of a node A is denoted as follows: The PageRank of node A can thus be calculated using an iterative algorithm that corresponds to the principal eigenvector of the normalised link matrix of the graph (Brin & Page, 1998). Note that the PageRank forms a probability distribution over graph nodes, so that the sum of all nodes' PageRanks will be one. PageRank is an additional indicator of relative node importance and centrality in a graph. In a transportation network, PageRank can help to identify key nodes (places) in the system that have a high impact on transportation efficiency.
(4) Betweennessthe number of links that pass through a node. The greater the betweenness the more important it is (Newman, 2005). For each pair of nodes in a graph, there exists at least one shortest path between them. Node betweenness refers to the number of the shortest paths that pass through a node. Betweenness represents the extent to which nodes are connected, and indicate transfers from one area to another in a transport system. Although bike-sharing trips generally do not rely on or are impacted by middle stations to reach the destination, they are still limited by station availabilities (available bike and empty docks) to start or complete journeys. The work of Saberi et al. (2018) suggests that in bike-sharing systems, the probability and spatial distribution of betweenness changed in response to urban public transit failure. Furthermore, betweenness is helpful to examine system changes during special events and adverse weather conditions. Fig. 2 (a) shows the map of the New York bike-sharing station groups, with each dot indicating the group's central position. Fig. 2 (be) give examples of the graph information properties for each node, with the different properties normalised to [0,1] for visualization purposes. The redder and larger the plot character, the higher value it has. This graph was constructed using 1 h (8,00 to 9:00 am on October 25, 2017) of bike-sharing travel data, which represents flow interactions in the morning rush hour. Fig. 2 (b) shows the out-strength, directly representing area travel demand intensities. Areas close to Grand Central Terminal (GCT) have the most bike trips with high numbers of trips in surrounding areas (Midtown Manhattan). Fig. 2 (c) illustrates the distribution of out-degree, and suggests that different regions in Manhattan all have high levels of flow interactions indicated by the number of neighbours linked by travel flows. Interestingly, the GCT region does not have the highest out-degree. This is because the trip destinations are less diverse during the selected period. Fig. 2 (d) shows the PageRank and has a similar pattern to Fig. 2 (b), emphasising the importance of GCT in the network. Fig. 2 (e), shows that betweenness has a different spatial pattern to the other figures (Fig. 2 b, c, d). It highlights the region of Williamsburg, located at the east side of the Williamsburg bridge. The high betweenness value indicates its crucial role as a bridge in the graph connecting different parts of the city (e.g. Manhattan and Brooklyn). The different graph properties describe the flows and their interactions in the graph structure, allow the importance of each node to be characterized in different ways.

Feature importance
Various models can be used to evaluate feature importance for making predictions, for example, Random Forest and Support Vector Machine. Among these approaches, XGBoost (extreme gradient boosting) is a gradient boosted regression tree algorithm and has been found to be one of the most powerful models in the literature (Li & Axhausen, 2019;Lin et al., 2018) and in competitions (Kaggle, 2015;KDD-Cup, 2017) to predict bike-sharing travels. It has been shown to have a comparable (or better) performance to several advanced deep neural networks such as ST-ResNet (Ma et al., 2019;Yao et al., 2019;. Another advantage of XGBoost is that its results are easily explainable: once the boosted trees are constructed, importance (i.e. gain) scores of each feature can readily be retrieved. The importance metric provides an evaluation of how useful or valuable each feature is, based on the degree to which a feature is used to make key decisions in trees. Therefore, this study used XGBoost to evaluate feature importance.
There are potential multi-collinearity problems that may impact the feature importance identified from different models. Strong collinearity can affect model reliability and precision (Comber et al., 2018) and can result in unstable estimates of feature importance and therefore inferential and prediction biases (Dormann et al., 2013). As a result, model extrapolation may be erroneous, and there may be problems in separating variable effects (Meloun, Militký, Hill, & Brereton, 2002). For example, in a random forest model, the importance of a feature may be diluted by another highly correlated variable, because each tree is independent of others and random choice will be made on features. XGBoost has been found to be relatively immune to the multi-collinearity problem (Chen & Guestrin, 2016;Chen, Tong, Benesty, & Tang, 2018) because the algorithm does not re-focus on any specific link between feature and outcome after it has been made and learnt in the boosting process. Table 2 lists the input variables in the XGBoost model used to predict bike-sharing demand. Based on the literature reviewed, temporal and meteorological variables were included in the Basic Features group (Li et al., 2015;. Bike travel flows were transformed into directed weighted graphs, allowing the strength and degree properties to represent the flow directions. Time-lagged travel demand is identical to time-lagged out-strength. All time-lagged properties were obtained from the last hour to provide temporal dependence for the prediction (longer time-lags are examined in Sections 4.2 and 4.3). As XGBoost only accepts numeric values, categorical variables (e.g. hour of day) were processed using Multiple Correspondence Analysis (Meng et al., 2016) to generate lower-dimensional numeric representations.

Adding time and graph information
A good feature is one that improves model performance (e.g. prediction) as it allows more parsimonious (less complex) models to be constructed, and non-optimised model hyperparameters to be included, whilst still generating good results. By continually adding different features into a machine learning model, changes in prediction results can be evaluated accordingly. A good feature will reduce forecasting errors, while bad features will result in higher errors (and more noise). In this study, Multi-Layer Perceptron (MLP) neural networks were constructed to confirm the usefulness of various input features. MLP is a class of feed-forward neural networks. It utilises backpropagation for Y. Yang, et al. Computers, Environment and Urban Systems 83 (2020) 101521 training, and its multiple layers and non-linear nature contribute to its ability to distinguish data that is not linearly separable. As a neural network, it has a relatively simple structure making it easier to construct and train than others. MLP also has been shown to have a strong performance in predicting short-term traffic demand (Li & Axhausen, 2019;Lin et al., 2018). This study firstly constructed an MLP that is neither under-nor overfitted, using the "Basic Features" listed in Table 2 of meteorological and temporal features that included time-lagged travel demand of −1 h. Different time-lagged travel demand variables and graph information properties were then sequentially added into the MLP, with the outputs evaluated accordingly. This identifies which lagged-time steps are more strongly associated with current travel demand and also provides validation of the important features as identified by the XGBoost.
An investigation of the hyperparameters determined that an MLP with two layers of 32 and 8 units neither under-or over-fitted models on both datasets. The mini-batch size was set to 1024 and training epochs to 150 (enough for convergence). The loss function used was RMSE (Root Mean Square Error), which is denoted as where A i and F i are the actual value and forecast value respectively. There are alternative loss functions, such as Mean Absolute Error (MAE), which may be used for training machine learning models. However, the errors are squared before being averaged in RMSE, thereby giving relatively high weight to large errors. RMSE was used in this study due to the fact that in a bike-sharing system, large errors of demand estimation may pose significant difficulties to scheme operators for successful bike fleet rebalancing.

Feature importance and variable selection
The data was split into training, validation and test datasets, with the data ordered by time. The first 80% of records were assigned as the training set, the following 10% as validation and the final 10% as the test set. The training datasets were inputted to the XGBoost models to rank the importance of different features with the results are shown in Fig. 3. Generally, temperature is considered as an important factor related to cycling activity (Miranda-Moreno & Nosal, 2011;Thomas, Jaarsma, & Tutert, 2013) and many bike travel demand prediction studies include temperature or a series of time-lagged temperatures as model inputs (Salaken et al., 2015;. Interestingly here, in both case studies (Fig. 3 a, b), out-strength, in-strength, out-degree, in-degree and PageRank were all found have greater (or comparable) importance scores than temperature, indicating their potential utility in short-term demand prediction. This suggests that despite temperature being widely used for bike-sharing demand prediction studies, several graph features are potentially more important. Betweenness failed to outperform temperature, probably due to it being less associated with travel demand intensity. As observed in Fig. 2, strength, degree and PageRank are relatively similar in their spatial patterns, while betweenness is different as it describes the "bridge effect" of a node.
In summary, feature selection using an initial XGBoost model identified the following features for inclusion in subsequent models: out-strength (OS), in-strength (IS), out-degree (OD), in-degree (ID) and PageRank (PR). In the following section, the results of applying a different machine learning model (MLP) are described to confirm the utility of these graph information properties in solving bike-sharing demand prediction problems.

Adding time and graph information comparison
Two types of MLP were constructed in the experiment, namely MLP-GI and MLP-DT. The former requires that graph information (GI) properties at a time lag of −1 h are sequentially added into the model, with the order of out-strength (OS), in-strength in (IS), out-degree (OD), PageRank (PR) and in-degree (ID), as suggested in Fig. 4 (a). The MLP-DT model used time-lagged travel demand (DT) from only −1 h to a group of −1 to −5 h. This is a common approach, using multiple timelagged demands from the last few hours provides a greater indication of temporal dependence in the models (Ke et al., 2017;Lin et al., 2018). Fig. 4 shows box plots of the distribution of the RMSE of 15 experiments, with the results evaluated on the validation set. Initially, the two models (MLP-GI and MLP-DT) are identical because they both used the travel demand (i.e. out-strength) number from the last hour. As more variables included, the MLP-GI models benefit from additional lagged graph information, with decreasing RMSE, in both average and median values. This is observed in both datasets (Fig. 4 a, b). Another finding is that adding OD (out-degree) and ID (in-degree) reduces prediction errors for the New York dataset (Fig. 4 a), but has less effect with the Chicago data (Fig. 4 b). The pattern accords with the previous finding in Fig. 3, where OD and ID much outperform the benchmark (temperature) in the New York (Fig. 3 a), this again confirms the variable importance identified by XGBoost in Section 4.1.
In the MLP-DT groups, there is a different pattern to MLP-GI. Although adding more time-lagged travel demand variables reduces errors initially, this improvement is reversed with longer sequences. In the case of New York (Fig. 4 a), the RMSE slightly increased after adding the travel demand of −5 h. For the Chicago data (Fig. 4 b), the model shows underperformance after adding demand intensity of −4 h, with a higher mean and median RMSE.
The possible underperformance with a longer temporal dependence is a general phenomenon, observed and discussed in many studies (Ke et al., 2017). The performance does not always improve, when a long sequence of previous observations are fed into machine learning approaches for modelling temporal dependency. The inclusion of information at less correlated timestamps can lead to poor forecasting. Therefore, the majority of previous studies only chose specific time steps to provide temporal dependence and to predict travel demand (Ke et al., 2017;Lin et al., 2018).
Comparing MLP-GI and MLP-DT with the same number of extra variables, MLP-GI always outperforms MLP-DT (see Fig. 4). The pattern indicates that using the groups of graph information properties is more effective than only using time-lagged observation of forecasting target (travel demand). Y. Yang, et al. Computers, Environment and Urban Systems 83 (2020) 101521 In summary, temporal dependence modelling is limited if only historical travel demand is utilised, because only a finite number of time lags will improve the prediction. However, better forecasting results may be obtained by introducing graph information properties.

Model comparison
The analysis and results from the previous sections indicate the potential usefulness of information derived from the bike flow interaction graph, but the graph properties were all derived from a single lagged timestep. This section examines how different ML models can comprehensively use varying lagged time-sequences of graph features and compares their performance with two other baseline approaches: HA (Historical Average) and ARIMA (Autoregressive Integrated Moving Average). The models are described as follows: (1) HA: uses the historical average demand for prediction. For example, the travel demand of Tuesday 12:00 is predicted as the average value of all past Tuesday's at 12:00 in the training dataset. (2) ARIMA: a statistical model, ARIMA is commonly used for analysing and forecasting time-series data. It has been widely applied in traffic prediction problems (Van Der Voort, Dougherty, & Watson, 1996;Williams & Hoel, 2003). In this work to predict demand at time T, the inputs to ARIMA were the demand observations from the first hour until T-1. It was undertaken using the automatic ARIMA model provided by the "forecast" package in R, a variation of the Hyndman-Khandakar algorithms (Hyndman et al., 2018). The model combines unit root tests, minimisation of the Akaike Information Criteria and Maximum Likelihood Estimation to construct the ARIMA. It should also be noted that the performance can be significantly influenced by model tuning, and there are also several variants such as seasonal ARIMA, which may generate better results. Time-lagged variables are reshaped to a sequence and put into a bidirectional LSTM layer. Other temporal features (hour of day, day of week, holiday) and meteorological features are placed into a vector and processed using a densely-connected layer which is concatenated to the LSTM layer. The two branches are merged using another densely-connected layer. The LSTM unit is composed of three gates: input, forget and output gates. These gates determine whether to include new inputs, delete information and whether the hidden state of the current time step is carried over to the next time step (iteration). As a result, LSTMs suffer less from the vanishing gradients problem and can handle complex temporal dependencies.
XGBoost, MLP, and LSTM models have three variants, denoted as "-TD","-PGI","-FGI" respectively. They all use the basic features including meteorological and temporal variables, but have different inputs in terms of graph information features.
(1) TD: uses time-lagged travel demand (out-strength) for prediction, as commonly observed in the literature (Lin et al., 2018). (2) PGI: this uses part of the time-lagged graph information. Outstrength and in-strength are provided for temporal dependence modelling and demand forecasting. (3) FGI: uses the full set of time-lagged graph information properties that were identified as more important than the baseline temperature variable; out-strength, in-strength, out-degree, in-degree in and PageRank.
Models with the same suffix (e.g. -PGI) used identical input features for travel demand prediction.
Incorporating the flexibility of feature engineering in machine learning models, allows them to achieve better results under the same or even reduced complexity. In this experiment, the hyperparameters of each -TD models were fine-tuned using grid search approaches, and the -PGI and -FGI models used the same hyperparameters. Therefore, -PGI and -FGI models do not significantly increase complexity in the algorithms and hyperparameters (e.g. the learning rate in XGBoost, number of hidden layers in NN) compared to -TD models. For the neural Y. Yang, et al. Computers, Environment and Urban Systems 83 (2020) 101521 networks, the Adam optimizer was applied as well as callbacks with a threshold of 10. This means that if the model performance does not improve for 10 epochs, the model will stop training to overcome potential overfitting problems.
In order to utilise hourly, daily and weekly periodicities in the model temporal dependencies (Zhang et al., 2019), time lags of today (−1 to −4 h for the New York data; and − 1 to −3 h for the Chicago data), yesterday (−23 to −25 h) and 7 days ago (−167 to −169 h) were selected, to provide three kinds of temporal dependence for forecasting. Graph information at these time lags was calculated and incorporated into the different models. Table 3 indicates the model forecasting results evaluation metrics. To eliminate randomness in NN outputs (MLP and LSTM), Table 3 shows the average MAPE (Mean Absolute Percentage Error) and RMSE of multiple (9) experiments. MAPE is denoted as follows: This is a measure of relative error used to remove the scale effect of demand intensity levels, with lower MAPE generally indicating better prediction. Because bike trip number (A t in Eq. (3)) may be 0 or near to 0 at certain places and times, leading to calculation and sensitivity problems, a threshold for A t is usually set in MAPE evaluations (Ke et al., 2017). This study uses a threshold of 5.
Overall, the machine learning models (XGBoost, MLP and LSTM)  Y. Yang, et al. Computers, Environment and Urban Systems 83 (2020) 101521 outperform the two baseline approaches (HA and ARIMA), as shown in Table 3. An important pattern is also evident: the more graph information that is included into an ML model, the lower the MAPE and RMSE values. However, different models have varying abilities in processing input features. XGBoost performed the best among the -TD models, similar to findings in other research. Lin et al. (2018) suggested that XGBoost outperforms LSTM and MLP in predicting New York bikesharing demand, with historical travel demand included in the feature set.
When additional graph information is provided, the various -PGI models show significant improvements over the -TD models, and even lower errors with the remaining features (-FGI). Despite better forecasting results of using full feature set, the performance of XGBoost -FGI becomes worse than LSTM -FGI. This is because time-lagged graph information properties are directly transformed into a vector for XGBoost and MLP. Although model improvements can be achieved, it is harder for them to differentiate information from different timestamps, and they fail to take full advantages of the long feature vector. However, LSTM, as a special RNN, leads to an improvement in forecasting (lower RMSE and MAPE) when using the complex full set (-FGI) of time-lagged graph information properties.
Overall, the results in Table 3 confirm that the feature engineering in this study results in a better prediction and that different kinds of machine learning models can generally benefit from time-lagged graph information properties for bike travel demand prediction. Table 3 indicates that XGBoost is the strongest in the "-TD" group, and LSTM performs the best in "-FGI" family. These approaches are from two broad categories of machine learning models: regression tree and neural networks, respectively. Therefore, this section provides spatial interpretation as a supplementary analysis of the two models, and the results are shown in Figs. 5 and 6. Fig. 5 (a) indicates the total travel demand in each region (groups of stations) in the New York case study over the period of the test set. Fig. 5 (b, c, d) shows how LSTM model benefits from additional graph information variables to forecast bike travel demand in New York. Areas close to Manhattan Midtown south (marked as "1" in Fig. 5a) show improvement in the LSTM-PGI model (Fig. 5c), they are also areas with a high bike trip intensity. LSTM-FGI (Fig. 5d) further improves the prediction by reducing MAPE in Upper East Side and Brooklyn (marked as"2" and "3" in Fig. 5 a), where presents medium-high travel demands. XGBoost also benefited from additional graph information properties (Fig. 5e, f, g), but to a lesser degree than the changing patterns in LSTM, especially when the "-PGI" and "-FGI" models are compared. For example, less improvement was found in the Manhattan Midtown south area in Fig. 5 (e-f-g), compared to Fig. 5 (b-c-d). This pattern also accords with the findings in Table 3, as LSTM experienced a significant decrease in MAPE from -PGI to -FGI. This again highlights LSTM's ability to process complicated sequential information. It should also be noted that LSTM and XGBoost may outperform each other in different areas of the city, suggesting that no single ML algorithm will have the best performance at all areas, as discussed in the work of Li and Axhausen (2019).

Spatial patterns of errors
Similar patterns to Fig. 5 are observed in Fig. 6 for the Chicago case study. XGBoost outperformed LSTM in the "-TD" models ( Fig. 6 b, e; Table 3), but LSTM-FGI (Fig. 6 d) obtained better predictions than both LSTM-PGI (Fig. 6 c) and XGBoost-FGI (Fig. 6 g) in areas that have large numbers of travel demand around the city centre. This is helpful for bike fleet management because regions with higher demand may experience greater bike shortages and more precise forecasting benefits the rebalancing work of scheme operators.

Discussion and conclusions
By examining travel flow interactions in transport systems, it is possible to shed light on the underlying structural characteristic of regions. This work highlights the importance of domain knowledge and feature engineering in machine learning problems. Casting complex urban systems such as transport networks into graph structures allows graph derived measures such as node importance and centrality to be included in models to capture and represent travels flow and regional attractiveness patterns (Batty, 2013;Yang et al., 2019;Zhang et al., 2017). Related graph features improve and enhance modelling and prediction in both tree-based model and neural networks, demonstrating the utility of better feature engineering.
It should be noted that this research predicts demand at small area levels (groups of station), rather than at the individual station level in order to avoid the impacts of service change and to reduce noise in such a complex system (Li et al., 2015). There are other strategies to eliminate these, for example, Lin et al. (2018) sought to predict demand at individual station level, and removed more than half of the New York bike stations from the data in order to only focus on stations that persisted over time with relatively high travel demand. Despite finer spatial granularity (station level), their approach may provided only a partial representation of actual demand patterns.
The station group/cluster size used in this work was a relatively arbitrary decision and may have affected the graph properties used in the models. Very large groups (areas) may result in many travel flows that start and end at the same region, making various centrality measures less representative of the actual dynamics and flows. Therefore, the choice of group number and clustering needs to find a balance between fine flow representations and system noises elimination.
There are several shortcomings in this study that will be improved and investigated in future work. First, more statistical time-series models could be used for comparison, such as KNN and seasonal ARIMA. Other hybrid deep neural networks may also be applied to verifying the FGI improvement, examples include MGCNN and ST-ResNet. MGCNN can benefit from RNN (LSTM, GRU) layers to model time-lagged variables (Lin et al., 2018), and it may be enhanced by FGI further, just like LSTM has shown in this work. Second, this study applied node-level graph information properties for better forecasting. Future work may examine the utility of including edge level (e.g. edge betweenness) and sub-graph level (e.g. modularity) information to improve transport demand forecasting. Third, this work only used data from two American cities, although similar patterns were identified, it is uncertain whether these findings are universally applicable to bikesharing systems in other regions (e.g. Asian, Europe). Additionally, both datasets are from dock-based bike-sharing systems. Examining dockless bike-sharing systems as graphs  and deriving useful information for demand forecasting is an area for further study.
Overall, this study identified the importance and effectiveness of time-lagged graph information properties in bike-sharing travel demand forecasting. Analysis of real-world data from different cities suggests that several time-lagged graph properties are of greater relevance for prediction of bike demand than more commonly used environmental measures. Graphs capture important structural information and system properties, and graph derived measures should be included in forecasting models. The follow-up experiments confirmed the improvements to several advanced machine learning approaches, noting that LSTM neural networks are able to effectively use a complex set of graph features, due to their ability to process sequential information.
A number of graph information variables were found to improve machine learning prediction of bike travel demand when included as lagged information in ML models: Out-strength, In-strength, Out-degree, In-degree and PageRank. Using in-strength can significantly decrease prediction errors, while the inclusion of the full set can lead to even lower average errors. The improvement also presents a spatial pattern and is more evident in areas with a medium and high volume of journeys, which is helpful in real-world applications. Unlike many data enrichment methods, this approach does not require data from other sources (e.g. land-use information from POI, twitter) or extra processing, data cleaning and fusion. These features are easily derived from bike flow graphs and are relatively easy to include in existing models. Predictions using such data can inform bike scheme operators, help them to better understand and model demand patterns in different urban areas and to run more successful bike-sharing schemes thereby promoting sustainable transport. The improved short-term demand predictions can also benefit "user-based rebalance" activities (Duan & Wu, 2019;Wu, Liu, & Shi, 2019), which often have directed user incentives to help bike rebalancing work, and dynamically optimise service provision.
The key finding from this work is that time-lagged graph flow information derived from actual bike-sharing patterns were found to be stronger predictors of demand than more commonly used meteorological features. This is because graph structural information captures important spatial and behavioural properties. Our study also found LSTM neural networks to be the most effective at handling a complex set of graph features and at processing sequential information. Combining these, resulted in enhanced and more accurate demand forecasting in bike-sharing system.