Abstract

Direct ridership models can predict station-level urban rail transit ridership. Previous research indicates that the direct modeling of urban rail transit ridership uses different coverage overlapping area processing methods (such as naive method or Thiessen polygons), area analysis units (such as census block group and census tract), and various regression models (such as linear regression and negative binomial regression). However, the selection of these methods and models seems arbitrary. The objective of this research is to suggest methods of station-level urban rail transit ridership model selection and evaluate the impact of this selection on ridership model results and prediction accuracy. Urban rail transit ridership data in 2010 were collected from five cities: New York, San Francisco, Chicago, Philadelphia, and Boston. Using the built environment characteristics as the independent variables and station-level ridership as the dependent variable, an analysis was conducted to examine the differences in the model performance in ridership prediction. Our results show that a large overlap of circular coverage areas will greatly affect the accuracy of models. The equal division method increases model accuracy significantly. Most models show that the generalized additive models have lower mean absolute percentage errors (MAPE) and higher adjusted values. By comparison, the Akaike information criterion (AIC) values of the negative binomial models are lower. The influence of different basic spatial analysis unit on the model results is marginal. Therefore, the selection of basic area unit can use existing data. In terms of model selection, advanced models seem to perform better than the linear regression models.

1. Introduction

Urban rail transit is a popular form of urban public transportation because of its large capacity, environmental friendliness, and fast speed. The emergence of the urban rail transit has alleviated the problems of congestion and exhaust pollution caused by private vehicles in the city. Strong evidence shows that urban rail transit stations reduce the private car ownership rates of nearby households [1]. A growing number of mega cities have adopted various measures to develop rail transit systems. In cities such as Seoul and Shanghai, the government promotes the use of urban rail transits through transit-oriented development (TOD) [2, 3]. The station-level ridership is a main factor for determining the operation and planning of the urban rail transit system. Because of this, ridership modeling at station level has consistently been a topic of interest for scholars and practitioners. A reliable ridership model can reflect the underlying factors that influence station-level ridership and facilitate operation and management, formulating measures to promote urban rail transit ridership. An accurate ridership model can assist operators in determining service frequency and best practice in operation.

In ridership modeling, an important factor to consider is the aspect of land use within the station service coverage area. Selecting a suitable station service coverage area is the first step towards building an effective ridership model. Double counting caused by overlapping area is a recognized issue in ridership modeling [4, 5]. To address the potential issues with double-counted data, the use of Thiessen polygons is a commonly proposed measure [69]. In brief, Thiessen polygons assign the closest point of the station to that station to tackle this issue. However, issues may still occur when using Thiessen polygons. For example, in places where the station layout is relatively compact, the station service coverage area formed by the Thiessen polygon becomes proportionally smaller, which may underestimate the effects of built environment variables on ridership in a model.

After determining the service coverage area of a station, the next step is to extract the built environment variables (such as population density and employment density) within the service coverage area, which is the basic unit for spatial analysis. Because station service coverage area usually does not completely match with the basic spatial analysis unit, is it often assumed that population and employment data are evenly distributed in the basic spatial analysis unit. Data weight is determined by the ratio of area covered to total area within the analysis unit. For example, the data weight is 1 when the analysis unit is completely covered by the station service coverage area. In existing literature, two basic spatial analysis units are often used: census block groups and census tracts [6, 1012]. Both census block groups (CBGs) and census tracts (CTs) are geographic units used by the U.S. Census Bureau. CT is a larger geographic unit, consisting of multiple CBG units.

Selection of a fit regression model is the following step to reveal the relationship between station-level ridership and its influencing factors. Linear regression is the most commonly used method in literature [2, 3, 7, 13]. Because of the differences in station locations, ridership can vary significantly by station, resulting in a possible disperse distribution of station-level ridership. To deal with this possibility, negative binomial regression models are suggested in literature and have been adopted [14, 15]. In addition, generalized additive regression models are proposed in tackling the likely nonlinear relationship between the dependent variable and the independent variables in the transportation field [16, 17].

This study selects five cities in the United States as case studies: New York City, Philadelphia, Boston, Chicago, and San Francisco. The urban rail transit systems in these cities are among the largest in USA. The inclusion of these five cities is to yield a more generalizable conclusion, as they vary across many aspects [18]. This study aims to determine which regression method is the most reliable in dealing with station coverage area overlapping issues, to explore if there exist significant differences between modeling CBGs versus CTs as spatial analysis units, and to provide insights into which model performs the best when modeling direct ridership at the station level.

The rest of this article is structured as follows. The next section reviews the existing literature in contributing factors of transit ridership, measures to address the overlapping issues of station coverage area, and applications of various spatial analysis units and models. The third section describes the data of the study and the three proposed methods to overcome the overlapping issue of station coverage area. It also explores the differences in the ordinary least square regression, negative binomial regression, and generalized additive models using two spatial analysis units (CBG and CT) and different station coverage treatment methods. The fourth section assesses the accuracy and reliability of the models and proposes directions for model improvement. Finally, a conclusion of our study findings, merits, and limitations is presented.

2. Literature Review

A large number of studies on urban rail ridership are based on station level. It is important to study the impact of the built environment on station-level ridership. Many studies have found that higher population density and employment density increase the ridership [3, 1921]. Built environment factors around the station such as density, diversity, and design have a significant effect on the ridership of urban rail [14, 2224]. The attributes of the station itself also have a significant impact on ridership, with transfer and terminal stations associated with higher ridership [2527].

To determine the built environment attributes around a station, buffers are usually drawn around the stations to represent coverage areas of the stations. Generally, a circular buffer zone with the radius of 800 meters is commonly used as the service coverage area of a station [25, 28, 29]. However, this naive approach tends to ignore the problem of overlapping buffers, especially when stations are close to each other. Some scholars have proposed solutions to such problem. The Thiessen polygon method is the most widely used method [6, 7, 9]. However, in places where the stations are close to each other, the use of Thiessen polygons can make the coverage area of stations be very small, which introduces errors when estimating the values of the built environment variables. As a result, neither of the two methods seems to address the issue of overlapping buffer well. Therefore, this paper proposes a new method to deal with this issue: calculate the values of the variables such as population, employment, and number of bus stops in the overlapping area of buffers, divide the value of such variables by the number of overlapping buffers, and assign the result to each overlapping buffer.

Different spatial analysis units are used by different studies. These units usually include CBG and CT. As the coverage area of stations does not completely match the spatial analysis unit, when estimating the values of built environment variables around stations, the common practice is to treat those variables as evenly distributed in the spatial analysis unit and calculate the values of the variables within the coverage area. Therefore, the choice of spatial analysis unit affects the results of the direct ridership model. Studies using CT as the spatial analysis unit include [6, 10], and studies using CBG as the spatial analysis unit include [11, 12].

Different studies also use different regression models to construct the relationship between independent variables and the station-level ridership. The linear regression model is the most common method [24, 26, 27]. Another widely used model is the negative binomial regression [14, 15]. Recently, nonlinear models have been extensively used. The nonlinear models include machine learning models and polynomial statistical models. Compared to statistical models, machine learning models have no significance inference for the independent variable and are prone to overfitting [30, 31]. GAM is an advanced statistical model that captures the nonlinear relationship between the independent and dependent variables through a smoothing function [16, 17, 31, 32]. Ding et al. [16] found the nonlinear association between almost all built environment variables and electric bike ownership. Hu et al. [17] used the generalized additive mixed effects model (GAMM) to investigate the nonlinear relationship between determinants and the attractiveness of car sharing. Since spatial autocorrelation is always observed when dealing with spatial data, spatial econometric models are usually used to deal with this issue [15, 33, 34]. Gan et al. [33] applied the spatial error model and found the factors that significantly influence station-level ridership while controlling for spatial autocorrelation.

In conclusion, previous studies have used different spatial analysis units, treatments of overlapping buffer areas, and regression models when estimating the station-level ridership of urban rail transit system. However, these methods and models have not been compared yet. So far, there is no guideline on which spatial analysis unit, treatment of overlapping buffer areas, and regression model should be used. As a result, this study will analyze the effect of different spatial analysis units, treatments of overlapping buffer areas, and regression models on the results so as to provide guidelines on which method or model should be used.

3. Research Design

3.1. Study Area

We collected the 2010 demographic variables from latest smart location database (SLD) and urban rail transit data of 2010 for the five selected U.S. cities: New York, San Francisco, Chicago, Philadelphia, and Boston. The five cities are selected because they provide ridership data that is available to the public. In addition, the numbers of urban rail transit stations in these five cities are relatively diverse. New York city has the largest urban rail transit system among the five cites, with 421 stations spreading across Manhattan, Brooklyn, Queens, and Bronx. San Francisco has the least number of stations, with 44 stations connecting cities in the Bay Area. The numbers of stations in Chicago, Philadelphia, and Boston are 136, 156, and 153, respectively.

3.2. Variables and Data Sources

Our dependable variables are the urban rail transit station-level ridership in 2010, which are in form of average weekday station ridership. The data sources and information are shown in (Table 1).

Our independent variables are derived from smart location database (SLD) 2010 and open data in five cities. SLD is available through the US Environmental Protection Agency. We obtain demographic data from SLD with CBG as our basic unit of spatial analysis and cluster them into CTs through GEOIDs. GEOIDs are numeric codes that uniquely identify geographic units. Shapefiles of roads, bus stops, bus lines, and urban rail transit stations are retrieved from the open data. Using ArcGIS spatial analysis tools, the variable data used for modeling are divided into three categories: demographic, land use, and station characteristics. Variable information is described in Table 2.

3.3. Overlapping Issue of Station Coverage Area

This study applies three different methods to deal with the overlapping issue of the station service coverage area: naive method (i.e., no treatment), equal division method, and the Thiessen polygons method. The circular buffer zone is a circular area with a station as the center and a radius of 800 meters as shown in Figure 1. However, when stations are located densely, a large amount of overlapping can occur with this method. As a result, this may significantly affect the model accuracy. For this reason, two other methods are used. The equal division method is similar to the naïve method. But the overlapped population, employment, and bus stops will be equally divided within a spatial unit. However, it should be noted that other variables (Table 1) do not involve equal splits under this method. As highlighted in the red square in Figure 1, when the two circular buffer areas overlap, the data in the overlapping part will be evenly assigned to the service coverage area of the two nearby stations. Another possible method is the use of Thiessen polygons, where any data point within the polygon is assigned to its closest station measured in distance, as illustrated in Figure 2.

3.4. The Modeling Approaches

In addition to the multiple linear regression, negative binomial regression, and spatial models that are common in the literature, this research also adds a generalized additive model. When building a direct ridership model for urban rail transit stations, linear regression is the most commonly used model by scholars. However, one assumption of linear regression is that the relationship between the dependent variable and the independent variable(s) can only be linear, which is rarely the case for some types of data. We thus introduce a generalized additive model to perform nonlinear fitting to capture the possibility of any possible nonlinear correlation between the dependent variable and the independent variable(s) through smoothing function. The station-level ridership difference among stations can be large. Through calculations, it is found that the coefficient of variation averages 1.49 from 0.83 (Chicago) to 3.07 (Philadelphia), an indication of overdispersion. Therefore, the negative binomial regression is proposed and used to overcome this overdispersion pattern of the dependent variable in the study [18]. Spatial variability in station-level ridership, for which a spatial error model is applied to address the spatial autocorrelation. All modeling and analyses are performed with the R software (version 3.6. 1), in which the generalized additive model is activated through the “mgcv” package [35], negative binomial regression is provided by the “MASS” package [36], and spatial error model is provided by the “spatialreg” package [37].

The negative binomial, generalized additive, and spatial error models used in this study can be expressed as follows.

3.4.1. Negative Binomial Regression

where represents the station-level ridership and is the intercept. represent the independent variables, which are population, employment, and other variables in our context, are the coefficients of each independent variable, and is the residual term.

3.4.2. Generalized Additive Regression

The generalized additive model uses a spline function to capture the nonlinear relationship between the independent and dependent variables, with the following formula:where represents the station-level ridership and is the intercept. are the independent variables used for linear fitting, are the coefficients of independent variables, represent the independent variables of nonlinear fitting, and is the residual term.

3.4.3. Spatial Error Model

The spatial error model assumes that only the effect from the error term in the spatial autocorrelation process of the elements is captured by the following formula:where is the weight matrix, is the coefficient vector of the random error term, is the spatial error coefficient, and is the random error.

3.5. Model Performance Indicators

To evaluate the accuracy of the model, this study used adjusted , MAPE, and AIC. Among them, adjusted is used to indicate goodness of fit of the models [38]. MAPE is used to evaluate the accuracy of the model [39]. AIC is used to compare the quality of different models [40]. These three indicators can be expressed mathematically aswhere is the residual sum of squares and is the degree of freedom of the . is the total sum of squares, and its degree of freedom is . represents the true values of the dependent variable in the model, while is the predicted value of the dependent variable. denotes the likelihood function and represents number of parameters.

4. Results

4.1. Model Outputs and Evaluations

In this section, we discuss differences in the use of various spatial analysis units, the processing methods for the overlapping parts of the station coverage areas, and the results of the three types of models. A total of 90 models have been built in this study. For model evaluation, we first use the variance inflation factor (VIF) value to examine potential collinearity between the variables and filter out the variables with a VIF value exceeding 10 [41]. The variables “housing unit” and “household” are removed. With generalized additive models, we use the AIC to determine the value of k. The model results for New York based on the spatial analysis unit of CBG and the naive treatment method for the overlapping buffer area are given in Table 3.

We then obtain and summarize the adjusted , MAPE, and AIC values of all final fitted models, as shown in Tables 47 . In addition, we calculate the size of the overlapping area and the overlap rates in the five cities using circular buffers (Table 8). With this calculation method, New York and Boston have a relatively high overlap rate, close to half of the area size, and this reflects the stations layout compactly.

4.2. Other Methods to Address the Overlapping Issue

In cities with compact station layouts, such as New York and Boston, the coverage overlap rates are close to half. With the same model parameters and analysis units, the equal division method in processing the overlapping area appears to be the most effective method overall with average adjusted  = 0.6, MAPE = 112.078, and AIC = 8020.195, which is better than the naive method (0.454, 164.758, and 8094.405) and the Thiessen polygon method (0.560, 126.657, and 8046.428), respectively. The method using Thiessen polygons performs better than the naive method as well. This is likely because when urban stations are densely located, the use of naive method recalculates a large amount of data, while the other two methods can better dilute the situation and reflect it around the station more closely. In cities with sparsely located stations, such as San Francisco, Philadelphia, and Chicago, the overlap rates are 0.034, 0.08, and 0.254, respectively. The naive method, equal division method, and Thiessen polygon method do not seem to show significant differences.

4.3. Model Comparison

The goodness of fit of a model can be assessed using the AIC value. Controlling for the same processing on the overlapping part of the station service coverage area, the same spatial analysis unit, and the same city, our results show that the AIC values of most negative binomial regression models are lower than those of the linear regression and the generalized additive models, indicating a better fit of the model. Besides, generalized additive models yield the highest accuracy rates. Typically, almost all generalized additive models have a smaller MAPE in our result, compared to the linear regression and negative binomial regression models. Negative binomial regression models also perform slightly better than the linear regression models on MAPE. Regarding the generalized additive models, the adjusted values of the models are higher than those of other models. Yet, smaller AIC values are observed with negative binomial regression models, indicating an overall better fit of data with such models. Smaller MAPE values are observed with the generalized additive models, indicating higher prediction accuracy with these models. Spatial error model also generates better results than the multiple linear regression model based on adjusted .

4.4. Spatial Analysis Unit Comparison

From the results, after controlling for other factors, CBG and CT methods yield distinct performances in different cities, and there is no clear consensus on which one performs better. A possible explanation is that the data obtained from CBGs are inferred from surveys which are subject to sampling biases. Therefore, when CBG data are aggregated onto CTs, these biases may wash out each other. Therefore, the accuracy of the data at the CT level may not be necessarily worse than that at the CBG level. On the other hand, due to the unobserved heterogeneity of urban rail transit station-level ridership, the room for improving the CBG level analysis can be marginal, despite having a higher accuracy. Both methods have close performance in accuracy. Therefore, for future station-level ridership modeling efforts, the selection of spatial analysis unit may not hold an upmost importance as changes in the overall outcome due to unit selection are minor. In addition, our results show that, for the same city, the model performance indicators AIC and adjusted carry the same functionality. The larger adjusted , the smaller the AIC. A majority of MAPE and adjusted values are also the same in terms of functionality. Last, we find that the accuracy of the station-level direct ridership model varies greatly across different cities (ranging from 0.4 to 0.9 in adjusted values).

5. Discussion

This study discusses the effect of the treatment of service coverage overlap areas, spatial analysis units, and different model choices on the direct ridership modeling results of urban rail transit systems.

First, we investigate the treatment of overlapping buffers. Several previous studies have used the Thiessen polygons approach to deal with overlapping buffers [69]. However, they did not compare the Thiessen polygons approach with the naive approach. Although the method of equal division is more cumbersome to use, it can generate better results in dealing with the problem of overlapping buffers.

Regarding the issue of different spatial analysis units (CBG or CT), our results show that there is no evidence as to which one is better. Previous studies have used different spatial analysis units [6, 1012], and in the future scholars can still use the most readily available spatial analysis unit.

With regard to different regression models, taking into account the overdispersion of ridership, nonlinear relationship between the independent and the dependent variables, and spatial error model could generate better results than the multiple linear regression based on adjusted . In particular, the nonlinear model, GAM, has both lower MAPE value and higher adjusted , which is consistent with the findings of existing studies [17, 23, 42].

6. Conclusions

The main contribution of this study is that we explored the effect of different treatment methods for the overlapping buffer area of stations, spatial analysis units, and regression models on station-level demand modeling results. This exploration answers the question of which treatment methods for the overlapping buffer area, spatial analysis unit, and regression model should be used in station-level demand modeling of urban rail transit. To obtain more convincing and universal conclusions, the ridership data of urban rail transit from five major cities in the United States are used to perform the study. We found that the nonlinear model outperforms the linear models in most of the cases. The equal division method usually performs better than the other two methods. Regarding the spatial analysis unit, the choice of CBG or CT does not influence the model results much. Thus, researchers could use either of them to perform the station-level demand modeling study.

Data Availability

The data used in this study are composed of two parts. The ridership data were taken from the transportation agencies of the five cities: New York [43], Philadelphia [44], Chicago [45], Boston [46], and San Francisco [47]. The built environment data were taken from the smart location database [48].

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Authors’ Contributions

H. Yang, C. Li, and X. Li conceived and designed the study. C. Li and X. Li performed data collection. H. Yang, C. Li, X. Li, J. Huo, Y. Wen, E. Parks, and Y. Liu performed analysis and interpretation of results. H. Yang, C. Li, X. Li, Y. Wen, E. Parks, and Y. Liu wrote the manuscript. All the authors reviewed the results and approved the final version of the manuscript.

Acknowledgments

This study was funded by the National Natural Science Foundation of China (Grants nos. 71704145, 51774241, and 71831006), Humanity and Social Science Foundation of Ministry of Education of China (Grant no. 18YJCZH138), China Postdoctoral Science Foundation, and Sichuan Youth Science and Technology Innovation Research Team Project (Grants nos. 2019JDTD0002 and 2020JDTD0027).