Distance-decay functions of travel to work trips in India

In 2011, for the first time, Census in India reported travel distance and mode of travel for the workers. The distance reported is in the form of aggregate counts for each mode of travel in 7 distance bins (0–1 km, 2–5 km, 6–10 km, 11–20 km, 21–30 km, 31–50 km, and >50 km). In this data article, methods are described to model categorical count data as distance-decay functions using continuous probability distributions. The distributions have been developed for 8 categories of modes—walk, cycle, motorised two-wheelers, car, tempo/auto rickshaw/taxi, bus, train, and all modes combined, for the 33 mainland states of India and all states combined. Distance for walk is modelled using exponential distribution, and for all the other modes using lognormal or Weibull distribution. For estimating parameters of the distributions, chi-square minimization has been used in a spreadsheet program. The data presented includes parameters of the 272 (34 × 8) probability distributions as well as descriptive statistics of these distributions.


a b s t r a c t
In 2011, for the first time, Census in India reported travel distance and mode of travel for the workers. The distance reported is in the form of aggregate counts for each mode of travel in 7 distance bins (0-1 km, 2-5 km, 6-10 km, 11-20 km, 21-30 km, 31-50 km, and 450 km). In this data article, methods are described to model categorical count data as distance-decay functions using continuous probability distributions. The distributions have been developed for 8 categories of modes-walk, cycle, motorised twowheelers, car, tempo/auto rickshaw/taxi, bus, train, and all modes combined, for the 33 mainland states of India and all states combined. Distance for walk is modelled using exponential distribution, and for all the other modes using lognormal or Weibull distribution. For estimating parameters of the distributions, chisquare minimization has been used in a spreadsheet program. The data presented includes parameters of the 272 (34 Â 8) probability distributions as well as descriptive statistics of these distributions.
& The data presented in this article includes the parameters of distance decay functions for 8 categories of travel modes in the forms of exponential, lognormal or Weibull distributions The data also includes descriptive statistics of travel distance for each mode in the 33 mainland states of India This is the first such travel-related data available for the whole of India The data will find use in multiple transport-related research as well as policy making, and the method described is generic and can be applied to estimate distance-decay functions at district level

Data
The commute distance data reported by the census is in the form of aggregate counts of workers classified into 7 distance bins for each mode. The count data will be modelled as continuous probability distribution functions to estimate mean distance travelled by each mode in each state. The data presented has been used for developing an accident prediction model for the states of India [5].

Materials and methods
In 2011, India had 28 states and 7 Union Territories (UTs). While the former has their own elected governments at the state levels, the latter are governed directly by federal government, and are usually much smaller in size than the states. The average population of the UTs is 2.9 million while that of the states is 41 million. Two of the UTs are islands, Andaman and Nicobar Island in the east and Lakshadweep in the west, and contribute 0.04% of the total population of the country. These were excluded from this analysis. The remaining 28 states and 5 UTs will be referred to as 33 states henceforth. In addition, the analysis includes all states combined, referred to as India.
Census in independent India has been conducted every decade from 1951 using personal interviews and covers the whole population. In 2011, Census introduced two questions regarding the commute of workers [1]. The two questions on commuting included mode of travel and one-way distance (in kilometres) from residence to place of work. There are 9 options for the travel modes: (1) walk, (2) cycle, (3) moped/scooter/motorcycle, (4) car, (5) tempo/auto rickshaw/taxi, (6) bus, (7) train, (8) water transport, and (9) any other, and an option of 'No travel'. Category 3 is referred to as motorised two-wheelers (2W), and category 5 as para-transit modes or Intermediate Public Transport (IPT) such as three-wheeled auto rickshaws, common across India (for their description see [6,9]).
These questions were asked from a subset of all the workers-the category called 'other workers'. These are defined as the workers other than those involved in economic activities such as cultivation, agriculture labour, or a household-based industry. Within urban areas, the workers are classified as working in a household-based industry if the business is conducted by the household members within the premises of their household. In rural areas, workers are classified as household-based if the industry is conducted within the village. If the person was engaged in more than one economic activity during the last year, this question was asked with reference to the main economic activity. The category of 'other workers' represent 42% of all the workers in India [2].
Among the 9 options of travel modes, only one could be selected by a worker. The question on mode thus disregards the multimodal characteristics of some of the trips, and census provides no details in this regard [3]. Thus, the working assumption is that the respondents informed their main mode of travel-the one using which they covered the longest travel distance. Since the census is conducted using personal interviews, it is possible that these questions, in some cases, were answered by proxy respondents, for instance, by other members of the household. However, no such information is available from Census to account for this bias.
For each mode, census has reported mode-specific count of workers classified into 7 distance categories: 0-1 km, 2-5 km, 6-10 km, 11-20 km, 21-30 km, 31-50 km, and 450 km. For walk, counts have been reported for 3 categories up to 10 km, and for cycle, 5 categories up to 30 km [1]. Table 1 presents this data for all India. The data has been reported only at the aggregate level of states and districts, with a further classification into rural, urban and total. In this article, total data (urban plus rural) has been used at the state level. All modes combined is also included as an additional category. 'Water transport' and 'any other' categories were excluded. These two categories were reported by 1.2% of those travelling by one of the 9 travel modes. In total, this article presents analysis of 8 travel mode categories in 33 states plus all India.
To estimate mode-specific average distance, the count data is modelled as continuous probability distributions. Such distributions, for distance, are often referred to as distance-decay functions [7,8,12]. The decay function for walking is often modelled with an exponential distribution. This implies that the likelihood is the highest for walking trips with distance close to zero and this likelihood reduces thereafter. In case of all the other modes, the peak, however, reaches at a point away from zero, followed by a long tail towards longer distance. Such variations in probabilities are often expressed using lognormal or Weibull distributions.
In their original form, exponential, lognormal as well as Weibull have domains reaching up to infinity. This means that an integral of these distributions from zero to infinity is unity. Since commute travel distance has a finite maximum value, for each of the distributions, their truncated forms were used. Without the truncations, distributions are likely to overestimate the average distance. Mathematically, truncation implies that probability density functions integrate to unity within a restricted domain i.e. a finite maximum value.
The truncated forms of cumulative distribution functions are shown in the Eqs. (1)- (3). Each of the distribution is expressed using two parameters, α and β; and the combination of two parameters is specific to a given combination of distribution type (denoted in the subscript by l: lognormal; w: Weibull; e: exponential), state (denoted by s), and mode of travel (denoted by m). In case of exponential function, the subscript for mode has not been used as this distribution is applicable only for walking. In the above equations, F 0 D max ð Þis the normalising factor which ensures that the integration of the distribution from 0 to D max equals unity. It is calculated as the cumulative probability of the untruncated distribution at x ¼ D max;m , where the subscript m denotes that this distance is specific to a mode. The objective is to find the parameters α and β.
For exponential distribution, mean and standard deviation were analytically derived. For lognormal and Weibull distributions, I used analytical forms of mean and standard deviation values reported in the literature ( [4] for Weibull and [11] for lognormal). Crénin [4] reported an R script for estimating mean and standard deviation, given the parameters α, β and D max;m (see Box 1 for expressions of mean and standard deviations for the three distributions).
The counts in the 7 distance bins specific to each mode are referred to as n 1 obs;m;s , n 5 obs;m;s , n 10 obs;m;s , n 20 obs;m;s , n 30 obs;m;s , n 50 obs;m;s and n 50 þ obs;m;s , and the total number of workers corresponding to each mode as n total obs;m;s , where subscript obs refers to the observed numbers, m refers to the mode of travel, s refers to the state, and the numbers in superscript refer to the distance bin. The number of workers within each distance bin modelled by distribution function are referred to as n 1 mod;m;s , n 5 mod;m;s , n 10 mod;m;s , n 20 mod;m;s , n 30 mod;m;s , n 50 mod;m;s and n 50 þ mod;m;s , thus replacing 'obs' by 'mod' in the subscripts.
To estimate the parameters α and β, an optimisation problem is setup, where the objective is to minimise the chi-square statistics given in Eq. (4).
3) Exponential distribution: F xjα; β ð Þ¼ð 1 Àβe À αx ð Þ Þ=F ðDjα; βÞ Mean μ ð Þ: counts less than or equal to 1 km and ðF x ¼ 5 km ð Þ À F x ¼ 1 km ð Þ Þ , refers to the percentage of counts greater than 1 km but less than 5 km. F x ð Þ refers to the cumulative probability distribution presented in Eqs. (1)- (3). The optimisation problem is to find the combination of parameter values α and β for which the value of χ 2 statistic is minimised.
To obtain the solutions, I used the Solver tool of Microsoft Excel spreadsheet program. In Solver, GRG (generalised reduced gradient: [10]) non-linear algorithm was used for optimisation. The solution of algorithm is sensitive to the selection of starting points for α and β. From a preliminary analysis for all the modes, it was clear that for a given mode and a given distribution, the values of these parameters belonged to a narrow range. Therefore, any outlying solution in terms of parameters would be easy to detect. The value of 1 was used as starting point for the both the parameters as a positive value.
In census data, D max is 10 km for walk and 30 km for cycle. For all the other modes (2W, car, IPT, bus, and train), the last bin is for distance greater than 50 km and is open ended (450 km). I assumed 100 km as D max for 2 W, 200 km for car and train, 100 km for IPT, and 100 km for bus. To test the sensitivity of these assumptions, I estimated mean distance assuming D max of 150 km and 300 km in case of car. The maximum difference in means is 0.5 km in the two assumptions.
During the initial analysis, it was found that, for walking, exponential distribution provides the perfect fit for all the states, indicated by minimum χ 2 statistic value of zero. Thus, for walk,    between the modelled and observed number of counts in the bins. It is because χ 2 statistic is not a normalised value and hence does not facilitate comparison across different cases. The distinction between good and poor fit was easy to identify as good fit provided a correlation of 0.99 or more while the poor fit was much lesser. Fig. 1 presents an example of fitting two different distributions for Cars in the state of Assam. It can been that lognormal is a much better fit than Weibull. The former has a χ 2 value of 560 while the latter has a value of 5574, thus lognormal is clearly a better distribution fit in this case.
In case of cycle, it was observed that in most states the counts in 21-30 km bin are more than, or in some cases almost the same as, the preceding bin of 11-20 km. Note that both bins are of equal size (10 km). There are some exceptions, such as Delhi, where number of cycle trips in the last bin (21-30 km) are 30% of those in the preceding bin (11)(12)(13)(14)(15)(16)(17)(18)(19)(20). Any distribution for cycle distance is likely to have a negative slope at this distance range (410 km) (see, for example, Fig. 2. Therefore, with the two bins of equal size (10 km), it is not possible to have more number of trips in 21-30 km range than 11-20 km. There is clearly some discrepancy in the data.
An alternative way to highlight this discrepancy is by observing the relative shares of different modes within each distance bin. For bins of longer trip distance, the relative share of motorised modes is expected to increase, and that of non-motorised modes expected to decrease. Fig. 3 presents this data for one of the states (Andhra Pradesh) where 21-30 km bin has higher number of counts than 11-20 km, and for Delhi, where the reverse is true. For 21-30 km, the relative share of cycle trips (chequered pattern) in Andhra Pradesh increases abruptly, while this transition is gradual in case of Delhi.
To correct this discrepancy, the two bins (10-20 km and 20-30 km) were combined into one large bin of 10-30 km. Thus, for cycling, the revised observed data consists of 4 bins-n 1 obs , n 5 obs , n 10 obs , and n 300 obs , in which the last bin refers to number of trips from 10-30 km. After this modification, the distribution fit increased considerably. Using the fitted distributions, the number of trips were estimated for the two bins-10-20 km and 20-30 km. It was observed that a large number of trips in 20-30 km bin are shifted to 10-20 km. In case of Delhi, the distribution was fitted without any modification in the bins, and a perfect fit was obtained.
Once the parameters of each mode in each state are known, I calculated average distance as well as the total distance travelled by the mode by all workers. Table 2 presents average and standard deviation of the distance. Fig. 4 presents distance-decay functions of six modes for all India. The maximum distance on x-axis is 30 km for a better representation. The distribution parameters and detailed descriptive statistics for each mode and for each state are presented in the Supplementary material.

Conclusion
This article presents a method to fit continuous probability distributions (distance-decay functions) for the categorical count data reported by Census. Also presented is the descriptive statistics of travel distance by mode. This method can be further extended to estimate distance-decay functions at the smaller levels of jurisdictions such as districts (counties) or cities. Due to simplistic questions in the Census questionnaire about the mode of travel to work, it has been assumed that the Census respondents informed their main mode of travel-the one using which they covered the longest travel distance. Further, since the census in India is conducted using personal interviews, it is possible that these questions, in some cases, were answered by proxy respondents, for instance, by other members of the household. However, no such information is available from Census to account for this bias.