Uncovering the spatial structure of mobility networks

The extraction of a clear and simple footprint of the structure of large, weighted and directed networks is a general problem that has many applications. An important example is given by origin-destination matrices which contain the complete information on commuting flows, but are difficult to analyze and compare. We propose here a versatile method which extracts a coarse-grained signature of mobility networks, under the form of a $2\times 2$ matrix that separates the flows into four categories. We apply this method to origin-destination matrices extracted from mobile phone data recorded in thirty-one Spanish cities. We show that these cities essentially differ by their proportion of two types of flows: integrated (between residential and employment hotspots) and random flows, whose importance increases with city size. Finally the method allows to determine categories of networks, and in the mobility case to classify cities according to their commuting structure.

The increasing availability of pervasive data in various fields has opened exciting possibilities of renewed quantitative approaches to many phenomena. This is particularly true for cities and urban systems for which different devices at different scales produce a very large amount of data potentially useful to construct a 'new science of cities' [1].
A new problem we have to solve is then to extract useful information from these huge datasets. In particular, we are interested in extracting coarse-grained information and stylized facts that encode the essence of a phenomenon, and that any reasonable model should reproduce. Such meso-scale information helps us to understand the system, to compare different systems, and also to propose models. This issue is particularly striking in the study of commuting in urban systems. In transportation research and urban planning, individuals daily mobility is usually captured in Origin-Destination (OD) matrices which contain the flows of individuals going from a point to another (see [2,3]). An OD matrix thus encapsulates the complete information about individuals flows in a city, at a given spatial scale and for a specific purpose. It is a large network, and as such does not provide a clear, synthetic and useful information about the structure of the mobility in the city. More generally, it is very difficult to extract high-level, synthetic * Correspondence and requests for materials should be addressed to MB (Email: marc.barthelemy@cea.fr).
information from large networks and methods such as community detection [4] and stochastic block modeling (see for example [5] and [6,7]) were recently proposed. Both these methods group nodes in clusters according to certain criteria and nodes in a given cluster have similar properties (for example, in the stochastic block modeling, nodes in a given group have similar neighborhood). These methods are very interesting when one wants to extract meso-scale information from a network, but are unable to construct expressive categories of links and to propose a classification of weighted (directed) networks. This is particularly true in the case of commuting networks in cities, where edges represent flows of individuals that travel daily from their residential neighborhood to their main activity area. Several types of links can be distinguished in these mobility networks, some constitute the backbone of the city by connecting major residential neighborhoods to employment centers, while other flows converge from smaller residential areas to important employment centers, or diverge from major residential neighborhoods to smaller activity areas. In addition, the spatial properties of these commuting flows are fundamental in cities and a relevant method should be able to take this aspect into account.
There is an important literature in quantitative geography and transportation research that focuses on the morphological comparison of cities [8][9][10][11] and notably on multiple aspects of polycentrism, ranging from schematic pictures proposed by urban planners and architects [12] to quantitative case studies and contextualized comparisons of cities [13][14][15]. So far most comparisons of large sets of cities have been based on morphological indicators [8,9]) -built-up areas, residential density, number of sub-centers, etc. -and aggregated mobility indicators [10,11] -motorization rate, average number of trips per day, energy consumption per capita per transport mode, etc. -, and have focused on the spatial organization of residences and employment centers. But these previous studies did not propose generic methods to take into account the spatial structure of commuting trips, which consist of both an origin and a destination. Such comparisons based on aggregated indicators thus fail to give an idea of the morphology of the city in terms of daily commuting flows. We still need some generic methods that are expressive in a urban context, and that could constitute the quantitative equivalent of the schematic pictures of city forms that have been pictured for long by urban planners [12].
In this paper we propose a simple and versatile method designed to compare the structure of large, weighted and directed networks. In the next section we describe this method in detail. The guiding idea is that a simple and clear picture can be provided by considering the distribution of flows between different types of nodes. We then apply the method to commuting (journey to work) OD matrices of thirty-one cities extracted from a large mobile phone dataset. We discuss the urban spatial patterns that our method reveals, and we compare these patterns observed in empirical data to those obtained with a reasonable null model that generates random commuting networks. Finally the method allows determining categories of networks with respect to their structure, and here to classify cities according to their commuting structure. This classification highlights a clear relation between commuting structure and city size.

Extracting coarse-grained information from OD matrices
For the sake of clarity, we will use here the language of OD matrices, but the method could easily be applied to any weighted and directed network from which we want to extract high-level information.
We assume that for a given city, we have the n × n matrix F ij where n is the number of spatial units that compose the city at the spatial aggregation level considered (for example a grid composed of square cells of size a, see Methods). This OD matrix F ij represents the number of individuals living in the location i and commuting to the location j where they have their main, regular activity (work or school for most people). By convention, when computing the numbers of inhabitants and workers in each cell we do not consider the diagonal of the OD matrix. This means that we omit the individuals who live and work in the same cell (considered as 'immobile' at this spatial scale).
In order to extract a simple signature of the OD matrix, we proceed in two steps. We first extract both the residential and the work locations with a large density -the so called 'hotspots' (see [16]). The number of residents of cell i is given by j =i F ij and its number of workers is given by j =i F ji . The hotspots then correspond to local maxima of these quantities. It is important to note that the method is general, and does not depend on how we determine these hotspots.
Once we have determined the cells that are the residential and the work hotspots (some cells can possibly be both), we proceed to the second and main step of the method. We reorder the rows and columns of the OD matrix in order to separate hotspots from non-hotspots. We put the m residential hotspots on the top lines, and do the same for columns by putting the p work hotspots on the left columns. The OD matrix then becomes a 4-quadrants matrix where the flows F ij are spatially positioned in the matrix with respect to their nature: on the top left the individuals that live in hotspots and work in hotspots; at the top right the individuals that live in hotspots and do not work in hotspots; at the bottom left individuals that do not live in hotspots but work in hotspots; and finally in the bottom right corner the individuals that neither live or work in hotspots. For each quadrant we sum the number of commuters and normalize it by the total number of commuters in the OD matrix, which gives the proportion of individuals in each of the four categories of flows. In other words, for a given city, we reduce the OD matrix to a 2 × 2 matrix is the proportion of Integrated flows that go from residential hotspots to work hotspots; is the proportion of Convergent flows that go from random residential places to work hotspots; is the proportion of Divergent flows that go from residential hotspots to random activity places;  1: Illustration of the ICDR method. The method decomposes the commuting flows in the city in four categories: the Integrated flows (I) from hotspot to hotspot, the Convergent flows (C) to hotspots, the divergent flows (D) originating at hotspots and finally the random flows (R), which are neither starting nor ending at hotspots. For each city with its origin-destination matrix, we can compute the importance of each commuting flow category and get a simple picture of the mobility structure in the city.
By construction, we have I, C, D, R ∈ [0; 1] and I + C + D + R = 1. This matrix Λ is thus a very simple footprint of the OD matrix that gives an expressive picture of the structure of commuting in the city, as illustrated by Fig. 1.

Commuting data
Large scale individual mobility networks are nowadays extracted from pervasive geolocated data, such as mobile phone, GPS, public transport cards or social apps data [17][18][19][20][21]. In particular, if an individual's mobile phone geolocated activity is available during a sufficiently long period of time, it is possible -under certain regularity conditions -to infer the most likely locations of her home and her workplace, and by aggregation to construct OD matrices [22][23][24]. Several parameters however impact the construction of OD matrices such as the nature of the data source (survey or user-generated geolocated data), or the spatial scale at which the OD matrix is built which can be dictated by administrative units (divisions in wards, counties, municipalities, etc.) or by technical reasons such as the density of antennas in the case of mobile phone data. Given this variety of data collection protocols, it is thus particularly remarkable that when considering the commuting flows at a city scale, various sources of pervasive data provide a very similar mobility information when compared to the OD matrices built from surveys [24]. This result needs to be confirmed for other cities and countries, but it already opens the door to a systematic use of pervasive, geolocated data as a relevant substitute to traditional transport surveys.
In the following we apply our ICDR method to OD matrices that have been extracted from mobile phone records in thirty-one Spanish urban areas during a five weeks period (see the Methods section for details on the dataset and the calculation of the OD matrices).

Hotspots
As described above, the first step consists in determining hotspots. Several possible methods have been proposed in the literature [15,25,26], and we use here a parameter-free method based on the Lorenz curve of the densities that we have proposed in a recent study (see [16] and the Methods section). Once we have determined both the origin (residential) hotspots and the destination (work) hotspots in each city, we first observe how their number scale with the population size of the city. Both these numbers for residential and employment hotspots scale sublinearly with the population size (see Supplementary Figures 4 and 5). The number of work hotspots grows significantly slower than the number of residential hotspots, showing that residential areas are (i) more dispersed in the city, and (ii) are more numerous than activity centers, as intuitively expected (see Supplementary Figure 6 that displays the locations of Home and Work hotspots in four cities that exhibit different spatial organizations). We also note here that the sublinear scaling of the work hotspots confirms previous results obtained with a totally different dataset (the number of employment centers in US cities) [27].

ICDR values
We now apply the second part of the method in order to calculate the I,C,D and R values for each OD matrix. For the 31 Spanish urban areas under study (see Supplementary Figure 1), we obtain the values shown in Fig. 2. In Fig. 2(a), we plot these values versus the population size of these cities. For this sample of cities we see that globally the proportion I of individuals that commute from hotspot to hotspot decreases as the population size increases, while the proportion R of 'random' flows increases and the proportions C and D of convergent and divergent flows seem surprisingly constant whatever the city size. In Fig. 2(b) we plot the same values but sorted by decreasing values of I which shows clearly that the I and R values are the relevant parameters for distinguishing cities from each other.
We also notice that the values obtained for another spatial scale of data aggregation confirm this trend (see Supplementary Figure 7). The decay of I flows ('integrated') flows in favor of R flows ('random') when P increases shows that the population growth among Spanish cities goes with a decentralization of both activity places and residences. As cities get bigger, their numbers of residential and employment hotspots grow (sublinearly), but these hotspots catch a smaller part of the commuting flows.

A null model
In order to evaluate to what extent the ICDR signatures of cities are characteristic of their commuting structure, we compare these values to the ones returned by a null model of commuting flows. For each city we generate random OD matrices of the same size than the reference OD matrix but with random flows of individuals that preserve the in-and out-degree of each node (see Supplementary Note 5). for the quantity R of the city i, the Z-score is given by . Essentially, we observe that the Z-scores of I and R are positive and large, while those of C and D are negative (and large in absolute value). Also as cities grow, the Z-scores of I and R increase while those of C and D decrease. These results demonstrate that the larger a city, and the less random it appears. This is in contrast with the naive expectation that the larger a city and the more disordered is the structure of individuals' mobility. For large cities, there seems to be a commuting backbone which cannot result from purely random movements of individuals. This backbone is the footprint of the city's structure and history, and probably results from strong constraints and efficiency considerations.

Robustness
Since our method first requires to determine origin and destination hotspots, one could argue that the interpretation of the I,C,D and R values will crucially depend on the particular method chosen to define these hotspots. The identification of hotspots is a problem that has been broadly discussed in urban economics (see Supplementary Note 3). Roughly speaking, starting from a spatial distribution of densities, the goal is to identify the local maxima and amounts to choose a threshold ρ * for the density ρ of individuals: a cell i is a hotspot if the local density of people is such that ρ(i) > ρ * . In order to test the impact of the choice of a particular threshold ρ * on the resulting ICDR values, we measure the sensitivity of these values as a function of the density threshold ρ * (see Supplementary Figure 8). As expected, the lower the density threshold, the larger the number of origin and destination hotspots, and consequently the larger I and the smaller R. In contrast, changing the density threshold has little impact on the C and D terms. More importantly, the conclusions drawn from the comparison of the ICDR values across cities remain the same: whatever the density threshold ρ * chosen to define residential/employment hotspots, we still observe the same qualitative behavior: a decay of integrated flows (I) in favor of 'random' flows (R), when the population size increases.

Distance of each type of flows
We now want to characterize spatially these different flows and the relation between city size and the commuting distances traveled by individuals. In each city we compute the average distance traveled by individuals per type of flows I,C,D and R. The resulting average distances measured in data are plotted in red on Fig. 3(a). We observe that the average distance for all categories of flows increases with population size, an expected effect as the city's area also grows with population size. We also observe that the average distance of convergent flows C increases faster than for other types of flows (we note that the average distance associated to convergent flows increases more than the distance associated to divergent flows D, showing that these flows are not symmetric as one could have naively expected). This result means that for this set of Spanish cities, commuters from small residential areas to important activity centers travel on average a longer distance than all other individuals. This observation could be an indication that for our set of cities, residential areas have expanded while activity centers remained at their location, leading to longer commuting distances (see Supplementary Figures 10 and 11 when considering various spatial scales of aggregation).
Another interesting information is provided by the comparison of distances measured in the data with average distances measured from the random OD matrices generated by the null model. The average distances associated to the null model are plotted in blue in Fig. 3(a). We see that for all types of flows the distances measured in the empirical data are shorter than those generated by the null model. This is another clear indication of the spatial organization of individual flows in cities. It also highlights the importance of the travel time budget in the residential locations choice. Remarkably enough, the distance of convergent flows (C) is both the largest and the one that increases the fastest with population, indicating a low degree of efficiency. The comparison of this behavior with the null model leads to interesting results. On Fig. 3(b) we plot the ratio D N ull /D Data for the four types of flow. Values lower than 1 indicate that the average commuting distance generated by the null model is shorter than the distance observed in the city. Surprisingly, we observe that small cities display a value less than one indicating the lesser importance of space at this short scale. We also see that this ratio increases faster for random flows (R) than for the others (D, C, I), suggesting a remarkable spatial structure of these R flows.
We also consider the fraction of total commuting distance by type of flow (Fig. 4). We see that for each type of flows, their respective fraction is constant and independent of city size. With the LouBar hotspots detection method [16] (see Supplementary Figure 3) and with a grid of 1km 2 square cells, we measure that roughly 40% of the total commuting distance is made on random flows while the other types represent each about 20%. This result shows that the method is able to identify where most  [16] is used here to define residential and employment hotspots.
of the commuting distance is traveled. In particular, we see that the natural, obvious flows (I) from residential centers to activity centers are not the most important ones, and that the decentralization of commuting flows seems to be the rule for the Spanish cities in our sample.

Classification of cities
Finally, the ICDR signature of their OD matrix allows to cluster cities with respect to the structure of their commuting patterns. We measure the euclidian distance between the cities' ICDR signatures and we then perform a hierarchical cluster analysis. Fig. 5 shows the dendrogram resulting from the classification. Four well-separated clusters are identified on this dendrogram, and Table I gives the average value of each term along with the average population of the cities composing the cluster. Remarkably these summary statistics show that largest cities are clustered together and are characterized by a larger proportion of 'random' flows (R) of individuals both living and working in parts of the city that are not the dominant residential and activity centers. This can be interpreted as an increased facility in bigger urban areas to commute from any part of the city to any other part. Further studies on other cities and countries are needed at this stage in order to discuss the relevance of the proposed classification.
It is also important to test the robustness of this classification and we show that introducing a reasonable amount of noise in the OD matrices does not change the classification (see Methods and Supplementary Figure 12). This sensitivity test confirms that the clustering is robust against possible errors in the data source and in the extraction of the mobility networks. The classification of cities based on their ICDR values is also reasonnably robust to a change of the method used to define residential and employment hotspots (see Supplementary Figure 13).

Discussion
We have proposed a method to extract high-level information from large weighted and directed networks, such as origin-destination matrices. This method relies on the identification of origin and destination hotspots, and this first step can be performed with any reasonable method. The important second step consists in aggregating flows in four different types, depending whether they start and end from/to a hotspot or not.
We have applied this method to commuting networks extracted from mobile phone data available in thirty-one Spanish cities. The method has allowed us to highlight several remarkable patterns in the data: • Independently of the density threshold chosen to determine hotspots, the proportion of integrated flows (I) decreases with city size, while the proportion of random flows (R) increases; • On average and for all cities considered here, individuals that live in residential main hubs and that work in employment main hubs (I flows) travel shorter distances than the others (C,D,R flows); • When the city size increases, the largest impact is on convergent flows (C) of individuals living in smaller residential areas (typically in the suburbs) and commuting to important employment centers;  The largest cities are clustered together. As cities get bigger, the 'random' component (R) of their commuting flows increases, which signals that it is easier to commute from any place to any other in large cities.
• The classification of cities based on the ICDR values leads to groups with consistent population size, highlighting a clear relationship between the population size of cities and their commuting structure.   In addition, the comparison with a null model led to interesting conclusions. Flows in cities display a high level of spatial organization and as the population size of the city grows, the increase of the Z-scores of I,C,D,R shows that the structure of the mobility is more and more specific, and far from a random organization. We note that an interesting direction for future research would be to find some analytical arguments using simple models of city organization for estimating these flows, and how they vary with population size.
Our coarse-graining method provides a large scale picture of individual flows in the city. In this respect it could be particularly useful for validating synthetic results of urban mobility models (such as [28] for example), and for comparing different models. An accurate modeling of mobility is indeed crucial in a large number of applications, including the important case of epidemic spreading which needs to be better understood, in particular at the intra-urban level [29,30].
It would also be interesting to apply the proposed method to other mobility datasets, with different time and spatial scales, in order to test the robustness of these commuting patterns. Another direction for future studies could be to inspect the time evolution of the I,C,D and R values for datasets describing travel to work journeys over several decades (such as national travel surveys). The method could also be applied at larger spatial scales, for example to capture dominant effects in international migration flows. More generally, we believe that an important feature of the ICDR method is its versatility, as it could be applied to any type of data that is naturally represented by a weighted and directed network.

Data
The dataset used for our analysis comprises 55 days of aggregated and anonymized records for 31 urban areas of more than 200,000 inhabitants. No individual information or records were available for this study. The records included the set of Base Transceiver Stations (communication antennas, BTS) used for the communications as presented in the Call Detail Records (CDR). A CDR is produced for each active phone event, including call/sms sending or reception. The number of anonymized users represents on average 2% of the total population and at most 5% of the total population. These percentages are almost the same for all the urban areas. From the CDR data obtained for 20 weekdays (from mondays to thrusdays only), we extracted home and work places for all the anonymized mobile phone users in the dataset. The output of this processing phase is an OD commuting matrix for each urban area, at the scale of the BTS point pattern. In order to facilitate the calculations and the comparison of the results between different cities, the OD matrices are then transposed on regular square cells grids of varying size a (see Supplementary Note 2).

Extraction of OD matrices from mobile phone data
In order to extract OD matrices from phone calls, we select a subset of users with a mobility displaying a sufficient level of statistical regularity. For this analysis we considered commuting patterns during workdays only. The users' Home and Work locations are identified as the Voronoi cells which are the most frequently visited on weekdays by each user between 8 pm and 7 am (Home) and between 9 am and 5 pm (Work). We assume that there must be a daily travel between the home and work locations of each individual. Users with a call activity larger than 40% of the days under study at home or work are considered as valid. We then aggregate the complete flow of users and construct the OD matrix with the flows between a Voronoi cell classified as home and another cell classified as a work place. Since the Voronoi areas do not exactly match the grid cells, we use a transition matrix to change the spatial scale of the OD matrix, that is to transform the

Spatial scale of the OD matrix
The OD matrix is the standard object in mobility studies and transport planning [2] and contains information about movement of individuals in a given area. More precisely, an OD matrix is a n * m matrix where n is the number of different 'Origin' zones, m is the number of 'Destination' zones and F ij is the number of people commuting from place i to place j during a given period of time. In transport surveys the size of the OD matrix depends on the spatial scale at which the mobility data has been collected. Traditionally the zones that are used to partition the city are the administrative units, whose size can vary from census and electoral units to whole departments or states, depending on the purpose for building the OD matrix.
In this study we applied our ICDR method to cities divided in square cells that are smaller than administrative units, allowing for a better spatial resolution. In the case of OD matrices extracted from CDR mobile phone data, the maximal resolution corresponds to the BTS (antennas) point pattern. The ICDR method proposed in this paper does however not depend on a particular spatial scale and can be applied on OD matrices available at coarser spatial resolutions as well. For a given territory the results obtained with the ICDR methodthe I,C,D and R values in the first place -will obviously depend on the spatial resolution and will also depend on the method used to define hotspots. It is important to note that when the ICDR method is used for comparing cities, the spatial resolution and the hotspots identification method should be the same for all cities (see Supplementary Notes 4 and 6, and Supplementary Figures 8 and 9 for results obtained with another hotspots delimitation method, and Supplementary Figure 7 for results obtained when considering another spatial scale of aggregation).

Robustness of the classification of cities
In order to ensure that the classification of cities based on their ICDR matrices is robust, we introduce a noise in the flows F ij (for all the thirty-one cities). We focus on the case where the workplace of an individual can be modified and where the number of individuals living in each cell i is kept constant. More precisely, the noise is introduced as follows: 1. We pick up a uniform random positive integer g, the number of individuals whose workplace is reshuffled. This number g varies from 1 to N = F ij the total number of commuters in the city.
2. We repeat g times the following operation: we pick up randomly a residence and a workplace (a couple of values (i, j)) and move one individual from her workplace j to put another randomly chosen workplace j : The parameter of this workplace reshuffling is then f = g/N . In order to evaluate how much the classification of cities is affected by this noise, we compute the Jaccard index J I between the reference classification of cities in k groups and the classification in k groups obtained for the noisy OD matrices. The Jaccard index measures the similarity of two partitions P and P of same size: where • a : number of city pairs that are in the same group for both P and P • b : number of city pairs that are in different groups in P but in the same one in P • c : number of city pairs that are in the same group in P but not in P (or conversely) The Jaccard index J I is in the range [0; 1] and the closest it is to 1, the larger the similarity between P and P . We generate 100 noisy matrices for each value of f and compute the average valueJ I of the Jaccard index. This average value encodes the distance between the reference partition P of cities in k groups and the partitions of cities in k groups obtained for the noisy OD matrices. Supplementary Figure 12 shows the values ofJ I versus the proportion f of reshuffled individuals, for different number of groups k. The red shaded rectangle on each panel corresponds to the mean value +/-the standard deviation obtained for 1, 000 replications of a null model, in which permutations of cities among the k groups are randomly performed. We observe here that up to 20% of reshuffled individuals, the average valueJ I obtained is always significantly larger than the one obtained for the null model, indicating that the classification is robust even for important values of the noise.

Acknowledgments
We thank the anonymous referees for interesting and constructive comments. The research leading to these results has received funding from the European Union Seventh Framework Programme   The scaling relation is sublinear, indicating an 'economy of scale' in the spatial organisation of cities. The exponent is remarkably smaller in the case of the work/main daily activity hotspots, which means that in Spanish cities the number of important places that concentrate many jobs and daily amenities, grows slower than the number of important residential places.  Figure S5. Effect of grid size on scalings. Numbers of residential hotspots and work hotspots (defined with the LouBar method, see section ) vs. the population size of the city for different sizes a of grid cells. In any case the scaling relations for both quantities stay sublinear, indicating an "economy of scale" in the spatial organisation of cities. Also remarkable is that the scaling exponent of the number of residential centers is always larger than the exponent of work centers. rank Flows I C D R Figure S9. ICDR values obtained with the 'Average' criteria. I (integrated),C (convergent),D (divergent) and R (random) values of the 31 Spanish cities ranked by decreasing order of I, for a = 1km and when defining hotspots with respect to the 'Average' criteria (i.e. all cells whose density of residents/workers is greater than the mean value are considered as residential/employment hotspots.  Figure S13. Effect of hotspots definition on clusters. Jaccard index JI resuming the sensitivity of the clusters to the threshold chosen to determine hotspots. The light red shaded rectangle on each panel corresponds to the mean value +/-the standard deviation obtained for 1000 replications of a null model, in which permutations of cities among the k groups are performed randomly.

Supplementary Notes
Supplementary note 1 Comparing properties of cities of very different population sizes and areas requires to rely on a harmonized, spatial delimitation of cities that goes beyond the administrative boundaries defined by national authorities. These administrative boundaries may be somewhat arbitrary and do not follow neither the morphological nor functionnal spatial footprint of the city. To this end we have chosen to rely on the urban areas defined by the AUDES initiative (Areas Urbanas De ESpaña: AU-DES project. Documentation and open data available at http://alarcos.esi.uclm.es/per/fruiz/audes/ (accessed January 27, 2014)). This project is an attempt to capture some coherent spatial delimitations of Spanish cities regarding the (journey to work) commuting patterns of individuals living in the core city of each urban area and in their surrounding municipalities. These delimitations are built upon statistical criteria based on the proportion of residents of surrounding municipalities that commute to the main city to work. The locations and spatial boundaries of the 31 Spanish urban areas analysed in this study are represented on Supplementary Figure S1. 4. The Voronoi cells between the UA, the territory outside the UA and the sea.
To compute the number of users associated with the intersections between the Voronoi cells and the UA we have to take into account these different types of Voronoi cells. Let m be the number of Voronoi cells, N v the number of mobile phone users in the Voronoi cell v and A v the area of the Voronoi cell v, v ∈ |[1, m]|. The number of users N v∩U A in the intersection between v and UA is given by the following equation: We note in Equation 3 that we remove the intersection of the Voronoi area with the sea, indeed, we assume that the number of users calling from the sea are negligeable. Now we consider the number of mobile phone users N v and the associated area A v of the Voronoi cells intersecting the UA (see Supplementary Figure S2b).
Extraction of Origin-Destination matrices We cannot directly extract an OD matrix between the grid cells with the mobile phone data because each users' home and work locations are identified by the Voronoi cells. Thus, we need a transition matrix P to transform the BTS OD matrix B into a grid OD matrix G.
Let m be the number of Voronoi cells covering the urban area and n be the number of grid cells. Let B be the OD matrix between BTSs where B ij is the number of commuters between the BTS i and the BTS j. To transform the matrix B into an OD matrix between grid cells G we define the transition matrix P where P ij is the area of the intersection between the grid cell i and the BTS j. Then we normalize P by column in order to consider a proportion of the BTSs areas instead of an absolut value, thus we obtain a new matrixP (Equation 4).
The OD matrix between the grid cells G is obtained by a matrices multiplication given in the following equation:

Supplementary note 3
A city is divided in n cells and the data give us access to its n × n OD matrix extracted from the mobile phone data. After straightforward calculation we obtain the distributions (C R ) and (C W ) of the numbers of residents/workers in the n cells composing the city. The determination of centres and subcentres is a problem which has been broadly tackled in urban economics (see for example the references given in the paper and references in [16]). Starting from a spatial distribution of densities, the goal is to identify the local maxima. This is in principle a simple problem solved by the choice of a threshold δ for the density ρ: a cell i is a hotspot if the density of users ρ(i) > δ. It is however clear that such method based on a fixed threshold introduce some arbitrariness due to the choice of δ, and also requires prior knowledge of the city to which it is applied to choose a relevant value of δ. In [16] we proposed a generic method to determine hotspots from the Lorenz curve of the densities. In the following we quickly introduce the principle of the method and its application to the determination of the residential and work hotspots of each city. We invite the interested readers to refer to [16] for further discussion on this method.
We first sort (C R ) and (C W ) in increasing order, and denote the ranked values by C 1 R (resp. C 1 W ) < C 2 R < ... < C n R where n is the number of cells. The two Lorenz curves of the distribution of residents/workers are constructed by plotting on the x-axis the proportion of cells F = i/n and on the y-axis the corresponding proportion of commuters L with: If all the cells had the same number of residents/workers the Lorenz curves would be the diagonal from (0, 0) to (1,1). In general we observe a concave curve with a more or less strong curvature. In the Lorenz curve, the stronger the curvature the stronger the inequality and, intuitively, the smaller the number of hotspots. This remark allows us to construct a criterion by relating the number of dominant places (i.e. those that have a very high number of residents/workers compared to the other cells) to the slope of the Lorenz curve at point F = 1: the larger the slope, the smaller the number of dominant values in the statistical distribution. The natural way to identify the typical scale of the number of hotspots is to take the intersection point F * between the tangent of L(F ) at point F = 1 and the horizontal axis L = 0 (see Suplementary Figure S3). This method is inspired from the classical scale determination for an exponential decay: if the decay from F = 1 were an exponential of the form exp −(1 − F )/a where a is the typical scale we want to extract, this method would give 1 − F * = a (see [16] for further discussion).

Supplementary note 4
On Supplementary Figure S8 we plot the I,C,D,R values of the 31 Spanish urban areas as a function of the density threshold ρ * chosen to define hotspots (here defined relatively to the density value ρ LB returned by the 'LouBar' method -see section )).
In the extreme case represented on Supplementary Figure S9 all cells whose number of residents/workers are greater than the mean value of the distribution of residents/workers are tagged as hotspots (see [16] for a discussion of this criteria and his comparison to the "LouBar" criteria used in this study). With a much broader acceptation of what is an hotspot, it is obvious that the term I will increase drastically since we increase both the number of residential hotspots and the number of work hotspots. Still what is important to notice is that we can still observe the qualitative trend observed with the LouBar method. As the population size increases, the decrease of the proportion I of integrated flows is accompanied by an increase of the proportion R of random flows.

Supplementary note 5
In order to evaluate to what extent the ICDR values of a given city are characteristic of its commuting structure, we compare these values to the values returned by a reasonable null model of commuting flows. The guiding idea is to generate OD matrices that (i) have the same size than the city's OD matrix (i.e that contain the same total number of individuals) ; (ii) that respect the city's static spatial organisation (i.e. the in-and out-degrees of all nodes should stay constant) ; and (iii) that randomize the flows between the nodes, i.e. with different weights of the edges. Such a null model of flows that respects the static organization of the city is indeed more reasonnable and realistic than a null model that would respect the total number of individuals in the matrix but that would modify the in-and out-degrees of the nodes.
To generate a random graph that conserves the in-and out-degree of each node of the reference graph, we use the Molloy-Reed algorithm [31] which complexity is in O(n), where n is the sum of the weights of the edges (i.e. the number of individuals in the OD case).

Supplementary note 6
In order to evaluate the sensitivity of the classification of cities to the number of hotspots selected in cities, for each city i we make vary the number of work hotspots between H i w the reference value returned by the LouBar method (see Supplementary Note 3), and two times the reference value 2 × H i w : H LB w (1 + δ), δ ∈ [0; 1]. As for the sensitivity to noise in the flows C ij , we evaluate the stability of the classification of cities in k groups with the Jaccard index (see Supplementary Note ). The values of J I as a function of δ are represented on Supplementary Figure S13.