Generalized gravity model for human migration

The gravity model (GM) analogous to Newton’s law of universal gravitation has successfully described the flow between different spatial regions, such as human migration, traffic flows, international economic trades, etc. This simple but powerful approach relies only on the ‘mass’ factor represented by the scale of the regions and the ‘geometrical’ factor represented by the geographical distance. However, when the population has a subpopulation structure distinguished by different attributes, the estimation of the flow solely from the coarse-grained geographical factors in the GM causes the loss of differential geographical information for each attribute. To exploit the full information contained in the geographical information of subpopulation structure, we generalize the GM for population flow by explicitly harnessing the subpopulation properties characterized by both attributes and geography. As a concrete example, we examine the marriage patterns between the bride and the groom clans of Korea in the past. By exploiting more refined geographical and clan information, our generalized GM properly describes the real data, a part of which could not be explained by the conventional GM. Therefore, we would like to emphasize the necessity of using our generalized version of the GM, when the information on such nongeographical subpopulation structures is available.


Introduction
For decades, the gravity model (GM) has successfully explained flows between geographically separated two regions such as traffic flow [1,2,3,4,5], international economic trades [6,7], and human migration [8,9].The GM is named after Newton's law of universal gravitation because of the similarity in the formula: a certain type of flow between two regions is proportional to the product of 'mass' of each region and inversely proportional to a certain power of distance between the regions.We interpret the mass depending on contexts; we can quantify the relative importance of regions in human migration from their population sizes, and the relative importance of countries in international trades from their economic scales.This simple but powerful model has succeeded in interpreting the real world.For instance, the GM indeed accurately describes the empirical data of daily human mobility in multiscale mobility networks [10].It also nicely explains the inter-and intra-city traffic flows in Korea [11], along with the passenger flows in the Korean subway system [12].
However, those examples only concern the spatial aspect of population.We can easily imagine more complicated situations such as the population flow between other attributes than the spatial or regional attributes, when the population at a given region consists of subpopulation structures.The subpopulation structures characterized by attributes for population flows can be ethnic groups, income levels, etc.In particular, when spatial movement of the subpopulation belonging to one attribute to another subpopulation takes place, the flow between attributes becomes relevant.This applies not only to the population flow but also to the international trade, for instance.Goods are transferred in different economic sectors as the attributes, and one may aim at estimating the flow between economic sectors.One such previous attempt to apply the GM to explain flow between attributes is [13] partly by a subset of the authors of this paper.However, the results have revealed the limitation of the conventional GM when the geographical location of population center of a clan is not representative.The limitation stems from the process of significant coarse graining of the detailed geographical information of clans into a single point (the population center).
In this paper, we exploit detailed information on the population substructures by generalizing the GM, rather than coarse graining the information.First, we formulate the generalized gravity model (GGM) to take the full information available on the subpopulation (characterized by both geographical information and attribute information) flows.As we show in figure 1(a), when subpopulations are distinguished not only by geographical regions but also by attributes, flows between regions or attributes can be calculated by properly taking subpopulation flows without information loss (see figures 1(b) and (c), respectively).We also would like to emphasize the necessity of calculating subpopulation flows, because the population flows calculated from the coarse grained population data and those from individual subpopulation data are not equivalent.As a concrete example, we apply the GGM to marriage records combined with census data.The results show that it effectively captures the geographical constraint imposed in the marriage patterns in the past, in contrast to the GM.
The paper is organized as follows.We first formulate the GGM in section 2. We We present the schematic figure for two spatial distributions of attributes, i and j, and flows between subpopulations.Each attribute has its own spatial population distribution represented by the color gradient with the center of mass, e.g., i CM or j CM (marked by the × symbol).Since the population for a given region is again distinguished by its attribute, each subpopulation is distinguished by both region and attribute indices denoted by uppercase letters, A, B, C, and D, and lowercase letters, i and j, respectively.We illustrate all of the possible flow combinations between subpopulations, and we use the rectangular and oval boundaries to distinguish regions and attributes, respectively.(b) We show the comparison of flows between regions in the viewpoints of GM and GGM.The GM described on the left considers the flow between regions.It integrates the population size at a given region first to calculate the flow, while the GGM on the right first considers subpopulation flows and integrates them.As we show in section 2, the results from the GM and GGM are not equivalent.(c) We compare the flows between attributes in the GM and GGM.The GM on the left takes into account the flow between attributes i and j from the centers of populations as in [13], which causes some information loss due to the coarse graining.In contrast, we keep the entire information available by integrating all of the subpopulation flows in the GGM (this paper).
introduce our data set in section 3 and apply the model to this data.The result in section 4 demonstrates that the GGM indeed captures the geographical information not available from the GM.Finally, we conclude our work in section 5.

The GGM
Our derivation of the GGM is a natural extension of using the maximum entropy principle to derive the GM [14,15], where we replace regional indices with both regional and attribute indices.We provide our step-by-step derivation in this section partly for a pedagogical purpose and the self-containedness of this paper, but most importantly, we can directly demonstrate the problem of using the coarse-grained population data during the derivation.The maximum entropy principle is the way to estimate a real probability distribution by maximizing entropy.In particular, this method is useful for systems with many degrees of freedom because it focuses on only a few macroscopic quantities.The real probability distribution is estimated by the maximum entropy principle from the agreement of those observed quantities.Each Lagrange multiplier corresponding to each constraint in maximization gives the corresponding model parameter.
Let us start from the number N (i j) AB of people who move from region A with attribute i to region B with attribute j, where we take the convention of uppercase letters as the subscript for the region indices, and the lowercase letters as the superscript for the attribute indices.The sets of attributes {i} for the sender side and { j} for the receiver side can be different.For example, {i} and { j} can be the education level and the income level, respectively, when we try to describe the flow of employment from one city's education system to another city's industry.Our formalism adopts the discrete indices and summation, but it should be straightforward to deal with the continuous cases by using continuous variables and integration.The total number N (i) A of people moving from region A with attribute i to anywhere with any attribute is then In the same way, the total number Ñ( j) B of people who arrive at region B with attribute j from anywhere with any attribute is given by When people move, they have to pay the cost, which is naturally a function of the distance between two regions, among other factors.We denote the cost to move from (i, A) to ( j, B) for each unit of movement by c (i j) AB .Then, the total moving cost C is the following weighted sum, We can then write down the number W of all possible arrangements of travelers considering the multiplicity factor N (i j) AB as where N is the total number of moving people, N = i,A, j,B N (i j) AB .In the entropy maximization scheme, {N (i j) AB } is estimated from maximizing the Boltzmann entropy k B log W (equivalently maximizing W) with constraints.We consider three constraints under which W is maximized: the outflows {N (i)  A }, the inflows { Ñ( j) B }, and the total moving cost C represented in equations ( 1)-(3).For this optimization problem under given constraints, we have to use the Lagrange multiplier method, i.e., to find the stationary point of the Lagrangian where λ i A , λ j B , and γ are the Lagrange multipliers for each constraint.This problem is essentially the recap of the derivation for the most probable distribution in terms of a given energy value, from the standard formalism of canonical ensemble in statistical mechanics, so the readers may check the details in any standard statistical mechanics textbooks such as [19].The moving cost for each movement and the γ parameter here play the roles of energy and inverse temperature there, respectively.The solution of maximizing equation ( 5) is given by Note that all λ i A and λ j B become unity from the constraint itself (the mass conservation), so there is only one free parameter γ, which is determined by the real data.Later, we will specifically choose the γ value that minimizes the error between the model and real data.
The flow from region A to region B usually decays as a function of the distance between them due to the obviously rising cost, and thus we have to choose the cost function c(r) as an increasing function of the distance r between two regions.Conventionally, we set the form c(r) ∝ ln r [20], which leads equation (6) to Note that, though the numbers of leaving or arriving people, N A and Ñ(j) B , are not the population size P attribute region at each region and attribute, generally those are assumed to be linear to the population sizes [N (i) A and Ñ( j) yielding In this case, we can reproduce the GM for the flow between two spatial regions where P A and P B are total numbers of people who live in region A and region B, respectively.However, when N (i) A and Ñ( j) B are nonlinear with respect to the population sizes such as in [10], i.e., N (i) β , the population flow from region A with attribute i to region B with attribute j: In this case, the flow from region A to region B regardless of attributes: N AB cannot be explained by the GM unless α = β = 1 or the population contains only one attribute, because Note that the conventional GM provides the upper or lower bounds: i [P (i) A ] α ≤ [P] α for α > 1 and i [P (i)  A ] α ≥ [P] α for α < 1, by using the convexity or concavity of the functional form.In parallel, summing up N (i j) AB for all of the regions gives the number of moving people from attribute i to j, This is not reducible to the GM either, because where r(i CM , j CM ) is the distance between the centers of population of i and j, as shown in figure 1.The only case when the two expressions actually coincide is the assumption implicitly made in [13]-the population of each attribute i is treated as the 'point mass' located in a single location in space, namely, i CM .The GGM estimates the flow between attributes without such a coarse graining process involving the information loss.Hence, the GGM is the correct way to handle subpopulation structures when it comes to the GM of population flow.
We later show that it indeed effectively captures the geographical constraints for the marriage flow in the past obtained from the data, which was not possible with the coarse-grained version of the GM due to the widely distributed population of the clans [13].

Data sets
In the traditionally patriarchal culture of Korea after around 17th century, a bride usually moved to her groom's place in the past, once they got married.We treat this type of migration caused by the marriage as our main data of human migration.By applying the GGM to marriage patterns between clans in the past, we estimate the geographical constraints.We take the real marriage flow O (i j) from the bride clan i to the groom clan j from the family book data called jokbo.We present more details in section 3.1.To compute the model flow, we extract the distance r AB between two regions and the distribution of the population for each clan, N (i) A (the bride side) and Ñ( j) B (the groom side), from the modern census data in 1985, 2000, and 2015-the three particular years when the information on the regional distribution of each clan's population is available.
We measure the distance r AB based on geographical coordinates of the regions using the Google maps application programming interface [21].The traveling distance within the same region r AA is estimated as the square root of the region's area.We assume that the size of the moving populations, represented by N (i) A and Ñ(j) B , are proportional to that of the resident populations (from the census data) of the corresponding clans living in the corresponding regions, so we just take the face values of populations in the census data and treat them as the migrating population for simplicity.As argued in [13], we use this modern population data to estimate the past migration flow between clans in jokbo data, based on the fact that the proportion of each clan living in each region with respect to the total population of Korea has been relatively steady.

Jokbo data
Jokbo, or the Korean family book, records the members of paternal lineage and each member's spouse and children.Even though a bride does not change her family name after marriage in Korean culture, she was (and still is in many conservative families) considered Table 1.The volume of the jokbo and the census data.The number of bride clans and the total number of entries in each jokbo are counted based on the existing clans in the census data.Note that we count all brides in the jokbo whether it includes birth and death dates or not, and thus the volume can be different from the previous research [13,16,17,18].In addition, the population sizes of the groom clans in the census data are presented.to belong to the groom's family after marriage.The key element of jokbo for our research is the fact that it records information of the female spouse's original clan including the information on its geographical origin.Each clan has its own jokbo, which is passed down to descendants.Previously, the distributions of clans in Korea have been studied based on ten jokbo data [16,17,18].Marriage patterns using the same data set have been studied in [13] with the GM framework under the assumption described in the left figure of figure 1(c).We also use the same ten jokbo data set, but at this time we merge two jokbo among ten because those two are different subgroups of the same clan (because our attribute unit is the clans), which results in the total number of nine distinct jokbo used in our analysis.We count how many brides from clan i married the grooms from clan j, the owner of jokbo, and treat the number of brides as the real migration flow O (i j) from the bride clan i to the groom clan j.Each jokbo contains between 5 146 and 189 158 marriage entries (see table 1 for detailed statistics).We index the jokbo in the ascending order of the value of γ ( j)  opt (that will be introduced in section 4) predicted from the 2015 census data.There is a single case of a tie, and we break it by using γ ( j)  opt from the 1985 data.

Census data
We assume that the outflow N (i) A from (i, A) and the inflow Ñ(j) B to ( j, B) are proportional to their population sizes, P (i) A and P ( j) B , to predict the flow N (i j) from equation (11).The population size of each clan residing in each region is taken from the Korean census data [22], where the spatial resolution of the data is determined by the set of 194 administrative regions.In particular, we use the census data in the years 1985 and 2000 as in [13], and the new data in 2015.We present the detailed population statistics for each groom clan in the census data in table 1.
Due to the changes in the administrative boundaries over 30 years (between 1985 and 2015), we have generated the common set of 194 administrative regions for the three different The error landscapes as the function of γ and the scatter plot of real and estimated marriage flows at γ ( j) opt for each groom clan j, with the census data in 2015.We index the groom clans according to γ ( j)  opt in the ascending order, from 1 to 9. The panels (a)-(i) correspond to the groom clans 1-9, respectively.For the actual error landscape plots, we use the normalized error E ( j) (γ)/E ( j) (γ = 0) with respect to the γ = 0 case.The vertical dashed lines in the error landscapes indicate the γ ( j)  opt value that gives the minimum value of E ( j) .
years, where we have unified the administrative regions whose boundaries had been changed, following the procedure of [13] to unify the administrative regions (for the two different years: 1985 and 2000, in [13]).

Results
We apply the GGM expressed in equation ( 11) to the marriage patterns in the past.The real number of marriage entries {O (i j) } are listed in the jokbo data, and we compare the predicted flow {N (i j) } from the model with {O (i j) }.For each groom clan (corresponding to each jokbo clan) j, the difference is quantified by the error calculated from the list of the bride clans {i}.Note that we discard the self migration flow E ( j j) when we calculate E ( j) , because the marriage between the same clans was forbidden in the past and it is indeed significantly underrepresented as reported in [13].In practice, we also checked that ignoring E ( j j) does not make much of a difference in our results.The proportionality factor for equation ( 11) is calculated by minimizing E ( j) at a given value of γ ( j) .The optimal value γ ( j) opt is assigned as the γ ( j) value that minimizes the error E ( j) .In this Figure 3.The estimated exponent γ ( j) opt of distance in equation ( 11) from the census data in 1985, 2000, and 2015, for each groom clan j.The horizontal axis indicates groom clans, and we use three different types of symbols to distinguish the results for each year.We shade every other column for better readability.
Table 2.The results of γ ( j)  opt for each groom clan j, from 1985, 2000, and 2015 census data, respectively.For the comparison between the GGM and the GM, we provide the values of ∆e ( j) (%) in equation ( 14) as the percentage, representing the relative performance of the GGM.We also characterize the population distribution of each clan by measuring the dispersion ∆R ( j) (km) in equation (15) and the effective number of occupied regions n ( j) in equation (16).∆R ( j) n ( j) γ ( j) opt ∆e ( j) ∆R ( j) n ( j) γ ( j) opt ∆e ( j) ∆R ( j) n ( j) case, we vary the γ ( j) value from 0 to 10 with the resolution of 0.1.The obtained γ ( j) opt value indicates the geographical constraint for the brides' migration to the groom clan j.
In figure 2, on the left side of each panel (corresponding to each groom clan j), we show the error in equation ( 13) as a function of γ ( j) , via the fact that N (i j) is a function of γ ( j) .On the right side of each panel, we also show scatter plots comparing the real flow O (i j) versus the predicted flow N (i j) from the GGM at a given γ ( j)  opt value with the guideline corresponding to O (i j) = N (i j) .The predicted flow N (i j) and the number of entries O (i j) in jokbo are indeed close to the O (i j) = N (i j) line.Since the results from 1985, 2000, and 2015 census data are qualitatively the same, we only show the results in 2015.Except for the clan 1, the GGM actually captures nonzero γ ( j)  opt values, while the exponent γ ( j) opt always vanishes when we use the GM for all of the jokbo [13].
We calculate N (i j) from each census data in 1985, 2000, and 2015, so we obtain the The performance of the GGM expressed as the normalized reduced error in equation ( 14) from the census data in 2015 for each groom clan j, compared with the model performance from geographically scrambled data.The values for the real data and the average values from the shuffled data are shown as the large purple diamonds and the small green diamonds with the error bars representing the standard deviation over 100 realizations, respectively.As in figure 3, The horizontal axis indicates groom clans and we shade every other column for better readability.
three γ ( j) opt values corresponding to each year, for each groom clan (see figure 3).From the similarity of error landscapes in each census data, the γ ( j)  opt values for different years are not much different for a given groom clan j (check table 2 for details), which indicates that the results of γ ( j)  opt are temporally robust for a given groom clan.It is also interesting to note that many γ ( j)  opt values are around 2 corresponding to the same formula with demographic gravitation introduced in [2].Most importantly, compared with the GM results, i.e., γ ( j) opt 0 for all of the groom clans [13], the GGM indeed yields nonzero γ ( j)  opt values except for the clan 1.It implies that the GGM actually captures the information of the geographical constraint on the flow, in contrast to the GM where it is hard to capture this geographical information due to the coarse graining, i.e., treating the population of a clan as a point particle at a center of mass.On the contrary, the GGM uses much more detailed information of the subpopulation structure that eventually leads us to capture the actual geographical constraint imposed on the flow.
To validate our model, we measure the performance of our model compared with the GM using the normalized reduced error with respect to the case of no geographical constraint, i.e., γ = 0, defined as The normalized reduced error ∆e ( j) quantifies the improvement of performance by using the GGM compared with the GM, which results in γ ( j) opt 0 for all of the clans [13].Large values of ∆e ( j) indicate significance of geographical constraints in the migration flow.Except for clans such as j =1, 5, and 9, the normalized reduced error ∆e ( j) 10%, as shown in figure 4.
To demonstrate the statistical significance of geographical information of the data, we shuffle regional indices for the groom clan j to obtain the corresponding surrogate γ ( j)  opt values, also shown in figure 4.
For the exceptional cases of j = 1, 5, and 9, we suspect the lack of geographical information in the data itself, as we argue.To test the statistical significance of spatial correlation in the data, we shuffle the regional indices to scramble geographical information.We examine the result of our model in this shuffled data, which is shown in figure 4. It supports that the GGM extracts more geographical information than the GM, by capturing the nonzero γ exponent.If the shuffled data gives similar results to those from original data, the original data contains a small amount of geographical information.As shown in figure 4, this situation precisely happens for the clans j = 1, 5, and 9, whose original and shuffled data give similar results.In other words, the data corresponding to those three clans originally contain less geographical information than the other clans.The small ∆e ( j) , therefore, does not come from the GGM but from the clan data itself.Hence, we conclude that as long as the data has enough geographical information, the GGM effectively extracts the corresponding information.
As mentioned above, geographical information of the population distribution is closely related to the model performance and the statistical significance of the nonzero γ ( j)  opt values.To quantify the geographical information in the distribution of clans more systematically, we introduce two measures: the dispersion that quantifies how strongly localized the populations are, and the homogeneity that focuses on how uniformly the populations occupy distinct regions.For the latter, we use the concept of the effective number of occupied regions based on the Rényi entropy for a given probability distribution, as in [23].We define the dispersion ∆R ( j) from the centroid R ( j) of the clan j by taking population fractions as weights: where r A is the location of the administrative region A. The population fraction f ( j) A is the population of clan j living in A divided by the total population of clan j, and || • • • || is the Euclidean norm.The population centroid of clan j is then R ( j) = A f ( j) A r A .As the concept of moment of inertia or radius of gyration [13], ∆R ( j) measures how (geographically) widely a certain clan is distributed.The effective number of occupied regions is defined as the reciprocal of the heterogeneity quantified by the second moment of the population fraction, given by n In contrast to ∆R ( j) , n ( j) measures how many of distinct regions (regardless of their geographical location) a certain clan occupies effectively; there are scaling relations for extreme cases: n ( j) the total number of administrative regions when the clan j is uniformly distributed to the entire set of administrative regions, while n ( j) 1 when the clan j is almost exclusively living in a single particular administrative region [23].The dispersion ∆R ( j) and the number n ( j) of occupied regions are usually positively correlated.However, ∆R ( j) can be large even when n ( j) is small, e.g., when the clan has multiple localized residential regions.Hence, we use both measures for more accurate identification of the population distributions.For all of the 788 clans in the 2015 census data, we measure both ∆R ( j) and n ( j) and present them as the density plot in figure 5.Among all of the clans listed in the census data, all of the groom clans corresponding to the jokbo data have relevantly large values of dispersion and the effective number of occupied regions.Note that the combination of large ∆R ( j) and small n ( j) (as discussed before) is observed indeed, while the combination of small ∆R ( j) and large n ( j) does not appear, as shown in figure 5.This contrast hints the existence of such multiple localized residential regions, which is also discussed in [13].
Finally, we compare the relative performance of GGM with these two measures, presented in figure 6.There is a trend of increasing ∆e ( j) when the population distribution has more geographical information, low dispersion and low effective number of occupation regions.The Pearson correlation coefficients between ∆e ( j) and the dispersion R ( j) or the effective number n ( j) of occupied regions are −0.270 and −0.213, respectively, implying the anticorrelation.This observation again confirms that when the data includes meaningful geographical information, the GGM captures the geographical constraint, while the GM may not.

Conclusions and discussion
In this paper, we formulate the GGM to properly take subpopulation structures for human migration, by keeping the entire geographical distribution of the subpopulations.The key aspect of our point is that we need to calculate individual subpopulation flows, before trying any geographical coarsening.To test the validity of the GGM, we investigate the marriage patterns of Korea in the past.Applying our model to the marriage pattern, we identify the geographical constraint.The results demonstrate that the GGM captures the subpopulation aspect of the data without the information loss occurred in the GM.We believe that our approach is applicable to a wide range of research on population dynamics.Moreover, we would like to point out that the GGM is in fact even more general than the treatment for our particular data set, e.g., the different types of attributes for the departure and arrival places by taking the different sets {i} { j}.For instance, the attribute in the departure place for education can be the education level of people, while the attribute of the arrival place for work can be the income level.Furthermore, if we release the constraints α = 1 and β = 1, one can also allow the nonlinear mass relation.In this case, the GGM becomes particularly important to prevent the loss of information because the scale factor, in addition to the distance factor, also has the nonlinear relation with the flow.
Finally, this scheme can be extended for multiple types of attributes that can be represented by the attribute vectors.For instance, people living in the region A with the attribute i = (e i , m i ) representing the education level e i and the income level m i can move to the region B with j = (e j , m j ).We hope to extend this type of general scheme for a wide variety of different data sets of human (and possibly nonhuman) migration or flow patterns in the future.

Figure 4 .
Figure 4.The performance of the GGM expressed as the normalized reduced error in equation (14) from the census data in 2015 for each groom clan j, compared with the model performance from geographically scrambled data.The values for the real data and the average values from the shuffled data are shown as the large purple diamonds and the small green diamonds with the error bars representing the standard deviation over 100 realizations, respectively.As in figure3, The horizontal axis indicates groom clans and we shade every other column for better readability.

Figure 5 .
Figure 5.The density plot of dispersion ∆R( j) and the effective number of occupied regions n ( j) for all of the clans in the 2015 census data.The nine orange diamonds correspond to the groom clans in the jokbo data.

Figure 6 .
Figure 6.The same ∆R ( j) -n ( j) diagram as figure5, but only with the groom clans, using the census data in the three different years.The color of the points represents ∆e( j) , and the different symbols indicate the results from the census data in different years.