Estimating Population Attribute Values in a Table : “ Get Me Started in ” Iterative Proportional Fitting

Estimating Population Attribute Values in a Table: “Get Me Started in” Iterative Proportional Fitting Nik Lomax & Paul Norman To cite this article: Nik Lomax & Paul Norman (2016) Estimating Population Attribute Values in a Table: “Get Me Started in” Iterative Proportional Fitting, The Professional Geographer, 68:3, 451-461, DOI: 10.1080/00330124.2015.1099449 To link to this article: http://dx.doi.org/10.1080/00330124.2015.1099449

T here are various data situations in population geography and demography when values for population attributes for areas might be missing due to being unknown, unreliable, outdated, or a sample. This article provides a guide to using iterative proportional fitting (IPF) as a tool for estimating the missing values for these population attributes and makes the case for it as a practical technique for answering a range of research questions. Although IPF is used widely in demographic analysis, the method is rarely presented in a way that is easily reproduced and, as a result, it can be opaque to nonexpert users. The aim of this article is to highlight that IPF is a technique that can be readily applied to a variety of data and scenarios and to provide researchers new to this technique with an introductory guide and an awareness of tools they can use to apply IPF in their own work.
To set the scene and introduce relevant terminology, Figure 1A shows a table of cells that are counts of people with a specific attribute. Each table row has data for an area and each column has counts of a particular population attribute in each area. External to the table are marginal cells: row totals of the number of people in each area and column totals of the population attribute across all areas. In Figure 1A, the sums of the rows and columns within the table agree with the marginal row and column totals. Supposing data for a subsequent year became available, but only the total population in each area (the row totals) and the population attribute totals for the large area, these smaller areas comprise (the column totals). This is the situation in Figure 1B, but the sum of the rows and columns of the table cells no longer agree with the external marginal cells. Using the internal table cell values as initial or "seed" values, IPF can be used to constrain (control or scale) the table to fit the marginal totals. Once IPF has been implemented, in Figure 1C, the internal values in the table now sum to the marginal row and column constraints and the data sets are said to have converged.
Following a brief discussion on the background of IPF, its previous applications, and some analogous methods, this article goes on to step through the IPF procedure and offer some pointers on operationalizing the algorithm. It then offers discussion of software implementation and applies the IPF method to three practical case study examples: estimating populations by age and sex, estimating migration flows between areas, and estimating multiple attributes for local area populations using a sample distribution. Finally, some conclusions are offered. The following section serves to highlight that although IPF is a widely used technique, the extant literature is not particularly useful for the casual user or someone new to IPF.

Background of IPF
IPF has been used in a wide variety of applications from multiple disciplines and the technique is referred to by various names: RAS in economics (from the notation of the modelrAŝ; see Bacharach 1965), Cross-Fratar (Fratar 1954) or Furness (Furness 1965) in transport engineering, and raking in computer science and statistics (Cohen 2008). IPF has also been referred to as rim-weighting or structure-preserving estimation (Simpson and Tranmer 2005). Johnston and Pattie (1993, 317) pointed toward a large body of literature in the field of geography that deals with approaches that are "entropy-maximizing, based on maximum likelihood estimation for which the IPF procedure is a means to that end." We discuss other equivalent methods that aim to achieve maximum likelihood in the next section. In demographics, the first use of IPF is widely attributed to Deming and Stephan (1940), who applied the technique to data from the 1940 U.S. census of population. Deming and Stephan found that although there were complete counts of the population for certain characteristics, when these characteristics were cross-tabulated the output was limited to a sample of the population. They used this sample as the starting distribution (the seeds) and applied IPF to derive an estimate of these cross-tabulated characteristics for the whole population. The ideas presented in Deming and Stephan (1940) were further explored and discussed by Deming (1943), Bishop (1969), Friedlander (1961), and Fienberg (1970, to name just a few. Many of these early papers, being presented from a mathematician's perspective, however, are likely to be incomprehensible to nonspecialist audiences and do not step through the process of using IPF such that somebody new to the technique could emulate it in similar settings. More recent demographic applications of IPF cover a variety of data sets and data availability and reliability issues. At the microdata level, Birkin and Clarke (1988) used IPF to estimate the characteristics of residents of small geographical areas, Rees (1994) updated the age and sex structure of small area populations, and Pritchard and Miller (2012) assigned multiple attributes to a synthetic population as an input to a microsimulation model. Simpson and Tranmer (2005) scaled small area population counts to large area information, using IPF to estimate a cross-tabulation of car ownership and tenure type using 1991 Census data. Lomax et al. (2013) used IPF to estimate missing migration data for the United Kingdom, and aggregate migration data were disaggregated by age and sex by Willekens, Por, and Raquillet (1981) and Willekens (1982).
Although all of these papers have applied IPF effectively to specific data problems, none are designed to guide the reader through the process of estimating the required information; rather, they present the tool as a means of getting at the results. Clearer explanations of IPF as a technique are offered by Wong (1992) and Rees (1994), but both are still opaque to those struggling with algebra. One exception is Norman (1999), who provided a guide on IPF but still without a stepthrough of the calculations involved. For a comprehensive history, summary of various applications, and detailed discussion on the robustness of IPF, see Zalo znik (2011).

Analogous Methods
IPF is not the only method that can be used for combining population data and estimating missing values. For example, in population estimation, the apportionment method can be used to ensure that small area data are made consistent with larger area information and the ratio method to update earlier cell counts. Both the apportionment and ratio methods can be regarded as ways of scaling data so that one source agrees with another. For definitions of these methods, see Rees, Norman, and Brown (2004). When estimating a contingency table of migration data, linear regression models are often the method of choice, in the form of Poisson regression models (Boyle 1993;Bohara and Krieg 1996) or log-linear regression models (Rogers, Little, and Raymer 2010;Raymer, de Beer, and van der Erf 2011). Similarly, spatial interaction models have a well-established place in the estimation of interaction data (Rees, Fotheringham, and Champion 2004;Congdon 2010) with a very useful introduction provided by Dennett (2012). When estimating a multidimensional age by sex by origin by destination table, van Imhoff et al. (1997) experimented with both log-linear modeling and IPF. They found that the fitted rates from the two methods are the same but favored IPF for its efficiency and speed.
For creating small area synthetic populations with multiple attributes, IPF is compared to hill climbing algorithms by Kurban et al. (2011) and to the combinatorial optimization (CO) method by Ryan, Maoh, and Kanaroglou (2009). The hill climbing algorithms are used by Kurban et al. (2011) to create cross-tabulations of households where only univariate distributions are available by swapping households within a randomly generated distribution until this distribution matches the real marginal totals. The CO method used by Ryan, Maoh, and Kanaroglou (2009) builds a synthetic population by swapping individuals until they closely match an observed distribution. Both studies found IPF a capable tool for the job but stated a preference for the analogous method due to improved accuracy. Both acknowledged, however, that their conclusions were drawn from the estimation of relatively small synthetic populations and called for further research on larger synthetic populations. These examples demonstrate that choosing a method (IPF or another) is largely down to the preferences of the researcher, the data problem being investigated, and the resources (time, software, and skills) available.

An Example of the IPF Algorithm
In this section we explain the steps involved in implementing IPF and how the values in Figure 1B became the fitted values in Figure 1C. Table 1 begins with the initial seed values at what is referred to as Step 0, along with the sum of the table rows and columns and the marginal row and column constraints. Note that values are reported to two decimal places. The mathematical equations for the procedure are presented in the Appendix. In Step 0, a table of initial seed values is available but the sum of the table rows does not equal the constraint row totals and the sum of the table columns does not equal the column constraint totals. IPF will adjust the table seed values to agree with both the row and columns constraints.
IPF proceeds as follows. In Step 1a, the values within the table are scaled to sum to the row constraints. The top left cell in Step 1a is calculated as 1.25 D 1.00 * 5.00/4.00 where 1.00 is the initial seed value, 5.00 is the row constraint, and 4.00 is the sum of the table row values in Step 0. The first cell in the middle row of Step 1a is calculated as 3.46, taking the values from Step 0 of 3.00 in the table, multiplied by the row constraint (15.00) divided by the table row sum (13.00). All other cells are calculated accordingly and at the end of Step 1a, the sum of each table row equals the row constraint. The sum of the table columns is different to that at Step 0 but still does not sum to the column constraint.
Step 1b then adjusts the table cell counts in Step 1a to agree with the constraint column totals. The top left cell in the table in Step 1b is calculated using values from Step 1a so that 1.45 D 1.25 * 11.00/9.51. The next cell down is calculated as 4.00 D 3.46 * 11.00/9.51 and the bottom cell 5.55 as 4.80 * 11.00/9.51. The other table cells are scaled similarly so that the sum of the table columns now agrees with the column constraints. Although the sum of the table rows agreed with the row constraints at the end of Step 1a, this is no longer the case. At the end of Step 1b, one iteration is complete. Because the difference between the row totals and the row constraints is larger than the predefined threshold (here 0.01), we go back to Step 1 and begin the next iteration. This predefined threshold (convergence) is user specified and can be measured by individual row and column differences or by the difference between the row and column totals.
Step 2 is the next iteration. In Step 2a been controlled to sum to the column constraints, the table row totals do not sum to the row constraints (but the difference is not as large as at the end of the first iteration at Step 1b).
The IPF routine then proceeds by alternating the scaling of the table cell values to agree with the row constraints and then to the column constraints. The Table 1 A step-through of the iterative proportional fitting calculation Step 0 There is a formal test for whether the table values fit the constraints (e.g., because the preceding data are only shown to two decimal places and further precision might show that the fit is not so exact). Bishop, Fienberg, and Holland (1975) discussed the convergence of the procedure and stopping rules. Convergence has occurred and the procedure stops when no cell value would change in the next iteration by more than a predefined amount that obtains the desired accuracy. A straightforward way to test for convergence is to carry out an iteration and to calculate the absolute difference between the tables generated by the row and the column constraints. Then, find the maximum value of the absolute differences and check this against the required convergence value.

IPF: Further Aspects
Here we flag some elements to be aware of when preparing data for use in and operationalizing IPF. For the marginal constraints, the sum of the row constraints must equal the sum of the column constraints and be of the same data type (i.e., counts, proportions); otherwise, IPF will not converge. Lomax et al. (2013) outlined a method for adjusting row constraints to agree with column constraints where the differences are small. There might be issues with using noninteger constraints in some programming languages (e.g., Visual Basic for Applications [VBA]), due to the way that the double data type is handled. Lovelace and Ballas (2013) offered some advice on creating integer weights. Many formulations of the IPF algorithm do not deal well with zero values in the constraints because there would be divisions by zero (although there are some exceptions; see, e.g., Dennett's [2011] Desktop IPF program). A simple solution to this problem is to add a small constant (e.g., 0.0001). There can be no negative values because the scaling leads to strange results.
For the initial table seed values, avoid having zero values in the table. Bishop et al. (1975, 101) noted that "too many" zero cells in the initial matrix might prevent convergence through a "persistence of zeros." Norman (1999) noted that "too many" is undefined, but in practice this is found to be around 30 percent of the values within the seed table if they are distributed evenly, or around 10 percent if the zeros are clustered together. The simplest way to allow for a large number of zero cells in the initial matrix is to add a small constant (less than the convergence test value) to all cells.
A convergence test value needs to be chosen that is appropriate to the data being used, the application, and the precision needed. It could be that for population-related data, the nearest 0.5 person is adequate. Setting a convergence value that is very low will result in more iterations, so it is important to weigh up the requirements for accuracy against time and computational aspects. Note that it is also possible to specify the maximum number of iterations, so the procedure will end before the convergence value is reached.

Expansion to Three and Four Dimensions
The example presented in the previous section can be referred to as two-dimensional (2D) IPF where the row and column margins represent two one-dimensional variables (e.g., these could be sex by age). IPF can be expanded to include three dimensions (3D) or even n dimensions (nD) when additional variables are included in the adjustment (e.g., age group by sex by ethnic group would require a 3D IPF solution). Deming and Stephan (1940) referred to the third dimension as a slice. For 3D IPF, the three margins (column, row, and slice) can be one-dimensional variables of age, sex, and ethnicity, respectively, or combinations of these, so age by sex, age by ethnicity, and sex by ethnicity, for example. As with the 2D example, these margins must all sum to the same value. The third dimension is often geography, as is the case in Simpson and Tranmer (2005), and the method is used to add multiple population attributes to synthetic populations (Beckman, Baggerly, and McKay 1996;Rich and Mulalic 2012). The technical requirements and good practice set out earlier still apply when using IPF on a data set with more than two dimensions. An example of 3D IPF is presented in Case Study 3 later in this article, where the algorithm deals with three dimensions in the order row, column, slice.
In the next section we present three case studies that step through the implementation of IPF in different population-related data challenges. The first two case studies describe the use of IPF in two dimensions; the third case study presents an example using IPF to estimate three dimensions of a table.

Practical Applications: Using IPF in the Real World
As a method, IPF has substantial advantages for solving real-world problems. It is fast and requires little computational power when compared with other methods (Lovelace et al. 2015); the methodology is transparent (once it is properly explained) and is reproducible; that is, with the same inputs, the outcome is the same no matter how many times it is implemented. There is also growing support for implementing IPF in a variety of statistical packages, discussed next. Following that we present three case study examples, where IPF has been used to overcome some real-world data issues. Links to the supplementary materials are supplied with each of these examples.
Software for Implementing IPF IPF can be implemented in a variety of different software packages and the choice is down to the preference of the researcher. In the examples used for this article, the estimation of populations by age and sex has been implemented using VBA in Microsoft Excel (Norman 1999), and the estimation of migration flows and estimation of population-level attributes in twoand three-dimensional tables, respectively, have been implemented in the R software package. Modules or user-produced syntax are available for a number of other platforms, including SAS, Matlab, Stata, and SPSS. The code and data files used in the examples presented in this article are available at https://github. com/niklomax/IPFexamples. The IPF code used in Case Study 2 was originally developed by Tomlinson and Hunsinger for the Alaska Department of Labor and Workforce Development (2009) and is freely available for researchers to download. The code has been used by the Alaska Department of Labor and Workforce Development to integrate characteristics (e.g., race) into population totals derived from the U.S. Census Bureau. The code has also been used to estimate cyclical employment and unemployment flows in the United States by Coleman (2010) and to create cross-tabulations of area variables where only univariate distributions are available by Kurban et al. (2011). Case study 3 is implemented using the 'mipfp' package in R.

Case Study 1: Using IPF in Population Estimates
Small area population estimates by age and sex are needed to show the population size and structure and as denominators for the calculation of rates (Norman, Simpson, and Sabater 2008). Although these populations are available for the midyear closest to the census, this is not necessarily the case in other years. A cohort-component method is commonly used whereby a base population by age and sex is updated to a later time point using counts of the births and deaths and the migration moves in and out of the area in the intervening period (Rees et al. 2003). Although data on births and deaths are usually available for small area geographies, the necessary migration counts are rarely obtainable. An approach that combines data sources and methods is a pragmatic solution.
Thus, for a set of small areas that comprise a larger area, a simple cohort-component method can be used to update the base populations by five-year age and sex with allowances for births, deaths, and aging but not for migration. This can provide initial seed values for IPF to then constrain these age-sex values at small area level to be consistent with separately estimated total populations for each area and with age-sex information from the containing larger area. As an example, populations by five-year groups for 1991 will be updated to 1996. This draws on Rees et al. (2003); Rees, Norman, and Brown (2004);and Norman, Simpson, and Sabater (2008), including their data inputs, and will be for the local government district of Bradford, England, which includes thirty electoral wards. Official age-sex estimates are available for Bradford as a whole for 1996 and births and deaths occurring in each ward between 1991 and 1996. Gross migration flows in and out of each ward are not available. Total ward populations have been separately estimated using the ratio method (Rees et al. 2003), using indicators of change in overall population size (thereby including change due to migration).
The top portion of Table 2 has the initial populations for 1996 derived as just stated. Selected wards and males up to age fifteen to nineteen are shown. The full data set has males and females up to age eighty-five for all thirty wards. The sum of the agesex information provides a total in each ward and the sum across wards provides totals for the district. These ward and district populations are different from those obtained for 1996 from the specific estimates of total populations and the official estimates for Bradford from the Office for National Statistics (ONS; bottom of Table 2). Both of these estimates include indicators of change due to migration, but the initial ward agesex estimates do not. Table 2 has the initial estimates constrained using IPF to be consistent with the ward (row) marginal and district (column) marginals.
Population estimates are just that-estimates-and we cannot know whether they are correct. Various data sources are available to measure demographic change in an area and these all have strengths and weaknesses. Combining these sources in a way that uses their strengths and compensates for their weaknesses makes the subsequent estimates defensible, and IPF plays a key role in this (Rees et al. 2003;Norman, Simpson, and Sabater 2008). The method outlined here has been shown to be an improvement over simpler methods and to perform equivalently to methods that, after time-consuming data preparation, incorporate up-dated gross migration flows based on the previous census (Rees, Norman, and Brown 2004).
The case study is implemented using VBA in Microsoft Excel. The implementation has the advantage of providing a clear step-through interface but lacks the ability to deal with very large data problems. For an alternative implementation of IPF in VBA, see Dennett (2011).

Case Study 2: Estimating Migration Flows Between Areas
Estimating the flow of people between one area and another is an application that is particularly suited to IPF as data sets are often available for total moves into and out of an area (i.e., the row and column constraints) but the data for the interaction between these areas are often not available, sparse, or incomplete. Previous examples include Chilton and Poet (1973), who used IPF to estimate migration between London boroughs reported in the 1966 Sample Census; Rees and Duke-Williams (1997), who estimated suppressed flows reported in the 1991 Census; Nair (1985), who estimated migration in India and Korea using lifetime migration tables; and Schoen and Jonsson (2003), who estimated interregional migration in the United States between 1980 and 1990.
More recently, Lomax et al. (2013) used IPF to update a seed table of Local Authority District (LAD)-level interactions derived from the 2001 Census for the four countries of the United Kingdom. LADs are the administrative units at which resources and funding are allocated, so a good estimate of migration is necessary to ensure the population estimates are accurate. Different statistical reporting systems are in place for England, Wales, Scotland, and Northern Ireland, so Lomax et al. (2013) used IPF to produce a consistent UK data set. Between censuses, the interaction data (moves between LADs) are sparse, heavily rounded for disclosure control purposes, or not available at all (in Northern Ireland and for moves between LADs located in different countries). Outside of the census (conducted every ten years), the only data that are consistently available are the total outmigration from an LAD to all other LADs and the total inmigration to an LAD from all other LADs, derived from National Health Service (NHS) data.
The example presented in Table 3 focuses on migration among the twenty-six LADs of Northern Ireland (for which there are no interaction data) in a single year (2001)(2002). In the Lomax et al. (2013) article the estimate is extended to incorporate moves between LADs where a migrant crosses the border from one UK country to another for each year 2001-2002 to 2010-2011. Table 3 shows two matrices of origin-destination interaction data for moves among the twenty-six LADs for Northern Ireland, before and after the IPF routine has been applied. These matrices are collapsed and show only the first and last two entries in each table. The seed table (the top portion of Table 3) contains data taken from the 2001 Census for the distribution of origin-destination interactions. The vertical margin contains total outmigration from each LAD to all other LADs, whereas the horizontal margin contains inmigration totals. The shaded margins in the bottom portion of Table 3 show the total in-and outmigration for 2001-2002, derived from NHS data. The IPF routine is applied to the data, and after sixteen iterations the data converge and the seed table is adjusted to agree with both the vertical and horizontal margins. Thus, the high-level interactions between areas that are reported in the 2001 Census are maintained, but the flows that are reported in the estimated table now sum to the in-and outmigration totals reported for 2001-2002. Lomax et al. (2013 reported that the IPF-derived estimates are reliable and useful, especially where there are no observed data, as is the case in Northern Ireland. This example is implemented using R code developed by the Alaska Department of Labor and Workforce Development (2009). The code has the advantage of being very efficient, offers failsafe checks (e.g., ensuring column and row constraints are equal), and deals with zeros in the margin by adding a small constant (0.001). An additional step-through guide was developed by Hunsinger (2008). This R code can also deal with IPF in three and four dimensions. Calculating age-specific rates of general health by ethnic group for each local authority is useful to indicate whether there are differences in age gradients of health by ethnic group and is essential as an input to directly standardized illness measures. Data from the 2011 UK Census are available from the Local and Detailed Characteristics tables, which have cross-tabulations of elements of these dimensions but, even for broad ethnic groupings, without sufficient detail on age (e.g., LC2301ew, LC3206ew, DC3201ew), general health (DC3201ew), or geography (DC3204ewr). Table DC3204ewr comes closest but, due to small cell counts that result from the cross-tabulation, the data are only released at the regional and country level. The Census Samples of Anonymised Records are individual-level microdata that have great versatility in terms of creating application-relevant recoded variables and in enabling cross-tabulations not readily available in the census area tables to be carried out (Norman and Boyle 2010). A 2011 Census Microdata Teaching File of an anonymized, random sample of census records was released by the Office for National Statistics to allow users to analyze census data in a way that is not possible using standard census tables. For England and Wales (i.e., with no subnational geography), a cross-tabulation of the 2011 file of age (eight groups from age zero to fifteen to seventy-five and over), ethnic group (five broad groups), and general health (five levels from very good health to very bad health) can provide the initial seed values for IPF in these three dimensions.
The constraints for each LAD in England and Wales are obtained from tables QS103ew: Age; KS201ew: Ethnic group; and KS301ew: General health. Respectively, these are the rows, columns, and slice. The data required for this adjustment can be seen in Figure 2. For each LAD in England and Wales, the number of people by ethnic group (column totals), the number of people by age (row totals), and the number of people by health (slice) are known but not the cross-tabulation between these three variables. The cross-tabulations between these variables are obtained from the microdata sample, and these form the starting seed distribution, conceptualized in Figure 2 as a cube to be adjusted: age by ethnicity by very good health at the front through to age by ethnicity by very bad health at the back. The seed is adjusted and constrained to the available totals (first by age, then ethnicity, then health), and convergence occurs in Figure 2 Iterative proportional fitting over three dimensions (age, ethnicity, and health). around eleven steps. This procedure is repeated for each LAD for which there are individual column, row, and slice total constraints. The constrained result presumes that the same interaction between the dimensions exists at the local level as at the national level. In a research situation, this could be addressed by using LAD-level microdata (Office for National Statistics 2015). Using this method, a range of local area crosstabulations can be estimated for the local and detailed characteristics tables. The example presented here is implemented using 'mipfp' in R, a fast and versatile package designed for the multidimensional implementation of IPF. The syntax supplied at https://github.com/niklomax/IPFexam ples shows how the algorithm can be implemented using very few lines of code, once an external package is relied on to undertake the calculation. Mipfp benefits from being continuously updated, has its own documentation, and can be expanded to deal with problems where available constraints are cross-tabulated (age by health, age by ethnicity, ethnicity by health, etc.).

Conclusion
This article provides a how-to guide on using IPF to estimate data where information is outdated, missing, or inaccurate. It builds on and adds to existing literature by providing a discussion on the practicalities of using IPF and offers a clear and jargon-explained description of how to implement the method. This article demonstrates that IPF has been used extensively in previous research as the preferred method for solving real-world data problems, and we presented three case studies where IPF has been used. The diversity of issues presented in these case studies serves to highlight the flexibility of IPF as a method and its applicability as a research tool.
Other methods can be used to solve these data problems. These are identified within the article, and we suggest that it is largely up to the researcher to decide which would be best, whether this be IPF or an analogous method. We make a case for choosing IPF, however, and argue that IPF is a method that is transparent and computationally efficient. We also believe that IPF is a fairly simple solution to implement, which produces consistent results. These outputs can be reproduced by other users given the same inputs and we encourage readers to explore the data files and code associated with the article.
The aim of this article was to highlight the research applicability of IPF and provide researchers new to its implementation with an introductory guide and an awareness of tools they can use. The next steps in terms of advised reading would be Norman (1999), Wong (1992), and Simpson and Tranmer (2005). Comprehensive and detailed coverage is provided by Zalo znik (2011) and an overview of IPF use in geography can be found in Johnston and Pattie (1993). & The steps involved in IPF are defined as follows.
Step 0 P ij [0] (Set Initial / Seed Values) where P ij is a (population) value (to be estimated / adjusted) in table / matrix row i and column j j i P ? P ? P ? P? P? P? P? P? P?
Step 1a  That is, if the absolute difference between the row constrained and column constrained tables is greater than the test value, then the values from Step 1b become the seed values in Step 0 and the procedure starts again at Step 1a. Steps 1a and 1b iterate (are repeated) until the Test condition is satisfied.