`Truncate, replicate, sample': a method for creating integer weights for spatial microsimulation

Iterative proportional fitting (IPF) is a widely used method for spatial microsimulation. The technique results in non-integer weights for individual rows of data. This is problematic for certain applications and has led many researchers to favour combinatorial optimisation approaches such as simulated annealing. An alternative to this is `integerisation' of IPF weights: the translation of the continuous weight variable into a discrete number of unique or `cloned' individuals. We describe four existing methods of integerisation and present a new one. Our method --- `truncate, replicate, sample' (TRS) --- recognises that IPF weights consist of both `replication weights' and `conventional weights', the effects of which need to be separated. The procedure consists of three steps: 1) separate replication and conventional weights by truncation; 2) replication of individuals with positive integer weights; and 3) probabilistic sampling. The results, which are reproducible using supplementary code and data published alongside this paper, show that TRS is fast, and more accurate than alternative approaches to integerisation.


Introduction
Spatial microsimulation has been widely and increasingly used as a term to describe a a set of techniques used to estimate the characteristics of individuals within geographic zones about which only aggregate statistics are available (Tanton and Edwards, 2012;Ballas et al., 2013). The model inputs operate on a different level from those of the outputs. To ensure that the individual-level output matches the aggregate inputs, spatial microsimulation mostly relies on one of two methods. Combinatorial optimisation algorithms are used to select a unique combination of individuals from a survey dataset. This approach was first demonstrated and applied by Williamson et al. (1998) and there have been several applications and refinements since then. Alternatively, deterministic reweighting iteratively alters an array of weights, N , for which columns and rows correspond to zones and individuals, to optimise the fit between observed and simulated results at the aggregate level. This approach has been implemented using iterative proportional fitting (IPF) to combine national survey data with small area statistics tables (e.g. Beckman et al., 1996;Ballas et al., 2005a). A recent review, published in this journal, highlights the advances made in methods for simulating spatial microdata (Hermes and Poulsen, 2012) since these works were published. Harland et al. (2012) also discuss the state of spatial microsimulation research and present a comparative critique of the performance of deterministic reweighting and combinatorial optimisation methods. Both approaches require micro-level and spatially aggregated input data and a predefined exit point: the fit between simulated and observed results improves, at a diminishing rate, with each iteration. 1 The benefits of IPF include speed of computation, simplicity and the guarantee of convergence (Deming, 1940;Mosteller, 1968;Fienberg, 1970;Wong, 1992;Pritchard and Miller, 2012). A major potential disadvantage, however, is that non-integer weights are produced: fractions of individuals are present in a given area whereas after combinatorial optimisation, they are either present or absent. Although this is not a problem for many static spatial microsimulation applications (e.g. estimating income at the small area level, at one point in time; for example see Anderson 2013), several applications require integer rather than fractional weights. For example, integer weights are required if a population is to be simulated dynamically into the 1 In IPF, model fit improves from one iteration to the next. Due to the selection of random individuals in simulated annealing, the fit can get worse from one iteration to the next (Williamson et al., 1998;Hynes et al., 2009). It is impossible to predict the final model fit in both cases. Therefore exit points may be somewhat arbitrary. For IPF, 20 iterations has been used as an exit point (Lee, 2009;Anderson, 2007). For simulated annealing, 5000 iterations have be used (Hynes et al., 2009;Goffe et al., 1994).
Integerisation solves this problem by converting the weights -a 2D array of positive real numbers (N ∈ R ≥0 ) -into an array of integer values (N ∈ N) that represent whether the associated individuals are present (and how many times they are replicated) or absent. The integerisation function must perform f (N ) = N whilst minimizing the difference between constraint variables and the aggregated results of the simulated individuals. Integerisation has been performed on the results of the SimBritain model, based on simple rounding of the weights and two deterministic algorithms that are evaluated subsequently in this paper (see Ballas et al., 2005a). It was found that integerisation "resulted in an increase of the difference between the 'simulated' and actual cells of the target variables" (Ballas et al., 2005a, p. 26), but there was no further analysis of the amount of error introduced, or which integerisation algorithm performed best.
To the best of our knowledge, no published research has quantitatively compared the effectiveness of different integerisation strategies. We present a new method -truncate, replicate sample (TRS) -that combines probabilistic and deterministic sampling to generate representative integer results. The performance of TRS is evaluated alongside four alternative methods.
An important feature of this paper is the provision of code and data that allow the results to be tested and replicated using the statistical software R (R Core Team, 2012). 2 Reproducible research can be defined as that which allows others to conduct at least part of the analysis (Table 1). Best practice is well illustrated by Williamson (2007), an instruction manual on combinatorial optimisation algorithms described in previous work. Reproducibility is straightforward to achieve (Gentleman and Temple Lang, 2007), has a number of important benefits (Ince et al., 2012), yet is often lacking in the field.
The next section reviews the wider context of spatial microsimulation Provide a mechanism for others to access data, software, and documentation research and explains the importance of integerisation. The need for new methods is established in Section 3, which describes increasingly sophisticated methods for integerising the results of IPF. Comparison of these five integerisation methods show TRS to be more accurate than the alternatives, across a range of measures (Section 4). The implications of these findings are discussed in Section 5.
2. Spatial microsimulation: the state of the art 2.1. What is spatial microsimulation, and why use it? Spatial microsimulation is a modelling method that involves sampling rows of survey data (one row per individual, household, or company) to generate lists of individuals (or weights) for geographic zones that expand the survey to the population of each geographic zone considered. The problem that it overcomes is that most publicly available census datasets are aggregated, whereas individual-level data are sometimes needed. The ecological fallacy (Openshaw, 1983), for example, can be tackled using individual-level data.
Microsimulation cannot replace the 'gold standard' of real, small area microdata (Rees et al., 2002, p. 4), yet the method's practical usefulness (see Tomintz et al., 2008) and testability (Edwards and Clarke, 2009) are beyond doubt. With this caveat in mind, the challenge can be reduced to that of optimising the fit between the aggregated results of simulated spatial microdata and aggregated census variables such as age and sex (Williamson et al., 1998). These variables are often referred to as 'constraint variables' or 'small area constraints ' (Hermes and Poulsen, 2012). The term 'linking variables' can also be used, as they link aggregate and survey data.
The wide range of methods available for spatial microsimulation can be divided into static, dynamic, deterministic and probabilistic approaches (Table 2). Static approaches generate small area microdata for one point in time. These can be classified as either probabilistic methods which use a random number generator, and deterministic reweighting methods, which do not. The latter produce fractional weights. Dynamic approaches project small area microdata into the future. They typically involve modelling of life events such as births, deaths and migration on the basis of random sampling from known probabilities on such events (Ballas et al., 2005a;Vidyattama and Tanton, 2010); more advanced agent-based techniques, such as spatial interaction models and household-level phenomena, can be added to this basic framework (Wu et al., 2008(Wu et al., , 2010. There are also 'implicitly dynamic' models, which employ a static approach to reweight an existing microdata set to match projected change in aggregate-level variables (e.g. Ballas et al., 2005b).

IPF-based Monte
Carlo approaches for the generation of synthetic microdata Individual-level, anonymous samples from major surveys, such as the Sample of Anonymised Records (SARs) from the UK Census have only been available since around the turn of the century (Li, 2004). Beforehand, researchers had to rely on synthetic microdata. These can be created using probabilistic methods (Birkin and Clarke, 1988). The iterative proportional fitting (IPF) technique was first described in 1940(Deming, 1940, and has become well established for spatial microsimulation (Axhausen and Müller, 2010;Birkin and Clarke, 1989).
The first application of IPF in spatial microsimulation was presented by Birkin andClarke (1988 and1989) to generate synthetic individuals, and allocate them to small areas based on aggregated data. They produced spatial microdata (a list of individuals and households for each electoral ward in Leeds Metropolitan District). Their approach was to select rows of synthetic data using Monte Carlo sampling. Birkin and Clarke suggested that the microdata generation technique known as 'population synthesis' could be of great practical use (Birkin and Clarke, 2012).  (Ballas et al., 2005c)

Combinatorial optimisation approaches
Since the work of Birkin andClarke (1988 and1989) there have been considerable advances in data availability and computer hardware and software. In particular, with the emergence of anonymous survey data, the focus of spatial microsimulation shifted towards methods for reweighting and sampling from existing microdata, as opposed to the creation of entirely synthetic data (Lee, 2009).
This has enabled experimentation with new techniques for small area microdata generation. A significant contribution to the literature was made by Williamson et al. (1998). The authors presented microsimulation as a problem of combinatorial optimisation: finding the combination of SARs which best fits the constraint variables. Various approaches to combinatorial optimisation were compared, including 'hill climbing', simulated annealing approaches and genetic algorithms (Williamson et al., 1998). These approaches involve the selection and replication of a discrete number of individuals from a nationally representative list such as the SARs. Thus, subsets of individuals are taken from the global microdataset (geocoded at coarse geographies) and allocated to small areas. There have been several refinements and applications of the original ideas suggested by Williamson et al. (1998), including research reported by Voas and Williamson (2000), Williamson et al. (2002), and Ballas et al. (2006).

Deterministic reweighting
The methods described in the previous section involve the use of random sampling procedures or 'probabilistic reweighting ' (Hermes and Poulsen, 2012). In contrast, Ballas et al. (2005c) presented an alternative deterministic approach based on IPF. It is the results of this method, that does not use random number generators and thus produces the same output with each run, 3 that the integerisation methods presented here take as their starting point. The underlying theory behind IPF has been described in a number of papers (Deming, 1940;Mosteller, 1968;Wong, 1992). Fienberg (1970) proves that IPF converges towards a single solution.
IPF can be used to produce maximum likelihood estimates of spatially disaggregated conditional probabilities for the individual attributes of interest. The method is also known as 'matrix raking', RAS or 'entropy maximising' (see Johnston and Pattie, 1993;Birkin and Clarke, 1988;Axhausen and Müller, 2010;Huang and Williamson, 2001;Kalantari et al., 2008;Jiroušek and Přeučil, 1995). The mathematical properties of IPF have have been described in several papers (see for instance Bishop et al., 1975;Fienberg, 1970;Birkin and Clarke, 1988). Illustrative examples of the procedure can be found in Saito (1992), Wong (1992) andNorman (1999). Wong (1992) investigated the reliability of IPF and evaluated the importance of different factors influencing its performance; Simpson and Tranmer (2005) evaluated methods for improving the performance of IPF-based microsimulation. Building on these methods, IPF has been employed by others to investigate a wide range of phenomena (e.g. Mitchell et al., 2000;Ballas et al., 2005a;Williamson et al., 2002;Tomintz et al., 2008).
Practical guidance on how to perform IPF for spatial microsimulation is also available. In an online working paper, Norman (1999) provides a user guide for a Microsoft Excel macro that performs IPF on large datasets. Simpson and Tranmer (2005) provided code snippets of their procedure in the statistical package SPSS. Ballas et al. (2005c) describe the process and how it can be applied to problems of small area estimation. In addition to these resources, a practical guide to running IPF in R has been created to accompany this paper. 4

Combinatorial optimisation, IPF and the need for integerisation
The aim of IPF, as with all spatial microsimulation methods, is to match individual-level data from one source to aggregated data from another. IPF does this repeatedly, using one constraint variable at a time: each brings the column and row totals of the simulated dataset closer to those of the area in question (see Ballas et al., 2005c and Fig. 5 below).
Unlike combinatorial optimisation algorithms, IPF results in non-integer weights. As mentioned above, this is problematic for certain applications. In their overview of methods for spatial microsimulation Williamson et al. (1998) favoured combinatorial optimisation approaches, precisely for this reason: "as non-integer weights lead, upon tabulation of results, to fractions of households or individuals" (p. 791). There are two options available for dealing with this problem with IPF: • Use combinatorial optimisation microsimulation methods instead (Williamson et al., 1998). However, this can be computationally intensive (Pritchard and Miller, 2012).
• Integerise the weights: Translate the non-integer weights obtained through IPF into discrete counts of individuals selected from the original survey dataset (Ballas et al., 2005a).
We revisit the second option, which arguably provides the 'best of both worlds': the simplicity and computational speed of deterministic reweighting and the benefits of using whole cases.
In summary, IPF is an established method for combining microdata with spatially aggregated constraints to simulate target variables whose characteristics are not recorded at the local level. Intergerisation translates the real number weights obtained by IPF into samples from the original microdata, a list of 'cloned' individuals for each simulated area. Integerisation may also be useful conceptually, as it allows researchers to deal with entire individuals. The next section reviews existing strategies for integerisation.

Method
Despite the importance of integer weights for dynamic spatial microsimulation, and the continued use of IPF, there has been little work directed towards integerisation. It has been noted that "the integerization and the selection tasks may introduce a bias in the synthesized population" (Axhausen and Müller, 2010, 10), yet little work has been done to find out how much error is introduced.
To test each integerisation method, IPF was used to to generate an array of weights that fit individual-level survey data to geographically aggregated Census data (see Section 3.7). Five methods for integerising the results are described, three deterministic and two probabilistic. These are: 'simple rounding', its evolution into the 'threshold approach' and the 'counter-weight' method and the probabilistic methods 'proportional probabilities' and finally 'truncate, replicate, sample'. TRS builds on the strengths of the other methods, hence the order in which they are presented.
The application of these methods to the same dataset (and their implementation in the same language, R) allows their respective performance characteristics to be quantified and compared. Before proceeding to describe the mechanisms by which these integerisation methods work, it is worth taking a step back, to consider the nature and meaning of IPF weights.

Interpreting IPF weights: replication and probability
It is important to clarify what we mean by 'weights' before proceeding to implement methods of integerisation: this understanding was central to the development of the integerisation method presented in this paper. The weights obtained through IPF are real numbers ranging from 0 to hundreds (the largest weight in the case study dataset is 311.8). This range makes integerisation problematic: if the probability of selection is proportional to the IPF weights (as is the case with the 'proportional probabilities' method), the majority of resulting selection probabilities can be very low. This is why the simple rounding method rounds weights up or down to the nearest integer weight to determine how many times each individual should be replicated (Ballas et al., 2005a): to ensure replication weights do not differ greatly from non-integer IPF weights. However, some of the information contained in the weight is lost during rounding: a weight remainder of 0.501 is treated the same as 0.999.
This raises the following question: Do the weights refer to the number of times a particular individual should be replicated, or is it related to the probability of being selected? The following sections consider different approaches to addressing this question, and the integerisation methods that result.

Simple rounding
The simplest approach to integerisation is to convert the non-integer weights into an integer by rounding. If the decimal remainder to the right of the decimal is 0.5 or above, the integer is rounded up; if not, the integer is rounded down.
Rounding alone is inadequate for accurate results, however. As illustrated in Fig. 2 below, the distribution of weights obtained by IPF is likely to be skewed, and the majority of weights may fall below the critical 0.5 value and be excluded. As reported by Ballas et al. (2005a, 25), this results in inaccurate total populations. To overcome this problem Ballas et al. (2005a) developed algorithms to 'top up' the simulated spatial microdata with representative individuals: the 'threshold' and 'counter-weight' approaches. Ballas et al. (2005a) tackled the need to 'top up' the simulated area populations such that P op sim ≥ P op cens . To do this, an inclusion threshold (IT ) is created, set to 1 and then iteratively reduced (by 0.001 each time), adding extra individuals with incrementally lower weights. 5 Below the exit value of IT for each zone, no individuals can be included (hence the clear cut-off point around 0.4 in Fig. 1). In its original form, based on rounded weights, this approach over-replicates individuals with high decimal weights.

The threshold approach
To overcome this problem, we took the truncated weights as the starting population, rather than the rounded weights. This modified approach improved the accuracy of the integer results and is therefore what we refer to when the 'threshold approach' is mentioned henceforth. 6 The technique successfully tops-up integer populations yet has a tendency to generate too many individuals for each zone. This oversampling is due to duplicate weights -each unique weight was repeated on average 3 times in our model -and the presence of weights that are different, but separated by less than 0.001. (In our test, the mean number of unique weights falling into non-empty bins between 0.3 and 0.48 in each area -the range of values reached by IT before P op sim ≥ P op cens -is almost two.)

The counter-weight approach
An alternative method for topping-up integer results arrived at by simple rounding was also described by Ballas et al. (2005a). The approach was labelled to emphasise its reliance on both counter and a weight variables. Each individual is first allocated a counter in ascending order of its IPF weight. The algorithm then tops-up the integer results of simple rounding by iterating over all individuals in the order of their count. With each iteration the new integer weight is set as the rounded weight plus the rounded sum of its decimal weight plus the decimal weight of the next individual, until the desired total population is reached. 7 There are two theoretical advantages of this approach: its more accurate final populations (it does not automatically duplicate individuals with equal weights as the threshold approach does) and the fact that individuals with decimal weights down to 0.25 may be selected. This latter advantage is minor, as IT reached below 0.4 in many cases (Supplementary Information, Fig. 2) -not far off. A band of low weights (just above 0.25) selected by the counter-weight method can be seen in Fig. 1. The total omission of weights below some threshold is problematic for all deterministic algorithms tested here: they imply that someone with a weight below this threshold, for example 0.199 in our tests, has the same sampling probability as someone with a weight of 0.001: zero! The complete omission of low weights fails to make use of all the information stored in IPF weights: in fact, the individual with an IPF weight of 0.199 is 199 times more representative of the area (in terms of the constraint variables and the make-up of the survey dataset) than the individual with an IPF weight of 0.001. Probabilistic approaches to integerisation ensure that all such differences between decimal weights are accounted for.

The proportional probabilities approach
This approach to integerisation treats IPF weights as probabilities. The chance of an individual being selected is proportional to the IPF weight: Sampling until P op sim = P op cens with replication ensures that individuals with high weights are likely to be repeated several times whereas individuals with low weights are unlikely to appear. The outcome of this strategy is correct from a theoretical perspective, yet because all weights are treated as probabilities, there is a non-zero chance that an individual with a low weight (e.g. 0. 3) is replicated more times than an individual with a higher weight (e.g. 3.3). (In this case the probability for any given area is ∼ 1%, regardless of the population size). Ideally, this should never happen: the individual with weight 0.3 should be replicated either 0 or 1 times, the probability of the latter being 0.3. The approach described in the next section addresses these issues.

Truncate, replicate, sample
The problems associated with the aforementioned integerisation strategies demonstrate the need for an alternative method. Ideally, the method would build upon the simplicity of the rounding method, select the correct simulated population size (as attempted by the threshold approach and achieved by using 'proportional probabilities'), make use of all the information stored in IPF weights and reduce the error introduced by integerisation to a minimum. The probabilistic approach used in 'proportional probabilities' allows multiple answers to be calculated (by using different 'seeds'). This is advantageous for analysis of uncertainty introduced by the process and allows for the selection of the best fitting result. Consideration of these design criteria led us to develop TRS integerisation, which interprets weights as follows: IPF weights do not merely represent the probability of a single case being selected. They also (when above one) contain information about repetition: the two types of weight are bound up in a single number. An IPF weight of 9, for example, means that the individual should be replicated 9 times in the synthetic microdataset. A weight of 0.2, by contrast, means that the characteristics of this individual should count for only 1/5 of their whole value in the microsimulated dataset and that, in a representative sampling strategy, the individual would have a probability of 0.2 of being selected. Clearly, these are different concepts. As such, the TRS approach to integerisation isolates the replication and probability components of IPF weights at the outset, and then deals with each separately. Simple rounding, by contrast, interprets IPF weights as inaccurate count data. The steps followed by the TRS approach are described in detail below.

Truncate
By removing all information to the right of the decimal point, truncation results in integer values -integer replication weights that determine how many times each individual should be 'cloned' and placed into the simulated microdataset. In R, the following command is used: where w is a matrix of individual weights. Saving these values (as count) will later ensure that only whole integers are counted. The decimal remainders (dr), which vary between 0 and 1, are saved by subtracting the integer weights from the full weights: This separation of conventional and replication weights provides the basis for the next stage: replication of the integer weights.

Replicate
In spreadsheets, replication refers simply to copying cells of data and pasting them elsewhere. In spatial microsimulation, the concept is no different. The number of times a row of data is replicated depends on the integer weight: an IPF weight of 0.99, for example, would not be replicated at this stage because the integer weight (obtained through truncation) is 0.
To reduce the computational requirements of this stage, it is best to simply replicate the row number (index) associated with each individual, rather than replicate the entire row of data. This is illustrated in the following code example, which appears within a loop for each area (i) to be simulated: Here, the indices (of weights above 1, index) are selected and then repeated. This is done using the function rep(). The first argument (1:nrow(index)) simply defines the indices to be replicated; the second (count) refers to the integer weights defined in the previous subsection. (Note: count in this context refers only to the integer weights above 1 in each area). Once the replicated indices have been generated, they can then be used to look up the relevant characteristics of the individuals in question. Weights of sampled individuals (n = 3415) count Figure 2: Histograms of original microdata weights (above) and sampled microdata after TRS integerisation (below) for a single area -zone 71 in the case study data.

Sample
As with the rounding approach, the truncation and replication stages alone are unable to produce microsimulated datasets of the correct size. The problem is exacerbated by the use of truncation instead of rounding: truncation is guaranteed to produce integer microdataset populations that are smaller, and in some cases much smaller than the actual (census) populations. In our case study, the simulated microdataset populations were around half the actual size populations defined by the census. This under-selection of whole cases has the following advantage: when using truncation there is no chance of over-sampling, avoiding the problem of simulated populations being slightly too large, as can occur with the threshold approach.
Given that the replication weights have already been included in steps 1 and 2, only the decimal weight remainders need to be included. This can be done using weighted random sampling without replacement. In R, the following function is used: Here, the argument size within the sample command is set as the difference between the known population of each area (pops [i,1]) and the size obtained through the replication stage alone (pops [i,2]). The probability (prob) of an individual being sampled is determined by the decimal remainders. dr varies between 0 and 1, as described above.
The results for one particular area are presented in Fig. 2. The distribution of selected individuals has shifted to the right, as the replication stage has replicated individuals as a function of their truncated weight. Individuals with low weights (below one) still constitute a large portion of those selected, yet these individuals are replicated fewer times. After TRS integerisation individuals with high decimal weights are relatively common. Before integerisation, individuals with IPF weights between 0 and 0.3 dominated. An individual-by-individual visualisation of the Monte Carlo sampling strategy is provided in Fig. 3. Comparing this with the same plot for the probabilistic methods ( Fig. 1), the most noticeable difference is that the TRS and proportional probabilities approaches include individuals with very low weights. Another important difference is average point density, as illustrated by the transparency of the dots: in Fig. 1, there are shifts near the decimal weight threshold (∼ 0.4 in this area) on the y-axis. In Fig. 3, by contrast, the transition is smoother: average darkness of single dots (the number of replications) gradually increases from 0 to 5 in both probabilistic methods. Fig. 4 illustrates the mechanism by which the TRS sampling strategy works to select individuals. In the first stage (up to x = 1,717, in this case) there is a linear relationship between the indices of survey and sampled individuals, as the model iteratively moves through the individuals, replicating those with truncated weights greater than 0. This (deterministic) replication stage selects roughly half of the required population in our example dataset (this proportion varies from zone to zone). The next stage is probabilistic sampling (x = 1,718 onwards in Fig. 4): individuals are selected from the entire microdataset with selection probabilities equal to weight remainders.

The test scenario: input data and IPF
The theory and methods presented above demonstrate how five integerisation methods work in abstract terms. But to compare them quantitatively a test scenario is needed. This example consists of a spatial microsimulation model that uses IPF to model the commuting and socio-demographic characteristics of economically active individuals in Sheffield. According to the 2001 Census, Sheffield has a working population of just over 230,000. The characteristics of these individuals were simulated by reweighting a synthetic microdataset based on aggregate constraint variables provided at the medium super output area (MSOA) level. The synthetic microdataset was created by 'scrambling' a subset of the Understanding Society dataset (USd). 8 MSOAs contain on average just over 7,000 people each, of whom 44% are economically active in the study area; for the less sensitive aggregate constraints, real data were used. These variables are summarised in Table 3.
The data contains both continuous (age, distance) and categorical (mode, NS-SEC) variables. In practice, all variables are converted into categorical variables for the purposes of IPF, however. To do this statistical bins are used. Table 3 illustrates similarities between aggregate and survey data overall (car drivers being the most popular mode of travel to work in both categories, for example). Large differences exist between individual zones and survey data, however: it is the role of iterative proportional fitting to apply weights to minimize these differences. IPF was used to assign 71 weights to each of the 4,933 individuals, one weight for each zone. The fit between census and weighted microdata can be seen improving after constraining by each of the 40 variables (Fig. 5). The process is repeated until an adequate level of convergence is attained (see Fig. 6). 9 The weights were set to an initial value of one. 10 The weights were then iteratively altered to match the aggregate (MSOA) level statistics, as described in Section 2.4.
Four constraint variables link the aggregated census data to the survey, containing a total of 40 categories. To illustrate how IPF works, it is useful 9 What constitutes an 'adequate' level of fit has not been well defined in the literature, as mentioned in the next section. In this example, 20 iterations were used.
10 An initial value must be selected for IPF to create new weights which better match the small area constraints. It was set to one as this tends to be the average weight value in social surveys (the mean Understanding Society dataset interview plus proxy individual cross-sectional weight is 0.986). to inspect the fit between simulated and census aggregates before and after performing IPF for each constraint variable. Fig. 5 illustrates this process for each constraint. By contrast to existing approaches to visualising IPF (see Ballas et al., 2005c), Fig. 5 plots the results for all variables, one constraint at a time. This approach can highlight which constraint variables are particularly problematic. After 20 iterations (Fig. 6), one can see that distance and mode constraints are most problematic. This may be because both variables depend largely on geographical location, so are not captured well by UK-wide aggregates. Fig. 5 also illustrates how IPF works: after reweighting for a particular constraint, the weights are forced to take values such that the aggregate statistics of the simulated microdataset match perfectly with the census aggregates, for all variables within the constraint in question. Aggregate values for the mode variables, for example, fit the census results perfectly after constraining by mode (top right panel in Fig. 5). Reweighting by the next constraint disrupts the fit imposed by the previous constraint -note the increase scatter of the (blue) mode variables after weights are constrained by distance (bottom left).
However, the disrupted fit is better than the original. This leads to a convergence of the weights such that the fit between simulated and known variables is optimised: Fig. 5 shows that accuracy increases after weights are constrained by each successive linking variable.

Results
This section compares the five previously describe approaches to integerisation -rounding, inclusion threshold, counter-weight, proportional prob-  Figure 5: Visualisation of IPF method. The graphs show the iterative improvements in fit after age, mode, distance and finally NS-SEC constraints were applied (see Table 3). See footnote 4 for resources on how IPF works.
abilities and TRS methods. The results are based on the 20 th iteration of the IPF model described above. The following metrics of performance were assessed: • Speed of calculation.
• Accuracy of results.
-Total Absolute Error (TAE) of simulated areas.
-Anomalies (aggregate cell values out by more than 5%).
-Correlation between constraint variables in the census and microsimulated data. Of these performance indicators accuracy is the most problematic. Options for measuring goodness-of-fit have proliferated in the last two decades, yet there is no consensus about which is most appropriate (Voas and Williamson, 2001). The approach taken here, therefore, is to use a range of measures, the most important of which are summarised in Table 4 and Fig. 7.

Speed of calculation
The time taken for the integerisation of IPF weights was measured on an Intel Core i5 660 (3.33 GHz) machine with 4 Gb of RAM running Linux 3.0. The simple rounding method of integerisation was unsurprisingly the fastest, at 4 seconds. In second and third place respectively were the proportional probabilities and TRS approaches, which took a a couple of seconds longer for a single integerisation run for all areas. Slowest were the inclusion threshold and counter-weight techniques, which took three times longer than simple rounding. To ensure representative results for the probabilistic approaches, both were run 20 times and the result with the best fit was selected. These imputation loops took just under a minute.
The computational intensity of integerisation may be problematic when processing weights for very large datasets, or using older computers. However, the results must be placed in the context of the computational requirements of the IPF process itself. For the example described in Section 3.7, IPF took approximately 30 seconds per iteration and 5 minutes for the full 20 iterations.

Accuracy
In order to compare the fit between simulated microdata and the zonally aggregated linking variables that constrain them, the former must first be aggregated by zone. This aggregation stage allows the fit between linking variables to be compared directly (see Fig. 7). More formally, this aggregation allows goodness of fit to be calculated using a range of metrics (Williamson et al., 1998). We compared the accuracy of integerisation techniques using 5 metrics: • Pearson's product-moment correlation coefficient (r).
• Total and standardised absolute error (TAE and SAE), • Proportion of simulated values falling beyond 5% of the actual values, • The proportion of Z-scores significant at the 5% level.
• Size of the sampled populations, The simplest way to evaluate the fit between simulated and census results was to use Pearson's r, an established measure of association (Rodgers, 1988).
The r values for all constraints were 0.9911, 0.9960, 0.9978, 0.9989 and 0.9992 for rounding, threshold, counter-weight, proportional probabilities and TRS methods respectively. IPF alone had an r value of 0.9996. These correlations establish an order of fit that can be compared to other metrics.
TAE and SAE are crude yet effective measures of overall model fit (Voas and Williamson, 2001). TAE has the additional advantage of being easily understood: where U and T are the observed and simulated values for each linking variable (j) and each area (i). SAE is the TAE divided by the total population of the study area. TAE is sensitive to the number of people within the model, while SAE is not. The latter is seen by Voas and Williamson (2001) as "marginally preferable" to the former: it allows cross-comparisons between models of different total populations (Kongmuang, 2006). The proportion of values which fall beyond 5% of the actual values is a simple metric of the quality of the fit. It implies that getting a perfect fit is not the aim, and penalises fits that have a large number of outliers. The precise definition of 'outlier' is somewhat arbitrary (one could just as well use 1%).
The final metric presented in Table 4 is based on the Z-statistic, a standardised measure of deviance from expected values, calculated for each cell of data. We use Zm, a modified version of the Z-statistic which is a robust measure of fit for each cell value Williamson et al. (1998). The measure of fit is appropriate here as it takes into account absolute, rather than just relative, differences between simulated and observed cell count: where To use the modified Z-statistic as a measure of overall model fit, one simply sums the squares of zm to calculate Zm 2 . This measure can handle observed cell counts below 5, which chi-squared tests cannot (Voas and Williamson, 2001).
The results presented in Table 4 confirm that all integerisation methods introduce some error. It is reassuring that the comparative accuracy is the same across all metrics. Total absolute error (TAE), the simplest goodness-of-fit metric, indicates that discrepancies between simulated and census data increase by a factor of 3.2 after TRS integerisation, compared with raw (fractional) IPF weights. 11 Still, this is a major improvement on the simple rounding, threshold and counter-weight approaches to integerisation presented by Ballas et al. (2005a): these increased TAE by a factor of 13, 7 and 5 respectively. The improvement in fit relative to the proportional probabilities method is more modest. The proportional probabilities method increased TAE by a factor of 3.8, 23% more absolute error than TRS.
The differences between the simulated and actual populations (P op sim − P op cens ) were also calculated for each area. The resulting differences are summarised in Table 5, which illustrates that the counter-weight and two probabilistic methods resulted in the correct population totals for every area. Simple rounding and threshold integerisation methods greatly underestimate and slightly overestimate the actual populations, respectively.

Discussion and conclusions
The results show that TRS integerisation outperforms the other methods of integerisation tested in this paper. At the aggregate level, accuracy improves in the following order: simple rounding, inclusion threshold, counterweight, proportional probabilities and, most accurately, TRS. This order of preference remains unchanged, regardless of which (from a selection of 5) measure of goodness-of-fit is used. These results concur with a finding derived from theory -that "deterministic rounding of the counts is not a satisfactory integerization" (Pritchard and Miller, 2012, p. 689). Proportional probability and TRS methods clearly provide more accurate alternatives.
An additional advantage of the probabilistic TRS and proportional probability methods is that correct population sizes are guaranteed. 12 In terms of speed of calculation, TRS also performs well. TRS takes marginally more time than simple rounding and proportional probability methods, but is three times quicker than the threshold and counter-weight approaches. In practice, it seems that integerisation processing time is small relative to running IPF over several iterations. Another major benefit of these nondeterministic methods is that probability distributions of results can be generated, if the algorithms are run multiple times using unrelated pseudorandom numbers. Probabilistic methods could therefore enable the uncertainty introduced through integerisation to be investigated quantitatively (Beckman et al., 1996; Little and Rubin) and subsequently illustrated using error bars.
Overall the results indicate that TRS is superior to the deterministic methods on many levels and introduces less error than the proportional probabilities approach. We cannot claim that TRS is 'the best' integerisation strategy available though: there may be other solutions to the problem and different sets of test weights may generate different results. 13 The issue will still present a challenge for future researchers considering the use of IPF to generate sample populations composed of whole individuals: whether to use deterministic or probabilistic methods is still an open question (some may favour deterministic methods that avoid psuedo-random numbers, to ensure reproducibility regardless of the software used), and the question of whether combinatorial optimisation algorithms perform better has not been addressed.
Our results provide insight into the advantages and disadvantages of five integerisation methods and guidance to researchers wishing to use IPF to generate integer weights: use TRS unless determinism is needed or until superior alternatives (e.g. real small area microdata) become available. Based on the code and example datasets provided in the Supplementary Information, we encourage others to use, build-on and improve TRS integerisation.
A broader issue raised by the this research, that requires further investigation before answers emerge, is 'how do the integerised results of IPF compare with combinatorial optimisation approaches to spatial microsimulation?' Studies have compared non-integer results of IPF with alternative approaches (Smith et al., 2009;Ryan et al., 2009;Rahman et al., 2010;Harland et al., 2012). However, these have so far failed to compare like with like: the integer results of combinatorial approaches are more useful (applicable to more types of analysis) than the non-integer results of IPF. TRS thus offers a way of 'levelling the playing field' whilst minimising the error introduced to the results of deterministic re-weighting through integerisation.
In conclusion, the integerisation methods presented in this paper make integer results accessible to those with a working knowledge of IPF. TRS outperforms previously published methods of integerisation. As such, the technique offers an attractive alternative to combinatorial optimisation approaches for applications that require whole individuals to be simulated based on aggregate data.

Acknowledgements
Thanks to: Mark Green, Luke Temple, David Anderson and Krystyna Koziol for proof reading and suggestions; to Eveline van Leeuwen for testing the methods on real data and improving the code; and to the anonymous reviewers for constructive comments. This research was funded by the resulting from the integerisation of a different weight matrix.

Engineering and Physical Sciences Research council (EPSRC) the via the E-Futures Doctoral Training Centre.
Birkin, M., Clarke, G., 2012. The enhancement of spatial microsimulation models using geodemographics. The Annals of Regional Science 49, 515-532.
Birkin, M., Clarke, M., 1988. SYNTHESIS -a synthetic spatial information system for urban and regional analysis: • Enhance transparency, repeatability and knowledge transfer within the field; • Allow others to use, test and further develop existing work, rather than starting from nothing each time, and; • Allow other researchers to critically assess the four integerisation methods presented in the paper -named simple round, the threshold approach, proportional probabilities and truncate, replicate, sample (TRS) -so they can be improved.
This worked example can be used in different ways, depending on one's aims. The first section shows how the necessary files can be downloaded and loaded into R. Section 2 (Running the Spatial Microsimulation Model) is linked to aim 1, and shows how the microsimulation model, which forms the foundation of the analysis presented in the paper, works. Section 3 (Integerisation in R) is linked to aim 2, and demonstrates how to run each type of integerisation, and display some of the results. Finally, Section 4 (Adapting the Model), illustrates how the code constituting the spatial microsimulation model and integerisation techniques can be adapted to different constraint variables for areas with different population sizes. Aim 3 can be met throughout: we encourage other researchers to experiment with and re-use our code, citing this work where appropriate. To do this, the first stage is to download the dataset and load it into R.
1 Downloading and loading the files into R Throughout this example, we assume that the data has been downloaded and extracted to a folder titled 'ints-public' onto your desktop and that R is installed on your computer. To download R, visit the project's homepage and follow the instructions. For new R users, it is recommended that a introductory text is acquired and referred to throughout. A range of excellent introductory guides are available online at http://cran.r-project.org/otherdocs.html. To access the files, unzip the file titled 'ints-public.zip', which is available online in the supplementary data. 2 A list of the folders and files contained within the folder 'ints-public' is provided in Table 1. To use the data, the first stage is to set the working directory. Find out the current working directory using the command getwd() from the R command line. Correctly setting the working directory will allow quick access the files of the microsimulation model and a logical place to save the results. The command list.files() is used to check the contents of the working directory from within R. Assuming the folder 'ints-public' has been extracted to the desktop in a Windows 7 computer with the user-name 'username', type the following into the R command line interface and press enter to set the working directory (change 'username' to your personal user name or retype the path completely if the folder was extracted elsewhere):

setwd("C:/Users/username/Desktop/ints-public/etsim")
To run the model, one can simply type the following (warning: this may take several minutes, so entering the code block-by-block is recommended): If the aim is to understand how the method works, we recommend opening the script files using a text editor and sending the commands to R block by block. This can be done by copying and pasting blocks of code into the R command line. Alternatively a graphical user interface such as Rstudio can be used. In both cases, running the code contained in etsim.R should take around one minute on modern computers, depending on the CPU. This will result in a number of objects being loaded onto your R session's workspace. These objects are listed by the command ls() and can be referred to by name. The constraint variables, for example, can be summarised using the command: summary(all.msim) R objects can also be loaded directly, having been saved from previous sessions. The command: load("iteration-20.RData") for example, when run in the working directory 'R', will load the results of the IPF model results after 20 iterations. This may be useful for users who want to move straight to integerisation, without running the IPF model first. Referring to file-names in R can be made easier using the auto-complete capabilities of some R editors. Rstudio, for example, allows auto-completion of file-names and R objects (see RStudio's website for more information).

Running the spatial microsimulation model
The code for running the spatial microsimulation model is contained within 'etsim.R' the folder entitled 'etsim' -see Table 1. As with all R script files, the contents of this file can be viewed using any text editor. With an R console running, R's reaction to each chunk of code can be seen by copy and pasting the script code line-by-line. This should give some indication of how the model works, and which parts take most time to process. Note that R accepts input from external files. Within 'etsim.R', this technique was used to reduce the number of lines of code and make the model modular. The constraint variables, for example, are read-in using the following command, that is contained within the main etsim.R script file: source(file="cons.R") As before, this simply sends the commands contained within the file to R, line by line, but without displaying the results until the script has finished.
It is good practice to provide comments within the code, so that others can see what is going on. In R, this is done using the hash symbol (#). Anything following the hash is ignored by R, until a new line is formed.
Once the first iteration of the entire model has run, you can check to see if the model has worked by analysing the objects that R has created. The raw weights are saved as 'weights0', 'weights1', etc. The number of each set of weights corresponds to the constraint which was applied. All of 'weights0' are set to 1 in the first iteration (the initial condition). The object 'weights5' represents the cumulative weight so far, after the weights have been constrained by all 4 constraint variables.
The simulated zonal aggregates are stored in objects labeled 'USd.agg' (this stands for 'Understanding Society dataset, aggregated'), from the original value (the summary results of the survey data) to 'USd.agg4' (after fitting for the fourth constraint). A good first indication of whether the model has worked is to compare 'USd.agg4' with 'all.msim' (the latter being the census aggregates). This can be done by using the command: head(USd.agg4) and running the same command for 'all.msim'. The command head() simply displays the first 5 rows or elements of an R object, to get a feel for what it looks like. (The meaning of any command can be prefixing the command name by "?". In this case ?head() would be used.) To make the comparison more interesting, one can plot the results. Try the following: plot(all.msim [,13],USd.agg4[,13]) The '[,' part of this command means "all rows within"; the '13]' part means "in column 13". In this model, column 13 is the variable "mainly works from home" ("mfh"). This can be established using names(all.msim), to identify the variables contained within the dataframe. If the plot looks the same as as that illustrated below (Fig .1), the model has worked.

Integerisation in R
The code used for integerisation of the results of IPF reweighting are kept in a separate folder, entitled 'R' (see Table 1). As before, navigate to this folder using a modified version of the following command, this time navigating to the folder 'R': setwd("C:/Users/username/Desktop/ints-public/R") As always, it is worth opening the script files in a text editor or within a dedicated R development environment such as RStudio. This will allow the commands to be seen in context and experimented with.

Simple rounding
The following section describes the code contained within the file 'int-meth1round.R'. Open the file and send its content to R line by line. The aim of this script is to round the IPF weights (calculated in the previous section) up or down and then select individuals accordingly. Once the IPF weights have been loaded using the load() command, the new weights are created using the following command: 3 intp <-round(i20.w5) In the following line of code, the decimal remainders are saved by subtracting the rounded weights from the original weights. Note that each new set of data is given a name, ready to be referred back to later: deci <-i20.w5 -intp # Decimal weights Before running a loop to select individuals based on their rounded weights, we created a number of R objects to be used during integerisation. Of note, the object pops is a dataframe for saving data about the population of each zone that are calculated while the loop is in operation. It has the same number of rows as there are areas in the constraint table (1:nrow(all.msim)). The columns are bound together using cbind(). The contents of this object are updated with each iteration, allowing the results for different methods to be compared directly.
In order to perform calculations on one zone at a time, a loop is used: for ( The final list of individuals is saved by replicating the integer weights the same number of times as the integer value of the rounded weights: The replication command rep() is used in this instance to replicate the individuals who are 'cloned' more than once. Note the use of double square brackets. This is used to refer to objects (dataframes in this case) that are part of a list. Because the matrix of rounded IPF weights ('intp') has indexes that correspond to the original survey data, we can extract their characteristics by simply referring to the previously defined index:

intall[[i]] <-USd[ints[[i]],]
Finally, the results are aggregated by converting the raw data into the categories of the constraint variables -using source("area.cat.R")-and then summing columns to provide zone-wide counts for each category: This same procedure is followed for each of the remaining 2 integerisation methods. The defining features of each are outlined below.

The inclusion threshold approach
The starting point of this method is an incomplete simulated population of integer results (the length of which is defined as P op sim ). The 5 steps of the threshold approach are as follows: 4. Recalculate P op sim with the additional individuals included.
5. Subtract x from IT to reduce the inclusion threshold for the next iteration. If P op sim is still less than P op cens return to step 2; if not exit.
The script file 'int-meth2-thresh.R' replicates these steps in two main loops, each iterating over the areas whose populations are being simulated. The first is identical to that of the simple rounding approach, (except in this case the IPF weights are truncated, not rounded) 5 and saves the resulting microdata as a list of vectors, each containing row names of individuals from the Understanding Society dataset (see ints [[i]] for area i).
The second loop adds additional individuals to those contained in ints for each area, by gradually reducing the inclusion threshold. This is done in a third loop which is nested within the second. Note that the value of the threshold (wv, for weighting variable -equivalent to IT , as described above) is set to 1 outside this third loop. This is done so that the threshold is reduced from one iteration to the next within the loop: wv <-wv -0.001 Note also that this third loop is initiated by the command while(), instead of the command for() used for the previous loops. This is because the number of iterations performed by the first two loops is fixed (to the number of areas), while the number of iterations in this one is determined by the threshold at which the sample population is greater than or to equal the census population: Here, the command c() appends the additional individuals to those already saved. After the while loop exits, the population and aggregate data for each area are saved, as with the simple rounding method.
To analyse the threshold reached for each area, this information is saved as for each area within the main loop:

pops$thresh[i] <-wv
This information can be subsequently analysed, e.g. to investigate the distribution of thresholds reached (Fig. 2 -This plot was produced by the following command: hist(pops$thresh)). A similar process is used to save information about the exit point of the counter-weight algorithm.

The counter-weight method
The counter-weight method is similar to the threshold algorithm: it begins with a crude integerisation strategy (in this case simple rounding, not truncation as described above -this starting point was found to lead to more accurate results), and then tops up each area with additional individuals that depend on their decimal weights.
The process can be summarised in the following 4 steps: • Sort the IPF weights in ascending order: sweights <-sort(i20.w5[,j], index.return = T)$x and save their order for future reference: ord <-rank(i20.w5[,j], ties.method="first") • If the total population is too small, top up the results for each individual by the rounded sum of their decimal weight plus the decimal weight of the next individual in the sorted vector of weights: if (

Proportional probability method
The script file 'int-meth3-pp.R' also contains two loops. The first simply creates proportional weights for each individual-zone combination using the following command: i20.w5[,i] / sum(i20.w5 [,i]). This is the code equivalent of the following equation: The result (saved as prop.weights[,i]) is used in the second loop as the selection probability for each individual. The second loop contains three main parts. First, individuals are randomly selected from the USd dataset, with probability set as follows: prob=prop.weights [,i] (Note that here we are sample with replacement -replace=T). Second, the population of the integerised sample is saved. Third, as with all integerisation methods, the command source("area.cat.R") is run to extract the additional information about individual from the Understanding Society dataset, based solely on their index. The results are saved as intagg.prop. The next stage is to run the TRS integerisation method.

TRS integerisation in R
The final method is contained in the script file 'Int-meth4-TRS.R'. It involves weight truncation, replication of integerised weights, and sampling based on the decimal remainders. Of these steps, sampling is the only one which requires detailed attention here: the others have already been described. Suffice to say that integer weights are generated by the command x%/%1, which is synonymous with the command trunc(x). Note that the command round() was used for integerisation in the simple rounding and threshold integerisation methods.
The population following truncation is guaranteed to be less than the census population as no rounding up occurs. This differs from the simple rounding and threshold approaches, and ensures that there will always be a difference between census and simulated results. The challenge is to fill the difference: where popstrs[,1] is the census population and popstrs[,2] is the simulated population based on truncated weights. The command: sample() allows an exact number of rows to be selected to make up the difference. The first argument of the command is the vector from which the sample is taken. The second is the sample size. For our purposes, the vector is the row names of all individuals from the survey. This vector is referred to by the command which(i20.w5[,i]>-1), which means "all individuals with weights greater than −1, for area i", i.e. all individuals. The size is the difference between census and simulated population sizes for the area in question (as defined above).
So far so good, but the sample strategy is simple random, meaning that probabilities will be equally assigned to all rows, unless stated. This is where the decimal weights -the 'conventional weight' components of the IPF weights -come into play. Conventional weights can be used to determine the probability of an individual being selected.
The final argument used, therefore, is the probability of selection (prob=...). The decimal weights are calculated in-situ by subtracting the integer weights from the actual weights: prob = i20.w5[,i]-i20.w5[,i] %/% 1)) As with the previous methods, the loop finishes by extracting the full survey data from the survey dataset, and saving the aggregate level results: intagg.trs [i,] <-colSums(area.cat) After the script files associated with all four integerisation methods have been run, the aggregate results are saved in R objects entitled intagg.round, intagg.thresh and intagg.trs. These results form the basis of the integerisation method performance comparison presented in the paper, and can be replicated using the file 'Analysis.R' (Table 1).

Adapting the model
So far the model has been used on a single case study. For the techniques showcased here to be truly useful, they must be be applicable to a wide range of situations. This section therefore illustrates how to adapt the model to simulate the individuals living in Output Areas (which contain around 300 people or ∼100 employed people, 20 times smaller than the Medium Super Output Areas used up until now), using different constraints and a different (smaller) survey dataset from which individuals are to be extracted.

Setting-up the constraint variables
In order to show the model's flexibility, 3 new constraint variables were used: • Hours worked per week

• Marital status
• Housing tenure of home These variables are available in both aggregate form for small areas, and from the Understanding Society dataset. The aggregate data can be downloaded by UK academics from the Casweb census data portal. The raw data (named 'hrs worked.csv' 'marital status.csv' and 'tenancy.csv') is read into R and cleaned by the commands contained in the script file 'cons.R' within the folder 'OA-eg'. The comments in this script file should explain most of the commands, which read the .csv files and remove superfluous variables. In one case (tenancy) the variables are also manipulated such that the category 'other' is the sum of three other variables: ten$other <-ten$other + ten$council + ten$assoc The reason for modifying the data in this way is so that the constraint data match the individual-level survey data. Also, the USd is a huge dataset (50994 rows by 1322 columns, contained in a 90 Mb file). Dropping unneeded information makes the data more manageable.
The script to load and subset the USd data is contained in the file 'load.R' (also in the folder 'OA-eg'). For confidentiality reasons the original data is not provided; the steps taken to process the USd dataset into a form ready for spatial microsimulation should be applicable to any survey dataset (the R package 'foreign' may be used to load unusual data types as an R object). The steps taken here should be fairly self-explanatory, based on the names of the commands and the comments. Although the script has been set-up to process the USd survey, in anticipation of running IPF constrained by the three constraint variables mentioned above, it would be possible to modify 'load.R' to accept different input survey datasets and subset the data for different constraints.
The data is also simplified to match available constraint categories in 'load.R'. To provide one example, the USd variable for married status -'pmarstat' -contains 14 categories, many of which can be merged. To ensure the categories of the survey data matched the census constraints (5 marriage status categories), the following command was used: levels(Und.sub$mas <-s[sample(nrow(s), size=500),]rstat) <-c( rep("other",5), "single", "married", "single", "separated", "divorced", "widowed", rep("other",3)) After running both 'load.R' and 'cons.R' we are left with four R objects in the workspace: 6 's', the survey micro-level dataset and 'hrs', 'mar' and 'ten' -the three constraint variables.

Modifying the spatial microsimulation model
The script that runs the spatial microsimulation model in the previous example is called 'etsim.R'. In order for it to use new constraint variables it must be modified. These modifications (which maintain the original structure and semantics of the original script) can be seen by comparing 'etsim.R' contained within the 'OA-eg' folder against the file of the same name contained within the folder 'etsim'. The following points summarise the changes made: • Add or remove constraints and loading functions depending on the input data. In this case, for example, the input survey dataframe 's' is too large relative to the average size of the zones under investigation (nrow(s) = 1678, more than 10 times greater the average size of individuals in Output areas -∼ 100). Therefore a simple random sample is taken to reduce the number of rows to 500: s <-s[sample(nrow(s), size=500),] • Alter the file 'USd.cat.r' so to convert the survey dataframe 's' into a wide data frame whose dimensions match 'all.msim'. This involves converting categorical variables into binary (1 or 0) using the subsets. Females who work more than 48 hours per week, for example, are allocated the value of 1 in the appropriate column using the following command: s.cat[which(s$jbhrs >= 49 & s$sex=="female"),12] <-1 • Names of the R objects referred to are changed to reflect the new input data. The object 'USd', for example, is renamed as 's'.

Integerisation of the new results
The integerisation scripts must also be modified slightly to accept the new input data. Therefore the files 'int-meth1-round.R' to 'int-meth4-TRS.R' described in Table 1 have been altered. The changes we need to account for include the new name of the weights (i1.w4 instead of i20.w5 in this caseonly iteration of the new model has been run for brevity) and, again, the renaming of the survey to 's' from 'USd' in the original files. It is recommended that differences in the R scripts for integerisation between files in the folder 'R' and those (with the same file names) in the folder 'OA-eg' are identified to understand how the methods can be generalised to accept any weighted input data. Figure 3: Scatter plots illustrating the relationship between census constraint variables (x axis) and simulated counts for these variables (y axis) after IPF and four methods of integerising the results. Each dot represents one variable for one area, 528 dots in each plot (22 variables multiplied by 24 areas).

Results
To confirm that the TRS method advocated in the paper is also the most accurate when it is used on different input data, a basic analysis script has been compiled ('basic-analysis.R' with the folder 'OA-eg'). These commands calculate the correlation between the simulated and census data at the aggregate data and illustrate the results. The results demonstrate that the TRS method is also more accurate than the others for these new constraints, as expected. The level of correlation rises (from 0.948 through 0.976, 0.975, and 0.981 to 0.987) for the threshold, rounding, counter-weight, proportional probabilities and TRS methods respectively. Note, the order of accuracy is the same as the same as presented in paper which this Supplementary Information accompanies, except for the counter-weight method performs worse than the inclusion threshold approach with the new input datasets. These results can be visualised in scatter plots of census vs simulated results (Fig. 3). This figure can be replicated using the last section of code in 'basic-analysis.R', provided the packages 'reshape2' and 'ggplot2' have been installed.
We encourage users to test the integerisation methods described in this user manual on a wider range of datasets, citing the authors where appropriate. This will help to check the replicability of the results presented in the paper that accompanies this code. It is also hoped that the code and the findings will be of use to researchers developing, evaluating and using spatial microsimulation models.
Any feedback would be gratefully received by robin.lovelace at shef.ac.uk. There is also the possibility to clone, branch and commit to a larger code development project related to this research: https://github.com/Robinlovelace/ IPF-performance-testing.