A synthetic Longitudinal Study dataset for England and Wales

This article describes the new synthetic England and Wales Longitudinal Study ‘spine’ dataset designed for teaching and experimentation purposes. In the United Kingdom, there exist three Census-based longitudinal micro-datasets, known collectively as the Longitudinal Studies. The England and Wales Longitudinal Study (LS) is a 1% sample of the population of England and Wales (around 500,000 individuals), linking individual person records from the 1971 to 2011 Censuses. The synthetic data presented contains a similar number of individuals to the original data and accurate longitudinal transitions between 2001 and 2011 for key demographic variables, but unlike the original data, is open access.


How data was acquired
Through a synthetic estimation process

Data
The main data file spreadsheet accompanying this article contains 569,741 rows of data (representing 1 individual person per row) with the first 17 columns (in green) containing variables derived from responses to the 2011 Census. The 8 columns immediately following (in yellow) are synthetic longitudinal transition variables estimating the individual's state in the 2001 Census. The final two columns contain synthetic estimates of whether the individual would have given birth to children (and how many) or died over the 10-year period. Metadata for all variables are contained in the first two sheets. Supplementary materials also accompanying this article include transitional probability tables for each synthetic variable and R code to generate the synthetic variables from these transitions.

Experimental design, materials and methods
The method we employ is, at its core, a simple one-dimensional proportional fitting exercise making it somewhat more straightforward than the multi-dimensional iterative proportional fitting first proposed by Deming and Stephan [1]. It has been necessary to avoid multi-dimensional variable interactions due to the small cell counts that would occur in the transition matrices.
Our base dataset is the 2011 Census Microdata Teaching File 1 . Transitional probabilities for each variable (for example not married to married or good health to bad healthsee Supplementary material) are derived from the LS for a series of 10-year age groups. All transitions are accurate when aggregated to these age groups, although not necessarily when aggregated to another variable such as geographic region.
In the Census Microdata Teaching File, age is recorded for 8 uneven age groups: To carry out the re-estimation to new groups, the single year of age for each person in each original age group is estimated before they can then be allocated a new broad age group. To estimate the single year of age for each of the 569,741 individuals in the dataset, we use data on single year of age for each UK region from the 2011 Census aggregate tables 2 . These Census tables can be aggregated into any age group required and the relative proportions each single age comprises in each group calculated. In doing this, single year of age counts are disaggregated by region due to the large differences in the proportion of the population in each age group in London compared to all other regions in England and Wales.
The total number of individuals of single year of age a in region r will be a fraction of the total number of individuals in age group A in region r: a r A A r Such that: By calculating all proportions of a r A r for each age group A r using the Census aggregate tables single year of age file, it is possible to decompose and re-estimate age group data as required.
The estimation of each longitudinal variable transition is carried out in almost exactly the same way for each variable (with some minor variations). Below the general process is described using Approximated Social Grade as the exemplar.
Stage 1 -Transitional matrices of the same format are generated for each variable of interest from the ONS Longitudinal Study. These are broadly comparable to the example table below (Table 1) which shows the transitional counts for the Approximated Social Grade variable.
As 2011 is our base population, transitional probabilities are calculated from the counts of transitions with each 2001 state calculated as a proportion of the corresponding 2011 state in turn. Table 2 exemplifies this more clearly: Taking the first row of Table 1 (Transitions between social grade 1 (AB) in 2011 and social grade 1 in 2001), we can observe that at age group 20-29 (2011 age group), 200 individuals in the LS underwent that transition. Table 2 shows that this is a proportion of 0.045 (4.5%) of all people of social grade 1 at age group 20-29 in 2011 (200/(200 þ1033þ183 þ3003) ¼0.045). For each 2011 variable, all 2001 state proportions at each age group will sum to 1. Similar transitional probability tables are generated for each of the variables we transition.
Stage 2 -We apply transitional probabilities to the Microdata Teaching File data to create estimates of the total number of people undergoing each transition.
Stage 3 -We use the estimates of the total number of people undergoing each transition to update (randomly) the Microdata Teaching File with expected transitions for the correct number of people. Some small variations in the estimation process were required for variables such as religion and the estimation of births and deaths. For the full estimation process for each variable, see the accompanying processing scripts written in the R language and transitional probability files (which include 2011 to 2001 transitional probabilities for each variable and single year of age counts by region from the 2011 Census).