Longitudinal child data: What can be gained by linking administrative data and cohort data?

Abstract Introduction Linked administrative data sets are an emerging tool for studying the health and well-being of the population. Previous papers have described methods for linking Canadian data, although few have specifically focused on children, nor have they described linkage between tax outcomes and a cohort of children who are particularly at risk for poor financial outcomes. Objective and methods This paper describes a probabilistic linkage performed by Statistics Canada linking the Montreal Longitudinal Experimental Study (MLES) and the Quebec Longitudinal Study of Kindergarten Children (QLSKC) survey cohorts and administrative tax data from 1992 through 2012. Results The number of valid cases in the original cohort file with valid tax records was approximately 84%. Rates of false positives, false negatives, sensitivity, and specificity of the linkage were all acceptable. Using the linked file, the relationship of childhood behavioural indicators and adult income can be investigated in future studies. Conclusions Innovative methods for creating longitudinal datasets on children will assist in examining long-term outcomes associated with early childhood risk and protective factors as well as an evidence base for interventions that promote child well-being and positive outcomes.

A true understanding of the influence of the early years on long-term individual well-being requires extensive data spanning from childhood into adulthood. Most researchers have relied on longitudinal survey designs to capture functioning in early childhood, mediating or moderating factors, and outcomes later in life (e.g., NICHD ECCD cohort [1,2]). Surveys provide a wealth of information on a specific topic and can enlighten our understanding of complex phenomenon on specific groups of people or on the population at large. However, although longitudinal population-based surveys are rich in providing descriptive characteristics and outcome data, they can be difficult and expensive to administer. They also place a burden on respondents due to frequent contact. Furthermore, longitudinal survey data is often hampered by attrition that may impact the ability to detect associations between variables (i.e., low power) or may introduce bias if the sample is differentially lost over time [3].
On the other hand, administrative data is data collected for administrative or service-related purposes such as hospital admissions, billing, or registries. For example, vital statistics registries capture birth and death information, census data captures a count of a national population, and justice data identifies those who have been in contact with the justice system. Administrative data has the advantage of being free or low-cost, collected on the entire population (including both those at risk and those who are less likely to respond to surveys), is often updated over a long period of time, and may include "objective" reports or information that are not subject to respondent biases [3]. Moreover, administrative data often requires low respondent burden and the relatively large sample size increases the power for robust analyses. Nonetheless, since administrative data is not collected for the purpose of research, it is often lacking in basic identifiers or in-depth information to describe the individual.
To capitalize on the advantages of both survey and administrative data, an increased number of data linkages are be-ing performed which merge individual-level survey information with population-based administrative files [e.g., 4 -7]. Survey information documents the cohort in terms of descriptive characteristics, behaviour, well-being, etc.; prospective and historical administrative files can provide information on all persons in a population and thus maximize the likelihood of having complete long-term outcome data. In this way, linked files maximize the descriptive and cohort variables with long term outcomes such as educational attainment, hospitalization, participation in the justice system, or financial outcomes. The benefits of survey data can thus be married with the benefits of administrative data, creating a file that allows for a broader scope of analyses.
As an illustration, researchers using the Montreal Longitudinal Experimental Study (MLES) and the Quebec Longitudinal Study of Kindergarten Children (QLSKC) survey cohort data have linked administrative data to report educational (i.e., secondary-school degree; [8,9]) and justice outcomes (number of criminal offenses and having a record for a violent or non-violent offense [4,9]) for a cohort of children who participated in an early behavioural intervention program. These linked data have been used in a large number of studies to examine the relationship between childhood behaviour and experiences and young adult outcomes [for example, 10 -15].
There is strong public policy interest in better understanding the mechanisms by which childhood environments and conditions may impact outcomes. For example, childhood opposition (for example, disobedience and being inconsiderate of others) in this sample has been associated in particular with theft in later years [16], and it is reasonable to expect that there may be a relationship with income. Linked data thus allow the researchers to examine long-term outcomes for the cohort that goes beyond the original longitudinal data collection.
The purpose of the current paper is to describe the methods of linking administrative tax information with childfocused survey data (in this case, the MLES and QLSKC cohorts). The advantage of this particular linked data set is that the outcomes include earnings data, by year, and over an extended time frame more so than could previously be accomplished using self-reported data. Furthermore, the linked data file allows researchers to benefit from the completeness of tax information (as opposed to tracking individuals for long-term data collection) and thus lower rates of attrition. Tax information is advantageous in that many employment earnings and income-related measures can be examined, including total income and use of social assistance programs in Canada. For the purpose of demonstrating the usefulness of the linked dataset, a final objective of the current study is to explore the association between child behaviour outcomes (from the cohort file) and later tax outcomes as an example of how this particular linked dataset can produce unique findings.

Tax records
The T1 Family File (T1FF) is an annual administrative database containing the income tax records of all families who file taxes in Canada. T1FF tax files were made available to Statistics Canada from the Canada Revenue Agency and preprocessed for linkage. The tax files include individual social insurance numbers which are used in the record linkage process [17] but are excluded from the analytical file. Information from the T1FF also includes a broad range of demographic and tax variables, including: year of birth, sex, marital status, number of children, year of birth of spouse, and province of residence. Income variables that were available for linkage included: wages and salaries, self-employment income, Provincial Parental Insurance Plan (PIPP) premiums, employment insurance premiums and benefits, income from social assistance, disability benefits/tax credit, total income before tax, contributions to a registered pension plan, pension adjustment, donation amount, income from workers' compensation, income from capital gains, investment income, and union dues. Spousal income data is also included on the T1FF by using the family identification number of the respondent and the family flag of family members. All told, these variables enable researchers to define the participant's economic outcomes including annual earnings, total personal and household income, whether or not the individual lived in low income, whether or not the individual ever claimed a charitable donation, and receipt of income assistance ever or in the past year (i.e. unemployment or social assistance).

MLES-QLSKC
The first study, the MLES, initially assessed the behaviour and school adjustment of 1,161 boys enrolled in kindergarten in 1983-84 and living in low income neighborhoods in Montréal, Canada [18]. The objective of the MLES was to prospectively examine the development of a sample of boys attending inner-city kindergartens in 1983-84 who had backgrounds of low socio-economic status, with a particular focus on antisocial behaviour and school adjustment. To obtain a high base rate of boys at risk for delinquent behaviour, the 53 schools with the lowest socioeconomic index were chosen and teachers were asked to rate each boy in their classes using the Social Behaviour Questionnaire [SBQ ; 19]. Ratings were returned by 87% of the teachers. Questionnaires were also filled out by parents. These boys were then assessed annually to age 17, and again at ages 21, 27, and 30.
The second study, the QLSKC, is similar to the MLES but assessed 4,648 children, including both boys and girls from low, middle, and higher income neighborhoods, enrolled in kindergarten in Quebec in 1986-1987 [20]. The QLSKC was created to obtain similar data (as the MLES) on children's behavioral outcomes from a random sample of kindergarten children attending French-speaking public schools in the province of Québec over a 2 year period (1986 and 1987). This strategy was used to obtain a sample of children that was (1) representative of all regions of Quebec, and (2) representative of urban and rural settings. A total of 4,648 students were rated on the Social Behaviour Questionnaire by both parents and teachers. A subsample of 2,000 students was assessed annually between 6 and 12 years and again at 15, 21, and 30 years. This sample provides a control group for the MLES study.
The members of these two cohorts are now in their thirties, and researchers have collected detailed longitudinal data on parent behaviours; child outcomes, behaviours, attitudes, and activities; education outcomes; psychological markers; as well as physical data. These data have been used to construct trajectories of psychological development over time. Results from both the MLES and the QLSKC have been frequently published in peer-reviewed journals.

Derived Record Depository (DRD)
In order to link these cohorts with the T1FF, the cohorts were first linked to the Statistics Canada's Derived Record Depository (DRD). The DRD is a national longitudinal data base of individuals derived from a number of Statistics Canada data files and it contains only basic personal identifiers. The purpose of the DRD is to populate a Key Registry through record linkage. It is not a data source for analytical purposes, and it is an evolving data base. New individuals are routinely added and existing records are updated by linking administrative data. For the purpose of the current project, the DRD (version 4) was filtered to include only individuals born between 1977 and 1981 to match the cohort birth years. This included 2,771,101 individuals and 14,504,919 1 records . The DRD contains name variables as well as parent's surnames, birth and death dates, and geographical variables (postal code, city, CMA, CSD, province). More information about the DRD is available on the Statistics Canada Website at http://www.statcan.gc.ca/eng/sdle/status.

Protecting respondent privacy
Statistics Canada ensures respondent privacy and confidentiality during the linkage process and subsequent use of linked files. Only employees directly involved in the linkage process have access to the unique identifying information required for linkage (such as names and birthdates) but have no access to the analysis variables. Once the data linkage process is complete, the resulting linked keys are used to create a linked file without identifying information and only the de-identified file is accessed by analysts for research purposes including validation and statistical analyses. The application for the record linkage was reviewed and approved by the Executive Management Board at Statistics Canada under the Statistics Canada Policy on Record Linkage (see http://www.statcan.gc.ca/eng/record/policy4-1.)

Linkage methods and results
The principal files for linkage were the combined cohort data (hereafter labeled the 'cohort file') from the 5804 children included in the Montreal Longitudinal Experimental Study (n=1161) and the Quebec Longitudinal Survey of Kindergarten Children (n=4643). Probabilistic record linkage was carried out using G-Link 2 3.2 [21]. Probabilistic record linkage methodology uses non-unique identifiers (e.g., name and birth date) to calculate the likelihood that matched records refer to the same entity (e.g., an individual). This was accomplished in two steps:

Primary linkage
First, the cohort file was cleaned and standardized to prepare for the matching process (e.g., removing trailing blanks and accents, all characters are set to uppercase, etc.). An initial quality assessment was also performed in order to ascertain the completeness of the data (i.e., percent of data available on each matching variable) and likelihood of matching (i.e., number of distinct values). Efficiency of the record linkage process depends on relatively complete data with a relative high number of distinct values in order to discriminate between cases.
Next, a series of twenty-four conditions were created based on surnames (including parents' surnames), given names, date of birth (and its components), province of residence, and the first three characters of postal code to limit the number of potential pairs generated. Comparison rules were created and applied to the potential pairs in order to compare surnames, given names, dates of birth and death, sex, and different geographic levels (postal code, city, census subdivision) between the two files. Based on the theory of probabilistic record linkage [22], each rule outcome was assigned a weight based on the ratio of the estimated probability of the outcome occurring for true matches to the estimated probability of the outcome occurring for non-matches. Linkage states were then assigned to the pairs based on probability ratios and thresholds. The total weight for each pair was a quantitative representation of the likelihood that the record pair was the same individual. This total weight was then compared to a lower and an upper threshold to assess whether the pair was considered definitely linked, possibly linked or not linked.
Record pairs that were determined to be definite or possible matches were then grouped to bring together all pairs that referred to the same individual. However, grouped pairs may not demonstrate the expected one-to-one mapping between the cohort and DRD; conflicts were resolved via mapping or correspondence analysis in G-Link and a linked file was created.
Overall, 98.4% (5,713 out of 5,804) of the individuals in the Quebec cohorts linked to the DRD. There were minor variations in linkage rates between the MLES and QLSKC cohorts, as well as by demographic variables (sex, birth year; see Table 1). The largest difference in linkage rates were seen when comparing cases with available geographical information (postal codes) and cases with missing geography, which was not surprising given that geography was used in the linkage process. The

Secondary linkage
The cohort-DRD linkage keys and the T1PMF-DRD and CCTB-Ident-DRD linkage keys were used to link the cohort to historical tax records (all years of the T1FF between 1992 and 2012). These data provide unique information on the cohort with respect to their administratively-collected income in 1 There are many more records than individuals in the DRD because a person's information (address, name, etc.) may change over time. The DRD file prepared for record linkage can contain as separate records up to 99 different combinations of identifiers for a given individual.
2 Statistics Canada's generalized system for record linkage adulthood. For the purpose of this study, tax records from 1992 through 2012 were linked to the cohort file, although the linkage rate was low between 1992 and 1998 due to the young age of the cohort who were not likely to yet be filing taxes (i.e., less than 18 years of age). The number of valid cases in the original cohort file with valid tax records was approximately 84%. Other linkages with Canadian tax data has shown similar linkage rates between a cohort and historical tax data [23].

Linkage accuracy
Linkage errors may create bias in the analyses, and were identified through manual review and reported as false positive rate, false negative rate, sensitivity, and specificity. False negatives occurred when matches were missed, that is, a record was not linked when it should have been. False negatives were identified from a review of all unlinked records. Fourteen of the 91 non-linked individuals were found to be good links (i.e., a false negative rate of 15.38%).
In order to measure the false positive rate, a sample size of 500 was randomly selected for different total pair weight intervals. The selected records were manually and independently reviewed by three people. The results of the manual review were reconciled and a majority-rule decision was made (at least two reviewers had to agree on the decision). False positives occurred when a record was incorrectly linked to the wrong record or when there is no true match. The false positive rate was estimated at 0.04%. The sensitivity or true positive rate was calculated as the number of pairs linked probabilistically and identified through manual review divided by the number of pairs linked through manual review. In this project, the sensitivity rate was found to be 99.76%. Specificity or true negative rate was calculated as the number of pairs not linked probabilistically nor identified through manual review divided by the number of pairs not linked through manual review. In this project, the specificity rate was 97.47%.

Coverage analysis
Since the Quebec cohorts were not representative of any particular population, comparisons of the linked data set to known estimates could not be made as has been done with previous linkage projects [e.g., 24]. For example, it was not possible to examine the distribution of the linked sample on certain sociodemographic characteristics (e.g., income) to the Census of Canada since the cohort was designed as an at-risk sample. However, it was possible to identify potential biases in the results by comparing the distribution of some key variables for the linked dataset, the cohorts, and the non-linked cases. Linkage rates to the DRD by demographics and variables of interest such as sex and birth year are presented in Tables 1-3. No distinct trends were noted, other than an effect of time. Fewer records were linked in 1992 through 1998 as compared to after 1998, although again this was to be expected since many individuals in the cohort would have been too young to file income taxes in those years.

Results from analyses of the linked data
As a simple example of the value of this linked dataset, we examined the changes in the correlation of childhood behaviour measures in the cohort data and adult earnings over time from the tax data. We calculated the correlation between two childhood behaviour measures (taken in 1986 or 1987) and adult individual earnings at intervals of five years. The childhood behaviour measures were opposition and anxiety [see 19 for a description of the measures and the underlying questionnaire]. Figure 1 demonstrates the evolution of these correlations over time. Both anxiety and opposition were negatively correlated with individual earnings, and after an initial increase in negative correlation, the relationship seems to stabilize. The pattern is quite similar for anxiety and opposition behaviors in early childhood. Figure 2 shows the same set of correlations for household income instead of individual earnings -that is, tax-based income of all members of the household instead of the individual as shown in Figure 1. However, the findings suggest different patterns for anxiety and opposition. In young adulthood, anxiety has a lower correlation (closer to zero) with household income than opposition, but as individuals age, the correlation is similar. In addition, the relationship of opposition to household income is fairly stable over time. These examples demonstrate the manner in which the tax income can be used for future analyses with the MLES-QLSKC cohort data in order to examine associations between early child behaviour and income-related outcomes.

Discussion
Future research on healthy child development will increasingly rely on linked datasets to capitalize on the advantages of individual-level survey information along with prospective, population-based administrative files. In fact, Canada has several provincial linked datasets that capitalize on the linkage of several administrative data bases (e.g., Manitoba Population Research Data Repository, Population Data BC, Institute for Clinical Evaluative Sciences data in Ontario). The present study demonstrates one such linkage for child data, which links a large developmental, longitudinal cohort (which includes many child behaviour and other contextual measures) and tax information (T1FF). Using probabilistic linkage methods, Statistics Canada was able to link 84% of valid cases in the original cohort file to valid tax records. This yields a child cohort dataset with administrative tax outcomes in order to examine long term earnings and related information. This is a significant contribution to child cohort data in that previous studies have relied on self-reported income information which can be limited.
An example of the potential use of the data was provided, demonstrating the correlations between two different early child behaviours (anxiety and opposition) and individual earnings and household income. Correlations were modest, although this is roughly equivalent to the relationship between parent and child income demonstrated in several Scandinavian countries [25]. Anxiety and opposition showed similar patterns of negative correlation to individual earnings over time, but different patterns of correlation to household income: the cor- relation of household income with anxiety in young adulthood was lower (closer to zero) than opposition, but as individuals aged became more similar [8,15]. One possible reason for the different patterns observed between individual earnings and household income is the possibility of belonging to a multiple earner household (through marriage or living with parents). Understanding how and why opposition and anxiety behaviours in early childhood are differently related to income over time (as well as other contextual factors included in the broader cohort data) is an interesting puzzle that can be addressed by a rich dataset such as this, which links multi-dimensional data on economic well-being (individual and household) with multi-dimensional data on childhood measures.
The correlational analysis presented above would not be possible in the absence of linked administrative data. Furthermore, self-reported income can be subject to higher levels of attrition, non-response bias, and recall errors. This linkage opens an array of analyses that were previously impossible, which can provide long-term insight into the relationship of early childhood circumstances to adult outcomes. Such studies can yield substantial new insight into the relationship of early socioeconomic status and childhood behaviour and environment to adult financial and labor market outcomes.
Despite these advantages, some limitations of linked data should be acknowledged. First, the value of linked data is dependent on complete, good quality data in the original files (in this case, cohort, DRD, and tax information) in order to optimize linkage rates between the cohort and the administrative data. Of particularly importance for the present linkage were keys on name, address, and date of birth. In this case, linkage rates were found to be quite high and of sound quality. Second, the cohort linked in this study represents a sample of children who attended kindergarten in Quebec only (and the MLES cohort only includes boys); the fact that the cohort is from Quebec facilitated linkage due to a decrease likelihood of name changes as a result of marriage. However, the results are not necessarily generalizable to the Canadian population. Further analyses may also be limited by the sample size in the cohort, and by the years of administrative data available.

Conclusions
The linked data described in the present paper expands opportunities available with cohort data, for example, tracking childhood characteristics and early environmental and contextual variables with adult outcomes. Linked data such as this could be used to address policy-relevant questions such as the impact of socio-economic status, early behaviours, and intervention programs on adult earnings. Researchers should consider other sources of available data such as education, justice, income, health/hospitalization, and vital statistics information in developing data collection activities and plan for possible future linkages. In doing so, researchers must carefully consider respondent data, requests for sharing and linkage, and questions that can be addressed beyond the life of primary data collection activities.

Statement on conflicts of interest
The Authors declare they have no conflicts of interest.