Automation of cleaning and reconstructing residential address histories to assign environmental exposures in longitudinal studies

Abstract Background We have developed an open-source ALgorithm for Generating Address Exposures (ALGAE) that cleans residential address records to construct address histories and assign spatially-determined exposures to cohort participants. The first application of this algorithm was to construct prenatal and early life air pollution exposure for individuals of the Avon Longitudinal Study of Parents and Children (ALSPAC) in the South West of England, using previously estimated particulate matter ≤10  µm (PM10) concentrations. Methods ALSPAC recruited 14 541 pregnant women between 1991 and 1992. We assigned trimester-specific estimated PM10 exposures for 12 752 pregnancies, and first year of life exposures for 12 525 births, based on maternal residence and residential mobility. Results Average PM10 exposure was 32.6  µg/m3 [standard deviation (S.D.) 3.0  µg/m3] during pregnancy and 31.4 µg/m3 (S.D. 2.6  µg/m3) during the first year of life; 6.7% of women changed address during pregnancy, and 18.0% moved during first year of life of their infant. Exposure differences ranged from -5.3  µg/m3 to 12.4  µg/m3 (up to 26% difference) during pregnancy and -7.22  µg/m3 to 7.64  µg/m3 (up to 27% difference) in the first year of life, when comparing estimated exposure using the address at birth and that assessed using the complete cleaned address history. For the majority of individuals exposure changed by <5%, but some relatively large changes were seen both in pregnancy and in infancy. Conclusions ALGAE provides a generic and adaptable, open-source solution to clean addresses stored in a cohort contact database and assign life stage-specific exposure estimates with the potential to reduce exposure misclassification.


Introduction
Longitudinal birth cohort studies provide an important resource to study the onset and development of disease associated with pre-or postnatal environmental exposures through childhood into adulthood. [1][2][3] Certain phases of rapid human development, including pregnancy and the first year of life, have been studied in relation to exposure to environmental pollutants and adverse health outcomes. [4][5][6][7][8][9] The first and last trimesters of pregnancy, for example, have been identified as key air pollution exposure stages associated with preterm birth and smallness for gestational age, respectively. 10,11 Most of these studies assign environmental exposure during pregnancy based on a single point in space and time, such as the residential address of the mother at the time of birth. Residential mobility during pregnancy, however, is common, varying between 10% and 30% according to a recent review. 12 There is potential for large exposure misclassification from using address at birth or having inaccurate or incomplete address histories, depending on both the characteristics of the move (for example, from city to rural village or within a city) 13 and the spatial and temporal variability of the pollutant under study. 14 Ignoring residential changes during pregnancy could, therefore, result in under-or overestimation of effect sizes in epidemiological studies using birth cohorts.
Some countries collect detailed information on residential mobility as part of national registries. In a study on maternal exposure to air pollution and birthweight in Oslo, for example, routinely collected information on residential address was linked to records from the Medical Birth Registry of Norway to account for residential mobility. 15 Most countries, however, do not maintain such detailed address records of their residents. Cohort studies do not routinely collect residential address histories of their participants either and might need to collect such information retrospectively, for example via person-or computerassisted phone interviews. 13,14,16,17 Such retrospective data collections are resource intensive and may be prone to recall bias. 12 An often overlooked alternative is the use of cohort contact databases-administrative systems set up to audit current addresses of cohort participants. Administrative systems have the advantage that address information is readily available in an electronic format, which allows the gathering of mobility data for large cohort populations without the resources needed for individual data collection. Deriving residential address histories from administrative systems, however, can be challenging. Contact databases are typically designed to audit current addresses, not to track past ones. This means that addresses are usually updated in the database whenever cohort members notify the cohort study of any address changes. Such data management systems are set up to create a new record in the database with a time stamp for the date at which the address was changed in the database, importantly not the date the cohort member moved. Instead of updating existing records, new records are commonly created every time changes are made to the address database, including the correction of errors such as spelling mistakes or adding

Key Messages
• Longitudinal birth cohort studies often assign environmental exposure during pregnancy based on residential address of the mother at the time of birth which, depending on the pollutant under study, might introduce exposure misclassification.
• We developed an ALgorithm for Generating Address Exposures (ALGAE), a generic, automated process for assigning life stage-specific environmental exposures to birth cohort participants using the cohort contact database.
• We applied ALGAE to assign previously modelled spatiotemporal high-resolution air pollution exposure to $14 000 pregnant women recruited as part of the Avon Longitudinal Study of Parents and Children (ALSPAC) birth cohort.
• The successful implementation of ALGAE to ALSPAC demonstrates its potential to reduce exposure misclassification in birth cohort studies.
• Its generic code base makes ALGAE re-usable for other cohort studies, providing an accessible and low-cost means to enhance cohort studies with environmental exposure data.
additional information to an address. In order to reconstruct address histories from contact databases, address records need to be cleaned in a systematic way such that cohort members are recorded as resident at only one location on any given day.
Here we explore the use of a contact database to reconstruct residential history for assigning environmental exposures. We developed an ALgorithm for Generating Address Exposures (ALGAE), a generic, automated process for assigning life stage-specific environmental exposures to cohort participants. We demonstrate this application for an English birth cohort study, the Avon Longitudinal Study of Parents and Children (ALSPAC). We constructed pregnancy trimester-and first year of life-specific exposure estimates based on previously modelled spatially and temporally detailed particulate matter of diameter 10 mm (PM 10 ) concentrations 18 at: (i) residential address at birth; and (ii) using reconstructed address histories for each participant to account for mobility during pregnancy and the infants' first year of life; and we then compared the differences in PM 10 concentrations between these two methods.

Methods
ALSPAC, a prospective observational study, is one of the best-characterized birth cohort studies in the world. 19 It was set up to explore modifiable influences on health across the life course. Centred on the city of Bristol in the South West of England, ALSPAC recruited 14 541 pregnant women with expected dates of delivery between 1 April 1991 and 31 December 1992. This resulted in 14 062 live-born children, of whom 13 985 survived to the end of the first year of life. 19 Those children have been followed up multiple times and follow-up is still ongoing.
Using a bespoke geocoding algorithm and Ordnance Survey's AddressBase Plus V C , we geocoded all residential addresses held in the ALSPAC contact database (n ¼ 45 771), allowing us to assign geographical coordinates to 96.2% of addresses (n ¼ 40 446). We restricted our study to children who did not move outside the original ALSPAC study area which was the extent of the air pollution modelling domain (1333 km 2 ), resulting in 36 986 addresses for which we modelled daily air pollution concentrations.
The air pollution modelling is described in detail elsewhere. 18 In brief, daily average PM 10 concentrations were modelled for all maternal residential addresses of ALSPAC mothers and their children, between the estimated date of conception of the first baby born in 1991 (1 August 1990) until the end of the first year of life of the last baby born in 1992 (31 December 1993). Dispersion models were used to separately model local (i.e. traffic, housing, industry) and regional (i.e. long-range transport) anthropogenic particulate sources, and added a time-invariant constant to reflect background, non-anthropogenic sources. We focused on PM 10 , although particulate matter of diameter 2.5 mm (PM 2.5 ) might be a more relevant pollutant in terms of birth outcomes 11 ; but such measurements were not available before 2008, which was outside our study period.
In order to assign daily exposure values to ALSPAC participants, we developed ALGAE to systematically clean the addresses stored in the ALSPAC contact database and account for temporal gaps or overlaps between successive address periods that might be present. Having assigned daily exposure estimates to each participant, based on the clean address history, ALGAE then aggregated daily exposures for each pregnancy trimester (T), early infancy (EI, 0-6 months) and late infancy (LI, 7-12 months) accounting for residential mobility.
ALGAE  20 Here we cover some of the key aspects of the process.
The temporal boundaries of life stages were based on date of birth (DoB) and date of conception (DoC), where DoC is defined as DoBÀ(7 x gestation age at birth in weeks)À1 day, as shown in Table 1. For some premature births, the third trimester (T3) was non-existing and the second trimester (T2) overlapped with days in EI. We fixed these overlaps by deleting T3 and computed the end date of T2 as: DoBÀ1 day. The ALGAE code clearly highlights the trimester calculations, so modifications to the temporal boundaries of life stages can easily be made.
The ALSPAC contact address database records each instance of change of address, not when study members began living at an address. The ALGAE protocol, therefore, favours preserving start dates over end dates when gaps and overlaps are encountered, assuming the start date of an address period to be a stronger and more reliable signal of location than the end date. This assumption is based on the fact that in an administrative system, start dates will likely correspond to time stamps but end dates will likely be imputed in relation to start dates. In case of gaps and overlaps in address periods, ALGAE therefore imputes start dates with DoC and missing end dates with the current date. Figure 1 illustrates the three scenarios: temporally contiguous address periods (Contiguous); gap between two address periods (Gap); overlap of two address periods (Overlap).
In addition, we corrected address periods if we could not assign geographical coordinates to an address because the address was unknown or fell outside the study area. Corrections were done only if the address period was immediately followed by an address period with a valid geocode, and the duration of the address period with invalid geocode did not overlap by more than 25% of days with any live stage. Such address periods were corrected by allowing the address period after it to subsume it. This process places each individual for each day of each life stage at one address. ALGAE could then assign life-stage specific exposures based on the modelled daily PM 10 concentrations for each location and computed mean, median and cumulative exposures across each life stage using the address history.
We compared estimated exposures obtained using the reconstructed address histories with those obtained using residence at birth for the whole duration of pregnancy and infancy (i.e. as often used in epidemiological studies), to explore the impact of reconstructing address histories on potential exposure misclassification. We used descriptive statistics, R 2 and Spearman's correlation to describe the relationship between the two different exposure estimates.

Results
Of all address periods processed, ALGAE corrected the start and end dates of 69% of records; 19% of address periods had more than one date changed. Based on the corrected address periods, we reconstructed residential address histories for 14 027 pregnant women, 10 028 of whom had gaps and overlaps in their address periods corrected. We assigned life stage-specific exposure to $92% of women.
Accounting for residential mobility, mean PM 10 exposure during pregnancy (n ¼ 12 752) across all women was 32.6 lg/m 3 (S.D. ¼ 3.0 lg/m 3 ) and 31.4 lg/m 3 (S.D. ¼ 2.6 lg/m 3 ) during infancy (n ¼ 12 525); 3414 women included in the study (24%) changed address during pregnancy and first year of life of their baby. The majority of those moves occurred after birth: 6.7% (n ¼ 937) of mothers moved during pregnancy and 18.0% (n ¼ 2477) of mothers moved during infancy of their baby. Among those who moved, 95% moved once, 4.5% moved twice and 0.5% moved three times. The average number of addresses women lived at was 1.3.
In comparing estimated PM 10 exposures using address at birth and those accounting for residential mobility (Figure 2), differences were up to 26% during pregnancy, ranging from À5.3 lg/m 3 to 12.4 lg/m 3 (5th to 95th percentileile, À1.1 to 1.0 lg/m 3 ), and up to 27% during infancy, ranging from À7.2 lg/m 3 to 7.6 lg/m 3 (5th to 95th percentile, À1.7 to 1.3 lg/m 3 ). Using address at birth Figure 1. ALGAE's conditions for cleaning address periods (a n ) derived from a contact database showing cases of: (i) contiguous address periods with complete information of address start (a n start ) and end (a n end ) dates; (ii) gaps in address periods where 1 or more days are missing; and (iii) overlap in address periods where one or more addresses are recorded for the same day for the same individual. In case of gaps or overlaps in address periods, ALGAE favours preserving the start date of address periods over end date. explained 92% of the variability in estimated PM 10 exposures for pregnancy and 85% for infancy, compared with using residential mobility. Most individual exposure changed by <5% (95% and 90% for pregnancy and infancy, respectively) despite the relatively large changes that were seen both in pregnancy and infancy ( Figure 2).

Discussion
Our study used a cohort contact database to clean and reconstruct residential address histories for cohort members for a rich and well-characterized birth cohort, ALSPAC. To do so we developed ALGAE, an automated protocol to assess historical exposure to air pollution for members of longitudinal cohort studies. The protocol could easily be adapted to support different environmental pollutants and is intended as a generic and re-usable tool to link environmental exposures to cohort studies by: (i) reconstructing residential address histories; (ii) assigning exposure estimates to cohort participants based on residential address; and (iii) aggregating exposure estimates over different time periods such as pregnancy, infancy and childhood through to adolescence. We used as an example an English birth cohort, but ALGAE is readily transferable and adaptable to other settings with similar demands on reconstructing address histories.
By correcting address histories for 69% of participants, we were able to assign daily exposure estimates to 92% of all women (n ¼ 12 905). We were limited in that we could not include women who moved outside the study area ($3.5%) as we did not have exposure data outside the modelling domain. The number of women changing address during pregnancy within the study area ($7%) was consequently lower compared with other studies. In the UK, Hodgson et al. (2015) 21 identified 24% movers in a study of 5399 pregnant women in the North East of England. Canfield et al. (2006), 13 for example, reported that out of 1085 mothers in a Texan case-control study on birth defects, $30% of mothers changed address during pregnancy. Chen et al. (2010) 14 found that in a New York birth cohort study, 16.5% of expecting mothers moved. The consensus across these studies was that mobility varied significantly by maternal age and socioeconomic deprivation, with older, more affluent mothers less likely to move. Such sociodemographic data were not analysed as part of the present study.
When comparing estimated exposures obtained using the reconstructed address histories with those obtained using address at birth for the whole duration of pregnancy, exposure estimates varied little overall, with differences between the two methods for the majority of individuals smaller than 5%. This is consistent with previous studies which only reported small changes between exposures at birth compared with those taking into account residential mobility. At the extreme end, however, we observed differences in exposure estimates of up to 26% during pregnancy and up to 27% during the first year of the child's life. The extent of exposure misclassification introduced by ignoring these observed residential mobility patterns. and instead assigning exposure to the address at delivery. will depend on the degree of spatial and temporal variability in the exposures and the geographical resolution at which they are estimated. 21 Chen et al. (2010), 14 for example, noted that the majority of moves during pregnancy occurred over short distances (median distance 4 km) and, therefore, exposure estimates did not change substantially when compared with those obtained only at place of birth. They concluded that the level of observed agreement is likely to decrease for studies using higher resolution exposure estimates. We did not have information about the average distance women moved as part of this study, to preserve the un-identifiability of study participants. Our primary interest was in air pollution exposure, but derived distance output could be added to the ALGAE code base if required for other studies, subject to ethical approval.
The granularity of exposure assignment relates to the type of geographical unit used to represent the address, from areas such as regions and districts to addresses (points with X/Y coordinates). Using address locations is especially important in studies where local, discrete air pollution sources (e.g. roads, industrial stacks) are modelled, as exposure estimates may vary substantially over short distances from sources, as shown in Figure 3. In ALSPAC, for example, we modelled PM 10 emissions from >1500 road sources, so it was important to have a complete record of spatially resolved address locations for exposure assignment.
Another consideration is the temporal resolution of the exposure data. For this study we computed daily exposure estimates for each address. Reconstructing address histories therefore may only be worthwhile if the exposure modelling is sufficiently granular relevant to the exposure window, such as daily or weekly averages for pregnancy exposure.
Our study is the largest to date to explore the effect of residential mobility during pregnancy on exposure misclassification. A considerable strength is the availability of daily exposure estimates at each residential address. By developing ALGAE, we were able to reconstruct residential mobility and assign trimester-specific exposure estimates for 12 752 pregnancies. The use of ambient air pollution exposure estimates at a residential address, however, does not take account of personal activity patterns of individuals or air pollution from indoor sources. The results presented here, therefore, are only a proxy for personal exposure which might vary substantially by individual, depending on daily activity or occupation.
Our study highlights the need for temporally and spatially detailed information on residential location in environmental epidemiological studies. Residential location is often used to assign spatially variable environmental risk factors as a proxy for individual exposure. Our case study demonstrates how the cohort contact database can be used and enhanced to achieve a consecutive temporal address history in cases where recalled address information is not available. We were not able to compare our results with those from recalled address histories as this information was not available in ALSPAC. Previous studies have, however, found an up to 90% agreement between recalled address histories and those obtained from public record databases. 22 Also, we assume that the start date of an address period is a stronger signal than the end date of the address period, a decision which was taken in collaboration with ALSPAC. The ALGAE code base has the flexibility to allow alterative assumptions which might be more appropriate within other cohort settings.
Our paper highlights ways to improve exposure assessment in cohort studies, where exposures relate to aspects of the environment associated with location of residence. This includes exposures related to the physical environment such as air and noise pollution, as well as social factors such as access to health care services or area-level deprivation. Brokamp et al. (2016), 23 for example, showed that using a single address at one point in time to assign environmental exposures, and other place-based factors such as socioeconomic status, can result in differential exposure misclassification leading to bias towards the null. The ability to identify frequent moves may also be important for other studies; for example looking at mental health outcomes, as frequent moving in childhood has been associated with poorer mental health in adulthood. 24 Figure 3. The effect of geographical resolution on exposure assessments that take account of mobility: left-low-resolution exposure (E) will result in same exposure estimates across all addresses (a); middle-medium resolution exposure will have low impact on exposure misclassification; rightexposure misclassification is potentially substantial if high-level exposure (e.g. address-specific) is available.
In conclusion, whereas ignoring residential changes during pregnancy may on average have a relatively small effect on environmental exposure estimates at residence, at least in the ALSPAC cohort studied here, for some individuals there may be quite marked exposure misclassification which could introduce bias into the study through either random or systematic errors (or both). Differences in exposure are likely to be larger in more mobile populations. Our bespoke software to assign air pollution data to residential histories, dealing with gaps, overlaps and errors in address records, offers a ready solution to link environmental data to individuals in longitudinal cohort studies. Its generic code base makes the ALGAE re-usable for other cohort studies in the UK and internationally, providing an accessible and low-cost means to enhance such studies with environmental exposure data.