Quality of age data in the Sierra Leone Ebola database

Introduction While it is suspected that some ages were misreported during the 2014-2016 West African Ebola outbreak, an analysis examining age data quality has not been conducted. The study objective was to examine age heaping and terminal digit preference as indicators for quality of age data collected in the Sierra Leone Ebola Database (SLED). Methods Age data quality for adult patients was analyzed within SLED for the Viral Hemorrhagic Fever (VHF) database and the laboratory testing dataset by calculating Whipple´s index and Myers´s blended index, stratified by sex and region. Results Age data quality was low in both the VHF database (Whipple´s index for the 5-year range, 229.2) and the laboratory testing dataset (Whipple´s index for the 5-year range, 236.4). Age was reported more accurately in the Western Area and least accurately in the Eastern Province. Age data for females were less accurate than for males. Conclusion Age data quality was low in adult patients during the 2014-2016 Ebola outbreak in Sierra Leone, which may reduce its use as an identifying or stratifying variable. These findings inform future analyses using this database and describe a phenomenon that has relevance in data collection methods and analyses for future outbreaks in developing countries.


Introduction
The West African Ebola outbreak in 2014-2016 resulted in over 28,000 cases and 11,000 deaths [1]. Deficiencies in pre-existing public health infrastructure information systems in Sierra Leone exacerbated data collection difficulties and complicated the public health response to the outbreak [2]. The Sierra Leone Ministry of Health and Sanitation (MoHS) used the Centers for Disease Control and Prevention (CDC) Epi Info Sierra Leone Viral Hemorrhagic Fever (VHF) application as a surveillance system to monitor the epidemic [3,4]. The resulting VHF database contains clinical information, such as symptoms and date of onset and demographic data reported by suspected case patients or their relatives and collected by case investigators [3]. While this database is often used for national and international level analyses because it provides the most comprehensive epidemiologic data on Ebola cases available in Sierra Leone, there were considerable difficulties encountered in ensuring consistency and completeness of the data [5]. These difficulties impaired contact tracing and links with other databases such as those created or recorded by the Ebola Treatment Centers´ data managers and the burial teams, which experienced similar data collection and quality problems. The Sierra Leone MoHS, with assistance from the CDC, consolidated available records to form a more comprehensive and complete database, referred to as the Sierra Leone Ebola Database (SLED) [6].
At the time of the epidemic, Sierra Leone did not have a widely used unique identifier system for the population (e.g. equivalent to a social security number in the United States). Careful recording of personal information, such as name and age of the patient, was particularly important during the Ebola outbreak to ensure accurate connection of laboratory testing results to individual patients, avoid duplication within a database and ensure accurate comparisons across databases.
However, it is common for individuals in Sierra Leone to be unaware of their exact age and not to possess a birth certificate, even if their birth was registered with the Registrar of Births and Deaths [7].
The tendency of reporting certain ages instead of others (e.g. rounding to the nearest age that ends in '0' or '5'), referred to as age heaping, has been shown previously in Sierra Leone census data and survey data [8][9][10]. In addition to complicating the linkage of databases, inaccurately reported age data can have implications for the quality of analyses using the data. While it is suspected that there was a preference of reporting ages with a terminal digit of either '0' or '5' during the Ebola virus disease outbreak, an analysis to examine the quality of age data from the SLED database has not been conducted.
This study examined the quality of age data collected for adult patients during the Ebola virus disease outbreak. Our objective was to describe age heaping as an indicator for inaccurate age data collected during the Ebola virus disease outbreak in Sierra Leone, with the goal of informing future SLED analyses and assessing implications for data management of other large-scale public health responses. The project was approved by the Sierra Leone MoHS and the CDC IRB.

Methods
Data: within the SLED database, data were analyzed separately for the VHF database and the laboratory testing dataset for the years 2014 and 2015. A de-identified analytic project data package was prepared by SLED data managers in Sierra Leone and transferred to the National Center for Health Statistics Research Data Center (RDC) [11] by secure file transfer protocol for analysis. Records with missing values for age or sex were excluded (3.0% from the VHF database and 7.3% from the laboratory testing dataset). To maintain confidentiality, analysis groups by sex and region with less than 15 records were suppressed. In addition, an RDC analyst reviewed the output to prevent personal-identifiable information (PII) disclosure.
This study also served as a testing project for providing secure data access to SLED. VHF study dataset: records for adult patients with recorded age were included in the VHF study dataset. Records for children patients were not included because the ages of childhood are less affected by age heaping. SLED data managers in Sierra Leone extracted two datasets for analysis: a national dataset by single age and a dataset by age group and region, to avoid PII disclosure. The exact method of age documentation for each individual was not analyzed. For instance, there is no indication in the datasets if a patient provided their own age data or if another individual gave the information.
Laboratory testing study dataset: records for initial laboratory tests for adult patients were included in the laboratory testing study dataset. Records for children patients were not included because the ages of childhood are less affected by age heaping. Because laboratory testing results were recorded for each sample tested, we only included initial tests (generally the first test for the patient) to Page number not for citation purposes 3 exclude follow-up testing records for the same patient. SLED data managers in Sierra Leone extracted the data from the laboratory testing dataset by age group to avoid small counts. The exact method of age documentation for each patient was not recorded.

Measurement of age heaping and age accuracy
VHF study dataset: age distribution was analyzed using a singleyear age plot by sex (national data only). Age heaping was calculated using Whipple´s index [12], stratified by sex and region of residence (Western Area, Northern Province, Eastern Province, Southern Province). Terminal digit preference was calculated using Myers´s blended index [12], stratified by sex (national data only).
Laboratory testing study dataset: age heaping was calculated using Whipple´s index, stratified by sex and region of residence.
Additionally, a sub-analysis was conducted on patients who were tested for the Ebola virus prior to their death, for the Western Area and Northern Province. This sub-analysis was conducted to determine if age was collected more or less accurately in patients who were tested for Ebola virus prior to death. Terminal digit preference could not be calculated because SLED data managers in Sierra Leone extracted the data by age group to avoid small counts.
Whipple´s index: it was developed to detect a preference or avoidance for ages ending in '0', '5', or both [12]. The index measures age heaping in the range of 23 to 63 years and assumes uniform distribution within a 5 or 10 year range. The ages of childhood (<20 years) and old age (>79 years) are excluded because they are more strongly affected by other types of error of reporting than by terminal digit preference [12]. The formula to calculate Whipple´s index is as follows: Whipple´s index for the 10-year range: Whipple´s index for the 5-year range: P represents that size of the population for each single-year age group. Whipple´s index varies between 100 and 500, with 100 indicating no preference for ages ending in '0' or '5' and 500 indicating only ages ending in '0' and '5' were reported [12]. The United Nations scale for interpreting Whipple´s index is as follows: <105 = highly accurate; 105-109.9 = fairly accurate; 110-124.9 = approximate; 125-174.9 = rough; ≥175 = very rough [13]. The proportion of individuals with an age reported with a terminal digit of '0' and '5' was evaluated using a two-tailed z-test for difference of proportion at the 0.05 level.

Myers´s blended index: Myers´s blended index is similar to
Whipple´s index, except that it detects the preference or avoidance for ages ending in any of the ten digits [12]. The index assumes that the population is equally distributed among the different ages.
Therefore, the expected frequency of all ten digits is ten percent.
Myers´s blended index indicates the preference for each terminal digit, represented as a deviation from ten percent.  (Table 1). Whipple´s index for the 5-year range followed a similar pattern, ranging from 182.7 in the Western Area to 275.0 in the Eastern Province (Table 1). Females had a higher Whipple´s index for both the 10-year and 5-year range than males for all regions, except the 10-year range in the Western Area (Table 1).
The Myers´s blended index for the national study population was 29.65 overall, 28.34 for males, and 29.43 for females. Preference or avoidance for ages ending in any of the ten digits by sex is shown in Figure 3. Preference for ages ending in '0' and '5' was shown for both men and women.
Laboratory testing study dataset: the total number of individuals between the ages of 23 and 62 included in the study population from the laboratory testing dataset was 18,698 (Table 2) Within the laboratory testing dataset, the Whipple´s index for individuals with an initial blood test result for the Ebola virus prior to their death was lower than all individuals within the dataset for the Western Area and Northern Province (Table 3). Age distribution and terminal digit preference could not be assessed due to small counts in order to maintain the confidentiality of individuals. Terminal digit preference is not limited to populations in Sierra Leone nor to age data alone. Inaccurately reporting age is common in demographic studies and has also been shown in clinical cohorts [12,14]. In demographic studies, preference for ages ending with terminal digits of '0' and '5' was correlated with low education level [12]. Digit preference bias has also been previously described in situations when patients are asked to report data such as year of menopause, smoking rate and in situations when clinicians are responsible for recording measurements such as blood pressure and birthweight [15][16][17][18]. Emergency departments also show considerable digit preference bias in the recording of patient time of departure from the emergency department [19,20].

Discussion
There are several possible explanations related to the Ebola outbreak for the inaccurate age data reported in SLED. First, both the reporter and the recorder influence what number is entered for age. As mentioned earlier, substantial evidence of misreporting of age has been documented for population-based samples in Sierra Leone previously [8,10], indicating that the misreporting of age in the SLED Ebola outbreak data may be a reflection of the Sierra Leone national Page number not for citation purposes 5 behavior. However, because we cannot determine what method exactly was used to document age, we do not know if it was the patient themselves reporting or a family member or neighbor who reported the age (e.g. if the patient was too ill to self-report or if the patient had died prior to reporting). Depending on the identity and relationship of that family member or neighbor, it is possible that they did not know the patient´s exact age. Alternatively, the person recording or collecting the data may have estimated the patient's age in cases where the patient or a proxy was unavailable to respond.
Both types of estimates may have favored reporting ages ending in either '0' or '5'. Data quality in Sierra Leone may have been further exacerbated by the crisis situation [21], which may have implications for data collection and reporting of data in future humanitarian emergencies.
In 2011, the United Nations office for the Coordination of Humanitarian Affairs issued a report stating that information gaps on sex and age limits the effectiveness of humanitarian response in all phases of a crisis [22]. The report argues that proper collection, analysis and use of sex and age disaggregated data allows operational agencies to deliver assistance more effectively and efficiently [22].
The SLED Ebola outbreak data collection and maintenance effort is commendable in that age data were collected to better inform the outbreak response and to inform analyses of the outbreak. However, our study results highlight the difficulties of collecting accurate age data to be used as an identifying or stratifying variable during humanitarian emergencies, especially in developing countries where age may already be more likely to be misreported [14]. In Sierra Leone, efforts are underway to improve civil registration, which may result in better availability and knowledge of birth dates and exact ages [23].
Our study has limitations. We were unable to determine if age was reported by the patient or by proxy. While this indication would not alter the accuracy of the age data, it would allow us to determine where the source of error might have originated. Additionally, we were not able to calculate Myers´s blended index in the laboratory testing dataset or by region in the VHF dataset. This analysis would have allowed us to detect the preference or avoidance for ages ending in any of the ten digits and not only ages ending in '0' or '5'. However, our overall conclusion that the quality of age data was poor would remain the same. Finally, best efforts were made to de-duplicate records in both the VHF dataset and laboratory testing dataset; however, a small number of duplicate records may have remained in the files.
This study highlights that during humanitarian emergencies, age data may be collected inaccurately. Specifically, our study shows that age data quality was low during the 2014-2016 Ebola outbreak in Sierra Leone, and therefore may have had limited use as an identifying or stratifying variable. In addition to informing future analyses using this database, these findings describe a phenomenon that may have relevance in data collection methods for future humanitarian emergencies.

Conclusion
Age data quality was low in adult patients during the 2014-2016 Ebola outbreak in Sierra Leone, which may reduce its use as an identifying or stratifying variable. These findings inform future analyses using this database and describe a phenomenon that has relevance in data collection methods and analyses for future outbreaks in developing countries.

What is known about this topic
• Deficiencies in pre-existing public health infrastructure information systems in Sierra Leone exacerbated data collection difficulties and complicated the public health response to the West African Ebola outbreak in 2014-2016; • The tendency of reporting certain ages instead of others (e.g. rounding to the nearest age that ends in '0' or '5'), referred to as age heaping, has been shown previously in Sierra Leone census data and survey data.

What this study adds
• Our analysis revealed significant age heaping in two essential databases from the 2014-2016 Ebola outbreak in Sierra Leone; • Our study shows that age data quality was low during the 2014-2016 Ebola outbreak in Sierra Leone, and therefore may have had limited use as an identifying or stratifying variable; • These findings describe a phenomenon that may have relevance in data collection methods for future humanitarian emergencies, in addition to informing future analyses using this database.
Center staff, and SLED CDC principal investigator Yelena Gorina for their support of this project. The SLED team, the CDC Research Data Center staff and Yelena Gorina did not receive any compensation.