A de-identified database of 11,979 verbal autopsy open-ended responses

As part of the Gates Grand Challenge 13, the Population Health Metrics Research Consortium (PHMRC) collected data to enable the development and validation of methods that measure cause-specific mortality in populations with incomplete or inadequate cause of death coding. This work yielded 11,979 verbal autopsy interviews (VAIs). In each, a field interviewer spoke with an individual familiar with the deceased and their final illness, and used a semi-structured questionnaire to collect information about the symptoms of the deceased in their final illness. The VAI collected demographic characteristics, possible risk factors (such as tobacco use), and other potentially contributing characteristics. It also included the open-ended question, “Could you please summarize, or tell us in your own words, any additional information about the illness and/or death of your loved one?” (open narrative). The VAI data were released in a de-identified format in September 2013 through the Global Health Data Exchange, in files that contain verbal autopsies that were collected at six sites in four countries (India, Mexico, Tanzania, and the Philippines). Due to research interest, we have now created redacted versions of the open narratives from the open-ended question of the questionnaire. We hope that this database will be the source of innovations that increase our knowledge about the causes of ill health and, through this knowledge, produce improvements in health for individuals and populations.


Introduction
Population health information that is both accurate and comprehensive can aid program implementation, monitoring, and evaluation, resource allocation and planning. However, there are currently large gaps in the technologies and measurement methods that are available to generate this information, and this makes it difficult to address health inequities through effective policy 1 .
The Population Health Metrics Research Consortium (PHMRC) conducted data collection to enable the development and validation of methods that measure cause-specific mortality in populations with incomplete or inadequate cause of death coding. This work produced around 12,000 verbal autopsy interviews (VAIs), in which a relative or someone familiar with the final illness of the deceased, provides information about the signs symptoms of the final illness, as well as demographic characteristics, and information on risk factor exposures (such as tobacco use), and other potentially relevant characteristics 2 .
The VAI data were released in a de-identified format in September 2013, through the Global Health Data Exchange, in files that contain verbal autopsies from six sites in four countries (India, Mexico, Tanzania, and the Philippines) using a standardized VA questionnaire developed by the PHMRC. The data is organized into three parts corresponding to the questionnaire modules for each age group: neonate, child, and adult. Each VAI in the database is matched with a "gold standard" diagnoses of underlying causes of death, typically identified from medical records, and using stringent diagnostic criteria (such as laboratory, pathology, or medical imaging findings.) 3 One portion of a VAI is the "open narrative," where the respondent has the opportunity to tell, in their own words, what happened during the illness that led to the death being investigated. This was collected as a final question in the PHMRC survey, after the structured interview, when the respondent was asked, "Could you please summarize, or tell us in your own words, any additional information about the illness and/or death of your loved one?" The full response to this question was transcribed and translated into English, and the 2013 data release included counts of stemmed keywords as variables in the final dataset, to allow researchers access to this rich source of unstructured data, while also removing any potentially personally identifiable information (PII) in that portion of the interview.
Due to research interest, we have now created redacted versions of 11,979 open narratives to allow researchers the opportunity to learn even more about how deaths are described. We hope that this database will be the source of innovations that increase our knowledge about the causes of ill health and through knowledge produce improvements in health for individuals and populations.

Methods
The process of collecting the VAIs has been described in detail previously 1 . In this article, we provide a detailed account of the protocol used to redact personal information from the openended question, and therefore allow the release of the full text of the open narrative collected in the VAIs.
Study participants provided their consent to participate with the knowledge that "reports of the data … will not identify any individual person." We chose also to redact the names of specific health facilities to avoid the risk of identifying individual health service providers indirectly, through their association with individual facilities. To retain the most information possible for future research, we replaced PII with "tags" that denote what sort of information has been redacted.
An example makes this clear: a typical text was redacted to read, "vaginal bleeding and delay to receive care at [HOSPITAL] was the main cause of death. he said that his wife arrive at the hospital at 8pm and didn't receive any care until 8am." Instead of including the name of the specific hospital, we redacted it to [ We included all VAIs for which there was an open-response string available to redact, even when the response was devoid of information.
We implemented the redaction process in a spreadsheet using Excel 2010, redacted manually by a single data analyst (LH), who read each open-response and replaced each piece of PII with the appropriate tag. there was a break on his forehead. was operated after 2 days. after operation he got fever. the deceased also had cough. as per respondent, it was not just the accident alone who led the deceased to death. there was also a complication of his kidney disease. long before (respondent was not able to remember the exact date), the deceased experienced inability to walk but it was not consulted to the doctor for the deceased doesn't want to. they only went to a traditional healer for treatment. the deceased can't walk for about 7 months but then later on he was able to walk again. after he was also hospitalized at [HOSPITAL5], it was known that he have kidney disease.

Additional clarifications
Midwife names were redacted to [DOCTOR].

Dataset validation
We reviewed progress weekly and discussed emerging challenges as they arose. For example, we determined that the original plan of redacting dates entirely to [DATE] seemed to be obscuring valuable information about the time between symptoms. One week later, we determined that our first attempt at a remedy, to include [DATE+days] was to labor intensive, and would prevent redaction from completing within our budget. Our next remedy worked, and that is how we developed the [YEAR+n] approach described above. When redaction was completed, we reviewed a simple random sample of redacted texts and confirmed that all were devoid of PII.

Ethics approval
This study was approved by the Human Subjects Division of the University of Washington (application number 34413). Ethical approval sought for the VAIs is stated in 1. All data were collected with informed verbal consent from participants before beginning the interview.

Open Peer Review
The release of the data is welcome as including the open narrative in automated VA analysis programs can significantly alter output diagnoses. Better understanding of the pros and cons of using free text, and the appropriate weighting for different sources of free text, is likely to be of further interest to researchers as their methods and technology develop. To illustrate, simple text mining for key words such as "malaria" could correctly provide positive evidence towards a diagnosis if the free text read "She suffered from malaria..." but incorrectly if the free text read "Her malaria tests at the health centre were negative".
Research may use the narrative free text alone but its use in VA analysis algorithms is likely to complement answers to the structured interview component of VA. This dataset includes the gold standard diagnosis and narrative free text but no link to the structured answers. The authors may wish to comment on this.
The methods are well described, with appropriate examples, and the released dataset clear to understand whilst minimising risk of individual identification.

Typos
Page 3: "Each VAI in the database is matched with a "gold standard" diagnoses of underlying causes of death" rather "Each VAI in the database is matched with a "gold standard" diagnosis of underlying cause of death".
Is the rationale for creating the dataset(s) clearly described?