Automated verbal autopsy classification: using one-against-all ensemble method and Naïve Bayes classifier

Verbal autopsy (VA) deals with post-mortem surveys about deaths, mostly in low and middle income countries, where the majority of deaths occur at home rather than a hospital, for retrospective assignment of causes of death (COD) and subsequently evidence-based health system strengthening. Automated algorithms for VA COD assignment have been developed and their performance has been assessed against physician and clinical diagnoses. Since the performance of automated classification methods remains low, we aimed to enhance the Naïve Bayes Classifier (NBC) algorithm to produce better ranked COD classifications on 26,766 deaths from four globally diverse VA datasets compared to some of the leading VA classification methods, namely Tariff, InterVA-4, InSilicoVA and NBC. We used a different strategy, by training multiple NBC algorithms using the one-against-all approach (OAA-NBC). To compare performance, we computed the cumulative cause-specific mortality fraction (CSMF) accuracies for population-level agreement from rank one to five COD classifications. To assess individual-level COD assignments, cumulative partially-chance corrected concordance (PCCC) and sensitivity was measured for up to five ranked classifications. Overall results show that OAA-NBC consistently assigns CODs that are the most alike physician and clinical COD assignments compared to some of the leading algorithms based on the cumulative CSMF accuracy, PCCC and sensitivity scores. The results demonstrate that our approach improves the performance of classification (sensitivity) by between 6% and 8% compared with other VA algorithms. Population-level agreements for OAA-NBC and NBC were found to be similar or higher than the other algorithms used in the experiments. Although OAA-NBC still requires improvement for individual-level COD assignment, the one-against-all approach improved its ability to assign CODs that more closely resemble physician or clinical COD classifications compared to some of the other leading VA classifiers.


Introduction
Verbal autopsy (VA) is increasingly being used in developing countries where most deaths occur at home rather than in hospitals, and causes of death (COD) information remains unknown 1 . This gap in information prevents evidence-based healthcare programming and policy reform needed to reduce the global burden of diseases 2 . VA consists of a structured questionnaire to gather information on symptoms and risk factors leading up to death from family members of the deceased. Each completed survey is then typically reviewed independently by two physicians, and COD assignment is done using World Health Organization (WHO) International Classification of Disease (ICD) codes 3 . If there is disagreement in assignment, then the VA undergoes further review by a senior physician 4,5 . Efforts are underway to make verbal autopsies the part of the civil registration system of countries to ensure that effective policies can be developed to prevent global diseases 6 .
In recent years, efforts have been made to automate VA COD assignment using various computational algorithms in an attempt to further standardize VA COD assignment and alleviate physician time and costs 7-14 . The current leading computational VA techniques include, InterVA-4 8 , Tariff 7 , InSilicoVA 9 , King-Lu 11 , and Naïve Bayes Classifier (NBC) 12 . InterVA-4 employs medical-expertdefined static weights for symptoms and risk factors given a particular COD, and subsequently calculates the sum of these weights to determine the most likely COD 8 . Conversely, Tariff was pre-trained on the Population Health Metrics Research Consortium (PHMRC) VA data to compute tariffs, which express the strength of association between symptoms and CODs that are later summed and ranked to determine a COD; the same procedure is used on the test dataset, with the resultant summed and ranked tariffs scores compared against the pre-trained COD rankings 15 . InSilicoVA assigns CODs by employing a hierarchical Bayesian framework with a naïve Bayes calculation component; it also computes the uncertainty for individual CODs and population-level COD distributions 9 . The King-Lu method measures the distribution of the COD and symptoms in the VA training dataset and uses these to predict CODs in the VA test dataset 11 . Lastly, NBC predicts the COD after computing the conditional probabilities of observing a symptom for a given COD from the VA training dataset, and then applying the Bayes rule against these probabilities 12 . These existing automated classification algorithms, however, generate low predictive accuracy when compared against physician VA or hospital-based COD diagnoses 9,12,16,17 . Leitao et al. 18 in their systematic review of automated verbal autopsy classification algorithms concluded that there is need to improve automated classification techniques to enable wider and more reliable employment in the field.
The aim of our research is also a classification method for predicting CODs using responses from structured questions in a VA survey. We used a different strategy by training multiple NBC algorithms 19 using the one-against-all approach (OAA-NBC) 20,21 . We have chosen NBC algorithm and one-against-all ensemble method of machine learning because former has shown better results on VA surveys 12 and later has shown better results in machine learning literature 20,21 . OAA-NBC generates ranked assignments of CODs for 26,766 deaths from four globally diverse VA datasets (one VA dataset was divided into four datasets; a total of seven datasets were used for analysis). We also compare our technique against the current leading algorithms Tariff 7 , InterVA-4 8 , NBC 12 and InSilicoVA 9 on the same deaths used for OAA-NBC.

Datasets
In order to test the performance of the algorithms, we used four main datasets, containing information on a total of 26,766 deaths: three physician COD diagnosed VA datasets, namely the Indian Million Death Study (MDS) 22 , South African Agincourt Demographic Surveillance Sites (HDSS) dataset 23 , and Bangladeshi Matlab HDSS dataset 24 , and one health facility diagnosed COD dataset, namely the PHMRC VA data collected from six sites in four countries (India, Mexico, the Philippines and Tanzania) 25,26 . We used four combinations of the PHMRC data by age group (adult and child) and by site (all versus India-only); this filtering was done to determine the effect on results when deaths were collected from the same geographical setting. A total of seven datasets were used and are summarized in Table 1. These datasets are publicly available, except for the MDS, and have been used in other studies 12,16,26 .
The MDS VA dataset used in this study contained information on 12,225 child deaths from ages one to 59 months. For each death, two trained physicians independently and anonymously assigned a WHO ICD version 10 code 27 . In the cases where the two physicians did not initially agree or reconcile on a COD, a third senior physician adjudicated 22 . Similarly, the Agincourt dataset 23 underwent dual physician COD assignment on its 5,823 deaths for ages 15 to 64 years. COD assignment was slightly different for the Matlab dataset which had 2,000 deaths for ages 20 to 64 years; a single physician assigned a COD, followed by review and verification by a second physician or an experienced paramedic 24 . In contrast, the PHMRC dataset was comprised of 6,718 hospital deaths that were assigned a COD based on certain clinical diagnostic criteria, including laboratory, pathology, and medical imaging findings 25,26 . For each VA dataset, we grouped the physician assigned CODs into 17 broad categories, refer to Table 2. We also show the distribution of records for each COD for each of the seven datasets used in our study.

Amendments from Version 1
We have done minor modifications in different parts of text in all sections of the paper for clarification as required by reviewer 1. In particular, we made a major change in Figure 1, minor changes in Figure  3 and tables' descriptions. We also added Supplementary file 1 which shows tables with complete results as requested by reviewer 1. For exact reasons and locations of changes, please read the response to reviewer 1 below. We also added new future work directions as recommended by reviewer 2 in the Conclusion section of the paper.  One-against-all Naïve Bayes (OAA-NBC) approach An overview of our approach is shown in Figure 1. We transformed each VA dataset into binary format with VA survey questions being the attributes (columns), answers being the values of cells in rows (re-coded into binary format with 'Yes' coded as 1 and 'No' as 0) and CODs (group number identifier listed in Table 2) being the last (or the first) column. For all VA datasets, a death was represented as a row (record).

REVISED
We divided each VA dataset into training and testing datasets. We trained multiple NBC models 19 on the transformed training datasets using the one-against-all approach 20,21 . We chose NBC because it showed better results on VA surveys in the past 12 .
The one-against-all approach was used because it improves the algorithm's classification accuracy on datasets with several categories of dependent variables as demonstrated by past literature 20,21 . This will be explained in detail in the next section. During testing, the trained NBC models assigned CODs to each death in the testing dataset. The assigned causes were ordered by their probabilities with the assumption that top cause would most likely be the real cause.
Training Naïve Bayes using one-against-all approach. NBC uses a training dataset to learn the probabilities of symptoms and their CODs 12,19 . NBC first measures the probability of each COD, P(COD), in the training dataset. Secondly, it determines the conditional probabilities of each symptom given a particular COD, P(Sym|COD). Thirdly, NBC determines the probability of every COD given a VA record in the test set, i.e., P(COD|VA).
. Conditional probability of COD given a VA record. P(COD|VA) is determined by taking the product of all P(Sym|COD) (i.e., all symptoms in the VA record) and P(COD). The highest P(COD|VA) value determines that COD as the correct COD. In particular, we chose the Naïve Bayes Multinomial classification algorithm that estimates probabilities by using a maximum likelihood estimate which is readily available in data mining software applications like Weka 19,20 . NBC COD P (COD| VA) = COD CODs argmax ∈ Equation 2. Select the class with the maximum probability.
In the one-against-all approach, we built an NBC model for each COD instead of one model for all CODs. In this approach, a dataset with M categories of CODs (dependent variables) was decomposed into M datasets with binary categories (CODs). Each binary dataset Di had a COD Ci (where i = 1 to M) labelled as positive and all other CODs labelled as negative with no two datasets having the same CODs labelled as positive. Finally, NBC was trained on each dataset Di resulting in M Naïve Bayes models, as shown in Figure 2. Each model was then used to classify the CODs for records in the test dataset producing a probability of classification. The cause Ci (where i=1 to M) with the highest probability was considered as the correct classification.
Testing OAA-NBC on new surveys. During testing, each Naïve Bayes model predicted a COD for each VA record in the test dataset, resulting in a list of CODs for each VA record in the test dataset. The list of assigned CODs is sorted by the COD probabilities. We made a minor modification in the oneagainst-all approach; instead of selecting a COD with the highest probability, we ranked the CODs in descending order of their probabilities for each VA record. We kept the ranked probabilities to generate cumulative performance measures, which are described in detail in the next section.

Assessment methods
A VA algorithm's performance is measured by quantifying the similarity between the algorithm's COD assignments to physician review (or clinical diagnoses in PHMRC) assignments. Since the community VA datasets included in this study come from countries that have weak civil and death registration, physician review is the most practical and relatively accurate (and only) option to use for assessing algorithm performance. Moreover, given that these deaths are unattended, it follows that there is no 'gold standard' for such community VA datasets. Nevertheless, we are confident in the robustness of dual physician review as initial physician agreement (i.e. where two physicians agreed right at the onset of COD coding) was relatively high; e.g., 79% for MDS and 77% for Agincourt.
We measured and compared the individual and population-level performance of all of the algorithms using the following metrics: sensitivity, partially chance corrected concordance (PCCC) and cause-specific mortality fraction (CSMF) accuracy. These measures are commonly used in VA studies 12,16,28 . They are shown in Equation 3 -Equation 5. They are helpful in objectively assessing the performance of VA algorithms, as they provide a robust strategy to assess an algorithm's classification ability for test datasets with widely varying COD distributions 13,28 .

True positive Sensitivity
True positive False Negative = +  Sensitivity and PCCC are metrics that assess the performance of an algorithm for correctly classifying the CODs at the individual level. Sensitivity measures the proportion of death records that are correctly assigned for each COD 13 . Similarly, PCCC computes how well a VA classification algorithm classifies the CODs at the individual-level while also taking chance (likelihood that it was randomly assigned a COD) into consideration 9,12,13,16 .
. Cause-specific mortality fraction (CSMF) Accuracy of classification: n is the total COD and N is the total records.
In contrast, CSMF accuracy is a measure for assessing how closely the algorithms classified the overall COD distribution at the population level 13 . It can be observed from Equation 5 that CSMF accuracy computes the absolute error between predicted COD distributions by an algorithm (pred) and the observed (true) COD distributions.
We measured the cumulative values of sensitivity, PCCC and CSMF accuracy on each rank and for each algorithm; e.g., sensitivity at rank two represented the sensitivity of both rank one and rank two classifications, which facilitated in measuring the overall performance of the algorithms for classifications at the top two or more ranks. For example, if sensitivity value was 60% at rank one and sensitivity value was 15% at rank two for a method, then the cumulative sensitivity was 75% at rank 2. The use of cumulative values for reporting results is common in applied machine learning literature (e.g., see Murtaza et al. 29 and Wong et al. 30 ). It only adds additional information to the traditional way of reporting results (which are only about rank 1) and useful when there are multiple classes (causes of deaths). Finally, we also performed a statistical test of significance on the results of all the datasets to ascertain that the difference in results was not by chance. The statistical test depends on the data distribution and association between experiments. We used Wilcoxon signed rank test as we were unsure about normal data distribution of our results. Our null hypothesis was that there was no significant difference between OAA-NBC and another algorithm. This is further discussed in the results section.

Experimental setup
In order to compare the performance between OAA-NBC, InterVA-4 8 , Tariff 7 , InSilicoVA 9 and NBC 12 , we follow a seven step procedure. In Step one, we partitioned each VA dataset using the commonly used evaluation criteria in data mining: 10-fold cross validation 20 . In 10-fold cross validation, a dataset was divided into 10 parts. Each part was created by using stratified sampling method-i.e., each part contained the same proportion of standardized CODs as the original dataset. In Step two, we selected one part for testing and nine parts for training from each VA dataset. In Step three, we trained OAA-NBC, InterVA-4, Tariff, InSili-coVA and NBC on the designated training data subsets from each partitioned VA dataset. In Step four, we generated classifications with ranks for each algorithm on the test part per VA dataset. In Step five, we calculated the cumulative sensitivity, PCCC and CSMF accuracy for each rank per each VA dataset. In Step six, we repeated the process from Step two to Step five up to 10 repetitions with a different part for testing in each turn and for each VA dataset. In Step seven, we computed the mean sensitivity, PCCC and CSMF accuracy for each rank per VA dataset and algorithm.
We implemented OAA-NBC in Java and with Weka API 20 . Weka provides APIs for one-against-all approach and Naïve Bayes Multinomial classifier 20 . We used the OpenVA package version 1.0.2 in R to implement InterVA-4, Tariff, InSilicoVA and NBC algorithms. The data format also was transformed into InterVA-4 input format (Y for 1 and empty for 0 values). It is important to note that the Tariff version provided in the OpenVA package is computationally different from the IHME's SmartVA-Analyze application tool. We used custom training option for InterVA-4 and InSilicoVA as present in OpenVA package in R. In custom training, the names of symptoms do not need to be in the WHO standardised format, and the rankings of the conditional probability P(symptom|cause) are determined by matching the same quantile distributions in the default InterVA P(symptom|cause). The reason for choosing customized training instead of using pre-trained global models is that different datasets have different proportions of symptoms and causes of deaths, and custom training allows algorithms to generate models customized for the dataset. It also allows for fair evaluation across algorithms, especially for the ones that only work by using customized training on datasets and acquire more knowledge of the dataset during testing.
We performed data partitioning, as discussed in Step 1, using Java and Weka's 20 stratified sampling API. Each algorithm was executed on that partitioned data. We used a separate Java program to compute the cumulative measures of sensitivity, PCCC and CSMF accuracy on the COD assignments of each algorithm for each VA dataset. This process ascertained that our evaluation measures were calculated in the exact same manner. Our source code for all the experiments is available on GitHub and is archived at Zenodo 31 . Figure 3 shows the mean CSMF accuracy values by algorithms across all VA datasets using rank one (most likely) cause (COD) assignments and the fifth most likely cause assignments (rank five). Note that the fifth rank shows the cumulative CSMF accuracy values from rank 1 to rank 5 as described earlier. OAA-NBC produced the highest CSMF accuracy values for most of the VA datasets, ranging from 86% to 90% for rank one; it came second or identical to NBC for the PHMRC child datasets (global and India). Furthermore, CSMF accuracy values for OAA-NBC were relatively consistent across the VA datasets compared to some of the other algorithms that varied considerably, such as Tariff, InterVA-4 and InSilicoVA. As expected, the cumulative CSMF accuracy values increased the overall CSMF accuracy values for each algorithm when including the top five ranked classifications for every VA dataset.

Ranked sensitivity comparison
Individual-level cumulative sensitivity results for classification ranks one and five are shown in Table 3; cumulative PCCC values are not shown as the values were very close to the cumulative sensitivity values. It can be observed from Table 3 that OAA-NBC got the highest sensitivity values for the first ranked (most likely) COD assignments compared to the other algorithms, ranging between 53-63%. When considering all top five ranked classifications, OAA-NBC improved sensitivity values by 31-38%, with cumulative values ranging from 91-95%. In the case of Tariff, InterVA-4 and InSilicoVA, the sensitivity values were significantly lower (10-40%) in comparison to OAA-NBC; NBC did not differ substantially from OAA-NBC, as differences only range from 3-7%. These results show that OAA-NBC consistently yields closer agreement with the physician review or clinical diagnoses at the individual-level than the other algorithms on most of the VA datasets.
We also performed a Wilcoxon signed rank statistical test on the reported sensitivity values in Table 3, generated from the five algorithms (we also included rank two to rank four values which are not shown in the table to minimize space but present in Supplementary file 1). For 35 observations (five ranks and seven data sets), the Wilcoxon signed ranked test yielded Z-score=5.194 and two tailed p-value=2.47 x 10 -7 between OAA-NBC and NBC. It yielded the same Z-scores and two tailed p-values against InSilicoVA, InterVA-4, and Tariff. Thus, this showed statistically significant differences between the sensitivity values generated by OAA-NBC and the four other algorithms (p < 0.05). Similarly, we conducted the Wilcoxon signed rank test on 35 observations of CSMF accuracy values for the five different algorithms. The Z-score is 4.248 and p-value is 2.15 x 10 -5 between OAA-NBC and NBC. The same Z-score and p-value were also obtained for the tests between OAA-NBC and other algorithms: InSilicoVA, InterVA-4, and Tariff. We found statistically significant differences between OAA-NBC and the other algorithms in all the comparisons.
Thus, the use of one-against-all approach with NBC (OAA-NBC) improved the performance of COD classification for VA records, and yielded better COD assignments at the populationand individual-level, which were statistically different and not attributed to chance compared to the four other algorithms. This also conformed to the machine learning literature that the one-against-all approach improved the performance of classification algorithms when there were more than two classes (CODs) 21 . However, this did not indicate that OAA-NBC did not require improvement, as the overall sensitivity for the top ranked CODs per VA record was still lower than 80%. We also made an additional assessment on the COD sensitivity.  Table 2). In general, the algorithms performances varied on different CODs for certain conditions in VA datasets. For example, classifications were equal to or under 10% across all algorithms for HIV/AIDs, cancers, cardiovascular disease, and chronic respiratory diseases in the MDS dataset. Algorithms like OAA-NBC and NBC mostly had better sensitivity for COD groups that had higher    proportion of records in training dataset. However, this was not always the case, and better sensitivity values also depended on how distinguishable VA records of a COD group were from all other COD groups. In the next section, we discuss the problem and effects of imbalance within datasets on the algorithms' classification accuracy.

Discussion
Our approach (OAA-NBC) produces better population and individual-level agreement (sensitivity) from different VA surveys compared to other algorithms. However, the overall sensitivity values are still in the range of 55-61% and not greater than 80% for the top ranked COD assignments. There are several reasons for the low sensitivity values; firstly, each VA dataset is unique, with varying amounts of overlapping or different symptoms. In this respect, the symptom-cause information (SCI) is unique to each VA dataset, and so, some of the algorithms could have had more trouble generating adequate SCIs due to the logic employed by the algorithm itself and VA data. This could help explain the low sensitivity scores by cause and per algorithm for the MDS data, which is one of the VA datasets with the fewest amounts of symptoms, and which could have impacted the SCI used for COD assignment by the algorithms. Conversely, some algorithms like InterVA-4 (when you specify the format as following the WHO 2012 or 2014 VA Instrument) require a set of predefined symptoms, or else prefer independent symptoms (i.e. had a fever) over dependent symptoms (i.e. fever lasted for a week) or interdependent symptoms (i.e. did s/he have diarrhoea and dysentery); the absence of such symptoms would also impact the algorithms' ability to classify VA records correctly. A solution to this problem is to have better differentiating symptoms for each COD.
One may argue that algorithms, such as InterVA-4 and InSili-coVA (non-training option), which use a different input, namely symptom list, based on WHO's forms for assigning CODs and do not need training on data, would be unfairly evaluated by using customized training. We converted symptoms in our datasets to WHO standardised names and evaluated InterVA-4, and InSili-coVA on the datasets. We used the same method of 10-fold cross validation method as we used in our experiments earlier but we only provided a test set for each fold to the algorithms for assigning causes of deaths based on standardised symptom names. The output of these algorithms was one of the 63 standardised CODs. We mapped these 63 causes to our 17 CODs for a fair evaluation (see Table 6 for complete details on mapping to the 17 COD categories). We observed that sensitivity for rank one for InterVA-4 remained between approximately 25% and 42%, and sensitivity for InSilicoVA remained between 20% and 43% on all datasets. The use of pre-trained models on standardized VA data inputs did not yield any better results than customized training on datasets.
One may also argue for the use of more recent algorithm versions, such as InterVA-5, for assessments. Due to the fact that the VA data used were captured prior to the release of the WHO 2016 forms, the resultant binary files would have many missing symptoms. Furthermore, InterVA-5 was only recently released for public use, specifically in September of 2018. Although an enhanced algorithm may perform more effectively due to logic employed, the VA data is also very relevant for performance. Since the VA data used in this study conformed better with the 2014 forms, we ran experiments using algorithms that were designed from WHO 2014 VA forms or did not require a specific input for a fair comparison.
VA datasets also differ in COD composition counts; there are some CODs in the VA datasets which have large number of records, while other CODs have fewer records. The ratio of composition of these CODs is highly imbalanced which can make any algorithm more biased towards the CODs with higher ratio of records in the training set. This implies that the overall agreement would most likely remain low for the algorithms in such cases. COD balancing can be performed by duplicating the number of records for the minority CODs (CODs with the least amounts of records) or decreasing the number of records for the majority CODs (CODs with the greatest amounts of records) 20 . However, these types of artificial balancing approaches do not always yield improvements in results.
A point for discussion relates to the distribution of CODs in training and test datasets. In machine learning, the composition of  Table 6. Complete mapping of ICD-10 (international classification of diseases 10 th revision) and WHO (World Health Organization) cause labels to the cause list used for performance assessments.  records of classes (e.g., CODs) are kept in the same proportion in the training and test set as in the original dataset when performing experiments 20 . This allows for a fair evaluation of the algorithm, otherwise too many VA records in a test set of a COD and too few in the training set would only result in poor performance of the algorithm for that COD. In real life situations, when a machine learning application is in production, it is possible that we may not get all the variations in the training (historical) set and we may have more variations of a particular COD in the newly collected data. The common solution to this problem is to update the training data, and re-train the algorithm to reflect newer SCI variations as they are observed 20 . Nonetheless, to understand the effect of different variations of CODs in training and test set, we performed another experiment by using Dirichlet distribution, which allowed us to vary the composition of records in the test set 32 . We used Dirichlet distribution-based sampling that actually models variability in occurrences of classes (CODs) by applying resampling with replacement. We divided the dataset into 10 parts using 10-fold cross validation method 20 as in our experiments above.

No. Cause of Death WHO list of Causes
On each fold, we resampled the test set with replacement using Dirichlet distribution 32 , resulting in different number of records for each type of COD. OAA-NBC, InterVA-4, Tariff, InSili-coVA and NBC were then evaluated on the resampled test set with different distribution of CODs. The results are shown in Table 5 for Matlab and MDS datasets. The overall performance of classification decreased as expected because the CODs with too few VA records in the actual training set were duplicated many times by the Dirichlet distribution in the new test set only. For example, if a record related to COD was not classified correctly by an algorithm and it was repeated many times in the test set then sensitivity would decrease on that COD. OAA-NBC and NBC still yielded better performance than all other algorithms. We showed results for these two datasets only as the other VA datasets had similar results of a dip in performance. An ideal training dataset would be a large repository of community VA deaths with enough variations in symptom patterns for each COD that are clinically verified; however, no such repository exists. The whole purpose of training on VA datasets is to be able to help classify CODs in situations where deaths occur unattended.
Finally, the performance of machine learning algorithms depend on the logic employed by the algorithm and the VA data, in terms of generating an adequate SCI for COD classification to discriminate different classes (CODs). To mitigate the effects of using one set of training data on all VA data, we trained algorithms on data derived from its origin dataset by using 10-fold cross validation method. By doing so, only SCIs generated from each separate VA data was considered when algorithms were classifying deaths per VA dataset. For the most part, the algorithms performed consistently, with OAA NBC performing better the majority of the time. Our results are reproducible; all of the scripts used and sample datasets are publicly available (see Experimental Setup section).

Conclusion
In this study, we have enhanced the NBC algorithm using the one-against-all approach to assign CODs to records in multiple VA datasets from different settings. The results show that our approach has 6-8% better sensitivity and PCCC for individuallevel COD classification than some of the current best performing computer-coded VA algorithms (i.e., Tariff, InterVA-4, NBC and InSilicoVA). Population-level agreements for OAA-NBC and NBC are found to be similar or higher than the other algorithms used in the experiments. Overall results show that OAA-NBC classification results are most like dual physician assignment based on VA data and clinical diagnostic COD assignments when compared against some of the leading algorithms by using cumulative sensitivity, PCCC and CSMF accuracy scores. The performance results are not due to chance as indicated by the Wilcoxon Signed Rank.
Thus, we conclude that using the one-against-all approach with NBC helps improve accuracy of COD classification. The one-against-all approach (and other ensemble methods of machine learning) can also be used with other VA algorithms instead of just Naïve Bayes. Although OAA-NBC generates the highest cumulative CSMF accuracy values, OAA-NBC still requires improvements to produce the most accurate COD classifications, especially for individual-level classification which is still below 80%. In the future, we plan to extend this work to include narratives present in the VA surveys for automated classification. Another endeavour would be to apply the one-against-all approach to the other algorithms to determine whether they can be improved further to classify community VA deaths more similarly to dual physician review. We also plan to explore different features selection techniques and prediction weighting methods (e.g., using CSMF distribution) for each individual NBC in OAA-NBC approach.

Data availability
Some of the data used in the analysis has already been made available, specifically the PHMRC data which can be found at: http://ghdx.healthdata.org/record/population-health-metricsresearch-consortium-gold-standard-verbal-autopsy-data-2005-2011. License: MIT License.

Grant information
This work was supported by the Bill and Melinda Gates Foundation (OPP51447).
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. 1.
will be different. When the final cause is determined, it seems that these individual probabilities should be weighted rather than just simply taking the max of all. The weights can be chosen to be the values that will optimize the overall cause specific mortality rate distribution (CSMFs).
For each NBC, it seems that some feature selections can be done to improve the accuracy of these individual predictions.

Is the rationale for developing the new method (or application) clearly explained? Yes
Is the description of the method technically sound? Yes Thank you for reviewing this article. Please find below replies to your questions.
How well does the one-against-all method perform when the number of disease categories Q1. increases? Will the uncertainty go up significantly?
We are not sure if the reviewers mean CODs by disease categories or symptoms REPLY 1: (features). We will answer from both perspectives.
Number of CODs increases: It will depend on the dataset. If we add a new COD with few (e.g., 10) records in the dataset, then the accuracy of classification will decrease slightly (e.g., approx. 1-2%). This is because a small ratio of records is not helpful in classification when some other CODs have very high ratio of records (e.g., few thousand records). Also, if the newer COD has most of the symptoms similar to another COD, then there are no sufficient discriminating factors between records of CODs. The accuracy of classification of this new category would remain low in this case too. If number of records are sufficient in ratio (e.g., at least 50-100) for newer category and there is sufficient records are sufficient in ratio (e.g., at least 50-100) for newer category and there is sufficient discrimination in terms of symptoms then the machine learning approaches, like one-against-all method with Naïve Bayes algorithm, will be able to classify records with good accuracy. In the case of current VA datasets, they accuracy of classification can be improved by increasing the ratio of records for categories of diseases that have a very small ratio of records compared to others and also by introducing better discriminating symptoms. The newer symptoms can be synthetic too by using other approaches in machine learning (see Answer 3).
Number of Symptoms (Features) Increases: If the number of symptoms increase and they increase the discriminating power between CODs then the accuracy will improve; otherwise, the increase in symptoms will not affect accuracy or will decrease the accuracy.
Each NBC generates a probability of COD and final list of predicted CODs from all REPLY 2: NBCs is generated by sorting them in descending order by their probabilities. However, not all NBCs predict a COD with a probability, some NBCs also predict the cause "Others"-recall that each NBC has two causes to predict: COD and "Others". When "Others" cause is predicted then it means that NBC is predicting that the COD (that it knows) is not the real cause, and we can simply ignore "Others" prediction. In this way, for 15 NBCs there are different numbers of predicted CODs in the final list depending on the VA record. It is a good suggestion to weight the predictions of CODs and then sort the predicted CODs by their final weighted probabilities. CSMF distribution is highly imbalanced for CODs in the VA datasets. So assigning weights proportional to the CMSF distribution would increase the chances of prediction of CODs in majority but they are already predicted accurately because of their large number of records. This could eventually decrease the accuracy. However, in our view a better way would be use the weights inversely proportional to CSMF distributions because that would give a better chance to those CODs which have fewer records and which are not correctly predicted by individual NBCs. This is a very good direction of research, we would like to explore this further in our future work and added to the future work section of our paper.
For each NBC, it seems that some feature selections can be done to improve the accuracy of Q3. these individual predictions. This is correct, better discriminating symptoms (features) can improve the accuracy of REPLY 3: prediction for each COD (i.e., each NBC). Feature selection can be done subjectively by using expert judgements or by using feature selection algorithms in machine learning. Accuracy could also be improved by introducing additional features, those features could be synthetic too; e.g., a feature X can be transformed into a new feature X+c where c is a constant, by taking its power such as X1/2, and by using similar such techniques. This could generate a new feature space that could help in better classifying the CODs. There are many feature selection methods and feature transformation methods. This will require another set of exploratory experiments to determine which one can actually improve accuracy of classification of CODs. This is a good direction of future work and we have added to the future work section of our paper.
No competing interests were disclosed.

Introduction
This provides a good overview of the current state of automated VA classification (though describing King-Lu as a 'current leading' method seems a bit of a stretch). The justification for the development of this method could be fleshed out a little more, perhaps explaining (for those unfamiliar with how VA data feed into policy) why it is important for these methods to be more accurate. To this end, the authors may want to consider citing the 2014 systematic review by Leitao et al. comparing PCVA with CCVA in LMIC and mentioning -even briefly -the large project underway to incorporate VA into CRVS systems (https://crvsgateway.info/A-stepwise-process~503 ).
The NBC is the model chosen for testing the one-against-all approach -it would be good to include a couple of sentences justifying this choice based on previous literature before the final paragraph in the introduction. In general, we found the use of the term 'CoD diagnosis' (used in the introduction and elsewhere in the manuscript) a little confusing. We would suggest using 'assignment' consistently throughout, to differentiate from 'diagnoses' made by clinicians during life.

Methods
Though it is made reasonably clear in the text that the MDS, Agincourt, and Matlab CoD are based on physician review of VA data compared with physician review of clinical data for PHMRC CoD, we think that this distinction could (and should) be made more clearly and repeatedly throughout the manuscript, including in Tables 1, 2, and 4. As the authors are no doubt aware, the use of PCVA CoD as a gold standard is not ideal, constituting, to some extent, a 'circular' comparison, as both methods ultimately rely on the quality of the VA data. We feel that the authors could make greater efforts to make clear (to the non-expert reader) this key difference between the different datasets. (Note the justification for the use of PCVA CoD as gold standard [page 6, under 'assessment methods'] does not really address this issue -a high level of agreement between physicians reading the same VA data does not have any bearing on the objective 'truth' of their assignments.) A minor point: the use of 'historical' and 'new' VA surveys in Figure 1 is potentially misleading, as it suggests that new data were collected and used to test the algorithm/s. Would 'train' and 'test' be more appropriate? Figure 2 is a helpful representation of the OAA approach. It may be useful to combine figures 1 and 2, showing in one place the workings of the method and how it fits into the process and, perhaps, showing in more detail how outputs from the multiple models are 're-assembled' to give one list of causes and probabilities that can then be interpreted or compared with the outputs from other  Table 4 already had a footnote; however, based on your suggestion we clarified it further and we added footnotes for Table 1 and Table 2. and we added footnotes for Table 1 and Table 2. REPLY 6: This is a good suggestion but we would like to keep the figures separate as it would clutter one figure, and the abstraction makes it easier to understand. However, we made some changes in Figure 1 which captures your suggestion. The output of each model is assembled in a simple manner: COD predictions/assignments from all models are simply sorted by their probability of prediction/assignments. REPLY 7: CSMF accuracy is the most widely used measure in the VA assessment studies, and this is the primary reason for choosing it in our study too. Chance-corrected CSMF (CCCSMF) accuracy could have been used in our study but it would not have made any difference in the value of overall results other than reducing the CSMF accuracy values for each algorithm/method. This can be understood from this equation presented by Flaxman et al. for chance correcting previous results: CCCSMF= (CSMF-mean random allocation / 1 -mean random allocation). The mean random allocation values in this equation for a dataset are measured by performing random predictions using Dirichlet distribution many times and taking their mean. This would be a constant number for a dataset, and it would only end up reducing a CSMF accuracy value by a constant rate only.
Furthermore, we have shown results separately using Dirichlet distribution for different datasets and methods. We have also shown results on individual causes alongside individual sensitivity measures. All these different perspectives mitigate the doubts of incorrect reported performances of methods in our study.
On another note, the use of Dirichlet distribution method only duplicates or reduces VA records in a training or test dataset, which actually only result in reduce performance of methods. An appropriate approach would be to have a training set with all variations of a cause of deaths that are expected to be observed in the field.
Q8. Referring to CSMF accuracy as "agreement" is potentially confusing -we would suggest using the full term or "CSMFa" throughout.
REPLY 8: We have changed agreement to CSMF accuracy throughout the entire paper as suggested.
Q9. The description of the calculation of cumulative sensitivity is a little confusing. ............ provide some justification for the choice of reporting cumulative sensitivity to rank 5.
REPLY 9: Reporting cumulative results is a popular approach used in applied machine learning REPLY 9: Reporting cumulative results is a popular approach used in applied machine learning and software engineering literature (see for example [1][2][3]). Random probability of prediction of causes of a problem is 1/N, where N is the number of causes. When data is not big, not separable, and has many causes, first rank prediction from any algorithm would not reach close to 100% mark. It is then useful to know how an algorithm would fare on top few predictions of causes (e.g., top 3 ranks, top 5 ranks, etc.) because an accuracy of 90% on top 4 causes implies that there is a 25% (1/4) probability of 90% accurate sensitivity (predictions). This is better than reviewing N causes (15 approximately in our datasets) which has a probability of 6.6% success.
Yes, if an algorithm has 15% sensitivity at rank 1 and 20 % sensitivity at rank 2 then cumulative sensitivity would be 35% at rank 2. Sensitivity at rank N is the sum of sensitivity values from rank 1 to rank N. Consider a method A has sensitivity values for top two ranks 30% & 20%, and a method B has sensitivity values 20% & 30% for top 2 ranks. The cumulative sensitivity values at rank 2 for both methods A and B would be 50%. However, this was not the case in our experiments. OAA-NBC consistently yielded better results at all ranks (from 1 to 5 and afterwards). The reason for choosing top 5 ranked predictions is subjective and it could have been top 4 or top 3 too.
The concept of cumulative reporting is straightforward, it does not affect traditional method of reporting results (which is only about first rank), and only adds additional information to the existing way of reporting. This should not be a source of concern for evaluation. We have modified text in the last paragraph of Assessment Methods section in Methods section to make the explanation clearer. Q10. A minor point: the description of computing the 'average' sensitivity, PCCC, and agreement (page 7, column 2, end of paragraph 1) is a little vague -please consider using 'mean' or another appropriate technical term.
REPLY 10: We have made the modification.

Q11
. It is not clear from the text in paragraph 1 of the results that agreements..........to signpost more clearly what the numbers represent.
REPLY 11: We have made the modifications everywhere in the text to further articulate that fifth rank represents the cumulative value from rank 1 to rank 5 as per your suggestion. "We also performed a Wilcoxon signed rank statistical test on the reported sensitivity in Table 3, generated from the five…………….. the Wilcoxon signed ranked test yielded Z-score=5.194 and two tailed p-value=2.47 x 10 between OAA-NBC and NBC." For Wilcoxon test on CSMF accuracy we found the Z-score=4.248 and p-value =2.15 x 10 between OAA-NBC and NBC. Exactly same values were also obtained for test of OAA-NBC and other algorithms in a pairwise manner.
The p values are extremely small in all the comparisons of OAA-NBC against other algorithms for both sensitivity and CSMF accuracy. Since the values are the same (for CSMF and for sensitivity; see Results Section), it is not worth showing these many similar p values, especially now all data is present in Appendix A and is trivial to determine the p values.
We didn't perform the test for PCC as those values are similar to sensitivity values and would not add any additional information. Finally, we have made modifications in the text to show exact p values for CSMF accuracy values of OAA-NBC vs NBC too." Q15. A minor point: when discussing the Wilcoxon signed rank......... would this be left out?
REPLY 15: Thank you for pointing this out. It was a typo, and we changed it to "rank two to rank four" in the text. All rank 1 to rank 5 values were used. studies have used this method for evaluation of algorithms. So, for consistent comparison with the literature we have also performed experiments using Dirichlet distribution. We have also added details of results based on Dirichlet distribution in the Appendix A. Similarly, we would like to keep pre-trained models separate too as all other algorithms have customized training. Pre-trained models actually generate poor results and it is not fair to compare them with the customized model in the Results section.
REPLY 17: Thank you for noting this as this was a typo, we fixed this error in the Table 6. Q18. It would be pertinent to acknowledge, again, ................. describing "dual physician and clinical diagnostic COD…" would more clearly make the assignment based on VA data distinction.
REPLY 18: Thank you for pointing this out, we have modified the text as suggested.
Q19. It would be useful to include a paragraph comparing the results .......... If not, what are the differences in the exercises undertaken?
REPLY 19: In terms of the paper pointed out by reviewers on Tariff algorithm, our results show that for PHMRC adult and child, PCCC values remain around 30% for the first rank (see Appendix A) and James et al. reported in the range of 22-40% (for only the first rank). Similarly, mean CSMF values in our case remain closer to 70% and their median CSMF values also remain closer to 70%. The main difference, however, is that they have partitioned PHMRC data based on health care experience, and we have used all PHMRC data and a partition of PHRMC based on Indian origin. It is not possible to compare the results exactly due to different partitions.
We have added complete details of the results in the Appendix A, and it should be transparent now in terms of comparison with any paper. Due to many differences in the setup of the experiments (as noted above), it is not possible to write the similarities and differences with all the past studies in one paragraph. This is mitigated by the fact that we have executed all the algorithms and shown all the results in a transparent manner. Thus, individual comparisons with studies will not generate any value in terms of comparisons of results.
Q20. The reporting of cumulative sensitivity as the main measure of agreement is an unusual aspect of this study; acknowledgment of this as a potential limitation would help provide context for comparisons with other similar studies.
REPLY 20: We disagree with the reviewers on this comment. There seems to be some confusion around this concept with reviewers. We have added clarification in the text about the concept as per their earlier comment. We have not introduced any new way of measuring performance of algorithms; in fact the cumulative frequency, cumulative distributions, etc. are common concepts in statistics. It is also common in applied machine learning literature (see above). In the case of top rank prediction (rank 1), results are the same as traditional method of reporting sensitivity, PCCC, CMSF or any other measure. For the next most likely predictions-i.e., rank 2 and onwards, cumulative values just show the sum of previous values. It is a very simple concept; it only provides additional information and does not hide or conceal any results. This is actually the richness of the information in the paper and not the weakness of the paper in any way because earlier researchers have not shown such information. We believe that public health community will only benefit more