Real-Time Classification of Causes of Death Using AI: Sensitivity Analysis

Background In 2021, the European Union reported >270,000 excess deaths, including >16,000 in Portugal. The Portuguese Directorate-General of Health developed a deep neural network, AUTOCOD, which determines the primary causes of death by analyzing the free text of physicians’ death certificates (DCs). Although AUTOCOD’s performance has been established, it remains unclear whether its performance remains consistent over time, particularly during periods of excess mortality. Objective This study aims to assess the sensitivity and other performance metrics of AUTOCOD in classifying underlying causes of death compared with manual coding to identify specific causes of death during periods of excess mortality. Methods We included all DCs between 2016 and 2019. AUTOCOD’s performance was evaluated by calculating various performance metrics, such as sensitivity, specificity, positive predictive value (PPV), and F1-score, using a confusion matrix. This compared International Statistical Classification of Diseases and Health-Related Problems, 10th Revision (ICD-10), classifications of DCs by AUTOCOD with those by human coders at the Directorate-General of Health (gold standard). Subsequently, we compared periods without excess mortality with periods of excess, severe, and extreme excess mortality. We defined excess mortality as 2 consecutive days with a Z score above the 95% baseline limit, severe excess mortality as 2 consecutive days with a Z score >4 SDs, and extreme excess mortality as 2 consecutive days with a Z score >6 SDs. Finally, we repeated the analyses for the 3 most common ICD-10 chapters focusing on block-level classification. Results We analyzed a large data set comprising 330,098 DCs classified by both human coders and AUTOCOD. AUTOCOD demonstrated high sensitivity (≥0.75) for 10 ICD-10 chapters examined, with values surpassing 0.90 for the more prevalent chapters (chapter II—“Neoplasms,” chapter IX—“Diseases of the circulatory system,” and chapter X—“Diseases of the respiratory system”), accounting for 67.69% (223,459/330,098) of all human-coded causes of death. No substantial differences were observed in these high-sensitivity values when comparing periods without excess mortality with periods of excess, severe, and extreme excess mortality. The same holds for specificity, which exceeded 0.96 for all chapters examined, and for PPV, which surpassed 0.75 in 9 chapters, including the more prevalent ones. When considering block classification within the 3 most common ICD-10 chapters, AUTOCOD maintained a high performance, demonstrating high sensitivity (≥0.75) for 13 ICD-10 blocks, high PPV for 9 blocks, and specificity of >0.98 in all blocks, with no significant differences between periods without excess mortality and those with excess mortality. Conclusions Our findings indicate that, during periods of excess and extreme excess mortality, AUTOCOD’s performance remains unaffected by potential text quality degradation because of pressure on health services. Consequently, AUTOCOD can be dependably used for real-time cause-specific mortality surveillance even in extreme excess mortality situations.


Background
In 2021, over 270,000 excess deaths were registered in the European Union, with >16,000 attributable to Portugal [1].Although most of these excess deaths were possibly related to the COVID-19 pandemic, excess deaths are generally attributable to preventable causes, making a case for the importance of real-time cause-specific mortality surveillance and the subsequent timely and appropriate public health response and suitable health policies in periods of excess mortality [2].
The Portuguese Directorate-General of Health (DGS) is responsible for processing data from the Death Certificate Information System (SICO) and ensuring the epidemiological surveillance of mortality [3].SICO all-cause mortality data are automatically analyzed and can be publicly accessed [4].However, the analysis of death certificates (DCs) requires manual coding of the primary causes of death according to the International Statistical Classification of Diseases and Health-Related Problems, 10th Revision (ICD-10) [5].This manual coding is a resource-intensive task that hinders real-time cause-specific mortality surveillance.
Excess mortality is defined by the World Health Organization as mortality above what would be expected.It allows for assessing the magnitude of a potential public health crisis by checking the additional deaths compared with a reference period and subsequently analyzing their causes in depth [6,7].
Excess mortality can be estimated in several ways.In Portugal, a period of excess mortality is defined as a consecutive period starting with 2 observed numbers of deaths above the baseline's upper 95% confidence limit or with only 1 observed number of deaths above the upper 99% confidence limit of the baseline.The period ends with 2 consecutive values below this limit [8].This methodology is aligned with the practice of the European mortality monitoring project (EuroMOMO), which allows for the detection and measurement in real time of periods of excess mortality from all causes as a result of threats to public health in Europe [9].
Most excess mortality surveillance systems such as EuroMOMO or national systems are based on all-cause mortality surveillance to ensure real-time surveillance.However, in many countries, information on cause of death is not readily available as it requires a human step to code the basic cause of death, delaying the surveillance and monitoring of cause-specific mortality.For instance, in Portugal, the manual establishment of the primary causes of death for the previous year is completed by March of the following year [10,11].
To overcome this problem, Portugal developed a deep neural network called AUTOCOD [12,13], which allows for presuggesting primary causes of mortality based on historical data of DCs (except for neonatal and perinatal mortality), achieving accuracies of 89% and 81% for ICD-10 chapters and blocks, respectively.AUTOCOD can also analyze data from autopsy reports and clinical bulletins (deaths occurring in health care facilities).Ultimately, the developed algorithm increased the productivity of coders, sped up the issuance of results and information, and ensured near-real-time mortality surveillance [12,13].
To our knowledge, no widespread dissemination of complex artificial intelligence (AI) algorithms can suggest underlying causes of death through free-text analysis of DCs in the same way as AUTOCOD [14].

Objectives
This study aimed to determine the sensitivity and specificity of AUTOCOD for classifying the underlying cause of death compared with manual coding to ascertain the specific causes of death in periods of excess mortality.AUTOCOD has already proven to have high sensitivity, specificity, and accuracy in periods without excess mortality.However, it was still being determined whether this performance would be maintained in periods of excess mortality, in which the recording of free text in DCs could change owing to the pressure felt in health services and the need to respond to more requests for DCs.A satisfactory performance by AUTOCOD could pave the way for its implementation as a real-time surveillance tool to monitor cause-specific mortality even during periods in which the national health system experiences severe pressure [14,15].

Study Population
In this study, we included all DCs registered in Portugal's SICO starting from January 1, 2016, to August 8, 2019.We excluded DCs referring to neonatal, perinatal, and maternal mortality as the AUTOCOD algorithm is not trained for these underlying causes of death [13].Each DC was manually classified according to the ICD-10 by human coders at the DGS (gold standard) or automatically by AUTOCOD.

Study Design and Data Sets
The methods behind the construction of the AUTOCOD algorithm have been explained in detail in previous publications.The algorithm was initially trained and tested using a data set different from the one chosen for this study [12,13].The manual codification of causes of death adheres to the World Health Organization Nomenclature Regulations specified in the ICD-10.In addition, it uses the ICD-10 rules for selecting the underlying cause of death as the primary cause of death by international rules [5].
The DC data set was then linked with 2 dictionaries of the ICD-10 to translate block and chapter codes into text descriptions.The DC data set was also linked to the national surveillance all-cause mortality data set [4], which defines the baseline for expected deaths according to the EuroMOMO methodology [16] and the daily count of observed deaths.

Excess Mortality Definition
Using this data set, we defined the periods in which excess mortality was observed according to the EuroMOMO Z score for excess mortality and the rules of Westgard [17] (ie, we considered excess mortality when there were 2 consecutive days with a Z score above the limit at 95% of the baseline or just 1 day at >99%).The period of excess mortality ended with 2 consecutive days below the limit of 95% of the baseline.Flowchart of the study population inclusion criteria can be found in Figure 1.
We also defined 2 metrics for periods of severe and extreme excess mortality.These were 2 consecutive days with a Z score above the limit of 4 SDs and 6 SDs, respectively.The Westgard functions used to classify the different periods can be found in Multimedia Appendix 1 [17][18][19].

Statistical Analysis
To obtain the multiclass confusion matrix, we used the "confusionMatrix" function of the caret package in RStudio (version 6.0-90; Posit, PBC) [18,19].In a multiclass problem such as classifying ICD-10 chapters and blocks, the "confusionMatrix" will show a set of "one-versus-all" results.For example, in a 3-class problem, the sensitivity of the first class is calculated against all the samples in the second and third classes (and so on).The resulting confusion matrix summarizes the prediction results for a classification problem.
The number of correct and incorrect predictions is summarized with count values and broken down by each class.The confusion matrix shows how a classification model such as AUTOCOD is confused when it makes predictions.These numbers are then organized into a table or matrix.Each row of the matrix corresponds to a predicted class (ie, AUTOCOD).Each matrix column corresponds to an actual class (ie, human coders at the DGS).
The numbers of correct and incorrect classifications are then filled into the table.The total number of correct predictions for a class goes into the expected row for that class value and the predicted column for that class value.In the same way, the total number of incorrect predictions for a class goes into the expected row for that class value and the predicted column for that class value.
Finally, we performed a sensitivity analysis (also using the R package caret) to compare the classification results obtained using the AUTOCOD algorithm (index test) with the classification made by human coders (gold standard) [20].This allowed us to obtain the number of true positives and false positives as well as additional metrics such as sensitivity (recall), specificity, accuracy, positive predictive value (PPV), and F 1 -score [13].This step was performed over time, including a comparison between periods of excess and no excess mortality and between periods of extreme excess mortality and no excess mortality both by chapter and block classification levels of the ICD-10 [13].We present this comparison as the difference in absolute values and with the Kullback-Leibler divergence (KLD), which measures the distribution of a metric and chapter or block during a specific period of excess or extreme mortality and periods of no excess mortality.In other words, the KLD measures the difference between 2 probability distributions.We used the kullback_leibler_distanc function of the R package philentropy [21].
The formulas used for all these performance metrics can be found in Table S1 in Multimedia Appendix 1 [17][18][19].
To assess the quality of AUTOCOD, we opted to present the weighted average of performance metrics such as sensitivity, precision, and F 1 -scores by taking the mean of all class performance metrics while considering each class's number of actual occurrences in the data set.The "weight" refers to the proportion of each class's actual occurrences in the data set relative to the sum of all occurrences.The full formula for this calculation of the weighted average is provided in Multimedia Appendix 1 [17][18][19].This choice was made as opposed to presenting the macroaverage of performance metrics (ie, macroaverages assign equal importance to each chapter or block, thus calculating the arithmetic mean of performance metrics) [13] as the latter methodology would artificially increase the importance of the average of the rare or infrequent cause of death chapters and blocks.
In the data set, 1 DC was not adequately codified by AUTOCOD, so the ICD-10 classifications of that DC from both AUTOCOD and the DGS were excluded.

Ethical Considerations
The DGS is the national entity responsible for data treatment and data protection of the SICO.The data provided were only for the purposes strictly necessary for this study within the competencies of the DGS.Data were previously anonymized.Patient consent was waived as the data were deidentified and processed for reasons of public interest in public health.This research received previous authorization from the DGS following positive advice from its data protection officer.In this way, the research complies with the best practices of the General Data Protection Regulation.This study was exempt from an ethics review board assessment following the self-assessment checklist for ethics of the Ethics Committee of the National School of Public Health [26].

Description of the Data Set
The data set (Table 1) comprised 330,098 DCs, each classified twice, meaning that we had all DCs classified by human coders and by AUTOCOD.The 3 most common ICD-10 chapters classified by human coders were chapter IX-"Diseases of the circulatory system" (97,420/330,098, 29.51%), chapter II-"Neoplasms" (85,837/330,098, 26%), and chapter X-"Diseases of the respiratory system" (40,202/330,098, 12.18%).A more extensive and detailed descriptive analysis of this data set can be found in Multimedia Appendix 1 [17][18][19], including the desegregation of DCs by year, ICD-10 chapter or block, and period.
As expected, there were fewer DCs for periods of excess mortality (n=186,834; 93,417/330,098, 28.3% of the total DCs from each source) than for periods without excess mortality (n=473,362; 236,681/330,098, 71.7% of the total DCs from each source).When considering the periods of severe and extreme excess mortality either for Z scores of >4 SDs (n=60,220; 30,110/330,098, 9.12% from each source) or Z scores of >6 SDs (n=12,480; 6240/330,098, 1.89% from each source), the DCs were even fewer.
Considering only the 3 most common chapters of the data set (chapters II, IX, and X), we performed the same analysis for the classification of ICD-10 blocks (Table 2), which accounted for 67.69% (223,459/330,098) of the total DCs throughout the period.The 5 most common blocks classified in DCs were C00-C97 (malignant neoplasms), I60-I69 (cerebrovascular diseases), I30-I52 (other forms of heart disease), I20-I25 (ischemic heart disease), and J09-J18 (influenza and pneumonia).o Diseases of the musculoskeletal system and connective tissue.
p Diseases of the genitourinary system.q Pregnancy, childbirth, and the puerperium.r Certain conditions originating in the perinatal period.s Congenital malformations, deformations, and chromosomal abnormalities.v Other respiratory diseases principally affecting the interstitium.w Suppurative and necrotic conditions of the lower respiratory tract.
x Other diseases of the pleura.
y Other diseases of the respiratory system.

Results for ICD-10 Chapters
The caret package provides the confusion matrix, which evaluates AUTOCOD's performance by calculating some performance metrics.The full performance metrics calculated for AUTOCOD can be found in Multimedia Appendix 1 [17][18][19].
Specificity in all ICD-10 chapters was >0.96 for the excess mortality periods.The highest values of sensitivity (or recall) were for chapter II-"Neoplasms" (0.95), chapter XVIII-"Symptoms, signs, and abnormal clinical and laboratory findings not elsewhere specified" (0.93), and chapter IX-"Diseases of the circulatory system" (0.91).Considering the PPV (or precision), the highest values were for chapter II-"Neoplasms" (0.97), chapter IX-"Diseases of the circulatory system" (0.91), and chapter XVIII-"Symptoms, signs, and abnormal clinical and laboratory findings not elsewhere specified" (0.88).The highest F 1 -scores were for chapter II-"Neoplasms" (0.96), chapter IX-"Diseases of the circulatory system" (0.91), and chapter XVIII-"Symptoms, signs, and abnormal clinical and laboratory findings not elsewhere specified" (0.90).Specificity in periods with severe excess mortality (>4 SDs) was >0.96 in all ICD-10 chapters.The highest values of sensitivity (or recall) were for chapter II-"Neoplasms" (0.94), chapter XVIII-"Symptoms, signs, and abnormal clinical and laboratory findings not elsewhere specified" (0.92), and chapter IX-"Diseases of the circulatory system" (0.91).Considering the PPV (or precision), the highest values were for chapter II-"Neoplasms" (0.97), chapter IX-"Diseases of the circulatory system" (0.91), and chapter XVIII-"Symptoms, signs, and abnormal clinical and laboratory findings not elsewhere specified" (0.88).The highest F 1 -scores were for chapter II-"Neoplasms" (0.96), chapter IX-"Diseases of the circulatory system" (0.91), and chapter XVIII-"Symptoms, signs, and abnormal clinical and laboratory findings not elsewhere specified (0.90).
Considering the weighted average of all chapters, the results we obtained for the performance metrics of AUTOCOD are presented in Table 3.For sensitivity, PPV, and F 1 -score, there was no difference between periods without excess mortality and those with excess mortality (<0.01).There was a decrease of 0.01 from periods without excess mortality to periods with severe excess mortality (>4 SDs).There was a decrease of 0.04 when comparing the weighted average of periods without excess mortality and periods with extreme excess mortality (>6 SDs).It is vital to analyze the differences between periods without excess mortality and periods of excess mortality, severe excess mortality, or extreme excess mortality and which chapters perform better.
According to Table 4, the biggest differences in the sensitivity values of AUTOCOD between periods without excess mortality and periods with excess mortality were found in chapter XVI-"Certain conditions originating in the perinatal period" (0.07), chapter XVII-"Congenital malformations, deformations, and chromosomal abnormalities" (0.05), chapter VIII-"Diseases of the ear and mastoid process" (−0.07), and chapter XII-"Diseases of the skin and subcutaneous tissue" (−0.08).For the 3 most common chapters, the differences were 0.00 (chapter II-"Neoplasms"), 0.00 (chapter IX-"Diseases of the circulatory system"), and 0.01 (chapter X-"Diseases of the respiratory system").Regarding the differences in sensitivity values between periods without excess mortality and periods of severe excess mortality (Z score of >4 SDs), the biggest differences were found in chapter VIII-"Diseases of the ear and mastoid process" (−0.22), chapter XII-"Diseases of the skin and subcutaneous tissue" (−0.12), chapter XVI-"Certain conditions originating in the perinatal period" (0.07), and chapter XVII-"Congenital malformations, deformations, and chromosomal abnormalities" (0.07).For the 3 most common chapters, the differences were 0.01 (chapter II-"Neoplasms"), 0.01 (chapter IX-"Diseases of the circulatory system"), and 0.00 (chapter X-"Diseases of the respiratory system").When comparing the difference between the sensitivity values of AUTOCOD for periods without excess mortality and periods of extreme excess mortality (Z score of >6 SDs), the biggest differences were found in chapter XVII-"Congenital malformations, deformations, and chromosomal abnormalities" (0.19), chapter III-"Diseases of the blood and blood-forming organs and certain disorders involving the immune system" (0.17), chapter XIII-"Diseases of the musculoskeletal system and connective tissue" (0.10), and chapter XII-"Diseases of the skin and subcutaneous tissue" (0.08).For the 3 most common chapters, the differences were 0.00 (chapter II-"Neoplasms"), 0.00 (chapter IX-"Diseases of the circulatory system"), and 0.00 (chapter X-"Diseases of the respiratory system").In addition, Table 4 shows the KLD between periods without excess mortality and periods of excess mortality.For 9 chapters, including 2 of the most prevalent (chapter II-"Neoplasms" and chapter IX-"Diseases of the circulatory system"), the KLD was 0, indicating that the distribution of values for periods of excess mortality was similar to that for periods of no excess mortality.For other chapters, such as chapter X-"Diseases of the respiratory system," the KLD was close to 0. In chapter XVI-"Certain conditions originating in the perinatal period," the KLD was particularly high, implying a large difference in the probability distributions.Regarding the KLD between periods without excess mortality and periods of extreme excess mortality (Z score of >4 SDs), the sensitivity had a KLD of 0 for 9 chapters, including chapter X-"Diseases of the respiratory system."It also had a KLD close to 0 for chapter II-"Neoplasms" and chapter IX-"Diseases of the circulatory system."When comparing the difference between the KLD for the sensitivity of AUTOCOD for periods without excess mortality and periods of extreme excess mortality (Z score of >6 SDs), sensitivity had a KLD of 0 in the 3 most prevalent chapters as well as chapter XV-"Pregnancy, childbirth, and the puerperium." The differences in the performance measures of AUTOCOD between periods without excess mortality and periods of excess or extreme excess mortality are shown in Figure 2. The absolute values of the observations for each period analyzed and additional comparisons of AUTOCOD performance measures can be found in Multimedia Appendix 1 [17][18][19].

Results for ICD-10 Blocks
This section analyzes the ICD-10 classification by blocks for only the 3 most common chapters (chapter II-"Neoplasms," chapter IX-"Diseases of the circulatory system," and chapter X-"Diseases of the respiratory system").
Table 5 presents AUTOCOD's performance metrics for the weighted average of all the blocks analyzed.For sensitivity, PPV, and F 1 -score, there was a decrease of 0.01 from periods without excess mortality to periods with excess mortality, severe excess mortality (>4 SDs), and extreme excess mortality (>6 SDs).
Considering the differences between periods of excess mortality and periods without excess mortality, it is important to analyze which blocks had the biggest differences.
According to Table 6, the largest differences in the sensitivity of AUTOCOD between periods without excess mortality and periods of excess mortality were in block J00-J06-acute upper respiratory infections (0.34), J30-J39-other diseases of the upper respiratory tract (0.28), and I95-I99-other and unspecified disorders of the circulatory system (0.08).Regarding the difference in sensitivity between periods without excess mortality and periods of severe excess mortality (>4 SDs), the largest differences were in block J00-J06-acute upper respiratory infections (0.41), J85-J86-suppurative and necrotic conditions of the lower respiratory tract (0.23), J30-J39-other diseases of the upper respiratory tract (0.20), and I05-I09-chronic rheumatic heart diseases (−0.22).The largest differences in the sensitivity of AUTOCOD between periods without excess mortality and periods of extreme excess mortality (>6 SDs) were in blocks J00-J06-acute upper respiratory infections (0.41), J85-J86-suppurative and necrotic conditions of the lower respiratory tract (0.31), and I05-I09-chronic rheumatic heart diseases (−0.26).Table 6 also shows the KLD between periods without excess mortality and periods of excess mortality.For 7 blocks, including C00-C97-malignant neoplasms and I60-I69-cerebrovascular diseases, the KLD was 0. Several blocks had values of KLD very close to 0, such as I20-I25-ischemic heart diseases and J09-J18-influenza and pneumonia.When comparing the difference between the KLD for the sensitivity of AUTOCOD for periods without excess mortality and periods of extreme excess mortality (Z score of >4 SDs), sensitivity had a KLD of 0 in 2 blocks: D37-D48-neoplasms of uncertain or unknown behavior and J95-J99-other diseases of the respiratory system.It also showed a KLD very close to 0 in blocks such as C00-C97-malignant neoplasms and I60-I69-cerebrovascular diseases.Regarding the KLD between periods without excess mortality and periods of extreme excess mortality (Z score of >6 SDs), the sensitivity had a KLD of 0 for I26-I28-pulmonary heart disease and diseases of pulmonary circulation and J40-J47-chronic lower respiratory diseases and a KLD very close to 0 for C00-C97-malignant neoplasm, I20-I25-ischemic heart diseases, and J09-J18-influenza and pneumonia.Some blocks, such as J00-J06-acute upper respiratory infections and J85-J86-suppurative and necrotic conditions of the lower respiratory tract, had a particularly high KLD for increasing mortality periods.
The differences in the performance measures of AUTOCOD among periods without excess mortality, with excess mortality, and with extreme excess mortality according to ICD-10 blocks are shown in Figure 3.Additional AUTOCOD performance comparisons between periods can be found in Multimedia Appendix 1 [17][18][19].

Principal Findings
Continuous and systematic mortality data collection is crucial for monitoring the population's health and complementing epidemiological studies.This national study is the first to demonstrate the robustness of deep neural networks in classifying primary causes of death even during periods of excess mortality, enabling cause-specific mortality surveillance, which is not widely performed worldwide.This study demonstrated a consistently good performance of AUTOCOD in different periods regardless of excess mortality rates.The results demonstrate the potential of AI algorithms to expedite disease classification and coding, making them a valuable tool for real-time surveillance, timely assessment of public health risks, and planification of responses.Proving that these algorithms can operate effectively despite external factors in different environments reinforces the case for their implementation.
AUTOCOD showed high sensitivity (≥0.75) in 10 chapters, with values of >0.90 for the 3 most common ones (chapter II-"Neoplasms," chapter IX-"Diseases of the circulatory system," and chapter X-"Diseases of the respiratory system," which together account for 223,459/330,098, 67.69% of all human-codified causes of death).The weighted average of sensitivity in the ICD-10 chapter analysis showed no difference between periods without excess mortality and periods of excess mortality, a difference of 0.01 between periods without excess mortality and periods of severe excess mortality (>4 SDs), and a difference of 0.04 between periods without excess mortality and periods of extreme excess mortality (>6 SDs).Regarding the ICD-10 block analysis, it showed a difference of 0.01 for the weighted average of sensitivity between periods without excess mortality and periods of excess mortality between periods without excess mortality and periods of severe (at the >4 SD threshold) and between periods without excess mortality and periods of extreme excess mortality (at the >6 SD threshold).
In the different periods considered for the ICD-10 chapter analysis, AUTOCOD showed a consistently good performance, demonstrating a sensitivity (or recall), a PPV (or precision), and an F 1 -score as high as 0.88 for periods without excess mortality and periods of excess mortality and as low as 0.84 in periods of extreme excess mortality (>6 SDs).When we considered only the most common chapters (chapter II-"Neoplasms," chapter IX-"Diseases of the circulatory system," and chapter X-"Diseases of the respiratory system"), sensitivity ranged from 0.94 to 0.95 in chapter II, 0.91 in chapter IX, and 0.89 to 0.90 in chapter X in the different periods analyzed.The same happened with the PPV, which ranged from 0.96 to 0.98 in chapter II, 0.90 to 0.92 in chapter IX, and 0.83 to 0.86 in chapter X. Regarding the F 1 -score, the performance of AUTOCOD was 0.96 in chapter II, 0.91 in chapter IX, and 0.86 to 0.88 in chapter X.When we considered only the most common blocks-C00-C97 (malignant neoplasms), I60-I69 (cerebrovascular diseases), I30-I52 (other forms of heart disease), I20-I25 (ischemic heart diseases), and J09-J18 (influenza and pneumonia)-the sensitivity ranged from 0.91 to 0.98, the PPV ranged from 0.89 to 0.99, and the F 1 -score ranged from 0.90 to 0.99.AUTOCOD presented high specificity and negative predictive values in all the analyses performed.This was expected as the number of true negatives was consistently much higher than that of true positives.This is not a characteristic of AUTOCOD itself but rather a result of our handling of the sample and our interpretation of the question as a classification problem with a one-versus-all solution.This method is widely used for multiple-output class classification problems.In our case, the individual ICD-10 chapters or blocks were handled as if they were in a binary model, thus assessing each class individually against all the other classes in the model.
It should be noted that chapter XVII ("Symptoms, signs, and abnormal clinical and laboratory findings not elsewhere specified") consistently presented high performance metrics in AUTOCOD.This does not translate to a correct certification of the cause of death, but it could imply that, when human coders have difficulties classifying the cause of death, so does the AUTOCOD.
These results are aligned with those of previous studies using AUTOCOD [12,13] and, in general, with the literature on deep neural networks applied to the automatic classification of DCs [14,27,28].Falissard et al [14] developed a deep neural network for automated coding of the underlying cause of death with a test accuracy of 0.978 (95% CI 0.977-0.979)and an F-measure value of 0.952 (95% CI 0.946-0.957)[27].The proposed approach by Della Mea et al [28] for automated coding of causes of death had an accuracy of 0.990 (95% CI 0.990-0.991)and a macroaveraged accuracy and F 1 -score of 0.974 and 0.968, respectively.Similarly to our study, Della Mea et al [28] found that accuracy was low for chapters with rare causes of death and, therefore, rare causes of death could be ignored.
However, to the best of our knowledge, this is the first time that a deep neural network that classifies basic causes of death has been evaluated while comparing its performance across different time frames according to their excess mortality rates.
Automatic classification of DCs relies on natural language processing (NLP) techniques and algorithms.NLP can translate free text written by the physician who certified the death into classification codes based on the ICD-10.However, this process depends on the text quality of the analyzed DCs.By text quality, we mean how successfully we can automatically classify, retrieve, or extract information from them [29].Thus, text quality does not involve a single aspect but combines numerous criteria, including spelling, grammar, organization, informative nature, and page layout [30].Extracting these attributes can become problematic in low-quality texts (poor grammar, many abbreviations, and short sentences).This is a known problem in medical and clinical texts such as patient records or DCs [30].The performance of systems that rely on attributes of text quality, such as NLP, affects the overall performance of the algorithms-a text of bad quality may result in poor-quality prediction results.To overcome this limitation, after the development AUTOCOD, a processing layer has been added to the neural network that has the ability to always read words in text fields as the closest word the model knows (eg, for the word Alzheimer, it currently identifies >25 ways of misspelling it).Therefore, this processing layer can help minimize text field errors or abbreviations in periods of excess mortality [31][32][33].
Our results suggest that, even in periods of excess, severe, and extreme excess mortality when the volume of deaths and the pressure on health services might increase, with a consequent impact on physicians that certify deaths and a potential impact on the quality of the text in the DC, AUTOCOD's performance remains unhindered.It is important to consider analyzing the linguistic properties of the DC, such as variations in text size and the number of fields filled in by physicians, in future work.

Limitations
An important limitation of this study is that the human coders had access to the automatic classification of the DC by AUTOCOD, meaning that the gold standard we used in this research might be biased by the same algorithm we were trying to evaluate.However, this implementation only entered production on July 26, 2019, meaning that manual classification was unbiased for most of the data sets used in this study.
In addition, there is the matter of ICD-10 code ambiguity.This is a known limitation of the ICD-10 for human coders and automatic algorithms of classification that the sometimes discrete differences between codes for similar causes of death can explain.This might explain the difference in sensitivity between, for example, respiratory blocks such as J00-J06 (acute upper respiratory infections) and J09-J18 (influenza and pneumonia), with the latter presenting a less ambiguous cause of death when compared with the former both for human classification and automatic classification.These unspecified codes are not necessarily an error rate but an indicator of the completeness of clinical information of DCs in which sufficient clinical information is not known or available to assign a more specific code.In the case of human coders, it is common that they look for more clinical information in electronic health records.However, AUTOCOD is restricted to the information included in the DC.This stresses the importance of a well-filled and detailed DC by the physician that certifies the death even in periods of excess mortality.
Routinely, racial and ethnic or socioeconomic groups are not collected in the DC.Although other proxies of social vulnerability can be used, such as the municipality of residence, the focus of this research was not the study of differences in subgroups, making this an important next step of investigation.
The human coders that we set as our ground truth were not mistake free.Current research puts the reliability of human coders at approximately 70% to 89% (reliability is a measure for calculating agreement between coders and the consistency of each coder individually) [34].These performance scores can be in part explained by the use of different codes for similar diseases.Moreover, the DGS has had a range of human coders that varies in number, typically from 4 to 6, and in experience in classifying causes of death.This may also affect the reliability and accuracy of the ground-truth labels we used in this study.Another possible limitation, known in the field of AI algorithms, is the generalization of our results to other countries [35].This question of model transferability requires further study.However, we feel confident that our results can be generalized to other algorithms that rely on NLP for automatic classification without a profound impact on the model's performance even in periods of excess mortality.

Strengths
In Portugal, Law 15/2012 of April 3, 2012, established the SICO, a mortality information system based on the electronic registration of DCs [36].Since then, SICO has become a widespread tool used by physicians nationally.Therefore, it is a well-established source of data and information related to mortality and an international example of the timeliness of mortality statistics [3].
AUTOCOD was built based on the already disseminated existence of DCs in electronic format and has since been validated as an essential tool for the automatic assignment of ICD-10 codes for causes of death [13].However, this validation never considered differences in periods that might affect the quality of the DC and, consequently, the performance of AUTOCOD.The method we used for evaluating the performance of AUTOCOD during periods of excess mortality, severe excess mortality, and extreme excess mortality is a known method for comparison of the performance of a given index test with a given ground truth or gold standard, making a case for the importance of evaluating algorithms and models in different periods and in the ever-changing environment that might affect the overall performance of the models.
Although the current use of AUTOCOD is limited to supporting human coders, the research findings suggest a compelling case for enhancing the algorithms used for the automated classification of causes of death.In a completed DC, AUTOCOD can be used to accurately classify basic causes of death in real time even in periods of excess mortality, attesting that deep neural networks are robust to eventual changes in the underlying quality of the text.Furthermore, by defining a baseline from the past (and Portugal has digital DC data going back to 2014), we can detect in real time, with high sensitivity, changes in mortality and periods of excess mortality without the need to wait for human classification of cause of death, especially for the more common and less ambiguous causes of death.Finally, with this algorithm, we can use our data to predict excess deaths that rely on seasonality, such as influenza and pneumonia.

Implications of Our Work
Our work makes a case for using AUTOCOD for real-time mortality surveillance by ICD-10 codes.It can be further validated by other countries wishing to train their neural networks for medical and clinical text classification.Our research also makes a case for auditing, evaluating, and consistently monitoring AI algorithms to identify potential barriers, strengths, and opportunities [37].
As the AUTOCOD algorithm is robust, it can be used to classify the underlying causes of death in periods of excess mortality with no need to wait for manual coding, which allows for adequate real-time cause-specific mortality surveillance, timely assessment of risks to public health, and definition of priorities and planification of responses in both periods with and without excess mortality.This cause-specific mortality surveillance in real time is not carried out widely worldwide and might benefit from further investigation and real-world intervention.This investigation is a step forward in Portugal for the widespread use of the classification of specific causes of death by the AUTOCOD, with renewed confidence in its results regardless of the presence of excess mortality, and for the implementation of targeted public health interventions and practices.
Further investigations should be carried out, such as a comparison of AUTOCOD with other automated coding systems and a new evaluation of the behavior of AUTOCOD during periods of excess mortality caused by the COVID-19 pandemic, including retraining the algorithm with the new codes for COVID-19 that were not present in the ICD-10 when AUTOCOD was built [14,16,28].To strengthen coding practices, conducting a reliability study among coders at the DGS would also be important.

Conclusions
This study makes the case for deep neural networks as powerful tools for automatically classifying primary causes of death according to the ICD-10 even during periods of excess mortality.Our work could potentially further the use of deep neural networks to facilitate automatic clinical codification, such as of diseases, medical procedures, or DCs.In addition, it may serve as a staple for the real-time monitoring and surveillance of public health threats and problems, allowing for timely action.More broadly, this study highlights the importance of AI algorithms as an advisory tool for public health policies and measures.

Figure 1 .
Figure 1.Flowchart of the study population inclusion criteria.DC: death certificate; DGS: Directorate-General of Health.

a
Percentage values represent the proportion of death certificates for each period analyzed considering the total of each chapter except for the total column, which gives the proportion of each chapter for all the death certificates.b Certain infectious and parasitic diseases.c Neoplasms.d Diseases of the blood and blood-forming organs and certain disorders involving the immune system.e Endocrine, nutritional, and metabolic diseases.f Mental and behavioral disorders.g Diseases of the nervous system.h Diseases of the eye and adnexa.i Missing values.j Diseases of the ear and mastoid process.k Diseases of the circulatory system.l Diseases of the respiratory system.m Diseases of the digestive system.n Diseases of the skin and subcutaneous tissue.

j
Pulmonary heart disease and diseases of pulmonary circulation.k Other forms of heart disease.l Cerebrovascular diseases.m Diseases of the arteries, arterioles, and capillaries.n Diseases of the veins, lymphatic vessels, and lymph nodes not elsewhere classified.o Other and unspecified disorders of the circulatory system.p Acute upper respiratory infections.q Influenza and pneumonia.r Other acute lower respiratory infections.s Other diseases of the upper respiratory tract.t Chronic lower respiratory diseases.u Lung diseases owing to external agents.

Figure 2 .
Figure 2. Comparison between performance metrics of AUTOCOD during periods of excess mortality, severe excess mortality, and extreme excess mortality and periods without excess mortality for International Statistical Classification of Diseases and Health-Related Problems, 10th Revision (ICD-10), chapters.DGS: Directorate-General of Health; SICO: Death Certificate Information System.

Figure 3 .
Figure 3.Comparison between performance metrics of AUTOCOD during periods of excess mortality and periods without excess mortality for International Statistical Classification of Diseases and Health-Related Problems, 10th Revision (ICD-10), blocks.DGS: Directorate-General of Health; SICO: Death Certificate Information System.

Table 1 .
Description of the study population by excess mortality and type of death certificate coding (N=330,098) a .

Table 2 .
Description of the study population for the 3 most common chapters (II, IX, and X) for all the periods analyzed (N=330,098) a .Percentage values represent the proportion of death certificates for each period analyzed considering the total of each block except for the total column, which gives the proportion of each block for all the death certificates.

Table 3 .
Average performance metrics for different periods for the International Statistical Classification of Diseases and Health-Related Problems, 10th Revision, chapter classification of AUTOCOD.

Table 4 .
Comparison among sensitivity values of AUTOCOD depending on the period (without excess mortality and with excess mortality, severe excess mortality, or extreme excess mortality) by chapter of the International Statistical Classification of Diseases and Health-Related Problems, 10th Revision.

Table 5 .
Weighted averages of performance metrics for different periods for the International Statistical Classification of Diseases and Health-Related Problems, 10th Revision, block classification of AUTOCOD.
Only 1 human coder classifies each DC, and the DGS regularly RenderX conducts an in-house auditing process in which 2 human coders check for internal reliability by classifying a small sample of DCs.