Critical success index or F measure to validate the accuracy of administrative healthcare data identifying epilepsy in deceased adults in Scotland

Background: Methods to undertake diagnostic accuracy studies of administrative epilepsy data are challenged by lack of a way to reliably rank case-ascertainment algorithms in order of their accuracy. This is because it is difficult to know how to prioritise positive predictive value (PPV) and sensitivity (Sens). Large numbers of true negative (TN) instances frequently found in epilepsy studies make it difficult to discriminate algorithm accuracy on the basis of negative predictive value (NPV) and specificity (Spec) as these become inflated (usually > 90%). This study demonstrates the complementary value of using weather forecasting or machine learning metrics critical success index (CSI) or F measure, respectively, as unitary metrics combining PPV


Introduction
In a recently published diagnostic accuracy study using administrative healthcare data to identify epilepsy in deceased adults in Scotland, four different sources of administrative data were used to develop diagnostic algorithms whose accuracy was then examined (Mbizvo et al., 2020a).Algorithms developed from one, two, or three database coding or antiepileptic drug (AED) strategies, respectively labelled levels 1, 2 and 3, were ranked according to the outcomes of interest.These were positive predictive value (PPV) and sensitivity (Sens), the most commonly used outcomes in diagnostic accuracy studies of administrative epilepsy data (Mbizvo et al., 2020b).Whilst negative predictive values (NPV) and specificity (Spec) can be used as outcome measures for diagnostic accuracy ranking (Chubak et al., 2012), these values are often > 90% because of the large numbers of true negative (TN) instances found in the base data of epilepsy studies, making it difficult to discriminate algorithm accuracy on the basis of these measures (Mbizvo et al., 2020b;Mbizvo et al., 2023).Therefore, algorithms are typically ranked in order of highest to lowest PPV and sensitivity, with priority given to those with higher values in both estimates (Horrocks et al., 2017;Mbizvo et al., 2020b;Mbizvo et al., 2020a;Wilkinson et al., 2018).However, such a method is challenging to apply objectively because there is a trade-off relationship between PPV and sensitivity, where one decreases as the other increases (Wang et al., 2021;Wilkinson et al., 2018).This makes it difficult to know which estimate to prioritise (PPV or sensitivity) when trying to rank the diagnostic algorithms in order of their accuracy.
There is clearly a need to consider novel ways to combine PPV and sensitivity into a single metric to make it easier and more objective to rank diagnostic algorithms by their accuracy.We propose the use of critical success index (CSI) (Schaefer, 1990) or F measure (Powers, 2015) (also known as the Dice co-efficient) (Jolliffe, 2016) for this purpose.CSI is commonly used in meteorology to verify the accuracy of weather forecasts (Doswell et al., 1990;Gerapetritis and Pelissier, 2004;Palmer and Allen, 1949;Schaefer, 1990;Space Weather Prediction Center, 2022;Spyrou et al., 2020;World Meteorological Organization, 2014).In signal detection theory, CSI is defined as ratio of hits to the sum of hits, false alarms, and misses (Larner, 2021(Larner, , 2024;;Space Weather Prediction Center, 2022).Therefore, it includes a measure of both the PPV and sensitivity.F measure (or F 1 score) is a machine learning evaluation metric that measures a model's accuracy as the weighted harmonic mean of precision (PPV) and recall (sensitivity) (Hicks et al., 2022;Powers, 2015).CSI values range from 0 to 1, interpreted as 0 = unable to forecast and 1 = perfect forecast (Spyrou et al., 2020;World Meteorological Organization, 2014).F measure is also bounded 0-1, where 1 represents perfect precision and recall values, and 0 represents absent precision and/or recall (Hicks et al., 2022).In effect, CSI or F measure combine PPV and sensitivity values into a convenient single metric that is more easy to interpret and rank in terms of diagnostic accuracy than trying to do so for the two measures (PPV and sensitivity) alongside one another.However, CSI is seldom used outside weather forecasting, with as yet no studies published using CSI in medicine, to our knowledge, beyond our recent proof-of-concept works (Larner, 2021;Mbizvo and Larner, 2022;Mbizvo et al., 2023).Although use of the F measure is commonplace in artificial intelligence (AI), its applicability and convenience as a non-AI diagnostic accuracy tool is yet to be fully demonstrated (Larner, 2021(Larner, , 2024)).
In the current study, we aim to reanalyse data published in a Scottish diagnostic accuracy study (Mbizvo et al., 2020a) of administrative epilepsy mortality data by calculating CSI scores for each of the diagnostic algorithms.We do this in order to see whether and how this alters the original diagnostic accuracy rankings proposed by the authors, which were done in order of highest to lowest PPV and sensitivity (Mbizvo et al., 2020a).This will help further understanding of the role CSI may play as a medical diagnostic accuracy measure.Corresponding F measures will also be reported for illustrative purposes as there is a monotonic relation between CSI and F (Jolliffe, 2016), meaning the ranking of CSI and F values calculated for any dataset is the same.

Methods
Details of the study design, study population, linkage of datasets and approvals accessed, as well as the algorithms used and their rankings may be found in the previous publication (Mbizvo et al., 2020a)  rankings are reproduced here in Tables 1, 2 and 3).In brief, the following administrative healthcare databases were used to create algorithms: National Records of Scotland (NRS) death records; Scottish Morbidity Record 01 (SMR01) hospital admissions; Prescribing Information System (PIS) Scottish prescribing data; and GP primary care data.PIS was used to screen for deceased adults prescribed one or more AEDs, using both broad (36 AEDs) and narrow (21 AEDs) filters (Mbizvo et al., 2020a).The study was designed to conform to the Standards for Reporting of Diagnostic Accuracy studies (STARD 2015 iteration) (Bossuyt et al., 2015).

Statistical analysis
From the numbers of true positive (TP), false positive (FP), and false negative (FN) cases, CSI values were calculated for each algorithm according to the formula (Larner, in press): This eschews true negatives (TN) (Mbizvo et al., 2023).CSI may also be expressed in terms of PPV and sensitivity: This relation of CSI to PPV means that CSI is affected by prevalence,   the probability of a positive diagnosis (P), as well as by test threshold, the probability of a positive test (Q): F values are also based on the same base data (Larner, in press): Or from PPV and sensitivity values: This relation of F to PPV means that F is affected by P, as well as Q: The monotonic relation between CSI and F is such that:

Results
CSI and F values for each of the 60 algorithms for which calculations could be made were plotted along with the corresponding PPV and sensitivity values (Fig. 1).The plot shows CSI scores were conservative (range 0.02-0.826),always less than or equal to the lower of the   corresponding PPV (range 39-100%) and sensitivity (range 2-93%).
Unlike CSI, F values were less conservative (range 0.039-0.905),sometimes higher than either PPV or sensitivity, but were always higher than CSI.Low CSI and F values occurred when there was a large difference between PPV and sensitivity, e.g.CSI was 0.02 and F was 0.039 in an instance when PPV was 100% and sensitivity was 2%.Algorithms with both high PPV and sensitivity performed best in terms of CSI and F measure, e.g.CSI was 0.826 and F was 0.905 in an instance when PPV was 90% and sensitivity was 91%.
Results were further considered according to the three levels of validation published in the original study (Mbizvo et al., 2020a).

Level 1 validation results assessing nine algorithms (Table 1)
CSI and F values could not be calculated for one of these algorithms because of the absence of information on FN instances and hence sensitivity.Amongst the remaining eight level 1 algorithms, the optimal coding strategy by PPV was the narrow list of AEDs from the PIS dataset, and this remained the optimal algorithm according to CSI values (= 0.826).The algorithm using the broad AED list rose from fifth ranking by PPV to second by CSI, displacing the NRS death records using either G40 (epilepsy) alone or G40-41 (epilepsy and/or status epilepticus) codes, whose 2nd and 3rd overall ranking by PPV swapped places to 4th and 3rd respectively by CSI.

Level 2 validation results assessing 25 algorithms (Table 2)
CSI and F values could be calculated for all 25 algorithms in level 2 validation.The optimal coding strategy by PPV, combining F25 epilepsy Read codes in primary care with R56.8 seizures codes within the NRS causes of death, dropped to 22nd of 25 on the ranking by CSI values (= 0.067) as a consequence of its very low sensitivity (7%).
The algorithm with the (joint) highest sensitivity (93%), combining F25 epilepsy Read codes from primary care with AEDs in the narrow list, became the new optimal coding strategy according to CSI (= 0.804) values, rising from 16th of 25 based on PPV (86%).

Level 3 validation results assessing 27 algorithms (Table 3)
CSI and F values could be calculated for all 27 algorithms in level 3 validation.The top six ranked algorithms all achieved maximal PPV but had very low sensitivity (range 2-6%), and hence all dropped in the ranking according to CSI values (= 0.02-0.058) to no better than ≥ 20th.The new optimal algorithm was previously 13th of 27 (PPV = 90%), but achieved a CSI of only 0.612.

Discussion
The key finding in our study was of substantial changes in the accuracy ranking of the diagnostic algorithms compared to the original rankings in the published study (Mbizvo et al., 2020a).These changes can perhaps be considered objective improvements in the rankings because the original rankings were based on a narrative comparison of PPV and sensitivity magnitudes against one another, as is frequently done in diagnostic accuracy studies of administrative data (Horrocks et al., 2017;Kee et al., 2012;Mbizvo et al., 2020b;Wilkinson et al., 2018).Algorithms with high PPV but low sensitivity were ranked lower using CSI and F as outcome measures than algorithms with both high PPV and sensitivity.This is because CSI or F prioritise instances where both PPV and sensitivity are high over instances where there are large differences between PPV and sensitivity (even if one of these is very high).This may potentially allow diagnostic accuracy thresholds based on combined PPV and sensitivity to be determined in future, e.g.CSI ≥ 0.80 (Mbizvo et al., 2023).
The findings of our study showed that the choice of "optimal" algorithm is influenced by the outcome measure used.Reasons for selecting or privileging PPV, sensitivity (or indeed NPV or specificity) as outcome measures have been discussed (Chubak et al., 2012).We understand that investigators may wish to prioritise PPV or sensitivity depending on the use case and the relative cost of false positives and false negatives for various applications.However, using either CSI or F values to complement these measures presents the opportunity to select the best balance of both PPV and sensitivity, and in a manner that can be standardised between studies.As there is a monotonic relation between CSI and F their rankings will always be the same, so there is no a priori reason to choose one measure over the other.However, our preference is for CSI since it is easy to calculate from the base data, is easily understood in terms of signal detection theory, and is a more conservative measure than F.Moreover, various shortcomings of the F measure have been described (Powers, 2015).
To our knowledge, this is the first study to use CSI scores and F measures to assist in establishing the diagnostic accuracy of administrative epilepsy data.These measures were proposed because they combine information from both PPV and sensitivity, the most commonly reported measures of diagnostic accuracy in studies validating administrative epilepsy data (Mbizvo et al., 2020b), thereby standardising their interpretation.We illustrate how it is easier to rank diagnostic algorithms in order of their relative accuracy when using unitary CSI or F measures.Like PPV and sensitivity, CSI and F values avoid the risk of very high values of NPV and specificity consequent upon large numbers of TN instances by eschewing this number.Like PPV, CSI and F values are inherently affected by disease prevalence.An alternative would be to use the Gilbert Skill score (GS) (Gilbert, 1884), another metric used in weather forecasting thought to be less biased because it is less affected by rare events (Space Weather Prediction Center, 2022).However, we have shown there is a monotonic relation between CSI and GS (Larner, 2021) that, in practice, leads in little overall difference in conclusions between the two metrics in epilepsy literature (Mbizvo et al., 2023).Area under the Receiver Operating Characteristic curve (AUC) would be another unitary metric to consider (Metz, 1978).This measures the ability of a test to discriminate whether a specific condition is present or not present.However, its calculation is dependent on having data available on specificity measured from TN instances.TNs and specificity are seldom reported in diagnostic accuracy studies of administrative epilepsy data (Mbizvo et al., 2020b).This is likely due to challenges researchers face in gathering all of the clinical information needed to confirm that cases without a coded diagnosis in the available administrative dataset are truly negative everywhere else in the health record (e. g. in primary care or specialist records that may be difficult to access).Therefore, AUC is rarely used in diagnostic accuracy studies of administrative epilepsy data (Mbizvo et al., 2020b).The wide availability of PPV and sensitivity in such studies leaves room to consider using CSI or F instead.
We suggest that CSI and/or F metrics should be further explored in diagnostic accuracy studies of administrative epilepsy data, as well more broadly in any screening or test accuracy studies involving people with epilepsy, e.g.those assessing the accuracy of electroencephalography.For example, ranking by CSI or F value may have been helpful in the largest systematic review of diagnostic accuracy studies of administrative epilepsy data (Mbizvo et al., 2020b), as the researchers identified the optimal diagnostic algorithms by ranking them in order of PPV and sensitivity and making a judgement on which had the best balance of both high PPV and sensitivity, selecting an arbitrary threshold of > 80% to represent accuracy.It was also difficult to identify the optimal diagnostic algorithms by NPV and specificity in that systematic review as these were nearly 100% across most studies due to very high numbers of TN, often far outnumbering TP, FP, and FN.Another systematic review of diagnostic accuracy studies of administrative epilepsy data took a similar approach of narrative comparison of PPV and sensitivity balances (Kee et al., 2012), as did a systematic review of administrative dementia data (Wilkinson et al., 2018), and a systematic review of administrative motor neurone disease data (Horrocks et al., 2017).
Ranking by CSI or F values may have been helpful in these reviews.Additional ways to explore CSI and F measure in future might be to consider whether both track predominantly with sensitivity, much more so than PPV, and whether this relationship changes with prevalence.To examine this question, researchers could use the equations expressing CSI and F in terms of P and substitute in different values of P, or perhaps calculate rescaled PPVs for different values of P using Bayes equation and thence calculate rescaled CSI and F values.We have not examined these possibilities here as they fall beyond the scope of the current study.

Fig. 1 .
Fig. 1. : Dotted line plot of CSI, PPV and sensitivity estimates across the diagnostic study algorithms Abbreviations: CSI = Critical success index; PPV = Positive predictive value; Sens = sensitivity; F -F value; SMR -Scottish Morbidity Record; NRS -National Records of Scotland; AEDantiepileptic drug; F25 -primary care diagnostic Read Codes for epilepsy; G40-41 -International Classification of Disease 10 (ICD-10) codes for epilepsy and status epilepticus; R56.8 -ICD-10 code for seizures; NL -AEDs on the narrow list; BL -AEDs on the broad list.

Table 2
Results of Level 2 validation of algorithms combining two database coding or AED strategies together.

Table 3
Results of Level 3 validation of algorithms combining three database coding or AED strategies together.