Computer-assisted initial diagnosis of rare diseases

Introduction. Most documented rare diseases have genetic origin. Because of their low individual frequency, an initial diagnosis based on phenotypic symptoms is not always easy, as practitioners might never have been exposed to patients suffering from the relevant disease. It is thus important to develop tools that facilitate symptom-based initial diagnosis of rare diseases by clinicians. In this work we aimed at developing a computational approach to aid in that initial diagnosis. We also aimed at implementing this approach in a user friendly web prototype. We call this tool Rare Disease Discovery. Finally, we also aimed at testing the performance of the prototype. Methods. Rare Disease Discovery uses the publicly available ORPHANET data set of association between rare diseases and their symptoms to automatically predict the most likely rare diseases based on a patient’s symptoms. We apply the method to retrospectively diagnose a cohort of 187 rare disease patients with confirmed diagnosis. Subsequently we test the precision, sensitivity, and global performance of the system under different scenarios by running large scale Monte Carlo simulations. All settings account for situations where absent and/or unrelated symptoms are considered in the diagnosis. Results. We find that this expert system has high diagnostic precision (≥80%) and sensitivity (≥99%), and is robust to both absent and unrelated symptoms. Discussion. The Rare Disease Discovery prediction engine appears to provide a fast and robust method for initial assisted differential diagnosis of rare diseases. We coupled this engine with a user-friendly web interface and it can be freely accessed at http://disease-discovery.udl.cat/. The code and most current database for the whole project can be downloaded from https://github.com/Wrrzag/DiseaseDiscovery/tree/no_classifiers.


65
The technology behind the web application is the GRAILS framework, a web application 66 framework built for the Java Virtual Machine that uses the Groovy programming language. GRAILS 67 uses a MVC (Model View Controller) pattern that allows for a full integration between the model 68 (and the database) and the view (user interface). With built-in database access and modeling, it 69 enables easy abstraction and decoupling between these two parts of the application, permitting 70 easy database migrations. This also helps hiding database complexity and access to information in 71 an object-oriented way. JQuery and Ajax were also used in order to provide dynamic web 72 capabilities like autocomplete. A powerful front-end framework for faster and easier web 73 development (Twitter Bootstrap) was included, streamlining the styling and design of the web 74 interface. Built for the JVM, this framework also enables easy integration with Java packages, 75 plugins and wrappers. 76 The database design provides a welcome positive side-effect: it is trivial to keep the database up-77 to-date. A periodic download of the ORPHANET data every three months, followed by upload of 78 that data to our database can be done in minutes, facilitating that RDD is kept up to date and 79 usable over the long run. Currently, the database has a total of 6 915 diseases and 2 110 80 symptoms. There is a total of 101 840 records representing relations between symptoms and 81 diseases. 82 We note that the symptoms-disease association file from the ORPHANET dataset are re-curated by 83 us in order to ensure that the automated processing of the xml file made available by ORPHANET 84 is done without mistakes. Although it does not always happen, in some versions of the xml files we 85 downloaded had one or more tags that were not properly closed. In addition, in the earlier 86 versions of the files, the symptoms had not yet been fully converted to their synonymous terms in 87 the HPO (Human Phenotype Ontology) 1 . We performed a script analysis to identify those terms 88 that were not in HPO and transformed them into their HPO synonyms. In the last 3 versions of the 89 ORPHANET xml file, we found that HPO nomenclature has been fully implemented. 90

92
Other classification approaches to predicting rare diseases based on symptoms were tested. First, 93 we tested an additional ranking function that takes into consideration how frequently each 94 symptom is thought to be associated with the disease. This information is provided in the 95 ORPHANET dataset, which associates qualitative frequency information to a symptom, when it is 96 associated to a disease (Very Frequent, Frequent, and Occasional). This tested function as the 97 form: 98 Eq. A1 99 In Eq. A1 S user represents the number of symptoms provided by the user, S Disease i represents the 100 number of symptoms of disease i stored in the database, and Max[S user , S Disease i ] represents the 101 largest number between S user and S Disease i . n represents the number of symptoms that are different 102 between the set submitted by the user and the set associated to any given rare disease in the 103 database. measures the qualitative frequency at which symptom j has been found to associate 104 to disease i in the past (see above). Given that there were only three categorical frequency 105 associations (Very Frequent, Frequent, Occasional), was considered to have one of three values. 106 = 1 if the symptom is either very frequently associated to the disease i or is a symptom that is 107 provided by the user; = 0.75 if the symptom is frequently associated to the disease i; finally, if 108 the symptom is only occasionally associated to disease i, = 0.5. It can be shown that , 109 −1 ≤ ≤ ′ ≤ 1. However, even though ′ ≠ , the list of diseases is ranked in the 110 same order by both scores (data not shown). Because more calculations are required to estimate 111 ′ , using this score for ranking leads to slower computation. Hence, we discarded ′ . 112 Second, we also trained and tested algorithms based on Support Vector Machines, Neural 113 Networks, Bayesian Networks, Random Trees, and Random Forests. Invariably, these algorithms 114 required extensive training and prediction time, and their best performance was always about one 115 order of magnitude lower than that of the algorithm and score described in this paper. They were 116 also orders of magnitude slower in predicting the disease and required more computational 117 resources for doing so. 118

119
RAMEDIS is a server that provides management services for medical doctors diagnosing, treating 120 and managing rare disease patients. Its database contains short report cards with at most 3 121 sentences about 1099 patients with confirmed rare disease clinical diagnostics. 122 The information for about 60% of these patients is public, although anonymized. From 123 these approximately six hundred patients, nearly half have metabolic rare diseases that were 124 diagnosed in screening programs at a preclinical stage. Of the remainder three hundred patients, 125 one hundred and eighty seven had a confirmed clinical diagnosis associated with a report card that 126 described at least one symptom. Examples of the procedure are given 127 We took these 187 patients and reconstructed their symptoms from the individual report 128 cards. In some cases this is easy, and report cards were very clear (for example: patient with 129 seizures or hypotonia). In other cases the symptoms were vaguely described and hard to 130 reconstruct. For example, "hearing problems" or "hearing loss" could be any of the following: 131 "Conductive deafness/hearing loss", "Central deafness/hearing loss", "Sensorineural 132 deafness/hearing loss", or "Hearing loss/hypoacusia/deafness". Another example, "Infection" 133 could be any of the following: "Immunodeficiency/increased susceptibility to infections/recurrent 134 infections", "Recurrent urinary infections", "Chronic skin infection/ulcerations/ulcers/cancrum", or 135 "Repeat respiratory infections". In these cases, we opted for including all possibilities rather than 136 eliminating the symptom. This decision was made because eliminating the symptom would have 137 meant discarding additional patients from an already small set, as all reported symptoms were 138 often ambiguous. An example of two report cards and their processing is shown in Supporting 139 The first benchmark test was done by generating several random sets of 10 000 patients, each 150 with all the symptoms associated to a specific but randomly chosen rare disease. Then, for 151 increasing percentages of the patients in a given random set either 1, 2, 3, 4, 5, 10, or 20 152 symptoms were randomly added or deleted to create noise. Then, the noisy sets of symptoms 153 were used by the RDD algorithm to predict the rare disease that generated them. The precision p, 154 sensitivity s, and F-Score of the RDD prediction algorithm were calculated for each set of patients. 155 The results are summarized in Figure 2 of the main text and discussed in the main manuscript. 156

Effects of unreported symptoms on prediction accuracy of the Rare Disease Discovery Algorithm 157
The second benchmark test was done by again generating several random sets of 10 000 patients, 158 each with all the symptoms associated to a specific but randomly chosen rare disease. Then, for 159 increasing percentages of the patients in a given random set, either 25%, 50%, or 75% of the 160 symptoms were deleted to create noise. Finally, the noisy sets of symptoms were used by the RDD 161 algorithm to predict the rare disease that generated them. These simulations represent situations 162 where not all symptoms are known to the user during diagnosis. The precision p, sensitivity s, and 163 F-Score of the RDD prediction algorithm were calculated for each set of patients. The results are 164 summarized in Figure 3 of the main text and discussed in the main manuscript. 165

patients that do not suffer from rare diseases 167
It is important to estimate how large DS i must be for a user to be sure that the set of symptoms 168 being submitted to RDD (Rare Disease Discovery) are not the result of randomly associated 169 symptoms. A third benchmark of the RDD algorithm was done to estimate this DS i value. This 170 estimation was done in following way. Consider that there are 13 698 diseases and 2 528 171 symptoms in our database. The average number of symptoms associated to a disease is 42, with a 172 standard deviation of 59. To calculate the probability that a given DS i for a set of symptoms 173 produced by a user is statistically significant we generated 10 000 random vectors of symptoms. 174 The population of the 10 000 vectors had an average number of symptoms equal to 42, with a 175 standard deviation of 59. Given that these vectors were random, by plotting = (1 − 176 ) as a function of DS i (Supporting Figure 3) we are able to 177 estimate the probability that a given score is achieved simply by choosing a random combination 178 of symptoms. This experiment estimates that a score ≥ 0.5 has a probability lower than 179 0.0001 of being obtained by choosing a random set of symptoms. If we lower the probability to 180 0.01, then ≥ 0.25. In fact, the median DS i score for a random choice of symptoms is less than 181 0.01. 182

Estimating significance levels for the differences between two ds i scores 183
In the previous section we describe an experiment that allowed us to estimate that if DS i >0.5, one 184 can be 99.99% sure that the score was not obtained by choosing a random set of symptoms. 185 Another issue is that of determining how significant are the differences between two DS i 186 scores for the same set of symptoms. Estimating this is much more complicated because the 187 significance will depend on the number of symptoms one submits for the prediction. A final 188 benchmark experiment was done in order to provide a best scenario estimation for how 189 statistically significant the differences between two DS i scores are. 190 In this fourth and final benchmark we performed the following Monte Carlo simulation 191 experiments. For each disease we created all possible sets of k symptoms, where k=1, 2,3,4,5,192 10, 20, and 50 symptoms that are associated to that disease (taking care to eliminate diseases in 193 the simulation that had less than the simulated number of symptoms). Then, for each k, we 194 calculated DS i for the correct disease. We call this list DS i correct In parallel, for each k and for each 195 set of symptoms, we calculated DS i for all diseases that were not the one from which we had 196 extracted the set of symptoms. We call this list DS i incorrect . 197 Then, for each k we created a list  Table 2 and interpreted in the following way. For the same number of 200 submitted symptoms, in the context of the disease-symptoms association matrix, the differences 201 between corresponding quantiles of the DS i correct and DS i incorrect lists provide a proxy to evaluate 202 how different two DS i scores (one correct and one incorrect) must be for that difference to be 203 significant. Thus, if users submit for example one symptom and want a certainty of 99.9% that two 204 DS i scores are different, Supporting Table 2 tells us that the two scores should differ by at least 205 0.14. How can this be interpreted? For example, the difference between the score for the most 206 highly ranked disease and that for the second best guess by RDD needs to differ by at least 0.14, if 207 one want to state that the prediction is significantly (p<0.001) better than the second best guess. 208 It is important to benchmark the performance of RDD with patients that have symptom(s) 209 associated to rare diseases, without suffering from those diseases. This is a very real scenario, as 210 many of the symptoms are common between rare and non-rare diseases. A possible test would be 211 to create synthetic patients from other diseases, adding random rare disease symptoms and 212 running RDD. However, we note that RDD only allows users to choose symptoms that have been 213 previously associated to at least one rare disease. Hence, testing RDD's performance with 214 synthetic patients from non-rare diseases is formally equivalent to generating synthetic patients 215 with random associations of rare-disease symptoms. This is the same test that was done to 216 determine significance for DS i scores. In other words, only when DS i is larger than 0.5, does RDD 217 ensure that the patient has a rare disease, with a probability higher than 0.9999. 218

Accurate predictions in the absence of statistically significant DS i scores 219
Taken together, the four benchmark experiments described in the main manuscript show that DS i 220 decreases sharply with noise; however, even if DS i is below the statistically significant level, it can 221 still be used to accurately predict the correct rare disease, although with a lower confidence (see 222 above). For example, in Supporting Figure 4 we show Box plots of the maximum DS i scores for all 223 patients in the second benchmark test. We see that when patients have 50% absent symptoms, 224 the maximum score is still almost always above 0.5, which is the 0.0001 significance level 225 determined in benchmark 3. Only when 75% of the symptoms are absent do we get maximum DS i 226 scores that are equal to or lower than 0.5 for more than 50% of the patients. Given that the dataset we used is annotated by humans and evolves, we wanted to have an 231 estimate of how much the changes might affect the predictive capabilities of RDD. To achieve this 232 we repeated the tests described in all the previous subsections of "BENCHMARKING THE RARE 233 DISEASE DISCOVERY ALGORITHM" for the ORPHANET dataset of 2015. What we found was that 234 the difference in F-Score of RDD between the two sets was smaller than 3% when noise was large 235 (20 noisy symptoms) and less than 0.2% when symptoms were accurate (Supporting Figure 5). We 236 also observe that the median score of the correct prediction when 25%, 50%, or 75% of symptoms 237 are absent increased by approximately 20% when we changed the 2014 dataset for the 2015 238 dataset (Supporting Figure 6). These results suggest that the human curation of the ORPHANET 239 dataset is improving over time, which also improves the quality of the results of computer assisted 240 DDX tools that use them, as is the case of RDD.