GRIPT: a novel case-control analysis method for Mendelian disease gene discovery

Despite rapid progress of next-generation sequencing (NGS) technologies, the disease-causing genes underpinning about half of all Mendelian diseases remain elusive. One main challenge is the high genetic heterogeneity of Mendelian diseases in which similar phenotypes are caused by different genes and each gene only accounts for a small proportion of the patients. To overcome this gap, we developed a novel method, the Gene Ranking, Identification and Prediction Tool (GRIPT), for performing case-control analysis of NGS data. Analyses of simulated and real datasets show that GRIPT is well-powered for disease gene discovery, especially for diseases with high locus heterogeneity. Electronic supplementary material The online version of this article (10.1186/s13059-018-1579-x) contains supplementary material, which is available to authorized users.

for a small proportion of the patients, pathogenic mutations of the same gene was 136 randomly selected from The Human Gene Mutation Database (HGMD) and spiked into 137 a small proportion (e.g. 0.5%, 1%, 2%, or 3%, respectively) of individuals in the patient 138 cohort (see methods). The size of patient cohort was set at 600 and the control cohort 139 at 5000. The simulation for each scenario was repeated 30 times. A genome-wide 140 statistical significance level (GWSL) of 2.7´10 -6 was used as the significant p-value 141 cutoff for multi-testing correction (given about 18500 autosomal protein-coding genes 142 annotated by RefSeq genes). The performance of GRIPT was measured with three 143 parameters: 1) the ranking of the disease gene with spike-in pathogenic mutations, 144 indicating the sensitivity of the tool; 2) the percentage of simulation runs in which the 145 disease gene passes GWSL, indicating the statistical power of the tool; and 3) the 146 number of significant autosomal candidate genes, indicating the specificity of the tool. 147 Furthermore, the performance of GRIPT was compared with four popular cohort 148 analysis tools, including the Mendelian disease gene finder, VAAST2, and three group-149 wise association tests, the CMC (burden test), SKAT and KBAC (kernel model), on the 150 same datasets. 151

152
The sensitivity and specificity of GRIPT under the AR and AD models 153 To test the performance of GRIPT in identifying AR disease gene, RPE65 was used as 154 an example. RPE65 is a well-studied gene with mutations known to cause AR Leber 155 congenital amaurosis (LCA) and Retinitis Pigmentosa (RP) [20][21][22]. The performance of 156 the four tests was summarized in Figure 3 and Supplementary table S1. Figure 3A-C and 157 table S1 demonstrate that GRIPT has great sensitivity and specificity in detecting RPE65, 158 even when the proportion of RPE65 patients was very low, mimicking the scenario of 159 patient cohort with high locus heterogeneity. When the RPE65 patient proportion was as 160 low as 0.5%, GRIPT ranked RPE65 on average sixth, achieving 66.67% power. When 161 the RPE65 patient proportion reached ≥ 1%, GRIPT ranked RPE65 first in all trials with 162 100% power. Across the range of RPE65 patient proportions, GRIPT identified on 163 average three significant candidates per simulation. In contrast, with a low proportion of 164 RPE65 patients, the other four algorithms had significantly lower sensitivity and power 165 than GRIPT (WRST, p-value see Supplementary table S1). For example, when the 166 RPE65 patient proportion was ≤ 1%, the powers of the other four tests were ≤ 10% and 167 the mean rank of RPE65 was between 38 and 3068. Each of the other four methods 168 identified on average zero or one significant candidate gene. 169

170
In parallel, the performance of GRIPT in identifying AD disease gene was tested using 171 TINF2 as an example. TINF2 is a known, disease-causing gene of AD Revesz syndrome 172 and Dyskeratosis congenital [23][24][25]. As shown in Figure 3D-F and table S1, GRIPT 173 lacked power when the TINF2 patient proportion was very low, but its performance was 174 greatly improved as the TINF2 patient proportion increased. Specifically, as TINF2 patient 175 proportion increased from 0.5% to 1%, the power of GRIPT increased from 3.33% to 176 53.33%. When the TINF2 patient proportion reached ≥ 2%, TINF2 was always ranked 177 first by GRIPT with 100% power. On average, GRIPT identified about two significant 178 candidate genes. In comparison, the other four methods had significantly worse 179 performance than GRIPT (WRST, p-value see Supplementary table S1). For example, 180 when TINF2 patient proportion increased from 0.5% to 1%, the power of VAAST2 181 increased from 0% to 13.33%, CMC from 0% to 36.67%, SKAT from 0% to 6.67%, and 182 KBAC from 0% to 6.67%. 183 184 Benchmark on 400 randomly selected known disease genes 185 To further expand the evaluation of GRIPT, we performed simulation using 400 Mendelian 186 disease-causing genes randomly selected from the OMIM database, including 200 AR 187 and 200 AD disease genes. For each gene, we simulated the patient cohorts with a size 188 of 600 and used the same simulated control cohort with a size of 5000. The results were 189 summarized in Figure 4 and Supplementary table S2. 190 191 Consistent with the results for RPE65, GRIPT showed outstanding sensitivity and 192 specificity in detecting the 200 AR genes even when the proportion of patients attributed 193 to the same disease gene was very low ( Figure 4A-C). Consistently, VAAST2, CMC,194 SKAT and KBAC showed significantly worse performance than GRIPT when patient 195 cohort had high locus heterogeneity ( Figure 4A-C, WRST, p-value see Supplementary  196 table S2). When the proportion of patients attributed to the same disease gene was as 197 low as 0.5%, the disease genes were ranked on average 24 th by GRIPT achieving 52.5% 198 power, whereas the other four methods had 0% power. When the patient proportion 199 equaled to 1%, the disease genes were ranked on average first by GRIPT with 97% power. 200 In contrast, the power of the other four methods were between 0.5% and 11.5%. When 201 the patient proportion reached ≥ 2%, the disease genes were always ranked first by 202 GRIPT with 100% power. In comparison, the power of the four methods were between 203 11.5% and 97.5%. Across the range of patient proportions, GRIPT identified on average 204 one significant candidate gene compared to zero or one candidate by each of the other 205 four methods. 206 207 Consistent to the results of TINF2, the overall performance of GRIPT was better than or 208 comparable to the other four methods in detecting the 200 AD genes (WRST, p-value see 209 Supplementary table S2). When the proportion of patients attributed to the same disease 210 gene was ≤ 1%, GRIPT and the other four tests have very low power, i.e. ≤ 29.5% for 211 GRIPT, ≤ 13% for VAAST2, ≤ 21.5% for CMC, ≤ 31% for SKAT, ≤ 4.5% for KBAC ( Figure  212 4D-F). When the patient proportion attributed to the same gene increased to 2%, the 213 disease genes were ranked on average third by GRIPT with 87% power. In comparison, 214 the power of the other four tests were between 68% and 85.5%. When the patient 215 proportion reached 3%, the disease genes were ranked first in 97.5% of simulations by 216 GRIPT with 99% power. Comparably, the power of the other four tests increased to 93% 217 -99%. Across the range of patient proportions, on average one to two significant 218 candidate genes were identified by GRIPT compared to between zero and five candidates 219 by the other four methods. 220 221

Simulations suggest GRIPT is highly robust 222
The performance of case-control cohort analysis can be potentially impacted by several 223 confounding factors, such as patient cohort size, population stratification, and the cutoff 224 of variant filtering frequency, and the control cohort size. To assess their impact, we 225 performed simulations using RPE65 and TINF2 as examples under the AR and AD 226 models respectively, and compared GRIPT with VAAST2, CMC, SKAT and KBAC using 227 the same datasets under each scenario. In addition, we tested the effect of different 228 variant score systems on the performance of GRIPT. 229

230
The sample size of the patient cohort 231 We simulated the patient cohorts in a range of sizes, i.e. 50, 100, 300, 600, and 800, with 232 2% of patients carrying the pathogenic mutations of the same disease genes, and control 233 cohorts with a size of 5000. The results were summarized in Figure 5 and Supplementary 234 As shown in Figure 5A-C, under the AR model, GRIPT maintains high sensitivity for 237 patient cohorts with a variety of sizes and high locus heterogeneity although its specificity 238 decreased for small patient cohorts with high locus heterogeneity. In comparison, the 239 other four methods performed significantly worse than GRIPT under the same situations 240 (WRST, p-value see Supplementary table S3). Specifically, as the patient cohort size 241 increased from 50 to 300 with 2% of patients carrying the RPE65 pathogenic mutations, 242 the mean rank of RPE65 increases from 31 to 1 by GRIPT with 100% power. The number 243 of significant candidates identified by GRIPT decreased from 107 to 8. When the patient 244 cohort size reached ≥ 300, GRIPT always ranked RPE65 first with 100% power. The 245 average number of significant candidates decreased to between one and eight. In 246 contrast, the power of the other four methods was 0% when the patient cohort size < 300. 247 When the patient cohort size reached ≥ 300, the power was 33.33%-100% for VAAST2, 248 0%-40% for CMC, 3%-56.67% for SKAT, and 0%-16.67% for KBAC. And the average 249 number of significant candidates identified by each of the four methods was between 0 250 and 26. 251 252 Under the AD model, when patient cohort was small and had high locus heterogeneity, 253 GRIPT had low sensitivity and specificity, but its performance was greatly improved as 254 the patient cohort size increased ( Figure 5D-F). The other four methods performed 255 comparably or significantly worse under the same scenarios ( Figure 5D-F, WRST, p-256 value see Supplementary table S3). Specifically, when the patient cohort size increased 257 from 50 to 100 with 2% of patients attributed to TINF2, the power of GRIPT increased 258 from 6.67% to 33.33% and the average number of significant candidates decreased from 259 79 to 28. When the patient cohort size increased to ≥ 300, TINF2 was ranked on average 260 first by GRIPT with 100% power. The average number of significant candidates by GRIPT 261 was between two and eight. In comparison, when the patient cohort size < 300, the power 262 increased from 6.67% to 36.67% for CMC and remained at 0% for VAAST2, SKAT and 263 KBAC. When the patient cohort size reached ≥ 300, the power was between 3.33% and 264 100% for the four tests. The average number of significant candidates by each of the four 265 tests was between 0 and 103. between case and control cohorts were simulated at 0%, 20%, 40%, 60%, 80% and 100%. 275 The size of patient cohort was set at 500 and the control cohort at 5000. The proportion 276 of patients carrying the pathogenic mutations of the same gene was set at 1%. The results 277 were summarized in Figure 6 and Supplementary table S4. 278

279
As shown in Figure 6A-F, the sensitivity and specificity of GRIPT slightly decreased as 280 unmatched ethnicity proportion between cases and controls increased. However, GRIPT 281 is significantly less affected by population stratification than the other four methods even 282 when patient cohort had high locus heterogeneity (WRST, p-value see Supplementary  283   table S4). Specifically, under the AR model, as the unmatched ethnicity proportion 284 between patients and controls increased from 0% to 100% (namely, from the completely 285 matched to the completely unmatched), the mean rank of RPE65 dropped from 1 to 32 286 by GRIPT but always with 100% power ( Figure 6A-C). Specificity was reduced as the 287 average number of significant candidate genes increased from 2 to 111 ( Figure 6A-C). In 288 comparison, the powers of CMC, SKAT and KBAC were between 0% and 20%. The 289 average number of significant candidate genes increased from 1 to 1929 for CMC, from 290 0 to 2603 for SKAT, and from 0 to 1921 for KBAC. In addition, as the unmatched ethnicity 291 proportion increased, the running time for VAAST2 dramatically increased (e.g. needs 292 120-240 hours with 5 parallel CPUs to finish one simulation run), VAAST2 was only tested 293 for the unmatched ethnicity proportion ranging from 0% to 60%. Under those scenarios, 294 the power of VAAST2 was between 10% and 26.7%. The average number of significant 295 candidate genes identified by VAAST2 increased from 0 to 1502. increased from 0% to 100%, the mean rank of TINF2 dropped from two to nine by GRIPT 300 with 96.67%-100% power ( Figure 6D-F). The mean number of significant candidate 301 genes increased from 3 to 19. In comparison, the mean rank of TINF2 dropped from 3 to 302 75 for VAAST2, from 7 to 57 for CMC, and from 44 to 166 for SKAT, and from 3 to 33 for 303 KBAC. The power was 0%-13.33% for VAAST2, 53.33%-66.67% for CMC, 0%-3.33% for 304 SKAT, and 0%-6.67% for KBAC. The average number of significant candidate genes 305 increased from zero to five for VAAST2, from 4 to 35 for CMC, from zero to two for SKAT, 306 and from zero to one for KBAC. (Figure 6D-F). 307 308

Variant frequency filtering 309
Mendelian disease-causing mutations are expected to be very rare in the population, and 310 common human variants are likely benign for rare Mendelian diseases. Therefore, to 311 reduce the analysis/computation complexity, variants from WES are conventionally first 312 filtered out common human genome variants based on allele frequency in large database 313 of human genome variants, e.g. gnomAD and ExAC. To mimic this scenario, the above 314 patient and control cohorts were simulated using the variants whose maximum population 315 frequency ≤ 0.5% in ExAC database for the AR model, and whose maximum population 316 frequency ≤ 0.01% for the AD model. Here, we examined the impact of a relaxed (i.e. 317 higher) frequency filtering cutoff on the disease gene identification methods. We 318 simulated the WES data of patient and control cohorts using a range of variant frequency 319 cutoffs respectively: ≤ 0.5%, ≤ 1% and ≤ 2% for the AR model, and ≤ 0.01%, ≤ 0.5% and 320 ≤ 1% for the AD model. The proportion of patients attributed to the same gene was set at 321 1%. The size of patient cohort was set at 600 and control cohort at 5000. The results 322 show that inclusion of more variants/noise per individual by using higher frequency 323 filtering cutoff had little impact on GRIPT's performance under the AR model, but it 324 reduced its power under the AD model. The performance of the other four methods were 325 largely compromised and were significantly worse than or comparable to that of GRIPT 326  In contrast, the ranking of RPE65 by the other four tests was largely decreased, with ≤ 332 10% power for VAAST2, 0% power for CMC, SKAT and KBAC. Under the AD model, as 333 the variant frequency cutoff increased from 0.01% to 1%, the average rank of TINF2 334 dropped from 5 to 590 by GRIPT with power decreasing from 53.33% to around 3%. The 335 average number of significant candidate genes was between zero to two ( Figure 7D-F). 336 The power of VAAST2 decreased from 13.33% to 10%, CMC from 36.67% to 0%, SKAT 337 from 6.67% to 0% for SKAT, and KBAC from 6.67% to 0%. 338

339
The effect of the control cohort size 340 Theoretically, the variant spectrum of a gene in a large control cohort should be less 341 biased and closer to the true distribution than that in a small control cohort. Thus, large 342 control cohorts can better serve as the control/baseline, for example, to exclude the genes 343 with large numbers of rare benign variants in population. To test the effect of control 344 cohort size, we simulated smaller control cohorts with a size of 600 and used the previous 345 case cohorts with a size of 600 to repeat the analysis. The results were summarized in 346 Figure 8 and Supplementary table S6.  347 348 Under the AR model, GRIPT remained sensitive in ranking RPE65. When the RPE65 349 patient proportion increased from 0.5% to ≥ 2%, the mean rank of RPE65 increased from 350 45 to 1. However, the p-value of RPE65 did not pass the GWSL in any of the simulations, 351 showing GRIPT with 0% power. Consistent to the results with larger control cohort, the 352 other four tools performed significantly worse than GRIPT ( Figure 8A To test whether the performance of GRIPT will be affected by different variant score 370 systems, besides CADD score, we applied the DANN and REVEL scores to annotate the 371 variant scores in GRIPT respectively, and repeated the aforementioned analyses. DANN

Comparison to the traditional GWAS single variant test 391
To compare the performance of GRIPT with the traditional GWAS single variant test, we 392 simulated the basic scenario with 0.5%-3% of patients carrying the pathogenic mutations 393 of RPE65 and TINF2 respectively, and applied GRIPT and Fisher's exact test to the data. 394 As shown in Figure 9 and Supplementary table S1, Fisher's exact test performed much 395 worse than GRIPT. Under the AR model, when the RPE65 patient proportion was 0.5%, 396 RPE65 was ranked on average sixth by GRIPT with 66.67% power. When the RPE65 397 patient proportion was ≥ 1%, RPE65 was always ranked first by GRIPT with 100% power. 398 In contrast, the average ranking of RPE65 by Fisher's exact test was in the range of 890 399 to 32000, always with 0% power. Under the AD model, as TINF2 patient proportion 400 increased from 0.5% to 1%, the power of GRIPT increased from 3.33% to 53.33%. When 401 the TINF2 patient proportion was ≥ 2%, GRIPT always ranked TINF2 first with 100% 402 power. In comparison, as the proportion of TINF2 patients increased, the average ranking 403 of TINF2 by Fisher's exact test was improved from 12675 th to 23 th , but the test power 404 remained at 0%. The reasons may be: 1) GRIPT is a gene-wise test that ranks the 405 functional effects of variants and incorporates the Mendelian inheritance models to 406 compute the gene score. In contrast, the traditional single variant test considers one 407 variant in a gene each time, and is mainly based on the allele frequency difference 408 between cases and controls. Thus, the single variant test does not have sufficient power 409 to detect the heterogeneous rare deleterious variants in Mendelian disease cohorts, 410 although it might be suitable for common complex diseases. 2) the multiple test correction 411 requests a much more stringent p-value cutoff for the single variant test than the gene-412 wise GRIPT due to the larger number of tests applied in the single variant test than in 413 GRIPT (i.e. variants vs. genes). 414 415

Analysis of real patient cohort data display GRIPT's excellent performance 416
To further validate the performance of GRIPT, we applied it to real WES data of three 417 different patient cohorts respectively, including a Leber's congenital amaurosis (LCA) 418 cohort, a Retinitis pigmentosa (RP) cohort, and a congenital disorder of glycosylation 419 (CDG) cohort. Both the LCA cohort and RP cohort were composed of the patients carrying 420 the pathogenic mutations of different genes, and the proportion of patients attributed to 421 each disease gene was small. Furthermore, the patient ethnicity of the LCA cohort or RP 422 cohort was an admixture of Caucasian, African American, Latino, and Asian. Whereas, 423 the CDG cohort was composed of the patients all attributed to PGM3 from two families. 424 The performance of GRIPT was also compared with VAAST2, CMC, SKAT and KBAC on 425 the same datasets. 426

427
The LCA cohort 428 LCA is a genetic heterogeneous disease and can be caused by mutations in at least 22 429 genes ( http://www.sph.uth.tmc.edu/RetNet, accessed as September 3rd, 2017). We 430 performed WES on 115 sporadic LCA patients. As LCA is a rare Mendelian disorder, 431 variants with maximum population allele frequency > 0.5% were filtered out based on 432 the allele frequency in the large public databases of normal populations (i.e. 1000 433 genome, dbSNP, ESP6500, ExAC, gnomAD) and an internal database. We only 434 focused on rare protein-changing variants including nonsense variants, splicing 435 donor/acceptor variants, missense variants, and small INDELs, since they are more 436 likely to be the disease-causing mutations. One previously simulated control cohort 437 (n=5000) was used as the control cohort for these tests. 438

439
GRIPT showed high sensitivity for the LCA cohort with high locus and ethnicity 440 heterogeneity. It successfully detected the disease gene that only accounted for ≤ 1% of 441 the patients. Specifically, the first nine candidate genes ranked by GRIPT were all known 442 retinal disease genes (Table 1). Among a total of 203 significant candidates, 19 genes 443 were known disease genes, each of which accounted for 0.87%-6.09% (one to seven 444 patients) of the cohort. Most interestingly, GRIPT was able to identify novel retinal disease 445 genes, i.e. POMGNT1 (p = 2.81 ´ 10 -10 ) and MFSD8 (p = 2.81 ´ 10 -10 ). In comparison, the other tools lacked power in detecting the disease genes accounting 453 for small proportions of this cohort. A total of 7 significant candidates were identified by 454 VAAST2, 27 by CMC, 6 by SKAT, and 1 by KBAC. Among them, 5 genes by VAAST2 455 were known disease genes, 3 genes by CMC, 2 genes by SKAT, and 1 genes by KBAC, 456 each of which accounted for 2.61%-6.09% (three to seven patients) of the cohort. 457 However, none of these known genes were the recently identified novel retinal disease 458 genes. 459

460
The RP cohort 461 RP is an inherited retinal disease with even greater genetic heterogeneity compared to 462 LCA. So far, mutations in more than 65 genes were found to cause the disease 463 ( http://www.sph.uth.tmc.edu/RetNet, accessed by September 3rd, 2017). WES was 464 performed for 154 sporadic RP patients. After filtering, the WES data of the real patient 465 cohort and a simulated control cohort (n=5000) were subjected to analysis. GRIPT again 466 showed excellent power in identifying low frequency disease genes underlying the cohort 467 with high locus and ethnicity heterogeneity. As shown in Table 2, eight genes whose 468 rankings ranged from first to eleventh by GRIPT were known retinal disease genes. 469 Among the 157 significant candidates (p < 2.7e-6) identified by GRIPT, 17 are known 470 disease genes, each of which explained 0.649%-8.44% (1 to 13 patients) of the cohort. 471 In comparison, the other tools had weak power in detecting the low frequency disease 482 genes underlying this cohort. A total of 4 significant candidate genes were identified by 483 VAAST2, 25 by CMC, 6 by SKAT, and 2 by KBAC. Among them, 2 genes by VAAST2 484 were known disease genes, 0 by CMC, 1 by SKAT and 0 by KBAC, each of which 485 accounted for 5.19%-8.44% (8 to 13 patients) of the cohort. And none of these known 486 genes were the novel retinal disease genes recently identified. 487

488
The CDG cohort 489 The CDG cohort was composed of six patients from two families who all carry the 490 pathogenic mutations of PGM3 gene. The WES data were downloaded from dbGaP 491 (phs000809.v1.p1). Thus, this cohort serves as a real data example of a genetic 492 homogeneous disease with extremely small case cohort size from an independent 493 external source. After filtering and annotation, the real WES data and a simulated control 494 cohort (n = 5000) were analyzed by the five tools. GRIPT showed the highest accuracy 495 and efficiency in analyzing this homogeneous external cohort. GRIPT correctly ranked 496 PGM3 first (p =0), taking less than 30 minutes with one CPU. VAAST2 also ranked PGM3 497 first (p= 2.50 ´ 10 -6 ) but took about 6 hours with 5 parallel CPUs. CMC ranked PGM3 11 th 498 (p = 3.79 ´ 10 -64 ) and took 2.5 hours with one CPU. The p-value of PGM3 by SKAT equals 499 to 0 but is the same as the other 162 genes (p = 0), taking 9.3 hour with one CPU. The 500 p-value of PGM3 by KBAC equals to 2 ´ 10 -6 but is the same as the other 62 genes (p = 501 2 ´ 10 -6 ), taking 7.8 hour and one CPU. 502 503

504
In this study, we developed a novel computational method named GRIPT for Mendelian 505 disease gene discovery through analyzing the NGS data of patient-control cohorts. The 506 null hypothesis of GRIPT is that a non-disease gene should have similar deleterious 507 mutations load in cases and in controls. GRIPT scores and compares the deleterious 508 mutations load of each gene in the genome between patients and controls using a 509 composite Fisher's test, and prioritizes the genes that have significant higher deleterious 510 mutation loads in cases than in controls as the candidate disease genes. 511 512 Both simulation and real data tests indicate that GRIPT has great sensitivity and 513 specificity and is highly reliable in discovering Mendelian disease genes. For example, as 514 shown in the benchmark of 400 known disease genes, under the AR model, GRIPT 515 ranked the disease gene first in 97.5% of the simulations for a patient cohort with a size 516 of 600 and with only 1% of patients carrying the pathogenic mutations of the same gene. 517 In addition, the disease gene was usually the only significant candidate gene identified 518 by GRIPT ( Figure 4A-C). Under the AD model, GRIPT ranked the disease genes in the 519 top three in 93.5% of the simulations when 2% of patients (cohort size =600) were 520 attributed to the same gene ( Figure 3D-F). The average number of significant candidates 521 was about two. Furthermore, the results from analysis of real patient data were consistent 522 with the benchmark results. For the LCA cohort (size n = 115), GRIPT was able to 523 systematically and accurately identify 19 disease genes (5 genes by VAAST2, 3 genes 524 by CMC, 2 genes by SKAT, and 1 by KBAC). The candidates ranked from first to ninth 525 were all real disease genes. For the RP cohort (size n = 154), GRIPT was able to 526 accurately identify 17 genes (2 genes by VAAST2, 0 genes by CMC, 1 genes by SKAT, 527 and 0 by KBAC) with seven of the top 10 candidates being real disease genes. Each of 528 the disease genes identified by GRIPT only accounted for 0.649%-8.44% (1 to 13 patients) 529 of patients in the LCA or RP cohort. Moreover, as shown in the simulation, GRIPT reached 530 around 100% power and always ranked the genes to the top for large patient cohorts (e.g. 531 size n ≥ 300) and/or more homogeneous patients (e.g. the same gene explaining ≥ 3% 532 of the patients), which was also demonstrated by the analysis of the CGD cohort with a 533 size of six and all attributed to the gene PGM3. Most interestingly, GRIPT was able to 534 discover four newly reported disease genes in the analysis of real patient data. Each of 535 these newly discovered genes only accounted for one or two (0.649%-1.3%) patients in 536 the patient cohort. Overall, GRIPT shows the great power in discovering known and novel 537 Mendelian disease genes. It is especially well suited to analyze diseases with high locus 538 (and ethnicity) heterogeneity, which is a major challenge for solving the underlying 539 genetics mechanisms of Mendelian disorders. 540 541 GRIPT is also more robust and significantly less affected by potential confounding factors 542 than other disease gene finders. For example, GRIPT remained powerful for small patient 543 cohorts with high locus heterogeneity. In simulation, under the AR model, for a patient 544 cohort with a size of 100 and only two (2%) patients carrying the pathogenic mutations of 545 the same gene, the disease gene was ranked on average third by GRIPT with 100% 546 power. In contrast, the mean ranking of the disease gene by other tools was between 547 ~150 and ~3300 and all with 0% power. This result was also consistent with results from 548 real data as previously discussed. Furthermore, using higher allele frequencies as the 549 variant filtering cutoff, which presumably adds more noise to the analysis, had little impact 550 on the performance of GRIPT under the AR model. In the simulation, for a patient cohort 551 with a size of 600 and with six (1%) patients attributed to the same gene, as the cutoff of 552 variant frequency filtering increased from 0.5% to 2%, the disease gene was ranked first 553 in 98.89% of simulations by GRIPT with 100% power. In comparison, the mean rank of 554 the gene was between 11 and 38 by VAAST2, between 2953 and 4420 by CMC, between 555 269 and 2095 by SKAT, and between 1306 and 1655 by KBAC, all of which had power 556 below 10%. More importantly, GRIPT is significantly less affected by the combined effect 557 of population stratification and high locus heterogeneity, which occur frequently in real 558 data and severely impair the performance of other tools as shown in the simulation and 559 real data analysis. In the simulation of the worst-case scenario where the ethnicity of the 560 patient cohort was completely unmatched by that of the control cohort and with only 1% 561 of the patient cohort (with a cohort size of 500) attributed to the same disease gene under 562 the AR model, GRIPT ranked the disease gene, on average, 32 th with 100% power 563 although it generated around 107 significant candidates. In contrast, the mean ranking of 564 the disease gene by other tools was greater than 3500 (power ≤ 20%), each of which 565 generated more than 1500 significant candidates. Consistently, the other tools displayed 566 lack of power in the real LCA and RP cohorts with mixed ethnicity and high locus 567 heterogeneity. 568 569 The performance advantage of GRIPT might be partly due to that it scores the mutation 570 Simulation results also suggest that to optimize the performance of GRIPT, the following 587 conditions should be considered. First, as one of the key factors affecting sensitivity is the 588 proportion of patients attributed to the same gene, it is highly desirable to increase the 589 homogeneity of patient cohort. One possible approach is to perform detailed phenotyping 590 and gather the patients who share similar phenotypes and are likely due to mutations in 591 one or a small number of genes. Second, while maintaining the homogeneity of the patient 592 cohort, increasing the patient cohort size can also improve sensitivity. For example, by 593 increasing the patient cohort size from 50 to 100 while maintaining 2% of patients carrying 594 disease mutations of the same gene under the AR model, the average rank of the disease 595 gene increased from 31 to 3 by GRIPT. Third, using the correct inheritance model when 596 running GRIPT can leverage its power. If the inheritance model of the diseases is unclear, 597 GRIPT should be run using different models, including AD, AR, XD and XR, respectively. 598 Fourth, reduction of the noises in the input variants will improve the outcome. For example, 599 large databases of "normal" populations, e.g. gnomAD and ExAC should be used to pre-600 filter variants and remove common benign variants that are unlikely to cause diseases, 601 while filtering with internal databases can weaken the error/bias from the sequencing 602 platforms and variant callers. Furthermore, under different inheritance models, the 603 mutations should be pre-filtered with different frequency cutoffs (for example, the variant 604 filtering frequency for AD model should be more stringent, namely lower than for AR 605 model). Additionally, removing the genes that are highly mutable but known not causing 606 diseases can reduce noise as well. Fifth, the accuracy of variant function/pathogenicity 607 prediction will also impact the performance of GRIPT. Currently GRIPT applies the well-608 established integrative allele prediction score, i.e. CADD score, to predict the 609 pathogenicity of variants. However, as the scoring system of GRIPT is flexible, users can 610 easily substitute the CADD score with any other score generated by better algorithms for 611 variant pathogenicity prediction. In aforementioned analysis, we also used DANN and 612 REVEL scores as the variant score, which generates the consistent results, suggesting 613 the reliability and robustness of the statistic test framework of GRIPT. The thumb of rules 614 for using variant score systems is: 1) the scoring systems should reliably and 615 quantitatively predict the deleteriousness of variants. 2) the scores should be 616 scaled/normalized into a genome-wide ranked score to allow the comparison 617 implemented in the statistic test of GRIPT.
3) The score system should be comprehensive 618 and cover all the possible SNP and INDEL in the genome. 619 620 Although GRIPT does not directly identify pathogenic mutations, by identifying candidate 621 (novel) disease genes, it will dramatically reduce the number of variants to be considered 622 for each patient and therefore greatly facilitate the identification of potential mutations.

635
In summary, we developed a highly accurate and robust case-control analysis method, 636 GRIPT, for discovery of Mendelian disease genes. It is especially powerful in detecting 637 disease genes underlying diseases with high locus heterogeneity and is less affected by 638 population stratification. It is also efficient, portable, and flexible. In addition, we generated 639 a WES data simulator which is capable of unbiasedly simulating the WES data of control 640 cohorts with any sample size, gender ratio, and population ethnicity for the usage of 641 GRIPT or other tools. As NGS technology advances (e.g. the decrease in cost and time) 642 and greater amounts of large cohort data become available, we envision that GRIPT will 643 make a significant contribution to the discovery of novel Mendelian disease genes and 644 pave the way for better understanding, diagnosis, prevention, and treatment of Mendelian 645

Each variant is scored to quantify the deleteriousness 649
The hypothesis that GRIPT tests is whether the deleterious mutation loads of a disease-650 causing gene is significantly higher in case cohort than in control cohort. To quantify the 651 deleteriousness of variants, in this study, we applied Combined Annotation Dependent 652 Gene score distribution is highly skewed for rare Mendelian disorders 688 As mentioned above, each gene has a score in each case or control individual, ranging 689 from 0 to 1 (for dominant models) or 2 (for recessive models). Then, for each gene, we 690 compare the gene score distribution in case cohort to that in control cohort. The null 691 hypothesis is that the deleterious mutations load of a gene is not significantly different 692 between cases and controls. Thus, the significance of the one-tailed alternative 693 hypothesis that the deleterious mutations load is higher in cases than controls could 694 suggest the likelihood of the gene associated with the disease. 695

696
To choose the appropriate statistic test, we first characterized the gene score distribution. 697 We found that the score distributions of most genes are highly skewed with excesses of 698 zeros. This is expected mainly because Mendelian diseases are rare and so are the 699 disease-causing mutations. Usually, after filtering out known common human variants 700 which are likely benign, only a small number of rare variants (e.g. MAF ≤ 0.5%) in cases 701 and controls will be kept. Moreover, among the filtered rare variants, only some of them 702 have deleterious effects, therefore, only these rare, deleterious variants will have positive 703 variant-scores. In addition, the recessive model requires a biallelic state to assign a 704 positive gene score in one individual. Thus, the scores of a gene in most case individuals 705 and control individuals are zeros. An example of USH2A gene score distributions in our 706 retinal disease patient cohort (n = 250) and an internal control cohort (n = 250) is shown 707 in Figure 2. 708 709

Combining two separate statistical tests with Fisher's test 710
To compare the highly skewed distributions of gene scores in case and control cohorts 711 derived above, we test a composite null hypothesis by applying a Fisher's test to 712 combine two separate tests including a binomial test and a WRS test [19]. The 713 composite null hypothesis is designed to answer two questions. The first question is 714 whether the proportions of non-zero scores are similar in case cohort and control cohort 715 (Z 1 =0). The second question is whether the values of non-zero scores are similar in 716 case cohort and control cohort (Z 2 = 0). Namely, Fisher's method will test the H 0 : Z 1 =0 717 and Z 2 =0 versus the one-tailed alternative H 1 : Z 1 > 0 and/or Z 2 >0 [19]. 718 719 Let N 1 and N 2 be the total number of cases and controls. Let n 1 and n 2 be the number of 720 non-zero score in cases and controls respectively.

728
The second statistic, Z 2 , represents the difference of the non-zero scores between 729 cases and controls. The standardized Wilcoxon rank sum test was applied to test 730 whether the gene cores in cases are significantly higher than those in controls. Let . 731 denote the corresponding one-tailed p-value. 732 Finally, Fisher's method is used to test the composite null hypothesis H 0 : Z 1 =0 and Z 2 734 =0 at one-tailed level based on a combination of Z 1 and Z 2 or * and . as follow: frequency of "A>T" is 0.2% and the allele frequency of "A>G" is 0.5%, then in the 762 simulated WES data of one person, there is 0.2% of chances the simulator will output 763 the SNP "A>T", 0.5% of chances will output the SNP "A>G", and 99.3% of chances the 764 simulator will generate "A>A", namely not output any SNP in chr1:10000. Thus, each 765 generated variant follows a multinomial distribution according to its frequency in the 766 user-selected ethnic population based on the ExAC database. For a given number (N) 767 of individuals with a given sex ratio, the simulator will generate "N" WES data file 768 individually. Each WES data file includes information such as reference nucleotides, 769 altered nucleotides, the coordinates in the genome, and the CADD scores of the 770 variants. 771 772

Simulation of patient and control cohorts 773
To evaluate the performance of GRIPT, we performed the simulation tests on GRIPT 774 and similar tools, i.e. VAAST2, CMC, SKAT and KBAC. The WES data of the patient 775 cohort and control cohort were first generated using the WES data simulator mentioned 776 above. Given the rare frequency of Mendelian disease-causing variants in normal 777 population, for the AR model, the WES data were simulated based on the variants 778 whose maximum population frequency was ≤ 0.5% in ExAC database by default, while 779 for the AD model, based on the variants whose maximum population frequency was ≤ 780 0.01% in ExAC database by default, unless otherwise specified. We used "adjusted" 781 average population frequency as the default variant frequency, unless otherwise 782 specified. Then, we randomly selected pathogenic variants of a given disease gene 783 from HGMD database with MAF ≤ 0.5% in ExAC database, and inserted them into a 784 given percentage of individuals randomly selected from the patient cohort to mimic the 785 patient cohort with genetic heterogeneity. In the AR model, two variants were 786 respectively selected from HGMD and spiked into each selected individual. Thus, the 787 two variants spiked into the same individual can be the same (homozygous) or different 788 (heterozygous). In the AD model, only one variant was randomly selected and spiked 789 into each selected individual. Therefore, under the AR or AD model, the pathogenic 790 mutations of a given gene can be the same or different within and between the patients.

Preprocessing the variants in cis 806
To reduce false positive, we recommend the users to handle the variants in cis before 807 inputting data into GRIPT . However, given that it is not always possible to obtain accurate 808 phasing information, GRIPT can tolerate imperfect phasing as shown in the 809 aforementioned simulation and real data analyses. Currently, a preprocessing script 810 included in the GRIPT package was used to handle variants in cis, which perform the 811 following operations: 812 813 1) If the genomic coordinates of two variants are within 100bp, Fisher's exact test will be 814 performed to determine whether the two variants are in cis by comparing the ratio of the 815 variant base sequencing coverage to the reference base sequencing coverage of the two 816 variants. If the two variants are in cis and within 100bp, they can be covered by a large 817 number of the same sequencing reads, therefore their read coverage ratios would be 818 similar and Fisher's exact test p-value would be large. In contrast, if they are in trans and 819 close to each other, they would be covered by different sequencing reads, thus the read 820 coverage ratios of the two variants would be different and Fisher's exact test p-value 821 would be small. We take Fisher's exact test p ≥ 0.4 as the cutoff to deduce the read 822 coverage ratios of the two variants are similar, namely, the two variants as in cis, 823 otherwise as in trans. Using different p-value cutoff does not significantly impact on the 824 result. For example, we have used p < 0.05 as the cutoff to assign the variants in trans, 825 and p ≥ 0.05 to assign the two variants in cis. Although this could mistakenly assign a few 826 in-trans variants as in-cis, the results remained consistent. Because GRIPT is built on the 827 mutation burden in case cohort and control cohort but not a single case, a few imperfect 828 phasing cases can be tolerated. If the two variants are determined to be in cis by Fisher 829 test, the variant with higher variant score (e.g. CADD score) will be passed on to the 830 subsequent analysis, while the one with lower variant score will be ignored.  First, the samples of the case and control cohorts will be collected and be subjected to 915 NGS, e.g. WES. After variant calling, the known common and/or benign variants will be 916 filtered out based on the variant annotation and their allele frequency in large databases 917 of normal populations. Thus, for each gene, only a few rare variants will be left. Then 918 GRIPT will annotate and rank the deleteriousness of each variant, e.g. using CADD 919 score. Based on the variant scores, a gene score will be calculated to measure the 920 deleterious mutation load of each gene in every individual according to a given 921 inheritance model (see Methods). Next, a Fisher's test built upon the combination of a 922 binomial test and a Wilcoxon rank sum test (WRST) will be calculated to measure the 923 difference of gene score distributions between patient cohort and control cohort for each 924 gene, and a significance p-value associated with the test statistic will be assigned. This 925 composite test is especially well suited to measure the difference of two highly skewed The rankings of AR/AD genes generated by GRIPT were compared to those generated 962 by the other four methods respectively with one-tailed WRST. The methods that 963 generated significantly worse ranking than GRIPT were marked with ' * ' if p-value < 0.05, 964 ' ** ' if p-value < 0.01, and ' *** ' if p-value < 0.001. The patient cohort sizes were tested at 50, 100, 300, 600 and 800. The control cohort 968 size was set at 5000. The percentage of patients carrying the pathogenic mutations of 969 were compared to those generated by the other four methods respectively with one-tailed 976 WRST. The methods that generated significantly worse ranking than GRIPT were marked 977 with ' * ' if p-value < 0.05, ' ** ' if p-value < 0.01, and ' *** ' if p-value < 0.001. GRIPT were compared to those generated by the other four methods respectively with 990 one-tailed WRST. The methods that generated significantly worse ranking than GRIPT 991 were marked with ' * ' if p-value < 0.05, ' ** ' if p-value < 0.01, and ' *** ' if p-value < 0.001. The cutoff of variant filtering frequency was tested at 0.5%, 1%, and 2% under the AR 995 model, and at 0.01%, 0.5%, and 1% under the AD model. The percentage of patients 996 carrying the RPE65 or TINF2 pathogenic mutations was set at 1%. The patient cohort 997 size was 600. The control cohort size was 5000. The performance of GRIPT, VAAST2, 998 CMC, SKAT and KBAC are shown in red, blue, green, purple, and orange, respectively.   *** ***