CalPen (Calculator of Penetrance), a web-based tool to estimate penetrance in complex genetic disorders

Mutations conferring susceptibility to complex disorders also occur in healthy individuals but at significantly lower frequencies than in patients, indicating that these mutations are not completely penetrant. Therefore, it is important to estimate the penetrance or the likelihood of developing a disease in presence of a mutation. Recently, a method to calculate penetrance and its credible intervals was developed on the basis of the Bayesian method and since been used in literature. However, in the present form, this approach demands programming skills for its utility. Here, we developed ‘CalPen’, a web-based tool for straightforward calculation of penetrance and its credible intervals by entering the number of mutations identified in controls and patients, and the number of patients and controls studied. For validation purposes, we show that CalPen-derived penetrance values are in good agreement with the published values. As further demonstration of its utility, we used schizophrenia as an example of complex disorder and estimated penetrance values for 15 different copy number variants (CNVs) reported in 39,059 patients and 55,084 controls, and 145 SNPs reported in 45,405 patients and 122,761 controls. CNVs showed an average penetrance of 7% with 22q11.21 CNVs having highest value (~20%) and 15q11.2 deletions with lowest value (~1.4%). Most SNPs, on the other hand showed a penetrance of 0.7% with rs1801028 having the highest penetrance (1.6%). In summary, CalPen is an accurate and user-friendly web-based tool useful in human genetic research to ascertain the ability of the mutation/ variant to cause a complex genetic disorder.


Introduction
In contrast to the simple Mendelian disorders where mutation of a gene is always associated with the disease, in case of complex disorders, mutations in multiple genes are not always associated with the disease but contribute as risk factors. Therefore, the risk-conferring mutations are also found in normal controls. In this context, geneticists use statistics to determine first whether the mutation occurs at a significantly higher frequency in patients than in controls to establish an association. Once such significant association is established, it becomes important a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 to determine penetrance (likelihood of causing the disease) of the mutation. Accordingly, a mutation can be either completely (100%) penetrant in which the presence of mutation definitively causes the disease or highly penetrant (there is more than 50% chance that the presence of mutation causes the disease) or low penetrant (there is less than 50% likelihood that the mutation causes the disease) [1]. However, in complex genetic disorders, the identified mutations are often incompletely (<100%) penetrant [2]. In order to calculate penetrance for a mutation conferring risk for a complex disease, Vassos et al [3] developed a Bayesian method, which is currently being used worldwide. According to this method, five types of data are needed: (i) the number of mutations identified in a patient sample, (ii) the number of patients studied, (iii) the number of mutations identified in the control sample, (iv) the number of controls studied and, (v) the general incidence of the disease under investigation in the population from which patients and controls are sampled (lifetime morbidity risk or baseline risk). This method involved simulation using the R statistical package to make prior distributions based on the observed frequencies in controls and patients. From these curves, 2.5 th , 50 th and 97.5 th percentiles are extracted to obtain the median penetrance and its~95% credible intervals. However, these calculations require programming skills which are generally not familiar to geneticists. Given these difficulties, a user-friendly web-based tool is desirable for direct determination of penetrance and its credible intervals without going through the need for programming skills. If such tool is available, then it would enable calculation of penetrance for mutations reported in a number of studies for common disorders and may eventually result in a database that will be useful in prenatal genetic screening and genetic counselling.
Here, we used python interface to develop 'CalPen', a free web-based tool, that enables calculation of penetrance and its~95% credible intervals. We show that the values obtained using CalPen were in good agreement with the reported values. Using published data, we also estimated penetrance and credible intervals for the 15 most replicated copy number variants reported in schizophrenia patients from multiple studies. We also show a wider utility of Cal-Pen by estimating penetrance using information on 145 SNPs significantly associated with schizophrenia.

Description of the code
The python script developed is deposited in GitHub software development platform (https:// github.com/dyex719/penetrance) and in the supporting files (S1 and S2 Files). Briefly, penetrance is calculated using population-based probabilistic method using the published dataset of the copy number variations (CNVs) at different locations in the genomes of schizophrenia patients and controls, wherein the data selection criteria were identical as described by Vassos et al. [3]. Based on the number of mutations identified in a given sample of controls and patients, median penetrance values were calculated from the formula: Wherein P(D|G)] is the penetrance or the probability of developing the disease (D) for patients with genotype (G) carrying the CNV. P(G|D) is the frequency of the CNV in patients, P(D) is the lifetime morbid risk (baseline risk) for the disease, PðGj � DÞ is the frequency of the CNV in controls and Pð � DÞ is the probability that an individual is normal (1 -P(D)). For determination of the credible intervals, we first derived binomial prior distributions for controls and patients for each mutation by Wilson's method [4] using Python ver. 3.5 and SciPy package ver. 1.2. These prior distributions were generated using the respective frequencies of mutations as the mean values for both cases and controls. For each prior distribution, the value corresponding to Mean + 2 σ gave the 97.5 th percentile value whereas Mean -2σ gave the 2.5 th percentile value (σ represents standard deviation). From the 2.5 th and 97.5 th percentiles of controls and patients, the upper bound and lower bounds of the credible intervals were calculated for each mutation by Bayesian method as follows

Development of the web-based tool
The web application was created using Python Flask (a web application framework written in Python to develop websites; www.flask.pocoo.org) and PythonAnywhere (www. pythonanywhere.com; an online integrated development environment and web-hosting service based on the Python programming at www.python.org). HTML was used to allow the user to use a graphical user interface (GUI). Features were added so that this web-based HTML application alerts the user if any of the five valid inputs are not provided.

Validation of CalPen
We used Pearson correlation coefficient as a means of validating the CalPen-derived penetrance values against those obtained in two published reports [3,5]; (Table 1)]. For this purpose, the lifetime morbidity risk (baseline risk), the number of mutations and the sample sizes used were obtained from the reports and the values were entered in the dialog boxes of CalPen to obtain penetrance values and their credible intervals. Correlation coefficient for median penetrance was calculated and scatter plots were made using Microsoft excel 2016 [6]. In case of credible intervals, coefficient of range was calculated using the formula using values from CalPen and published reports The individual values of the coefficients were then used to calculate the correlation coefficient and obtain scatter plots.

Tests of significance
We used two methods to determine whether there is a significant difference in penetrance values reported and those obtained by CalPen. First, we used an online tool (https://www. socscistatistics.com/pvalues/pearsondistribution.aspx) and calculated the P value for obtaining the correlation coefficient (r) against n-2 degrees of freedom. As an independent approach, we also performed a χ2 test of association using the reported penetrance values as expected and Calpen values as observed values (https://www.graphpad.com/quickcalcs/chisquared1/? Format=C). A P value < 0.05 is taken as significant.

Results and discussion
In order to develop a web-based tool for calculation of penetrance (CalPen), we used the same parameters as described by Vassos et al. [3] and Python script in the SciPy package (see Materials and Methods). A schematic of the workflow resulting in computation of median penetrance and credible intervals is shown in Fig 1A. An example of steps involved in using CalPen is shown in Fig 1B wherein the user needs to enter the appropriate number against the dialog boxes given. For example, the data in Fig 1B indicates that there are 10 mutations in 1000 patients but five among 1000 controls. After entering the baseline risk (lifetime morbidity risk), the user needs to click the dialog box named 'Calculate Penetrance' and, the software gets forwarded to the next webpage showing the values of penetrance and credible intervals at the bottom. The user does not need to go to the previous webpage to start with another mutation but can continue from the same page by entering a new set of relevant numbers. In order to validate the performance of CalPen, we used the published data from Vassos et al [3] and Rosenfeld et al [5]. In both cases, we used baseline risks as given by the authors. Data from CalPen and the two published reports is shown in Table 1. A comparison of the median penetrance values obtained using CalPen with the reported values gave a coefficient of correlation (r) of 0.992 (Fig 2A), with a P value < 0.001 indicating a significant degree of association between reported and CalPen-derived penetrance data. As an independent measure we used χ2 test with Yate's correction, which gave a value of 0.04, corresponding to a P value of 1.0, indicating a high-degree of agreement in the two sets of values. In case of credible intervals, we first calculated the coefficient of range of these intervals and then used for calculation of correlation coefficient. The data shown in Fig 2B gave a r value of~0.95, again indicating that there is a significant similarity between the published and CalPen-calculated credible intervals (P <0.001). As in case of penetrance values, a χ2 test with Yates correction gave a P value of 1.0 indicating that the published credible intervals were very similar to those calculated using CalPen. Taken together, these results suggest that CalPen software gives reliable values of both penetrance and credible intervals.
To demonstrate the wider utility of CalPen, we chose schizophrenia (SZ) as an example of complex disorder in which two categories of variants viz., Copy Number Variants (CNVs) and Single nucleotide polymorphisms (SNPs) are widely studied. CNVs are sub-microscopic deletions and duplications ranging in size from few kilobases to a few megabases, affecting one to many genes, constituting about 5-10% of human variation [31]. Among the different CNVs identified in SZ, meta-analysis resulted in identification of a specific set of 15 CNVs that are more likely to be replicated in a diverse set of populations [7]. Data on these 15 CNVs was obtained from different published reports [8][9][10][11][12][13][14]; (Table 2) and penetrance values were calculated using CalPen (Fig 2C). The data suggests that for a given CNV, there was a range of penetrance values from different reports. For example, 3q29 deletions showed penetrance values ranging from 1.8% to 15.3%. Overall the average penetrance of the 15 CNVs is~7%, meaning that among 100 individuals with a CNV, there is a likelihood of seven individuals being abnormal. CNVs of 22q11.21 appear to have the highest average penetrance (~20%) whereas 15q11.2 deletions, which also represent variants of uncertain significance, have the lowest average penetrance (~1.4%). We also estimated the penetrance values of 128 SNPs using a large set of data reported by the psychiatric Genetics Consortium [29] and other reports that identified 17 SNPs among schizophrenia patients [15][16][17][18][19][20][21][22][23][24][25][26][27][28]; (Table 3). In contrast to CNVs, the odds ratios of the SNPs are always lower, rarely approach a ratio of 1.5 [32] and therefore are likely to have lesser penetrance than CNVs. In agreement with this expectation, 117 out of 145 SNPs studied showed a penetrance of 0.7%. Eleven SNPs showed lowest (0.6%) and rs1801028 showed the highest (1.6%) penetrance values (Fig 2D; S1 Table).

Conclusion
In conclusion, CalPen is a straight forward tool for accurate determination of penetrance and credible intervals of mutations/variants associated with complex disorders and circumvents the bottleneck of the requirement of programming skills. At this juncture, this tool can calculate penetrance for one variant at a time and does not allow a set of variants identified in case- control studies to be analyzed together. Also, it is not possible to perform complex calculations resulting in estimations of combined penetrance in patients with more than one variant. With these improvements, CalPen in future may enable in better understanding of the phenotypic outcomes in complex disease genetics. For routine penetrance calculations, this web-based tool can be accessed from http://calpen.pythonanywhere.com/#about.