DCMP: database of cancer mutant protein domains

Abstract Protein domains are functional and structural units of proteins. They are responsible for a particular function that contributes to protein’s overall role. Because of this essential role, the majority of the genetic variants occur in the domains. In this study, the somatic mutations across 21 cancer types were mapped to the individual protein domains. To map the mutations to the domains, we employed the whole human proteome to predict the domains in each protein sequence and recognized about 149 668 domains. A novel Perl-API program was developed to convert the protein domain positions into genomic positions, and users can freely access them through GitHub. We determined the distribution of protein domains across 23 chromosomes with the help of these genomic positions. Interestingly, chromosome 19 has more number of protein domains in comparison with other chromosomes. Then, we mapped the cancer mutations to all the protein domains. Around 46–65% of mutations were mapped to their corresponding protein domains, and significantly mutated domains for all the cancer types were determined using the local false discovery ratio (locfdr). The chromosome positions for all the protein domains can be verified using the cross-reference ensemble database. Database URL: https://dcmp.vit.ac.in/


Introduction
Cancers are triggered by collective changes in genetic and nongenetic materials, which are induced by environmental factors that elicit inappropriate activation or inactivation of specific genes (1). It started by way of disrupting the pathways of cellular proliferation as well as differentiation leading to neoplastic transformations or abnormal cell growth (2). It is a large family of diseases that can invade or spread to other parts of the body. Analyses of well-studied cancers, such as colorectal cancer and retinoblastoma, have suggested that only three or fewer mutations are sufficient for cancer initiation (3)(4)(5).
Most researchers have carried out detailed studies that focus on how to stop this deadly disease in its tracks. One such study includes the application of genomics and proteomics in cancer biology, which holds great potential for identifying the mechanisms that lead to malignancy and the development of therapeutic strategies (6). Several cancer genomes were sequenced and documented thousands of DNA mutations and other genomic alterations (7)(8)(9). Efforts were made by the team of The Cancer Genome Atlas, the International Cancer Genome Consortium and Catalogue of Somatic Mutations in Cancer (COSMIC) (10)(11)(12). In recent years, mutational landscapes of several cancer types have been revealed. However, the extracting process of knowledge from immense sequence resources has just begun. Each cancer can contain thousands of somatic mutations that exemplify challenges to therapy and provide a basic understanding of the cancer disease.
Currently, vast data of cancer genome sequences increase with the number of tumor samples, where the prediction of driver mutations in these genomes reflects false positive rate data (24,25). Hence, determining the effects of mutations on the structure and function of the protein remains challenging (26). Recent computational structural studies have revealed that this gene-based approach usually does not consider the position of the mutation within the gene or provides the functional context of the position of the mutation. Computational structural studies have explored mutational effects on specific regions of a protein (e.g. the binding site) (27)(28)(29). In this study, the somatic mutations of 21 different cancers were mapped to the individual protein domains to identify the significantly mutated domains (SMDs) across the cancer types. For mapping mutations, the protein domains were predicted from the human proteome, and the domain positions were converted into their nucleotide or chromosomal location. Thus, turning the peptide into a nucleotide position offered a reliable method of mapping mutations to protein domains. The top 10 significant protein domains were determined using the local false discovery ratio. The users can access the protein domain position in the chromosome with the help of a developed database.

Human protein sequences
The human protein sequences were retrieved from Ensembl using genome assembly GRCh38.p13 (Genome Reference Consortium Human Build 38), INSDC Assembly GCA_000001405.28, December 2013 (30). The protein domains from each protein sequence were predicted using the Pfam scan tool, and we considered the domains with an e-value ≤0.01 (31).

Prediction of protein domains from the human proteome
The homo sapiens proteome containing 109 095 sequences was obtained from the Ensembl database using genome assembly GRCh38. The PfamScan search tool is locally installed, incorporating HMMER and BLAST to search against Pfam domain libraries. The individual protein sequence of the target species was searched against Pfam libraries, and the total estimate of 169 745 protein domains  To run searches using 'pfam_scan.pl':

Cancer mutations from the COMIC database
The COSMIC database was used to download the mutations for 21 different cancers, using the GRCh38 genome version, as shown in Table 2. The mutations were obtained under the COSMIC Complete Mutation Data (Targeted Screens) that contains the tab-separated table of the complete, curated COSMIC dataset in January 2020 (33). It is the most comprehensive resource for exploring the impact of somatic mutations in human cancer. The mutation types, such as nonsense, missense, coding silent and complex, which involve multiple insertions, deletions and substitutions, were included. Intronic and unknown mutations were excluded from the mutation dataset because those mutations occur outside the coding domains and mutations with no detailed information.

Mapping cancer mutation to protein domains
The domains predicted from the protein sequence are reported in peptide position, whereas the cancer mutations are depicted in genomic locations. Before mapping the cancer mutations to their corresponding protein domains, we should change  count. The users can freely access the Perl-API program from the GitHub link https://github.com/iarnoldemerson/Proteinto-genome-position.git, and supplementary file 1 provides the program instruction. After converting the domain position into genomic position using the Perl API program, cancer mutations are now ready to map with their protein domains. Figure 2 illustrates the methodology for mapping the mutation to the protein domains. Every mutation is searched through all the Pfam domains. If the mutation position is detected between the domain start and end, then the mutation count is increased by 1, else choose the next mutation. Some mutations do not map to any Pfam domains, and this is because the mutation is not positioned in the protein domain locations.

Calculation of normalized mutation frequency and SMDs
After mapping all the cancer mutations, the mutation count for each Pfam domain needs to be normalized. In this study, we normalized the mutation counts by utilizing the cumulative length of all occurrences of the Pfam domain within the cancer set. Figure 3 depicts an illustration of normalizing the OSR1_C domain, and it is located in three genes, namely, WNK1, WNK2 and OXSR1. The accumulated SNP signifies the sum of mutations that occurred in the OSR1_C domains, whereas the cumulative domain length is accomplished by summing their domain length in all those three genes.
To determine the SMDs, we adapted the method to estimate the local false discovery rate in microarray experiments by Efron et al. The relative frequency is utilized as the success probability (p). Then, it was normalized using the Bernoulli distribution signal to noise ratio, which results in the normalized score, z, as follows: The null distribution is estimated using the 'locfdr' package from R and employed these statistics to identify statistically significant domains with a local false discovery rate of <0.1. False Discovery Rate (FDR) controls the number of false positives that result in a significant result, and it has a greater ability to find truly significant results. For example, an FDR of 0.1 implies that 10% of significant tests will result in false positives. In a gene expression study, when the FDR was fixed at 0.1, seven genes with a significant difference were found. However, the number of significant differences decreases to 1, using a more stringent FDR of 0.05. Furthermore, it has been shown that the number of false positives recovered is considerably higher than the number expected (34). Thus we have chosen the FDR of 0.1 to reduce the false positive in the SMDs. We created a heat map representation of the hierarchical clustering of SMDs in different cancers using the 'heatmap' R package based on the 'locfdr' values.

Results and discussion
Conversion of protein to genomic positions From Figure 1, the protein domains were predicted with the peptide positions, whereas the cancer mutations were reported with genomic or chromosome positions. To accomplish the mapping of the cancer mutations to the protein domain, either the peptide positions or the chromosome positions need to be converted. The most efficient method is to convert peptide positions to their corresponding chromosome positions. Thus, it creates a more straightforward way to map all the cancer mutations to the protein domains. The steps required for converting peptides to chromosome positions are described in the 'Materials and methods' section ( Figure 4).
The peptide to genome conversion program takes peptide start and end as an input ( Figure 5-green table), and it provides their corresponding chromosome positions as output ( Figure 5-blue table). The program output can be validated using the transcript id in the Ensemble database. For example, the first transcript id ENST00000377712.3 in Figure 5, the Aetyltransf_1 domain, starts from 112 to 194,  containing 83 amino acids. Since each amino acid contains three nucleotides, it requires 249 bases. The result shows that the Pfam domain resides in the second chromosome, and it starts from 73 700 972 to 73 700 724 (negative strand). Thus, the total length is equal to 249 bases, which codes for 83 amino acids. This equality is not the case in many chromosome positions. This transcript contains only one exon without introns, where the chromosome length is precisely equal to the peptide length (i.e. 249/3 nucleotides = 83 amino acids).
In most cases, the transcript will have multiple exons and introns, and the protein domain starts and ends in different exons. One such example is the last transcript   we subtract the intron length from the total length (2427 − 2230 = 198 bases), the actual 198 bases that code 66 amino acids are remaining, as shown in Figure 6.

Protein domains in human chromosomes
Pfam domains with ≤0.01 were selected for higher accuracy, and subsequently, we examined around 149 668 domains from the human proteome. Each chromosome contains hundreds to thousands of genes, which carry the instructions for making proteins. Each of the estimated 30 000 genes in the human genome makes an average of three proteins. A single gene can produce multiple different RNAs, i.e. transcripts. The actual transcript observed will depend on the tissue, developmental time point, environmental factors, etc. The number of coding genes and protein-coding transcripts in each chromosome was determined and compared with the number of protein domains across 23 chromosomes, as shown in Figure 7. In our study, the estimated number of unique genes is around 15 096, and these genes account for 73 311 transcripts, and thus, the average number of transcripts per gene is 4.85%. Figure

Mapping mutations to individual protein domains
We utilized the developed Perl API program to transform all the Pfam domain positions into their chromosome positions. Thus, the mutation and domain positions became precisely equivalent in their locations (i.e. chromosome position). The next step is to map the mutations into each protein domain, and this step requires more computation time since the mutation position is compared with all the domain positions.
Mapping of mutations was carried out for all 21 cancer types, and Table 3 represents the percentage of mutations mapped to the protein domains. The percentages of mutations range from 46 to 65, suggesting that the non-mapped mutations are not in the protein domain location. In addition, mutated domains were also calculated and depicted in Table 3. After the mutations were mapped to individual protein domains, we calculated the number of mutations in each cancer type.
Interestingly, we found that the "large intestine" cancer acquired more mutations for 518 025, as shown in Figure 8.

Significantly mutated domains
The locfdr was used to determine the statistically significant domains for all the cancer types.  in Figure 9. Among cancer-specific SMDs, most were only significantly mutated in a single cancer type. Thus, each column represents cancer, and the same color indicates the SMDs belong to the particular cancer type. Moreover, the P53 was the only domain observed in the significantly mutated domain of the "testis" cancer type, and we excluded it in the heatmap.
Interestingly, the p53 protein domain has been found in the top 10 list of all cancer types. The TP53 gene is a gene that is mutated in many cancers, and it is the most common gene mutation found in cancer cells. A tumor-suppressor gene, TP53, codes for a protein that inhibits the development and growth of tumors. Since over 50% of human cancers carry loss of function mutations in the p53 gene, p53 has been considered one of the classical type tumor suppressors. There are three protein domains, namely, PI3Ka, Nebulin and zf-H2C2_2, which occur in >10 cancer subtypes. PI3Ka is believed to be one of the significant therapeutic targets for cancer treatment (35). Hyperactivity of PI3K signaling is significantly associated with human tumor progression and invasive potential of cancer cells. NEBL (nebulette) gene is located on chromosome 10p12.31 and encodes the nebulin-like protein, and studies indicate the role of NEBL as an oncogene and tumor suppressor in cancer (36). The ZF domains are significant determinants of human regulatory networks, as they are contained in nearly half of human transcription factors. Studies establish that mutation in ZF genes is expressed at levels comparable to other cancer-relevant genes (37).