PsyMuKB: A De Novo Variant Knowledge Base Integrating Transcriptional and Translational Information to Identify Isoform-specific Mutations in Developmental Disorders

De novo variants (DNVs) are one of the most significant contributors to severe early-onset genetic disorders such as autism spectrum disorder, intellectual disability, and other developmental and neuropsychiatric (DNP) disorders. Currently, a plethora of DNVs have being identified through the use of next-generation sequencing and much effort has been made to understand their impact at the gene level; however, there has been little exploration of the impact at the isoform level. The brain contains a high level of alternative splicing and regulation, and exhibits a more divergent splicing program than other tissues; therefore, it is crucial to explore variants at the transcriptional regulation level to better interpret the mechanisms underlying DNP disorders. To facilitate better usage and improve the isoform-level interpretation of variants, we developed the PsyMuKB (NeuroPsychiatric Mutation Knowledge Base), a knowledge base containing a comprehensive, carefully curated list of DNVs with transcriptional and translational annotations to enable identification of isoform-specific mutations. PsyMuKB allows a flexible search of genes or variants and provides both table-based descriptions and associated visualizations, such as expression, transcript genomic structures, protein interactions, and the mutation sites mapped on the protein structures. It also provides an easy-to-use web interface, allowing users to rapidly visualize the locations and characteristics of mutations and the expression patterns of the impacted genes and isoforms. PsyMuKB thus constitutes a valuable resource for identifying tissue-specific de novo mutations for further functional studies of related disorders. PsyMuKB is freely accessible at http://psymukb.net.


Introduction
In addition to inheriting half of each parent's genome, each individual is born with a small set of novel genetic changes, referred to as de novo variants (DNVs), that occur during gametogenesis [1,2]. These variants, which are identified in parent-offspring trios, range in size from single-nucleotide variants (SNVs) to small insertions and deletions (indels), termed as de novo mutations (DNMs), and larger structural variations as de novo copy number variants (CNVs), and have been implicated in various human diseases [3,4].
Recently, a large number of DNVs have been discovered by whole-exome sequencing (WES) and whole-genome sequencing (WGS), and have been explored and analyzed at the gene level to assess their contributions to complex diseases [5][6][7][8][9][10]. However, isoform level information has rarely been explored for investigations. As many as 95% of genes are subject to alternative splicing (AS), initiation, and promoter usage to produce various isoforms, increasing human transcriptomic and proteomic diversity [11,12], with approximately four to seven isoforms per gene [12,13].
An isoform is highly specific, and its expression is often restricted to certain organs, tissues, or even cell types within the same tissue [14][15][16]. Notably, this occurs at a high frequency in brain tissues [17,18] and regulates biological processes during neural development, including cell-fate decisions, neuronal migration, axon guidance, and synaptogenesis [19,20].
Exons are differentially used in isoforms of the same gene; therefore, it is likely that disease mutations may selectively impact only isoforms with mutation-carrying exons.
Moreover, if some isoforms are not expressed in a particular developmental period or in a specific tissue, then the disease mutations affecting such isoforms may not manifest their functional impact at that period or in that tissue. Thus, correlating tissue-specific isoforms with disease mutations is an important and necessary task for refining our understanding of human diseases. Because is subject to the highest number of AS events [17,18], it is imperative to study mutations related to brain disorders at the isoform level with brainspecific expressions. However, the association between isoforms and DNMs in developmental and neuropsychiatric (DNP) disorders, such as autism (ASD), schizophrenia (SCZ), early-onset Alzheimer disorder (AD), and congenital heart disorder (CHD), has rarely been investigated on a large scale.
In this study, we present the NeuroPsychiatric Mutation Knowledge Base (PsyMuKB), a unique DNV database that we have developed. PsyMuKB serves as an integrative platform that enables exploration of the association between tissue-specific regulation and DNVs in DNP disorders (Figure 1). It provides a comprehensive collection of DNVs, both DNMs hitting coding and non-coding regions and de novo CNVs, spanning across 25 different clinical phenotypes, reported in 123 studies as May 2019, including DNP disorders such as ASD, SCZ, and early-onset AD. In addition, based on the genomic position of each mutation, transcriptional features, and the genomic structures of transcripts, we developed a novel pipeline that allows flexible filtering and exploration of isoforms that are impacted by mutation and/or brain-expressed with a user-specified selection.
Finally, PsyMuKB allows the searching and browsing of genes by their IDs, symbols, or genomic coordinates, and provides detailed gene information, including descriptions and summaries, exon-intron structures of transcripts, expression of the gene and/or protein in various tissues, and protein-protein interactions (PPIs). Therefore, PsyMuKB is a comprehensive resource for exploring disease risk factors by transcriptional and translational information with associated visualizations. Herein, we describe the architectural features of PsyMuKB, including both the variants and their annotations, and a system for understanding the impact of mutations on tissue-specific isoforms, protein structures on brain-related complex disorders. It highlights novel mechanisms underlying the genetic basis of DNP disorders.

DNV curation
PsyMuKB catalogues two types of DNVs: (1) DNMs that include de novo point mutations and small indels; (2) de novo CNVs that involve deletions or duplications in copy numbers of specific regions of DNA. We first surveyed the literature for all published studies where human DNVs, including DNMs and CNVs, had been identified at a genome-wide scale [21]. All studies were then carefully curated to maintain essential information on each DNV, including sample identifier (if available), chromosomal locations of the reference and alternative alleles, validation status. All variants' coordinates are shown in GRCh37 (hg19) in PsyMuKB for both DNMs and de novo CNVs. If source variant coordinates were not originally provided in GRCh37, the coordinates were then lifted over using the "LiftOver" from the UCSC genome browser (https://genome.ucsc.edu/cgi-bin/hgLiftOver) for annotation consistency.
The vast majority of DNM studies published and included in PsyMuKB have employed large-scale parallel sequencing using mostly WES but sometimes WGS, in conjunction with large sample sizes (hundreds to thousands of samples). These were collected from mostly from trios families, but sometimes quads families [21]. By comparing the DNA sequences obtained from affected children to those from their parents, it is possible to identify DNMs after filtering out sequencing artifacts and variant-calling errors. The variant-calling process requires a detailed bioinformatics pipeline involving the application of different thresholds to filter for various quality parameters, such as allele balance (e.g. AB between 0.3 and 0.7), allele depth (e.g. DP≥20), genotype quality (e.g. GQ≥20), mapping quality (e.g. MQ≥30), allele frequency in general population (usually <1% or 0.1% as a more stringent cutoff), etc. [5,22]. Nonetheless, all DNMs (or randomly selected subsets) are re-sequenced by other methods, usually Sanger sequencing, to check the accuracy of the findings. As a result, the average rate of DNM is estimated to be 1-3 per individual in whole exome and 60-80 per individual in the whole genome [23]. During our data collection and curation process, we ensured all the DNM data included in PsyMuKB came from discovery pipelines with reasonable quality parameters, such as those used in the 2018 study by Werling et al. [5]. Next, all the collected DNMs were batch-processed for systematic annotations using the ANNOVAR annotation platform [24] to include annotations, such as variant function (exonic, intronic, intergenic, UTR, etc.), exonic variant function (non-synonymous, synonymous, etc.), amino acid changes, frequency in the 1000 genome and ExAC database [25], and variant functional predictions by SIFT [26], Polyphen2 [27], GERP++ [28], and CADD [29]. Since the emphasis of many available functional annotations of variants is on coding regions, we included the DeepSea scores in the variant annotation table to help users evaluate the impact of the variants at non-coding locations. In addition, for each gene we included the Haploinsufficiency Score [30] for assessing the likelihood of the gene exhibiting haploinsufficiency and the pLI score [25] for assessing the probability of it being intolerant to loss-of-function (LoF) variants.

Collection and processing of expression datasets
PsyMuKB includes five different datasets for expression annotations, of which four are transcriptomic data, and one is protein expression data. We selected four large-scale transcriptomic study datasets to comprehensively annotate and illustrate transcriptional expressions, including human tissue expressions from the Genotype-Tissue Expression (GTEx) consortium [31] (http://www.gtexportal.org/home/), the BrainSpan Atlas of the Developing Human Brain [32] (www.brainspan.org), and human embryonic prefrontal cortex single cell expressions [33]. Considering the majority of developmental regulation modules are preserved between human and mouse [34], we also integrated adult mouse brain single-cell expression atlas data (DropViz: http://dropviz.org/) [35], to expand the interpretive annotations of genes associated with DNVs. Gene expression levels were summarized as either Reads Per Kilobase Million (RPKM) or Transcripts Per Million (TPM) as provided by their respective sources, then we calculated and visualized all the expression levels by either original or log10-based normalized values. The BrainSpan data were plotted across six brain regions and nine developmental periods, while GTEx data were plotted by listing all human tissues in alphabetical order. All neuronal cell types were annotated by their major cell types, such as neuron, interneuron, microglia, stem cell, oligodendrocyte progenitor cell (OPC), astrocyte, etc. The human brain single-cell expressions were visualized by developmental periods and cell types, while the mouse brain single-cell expressions were visualized by brain regions and cell types. These gene expression patterns mainly aid exploration of the role of a gene in normal tissues or developmental periods, but no specific transcripts of the gene were revealed in abnormal situations. We then focused on the transcripts where DNMs were mapped to their exon locations, and where the specific location in the brain where they were expressed was recorded. To associate mutations with the brain-expressed transcripts, we mapped the genomic locations of DNMs to the exon-intron structures of each gene isoform expressed.
To associate the mutations with the protein-level annotations, we extracted the protein isoform expression data of various human tissues from ProteomicsDB (https://www.proteomicsdb.org/). Protein isoform expression data were directly extracted from ProteomicsDB with median log 10-based normalized iBAQ intensities as the expression levels. To associate the mutations with the protein isoforms expressed in the brain, we first mapped the mutation genomic locations to all the Gencode mRNA transcripts. Then, we linked Gencode mRNA transcript IDs and UniProt IDs, which were used to identify protein isoform expression data provided by ProteomicsDB. After this, we mapped the expression data to all proteins and their isoforms by UniProt IDs, and all protein expression information was plotted as histograms by different tissue type, e.g. the brain.

Regulatory element curation and mutation mapping
Currently, functional annotations mostly emphasize mutations in coding regions. However, more than 90% of all the reported DNMs are located in non-coding regions of the genome (Figure 2A), which can be potentially functionally important due to the sheer size. To facilitate the usage of these variants and better explore the potential impact from the mutations hitting the non-translated genomic regions, PsyMuKB provides regulatory element annotations to help investigation of whether a non-coding mutation hits a regulatory element, potentially influencing downstream gene/isoform targets. This information is located at "Transcripts" subsection of the "Gene Information" page. There were 250,733 gene enhancer regions defined by GeneHancer [36] and 82,149 promoters defined in phase 2 of FANTOM5 [37]. We have mapped curated DNMs locating at noncoding regions of the genome to all the regulatory regions and list them as part of the mutation annotations (Figure 3).

Interaction data curation
We extracted PPI data from BioGRID [38] to construct a comprehensive map of physically interacting human proteins. After removing non-physical interactions as defined in BioGRID, we obtained 409,173 human PPIs for annotation integration, allowing users to explore the potential functional pathways involving the proteins impacted. For each interaction, we have kept the annotations, such as official symbols of both protein interactors, experiment detection method, and publication PMID.

Database architecture
PsyMuKB has been designed as an expandable big data platform using MongoDB, a highperformance non-SQL database management system. This provides sufficient scalability and extensibility for easy and fast data integration and module expansion in future updates.
All metadata in PsyMuKB are stored in the MongoDB database, while the graphical representation, such as expression profiles, mutations mapping to the transcripts, and PPI network, are mapped and drawn in real time when related data are queried. The web interface and data visualization of PsyMuKB were implemented mostly in Python scripts based on HTML5 and Cascading Style Sheets (CSS), and JavaScript (JS). The expression data visualization and regulatory element mapping were implemented using Plotly. The interaction network visualization was implemented using Cytoscape.js [39]. Illustration of the mutation site in a 3D protein structure is provided by a link to the corresponding visualization provided by the muPIT [40] interactive web server (http://mupit.icm.jhu.edu/MuPIT_Interactive/).

Database Content and Usage
Mutation data statistics   Figure   2A).
It has been shown that CNVs have contributed significantly to the disease etiology of psychiatric disorders [43][44][45][46]. Thus, it is vital that such variants are included in the database as well. Therefore, we have curated 841 de novo CNVs from reported genome-scale studies, covering eight different clinical phenotypes and affecting 369 non-overlapping genomic regions ( Figure 2B, Data collection and processing), ranging from 1Kb to 600Mb. More than half of de novo CNVs (28%, n=486) are ASD CNVs, followed by control (14%), intellectual disability (9.6%), and SCZ (7.8%) CNVs. In this set of curated CNV data, 61% were deletions, and 39% were duplications. Additionally, CNVs were shown to hit most frequently at regions of chromosome 16 (10%, n=85), followed by chromosomes 22, 2, 7 and 1 ( Figure 2B).

Novelty of PsyMuKB
PsyMuKB does not limit its collection of variants to DNMs like three existing databases, the Developmental Brain Disorder Genes Database (DBD) [47], denovo-db [48] and NPdenovo [49]. information" sub-section, PsyMuKB also provides an "Assessment Table", which includes several brain-or disease-related genetic features, such as pLI score, haploinsufficiency score rank, expressed or not-expressed in the brain, etc., to help the user better understand the relationship between the gene and diseases.
DNVs can be accessed via two different approaches: (1) through the "De novo variants" statistic table (Figure 3G) of the gene information page after searching by "Gene ID" or This mutation-level assessment, together with gene-level assessment, aids a greater understanding of queried gene and the specific mutation carried by it.
PsyMuKB also provides basic genomic information on annotated regulatory elements, such as promoters and enhancers, by visualizing their locations on mRNA transcripts of the queried gene ( Figure 3F). Moreover, all reported DNMs are mapped and visualized on top of the exon-intron structure of the mRNA transcripts, together with their regulatory elements, which may aid elucidation of the potential roles of the regulatory elements. In addition, PsyMuKB utilizes alternatively spliced isoforms with tissue-specific expression information, together with DNM mapping on top of the isoform structures in order to provide isoform-specific mutation selections ( Figure 3H).
PsyMuKB also provides a human protein interaction map for the queried protein ( Figure 3D). The interaction network is constructed using both first-and second-degree interactions and interactively visualized using Cytoscape.js [39]. The first-degree interactions are defined as the interactions between all proteins and the queried protein.  reported publications and total evidence count, regardless of the amount of evidence.

Exploring mutations at the isoform-level
One of the key features of PsyMuKB is that it allows visualization of the DNM locations at the transcript-level and identification of affected isoforms with tissue-specific expression annotations, both at mRNA and protein levels. Here, we first assessed the necessity of studying DNMs at the isoform level and explored the scale of the DNMs that satisfy the criteria above. We used all mRNA transcripts from Gencode v19 and protein isoforms from UniProtKB (version released on 2018_07), which have been integrated into PsyMuKB, and defined three types of isoforms, "longest isoform", "brain-expressed isoform" and "not brain-expressed isoform" (Figure 4A-B). At the mRNA level, the "longest isoform" is the isoform with the longest coding sequence compared to all other isoforms of the same gene; the "brain-expressed isoform" is an isoform with expression of TPM≥1 in at least one brain tissue from GTEx data; and the "not brain-expressed isoform" is an isoform that is not expressed (TPM<1) in any brain tissue sample from GTEx data. At the protein level, the "longest isoform" is the isoform with the longest amino acid sequence in a protein; the "brain-expressed isoform" is an isoform with expression of iBAQ intensity ≥1 in at least one brain tissue from ProteomicsDB data; and the "not brain-expressed isoform" is an isoform that is not expressed (iBAQ intensity <1) in any brain tissue sample from ProteomicsDB data.
We annotated those DNMs in PsyMuKB hitting brain-expressed isoforms and identified these as "brain-expressed" mutations, as well as identifying "not-brain-expressed" mutations. Although DNMs can occur anywhere in the genome, the exome, or proteincoding region of the genome, is often investigated first when studying human disease [6,7,50]. Therefore, "not-brain-expressed" mutations may not be as interesting to researchers studying tissue-specific disease biology.
Using the "longest isoform" as the reference isoform has been a common practice in many studies and databases. Here, we ask the question whether the longest isoform strategy is still applicable for studying tissue-specific mutations. First, we looked at the exonic DNMs that impact isoforms and observed that the majority would hit the longest isoforms as expected due to the length: 97% at the mRNA-level and 99% at the protein-level ( Figure   4). However, when checking whether most DNMs would hit at least one brain-expressed isoform, we observed that about 28% of DNMs do not hit any brain-expressed mRNA isoforms (Figure 4A), and as many as 64% of DNMs do not hit any brain-expressed protein isoforms (Figure 4B), based on the current protein isoform annotation and protein expression information from ProteomicsDB. The results show that investigation of the impact of the disease variants at the isoform-level and tissue specificity is imperative. This is a key reason for PsyMuKB to include tissue and isoform-specific expression for investigating disease-relevant mutations.
To illustrate the exploration of isoform-specific features using PsyMuKB (Figure 3H), we have showcased this functionality with the neuropsychiatric disease associated gene Chromodomain-helicase-DNA-binding protein 8, CHD8 (Figure 4C), which has multiple alternative spliced isoforms and wide-spread expressions across many tissues. CHD8 is believed to affect the expression of many other genes that are involved in prenatal brain development and is a strong risk factor for DNP disorders, such as ASD [51][52][53]. Figure   4C demonstrates the isoform-specific filtering process to identify suitable models for the study of mutations in CHD8.