An update on the CHDGKB for the systematic understanding of risk factors associated with non-syndromic congenital heart disease

The Congenital Heart Disease Genetic Knowledge Base (CHDGKB) was established in 2020 to provide comprehensive knowledge about the genetics and pathogenesis of non-syndromic CHD (NS-CHD). In addition to the genetic causes of NS-CHD, environmental factors such as maternal drug use and gene-environment interactions can also lead to CHD. There is a need to integrate this information into a platform for clinicians and researchers to better understand the overall risk factors associated with NS-CHD. The updated CHDGKB contains the genetic and non-genetic risk factors from over 4200 records from PubMed that was manually curated to include the information associated with NS-CHD. The current version of CHDGKB, named CHD-RF-KB (KnowledgeBase for non-syndromic Congenital Heart Disease-associated Risk Factors), is an important tool that allows users to evaluate the recurrence risk and prognosis of NS-CHD, to guide treatment and highlight the precautions of NS-CHD. In this update, we performed extensive functional analyses of the genetic and non-genetic risk information in CHD-RF-KB. These data can be used to systematically understand the heterogeneous relationship between risk factors and NS-CHD phenotypes.


Introduction
Congenital heart disease (CHD) is the most common cause of heart disease with an estimated incidence of 0.7-1% per live birth [1,2]. Reports have shown that genetics plays an important role in the process of CHD and that chromosomal abnormalities, copy number variations, mutations (including single nucleotide polymorphism) [3][4][5], hypomethylation [6] and functional variants in microRNAs contribute to the development of CHD [7]. These genetic variations disrupt or alter the function of genes during the normal development of the heart. Whilst genetics play a vital role in the development of CHD, only 20-30% of individuals with CHD can be identified based on a single genetic factor [8]. Largescale studies have suggested that environmental factors such as parental drug profiles, maternal health status can cause or interact with genetic variations to contribute to CHD [9][10][11][12].
Advances in genetic testing and surgical techniques have led to a decrease in the prevalence of CHD. However, there are currently no available comprehensive risk factors for the genetic and nongenetic information associated with NS-CHD. Syndromic CHD describes CHD with syndrome-associated abnormalities such as Noonan, DiGeorge, Holt-Oram, Marfan, Chat and other syndromes, often with cardiac and non-cardiac abnormalities. Non-syndromic CHD refers to CHD with only cardiac abnormalities including simple and severe congenital heart disease.
The current version of CHDGKB was developed from articles available on PubMed. We estbalished a genetic variation database and included an analysis of the molecular mechanism of NS-CHD. The updated database presented in this study provides a useful tool for researchers to systematically study the prognosis, risk of recurrence, and to evaluate treatments for NS-CHD. Also, in the current version, we performed extensive functional analyses aiming to better understand the complex relationships between genes, NS-CHD subtypes, and other risk factors.

Data collection
Based on the CHDGKB database, we expanded all of the nongenetic risk information associated with NS-CHD. In the updated version, we collected all data for the KnowledgeBase on nonsyndromic Congenital Heart Disease associated Risk Factors (CHD-RF-KB) manually from PubMed. The literature searches were performed on publications prior to May 5th, 2020 with the following keywords were included: (

Inclusion and exclusion criteria
The inclusion criteria for the non-genetic risk data in the CHD-RF-KB were partly the same as that for the CHDGKB [13]. The studies had to meet the following criteria: 1) Patients presented with the clinical features of CHD and had echocardiographic evidence of disease or surgical records; 2) Studies conformed to approved institutional guidelines and all patients were recruited by written informed consent; 3) Patients had established environmental risk factors for CHD including maternal illnesses, drug use during the first trimester of pregnancy, parental smoking, and chronic exposure to toxic substances or ionizing radiation.
The exclusion criteria for the non-genetic risk data were the same as those for the CHDGKB [13] criteria (i), (ii), (iii).

Database construction
The CHD-RF-KB web interface was constructed with MySQL (10.4.6-MariaDB), Apache (2.4.39), PHP (7.3.8), HTML, Bootstrap 4, and JavaScript. An overview of the construction of CHD-RF-KB with non-genetic factors is shown in Fig. 1.

Updating genetic information
The updated version of CHDGKB includes the details from 284 individual studies. Up to 5th May 2020, the data from 697 studies were manually mined in the CHD-RF-KB version. The genetic information was updated to include 5521 items consisting of 4830 small variations, 657 copy number variations (CNVs), 17 methylations, and 17 other genetic variations. The small variations included 3714 SNPs, 1057 mutations (NOT SNP), 12 haplotypes, and 47 other variations. In our current version, we also extended the related statistical function between the NS-CHD subtypes and variant genes (correlation criteria: P < 0.05). Taking atrial septal defects (ASDs) as an example, when the input ''ASD" was input as the ''subtype" on the ''Statistics" interface, the webpage can show a correlation diagram between ''ASD" and all related genes, as presented in Fig. 2. When the input ''GATA4" was input into the ''Gene" interface, a correlation diagram between ''GATA4" and related subtypes and corresponding genetic information is presented (Fig. 3).

Extension to the non-genetic factors
An extension to the non-genetic factors associated with NS-CHD in CHD-RF-KB was made. The risk factors were classified into five groups as shown in Table 1 [14,15]. Based on these definitions, the 4,236 non-genetic risk factors were distributed as risk (23%), protective (5.2%), non-influencing (1.6%), unrelated (1.7%) and unknown factors (68.6%) (Fig. 4A). The non-genetic risk factors were further divided into seven subgroups as clinical (42.3%), environmental (1.0%), lifestyle (2.5%), molecular (2.4%), physiological (36.05%), psychosocial (4.6%) and combined factors (11.14%). Each of the seven sub-classifications had specific details that were sorted according to the top 10 sub-classifications of risk factors that were correlated with all of the NS-CHD subtypes. These data are presented in Fig. 5.
Similar to the correlation functions at the genetic interface, this new function was extended to the ''non-genetic" interface in which users can search for all of the risk factors associated with a certain subtype. For example, when type ''ASD" in the ''non-genetic" details were entered into the interface, the webpage shows a classification of the risk factors related to ASD (Fig. 4B) along with a subclassification of the factors associated with ASD (Fig. 4C). Users can search all of the NS-CHD subtypes that are correlated with a specific factor. When the input is a ''treatment" in the sub-classification of factor interface, a correlation diagram between the treatment and the associated subtypes is shown in the statistics interface, along with the correlated risk factor information (Fig. 6).

Data browsing and retrieval
Users can browse the risk factor data by choosing the classification, sub-classification, factors and risk factors (e.g. protective or  unrelated factors). The users can search for information on the non-genetic risk factors related to a certain CHD subtype on the query interfaces through the following processes: 1. Search the CHD-subtype in the ''Contain" menu. 2. Search the ''exact" menu: Users can search for any of the NS-CHD types/subtypes by selecting the terms from the dropdown menu which is a precise query.

Data Download and submission
Similar to the CHDGKB version, all of the NS-CHD non-genetic information can be downloaded in Excel format (http://www.sysbio.org.cn/CHDRFKB/Download.html). The risk factor data can be submitted to repositories at http://www.sysbio.org.cn/CHDRFKB through the ''Submit" interface for further validation and updating.

Non-genetic risk factors correlated with NS-CHD subtypes
ASD was selected as an example which was reported as one of the most common subtypes in the CHD-RF-KB. Five types of risk factors correlated with ASD are shown in Fig. 4B. These risk factors can be separated into those related to ASD risk and those that are correlated with ASD prognosis. The distribution of these factors in the application is shown in Fig. 7. 89 risk factor items aimed at ASD risk based on single factors classification of cardiovascular diseases [16]. These were divided into seven sub-classifications as clinical (30 items), physiological (10 items), molecular (16 items), environmental (9 items), psychosocial (7 items), combined (7 items) and lifestyle factors (4 items).

Genetic risk factors correlated with NS-CHD subtype
Using ASD as an example and based on a criterion of p < 0.05, when ''ASD" was the ''genetic" input and the ''CHDsubtype" input of web statistic interface, a list of 205 items with genetic variations was shown. Amongst the genetic variations that were correlated with ASD, a total of 31 genes were identified. There were 11 variation types related to ASD that are shown in Fig. 8. Amongst the 11 variation types, missense and intron variants accounted for the top two proportions of the variants at 37.07% (76 items) and 24.39% (50 items), respectively. The remaining 9 variation types included downstream, upstream, synonymous, 3 prime UTR, 5 prime UTR, intergenic, non-coding transcript exon, frameshift and unknown variants.

GO enrichment analysis and pathway mapping
The R package ClusterProfiler was used for the GO (Gene Ontology) analysis of the ASD subtype at the biological process (BP), cellular component (CC), and molecular function (MF) levels. The associated genes and the number of enriched GO terms are listed in Table 2. The top 10 significantly enriched terms (p < 0.05) on the two process levels for ASD are summarized in Fig. 9A and 9B. For the ASD subtype, at the BP level, the most significantly enriched terms were mainly related to stem cell differentiation, mesenchyme development, cardiac septum morphogenesis/development and cardiac chamber morphogenesis/development.
At the MF level, the most significantly enriched terms were mainly correlated with DNA-binding transcription activator activity, RNA polymerase II-specific, RNA polymerase II transcription factor binding and activating transcription factor binding. KEGG pathways were also generated based on the enrichment analysis. The top four significantly enriched terms of the KEGG pathways for ASD are summarized in Fig. 9C. The cGMP-PKG signaling pathway, Human T-cell leukemia virus 1 infection, AGE-RAGE signaling pathway in diabetic complications, and the one-carbon pool by folate were pathways identified as essential for the occurrence of the ASD subtype.

Correlation analysis of the non-genetic risk factors associated with ASD
As shown in Fig. 9, the 441 risk factors associated with ASD were correlated with complications, mortality, and ASD occurrence risk. 89 items that were ASD risk factors were selected for further    Fig. 10.
The physiological factors for ASD were mainly birth weight, detection age, and gender. It was suggested that birth weights less than 2500 g, screening at 0-3 months and neonates with asphyxia or hypoxia had a diagnostic risk factor for ASD [17,18]. Individuals aged 10-40 years and females had a high risk of ASD compared to males aged 0-9 years [19]. Amongst the molecular factors, the genotype of the MTHFR gene (c.677C > T: CT or TT) was correlated with ASD [20]. Wang et al [21] reported the prevalence of ASD to be 43% in first-degree relatives which was significantly higher than 4.4% in second-degree relatives. Furthermore, the prevalence of ASD (90%) in twins was significantly higher (62.2%) than in siblings. These data indicate that genetic factors play an important role in the development of CHD.
Amongst the combined factors, it was found that genetic variation combined with harmful parental environments or unhealthy lifestyles were associated with ASD risk. For instance, a functional Aryl hydrocarbon receptor (AhR) genetic variant (p.Arg554Lys) (rs2066853) is a risk factor for ASD alone. Individuals carrying genetic variants of Arg (genotype with Lys/Lys and Arg/Lys) had a parental history of exposure to toxic environments or smoking, and so the risk of ASD was significantly higher than those without exposure histories [12]. Genetic and environmental factors may contribute to the development of CHD. Furthermore, in the fetal order [18], along with ascending altitude environment [22], the prevalence of ASD increased accordingly. Therefore, the age of screening, females, high altitude environments and first or second-degree relatives of CHD patients are at risk. 52 ASD risk factors associated with offspring mainly included maternal diseases, environmental exposures, maternal psychosocial and physiological factors, unhealthy lifestyles, and drug treatments. Firstly, maternal illnesses such as diabetes mellitus (type 1, 2), hypertension before and during pregnancy, anemia, epilepsy, connective tissue disorders, and mood disorders were all identified as risk factors for ASD [23]. Moreover, maternal respiratory tract infections [18], vaginal infections, and clotting disorders were all significantly associated with ASD [24]. Also, pregnancy malnutrition and histories of abnormal childbearing were found to be correlated with ASD in offspring [25]. Maternal illness and diabetes mellitus (type 2) were related to the risk of ASD occurrence in offspring and also increased the severity of CHD [23].
Dolk et al, found that increased paternal blood pressure and the use of anti-clotting medications (enoxaparin and aspirin) in the first three months of pregnancy were correlated with ASD [24]. Therefore, we need to prevent high-risk diseases such as diabetes mellitus before and during pregnancy, especially for those that may involve high blood pressure in both parents. Also, several factors should be prevented including upper respiratory tract infections and the use of medication during early pregnancy. Health education should be offered to women of childbearing age along with the use of improved obstetric procedures and techniques to reduce the risk of CHD.
Medication history and exposure to adverse environmental factors were shown to increase the prevalence of ASD in offspring. For example, exposure to decoration environments during pregnancy increases the risk of isolated CHD such as ASD or VSD in offspring and is significantly correlated with complex CHD. Moreover, exposure to housing renovations in the first trimester (less than one month after renovation) increases the risk of ASD in offspring more than before pregnancy [26]. This may be due to the teratogenic sensitive period of the embryo in the first pregnancy trimester.
In addition, unhealthy maternal lifestyles are related to the occurrence of ASD. Studies have shown that poor maternal sleep   can increase the risk for ASD and other CHD subtypes in offspring.
Within the same group of pregnant women with poor sleep quality, the concurrence of daytime naps decreases the risk of simple CHD [27]. Dolk's research also showed that mothers who drink fizzy or high-energy drinks every day had a higher risk of ASD in their offspring [24]. Maternal physiological and psychosocial factors were also correlated with the risk for CHD in the offspring. Mothers over 40 years of age and gestational ages less than 37 weeks were all at higher risk for ASD compared to younger pregnant women and full-term deliveries. Mothers with bluecollar occupations [28], lower education levels, multiple stresses in the periconceptional period, and other social psychology factors also had a higher prevalence of CHD in offspring [24].

Correlation analysis of genetic risk factors and ASD
From Fig. 8 it can be seen that intron mutations accounted for a second high proportion of ASD-related genetic variations (24.4%). In all NS-CHD-related small variations (a total of 992 variation information), intron mutations ranked third of all the small variation types (20.5%). These data suggested that intron mutations play a pivotal role in the occurrence of CHD. It is generally assumed that the intron sequences do not play a role in pre-mRNA splicing process as it is far away from the classical splicing site. However, an increasing number of studies have shown that mutations in the intron region of many disease-related genes including single base mutations at the junction sites between introns and exons, can affect the splicing process of pre-mRNA. This alternative splicing often results in the generation of new exons in the mature mRNA product. It has been reported that in CHD and related complications, the c.3964 + 1G > T mutation in intron 32 of gene FBN1 can contribute to Marfan syndrome [29]. Zhao et al. also found that the functional SNP mutation, c.56 + 781A > C, in the intron region of gene MTRR associated with the cysteine/folate metabolic pathway is an important genetic marker for ASD [30].
The diverse functions of introns, such as enhancement effects, promoter functions and other mediating factors can give introns more significant biological functions. The conservation of huge intron sequences in the human genome have special functions in biological evolution [31]. Therefore, more attention should be paid to intron mutations in genetic analysis. Based on the high percentage of intron mutations in ASD found in our database, the discovery and annotation of mutations in non-coding regions during analysis for CHD-related genetic variations should be a particular focus that can help to improve the diagnostic efficiency of genetic factors associated with CHD.
The GO annotation and the enriched GO terms at three major process levels are summarized in Table 2. At the BP level, the most frequently annotated gene was NRP1 (NM_003873.6) which had a total of 285 annotations. The most significantly enriched GO terms with target genes were mainly related to multiple cardiac septum development processes such as cardiac septum morphogenesis, outflow tract septum morphogenesis, ventricular chamber morphogenesis, and cardiac septum morphogenesis (Fig. 9A).
Comparative transcriptomics analysis demonstrated that cardiac-specific transcriptional factors (GATA4 and NKX2-5, which were annotated in our correlation analysis with ASD), extracellular signal molecules, along with cardiac sarcomeric proteins were downregulated in ASD. These changes may influence the formation of the heart atrial septum, cardiomyocyte proliferation, and cardiac muscle development [32]. The study also showed that the decreased expression of cell cycle proteins may affect cardiomyocyte growth and differentiation during atrial septum formation. At the MF level, the most annotated gene was NOS3 (NM_000603.5). A study on the role of NOS3 on myocardial performance indicated that NOS3 contributes to the bioactive NO pool during the development of sepsis and results in impaired cardiac contractility [33].
The most significantly enriched GO terms were mapped to NADP binding, oxidoreductase activity, acting on NAD(P)H, coenzyme binding, and flavin adenine dinucleotide binding. Compared to the enriched terms of isolated NS-CHD in the CHDGKB version [13], the enriched terms associated with ASD focused on sequence-specific DNA binding, DNA-binding transcription activator activity, enhance binding, and RNA polymerase II transcription factor binding (Fig. 9B). Previously, it has been shown that specific NKX2-5 mutations result in abnormal protein degradation through the Ubiquitin-Proteasome system and can contribute to CHD due to partially impaired transcriptional activity [34]. Furthermore, enhancers regulate transcription by binding to transcription factors which in turn could recruit cofactors to activate RNA Polymerase II at core promoters [35]. These changes demonstrate interactions between the processes of the ASD-related enriched terms described above. At the CC level, although the GO terms were not significantly enriched, the most frequently annotated gene was also NRP1 (NM_003873.6) which is the vital gene involved in the process of intermediate filament cytoskeleton that is a key receptor in the outflow tract of the developing heart septum [36]. Amongst the three process levels, the NOS3 gene is annotated in all three processes of the GO terms.
Based on the target genes related to ASD, we performed enrichment analysis of KEGG pathways and annotated a total of 107 KEGG metabolic pathways. These included the cGMP-PKG signaling pathway, fluid shear stress, atherosclerosis and cellular senescence. The significantly enriched pathways were mainly correlated with the cGMP-PKG and AGE-RAGE signaling pathway in diabetic complications (Fig. 9C). In the cGMP-PKG signaling pathway, four genes (GATA4, NOS3, EDNRA, and NFATC1) were found to be associated with ASD. These genes included GATA4, a zinc finger transcription factor that is essential for heart development and disease onset [37]. Other studies have shown that the transcriptional activity of GATA4 is mediated by cell signals that are dependent on cGMP-PKG-1a activity.
Protein kinase G (PKG) is a serine/tyrosine specific kinase and the main effector of cGMP signal transduction. Enhanced transcriptional activity induced by the co-expression of GATA4 and PKG-1a was also been observed. Phosphorylate GATA4 (S261) can be detected on Serine 261, and the C-terminal activation domain of GATA4 is related to PKG-1a. PKG-1a enhances the DNA binding activity of GATA4 through phosphorylation and physical connection processes. Many GATA4 mutations are associated with human diseases and exhibit impaired phosphorylation on S261 indicating that S261 phosphorylation defects are involved in human heart diseases [38,39]. In summary, cGMP-PKG signaling mediates the transcriptional activity of GATA4 connecting GATA4 and PKG-1a mutations with human heart disease.
Another significantly enriched pathway involving ASD target genes was the AGE-RAGE signaling pathway which may be closely related to the influence of diabetes regulatory gene NOS3. Diabetes is also a risk factor related to CHD. It has been reported that NOS3, combined with TBX5 haploinsufficiency can cause abnormal heart formation [40]. These observations provide a new perspective on the molecular mechanisms of the combined impacts between genes, the environment, and other CHD risk factors.

Conclusions and future directions
Based on risk factor information that was correlated with the ASD subtype derived in our CHD-RF-KB, further correlation analy- sis was performed between other risk factors, complications, prognosis, and therapies for ASD. These data enabled the development of a prediction model for ASD diagnosis and prognosis using logistic regression [41] or other methods [42]. These applications could be extended to other NS-CHD subtypes to help users make precise assessments for the risk of NS-CHD onset, prognosis and inform diagnosis and treatment strategies.
The purposes and content domains of other existing congenital heart disease databases are largely different from CHD-RF-KB [43,44] as our data was curated by original research in PubMed. However, CHDRFGB has several limitations. Firstly, transcriptional information was not included in the current database. We could include more CHD-associated functional variations aiming to determine the complex relationships between genes and regulatory networks. Secondly, data from clinical or animal studies could be used to validate the findings and to demonstrate the underlying mechanisms of multiple risk factors in the development of NS-CHD. Finally, as the scientific discovery paradigm shifted to a data-driven model [45], we will ensure that our knowledgebase is regularly updated and expanded to include new associations with environmental factors and integrate proteomic and epigenetic data, and artificial intelligence models of NS-CHD.