Mutations in the tail domain of MYH3 contributes to atrial septal defect.

Atrial septal defect (ASD) is one of the most common congenital heart defects diagnosed in children. Sarcomeric genes has been attributed to ASD and knockdown of MYH3 functionally homologues gene in chick models indicated abnormal atrial septal development. Here, we report for the first time, a case-control study investigating the role of MYH3 among non-syndromic ASD patients in contributing to septal development. Four amplicons which will amplifies the 40 kb MYH3 were designed and amplified using long range-PCR. The amplicons were then sequenced using indexed paired-end libraries on the MiSeq platform. The STREGA guidelines were applied for planning and reporting. The non-synonymous c. 3574G>A (p.Ala1192Thr) [p = 0.001, OR = 2.30 (1.36–3.87)] located within the tail domain indicated a highly conserved protein region. The mutant model of c. 3574G>A (p.Ala1192Thr) showed high root mean square deviation (RMSD) values compared to the wild model. To our knowledge, this is the first study to provide compelling evidence on the pathogenesis of MYH3 variants towards ASD hence, suggesting the crucial role of non-synonymous variants in the tail domain of MYH3 towards atrial septal development. It is hoped that this gene can be used as panel for diagnosis of ASD in future.


Introduction
Atrial septal defect (ASD) is one of the most common forms of congenital heart defects (CHD), with an estimated incidence of an approximately 100 per 100 000 live births [1] and accounting for 10-15% of CHD worldwide [2][3][4]. Based on the location relative to the fosa ovalis, ASD refers to a communication between the right and left atria, anatomically classified from either the parents of patients below 18 years old or directly from the patients who are 18 years old and above. Additionally, all patients and/or parent' of patients has signed written informed consents to allow publication of their medical and/or genetic information. A total of 69 non-syndromic ASD patients (cases) confirmed with ASD were recruited within almost one year. Baseline assessment of individual and family histories, review of the medical records and two-dimensional echocardiography with colour flow Doppler by an experienced paediatric cardiologist was recorded. For each case recruited, a gender-ethnicity matched control was recruited, this is to avoid analysis stratification. Echocardiography was also carried-out on the controls to eliminate potential bias. The recruitment of patients and controls cohort was conducted at Echocardiography Unit, Hospital Universiti Sains Malaysia (HUSM), Kubang Kerian, Kelantan, which is the main tertiary cardiac referral centre in the northeast region of Peninsular Malaysia. Following signing of written informed consents, peripheral blood (3 ml) was collected in EDTA tubes and stored at 4˚C until further analysis.

Long-range PCR development and amplification
Genomic DNA of the recruited samples were extracted from peripheral blood from patients (n = 44) using a commercially available kit GeneAll 1 Exgene TM Blood SV mini, GeneAll, Korea, at the Human Genome Center, Universiti Sains Malaysia (HGC-USM). For patients who refused to give their blood (n = 25), saliva samples were collected instead using PSP1 Sal-ivaGene DNA kit, STRATEC. Findings from a high-throughput genotyping study reported that DNA isolates from blood draws and DNA isolates form saliva showed no comparable differences in term of results [21]. The concentration and purity of the extracted DNA was measured using NanoQuant (Tecan, USA).
MYH3 sequence from Ensembl Genome Browser (ID: ENSG00000109063, GRCh38) was used to design the LR-PCR primers. The entire promoter, 5' and 3' MYH3 untranslated and coding regions were amplified in four distinct LR-PCR reactions. LR-PCR was performed using a Max Taq Polymerase (Vivantis, USA), according to the manufacturer's protocol at the HGC-USM.
DNA thermal cycling was performed using a SureCycler 8800 (Agilent Technologies, UK); 94˚C for 2 min, 94˚C 12 sec, annealing temperature for 30 secs, 68˚C for 10 mins, 68˚C for 7 mins for a total of 35 cycles. The size of the amplified PCR products was determined by gel electrophoresis using SYBR1 Green I nucleic acid gel stain (Life Technologies, USA) under ultraviolet light. The amplified products were then purified by using an Illustra Exo-ProStar (GE Healthcare, UK). Subsequently, the LR-PCR fragments were quantified using Qubit1 1.0 Fluorometer (Invitrogen, USA) with Qubit dsDNA BR Assay Kits (Invitrogen, USA) and were pooled together at equal molar ratios.

NGS library preparation and sequencing
For each sample, 5 μl (1 ng in total) at 0.2 ng/ul of pooled LR-PCR products was used to generate indexed paired-end libraries with Nextera XT DNA Sample Preparation Kit (Illumina, San Diego, CA) according to the manufacturer's protocol. A fragment length of the libraries was ascertained using the High Sensitivity DNA kit (Agilent Technologies, Santa Clara, CA) on the 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA). Normalized libraries were subjected to a 300-cycle sequencing run with MiSeq Reagent Nano Kit v2 (Illumina, San Diego, CA) on the MiSeq (Illumina, San Diego, CA), were carried out the Genetic Laboratory, faculty of Science, University of Malaya.

Data enrichment and variant calling
The sequencing data was evaluated with FastQC (version 1.0, Babraham Bioinformatics, UK) for quality control on the raw data. Subsequently, the reads were aligned to a reference genome (hg19) by BWA-MEM (version 1.0) algorithm based on the default settings. The SNPs were filtered by a set of filters. A variant Studio v2.1 software (Illumina, San Diego, CA) was used to identify and annotate exonic and intronic variants and to determine if the variants have been reported in public databases. An integrative Genomics Viewer (IGV version v1.0.0) was used to examine the read counts of the target amplicons and confirm the detected variants. All variants identified by NGS were resequenced with Sanger sequencing.

Genotyping in controls cohort
The identified variants were genotyped in the controls cohort by made-to-order TaqMan SNP Genotyping Assay (ThermoFisher, USA). Variants that are not readily-available were custom designed and genotyped using the TaqMan SNP Genotyping Assay (ThermoFisher, USA). Variants that were not readily available as pre-designed Made-to-order TaqMan1 SNP Genotyping Assay and cannot be custom designed were then genotyped using Sanger sequencing method.

Statistical and bioinformatics analyses
The study protocol and reporting were developed according to the STREGA guidelines for case-control studies [22]. The allelic and genotypic frequencies of the patient and control groups were determined using a Fisher's exact test with odds ratio and 95% confidence interval (CI). A x 2 P-value of < 0.05 was considered as statistically significant. Association was evaluated for every variant and False Discovery Rate and Bonferroni adjustments were used for multiple-testing corrections. Haploview version 4.2 (Broad Institute of MIT and Harvard, USA) was used for analysis of linkage disequilibrium (LD) and haplotype block [23]. LD blocks were defined according to the haplotype block definition of Gabriel and colleagues [24]. The method defines pairs to be in strong LD if the one-sided upper 95% confidence bound on D 0 is >0.98 and the lower bound is above 0.7.
In-silico analysis of pathogenic potential of identified non-synonymous variants were predicted using Sorting Intolerant From Tolerant (SIFT) and Polymorphism Phenotyping v2 (PolyPhen-2). SIFT scores which range between 0 and 1 as well as scores below 0.05 suggest that the amino acid change is not tolerated whereas, PolyPhen-2 scores >0.85 are interpreted as "probably damaging" and scores 0.15-0.85 as "possibly damaging". The evolutionary conservation of the identified non-synonymous variants was conducted using a Clustal Omega programme (EMBL-EBI, European Bioinformatics Institute). The splicing effects were predicted by the Human Splicing Finder (HSF) version 3.0 (http://www.umd.be/HSF/) and ESEfinder 3.0 (http://rulai.cshl.edu/).

Modelling of the mutant protein structure
Prior to the template selection, both wild and mutant types of complete sequences of human myosin MYH3 were aligned against the sequences from Protein Data Bank (PDB), (www.rcsb. org/pdb). Our results showed that protein structures with the PDB ID: 5TBY and 5WJ7 with 72% and 83% sequence identity, respectively are suitable templates for homology modelling. Multiple templates approach homology modelling was adopted in this study. Both the selected templates indicated to belong to the family protein from Homo sapiens myosin 7, which is closely related to the modeled sequences.
A total of 20 models for wild and mutant type sequence were built using Modeller 9v20 (www.salilab.org/modeller) and the best models with the lowest discrete optimize potential energy (DOPE) scoring was selected for each wild and mutant types. The structural differences between the mutant against the wild type built model was calculated using a root mean square deviation (RMSD) method.

Nomenclature and classification for detected variants
All sequence variants are named according to the nomenclature guidelines published by the Human Genome Variation Society (HGVS) [25]. Nucleotide and amino acid numbering were based on GenBank reference sequences NM_002470.3 and NP_002461.2 respectively.

Results
In the present study, a total of 69 non-syndromic ASD patients were recruited. However, only 51 samples were further proceeded as 18 samples were excluded due to low genomic DNA concentration (<50 ng/ul) and low sequencing libraries of <10 nM.
The entire 40 kb of MYH3 was successfully sequenced in 51 non-syndromic ASD patients using LRPCR-NGS approach. The validity of the results was confirmed with Sanger sequencing.

Demographic and clinical data
Although a total of 69 non-syndromic ASD patients where recruited (blood sample, n = 44 and saliva sample, n = 25), only 51 samples were further proceeded for sequencing. The demographic data of these 51 samples were; aged between 3-62 years old with mean age of 4.72 ± 1.77 years. Based on ethnicity, 47 (92.16%) were Malays, two (3.92%) were Chinese and two (3.92%) were Siamese. In terms of gender, 4 (28.6%) were male and 47 (71.4%) were female patients. The cardiac phenotypes of the recruited patients were categorized into four groups 1) ASD secundum (82.2%) 2) ASD primum (6.5%) and 3) fenestrated ASD (11.3%). The recruited controls cohort consisting of 73 healthy individuals were matched for age, ethnicity and gender with the cases. The age of controls ranged from 3-62 years old with mean age of 4.72 ± 1.77 years. In term of ethnicity and gender; 69 Malays, two Chinese and two Siamese, with 59 of them being females and 14 males.

Coverage and sequencing depth of targets of LRPCR-NGS
Four pairs of LR-PCR primers were designed to amplify the entire 40 kb of MYH3. These primers were optimized at annealing temperatures of 63.1˚C, 62.4˚C, 64.7˚C and 64.7˚C respectively, employing LR-PCR cycling method within a short period of time. Targeted sequencing using Illumina MiSeq paired-end sequence resulted in 11,947,785 reads (98.05%) of which 99.8% may be aligned with the human genome (S1 Fig). This provided 99.8% coverage of the sequence at a minimum sequencing depth of 6×. The run had a cluster density of 615 +/-6 K/mm 2 with approximately 92.63 +/-2.05% of the clusters passing the QC filter. The observed read depth in all the samples was > 100 X per base with Q � 30 (99.9% base call accuracy).

MYH3 variants analysis
Variants with minor allele frequency (MAF) <5% and phred scale of <30 were excluded, resulting in 11 variants and one indel. Fig 1B and 1C show the diagrammatic position of the MYH3 variants (Fig 1) detected while Table 1 summarizes the characteristic of the identified variants. To evaluate the performance and validity of the LRPCR-NGS, resequencing of the identified variants was carried out using a Sanger sequencing.  (Table 2). These variants were further subjected to haplotype analysis. The LD block of rs2285477 (c.3574G>A), rs2285474 (c.2952T>C) and rs2285473 (c.2926-12A>G) showed a strong correlation with

Statistical and bioinformatics analysis
ASD (p-value; GTA: 0.0003, ACG: 2.878e-5) (Fig 2), haplotypes: GTA, ACG and ATG also showed statistical significant (p<0.05) when compared to the control group. The non-synonymous c.5254G>A (p. Ala1752Thr) was predicted to be deleterious by SIFT tool (0.01) whereas PolyPhen-2 predicts it to be benign (0.144). The non-synonymous c.3574G>A (p. Ala1192Thr) was predicted to be tolerated and benign based on SIFT and Poly-Phen-2 tool with scores of 0.44 and 0.003 respectively ( Table 1). The Ala1751 (p. Ala1752Thr) and Ala1191 (p. Ala1192Thr) residues were identified to be highly conserved in human MYH7B, MYH3, MYH6 and MYH7 whereas in orthologous sequence it is conserved among chimp, mouse, rat, human and dog (Fig 3).
The putative 5' and 3' splice sites as well as branch sites were predicted to determine if the identified mutations would disrupt these sites. The results were interpreted by comparing the percentage of consensus values (CV%) between the wild-type and the mutant variant [26]. CV of higher than 80 indicates strong sites, and a negative CV variation percentage between the wild-type and mutant variant indicates that the variant affects splicing mechanism. The c.2916A>G mutation that occurs at late exonic position was predicted to affect the splicing mechanisms. The HSF analysis for c.2916A>G generated consensus values of 80.66 and 51.71 for the wild and mutant types respectively with predicted consensus deviation value of -35.89%. The A to G synonymous substitution (exon 23, 324 th nucleotide) affects the SC35 binding motif TGCCACAG, reducing the score from 2.90 to 0.0 (below the default threshold) and creating a sequence containing a potential SRp40 binding motif (TGCCACG score: 2.80) (S2 Fig)   IL). The native and wild types models were superimposed, generating RMSD value of 5.4Å. To assess the stereochemical properties of the built model, Ramachandran plot analysis was conducted. Ramachandran plot analysis showed that out of 1121 total residues in both the built models, the wild and mutant type models were observed with 86.5% and 86.1%, residues fall in most favoured region, respectively.

Discussion
To our knowledge, this is the first study to elucidate the role of MYH3 towards ASD in nonsyndromic patients employing LRPCR-NGS approach. A total of 12 genetic variants were detected of which seven (58%) were located within the tail domain of MYH3. The non-synonymous variant c.5254G>A (p. Ala1752Thr) was identified in two female ASD patients having ASD secundum with defect size of 0.5 cm and 0.8 cm respectively but was absent in the 73 unrelated controls. Nevertheless, the non-synonymous c.3574G>A (p. Ala1192Thr) variant was identified in six patients and was associated with ASD [p = 0.001 (OR = 2.30 [1.36-3.87]). Evolutionary conservation analysis of MYH3 amino acid sequence showed that these variants are highly conserved in the sarcomeric MYH genes expressed in the heart, and orthologous sequence alignment showed it to be conserved in chimp, mouse, rat, human and dog. The amino acid substitution of hydrophobic alanine to hydrophilic theornine in p. Ala1752Thr and p. Ala1192Thr, places a hydrophobic side chain into a hydrophilic environment in the tail domain, which suggest that the mutant complex might be destabilized (ΔΔG of binding) relative to the wild-type complex. This action may pertubate the ability of the long coiled-coil tails of the individual myosin molecules to form thick filaments of the sarcomere. Thus further differing the ability of C-terminal tail domain's ability for the heavy chains to self-associate and/or bind the myosin to its cargo [27]. A less stable thick filament could also affect the coupling between ATP hydrolysis and force generation by deficient transmission of energy from the motor domain to the sarcomeric structure [28].
Studies have also reported that mutations in the tail domain could lead towards disruption of either the head or the rod domain of embryonic myosin [29]. This result is similar to that conducted by Tajsharghi and colleagues [30], indicating that the tail domain is a functionally important residue for myosin heavy change (MHC) association, which binds thick filament in the cardiac muscle. A study by Ching and colleagues [31] identified I820N mutation that causes hydrophobic isoleucine to change to hydrophilic asparagine, thus predicting that the change may affect the binding of MHC to its regulatory light chain thus suggesting that these mutations may affect the normal formation of the myofibrils, which is in-hand with the findings of current study. However, histological analysis of the myocardium is substantially important to justify the said postulation.
The present study also identified c.2916A>G which is located at the late exonic position of exon 23 of the tail domain to affect splicing mechanism. In silico analysis suggested that this mutation will result in alterations of the exonic splicing enhancer (ESE) site, that would cause either skipping of the respective exon, activation of cryptic site or intron retention. Homology modelling of c. 3574G>A (p.Ala1192Thr) mutant against the wild type protein, showed some degree of difference with RMSD value of 5.4Å. However, these predictions were identified by the combined use of different algorithms where it is not yet possible to consistently predict the effects of nucleotide substitutions on splicing [32]. Therefore, in vitro analysis can be employed to further validate these predictions thus allowing a more accurate determination of the type of nucleotide changes which may affect the splicing mechanism.
MYH3 plays important role in skeletal muscle development [33]. Other genes have additionally been reported to play important roles in skeletal muscle developments such as MYH2, MYH7 and MYH8 where myosin myopathies involving these genes have been widely reported. However, the phenotypes associated with these genes vary from one skeletal muscle gene to another, ranging from prenatal non-progressive arthrogryposis syndromes to adult-onset progressive muscle weakness [33]. Additionally, mutations in MYH2 have been associated with dominant myopathy characterized by ophthalmoplegia, congenital joint contractures, and rimmed vacuoles in muscle fibers [34,35]. Mutations in MYH7 cause skeletal myopathies such as myosin storage myopathy and Laing early-onset distal myopathy [33,34,36,37].
Although studies have reported the contributing factors of MYH3 towards various syndromes including distal arthrogryposis type I, 2A (Freeman-Sheldon syndrome) and 2B (Sheldon-Hall syndrome), the association of MYH3 towards heart abnormalities in humans has not been reported [29,38]. The data presented in this study provided novel insights into the contribution of molecular genetics of MYH3 towards the development of atrial septa.
Nevertheless, our study has some limitations, only MYH3 gene was focussed, thus novel variants or other genomic loci beyond MYH3, was not identified. The identified variants were also detected among controls, therefore further study on the mendelian segregation among familial ASD patient for mutation in MYH3 is important. However, despite its limitations, targeted NGS is able to obtain high-quality data, within a hypothesis-driven approach, while remaining less expensive than its whole genome sequencing and whole exome sequencing counterparts. This methodology is appropriate for efficient and directed research, in elucidating molecular pathways of various diseases.

Conclusion
In conclusion, this study proposes non-synonymous mutations in the tail domain of MYH3 as a potential autosomal cause of ASD among non-syndromic patients. Hence, if a number of genes with contributions of similar magnitude as MYH3 to the risk of ASD could be identified, these genes can be used as panel for diagnosis of ASD.