Identification of potential causal variants for premature ovarian failure by whole exome sequencing

Premature ovarian failure (POF) is a highly heterogeneous disorder that occurs in 1% of women of reproductive age. Very few causative genes and variants contributing to POF have been detected, and the disease remains incompletely understood. In this study, we used whole exome sequencing (WES) to identify potential causal variants leading to POF. WES was conducted to identify variants in 34 Korean patients with POF, alongside 10 normal controls. Detected variants were filtered using a range of characterized bioinformatics analyses, and the machine learning tools, CADD and VEST, were used to predict pathogenic variants that could cause disease. VarSome was used for a comprehensive interpretation of the variants. Potential causal variants finally screened by these analyses were confirmed using Sanger sequencing. We identified nine potential causative variants in genes previously associated with POF in 8 of 34 (24%) Korean patients by WES variant analysis. These potentially pathogenic variants included mutations in the MCM8, MCM9, and HFM1 genes, which are involved in homologous recombination, DNA repair, and meiosis, and are established as causing POF. Using a combination of CADD and VEST, 72 coding variants were also identified in 72 genes, including ADAMTSL1 and FER1L6, which have plausible functional links to POF. WES is a useful tool to detect genetic variants that cause POF. Accumulation and systematic management of data from a number of WES studies in specialized groups of patients with POF (family data, severe case populations) are needed to better comprehend the genetic landscape underlying POF.


Background
When women are around 50 years old, ovarian function generally decreases, secretion of the female hormone estrogen declines, and menopause occurs. Menopause increases the incidence of various diseases, such as hypertension, depression, and obesity [1], with consequences for wider society. Against this background, premature ovarian failure (POF) is receiving increasing attention. POF, sometimes known as premature ovarian insufficiency, refers to the loss of ovary function before 40 years-of-age. POF occurs in 1% of all women, with 0.1% experiencing POF before 30 yearsof-age [2]. POF is defined as amenorrhea, with at least 4 months loss of ovarian function, before 40 yearsof-age and levels of follicle stimulating hormone (FSH) > 40 IU/L and estradiol < 50 pg/mL (measured twice, at least a month apart, by blood test) [3,4].
Possible causes of POF include chromosomal and genetic abnormalities, autoimmune disease, viral infectious diseases, and congenital uterine anomalies. Further, surgery due to endometriosis, ovarian cancer, and chemotherapy are also risk factors for POF. Chromosomal abnormalities are detected in 10-15% of patients with POF and other associated genetic factors are detected in 20-25% [5], while 68% of POF cases have an idiopathic character, with no apparent cause, making it difficult to identify the exact etiology.
In the last decade, variants that may be causative for POF have been discovered using various methods. Some were identified in candidate genes for POF with known roles in various relevant processes, such as genital development, DNA replication, meiosis, immune function, and metabolic function [6]. Next generation sequencing (NGS) is a powerful tool to generate accurate results by analysis of complex genetic systems. Whole exome sequencing (WES) is an NGS approach that can be applied for discovery of rare variants within protein encoding gene regions that can cause changes in protein structure [7][8][9]. There have been no comprehensive WES studies of patients with POF in Korea.
Recently, various methods integrating machine learning for predicting the deleteriousness of variants have been developed, based on advances in computing performance. There are three main categories of mutation risk prediction methods: (1) prediction by phylogeny and sequence structure, (2) predictions based on evolutionary conservation of sequences, and (3) machine learning-based prediction methods that integrate data from various sources [10]. In this study, two methods, Variant Effect Scoring Tool (VEST) [11] and Combined Annotation Dependent Depletion (CADD) [12], were used to predict pathogenic variants. Ranked CADD values have been widely applied to predict and classify the deleteriousness of variants in several diseases [13,14]. VEST, which can measure the deleteriousness of sequence insertions and deletions, is also established as a tool for variant classification [15]. For interpreting deleteriousness of variants, VarSome is an impact analysis tool for human genetic variation, providing a powerful analytical resource and a repository for accumulated global knowledge of the genomics community [16].
In this study, we analyzed WES data from 34 Korean patients with POF and 10 normal controls to identify potential causal variants. Bioinformatics analysis was used to characterize protein coding sequence variants corresponding to 131 known POF-candidate genes selected from public databases. We also describe novel variants that have not previously been associated with POF.

Subjects
Unrelated patients (n = 37) diagnosed with POF with amenorrhea for at least 6 months before the age of 40 years, and with FSH plasma levels > 40 IU/L, were included in this study. WES was performed on 37 patients with POF (P1-P37) reported in a previous study, as well as ten normal control samples collected from postmenopausal women ≥ 48 years old who had not had POF and had no known genetic disease. All patients with POF were of Korean ethnicity and recruited from Seoul CHA Hospital. This study was approved by the Institutional Review Boards at CHA University and the Institutional Review Boards of CHA Bundang Medical Center.

Whole exome sequencing
Genomic DNA (gDNA) was isolated from patient peripheral blood using the high-salt buffer method. A minimum of 100 ng gDNA per sample was used to generate sequencing libraries. Genomic DNA was fragmented to 150-200 bp using an E220 Focused-ultrasonicator (Covaris, Woburn, MA, USA). After fragmentation, libraries were prepared using Agilent's SureSelect XT Human All Exon V4 + UTRs Enrichment kit, according to the manufacturer's recommendations (Agilent Technologies, Santa Clara, CA, USA). Paired-end sequencing (100 bp) was performed on the HiSeq2500 platform, according to the manufacturer's instructions (Illumina Inc., San Diego, CA, USA). FastQC software (version 0.11.4) was used to assess read quality, and Trimmomatic (version 0.32) was used to remove low quality reads (base quality < 20). Sequence reads were aligned into human reference genome (hg19/GRCh37) using BWA-MEM (version 0.7.7) [17] and converted into BAM files using SAMtools (version 1.3) [18]. Sorting and indexing of BAM files was also conducted using SAMtools, and Picard (version 1.14) was used to remove duplicate reads. The mean alignedread depth value on target regions (approximately 70 Mb) was approximately 75× (range, 50-70×) with more than 68% of target regions covered by at least 20×, and 57% covered by at least 30×. Single nucleotide variant (SNV) and insertion/deletion (InDel) variants were called using the GATK (version 4.0.7.0) multiple-sample analysis pipeline (HaplotypeCaller, CombineGVCFs) and functionally annotated using SnpEffect (version 4.2), as well as in-house scripts.

Annotation and variants filtering
Preprocessing, variant calling, prediction of deleterious variants, and functional annotation were conducted to identify potential POF causative variants (Fig. 1). Only SNVs (non-synonymous, stop gain, stop loss, and splice sites) and InDel (frameshift, non-frameshift) variants that corresponded to coding regions and may affect protein function were collated. The obtained variants were filtered by location (target regions of the SureSelect XT Human All Exon V4 + UTR) and quality (read depth ≥ 20, variant allele frequency ≥ 0.25, genotype quality ≥ 20). Due to negative selection, deleterious variants are expected to have a lower frequency than neutral mutations. This theory has been proven in past large human genome studies [19]; therefore, variants were also filtered according to minor allele frequency (MAF) > 1% in the 1000 Genomes, Genome Aggregation Database (gnomAD), and Exome Aggregation Consortium (ExAC) databases [20,21] to identify deleterious changes. In addition, variants in the Korean Reference Genome Database (KRGDB) [22] with MAF > 1% were filtered to remove common variants in the ethnic Korean population. Variants found in both control and patient samples were also filtered. Gene-disease association databases were used to create a list of genes associated with POF (Additional file 1: Table S1), and variants mapping to those genes were selected as potential variants causative of POF. Identified candidate variants were finally confirmed using the Integrative Genomics Viewer. In addition, the aggregated knowledge-based tool, VarSome, was used to comprehensively review variants [23].

Pathogenicity prediction of variants
To reduce the range of candidate variants, variants were filtered using population allele frequency databases (ExAC, gnomAD, 1K genome, and KRGDB) and non-synonymous variants, predicted to result in protein modifications, were selected. To narrow the search even further and identify deleterious variants, additional annotation was conducted using the prediction tools CADD (version 1.4) and VEST (version 4). CADD is an ensemble method that integrates multiple annotations to predict the potential pathogenicity of variants, and is trained to identify potential pathogenic variants by integrating and combining several different annotation algorithms, based on a logistic regression model. VEST uses a training set, based on a random forest model, to distinguish pathogenic variants using databases of mutations classified as positive or negative. Both methods use machine learning-based algorithms and can be used to measure the deleteriousness of various types of short variants (missense SNV, nonsense SNV, splice site SNV, frameshift InDel, and non-frameshift InDel). A CADD (PHRED-like scaled) score ≥ 20 indicates that a substitution is predicted to be among the 1% most deleterious in the human genome. Therefore, CADD variants with scores ≥ 20 were classified as pathogenic variants and those with scores < 20 as benign. VEST scores were calculated using the CRAVAT web server (https ://www. crava t.us). Variants with VEST scores ≥ 0.5 were classified as pathogenic and those with scores < 0.5 as benign. Finally, variants with both CADD scores ≥ 20 and VEST scores ≥ 0.5 were classified as potentially pathogenic candidate variants.

Sanger validation
After filtering against multiple databases, candidate variants were re-sequenced using Sanger sequencing using an Applied Biosystems 3730xl DNA Analyzer. Targeted primers were designed using Primer3 software [24] (Additional file 1: Table S2). SeqMan 7.0 (DNAStar Lasergene, USA) was used for variant analysis of Sanger sequencing data.

Result
To find potential causal variants of POF, we conducted WES using DNA from 37 patients diagnosed with POF and ten controls. Patients P1 and P31 were identified as XY females and data from patient P4 had biased GC content; therefore, data from 34 patients were included in our sequencing analysis [16]. Unfortunately, additional clinical information, such as XY karyotyping and ultrasound findings, were not available for these patients. Patient exomes were sequenced to an average sequencing depth of 100×. Clinical interpretation and evidence for the variants were carefully reviewed through VarSome.

Identification of variants in genes previously associated with POF
Screening of genes known to cause POF (Additional file 1: Table S1) led to identification of nine plausible heterozygous variants (Table 1). These nine pathogenic variants in genes known to contribute to POF were carried by 8  The genes corresponding to the nine identified variants included participants in key biological processes, such as homologous recombination, DNA repair, and meiosis (MCM8, MCM9, and HFM1). The nine potential causal variants were confirmed by Sanger sequencing (Additional file 2: Figure S1).

Identification of novel, candidate variants associated with POF
In addition to nine variants in genes associated with POF, 72 variants in 72 genes were found in two or more individuals in genes not previously associated with POF (Tables S3, S4). Of these, genes of particular interest, based on their known functions, included identical variants in two individuals found in ADAMTSL1 and FER1L6, which are genes associated with the development of reproductive organs and folliculogenesis. The ADAMTS Like 1 (ADAMTSL1) variant (NM_001040272.5; c.397G>A; p.Glu133Lys) was found in two individuals. This variant generated high risk scores using both VETS (0.924) and CADD (32.0). The variant in FER1L6 was c.814G>A; p.Gly272Ser (NM_001039112.2).

Discussion
In this study, we recruited 34 Korean patients clinically diagnosed with POF and performed WES to identify pathogenic variants. Nine variants were detected that corresponded to genes known to potentially cause POF. Among these genes, Minichromosome maintenance 8 (MCM8) and Minichromosome maintenance 9 (MCM9) are implicated in homologous recombination and repair of DNA double-strand breaks. Female mice with MCM8 and MCM9 defects exhibit early oocyte loss or severe early proliferative defects in germ cells in testes [25]. The protein encoded by eukaryotic translation initiation factor 2B, subunit gamma (EIF2B3), is a subunit of the initiation factor eIF2B, which is associated with vanishing white matter (VWM) disease and leukodystrophy. Ovarian failure is frequently reported in women with VWM disease [26]. Mutations of the EIF2B3 gene are very rarely found in patients with POF [27]. Mutations in prolyl endopeptidase like (PREPL) are associated with hypotonia-cystinuria syndrome (HCS), which is caused by deletions including SLC3A1. PREPL deficiency causes hypergonadotropic hypogonadism and HCS [28]. The ERCC6-PGBD3 fusion protein and mutations in the ERCC excision repair 6 (ERCC6) gene are reported to cause POF and are among several proteins that participate in, or regulate, DNA repair [29]. The protein encoded by helicase for meiosis 1 (HFM1) is thought to be an ATP-dependent DNA helicase and is primarily expressed in germ-line cells, such as testis and ovary. Mutations in the HFM1 gene, which encodes a protein necessary for homologous recombination and synapsis during meiosis, are a cause of POF [30]. Spalt like transcription factor 4 (SALL4) plays important roles in maintaining the pluripotency of embryonic stem cells and regulating the embryonic development of many organisms. Similar to adult mice, SALL4 is highly expressed in adult humans, with expression restricted to the testis and ovary [31]. A recent study of 50 patients with POF reported pathogenic mutations in SALL4 (3/50 patients, p.Val181Met; p.Lys597Arg; p.Thr760Ile) and predicted that these variants had different effects on SALL4 activity [32]. The SALL4 variant (NM_020436.3; c.3149T>C; p.Ile1050Thr) detected in our study was not one of the three previously reported mutations; however, it could be expected to also influence SALL4 activity. Thyroglobulin (TG) is a precursor of the iodinated thyroid hormones, thyroxine (T4) and triiodothyronine (T3), and is synthesized predominantly by the thyroid gland. Mutations in TG are associated with the development of thyroid disorders, such as congenital hypothyroidism, thyroid cancer, and autoimmunity [33]. Previously, we reported that hypothyroidism due to TG mutations could lead to POF. [34,35]. The TG variant (NM_003235.4; c.7364G>A; p.Arg2455His) identified in P3 may have resulted in hypothyroidism.
We also reported identical variants that did not correspond to known POF genes but were found in two or more individuals with POF (Additional file 1: Table S4). The ADAMTS protease family contributes to aspects of reproductive organ development and tissue morphogenesis, and dysregulation or functional alteration of ADAMTS proteins are associated with reproductive disorders, such as polycystic ovary syndrome and POF [36]. In chickens, expression of ADAMTSL1 increases in the developing ovaries during gonadal differentiation [37]. FER1L6 homologs in model organisms are involved in folliculogenesis and a copy number variant was reported in a patient with POF [38]. Solute carrier family 3 member 1 (SLC3A1) frameshift variant (NM_000341.3; c.1820delT; p.Leu607fs) has very low allele frequency in both the ExAC (MAF = 8.24E−06) and gnomAD (4.08E−06) databases. According to VarSome, this SLC3A1 variant has a strong association with Cystineuria. Cystineuria is a rare hereditary disease associated with kidney stones, which is primarily caused by mutations in two protein subunits (rBAT and b 0,+ AT) encoded by SLC3A1 and SLC7A9 [39].
POF is a highly heterogeneous disorder that can be caused by a variety of factors, suggesting that an unspecified number of variants can cause the condition. The increase of research use of NGS has led to the discovery of numerous disease-specific genes and variants in patients with POF; however, these results are rarely verified through sufficient downstream experiments and are not managed systematically, making in-depth study of the disease difficult. To establish the genetic pathogenesis of POF, efforts are needed to understand the complex genetic mechanisms involved; for example, using integrated databases and approaches to predict combinations of disease-associated variants.

Conclusion
Sporadic POF is a highly heterogeneous condition with no common genetic cause. Hence, the scope for identifying genes and variants common to many individuals is limited. Here, we identified several variants and genes that may cause disease in 34 Korean patients with apparent isolated POF. We report novel candidate variants in ADAMTSL1 and FER1L6, as well as alterations in genes previously reported to be associated with POF. In the study of disease, prediction of the risk associated with variants detected in patients by WES and in the candidate genes associated with the disease is a useful approach to find causative genetic variants. Our data confirm that WES is an efficient tool for studying the genetic causes of POF and can contribute to understanding of disease etiology. For more accurate identification of pathogenic variants, integration and systematic management of data collected from a number of WES studies of specific POF populations (e.g., family data and severe case populations) will be necessary.
Additional file 1: Table S1. List of POF/POI genes (131). List of candidate POF genes obtained from four public databases (DisGeNET, Monarch, MalaCards and NCBI:Gene) that collect genes and variants associated with human disease. Table S2. Primers used for confirming selected variants. Table S3. Genes with identical rare variants (MAF ≤ 0.01) detected in more than one individual. Table S4. Identical coding variants found in two or more individuals via whole-exome sequencing. VEST analysis generates values between 0 and 1, and scores ≥ 0.5 were classified as pathogenic variants and those < 0.5 as benign. CADD analysis generates a PHREDlike scaled value. Scores ≥ 20 were classified as pathogenic variants and those < 20 as benign. Variants satisfying both CADD ≥ 20 and VEST ≥ 0.5 were classified as pathogenic.