Whole Exome-Trio Analysis Reveals Rare Variants Associated with Congenital Pouch Colon

Anorectal malformations (ARM) are individually common, but Congenital Pouch Colon (CPC) is a rare anorectal anomaly that causes a dilated pouch and communication with the genitourinary tract. In this work, we attempted to identify de novo heterozygous missense variants, and further discovered variants of unknown significance (VUS) which could provide insights into CPC manifestation. From whole exome sequencing (WES) performed earlier, the trio exomes were analyzed from those who were admitted to J.K. Lon Hospital, SMS Medical College, Jaipur, India, between 2011 and 2017. The proband exomes were compared with the unaffected sibling/family members, and we sought to ask whether any variants of significant interest were associated with the CPC manifestation. The WES data from a total of 64 samples including 16 affected neonates (11 male and 5 female) with their parents and unaffected siblings were used for the study. We examined the role of rare allelic variation associated with CPC in a 16 proband/parent trio family, comparing the mutations to those of their unaffected parents/siblings. We also performed RNA-Seq as a pilot to find whether or not the genes harboring these mutations were differentially expressed. Our study revealed extremely rare variants, viz., TAF1B, MUC5B and FRG1, which were further validated for disease-causing mutations associated with CPC, further closing the gaps of surgery by bringing intervention in therapies.


Introduction
Whole Exome Sequencing (WES) has been an invaluable and cost-effective approach to identify genetic variants responsible for both Mendelian and polygenic diseases [1]. In the recent past, it has allowed the detection of clinically relevant genomic regions spanning the known unknown regions, disease-associated sites and untranslated regions (UTRs) [2]. In addition to the well-known diseases, prenatal abnormalities, structural anomalies and congenital defects were studied, demonstrating a good diagnostic yield [3,4]. While WES approaches are abundant, they are limited if the disease in question is characteristically rare and medically inconclusive. This could be a deterrent because of the challenges in variant discovery, including rare and low-frequency mutations using next generation sequencing (NGS) technologies. A recent decrease in the cost of WES and the accuracy of the NGS enabled the researchers to study a large number of WES samples, but in case of rare diseases, studying exome-trios (proband/parents), or quads, with an addition of siblings, to discover their children. Blood samples were collected from all the probands, parents and unaffected siblings, if any. The WES data from a total of 64 samples, including 16 affected neonates with their parents and unaffected siblings, were used for the study. We confined our pool of analyses to all probands (11 male and 5 female) and unaffected parents/siblings. The methods and pipeline leading to family/quad analyses are summarized in Fig

Variant Annotation, Filtering and Quality Control
The details of sequencing and variant calling in CPC subjects have been described previously [17]. Briefly, WES was performed on an Illumina multiplexed sequencer with paired-end chemistry and 110x effective coverage. Using our in-house developed pipeline [19], all unmapped sequence reads were aligned to the human reference genome (hg38) and variants were called. The mutations from WES study were checked to discover the variants, if any, across the samples ( Figure 2). All variants with MAF < 0.05 were checked for whether or not they were present in probands but absent in their respective parents (Supplementary Information) and healthy siblings. After checking the variants, we confined the prioritization of variants with filters set to an average depth of 250 and MAF < 0.01 and MAF ≤ 0.01% across all the trio samples [20]. For further checking with dbSNP [21], GnomAD [22], ClinVar [23], and COSMIC [24] databases, we used SNP-Nexus [25] to filter mutations listed in a cohort of databases, viz., SIFT [26], PolyPhen-2 [27], Ensembl Variant Effect Predictor [28], MutationTaster [29], CADD [30] and GERP [31], and prioritized the pathogenic mutations, if they were deleterious in nature. The CNVs and variants of unknown significance (VUS) were inferred by mapping the final list of variants to SNP-Nexus. As a final check in reaching a consensus for the variants present in all the probands, we checked the variants with multiple bioinformatics tools so as to find bona fide variants at the union of intersection of these methods, which we construed to be associated with CPC.

Variant Annotation, Filtering and Quality Control
The details of sequencing and variant calling in CPC subjects have been described previously [17]. Briefly, WES was performed on an Illumina multiplexed sequencer with pairedend chemistry and 110x effective coverage. Using our in-house developed pipeline [19], all unmapped sequence reads were aligned to the human reference genome (hg38) and variants were called. The mutations from WES study were checked to discover the variants, if any, across the samples ( Figure 2). All variants with MAF < 0.05 were checked for whether or not they were present in probands but absent in their respective parents (Supplementary Information) and healthy siblings. After checking the variants, we confined the prioritization of variants with filters set to an average depth of 250 and MAF < 0.01 and MAF ≤ 0.01% across all the trio samples [20]. For further checking with dbSNP [21], GnomAD [22], ClinVar [23], and COSMIC [24] databases, we used SNP-Nexus [25] to filter mutations listed in a cohort of databases, viz., SIFT [26], PolyPhen-2 [27], Ensembl Variant Effect Predictor [28], MutationTaster [29], CADD [30] and GERP [31], and prioritized the pathogenic mutations, if they were deleterious in nature. The CNVs and variants of unknown significance (VUS) were inferred by mapping the final list of variants to SNP-Nexus. As a final check in reaching a consensus for the variants present in all the probands, we checked the variants with multiple bioinformatics tools so as to find bona fide variants at the union of intersection of these methods, which we construed to be associated with CPC.

Downstream Analyses
We used the vcftools package (https://vcftools.github.io; last accessed on 6 March 2023), and the mutations in probands were further checked with their relative parents/siblings for heterozygosity transmission. We also looked into the homozygous variants and considered them as associated with CPC, where the proband was found to be homozygous and their respective parents and unaffected siblings, if any, were heterozygous for the specific allele. Enrichment analysis of the data was carried out to calculate the inclusiveness of parameters such as binomial probability and hypergeometric distribution [32]. After high throughput screening, we undertook candidate gene-set analysis based upon significantly enriched sets or rare mutations, specifically in colon related disorders. Seeking novel insights into the disorder, we used pathway analysis based upon gene ontology (GO) derived from PANTHER ontology [33] (http://pantherdb.org/tools; last accessed on 6 March 2023) and EnrichR [34] (https://amp.pharm.mssm.edu/Enrichr; last accessed on 6 March 2023) annotation terms. Gene Ontology-based annotations included a biological network gene ontology tool (BinGO) [35], a plug-in for ontology annotation in Cytoscape [36] used for ontological analysis in the form of biological, cellular and metabolic processes.

Identification of Transcripts for Comparative Screening
From RNA-Seq, a quality check ensued after total RNA was isolated and after cDNA double strand synthesis on a pair of CPC type-4 samples. The RNA-Seq was performed on an Illumina HiSeq 2000 platform with 2 × 100 bp paired end sequencing chemistry which generated ca. 32 million read pairs. The pair of samples was run through differential gene expression analysis using Cufflinks [37] and DESeq [38] pipelines and a consensus was reached. For inferring the role of lncRNAs, UVA FASTA software (https://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml; last accessed on 6 March 2023) (v36. 6.8 version) and the NONCODE FASTA repository [39] were downloaded and the intergenic regions of the genes from WES samples were queried. The lncRNA-NONHSAT002007 was identified based on the query coverage e-value < 0.01. The sequences were carefully

Downstream Analyses
We used the vcftools package (https://vcftools.github.io; last accessed on 6 March 2023), and the mutations in probands were further checked with their relative parents/siblings for heterozygosity transmission. We also looked into the homozygous variants and considered them as associated with CPC, where the proband was found to be homozygous and their respective parents and unaffected siblings, if any, were heterozygous for the specific allele. Enrichment analysis of the data was carried out to calculate the inclusiveness of parameters such as binomial probability and hypergeometric distribution [32]. After high throughput screening, we undertook candidate gene-set analysis based upon significantly enriched sets or rare mutations, specifically in colon related disorders. Seeking novel insights into the disorder, we used pathway analysis based upon gene ontology (GO) derived from PANTHER ontology [33] (http://pantherdb.org/tools; last accessed on 6 March 2023) and EnrichR [34] (https://amp.pharm.mssm.edu/Enrichr; last accessed on 6 March 2023) annotation terms. Gene Ontology-based annotations included a biological network gene ontology tool (BinGO) [35], a plug-in for ontology annotation in Cytoscape [36] used for ontological analysis in the form of biological, cellular and metabolic processes.

Identification of Transcripts for Comparative Screening
From RNA-Seq, a quality check ensued after total RNA was isolated and after cDNA double strand synthesis on a pair of CPC type-4 samples. The RNA-Seq was performed on an Illumina HiSeq 2000 platform with 2 × 100 bp paired end sequencing chemistry which generated ca. 32 million read pairs. The pair of samples was run through differential gene expression analysis using Cufflinks [37] and DESeq [38] pipelines and a consensus was reached. For inferring the role of lncRNAs, UVA FASTA software (https://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml; last accessed on 6 March 2023) (v36. 6.8 version) and the NONCODE FASTA repository [39] were downloaded and the intergenic regions of the genes from WES samples were queried. The lncRNA-NONHSAT002007 was identified based on the query coverage e-value < 0.01. The sequences were carefully checked for bidirectional blast hits, and the lncRNA was visualized using an Ensembl genome browser for fidelity.

SNP Genotyping and Burden Tests
For Sequenom genotyping, a multiplexed iPLEX assay was designed for 20 ng of DNA per sample to determine SNP calls using the Agena biosciences assay design suite. The MALDI-TOF-MS analysis was performed using the Agena biosciences massArray analyzer platform on 29 SNPs in 37 DNA samples (16 probands and 21 controls). This method consisted of five steps: PCR amplification, shrimp alkaline phosphatase treatment, single base extension, nano dispensing, and matrix-assisted laser desorption/ionizationtime of flight (MALDI-TOF) mass spectrometry [40]. Data acquisition was automatically performed, and the mass window of analyte peak observation was set at 4500-9000 Da. Call frequencies, expressed as percentages, were calculated for each SNP. The final list of bona fide variants' mutational burden was checked by comparing the previous analysis [17]. All the variants reported were carefully taken into consideration through the iterative process of the American College of Medical Genetics and Genomics (ACMG) [41] and Sherloc guidelines [42].

Variant Downstream Analysis
We achieved a mean read depth of 80 in the targeted regions for an average depth of coverage of 110x. We generated an average of 840,667 total variants, comprising 775,262 SNPs and 65,405 indels per exome in probands, as compared to 736,820 SNPs and 63,418 indels in unaffected samples from a total of 777,216 variants ( Figure 3). Missense variants constituted the large variation, followed by loss of function variants in cases when compared to unaffected samples, with overall 0.35% missense, 0.02% frameshift, 0.004% stop gain and 0.0009% stop lost. The downstream analysis leading to the variant calling was carried out carefully to yield the list of the final number of variants. This was also checked with gene-density and high linkage disequilibrium (LD) regions even as MAF ≤ 0.01 was sought for, with SNPsnap [43] having no candidate matches, thereby confirming that these variants are extremely rare. Given the rare phenotype, the frequency of the prioritized variants was checked for agreement with that of 1000 genome, GnomAD and ExAC databases. From the reported familial history, the relatedness tests were not felt necessary as the pedigree confirmed correct parenthood for all affected/unaffected samples (Supplementary Table S1). From the final list of segregated variants (Table 1), we identified three mutations in MUC5B, FRG1 and TAF1B genes, which we deemed extremely rare variants (Table 2). Furthermore, we also found an AK9 copy number variant (CNV) which was run through SNPsnap [43] for assessing whether the rare allelic variation was seen as enriched for particular biological annotations (Supplementary Table S6). Finally, a set of candidate variants inferred from all samples was checked for validation using Sequenom array/plex (Table 3).     We found two independent lines of evidence from various tools, and reached a consensus in detecting variants from each sample with all filtering steps. For example, the pathogenic mutations contributing to each relevant proband were screened first, wherein three variants, viz., FRG1 (NC_000004.12:g.189957414T > A), TAF1B (NC_000002.12:g.9904885C > T) and MUC5B (NC_000011.10:g.1238987C > T), were confirmed through ClinVar and GnomAD. On the other hand, we sought to ask whether or not major differences in functional alleles are seen when compared to unaffected controls. From the MAF cutoff of these rare variants, we observed that these variants came up with other alternative allele burdens not seen in our samples. For example, rs79638064 (C/A) was reflected in SIFT, whereas C/T was seen in our probands, indicating that these could be extremely rare pathogenic conditions.
While the FRG1 mutation happens to be a CNV-associated missense seen in Z15 and Z99 samples, it is further augmented by the fact that the low expression of FRG1 is associated with tumor progression in the colon [47]. The TAF1B mutation is highly associated with colorectal tissues, as we found these stop gain/missense mutations to be highly prolific in some of the aggressive CPC probands, viz., Z15, Z19, Z34, Z42, Z46, Z54, Z62, Z66 and Z74. The TAF1B is known to be the second-largest subunit of the TATA box-binding protein (TBP)-containing promoter selectivity factor TIF-1B/SL1, and is connected with ribosomal transcription [48]. We found that a number of detected variants were associated with colon, rectal and genitourinary tissues. The density of these pouchrelated tissues with the presence of missense/stop gain mutations could be attributed to the aggressive CPC type IV.
Dissecting the genetic architecture of a rare disease is certainly an arduous task. Our CPC exome analysis was intended to fill this gap by detecting SNVs and CNVs affecting the focal genes/loci. Assessing these variants in CPC has provided ample evidence of strata coherent to CPC traits. In our study, at least three other genes from trio-exome analysis were reported in colon related ailments, viz., DLC1, HAVCR1 and GBA3, but their allele frequencies were not bona fide or comparable. While we have aimed to characterize the variants by employing different approaches, a polygenic model was assumed. This could be compounded with two assumptions, (a) capturing exomes and identifying deleterious mutations from a high depth of coverage exomes, and (b) identifying large cohorts of mutations that fall in a low depth of coverage exomes. Though we observed both classes of genetic variation contributing to the etiology of the disease, inferring proband-parent trios and detecting de novo and transmitted genetic variants is quite a challenge. By considering extremely rare variants and adopting a strategy of identifying them in the high depth exome, we validated all 16 trios. Nevertheless, we could not compare the detection yield inherent to this spectrum of patients owing to a lack of CPC phenotype and trio-exome studies of similar design. Although previous studies have shown relatively similar methods, they detected medically relevant variants in the majority of the diseased phenotypes [11].

Variants of Unknown Significance
Our findings were in agreement with a large number of reports for rare variants, suggesting that the cumulative contribution of variants across different genes is associated with distinct phenotypes. In addition, an important challenge for researchers and clinicians nowadays in investigating rare disorders involves predicting pathogenicity for VUS. With several guidelines mentioned for predicting the pathogenicity of variants [8,47,48], molecular investigators face a daunting task in considering a rare variant as benign or pathogenic and inferring it to be pathogenic. In explaining the germline/heritability of complex variants based on the rare variant hypothesis, we argue that the extreme rare variants are associated with phenotype sampling [49]. Next, we showed how we can influence and prioritize these extreme rare variants and further propose an optimization procedure to check the variants identified between MAF < 0.01 and MAF 0.01%. To address this, we discarded many variants that had MAF < 0.01 and finally expanded the current annotation and prioritization to accommodate the CPC framework.
In rare diseases where only a minority of the population is affected, or where they are prevalent to a specific geographical location, identifying and considering VUS will require thoughtful consideration. Notable among them, the sequestome (SQSTM1 or P62) gene encodes a multifunctional scaffolding protein involved in multiple cellular processes [50,51] besides showing mitochondrial integrity, import and dynamics as a discriminating autophagy receptor [52]. In addition, P62/SQSTM1 is expressed ubiquitously in various cell types such as cytoplasm, nucleus and lysosomes [53] and is known to be overexpressed in various human genitourinary diseases including colon cancer [54], hepatocellular carcinoma [55] and prostate cancer [56] (Supplementary Table S2). While KCNJ12 was also among the genes harboring bona fide variants, our variant classification did not compel us to consider it as extremely rare. It is known to initiate transcription by RNA polymerase I and acts as a channel for regulatory signals, while KCNJ12 encodes the ATP-sensitive inward rectifier potassium channel [12] and is subtly associated with the repolarization of channels [57,58]. A list of extremely rare variants in the form of SQSTM1 and KCNJ12 was also considered for validation. The SQSTM1 mutation is invariably inherited from the father in the case of the Z12 index case, while being inherited from the mother in the case of Children 2023, 10, 902 9 of 17 Z54. As the aforementioned variants were considered to be extremely rare variants, we also found AK9 (6:109528998..109528999 C|-) to be consistently seen across all probands. This deletion is invariably associated with nucleotide metabolism pathways and maintains the homeostasis of cellular nucleotides [59]. Although AK9 (deletion) is seen phenomenally in all probands, we considered it as a rare variant candidate for validation. However, multiple lines of evidence suggest that apart from VUS, mutations yielding somatic copy number alterations (SCNA) could not be ruled as we found some uncommon missense mutations with MAF < 0.05 in C7orf57, C9orf84, ORF5AR1, FGFR4, HLA-DRB5, NOTCH2NLA and MUC5B genes, which could be turn out to be non-pathogenic/causal to CPC.

Role of Hypothetical Genes in CPC Pathogenesis
One of the interesting findings that has emerged from our study is the role of the known unknowns or hypothetical genes, which could be predecessors of non-coding, and hence their establishing roles in diseases such as CPC is limited [60]. Notable among them is the C10orf120 gene, which harbors CTCF binding sites, as these mutations remain undefined for most disease types, including cancer [61]. We observed that there was a significant enrichment of CNVs and indels associated with intestinal/colon-related specific genes as they are widespread in tissues showing chromosomal instability, cooccurring with neighboring chromosomal aberrations, and are frequent in colon, rectum and gastrointestinal tumors but rare in other diseases. We argue that this mutational disruption, associated with CTCF binding sites, could be associated with pathogenesis as it appears to be conserved in a majority of CPC probands (Supplementary Table S3). Another orphan ORF, viz., C7orf31, also harbors CTCF binding sites and this, in fact, showed significant enrichment for biological processes associated with regulatory, cellular and metabolic pathways (Table 4). Another C10orf120 has somatic variants subtly contributing to CPC pathogenesis and a maximum number of gains and losses observed in the form of CNVs in CDC27, HLA-DRB5 and MST1L genes (Table 2b). Although some of these ORFs' maternal associations and pathogenicity cannot be ruled out, we construe that there are candidate genes that could be promising biomarkers as precursors of CPC, which is beyond the scope of this canonical hypothesis (Supplementary Table S4). In addition, we screened our variants from the Indian Genome Variant Database (IGVDB) and found that they are already reported in the Indian subpopulation [62], and this stratification allowed us to review the patterns influencing common and rare variants. In principle, the rare variants were found to have stronger patterns when compared to the common variants. Thus, there is an inherent need to study the mutations in the known unknown regions which would possibly delve into understanding rare diseases.

RNA-Seq Analysis
To gain insights into the role of lncRNAs, we revisited our hypothesis from our previous study [18] and reconfirmed whether NONHSAT002007 was inferred in WES samples with predictions from the NONCODE database. While we did not find mutations in lncRNAs from trio-exome analyses, we argue that the mutations in essential genes tend to be associated or causal for rare diseases, paving way for driver mutations with the mutations in non-coding genes suppressed for selective pressure. On a different note, we aimed to check whether any of the genes harboring mutations were differentially expressed.
To check this, we employed RNA-Seq from the transcriptome pair of CPC type-4 (proband and its unaffected parent), and we observed several transcripts, alternative splice variants and fusion genes. However, none of them could be associated with the causal genes inferred from the exome study (Supplementary Table S5). Although we found RGPD2 and RGPD4 genes known to be significantly associated with bowel/colon as among the top enriched, nevertheless it is hypothetical to infer global gene expression from just a pair of datasets. This approach, if studied on all samples, we believe, could identify transcripts present at low levels, which in fact could be associated with the pathogenesis of CPC. As the CPC cases emerge, it is difficult to classify the clinical significance of pathogenic variants without the trio analysis, especially the interpretation of de novo variants. The trio exome data could provide insights into whether or not the variant could be inherited, and the carrier mutation could be transcended to the progeny. In conclusion, we argue that the genetic variability in CPC has remarkable significance not only with the parents, but also within each proband. With genetic diseases being leading causes of death in infants, rapid clinical/trio exome sequencing could provide early diagnoses which could make an impact on decision making in critically ill pediatric patients. Although more efficient early diagnosis could be made with interventions using chromosomal microarray (CMA), it was shown that the clinical utility of trio/exome/whole genome sequencing is more than the CMA [63]. The discovery of causal mutations could provide insights into developmental disorders/anorectal malformation such as CPC and its etiology, which closes the gaps of surgery, moving precision therapy forward.

Conclusions
Our findings confirm de novo heterozygous missense mutations in 16 proband-parent trios, which could provide insights into CPC manifestation and its etiology. This study sheds light on how trio-based WES technologies can play a significant role in the identification of associated/causal mutations for rare diseases. In this work, we identify de novo heterozygous missense mutations in 16 proband-parent trios, and further discover VUS which could provide insights into CPC manifestation and its etiology. Our study confirms candidate mutations in genes, viz., TAF1B, MUC5B and FRG1, cause this developmental disorder. In addition, hypothetical genes including C10orf120 harboring CTCF binding sites were predominant in disease causing variants. A significant enrichment of CNVs and indels associated with colon specific genes was found to be predominant. While the RNA-Seq analysis did not delve into characteristic DEGs, we consider that we need a greater sample size to check the gene expression patterns. Variant validation revealed disease-causing mutations associated with CPC and genitourinary diseases, which could close the gaps of surgery by bringing intervention in therapies.