A (GCC) Repeat in the Untranslated Region of Human SBF1 Departs from Hardy-Weinberg Equilibrium in Human and Links to Late-onset Neurocognitive Disorder

Mina Ohadi (  ohadi.mina@yahoo.com ) University of Social Welfare and Rehabilitation Sciences Safoura Khamse University of Social Welfare and Rehabilitation Sciences Samira Alizadeh University of Social Welfare and Rehabilitation Sciences Stephan H Bernhart IZBI, Universität Leipzig Hossein Afshar University of Social Welfare and Rehabilitation Sciences Ahmad Delbari University of Social Welfare and Rehabilitation Sciences

Among various categories of STRs, CGG/GCC repeats are overrepresented in the exons of the human genome, and are mainly focused on because of their involvement in neurological disorders [11][12][13][14]. The human gene, SBF1 (SET binding factor 1), also known as MTMR5 (Myotubularin-related protein 5) contains an annotated (GCC)-repeat of 9-repeats in the interval between +1 to +60 of the transcription start site (TSS) (SBF1-202 ENST00000380817.8), which is in the top 1 percentile of (GCC)-repeats with respect to length [15]. SBF1 is located at the extreme end of the long arm of chromosome 22 (22q13.33), and across all human tissues, reaches maximum expression in the human cortex (https://www.proteinatlas.org/ENSG00000100241-SBF1/tissue). In comparison with other primate species, SBF1 reaches maximum expression quantiles in the human brain and skeletal muscle (https://www.ncbi.nlm.nih.gov/IEB/Research/Acembly) [16]. In line with the above, aberrant regulation of the gene networks in which SBF1 plays a role has been reported in late-onset neurocognitive disorders (NCDs), such as Alzheimer's disease (AD) [17].
Here we sequenced the SBF1 (GCC)-repeat in a sample of humans, consisting of late-onset NCDs and controls. We also studied the status of this STR across vertebrates.

Subjects
Five hundred forty-two unrelated Iranian subjects of ≥60 years of age, consisting of late-onset NCD patients (DSM-5) (N=260) and controls (N=282) were recruited from the provinces of Tehran, Qazvin, and Rasht. In each NCD case, the Persian version of the Abbreviated Mental Test Score (AMTS) [18,19] was implemented (AMTS<7 was an inclusion criterion for NCD), medical records were reviewed in all participants, and CT-scans were taken where possible. Furthermore, in a number of subjects, the Mini-Mental State Exam (MMSE) Test [20] was implemented in addition to the AMTS. A score of <24 was an inclusion criterion for NCD. The Persian version of the AMTS is a valid cognitive assessment tool for older Iranian adults, and can be used for NCD screening in Iran [18]. The control group was selected based on cognitive AMTS of >7 and MMSE>24, lack of major medical history, and normal CT-scan where possible.
The cases and controls were matched based on age, gender, and residential district. The subjects' informed consent was obtained (from their guardians where necessary) and their identities remained con dential throughout the study. The research was approved by the Ethics Committee of the Social Welfare and Rehabilitation Sciences, Tehran, Iran, and was consistent with the principles outlined in an internationally recognized standard for the ethical conduct of human research. All methods were performed in accordance with the relevant guidelines and regulations.
Allele and genotype analysis of the SBF1 (GCC)-repeat.
Genomic DNA was obtained from peripheral blood using a standard salting out method. PCR reactions for the ampli cation of the SBF1 (GGC)-repeat were set up with the following primers: Forward: TCTGGACCAATGGAGATGCG Reverse: GAAGTAGTCCGCGAGCCG PCR reactions were carried out in a nal volume of 20 µl, at a nal concentration of 30% high-GC buffer, in a thermocycler (Peqlab-PEQStar) under the following conditions: initial denaturation at 95 •C for 5 min, 40 cycles of denaturation at 95 •C for 45 s, annealing at 55 •C for 45 s, and extension at 72 •C for 1 min, and a nal extension at 72 •C for 10 min. All samples included in this study were sequenced by the forward primer, using an ABI 3130 DNA sequencer.

Statistical analysis
The OpenEpi software (https://www.openepi.com/TwobyTwo/TwobyTwo.htm) was implemented to analyze the allele and genotype data in the human samples studied.
Structural analysis of the human SBF1 with different numbers of (GCC)-repeats We investigated accessibility (probability to be unpaired) differences of exon 1 of the human SBF1 gene with ve to nine (GCC)-repeats, using the accessibility computation of the ViennaRNA package (RNAplfold with -W 300 -L 300 -u 10) [22,23]. We compared the accessibilities of all regions of 10 nt length, and found distinct differences at 3 regions of exon 1. Furthermore, we used RNAup -b [24] to compare possible interactions in homodimeric and heterodimeric SBF1 rst exon with different numbers of (GCC)-repeats.
Analysis of the SBF1 (GCC)-repeat across vertebrates.
The interval between +1 and +100 of the TSS of the SBF1 was searched across all species in which SBF1 was annotated, based on Ensembl 104. The Ensembl alignment program was implemented for the sequence alignments across the selected species.

Results
The SBF1 (GCC)-repeat allele frequency compartment was signi cantly skewed in the NCD group vs. controls.
We detected two predominantly abundant alleles of 8 and 9-repeats in the human subjects studied in both groups (Table 1, Fig. 1). At signi cantly lower frequencies, we detected repeats of 5, 6, 7, and 10, with frequencies of <0.03. The allele frequency compartment was signi cantly skewed in the NCD group vs. controls, mainly due to the 8 and 9 repeat alleles (Yates corrected (Chi)2 for 8-repeat=9.54, p=0.001; Yates corrected (Chi)2 for 9-repeat=6.29, p=0.006). The SBF1 (GCC)-repeat genotype compartment was signi cantly anomalous across the NCD and control groups.
Overall heterozygosity for the observed alleles was signi cantly less than expected in the NCD and control groups, at 22.3% and 16.31%, respectively (p=0.000). Speci cally, rather than an expected >45% 8/9 genotype based on the predominantly bi-allelic 8 and 9-repeat allele frequencies, we detected <18% of that genotype across the two groups (p=0.000) ( Table 1, Fig. 2). There were other discrepancies in the genotype distribution of alleles across the two groups. The 6/8 genotype was signi cantly more detected than the 6/9 genotype in both case and control groups (p<0.05), and the skew was even more signi cant when the two groups were pooled (p<0.0002). Those discrepancies were detected for other heterozygous genotypes, such as the 9/10 and 8/10 genotypes, in which there was skew of the homo/hetero ratio in the observed vs. expected compartments in both groups in the context of the 10/10 homozygotes (p<0.000).
Between the two groups, we detected signi cant skew, primarily due to the signi cant enrichment of the 8/9 genotype in the NCD group vs. controls, and reverse ratio of 8/8 and 9/9 genotypes (p=0.001), which was primarily a consequence of excess 8/9 genotypes in the former.
Identi cation of an extreme genotype in the NCD group only.
We detected a genotype at the extreme short end of the allele range in one instance of late-onset NCD.
This genotype was 5/6 ( Fig. 3), and was detected in an 85-year-old female case of NCD with AMTS=3, and suspected of having late-onset AD. Whereas the 5-repeat was the shortest allele detected in the NCD group, the shortest allele detected in the control group was 6-repeats.
The number of (GCC)-repeats may change the RNA secondary structure and interaction sites.

Discussion
Here we report the rst indication of purifying selection at a STR locus in human. The primary importance of (GCC)-repeats stems from a possible link between that type of STR and natural selection, mainly for two reasons: Firstly, (GCC)-repeats are speci cally enriched in the exons. Secondly, GC-rich sequences are mutation hotspots [25], and frequently interrupted by single nucleotide substitutions. Speci c expansion of the SBF1 (GCC)-repeat in primates, and not in any other order, supports selective advantage of the STR in this order.
In both NCD and control groups, the expected heterozygosity for the observed allele frequencies was dramatically compromised, most likely due to selection against heterozygous genotypes. As a consequence, the homozygote compartment expanded signi cantly beyond expectation and over 77% across the two groups. This anomaly could not be attributed to the excess of consanguineous marriages in Iran, as excess of homozygosity in consanguineous societies can contribute to between 2 and 11% homozygosity at a given locus [26,27]. The homozygous genotypes could not be attributed to allele dropout either, as the frequency of such event is less than 0.004 in ampli cation-based approaches [28]. Sampling error is another explanation for the observed genotypes. All samples were collected from the same districts in Iran, and the results were replicated in both groups, such as the shrunk 8/9 genotype compartment, and the excess of the 6/8 vs. 6/9 genotypes. However, it should be noted that this is a pilot study, and warrants replication by independent studies.
Searching the Genome Aggregation database (gnomAD) for the human SBF1 (GCC)-repeat revealed inconclusive data for the annotated alleles and genotypes, which spanned across all the populations studied (https://gnomad.broadinstitute.org). The above ndings are most likely due to the frequent failure of the general whole-exome sequencing methods to capture GC-rich sequences. Successful PCR ampli cation of the human SBF1 gene necessitates stringent conditions and special GC-rich buffer preparations as described in the Methods.
A likely hypothesis that may be put forward is that the heterozygous genotypes might have been selected against in human in the process of evolution. The studied (GCC)-repeat is located in the 5′ UTR, and it may be speculated that the heterodimer RNAs of, for example, 8 and 9 repeats, and 6 and 9, have a detrimental effect on the downstream events, such as transcript processing and translation. A possible mechanism might be connected to RNA structure and accessibility, which we could show does change with the number of (GCC)-repeats, and can affect at least exon 1.
Example of RNA heterodimer formation exists in the 5' regulatory regions of human HIV-1/HIV-2 RNAs [29,30], which requires GC-rich palindromic sequences among a number of other motifs [31]. It may be speculated that similar sequences in the GC-rich human SBF1 RNAs ful ll the conditions for potential RNA dimerization. Experimental synthetic stem-loop RNAs have been reported to alter the expression of a number of genes in bacteria [32].
SBF1 is predominantly expressed in the brain and skeletal muscle, and the protein encoded by this gene is a member of the myotubularin family. Myotubularin-related proteins, namely MTMR2, MTMR13/SBF2 and MTMR5/SBF1 are mainly involved in regulating endolysosomal tra cking [33] and mitochondrial functioning [34]. Dysregulation of SBF1 is linked to late-onset NCDs such as AD [17], which is also indicated by the observed genotype anomalies in the NCD group vs. controls in our study. An isolate instance of an NCD patient harboring a genotype that consisted of extreme short alleles, may be of signi cance, while random co-occurrence should also be considered as a possibility. The secondary structure and accessibility effect of the 5/6 genotype were dramatically divergent, and the 5-repeat allele length was not detected in the control group. It is possible that low frequency alleles at the extreme ends of the allele distribution curve are subject to negative natural selection [8, 12,35].
It remains to be clari ed how certain heterozygous genotypes might have been selected against in human, and may increase the risk of late-onset AD. It is also warranted that this STR is sequenced in larger samples and in a spectrum of neurological disorders.

Conclusion
We report indication of a novel biological phenomenon, in which there is signi cant selection against heterozygous genotypes at a STR locus in the human population. In view of the location of the (GCC)repeat in the 5′ UTR of the SBF1 gene, it is speculated that speci c RNA/RNA or DNA/RNA heterodimers may exert effects that are favored against in the course of evolution. We also report skewed genotypes in late-onset NCD vs. controls at this locus. It should be noted that this is a pilot study, warranting replication by independent studies.   Figure 1 Allele frequency of the SBF1 (GCC)-repeat in the human samples studied. The locus was predominantly biallelic, consisting of 8 and 9-repeat alleles.

Figure 2
Genotype frequency of the SBF1 (GCC)-repeat in the human samples studied. Signi cant anomaly was detected in the NCD and control groups as a result of compromised heterozygous genotypes, mainly the 8/9 genotype. This anomaly was aggravated in the NCD group.

Figure 3
Identi cation of a genotype at the short extreme of the allele range in one instance of late-onset NCD. Accessibility (probability to be unpaired) of all regions of 10 nt length, ending at base x for the rst exon of human SBF1 with 5 (red), 6 (green), 7 (black), 8 (blue), and 9 (black) (GCC) repeats. Differences in 3 regions were detected, at about nt 50, about nt 200, and about nt 220.

Figure 5
Sequence alignment of the SBF1 (GCC)-repeat across selected vertebrate species. The (GCC)-repeat expands beyond 2-repeats in primates.