Directional Decremental Abundance of (GA)9 and (GA)11 Blocks in Primate Speciation

Background: Recent evidence indicates that expansion of a number of short tandem repeats (STRs) may be a result of natural selection. The human neuron-specic genes, RIT2 and GPM6B, contain two of the longest GA-STRs at 11 and 9-repeats, respectively, the length ranges of which are functional, and exceedingly rare repeats at the extreme end of those STRs occur with major human disorders. To examine the evolutionary trend of (GA)11 and (GA)9 blocks, two sets of chromosomal regions (each spanning 10 Mb of genomic DNA) across all chromosomes, were searched for those blocks across rodent and primate orders. We also sequenced the RIT2 and GPM6B STRs in 600 human subjects, consisting of late-onset neurocognitive disorder (n=200), multiple sclerosis (n=100), and controls (n=200). Results: We detected a directional decremental abundance of (GA)11 and (GA)9 blocks, matching the phylogenetic distance of the selected species as follows: mouse>macaque>great apes (p=0.000006). The RIT2 and GPM6B GA-repeats were at strict lengths of 11 and 9-repeats in human, respectively, and were predominantly human-specic in formula. Exception included a 9/11 genotype of the RIT2 GA-STR in an isolate case of multiple sclerosis. Conclusion: We report the rst evidence of massive directional trend of STRs linking to speciation. Genes such as RIT2 and GPM6B may be suitable candidates to explore the evolutionary impact of STR blocks.


Introduction
Short tandem repeats (STRs) are a signi cant source of variation across species. This is a re ection of the highly polymorphic nature and plasticity of these genetic elements. Evolving evidence indicates crucial roles for STRs in gene expression (1)(2)(3)(4) and translation (5). A number of STRs may function as evolutionary switch codes for speciation (6,7). A minority of human STRs reach exceptional lengths of ≥ 6-repeats in the critical core promoter and 5′ untranslated region (UTR) (hence, the term "exceptionally long"), a number of which may be of prime importance in respect of speciation and adaptive evolution (8,9). As an example, an exceptionally long CT/GA complex in the core promoter of PAXBP1 links to the evolution of primates and reaches exceptional length and complexity in human (OMIM: 617621) (10).
The 11-repeat functional GA-STR in the core promoter of the human neuron-speci c gene, RIT2 (Ras-like without CAAX 2), is a prime example of possible natural selection for a particular STR length in human and species-speci city of that length in this species (12). A 5-repeat allele of this GA-STR was detected in the homozygous status (5/5) in a patient with schizophrenia. While the 5-repeat allele was most likely selected against following the Neolithic revolution (this allele is non-existent in the Genome Aggregation database, 1000 Genomes database, and in our study of over two thousand human subjects to date), it is annotated in one of four hunter-gatherer men sequenced from southern Africa (BUSHMAN KB1: rs113265205). mRNA expression in normal human tissues from GTEx, Illumina, BioGPS, and SAGE indicate highest expression of this gene in the nervous system, particularly in the cerebellum.
Another exceptionally long GA-repeat is located in the 5′ UTR interval between + 1 to + 60 to the TSS of the gene, glycoprotein membrane 6B (GPM6B) (Transcript: GPM6B-203 ENST00000356942.9) (8). GPM6B encodes an abundant cell surface protein in the CNS neurons (13), which regulates oligodendrocyte myelination in the central and peripheral nervous system (14,15). mRNA expression in normal human tissues from GTEx, Illumina, BioGPS, and SAGE indicate highest expression of this gene in the nervous system, particularly in the frontal cortex. Moreover, integrated proteomics from ProteomicsDB and MOPED indicate highest protein expression in the frontal cortex. Interestingly, transcribed STR allele lengths in the UTRs are correlated with gene expression in plant populations(1).
Here we studied the evolutionary trend of perfect (GA)9 and (GA)11 blocks across rodent and primate orders. In order to assess the polymorphism status of the RIT2 and GPM6B exceptionally long STRs, we also sequenced these two STRs in 600 human subjects, consisting of late-onset neurocognitive disorder (NCD), multiple sclerosis (MS) and controls.

Subjects
Six hundred unrelated Iranian subjects, consisting of NCD patients (n=200), MS (n=200) and controls (n=200) were recruited from the provinces of Tehran, Qazvin, and Rasht. In each NCD case, the Persian version of the Abbreviated Mental Test Score (AMTS) (16,17) was implemented (AMTS of <7 was an inclusion criterion for NCD), medical records were reviewed in all participants, and CT-scans were taken where possible (approximately 40% of instances). The AMTS is currently one of the most accurate primary screening instruments to increase the probability of NCD (18). The Persian version of the AMTS is a valid cognitive assessment tool for older Iranian adults and can be used for NCD screening in Iran (16). Diagnosis in the MS cases was performed independently by two neurologists. The control group was selected based on normal cognitive AMTS of >7, lack of major medical history, and normal CTscan where possible. The cases and controls were matched based on age, gender, and residential district.
The subjects ' consent was obtained (from their guardians where necessary) and their identities remained con dential throughout the study. This research was approved by the Ethical Committee of the University of Social Welfare and Rehabilitation Sciences, Tehran, Iran, and was consistent with the principles outlined in an internationally recognized standard for the ethical conduct of human research. All methods were performed in accordance with the relevant guidelines and regulations.
Evolutionary analysis of perfect (GA)9 and (GA)11 blocks across rodent and primate species.

Extraction of STRs from genomic sequences
The abundance of perfect (GA)9 and (GA)11 blocks was studied in six selected species, including mouse, macaque, gorilla, chimpanzee, bonobo, and human, by designing a software package in C # environment. By using the REST API service from Ensemble 101 (https://asia.ensembl.org), in each chromosome, the rst and second 10 Mbs from the end of each chromosome were selected (Fig. 1). This selection was arbitrary to represent two independent regions of chromosomes across all chromosomes. Subsequently, for each selected region, the STRs and their abundance were calculated, and the abundance of STRs was compared on a chromosome-to-chromosome approach. Finally, the data of the selected regions for all chromosomes in the six species were aggregated and analyzed.
Allele and genotype analysis of the RIT2 and GPM6B GA-repeats in human.
Genomic DNA was obtained from peripheral blood using a standard salting out method. PCR reactions for the ampli cation of the RIT2 GA-repeat were set up as previously described (12). The following primers were used to amplify the region containing the GPM6B GA-STR: Forward: CTCCTTCACATCCCCTCCTC, Reverse: GTGCCTACAGTCTCAATGCG. PCR reactions for the GPM6B GArepeat were carried out in a thermocycler (Creacon, model 0005.401) under the following conditions: 95 •C for 3 min, 35 cycles of denaturation at 94 •C for 30 s, annealing for 40 s at 56 •C and extension at 72 •C for 40 s, and a nal extension of 72 •C for 5 min. As necessitated by the nature of dinucleotide repeats, all the samples included in this study were sequenced by the forward primer, using an ABI PRISM 377 DNA sequencer.
Evolutionary analysis of the RIT2 and GPM6B GA-STRs across vertebrates.
The RIT2 and GPM6B core promoter and 5′ UTR sequences from -120 to +120 of the TSS were screened in 63 and 76 species, respectively, selected from major orders of vertebrates based on the Ensembl database version 101 (http://asia.ensembl.org/index.html). In the species in which the transcript boundaries were not determined, the GA-repeat location was estimated based on the length of the transcript in human, as reference.
DNA reconstruction of the RIT2 and GPM6B GA-repeats for different repeat lengths.

Molecular dynamics simulation details
The DNA structures of the RIT2 and GPM6B GA-STRs and the immediate anking sequences (up to 10 bp) were reconstructed as follows: Web 3DNA 2.0 (http://web.x3dna.org) was used to model the threedimensional structure of DNA, and all-atom molecular dynamics simulations were performed using Gromacs-2020 package (19). The CHARMM-GUI web server (20) was used to create the topology les required for the simulation. Water molecules were modeled using the TIP3P solvent type. A suitable number of ions was added to the simulation box in order to keep a 0.15 nM salt concentration.
CHARMM36 was applied as the force eld in our simulations (21). Periodic boundary conditions (PBC), were applied throughout simulations (22). The particle mesh Ewald (PME) was used for the described long-range Coulombic interactions (23). The bond lengths were restrained using LINCS algorithm (24). The cut-off length for long-range Coulombic and van der Waals bonds was set to 1.2 nm. The steepest descent algorithm was used for energy minimization (25). Next, the equilibration phases were applied with position restraint on the system for 0.25 via standard coupling methods (19). Finally, main runs were performed without any restraint on the molecules for 10 ns. VMD, was used to visualize the results of simulations (26).

Statistical Analysis
The STRs across the six selected species were descripted with mean ± standard deviation and BOX plot diagrams. The Wilcoxon Signed Ranks Test was used for pairwise comparisons between species and p-value>0.2 was de ned as similarity between species.

Results
Non-random trend of perfect (GA)9 and (GA)11 blocks in primate speciation.
To examine whether (GA)11 and (GA)9 blocks evolved as a directional trend or as a result of random evolution, two arbitrary chromosomal regions were screened across all chromosomes for perfect (GA)9 and (GA)11 blocks in six species of rodent and primate orders (Tables 1 and 2). We observed a predominant directional decremental trend in the abundance of those repeat blocks, which was replicated in datasets 1 and 2, and matched the phylogenetic distance of the selected species as follows: mouse>macaque>great apes (p=0.000006) (Tables 3 and 4) ( Fig. 2 and 3).
Strict monomorphism of the exceptionally long GA-repeats in the core promoter and 5′ UTR of RIT2 and GPM6B in human.
We detected strict monomorphism of the GA-repeats in both RIT2 and GPM6B genes in 599 human subjects studied, at 11 and 9-repeats, respectively (Fig. 4).
A 9/11 genotype of the RIT2 GA-repeat in a female case of MS.
An exception to monomorphism included a 9-repeat allele in the context of a 9/11 genotype in a female case of MS for the RIT2 GA-STR (Fig. 5). While this repeat was not detected in the 1000 Genomes or Genome Aggregation database, it was annotated at a frequency of 0.0002 in the Trans-Omics for Precision Medicine (TOPMed) database.
Evolutionary status of the RIT2 and GPM6B GA-repeats across vertebrates.
While we detected a wide range of repeat lengths of the GA-repeats across the selected species across major orders of vertebrates (Tables 5 and 6), these repeats were predominantly speci c in formula in human. Exception included in Ma's night monkey, where although a (GA)11 was detectable in the RIT2 gene, the 5′ immediate anking sequence to the GA-repeat was divergent from human, resulting in a signi cantly different molecular simulation pattern (Fig. 6). Various molecular patterns were also observed for different lengths of the GA-repeat in human.
Similarly, molecular simulation revealed divergent patterns for different lengths of the GPM6B GA-repeat lengths in human (Fig. 7).

Discussion
Here we report massive and directional evolutionary trend of GA-STR blocks of 9 and 11 repeats in primate speciation, which was replicated in two independent sets of data. We also provide examples of monomorphism of those blocks in the critical regulatory regions of two neuron-speci c genes, RIT2 and GPM6B. The STR formulas were unique to those genes and no other genes were associated with STRs of those lengths in the speci ed interval (8,9).
The bulk of literature on STRs proposes that these highly polymorphic elements evolve randomly for the most part, or as a result of genetic drift in bottleneck events. However, detection of a directional trend which matched the genetic distance and phylogeny of the six selected species, spanning a time-scale of at least 80 million years, is the rst instance of a directional evolutionary trend for STRs (Fig. 8). STRs form non-B (alternative) DNA structures, which can serve as building blocks for genetic computers, programed for diverse evolutionary, physiological, and pathological processes (27,28).
The emerging comparative and functional analyses support adaptive evolutionary patterns for the expansion of a number of STRs, and the co-occurrence of alleles at the extreme ends of these STRs with major human cognitive disorders (10,12,(29)(30)(31). Recent reports indicate that STR length in uences expression quantitative trait loci (eQTL) associations (32). It should be noted that instances of STR allele selection in animals have been reported in the 5′ regulatory region of arginine vasopressin 1a receptor and oxytocin receptor loci (33).
The monomorphic GA-repeat blocks in the RIT2 and GPM6B regulatory regions provide examples to further examine the evolutionary implication of GA blocks. Exceedingly rare alleles of different lengths virtually exist in humans for the two GA-repeats, which are likely to result in disease phenotypes. For example, A 9/11 genotype of the RIT2 GA-repeat in an isolate case of MS in this study and a 5/5 homozygous genotype in a consanguineous case of schizophrenia in our previous study (12) strengthen the hypothesis that deviation from monomorphism for those repeats link to disease. The 9-repeat allele was detected at a frequency of 0.0002 in the TOPMed database. This database is aimed at cataloging whole-genome data for a vast range of human disorders and phenotypes. We did not detect any deviation from monomorphism for the GPM6B GA-repeat in the human subjects studied. Remarkably, TOPMed is enriched for very rare allele lengths of the GPM6B GA-repeat as opposed to signi cantly restricted allele lengths in the 1000 Genomes (https://www.internationalgenome.org) and Genome Aggregation databases (https://gnomad.broadinstitute.org). The identi ed monomorphism and association of instances of deviation from this monomorphism with disease phenotypes, experimental evidence on the functionality of the GA blocks, per se, and their various lengths (12), as well as molecular simulation ndings on the signi cantly divergent molecular patterns of various lengths, support that those GA blocks are not neutral.
Genome editing techniques such as GRISPR/Cas-9 may be employed to explore the role of STR blocks, for example in the differentiation of various neural cells from stem cells in human (34).

Conclusion
This research is the rst to propose massive directional trend of STR blocks co-occurring with the evolution of various species. The resulting research outcome may change the perspective of evolution and complex traits at the molecular level.  Tables   Due to technical limitations, table 1-6 is only available as a download in the Supplemental Files section. Figure 1 Schematic representation of the approach used for collecting datasets 1 and 2. This approach applied to all chromosomes across the six selected species. Only one chromosome is shown as an example.

Figure 3
Abundance trend of perfect (GA)9 and (GA)11 blocks in mouse and ve primate species in dataset 2.

Figure 4
Strict monomorphism of the RIT2 A) and GPM6B B) GA-repeats in 600 human individuals studied.
Monomorphism was documented by Sanger sequencing of the entire samples. Only one sequence is depicted for each STR as an example.  Molecular simulation of the RIT2 GA-repeats at the inter and intraspecies levels.

Figure 7
Molecular simulation of the GPM6B GA-repeat with various lengths in human.