Method for identification of 10 SSR markers from monkey genomes and its statistical inference with One & Two-way ANOVA

DNA tracts that include simple sequence repeats (SSRs), sometimes known as genetic "stutters), are composed of a few to many tandem repetitions of a short base-pair motif. These sequences frequently mutate, changing the amount of repetitions. SSRs are frequently found in promoters, untranslated regions, and even coding sequences, therefore these alterations can significantly affect practically every aspect of gene activity. SSR alleles can also contribute to normal diversity in brain and behavioural features. Mutational expansion of certain triplet repeats is the cause of a number of inherited neurodegenerative diseases. Due to its importance in genetic research, in this paper we explored Ten SSR markers TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA that are identified from the genomes of Eleven distinct monkeys: A.Nancymaae, C.C.Imitator, C.Atys, M.Leucophaeus, P.Paniscus, R.Bieti, R.Roxellana, S.Boliviensis, T.Syrichta, C.A.Palliatus and M.Nemestrina using pattern matching mechanism. We identified 4bp SSR from eleven monkey dataset's Unchr chromosome mainly in this paper. The proposed approach finds the exact place/location of the SSR's and number of times that it appears in the given genome sequence. The identified patterns are analyzed with One-way and Two-way ANOVA that gives better analysis which is useful for genomic studies. Also, this 4bp Ten SSR markers data is a valuable to illustrate genetic variation of genomic study.• The great specificity of data sets produced from monkey genomes with pattern matching has been demonstrated.• These findings show that SSR identification could be a useful tool for determining genome similarity and comparability.• Researchers can use the raw sequencing data to conduct additional bioinformatics analysis.


Method details
Tandems of repeating DNA sequences are present in various quantities for the majority of genomes in simple sequence repeats (SSRs). This repetition of genetic mapping and population research has been widely employed. SSRs also give molecular tools for the understanding of spatial links between segments of chromosomes which, in turn, help in the analysis of temporal linkages between species and genera.
It is predicted that the study of repeat frequency and their distribution pattern in the genome would assist to comprehend their meaning. There are accumulated indications suggesting SSRs influence gene expression [1][2][3] .
Complete genome sequences were available for several species and genome-wide analysis were carried out. In this study, we analysed Unchr chromosome of Eleven different monkeys A.Nancymaae, C.C.Imitator, C.Atys, M.Leucophaeus, P. Paniscus, R.Bieti, R.Roxellana, S.Boliviensis, T.Syrichta, C.A.Palliatus and M.Nemestrina and Ten SSR loci were investigated for their spread and frequency of occurrence.
Previously, few studies have tried to evaluate tandem replacement distributions in monkey genomes [4] , but they are restricted to a single or a small number of genomes. This multiple mining employing Analysis of Variance (ANOVA) helps to understand and resolve biological issues.
The proposed structure of the method is shown in Fig. 1 that comprises of collected data set and read, SSR identification and Search process and Analysis of variance (ANOVA).

Search process
Algorithm 1 describes the complete SSR identification heuristic procedure. In this heuristic procedure, the chromosomes and SSRs are given as input, which then invokes the first_occurance_position_heuristic, bad_character_heuristic, and good_suffix_heuristic procedures. Finally, this algorithm displays the pattern and its position and continues the search process until the end of the given chromosome sequence.
Algorithm 2 describes the first occurrence position heuristic procedure. In this heuristic procedure, the pattern's rightmost character, pattern [m-1], was compared with the corresponding character in the genome sequence; if a match is found, the match position is returned; otherwise, the comparison continues until the end of the genome sequence.
Algorithm 3 describes the Bad Character Heuristic Procedure. When the mismatch case occurred, then this heuristic procedure was invoked and returned the shifted position of the pattern. If any of

Algorithm 2
First Occurrence Position Heuristic Process.

Algorithm 3
Bad Character Heuristic Process.
Return β the pattern characters was not matched with the genome sequence, the entire pattern position was shifted; otherwise, the number of characters matched was used to shift the pattern. Algorithm 4 describes the Good Suffix Heuristic Procedure. This heuristic procedure was invoked at the time of a complete pattern match and returned the search position. Look If a substring of a pattern is matched until a bad character has a good suffix, after a mismatch that causes a negative shift in bad character heuristics, we take a step forward equal to the length of the suffix found.
This procedure is repeated for all SSRs as well as the whole data in the chromosomes.

Analysis of variance (ANOVA)
The analysis of variance (ANOVA) [5][6][7] is a set of statistical models and estimate processes for analyzing differences between group means in a sample. It is useful for comparing (testing) the statistical significance of three or more group means. For this, here we are calculated the Algorithm 4 Good_Suffix_Heuristic Process.
Good_Suffix_Heuristic (Patt, m) 10. Return γ S S between , M S between , d f between , S S within , M S within , d f within , and F v alue values. We'll sum them up by multiplying each squared variation by each sample size. For between-group variability, this is known as the sum-of-squares, as shown in Eq. (1 ).

S S between
Within-group variability is measured by how far each sample's value deviates from the sample mean.
S S within is calculated with Eq. (4 ), d f within is calculated with Eq. (5 ) and M S within is calucalted with Eq. (6 )

F-Statistic
It assesses the means of two or more samples significance. Their value is less then sample means are close to each other. We can not rule out the null hypothesis in such instance. It is calculated with Eq. (7 ) F = Bet ween group v ariabilit y within group v ariability (7) Null Hypothesis = Re j ected , F static < F critical value Accepted, Otherwise  Table 1 shows data from ten SSR markers taken from the genomes of eleven monkeys. NCBI ( https: //www.ncbi.nlm.nih.gov ) provides the genome dataset. The results suggest that SSR identification with pattern matching was quite beneficial in revealing variation in chosen genome libraries. These SSR markers may be used to compare and quantify genomic similarities.  Table 1 . The numbers of patterns identified in different chromosomes are depicted as the bar plot, which has been shown in Fig. 2 . Fig. A1 depicted in Appendix A (figures part) from A1(a) to (j)) has shown the max size of pattern related to considered genome datasets respectively. Fig. A2 depicted in Appendix A (figures part) from Fig. A2 (a) to (k) has shown the Ten patterns related 11 datasets respectively.

One-way analysis of variance
This method was employed to compare the averages of two or more samples (using the F distribution). This is only applicable to numerical response data (the "Y"), which is generally one variable, and numerical or (mostly) categorical input data (the "X"), which is always one variable, hence "One-way". One-way analysis of variance is performed among A.Nancymaae, C.C.Imitator, Cercocebus_atys, M.Leucophaeus, P. Paniscus,R.Bieti, R.Roxellana, S.Boliviensis, T.Syrichta, C.A.Palliatus and M.Nemestrina for Ten patterns. The actual results are shown in Tables 2 and 3 .
From Table 2 , it is observed that null hypothesis is TRUE for every monkey dataset except C.Atys. So from these p value, we conclude that there is a similarity of C.Atys monkey with others. The statistic of One-way ANOVA of different chromosomes was depicted as the bar plot, which has been shown in Fig. 3 .
From Table 3 , it is also observed that null hypothesis is TRUE for every pattern except four patterns called TCAT,GAAT,CTTT,TCTG. From these p value, we conclude that there is a similarity of TCAT,GAAT,CTTT,TCTG with others. The statistic of One-way ANOVA of different patterns was depicted as the bar plot, which has been shown in Fig. 4 .

Two-way analysis of variance
It looks at the impact of two categorical independent variables on a continuous dependent variable. It is used to determine not only the main impact of each independent variable, but also whether they interact. It is performed for each of 11 datasets (A.Nancymaae, C.C.Imitator, C.Atys, M.Leucophaeus, P.Paniscus, R.Bieti, R.Roxellana,S.Boliviensis, T.Syrichta, C.A.Palliatusand M.Nemestrina) among groups between the ten patterns.
The actual results are uploaded in mendeley Appendix A [source] & B [source]. Table A.1 to A.11 has shown the Two way ANOVA statistic and p value of 11 datasets for ten patterns. These results are shown the relation among monkey datasets interms of supporting the null hypothesis and other are alternate hypothesis. For example from the statistics and p value, it  is observed that relation between TAGA and AGAA has alternative hypothesis, and TCTA and GAAT has null hypothesis.  table Table B.1 for TAGA pattern, it is observed that A.Nancymaae and C.C.Imitator has alternative hypothesis based its statistics and p value and A.Nancymaae and C.Atys has null hypothesis. Table B.12 to B.21 has shown the two way ANOVA hypothesis reject TRUE/FALSE among 11 datasets related to ten patterns respectively. From table Table B.12 for TCAT pattern, it is observed that the relation between A.Nancymaae and C.C.Imitator  Fig. 5 (a) to (k) it is observed that, 11 monkey dataset for 10 patterns, these graphs results matched with results discussed in the previous paragraphs. Fig. 6 (a) to (j) has shown the Multiple comparisons between all pairs(Tukey) between 10 patterns for all 11 datasets. From the Fig. 6 (a) to (j) it is also observed that, 10 patterns for 11 monkey dataset, these graphs results matched with results discussed in the previous paragraphs.

Ethics statements
This work has never been published or submitted to another journal. This information and analysis will not hurt humans or animals.

Declaration of Competing Interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability
Data will be made available on request.