MONO, DI and TRI SSRs data extraction & storage from 1403 virus genomes with next generation retrieval mechanism

Now a day׳s SSRs occupy the dominant role in different areas of bio-informatics like new virus identification, DNA finger printing, paternity & maternity identification, disease identification, future disease expectations and possibilities etc., Due to their wide applications in various fields and their significance, SSRs have been the area of interest for many researchers. In the SSRs extraction, retrieval algorithms are used; if retrieval algorithms quality is improved then automatically SSRs extraction system will achieve the most relevant results. For this retrieval purpose in this paper a new retrieval mechanism is proposed which will extracted the MONO, DI and TRI patterns. To extract the MONO, DI and TRI patterns using proposed retrieval mechanism in this paper, DNA sequence of 1403 virus genome data sets are considered and different MONO, DI and TRI patterns are searched in the data genome sequence file. The proposed Next Generation Sequencing (NGS) retrieval mechanism extracted the MONO, DI and TRI patterns without missing anything. It is observed that the retrieval mechanism reduces the unnecessary comparisons. Finally the extracted SSRs provide the useful, single view and useful resource to researchers.


Specifications
These data suggest that SSR extraction is an useful method for providing information for various applications related to studies in VIRUSES.
Access to the raw sequencing data in VIRUSES allows researchers to perform further bioinformatics analysis based on their own computational algorithms.

Data
Database has been developed using MySQL. The information stored in the database includes virus names, genome id, A,C,G,T percentages, tract length, category, motif types (MONO, DI and TRI), the sequences of the motifs and frequencies of occurrence in the entire genome. The actual process of database is shown in Fig. 1.

Structure of the database
In this paper, we consider three tables from database and changed the structure to our own format so that additional analysis can be done easily. They are 1. virus_category 2. virus_acgt_count 3. virus_ssrs

Virus category
This table has the information related to virus categories from virus files. The structure is as shown in the Table 1 and actual data was shown in Fig. 1.

Virus ACGT count
This table has the information related to virus A,C,G and T count, its percentage, tract length. The structure is as shown in the Table 2 and actual data was shown in Fig. 2.

Virus SSRs
This table has the information related to virus_name, genome_id, motif, frequency and its position. The structure is as shown in the Table 3 and actual data was shown in Fig. 3.

Description
In this section we give detailed description of the 1403 virus genomes

Category wise description
We used a total of 1403 virus genome sequences. We categorized these genomes as shown in the

Frequency description
We extracted the overall frequency, MONO, DI and TRI frequencies from the virus_ssrs those are shown in Table 4. From these extracted information MONO has shown the max frequency that is 99, so it has high impact.

Virus size description
In this section, we described SSRs by executing SQL queries on virus_category for category wise counts and the results are shown in the Table A2 (presented in Appendix A). Table A2 gives a  summary of the total number of genomes categorized based on genome sizes of various virus categories. Two of the Mimiviridae genomes are found to be very high (greater than 1 Mb), 81 ssRNA negative-strand viruses and 89 ssRNA positive-strand viruses, no DNA are found to be between the 10 Kb and 50 Kb. 31 virus genomes have shown size less than o 1 Kb.

MIN, MAX and AVG tract length description
We did a preliminary study on the genome sizes of all viruses as shown in the Table A3 (presented in Appendix A). From the Table A3, we observed that, the smallest Mitochondrial genome is Satellite Nucleic Acids of length 216 bp whereas the largest virus genome is Mimiviridae of length 1,241,026 bp. When the average genome sizes of viruses are considered with respect to their category, it has been observed that the average lengths of Mimiviridae genomes are much higher when compared to those of Herpesvirales and Baculoviridae (Refer Fig. 5). The virus genomes of Mimiviridae are around 6 times larger than those of Herpesvirales and 7 times larger than Baculoviridae genomes.

MONO MOTIF description
We extract the total of 4,692,149 continues MONO, DI and TRI SSRs are extracted from 1403 genomes. Table A4 (presented in Appendix A) shown the max frequency of the MONO motifs.

DI MOTIF description
We extract a total of 12853740 continues DI SSRs are extracted from 1403 genomes. Table A5 (presented in Appendix A) shown the max frequency of the DI motifs.

TRI MOTIF description
We extract a total of 14469215 continues TRI SSRs are extracted from 1403 genomes. Table A6 (presented in Appendix A) shown the max frequency of the TRI motifs.

SSR extraction
Availability of next-generation sequencing techniques leads to the accessibility of genome sequences including that of organelles like virus, fungi, bacteria etc. Studying the hyper-mutating SSRs [1][2][3][4][5][6] repeats in virus genomes using Bioinformatics approach would be very interesting and informative as SSRs mining not only helps in understanding and addressing biological questions but also helps in making the best use of these repeats in various diverse applications. Earlier, few studies have attempted to analyze the distribution of SSR repeats in virus genomes but they are confined to a single or a small set of genomes. So far, there are no comprehensive reports in literature that show the distribution of microsatellite repeats in all sequenced virus genomes. In the remaining part of this study, we analyzed SSR repeats in more than 1403 virus genomes and a brief note on the distribution and frequency of these repeats has been presented.
This approach scans the input virus genome sequence file and pattern files for MONO, DI and TRI patterns to find all occurrences of these patterns within this file using next generation retrieval mechanisms [7][8][9]. If repeat occurs then the successive logic is applied. The successive logic means continuous occurrence of similar patterns. If the successive pattern size 41 then the successive occurrence of pattern information is stored in the database. The process is shown in Fig. 6. The database is constructed in MySQL using JAVA. SSR NGS retrieval algorithm has shown the detailed explanation about the Next Generation Sequencing(NGS) retrieval algorithm. It consists of five segments called I/O, Main, search, tandem repeat checking and database insertion. In input segment virus and pattern files are considered as input. In output segment, the extracted mechanism provides the number of occurrences, positions of MONO, DI and TRI patterns. In Main segment the length of file and pattern are read, for each pattern, ngs_search, check_for_tandem_repeat and ngs_database_insertion segments are called for entire length of input file. In search segment, the pattern is searched in the input file, if match occurs then increments the occurrence count. In tandem repeat checking segment, the different between the occurrence positions are measured, if they are equal to length of the pattern then it is considered one tandem repeat. In database insertion segment, virus name, genome id, pattern, count and position is stored in the database.