ON THE HIGH PERFORMANCE COMPUTING FOR MOTIF DISCOVERY IN DNA SEQUENCES.

Several used in by sensitivity analysis on of computational cost can be reduced. Thus, the proposed approach can be used for the motif discovery effectively and efficiently.

In bioinformatics, one of the most important research problems is the Motif discovery in DNA sequences. The algorithm having accuracy and speed has always been the goal of research in bioinformatics, for solving this problem. Therefore, the idea of this research study is to modify the random projection algorithm to be implemented using high performance computing technique (i.e., the R package pbdMPI). The steps that are needed to achieve this objective are the main focus of this study, i.e. preprocessing data, splitting data according to number of batches, modifying and implementing random projection in the pbdMPI package, and then aggregating the results. To validate this approach, some experiments have been conducted. Several benchmarking data were used in this study by sensitivity analysis on number of cores and batches. Experimental results show that computational cost can be reduced. Thus, the proposed approach can be used for the motif discovery effectively and efficiently.
Issues in motifdiscovery can be categorizedinto 3 types, namely Simple Motif Search (SMS), Edit distance based (EMS), and Planted Motif Search (PMS) [3]. The purpose of SMS is to find all the motifs from lengths 1 to the specified length in all sequences of [4] while the purpose of the EMS is tofind all the motifs on the desired number of sequences [5]. PMS aims to find the motive that appears in every sequence that exists [6].
In PMS, there are two important input parameters: the desired length of motif symbolized by l and the number of mismatches denoted by d [7]. For example, there are three DNA sequences, as follows: S 1 = ATTGCTGA, S 2 = GCATTGAA, and S 3 = CATGCTTG. With l = 4 and d = 1, we obtain the following repetitive motifs: ATTG and TTGC. It can be seen that PMS is included in the NP-Hard problem, so that if this algorithm is run to look for all possible motives that appear in all sequences, then the time spent will be exponential [1]. 881 the random position determined by k (k-mers) values [1,9]. RP represents that mutations can occurany where so the projection is done randomly. Even though many algorithms have been introduced, since PMS is NP-hard problems, an implementation of the algorithms into parallel computing is necessary to be done. Therefore, this research is aimed to design and implement RP for dealing with PMS in parallel computing in the R programming language. The R programming language [10] is chosen since it has become the de-facto standard for statistics, data analysis, and visualization. Now a days, there are many algorithms, collected in software libraries/packages, that have been implemented and saved in the Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/. In this repository, one of packages in R used for high performance computing and big data analysis is pbdMPI [11] that is used in this research.
In the literature, we found some relevant articles discussing implementations of motif discovery in parallel computing. For example, in Clemente & Adorna's study [12], random Projection algorithm was developed in the concept of GPU (GraphicProcessingUnits). Each processor will be directed into threads that work within the device or GPU. Meanwhile, the sequential process will be executed on the host or CPU. TEIRESIAS has been introduced to improve the speed on finding maximal pattern [13]. An enhancement of the PMSPRUNE algorithm has been proposed with two additional features: neighbor generation on a demand basis and omitting the duplicate neighbor checking [14]. Furthermore, there are some different approaches for dealing with patterns matching in various fields. For instance, multiple patterns matching methods was introduced for large multi-pattern matching [15]. Improving the scanning mode of Square Non-symmetry and Antipacking Model (SNAM) for binary-image is obtained by proposing the new neighbor-finding algorithm [16].
The rest of the paper is organized as follows: first, the global procedure of this research is presented in Section 2. In Section 3, a main contribution, which is a modification and implementation of parallel random projection by using the pbdMPI package, is discussed. To validate and analyze the proposel computational model, we conduct some experiments in Section 4 and some analysis in Section 5. Finally, we conclude the research in Section 6. Figure 1 shows the research design done in this study. It can be seen that first, we perform some preparation, such as identifying problems, research objectives, and literature study. These activities have been presented in the previous section. Then, we present a main contribution of this research, which is designing and implementing parallel random projection with the R high performance computing (i.e., thepbdMPIpackage). This part will be explained in the next section. After that, we conduct some experiments and their analysis of the results. Drawing some conclusión is presented in the end. Parallel Random Projection with the pbdMPI package:-Basically, the computational model proposed in this research can be seen in Figure 2. First, after reading and converting the input data from the .falsa file, we perform a modification of random projection by utilizing the R high perfomance computing (i.e., the pbd MPI package), called parallel random projection with pbdMPI. Detailed explanation regarding the proposed approach can be seen in Figure 3. The results of this model is all motifs, their starting indices, and computational costs.  Figure 3, it can be seen that besides supplying some parameters related to the RP algorithm, we need to input the number of cores and batches. Since the R programming language needs to load data into random access memory (RAM), we need to define the number of batches so that each batch just takes less than 20% of total memory capacity. Furthermore, actually Step 1 to 3 and Step 6 to 8 are the same as the RP algorithm on the standalone mode. However, from Step 4 to 5 the tasks are conducted in parallel computing by using pbdMPI commands. An important part of these steps is a rule to divide the sequence into numbers of batches. Moreover, the rule should prevent all possible motif including the sequence even though it has been splitted into several batches. So, in this case we implement the following equations:

Research Method:-
where and are starting and ending indices for cutting the batch of . and are the length of sequence, number of batches, and length of pattern, respectively. It should be noted that the starting index starts from = 2. For example, given the sequence S = CAGTGACGTAATCA, and the length of pattern is 3. So, according to (1) and (2), we obtain the following batches: S 1 = CAGT; S 2 =GTGACG; and S 3 =CGTAATCA. By following how the algorithm random projection generates k-mers, we obtain the following k-mers on all batches that 883 are the same as k-mers on the sequence (without splitting into batches): CAG; AGT;GTG; TGA; GAC; ACG; CGT; GTA; TAA; AAT; ATC; and TCA. It means that even though the sequence has been splited and processed by different cores, the results of RP and parallel random projection are the same.

Experimental Study:-Data Gathering:-
The data used in this study obtained from research in [17]. To download the data can be through the site of University of Washington Computer Science & Engineering on page http://bio.cs.washington.edu/research/download. In total, there are 52 data sets of DNA sequences derived from four species, 6 of which are derived from the Drosophila melanogaster sequence, 26 data derived from human sequences, 12 data derived from rat sequences and 8 other data derived from the Saccharomyces cerevisiae sequence. In each data file there are several sequences that number between 1 to 35 sequences. Then, every sequence that resides on the file has a variable length ranging from 500 to 3000 base pairs.
In this case, we only consider to use four datasets as follows: the dm01r.fasta and dm05r.fasta files that are DNA sequences of Drosophila melanogaster, then hm01r.fasta derived from the human sequence, and muso4r.fasta which is the rat DNA sequence as the input data. The dm01r.fasta file contains 4 DNA sequences with the total length of sequence is 6000, while the dm05r.fasta file consists of 3 DNA sequence with the length of 7500. The hm01r.fasta and mus04r.fasta files have the DNA sequence length of 36000 and 7000, respectively.

Experimental Design:-
In this study we conduct two simulations: standalone and parallel computing (i.e., multicore) modes. Each group will use all data as mentioned previously: dm01r.fasta, dm05r.fasta, hm01r.fasta, and muso4r.fasta. Furthermore, in accordance with the algorithm, some parameters should be assigned, as follows: the length of motif and mismatches

Results and Analysis:-
Since the limited space, in this section we illustrate the results and their analysis for a particular dataset only. For example, on the stand alone mode, a comparison of the number of motifs found according to m, θ, and (l, d) on the 884 dm01 r data set is shown in Figure 4. It can be seen that the higher numbers of mismatch makes the higher of numbers of motifs. Furthermore, on the stand alone mode, we can compare the computational cost with length of DNA sequence on the different (l, d) and θ as shown in Figure 5. It is obvious that the longer length of DNA sequence takes the higher computation cost. It should be noted that these lengths also represent the datasets used in the experiments, such as the dm01r has the length of 6000. On the parallel computing mode, Figure 6 shows that the comparison between the computational costs and numbers of cores when we used the dm01r dataset on (l,d) = (6,2), θ = 3, and b = 10. It can be seen that the proposed model has been successful since in general speaking the computation cost can be reduced by adding the number of cores. It means that the computational time on stand alone needs four times longer than using 2 cores. Moreover, the standalone mode took more than ten times compared with parallel computing using 3 cores (i.e., 2.52 seconds). Using 6 cores, the computation can be faster around 34 times compared with the standalone mode. So, now it is obvious that the proposed model is much faster than the standalone mode. We also compared computational time gained from experimental results on the previous research [1] even though there are different data on the file dm01r and mus04r. The number of DNA sequences contained in the file dm01r is 4 with the length of 1500 for each sequence while in the research [1] the dataset contains 5 DNA sequences. In the file mus04r the number of DNA sequences used in this experiment is 7 sequences with the length of each sequence is 1000 while only 6 sequences were used by the previous research. The comparison can be seen in Table 1. It can be seen that all experiments conducted in this research are faster than the study in [1]. Computational cost (s)

Conclusion:-
The main contributions of this research are as follows 1. To propose the computational model for modifying the random projection algorithm, called parallel random projection, for dealing with planted motif search by utilizing the R high performance computing (i.e., the pbdMPIpackage) and 2. To implement the proposed model and then valídate it for finding motifs on DNA sequences. According to the experiments, we can state that the proposed model is able to reduce the computational cost significantly. Moreover, a comparison with the previous study has been done, and it is shown that the proposal produced better results in the term of computational cost.
In the future, we have a plan to improve the model by using Big Data platform, such as by using the programming model of Map Reduceon Apache Hadoop [18] and Resilient Distributed Datasetson Apache Spark [19]. Moreover, the different toools for utilizing parallel computing, e.g., the foreachpackage [20], can be used as the study in [21]. Different tasks in the related research to bioinformatics can be applied to test the proposed model as well, such as prediction on cáncer [22], kidney disease [23], and sleep disorder [24].