A WORK DISTRIBUTION STRATEGY FOR GLOBAL SEQUENCE ALIGNMENT

: The sequence alignment comprises to identify similarities and dissimilarities between two given sequences. In this paper, we propose a work distribution strategy for the implementation of DNA global sequence alignment. The main objective of this work is to minimize the execution time required for DNA global alignment of large biological sequences. The proposed approach dealt the issues with the memory optimizations and minimization of execution time. We considered the biological sequences of different size to fit into the global memory of the system. The proposed strategy is implemented in shared memory architecture using OpenMP programming for large biological sequences. Parallelization using OpenMP directive is relatively easy and execute the code fast. We experimented on the Dell Precision Tower 7910 with Intel Xeon processor with 32GB RAM and 28 CPU cores. The efficient use of global memory and cache memory optimization dominate the results of execution time. The results demonstrate the significantly high speed up using OpenMP as compared with other implementations.


INTRODUCTION
One of the principal applications of edit distance algorithm in Bioinformatics is to find out the similarity of macromolecules such as DNA sequence composed of letters A, T, C and G.The replicas of DNA, which shows imperfection gets changed by mutations at random places [1].These mutations can grow exponentially with large sequences.The mutations can cause the transformations to both the sequences.Such transformations are; 1) Insertion of character x before position m denoted by mx 2) Deletion of character x at position m denoted by x i_ and, 3) Substitution of character x to character y at position m denoted by x my.
The transformations due to mutations cause weights defined by a predefined weight function W. The following assumptions are made for weight functions [1,2] is shown in Table 1.
The series of transformations forms a metric for both Seq#1 and Seq#2 which is referred to as Score Matrix (SM).Thus, the optimum alignment is to find minimum weight alignment with minimum weight transformation.One of the optimal alignments can be shown in Example 1.  -A C G -C  G A C T A C .
A pair of characters in a position is called aligned pair.The weights of the series of transformations are the sum of weights of aligned pairs.Global alignment between two large DNA sequences is the problem of finding the optimum alignment under the given scoring scheme.The BLAST [3] is used for sequence comparisons.It compares the sequences and finds out statistical information.FASTA [4] provides sequence similarity searching against protein databases.An adaptive grid implementation of the DNA sequence alignment proposed by C. Chen at.el [5], the author described a dynamic programming algorithm to compute k nonintersecting near-optimal alignments in linear space.In order to reduce runtime significantly, a hierarchical grid system as the computing platform and Static as well as dynamic load balancing techniques were applied.In [6], the work for graphical representation and alignment of DNA sequences are presented.The graphical alignment approach outlined, which is both conceptually and computationally not involved, designed to quickly find the two best global alignments.An optimization approach and its application to compare DNA sequences were proposed in [7].It uses linear programming analysis methods based on the LZ algorithm and the Phylogenetic tree obtained by ClustalW using 48HEV sequences to compare strings.Parallel architecture for DNA sequence inexact matching with Burrows-Wheeler Transform was by [8] proposed on novel hardware architecture to parallelize the inexact matching algorithm based on BWT, and implements it on FPGAs.F. Saeed at.el [9] proposed a high-performance multiple sequence alignment system for pyrosequencing reads from multiple reference genomes.This work was based on domain decomposition to align such a large number of reads from single or multiple reference genomes.The alignment algorithm accurately aligns the erroneous reads and has been implemented on a cluster of workstations using MPI library.A tiling based sequence alignment is proposed in [10].The combination of OpenMP and MPI paradigm is utilized for load balancing on parallel architecture.The proposed algorithm targets the metrics for DNA sequence alignment were time, speedup and efficiency.
An optimized technique for DNA sequence data compression using OpenMP and MPI was illustrated in [11].This compression is vital in massive data storage and transmission.A parallel algorithm for DNA sequencing on heterogeneous platform using supervised machine learning approach is described in [12].FED based a parallel algorithm was proposed by Q. Xue et.al. using Message Passing Interface (MPI).The results reported the matching in the given sequence and improved speedup using MPI [13].The performance evaluation of DNA sequencing using OpenMP and CUDA is presented in [14].A parallel approach for solving the kdifferences prime problem is presented and speedup achieved up to 5.6 and 72.8 on OpenMP and GPU respectively.

NEEDLEMAN-WUNSCH ALGORITHM
The N-W algorithm is a programming model for efficiently implementing recursion dynamically.This algorithm takes two input sequences seq#1 and seq#2, builds a score matrix SM, where SM [n,m] represents the score of optimal alignment of seq#1 [1..n] and seq#2 [1..m] where n is the length of seq#1 and m is the length of seq#2.The recursion is given [15,16] by Initially, it fills SM[0,0] = 0 and then proceed to fill the matrix from the top left corner to the bottom right corner by applying the recursion equation ( 1) on each i and j.The Sscheme is a predefined matrix that compares characters at individual positions and assigns weights.In this paper, we used the Sscheme shown in Table 2.The various scoring schemes are presented in the literature such as BLOSUM and PAM, for more detail see [1].The N-W algorithm also creates Direction Matrix (DM), which stores direction of movement (pointer) that can be used for finding the optimal alignment.The purpose of the DM is to backtrack for finding the maximum value given in recursion.The backtracking is the reverse of the score calculation.It starts from the bottom right corner and proceeds towards the top left corner [15,17].

PARALLELIZATION OF NEEDLEMAN-WUNSCH ALGORITHM
Various parallel techniques were proposed in the literature for the implementation of the N-W algorithm used in DNA sequence alignment.The thread level parallelization using pthreads is proposed in [18, 19], achieves high performance.However, synchronization among threads and overhead to shift control from one thread to another thread becomes the bottleneck for these methods.Fig. 1 and Fig. 2 illustrate the typical tiling implementation techniques utilized for the various algorithms after dependency analysis.The tiling is an optimization that has been used to obtain huge performance gains on selecting the proper tile size to fit into the cache memory.The tiling is the compilerbased optimization, divides the original task into sub-tasks and computation of these sub-tasks are assigned to different threads to be performed in parallel.Fig. 1 and Fig. 2 shows, blocks from B00 to Bn-1n-1 are referred as sub-task of the original task, which are independent of each other as per order of their execution.In Fig. 1, the blocks are processed diagonally as these blocks are independent to the other blocks in the computation.In Fig. 2, the blocks are executed in parallel in vertical direction to complete the computation.The tiling provides efficiency to the algorithm for large sequences.However, the number of threads required is large [20] and huge computation performed by the threads.But, managing large number of threads increases the overheads such as the cost of computation.
The best alignment for two sequences n and m is determined by applying the three steps i.e. 1) initialization, 2) scoring and, 3) trace back.In our approach, we have used scoring criteria for match = +2, mismatch = -1 and gap = -1.The initialization of sequences n × m is shown in Table 3.The trace back or Global alignment is demonstrated in Table 4.The last right point is utilized for back trace in the matrix.The next point is identified by moving diagonal or left or up as per the computed value from the start point.
The alignment of sequences n and m is illustrated in Fig. 3.

WORK DISTRIBUTION STRATEGY
In this section, we introduced the nomenclature, scheduling algorithm for global alignment and the OpenMP implementation.
Table 5 shows the list of abbreviations and their meanings used in this paper.The parallel algorithm for finding global alignment is represented in Algorithm 1. Horizontal and Vertical direction, with threadID form 1, 2, and 3 respectively.8. Find BlockID for blocks down and left of the diagonal block.9. Compute the blocks successively to the block down to diagonal block.Also, compute the blocks successively to the block left of the diagonal block.The two independent threads were assigned to complete these block computations.This process will repeat until all the diagonal blocks were consumed.
Fig. 4 shows a typical block matrix addresses for the computation of score matrix.In this block matrix, every block is about the same size.Score computation for every element will be within that block.The starting and ending elements are computed separately.
Fig. 5 demonstrates the blocks execution by the different threads.Initially, the block0 is executed by thread0 then block1 and block10 is executed by thread1 and thread2 and this process carried till left diagonal block computation finished and thread disband starts thereafter.The SM computation starts with the computation of diagonal blocks from B00 to Bn-1n-1.This computation is done by using a single thread.Once the diagonal block computation is finished, two new threads become active to start the computation of blocks down and left to the diagonal block.This computation process will continue till last diagonal computation is finished.

OPENMP IMPLEMENTATION
The Listing 1 illustrates the high level description of OpenMP [20] code with load distribution strategy for the computation of score matrix computation using N-W algorithm.
Listing 1: High level description of N-W algorithm.
//section 1 inside section 3 for every q and w 17.
//section 1 inside section 3 for every q and w 24.
Compute Scores for every block with blockid(e, f); 28.} 29.} 30.} 31.} 32.} 33.} The Line No. 3 and Line No. 7 in Listing 1 can be performed in parallel.The computations of block with block addresses specified are corresponding to the first column in the Fig. 3 for Line No. 3 and computation of block addresses indicated in the first row in the Table 3 for Line No. 7.These parallel computations will begin after the computation of block '0'.However, once block '1' and block '10' finished the computation, the Line No. 11 becomes active and the computation is distributed to the 3 threads.The nested parallelism is activated and utilized when one block finished the computation, then two new sections are initiated to compute in parallel as indicated in Line No. 17.For example, once the block '11' is processed completely as shown in Fig. 3, 4 threads were active to begin to perform the computations of the blocks '20', '21', '2' and '12' in parallel.The process of sections creation and assigning the blocks to them will continue for every diagonal block execution as indicated by Line No. 17.With this approach more threads will be available for the computation at the successive stages of block computation.Hence, this will increase the overall performance of the N-W algorithm computation in parallel.

RESULTS AND DISCUSSION
The performance is measured only for the computation of the score matrix.The time required for the sequential algorithm is compared with the time required for parallel algorithm.We compared genomes of equal residues ranging from 10000 to 100000 on Dell Precision Tower 7910 with Intel Xeon processor with 32GB RAM and 28 CPU cores.The OMP_NESTED environment variable is set to true.The performance measure includes the metrics; score matrix computation and different DNA lengths.
We first evaluated the time required for computation of the score matrix in sequential and parallel.We have considered the different set of threads for performance measurement such as 8, 12 and 28 threads.Fig. 7 to Fig. 9 shows the comparison between time required for the sequential and parallel implementation of the proposed approach by using a different set of threads and different lengths of DNA sequences respectively.The x-axis represents the sequence lengths in characters, and the y-axis represents the time for computation in seconds.The speedup achieved for the DNA sequence alignment for the different DNA lengths listed in this section.The speedup is computed by using the formula, Speedup = timesequential / timeparallel.We evaluated the speedup for all the above computations.The speedup achieved in the range of 1.5x to 4x on the use of 8 cores, the speedup ranging from 1.2x to 5.5x is achieved on the use of 12 cores, and the speedup achieved range from 2x to 13x on the use of 28 cores.The maximum speedup is achieved for string length of 100000 as 13x on the 28 core system.It is also possible to get a higher speedup for larger sequences for higher configuration systems.
We have compared our work distribution approach with [21-24].The performance analysis shows that the speedup achieved by [21] is 2.63x, [22] is 3x, [23] is 7x, [24] is 2x and 13x by our proposed approach.As highlighted in Table 6 the proposed strategy in this paper outperform [21-24] by achieving highest speedup.In this work, we mainly focused on the memory optimizations and minimization of execution time.

CONCLUSION
In this paper, we have presented the work distribution strategy using OpenMP to speed up the global sequence alignment for DNA.We have been using the OpenMP nested parallelism strategy for our algorithm implementation.This nested parallelism supported by OpenMP increases the efficiency of the system and achieves high speed up.The speed up is computed for DNA sequence of different size ranging from 10000 to 100000.The performance evaluation shows that our algorithm achieved high speed up over the sequential algorithm.The speedup will increase for large sequences.The backtracking process is not considered in this work.In addition, backtracking can be considered for future work and the approach can be well suitable for MPI and CUDA implementation.

Algorithm 1 :
Scheduling for Global Alignment 1. Find the length of strings.2. Define the BS. 3. Divide the string to form a block matrix.4. Create a block matrix with entries such as 1, 2, 3, ….. . 5. Find StartRowID, EndRowID, StartColID, EndColID from BlockID. 6. Computes scores for individual blocks assigned to each thread.7. Assign threads to the blocks in Columnar,

Figure 6 -
Figure 6 -Proposed Load Distribution Strategy