Blast-i2b2: Blast for Biological Sequence Comparison in i2b2 Platform

Finding similarities between biological sequences is an important process to analyze these sequences. The similarity between biological sequences can be determined by using sequence alignment methods. BLAST (Basic Local Alignment Search Tool) is a search tool that is designed to perform the sequence alignment process. The current BLAST platform compares a biological sequence with sequences in a biological sequences database, lists the similar results


Introduction
In the field of Bioinformatics, the comparison between biological sequences is the most widely used strategy to determine the functionality of newly sequenced genes, extract new members belonging to a specific gene family, and find evolutionary relationships between sequences. Sequence alignment is one of the main techniques for finding similarities between biological sequences. It aligns a given biological sequence with a set of biological sequences that are stored in a given database. BLAST (Basic Local Alignment Search Tool) [1] is a search tool that implements the sequence alignment method and is available at the NCBI (National Center for Biotechnology Information) website 1 . Biological sequence databases contain huge number of DNA, RNA, or protein sequences and are periodically updated. The updates of these databases may contain some newly added sequences, some deleted sequences, or some modified sequences; and usually the updates are minor. The processing time for completing the BLAST job increases with the increase in the target database size. The current BLAST algorithm compares the query sequence with each sequence in the target database and only returns the similar sequences without storing any results in a main center. If that database is updated, the user must run BLAST again to get updated results. In this case, the user has to resubmit the query and the BLAST will run again against the whole updated target database. This shows the need for a data warehouse that can store each submitted query by the user and store its results. Hence, if the target database is updated, the user can request to get an updated result of the previously submitted query and even keep previous versions if needed. In this case, BLAST can run against only the updated sequences, resulting in considerably reducing the run time. In additions, storing the BLAST results enables further analysis 1 https://blast.ncbi.nlm.nih.gov/Blast.cgi and tracking of the similarity changes over time when databases are updated. As the target databases keep updating, the data warehouse should also be capable of dealing with these updates along with the time required for searching the updated database.
The i2b2 (Informatics for Integrating Biology & the Bedside) 2 is an informatics platform that allows the integration of clinical and genomics data in a single repository. It provides researchers with software tools for collecting and managing biomedical research data from a variety of resources. The i2b2 platform's design is scalable and can be extended by developing new components or "plug-ins" to suit the needs of the researchers. It also packaged with built-in tools for data query, analysis, and visualization [2]. This paper presents the BLAST-i2b2 platform -an implementation of BLAST functionality using i2b2, and enhances the BLAST results storage and update features using the capabilities of i2b2 data warehouse. One of the important features of our proposed approach is that we save the BLAST results in a welldesigned i2b2 warehouse. In BLAST-i2b2, the BLAST compares the query sequence with each sequence in the target database to find the similar sequences, show the BLAST results, and allows the user to save the query parameters and BLAST results in the data warehouse. In addition, it can update the saved results in the data warehouse whenever there are updates in the target biological database, based on user request.
In this scenario, the user can view the BLAST history and select one of the submitted queries to check if there is an update on the target database. If that target database is modified, BLAST compares the stored query sequence with only the modified sequences in the database instead of comparing it with the whole sequences in that database, making BLAST much faster. Since the results are stored in the data warehouse, this approach will allow the researchers to reuse those results for further analysis and research purposes using the built in data exploration and visualization capabilities of i2b2 platform in the future. The next section provides an overview of basic sequence alignment techniques, the BLAST algorithm, NCBI database, and i2b2 platform. Section 3 presents the proposed BLAST-i2b2 along with save and update algorithms. It also contains the data warehouse design to illustrate the data storage mechanism. Section 4 describes the experimental setup of BLAST-i2b2 and provides the results in order to test the performance of BLAST-i2b2 and shows the evaluation of the proposed approach. Finally the conclusion section provides a summary and future work.

Sequence alignment
The similarities between biological sequences can be found by using Sequence Alignment method [3]. An alignment is simply done by pairing letters from two sequences: one is defined as a query sequence that is given by the user and the other is the database sequence that resides in the biological sequence database. The result is in the form of a similarity sequence as illustrated in Figure 1. Each letter in the similarity sequence (or alignment) is either a match or a gap. If a letter in the query sequence is identical to the corresponding letter in the database sequence, a match will be inserted in the similarity sequence; and if they are not identical, a gap will be inserted. In order to produce the similarity between two sequences, a similarity score is calculated using a scoring matrix. This matrix is usually called substitution matrix, which assigns predetermined scores to both matched and unmatched sequence pairs. The final similarity score of the two aligned segments is the sum of all scores of each segment pair. PAM [4] and BLOSUM [5] are the most commonly used scoring matrices [6]. Different similarity searching algorithms have been proposed in order to find similarities between sequences, such as Needleman-Wunsch algorithm [7], Smith-Waterman algorithm [8], and BLAST algorithm [1]. In this paper, we focus on the BLAST algorithm.

BLAST
BLAST algorithm was developed by Altschul, Gish, Miller, Myers, and Lipman in 1990 [1]. It was produced to compare a given DNA, RNA, or Protein query sequence against a list of database sequences to identify the most similar database sequences above certain value.
BLAST use PAM and BLOSUM matrices to calculate the similarity scores [6].
BLAST algorithm: BLAST algorithm [6] consists of four main processes: pre-processing, seeding, extension, and evaluation as illustrated in Figure 2. The main goal of BLAST is to find all HSPs (High Segment Pairs) that have a score equals or more than a given value S [6]. The Inputs to the BLAST algorithm are: the query sequence (the sequence that the user wants to compare), the word size W (the length of the subsequence that initiates the alignment), the threshold T (the number of different alignments expected to be matched by chance. If T=1, it means that 1 alignment match can be found by chance), and cut-off S (a number that reports high-scoring segment pairs).
Pre-process step: Using the query sequence, BLAST generates a word list and a neighbors list. In order to generate the word list, BLAST breaks the query sequence into words of size W. An example is shown in Figure 3a, where W is set to 3. The query sequence and the generated word list are shown. For each word in the word list, a new list of W-length neighbors, which have similarities equal to or greater than T, is generated. Figure 3a shows an example of HEA neighbors' that has a similarity equal to or greater than T=13 using BLOSUM62 matrix.
Seeding step: For each word in the word list, the algorithm scans the matches (seeds) between word's neighbors and target database sequences. Figure 3b shows the seeds of HEA's neighbors.
Extension step: For each match (hit) found in the database, the query sequence is aligned to the database sequence at the seed position. Then, the alignment is extended in both directions until it reaches the highest score. Figure 3c shows the extension process between the alignment of the query sequence and seq1 in the target database.
Evaluation step: Obtained alignments are evaluated to identify if they are High-scoring Segment Pairs (HSPs). To evaluate these alignments, the algorithm compares alignments' scores (significance) with a given cut-off value S. The highest segment score, which is equal to or greater than S, is selected. Figure 3d shows the evaluation process between HEA's seeds with significance score more than or equal to s=40.
BLAST graphical user interface: NCBI provides many variations of BLAST programs 3 that allow the user to search sequence similarities 3 https://blast.ncbi.nlm.nih.gov/Blast.cgi

A T G G C G A A A T C A
The query sequence

A T G G C G -A --C A
The similarity sequence

A T G G C G T A C C C A
The database sequence      such as BLASTN, BLASTP, BLASTX, TBLASTN, and TBLASTX. BLASTN compares a given DNA sequence to a DNA database. BLASTP compares a given protein sequence to a protein database. BLASTX compares a given translated DNA sequence to a protein database. TBLASTN compares a given protein sequence to a translated DNA database. TBLASTX compares a given translated DNA sequence to a translated DNA database. BLASTN program interface is shown in Figure 4. The other BLAST programs' interfaces are similar to BLASTN. BLASTN interface has three main sections: Enter query sequence: This section allows the user to enter a query sequence or upload it as FASTA file. A query sequence maybe entered in three different ways: enter the accession number [9] of the query sequence, enter the gi [9] of the query sequence, or enter the FASTA format [10] of the query sequence. The other choice is upload the sequence file. The file may contain a single sequence or a list of sequences. The data in the file may be either a list of accession numbers, gi numbers, or sequences in FASTA format. similar sequences (megablast) [11], or with more dissimilar sequences (discontiguous megablast), or with somewhat similar sequences (blastn).
Algorithm parameters: This section is optional and allows the user to modify the default values of the following parameters: maximum number of aligned sequences to display, expected threshold value, word size value, match/mismatch scores, and gap costs. For each parameter shown in Figure 4, there is a small question mark icon the user can click on to find out more description about that parameter. The output of BLASTN is all alignments (hits) listed in increasing order of E-value (expected threshold) which measures the hit quality (smaller numbers mean better hits). It is displayed in HTML format. The result page contains three parts: a graphical format with hits founded as illustrated in Figure 5a, a table with hits sequence identifiers and scoring data as illustrated in Figure 5b, and alignments for the query sequence and hits as illustrated in Figure 5c. The output can be downloaded to the local machine as text file, GenBank, CSV, XML, or ASN file.

i2b2
As a part of NIH (National Institutes of Health) plan, the mission of NCBC (National Center for Biomedical Computing) 5 is to develop and implement innovative software programs that are needed in biomedical research. One of the projects sponsored by NCBC is -i2b2, which is an open source scalable informatics framework that provides clinical investigators the necessary software tools that they need to collect, manage, and combine medical data and research data. The platform design is scalable and its architecture can be extended using 5 http://www.ncbcs.org/ additional plug-ins functionalities. Developers are able to create custom plug-ins based on the needs of clinical and research investigators. The platform enables the integration of patients' clinical and genomics data from heterogeneous resources. It is packaged with a query tool that can be used to query patient information and can provide the users with de-identified data of a group of patients meeting certain inclusion or exclusion criteria. Some major applications supported by i2b2 include cohort management [12], which consists of finding a group of patients sharing common criteria for analysis, Population-based studies [13], which join the data of multiple numbers of centers to include large number of participants of a wider range of population groups. This platform is widely used by various academic medical centers and hospitals around the world [14].
i2b2 Users: The i2b2 platform is designed for the following users: • Clinical investigators: who need to collect and manage projectrelated clinical research data.
• Bioinformatics scientists: who need to customize the flow of biomedical data and interactions.
• Bio-computational software developers: who are responsible for developing new software services that can be integrated into the i2b2 platform [2].
i2b2 Architecture: The i2b2 architecture is based on Service Oriented Architecture (SOA) and contains a set of interconnected cells / software modules as a hive that has common messaging protocol and interact using web services and XML message. Each cell can be developed independently for a specific goal and can be integrated to the hive to enhance the functionality of i2b2.
There are two types of cells in the hive: core cells and plug-in cells, as shown in Figure 6. Core cells constitute the back-end infrastructure, i.e., server side components, and establish the basic services of the hive such as managing the hive setup, security, users, projects, files, and data. Plug-in cells are client side components, which consist of an application suite of data querying and mining tools. The platform design is scalable and can be extended to provide independent software services. Each plug-in can be developed independently by different investigators to achieve specific analytic goals. Then, they can be integrated into the hive to enhance the functionality of i2b2.
i2b2 Data Model: The i2b2 platform was created to host many projects and each project has its own data warehouse, i.e., Clinical Research Chart (CRC) cell. The data warehouse was designed using the star schema [15] as illustrated in Figure 7. The i2b2 star schema consists of one fact i2b2 Client Applications: The client side applications available to users for query and analysis include: • The Workbench 6 , which is a fat-client application, which can be 6 https://www.i2b2.org/software/    • The Web client 7 , which is a thin-client application, which can run on any modern web browser.

Proposed BLAST-i2b2 Approach
The main issue with the current BLAST platform is the storage of BLAST results. Biological sequence databases are periodically updated; thus, BLAST results may be different if the user runs the BLAST against different versions of a specific biological sequence database. In another words, if the user runs BLAST against a biological sequence database, the current platform displays the result and allow the user to download these results to the local machine. If that specific target biological sequence database is updated, the previously downloaded results may be old and the user needs to run the BLAST against the updated version to get an updated result. In this case, the user should enter the query sequence again, the BLAST will be run against the whole sequences in the database each time the database is updated, which is a waste of time because the target database is usually slightly updated. So, we need to run the BLAST against the updated sequences only instead of running it against the whole sequences in order to reduce BLAST run time. To solve this problem, we designed the data warehouse to store BLAST result. In this scenario, if the user runs BLAST against a biological sequence database, the system will display the result and allow the user to save these results into the data warehouse. If that specific target biological sequence database is updated later and the user requests to have an updated result, BLAST runs against the updated sequences only which considerably reduces BLAST run time.
We extended the i2b2 platform by developing a new plugin, BLAST-i2b2. BLAST-i2b2 provides two main features: saving and updating BLAST results. In order to implement this new tool, we incorporated the BLAST tool into the i2b2 platform, and also added the required features for saving and updating the BLAST results. In this process, the following steps were performed: • First, we selected which client to use, i2b2 web client or i2b2 workbench. The i2b2 web client was selected because it is easily 7 https://www.i2b2.org/webclient/ accessible from web browsers.
• Second, we incorporated NCBI BLAST into the i2b2 platform. In order to do that, we downloaded BLAST executable (command line) programs 8 from NCBI web page. We also downloaded some of BLAST databases 9 . When the user runs BLAST, the system executes BLAST executable against the locally downloaded BLAST databases.
• Third, we modified the BLAST-i2b2. In order to save BLAST results, we designed the data warehouse, which stores the query parameters and the query results. To update the BLAST results; we first run a script every three days to download the latest version of the BLAST databases locally. We also give the user the ability to run this feature manually. Second, we allow the user to review queries history. The user can select one of the previous queries and request an update. In this case, if a new BLAST database version is downloaded; the system extracts both deleted and newly added sequences, runs BLAST against the newly added sequences, deletes the stored sequences which are matched the deleted sequences, and updates the result. The next section gives a detailed description of the save and the update processes.

Save process
An innovative functionality, saving the BLAST results, was added to the proposed tool. If the user enters a query sequence and selects a target database; the BLAST-i2b2 runs the BLAST algorithm to find the similarities between the query sequence and each sequence in the target database, displays the BLAST results, and gives the user the ability to save these results, as shown in Figure 8. The benefits of saving BLAST results are: • An up-to-date storage of BLAST results as well as the storage of queries in a local data warehouse, which is directly connected to the targeted database for periodic updates.
• Faster access to the stored results and options for running and customizing the previous queries.    • The ability to utilize the data for data mining, analysis and visualization in the future using the set of client side tools provided by the i2b2 platform

•
Reusability of the stored data in the data warehouse for future analytical experimentation and research purposes.
BLAST-i2b2 saves both the query parameters including the query sequence, the target database name and version, the expected threshold value, and the word size value, and the BLAST results of this query. The BLAST-i2b2 data warehouse design is described in Section 3.3.

Update process
In the original BLAST algorithm process, the user enters a query sequence, selects a target database, and runs BLAST. The BLAST algorithm compares the submitted query sequence to all sequences in the target database and displays the BLAST results as illustrated in Figure 9a. If a new version of that target database is downloaded, the user must enter the query parameters again (because they are not saved) to get updated results. In this case, the BLAST repeats its process and compares the re-submitted query sequence to all sequences in the updated target database as illustrated in Figure 9b, which may take a longer time. Therefore, we propose the 'save process', which saves the query sequence and BLAST results to reduce BLAST run time when the target database is updated.
In BLAST-i2b2; the user enters a query sequence, selects a target database, and runs BLAST. BLAST-i2b2 compares the submitted query sequence to all sequences in the target database, displays the BLAST results, and allows the user to save those results, as illustrated in Figure  10a.
If a new version of that target database is downloaded, the system informs the user, who can simply update the results. In this case, BLAST compares the saved query sequence with the modified sequences only in the updated version of the target database instead of comparing it to all sequences that will reduce the BLAST execution time as illustrated in Figure 10b. To implement the update process, we used the following strategy: • Extract the sequence GI number: The modification of a database could be an added sequence, a deleted sequence, or a modified sequence. Each sequence in a database is identified by GI number. In order to find the modifications between database versions (the old version which the user run a query against and the new version), we extract the GI number for all sequences from both the old and new database versions into two text files as illustrated in Figure 11a.
• Find the modification between database versions: We compare the two text files to extract the newly added sequences and the deleted sequences into another two files, as illustrated in Figure 11b. One file contains the newly added sequences GI numbers that exist in the new database version and do not exist in the old database version. The other file contains the deleted sequences GI numbers that exist in the old database version and do not exist in the new database version. We do the comparison using UNIX "JOIN" utility 10 .
• Run BLAST against new sequences: BLAST-i2b2 runs BLAST against the newly added sequences and displays new BLAST results. If the user wants to save the new BLAST results, the system will performs these two steps: • It merges the new results with the stored BLAST results.
• It checks for matching between the deleted sequences and the old stored BLAST results. If there are any matches, the system deletes those results from the data warehouse.

BLAST-i2b2 data warehouse design
The BLAST-i2b2 data warehouse is designed as star schema with four tables: one fact  Figure 14. Three extra variables: NewSeqDatabase, DelSeqDatabase, and UpdatedVersion are assigned to NULL unless the user requests an update. In this case, these variables are filled as follows: NewSeqDatabase is filled with the file name that contains the newly sequences that exist in the UpdatedVersion field and do not exist in DatabaseVersion field; DelSeqDatabase is filled with the file name that contains the deleted sequences which exist in DatabaseVersion field and do not exist in UpdatedVersion field; UpdatedVersion will be filled with the updated database version number. The BlastDatabaseDim table, shown in Figure 15, stores a list of all the BLAST databases the user can search with version number and download date.

Results and Discussion
We tested the performance of BLAST-i2b2 to assess its capability. Since the key feature of BLAST-i2b2 that it stores BLAST results to reduce BLAST search time by providing the update process; the performance test focuses on the execution time of BLAST against the updated BLAST database. We conducted three different experiments; each one is done against different database size. The following section present the test parameters including the submitted query sequence, searched BLAST databases, expected threshold value, and word size.

Experimental Settings
Query sequence

BLAST databases
Small sized database used in experiment 1. Large sized database used in experiment 3.
• Expected threshold value (The number of different alignments that are expected to be matched by chance): 10.
The performance test was conducted using two scenarios for each database size.
• First Scenario: Run the query sequence against the first version of a BLAST database without saving BLAST results. Then, run the same query sequence against the second version of the same BLAST database.
• Second Scenario: Run the query sequence against the first version of a BLAST database and save BLAST results. Then, update the saved BLAST results using the BLAST-i2b2 update algorithm.

Discussion
The experimental results show that the BLAST-i2b2 update process reduces the search time by 84% for the small sized database: RefSeqGene, by 93% for the medium sized database: EST_mouse, and by 97% for the large sized database: RefSeq_RNA. As shown in Figure  17, the left bars of the figure present the run time against the small sized      Figure 18, running BLAST against the whole sequences in RefSeq_RNA database v.2 takes 474 sec and updating the result takes only 14 sec. From these results, we can conclude that BLAST-i2b2 update process enhances BLAST by reducing its run time and storing the results for further analysis.

Conclusions
In this paper, we introduced biological sequences related concepts, such as sequence similarity and its importance and how to find this similarity using different sequence alignment methods. We discussed the BLAST and its functionality in details. We provided an overview of i2b2 architecture and features. Then, we described the proposed approach of developing BLAST-i2b2 using i2b2 platform. This new tool provides researchers the ability to compare DNA sequences with some of NCBI DNA databases. We presented BLAST-i2b2 methodology and the data warehouse design. The paper showed that BLAST-i2b2 enhanced the functionality for both the BLAST and the i2b2 platform. From the BLAST perspective, the major added features included: the ability to store BLAST results for further future analysis and the ability to update searched query results whenever there are updates in the target database. We modified the BLAST by allowing it to compare the query sequence with the modified sequences only in the target database instead of comparing it with all sequences. From the i2b2 perspective, a new BLAST-i2b2 plug-in was added to the i2b2 platform that is capable of storing the BLAST search queries and results locally and reusing them in an innovative way to enhance the BLAST functionality. We tested the performance of BLAST-i2b2 update process against three different database sizes. The experiments results showed that BLAST-i2b2 update process reduces the BLAST run time to 84% for the small database size, 93% for the medium database size, and 97% for the large sized database. The work in progress focuses on extending the BLAST-i2b2 functionality by importing, mapping and customization of locally stored search results for i2b2 web client application. The future work aims for the utilization and reusability of these stored results for data mining and analytics purpose that will enable users to query, mine, analyze and visualize the BLAST search results using BLAST-i2b2 platform. Due to the generic nature of proposed approach, in the future, the BLAST-i2b2 can also be extended to search the other available DNA databases as well as the protein sequence databases.