Ori-Finder 2, an integrated tool to predict replication origins in the archaeal genomes

DNA replication is one of the most basic processes in all three domains of cellular life. With the advent of the post-genomic era, the increasing number of complete archaeal genomes has created an opportunity for exploration of the molecular mechanisms for initiating cellular DNA replication by in vivo experiments as well as in silico analysis. However, the location of replication origins (oriCs) in many sequenced archaeal genomes remains unknown. We present a web-based tool Ori-Finder 2 to predict oriCs in the archaeal genomes automatically, based on the integrated method comprising the analysis of base composition asymmetry using the Z-curve method, the distribution of origin recognition boxes identified by FIMO tool, and the occurrence of genes frequently close to oriCs. The web server is also able to analyze the unannotated genome sequences by integrating with gene prediction pipelines and BLAST software for gene identification and function annotation. The result of the predicted oriCs is displayed as an HTML table, which offers an intuitive way to browse the result in graphical and tabular form. The software presented here is accurate for the genomes with single oriC, but it does not necessarily find all the origins of replication for the genomes with multiple oriCs. Ori-Finder 2 aims to become a useful platform for the identification and analysis of oriCs in the archaeal genomes, which would provide insight into the replication mechanisms in archaea. The web server is freely available at http://tubic.tju.edu.cn/Ori-Finder2/.


INTRODUCTION
DNA replication is one of the essential and conserved features among all three domains of life. In bacteria, DNA replication initiates from a single replication origin (oriC), which is often adjacent to the replication-related genes and distributed with the DnaA box motifs, whereas eukaryotic organisms exploit significantly more replication origins, ranging from hundreds in yeast to tens of thousands in human (Gao et al., 2012). Archaea are classified as a separate domain in the three-domain system, and share some similar features with both bacteria and eukaryotes (Woese and Fox, 1977). Similar to the bacteria, the oriCs in archaea are located in the intergenic regions around the replication-related proteins and distributed with the origin recognition boxes (ORBs). The ORB motifs are the conserved sequences and recognition sites for the Orc1/Cdc6 initiation proteins (Barry and Bell, 2006). In some organisms, G-stretches are also observed at the end of ORBs. On the other hand, the origin binding proteins in archaea are homologous to the related eukaryotic Orc1/Cdc6 proteins, and some archaea could also adopt more than one oriC to initiate DNA replication. With the increasing availability of complete archaeal genomes, identification of their oriCs would provide further insight into the mechanism of DNA replication in archaea and reveal the evolutionary history between bacteria and eukaryotes (Barry and Bell, 2006;Wu et al., 2014b).
The first putative oriC of archaea was identified in Halobacterium sp. strain NRC-1 by GC-skew method and demonstrated by cloning into a non-replicating plasmid (Myllykallio et al., 2000). The Z-curve method is an alternative technique that detects the asymmetrical nucleotide distribution around replication origins. The three components of the Z-curve, x n , y n, and z n display the distributions of purine versus pyrimidine (R vs. Y), amino versus keto (M vs. K) and strong H-bond versus weak H-bond (S vs. W) bases along the sequence, respectively. The x n and y n components are termed the RY and MK disparity curves, respectively. The AT and GC disparity curves are defined by (x n + y n )/2 and (x n − y n )/2, which shows the excess of A over T and G over C, respectively, along the sequence (Zhang and Zhang, 2005;Gao, 2014). Based on the Z-curve analysis, we have identified single oriC in Methanocaldococcus jannaschii and Methanosarcina mazei, double oriCs in Halobacterium sp. strain NRC-1, and three oriCs in Sulfolobus solfataricus P2, which are consistent with the subsequent experiments (Soppa, 2006). Recently, multiple orc1/cdc6-associated oriCs in all the available haloarchaeal genomes have been predicted by identification of putative ORBs (Wu et al., 2012). Based on these discoveries, several basic features of the oriCs could be summarized in archaea. Firstly, most oriCs are located in proximity to the genes encoding archaeal replication-related proteins, such as archaeal Orc/Cdc6 protein, Whip (Winged-Helix Initiator www.frontiersin.org Protein) and DNA primase. Secondly, oriCs are often located around the extremes of disparity curves. Finally, most of the oriCs contains the AT-rich unwinding elements and conserved ORBs (Zhang and Zhang, 2005;Barry and Bell, 2006;Wu et al., 2014a).
Our group has developed a web-based system Ori-Finder 1 to find oriCs in the bacterial genomes based on the Z-curve method with high accuracy and reliability (Gao and Zhang, 2008). Now with the knowledge of oriCs in the archaeal genomes, we present an online tool, Ori-Finder 2, to identify the oriCs in the archaeal genomes, based on the integrated method comprising the analysis of base composition asymmetry using the Z-curve method, the distribution of ORB elements identified by FIMO tool, and the occurrence of genes frequently close to replication origins, which is available at http://tubic.tju.edu.cn/Ori-Finder2/.

METHODS AND IMPLEMENTATION
Ori-Finder 2 utilizes an integrated approach to predict oriCs in the user-supplied archaeal genomes automatically. Figure 1 presents the workflow of Ori-Finder 2. Users submit an annotated or unannotated genome sequence to the web server. For the annotated genome, we recommend that users submit the sequence file in GenBank format or upload the sequence file in FASTA format as well as its corresponding protein table (PTT) file. The web server is also able to analyze the unannotated genomes by integrating two gene prediction pipelines, ZCURVE1.02 and Glimmer3 (Guo et al., 2003;Delcher et al., 2007), for gene identification and BLAST program for functional annotations of genes. Then all the intergenic sequences are scanned by Find Individual Motif Occurrences (FIMO), a software tool for scanning DNA or protein sequences with motifs described as position-specific scoring matrices (Grant et al., 2011), to obtain the ORB sequences, and also by REPuter program, a classic pipeline to compute exact repeats and palindromes in complete genomes (Kurtz et al., 2001), to identify the repeats. Finally, all the intergenic sequences adjacent to the replication-related genes with the ORB sequences are predicted as oriCs. Since the approach relies on the prior knowledge of oriCs in archaea, it may fail to identify the oriCs adjacent to the unknown genes which might be involved in DNA replication. In order to overcome the drawback, the intergenic sequences, which contain more than two conserved motifs, will be also predicted as oriCs. BLAST searches are performed against DoriC, a database of bacterial and archaeal replication origins, to search the homologs (Gao and Zhang, 2007;Gao et al., 2013). Here, the conserved motifs of ORB sequences used in FIMO were obtained from DoriC. All the records in DoriC were organized into several taxonomic clusters, including Methanobacteriaceae, Methanomicrobia, Methanococcaceae, Sulfolobaceae and Thermococcaceae. And the conserved ORB motifs were calculated from the corresponding clusters by Multiple EM for Motif Elicitation (MEME) program, a tool used to discover motifs in a group of related DNA or protein sequences (Bailey et al., 2009). Table 1 displays the regular expressions of ORB motifs. Note that the common motif is calculated from all the records in DoriC. The motif logos are shown in the submission form, and the position specific probability matrix (PSPM) is available in the document webpage. Each job of Ori-Finder 2 is assigned a unique ID, and the whole process will take several minutes to complete. Users could retrieve their results with the job ID or be notified by email if specified in the submission page.
In the result, the information including genome size, GC content, the locations of replication-related genes and the predicted oriCs, as well as the Z-curve (AT, GC, RY, and MK disparity curves) for the input genome is displayed as an HTML table.
In addition, the detailed information about the repeats identified by REPuter program, ORBs recognized by FIMO and the homologs in DoriC are also presented in the corresponding subtable. The ORB motifs in all the intergenic regions are also available for download from the provided URL. Users could also click to enlarge the embedded figure to obtain the high Frontiers in Microbiology | Evolutionary and Genomic Microbiology  resolution one which displays the RY, MK, GC, AT disparity curves, replication-related proteins, and the predicted oriCs. The result webpage and figures will be stored in 7 days on the web server.
Ori-Finder 2 is developed using Python and PHP on a Unix platform with an Apache web-server. The web interface is implemented using Common Gateway Interface (CGI) python scripts, and the webpage is designed with HTML, CSS, and JavaScript. The pipeline of Ori-Finder 2 uses the Biopython library, and the output graphs are generated by the Python module Matplotlib (Hunter, 2007;Cock et al., 2009).

RESULTS AND DISCUSSION
Based on this online system, we predicted the oriCs for all the available complete archaeal genomes in GenBank. For example, Pyrococcus abyssi is a classical model of DNA replication in the archaeal organisms. Similar to bacteria, there is only one oriC in its circular chromosome, which has been identified by www.frontiersin.org  . (B) The detailed information of the predicted oriC region including size, GC content, homologs in DoriC and sequence, as well as the information of the identified ORBs including the ORB motif (also referred to as "Pattern name"), location, strand, the associated log-likelihood ratio score, P value and the matched sequences. Note that the log-likelihood ratio score and P value are computed by FIMO to measure the similarity between the ORB motif and the matched sequence, and the P value cutoff for FIMO motif searching is 10 −4 . The ORB motif used here is the common motif. cumulative oligomer skew and confirmed by in vivo method. With the annotated genome file, the oriC predicted by Ori-Finder 2 is in accordance with the experimental result and located at the peak of the MK disparity curve. Several ORB sequences are recognized in the oriC. Figure 2 is a screenshot of the result by Ori-Finder 2. In addition, some archaea adopt more than one oriC during the DNA replication. For this situation, Ori-Finder 2 also predicted multiple oriCs in their genomes. Haloferax volcanii DS2 has a chromosome with multiple oriCs. Five oriCs were identified in silico, and three of them have been confirmed in vitro (Norais et al., 2007;Wu et al., 2012;Hawkins et al., 2013). With the annotated genome file, all the five oriCs mentioned above have been predicted by Ori-Finder 2 successfully, and another oriC with three ORB motifs is also found, which is adjacent to the genes purO and cgi. Besides that, the oriCs identified in the unannotated genomes are consistent with the previous results. In order to estimate the performance of Ori-Finder 2, we used 13 annotated archaeal chromosomes, whose oriCs have been confirmed by experimental method or identified in silico by other groups ( Table 2). Compared with the records in DoriC, the sensitivity and precision are 66.7% and 62.1%, respectively. The reason of the lower precision and sensitivity compared with the programs to detect bacterial origins, such as Ori-Finder 1, is that bacteria have only one oriC in their chromosomes, but archaea tend to have more than one. Furthermore, oriCs in archaea show more diversity than those in bacteria, such as more complex ORBs in comparison with the DnaA boxes, and more unknown speciesspecific replication-related genes. It is difficult to predict the oriCs in archaea with high precision and sensitivity due to the limited amount of experimental data. For example, not all the oriCs in the genomes with multiple oriCs are found, and the ORBs with unique features need to be further explored by experimental methods. For the convenience of users' query, the oriCs confirmed by in vivo or in silico methods have been collected into DoriC, which is freely available at http://tubic.tju.edu.cn/doric/.

CONCLUSION
Here, we presented a user-friendly interactive web-based platform Ori-Finder 2 to predict the oriCs in the archaeal genomes. The tool integrated several genomic pipelines, including FIMO, BLAST, ZCURVE, Glimmer, and REPuter, to comprehensively annotate and analyze the oriCs. Moreover, the ORB motifs are also calculated by MEME and organized by taxonomy. The software presented here does not necessarily find all the origins of replication in cases where there are multiple ones in a genome. However, we will continually strive to improve our approach to make it more accurate and sensitive with the increase of the oriCs confirmed experimentally in archaea. As the only currently available auto-annotation system for the archaeal replication origins at the sequence level, we believe that Ori-Finder 2 will be helpful to predict the archaeal replication origins and provide insight into DNA replication in archaea.

AUTHOR CONTRIBUTIONS
Hao Luo designed the computer program and drafted the manuscript. Chun-Ting Zhang and Feng Gao supervised the study and revised the manuscript. All authors read and approved the final manuscript.