Ori-Finder 2022: A Comprehensive Web Server for Prediction and Analysis of Bacterial Replication Origins

The replication of DNA is a complex biological process that is essential for life. Bacterial DNA replication is initiated at genomic loci referred to as replication origins (oriCs). Integrating the Z-curve method, DnaA box distribution, and comparative genomic analysis, we developed a web server to predict bacterial oriCs in 2008 called Ori-Finder, which is helpful to clarify the characteristics of bacterial oriCs. The oriCs of hundreds of sequenced bacterial genomes have been annotated in the genome reports using Ori-Finder and the predicted results have been deposited in DoriC, a manually curated database of oriCs. This has facilitated large-scale data mining of functional elements in oriCs and strand-biased analysis. Here, we describe Ori-Finder 2022 with updated prediction framework, interactive visualization module, new analysis module, and user-friendly interface. More species-specific indicator genes and functional elements of oriCs are integrated into the updated framework, which has also been redesigned to predict oriCs in draft genomes. The interactive visualization module displays more genomic information related to oriCs and their functional elements. The analysis module includes regulatory protein annotation, repeat sequence discovery, homologous oriC search, and strand-biased analyses. The redesigned interface provides additional customization options for oriC prediction. Ori-Finder 2022 is freely available at http://tubic.tju.edu.cn/Ori-Finder/ and https://tubic.org/Ori-Finder/.


Introduction
As a complex and essential process of cell life, DNA replication is strictly regulated to ensure the accurate transfer of genetic material from parents to offspring. Identification and characterization of replication origins (oriCs) can provide new insights into the mechanisms of DNA replication as well as cell cycle regulation and facilitate drug development [1], genome design [2], plasmid construction [3] etc. Therefore, various experimental approaches such as two-dimensional agarose gel electrophoresis [4], assay of autonomously replicating sequence activity [5], and marker frequency analysis (MFA) [6] have been developed to identify bacterial oriCs. Microarray-based whole-genome MFA [7] as well as highthroughput sequencing-based MFA [8] with higher resolution has been proposed to generate the replication maps of genomes and to locate oriCs. Detecting interactions between origin DNA and initiator proteins can also provide evidence for predicting oriCs [9].
However, the rapid accumulation of sequenced genomes has rendered identifying oriCs in all of them impossible using experimental methods. Therefore, the development of bioinformatics algorithms to predict oriCs on a large scale is particularly important. Classical in silico methods, such as GC skew [10], cumulative GC skew [11], and oligomer skew [12], have been proposed based on DNA asymmetry. Furthermore, Oriloc was developed to predict bacterial oriCs by analyzing local and systematic deviations of base composition within each strand [13]. However, these methods only provided the approximate location without the precise boundary of predicted oriCs. In addition, they cannot accurately predict oriCs in bacterial genomes without a typical GC skew, which is sometimes universal for genomes in certain phylum, such as Cyanobacteria [14,15]. Although DNA asymmetry is the most common characteristic used for predicting oriCs, Mackiewicz et al. [16] also found that the prediction could be improved by considering dnaA and DnaA box clusters. However, DnaA box motifs are often species-specific, and the oriC is not always close to the dnaA gene in some species. Considering these factors, the Ori-Finder web server was developed to provide users with a more convenient and accurate tool for predicting oriCs [17].
Since it was introduced in 2008, Ori-Finder has been widely used to help investigators identify oriCs. To date, Ori-Finder has been used to identify oriCs in hundreds of sequenced bacterial genomes in the genome reports [18][19][20], and dozens of the predicted oriCs have been experimentally confirmed [21][22][23][24]. Furthermore, Ori-Finder predictions have led to new discoveries. For example, each bacterial chromosome is generally considered to carry a single oriC. However, Ori-Finder predictions indicate that multiple oriCs may occur on a bacterial chromosome [25,26], and this opinion has been used to explain the experimental results of investigations into single Achromatium cells [27]. Naturally occurring single chromosome in Vibrio cholerae strain harbors two functional oriCs, which provides strong support for our opinion [28]. Ori-Finder provides a large number of oriCs as resources for data mining. Particularly, the oriCs identified by Ori-Finder, including those confirmed by experiments in vivo and in vitro, have been organized into the DoriC database [29][30][31] available at https://tubic.org/doric/. Therefore, the data for oriC characteristics can be mined on a large scale [32,33]. For example, vast amounts of oriC data can be used to identify and analyze functional elements, such as DnaA boxes and DnaA-trios [34,35]. Finally, Ori-Finder facilitates analyses of strand-biased biological characteristics that are closely associated with DNA replication, transcription, and other biological processes [10,36].
The Ori-Finder web server and DoriC database have been extensively applied to strand-biased analyses, such as base composition [37,38], gene orientation [39], and codon usage [40]. Ori-Finder has also been referred to as a software tool to identify replichores [41].
Bacterial oriCs generally contain several functional elements, such as DnaA-binding sites, AT-rich DNA unwinding elements (DUEs), and binding sites for proteins that regulate replication initiation [42]. These functional elements play important roles in the initiation of DNA replication, which should be considered in the prediction of oriCs. Most of bacterial oriCs contain DnaA box clusters that are recognized and bound by DnaA proteins. Therefore, the DnaA box cluster is considered as an important characteristic for predicting oriCs [16]. DnaA box is usually a 9-bp non-palindromic motif, such as the perfect Escherichia coli DnaA box TTATCCACA. Species-specific DnaA box motifs, such as TTTTCCACA in Cyanobacteria and AAACCTACCACC in Thermotoga maritima have been identified [43]. In addition, degenerated DnaA boxes have also been identified within oriCs in some species, such as 6mer ATP-DnaA boxes (AGATCT) in E. coli [44]. Although degenerate DnaA boxes can also bind DnaA protein, only the broadly conserved DnaA box is considered for oriC prediction here.
The DnaA protein not only interacts with the doublestranded DnaA box, but also binds to the single-stranded DNA to promote unwinding. For example, DnaA protein can bind to single-stranded ATP-DnaA boxes mentioned above. The two-state and loop-back models can explain how DnaA protein melts DNA and stabilizes the unwound region by DnaA-ssDNA interaction [42]. In two-state model, DnaA protein guided from double-stranded DnaA boxes to the adjacent single-stranded DNA changes from a double-to a singlestranded binding mode. A new oriC element comprising repeated 3-mer motif (DnaA-trio), found in Bacillus subtilis, promotes DNA unwinding by stabilizing DnaA filaments on a single DNA strand [45]. Consequently, a basal unwinding system (BUS) comprising DnaA boxes and DnaA-trios in bacterial oriCs has been proposed [46]. Subsequent bioinformatic analyses of oriCs from over 2000 bacterial species, together with molecular biology studies of six representative species, found that the BUS is broadly conserved in bacteria [35]. Integration host factor (IHF) induces DNA to bend backwards in the loop-back model, bringing the DUE close to the DnaA protein bound to the DnaA box and thus facilitating protein binding to double-and single-stranded DNA sequences simultaneously. This mechanism has been identified in E. coli [47], and a similar mechanism might be also found in Helicobacter pylori [48] with a bipartite oriC and in V. cholerae chromosome 2 whose replication initiator requires RctB protein other than DnaA protein [49].
In addition to binding sites for the DnaA protein, oriC has other binding sites for proteins that regulate replication initiation. Factor for inversion stimulation (Fis) and IHF bind to specific sites and bend oriC DNA to inhibit or facilitate DnaA binding in E. coli [47]. SeqA blocks oriC recognition of DnaA by binding to the transiently hemimethylated GATC sequence cluster [50]. The regulatory mechanisms might differ because of the diversity of regulatory proteins and their binding motifs among species. For example, CtrA in Caulobacter crescentus plays a similar role to SeqA and inhibits replication initiation by binding motifs (TTAA-N7-TTAA) [51,52]. Wolanski et al. [53] comprehensively summarized the detailed information about the proteins that regulate DNA replication initiation and their binding sites.
To facilitate a comprehensive understanding of the replication mechanism and sequence characteristics related to oriCs, Ori-Finder 2022 annotates various regulatory proteins and functional elements within oriCs. Updated information about the user interface, prediction framework, visualization, and analysis modules are described in detail below.

Software implementation
Ori-Finder 2022 was deployed using a Linux-Apache-MySQL-PHP structure and mainly developed using Python and C++ languages. We packaged the pipeline into a container using Docker to ensure reproducible and reliable execution. We also integrated the third-party tools BLAST+ 2.11.0 [54], Prodigal [55], stress-induced structural transitions (SIST) [56], and MEME 5.4.1 [57] into Ori-Finder 2022, and tested the updated server on the web browsers, such as Firefox, Chrome, Safari, and Microsoft Edge.

Input file
By December 28, 2021, 91.5% of 362,223 bacterial genomes in the National Center for Biotechnology Information (NCBI) Genome database were draft genomes with scaffold or contig assembly levels. We updated Ori-Finder to enable oriC prediction to meet the imperative need to annotate oriCs in these genomes ( Figure 1A). The updated web server can consequently handle complete or draft bacterial genomes with or without annotations. Ori-Finder 2022 integrates the genefinding algorithm Prodigal [55] to predict protein-coding genes in unannotated genomes in the FASTA format. If an annotated genome file is uploaded in the GenBank (GBK) format, the annotation information is automatically extracted by parsing text.

Updated prediction framework
Ori-Finder was originally developed with DNA asymmetry analysis using the Z-curve method, the distribution of DnaA boxes, and indicator genes close to oriCs [17]. Considering more oriC characteristics, the updated prediction framework of Ori-Finder 2022 adopts a new scoring criterion to quantitatively reflect these oriC characteristics of each intergenic sequence (IGS), and the IGSs with the highest score are predicted as potential oriCs ( Figure 1B; Table S1). As a characteristic of base composition, GC asymmetry is widely used for predicting oriCs. Ori-Finder 2022 scores the characteristics of base composition according to the distance to the minimum of the GC disparity (Table S1). Bacterial oriCs are usually adjacent to a dnaA gene, which can serve as an indicator for oriCs, but such genes are often different among bacterial species. Ori-Finder 2022 scores indicator genes based on the lineage and chromosome type entered by users (Table S2). Ori-Finder 2022 scores DnaA boxes according to their numbers and mismatches. In addition, Ori-Finder 2022 identifies other functional elements of oriC, such as the Dam methylation site (GATC), and DnaA-trio, to screen prediction results if several IGSs with the same highest scores occur during the prediction process. Ori-Finder 2022 can also predict the replication terminus of a complete genome according to the dif motif or the maximum of GC disparity. For draft genomes, each sequence fragment will be predicted using Ori-Finder 2022, and all results will be considered together using the same prediction framework. Unlike the complete genome, the GC disparity minimum of each sequence fragment was used when scoring base composition.

Updated user interface
According to the updated prediction framework, the user interface for data submission was redesigned to enhance user experience (Figure 2A). Ori-Finder 2022 only requires users to upload the genome file in the FASTA or GBK format to deliver a default oriC prediction; moreover, it provides some customization parameters. In Ori-Finder 2022, the principal indicator gene is dnaA by default and will be adjusted according to the lineage and chromosome type entered by users (Table S2). The default DnaA box is the standard motif (TTATCCACA) of E. coli, while the built-in DnaA box motif can be selected according to the organism or lineage of the input genome. The drop-down checkboxes of the DnaA box motif and dif motif can achieve certain linkages for user convenience. Because of the diversity of DnaA boxes, Ori-Finder 2022 allows users to define their own DnaA box motifs. Users can select or define the dif motif in a similar way. Users can also choose to perform strand-biased analysis for complete genomes.

Updated visualization module
The updated visualization module in Ori-Finder 2022 contains interactive Z-curve graph and characteristic visualization of oriC sequence ( Figure 1C). Global or local information of the genome can be grasped at a glance from the interactive Z-curve graph that displays the four disparity curves representing the distributions of adenine/thymine (A/T), guanine/ cytosine (G/C), purine/pyrimidine (R/Y), and amino/keto (M/K) bases, respectively, and the distributions of DnaA boxes, indicator genes, potential oriCs, and replication terminus ( Figure 2B). The red, green, blue, and yellow line graphs indicate the AT, GC, RY, and MK disparity curves, respectively, calculated according to the Z-curve method. The purple vertical lines display the density of DnaA boxes, which is used to indicate the existence of DnaA box clusters. Red, dark blue, and light blue dotted lines indicate the locations of indicator genes, oriCs, and replication terminus, respectively. The indicator genes were identified by parsing the annotation information of the genome or BLAST with protein sequences of known indicator genes. Users can select all the information or only several datasets to analyze according to their requirements. The graph also supports the zoom function for analyzing the details. Moreover, when users hover the cursor over the dotted lines marking predicted oriCs, indicator genes, or replication terminus, the exact locations and other related information are automatically displayed. The other visualization result provided by Ori-Finder 2022 is the characteristic visualization of oriC sequence, which displays the distribution of functional elements in oriC. The first part is the line graph ( Figure 2C, top), which shows the transition probability of each base pair in the oriC sequence calculated using stress-induced duplex destabilization method [56] that analyzes stress-driven DNA strand separation. Five lines with gradient colors were calculated using different negative superhelicity values, and the peaks were corresponded to the AT-rich sequence that might serve as a DUE. The second part is an oriC sequence schematic diagram showing the distribution of functional elements, such as DnaA boxes, DnaAtrios, ATP-DnaA boxes, and binding sites of SeqA, CtrA, Fis, and IHF found in the predicted oriC ( Figure 2C, middle). The third part is the sequence of the predicted oriC in which the functional elements are labeled with different colors or symbols ( Figure 2C, bottom). Indicator genes upstream and downstream of the predicted oriC are also labeled. In order to display the possible functional elements as comprehensively as possible, all possible DnaA-trios are labeled, and a less conserved DnaA box with 4 mismatches from the standard DnaA box motif adjacent to potential DnaA-trios will also be labeled, although its mismatch might be more than that entered by users.

Updated analysis module
Ori-Finder 2022 was expanded to include the new analysis modules ( Figure 1D). Combined with the different elements labeled in oriC sequence ( Figure 2C), the annotation of corresponding regulatory proteins, such as Fis, SeqA, and CtrA ( Figure 2D), by Ori-Finder 2022 might provide new insights into the related regulatory mechanisms. In addition, the repeat sequences in predicted oriCs discovered by MEME are listed in a HTML table to reveal possible new motifs ( Figure 2E). Strand-biased analysis can reveal the distributions of genes and bases in the leading and lagging strands of a complete genome ( Figure 2F). Sequences homologous to predicted oriCs were searched using BLAST against the DoriC database [31], and the BLAST results linked to the corresponding entry in the DoriC database are also provided ( Figure 2G).

Results and discussion
Here, Yersinia pestis KIM+ is presented to illustrate details of the predicted results of Ori-Finder 2022. The structure of oriC in Y. pestis KIM+ is similar to that in E. coli [58]. Figure 2 shows the main visualization and analytical results of the oriC predicted by Ori-Finder 2022 and the complete predicted results are available as a sample result at our website (http://tubic.tju.edu.cn/Ori-Finder2022/public/index.php/retrieve/sam-ple_result/). Due to possible rearrangement, the four disparity curves of this genome fluctuate at their extrema [58], which does not seem to provide sufficient evidence to identify an oriC ( Figure 2B). Ori-Finder 2022 identified an IGS of 380 bp as the potential oriC by taking more characteristics into consideration, such as indicator genes, DnaA box clusters, and other functional elements. Like that in E. coli, the predicted oriC in Y. pestis KIM+ was located between gidA and mioC. The sequence corresponding to the peak of the lines calculated by SIST also contained DnaA-trios and three ATP-DnaA boxes (AGATCT), which was likely to contain a site of DNA duplex unwinding ( Figure 2C). The genome of Y. pestis KIM+ encodes regulatory proteins such as Fis, SeqA, IHF, and Dam ( Figure 2D), and the possible binding sites for corresponding proteins are also found in the predicted oriC. Although the genome of Y. pestis KIM+ does not appear to encode CtrA proteins, two possible CtrA binding sites were identified within the predicted oriC. The repeat sequences in the predicted oriC were discovered using MEME, which might reveal new oriC motifs. For example, two of the five motifs in the first set (ARGATC) overlapped with predicted ATP-DnaA boxes. In the second set (GTTATGCACAT), three of the five motifs overlapped with the predicted DnaA boxes, and the other two contained DnaA box-like motifs with three and four mismatches from the perfect DnaA box (TTATCCACA) in E. coli, respectively. A dif site was located near the top of the GC disparity curve. Strand-biased analysis revealed the biases in some features between the leading and lagging strands. The lengths of the leading (50.66%) and lagging (49.34%) strands were almost identical. The leading strand included 2317 (57.32%) genes, probably a result of rearrangement during which the strand-biased phenomenon of genes is not obvious. Base contents of the leading and lagging strands were also calculated ( Figure 2F). The predicted result was considered reliable because homologous sequences were found in the DoriC database ( Figure 2G).

Conclusion
Ori-Finder has been widely applied by biologists over the past decade to predict bacterial oriCs, and some predictions have been experimentally confirmed [21][22][23][24] or supported by various studies [45,59]. For example, the oriCs of 132 gut microbes in metagenomic samples predicted by metagenomic analyses and Ori-Finder were consistent (R 2 = 0.98, P < 1 Â 10 À30 ) [59]. The bacterial oriC element, DnaA-trio, was found in 85% of oriCs predicted or confirmed from > 2000 species. Numerous bacterial oriCs predicted by Ori-Finder have been used for large-scale data mining and analysis. Ori-Finder 2022 can now predict oriCs in complete or draft genomes based on an updated prediction framework and provide interactive visualization module as well as new analysis module. Now, the predicted oriCs by Ori-Finder 2022 and its original version could match those deposited in DoriC 6.5 for 85% and 79% of the genomes, respectively. DoriC 6.5 is a widely used and thoroughly checked database version with oriCs in 2196 genomes including those experimentally confirmed. Ori-Finder will be continuously improved by incorporating state-of-the-art research results and integrating additional analysis modules. We plan to provide users with an integrated platform for comprehensive prediction, analysis, and knowledge mining to determine microbial replication origins. This will be achieved by integrating Ori-Finder 2 [60] that predicts archaeal oriCs, and Ori-Finder 3 [61], an online service for predicting replication origins in Saccharomyces cerevisiae in the future.