iORI-PseKNC: A predictor for identifying origin of replication with pseudo k-tuple nucleotide composition

https://doi.org/10.1016/j.chemolab.2014.12.011Get rights and content

Highlights

  • A novel feature, called PseKNC, was proposed to formulate the DNA sequences.

  • The overall accuracy of 83.72% was achieved for predicting origin of replication.

  • A free web-server iORI-PseKNC was constructed.

Abstract

The initiation of replication origin is an extremely important process of DNA replication. The distribution of replication origin regions (ORIs) is the major determinant of the timing of genome replication. Thus, correctly identifying ORIs is crucial to understand DNA replication mechanism. With the avalanche of genome sequences generated in the post-genomic age, it is highly desired to develop computational methods for rapidly, effectively and automatically identifying the ORIs in genome. In this paper, we developed a predictor called iORI-PseKNC for identifying ORIs in Saccharomyces cerevisiae genome. In the predictor, based on the concept of the global and long-range sequence-order effects of DNA sequence, the feature called “pseudo k-tuple nucleotide composition” (PseKNC) was used to encode the DNA sequences by incorporating six local structural properties of 16 dinucleotides. The overall success rate of 83.72% was achieved from the jackknife cross-validation test on an objective benchmark dataset. Comparisons demonstrate that the new predictor is superior to other methods. As a user-friendly web-server, iORI-PseKNC is freely accessible at http://lin.uestc.edu.cn/server/iORI-PseKNC. We hope that iORI-PseKNC will become a useful tool or at least as a complement to existing methods for identifying ORIs.

Introduction

In cell division, DNA replication is a highly orchestrated process of producing an identical replica from the original DNA molecule [1]. This process commonly initiates at specific regions called origin of replication regions (ORIs). Most of bacterial genomes have only a single ORI [2]. However, in eukaryotic genome, due to the large size of genomes and the limitation of nucleotide incorporation during DNA synthesis, it is necessary for completing replication in a reasonable period of time using multiple ORIs [3]. In Saccharomyces cerevisiae, an autonomously replicating sequence (ARS) element contains the ORI [4], which consists of three domains A, B and C. The A domain contains an essential ARS consensus sequence (T/A)TTTAT(A/G)TTT(T/A), while the B domain tends to be helically unstable and additionally contains a number of short sequence motifs that contribute to origin activity [5], [6]. The C domain plays an important role in the interaction between DNA and regulatory protein [5].

During the process of replication, the DNA double helix strands in ORIs are dissociated and unwinded by helicases for allowing access to DNA polymerase [7]. Subsequently, the semiconservative replication strategy is used to synthesize daughter strands based on the parental template strands [8]. The replication is activated only once at each cell cycle to avoid amplification and maintain genome integrity [9]. DNA replication is associated with gene transcription and expression [10]. For example, an analysis from the distribution of ORIs showed that replication initiation events were absent from transcription start sites but were highly enriched in adjacent, downstream sequences [11]. Therefore, it is crucial to understand the regulatory mechanism of cell division and establish the network of a cell cycle so as to reveal the mechanism involved in DNA replication. Accurate identification of ORIs is an essential prerequisite for further studying and understanding the DNA replication mechanisms.

Chromatin immunoprecipitation (Chip) is the most popular technique to determine ORIs [12]. Although the technique can precisely identify the ORIs, with the avalanche of genome sequences generated in the post-genomic era, it is expensive and time-consuming for experimental approaches to perform genome-wide identification of ORIs. In this regard, computational methods can be applied to the entire genome without these disadvantages. Based on the consensus sequence [13], some theoretical works have been proposed in order to accurately identify ORIs. Marie-Claude et al. have predicted the ORIs by analyzing asymmetry indices of sequence [14]. The signal of nucleosome occupancy was used as a likely candidate to determine ORI distribution [15], [16]. Although it is of great interest and value, the ACS-based method is not sufficient to predict ORIs [17] because there are 12,000 ACS sites in S. cerevisiae genomes, and only 400 associate with ORIs [18]. Recently, two DNA structural properties, namely DNA bendability [19] and hydroxyl radical cleavage intensity [20], [21], were proposed to predict ORIs in S. cerevisiae genome [22]. Although these methods have achieved encouraging results, they are still limited in their accuracy and resolution. Moreover, no web-server was provided to most of these methods, and hence their usage is quite limited, especially for the majority of experimental scientists.

It has been reported that the local DNA structural properties [23] and their impacts to the global sequence effects are important feature signals for DNA functional elements and have been used to identify the nucleosome occupancy [24], recombination spots [25] and exon/intron splice site [26]. DNA conformation may be changed by the ionic bonding effects in the methylation form of specific bases [27]. Besides, the cell differentiation is caused by the dynamical position of nucleosomes due to the chemical reactions, where cell lines have different ORIs [28].

In view of this, the present study was initiated in an attempt to develop a new method for predicting S. cerevisiae ORIs based on the physicochemical properties of DNA. At first, a valid benchmark database was constructed to train and test the proposed method. Subsequently, the DNA sequences were encoded with the pseudo k-tuple nucleotide composition (PseKNC), which can reflect the intrinsic correlation between local/global features and the ORIs. Thirdly, a powerful algorithm SVM was used to operate the prediction by using rigorous jackknife cross-validation test to evaluate the performance of the proposed method. Finally, based on the proposed method, a user-friendly web-server, called iORI-PseKNC, was established for basic academic study and application of ORIs.

Section snippets

Datasets

A total of 740 S. cerevisiae ORIs were collected from OriDB [29] (http://www.oridb.org/). The following steps were used to construct a reliable benchmark dataset. Firstly, the ORIs with ambiguous annotation such as “likely” and “dubious” were excluded because they lack confidence. Then, we obtained 410 experimental-confirmed ORIs with the length of 300 bp. Subsequently, the 410 non-ORI samples with 300 bp long were extracted from − 600 bp to − 300 bp upstream of the 410 ORIs. It is well known that

Profile analyze for local structural property

The specific conformation of DNA sequence can be recognized by regulatory proteins [47], [48], [49]. In order to explore the specific features possessed by ORI sequences, six structural parameters (twist, tilt, roll, shift, slide and rise) of both ORI and non-ORI sequences was calculated to characterize the local geometry with the step of one base-pair in the S. cerevisiae genome. Using graphic approaches to study ORIs can provide an intuitive picture and useful insights for revealing

Conclusion

Correct identification of ORIs is the first step of understanding the replication mechanisms. The current study developed a PseKNC-based method which can incorporate the local and global sequence-order information for identifying the ORIs. The physiochemical properties were proposed to formulate the DNA sequences. Statistical analysis shows that ORI sequences are dramatically different from the non-ORI sequences in the sequence structure which may be the key feature recognized by regulatory

Conflict of interests

The author has no conflict of interests concerning this work.

Acknowledgments

The authors would like to thank the anonymous reviewers for their constructive comments. This work was supported by the National Nature Scientific Foundation of China (No. 61202256, 61301260 and 61100092), the Nature Scientific Foundation of Hebei Province (No. C2013209105), and the Fundamental Research Funds for the Central Universities (No. ZYGX2013J102).

References (53)

  • I. Jimenez-Useche et al.

    The effect of DNA CpG methylation on the dynamic conformation of a nucleosome

    Biophys. J.

    (2012)
  • X. Zhou et al.

    Predicting methylation status of human DNA sequences by pseudo-trinucleotide composition

    Talanta

    (2011)
  • Q.Z. Li et al.

    The recognition and prediction of sigma70 promoters in Escherichia coli K-12

    J. Theor. Biol.

    (2006)
  • W. Chen et al.

    PseKNC: a flexible web server for generating pseudo k-tuple nucleotide composition

    Anal. Biochem.

    (2014)
  • Y.C. Zuo et al.

    The hidden physical codes for modulating the prokaryotic transcription initiation

    Phys. A

    (2010)
  • Y.X. Zhang

    An improved QSPR method based on support vector machine applying rational sample data selection and genetic algorithm-controlled training parameters optimization

    Chemometr. Intell. Lab. Syst.

    (2014)
  • X. Huang et al.

    A novel tree kernel support vector machine classifier for modeling the relationship between bioactivity and molecular descriptors

    Chemometr. Intell. Lab. Syst.

    (2013)
  • C. Nantasenamat et al.

    Quantitative structure–property relationship study of spectral properties of green fluorescent protein with support vector machine

    Chemometr. Intell. Lab. Syst.

    (2013)
  • P.M. Feng et al.

    iHSP-PseRAAAC: identifying the heat shock protein families using pseudo reduced amino acid alphabet composition

    Anal. Biochem.

    (2013)
  • E.N. Trifonov

    Base pair stacking in nucleosome DNA and bendability sequence pattern

    J. Theor. Biol.

    (2010)
  • T.D. Halazonetis

    Conservative DNA replication

    Nat. Rev. Mol. Cell Biol.

    (2014)
  • F. Coin et al.

    DNA in 3R: repair, replication, and recombination

    Mol. Biol. Int.

    (2012)
  • C. Cayrou et al.

    New insights into replication origin characteristics in metazoans

    Cell Cycle

    (2012)
  • T. Valovka et al.

    Transcriptional control of DNA replication licensing by Myc

    Sci. Rep.

    (2013)
  • M.M. Martin et al.

    Genome-wide depletion of replication initiation events in highly transcribed regions

    Genome Res.

    (2011)
  • J.V. Van Houten et al.

    Mutational analysis of the consensus sequence of a replication origin from yeast chromosome III

    Mol. Cell. Biol.

    (1990)
  • Cited by (82)

    • OriC-ENS: A sequence-based ensemble classifier for predicting origin of replication in S. cerevisiae

      2021, Computational Biology and Chemistry
      Citation Excerpt :

      With this method, Sequence-based features were introduced in predicting the ORI Sites. Later, Chao and Deng (Li et al., 2015) introduced a novel approach named iORI-PseKNC that used Pseudo K-tuple nucleotide composition to predict the ORI Sites. A few years later, Liu and Weng introduced a modified version of the previous approach that was named iORI-3wPseKNC (Liu et al., 2018) which was based on a 3-Window based Pseudo K-tuple nucleotide composition.

    • Using extreme gradient boosting to identify origin of replication in Saccharomyces cerevisiae via hybrid features

      2020, Genomics
      Citation Excerpt :

      To have a fair comparison, we would like to compare with the works that used the same dataset and cross-validation method with our study. There are three predictors in the literature which meet this requirement, i.e., DNA bendability and cleavage intensity [24], iORI-PseKNC [26], and iORI-PseKNC2.0 [3]. Table 2 shows the predictive performance among these works by highlighting the best values.

    • CRBSP:Prediction of CircRNA-RBP Binding Sites Based on Multimodal Intermediate Fusion

      2023, IEEE/ACM Transactions on Computational Biology and Bioinformatics
    View all citing articles on Scopus
    View full text