Identification of exon locations in DNA sequences using a fractional digital anti-notch filter

https://doi.org/10.1016/j.bspc.2022.104362Get rights and content

Highlights

  • An accurate exon identification method using a new low order fractional digital anti-notch filter was proposed.

  • The proposed filter extracts more selectively the 3-periodicity frequency component.

  • The proposed method achieves an accuracy of 96% and an AUC of 93%.

Abstract

Identification of protein coding region (exon) locations in DNA sequences is a fundamental initial step in genomic signal processing (GSP). Several techniques have already been applied to achieve this challenging task. However, improvements are still needed. Transforms-based methods and digital filtering are among those techniques that have been widely used. These techniques exploit the period-3 property of protein coding regions. This paper proposes the application of a narrowband bandpass fractional digital filter to extract more selectively the single frequency component corresponding to the frequency f=1/3 from DNA sequences. The ideal fractional digital anti-notch filter has an infinite amplitude at the central frequency and two tuning parameters which may be used to independently adjust the central frequency and the amplitude frequency response. The ideal filter has been approximated and implemented efficiently as an infinite impulse response (IIR) filter. The effectiveness of the proposed method has been assessed in terms of common performance evaluation metrics computed from the results obtained using DNA sequences taken from the National Center for Biotechnology Information (NCBI) and HMR195 datasets using different numerical transformations including Voss mapping and electron–ion potential (EIIP) representation. In addition to overcome the problem of sliding window size encountered in transform-based methods, comparison with existing state-of-the-art methods for exon location identification has demonstrated superiority of the proposed method on benchmark datasets.

Introduction

According to their cell structures, living organisms are divided into two categories: prokaryotes and eukaryotes. Prokaryotes lack a defined nucleus and their deoxyribonucleic acid (DNA) is located in a region of the cytoplasm called the nucleoid. In eukaryotic cells, the DNA, mostly contained in the nucleus, is made up of genes and intergenic regions. In turn, eukaryotic genes are composed of an alternation of exons (protein coding regions) and introns (non-coding regions). Unlike eukaryotic cells, prokaryotic cells do not have introns. DNA is a molecule formed of two complementary strands of four types of nucleotides. The four nucleotide bases are adenine (A), cytosine (C), guanine (G), and thymine (T). The sequence order of nucleotides along the DNA molecule is called DNA sequence. It carries the genetic information that is used to make proteins. These are large molecules made up of several amino acids synthesized via two steps biological process called gene expression consisting of transcription followed by the translation. In this process, introns are removed and exons are reconnected together by splicing leading to the production of mature messenger RNA (mRNA) molecules containing only the coding sequence that will be used for protein synthesis (see Fig. 1). Each amino acid in the protein is specified by codons. A codon, also called a triplet, is a group of three bases. Since there are only 20 amino acids that are encoded by four bases (A, T, C, G), the genetic code is redundant, i.e., a single amino acid may be specified by more than one codon [1].

Fast and accurate identification of exon locations in DNA sequences is a fundamental initial step in genomic data analysis that would lead to a better understanding of the structures and functions of proteins. Exon location identification can also help for disease diagnosis and drug discovery. Since biological experiments are costly to carry out and time-consuming, computational technique-based approaches are widely used instead and can accomplish this task in a fast way. From a computational perspective, the presence of introns in eukaryotic DNA makes exon location detection more difficult than in prokaryotic DNA. Available methods proposed in the literature for identifying protein coding regions computationally include machine learning, support vector machine, artificial neural network, and digital signal processing (DSP) techniques like Fourier transform-based methods, wavelet-based methods, and digital filtering techniques[2]. The use of DSP techniques to process genomic data such as DNA sequences is referred to as genomic signal processing (GSP) [3]. The main steps involved in GSP methods for exon identification are shown in Fig. 2. To apply GSP methods, the character string in the DNA sequences first needs to be mapped into numerical sequences. These may be viewed therefore as discrete signals that are a function of the position number of the base in the sequence. There exist many transformations to convert DNA sequences into discrete signals such as Voss representation, electron–ion interaction potentials (EIIP), 2-bit binary representation, Z-curve, etc. Most of these transformations are reviewed in [2]. After numerical mapping come the DSP techniques which exploit the triplet periodicity or period-3 property (the period being equal to 3 bases) of the base sequences within protein coding regions to discriminate between exons and introns. This property is usually not found in introns and intergenic regions. The triplet periodicity property states that the power spectral density (PSD) of the DNA sequence of length N, mapped to numerical sequence, exhibits a high spectral peak at the discrete frequency k=N/3. It has been used as a good indicator in many algorithms of exon location identification[3], [4], [5], [6].

Among the first GSP algorithms using the triplet periodicity property is the spectral analysis Fourier transform-based method proposed by Tiwari et al [7]. This has been improved since but the challenge of Fourier transform-based methods like short-time Fourier transform (STFT) is their sliding window size over which the DSP is computed. Several techniques have been proposed to deal with this by fixing the window length or introducing the wavelet transform [8], [9]. However, a direct application of wavelet transform to exon identification is inappropriate since the protein coding regions present the same frequency under different scales. This issue has been addressed by introducing some modifications like the gaussian window in the STFT which can analyze specific periodicity at a continuously varying scale [8].

To overcome the use of the sliding window, GSP techniques based on digital filters were developed. Both narrowband bandpass infinite (IIR) and finite impulse response (FIR) filters have been used in exon location identification with their passband centered at the discrete frequency f=1/3 according to the aforementioned period-3 property [10]. Among the first methods is the multistage IIR anti-notch filter with a good stopband attenuation proposed by Vaidyanathan and Yoon [11]. Similarly, Guan et al. [12] used an IIR multi-rate filter model to reduce the background noise of the output DNA signal when Kakumani et al. [13] proposed a digital statistically optimal null filter (SONF) to detect short exon. For better improvement, Tomar et al. [14] introduced a harmonic suppression (HS) filter with a minimum variance (MV) to reduce intron region power while many DSP-based methods fail. Heba et al [15] and Ramachandran et al [16] exploited the performance of inverse Chebyshev-II digital filter whereas Barman et al. [17] studied the IIR anti-notch filter with different structures like harmonic suppression comb, lattice, and cascaded lattice in term of signal to noise ratio (SNR). Improvement in background noise reduction was proposed by Singh and Srivastava [18] with a Savitzky-Golay (S-G) filter. In recent works, some researches focused on comparative studies between the performance of IIR and FIR filters in exon location detection [10]. Although these methods showed efficiency, improvements are needed.

Large amount of genomic data is continuously generated by intensive sequencing from many organisms. As a result, fast and efficient techniques to analyze the content of these genomic data have become increasingly important over the years. Accurate identification of protein coding region locations by GSP methods is one of the fundamental issues in genomic data analysis. The motivation behind the work described in this paper was to develop a simple and accurate method to detect the locations of coding regions in DNA sequences of Eukaryotes by using a new narrowband anti-notch fractional digital filter to extract more selectively the single frequency component corresponding to the frequency f=1/3 from DNA sequences. Fractional filters have gained researchers’ attention in recent years because of their design flexibility and better performance compared to their integer counterparts [19], [20], [21], [22], [23]. Further, they overcome the problem of sliding window size encountered in transform-based methods. The proposed filter is characterized by two tuning parameter that determine the central frequency and the amplitude at that frequency. The resulting frequency response has high amplitude, narrowband bandpass, and absence of stopband ripples. These features best fit the specifications of very narrowband anti-notch filters, and therefore make the proposed filter more suitable for accurate exon identification by exploiting the period-3 property. The proposed approach outperforms state-of-the-art results on DNA sequences from benchmark datasets.

The remaining of the paper is organized as follows. The proposed methodology is described in detail in Section 2. The datasets description, and the obtained results with their discussion are given in Section 3. Section 4 provides concluding remarks and future works.

Section snippets

Numerical representation of DNA sequences

As shown in Fig. 2, the first step in GSP methods is the numerical transformation of DNA sequences. According to [2], [24], several transformations exist and can be divided into two groups. Firstly, the “fixed mapping” (FM) regroups methods that use arbitrary numbers to convert the DNA character string. These numbers have no DNA biological meanings but computational effects. The FM mapping includes the Voss representation, real number, integer, complex, 2-bit binary, tetrahedron, and quaternion

DNA databases

To evaluate the proposed method, eukaryotic genes containing single to multiple exons were taken from the HMR195 [35] and NCBI [36], and reported in Table 1. These genes have been used in previous GSP researches based on digital filters for exon identification [11], [18], [37], [38]. The main study is focused on the F56F11.4 gene (C. Elegans) case and simulations were made on MATLAB software using 8 GB RAM with a 64-bit-6700HQ i7 CPU.

Fractional digital filter parameter selection

In order to find the proposed fractional digital filter best

Conclusion

In this paper, a fractional digital anti-notch filter-based method is proposed for the identification of coding regions in the DNA sequence of eukaryotic cells. The proposed filter has better features such a high amplitude frequency response, absence of stopband ripples, and narrowband bandpass frequency which are required to achieve accurate exon detection by exploiting the period −3 property. The bandpass filter frequency is centered at f=1/3 as required by the period-3 property in coding

Funding resources

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (45)

  • Q. Zheng et al.

    Gene prediction by the noise-assisted MEMD and wavelet transform for identifying the protein coding regions

    Biocybern. Biomed.

    (2021)
  • M.K. Hota et al.

    Identification of protein coding regions using antinotch filters

    Digit. Signal Process. A Rev. J.

    (2012)
  • S.S. Sahu et al.

    Identification of Protein-Coding Regions in DNA Sequences Using A Time-Frequency Filtering Approach

    Genomics, Proteomics & Bioinformatics

    (2011)
  • P.P. Vaidyanathan

    Genomics and proteomics: a signal processor’s tour

    IEEE Circuits Syst. Mag.

    (2004)
  • N. Yu et al.

    Survey on encoding schemes for genomic data representation and feature learning—from signal processing to machine learning

    Big Data Min. Anal.

    (2018)
  • S.A. Marhon et al.

    Gene prediction based on DNA spectral analysis: a literature review

    J. Comput. Biol.

    (2011)
  • J. Tuqan, A. Rushdi, A DSP perspective to the period-3 detection problem, IEEE International Workshop on Genomic Signal...
  • S. Tiwari et al.

    Prediction of probable genes by Fourier analysis of genomic sequences

    Bioinformatics

    (1997)
  • J.P. Mena-Chalco et al.

    Identification of protein coding regions using the modified Gabor-Wavelet transform

    IEEE/ACM Trans. Comput. Biol. Bioinforma

    (2008)
  • S.A. Marhon et al.

    Prediction of Protein Coding Regions Using a Wide-Range Wavelet Window Method

    IEEE/ACM Trans. Comput. Biol. Bioinforma.

    (2016)
  • P. Vaidyanathan, B.-J. Yoon, Digital filters for gene prediction applications, in: Conference Record of the...
  • Raymond Guan, J. Tuqan, Multirate DSP models for gene detection, Record of the Thirty-Eighth Asilomar Conference on...
  • Cited by (0)

    View full text