Genomic Island Prediction via Chi-Square Test and Random Forest Algorithm

Genomic islands are related to microbial adaptation and carry different genomic characteristics from the host. Therefore, many methods have been proposed to detect genomic islands from the rest of the genome by evaluating its sequence composition. Many sequence features have been proposed, but many of them have not been applied to the identification of genomic islands. In this paper, we present a scheme to predict genomic islands using the chi-square test and random forest algorithm. We extract seven kinds of sequence features and select the important features with the chi-square test. All the selected features are then input into the random forest to predict the genome islands. Three experiments and comparison show that the proposed method achieves the best performance. This understanding can be useful to design more powerful method for the genomic island prediction.


Introduction
Horizontal gene transfer (HGT) is one of the main factors affecting bacterial adaptability. Hacker et al. found some viral gene clusters in E. coli genomes and did not exist in their close species, and they denoted them as pathogenic islands (PAIs) [1]. Since then, at least a dozen PAIs have been detected, such as "secretion island," "antimicrobial island," and "metabolic island" [2]. They are first expressed as genomic islands (GIs) and further encode them based on the functions related to the complex changes of niche [3]. For example, GIs are responsible for the type III secretion system, iron absorption function, toxin, and adhesion secretion, which enhance the survival ability of pathogens in the host body, leading to diseases [4,5]. Some researchers reported that pathogenicity can be regulated by selective loss or recovery of specific GIs [6,7], and PAI can be spontaneously removed from chromosomes at a detectable rate, resulting in different pathogenic phenotypes [8,9]. Therefore, the detection of different GIs has become an important content of microbial evolution and function research.
With the help of large-scale comparative genomics, researchers found that GIs have different sequence composition, direct flanking duplication, mobility, and tRNA genes. In turn, exploring and utilizing these features can lead to better detection of GIs [3,[10][11][12]. GIs are scattered among close relatives, which carry some species patterns different from the host. Researchers can identify distant relatives by comparing the differences of 16S rRNA or other homologous sequences [13]. Some alignment-based methods have been developed to detect GIs, such as the basic local alignment method [14] and whole genome alignment method [15]. These tools rely on the observation that, compared with the conserved regions, the genomic regions that are not aligned across multiple genomes or only aligned with one genome are more likely to be hypothetical GIs. For some complex cases, several methods of constructing and applying multilayer or large-scale genome comparison are reported. For example, MobilomeFINDER first finds shared tRNA genes in several related genomes and then uses Mauve to search for GIs in the upstream and downstream regions of homologous tRNA genes [16]. Since the identified GIs with this method are related to tRNA disruption, the GIs without the tRNA gene as insertion site will be omitted. In order to solve this problem, MOSAIC has developed a method to identify strain-specific regions that do not necessarily insert tRNA [17]. Unfortunately, inversion and translocation are often mistaken for strain-specific regions. IslandPick is one of the most widely used tools for GI detection [18]. Given a genome, IslandPick first automatically selects the appropriate comparative genomes without any deviation and then uses Mauve to construct the whole genome alignment. To avoid duplication, IslandPick uses BLAST as a secondary filter to recheck the areas aligned by mauve. IslandPick has been integrated into the islandviewer website, where the dataset of precomputed GIs can be downloaded [19][20][21].
In addition to comparative genomics, component-based methods are also very sensitive to GI detection. Considering that GIs usually show significantly different sequence composition from the host, an effective detection algorithm can distinguish the abnormal region from the rest of the genome according to the composition deviation. In practice, component-based methods are desirable because they can rapidly detect GIs from analyzed sequences without the need for additional genomes. CG content and oligonucleotides with lengths 2-9 are widely used to describe the sequence composition in GI detection [10,[22][23][24][25]. For example, PAI-Finder calculates G + C content abnormality and codon usage deviation to detect GIs and further evaluates the candidate PAI only when PAI-like region partially or completely crosses GIs [26]. PAI Finder has been integrated into the PAI database, where comprehensive information of all annotated PAIs and predicted PAI in prokaryotic genome can be downloaded [27,28]. The HMM model has also been introduced to detect abnormal areas containing component deviations [22,[29][30][31]. For example, SIGI-HMM constructs an HMM model to remove codons using biased ribosomal regions [29,30], and IslandPath-DIMoB [31] uses HMM to identify migration genes by searching the PFAM37 migration gene map [32] of each prediction gene [11]. Alien_Hunter introduced a scoring system based on the k-mers and refined the boundary of prediction GIs using the HMM model [22].
Although the performance of the above algorithms is good, there are still some problems: (1) the comparative genomics relies heavily on the genomes used in the comparison, and so it can be used in the annotation process or when closely related genomes are available. Even if more genomes are available, researchers have to spend more time on selecting genomes from the species of interest. (2) Although these methods based on HMM show better performance in GI detection, they involve relatively more parameters and a lot of training calculation; so, it takes a long time to detect GIs.
(3) In recent years, different sequence features have been proposed, but these features are rarely applied to genome island prediction. How to fuse and select some effective features is also a way to improve the efficiency of genomic island detection.
With the above problems in mind, we present a scheme to predict the genomic islands using the chi-square test and random forest algorithm. We first extract seven kinds of widely used sequence features and compare their perfor-mance in GI detection. The chi-square test is then used to select the important features. At last, all the selected features are input into the random forest to detect the genome islands. Through a comprehensive comparison and discussion, some novel valuable guidelines for use of the sequence features, feature selection, and prediction methods are obtained.

Materials and Methods
2.1. Datasets. Four standard data sets are used in this study. The first data set, PICK108, consists of 108 complete bacterial genome sequences and their annotations. The number of positive and negative GIs in this dataset is 3868 and 679, respectively [33]. The second set of data is referenced as CF15 which consists of 15 complete bacterial genome sequences and their annotations. The number of positive and negative GIs in this data set is 6070 and 5833, respectively [34]. The third data set, denoted as RGP104, consists of 104 complete bacterial genomes and their annotations. The number of positive and negative GIs is 1846 and 3267, respectively, in this dataset [35].

Sequence
Features. Seven kinds of widely used sequence features are extracted for genome island detection. They are composition of k-spaced nucleic acid pairs (CKSNAP), dinucleotide composition (DNC), nucleic acid composition (NAC), pseudodinucleotide composition (PseDNC), electron-ion-interaction pseudopotentials of trinucleotide (PSEIIP), reverse compliment k-mer (RCKmer), and trinucleotide composition (TNC). The above features are obtained by iLearn that is a comprehensive python-based toolkit that integrates entity extraction, computation, entity analysis, and construction of predictor variables [36].

Dinucleotide Composition (DNC)
. DNC expresses the composition of consecutive pairs of nucleotides [36,39]. The coding of the DNC characteristics uses 16 descriptors defined as follows: where N ij donates the number of dinucleotides represented by nucleotide types i and j.

Trinucleotide Composition (TNC)
. TNC refers to the composition of three consecutive nucleotides in biological sequences [40]. The coding of TNC 64 descriptors described as follows: ("AAA," "AAC," "AAG," "AAT," …, "TTT"), which can be defined as where N ijk donates the number of trinucleotide pairs represented by nucleotide types i, j, and k.

Pseudodinucleotide Composition (PseDNC).
PseDNC converts the local sequence arrangement and global sequence information into the feature vector [39]. The PseDNC is expressed as follows: where f k ðk = 1, 2 ⋯ 16Þ reflects the normalized frequency of occurrence of dinucleotides, λ represents the highest counted rank of the correlation along the biological sequences, w (0 to 1) is the weight factor, and θ j ðj = 1, 2 ⋯ λÞ is the j-tier correlation factor, which is defined as where the correlation function is defined as where μ denotes the number of physicochemical indexes, C u ðR i R i+1 Þ is the numerical value of the u th physicochemical index of the dinucleotide R i R i+1 , and C u ðR j R j+1 Þ denotes the corresponding value of the dinucleotide R j R j+1 at position j.
2.2.6. Nucleic Acid Composition (NAC). NAC assesses the frequency of each nucleic acid along the sequence. The frequencies of all 4 natural nucleic acids (i.e., "ACGT") can be calculated: where NðtÞ represents the number of nucleic acid type t, while N is the length of a nucleotide sequence [36].

2.2.7.
Electron-Ion-Interaction Pseudopotentials of Trinucleotide (PSeEIIP). EIIPA, EIIPT, EIIPG, and EIIPC represent the EIIP measurements of nucleotides A, T, G, and C, respectively. The average EIIP of the trinucleotides in each sample is exploited for the construction of the feature vector, which is described as follows: where f xyz represents the normalized frequency of the i th trinucleotide, EIIIP xyz = EIIP x + EIIP y + EIIP z represents the EIIP value of a trinucleotide and x, y, z ∈ fA, C, G, Tg [36].
2.3. Chi-Square Test. All kinds of sequence features will be fused together in order to improve the prediction efficiency, but the redundancy of different features cannot be ignored. Therefore, one of the primary tasks involved in genomic island prediction is to select the best features from the given dataset to achieve the best prediction. This work uses the chi-square test to select the best features for genomic island prediction. The chi-square ðX 2 Þ test measures the deviation from the expected distribution [40,41]. Statistically, X 2 tests the independence of two variables, where two variables A and B are defined as independent if PðABÞ = PðAÞPðBÞ or PðA | BÞ = P ðAÞ (PðB | AÞ = PðBÞ). In feature selection, the two variables are the term occurrence and the class occurrence. The terms in relation to the quantity are classified as follows: where N is the observed frequency in D and F. w i and w j are defined as where U is a random variable that takes values w i = 1 (the 3 Computational and Mathematical Methods in Medicine presence of the feature i) and w i = 0 (absence of the feature i), and C is a random variable that takes values e j = 1 (the presence of the feature in class j) and e j = 0 (absence of the feature in class j). We write U i and U j if it is not clear from context which features i and class j we are referring to and got the following equation: where the N are counts of features that have the values of w i and w j that are indicated by the two subscripts. For example, F 10 is the number of features that contain i (w i = 1) and are not in jðw j = 0Þ. F 1 = F 10 + F 11 is the number of features that contain i (w i = 1), and we count features independent of class membership w i ∈ f0, 1g . F = F 00 + F 01 + F 10 + F 11 is the total number of documents [42]. X 2 is a measure of how much expected counts E and observed counts N deviate from each other. A high value of X 2 indicates that the hypothesis of independence, which implies that expected and observed counts are similar, is incorrect. An arithmetically simpler way of computing X 2 is the following: 2.4. Prediction Algorithm. Random forest (RF) is among the best classification algorithms and widely applied to manage many biological problems. It works by building small groups of weak classifiers, to finally combine them and form a strong classifier. This is a configuration learning method that can build models that create multiple decision trees during training and will remove modal classes from classes predicted by a single tree. It is a fusion of tree predictors, where each tree depends on the value of an independent sampled random vector and the same distribution of all trees in the forest [43]. A random forest is a collection of tree predictor hðX ; ω i Þ , i = 1, ⋯, I, where X represents the observed input (covariate) vector of length p with associated random vector X and ω i . They are independent and identically distributed ðiidÞ random vectors. As mentioned, we focus on the regression setting for which we have a numerical outcome Y, but we make some points of contact with classification (categorical outcome) problems [44]. The observed (training) data is assumed to be independently drawn from the joint distribution of ðX, YÞ and comprises nðp + 1Þ-tuples Xðx 1 , y 1 Þ, ⋯, ðx n , y n Þ.
For regression, the random forest prediction is the weighted average over the collection As i → ∞, the law of large numbers ensures The quantity on the right is the prediction (or generalization) error for the random forest, denoted as PE * f . The convergence implies that random forests do not overfit. Now, define the average prediction error for an individual tree hðX, ωÞ Assume that for all the tree is unbiased, i.e., EY = E X hðX , ωÞ. Then, where μ is the weighted correlation between residuals Y − hð X, ωÞ and hðX ; ωÞ for independent ω, ω k . The above inequality pinpoints what is required for accurate random forest regression: low correlation between residuals of differing tree members of the forest and low prediction error for the individual trees [44]. Further, the random forest will decrease the individual tree error (PE * t ), by the factor μ.

Performance Evaluation.
This work introduces crossvalidation to evaluate the proposed method and calculates accuracy, recall, F-measure, precision specificity, sensitivity, and precision as standard performance indicators. They are defined as follows: where TP is the number of true positives, FP is the number of false positives, TN is the number of true negatives, and FN is the number of false negatives.
We further compare the proposed method with the current methods. For the convenience of comparison, we compare our results with that of the published results with the existing methods. Therefore, different datasets choose different evaluation methods, which are summarized in Tables 1-3.
The above results show that the proposed method outperforms the available genomic island prediction methods,  Figure 2. Figure 2 indicates that each feature makes its own positive contributions to the predictions; although, different features have certain preferences for different data sets. On the whole, PSeEIIP, RCKmer, and TNC achieve the best performance among all kinds of the sequence features. It is easy to note that PSeEIIP and RCKmer not only reflect the content of components but also focus the local sequence arrangement and global sequence information and calculate the energy of delocalized electrons in nucleotides as the electron-ion interaction. Compared with the ANC and DNC, PSeEIIP and RCKmer are more closely related to the genomic islands, and this is why they achieve the better performance in the genomic island prediction.
3.3. Influence of the Different Feature Selections. A feature of the proposed method is the feature selection based on the chi-square test. For a better understanding of the feature selection, we select the feature set with size from 5 to 120. All experiments are performed with each selected feature set using the 10 times crossvalidation test, and overall accuracy is chosen to represent the score in this prediction. Figure 3 is the overall accuracies of all experiments with the selected feature sets for three datasets.
As would be expected, the overall accuracy first increases and then decreases as the selected feature size continues to increase. When the selected feature set size is less than 30, all data sets have reached the best prediction. As the increase of the number of selected features, the overall accuracy decreases. The chi-square is further compared with feature importance (FI), Pearson correlation (PC), ROC-AUC, mutual information gain (MIG), linear discriminant analysis (LDA), and principal component analysis (PCA), and it is easy to note that the chi-square test achieves the best performance among seven feature selection method.

Influence of the Different Prediction Algorithms.
Random forest (RF) was employed as a classifier in this work. To compare different classifiers' performance, support vector machine (SVM), k-nearest neighbor (KNN), gradient boosting (GB), adaBoost (AB), decision tree (DT), bagging, extra trees (ET), stochastic gradient descent (SGD), and layer perceptron (MLP) were also adopted for protein structural class prediction. All experiments are performed with each selected feature set using the 10 times crossvalidation test, and overall accuracy is chosen to represent the score in this prediction. Figure 4 summarizes the overall accuracies of all experiments with the different prediction algorithms for three datasets.
From Figure 4, it is easy to note that the random forest (RF) achieves the best performance among the ten classifiers. Specifically, the average overall prediction accuracy is 95% for PICK108, RGP104, and CF15 datasets compared with 91% of the gradient boosting (GB) and 92% of the bagging. These results indicate that the random forest is a more powerful classifier for the genomic island prediction.

Conclusion
Genome islands are related to the rapid adaptation of prokaryotes, which have important medical, economic, or environmental significance. Some methods usually evaluate all features and focus on whether the local features of a certain area are significantly different from the host. Although these methods have achieved good experimental results, various feature extraction methods have been proposed, but they are rarely used to predict genomic islands. With these problems in mind, we present a scheme to predict the genomic islands using the chi-square test and random forest algorithm. We extract seven kinds of widely used sequence features and select the important features with the chi-square test. At last, all the selected features are input into the random forest to predict the genome islands. Three experiment results show that the proposed method has better performance than previous methods. The first contribution can be seen from the influence of the different features, and we find that PSeEIIP, RCKmer, and TNC are more closely related to the genomic islands and achieve the best performance among all kinds of the sequence features. The second contribution can be indicated from the influence of the different feature selections, and the chi-square test achieves the best performance among seven feature selection method. The final contribution can be seen from the influence of the different prediction algorithms, and we notice that the random forest (RF) achieved the best performance among the ten classifiers; its accuracy is 3% higher than that of the next one. This understanding can be then used to develop more powerful methods for genomic island prediction.

Data Availability
All the data used to support the findings of this study are available on https://github.com/Onesime243/Chi_square_ Genomic_Islands_predicton_data-and-result.git.

Conflicts of Interest
The authors declare that they have no conflicts of interest.