Building a Model for Predicting Target Gene Expression in Rice T-DNA Insertional Mutants by Machine Learning Approaches


 Background

T-DNA activation-tagging technology is widely used to enhance flanking gene expression near the site of insertion for functional genomics research in rice. However, whether the expression of a gene of interest is enhanced must be validated experimentally.
Results

In this study, we built a model to predict gene expression in T-DNA mutants by machine learning approaches, thereby improving the efficiency of screening for activated genes. We gathered experimental consisting of gene expression data in T-DNA mutants and captured the PROMOTER and MIDDLE sequences for encoding. In first-layer models, SVM models were constructed with nine features consisting of information about biological function and local and global sequences. Feature-encoding based on the PROMOTER sequence was weighted by logistic regression. The second-layer models integrated 16 first-layer models with feature selection and the algorithm, which were selected from nine feature selection methods and 65 classified methods, respectively. The accuracy of the final two-layer machine learning model, referred to as was 99.3% based on five-fold cross-validation, and 85.6% based on independent-testing.
Conclusion

We discovered that the information within the local sequence had a greater contribution than the global sequence with respect to classification had a good predictive ability for target genes within 20 from the 35S enhancer. Based on the analysis of significant sequences, the G-box regulatory sequence may also play an important role in the mechanism of activation of the 35S enhancer.


Background
Rice is one of the most important models of monocotyledon plants for the analysis of plant gene function. Rice is one of three major food crops throughout the world, and it is the staple food of more than half of the world's population. In the past 30 years, rice production has doubled, although the supply of rice is expected to gradually become insufficient with the rapid increase in the world population, climate change and a shortage of water [1]. It will not be easy to increase food production to the necessary levels. In 2004, the International Rice Genome Sequencing Project (IRGSP) completed the sequencing of the rice genome [2]. The ultimate goal of genome analysis is to realize the structure and function of each gene within an organism. To further confirm the function of and metabolic pathways related to each gene in rice, scientists have focused their efforts on analyzing the rice genome and are committed to promoting rice genome annotation to move rice research into the postgenome era.
T-DNA insertion activation-tagging technology is widely used in the analysis of the function of rice genes [3,4]. This method results in the construction of four tandem cauliflower mosaic virus (CaMV) 35S enhancers on a T-DNA plasmid; when this T-DNA is inserted into the rice genome, it activates genes that flank the T-DNA insertion site [5]. The CaMV 35S enhancer can activate gene expression in dicots and monocots and is widely used in T-DNA transformation. Gene expression gradually increases with the number of 35S enhancers on T-DNA, which led to the incorporation of four tandem repeat CaMV 35S enhancers for enhanced gene expression with this approach [6][7][8][9][10][11]. Agrobacterium-mediated T-DNA transformation tends to insert one copy of T-DNA, an average of 1.4 loci of T-DNA inserts in transgenic plants [12], reducing the complexity of rice gene research. T-DNA inserted into the rice genome with 35S enhancer resulted in two states. 1) Gene knockdown: when T-DNA is inserted into the coding region or promoter of a gene, it is likely to destroy the structure of the gene, resulting in reduced function or loss of function of the gene. 2) Activation-tagging: T-DNA might enhance the activity of genes that flank the T-DNA insertion site through the effect of the 35S enhancers. Thus, we can make use of T-DNA insertion activation-tagging to study the association between genetic function and morphological traits [5]. However, there has been no basis for determining whether a target gene is activated by the enhancer prior to experimental analyses. There has even been a study indicating that the enhancer can activate genes that are millions of base pairs away from the enhancer [13]. Not all of the genes that flank the T-DNA insertion site are expected to be activated by the 35S enhancer in our experiments. In some T-DNA mutants, the 35S enhancer does not activate the closer gene but instead activates a gene that is farther away from the 35S enhancer [14]. Researchers thus cannot rely on the distance between the enhancer and a particular gene to judge whether that gene would be activated. They must instead determine the activated genes experimentally to explore the related genetic function and morphological traits. Therefore, it is a timeconsuming and laborious process to check for the expression of a target gene.
In this study, we used machine learning approach to predict target gene expression in rice T-DNA insertion mutants and improved the efficiency of finding activated target genes. Our team had developed a website platform, EAT-Rice [15], for predicting the expression status of rice genes that flank the T-DNA insertion site in activating mutants. The system of EAT-Rice applied the distance factor from T-DNA insertion site to gene loci to weight feature encoding, and used two kinds of algorithms to build two-layer model of machine learning. In this study, we based on EAT-Rice with a modified sequence capturing method, system architecture and other additional features to build a more comprehensive system for target gene expression prediction in T-DNA insertion mutants.
The datasets used in this study were experimentally validated. We first characterized genes based on their activation by the 35S enhancer, these genes were divided into activated genes and nonactivated genes. The system we built refer to the EAT-Rice. We captured the DNA sequence of the promoter and the central region of each activated gene from the start codon of the target gene to the 35S enhancer and used nine features for encoding. We then used LibSVM (Library Support Vector Machine) [16] and LADTree [17] algorithms to build two-layer models of machine learning. In the first-layer model, we carried out a logistic regression to weight the features that depending on the probability of gene activation and the distance from the enhancer to the gene start codon. Moreover, we used the mRMR (Minimum Redundancy Maximum Relevance) [18] method and incremental feature selection to determine the most relevant features in the second layer. This system is referred to as TIMgo.
The TIMgo performance was 99.3% based on five-fold cross-validation and 85.6% based on independent-testing. TIMgo had >80% accuracy for target genes within 20 kb from the 35S enhancer, but genes that were >20 kb away were still predicted with >60% accuracy. We also discovered that the value of the k parameter for Kmer, RevKmer and PseKNC encoding within the PROMOTER sequences was higher than that of MIDDLE sequences. This suggested that for the analysis of longer sequences a greater number of features was needed to improve the prediction performance. Finally, the G-box cis-element has an important function in gene activation by the 35S enhancer based on the motif analysis, and among the G-box−associated binding proteins most are bZip (basic region/leucine zipper) transcription factors.

Sources for T-DNA Mutant Data and Datasets
The experimental data were collected from 11 rice T-DNA mutants from Liang-Jwu Chen's laboratory at NCHU and 316 mutants from Su-May Yu's research team at the Academia Sinica. These data consisted of the T-DNA insertion point and expression status of flanking genes (as detected by RT-PCR [28]). The expression status of each gene was characterized based on the four following categories: activated gene (Ac), no significant effect (NE), non-detectable (ND) and knockout (Ko). The data distribution for the expression status of these genes is shown in Table 1.
To maintain dataset quality and consistency, we removed the 30 ND genes from the dataset. The collected data included two Ko genes, in which the T-DNA insertion point was located inside the gene, thus disrupting the gene structure and most likely leading to a loss of function. Because Ko genes were not a focus of this study, we removed them from the dataset. We defined NE genes as non-activated (NAc) genes to differentiate them from the Ac genes. Ultimately, data for 453 genes were collected in this study.
A training set was used to determine the performance of the subsequent system. As the ratio of positive data (Ac genes) to negative data (NAc genes) affects the performance of the machine learning [29], this study used EAT-Rice with a 1:1 ratio to carry out the selection of the training dataset. We used data from 300 genes in the training dataset, which was referred to as D300. Data from the remaining 153 genes were used for independent-testing to evaluate system accuracy; this dataset was referred to as D153 (Table 2).

Target Gene Sequence Retrieval
The analyzed genes provided from Liang-Jwu Chen's laboratory and Su-May Yu's team were annotated according to the Rice Genome Automated Annotation System (RiceGAAS) [30] and the MSU Rice Genome Annotation Project (TIGR) [31,32] rice gene annotation database. We hypothesized that we could predict the expression status of a target gene by analyzing the sequence of Ac and NAc genes. Thus, with reference to the EAT-Rice construction process and the enhancerrelated hypothesis mechanisms [33,34], we extracted nucleotide sequences for each gene from two regions: 1) a 1500-bp region upstream relative to the translation start site (TLS), referred to as the PROMOTER region, and 2) a central region of 300 bp centered between the TLS of the target gene and the 35S enhancer, referred to as the MIDDLE region (Supplementary Figure S1).

Feature Encoding
In this study, we encoded information about nine features of the sequences: five sequence information codes and four biological functional codes. The sequence codes consisted of two local sequence codes, two global sequence codes and a code to reflect both the local and global sequence 4 information simultaneously. The local sequence characteristics consisted of Kmer and Reverse complementary kmer (RevKmer) values, which were coded by the DNA composition; such characteristics have been successfully applied toward human gene regulatory sequence prediction [35,36] and enhancer identification [37], among others. The two global sequence codes, dinucleotidebased auto-cross covariance (DACC) and trinucleotide-based auto-cross covariance (TACC), were coded by calculating the sequence autocorrelation as global sequence characteristics; this type of feature has been used to predict sequence-based protein-protein interactions [38]. Another coding method, PseKNC, has been used to identify promoters in prokaryotes [39] and incorporates the contiguous local sequence-order information and the global sequence-order information into the feature vector. The biological characteristics included the presence of CpG-islands (CGIs), regulatory cis-elements (Motif) and conformational and physicochemical properties of dinucleotide and trinucleotide sequences (DNP and TNP, respectively). Each of these features is described in more detail below.

CpG-Islands (CGIs)
DNA methylation on CpG-islands reduces or silences gene expression based on enhancerpromoter interactions [40,41]. For this analysis, we used the EMBOSS Newcpgreport tools of EMBL-EBI to predict CpG-islands and encoded their corresponding number, length, distance from the TLS, CpG ratio and OE (observed/expected) value, resulting in the feature CGIs were as follows: Number coding was represented by the number of CGIs (j) predicted in the sequence (Equation 1). Length coding consisted of the ratio of the CGIs length dived by the PROMOTER or MIDDLE sequence length (Equation 2). Distance was encoded as the distance from the CpG-island to the TLS (Equation 3). The CG ratio was calculated as the ratio of CpG fragments in the CpG-island by dividing total number of CGIs (Equation 4). The OE value indicates the ratio of the number of CpGs present in the CpG-island relative to the expected value of CpG fragments and was calculated by dividing the number of CpG fragments in the sequence by the product of the number of Cs and the number of Gs in the CpG-island (Equation 5).

Regulatory cis-Elements (Motif)
To consider that the transcription factor binding sites (TFBSs) of rice have been confirmed may not be comprehensive enough yet, we therefore incorporated proven TFBSs from other plants. Data for 2087 motifs were collected from PLACE [42] and the RegSite database (http://linux1.softberry. com/berry.phtml?topic=regsitelist). The tool Find Individual Motif Occurrences (FIMO) [43] in the MEME suite was used to scan for regulatory sequences in the PROMOTER region, and the scanning results were encoded by FIMO [44,45]. This features encoding are listed as follows:

CGI_LengthRatio = length of CGI length of sequence
(2) Motif_Conserve ( ) = alignment score in promoter Motif_Number( ) The number of regulatory elements was coded by the number (j) of motifs found in the PROMOTER (Equation 6). The conservation score was calculated by FIMO; we summed up the conserved scores of specific motifs and divided by the number of motifs in the sequence (Equation 7). As motifs can be located on both the DNA coding strand (codons) and the template strand (anticodons), the orientation characteristic was calculated to determine the proportion of motifs on the coding strand. We thus used the number of motifs on the coding strand (i.e., positive motifs, pos) as the numerator, and the denominator is the number of all motifs (Equation 8). The distance characteristic was determined based on the distance (in base pairs) from each motif to the TLS, which was summed for all motif sites within a given sequence, divided by the number of motifs (Equation 9). In these equations, i indicates the kinds of motifs, Mi indicates a specific motif and the geneTLS refers to the translation start site of a target gene.

Kmer and Reverse Complementary Kmer (Kmer and RevKmer)
Kmer refers to the local sequence information and indicates a subsequence containing k neighboring nucleic acids in a DNA sequence. Using a coding strand as the template, the Kmer feature will scan for the number of occurrences of the nucleic acid subsequence in the template. For example, when k is 2, the subsequence composition of a Kmer will be called a 2-mer, which contains 16 subsequences (based on the four nucleotides, G, A, T and C). In the case of the AA dinucleotide, if this subsequence appeared twice in the DNA template, it would be encoded as 2; if it was not present in the template, it would be encoded as 0. In eukaryotes, the average length of TFBSs is 10 bps [46], which suggests that the number of k neighboring nucleic acids in this study could be increased. We encoded the sequence with 3-to 6-mer, 3-to 7-mer, 3-to 8-mer and 3-to 9-mer, which produced 5440, 21824, 87360 and 349504 different nucleotide compositions, respectively. The Kmer encoding was carried out based on the number of occurrences in the template sequence (Equation 10).
Reverse complementary kmer (RevKmer) is a variant of kmer-in which the kmers are not expected to be strand specific, so reverse complements are collapsed into a single value. In this study, the RevKmer feature was encoded in the same manner as Kmer and produced 2760, 10952, 43848 and 174920 nucleotide compositions for the 3-to 6-mer, 3-to 7-mer, 3-to 8-mer and 3-to 9-mer, respectively. RevKmer encoding was carried out according to the number of occurrences in the template sequence (Equation 11).
where i indicates kinds of nucleotide combinations and j indicates the number of matches with pattern (i) in a specific sequence. If there are no matches for a pattern in a specific sequence, this is coded as zero.
Nucleotide Conformational and Physicochemical Properties (DNP and TNP) The nucleotide conformation and physicochemical properties of dinucleotides and trinucleotides were also encoded. DiProDB provides information about 125 properties of dinucleotides, and these 125 properties were integrated into 15 characteristics through a statistical Principal components analysis (PCA) method [47]. The value of each property is based on the dinucleotide as a unit, and each property has 16 values corresponding to all possible dinucleotide combinations. We used the property of the dinucleotide to produce a training model with 240 dimensions; this feature is referred to as the DNP (dinucleotide conformation and physicochemical properties) (Equation 12). PseKNC-General (the general form of pseudo k-tuple nucleotide composition) is a tool that provides the conformation and physicochemical properties of oligonucleotides [48]. In this study, 12 trinucleotide properties were used for coding. There were 64 combinations of trinucleotides, which generated a training model with 768 dimensions based on the 12 trinucleotide properties; this feature is referred to as the TNP (trinucleotide conformation and physicochemical properties) (Equation 13).

Autocorrelation (DACC and TACC)
Pse-in-One provides a pseudo-components mode reflecting the correlation between two dinucleotides or trinucleotides within a DNA sequence via their physicochemical properties [49]. In this study, we used dinucleotide-based auto-cross covariance (DACC) and trinucleotide-based autocross covariance (TACC) as provided by Pse-in-One for encoding (Equations 14-16).
In this study, DACC was based on the 15 properties from DiProDB and the lag value was 4, generating a training model with 900 dimensions. TACC used the 12 Pse-in-One built-in properties and the lag value was 4; it generated a training model with 576 dimensions. There formulas were show as follows: for a DNA sequence D with L nucleic acid residues, where R1 represents the nucleic acid residue at the sequence position 1, R2 the nucleic acid residue at position 2 and so on (Equation 14). DACC and TACC measure the correlation of the same physicochemical index between two dinucleotides or 7 trinucleotides separated by a distance of lag along the sequence (Equations 15, 16). In these equations, u1 and u2 are two different physicochemical indices, Pu(RiRi+1) and Pu(RiRi+1Ri+2) are the numerical value of the physicochemical index u for the dinucleotide RiRi+1 or trinucleotide RiRi+1Ri+2 at position i and u P is the average value for the physicochemical index value u along the whole sequence. The number of features represented by DACC and TACC, will be the lag value multiplied by the square of the properties number.
Pseudo K-Tuple Nucleotide Composition (PseKNC) Pseudo k-tuple nucleotide composition (PseKNC) is one of the encoding modes supplied by Psein-One. It incorporates both the contiguous local sequence order information (like Kmer and RevKmer) and the global sequence order information (like DACC and TACC) into the feature vector of the DNA sequence.
For a DNA sequence D with L nucleic acid residues, where R1 represents the nucleic acid residue at the sequence position 1, R2 the nucleic acid residue at position 2 and so on (Equation 17). PseKNC will calculate the occurrence frequency (f) of dinucleotides in the DNA sequence and the correlation between two oligonucleotides that are 1 to λ nucleotides apart from each other. In equation 18, fu is the occurrence frequency of dinucleotides in the DNA sequence, which is ). The feature number of PseKNC will be λ multiplied by 4 to the power k. In this study, the PseKNC feature was determined with a λ value of 4, w is 0.2, and k is from 2 to 6.

Significant Sequence Fragments Analysis
Because there are numerous features in this first-layer model, the complexity of the model is relatively high. To reduce the interference of excessive noise, we used independent two-sample ttests (implemented in R) to select features from the high-dimension models. We used the occurrence of specific oligonucleotides in the Ac and NAc groups to generate the t-test (Supplementary Figure  S2) and retained the oligonucleotides with P < 0.05 to encode these significant fragments.

Model Evaluation and Cross-Validation
We used a five-fold cross-validation method and independent-testing data to evaluate the predictive performance of the model. Our evaluation methods included Accuracy (Acc), Sensitivity (Sn), Specificity (Sp) and Matthews Correlation Coefficient (MCC). Acc is used to estimate the prediction accuracy of the global prediction capability, with values closer to 100% indicating the better overall predictive performance of a model (Equation 20). Sn and Sp evaluate the accuracy of 8 the prediction of positive and negative data, respectively (Equations 21,22). When the number of positive and negative data differs, Acc is not a good evaluation indicator. MCC is, however, suitable for assessing a dataset in which there is an imbalance between positive data and negative data (Equation 23). When the MCC score is closer to 1, the prediction capability is better; a score closer to −1 indicates a worse prediction capability.

Framework of TIMgo
TIMgo is a two-layer machine learning model constructed for predicting the effect of 35S enhancer on the expression of target gene (Figure 1). The D453 were dived into training dataset (D300) and independent-testing data (D153). The DNA sequence of Promoter and Middle were retrieved for analyzing between NAc and Ac gene. In the first-layer module, the SVM models were constructed within 9 feature encoding methods. And the significant sequences were analyzed by student's t-test and a model of logistic regression was used to assist training, which based on the relationship between distance from the 35S enhancer to the target gene and states of gene expression. The features encoded from Promoter region were weighted by logical regression model for probability of gene activation. Then, we adopted feature selection by LIBSVM built-in tool in the partial SVM models. The prediction results of first layer module were integrated in the second layer model, and mRMR [50] were used to feature selection and built the LAD tree model. At the last, we evaluated the prediction efficacy of TIMgo with D153 independent-testing dataset.

Correlation between Gene Activation and Distance from the 35S Enhancer to the TLS
The distance between the enhancer and a target gene cannot be directly used to determine whether the target gene will be activated, although it does have some relevance for determining gene activation [19,20]. A target gene is more likely to be activated if it is closer to the enhancer [21]. We characterized each of the 453 genes in the entire dataset (D453) based on the distance from the CaMV 35S enhancer on the inserted T-DNA to TLS and calculated the ratio of Ac genes and NAc genes. We found a negative correlation between this distance and gene activation. Genes closer to the 35S enhancer had a greater probability of activation (P < 0.001) (Supplementary Figure S3). The results are the same as those indicated in a previous study [15] (Figure 2, Supplementary Table S1).
Among the D453 dataset, there were 94 sets of duplicated data which consist of multiple genes, and the PROMOTER sequences corresponding to these genes were identical. Each of the experimental data in this study represented the effect of a single insertion event on its target gene. In the experimental data collected in this study, when the same gene was detected for multiple T-DNA insertion events, the PROMOTER sequences from those genes were identical. For different T-DNA insert events, the 35S enhancer may result in different states of expression for the same target gene, which will lead to contradictory results while building the machine learning model. To distinguish between these PROMOTER sequences, we used logistic regression to build a regression model of the distance coefficient and the target gene activation probability (Supplementary Equation S1). In this study, the values calculated by logistic regression were used to weight the promoter sequence feature, so that the same sequence could be distinguished when quantified based on numerical values.

Comparison of Kmer and RevKmer Combined with Motif
In the Kmer and RevKmer feature models, a t-test was used to calculate the number of occurrences of specific sequence fragments in Ac and NAc genes, respectively, from sequence lengths (k) of 3−9 nucleotides. The specific sequence fragments with P < 0.05 were then used for encoding. These fragments were combined as 3−6, 3−7, 3−8 and 3−9 combinations for Kmer and RevKmer. The Motif feature was used to carry out a similar analysis. The Kmer and RevKmer features associated with the PROMOTER region were combined with the Motif feature (Supplementary Table S2). The features from Kmer, RevKmer, Kmer + Motif and RevKmer + Motif were used to build SVM models, and the best model was selected for the second-layer model integration (Supplementary Table S3).
Before combining the Motif with Kmer or RevKmer, the Acc scores of the SVM models of Kmer and RevKmer were 55−85%, whereas the Acc scores of the Motif models were 52−75%. After combining the Motif with Kmer or RevKmer, the Acc scores were 78−86%, and the Acc consistently increased with the k value for Kmer and RevKmer (Table 3).

First-Layer Models Evaluation
In the first-layer models, nine feature coding methods and two types of sequences were used to construct into 16 feature models (Supplementary Table S4). The prediction ability of each feature model was evaluated with five-fold cross-validation and independent-testing with the D153 data (Table 4). For the Pre-in-One feature encoding, one gene sequence from the training dataset (D300) did not conform to the encoding requirements. Therefore, in the DACC, TACC and PseKNC model, this information was removed from the training data, and the training dataset consisting of the remaining 299 genes was referred to as D299. The PseKNC models used k values of 2−7, and eight models each were established for the PROMOTER and MIDDLE sequences. A PseKNC model with k = 6 that was selected among the PROMOTER models had an Acc of 75.3% with five-fold crossvalidation. The PseKNC model with k = 2 that was selected among the MIDDLE models had an Acc of 59.5% (Supplementary Table S5).
In the evaluation results of the first-layer feature models (Table 4), the Kmer, RevKmer, Kmer + Motif and RevKmer + Motif had the best predictive performance based on the Kmer feature provided. Their Acc values were 79−88.3% with five-fold cross-validation. With independent-testing, their Acc values were 80.4−84.3%, with the exception for RevKmer, which was 67.3%. The PseKNC model built using the PROMOTER sequence was slightly inferior to the model built using Kmer-related features. The Acc and MCC values for PseKNC were 75.3% and 0.53 with cross-validation, respectively, and 56.2% for Acc and 0.165 for MCC with independent-testing. The DACC, TACC, DNP, CGIs and TNP constructed by the PROMOTER sequence and the PseKNC constructed by the MIDDLE sequence had lower predictive performance, with Acc values of 58.2−69.9% and MCC values of 0.164−0.398. Among these 16 models, CGIs and TNP constructed using the MIDDLE sequence were the least accurate in cross-validation, with an Acc of ~47%. Their Acc values for independent-testing were 11.8% and 62.1%, respectively. In terms of overall predictive performance, the PROMOTER sequence is thus more important than the MIDDLE sequence, and Kmer, RevKmer, Kmer + Motif and RevKmer + Motif features have the highest correlation with the activation of genes.

Comprehensive Feature Selection in Second-Layer Model
The second-layer model integrated the prediction results from the 16 feature models in the first layer as features to build model by machine learning and obtained the ultimate prediction result. The features used in the second-layer model of this study included predictive results and positive and negative predictive confidence scores, generating 48 features. We used incremental feature selection and an SVM model with cross-validation to carry out comprehensive feature selection among these 48 features to pick out the best feature combinations with nine feature selection methods. The top 33 features of the mRMR [18] were selected as the best feature combination with the highest Acc and the fewest features (Figure 3, Supplementary Table S6). Among the 33 selected features, we knew that the encoding contributed to classification is DACC and Kmer related principally, the PseKNC and TACC are secondary, and the few are CGIs, TNP and DNP.

Second-Layer Model Evaluation
We assessed the best-suited machine learning algorithm for the second-layer model through the WEKA [22] analysis platform. In this study, we use the 65 algorithms provided by WEKA to establish the model separately and evaluated the effectiveness of these models with cross-validation (Supplementary Table S7). In this experiment, the LADTree algorithm was used to construct the second-layer integration model according to the above conditions. The Acc was 99.3%, MCC was 98.7% and Sn and Sp were 0.993. In independent-testing, the model Acc reached 85.6%, MCC was 35.3%, Sn was 0.891 and Sp was 0.533. Among the testing data, there were only 15 negative data, such that each predictive result with these data would lead to a substantial impact on the overall predictive effectiveness assessment. Among these models built with multiple algorithms, Sp values ranged from 0.467 to 0.733, which corresponded to a difference of only six correctly predicted negative data.

Correlation between Predictive Accuracy and Distance from 35S Enhancer to TLS
To analyze the relationship between distance and TIMgo prediction accuracy, the training dataset and independent-testing dataset were grouped according to the distance between the TLS and 35S enhancer (Figure 4). In cross-validation, Acc was 99.3%, and predictions for only two genes were incorrect (Table 5); these two genes were 10−15 kb away from the 35S enhancer. In independenttesting, the prediction accuracy for genes within 20 kb from the 35S enhancer was >84%. For genes located >20 kb from the 35S enhancer, the prediction accuracy decreased with increasing distance, but still was >60% (Table 6).

Differences of the Framework between TIMgo and EAT-Rice
In a previous study, the PROMOTER region for most genes was defined as the upstream region from the transcription start site (TSS) [23]. For the EAT-Rice analysis, we collected gene data that only include the information of TLS, however, the promoter region usually includes the upstream sequence of the TSS, which was based on a 1000-bp region upstream of the TLS. Upstream sequence of the TSS contain the 5' untranslated region of the mRNA and sequences downstream of the TSS may also be involved with transcription factor regulation of gene expression [24]. Therefore, we considered an average length of 500 bp for 5' untranslated regions in rice and the 1000-bp upstream of the TSS as the condition. We used the 1500-bp sequence upstream of the TLS as the PROMOTER in this study.
For our prediction models, we retained the EAT-Rice CGIs and DNP (dinucleotide conformation and physicochemical properties encoding) and increased the TNP coding with the DNP coding concept. We also used the Pse-in-One tool to generate code for DACC, TACC and PseKNC. Given the strand specificity of Kmer, we added RevKmer coding, and the Motif coding of the PROMOTER region was combined with Kmer and with RevKmer. The ranges of overall predictive accuracy for Kmer + Motif and RevKmer + Motif models were small, which indicated that Motif was complementary with Kmer and RevKmer, the combination of these two features could improve the classification ability. Predictive accuracy increased with the length of k for both Kmer and RevKmer, because that Motif feature consisted of experimentally validated regulatory sequences, but the number of proven regulatory sequences in plants is limited, whereas Kmer and RevKmer considered all the sequence combinations that provided higher data integrity than Motif, so using longer Kmer and RevKmer should lead to better prediction performance. Although Kmer and RevKmer had higher data integrity than Motif, the complexity of the Kmer and RevKmer data increased exponentially with the increase in sequence length, resulting in processing time that was too lengthy. Therefore, we used Kmer (and RevKmer) with limited k length and retained Motif with longer sequences, to preserve important regulatory sequence data and reduce the computational complexity significantly.

Specific Regulatory Sequences within Genes Activated by the 35S Enhancer
To find out whether a specific regulatory sequence was related to gene activation in the T-DNA insertion mutants, we analyzed the 2087 motifs with a t-test. We found that there were 181 regulatory sequences that had significant difference in their occurrence frequency between Ac and NAc genes. Among these 181 regulatory sequences, 20 were G-box and G-box related sequences. The G-box contains a core region, CACGTG, and flanking sequences that are composed of other nucleotides. The G-box binding protein has different binding preferences and affinities according to the different flanking sequences in the G-box. bZip (basic region/leucine zipper) transcription factors account for the majority of G-box binding proteins. Transcription regulation in plants is often affected by G-box sequences, such as stress hormones (e.g., abscisic acid), seed germination, protein storage and the light response [25][26][27]. Thus, the G-box may have important biological significance in the regulation of gene expression by the 35S enhancer and may affect whether the 35S enhancer will activate a target gene in rice.

Correlation between Length of Sequence and Nucleotide Length Parameter
In the feature coding of TIMgo, the coding of Kmer, RevKmer and PseKNC can be adjusted based on the nucleotide length parameter (k). We needed to find a suitable nucleotide length parameter for encoding. For these three kinds of coding, the k value selected for the PROMOTER region was greater than that for the MIDDLE region. A higher value for k results in a higher number of features being generated, which requires that more features need to be improved to increase the predictive accuracy of the PROMOTER region, relative to the MIDDLE region. Thus, an excessive number of features would reduce the predictive performance of the model. From the optimal k value for the MIDDLE sequence, we could see that a higher number of features did not necessarily make the classification better. By comparing the optimal k value selected for the PROMOTER and MIDDLE region, we note that a longer sequence does seem to require more features to make the classification better. Moreover, among the local, global and local + global sequence characteristics used to build the TIMgo, the local sequences had a greater contribution with respect to identifying activation of the target genes (Table 4).

Performance Comparison of TIMgo and EAT-Rice
To confirm that the model constructed by the framework of TIMgo is superior to that of EAT-Rice, the training dataset and testing dataset used to develop EAT-Rice were used to build models in the TIMgo framework and to evaluate TIMgo for comparing their predictive performance. The training dataset used with EAT-Rice had data for 280 validated genes, and these 280 data points were separated into two subsets (subset1, subset2) with 180 validated genes [15]. The independent-testing dataset used with EAT-Rice had 48 validated genes. Two training datasets (subset1 and subset2) were used to build training models within the framework of TIMgo, and the predictive efficacy of EAT-Rice and TIMgo was evaluated with an independent-testing dataset consisting of an additional 48 validated genes (Table 7). Using subset1 as the training dataset and using the EAT-Rice system to establish the model, the Acc in the independent-testing was 72.9%; the Acc for TIMgo was 79.2%, and the Sp value of TIMgo was 12.8% higher than that of EAT-Rice. Using subset2 as the training dataset, the Acc with independent-testing was 77.1% for EAT-Rice and 77.6% for TIMgo. In the case of using the same training dataset and testing dataset, the accuracy of the TIMgo framework is better than EAT-Rice.

Conclusions
In this study, we analyzed the DNA sequence and constructed a two-layer model system using the machine learning method to predict whether the 35S enhancer would affect the expression of a target gene in T-DNA insertion mutants. The first layer of the system was built with the PROMOTER and MIDDLE sequences and was encoded using nine features. We analyzed significant sequence fragments in Motif, Kmer and RevKmer and weighted the PROMOTER based on a logistic regression analysis of the distance between the 35S enhancer and the TLS of each gene. Some of the first-layer SVM models were built with LIBSVM feature selection. The second-layer model used the mRMR feature selection tool to select the predicted values from the 16 models in the first layer, and these were integrated with the LADTree algorithm as the second-layer model. The predictive performance of TIMgo had Acc of 99.3% with cross-validation and of 85.6% with independent-testing, and the predictive efficiency of TIMgo was better than that of EAT-Rice. TIMgo can more accurately predict the activation of genes located within 20 kb of the 35S enhancer. We analyzed the 2087 motifs and found that there was a significant difference in the frequency of G-box sequences between Ac and NAc genes, suggesting that the G-box may play an important role in the mechanism of 35S enhancer−activation of genes. Our model has improved the predictive ability of determining target gene activation based on the CaMV 35S enhancer in rice T-DNA insertion mutants. This system could help researchers to improve their efficiency when screening rice genes.

Availability of data and materials
Data cannot be shared publicly because of this data contain each gene expressed information are from the C4 Rice Project -is founded by Bill & Melinda Gates Foundation, which is not published yet, so we only provide the gene sequence for interest researcher to confirm. It can be obtained in supplementary material (Please see the additional file 2 of "Additional Material"). equipment and office supplies for the printing fee. (c) Advanced Plant Biotechnology Center from The Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan. This funding was used for English editing and publication fees. The funders had no role in the study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author Contributions
C.-C.L. and J.-J.C. contributed to data collection, design of experimental processes and system architecture, and C.    . Accuracy trend of TIMgo for cross-validation and independent-testing of data within different distances. Train represents the Acc from 5-fold cross validation with D299; Test represents the Acc from independent testing with D153. The x axis indicates each distance interval, and the y axis indicates the predictive accuracy. Table 1. Data distribution of flanking analyzed genes in rice T-DNA mutants a Validated genes indicate the target genes that were detected by RT-PCR. b NCHU, experimental data were collected from Liang-Jwu Chen's laboratory. c Academia-Sinica, experimental data were collected by Su-May Yu's research team. Ac: activated gene; NE: non-activated gene; ND: non-detectable gene; Ko: knockout gene