PromoterPredict: sequence-based modelling of Escherichia coli σ70 promoter strength yields logarithmic dependence between promoter strength and sequence

We present PromoterPredict, a dynamic multiple regression approach to predict the strength of Escherichia coli promoters binding the σ70 factor of RNA polymerase. σ70 promoters are ubiquitously used in recombinant DNA technology, but characterizing their strength is demanding in terms of both time and money. We parsed a comprehensive database of bacterial promoters for the −35 and −10 hexamer regions of σ70-binding promoters and used these sequences to construct the respective position weight matrices (PWM). Next we used a well-characterized set of promoters to train a multivariate linear regression model and learn the mapping between PWM scores of the −35 and −10 hexamers and the promoter strength. We found that the log of the promoter strength is significantly linearly associated with a weighted sum of the −10 and −35 sequence profile scores. We applied our model to 100 sets of 100 randomly generated promoter sequences to generate a sampling distribution of mean strengths of random promoter sequences and obtained a mean of 6E-4 ± 1E-7. Our model was further validated by cross-validation and on independent datasets of characterized promoters. PromoterPredict accepts −10 and −35 hexamer sequences and returns the predicted promoter strength. It is capable of dynamic learning from user-supplied data to refine the model construction and yield more robust estimates of promoter strength. PromoterPredict is available as both a web service (https://promoterpredict.com) and standalone tool (https://github.com/PromoterPredict). Our work presents an intuitive generalization applicable to modelling the strength of other promoter classes.

Theory has yielded a linear relationship between the total promoter score and the natural log of promoter strength (Berg & Von Hippel, 1987;Li & Zhang, 2014). Nucleotide occurrence frequencies were first used by Weller & Recknagel (1994) in promoter strength prediction. Additivity in promoter-polymerase interaction has been affirmed by Benos, Bulyk & Stormo (2002). Patterns in s 70 promoters have been quantified by Huerta & Collado-Vides (2003). Strength of E. coli s E RNA polymerase promoters were studied by Rhodius & Mutalik (2010). The complexity of E. coli s 70 promoter sequences has been treated from an information theoretic standpoint by Shultzaberger et al. (2007). More recently, an support vector machines (SVM) model has been successfully applied to predicting the strength of a mutation library of E. coli Trc promoter sequences (Meng et al., 2017). One drawback with an SVM or artificial neural networks (ANN) machine learning model is the 'black-box' approach; that is, the absence of any mechanistic insights that could be gleaned with respect to the relationship between promoter sequence and strength. Such an understanding could be vital in the prediction of promoter strengths in different contexts, as well as the forward design of promoters in finely-tuned genetic circuits (see Endy, 2005;De Mey et al., 2007;Salis, Mirsky & Voigt, 2009;Li & Zhang, 2014). Many freely available resources predict the location of promoters in a genomic sequence mainly by identifying the -10 and -35 regulatory sequences (De Jong et al., 2012), but very few tools are available to predict the strength of such sequences. One tool provides qualitative predictions ('strong' or not) of promoter strength based on the occurrence of a triad pattern (Dekhtyar, Morin & Sakanyan, 2008), and is available as a macro. Here, we present a two-step approach to the predictive modelling of the strength of s 70 core promoters, and a companion web-based platform and Python standalone tool that implement our method along with the option to dynamically include user data into the prediction model. Our implementation is the first freely available tool/web-server for the quantitative prediction of promoter strength.

Generative model of promoter sequences
A generative model of the -10 and -35 promoter sequences is constructed using two position weight matrices (PWM -10 and PWM -35 ) in the following manner.
A comprehensive set of s 70 -binding promoter sequences was extracted from the RegulonDB (Gama-Castro et al., 2016). For each promoter sequence, we extracted a -35 region of 13 nucleotides centred at -35 position, and a -10 region of 13 nucleotides centred at the -10 position, to allow for uncertainties in the precise position of occurrence of the hexamers. For each -35 region, we used FIMO (Grant, Bailey & Noble, 2011) to find the best match to the consensus -35 motif, and similarly for the -10 regions, to obtain a dataset of -35 and -10 hexamer sequences. This dataset was then filtered for only significant hits to the consensi motifs (p-value < 0.05) and the resulting dataset was used to determine the weights of each nucleotide at each position of the -35 and -10 hexamers. Nucleotide-wise counts at each position of the hexamer motifs were augmented by a pseudo-count prior to correct for E. coli GC content of 50.8% and the resulting frequency matrices were converted into log-odds matrices. Biopython routines (www.biopython.org) were used.
Linear modelling of promoter strength Following Berg & Von Hippel (1987), we modelled the relationship between the promoter sequences and the ln of the promoter strength using multiple linear regression. The training set of 18 promoters is drawn from the Anderson library of activatorindependent plasmid tet promoter variants maintained at the Registry of standard biological parts (http://parts.igem.org/Promoters/Catalog/Anderson). Each promoter sequence is scored with respect to the generative models of the -10 and -35 motifs (i.e. the PWM -10 and PWM -35 matrices) and the two scores obtained formed the feature space of the regression modelling. The regression coefficients to be determined represent the weights of the -10 and -35 regions in the regression analysis. The Anderson library provided promoter strengths spanning two orders of magnitude and normalized in the range 0.00-1.00 with respect to the strongest (i.e. reference) promoter. It was noted that the normalisation step would not affect a linear relationship, altering only the constant of the regression. The normalised strength values were log-transformed to obtain the required response variable values. Since the ln function rapidly descends towards-Inf with decreasing promoter strength, we capped the infimum of promoter strength at 0.0001 prior to log-transformation. The least-squares cost function was minimized using iterative gradient descent. The model parameters were assessed using t-statistics, and the overall model was assessed using F-statistic and the adjusted multiple coefficient of determination given by: where m is the number of features and n is the number of instances. The adjustment is a penalty for increasing model complexity.

Model validation
The model of promoter strength was validated in three ways: i) The model was validated using leave-one-out cross-validation (LOOCV).
ii) We generated 100 sets of 100 randomly generated promoter sequences each, using the sample function in Python. From the obtained sampling distribution of mean strengths of random promoter sequences, we calculated the estimate of the true mean strength of a random promoter sequence, together with its standard error.

RESULTS
The entire datasets of 1,004 -35 hexamers and 1,046 -10 hexamers parsed out of RegulonDB are available as Supplementary Information. The conservation profiles of the extracted -35 and -10 hexamer sequences of the promoters in the RegulonDB were visualized and shown in Fig. 1. Based on these PWMs, the site scores of each promoter sequence in the Anderson library were regressed on the corresponding ln of the promoter strength. A summary of this process with the training data, log-transformation of the promoter strength and predicted response values is presented in Table 1. The modelling process converged within 10 5 iterations by tuning the gradient descent to a learning rate (a) of 0.015, and the following model was obtained: We derived an independent solution of the multiple regression using R (www.r-project.org) and obtained a correlation coefficient of 0.998 between the fitted values of the two models. The interval estimates of the coefficients of the regression were computed in R using confint (fit, level = 0.95), and obtained the following 95% confidence intervals: Intercept : (-6.4974449, -3.7118421) The interval estimates did not include zero, and this implied that the coefficients were significant at the 0.05 level. In fact, all the three estimates were significant at a p-value of 1E-3. The F-statistic of the overall regression was significant at a p-value of 2E-4 and adj. R 2 was ≈0.65. The plane of best fit corresponding to the above model is visualized in Fig. 2.
The model was then cross-validated using a 18-fold LOOCV (similar to jack-knife). Cross-validation yielded a correlation coefficient of ∼0.76 (Table 2). We sought to benchmark our model on a negative test set by generating random -35 and -10 hexamer sequences. To this end, we applied our model to 100 sets of 100 random promoter sequences each (available in Supplementary Information) and estimated the true mean of the sampling distribution as 0.00055. The standard error of the estimate was 1.04E-7. The low predicted strength along with the very small standard error indicated that the model predicted these instances to be non-promoter sequences with good certainty. This affirmed the specificity of our model for true promoters.
To validate our model further on true promoter sequences and experimentally characterized promoter strengths, we used datasets available in the literature and  The promoter activities (strengths) are seen to span two orders of magnitude in the range (0.0, 1.0). The promoters follow the naming in the Anderson dataset.   (2011), we ranked the promoters in Table 1 of the same reference according to their strengths and observed a 1,000-fold span of promoter strengths, 1E-3 to 1 (Table 3). Promoters 2 and 3 were identically strong, hence we took the average of their predicted strengths in ranking the promoters. With this arrangement, we found that the predicted order of promoters in terms of strength exactly reproduced the experimentally characterized order. Despite the fact that Anderson library and these promoters were characterized and normalized using different systems, the model was able to predict surprisingly well across a promoter strength spectrum spanning three orders of magnitude.
ii) Next, we applied our model to the set of 13 strong promoter candidates of Thermotoga maritima discussed in Dekhtyar, Morin & Sakanyan (2008). Using the hexamer sequences provided in Fig. 5 of the same reference, we applied our model and obtained quantitative predictions of promoter strengths (Table 4). Almost all the promoters had predicted strengths >0.38 and promoters with canonical hexamers even had strengths >1.00. One promoter (TM0032) was predicted as 'weak' with a strength ∼0.056 and seemed to point to an apparent anomaly in the relationship between promoter sequence and strength, possibly highlighting the need for further experimentation on this promoter. Our observations were corroborated by Fig. 4 in the same reference that showed the least and greatly reduced expression from this particular promoter. These results taken in conjunction with the results on random promoter sequences affirmed the ability of our model to discriminate between promoters at opposite ends of the strength spectrum.
iii) We also applied our model on the five promoters discussed in Dayton et al. (1984). Of these, the first three are known as 'major' promoters that are active even at low concentrations of the polymerase, whereas the last two are 'minor', less strong promoters that are only active when the polymerase is present at high concentrations. We applied our model on the promoter sequences found in Fig. 5 of the same reference and found the predictions in line with the nature of these promoters ( Table 5). The activity of the least strong 'major' promoter is about two times more than the activity of the strongest 'minor' promoter. Hence, our modelling approach was able to discriminate between major and minor promoters.

DISCUSSION
In addition to the independent contributions of -35 and -10 sites to promoter strength, we were interested in exploring if any interactions between them could contribute to promoter strength. To this end, we examined the following model in R: where PWM35 and PWM10 represent the corresponding site scores. This model resulted in a lower adj. R 2 -value than that without any interactions. Further, the p-value of the PWM 10 score dropped below significance (0.31), and the interaction term turned out to be totally insignificant (p-value: 0.97), thus discounting any interaction between the sites in the present dataset. On this basis, the null hypothesis of absence of any interaction could not be rejected, and we concluded that there is little evidence for interaction between the -35 and -10 sites in contributing to promoter strength.
Our model assumed that both the predictors carried independent information about the promoter strength, and together they are able to provide sufficient information about the strength. The basis of this assumption was probed to determine if both predictors are necessary to the model. Could one predictor provide sufficient information about the promoter strength in the absence of the other? There are at least three angles to address this question, and all of them were considered to interpret the model better. The promoters were ordered based on the rank of their strength, and given as input to our model. The predicted promoter log strengths were then examined for agreement with the actual rank and the ordering obtained matched the original ordering. The individual predicted values for pro2 and pro3 were 0.0024 and 0.059, respectively.
Since there is not much difference between R 2 and adj. R 2 , we could say that both predictors contribute substantially to the response variable (promoter strength) and account for about 65% of its variance.
2. Since the p-values of both predictors are significant, it would be interesting to observe their effect on the response variable in more detail. This was performed using the effects package in R: library(effects) fit = lm(logStrength∼ PWM35+ PWM10, data) plot(allEffects(fit))  The results are shown in Fig. 3 where the PWM scores are plotted against the level of confidence in the predicted response. Confidence in the effect of -35 site increases with the score from 0 to about 7, and then is susceptible to edge effects as the score reaches 8. Confidence in the effect of the -10 site increases with the score from -4 to about 5, and then is susceptible to edge effects as the score reaches 10.
3. Another way to address the question is to compute the correlation coefficients between all the variables of interest, including a variable with the combined effects of -35 and -10 sites. This is shown in Table 6. Three features were used, namely PWM -10 score, PWM -35 score, and the combined score (i.e. PWM -10 + PWM -35 ). These feature variables were correlated with two response variables, namely promoter strength and its corresponding log-transformation. It was first observed that the PWM -10 and PWM -35 scores were anti-correlated with each other (correlation coefficient = -0.37), thus supporting the hypothesis that they are two independent features that could compensate for each other in determining promoter strength. It was significant that the each feature was better correlated with the log of the strength than the strength itself. We tried to regress the strength on the PWM scores, but the model had a very low adj. R 2 (≈0.40) and the intercept term was not significant at the 0.05 level. Further, the highest correlation between the features and response variable was observed between the combined score and log of the promoter strength (∼0.79), but the combined score showed only a moderate correlation with the promoter strength prior to log-transformation (∼0.63). This was in keeping with similar observations for the strength of s E promoters (Rhodius & Mutalik, 2010) and underscored the logarithmic dependence between the promoter strength and sequence.
Finally, the assumptions of linear modelling were investigated with reference to our problem. Model diagnostics of four basic assumptions were plotted (shown in Fig. 4). Specifically: Plot A: The residuals were plotted against the fitted values. No trend was visible in the plot, indicating the residuals did not increase with the fitted values and followed a random pattern about zero. This validated the assumption that the errors were independent.
Plot B: The square root of the relative error (standardized residual) was plotted against the fitted value. An almost flat trend was observed, indicating that the standardized residual did not vary with the fitted value. This further validated the assumption that the errors were independent.
Plot C: To test the assumption that the errors were normally distributed, the standardized residuals were plotted against the theoretical quantiles of a normal distribution. The residual distribution closely followed the theoretical quantiles, except for minor deviations towards the tails of the distribution.
Plot D: Since the least-squares cost function is sensitive to outliers, the number of outliers should be kept to a minimum. This was investigated by plotting the standardized residual against the corresponding instance's model leverage. This plot showed that there were no significant outliers in the dataset that could exert an undue influence on the regression parameters.  An alternative univariate regression model using only the combined score of the PWMs found the coefficient of regression and the F-statistic significant (both p-values ≈ 10 -4 ). However, the adj. R 2 of the model (≈0.59) was much lower than that for Eq. (2), so the original multiple linear regression model was retained for the estimation of the promoter strength.
In summary, our model performed equally well on datasets of strong promoter sequences and datasets of weak random promoter sequences. Our model was consistent in detecting promoter strengths across a 1,000-fold span of promoter strengths in E. coli as well as the promoter strengths of a different species, T. maritima. The model was further able to discriminate between the major and minor promoters of bacteriophage T7.
Based on these results, an open-access open-source web server and standalone tool offering the prediction service have been implemented. Since the linear modelling results are dependent on the dataset, our implementation provides a facility to augment the learning based on user-provided inputs. The web interface is based on Python web module (web.py) and nginx server. The computational layer is based on numpy, Biopython and matplotlib. The user is provided with an option to add any number of promoter instances with -10 and -35 sequences and the corresponding strengths to augment the training data of the supervised model. The measurement of promoter strength could be done in the manner of Kelly et al. (2009), where the GFP (reporter gene) synthesis rate is measured per unit biomass, and this could be normalized relative to the reference promoter. In order to assess the goodness of fit of the updated model, the R 2 -value is re-computed, along with the 3D plot of the regression surface. This would enable the user to decide whether the data added to the model has improved its performance for further experiments with the software. Based on the trained model, the user could predict the strength of an uncharacterised promoter given its -10 and -35 hexamers.

CONCLUSION
The following important conclusions were drawn from our study. (1) Sequence-based modelling yielded a non-linear, logarithmic dependence between promoter strength and sequence.
(2) The model was able to discriminate equally well between strong/major promoters and weak/minor/random promoter sequences, indicating successful learning of the essential features of promoter strength prediction. (3) The combined score (PWM -35 + PWM -10 ) emerged as the single most important predictor of the promoter strength. Our model yielded robust quantitative prediction across a 1,000-fold span of promoter strengths. It is straightforward to extend our methodology to the study of new promoter classes of other s factors. Our implementation and web service could be useful in characterizing promoters identified in genome sequencing projects as well in engineering promoters for the design of finely-tuned genetic circuits in synthetic biology. The dynamic feature of our implementation would enable users to incorporate their own data into the model and obtain more reliable estimates of promoter strength. The service will be periodically updated based on the availability of new training instances, user input data and/or models for promoters of other s factors.