Exploring comprehensive within-motif dependence of transcription factor binding in Escherichia coli

Modeling the binding of transcription factors helps to decipher the control logic behind transcriptional regulatory networks. Position weight matrix is commonly used to describe a binding motif but assumes statistical independence between positions. Although current approaches take within-motif dependence into account for better predictive performance, these models usually rely on prior knowledge and incorporate simple positional dependence to describe binding motifs. The inability to take complex within-motif dependence into account may result in an incomplete representation of binding motifs. In this work, we applied association rule mining techniques and constructed models to explore within-motif dependence for transcription factors in Escherichia coli. Our models can reflect transcription factor-DNA recognition where the explored dependence correlates with the binding specificity. We also propose a graphical representation of the explored within-motif dependence to illustrate the final binding configurations. Understanding the binding configurations also enables us to fine-tune or design transcription factor binding sites, and we attempt to present the configurations through exploring within-motif dependence.


Contents Supplementary Methods.
Positive and negative association rule mining.
Selection of the α value during the elastic net regularization.
Workflow for sequence analysis with ELRM.

ELRM visualization.
Supplementary Figure S1. Correlations between information content and ratio of association features.   Positive and negative association rule mining. We considered each sequence as a collection of items. Each item is the single-base feature. For example, TTGAT, is a collection of five items, {T 1, T 2, G3, A4, T 5}. We applied the Apriori algorithm 1 to search the frequent itemsets from collections. These collections were the non-redundant sequences which have PWM scores greater than zero. After performing the Apriori algorithm, we searched the frequent itemsets and their support values. For each frequent itemset X, we can find any two subsets A and B, which satisfy the conditions, A ∪ B = X and A ∩ B = / 0. Then, we can calculate the support values for the negation itemsets (¬A ∪ B or ¬A ∪ ¬B) as follows: Workflow for sequence analysis with ELRM. The Perl and R scripts used to construct ELRMs and perform sequence scan are publicly available at https://github.com/chiyang/elrm. Currently, the work was implemented for microorganisms with one chromosome. To perform genome scan or promoter analysis with ELRM, we suggest the following workflow.
1. Model construction (a) To construct an ELRM, fixed-length binding sequences of a TF are required as the positive sequences. These sequences can be obtained from Regulon DB, literatures, or ChIP-chip/ChIPseq experiments.
(b) Construct the PWM from the positive sequences.
(c) Perform PWM scan on genome sequence(s) and collect sequences whose PWM scores are greater than zero. We suggest using PWM scanning tools which are suitable for chromosome-sized scanning. For exampe, MOODS, 3 TFM-scan, 4 or PWMscan (http://ccg.vital-it.ch/pwmtools/pwmscan.php) are suitable for genome-wide PWM scanning. (d) Perform Apriori algorithm for mining frequently co-occurring single-base features (Apriori.pl).
In this step, the minimum support has to be specified.
(e) Perform positive and negative association rule mining from the previous step and integrate rules into association features (FindRules.pl). In this step, the minimum confidence and the minimum correlation strength are required.
(f) Code the sequences of training sets for model construction (SequenceCoding.pl). The training sets should contain positive and negative sequences. In this study, the negative sequences were generated as described in the main text. The training sequences of the 86 TFs in E. coli are also available from the ELRM website.
(g) Regress the model (GetTheModel.R). After coding the training sets, an R script was implemented to construct the ELRM. An α value has to be given. The λ value is optional. When λ is not set, the script automatically performs 10-fold cross validation to generate a λ that gives the minimum mean cross-validation error. The output will include the λ value, and this λ can be given later to get the same model. (h) Visualize the ELRM. We provided an interactive user interface to investigate the model (please see the next section and Supplementary Fig. S5).

Sequence scan
(a) In the testing steps, interested sequences such as promoter regions or the whole genome have to be coded into a matrix as shown in the Figure 1 in the main text. A Perl script was implemented to convert these long sequences into sliding windows of the motif length with 1-bp offset (SequenceCoding.pl). (b) After the sequence coding, an R script was provided to perform ELRM scan (ELRMscan.R).
For detailed usage of these scripts, please see the instructions on the ELRM website.

ELRM visualization
To assist in the understanding of the ELRM and fast investigation of the binding motif, we created a graphical representation of ELRM with an interactive user interface. The interface was written in HTML5 and can be easily viewed on web browsers that support HTML5. This interactive tool is also available on the ELRM website (https://github.com/chiyang/elrm). Supplementary Figure S5 explains the ELRM representation. Through this visualization, we aim to easily view the within-motif dependence and then interpret the characteristics of a binding motif. With the interactive user interface, one can inspect a singlebase feature by clicking it and view its coefficient as well as the coefficients of the involved association features.
The user interface also allows users to test an interested sequence and see if the sequence is preferred for TF binding. The interface will highlight the involved single-base and association features of the given sequence and output a probability score of it being a binding site. Therefore, users can quickly test if their sequence of interest is a TF binding site and understand the reasons why the model describes the sequence as preferred/non-preferred. Ratio of association features

Normalized information content
Supplementary Figure S1. Correlations between width-normalized information content and ratio of associations in ELRMs. This scatter plot shows the distribution of normalized information contents versus the ratio of association features. The ratio of association features is the number of association features over the total feature number (i.e. total number of coded single-base features and association features). The linear regression line shows the Pearson correlation coefficient of 0.71.  Figure S2. Screening threshold settings. Three thresholds (minimum support, minimum confidence, and minimal correlation strength) determine the quantity and quality of association features for our model construction. We constructed ELRMs with the 48 combinations of three minimum supports (0.1, 0.15, and 0.2), four minimum confidences (0.5, 0.6, 0.7, 0.8), and four minimum correlation strengths (0.05, 0.1, 0.15, 0.2). For each threshold set, the 86 TF binding models were trained and the cross-validation AUCs of these 86 models were used to assess the learning performance and shown in the boxplot. The gray bars indicate the threshold sets which passed our criteria, i.e. the worst cross-validation AUC is more than 0.8 and the number of outliers is less than seven. The dark gray bar shows the threshold set which passed the criteria with the smallest standard deviation.  Figure  The α value

Ratio of feature number
Supplementary Figure S4. The ratios of feature numbers at different α values. The ratios were calculated relative to the feature number at the α of 0.1 under the same parameter settings. For each α, models from the 48 configurations were constructed, and the feature numbers were summarized and presented in this boxplot.

9/21
Supplementary Figure S5. Graphical representation of ELRM. The representation contains three parts, association features, independent single-base features, and irrelevant features. The single-base features were represented as rectangular boxes, and association features were edges that link among the boxes. It is worth noting that an association feature may be constituted of more than two single-base features. In this case, any two single-base features in the association feature will be joined by an edge. For example, in the above figure, the three association features G1:T8, G1:C15, and T8:C15 had positive coefficients and G1:T8:C15 had a negative coefficient. Therefore, the three single-base features G1, T8, C15 were linked by both red (positive) and green (negative) edges. On the other hand, two single-base features may be involved in more than one association features. This explains the presence of multiple edges that link between two single-base features. Although this ELRM representation can not directly present which single-base features form an association feature, we created a web-based interactive user interface that provides tooltips to assist investigation of the association features (see Supplementary Methods).

18/21
Supplementary Table S3. Performance assessment of ELRM and other approaches. We assessed the learning performance of ELRM and three other modeling approaches: PWM, 5 TFFM, 6 DWM. 7 We conducted the ten-fold cross-validation when the positive training size is greater than ten sequences and used the leave-one-out cross-validation for the remaining sets. The same cross-validation procedure was performed for the four approaches. The results from the N-folds were combined and cross-validation AUCs was calculated and shown in this table. In average, the ELRM has significantly lower average AUCs of 0.0172 ± 0.0296 and 0.0180 ± 0.0297 (mean±SD), compared to PWM and TFFM, respectively (Both p-values < 1e − 6). The ELRM has slightly higher AUCs of 0.0243 ± 0.0733 than the AUC of DWM (p-value = 0.0014).