Modelling the evolution of transcription factor binding preferences in complex eukaryotes

Transcription factors (TFs) exert their regulatory action by binding to DNA with specific sequence preferences. However, different TFs can partially share their binding sequences due to their common evolutionary origin. This “redundancy” of binding defines a way of organizing TFs in “motif families” by grouping TFs with similar binding preferences. Since these ultimately define the TF target genes, the motif family organization entails information about the structure of transcriptional regulation as it has been shaped by evolution. Focusing on the human TF repertoire, we show that a one-parameter evolutionary model of the Birth-Death-Innovation type can explain the TF empirical repartition in motif families, and allows to highlight the relevant evolutionary forces at the origin of this organization. Moreover, the model allows to pinpoint few deviations from the neutral scenario it assumes: three over-expanded families (including HOX and FOX genes), a set of “singleton” TFs for which duplication seems to be selected against, and a higher-than-average rate of diversification of the binding preferences of TFs with a Zinc Finger DNA binding domain. Finally, a comparison of the TF motif family organization in different eukaryotic species suggests an increase of redundancy of binding with organism complexity.


Maximum Likelihood Estimation (MLE)
In this section we derive the maximum likelihood estimator for the parameter θ of the discrete logarithmic distribution. The properly normalized discrete logarithmic distribution, for k ≥ 1 and 0 < θ < 1, is given by: Given the data sample {k j } j=1,...F , where k j is the size of j-th family and F the total number of families, the likelihood is defined as: As usual, we will treat the log-likelihood: where k is the arithmetic mean over the sample and M g is the geometric mean over the sample. To maximize the log-likelihood: So the ML estimator θ M LE depends only on the arithmetic mean over the sample. To obtain the explicit form of θ M LE we rewrite the equation in terms of t = ln(1 − θ M LE ) and take advantage of the Lambert W Function defined by z = W (z)e W (z) for any complex number z. Then So the relative error is given by where W (z) denotes, as above, the Lambert function. This is the minimum acceptable value of M for which the relative error is smaller than for a given θ. For example for = 0.01 at θ = 0.74 M needs to be greater than 10 a condition which is indeed fulfilled by our data, As θ decreases also the threshold M min decreases. In the fouth cases discussed in the main text we find M min = 1 for θ = 0.22 (yeast), M min = 2 for θ = 0.30 (caenorhabditis), M min = 4 for θ = 0.51 (drosophila) and M min = 10 for θ = 0.75 (mouse). We checked that in all the above cases this condition is indeed fulfilled by the data.   Figure S2: θ does not depend on the number of TFs. We ran 5 * 10 3 simulations of the model for different system sizes, corresponding to the different numbers of TFs. The rates of duplication, deletion and cis-innovation are set by θ = 0.7 and by the additional constraint λ = δ. Simulations are let to run for 5 * 10 4 to the steady state and the θ M LE is computed.

Three illustrative examples of the splitting of DBD families in motif families
This section discusses three prototypical examples of DBD families and how they split into motif families. More specifically these examples cover three possible scenarios: a DBD family that splits into a set of single copy transcription factors (thus potential singletons); a family in which the DBDs are so well conserved that all the components belong to the same motif family; and finally an intermediate situation in which the DBD family splits into several motif families of different sizes.

The {IRF} family
In human, the interferon regulatory transcription factor (IRF) family is composed by nine elements denoted as IRF1....IRF9. For each of them, except for IRF6, there are several experimentally validated PWMs. For instance, for IRF1 (which was the first member of the IRF family identified) there are 6 known PWMs: four are present in the TRANSFAC database [1], one in the HOCOMOCO database [2] and one in the JASPAR database [3]. These PWMs are very similar, although not identical, and are thus treated in the CIS-BP database as independent entries. The DBD domain of IRF1 has a strong homology with the DBDs of the orthologous IRF1 in other vertebrates, but low homology with any other human TF. Therefore, there are no PWMs from other TFs that can be additionally associated to IRF1 by inference. The same occurs for all the remaining IRFs, except for IRF6 for which there is no direct experimental evidence. There is however a PWM for the mouse version of IRF6 obtained using Protein Binding Microarrays and reported in ref. [4]. Thanks to the very high level of homology (100% identity of the two DBDs) between the human and mouse versions of IRF6, this PWM can be inferred to hold also for the human version of IRF6. In this way, we end up with at least one PWM for each IRF gene. Since there is no common PWM between any pair of IRFs, each of them is present as a singleton in our network. It is easy to see that this result is a consequence of the low level of homology between the DBDs of different human IRF TFs (so that no PWM of a given human IRF could be inferred to hold also for another human IRF) and thus correctly reproduces the fact that they substantially diverged during evolution. As a matter of fact, a direct inspection ( Figure S4, panel a) immediately shows that the PWMs of different human IRFs are indeed very different among them, in contrast with the high similarity of PWMs obtained with different experimental methods for the same IRF, as mentioned above in the IRF1 case.

The {IRX} family
A radically different situation is that of the IRX family. The Iroquois homeobox factors (IRX) compose a family of homeodomain TFs that play a crucial role in many developmental processes [5]. In human, this family is composed by six elements denoted as IRX1...IRX6. Only for two of them: IRX2 and IRX5 there is an experimentally validated PWM, obtained, in both cases, using SELEX technology [6]. The PWMs of the other four can be inferred from these two, thanks to the high degree of homology among these TFs. More precisely the DBDs of IRX2 and IRX5 have a 100% level of homology between them (and indeed the two PWMs are very similar and almost coincide), IRX6 has a 90% homology with (IRX2, IRX6) and IRX1, IRX3, IRX4 (which in turn are very similar among them) have 88% level of homology with (IRX2, IRX6). As a final result all the 6 TFs share the same pair of PWMs and are thus joined together in the same motif family. In Figure S4b a typical PWM similarity within the group is shown.

The {HOX} family
As a final example let us study an intermediate situation. The HOX family is composed by a large number of TFs which turn out to split in three motif families in our network. There is a large motif family composed by 34 TFs out of which 31 comes from HOX genes, and two smaller families of 4 and 2 elements respectively. In particular, the family with 4 elements is composed by the {HOXA13, HOXB13, HOXC13, HOXD13} proteins. These are well studied and important TFs for which several PWMs have been characterized experimentally with a wide range of different techniques. More precisely: -HOXA13 is associated in CIS-BP to 7 different experimentally validated PWMs. Two are taken from the Transfac database [1], four are the results of SELEX experiments [6] and the last one is taken from the HOCOMOCO database [2].
-HOXB13 is associated to three PWMs, one obtained with Protein Binding Microarrays [7], and the other two with Selex [6].
-HOXD13 is associated to three PWMs, one is taken from the HOCOMOCO database [2] and the other two were obtained with Selex [6].
What is important for our analysis is that the DBDs of these four TFs show a very high level of homology among them ( Figure S3). Thus all the PWMs listed above can be inferred to hold also for all the remaining HOX*13 genes. On the contrary, these proteins show a rather low degree of homology in their DBDs (below 70%) with the remaining HOX. For this reason, no PWM corresponding to any other HOX TF can be associated to the HOX*13 TFs by inference. As a final result, these four HOX*13 TFs compose a single motif family, which is distinct from the main one containing almost all the remaining HOX TFs. Figure S4 show that indeed the PWM similarity among HOX*13 TF is really high (panel c) with respect to the negligible similarity of PWMs of elements of HOX*13 with other HOX TFs (panel d). This last example allows us to discuss another important issue related to the inference protocol of the CIS-BP database. Each DBD within the CIS-BP database is characterized by a different threshold which separates highhomology TFs (from which the PWMs can be inferred) from low-homology ones. In the HOX case discussed here, the threshold is located at 70% of homology, but it can be different for different DBDs depending on their specific physical properties. On the other hand, the IRX1..6 family (panel b) does not split into independent motif families and coherently the associated PWMs have a high degree of similarity. Finally, motifs associated to the motif family of HOX*13 TFs have a strikingly high value of in-group similarity (panel c), as opposed to a very low value when compared to motifs associated to other HOX member that still belong to the same DBD family but are associated to a different motif family by our procedure.

Robustness of the motif family organization
Our definition of motif families is based on the CIS-BP database (version number 1.02), which in turn is based on a combination of experimental evidence and DBD homology. In order to assess the robustness of our construction, this section studies the changes induced in the TF-TF network, and consequently in the composition of its connected components (i.e., the motif families), if a different and less stringent definition of links is adopted. So far, we considered the TF-TF network in which a link is present if two TFs share at least one identical PWM according to the CIS-BP database. This choice could in principle generate errors in the network construction if two very similar PWMs are actually annotated as different in the data source. To test this possible issue, we introduce a measure of similarity between PWMs and thus expand the TF-TF network using different thresholds for this similarity measure to define new potential links.

PWM similarity as an alternative way to group transcription factor binding preferences
We downloaded all PWMs and their associations to TFs from CIS-BP (version 1.02). To avoid the noise due to PWMs with low information content (IC), we filtered out the matrices with IC less than 10 bits. We measured the IC as the Kullback-Leibler (KL) distance between the motif and the overall genome composition where L is the length of the motif, p i (b) is the frequency of base b at position i in the motif and q(b) is its background frequency. After this filtering step, we performed a similarity analysis between pairs of PWMs using the Jaccard index as in ref. [8]. In this context, the Jaccard index measures the fraction of sequences recognized by a pair of PWMs in the larger set of sequences recognized by any of the two. A completely connected weighted network of all PWMs with IC > 10 bits can now be built with weights defined by this similarity measure. The distribution of this weights turned out to be roughly exponential.

A TF-TF network expansion based on PWM similarity does not significantly alter the motif family organization
The TF-PWM network defined by the associations in the CIS-BP database is now coupled to the weighted PWM-PWM network defined above. A threshold can now be defined on the PWM similarity measure to prune the PWM-PWM network and keep only the connections corresponding to a sufficiently high similarity. For any given value of the threshold, there is a natural way to define the TF projected network that we used in the main text. This is a TF-TF network in which a link is present between two TFs if there is at least one connected path joining them through their associated motifs in the combined TF-PWM network. Once again, the network connected components define the motif families. This procedure is sketched in Figure S5. With this procedure, the motif family organization in principle depends on the choice of the threshold for PWM similarity. Two extreme cases can be easily defined: -a threshold value close to zero conserves all the links between PWMs, thus defining a fully connected network of TFs; -if a perfect identity between PWMs is required to cross the threshold, we recover the procedure described in the main text, in which a link between two TFs is present only if they have at least one motif in common.
Exploring threshold values in between these two extreme scenarios opens the possibility to test the robustness of the motif family organization as a function of the similarity threshold between PWMs. In other words, we can check how much our definition of the motif family structure is resilient to the addition of new links in the underlying PWM network and thus indirectly how much can be affected by possible annotation errors in the original database. Note that the threshold value corresponds to the level of similarity considered sufficient to state that two TFs have essentially the same binding preferences, and thus potentially the same set of target genes. Supplementary Figure S6a reports the number of links that are added to the projected network between TFs as a function of the threshold value for PWM similarity (measured with the Jaccard index), and of the corresponding added links between PWMs. Even if this absolute number can be substantial, it does not correspond to a significant reorganization of the motif family organization. This is shown in Figure S6b: the fraction of TFs that change motif family due to this network expansion is very low for a wide range of threshold values. For example, this fraction is still below 10% for a threshold value of 0.7 which corresponds to more than 400 new potential links in the TF network. In fact, the number of new links in the TF network increases almost linearly with the threshold (with bumps which may be related to the modular structure of the network), but most of them coincide with already existing links between TFs or with links that do not merge different motif families. This is clearly illustrated by three examples reported in Figure S7.