Lineage-specific expansion of DNA-binding transcription factor families

DNA-binding domains (DBDs) are essential components of sequence-specific transcription factors (TFs). We have investigated the distribution of all known DBDs in more than 500 completely sequenced genomes from the three major superkingdoms (Bacteria, Archaea and Eukaryota) and documented conserved and specific DBD occurrence in diverse taxonomic lineages. By combining DBD occurrence in different species with taxonomic information, we have developed an automatic method for inferring the origins of DBD families and their specific combinations with other protein families in TFs. We found only three out of 131 (2%) DBD families shared by the three superkingdoms.


Genomes used in this analysis
Genomes used in this analysis were taken from the DBD database [1], which contains TF annotation of more than one thousand completely sequenced genomes from diverse lineages across the tree of life. A representative non-redundant group of organisms was selected from the DBD database to represent the DBD occurrence expansion in different lineages in the heatmap.
One possible source of bias of using all the genomes available is the variance in numbers of genomes in different lineages. For example, certain types of pathogenic bacteria and fungi are important from a medical and agricultural point of view and have been intensively studied. To minimise this bias, only the most well-characterised strain was selected to represent each particular species. Other redundant strains were excluded from this list. For instance, Escherichia coli K12 and Staphylococcus aureus NCTC 8325 were used to represent Escherichia coli and Staphylococcus aureus, respectively. For eukaryotic genomes, only the longest transcripts per gene were considered. We note that an analysis of splicevariants across many genomes is confounded by the heterogeneity of the data available for different organisms. For instance, mouse is extremely well-characterised, while chimpanzee is not. As a result, alternative splicing was excluded entirely from this study. This also allows the numbers of eukaryotic TFs to be compared with the bacterial TFs, which do not contain splice-variants.
To ensure a clear and meaningful DBD expansion in our heatmap, we further refined our genome list by filtering out species with a small number of predicted TFs which exhibit negligible expansion. Organisms that possess less than 50 predicted TFs were excluded from this analysis. The majority of these species are obligate parasites such as Plasmodium (eukaryotic microbes), Mycoplasma and Chlamydophila bacteria. Other poorly characterised eukaryotic genomes were also removed manually if bacterial contamination was detected.
These contaminated genomes displayed a great large number of bacterial-specific DBD families that are not observed in other closely related species. The eukaryotic genomes removed during this process include Apis mellifera (honey bee), Ricinus communis (castor bean), Capitella sp.I (segmented worm), Popolus trichocapa (western balsam poplar), Physcomotrella patens subsp. patens (moss) and Xenopus tropicalis (frog). The contamination in the frog genome in particular, has been observed before [2] and the honey bee genome has been removed from Ensembl. After this species refinement, the final number of organisms was 538, comprising 160 from Eukaryota, 30 from Archaea, and 348 from Bacteria. A table containing a complete list of genomes can be obtained from the authors' project website.

DBD families used in this analysis
The DBD families used were also obtained from the DBD database. The prediction was performed based on the presence of DBDs, from two HMM libraries: SUPERFAMILY and Pfam. SUPERFAMILY HMM models are designed to identify members of superfamilies, based on the domain definition of Structural Classification Of Proteins, SCOP [3]. Since protein domain members in SCOP superfamilies tend to be functionally diverse, manual curation in the DBD database was done at the SCOP family level instead [4]. Moreover, it has been shown that many SCOP families have homologous connections to Pfam families [5]. For these reasons, the analysis was performed for all Pfam and SCOP family DBDs.

DBD expansion heatmap
To survey the presence and absence of DBDs in different lineages, we collected the number of TFs containing each DBD family in each of 538 representative organisms obtained from the refinement procedure explained above. We divided this absolute TF number by the number of genes in each species and present the result in a single-colour heatmap (see Figure   1, high-resolution version can be obtained from our project website). It is clear from this heatmap that the number of DBD families shared between prokaryotes and eukaryotes is very small, but the contractions of DBDs are not visualised. The table containing number of TFs in each DBD family normalised by total number of genes in each genome, which was used to generate this heatmap, can be obtained from our project website. To further improve the presentation, we developed a two-colour heatmap that represents the expansion, as well as contraction/depletion within a particular DBD family. To do so, we computed Z-scores of TFs containing DBDs across all species. For a particular DBD family D in species i, we counted the total number of TFs containing the DBD family, t D,i . As before, the transcription factor counts were normalised by the number of genes in that species, G i (Equation 1). This is to minimise the bias due to the differences in number of genes in diverse species. We refer to this as normalised TF counts, T D,i . For the DBD family D, we calculate the Z-score, Z D,i , of the normalised count of TFs, containing this DBD in species i. This Z-score represents the relative expansion or contraction in different lineages, compared to all other species as described in Equation 2 where € T D is the mean of TFs containing DBD family D across all species and € SD D is the standard deviation.
Z-scores of all DBD families in all species were visualised as a heatmap using the Genesis 1.7.2 software package [6]. Heatmap columns correspond to DBD families, hierarchically clustered using the complete linkage method and Pearson correlation. Rows correspond to species, ordered according to the NCBI taxonomic tree. The NCBI taxonomy is a expertly curated organism hierarchy which contains more than 300,000 species [7]. Since the taxonomic tree contains more species than any current phylogenetic tree and is manually curated, we preferred it over available phylogenetic trees. A positive Z-score indicates that the DBD is relatively highly expanded in that species and is shown in orange. A negative Zscore represents DBD contraction in the genomes and is shown in blue. High resolution heatmaps for both Pfam and SCOP family DBDs, with family and species name labelled, can be obtained from our project website. Although the distribution of each DBD may not be strictly Gaussian, it is clear that a high relative abundance of DBDs gives positive Z-scores (orange), while depletion corresponds to negative Z-score (blue). This representation has been successfully used before to explain expansion and contraction of DBD families by us and others [1,8].

Taxonomic limits and conservation densities a. Estimating taxonomic limits of DBD families
To obtain a "Taxonomic limit" for a particular DBD family D, we first collected all species which have the DBD family predicted in their genomes. Based on the NCBI taxonomic tree,   Consequently, the trade-off between contaminations and horizontal gene transfer is a crucial issue to the inference taxonomic limit inference.

Calibration of the taxonomic limit method and cut-off threshold
To correct the taxonomic limit assignments of these DBD families with potential horizontal gene transfer, we found that the taxonomic limit needed to be shifted down to the parental node. The taxonomic rank of the parental node to be shifted to should be close to the incorrectly assigned child node and its frequency fraction should be large enough. After a careful manual investigation, we discovered the taxonomic limit correction returned the most accurate results when the parental node to be shifted to was not more than 5 taxonomic ranks above the node with the highest frequency fraction. At the same time, the ratio of the frequency fraction of the node to be shifted to, to the highest fraction (hereby called "Frequency fraction ratio"), was greater than the cut-off of 0.2. We manually inspected and found that the DBD families that have this ratio less than 0.2 were indeed bacterial contaminations. The bacterial-specific DBDs that showed contamination traces in Eukaryota include HTH_1, FUR, GntR and MerR.
In addition to the manual inspection of the bacterial contaminations in eukaryotic genomes, we assessed this cut-off threshold more systematically by investigating the number of taxonomic limits assigned to different taxonomic groups, using different cut-offs ( Figure   2). The higher the cut-off, the more stringent and narrower the taxonomic limit assignment.  when the cut-off is lowered to 0. This is also likely due to the bacterial contaminations in eukaryotic genomes.
We included a complete list of the taxonomic limits of all DBD families when different cut-offs were used as a supplementary file available from our project website.

(i) Between Eukaryota and Bacteria
As shown in Figure 2a, as we lowered the cut-off from 0.2 to 0, we observed many DBDs that are well-known for their regulatory roles in bacterial species having their taxonomic limits switched from Bacteria to "cellular organisms". When we investigated their presence in all species, we found that most of these DBDs were detected in the majority of genomes involved in sugar uptake [11] and sporulation regulation [12] in bacteria, respectively. It is clear, even from the heatmap ( Figure 1 in the main text), that they are also detected in more than half of the fungal genomes in our dataset and might have been disseminated through horizontal gene transfer. These two families are also discussed in the main text. The other two DBD families falling in this category are AP2 and KilA-N. All four of these families shared by Bacteria and Eukaryota have "cellular organism" as their taxonomic limits.

(ii) Between Eukaryota and Archaea
The number of archaeal genomes available is very small compared to the bacterial or eukaryotic genomes and thus the presence of DBDs in Archaea is harder to verify.
Nonetheless, our method regards a eukaryotic DBD zf-C2H2 found in half of the archaea as true hits and thus assigns "cellular organisms" as the taxonomic limit of the family.

(iii) Between Bacteria and Archaea
DBD family sharing between the two prokaryotic superkingdoms is common and already well-studied [9]. We have already discussed in the main text that a large number of DBD families shared by bacterial and archaeal genomes might be a result of horizontal gene transfer between the two prokaryotic superkingdoms. As expected, the presence of DBDs in Archaea is harder to confidently verify whether or not they are true hits due to the small number of archaeal genomes as described above. Using a cut-off of 0.2, our method regards most bacterial DBDs found in Archaea as true hits, except for a small number of cases such as LexA_DNA_bind which is found in only one archaeal genome. In this case, our method assigns Bacteria as their taxonomic limit instead of "cellular organisms".

(iv) Within Bacteria
Horizontal gene transfer is known to play a crucial role in shaping the phylogentic profiles of prokaryotes [9,10]. This corresponds to our taxonomic limit assignments where approximately half of the DBD families found in bacterial species have Bacteria as their taxonomic limits, rather than a more specific bacterial subgroup. However, we also found a number of phylum-specific DBDs. These families are truly phylum-specific because the numbers of DBDs assigned to phyla drop only slightly even when the cut-off was lowered There are no phylum-specific DBDs that have their taxonomic limits switched to Bacteria when the cut-off was lowered to 0. This shows that the number of bacterial phylumspecific DBDs the method estimates is not overestimated, and that horizontal gene transfer within the bacterial species is not underestimated.

(v) Within Eukaryota
Horizontal gene transfer between eukaryotic species is thought to be rare, especially in multicellular organisms [13]. Nevertheless, we have discussed the presence of some animalspecific DBDs in the choanoflagellate M. Brevicollis, which is interesting from the point of view of multicellular eukaryotic evolutionary.

c. Taxonomic conservation densities and Monophyletic clades
To estimate the proportion of children species within a taxonomic limit clade that actually contains the DBD family of interest, we calculated the fraction of species containing the DBD over the total number of species within that taxonomic node. We termed this fraction the "Taxonomic conservation density". "Monophyletic clades" were defined as nodes below the taxonomic limit where at least 98% of their members contain the DBD family of interest (having taxonomic conservation density greater or equal to 0.98). We decided to use this 0.98 cut-off instead of 1 (all children nodes contain the DBD) because we observed in the majority of lineage-specific families, that the assignments of DBDs were always technically missing in a small number of species members within the lineage (Figure 3). This process was used to find all monophyletic clades at taxonomic ranks below the taxonomic limit nodes for all

d. Examples of taxonomic limit and conservation density calculations
In addition to the discussion on how we assessed the taxonomic limit results when different cut-offs were used, we provide examples of taxonomic limit and conservation density calculations for four DBD families in Figure 4 (DBD name followed by frequency fraction ratio in the brackets): i. MerR (0.14): The MerR family is known to mediate the mercuric-dependent induction of the mercury resistance operon in bacterial species [14]. The family is detected in most prokaryotic genomes but also in a few eukaryotic genomes, i.e. X. tropicalis and R.
communis). Because the frequency fraction ratio of "cellular organisms" over Bacteria (0.086/0.611 = 0.14) is less than the cut-off of 0.2, our method does not shift the taxonomic limit to "cellular organisms" but instead identifies Bacteria as the taxonomic limit, in line with a previous study suggesting that there is bacterial contamination in some eukaryotic genomes including X. tropicalis [2].
ii. HTH_AraC (0.21): The HTH_AraC family is related to the arabinose operon regulatory protein AraC in bacteria [11]. The family is present in approximately 80% of bacterial genomes but also in approximately half of fungal genomes. Our taxonomic limit method shifted the taxonomic limit from Bacteria to "cellular organisms" because the frequency fraction ratio of "cellular organisms" over Bacteria (0.112/0.528 = 0.21) is greater than 0.2.
iii. Homeobox (0.65): The Homeobox family is well known for its role in morphogenesis and animal body development [15]. Despite the small number of plant genomes available, the Homeobox family is also found in almost all plants. Our method shifted taxonomic limit to Eukaryota because the frequency fraction ratio of Eukaryota over Fungi/metazoa (0.238/0.364 = 0.65) is greater than 0.2.
iv. zf-C2H2 (0.30): zf-C2H2 are found in all Eukaryota but are also found in nearly half Archaea. Even though Eukaryota has the highest frequency fraction, the method shifted the taxonomic limit to "cellular organisms" because the frequency fraction ratio of "cellular organisms" over Eukaryota (0.093/0.306 = 0.30) is greater than 0.2.
These examples demonstrate that our taxonomic limit method has the power to distinguish contaminations from true hits. Apart from these four examples, we have included a complete list of the taxonomic limits of all DBD families when different cut-offs between 0-0.4 were used as a supplementary file available from our project website.

Figure 4 Examples of DBD occurrence on a simplified phylogenetic tree (a)
The MerR family is detected in most prokaryotic genomes but also in a few eukaryotic genomes. Because the ratio of the frequency fraction at cellular organisms over at Bacteria is less than the cut-off of 0.2, our method does not shift the taxonomic limit to cellular organisms but instead identifies Bacteria as the taxonomic limit (b-d) The HTH_AraC, Homeobox, and zf-C2H2 families have the ratio of the frequency fraction of the nodes to be shifted to, to the highest one, greater than 0.2, our method shifts the taxonomic limits to the parental nodes and regards the DBDs found in other branches as true hits. The frequency fraction at each node is shown in bold and the taxonomic conservation density is shown in the brackets and italics.

e. Taxonomic limit method and previous literature
In this section, we provide a detailed discussion of previously published approaches for inferring the evolutionary scenarios of proteins or protein families. Although they might seem similar to our method, they are not identical and are not suitable for our purpose. Here we discuss the strengths of our new taxonomic limit method and compare it to the existing approaches that are most relevant.
Many studies have shown that the phylogenetic relationship of proteins can be inferred by building a phylogenetic tree, either based on sequence similarity or presence/absence profile, (as reviewed by [16]). However, such a tree would only contain the species where the DBD is present, which means gene losses cannot be assessed. Moreover, the trees built using different families are often different. Consequently, we focus on the methods that are, like ours, based on presence/absence profiles of protein families and a reference tree.
To the best of our knowledge, there are only a small number of groups that have combined the gene-content profiles and phylogenetic relationships to reconcile evolutionary scenarios. Koonin and Mushegian [17] and Kyrpides et al. [18] were among the first to do so.
However, they only focused on the minimal gene set of the last universal common ancestor, which is not relevant to our analysis.
More relevant methods to what we describe were published by Snel et al. [19], Kunin et al. [20] and Mirkin et al. [21]. They all focus on constructing the most parsimonious evolutionary scenarios (gene gain, loss, horizontal gene transfer occurring at all internal nodes) for protein families given a species tree. Although such methods could, in principle, be used to estimate the earliest node where the DBD family became present, none of these methods or papers actually does this. There are also a number of technical and conceptual issues that make them unsuitable for our purpose.
i. All three methods require an accurate bifurcated species tree (branched into two at each internal node). Because their parsimonious algorithms start from terminal (leaf) nodes of the tree and move towards to the root, it is important that the relationships between species are accurate near the terminal nodes. To the best of our knowledge, there is no existing phylogenetic tree that matches the number of genomes in the DBD database (over 1000 genomes to date). We thus decided to use the NCBI taxonomy tree, which provides a manually curated organism hierarchy of more than 300,000 species [7]. However, the taxonomy tree is not bifurcated as more than two species can share the same parental node. For instance, the taxonomic node Escherichia is parent to at least 50 Escherichia species including E. coli. K12. Thus, accurate phylogenetic relationships between species under the same parent might not always be attainable. These parameters have been carefully explored for prokaryotic genomes but the same set of parameters will not be suitable for eukaryotic genomes, where HGT contributes to the evolution of genomes at a much lower rate, especially after the emergence of multicellular organisms [13].
iii. The methods published by Snel et al. [19], and Mirkin et al. [21] are only applicable for a group of proteins where their orthologue definitions are available because additional orthologous information is required. Consequently, they do not suit our purpose where we want to estimate the taxonomic limit of the entire family because its protein members not always being orthologous.
iv. Our method is flexible and not restricted to the taxonomic limits of protein families. It can also be used to estimate when a domain combination between two domains occurred. To the best of our knowledge, this is the first time such an analysis has been done at the domain architectural level.
In summary, our method for estimating taxonomic limits is simple, intuitive, fast and robust to uncertainties of trees near species nodes. We have shown that the method has sufficient power to distinguish contaminations from horizontal gene transfer. respectively. Colour codes are as described in the main text. White means the DBD is also shared with other superkingdoms (Eukaryota and/or Archaea). DBDs which occur alone as single-domain TFs in more than 25% of all their architectural patterns have orange borders.

Network representation of TF domain architectures
The most common domain architectures of bacterial and eukaryotic TFs are illustrated separately using a network representation. Partner domains and architectures that occur in more than 5% of TFs for each DBD family were gathered and TF architecture networks were generated using Cytoscape 2. Both domain occurrence and domain combinations are labelled according to their taxonomic limits. Nodes and arrows are coloured according to their taxonomic limits, which were derived from the method described above. Colour codes are as described in the main text.
DBDs that occur as single-domain TFs for more than 25% of all architectures are highlighted using orange (in bacterial network) or green borders (in eukaryotic network). Tables describing the numbers of domain combinations normalised by numbers of genes, which were used to generate these networks, can be obtained from our project website. In addition to the eukaryotic TF architectural network shown in the main text, a complete bacterial TF network is shown in Figure 5.

Additional discussion
In this section we provide additional discussion on conserved and lineage-specific DBD families across the tree of life. The literature on the biological processes the DBDs are implicated in is also extensively documented. In addition to the number of DBDs shared by Archaea, Bacteria, and Eukaryota described by a Venn diagram in the main text, here we provide a simplified taxonomic tree with the number of DBD families and Pfam families having their taxonomic limits assigned to each node (Figure 6). These results show that the number of DBDs families having "cellular organisms" as taxonomic limits (15%) is significantly greater than of all Pfam families (33%). This confirms that the repertoires of DBD families are more lineage-specific than other proteins. In addition, we also show a Venn diagram representing the number of SCOP families classified as DBDs which have taxonomic limits belonging to the three major superkingdoms (Figure 7). Eight out of 88 SCOP families (9%) are shared by the three major superkingdoms, compared to 2% of Pfam DBDs shared.
The distribution of DBDs in prokaryotes is not only widespread at the superkingdom level but there is also no clearly distinguishable expansion scheme within the three major bacterial phyla: Actinobacteria (purple), Firmicutes (dark blue) and Proteobacteria (light blue). According to Figure 8a, which summarised the taxonomic limits of bacterial DBDs, each phylum seems to possess a small number of phyla specific DBD families but it is apparent that the majority of DBDs are shared by all bacterial lineages.
These conserved DBDs participate not only in basic carbon source metabolism such as HTH_AraC [11], LacI [30] and GntR [31], but also the more specific functions such as FUR (Ferric uptake regulator) [32], MerR (mercury resistance) [14], LexA repressor (DNA repair system) [33], GerE (Lux family, quorum sensing) [34] and HTH_8 (Fis family, virulence gene expression) [35]. These bacterial specific DBDs are all found in more than 60% of bacterial species (conservation densities greater than 0.60) and are most likely inherited from the last common ancestor of all bacterial species. Since most of the Actinobacteria are filamentous, it makes sense that this bacterial phylum has a DBD WhiB [36] specific to mycelium formation regulation. A number of DBDs, which control expression of genes in different pathways, are specific to the Firmicutes: CodY GAFlike domain [37], ComK protein [38], and Firmicute transcriptional repressor of class III stress genes (CtsR) [39], for instance. The sporulation initiation factor (Spo0A) is, however, the most interesting of all. This family reflects the lifestyle of many Firmicutes which reproduce by forming spores in undesirable conditions [40].
Owing to the greater number of completely sequenced genomes available, Proteobacteria possess more phylum-specific DBDs than any other bacterial lineage. The DBD families that fall into this category include Crl (fibronectin binding activators) [41], ROS/MUCR (virulence region repressor) [42] and Met repressor (MetJ, methionine synthesis) [43]. The FlhC and FlhD TFs have been shown to be global regulators involved in many cellular processes as well as flagella transcriptional activators [44]. They are only present in Gram-negative Proteobacteria but not in Firmicutes and Actinobacteria. The phylogenetic pattern of these DBD families may be linked to the four-support-ring flagella in Gram-negative bacteria, as opposed to the two-support-ring flagella in Gram-positive bacteria.

b. Conserved and lineage-specific DBDs in eukaryotes
In contrast to the disperse DBD occurrence in bacterial species, Figure 1 in the main text demonstrates the distinct expansion patterns between the three main eukaryotic kingdoms: Metazoa (pink), Fungi (orange) and Viridiplantae (yellow), and other unicellular eukaryotic organisms. Metazoans (animals) possess a considerably larger DBD repertoire than the Fungi and Viridiplantae kingdoms (Figure 8b). This reflects the greater morphological complexity and body structures of animals, as well as a potential bias towards the study of animal model organisms.
Only a small proportion of DBD families are ubiquitously present across the eukaryotic superkingdom. These families include the majority of Zinc finger families [45,46], helix-loop-helix (HLH) [47] and basic leucine zippers (bZIPs) [48]. Surprisingly, the Homeobox family, famous for its role in morphogenesis and animal body development [15,49], is also found throughout eukaryotic organisms, including fungi and plants.
Distinct expansion schemes are also observed among animal species. The most notable difference is between vertebrates and invertebrates. The majority of DBDs found in metazoans are present in both animal groups, however, the expansion in invertebrates is significantly less pronounced in many DBD families. The DBDs with particularly extensive expansion in vertebrates include: STAT (signal transduction) [50], T-box (body plan and organogenesis) [51] and p53 (cell cycle arrest and apoptosis) [52]. Other DBDs such as Interferon regulator factor (IRF, regulation of immunity) [53], Churchill (neural development) [54] and an oncogene Myc [55], are entirely absent from invertebrates. This is most likely due to more elaborate immune and nervous systems in vertebrates. On the contrary, Runt [56] and GCM [57] families regulate fundamental developmental processes in both vertebrates and invertebrates, and are equally expanded in both groups. It is worth mentioning that the BESS and HTH_psq domains are particularly highly represented in insects, the taxonomic group that dominates the invertebrate genomes.
Being phylogenetically closer to Metazoa, Fungi share more DBD families with animals than with plants ( Figure 8b). DBDs which are common to metazoans and fungi but completely absent in plants include: CP2, Fork head, NDT80/PhoG and Tea domains. DBD occurrence patterns between fungal organisms are more uniform than in metazoans as all fungi possess similar sets of DBD repertoires. In accordance with previous work [58], we observed a number of DBDs detected in Fungi but not in other eukaryotes. Interestingly, not all fungal specific DBDs are restricted to fungal-specific processes. To illustrate the point, the DNA-binding domain of Mlu1-box binding protein MBP1 is mainly involved in regulation of the cell cycle [59]. Zn2/Cys6 (Zn cluster) has many regulatory roles such as in sugar and amino acid metabolism, cell cycle, as well as drug resistance [60]. The Copper-fist domain participates in copper utilisation and stress response processes [61]. MAT alpha1 and APSES, however, do regulate fungal-specific process as they activate mating-type specific genes [62] and regulate yeast-hyphal transitions [63], respectively. HTH_AraC [11] and FMN (Flavin mononucleotide) binding domain [12] are exceptional cases of bacterial DBDs found in many fungi. The families have been experimentally shown to be involved in sporulation regulation and sugar uptake in bacterial species, but their functionality in fungi has yet to be investigated.
Apart from the majority of plant DBD families which are also found in animals and fungi, a set of DBD families are specific to plants including: AP2/GCC-box binding domain (activation of defence genes) [64], SBP (flowering development) [65] and WRKY (pathogen defence and biosynthesis of secondary metabolites) [66]. Additionally, we observe a number of DBD families found in the Streptophyta phylum (land plants) are absent in Chlorophyta (green algae). These families are discussed in the next section.

c. From uni-to multicellular eukaryotes: additional DBD families emerge
Apart from the three major kingdoms, the DBD database also provides TF predictions for many unicellular eukaryotes. Among the unicellular eukaryotic species available, Monosiga brevicollis is one of the most interesting organisms as it is the only well-annotated representative of choanoflagellates, the closest known relatives of metazoans [67]. Previous studies on the organism have concentrated on its signal transduction mechanisms and found that the species contained a considerable amount of signalling components in common with animals [68,69].
Besides the more elaborate signalling machineries, uni-to multicellular transitions also require a greater number of components that contribute to the more complex genetic regulatory networks in functionally diverse cell types [8]. One possible way to enhance regulation capacity is by recruiting novel sets of TFs. By investigating the Monosiga brevicollis genome, we observed not only DBDs common to the Fungi/metazoa group such as Homeobox, HLH, Fork_head and bZIPs, but also many metazoan-specific DBDs not found in fungi. Among the animal-specific DBDs, there are families which regulate animalspecific processes such as STAT (signal transduction), p53 (apoptosis), Tub (nervous system development) [70], as well as those involved in more general pathways like E2F/DP (cell cycle) [71] and Cold-shock domain (CSD, low temperature response) [72].