Origin and evolution of the enhancer of split complex

The Enhancer of split complex is an unusual gene complex found in Arthropod genomes. Where known this complex of genes is often regulated by Notch cell signalling and is critically important for neurogenesis. The Enhancer of split complex is made up of two different classes of genes, basic helix-loop-helix-orange domain transcription factors and bearded class genes. The association of these genes has been detected in the genomes of insects and crustaceans. Tracing the evolution of the Enhancer of split complex in recently sequenced Arthropod genomes indicates that enhancer of split basic helix-loop-helix orange domain genes arose before the common ancestor of insects and Crustacea, and before the formation of the complex. Throughout insect and crustacean evolution, a four-gene cluster has been present with lineage specific gene losses and duplications. The complex can be found in the vast majority of genomes, but appears to be missing from the genomes of chalcid wasps, raising questions as to how they carry out neurogenesis in the absence of these crucial genes. The enhancer of split complex arose in the common ancestor of Crustacea and insects, probably through the linkage of a basic helix-loop-helix orange domain gene and a bearded class gene. The complex has been maintained, with variations, throughout insect and crustacean evolution indicating some function of the complex, such as coordinate regulation, may maintain its structure through evolutionary time.


Background
Evolutionary conserved complexes of genes are rare in insect genomes [1] while relatively common in vertebrates [2][3][4]. The best characterised is the Hox complex, found in the genomes of widely diverse animals, in which interlinked and coordinate gene expression appear to stabilise the genomic structure of the complex over evolutionary time [5]. Complexes of genes that remain intact over wide evolutionary distances in insect genomes presumably have similar features to the Hox complex, such as coordinate regulation, that maintain their genomic structure, while the genome is rearranged around them.
The Enhancer of Split Complex (E(spl)-C) is an unusual and conserved complex of genes first identified in Drosophila melanogaster. This complex differs from most in that the genes it contains encode two completely different sorts of proteins: bHLH-Orange domain transcription factors [6,7] (bHLHO), and Bearded class proteins [8,9] (Brd). The association between these two types of genes has been found in the genomes of both insects and crustaceans [10], implying that this complex first formed through the association of these genes rather than the more usual gene duplication.
A limited survey of insect and crustacean genomes has shown that the E(spl) complex is ancestrally made up of three bHLHO encoding genes (bHLH1, bHLH2 and her) and a single Brd-class gene [10]. The structure of the complex is modified in some insects, Drosophila being an example, where two bHLH-orange domain genes are absent from the complex and there are seven copies of the remaining one, the Brd-class gene, mα, has been duplicated and there are a range of unrelated genes inserted in the complex [6,[9][10][11][12][13][14][15][16].
The E(spl)-C was first identified as a modifier of the Notch mutant Split [17]. Subsequent studies have shown that the bHLH proteins encoded by E(spl)-C act as effectors of Notch cell signalling [13,18]. During neurogenesis in Drosophila, presumptive neuroblasts signal to surrounding cells in proneural clusters through the Notch cell-signalling pathway (Reviewed in [19][20][21]). This pathway leads to expression of E(spl)-C bHLH genes [18], which act to repress the expression [22], and function [23], of proneural genes, a set of transcription factors that promote neural cell fate [24,25]. Presumptive neuroblasts thus signal surrounding cells to block their differentiation as neuroblasts through the activation of Notch cell signalling, and the expression of the E(spl) bHLHO domain proteins [26][27][28][29]. E(spl) bHLHO domain proteins encode a c-terminal WRPW motif which recruits the transcriptional repressor Groucho, which in turn acts to attenuate gene expression by promoting RNA polymerase pausing, and causes local histone deacetylation [30]. E(spl) bHLHO domain proteins thus target transcriptional repression to proneural genes, and proneural gene targets.
Brd class proteins from the E(spl)-C, in contrast, antagonise Notch signalling by interacting with Neuralised [31], an E3 ubiquitin ligase, stimulating the degradation of a Notch ligand, Delta [32]. In Drosophila, brd-class proteins particularly act to pattern adult sensory precursor formation [8,9]. Bearded class genes encode small proteins with amphipathic alpha helices, and little sequence conservation [9].
The genes contained within the E(spl)-C are regulated in a range of ways. While individual enhancer elements for some of the genes have been identified [33,34], the entire complex appears also to be activated by Su(H) [9,[35][36][37][38][39][40], a transcription factor usually regulated by Notch signalling [41]. This Notch responsiveness is also found in a crustacean, Daphnia [42]. Individual transcripts are repressed, in Drosophila, by miRNAs binding to 3′UTR located sites (named GY, Brd and K-boxes) [15,43,44]. There is also evidence for coordinate regulation of the E(Spl)-C by cohesin [45], a chromatin structure regulating protein that coats the E(spl)-C in cells and represses expression, and repression by Polycomb group proteins [46]. The genomics DNA containing the E(spl)-C also is structured in three dimensions in cells such that the chromatin of the complex interacts with itself forming an isolated domain, but does not interact with flanking regions [46]. This self-interactive structure implies the complex is regulated in a coordinated manner [46]. E(Spl)-C bHLH proteins are closely related to other clades of insect bHLH-orange domain proteins, including clockwork orange (cwo), Similar to Deadpan, (Side), Hairy and E(spl) related with a YRPW domain (Hey), hairy (h), deadpan (dpn). These genes have multiple roles in insects.
Hey has been shown to regulate neuron fate determination [55]. Cwo (clockwork orange) has a role in regulating the circadian clock [56][57][58][59]. Side has no identified function in Drosophila. In non-arthropod animals, closely related genes, the HES genes (hairy-enhancer of split genes) act, in most situations, as effectors of Notch signalling [60,61] though some these genes do have non-Notch related roles [62,63]. Given this close phylogenetic relationship, and the role in Notch signalling conserved between arthropods and vertebrates, it seems likely that Notch responsiveness, and a role as an effector of Notch signalling, may be an ancestral function for this group of transcription factors. Indeed expression of E(spl)-mδ produces a neurogenic effect when mis-expressed in Xenopus embryos [64].
The unusual nature of the E(spl)-C, containing two types of genes, and its potential coordinate regulation, make understanding the dynamics of its evolution important. Here I examine the structure and relationships of the E(spl)-C from arthropod and onychophoran genomes, particularly those provided by the i5K consortium [65,66]. The i5K consortium provides high quality Arthropod genome sequences that allow both the identification of genes, and examination of genome structure. This dataset allows us to trace both the origins and subsequent evolution of the E(spl)-C in arthropods.

Results and discussion
Phylogeny of all isolated E(spl) bHLH genes recovers three major clades Searching for E(spl)-like bHLH sequences in arthropod genomes identifies large numbers of sequences with similarity to E(spl) bHLHs, and other clades of bHLHO domain sequences. In most species of insects and Crustacea, multiple bHLHO sequences can be identified with strong similarity to E(Spl)-C bHLHs, and these genes are often found adjacent to each other in contigs, linkage groups or genome scaffolds. Aligning the protein products of these genes and reconstructing their relationships using Bayesian phylogenetics [67] identifies the three major clades of E(spl) bHLHO proteins as first described in Duncan and Dearden [10] (her, bHLH1 and bHLH2) (Fig. 1). The analysis also indicates that the Strigamia maritima (centipede) genome contains a number E(Spl) bHLH proteins, but these do not fall into the three classes of E(spl) bHLH proteins found in crustacean and insects. Orthologues of non E(Spl) bHLHO proteins robustly fall outside the E(Spl) bHLH clades (Protein identifiers refer to names in Additional file 1: Table S1).
Branch-lengths within the E(spl) bHLH clades are short, indicating strong conservation of sequence, but these branch lengths increase between bHLH1 proteins from Diptera, implying more sequence divergence in that group. The assignment of E(spl) bHLH genes from crustacean and insect genomes to these three clades of allows us to interpret the genomic structure of the E(spl)-C in crustacean and insect genomes.

Genomic analysis of E(spl) genes indicates clusters and conserved structure
Duncan and Dearden [10] identified a four gene E(spl) complex as the ancestral state for insects and Crustacea. This four-gene cluster (bHLH-2, her, mα and bHLH-1) was identified in the genome of the crustacean Daphnia and in sequenced insect genomes [10]. I have expanded this analysis, using the i5k consortium data, to identify E(spl)-Cs within Arthropod and Onycophoran genomes and then used phylogenetic analyses ( Fig. 1) to categorise those genes. This analysis indicates that the bHLH complex is a component of the vast majority of insect genomes, but there are clade and species-specific losses and expansions. In Chelicerates, and the partial genome of an onycophoran (Velvet worm, Non arthropod Ecdysozoan, basal to arthropods), we can find no evidence for bHLHO proteins that are orthologous (by reciprocal blast) to E(spl) bHLHO proteins (Fig. 2). A myriapod genome [68] (Strigamia) encodes 5 E(spl) proteins, but these are not co-located in the genome. In two Fig. 1 Relationships of E(Spl) bHLH genes. Bayesian Phylogram of E(spl)-C related bHLH proteins generated using WAG model of amino-acid evolution. Clade of bHLH proteins are marked by coloured areas and named. Node labels indicate posterior probabilities. The tree is rooted with hairy sequences from insects and chelicerates. Protein identifiers refer to names in Additional file 1: Table S1 crustacean genomes (Daphnia and Eurytemora), the E(spl)-C is more apparent. In Daphnia (Water Flea), as described previously, a four-gene complex is present, with an identifiable Brd-class gene, mα, embedded within it.
Mα genes are often difficult to identify by Blast alone because their sequence evolves rapidly. In all species in  Fig. 1 (bHLH-1, light blue; bHLH2, Dark blue; her, green). Red hexagons indicate mα genes. White ovals indicate inserted genes with no homology to bHLHO or mα, brown ovals represent Tubulin Tyrosine ligase genes. Purple circles mark gooseberry genes. Where the colour of a bHLH gene is lightened, identification of this gene is only through placement in the complex due to gaps in the genome sequence. Where a white square is shown, placement in the complex cannot indicate the identity of the partial gene sequence. Yellow squares indicate E(spl) type bHLHs from Strigamia that are not able to be classified. Arrows indicate direction of transcription. Structures for the Endopterygota are predicted based on analyses of complex from these groups (Figs. 3,4,5 and 6) which mα genes could be identified by tblastn [69] of the genome, they were found in the E(spl)-C except in Drosophila, where a complex of Brd-class genes lies outside the E(spl)-C [31]. Given the difficulties in identifying these genes, it is possible that other members of the Brd-class are present outside the E(spl)-C in other species.
In Eurytemora, a copepod crustacean, the E(spl)-C is modified with duplications of all three E(spl) bHLH genes (two bHLH-1, three her and 5 bHLH2). Alongside this, no mα gene can be identified.
In insects, a four-gene complex can also be identified in deeply branching clades (Fig. 2). In a mayfly (Ephemera) and a termite (Zootermopsis), a four-gene complex, identical to Daphnia, is present.
In the large hemipteroid clade of insects, examples of a four-gene complex are relatively common (Frankliniella (Western flower thrips), Acrythosiphon (Pea Aphid, though this is split between two contigs) [10], and Homalodisca (Glassy-winged sharpshooter)). Despite this conservation, gene-loss has occurred in some Hemipteroid lineage, with species such as Cimex (bedbug) and Oncopeltus (milkweed bug) having only two bHLH genes (her and bHLH2) and species such as Gerris (Waterstrider) and Halyomorpha (Brown marmorated stink bug) having a single E(spl) gene (bHLH2). The patterns of change in the complex imply that such losses are lineage or species specific, and that the selective pressure to maintain the full E(spl)-C is somewhat reduced in this assemblage. In species without the full complex of genes, mα is invariably missing, though its fast sequence evolution means that it could be located elsewhere in the genome.
Within Endopterygote lineages, the E(spl)-C appears more stable. In the Hymenoptera (Fig. 3), the classic fourgene complex is present in most species, though in Orussus (Parasitic woodwasp) one gene (her) is missing, and in Chalconid wasps (Trichogramma, Copidosoma and Nasonia) no complex is detectable (see later). With these exceptions the E(spl)-C (including flanking and unrelated genes inserted into the complex (see later)) is completely conserved in gene complement, order and orientation.
In Coleoptera, by contrast, the complex is stable, but only contains three genes (her, mα and bHLH2) (Fig. 4). An orthologue of bHLH1 is not present in any of the coleopteran genome examined. In Leptinotarsa (Colorado potato beetle), her and mα are missing, and bHLH2 has been duplicated. The fact that bHLH1 is absent from all Coleopteran species examined indicates that it was likely lost early in Coleopteran evolution.
The basal four-gene complex can be found in the genomes of the Lepidopterans (   Fig. 1 (bHLH-1, light blue; bHLH2, Dark blue; her, green). Red hexagons indicate mα, genes. White ovals indicate inserted genes with no homology to bHLHO or mα, brown ovals represent Tubulin Tyrosine ligase genes. Purple circles mark gooseberry genes. Where the colour of a bHLH gene is lighter, identification of this gene is only through placement in the complex due to gaps in the genome sequence. Arrows indicate direction of transcription the complex is split across two contigs. The complex is reduced to two genes (bHLH1 and her) in Heliconius (Postman Butterfly). This is the only insect species of the 42 studied that does not have a bHLH2 gene, raising the possibility that this may be a genome sequencing error. In a species of Caddisfly (Limnephilus, sister group to the Lepidoptera) only bHLH2 can be found. It is unclear if this is species specific or lost in the entire lineage.
In Diptera (Fig. 6), bHLH1 is absent from all genomes examined. In the Cuculidae (Anopheles, Aedes and Culex), her is also absent from the genome and mα and bHLH2 make up the complex. In the Brachycera, the complex is expanded, with multiple copies of bHLH2; 7 in Ceratitis (Medfly), 7 in Drosophila species and 9 in Lucilia (Common Green Bottle fly). Mα is also expanded, with two copies in each genome. In Drosophila, her is present in the genome, though not linked to the E(spl)-C. In Lucilia and Ceratitis, no her ortholog is present.
Her is present in the E(spl)-C of Mayetiola (Hessian fly), a more deeply branching fly, and there is an unidentifiable gene (due to a gap in genome sequence), in the same position in the fragmentary genome of Lutzomyia (Blackfly). These patterns imply that at least a three gene complex (her, mα and bHLH2) was present in the common ancestor of Diptera, but that gene loss (Culicidae) and expansion of the complex (Brachycera) have extensively modified the E(spl)-C in this group.
In most of the complexes I have identified, the orientation of genes with respect to the direction of transcription of mα is conserved. bHLH-1 and her are transcribed on the opposite strand to mα, and bHLH2 on the same strand. The only variations to this pattern are in Diptera (Fig. 6) where the multiple copies of bHLH2 in Brachycera are transcribed from either strand, and in Culex (Mosquito), where mα and bHLH2 are transcribed from opposite strands.
These data imply that the origins of the E(spl)-C lie in the pan-crustacean clade. I can find no evidence for E(spl)-C bHLH genes in chelicerates (4 genomes), the most basally branching clade of arthropods, nor the onychophora, the closest non-arthropod ecdysozoan group. While this analysis is not conclusive as to the presence of   Fig. 1 (bHLH-1, light blue; bHLH2, Dark blue; her, green). Red hexagons indicate mα, genes. White ovals indicate inserted genes with no homology to bHLHO or mα. Arrows indicate direction of transcription these genes in onychophorans (due to the partial nature of the genome sequence), their absence in all chelicerates examined is best explained by absence of these genes in this lineage, but could plausibly be due to gene loss.
The patterns of conservation, gene loss and expansion indicate that the E(spl)-C has a history of conservation of an ancestral four-gene structure, with gene-loss in some lineages, usually not affecting bHLH2, and expansion via gene duplication in Brachyceran flies, and the copepod Eurytemora. These expansions are difficult to explain, as their patterns of expansion are different. In Eurytemora, all bHLH genes are expanded, while mα is missing. In Brachycera, bHLH1 and her are missing from the complex, with an expansion of bHLH2 and mα. Given there are many species with two, or only one class of E(Spl) gene, these expansions are not best explained as a way to replace missing members of the complex, but may be related to complexity of gene regulation, or pattern formation, required from the complex.

Insertions into the E(spl)-C
In Drosophila melanogaster (Fig. 6), 2 non bHLHO/mα genes are found in the E(spl)-C. These genes, m1(encoding a Kazal-type protease inhibitor) and m6 (encoding a protein with a Myelin proteolipid protein PLP), do not produce Notch signalling-like defects when mutant, though m6, like the other E(spl)-C genes, is regulated by Notch signalling [70]. The insertion of m1 and m6 in the complex are conserved in other Drosophilids [10,71] but only m1 is conserved in the Bracyceran flies Ceratitis and Lucilia. Flanking the Drosophila complex is another Notch related gene, groucho, which encodes a protein that interacts with E(spl)-bHLH genes to supress gene expression [30,72]. This gene does not flank the E(spl)-C outside Drosophila species.
Unrelated genes are inserted into the E(spl)-C in many insect species but these insertions are most often not conserved between species. The exception to this are a set of tubulin tyrosine ligase genes inserted between bHLH1 and her in Hymenopteran genomes (Fig. 3). One or two of these genes are present in this location in all Hymenopteran examined that have an E(spl)-C. The maintenance of this insertion over 250 million years of evolutionary time, and the expression pattern of one of these genes in Honeybees [10] implies that these tyrosine-tubulin ligase genes may be regulated by Notch signalling.
The stability of the E(spl)-C in hymenopteran genomes extends to flanking genes. All Hymenopteran complexes are also flanked by a gene named gooseberry (Fig. 3). Gooseberry is a paired-box containing transcription factor that has been shown to have roles in patterning the nervous system and cuticle in a number of insect species [73][74][75][76][77][78][79] including hymenoptera [80,81]. This gene is also found flanking the E(spl)-C in Homoladisca (Fig. 2), implying the association of gooseberry and E(spl)-C may date from the common ancestor of Endopterygota and the hemipteroid assemblage.

Chalcid Wasps have lost the E(spl) complex
While the E(spl)-C in the hymenoptera is highly conserved and stereotyped, I can find no evidence for E(spl)-C bHLHO genes in the genomes of three chalcid wasps. These species have a full complement of other  (Fig. 7), but no E(spl)-C orthologues are present in these genomes, either distributed, or in a complex. That this pattern of loss is present in three genomes, one of which is the well-sequenced (both genomes and transcriptomes) Nasonia genome, implies that loss of the complex in this group is not a sequencing error, but that Chalcid wasps have lost their E(spl)-C. These are the only group yet found in insects that do not have identifiable E(spl)-C genes. All of these wasp's genomes encode the core components of the Notch signalling pathways, as well as other direct targets of Notch signalling (eg glass, sugarless etc.) (data not shown). This deficit is specific, therefore, to the genes of the E(spl)-C. That the E(spl)-C is missing from the genomes of these wasps raises important questions as to how cell specification in the nervous system of these animals is achieved.
Interestingly, studies of the effect of Nasonia venom on their fly hosts indicates that E(spl)-C genes are upregulated in the host in response to the venom [82], possibly to trigger developmental arrest. Is it possible that the evolution of resistance to their own venom has necessitated the deletion of the E(spl)-C from the genomes of wasps that use this mechanism?
Origins of the e(spl) complex E(spl) bHLHO proteins are related to a broad family of BHLHO proteins found in animal genomes. In arthropods, 5 major families (Hairy/deadpan, Side, clockwork Fig. 7 Relationships of bHLHO proteins from Chalconid wasps and other insects. Bayesian phylogram of representative hymenopteran bHLH-orange domain proteins reconstructed using the WAG model of protein evolution. While Chalconid wasps (genes identified by the prefixes, CFL (Copidosoma), TRE (Trichogramma) and XP_ or GLEAN (Nasonia)) have members of the hairy, deadpan, clockwork orange, Hey and Side families, they do not have proteins related to E(spl)-C bHLH proteins from Apis (GB identifiers, Orussus (woodwasp, OAB identifiers), or Drosophila (gene names) orange, hey and E(spl)) are present. These proteins are related to HES genes (Hairy/E(spl)-like proteins) found in Deuterostomes and Lophotrochozoan genomes [61]. I reconstructed the relationships between these genes using Bayesian techniques, focussing particularly on deep arthropod relationships, in order to understand the origins of E(spl)-C bHLHO proteins (Fig. 8). This analysis indicates that all of the 5 families of arthropod bHLHO proteins are related to the HES genes of other metazoa. Hairy/deadpan, side, cwo, hey and E(spl) are all equally related to the Notch regulated HES genes. E(spl) proteins are, however, restricted to Myriapods, Crustacea and Insects. The case of Strigamia maritima is an illuminating one. In this genome there are five genes encoding proteins closely related to E(spl)-C proteins, as well as examples of the other arthropod bHLHO (excepting Side). Strigamia also encodes two proteins similar to HES from Lophotrochozoa. Chelicerates (including 1 mite, 1 tick and two spiders) have a range of bHLHO proteins, but no E(Spl)-C related genes.  [61]. Strigamia contains E(spl) related bHLHO proteins, but these form a separate clade to those in insects and Crustacea that make up the E(spl)-C The E(spl)-like genes from Strigamia do not form a cluster in the genome, and the proteins encoded by these genes form a clade separate to the three crustacean/insect E(spl)-C bHLH clades. As most phylogenetic examinations of the placement of myriapods within the arthropods indicates that they are the sister group to crustaceans and insects, the origins of E(spl)-c bHLH proteins must lie somewhere after the separation of the lineage leading to chelicerates, but before the last common ancestor of insects/crustacea and myriapods. E(spl)-C bHLHO proteins thus pre-date the origin of the E(spl)-C, which is present only in the genomes of insects and crustacea.
These differences in the organisation of the E(spl)-C genes and complex mirror, to some extent, differences between these clades in the presence of the neural stem cells, neuroblasts, that the E(spl)-C regulates. In chelicerates and myriapods there is no evidence for cells similar to the neural stem cells that arise out of proneural clusters and repress their neighbours through Notch signalling and the E(spl)-C in Drosophila [83][84][85][86][87]. Neither of these groups have an E(spl)-C in our analyses, while Crustacea and insects, which do have identifiable neuroblasts, do. There is some evidence that Myriapods, which have no E(spl)-C but do have E(Spl) bHLHO proteins, may have specialised neural precursors in groups of cells specified to become neural [83]. Understanding how neural cells are specified in these groups, and how this is related to Notch signalling, will allow us to determine if the formation of the E(spl)-C is linked to the evolution of neuroblasts.

Conclusions
The genomes of arthropods contain few evolutionary conserved gene complexes [1], the most well known being the Hox [5], runt [88] and E(spl)-C [10]. The E(spl)-C is restricted to Crustacea and insects, but the key bHLH genes arose before the formation of the complex. The complex appears to have become assembled in the lineage leading to insects and crustaceans, possibly though the association of the bHLH genes (with a long evolutionary history of Notch responsiveness) with the mα Brd-class gene (Fig. 9). Presumably the formation of this complex gave some advantage in the regulation or expression of these genes, cementing the structure of the complex. The regulation of this complex in Drosophila through chromatin conformation regulators [45,46], and the suggestion of coordinate regulation [46], may provide an explanation for the conservation of the complex through 540 million years of arthropod evolution. This complex has remained stable in insect genomes, but while gene-loss and duplication has reshaped it in some lineages, its complete absence has only been detected in a group of Chalconid wasps.
The pattern of evolution of the E(spl)-C implies some regulatory reason for the conservation of its structure, Fig. 9 Summary of inferred evolutionary events in the evolution of the E(Spl)-C in insects and Crustacea perhaps on a par with the well-described coordinate regulation of the Hox complex [5]. Examining the expression and function of these genes in species with variations of the complex, and in deeply branching groups such as myriapods, will provide insight into the reasons behind the conservation of this remarkable gene complex.

Gene identification
BHLHO domain genes and Brd-class genes were identified in arthropod genomes using Blast [69], with orthology assigned using a reciprocal blast best-hit approach. Coding sequences were either extracted from gene prediction sets, or, if such predictions were absent or erroneous, using FGENESH [89] on contigs identified as containing bHLHO sequences using tblastn [69]. Predicted proteins were generated and aligned using CLC Genomics Workbench (http://www.clcbio.com).
Genomic analyses were carried out using CLC Genomics Workbench to visualise the placement of bHLHO genes on scaffolds and contigs. Predicted proteins encoded in these regions were analysed with blastp [69], in the first instance, and HMMER [90] (to identify Brd-class encoding genes).

Phylogenetics
All phylogenetics were carried out using MrBayes [67] using the WAG model of protein [91] which proved to be the most appropriate model after testing using mixed models. Monte-Carlo Markov chains were run for 1000000 generations with the initial 25 % of trees discarded as burn-in. Consensus trees were visualised with Dendroscope [92] or CLC Genomics Workbench.