The genetic factors of bilaterian evolution

The Cambrian explosion was a unique animal radiation ~540 million years ago that produced the full range of body plans across bilaterians. The genetic mechanisms underlying these events are unknown, leaving a fundamental question in evolutionary biology unanswered. Using large-scale comparative genomics and advanced orthology evaluation techniques, we identified 157 bilaterian-specific genes. They include the entire Nodal pathway, a key regulator of mesoderm development and left-right axis specification; components for nervous system development, including a suite of G-protein-coupled receptors that control physiology and behaviour, the Robo-Slit midline repulsion system, and the neurotrophin signalling system; a high number of zinc finger transcription factors; and novel factors that previously escaped attention. Contradicting the current view, our study reveals that genes with bilaterian origin are robustly associated with key features in extant bilaterians, suggesting a causal relationship.

To create a test dataset for the R version of OrthoMCL, we merged the first 100 (of 10,000) BLAST output tables of the full dataset, comprising 243,586,115 BLAST hits, and processed the resulting file in parallel with the R version of OrthoMCL and with the original version [Li et al. 2003]. Both pipelines delivered identical ortholog, inparalog, and coortholog tables which serve as input for the final clustering steps. This table shows the number of rows present in the BLAST output table (wc -l BLAST_output_tbl) and in the three OrthoMCL output tables (wc -l orthologs.txt inparalogs.txt coorthologs.txt) as well as the associated MD5 message digest checksums. In addition, the first five rows of all three orthology tables are displayed (head -5 orthologs.txt inparalogs.txt coorthologs.txt). All numbers and row contents are identical between the original and the R version of OrthoMCL.   Table 6: Orthogroup composition of NK cluster genes in the BigWen database. The table shows, from left to right, Drosophila NK genes, their corresponding human orthologs and paralogs (GenBank accession numbers in square brackets), the orthogroup(s) to which an NK gene was assigned in our dataset, and the composition of this orthogroup, denoting the number of species from Po (Porifera), Pl (Placozoa), Ct (Ctenophora), Cn (Cnidaria), and Bi (Bilateria) that belong to this group. In contrast to other NK genes, D. melanogaster tinman is not placed with its human counterparts, but is a member of an extra orthogroup restricted to Endopterygota (OG 92160, see text). NKX2.3 is the only NK cluster gene with a potential ortholog in Placozoa (bold). *: Lbe/lbl in Drosophila and LBX1/LBX2 in Homo are both independent duplications of a single ancestral ladybird gene [Cande et al. 2009  Orthogroup ID, corresponding gene name, and orthogroup composition is shown for several reciprocal HMM-HMM best hit orthogroups of two example proteins, Sprouty and GAGA factor. Columns 4-7 summarise results of HMM-HMM searches in our database. In column four and five, orthogroup ID and composition of the three best search hits for each query is shown. Column six and seven report last common ancestor and corresponding E-value of hit orthogroups. B = Bilateria; Cn = Cnidaria; Ct = Ctenophora; D = Deuterostomia; E = Ecdysozoa; L = Lophotrochozoa; Pl = Placozoa; n.d. = not determined. Orthogroups are coloured to illustrate reciprocal best hit relationships. Note that the four orthogroups OG 7767, OG 46647, OG 58139, and OG 73898 are each other's hits in reciprocal HMM-HMM searches, suggesting that a combination of the four orthogroups with the inferred ancestor "Metazoa" (ancestor of bilaterians, cnidarians, and placozoans) correctly reflects evolutionary history, in agreement with the description of Sprouty in cnidarians [Matus et al. 2007]. Similarly, OG 56080 and OG 165721 are each other's reciprocal best hit while another orthogroup, OG 26633, does not satisfy this criterion. The complete orthogroup for GAGA factor is therefore a combination of OG 56080 and OG 165721 with the ancestor "Pterygota", as published previously [Heger et al. 2013].  Supplementary Table 9: Novel protein domains of bilaterian origin. Left column: Orthogroup identifier in BigWenDB. Second and third column: Sequence ID and E-value of the best HMM search hit in the entire BigWenDB (HMM search in fasta-formatted BigWenDB sequence collection). In contrast, the fourth and last column show sequence ID and E-value of the best HMM search hit from non-bilaterian metazoans. *: abbreviated for space constraints. Full ID is 7868|predict SINCAMP00000020970 0 and 137513|trs comp59583 c0 seq3 48 0, respectively. Red colour indicates the only bilaterian-specific orthogroup for which we found similar sequences of non-bilaterian origin, indicating the existence of nonorthologous proteins with a similar domain in non-bilaterians metazoans. Hox-D8 Sequence-specific transcription factor which is part of a developmental regulatory system that provides cells with specific positional identities on the anterior-posterior axis.
5339 Prospero homeobox protein 2 Transcription factor involved in developmental processes such as cell fate determination, gene transcriptional regulation and progenitor cell regulation in a number of organs. Plays a critical role in embryonic development and functions as a key regulatory protein in neurogenesis and the development of the heart, eye lens, liver, pancreas and the lymphatic system. Involved in the regulation of the circadian rhythm. [. . . ] 11804 Homeoboxcontaining protein 1 Transcription factor. Isoform 1 acts as a transcriptional repressor. Isoform 4 has very low activity as a transcriptional repressor.
11810 Paired mesoderm homeobox protein 1 Acts as a transcriptional regulator of muscle creatine kinase (MCK) and so has a role in the establishment of diverse mesodermal muscle types. The protein binds to an A/T-rich element in the muscle creatine enhancer.

Homeobox protein Mohawk
May act as a morphogenetic regulator of cell adhesion.
18424 Intestine-specific homeobox Transcription factor expressed in neurons of the brain that regulates the excitatory-inhibitory balance within neural circuits and is required for contextual memory in the hyppocampus (By similarity). Plays a key role in the structural and functional plasticity of neurons (By similarity). Acts as an early-response transcription factor in both excitatory and inhibitory neurons, where it induces distinct but overlapping sets of late-response genes in these two types of neurons, allowing the synapses that form on inhibitory and excitatory neurons to be modified by neuronal activity in a manner specific to their function within a circuit, thereby facilitating appropriate circuit responses to sensory experience. [. . . ] 9983 Achaete-scute homolog 2 AS-C proteins are involved in the determination of the neuronal precursors in the peripheral nervous system and the central nervous system.  Supplementary Table 13: HMM-HMM search results for three uncharacterised arthropodspecific proteins. Orthogroup ID, corresponding D. melanogaster gene name, and orthogroup composition is shown for three arthropod-specific proteins without described function. Columns 4-7 summarise results of HMM-HMM searches in our database. In column four and five, orthogroup ID and species number within the four best search hits for each query are shown. Column six and seven report last common ancestor and corresponding E-value of orthogroup hits. Ch = Chelicerata; Cr = Crustacea; He = Hexapoda; My = Myriapoda. Note that each query HMM detects itself as best hit. All next similar HMM-HMM hits correspond either to orthogroups within Arthropoda (Endopterygota, Panarthropoda, Formicidae, Obtectomera) or to very small, phylogenetically distant orthogroups outside arthropods with low similarity (Neognathae, Trichoderma, Rhizoctonia, Gnathostomata), illustrating that these arthropod-specific proteins do not possess related domains outside their lineage. For corresponding multiple sequence alignments, see Figure 4- Figure  Supplement 1.   Orthogroup ID, corresponding gene name, and orthogroup composition is shown for Nodal and a Hydra magnipapillata Nodal-related gene as published in [Watanabe et al. 2014]. Columns 4-7 summarise results of HMM-HMM searches in our database. In column four and five, orthogroup ID and corresponding gene name of the four best search hits for each query are shown. Column six and seven report last common ancestor and composition of hit orthogroups. Abbreviations: B = Bilateria; Cn = Cnidaria; Ct = Ctenophora; D = Deuterostomia; E = Ecdysozoa; L = Lophotrochozoa; Pl = Placozoa; Po = Porifera; n.d. = not determined; GDF= Growth/differentiation factor. Note that each query HMM detects itself as best hit. OG 12210 (Nodal), OG 9136 (containing Hydra Nodal-related), and OG 9136 Cn, containing only cnidarian Nodal-related genes, are not engaged in a reciprocal best hit relationship in HMM-HMM searches, arguing against a common evolutionary origin. Supplementary Table 17: HMM-HMM search results for bilaterian-specific G protein-coupled receptors. Orthogroup ID, corresponding human gene name (UniProt ID), and orthogroup composition is shown for eight bilaterian-specific G protein-coupled receptor proteins. Columns 4-7 summarise results of HMM-HMM searches. Columns four and five show orthogroup ID and composition of the four best search hits for each query HMM. Columns six and seven report their last common ancestor and corresponding E-value. Cn = Cnidaria; Ct = Ctenophora; D/Deuterost. = Deuterostomia; E = Ecdysozoa; Eumeta. = Eumetazoa; Gnathost. = Gnathostomata; L = Lophotrochozoa; Protost. = Protostomia. Note that each query HMM detects itself as best hit. Most next similar hits correspond to orthogroups within Bilateria (Deuterostomia, Gnathostomata, Nematoda, Protostomia). Four HMM-HMM hits belong to more ancient orthogroups with the potential to shift the inferred bilaterian ancestor (highlighted in red), but in all these cases hit orthogroup composition does not support a (eu)metazoan ancestor on a broad phylogenetic basis as only one or two non-bilaterian species are present. In addition, the reciprocal best hit criterion is not fulfilled in all but one case (OG 23231), arguing that most of the eight GPCRs originated in the ancestor of bilaterians.   Table 17 with a best HMM-HMM hit relationship to bilaterian-specific GPCRs and origin prior to the bilaterian ancestor. Columns 3-6 summarise results of HMM-HMM searches in our database. In column three and four, orthogroup ID and composition of the four best search hits for each query HMM are shown. Columns five and six report their last common ancestor and corresponding E-value. B = Bilateria; Cn = Cnidaria; Ct = Ctenophora; D = Deuterostomia; E = Ecdysozoa; L = Lophotrochozoa; Pl = Placozoa; Po = Porifera. Note that each query HMM detects itself as best hit. The HMM-HMM analysis indicates that no reciprocal best-hit orthogroups for the bilaterian-specific GPCRs of Supplementary  Table 19: HMM-HMM search results for two major axon guidance pathways. Orthogroup ID, corresponding gene name, and orthogroup composition is shown for the Netrin-DCC and Slit-Robo axon guidance molecules. Columns 4-7 summarise results of HMM-HMM searches in our database. In column four and five, orthogroup ID and composition of the four best search hits for each query HMM are shown. Columns six and seven report their last common ancestor and corresponding E-value. Cn = Cnidaria; Ct = Ctenophora; D = Deuterostomia; E = Ecdysozoa; L = Lophotrochozoa; Pl = Placozoa; Po = Porifera. Note that each query HMM detects itself as best hit. OG 51853 is the reciprocal best hit in cnidarians (green highlight) of Robo orthogroup OG 4128, suggesting a (eu)metazoan origin of this receptor. Similarly, the composition of the Netrin and DCC orthogroups and of their HMM search hits suggests a pre-bilaterian origin of these factors.  In each subfigure B-D, one of three sequences (highlighted in red) from fungi or cnidarians was added that were present in the original orthogroup OG 5226 and shifted its ancestor to opisthokonts. The added sequences either reduce bootstrap support of the OSR clade (B) or clustered with the outgroup clade (C and D), suggesting that they do not belong to the OSR orthogroup. Branch labels correspond to the results of SH-aLRT (Shimodaira-Hasegawa-like approximate likelihood ratio test, left) and UFBoot (ultrafast bootstrap approximation, right) as implemented in IQ-TREE [Nguyen et al. 2015].

Plal1 (OG_11041)
poly-ZF region  [Crooks et al. 2004] of two bilaterian-specific orthogroups, CTCF (OG 6452) and Plal1 (OG 11041), both containing multiple C 2 H 2 Zinc fingers in their central region (see Supplementary Table 8). The Plal1 logo was built from 92 sequences aligned over 1053 positions. The CTCF logo represents 175 sequences and 939 alignment positions, as obtained in the clustering. Indels and unaligned regions have been removed from the corresponding multiple sequence alignments, and the two first zinc finger domains of each protein are highlighted by shading and a red label. The sequence logos demonstrate that the central ZF domains are conserved beyond the Cys and His residues needed for Zn 2+ complexion. Individual Zinc fingers in both proteins display highly informative and unique signatures that distinguish them from other Zinc fingers in the same or similar proteins.