Parallel Expansions of Sox Transcription Factor Group B Predating the Diversifications of the Arthropods and Jawed Vertebrates

Group B of the Sox transcription factor family is crucial in embryo development in the insects and vertebrates. Sox group B, unlike the other Sox groups, has an unusually enlarged functional repertoire in insects, but the timing and mechanism of the expansion of this group were unclear. We collected and analyzed data for Sox group B from 36 species of 12 phyla representing the major metazoan clades, with an emphasis on arthropods, to reconstruct the evolutionary history of SoxB in bilaterians and to date the expansion of Sox group B in insects. We found that the genome of the bilaterian last common ancestor probably contained one SoxB1 and one SoxB2 gene only and that tandem duplications of SoxB2 occurred before the arthropod diversification but after the arthropod-nematode divergence, resulting in the basal repertoire of Sox group B in diverse arthropod lineages. The arthropod Sox group B repertoire expanded differently from the vertebrate repertoire, which resulted from genome duplications. The parallel increases in the Sox group B repertoires of the arthropods and vertebrates are consistent with the parallel increases in the complexity and diversification of these two important organismal groups.


Introduction
Sox (Sry-related high-mobility-group box) group B belongs to the Sox family of proteins, which are transcription factors essential in diverse developmental processes [1,2], including neurogenesis [3,4], gonadogenesis [5], and lymphopoiesis [6]. The Sox family was initially identified in relation to the mammalian testisdetermining factor, SRY, based on the sequence conservation of the single HMG (high-mobility group) domain, which is a domain of about 79 residues [7] that functions in DNA binding, DNA bending, protein interactions, and nuclear transport [1]. Their interaction with other tissue-specific transcription factors and their spatiotemporal expression patterns, together with mutations in the HMG domain, allow different Sox transcription factors to specify their target selection [1,8,9]. After earlier phylogentic analyses on the HMG superfamily involving the Sox family [10,11], the analysis conducted by Bowles et al. (2000) based on the HMG domain sequences and other structural indicators, including intron positions, suggested that the Sox family can be classified into groups A-J [12]: A refers to the Sry proteins restricted to some mammals; B, C, D, E, and F are the major groups expressed by a broad range of metazoan taxa [13,14]; and G-J are particular lineage-specific proteins. This transcription factor family first emerged in the stem of the metazoa, and the bilaterian last common ancestor (LCA) already contained all the major Sox groups in its genome [13,14].
Group B Sox proteins play crucial roles in neurogenesis, gonadogenesis, morphogenesis, etc. in vertebrates and insects [1,2,15,16,17]. Within Sox group B, the division into subgroups B1 and B2 has been proposed based on a full-length protein sequence alignment and the functional roles of the group B proteins in chicken [17] and some other vertebrates [12]. In the vertebrates, members of the same subgroup share high similarity of their full-length protein sequences but no observable similarities with members of the other subgroup in the regions outside the HMG domain and a short C-proximal region of this domain. In terms of function, SoxB1 proteins act as transcriptional activators, whereas SoxB2 proteins play a role as repressors in the chicken [17]. SoxB1-and SoxB2-like proteins have also been identified in bilaterian invertebrates and assigned to the two subgroups based on BLAST searches and tree-based analyses, although less confidently [12]. SoxB1-and SoxB2-like proteins have also been identified in the cnidarians [18] and demosponges [13,19], although with much less confidence, which implies that the division into subgroups B1 and B2 might have taken place before the demosponges diverged from the eumetazoans [13].
However, there is negligible similarity between the protein sequences of the non-HMG domain regions in the members of different SoxB subgroups as in the members of different Sox groups. Therefore, the tree-based phylogenetic analysis of SoxB proteins is actually restricted to the HMG domain as is the analysis of the whole Sox family. However, although the HMG domain sequence has been demonstrated to be sufficient to group the different Sox groups, this domain inadequately resolves the grouping of the subgroups within Sox group B. On published trees constructed from the sequences of Sox group B and the other Sox groups of the bilaterians, the Sox subgroup B2 almost always (and subgroup B1 sometimes) shows paraphyly [12,20,21,22]. When nonbilaterian SoxB sequences are included in the tree construction, the situation becomes more complicated, because the nonbilaterian sequences are often highly divergent and lineagespecific duplications seem to have occurred [13,18,23]. On the tree reported in the paper of Shinzato et al. (2008), the previously assigned cnidarian and demosponge SoxB1s and SoxB2s all cluster outside the bilaterian Sox subgroup B1 and B2 representatives, which prompted the suggestions that the partition of group B to subgroups B1 and B2 only occurred in the bilaterians and that both SoxB1 and SoxB2 were generated from a SoxB1-like precursor [23]. However, as the authors noted, the tree may contain bias, so the relationship between subgroups B1 and B2 remains unresolved.
Within the Bilateria, vertebrates such as the human and mouse have several representatives from each major Sox group. In contrast, the bilaterian invertebrates typically have only one family member from each of groups C-F, and two members from group B [2,12]. This difference is considered to have arisen from genome duplications during the early evolution of the vertebrates [24,25]. However, the situation is quite different in the Sox group B of the insects, in that the insect genomes contain at least four members of this group [20]. There are discrepancies in the assignment of these members to the B1 or B2 subgroups. The early phylogenetic analysis of the Drosophila Sox family put three of the four SoxB members in subgroup B2, and suggested that the additional SoxB2 members might have been produced by recent lineage-specific duplications [12]. However, later studies that included more insect genomes revealed that the basal four-member inventory of Sox group B is conserved, at least in the holometabolous insects, and the previously defined insect SoxB2 members have specific sequence and functional features distinguishable from the vertebrate features, which make it difficult to clarify the orthologies between the SoxB members of the insects and the vertebrates [20,26]. A model has been suggested for the expansion of Sox group B in the insects [26], in which the previously defined SoxB2 member Dichaete of Drosophila is orthologous to the vertebrate SoxB1 members, rather than to the vertebrate SoxB2 members. However, this model seems implausible after our investigation, in which we consider orthology a strictly evolutionary concept.
In published papers, only one single species of the noninsect arthropods, the millipede Glomeris marginata, has been investigated in terms of its Sox genes, and three SoxB members were found in that species [27]. However, only fragments of the HMG domain of the Sox sequences were obtained and no clear orthologies of the SoxB members have been resolved.
Here, we address the questions that underlie the issues discussed above to achieve a more confident and clear understanding of the evolution of Sox group B. In summary, the questions are as follows: When did the subdivision of Sox group B into subgroups B1 and B2 take place? What is the evolutionary trajectory of the expansion of Sox group B found in the insects? Is this expansion insect specific or did it occur homologously in other arthropods, such as crustaceans, myriapods, and chelicerates, or in even broader taxa? To answer these questions, we collected data from representative metazoan lineages and the metazoans' closest relative, and reconstructed the evolutionary scenario of Sox group B in the metazoans.

Results and Discussion
1. Phylogenetic origin of the Sox subgroup B1/B2 Larroux et al. (2008) suggests that the metazoan LCA had one or two proto-SoxB members, and because the genomes of the fungi and choanoflagellates, the closest relatives of metazoans, contain no Sox sequences, SoxB must have originated after the divergence of the metazoan and choanoflagellate lineages [13]. However, King et al. (2008) identified a Sox-like sequence in the genome of the choanoflagellate Monosiga brevicollis using BLAST [28]. But this Sox-like sequence shares only relatively low identities (,40%) with metazoan Sox proteins in the HMG domain in our analysis, which is significantly below the identities ($46%; Lefebvre et al., 2007 [1]) shown by the metazoan Sox proteins in the HMG domain. The choanoflagellate Sox-like sequence clusters with the Capicua (Cic) sequences on the unrooted Bayesian and Maximum likelihood (ML) trees ( Fig. 1A; the ML tree is not shown because it has nearly the same topology as the Bayesian tree) reconstructed with the HMG domains of representative metazoan Sox proteins and the choanoflagellate Sox-like protein, together with representative metazoan nonSox HMG box proteins: T-cell factor (TCF)/ lymphoid enhancer binding factor (LEF-1)/pangolin (Pan) and Cic proteins, and the choanoflagellate Cic-like protein. Because conflicting and ambiguous phylogenetic signals can be visualized with split networks [29], we reconstructed a split network using the neighbor-net method based on the same data used in the treebased analyses. On the split network constructed with the same data ( Fig. 1B), the conflicting topologies are displayed simultaneously, and the clustering of the choanoflagellate Sox-like protein with the metazoan Sox proteins is observed. However, the choanoflagellate Sox-like protein is outside the metazoan Sox family, even on the split network, so even if this protein is orthologous to the metazoan Sox family, it is not directly orthologous to a specific Sox group of the metazoans. Therefore, the genesis of Sox group B must have occurred after the divergence of the metazoan lineages and the choanoflagellate lineage.
As mentioned in the Introduction, the subdivision of Sox group B was proposed based on a comparison of the full-length protein sequences and functions of the Sox group B members in chicken [17]. However, the extent to which this subdivision applies should be assessed based on sufficient representative species. We collected and aligned the full-length protein sequences of the Sox group B members of species representing the three major clades (lophotrochozoa, ecdysozoa, and deuterostomia) of bilaterians, nonbilaterian eumetazoans, and basal metazoans, and found that subgroup-specific conservative motifs exist in the region outside the HMG domain in both subgroup B1 [23] and B2 throughout almost all the metazoan representatives, although the conservation becomes less clear when extended to the demosponge SoxB2, and the subgroup-specific conservative motif of subgroup B2 seems to have been lost in some of the protostomes (Fig. S1).
With a more extensive sampling of the bilaterian SoxB proteins, we first classified the collected SoxB sequences into subgroup B1or B2 according as the best hits of BLASTP searches of the RefSeq protein database of Homo sapiens. The alignments of these collected sequences confirmed that the two previously proposed signature residues at positions 2 and 78 of the HMG domain [30], which distinguish Sox subgroups B1 and B2 within group B, are conserved in our much broader sample of taxa, with a few exceptions, which often correspond to highly divergent sequences (Fig. 2). When nonbilaterian SoxBs are included, the signature residues in the HMG domain are incomplete. This reflects either a loss of conservation or suggests that some of the signature residues were derived after the divergence of the bilaterian lineage from the other metazoan lineages. However, in most places, the signature residues histidine (H) at position 2 and proline (P) at position 78 are conserved in the SoxB2 HMG domains of the nonbilaterian metazoans, and the signature residue arginine (R) at position 2 is conserved in the SoxB1 HMG domains of the nonbilaterian eumetazoans (Fig. S2A).
We also performed tree-based and net-based phylogenetic analyses of the SoxB HMG domain sequences of species representing 11 phyla of the animal kingdom, excluding some of the highly divergent cnidarian SoxB duplicates. Both the ML and Bayesian trees (Fig. 3A) maintained the split between the SoxB1s and SoxB2s, which was confirmed by the branch supports calculated with the approximate likelihood ratio test and the SH-like test, although the bootstrap value for the ML tree was less than 50%. This low bootstrap value was probably caused by the short length of the sequences (79 residues) and their high identities (.65%), rather than refuting the split of subgroups B1 and B2. Low bootstrap values are prevalent in phylogenetic analyses of the SoxB HMG domains, and these bootstrap values decrease as the number of sequences analyzed increases and/or the length of the sequences decreases [13,14]. Therefore, the bootstrap test does not seem sufficiently powerful and may be inappropriate for the evaluation of the statistical confidence in the phylogenetic analysis of the SoxB HMG domains. We also reconstructed a split network using the neighbor-net method based on the same data used in the tree-based analyses (Fig. 3B). The split network shows the SoxB1/ B2 split and also the possible existence of long-branch attraction (LBA) between the nonbilaterian SoxB1s and SoxB2s, which probably caused the nonbilaterian SoxB1s to be placed outside the bilaterian SoxB1s and SoxB2s in the tree of Shinzato et al. (200z8).
Considering the evidence presented above, we can state that the partition of Sox group B to subgroups B1 and B2 makes sense, and reflects the true phylogenetic relationships, and that the SoxB1/B2 division occurred after the divergence of the metazoans and the choanoflagellate but before the demosponges diverged from the eumetazoan lineages.

Parallel expansions of Sox group B before the arthropod and jawed vertebrate radiations
Previous studies have suggested two incompatible models for the expansion of Sox group B in Drosophila [12,26]. One of these models places one of the four SoxB members into subgroup B1, and the other three into subgroup B2 [12]. Although there is agreement on the assignment of SoxNeuro (SoxB1) into subgroup B1 and Sox21a (SoxB2a) into subgroup B2, the other model maintains that Dichaete (SoxB2b1) and Sox21b (SoxB2b2) are both co-orthologous to both vertebrate Sox1 and Sox2 rather than to the vertebrate SoxB2 members, and that the Protostome-Deuterostome LCA had a three-member complement of Sox group B proteins [26]. The resolution of this dispute lies in the correct orthology assignments of the Drosophila SoxB members with the vertebrate ones, and a valid reconstruction of the ancestral SoxB repertoire at key phylogenetic nodes. As we mentioned in the Introduction, a related and interesting question concerns the phylogenetic timing of the expansion of Sox group B in Drosophila.  Table 1 Initially, this expansion was attributed to relatively recent duplications [12], but later research [20,26] involving more insect taxa indicated that the four-member SoxB inventory is phylogenetically old, and was at least present in the LCA of the Hymenoptera and Diptera. However, whether this expansion is even older remained an open question at that time.
To resolve these linked questions, we based our research on an extensive sample of taxa derived from database searches, text mining, and DNA sequencing, which involved 31 species from nine major phyla of bilaterians and five species from three major phyla of nonbilaterian metazoans (Table 1). To represent the major taxonomic groups of Arthropoda, because the arthropods were the focus of this study, our first phylogenetic analysis contained complete data for the Sox group B of eight insects and one branchiopod (subphylum Pancrustacea), and one arachnid (subphylum Chelicerata), and the subsequent analysis added partial data for the Sox group B (Fig. S2B): we retrieved the incomplete HMG domain sequences of one diplopod (subphylum Myriapoda) from the text [27], and newly obtained sequences of one malacostracan (subphylum Pancrustacea), one chilopod (subphylum Myriapoda) and another arachnid (subphylum Chelicerata) by degenerate PCR and genome walking technique; SoxB sequences of the microscopic tardigrade Macrobiotus areolatus, which belongs to the superphylum Panarthropoda, were also obtained by our de novo sequencing, giving the first records of Sox genes for the mysterious phylum Tardigrada. In preliminary analyses of the incomplete HMG box sequences we newly determined, some of the sequences show ambiguous orthology (Table 1) and they were excluded from the subsequent phylogenetic reconstructions. Because the nonbilaterian Sox sequences are typically highly divergent, as revealed by previous studies and also our preliminary analyses, we excluded nonbilaterian sequences from the phylogenetic analysis undertaken to resolve the SoxB duplication within the Bilateria, to lessen the effects of LBA.
As mentioned above, we first classified the collected SoxB sequences into subgroup B1 or B2 according as the best hits of BLASTP searches of the RefSeq protein database of Homo sapiens. The subgroup-specific residues at positions 2 and 78 of the HMG domain [30] are well conserved in the full-length HMG domain alignment of our broad sample of taxa, with only a few exceptions, and none of the exceptions occurs in the arthropods (Fig. 2). We then constructed both tree-and net-based phylogenies based on the alignment in Fig. 2, which contains all the full-length HMG domain sequences of SoxB from the bilaterian species for which the full SoxB complements were available, except three divergent   Table 1. doi:10.1371/journal.pone.0016570.g002 sequences from Branchiostoma floridae and Saccoglossus kowalevskii (discussed in subsection 4). The ML and Bayesian trees and the split network (Figs. 4A and 5A) all maintained the split between Sox subgroup B1 and subgroup B2, giving support to the previous classification, although the bootstrap value on the ML tree was low, probably because the sequences are short, have high similarity, and there are large numbers of sequences. The high similarity between the subgroup B1 and B2 HMG domains reflects the fact that the net difference between the HMG domains of subgroups B1 and B2, calculated based on the sequences in the alignment (Fig. 2), is 0.04 using p-distances, which corresponds to about three amino acid residues in the 79residue HMG domain. As discussed in subsection 1, the bootstrap test may be inappropriate for evaluating the statistical confidence at the nodes of the trees constructed for the SoxB HMG domains.
When the Sox sequences of the other four major Sox groups (C, D, E, and F) of Drosophila melanogaster, Capitella sp. I, and Homo sapiens were used as the outgroups, the clade of subgroup B1 was maintained, but subgroup B2 showed paraphyly in the recon-structed phylogenetic trees and network (Figs. 6A and S3), as in the trees of previously published papers [12,13,20]. The paraphyly of Sox subgroup B2 can be explained as a combination of several effects: the metazoan lineages diverged before any large divergence occurred between SoxB1 and SoxB2 in the HMG domain, which caused the internal branch separating Sox subgroup B1 and B2 to be short; SoxB2 experienced fewer functional constraints than did SoxB1 [25], causing more lineage-specific substitutions to accumulate in SoxB2, and SoxB2 displayed nonhomogeneous sequence evolution in different lineages and also between the duplicates generated by multiple lineage-specific duplications, which together resulted in a combination of long and short branches, causing LBA [31]. Even now, the monophyly of subgroup B2 is supported to some extent by the split network (Fig. 6B) from which were excluded four arthropod SoxB2 sequences that formed long branches in the previous network (Fig. S3). Therefore, the nonmonophyly of Sox subgroup B2 in the reconstructed trees probably reflects the influence of statistical errors, such as LBA, rather than refutes the monophyly of Sox subgroup B2.  A systematic nomenclature was developed to better reflect the orthologies between the arthropod subgroup B2 Sox genes (Figs. 2 and S2B, and Table 2). The BLAST best matches, phylogenetic groupings (Figs. 4A and 5A), signature residues at positions 2 and 78 of the HMG domain (Fig. 2), and the preservation of the conserved SoxB2-specific non-HMG motif (Fig. S1B) [16] is not strong evidence for orthology, because functional equivalence is not a part of the definition of orthology as a strict evolutionary concept [32]. The functional equivalence of Drosophila Dichaete (SoxB2b1) and mouse Sox2 may reflect either the retention of the function of the ancestral SoxB or convergent evolution.
Because there were no SoxB sequences from noninsect arthropods in their study, McKimmie et al. (2005) suggested that, although Sox21a (SoxB2a) retained the ancestral form of SoxB2, Dichaete (SoxB2b1) and SoxB21b (SoxB2b2) might represent an insect-specific group. The phylogenetic trees and networks of our study (Figs. 4 and 5) include a wider range of arthropod taxa and suggest that counterparts of the insect SoxB2b genes occur in the genomes of the branchiopod, malacostracan, myriapods, and chelicerates. The SoxB2b proteins of the arthropods form a monophyletic group on the trees and network (when the tardigrade SoxB2 is excluded, discussed below) (Figs. 4 and 5) and the signature residue isoleucine (I) at position 21 of the HMG domain in the insect SoxB2b proteins [26] also occurs in the SoxB2b proteins of the noninsect arthropods (Figs. 2 and S2B). None of the SoxB2 proteins in the alignments, other than the arthropod SoxB2b proteins, has isoleucine at position 21, which implies that isoleucine at position 21 is a synapomorphy of the arthropod SoxB2b proteins.
Our preliminary analyses indicated that BLAST methods and tree/net-based methods were incapable of fully deciding the clear orthologies of the SoxB2bs of different insects or those of the insects and noninsect arthropods. However, the conserved gene neighborhoods (CGNs) and conserved intron positions of the SoxB2 genes of the insects and Daphnia pulex allowed the assignment of gene orthologies (Table 1 and Fig. 2). The genome of the shrimp Macrobrachium nipponense also contains one intronless SoxB2b gene and one SoxB2b gene containing an intron at the position  Table 1. doi:10.1371/journal.pone.0016570.g005 conserved among the SoxB2b2 genes of Daphnia and the insects, which indicates that SoxB2b1 and SoxB2b2 were present in the genome of the insect-malacostracan LCA. The direct orthologies between the SoxB2bs of the chelicerate Ixodes scapularis and the insect SoxB2bs are less clear because there is no intron in the SoxB HMG boxes of Ixodes scapularis and data on the CGNs of this species are not yet available. However, there is a conserved signature that might distinguish the two SoxB2b paralogues in the protein sequences of the regions outside the HMG domain (Fig. S1B). Because the available sequences are incomplete and there is also no intron in the SoxB2b genes of the other chelicerate or the myriapods, no clear orthology assignment among these SoxB2b genes is possible at present. However, the myriapods and chelicerates have two or more SoxB2b genes, like the insects and crustaceans (Figs. 2 and S2B), so it is highly probable that the arthropod LCA had two SoxB2b genes.
The tardigrade Macrobiotus areolatus, which is not an arthropod but belongs to the superphylum Panarthropoda, seems to have an enormously large repertoire of SoxB (Table 1). This large repertoire is hardly attributed to the possible contamination of the genomic DNA by the other organisms in environment such as mosses, fungi or bacteria although the tardigrade samples are microscopic, because mosses, fungi and bacteria have no Sox gene in their genome, and the obtained tardigrade SoxB sequences show particularity when compared with the SoxB sequences of the species from the other taxonomic groups by BLAST searches and phylogenetic analyses. The full-length HMG domain sequence of a tardigrade SoxB2 was newly determined by us, but this sequence is highly divergent, and nested in the arthropod SoxB2b clade in the trees and network (Figs. 4B and 5B). Currently, it cannot be clarified whether the SoxB duplicates of the tardigrade share common ancestry of duplication with the SoxB duplicates of the arthropods due to the incompleteness and/or high level of divergence of the tardigrade sequences which causing clear orthology assignments impossible, however, the ongoing genome project of the tardigrade Hypsibius dujardini [33] will shed light on this issue.
From the gene inventories and orthology assignments based on the best matches of BLAST, phylogenetic trees and networks,  Table 1. doi:10.1371/journal.pone.0016570.g006 signature residues, conserved intron positions, conserved non-HMG motifs, and the CGNs of Sox group B of the metazoan species examined, we have constructed the evolutionary history of the SoxB1/B2 genes in the bilaterians. When the gene repertoires and CGNs of SoxB1/B2 were mapped onto the well-established metazoan phylogeny in the form of a timetree [34,35,36,37,38,39,40] and the ancestral states of SoxB1/B2 at important phylogenetic nodes were reconstructed based on the principle of parsimony, a picture of the SoxB1/B2 evolution in the bilaterians emerged (Fig. 7). The bilaterian LCA had one linked pair of SoxB1 and SoxB2 genes, which was probably generated by a tandem duplication of the ancestral SoxB in the metazoan stem before the demosponge-eumetazoan split. The expansions of Sox group B observed in the vertebrates and arthropods occurred later, in mutually independent duplications. During the early evolution of the vertebrates, before the diversification of the jawed vertebrates, two rounds of whole-genome duplication occurred [41,42,43], and subsequent gene losses reduced the Sox group B repertoire to the complement of three SoxB1 and two SoxB2 duplicates we find in the land vertebrates. A third round of genome duplication took place in the stem of the teleost fishes [44,45,46], which together with subsequent gene losses, led to the Sox group B repertoires observed in the teleosts [21]. In almost the same period that the vertebrate ancestors underwent their whole-genome duplications, a tandem duplication of SoxB2 in the genome of the common arthropod ancestor gave rise to SoxB2a and the ancestral SoxB2b, and a subsequent tandem duplication of SoxB2b generated SoxB2b1 and SoxB2b2 before the arthropod diversification leading to the extant lineages.
This scenario of SoxB evolution counters the model proposed by McKimmie et al. (2005) and adopted by others [25] (Fig. 8A), which suggests that the bilaterian LCA had a total of three Sox group B members. The McKimmie model was perhaps prompted by the false assumptions that Dichaete (SoxB2b1) is orthologous to vertebrate Sox2 and that the dislinkage of SoxB1 and Dichaete in Drosophila reflects the ancestral state. Because the SoxB1 and SoxB2 genes are on one chromosome in the genomes of Anopheles gambiae and Tribolium castaneum (Table 1), they were probably clustered on one chromosome in the insect LCA. Therefore, the break in the linkage between the SoxB1 and SoxB2 genes observed in Drosophila must have occurred after the divergence of Drosophila from Anopheles. A similar linkage break in Apis mellifera must have been an independent event. Our model of SoxB evolution (Figs. 7 and 8B) also better fits the prevailing hypothesis that two rounds of genome duplication (and a further round in the teleosts) occurred during the evolution of the vertebrates [42].

Evolutionary significance of the expansion of Sox group B in the arthropods
Gene duplicates can be preserved permanently by neofunctionalization and/or subfunctionalization, thus generating biological novelty and diversity [47]. The lineage-specific expansion of transcription factor families is believed to have played an important role in the increased complexity of animals and in their diversification [48]. Arthropods first appeared near the base of the Cambrian [49] and constitute the most species-rich and ecologically diverse phylum of the animal kingdom. Considering the Sox family's core role in diverse developmental processes [1,2,12] and the long-term preservation of the duplicate genes throughout the Arthropoda, the arthropod-specific expansion of the SoxB2 inventory might have provided a versatile genetic tool kit that contributed to the arthropod radiation. Gene duplication provides new genetic material that allows new functions to evolve under relaxed functional constraints and the subfunctionalization of gene duplicates by the divergence of protein expression patterns and/or functions contributes to the establishment of more sophisticated gene networks [47,50,51]. Functional studies of SoxB2b1 (Dichaete) in Drosophila melanogaster have indicated that this SoxB2 duplicate is crucial to segmentation, neurogenesis, hindgut morphogenesis, cuticle differentiation, and oogenesis [3,15,16,52,53,54]. The functional role of Drosophila SoxB2b1 in segmentation may be an example of neofunctionalization, because segmentation probably evolved in parallel in the arthropods, chordates, and annelids [55] and the vertebrate SoxB proteins seem to have no such function in embryo development [1]. Interestingly, a similar pattern can be found in the gene Pax3/7, the products of which functioned in neurogenesis in the Protostome-Deuterostome LCA but gained a pair-rule function in the common arthropod ancestors [56]. Moreover, speciesspecific neofunctionalization or subfunctionalization of the SoxB2  [20,26,57]. The genomic integrity of the SoxB2 cluster is also retained, at least in insects and Daphnia, which diverged over 400 million years ago [40], implying that there are evolutionary constraints on this organization. It will be intriguing to test and compare the functions and organization of the SoxB2 duplicate genes in species of other arthropod groups.

Independent duplications of SoxB in several other metazoan lineages
In previous studies [25,58], it was found that the amphioxus Branchiostoma floridae has three SoxB1 genes, which are not directly orthologous to the vertebrate SoxB1 genes. Our analysis of the amphioxus SoxB1 genes indicated that amphioxus SoxB1a probably evolved from a SoxB2 duplicate generated by a segmental duplication that produced an additional SoxB1/B2 cluster, and gained the SoxB1 characteristics through convergent evolution with SoxB1b. This suggestion is based on the facts that SoxB2 proteins were the best hits when BLASTP searches were performed in the RefSeq protein database of Homo sapiens using the HMG domain of amphioxus SoxB1a as the query; SoxB1a contains the SoxB2 signature residue H at position 2 of the HMG domain, and lacks the conserved SoxB1-specific motifs outside the HMG domain; and the signals for the convergent evolution of amphioxus SoxB1a can be visualized with a split network reconstructed with SoxB sequences of representative bilaterian species (Fig. S4, based on the alignment in Fig. S2C).
In our study, two other bilaterian species provided evidence of the occurrence of a lineage-specific duplication of Sox group B. One species is the platyhelminth Schmidtea mediterranea, which contains two paralogous SoxB2 genes, which may have resulted from a recent duplication. The other species is the hemichordate Saccoglossus kowalevskii, which contains an additional divergent SoxB1 sequence, which has no direct orthologue in other species.
Outside the Bilateria, the anthozoan cnidarian Nematostella vectensis has a large repertoire of the Sox family, containing 14 members [18]. Six of the 14 members can be classified into Sox group B but have diverged markedly, and three of them are characterized by an additional residue in the HMG domain  [26]. In this model, an ancestral SoxB generate original Dichaete and SoxNeuro by an ancient genome duplication, a subsequent tandem duplication generate original Sox21a before the Deuterostome/Protostome split. After the Deuterostome/Protostome split, a further tandem duplication generated Sox21b in insects and an independent genome duplication event increased the copy number of SoxB in vertebrates. (B) The model for Sox group B evolution proposed in this study. In this model, the Protostome-Deuterostome LCA had one SoxB1 and one SoxB2 generated by an ancient tandem duplication of an ancestral SoxB. After the Deuterostome/Protostome split, two further tandem duplications gave rise to the additional two copies of SoxB2 in arthropods, and a linkage break between SoxB1 and SoxB2s occurred in the ancestor of Drosophila, resulting in the different chromosome locations of SoxB1 and SoxB2s in Drosophila; independently, the vertebrates increased their copy number of SoxB through the two rounds of genome duplication. Forks on the rectangles indicate pseudogenization leading to gene loss. SoxB2b1, SoxB2b2, and SoxB2a are the preferred synonyms for Dichaete, Sox21a, and Sox21b, respectively. Sry is currently considered to have evolved from allele Sox3 on the Y chromosome [69], and is therefore not shown in the models. doi:10.1371/journal.pone.0016570.g008 [13,23]. Direct orthologues of some of these divergent SoxBs were found in the genome of the hydrozoan cnidarian Hydra magnipapillata in our analysis (Fig. S5, based on the alignment in Fig. S2D). Because Anthozoa and Hydrozoa are basal clades of the Cnidaria, these duplications of SoxB must have occurred before the cnidarian diversification. Interestingly, the duplications of SoxB in the common cnidarian ancestor might have occurred during almost the same period in which the SoxB repertoires of the common jawed vertebrate ancestor and the common arthropod ancestor increased in parallel, roughly around 600 million years ago (Fig. 8).
The placozoan Trichoplax adherens also has an additional SoxB, with a residue insertion in the same position of the HMG domain as that in the cnidarians (Fig. S2A). It is currently unclear whether this SoxB member emerged before the Placozoan-Cnidarian divergence or independently in the placozoan lineage because the nonbilaterian SoxB sequences are generally divergent and a more extensive sample of taxa is required for its valid resolution.

Conclusion
We have reconstructed the evolutionary history of the Sox subgroups B1 and B2 in the metazoans, reconfirmed that the subdivision of Sox group B into subgroups B1 and B2 took place in the metazoan stem after the metazoan-choanoflagellate divergence but before the demosponge-eumetazoan divergence, and found that after the arthropod-nematode divergence but before the arthropod diversification, the Sox subgroup B2 expanded in the common arthropod ancestor to include three members after two successive tandem gene duplications. The bilaterian LCA had only one member from each of the Sox subgroups B1 and B2. The Sox group B expanded independently in the genomes of the vertebrates and arthropods via different trajectories. This parallel increase in complexity at the molecular level was coincident with parallel increases in complexity and diversification at the organismal level in the vertebrates and arthropods. Functional studies of the Sox subgroup B2 proteins of the arthropods is warranted, and a comparison of the different neofunctionalizations and subfunctionalizations of SoxB2 duplicates in different arthropod groups and between the arthropods and vertebrates should be very interesting and insightful in terms of evolutionary developmental research. agarose gels, and the expected bands were excised and gel purified. The purified products were then subcloned into the pMD18-T vector (TaKaRa Biotechnology [Dalian] Co., Ltd, Dalian, China). The positive clones were sequenced on an ABI 3730 capillary sequencer (Applied Biosystems, Foster City, CA). The Genome Walking Kit (TaKaRa Biotechnology [Dalian] Co., Ltd, Dalian, China) was used to determine the flanking regions of two intron-containing Sox products (later shown to be MnSoxB2b2 and MaSoxB2).
The identities of these newly sequenced Sox fragments (Genbank accession numbers: FJ805198-FJ805217 and FJ976523) were determined tentatively by BLASTX searches against the RefSeq protein databases of Homo sapiens and Drosophila melanogaster at NCBI and then by iterative phylogenetic analysis together with the already-defined Sox genes. The clear orthologies of these new Sox sequences to that of the model species were assigned after extensive phylogenetic analyses. Finally, a revised nomenclature for arthropod SoxB genes was developed to better reflect the gene phylogeny.

Phylogenetic analysis
2.1. Sequence alignment and distance calculations. The full lengths or HMG domains of the collected protein sequences were aligned using ClustalW [63] implemented in the Software MEGA4 [64], and the alignment was adjusted by manual inspection. Pairwise p-distances, within-group mean p-distances and between-groups mean p-distances were computed using MEGA4.
2.2. Phylogenetic reconstruction. ProtTest 2.4 [65] is a program for the selection of the models of protein evolution and was used to determine the models of protein evolution that best fitted our different data sets. ML trees were constructed with the ML method implemented in PhyML 3.0 [66]. Both nearest-neighbor interchange (NNI) and subtree pruning and regrafting (SPR) tree topology searches were used to avoid local optima. The statistical confidence in the nodes was assessed with an approximate likelihood ratio test, which returns x 2 -based parametric branch support, an SH-like test, and 100 bootstrap replicates. The MrBayes 3.1.2 program [67] was also used to construct the Bayesian trees with the best available model selected by ProtTest 2.4. Two independent Bayesian analyses were run simultaneously for 10 million generations each. Metropolis-coupled Markov chain Monte Carlo with one cold chain and three heated chains was run for each analysis and sampled every 100 th generation. A burn-in of 25,000 trees was removed. The convergence of each run was evaluated by plotting the log likelihood value against the number of generations. The statistical confidence in the nodes of the Bayesian trees was evaluated with posterior probabilities. The Software SplitsTree4 [29] was used to generate the split networks using the neighbor-net method [68]. Both p-distances and ML distances under the Jones-Taylor-Thornton (JTT) model were used in this analysis to compare and visualize the conflicting signals. The red line indicates the conservative SoxB2-specific motif. Abbreviations of species names are as in Table 1. (PDF) Figure S2 Additional alignments of the SoxB HMG domains. Abbreviations of species names are as in Table 1. (PDF) Figure S3 Split network of the bilaterian SoxB1/B2 proteins based on the HMG domain with sequences from Sox groups C, D, E, and F of human, Drosophila, and the annelid Capitella sp. I. The split network was reconstructed under the JTT model. Abbreviations of species names are as in Table 1. (PDF) Figure S4 Split network of the HMG domain sequences of the SoxB1/B2 proteins in the full complements of Branchiostoma floridae and representative bilaterians, showing the signals of convergent evolution in BfSoxB1a and BfSoxB1b. The split network is based on the alignment shown in Fig. S2C. Abbreviations of species names are as in Table 1. (PDF) Figure S5 Phylogenetic tree and split network of the HMG domain sequences of the SoxB1/B2 proteins of three cnidarians and representative bilaterians. (A) Bayesian tree based on the alignment shown in Fig. S2D. Statistical support values for the SoxB1/SoxB2 split and the arthropod SoxB2b clade were derived with different methods, as described in Fig. 3. The model for the Bayesian reconstruction was RtREV + I + G; the model for the ML reconstruction was LG + I + G. (B) Split network under the JTT model is shown. Abbreviations of species names are as in Table 1