Partial Gene Duplication and the Formation of Novel Genes

1.1 Lineage-specific genes The availability of several genomes from related organisms permits the identification of newly evolved genes in different lineages or species, the study of their mechanisms of formation and the investigation of their role in adapting to new environments or physiological conditions (Domazet-Loso & Tautz, 2003; Guo et al., 2007; Khalturin et al., 2009; Kuo & Kissinger, 2008; Siepel, 2009; Toll-Riera et al., 2009a; Zhou et al., 2008). Recently formed genes give us the opportunity to study the action of natural selection in recent times and to investigate the processes associated with gene creation (Zhou & Wang, 2008). The number of species-specific genes, or orphan genes, is not insignificant. They represent around 14% of the genes in 60 fully sequenced microbial genomes (Siew & Fischer, 2003) and between 20-30% in Drosophila species (Domazet-Loso & Tautz, 2003; Drosophila 12 Genomes Consortium, 2007). Genes restricted to particular lineages include vomeronasal receptors and casein milk proteins in mammals, which are known to be involved in specific physiological adaptations in this lineage (International Chicken Genome Sequencing Consortium, 2004). Additionally, several lineage-specific genes have been found to be involved in defence against pathogens, such as dermcidin in primates (Toll-Riera et al., 2009a) and surface antigens in apicomplexan parasites (Kuo & Kissinger, 2008). Interestingly, it has been noticed that rice orphan genes are more often expressed under environmental pressure (injury and hormone treatment) than non-orphan genes, indicating that novel genes help in adaptation to changing conditions (Guo et al., 2007). Many newly evolved genes are derived from partial or complete gene duplication of preexisting genes (Long et al., 2003; Marques et al., 2005; Toll-Riera et al., 2009a; Zhou et al., 2008). Alternative processes of gene formation include exaptation from mobile elements


Introduction
The publication of the first fully sequenced genomes represented a landmark in the biological sciences. The comparison of genomes from different organisms provides us with unprecedented opportunities to address many long-standing evolutionary questions in a more comprehensive way. separation of New World Monkeys, and which gave rise to differentiated red and green pigments (Nathans et al., 1986). Zhang and colleagues (Zhang et al., 1998) reported on another example of the action of positive selection after gene duplication. The eosinophil cationic protein (ECP) and eosinophil-derived neurotoxin (EDN) genes are present in Old World Monkeys and hominoids, and probably originated by tandem gene duplication after the divergence of New World Monkeys. EDN is an antiviral agent (Domachowske & Rosenberg, 1997) and ECP is a potent toxin for bacteria and parasites (Rosenberg & Dyer, 1995). The authors detected a non-random accumulation of arginine substitutions in ECP, which may contribute to the generation of pores in pathogens' membranes. Another example refers to pancreatic ribonuclease 1B (RNASE1B), which originated through gene duplication of RNASE1, an enzyme used to digest bacteria in the small intestine, in the douc langur (Pygathrix nemaeus) around 2-4 million years ago (Zhang et al., 2002). Douc langurs are folivorous monkeys, in which leaves are digested through fermentation by symbiotic bacteria residing in the foregut. The newly duplicated copy, RNASE1B has evolved very rapidly (non-synonymous to synonymous nucleotide substitution rate of 4.03), contrary to the paralogous copy, RNASE1, which has not undergone change. These results indicate a burst of positive selection acting on the duplicated copy. Moreover, most of the substitutions imply the gain of negatively charged residues, lowering the optimal pH for RNASE1B, which could be related to an increase in digestive efficiency, given the lower pH found in the small intestine of douc langurs.

Partial gene duplication
Not all duplicated proteins are identical to their parental copies at birth. In fact, it has been reported that in C. elegans only about 40% of the new duplicates are borne out of complete gene duplications, the remainder representing cases of partial gene duplication (Katju & Lynch, 2003). These partially duplicated genes may recruit sequences from their genomic neighbourhood or from other genes (Katju & Lynch, 2006). In the first case, adjacent non-coding sequences are co-opted for a coding function. Katju and Lynch (Katju & Lynch, 2006) found that about half of the partially duplicated genes did not recruit any surrounding sequences but accumulated mutations, for example in initiation or termination codons, that altered the coding sequence. In Drosophila melanogaster, around 30% of the newly formed genes recruited various genomic sequences or formed chimeric gene structures . Partially duplicated and chimeric genes are expected to adopt new functions immediately, which may increase their probability of being retained (Patthy, 1999;. An example of a gene that has arisen by partial duplication is the Hun gene in Drosophila, located on the X-chromosome. Hun arose from a partial duplication of the Bällchen gene, which is on chromosome 3R. Hun lacks 3' coding sequence with respect to Bällchen, but has gained 33 amino acids from a nearby intergenic sequence. Further, while Bällchen is expressed ubiquitously, Hun shows testes-specific expression (Arguello et al., 2006). The sequence similarity that exists between completely duplicated gene copies and parental gene copies is often sufficient to detect homologues in a whole range of organisms. However, this is often not the case for partially duplicated genes, especially if the sequence common to both duplicates is short and the rate of divergence of the novel gene duplicate is abnormally high. As a result, many partially duplicated genes are identified as orphan or lineage-specific genes, that is, genes that do not yield any significant hits in database protein searches of more distant organisms (Chen et al., 2010;Domazet-Loso & Tautz, 2003;Toll-Riera et al., 2009a). In a recent study that showed that newly formed genes in Drosophila melanogaster are as likely to perform essential functions as older genes, it was found that 28 out of the 50 new genes that had arisen through gene duplication corresponded to partial duplications (Chen et al., 2010). These young genes were found to evolve very rapidly, showing a median of 47.3% divergence, at the amino-acid level, from their parents. In an analysis of the mechanisms of formation of primate-specific genes, we observed that about 24% of the newly formed genes had originated through gene duplication, frequently involving partial gene duplication and the recruitment of additional sequences (Toll-Riera et al., 2009a). One example is human XAGE-1, a cancer/testis-associated gene that has partial homology to human XAGE-2, a gene that is well conserved in other mammals. The similarity is limited to the C-terminal half of the orphan XAGE-1 protein. We showed that, in the conserved region, the rate of amino acid sequence evolution of XAGE-1 was double that of XAGE-2, suggesting that the recruitment of additional sequences in XAGE-1 resulted in a marked asymmetry in the evolutionary rates of the two copies. Partial gene duplication is likely to be very important for the formation of novel gene structures and the evolution of new protein functions, but studies focusing on this type of gene duplication are still scarce. To shed new light on this issue, we decided to analyse the evolutionary patterns of several primate-specific genes (orphan genes) formed, at least partially, by gene duplication. The results show that increased evolutionary rates in the partially duplicated copy are the norm, reinforcing the role of partial gene duplication in the formation of novel genes with distinct functions.

Results
Here we use a similar approach to that employed in Toll-Riera et al. (Toll-Riera et al., 2009a) to identify a set of primate-specific genes that show significant similarity to human genes (parental genes) that are well conserved in non-primate species. We investigate the differences in the rate of evolution of the novel and parental genes and discuss the role of partial duplication in increasing the protein functional repertoire.

Identification of primate lineage-specific genes formed by gene duplication
We identified a set of genes present in human and macaque but absent in 13 non-primate genomes (Mus musculus, Rattus norvegicus, Bos Taurus, Canis familiaris, Gallus gallus, Xenopus tropicalis, Danio rerio, Takifugu rubripes, Caenorhabditis elegans, Drosophila melanogaster, Saccharomyces cerevisiae, Schizosaccharomyces pombe, and Arabidopsis thaliana). The existence of a homologue in a specific genome was determined by the presence of a BLASTP (Altschul et al., 1997) hit with an expectation value (E-value) smaller than 10 -4 , as previously described (Alba and Castresana 2005). Orphan genes were defined as those for which we could not detect any homologues in any of the species mentioned above. As they were, by definition, present in human and macaque, our collection of orphan genes corresponded to primate-specific genes, presumably formed after the split of the rodent and primate branches and before the speciation of the human and macaque lineages. Once we had this set of orphan genes, we investigated which ones could have arisen through gene duplication by performing BLASTP searches against all human proteins, using a relaxed E-value (E<0.5). We kept those cases for which we could identify human www.intechopen.com paralogues that were not primate-specific. In such cases, the closest hit in human was considered the putative human parental gene, and the closest non-primate orthologue of the parental gene was taken as the outgroup gene ( Figure 1). Protein sequences were aligned with T-Coffee (Notredame et al., 2000), and the alignments between primate-specific genes and parental genes were carefully examined to discard any spurious associations. We also removed any regions that were completely divergent (non-alignable) between the orphan and parental genes. The final set consisted of 14 orphan genes. Table 1 shows the orphan, parental and outgroup gene names, protein identifiers, and the percent of parental protein that could be reliably aligned with the orphan protein, corresponding to the portion of the protein that had duplicated. Of the 14 orphan genes, 4 represented single copies and the rest belonged to orphan gene families. In only one case, dermcidin, sequence similarity supported a complete gene duplication event. We used the protein multiple alignments to estimate the number of amino acid substitutions per site (K) in the orphan, parental and outgroup branches. We used PROML, a maximum likelihood based method in the Phylip package for this purpose (Felsenstein, 2005). The results of these computations are discussed below.

Dermcidin and lacritin
The first example of an orphan gene that arose through gene duplication is dermcidin. This gene encodes a short protein of 110 amino acids in length. The corresponding parental gene is lacritin, which has orthologues in other mammals, and is located on chromosome 12 adjacent to the dermcidin gene. The two genes have a similar exonic structure, and although they are highly divergent, sequence similarity between the two is still detectable . Dermcidin is secreted in sweat glands, having an antimicrobial activity (Schittek et al., 2001), and may also be involved in neural survival and cancer (Porter et al., 2003), whereas lacritin is expressed in the lacrimal glands (Ma et al., 2008).

Partially duplicated orphan genes
The remaining primate-specific genes that have arisen through gene duplication corresponded to partial duplications of the parental gene (Table 1). They included 3 individual genes (AL023807.2, NPIP-like 1 and AL133216.1) and 3 gene families (FAM9, XAGE-1 and C2orf27). The percentage of protein sequence from the parental protein that could by identified as homologous in the orphan protein ranged from 9.4 to 87.3% (Table 1).
With the exception of NPIP-like 1, the orphan gene is located on a different chromosome from the parental gene, although the presence of introns in all orphan genes suggests that they were not retrotransposed copies. We aligned the conserved regions of orphan, parental and outgroup proteins (Figure 3). These alignments were used for the estimation of the www.intechopen.com number of amino acid substitutions per site in the orphan, parental and outgroup branches. We also investigated the presence of any known protein domains in the region conserved between parental and orphan proteins, using the Pfam web server (Finn et al., 2010). The largest orphan gene family is XAGE-1, which has 5 members with identical amino acid sequences that are contiguous on the X chromosome. The region conserved between the XAGE-1s and XAGE-2 includes the GAGE domain. The function of GAGE (G antigen) and X A G E ( X a n t i g e n ) d o m a i n s i s u n k n o w n , b u t X A G E a n d G A G E p r o t e i n s h a v e b e e n implicated in several human cancers (Zendman et al., 2002). The two genes belonging to the C2orf27 family are contiguous in the genome, though C2orf27A is located on the forward strand of chromosome 2 whereas C2orf27B is located on the reverse strand. Their function is unknown, but they derive from a protein annotated as Ral guanine nucleotide dissociation stimulator-like 4. The parental protein contains the RasGEF domain, which is a guanine nucleotide exchange factor for Ras-like small GTPases. The duplicated region overlaps minimally with this domain (14 amino acids). NPIP-like 1 belongs to the nuclear pore complex-interacting protein (NPIP) family. The parental protein contains two AMP-binding domains that are at the N-terminal region of the protein, not the area conserved in the orphan protein, which is the C-terminal part. The NPIP family (Nuclear Pore Interacting Protein), also named morpheus, is located on a duplicated segment of chromosome 16. It has been suggested to have experienced a burst of positive selection during the emergence of Homininae (Johnson et al., 2001). Finally, AL133216.1 and AL023807.2 are two primate-specific genes of unknown function containing putative coding sequences of length 151 and 121 amino acids respectively. The parental copy of AL133216.1 modulates arsenic sensitivity, is involved in cell cycle progression, and in RNA-mediated gene silencing by microRNA (Gruber et al., 2009). It also contains an arsenite-resistance protein 2 domain (Pfam hit E-value = 3.1e -18 ). The orphan www.intechopen.com copy does not contain this domain even though it is located in the conserved region, suggesting that this region has lost its ancestral function in the orphan protein. Table 2 shows the estimated amino acid substitution rates in the orphan, parental and outgroup branches. In the case of identical copies (for example C2orf27A and C2orf27B) only one is taken as representative. In the case of divergent copies (the FAM9 family) the amino acid substitution rates are summed up for all branches from the ancestor to the derived node (see Figure 4). In all cases the duplicated protein is evolving much faster than the parental gene, and in some cases, such as the FAM9 and NPIP-like 1 proteins, more than six times faster. These results indicate that orphan proteins are evolving under much more relaxed constraints, and/or adapting to a new function with respect to their parental copies.  Table 1 for more details.

Role of low-complexity sequences
Low complexity regions (LCRs) are sequences in which one or a few residues are highly overrepresented. Several studies have shown that duplicated gene copies can gain new functions through the acquisition of LCRs (Fondon & Garner, 2004;Salichs et al., 2009). It has also been shown that young proteins contain more LCRs than old proteins (Alba & Castresana, 2005). Therefore, we inspected the presence of LCRs in our set of orphan proteins using the SEG algorithm with default parameters (Wootton & Federhen, 1996). We found that the FAM9A protein contained a very conspicuous low-complexity sequence. Figure 4 shows the detailed phylogenetic tree of the FAM9 gene family (including the parental and outgroup SYCP3 genes). The ancestral FAM9 evolved very rapidly and eventually underwent two duplication events, leading to FAM9A, FAM9B and FAM9C. The multiple alignment of the region surrounding the LCR in FAM9A shows how, from a small region containing several acidic residues in SYCP3, a larger acidic region was formed in the common FAM9 ancestor, which finally expanded to a 75 amino acid stretch in FAM9A containing a long glutamic acid repeat, as well as poly-alanine and poly-glycine repeats.
As is the case for the SYCP3 proteins, all three human FAM9 proteins show testis-specific expression. However, the cellular localization is different depending on the protein studied: FAM9B and FAM9C are localized in the nucleus with low protein levels being detectable in the cytoplasm, whereas FAM9A is present at high levels in the nucleolus (Martinez-Garay et al., 2002). The distinct location of FAM9A may be due to the long glutamic acid repeat, as Fig. 3. Multiple alignments of the conserved regions between orphan, parental and outgroup proteins. For the XAGE-1 family only XAGE-1A is shown, as the other orphan sequences were identical at the amino acid level. The same is true for the C2orf27 family. Identical residues are in green, similar residues in yellow. See Table 1 for more details.
www.intechopen.com Fig. 4. Phylogenetic tree of the FAM9 gene family. Branch lengths correspond to the estimated number of amino acid substitutions per site, using the alignment in Fig 3. The protein alignment shown corresponds to exon 5 in FAM9B and FAM9C and to exon 6 in FAM9A, human SYCP3 and mouse SYCP3. The expanded low-complexity region in FAM9A is depicted above the alignment.
acidic clusters have been shown to mediate protein nucleolar retention (Ochs et al., 1996;Shu-Nu et al., 2000;Ueki et al., 1998).In FAM9A, the low complexity sequence is located within the Cor1/Xlr/Xmr conserved region, perhaps interfering with its function. In fact, FAM9A shows higher sequence divergence from the common ancestor than FAM9B.

Discussion
The role of partial gene duplication in the formation of novel genes is still poorly understood, although recent reports in Drosophila (Chen et al., 2010; and C.elegans (Katju & Lynch, 2006; indicate that partially duplicated gene copies are very frequent. The present study analyses a set of primate-specific genes formed by partial gene duplication. We find that the rate of divergence of the partially duplicated copy is, in all cases, higher that the rate of divergence of the parental copy, generalizing previous observations for XAGE1-A (Toll-Riera et al., 2009a). This, together with the fact that most partially duplicated genes recruit additional sequences, strengthens the notion that partial duplication is a major process for the formation of genes with novel structures and functions. In these genes, any remaining similarity to the homologous proteins is being quickly erased by high sequence turnover. As a consequence, distant homologues are difficult to identify and these proteins end up being classified as orphans. This fits the model of Domazet-Loso and Tautz in explaining the high number of orphan genes in Drosophila: orphan genes are created by gene duplication followed by a period of rapid sequence divergence that erases the similarity with its homologues (Domazet-Loso & Tautz, 2003). Although we now have evidence that not all orphan genes are generated in this manner (Toll-Riera et al., 2009a;Toll-Riera et al., 2009b;, a significant portion is. A large fraction of the duplicated gene copies that become fixed in a population are subsequently lost, presumably because the new copy is completely redundant and thus dispensable. However, the formation of chimeric gene structures, encoding part of an existing protein together with additional sequences, could in principle favour their retention, as these genes are not going to be functionally equivalent to the ancestral gene (Patthy, 1999;. In support of this, in Drosophila it was found that the proportion of novel genes corresponding to complete gene duplications decreased with gene age, suggesting that complete gene duplications had a shorter lifespan than partial gene duplications . Orphan genes are in general poorly annotated and their function is unknown in most cases (Kuo & Kissinger, 2008). The fact that organisms had lived perfectly well without them until recent times when they made their appearance, has led scientists to think that orphan genes were, for the most part, dispensable. However, a recent study by Chen and colleagues (Chen et al., 2010) has challenged this viewpoint. In their study, the authors identified new young genes in Drosophila melanogaster (around 34 million years old) and designed RNA interference lines to knoch each of them out (KO). Surprisingly, they found that 30% of these young genes KOs were lethal, as Drosophila could not survive without them. These young genes had mainly arisen through duplication and they showed higher evolutionary rates than the parental gene, indicating the action of positive selection, or relaxation of functional constraints. They hypothesized that new genes are quickly integrated into existing pathways, and hence many of them soon become essential for the viability of the organism. Capra and colleagues (Capra et al., 2010) compared the evolutionary patterns of genes that arose by duplication with those that did not (named novel genes). They argued that the evolutionary pressures should be different in each case as, contrary to novel genes, duplicated genes were functionally and structurally well formed from birth. They showed that although duplicated genes are initially more integrated into cellular networks, both types of new genes gain functions and interactions with time, though novel genes do it more rapidly than duplicated genes. Additionally, novel genes also increase in length through the incorporation of transposable elements or surrounding sequences. This increase in length could be related with the rapid gain of function and interactions experienced by novel genes. They also found that genes tended to interact with genes similar in age and mode of origin. Thus, the mechanism by which a gene originates seems to significantly impact on its subsequent evolution. Several studies have demonstrated that duplicated genes show increased protein evolutionary rates with respect to non-duplicated genes in the same lineage (Castillo-Davis et al., 2004;Cusack & Wolfe, 2007;Kondrashov et al., 2002;Lynch & Conery, 2000;Nembaware et al., 2002;Scannell & Wolfe, 2008;Van de Peer et al., 2001). Here we identified a very strong asymmetry in the rates of evolution of the newly evolved copy (orphan) and the well-conserved copy (parental), the former evolving much faster than the latter. Surprisingly, the parental protein copy did not evolve consistently faster than the outgroup protein (not duplicated), highlighting the fact that we are dealing with a special type of gene duplication in which the copy containing the partially duplicated segment rapidly departs from the ancestral family, which remains essentially unaffected. Increased evolutionary rates may reflect either relaxation of purifying selection, positive selection, or the combined effects of both these forces. The orphan genes under study predated the split of the human and macaque lineages, which occurred approximately 25 million years ago so, if relaxed selection was the only factor for their increased rates, the genes should by now have become pseudogenes and not be expressed. However, all genes were expressed at the RNA level in one or several tissues. Therefore we must hypothesize that, at least to some extent, positive selection has influenced the evolution of these genes. We compared the rates of evolution of the protein regions that were conserved between orphan and parental proteins, but what about the unique sequences contained in the orphan proteins? These sequences lacked any similarity to other protein-coding genes, so they may be ancestral non-coding sequences that have been co-opted for a coding function (Long et al., 2003). Genes generated de novo from non-coding sequences are among the fastest evolving genes (Levine et al., 2006), and there is no reason to believe that unique sequences www.intechopen.com in orphan proteins will evolve slower than the conserved protein regions, rather the contrary would seem more logical. In a previous study we showed that the nonsynonymous to synonymous nucleotide substitution rates of primate-specific genes, measured for human and macaque orthologues, were, on average, twice as high as those of mammalian-specific genes and five times higher than those of deeply conserved eukaryotic proteins (Toll-Riera et al., 2009a). The differences in amino acid substitution rates between orphan and parental genes described here reinforce the idea that the evolution of a new gene is strongly associated with very rapid sequence change.

Concluding remarks and future research
We have examined the evolutionary dynamics of a group of novel primate-specific genes (orphan genes) that have arisen by gene duplication. These genes typically form new structures in which only part of the protein sequence is shared with the parental copy, presumably because of partial gene duplication, and the rest of the protein sequence is unique. The orphan proteins accumulate a much larger number of amino acid substitutions per site than the parental proteins, denoting rapid functional diversification. The parental gene copies appear to act as "donors" of sequence but do not experience any obvious sequence evolution alterations, thus they probably preserve their ancestral functions. Future research in this area, using computational as well as experimental studies, should help clarify how frequent is partial gene duplication with respect to complete gene duplication, the differences in gene copy survival in both cases, and how partial and complete gene duplication contribute to the generation of evolutionary novelties. The book Gene Duplication consists of 21 chapters divided in 3 parts: General Aspects, A Look at Some Gene Families and Examining Bundles of Genes. The importance of the study of Gene Duplication stems from the realization that the dynamic process of duplication is the "sine qua non" underlying the evolution of all living matter. Genes may be altered before or after the duplication process thereby undergoing neofunctionalization, thus creating in time new organisms which populate the Earth.