Recent gene duplication in the MCP gene family
The MCP family of enzymes presents evidence of common ancestry, both in terms of sequence and structural homology and the chromosomal arrangement of genes. We were interested to know if any of these MCP genes continued to be duplicated more recently, and therefore if we could learn about mechanisms of duplication and subsequent selection through this larger dataset. Therefore, we explored the Ensembl database (release 98), containing genome sequence information for 188 vertebrate species, to identify vertebrate organisms containing multiple copies of these genes. This data was submitted to extensive manual analysis to ensure, as much as possible, that our conclusions were not based on incomplete and fragmented genomic data or incorrect genomic annotations.
While most MCP genes were present in duplicate form in a small number of species (1–10), several genes were found more frequently duplicated, such as the CPO gene within the A/B subfamily (Fig. 2). Notably, while a large majority of species contained only one CPO gene, and 29 vertebrate species contained two CPO genes, a number of species, including many, although not all, rodents, contained no functional CPO gene, but rather a pseudogene 25. At the other extreme, some species, notably many fishes and Xenopus tropicalis, contained more than two copies of the CPO gene (Fig. 2A). A phylogenetic analysis of the proteins predicted to be translated by these CPO genes suggested a number of duplication events throughout the natural history of these fishes (see Supplemental Fig. S1 online). Evidence for gene conversion was found in a number of cases. For example, the zig-zag eel and swamp eel both contained 3 homologous CPO genes with very similar gene synteny, suggesting duplication events prior to the divergence of these two species; however, intraspecific homology between these protein sequences was greater than interspecific homology.
Within the N/E subfamily, AEBP1, CPXM1, and CPZ were found frequently in duplicate form (Fig. 2B). Nearly all ray-finned fishes (54 of 60 species) contained two AEBP1 paralogs, one annotated as aebp1 and the other as si:ch1073-459j12.1. The huchen appears to have an additional duplication of each of these genes. Only one other species in Ensembl98, the chimpanzee, exhibited a duplication of AEBP1 (upon manual analysis, others reported in R98 as duplicated were in fact two halves of the same gene). CPXM1 and CPZ present a similar story, in which 49 and 53, respectively, of 60 ray-finned fishes contained two paralogs (cpxm1a and cpxm1b, CPZ and cpz), with no duplicates present in other phyla (Fig. 2B). No CCP genes were found frequently in duplicate form (Fig. 2C).
Gene synteny suggests duplication mechanisms
The presence of AEBP1, CPXM1 and CPZ gene duplicates throughout the ray-finned fish lineage suggested they were the result of a gene duplication event prior to the divergence of this group. The members of these gene duplicate pairs were on separate chromosomes and, in the cases of AEBP1 and CPZ, were surrounded by similar genes (grk5 and grk5l upstream of aebp1 duplicates, and gpr78-and hmx-like genes surrounding both cpz duplicates; Fig. 3A), suggesting that these duplication events impacted large chromosomal segments, possibly the result of a whole-genome duplication event 10. Phylogenetic analysis confirmed this proposed relationship (see Supplemental Fig. S2 online).
In contrast, many CPO gene copies were arranged tandemly (Fig. 3B). Examples could be found of fish species with two to four tandemly arranged CPO genes, all with some level of shared syntenic relationships with neighboring genes, depending on evolutionary relationships of species. For example, many of these gene clusters were flanked by fn1b on one side and/or nudt15 on the other side (Fig. 3B). The Ensembl release 100 genome assembly for the jewelled blenny contained five CPO genes in close chromosomal proximity. The reason for this chromosomal amplification of the CPO gene within fish lineages was not apparent from a survey of the DNA sequences. Repetitive sequences were found in abundance throughout introns and intergenic sequence, although it was not clear that this was abnormal for such regions. However, the tandem arrangement of CPO genes was suggestive of an origin in unequal crossing over 8.
An analysis of CPO gene synteny in the Xenopus lineage suggested a possible cause for duplication in this case. The parental CPO gene, that is, the CPO paralog found with identical synteny throughout non-fish vertebrates (eutherians, saurians, and amphibians), was found just upstream of the gamma-crystallin gene cluster in both eutherians and Xenopus (Fig. 3C). In Xenopus tropicalis, three additional copies of the CPO gene could be found between these gamma crystallin genes and the following grip1 gene. Within the last intron of this grip1 gene there were two additional annotated open reading frames, one predicted to encode a gamma-crystallin-1-like protein, and another with homology to DNA transposases (Fig. 3C). While DNA transposases are not commonly thought to be involved in this kind of tandem duplication, the observation of a CPO gene cluster together with a gamma-crystallin gene cluster and a putative transposase was suggestive of a link between all three. Examples of transposable elements involved in tandem duplication have been shown in maize 26 and humans 27.
Retention of duplicated genes could be aided by reduced gene size
We wished to further investigate the reason for the unique expansion of, and particularly the retention of, CPO genes within fish and amphibian lineages. Gene synteny suggested an origin in meiotic crossing over events, with a possible role of a transposase in Xenopus. However, further selective pressures must be present to ensure maintenance of these genes in the genome. We considered that both the duplication of a gene and its retention in the genome could be influenced by gene size. Complete gene duplication could be aided by a smaller gene size, while smaller genes might avoid detrimental mutation longer than larger genes, resulting in their maintenance in the genome longer than larger genes. Therefore, MCP gene size was analyzed using information provided by Ensembl release 100. While gene size varied dramatically across phyla, a statistically-significant reduction in gene size (p = 1.593 x 10− 37) was observed for the four commonly-duplicated MCP genes (CPO, AEBP1, CPXM1, and CPZ) when compared to the other MCP genes. The mean gene size for 1181 vertebrate CPO, AEBP1, CPXM1, and CPZ genes was 15,214 base-pairs, while that for the remaining 4120 vertebrate MCP genes found in the Ensembl database was 83,159 base-pairs (Fig. 4). While the number of genes analyzed here was minimal, this did suggest a role for gene size in this process. It is interesting to note that, while the CPXM1 and CPXM2 genes produce proteins with very similar structures and functions 28–30, the CPXM1 gene is much smaller than CPXM2 and is maintained in duplicate form in fish genomes (Fig. 4; see also Figs. 2B and 3A).
Some studies have suggested that gene expression can affect the rate of gene evolution through dosage sensitivity 12. Some genes are particularly sensitive to an increased expression level and so a second copy is rapidly purged from the genome. A recent study examined questions surrounding the impact of gene duplication on expression using a large comparative RNA-seq dataset 31. However, there is limited data available on the impact of expression levels on gene duplication 32. An analysis of human MCP gene expression did not reveal any unique characteristics in terms of gene expression for the CPO, AEBP1, CPXM1 and CPZ genes (see Supplemental Fig. S3 online).
Duplicated MCP genes present evidence of neofunctionalization
Following gene duplication, selection has the opportunity to relax and explore new function through neofunctionalization or subfunctionalization. A wealth of information is available regarding MCP protein structure and enzymatic mechanism and this can allow us to make predictions regarding putative MCP gene function, whether it be the same function as the parental gene, neofunctionalized to produce an enzyme with a new substrate specificity, neofunctionalized to produce a protein with no enzyme activity (yet perhaps retaining another role in protein-protein interactions as a pseudoenzyme), or degraded to become a nonfunctional pseudogene. We focused on the CPA/B subfamily, as the tandem amplification and retention of CPO genes was of most interest.
We first set out to predict whether CPA/B genes found in duplicate form produced active enzymes, pseudoenzymes, or were unlikely to produce a functional protein at all (ie. were pseudogenes). In order to classify predicted proteins in this manner, we needed to have an inventory of amino acid residues critical for carboxypeptidase structure or enzymatic function. To develop this inventory, we prepared a multiple alignment containing protein representatives of each member of the CPA/B subfamily from a broad range of taxa, typically including five mammal sequences (including one afrotherian, one marsupial, and one monotreme sequence), four sauropsid sequences (two bird and two reptile), one amphibian sequence, and two fish sequences. The resulting alignment indicated conservation of key catalytic residues that would be required for an active enzyme (H69, E72, and H196, involved in coordination of the required zinc ion; R127, N144, and R145, required for binding the carboxyl group at the C-termini of substrates; and E270, the key residue involved in acid-base catalysis; see Supplemental Fig. S4 online; 33), as well as many other residues conserved entirely, that are likely necessary for the structural integrity of these proteins. With this as a guide, we classified each predicted protein arising from duplicated genes as an active protein (all residues conserved in our pan-alignment are present), a pseudoenzyme (one or more residues necessary for enzyme activity are substituted, but other conserved residues retained), or a pseudogene (multiple conserved residues are substituted; see Supplementary Table S1 online). In some cases, predicted proteins were partial or fragmented in some manner that excluded key segments. Genes encoding these proteins were classified as pseudogenes, although in some cases they might simply be a result of incomplete genome sequencing or annotation.
Following this approach, 83% (63/76) of all duplicated CPO genes, and 82% (42/51) of all other duplicated CPA/B genes (all CPA and CPB genes combined, as numbers of each were low), were predicted to encode active enzymes (Fig. 5A, B). Eight percent (6/76) of duplicated CPO genes were predicted to encode pseudoenzymes due to one or more substitutions in strictly conserved active site residues, in contrast to other CPA/B genes, in which only one pseudoenzyme was predicted (1/51). Likewise, most duplicated genes predicted to encode enzymes retained their expected substrate specificity: 80% (32/40) of CPA genes were predicted to encode enzymes with specificity to cleave aliphatic/aromatic C-terminal amino acids (having a hydrophobic residue at position 255) and eight of nine CPB genes were predicted to encode enzymes with specificity to cleave basic C-terminal amino acids (having an acidic amino acid at position 255; Fig. 5C, D). Only 70% (52/74) of CPO genes were predicted to encode enzymes with specificity to cleave acidic C-terminal amino acids (having a basic amino acid at position 255). However, many of the remaining CPO-like enzymes contained polar amino acids at residue 255 (26%; 19/74), with unknown impact on the substrate specificity of these enzymes.
In order to investigate the selective pressures placed on these genes, we compared the identified gene paralogs using the method of Nei and Gojobori to estimate synonymous (dS) and nonsynonymous (dN) nucleotide substitutions per site. The probability of rejecting the null hypothesis of strict-neutrality (dN = dS) in favor of the alternative hypothesis of purifying selection (dN < dS) was determined for each paralog pair. The resulting statistic (dS - dN) suggested widely varying levels of purifying selection for each gene within the A/B subfamily (Fig. 5E). Purifying selection was more likely for CPA1, CPA3, and CPO, although the sample size for non-CPO genes was small. When combined and compared with CPO genes (Fig. 5F), the data suggested statistically significant purifying selection acting upon CPO duplicates, with less stringent selection or neutral selection acting upon other CPA/B gene duplicates. This was also reflected in the p values reported for this test of purifying selection (Fig. 5G). CPO paralogs suggested a bimodal distribution, with one group of values for dS - dN in the range of 1 to 4.5, and another in the range of 7 to 14 (see Fig. 5F). Interestingly, most of the low values involved genes predicted to be either pseudogenes or pseudoenzymes, which might be expected if selective pressures are not acting upon such genes.
Xenopus CPO paralogs show evidence of neofunctionalization
The data thus far suggests that CPO genes are frequently duplicated and maintained in fish and Xenopus genomes, and that these duplicates are maintained through purifying selection, suggesting functional importance. In many cases (~ 30% of duplicated CPO genes), the specificity-determining residue has changed from the stereotypical arginine found in most CPO proteins, suggesting a new function that could be selected for. In order to experimentally examine such predictions, we obtained cDNAs for the four Xenopus tropicalis CPO paralogs, engineering each of these cDNAs to include an HA tag for protein detection (see Supplementary Methods online for sequence details).
A number of characteristics could be determined simply from the amino acid sequences of these four predicted proteins, using available prediction programs. In contrast to most fish and mammalian CPO orthologs which lack a complete prodomain, these four Xenopus CPO orthologs retained a prodomain (see Supplementary Fig. S5 online for multiple alignment), although only CPO1 was predicted by ProP 1.0 to be cleaved by a proprotein convertase at the junction between the prodomain and the CP domain (ProP 1.0; https://services.healthtech.dtu.dk/service.php?ProP-1.0)34. One of the four, CPO2, was not predicted to encode a GPI signal peptide (PredGPI; http://gpcr.biocomp.unibo.it/predgpi/)35, although all were predicted to encode an ER signal peptide (SignalP-5.0; https://services.healthtech.dtu.dk/service.php?SignalP-5.0)36. Only CPO1 encodes an arginine at the substrate specificity-determining position. Based on this site alone, the other CPO paralogs would be predicted to exhibit differing substrate specificity: a glycine is found at this position in CPO2, a glutamine in CPO3, and a cysteine in CPO4. To our knowledge, the consequence of these amino acids in determining substrate specificity has not been previously investigated experimentally. A phylogenetic analysis of these four proteins (see Supplementary Fig. S1 online) suggests that CPO3 was the progenitor gene, duplicating to produce CPO1 (which then freed CPO3 to adopt an alternate substrate specificity), which subsequently duplicated to produce CPO2 and CPO4. This is consistent with gene synteny shown in Fig. 3.
Expression plasmids encoding HA-tagged Xenopus CPO paralogs were transfected into HEK293T cells and protein expression analyzed by western blotting. All four CPO paralogs were detected in cell extracts, with CPO1 expressed most highly and expression of CPO2 being the weakest (Fig. 6A). No expression was detected in media, and no enzymatic activity could be detected from either media or cell extracts, even following limited digestion with trypsin.
Immunocytochemistry was employed to determine the subcellular distribution of these Xenopus CPO paralogs in comparison to that of human CPO in HEK293T cells. Expression of Xenopus paralogs partially overlapped that of human CPO in a punctate pattern (Fig. 6B) consistent with that observed previously 37. Xenopus CPO2 was not detected.
As enzymatic activity was not detected when Xenopus CPO paralogs were expressed in the HEK293T system, we used the Sf9 insect cell system for more robust expression. Infection of these cells with baculovirus encoding Xenopus CPO paralogs resulted in expression of all four paralogs, as detected by western blot, both in the cell lysates and in the conditioned medium (Fig. 6C). Although band intensity appeared similar in both cell lysates and media, these bands resulted from the analysis of 1% of the total cell lysate and 0.06% of the total media collected, indicating the majority of the protein to be secreted into the media. Highest expression was detected for CPO4, followed by CPO3 and CPO1. Expression of CPO2 was at the detection limit of our western blotting system.
In order to cleave the N-terminal prodomains from these proteins, conditioned media were incubated with small amounts of trypsin at room temperature for five minutes, followed by inactivation of trypsin with PMSF. Western blotting showed that this cleavage was effective in producing mature enzymes, with molecular weights similar to those predicted for the mature enzymes, with the exception of the weakly-expressed CPO2 (Fig. 6D). Enzyme assays confirmed this—while no enzyme activity could be detected in conditioned medium, trypsinized medium showed robust activity for CPO4 and weaker activity of CPO1 and CPO3, consistent with the lesser expression levels of CPO1 and CPO3 (Fig. 6E). Most interesting was a comparison of substrate specificity, utilizing the three synthetic substrates available. CPO1 was able to cleave FA-EE, but not FA-FA or FA-FF, consistent with the presence of an arginine at position 255. CPO3 appeared equally able to cleave FA-EE and FA-FA, although the p-value for activity against FA-FA was 0.09, and thus not statistically significant in this analysis, while CPO4 exhibited a clear preference for FA-FA over the other two substrates.