An improved Plasmodium cynomolgi genome assembly reveals an unexpected methyltransferase gene expansion

Background: Plasmodium cynomolgi, a non-human primate malaria parasite species, has been an important model parasite since its discovery in 1907. Similarities in the biology of P. cynomolgi to the closely related, but less tractable, human malaria parasite P. vivax make it the model parasite of choice for liver biology and vaccine studies pertinent to P. vivax malaria. Molecular and genome-scale studies of P. cynomolgi have relied on the current reference genome sequence, which remains highly fragmented with 1,649 unassigned scaffolds and little representation of the subtelomeres. Methods: Using long-read sequence data (Pacific Biosciences SMRT technology), we assembled and annotated a new reference genome sequence, PcyM, sourced from an Indian rhesus monkey. We compare the newly assembled genome sequence with those of several other Plasmodium species, including a re-annotated P. coatneyi assembly. Results: The new PcyM genome assembly is of significantly higher quality than the existing reference, comprising only 56 pieces, no gaps and an improved average gene length. Detailed manual curation has ensured a comprehensive annotation of the genome with 6,632 genes, nearly 1,000 more than previously attributed to P. cynomolgi. The new assembly also has an improved representation of the subtelomeric regions, which account for nearly 40% of the sequence. Within the subtelomeres, we identified more than 1300 Plasmodium interspersed repeat ( pir) genes, as well as a striking expansion of 36 methyltransferase pseudogenes that originated from a single copy on chromosome 9. Conclusions: The manually curated PcyM reference genome sequence is an important new resource for the malaria research community. The high quality and contiguity of the data have enabled the discovery of a novel expansion of methyltransferase in the subtelomeres, and illustrates the new comparative genomics capabilities that are being unlocked by complete reference genomes.


Introduction
Plasmodium cynomolgi, a non-human primate malaria parasite first mentioned by Mayer in 1907 1 and established as a separate species from P. inui by Mulligan in 1935 2 , has been used as a model parasite species since its discovery. First used to establish the level of susceptibility of Malaysian Anophelines to non-human primate malaria 3 , P. cynomolgi forms hypnozoites (a dormant liver stage), similar to those of human-infective P. vivax and P. ovale species. Other shared characteristics between P. cynomolgi and P. vivax include erythrocyte morphology (e.g. Schüffner's stippling), amoeboidity and the tertian periodicity of intraerythrocytic asexual development (48h life-cycle). P. cynomolgi is thus regarded as a powerful model for P. vivax and potentially P. ovale human malaria. The use of P. cynomolgi as a model organism is further reinforced by it being readily infective to and transmitted by a large number of mosquito species [4][5][6][7] , and by having a wide range of natural 8-10 and experimental hosts 3,11 .
A particular strength of the P. cynomolgi system is access to chronic infections and to the developing and dormant liver stages in a parasite similar to P. vivax. An in vivo-vitro shuttle system for the study of P. cynomolgi liver stages 12 is being exploited to better understand hypnozoite biology using molecular tools and genomescale approaches, which rely on the availability of a complete and well annotated P. cynomolgi reference genome sequence. However, the current P. cynomolgi B reference is very fragmented 13 , and lacks large parts of the subtelomeric regions, thought to harbour genes involved in host-parasite interactions. Other closely related malaria parasite species have been sequenced, including P. coatneyi 14 which is closely related to P. knowlesi, and P. simiovale that was sequenced but never systematically assembled 15 .
In this paper, we describe the improved genome sequence assembly of the P. cynomolgi M strain and compare it the genomes of five other Plasmodium species (P. vivax, P. falciparum, P. knowlesi, P. coatneyi, P. simiovale) that infect humans or monkeys, to uncover similarities and differences that may inform future studies aimed at harnessing P. cynomolgi as a model for P. vivax human malaria.

Methods
Samples DNA was obtained from a blood stage infection of an Indian rhesus macaque donor with P. cynomolgi M strain stocks originally provided by Dr. Bill Collins from the Center for Disease Control, Atlanta. After PlasmodiPur filtration, parasites were matured in vitro overnight. Parasites were purified over a 15.1% (w/v) Nycodenz gradient and DNA was isolated using the Gentra Puregene Blood kit (Qiagen) and processed according to the manufacturers' instructions. The material was handled carefully in order to ensure the integrity of the DNA was maintained. The donor monkey (Macaca mulatta, male, age 5 years, Indian origin) used in this study was captive-bred and socially housed. Animal housing was according to international guidelines for nonhuman primate care and use. Besides the standard feeding regime, and drinking water ad libitum via an automatic watering system, the animal followed an environmental enrichment program in which, next to permanent and rotating non-food enrichment, an item of food-enrichment was daily offered to the macaque. Monitoring of parasitemia was done by thigh pricks each time followed by a reward. The intravenous injection and large blood collection were performed under ketamine sedation, and all efforts were made to minimize any suffering of the animal. The monkey was daily monitored for health and discomfort. Immediately after taking blood from the monkey, the monkey was cured from malaria by intramuscular injection of chloroquine (7.5 mg/kg, on 3 consecutive days) and the absence of parasites was verified two weeks after treatment by microscopy of Giemsa stained slides of thigh prick blood of the monkey.

Ethical approval
Sequencing, assembly and annotation of P.

Methyltransferases
Genes with the product 'methyltransferase' were all selected as nucleotide sequences. A selection of these genes, based on sequence similarity, was aligned with mafft. The phylogenetic tree was generated as the MSP tree, using the PROTGAMMAIGTR model. Potential transpososons were analysed with http://www.girinst.org 42 (using the RepbaseSubmitter section).

PIR analysis
The amino acid sequences of the Plasmodium interspersed repeat (pir) genes were extracted from five genomes (PcyM, P. vivax P01, P. coatneyi, P. ovale curtisi and P. knowlesi). First, low complexity sequences were trimmed with seg 43 . Next, proteins smaller than 250aa were excluded. A BLASTp all-against-all comparison was run (E-value 1e-6, -F F, allowing for up to 4500 hits). The results were visualized in Gephi 36 , clustered with the force field and the Reingold-Watermann algorithm. We also clustered the pir genes from the same BLAST with TribeMCL 44 , using an inflation coefficient of 1.5.

Results and discussion
Improved genome assembly and annotation The existing P. cynomolgi reference (B-strain, referred henceforth as PcyB) is highly fragmented, with 1,649 unassigned scaffolds. We generated a new reference genome sequence (P. cynomolgi M strain -PcyM) using high-depth (>100x) Pacific Bioscience long-read sequence data and further improved it with Illumina sequencing reads. The new PcyM assembly is significantly larger than the PcyB assembly (31 versus 26 Mb) (see Table 1), more contiguous (N90 of 370kb versus 3.9kb), and has no sequencing gaps (0 versus 1943 gaps). The unassigned scaffolds have been reduced from 1,649 in PcyB to just 40 in the new PcyM assembly (see Figure 1).    The 242 clusters between P. vivax and P. cynomolgi emphasise that they are closer related. The 103 clusters shared between P. ovale, P. vivax and P. cynomolgi might give insight into genes associated to the hypnozoite stages.
These improvements in contiguity and reduction of gaps had a large impact on the quality of the gene models. Overall, genes in PcyM are similar in size to their orthologues in P. vivax P01, while those in PcyB are around 20% shorter. In terms of annotation, 966 new genes were found in the PcyM assembly compared to PcyB, with most of these genes being found in the subtelomeres (see Table 2).
The new genes, however, also include 119 genes that are 1-1 orthologous to genes in P. vivax. Due to the manual curation, 12% more genes have been assigned a gene function in the new assembly. These systematic improvements make the PcyM genome sequence a better reference for the community to use when studying the biology of P. cynomolgi and relapsing malaria parasites in general.
The genome sequences were obtained from samples that were originally described as being two different strains, Mulligan (M strain) and Bastianelli (B-strain). However, a genome-wide comparison of the gene repertoires reveals that 67% of the 1:1 orthologues are identical, which is much more than the number of identical genes observed (32%) between two P. vivax isolates (P01 versus C01). This is in line with the findings in the original publication describing the PcyB genome assembly 13 , suggesting that the two strains are likely derived from the same isolate. This was further confirmed by a recent study that analysed the diversity of several P. cynomolgi isolates 45 . Although the authors proposed to call the isolate M/B, we will use the M(ulligan) nomenclature for continuity.

OrthoMCL clustering
To look for conserved orthologues between species, an OrthoMCL 28 clustering of genes from eleven genome assemblies was performed (see Methods and Supplementary Table 1). We used the clustering to look further into genes potentially involved in the formation and development of the dormant hypnozoite stage. There are 103 gene clusters (see Figure 2) that are common to the relapsing parasites, but absent in P. knowlesi and P. coatneyi. Of these, 73 gene clusters are uniquely shared between P. vivax P01, PcyM and P. ovale curtisi GH01. The remaining 30 clusters are either shared with various combinations of the other nine parasite species (see Supplementary  Table 1) or only with P. malariae (20 out of the 30 clusters).
The 73 clusters unique to the relapsing parasites include three tryptophan rich protein clusters where the orthology is 1:1:1 with the exception of one cluster in which P. vivax presents an expansion to four genes; two PHIST proteins (before named RAD and Pv-fam-e) clusters containing 1:1:1 orthologs; 11 clusters featuring 1:1:1 orthologs annotated as 'Plasmodium exported proteins'; three clusters of 1:1:1 hypothetical protein orthologs; one cluster annotated as MSP-7 or MSP-7-like and 56 pir gene clusters showing different degrees of expansion in the three relapsing species. While their specificity is interesting, clusters corresponding to multigene families are probably less likely to have a direct function in dormancy. The hypothetical protein clusters (PcyM_0326800, PcyM_ 0423700 and PcyM_0904700), however, being specific to the three relapsing Plasmodium species, are intriguing, as is the MSP-like protein cluster.  Figure 3A). In contrast, MSP-3 and MSP-7/7-like are highly expanded. MSP-3 is expanded in P. vivax, P. malariae, P. ovale and P. cynomolgi (see Figure 3B). Interestingly, while in P. malariae and to P. ovale, MSP-3 paralogs seem to be species-specific, in P. cynomolgi, P. vivax, P. coatneyi and P. knowlesi many of the paralogs seem to predate speciation, indicating that MSP-3 duplicated in the common ancestor of the latter four species. These findings of MSP-3 expansions are in line with the finding of multi-allelic diversification reported previously 46 , but also confirm the expansion in P. malariae and P. ovale. In addition to the pre-speciation expansion in P. cynomolgi, a species-specific expansion of MSP-3 (see area indicated with '*' in Figure 3B) genes suggests ongoing evolutionary pressure on these genes.

Paralogous
We also observed an expansion of MSP-7/7-like genes. In the OrthoMCL clustering, the genes were distributed in nine different clusters: 108, 4913, 5404, 5550, 5065, 6376 and 5765-5767 (Supplementary Table 1). A phylogenetic tree of the MSP-7/7-like proteins revealed a complex evolutionary relationship (see Figure 3C), splitting the tree into three major clades. Across the tree we find paralogous expansions of different ages, some of which predate speciation. A particularly striking branch comprises only genes from the three hypnozoite-forming species. As a result of the large amount of genome sequences now available for different Plasmodium species, a complex pattern now emerges in the MSP7/7-like tree, suggesting that the different MSP7 proteins likely have different functions.

Improved sub-telomeres reveals insights into subtelomeric gene families
The new high-quality PcyM assembly has an improved representation of the subtelomeric regions of the genome, which now encompass nearly 40% of the genome sequence. Manual curation of the gene annotation enabled the complete set of subtelomeric genes to be resolved (see Table 2). In P. vivax, genes encoding the exported protein family 'PHIST', and exported proteins in general (as predicted by ExportPred 47 ), have paralogously expanded compared to P. cynomolgi (84 vs 54). It is tempting to speculate about the reason for the higher number of exported proteins in P. vivax. One hypothesis is that it could be due to differences in the blood cells of humans compared to primates; while another could be that they are involved in the regulation of genes involved in host parasite interaction. In P. falciparum, it was suggested that PHISTb regulates var genes 48 . In P. cynomolgi, we observed an expansion of the STP1 family (51 genes). STP1 proteins are common in P. malariae and P. ovale curtisi (166 and 70 genes, respectively), but are contracted in number in P. vivax (10 genes). One may also speculate that the expansion of PHIST and exported proteins in P. vivax compensates for the lack of STP1 proteins.
The largest multigene family in P. cynomolgi comprises pir genes.
The pir superfamily occurs in all Plasmodium species 49 , but their function remains poorly understood. Recent studies suggest a possible role in the regulation of the establishment of chronic infections 50 and they have been found expressed in liver stage infections of rodent parasites 51 . An extensive repertoire of 1373 pir genes was identified in the PcyM assembly, compared to 263 in PcyB. This updated number puts the P. cynomolgi pir gene repertoire at a similar size to that of P. vivax (1,216), while P. ovale curtisi has an even larger repertoire (1,949). Conversely, P. knowlesi, has only 70 pir genes present. Interestingly, the re-annotated P. coatneyi genome that clusters closely to P. knowlesi has 827 pir genes (see Figure 4B). In the published annotation it has just 256 pir genes.
As previously reported 29,52 , the pir genes can be grouped based on sequence similarity. We observe that the diversity of the pir repertoire is dramatically reduced in P. coatneyi and P. knowlesi. Most of the pir genes form the same cluster (cluster 0; Figure 4A). However, that cluster splits into two groups in the gene-gene network due to the different lengths of the pir genes in P. coatneyi and P. knowlesi (see Figure 4B). One hypothesis for the loss of other pir types might be the occurrence of sicaVAR genes in P. knowlesi and P. coatneyi 33 . The reduction of the pir repertoire is an interesting parallel to the Laverania, where the amount of rif genes (analogous to pir genes) is reduced but a new gene family evolved, the var genes. Additionally, in the Laverania the number of rif genes drops further when the parasite is in the human compared to the primate 31 .
As for the other clusters, it seems that the underlying structure of the pir genes predates the speciation of P. ovale, P. vivax and P. cynomolgi. Depending on the type of pir, the amount can fluctuate, as can be seen by the large variance in number of genes per cluster. Some clusters are specific to P. ovale and some others contain just the two human malaria parasites, P. vivax and P. ovale. Interestingly, several pir genes have 1:1 orthologues across the different species (Supplementary Table 1, see Figure 4B). As those genes seem to be conserved across evolutionary time, it is unlikely that they are extracellular (where they would be under immune pressure), rather they must have more conserved core functions.

Expansion of methyltransferases.
While paralogous expansions of pir genes and genes encoding MSP genes have been described in other Plasmodium species, P. cynomolgi exhibits an unexpected expansion of 36 methyltransferase pseudogenes. These pseudogenes are found in the subtelomeres, and were annotated as encoding 26 hypothetical proteins in the PcyB assembly. The role of pseudogenes in Plasmodium is little understood, but in several malaria parasite species conserved pseudogenes are found in the subtelomeres. In the OrthoMCL clustering, all 36 methyltransferase  Network illustrating the relatedness of PIR between P. ovale (red), P. cynomogli (light blue) and P. vivax (blue), and between P. coatneyi (light green) and P. knowlesi (green). P. ovale, P. cynomogli and P. vivax clearly share the same general topology of PIR architecture. In contrast, P. coatneyi and P. knowlesi have a reduced number PIRs and also of PIR classes.
pseudogenes cluster with one full-length core gene (PcyM_0947500, Figure 5A). This gene is found on chromosome 9 and has one conserved orthologue across all other Plasmodium species (cluster 51, Supplementary Table 1), and is found in many other species on OrthoMCL as cluster OG5_129798. The 36 copies are spread evenly throughout the subtelomeres, without evidence of spatial clustering.
The methyltransferase pseudogenes contain motifs of the Caulimovirus, a virus often found integrated in to plant genomes, and of different retrotransposons families such as aedes aegypti, Gypsy, Helitron-5, CACTA-1, RTEX and CR1 (see Supplementary Table 2). While the Caulimovirus insert was mostly found to have occurred in an antisense orientation hinting towards a role in stability, the LTR and non-LTR insertions were found most often to have occurred in a sense orientation 53 . The hits were mostly to low complexity regions, suggesting that recombination in the subtelomeres may be a result of mechanisms similar to those used by retro elements.
We also found evidence that this duplication of methyltransferases was also found in P. simiovale, a close outgroup to P. cynomolgi, P. vivax, and P. knowlesi. Fewer copies were observed in the P. simiovale assembly (13), but this may be due to the fragmentation of the assembly. Although they are generally less degenerate at their 5' ends, they are nevertheless pseudogenized.
To further understand the duplication, we mapped the reads of P. cynomolgi, P. simiovale and P. vivax P01 against the locus on chromosome 9 containing the ancestral methyltransferase in P. cynomolgi (see Figure 5B). Although the coverage is shown as log Interestingly, the locus in P. cynomolgi has an insertion with a subtelomeric gene that has a weak hit with to a putative DNA translocase Ftsk domain. Coverage plot mapped from P. cynomolgi reads (black), P. vivax (blue) and P. simiovale (magneta) is shown in log scale on P. cynomolgi. The methyltransferases are duplicated more than 35 times. As the height is roughly similar between the two duplications, we expect around the same number of methyltransferases in P. simiovale than in P. cynomolgi. The insert of the green gene is found just in P. cynomolgi, due to the missing coverage. The upper panel shows the distance of read pairs; the insertion of the region probably occurred after the duplication of the gene into the subtelomeres, as all reads from the duplications are connected over the insertion. The next core gene is also duplicated.
scale, the coverage across the methyltransferase seems to be identical for P. simiovale and P. cynomolgi, but significantly lower for P. vivax. This leaves us to speculate that the number of methyltransferases is roughly the same in both P. simiovale and P. cynomolgi. Further, the coverage plot also reveals that the next core gene of unknown function, PcyM_0947600, is also duplicated. In PcyM we find two further paralogous genes: PcyM_0054800 PcyM_0012100. Furthermore, it is more often duplicated in P. simiovale, as the coverage of that gene is high ( Figure 5B). A search for structural similarity using I-TASSER 54 yielded no conclusive results.
A phylogenetic tree (see Figure 5A) shows the methyltransferase paralogs in P. cynomolgi and P. simiovale compared to the orthologues in the other species. The genes generally follow the species tree, but they are expanded in P. cynomolgi and P. simiovale. As P. simiovale is thought to be an outgroup to P. cynomolgi and P. vivax, we expect that P. vivax has lost the expansions.
We compared the location of the ancestral methyltransferase between PcyM and P. vivax. To our surprise, we found a potential open reading frame inserted between two methyltransferases in P. cynomolgi. A tBLASTn of that CDS against the Nucleotide NCBI database revealed no significant similarity to any other sequence, except for the subtelomeres of P. vivax and P. cynomolgi. A very weak hit (e-value of e-4) to a DNA translocase FtsK, is an interesting finding, in light of the potential LTR transposon-like sequences discussed previously, but is to be taken with caution. This particular open reading frame is absent in P. simiovale and seems that have occurred subsequent to the expansion and is not likely to be implicated in the expansion itself.
It remains speculative if the paralogs of the methyltransferase genes and the adjacent gene were functional in the ancestor. Hypothetical roles of the methyltransferase could involve any of the following: 1) the epigenetic control of differential pir gene expression in acute and chronic infections 50 , 2) the sequence may have a role in genome stability and recombination, or 3) this could be a selfish gene that was able to transpose.

Conclusion
The availability of a new and improved P. cynomolgi reference genome sequence will enable in-depth studies of this widely used model parasite, including investigations into dormant stages and the selection of new drug targets and vaccine candidates. High quality genomics related studies will now be possible, including studies of previously missed core genes. In particular, the improved subtelomeres have enabled us to dissect the pir gene family further, and have revealed a novel and unexpected expansion of methyltransferase genes.

Data and software availability
The project number of the P. cynomolgi raw reads is deposited in the European Nucleotide Archive under accession number ERP000298. The submitted genome is under the project number PRJEB2243.

Competing interests
None of the authors declared competing interest.

Grant information
This work was supported by the Wellcome Trust (098051), EVI-MalaR (contract number 242095) and Gates Foundation Project OPP1023583. GGR is supported by the Medical Research Council (MR/J004111/1).
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Click here to access the data. It is nice that the same color scheme is used throughout the manuscript, but the color of Pcy, Pc and Pk are rather similar and difficult to tell apart. In particular on Figure 4 this is problematic.

Supplementary
Perhaps it would be more logic to discuss the improved subtelomeres earlier in the manuscript (i.e before the OrthoMCL clustering) Figure 3 B and C could be better labelled.
The improved reference genome should be made available in GeneDB and PlasmoDB as well.
Finally, the authors mention that genes specific to hypnozoite forming parasites mainly belong to variant multigene families and unlikely to be relevant for hypnozoite formation. Perhaps it would be worthwhile identifying genes which are not unique, but substantially different in these parasites (indels, unexpectedly high number of SNPs, etc). 1.

6.
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. The current manuscript describes the resequencing and finishing to near chromosomal completion of the genome using high coverage PacBIO sequencing, the reannotation of the Plasmodium cynomolgi parasite's ~6,500 coding gene models and a detailed comparative analysis of its major gene families relative to related primate clade species of , including and as Plasmodium P. vivax, P. ovale P. knowelsi well as a draft assembly and annotation of and a reannotation of . As the authors P. simiovale P. coatneyi describe, is an important model for a relapsing human infectious species that now P. cynomolgi P. vivax, predominates as the primary cause of malaria in the Asia-Pacific and Americas. The authors correctly note that the current draft assembly the genome, which was produced several years ago, is P. cynomolgi highly fragmented, incomplete and has truncated / inadequate gene models. Though of an acceptable standard at the time, it is clear that this prior genome is no longer adequate to act as a reference of P.
research. Based on this, the current manuscript is a timely contribution that will provide an cynomolgi excellent resource for the malaria research community. The paper itself is well written, the figures are very nicely conceived and presented and the methods used are appropriate and expertly applied. I have no hesitation in recommending this study for publication. Could the authors please provide a more detailed explanation either in their methods section or in the results section for how they define the expanded methyltransferases they identify in P.
as pseudogenes? I wonder if this will otherwise not be immediately clear to the reader. cynomolgi Under the 'Paralogous expansion of ... MSP' section -'while in and to ' -I P. malaria P. ovale assume 'to' should be deleted here.
In Figure 3C, should this be 'expansion in msp3-like'? If not, could the authors explain what they mean by 'expansion like'?

Is the work clearly and accurately presented and does it cite the current literature?
Yes