Progression of the canonical reference malaria parasite genome from 2002–2019

Here we describe the ways in which the sequence and annotation of the Plasmodium falciparum reference genome has changed since its publication in 2002. As the malaria species responsible for the most deaths worldwide, the richness of annotation and accuracy of the sequence are important resources for the P. falciparum research community as well as the basis for interpreting the genomes of subsequently sequenced species. At the time of publication in 2002 over 60% of predicted genes had unknown functions. As of March 2019, this number has been significantly decreased to 33%. The reduction is due to the inclusion of genes that were subsequently characterised experimentally and genes with significant similarity to others with known functions. In addition, the structural annotation of genes has been significantly refined; 27% of gene structures have been changed since 2002, comprising changes in exon-intron boundaries, addition or deletion of exons and the addition or deletion of genes. The sequence has also undergone significant improvements. In addition to the correction of a large number of single-base and insertion or deletion errors, a major miss-assembly between the subtelomeres of chromosome 7 and 8 has been corrected. As the number of sequenced isolates continues to grow rapidly, a single reference genome will not be an adequate basis for interpreting intra-species sequence diversity. We therefore describe in this publication a population reference genome of P. falciparum, called Pfref1. This reference will enable the community to map to regions that are not present in the current assembly. P. falciparum 3D7 will continue to be maintained, with ongoing curation ensuring continual improvements in annotation quality.


Introduction
The genome of Plasmodium falciparum 3D7 (a clone from the NF54 (Walliker et al., 1987) isolate), the species responsible for the most severe form of malaria, was the first reference genome published to support Plasmodium research. Its publication more than almost two decades ago (Gardner et al., 2002) was a milestone, the impact of which is reflected in several thousand citations that mention the P. falciparum 3D7 genome. The sequencing of P. falciparum was initially accompanied by the draft genome of a rodent malaria species, P. yoelii (Carlton et al., 2002). These genomes were followed by those of several other Plasmodium spp, sequenced using Sanger sequencing technology, including human-infective species P. vivax (Carlton et al., 2008), the monkey and human malaria parasite P. knowlesi (Pain et al., 2008) and further rodent Plasmodium spp (Hall et al., 2005). With the advent of much cheaper short-read technology, many more genomes have been sequenced, including the chimpanzee parasite P. reichenowi (Otto et al., 2014), the monkey malaria parasites P. cynomologi (Tachibana et al., 2012), P. coatneyi (Chien et al., 2016), P. inui and P. fragile, the murine parasite P. vinkei, the human parasites P. malariae, P. ovale (Rutledge et al., 2017) and the avian malaria parasites P. gallinaceum and P. relictum (Böhme et al., 2018). Although, many of these genomes are highly fragmented draft assemblies, algorithms that use high coverage of aligned short reads have enabled a variety of cost-effective genomeassembly improvements for several species (Swain et al., 2012). P. falciparum 3D7 is a major focus of malaria research and the accuracy of its reference genome and annotation are vital for accelerating hypothesis-driven research. Moreover, the availability of a reference genome has additional importance: it underpins genome comparisons, across the suite of genome sequences that are now available for multiple Plasmodium species, and the global efforts to analyse genome variation amongst thousands of clinical and lab isolates. The need for a commitment to maintain and improve this genome has long been recognized by the Wellcome Sanger Institute. Through careful manual curation, highly accurate predictions of coding and non-coding genes have been added. Functional descriptions of genes have also been kept up to date, to reflect the growing volume of P. falciparum related scientific literature. In many ways the depth of annotation is similar to that more commonly associated with model organisms. For instance, Gene Ontology terms have been manually selected that capture from the scientific literature the richness of gene roles in a format that can be easily queried or used for inference in genome-wide analyses. Recent examples include the genome-wide analysis of transcriptional dynamics (Painter et al., 2018) and the uncovering of common functions in essential genes (Zhang et al., 2018).
Genome improvement and curation has resulted in thousands of individual changes over more than 15 years. In particular, the resolution of subtelomeric regions has been transformed along with the ability to annotate important multigene families that are often found in those regions. This is the first paper to describe the changes since the P. falciparum 3D7 genome was first published. Originally 5,268 protein-coding genes were annotated and of those, over 60% (3,465 genes) of predicted genes had unknown functions ( Table 1). Despite the fact that there still seems to be a common perception that over 50% of genes remain functionally unannotated (Briquet et al., 2018; Tang et al., 2019) the number of predicted genes has risen to 5,438 and the proportion without ascribed functions has now shrunk to 33% (1,776 genes) ( Table 1). Since 2002, 27% of genes have undergone structural changes or have been added based on RNAseq data and other data from publications. New ncRNAs have also been added and complete apicoplast and mitochondrial genomes have been assembled. One of the many purposes of a reference genome is to interpret natural variation data. In the latest version, we have therefore included alternative contigs representing major haplotypic differences. This reference dataset has been named Pfref1 to reflect the fact that it does not simply comprise P. falciparum 3D7 data but has been supplemented with other reference data to better represent the pan-genome for this species. The aim of Pfref1 is to enable robust mapping to analyse genome variation in regions of Plasmodium genomes where the current Pf3D7 genome (v3.2) is an unsuitable reference.

Curation and annotation
Changes to the genome annotation reflect ongoing work at the Wellcome Sanger Institute. The software Artemis (version 10 to version 18) was adapted to use a CHADO database schema (Carver et al., 2008) and has been used continuously for manual curation and annotation. This database system is directly connected to GeneDB. Every 4 to 6 months data is transferred to PlasmoDB. To update functional annotation, Pubmed was searched (search terms Plasmodium and apicomplexa) on a regular basis for publications related to Plasmodium. Relevant information, i.e. gene product descriptions, EC numbers, gene names and functional descriptions to be captured by Gene Ontology terms, was extracted and changes manually added in Artemis. RNA-Seq data and TBLASTX comparisons were the primary supporting evidence for manual improvements to gene models. Information from user comments that were submitted to gene record pages in PlasmoDB were assessed and where relevant used to update annotation. Evidence codes that support product descriptions are available as GFF format genome annotation files from the following FTP site: ftp://ftp.sanger.ac.uk/pub/ genedb/releases/latest/Pfalciparum/. To find annotation differences

Amendments from Version 1
We would like to thank the reviewers for their helpful feedback. We have responded to the three reviewers' comments. We have made minor changes to the manuscript and improved Figure 2. Gene IDs have been added on the panels. The TBLASTX matches are now shown in grey and the height of the TBLASTX matches has been shortened. All genes are shown in red.
In Figure 3 the description of 2007-2010 has been corrected to "69 new".  Box 1). One of the goals of the workshop was to ascribe updated functions to predicted proteins, check gene structures and systematically revisit the nomenclature for large gene families. A major new addition to the evidence was genome wide TBLASTX comparisons between species that were used to highlight conserved regions at the protein level and therefore identify positionally conserved orthologues and refine their exon-intron boundaries. In 2010, we published the first RNA-Seq data for this species (Otto et al., 2010b). These data were used to further evaluate gene models and improve the accuracy of gene structures. As a result, 27% of genes have been added or had their structure changed since 2002 ( Figure 1). 1255 genes had changes to exon-intron boundaries or exons added or removed ( Figure 2A); this number include genes that were merged ( Figure 2B) or split ( Figure 2C). Since 2002, 244 genes have been added ( Figure 2D, Extended data: Table 1 (Böhme, 2019)) and 36 predicted genes have been deleted due to a lack of evidence supporting their earlier prediction in regions of repetitive or unusual sequence, or because later RNAseq evidence (including strand-specific information) suggested that they were ncRNAs rather than protein-coding ( Figure 2E, Extended data: Table 2 (Böhme, 2019)). In addition, a number of genes were created after 2002 based on algorithmic predictions but subsequently deleted due to a lack of supporting evidence (Extended data: Table 3 (Böhme, 2019)). Figure    Altogether there are 1302 genes annotated using GO and supported by experimental evidence: 1095 genes captured by the "component" aspect of GO; 609 captured by "molecular function" and 369 genes captured by "biological process". Because individual genes have been annotated with multiple terms, the number of individually curated and experimentally verified GO terms is much higher. There are 1867 GO components annotated, 979 GO functions and 857 GO processes. The manual GO annotation also includes 342 protein binding interactions ( Table 2). Annotation is updated continuously as new literature is published.

REVISED
Throughout the annotation improvement process, engagement of the malaria research community has played an important role. The process started with the workshop in 2007 (Box 1) but has continued through the activities of a dedicated full time curator aided by direct feedback and through comments that can be added by the community to gene record pages at PlasmoDB. These comments are constantly being evaluated and incorporated where relevant. The ongoing annotation is physically housed at the Sanger Institute, with updates regularly passed on to PlasmoDB (every 4 to 6 months).

Population reference genome Pfref1
One of the many purposes of a reference is to interpret natural variation data, the aim being to enable robust mapping of re-sequencing reads from subsequent isolates. In the latest version we have incorporated sequence differences derived from 3 lab isolates assembled de novo as part of a collection of 15 PacBio reference genomes (Otto et al., 2018). The differences have been incorporated into the reference as three classes ( Figure 5). The first (type-1) are "patches" to correct errors in Pf3D7 (version 3.2), for example a missing centromere on chromosome 10 and a missing gene on chromosome 13 ( Figure 5, Figure 6A), the second (type-2) are core genes that are present in other sequenced isolates, i.e. P. falciparum IT or P. falciparum DD2 but are missing in Pf3D7 ( Figure 5, Figure 6B) and the third (type-3) are dimorphic genes where alternative alleles cannot be mapped to the one currently present in Pf3D7 ( Figure 5, Figure 6C, Figure 6D). In total, there are now 17 type-1, four type-2 and 17 type-3 patches. The type-2 patches include genes encoding gamete associated protein (GAP), CLAG and hypothetical proteins. Type-3 include dimorphic genes encoding DBL-containing protein (PF3D7_ 0113800), Surfin 1.     3D7 (version 3.1, 14.02.2019).

Discussion
In this paper we have provided an overview of the sequence and annotation changes the P. falciparum 3D7 genome has undergone since the initial publication in 2002. The inclusion of long-read sequencing has been critically important for spanning gaps that persisted for years in the reference assembly due to their extreme AT-richness and length. A previous attempt to produce an improved P. falciparum 3D7 reference assembly used Pacific Biosciences sequencing data assembled de novo (Vembar et al., 2016). Although the assembly contiguity metrics were impressive, the authors did not attempt error-correction. As a consequence, a high proportion of gene sequences contained frameshifts and there were many unresolved repetitive sequences. In the present study, we have used automated error-correction assisted by a high coverage of aligned short reads, plus extensive manual review of individual read alignments. This has enabled us to drive the accuracy of underlying reference sequence, bringing a range of benefits. First, end users interested in individual genes have access to the most up to date information. Second, users interested in high throughput functional genomics or genome variation need the most up to date and complete sequence for mapping purposes to be available and used by all labs. Third, detailed curation in P. falciparum has a knock on effect across other important species of the genus because functional insights from one species can be projected to others based on homology. It is inevitable that there is a law of diminishing returns. However, with 33% of genes still of unknown function it is essential that ongoing maintenance, annotation and curation are continued. In particular, our future plans include the annotation of untranslated regions (UTRs) and the annotation of additional common alternative splice-forms for genes. We also plan to provide better visibility for evidence codes that support protein descriptions but are currently only available as GFF format genome annotation files. New possibilities are now being explored for the community to get involved with annotation. GeneDB will soon provide an opportunity for the community to contribute directly to structural annotation. Equivalent to that of an annotator's view, the user will be able to view the curated Plasmodium genomes in Apollo (Lee et al., 2013), a collaborative genomic annotation editor which allows multiple users to access the data.  loci, rather than simply excluding them from population analyses, remains a challenge and will require the further development of variant-calling methods. However, as a first step, the popular short-read alignment tool BWA-MEM (Li, 2013), used in the GATK variant call pipeline, has been able to perform alignments in an alternative-aware mode for several years. In some regards it is a historical Plasmodium falciparum review detailing the evolution of this critical genome, highlighting a significant number of gene modifications, losses as well as additions. As described, the current status of the genome P. falciparum should be the envy of any field focused on a single model species. The examples of modifications to the genome are adequately depicted throughout and easy to follow. Overall, the exceptional curation of this genome is largely due to the diligent and consistent attention to the quality and completeness that it has genome is largely due to the diligent and consistent attention to the quality and completeness that it has received over the years from this team at the Wellcome Sanger Institute.

Data availability
One new feature for the malaria parasitology community is the creation of a "population" reference genome, Pfref1, that accounts for sequence differences between strains. While I applaud this effort since there is a need to encompass the diversity of parasites strains, it is not clear to me whether 3 isolates is sufficient to capture the known major haplotypes, as claimed. Perhaps this is true, but do IT, P. falciparum DD2 and Pf3D7 provide sufficient haplotype diversity? If so, can you demonstrate this? And what type of genes account fort the greatest haplotype diversity? Also, with Pfref1, how many genes are now included (and nucleotide increase) to map against? I think the manuscript could also be somewhat enhanced by describing which sequencing and/or annotation algorithms/approaches have contributed the most significantly to the changes and improvements to the current genome overall.

P. falciparum
Although the manuscript describes a large gain in pseudogenes, given recent reports that some pseudogenes may actually express functional proteins this should be mentioned as a possible outcome for genes predicted to be non-functional.
On page 5, it is stated that the number of verified genes has changed from 597 to 1296 genes. It is unclear what "experimentally verified" implies. What verification was applied to these genes?

Minor comment:
Please change reference to "AP2 proteins" to ApiAP2 proteins. AP2 refers solely to the DNA binding domain region of these proteins. On page 7, three "types" of genes are described (type-1, type-2, and type-3). For the non-experts, please describe what these types refer to? References 1. Macedo Silva T, Duque Araujo R, Wunderlich G: The pseudogene SURFIN 4.1 is vital for merozoite formation in blood stageP. falciparum.

Are sufficient details of methods and materials provided to allow replication by others? Yes
Are the datasets clearly presented in a useable and accessible format? Yes No competing interests were disclosed. Thank you for your constructive feedback.
1. One new feature for the malaria parasitology community is the creation of a "population" reference genome, Pfref1, that accounts for sequence differences between strains. While I applaud this effort since there is a need to encompass the diversity of parasites strains, it is not clear to me whether 3 isolates is sufficient to capture the known major haplotypes, as P. falciparum claimed. Perhaps this is true, but do IT, DD2 and Pf3D7 provide sufficient haplotype diversity? If so, can you demonstrate this? And what type of genes account fort the greatest haplotype diversity?
Our intention is to provide a reference that will be sufficient to capture the major haplotypes by sequence-mapping approaches. IT, DD2 and Pf3D7 do not represent the full of universe of haplotype diversity. In the paper we state that this supplemented sequence "better represents the pan genome for this species". Additional sequences can be easily added as they are identified, using this consistent format.
2. Also, with Pfref1, how many genes are now included (and nucleotide increase) to map against?
We found 4 core genes that are present in other isolates.
3. I think the manuscript could also be somewhat enhanced by describing which sequencing and/or annotation algorithms/approaches have contributed the most significantly to the changes and improvements to the current genome overall.

P. falciparum
As shown in Figure 3, the biggest number of changes were done between 2007 and 2010. This is due to RNA-Seq data and TBLASTX comparison which is described in the methods section: "RNA-Seq data and TBLASTX comparisons were the primary supporting evidence for manual improvements to gene models." 4. Although the manuscript describes a large gain in pseudogenes, given recent reports that some pseudogenes may actually express functional proteins this should be mentioned as a possible outcome for genes predicted to be non-functional.
We've added a sentence to the Table 1 legend to state that pseudogenes have simply been defined operationally as genes with premature stop codons or at least one frameshift. 5. On page 5, it is stated that the number of verified genes has changed from 597 to 1296 genes. It is unclear what "experimentally verified" implies. What verification was applied to these genes?
These are genes that have been mentioned by others in a peer-reviewed paper containing some level of experimental evidence.

:
Minor comment 1. Please change reference to "AP2 proteins" to ApiAP2 proteins. AP2 refers solely to the DNA binding domain region of these proteins.
AP2 has now been changed to ApiAP2.
2. On page 7, three "types" of genes are described (type-1, type-2, and type-3). For the non-experts, please describe what these types refer to?
This has now been clarified in the text. "The differences have been incorporated into the reference as three classes ( Figure 5). The first are "patches" to correct errors in Pf3D7 (version 3.2), for example a missing centromere (type-1) on chromosome 10 and a missing gene on chromosome 13 ( Figure 5, Figure 6A), the second are core genes that are present in other sequenced isolates, i.e. P. falciparum IT or P. In the almost 20 years since the publication of the 3D7 genome sequence there Plasmodium falciparum have been tremendous improvements in the completeness and accuracy of the sequence and genome structure, as well as an impressive improvement in the annotation. These advances have been achieved by a combination of further sequence analysis of both genome and transcripts, as well as automatic and literature and community-based manual annotation. This paper spells out the patient and often painstaking strategies that have been used since the first publication to ensure that there is as far as possible a complete and fully annotated genome sequence available to the scientific community. Such information is not only essential for much hypothesis-driven research into parasite biology and parasite-host interactions, or for studies of parasite evolution and epidemiology, but also underpins other global approaches, such as proteomics analysis. The work highlights some of the problems that needed to be addressed and provides some examples of the solutions that were found and applied. The ongoing improvements in annotation will facilitate research into specific areas of cell biology and metabolism, and contribute to efforts to translate this knowledge into products useful for interventions to control malaria.
In addition to the work on the reference cloned 3D7 parasite line, the authors also describe the 1.

4.
In addition to the work on the reference cloned 3D7 parasite line, the authors also describe the establishment of an extended data set comprised of a population reference genome of P. falciparum, called Pfref1. This will facilitate mapping and comparative studies of isolates and lines that cannot be achieved using the 3D7 reference alone. Such a reference is likely to be very useful for studies of parasite evolution and spread, for example in response to selective pressures.
The data are updated regularly and readily available, for example through GeneDB or PlasmoDB and through the documented web sites for further specific details.

Minor textual corrections:
The last sentence of the abstract needs improvement.
In the Methods section, subsection 'PfRef1 reference genome', the second to last sentence requires attention. On page 5, second paragraph, the amino acid names isoleucine and valine are not normally capitalized.
In Figure 3, during the period 2007-2010, presumably the blue bar should represent 69 new rather than 69 changed.

Is the rationale for creating the dataset(s) clearly described? Yes
Are the protocols appropriate and is the work technically sound? Yes

Are sufficient details of methods and materials provided to allow replication by others? Yes
Are the datasets clearly presented in a useable and accessible format? Yes No competing interests were disclosed. Competing Interests: Reviewer Expertise: I have worked for many years on malarial parasite-host interactions including host cell invasion and cell and molecular biological aspects of parasite growth and proliferation in red blood cells.
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

5.
6. 7. Figure 2 can be improved and made easier to read by providing the gene IDs directly on the panels. The red blocks between v1 and v3.2 can be much less tall and the coding regions heightened to draw better attention to the gene information. In Figure 4, Figure 3 is labeled as 2.1.5. Correct/clarify the labels of the different versions. In Table 2, should the "Number of GO annotations (2018)" be corrected to read "Number of GO annotations (2019)"? For the Pfref1 population reference, major sequence variations are included from the three different lab isolates. Please clarify how copy number variations are treated. Introduction, page 3.
3D7 is from the NF54 isolate (Walliker 1987, , P. falciparum et al. Science footnote 11 ). NF54 has served along with 3D7 as a major focus of malaria research. Inclusion of this information would be helpful to readers, especially as the close relationship of 3D7 to NF54 is often unrecognized or forgotten.
Thank you for this suggestion. This information has now been added in the figure legend.
"Genes above the chromosome lines are located on the forward strand, genes below the chromosome lines are on the reverse strand." Table 1 appear to differ from the numbers in the corresponding versions of Figure 4. For example, Table 1 lists 5280 genes in Version 3.2 but Figure 4 notes 5438 genes. Check and correct/explain the differences.

The numbers of genes in several versions listed in
As stated in the figure legend, Table 1 lists the gene numbers without pseudogenes. Pseudogenes are separately listed in Table 1. Version 3.2 has 5280 protein-coding genes and 158 pseudogenes. Adding both numbers will give you a total of 5438 genes. We've now added the following sentence to the legend of Figure 4 to make this clearer. 3. Figure 2 can be improved and made easier to read by providing the gene IDs directly on the panels. The red blocks between v1 and v3.2 can be much less tall and the coding regions heightened to draw better attention to the gene information.
Thank you for the suggestion to improve this figure. Gene IDs have been added on the panels. The TBLASTX matches are now shown in grey and the height of the TBLASTX matches has been shortened. All genes are shown in red. This is an ACT screenshot which means that coding regions cannot be heightened. But we hope that the other changes will draw more attention to the gene information. Thank you for noticing this. The number in the legend of Figure 3 has now been changed from 2.1.5 to 2.1.4. Table 2, should the "Number of GO annotations (2018)" be corrected to read "Number of GO annotations (2019)"?

In
This has now been corrected.
6. For the Pfref1 population reference, major sequence variations are included from the three different lab isolates. Please clarify how copy number variations are treated.
Most CNVs can be detected and quantified using the strictly 3D7 sequence as a reference (based on mapped coverage of reads). The Pfref1 simply extends the ref to those areas that are missing or highly diverged from 3D7. 7. Introduction, page 3.
3D7 is from the NF54 isolate (Walliker 1987, ). P. falciparum et al. Science NF54 has served along with 3D7 as a major focus of malaria research. Inclusion of this information would be helpful to readers, especially as the close relationship of 3D7 to NF54 is often unrecognized or forgotten.
We have addressed this with the following text "the genome of 3D7 (a Plasmodium falciparum