Diversity pattern of Plasmodium knowlesi merozoite surface protein 4 (MSP4) in natural population of Malaysia

Human infections due to the monkey malaria parasite Plasmodium knowlesi are increasingly being reported from Malaysia. The parasite causes high parasitaemia, severe and fatal malaria in humans thus there is a need for urgent measures for its control. The MSP4 is a potential vaccine candidate, which is well studied in Plasmodium falciparum and Plasmodium vivax; however, no study has been conducted in the orthologous gene of P. knowlesi. In this study, we investigated the level of polymorphisms, haplotypes, natural selection and population structure of full-length pkmsp4 in 32 clinical samples from Malaysian Borneo along with 4 lab-adapted strains. We found low levels of polymorphism across the gene with exon I showing higher diversity than the exon II. The C- terminal epidermal growth factor (EGF) domains and GPI-anchored region within exon II were mostly conserved with only 2 non-synonymous substitutions. Although 21 amino acid haplotypes were found, the frequency of mutation at the majority of the polymorphic positions was low. We found evidence of negative selection at the exon II of the gene indicating existence of functional constraints. Phylogenetic haplotype network analysis identified shared haplotypes and indicated geographical clustering of samples originating from Peninsular Malaysia and Malaysian Borneo. High population differentiation values were observed within parasite populations originating from Malaysian Borneo (Kapit, Sarikei and Betong) and laboratory-adapted strains obtained from Peninsular Malaysia and Philippines indicating distinct population structure. This is the first study to genetically characterize the full-length msp4 gene from clinical isolates of P. knowlesi from Malaysia and thus would be very useful for future rational vaccine studies. Further studies with higher number of samples and functional characterization of the protein will be necessary.


Introduction
Plasmodium knowlesi, a zoonotic malaria parasite of long-tailed and pig-tailed macaques is now considered as the fifth Plasmodium species infecting humans and now the most common cause of malaria in Malaysia. Most Southeast Asian countries have reported cases of this infection in humans [1,2]. As per the World Malaria Report 2017, there is a rapid increase of human cases in Malaysia [3] and within Malaysia, highest incidence of knowlesi malaria in human have been documented from Malaysian Borneo [4][5][6] highlighting the need of immediate and comprehensive approaches integrating multiple strategies such as vector control, anti-malarial treatment and development of effective vaccines. Almost 70-78% of malaria cases reported from Malaysian Borneo (Sarawak and Sabah) were due to P. knowlesi [6,7]. Death due to P. knowlesi malaria has also been reported in Sarawak and Sabah of Malaysian Borneo and rapid increase in parasitaemia has been shown to be associated with severe malaria and in some cases fatal in Malaysian Borneo [8,9]. Genetic studies and genomic studies on P. knowlesi clinical isolates from Malaysia have identified at least 3 sub-populations with their overall diversity even higher than P. falciparum and P. vivax, 2 of the populations were associated with primary primate hosts and one with geographical location [1,[10][11][12]. Mitochondrial gene cytochrome oxidase I(cox 1) and the smaller subunit ribosomal rRNA (ssrRNA) of P. Knowlesi from clinical isolates and macaques also identified two distinct clusters which clustered geographically to Peninsular Malaysia and Malaysian Borneo [13].
One of the strategies to develop a vaccine against Plasmodium species is based on targeting apical organelle antigens or the merozoite surface antigens involved in the asexual-stage of the parasite life cycle, which are accessible to the host immune system [14]. Immune response induced by such antigens has the potential to block parasite entry into RBCs. However, because of high antigenic diversity (in field isolates), which is one of the main mechanisms through which the malaria parasites evade host immune responses remains as one of the challenges to design a strain-transcending vaccine. Major vaccine candidates studied till date in P. falciparum (like CSP, AMA1) show high polymorphism, evolve under positive natural selection and show high antibody response but rendered non-efficacious vaccine trial because of strain-specific immune response [15]. Merozoite surface protein family (MSPs) forms the most abundant protein, which are targets of immune attack by host antibodies and thus considered excellent targets for vaccine development. One of these proteins is the merozoite surface protein 4 (MSP4), an abundant glycosylphosphatidylinositol (GPI) anchored protein which contains a single epidermal growth factor (EGF)-like domain both these are towards the carboxyl terminus of the protein [16,17]. MSP4 has been considered as a promising subunit vaccine candidate in P. falciparum and naturally acquired antibody response hasbeen reported from malaria endemic regions [18,19].Vaccine trials in mouse models in P. berghei and P. yoelli have shown significant protective efficacy [20]. The structural conformation of the EGFdomain in P. falciparum MSP4 has been found to be essential for binding to host erythrocytes and antigenicity [21]. Recently, protective role of naturally acquired anti-PfMSP4 antibodies was found to be associated with clinical malaria in an endemic region of Senegal [22] supporting further development of MSP4 as a vaccine candidate. Genetic studies in both P. falciparum and P. vivax MSP4 gene have been extensively conducted in different endemic areas of the world and the gene is found to possess low level of polymorphism and under purifying selection [23][24][25]. Vaccine studies in P. knowlesi are still in its nascent stage. High genetic diversity has been observed in clinical isolates of P. knowlesi in several ortholog vaccine antigens (like NBPXA, MSP1 and MSP7D) [11,[26][27][28]. However, no genetic study has been done in the pkmsp4 gene from clinical isolates of Malaysia. In this study, 36 pkmsp4 full-length sequences (32 clinical isolates from Sarawak, Malaysian Borneo and 4 long-time lab-adapted strains) were obtained from published genome studies and the level of nucleotide diversity, haplotypes, and natural selection acting at full-length MSP4 gene were determined. Information on genetic diversity and natural selection acting at msp4 gene will be essential for a rational approach for vaccine design and functional assays.

Materials and methods
pkmsp4 sequence data pkmsp4 sequence data were obtained from genomes of 32 clinical samples originating from Sarawak, Malaysian Borneo obtained from a previous genome study along with 4 long-time isolated lines originated from Peninsular Malaysia and Philippines (along with the H-strain, PKNH_0414100) [1,10] which were orthologous to P. vivax (PVX_003775). These 4 long-time isolated laboratory lines were maintained in rhesus macaques which were originally obtained from Peninsular Malaysia and Philippines in 1960 [10]. The original genome study was conducted with appropriate informed consent from patients and with clearance from ethical committees [10]. The accession numbersof the sequences along with the location of sample collection are listed in S1 Table. The genomes were downloaded from the European Nucleotide Archive (https://www.ebi.ac.uk/ena). Sequence data were aligned using the CLUSTAL-W program in MegAlignLasergene v 7.0 (DNASTAR). The signal peptide within the full-length PkMSP4 amino acid sequence was predicted using Signal IP-5.0 software [29].

Sequence diversity and natural selection
Sequence diversity (π), which is defined as the average number of nucleotide differences per site between two sequences was determined by DnaSP v5.10 software. Number of polymorphic sites, singleton sites (a nucleotide variant that appears only once in among the sequences), number of synonymous (silent mutations) and non-synonymous substitutions (replacement mutations or mutations leading to change in amino acids), number of haplotypes (H) and haplotype diversity (Hd) within the pkmsp4 sequences were determined by DnaSP v5.10 software [30].
Natural selection was determined at the intra and inter-species levels. At the intra-population level, natural selection was determined by calculating the rates of synonymous substitutions per synonymous site (dS) and non-synonymous substitutions per non-synonymous site (dN) as computed by using Nei and Gojobori's method and robustness was estimated by the bootstrap method with 1000 pseudo replicates as implemented in the MEGA 5.0 software [31]. Difference between dN and dS was determined by applying codon based Z-test (P <0.05) in MEGA software v.5 with 1000 bootstrap replications [31]. The Tajima's D, Fu & Li's D � and F � neutrality tests were performed as implemented in DnaSP v5.10 software. Tajima's D is expected to be 0 for a gene which is not under the influence of any selection pressure. When Tajima's D values are positive and significant, it indicates positive/balancing selection, whereas negative values suggest negative selection or population expansion. Significant positive values for Fu & Li's D � and F � also indicate population contraction due to a selection event while negative values indicate population expansion and excess of singletons. To test whether the pkmsp4 gene is under the influence of natural selection in the inter-species level, the robust McDonald and Kreitman(MK) test was performed with P. coatneyi (PCOAH_00008580) msp4 gene as an out-groups using DnaSP v5.10 software [30]. Graphical representation of nucleotide diversity and Tajima's D across the full-length pkmsp4 genes were conducted using the same software with window length 100 and step size 50 bp using.

Haplotype network analysis
Genealogical relationships between the pkmsp4 nucleotide haplotypes were constructed using the median-joining method with default parameters in NETWORK software (version 4.6.1.2, FluxusTechnology Ltd, Suffolk, UK). The analysis aimed to reconstruct haplotype networks of the entire set of P. knowlesi msp4 genes, with color-coded haplotypes for geographical origins. Straight lines connect pairs of haplotypes that differ by a single mutational step.

Schematic structure and polymorphism within PkMSP4
The schematic structure of the PkMSP4 protein based on the H-strain with 2 exons (Exon I, 280 bp, and Exon II, 235 bp), C-terminal single EGF-domain and GPI-anchored region is described in Fig 1. The signal peptide of the PkMSP4 protein was detected between amino acid positions 22 to 26 by the SignalP server S1

Nucleotide diversity and polymorphisms
Analysis of nucleotide alignment of 36 full-length pkmsp4 sequences (606 bp) revealed that there were 29 (4.7%) polymorphic sites, of which 13 were singleton sites and 16 were parsimony informative sites. The overall nucleotide diversity across the full-length gene was π = 0.007 ± SD 0.000 Table 1 which was higher compared to MSP4 orthologs in P. vivax and P. falciparum [23,25]. There were 29 SNPs (17 non-synonymous and 12 synonymous substitutions) across the full-length gene ( Table 2). These 29 SNPs led to 24 nucleotide haplotypes and the haplotype diversity was 0.975, which was higher compared to exon I and exon II ( Table 1). The sliding window analysis of nucleotide diversity across the full-length gene also indicated higher level of diversity in exon I compared to exon II which constituted the single EGF domain and the GPI-anchored region (Fig 3A). The diversity ranged from 0 to 0.28 while the higher values were towards the exon I Fig 3A. Domain wise analysis of the exons indicated that exon I had higher number (SNPs = 21) of polymorphic sites of which 16 non-synonymous (including five complex sites) and 5 synonymous sites, compared to exon II; SNPs = 9 (2 non-synonymous and 7 synonymous) sites ( Table 2 (Fig 2). The higher number of SNPs in exon I (with 8 singleton sites) led to a higher number of haplotypes, haplotype diversity and nucleotide diversity (H = 19, Hd = 0.946 and π = 0.012) compared to exon II (Table 1).

Natural selection
To determine whether natural selection contributes to the polymorphism in the pkmsp4 fulllength gene and at each exon, multiple tests were conducted both at the inter-as well as intraspecies levels. At the intra-species level, the full-length genes showed negative values were obtained for dN-dS, Tajima's D and Fu and Li's D � and F � values (Table 1) indicating purifying/negative selection and population expansion. However, values obtained were not significant. Independent tests for both exon I and II also showed similar results except for exon I which showed positive values for dN-dS = 0.23 (Table 1). Indeed this was obvious because of higher number of non-synonymous substitutions in Exon I ( Table 2), but most of these were due to low frequency singleton variable sites indicating negative natural selection and parasite population expansion. At the inter-species level, the robust MK test was performed with P. coatneyi msp4 gene as outgroup sequence. Test results showed significant negative natural selection for exon II (NI = 0.105, P < 0.05) which contained the EGF-domain and the GPI-    (Table 3). However, MK test for exon I showed NI = 1.28 but not significant (Table 3). Sliding window plot analysis of Tajima's D across the full-length pkmsp4 gene also indicated most values below 0 indicating purifying selection, however SNPs from 280-320 showed positive D values. (Fig 3B).

Haplotype network analysis
Genealogical haplotype network analysis identified two distinct population clusters based on the geographical origin; one cluster originating in the laboratory-adapted strains (i.e. H, Malayan, Philippine and MR4H) and the other sub-cluster was the clinical isolates from Sarawak, Malaysian Borneo (Fig 4). Shared haplotypes between parasite populations from Kapit, Betong and Sarikei (H_4) (Fig 4) indicating a common origin of parasites from the region. Similar findings with clinical isolates from Malaysian Borneo have been reported earlier with other merozoite surface proteins [26,28].The 24 pkmsp4 nucleotide haplotypes identified in this study are listed in S2 Table. Genetic differentiation within P. knowlesi populations  Table 4 indicating localized transmission within Sarawak.

Discussion
P. knowlesi has gained substantial research interest in recent years as a high proportion of human cases specifically from Malaysia and most Southeast Asian countries have been reported and it can cause high parasitemia in humans which in certain cases become severe disease and can be fatal [1]. Blood stage antigens localized at the merozoite surface play an important role in invasion into erythrocytes and these antigens are directly exposed to host immune response during merozoite egress and thus are considered excellent vaccine candidates. A candidate antigen should optimally possess low polymorphism to be efficacious across different geographical locations and avoid allele-specific immune response. Merozoite surface proteins (MSPs), specifically MSP4 is recognized as a potential vaccine candidate for P.  falciparum as it has been found to elicit a strong antibody response in patients and bound to RBCs [21,22]. Thus, in this study, we studied the level of polymorphism and natural selection of msp4 from clinical isolates of P. knowlesi from Sarawak, Malaysian Borneo and lab-adapted strains from Peninsular Malaysia and Philippines. We found low levels of polymorphism (SNP = 29) across the full-length pkmsp4 gene and the majority of the SNPs were localized toward the exon I. There were only 2 non-synonymous substitutions in exon II (P107S and G171A), of which the latter was a single amino acid change present in only one isolate from Malaysian Borneo indicating conserved function within the single EGF-domain. The 6 cysteine residues within the EGF-domain were conserved within the 36 isolates (including the lab adapted strains) indicating conserved functional activity. It is interesting to note that pvmsp4 and pfmsp4 show a similar pattern of higher polymorphism in exon I compared to exon II [23,24,33]. The GPI-anchored region was completely conserved in pkmsp4 clinical isolates from Sarawak and similar conservation was observed in pfmsp4 and pvmsp4 and these anchored regions were found to be essential for proper folding and immunogenicity of the whole protein [21]. The PkMSP4 protein did not possess any tandem repeat units as observed for PvMSP4 and PfMSP4 proteins [23,24]. Among all the MSPs studied till date in P. knowlesi, low polymorphisms towards the C-terminal EGF-domains have been shown in only a few antigens example msp1p [34].
We observed negative/purifying selection within the 36 pkmsp4 genes from Malaysia. We verified natural selection tests both at the inter-and intra-species levels and found significant tests results for the exon II which constituted the EGF and GPI-anchored domains indicating functional constraints. The natural selection results obtained in this study indicate that msp4 gene may not be exposed to host immune pressure however, graphical representation of Tajima's D values indicated some SNPs with positive values within Exon II. Thus, immunological and functional characterization would be necessary as ortholog in P. vivax and P. falciparum elicit a strong immune response in patient sera [19,22]. Haplotype data from both amino acid and nucleotide sequences showed a similar pattern of clustering of isolates based on geographical origin i.e. from Peninsular Malaysia and Malaysian Borneo. Indeed, the haplotype network tree generated using the 24 nucleotide haplotypes showed geographical clustering more clearly, as observed for other MSPs of P. knowlesi [34].
Population differentiation index F ST based on pkmsp4 showed high genetic differentiation values between the long-term isolated laboratory-adapted strains and Sarawak, Malaysian Borneo and this was due to the geographical distance between the two regions, which is separated by the South China Sea. These results indicate localized transmission in Sarawak, Malaysian Borneo. However, a vaccine designed based on the low polymorphic pkmsp4 domains could still be effective for all sub-populations. Thus, further characterization through genetic as well as immunological studies is necessary.

Conclusions
The present study is the first to investigate genetic diversity and natural selection of the pkmsp4 gene from clinical samples of Sarawak and laboratory-adapted strains. Low level of genetic diversity was observed across the gene with only two non-synonymous substitutions within the EGF-domain. Overall, the gene was under negative/purifying natural selection, however, certain regions in Exon II showed high Tajima's D values thus could be under balancing selection. Distinct population structure was observed and high genetic differences between parasites populations originating from Peninsular Malaysia and Malaysian Borneo was noted indicating absence of gene flow between the two regions. Further genetic studies with a higher number of clinical isolates (specifically form Peninsular Malaysia) as well as immunological studies characterizing the functional domains would be necessary to validate pkmsp4 as a potential vaccine candidate for P. knowlesi.