Whole genome sequencing and phylogenetic analysis of human metapneumovirus strains from Kenya and Zambia

Background Human metapneumovirus (HMPV) is an important cause of acute respiratory illness in young children. Whole genome sequencing enables better identification of transmission events and outbreaks, which is not always possible with sub-genomic sequences. Results We report a 2-reaction amplicon-based next generation sequencing method to determine the complete genome sequences of five HMPV strains, representing three subgroups (A2, B1 and B2), directly from clinical samples. In addition to reporting five novel HMPV genomes from Africa we examined genetic diversity and sequence patterns of publicly available HMPV genomes. We found that the overall nucleotide sequence identity was 71.3 and 80% for HMPV group A and B, respectively, the diversity between HMPV groups was greater at amino acid level for SH and G surface protein genes, and multiple subgroups co-circulated in various countries. Comparison of sequences between HMPV groups revealed variability in G protein length (219 to 241 amino acids) due to changes in the stop codon position. Genome-wide phylogenetic analysis showed congruence with the individual gene sequence sets except for F and M2 genes. Conclusion This is the first genomic characterization of HMPV genomes from African patients.


Background
Human metapneumovirus (HMPV) is a single-stranded RNA virus in the family Paramyxoviridae and closely related to human respiratory syncytial virus (RSV) [1]. HMPV causes respiratory disease similar to RSV, ranging from mild upper respiratory infection to bronchiolitis and pneumonia [2]. HMPV infections are seasonal and coinfection with other respiratory pathogens is common [1]. The HMPV genome is approximately 13 kb and comprises eight open reading frames (ORFs) encoding nucleoprotein (N), phosphoprotein (P), matrix protein (M), fusion glycoprotein (F), transcription enhancer protein (M2), small hydrophobic protein (SH), attachment glycoprotein (G), and large polymerase protein (L) [3]. The membrane glycoproteins F and G sequences are used to define two major genotypes or groups, A and B, which are further classified into four subgroups (A1, A2, B1, and B2). HMPV A2, the most frequently observed subgroup, is further divided into two proposed sub-lineages (A2a and A2b) [3].
HMPV is reported to have an important contribution to acute respiratory infections (ARI) in Africa. For instance, HMPV-associated hospitalization was estimated at 6.5 per 1000 person years in infants in Soweto, South Africa [4]; at 4% in hospitalized children with severe ARI during a 2-year period in Cameroon [5]; and in rural western Kenya, incidence of HMPV associated with ARI cases in outpatient clinic visits was estimated at 0.43 per 100 person-years among outpatients [6]. In Kilifi coastal Kenya, between January 2007 to December 2011, children under 6 months of age accounted for 44% of HMPV positive cases, while 74% were children under 1 year, and 1.3% (2/160) were children > 36 months [7]. In Dadaab and Kakuma refugee camps in Kenya, HMPV was detected in 5.7% hospitalizations, and virus-positive crude hospitalization rate (per 1000 children < 5 years old) was 4 for HMPV [8]. In Mali, contribution of HMPV to pneumonia had a population attributable fraction of 9% (95% CI: 7-11%) [9]; while in Morocco [10], 8.9% of children < 5 years admitted with severe pneumonia were infected with HMPV. HMPV prevalence and incidence elsewhere globally, is indicated in Additional file 4: Table S1. Of note is that the variations in incidence rates could be attributed to study population, seasonality and even detection methods. Nonetheless, genomic epidemiology of HMPV in Africa is inadequately reported, and comparison of genetic similarity and differences between African and global strains is not documented.
Genome sequences provide valuable resources for characterizing viral evolution and disease epidemiology, and for identifying transmission events and outbreaks, which is not always possible with sub-genomic fragments [11][12][13]. The increased number of phylogenetically informative variant sites obtained from full genomes may allow better linking of cases and aid public health interventions in real time during epidemics [14,15]. PCR approaches for targeted whole genome sequencing, in contrast to random amplification, can preferentially amplify the target virus over host or environmental nucleic acids [16,17] potentially focusing sequencing on the virus of interest. To date, the largest dataset of HMPV whole genomes (n = 61) sequenced from any tropical country is from three Peruvian cities, Lima, Piura and Iquitos [18]. In Africa, apart from one metapneumovirus genome identified from a wild mountain gorilla in Rwanda (GenBank accession number HM197719), there are no HMPV genomes reported according to the NIAID Virus Pathogen Database and Analysis Resource (ViPR, http://www.viprbrc. org/, accessed April 30, 2019). This has led to limited understanding of the genetic and genomic diversity of HMPV in the continent.
This work describes a whole genome sequencing (WGS) approach for HMPV from a small number of HMPV positive clinical samples collected at Kilifi County Hospital in Kilifi, Kenya and University Teaching Hospital in Lusaka, Zambia. The genomes were generated by sequencing overlapping PCR amplicons spanning the entire genome. These are the first reported complete genome sequences of locally circulating HMPV strains obtained directly from clinical samples in Africa. We also combined the new genomes with publicly available sequences to examine patterns in global HMPV genetic diversity.

Genome characteristics
Whole genome sequencing was successful for all 5 clinical samples that were attempted. A single genomic sequence was obtained from each sample, and the length of the 5 new HMPV genomes ranged from 13,097 to 13, 134 nt (> 95% length coverage). Sequencing and data assembly parameters, including coverage depth are shown in Table 1.
For the 143 HMPV genomes, we checked sequence conservation at transcriptional control regions, at the termini of each gene, as well as the lengths of intergenic sequences between gene boundaries. The length of the F-M2 intergenic region was different between group A and B viruses, that is, 13 nt and 2 nt, respectively. The SH-G and G-L intergenic regions were the longest, up to 125 nt and to 190 nt, respectively. Consensus nucleotides (9 to 19 length) at the putative start and end regions flanking the ORF of the viral genes are shown in Fig. 1. The gene-start and -end regions of N and P were conserved (> 90% average pairwise identity) in both HMPV groups, and the M2 and M gene-start and -end were also conserved in HMPV group A and B, respectively. The putative ATG start codon was consistently located at positions 14-16 upstream of a gene start motif (consensus: GG/AGAC/TAAA/GTnnnnATG), except for the internal M2-2. An additional ATG start codon upstream of the gene-start motif was observed in the SH gene for the B1 and B2 strains. In five of the eight annotated genes (N, P, F, M2, and G (B1 and B2 strains only)), the intergenic regions were short and the ORFs for these 5 genes terminated within the propositioned gene-end motifs.

Sequence diversity and phylogenetic relationships
We combined the five genome sequences from Kenya and Zambia with available global sequences, aligned individual genes and calculated the percent nucleotide (nt) and amino acid (aa) identity ( Table 2).
The coding sequences of N, M, F, M2-1, M2-2, and L genes were conserved at nucleotide and amino acid levels, by sharing > 85% between-subgroup nucleotide identity and 90% protein identity ( Table 3). The nucleoprotein gene was the most conserved among all subgroups at the nt and aa levels. SH and G glycoprotein genes were more divergent between the HMPV subgroups at the nucleotide level with 76 and 63% identity, respectively. The SH protein length was variable between group A and B strains due to a nucleotide substitution (CAA ➔ TAA) at gene position 532 in group B, resulting in protein lengths of 178 and 180 aa, respectively. The predicted G protein length also varied among the different HMPV subgroups, between 219 and 241 aa, due to different positions of the Stop codon. Amino acid sequence diversity for G and SH glycoproteins is depicted in Fig. 2 and Additional file 2: Figure S2, respectively. The diversity of the complete nucleotide sequences of SH and G genes is depicted in phylogenetic trees in Fig. 3.
We evaluated phylogenetic classification and relationship between the 5 new genomes obtained in this study and previously published genomes (Fig. 3). Full genome classification was consistent with that based on partial genomic fragments (F and G genes). Two genomes from the samples collected in Kenya (HMPV/03/KEN/2013) and (HMPV/01/KEN/2015), clustered closely to viruses  Figure S3. There was phylogenetic congruence with the individual gene sequence sets as with the full genome dataset, except for F and M2 gene (Additional file 3: Figure S3).

Sequence diversity at rRT-PCR target region
Variant or drifted viral strains may lower the sensitivity of detection resulting in a decreased quantitation of the viral load and underestimation of disease incidence [19]. We checked the new HMPV genomes for nucleotide differences in the genomic regions targeted by our diagnostic rRT-PCR primers and probes (Additional file 7: Table S4) used for HMPV detection. Up to eight primer-and probe-template mismatches were identified (Fig. 4): one mismatch in the forward primer region in HMPV group A (F gene-based rRT-PCR assay, Fig. 4a); one mismatch in each of the forward and probe target regions in group B (F gene-based rRT-PCR assay, Fig. 4b); and 5 different mismatches with the N-gene based rRT-PCR assay (Fig. 4c). Note, the F gene-based rRT-PCR assays are different or specific to the two HMPV groups.

Discussion
HMPV causes respiratory illness presenting as mild upper respiratory tract infection or life-threatening severe bronchiolitis and pneumonia primarily in children, sometimes adults as well as immunocompromised individuals [2]. However, HMPV genome sequence data from Africa is sparse and information on genome-wide diversity is limited. In the present study, the whole genome sequences of five HMPV strains from Kenya and Zambia were determined and compared with the genomes published previously from around the world. Comparative sequence analysis indicated fairly conserved positioning of the gene-start and -end regions as well as translational start and -end codons. Variation in genestart and -end sequences can have significant impact on transcription initiation and termination efficiency so that there is more selective pressure preventing changes in these regions [20], and this likely explains our observation. The additional ATG start codon found upstream of the gene-start motif of the SH gene was consistent with a previous report [21], though its role in gene expression is yet to be identified.
These observed sequence conservation in N, M, F, M2-1, M2-2, and L genes is not unusual and is suggestive of functional and structural constraints on diversity, but less expected of the F gene because of its status as a neutralization and protective antigen, similar to its close 'relative' RSV [22]. It has also been suggested that the low diversity in F gene might make a substantial contribution to cross-neutralization and cross-protection between the HMPV subgroups [21]. The relatively high frequency of amino acid diversity in G (and to a lesser extent SH) could be attributable to selective pressure for amino acid change coming from host immunity; and the  ability of the protein to tolerate substitutions, which might be due to its proposed extended, unfolded nature [22]. The phylogenetic incongruence observed between whole genome tree and the F and G gene trees, is as reported previously for HMPV [23], and could be attributed to differential rates of evolution, selection pressure or past recombination events [24]. The prevalence of HMPV in hospitalized pediatric population in Kilifi county in coastal Kenya has been reported [7,25]. However, it is notable that in recent years, HMPV has been detected at low prevalence in Kilifi (unpublished observations from hospital-based pneumonia surveillance). Whether this low prevalence is due to reduced virus transmission, or decreased sensitivity of our HMPV molecular diagnostic assay due to progressive primer/probe mismatches, is yet to be established.

Conclusion
We present the first full genome sequences of circulating HMPV strains from sub-Saharan Africa. A limitation of our sequencing method, as is common with amplicon sequencing protocols [26,27], was absent coverage at the 3′ leader and 5′ trailer regions not captured by these primers. Our results demonstrate the application of amplicon sequencing to generate full length HMPV genomes directly from clinical samples. The observed diversity of the individual genes is comparable to that described previously [20][21][22]. This method and data provide a useful reference for design of local molecular diagnostics and for studies aimed at understanding HMPV epidemiology and evolution in Africa.

HMPV detection and genotype assignment
Nasopharyngeal and oropharyngeal (NP-OP) swab samples were collected from children (1-59 months) hospitalized with pneumonia, four of whom were enrolled in the PERCH study [18] in 2012. The fifth sample was collected from a child enrolled in the routine pneumonia surveillance study at Kilifi County Hospital, Kenya, in 2015. The samples were tested for HMPV by multiplex semi-quantitative real-time reverse transcription PCR (rRT-PCR) assays. The rRT-PCR primers and probes used, cycling conditions and assay set up have been Fig. 2 Consensus nucleotide sequences of putative gene-start (13 nucleotides upstream of ATG codon) and gene-end signals (6-16 nucleotides from Stop codon) visualized as sequence logos, for HMPV group (a) and (b). The height of each character in the sequence logo plots is proportional to its relative frequency. The green color on the bar at the bottom of the consensus sequence logo indicates 100% average pairwise identity, brown indicates at least 30 to < 100% identity and red indicates < 30% identity described elsewhere [28,29]. Fusion (F) and glycoprotein (G) encoding genes of the HMPV positive samples were amplified in a one-step RT-PCR assay (OneStep RT-PCR kit, QIAGEN), as described previously [7]. Partial G or F nucleotide sequences were analyzed by maximum likelihood (ML) phylogenetic trees using IQ-TREE [30], together with reference strains of HMPV subgroups (accession numbers AF371337.2, FJ168779, AY297749, AY530095, JN184401 and AY297748). Five HMPV positive samples from the Kenya and Zambia study sites, belonging to the A2a (n = 1), A2b (n = 2), B1 (n = 1) and B2 (n = 1) genetic subgroups based on their G and F gene sequences, were selected for whole genome sequencing. Data on age, sex and clinical assessment information collected at the time of sample collection, for the five selected samples, are shown in Table 3.

Whole genome sequencing
The sequencing protocol consisted of four steps as follows: (i) primer design, (ii) preparation of primer mixes, (iii) cDNA and PCR (iv) Illumina sequencing and data analysis.

Preparation of HMPV Tm48 full genome primers
All human metapneumovirus (HMPV) full genome sequences were retrieved from GenBank (January 2018) using the query (txid162145 (Organism) AND 12000(SLEN): 14000(SLEN) NOT patent). Sequence entries with gaps larger than 6 nt were excluded to generate a set of yielding 178 genomes. All possible 23 nt sequences were generated from the genomes dataset and trimmed to a final calculated melting temperature (Tm) of 47.9-49.5°C. Sequences with homology to rRNA sequences, with GC content outside < 0.3 or > 0.75 or with a single nucleotide fractional content of > 0.6 were discarded. The primer set was then made nonredundant yielding 60,746 potential primers. All potential primers were mapped against the 178 HMPV full genomes and the number of perfect matches (frequency score) was determined as a measure of primer sequence conservation. To select primers, the HMPV genome sequences were divided into amplicons with 222 nt overlap spanning the virus genome. Potential primers that mapped within the terminal 5′ and 3′ 222 nt of each amplicon were identified and the sequence with the highest frequency score was selected, and primers mapping to the reverse bins were reverse complemented. In this manner, 24 primers were selected for each of the 4 HMPV genotype representative genomes (GenBank accession number HMPV A1: AF371337, HMPV A2: FJ168779; HMPV B1: AY525843, and HMPV B2: FJ168778). Because of conservation between genotypes, there was primer redundancy which was removed. The final set of 65 primer sequences, their lengths, calculated Tm, fractional GC content and mapping position on the HMPV genome are presented in Additional file 5: Table S2. The primers were computationally tested against each of the 4 HMPV subgroups. A graphical representation of the primer target sites is presented in Additional file 1: Figure S1.

Preparation of primer mixes
Amplification was performed in two reactions. To avoid generating small products from adjacent forward and reverse primers, amplicons were assigned to alternate reactions, with reaction 1 containing primers for   Table S3). Bootstrap support values (evaluated by 1000 replicates) are indicated along the branches. Genetic subgroups A1, A2a, A2b, B1, and B2, are indicated. Multiple sequence alignment was done using MAFFT and the ML phylogeny inferred using GTR + Γ nucleotide substitution model and ultrafast bootstrap approximation in IQ-TREE. The genotype B2 Sabana strain sequence (GenBank accession number HM197719) reported from a wild mountain gorilla in Rwanda is marked in blue. The scaled bar indicates nucleotide substitutions per site

Illumina sequencing and data analysis
Libraries were prepared using Nextera XT kit (Illumina) and pair-end sequencing (2 × 300 base pairs) with the MiSeq Reagent V3 kit (Illumina), following the manufacturer's instructions. The Nextera enzyme mix was used to simultaneously fragment input DNA and tag with universal adapters in a single tube reaction, followed by 12-cycle PCR reaction for dual indexing. Agencourt AMPure XP beads (Beckman Coulter) were used for all purification steps and libraries were quantified and quality-checked using the Qubit (Thermo Fisher) and Bioanalyzer (Agilent). Adapter trimming, quality  Table S4 filtering, kmer normalization of sequencing reads, de novo assembly, calculation of mean genome coverage was as previously described [31].

Phylogenetic analyses
A dataset of HMPV genome sequences was retrieved from ViPR in order to infer relationship between HMPV viruses from Kenya and Zambia and viral populations sampled globally. The dataset included 138 sequence entries (> 13,000 nt) that included date (year) and location of sample collection (Additional file 6: Table S3). Sequence alignment was done using MAFFT v.7.221 [32] using the parameters 'localpair -maxiterate 1000'. IQ-TREE was used to infer maximum likelihood (ML) trees of the complete genome and individual genes under general time-reversible (GTR) substitution model with gamma-distributed among-site rate heterogeneity. A summary of the methodology outlined here is depicted in Fig. 5.

Availability of data and materials
The assembled sequences for the five genomes from Kenya and Zambia are available in the GenBank nucleotide database with accession numbers MK588633 to MK588637, and the raw sequence data are available in the NCBI SRA archive as BioProject PRJNA523302. The datasets and scripts used in analysis are available on Harvard Dataverse site (doi:https://doi.org/10. 7910/DVN/TVB65V).

Ethics approval and consent to participate
The Institutional Review Board (IRB) at Johns Hopkins Bloomberg School of Public Health (JHSPH) reviewed the overall clinical protocol for the study. The protocol was then adapted at the Kenyan and Zambian sites and reviewed by the local ethic committees. For Kenya, the KEMRI Scientific and Ethics Review Unit (SERU, KEMRI-Wellcome Trust Research Program, Kilifi, Kenya) and Oxford Tropical Research Ethics Committee (University of Oxford, UK) approved the study protocol (IRB numbers 1526 and 60-09, respectively). For Zambia, the ERES Converge (University Teaching Hospital, Lusaka, Zambia) and Boston University Medical Campus and Boston Medical Center Institutional Review Board (Boston University School of Public Health, Boston, MA, USA) approved the study at the University Teaching Hospital (IRB numbers 00005948 and H-29860, respectively). Parents or guardians of participants from whom the respiratory specimens were derived provided written informed consent to participate in the study.

Consent for publication
Written informed consent for publication of the participants' clinical details was obtained from the parents or legal guardians of each participant.