Streptococcus pneumoniae genomic datasets from an Indian population describing pre-vaccine evolutionary epidemiology using a whole genome sequencing approach

Globally, India has a high burden of pneumococcal disease, and pneumococcal conjugate vaccine (PCV) has been rolled out in different phases across the country since May 2017 in the national infant immunization programme (NIP). To provide a baseline for assessing the impact of the vaccine on circulating pneumococci in India, genetic characterization of pneumococcal isolates detected prior to introduction of PCV would be helpful. Here we present a population genomic study of 480 Streptococcus pneumoniae isolates collected across India and from all age groups before vaccine introduction (2009–2017), including 294 isolates from pneumococcal disease and 186 collected through nasopharyngeal surveys. Population genetic structure, serotype and antimicrobial susceptibility profile were characterized and predicted from whole-genome sequencing data. Our findings revealed high levels of genetic diversity represented by 110 Global Pneumococcal Sequence Clusters (GPSCs) and 54 serotypes. Serotype 19F and GPSC1 (CC320) was the most common serotype and pneumococcal lineage, respectively. Coverage of PCV13 (Pfizer) and 10-valent Pneumosil (Serum Institute of India) serotypes in age groups of ≤2 and 3–5 years were 63–75 % and 60–69 %, respectively. Coverage of PPV23 (Merck) serotypes in age groups of ≥50 years was 62 % (98/158). Among the top five lineages causing disease, GPSC10 (CC230), which ranked second, is the only lineage that expressed both PCV13 (serotypes 3, 6A, 14, 19A and 19F) and non-PCV13 (7B, 13, 10A, 11A, 13, 15B/C, 22F, 24F) serotypes. It exhibited multidrug resistance and was the largest contributor (17 %, 18/103) of NVTs in the disease-causing population. Overall, 42 % (202/480) of isolates were penicillin-resistant (minimum inhibitory concentration ≥0.12 µg ml−1) and 45 % (217/480) were multidrug-resistant. Nine GPSCs (GPSC1, 6, 9, 10, 13, 16, 43, 91, 376) were penicillin-resistant and among them six were multidrug-resistant. Pneumococci expressing PCV13 serotypes had a higher prevalence of antibiotic resistance. Sequencing of pneumococcal genomes has significantly improved our understanding of the biology of these bacteria. This study, describing the pneumococcal disease and carriage epidemiology pre-PCV introduction, demonstrates that 60–75 % of pneumococcal serotypes in children ≤5 years are covered by PCV13 and Pneumosil. Vaccination against pneumococci is very likely to reduce antibiotic resistance. A multidrug-resistant pneumococcal lineage, GPSC10 (CC230), is a high-risk clone that could mediate serotype replacement.

protocols have been provided within the article or through supplementary data files.

BACKGROUND
Streptococcus pneumoniae is a human nasopharyngeal commensal and a respiratory pathogen causing a spectrum of diseases ranging from mild respiratory illness (e.g. otitis media) to severe diseases (e.g. pneumonia and meningitis) [1]. In 2015, India was estimated to have the highest burden of pneumococcal deaths [2]; India, Nigeria, Democratic Republic of the Congo and Pakistan accounted for 50 % of all pneumococcal deaths. In India, 68 700 [uncertainty range (UR) 44600-86 000] pneumococcal deaths were estimated to have occurred in children aged 1-59 months. Severe pneumococcal disease in India manifests primarily as severe pneumonia. There were 1·6 million (UR 1·2-1·8) estimated cases of severe pneumococcal pneumonia in 2015, accounting for more than 97 % of all severe pneumococcal disease [2]. The recent roll-out (May 2017) of pneumococcal conjugate vaccine (PCV) in the national infant immunization schedule is expected to contribute to reductions in this disease burden [3]. Pneumococcal vaccination is not recommended for healthy adults under the age of 65 years in India. Vaccination with PPV23 in adults above 65 years of age is recommended because of the overall higher incidence of invasive pneumococcal disease in this age group [3][4][5][6][7].
Changes in epidemiology and population structure are likely to follow vaccine introduction [4]. Understanding the changes requires reproducible and robust molecular typing methods. Molecular typing of S. pneumoniae helps to delineate the genetic structure of bacterial populations and infer evolutionary relationships between isolates. Whole-genome sequencing (WGS) with its high discriminatory power has become a feasible tool for bacterial typing, given steadily decreasing associated costs [8,9].
With 20 027 pneumococcal genomes sequenced, the Global Pneumococcal Sequencing project (GPS, http://www. pneumogen. net/ gps/) defined 621 pneumococcal lineages, named Global Pneumococcal Sequence Clusters (GPSCs). This has contributed to the increased understanding of the pneumococcal population structure globally and provided further information on the distribution of serotypes and antibiotic resistance [10]. As part of the GPS project, we analysed the WGS data of invasive and carriage pneumococcal isolates (n=480) from Indian adults and children over an 8 year period (2009-2017) before the introduction of 13-valent PCV (PCV13) in the national infant immunization programme (NIP). Serotype distribution, antibiotic resistance and capsular switching in a sample of pneumococcal isolates is reported and discussed.

Pneumococcal isolates
We collected pneumococcal isolates from 14 regions in India through PNEUMONET [11] and the multicentric PIDOPS project [12] between 2009 and 2017. PNEUMONET and the PIDOPS project targeted routine collection of disease and carriage pneumococcal isolates among all age groups across sentinel sites. The isolates were collected prior to the introduction of PCV13 (May 2017) in the NIP. The isolates were stored in Central Research Laboratory, KIMS, Bangalore, for further analyses.
The collection consisted of 480 pneumococcal isolates, including carriage isolates (n=186) and disease isolates (n=294) (Fig. 1, Table S1, Table S8, available in the online version of this article). The disease isolates were collected across all sampling sites except for Kharagpur while carriage isolates were from Bangalore, Delhi, Hyderabad, Kharagpur, Mumbai and Pondicherry. Of isolates causing disease, they were recovered from blood culture (n=226), cerebrospinal fluid (n=36), pleural fluid (n=7) and other sources such as eye swabs, ascitic fluid and lung abscess (n=25).

Genome sequencing and analyses
The pneumococcal isolates were subject to WGS on an Illumina Hi-Seq platform to produce paired-end reads with an average of 151 bases in length and the raw data were deposited in the ENA (Table S8). WGS data were processed as previously described [10]. Briefly, we derived the serotype using SeroBA [13] and multilocus sequence types (MLSTs) using MLSTcheck [14]. We defined MLST clonal complexes (CC) as STs with single locus variant (SLV) differences, within the GPS dataset as previously described [10]. Antibiotic resistance profiles and presence of pili were predicted using the CDC pipeline from genome data [15][16][17][18]. The CDC pipeline script and reference database are deposited at https:// github. com/ BenJamesMetcalf/ Spn_ Scripts_ Reference. Both in silico prediction of serotypes and antibiotic resistance were compared with available phenotypic testing results and showed high concordance [10]. Multidrug resistance (MDR) was defined as isolates resistant to three or more classes of antibiotics. The population structure was defined by assigning GPSC to each isolate using a Kmer-based clustering method, PopPUNK [19], and a reference list of pneumococcal genomes (n=34780) that is available at https://www. pneumogen.

Impact Statement
This study provides a detailed report of the population genetic structure of a collection of pneumococcal disease and carriage isolates from children and adults in India. It provides genomic data to understand the prevalence of serotypes, pneumococcal lineages and antimicrobial resistance prior to vaccine introduction, so as to enable future studies to assess these changes after the roll out of vaccines. This study also highlights a high-risk clone, GPSC10 (CC230), that could potentially evade PCV13. The current findings demonstrate the usefulness of genomic surveillance in understanding the pneumococcal epidemiology and evolution so as to inform disease prevention.
net/ gps/ assigningGPSCs. html. Phylogenetic analysis was performed on all isolates by constructing a maximumlikelihood tree using FastTree version 2.1.10, which used heuristics to restrict the search for better trees and estimates a rate of evolution for each site [20]. The phylogeny was based on SNPs extracted from an alignment generated by mapping reads to the reference genome of S. pneumoniae ATCC 700669 (NCBI accession number FM211187) using Smalt, version 0.7.4, with default settings [21]. The metadata and analysis results can be interactively visualized online using Microreact at https:// microreact. org/ project/ GPS_ India.

Capsular switching
Histories of capsular switching were inferred in the isolates with identical ST but different serotypes in the Indian dataset in this study. For each ST, we then examined the genetic relatedness of isolates in a lineage-specific phylogeny. The lineage-specific phylogeny was constructed using GPS published isolates belonging to this ST and other related STs within a GPSC [22]. Including the GPS isolates from other countries provided a global context so as to better understand if the observation was a result of (1) an in-country capsular switching (isolates from India clustered together in the global phylogeny) or (2) importations of isolates with identical ST but different serotype from other countries (isolates from India did not cluster together but clustered with isolates from other countries). In brief, for each GPSC, the lineage-specific phylogeny was built from a recombination-free SNP alignment. This alignment was created by first mapping reads to a lineage-specific reference genome using Burrows Wheeler Aligner version 0.7.17-r1188 (BWA), then removing recombination regions, and extracting SNPs for tree reconstruction using GUBBINS version 2.4.1 [23,24].

Prevalence of pneumococcal serotypes
Serotypes as predicted from WGS data revealed 54 serotypes plus one isolate identified as non-typeable. Stratified by carriage and disease, the number of serotypes were 41 and 48, respectively. Thirteen serotypes (1, 7F, 8, 20, 33F, 2, 38, 12F, 27, 45, 25A, 25F, 9L, decreasing order of prevalence) were only found among diseased isolates while seven serotypes (6D, 7C, 19B, 48, 9N, 18B, nontypeable, descending prevalence) were only detected among carriage isolates. Examination of the cps region (flanked by dexB and aliA) of the non-typeable isolate showed that there is an insertion of a surface protein NspA, which was previously described by Salter et al. [25]. There was no significant difference in serotype prevalence between children aged under 5 years and adults aged 50 and above (Tables S6 and S7).
The top ten serotypes were PCV13 serotypes, except for serotype 15B/C and 24A/F among disease isolates and except for 11A, 22F and 28A in carriage isolates (Fig. 2). Overall, serotype 19F was the most prevalent serotype among both disease (13 %, 38/294) and carriage (13 %, 24/186) isolates. It is of note that serotype 4, which was significantly associated with high invasive disease potential [10], was found to be carried by eight children aged <9 years in the Pondicherry area in 2014. They all belonged to a single ST205 (GPSC27), suggesting local clonal transmission. Four serotype 4 were also detected in disease isolates in Pondicherry (n=1, GPSC27) and Bangalore (n=3, GPSC86). Rare, but invasive, serotypes Table 2. The five most prevalent pneumococcal lineages and their associated serotypes in a collection of disease (n=294) and carriage (n=184) isolates from India The non-PCV13 serotypes are underlined. GPSC, global pneumococcal sequencing cluster; CC, clonal complex.  Fig. S2).
The top five pneumococcal lineages and their associated serotypes are summarized in Table 2 (Fig. 3).

Antibiotic resistance
The predicted prevalence of antibiotic resistance in diseasecausing and carriage populations is shown in Table 4. Diseasecausing isolates had a higher prevalence of resistance to three beta-lactam antibiotics (penicillin, meropenem, cefuroxime) while chloramphenicol and cotrimoxazole resistance was higher among carriage isolates (Table 4). In both populations, VT had a significantly higher prevalence of resistance to all six beta-lactam antibiotics (penicillin, amoxicillin, meropenem, cefotaxime, ceftriaxone and cefuroxime) (Tables S2 and S3). In disease-causing populations, VT isolates also showed a higher prevalence in resistance to macrolide and cotrimoxazole and in multidrug resistance (Table S2). Among the non-PCV13 vaccine types in the disease-causing population, serotypes 24F, 10A and 15B showed multidrug resistance.

DISCUSSION
This study analysed the genetic lineages underlying both disease-causing and carriage isolates of pneumococci isolates in India pre-PCV introduction using a WGS approach. The predicted serotype distribution of the isolates based on the sequences of their capsular genes revealed a wide variety of capsular types. Of particular importance is the distribution of serotypes among children <2 years old, as 10-valent Pneumosil (Serum Institute of India) is being considered for expansion in the NIP to cover all children. Our data suggest that the existing (PCV13) and proposed vaccines (Pneumosil) for use in India will cover between 63 and 60 % of serotypes among children <2 years of age, respectively.
India is in the process of expanding the coverage of PCVs in its Universal Immunization Programme (UIP). In this study, 33 % (41/124) of invasive isolates recovered from blood and cerebrospinal fluid (CSF) were NVTs in children below the age of 5 years, which is slightly higher than a previous report of ~25 % NVT causing invasive disease in the same age group from Vellore, India [26,27]. Similar to these previous reports, this study also observed an equal distribution of prevalence of NVTs, which makes it difficult to suggest any NVT for inclusion in future vaccine formulations based solely on prevalence. Among the NVTs observed in this study, some have high invasive disease potential, for example serotypes 2, 8, 12F and 24F. Serotypes 2, 8 and 12F are covered by PCV24 that is under development and there is also plan to include 24F in the future vaccine [28,29].
Among the non-vaccine serotypes found in children, two rarely found serotypes were serotypes 2 and 45. Two serotype 2 isolates were identified to be causing bacteraemia, one from a child from Bangalore and one from an adult from Delhi, during 2015. Serotype 2 strains were common in adults a century ago and were rarely being identified as causing invasive disease [30,19]. However, they have recently been described in meningitis cases among children from Bangladesh [31] and causing a widespread outbreak in Israel [32]. Similar to most of the serotype 2 isolates identified elsewhere, these two isolates belong to GPSC96 (CC74) [10]. The other rare serotype 45 found in India had the genetic background of ST3022 (GPSC245); this strain caused meningitis in an infant aged 5 months old. The same clone expressing serotype 45 was recovered from a CSF sample in Niger in 2006 [33] [34], and a clonally related strain ST2212 (TLV of ST3022) causing meningitis was identified in Bangladesh during 2007-2013 [17,35]. Serotype 45 was also found in other genetic backgrounds, for example ST3332 in The Gambia [36]. Serotype 2 and 45 were the 9th and 14th most common serotypes found in Gavi countries causing invasive pneumococcal disease among children under 5 years [37]. Therefore, they are potentially important serotypes to be considered for inclusion in future pneumococcal conjugate vaccines. GPSC10 (CC230) is the only sequence cluster among the top five lineages in the disease-causing population to have both VTs and NVTs. It is the largest contributor of NVTs and accounted for 50 % of the major non-PCV13 serotype 15B/C in the disease-causing population [27], highlighting the potential of GPSC10 to mediate serotype replacement in the post-vaccine era. The NVT GPSC10 isolates expressing serotype 24F have an invasive disease potential similar to serotype 19A [24]. Increases in invasive diseases caused by serotype 24F pneumococci were also observed in Argentina (unpublished data), France [28] and Spain in children after the introduction of PCV13 [38]. In Spain, the increase was largely due to CC230 (major CC in GPSC10). GPSC10 is a multidrug-resistant lineage that is associated with resistance to penicillin, erythromycin, cotrimoxazole and tetracycline.
Antibiotic resistance is significantly higher among VT pneumococci, especially in disease-causing isolates. This finding suggests that the reduction of antibiotic resistance after the use of pneumococcal vaccines in developed countries could also occur in India via directly removing VTs that are associated with antibiotic resistance and via a reduction in febrile illnesses that often require antibiotic use [39,40].
A limitation of this study is the relatively small sample size for each region, which does not allow us to investigate the potential differences in serotypes, strains and antibiotic resistance between regions during the phase introduction of the conjugate vaccines. To detect a 20 % difference in prevalence with 95 % confidence level, at least 196 samples are required for each region. However, to achieve this sample size may not be feasible due to the challenge of isolating pneumococci from suspected cases of pneumococcal diseases: antibiotic use prior to sampling and varied healthcare infrastructure in different regions. While obtaining a statistically sufficient number of disease isolates is not likely, cross-sectional studies sampling isolates from the nasopharynx from healthy carriers could be an alternative method to detect the impact of the vaccine.
This study, describing the pneumococcal disease and carriage epidemiology, demonstrates that 60-75 % of pneumococcal serotypes in children younger than 5 years is covered by PCV13 and Pneumosil. Vaccination against pneumococci is very likely to reduce antibiotic resistance. A multidrugresistant pneumococcal lineage GPSC10 is a high-risk clone that could mediate serotype replacement. This study decsribes pneumococcal strain characteristics prior to vaccination that will help to evaluate changes associated with the NIP in the future.

Funding information
This study was co-funded by the Bill and Melinda Gates Foundation (grant code OPP1034556), the Wellcome Sanger Institute (core Wellcome grants 098051 and 206194), and the US Centers for Disease Control and Prevention.