Pan-Genome-Wide Association Study of Serotype 19A Pneumococci Identifies Disease-Associated Genes

ABSTRACT Despite the widespread implementation of pneumococcal vaccines, hypervirulent Streptococcus pneumoniae serotype 19A is endemic worldwide. It is still unclear whether specific genetic elements contribute to complex pathogenicity of serotype 19A isolates. We performed a large-scale pan-genome-wide association study (pan-GWAS) of 1,292 serotype 19A isolates sampled from patients with invasive disease and asymptomatic carriers. To address the underlying disease-associated genotypes, a comprehensive analysis using three methods (Scoary, a linear mixed model, and random forest) was performed to compare disease and carriage isolates to identify genes consistently associated with disease phenotype. By using three pan-GWAS methods, we found consensus on statistically significant associations between genotypes and disease phenotypes (disease or carriage), with a subset of 30 consistently significant disease-associated genes. The results of functional annotation revealed that these disease-associated genes had diverse predicted functions, including those that participated in mobile genetic elements, antibiotic resistance, virulence, and cellular metabolism. Our findings suggest the multifactorial pathogenicity nature of this hypervirulent serotype and provide important evidence for the design of novel protein-based vaccines to prevent and control pneumococcal disease. IMPORTANCE It is important to understand the genetic and pathogenic characteristics of S. pneumoniae serotype 19A, which may provide important information for the prevention and treatment of pneumococcal disease. This global large-sample pan-GWAS study has identified a subset of 30 consistently significant disease-associated genes that are involved in mobile genetic elements, antibiotic resistance, virulence, and cellular metabolism. These findings suggest the multifactorial pathogenicity nature of hypervirulent S. pneumoniae serotype 19A isolates and provide implications for the design of novel protein-based vaccines.

S. pneumoniae serotype 19A was associated with several invasive diseases and highlevel antimicrobial resistance (4)(5)(6). After the widespread use of 7-valent as well as 10valent PCVs, serotype 19A emerged as a predominant serotype in many countries with vaccination programs and other countries without vaccination programs (7)(8)(9). In the era of 13-valent PCV (covering serotype 19A), S. pneumoniae 19A remains the prevalent serotype found in IPD patients (2,(10)(11)(12). Although serotype 19A pneumococci may be more prone to invade human hosts, it remains unclear whether the serotype 19A isolates from IPD patients represent a unique subgroup genetically different from those from asymptomatic individuals (11,12). Therefore, it is important to understand the genetic and pathogenic characteristics of S. pneumoniae serotype 19A, which may provide important information for the prevention and treatment of S. pneumoniae disease and the development of proteomic vaccines in the future.
Whole-genome sequencing (WGS) with its high discriminatory power has become a feasible tool for bacterial typing, given steadily decreasing associated costs. The increasing availability of high-throughput WGS and generation of high-dimensional genomic data led to the development of numerous comparative bacterial genome analyses. Pan-genomewide association studies (pan-GWASs) are increasingly used to explore the statistical association between genotypic variation (associated with virulence, antibiotic resistance, cellular metabolism, and disease susceptibility) and bacterial phenotypes such as disease susceptibility (13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24). To date, few studies have focused on evaluating genetic variants between pathogenic and nonpathogenic serotype 19A isolates at the pan-genome level, so it is unclear whether there are disease-associated genotypes of S. pneumoniae. In order to identify the key factors of S. pneumoniae pathogenesis, we carried out a pan-GWAS of 1,292 S. pneumoniae genomes (belonging to serotype 19A) available in the NCBI GenBank database. This study aimed to compare the genetic divergences between invasive disease and asymptomatic carriage isolates of S. pneumoniae serotype 19A so as to identify diseaseassociated genotypes that have a key role in the pathogenic process of S. pneumoniae, including the regulation of bacterial virulence, drug resistance, and transfer events of genetic elements. Our findings are essential to help understand what is causing pneumococcal disease, providing new guidance for future preventive and therapeutic interventions.

RESULTS
Characteristics of the S. pneumoniae serotype 19A isolates. We analyzed the genomes of 1,292 S. pneumoniae serotype 19A isolates collected between 1905 and 2019, including 883 (68.34%) isolates from patients with invasive disease and 409 (31.66%) isolates from individuals with asymptomatic carriage. All of the IPD specimens were isolated from cerebrospinal fluid, blood, joint fluid, pleural fluid, peritoneal fluid, and so on. The numbers of the patients by age group were as follows: #2 years old, n = 695 (53.8%); 3 to 5 years old, n = 196 (15.2%); 6 to 59 years old, n = 179 (13.9%); $60 years old, n = 74 (5.7%); and age unknown, n = 148 (11.4%). In addition, there was a nonlinear (reverse V-shaped) annual trend for the number of the patients ( Fig. 1a  Pan-genome construction and phylogenetic analysis. A pan-genome of the cohort consisting of 11,935 genes was constructed using 1,292 S. pneumoniae serotype 19A isolates. A visualization of the pangenetic results is shown in Fig. 1c to e. The diversity of the S. pneumoniae pan-genome, including 11,935 genes, is highlighted by the relatively large number of accessory genes (n = 11,078) and a high variability with a large proportion of unique genes (n = 3,342), suggesting that it belongs to an open pan-genome. The proportion of accessory gene increased as the sampling increased, suggesting widespread acquisition of accessory genes through horizontal gene transfer.
Multiple pan-GWAS methods for identifying disease-associated genotypes. In the pan-GWAS analysis using the phylogenetic-based Scoary method ( Fig. 3a and b), there were 828 genes statistically associated with disease phenotypes upon surviving an adjusted P value threshold (4.19 Â 10 26 ). The majority of disease-associated genes were annotated as hypothetical proteins, indicating that their functions were unknown. Notably, we found that ;30% of these disease-associated genes were annotated as having mobile elementrelated functions, indicating that horizontal gene transfer elements may be associated with causing disease.
In the pan-GWAS analysis using a linear mixed-model (LMM) method ( Fig. 3c and d), there were 195 genes statistically associated with disease phenotypes upon surviving an adjusted P value threshold (1.15 Â 10 25 ). More than 30% of them were annotated as hypothetical proteins, and roughly 20% were annotated as mobile genetic elements.
In the pan-GWAS analysis using the random forest (RF) model, the importance of the 100 top-ranked genes is shown in Fig. 3e, with high accuracy (91.4%; 95% To avoid inherent limitations of individual pan-GWAS methods, we use three methods (Scoary, LMM, and RF) to minimize false-positive associations and triangulate potential hits. Interestingly, we found consensus on statistically significant associations based on the three pan-GWAS methods, including 30 genes identified by all three methods, 86 genes identified by both Scoary and LMM, 80 genes identified by both Scoary and RF, and 36 genes identified by both RF and LMM (Fig. 4). For 30 significant genes found simultaneously by three methods, Fig. 5 illustrates the overall effect of different genes on the estimated risk score. A point above the diagonal indicates that the risk score is increased when the gene is present. Importantly, we observed 18 genes above the diagonal, suggesting that the presence of these genes may increase the risk of causing invasive disease in humans.
Functional annotation of disease-associated genes. After filtering the orthologous genes to identify the disease-associated genotypes in S. pneumoniae, there were a subset of 30 genes found simultaneously by three methods, for which 19 encoded hypothetical proteins and 11 encoded known functional proteins. These disease-associated genes had diverse predicted functions, including mobile genetic elements, antibiotic resistance, virulence, and cellular metabolism ( Table 2).
The COG annotation revealed that the percentages of genes associated with (i) cell replication, recombination, and repair, (ii) cell wall, membrane, and envelope biogenesis, (iii)  Disease-Associated Genes of Pneumococci Microbiology Spectrum nucleotide transport and metabolism, and (iv) transcription were 20%, 10%, 10%, and 10%, respectively (Table 2). Furthermore, the GO annotation showed that the cytoplasm and integral component of membrane in cellular component and metal iron binding in molecular function play important roles in pathogenic process of S. pneumoniae.

DISCUSSION
Despite the widespread implementation of PCV13, S. pneumoniae serotype 19A remains the leading cause of pneumococcal disease, especially for life-threatening IPD (1)(2)(3)25). This phenomenon may be related to low vaccine coverage, delayed effect, antibiotic pressure, capsular switching, and immune evasion. In this study, we performed pan-GWAS analysis of 1,292 hypervirulent S. pneumoniae serotype 19A isolates to compare the genetic divergences between invasive disease and asymptomatic carriage isolates and found several disease-specific STs, indicating that ST2062, ST320, ST199, ST172, and ST695 isolates are associated with causing IPD. Additionally, we used three pan-GWAS methods to identify 30 simultaneously significant disease-associated genes, which had diverse predicted functions, including those participated in mobile genetic elements, antibiotic resistance, virulence and cellular metabolism. Our findings provide evidence for the presence of specific pathogenic clones and genetic elements that promote infection and invasion, suggesting the multifactorial nature of the pathogenicity.
Multilocus sequence typing in the present study showed a considerable diversity among S. pneumoniae 19A isolates, with the prevalent STs being ST2062, ST320, ST199, ST847, ST172, ST695, and ST1701. The S. pneumoniae 19A ST320 isolate, known as multidrug resistant and hypervirulent, was derived from the ancestral clone Taiwan 19F -14 FIG 3 pan-GWAS analyses for identifying disease-associated genotypes using the Scoary, LMM, and random forest methods. Manhattan plots show statistical significance (2log 10 ) of the genes using Scoary (a) and LMM (c). Volcano plots show the relationship between statistical significance (2log 10 ) and the effect size in terms of the log-transformed (base 2) carriage-to-disease odds ratio for the genetic variants using Scoary (b) and LMM (d). The horizontal lines represent the pan-WGAS statistical significance threshold based on the Bonferroni adjustment. The importance score for the 100 top-ranked genes (e) and ROC curves (f) for random forest with the 100 top-ranked genes are shown.

Disease-Associated Genes of Pneumococci
Microbiology Spectrum (ST236). Although ST320 is a double-locus variant of ST236, the genetic evolution from Taiwan 19F -14 (ST236) to 19A ST320 contributed to a significant competitive advantage in enhancing colonization capacity and causing invasive disease (26). Notably, S. pneumoniae 19A ST320 was prevalent in many countries in both adults and children, as well as in both pre-PCV and post-PCV periods; therefore, the spread of this clone poses a threat to human health worldwide (27,28). Despite the diversity of STs, IPD isolates were present across the phylogenetic tree, indicating that these disease clones come from multiple genetic backgrounds. Moreover, this study revealed that ST320 (odds ratio [OR] = 6.79), ST199 (OR = 1.56), ST172 (OR = 22.42), ST695 (OR = 3.02), and ST2062 (OR = 1.56) isolates were significantly and positively associated with the IPD phenotype, suggesting the present specific pathogenic clones promote invasion. Consistent with this, previous studies have demonstrated that ST320, ST199, and ST172 belonged to highly pathogenic clones and became increasing causes of infection in various geographic regions, since these STs have the ability to compete in the nasopharyngeal niche (29,30). The above findings show a diverse genetic background of S. pneumoniae 19A and give an insight into the molecular typing of disease and carriage isolates. There is increasing evidence that S. pneumoniae infection isolates are a subset of asymptomatic colonization clones (26), indicating that there may be specific virulence factors associated with the evolution of pneumococcal clones from harmless ancestors. In addition, the virulence factors may vary among different pneumococcal clones, so knowledge of pathogenicity determinants is crucial for identifying risk genotypes and predicting disease phenotypes during the early onset of the disease. To obtain a higherresolution snapshot of S. pneumoniae genetic background, further pan-GWAS analyses are urgently needed to reveal disease-associated virulence factors, which may provide important evidence of the pathogenesis of pneumococcal disease. The latest study has revealed that the absence of consensus genetic variation associated with disease status supports the opportunistic infection model for S. pneumoniae serotype 1 isolates, suggesting that disease and carriage pneumococci are intrinsically hyperinvasive and

Disease-Associated Genes of Pneumococci
Microbiology Spectrum equally able to cause disease (31). However, we used three pan-GWAS methods (Scoary, LMM, and RF) to triangulate potential genetic variation and surprisingly found consensus evidence that certain genetic variation is significantly overrepresented in IPD than in carriage isolates, indicating the complex multifactorial pathogenicity of S. pneumoniae serotype 19A. For example, the disease-associated group_114, group_115, group_190, and group_142 genes were found to encode IS5 family transposases. Transposition of genetic elements (e.g., transposons, plasmids, and some other kinds of mobile genetic elements) has significant effects on divergence and evolution of S. pneumoniae (32,33). These findings have emphasized the crucial role of mobile genetic elements on the pathogenic potential of S. pneumoniae. In addition, desK, ydhF, and gdhA are involved in encoding metabolism-associated proteins such as histaminase, oxidoreductase, and NADP-specific glutamate dehydrogenase, which can modulate bacterial response to stimuli and environmental changes of S. pneumoniae (34,35). Carbon and nitrogen metabolism is essential for each biological system, since every cellular component (such as proteins, genetic elements, and energy carrier molecules) is derived from metabolism. Consistent with previous studies, we observed that disease-associated glf, prsA, dtxR, and rsmD genes are involved in encoding virulence-and resistance-related proteins, which have the ability to modulate the bacterial virulence and antibiotic resistance (36)(37)(38)(39)(40)(41)(42). In summary, our findings suggest that certain genomic variations modulate S. pneumoniae colonization and invasion, which may shed light on the complex multifactorial nature of the pathogenicity. Change in risk score for the gene when a disease-associated gene is present (y axis) compared to absent (x axis). A point above the diagonal implies that the risk score is increased when the gene is present.

Disease-Associated Genes of Pneumococci Microbiology Spectrum
Recent advances in high-throughput WGS and its cost reductions have increased the applicability of GWAS to explore the statistical associations between genomic variation and bacterial phenotypes such as disease susceptibility and virulence, which has the potential to reveal the complex multifactorial pathogenicity and inform disease prevention measures (31). In particular, the LMM-based GWAS has led to the widespread application of genotype-phenotype association studies in bacterial genomic data, which use the random effects of kinship matrix from the phylogeny to robustly control for population structure (43). The Scoary pan-genome construction program is an easy-to-use and ultrafast tool for exploring pan-genome association between genetic variation and clinically relevant phenotypes in bacteria, which accounts for population stratification using the phylogenetic tree or Hamming distances in the genotype matrix (typically with gene presence/absence). RF, known as a nonparametric tree-based machine learning method, can effectively deal with variable interaction and correlation and also provide rapidly computable variable importance measures for variable ranking (44). This makes RF particularly attractive for the analysis of complex and high-dimensional genomic data. Many previous studies that mostly focused on singlenucleotide polymorphism (SNP)-based GWAS approaches to reveal structural variations relied on a single reference genome, but a single reference genome uncovering the complete set of genes in the entire species is inadequate for many purposes, suggesting that a pan-genome-based method is especially important for identifying diseaseassociated genes of the hypervirulent serotype 19A isolates (20). Considering the highly variable genomes of S. pneumoniae and inherent limitations of individual methods, we use three pan-GWAS methods to identify consistently significant disease- a OR, odds ratio; 95% CI, 95% confidence interval. b COG annotation category: C, energy production and conversion; F, nucleotide transport and metabolism; K, transcription; L, replication, recombination, and repair; M, cell wall, membrane, and envelope biogenesis; O, posttranslational modification, protein turnover, and chaperones; S, function unknown; T, signal transduction mechanisms.
Disease-Associated Genes of Pneumococci Microbiology Spectrum associated genotypes, which may make our findings more comprehensive and credible to reveal the complex pathogenicity of this hypervirulent serotype.
In contrast with previous bacterial GWAS studies (28,31,45,46), this study has some strengths. This study first focused on a specific highly virulent S. pneumoniae serotype, 19A, covering majority of countries around the world and multiple patients with a wide range of symptoms. This improved the representativeness of S. pneumoniae isolates and excluded the effect of capsular diversity in the pan-GWAS analysis, making our results more convincing. Additionally, we minimized the opportunity of S. pneumoniae serotype 19A isolates transitioning to causing IPD by including the carriage isolates from asymptomatic individuals and not infection isolates from patients with mild NIPD, which may improve the statistical power to detect disease-associated genes. Moreover, we performed pan-GWAS analyses using multiple methods to identify consistently significant genetic variants associated with IPD, thereby avoiding inherent limitations of individual methods and minimizing false-positive associations. These consensus findings may provide comprehensive evidence for clarifying the complex multifactorial pathogenicity of S. pneumoniae serotype 19A.
However, there are some inherent limitations that need to be considered. First, the number of disease and carriage isolates in this study was limited by current publicly available S. pneumoniae genomes in the NCBI GenBank database. However, this study represents a unique and large-scale genomic analysis of S. pneumoniae serotype 19A isolates covering a majority of countries around the world. Second, most of the diseaseassociated genes identified in this pan-GWAS analysis were annotated as hypothetical proteins with no known function. Based on the evidence that several hypothetical genes are consistently associated with IPD in three pan-GWAS analyses, we confirmed that these hypothetical genes are highly likely to participate in colonization and invasion. Therefore, these hypothetical proteins are worthy of further research to reveal their pathogenic mechanisms. Finally, all genes in this study were marked as having a presence or absence status, which did not take into account their extent of insertions and deletions and variation. In the future, more comprehensive SNP-and k-mer-based GWAS studies are required to confirm the effects of these disease-associated genes.
In conclusion, this global large-sample pan-GWAS study based on three methods has identified a subset of 30 consistently significant disease-associated genes that are involved in mobile genetic elements, antibiotic resistance, virulence and cellular metabolism. These findings suggest the multifactorial pathogenicity nature of hypervirulent S. pneumoniae serotype 19A isolates and provide important evidence for the design of novel protein-based vaccines to prevent and control pneumococcal disease.

MATERIALS AND METHODS
Sample selection and quality control. A total of 1,292 genome assemblies of S. pneumoniae serotype 19A isolates were downloaded from the NCBI GenBank database between 1905 and 2019, which included isolates from the nasopharynx of asymptomatic carriers and clinical specimens from IPD patients. IPD was defined as an infection in which S. pneumoniae was recovered from normally sterile sites (e.g., blood, cerebrospinal fluid, or pleural fluid). S. pneumoniae assemblies were checked for lowlevel contamination using Kraken (v.1.1.1) (http://ccb.jhu.edu/software/kraken/) (47), which uses exact alignment to accurately assign taxonomic labels to metagenomic DNA sequences. Briefly, S. pneumoniae genomes were excluded from our analyses if more than 5% of the total sequence belonged to a different species. In addition, genome completeness and contamination were evaluated using the default parameters of CheckM (v.1.2.0) (48).
Pan-genome construction and phylogenetic analysis. The genome assemblies were reannotated using Prokka (v.1.14.6) (51). Then, these annotated assemblies were fed to Roary (v.3.13.0) for the pangenome (core and accessory genes) construction, with the Roary parameters being 90% for minimum blastp identity and 1.5 for MCL inflation value (52). The pan-genome was visualized by the python opensource script "roary_plots.py." A maximum likelihood phylogenetic tree based on 857 core genes was constructed with Fasttree, using the GTR1CAT model (v.2.1.11) (53). The whole-genome alignments for the variant sites with Disease-Associated Genes of Pneumococci Microbiology Spectrum single-nucleotide polymorphisms (SNPs) were generated by Snippy (v.4.6.0) (https://github.com/ tseemann/snippy), using S. pneumoniae TIGR4 (NCBI accession no. PRJNA277) as the reference genome. A maximum likelihood tree based on SNPs was also generated from the recombination-filtered alignment with RAxML (v.8.2.12), using the GTR1C (gamma) model and 100 bootstrap replicates (54). Visualization and annotation of the phylogenetic tree were performed using Chiplot (https://www .chiplot.online). pan-GWAS analysis for disease-associated genotypes. Because pathogenicity is a complex multifactorial property, we used multiple pan-GWAS methods, including a phylogenetic-based approach (Scoary), a linear mixed model (LMM), and a machine learning method (random forest) (55) to explore the pan-GWAS between genotypes and disease phenotypes, so as to identify disease-associated genes at the pan-genome level. In the above pan-GWAS analysis, we used the disease phenotype (IPD or carriage) as the outcome variable and the pan-genome matrix of genotypes (presence or absence) as the independent variable. We carried out the univariate pan-GWAS using Scoary analysis (Scoary v.1.6.16) and LMM (Pyseer v.1.2.0) to identify disease-associated genes (56,57), using Hamming distances and core SNPs to correct for the population structure. To control for false positives due to multiple comparisons, the Bonferroni correction was used to calculate the adjusted P value threshold (4.19 Â 10 26 for Scoary and 1.15 Â 10 25 for LMM) (58,59). To capture the complex nonlinear association between genotypes and disease phenotypes, the RF classification model was also used to identify disease-associated genes of high variable importance, using the Random Forest package in R (v.4.1.3). The importance of the characteristic variables was rated by calculating the mean decrease in impurity (mean decrease in Gini [MDG]). We used sensitivity, specificity, positive predictive value, negative predictive value, k value, receiver-operating characteristic (ROC) curve, and 10-fold cross-validation to evaluate the predictive effect of the RF model. Since similar accuracy occurs when we included more genes in the RF model, the 100 top-ranked genes were regarded as a conservative threshold to reduce false-positive genes.
Functional annotation of disease-associated genes. All gene sequences significantly associated with disease phenotype by three methods (Scoary, LMM, and RF) were first functionally annotated by Prokka and Roary. In addition, the disease-associated gene sequences were extracted from the pan-genome using TBtools (v.1.098746), and then each DNA was checked for similarity against known genes and proteins using eggNOG (http://eggnog-mapper.embl.de/) and UniProt (https://beta.uniprot.org/). Data availability. NCBI accession numbers, associated metadata, and reference for each genome included in this study are listed in the supplemental material. Detailed information on pan-GWAS analyses, DNA sequences, and annotations of genetic variants are available in the supplemental material.