Diversity and Complexity of the Large Surface Protein Family in the Compacted Genomes of Multiple Pneumocystis Species.

Pneumocystis continues to be a major cause of disease in humans with immunodeficiency, especially those with HIV/AIDS and organ transplants, and is being seen with increasing frequency worldwide in patients treated with immunodepleting monoclonal antibodies. Annual health care associated with Pneumocystis pneumonia costs ∼$475 million dollars in the United States alone. In addition to causing overt disease in immunodeficient individuals, Pneumocystis can cause subclinical infection or colonization in healthy individuals, which may play an important role in species preservation and disease transmission. Our work sheds new light on the diversity and complexity of the msg superfamily and strongly suggests that the versatility of this superfamily reflects multiple functions, including antigenic variation to allow immune evasion and optimal adaptation to host environmental conditions to promote efficient infection and transmission. These findings are essential to consider in developing new diagnostic and therapeutic strategies.

ABSTRACT Pneumocystis, a major opportunistic pathogen in patients with a broad range of immunodeficiencies, contains abundant surface proteins encoded by a multicopy gene family, termed the major surface glycoprotein (Msg) gene superfamily. This superfamily has been identified in all Pneumocystis species characterized to date, highlighting its important role in Pneumocystis biology. In this report, through a comprehensive and in-depth characterization of 459 msg genes from 7 Pneumocystis species, we demonstrate, for the first time, the phylogeny and evolution of conserved domains in Msg proteins and provide a detailed description of the classification, unique characteristics, and phylogenetic relatedness of five Msg families. We further describe, for the first time, the relative expression levels of individual msg families in two rodent Pneumocystis species, the substantial variability of the msg repertoires in P. carinii from laboratory and wild rats, and the distinct features of the expression site for the classic msg genes in Pneumocystis from 8 mammalian host species. Our analysis suggests multiple functions for this superfamily rather than just conferring antigenic variation to allow immune evasion as previously believed. This study provides a rich source of information that lays the foundation for the continued experimental exploration of the functions of the Msg superfamily in Pneumocystis biology. IMPORTANCE Pneumocystis continues to be a major cause of disease in humans with immunodeficiency, especially those with HIV/AIDS and organ transplants, and is being seen with increasing frequency worldwide in patients treated with immunode-pleting monoclonal antibodies. Annual health care associated with Pneumocystis pneumonia costs ϳ$475 million dollars in the United States alone. In addition to causing overt disease in immunodeficient individuals, Pneumocystis can cause subclinical infection or colonization in healthy individuals, which may play an important role in species preservation and disease transmission. Our work sheds new light on the diversity and complexity of the msg superfamily and strongly suggests that the versatility of this superfamily reflects multiple functions, including antigenic variation to allow immune evasion and optimal adaptation to host environmental conditions to promote efficient infection and transmission. These findings are essential to consider in developing new diagnostic and therapeutic strategies. KEYWORDS classification, conserved domains, major surface glycoprotein, phylogenetic analysis, Pneumocystis P neumocystis continues to be a major cause of disease in humans with immunodeficiencies, especially those with HIV/AIDS and organ transplants, and is being seen with increasing frequency in patients treated with immunodepleting monoclonal antibodies. As an atypical fungus, Pneumocystis has highly adapted to the mammalian lung environment (1), with a high level of host specificity; P. jirovecii infects humans, P. carinii infects Norway rats (Rattus norvegicus), and P. murina infects house mice (Mus musculus). In addition, Pneumocystis cell walls are structurally unique and differ significantly from typical fungal cell walls that are composed of polysaccharides (mainly glucan and chitin) and highly mannosylated proteins. Both genomic and experimental analyses have shown the absence of chitin and outer chain N-mannans in Pneumocystis cell walls (1). Furthermore, beta-1,3-glucan is absent in the trophic form but masked in the cyst form of Pneumocystis (2).
An integral component of the Pneumocystis cell wall in both the cyst and trophic forms is the major surface glycoprotein (Msg) (also known as gp95, gp115, gp120, and gpA) (3)(4)(5)(6)(7)(8). Ever since its identification in 1982 (9), Msg has been a focus of research, in part because it is the most abundant Pneumocystis protein as assessed by SDS-PAGE. Msg is present in all Pneumocystis species studied to date (3,4,6,7,10,11) and appears to play an important role in pathogen-host interactions as well as in evasion of host immune responses. Based on studies of Pneumocystis in humans, rats, and mice, Msg is encoded by a multicopy gene family with an estimated ϳ30 to 100 copies per genome (5,6,8,10,12). Msg genes (up to ϳ3 kb each) are closely related to but clearly distinct from each other and are clustered together in the subtelomeric regions of multiple chromosomes (1, 13) (see Text S1 in the supplemental material). While there is no apparent variation in the msg repertoire among laboratory-bred P. murina or P. carinii isolates, extensive variation is present among P. jirovecii isolates (14).
Recently, we utilized long-read sequencing technology (15,16) to identify the most complete set to date of msg genes in three Pneumocystis species (P. jirovecii, P. carinii, and P. murina) as part of the Pneumocystis genome project (1). Based on our studies, each Pneumocystis genome harbors approximately 60 to 180 msg genes, depending on the species, including the classical msg genes, msg-related (termed msr) genes, and additional related genes. These genes are collectively termed the msg superfamily. We previously reported on the first systematic classification of the msg superfamily (1) but did not provide a detailed description of the unique characteristics and phylogenetic relationships of individual domains and families or subfamilies. A recent report identified a small subset of msg genes in P. jirovecii from a single patient and described potential mechanisms of recombination, but this report did not include any other Pneumocystis species (17).
In the current report, we expanded our published analysis (1) to include msg genes in Pneumocystis from other mammalian host species. The goals of the current report were to (i) identify msg genes from P. oryctolagi (infecting rabbits), P. wakefieldiae (infecting rats), Pneumocystis sp. "macacae" (infecting rhesus macaques), and Pneumocystis sp. "canis" (infecting dogs); (ii) describe the characteristics and phylogenetic evolution of individual msg domains; (iii) illustrate the characteristics, phylogenetic evolution, and relative expression levels of individual msg families or subfamilies; (iv) compare the variability of the msg repertoires in P. carinii from laboratory and wild Norway rats; and (v) characterize the variation of the expression sites or upstream conserved sequences (UCSs) of the classic msg genes in Pneumocystis from 11 mammalian host species.

RESULTS
Sources of Pneumocystis msg sequences. msg sequences for P. murina, P. carinii, and P. jirovecii were obtained primarily from our previous msg and genome sequencing studies (1,15,16). The accuracy of these sequences was maximized by integrating Illumina high-throughput sequencing of genomic DNA, PacBio long-read sequencing of msg repertoire amplicons, and Sanger sequencing of cloned msg genes. Additional msg sequences were identified by Sanger sequencing of cloned msg amplicons and nextgeneration sequencing of whole genomes of the following Pneumocystis species: P. wakefieldiae, P. carinii (in wild rats), P. oryctolagi, Pneumocystis sp. "macacae," and Pneumocystis sp. "canis" (Table 1 and see Table S1 in the supplemental material). Due to the low-throughput nature and high cost of Sanger sequencing of cloned msg amplicons and the difficulty in assembling short reads from Illumina sequencing, only a small number of full-length msg genes were obtained from these species (1 to 13 genes per species). As whole-genome assembly of these species is still in progress, the msg genes reported for these species are only representative, not all inclusive. All msg sequences are available from the Zenodo database (data sets 1 to 8 available at https://zenodo.org/record/3523554#.XbpSjjd7mpo) as well as the BioProject database with accession number (no.) PRJNA560924.
Characteristics and phylogenetic relationships of individual Msg domains. We identified a total of 9 conserved domains ( Fig. 1 and Text S1). Classic Msg (Msg-A1) proteins contain 5 domains that presumably arose by gene duplication. Based on phylogenetic trees constructed using only these 5 domains, we found that each forms its own cluster, regardless of the origin of the species of the domains (Fig. 2). This strongly suggests that the most recent common ancestor to these Pneumocystis species already had developed this Msg domain structure and organization and that, subsequently, these domains evolved with no further duplication or recombination among domains across or within species. We also found that within each of these 5 Msg domains, individual domains clustered according to Pneumocystis species, suggesting that significant Msg family expansion occurred after the separation of Pneumocystis species. In addition, P. carinii and P. murina form two separate clusters, with each cluster containing both species, suggesting that those two clusters arose before separation of these two species.
Unique characteristics of each Msg family and subfamily. Based on domain structure, phylogeny analysis, and expression control mechanisms of the msg superfamily, we previously proposed a classification of five families, named Msg-A, Msg-B, Msg-C, Msg-D, and Msg-E (1), as summarized in Table 1. According to the chromosomelevel assemblies of the P. murina, P. carinii, and P. jirovecii genomes, msg genes are located almost exclusively in subtelomeric regions and are usually present in clusters (Text S1). Different msg families differ in the numbers of members, distributions among different Pneumocystis species, sequence structures (gene length, location and number of introns, and number of conserved domains), and expression control mechanisms, as summarized in Table 1. In addition, there is a bias of amino acid distribution among different Msg families (see Fig. S1 and Text S1).
(i) Msg-A family. Msg-A family is by far the largest among the 5 families of the Msg superfamily. This family is divided into three subfamilies: Msg-A1, Msg-A2, and Msg-A3  Results represent a potentially complete set for P. murina, P. carinii, and P. jirovecii as described in reference (1) but only a subset of members for the other four species (P. wakefieldiae, P. oryctolagi, Pneumocystis sp. "macacae," and Pneumocystis sp. "canis"), whose genome sequencing is ongoing.
b Based on P. murina, P. carinii, and P. jirovecii genes with complete or nearly complete reading frames.
c Subfamilies belonging to msg-A family.
( Fig. 3 and Fig. S2 and S3). In addition to differences in phylogenetic relationships, these three subfamilies have significant differences in the expression control mechanisms and sequence structures of the 5=-end leaders. The Msg-A1 subfamily includes all classic msg genes. The most striking characteristics of this subfamily are its dominance among all Msg families/subfamilies across all Pneumocystis species (Table 1) and its unique expression control mechanism. It has been well established that expression of this subfamily is controlled by a dedicated, single-copy subtelomeric expression site, also known as the upstream conserved sequence (UCS) (18-21) (Fig. 4). The UCS is expressed in fusion with an msg gene; the region between UCS and its downstream msg gene is termed the conserved recombination joint element (CRJE), which is highly conserved among all msg-A1 genes and potentially serves as an anchor for recombination (22). Available data suggest that these msg genes are not expressed unless they are translocated downstream of and in-frame with the UCS (18)(19)(20)(21). This mechanism allows only one msg-A1 gene to be expressed in a single organism at a given time, although multiple msg-A1 genes are expressed at the population level in immunosuppressed hosts. In a phylogenetic analysis of 183 full-length msg-A1 genes from 7 Pneumocystis species, these genes clustered by species; as expected, genes from all three rodent Pneumocystis species formed a strong monophyletic group (Fig. 3). As previously noted (14), the Msg-A1 family in P. jirovecii is composed of two phylogenetically distinct groups; such separation is also seen in P. murina and P. carinii.
The Msg-A2 subfamily represents the previously reported msr genes in P. carinii (23,24) and differs from Msg-A1 as follows: (i) the abundant presence in rodent Pneumocystis, but absence in all other species examined thus far (Table 1); (ii) the presence of a short intron near the 5= end; (iii) the presence of a unique highly conserved exon 1; and (iv) independent expression of each individual gene (not under the control of UCS).  Alignment of 73 full-length msg-A2 genes from rodent Pneumocystis revealed two groups of genes with sizes of ϳ2 kb and ϳ3 kb. The group with a larger size contains all 9 Msg domains, while the other group lacks three of them (M5, M6, and C1). In a phylogenetic analysis, all msg-A2 genes from P. carinii formed a strong clade (with 99% bootstrap support), while msg-A2 genes from P. wakefieldiae were interspersed among msg-A2 genes from P. murina (Fig. S2A). In P. carinii, 11 msg-A2 genes show higher sequence identities to msg-A1 genes than other msg-A2 genes (53% to 63% versus 35% to 44%) and are clustered together with msg-A1 genes from P. carinii in a phylogenetic analysis (Fig. S2B). Similarly, one of the 6 msg-A2 genes in P. wakefieldiae shows higher sequence identity to and is clustered with msg-A1 genes.
The Msg-A3 subfamily includes genes with substantial sequence identity to the Msg-A1 and Msg-A2 subfamilies but without the CRJE element of the msg-A1 genes or the highly conserved exon 1 of the msg-A2 genes (Fig. S3A). This subfamily has a significant expansion in P. jirovecii with 33 copies but only 1 to 6 copies in other species (Table 1). With an overall highly variable 5=-end leader, members of this subfamily are expected to be expressed independently, similarly to the Msg-A2 subfamily. Nevertheless, 5 of the 6 msg-A3 genes in P. murina contain an ϳ600-bp leader sequence with significant identity and structural similarity to the UCS (termed UCS-like), including a Based on reverse transcription-PCR studies, each of these 5 genes was expressed independently (Text S1). Of note, the Msg-A3 subfamily encompasses both the Msg-II and Msg-III families reported by others (17), as illustrated in Fig. S3B. Given the complex (B) Schematic representation of the UCS sequence structures. The number in each box is the sequence length (base pairs) for each region. The approximate location of the tandem repeats in P. jirovecii, P. oryctolagi, and P. carinii are indicated by ovals, with more details on tandem repeat variation in P. oryctolagi and P. carinii shown in Fig. 6. (C) Alignment of deduced protein sequences of UCSs. Asterisks indicate the KR site potentially for proprotein cleavage by endoprotease. All sequences are available from GenBank with accession numbers or gene locus tag numbers indicated in parentheses, including P. murina (PNEG_04309), P. carinii (T552_04149), P. wakefieldiae (AF164562), P. jirovecii (T551_00002), P. oryctolagi (MN509824), Pneumocystis sp. "macacae" (MN509821), Pneumocystis sp. "canis" (MN509823), Pneumocystis sp. "fulvescens" (MN509819), Pneumocystis sp. "muelleri" (MN509817), Pneumocystis sp. "tanezumi" (MN509820), and Pneumocystis sp. "exulans" (MN509818). Details about the nomenclature of Pneumocystis and its host species are available in Table S1 in the supplemental material. clustering patterns in phylogenetic analysis and unknown functions of these genes, it appears not meaningful to further divide the Msg-A3 subfamily.
(ii) Msg-B family. This represents the only Msg family completely absent in all rodent Pneumocystis species sequenced to date but with exceptionally high abundance in P. jirovecii (Table 1). With a highly variable 5=-end leader, members of this family are expected to be expressed independently. In a phylogenetic analysis (see Fig. S4), the family separated into two major groups, which also differ in size (1.3 kb and 1.6 kb).
(iii) Msg-C family. The prominent characteristics of this family are its significant presence in P. murina and unique chromosomal organization (see Fig. S5). This family consists of a tandem array of 6 genes in chromosome 17 of P. murina, which represents the largest tandem array of genes of the same family identified so far in any Pneumocystis species. In contrast, there are no more than two msg-C genes in other Pneumocystis species. The two msg-C genes in P. wakefieldiae share similar sizes, intron-exon structures, and domain compositions (N1, M2, and M3) with msg-C genes in P. murina. However, in all other species examined, the msg-C genes are smaller (0.8 to 1 kb) with different intron-exon structures and/or lack the highly conserved exon 1 sequence of P. murina. In addition, they have different domain compositions and are only distantly related to the 6 genes in P. murina by phylogenetic analysis (Fig. S5A). Furthermore, the  (Table S2 and Text S1). Pneumocystis Major Surface Glycoprotein Superfamily ® chromosomal arrangement of the msg-C genes in P. jirovecii and Pneumocystis sp. "macacae" is different from that in rodent Pneumocystis (Fig. S5C). It is likely that the msg-C genes in P. carinii, P. jirovecii, and Pneumocystis sp. "macacae" represent degenerate genes or pseudogenes, as supported by the low-level transcription of this gene in P. carinii (Table 2).
(iv) Msg-D family. This family is related to the previously reported A12 antigen gene in P. murina (25). Like the Msg-A3 subfamily and Msg-B family, this family is rarely present in rodent Pneumocystis but significantly expanded in P. jirovecii and perhaps in Pneumocystis sp. "macacae" as well (Table 1 and Fig. S6). However, most of the Msg-D members contain 6 conserved domains compared to 9 and 3 domains in Msg-A3 and Msg-B, respectively. In a phylogenetic analysis, all single-copy Msg-D members in rodent Pneumocystis tightly clustered into one clade, which is well separated from Msg-D members in all other species. Consistent with the phylogenetic analysis, Msg-D members in all rodent Pneumocystis species lack both N1 and M2 domains, which are present in most of Msg-D members in other species.
(v) Msg-E family. This family is related to two previously reported p55 antigen genes (26)(27)(28). Unique among all Msg families, the Msg-E family is the smallest in member size, molecular size, and number of conserved domains. It is relatively equally distributed across all Pneumocystis species examined (Table 1) and among the most highly expressed families in rodent Pneumocystis (Table 2). Members did not cluster by species in a phylogenetic analysis (see Fig. S7). In P. murina, there are three members with nearly identical sequences and molecular sizes (termed p57), which are located in separate chromosomes (1,29). Each of these 3 genes is present as a tandem array with one msg-A2 gene and one msg-A1 gene downstream (1). Homologs to these 3 genes are also present in three separate chromosomes of P. wakefieldiae, though their downstream genes have not been identified, presumably due to incomplete genome assembly. No other species sequenced to date have close homologs to these 3 genes. These findings further suggest that duplication of the p57 gene in P. murina and P. wakefieldiae occurred before separation of these two species or there was introgression.
(vi) Unclassified genes. In P. murina, P. carinii, and P. jirovecii, there are 8 to 13 genes related to Msg that are unable to be reliably classified due to their shorter length (ϳ970 bp on average), presence of multiple introns, or lack of unique sequences (CRJE, KR site, or conserved leader sequences). The shorter length in most of them is not due to incomplete sequencing, as they are present within well-covered contigs. Almost all of these genes in P. murina and P. carinii have a low expression level ( Table 2), suggesting they are degenerate genes or pseudogenes.
Highly variable expression levels among different msg families in P. murina and P. carinii. Transcriptome sequencing (RNA-Seq) data indicate that all msg genes in P. murina and P. carinii are transcribed except for two unclassified msg genes in P. murina and 5 msg-A2 genes in P. carinii (Table 2). Strikingly, the UCS genes in both P. murina and P. carinii were the most highly expressed protein-coding genes of the whole genome (with their fragments per kilobase of exon per million fragments mapped [FPKM] values being more than 100 times higher than the median expression level for the whole gene set); as expected, individual msg-A1 members were expressed at lower levels. This high expression level is consistent with SDS-PAGE analysis of Pneumocystis proteins, which demonstrates that Msg is the most abundant protein as estimated by Coomassie blue staining (1). In P. murina, the highest expression level of individual Msg genes was observed in the Msg-C family, followed by the Msg-D, Msg-E, Msg-A3, and Msg-A1 families or subfamilies, all of which showed an expression level at least 3 times higher than the median of the whole gene set. The expression level of the Msg-A2 family was slightly higher than the median. In P. carinii, the highest expression level was observed in the single Msg-D gene, followed by the Msg-E family, both of which showed an expression level at least 11 times higher than the median. The expression level of the Msg-A family (including all 3 subfamilies) was similar to the median. Significant diversity of UCS in Pneumocystis from 10 mammalian host species. UCS was previously reported for P. murina, P. carinii, P. wakefieldiae, and P. jirovecii (18)(19)(20)(21)30). In the present study, we identified UCS in Pneumocystis species infecting rhesus macaques, dogs, rabbits, chestnut white-bellied rats, Müller's giant Sunda rats, Asian house rats, and Polynesian rats. Details about the nomenclature of these mammalian species and Pneumocystis species are listed in Table S1. As shown in Fig. 4, the Pneumocystis UCSs from all these animals show the sequence organization in known UCSs, including two exons that are interrupted by a variably sized intron. While exon 1 is identical in size (97 bp) among all UCSs, exon 2 is highly variable in size, with the shortest size present in P. oryctolagi (230 bp) and the longest in P. murina (314 bp).
The predicted UCS protein sequences vary in size from 110 to 138 amino acids, with 24% to 97% sequence identity (Fig. 4C). Despite these variations, all UCSs contain a pair of basic amino acid residues in the carboxyl end, Lys-Arg, known as the KR site (19,31,32). Phylogenetic analysis showed a clear separation between the UCSs in Pneumocystis species from rodents and those from other mammalian species (Fig. 4A). Consistent with the phylogenetic relationships, the UCSs in all rodent Pneumocystis species have an extra 13 to 15 amino acid residues at the beginning of exon 2 and a unique hexapeptide of PGVDYF near the center of exon 2 compared to Pneumocystis species from other mammalian species.
Similar to exon 2, the intron is also highly variable in size, with the shortest present in P. carinii (150 bp) and the longest in P. oryctolagi (515 bp). In addition, different levels of inter-or intrastrain sequence variation were observed in UCSs from P. carinii, P. wakefieldiae, Pneumocystis sp. "macacae" and P. oryctolagi (Text S1 and Fig. 6). The highest variation was observed in P. oryctolagi isolates, which displayed extensive interand intrastrain variations, including two single nucleotide polymorphisms (SNPs) in exon 2 and many SNPs, indels, and tandem repeat variations in the intron (Fig. 5).
Substantial variation of the msg-A1 gene repertoires in P. carinii from laboratory and wild rats. To compare the msg diversity between Pneumocystis from laboratory-bred animals and that from wild animals, we analyzed the restriction fragment length polymorphism (RFLP) patterns of the msg-A1 repertoires in P. carinii from 8 wild Norway rats collected in Ontario, Canada, in comparison with P. carinii from 8 laboratory Norway rats collected in three different animal facilities in United States. P. carinii from all laboratory rats showed almost identical RFLP patterns, while substantial variations in the RFLP patterns were observed within P. carinii isolates from wild rats and between P. carinii from wild and laboratory rats (Fig. 7).
To further confirm these variations, we determined the full-length msg-A1 sequences in P. carinii from one wild rat by Sanger sequencing of cloned PCR products. Sequence analysis of 28 random clones identified 13 unique msg sequences, with identities of 78% to 96% at the nucleotide level and 63% to 95% at the amino acid level. All but 1 of these 13 genes were clearly different from the 65 msg-A1 genes of P. carinii from the laboratory rats. In a phylogenetic analysis, msg-A1 genes from wild and laboratory rats were interspersed (Fig. 3). These findings suggest a closely related but clearly distinct repertoire of msg-A1 genes in P. carinii from wild and laboratory rats.

DISCUSSION
Over the past several decades, Msg has been the most extensively studied molecule in Pneumocystis, primarily due to its abundance, its role in pathogen-host interactions, and its potential as a target for diagnosis of Pneumocystis infection. In this report, we isolates. Five different sequence populations were identified and named types I to V indicated at the end of the sample codes, including RAB_M from MI, USA, RAB_F from Tours, France, and PRAB1 and PRAB2 from Lille, France. The 3 types of sequences (III, IV, and V) in sample PRAB2 were obtained from sequencing of 2, 6, and 3 plasmid clones, respectively, while the 2 types of sequences (II and III) in sample PRAB1 were obtained from 3 and 5 plasmid clones, respectively. The other 2 samples showed no variation based on Illumina sequencing of genomic DNA; their PCR products showed homogeneous sequences in direct Sanger sequencing and were not further subcloned. Three types of tandem repeats are indicated by colored lines. (B) UCS in 8 P. wakefieldiae isolates. Sequences for the first two isolates were reported by Schaffzin et al. (24), with GenBank accession no. AF164574.1 and AF164569.1. Sequences for the other 6 isolates were obtained in this study, as determined by next-generation sequencing (NGS; isolates Pw1A, Pw2A, Pw3A, and PwC1 from laboratory rats) and PCR (isolates P0025 and P0034 from wild rats). (C) UCS in P. carinii isolates. The first sequence is the P. carinii UCS gene from the P. carinii genome assembly (1). The last 3 sequences indicated by GenBank accession no. D31910 to D31912 were reported by Wada et al. (21). Sequences B70_1 and B70_2 were assembled in this study using previous NGS data from one rat (1). Sequences UC_1 and UC_2 were assembled in this study using Sanger sequence reads from http://pgp.cchmc.org (73). The 11-bp tandem repeat unit is underlined. Numbers at both sides of the alignments refer to the nucleotide positions relative to the predicted UCS translational start site. The 3= end of exon 2 is not shown due to space limitation. The intron is indicated in red brackets. All new UCS sequences obtained in this study are available from GenBank with accession no. MN509813 to MN509830. present an in-depth analysis of the msg domain structure and the characteristics of each individual msg family or subfamily, including new msg genes identified from P. oryctolagi, Pneumocystis sp. "macacae," Pneumocystis sp. "canis," P. wakefieldiae, and P. carinii infecting wild rats. The results from our analysis demonstrate a much greater complexity to this superfamily than was previously appreciated, expand the understanding of the primary structure, organization, phylogeny, and expression patterns of the Msg superfamily, and provide a comprehensive basis for further investigation of the role of the Msg superfamily in Pneumocystis biology.
The Msg superfamily, particularly, in P. jirovecii (179 members), represents the largest surface protein family identified to date in the fungal kingdom (33), which is surprising for an organism whose genome size is the smallest in the fungal kingdom sequenced to date, after the intracellular Microsporidia (34). msg genes are unique to Pneumocystis and account for 3% to 6% of the total genome, suggesting a critical role in the organism's survival (1). The vast majority of msg genes are clustered in subtelomeric regions, which are presumably advantageous to foster DNA recombination and antigenic variation, as has been found for surface protein genes in other pathogens (35). Their positioning is consistent with the notion that subtelomeric regions are favorable locations for fungal pathogens to acquire novel genes and foster evolution (36,37).
By domain structure, phylogenetic relationships, and expression control mechanisms, we have been able to classify the Msg superfamily into discrete families and subfamilies. Our classification based on exhaustive cataloging of Ͼ400 full-length msg genes from seven Pneumocystis species is more comprehensive than the one described in a recent report, which was based on 113 msg genes from a single species, P. jirovecii, of which only 55 were full-length sequences (17). Thus, despite the consistency of four families/subfamilies between these two systems (Msg-A1, -B, -D, and -E versus Msg-I, -IV, -V, and -VI), two families (msg-A2 and msg-C), which are almost exclusively present in rodent Pneumocystis, are absent in that report (17). We also elected not to subdivide the msg-A3 subfamily due to the complex clustering patterns in the phylogenetic

FIG 7
Comparative restriction fragment length polymorphism (RFLP) analysis of msg-A1 of P. carinii infecting laboratory and wild Norway rats. msg-A1 repertoire was amplified by PCR with genomic DNA prepared from P. carinii-infected lung samples from 8 immunosuppressed laboratory Norway rats (with three representatives show in lanes indicated Lab) and 8 wild Norway rats (in lanes indicated Wild). While R22 is most clearly representative of a different RFLP pattern, more subtle differences are also apparent in some of the other wild rats (e.g., R5 and R11) compared to that from laboratory rats. The PCR products were digested with restriction enzyme DraI and separated in 2% agarose gels containing SYBR Safe. Lane M, DNA size marker containing DNA digested with HindIII and ØX174 DNA digested with HaeIII.
analysis (see Fig. S4 in the supplemental material) and unknown functions of these genes. This classification will likely be refined when our understanding of the function of Msg is improved and this superfamily is better characterized for other Pneumocystis species.
Based on our analysis, there is substantial conservation among most Msg families or subfamilies across different Pneumocystis species, but there are also species-specific expansions or contractions. Among the three Pneumocystis species with the most complete data set, P. murina has the fewest number of genes in the Msg superfamily, while P. jirovecii has the most. These differences may be related to the larger body and therefore lung size, as well as the longer life span, in humans versus in rodents and the consequent need for a higher degree of antigenic variation to avoid the longer duration of immunologic memory in individual human hosts. The larger size of the Msg superfamily in P. jirovecii is attributable in part to the expansion of the classic Msg-A1 subfamily as well as other families (including Msg-A3, Msg-B, and Msg-D), which have no or limited representation in rodent Pneumocystis. Of note, P. murina possesses a set of 6 msg genes (Msg-C family) that are clustered as a tandem array in one chromosome and are the most highly expressed msg genes ( Table 2).
The functions of Msg remain unknown or poorly understood. To date, the best studied genes of the Msg superfamily are those classical Msg genes in the Msg-A1 family, whose expression is regulated by the single-copy UCS expression site, which allows antigenic variation through DNA recombination (14,17,38). Such variation can potentially serve as a mechanism to facilitate evasion of host immune responses, enabling the organism to persist longer in the host and transmit to a new host. This mechanism evolved presumably to operate in immunocompetent hosts. The expression of multiple msg-A1 variants in the lungs of immunodeficient hosts presumably results from ongoing recombination at the UCS in the absence of immune pressure. For all three Pneumocystis species with nearly fully sequenced genomes, the msg-A1 genes account for approximately 50% of all msg genes, supporting their potential to efficiently generate a large number of variants allowing immune evasion. In support of this hypothesis, our RNA-Seq analysis of P. murina and P. carinii revealed an exceptionally high-level expression of UCSs and a variable level of expression of all individual msg-A1 genes ( Table 2).
UCS is known to have a highly variable number of tandem repeats in the intron in P. jirovecii (19,39,40). In this study, we demonstrated for the first time the presence of inter-and intrastrain variations in tandem repeats in the intron of UCSs in P. carinii and P. oryctolagi. UCS in P. oryctolagi appears to have a higher degree of variation in tandem repeats as well as SNPs than P. jirovecii UCS. The intron in UCS (150 to 515 bp) is among the longest introns in Pneumocystis species studied to date. The retention of such a long intron with high variability in these species in an otherwise highly reduced genome suggests a critical role in organism survival, e.g., transcriptional regulation by a recursive splicing mechanism (41,42).
Of note, while the UCS is present as a single-copy gene per genome in all Pneumocystis species, there are 5 msg-A3 genes in P. murina, each of which contains a UCS-like leader sequence (Fig. 5) and is expressed at a relatively high level independent of the classic UCS (Table 2). These may have arisen from gene duplication in P. murina; alternatively, it is possible that a common ancestor of Pneumocystis had multiple UCSs, which have been gradually lost as a result of evolving an efficient recombination system involving only a single UCS (for the msg-A1 family).
Previous studies have demonstrated a conservation of the msg-A1 repertoires in Pneumocystis in colony-bred laboratory rats and mice in contrast to the highly variable msg repertoires in P. jirovecii (14), suggesting a homogeneous population of rodent Pneumocystis due to closed breeding conditions. In support of this, we observed substantial variations in the RFLP patterns among P. carinii isolates from wild rats and between P. carinii from wild and laboratory rats, supporting the former possibility. The absence of clustering of Msg-A1 variants based on geographic origin of the isolates suggests that the repertoire variation was not driven by geographic isolation of the organisms. These variations may reflect the difference in immune system development and modulation in wild animals, as they are continuously exposed to high levels of immune challenges in an open environment and experience high levels of infection with a wide range of pathogens (43)(44)(45). We hypothesize that this diversity is driven in part by a need for antigenic variation in response to T cell-rather than B cell-mediated immune responses and potential adaptation to the diverse HLA repertoire that would be present in a natural community of host species (46) versus the limited diversity present in inbred communities.
Domains M1 to M5 of msg-A1-encoded proteins likely arose by gene duplication given their conserved pattern of cysteine residues, and in fact, only a single M domain is categorized in Pfam. However, more detailed analysis clearly allows the identification of 5 related but unique domains. It is noteworthy that by phylogenetic analysis (Fig. 2), individual domains are observed as more closely related to each other across species than to other M domains in the same species, which suggests that there is a critical function for each domain and its evolution is restricted by negative selection. Furthermore, given that msg-A1-encoded proteins with these domains have been identified in all Pneumocystis species studied to date, this gene organization appears to have developed in an ancestor common to all Pneumocystis species and may have been a critical factor that allowed Pneumocystis to successfully infect mammalian hosts.
Unlike P. jirovecii, and perhaps Pneumocystis sp. "macacae," Pneumocystis sp. "canis," and P. oryctolagi, rodent Pneumocystis species (P. murina, P. carinii, and P. wakefieldiae) have a large number of msg-A2 genes, which are only slightly less frequent than msg-A1 genes. Previous studies of P carinii have shown that msg-A2 genes are expressed independent of the UCS (23,24). Nevertheless, the possibility of homologous recombination between msg-A2 and msg-A1 genes cannot be ruled out due to their high sequence identities, as previously suggested (13,47). Eleven msg-A2 genes in P. carinii show higher identities to msg-A1 genes than to other msg-A2 genes in this organism. It is likely that these 11 msg-A2 genes (the second exons) have arisen as a result of reciprocal recombination with msg-A1 genes (through a mechanism unrelated to UCS or CRJE). While it appears that msg-A2 expression is not regulated by UCS, nothing is known yet about what mechanisms control msg-A2 expression or whether the msg-A2 family contributes to antigenic variation in response to immune pressure, environmental changes, or life cycle phases. The presence of a long monoguanosine repeat in some msg-A2 genes has raised the possibility that variation in the length of this repeat may cause frameshifts, thus altering the amino sequence downstream of the repeat (13,47). However, based on the high-throughput genome sequencing data with at least 150ϫ coverage (1), sequence reads for the monoguanosine repeat region in all involved msg-A2 genes appeared highly uniform, though a small number of reads (Ͻ5%) showed different numbers of repeats. We could not determine if this was caused by sequence errors or in vivo changes. The presence of such a small number of variable reads does not seem to support an involvement of this repeat in altering the antigenicity or other functions of the msg-A2 genes. Of note, a polyguanosine repeat encodes a polyglycine peptide, which has been shown in other organisms to play various critical roles, such as in protein-to-protein interactions, cell wall plasticity, and modulation of developmental stages (48)(49)(50). Whether the polyglycine peptide in Msg-A2 proteins has these functions awaits future investigation.
Despite their potential importance in Pneumocystis' survival, the functions of the vast majority of members of the msg superfamily remain poorly understood or uncharacterized. Even for the most extensively studied msg-A1 genes, while it has been generally believed that their primary function is to confer antigenic variation and immune evasion, there are only limited experimental data supporting this potential function (46). There are also multiple studies showing an involvement of Msg proteins in mediating adherence of Pneumocystis organisms to host alveolar epithelial cells and macrophages (51)(52)(53), though it is unknown if the Msg proteins involved in these studies represent Msg-A1 or other Msg proteins, especially Msg-A2 and Msg-A3 proteins, which are highly similar to Msg-A1 proteins in sequence and length. The functions of all non-msg-A1 genes remains unknown. Given that the Pneumocystis genome is highly compacted and that the DNA recombination system associated with msg-A1 genes is presumably sufficient for antigenic variation and immune evasion, there seems no reason to assume other msg genes perform the same function. We speculate that non-msg-A1 genes may perform other functions, such as mediating developmental states, optimizing mobility and adhesion ability, and adapting to specific host niches or environmental conditions. In support of this hypothesis, one such gene of the msg-E family in P. murina, termed p57, has been shown to be a stagespecific antigen that is expressed exclusively on intracystic bodies and young trophic forms, suggesting a role in the Pneumocystis life cycle development (29).
In conclusion, despite a highly reduced genome, Pneumocystis is equipped with a large complex superfamily of msg genes. These genes exhibit conservation among msg families and subfamilies across different Pneumocystis species as well as species-specific expansions or contractions. The versatility of these genes may mirror their association with a wide variety of functions rather than just conferring antigenic variation to allow immune evasion as previously believed. Our results provide a rich source of information that lays the foundation for the continued experimental exploration of the function of the Msg superfamily in Pneumocystis biology.

MATERIALS AND METHODS
Sources of Pneumocystis msg sequences. The primary source of msg sequences for P. murina, P. carinii, and P. jirovecii was from our previous studies (1,15,16), which are available from the NCBI Umbrella project PRJNA223519 at https://www.ncbi.nlm.nih.gov/bioproject/?termϭPRJNA223519. In this study, we obtained additional Pneumocystis msg and UCS sequences from various animals as listed in Table S1 in the supplemental material, which includes new tentative names for Pneumocystis organisms not reported previously. The methods to obtain these new sequences are described below.
Pneumocystis sample sources and DNA extraction. Agarose gel blocks containing P. wakefieldiae and P. carinii were obtained from 4 Norway rats immunosuppressed once per week by 4 mg/kg of methylprednisolone acetate (Depo-Medrol; Pharmacia and Upjohn Co. a division of Pfizer, Inc.) at the animal facility of the University of Cincinnati, OH, USA (54). Genomic DNA in gel blocks was extracted using the Zymoclean Gel DNA Recovery kit (Zymo Research).
P. carinii-infected lung tissues were obtained from 8 immunosuppressed Sprague Dawley rats collected between 1986 and 2013 from the animal facilities at NIH, Bethesda, MD (14,55), Indiana University, Indianapolis, IN (56), and Louisiana State University Health Science Center, New Orleans, LA. Genomic DNA was isolated by use of either a QIAamp DNA minikit (Qiagen) or a traditional method utilizing proteinase K digestion, phenol-chloroform extraction, and ethanol precipitation (14). In addition, P. carinii-infected lung tissues were obtained from 8 wild Norway rats (R. norvegicus) from 5 different pig farms in Ontario, Canada, in 2015 as previously described (57); exact locations and names of these farms were kept anonymous based on agreement with farm owners. Genomic DNA was extracted using the MasterPure Yeast DNA purification kit (Epicentre). All wild rats appeared to be healthy upon capture and were confirmed to be infected by P. carinii alone based on PCR and sequence analysis of two Pneumocystis mitochondrial genes, including the large subunit of rRNA (mtLSU) and ATPase subunit 6 genes (unpublished data). The P. carinii mtLSU sequence in all rats was identical to that in the laboratory Norway rats (GenBank JX499145).
DNA samples for Pneumocystis species infecting other wild rat species in Southeast Asia, including chestnut white-bellied rats, Müller's giant Sunda rats, Asian house rats, and Polynesian rats, were obtained from previous studies (58). All animals appeared to be healthy upon capture.
Pneumocystis sp. "macacae"-infected lungs were obtained from two simian immunodeficiency virus (SIV)-infected rhesus macaques at the NIH Animal Center, Bethesda, MD, USA (59,60). Genomic DNA was extracted following a Pneumocystis DNA enrichment protocol as described previously (1). An additional two Pneumocystis sp. "macacae" samples were obtained as formalin-fixed paraffin-embedded (FFPE) tissue sections prepared from SIV-infected rhesus macaques at the Tulane National Primate Research Center, Covington, LA (61), and the UC Davis California National Primate Research Center, Davis, CA, USA. Genomic DNA was extracted using the AllPrep DNA/RNA FFPE kit (Qiagen).
Pneumocystis sp. "canis" DNA samples were obtained from one Cavalier King Charles Spaniel dog at the University of Helsinki, Finland (62,63), and one Whippet dog at the University of Veterinary Medicine, Vienna, Austria (64).
Four P. oryctolagi DNA samples were obtained from previous studies of one rabbit with severe combined immunodeficiency at the University of Michigan, Ann Arbor, MI, USA (65), and three immunosuppressed rabbits at the Institut Pasteur de Lille (66) and the Institut National de la Recherche Agronomique de Tours Pathologie Aviaire et Parasitologie, Tours (67), France.
Animal experimentation guidelines of the National Institutes of Health were followed in the conduct of these studies.
Illumina sequencing. DNA extracts for 4 P. wakefieldiae, 4 Pneumocystis sp. "macacae," 2 Pneumocystis sp. "canis," and 4 P. oryctolagi samples were subjected to whole-genome sequencing commercially in an Illumina HiSeq platform using a 150-bp paired-end library and/or a 250-bp paired-end library. Genome assembly was performed essentially as previously described (1,68); detailed analyses of these genomes will be published separately.
RNA-Seq analysis of different msg families in P. murina and P. carinii. The relative expression level for each gene was estimated using RNA-Seq data from three heavily infected laboratory animals each for P. murina and P. carinii as previously described (1). RNA-Seq reads from each of the three samples were aligned to the coding DNA sequences (CDSs) using bowtie (69). The alignment bam files were then used to quantify transcript abundances by RSEM (70). The relative expression level for each gene was expressed as fragments per kilobase of exon per million fragments mapped (FPKM). msg sequences of P. wakefieldiae. To amplify the repertoire of the classical msg-A1 genes in full length, the forward primer (WSG.f3) was designed from the 3= end of the previously reported UCS (within CRJE) of P. wakefieldiae (GenBank accession no. AF164562) (30). The reverse primer (WSG.r5) was designed from the highly conserved 3=-end coding region near the stop codon based on an alignment of more than 3,000 Illumina HiSeq reads. Primer sequences are listed in Table S2. Both primers were specific for P. wakefieldiae, with no cross-amplification of P. carinii. PCR was performed using P. wakefieldiae genomic DNA and the AccuPrime Pfx SuperMix kit (Thermo Fisher Scientific) with the following cycling conditions: 95°C for 5 min and then 35 cycles at 95°C for 15 s, 55°C for 30 s, and 68°C for 3 min, with a final extension at 68°C for 5 min. The PCR product was subcloned into the pCR2.1 TOPO vector by use of the TOPO TA Cloning kit (Invitrogen, Carlsbad, CA). Two clones containing the full-length msg-A1 gene were sequenced commercially by Sanger sequencing.
To sequence the P. wakefieldiae homologue of the 6-gene cluster of the msg-C family in P. murina, we first used the Illumina reads of P. wakefieldiae (mixed with P. carinii reads) to assemble the P. wakefieldiae homologues of PNEG_03432 and PNEG_03438, which are flanking the 6-gene cluster in P. murina. Subsequently, we designed a primer set (3432.f1 and 3438.r1) (Table S2) specific for these two genes in P. wakefieldiae. With these two primers, we amplified an 8-kb fragment from P. wakefieldiae DNA and sequenced its full length by Sanger sequencing with primer walking. From a draft P. wakefieldiae genome assembly, we identified members of the msg-A2, msg-A3, msg-D, and msg-E families or subfamilies based on homology to known genes in P. murina, P. carinii and P. jirovecii (1). Full-length msg-A1 genes sequences were unable to be assembled from the short HiSeq reads (16).
msg-A1 sequences of P. carinii from wild rats. To determine whether the msg-A1 repertoires are identical in P. carinii from wild and laboratory Norway rats, we performed RFLP analysis of P. carinii isolates from 8 wild rats in comparison with those from 8 laboratory rats. The msg-A1 repertoires were amplified by PCR using primers RSG.f10 and RSG.r8 (Table S2), which are located in the highly conserved regions at the beginning and end of the msg coding regions, respectively, among 57 known full-length msg-A1 genes in P. carinii (1). Amplification was performed using the LongAmp Master Mix (New England Biolabs) with the following parameters: 94°C for 2 min and then 35 cycles at 94°C for 15 s, 55°C for 30 s, and 68°C for 3 min, with a final extension at 68°C for 5 min. PCR products were purified and subjected to restriction digestion with DraI (New England BioLabs) at 37°C for 2 h. The resulting digests were purified and separated in 2% E-gel containing ethidium bromide (Invitrogen, Carlsbad, CA).
The msg-A1 repertoire from one wild rat (no. R5), which showed a distinct RFLP pattern compared to those of laboratory rats, was chosen for sequencing after PCR amplification using the primer pair RSG.f10-RSG.r8 and the LiSpark Max SuFi PCR Master Mix kit (LifeSct LLC, Rockville, MD). The PCR product was subcloned into the pCR-XL-2 TOPO vector by use of the TOPO XL-2 Complete PCR Cloning kit (Invitrogen, Carlsbad, CA). A total of 28 clones containing the full-length msg-A1 gene were sequenced commercially by Sanger sequencing. msg sequences of Pneumocystis sp. "macacae," Pneumocystis sp. "canis," and P. oryctolagi. Illumina HiSeq reads from one Pneumocystis sp. "macacae" sample were aligned to all known full-length msg-A1 genes of P. jirovecii (1), resulting in at least 1,000 reads for the very 5= and 3= ends of the msg-A1 coding regions. Two primers (KSG.f3 and KSG.r2) (Table S2) were designed from highly conserved regions based on alignment of these reads. The full-length msg-A1 repertoire in Pneumocystis sp. "macacae" was amplified using these two primers and the LiSpark Max SuFi PCR Master Mix kit, followed by subcloning into the pCR-XL-2 TOPO vector as described above. Two clones containing the full-length msg-A1 gene were sequenced commercially by Sanger sequencing.
For other msg families and subfamilies, we identified a small number of representative genes from a partial genome assembly of Pneumocystis sp. "macacae" based on homology to known genes in P. murina, P. carinii, and P. jirovecii (1).
For Pneumocystis sp. "canis" and P. oryctolagi, a small number of genes representing each msg family were identified from a partial genome assemblies of Pneumocystis sp. "canis" and P. oryctolagi, respectively.
UCSs of msg-A1 genes in Pneumocystis from various mammalian host species. The UCS and its 5= untranscribed region (UTR) sequences in Pneumocystis sp. "macacae," Pneumocystis sp. "canis," and P. oryctolagi were first obtained by assembling Illumina HiSeq reads from whole-genome sequencing as described above, followed by confirmation by PCR amplification and Sanger sequencing of genomic DNA. Based on sequence alignment of these UCSs and know UCSs of P. murina, P. carinii, P. wakefieldiae, and P. jirovecii, we designed one forward primer (5UTR) from the conserved region in the 5= UTR and one reverse primer (CRJE.r3) from the conserved region in the CRJE (Table S2). This primer set was used to amplify the UCS along with its 5= UTR in Pneumocystis species from other mammal species, including dogs, rabbits, chestnut white-bellied rats, Müller's giant Sunda rats, Asian house rats, and Polynesian rats ( Fig. 4 and Table S1). To study the variability of UCSs and downstream msg-A1 coding regions in different P. oryctolagi isolates, PCR was performed using a pair of primers, OSG.f3 and OSG.r9, which are located at the very 5= end of UCS and one highly conserved region near the 5= end of the msg-A1 coding region (Table S2). All PCR products were sequenced directly and/or after subcloning into the pCR2.1 TOPO vector as described above.
Phylogenetic analysis. To analyze phylogenetic relationships, deduced protein sequences were aligned using MUSCLE (71), and phylogenetic trees were constructed based on maximum likelihood (ML) using RAxML (v8.2.5) (72) with 100 bootstraps as support values. The best amino acid model was estimated using the PROTGAMMAAUTO option.
Data availability. Annotated genomic sequences of all new msg genes identified in this study are available from the BioProject database (https://www.ncbi.nlm.nih.gov/bioproject) under accession no. PRJNA560924. All new UCS sequences obtained in this study are available from GenBank with accession no. MN509813 to MN509830. Coding DNA sequences (CDSs) and deduced amino sequences for all msg genes according to the family/subfamily are available at https://zenodo.org/record/3523554 #.XjLZ7UBFyF4 (excel file for data sets 1 to 8). Hidden Markov models (HMMs) for Msg domains are available at https://zenodo.org/record/3515473#.XjLaaEBFyF4.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only. TEXT S1, DOCX file, 0.1 MB. We thank Rene Costello for providing animal care, Nicolas Cere (Institut National de la Recherche Agronomique de Tours Pathologie Aviaire et Parasitologie, Tours, France) for kindly providing P. oryctolagi samples, and B. Scandrett, C. Roehrig, K. Konecsni, and participating farmers for coordinating rat sample collection in Ontario. We also thank Phillippe Hauser (University of Lausanne, Lausanne, Switzerland) for providing information about msg sequences in their studies (17) and members of the Nomenclature Committee for Fungi, Scott Redhead and Konstanze Bensch, for their advice on Pneumocystis names.
We declare no conflicts of interests.